Over the weekend, Annie Swafford published another installment in her ongoing critique of Syuzhet, the R package that I released in early February. In her recent blog post, an interesting approach for testing the get_transformed_values function is proposed[1].

Previously Annie had noted how using the default values for the low-pass filter may result in too much information loss, to which I replied that that is the point.  (Readers hung up on this point are advised to go back and watch the Vonnegut video again.) With any kind of smoothing, there is going to be information loss.  The function is designed to allow the user to tune the low pass filter for greater or lesser degrees of noise (an important point that I shall return to in a moment).

In the new post, Annie explores the efficacy of leaving the low pass filter at its default value of 3; she demonstrates how this value appears to produce a ringing artifact.  This is something that the two of us had discussed at some length in an email correspondence prior to this blogging frenzy.  In that correspondence, I promised to explore adding a gaussian filter to the package, a filter she believes would be more appropriate. Based on her advice, I have explored that option, and will do so further, but for now I remain unconvinced that there is a problem for Gauss to solve.[2]

As I said in my previous post, I believe the true test of the method lies in assessing whether or not the shapes produced by the transformation are a good approximation of the shape of the story. But remember too, that the primary point of the transformation function is to solve the problem of length; it is hard to compare the plot shape of a long novel to a short one.  The low-pass argument is essentially a visualization and noise reduction parameter.   Users who want a closer, scene by scene or sentence by sentence representation of the sentiment data, will likely gravitate to the get_percentage_values function (and a very large number of bins) as, for example, Lincoln Mullen has done on Rpubs.[3]

The downside to that approach, of course, is that you cannot compare two sentiment arcs mathematically; you can only do so by eye.  You cannot compare them mathematically because the amount of text inside each percentage segment will be quite different if the novels are of different lengths, and that would not be a fair comparison.  The transformation function is my attempt at solving this time domain conundrum.  While I believe that it solves the problem well, I’m certainly open to other options.  If we decide that the transformation function is no good, that it produces too much ringing, etc. then we should look for a more attractive alternative.  Until an alternative is found and demonstrated, I’m not going to allow the perfect to become the enemy of the good.

But, alas, here we are once again on the question of defining what is “good” and what is “good enough.”  So let us turn now to that question and this matter of ringing artifacts.

The problem of ringing artifacts is well understood in the signal processing literature if a bit less so in the narratological literature:-)  Annie has done a fine job of explicating the nature of this problem, and I can’t help thinking that this is a very clever idea of hers.  In fact, I wrote to Annie acknowledging this and noting how I wish I had thought of it myself.

But after repeating her experiment a number of times, with greater and lesser degrees of success, I decided that this exercise is ultimately a bit of a red herring.  Among other things, there are no books with zero neutral values for an entire third, but more importantly the exercise has more to do with the setting of a particular user parameter than it does with the package.

I’d like to now offer a bit of cake and eat it too.  This most recent criticism has focused on the default values for the low-pass filter that I set for the function. There is, of course, nothing preventing adjustment of that parameter by those with a taste for adventure.  The higher the number, the greater the number of components that are retained; the more components we retain, the less ringing and the closer we get to reproducing the original signal.

So let us assume for a moment that the sentiment detection methods all work perfectly. We know as a matter of fact that they don’t work perfectly (you know, like human beings), but this matter of imprecision is something we have already covered in a previous post where I showed that the three dictionary based methods tend to agree with each other and with the more sophisticated Stanford method.  So even though we know we are not getting every sentence’s sentiment just right, let’s pretend that we are, if only for a moment.

With that assumed, let us now recall the primary rationale for the Fourier transformation: to normalize the length of the x-axis.  As it happens, we can do that normalization (the cake) and also retain a great many more components than the 3 default components (eating it).  Figure 1 shows Joyce’s Portrait of the Artist transformed using a low pass filter size of 100.

This produces a graph with a lot more noise, but we have effectively eliminated any objectionable ringing.  With the addition of a smoothing line (lowess function in R), what we see once again (ta da) is a beautiful, if rather less dramatic, example of Vonnegut’s Man in Hole!  And this is precisely the goal, to reveal the plot shape latent in the noise.  The smaller low-pass filter accentuates this effect, the higher low-pass filter provides more information: both show the same essential shape.

Figure 4: Portrait with low pass at 100

Figure 1: Portrait with low pass at 100

foundation

Figure 2: Portrait with low pass at 3

low_pass_20

Figure 3: Portrait with low pass at 20

In the course of this research, I have hand examined the transformed shapes for several dozen novels.  The number of novels I have examined corresponds to the number that I feel I know well enough to assess (and also happen to possess in digital form).  These include such old and new favorites as:

  • Portrait of the Artist
  • Picture of Dorian Grey
  • Ulysses
  • Blood Meridian
  • Gone Girl
  • Finnegans Wake (nah, just kidding)
  • . . .
  • And many more.

As I noted in my previous post, the only way to determine the efficacy of this model is to see if it approximates reality.  We have now plotted Portrait of the Artist six ways to Sunday, and every time we have seen a version of the same man in hole shape.  I’ve read this book 20 times, I have taught this book a dozen times.  It is a man in hole plot.

In my (admittedly) anecdotal evaluations, I have continued to see convincing graphs, such as the one above (and the one below in figure 4).  I have found a few special books that don’t do very well, but that is a story you will have to wait for (spoiler alert, they are not works of satire or dark humor, but they are multi-plot novels involving parallel stories).

Still, I am open to the possibility that there is some confirmation bias possible here.  And this is why I wanted to release the package in the first place.  I had hoped that putting the code on gitHub would entice others toward innovation within the code, but the unexpected criticism has certainly been healthy too, and this conversation has certainly made me think of ways that the functions could be improved.

In retrospect, it may have been better to wait until the full paper was complete before distributing the code.  Most of the things we have covered in the last few weeks on this blog are things that get discussed in finer detail in the paper. Despite more details to come, I believe, as Dryden might say, that the last (plot) line is now sufficiently explicated.

Bonus Images:

dorian_100

Figure 4

In terms of basic shape, Figure 4 is remarkably similar to the more dramatized version seen in figure 5 below.  If you can’t see it, you aren’t reading enough Vonnegut.

dorian_3

Figure 5

[1] How’s that for some awkward passive voice? A few on Twitter have expressed some thoughts on my use of Annie’s first name in my earlier response.  Regular readers of this blog will know that I am consistent in referring to people by their full names upon first mention and by their first names thereafter.  Previous victims of my “house style” have included David Mimno, David;  Dana Mackenzie, Dana; Ben Schmidt, Ben; Franco Moretti, Franco, and Julia Flanders, Julia.  There are probably others.

[2] Anyone losing sleep over this gaussian filter business is welcome to grab the code and give it a whirl.

[3] In the essay I am writing about this work, I address a number of the nuances that I have skipped over in these blog posts.  One of the nuances I discuss is an automated process for the selection of a low-pass filter size.