Home | ## A Critique of Function Word Adjacency Networks |

My article called 'Authorship Attribution for Early Modern Plays using Function Word Adjacency Networks: A Critical View',
published in *American Notes and Queries* in 2018, sets out my main arguments against the function words adjacency networks method. By that method, the New Oxford Shakespeare editors had divided up the *Henry VI* trilogy between Shakespeare and Marlowe (Segarra et al 2016). I argued in my article that the method has been badly defined, and its claimed success is illusory and not to be relied on.

The mathematical definition of the method has been given by its inventors in several articles in the last few years. I cited the latest definition, in Eisen et al 2018. However, to make my article suitable for non-specialist readers, I included in it only my non-technical objections. I omitted the more mathematical parts of my arguments against the method. These are given below, under separate headings.

References below to formulae are to the ones given in Eisen et al 2018 and I have assumed that the reader has read section 2 of that article, which is where the formulae are given.

The method quite correctly aims to give less weight to function words that are far apart than those that are close together. However, its formula 1 is not good enough to distinguish between texts where, on the one hand, function words occur just a few times but close together and, on the other hand, texts where they occur often but further apart. The formula is liable to give the same answer, or very similar answers, for both texts, making it impossible for the subsequent steps in the method to distinguish between them.

This defect was apparently not understood when the method was invented, since it is not mentioned in the first published definition of it (Segarra et al 2013). It was understood later, for Seggara et al 2015 contains the following admission: "Notice that [formula 1] combines into one similarity number the frequency of co-appearance of two words and the distance between these two words in each appearance, making both effects *indistinguishable*" (Segarra et al 2015, 5466; my emphasis).

It is important that readers, especially scholars who may be thinking of using the method in their own research, be alerted to this defect and, therefore, it was wrong of Eisen et al 2018 to omit the admission that Segarra et al 2015 had made, especially as the 2015 article is found in a journal that is unlikely to be read by humanities scholars.

In analytical work that seeks to compare data sets of differing lengths, we need to ensure that we compare like with like. For example, suppose a certain collocation of words occurs the same number of times in each of two texts. If one text is ten times as long as the other, then our intuition tells us that it is misleading to say that both texts use the collocation equally. Conversely, suppose that the text which is ten times as long also uses the collocation ten times as often as the other text. In this case, although the collocation occurs much more often in one text than in the other, we may fairly say that both texts use it equally. If we do not have samples of approximately equal size, we might compensate for the disparity in size by dividing the number of occurrences of the collocation by the number of words in the corresponding text, in effect turning the raw numbers into proportions or percentages, so it becomes possible to compare them fairly.

What this method does instead is to divide each adjacency distance from a word by the sum of all the distances from that word. It calls this normalization and it is defined by its formula 3. Put like that, it sounds reasonable, but some further thought reveals the problem with it. For example, suppose one text gives us the distances {2, 3} because it is a very short text. The method divides each number by the sum, in this case 5, to obtain {0.4, 0.6}. If another, much larger, text gives us the distances {200, 300} then the method divides each number by 500 and again obtains {0.4, 0.6}. As soon as the normalization is performed, all knowledge about the sizes of the texts being tested is obliterated, and it can therefore play no part in subsequent steps. A method like this, that simply disregards the sizes of the texts it is attributing, is liable to go wrong by putting large amounts of evidence on a parity with small amounts.

My ANQ article already explains that, in its entropy calculation, the method disregards pairs of function words that are present in some texts but absent from others. I showed that, in the case of Shakespeare and Marlowe, this unwise decision leads to about 28% of evidence -- *ex hypothesi* the most important evidence -- being excluded. What I explain below is how, in its normalization procedure, the method also does the opposite, i.e. it deems the presence of pairs of words that are in fact absent.

The normalization formula (formula 3) breaks down if some function word happens not to have any function words (among the subset of words being searched for) following it within the ten-word windows that the method searches in. That's because the denominator is then zero, and it is impossible to divide by zero. To get around this problem, the method deems that in such a case that function word *is followed by every other function word in equal proportion!* (Eisen et al 2018, 502). The inventors make no attempt to explain why this is a reasonable thing to do, when we know even from casual observation that words do not follow other words in equal proportion in any text.

I also want to draw attention to the fact that the inventors of this method were originally candid in admitting to the fault that led them to exclude all the evidence that would cause their entropy calculation, given in formula 7, to fail because of an attempted division by zero. Segarra et al 2015 told us correctly: "This is undesirable because the often [sic] appearance of this transition in the text network P1 is a strong indication that this text was not written by the author whose profile network is P2" (Segarra et al 2015, 5467). That was a technical way of saying what is intuitively obvious, that if one function word is often followed by another in one text but never in the other, the explanation might be that they were written by different authors. Regrettably, Eisen et al 2018 omits this correct explanation and substitutes an incorrect one in its place, by saying that the purpose of the rule that excludes the evidence is to avoid "potential biasing for smaller profiles" (Eisen et al 2018, 503). This is nonsense, because the exclusion of the evidence takes place even when the texts being tested are large. The exclusion is triggered not by the text being too small, but by its not containing function words that the other text does contain. As I have shown with the examples of Shakespeare and Marlowe, even large canons have some function words that never follow some other function words. The point about excluding this evidence in order to avoid a bias because of small samples is thus seen to be false. The original, correct, explanation in Segarra et al 2015 should have been disclosed to readers of Eisen et al 2018.

In order to calculate entropy values, which is what the method needs in order to make authorship attributions, the word adjacency network it has constructed must be a Markov chain: if it is not, then the entropy formula cannot be used, and the method fails before it gets to the authorship attribution stage. It was therefore essential for the inventors to prove that their networks are in fact Markov chains. They make no attempt to do this. Instead, they simply say that their networks "can be interpreted" as Markov chains (Eisen et al 2018, 502) and carry on from there. The reader should look up a definition of Markov chains, for example on Wikipedia, to understand just how surreal is the idea that a play can be treated as a Markov chain.

To demonstrate that the networks are in fact Markov chains, the inventors had to show that their normalization calculations had given them a probability distribution, since that is a condition for a Markov chain. But they simply assert it, instead of proving it, and I showed in my ANQ article, with examples like *devoid of*, how far from obvious it is that the networks are the probability distributions they need to be. Whatâ€™s worse, the way that their normalization formula (formula 3) has calculated the values used in subsequent steps -- by dividing each value by the total of all of them -- guarantees that their data always *looks* like a probability distribution, because all the numbers are between 0 and 1 and they add up to 1. Even if they were to pick adjacency numbers out of a hat, instead of measuring them from the play texts, their formula 3 would turn them into what *looks* like a probability distribution, allowing them to get away with claiming that the network "can be interpreted" as a Markov chain.

The complete absence of proof, or even a plausible argument, means that it is invalid to assume that the word adjacency networks are Markov chains, and the method therefore collapses before it gets to the attribution stage.

**Eisen, M, Ribeiro, A, Segarra, S, and Egan, G**. (2018). Stylometric Analysis of Early Modern Period English Plays, *Digital Scholarship in the Humanities*, 500-528.

**Segarra, S, Eisen, M, Egan, G, and Ribeiro, A**. (2016). Attributing the Authorship of the Henry VI Plays by Word Adjacency, *Shakespeare Quarterly*, 67: 232-56.

**Segarra, S, Eisen, M, and Ribeiro, A**. (2015). Authorship Attribution Through Function Word Adjacency Networks, *Institute of Electrical and Electronic Engineers (IEEE) Transactions on Signal Processing*, 63.20: 5464-78.

**Segarra, S, Eisen, M, and Ribeiro, A**. (2013). Authorship Attribution Using Function Words Adjacency Networks [accessed 9 February 2019].