Home | ## The Counting of N-grams |

There is more than one way to count matching N-grams. On this site, I have provided counts based on two methods that I consider to be reasonable. It is essential to understand these methods, to avoid the risk of misusing the counts and obtaining invalid results.

Consider two texts that share the phrase ABCD, where A, B, C and D are four words. We ask: how many 1-grams, 2-grams, 3-grams and 4-grams do these texts have in common? The answers are easy to work out:

- Four 1-gram matches: A, B, C, D
- Three 2-gram matches: AB, BC, CD
- Two 3-gram matches: ABC, BCD
- One 4-gram match: ABCD

We thus see that there are ten N-gram matches in total. I shall call these formal N-gram matches.

It would be absurd for a researcher to list all ten N-gram matches here, because they are nested within each other. The custom is to list only the longest match, which is the 4-gram ABCD. I shall call this a maximal N-gram match. A maximal match is also a formal match, but the reverse is not true.

I have provided lists of N-gram matches for every play. These are of course maximal matches, as is customary. The lists are accompanied by counts, which are therefore counts of maximal matches.

Independently of that, I have provided counts of formal N-gram matches, for use in computational stylistic tests. As I explain below, it makes more sense to count formal matches for this purpose.

In some cases, the decision is dictated by the nature of the work. If our aim is to list N-gram matches, perhaps for qualitative analysis, then maximal matches, and therefore counts of maximal matches, are appropriate. On the other hand, if we are looking at, say, 4-grams, in isolation to other N-grams, then of course we must use formal matches. In some other cases, there may be a genuine choice whether to count formal or maximal matches, as the following examples illustrate.

Suppose we are using maximal matches. We are comparing many plays, perhaps to perform authorship attribution. Suppose we compare the plays by asking how many 3-grams they have in common with each other. We would have to report that the two texts in the ABCD example above do not have any 3-grams in common (because the 3-gram they do share is nested inside a 4-gram and is therefore not maximal). This is clearly unsatisfactory, because it causes a match between the two texts to be missed.

A more satisfactory alternative is to count maximal 3-grams-and-above; here, that would allow us to report one match between the two texts, because the maximal 3-grams-and-above count includes the maximal 4-gram. From this argument we see that it can seldom or never make sense to use maximal N-gram match counts singly. We must use counts of maximal 1-grams-and-above, maximal 2-grams-and-above, maximal 3-grams-and-above, and so on. That is what the Summary spreadsheets that go with my lists of maximal N-gram matches contain (except for 1-grams, for which I did not list maximal matches at all).

That is not the end of the matter. Suppose we had another pair of texts, containing the phrases ABCD and ABCE respectively. For these, the formal N-gram matches are as below:

- Three 1-gram matches: A, B, C
- Two 2-gram matches: AB, BC
- One 3-gram matches: ABC
- Zero 4-gram matches

There is just one maximal 3-gram here, and no 4-grams. If we were using the maximal 3-grams-and-above test, we would have to report one match between these texts. Thus, we see that we report one maximal 3-grams-and-above match for the earlier texts, with ABCD and ABCD, but also one match for the texts with ABCD and ABCE. That is not satisfactory either: the texts are different, and a good test should be able to tell them apart.

Counts of formal N-gram matches do not suffer from the problem illustrated above. For the pair of texts both containing ABCD, there are two formal 3-gram matches, while for the pair of texts containing ABCD and ABCE, we have just one formal 3-gram match. The pairs of texts look different, and our formal 3-gram test duly reports them as different.

Formal N-gram counts do better in the above example because, unlike maximal N-gram counts, they have a built-in cascading effect. If, for example, we are counting maximal 3-grams-and-above, then a maximal 3-gram counts as one, as does a maximal 4-gram, as does a maximal 5-gram, and so on. By contrast, a formal 4-gram contributes two to the formal 3-gram count, a formal 5-gram contributes three to the formal 3-gram count, and so on. This means that a formal N-gram contains information not just about N-grams but also information about (N+1)-grams, (N+2)-grams, and so on. Moreover, it gives a higher weighting to higher order N-grams, rather than treating them all equally, as maximal N-gram counts do.

A further argument in favour of formal N-gram matches is that they are easier to find than maximal matches. If two texts have N successive words in common then that is a formal N-gram match, without further ado. But is it also a maximal N-gram match? We cannot decide without looking at (N+1)-gram matches: if there is an (N+1)-gram match containing the N successive words, then we do not have a maximal N-gram match; otherwise, we do.

My conclusion from the above discussion is that formal N-gram counts should (continue to) be used for computational stylistics.

The Summary spreadsheets that accompany my lists of N-gram matches contain counts of the matches listed. Since the matches listed are maximal, it means that those Summaries contain counts of maximal N-gram matches. Listing matches means listing the tokens, so these counts are counts of tokens, not types.

With formal N-grams, I have provided separate sets of counts, one set produced by counting tokens, the other produced by counting types.

If types make better authorial markers than tokens, then of course we need to use them. But unless that is the case, it makes more sense to use token-based counts, for one very important reason. Token-based counts are *scalable*. What does that mean?

Consider an example. There are 17,778 tokens in *1 Tamburlaine* and 18,158 tokens in *2 Tamburlaine*. How many tokens are there in the two plays together? The answer is easy to work out by addition: it is 35,936. Now, there are 2,422 types in *1 Tamburlaine* and 2,474 types in *2 Tamburlaine*. It would be wildly wrong to add these numbers together and say that there are 4,896 types in the two plays together. That is because the plays have many words in common, which are counted separately as tokens but just once as types. The actual number of types in the two plays is 3,420. It cannot be derived from any other numbers, by addition or any other method, and it must be worked out separately.

If we want to do authorship attribution using large amounts of data, perhaps testing many alternative attributions, then our methods must be scalable. That is, it must be possible for us to scale up from using a handful of numbers to using hundreds, thousands or even millions of numbers. Counts of tokens make this possible; counts of types do not. If we have counted tokens for a large set of plays, or parts of plays, then we can treat the counts as building blocks. We can assemble them in different ways, according to the authorship theories we are testing, simply by adding the right combinations of counts, without having to go back to the texts and make new counts every time. That is why, in tests involving counts, we should always use tokens, unless there is a compelling reason to use types.

My database contains 527 plays, consisting of 9,577,660 words, including stage directions but excluding speech prefixes (of which there are 347,547). A reader of the Summary spreadsheets I have provided, particularly those for formal N-gram matches, may be astonished at the very large numbers of matching N-grams. When a play contains only few tens of thousand words, how can there be tens of millions of matching N-grams for it?

The answer lies in the effects of multiplication. Suppose a phrase occurs three times in one play, and four times in another play. Each of the three occurrences of the phrase in the first play matches with each of its four occurrences in the other play, so we have twelve N-gram matches. When counting formal N-gram matches, I have not excluded any word or phrase from consideration, as I did with maximal matches. When we consider that very common phrases like *and the* and *of the* occur many times in every play, we can understand how the counts become so large. In an extreme case, if *the* occurs a thousand times in each of two plays, it alone contributes a million to the count of formal 1-gram matches between those plays.

The phenomenon is amplified by the fact that my N-gram searches use lemmatized forms of words, not the words themselves. This is the right thing to do, because it allows, for example, singular words to be matched with their plurals. If the searches were not lemma-based, we would miss many thousands of important N-gram matches which ought to be counted. Doing lemma-based searches allows more matches to be found; however, a side-effect is that it inflates counts of 1-gram and, to a lesser extent, 2-gram matches to levels even larger than they would be otherwise.