The following 151 words are regarded as common words for the purpose of listing N-gram and collocation matches. A bigram is excluded from the published lists if either of the words it consists of is among these words. A trigram is excluded if it contains two or more such words. Tetragrams and above are always listed. Similarly, a collocation is listed only if it contains at least two words which are not among these words.
|'tis a about after against all am an and another any are as at away bar be because before both but by can close come could dare did do down enough enter every for from given go good had hath have he hence her here him his how i i'll if in into is it know let like little lord love make man many may me might more most much must my need neither never next no none nor not nothing now o of off on once one or other our out over part past see shall she should since sir so some such take than that the thee their them then there therefore these they this those thou though through thy till to too until unto up upon us was we well were what when where which while who whom whose why will with within without would yet you your|
All matches are based on the lemmatized forms of words, rather than the words themselves; for example, kind hearts is matched with kind-hearted. Consistent with that, all words that are lemmatized the same as one of the above words are also treated as common words. For example, although only O is listed above, Oh is also treated as a common word, since they share the same lemma.
My published web pages giving lists of matches state the number of common words as 154. That was an error: unaccountably, an was listed four times instead of once in the list I originally made.
Why 151 (or 154) words?
The original list of common words consisted of the one hundred most common words in my database of plays. I found that this was not enough: too many very common bigrams and trigrams were being admitted to the published lists, making the files even larger than they are now, and hard to navigate. I had to increase the number of common words on the list. I therefore merged my list with the separate list of one hundred function words given in the following paper: Segarra et al, 'Attributing the Authorship of the Henry VI Plays by Word Adjacency', Shakespeare Quarterly, vol. 67, no. 2 (Summer 2016), 232-256. As the two lists already had many of the same words, the merged list contains 151 words. Segarra et al's list is, as I now realise, not the best one to use. For example, it contains given but not give; however, as both words have the same lemma, give is also treated as a common word.