Collocations and N-grams

Introduction

This page provides downloads from the results of fully automated searches for matching N-grams and collocations among early modern plays, using programs written by me.

The plays I have searched cover English drama written in the years 1552 to 1657. I searched 527 plays (the Additions to The Spanish Tragedy being counted as a separate little play).

My searches were done using modern-spelling lemmatized texts. This allows for the widest possible discovery of matches; for example, kind hearts is matched with kind-hearted. My plays were taken from EarlyPrint and the Folger Digital Texts website, and I am grateful to both for providing free access to their texts. All my files are also freely provided here, under the Creative Commons Attribution-NonCommercial license.

I have provided two very large sets of files to download. One set contains lists of matches and will mainly be of interest to researchers doing qualitative analysis; for example, to review all N-gram matches between two plays. The other set gives just counts, of all N-gram matches between all plays, for values of N from 1 to 10, and will therefore be of interest to researchers doing computational stylistics work. These two sets of data are self-contained and self-consistent, and they are also consistent with each other, being produced from the same texts.

If you intend to make any serious use of the counts I have provided, then please first read my short article The Counting of N-grams.

Lists of Collocation and N-gram Matches

For each play, search results for both N-grams and collocations are provided in three formats. (i) For casual browsing and qualitative analysis, an HTML page is provided, giving what my search program considers to be the best few thousand matches for that play. (ii) A CSV file is provided containing the full set of matches. This may be opened in Excel, or other tools, and be used for quantitative analysis. (iii) A summary is provided, also as a CSV file, giving the number of matches with each play.

Listings of N-gram matches are complete for 4-grams and above. However, there are far too many hundreds of millions of 1-gram, 2-gram and 3-gram matches to list. I have therefore disregarded 1-grams entirely and listed 2-grams and 3-grams only if they contain at least two words not on the list of the most common words used in these plays. That list contains mainly function words such as the, and, of, to, and so on. Similarly, collocation search results are listed only if they contain at least two words that are not on the common words list.

Counts of N-gram Matches

For every play, I have provided counts of all its N-gram matches with every other play, for values of N from 1 to 10. The N-grams cover all words, except those in speech prefixes, but with no other exclusions, not even of very common words. These counts are provided separately for each of the following three categories:

All N-gram matches.
Unique matches; that is, N-grams found in just two plays.
Function word skip N-gram matches. These are N-grams found by skipping over any word that is not a function word.

Moreover, I have performed each count twice, once for tokens and once for types.

I have done some experiments with the above data, to give an example of how it can be further processed.

Download Search Results

You can do your own online search for N-gram matches.

If you want to run your own SQL queries on this data, you can do it here without needing to log in.

If you want to create your own copy of the database of plays, go to the Database folder and read the README file.