Questions to Ask

Every computational, quantitative, or "non-traditional" authorship attribution study must be considered on its own merits. Nonetheless, there are several fundamental issues common to all studies of this kind. Framed around these issues, this page offers a series of questions one might ask when evaluating scholarship.

This page should be read in conjunction with the Glossary of Terms.

Material presented here was originally prepared by Brett Greatley-Hirsch for an earlier version of his chapter, "Computational Studies", in The Arden Research Companion to Contemporary Shakespeare Criticism, 2020, PDF.

Is the data source accurate and reliable?

Data provenance is a crucial – and an all-too-often overlooked – aspect of authorship attribution study (and quantitative criticism more broadly), because the accuracy and reliability of data are essential prerequisites for the analysis to have any validity. For example, an attribution study of early modern drama in which data for Shakespeare's plays derives from modern edited texts but the data for other playwrights comes from unedited sources proceeds on questionable grounds. These conditions also apply to metadata. In the previous example, the associated metadata might include details about each play's authorship, date, and genre; errors in any of these fields could render the analysis inaccurate.

Is the data fit for purpose?

To be persuasive, arguments must be logically sound and supported by relevant evidence, and computational studies are not exempt from this standard. A stylometric analysis of patterns in early modern play-texts, for example, should not unwittingly include samples of non-dramatic verse and prose in the data. Fitness for purpose also means ensuring that the data, if not comprehensive, is a sufficiently representative sample from which to draw statistically significant results. For example, data from a small selection of Ben Jonson comedies is an insufficient sample from which to make statistically significant observations about the whole canon of early modern drama.

What features are selected?

The range of features that scholars have analyzed in authorship attribution study is impressive – characters, words, phrases, n-grams, punctuation, contractions, oaths, word-length, vocabulary size, spelling, collocates, syntactic forms, line-length, sentence-length, pause patterns, rhymes, distances between classes of words – and the list continues to grow. What is counted, however, often is a more complicated matter than first meets the eye. Take the word as a feature, for instance: there is significant disparity between the various figured cited for Shakespeare’s vocabulary. This fact alone is not, as Jeffrey Kahan wrongly concludes, evidence that "scholars could not even count words properly" (2015: 829); rather, it confirms the effects of sourcing data from different texts and adopting different criteria for what constitutes a word.

Words may be counted severally as unique forms ("types") or concrete instances of those forms ("tokens"), to which additional criteria apply. Are words counted as lemmas (dictionary-type headwords, such as counting shaking, shaken and shook as instances of the lemma shake)? Are homographs (words with shared spelling but different meaning or grammatical function, such as the noun and verb forms of will) counted separately? Are contractions expanded or retained (such as treating she’ll as a single form or as instances of she and will respectively)? Are compound words separated or retained? Are variants in orthography and spelling retained or normalized (treating tother and t’other as distinct forms, for instance, or murther and murder as the same word)? There are further questions of inclusion and exclusion. For example, an attempt to quantify a playwright's vocabulary might limit the analysis to English words, thereby excluding any foreign-language passages. Likewise, an authorship attribution study might exclude from the analysis all editorial insertions, paratextual matter, and other elements of uncertain authorial status, such as stage directions, speech prefixes, and preliminaries. Other analyses may focus exclusively on words of a particular class (such as function or content words), syntactic form (such as personal pronouns), or frequency range (such as common or rare words).

The data source should be prepared in such a way that supports accurate counting of selected features. In the case of texts, this typically involves textual encoding or markup – appending annotations or tags to classify units of text into categories, such as structural elements (dialogue, stage direction, speech prefixes), linguistic features (grammatical part of speech, syntactic function, normalized spelling, regularization, lemmata), and so on. Without textual encoding to provide explicit instructions, a computer will generally treat documents as undifferentiated sequences of alphanumeric characters and whitespace. Some tags (such as part-of-speech) can be added algorithmically, but these procedures are designed primarily for use with modern text (and correspondingly modern usages of grammar, vocabulary, and syntax). The orthography and spelling of early modern English text presents an additional pre-processing challenge. Although software to fully- or semi-automate the process of normalizing spelling in early modern English text is available (such as VARD and MorphAdorner), the results may not be sufficiently accurate without extensive human intervention. Thus, significant pre-processing of data is often required: proof-reading transcriptions, adding textual encoding, normalizing or regularizing features, checking the accuracy of these processes, and so on. Given the heavy costs of time and effort involved, scholars may be tempted to cut corners by using existing datasets indiscriminately and without further preparation or customization to properly fit their investigations.

How are the selected features counted?

How features are counted is just as important a consideration as their selection. For example, the asymmetry of canon formation renders total word-types a misleading basis for comparing Shakespeare's dramatic vocabulary with his contemporaries’, whereas calculating the rate at which each playwright adopts new words by treating each play as if it was a new work gives more meaningful results (Craig 2011; Elliott and Valenza 2011). In addition to raw counts and rates, features may also be counted as ratios and proportions – especially useful when comparing frequencies across unequally sized samples. Another important consideration when counting textual features is segmentation: in the case of plays, for example, features might be counted across the whole text, or in smaller, logical segments of unequal size (such as by act, scene, character) or smaller, arbitrary segments of equal size (such as overlapping or non-overlapping blocks of n words).

How does the method process and/or manipulate the data?

Unless standard statistical or mathematical procedures are used, readers should expect investigators to describe their methods in sufficient detail or to provide appropriate citations where such details can be found. The description should make clear whether the machine-learning algorithms are "supervised" or "unsupervised" – that is, whether or not the algorithm is supplied with metadata and examples from which to "learn" how to discriminate the data. Many authorship attribution methods, for example, involve training an algorithm to build a classifier by selecting and counting features that discriminate between a set of sample texts whose authorship is made known to the algorithm. The classifier can then be tested and refined by treating texts of known authorship as if they were unknown before being used to predict the authorship of anonymous texts. (The process is much the same for classifying categories besides authorship, such as genre, mode, period, and so on.) While each method will manipulate data differently, there are some common operations, such as data reduction and visualization.

References

Craig, Hugh. (2011), ‘Shakespeare’s Vocabulary: Myth and Reality’, Shakespeare Quarterly, 62(1): 53–74.

Elliott, Ward E.Y. and Robert J. Valenza. (2011), ‘Shakespeare’s Vocabulary: Did It Dwarf All Others?’ in Mireille Ravassat and Jonathan Culpeper (eds), Stylistics and Shakespeare’s Language: Transdisciplinary Approaches, 34–57, London: Continuum.

Greatley-Hirsch, Brett. (2020), ‘Computational Studies’, in Evelyn Gajowski (ed), The Arden Research Handbook of Contemporary Shakespeare Criticism, 205–21, London: Arden Shakespeare. PDF.

Kahan, Jeffrey. (2015), ‘“I tell you what mine author says”: A Brief History of Stylometrics’, ELH, 82(3): 815–44.