Empirical evaluations of language-based author identification techniques

Authors

  • Carole E. Chaski Institute for Linguistic Evidence, Inc.

DOI:

https://doi.org/10.1558/sll.2001.8.1.1

Keywords:

language-based author identification, document examination, author attribution, questioned document

Abstract

Recent Court decisions in the United States call for the empirical testing of language-based author identification techniques. This article shows the results of such testing. The tested hypotheses include: syntactic analysis, syntactically-classified punctuation, sentential complexity, vocabulary richness, readability, content analysis, spelling errors, punctuation errors, word form errors, and grammatical errors. These hypotheses are tested on a set of documents written by four women who are similar in age, educational level, and dialectal background: two of the women are Euro-American, and two are Afro-American. Each hypothesis is tested separately to determine its ability to differentiate documents from different authors and cluster documents from each author. Hypotheses which quantify linguistic features are tested statistically using the chi-square statistic. Discrimination error rates are calculated. Only two hypotheses successfully differentiate and cluster documents: syntactic analysis and syntactically-classified punctuation.

Author Biography

  • Carole E. Chaski, Institute for Linguistic Evidence, Inc.
    Executive Director Institute for Linguistic Evidence, Inc

Published

2001-02-28

Issue

Section

Articles

How to Cite

Chaski, C. E. (2001). Empirical evaluations of language-based author identification techniques. International Journal of Speech, Language and the Law, 8(1), 1-65. https://doi.org/10.1558/sll.2001.8.1.1