Native language influence detection for forensic authorship analysis: Identifying L1 Persian bloggers

Authors

  • Ria Perkins Aston University
  • Tim Grant Aston University

DOI:

https://doi.org/10.1558/ijsll.30844

Keywords:

native language identification, authorship analysis, linguistic profiling, native language influence detection, persian

Abstract

This article demonstrates and examines the potential use of interlingual identifiers for forensic authorship analysis and native language influence detection (NLID). The work focuses on the practical applications of native language (L1) identifiers by a human analyst in investigative situations. Using naturally occurring blog posts where the writer self-identifies as a native Persian speaker, a human analyst derived and coded sets of non-native features. Two logistic regression models were built: the first was used to select features to distinguish L1 Persian speakers from L1 English speakers in their English writings, the second developed a feature list to contrast L1 languages that are geographically and linguistically close to Persian. The results clearly demonstrate that interlingual identifiers have the potential to aid in determining the L1 of an anonymous author and can be used by a human analyst in a short forensically realistic example text. This article demonstrates that NLID is possible beyond the more common computational approaches and can form a useful tool in the forensic linguist’s toolbox. This study is not a statistical validation study, instead it demonstrates how a sociolinguistic approach can complement more traditional computational approaches.

Author Biographies

  • Ria Perkins, Aston University
    Dr Ria Perkins is a Research Associate at the Centre for Forensic Linguistics at Aston University. Her research focuses predominantly on authorship analysis, in particular native language influence detection. Her research interests also include power and persuasive communication, online influence, computer mediated communication, sociolinguistics and the application of forensic linguistics to security and intelligence investigations. Her casework speciality is authorship profiling, and she has undertaken and assisted with work for law enforcement, private companies and international NGOs.
  • Tim Grant, Aston University
    Professor Tim Grant is the Director of the Centre for Forensic Linguistics and Aston University's 50th Anniversary Chair in Forensic Linguistics. He publishes mostly in the area of forensic authorship analysis and most recently has been researching how forensic linguistic techniques can assist in darkweb investigations. As a practitioner his casework has helped resolve numerous cases involving murder, sexual crime and terrorism. He has appeared in the press, on television and on radio programmes in the UK and internationally including appearances on BBC Crimewatch and BBC Radio 4 Word of Mouth.

References

Bhatia, T. K. and Ritchie, W. C. (2004) Bilingualism in the global media and advertising. In T. K. Bhatia and W. C. Ritchie (eds) The Handbook of Bilingualism 513–546. Oxford: Blackwell.

Brooke, J. and Hirst, G. (2012) Robust, lexicalized native language identification. In Proceedings of COLING 2012 391–408. Mumbai: The COLING 2012 Organizing Committee. Retrieved in October 2016 from http://www.aclweb.org/anthology/C12-1025

Brooke, J. and Hirst, G. (2013) Using other learner corpora in the 2013 NLI shared task. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications 188–196. Retrieved in October 2016 from http://www.aclweb.org/anthology/W13-1725

Champod, C. and Evett, I. W. (1999) Commentary on Broeders. Forensic Linguistics 7(2): 238–243

Comrie, B. (2001) Languages of the world. In M. Aronoff and J. Rees-Miller (eds) The Handbook of Linguistics 19–43. Oxford: Blackwell.

Corder, S. P. (1981) Error Analysis and Interlanguage. Oxford: Oxford University Press.

de Vel, O., Anderson, A., Corney, M. and Mohay, G. (2001) Mining e-mail content for author identification forensics. ACM SIGMOD Record 30(4): 55–64. https://doi.org/10.1145/604264.604272

Doyle, A. C. (1892) The Adventures of Sherlock Holmes. New York: Harper and Brothers.

Dras, M. and Malmasi, S. (2015) Multilingual native language identification. Natural Language Engineering 1(1): 1–53.

Eades, D., Fraser, H., Siegel, J., McNamara, T. and Baker, B. (2003) Linguistic identification in the determination of nationality: a preliminary report. Language Policy 2(2): 179–199.

Fraser, H. (2012) Language analysis for the determination of origin (LADO). In C. A. Chappelle (ed.) Encyclopedia of Applied Linguistics 9–11. Malden: Wiley-Blackwell.

Grant, T. (2008) Approaching questions in forensic authorship analysis. In J. Gibbons and M. T. Turell (eds) Dimensions of Forensic Linguistics 215–229. Philadelphia, PA: John Benjamins. https://doi.org/10.1075/aals.5.15gra

Grant, T. (2010) Text Messaging Forensics: Txt 4n6: Idiolect free authorship analysis? In M. Coulthard and A. Johnson (eds) The Routledge Handbook of World Englishes 508–522. London and New York: Routledge.

Hopkins, E. (1982) Contrastive analysis, interlanguage, and the learner. In W. Lohnes and E. Hopkins (eds) The Contrastive Grammar of English and German 32–48. Ann Arbor, MI: Karoma Publishers.

Jaleh. (2011) Jamigen’s Iranian affairs blog site. Retrieved December 2012 from http://jamigen.com/index.htm

Kniffka, H. (1996) On forensic linguistic ‘differential diagnosis’. In H. Kniffka, S. Blackwell and M. Coulthard (eds) Recent Developments in Forensic Linguistics 75–122. Frankfurt Am Main: Peter Lang.

Koppel, M., Schler, J. and Argamon, S. (2009) Computational methods in authorship attribution. Journal of the Association for Information Science and Technology 60(1): 9–26. https://doi.org/10.1002/asi.20961

Koppel, M., Schler, J. and Zigdon, K. (2005) Determining an author’s native language by mining a text for errors. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining - KDD ’05 624–628. New York: ACM Press. https://doi.org/10.1145/1081870.1081947

Lado, R. (1957) Linguistics across Cultures. Ann Arbor, MI: University of Michigan Press.

Leung, C., Harris, R. and Rampton, B. (1997) The idealised native speaker, reified ethnicities, and classroom realities. TESOL Quarterly 31(3): 543–560. https://doi.org/10.2307/3587837

Li, B. (2013) Recognizing English learners’ native language from their writings. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications 119–123. Retrieved in October 2016 from http://www.aclweb.org/anthology/W13-1715

Mahootian, S. and Gebhardt, L. (2007) Persian (Kindle edition). London and New York: Routledge.

Malmasi, S. (2016) Native language identification: explorations and applications. PhD Thesis, Macquarie University. Retrieved from https://www.researchonline.mq.edu.au/vital/access/services/Download/mq:50040/SOURCE1?view=true

Malmasi, S. and Dras, M. (2014) Language transfer hypotheses with linear SVM weights. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP ’14), (2013) 1385–1390. Retrieved October 2016 from http://www.aclweb.org/anthology/D14-1144

Rampton, M. B. H. (1990) Displacing the ‘native speaker’: expertise, affiliation, and inheritance. ELT Journal 44(2), 97–101.

Richards, J. C. (1971) A non-contrastive approach to error analysis. ELT Journal 25(3): 204–219. https://doi.org/10.1093/elt/XXV.3.204

Selinker, L. (1972) Interlanguage. International Review of Applied Linguistics in Language Teaching, 10: 209–231. https://doi.org/10.1515/iral.1972.10.1-4.209

Selinker, L. (1974). Interlanguage. In J.C. Richards (ed.) Error Analysis: Perspectives in second language acquisition 31–53. Rowley. MA: Newbury House.

Shuy, R. (2001) Forensic linguistics. In M. Aronoff and J. Rees-Miller (eds) The Handbook of Linguistics (Kindle edition) 683–691. Oxford & Malden: Blackwell.

Simons, G. F. and C. D. Fennig (eds.) (2018) Ethnologue: Languages of the World (21st edn). Dallas, TX: SIL International. Online version: http://www.ethnologue.com. Accessed May 2018

Suren-pahlav, S. (2007) Persian NOT Farsi: Iranian identity under fire: an argument against the use of the word ‘Farsi’ for the Persian language. The Circle of Ancient Iranian Studies (July): 1–14.

Tetreault, J., Blanchard, D. and Cahill, A. (2013) A report on the first native language identification shared task. Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications 48–57. Retrieved September 2016 from http://aclweb.org/anthology/W/W13/W13-1706.pdf

The International Herald Tribune. (2010) From the International Herald Tribune - 100, 75, 50 Years Ago - NYTimes.com. International Herald Tribune. Retrieved December 2012, from http://www.nytimes.com/2010/01/25/opinion/25iht-oldjan25.html?_r=1

Thomason, S. (2001) Language contact: an introduction. In N. J. Smelser and P. B. Bates (eds) International Encyclopedia of the Social & Behavioral Sciences 8325–8329. Baltimore, MD: Georgetown University Press. https://doi.org/10.1016/B0-08-043076-7/03032-1

Tomokiyo, L. M. and Jones, R. (2001) You’re not from ’round here, are you? Naive Bayes detection of non- native utterance text. In Association for Computational Linguistics (ed.) Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics on Language Technologies 1–8. Pittsburgh, PA: Association for Computational Linguistics.

Tsur, O. and Rappoport, A. (2007) Using classifier features for studying the effect of native language on the choice of written second language words. In P. Buttery, A. Villavicencio and A. Korhonen (eds) Cognitive Aspects of Computational Language Acquisition 9–17. Madison, WI: Omnipress. https://doi.org/10.3115/1629795.1629797

Wardhaugh, R. (1970) The contrastive analysis hypothesis. TESOL quarterly 4: 123–130.

Weinreich, U. (1953) Languages in Contact. The Hague: Mouton & Co.

Weisburd, D. and Britt, C. (2007) Statistics in Criminal Justice (3rd edn). New York: Springer.

Wilson, L. and Wilson, M. (2001) Farsi Speaker. In M. Swan and B. Smith (eds.) Learner English: A Teacher's Guide to Interference and other Problems (2nd edn) 197–194. Cambridge: Cambridge University Press.

Wong, S. J. and Dras, M. (2011) Exploiting parse structures for native language identification. In Association for Computational Linguistics (ed.) Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing 1600–1610. Edinburgh: Association for Computational Linguistics.

Wong, S. J., Dras, M. and Johnson, M. (2011) Topic modeling for native language identification. In Proceedings of Australasian Language Technology Association Workshop 115–124. Canberra.

Wong, S.-M. J. and Dras, M. (2009) Contrastive analysis and native language identification. In L. A. Pizzato and R. Schwitter (eds) Australasian Language Technology Association Workshop (ALTA) 53–62. Sydney. Retrieved October 2017 from http://www.alta.asn.au/events/alta2009/index.html

Zipf, G. K. (1932) Selected Studies of the Principle of Relative Frequency in Language. Cambridge, MA: Harvard University Press. https://doi.org/10.4159/harvard.9780674434929

Published

2018-09-10

Issue

Section

Articles

How to Cite

Perkins, R., & Grant, T. (2018). Native language influence detection for forensic authorship analysis: Identifying L1 Persian bloggers. International Journal of Speech, Language and the Law, 25(1), 1-20. https://doi.org/10.1558/ijsll.30844