Native language influence detection for forensic authorship analysis: Identifying L1 Persian bloggers
DOI:
https://doi.org/10.1558/ijsll.30844Keywords:
native language identification, authorship analysis, linguistic profiling, native language influence detection, persianAbstract
This article demonstrates and examines the potential use of interlingual identifiers for forensic authorship analysis and native language influence detection (NLID). The work focuses on the practical applications of native language (L1) identifiers by a human analyst in investigative situations. Using naturally occurring blog posts where the writer self-identifies as a native Persian speaker, a human analyst derived and coded sets of non-native features. Two logistic regression models were built: the first was used to select features to distinguish L1 Persian speakers from L1 English speakers in their English writings, the second developed a feature list to contrast L1 languages that are geographically and linguistically close to Persian. The results clearly demonstrate that interlingual identifiers have the potential to aid in determining the L1 of an anonymous author and can be used by a human analyst in a short forensically realistic example text. This article demonstrates that NLID is possible beyond the more common computational approaches and can form a useful tool in the forensic linguist’s toolbox. This study is not a statistical validation study, instead it demonstrates how a sociolinguistic approach can complement more traditional computational approaches.
References
Brooke, J. and Hirst, G. (2012) Robust, lexicalized native language identification. In Proceedings of COLING 2012 391–408. Mumbai: The COLING 2012 Organizing Committee. Retrieved in October 2016 from http://www.aclweb.org/anthology/C12-1025
Brooke, J. and Hirst, G. (2013) Using other learner corpora in the 2013 NLI shared task. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications 188–196. Retrieved in October 2016 from http://www.aclweb.org/anthology/W13-1725
Champod, C. and Evett, I. W. (1999) Commentary on Broeders. Forensic Linguistics 7(2): 238–243
Comrie, B. (2001) Languages of the world. In M. Aronoff and J. Rees-Miller (eds) The Handbook of Linguistics 19–43. Oxford: Blackwell.
Corder, S. P. (1981) Error Analysis and Interlanguage. Oxford: Oxford University Press.
de Vel, O., Anderson, A., Corney, M. and Mohay, G. (2001) Mining e-mail content for author identification forensics. ACM SIGMOD Record 30(4): 55–64. https://doi.org/10.1145/604264.604272
Doyle, A. C. (1892) The Adventures of Sherlock Holmes. New York: Harper and Brothers.
Dras, M. and Malmasi, S. (2015) Multilingual native language identification. Natural Language Engineering 1(1): 1–53.
Eades, D., Fraser, H., Siegel, J., McNamara, T. and Baker, B. (2003) Linguistic identification in the determination of nationality: a preliminary report. Language Policy 2(2): 179–199.
Fraser, H. (2012) Language analysis for the determination of origin (LADO). In C. A. Chappelle (ed.) Encyclopedia of Applied Linguistics 9–11. Malden: Wiley-Blackwell.
Grant, T. (2008) Approaching questions in forensic authorship analysis. In J. Gibbons and M. T. Turell (eds) Dimensions of Forensic Linguistics 215–229. Philadelphia, PA: John Benjamins. https://doi.org/10.1075/aals.5.15gra
Grant, T. (2010) Text Messaging Forensics: Txt 4n6: Idiolect free authorship analysis? In M. Coulthard and A. Johnson (eds) The Routledge Handbook of World Englishes 508–522. London and New York: Routledge.
Hopkins, E. (1982) Contrastive analysis, interlanguage, and the learner. In W. Lohnes and E. Hopkins (eds) The Contrastive Grammar of English and German 32–48. Ann Arbor, MI: Karoma Publishers.
Jaleh. (2011) Jamigen’s Iranian affairs blog site. Retrieved December 2012 from http://jamigen.com/index.htm
Kniffka, H. (1996) On forensic linguistic ‘differential diagnosis’. In H. Kniffka, S. Blackwell and M. Coulthard (eds) Recent Developments in Forensic Linguistics 75–122. Frankfurt Am Main: Peter Lang.
Koppel, M., Schler, J. and Argamon, S. (2009) Computational methods in authorship attribution. Journal of the Association for Information Science and Technology 60(1): 9–26. https://doi.org/10.1002/asi.20961
Koppel, M., Schler, J. and Zigdon, K. (2005) Determining an author’s native language by mining a text for errors. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining - KDD ’05 624–628. New York: ACM Press. https://doi.org/10.1145/1081870.1081947
Lado, R. (1957) Linguistics across Cultures. Ann Arbor, MI: University of Michigan Press.
Leung, C., Harris, R. and Rampton, B. (1997) The idealised native speaker, reified ethnicities, and classroom realities. TESOL Quarterly 31(3): 543–560. https://doi.org/10.2307/3587837
Li, B. (2013) Recognizing English learners’ native language from their writings. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications 119–123. Retrieved in October 2016 from http://www.aclweb.org/anthology/W13-1715
Mahootian, S. and Gebhardt, L. (2007) Persian (Kindle edition). London and New York: Routledge.
Malmasi, S. (2016) Native language identification: explorations and applications. PhD Thesis, Macquarie University. Retrieved from https://www.researchonline.mq.edu.au/vital/access/services/Download/mq:50040/SOURCE1?view=true
Malmasi, S. and Dras, M. (2014) Language transfer hypotheses with linear SVM weights. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP ’14), (2013) 1385–1390. Retrieved October 2016 from http://www.aclweb.org/anthology/D14-1144
Rampton, M. B. H. (1990) Displacing the ‘native speaker’: expertise, affiliation, and inheritance. ELT Journal 44(2), 97–101.
Richards, J. C. (1971) A non-contrastive approach to error analysis. ELT Journal 25(3): 204–219. https://doi.org/10.1093/elt/XXV.3.204
Selinker, L. (1972) Interlanguage. International Review of Applied Linguistics in Language Teaching, 10: 209–231. https://doi.org/10.1515/iral.1972.10.1-4.209
Selinker, L. (1974). Interlanguage. In J.C. Richards (ed.) Error Analysis: Perspectives in second language acquisition 31–53. Rowley. MA: Newbury House.
Shuy, R. (2001) Forensic linguistics. In M. Aronoff and J. Rees-Miller (eds) The Handbook of Linguistics (Kindle edition) 683–691. Oxford & Malden: Blackwell.
Simons, G. F. and C. D. Fennig (eds.) (2018) Ethnologue: Languages of the World (21st edn). Dallas, TX: SIL International. Online version: http://www.ethnologue.com. Accessed May 2018
Suren-pahlav, S. (2007) Persian NOT Farsi: Iranian identity under fire: an argument against the use of the word ‘Farsi’ for the Persian language. The Circle of Ancient Iranian Studies (July): 1–14.
Tetreault, J., Blanchard, D. and Cahill, A. (2013) A report on the first native language identification shared task. Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications 48–57. Retrieved September 2016 from http://aclweb.org/anthology/W/W13/W13-1706.pdf
The International Herald Tribune. (2010) From the International Herald Tribune - 100, 75, 50 Years Ago - NYTimes.com. International Herald Tribune. Retrieved December 2012, from http://www.nytimes.com/2010/01/25/opinion/25iht-oldjan25.html?_r=1
Thomason, S. (2001) Language contact: an introduction. In N. J. Smelser and P. B. Bates (eds) International Encyclopedia of the Social & Behavioral Sciences 8325–8329. Baltimore, MD: Georgetown University Press. https://doi.org/10.1016/B0-08-043076-7/03032-1
Tomokiyo, L. M. and Jones, R. (2001) You’re not from ’round here, are you? Naive Bayes detection of non- native utterance text. In Association for Computational Linguistics (ed.) Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics on Language Technologies 1–8. Pittsburgh, PA: Association for Computational Linguistics.
Tsur, O. and Rappoport, A. (2007) Using classifier features for studying the effect of native language on the choice of written second language words. In P. Buttery, A. Villavicencio and A. Korhonen (eds) Cognitive Aspects of Computational Language Acquisition 9–17. Madison, WI: Omnipress. https://doi.org/10.3115/1629795.1629797
Wardhaugh, R. (1970) The contrastive analysis hypothesis. TESOL quarterly 4: 123–130.
Weinreich, U. (1953) Languages in Contact. The Hague: Mouton & Co.
Weisburd, D. and Britt, C. (2007) Statistics in Criminal Justice (3rd edn). New York: Springer.
Wilson, L. and Wilson, M. (2001) Farsi Speaker. In M. Swan and B. Smith (eds.) Learner English: A Teacher's Guide to Interference and other Problems (2nd edn) 197–194. Cambridge: Cambridge University Press.
Wong, S. J. and Dras, M. (2011) Exploiting parse structures for native language identification. In Association for Computational Linguistics (ed.) Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing 1600–1610. Edinburgh: Association for Computational Linguistics.
Wong, S. J., Dras, M. and Johnson, M. (2011) Topic modeling for native language identification. In Proceedings of Australasian Language Technology Association Workshop 115–124. Canberra.
Wong, S.-M. J. and Dras, M. (2009) Contrastive analysis and native language identification. In L. A. Pizzato and R. Schwitter (eds) Australasian Language Technology Association Workshop (ALTA) 53–62. Sydney. Retrieved October 2017 from http://www.alta.asn.au/events/alta2009/index.html
Zipf, G. K. (1932) Selected Studies of the Principle of Relative Frequency in Language. Cambridge, MA: Harvard University Press. https://doi.org/10.4159/harvard.9780674434929