Authorship attribution and feature testing for Chinese short emails

Authors

  • Shaomin Zhang Guangdong Polytechnic Normal University

DOI:

https://doi.org/10.1558/ijsll.v23i1.20300

Keywords:

authorship attribution, authorship features, pragmatic features, discourse semantic features, discourse information features, Chinese short emails

Abstract

Features used in English-based authorship attribution demonstrate some constraints when used in authorship attribution of Chinese texts, even though they shed much light on it. Therefore, authorship attribution for short Chinese texts free of handwritten documents will help to promote the progress of legislation in China. This study aims to explore and test some features in pragmatic, discourse semantic and discourse information features for authorship attribution of short Chinese emails, which are representative of communication tools. It is hoped that some effective features can be found for the attribution of short Chinese emails. The texts used in the study include 72 short emails written by six authors. All the possible 57 combinations of the six authors are tested and attributed based on the extracted features. Discriminant analysis is employed, and the results demonstrate significant predictions in all the tests. It is concluded that the extracted features in pragmatics, discourse semantics and discourse information can significantly distinguish short Chinese emails, and the suggested number of suspect authors for authorship attribution of short emails should not exceed five.

Author Biography

  • Shaomin Zhang, Guangdong Polytechnic Normal University
    Shaomin Zhang is a PhD graduate from the National Key Research Center for Linguistics and Applied Linguistics, Guangdong University of Foreign Studies and works as a full-time lecturer at the School of Foreign Languages, Guangdong Polytechnic Normal University. His academic interests lie in forensic linguistics, discourse analysis and authorship attribution.

References

Altman, D. (1991) Practical Statistics for Medical Research. London: Chapman and Hall.


Archer, D., Aijmer, K. and Wichmann, A. (2012) Pragmatics: An Advanced Resource Book for Students. New York: Routledge.


Argamon, S. and Koppel, M. (2013) A systemic functional approach to automated authorship analysis. Journal of Law and Policy 21(2): 299–315.


Burns, R. P. and Burns, R. (2008) Business Research Methods and Statistics Using SPSS. London: Sage.


Butler, C. (2006) Statistics in Linguistics. Oxford: Basil Blackwell.


Chaski, C. E. (2001) Empirical evaluation of language-based author identification techniques. Forensic Linguistics 8(1): 1–65. http://dx.doi.org/10.1558/sll.2001.8.1.66


Chaski, C. E. (2005) Who’s at the keyboard? Authorship attribution in digital evidence investigations. International Journal of Digital Evidence 4(1): 1–13.


Chaski, C. E. (2007) The keyboard dilemma and authorship identification. Advances in Digital Forensics 3: 59–71. http://dx.doi.org/10.1007/978-0-387-73742-3_9


Coates, J. (1993) Women, Men and Language. New York: Longman.


Coulthard, M. (1994) Powerful evidence for the defence: an exercise in forensic discourse analysis. In J. Gibbons (ed.) Language and the Law 414–427. New York: Longman.


Coulthard, M. and Johnson, A. (2007) An Introduction to Forensic Linguistics: Language in Evidence. London: Routledge.


Coulthard, M. and Johnson, A. (2010) The Routledge Handbook of Forensic Linguistics. London: Routledge.


Cui, J. M. (2011) The past, present and future of crime investigation linguistics. Applied Linguistics 5(2): 54–62.


Du, Jinbang (2007) A study of the tree information structure of legal discourse. Modern Foreign Languages 30(1): 40–50.


Du, Jinbang (2009) A study of discourse information in the English writings written by English majors. Foreign Language Education 30(2): 42–46.


Du, Jinbang (2010) Realization of persuasive reasoning in courtroom discourse: from the perspective of transaction-oriented information processing. Modern Foreign Languages 4: 363–370.


Du, Jinbang (2012) Application of multimodal information corpus techniques in legal English teaching. International Journal of Law, Language and Discourse 2(4): 19–38.


Du, Jinbang (2013) How is multimodal information to be managed in the legal English class? International Journal of Legal English 1(1): 23–47.


Du, Jinbang (2014) On Legal Discourse Information. Beijing: People’s Publishing House.


Eagleson, R. (1994) Forensic analysis of personal written texts: a case study. In J. Gibbons (ed.) Language and the Law 362–373. New York: Longman.


Forsyth, R. S. and Holmes, D. I. (1996) Feature-finding for text classification. Literary and Linguistic Computing 11(4): 163–174. http://dx.doi.org/10.1093/llc/11.4.163


Grant, T. (2007) Quantifying evidence in forensic authorship analysis. International Journal of Speech, Language and the Law 14(1): 1–25. http://dx.doi.org/10.1558/ijsll.v14i1.1


Grant, T. and Baker, K. (2001) Identifying reliable, valid markers of authorship: a response to Chaski. Forensic Linguistics 8(1): 66–79.


Grant, T. (2013) TXT 4N6: method, consistency, and distinctiveness in the analysis of SMS text messages. Journal of Law and Policy 21(2): 467–494.


Holmes, D. I. (1992) A stylometric analysis of Mormon scripture and related texts. Journal of the Royal Statistical Society 155(1): 91–120.


Holmes, D. I. (1994) Authorship attribution. Computers and the Humanities 28(2): 87–106.


Hyndman, R. J. (2010) Why every statistician should know about cross-validation [Web log message]. Retrieved on 4 October 2013 from http://robjhyndman.com/hyndsight/crossvalidation/


Juola, P. (2013) Stylometry and immigration: a case study. Journal of Law and Policy 21(2): 287–298.


Krippendorff, K. (2011) Agreement and information in the reliability of coding. Communication Methods and Measures 5(2): 1–20. http://dx.doi.org/10.1080/19312458.2011.568376


Landis, J. R. and Koch, G. G. (1977) The measurement of observer agreement for cat­egorical data. Biometrics 33: 159–174. http://dx.doi.org/10.2307/2529310


Ledger, G. and Merriam, T. (1994) Shakespeare, Fletcher, and the Two Noble Kinsmen. Literary and Linguistic Computing 9(3): 235–248. http://dx.doi.org/10.1093/llc/9.3.235


Leech, N. L., Barrett, K. C. and Morgan, G. A. (2011) IBM SPSS for Intermediate Statistics (4th edn). New York: Routledge.


Liu, G. M. (1985) Course on Document Examination. Beijing: Masses Publishing House.


Macaulay, M. (2001) Tough talk: indirectness and gender in requests for information. Journal of Pragmatics 33(2): 293–316. http://dx.doi.org/10.1016/S0378-2166(99)00129-0


Martin, J. R. and White, P. R. (2005) The Language of Evaluation. Basingstoke: Palgrave Macmillan. http://dx.doi.org/10.1057/9780230511910


McLachlan, G. (2005) Discriminant Analysis and Statistical Pattern Recognition. Published online. Wiley. com. Retrieved from http://onlinelibrary.wiley.com/book/10.1002/0471725293


McMenamin, G. R. (2002) Forensic Linguistics: Advances in Forensic Stylistics. New York: CRC Press. http://dx.doi.org/10.1201/9781420041170


Merchant, K. (2012) How men and women differ: gender differences in communication styles, influence tactics, and leadership styles. CMC Senior Theses. Retrieved on 5 October 2013 from http://scholarship.claremont.edu/cgi/viewcontent.cgi?article=1521&context=cmc_theses.


Morgan, G. A., Leech, N. L., Gloeckner, G. W. and Barrett, K. C. (2004) SPSS for Introductory Statistics. London: Lawrence Erlbaum.


Olsson, J. (2008) Forensic Linguistics: An Introduction to Language, Crime and the Law. Shanghai: Shanghai Foreign Language Education Press.


Osborne, J. and Costello, A. (2004) Sample size and subject to item ratio in principal components analysis. Practical Assessment, Research & Evaluation 9(11): 8.


Parkins, R. (2012) Gender and emotional expressiveness: an analysis of prosodic features in emotional expression. Griffith Working Papers in Pragmatics and Intercultural Communication 5(1): 46–54.


Peng, F., Schuurmans, D., Wang, S. and Keselj, V. (2003) Language independent authorship attribution using character level language models. In Proceedings of the Tenth Conference of European Chapter of the Association for Computational Linguistics-Volume 1 267–274. Association for Computational Linguistics. http://dx.doi.org/10.3115/1067807.1067843


Refaeilzadeh, P., Tang, L. and Liu, H. (2009) Cross-validation. Encyclopedia of Database Systems: 532–538.


Rico-Sulayes A. (2011) Statistical authorship attribution of Mexican drug trafficking online forum posts. International Journal of Speech, Language and the Law 18(1): 53–74. http://dx.doi.org/10.1558/ijsll.v18i1.53


Schwab, A. J. (2005) Overall significance of the discriminant function (s). Course materials-data analysis. Retrieved on 7 November 2013 from http://luckyaeo.blog.163.com/blog/static/1776794042013628112555192/


Searle, J. R. (1975) Indirect speech acts. In Peter Cole and Jerry L. Morgan (eds) Syntax and Semantics vol. 3: Speech Act 59–82. New York: Academic Press.


Smith, W. (1994) Computers, statistics and disputed authorship. In J. Gibbons (ed.) Language and the Law 374–413. New York: Longman.


Solan, L. M. and Tiersma, P. M. (2004) Author identification in American courts. Applied Linguistics 25(4): 448–465. http://dx.doi.org/10.1093/applin/25.4.448


Svartik, J. (1968) The Evans Statements. Gothenburg: University of Gothenburg.


Tabachnick, B. G. and Fidell, L. S. (2007) Using Multivariate Statistics. New York: Pearson.


Turell, M. T. (2010) The use of textual, grammatical and sociolinguistic evidence in forensic text comparison. International Journal of Speech Language and the Law 17(2): 211–250.


Turell, M. T. and Gavalda, N. (2013) Towards an index of idiolectal similitude (or distance) in forensic authorship analysis. Journal of Law and Policy 21(2): 495–514.


Wang, H. (2012) Technology on Language Recognition and Identification. Beijing: Chinese People’s Public Security University Press.


Wang, H. and Yue, J. F. (2005) The development of written language identification technology. China Public Security (Academic Edition) 12(3): 71–73.


Wang, Z. J., Jia, Y. W., Wang Y. L. and Feng M. S. (2003) Authorship identification of printed documents. Chinese Journal of Forensic Sciences 1: 32–35.


Yang, J. (2011) The Analysis of Gender Difference Phenomenon in Refusal Speech Act in Chinese. Master’s Thesis. Retrieved on 7 November 2013 from http://www.cnki.net/KCMS/detail/detail.aspx?QueryID=14&CurRec=1&recid=&filename=1011138983.nh&dbname=CMFD2011&dbcode=CMFD&pr=&urlid=&yx=&v=MTQ2MDR1eFlTN0RoMVQzcVRyV00xRnJDVVJMeWZZdVpvRkN2blVML0pWRjI2SDdLN0Z0akVySkViUElSOGVYMUw=


Yu, K. (2011) Culture-specific concepts of politeness: indirectness and politeness in English, Hebrew and Korean requests. Intercultural Pragmatics 8(3): 385–409. http://dx.doi.org/10.1515/iprg.2011.018


Yuan, Y. (2005) Language Analysis and Identification. Beijing: Chinese People’s Public Security University Press.


Yue, J. F. (2007) Language Recognition and Identification. Beijing: Chinese People’s Public Security University Press.


Zhang, W. T. (2004) SPSS Advanced Statistics. Beijing: Higher Education Press.


Zheng, R., Li, J., Chen, H. and Huang, Z. (2006) A framework for authorship identification of online messages: writing-style features and classification techniques. Journal of the American Society for Information Science and Technology 57(3): 378–393. http://dx.doi.org/10.1002/asi.20316


Zheng, R., Qin, Y., Huang, Z. and Chen, H. (2003) Authorship analysis in cybercrime investigation. In G. Goos, J. Hartmanis and J. van Leeuwen (ed.), Intelligence and Security Informatics 59–73. Berlin: Springer. http://dx.doi.org/10.1007/3-540-44853-5_5

Published

2016-07-08

Issue

Section

Articles

How to Cite

Zhang, S. (2016). Authorship attribution and feature testing for Chinese short emails. International Journal of Speech, Language and the Law, 23(1), 71-97. https://doi.org/10.1558/ijsll.v23i1.20300