Language technologies for detecting the author's personal profile

Research activity

Code Science Field Subfield
7.00.00  Interdisciplinary research     

Code Science Field
H350  Humanities  Linguistics 

Code Science Field
6.02  Humanities  Languages and Literature 
author profiling, forensic linguistics, corpus linguistics, data mining
Researchers (1)
no. Code Name and surname Research area Role Period No. of publicationsNo. of publications
1.  24440  PhD Ana Zwitter Vitez  Linguistics  Head  2011 - 2014  118 
Organisations (1)
no. Code Research organisation City Registration number No. of publicationsNo. of publications
1.  2923  Trojina, zavod za uporabno slovenistiko (Slovene)  Škofja Loka  1914642  57 
Authorship attribution has developed immensely in the last two decades due to the fact that public figures and individuals are more and more exposed to threat letters, either in traditional form or over the internet (in the last few years such examples include G. Bush, J. Janša, K. Kresal). Also, there has been an increase in plagiarism because of the accessibility of many texts on the web (PhD thesis of the German defence minister K.-T. Guttenberg).   Due to the importance of identifying author's distinguishing linguistic features in a text, authorship attribution is particularly well-developed in the fields of authorship law and copyright (Grant, 2007), literary studies (Hoover, 2004), criminology (Coulthard, 2005) and customer profiling for commercial purposes (Shaw et al., 2001).   Many studies on authorship attribution have been conducted, however in Slovenia this field remains relatively under-researched as only two studies using statistical methods have been made (Dović, 2002; Limbek, 2008). There is, however, potential for quality research because of available language tools and resources for Slovene.   The aim of the project Language technologies for determining the author’s personal profile is thus to acquire knowledge that will enable the answers to the following question: -      what is the profile of the author (gender, age, level of education, region, psychological profile) of a text of an unknown authorship when no potential authors are available.   The knowledge will be gained using the following methodology: we will design and build a reference corpus we will determine and evaluate (set of) lexical, character, syntactic, and semantic features for authorship profiling we will design and evaluate the feature-based model for author profiling.   We will also analyse the differences in linguistic features according to genre and identify how the methods for author profiling are affected by different characteristics such as the number of candidate authors and text length.   The result of the project Language technologies for determining the author’s personal profile will be the definition of linguistic features for Slovene that enable author profiling (identifying gender, age, education, region and psychometric traits of the author).   The results of the research will make a considerable improvement in the quality of criminology, authorship law, literary studies, and customer profiling in market research. For this reason, the results and the methods for authorship attribution and author profiling will be published and forwarded to the Centre for forensic research, The Institute for criminology at the Faculty of Law, the Faculty of criminal justice and security, companies working on language technologies (Amebis, Alpineon), and institutions interested in authorship attribution for the purposes of better advertising possibilities, potential plagiarism, or human resource management.
Significance for science
The results of the project have contributed to the field of authorship attribution at the following levels: - methodology: the research has been conducted in an interdisciplinary dialogue between linguistics (corpus design and analysis), computer science (machine learning), and criminology (forensic linguistics), - results: for further studies, corpus with authors’ profile metadata, feature values and classification results are available, - terminology: national and international publications have contributed to the unification of terminology in the field of authorship attribution, - transfer of knowledge: results of the survey were incorporated into the study process and presented in the media.
Significance for the country
The research results contribute to the realization of two strategic documents: - European initiative Digital Agenda for Europe, which promotes the use of digital technologies, - Resolution on Research and Innovation Strategy of Slovenia 2011-2020 (3.2 Transfer of knowledge, 4.3 Research infrastructure development 4.5 Information infrastructure supporting innovation system). The research complements the quantitative analysis of everyday language production and highlights the importance of applied linguistic analyses. The results can be exploited in different scientific disciplines (linguistics, informatics, criminology) and in the following fields: - economy (market analysis): the methodology can be adapted to the needs of enterprises developing advertising strategies and product development on the basis of clients’ language production (Shaw et al., 2001), - human resource management: linguistic features for author profiling allow the selection of suitable candidates in large companies (Schuler et al. 1999), - state authorities: in the field of criminal investigation, calculation and evaluation of lexical and readability features enables to detect whether one of the possible authors has written an anonymous text, - cultural heritage: the resulting corpus with annotated metadata facilitates further analyses on authorship attribution and author profiling (with a careful protection of copyright and personal data) - intercultural dialogue: awareness of the power of authentic language production analysis can improve the understanding of social relations and conflicts in everyday life.
