Projects / Programmes
Resources, methods and tools for the understanding, identification and classification of various forms of socially unacceptable discourse in the information society
Code |
Science |
Field |
Subfield |
6.05.02 |
Humanities |
Linguistics |
Theoretical and applied linguistics |
Code |
Science |
Field |
H350 |
Humanities |
Linguistics |
Code |
Science |
Field |
6.02 |
Humanities |
Languages and Literature |
Socialy unacceptable discourse; Computer Mediated Communication; Corpus Linguistics; Critical Discourse Analysis; Language Technologies
Researchers (17)
Organisations (4)
Abstract
Socially unacceptable discourse, such as hate, discriminatory, offensive or threatening speech is by no means a new phenomenon. It has, however, recently gained significant momentum due to a number of substantial societal, cultural and economic changes. Furthermore, the boom of the information-communication technology and the speed at which information is spread on the Internet have given such discourse practices an unprecedented reach and impact that can only be studied and efficiently mitigated with interdisciplinary methods and automatic approaches.
The project combines state-of-the-art quantitative and qualitative multidisciplinary approaches which will be employed to investigate the use of socially unacceptable discourse in its sociocultural context. The use of novel data-driven approaches on unstructured and semi-structured data will move the frontiers of the traditional humanities and social sciences. As a side-effect, the project will also support the development of the new field of Digital Humanities and Social Sciences, which combines tools and methods from computer science with those of humanities and social sciences.
In the scope of the project we will construct large corpora of Slovene computer mediated communication in general and socially unacceptable discourse in particular, which will serve as the basis for our empirically based research. The collected corpora will be highly structured and their texts linguistically processed as well as enriched with various metadata.
We will develop a typology of socially unacceptable discourse and its targets, and manually annotate a representative sample of texts with this typology. This will result in a gold-standard dataset for researching such communication. By using machine learning techniques on this dataset, an automatic method to flag and categorise SUD texts and their targets will be developed and applied to the compiled corpora.
Interdisciplinary sociolinguistic analyses will be performed on the basis of the collected and processed resources, focusing on migrants and Islamophobia, homophobia and gay rights, and sexism and misogyny. We will use the methodologies and instruments of corpus linguistics, critical discourse analysis and inferential statistics. These approaches will be supplemented with a corpus analysis of legal aspects of socially unacceptable discourse and surveys on its the perception in the Slovene society.
The project will organise an international interdisciplinary workshop and publish a monograph. It is important to note that the project will enable free and open access to the research results through the research infrastructure CLARIN.SI and the Social Science Data Archive. The research data will consist of the developed language resources and software. All legal and ethical issues with regard to personal data distribution will be taken into account. Through this, the project will also support the move to open science, enabling reproducibility of its research results.
Significance for science
The proposed project is an important milestone in Slovene humanities and social sciences as there have
been no previous attempts of comprehensive, inter- and multidisciplinary, data-driven research of SUD.
The relevance and impact of the project for the development of science is four-fold:
A tangible result of the project is the large, richly annotated corpora of socially unacceptable CMC
and of general CMC as well as manually annotated datasets giving the SUD type and target the
SUD is aimed at. These language resources will enable a comprehensive insight into the characteristics of various forms of SUD practices in the information society and will facilitate a
number of novel research approaches in the fields of linguistics, sociolinguistics, critical discourse analysis and anthropology as well as support the development of technologies for content analysis
and text analytics for Slovene that can be widely employed in the Digital Humanities and Social sciences.
The project will result in a theoretically grounded and thoroughly tested tool for automatic detection
and classification of socially unacceptable web content that will be directly applicable in social
science, law and criminology. Such services are becoming increasingly important in the knowledge-based, hi-tech society, where Slovene is still lagging far behind most European languages, which
puts it into unequal position compared to others and consequently hinders the development of
Slovene society and Slovene language.
In contrast to many Slovene projects, its research results in terms of created resources, manually annotated datasets, models and technologies, will be published, taking into account legal and
ethical limitations, under an open-source research licence (Creative Commons) according to the EU open science guidelines. This enables the reproducibility and enhancements of the research results achieved in the scope of the project by other interested researchers, be it in Slovenia or abroad.
Combining methods in humanities and social sciences with those from computer science, the
project will support the development of the new research field of Digital Humanities and Social Sciences.
In the field of legal sciences the novel contribution of the project will be the legal analysis of the corpus, showing the extent of legally prosecutable amount of SUD which remains untackled, the legal characteristics of the SUD corpus, its targets and level of severeness. This will importantly contribute to knowledge and understanding of SUD and its prevention, which is the preferred tool as opposed to criminal prosecution.
Significance for the country
With a combination of methods and approaches from various fields of Digital Humanities and Social Sciences the project facilitates the perception of SUD in the society and provides tools and guidelines
to combat elements of extremism and intolerance in our society. Apart from the society in general, the direct beneficiaries will be newspaper publishers and online content providers as well as governmental and non-governmental institutions.
In the field of legal sciences the novel contribution of the project will be the legal analysis of the corpus, showing the extent of legally prosecutable amount of SUD which remains untackled, the legal characteristics of the SUD corpus, its targets and level of severeness. This will importantly contribute
to knowledge and understanding of SUD and its prevention, which is the preferred tool as opposed to criminal prosecution.
The project will also produce a number of open-source language technology tools and resources for dealing with Slovene CMC, which will significantly outperform existing ones. Furthermore, the project
will compile annotated datasets of Slovene, a key resource for training language analysis tools.
These tools and resources will be directly accessible for use by other researchers, and, where
possible, for commercial use as well. This will facilitate further development of language technologies
for Slovene.
Most important scientific results
Interim report,
final report
Most important socioeconomically and culturally relevant results
Interim report,
final report