Resources, methods and tools for the understanding, identification and classification of various forms of socially unacceptable discourse in the information society

Code

J7-8280 (B) - included in ARIS records

Head

PhD Tomaž Erjavec

Period

5/1/2017 - 4/30/2020

Range in 2020

0.3 FTE

Science

Engineering sciences and technologies (2)
Social sciences (7)
Humanities (7)
Other (1)

Reseacher status

Researcher (16)
Junior expert or technical associate (1)

Education

Doctoral degree (14)
Other (3)

Sex

Woman (7)
Man (10)

Status

Employed at RO (1)
Employed at RO and RRD (14)
No data on employment in RO (1)
Retired (1)

No. of publications

0 (1)
10–99 (5)
100–999 (11)

Projects / Programmes source: ARIS

Resources, methods and tools for the understanding, identification and classification of various forms of socially unacceptable discourse in the information society

Research activity

Code	Science	Field	Subfield
6.05.02	Humanities	Linguistics	Theoretical and applied linguistics

Code	Science	Field
H350	Humanities	Linguistics

Code	Science	Field
6.02	Humanities	Languages and Literature

Keywords

Socialy unacceptable discourse; Computer Mediated Communication; Corpus Linguistics; Critical Discourse Analysis; Language Technologies

Evaluation (metodology)

Evaluation of bibliographic research performance indicators according to ARIS methodology

Citations Citations for bibliographic records in COBIB.SI that are linked to records in citation databases

Organisations (4) , Researchers (17)

0106 Jožef Stefan Institute

no.	Code	Name and surname	Research area	Role	Period	No. of publicationsNo. of publications
1.	05023	PhD Tomaž Erjavec	Linguistics	Head	2017 - 2020	710
2.	36871	PhD Nikola Ljubešić	Linguistics	Researcher	2017 - 2020	492
3.	03323	PhD Igor Mozetič	Computer science and informatics	Researcher	2017 - 2020	186
4.	37977	PhD Jasmina Smailović	Computer science and informatics	Researcher	2017	40

0366 Peace institute Ljubljana

no.	Code	Name and surname	Research area	Role	Period	No. of publicationsNo. of publications
1.	28195	PhD Veronika Bajt	Social sciences	Researcher	2017 - 2020	281
2.	53338	Monika Bohinec	Criminology and social work	Technical associate	2020	16
3.	27894	PhD Neža Kogovšek Šalamon	Law	Researcher	2017 - 2020	438
4.	20544	Irena Salmič		Technical associate	2017 - 2020	0

0581 University of Ljubljana, Faculty of Arts

no.	Code	Name and surname	Research area	Role	Period	No. of publicationsNo. of publications
1.	30672	PhD Maja Bitenc	Linguistics	Researcher	2018	82
2.	36914	PhD Jaka Čibej	Linguistics	Researcher	2017 - 2020	227
3.	26294	PhD Darja Fišer	Linguistics	Researcher	2017 - 2020	436
4.	14681	PhD Vojko Gorjanc	Linguistics	Researcher	2017 - 2020	514
5.	50983	PhD Jakob Lenardič	Linguistics	Researcher	2018 - 2020	75

0582 University of Ljubljana, Faculty of Social Sciences

no.	Code	Name and surname	Research area	Role	Period	No. of publicationsNo. of publications
1.	30704	PhD Jernej Berzelak	Sociology	Researcher	2017 - 2020	140
2.	24365	PhD Dejan Jontes	Social sciences	Researcher	2019 - 2020	337
3.	39534	Andrej Motl	Sociology	Researcher	2017 - 2020	24
4.	10155	PhD Vasja Vehovar	Sociology	Researcher	2017 - 2020	887

Abstract

Socially unacceptable discourse, such as hate, discriminatory, offensive or threatening speech is by no means a new phenomenon. It has, however, recently gained significant momentum due to a number of substantial societal, cultural and economic changes. Furthermore, the boom of the information-communication technology and the speed at which information is spread on the Internet have given such discourse practices an unprecedented reach and impact that can only be studied and efficiently mitigated with interdisciplinary methods and automatic approaches. The project combines state-of-the-art quantitative and qualitative multidisciplinary approaches which will be employed to investigate the use of socially unacceptable discourse in its sociocultural context. The use of novel data-driven approaches on unstructured and semi-structured data will move the frontiers of the traditional humanities and social sciences. As a side-effect, the project will also support the development of the new field of Digital Humanities and Social Sciences, which combines tools and methods from computer science with those of humanities and social sciences. In the scope of the project we will construct large corpora of Slovene computer mediated communication in general and socially unacceptable discourse in particular, which will serve as the basis for our empirically based research. The collected corpora will be highly structured and their texts linguistically processed as well as enriched with various metadata. We will develop a typology of socially unacceptable discourse and its targets, and manually annotate a representative sample of texts with this typology. This will result in a gold-standard dataset for researching such communication. By using machine learning techniques on this dataset, an automatic method to flag and categorise SUD texts and their targets will be developed and applied to the compiled corpora. Interdisciplinary sociolinguistic analyses will be performed on the basis of the collected and processed resources, focusing on migrants and Islamophobia, homophobia and gay rights, and sexism and misogyny. We will use the methodologies and instruments of corpus linguistics, critical discourse analysis and inferential statistics. These approaches will be supplemented with a corpus analysis of legal aspects of socially unacceptable discourse and surveys on its the perception in the Slovene society. The project will organise an international interdisciplinary workshop and publish a monograph. It is important to note that the project will enable free and open access to the research results through the research infrastructure CLARIN.SI and the Social Science Data Archive. The research data will consist of the developed language resources and software. All legal and ethical issues with regard to personal data distribution will be taken into account. Through this, the project will also support the move to open science, enabling reproducibility of its research results.

Significance for science

The proposed project is an important milestone in Slovene humanities and social sciences as there have 
been no previous attempts of comprehensive, inter- and multidisciplinary, data-driven research of SUD. 
The relevance and impact of the project for the development of science is four-fold:
A tangible result of the project is the large, richly annotated corpora of socially unacceptable CMC 
and of general CMC as well as manually annotated datasets giving the SUD type and target the 
SUD is aimed at. These language resources will enable a comprehensive insight into the characteristics of various forms of SUD practices in the information society and will facilitate a 
number of novel research approaches in the fields of linguistics, sociolinguistics, critical discourse analysis and anthropology as well as support the development of technologies for content analysis 
and text analytics for Slovene that can be widely employed in the Digital Humanities and Social sciences.
The project will result in a theoretically grounded and thoroughly tested tool for automatic detection 
and classification of socially unacceptable web content that will be directly applicable in social 
science, law and criminology. Such services are becoming increasingly important in the knowledge-based, hi-tech society, where Slovene is still lagging far behind most European languages, which 
puts it into unequal position compared to others and consequently hinders the development of 
Slovene society and Slovene language.
In contrast to many Slovene projects, its research results in terms of created resources, manually annotated datasets, models and technologies, will be published, taking into account legal and 
ethical limitations, under an open-source research licence (Creative Commons) according to the EU open science guidelines. This enables the reproducibility and enhancements of the research results achieved in the scope of the project by other interested researchers, be it in Slovenia or abroad.
Combining methods in humanities and social sciences with those from computer science, the 
project will support the development of the new research field of Digital Humanities and Social Sciences.
In the field of legal sciences the novel contribution of the project will be the legal analysis of the corpus, showing the extent of legally prosecutable amount of SUD which remains untackled, the legal characteristics of the SUD corpus, its targets and level of severeness. This will importantly contribute to knowledge and understanding of SUD and its prevention, which is the preferred tool as opposed to criminal prosecution.

Significance for the country

With a combination of methods and approaches from various fields of Digital Humanities and Social Sciences the project facilitates the perception of SUD in the society and provides tools and guidelines 
to combat elements of extremism and intolerance in our society. Apart from the society in general, the direct beneficiaries will be newspaper publishers and online content providers as well as governmental and non-governmental institutions.
In the field of legal sciences the novel contribution of the project will be the legal analysis of the corpus, showing the extent of legally prosecutable amount of SUD which remains untackled, the legal characteristics of the SUD corpus, its targets and level of severeness. This will importantly contribute 
to knowledge and understanding of SUD and its prevention, which is the preferred tool as opposed to criminal prosecution.
The project will also produce a number of open-source language technology tools and resources for dealing with Slovene CMC, which will significantly outperform existing ones. Furthermore, the project 
will compile annotated datasets of Slovene, a key resource for training language analysis tools. 
These tools and resources will be directly accessible for use by other researchers, and, where 
possible, for commercial use as well. This will facilitate further development of language technologies 
for Slovene.

Most important scientific results

Interim report, final report

Most important socioeconomically and culturally relevant results

Interim report, final report

Resources, methods and tools for the understanding, identification and classification of various forms of socially unacceptable discourse in the information society

Views history

Favourite

Resources, methods and tools for the understanding, identification and classification of various forms of socially unacceptable discourse in the information society

FRASCATI classification

CERIF classification

FORD classification

Confirmation required

Views history

Favourite