Improving Reproducibility of Experiments and Reusability of Research Outputs in Complex Data Analysis

Code

J2-9230 (B) - included in ARIS records

Head

PhD Panče Panov

Period

7/1/2018 - 6/30/2022

Range in 2022

0.86 FTE

Science

Engineering sciences and technologies (11)

Reseacher status

Researcher (11)
Junior expert or technical associate (0)

Education

Doctoral degree (10)
Other (1)

Sex

Woman (2)
Man (9)

Status

Employed at RO and RRD (9)
No data on employment in RO (2)

No. of publications

10–99 (7)
100–999 (3)
1,000–9,999 (1)

Projects / Programmes source: ARIS

Improving Reproducibility of Experiments and Reusability of Research Outputs in Complex Data Analysis

Research activity

Code	Science	Field	Subfield
2.07.00	Engineering sciences and technologies	Computer science and informatics

Code	Science	Field
P170	Natural sciences and mathematics	Computer science, numerical analysis, systems, control

Code	Science	Field
1.02	Natural Sciences	Computer and information sciences

Keywords

reproducible research; reuse of research results; machine learning; data mining; complex data analysis; semantic technologies;

Evaluation (metodology)

Evaluation of bibliographic research performance indicators according to ARIS methodology

Citations Citations for bibliographic records in COBIB.SI that are linked to records in citation databases

Organisations (1) , Researchers (11)

0106 Jožef Stefan Institute

no.	Code	Name and surname	Research area	Role	Period	No. of publicationsNo. of publications
1.	53798	PhD Jure Brence	Computer science and informatics	Researcher	2020 - 2022	27
2.	36220	PhD Martin Breskvar	Computer science and informatics	Researcher	2018 - 2022	38
3.	11130	PhD Sašo Džeroski	Computer science and informatics	Researcher	2018 - 2022	1,314
4.	31050	PhD Dragi Kocev	Computer science and informatics	Researcher	2018 - 2022	241
5.	53530	Ana Kostovska	Computer science and informatics	Young researcher	2020 - 2022	59
6.	28291	PhD Petra Kralj Novak	Computer science and informatics	Researcher	2018 - 2022	134
7.	36356	PhD Aljaž Osojnik	Computer science and informatics	Researcher	2018 - 2022	49
8.	27759	PhD Panče Panov	Computer science and informatics	Head	2018 - 2022	175
9.	38206	PhD Matej Petković	Computer science and informatics	Young researcher	2018 - 2020	76
10.	34452	PhD Nikola Simidjievski	Computer science and informatics	Researcher	2018 - 2022	60
11.	39156	PhD Tomaž Stepišnik	Computer science and informatics	Young researcher	2018 - 2022	28

Abstract

The advances in science are heavily based on the premise of the concept of a trusted discovery, provided that the preformed research is done correctly, and reproducible by other scientists. In order to increase the reusability of research outputs, such as developed models and produced data, they should be Findable, Accessible, Interoperable and Reusable (FAIR principles). The main point of the FAIR is to ensure that research outputs are reusable and will actually be used by others, thus becoming more valuable. The DG for Research and Innovation of the EC has adopted the reusability of research data as one of their priorities, which provided the rapid endorsement of the FAIR principles by different stakeholders. The research outputs that wish to fulfill the FAIR principles must be represented with a wide accepted machine-readable framework. Currently, a popular solution to data sharing that fulfills the FAIR requirements is the use of semantic web technologies. Complex data analysis methods, originating from machine learning (ML) and data mining (DM), are increasingly being used in applications from various domains of science (e.g., life sciences, space research, etc). In order to provide reproducibility of experiments (e.g., executions of methods) and reuse of research outputs (e.g., predictive models), one needs to formally describe the entities involved in the process of analysis, and store them together with their descriptions (e.g., metadata) as a digital objects in a database like structure. Having a “semantically aware” stores of entities for complex data analytics enhanced with automatic reasoning capabilities, would be beneficial for improving reproducibility of experiments and reuse of research outputs. In this way we would move closer towards a FAIR data analysis process. The main objective of the proposed project is to improve the repeatability of experiments and reusability of research outputs in complex data analysis. We will address this objective by combining approaches and ideas from the areas of complex data analytics, ontologies for science, semantic web and inductive databases. More specifically, we will develop a modular system for executing complex data analysis experiments, and semantically annotating, storing, querying and reusing their outputs. To meet the project main objective, we plan to: (1) design, implement and populate ontologies for complex data analysis to be used for semantic annotation; (2) design and implement a prototype system for storing semantically annotated data, experiments and models; (3) develop querying strategies and test the querying capabilities of the prototype system, and (4) test the developed system in different use-case scenarios from various domains, such as machine learning, lifesciences, space research and chemoinformatics. The proposed research will significantly advance the state-of-the-art in the general area of computer science, the specific area of machine learning and data mining, and particularly for the topic of complex data analytics. It will develop new architecture for semantically aware experimentation. It will also improve storing, reusing, revising and querying of models produced by the analytics methods. This is of particular particular importance for the application domains that heavily use data analytics tools in their work. The proposed project will also have a large impact in the context of automating data science. The experiments would be repeatable, since they are performed in a sound documented fashion, as there will be an architecture available to perform such an analysis. Current experimentation architectures are applicable to a very limited set of tasks and do not deal with querying, collaborative validation and revision of models, which represents a serious development bottleneck. Finally, in a wider societal context, the project will increase Slovenia’s research and innovation potential in this area of extreme practical importance.

Significance for science

Within the proposed project, we will develop a new architecture for executing, storing, semantically annotating, and querying experiments and results in complex data analytics, to improve the reproducibility of experiments and reusability of data, experiments and models. This is a highly relevant topic that can have large impact in application domains that heavily use data analytics tools in their work.   For example, given our involvement in the Medical Informatics Platform of the H2020 FET Flagship Human Brain Project, we can instantiate the architecture to support reproducibility and reuse for the tasks of  biomarker discovery and biological signatures of   diseases discovery. Moreover, we are also involved in an Interreg project with an Italian partner ICGEB, Trieste, where we will be able to instantiate the architecture to support reproducibility and reuse for the task of disease modeling in the context of   building a cross-border platform for validated biotech industry kits.

The proposed research will significantly advance the state-of-the-art in the general area of computer science, the specific area of machine learning and data mining, and particularly for the topic of complex data analytics. It will develop new architecture for semantically aware experimentation. It will also improve storing, reusing, revising and querying of models, which is of particular importance for the application areas that use analytics as a service. In a wider societal context, the project will increase Slovenia’s research and innovation potential in this area of extreme practical importance.

The proposed project can have a large impact in the context of automating data science. The experiments would be repeatable since they are performed in a sound documented fashion, as there will be an architecture available to perform such an analysis. Current experimentation architectures are applicable to a very limited set of tasks and do not deal with querying, collaborative validation and revision of models, which represents a serious development bottleneck.

Finally, the project results will be properly disseminated and communicated to provide maximal outreach. This will be done by attending various workshops and conferences as well as by publishing the research results in the journals. The proposed architecture and the produced resources (ontologies, RDF stores) will be made publically available to ensure larger outreach. The obtained results will also be communicated to external stakeholders in Europe (such as the Research Data Alliance) that can help in increasing the outreach of the project and exploit the obtained knowledge and resources.

Significance for the country

Within the proposed project, we will develop a new architecture for executing, storing, semantically annotating, and querying experiments and results in complex data analytics, to improve the reproducibility of experiments and reusability of data, experiments and models. This is a highly relevant topic that can have large impact in application domains that heavily use data analytics tools in their work.   For example, given our involvement in the Medical Informatics Platform of the H2020 FET Flagship Human Brain Project, we can instantiate the architecture to support reproducibility and reuse for the tasks of  biomarker discovery and biological signatures of   diseases discovery. Moreover, we are also involved in an Interreg project with an Italian partner ICGEB, Trieste, where we will be able to instantiate the architecture to support reproducibility and reuse for the task of disease modeling in the context of   building a cross-border platform for validated biotech industry kits.

The proposed research will significantly advance the state-of-the-art in the general area of computer science, the specific area of machine learning and data mining, and particularly for the topic of complex data analytics. It will develop new architecture for semantically aware experimentation. It will also improve storing, reusing, revising and querying of models, which is of particular importance for the application areas that use analytics as a service. In a wider societal context, the project will increase Slovenia’s research and innovation potential in this area of extreme practical importance.

The proposed project can have a large impact in the context of automating data science. The experiments would be repeatable since they are performed in a sound documented fashion, as there will be an architecture available to perform such an analysis. Current experimentation architectures are applicable to a very limited set of tasks and do not deal with querying, collaborative validation and revision of models, which represents a serious development bottleneck.

Finally, the project results will be properly disseminated and communicated to provide maximal outreach. This will be done by attending various workshops and conferences as well as by publishing the research results in the journals. The proposed architecture and the produced resources (ontologies, RDF stores) will be made publically available to ensure larger outreach. The obtained results will also be communicated to external stakeholders in Europe (such as the Research Data Alliance) that can help in increasing the outreach of the project and exploit the obtained knowledge and resources.

Most important scientific results

Interim report

Most important socioeconomically and culturally relevant results

Improving Reproducibility of Experiments and Reusability of Research Outputs in Complex Data Analysis

Views history

Favourite

Improving Reproducibility of Experiments and Reusability of Research Outputs in Complex Data Analysis

FRASCATI classification

CERIF classification

FORD classification

Confirmation required

Views history

Favourite