Premagovanje prekletstva dimenzionalnosti z uporabo predznanja (Slovene)

Code

J2-5480 (C) - included in ARIS records

Head

PhD Janez Demšar

Period

8/1/2013 - 7/31/2016

Range in 2016

0.96 FTE

Science

Natural sciences and mathematics (1)
Engineering sciences and technologies (17)
Medical sciences (1)

Reseacher status

Researcher (19)
Junior expert or technical associate (0)

Education

Doctoral degree (13)
Master's degree (2)
Other (4)

Sex

Woman (6)
Man (13)

Status

Employed at RO (1)
Employed at RO and RRD (11)
No data on employment in RO (7)

No. of publications

0 (2)
1–9 (5)
10–99 (9)
100–999 (3)

Projects / Programmes source: ARIS

Premagovanje prekletstva dimenzionalnosti z uporabo predznanja (Slovene)

Research activity

Code	Science	Field	Subfield
2.07.07	Engineering sciences and technologies	Computer science and informatics	Intelligent systems - software

Code	Science	Field
P176	Natural sciences and mathematics	Artificial intelligence

Code	Science	Field
1.02	Natural Sciences	Computer and information sciences

Keywords

data mining, statistics, machine learning, dimensionality reduction, background knowledge

Evaluation (metodology)

Evaluation of bibliographic research performance indicators according to ARIS methodology

Citations Citations for bibliographic records in COBIB.SI that are linked to records in citation databases

Organisations (2) , Researchers (19)

1539 University of Ljubljana, Faculty of Computer and Information Science

no.	Code	Name and surname	Research area	Role	Period	No. of publicationsNo. of publications
1.	36469	PhD Niko Colnerič	Computer science and informatics	Young researcher	2015 - 2016	3
2.	23399	PhD Tomaž Curk	Computer science and informatics	Researcher	2013 - 2016	279
3.	16324	PhD Janez Demšar	Computer science and informatics	Head	2013 - 2016	347
4.	31035	MSc Marjana Erdelji	Computer science and informatics	Researcher	2015	22
5.	35424	PhD Tomaž Hočevar	Computer science and informatics	Young researcher	2015 - 2016	43
6.	38462	Jernej Kernc	Computer science and informatics	Technical associate	2015	0
7.	32042	PhD Matija Polajnar	Computer science and informatics	Young researcher	2013 - 2014	0
8.	38461	PhD Ajda Pretnar Žagar	Computer science and informatics	Technical associate	2015	63
9.	33189	Anže Starič	Computer science and informatics	Young researcher	2013 - 2016	8
10.	29630	PhD Miha Štajdohar	Computer science and informatics	Researcher	2013	29
11.	38464	Vesna Tanko	Computer science and informatics	Researcher	2015	7
12.	30142	PhD Marko Toplak	Computer science and informatics	Researcher	2013 - 2016	37
13.	37693	MSc Maja Vodopivec	Computer science and informatics	Researcher	2014 - 2015	4
14.	23987	PhD Martin Vuk	Mathematics	Researcher	2013 - 2014	30
15.	12536	PhD Blaž Zupan	Computer science and informatics	Researcher	2013 - 2016	576
16.	30921	PhD Lan Žagar	Computer science and informatics	Researcher	2013 - 2015	17
17.	32929	Jure Žbontar	Computer science and informatics	Researcher	2013 - 2015	9
18.	35422	PhD Marinka Žitnik	Computer science and informatics	Researcher	2015	88

0312 University Medical Centre Ljubljana

no.	Code	Name and surname	Research area	Role	Period	No. of publicationsNo. of publications
1.	25792	PhD Minca Mramor	Human reproduction	Researcher	2013 - 2016	63

Abstract

We live in a data-driven society whose functioning depends on gathering and analyzing huge quantities of data. Since collecting and storing the data has become very cheap, we no longer observe small sets of well-chosen variables; we routinely collect large numbers of measurements for each data instance. This holds equally true for any field of human endeavor, from science, with, for instance, genome-wide sequencing and expression profiling, to business and economy, with, say, snapshots of share values or currency exchange rates being recorded at small time intervals. In principle, this should enable us to find much more complex and unexpected patterns in the data than before. In practice, this abundance of data is like a huge haystack and we lack efficient methods for finding the needles, or, worse still, for distinguishing between needles and straws. Formally, given the huge dimensionality of data, current data mining methods find a great number of models and patterns that fit the data equally well. Although most of them are random, it is mathematically impossible to tell them from the true phenomena. We argue that this problem is inherent in the current approach to data mining, which mostly uses only data to construct new theories, a practice initially denounced as data fishing. So far it has managed to get around the problem by biasing the theories towards simplicity (e.g. using linear models, various regularizations, Occam’s principle ets.). In high-dimensional problems this approach fails due to many simple theories that fit the data equally well. We intend to research what we believe to be the only viable solution of the problem. Just as classical science does not build theories from observations alone, we believe that the search for models, patterns and visualizations in data mining should build on existing knowledge about the domain. For the purpose of the project, this prior knowledge can take any machine-readable form that describes the relations between variables, for instance an ontology or network of entities corresponding to the variables, correlations between the variables observed in past experiments, rules explicitly given by the expert, or text documents related to the topic, which can be used to statistically relate the variables. Prior knowledge should be used in all phases of data mining process. We propose to develop methods for data transformation, that will, for instance, decrease the dimensionality of the problem by using the prior knowledge to construct new meaningful variables from the observed ones; note that this differs from traditional dimensionality reduction, which reduces the dimensionality of the data using the data itself. In visualization, we will develop methods for construction of useful visualization based on available background knowledge of the problem. Predictive modeling, especially in machine learning, involves a search through a huge space of models; again, this search can be guided to incorporate the known relations between the variables. Finally, prior knowledge can be used to choose from the huge number of found models and patterns that fit the data equally well. The project will borrow from recent advances in genetics that made the most progress in solving the dimensionality problems by using prior knowledge, and from statistical techniques for dimensionality reduction and machine-learning techniques for limiting the search space, none of which currently employs much of background knowledge. For this reason, the core of the project group consists of a PI, whose background is machine learning, and two members with PhDs in statistics and in medicine, in particular genetics. All methods will be implemented in open source data mining tools, so they will be available for practical use on real-world problems and serve as a test bed for immediate testing and improvement of all algorithms developed within the project.

Significance for science

The basic premise of the project proposal - which was also reflected in its title - was that the use of prior knowledge or background knowledge can improve the analysis of high-dimensional data. As a result of the work on this project, we no longer think in terms of distinguishing between "prior data" and "data" but prefer to consider these as different, heterogenous data sources that can be fused together.  One of the most important scientific achievements of the project team was development of methods for data fusion that work on an (in principle) arbitrary number of data sources of any type that can be represented with matrices and connected into a graph. The technic is highly versatile and can be adapted to many different problems, as we demonstrated in a high number of well-cited works.  Second, in the era of big data, networks are becoming a prominent structure for presenting the data, since data collection often describe sets of objects that can be (pairwise) related in different ways. As such - and in particular in the data fusion setup described above - networks played an important role in the project. We developed new techniques that allow us to run network analytic techniques that were impractical in the past due to their time complexity. A particular achievement was a fast combinatorial algorithm for counting graphlet orbits in large sparse networks.  Scientific progress requires tools. The group continued the development of one of the most popular open source data mining platforms, Orange. In the past three years, it was extended with modules for working with extremely large (e.g. several terabytes) data, analysis of time series, spectral images, text mining, image embedding and many other methods related to, or at least tangential, to the work done within this project.

Significance for the country

Besides the core research team, plenty of work on the project was done by graduate students as well as undergraduate students who thus got an opportunity to experience some state-of-the-art scientific work.  Most of the team members come from the Bioinformatics Laboratory at the Faculty of Computer Science, University of Ljubljana. The group has -- also thanks to this project -- become stronger, and we also obtained funding from other Slovenian and foreign agencies and companies. Bioinformatics Laboratory is currently one of the biggest and most productive Slovenian research groups in this field. Through seminars, workshops and other presentations we promoted Slovenian scientific achievements abroad.  Team members were also active promoters of our field among adults and younger generations.

Most important scientific results

Annual report 2013, 2014, 2015, final report

Most important socioeconomically and culturally relevant results

Annual report 2014, 2015, final report

Premagovanje prekletstva dimenzionalnosti z uporabo predznanja (Slovene)

Views history

Favourite

Premagovanje prekletstva dimenzionalnosti z uporabo predznanja (Slovene)

FRASCATI classification

CERIF classification

FORD classification

Confirmation required

Views history

Favourite