Projects / Programmes
Conquering the Curse of Dimensionality by Using Background Knowledge
Code |
Science |
Field |
Subfield |
2.07.07 |
Engineering sciences and technologies |
Computer science and informatics |
Intelligent systems - software |
Code |
Science |
Field |
P176 |
Natural sciences and mathematics |
Artificial intelligence |
Code |
Science |
Field |
1.02 |
Natural Sciences |
Computer and information sciences |
data mining, statistics, machine learning, dimensionality reduction, background knowledge
Researchers (19)
no. |
Code |
Name and surname |
Research area |
Role |
Period |
No. of publicationsNo. of publications |
1. |
36469 |
PhD Niko Colnerič |
Computer science and informatics |
Junior researcher |
2015 - 2016 |
3 |
2. |
23399 |
PhD Tomaž Curk |
Computer science and informatics |
Researcher |
2013 - 2016 |
253 |
3. |
16324 |
PhD Janez Demšar |
Computer science and informatics |
Head |
2013 - 2016 |
340 |
4. |
31035 |
MSc Marjana Erdelji |
Computer science and informatics |
Researcher |
2015 |
19 |
5. |
35424 |
PhD Tomaž Hočevar |
Computer science and informatics |
Junior researcher |
2015 - 2016 |
30 |
6. |
38462 |
Jernej Kernc |
Computer science and informatics |
Technical associate |
2015 |
0 |
7. |
25792 |
PhD Minca Mramor |
Human reproduction |
Researcher |
2013 - 2016 |
61 |
8. |
32042 |
PhD Matija Polajnar |
Computer science and informatics |
Junior researcher |
2013 - 2014 |
0 |
9. |
38461 |
PhD Ajda Pretnar Žagar |
Computer science and informatics |
Technical associate |
2015 |
46 |
10. |
33189 |
Anže Starič |
Computer science and informatics |
Junior researcher |
2013 - 2016 |
8 |
11. |
29630 |
PhD Miha Štajdohar |
Computer science and informatics |
Researcher |
2013 |
21 |
12. |
38464 |
Vesna Tanko |
Computer science and informatics |
Researcher |
2015 |
0 |
13. |
30142 |
PhD Marko Toplak |
Computer science and informatics |
Researcher |
2013 - 2016 |
27 |
14. |
37693 |
MSc Maja Vodopivec |
Computer science and informatics |
Researcher |
2014 - 2015 |
3 |
15. |
23987 |
PhD Martin Vuk |
Mathematics |
Researcher |
2013 - 2014 |
25 |
16. |
12536 |
PhD Blaž Zupan |
Computer science and informatics |
Researcher |
2013 - 2016 |
531 |
17. |
30921 |
PhD Lan Žagar |
Computer science and informatics |
Researcher |
2013 - 2015 |
17 |
18. |
32929 |
Jure Žbontar |
Computer science and informatics |
Researcher |
2013 - 2015 |
9 |
19. |
35422 |
PhD Marinka Žitnik |
Computer science and informatics |
Researcher |
2015 |
83 |
Organisations (2)
Abstract
We live in a data-driven society whose functioning depends on gathering and analyzing huge quantities of data. Since collecting and storing the data has become very cheap, we no longer observe small sets of well-chosen variables; we routinely collect large numbers of measurements for each data instance. This holds equally true for any field of human endeavor, from science, with, for instance, genome-wide sequencing and expression profiling, to business and economy, with, say, snapshots of share values or currency exchange rates being recorded at small time intervals.
In principle, this should enable us to find much more complex and unexpected patterns in the data than before. In practice, this abundance of data is like a huge haystack and we lack efficient methods for finding the needles, or, worse still, for distinguishing between needles and straws. Formally, given the huge dimensionality of data, current data mining methods find a great number of models and patterns that fit the data equally well. Although most of them are random, it is mathematically impossible to tell them from the true phenomena.
We argue that this problem is inherent in the current approach to data mining, which mostly uses only data to construct new theories, a practice initially denounced as data fishing. So far it has managed to get around the problem by biasing the theories towards simplicity (e.g. using linear models, various regularizations, Occam’s principle ets.). In high-dimensional problems this approach fails due to many simple theories that fit the data equally well.
We intend to research what we believe to be the only viable solution of the problem. Just as classical science does not build theories from observations alone, we believe that the search for models, patterns and visualizations in data mining should build on existing knowledge about the domain. For the purpose of the project, this prior knowledge can take any machine-readable form that describes the relations between variables, for instance an ontology or network of entities corresponding to the variables, correlations between the variables observed in past experiments, rules explicitly given by the expert, or text documents related to the topic, which can be used to statistically relate the variables.
Prior knowledge should be used in all phases of data mining process. We propose to develop methods for data transformation, that will, for instance, decrease the dimensionality of the problem by using the prior knowledge to construct new meaningful variables from the observed ones; note that this differs from traditional dimensionality reduction, which reduces the dimensionality of the data using the data itself. In visualization, we will develop methods for construction of useful visualization based on available background knowledge of the problem. Predictive modeling, especially in machine learning, involves a search through a huge space of models; again, this search can be guided to incorporate the known relations between the variables. Finally, prior knowledge can be used to choose from the huge number of found models and patterns that fit the data equally well.
The project will borrow from recent advances in genetics that made the most progress in solving the dimensionality problems by using prior knowledge, and from statistical techniques for dimensionality reduction and machine-learning techniques for limiting the search space, none of which currently employs much of background knowledge. For this reason, the core of the project group consists of a PI, whose background is machine learning, and two members with PhDs in statistics and in medicine, in particular genetics.
All methods will be implemented in open source data mining tools, so they will be available for practical use on real-world problems and serve as a test bed for immediate testing and improvement of all algorithms developed within the project.
Significance for science
The basic premise of the project proposal - which was also reflected in its title - was that the use of prior knowledge or background knowledge can improve the analysis of high-dimensional data. As a result of the work on this project, we no longer think in terms of distinguishing between "prior data" and "data" but prefer to consider these as different, heterogenous data sources that can be fused together. One of the most important scientific achievements of the project team was development of methods for data fusion that work on an (in principle) arbitrary number of data sources of any type that can be represented with matrices and connected into a graph. The technic is highly versatile and can be adapted to many different problems, as we demonstrated in a high number of well-cited works. Second, in the era of big data, networks are becoming a prominent structure for presenting the data, since data collection often describe sets of objects that can be (pairwise) related in different ways. As such - and in particular in the data fusion setup described above - networks played an important role in the project. We developed new techniques that allow us to run network analytic techniques that were impractical in the past due to their time complexity. A particular achievement was a fast combinatorial algorithm for counting graphlet orbits in large sparse networks. Scientific progress requires tools. The group continued the development of one of the most popular open source data mining platforms, Orange. In the past three years, it was extended with modules for working with extremely large (e.g. several terabytes) data, analysis of time series, spectral images, text mining, image embedding and many other methods related to, or at least tangential, to the work done within this project.
Significance for the country
Besides the core research team, plenty of work on the project was done by graduate students as well as undergraduate students who thus got an opportunity to experience some state-of-the-art scientific work. Most of the team members come from the Bioinformatics Laboratory at the Faculty of Computer Science, University of Ljubljana. The group has -- also thanks to this project -- become stronger, and we also obtained funding from other Slovenian and foreign agencies and companies. Bioinformatics Laboratory is currently one of the biggest and most productive Slovenian research groups in this field. Through seminars, workshops and other presentations we promoted Slovenian scientific achievements abroad. Team members were also active promoters of our field among adults and younger generations.
Most important scientific results
Annual report
2013,
2014,
2015,
final report
Most important socioeconomically and culturally relevant results
Annual report
2014,
2015,
final report