Loading...
Projects / Programmes source: ARIS

Advancement of computationally intensive methods for efficient modern general-purpose statistical analysis and inference

Research activity

Code Science Field Subfield
1.07.01  Natural sciences and mathematics  Computer intensive methods and applications  Algorithms 

Code Science Field
P160  Natural sciences and mathematics  Statistics, operations research, programming, actuarial mathematics 

Code Science Field
1.01  Natural Sciences  Mathematics 
Keywords
applied statistics, Bayesian statistics, Markov Chain Monte Carlo, parallelization, graphical processing units, hierarchical models, clustering
Evaluation (rules)
source: COBISS
Researchers (24)
no. Code Name and surname Research area Role Period No. of publicationsNo. of publications
1.  28779  PhD Zoran Bosnić  Computer science and informatics  Researcher  2016 - 2019  214 
2.  27510  PhD Mateja Breg Valjavec  Geography  Researcher  2016 - 2019  192 
3.  30791  PhD Rok Ciglič  Geography  Researcher  2016 - 2019  422 
4.  33795  Rok Češnovar  Computer science and informatics  Researcher  2016 - 2019  23 
5.  37645  PhD Jure Demšar  Computer science and informatics  Researcher  2018  86 
6.  14970  PhD Frane Erčulj  Educational studies  Researcher  2016 - 2019  594 
7.  29485  PhD Jana Faganeli Pucer  Computer science and informatics  Researcher  2016 - 2019  37 
8.  28473  Štefan Furlan  Computer science and informatics  Researcher  2016  35 
9.  13179  PhD Mauro Hrvatin  Humanities  Researcher  2016 - 2019  364 
10.  07553  PhD Drago Kladnik  Geography  Researcher  2016 - 2019  1,163 
11.  21464  PhD Blaž Komac  Geography  Researcher  2016 - 2019  634 
12.  04242  PhD Igor Kononenko  Computer science and informatics  Researcher  2016 - 2019  475 
13.  14565  PhD Matjaž Kukar  Computer science and informatics  Researcher  2016 - 2018  219 
14.  22723  PhD Polona Oblak  Mathematics  Researcher  2016 - 2018  138 
15.  08294  PhD Drago Perko  Geography  Researcher  2016 - 2019  1,046 
16.  37430  PhD Vida Ana Politakis  Psychology  Junior researcher  2016 - 2019  27 
17.  17893  PhD Grega Repovš  Psychology  Researcher  2016 - 2019  490 
18.  15295  PhD Marko Robnik Šikonja  Computer science and informatics  Researcher  2016 - 2018  421 
19.  36162  PhD Anka Slana Ozimič  Neurobiology  Researcher  2016 - 2018  126 
20.  33385  PhD Davor Sluga  Computer science and informatics  Researcher  2018 - 2019  31 
21.  29486  PhD Erik Štrumbelj  Computer science and informatics  Head  2016 - 2019  116 
22.  36887  Gaber Terseglav  Computer science and informatics  Researcher  2016 
23.  33290  PhD Gašper Zadnik  Computer science and informatics  Researcher  2016 - 2019  10 
24.  22245  PhD Matija Zorn  Geography  Researcher  2016 - 2019  1,231 
Organisations (5)
no. Code Research organisation City Registration number No. of publicationsNo. of publications
1.  0581  University of Ljubljana, Faculty of Arts  Ljubljana  1627058  97,992 
2.  0587  University of Ljubljana, Faculty of Sport  Ljubljana  1627040  19,184 
3.  0618  Research Centre of the Slovenian Academy of Sciences and Arts  Ljubljana  5105498000  62,991 
4.  1539  University of Ljubljana, Faculty of Computer and Information Science  Ljubljana  1627023  16,242 
5.  3346  OPTILAB d.o.o., information technology and business services  Ajdovščina  2367335  45 
Abstract
It is difficult to overstate the importance of statistical data analysis in today's world: all the empirical sciences, health, finance, fraud detection, telecommunications, social networking, and marketing are just a few areas, which rely heavily on data and their analysis. While applied statistics, especially modern Bayesian statistics, have progressed tremendously and have become much more accessible, progress has recently been slowing down, because current state-of-the-art computation cannot handle the models and volumes of data we want to analyze today. The issue of inefficient statistical computation has recently been highlighted as one of the top 5 open problems in statistics. Our primary objective is to contribute to solving this problem by researching an approach to more efficient general-purpose computation and implementing the findings in a tool, which would allow us to analyze ever growing volumes of data at a reasonable cost. We plan to achieve this objective by automatically parallelizing the most expensive parts of general-purpose Markov Chain Monte Carlo computation algorithms (in particular, Metropolis-Hastings and Hamiltonian Monte Carlo) and using graphical processing units. As a result of our project, we anticipate at least 100-fold speedups at a low cost (less than €1.000,00). Furthermore, have attracted top researchers and experts from the University of Ljubljana, the Slovenian Academy of Sciences and Arts, and industry to participate in the project. Every data set and statistical inference problem we use to gain insight, develop, evaluate, and validate our methodology, will be a part of a relevant practical problem faced by Slovenian researchers. There have been successful attempts at efficient statistical computation for very limited cases, but what we are aiming for - general-purpose inference, which is automatically parallelized for highly efficient computation - is novel and has so far not been achieved. This makes the project extremely relevant both as a significant scientific achievement in the field of computation and due to the numerous practical benefits of low-cost accessible high-performance statistical inference. Indices from related work suggest that the speedups we are aiming for are achievable. While this is a research project and several technical details and implementation issues remain to be resolved, we are confident of the projects feasibility, as have a set of well-defined and directly measurable requirements, we laid out a clear plan on how to achieve them, and assembled a project team of experts from varied backgrounds with all the required knowledge and know-how. We also attracted co-financing from industry to supplement our budget and we will actively promote student participation. The main contributions of the project will be the theoretical research that leads to efficient computation, the practical implementation of this research into a software tool for general-purpose statistical computation, and, as a by-product, empirical research achievements in other fields of science made possible by our methodological research. Efficient computation will cut time and costs, which will directly benefit industry and, given the ubiquity and growing volumes of data, every-day life. And last, but not least, the collaboration between researchers, applied researchers, industry, and students will raise the general level of applied statistical knowledge, a field that is extremely underdeveloped in Slovenia.
Significance for science
Successful realization of the research project objectives in scientific terms is directly relevant to the fields of computationally intensive methods and statistical computing. Statistical computing has also been highlighted as the 2nd most important open problem in applied statistics. Our research is therefore also an important step forward in the field of applied statistics at the world-class level. The key contribution will be the research on how to automatically parallelize the MCMC methods for a general class of statistical models and a practical implementation of the research findings in a tool for efficient general-purpose statistical inference. While primary focus will be on functions that are relevant to statistics, our findings will be relevant to any field that deals with the computation of high-dimensional integrals. The computational tools developed in this project will also have a secondary effect on the development in other scientific fields. Statistical inference has broad applicability and are the methodological foundation of all empirical investigation. The statistical analyses in the applied problems we have identified together with our project partners will be made more efficient, which will allowing us to investigate research questions faster. And, what is more important, it will allow Slovenian empirical scientists, which in most cases don't have supercomputers or substantial budgets for computation, to investigate new research questions that were until now prohibitive due to model complexity and/or scale of data. All our applications deal with open research questions in their respective field, so the each of them has the potential to lead to significant scientific progress. We anticipate at least one substantial research achievement in each of the following fields: geography, neuroscience, and sports.
Significance for the country
Successful realization of the project will have a strong impact on applied statistical analysis and will have long-term benefits to Slovenian research and industry. Our results will improve the flexibility of statistical inference and cut computational costs. Statistical analysis has as broad an applicability in industry as it does in the sciences - fraud detection, marketing, and finance are just some of the industries that rely heavily on data analysis. Specifically, at least one of the application areas in the project will be in collaboration with our partner from industry, Optilab, where we plan for our methods to improve one of their services or products in either direct marketing or fraud detection. This will directly contribute to the quality of the partner's services, their known-how, and is very likely to lead to new employment opportunities and future projects. The results of this project can just as easily be applied in health, medical prognostics and diagnostics, natural disaster prediction, etc... .and other similar fields where statistical analysis can have a strong positive impact on society. To give an example, our research group has in the past worked with the Institute of oncology, Ljubljana, on automated breast cancer recurrence prediction, leading to positive results on a small data set of approximately 1000 patients. However, today there are in Slovenia alone over 10.000 new cancer patients, each with a medical history, gene profile, test results, etc... which amounts to large volumes of data. Statistical analysis of this data could lead to substantial improvement in cancer prognosis or recurrence prediction and in turn to more efficient treatment, which would benefit patients as well as reduce healthcare costs. However, in order to analyze such data, we require efficient computation. Furthermore, the project is a collaboration between computational and statistical scientists, researchers from other applied sciences, and a partner from industry, which will facilitate the flow of ideas. Several applications included in the project have substantial economic value, such as direct marketing, predicting natural disasters (landslides, etc...) and automated player scouting. These or potentially some of the other applications are likely to lead to new projects or a spin-off company.
Most important scientific results Interim report, final report
Most important socioeconomically and culturally relevant results Final report
Views history
Favourite