1.

Ensemble-based noise detection

NoiseRank is an ensemble-based method for detection and ranking of noise, errors and outliers in data. The method enables to use arbitrary noise detection algorithms and offers to explore the detected noisy instances. NoiseRank was successfully applied in a medical domain for detection of atypical and falsely diagnosed cases, as well as in the analysis of textual data for the detection of unusual articles and errors in the corpus collection process. Public use of the NoiseRank method was enabled by its implementation in the web-based data mining platform ClowdFlows. Additionally, the ViperCharts web environment was developed, for performance evaluation of noise and outlier detection algorithms, as well as for the evaluation of other machine learning and data mining algorithms. The work was published in Data Mining and Knowledge Discovery, the journal with the highest impact factor in the research area of data mining.

F.15 Development of a new information system/databases

COBISS.SI-ID: 26557479

2.

Real-time data analysis on the ClowdFlows platform

ClowdFlows is an open cloud based platform for composition, execution, and sharing of interactive data mining workflows. In the paper we extend the ClowdFlows platform with the ability to mine real-time data streams. This functionality was implemented by creating a specialized type of workflow component and a stream mining daemon that delegates the execution of workflows in real-time. In this way, we have transformed a batch data processing platform into a real-time stream mining platform with an intuitive user interface. The real-time analytics aspect of the platform is demonstrated with a Twitter sentiment analysis use case where the sentiment of tweets about whistleblower Edward Snowden was monitored for approximately one month.

F.15 Development of a new information system/databases

COBISS.SI-ID: 27392039

3.

Semantic data mining of financial news articles

Department of Knowledge Technologies was the main technological partner in the project FIRST (Large scale information extraction and integration infrastructure for supporting financial decision making), which received an excellent score in its final evaluation. Novelty of this project was an analysis of large amounts of financial news, blogs and tweets. We developed prototypes for evaluating the reputation of financial institutions (project partner Banca Monte dei Paschi di Siena from Italy), detecting financial market manipulations (project partner b-next from Germany), assistance in stock trading (project partner Interactive Data Managed Solutions from Germany) and for monitoring of events connected to the current financial crisis.

F.15 Development of a new information system/databases

COBISS.SI-ID: 27322151

4.

A decision model for fraud detection in financial operation

In the framework of the EU project FIRST (Large scale information extraction and integration infrastructure for supporting financial decision making), we have - in collaboration with experts from Germany - developed a multi-attribute model for fraud detection in financial operation. Specifically, we addressed the type of fraud known as "Pump and Dump", which refers to an illegal manipulation of financial instrument values by spreading false information. The novelty of our approach is in combining internal financial information with the analysis of sentiment in internet documents. The proposed solution was presented in an awarded conference paper. Also, it has been included in a well-known information system produced by a German project partner. This indicates that the solution is practically applicable and can substantially support financial institutions in detecting fraud and diminishing its consequences.

F.15 Development of a new information system/databases

COBISS.SI-ID: 26828583

5.

URL Tree method for Content Extraction from Streams of Web Documents

The URL Tree method was developed for content extraction from streams of HTML documents. The method is a component in the infrastructure that converts continuously acquired HTML documents into a stream of plain text documents. The URL Tree method is a novel content extraction algorithm, which is efficient, unsupervised, and language-independent. It is based on the observation that HTML documents from the same source normally share a common template. The core of the proposed content extraction algorithm is a simple data structure called URL Tree. The performance of the algorithm was evaluated in a stream setting on a time-stamped semi-automatically annotated dataset, which was made publicly available. The performance of the URL Tree method was compared with that of several open source content extraction algorithms. The evaluation results show that our stream-based algorithm outperforms other algorithms after only 10 to 100 analyzed documents from a specific domain.

F.15 Development of a new information system/databases

COBISS.SI-ID: 27245863

P2-0103 — Annual report 2013

1.

Ensemble-based noise detection

2.

Real-time data analysis on the ClowdFlows platform

3.

Semantic data mining of financial news articles

4.

A decision model for fraud detection in financial operation

5.

URL Tree method for Content Extraction from Streams of Web Documents