Background Classification using class-imbalanced data is biased in favor of the majority class. The bias is even larger for high-dimensional data, where the number of variables greatly exceeds the number of samples. The problem can be attenuated by undersampling or oversampling, which produce class-balanced data. Generally undersampling is helpful, while random oversampling is not. Synthetic Minority Oversampling TEchnique (SMOTE) is a very popular oversampling method that was proposed to improve random oversampling but its behavior on high-dimensional data has not been thoroughly investigated. In this paper we investigate the properties of SMOTE from a theoretical and empirical point of view, using simulated and real high-dimensional data.Results While in most cases SMOTE seems beneficial with low-dimensional data, it does not attenuate the bias towards the classification in the majority class for most classifiers when data are high-dimensional, and it is less effective than random undersampling. SMOTE is beneficial for k-NN classifiers for high-dimensional data if the number of variables is reduced performing some type of variable selection; we explain why, otherwise, the k-NN classification is biased towards the minority class. Furthermore, we show that on high-dimensional data SMOTE does not change the class-specific mean values while it decreases the data variability and it introduces correlation between samples. We explain how our findings impact the class-prediction for high-dimensional data.Conclusions In practice, in the high-dimensional setting only k-NN classifiers based on the Euclidean distance seem to benefit substantially from the use of SMOTE, provided that variable selection is performed before using SMOTE; the benefit is larger if more neighbors are used. SMOTE for k-NN without variable selection should not be used, because it strongly biases the classification towards the minority class.
COBISS.SI-ID: 30528217
Background PAM, a nearest shrunken centroid method (NSC), is a popular classification method for high-dimensional data. ALP and AHP are NSC algorithms that were proposed to improve upon PAM. The NSC methods base their classification rules on shrunken centroids; in practice the amount of shrinkage is estimated minimizing the overall cross-validated (CV) error rate. Results We show that when data are class-imbalanced the three NSC classifiers are biased towards the majority class. The bias is larger when the number of variables or class-imbalance is larger and/or the differences between classes are smaller. To diminish the class-imbalance problem of the NSC classifiers we propose to estimate the amount of shrinkage by maximizing the CV geometric mean of the class-specific predictive accuracies (g-means). Conclusions The results obtained on simulated and real high-dimensional class-imbalanced data show that our approach outperforms the currently used strategy based on the minimization of the overall error rate when NSC classifiers are biased towards the majority class. The number of variables included in the NSC classifiers when using our approach is much smaller than with the original approach. This result is supported by experiments on simulated and real high-dimensional class-imbalanced data.
COBISS.SI-ID: 30458841
With changing the age distribution at the time of cancer diagnosis, the administrative censoring due to study end may be informative. This problem has been mentioned frequently in the relative survival field, and an estimator aiming to correct this problem has been developed. In this paper, we review the existing methods for estimation in relative survival, demonstrate their deficiencies, and propose weighting to correct both the recently introduced net survival estimator and the Ederer I estimator. Using simulations and real cancer registry data, we evaluate the magnitude of the informative censoring problem. We clarify the assumptions behind the reviewed methods and provide guidance to their usage in practice.
COBISS.SI-ID: 30655961
Background: Previous analyses concerning health components of European Union (EU)-funded research have shown low project participation levels of the 12 newest member states (EU-12). Additionally, there has been a lack of subject-area analysis. In the Health Research for Europe project, we screened all projects of the EUs Framework Programmes for research FP5 and FP6 (19982006) to identify health research projects and describe participation by country and subject area. Methods: FP5 and FP6 project databases were acquired and screened by coders to identify health-related projects, which were then categorized according to the 47 divisions of the EU Health Portal (N = 2728 projects) plus an extra group of basic/biotech projects (N = 1743). Country participation and coordination rates for projects were also analyzed. Results: Approximately 20% of the 26 946 projects (value 29.2bn) were health-related (N = 4756. Value 6.04bn). Within the health categories, the largest expenditures were cancer (11.9%), other (i.e. not mental health or cardiovascular) non-communicable diseases (9.5%) and food safety (9.4%). One hundred thirty-two countries participated in these projects. Of the 27 EU countries (and five partner countries), north-western and Nordic states acquired more projects per capita. The UK led coordination with ) 20% of projects. EU-12 countries were generally under-represented for participation and coordination. Conclusions: Combining our findings with the associated literature, we comment on drivers determining distribution of participation and funds across countries and subject areas. Additionally, we discuss changes needed in the core EU projects database to provide greater transparency, data exploitation and return on investment in health research.
COBISS.SI-ID: 30835673
Outlier detection among over-dispersed proportions is important in healthcare quality monitoring. We had previously introduced control limits for double-square-root chart on the basis of prediction intervals from regression-through-origin and compared our approach to common outlier detection tests. In this study, we develop our approach further by adjusting the confidence level (in the spirit of Chauvenet*s criterion and Bayesian thinking) and transforming the chart into an asymmetric funnel plot. We compare it to Laney's approach (p'-chart adapted for cross-sectional data), Spiegelhalter's approach (funnel plots based on multiplicative or additive regression models) and Carling's median rule. Comparisons are performed on simulated and real data. The simulations comprise small ((0.2; highly right-skewed) and large ()0.5; symmetrically distributed) proportions, drawn in samples of size 10-100 from lognormal distribution either without outliers or with one outlier added. The real data comprise hospital readmissions from the UK (used by Laney and Spiegelhalter) and business indicators of healthcare quality for Slovenian hospitals. In the simulations, Spiegelhalter's approach tended to yield very high false alarm rates, except the multiplicative version in very small samples. Laney's approach produced fewest false alarms but could not detect the outlier in very small samples among small proportions, and regardless of sample size among large proportions. Median rule performed similarly. Our approach performed the best overall, although it is slightly less liberal than median rule for small proportions, it appears to be the only generally useful approach for large proportions.
COBISS.SI-ID: 1848681