a Department of Electrical Systems and
a Department of Electrical, Systems and Automatic Engineering, Universidad of León, Campus de Vegazana s/n, León 24071, Spain
b Grupo Investigación Interacciones Gen-Ambiente y Salud (GIIGAS), Centro de Investigación Biomédica en Red (CIBER), Spain
c Unit of Biomarkers and Susceptibility, Cancer Prevention and Control Programme, Catalan Institute of Oncology-IDIBELL, L’Hospitalet de Llobregat, Spain
d Department of Clinical Sciences, Faculty of Medicine, University of Barcelona, Barcelona, Spain
e CIBER Epidemiologia y Salud Publica (CIBERESP), Madrid, Spain
f Epidemiology Section, Public Health Division, Department of Health of Madrid CIBER Epidemiología y Salud Pública (CIBERESP), Madrid, Spain
Risk prediction model
Background and objective: Risk prediction models aim at identifying people at higher risk of developing a target disease. Feature selection is particularly important to improve the prediction model performance avoiding overfitting and to identify the leading cancer risk (and protective) factors. Assessing the stabil-ity of feature selection/ranking algorithms becomes an important issue when the aim is to analyze the features with more prediction power.
Methods: This work is focused on colorectal cancer, assessing several feature ranking algorithms in terms of performance for a set of risk prediction models (Neural Networks, Support Vector Machines (SVM), Logistic Regression, k-Nearest Neighbors and Boosted Trees). Additionally, their robustness is evaluated following a conventional approach with scalar Amiloride HCL metrics and a visual approach proposed in this work to study both similarity among feature ranking techniques as well as their individual stability. A comparative analysis is carried out between the most relevant features found out in this study and fea-tures provided by the experts according to the state-of-the-art knowledge.
Results: The two best performance results in terms of Area Under the ROC Curve (AUC) are achieved with a SVM classifier using the top-41 features selected by the SVM wrapper approach (AUC=0.693) and Logis-tic Regression with the top-40 features selected by the Pearson (AUC=0.689). Experiments showed that performing feature selection contributes to classification performance with a 3.9% and 1.9% improvement in AUC for the SVM and Logistic Regression classifier, respectively, with respect to the results using the full feature set. The visual approach proposed in this work allows to see that the Neural Network-based wrapper ranking is the most unstable while the Random Forest is the most stable.
Conclusions: This study demonstrates that stability and model performance should be studied jointly as Random Forest turned out to be the most stable algorithm but outperformed by others in terms of model performance while SVM wrapper and the Pearson correlation coe cient are moderately stable while achieving good model performance.
ColoRectal Cancer (CRC) is ranked third and second among all cancer incidences in men and women, respectively worldwide . It is the fourth leading cause of cancer death in the world, ac-
∗ Corresponding author. E-mail addresses: [email protected] (N. Cueto-López), [email protected] (M.T.
García-Ordás), [email protected] (V. Dávila-Batista), [email protected] (V.
Moreno), [email protected] (N. Aragonés), [email protected] (R.
counting for over one million new cases of colorectal cancer (CRC) diagnosed and more than 880,000 deaths in 2018 . Globally, CRC has increased steadily worldwide since the 1960s but there is substantial geographical variation in incidence and mortality rates across the world. The distribution of CRC varies widely, with more than two-thirds of all cases and about 60% of all deaths occurring in countries with a high human development index . Thus, CRC rates are rising in countries that are undergoing rapid economic development  due to economic transitions and the relation with lifestyle issues such as diet, physical inactivity and obesity . On the other hand, preventive screening and specialized care are