Heart disease, K-Nearest Neighbors (KNN), random forest, feature selection
Abstract
Heart disease is one of the leading causes of death worldwide, claiming millions of lives each year. To address this serious public health challenge, early prediction of heart disease using machine learning techniques has become a hot topic of research. This study explores the impact of different numbers of features on the performance of the K-Nearest Neighbors (KNN) model in predicting heart disease. Initially, a random forest algorithm was employed to rank the importance of a large set of features and identify the key factors most influential in predicting heart disease. Subsequently, starting with the most important features, the study incrementally increased the number of features applied to the KNN model, comparing the model’s accuracy and recall across different feature combinations. The results show that as the number of features increases, the model’s predictive performance does not consistently improve. When the number of features is initially increased, accuracy experiences a sharp decline; although it slightly recovers later, the overall performance does not return to the high level observed with fewer features. Meanwhile, recall significantly improves when the number of features first increases but then starts to fluctuate and noticeably decreases when a certain number of features is reached. This study demonstrates that simply increasing the number of features does not guarantee improved model performance; instead, it may introduce redundant information or noise, weakening the model’s effectiveness.