Application of Fuzzy K-Nearest Neighbor (FKNN) To Detect the Parkinson’s Disease

Parkinson’s disease is a neurological disorder in which there is a gradual loss of brain cells that make and store dopamine. Researchers estimate that four to six million people worldwide, are living with Parkinson’s. The average age of patients is 60 years old, but some are diagnosed at age 40 or even younger and the worst thing is some patients are late to find out that they have Parkinson's disease. In this paper, we present a diagnosis system based on Fuzzy K-Nearest Neighbor (FKNN) to detect Parkinson’s disease. We use Parkinson’s disease dataset taken from UCI Machine Learning Repository. The first step is normalize the Parkinson’s disease dataset and analyze using Principal Component Analysis (PCA). The result shows that there are four new factors that influence Parkinson’s disease with total variance is 85.719%. In classification step, we use several percentage of training data to classify (detect) the Parkinson's disease i.e. 50%, 60%, 70%, 75%, 80% and 90%. We also use k = 3, 5, 7, and 9. The classification result shows that the highest accuracy obtained for the percentage of training data is 90% and k = 5, where 19 are correctly classified i.e. 14 positive data and 5 negative data, while 1 positive data is classified incorrectly.


INTRODUCTION
Parkinson's disease is a neurological disease, where the disease causes loss of brain cells that make and store dopamine which is useful for sending messages to control movement in the body. Researchers estimate that 4-6 million people worldwide live with Parkinson's disease. Usually people suffer Parkinson average on 60 years old, but there are some Parkinson's sufferers aged 40 or even younger. Some sufferers find out too late that they have Parkinson's disease [1]. Parkinson's disease was first discovered by Dr. James Parkinson in 1817. Parkinson's disease is a neurological disease, in which the disease causes loss of brain cells that make and store dopamine. Common symptoms of Parkinson's disease are muscle weakness, slow and stiff movements, blood pressure problems, tremors and loss of balance. The cause is still unknown, although researchers believe that Parkinson's disease can be caused by a combination of environmental factors and genetic factors. Until now there has been no treatment that can cure Parkinson's disease, it's just found therapies and drugs to inhibit cell damage [2]. Therefore, we need a program that can detect Parkinson's disease earlier.
The use of computer-based systems as an analytical technique in diagnosing disease is important. Machine learning is an analytical method that helps deal with big data by developing computer algorithms. Machine Learning is broadly classified into supervised learning and unsupervised learning [3]. Some Fuzzy methods used for clustering include Fuzzy C-Means (FCM) and Subtractive Clustering, while those used for classification methods include Sugeno, Tsukamoto, Mandani, and several hybrid methods with fuzzy are Adaptive Neuro Fuzzy Inference System (ANFIS), Fuzzy K-Nearest Neighbor (FKNN), Fuzzy Neural Network (FNN) and others.
In several studies there have been many researchers who use either classification or clustering. First research example conducted by Novitasari et al. [4], they using Fast Fourier Transform (FFT) and ANFIS for classify Epilepsy disease, the results of this research indicate the EEG signal classification system using ANFIS with two classes (Normal-Epilepsy) states accuracy, sensitivity, and precision of 100%. And the classification systems with three class division (Normal-Not Seizure Epilepsy -Epilepsy) resulted in an accuracy of 89.33% sensitivity of 89.37% and precision of 89.33% [4]. Second research example conducted by Novitasari et al. [5], they using fuzzy c-mean, gray level co-occurrence matrix and support vector machine for classify Alzheimer disease, the results of this research give accuracy 93.33% [5]. Third research example conducted by Afifah et al. [6], they using Fuzzy C-means for clustering of rice field in Indonesia as an evaluation of the availability of food production, this research give results the most potential rice field in Indonesia is East Java, Central Java and West Java [6]. Fourth research example conducted by Novitasari et al. [5], using Fuzzy Cmeans and Adaptive Neighborhood Modified Backpropagation (ANMBP) for classify EEG signals. This research give the temporary result system accuracy 74.37% [7]. Fifth example conducted by Novitasari et al. [8], using Fuzzy C-means and Adaptive Neuro Fuzzy Inference System (ANFIS) for classify EEG signals. This research give the accuracy 89.19% using 2 level wavelet and FCM with 3 clusters [8]. Last example conducted by Febrianti et al. [9], they compare K-means method with Fuzzy C-means for clustering iris data. This research give RMSE value 2.2122E-14 for Fuzzy C-means in 80 training data and 70 checking data. From this research we known that Fuzzy C-means method has a higher level accuracy than the K-means method [9].
In this research, we use FKNN to classify health and Parkinson patient based on 22 attributes. Fuzzy K-Nearest Neighbor (FKNN) method is a combination of Fuzzy logic and KNN method. The advantage of FKNN, compared to the KNN method, is that the FKNN algorithm classifies test data based on metric similarity [10] [11]. In some studies, feature extraction has been widely used to reduce datasets that have a very large number of attributes, so that the dataset can be simplified. There are many methodologies that can be used to perform feature extraction, one of them is the Principal Component Analysis (PCA) method. PCA is the oldest and most widely used multivariate statistical analysis technique [12].
In this paper, we develop diagnosis system based on Fuzzy K-Nearest Neighbor (FKNN) to detect the Parkinson's disease with k = 3, 5, 7, and 9 and PCA as feature extraction. We use the Parkinson's disease dataset taken from UCI Machine Learning Repository. Data divided into training and testing with percentage of training data are 50%, 60%, 70%, 75%, 80% and 90%. In the classification step, we use the confusion matrix to compare the accuracy.

METHOD
We use the Parkinson dataset from the UCI machine learning repository. The purpose of this dataset is to distinguish healthy people from those suffering from Parkinson's with various medical tests conducted. The Parkinson dataset has 22 attributes and consists of 195 data samples divided into 2 classes, namely 147 positive Parkinson data indicated by label 1 and 48 negative Parkinson (healthy/normal) data indicated by label 0. Table 1 is the example of the Parkinson dataset.

Pre-processing Data
There are 3 steps i.e. data normalization, variable reduction using PCA, and divide data into training and testing. PCA is the oldest technique and most widely used multivariate statistics to find out which variables are most influential on a data and to reduce datasets that have a very large number of variables, so that the dataset can be simplified. For example, the dataset is a matrix of size (n × D) where n represents observation xi for i{1, 2, 3, ..., n} and D represents the variable in the dataset. The general PCA algorithm is [13]: 1. KMO test: where rij is the correlation coefficient between variables and aij is the partial correlation coefficient between variables.

Calculate the eigen value
( − λI) = 0, and eigenvector v using ( − λI) = 0. 4. The obtained eigenvector is the main component that will be used to form a new variable based on the product between the eigenvector v and the normalized dataset matrix. 5. Calculate the variance using ∑ =1 × 100% .
6. The number of the new variables determined based on the percentage of cumulative contributions that calculated using × 100%, where λ1> λ2>… > λD.

The FKNN algorithm
The steps in FKNN are [11] [10]: 1. Determine k (the number of the nearest neighbor where 1 ≤ ≤ ) and n (the number of training data). 2. Calculate the membership function using: where ∑ = 1 and nj is the number of member of class j in the training data n, j is the class data, and K is the number of training data. 3. Calculate the Euclidean distance of training data to the test data. 4. Sort the Euclidean values from the smallest values. 5. Determine k nearest neighbors and refer it as new data. 6. Calculate the membership value for the new data from each class: where ui(x) is the membership value of data x to class i, K is the number of the closest neighbor, ‖ − ‖ is distance between data x to xj in K closest neighbor, m > 1 is weight exponent. 7. Select the class that has the largest membership value as output.
These steps are illustrated in Figure 1.

Performance calculate using Confusion Matrix
The next step is performance calculate test using confusion matrix. Confusion matrix is used to check the performance of a classification model on a set of test data for which the true (real) values are known. Most performance measures such as precision, recall (sensitivity), accuracy and specificity are calculated from the confusion matrix 1. Accuracy (ACC) is calculated as the number of all correct predictions divided by the total number of the dataset.
2. Recall/Sensitivity (SN) is calculated as the number of correct positive predictions divided by the total number of positives. It is also called recall (REC).
3. Specificity (SP) is calculated as the number of correct negative predictions divided by the total number of negatives.
where TP and TN are explained in Table 2 [14]:

The Result of PCA
The first result of PCA obtained MSA value for the D2 variable is 0.491. This MSA value is less than 0.5, therefore the D2 variable is more appropriate to be reduced and the KMO value increases to 0.892 with significance value 0,000 and a Bartlett Test of Sphericity value is 13244,618. This means that the data are meets the requirements to be analyzed using PCA. The next step is choosing a variable that will be entered into the factor i.e. variable which has loading value more than 0.5. The results of this analysis can be seen in Table 3. In Table 4, the variance values for each component are 61.610%, 11.176%, 7.089%, and 5.844%, respectively. This means that the four new components able to explain 85.719% the diversity of data. Factor scores for each component can be seen in Table 3. This factor score will be used in classification process.

Classification Results
To detect Parkinson's disease, we will classify data based on factors produced by PCA using FKNN method. In the classification step, we use the training data by percentage: 50%, 60%, 70%, 75%, 80% and 90%. The results of the classification using FKNN can be seen in Table 5 and Figure  2. From Figure 2, it can be seen that the classification results to detect Parkinson's disease using the Fuzzy K-Nearest Neighbor method obtain the highest accuracy of 95% with the percentage of training data is 90% and k = 5, where 19 are correctly classified i.e. 14 positive data and 5 negative data , while 1 positive data is classified incorrectly.