A Backpropagation Artificial Neural Network Approach for Loan Status Prediction

Abstract is a synopsis of the work containing the problems studied, the purpose of research, information and methods used to solve problems, and conclusions. Articles must be submitted in print-ready format and are limited to a minimum of ten (10) pages and a maximum of twelve (12) pages. Abstract is a synopsis of the work that contains the issues studied, the research purpose, the information and methods used to solve the problem, and the research conclusion. Abstracts are limited to 200 words and should not contain references, mathematic equations, figures, and tables. The font size for abstracts, keywords, and body of article is 11pt. Keywords are no more than six (6) words, but the minimum is three


INTRODUCTION
Every business is aiming for profit, financial institutions such as banks included. Bank as a company collects money from the public with savings and redistributes money to a certain community with credit [1]. Providing credit has become the main activity of banks since this transaction generates large profits for the institution. Credit is an arrangement in which a party or other instance enables its client to future repayment for its money, goods, property, or services supplied or borrowed [2]. However, this lending and borrowing transaction in financial institutions has a chance to turn into a huge source of loss. The issues arise when creditors' position is vulnerable since the amount of transaction of credits occurred but the creditors aren't cautious in examining the future debtor. Banks are exposed to the credit risk of failure to observe the debtors' capability to complete their obligations.
Credit risk is referred to as the probability of default in a loan agreement. The risk occurred of the increasing possibility of irrecoverable loans because of the case of outright default [3]. An irrecoverable loan is an incident when a debtor is unable to repay its loan installment within a given period. The arising danger of credit risk declines the value of company assets and in an extreme way might lead to insolvency of the bank [3]. It is the obvious bank would reject the idea of company bankruptcy because of the debtor's irresponsibility. In order to prevent this incident, the bank is required to obligate a noble credit risk management. There are steps implemented by the bank before accepting a credit application, one of them is applicant assessment. The applicant's assessment allows the bank to determine the eligibility and ability of prospective credit recipients to repay their obligations. Thus, the aim is to minimize the credit risk by assessing the loan status of the prospective customer through the evaluation process in order to escape unexpected events that might inflict financial loss.
The problem to solve with the applicant's assessment is to distinguish potential customers within the categories that are eligible for a loan and not. Eligible customers will have their loan application addressed, while the bank has the right to deny those customers non-eligible.
Another word for an applicant's assessment is to predict the status of a loan application between accepted or rejected. In the application, there has been a traditional statistical method developed to obtain an accurate and guaranteed prediction result. Some examples of developed methods are Logistic Regression and Linear Discriminant Analysis. Several studies applied traditional methods in predicting the loan status including: [4] building a prediction model to classify the loan status into accepted or rejected. This study conducted by using Logistic Regression and Naïve Bayes classifier, with a result of model performance is scaled by accuracy obtained at 85.9% and 84.62% respectively. A study conducted by [5] utilizing Logistic Regression in classifying loan status into accepted or rejected has received an accuracy of 81%. Another study by [6] in building a prediction model to predict loan safety apply Logistic Regression has obtained the best result with 81.11% accuracy. Previous studies show that model performance is good but still less than 90%.
Furthermore, with the advancement of technology, an information processing system is presented for solving the problem with help of a machine. Machine learning is introduced for solving a typical classification problem. Some studies implied that machine learning has great capability in classifying and can replace the use of traditional statistical methods [7], [8]. Machine learning is widely used for loan status prediction, for example [9] builds a loan default prediction model using Random Forest algorithm with accuracy of 98% and Decision Tree algorithm with accuracy of 95%.
Artificial Neural Network (ANN) is one of machine learning methods widely used in today's age. Research of predicting and classifying has been conducted and the result found that Neural Networks if compared with Logistic Regression, will show higher accuracy and ROC values, therefore, Neural Network outperform Logistic Regression [7], [10], [11]. Furthermore, this research applies one of algorithms in ANN namely Backpropagation. Backpropagation is an algorithm known for its ability to obtain minimal error and generate output that is closer to desired output with every forward pass [12]. This ability made Backpropagation a great proposed model for classification problems. Another advantage of Backpropagation is simple and easy to construct the program and works well with such complex datasets [13], [14]. Besides of number of inputs, there is no complex parameter in Backpropagation that must be calculated and the application of Backpropagation doesn't require prior knowledge of the network making it convenient for use [14]. These advantages made Backpropagation a fast-learning convergence for classification. Despite these advantages, there is a limited amount of research applying Backpropagation for loan status prediction, this paper is expected to contribute to delivering knowledge to the audience.
This study aims to construct an applicant assessment to predict the accepted or rejected loan. The method proposed for predicting the loan status is the theory of Backpropagation. This study builds a prediction model with Backpropagation utilizing available historical data and determining which loan should be accepted or rejected. Later, the prediction result is used to conclude the performance of Backpropagation in classifying applicant loans.
The structure of this paper is as follows. Section II provides an explanation of Backpropagation and provides a list of information on the method of analyzing the data. Section III demonstrate the process of generating a model for this research and displays the research result obtained. Section IV is the conclusion of this study.

II. METHODOLOGY
This study applies Backpropagation to build a classification model for loan status. Backpropagation is known for its competency in recognizing data patterns and minimizing output error by optimizing the value of model parameters.
First step for building the prediction model with Backpropagation is to collect the dataset. Obtained dataset processed into transformation aims to improve the quality of dataset. This transformation process is called data pre-processing, total of six steps will further explain. One noticeable step in preprocessing is feature selection, where for this study two data models are formed with purpose to prove models' performance in prediction considering data with fewer or more variables. Immediately after preprocessing, data will use for training model with Backpropagation method. Activation function is sigmoid function that is suitable for binary classification as the research purpose.

Data Collection
This research uses secondary data obtained from online database github.com [15] which has 983 observations. The number of samples for this research is taken with Yamane's formula. There is total of 12 variables within dataset. From those variables, there are eleven (11) independent variables that act as the predictor and only one (1) target variable, which is variable "Loan Status". List of variables in dataset is summarized in Table 1 below. The descriptive statistic of data variables for categorical variables is shown in Table 2, and for numerical variables is shown in Table 3.

Data Preprocessing
Data Preprocessing is conducted to enhance the quality of data and improve model performance [16]. There are six data preprocessing in this study.

Data Cleaning
The missing data expressed with null data or "NA" is eliminated by deleting the missing information [16]. After the elimination, the original dataset of 983 data turns into 769 data.

Feature Selection
Feature selection reduces the number of input variables by removing noninformative predictors and maintains the most informative predictor variables [17]. This research can obtain two data models, where one consists of variable within dataset means more input and second consists of informative variables means less input. This study checks Backpropagation capability of predicting with more input or less input. Since predictor in dataset consists of numerical and categorical variables while target variable is a binary variable, supervised method was applied to remove the irrelevant variables based on their relationship with target variable [17]. Point Biserial Correlation Coefficient is used to check correlation between numerical variables and target variables. To measure the Point Biserial Correlation Coefficient value such formula is seen in Equation (1) [18]: where is mean score total of numerical variables, is mean score from population has possibility of an accepted loan, is standard deviation total of numerical variables, is probability of loan status accepted within the population, and is probability of loan status rejected. Chi-square Test was used to check correlation between categorical variables. Chi-square value obtained from formula in Equation (2)[19]: where is the observed data and is the expected data.

Data Encoding
Machine learning required the input and output variables to be numeric which means every categorical data should be encoded into the numerical label. Research applied label encoding for binary variables and dummy encoding for non-binary categorical variables.

Data Splitting
The sample size for this study is generated by computing Yamane's sample in Equation (3)[20]: From Yamane's, the sample size obtain is 284 data. After setting sample size, divide data into training and testing datasets. The splitting ratio chosen in this study is 75% for training dataset and 25% for testing dataset. There is total of 213 data for training and 71 data for testing.

Data Normalization
Data normalization is applied for numerical variables [16]. Previous step of data encoding transforms categorical variables into numerical values of 0 and 1. Numerical variables have a larger range between their values with the remaining categorical variable values. Since the dataset has different ranges, normalization is needed to scale data as fall within smaller range such as -1 to 1 or 0 to 1 [16]. The normalization conducted with Min and Max Normalization is transforming the data where minimum value becomes 0, maximum value becomes 1, and the other value becomes decimal number between 0 and 1. Apply Min and Max Normalization, with Equation (4) [16]: where maximum value of variable, is minimum value of variable and is the value of variables.

Outlier Filtering
Noisy data might interrupt information processing. Filtering the outlier with Interquartile Ratio (IQR) and the values of IQR will determine the change in the outlier value and create zero outliers within dataset.

Backpropagation Algorithm
This section explains how to solve a problem with Backpropagation. The training process required three phases: Feed-forward propagation, Backpropagation, and Weight Adjustment [21]. These phases are defined as: a. Feedforward: Feedforward is the process of inserting the inputs into the network and obtaining the output of model. b. Backward Propagation: This process of finding errors of weights and biases carries out in backward direction, therefore it starts from output layer (error of the output) to hidden layer (error of weight and bias from hidden units to output unit) and lastly input layer (error of weight and bias from input units to output unit). c. Weight Adjustment: Updating value of weight and bias from error adjustment of both weight and bias. The new value of weight and bias later are used for the final model. For the testing process, feed-forward is the only phase since the appropriate parameter has been obtained in training process.
In detail, the training process of Backpropagation is explained further step by step [21]: Step 0: Initialize weight with small random number. Determine the "STOP" condition by calculating the target error and maximum epoch.
Step 2: Continue next step for every data

Phase I: Feed-forward Propagation
Step 3: Every input unit , ( = 1, … , ) receive input signal will deliver those signals toward the units in the next layer (hidden layer).
Later, to calculate the output signal in hidden layers used an activation function that has been determined before: = ( ) (6) Moving forward, hidden layer will send these signals toward unit in output layer.
Step 5: For every output unit , = 1, … , will calculate each weight for each input signal (including the bias).
Later, to calculate the output signal in output layers used an activation function that has been determined before: = ( ) (8) Moving forward, output layer will send these signals toward unit in output layer.

Phase II: Backward Propagation
Step 6: Every output unit , = 1, … , will receive a target pattern that matches the training input pattern and calculates the error between the target and the output generated by the network. After that, factor k  is going to be sent toward under layer (or the layer below) which is hidden layer.
Then multiplied this delta input with activation function to calculate the error information. = _ ′( ) (13) Factor j  is going to be used to calculate the error adjustment ∆ that later will used to update , where: ∆ = (14) Later, the bias adjustment ∆ 0 is calculated and used to correct the value of 0 , where:

Phase III: Weight Adjustment
Step 8: Every output unit , = 1, … , adjust their weights and bias from every hidden unit = 0, . . , , where: ′ = + ∆ (16) Same way with each hidden output , ( = 1, … , ) will also adjust and update their weights and bias from each input unit = 0, … , , where: ′ = + ∆ (17) Step 9: Test the "STOP" condition Symbol in equation defined as: -is the target output is the input unit -0 is the bias on hidden is the hidden unit -0 is the bias on output is the output unit is learning Rate is the portion of error correction weight adjustment for is the portion of error correction weight adjustment for

Activation Function
Activation function in this research is Sigmoid Function. Sigmoid function has range of = (0,1) to return the probability and is very suitable for solving classification problems [22]. Equation (17) [22] is the formula of sigmoid: The derivative of this function is implemented in Backward Propagation phase to find the error adjustment of weight and bias, shown in Equation (18) [22]:

Loss Function
Calculate the loss of model training for binary classification problem with Binary Cross Entropy. Binary Cross Entropy has function defined in Equation (19)[23]:

Confusion Matrix
Confusion Matrix is a performance evaluation for machine learning classification for binary classification or multi-class classification [4]. Confusion matrix has four classification terms, TP (True Positive), TN (True Negative), FP (False Positive), and FN (False Negative). These four terms are shown in the confusion matrix in Table 4.

III. RESULTS AND DISCUSSION
Now, the obtained data from preprocessing is ready to use for building Backpropagation model. In data preprocessing, number of observations is reduced to 769 data after eliminating the missing data. The distribution of dataset is shown in Figure 1.  Table 5 summarizes the result of estimates and p-value.
Every numerical variable has p-value bigger than 0.05 addressed with no correlation between these variables with the loan status. For Chi-square Test, the result is shown in Table 6. There are four variables that have p value bigger than 0.05, namely Gender, Dependents, Education and Self Employed. These variables pointed to no correlation with loan status. Besides, there are also four variables have p-value less than 0.05, they are Married, Loan Term, Credit History, and Property Area. If p-value is less than 0.05, two variables have shown correlation. From feature selection conducted, for both Point Biserial correlation and Chi-square test, it shows that there are only four variables that significant towards the target variables. These four significant variables have p-value less than 0.05 and they are Married, Loan Term, Credit History, and Property Area. Two models of data are formed in this study, where one is four significant variables with loan status, and the second is every variable within dataset. Table 7 shows data models included variables.

Building Backpropagation Model
In this section, building Backpropagation model is discussed. By applying the formula in (3) and the number of observations is 768, it is obtained that sample size is 284 data with 213 data for training and 71 data for testing. Important decision in building backpropagation model is to determine the number of units in hidden layer. Until today, there hasn't found formula for finding appropriate amount of hidden units theoretically. However, there are several rules of thumb to help researchers decide size of hidden unit, the rules are as follows [24]: 1. Hidden unit is 2/3 of size of input unit and output unit 2. Size of hidden unit is between size of input and output unit 3. Size of hidden unit is less than twice the size of the input layer The number of hidden units for Backpropagation model by maximizing the rules of thumb for model A and model B are summarized in Table 8. From Table 8, the appropriate number of hidden units for model A by applying first rule of thumb is 9 hidden units. For model B, by first rule of thumb, number of hidden units is 3. Additional model was created by reducing the number of hidden units. In addition, learning rate is also one of important hyperparameters in training model. Such a formula to obtain appropriate learning rate hasn't been found, therefore it's practical to trial and error. A traditional value to start is 0.1 and reduces the number by logarithmic scale [25]. After several trial errors, this research finally concludes the best learning rate that appropriate for model. For model A, with total of 4 models, this study uses a learning rate of 0.01 for training and testing. Summary of parameters of every model A is shown in Table 9.  After generating model for prediction, then proceeds to test the model in classifying the loan status of applicants. One more step after testing is to check the performance of Backpropagation model in resolving loan status prediction by using evaluation matrix in confusion matrix.

Backpropagation Model Testing
In testing the Backpropagation model, the algorithm will produce output that is accurate with the actual output. Output of this research is number between 0 and 1 or a probability, that later will generate accepted or rejected. The probability less than 0.5 is determined to be rounded to 0 or interpreted as rejected loan. If probability is bigger than 0.5, the rounding value will be 1 and this output interpreted as an acceptable loan. The testing process uses 71 data with distribution shown in Figure 2.

Figure 2. Distribution of loan status in testing data
This study is to predict output desired in binary classification, hence the performance assessment of each model is examined with a 2 × 2 confusion matrix. Evaluation metrics mentioned in previous section are accuracy, precision, recall, specificity, and F1 score. The proposed loan status classification is carried out with different data model and different architectures of Backpropagation. Performance of model A in classifying the loan status is summarized in Table 11.
Model A includes every variable from dataset, there are correlated variables and there are also variables that have no correlation with Loan Status. The lowest accuracy is model A4 with 80.28% and highest is model A3 with 94.37%. The sensitivity with highest result is from model A3 with 78.57%, then model A1 with 71.43%, and model A2 with 64.29%. As for the specificity, all models in model A obtained a very high number above 90%. Additionally, another performance metric is model A4 cannot be general optimal since unfortunate occurred in testing the model with A4.   Table 11 and Table 12, the model with applied Backpropagation in predicting and classifying the loan status of applicants has good accuracy. The smallest accuracy was obtained from Model A4 with an architecture of 6 hidden nodes. Model A4 accurately classifies the loan status by 80.28%. The highest accuracy obtained is 94.37%.
From above Table 11 and Table 12, each model has obtained at least one backpropagation model with accuracy 94.37%. In addition, model B gives identical results in predicting loan status. All this is because the variables in model B consist of significant input variables towards loan status. For the case of the given dataset, based on the results of model A and model B, it can be said that the process of forming model B is more efficient because it requires a simple architecture and a smaller amount of data than in model A. Thus, it becomes important to analyze the correlation between predictor variables and target variables before the Backpropagation model is constructed.
For the limitation, this research was limited to not optimize the performance metrics of the model by making the variation of the parameter for example parameter of learning rate. In addition, the research was limited to the binary classification of loan applicant status. This matter is related to limited information could be obtained from dataset. The dataset used for this research has size less than one thousand and relatively small variable provided in the data.

IV. CONCLUSION
This study has applied the Backpropagation algorithm to predict the loan status with the used dataset from [15]. The good performance metrics of the model will help the financial institution to reduce credit risk.
The two simulation experiments, model A and model B, were presented. Model A is a research model in which all input variables exist in the dataset. Model A involves 13 independent inputs and consists of four model architectures distinguished by the number of hidden layers. Model B filters the predictor variables with those variables that show association with target variable being used for model B. As a result, model B involves only 4 independent inputs and has three model architectures.
The result of performance metrics of model A shows the best performance metrics are 94.37% accuracy, 78.57% sensitivity, 98.25% specificity, 91.67% precision, and 84.62% F1 score. The other simulation has the same with this result. For the given dataset, this means that analysis of correlation between predictor and target variable is important since can make the backpropagation model constructed efficiently.
By comparing these results with previous works, for example in [4] where author used Logistic Regression and Naïve Bayes classifier for loan prediction, the accuracy obtained 85.9% and 84.62% respectively, then it can be concluded that the backpropagation algorithm gives the better of performance than the both one.
Further study might use vary the parameter model for example learning rate to find the optimum of the performance metrics of the model. On the other hand, To find a more realistic result, it is important for further research by using historical dataset from financial institutions regarding credit applicants supposed to image a complex dataset that consists of more data variable and bigger size of data.
The result of this expected financial institutions to consider the use of Backpropagation algorithm in modeling predictive application before they proving credit to clients.