Sentiment Analysis of Public Opinion Covid-19 Vaccine Using Naïve Bayes and Random Forest Methods

Abstract is a synopsis of the work containing the problems studied, the purpose of research, information and methods used to solve problems, and conclusions. Articles must be submitted in print-ready format and are limited to a minimum of ten (10) pages and a maximum of twelve (12) pages. Abstract is a synopsis of the work that contains the issues studied, the research purpose, the information and methods used to solve the problem, and the research conclusion. Abstracts are limited to 200 words and should not contain references, mathematic equations, figures, and tables. The font size for abstracts, keywords, and body of article is 11pt. Keywords are no more than six (6) words, but the minimum is three (3) words. ABSTRACT The emergence of COVID-19 or 2019 coronavirus disease has been reported as a problem with a new type of disease caused by SARS-Voc 2. It has spread to 223 countries and 25 areas around the world, including Indonesia. COVID-19 has deeply affected many aspects of our lives, the environment, mental health and the economy. Twitter is one of the media outlets that is busy discussing news regarding the COVID-19 vaccine. Covid-19 has been a major impact. The Government has implemented policies such as large-scale social restrictions to address the spread of COVID-19. The elevated spread of COVID-19 has prompted the Government of Indonesia to encourage the production of a COVID-19 vaccine. The provision of the COVID-19 vaccine has become a boon and a boon to the people of Indonesia. A lot of people don't want to be vaccinated because the news of the impact of vaccination is spreading on social media, even if the news isn't necessarily real. The Government is looking for ways to continue vaccinating the community, including by collaborating with community leaders, influencers and others. The purpose of this study is to identify the community response to the vaccine so that the right strategy can be used. The results of this study yielded 89.79% for Naïve Bayes and 84.62% for Random Forest. Indonesians are giving positive responses to the administration of the COVID-19 vaccine.


INTRODUCTION
COVID-19 or coronavirus disease 2019 is an emerging pandemic caused by SARS-Cov 2. The coronavirus has spread around the world, including in Indonesia. COVID-19 was first detected in Wuhan, China at the end of 2019 [1]. The Indonesian government is trying to stop the spread of COVID-19 by supplying vaccines to the Indonesian population. This vaccination should protect those who have received the vaccine and the community as a whole [2]. The importance of this issue remains with the Indonesian government adopt the right policies and decisions with regard to the allocation of human resources, funds, vaccines, and vaccine delivery strategies in each region of Indonesia. It is expected that this policy may include a comprehensive vaccination campaign. Twitter is one of the news media that is widely used by the public, both those defending the pro and contra of administering the COVID-19 vaccine in Indonesia. Twitter can be a policy tool because there is a great deal of information and public opinion. Twitter provides many features that can be used by its users, including the ability to send and play text messages, videos, pictures, and links. Twitter use to obtain data due to the ease of searching for textual data only using hashtags and can select the most recent and popular topics so that the resulting data is the latest data. Twitter social media allows researchers to collect data with a good sample because the number of tweets posted per day can reach 500 million tweets [3].
The results of earlier research were conducted by [4] Public opinion analysis on COVID-19 vaccination using Naïve Bayes Classifier on 3,780 tweets reveals a 93% accuracy. The difference with the current search is that the data used is 500 data along with the methods used and the parameters of the application rapidminer. In previous studies, there was an emoticon conversion process, whereas in this study, we did not use this process because of the missing values cleaning. Research conducted by [5] successfully analyzed feelings related to the influence of PSBB on Twitter social media using the KNN algorithm, Decision Tree, and Naïve Bayes. The results of the accuracy of the third algorithm used accuracy values of 83.3% for the decision tree, 80.80% for KNN, and 80.03% for Naïve Bayes. Accuracy results using a decision tree obtained a higher value, resulting in more accurate predictions using a decision tree. The difference with this research is the use of the main process in Rapidminer, using the SMOTE Upsampling operator and also the model used.
Research conducted by [6] Analyzes Twitter sentiment about anti-LGBT campaigns in Indonesia using the algorithm of Naïve Bayes, Decision Tree and Random Forest. The problem in this research is the frenetic anti-LGBT campaign that is being discussed by the people of Indonesia on Twitter social networks. This study uses the algorithm of Naïve Bayes, Decision Tree and Random Forest. The results of this study indicate more neutral comments from Twitter users, by processing the data using the Naive Bayes algorithm on RapidMiner tools, this study achieved an accuracy of 86.43%. Naive Bayes' accuracy is higher than the Decision Tree and Random Forest algorithms which produce an accuracy value of 82.91%. The present study uses the same methodology, Naive Bayes. Consider Random Forest for sentiment analysis because random forest is a method derived from the decision tree. Random Forest more intelligent method as it has a lot of trees than the decision tree. Random forest process takes more time because it has a lot of trees so every sheet has exactly one value.
Based on the problems and the results of previous studies, Naïve Bayes and Random Forest are the methods used in this study to analyze the sentiment of the coronavirus vaccine. This research is based on sources of scientific literature and relevant data related to the analysis of coronavirus vaccination sentiment on Twitter. Rapidminer version 9.9 is used to handle Twitter print data and process data based on previous searches. The dataset used in this study was derived from a Twitter direct crawling process that is connected to the Rapidminer app.
This study was conducted to analyze the public response to the COVID-19 vaccine on Twitter, and categorize it into 3 parts: positive, negative, and neutral. The results of this sentiment analysis can be considered by the Indonesian government to make the right policies and decisions in allocating human resources, funds, vaccine allocations, and vaccine delivery strategies in each region of Indonesia.
This research is the original result of the development of previous research related to sentiment analysis.

II. METHODOLOGY
The research methodology used in this study contains steps for obtaining the results of the sentiment analysis using Naive Bayes and Random Forest. The first step is to gather data sets on the COVID-19 vaccine by crawling on Twitter. The data served as popular responses in December 2021. In all, 500 public comments were entered using the keyword "COVID-19 vaccine". The research methodology used is quantitative. The quantitative method is a social problems research based on theoretical tests that consist of variables, which are measured by numbers, and analyzed by statistical means to determine whether the application of theoretical predictions is correct.
Data collection techniques obtained from these observations include : a) The COVID-19 vaccine tweet dataset is the result of a social media search and upload from Twitter. b) The data collection results are then stored in Excel format and then stored in the system database. c) Documentation of data studies such as journals, papers, proceedings, and article data based on the relationship to the research being studied. The steps of retrieving data from Twitter with the crawling process are as follows : 1. Determine the type of connection in this process with the help of Twitter social networks.

Sentiment Analysis
Sentiment analysis is part of data mining. The analysis of feelings includes the computational analysis of opinions, emotions, and feelings expressed in a text. A collection of textual documents contains feelings regarding some objects. The purpose of the sentiment analysis is to identify the attributes and parts of the topic that were discussed in each document and decide whether the answer is positive, negative, or neutral. Sentiment analysis is used to determine the nature or opinion of an author about a specific object. Behavior can indicate opinion, reason, or judgment, a condition of tendency (how the author wants to affect the reader) [7].

Text Mining
Text mining is part of data mining. Text mining is used to process large amounts of data into large amounts of text. Text mining is about getting a word that has a primary meaning in a document so that it can be analyzed against both documents [8]. All documents preprocess the text before the text extraction process so that the resulting documents are much easier to file [9].

Case Folding
Case folding or transforming cases is a process of converting all upper case letters of documents contained within the document to lower case.

Stopword Removal
Stopword is a feasible process that does not fit the subject of the document because it does not affect the accuracy of the classification, that stopword will wipe out meaningless words if it's alone.

Tokenizing
Tokenizing is a process of breaking down text into words and therefore becoming a token that can be parsed. The general strategy of the tokenizing step is to remove punctuation characters.

Stemming
Stemming is the process of description of the form of a word to form a root word. Stemming is one of the stages in pre-processing. The removal process has an impact on the accuracy of information retrieval.

TF-IDF
TF-IDF is a scheme for a popular word (alternate term). The TF-IDF method is known to be effective, easy, and to have the right results. TF-IDF is used to calculate the value of the Term Frequency (TF) and Inverse Documentations Frequency (IDF) for each term to each document in the polypus (a text that has a large, authentic, systematic nature that can be electronically stored and processed).

Naïve Bayes
Naïve Bayes' classification methodology was used in the analysis of feelings. This approach is theoretically based on both data coherence and computational classification. Naïve Bayes are widely used in classification techniques, especially Twitter. Naïve Bayes Classification was a strong assumption of a condition or event. Naïve Bayes probability group computation uses a Bayesian algorithm approach using an equation [10].

Random Forest
The Random Forest algorithm is one of many methods used to classify data. The random forest algorithm is a join learning technique based on a decision tree algorithm. Random forest it grows a variety of classified trees, commonly called forest. If we're going to classify new data, then each tree presents a prediction with its category as one voice. Forrest will choose the category that has the greatest number of votes [11].

Confusion Matrix
Confusion Matrix is used in this assessment because it works as a rating tool to estimate true and false objects in the classification model. The matrix itself is used to compare the classification findings with the actual data. To calculate the precision value using the method : To calculate the recall value using the method :

K-Fold Cross Validation
An algorithm's performance is measured and evaluated through this process. Validation consists of separating the data into two sub-sets, training data and test data. Randomization or data screening is required to prevent data screening.

Data Collection
The data obtained through the research is public data taken directly from the Crawling process on Twitter using the Rapidminer 9.9 application. Twitter search operator where we must first connect the Rapidminer app to Twitter social networks. The dataset used is a sample of 500 data about public responses to the COVID-19 vaccine. The data served as popular responses in December 2021. In all, 500 public comments were entered using the keyword "COVID-19 vaccine".
The data set is a term used to describe data collection. The data set contains more than one variable related to a specific subject. A data set is also a collection of data derived from previous data and is ready for management in new information. The data set used in this study uses a publicly available data set from Twitter social media, the attributes used are text on Twitter.

Data Labeling
The data is manually labeled as positive, negative, and neutral sentiments. Labels are a great way to communicate information like the name of a location. Labeling is a part of supervised learning. Supervised learning is a machine learning model that examines data with tags or targets where model evaluation will be based on those targets.

Naïve Bayes Classification
Result Performance Assessment Classification Naive Bayes algorithm: Figure 12. Performance Naïve Bayes

Random Forest Classification
Result Performance Categorization Evaluation Random Forest Algorithm :    The above results, therefore, lead to the conclusion that, in this study, the classification of Naïve Bayes is the best because it produces more accurate and precise predictions with an accuracy rate of 89.79%.

IV. CONCLUSION
According to the analysis done on COVID-19 vaccine sentiment on social media Twitter using Naïve Bayes and Random Forest methods in Indonesia. The purpose of this study was to analyze the public response to administering the COVID-19 vaccine through social media Twitter classifies it as positive, negative, and neutral.
Comparing the two algorithms of Naïve Bayes and Random Forest gives an accuracy that is not too different, namely for Naïve Bayes Accuracy 89.79%, Precision 84.04%, Recall 81.64%. Random Forest Accuracy 84.62%, Precision 83.47%, Recall 69.70%.
Public opinion on the provision of the covid-19 vaccine, there are Positive, Negative, and Neutral opinions. After the analysis, it turns out that the results from this study tend to give positive responses about COVID-19 vaccine administration with a negative = 78, positive = 377, neutral = 45 with the accurate method of Naïve Bayes.
In subsequent studies, more data sets should be used than in this study to increase information on the COVID-19 vaccine and improve precision compared to earlier studies. Use other methods to verify that the research process is accurate. Additional searches may also look for other data sources such as Facebook, Instagram, and YouTube.

ACKNOWLEDGMENTS
Praise and gratitude, writer would like to say the presence of Allah SWT who has blessed to complete this thesis. Final project publication bachelor degree is submitted as a requirement in completing undergraduate education advanced information engineering department in STMIK IKMI Cirebon. Thank you to the supervisors who provided the motivation to finish the final project properly and on time.