Spam Filtering with Machine Learning

Faculty of Engineering and Environment

MSc Computer Science and Digital Technologies Project

KF7029

Project Dissertation

Project Title: Machine Learning Techniques in E-Mail Spam Filtering

Tutor Name:

Student Name:

Student ID:

Abstract

Emails are the professional way of communicating with others used by many people. The high engagement on the platform leads to getting targeted by the marketing and promotion emails. These types of emails are just for attracting the user which are also useless and provides unwanted information. The spam email filtering technique based on machine learning identifies the spam emails and keep them in an individual section. The techniques used in the past system were based on the keywords identified by the system which are considered to be threats and harmful to the system. The user gets an alert of threat based on content in the email. The system also sends legit mails to the spam section when the email is reported by the majority of the person. The email id of such sender is considered to be spam and harmful. The approach used to filter the spam emails with the help of the project.

The complete system of the model is implemented with various technologies like machine learning and artificial intelligence. The supervised machine learning technique is used in the system to train the system and work accordingly. The system analyses the keywords in the carbon copy (Cc) and Blind Carbon copy (Bcc) with the header email id of mail received to the user and classify it accordingly. The spam email filtering system is based on the implementation of various libraries used for data preprocessing. The data is collected from the project where the conclusion and results are based on the outcomes of the project. The data analysis method is experimental where data is analyzed by the finding and outcomes of the project. The complete system is also monitored and controlled. The method used in the project is based on algorithms of machine learning such as KNN and SVM. The system is based on the python programming language where data is pre-processed using various Python libraries such as Pandas, NumPy and Matplotlib.

The project or model results in filtering the spam emails and provides a clear inbox with all the legit emails. The result of the research is based on the working of the project where the outcomes and findings of the project are listed and analyzed. The result of the project gives a clear understanding of the working of the system. The research is based on the study of various techniques which can be implemented to avoid spam emails and identify them. The Naïve Bayes algorithm is one of the best machine learning techniques.

1. Abstract 1

2. Introduction 5

2.1 Background 5

2.2 Aim and Objectives 5

2.3 Research Questions 6

2.4 Rational 6

2.5 Description/Overview 7

3. Literature Review 7

Machine learning: An Overview 7

Reinforcement machine learning 9

E-mail Overview 9

E-Mail Structure 9

Email spam attacks 10

The framework of Investigation of Spam Classification 11

Email Spam Filtering techniques 13

Types of algorithm for machine learning 14

Design methodology 15

Detection of poisoning points 16

Experimental evaluation 18

4. Practical Work & Analysis 19

Spam Filter Methodology 19

5. Research design 21

6. Implementation 21

Importing libraries 21

Natural language resources 22

Reading dataset 22

Checking null values 23

Count of Ham or Spam 23

Length checking for the mail 23

Pre-Processing the text 24

Stemming 25

Converting the text in the vocabulary 25

Testing and training split 25

Creation of model and evaluating the model 26

Accuracy calculation 27

7. Experimental results and discussion 27

8. References 30

List of Figures

Figure 1: Machine learning tree 8

Figure 2: E-mail attack scenario 10

Figure 3: Framework of Investigation of Spam Filtering 12

Figure 4: Case Base Spam Filtering System 13

Figure 5: Liner SVM model. 14

Figure 6: Proposed architecture of the system 16

Figure 7: Flowchart of defense algorithm from poisoning attack 17

Figure 8: MNIST result after evaluation 18

Figure 9: Importing libraries 22

Figure 10: Natural language resources 22

Figure 11: Reading dataset 22

Figure 12: Checking null values 23

Figure 13: Count of Ham or Spam 23

Figure 14: Length checking for the mail 24

Figure 15: Pre-Processing the text 24

Figure 16: Stemming 25

Figure 17: Converting the text in the vocabulary 25

Figure 18: Testing and training split 25

Figure 19: Creation of model 26

Figure 20: classification report 26

Figure 21: curve plotting 26

Figure 22: confusion matrix plotting 26

Figure 23: Accuracy calculation 27

Figure 24: checking the count of spam or ham 27

Figure 25: true positive rate and the false positive rate 28

Figure 26: confusion matrix 28

Figure 27: accuracy 29

Introduction

2.1 Background

Emails are the best way to interact for professional communication. It is used by many people for professional and even personal work. The engagement on the platform is very high which leads to getting targeted by advertising and digital marketing emails. There are many technologies used in the email filtering technique. The past research on email filtering shows the identification of spam emails based on the keywords seen in the mail. The platform automatically marks the email as spam when specific keywords are identified. The past research also shows the email getting reported by the majority of the users to keep them in the spam section. It also keeps the legit emails in the spam if the majority of people had marked or reported that mail as spam.

The email platform gives an alert to the user about the spam and threat. The project is based on artificial intelligence and machine learning which identifies the digital marketing and promotion mails by the dataset. The system is based on supervised machine learning which works with labels and works accordingly. The system gets trained with the labels in the mail and keep that mail in the spam section. This technology provides a better user experience where the user gets a smooth and clear view of legit emails in the inbox section. This research and project put a great impact on the past research with the use of advanced technologies. The project-based on the advanced technologies provide more security and clear use of the filtering techniques in the system. This project can be helpful in many fields for better understanding. The project and report show that technology is evolving with time. It also shows the comparison of the new advanced technology implemented with past techniques used in spam email filtering systems.

2.2 Aim and Objectives

Aim

The aim of this research and project is to create an artificially intelligent model and analysis machine learning model with the help of a dataset that can easily detect and classify spam emails from the normal genuine set of emails.

Objectives

To identify a basic understanding of email spam filtering techniques.
To get a basic understanding of machine learning and dataset.
To understand the integral of Artificial intelligence with Machine learning.
To identify different email spam filtering techniques.
To understand the different libraries of python.
To understand the different mechanisms of machine learning to be implemented.
To implement machine learning and artificial intelligence to design the email spam filtering system.
To analyze the outcomes and control the complete system.

2.3 Research Questions

Do the spam class emails have a distinct underlying pattern that can be detected through the use of an artificially intelligent model?
What kind of preprocessing techniques should be used for text processing to be the most efficient?
What model will be best suited for this text classification scenario?
Why supervised machine learning algorithm is better in this project?
What methodology is the best suited for this research to be the most efficient?
Can AI approaches to the management of email be improved using current methodologies?

2.4 Rational

The research based on the email spam filtering technique with the help of machine learning will be helpful for many platforms. The platforms providing email systems can understand and implement this integral system to provide a better user experience. The research will also be helpful for computer science students to understand the use of technologies implemented in the project. The project can also be designed by the students for a better understanding of the system and technologies. This report will be helpful for researchers working on machine learning technology. It will also provide a basic and clear understanding of the complete system.

Many companies can use this algorithm and implement this system for the employees to avoid spam mails. The report based on the project of filtering spam emails can also be used to teach the students about machine learning. The artificial intelligence used in the project also can be used by the researchers for further study. The researchers can analyze the system and do further study and research for the system if required. The project and integral system is overall helpful for many people and can be used in many fields.

2.5 Description/Overview

The model or project is designed on artificial intelligence to detect the spam emails for the inbox of the user to filter them and stored them in a different section. The project is based on many different advanced technologies like machine learning, artificial intelligence, spam filtering etc. The topic of email spam filtering is selected because this is the major issue faced by everyone using email services. The user gets spam emails all day long from many different websites and servers.

This project will be beneficial for all the email using people where they can better experience using the platform. The spam emails will automatically be filtered by the system which will be stored in a different section. The user can also access them whenever required in the spam section of the email.

Literature Review

This section of the report will discuss the previous researches that are conducted on the machine learning technique in email spam filtering.

Machine learning: An Overview

According to Janiesch 2021 Artificial intelligence and machine learning are one of the fast-growing fields. To make the life of humans easy several machine learning algorithms have been used and machine learning is used in several procedures. Machine learning best works with data science, it analyses presented data and takes actions accordingly. With the help of machines, learning programs enhance the effectiveness because with every experience it modifies the process and learns things on its own. Some tasks like object detection and natural language translation are being performed by machine learning techniques. Instead of programming everything and coding, everything in the system machine learning can be used it will learn meaningful relationships and examples that can be useful while performing operations (Janiesch et al. 2021). The machine learning technique can be useful to develop a system or a method that can identify the spam mail and perform appropriate action accordingly. This technique can identify spam mails automatically without any human interaction (Janiesch et al. 2021).

Figure 1: Machine learning tree

(Source: Mahesh 2020).

According to Mahesh 2020, Machine learning methods are classified into three main categories

Supervised machine learning: It is also known as supervised learning and it uses a labeled database to train algorithms. The labeled database is used to predict outcomes and classify data. An input data comes into the model and it identifies proper weight and the cross-validation process makes sure that the model is not underfitting or overfitting. It is used to identify the spam mails in incoming mails. Some methods are used in supervised learning like random forest, linear regression, neural network, and logistic regression (Mahesh 2020).
Unsupervised machine learning: It is also known as unsupervised machine learning and it is used to cluster and analyze unlabeled datasets. This database can discover differences and similarities in any information. It can discover hidden patterns and information without the need for any human intervention. It is mainly used for pattern and image reorganization, customer segmentation, and data analysis. The two common approaches which are used in this are singular value decomposition (SVD) and principal component analysis (PCA). This machine learning method can be helpful to identify spam mails with the help of these approaches and can find out the differences and similarities in the email data (Barushka 2020).
Semi-supervised Learning: this is a combination of both learning unsupervised and supervised. It uses an extraction feature for a larger database with unlabeled data set and uses a smaller labeled database to guide classification. It can solve the problem of not having enough labeled data to train a supervised learning algorithm. This type of learning can be very useful to identify spam mails in the mailbox because it is a combination of both learnings so it can analyze both types of data labeled and unlabeled (Barushka 2020).

Reinforcement machine learning

It refers to a behavioral machine learning model that is quite similar to supervised learning. The model learns with experience and trials and errors. Trials and errors enhance the decision-making ability and they also develop the best recommendation or policy for a given problem. Spam mails can be received regularly with some modification and that can be difficult for other learning techniques to identify spam mails but reinforcement machine learning will learn and modify its system accordingly with the errors and trials. This is the best learning in terms of development and with time it improves the quality of decision making (IBM 2022).

E-mail Overview

According to Dada et al. 2019, in recent times unwanted mails and commercial bulk emails which are known as spam become a big problem over the internet. Communication is becoming simple and easy with the help of electric devices and the internet. One of the modes of communication is emails, usually, emails are used in a professional way to communicate. The emails are sent to communicate or to inform the other person about anything. Because emails are the one thing which is very important for a person to communicate, that’s why it is becoming a common medium for companies to promote their brands (Dada et al. 2019).

E-Mail Structure

According to Bhowmick and Hazarika 2018, An email is divided into two elements body and the header. The body of an email contains unstructured data such as multimedia objects, HTMP mark-up, and text. The header of the mail contains structured data like sender name, recipient information, email ID, and many more information (Bhowmick and Hazarika 2018).

-Received: Contains information like IP address, email ID and email server, etc.

-From: It includes the sender’s email ID, name. <e-mail@example.com>

-To: this section contains the recipient’s information like name, email ID. <e-mail@example.com>

-Return path: It shows an optional address if an error occurs while sending the email.

-Message-ID: It is designed by the mail system that shows a single unique message identifier.

-X-mailer: The mail software used to send the message.

-Subject: It defines the purpose of sending the mail.

-Content-type: It shows the format of the mail.

Email spam attacks

Spam emails are emails that are used by spammers to hack someone’s data and computer. Spam mail can affect the working of the computer and it can breach the privacy of the user. There are numbers of attacks are happening on users worldwide to gather the in-system data and to hack the system and access the system. there are so many ways a mail can be spam spammers use different techniques to get attention and if a user clicks on those emails then the spammer gets access into the user’s system. According to Karim et al. 2019, several spam attacks cause problems to the user and some of them are given below (Karim et al. 2019).

Figure 2: E-mail attack scenario

(Source: Basit et al. 2020)

Email Phishing: this is the most common type of spam attack. In this attack, the received mail looks like it has come from a different person and different source. The presentation of the mail looks like it coming from the known actual user. Spammers also temper with the domain to make it all real. Spammers manipulate data part in this mail to get into user’s system. with this technique a Malaysian oil distribution company faced a financial loss of over USD 1 million in 2017 (Karim et al. 2019).
Spear Phishing: This type of attack is harder to detect with traditional filters. A type of email that looks like it came from a genuine source and it can be a link to a bogus website. This spam gathers the personal and financial information of the user. This email can have an attachment file that contains malware viruses that can affect the system and help hackers to get into the computer. In 2018 sensitive financial information of a USA-based company got hacked and compromised due to phishing email scams (Lin et al. 2019).
Whaling: These types of attacks are generally used on the higher-level employees, those who have full authority of a company or system. The main target of these attacks is high-level officials like the CEO of a company or upper-level manager and senior executives. These types of email contain serious company-based information that attracts the person and after clicking on that mail scammer get access to the computer and get high-level clearance. In a recent attack in 2018 French cinema chain, ‘pathe’ get attacked by scammers and lost over USD 21 Million (Ramprasad et al. 2019).

These are some common words that are used by email scammers to scam people because these words gather the attention of everyone.

The spammer’s Trick: According to Mohammed, Ibrahim, and Salman 2021, spam mails are also known as junk mails and it can be referred to as “unsolicited bulk email” these are some categories in which spammers shows their trick to hack or steal data like health, education, political, finance, marketing, and adult content, etc (Mohammed, Ibrahim and Salman 2021).

The framework of Investigation of Spam Classification

According to Sattu 2020, the classification of an email with different techniques or methods contains four major steps that are discussed below:

Figure 3: Framework of Investigation of Spam Filtering

(Source: Sattu 2020).

Dataset collection: In this process, the data of the mail are collected by a specific technique or method. All the present words, texts, and data got collected from the mail and go for the next process (Sattu 2020).
Pre-processing of the dataset: After collecting data from the mail system send that data to the next step where the data got classified and separated into different categories to perform the next step (Gibson et al. 2020).
Training and Testing models: This part of the process is the most important and there is some process that happens like word tokenization, N-gram, stop word removal, and Feature extraction. In the training process, machine learning algorithms like SVM, KNN, ANN, RF, DT, NB, LR are used and that helps a method to select further classification methods like SPAM and HAM (Gibson et al. 2020).
Result Comparison and Analysis: this is the last part of the whole process of spam classification. Comparison of emails and the full analysis of the mails performs in this section of the process and the result can be presented in two ways first one is the graphical way and the second one is the tabular way (Sattu 2020).

Email Spam Filtering techniques

In the paper “Machine learning for email filtering: review, approaches and open research problems” Dada et al. 2019, explained the different types of techniques to avoid spam emails. These are some types of filtering techniques that are applied to filter the emails

Case Base Spam Filtering Method: this method is also known as the sample base filtering method and it is a popular filtering method. First of all, emails are extracted from each user’s email using the collection method. The pre-processing will identify the emails according to the client interface, feature selection, and email data. The data of the email is then classified into two vector sets. Then Machine learning process identifies the data of the mails and decides whether the received mail is spam or not (Bhuiyan et al. 2018).

Figure 4: Case Base Spam Filtering System

(Source: Bhuiyan et al. 2018).

Rule-Based Spam Filtering Method: this system already created the rule and a process that identify the emails. If an email follows the rule and the words and technique in the mail is the same that that mail goes directly to spam. This process needs time to time modification in the system that can help to identify the different patterns. Some ranking rules changes with time and some don’t change at all. The development of rules is necessary to identify the spammers who continuously introduce new spam messages. This technique can be useful where the pattern of spam emails repeats (Data et al 2019).
Content-Based Filtering Technique: This method is normally used to create automatic filtering rules with the help of machine learning approaches. Some approaches of machine learning like Naïve Bayesian classification, neural networks, and Support Vector Machine are used. Some most common words that are used in emails can be identified and this method can filter emails based on the content of the whole mail. This technique usually analyses the words presented in the mail and phrases in the content of an email and generates rules accordingly to apply in the emails (Sheikhi, Kheirabadi, and Bazzazi 2020).
Adaptive Spam Filtering Method: This technique of filtering makes various groups of emails according to the content and words use in it. It divides email corpus into different groups and each group has an emblematic text. Each email gets divided into groups and a comparison is made between all incoming mails and the percentage of similarity decides the probable group to where the mail belongs to (Junnarkar et al. 2021).
Previous Likeness-Based Spam Filtering Method: this approach uses instance-based and memory-based machine learning methods. This technique segregates the incoming mails based on their resemblance to training emails. The filer system filters the mails according to the likeness of the mail and based on the similarities. This method uses the KNN for filtering spam emails (Data et al 2019).

Types of algorithm for machine learning

These are some algorithm for machine learning:

Support Vector machine (SVM)

According to Huang et al. 2018, This machine learning algorithm is a liner model which is used for regression and problem classification. This works well in email spam classification to separate the different types of mails. It uses a trick to transform the collected data into possible output. SVM algorithm is very much useful in data classification, it can classify different type of data and use it to predict the solution. The kernel function is used in SVM to help with higher dimensions and perform smooth calculations.

This algorithm can solve non-liner and liner problems that can help to classify the mails (Huang et al. 2018).

Figure 5: Liner SVM model

(Source: Huang et al. 2018)

KNN algorithm

Larijani et al. 2019, explained in its paper that KNN is supervised machine learning method that can be used to solve regression and classification problems. This algorithm’s learning is based upon three learning which are instance-based learning, lazy learning and non-parametric learning. This algorithm helps to predict the class of any data. One of the drawbacks of this algorithm is that it becomes slow by the time because of size of data. This algorithm can help the system to classify the received emails and separate them accordingly. This can provide high accuracy results in spam filtration (Larijani et al. 2019).

Decision Tree

According to Patel and Prajapati 2018, decision tree is used to make decisions with the help of flowchart and structure. decision tree can help the system to identify what needs to be done when it finds out a spam email. The decision tree is help system to find out the decision which are going to be made or what types of difference it will make into the system. decision tree will be used in spam mail detection to find out the output of every decision which will be made by the system. The decision tree can be developed according to the outcome. The required action can be taken according to the outcome. Each branch in the decision tree represents the outcomes (Patel and Prajapati 2018).

XG Boost algorithm

Mussa and Jameel 2019, explains that it is decision tree-based ensemble machine learning algorithm. The best tree model in a system makes process easy and effective to find out the decisions. XG boost algorithm is used to find out the best tree model for a system. This is used to increase model performance and execution speed. This algorithm will increase the performance and computational speed to find out the spam mails. This works best with spare data and can provide better results. This uses a gradient boosting framework that is designed to be highly flexible, efficient and portable. This can help to predict the problems and detect the spam in the received emails. This can be used to find out the best tree model more accurately. This can help to solve and identify the unstructured data in the mail (Mussa and Jameel 2019).

Design methodology

The spam email filtering system is based on artificial intelligence and machine learning. This is where the system accesses the data and reads it. According to Taylor and Ezekiel 2020, the data is processed using min_max_scaler and values should be scaled properly for better and more efficient results. The data is divided into 2 parts after splitting. The data set is then trained using different machine learning algorithms. The Random Forest classifier and Support Vector Classifier is used to train the data in the used system by the author. The accuracy of the system is checked and monitored which identifies the spam email and legit email (Taylor and Ezekiel 2020).

Figure 6: Proposed architecture of the system

(Source: Taylor and Ezekiel 2020)

The above architecture is proposed by the author which shows the complete process of the system. The dataset used for the system is pre-processed and split into 2 different parts. The dataset is trained and tested further to check the accuracy of the system. This process classifies the email and filters them accordingly. The support vector classifier gives the result of 89.21% accuracy when kernel = 1 and 95.36% accuracy came up with the result of random forest classifier (Taylor and Ezekiel 2020). The Random Forest classifier was used further for the system after comparing the accuracy for the system.

According to Karim et al. 2019, a DT-based framework is designed. The framework works in the conjunction with spam keywords. The framework search for the spam keyword online to identify the spam email and filter it accordingly. It is also said that random forest is one of the most successful classifiers of the supervised machine learning used in the system. The random forest uses complex calculations but also better accuracy (Karim et al. 2019).

According to Bhowmick and Hazarika 2018, the automatic classification of spam emails approach is based on statistical or techniques of machine learning which builds a model or a task. The task is trained to filter the spam emails from the user’s inbox. A set of pre-classified datasets or documents is required for building the model using the machine learning technique (Bhowmick and Hazarika 2018).

Detection of poisoning points

The poisoning attack is corrupting the data set with fake data and manipulate the trained model. The machine learning technique used to train and test the data for getting expected responses. The data poisoning is done to get undesired outcomes from the trained model.

According to Dada et al. 2019, the Bayesian poisoning technique which is used by the spammers to reduce the efficiency of the filters based on the Bayesian filtering technique. The random forest algorithm is one of the best algorithms which is used to detect the model for spam detection. The manipulation of minimum 1% of the training data is enough in certain cases to change the desired outcome of the system/model (Dada et al. 2019).

According to Paudice et al. 2018, poisoning attack are one of the most significant attack for the system which depends on collecting the data. In the case of the linear classifiers, by increment in the value of the |w^?x_pj |, the loss function for poisoning is made arbitrarily large. The samples the examined to identify and detect the poisoning attack points. In the sample, the changes are observed by adding one poisoning point in the data set used for training the system. The decision boundaries are seen to be changed with time. The data analysis based on the colour map represents the validation dataset’s cost as a function of point. The change seen in the colour map was that the cost of the validation dataset increases when the poisoning points moves away from the distributed data (Paudice et al. 2018).

The dataset used in the system is retrained in many machine learning models to avoid such attack and monitor the changes occurs in the outcome of the trained system. The strategy used for poisoning does not put any restriction on the attacker’s side to generate the point.

Figure 7: Flowchart of defense algorithm from poisoning attack

(Source: Paudice et al. 2018)

The above image shows and explains the defence algorithm where the training dataset for the system is checked and monitored. The splitting of the data is trained and tested to get the desired output. The outliner detection is done on both data and them combined together to get a clear and accurate dataset for the system. It is also important to use the trusted outliner detection method for the dataset. The assumptions taken for the outliner detection or defence mechanism should not be violated. If it gets violated, the detector can also be targeted by the attacker and poisoned (Paudice et al. 2018).

Experimental evaluation

According to Paudice et al. 2018, the experimental evaluation is performed to see the validity of the outliner detector used in the system to identify poisoning attack. The process used 2 different datasets for the experiment. The label flipping attack strategy is also included in the process which are seen to be more effective than machine learning classifiers. The evaluation also shows the lack of control over the values used for poisoning points for the result. The experimental evaluation which uses real datasets also shows the effectiveness of the used defence methodology in the system.

The evaluation process also shows that the optimal attacks are more effective than the label flipping attack. This was seen in the system that label flipping attack can be more harmful for the whole machine. This clearly explains that the label flipping attack is hard to identify or detect with the used defence mechanism. The experimental evaluation uses MNIST dataset which is handwritten recognition. The results of the evaluation are monitored and noted down. The small training dataset results in a more effective poisoning attack (Paudice et al. 2018).

Figure 8: MNIST result after evaluation

(Source: Paudice et al. 2018)

The above image shows the effectiveness of the detector uswd in the system and effect of poisoning attacks. The first row in the image shows the effect of poisoning when countermeasures is not applied in the system. The outliner detector shows 3 different results of percentiles 90,95 and 99 respectively.

The system shows the evaluation of the final experiment done to test the system from the attack. The system identifies the spam emails with the help of supervised machine learning technique using random forest algorithm (Paudice et al. 2018).

Practical Work & Analysis

Spam Filter Methodology

E-mail is considered to be one of the fastest, inexpensive and effective ways for transferring the data of any kind on the internet. This can be used for the various purposes in the informal or formal way and thus, it is most preferred way of the communication. There are around billions users worldwide who use the mail however this also lead to the emergence of various issues related to same in which the major issue is legitimacy of the mail which is generally known as spam. According to the definition, spam represents the mails which is unsolicited which is sent over the internet indiscriminately and they can be the carrier of the malware in the form of fraud scheme, advertisement, phishing messages, promotions, explicit content or more. Thus, spam mails is considered to be the broad concept which is still very complex and thus, it is important to filter such mails in order to protect the privacy of the individual and prevent any kind of potential attacks which might be caused by the same (Agarwal, et. al., 2018).

According to the Intelligence report by the Message Labs, spam takes the 88% approx of the email traffic and thus, there are so many issues related with the same and so, the approach whuch is most adopted for the same is the filtering of the email spam using the various available technologies. For doing there are various technologies available that can help in filtering the mail and one of such technology includes the machine learning technology in which there are various techniques for filtering the spam mails such as using the K-Nearest Neighbour algorithm which help in building the biological mind for filtering the mails (Mallampati, et. al., 2019).

Because of the advances made in the field of the artificial intelligence and approaches of the machine learning, researchers are using combination of these technologies for the email spam filtering. There is other method of the Naïve Bayesian model which is based on the statistical method for spam filtering by using the categorization as well as precisions. The studies have used 5 different version of the Naïve Bayesian technology for the spam filtering which also includes the hybrid approach along with the memory-based approach for making the filteration of the spam mail more accurate. There are also methods which is based on the artificial immune system based on the Naïve Bayes methodology which produced the hybrid algorithm for the filtering the mails to get more accurate outputs (Agarwal, et. al., 2018).

The other techniques include the use of support vector machine which implement map training for filtering the mail using the effective dimensional feature area using the nonlinear feature and then standard hyper lane is calculated which helps in maximizing the margins among the points of information which is inside the positive class as well as the points of records which is present at the negative class. Then that hyper lane can be used for classifying the instances that are new. The SVM was also combined with the logistic regressions, linear kernel as well as multiple instance logistic regressions in which the more than 0 features of the mail got investigated in order to make the algorithm for filtering the spam, ham or phishing mails (Mallampati, et. al., 2019).

However, in these approaches of machine learning for filtering the spam mails, there might be presence of the adversaries which can launch the attacks for the spam detectors which has been deployed for detecting the attacks in the prediction or training phase of the algorithm. Thus, even if the machine learning algorithms can successfully detect the spam mails, the presence of adversaries can have the impact on the performance of the spam filters algorithms. This is because the adversarial exposes the machine learning vulnerabilities according to the security mechanism and thus, it can be also known as the reverse engineering in terms of machine learning context (Kuchipudi, et. al., 2020).

As the models of the machines learning can have different loopholes thus, it is considered the examples of adversaries. In the recent survey it was mentioned that there is the spam tweet of new types which can plan the attack against the online networks of the social media as it spam mixed itself with the content which is supposedly legitimate in order to create camouflaged messages. Thus, using such techniques, attacker can send such mails by mixing the spam with the legitimate content to the user which is hard to detect for planning the attack. Though, the adversary’s machine learning also has the limitation as the knowledge level of the model which is deployed for the determining the attacks success also plays the important role in this. Like the researchers showcases that an adversary might exploit the algorithm of machine learning which is deployed for the spam filters even if training data of 1% is known by the attacker for statistical based machine learning algorithm. And for attacks which are dictionary based as they are generally well informed and focus attacks, the impact of the same can het reduced by the adding the classifier weight (Imam, et. al., 2020).

Research design

For spam mail filtering, there are many algorithms in the machine learning and from which Naïve Bayesian algorithm was used to make the model of spam filtering in which first step includes the import of the required libraries and the resources then reading the dataset. Here the dataset consists of the subject of the mails and its status of whether it is spam or not in order to the test the dataset. This is required to test the accuracy of the model which implementation is showcased in the next section. Then the further step includes the reading of the data and checking whether it has null values or not.

After that the implementation of the model starts from the calculation of the length of the mail and storing the values in the other created columns. After that the pre-processing of the text is required in which any stop words, punctuation mark or stemming is removed using the function. Then stemming was performed on text and then with the use of function, string data was presented in the word format. Lastly, then this text was trained and tested for creating the model and then testing of the model was done to determine the accuracy of the same. Thus, the next section represents the implementation of the same in the different steps as shown below with the results and discussion of the result as showcased in the further sections.

Implementation

For implementing the spam filter method, Naïve bayes algorithm was implemented using the dataset which consist of the subject of the mail and spam count then the process was divided in the following steps as showcase below-

Importing libraries

The first step is to import all the necessary libraries in the code which is shown in the following image

Figure 9: Importing libraries

Natural language resources

For processing the algorithm, the resources of the natural language was required to import for processing the detection method as shown in the following image

Figure 10: Natural language resources

Reading dataset

In this step, the function for reading the dataset is applied as shown in the following image

Figure 11: Reading dataset

Checking null values

It is important to pre process the data before performing the operation on the same so in order to do so, null values can be checked in the data in order to determine whether it will give accurate results or not as unclean data might not showcase the accurate results and thus, the function of the same is shown in the following image

Figure 12: Checking null values

Count of Ham or Spam

Now the function is applied for checking whether the mails are spam or not and then the graph was plot which is showed in the following image

Figure 13: Count of Ham or Spam

Length checking for the mail

This next step is to check the length of the mail in the given dataset using the following the function as shown and then storing this length information in another column by using the function

Figure 14: Length checking for the mail

Pre-Processing the text

The next step is to process the text which is used remove any punctuations, stop words and stemming in the text by using the function as shown in the following image

Then applying this function in the text for pre-processing the same using the following function

Figure 15: Pre-Processing the text

Stemming

After the pre-processing of the data, stemming was done which is the process of word reducing to its word stem which affixes to the prefixes and suffixes or to the word roots which is also known as the lemma. It is the part of the linguistic studies in the morphology and AI information which helps in retrieval and extraction of the data to showcase the meaningful information from the big source of the data such as in big data or the sources of the internet. Thus, after cleaning the data, following functions was implied for stemming (Techtarget. 2022).

Figure 16: Stemming

Converting the text in the vocabulary

The next step is to use the function of count vectorizer which helps in concerting the string of the data in the bag of the words which is also known as the vocabulary as this step help in determining whether the mail consist of the spam messages on not as shown in the following image.

Figure 17: Converting the text in the vocabulary

Testing and training split

Now after the process, the next step is to test and train split the text which was done using the following function

Figure 18: Testing and training split

Creation of model and evaluating the model

The last step is to create the model and then evaluate the same which is done using the following function

Figure 19: Creation of model

Now printing the report of classification will give the following report

Figure 20: classification report

Now plotting the curve using the following function between the true positive rate and the false positive rate using the following function

Figure 21: curve plotting

Lastly plotting the confusion matrix for predicted label and true label using the following function

Figure 22: confusion matrix plotting

Accuracy calculation

Then lastly, the accuracy of the developed model can be calculated by using the function showcased in the following image

Figure 23: Accuracy calculation

Experimental results and discussion

The first graph plot in the code was of checking the count of spam or ham which is shown in the following image

Figure 24: checking the count of spam or ham

Then the next plot graph was between the true positive rate and the false positive rate which gave the following result:

Figure 25: true positive rate and the false positive rate

After that confusion matrix was created between the for predicted label and true label which is showcased in the following image

Figure 26: confusion matrix

And lastly, the accuracy was calculated for the created model which showcased the following accuracy of the same using the cross validation.

Figure 27: accuracy

Thus, this showcased that the created model for the spam mail detection has pretty good accuracy. However, this accuracy percentage can be change if there is any adversary present while doing the testing and training. The presence of adversaries who can launch attacks against spam detectors that have been incorporated for detecting attacks during the prediction or training phase of the algorithm can be present in these methods of machine learning for filtering spam mails, according to the authors. In other words, even if machine learning algorithms are successful in detecting spam email, the presence of adversaries may have an impact on the performance of spam filter algorithms in general. As a result, the adversarial exposes machine learning vulnerabilities in accordance with the security mechanism, and as such, it may also be referred to as reverse engineering in the context of machine learning.

Because the models of machine learning may have several gaps, they are referred to as adversaries in some instances. According to the findings of the current study, there is a new type of spam tweet that can be used to plan an attack against social media networks on the internet. The spam tweet mixes itself with information that is purportedly legal in order to generate camouflaged messages, according to the researchers. Accordingly, using such approaches, an attacker can send such emails to users by mixing spam with lawful material that is difficult to distinguish from spam for the purpose of preparing an attack on the created model for the spam filtering (Wang, et. al., 2019).

References

Agarwal, K. and Kumar, T., 2018, June. Email spam detection using integrated approach of naïve Bayes and particle swarm optimization. In 2018 Second International Conference on Intelligent Computing and Control Systems (ICICCS) (pp. 685-690). IEEE.

Basit, A., Zafar, M., Javed, A.R. and Jalil, Z., 2020, November. A novel ensemble machine learning method to detect phishing attack. In 2020 IEEE 23rd International Multitopic Conference (INMIC) (pp. 1-5). IEEE. shorturl.at/lvBFS

Bhowmick, A. and Hazarika, S.M., 2018. E-mail spam filtering: a review of techniques and trends. Advances in electronics, communication and computing, pp.583-590. DOI: 10.1007/978-981-10-4765-7_61

Bhuiyan, H., Ashiquzzaman, A., Juthi, T.I., Biswas, S. and Ara, J., 2018. A survey of existing e-mail spam filtering methods considering machine learning techniques. Global Journal of Computer Science and Technology. https://www.researchgate.net/publication/332865507_A_Survey_of_Existing_E-Mail_Spam_Filtering_Methods_Considering_Machine_Learning_Techniques

Dada, E.G., Bassi, J.S., Chiroma, H., Adetunmbi, A.O. and Ajibuwa, O.E., 2019. Machine learning for email spam filtering: review, approaches and open research problems. Heliyon, 5(6), pp.1-23. DOI: https://doi.org/10.1016/j.heliyon.2019.e01802 https://reader.elsevier.com/reader/sd/pii/S2405844018353404?token=E0D2D010E18104AB2E9FCE395845AF8CDE819F6B7C2466AD368273C1A889D991479F727DDC236F12B114F436D98B71B0&originRegion=eu-west-1&originCreation=20220221100016

Imam, N.H. and Vassilakis, V.G., 2019. A survey of attacks against twitter spam detectors in an adversarial environment. Robotics, 8(3), p.50.

Janiesch, C., Zschech, P. and Heinrich, K., 2021. Machine learning and deep learning. Electronic Markets, 31(3), pp.685-695. https://doi.org/10.1007/s12525-021-00475-2

Karim, A., Azam, S., Shanmugam, B., Kannoorpatti, K. and Alazab, M., 2019. A Comprehensive Survey for Intelligent Spam Email Detection. IEEE Access, 7, pp.168261-168295. https://doi.org/10.1109/ACCESS.2019.2954791

Kuchipudi, B., Nannapaneni, R.T. and Liao, Q., 2020, August. Adversarial machine learning for spam filters. In Proceedings of the 15th International Conference on Availability, Reliability and Security (pp. 1-6).

Mahesh, B., 2020. Machine learning algorithms-a review. International Journal of Science and Research (IJSR).[Internet], 9, pp.381-386. DOI: 10.21275/ART20203995

Mallampati, D., Chandra Shekar, K. and Ravikanth, K., 2019. Supervised machine learning classifier for Email spam filtering. In Innovations in Computer Science and Engineering (pp. 357-363). Springer, Singapore.

Mohammed, M.A., Ibrahim, D.A. and Salman, A.O., 2021. Adaptive intelligent learning approach based on visual anti-spam email model for multi-natural language. Journal of Intelligent Systems, 30(1), pp.774-792. https://doi.org/10.1515/jisys-2021-0045

Paudice, A., Muñoz-González, L., Gyorgy, A. and Lupu, E.C., 2018. Detection of adversarial training examples in poisoning attacks through anomaly detection. arXiv preprint arXiv:1802.03041. pp.1-10. https://arxiv.org/abs/1802.03041

Sattu, N., 2020. A study of machine learning algorithms on email spam classification (Doctoral dissertation, Southeast Missouri State University). 69, pp.170-179. https://ww.easychair.org/publications/download/Jvsw

Sheikhi, S., Kheirabadi, M.T. and Bazzazi, A., 2020. An effective model for SMS spam detection using content-based features and averaged neural network. International Journal of Engineering, 33(2), pp.221-228. Doi: 10.5829/ije.2020.33.02b.06

Techtarget. 2022. Stemming. [Online]. Available at: https://www.techtarget.com/searchenterpriseai/definition/stemming [Accessed on 19 February 2022]

Wang, X., Li, J., Kuang, X., Tan, Y.A. and Li, J., 2019. The security of machine learning in an adversarial setting: A survey. Journal of Parallel and Distributed Computing, 130, pp.12-23.

Huang, S., Cai, N., Pacheco, P.P., Narrandes, S., Wang, Y. and Xu, W., 2018. Applications of support vector machine (SVM) learning in cancer genomics. Cancer genomics & proteomics, 15(1), pp.41-51. doi:10.21873/cgp.20063

Larijani, M.R., Asli?Ardeh, E.A., Kozegar, E. and Loni, R., 2019. Evaluation of image processing technique in identifying rice blast disease in field conditions based on KNN algorithm improvement by K?means. Food science & nutrition, 7(12), pp.3922-3930. DOI: 10.1002/fsn3.1251

Patel, H.H. and Prajapati, P., 2018. Study and analysis of decision tree based classification algorithms. International Journal of Computer Sciences and Engineering, 6(10), pp.74-78. DOI: 10.26438/ijcse/v6i10.7478

Mussa, D.J. and Jameel, N.G.M., 2019. Relevant SMS spam feature selection using wrapper approach and XGBoost algorithm. Kurdistan Journal of Applied Research, 4(2), pp.110-120. https://www.spu.edu.iq/kjar/index.php/kjar/article/download/338/237

Lin, T., Capecci, D.E., Ellis, D.M., Rocha, H.A., Dommaraju, S., Oliveira, D.S. and Ebner, N.C., 2019. Susceptibility to spear-phishing emails: Effects of internet user demographics and email content. ACM Transactions on Computer-Human Interaction (TOCHI), 26(5), pp.1-28. https://doi.org/10.1145/3336141

Machine Learning Techniques in E-Mail Spam Filtering

Abstract

Contents

Introduction