Using Natural Language Processing for Spam Detection in Emails
September 26, 2019source: https://lionbridge.ai/articles/using-natural-language-processing-for-spam-detection-in-emails/
Have you ever wondered how a machine translates language? Or how voice assistants respond to questions? Or how mail gets automatically classified into spam or not spam?
All these tasks are done through Natural Language Processing (NLP), which processes text into useful insights that can be applied to future data. In the field of artificial intelligence, NLP is one of the most complex areas of research due to the fact that text data is contextual. It needs modification to make it machine-interpretable, and requires multiple stages of processing for feature extraction.
Classification problems can be broadly split into two categories: binary classification problems, and multi-class classification problems. Binary classification means there are only two possible label classes, e.g. a patient’s condition is cancerous or it isn’t, or a financial transaction is fraudulent or it is not. Multi-class classification refers to cases where there are more than two label classes. An example of this is classifying the sentiment of a movie review into positive, negative, or neutral.
There are many types of NLP problems, and one of the most common types is the classification of strings. Examples of this include the classification of movies/news articles into different genres, and the automated classification of emails into spam or not spam. I’ll be looking into this last example in more detail for this article.
Problem Description
Understanding the problem is a crucial first step in solving any machine learning problem. In this article, we will explore and understand the process of classifying emails as spam or not spam. This is called Spam Detection, and it is a binary classification problem.
The reason to do this is simple: by detecting unsolicited and unwanted emails, we can prevent spam messages from creeping into the user’s inbox, thereby improving user experience.

Dataset
Let’s start with our spam detection data. We’ll be using the open-source Spambase dataset from the UCI machine learning repository, a dataset that contains 5569 emails, of which 745 are spam.
The target variable for this dataset is ‘spam’ in which a spam email is mapped to 1 and anything else is mapped to 0. The target variable can be thought of as what you are trying to predict. In machine learning problems, the value of this variable will be modeled and predicted by other variables.
A snapshot of the data is presented in figure 1.

Task: To classify an email into spam or not spam.
To get to our solution we need to understand the four processing concepts below. Please note that the concepts discussed here can also be applied to other text classification problems.
- Text Processing
- Text Sequencing
- Model Selection
- Implementation
1. Text Processing
Data usually comes from a variety of sources and often in different formats. For this reason, transforming your raw data is essential. However, this transformation is not a simple process, as text data often contains redundant and repetitive words. This means that processing the text data is the first step in our solution.
The fundamental steps involved in text preprocessing are,
- Cleaning the raw data
- Tokenizing the cleaned data
a. Cleaning the Raw Data
This phase involves the deletion of words or characters that do not add value to the meaning of the text. Some of the standard cleaning steps are listed below:
- Lowering case
- Removal of special characters
- Removal of stopwords
- Removal of hyperlinks
- Removal of numbers
- Removal of whitespaces
Lowering Case
Lowering the case of text is essential for the following reasons:
- The words, ‘TEXT’, ‘Text’, ‘text’ all add the same value to a sentence
- Lowering the case of all the words is very helpful for reducing the dimensions by decreasing the size of the vocabulary
def to_lower(word):
result = word.lower()
return result
Removal of special characters
This is another text processing technique that will help to treat words like ‘hurray’ and ‘hurray!’ in the same way.
def remove_special_characters(word):
result=
word.translate(str.maketrans(dict.fromkeys(string.punctuation)))
return result
Removal of stop words
Stopwords are commonly occurring words in a language like ‘the’, ‘a’, and so on. Most of the time they can be removed from the text because they don’t provide valuable information.
def remove_stop_words(words):
result = [i for i in words if i not in ENGLISH_STOP_WORDS]
return result
Removal of hyperlinks
Next we remove any URLs in the data. There is a good chance that email will have some URLs in it. We don’t need them for our further analysis as they do not add any value to the results.
def remove_hyperlink(word):
return re.sub(r"http\S+", "", word)
B. Tokenizing the Cleaned Data
Tokenization is the process of splitting text into smaller chunks, called tokens. Each token is an input to the machine learning algorithm as a feature.
keras.preprocessing.text.Tokenizer
is a utility function that tokenizes a text into tokens while keeping only the words that occur the most in the text corpus. When we tokenize the text, we end up with a massive dictionary of words, and they won’t all be essential. We can set ‘max_features’ to select the top frequent words that we want to consider.
max_feature = 50000 #number of unique words to consider
from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words=max_feature)
tokenizer.fit_on_texts(x_train)
x_train_features = np.array(tokenizer.texts_to_sequences(x_train))
x_test_features = np.array(tokenizer.texts_to_sequences(x_test))

2. Text Sequencing
a. Padding
Making the tokens for all emails an equal size is called padding.
We send input in batches of data points. Information might be lost when inputs are of different sizes. So, we make them the same size using padding, and that eases batch updates.
The length of all tokenized emails post-padding is set using ‘max_len’.

Code snippet for padding :
from keras.preprocessing.sequence import pad_sequences
x_train_features = pad_sequences(x_train_features,maxlen=max_len)
x_test_features = pad_sequences(x_test_features,maxlen=max_len)
b. Label the encoding target variable
The model will expect the target variable as a number and not a string. We can use Label encoder fromsklearn
to convert our target variable as below.
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
train_y = le.fit_transform(target_train.values)
test_y = le.transform(target_test.values)
3. Model Selection
A movie consists of a sequence of scenes. When we watch a particular scene, we don’t try to understand it in isolation, but rather in connection with previous scenes. In a similar fashion, a machine learning model has to understand text by utilizing already-learned text, just like in a human neural network.
In traditional machine learning models,we cannot store a model’s previous stages. However, Recurrent Neural Networks (commonly called RNN) can do this for us. Let’s take a closer look at RNNs below.

An RNN has a repeating module that takes input from the previous stage and gives its output as input to the next stage. However, in RNNs we can only retain information from the most recent stage. To learn long-term dependencies, our network needs memorization power. Here’s where Long Short Term Memory Networks (LSTMs) come to the rescue.
LSTMs are a special case of RNNs, They have the same chain-like structure as RNNs, but with a different repeating module structure.

To perform LSTM even in reverse order, we’ll use a Bi-directional LSTM.
4. Implementation
Embedding
Text data can be easily interpreted by humans. But for machines, reading and analyzing is a very complex task. To accomplish this task, we need to convert our text into a machine-understandable format.
Embedding is the process of converting formatted text data into numerical values/vectors which a machine can interpret.

import tensorflow as tf
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation, Conv1D
from keras.layers import Bidirectional, GlobalMaxPool1D
from tensorflow.compat.v1.keras.layers import CuDNNGRU
from keras.models import Model
from keras import initializers, regularizers, constraints, optimizers, layers
#size of the output vector from each layer
embedding_vector_length = 32
#Creating a sequential model
model = tf.keras.Sequential()
#Creating an embedding layer to vectorize
model.add(Embedding(max_feature, embedding_vector_length, input_length=max_len))
#Addding Bi-directional LSTM
model.add(Bidirectional(tf.keras.layers.LSTM(64)))
#Relu allows converging quickly and allows backpropagation
model.add(Dense(16, activation='relu'))
#Deep Learninng models can be overfit easily, to avoid this, we add randomization using drop out
model.add(Dropout(0.1))
#Adding sigmoid activation function to normalize the output
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

history = model.fit(x_train_features, train_y, batch_size=512, epochs=20, validation_data=(x_test_features, test_y))
y_predict = [1 if o>0.5 else 0 for o in model.predict(x_test_features)]

Through the above, we have successfully fit a bi-directional LSTM model on our email data, and detected 125 of 1114 emails as spam.
Since the percentage of spam in data is often low, Measuring the model’s performance by accuracy alone is not recommended. We need to evaluate it using other performance metrics as well, which we’ll look at below.
Performance Metrics
Precision and recall are the two most widely used performance metrics for a classification problem to get a better understanding of the problem. Precision is the fraction of the relevant instances from all the retrieved instances. Precision helps us to understand how useful the results are. Recall is the fraction of relevant instances from all the relevant instances. Recall helps us understand how complete the results are.
The F1 Score is the harmonic mean of precision and recall.
For example, consider that a search query results in 30 pages, of which 20 are relevant, but the results fail to display 40 other relevant results. In this case, the precision is 20/30, and recall is 20/60. Therefore, our F1 Score is 4/9.
Using F1-score as a performance metric for spam detection problems is a good choice.
from sklearn.metrics import confusion_matrix,f1_score, precision_score,recall_score
cf_matrix =confusion_matrix(test_y,y_predict)
tn, fp, fn, tp = confusion_matrix(test_y,y_predict).ravel()
print("Precision: {:.2f}%".format(100 * precision_score(test_y, y_predict)))
print("Recall: {:.2f}%".format(100 * recall_score(test_y, y_predict)))
print("F1 Score: {:.2f}%".format(100 * f1_score(test_y,y_predict)))

import seaborn as sns
import matplotlib.pyplot as plt
ax= plt.subplot()
#annot=True to annotate cells
sns.heatmap(cf_matrix, annot=True, ax = ax,cmap='Blues',fmt='');
# labels, title and ticks
ax.set_xlabel('Predicted labels');
ax.set_ylabel('True labels');
ax.set_title('Confusion Matrix');
ax.xaxis.set_ticklabels(['Not Spam', 'Spam']); ax.yaxis.set_ticklabels(['Not Spam', 'Spam']);

A model with an F1 score of 94% is a good-to-go model. Keep in mind, however, that these results are based on the training data we used. When applying a model like this to real world data, we still need to actively monitor the model’s performance over time. We can also continue to improve the model by responding to results and feedback by doing things like adding features and removing misspelled words.
Summary
In this article, we created a spam detection model by converting text data into vectors, creating a BiLSTM model, and fitting the model with the vectors. We also explored a variety of text processing techniques, text sequencing techniques, and deep learning models, namely RNN, LSTM, BiLSTM. You can find all the code for the project on my GitHub.
The concepts and techniques learnt in this article can be applied to a variety of natural language processing problems like building chatbots, text summarization, language translation models. We hope to have more articles about such NLP problems in the future.