Elasticsearch Machine Learning and Spam Email Identification

Well-known as a powerful search engine, Elasticsearch can help users collect and transform data in real-time. However, its capabilities of data analytics, especially machine learning, may not be widely known. As cyber criminals continually develop new ways to steal data from organisations, one of their preferred methods is phishing emails. The challenges of identifying and managing spam emails are increasing, plaguing email inboxes and consuming precious resources and time. Business intelligence consultants and cyber security professionals are using Elasticsearch to reallocate their time and resources away from spam email identification and replace it with a powerful tool. This blog will unravel how to create an ML-based classification model with Elasticsearch and provide a practical example for spam email identification.

Background

It can be annoying when you find a number of unwanted emails in your inbox. Filtering out the spam is tiresome since indicators of “junk” are fuzzy. Traditionally, spam emails are blocked by certain sender domains and email addresses. However, it is an endless process to identify a list of suspicious senders. Among a variety of solutions, supervised machine learning techniques have been proven to be fast and reliable in detecting spam based on the message content.

Supervised machine learning uses a training dataset to teach the algorithm to accurately assign data into a specific category. In the case of spam detection, we will use an example set of spam and ham emails to create a classification model. With this model, we will be able to find the underlying patterns and make accurate predictions.

Elastic Stack makes it easy to ingest the dataset into Elasticsearch and access the Kibana UI to use machine learning features.

 

 

Elasticsearch machine learning supports the end-to-end workflow from training, evaluating to deploying. The next section will demonstrate the process of identifying spam using Elasticsearch step-by-step.

Creating a Classification Model with Elasticsearch

In order to use supervised machine learning to identify spam, we will use Elasticsearch to prepare and extract features from an email dataset. The next step is to use Kibana to develop a classification machine learning job based on the dataset. Finally, we will evaluate how well the model can identify spam emails and explore how to deploy the model at the ingestion level.

Preparing data

The dataset we are going to use was collected from the SpamAssassin site. We can upload the CSV file and view the data via Kibana UI directly. It shows that the dataset contains fields namely “Body” and “Label”. If the “Label” is “1”, it means the corresponding email is spam. Conversely, “Label” 0 means the email is not spam.

 

 

After indexing the data into Elasticsearch with a name as “email-example”, we can explore the data distribution in Kibana. As the bar chart shows, around 4000 records have been tagged as “not spam” and 2000 records have a label as “spam”.

 

 

To validate the quality of the public dataset, next we need to deal with missing values in the data. All the emails with no “Body” or “Label” will be removed from the training dataset. After cleaning up the data, we have 5982 records which are ready to be used for analysis.

 

 

Feature Engineering

Before we dive deeper into the data model development, we will do feature extraction from the data. This process of extracting more attributes from raw data is called feature engineering. By telling the machine learning algorithm more characteristics shared by independent units in the data, we will be able to improve the performance of our classification model.

Those new features to be generated are:

  1. The length of an email message
  2. We can use a script to calculate the length of the message for each record in our training data.

    PUT_scripts/email_length
    {
    “script”: {
    “lang”: “painless”,
    “source”: “…
    def text = ctx [‘Body’].toLowerCase();
    def length = text.length();
    ctx [’email_length’] = length;
    }
    }

     

    To apply the script to the dataset, we need to create an ingest pipeline to transform the data.

    PUT _ingest/pipeline/count_email_length{
    “description”: “This is to count the length of a text”,
    “processors”: [
    {“script”: {
    “id”:”email_length”
    }}
    ]
    }

     

  3. The number of spam trigger words in an email

While spam emails normally try to persuade someone to take action, some certain words appear frequently in those messages to draw attention. Therefore, we can also predict how likely the email is spam based on how many keywords presenting in the email.

To keep things simple, we will use a selected list of keywords as an example. Here is the tag cloud of the selected corpus.

 

 

Similarly, we can use a script to calculate the occurrence of keywords in each email message.

PUT _scripts/num_of_keyword{
“script”: {
“lang”: “painless”,
“source”: “”
def keywordList =
[‘act’, ‘apply’, ‘bonus’, ‘buy’, ‘call’, ‘cheap’, ‘click’, ‘earn’, ‘free’, ‘get
‘,’gift’, ‘limited’, ‘offer’, ‘order’, ‘save’];
def keywordCount = 0;
for (def i=0; i
{
def text = ctx [‘Body’].toLowerCase();
if (text.contains(keywordList[i]))
{
keywordCount +=1;
}
}
ctx [‘keyword_count’]
=
keywordCount;
||||||
}
}

 
And we also build a pipeline to implement the script.

}PUT _ingest/pipeline/count_keyword_freq
}”description”: “This is to count the occurance of keywords in a text”,
}”processors”: [
}{“script”: {
}”id”:”num_of_keyword”
}}}
}}

The last step in pre-processing data is to add those new attributes to the dataset.

Let’s create a top-level pipeline to include pipelines built earlier and to drop unnecessary columns.The last step in pre-processing data is to add those new attributes to the dataset.

PUT _ingest/pipeline/reindex_pipeline{
“description”: “The top-level pipeline”,
“processors” : [
{
“pipeline”: {
“name”: “count_keyword_freq”
}v
},v
{
“pipeline”: {
“name”: “count_email_length”
}
},
{
“remove”: {
“field”: “column1”
}
}
}

 
Finally, we can create a new index for our enriched training dataset using the pipeline.

POST _reindex
{
“source”: {
“index”: “email-sample”
},
“dest”: {
“index”: “enriched-email-sample”,
“pipeline”: “reindex_pipeline”
}
}

 

Here is how the new dataset looks with extra fields. Finally, we can create a new index for our enriched training dataset using the pipeline.

 

 

Training Data in Elasticsearch

Building a classification model with Elasticsearch is quite straightforward. We can access the Kibana Data Frame Analytics tab to use the machine learning wizard.

Here we select “Classification” as the job type, “enriched-email” as the source index, and “Label” as the dependent variable we want to predict. All the other fields will be included in the analysis.

 

 

We can also look at the scatterplot matrix to understand the relationships between the fields.

 

The job will go through several phases to finish analysing the data and generating results.

 

 

Evaluating the model

After the training process is complete, we will use the confusion matrix to evaluate the classification model. It shows the percentage of correctly predicted values in each category.

For example, from the table we can see that 78% of label 1 emails were predicted as “spam” by the model, which indicates the True Positive Rate. Comparably, 79% of label 0 emails were tagged with “safe”, which indicates the True Negative Rate. Of course we can also know how many label 1 emails were identified as “non-spam” (False Negative Rate) as well as how many label 0 emails were assigned to the “spam” category (False Positive Rate).

In addition, a ROC (Receiver Operating Characteristic) curve is also provided. It compares the True Positive Rate against the False Positive Rate at different classification thresholds. If the AUC (Area Under the Curve) is higher, the model performs better in predicting the classes.

In reality, this model requires improvement by different approaches. For example, we can use a larger size of training data or extract more features such as the number of bigram-based keywords in an email. Nevertheless, it is an ongoing process and there may be a tradeoff between False Positives and False Negatives.

 

Deploying the model

Once we are satisfied with the evaluation results, we can deploy the model in Elasticsearch, for example, as a processor in an ingest pipeline, to make predictions against new data.

PUT _ingest/pipeline/spam_prediction{
“description”: “Filter spam emails using ML”,
“processors”: [
{
“inference”: {
“model_id”: “spam_email_classification-1618899599824”
}
}
]
}

 

Conclusion

In this blog, we have seen what Elasticserach machine learning is and how simple it can be when we use Elastic Stack to build a classification model. Through our experimentation with Elastic Stack, we successfully created a spam email detection model that achieved an impressive 80% accuracy. Although further refinement is necessary to make it deployable in real-world scenarios, it serves as a promising foundation for businesses seeking to enhance their email filtering mechanisms. Elastic, combined with machine learning, has the potential to complement existing algorithms adopted by email service providers, contributing to a more comprehensive approach to combating spam and ensuring genuine communications reach their intended recipients. 

Apart from spam email detection, we can extend the classification method to many more use cases, including identifying whether a domain is risky or determining whether an application is malicious. One of the main advantages of the integrated solution provided by Elastic Stack is that it enables us to accelerate the process of machine learning deployment and adapt quickly to changes. By leveraging the power of Elasticsearch’s indexing and querying capabilities, machine learning models can tap into vast amounts of data, resulting in more accurate predictions and more effective decision-making. 

This blog post underscores the potential of Elastic and machine learning in transforming the landscape of data analysis and application. The incorporation of a console operator and continuous performance testing further accentuates the value of Elastic Stack as a dynamic tool for addressing diverse challenges across various domains. As technology continues to evolve, embracing the capabilities of Elastic Stack and its integration with machine learning will undoubtedly be pivotal in propelling innovation and advancing solutions for a data-driven world. 

FAQ

What is character mapping in Elasticsearch? 

Character mapping refers to the process of transforming or mapping characters in text data during the indexing and search operations. It involves defining how specific characters or sequences of characters should be treated, normalized, or transformed in the context of text analysis.

Character mapping is an important aspect of text analysis because it helps improve the accuracy of search results by handling variations in character representations. For example, character mapping can be used to:

  1. **Normalize Accents and Diacritics**: In some languages, characters might have accents or diacritics that can be represented in different ways. Character mapping can be used to transform these variations into a consistent representation, ensuring that searching for a term with or without accents yields the same results.
  1. **Case Insensitivity**: Character mapping can convert all characters to lowercase or uppercase during indexing and searching. This ensures that searches are case-insensitive, so users don’t need to worry about matching the exact case of letters.

In Elasticsearch, character mapping is often configured through analyzers, which define a series of transformations and processes to be applied to text data during indexing and searching. Analysers can include character filters that perform character mapping tasks like those mentioned above.

Here’s a simplified example of how character mapping might be configured in Elasticsearch’s index settings:

“`json

{

  “settings”: {

    “analysis”: {

      “analyzer”: {

        “custom_analyzer”: {

          “tokenizer”: “standard”,

          “char_filter”: [“normalize_chars”, “remove_punctuation”],

         “filter”: [“lowercase”]

        }

      },

      “char_filter”: {

        “normalize_chars”: {

          “type”: “mapping”,

          “mappings”: [“à => a”, “é => e”]

        },

        “remove_punctuation”: {

          “type”: “pattern_replace”,

          “pattern”: “[\\p{Punct}]”,

          “replacement”: “”

        }

      }

    }

  }

}

“`[[_

In this example, the custom analyzer applies character mapping through the “normalize_chars” char filter and removes punctuation using the “remove_punctuation” char filter.

References

  1. https://heartbeat.fritz.ai/spam-filtering-using-bag-of-words-1c5484ff07f1
  2. https://www.kaggle.com/owenpatrickfalculan/spam-email-classification
  3. https://bdtechtalks.com/2020/11/30/machine-learning-spam-detection/
  4. https://www.elastic.co/guide/en/machine-learning/7.9/flightdata-classification.html

Author:

Ziqing(Astrid) Liu
 

About Skillfield:

Skillfield is a Melbourne-based Cyber Security and Data Services consultancy and professional services company. We provide solutions that help our customers discover, protect and optimise big data in a way that works for them.

Share