When cybersecurity meets text mining - an introductory approach


The application of text mining in the cybersecurity domain is not a new topic. However, these terms are not usually found together in cybersecurity product descriptions. Instead, the cybersecurity industry prefers using Natural Language Processing (NLP) in most cases. In other cases, the industry uses a most-fancy term: Artificial Intelligence (AI).

In iCybersec, we follow the authors' definition that text mining seeks to extract useful information from a collection of documents by identifying interesting patterns [1]. However, it is relevant to highlight thatthe collection is formed by unstructured and semi-structured documents, which explains using "text" in text mining. In iCybersec, we believe that text mining defines better patterns identification that cybersecurity solutions can implement, so we will usually use this term. We will differentiate the three terms in future posts since this one aims to present the intersection of text mining and cybersecurity.

Nowadays, text mining plays a significant role in several cybersecurity activities. The first example presented in this post is cyber situational awareness, which offers organizations (or governments) perceptions about the ecosystem surrounding their presence in cyberspace. These perceptions provide elements to understand different contexts related to enterprises, making it possible to create projections about the near future. In order to achieve cyber situational awareness is necessary to collect data from different data sources and design solutions to fuse and process the data aiming to offer information to support strategic decisions. As unstructured data is predominant in cyberspace, text mining plays a central role in analyzing data to extract relevant information.

Cyber situational awareness should use Cyber Threat Intelligence (CTI), which provides evidence-based knowledge about threats to support decisions [2]. Through CTI, an organization can obtain information about cyber threats with the potential to cause a cyber security incident. CTI can also deliver information about threat actors, clarifying the techniques, tactics, and procedures used to launch cyber-attacks. An organization should understand threat actors' behaviors to implement security controls to prevent cyber-attacks. The intelligence provided by CTI can be based on multiple data sources, and many of them feed CTI with unstructured data. As with cyber situational awareness, CTI handles a massive amount of data and needs to count on text mining to deal with it.

Text mining can also support cyber-attack detection when it involves unstructured data. This situation occurs, for example, in phishing and misinformation/disinformation campaigns. These cyber-attacks can be partially or fully based on unstructured data, so text mining can identify patterns used in past campaigns to identify new ones. In a similar context, text mining can also detect online scams and spam messages. Additionally, cyber-attack information can reverberate in real-time in cyberspace, which can be used for detection.

Recently, we were flooded by news reporting hundreds of data leakage incidents. Although cyber incidents usually involve data records related to structured data, many also include unstructured data. Therefore, organizations should count on solutions to identify and avoid data leakages. Text mining can help organizations in creating an inventory of information assets holding sensitive data. This visibility is crucial for cybersecurity teams to implement security controls to protect essential data. Moreover, Data Leakage Prevention (DLP) systems can implement text mining to identify which information is being exchanged through computer networks and block it if necessary.

As presented in this post, many cybersecurity activities require dealing with unstructured data, and a report published by IDC [3] states that the amount of data generated will continue to grow in the following years. The expansion of data generation will demand that organizations use new strategies to separate the wheat from the chaff, and text mining offers powerful tools to face this challenge. In the cybersecurity domain, the scenario will be the same, and the leaders will have to find soon the solutions that will protect their organization in the following years.


[1] Ronen Feldman and James Sanger. The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press. 2006.

[2] https://www.gartner.com/en/documents/2487216/definition-threat-intelligence

[3] International Data Group. The Digitization of the World - From Edge to Core. 2018.