Screenshot 2020-09-22 at 14.19.59.png

Machine learning for cyber security

The goal of machine learning is to enable computers to learn on their own, solving many real world, complex problems. A machine learning algorithm may be able to identify patterns in observed data, build models that explain the world, and predict things without having explicitly pre-programmed rules and models. There are many benefits to using machine learning including speed, accuracy, ability to ingest large volumes of data, automation of tasks, and ultimately significant cost savings.

The application of machine learning in cyber security has many uses: it provides the ability to enhance our detection and response capabilities and to automate analysis of user and device behaviour.


There are two main machine learning approaches, supervised and unsupervised, which have unique benefits and applications.


Supervised machine learning

Machine learning can be guided or ‘supervised’. Data can be labelled to imply characteristics or attributes that help it learn by training it. For example, you could tag pictures of cats with a label of ‘cats’ and tag pictures of dogs with a label of ‘dogs’. The machine would learn the differing characteristics of cats and dogs from that set of training data and produce an algorithm that could then tell the difference between pictures of cats and dogs that are not labelled.

Common applications of supervised machine learning are identifying spam emails: large amounts of both spam and genuine emails (ham) are fed into supervised machine learning algorithms to increase their accuracy. This enables the system to detect spam emails without being explicitly programmed to do so. Spam is a fairly easy application of supervised machine learning for cyber defence, given the large amount of labelled data that is available to train the models that the algorithms produce.


However, there are limits to the widespread usefulness and applicability of supervised machine learning in other scenarios. Crucially, most enterprises will not retain labelled datasets of previous attacks, which undercuts the ability to use supervised models. Even when supervised models are possible, attackers continuously evolve their approach which decays the value of those models.


Senseon has developed a system that combines analyst feedback on Senseon threat cases with unsupervised outlier detection methods, to create supervised models. This continual synthesis of new models that are automatically generated by the system enables Senseon to grow and adapt its detection capabilities within an organisation whilst minimising false positives and optimising analyst time.

Unsupervised machine learning

Machine learning can also operate without any guidance. This is known as unsupervised machine learning. Because it hasn’t been trained to understand or identify specific characteristics and isn’t trying to find specific outcomes, it is able to take sets of data and find patterns within them that would be very hard for a human to find, especially when dealing with very large sets of data.

The goal for unsupervised learning is to model the underlying structure or distribution in the data in order to learn more about that data. Unlike supervised learning there are no correct answers and there is no teacher.


A common application of unsupervised machine learning in cyber defence is looking for outliers for anomaly detection. These algorithms can detect combinations of data that may be indicative of anomalous behaviour of users or devices, which can improve the detection rate of new or novel attacks. The disadvantage of a pure unsupervised approach is that it may trigger more false positive alerts. Senseon has overcome this by leveraging AI Triangulation algorithms to increase the explainability and verify the output from multiple perspectives.

An example of unsupervised machine learning is analysis of DNS traffic to identify malware communicating with its command and control server. Whilst an unsupervised algorithm may identify traffic as an outlier, the ability to verify the anomaly by gathering data from multiple perspectives (such as features of the domain itself), enables the validation of the initial unsupervised approach.