Anomaly detection is one of the most important classes of problems in the modern world: from telecommunication signals to manufacturing operations to user behavior, all sorts of systems expect to work in a regular pattern, and deviations from this pattern can represent problems such as errors, failures, attacks or even changes that, despite not being negative per se, may require action from the system keepers.
Like most modern problems, human supervision alone is not enough to keep up with the sheer volume of data and complexity of patterns that should be followed, and even automated approaches designed by specialists – such as handcrafted filters or thresholds – rely too much on the observation and analysis of people. For very well-known problems and systems, this human approach might be enough, but in many cases the definition of regular behavior is not clear. This may happen because too many elements are involved, because their relations are too complex, or simply because there are too many distinct behaviors that are valid. In some cases, these behaviors are not static either, and may evolve over time and adapt due to both seasonal events (such as holidays) and unpredictable ones (such as legislative changes or even global events).
The modern approach for overly complex problems like these usually comes in the form of Artificial Intelligence or, more specifically, Machine Learning techniques. With enough data, it’s possible to train models that can learn all sorts of patterns and behaviors and automate the decision-making process that would otherwise be too costly or complex for human operations.
Key Features for Anomaly Detection
Sadly, “machine learning” does not necessarily hold a solution for every single problem, or at least not a complete solution. For example, the most common Machine Learning techniques are based on classification – i.e., inferring the label of an entry based on its attributes – but this approach requires many known examples from each class. Since anomalies are, by definition, deviations from normal behavior, acquiring a reliable set of labeled anomalies to train a regular classification model is no easy task, especially when both regular examples and anomalies can have very distinct behaviors. In machine learning, problems with these characteristics are usually solved with unsupervised learning: patterns are learned without the usage of labels.
The same data can be organized in different ways, depending on whether label information is available
Unsupervised learning techniques can also be used to detect anomalies – either by training models that expect most of the presented data to belong to a single class, and everything else to be outliers, or models that organize data into groups based on their relative structure, with some of these groups being anomalous while others are regular. In both cases, though, it’s hard to guarantee if the resulting groupings are relevant, especially when there’s a large number of factors to consider. It’s possible that the most important variables have very subtle variations, while irrelevant or spurious ones might dominate the model. An anomaly detection model should be robust to the nature of features that are used, otherwise, it will rely too much on the insight of data analysts and domain specialists during feature selection and engineering.
Alternatively, anomaly detection can be solved with Machine Learning as a regression problem: instead of classifying examples in ‘regular’ and ‘anomalous’, the next value of a series could be inferred and then compared to the actual value. This approach is very popular with time series, but it brings its own problems: most of the techniques will focus on isolated variables that are tracked over time, and then predict the next value of them. If a problem is multivariate, with multiple characteristics influencing each other, regression modeling becomes increasingly complex, since data representation becomes a key issue or lesser features end up being ignored for this analysis.
Another common feature of many machine learning algorithms, which is detrimental to anomaly detection (especially unsupervised anomaly detection) is the “opaqueness” of resulting models: many of them don’t provide a way to understand how a certain decision was taken or a classification was performed, nor which factors affected it. This is particularly problematic for anomaly detection because the reasons behind an anomaly are as important as identifying it. For example, without knowing why a machine’s behavior was considered anomalous makes it hard to decide which action should be taken (Preventive maintenance? Cleaning? Verification of the input materials?), while lack of explainability on the detection of anomalous users might lead to lawsuits if action is taken against them without a credible explanation.
Finally, in all cases, it is important to have efficient models that can learn and, especially, be applied in a quick, distributed and scalable way. This is particularly true for systems that consider millions of new events happening online, and with a very diverse set of possible behaviors. These attributes are common in telecommunication systems but are also relevant to most areas that deal with “big data” in general, including large pools of users, equipment, or sensors.
As described previously, regression models, such as ARIMA or Facebook’s Prophet [Taylor, Lethan 2017], are still considered to be a staple in the detection of anomalies over temporal series, with known implementations that can deliver very efficient inferences. These techniques, however, are mostly focused on predicting a single series of data points over time. They provide certain levels of robustness by considering different possible seasonalities and behaviors, but their final predictions are still strictly monovariate. This problem can be addressed by either engineering a single variable that maps all the behaviors through the use of dimension reduction techniques, such as PCA, or by running multiple anomaly detectors in parallel, but either approach fails in preserving internal relations between these variables. These approaches are commonly applied for domains in which there is already a well-established model for data processing, but for open-ended anomaly detection – for example, determining if a cellphone user is acting out of the ordinary – there are too many features to be tracked at once.
With classification-based techniques, trees are some of the most popular (and successful) learning techniques, thanks to their intuitive design and, in the case of forest ensembles, scalability. However, traditional classification relies on at least some level of supervision (in order to define what is anomalous vs what is regular), and these labels are hardly ever available. Unsupervised tree techniques for detection are also known, such as Isolation Forests, but these techniques – as well as other one-class classifiers, such as Support Vector Machines – end up being very reliant on the chosen features to describe data. The more spurious features available, the harder it will be for these techniques to tell actual anomalies from data points that are distant from others based on random noise alone. These techniques also struggle with anomalies that are not merely outliers: if a well-defined group of events or examples is anomalous (for example, a group of malicious agents or a subset of nodes in a system that has been infected by a virus), these behaviors might be perceived as ‘regular’ due to their similarity to other known behaviors.
Single-Class classifiers will define a boundary around known data and separate normal data from outliers
On the other end of the spectrum, we have Neural Networks, which can deal with complex sets of features and identify connections – even nonlinear connections – between them. Most of these techniques are based on supervised classification or regression, whose problems have been addressed before, but other than that they also pose a serious limitation: the resulting models are among the least explainable, and this obfuscation severely restricts which problems can realistically be solved by these models. Even if they can train models and identify results, the opaque nature of the result means that no further insight can be taken from the result itself, relying on either human specialists to analyze it or simply avoiding any kind of explanation.
While no technique is perfect and the specific conditions of each use case are important, one technique has shown great results in very flexible and diverse applications: Robust Random Cut Forests (RRCF). This technique can be used to provide most of the features desired in anomaly detection problems, as well as avoid the pitfalls of other techniques. In the next part of this 3-part article, we will explore the key characteristics of RRCF and how they can help with anomaly detection problems.
- Robust Random Cut Forests. Sudipto Guha, Nina Mishra, Gourav Roy, Okke Schrijvers Proceedings of The 33rd International Conference on Machine Learning, PMLR 48:2712-2721, 2016.
- Forecasting at scale. Taylor SJ, Letham B. 2017. Forecasting at scale. PeerJ Preprints 5:e3190v2 https://doi.org/10.7287/peerj.preprints.3190v2
- Anomaly Detection: a Survey. Varun Chandola, Arindam Banerjee, and Vipin Kumar. 2009. Anomaly detection: A survey. ACM Comput. Surv. 41, 3, Article 15 (July 2009), 58 pages.
This article was written by Ivan Caramello, Data Scientist and member of the Data Science & Engineering Tech Practices at Encora. Thanks to Daniel Franco, Alicia Tasseli, Caio Vinicius Dadauto, João Caleffi and Kathleen McCabe for reviews and insights.
Fast-growing tech companies partner with Encora to outsource product development and drive growth. Contact us to learn more about our software engineering capabilities.