Anomaly Detection Principles and Algorithms

Authors: Kishan G. Mehrotra, Chilukuri K. Mohan, and HuaMing Huang

ISBN: 978-3-319-67526-8

Part I: Principles

1. Introduction

1.1 What's an Anomaly?

Definition
  • An anomaly (outlier) is a substantial variation from the norm.
Primary Assumption of Anomaly Detection
  • Anomaly detection approaches are based on models and predictions from past data.
  • It is primarily assumed that the underlying processes, that led to the generation of the data, are believed not to have changed significantly.
  • Hence, the statistics that characterized a system in the past continue to characterize the system in the future.
Models/Processes and Anomaly Detection
  • In some cases, common-sense knowledge or a model of the underlying process is available to detect anomalies.
  • In other cases, a model or a process must be identified prior to detecting anomalies.
  • In other cases, the effects of an underlying (unknown) process must be characterized on observable data prior to detecting anomalies.
  • Hence, one can estimate the likelihood with which a particular data point could have been generated by the underlying (unknown) process, based on a characterization of what is non-anomalous from other observable data generated by the same process.
Inefficiency of Traditional Machine Learning and Classification Algorithms
  • Traditional machine learning and classification algorithms are ineffective at classifying data as anomalous or non-anomalous for the following reasons.
    1. Anomalous data are much rarer than non-anomalous data.
    2. Results obtained by classification algorithms will often result in too many false negatives (i.e., not recognizing anomalies).
    3. Various anomalous cases may have very little in common.
    4. The occurrence of an anomaly may well be within the same bounds as those characterizing non-anomalous data, and hence not distinguishable directly by attribute values.

2. Anomaly Detection

2.1 Anomalies

Anomaly Detection Algorithms Results
  • When an anomaly detection algorithm is applied, three possible cases need to be considered:
    1. Correct Detection: Detected abnormalities in data do correspond exactly to abnormalities in the process.
    2. False Positives: The process continues to be normal, but unexpected data values are observed, e.g., due to intrinsic system noise.
    3. False Negatives: The process becomes abnormal, but the consequences are not registered in the abnormal data, e.g., due to the signal of the abnormality being insufficiently strong compared to the noise in the system.
  • It is impossible to achieve 100% correct detection. Hence, it is important to minimize and to compromise between false positives and false negatives.
Precision

Given a data set $\mathcal{D}$, suppose an outlier detection algorithm identifies $m > 0$ potential anomalies, of which $m_t \leq m$ are known to be true outliers. Then precision, which measures the proportion of true outliers in top $m$ suspicious instances, is:

$$Pr = \frac{m_t}{m}$$

and equals $1.0$ if all the points identified by the algorithm are true outliers.

Recall

If $\mathcal{D}$ contains $d_t \geq m_t$ true outliers, then recall is defined as:

$$Re = \frac{m_t}{d_t}$$

which equals $1.0$ if all true outliers are discovered by the algorithm.

RankPower

If $R_i$ denotes the rank of the $i$th true outlier in the sorted list of most suspicious objects, then the RankPower is given by:

$$RP = \frac{m_t (m_t + 1)}{2 \sum_{i=1}^{m_t} R_i}$$

which takes the maximum value $1$ when all $d_t$ true outliers are in the top $d_t$ positions.

Comparing Anomaly Detection Algorithms
  • When algorithms with the same $m$ are compared, the larger values of all three of precision, recall, and RankPower imply better performance.
Old Problems vs. New Problems
  • Traditional machine learning and classification algorithms are effective at identifying known (old) types of anomalies, yet they are ineffective at identifying unknown (new) types of anomalies.
What Kind of Data?
  • Assumption #1: Some process exists, and that there are rules, guidelines, or principles governing the possible randomness in the data.
  • Assumption #2: The characteristics of this process hold in the entire set of data.
  • The choice of the anomaly detection algorithm should depend on the nature of the process generating the anomaly.
What's a Norm?
  • The norm is a set of points such that to consider a point to be abnormal, it must be substantially distant from each of the points in the norm.

2.2 Outliers in One-Dimensional Data

Uniform Distribution
  • When data is distributed uniformly over a finite range, if the neighborhood of any data point is as richly populated as any other point, it can be argued that there are no anomalous data points.
  • However, a small neighborhood that contains substantially fewer or more data points than expected from a uniform distribution can be an indication of anomalous behavior.
Normal Distribution
  • When data is distributed normally, the density of points decreases substantially as we move away from the mean.
  • $0.1\%$ of the points are more than $3\sigma$ away from the mean.
  • $5 \times 10^{-8}\%$ of the points are more than $6\sigma$ away from the mean.
  • Perspective #1: A point beyond $3\sigma$ from the mean is anomalous.
  • Perspective #2: A set of point is anomalous if and only if their number is substantially higher than the number expected if the data were to be normally distributed.
Other Unimodal Distributions
  • Perspective #1: If the nature and characteristics of the distribution are known, one may seek to find thresholds beyond which a relatively small number of the data points are found.
  • Perspective #2: Otherwise, a collection of points is anomalous if their number is larger than predicted by the statistics of the distribution.
Multimodal Distributions
  • With data sets containing multiple modes, it is useful to think of the data as consisting of a collection of clusters of data points.
  • Points which do not belong to any cluster are candidates to be considered anomalous.
Varying Definitions of Clustering
  1. If the relative number of points (per unit distance) is substantially higher in a small region than the entire data set, the points in that region can be considered a cluster.
    • The distribution of densities can itself be analyzed, and (if unimodal) we can identify the density threshold beyond which the density is high enough to consider a region to contain a cluster.
  2. If points are substantially closer to each other (on average) than they are to the nearest points outside the cluster, then the points form a cluster.

2.3 Outliers in Multidimensional Data

  • The ideas regarding outliers in one-dimensional data extend to multidimensional data with the following complications.
    1. The choice of a distance measure can affect how the data points are clustered.
    2. The dimensions can have varying ranges, so the choice of a normalization technique can affect how the data points are clustered.
    3. There can exist some dimensions which require special consideration when clustering such as time.

2.4 Anomaly Detection Approaches

Primary Approaches to Anomaly Detection
  1. Distance-Based: Points that are farther away from others are considered more anomalous.
  2. Density-Based: Points that are in relatively low density regions are considered more anomalous.
  3. Rank-Based: The most anomalous points are those whose nearest neighbors have others as nearest neighbors.
Nature of Data
  1. Supervised: Classification labels are known for a set of "training" data, and all comparisons and distances are with respect to such training data.
  2. Unsupervised: No such labels are known, so distances and comparisons are with respect to the entire data set.
  3. Semi-Supervised: Labels are known for some data, but not for most others.
Characteristics of Unsupervised Anomaly Detection Algorithms
  1. Normal behaviors have to be dynamically defined. No prior training data set or reference data set for normal behavior is needed.
  2. Outliers must be detected effectively even if the distribution of data is unknown.
  3. The algorithm should be adaptable to different domain characteristics; it should be applicable or modifiable for outlier detection in different domains, without requiring substantial domain knowledge.

2.5 Evaluation Criteria

  1. Can quantitative metrics be devised so that we can unambiguously say which of two data points in a given data set is "more anomalous" - without appealing to human intuition? An answer is required in order to minimize the number of false positives generated by anomaly detection algorithms, particularly in the unsupervised and semi-supervised contexts.
  2. Each anomaly detection algorithm answers the above question in a procedural way. Can this implicit choice be justified on mathematical or rational grounds?
  3. In some cases, algorithms also appeal to the desire to compute the results in a "reasonable" amount of time, ruling out the search for optimal solutions. In such cases, can we say anything about the quality of the obtained solutions when compared to the optimal solutions?

3. Distance-Based Anomaly Detection Approaches

3.1 Introduction

Mathematical Notations
  • $\mathfrak{D}$ is the $n$-dimensional data set, presumed to be in continuous unbounded real-valued space $\mathfrak{R}^{n}$.
  • $p$ and $q$ are data points in $\mathfrak{D}$.
  • $P$ and $Q$ are sets of data points in $\mathfrak{D}$.
  • $d(p,q)$ is the distance between two points $p,q \in \mathfrak{D}$.
Primary Questions
  1. Measurement: How anomalous is a given point? This requires transforming points into a one-dimensional scale, e.g., defining a function $\alpha$ such that $\alpha(p) \in \mathfrak{R}$ measures the anomalousness of $p \in \mathfrak{D}$.
  2. Absolute: Is a given point anomalous? This requires finding a threshold $\theta > 0$ so that we say a point $p \in \mathfrak{D}$ is anomalous if $\alpha(p) \gt \theta$.
  3. Relative: Is one point "more anomalous" than another? This requires comparing points, so that $\alpha(p) \gt \alpha(q)$ if a point $p \in \mathfrak{D}$ is more anomalous than another point $q \in \mathfrak{D}$. We may denote this relationship with the symbol "$\rhd$", i.e., $p \rhd q$ if $\alpha(p) \gt \alpha(q)$.
Secondary Questions
  1. Subset: What are the $m$ most anomalous points in a given data set? To answer this question, we may use the "relative" criterion to find the most anomalous, the secondy most anomalous, etc., perhaps by sorting if $m$ is large.
  2. Hybrid: What are the $m$ most anomalous points in a given data set, which are also absolutely anomalous? The answer can be obtained by applying an absolute threshold to the "Subset" answer above.

3.2 Similarity Measures

Similarity and Distance
  • A similarity measure can be obtained by essentially considering the inverse of a distance measure.
Mahalanobis Distance
$$d(p,q) = \sqrt{ (p - q)^{T} S^{-1} (p - q) }$$

where $S$ is the covariance matrix measuring the mutual correlations between dimensions for all points in the data set $\mathfrak{D}$.

Euclidean Distance
$$d(p,q) = \sqrt{\sum_{i=1}^{n} (p_i - q_i)^2}$$

The Euclidean Distance is the Mahalanobis Distance where $S$ is the identity matrix.

Minkowski Distance (Order $l$)
$$d(p,q) = \left( \sum_{i=1}^{n} \left| p_i - q_i \right|^l \right)^{\frac{1}{l}}$$

If $l = 1$, the Minkowski Distance is equal to the Euclidean Distance.

Cosine Similarity
$$ \cos (\theta) = \frac{p \cdot q}{\left|\left| p \right|\right| \left|\left| q \right|\right|} = \frac{\sum_{i=1}^{d} p_{i} q_{i}}{\sum_{i=1}^{d} p_{i}^{2} \sum_{i=1}^{d} q_{i}^{2}}$$

The Cosine Similarity ranges from $-1$ to $1$ where $1$ implies $p$ is equivalent to $q$ and $-1$ implies $p$ is exactly opposite to $q$.

Jaccard Index
$$J(A,B) = \frac{\left| A \cap B \right|}{\left| A \cup B \right|}$$

where $A$ and $B$ are two datasets.

3.3 Distance-Based Approaches

Distance to All Points
  • The most anomalous point is the farthest from all points in the data set.
$$\alpha(p) = \sum_{q \in \mathfrak{D}} d(p,q)$$
Distance to Nearest Neighbor
  • The most anomalous point is one whose distance to nearest neighbor is the greatest.
$$\alpha(p) = \min_{q \in \mathfrak{D}, q \neq p} d(p,q)$$
Average Distance to $k$ Nearest Neighbors
  • Let $k < N = \left| \mathfrak{D} \right|$ be the number of nearest neighbors to be considered.
  • Let $Near(p,j)$ be the $j$th nearest neighbor of a point $p \in \mathfrak{D}$, where $j < N = \left| \mathfrak{D} \right|$, breaking ties arbitrarily.
  • The most anomalous point is one whose average distance to $k$ nearest neighbors is the greatest.
$$\alpha(p) = \sum_{j = 1}^{k} d(p, Near(p,j)) / k$$
Median Distance to $k$ Nearest Neighbors
  • As the median is less sensitive to noise in the data than the average, the median distance to $k$ nearest neighbors can be used as the anomalousness metric.
  • If $k = 2m-1$ is odd, the median distance to $k$ nearest neighbors is $Near(p,m)$.

4. Clustering-Based Anomaly Detection Approaches

4.1 Identifying Clusters

  • Distance-based Clustering: Points within the same cluster are separated by relatively small distances, whereas points in different clusters are at greater distance from each other.
  • Similarity-based Clustering: Points that are similar to each other should belong in the same cluster because points at smaller distances from each other are presumed to be more similar.
Types of Clustering Algorithms
  • Assume that the Data is in a bounded continuous multi-dimensional space, and that a similarity or distance measure has already been chosen.
  • In general, each point $p_i$ is assigned a "degree of membership" $\mu(p_i, C_j) \in [0, 1]$ to any cluster $C_j$.
    • Most clustering algorithms partition the data into clusters, so that each $\mu(p_i, C_j) \in {0, 1}$ and $\sum_{j} \mu(p_i, C_j) = 1$ for each $p_i$.
    • Non-partitioning algorithms allow for the possibility that some points do not belong to any cluster, i.e., each $\mu(p_i, C_j) \in {0, 1}$ and $\sum_{j} \mu(p_i, C_j) < 1$ for each $p_i$.
    • Some algorithms allow clusters to overlap, and a point may belong to multiple clusters, i.e., each $\mu(p_i, C_j) \in {0, 1}$ and $\sum_{j} \mu(p_i, C_j) \le \text{the number of clusters}$ for each $p_i$.
    • Fuzzy clustering algorithms allow non-binary membership values, i.e., $0.0 \le \mu(p_i, C_j) \lt 1.0$, although they often restrict $\sum_{j} \mu(p_i, C_j) = 1$ for each $p_i$.
$k$-Nearest Neighbor Clustering
  • Points are labeled with the same label as a majority of their $k$ nearest neighbors.
  • A region-growing heuristic is to start with a single new point, label it with a new cluster ID, and iteratively label all immediate neighbors of labeled points whose distance is smaller than a threshold.
$k$-Means Clustering

$k$-Means Clustering Algorithm

Advantages
  1. This algorithm is the most widely used clustering algorithm.
Disadvantages
  1. The choice of the initial set of $k$ centroids can affect the final result.
  2. The number of clusters $k$ is difficult to determine.
Heuristics for $k$
  • Successively increment an initially small value of $k$ while significant measureable progress is achieved:
    1. The ratio of the intra-cluster distance to the inter-cluster distance should be small.
    2. Silhouette Measure: $s = 1 - (\text{distance to own centroid}) / (\text{distance to next nearest centroid})$
References
  • S. Lloyd, “Least squares quantization in pcm.” IEEE Trans. Inform. Theory 28(2), 129–137 (1982)
Fuzzy Clustering

Fuzzy $k$-Means Clustering Algorithm

Advantages
  1. Points that are closer to a centroid are given greater weightage in the minimization process.
  2. The cluster allocation depends less on points that are distant from cluster centroids.
References
  • J.C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms (Kluwer Academic Publishers, Norwell, 1981)
Agglomerative Clustering

Agglomerative Clustering Algorithm

Advantages
  1. This algorithm does not require an externally determined parameter for the number of clusters $k$.
  2. This algorithm is deterministic.
Disadvantages
  1. This algorithm is more computationally complex than the $k$-means clustering algorithm.
  2. This algorithm requires external decision-making to determine how deep each path down the root should be traversed before declaring a node as representing a genuine cluster.
References
  • R. Sibson, “Slink: an optimally efficient algorithm for the single-link cluster method.” Comput. J. 16(1), 30–34 (1973)
  • T. Zhang, R. Ramakrishnan, M. Livny, “Birch: an efficient data clustering method for very large databases,” in ACM Sigmod Record, vol. 25(2) (ACM, New York, 1996), pp. 103–114
  • S. Guha, R. Rastogi, K. Shim, “Cure: An efficient clustering algorithm for large databases,” in Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, Seattle, Washington, USA, pp. 73–84, 1998
Density-Based Agglomerative Clustering

Density-Based Agglomerative Clustering Algorithm

Advantages
  1. This algorithm discovers clusters of arbitrary shape from noisy data sets, requiring only one scan of the dataset.
Disadvantages
  1. This algorithm has difficulties when cluster have widely varying densities.
  2. This algorithm relies on the concept of density being defined as the number of points in unit volume. This fails for features that are discrete or categorical or that have high dimensionality.
References
  • M. Ester, H.P. Kriegel, J. Sander, X. Xu, “A density-based algorithm for discovering clusters in large spatial databases with noise,” in In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (AAAI Press, 1996), pp. 226–231
Divisive Clustering

Divisive Clustering Algorithm

Advantages
  1. This algorithm avoids extensive distance computations required at the fine-grain levels, so it is more computationally efficient than agglomerative clustering.
Termination Conditions
  1. Small: The number of elements in each cluster is less than or equal to a threshold; or
  2. Tight: The maximum distance between data points in a cluster is less than or equal to a threshold;
  3. Relatively Compact: The average distance within the cluster is much small than in the entire data set.

4.2 Anomaly Detection Using Clusters

Cluster Membership or Size
  • The points that do not belong to any cluster can be considered anomalous.
  • The points that belong to clusters that contain a relatively small number of points can be considered anomalous.
Proximity to Other Points
  • The points in a cluster that are so far from all the other points in the same cluster can be considered anomalous.
  • When some points do not belong to any cluster, let $\alpha(p) = \min_{j} d(p, \mu_j)$, where $\mu_j$ is the centroid of cluser $C_j$.
    • $\mu_j = \sum_{p_j \in C_j} p_j / \left| C_j \right|$
    • If $\alpha(p)$ is large, $p$ is considered to be an anomaly.
Proximity to Nearest Neighbor
  • Any clustering algorithm that relies on the distance to cluster centroids have an implicit assumption that clusters should be symmetric.
  • Non-trivial data sets are characterized by asymmetric clusters whose asymmetries cannot be removed by any simple linear or nonlinear transformations.
  • Use the distance to the nearest neighbor as a useful indicator of the degree to which a data point is anomalous.
Boundary Distance
  • In nearest-neighbor based anomaly detection, a set of outliers which are far from all other points can be clustered together and be mislabeled.
  • Use the distance to the nearest cluster boundary as a useful indicator of the degree to which a data point is anomalous.
When Cluster Sizes Differ
  • Remove the bias between different cluster sizes by normalizing $\alpha(p)$ based on some function of the cluster sizes.
    • Scaling Factor: $\max_{p \in C_i} \alpha(p) / \max_{p \in C_j} \alpha(p)$
Distances from Multiple Points
  • Consider distances to multiple ($k = 3$ or $k = 5$) nearest neighbors, multiple cluster centroids, or multiple cluster boundaries, and consider the averages or the medians over a collection of values to reduce the effects of noise.

Part II: Algorithms

6. Distance and Density Based Approaches

References

$k$-Nearest Neighbors (KNN)

Ramaswamy, Sridhar, Rajeev Rastogi, and Kyuseok Shim. "Efficient algorithms for mining outliers from large data sets." ACM Sigmod Record. Vol. 29. No. 2. ACM, 2000.

Connectivity-Based Outlier Factor (COF)

Tang, Jian, et al. "Capabilities of outlier detection schemes in large datasets, framework and methodologies." Knowledge and Information Systems 11.1 (2007): 45-84.

Influential Measure of Outlierness (INFLO)

Jin, Wen, et al. "Ranking outliers using symmetric neighborhood relationship." Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, Berlin, Heidelberg, 2006.

Local Correlation Integral (LOCI)

Papadimitriou, Spiros, et al. "Loci: Fast outlier detection using the local correlation integral." Proceedings 19th International Conference on Data Engineering (Cat. No. 03CH37405). IEEE, 2003.

Local Outlier Factor (LOF)

Breunig, Markus M., et al. "LOF: identifying density-based local outliers." ACM sigmod record. Vol. 29. No. 2. ACM, 2000.

Local Outlier Probabilities (LOOP)

Kriegel, Hans-Peter, et al. "LoOP: local outlier probabilities." Proceedings of the 18th ACM conference on Information and knowledge management. ACM, 2009.

7. Rank Based Approaches

References

Density-Based Clustering and Outlier Detection Algorithm (DBCOD)

Tao, Yunxin, and Dechang Pi. "Unifying density-based clustering and outlier detection." 2009 Second International Workshop on Knowledge Discovery and Data Mining. IEEE, 2009.

Rank-Based Detection Algorithm (RBDA)

Huang, Huaming, Kishan Mehrotra, and Chilukuri K. Mohan. "Rank-based outlier detection." Journal of Statistical Computation and Simulation 83.3 (2013): 518-531.