异常检测时间序列_时间序列的无监督异常检测-程序员宅基地

异常检测时间序列

To understand the normal behaviour of any flow on time axis and detect anomaly situations is one of the prominent fields in data driven studies. These studies are mostly conducted in unsupervised manner, since labelling the data in real life projects is a very tough process in terms of requiring a deep retrospective analyses if you already don’t have label information. Keep in mind that outlier detection and anomaly detection are used interchangeably most of the time.

理解时间轴上任何流量的正常行为并检测异常情况是数据驱动研究的重要领域之一。这些研究大多以无人监督的方式进行，因为在现实生活中的项目中标记数据是非常困难的过程，因为如果您还没有标记信息，则需要进行深入的回顾性分析。请记住， 异常检测和异常检测在大多数情况下可以互换使用。

There is not a magical silver bullet that performs well in all anomaly detection use cases. In this writing, I touch on fundamental methodologies which are mainly utilized while detecting anomalies on time series in an unsupervised way, and mention about simple working principles of them. In this sense, this writing can be thought as an overview about anomaly detection on time series including real life experiences.

在所有异常检测用例中，没有一种能很好地发挥作用的神奇的银弹。在本文中，我介绍了在以无监督的方式检测时间序列异常时主要使用的基本方法，并提到了它们的简单工作原理。从这个意义上讲，本文可以看作是关于时间序列异常检测的概述，包括现实生活中的经验。

基于概率的方法 (Probability Based Approaches)

Using Z-score is one of the most straightforward methodology. Z-score basically stands for the number of standart deviation that sample value is below or above the mean of the distribution. It assumes that each features fits a normal distribution, and calculating the z-score of each features of a sample give an insight in order to detect anomalies. Samples which have much features whose values are located far from the means are likely to be an anomaly.

使用Z分数是最直接的方法之一。 Z分数基本上代表样本值低于或高于分布平均值的标准偏差数。假定每个特征都符合正态分布，并且计算样本中每个特征的z得分可提供洞察力以检测异常。特征值远非均值的样本可能是异常的。

While estimating the z-scores, you should take into account the several factors that affect the pattern to get more robust inferences. Let me give you an example, you aim detecting anomalies in traffic values on devices in telco domain. Hour information, weekday information, device information(if multiple device exist in dataset) are likely to shape the pattern of traffic values. For this reason, z-score should be estimated by considering each device, hour and weekday for this example. For instance, if you expect 2.5 mbps average traffic on device A at 8 p.m. at weekends, you should take into consideration that value while making a decision for corresponding device and time information.

在估计z分数时，应考虑影响模式的几个因素，以获得更可靠的推断。让我举一个例子，您的目标是检测电信域中设备的流量值异常。小时信息，工作日信息，设备信息(如果数据集中存在多个设备)很可能会影响流量值的模式。因此，在此示例中，应通过考虑每个设备，小时和工作日来估算z得分。例如，如果您预计周末晚上8点设备A的平均流量为2.5 mbps，则在确定相应设备和时间信息时应考虑该值。

One of the drawbacks of this approach is that it assumes that features fit a normal distribution which is not true all the time. Another one can be counted that it ignores the correlations between features in the above mentioned solution. One important point is that z-scores can be used as inputs for other anomaly detection models as well.

这种方法的缺点之一是，它假定要素符合正态分布，但并非总是如此。可以算出另一个忽略了上述解决方案中要素之间的相关性。 重要的一点是，z分数也可以用作其他异常检测模型的输入。

Quartiles-based solution implements a very similar idea to Z-score, differently it takes into account median instead of mean in a simple manner. Sometimes, it achieves better results compared to the z-score depending on distribution of the data.

基于四分位数的解决方案实现了与Z分数非常相似的想法，不同的是，它以简单的方式考虑了中位数而不是均值。有时，根据数据分布，与z得分相比，它可以获得更好的结果。

Elliptic Envelope is another option for outlier detection, which fits multivariate Gaussian distribution on data. However, it might not perform well with high dimensional data.

椭圆包络是离群值检测的另一种选择，它适合数据的多元高斯分布。但是，对于高维数据，它可能无法很好地执行。

基于预测的方法 (Forecasting Based Approaches)

In this methodology, a prediction is performed with a forecasting model for the next time period and if forecasted value is out of confidence interval, the sample is flagged as anomaly. As a forecasting model, LSTM and ARIMA models can be used. The advantage of these methods are that they are well-performing models on time series and can be applied directly on time series without feature engineering step most of the time. On the other hand, estimating the confidence interval is not a trivial and easy task. Furthermore, the accuracy of the forecasting model directly affects the success of anomaly detection. Good news is that you are able to evaluate the accuracy of the forecasting model in a supervised manner, even if you are performing anomaly detection without any label info.

在这种方法中，使用下一个时间段的预测模型执行预测，如果预测值超出置信区间，则将样本标记为异常。作为预测模型，可以使用LSTM和ARIMA模型。这些方法的优势在于它们是时间序列上性能良好的模型，并且可以在大多数时间无需特征工程步骤直接应用于时间序列。另一方面，估计置信区间并不是一件容易的事。此外，预测模型的准确性直接影响异常检测的成功。好消息是，即使您在没有任何标签信息的情况下执行异常检测，您也可以通过监督的方式评估预测模型的准确性。

Prophet is also worth to take a look, it is basically a forecasting algorithm designed for time series and developed by Facebook, but I encounter many implementation of this algortihm in anomaly detection use cases.

预言家 也值得一看，它基本上是为时间序列设计的并由Facebook开发的预测算法，但是我在异常检测用例中遇到了许多这种算法的实现。

基于神经网络的方法 (Neural Network Based Approaches)

Autoencoder is an unsupervised type neural networks, and mainly used for feature extraction and dimension reduction. At the same time, it is a good option for anomaly detection problems. Autoencoder consists of encoding and decoding parts. In encoding part, main features are extracted which represents the patterns in the data, and then each samples is reconstructed in the decoding part. The reconstruction error will be minumum for normal samples. On the other hand, the model is not able to reconstruct a sample that behaves abnormal, resulting a high reconstruction error. So, basically, the higher reconstruction error a sample has, the more likely it is to be an anomaly.

自动编码器 是一种无监督型神经网络，主要用于特征提取和降维。同时，它是异常检测问题的不错选择。自动编码器由编码和解码部分组成。在编码部分，提取代表数据中模式的主要特征，然后在解码部分重构每个样本。对于正常样本，重建误差将最小。另一方面，该模型无法重建表现异常的样本，从而导致较高的重建误差。因此，基本上，样本的重构误差越高，就越有可能成为异常。

Autoencoder is very convenient for time series, so it can also be considered among preferential alternatives for anomaly detection on time series. Note that, layers of autoencoders can be composed of LSTMs at the same time. Thus, dependencies in sequential data just like in time series can be captured.

自动编码器对于时间序列非常方便，因此也可以考虑将其作为时间序列异常检测的优先选择。注意，自动编码器的层可以同时由LSTM组成。因此，可以捕获时序数据中的时序序列中的依存关系。

Self Organizing Maps (SOM) is also another unsupervised neural network based implementation, and it has simpler working principle compared to other neural network models. Although, it does not have a widespread usage in anomaly detection use cases, it is good to keep in mind that it is also an alternative.

自组织映射(SOM)也是另一种基于非监督神经网络的实现，与其他神经网络模型相比，它的工作原理更简单。尽管它在异常检测用例中没有广泛使用，但请记住，它也是一种替代方法。

基于聚类的方法 (Clustering Based Approaches)

The idea behind usage of clustering in anomaly detection is that outliers don’t belong to any cluster or has their own clusters. k-means is one of the most known clustering algorithms and easy to implement. However, it brings some limitations like picking an appropriate k value. Moreover, it forms spherical clusters which is not correct for all cases. Another drawback is that it is not able to supply a probability while assigning samples to clusters especially considering that clusters can overlap in some situations.

在异常检测中使用聚类的背后思想是，异常值不属于任何聚类或具有自己的聚类。 k-means是最著名的聚类算法之一，易于实现。但是，它带来了一些限制，例如选择适当的k值。而且，它形成球形簇，并非在所有情况下都是正确的。另一个缺点是，在将样本分配给群集时，尤其是考虑到群集在某些情况下可能重叠时，它无法提供概率。

Gaussian Mixture Model (GMM) focus on the abovementioned weaknesses of k-means and present a probabilistic approach. It attemps to find a mixture of a finite number of Gaussian distributions inside the dataset.

高斯混合模型(GMM)着重于k均值的上述缺点，并提出了一种概率方法。它试图在数据集中找到有限数量的高斯分布的混合。

DBSCAN is a density based clustering algorithm. It determines the core points in the dataset which contains at least min_samples around it within epsilon distance, and creates clusters from these samples. After that, it finds all points which are densely reachable(within epsilon distance) from any sample in the cluster and add them to the cluster. And then, iteratively, it performs the same procedure for the newly added samples and extend the cluster. DBSCAN determines the cluster number by itself, and outliers samples will be assigned as -1. In other words, it directly serves for anomaly detection. Note that, it might suffer from perfromance issues with large sized datasets.

DBSCAN是基于密度的聚类算法。它确定数据集中至少在epsilon距离内包含min_samples的核心点，并根据这些样本创建聚类。之后，它会从聚类中的任何样本中找到所有密集到达的点(在epsilon距离之内)，并将它们添加到聚类中。然后，它对新添加的样本反复执行相同的过程，并扩展聚类。 DBSCAN自行确定群集编号，离群值样本将分配为-1。换句话说，它直接用于异常检测。请注意，它可能会遇到大型数据集的性能问题。

基于接近的方法 (Proximity Based Approaches)

The first algorithm that come to mind is k-nearest neighbor(k-NN) algorithm. The simple logic behind is that outliers are far away from the rest of samples in the data plane. The distances to nearest negihbors of all samples are estimated and the samples located far from the other samples can be flagged as outlier. k-NN can use different distance metrics like Eucledian, Manhattan, Minkowski, Hamming distance etc.

我想到的第一个算法是k最近邻居(k-NN)算法。背后的简单逻辑是，离群值与数据平面中的其余样本相距甚远。估计所有样本到最近邻居的距离，并且可以将距离其他样本较远的样本标记为离群值。 k-NN可以使用不同的距离度量标准，例如Eucledian，Manhattan，Minkowski，Hamming距离等。

Another alternative algorithm is Local Outlier Factor (LOF) which identifies the local outliers with respect to local neighbors rather than global data distribution. It utilizes a metric named as local reachability density(lrd) in order to represents density level of each points. LOF of a sample is simply the ratio of average lrd values of the sample’s neighbours to lrd value of the sample itself. If the density of a point is much smaller than average density of its neighbors, then it is likely to be an anomaly.

另一种替代算法是 本地离群值因子(LOF) ，它相对于本地邻居而不是全局数据分布来标识本地离群值。它利用一个称为局部可达性密度(lrd)的度量来表示每个点的密度级别。样本的LOF只是样本邻居的平均lrd值与样本本身的lrd值之比。如果一个点的密度远小于其相邻点的平均密度，则可能是异常。

基于树的方法 (Tree Based Approaches)

Isolation Forest is a tree based, very effective algorithm for detecting anomalies. It builds multiple trees. To build a tree, it randomly picks a feature and a split value within the minimums and maximums values of the corresponding feature. This procedure is applied to all samples in the dataset. And finally, a tree ensemble is composed by averaging all trees in the forest.

隔离林是一种基于树的非常有效的异常检测算法。它构建多棵树。要构建树，它会随机选择一个特征和一个在相应特征的最小值和最大值内的分割值。此过程将应用于数据集中的所有样本。最后，通过对森林中的所有树木进行平均来构成树木集合。

The idea behind the Isolation Forest is that outliers are easy to diverge from rest of the samples in dataset. For this reason, we expect shorter paths from root to a leaf node in a tree(the number of splittings required to isolate the sample) for abnormal samples compared to rest of the samples in dataset.

隔离林背后的想法是，异常值很容易与数据集中的其余样本相区别。因此，与数据集中的其他样本相比，我们期望异常样本从树的根到叶节点的路径更短(分离样本所需的分裂数)。

Extended Isolation Forest comes with an imporvement to splitting process of Isolation Forest. In Isolation Forest, splitting is performed parallel to the axes, in other saying, in horizontal or vertical manner resulting too much redundant regions in the domain, and similarly over construction of many trees. Extended Isolation Forest remedies these shortcomings by allowing splitting process to happen in every direction, instead of selecting a random feature with a random splitting value, it selects a random normal vector along with a random intercept point.

扩展隔离林带有隔离林拆分过程的改进。在“隔离林”中，平行于轴进行拆分，也就是说，以水平或垂直方式进行拆分，从而导致域中有过多的冗余区域，并且类似地，在构建许多树时也是如此。扩展隔离林通过允许在每个方向上进行拆分过程来弥补这些缺点，而不是选择具有随机拆分值的随机特征，而是选择随机法向矢量以及随机截距。

基于降维的方法 (Dimension Reduction Based Approaches)

Principal Component Analyses (PCA) is mainly used as a dimension reduction method for high dimensional data. In a basic manner, it helps to cover most of the variance in data with a smaller dimension by extracting eigenvectors that have largest eigenvalues. Therefore, it is able to keep most of the information in the data with a very smaller dimension.

主成分分析(PCA)主要用作高维数据的降维方法。从根本上讲，它通过提取具有最大特征值的特征向量来帮助覆盖较小维度的数据中的大多数方差。因此，它能够以很小的维数将大多数信息保留在数据中。

While using PCA in anomaly detection, it follows a very similar approach like Autoencoders. Firstly, it decomposes data into a smaller dimension and then it reconstructs data from the decomposed version of data again. Abnormal samples tend to have a high reconstruction error regarding that they have different behaviors from other observations in data, so it is diffucult to obtain same observation from the decomposed version. PCA can be a good option for multivariate anomaly detection scenarios.

在异常检测中使用PCA时，它采用了非常类似的方法，例如自动编码器。首先，它将数据分解为较小的维度，然后再次从分解后的数据版本中重建数据。由于异常样本与数据中其他观测值的行为不同，因此它们往往具有较高的重构误差，因此很难从分解后的版本中获得相同的观测值。对于多变量异常检测方案，PCA可能是一个不错的选择。

Image for post — Anomaly Detection on Time Series

真实生活经验 (REAL LIFE EXPERIENCES)

Before starting the study, answer the following questions: How much data do you have retroactively? Univariate or multivariate data? What is the frequency of making anomaly detection?(near real time, hourly, weekly?) On what unit you are supposed to make anomaly detection? (For instance, you are studying on traffic values and you might make an anomaly detection on only devices or for each slot/port of devices)
开始研究之前，请回答以下问题：您具有多少数据？单变量或多变量数据？进行异常检测的频率是多少(近实时，每小时，每周一次？)您应该在哪个单元上进行异常检测？ (例如，您正在研究流量值，并且可能仅在设备上或针对设备的每个插槽/端口进行异常检测)
Do your data have multiple items? Let me clarify it, assume that you are supposed to perform anomaly detection on traffic values of devices from the previous example in telco domain. You probably have traffic values for many devices(may be thousands of different devices), each has different patterns, and you should avoid to design separate models for each device in terms of complexity and maintenance issues in production. In such situations, selecting correct features is more functional rather than focusing on trying different models. Determine the patterns of each device considering properties like hour, weekday/weekend info, and extract deviation from their patterns (like z-scores) and feed the models with these features. Note that contextual anomalies are tackled mostly in time series. So, you can handle this problem with only one model that is really precious. From the forecasting perspective, a multi head neural network based model can be adapted as an advanced solution.
您的数据有多个项目吗？让我澄清一下，假设您应该对电信领域中上一个示例中的设备的流量值执行异常检测。您可能具有许多设备(可能是数千个不同的设备)的流量值，每个设备都有不同的模式，并且应避免就生产中的复杂性和维护问题为每个设备设计单独的模型 。在这种情况下，选择正确的功能比起专注于尝试不同的模型更具功能性。考虑诸如小时，工作日/周末信息之类的属性来确定每个设备的模式，并从其模式中提取偏差(例如z得分)，然后将这些功能提供给模型。请注意， 上下文异常通常按时间序列处理。因此，您只能使用一种非常珍贵的模型来解决此问题。从预测的角度来看，基于多头神经网络的模型可以用作高级解决方案。
Before starting, if it is possible, you necessarily ask for a few anomaly example from the past from the client. It will give you an insight about what is expected from you.
在开始之前，如果有可能，您一定要从客户端询问过去的一些异常示例。它将使您对期望的结果有深刻的了解。
The number of anomalies is another concern. Most anomaly detection algorithms have a scoring process internally, so you are able to tune the number of anomalies by selecting an optimum threshold. Most of the time, clients dont want to be disturbed with too many anomalies even if they are real anomalies. Therefore, you might need a separate false positive elimination module. For simplicity, if a device has traffic pattern of 10mbps and if it increases to 30mbps at a point, then it is absolutely an anomaly. However, it might not attract more attention than increasing from 1gbps to 1.3gbps.
异常的数量是另一个问题。大多数异常检测算法在内部都有评分过程，因此您可以通过选择最佳阈值来调整异常数量。在大多数情况下，客户不希望被太多异常打扰，即使它们是真正的异常。因此， 您可能需要一个单独的误报消除模块 。为简单起见，如果设备的流量模式为10mbps，并且某个点的流量模式增加到30mbps，则绝对是异常情况。但是，它可能不会比从1gbps增加到1.3gbps引起更多的注意。
Before making any decision about methodologies, I recommend visualizing the data for at least a sub-sample that will give a deep vision about data.
在对方法进行任何决定之前，我建议至少对子样本进行数据可视化，以提供对数据的深入了解。
While some of the methods accept the time series directly without any preprocessing step, you need to implement a preprocessing or feature extraction step in order to turn data into convenient format for some methods.
尽管某些方法无需任何预处理步骤即可直接接受时间序列，但您需要实施预处理或特征提取步骤，以便将数据转换为某些方法的便捷格式 。
Note that novelty detection and anomaly detection are different concepts. In short, in novelty detection, you have a dataset completely consists of normal observations, and decide on whether new received observation fits to data in trainset. At variance with novelty detection, you have trainset consists of both normal and abnormal samples in anomaly detection. One-class SVM might be a good option for novelty detection problems.
注意， 新颖性检测和异常检测是不同的概念。简而言之，在新颖性检测中，您有一个完全由正常观测值组成的数据集，并确定新接收到的观测值是否适合火车集中的数据。与新颖性检测不同，您的训练集包含异常检测中的正常样本和异常样本。对于新颖性检测问题， 一类SVM可能是一个不错的选择。
I encourage to take a look on pyod and pycaret libraries in python, which provide off-the-shelf solutions in anomaly detection.
我鼓励您看一下python中的pyod和pycaret库，它们提供了异常检测的现成解决方案。

有用的链接 (USEFUL LINKS)

翻译自: https://towardsdatascience.com/unsupervised-anomaly-detection-on-time-series-9bcee10ab473

异常检测时间序列