The terms outlier and anomaly are usually used interchangeably. However, they are subtly different: while an outlier is a legitimate point that is far away from the average or median, an anomaly is an illegitimate point that is generated by a different process than the one that generated the rest of the data. This is why many methods for finding outliers are used to detect anomalies.
Nowadays, anomalous events happen in different fields, and this makes early and accurate detection very important. For example, in the financial field, it is necessary to be able to identify fraudulent transactions; in different areas where machinery is used, it is very important to detect when the machine may fail. Also, having accurate anomaly detection algorithms allows predicting future cyber attacks. Other applications are in detecting deforestation, glacier melting, and cancer in images.
In this post we will show two algorithms to identify fraudulent financial transactions: One-Class Support Vector Machines (OCSVM) and Autoencoders. SVM is a classical anomaly detection method based on geometrically finding the best separation of anomalous data and normal data, while Autoencoders are a type of neural networks that use the intrinsic properties of such networks to learn the general behavior of the data and thus reconstruct this behavior. When a new observation cannot be reconstructed, it is qualified as an anomalous observation.
The dataset to be used contains credit card transactions conducted on two days during September 2013 by European cardholders. This dataset consists of 492 frauds out of 284,807 transactions. This means that 0.172% of transactions correspond to frauds.
Almost all columns are anonymized, which means that we cannot know what each variable means. It consists of 28 variables, defined as V1, V2, …, V28, all numeric. Only two columns were not anonymized: Time and Amount. The variable Time shows the seconds elapsed between a transaction and the first record present, while Amount is the amount of the transaction (however, Time is not used in the analysis). Finally, the Class column is the response variable that takes value 1 in case of fraud and 0 otherwise. An extract of the data is shown in the following table:
As the database is unbalanced, a sampling is performed, leaving the new database with 20,000 normal records and 400 abnormal ones, i.e. 2% of the data are fraudulent transactions. If the original ratio (0.172%) is used, it may happen that both algorithms do not have the sensitivity to detect an anomalous observation. By slightly increasing the ratio, we achieve that the algorithms have enough normal/anomalous data to learn the patterns that exist in the database. On the other hand, for model evaluation, 80% of this new database is used for training and 20% for testing.
The idea behind this algorithm is to find the hyperplane that divides both classes in such a way as to optimize the margin between them as shown in the figure below.
Many times a straight line does not separate the classes in an optimal way, for these cases a Kernel transformation is performed. Specifically, the coordinates are transformed and then the method is applied and separated by a plane as shown in the following figure.
After sampling the transactional data, the proportion of fraudulent transactions is left with a percentage of about 2%. This percentage or estimate of anomalies should be entered as parameter “nu” to the model. The metrics after applying the model in the test set are shown in the following table.
What we are interested in is the model’s ability to find anomalies. Looking at the table, we have an accuracy of 40% and a recall of 60% (accuracy in unbalanced cases is not a good metric since, if the model gives me only 0s, I will have an accuracy of ~1). Another important metric to consider is the AUC: this value represents the area under the curve of the ratio of true positives vs. the ratio of false positives when varying the probability threshold to consider the observation as an anomaly. An AUC of 0.5 means that it classifies randomly while closer to 1 means that it classifies perfectly. An example of an AUC of 0.75 is shown below.
In our case, and using the test set, an AUC of 0.79 is obtained. Finally, we can look at the confusion matrix, which is where we can calculate the above metrics. We can see that out of the 93 fraudulent transactions, 56 were identified. There were 83 false positives and 37 false negatives.
Autoencoders are a type of neural networks capable of discovering lower dimensional representations in high dimensional data and reconstructing the input data. They consist of two parts, an Encoder and a Decoder. The former reduces the dimensionality and the latter expands this representation back to the original dimensionality. Among the applications of this type of networks are: image compression, classification, anomaly detection and generative models. This property of Autoencoders is what allows to classify an anomaly: the algorithm learns what is normal behavior data, and is able to reconstruct it. When it enters a new record, if it cannot reconstruct it, it classifies it as an anomaly.
The figure above shows a basic Autoencoder diagram: in the first stage (Encoder), the data are encoded, then their dimensionality is reduced in the compression stage, and in the last stage (Decoder), they are reconstructed.
The Autoencoder then uses this model on the test data. To obtain the result, the Autoencoder error associated with the reconstruction of the input is used. In this way, different cuts are used to find the one that maximizes the recall. In our case, we use MSE 5, 10 and 15. This is shown in the table below:
In the above table, an increase in Recall is observed when MSE 10 is used as the error limit compared to MSE 5. And on the other hand, when increasing to MSE 15, Recall drops from 0.87 to 0.59 for the anomaly class. Therefore, it is considered that the best option is MSE 10 for this case. The figure below graphically shows the cutoff. The red crosses are the anomalous points and the green ones are the real ones. The Autoencoder decides that the points above the blue line are anomalies.
For this model with MSE 10, an AUC of 0.92 is obtained, which is an improvement with respect to the OCSVM of the previous section. This improvement is also observed in the accuracy which went up from 40% to 44% and the Recall which went up from 60% to 87%. Finally, we can observe with the confusion matrix that, of the 93 fraudulent transactions, 81 were identified this time, which is an improvement compared to the 56 of the OCSVM. On the other hand, false positives increased from 83 to 102. Varying the error boundary can help to decrease these false positives and find the balance that best suits the needs of the case. In this case, it is preferable to find as many fraudulent transactions as possible, so Recall is the metric that helps us choose the best model.
This blog showed two methodologies to identify anomalies, OCSVMs and Autoencoders. Anomalies usually represent a low percentage of the data, although the identification of anomalies can be of great importance for a large number of applications and industries.
In this blog, both models were tested for a transactional dataset and the objective was to find fraudulent transactions. For this case, Autoencoder proved to be superior in terms of the metrics. An improvement in the AUC, accuracy and recall of the test set was obtained, although the OCSVM is a good choice because of its simplicity and easy interpretability.
There are other classical methods such as Isolation Forest and also in the world of Deep Learning there are other types of Autoencoders such as Deep Autoencoders and Sparse Autoencoders. Also, for time series anomalies, LSTM networks and Temporal Convolutional Networks are quite common.
Sridhar Alla, S. K. (2019). Beginning Anomaly Detection Using Python-Based Deep Learning. Apress Berkeley, CA.