Anomaly detection is one of the most important aspect of Network Security Monitoring. Most of the current approaches in Network Intrusion Detection Systems use signature based techniques. Recently data-mining approaches become popular and many paper published about that topic. Supervised data-mining approaches rely on some kind of training data. But for network traffic, it is expensive to get or have labeled data. Also those methods have difficulty in detecting new types of attacks. However using unsupervised clustering techniques we can find the anomalies in the traffic. Leung etal, in their paper present a new grid-based and density based clustering algorithm called fpMAFIA which they also claim to work on network data very efficiently.

Due to the increasing amount of malicious traffic, network intrusion detection systems gain more importance and it attracts more attention. A network intrusion can cause catastrophic damages to networks. Most of the NIDS uses signature-based methods to identify the malicious traffic. Signature bases methods use some certain signatures or rules in netflow data in order to classify the traffic as malicious or legal traffic. But signature based NIDS are vulnerable to the zero-day attacks which don’t have any signature in the signature database. Snort and BRO are most widely used IDS softwares which use signature based methods.

There are data-mining and machine learning techniques for detecting and classifying intrusions. Those techniques are grouped in 2 distinct categories:

1. Anomaly detection
2. Misuse detection

Misuse detection methods define the abnormal system behaviour at first , and then it defines any other behavior as normal behaviour. For instance signature based techniques explained above are misuse detection methods. Anomaly detection methods work in the opposite way. In contrast anomaly detection based approaches build model of normal data and then attempt to detect deviations from the normal model in observed data. Those deviations are anomalies in the data.

We have 2 main assumptions for unsupervised anomaly detection to deal with the data:

Assumption 1: Majority of the network connections are normal traffic. Only X% of the traffic is malicious.
Assumption 2: The attack traffic is significantly different from the normal traffic.

Unsupervised learning techniques fits for use in anomaly detection. Because unsupervised techniques can cluster the data according to the statistical properties of the data and according to assumption 1 and assumption 2 , the statistical properties of attack traffic is significantly different from the normal traffic.

Many data-mining algorithms tested against the network payload data like fixed-width clustering algorithm, k-nearest neighbour algorithm and one-class support vector machines. All of these approaches worked well and accurately on the network traffic.

Kingsly et al, developed a technique called fpMAFIA which uses both density-based and grid-based technique. fpMAFIA is a successor of pMAFIA and CLIQUE algorithms. Difference of fpMAFIA from pMAFIA is that fpMAFIA uses FP-trees(frequent pattern) in the intermediate step.

There are 3 challenges that we face with unsupervised anomaly detection:

1) Efficiently processing of packets
2) Working with high dimensional data
3) Accurately determine the boundaries of clusters and to ensure the clusters will cover the 95% percent of the data.