Analyzing ML-based IDS over Real-Traffic

The rapid growth of computer networks has caused a significant increase in malicious traffic, promoting the use of Intrusion Detection Systems (IDSs) to protect against this ever-growing attack traffic. A great number of IDS have been developed with some sort of weaknesses and strengths. Most of the development and research of IDS is purely based on simulated and non-updated datasets due to the unavailability of real datasets, for instance, KDD '99, and CIC-IDS-18 which are widely used datasets by researchers are not sufficient to represent real-traffic scenarios. Moreover, these one-time generated static datasets cannot survive the rapid changes in network patterns. To overcome these problems, we have proposed a framework to generate a full feature, unbiased, real-traffic-based, updated custom dataset to deal with the limitations of existing datasets. In this paper, the complete methodology of network testbed, data acquisition and attack scenarios are discussed. The generated dataset contains more than 70 features and covers different types of attacks, namely DoS, DDoS, Portscan, Brute-Force and Web attacks. Later, the custom-generated dataset is compared to various available datasets based on seven different factors, such as updates, practical-to-generate, realness, attack diversity, flexibility, availability, and interoperability. Additionally, we have trained different ML-based classifiers on our custom-generated dataset and then tested/analyzed it based on performance metrics. The generated dataset is publicly available and accessible by all users. Moreover, the following research is anticipated to allow researchers to develop effective IDSs and real traffic-based updated datasets.


Figure 1. Flow Diagram of Methodology
The total duration of the experiment is five days, as shown in Table. 1, which starts from Monday to Friday and each scenario such as: capturing the normal and attack traffic is divided into different days. We have used the Wireshark tool [29] for capturing network traffic in .pcap format at the attacker's side. Further detail for each scenario is described in the following sections.  Figure. 2 shows the complete configuration of the network. We have used 5 machines: 2 Kali Linux machines, 1 windows machine, 1 Ubuntu-based Metasploitable 2, and a web server. Each machine is connected using a switch. Both Kali machines are chosen for performing attacks as they provide over 600 penetration tools [30]. While remaining machines are considered as the victim. Victim 1 is Metasploitable 2 [31], an intentionally vulnerable virtual machine that comes with 3 security levels low, medium, and impossible. While the web server is being set up on Metasploitable 2, which provides different login pages and some application layer services such as HTTP, HTTPS, FTP, SSH, etc. The complete topology shown is configured using VirtualBox [32]. Working with realistic traffic is one of the priorities of this research. To achieve it, we have captured the complete network traffic of a user on a windows machine for three different days at different times. Captured traffic includes routine-based activities, such as surfing the internet, attempting logins on different web pages, transferring files, or sending emails that show the variety of traffic, for instance, HTTP, HTTPS, FTP, or mailing protocols. We have set up an antivirus and IDS tool to ensure that normal traffic does not contain any intrusions or malicious traffic. 3. Attack Profiles. Since the paper intends to provide network security and intrusion detection, it should provide a diverse range of attacks. Below, we have defined the list of common attacks, their related tools, and the codes to execute them. Each attack is performed using Kali Linux, so most attacking tools are pre-installed, or they can easily be found on GitHub [33]. Mainly, these attacks are based on the CLI method and are easier to use. i. DoS Attack. For generating the DoS attack, we have used 4 different tools based on their specification such as GoldenEye, Hulk, SlowHttptest, and Slowloris. These tools can be easily accessible from GitHub. Using GoldenEye [34], a single machine is enough to put down another machine, which tries to flood legitimate HTTP traffic to overwhelm web resources by frequently requesting multiple URLs. Before generating this attack, we have started the Tor service to anonymize the attacker and simply typing ./goldeneye.py -h gives you the detail of parameters to be inserted. Figure. 3 shows that we have started the attack on IP: 192.168.18.73 with 10 workers generating 1000 requests each time. While proxy chain command is used to go through multiple proxies to avoid being identified.   [36] DoS is an attack known for its lower bandwidth consumption with higher impact. The tool starts the partial requests to a server and tries to maintain connection as long as possible. We have used the slowloris module provided by the Metasploit application, built-in Kali Linux. Figure. 5 shows how we have set up the target for an attack, where the socket count shows the number of sockets used during an attack. After each interval, these keep-alive headers are sent by the attacker to make a persistent connection with the host.

Figure 5. SlowLoris DoS Attack
ii. DDoS Attack. Involves multiple systems, collectively called botnets that try to overwhelm the target by simultaneously attacking at the same time. SynFlood tool [37] is used simultaneously on both kali machines that bombards thousands of TCP connection requests without replying to corresponding acknowledgement. We have used the synflood module provided by the Metasploit tool. Attached Figure. 6 shows an attack from a single source, while the impact on the victim is dependent on the number of attackers. Figure 6. SynFlood DDoS Attack LOIC [38], short for low orbit ion cannon, is another GUI-based tool used for DDoS attacks that are capable of generating three different types of requests: TCP, UDP, and HTTP. In Figure. 7, target is set for an attack by sending HTTP requests. Parameters such as port no., request type, and transfer rate of request packets can be adjusted based on the requirements. In this attack, attacker tries to guess the login information using the hit and try the method. According to [39], most people prefer to choose simpler and more common passwords such as their names, date-of-births or "12345", "passwords," "admin," etc., which can be guessed easily. Several tools are available to perform Brute force attacks, such as Patator, Hydra, Ncrack, Medusa, Nmap NSE scripts, and Metasploit modules. We have used Patator [40] because of its simplicity and reliability, as it provides a separate log file for each response that can be viewed later. Moreover, the Patator tool can be used on more than 30 various applications such as SSH, FTP, Telnet, SMTP, and so on. In our case, we have set up an FTP & SSH vulnerability on our Metasploitable 2 machine; we have executed the Patator shown in Figure. 8. Before generating an attack, a list of common usernames & passwords is provided separately in text format. It is a common technique used by hackers to scan and find vulnerabilities in the target machine. Nmap [41] is one of the famous tools used for scanning, pre-installed in Kali Linux. It helps identify the users on a network, their open ports, and services. Figure. 9 shows that scanning is performed on a victim within the specified ports from 0-1000, and the result shows different port numbers, and their services are available for exploitation. Figure 9. Portscan Attack using Nmap v. Web Attack. Brute force is the first technique an attacker tries before proceeding to other attacks. Burp suite [42], which is used for penetration tests and analysis of web attacks; has been applied here to perform an attack on the web pages. Different sample login web pages can be accessed using Metasploitable 2. Figure. 10 shows the trials of attempting different passwords, and highlighted area shows the correct password with an authentic username has been found. Traffic was captured using the Wireshark tool that produces packet capture files in .pcap format. As .pcap files are not enough to directly feed to ML-models, for its further processing, we have used the CIC-Flow meter [43]. Figure. 11 shows the flowchart of how processing of the .pcap file is done. The CIC-Flow meter was developed by the Canadian Institute of cyber security and used as a traffic analyzer that can extract more than 70 network features from a .pcap file such as Table. 2 shows the list of features extracted. The output file of the CIC-Flow meter is the .csv file, which can easily be used for machine learning models.

Labeling.
It is the last part of the dataset creation method, which is the process of identifying raw data points. Labeling was performed manually on each .csv file according to its scenario. Figure. 12 the total distribution of each traffic type, while Figure. 13 shows the total classes and their samples that we have feed to our ML-models.  Figure. 14 shows the complete methodology. Figure 14. Methodology of ML-based IDS Further processing of the dataset such as cleaning, normalization, and feature selection is performed before forwarding the dataset to the ML models. The generated dataset may contain null or infinite values that may affect the final results [44], so, we have eliminated them.
Furthermore, independent variables of the dataset contain highly varying magnitudes, so for feature scaling, we have used the Normalization method that translates all the independent values within the range of [0-1] [45].
A dataset contains more than 70 independent features and applying all of them is not feasible because it may cost computational power and efficiency variations, so we have applied the chi-squared test for feature selection [46]. In Figure. 15, each feature variable has a score which is representing correlation with the labels. By analyzing the graph, we have found that around 99% of the information is found in 40 features.
Further, Figure. 16 shows the cutoff point that specifies whether to include or eliminate the feature, so for better accuracy and results we have eliminated the remaining features that fall after the cutoff point.

Figure 16. Cutoff Point of Features
After completing the data processing, we have used the split-validation method and divided the dataset into three proportions; training, validation and testing set with a ratio of 60:20:20. We have selected 3 different algorithms based on their performances such as Support Vector Machine (SVM), Decision Tree (DT) and Naive Bayes with their default paraments (Hyper Parameters) such as LinearSVC, Decision Tree Classifier(random_state = 0), and MultinomialNB(), respectively [12] [47] [48] and analyzed their performance based on evaluation metrics, such as Accuracy, Precision, Recall and F-measure which can be determined by using values of confusion matrix, for instance, True Positive (TP), True Negative (TN), False Positive ((FP) and False Negative (FN) [49]. i. Accuracy. It is a measurement that is the ratio of the numbers of all correct predictions to the total no. of predictions and can be calculated by: ii. Precision. It is known as the Positive Prediction value and can be calculated by total correct positive predictions divided by predicted positives.
Recall. It is known as sensitivity or true positive rate and is calculated by dividing the correct positive prediction by the total positive samples.  Table. 5 represents the classification report of DT that shows higher positive numbers of each metric that is above 90% in each case. While Table. 6 depicts the confusion matrix that shows the better results compare to SVM with the highest number of correct predictions that is 2029 and 2219 and very few instances are falsely predicted by the model.  Figure. 17 Decision Tree's overall performance outperforms Support Vector Machine and Naive Bayes with a percentage of around 99% in all metrics. While SVM is the second-best performer with a percentage greater than 90% in all metrics. However, Naive Bayes overall performance shows the lowest percentage throughout metrics.  Table. 5 shows the comparative analysis between DARPA, KDD'99, Kyoto, ISCX2012, UNSW-NB15, CIC-IDS-18, and the proposed dataset. All the datasets mentioned are chosen based on their popularity. For comparison, we have chosen seven parameters such as realistic traffic, practical to generate, updates, publicly available, attack diversity, flexibility, and interoperability.
By analyzing the real-traffic column, it's obvious that most of the datasets are based on simulated traffic; either they have generated synthetic normal traffic, or their attacks are replicating real attacks thus we have tried to provide a dataset based on realistic scenarios by capturing the actual normal traffic that flows through the network. While capturing the real attacks, we have worked on manually generating each attack.
Another noticeable problem with this dataset is that no proof or document clearly defines how this particular dataset is generated or what tools and methods have been used. Researcher's favorite dataset: KDD '99 & DARPA, which is 22 years old, cannot be reproduced due to the unavailability of detailed methodology. Whereas, in addition to the dataset, we have provided the complete methodology that includes complete information about the scenarios, tools, methods and processes which a user can apply to generate their new dataset based on their requirements.
While the Update column depicts that most datasets are non-updated. In comparison, Kyoto datasets had released their updates till 2015, but since then, no update has appeared. Further, CIC-IDS-18 only releases partial updates, such as their last dataset released in 2022 that only contains obfuscated malware traffic. In our case, we have tried to provide an updated dataset whose traffic is based on current trends and patterns. By analyzing the attack diversity column, most used datasets such as KDD '99, DARPA, and UNSW-NB15 had used a wide range of attacks, but most of these attacks are no longer in use. Over time different techniques and tools have evolved, so we have tried to adopt the latest tools in our dataset.
The flexibility column depicts that most of the dataset contains the traffic diversity except for Kyoto and ISCX-12, which had worked on a specific scenario. For such a case, we have tried to include the wide range of traffic patterns that enabled the dataset to adjust to different objectives and scenarios.
We have tried to cover the research gap by providing a complete dataset based on realistic scenarios. Moreover, we have provided the complete methodology that enabled the community to reproduce new datasets based on their needs.
To improve the performance and accuracy of IDSs, a reliable, authentic dataset is essential. In this paper, we have discussed different datasets since 1998, which appear insufficient in the sense of unavailability of traffic diversity, ground truths, updated versions, and lack of diverse and updated attacks. The problem with these static, onetime generated datasets is that they cannot adjust to the ongoing changes of networks. Based on the research gap and requirements, we have worked on generating a new feasible dataset that fulfills the requirement of a researcher who wants to test their IDS on a realistic dataset. The generated dataset is publicly available for users on GitHub [50] [51]. Moreover, we have provided complete detailed methods which help analysts to generate their new dataset based on their needs and objectives. Further, we have developed three different ML models on custom generated unbalanced dataset and compared performance of each model. In our case, Decision Tree (DT) has shown much better results than the Support Vector Machine and Naïve Bayes with higher accuracy that is around 99.67%. Various studies on machine learning have been conducted that show the emergence of ML in daily dynamics [52,53,54,55,56,57,58,59].
In future work, we can extend our research by using our custom dataset as a benchmark dataset, where we will train different machine learning-based models on different available datasets and test them using our custom dataset. Moreover, we will extend this research by training these ML-models using same custom dataset where we will apply different balancing techniques. Later we will analyze and compare how a balancing technique affects the performance of each model.