Download
WARNING! The link will download an encrypted zip file that contains real malicious samples. Handle the contents with care. Only utilize for legitimate purposes.I have removed the file extensions from all of the samples in order to prevent accidental execution; however, I still highly recommend opening it up in a sandboxed environment. As an additional precaution, you should also change the permissions of the folder to deny “Execute” permissions to all files in the folder. Conducting your analysis on a non-Windows operating system will also help eliminate risk.
OneDrive: https://1drv.ms/u/s!AsaC1RPcfUL1oB5qbdWOm-PIk2jX?e=mOlo6J
Password: infected
SHA256: A8B02407A1F8C77DD9DCCC229503A4F668083271EDCBB0289D53C28EBF51215E
Terms of Use
If you use this dataset, please adhere to the following rules:
- Do not use the files for malicious purposes.
- Let me know via comment, email, or tweet if you found this dataset useful.
- Site me as a source in any academic paper that leverages this dataset using the following contact information:
- Name: Michael Lester
- Email: [email protected]
- Website: https://www.practicalsecurityanalytics.com
- Provide feedback or recommendations on how to improve the dataset.
Purpose
The purpose of this dataset is to provide raw labeled portable executables to security and AI researchers in order to improve cyber security in the industry. Many of the datasets that I have seen (such as this dataset from a Microsoft sponsored Kaggle competition) do not provide the raw binary files themselves, but rather metadata that has already been pre-extracted from the samples. This prevents a lot of potential learning that can come from exploring other features that could be extracted from the raw samples themselves.
About the Dataset
Statistics
Samples | 201,549 |
Legitimate | 86,812 |
Malicious | 114,737 |
Compressed Size | 43.8GB |
Uncompressed Size | 117GB |
File Types | All are Portable Executable files. Most are user-mode Portable Executable files (e.g. .exe, .dll, .scr). |
Layout
The dataset has the following folder structure:
- samples
- 1
- 2
- 3
- …
- samples.csv
The files in the “samples” folder are given the name of their corresponding entry in the ID field of the samples.csv file. The samples.csv file contains the labels for each of the samples in the samples folder.
Note: The extension has been removed from all the files in the samples directory in order to prevent accidental execution. The extension would have to be manually renamed, in most cases, in order to get the malware to execute properly. The proper extension can be determined by parsing the PE header.
Labels
Each entry in the samples.csv file contains the following metadata fields:
Field | Description | Example |
id | The identifier for the sample that corresponds to the name of the file in the samples directory. | 5 |
md5 | The MD5 hash of the file. | ad27f1a72dda61d1659810c406f37ab8 |
sha1 | The SHA1 hash of the file. | f8fd630c880257c7e74c1f87929993477453d989 |
sha256 | The SHA256 of the file. | 984d732c9f32197232918f2fce0aa9cedc1011d93e32acb4ad01e13f2f76d599 |
total | The total number of antivirus engines that scan this file at the time of the query. | 67 |
positives | The number of antivirus engines that flag this files malicious at the time of the query. | 0 |
list | Either blacklist or whitelist indicating whether or not the file is malicious or legitimate respectively. | Whitelist |
filetype | This field will always be exe for this data set. | exe |
submitted | The date that the sample was entered into my database. | 6/24/2018 4:18:38 PM |
user_id | Redacted. | 1 |
length | The length of the file in bytes. | 211,456 |
entropy | The Shannon entropy of the file. The values will range from 0 to 8. | 2.231824 |
Sources
Malicious samples in the dataset come primarily from the sources linked below.
Potential Biases
The majority of the samples came from easy-to-acquire locations. There are many samples of very similar families of malware that tend to dominate the dataset. While the dataset does contain samples from more sophisticated malware from Advanced Persistent Threat actors, there are far fewer of those samples than there are of generic adware, spyware, and ransomware. As a result, the dataset may not be reflective of malware used in actual intrusions. The dataset may be able to generalize to more advanced malware, or it may not.
The majority of legitimate files came from instances of various versions of Windows 7 and above with a variety of different software download and installed. There were only so many applications that I downloaded and tested. Other legitimate files come from the sources listed above, but were false positives. This may give a particular bias towards Microsoft produced software as those binaries dominate the legitimate file dataset.
I appreciate the effort in assembling this but I genuinely don’t understand why you couldn’t separate the benign files from the malware files into separate folders. Your Microsoft SDK guide doesn’t allow new requested accounts and the ID of the files aren’t as straightfoward as 1,2,3,4 etc. The ID of samples.csv increment stop incrementing by 1 from 64657 and start incrementing with random numbers, often 12,7,2,3 until the ID reaches 612,220 even though there’s only 201,549 files..
How can we tell which malware family the samples belong to?
The only thing provided is the name of the signature from the AV which is sometimes the family name, but is often just a particular technique or code that may be shared amongst multiple families.
Pingback: Malware detection and classification with Amazon Rekognition – Kamal Reader
Pingback: Malware detection and classification with Amazon Rekognition – Maverick Studios
Pingback: Malware detection and classification with Amazon Rekognition – Vedere AI
Pingback: Malware detection and classification with Amazon Rekognition – digitado
Pingback: Malware detection and classification with Amazon Rekognition | MKAI