PE Malware Machine Learning Dataset

Download

WARNING! The link will download an encrypted zip file that contains real malicious samples. Handle the contents with care. Only utilize for legitimate purposes.I have removed the file extensions from all of the samples in order to prevent accidental execution; however, I still highly recommend opening it up in a sandboxed environment. As an additional precaution, you should also change the permissions of the folder to deny “Execute” permissions to all files in the folder. Conducting your analysis on a non-Windows operating system will also help eliminate risk.

OneDrive: https://1drv.ms/u/s!AsaC1RPcfUL1oB5qbdWOm-PIk2jX?e=mOlo6J
Password: infected

Terms of Use

If you use this dataset, please adhere to the following rules:

  1. Do not use the files for malicious purposes.
  2. Let me know via comment, email, or tweet if you found this dataset useful.
  3. Site me as a source in any academic paper that leverages this dataset using the following contact information:
  4. Provide feedback or recommendations on how to improve the dataset.

Purpose

The purpose of this dataset is to provide raw labeled portable executables to security and AI researchers in order to improve cyber security in the industry. Many of the datasets that I have seen (such as this dataset from a Microsoft sponsored Kaggle competition) do not provide the raw binary files themselves, but rather metadata that has already been pre-extracted from the samples. This prevents a lot of potential learning that can come from exploring other features that could be extracted from the raw samples themselves.

About the Dataset

Statistics

Samples201,549
Legitimate86,812
Malicious114,737
Compressed Size43.8GB
Uncompressed Size117GB
File TypesAll are Portable Executable files. Most are user-mode Portable Executable files (e.g. .exe, .dll, .scr).

Layout

The dataset has the following folder structure:

  • samples
    • 1
    • 2
    • 3
  • samples.csv

The files in the “samples” folder are given the name of their corresponding entry in the ID field of the samples.csv file. The samples.csv file contains the labels for each of the samples in the samples folder.

Note: The extension has been removed from all the files in the samples directory in order to prevent accidental execution. The extension would have to be manually renamed, in most cases, in order to get the malware to execute properly. The proper extension can be determined by parsing the PE header.

Labels

Each entry in the samples.csv file contains the following metadata fields:

FieldDescriptionExample
idThe identifier for the sample that corresponds to the name of the file in the samples directory.5
md5The MD5 hash of the file.ad27f1a72dda61d1659810c406f37ab8
sha1 The SHA1 hash of the file.f8fd630c880257c7e74c1f87929993477453d989
sha256The SHA256 of the file.984d732c9f32197232918f2fce0aa9cedc1011d93e32acb4ad01e13f2f76d599
totalThe total number of antivirus engines that scan this file at the time of the query.67
positivesThe number of antivirus engines that flag this files malicious at the time of the query.0
listEither blacklist or whitelist indicating whether or not the file is malicious or legitimate respectively.Whitelist
filetypeThis field will always be exe for this data set.exe
submittedThe date that the sample was entered into my database.6/24/2018 4:18:38 PM
user_idRedacted.1
lengthThe length of the file in bytes.211,456
entropyThe Shannon entropy of the file. The values will range from 0 to 8.2.231824

Sources

Malicious samples in the dataset come primarily from the sources linked below.

Potential Biases

The majority of the samples came from easy-to-acquire locations. There are many samples of very similar families of malware that tend to dominate the dataset. While the dataset does contain samples from more sophisticated malware from Advanced Persistent Threat actors, there are far fewer of those samples than there are of generic adware, spyware, and ransomware. As a result, the dataset may not be reflective of malware used in actual intrusions. The dataset may be able to generalize to more advanced malware, or it may not.

The majority of legitimate files came from instances of various versions of Windows 7 and above with a variety of different software download and installed. There were only so many applications that I downloaded and tested. Other legitimate files come from the sources listed above, but were false positives. This may give a particular bias towards Microsoft produced software as those binaries dominate the legitimate file dataset.

8 thoughts on “PE Malware Machine Learning Dataset”

  1. RetardedGenius

    I appreciate the effort in assembling this but I genuinely don’t understand why you couldn’t separate the benign files from the malware files into separate folders. Your Microsoft SDK guide doesn’t allow new requested accounts and the ID of the files aren’t as straightfoward as 1,2,3,4 etc. The ID of samples.csv increment stop incrementing by 1 from 64657 and start incrementing with random numbers, often 12,7,2,3 until the ID reaches 612,220 even though there’s only 201,549 files..

    1. The only thing provided is the name of the signature from the AV which is sometimes the family name, but is often just a particular technique or code that may be shared amongst multiple families.

  2. Pingback: Malware detection and classification with Amazon Rekognition – Kamal Reader

  3. Pingback: Malware detection and classification with Amazon Rekognition – Maverick Studios

  4. Pingback: Malware detection and classification with Amazon Rekognition – Vedere AI

  5. Pingback: Malware detection and classification with Amazon Rekognition – digitado

  6. Pingback: Malware detection and classification with Amazon Rekognition | MKAI

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top