PE Malware Machine Learning Dataset

Table of Contents

Download

WARNING! The link will download an encrypted zip file that contains real malicious samples. Handle the contents with care. Only utilize for legitimate purposes.I have removed the file extensions from all of the samples in order to prevent accidental execution; however, I still highly recommend opening it up in a sandboxed environment. As an additional precaution, you should also change the permissions of the folder to deny “Execute” permissions to all files in the folder. Conducting your analysis on a non-Windows operating system will also help eliminate risk.

OneDrive: https://1drv.ms/u/s!AsaC1RPcfUL1oB5qbdWOm-PIk2jX?e=mOlo6J
Password: infected
SHA256: A8B02407A1F8C77DD9DCCC229503A4F668083271EDCBB0289D53C28EBF51215E

Terms of Use

If you use this dataset, please adhere to the following rules:

Do not use the files for malicious purposes.
Let me know via comment, email, or tweet if you found this dataset useful.
Site me as a source in any academic paper that leverages this dataset using the following contact information:
- Name: Michael Lester
- Email: [email protected]
- Website: https://www.practicalsecurityanalytics.com
Provide feedback or recommendations on how to improve the dataset.

Purpose

The purpose of this dataset is to provide raw labeled portable executables to security and AI researchers in order to improve cyber security in the industry. Many of the datasets that I have seen (such as this dataset from a Microsoft sponsored Kaggle competition) do not provide the raw binary files themselves, but rather metadata that has already been pre-extracted from the samples. This prevents a lot of potential learning that can come from exploring other features that could be extracted from the raw samples themselves.

About the Dataset

Statistics

Samples	201,549
Legitimate	86,812
Malicious	114,737
Compressed Size	43.8GB
Uncompressed Size	117GB
File Types	All are Portable Executable files. Most are user-mode Portable Executable files (e.g. .exe, .dll, .scr).

Layout

The dataset has the following folder structure:

samples
- 1
- 2
- 3
- …
samples.csv

The files in the “samples” folder are given the name of their corresponding entry in the ID field of the samples.csv file. The samples.csv file contains the labels for each of the samples in the samples folder.

Note: The extension has been removed from all the files in the samples directory in order to prevent accidental execution. The extension would have to be manually renamed, in most cases, in order to get the malware to execute properly. The proper extension can be determined by parsing the PE header.

Labels

Each entry in the samples.csv file contains the following metadata fields:

Field	Description	Example
id	The identifier for the sample that corresponds to the name of the file in the samples directory.	5
md5	The MD5 hash of the file.	ad27f1a72dda61d1659810c406f37ab8
sha1	The SHA1 hash of the file.	f8fd630c880257c7e74c1f87929993477453d989
sha256	The SHA256 of the file.	984d732c9f32197232918f2fce0aa9cedc1011d93e32acb4ad01e13f2f76d599
total	The total number of antivirus engines that scan this file at the time of the query.	67
positives	The number of antivirus engines that flag this files malicious at the time of the query.	0
list	Either blacklist or whitelist indicating whether or not the file is malicious or legitimate respectively.	Whitelist
filetype	This field will always be exe for this data set.	exe
submitted	The date that the sample was entered into my database.	6/24/2018 4:18:38 PM
user_id	Redacted.	1
length	The length of the file in bytes.	211,456
entropy	The Shannon entropy of the file. The values will range from 0 to 8.	2.231824

Sources

Malicious samples in the dataset come primarily from the sources linked below.

Potential Biases

The majority of the samples came from easy-to-acquire locations. There are many samples of very similar families of malware that tend to dominate the dataset. While the dataset does contain samples from more sophisticated malware from Advanced Persistent Threat actors, there are far fewer of those samples than there are of generic adware, spyware, and ransomware. As a result, the dataset may not be reflective of malware used in actual intrusions. The dataset may be able to generalize to more advanced malware, or it may not.

The majority of legitimate files came from instances of various versions of Windows 7 and above with a variety of different software download and installed. There were only so many applications that I downloaded and tested. Other legitimate files come from the sources listed above, but were false positives. This may give a particular bias towards Microsoft produced software as those binaries dominate the legitimate file dataset.

Mashal Zainab

March 17, 2025 at 11:51 am

I have a question regarding the timeframe for collecting malware samples. Specifically, if I am working on creating a dataset where malware files are categorized based on the year they were collected, where do I put these files (dates)?

I have also sent an email regarding this inquiry and would appreciate any guidance you can provide.

pracsec
March 19, 2025 at 4:27 am

The dataset was made in 2018 I believe, so most of the samples were compiled before that. There are likely files that were compiled over many years, but it is hard to say which ones. The best you can do is probably scrape metadata from the internal timestamps in the PE headers and other locations, though that will not be a perfect solution either.

RetardedGenius

June 17, 2023 at 9:33 am

I appreciate the effort in assembling this but I genuinely don’t understand why you couldn’t separate the benign files from the malware files into separate folders. Your Microsoft SDK guide doesn’t allow new requested accounts and the ID of the files aren’t as straightfoward as 1,2,3,4 etc. The ID of samples.csv increment stop incrementing by 1 from 64657 and start incrementing with random numbers, often 12,7,2,3 until the ID reaches 612,220 even though there’s only 201,549 files..

kupgnd

March 1, 2023 at 5:11 am

How can we tell which malware family the samples belong to?

pracsec
April 30, 2023 at 12:02 am

The only thing provided is the name of the signature from the AV which is sometimes the family name, but is often just a particular technique or code that may be shared amongst multiple families.

Pingback: Malware detection and classification with Amazon Rekognition – Kamal Reader

Pingback: Malware detection and classification with Amazon Rekognition – Maverick Studios

Pingback: Malware detection and classification with Amazon Rekognition – Vedere AI

Pingback: Malware detection and classification with Amazon Rekognition – digitado

Pingback: Malware detection and classification with Amazon Rekognition | MKAI

PE Malware Machine Learning Dataset

Download

Terms of Use

Purpose

About the Dataset

Statistics

Layout

Labels

Sources

Potential Biases

10 thoughts on “PE Malware Machine Learning Dataset”

Leave a Comment Cancel Reply

Download

Terms of Use

Purpose

About the Dataset

Statistics

Layout

Labels

Sources

Potential Biases

Share this:

Related Posts

10 thoughts on “PE Malware Machine Learning Dataset”

Leave a Comment Cancel Reply