What is the PE Checksum?
When the portable executable format was developed, network connections were much less reliable than they are today. It was not uncommon for the integrity of a connection to be compromised and the data being transferred to become corrupted. Additionally, it was difficult for a client to detect whether or not a file was corrupted in transit. This was a significant problem, especially if you were downloading operating system files such as executables and drivers. A one-byte error in a driver could cause an unrecoverable system crash.
As a result, checksums were implemented in the portable executable format with the express intent of being able to detect data corruption and reduce the probability of corrupted code being executed.
What is a Checksum?
A checksum is a very basic hashing algorithm that will produce significantly different results even with small changes to the input. It works by calculating the hashing function over the contents of a file to produce the end result. Therefore, files with different contents will have different checksums. This is not always the case as there can be hash collisions where two files with different contents have the same checksum. The probability of a collision is relatively low, so using a checksum helps the majority of the time but is not a guarantee.
How are Checksums Implemented in PE Files?
In an executable, the header contains a field for the checksum of the file. Typically, the compiler generates the checksum at compile time and writes the value into the checksum field.
According to the PE format specification by Microsoft, the purpose of the checksum field is:
The image file checksum. The algorithm for computing the checksum is incorporated into IMAGHELP.DLL. The following are checked for validation at load time: all drivers, any DLL loaded at boot time, and any DLL that is loaded into a critical Windows process.
https://docs.microsoft.com/en-us/windows/win32/debug/pe-format
Basically, before loading drivers and certain DLLs, Windows will use a function inside of IMAGHELP.DLL to calculate the checksum of the executable. It will then compare that checksum to the value inside of the PE header. If the two checksums match, then the driver or DLL will be loaded. If not, the Windows loader will assume the file was corrupted and prevent the driver or DLL from being loaded.
One thing that is interesting to note is that executables are not checked for validity using the checksum field, and therefore that field does not have to contain an valid checksum for Windows to run the executable.
Algorithm
While similar to a CRC32 checksum, the algorithm is a little different. It is actually a custom algorithm developed by Microsoft that is not officially published. There are also very few libraries that implement the checksum. I ended up building my own version from a few posts on StackOverflow here. The algorithm below is implemented in C++\CLR, but it gives a good overview of how it works.
UInt32 GetChecksum(cli::array<Byte>^ data, Int32 checksumOffset) {
pin_ptr<unsigned char> pin = &data[0];
unsigned int* pointer = (unsigned int*)pin;
long long int checksum = 0;
long long int top = Utility::Pow((long long int)2, (long long int)32);
for (int i = 0; i < checksumOffset / 4; i++) {
unsigned int temp = pointer[i];
checksum = (checksum & 0xffffffff) + temp + (checksum >> 32);
if (checksum > top) {
checksum = (checksum & 0xffffffff) + (checksum >> 32);
}
}
int stop = data->Length / 4;
for (int i = checksumOffset / 4 + 1; i < stop; i++) {
unsigned int temp = pointer[i];
checksum = (checksum & 0xffffffff) + temp + (checksum >> 32);
if (checksum > top) {
checksum = (checksum & 0xffffffff) + (checksum >> 32);
}
}
//Perform the same calculation on the padded remainder
int remainder = data->Length % 4;
if (remainder != 0) {
cli::array<Byte>^ a = gcnew cli::array<Byte>(4);
int index = data->Length - remainder;
for (int i = 0; i < 4; i++) {
if (i < remainder) {
a[i] = data[data->Length - remainder + i];
}
else {
a[i] = 0;
}
}
pin_ptr<unsigned char> pin2 = &a[0];
unsigned int* pointer2 = (unsigned int*)pin2;
unsigned int temp = pointer2[0];
checksum = (checksum & 0xffffffff) + temp + (checksum >> 32);
if (checksum > top) {
checksum = (checksum & 0xffffffff) + (checksum >> 32);
}
}
checksum = (checksum & 0xffff) + (checksum >> 16);
checksum = (checksum)+(checksum >> 16);
checksum = checksum & 0xffff;
checksum += (unsigned int)data->Length;
return (unsigned int)checksum;
}
How does this apply to intrusion detection?
By now, you probably are wondering, “So what? How does this help me identify and triage malware?” Well, as it turns out there is a strong correlation between invalid PE Checksums and malware. The graph below illustrates the disparity between malicious and legitimate executables with respect to valid and invalid checksums.
The graph shows two datasets: good and bad executables. The x-axis shows the two possible results (valid or invalid) of the PE checksum validation, and the y-axis shows the percent of each dataset.
They key take aways from this graph are:
- 83% of malware had invalid checksums
- 90% of legitimate files had valid checksums.
As it turns out, the PE header checksum is the single greatest stand-alone indicator of malware we will discuss in this series, even more so than digital signatures (we’ll talk about why that is in the post about digital signatures).
There are several reasons that the checksum will be invalid: (1) some compilers used by malware authors don’t support generating the checksum and (2) some authors will modify the executable post compilation which invalidates the checksum. Generally speaking, most malware authors do not go back and update the checksum after the modifications. There is also a lot of incentive for malware authors to modify executables in order to evade detection by AV. Packing, encryption, encoding, and compression can all be used to obfuscate the signaturizable parts of the malware. Many of these tools are designed to work on executables post-compilation so that the malware itself does not need to be rebuilt, and many of these tools do not update the checksum after modifying the executable.
The end result is that 83% of executable malware has invalid PE checksums, which is huge! It is seldom that so many variants of malware share a common suspicious trait. Even with such a ubiquitous feature, OE checksums by themselves will not always land you positive detections.
Let’s say you are trying to defend a small network where you have found 100,000 unique executables but only one of those is malicious. If you were to try and identify malware based solely on the PE checksums, then you would have approximately 10,000 false positives (10% of all legitimate files have invalid checksums) and 1 true positive. This yields a net false positive rate of 99.99%.
At the same time, you’ve reduced the amount of hay in your haystack by 90%, making it much easier to find that one needle. Combining the PE checksum with other features will continue to narrow your focus.
Summary
The PE checksum was designed to reduce the probability of data corruption in a DLL or driver leading to crashes in the operating system. The checksum is calculated by the compiler after it builds the executable, and any modifications to the binary post-compilation will invalidate the checksum. Malware authors commonly encode, encrypt, compress, or pack their malware post compilation, but often do not update the checksum. This results in 83% of malware samples possessing invalid PE checksums versus only 10% of legitimate files have invalid checksums.