How to find data hidden at the end of an OLE file

"Would it be possible to add a method to olefile that returns bytes that are appended to an OLE file? I have a sample that has encoded EXE appended."

When Didier Stevens asked me that question some time ago, I thought it would be easy, a matter of minutes. Indeed, the OLE format (aka Microsoft Compound File Binary Format) is structured and well specified in MS-CFB.

After a quick look at the specifications and my code from olefile, I realized it was not obvious at all...

OK, challenge accepted. :-)

When is there extra data at the end of OLE files?

This situation is not the normal case when an OLE file is saved by Microsoft Word or other MS Office applications. It usually happens when a malicious payload such as an executable file, is appended manually at the end of a MS Office file on purpose. This can be done very easily with a command such as this:

copy /b document.doc+malware.exe newdoc.doc

When the file is opened in MS Word or Excel, a first stage of the malware is triggered by a VBA macro, an OLE object or an exploit in the document. That code locates the second stage data at the end of the file (using a specific marker or knowing the data length), saves it to another file on disk and runs it.

Another example is the old steganography tool Camouflage, that scrambles a file and hides it at the end of a document.

So why is it difficult to find that extra data?

Unlike other file formats such as ZIP, PDF or RTF, the OLE format does not have a trailer or a specific structure marking the end of data in a file.

The OLE format is designed like a filesystem within a file: an OLE file is divided into fixed length sectors. The sector size is usually 512 bytes, or rarely 4096 bytes for very large files. The first sector contains the OLE header, and the rest is used to store data in streams and other structures.

The FAT (File Allocation Table) is used to keep track of all sectors. It is an array of 32 bits integers. Each integer in the FAT indicates if the corresponding sector is used to store stream data, other structures, or if the sector is free.

The FAT itself is stored in sectors, called FAT sectors. If the sector size is 512 bytes, then each FAT sector contains 128 x 32 bits integers, which map to 128 actual sectors. So usually, the FAT covers more sectors than what the file actually contains. If the sector size is 4096 bytes, it is the same with 1024 sectors per FAT sector.

The issue is, the OLE format does NOT store anywhere the number of sectors that are actually used. This is the main reason why we cannot determine for sure where OLE data ends, and where appended data starts.

The only indication we have is the number of FAT sectors, stored in the OLE header. By multiplying this number by 128 or 1024, we can determine the maximum number of sectors that the FAT can cover.

According to the MS-CFB specifications, "The last FAT sector can have more entries that span past the actual size of the compound file. In this case, the entries that cover past end-of-file MUST be marked with FREESECT (0xFFFFFFFF)."

Therefore, when we look at the end of the FAT, the last sectors are all marked as free sectors. The issue is, we have no way to know for sure if those sectors were originally present in the file and actually free, or if they were not present in the original file before someone appended extra data.

How to determine the end of OLE data

According to MS-CFB, the following solution should be used: "The size of a compound file is determined by the index of the last non-free FAT array entry."

So we need to start from the end of the FAT, and look backwards to find the last sector that is not marked as free. That should give us the original end of the data stored in the OLE file. If there is data in the file past that point, then it must be extra data that has been appended afterwards.

This is the algorithm that I recently implemented in my tool olemap, to detect and display extra data. See here if you want to install the development version of oletools, to test it by yourself.

Let's look at a normal OLE file saved by MS Word:

olemap mydocument.doc

olemap shows that the offset of the first free sector at the end of the FAT is 0x5800, which corresponds to the end of the file on disk. Therefore, there does not seem to be extra data in the file after the last used OLE sector.

Now, let's append an executable file at the end of the document, and save the result into another MS Word file:

copy /b mydocument.doc+test.exe mydocument2.doc
mydocument.doc
test.exe
        1 file(s) copied.

olemap will detect that 768 bytes are present after the end of OLE data:

olemap mydocument2.doc

And using the new option --exdata, olemap shows a hex dump of the extra data:

olemap --exdata mydocument2.doc

Great, is that it?

Well, that would be too easy. ;-)

This method works fine with extra data simply appended to well-formed OLE files. But nothing prevents you to alter the last FAT records, so that they do not appear as free sectors. Since they are not referenced elsewhere, the application opening the OLE file should not complain about it.

In the next versions of olemap, I plan to address that case by checking which sectors are actually referenced in the normal OLE structures (streams, FAT, directory, etc). Any non-free sector that is not referenced should then be identified as potential extra data. Stay tuned!