This article presents several tools that can be used to extract VBA Macros source code from MS Office Documents, for malware analysis and forensics. It also provides an overview of how VBA Macros are stored.
A few years ago, it was not easy to find tools to extract VBA Macros, mainly because the file format was not documented. Also, using VBA Macros was not so trendy anymore among malware writers.
But since Microsoft published official specifications of the VBA Macro storage [MS-OVBA], several tools have been developed to extract VBA code. Moreover, in 2013-2014 several reports highlight a recent resurgence of malware involving VBA macros. So it may be useful to have a new look at how to extract those macros.
The first section below provides an overview of how VBA Macros are stored in various types of documents. The following sections present several tools that can be used to extract VBA source code.
Most of the MS Office 97-2003 documents use the same underlying file format called Microsoft Compound File Binary (CFB) file format, or simply OLE2 file format.
An OLE file can be seen as a mini file system or a Zip archive: It contains streams of data that look like files embedded within the OLE file. Each stream has a name. For example, the main stream of a MS Word document containing its text is named "WordDocument".
An OLE file can also contain storages. A storage is a folder that contains streams or other storages. For example, a MS Word document with VBA macros has a storage called "Macros".
A typical MS Word document with VBA macros may look like this:
VBA macros are normally contained in a VBA project structure, located in different places depending on the document type:
According to [MS-OVBA], a VBA project root (e.g. "Macros"or "_VBA_PROJECT_CUR") must contain at least the following elements (case-insensitive names):
The VBA source code is stored in one ore several streams located in the VBA storage (for example "ThisDocument" in the sample above). The code is not stored in clear text: It is compressed using a specific run-length encoding algorithm described in [MS-OVBA]. Moreover, the compressed content does not start at the beginning of those streams. It is necessary to parse binary structures in the VBA/dir stream (also compressed with the same RLE algorithm) in order to find the exact offset of the compressed VBA content in the code streams.
This is why extracting VBA source code is not straightforward. Luckily, several open-source tools are now available for this task.
Some tools such as oledump (see below) use a simpler heuristic, looking for any stream containing the string "\x00Attribut", which is in fact the very first VBA keyword found at the beginning of the code of most macros. But since this keyword is VBA code, it may be possible to tweak macros to evade detection.
MS Office 2007+ file formats, also called MS Open XML, are quite different because they are made of XML files stored in Zip archives.
However, VBA macros are usually stored in a binary OLE file within the Zip archive, called "vbaProject.bin". Then the vbaProject.bin OLE file contains the same VBA project structure as described above for MS Office 97-2003 documents.
Here again, the vbaProject.bin file may be stored in different places in the Zip archive, depending on the document type:
Here is the content of a sample Word 2007+ document with VBA macros:
And here is the content of the OLE file vbaProject.bin:
Important note: the name "vbaProject.bin" is used by default by MS Office, but the Open XML standard and MS Office allow any file name, as long as the relationships are defined accordingly in the XML files (see this article page 18 for details). Therefore, it is not safe to find this file only by name.