Anti-Analysis Tricks in Weaponized RTF

This article describes several anti-analysis tricks found in recent malicious RTF documents, and how I improved rtfobj to handle them.

Weaponized RTF documents

The RTF format has always been considered as quite safe compared to other MS Office formats. It is true that RTF cannot contain macros. However, it is much less known that RTF documents may contain OLE objects with potentially malicious content. And more specifically, OLE “Package” objects may store any file, including executables and scripts. If the end-user double-clicks on such an object, the embedded file will be opened by the system. This feature is actively used to deliver malware in real life, as we will see later on.

Worse, it turns out that many antivirus engines do not even scan files stored into RTF documents: this article warned about it in 2007, and according to my tests, it is still the case for half of the antivirus products referenced on VirusTotal.

RTF documents may also contain exploits targeting vulnerabilities in MS Word to execute code. And finally, they may contain files with exploits for other software, for example Flash objects (SWF files).

RTF structure

An RTF document is mostly made of control words with parameters, enclosed in curly brackets {...}(braces). A control word starts with a backslash, for example “\fonttbl”. Braces may contain one or several control words, making it a group. Braces may also contain plain text, and other braces (nested). Here are a few examples of RTF control words:

{\version1}
{\fonttbl{\f0\froman Tms Rmn;}{\f1\fdecor Symbol;}{\f2\fswiss Helv;}}

According to the Microsoft RTF Specifications page 12, section “RTF Version”, an RTF document should always start with the control word “{\rtf1”, which means RTF version 1.x.

Here is a sample RTF document (excerpt from the RTF specifications):

{\rtf1\ansi\deff0{\fonttbl{\f0\froman Tms Rmn;}{\f1\fdecor Symbol;}{\f2\fswiss Helv;}}
{\colortbl;\red0\green0\blue0;\red0\green0\blue255;\red0\green255\blue255;\red0\green255\blue0;
\red255\green0\blue255;\red255\green0\blue0;\red255\green255\blue0;\red255\green255\blue255;}
{\stylesheet{\fs20 \snext0 Normal;}}{\info{\author John Doe}{\creatim\yr1990\mo7\dy30\hr10\min48}
{\version1}{\edmins0}{\nofpages1}{\nofwords0}{\nofchars0}{\vern8351}}
\widoctrl\ftnbj \sectd\linex0\endnhere\pard\plain \fs20 This is plain text.\par}

Embedded Objects

OLE objects are stored using several nested control words, as described in the Microsoft RTF Specifications page 150, section “Objects”:

{\object … {\objdata …}}

The object data is serialized to a bytes string using the OLESaveToStream format (that I will explain in a future article), and that string is encoded in hexadecimal after the “\objdata” control word.

The hexadecimal encoding is not strictly defined in the specifications. By looking at the RTF documents produced by MS Word or Wordpad, it is simply a string of hexadecimal digits, with optional spaces and newline characters to split long lines.

Here is the beginning of a sample OLE object containing a MS Word document within an RTF file:

{\object\objemb{\*\objclass Word.Document.12}\objw9355\objh1018{\*\objdata
01050000
02000000
11000000
576f72642e446f63756d656e742e313200
00000000
00000000
003a0000
d0cf11e0a1b11ae1000000000000000000000000000000003e000300feff090006000000000000
0000000000010000000100000000000000001000000200000001000000feffffff000000000000
0000ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff

It is therefore quite easy to identify and extract those objects, simply by looking for long strings of hexadecimal characters. This is what I implemented in my tool rtfobj.

Anti-Analysis Tricks

However, I recently came across several malicious RTF samples (thanks to @iHeartMalware, @r00tbsd and @Sebdraven) which could not be properly parsed by rtfobj.

After investigation, it turned out that they were specially crafted to fool security tools, by using various anti-analysis tricks. Let's have a closer look.

Instead of starting with “{\rtf1” as described in MS specifications, some samples start with “{\rt0” or “{\rtvpn”. Those may be found using my custom malware search engine, for example this one (SHA256: 04beed90f6a7762d84a455f8855567906de079f48ddaabe311a6a281e90bd36f):

MS Word accepts them as RTF documents, whereas many analysis tools will only see text files and fail to parse them as RTF. For example, this is the case for malwr.com, as shown on the following screenshot:

It looks like MS Word's RTF parser only looks for “{\rt” to decide if a file is actually RTF.

Trick #2: Odd Number of Hex Digits

Data encoded as bytes in hexadecimal should always have an even number of hex digits. This may look obvious, but if one adds one extra hex digit at the end of hex-encoded data, some parsers may fail when decoding the data. It used to be the case for my tool rtfobj.

Here again, MS Word is permissive, and any extra hex digit is simply ignored.

Trick #3: Extra Whitespace Between Hex Digits

Hex-encoded data produced by MS Word is neatly aligned, with all hex digits grouped together. But according to my tests with MS Word, it is actually allowed to insert whitespace characters (including space, tab, newline, etc) between hex numbers, and even between the two hex digits representing a single byte.

This is therefore equivalent to the previous example:

{\object\objemb{\*\objclass Word.Document.12}\objw9355\objh1018{\*\objdata
01050000
02000000
11000000
5 7    6 f  7 2 6 4         2 e44 6f 637  56d
656e742e313200
00000000
00000000
003a0000
d0cf11e0a1b11ae1000000000000000000000000000000003e000300feff090006000000000000
0000000000010000000100000000000000001000000200000001000000feffffff000000000000
0000ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff


Trick #4: Dummy RTF Control Words Within the Hex-encoded Data

Looking at the same sample shown above, we can see there is an extra control word “{\object}” located inside hex-encoded data:

This case is not mentioned in the RTF specifications, but according to my tests it looks like MS Word simply ignores any such unexpected control word. To make it even more complex for parsers, those control words may be nested, for example “{\foo{}{\bar}}”.

Trick #5: Binary Data Within the Hex-encoded Data

Other weaponized RTF samples, such as this one (SHA256: 65cd5120aa5133659bf190cb4d97609435d059b88f9e3a969efbad070d4126bd), contain a control word “\bin” followed by lots of zeros and a few binary characters, right in the middle of a hex-encoded object:

Looking at the RTF Specifications page 145, section “Pictures”, picture data may either be stored as hex-encoded data, or raw binary data following the “\binN” control word. Although the specifications do not mention that possibility for OLE objects, it looks like MS Word supports it. In practice, any combination of hex-encoded and binary data is allowed.

The “\bin” control word is followed by a decimal number corresponding to the number of raw binary characters that follow it directly. Leading zeros are allowed. According to the specifications page 8, “An RTF parser must allow for up to 10 digits optionally preceded by a minus sign.

From my tests with MS Word, the decimal number may in fact contain up to 250 digits.

Finally, the \bin control word with its decimal parameter may be followed by an optional space character, before the raw binary data starts.

OK, but does it really matter?

Well, many tools need to parse RTF documents and to extract the files they may contain in OLE objects. This includes security analysis tools such as rtfobj and OfficeMalScanner/RTFScan, but also any antivirus engine.

I made a very quick test using a simple RTF document with an OLE Package object containing the EICAR test file: on VirusTotal, 30 out of the 56 antivirus engines detect the EICAR file. This may be surprising as one would expect 100% detection, but it is known for years that some antivirus vendors choose not to analyse files embedded into RTF on purpose, because those files will be caught as soon as the user tries to open them.

However, after applying all the tricks described in this article, only 6 antivirus engines out of 56 detect something suspicious. But out of these 6, only two actually detect the EICAR file properly. The other 4 report that the RTF file is malformed or obfuscated.

The consequence is that all the tricks described above can be used to hide malicious payloads that will not be picked up by most e-mail gateways, web proxies and intrusion detection systems. And they have already been used in the wild for several malware campaigns.

Rtfobj

My tool rtfobj is a quite simple Python script I wrote to identify and extract OLE objects in RTF documents. I improved it recently to handle all the cases described in this article.

It was especially used by @r00tbsd and @Sebdraven to analyze a recent malware sample exploiting the MS Word vulnerability CVE-2015-1641, as described in this article: http://www.sekoia.fr/blog/ms-office-exploit-analysis-cve-2015-1641/.

The new version of rtfobj may now be downloaded as part of the oletools package.