PDF files may be used to trigger malicious content, as described here. PDFiD is a Python tool to analyze and sanitize PDF files, written by Didier Stevens. Here is PDFiD_PL, a version that I have slightly modified so that it can be imported as a module in Python applications (originally for ExeFilter).
The modified version is named pdfid_PL.py. The main differences with the original tool are in the PDFiD function:
def PDFiD(file, allNames=False, extraData=False, disarm=False, force=False, output_file=None, raise_exceptions=False, return_cleaned=False, active_keywords=ACTIVE_KEYWORDS):
The following parameters have been added:
All these parameters are optional, so that pdfid_PL.py runs exactly like the original pdfid.py when they are not set.
pdfid_PL is updated each time Didier Stevens modifies pdfid:
Pick the attached file below.
Alternatively, you may get the latest version from the ExeFilter SVN.
import pdfid_PL as pdfid xmldoc, cleaned = pdfid.PDFiD('file.pdf', disarm=True, output_file='cleaned.pdf', raise_exceptions=True, return_cleaned=True) if cleaned: print 'PDF has been cleaned.' else: print 'PDF is clean.'