ElementTree is a "pythonic" XML parser interface developed by Fredrik Lundh which is included in the Python standard library since version 2.5. It provides a very simple and intuitive API to process XML (well, much simpler and more intuitive than usual parsers). lxml is a more efficient parser with a compatible interface. Here are some useful tips to use ElementTree and lxml.
ElementTree
The ElementTree documentation included in the Python 2.5 manual is far from complete. Fredrik Lundh's pages on his effbot website are still necessary to take advantage of all useful ElementTree features:
- Overview: http://effbot.org/zone/element-index.htm
- Documentation: http://effbot.org/zone/element.htm
- API reference: http://effbot.org/zone/pythondoc-elementtree-ElementTree.htm
How to import ElementTree:
In order to have a portable code, it is necessary to support versions of ElementTree before and after Python 2.5. It can be done this way:
try: # Python 2.5+: batteries included import xml.etree.ElementTree as ET except ImportError: try: # Python <2.5: standalone ElementTree install import elementtree.ElementTree as ET except ImportError: raise ImportError, "ElementTree is not installed, see http://effbot.org/zone/element-index.htm"
You may also replace ElementTree by cElementTree to get an optimized version of the parser developed in C. See below for a performance comparison.
lxml
lxml is another module providing an ElementTree-compatible API with additional features thanks to the use of libxml2 and libxslt libraries:
- full XPath support
- XSLT
- XML Schemas and Relax NG
- canonicalization (C14N)
- Xinclude
- namespaces preservation
- and many other features and subtleties
Official website: http://codespeak.net/lxml/
How to import lxml:
It is possible to easily switch from ElementTree to lxml simply by changing the import lines:
try: import lxml.etree as ET except ImportError: raise ImportError, "lxml is not installed, see http://codespeak.net/lxml/"
Performance comparison
When parsing large XML files, performance matters. For example I parsed a large and complex 11MB XML file using ElementTree, cElementTree and lxml, first in a normal environment and then with psyco enabled. Here are the results:
1) parsing with lxml... lxml: 1.231 s 2) parsing with cElementTree... cElementTree: 4.416 s 3) parsing with ElementTree... ElementTree: 15.927 s same tests with psyco.full() enabled: 4) parsing with lxml... lxml: 4.486 s 5) parsing with cElementTree... cElementTree: 2.731 s 6) parsing with ElementTree... ElementTree: 14.419 s
This simple test may not be very representative, but it clearly shows two things:
- lxml is roughly three times faster than cElementTree and twelve times than ElementTree, when parsing a large XML file.
- using psyco improves cElementTree performance, but it slows down lxml!
So as a conclusion I would recommend lxml for most XML processing, with a fallback to cElementTree for portability, such as this:
try:
# lxml: best performance for XML processing
import lxml.etree as ET
except ImportError:
try:
# Python 2.5+: batteries included
import xml.etree.cElementTree as ET
except ImportError:
try:
# Python <2.5: standalone ElementTree install
import elementtree.cElementTree as ET
except ImportError:
raise ImportError, "lxml or ElementTree are not installed, "\
+"see http://codespeak.net/lxml "\
+"or http://effbot.org/zone/element-index.htm"
To be continued...