Python vs XML

Python-for-lifers seem to have something against XML. Maybe it is because if you come from a Java background (which I do) then you will probably be quite fond of it and, you know, Python isn't Java ($DEITY forbid). Or maybe Pythonistas actually like the god-awful INI file parser that Python ships with and that is good enough for them. Maybe it is some other reason, but don't know why: the bindings for GNOME's libxml2 makes working with XML in Python quite easy, as I found out working on Republic.

Libxml2 is a C library for parsing XML documents that has been around for some time now. There is also libxslt, which builds on libxml2 to provide an XSLT processor. This too has Python bindings available. Note that the bindings are not shiiped as part of the standard Python installation, they need to be installed separately.

The Good

The thing that strikes me about libxml2 is how simple it is to do common things. For example, parsing an XML file is a two-line process:

import libxml2

doc = libxml2.parseFile(filename)

It hardly gets easier than that. It is almost as easy to query the contents of the XML document using XPath. This requires a few lines more code, but it's not that bad:

xpath_context = doc.xpathNewContext()

value = xpath_context.xpathEval('/doc/child/@some-attr')
value = value \
    and value[0].xpathCastNodeToString() \
    or None

Okay, admittedly the last few lines are a bit sucky, but that can easily be refactored into a helper function, or removed altogether If we are happy to make the XPath query a bit more clunky:

xpath_context = doc.xpathNewContext()

value = xpath_context.xpathEval('string(/doc/child/@some-attr)')

All this really beats using SAX or the DOM APIs to try and pull values out of an XML document. It isn't the most efficient, but hey, the code is much more readable and I'll take that over premature optimisation any day.

The Bad

Like everything, both libxml2 and libxslt have some warts. My first complaint is that you need to dispose of documents explicitly. I presume this is because the bindings aren't smart enough or the underlying C API make it impossible to guess. I would like to look into this when I have the time. The API is pretty odd in other places, too - like naming attributes accessed on elements as strings "props" and sometimes having to manually create actual Python objects from the return values from function/method calls.

The API itself is huge and pretty poorly organised. The libxml2 module contains almost 40 classes, a huge number of functions and even more variables/constants. It would be handy if they were organised into packages of related functionality, if only for making it possible to find the function/class that you are looking for. I guess the size of the API is the price paid for all the built-in convenience.

Lastly, the documentation could certainly be improved. Most functions and methods have a one-liner, sometimes in the order just of repeating the method name, sometimes actually useful. The classes themselves aren't documented and there is only a single, small tutorial for both libraries. Having more comprehensive tutorials would be great, as well as adding useful documentation comments for runnning through Epydoc.

The Ugly

Despite the above, libxml2 and libxslt are useful tools for a Python developer. Maybe if they were part of the core distribution, Pythoneers would not dislike XML so much.

Python vs XML

Comments

The Good

The Bad

The Ugly