Parsing Large XML files, Serially, in Python
So you've been told you have to read this bioinformatic data format, and you just realized that it's essentially one cluster-fuck of XML that's 750MB large.
As you might have figured out already, to read large XML files in one go:
import xml.etree.ElementTree as etree
data = etree.parse('a_very_big.xml')
will kill your memory and freeze your computer.
Before you slit your wrists in despair, let me tell you that in Python, it's not that bad. The good news is that the Python ElementTree module has a great API to handle this situation. The bad news is that nobody has put all the information in one place. Hopefully this article fixes that.
Iterate it
You need to read it in a chunk at a time. All the XML docs say use SAX, but don't. The ElementTree library has a perfectly good Python iterator that does the job:
import xml.etree.ElementTree as etree
for event, elem in etree.iterparse(xmL, events=('start', 'end', 'start-ns', 'end-ns')):
print event, elem
This will parse the XML file in chunks at a time and give it to you at every step of the way. start
will trigger when a tag is first encountered. At this point elem
will be empty except for elem.attrib
that contains the properties of the tag. end
will trigger when the closing tag is encountered, and everything in-between has been read.
Namespaces
start-ns
is probably the most important thing you will encounter to process XML serially. ElementTree has wisely provided this call for you to gather all the namespaces in the file. You will need to store this in a special namespace dictionary nsmap
:
nsmap = {}
for event, elem in etree.iterparse(xmL, events=('start-ns')):
ns, url = elem
nsmap[ns] = url
print nsmap
This nsmap
is the key to your sanity. Basically ElementTree reports all tags in the fully expanded name, where namespaces are expanded out in full:
GAML:attribute -> {http://www.bioml.com/gaml/}attribute
However, once you've collected your nsmap, you can construct the tags in a straightforward using the fixtag
function:
def fixtag(ns, tag, nsmap):
return '{' + nsmap[ns] + '}' + tag
So in the GAML
example above, to look for closing tags:
nsmap = {}
results = []
for event, elem in etree.iterparse(xmL, events=('end', 'start-ns')):
if event == 'start-ns':
ns, url = elem
nsmap[ns] = url
if event == 'end':
if elem.tag == fixtag('GAML', 'peptide', nsmap):
result = process_peptide(elem, nsmap)
results.append(result)
And we send the peptide
subbranch of the XML file into a parsing function. This structure will be nicely parsed.
In one of my XML files, I have some implicit top-level namespace, where tag expands out to:
peptide -> {http://regis-web.systemsbiology.net/protXML}peptide
Turns out the namespace is actually ''
, so for fixtag, I use:
fixtag('', peptide, nsmap)
Pruning the branch
Now here's the most important part for serial reading, once
you've parsed your chunk, you have to clear it with .clear()
:
for event, elem in etree.iterparse(xmL, events=('end')):
if event == 'end':
if elem.tag == fixtag('GAML', 'peptide', nsmap):
process_peptide(elem, nsmap)
elem.clear()
The garbage collection will then claim the freed memory. It is thus very important that you choose independent sub-trees to parse, so that clearing that sub-tree won't affect other parts.
You may have noticed that in the parsing function process_peptide
, I pass in both elem
and nsmap
. That's because you need to generate the full tag to find things even in your sub-tree:
ydata = etree.findall(fixtag('GAML', 'ydata', nsmap))
Another alternative is that the find
functions in etree can take nsmap
and do some parsing, allowing perhaps clearer strings in the search path:
data = etree.findall('GAML:ydata/GAML:xdata', namespaces=nsmap)
Going down the rabbit hole
Now if your XML document creator has created a well thought-out document, then it's quite easy to traverse the tree by following the tags. But hey, there's no such thing as a well-thought out XML document. Have you ever seen shit like this?
<group>
<group>
<group>
<parameter name="this" value="shit"/>
<parameter name="always" value="happens"/>
<parameter name="to" value="me"/>
</group>
</group>
</group>
Well I have, and this kind of nesting screws up your tag matching if you're looking for a <group>
tag. The answer is that you need to keep track of the level of nesting using the start
event in iterparse
:
group_nesting = 0
for event, elem in etree.iterparse(xmL, events=('start', 'end')):
if event == 'start':
if elem.tag == 'group':
group_nesting += 1
if event == 'end':
if elem.tag == 'group':
if group_nesting == 3:
process_params(elem, nsmap)
elem.clear()
group_nesting -= 1
Reading XML serially in Python: it's not so bad, is it?