On 6 January 2011 13:51, Tom Deckert <[email protected]> wrote:
>
> G'Day,
>
> Any easy XML (Python or otherwise) tools for splitting a 750M
> XML file down into smaller portions?
>
> Because the file is so large
> and exceeds memory size, I think the tool needs to be a 'streaming'
> tool.  On IBM DeveloperWorks site, I found an article detailing
> using XSLT, but in other places it states XSLT tools usually
> aren't streaming, so I'm guessing none of the XSLT processors
> (xalan, saxon) will succeed.  (Not to mention its been more than
> 10 years since I last worked with XSLT.)
>
> Original file looks like:
> <?xml version="1.0"?>
> <!DOCTYPE BigFile SYSTEM "BigFile.dtd">
> <BigFile>
> <TrivialHeader> blah </TrivialHeader>
> <Datum> A couple hundred thousand Datum elements.</Datum>
> <Datum> 'Datum' are non-trivial, containing extensive subtrees.</Datum>
> <Datum> ...etc... </Datum>
> <TrivialFooter> blah </TrivialFooter>
> </BigFile>
>
>
> I'd like a tool to split that into maybe
> 10 different, valid XML files, all of which have the <BigFile>,
> <TrivialHeader> and <TrivialFooter> tags,
> but 1/10th as many <Datum>s per file.
>
>
> The problem is that on my 4Gig laptop, I run out of memory
> for any tool which tries to read in the whole tree at
> one time.  In my case, Python's ElementTree fails, ala:
>
>> fin  = open("BigFile.xml", "r")
>> tree = xml.etree.ElementTree.parse(fin)  --> Out of Memory
>
>
> Solution doesn't have to be Python, but it would be nicest
> if it were, as rest of the processing is all done in
> a Python script.

Out of interest is it just one large xml file or multiple xml files
within one file ?

Also, have you tried lxml? [0]

[0] - http://codespeak.net/lxml/
--
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html

Reply via email to