G'Day,

Apologies for not responding sooner - I've been too embarrassed. 
Re-googling instantly gave the answer - xml_split.  On my
Ubuntu Linux desktop, it's in package xml-twig-tools.

Thanks to Peter who reminded about awk (I'd not forgotten
about it), and thanks to Chris for writing 160 lines of
shell code, but I knew there had to be a trivially easy
tool out there.  

A thing about Python I just learned and really love is:

> import apt
> cache = apt.Cache()
> if not cache['xml-twig-tools'].isInstalled:
>    print "Please install xml-twig-tools and rerun"
>    sys.exit(1)

This makes it mind-bogglingly easy for a Python script to check 
whether a tool it needs is installed.  Fantastic!

Cheers,
Tom



On Thu, 2011-01-06 at 13:51 +1100, Tom Deckert wrote:
> G'Day,
> 
> Any easy XML (Python or otherwise) tools for splitting a 750M 
> XML file down into smaller portions?  
> 
> Because the file is so large
> and exceeds memory size, I think the tool needs to be a 'streaming'
> tool.  On IBM DeveloperWorks site, I found an article detailing 
> using XSLT, but in other places it states XSLT tools usually
> aren't streaming, so I'm guessing none of the XSLT processors
> (xalan, saxon) will succeed.  (Not to mention its been more than
> 10 years since I last worked with XSLT.)
> 
> Original file looks like:
> <?xml version="1.0"?>
> <!DOCTYPE BigFile SYSTEM "BigFile.dtd">
> <BigFile> 
> <TrivialHeader> blah </TrivialHeader>
> <Datum> A couple hundred thousand Datum elements.</Datum>
> <Datum> 'Datum' are non-trivial, containing extensive subtrees.</Datum>
> <Datum> ...etc... </Datum> 
> <TrivialFooter> blah </TrivialFooter>
> </BigFile>
> 
> 
> I'd like a tool to split that into maybe
> 10 different, valid XML files, all of which have the <BigFile>,
> <TrivialHeader> and <TrivialFooter> tags, 
> but 1/10th as many <Datum>s per file.  
> 
> 
> The problem is that on my 4Gig laptop, I run out of memory
> for any tool which tries to read in the whole tree at
> one time.  In my case, Python's ElementTree fails, ala:
> 
> > fin  = open("BigFile.xml", "r")
> > tree = xml.etree.ElementTree.parse(fin)  --> Out of Memory
> 
> 
> Solution doesn't have to be Python, but it would be nicest 
> if it were, as rest of the processing is all done in
> a Python script.
> 
> 
> Cheers,
> Tom
> 
> 
> 
> 
> 

--
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html

Reply via email to