G'Day,
Apologies for not responding sooner - I've been too embarrassed.
Re-googling instantly gave the answer - xml_split. On my
Ubuntu Linux desktop, it's in package xml-twig-tools.
Thanks to Peter who reminded about awk (I'd not forgotten
about it), and thanks to Chris for writing 160 lines of
shell code, but I knew there had to be a trivially easy
tool out there.
A thing about Python I just learned and really love is:
> import apt
> cache = apt.Cache()
> if not cache['xml-twig-tools'].isInstalled:
> print "Please install xml-twig-tools and rerun"
> sys.exit(1)
This makes it mind-bogglingly easy for a Python script to check
whether a tool it needs is installed. Fantastic!
Cheers,
Tom
On Thu, 2011-01-06 at 13:51 +1100, Tom Deckert wrote:
> G'Day,
>
> Any easy XML (Python or otherwise) tools for splitting a 750M
> XML file down into smaller portions?
>
> Because the file is so large
> and exceeds memory size, I think the tool needs to be a 'streaming'
> tool. On IBM DeveloperWorks site, I found an article detailing
> using XSLT, but in other places it states XSLT tools usually
> aren't streaming, so I'm guessing none of the XSLT processors
> (xalan, saxon) will succeed. (Not to mention its been more than
> 10 years since I last worked with XSLT.)
>
> Original file looks like:
> <?xml version="1.0"?>
> <!DOCTYPE BigFile SYSTEM "BigFile.dtd">
> <BigFile>
> <TrivialHeader> blah </TrivialHeader>
> <Datum> A couple hundred thousand Datum elements.</Datum>
> <Datum> 'Datum' are non-trivial, containing extensive subtrees.</Datum>
> <Datum> ...etc... </Datum>
> <TrivialFooter> blah </TrivialFooter>
> </BigFile>
>
>
> I'd like a tool to split that into maybe
> 10 different, valid XML files, all of which have the <BigFile>,
> <TrivialHeader> and <TrivialFooter> tags,
> but 1/10th as many <Datum>s per file.
>
>
> The problem is that on my 4Gig laptop, I run out of memory
> for any tool which tries to read in the whole tree at
> one time. In my case, Python's ElementTree fails, ala:
>
> > fin = open("BigFile.xml", "r")
> > tree = xml.etree.ElementTree.parse(fin) --> Out of Memory
>
>
> Solution doesn't have to be Python, but it would be nicest
> if it were, as rest of the processing is all done in
> a Python script.
>
>
> Cheers,
> Tom
>
>
>
>
>
--
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html