[SLUG] Re: Python, XML, and Splitting a 750M XML File?

Bill Donoghoe Mon, 31 Jan 2011 23:33:06 -0800

G'Day,

An option that is based on XQuery/Xpath is STX (Streaming Transformations
for XML).  There is a Java implementation called Joost.


Here is a link to the STX sourceforge page (http://stx.sourceforge.net/).
That page includes a link to Joost.

Regards,
Bill Donoghoe

--------- Forwarded message ----------

> From: Tom Deckert <[email protected]>
> To: [email protected]
> Date: Mon, 31 Jan 2011 12:32:48 +1100
> Subject: [SLUG] Re: Python, XML, and Splitting a 750M XML File?
>
> G'Day,
>
> Apologies for not responding sooner - I've been too embarrassed.
> Re-googling instantly gave the answer - xml_split.  On my
> Ubuntu Linux desktop, it's in package xml-twig-tools.
>
> Thanks to Peter who reminded about awk (I'd not forgotten
> about it), and thanks to Chris for writing 160 lines of
> shell code, but I knew there had to be a trivially easy
> tool out there.
>
> A thing about Python I just learned and really love is:
>
> > import apt
> > cache = apt.Cache()
> > if not cache['xml-twig-tools'].isInstalled:
> >    print "Please install xml-twig-tools and rerun"
> >    sys.exit(1)
>
> This makes it mind-bogglingly easy for a Python script to check
> whether a tool it needs is installed.  Fantastic!
>
> Cheers,
> Tom
>
>
>
> On Thu, 2011-01-06 at 13:51 +1100, Tom Deckert wrote:
> > G'Day,
> >
> > Any easy XML (Python or otherwise) tools for splitting a 750M
> > XML file down into smaller portions?
> >
> > Because the file is so large
> > and exceeds memory size, I think the tool needs to be a 'streaming'
> > tool.  On IBM DeveloperWorks site, I found an article detailing
> > using XSLT, but in other places it states XSLT tools usually
> > aren't streaming, so I'm guessing none of the XSLT processors
> > (xalan, saxon) will succeed.  (Not to mention its been more than
> > 10 years since I last worked with XSLT.)
> >
> > Original file looks like:
> > <?xml version="1.0"?>
> > <!DOCTYPE BigFile SYSTEM "BigFile.dtd">
> > <BigFile>
> > <TrivialHeader> blah </TrivialHeader>
> > <Datum> A couple hundred thousand Datum elements.</Datum>
> > <Datum> 'Datum' are non-trivial, containing extensive subtrees.</Datum>
> > <Datum> ...etc... </Datum>
> > <TrivialFooter> blah </TrivialFooter>
> > </BigFile>
> >
> >
> > I'd like a tool to split that into maybe
> > 10 different, valid XML files, all of which have the <BigFile>,
> > <TrivialHeader> and <TrivialFooter> tags,
> > but 1/10th as many <Datum>s per file.
> >
> >
> > The problem is that on my 4Gig laptop, I run out of memory
> > for any tool which tries to read in the whole tree at
> > one time.  In my case, Python's ElementTree fails, ala:
> >
> > > fin  = open("BigFile.xml", "r")
> > > tree = xml.etree.ElementTree.parse(fin)  --> Out of Memory
> >
> >
> > Solution doesn't have to be Python, but it would be nicest
> > if it were, as rest of the processing is all done in
> > a Python script.
> >
> >
> > Cheers,
> > Tom
> >
> >
> >
> >
> >
>
>
-- 
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html

[SLUG] Re: Python, XML, and Splitting a 750M XML File?

Reply via email to