G'Day, An option that is based on XQuery/Xpath is STX (Streaming Transformations for XML). There is a Java implementation called Joost.
Here is a link to the STX sourceforge page (http://stx.sourceforge.net/). That page includes a link to Joost. Regards, Bill Donoghoe --------- Forwarded message ---------- > From: Tom Deckert <[email protected]> > To: [email protected] > Date: Mon, 31 Jan 2011 12:32:48 +1100 > Subject: [SLUG] Re: Python, XML, and Splitting a 750M XML File? > > G'Day, > > Apologies for not responding sooner - I've been too embarrassed. > Re-googling instantly gave the answer - xml_split. On my > Ubuntu Linux desktop, it's in package xml-twig-tools. > > Thanks to Peter who reminded about awk (I'd not forgotten > about it), and thanks to Chris for writing 160 lines of > shell code, but I knew there had to be a trivially easy > tool out there. > > A thing about Python I just learned and really love is: > > > import apt > > cache = apt.Cache() > > if not cache['xml-twig-tools'].isInstalled: > > print "Please install xml-twig-tools and rerun" > > sys.exit(1) > > This makes it mind-bogglingly easy for a Python script to check > whether a tool it needs is installed. Fantastic! > > Cheers, > Tom > > > > On Thu, 2011-01-06 at 13:51 +1100, Tom Deckert wrote: > > G'Day, > > > > Any easy XML (Python or otherwise) tools for splitting a 750M > > XML file down into smaller portions? > > > > Because the file is so large > > and exceeds memory size, I think the tool needs to be a 'streaming' > > tool. On IBM DeveloperWorks site, I found an article detailing > > using XSLT, but in other places it states XSLT tools usually > > aren't streaming, so I'm guessing none of the XSLT processors > > (xalan, saxon) will succeed. (Not to mention its been more than > > 10 years since I last worked with XSLT.) > > > > Original file looks like: > > <?xml version="1.0"?> > > <!DOCTYPE BigFile SYSTEM "BigFile.dtd"> > > <BigFile> > > <TrivialHeader> blah </TrivialHeader> > > <Datum> A couple hundred thousand Datum elements.</Datum> > > <Datum> 'Datum' are non-trivial, containing extensive subtrees.</Datum> > > <Datum> ...etc... </Datum> > > <TrivialFooter> blah </TrivialFooter> > > </BigFile> > > > > > > I'd like a tool to split that into maybe > > 10 different, valid XML files, all of which have the <BigFile>, > > <TrivialHeader> and <TrivialFooter> tags, > > but 1/10th as many <Datum>s per file. > > > > > > The problem is that on my 4Gig laptop, I run out of memory > > for any tool which tries to read in the whole tree at > > one time. In my case, Python's ElementTree fails, ala: > > > > > fin = open("BigFile.xml", "r") > > > tree = xml.etree.ElementTree.parse(fin) --> Out of Memory > > > > > > Solution doesn't have to be Python, but it would be nicest > > if it were, as rest of the processing is all done in > > a Python script. > > > > > > Cheers, > > Tom > > > > > > > > > > > > -- SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/ Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html
