On 6 January 2011 13:51, Tom Deckert <[email protected]> wrote: > > G'Day, > > Any easy XML (Python or otherwise) tools for splitting a 750M > XML file down into smaller portions? > > Because the file is so large > and exceeds memory size, I think the tool needs to be a 'streaming' > tool. On IBM DeveloperWorks site, I found an article detailing > using XSLT, but in other places it states XSLT tools usually > aren't streaming, so I'm guessing none of the XSLT processors > (xalan, saxon) will succeed. (Not to mention its been more than > 10 years since I last worked with XSLT.) > > Original file looks like: > <?xml version="1.0"?> > <!DOCTYPE BigFile SYSTEM "BigFile.dtd"> > <BigFile> > <TrivialHeader> blah </TrivialHeader> > <Datum> A couple hundred thousand Datum elements.</Datum> > <Datum> 'Datum' are non-trivial, containing extensive subtrees.</Datum> > <Datum> ...etc... </Datum> > <TrivialFooter> blah </TrivialFooter> > </BigFile> > > > I'd like a tool to split that into maybe > 10 different, valid XML files, all of which have the <BigFile>, > <TrivialHeader> and <TrivialFooter> tags, > but 1/10th as many <Datum>s per file. > > > The problem is that on my 4Gig laptop, I run out of memory > for any tool which tries to read in the whole tree at > one time. In my case, Python's ElementTree fails, ala: > >> fin = open("BigFile.xml", "r") >> tree = xml.etree.ElementTree.parse(fin) --> Out of Memory > > > Solution doesn't have to be Python, but it would be nicest > if it were, as rest of the processing is all done in > a Python script.
Out of interest is it just one large xml file or multiple xml files within one file ? Also, have you tried lxml? [0] [0] - http://codespeak.net/lxml/ -- SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/ Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html
