RE: XML Parsing for large XML documents

2004-02-26 Thread Robert Fox
Peter and Ed-

Thanks for the replies.

Your suggestions are very good. Here is my problem, though: I don't think 
that I can process this document in a serial fashion, which seems to be 
more akin to SAX. I need to do a lot of node hopping in order to create 
somewhat complex data structures for import into the database, and that 
requires a lot of jumping around from one part of the node tree to another. 
Thus, it seems as though I need to use a DOM parser to accomplish this. 
Scanning an entire document of this size in order to perform very specific 
event handling for each operation (using SAX) seems like it would be just 
as time consuming as having the entire node tree represented in memory. 
Please correct me if I'm wrong here.

On the plus side, I am running this process on a machine that seems to have 
enough RAM to represent the entire document and my code structures (arrays, 
etc.) without the need for virtual memory and heavy disk I/O. However, the 
process is VERY CPU intensive because of all of the sorting and lookups 
that occur for many of the operations. I'm going to see today if I can make 
those more efficient as well.

Someone else has suggested to me that perhaps it would be a good idea to 
break up the larger document into smaller parts during processing and only 
work on those parts in a serial mode. It was also suggested that 
XML::LibXML was an efficient tool because of the C library core (libxml2). 
And, I've also now heard of hybrid parsers that allow the ease of use and 
flexibility of DOM with the efficiency of SAX (RelaxNGCC).

For those of you that haven't heard of these tools before, you might want 
to check out:

XML::Sablotron (similar to XML::LibXML)
XMLPull (http://www.xmlpull.org)
Piccolo Parser (http://piccolo.sourceforge.net)
RelaxNGCC (http://relaxngcc.sourceforge.net/en/index.htm)
I get the impression that if I tried to use SAX parsing for a relatively 
complex RDF document, the programming load would be rather significant. 
But, if it speeds up processing by several orders of magnitude, then it 
would be worth it. I'm concerned, though, that I won't have the ability to 
crawl the document nodes using conditionals and revert to previous portions 
of the document that need further processing. What is your experience in 
this regard?

Thanks again for the responses. This is great.

Rob



At 11:07 AM 2/26/2004 +, Peter Corrigan wrote:
On 25 February 2004 20:31 wrote...
1. Am I using the best XML processing module that I can for this sort
of
 task?
If it must be faster, then it might be worth porting what you have to
work with LibXML which has all round impressive benchmarks especially
for DOM work.
Useful comparisons may be found at:
http://xmlbench.sourceforge.net/results/benchmark/index.html
Remember that the size of the final internal representation used to
manipulate the XML data for DOM could be up to 5 times the original size
i.e. 270mb in your case. Simply adding RAM/porting your exising code to
another machine might be enough to give you the speed-up you require.
3. What is the most efficient way to process through such a large
document
 no matter what XML processor one uses?
SAX type processing will be faster and use less memory. If you need
random access to any point of the tree after the document has been read
you will need DOM, hence you will need lots of memory.
If this is a one off load, I guess you have to balance the cost of your
time recoding with the cost of waiting for the data to load using what
you have already. Machines usually work cheaper :-)
Best of luck

Peter Corrigan
Head of Library Systems
James Hardiman Library
NUI Galway
IRELAND
Tel: +353-91-524411 Ext 2497
Mobile: +353-87-2798505
-Original Message-
From: Robert Fox [mailto:[EMAIL PROTECTED]
Sent: 25 February 2004 20:31
To: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Subject: XML Parsing for large XML documents
I'm cross posting this question to perl4lib and xml4lib, hoping that
someone will have a suggestion.
I've created a very large (~54MB) XML document in RDF format for the
purpose of importing related records into a database. Not only does the
RDF
document contain many thousands of individual records for electronic
resources (web resources), but it also contains all of the
relationships
between those resources encoded in such a way that the document itself
represents a rather large database of these resources. The relationships
are multi-tiered. I've also written a Perl script which can parse this
large document and process through all of the XML data in order to
import
the data, along with all of the various relationships, into the
database.
The Perl script uses XML::XPath, and XML::XPath::XMLParser. I use these
modules to find the appropriate document nodes as needed while the
processing is going on and the database is being populated. The database
is
not a flat file: several data tables and linking tables are involved.
I've run into a problem, though: my Perl script runs

XML Parsing for large XML documents

2004-02-25 Thread Robert Fox
I'm cross posting this question to perl4lib and xml4lib, hoping that 
someone will have a suggestion.

I've created a very large (~54MB) XML document in RDF format for the 
purpose of importing related records into a database. Not only does the RDF 
document contain many thousands of individual records for electronic 
resources (web resources), but it also contains all of the relationships 
between those resources encoded in such a way that the document itself 
represents a rather large database of these resources. The relationships 
are multi-tiered. I've also written a Perl script which can parse this 
large document and process through all of the XML data in order to import 
the data, along with all of the various relationships, into the database. 
The Perl script uses XML::XPath, and XML::XPath::XMLParser. I use these 
modules to find the appropriate document nodes as needed while the 
processing is going on and the database is being populated. The database is 
not a flat file: several data tables and linking tables are involved.

I've run into a problem, though: my Perl script runs very slowly. I've done 
just about everything I can to optimize my script so that it isn't memory 
intensive and efficient, and nothing seems to have significantly helped. 
Therefore, I have a couple of questions for the list(s):

1. Am I using the best XML processing module that I can for this sort of task?
2. Has anyone else processed documents of this size, and what have they used?
3. What is the most efficient way to process through such a large document 
no matter what XML processor one uses?

The processing on this is so amazingly slow that it is likely to take many 
hours if not days(!) to process through the bulk of records in this XML 
document. There must be a better way.

Any suggestions or help would be much appreciated,

Rob Fox

Robert Fox
Sr. Programmer/Analyst
University Libraries of Notre Dame
(574)631-3353
[EMAIL PROTECTED]


Re: XML Parsing for large XML documents

2004-02-25 Thread Ed Summers
Hi Rob:

On Wed, Feb 25, 2004 at 03:31:07PM -0500, Robert Fox wrote:
 1. Am I using the best XML processing module that I can for this sort of 
 task?

XPath expressions require building a document object model (DOM) of your XML 
file. Building a DOM for a huge file is extremely expensive since it converts
your XML file into an in memory tree structure, where each element is a node.
You system is probably digging into virtual memory (to disk) to keep the 
monster in memory...which means slow. And you need to slurp the whole thing
in before any work can actually start.

When processing large XML files you'll want to use a stream based parser like 
XML::SAX. 

 2. Has anyone else processed documents of this size, and what have they 
 used?

Yep, I've used XML::SAX recently and XML::Parser back in the day. XML::Parser
use is depracated now, but once upon a time it was cutting edge :)

 3. What is the most efficient way to process through such a large document 
 no matter what XML processor one uses?

Use a stream based parser instead of one that is DOM based. This applies in 
any language (Java, Python, etc...). There is a series of good articles on 
SAX parsing from Perl on xml.com [1]. The nice thing about SAX is that it is 
not Perl specific, so what you learn about SAX can be applied in lots of other
languages. SAX filters [2] are also incredibly useful. 

Good luck!

//Ed

[1] http://www.xml.com/pub/a/2001/02/14/perlsax.html
[2] http://www.xml.com/pub/a/2001/10/10/sax-filters.html