RE: Repost: Xerces XML performance problems

Erik Rydgren Wed, 26 May 2004 07:46:02 -0700

First of all, do not use getElementByTagname. It traverses and looks at
ALL nodes in the tree. Think of it as a string compare on each tag in
the document. It will take a long time on large documents. Use
FirstChild NextSibling instead. If you are SURE you have a valid
document structure you can skip the tagname check altogether. Just get
the FirstChild of the document element. If should be a word element.
Then just loop with GetNextSibling until it returns NULL. (Don't forget
to check the nodetype if you have whitespace between tags in the XML
document)


The rest of the code looks alright too me. I wonder if creation of the
NodeLists takes time as well. The nodelist must be allocated and
populated each time you call the method if I'm not mistaken. Try using
FirstChild NextSibling methods there as well. They do not require a
nodelist.

/ Erik

> -----Original Message-----
> From: Nath [mailto:[EMAIL PROTECTED]
> Sent: den 26 maj 2004 16:22
> To: [EMAIL PROTECTED]
> Subject: Re: Repost: Xerces XML performance problems
> 
> Sure thing,
> 
> // Member variables
> 
> XercesDOMParser *cXMLParser;
> 
> XERCES_CPP_NAMESPACE_QUALIFIER DOMDocument *cXMLDoc;
> 
> DOMNodeList *cXMLNodeList,
> 
>                         *cChildNodeList;
> 
> DOMNode *cXMLNode;
> 
> DOMNamedNodeMap *cXMLNamedNode;
> 
> 
> 
> 
> 
> 
> // Initialization
> 
> XMLPlatformUtils::Initialize();
> 
> cXMLParser = new XercesDOMParser();
> 
> cXMLParser->setValidationScheme(XercesDOMParser::Val_Never);
> 
> cXMLParser->setLoadExternalDTD(false);
> 
> 
> 
> 
> 
> 
> // Main code
> 
> cXMLParser->parse(filename);
> 
> cXMLDoc = cXMLParser->getDocument();
> 
> 
> 
> // Get word nodes
> 
> cXMLNodeList = cXMLDoc-
> >getElementsByTagName(XMLString::transcode("word"));
> 
> 
> 
> // Loop through all word nodes
> 
> for (int i = 0; i < cXMLNodeList->getLength(); i++)
> 
> {
> 
>    // Obtain list of child nodes
> 
>    cChildNodeList = cXMLNodeList->item(i)->getChildNodes();
> 
> 
> 
>    // Loop through all child nodes
> 
>    for (int j = 0; j < cChildNodeList->getLength(); j++)
> 
>    {
> 
>       strcpy(name,
> XMLString::transcode(cChildNodeList->item(j)->getTextContent());
> 
>       // . . . . definitions and whatnot are also copied here
> 
>    }
> 
> }
> 
> 
> 
> 
> 
> ----- Original Message -----
> From: "Erik Rydgren" <[EMAIL PROTECTED]>
> To: <[EMAIL PROTECTED]>; "'Nath'" <[EMAIL PROTECTED]>
> Sent: Wednesday, May 26, 2004 2:29 AM
> Subject: RE: Repost: Xerces XML performance problems
> 
> 
> > Can you please provide a snippet of your DOM tree traversing code.
It is
> > hard to see what the problem is if we do not know what you are
doing.
> >
> > Regards
> > Erik
> >
> > > -----Original Message-----
> > > From: Nath [mailto:[EMAIL PROTECTED]
> > > Sent: den 25 maj 2004 19:07
> > > To: [EMAIL PROTECTED]
> > > Subject: Repost: Xerces XML performance problems
> > >
> > > I had a mix-up in mailing lists, so I'm reposting my question here
> > (with
> > > some amendments to make it clearer) for any assistance.
> > >
> > >
> > >
> > >
> > > I converted over a dictionary of words and definitions into XML
files
> > (one
> > > file per letter of the alphabet), each weighing around 1-5 megs (I
> > chose
> > > XML
> > > for storage and extensibility reasons). I'm trying to access node
> > > information from these files and it's taking an incredible amount
of
> > time
> > > to
> > > do it. When acquiring node information from small files (letters
X, Y,
> > and
> > > Z - a total of 815 words or 151 KB) the DOM document returns
results
> > > somewhat quickly and I can process the entire tree in less than 2
> > seconds.
> > > When parsing the letter A file (11,000 some words or 1.58 megs),
it
> > takes
> > > 5
> > > seconds just to process 20 word nodes (see below for a typical
word
> > node).
> > > It seems the larger the XML file (ie: the more nodes within), the
> > longer
> > > it
> > > takes to process all the nodes. Granted there's obviously going to
be
> > more
> > > time involved, but between the 2 files I've tested, there doesn't
seem
> > to
> > > be
> > > a linear process-time relationship. Can anyone suggest why this is
> > > happening
> > > and how I can fix it? I've used xerces c++ 2.4.0 and recently
upgraded
> > to
> > > xerces c++ 2.5.0.
> > >
> > >
> > > I'm just following the standard XML start-up and DOM parsing
procedure
> > > - Initialize platform utils
> > > - Don't validate files
> > > - parse and assign DOM document (fast)
> > > - go through each child node and collect data (slow)
> > >
> > >
> > >
> > > The dictionary format is simply:
> > >
> > > <dictionary>
> > >
> > > <word>
> > >
> > > <name>whatever</name>
> > >
> > > <def> 1 </def>
> > >
> > > <def> 2 </def>
> > >
> > >
> > >
> > > </word>
> > >
> > >
> > >
> > > </dictionary>
> > >
> > > I have a 1600MHz processor, so handling a few meg files should be
> > fairly
> > > quick. I've also tried parsing the file with SAX, albeit the
> > performance
> > > is
> > > a tad better, the end result is still a lengthy wait.
> > >
> > >
---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Repost: Xerces XML performance problems

Reply via email to