First of all, do not use getElementByTagname. It traverses and looks at ALL nodes in the tree. Think of it as a string compare on each tag in the document. It will take a long time on large documents. Use FirstChild NextSibling instead. If you are SURE you have a valid document structure you can skip the tagname check altogether. Just get the FirstChild of the document element. If should be a word element. Then just loop with GetNextSibling until it returns NULL. (Don't forget to check the nodetype if you have whitespace between tags in the XML document)
The rest of the code looks alright too me. I wonder if creation of the NodeLists takes time as well. The nodelist must be allocated and populated each time you call the method if I'm not mistaken. Try using FirstChild NextSibling methods there as well. They do not require a nodelist. / Erik > -----Original Message----- > From: Nath [mailto:[EMAIL PROTECTED] > Sent: den 26 maj 2004 16:22 > To: [EMAIL PROTECTED] > Subject: Re: Repost: Xerces XML performance problems > > Sure thing, > > // Member variables > > XercesDOMParser *cXMLParser; > > XERCES_CPP_NAMESPACE_QUALIFIER DOMDocument *cXMLDoc; > > DOMNodeList *cXMLNodeList, > > *cChildNodeList; > > DOMNode *cXMLNode; > > DOMNamedNodeMap *cXMLNamedNode; > > > > > > > // Initialization > > XMLPlatformUtils::Initialize(); > > cXMLParser = new XercesDOMParser(); > > cXMLParser->setValidationScheme(XercesDOMParser::Val_Never); > > cXMLParser->setLoadExternalDTD(false); > > > > > > > // Main code > > cXMLParser->parse(filename); > > cXMLDoc = cXMLParser->getDocument(); > > > > // Get word nodes > > cXMLNodeList = cXMLDoc- > >getElementsByTagName(XMLString::transcode("word")); > > > > // Loop through all word nodes > > for (int i = 0; i < cXMLNodeList->getLength(); i++) > > { > > // Obtain list of child nodes > > cChildNodeList = cXMLNodeList->item(i)->getChildNodes(); > > > > // Loop through all child nodes > > for (int j = 0; j < cChildNodeList->getLength(); j++) > > { > > strcpy(name, > XMLString::transcode(cChildNodeList->item(j)->getTextContent()); > > // . . . . definitions and whatnot are also copied here > > } > > } > > > > > > ----- Original Message ----- > From: "Erik Rydgren" <[EMAIL PROTECTED]> > To: <[EMAIL PROTECTED]>; "'Nath'" <[EMAIL PROTECTED]> > Sent: Wednesday, May 26, 2004 2:29 AM > Subject: RE: Repost: Xerces XML performance problems > > > > Can you please provide a snippet of your DOM tree traversing code. It is > > hard to see what the problem is if we do not know what you are doing. > > > > Regards > > Erik > > > > > -----Original Message----- > > > From: Nath [mailto:[EMAIL PROTECTED] > > > Sent: den 25 maj 2004 19:07 > > > To: [EMAIL PROTECTED] > > > Subject: Repost: Xerces XML performance problems > > > > > > I had a mix-up in mailing lists, so I'm reposting my question here > > (with > > > some amendments to make it clearer) for any assistance. > > > > > > > > > > > > > > > I converted over a dictionary of words and definitions into XML files > > (one > > > file per letter of the alphabet), each weighing around 1-5 megs (I > > chose > > > XML > > > for storage and extensibility reasons). I'm trying to access node > > > information from these files and it's taking an incredible amount of > > time > > > to > > > do it. When acquiring node information from small files (letters X, Y, > > and > > > Z - a total of 815 words or 151 KB) the DOM document returns results > > > somewhat quickly and I can process the entire tree in less than 2 > > seconds. > > > When parsing the letter A file (11,000 some words or 1.58 megs), it > > takes > > > 5 > > > seconds just to process 20 word nodes (see below for a typical word > > node). > > > It seems the larger the XML file (ie: the more nodes within), the > > longer > > > it > > > takes to process all the nodes. Granted there's obviously going to be > > more > > > time involved, but between the 2 files I've tested, there doesn't seem > > to > > > be > > > a linear process-time relationship. Can anyone suggest why this is > > > happening > > > and how I can fix it? I've used xerces c++ 2.4.0 and recently upgraded > > to > > > xerces c++ 2.5.0. > > > > > > > > > I'm just following the standard XML start-up and DOM parsing procedure > > > - Initialize platform utils > > > - Don't validate files > > > - parse and assign DOM document (fast) > > > - go through each child node and collect data (slow) > > > > > > > > > > > > The dictionary format is simply: > > > > > > <dictionary> > > > > > > <word> > > > > > > <name>whatever</name> > > > > > > <def> 1 </def> > > > > > > <def> 2 </def> > > > > > > > > > > > > </word> > > > > > > > > > > > > </dictionary> > > > > > > I have a 1600MHz processor, so handling a few meg files should be > > fairly > > > quick. I've also tried parsing the file with SAX, albeit the > > performance > > > is > > > a tad better, the end result is still a lengthy wait. > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
