Re: Performance question

2004-01-08 Thread Andrzej Bialecki
Dror Matalon wrote:

On Wed, Jan 07, 2004 at 07:24:22PM -0700, Scott Smith wrote:
 

After two rather frustrating days, I find I need to apologize to Lucene.  My
last run of 225 messages averaged around 25 milliseconds per message--that's
parsing the xml, creating the Document, and putting it in the index (2.5Ghz
cpu, 1G ram).  Turns out the performance problem was xerces sax helping me
by loading the DTD before it parsed each message and the DTD wasn't local to
our site.  After seeing Terry's response, I knew there had to be more going
on than what I was assuming.
Thanks for the suggestions.  I wonder how much faster I can go if I
implement some of those?
   

25 msecs to insert a document is on the high side, but it depends of
course on the size of your document. You're probably spending 90% of
your time in the XML parsing. I believe that there are other parsers
that are faster than xerces, you might want to look at these. You might
want to look at http://dom4j.org/.
Dror

 

You may want to check the XML Pull Parser - it offers something between 
SAX and DOM, with performance similar to SAX. 
(http://www.extreme.indiana.edu/xgws/xsoap/xpp)

--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-
FreeBSD developer (http://www.freebsd.org)


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Performance question

2004-01-08 Thread Tatu Saloranta
On Wednesday 07 January 2004 20:48, Dror Matalon wrote:
 On Wed, Jan 07, 2004 at 07:24:22PM -0700, Scott Smith wrote:
...
  Thanks for the suggestions.  I wonder how much faster I can go if I
  implement some of those?

 25 msecs to insert a document is on the high side, but it depends of
 course on the size of your document. You're probably spending 90% of
 your time in the XML parsing. I believe that there are other parsers
 that are faster than xerces, you might want to look at these. You might
 want to look at http://dom4j.org/.

I think more significant than whether one uses DOM or some other full-document 
in-memory parser, is whether to perhaps use streaming (usually event-based) 
parsers such as ones using SAX. These are generally an order of magnitude 
faster, at least for bigger documents. Fortunately many standard XML parsers 
can work as both DOM and SAX parsers (I believe Xerces at least does, in any 
case).

It's bit more cumbersome to use event-based parsers (push vs. pull; need to 
explicitly keep track of current subtree, if parent tag order matters), but 
from performance perspective (memory usage, speed) it may be worth it.

-+ Tatu +-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Performance question

2004-01-08 Thread Scott Smith
The parsing I do currently is pretty straight forward.  There are only four
tags I look for (and one of those tags typically encompasses most of the
file).  Sax works great though I'm not stuck on using xerces.  In the
short-run, the 25 millisecond is quite acceptable (where, for obvious
reasons, the 1.2 seconds was not).  In the long-run, sounds like I need to
look at some other options besides xerces.  

Another thing I noticed doing this is that the xeres sax interface tends to
pass small blocks of characters (typically, around 50 characters) on each
character callback even when there are several thousand bytes of character
data in the tag.  Currently, I add each block of characters to the Document
separately.  This means I often end up with 100 or more items on the
Document linked list for the same field.  When I get some time, I would like
to see if things work faster if I accumulate these into a StringBuffer and
pass them to the document as one large block instead of a lot of little
blocks.

Thanks for all of the suggestions.

Scott

-Original Message-
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] 
Sent: Thursday, January 08, 2004 5:24 AM
To: Lucene Users List
Subject: Re: Performance question


Dror Matalon wrote:

On Wed, Jan 07, 2004 at 07:24:22PM -0700, Scott Smith wrote:
  

After two rather frustrating days, I find I need to apologize to 
Lucene.  My last run of 225 messages averaged around 25 milliseconds 
per message--that's parsing the xml, creating the Document, and 
putting it in the index (2.5Ghz cpu, 1G ram).  Turns out the 
performance problem was xerces sax helping me by loading the DTD 
before it parsed each message and the DTD wasn't local to our site.  
After seeing Terry's response, I knew there had to be more going on 
than what I was assuming.

Thanks for the suggestions.  I wonder how much faster I can go if I 
implement some of those?



25 msecs to insert a document is on the high side, but it depends of 
course on the size of your document. You're probably spending 90% of 
your time in the XML parsing. I believe that there are other parsers 
that are faster than xerces, you might want to look at these. You might 
want to look at http://dom4j.org/.

Dror

  

You may want to check the XML Pull Parser - it offers something between 
SAX and DOM, with performance similar to SAX. 
(http://www.extreme.indiana.edu/xgws/xsoap/xpp)

-- 
Best regards,
Andrzej Bialecki

-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-
FreeBSD developer (http://www.freebsd.org)



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]