Re: [Performance] Streaming main memory indexing of single strings

2005-05-03 Thread Erik Hatcher
Applied!! Erik On May 3, 2005, at 1:31 PM, Wolfgang Hoschek wrote: Here's a performance patch for MemoryIndex.MemoryIndexReader that caches the norms for a given field, avoiding repeated recomputation of the norms. Recall that, depending on the query, norms() can be called over and over a

Re: [Performance] Streaming main memory indexing of single strings

2005-05-03 Thread Wolfgang Hoschek
Here's a performance patch for MemoryIndex.MemoryIndexReader that caches the norms for a given field, avoiding repeated recomputation of the norms. Recall that, depending on the query, norms() can be called over and over again with mostly the same parameters. Thus, replace public byte[] norms(S

Re: [Performance] Streaming main memory indexing of single strings

2005-05-02 Thread Wolfgang Hoschek
Thanks! Wolfgang. I've committed this change after it successfully worked for me. Thanks! Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: [Performance] Streaming main memory indexing of single strings

2005-05-02 Thread Erik Hatcher
On May 2, 2005, at 5:21 PM, Wolfgang Hoschek wrote: Finally found and fixed the bug! The fix is simply to replace MemoryIndex.MemoryIndexReader skipTo() with the following: public boolean skipTo(int target) { if (DEBUG) System.err.println(".skipTo: " + targe

Re: [Performance] Streaming main memory indexing of single strings

2005-05-02 Thread Wolfgang Hoschek
The version I sent returns in O(1), if performance was your concern. Or did you mean something else? Since 0 is the only document number in the index, a return target == 0; might be nice for skipTo(). It doesn't really help performance, though, and the next() works just as well. Regards, Paul Elsc

Re: [Performance] Streaming main memory indexing of single strings

2005-05-02 Thread Paul Elschot
On Monday 02 May 2005 23:38, Wolfgang Hoschek wrote: > > Yes, the svn trunk uses skipTo more often than 1.4.3. > > > > However, your implementation of skipTo() needs some improvement. > > See the javadoc of skipTo of class Scorer: > > > > http://lucene.apache.org/java/docs/api/org/apache/lucene/sea

Re: [Performance] Streaming main memory indexing of single strings

2005-05-02 Thread Wolfgang Hoschek
Yes, the svn trunk uses skipTo more often than 1.4.3. However, your implementation of skipTo() needs some improvement. See the javadoc of skipTo of class Scorer: http://lucene.apache.org/java/docs/api/org/apache/lucene/search/ Scorer.html#skipTo(int) What's wrong with the version I sent? Remeber t

Re: [Performance] Streaming main memory indexing of single strings

2005-05-02 Thread Paul Elschot
Wolfgang, On Monday 02 May 2005 23:21, Wolfgang Hoschek wrote: > Finally found and fixed the bug! > The fix is simply to replace MemoryIndex.MemoryIndexReader skipTo() > with the following: > > public boolean skipTo(int target) { >

Re: [Performance] Streaming main memory indexing of single strings

2005-05-02 Thread Wolfgang Hoschek
Finally found and fixed the bug! The fix is simply to replace MemoryIndex.MemoryIndexReader skipTo() with the following: public boolean skipTo(int target) { if (DEBUG) System.err.println(".skipTo: " + target);

Re: [Performance] Streaming main memory indexing of single strings

2005-05-02 Thread Wolfgang Hoschek
This is what I have as scoring calculation, and it seems to do exactly what lucene-1.4.3 does because the tests pass. public byte[] norms(String fieldName) { if (DEBUG) System.err.println("MemoryIndexReader.norms: " + fieldName); Info info = getInfo(fieldName); int numTokens = info

Re: [Performance] Streaming main memory indexing of single strings

2005-05-02 Thread Wolfgang Hoschek
I'm looking at it right now. The tests pass fine when you put lucene-1.4.3.jar instead of the current lucene onto the classpath which is what I've been doing so far. Something seems to have changed in the scoring calculation. No idea what that might be. I'll see if I can find out. Wolfgang

Re: [Performance] Streaming main memory indexing of single strings

2005-05-02 Thread Erik Hatcher
On May 1, 2005, at 10:20 PM, Wolfgang Hoschek wrote: I've uploaded code that now runs against the current SVN, plus junit test cases, plus some minor internal updates to the functionality itself. For details see http://issues.apache.org/bugzilla/show_bug.cgi?id=34585 Be prepared for the test

Re: [Performance] Streaming main memory indexing of single strings

2005-05-01 Thread Wolfgang Hoschek
I've uploaded code that now runs against the current SVN, plus junit test cases, plus some minor internal updates to the functionality itself. For details see http://issues.apache.org/bugzilla/show_bug.cgi?id=34585 Be prepared for the testcases to take some minutes to complete - don't hit CTRL

Re: [Performance] Streaming main memory indexing of single strings

2005-04-27 Thread Wolfgang Hoschek
OK. I'll send an update as soon as I get round to it... Wolfgang. On Apr 27, 2005, at 12:22 PM, Doug Cutting wrote: Erik Hatcher wrote: I'm not quite sure where to put MemoryIndex - maybe it deserves to stand on its own in a new contrib area? That sounds good to me. Ok... once Wolfgang gives me

Re: [Performance] Streaming main memory indexing of single strings

2005-04-27 Thread Erik Hatcher
On Apr 27, 2005, at 12:22 PM, Doug Cutting wrote: Erik Hatcher wrote: I'm not quite sure where to put MemoryIndex - maybe it deserves to stand on its own in a new contrib area? That sounds good to me. Ok... once Wolfgang gives me one last round up updates (JUnit tests instead of main() and upgr

Re: [Performance] Streaming main memory indexing of single strings

2005-04-27 Thread Wolfgang Hoschek
Whichever place you settle on is fine with me. [In case it might make a difference: Just note that MemoryIndex has a small auxiliary dependency on PatternAnalyzer in addField() because the Analyzer superclass doesn't have a tokenStream(String fieldName, String text) method. And PatternAnalyzer r

Re: [Performance] Streaming main memory indexing of single strings

2005-04-27 Thread Doug Cutting
Erik Hatcher wrote: I'm not quite sure where to put MemoryIndex - maybe it deserves to stand on its own in a new contrib area? That sounds good to me. Or does it make sense to put this into misc (still in sandbox/misc)? Or where? Isn't the goal for sandbox/ to go away, replaced with contrib/

Re: [Performance] Streaming main memory indexing of single strings

2005-04-26 Thread Erik Hatcher
Wolfgang, You have provided a superb set of patches! I'm in awe of the extensive documentation you've done. There is nothing further you need to do, but be patient while we incorporate it into the contrib area somewhere. Your PatternAnalyzer could fit into the contrib/analyzers area nicely

Re: [Performance] Streaming main memory indexing of single strings

2005-04-26 Thread Wolfgang Hoschek
I've uploaded slightly improved versions of the fast MemoryIndex contribution to http://issues.apache.org/bugzilla/show_bug.cgi?id=34585 along with another contrib - PatternAnalyzer. For a quick overview without downloading code, there's javadoc for it all at http://dsd.lbl.gov/nux/api/o

Re: [Performance] Streaming main memory indexing of single strings

2005-04-22 Thread Wolfgang Hoschek
I've now got the contrib code cleaned up, tested and documented into a decent state, ready for your review and comments. Consider this a formal contrib (Apache license is attached). The relevant files are attached to the following bug ID: http://issues.apache.org/bugzilla/show_bug.cgi?id

Re: [Performance] Streaming main memory indexing of single strings

2005-04-20 Thread Wolfgang Hoschek
On Apr 20, 2005, at 9:22 AM, Erik Hatcher wrote: On Apr 20, 2005, at 12:11 PM, Wolfgang Hoschek wrote: By the way, by now I have a version against 1.4.3 that is 10-100 times faster (i.e. 3 - 20 index+query steps/sec) than the simplistic RAMDirectory approach, depending on the nature of th

Re: [Performance] Streaming main memory indexing of single strings

2005-04-20 Thread Erik Hatcher
On Apr 20, 2005, at 12:11 PM, Wolfgang Hoschek wrote: By the way, by now I have a version against 1.4.3 that is 10-100 times faster (i.e. 3 - 20 index+query steps/sec) than the simplistic RAMDirectory approach, depending on the nature of the input data and query. From some preliminary te

Re: [Performance] Streaming main memory indexing of single strings

2005-04-20 Thread Wolfgang Hoschek
eveloper to debug and find the reason in the first place!) Luc -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Saturday, April 16, 2005 2:09 AM To: java-dev@lucene.apache.org Subject: Re: [Performance] Streaming main memory indexing of single strings On Apr 15, 2005,

RE: [Performance] Streaming main memory indexing of single strings

2005-04-20 Thread Vanlerberghe, Luc
or the developer to debug and find the reason in the first place!) Luc -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Saturday, April 16, 2005 2:09 AM To: java-dev@lucene.apache.org Subject: Re: [Performance] Streaming main memory indexing of single strings On

Re: [Performance] Streaming main memory indexing of single strings

2005-04-17 Thread Wolfgang Hoschek
On Apr 16, 2005, at 1:17 PM, Wolfgang Hoschek wrote: Note that "fish*~" is not a valid query expression :) Perhaps the Lucene QueryParser should throw an exception then. Currently 1.4.3 accepts the expression as is without grumbling... Several minor QueryParser weirdnesses like this have turned u

Re: [Performance] Streaming main memory indexing of single strings

2005-04-16 Thread Erik Hatcher
On Apr 16, 2005, at 1:17 PM, Wolfgang Hoschek wrote: Note that "fish*~" is not a valid query expression :) Perhaps the Lucene QueryParser should throw an exception then. Currently 1.4.3 accepts the expression as is without grumbling... Several minor QueryParser weirdnesses like this have turned up

Re: [Performance] Streaming main memory indexing of single strings

2005-04-16 Thread Wolfgang Hoschek
On Apr 16, 2005, at 2:58 AM, Erik Hatcher wrote: On Apr 15, 2005, at 9:50 PM, Wolfgang Hoschek wrote: So, all the text analyzed is in a given field... that means that anything in the Query not associated with that field has no bearing on whether the text matches or not, correct? Right, it has no

Re: [Performance] Streaming main memory indexing of single strings

2005-04-16 Thread Erik Hatcher
On Apr 15, 2005, at 9:50 PM, Wolfgang Hoschek wrote: So, all the text analyzed is in a given field... that means that anything in the Query not associated with that field has no bearing on whether the text matches or not, correct? Right, it has no bearing. A query wouldn't specify any fields, it

Re: [Performance] Streaming main memory indexing of single strings

2005-04-15 Thread Wolfgang Hoschek
cument arrives matching the saved queries? Erik -Original Message- From: Wolfgang Hoschek [mailto:[EMAIL PROTECTED] Sent: Thursday, April 14, 2005 4:04 PM To: java-dev@lucene.apache.org Subject: Re: [Performance] Streaming main memory indexing of single strings This seems to

Re: [Performance] Streaming main memory indexing of single strings

2005-04-15 Thread Erik Hatcher
system and need to be alerted whenever a new document arrives matching the saved queries? Erik -Original Message- From: Wolfgang Hoschek [mailto:[EMAIL PROTECTED] Sent: Thursday, April 14, 2005 4:04 PM To: java-dev@lucene.apache.org Subject: Re: [Performance] Streaming main

Re: [Performance] Streaming main memory indexing of single strings

2005-04-15 Thread Wolfgang Hoschek
tuations where users have queries saved in a system and need to be alerted whenever a new document arrives matching the saved queries? Erik -Original Message- From: Wolfgang Hoschek [mailto:[EMAIL PROTECTED] Sent: Thursday, April 14, 2005 4:04 PM To: java-dev@lucene.apache.org Subje

Re: [Performance] Streaming main memory indexing of single strings

2005-04-15 Thread Erik Hatcher
s? Erik -Original Message- From: Wolfgang Hoschek [mailto:[EMAIL PROTECTED] Sent: Thursday, April 14, 2005 4:04 PM To: java-dev@lucene.apache.org Subject: Re: [Performance] Streaming main memory indexing of single strings This seems to be a promising avenue worth exploring. My

Re: [Performance] Streaming main memory indexing of single strings

2005-04-15 Thread Wolfgang Hoschek
nd the core would be moved into AbstractIndexReader so projects like this would be much easier). Robert -Original Message- From: Wolfgang Hoschek [mailto:[EMAIL PROTECTED] Sent: Friday, April 15, 2005 5:58 PM To: java-dev@lucene.apache.org Subject: Re: [Performance] Streaming main memory indexi

Re: [Performance] Streaming main memory indexing of single strings

2005-04-15 Thread Wolfgang Hoschek
On Apr 15, 2005, at 4:15 PM, Doug Cutting wrote: Wolfgang Hoschek wrote: The classic fuzzy fulltext search and similarity matching that Lucene is good for :-) So you need a score that can be compared to other matches? This will be based on nothing but term frequency, which a regex can compute.

RE: [Performance] Streaming main memory indexing of single strings

2005-04-15 Thread Robert Engels
@lucene.apache.org Subject: Re: [Performance] Streaming main memory indexing of single strings Wolfgang Hoschek wrote: > The classic fuzzy fulltext search and similarity matching that Lucene is > good for :-) So you need a score that can be compared to other matches? This will be based on nothi

Re: [Performance] Streaming main memory indexing of single strings

2005-04-15 Thread Doug Cutting
Wolfgang Hoschek wrote: The classic fuzzy fulltext search and similarity matching that Lucene is good for :-) So you need a score that can be compared to other matches? This will be based on nothing but term frequency, which a regex can compute. With a single document there'll be no IDFs, so y

RE: [Performance] Streaming main memory indexing of single strings

2005-04-15 Thread Robert Engels
xReader so projects like this would be much easier). Robert -Original Message- From: Wolfgang Hoschek [mailto:[EMAIL PROTECTED] Sent: Friday, April 15, 2005 5:58 PM To: java-dev@lucene.apache.org Subject: Re: [Performance] Streaming main memory indexing of single strings A primary reason for the

Re: [Performance] Streaming main memory indexing of single strings

2005-04-15 Thread Wolfgang Hoschek
On Apr 15, 2005, at 4:00 PM, Doug Cutting wrote: Erik Hatcher wrote: I think something like this would make a handy addition to our contrib area at least. Perhaps. What use cases cannot be met by regular expression matching? Doug The classic fuzzy fulltext search and similarity matching that Lucen

Re: [Performance] Streaming main memory indexing of single strings

2005-04-15 Thread Doug Cutting
Erik Hatcher wrote: I think something like this would make a handy addition to our contrib area at least. Perhaps. What use cases cannot be met by regular expression matching? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For a

Re: [Performance] Streaming main memory indexing of single strings

2005-04-15 Thread Wolfgang Hoschek
Erik -Original Message- From: Wolfgang Hoschek [mailto:[EMAIL PROTECTED] Sent: Thursday, April 14, 2005 4:04 PM To: java-dev@lucene.apache.org Subject: Re: [Performance] Streaming main memory indexing of single strings This seems to be a promising avenue worth exploring. My gutfeeling is th

Re: [Performance] Streaming main memory indexing of single strings

2005-04-15 Thread Wolfgang Hoschek
Original Message- From: Wolfgang Hoschek [mailto:[EMAIL PROTECTED] Sent: Thursday, April 14, 2005 4:04 PM To: java-dev@lucene.apache.org Subject: Re: [Performance] Streaming main memory indexing of single strings This seems to be a promising avenue worth exploring. My gutfeeling is that thi

Re: [Performance] Streaming main memory indexing of single strings

2005-04-15 Thread Erik Hatcher
document arrives matching the saved queries? Erik -Original Message- From: Wolfgang Hoschek [mailto:[EMAIL PROTECTED] Sent: Thursday, April 14, 2005 4:04 PM To: java-dev@lucene.apache.org Subject: Re: [Performance] Streaming main memory indexing of single strings This seems to be a pr

RE: [Performance] Streaming main memory indexing of single strings

2005-04-14 Thread Robert Engels
ement termDocs() and termPositions() to use the structures from > above. > > run searches. > > start again with next document. > > > > -Original Message- > From: Wolfgang Hoschek [mailto:[EMAIL PROTECTED] > Sent: Thursday, April 14, 2005 2:56 PM > To: java

Re: [Performance] Streaming main memory indexing of single strings

2005-04-14 Thread Wolfgang Hoschek
earches. start again with next document. -Original Message- From: Wolfgang Hoschek [mailto:[EMAIL PROTECTED] Sent: Thursday, April 14, 2005 2:56 PM To: java-dev@lucene.apache.org Subject: Re: [Performance] Streaming main memory indexing of single strings Otis, this might be a misunderstandi

RE: [Performance] Streaming main memory indexing of single strings

2005-04-14 Thread Robert Engels
to use the structures from above. run searches. start again with next document. -Original Message- From: Wolfgang Hoschek [mailto:[EMAIL PROTECTED] Sent: Thursday, April 14, 2005 2:56 PM To: java-dev@lucene.apache.org Subject: Re: [Performance] Streaming main memory indexing of single stri

Re: [Performance] Streaming main memory indexing of single strings

2005-04-14 Thread Wolfgang Hoschek
Otis, this might be a misunderstanding. - I'm not calling optimize(). That piece is commented out you if look again at the code. - The *streaming* use case requires that for each query I add one (and only one) document (aka string) to an empty index: repeat N times (where N is millions or billio

Re: [Performance] Streaming main memory indexing of single strings

2005-04-13 Thread Otis Gospodnetic
It looks like you are calling that IndexWriter code in some loops, opening it and closing it in every iteration of the loop and also calling optimize. All of those things could be improved. Keep your IndexWriter open, don't close it, and optimize the index only once you are done adding documents t

[Performance] Streaming main memory indexing of single strings

2005-04-13 Thread Wolfgang Hoschek
Hi, I'm wondering if anyone could let me know how to improve Lucene performance for "streaming main memory indexing of single strings". This would help to effectively integrate Lucene with the Nux XQuery engine. Below is a small microbenchmark simulating STREAMING XQuery fulltext search as typ

[Performance] Streaming main memory indexing of single strings

2005-04-13 Thread Wolfgang Hoschek
Hi, I'm wondering if anyone could let me know how to improve Lucene performance for "streaming main memory indexing of single strings". This would help to effectively integrate Lucene with the Nux XQuery engine. Below is a small microbenchmark simulating STREAMING XQuery fulltext search as typ