Re: Token termBuffer issues

2007-07-30 Thread Michael McCandless
"Michael McCandless" <[EMAIL PROTECTED]> wrote: > "Yonik Seeley" <[EMAIL PROTECTED]> wrote: > > On 7/25/07, Michael McCandless <[EMAIL PROTECTED]> wrote: > > > "Yonik Seeley" <[EMAIL PROTECTED]> wrote: > > > > On 7/24/07, Michael McCandless <[EMAIL PROTECTED]> wrote: > > > > > "Yonik Seeley" <[EMA

Re: Token termBuffer issues

2007-07-26 Thread Yonik Seeley
On 7/26/07, Steven Parkes <[EMAIL PROTECTED]> wrote: > First I create a single large file that has one doc per line > from > Wikipedia content, using this alg > > Anybody disagree that the 1-line-per-doc format is better (at least for > Wikipedia)? If so, I'll get rid of the interme

RE: Token termBuffer issues

2007-07-26 Thread Steven Parkes
First I create a single large file that has one doc per line from Wikipedia content, using this alg Anybody disagree that the 1-line-per-doc format is better (at least for Wikipedia)? If so, I'll get rid of the intermediate one-file-per-doc step. --

Re: Token termBuffer issues

2007-07-25 Thread Michael McCandless
"Yonik Seeley" <[EMAIL PROTECTED]> wrote: > On 7/25/07, Michael McCandless <[EMAIL PROTECTED]> wrote: > > "Yonik Seeley" <[EMAIL PROTECTED]> wrote: > > > On 7/24/07, Michael McCandless <[EMAIL PROTECTED]> wrote: > > > > "Yonik Seeley" <[EMAIL PROTECTED]> wrote: > > > > > On 7/24/07, Michael McCandl

Re: Token termBuffer issues

2007-07-25 Thread Yonik Seeley
On 7/25/07, Michael McCandless <[EMAIL PROTECTED]> wrote: "Yonik Seeley" <[EMAIL PROTECTED]> wrote: > On 7/24/07, Michael McCandless <[EMAIL PROTECTED]> wrote: > > "Yonik Seeley" <[EMAIL PROTECTED]> wrote: > > > On 7/24/07, Michael McCandless <[EMAIL PROTECTED]> wrote: > > > > OK, I ran some benc

Re: Token termBuffer issues

2007-07-25 Thread Michael McCandless
"Yonik Seeley" <[EMAIL PROTECTED]> wrote: > On 7/24/07, Michael McCandless <[EMAIL PROTECTED]> wrote: > > "Yonik Seeley" <[EMAIL PROTECTED]> wrote: > > > On 7/24/07, Michael McCandless <[EMAIL PROTECTED]> wrote: > > > > OK, I ran some benchmarks here. > > > > > > > > The performance gains are sizab

Re: Token termBuffer issues

2007-07-24 Thread Yonik Seeley
On 7/24/07, Michael McCandless <[EMAIL PROTECTED]> wrote: "Yonik Seeley" <[EMAIL PROTECTED]> wrote: > On 7/24/07, Michael McCandless <[EMAIL PROTECTED]> wrote: > > OK, I ran some benchmarks here. > > > > The performance gains are sizable: 12.8% speedup using Sun's JDK 5 and > > 17.2% speedup usin

Re: Token termBuffer issues

2007-07-24 Thread Michael McCandless
"Doron Cohen" <[EMAIL PROTECTED]> wrote: > "Michael McCandless" wrote: > > > boolean next(Token resToken) > > > > which returns true if it has updated resToken with another token, > > else false if end-of-stream was hit. > > I would actually prefer > Token next(Token resToken) > because: >

Re: Token termBuffer issues

2007-07-24 Thread Doron Cohen
"Michael McCandless" wrote: > boolean next(Token resToken) > > which returns true if it has updated resToken with another token, > else false if end-of-stream was hit. I would actually prefer Token next(Token resToken) because: - this was the API with reuse is very much like the one without

Re: Token termBuffer issues

2007-07-24 Thread Michael McCandless
"Yonik Seeley" <[EMAIL PROTECTED]> wrote: > On 7/24/07, Michael McCandless <[EMAIL PROTECTED]> wrote: > > OK, I ran some benchmarks here. > > > > The performance gains are sizable: 12.8% speedup using Sun's JDK 5 and > > 17.2% speedup using Sun's JDK 6, on Linux. This is indexing all > > Wikipedia

Re: Token termBuffer issues

2007-07-24 Thread Yonik Seeley
On 7/24/07, Michael McCandless <[EMAIL PROTECTED]> wrote: OK, I ran some benchmarks here. The performance gains are sizable: 12.8% speedup using Sun's JDK 5 and 17.2% speedup using Sun's JDK 6, on Linux. This is indexing all Wikipedia content using LowerCaseTokenizer + StopFilter + PorterStemFi

Re: Token termBuffer issues

2007-07-24 Thread Michael McCandless
"Doron Cohen" <[EMAIL PROTECTED]> wrote: > > Agreed, so we can't change the API. So the next/nextDirect proposal > > should work well: it doesn't change the API yet would allow consumers > > that don't require "full private copy" of every Token, like > > DocumentsWriter, to have good performance.

Re: Token termBuffer issues

2007-07-24 Thread Michael McCandless
OK, I ran some benchmarks here. The performance gains are sizable: 12.8% speedup using Sun's JDK 5 and 17.2% speedup using Sun's JDK 6, on Linux. This is indexing all Wikipedia content using LowerCaseTokenizer + StopFilter + PorterStemFilter. I think it's worth pursuing! Here are the optimizati

Re: Token termBuffer issues

2007-07-22 Thread Doron Cohen
> Agreed, so we can't change the API. So the next/nextDirect proposal > should work well: it doesn't change the API yet would allow consumers > that don't require "full private copy" of every Token, like > DocumentsWriter, to have good performance. If we eventually go this way, my preferred API f

Re: Token termBuffer issues

2007-07-21 Thread eks dev
dev <[EMAIL PROTECTED]> To: java-dev@lucene.apache.org Sent: Saturday, 21 July, 2007 6:23:25 PM Subject: Re: Token termBuffer issues On 7/21/07, Michael McCandless <[EMAIL PROTECTED]> wrote: >> To further improve "out of the box" performance I would really also >&

Re: Token termBuffer issues

2007-07-21 Thread eks dev
On 7/21/07, Michael McCandless <[EMAIL PROTECTED]> wrote: >> To further improve "out of the box" performance I would really also >> like to fix the core analyzers, when possible, to re-use a single >> Token instance for each Token they produce. This would then mean no >> objects are created as yo

Re: Token termBuffer issues

2007-07-21 Thread Michael McCandless
"Yonik Seeley" <[EMAIL PROTECTED]> wrote: > On 7/21/07, Michael McCandless <[EMAIL PROTECTED]> wrote: > > To further improve "out of the box" performance I would really also > > like to fix the core analyzers, when possible, to re-use a single > > Token instance for each Token they produce. This w

Re: Token termBuffer issues

2007-07-21 Thread Mark Miller
The problem is, this would be a sneaky API change and would for example break anyone who gathers the Tokens into an array and processes them later (eg maybe highlighter does?). org.apache.lucene.analysis.CachingTokenFilter does this. - Mark -

Re: Token termBuffer issues

2007-07-21 Thread Yonik Seeley
On 7/21/07, Michael McCandless <[EMAIL PROTECTED]> wrote: To further improve "out of the box" performance I would really also like to fix the core analyzers, when possible, to re-use a single Token instance for each Token they produce. This would then mean no objects are created as you step thro

Re: Token termBuffer issues

2007-07-21 Thread Michael McCandless
"Chris Hostetter" <[EMAIL PROTECTED]> wrote: > > : > Currently the char[] wins, but good point: seems like each setter > : > should null out the other one? > : > : Certainly the String setter should null the char[] (that's the only > : way to keep back compatibility), and probably vice-versa. >

Re: Token termBuffer issues

2007-07-20 Thread Chris Hostetter
: > Currently the char[] wins, but good point: seems like each setter : > should null out the other one? : : Certainly the String setter should null the char[] (that's the only : way to keep back compatibility), and probably vice-versa. i haven't really thought baout this before today (i missed s

Re: Token termBuffer issues

2007-07-20 Thread Yonik Seeley
On 7/19/07, Michael McCandless <[EMAIL PROTECTED]> wrote: "Yonik Seeley" <[EMAIL PROTECTED]> wrote: > I had previously missed the changes to Token that add support for > using an array (termBuffer): > > + // For better indexing speed, use termBuffer (and > + // termBufferOffset/termBufferLength

Re: Token termBuffer issues

2007-07-19 Thread Michael McCandless
"Yonik Seeley" <[EMAIL PROTECTED]> wrote: > I had previously missed the changes to Token that add support for > using an array (termBuffer): > > + // For better indexing speed, use termBuffer (and > + // termBufferOffset/termBufferLength) instead of termText > + // to save new'ing a String per

Token termBuffer issues

2007-07-19 Thread Yonik Seeley
I had previously missed the changes to Token that add support for using an array (termBuffer): + // For better indexing speed, use termBuffer (and + // termBufferOffset/termBufferLength) instead of termText + // to save new'ing a String per token + char[] termBuffer; + int termBufferOffset;