"Michael McCandless" <[EMAIL PROTECTED]> wrote:
> "Yonik Seeley" <[EMAIL PROTECTED]> wrote:
> > On 7/25/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
> > > "Yonik Seeley" <[EMAIL PROTECTED]> wrote:
> > > > On 7/24/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
> > > > > "Yonik Seeley" <[EMA
On 7/26/07, Steven Parkes <[EMAIL PROTECTED]> wrote:
> First I create a single large file that has one doc per line
> from
> Wikipedia content, using this alg
>
> Anybody disagree that the 1-line-per-doc format is better (at least for
> Wikipedia)? If so, I'll get rid of the interme
First I create a single large file that has one doc per line
from
Wikipedia content, using this alg
Anybody disagree that the 1-line-per-doc format is better (at least for
Wikipedia)? If so, I'll get rid of the intermediate one-file-per-doc
step.
--
"Yonik Seeley" <[EMAIL PROTECTED]> wrote:
> On 7/25/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
> > "Yonik Seeley" <[EMAIL PROTECTED]> wrote:
> > > On 7/24/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
> > > > "Yonik Seeley" <[EMAIL PROTECTED]> wrote:
> > > > > On 7/24/07, Michael McCandl
On 7/25/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
"Yonik Seeley" <[EMAIL PROTECTED]> wrote:
> On 7/24/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
> > "Yonik Seeley" <[EMAIL PROTECTED]> wrote:
> > > On 7/24/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
> > > > OK, I ran some benc
"Yonik Seeley" <[EMAIL PROTECTED]> wrote:
> On 7/24/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
> > "Yonik Seeley" <[EMAIL PROTECTED]> wrote:
> > > On 7/24/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
> > > > OK, I ran some benchmarks here.
> > > >
> > > > The performance gains are sizab
On 7/24/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
"Yonik Seeley" <[EMAIL PROTECTED]> wrote:
> On 7/24/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
> > OK, I ran some benchmarks here.
> >
> > The performance gains are sizable: 12.8% speedup using Sun's JDK 5 and
> > 17.2% speedup usin
"Doron Cohen" <[EMAIL PROTECTED]> wrote:
> "Michael McCandless" wrote:
>
> > boolean next(Token resToken)
> >
> > which returns true if it has updated resToken with another token,
> > else false if end-of-stream was hit.
>
> I would actually prefer
> Token next(Token resToken)
> because:
>
"Michael McCandless" wrote:
> boolean next(Token resToken)
>
> which returns true if it has updated resToken with another token,
> else false if end-of-stream was hit.
I would actually prefer
Token next(Token resToken)
because:
- this was the API with reuse is very much like the one without
"Yonik Seeley" <[EMAIL PROTECTED]> wrote:
> On 7/24/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
> > OK, I ran some benchmarks here.
> >
> > The performance gains are sizable: 12.8% speedup using Sun's JDK 5 and
> > 17.2% speedup using Sun's JDK 6, on Linux. This is indexing all
> > Wikipedia
On 7/24/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
OK, I ran some benchmarks here.
The performance gains are sizable: 12.8% speedup using Sun's JDK 5 and
17.2% speedup using Sun's JDK 6, on Linux. This is indexing all
Wikipedia content using LowerCaseTokenizer + StopFilter +
PorterStemFi
"Doron Cohen" <[EMAIL PROTECTED]> wrote:
> > Agreed, so we can't change the API. So the next/nextDirect proposal
> > should work well: it doesn't change the API yet would allow consumers
> > that don't require "full private copy" of every Token, like
> > DocumentsWriter, to have good performance.
OK, I ran some benchmarks here.
The performance gains are sizable: 12.8% speedup using Sun's JDK 5 and
17.2% speedup using Sun's JDK 6, on Linux. This is indexing all
Wikipedia content using LowerCaseTokenizer + StopFilter +
PorterStemFilter. I think it's worth pursuing!
Here are the optimizati
> Agreed, so we can't change the API. So the next/nextDirect proposal
> should work well: it doesn't change the API yet would allow consumers
> that don't require "full private copy" of every Token, like
> DocumentsWriter, to have good performance.
If we eventually go this way, my preferred API f
dev <[EMAIL PROTECTED]>
To: java-dev@lucene.apache.org
Sent: Saturday, 21 July, 2007 6:23:25 PM
Subject: Re: Token termBuffer issues
On 7/21/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
>> To further improve "out of the box" performance I would really also
>&
On 7/21/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
>> To further improve "out of the box" performance I would really also
>> like to fix the core analyzers, when possible, to re-use a single
>> Token instance for each Token they produce. This would then mean no
>> objects are created as yo
"Yonik Seeley" <[EMAIL PROTECTED]> wrote:
> On 7/21/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
> > To further improve "out of the box" performance I would really also
> > like to fix the core analyzers, when possible, to re-use a single
> > Token instance for each Token they produce. This w
The problem is, this would be a sneaky API change and would for
example break anyone who gathers the Tokens into an array and
processes them later (eg maybe highlighter does?).
org.apache.lucene.analysis.CachingTokenFilter does this.
- Mark
-
On 7/21/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
To further improve "out of the box" performance I would really also
like to fix the core analyzers, when possible, to re-use a single
Token instance for each Token they produce. This would then mean no
objects are created as you step thro
"Chris Hostetter" <[EMAIL PROTECTED]> wrote:
>
> : > Currently the char[] wins, but good point: seems like each setter
> : > should null out the other one?
> :
> : Certainly the String setter should null the char[] (that's the only
> : way to keep back compatibility), and probably vice-versa.
>
: > Currently the char[] wins, but good point: seems like each setter
: > should null out the other one?
:
: Certainly the String setter should null the char[] (that's the only
: way to keep back compatibility), and probably vice-versa.
i haven't really thought baout this before today (i missed s
On 7/19/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
"Yonik Seeley" <[EMAIL PROTECTED]> wrote:
> I had previously missed the changes to Token that add support for
> using an array (termBuffer):
>
> + // For better indexing speed, use termBuffer (and
> + // termBufferOffset/termBufferLength
"Yonik Seeley" <[EMAIL PROTECTED]> wrote:
> I had previously missed the changes to Token that add support for
> using an array (termBuffer):
>
> + // For better indexing speed, use termBuffer (and
> + // termBufferOffset/termBufferLength) instead of termText
> + // to save new'ing a String per
I had previously missed the changes to Token that add support for
using an array (termBuffer):
+ // For better indexing speed, use termBuffer (and
+ // termBufferOffset/termBufferLength) instead of termText
+ // to save new'ing a String per token
+ char[] termBuffer;
+ int termBufferOffset;
24 matches
Mail list logo