Re: Slow Index Writes

2014-01-08 Thread Michael McCandless
NRTManager was renamed to ControlledRealTimeReopenThread at some point.

But likely simple NRT readers (as Ian described, using
.openIfChanged()) will fit your usage.

ControlledRealTimeReopenThread is only necessary if you require
certain searches to be real-time, e.g. you just indexed a document and
then want to run a search that you know reflects that document.

Mike McCandless

http://blog.mikemccandless.com


On Tue, Jan 7, 2014 at 8:41 AM, Klaus Schaefers
 wrote:
> Hi,
>
>
> I was looking for some examples but I just found some using an NRTManager
> class? In Lucene 4.5 I cannot find the class (missing a maven dependency?).
> Can anyone point me to a working example?
>
> Cheers,
>
> Klaus
>
>
>
> On Fri, Jan 3, 2014 at 11:49 AM, Ian Lea  wrote:
>
>> You will indeed get poor performance if you commit for every doc.  Can
>> you compromise and commit every, say, 1000 docs, or once every few
>> minutes, or whatever makes sense for your app.
>>
>> Or look at lucene's near-real-time search features.  Google "Lucene
>> NRT" for info.
>>
>> Or use Elastic Search.
>>
>>
>> --
>> Ian.
>>
>>
>> On Fri, Jan 3, 2014 at 10:21 AM, Klaus Schaefers
>>  wrote:
>> > Hi,
>> >
>> > I am trying to use a lucene as a kind of key value store, but I
>> encountered
>> > some bad performance issues. When I try to add my data as documents to
>> the
>> > index I get an average write rate of 3 documents / second!! This seems to
>> > me ridiculously slow and I guess I must have somewhere an error. Please
>> > have a look at my code:
>> >
>> >
>> >
>> > Directory dir = new niofsdirectojava-u...@lucene.apache.org!
>> > java-user@lucene.apache.org!ry(file);
>> > Analyzer analyzer =  new StandardAnalyzer(Version.LUCENE_45);
>> > IndexWriterConfig config = new
>> IndexWriterConfig(Version.LUCENE_45,
>> > analyzer);
>> > IndexWriter writer = new IndexWriter(dir, config);
>> >
>> > int eventCount = 1000;
>> > for(int i=0; i < eventCount;i++){
>> > Document doc = new Document();
>> > doc.add(new StringField("id", i+"id" ,Store.YES));
>> > doc.add(new StoredField("b", buildVector()));
>> > writer.addDocument(doc);
>> > writer.commit();
>> > }
>> > dir.close();
>> > writer.close()
>> >
>> >
>> > Not calling the commit function seems to fix the issue, but I guess this
>> > would then have some issues if I want to read values in the mean time. My
>> > normal use case would be to read something from the index, maybe alter it
>> > and then write back. So I would have roughly 50% of reads.
>> >
>> > I tried also an embedded version of elastic search and it manages to go
>> to
>> > 2000 documents/ per second. As its based on lucene as well I guess I do
>> > something wrong in my code.
>> >
>> >
>> > THX for the help,
>> >
>> > Klaus
>> >
>> >
>> > --
>> >
>> > --
>> >
>> > Klaus Schaefers
>> > Senior Optimization Manager
>> >
>> > Ligatus GmbH
>> > Hohenstaufenring 30-32
>> > D-50674 Köln
>> >
>> > Tel.:  +49 (0) 221 / 56939 -784
>> > Fax:  +49 (0) 221 / 56 939 - 599
>> > E-Mail: klaus.schaef...@ligatus.com
>> > Web: www.ligatus.de
>> >
>> > HRB Köln 56003
>> > Geschäftsführung:
>> > Dipl.-Kaufmann Lars Hasselbach, Dipl.-Kaufmann Klaus Ludemann,
>> > Dipl.-Wirtschaftsingenieur Arne Wolter
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>
>
> --
>
> --
>
> Klaus Schaefers
> Senior Optimization Manager
>
> Ligatus GmbH
> Hohenstaufenring 30-32
> D-50674 Köln
>
> Tel.:  +49 (0) 221 / 56939 -784
> Fax:  +49 (0) 221 / 56 939 - 599
> E-Mail: klaus.schaef...@ligatus.com
> Web: www.ligatus.de
>
> HRB Köln 56003
> Geschäftsführung:
> Dipl.-Kaufmann Lars Hasselbach, Dipl.-Kaufmann Klaus Ludemann,
> Dipl.-Wirtschaftsingenieur Arne Wolter

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Slow Index Writes

2014-01-08 Thread Klaus Schaefers
THX!


On Wed, Jan 8, 2014 at 10:10 AM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> NRTManager was renamed to ControlledRealTimeReopenThread at some point.
>
> But likely simple NRT readers (as Ian described, using
> .openIfChanged()) will fit your usage.
>
> ControlledRealTimeReopenThread is only necessary if you require
> certain searches to be real-time, e.g. you just indexed a document and
> then want to run a search that you know reflects that document.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Tue, Jan 7, 2014 at 8:41 AM, Klaus Schaefers
>  wrote:
> > Hi,
> >
> >
> > I was looking for some examples but I just found some using an NRTManager
> > class? In Lucene 4.5 I cannot find the class (missing a maven
> dependency?).
> > Can anyone point me to a working example?
> >
> > Cheers,
> >
> > Klaus
> >
> >
> >
> > On Fri, Jan 3, 2014 at 11:49 AM, Ian Lea  wrote:
> >
> >> You will indeed get poor performance if you commit for every doc.  Can
> >> you compromise and commit every, say, 1000 docs, or once every few
> >> minutes, or whatever makes sense for your app.
> >>
> >> Or look at lucene's near-real-time search features.  Google "Lucene
> >> NRT" for info.
> >>
> >> Or use Elastic Search.
> >>
> >>
> >> --
> >> Ian.
> >>
> >>
> >> On Fri, Jan 3, 2014 at 10:21 AM, Klaus Schaefers
> >>  wrote:
> >> > Hi,
> >> >
> >> > I am trying to use a lucene as a kind of key value store, but I
> >> encountered
> >> > some bad performance issues. When I try to add my data as documents to
> >> the
> >> > index I get an average write rate of 3 documents / second!! This
> seems to
> >> > me ridiculously slow and I guess I must have somewhere an error.
> Please
> >> > have a look at my code:
> >> >
> >> >
> >> >
> >> > Directory dir = new niofsdirectojava-u...@lucene.apache.org!
> >> > java-user@lucene.apache.org!ry(file);
> >> > Analyzer analyzer =  new StandardAnalyzer(Version.LUCENE_45);
> >> > IndexWriterConfig config = new
> >> IndexWriterConfig(Version.LUCENE_45,
> >> > analyzer);
> >> > IndexWriter writer = new IndexWriter(dir, config);
> >> >
> >> > int eventCount = 1000;
> >> > for(int i=0; i < eventCount;i++){
> >> > Document doc = new Document();
> >> > doc.add(new StringField("id", i+"id" ,Store.YES));
> >> > doc.add(new StoredField("b", buildVector()));
> >> > writer.addDocument(doc);
> >> > writer.commit();
> >> > }
> >> > dir.close();
> >> > writer.close()
> >> >
> >> >
> >> > Not calling the commit function seems to fix the issue, but I guess
> this
> >> > would then have some issues if I want to read values in the mean
> time. My
> >> > normal use case would be to read something from the index, maybe
> alter it
> >> > and then write back. So I would have roughly 50% of reads.
> >> >
> >> > I tried also an embedded version of elastic search and it manages to
> go
> >> to
> >> > 2000 documents/ per second. As its based on lucene as well I guess I
> do
> >> > something wrong in my code.
> >> >
> >> >
> >> > THX for the help,
> >> >
> >> > Klaus
> >> >
> >> >
> >> > --
> >> >
> >> > --
> >> >
> >> > Klaus Schaefers
> >> > Senior Optimization Manager
> >> >
> >> > Ligatus GmbH
> >> > Hohenstaufenring 30-32
> >> > D-50674 Köln
> >> >
> >> > Tel.:  +49 (0) 221 / 56939 -784
> >> > Fax:  +49 (0) 221 / 56 939 - 599
> >> > E-Mail: klaus.schaef...@ligatus.com
> >> > Web: www.ligatus.de
> >> >
> >> > HRB Köln 56003
> >> > Geschäftsführung:
> >> > Dipl.-Kaufmann Lars Hasselbach, Dipl.-Kaufmann Klaus Ludemann,
> >> > Dipl.-Wirtschaftsingenieur Arne Wolter
> >>
> >> -
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>
> >>
> >
> >
> > --
> >
> > --
> >
> > Klaus Schaefers
> > Senior Optimization Manager
> >
> > Ligatus GmbH
> > Hohenstaufenring 30-32
> > D-50674 Köln
> >
> > Tel.:  +49 (0) 221 / 56939 -784
> > Fax:  +49 (0) 221 / 56 939 - 599
> > E-Mail: klaus.schaef...@ligatus.com
> > Web: www.ligatus.de
> >
> > HRB Köln 56003
> > Geschäftsführung:
> > Dipl.-Kaufmann Lars Hasselbach, Dipl.-Kaufmann Klaus Ludemann,
> > Dipl.-Wirtschaftsingenieur Arne Wolter
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


-- 

-- 

Klaus Schaefers
Senior Optimization Manager

Ligatus GmbH
Hohenstaufenring 30-32
D-50674 Köln

Tel.:  +49 (0) 221 / 56939 -784
Fax:  +49 (0) 221 / 56 939 - 599
E-Mail: klaus.schaef...@ligatus.com
Web: www.ligatus.de

HRB Köln 56003
Geschäftsführung:
Dipl.-Kaufmann Lars Hasselbach, Dipl.-Kaufmann Klaus Ludemann,
Dipl.-Wirtschaftsingenieur Arne Wolter


Re: How is incrementToken supposed to detect the lack of reset()?

2014-01-08 Thread Mindaugas Žakšauskas
Just for the interest, I had a similar problem too as well as other
people [1]. In my project, I am extending the Tokenizer class and have
another tokenizer (e.g. ClassicTokenizer) as a delegate.
Unfortunately, properly overriding all public/protected methods is
*not* enough, e.g.:

public void reset() throws IOException {
  super.reset();
  delegate.reset();
}

I was still getting the exception of broken read()/close() contract.
Half day and *lots* of debugging later, I realized that exception is
only thrown when indexing second document only as the delegate reader
internally gets replaced with ILLEGAL_STATE_READER after .close() is
called. My solution to this problem was to make the reset() method
like this:

public void reset() throws IOException {
  super.reset();
  delegate.setReader(input);
  delegate.reset();
}

Another thing worth mentioning is that it's crucial to have
super.method() before delegate.method() in all overridden methods.
Would be nice if all of this was somewhere in the Tokenizer Javadoc,
or even nicer if the base class was designed with delegation in mind
(Effective Java (2nd edition), Item 16).

Hope this helps somebody.

[1] 
http://stackoverflow.com/questions/20624339/having-trouble-rereading-a-lucene-tokenstream/20630673#20630673

Regards,
Mindaugas

On Tue, Jan 7, 2014 at 9:45 PM, Benson Margulies  wrote:
> Yes I Do.
>
>
> On Tue, Jan 7, 2014 at 3:59 PM, Robert Muir  wrote:
>
>> Benson, do you want to open an issue to fix this constructor to not
>> take Reader? (there might be one already, but lets make a new one).
>>
>> These things are supposed to be reused, and have setReader for that
>> purpose. i think its confusing and contributes to bugs that you have
>> to have logic in e.g. the ctor THEN ALSO in reset().
>>
>> if someone does it correctly in the ctor, but they only test "one
>> time", they might think everything is working..
>>
>> On Tue, Jan 7, 2014 at 3:23 PM, Benson Margulies 
>> wrote:
>> > For the record of other people who implement tokenizers:
>> >
>> > Say that your tokenizer has a constructor, like:
>> >
>> >  public MyTokenizer(Reader reader, ) {
>> >super(reader);
>> >myWrappedInputDevice = new MyWrappedInputDevice(reader);
>> > }
>> >
>> > Not a good idea. Tokenizer carefully manages the data flow from the
>> > constructor arg to the 'input' field. The correct form is:
>> >
>> >  public MyTokenizer(Reader reader, ) {
>> >super(reader);
>> >myWrappedInputDevice = new MyWrappedInputDevice(this.input);
>> > }
>> >
>> >
>> >
>> > On Tue, Jan 7, 2014 at 2:59 PM, Robert Muir  wrote:
>> >
>> >> See Tokenizer.java for the state machine logic. In general you should
>> >> not have to do anything if the tokenizer is well-behaved (e.g. close
>> >> calls super.close() and so on).
>> >>
>> >>
>> >>
>> >> On Tue, Jan 7, 2014 at 2:50 PM, Benson Margulies > >
>> >> wrote:
>> >> > In 4.6.0,
>> >> org.apache.lucene.analysis.BaseTokenStreamTestCase#checkResetException
>> >> >
>> >> > fails if incrementToken fails to throw if there's a missing reset.
>> >> >
>> >> > How am I supposed to organize this in a Tokenizer? A quick look at
>> >> > CharTokenizer did not reveal any code for the purpose.
>> >> >
>> >> > -
>> >> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >> > For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >> >
>> >>
>> >> -
>> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >>
>> >>
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How is incrementToken supposed to detect the lack of reset()?

2014-01-08 Thread Benson Margulies
If you'd like to join in on the doc, see
https://github.com/apache/lucene-solr/pull/14/files. I'd be happy to grant
you access to push to my fork.


On Wed, Jan 8, 2014 at 5:37 AM, Mindaugas Žakšauskas wrote:

> Just for the interest, I had a similar problem too as well as other
> people [1]. In my project, I am extending the Tokenizer class and have
> another tokenizer (e.g. ClassicTokenizer) as a delegate.
> Unfortunately, properly overriding all public/protected methods is
> *not* enough, e.g.:
>
> public void reset() throws IOException {
>   super.reset();
>   delegate.reset();
> }
>
> I was still getting the exception of broken read()/close() contract.
> Half day and *lots* of debugging later, I realized that exception is
> only thrown when indexing second document only as the delegate reader
> internally gets replaced with ILLEGAL_STATE_READER after .close() is
> called. My solution to this problem was to make the reset() method
> like this:
>
> public void reset() throws IOException {
>   super.reset();
>   delegate.setReader(input);
>   delegate.reset();
> }
>
> Another thing worth mentioning is that it's crucial to have
> super.method() before delegate.method() in all overridden methods.
> Would be nice if all of this was somewhere in the Tokenizer Javadoc,
> or even nicer if the base class was designed with delegation in mind
> (Effective Java (2nd edition), Item 16).
>
> Hope this helps somebody.
>
> [1]
> http://stackoverflow.com/questions/20624339/having-trouble-rereading-a-lucene-tokenstream/20630673#20630673
>
> Regards,
> Mindaugas
>
> On Tue, Jan 7, 2014 at 9:45 PM, Benson Margulies 
> wrote:
> > Yes I Do.
> >
> >
> > On Tue, Jan 7, 2014 at 3:59 PM, Robert Muir  wrote:
> >
> >> Benson, do you want to open an issue to fix this constructor to not
> >> take Reader? (there might be one already, but lets make a new one).
> >>
> >> These things are supposed to be reused, and have setReader for that
> >> purpose. i think its confusing and contributes to bugs that you have
> >> to have logic in e.g. the ctor THEN ALSO in reset().
> >>
> >> if someone does it correctly in the ctor, but they only test "one
> >> time", they might think everything is working..
> >>
> >> On Tue, Jan 7, 2014 at 3:23 PM, Benson Margulies 
> >> wrote:
> >> > For the record of other people who implement tokenizers:
> >> >
> >> > Say that your tokenizer has a constructor, like:
> >> >
> >> >  public MyTokenizer(Reader reader, ) {
> >> >super(reader);
> >> >myWrappedInputDevice = new MyWrappedInputDevice(reader);
> >> > }
> >> >
> >> > Not a good idea. Tokenizer carefully manages the data flow from the
> >> > constructor arg to the 'input' field. The correct form is:
> >> >
> >> >  public MyTokenizer(Reader reader, ) {
> >> >super(reader);
> >> >myWrappedInputDevice = new MyWrappedInputDevice(this.input);
> >> > }
> >> >
> >> >
> >> >
> >> > On Tue, Jan 7, 2014 at 2:59 PM, Robert Muir  wrote:
> >> >
> >> >> See Tokenizer.java for the state machine logic. In general you should
> >> >> not have to do anything if the tokenizer is well-behaved (e.g. close
> >> >> calls super.close() and so on).
> >> >>
> >> >>
> >> >>
> >> >> On Tue, Jan 7, 2014 at 2:50 PM, Benson Margulies <
> bimargul...@gmail.com
> >> >
> >> >> wrote:
> >> >> > In 4.6.0,
> >> >>
> org.apache.lucene.analysis.BaseTokenStreamTestCase#checkResetException
> >> >> >
> >> >> > fails if incrementToken fails to throw if there's a missing reset.
> >> >> >
> >> >> > How am I supposed to organize this in a Tokenizer? A quick look at
> >> >> > CharTokenizer did not reveal any code for the purpose.
> >> >> >
> >> >> >
> -
> >> >> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> >> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >> >> >
> >> >>
> >> >> -
> >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >> >>
> >> >>
> >>
> >> -
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>
> >>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: How is incrementToken supposed to detect the lack of reset()?

2014-01-08 Thread Mindaugas Žakšauskas
Hi,

Sure, why not - I'm just not sure if my approach (of setting reader in
reset()) is preferred over yours (using this.input instead of input in
ctor)? Or are they both equally good?

m.

On Wed, Jan 8, 2014 at 12:18 PM, Benson Margulies  wrote:
> If you'd like to join in on the doc, see
> https://github.com/apache/lucene-solr/pull/14/files. I'd be happy to grant
> you access to push to my fork.
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How is incrementToken supposed to detect the lack of reset()?

2014-01-08 Thread Benson Margulies
I'm not in the delegate business, just a straight subclass. So I think they
are complementary. Gimme your github identity, and you are, as far as I am
concerned, more than welcome to add a section on delegates.



On Wed, Jan 8, 2014 at 7:38 AM, Mindaugas Žakšauskas wrote:

> Hi,
>
> Sure, why not - I'm just not sure if my approach (of setting reader in
> reset()) is preferred over yours (using this.input instead of input in
> ctor)? Or are they both equally good?
>
> m.
>
> On Wed, Jan 8, 2014 at 12:18 PM, Benson Margulies 
> wrote:
> > If you'd like to join in on the doc, see
> > https://github.com/apache/lucene-solr/pull/14/files. I'd be happy to
> grant
> > you access to push to my fork.
> >
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Suggesters: payloads and filter predicates

2014-01-08 Thread Oliver Christ
Hi,

It's great to see support for payloads in the suggesters - this is really 
helpful, and pretty much addresses LUCENE-4516. Are there any plans to also 
support them for WFSTs? We have some cases where we don't need the Analyzer's 
capabilities (we look up the completion using the payload information, so don't 
need the automaton to return the surface form, and can benefit from the much 
smaller WFSTs).

I'm also getting back to LUCENE-4517, wondering about its status. I know that 
filtering interferes with pruning, but it generally seems to be less of an 
"evil" than requesting a significantly higher number of hits and filter the 
result set (the size of which is capped, if I remember correctly).

Thanks!

Cheers, Oli