Re: Slow Index Writes
NRTManager was renamed to ControlledRealTimeReopenThread at some point. But likely simple NRT readers (as Ian described, using .openIfChanged()) will fit your usage. ControlledRealTimeReopenThread is only necessary if you require certain searches to be real-time, e.g. you just indexed a document and then want to run a search that you know reflects that document. Mike McCandless http://blog.mikemccandless.com On Tue, Jan 7, 2014 at 8:41 AM, Klaus Schaefers wrote: > Hi, > > > I was looking for some examples but I just found some using an NRTManager > class? In Lucene 4.5 I cannot find the class (missing a maven dependency?). > Can anyone point me to a working example? > > Cheers, > > Klaus > > > > On Fri, Jan 3, 2014 at 11:49 AM, Ian Lea wrote: > >> You will indeed get poor performance if you commit for every doc. Can >> you compromise and commit every, say, 1000 docs, or once every few >> minutes, or whatever makes sense for your app. >> >> Or look at lucene's near-real-time search features. Google "Lucene >> NRT" for info. >> >> Or use Elastic Search. >> >> >> -- >> Ian. >> >> >> On Fri, Jan 3, 2014 at 10:21 AM, Klaus Schaefers >> wrote: >> > Hi, >> > >> > I am trying to use a lucene as a kind of key value store, but I >> encountered >> > some bad performance issues. When I try to add my data as documents to >> the >> > index I get an average write rate of 3 documents / second!! This seems to >> > me ridiculously slow and I guess I must have somewhere an error. Please >> > have a look at my code: >> > >> > >> > >> > Directory dir = new niofsdirectojava-u...@lucene.apache.org! >> > java-user@lucene.apache.org!ry(file); >> > Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_45); >> > IndexWriterConfig config = new >> IndexWriterConfig(Version.LUCENE_45, >> > analyzer); >> > IndexWriter writer = new IndexWriter(dir, config); >> > >> > int eventCount = 1000; >> > for(int i=0; i < eventCount;i++){ >> > Document doc = new Document(); >> > doc.add(new StringField("id", i+"id" ,Store.YES)); >> > doc.add(new StoredField("b", buildVector())); >> > writer.addDocument(doc); >> > writer.commit(); >> > } >> > dir.close(); >> > writer.close() >> > >> > >> > Not calling the commit function seems to fix the issue, but I guess this >> > would then have some issues if I want to read values in the mean time. My >> > normal use case would be to read something from the index, maybe alter it >> > and then write back. So I would have roughly 50% of reads. >> > >> > I tried also an embedded version of elastic search and it manages to go >> to >> > 2000 documents/ per second. As its based on lucene as well I guess I do >> > something wrong in my code. >> > >> > >> > THX for the help, >> > >> > Klaus >> > >> > >> > -- >> > >> > -- >> > >> > Klaus Schaefers >> > Senior Optimization Manager >> > >> > Ligatus GmbH >> > Hohenstaufenring 30-32 >> > D-50674 Köln >> > >> > Tel.: +49 (0) 221 / 56939 -784 >> > Fax: +49 (0) 221 / 56 939 - 599 >> > E-Mail: klaus.schaef...@ligatus.com >> > Web: www.ligatus.de >> > >> > HRB Köln 56003 >> > Geschäftsführung: >> > Dipl.-Kaufmann Lars Hasselbach, Dipl.-Kaufmann Klaus Ludemann, >> > Dipl.-Wirtschaftsingenieur Arne Wolter >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > > -- > > -- > > Klaus Schaefers > Senior Optimization Manager > > Ligatus GmbH > Hohenstaufenring 30-32 > D-50674 Köln > > Tel.: +49 (0) 221 / 56939 -784 > Fax: +49 (0) 221 / 56 939 - 599 > E-Mail: klaus.schaef...@ligatus.com > Web: www.ligatus.de > > HRB Köln 56003 > Geschäftsführung: > Dipl.-Kaufmann Lars Hasselbach, Dipl.-Kaufmann Klaus Ludemann, > Dipl.-Wirtschaftsingenieur Arne Wolter - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Slow Index Writes
THX! On Wed, Jan 8, 2014 at 10:10 AM, Michael McCandless < luc...@mikemccandless.com> wrote: > NRTManager was renamed to ControlledRealTimeReopenThread at some point. > > But likely simple NRT readers (as Ian described, using > .openIfChanged()) will fit your usage. > > ControlledRealTimeReopenThread is only necessary if you require > certain searches to be real-time, e.g. you just indexed a document and > then want to run a search that you know reflects that document. > > Mike McCandless > > http://blog.mikemccandless.com > > > On Tue, Jan 7, 2014 at 8:41 AM, Klaus Schaefers > wrote: > > Hi, > > > > > > I was looking for some examples but I just found some using an NRTManager > > class? In Lucene 4.5 I cannot find the class (missing a maven > dependency?). > > Can anyone point me to a working example? > > > > Cheers, > > > > Klaus > > > > > > > > On Fri, Jan 3, 2014 at 11:49 AM, Ian Lea wrote: > > > >> You will indeed get poor performance if you commit for every doc. Can > >> you compromise and commit every, say, 1000 docs, or once every few > >> minutes, or whatever makes sense for your app. > >> > >> Or look at lucene's near-real-time search features. Google "Lucene > >> NRT" for info. > >> > >> Or use Elastic Search. > >> > >> > >> -- > >> Ian. > >> > >> > >> On Fri, Jan 3, 2014 at 10:21 AM, Klaus Schaefers > >> wrote: > >> > Hi, > >> > > >> > I am trying to use a lucene as a kind of key value store, but I > >> encountered > >> > some bad performance issues. When I try to add my data as documents to > >> the > >> > index I get an average write rate of 3 documents / second!! This > seems to > >> > me ridiculously slow and I guess I must have somewhere an error. > Please > >> > have a look at my code: > >> > > >> > > >> > > >> > Directory dir = new niofsdirectojava-u...@lucene.apache.org! > >> > java-user@lucene.apache.org!ry(file); > >> > Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_45); > >> > IndexWriterConfig config = new > >> IndexWriterConfig(Version.LUCENE_45, > >> > analyzer); > >> > IndexWriter writer = new IndexWriter(dir, config); > >> > > >> > int eventCount = 1000; > >> > for(int i=0; i < eventCount;i++){ > >> > Document doc = new Document(); > >> > doc.add(new StringField("id", i+"id" ,Store.YES)); > >> > doc.add(new StoredField("b", buildVector())); > >> > writer.addDocument(doc); > >> > writer.commit(); > >> > } > >> > dir.close(); > >> > writer.close() > >> > > >> > > >> > Not calling the commit function seems to fix the issue, but I guess > this > >> > would then have some issues if I want to read values in the mean > time. My > >> > normal use case would be to read something from the index, maybe > alter it > >> > and then write back. So I would have roughly 50% of reads. > >> > > >> > I tried also an embedded version of elastic search and it manages to > go > >> to > >> > 2000 documents/ per second. As its based on lucene as well I guess I > do > >> > something wrong in my code. > >> > > >> > > >> > THX for the help, > >> > > >> > Klaus > >> > > >> > > >> > -- > >> > > >> > -- > >> > > >> > Klaus Schaefers > >> > Senior Optimization Manager > >> > > >> > Ligatus GmbH > >> > Hohenstaufenring 30-32 > >> > D-50674 Köln > >> > > >> > Tel.: +49 (0) 221 / 56939 -784 > >> > Fax: +49 (0) 221 / 56 939 - 599 > >> > E-Mail: klaus.schaef...@ligatus.com > >> > Web: www.ligatus.de > >> > > >> > HRB Köln 56003 > >> > Geschäftsführung: > >> > Dipl.-Kaufmann Lars Hasselbach, Dipl.-Kaufmann Klaus Ludemann, > >> > Dipl.-Wirtschaftsingenieur Arne Wolter > >> > >> - > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > >> > >> > > > > > > -- > > > > -- > > > > Klaus Schaefers > > Senior Optimization Manager > > > > Ligatus GmbH > > Hohenstaufenring 30-32 > > D-50674 Köln > > > > Tel.: +49 (0) 221 / 56939 -784 > > Fax: +49 (0) 221 / 56 939 - 599 > > E-Mail: klaus.schaef...@ligatus.com > > Web: www.ligatus.de > > > > HRB Köln 56003 > > Geschäftsführung: > > Dipl.-Kaufmann Lars Hasselbach, Dipl.-Kaufmann Klaus Ludemann, > > Dipl.-Wirtschaftsingenieur Arne Wolter > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- -- Klaus Schaefers Senior Optimization Manager Ligatus GmbH Hohenstaufenring 30-32 D-50674 Köln Tel.: +49 (0) 221 / 56939 -784 Fax: +49 (0) 221 / 56 939 - 599 E-Mail: klaus.schaef...@ligatus.com Web: www.ligatus.de HRB Köln 56003 Geschäftsführung: Dipl.-Kaufmann Lars Hasselbach, Dipl.-Kaufmann Klaus Ludemann, Dipl.-Wirtschaftsingenieur Arne Wolter
Re: How is incrementToken supposed to detect the lack of reset()?
Just for the interest, I had a similar problem too as well as other people [1]. In my project, I am extending the Tokenizer class and have another tokenizer (e.g. ClassicTokenizer) as a delegate. Unfortunately, properly overriding all public/protected methods is *not* enough, e.g.: public void reset() throws IOException { super.reset(); delegate.reset(); } I was still getting the exception of broken read()/close() contract. Half day and *lots* of debugging later, I realized that exception is only thrown when indexing second document only as the delegate reader internally gets replaced with ILLEGAL_STATE_READER after .close() is called. My solution to this problem was to make the reset() method like this: public void reset() throws IOException { super.reset(); delegate.setReader(input); delegate.reset(); } Another thing worth mentioning is that it's crucial to have super.method() before delegate.method() in all overridden methods. Would be nice if all of this was somewhere in the Tokenizer Javadoc, or even nicer if the base class was designed with delegation in mind (Effective Java (2nd edition), Item 16). Hope this helps somebody. [1] http://stackoverflow.com/questions/20624339/having-trouble-rereading-a-lucene-tokenstream/20630673#20630673 Regards, Mindaugas On Tue, Jan 7, 2014 at 9:45 PM, Benson Margulies wrote: > Yes I Do. > > > On Tue, Jan 7, 2014 at 3:59 PM, Robert Muir wrote: > >> Benson, do you want to open an issue to fix this constructor to not >> take Reader? (there might be one already, but lets make a new one). >> >> These things are supposed to be reused, and have setReader for that >> purpose. i think its confusing and contributes to bugs that you have >> to have logic in e.g. the ctor THEN ALSO in reset(). >> >> if someone does it correctly in the ctor, but they only test "one >> time", they might think everything is working.. >> >> On Tue, Jan 7, 2014 at 3:23 PM, Benson Margulies >> wrote: >> > For the record of other people who implement tokenizers: >> > >> > Say that your tokenizer has a constructor, like: >> > >> > public MyTokenizer(Reader reader, ) { >> >super(reader); >> >myWrappedInputDevice = new MyWrappedInputDevice(reader); >> > } >> > >> > Not a good idea. Tokenizer carefully manages the data flow from the >> > constructor arg to the 'input' field. The correct form is: >> > >> > public MyTokenizer(Reader reader, ) { >> >super(reader); >> >myWrappedInputDevice = new MyWrappedInputDevice(this.input); >> > } >> > >> > >> > >> > On Tue, Jan 7, 2014 at 2:59 PM, Robert Muir wrote: >> > >> >> See Tokenizer.java for the state machine logic. In general you should >> >> not have to do anything if the tokenizer is well-behaved (e.g. close >> >> calls super.close() and so on). >> >> >> >> >> >> >> >> On Tue, Jan 7, 2014 at 2:50 PM, Benson Margulies > > >> >> wrote: >> >> > In 4.6.0, >> >> org.apache.lucene.analysis.BaseTokenStreamTestCase#checkResetException >> >> > >> >> > fails if incrementToken fails to throw if there's a missing reset. >> >> > >> >> > How am I supposed to organize this in a Tokenizer? A quick look at >> >> > CharTokenizer did not reveal any code for the purpose. >> >> > >> >> > - >> >> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> >> > For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > >> >> >> >> - >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> >> >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How is incrementToken supposed to detect the lack of reset()?
If you'd like to join in on the doc, see https://github.com/apache/lucene-solr/pull/14/files. I'd be happy to grant you access to push to my fork. On Wed, Jan 8, 2014 at 5:37 AM, Mindaugas Žakšauskas wrote: > Just for the interest, I had a similar problem too as well as other > people [1]. In my project, I am extending the Tokenizer class and have > another tokenizer (e.g. ClassicTokenizer) as a delegate. > Unfortunately, properly overriding all public/protected methods is > *not* enough, e.g.: > > public void reset() throws IOException { > super.reset(); > delegate.reset(); > } > > I was still getting the exception of broken read()/close() contract. > Half day and *lots* of debugging later, I realized that exception is > only thrown when indexing second document only as the delegate reader > internally gets replaced with ILLEGAL_STATE_READER after .close() is > called. My solution to this problem was to make the reset() method > like this: > > public void reset() throws IOException { > super.reset(); > delegate.setReader(input); > delegate.reset(); > } > > Another thing worth mentioning is that it's crucial to have > super.method() before delegate.method() in all overridden methods. > Would be nice if all of this was somewhere in the Tokenizer Javadoc, > or even nicer if the base class was designed with delegation in mind > (Effective Java (2nd edition), Item 16). > > Hope this helps somebody. > > [1] > http://stackoverflow.com/questions/20624339/having-trouble-rereading-a-lucene-tokenstream/20630673#20630673 > > Regards, > Mindaugas > > On Tue, Jan 7, 2014 at 9:45 PM, Benson Margulies > wrote: > > Yes I Do. > > > > > > On Tue, Jan 7, 2014 at 3:59 PM, Robert Muir wrote: > > > >> Benson, do you want to open an issue to fix this constructor to not > >> take Reader? (there might be one already, but lets make a new one). > >> > >> These things are supposed to be reused, and have setReader for that > >> purpose. i think its confusing and contributes to bugs that you have > >> to have logic in e.g. the ctor THEN ALSO in reset(). > >> > >> if someone does it correctly in the ctor, but they only test "one > >> time", they might think everything is working.. > >> > >> On Tue, Jan 7, 2014 at 3:23 PM, Benson Margulies > >> wrote: > >> > For the record of other people who implement tokenizers: > >> > > >> > Say that your tokenizer has a constructor, like: > >> > > >> > public MyTokenizer(Reader reader, ) { > >> >super(reader); > >> >myWrappedInputDevice = new MyWrappedInputDevice(reader); > >> > } > >> > > >> > Not a good idea. Tokenizer carefully manages the data flow from the > >> > constructor arg to the 'input' field. The correct form is: > >> > > >> > public MyTokenizer(Reader reader, ) { > >> >super(reader); > >> >myWrappedInputDevice = new MyWrappedInputDevice(this.input); > >> > } > >> > > >> > > >> > > >> > On Tue, Jan 7, 2014 at 2:59 PM, Robert Muir wrote: > >> > > >> >> See Tokenizer.java for the state machine logic. In general you should > >> >> not have to do anything if the tokenizer is well-behaved (e.g. close > >> >> calls super.close() and so on). > >> >> > >> >> > >> >> > >> >> On Tue, Jan 7, 2014 at 2:50 PM, Benson Margulies < > bimargul...@gmail.com > >> > > >> >> wrote: > >> >> > In 4.6.0, > >> >> > org.apache.lucene.analysis.BaseTokenStreamTestCase#checkResetException > >> >> > > >> >> > fails if incrementToken fails to throw if there's a missing reset. > >> >> > > >> >> > How am I supposed to organize this in a Tokenizer? A quick look at > >> >> > CharTokenizer did not reveal any code for the purpose. > >> >> > > >> >> > > - > >> >> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> >> > For additional commands, e-mail: java-user-h...@lucene.apache.org > >> >> > > >> >> > >> >> - > >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org > >> >> > >> >> > >> > >> - > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > >> > >> > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Re: How is incrementToken supposed to detect the lack of reset()?
Hi, Sure, why not - I'm just not sure if my approach (of setting reader in reset()) is preferred over yours (using this.input instead of input in ctor)? Or are they both equally good? m. On Wed, Jan 8, 2014 at 12:18 PM, Benson Margulies wrote: > If you'd like to join in on the doc, see > https://github.com/apache/lucene-solr/pull/14/files. I'd be happy to grant > you access to push to my fork. > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How is incrementToken supposed to detect the lack of reset()?
I'm not in the delegate business, just a straight subclass. So I think they are complementary. Gimme your github identity, and you are, as far as I am concerned, more than welcome to add a section on delegates. On Wed, Jan 8, 2014 at 7:38 AM, Mindaugas Žakšauskas wrote: > Hi, > > Sure, why not - I'm just not sure if my approach (of setting reader in > reset()) is preferred over yours (using this.input instead of input in > ctor)? Or are they both equally good? > > m. > > On Wed, Jan 8, 2014 at 12:18 PM, Benson Margulies > wrote: > > If you'd like to join in on the doc, see > > https://github.com/apache/lucene-solr/pull/14/files. I'd be happy to > grant > > you access to push to my fork. > > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Suggesters: payloads and filter predicates
Hi, It's great to see support for payloads in the suggesters - this is really helpful, and pretty much addresses LUCENE-4516. Are there any plans to also support them for WFSTs? We have some cases where we don't need the Analyzer's capabilities (we look up the completion using the payload information, so don't need the automaton to return the surface form, and can benefit from the much smaller WFSTs). I'm also getting back to LUCENE-4517, wondering about its status. I know that filtering interferes with pruning, but it generally seems to be less of an "evil" than requesting a significantly higher number of hits and filter the result set (the size of which is capped, if I remember correctly). Thanks! Cheers, Oli