Re: Slower document retrieval in 8.7.0 comparing to 7.5.0
I did some testing for you :) I modified your code to run in a JMH benchmark; and changed the number of retrieved docs to 1000 out of 1M in the index. This is what I got: Lucene 7.5 Benchmark Mode Cnt Score Error Units DocRetrievalBenchmark.retrieveDocuments thrpt4 37.147 ± 6.218 ops/s Lucene 8.7 Benchmark Mode Cnt Score Error Units DocRetrievalBenchmark.retrieveDocuments thrpt4 18.680 ± 5.755 ops/s This is much in line with your observations, (lucene 8.7 seems almost twice as slow) so something is going on when running out-of-the-box. The code can be found : (not really beautiful, but gets the job done. If you want to switch lucene-versions, edit the pom and make sure to set the proper index version) https://gist.github.com/d2a-raudenaerde/93a490e5b0d17b2fa88862473429aeb3 JMH details: # JMH version: 1.21 # VM version: JDK 11.0.9.1, OpenJDK 64-Bit Server VM, 11.0.9.1+1-Ubuntu-0ubuntu1.20.04 # VM invoker: /usr/lib/jvm/java-11-openjdk-amd64/bin/java # VM options: -Xms2G -Xmx2G # Warmup: 2 iterations, 10 s each # Measurement: 4 iterations, 10 s each # Timeout: 10 min per iteration # Threads: 1 thread, will synchronize iterations # Benchmark mode: Throughput, ops/time # Benchmark: org.audenaerde.lucene.DocRetrievalBenchmark.retrieveDocuments On Fri, Jan 22, 2021 at 4:22 PM Martynas L wrote: > Just played with my reading sample. I do not have a goal to show the exact > numbers, but it is a fact that document retrieval IndexSearcher.doc(int) is > much slower. > All our performance tests showed performance degradation after changing to > 8.7.0, even without measurement we can "see/feel" the operations involving > documents retrieval became slower. > > > > On Fri, Jan 22, 2021 at 4:48 PM Rob Audenaerde > wrote: > > > Hi Martynas > > > > How did you measure that? > > > > I ask, because writing a good benchmark is not an easy task, since there > > are so many factors (class loading times, JIT effects, etc). You should > use > > Java Microbenchmark Harness or similar; and set up a random document > > retrieval task, with warm-up etc.etc. > > > > (I'm not aware of any big slowdowns, but as you see them, the best way is > > to build a robust benchmark and then start comparing) > > > > -Rob > > > > > > On Fri, Jan 22, 2021 at 3:43 PM Martynas L > wrote: > > > > > Even retrieving single document 8.7.0 is more than x2 slower > > > > > > On Fri, Jan 22, 2021 at 2:28 PM Diego Ceccarelli (BLOOMBERG/ LONDON) < > > > dceccarel...@bloomberg.net> wrote: > > > > > > > > I think it will be similar ratio retrieving any number of > documents. > > > > > > > > I'm not sure this is true, if you retrieve a huge amount of documents > > you > > > > might cause troubles to the GC. > > > > > > > > From: java-user@lucene.apache.org At: 01/22/21 12:11:19To: > > > > java-user@lucene.apache.org > > > > Subject: Re: Slower document retrieval in 8.7.0 comparing to 7.5.0 > > > > > > > > The accent should not be on retrieved documents number, but on the > > > duration > > > > ratio - 8.7.0 is 3 times slower. I think it will be similar ratio > > > > retrieving any number of documents. > > > > > > > > On Fri, Jan 22, 2021 at 1:39 PM Rob Audenaerde < > > rob.audenae...@gmail.com > > > > > > > > wrote: > > > > > > > > > Hi Martrynas, > > > > > > > > > > In your sample code you are retrieving all (1 million!) documents > > from > > > > the > > > > > index, that surely is not a good match for lucene :) > > > > > > > > > > Is that a good reflection of your use-case? > > > > > > > > > > On Fri, Jan 22, 2021 at 9:52 AM Martynas L > > > > > wrote: > > > > > > > > > > > Please see the sample at > > > > > > > > > > > > https://drive.google.com/drive/folders/1ufVZXzkugBAFnuy8HLAY6mbPWzjknrfE > > > > > > > > > > > > IndexGenerator - creates a dummy index. > > > > > > IndexReader - retrieves documents - duration time with 7.5.0 > > version > > > is > > > > > > ~2s, while ~6s with 8.7.0 > > > > > > > > > > > > Regards, > > > > > > Martynas > > > > > > > > > > > > > > > > > &g
Re: Slower document retrieval in 8.7.0 comparing to 7.5.0
Hi Martynas How did you measure that? I ask, because writing a good benchmark is not an easy task, since there are so many factors (class loading times, JIT effects, etc). You should use Java Microbenchmark Harness or similar; and set up a random document retrieval task, with warm-up etc.etc. (I'm not aware of any big slowdowns, but as you see them, the best way is to build a robust benchmark and then start comparing) -Rob On Fri, Jan 22, 2021 at 3:43 PM Martynas L wrote: > Even retrieving single document 8.7.0 is more than x2 slower > > On Fri, Jan 22, 2021 at 2:28 PM Diego Ceccarelli (BLOOMBERG/ LONDON) < > dceccarel...@bloomberg.net> wrote: > > > > I think it will be similar ratio retrieving any number of documents. > > > > I'm not sure this is true, if you retrieve a huge amount of documents you > > might cause troubles to the GC. > > > > From: java-user@lucene.apache.org At: 01/22/21 12:11:19To: > > java-user@lucene.apache.org > > Subject: Re: Slower document retrieval in 8.7.0 comparing to 7.5.0 > > > > The accent should not be on retrieved documents number, but on the > duration > > ratio - 8.7.0 is 3 times slower. I think it will be similar ratio > > retrieving any number of documents. > > > > On Fri, Jan 22, 2021 at 1:39 PM Rob Audenaerde > > > wrote: > > > > > Hi Martrynas, > > > > > > In your sample code you are retrieving all (1 million!) documents from > > the > > > index, that surely is not a good match for lucene :) > > > > > > Is that a good reflection of your use-case? > > > > > > On Fri, Jan 22, 2021 at 9:52 AM Martynas L > > wrote: > > > > > > > Please see the sample at > > > > > > https://drive.google.com/drive/folders/1ufVZXzkugBAFnuy8HLAY6mbPWzjknrfE > > > > > > > > IndexGenerator - creates a dummy index. > > > > IndexReader - retrieves documents - duration time with 7.5.0 version > is > > > > ~2s, while ~6s with 8.7.0 > > > > > > > > Regards, > > > > Martynas > > > > > > > > > > > > On Thu, Jan 21, 2021 at 8:21 PM Rob Audenaerde < > > rob.audenae...@gmail.com > > > > > > > > wrote: > > > > > > > > > There is no attachment in the previous email that I can see? Maybe > > you > > > > can > > > > > post it online? > > > > > > > > > > On Thu, Jan 21, 2021 at 4:54 PM Martynas L > > > > > wrote: > > > > > > > > > > > Hello, > > > > > > > > > > > > Are there any comments on this issue? > > > > > > If there is no workaround, we will be forced to rollback to the > > 7.5.0 > > > > > > version. > > > > > > > > > > > > Best regards, > > > > > > Martynas > > > > > > > > > > > > On Tue, Jan 12, 2021 at 12:27 PM Martynas L < > > martynas@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > Please see attached sample. > > > > > > > IndexGenerator - creates a dummy index. > > > > > > > IndexReader - retrieves documents - duration time with 7.5.0 > > > version > > > > is > > > > > > > ~2s, while ~6s with 8.7.0 > > > > > > > > > > > > > > Regards, > > > > > > > Martynas > > > > > > > > > > > > > > On Tue, Dec 22, 2020 at 3:23 PM Vincenzo D'Amore < > > > v.dam...@gmail.com > > > > > > > > > > > > wrote: > > > > > > > > > > > > > >> I think it would be useful to have an example of a document > and, > > > if > > > > > > >> possible, an example of query that takes too long. > > > > > > >> > > > > > > >> On Mon, Dec 21, 2020 at 1:47 PM Martynas L < > > > martynas@gmail.com> > > > > > > >> wrote: > > > > > > >> > > > > > > >> > Hello, > > > > > > >> > > > > > > > >> > I am sorry for the delay. > > > > > > >> > > > > > > > >> > Not sure what you mean by &qu
Re: Slower document retrieval in 8.7.0 comparing to 7.5.0
Hi Martrynas, In your sample code you are retrieving all (1 million!) documents from the index, that surely is not a good match for lucene :) Is that a good reflection of your use-case? On Fri, Jan 22, 2021 at 9:52 AM Martynas L wrote: > Please see the sample at > https://drive.google.com/drive/folders/1ufVZXzkugBAFnuy8HLAY6mbPWzjknrfE > > IndexGenerator - creates a dummy index. > IndexReader - retrieves documents - duration time with 7.5.0 version is > ~2s, while ~6s with 8.7.0 > > Regards, > Martynas > > > On Thu, Jan 21, 2021 at 8:21 PM Rob Audenaerde > wrote: > > > There is no attachment in the previous email that I can see? Maybe you > can > > post it online? > > > > On Thu, Jan 21, 2021 at 4:54 PM Martynas L > wrote: > > > > > Hello, > > > > > > Are there any comments on this issue? > > > If there is no workaround, we will be forced to rollback to the 7.5.0 > > > version. > > > > > > Best regards, > > > Martynas > > > > > > On Tue, Jan 12, 2021 at 12:27 PM Martynas L > > > wrote: > > > > > > > Hi, > > > > > > > > Please see attached sample. > > > > IndexGenerator - creates a dummy index. > > > > IndexReader - retrieves documents - duration time with 7.5.0 version > is > > > > ~2s, while ~6s with 8.7.0 > > > > > > > > Regards, > > > > Martynas > > > > > > > > On Tue, Dec 22, 2020 at 3:23 PM Vincenzo D'Amore > > > > > wrote: > > > > > > > >> I think it would be useful to have an example of a document and, if > > > >> possible, an example of query that takes too long. > > > >> > > > >> On Mon, Dec 21, 2020 at 1:47 PM Martynas L > > > >> wrote: > > > >> > > > >> > Hello, > > > >> > > > > >> > I am sorry for the delay. > > > >> > > > > >> > Not sure what you mean by "workload". We have a performance tests, > > > which > > > >> > started failing after upgrading to 8.7.0. > > > >> > So I just tried to query the index (built form the same source) to > > get > > > >> all > > > >> > documents and compare the performance with 7.5.0. > > > >> > > > > >> > Document "size" is a sum of all stored string lengths (3402519 > > > >> documents): > > > >> > > > > >> > doc size 903 - 88s vs 22s > > > >> > > > > >> > doc size 36 (only one field loaded, used searcher.doc(docID, > > > >> > Collections.singleton("fieldName"))) - 78s vs 16s > > > >> > > > > >> > doc size 439 (some fields made not stored) - 46s vs 14.5s > > > >> > > > > >> > Best regards, > > > >> > Martynas > > > >> > > > > >> > On Fri, Dec 4, 2020 at 12:06 AM Adrien Grand > > > wrote: > > > >> > > > > >> > > Hello Martynas, > > > >> > > > > > >> > > There have indeed been changes related to stored fields in 8.7. > > What > > > >> does > > > >> > > your workload look like and how large are your documents on > > average? > > > >> > > > > > >> > > On Thu, Dec 3, 2020 at 3:04 PM Martynas L < > martynas@gmail.com > > > > > > >> > wrote: > > > >> > > > > > >> > > > Hi, > > > >> > > > We've migrated from 7.5.0 to 8.7.0 and find out that the index > > > >> > > "searching" > > > >> > > > is significantly (4-5 times) slower in the latest version. > > > >> > > > It seems that > > > >> > > > org.apache.lucene.search.IndexSearcher#doc(int) > > > >> > > > is slower. > > > >> > > > > > > >> > > > Is it possible to have similar performance with 8.7.0? > > > >> > > > > > > >> > > > Best regards, > > > >> > > > Martynas > > > >> > > > > > > >> > > > > > >> > > > > > >> > > -- > > > >> > > Adrien > > > >> > > > > > >> > > > > >> > > > >> > > > >> -- > > > >> Vincenzo D'Amore > > > >> > > > > > > > > > >
Re: Slower document retrieval in 8.7.0 comparing to 7.5.0
There is no attachment in the previous email that I can see? Maybe you can post it online? On Thu, Jan 21, 2021 at 4:54 PM Martynas L wrote: > Hello, > > Are there any comments on this issue? > If there is no workaround, we will be forced to rollback to the 7.5.0 > version. > > Best regards, > Martynas > > On Tue, Jan 12, 2021 at 12:27 PM Martynas L > wrote: > > > Hi, > > > > Please see attached sample. > > IndexGenerator - creates a dummy index. > > IndexReader - retrieves documents - duration time with 7.5.0 version is > > ~2s, while ~6s with 8.7.0 > > > > Regards, > > Martynas > > > > On Tue, Dec 22, 2020 at 3:23 PM Vincenzo D'Amore > > wrote: > > > >> I think it would be useful to have an example of a document and, if > >> possible, an example of query that takes too long. > >> > >> On Mon, Dec 21, 2020 at 1:47 PM Martynas L > >> wrote: > >> > >> > Hello, > >> > > >> > I am sorry for the delay. > >> > > >> > Not sure what you mean by "workload". We have a performance tests, > which > >> > started failing after upgrading to 8.7.0. > >> > So I just tried to query the index (built form the same source) to get > >> all > >> > documents and compare the performance with 7.5.0. > >> > > >> > Document "size" is a sum of all stored string lengths (3402519 > >> documents): > >> > > >> > doc size 903 - 88s vs 22s > >> > > >> > doc size 36 (only one field loaded, used searcher.doc(docID, > >> > Collections.singleton("fieldName"))) - 78s vs 16s > >> > > >> > doc size 439 (some fields made not stored) - 46s vs 14.5s > >> > > >> > Best regards, > >> > Martynas > >> > > >> > On Fri, Dec 4, 2020 at 12:06 AM Adrien Grand > wrote: > >> > > >> > > Hello Martynas, > >> > > > >> > > There have indeed been changes related to stored fields in 8.7. What > >> does > >> > > your workload look like and how large are your documents on average? > >> > > > >> > > On Thu, Dec 3, 2020 at 3:04 PM Martynas L > >> > wrote: > >> > > > >> > > > Hi, > >> > > > We've migrated from 7.5.0 to 8.7.0 and find out that the index > >> > > "searching" > >> > > > is significantly (4-5 times) slower in the latest version. > >> > > > It seems that > >> > > > org.apache.lucene.search.IndexSearcher#doc(int) > >> > > > is slower. > >> > > > > >> > > > Is it possible to have similar performance with 8.7.0? > >> > > > > >> > > > Best regards, > >> > > > Martynas > >> > > > > >> > > > >> > > > >> > > -- > >> > > Adrien > >> > > > >> > > >> > >> > >> -- > >> Vincenzo D'Amore > >> > > >
Fwd: best way (performance wise) to search for field without value?
To follow up, based on a quick JMH-test with 2M docs with some random data I see a speedup of 70% :) That is a nice friday-afternoon gift, thanks! For ppl that are interested: I added a BinaryDocValues field like this: doc.add(BinaryDocValuesField("GROUPS_ALLOWED_EMPTY", new BytesRef(0x01; And used the finalQuery.add(new DocValuesFieldExistsQuery(" GROUPS_ALLOWED_EMPTY", BooleanClause.Occur.SHOULD); On Fri, Nov 13, 2020 at 2:09 PM Michael McCandless < luc...@mikemccandless.com> wrote: > Maybe NormsFieldExistsQuery as a MUST_NOT clause? Though, you must enable > norms on your field to use that. > > TermRangeQuery is indeed a horribly costly way to execute this, but if you > cache the result on each refresh, perhaps it is OK? > > You could also index a dedicated doc values field indicating that the > field empty and then use DocValuesFieldExistsQuery. > > Mike McCandless > > http://blog.mikemccandless.com > > > On Fri, Nov 13, 2020 at 7:56 AM Rob Audenaerde > wrote: > >> Hi all, >> >> We have implemented some security on our index by adding a field >> 'groups_allowed' to documents, and wrap a boolean must query around the >> original query, that checks if one of the given user-groups matches at >> least one groups_allowed. >> >> We chose to leave the groups_allowed field empty when the document should >> able to be retrieved by all users, so we need to also select a document if >> the 'groups_allowed' is empty. >> >> What would be the faster Query construction to do so? >> >> >> Currently I use a TermRangeQuery that basically matches all values and put >> that in a MUST_NOT combined with a MatchAllDocumentQuery(), but that gets >> rather slow then the number of groups is high. >> >> Thanks! >> >
best way (performance wise) to search for field without value?
Hi all, We have implemented some security on our index by adding a field 'groups_allowed' to documents, and wrap a boolean must query around the original query, that checks if one of the given user-groups matches at least one groups_allowed. We chose to leave the groups_allowed field empty when the document should able to be retrieved by all users, so we need to also select a document if the 'groups_allowed' is empty. What would be the faster Query construction to do so? Currently I use a TermRangeQuery that basically matches all values and put that in a MUST_NOT combined with a MatchAllDocumentQuery(), but that gets rather slow then the number of groups is high. Thanks!
Re: unexpected performance TermsQuery Occur.SHOULD vs TermsInSetQuery?
Ah. That makes sense. Thanks! (I might re-run on a larger index just to learn how it works in more detail) On Tue, Oct 13, 2020 at 1:24 PM Adrien Grand wrote: > 100,000+ requests per core per second is a lot. :) My initial reaction is > that the query is likely so fast on that index that the bottleneck might be > rewriting or the initialization of weights/scorers (which don't get more > costly as the index gets larger) rather than actual query execution, which > means that we can't really conclude that the boolean query is faster than > the TermInSetQuery. > > Also beware than IndexSearcher#count will look at index statistics if your > queries have a single term, which would no longer work if you use this > query as a filter for another query. > > On Tue, Oct 13, 2020 at 12:51 PM Rob Audenaerde > wrote: > > > I reduced the benchmark as far as I could, and now got these results, > > TermsInSet being a lot slower compared to the Terms/SHOULD. > > > > > > BenchmarkOrQuery.benchmarkTerms thrpt5 190820.510 ± 16667.411 > > ops/s > > BenchmarkOrQuery.benchmarkTermsInSet thrpt5 110548.345 ± 7490.169 > > ops/s > > > > > > @Fork(1) > > @Measurement(iterations = 5, time = 10) > > @OutputTimeUnit(TimeUnit.SECONDS) > > @Warmup(iterations = 3, time = 1) > > @Benchmark > > public void benchmarkTerms(final MyState myState) { > > try { > > final IndexSearcher searcher = > > myState.matchedReaders.getIndexSearcher(); > > final BooleanQuery.Builder b = new BooleanQuery.Builder(); > > > > for (final String role : myState.user.getAdditionalRoles()) { > > b.add(new TermQuery(new Term(roles, new BytesRef(role))), > > BooleanClause.Occur.SHOULD); > > } > > searcher.count(b.build()); > > > > } catch (final IOException e) { > > e.printStackTrace(); > > } > > } > > > > @Fork(1) > > @Measurement(iterations = 5, time = 10) > > @OutputTimeUnit(TimeUnit.SECONDS) > > @Warmup(iterations = 3, time = 1) > > @Benchmark > > public void benchmarkTermsInSet(final MyState myState) { > > try { > > final IndexSearcher searcher = > > myState.matchedReaders.getIndexSearcher(); > > final Set roles = > > > > > myState.user.getAdditionalRoles().stream().map(BytesRef::new).collect(Collectors.toSet()); > > searcher.count(new TermInSetQuery(BenchmarkOrQuery.roles, > roles)); > > > > } catch (final IOException e) { > > e.printStackTrace(); > > } > > } > > > > > > On Tue, Oct 13, 2020 at 11:56 AM Rob Audenaerde < > rob.audenae...@gmail.com> > > wrote: > > > > > Hello Adrien, > > > > > > Thanks for the swift reply. I'll add the details: > > > > > > Lucene version: 8.6.2 > > > > > > The restrictionQuery is indeed a conjunction, it allowes for a document > > to > > > be a hit if the 'roles' field is empty as well. It's used within a > > > bigger query builder; so maybe I did something else wrong. I'll rewrite > > the > > > benchmark to just benchmark the TermsInSet and Terms. > > > > > > It never occurred (hah) to me to use Occur.FILTER, that is a good point > > to > > > check as well. > > > > > > As you put it, I would expect the results to be very similar, as I do > not > > > react the 16 terms in the TermInSet. I'll let you know what I'll find. > > > > > > On Tue, Oct 13, 2020 at 11:48 AM Adrien Grand > wrote: > > > > > >> Can you give us a few more details: > > >> - What version of Lucene are you testing? > > >> - Are you benchmarking "restrictionQuery" on its own, or its > > conjunction > > >> with another query? > > >> > > >> You mentioned that you combine your "restrictionQuery" and the user > > query > > >> with Occur.MUST, Occur.FILTER feels more appropriate for > > >> "restrictionQuery" > > >> since it should not contribute to scoring. > > >> > > >> TermsInSetQuery automatically executes like a BooleanQuery when the > > number > > >> of clauses is less than 16, so I would not expect major performance > > >> differences between a TermInSetQuery over less than 16 terms and a > > >> BooleanQuery wrapped in a ConstantScoreQuery. > > >> > > >> On Tue
Re: unexpected performance TermsQuery Occur.SHOULD vs TermsInSetQuery?
I reduced the benchmark as far as I could, and now got these results, TermsInSet being a lot slower compared to the Terms/SHOULD. BenchmarkOrQuery.benchmarkTerms thrpt5 190820.510 ± 16667.411 ops/s BenchmarkOrQuery.benchmarkTermsInSet thrpt5 110548.345 ± 7490.169 ops/s @Fork(1) @Measurement(iterations = 5, time = 10) @OutputTimeUnit(TimeUnit.SECONDS) @Warmup(iterations = 3, time = 1) @Benchmark public void benchmarkTerms(final MyState myState) { try { final IndexSearcher searcher = myState.matchedReaders.getIndexSearcher(); final BooleanQuery.Builder b = new BooleanQuery.Builder(); for (final String role : myState.user.getAdditionalRoles()) { b.add(new TermQuery(new Term(roles, new BytesRef(role))), BooleanClause.Occur.SHOULD); } searcher.count(b.build()); } catch (final IOException e) { e.printStackTrace(); } } @Fork(1) @Measurement(iterations = 5, time = 10) @OutputTimeUnit(TimeUnit.SECONDS) @Warmup(iterations = 3, time = 1) @Benchmark public void benchmarkTermsInSet(final MyState myState) { try { final IndexSearcher searcher = myState.matchedReaders.getIndexSearcher(); final Set roles = myState.user.getAdditionalRoles().stream().map(BytesRef::new).collect(Collectors.toSet()); searcher.count(new TermInSetQuery(BenchmarkOrQuery.roles, roles)); } catch (final IOException e) { e.printStackTrace(); } } On Tue, Oct 13, 2020 at 11:56 AM Rob Audenaerde wrote: > Hello Adrien, > > Thanks for the swift reply. I'll add the details: > > Lucene version: 8.6.2 > > The restrictionQuery is indeed a conjunction, it allowes for a document to > be a hit if the 'roles' field is empty as well. It's used within a > bigger query builder; so maybe I did something else wrong. I'll rewrite the > benchmark to just benchmark the TermsInSet and Terms. > > It never occurred (hah) to me to use Occur.FILTER, that is a good point to > check as well. > > As you put it, I would expect the results to be very similar, as I do not > react the 16 terms in the TermInSet. I'll let you know what I'll find. > > On Tue, Oct 13, 2020 at 11:48 AM Adrien Grand wrote: > >> Can you give us a few more details: >> - What version of Lucene are you testing? >> - Are you benchmarking "restrictionQuery" on its own, or its conjunction >> with another query? >> >> You mentioned that you combine your "restrictionQuery" and the user query >> with Occur.MUST, Occur.FILTER feels more appropriate for >> "restrictionQuery" >> since it should not contribute to scoring. >> >> TermsInSetQuery automatically executes like a BooleanQuery when the number >> of clauses is less than 16, so I would not expect major performance >> differences between a TermInSetQuery over less than 16 terms and a >> BooleanQuery wrapped in a ConstantScoreQuery. >> >> On Tue, Oct 13, 2020 at 11:35 AM Rob Audenaerde > > >> wrote: >> >> > Hello, >> > >> > I'm benchmarking an application which implements security on lucene by >> > adding a multivalue field "roles". If the user has one of these roles, >> he >> > can find the document. >> > >> > I implemented this as a Boolean and query, added the original query and >> the >> > restriction with Occur.MUST. >> > >> > I'm having some performance issues when counting the index (>60M docs), >> so >> > I thought about tweaking this restriction-implementation. >> > >> > I set-up a benchmark like this: >> > >> > I generate 2M documents, Each document has a multi-value "roles" field. >> The >> > "roles" field in each document has 4 values, taken from (2,2,1000,100) >> > unique values. >> > The user has (1,1,2,1) values for roles (so, 1 out of the 2, for the >> first >> > role, 1 out of 2 for the second, 2 out of the 1000 for the third value, >> and >> > 1 / 100 for the fourth). >> > >> > I got a somewhat unexpected performance difference. At first, I >> implemented >> > the restriction query like this: >> > >> > for (final String role : roles) { >> > restrictionQuery.add(new TermQuery(new Term("roles", new >> > BytesRef(role))), Occur.SHOULD); >> > } >> > >> > I then switched to a TermInSetQuery, which I thought would be faster >> > as it is using constant-scores. >> > >> > final Set rolesSet = >> > roles.stream().map(BytesRef::new).collect(Collectors.toSet()); >> > restrictionQuery.add(new TermInSetQuery("roles", rolesSet), >> Occur.SHOULD); >> > >> > >> > However, the TermInSetQuery has about 25% slower ops/s. Is that to >> > be expected? I did not, as I thought the constant-scoring would be >> faster. >> > >> >> >> -- >> Adrien >> >
Re: unexpected performance TermsQuery Occur.SHOULD vs TermsInSetQuery?
Hello Adrien, Thanks for the swift reply. I'll add the details: Lucene version: 8.6.2 The restrictionQuery is indeed a conjunction, it allowes for a document to be a hit if the 'roles' field is empty as well. It's used within a bigger query builder; so maybe I did something else wrong. I'll rewrite the benchmark to just benchmark the TermsInSet and Terms. It never occurred (hah) to me to use Occur.FILTER, that is a good point to check as well. As you put it, I would expect the results to be very similar, as I do not react the 16 terms in the TermInSet. I'll let you know what I'll find. On Tue, Oct 13, 2020 at 11:48 AM Adrien Grand wrote: > Can you give us a few more details: > - What version of Lucene are you testing? > - Are you benchmarking "restrictionQuery" on its own, or its conjunction > with another query? > > You mentioned that you combine your "restrictionQuery" and the user query > with Occur.MUST, Occur.FILTER feels more appropriate for "restrictionQuery" > since it should not contribute to scoring. > > TermsInSetQuery automatically executes like a BooleanQuery when the number > of clauses is less than 16, so I would not expect major performance > differences between a TermInSetQuery over less than 16 terms and a > BooleanQuery wrapped in a ConstantScoreQuery. > > On Tue, Oct 13, 2020 at 11:35 AM Rob Audenaerde > wrote: > > > Hello, > > > > I'm benchmarking an application which implements security on lucene by > > adding a multivalue field "roles". If the user has one of these roles, he > > can find the document. > > > > I implemented this as a Boolean and query, added the original query and > the > > restriction with Occur.MUST. > > > > I'm having some performance issues when counting the index (>60M docs), > so > > I thought about tweaking this restriction-implementation. > > > > I set-up a benchmark like this: > > > > I generate 2M documents, Each document has a multi-value "roles" field. > The > > "roles" field in each document has 4 values, taken from (2,2,1000,100) > > unique values. > > The user has (1,1,2,1) values for roles (so, 1 out of the 2, for the > first > > role, 1 out of 2 for the second, 2 out of the 1000 for the third value, > and > > 1 / 100 for the fourth). > > > > I got a somewhat unexpected performance difference. At first, I > implemented > > the restriction query like this: > > > > for (final String role : roles) { > > restrictionQuery.add(new TermQuery(new Term("roles", new > > BytesRef(role))), Occur.SHOULD); > > } > > > > I then switched to a TermInSetQuery, which I thought would be faster > > as it is using constant-scores. > > > > final Set rolesSet = > > roles.stream().map(BytesRef::new).collect(Collectors.toSet()); > > restrictionQuery.add(new TermInSetQuery("roles", rolesSet), > Occur.SHOULD); > > > > > > However, the TermInSetQuery has about 25% slower ops/s. Is that to > > be expected? I did not, as I thought the constant-scoring would be > faster. > > > > > -- > Adrien >
unexpected performance TermsQuery Occur.SHOULD vs TermsInSetQuery?
Hello, I'm benchmarking an application which implements security on lucene by adding a multivalue field "roles". If the user has one of these roles, he can find the document. I implemented this as a Boolean and query, added the original query and the restriction with Occur.MUST. I'm having some performance issues when counting the index (>60M docs), so I thought about tweaking this restriction-implementation. I set-up a benchmark like this: I generate 2M documents, Each document has a multi-value "roles" field. The "roles" field in each document has 4 values, taken from (2,2,1000,100) unique values. The user has (1,1,2,1) values for roles (so, 1 out of the 2, for the first role, 1 out of 2 for the second, 2 out of the 1000 for the third value, and 1 / 100 for the fourth). I got a somewhat unexpected performance difference. At first, I implemented the restriction query like this: for (final String role : roles) { restrictionQuery.add(new TermQuery(new Term("roles", new BytesRef(role))), Occur.SHOULD); } I then switched to a TermInSetQuery, which I thought would be faster as it is using constant-scores. final Set rolesSet = roles.stream().map(BytesRef::new).collect(Collectors.toSet()); restrictionQuery.add(new TermInSetQuery("roles", rolesSet), Occur.SHOULD); However, the TermInSetQuery has about 25% slower ops/s. Is that to be expected? I did not, as I thought the constant-scoring would be faster.
find documents with big stored fields
Hello, We are currently trying to investigate an issue where in the index-size is disproportionally large for the number of documents. We see that the .fdt file is more than 10 times the regular size. Reading the docs, I found that this file contains the fielddata. I would like to find the documents and/or field names/contents with extreme sizes, so we can delete those from the index without needing to re-index all data. What would be the best approach for this? Thanks, Rob Audenaerde
force deletes - terms enum still has deleted terms?
Hi all, We build a FST on the terms of our index by iterating the terms of the readers for our fields, like this: for (final LeafReaderContext ctx : leaves) { final LeafReader leafReader = ctx.reader(); for (final String indexField : indexFields) { final Terms terms = leafReader.terms(indexField); // If the field does not exist in this reader, then we get null, so check for that. if (terms != null) { final TermsEnum termsEnum = terms.iterator(); However, it sometimes the building of the FST seems to find terms that are from documents that are deleted. This is what we expect, checking the javadocs. So, now we switched the IndexWriter to a config with a TieredMergePolicy with: setForceMergeDeletesPctAllowed(0). When calling indexWriter.forceMergeDeletes(true) we expect that there will be no more deletes. However, the deleted terms still sometimes appear. We use the DirectoryReader.openIfChanged() to refresh the reader before iterating the terms. Are we forgetting something? Thanks in advance. Rob Audenaerde
Re: Lucene nested query
Your query can be seen as an inner join: select t0.* from employee t0 inner join employee t1 on t0.dept_no = t1.dept_no where t1.email='a...@email.com' Maybe JoinUtill can help you. http://lucene.apache.org/core/7_0_0/join/org/apache/lucene/search/join/JoinUtil.html?is-external=true On Tue, Apr 10, 2018 at 10:44 AM, Khurram Shehzad wrote: > Hi guys! > > > I've a scenario where the lucene query depends on the result of another > lucene query. > > > For example, find all the employees of the department where one of its > employee's email address = 'a...@email.com' > > > SQL would be like: > > > select * from employee where dept_no in( > > select dept_no from employee where email = 'a...@email.com' > > ) > > > Please note that employee is a huge data and inner query can result into 5 > million rows > > > Any thoughts how to replicate this scenario using core lucene? > > > Regards, > > Khurram >
Re: indexing performance 6.6 vs 7.1
Hi Adrian, Thanks for the response. Good points too! We actually went with a smallish benchmark to be able to profile the application within reasonable time. We will do a larger benchmark (say, 1M documents, without profiling) and I will revisit the commit-code as well. (IIRC we actually increased the commit frequency a while back because of issues (maybe out-of-memory issues, it was in the Lucene 4.x time. But this might no longer be relevant) What I don't understand yet is how this difference (between 6 and 7) came to be, I was reading the change log but could not really pinpoint it. Sure, the commit's are far from optimal, but we use the same commit strategy between 6.6 and 7.1. -Rob On Wed, Jan 31, 2018 at 1:56 PM, Adrien Grand wrote: > Hi Rob, > > I don't think your benchmark is good. If I read it correctly, it only > indexes between 21k and 22k documents, which is tiny. Plus it should try to > better replicate production workload, otherwise we will draw wrong > conclusions. > > I also suspect something is not quite right in your indexing code. When I > look at the IW logs, 562 out of the 642 flushes only write 1 document. I'm > not surprised that it exacerbates the cost of checksums, which are cheaper > to compute on one large file than on many tiny files. For the record, even > committing every 5k documents still sounds too frequent to me for an > application that is heavily indexing. Maybe you should consider moving to a > time-based policy? eg. commit every 10 minutes? > > Le mer. 31 janv. 2018 à 10:25, Rob Audenaerde a > écrit : > > > Hi all, > > > > We ran the benchmarks (6.6 vs 7.1) with IW info stream and (as attachment > > cannot be too large) I uploaded them to google drive. They can be found > > here: > > > > https://drive.google.com/open?id=1-nAHgpPO3qZ78lnvvlQ0_lF4uHJ-cWLh > > > > Thanks in advance, > > -Rob > > > > On Mon, Jan 29, 2018 at 1:08 PM, Rob Audenaerde < > rob.audenae...@gmail.com> > > wrote: > > > > > Hi Uwe, > > > > > > Thanks for the reply. We commit often. Actually, in the benchmark, we > > > commit every 60 documents (but we will run a larger set with less > > commits). > > > The number of commits we call does not change between 6.6. and 7.1. In > > our > > > production systems we commit every 5000 documents. > > > > > > We dug deeper into the commit methods, and currently see the main > > > difference seems to be the calls to the java.util.zit.Checksum.update( > ). > > > The number of calls to that method in 6.6 is around 11M , and 7.1 > 21M, > > so > > > almost twice the calls. > > > > > > -Rob > > > > > > On Mon, Jan 29, 2018 at 12:18 PM, Uwe Schindler > wrote: > > > > > >> Hi, > > >> > > >> How often do you commit? If you index the data initially (that's the > > case > > >> where indexing needs to be fast), one would call commit at the end of > > the > > >> whole job, so the actual time it takes is not so important. > > >> > > >> If you have a system where the index is updated all the time, then of > > >> course committing is also something you have to take into account. > > Systems > > >> like Solr or Elasticsearch use a transaction log in parallel to > > indexing, > > >> so they commit very seldom. If the system crashes, the changes are > > replayed > > >> from tranlog since last commit. > > >> > > >> Uwe > > >> > > >> - > > >> Uwe Schindler > > >> Achterdiek 19, D-28357 Bremen > > >> http://www.thetaphi.de > > >> eMail: u...@thetaphi.de > > >> > > >> > -Original Message- > > >> > From: Rob Audenaerde [mailto:rob.audenae...@gmail.com] > > >> > Sent: Monday, January 29, 2018 11:29 AM > > >> > To: java-user@lucene.apache.org > > >> > Subject: Re: indexing performance 6.6 vs 7.1 > > >> > > > >> > Hi all, > > >> > > > >> > Some follow up (sorry for the delay). > > >> > > > >> > We built a benchmark in our application, and profiled it (on a > > smallish > > >> > data set). What we currently see in the profiler is that in Lucene > 7.1 > > >> the > > >> > calls to `commit()` take much longer. > > >> > > > >> > The self-time committing in 6.6: 3,215 ms > > >> > The self-time committing in
Re: indexing performance 6.6 vs 7.1
Hi all, We ran the benchmarks (6.6 vs 7.1) with IW info stream and (as attachment cannot be too large) I uploaded them to google drive. They can be found here: https://drive.google.com/open?id=1-nAHgpPO3qZ78lnvvlQ0_lF4uHJ-cWLh Thanks in advance, -Rob On Mon, Jan 29, 2018 at 1:08 PM, Rob Audenaerde wrote: > Hi Uwe, > > Thanks for the reply. We commit often. Actually, in the benchmark, we > commit every 60 documents (but we will run a larger set with less commits). > The number of commits we call does not change between 6.6. and 7.1. In our > production systems we commit every 5000 documents. > > We dug deeper into the commit methods, and currently see the main > difference seems to be the calls to the java.util.zit.Checksum.update(). > The number of calls to that method in 6.6 is around 11M , and 7.1 21M, so > almost twice the calls. > > -Rob > > On Mon, Jan 29, 2018 at 12:18 PM, Uwe Schindler wrote: > >> Hi, >> >> How often do you commit? If you index the data initially (that's the case >> where indexing needs to be fast), one would call commit at the end of the >> whole job, so the actual time it takes is not so important. >> >> If you have a system where the index is updated all the time, then of >> course committing is also something you have to take into account. Systems >> like Solr or Elasticsearch use a transaction log in parallel to indexing, >> so they commit very seldom. If the system crashes, the changes are replayed >> from tranlog since last commit. >> >> Uwe >> >> - >> Uwe Schindler >> Achterdiek 19, D-28357 Bremen >> http://www.thetaphi.de >> eMail: u...@thetaphi.de >> >> > -Original Message- >> > From: Rob Audenaerde [mailto:rob.audenae...@gmail.com] >> > Sent: Monday, January 29, 2018 11:29 AM >> > To: java-user@lucene.apache.org >> > Subject: Re: indexing performance 6.6 vs 7.1 >> > >> > Hi all, >> > >> > Some follow up (sorry for the delay). >> > >> > We built a benchmark in our application, and profiled it (on a smallish >> > data set). What we currently see in the profiler is that in Lucene 7.1 >> the >> > calls to `commit()` take much longer. >> > >> > The self-time committing in 6.6: 3,215 ms >> > The self-time committing in 7.1: 10,187 ms. >> > >> > We will try to run a larger data set and also later with the IW info >> > stream. >> > >> > -Rob >> > >> > On Thu, Jan 18, 2018 at 7:03 PM, Erick Erickson < >> erickerick...@gmail.com> >> > wrote: >> > >> > > Robert: >> > > >> > > Ah, right. I keep confusing my gmail lists >> > > "lucene dev" >> > > and >> > > "lucene list" >> > > >> > > Siiih. >> > > >> > > >> > > >> > > On Thu, Jan 18, 2018 at 9:18 AM, Adrien Grand >> > wrote: >> > > > If you have sparse data, I would have expected index time to >> *decrease*, >> > > > not increase. >> > > > >> > > > Can you enable the IW info stream and share flush + merge times to >> see >> > > > where indexing time goes? >> > > > >> > > > If you can run with a profiler, this might also give useful >> information. >> > > > >> > > > Le jeu. 18 janv. 2018 à 11:23, Rob Audenaerde >> > >> > > a >> > > > écrit : >> > > > >> > > >> Hi all, >> > > >> >> > > >> We recently upgraded from Lucene 6.6 to 7.1. We see a significant >> drop >> > > in >> > > >> indexing performace. >> > > >> >> > > >> We have a-typical use of Lucene, as we (also) index some database >> > tables >> > > >> and add all the values as AssociatedFacetFields as well. This >> allows us >> > > to >> > > >> create pivot tables on search results really fast. >> > > >> >> > > >> These tables have some overlapping columns, but also disjoint ones. >> > > >> >> > > >> We anticipated a decrease in index size because of the sparse >> > > docvalues. We >> > > >> see this happening, with decreases to ~50%-80% of the original >> index >> > > size. >> > > >> But we did not expect an drop in indexing performance (client >> systems >> > > >> indexing time increased with +50% to +250%). >> > > >> >> > > >> (Our indexing-speed used to be mainly bound by the speed the >> > Taxonomy >> > > could >> > > >> deliver new ordinals for new values, currently we are >> investigating if >> > > this >> > > >> is still the case, will report later when a profiler run has been >> done) >> > > >> >> > > >> Does anyone know if this increase in indexing time is to be >> expected as >> > > >> result of the sparse docvalues change? >> > > >> >> > > >> Kind regards, >> > > >> >> > > >> Rob Audenaerde >> > > >> >> > > >> > > - >> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> > > For additional commands, e-mail: java-user-h...@lucene.apache.org >> > > >> > > >> >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >
Re: indexing performance 6.6 vs 7.1
Hi Uwe, Thanks for the reply. We commit often. Actually, in the benchmark, we commit every 60 documents (but we will run a larger set with less commits). The number of commits we call does not change between 6.6. and 7.1. In our production systems we commit every 5000 documents. We dug deeper into the commit methods, and currently see the main difference seems to be the calls to the java.util.zit.Checksum.update(). The number of calls to that method in 6.6 is around 11M , and 7.1 21M, so almost twice the calls. -Rob On Mon, Jan 29, 2018 at 12:18 PM, Uwe Schindler wrote: > Hi, > > How often do you commit? If you index the data initially (that's the case > where indexing needs to be fast), one would call commit at the end of the > whole job, so the actual time it takes is not so important. > > If you have a system where the index is updated all the time, then of > course committing is also something you have to take into account. Systems > like Solr or Elasticsearch use a transaction log in parallel to indexing, > so they commit very seldom. If the system crashes, the changes are replayed > from tranlog since last commit. > > Uwe > > - > Uwe Schindler > Achterdiek 19, D-28357 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > > -Original Message- > > From: Rob Audenaerde [mailto:rob.audenae...@gmail.com] > > Sent: Monday, January 29, 2018 11:29 AM > > To: java-user@lucene.apache.org > > Subject: Re: indexing performance 6.6 vs 7.1 > > > > Hi all, > > > > Some follow up (sorry for the delay). > > > > We built a benchmark in our application, and profiled it (on a smallish > > data set). What we currently see in the profiler is that in Lucene 7.1 > the > > calls to `commit()` take much longer. > > > > The self-time committing in 6.6: 3,215 ms > > The self-time committing in 7.1: 10,187 ms. > > > > We will try to run a larger data set and also later with the IW info > > stream. > > > > -Rob > > > > On Thu, Jan 18, 2018 at 7:03 PM, Erick Erickson > > > wrote: > > > > > Robert: > > > > > > Ah, right. I keep confusing my gmail lists > > > "lucene dev" > > > and > > > "lucene list" > > > > > > Siiih. > > > > > > > > > > > > On Thu, Jan 18, 2018 at 9:18 AM, Adrien Grand > > wrote: > > > > If you have sparse data, I would have expected index time to > *decrease*, > > > > not increase. > > > > > > > > Can you enable the IW info stream and share flush + merge times to > see > > > > where indexing time goes? > > > > > > > > If you can run with a profiler, this might also give useful > information. > > > > > > > > Le jeu. 18 janv. 2018 à 11:23, Rob Audenaerde > > > > > a > > > > écrit : > > > > > > > >> Hi all, > > > >> > > > >> We recently upgraded from Lucene 6.6 to 7.1. We see a significant > drop > > > in > > > >> indexing performace. > > > >> > > > >> We have a-typical use of Lucene, as we (also) index some database > > tables > > > >> and add all the values as AssociatedFacetFields as well. This > allows us > > > to > > > >> create pivot tables on search results really fast. > > > >> > > > >> These tables have some overlapping columns, but also disjoint ones. > > > >> > > > >> We anticipated a decrease in index size because of the sparse > > > docvalues. We > > > >> see this happening, with decreases to ~50%-80% of the original index > > > size. > > > >> But we did not expect an drop in indexing performance (client > systems > > > >> indexing time increased with +50% to +250%). > > > >> > > > >> (Our indexing-speed used to be mainly bound by the speed the > > Taxonomy > > > could > > > >> deliver new ordinals for new values, currently we are investigating > if > > > this > > > >> is still the case, will report later when a profiler run has been > done) > > > >> > > > >> Does anyone know if this increase in indexing time is to be > expected as > > > >> result of the sparse docvalues change? > > > >> > > > >> Kind regards, > > > >> > > > >> Rob Audenaerde > > > >> > > > > > > - > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Re: indexing performance 6.6 vs 7.1
Hi all, Some follow up (sorry for the delay). We built a benchmark in our application, and profiled it (on a smallish data set). What we currently see in the profiler is that in Lucene 7.1 the calls to `commit()` take much longer. The self-time committing in 6.6: 3,215 ms The self-time committing in 7.1: 10,187 ms. We will try to run a larger data set and also later with the IW info stream. -Rob On Thu, Jan 18, 2018 at 7:03 PM, Erick Erickson wrote: > Robert: > > Ah, right. I keep confusing my gmail lists > "lucene dev" > and > "lucene list" > > Siiih. > > > > On Thu, Jan 18, 2018 at 9:18 AM, Adrien Grand wrote: > > If you have sparse data, I would have expected index time to *decrease*, > > not increase. > > > > Can you enable the IW info stream and share flush + merge times to see > > where indexing time goes? > > > > If you can run with a profiler, this might also give useful information. > > > > Le jeu. 18 janv. 2018 à 11:23, Rob Audenaerde > a > > écrit : > > > >> Hi all, > >> > >> We recently upgraded from Lucene 6.6 to 7.1. We see a significant drop > in > >> indexing performace. > >> > >> We have a-typical use of Lucene, as we (also) index some database tables > >> and add all the values as AssociatedFacetFields as well. This allows us > to > >> create pivot tables on search results really fast. > >> > >> These tables have some overlapping columns, but also disjoint ones. > >> > >> We anticipated a decrease in index size because of the sparse > docvalues. We > >> see this happening, with decreases to ~50%-80% of the original index > size. > >> But we did not expect an drop in indexing performance (client systems > >> indexing time increased with +50% to +250%). > >> > >> (Our indexing-speed used to be mainly bound by the speed the Taxonomy > could > >> deliver new ordinals for new values, currently we are investigating if > this > >> is still the case, will report later when a profiler run has been done) > >> > >> Does anyone know if this increase in indexing time is to be expected as > >> result of the sparse docvalues change? > >> > >> Kind regards, > >> > >> Rob Audenaerde > >> > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
indexing performance 6.6 vs 7.1
Hi all, We recently upgraded from Lucene 6.6 to 7.1. We see a significant drop in indexing performace. We have a-typical use of Lucene, as we (also) index some database tables and add all the values as AssociatedFacetFields as well. This allows us to create pivot tables on search results really fast. These tables have some overlapping columns, but also disjoint ones. We anticipated a decrease in index size because of the sparse docvalues. We see this happening, with decreases to ~50%-80% of the original index size. But we did not expect an drop in indexing performance (client systems indexing time increased with +50% to +250%). (Our indexing-speed used to be mainly bound by the speed the Taxonomy could deliver new ordinals for new values, currently we are investigating if this is still the case, will report later when a profiler run has been done) Does anyone know if this increase in indexing time is to be expected as result of the sparse docvalues change? Kind regards, Rob Audenaerde
Re: Lucene update performance
As far as I know, the updateDocument method on the IndexWriter delete and add. See also the javadoc: [..] Updates a document by first deleting the document(s) containing term and then adding the new document. The delete and then add are atomic as seen by a reader on the same index (flush may happen only after the add). [..] On Tue, May 9, 2017 at 3:37 PM, Kudrettin Güleryüz wrote: > I do update the entire document each time. Furthermore, this sometimes > means deleting compressed archives which are stores as multiple documents > for each compressed archive file and readding them. > > Is there an update method, is it better performance than remove then add? I > was simply removing modified files from the index (which doesn't seem to > take long), and readd them. > > On Tue, May 9, 2017 at 9:33 AM Rob Audenaerde > wrote: > > > Do you update each entire document? (vs updating numeric docvalues?) > > > > That is implemented as 'delete and add' so I guess that will be slower > than > > clean sheet indexing. Not sure if it is 3x slower, that seems a bit much? > > > > On Tue, May 9, 2017 at 3:24 PM, Kudrettin Güleryüz > > wrote: > > > > > Hi, > > > > > > For a 5.2.1 index that contains around 1.2 million documents, updating > > the > > > index with 1.3 million files seems to take 3X longer than doing a > scratch > > > indexing. (Files are crawled over NFS, indexes are stored on a > mechanical > > > disk locally (Btrfs)) > > > > > > Is this expected for Lucene's update index logic, or should I further > > debug > > > my part of the code for update performance? > > > > > > Thank you, > > > Kudret > > > > > >
Re: Lucene update performance
Do you update each entire document? (vs updating numeric docvalues?) That is implemented as 'delete and add' so I guess that will be slower than clean sheet indexing. Not sure if it is 3x slower, that seems a bit much? On Tue, May 9, 2017 at 3:24 PM, Kudrettin Güleryüz wrote: > Hi, > > For a 5.2.1 index that contains around 1.2 million documents, updating the > index with 1.3 million files seems to take 3X longer than doing a scratch > indexing. (Files are crawled over NFS, indexes are stored on a mechanical > disk locally (Btrfs)) > > Is this expected for Lucene's update index logic, or should I further debug > my part of the code for update performance? > > Thank you, > Kudret >
Re: Autocomplete using facet labels?
Thanks Erick for you reply, I see you refer to solr sources while I was hoping for lucene suggestions. I hadn't thought of the idea of reverse indexing the facet values and will consider it. Meanwhile I will try to explore how I might to use the TaxonomyIndex as well, as this should contain the FacetLabels I'd like to use. -Rob On Wed, Apr 12, 2017 at 5:00 PM, Erick Erickson wrote: > First take a look at autosuggest. That does some great stuff if you > can build the autocomplete dictionary only periodically, which can be > somewhat expensive. See: > https://lucidworks.com/2015/03/04/solr-suggester/ > > There are lighter-weight ways to autosuggest that should be extremely > fast, in particular index your stuff backwards in a suggest field as: > John Doe - Author > > and use TermsComponent on that for instance. TermsComponent is pretty > literal, i.e. it's case-sensitive but you can send terms.prefix=jo and > case things properly on the app side. > > Best, > Erick > > On Wed, Apr 12, 2017 at 6:33 AM, Rob Audenaerde > wrote: > > I have a Lucene (6.4.2) index with about 2-5M documents, and each > document > > has about 10 facets (for example 'author', 'publisher', etc). The facets > > might have up to 100.000 different values. > > > > I have a search bar on top of my application, and would like to implement > > autocomplete using the facets. For example, when the user enters 'Jo' I > > would like the options to be: > > > > 'John Doe - Author' > > 'Jonatan Driver - Publisher' > > 'Joan Deville - Author' > > ... > > > > My facets are structured using the FacetFields and Lucene Taxonomy, like > > this: > > > > 'Author / John Doe' > > 'Author / Joan Deville' > > ... > > > > Are there built-in options to create such an autocomplete? Or do I have > to > > build it myself? > > > > I prefer not to do a search on all the matching documents and collect > > facets for those, because that is not very fast > > > > Any hints? > > > > Thanks in advance, > > Rob Audenaerde > > > > See also: > > http://stackoverflow.com/questions/43369715/lucene- > autocomplete-using-facet-labels > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Autocomplete using facet labels?
I have a Lucene (6.4.2) index with about 2-5M documents, and each document has about 10 facets (for example 'author', 'publisher', etc). The facets might have up to 100.000 different values. I have a search bar on top of my application, and would like to implement autocomplete using the facets. For example, when the user enters 'Jo' I would like the options to be: 'John Doe - Author' 'Jonatan Driver - Publisher' 'Joan Deville - Author' ... My facets are structured using the FacetFields and Lucene Taxonomy, like this: 'Author / John Doe' 'Author / Joan Deville' ... Are there built-in options to create such an autocomplete? Or do I have to build it myself? I prefer not to do a search on all the matching documents and collect facets for those, because that is not very fast Any hints? Thanks in advance, Rob Audenaerde See also: http://stackoverflow.com/questions/43369715/lucene-autocomplete-using-facet-labels
Re: commit frequency guideline?
Thanks for the quick reply! >What do you mean by "Lucene complain about too-many uncommitted docs"? --> good question, I was thoughtlessly echoing words from my colleague. I asked him and he said that it was about taking very long to commit and memory issues. So maybe this wasn't the best opening statement :) For the other part of the question: we need users to see the changed documents immediately, but I think we have this covered by using NRT Readers and the SearcherManager. Am I correct to conclude calling commit() is not necessary for finding recently changed documents? I think we can then switch to a time based commit() where we just call commit every 5 minutes, in effect losing a maximum of 5 minutes of work (which we can mitigate in another way) when the server somehow stops working. Thank you, -Rob On Wed, Nov 30, 2016 at 3:17 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > What do you mean by "Lucene complain about too-many uncommitted docs"? > Lucene does not really care how frequently you commit... > > How frequently you commit is really your choice, i.e. what risk you > see of power loss / OS crash vs the cost (not just in CPU/IO work for > the computer, but in the users not seeing the recently indexed > documents for a while) of replaying those documents since the last > commit when power comes back. > > Pushing durability back into the queue/channel can be a nice option > too, e.g. Kafka, so that your application doesn't need to keep track > of which docs were not yet committed. > > Mike McCandless > > http://blog.mikemccandless.com > > > On Wed, Nov 30, 2016 at 8:50 AM, Rob Audenaerde > wrote: > > Hi all, > > > > Currently we call commit() many times on our index (about 5M docs, where > > some 10.000-100.000 modifications during the day). The commit times > > typically get more expensive when the index grows, up to several seconds, > > so we want to reduce the number of calls. > > > > (Historically, we had Lucene complain about too-many uncommitted docs > > sometimes, so we went with the commit often approach.) > > > > What is a good strategy for calling commit? Fixed frequency? After X > docs? > > Combination? > > > > I'm curious what is considered 'industry-standard'. Can you share some of > > your expercience? > > > > Thanks! > > > > -Rob >
commit frequency guideline?
Hi all, Currently we call commit() many times on our index (about 5M docs, where some 10.000-100.000 modifications during the day). The commit times typically get more expensive when the index grows, up to several seconds, so we want to reduce the number of calls. (Historically, we had Lucene complain about too-many uncommitted docs sometimes, so we went with the commit often approach.) What is a good strategy for calling commit? Fixed frequency? After X docs? Combination? I'm curious what is considered 'industry-standard'. Can you share some of your expercience? Thanks! -Rob
Re: wicket datatable, row selection, update another component
Whoops! You are correct! Sorry 'bout that. On Fri, Oct 28, 2016 at 1:26 PM, Alan Woodward wrote: > Hi Rob, I think you posted this to the wrong mailing list? > > Alan Woodward > www.flax.co.uk > > > > On 28 Oct 2016, at 12:13, Rob Audenaerde > wrote: > > > > Hi all, > > > > I have a DataTable which, in onConfigure(), sets a selected item. I want > > another (detail) panel, outside of this component, to react on that > > selection i.e. set it's visibility and render details of the selected > item. > > > > What I see is that the onConfigure() of the detail component is called > > BEFORE the DataTable, so I figure it renders before the DataTable is > > rendered, so the detail.setVisible() in the onConfigure() in the > DataTable > > is called too late. > > > > How should I solve this? The only component that know which item is going > > to be selected is the DataTable. > > > > Thanks, > > Rob > >
wicket datatable, row selection, update another component
Hi all, I have a DataTable which, in onConfigure(), sets a selected item. I want another (detail) panel, outside of this component, to react on that selection i.e. set it's visibility and render details of the selected item. What I see is that the onConfigure() of the detail component is called BEFORE the DataTable, so I figure it renders before the DataTable is rendered, so the detail.setVisible() in the onConfigure() in the DataTable is called too late. How should I solve this? The only component that know which item is going to be selected is the DataTable. Thanks, Rob
Re: clone RAMDirectory
Thanks for the quick reply Uwe! I opened https://issues.apache.org/jira/browse/LUCENE-7366 for this. -Rob On Thu, Jun 30, 2016 at 12:06 PM, Uwe Schindler wrote: > Hi, > > I looked at the code: The FSDirectory passed to RAMDirectory could be > changed to Directory easily. The additional check for "not is a directory > inode" is in my opinion lo longer needed, because listFiles should only > return files. > > Can you open an issue about to change the FSDirectory in the RAMDirectory > ctor to be changed to Directory? > > Uwe > > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > > -Original Message- > > From: Rob Audenaerde [mailto:rob.audenae...@gmail.com] > > Sent: Thursday, June 30, 2016 12:00 PM > > To: java-user@lucene.apache.org > > Subject: clone RAMDirectory > > > > Hi all, > > > > For increasing the speed of some of my application tests, I want to > > re-use/copy a pre-populated RAMDirectory over and over. > > > > I'm on Lucene 6.0.1 > > > > It seems an RAMDirectory can be a copy of a FSDirectory, but not of > another > > RAMDirectory. Also RAMDirectory is not Clonable. > > > > What would be the 'proper' approach to re-use (fast copy) pre-populated > > indices over tests? I know I can create a FSDirectory and copy that, but > > then I also need to take into account temporary files etc. > > > > Thanks in advance, > > > > - Rob > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
clone RAMDirectory
Hi all, For increasing the speed of some of my application tests, I want to re-use/copy a pre-populated RAMDirectory over and over. I'm on Lucene 6.0.1 It seems an RAMDirectory can be a copy of a FSDirectory, but not of another RAMDirectory. Also RAMDirectory is not Clonable. What would be the 'proper' approach to re-use (fast copy) pre-populated indices over tests? I know I can create a FSDirectory and copy that, but then I also need to take into account temporary files etc. Thanks in advance, - Rob
Re: GROUP BY in Lucene
Hi Gimantha, You don't need to store the aggregates and don't need to retrieve Documents. The aggregates are calculated during collection using the BinaryDocValues from the facet-module. What I do, is that I need to store values in the facets using AssociationFacetFields. (for example FloatAssociationFacetField). I just choose facets because then I can use the facets as well :) I have a implementation of `Facets` class that does all the aggregation. I cannot paste all the code unfortunately, but here is the idea (it is loosly based on the TaxonomyFacetSumIntAssociations implementation, where you can look up how the BinaryDocValues are translated to ordinals and to facets). This aggregation is used in conjunction with a FacetsCollector, which collects the facets during a search: FacetsCollector fc = new FacetsCollector(); searcher.search(new ConstantScoreQuery(query), fc); Then, the use this FacetsCollector: taxoReader = getTaxonomyReaderManager().acquire(); OnePassTaxonomyFacets facets = new OnePassTaxonomyFacets(taxoReader, LuceneIndexConfig.facetConfig); Collection facets.aggregateValues(fc.getMatchingDocs(), p.getGroupByListWithoutData(), aggregateFields); The aggregateValues method (cannot paste it all :( ) : public final Collection aggregateValues(List matchingDocs, final List groupByFields, final List aggregateFieldNames, EmptyValues emptyValues) throws IOException { LOG.info("Starting aggregation for pivot.. EmptyValues=" + emptyValues); //We want to group a list of ordinals to a list of aggregates. The taxoReader has the ordinals, so a selection like 'Lang=NL, Region=South' will //end up like a MultiIntKey of [13,44] Map> aggs = Maps.newHashMap(); List groupByFieldsNames = Lists.newArrayList(); for (GroupByField gbf : groupByFields) { groupByFieldsNames.add(gbf.getField().getName()); } int groupByCount = groupByFieldsNames.size(); //We need to know which ordinals are the 'group-by' ordinals, so we can check if a ordinal that is found, belongs to one of these fields int[] groupByOrdinals = new int[groupByCount]; for (int i = 0; i < groupByOrdinals.length; i++) { groupByOrdinals[i] = this.getOrdinalForListItem(groupByFieldsNames, i); } //We need to know with ordinals are the 'aggregate-field' ordinals, so we can check if a ordinal that is found, belongs to one of these fields int[] aggregateOrdinals = new int[aggregateFieldNames.size()]; for (int i = 0; i < aggregateOrdinals.length; i++) { aggregateOrdinals[i] = this.getOrdinalForListItem(aggregateFieldNames, i); } //Now we go and find all the ordinals in the matching documents. //For each ordinal, we check if it is a groupBy-ordinal, or a aggregate-ordinal, and act accordinly. for (MatchingDocs hitList : matchingDocs) { BinaryDocValues dv = hitList.context.reader().getBinaryDocValues(this.indexFieldName); //Here find the oridinals of the group-by-fields and the arrgegate fields. //Create a multi ordinal key MultiIntKey from the group-by-ordinals and use that to add the current value of the fiels to do the agggregation to the facet-aggregates .. Hope this helps :) -Rob
Re: Lucene Facets performance problems (version 4.7.2)
Hi Simona, In addition to Ericks' questions: Are you talking about *search* time or facet-collection time? And how many results are in your result set? I have some experience with collecting facets from large results set, these are typically slow (as they have to retrieve all the relevant facet fields for the facetted doccument). In Lucene 4.8 the RandomSamplingFacetsCollector returned (as per https://issues.apache.org/jira/browse/LUCENE-5476). -Rob On Fri, Feb 26, 2016 at 6:01 AM, Simona Russo wrote: > Hi all, > > we use Lucene *Facet* library version* 4.7.2.* > > We have an *index* with *45 millions *of documents (size about 15 GB) and > a *taxonomy* index with *57* millions of documents (size about 2 GB). > > The total *facet search* time achieve *15 seconds*! > > Is it possible to improve this time? Is there any tips to *configure* the > *taxonomy* index to avoid this waste of time? > > > Thanks in advance >
Re: Profiling lucene 5.2.0 based tool
Hi Sandeep, How many threads do you use to do the indexing? The benchmarks of Lucene are done on >20 threads IIRC. -Rob On Tue, Feb 23, 2016 at 8:01 AM, sandeep das wrote: > Hi, > > I've implemented a tool using lucene-5.2.0 to index my CSV files. The tool > is reading data from CSV files(residing on disk) and creating indexes on > local disk. It is able to process 3.5 MBps data. There are overall 46 > fields being added in one document. They are only of three data types 1. > Integer, 2. Long, 3. String. > All these fields are part of one CSV record and they are parsed using > custom CSV parser which is faster than any split method of string. > > I've configured the following parameters to create indexWriter > 1. setOpenMode(OpenMode.CREATE) > 2. setCommitOnClose(true) > 3. setRAMBufferSizeMB(512) // Tried 256, 312 as well but performance is > almost same. > > I've read over several blogs that lucene works way faster than these > figures. So, I thought there are some bottlenecks in my code and profiled > it using jvisualvm. The application is spending most of the time in > DefaultIndexChain.processField i.e. 53% of total time. > > > Following is the split of CPU usage in this application: > 1. reading data from disk is taking 5% of total duration > 2. adding document is taking 93% of total duration. > >-postUpdate -> 12.8% >-doAfterDocument -> 20.6% >-updateDocument -> 59.8% > - finishDocument -> 1.7% > - finishStoreFields -> 4.8% > - processFields -> 53.1% > > > I'm also attaching the screen shot of call graph generated by jvisualvm. > > I've taken care of following points: > 1. create only one instance of indexWriter > 2. create only one instance of document and reuse it through out the life > time of application > 3. There will be no update in the documents hence only addDocument is > invoked. > Note: After going through the code I found out that addDocument is > internally calling updateDocument only. Is there any way by which we can > avoid calling updateDocument and only use addDocument API? > 4. Using setValue APIs to set the pre created fields and reusing these > fields to create indexes. > > Any tip to improve the performance will be immensely appreciated. > > Regards, > Sandeep > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org >
RE: debugging growing index size
Thank you all, I will further fix and investigate! On Nov 14, 2015 10:00, "Uwe Schindler" wrote: > I agree. On Linux it is impossible that MMapDirectory is the reason! Only > on windows you cannot delete still open/mapped files. > > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > > -Original Message- > > From: Michael McCandless [mailto:luc...@mikemccandless.com] > > Sent: Friday, November 13, 2015 8:30 PM > > To: Lucene Users > > Subject: Re: debugging growing index size > > > > So with MMapDir at defaults (unmap is enabled) you see old files, with > > no open file handles as reported by lsof, still existing in your index > > directory, taking lots of space. > > > > But with NIOFSDirectory the issue doesn't happen? Are you sure? > > > > I'll look at the 6.6 GB infoStream to see what it says about the ref > counts. > > > > Did you fix the issue in your app where you're not closing all opened > > NRT readers? > > > > Mike McCandless > > > > http://blog.mikemccandless.com > > > > > > On Fri, Nov 13, 2015 at 12:22 PM, Rob Audenaerde > > wrote: > > > I haven't disabled unmapping, and I am running out-of-the-box > > > FSDirectory.open(). As I can see it tries to pick MMap. For the test I > > > explicitly constructed a NIOFSDIrectoryReader > > > > > > OS is (from the top of my head) CentOS 6.x, Java 1.8.0u33. I can check > > > later for more details. > > > On Nov 13, 2015 18:07, "Uwe Schindler" wrote: > > > > > >> Hi, > > >> > > >> Lucene has the workaround, so it should not happen, UNLESS you > > explicitly > > >> disable the hack using MMapDirectory#setEnableUnmap(false). > > >> > > >> Uwe > > >> > > >> - > > >> Uwe Schindler > > >> H.-H.-Meier-Allee 63, D-28213 Bremen > > >> http://www.thetaphi.de > > >> eMail: u...@thetaphi.de > > >> > > >> > -Original Message- > > >> > From: will martin [mailto:wmartin...@gmail.com] > > >> > Sent: Friday, November 13, 2015 6:04 PM > > >> > To: java-user@lucene.apache.org > > >> > Subject: Re: debugging growing index size > > >> > > > >> > Hi Rob: > > >> > > > >> > > > >> > Doesn’t this look like known SE issue JDK-4724038 and discussed by > > Peter > > >> > Levart and Uwe Schindler on a lucene-dev thread 9/9/2015? > > >> > > > >> > MappedByteBuffer …. what OS are you on Rob? What JVM? > > >> > > > >> > http://bugs.java.com/view_bug.do?bug_id=4724038 > > >> > > > >> > http://mail-archives.apache.org/mod_mbox/lucene- > > >> > dev/201509.mbox/%3c55f0461a.2070...@gmail.com%3E > > >> > > > >> > hth > > >> > -will > > >> > > > >> > > > >> > > > >> > > On Nov 13, 2015, at 11:23 AM, Rob Audenaerde > > >> > wrote: > > >> > > > > >> > > I'm currently running using NIOFS. It seems to prevent the issue > from > > >> > > appearing. > > >> > > > > >> > > This is a second run (with applied deletes etc) > > >> > > > > >> > > raudenaerd@:/<6>index/index$sudo ls -lSra *.dvd > > >> > > -rw-r--r--. 1 apache apache 7993 Nov 13 16:09 > _y_Lucene50_0.dvd > > >> > > -rw-r--r--. 1 apache apache 39048886 Nov 13 17:12 > > _xod_Lucene50_0.dvd > > >> > > -rw-r--r--. 1 apache apache 53699972 Nov 13 17:17 > > _110e_Lucene50_0.dvd > > >> > > -rw-r--r--. 1 apache apache 112855516 Nov 13 17:19 > > _12r5_Lucene50_0.dvd > > >> > > -rw-r--r--. 1 apache apache 151149886 Nov 13 17:13 > > _y0s_Lucene50_0.dvd > > >> > > -rw-r--r--. 1 apache apache 222062059 Nov 13 17:17 > > _z20_Lucene50_0.dvd > > >> > > > > >> > > raudenaerde:/<6>index/index$sudo ls -lSaa *.dvd > > >> > > -rw-r--r--. 1 apache apache 222062059 Nov 13 17:17 > > _z20_Lucene50_0.dvd > > >> > > -rw-r--r--. 1 apache apache 151149886 Nov 13 17:13 > > _y0s_Lucene50_0.dvd > > >> > > -rw-r--r--. 1 apache apache 112855516 Nov 13 17:19 >
RE: debugging growing index size
I haven't disabled unmapping, and I am running out-of-the-box FSDirectory.open(). As I can see it tries to pick MMap. For the test I explicitly constructed a NIOFSDIrectoryReader OS is (from the top of my head) CentOS 6.x, Java 1.8.0u33. I can check later for more details. On Nov 13, 2015 18:07, "Uwe Schindler" wrote: > Hi, > > Lucene has the workaround, so it should not happen, UNLESS you explicitly > disable the hack using MMapDirectory#setEnableUnmap(false). > > Uwe > > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > > -Original Message- > > From: will martin [mailto:wmartin...@gmail.com] > > Sent: Friday, November 13, 2015 6:04 PM > > To: java-user@lucene.apache.org > > Subject: Re: debugging growing index size > > > > Hi Rob: > > > > > > Doesn’t this look like known SE issue JDK-4724038 and discussed by Peter > > Levart and Uwe Schindler on a lucene-dev thread 9/9/2015? > > > > MappedByteBuffer …. what OS are you on Rob? What JVM? > > > > http://bugs.java.com/view_bug.do?bug_id=4724038 > > > > http://mail-archives.apache.org/mod_mbox/lucene- > > dev/201509.mbox/%3c55f0461a.2070...@gmail.com%3E > > > > hth > > -will > > > > > > > > > On Nov 13, 2015, at 11:23 AM, Rob Audenaerde > > wrote: > > > > > > I'm currently running using NIOFS. It seems to prevent the issue from > > > appearing. > > > > > > This is a second run (with applied deletes etc) > > > > > > raudenaerd@:/<6>index/index$sudo ls -lSra *.dvd > > > -rw-r--r--. 1 apache apache 7993 Nov 13 16:09 _y_Lucene50_0.dvd > > > -rw-r--r--. 1 apache apache 39048886 Nov 13 17:12 _xod_Lucene50_0.dvd > > > -rw-r--r--. 1 apache apache 53699972 Nov 13 17:17 _110e_Lucene50_0.dvd > > > -rw-r--r--. 1 apache apache 112855516 Nov 13 17:19 _12r5_Lucene50_0.dvd > > > -rw-r--r--. 1 apache apache 151149886 Nov 13 17:13 _y0s_Lucene50_0.dvd > > > -rw-r--r--. 1 apache apache 222062059 Nov 13 17:17 _z20_Lucene50_0.dvd > > > > > > raudenaerde:/<6>index/index$sudo ls -lSaa *.dvd > > > -rw-r--r--. 1 apache apache 222062059 Nov 13 17:17 _z20_Lucene50_0.dvd > > > -rw-r--r--. 1 apache apache 151149886 Nov 13 17:13 _y0s_Lucene50_0.dvd > > > -rw-r--r--. 1 apache apache 112855516 Nov 13 17:19 _12r5_Lucene50_0.dvd > > > -rw-r--r--. 1 apache apache 53699972 Nov 13 17:17 _110e_Lucene50_0.dvd > > > -rw-r--r--. 1 apache apache 39048886 Nov 13 17:12 _xod_Lucene50_0.dvd > > > -rw-r--r--. 1 apache apache 7993 Nov 13 16:09 _y_Lucene50_0.dvd > > > > > > > > > > > > On Thu, Nov 12, 2015 at 3:40 PM, Michael McCandless < > > > luc...@mikemccandless.com> wrote: > > > > > >> Hi Rob, > > >> > > >> A couple more things: > > >> > > >> Can you print the value of MMapDirectory.UNMAP_SUPPORTED? > > >> > > >> Also, can you try your test using NIOFSDirectory instead? Curious if > > >> that changes things... > > >> > > >> Mike McCandless > > >> > > >> http://blog.mikemccandless.com > > >> > > >> > > >> On Thu, Nov 12, 2015 at 7:28 AM, Rob Audenaerde > > >> wrote: > > >>> Curious indeed! > > >>> > > >>> I will turn on the IndexFileDeleter.VERBOSE_REF_COUNTS and recreate > > the > > >>> logs. Will get back with them in a day hopefully. > > >>> > > >>> Thanks for the extra logging! > > >>> > > >>> -Rob > > >>> > > >>> On Thu, Nov 12, 2015 at 11:34 AM, Michael McCandless < > > >>> luc...@mikemccandless.com> wrote: > > >>> > > >>>> Hmm, curious. > > >>>> > > >>>> I looked at the [large] infoStream output and I see segment _3ou7 > > >>>> present on init of IW, a few getReader calls referencing it, then a > > >>>> forceMerge that indeed merges it away, yet I do NOT see IW > > attempting > > >>>> deletion of its files. > > >>>> > > >>>> And indeed I see plenty (too many: many times per second?) of > > commits > > >>>> after that, so the index itself is no longer referencing _3ou7. > > >>>> > > >>>> If you are failing to close all NRT readers then I would
Re: debugging growing index size
I'm currently running using NIOFS. It seems to prevent the issue from appearing. This is a second run (with applied deletes etc) raudenaerd@:/<6>index/index$sudo ls -lSra *.dvd -rw-r--r--. 1 apache apache 7993 Nov 13 16:09 _y_Lucene50_0.dvd -rw-r--r--. 1 apache apache 39048886 Nov 13 17:12 _xod_Lucene50_0.dvd -rw-r--r--. 1 apache apache 53699972 Nov 13 17:17 _110e_Lucene50_0.dvd -rw-r--r--. 1 apache apache 112855516 Nov 13 17:19 _12r5_Lucene50_0.dvd -rw-r--r--. 1 apache apache 151149886 Nov 13 17:13 _y0s_Lucene50_0.dvd -rw-r--r--. 1 apache apache 222062059 Nov 13 17:17 _z20_Lucene50_0.dvd raudenaerde:/<6>index/index$sudo ls -lSaa *.dvd -rw-r--r--. 1 apache apache 222062059 Nov 13 17:17 _z20_Lucene50_0.dvd -rw-r--r--. 1 apache apache 151149886 Nov 13 17:13 _y0s_Lucene50_0.dvd -rw-r--r--. 1 apache apache 112855516 Nov 13 17:19 _12r5_Lucene50_0.dvd -rw-r--r--. 1 apache apache 53699972 Nov 13 17:17 _110e_Lucene50_0.dvd -rw-r--r--. 1 apache apache 39048886 Nov 13 17:12 _xod_Lucene50_0.dvd -rw-r--r--. 1 apache apache 7993 Nov 13 16:09 _y_Lucene50_0.dvd On Thu, Nov 12, 2015 at 3:40 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > Hi Rob, > > A couple more things: > > Can you print the value of MMapDirectory.UNMAP_SUPPORTED? > > Also, can you try your test using NIOFSDirectory instead? Curious if > that changes things... > > Mike McCandless > > http://blog.mikemccandless.com > > > On Thu, Nov 12, 2015 at 7:28 AM, Rob Audenaerde > wrote: > > Curious indeed! > > > > I will turn on the IndexFileDeleter.VERBOSE_REF_COUNTS and recreate the > > logs. Will get back with them in a day hopefully. > > > > Thanks for the extra logging! > > > > -Rob > > > > On Thu, Nov 12, 2015 at 11:34 AM, Michael McCandless < > > luc...@mikemccandless.com> wrote: > > > >> Hmm, curious. > >> > >> I looked at the [large] infoStream output and I see segment _3ou7 > >> present on init of IW, a few getReader calls referencing it, then a > >> forceMerge that indeed merges it away, yet I do NOT see IW attempting > >> deletion of its files. > >> > >> And indeed I see plenty (too many: many times per second?) of commits > >> after that, so the index itself is no longer referencing _3ou7. > >> > >> If you are failing to close all NRT readers then I would expect _3ou7 > >> to be in the lsof output, but it's not. > >> > >> The NRT readers close method has logic that notifies IndexWriter when > >> it's done "needing" the files, to emulate "delete on last close" > >> semantics for filesystems like HDFS that don't do that ... it's > >> possible something is wrong here. > >> > >> Can you set the (public, static) boolean > >> IndexFileDeleter.VERBOSE_REF_COUNTS to true, and then re-generate this > >> log? This causes IW to log the ref count of each file it's tracking > >> ... > >> > >> I'll also add a bit more verbosity to IW when NRT readers are opened > >> and close, for 5.4.0. > >> > >> Mike McCandless > >> > >> http://blog.mikemccandless.com > >> > >> > >> On Wed, Nov 11, 2015 at 6:09 AM, Rob Audenaerde > >> wrote: > >> > Hi all, > >> > > >> > I'm still debugging the growing-index size. I think closing index > readers > >> > might help (work in progress), but I can't really see them holding on > to > >> > files (at least, using lsof ). Restarting the application sheds some > >> light, > >> > I see logging on files that are no longer referenced. > >> > > >> > What I see is that there are files in the index-directory, that seem > to > >> > longer referenced.. > >> > > >> > I put the output of the infoStream online, because is it rather big > (30MB > >> > gzipped): http://www.audenaerde.org/lucene/merges.log.gz > >> > > >> > Output of lsof: (executed 'sudo lsof *' in the index directory ). > This > >> is > >> > on an CentOS box (maybe that influences stuff as well?) > >> > > >> > COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME > >> > java30581 apache memREG 253,0 3176094924 18880508 > >> > _4gs5_Lucene50_0.dvd > >> > java30581 apache memREG 253,0 505758610 18880546 _4gs5.fdt > >> > java30581 apache memREG 253,0 369563337 18880631 > >> > _4gs5_Lucene50_0.tim > >
Re: debugging growing index size
I got the data (beware, it is about 180MB download, xz-zipped, unpacked it is about 6.6 GB). Unfortunately, I accidentally restarted the application so the index-files and lsof output could not be determined for this run. Hopefully the infoStream log with the extra logging will provide enough information. I will work that next week if needed. The infoStream can be downloaded here: http://www.audenaerde.org/lucene/merges.log.xz The value of MMapDirectory.UNMAP_SUPPORTED= true I'm currently trying to create a build with NIOFSDirectory instead. -Rob On Thu, Nov 12, 2015 at 11:34 AM, Michael McCandless < luc...@mikemccandless.com> wrote: > Hmm, curious. > > I looked at the [large] infoStream output and I see segment _3ou7 > present on init of IW, a few getReader calls referencing it, then a > forceMerge that indeed merges it away, yet I do NOT see IW attempting > deletion of its files. > > And indeed I see plenty (too many: many times per second?) of commits > after that, so the index itself is no longer referencing _3ou7. > > If you are failing to close all NRT readers then I would expect _3ou7 > to be in the lsof output, but it's not. > > The NRT readers close method has logic that notifies IndexWriter when > it's done "needing" the files, to emulate "delete on last close" > semantics for filesystems like HDFS that don't do that ... it's > possible something is wrong here. > > Can you set the (public, static) boolean > IndexFileDeleter.VERBOSE_REF_COUNTS to true, and then re-generate this > log? This causes IW to log the ref count of each file it's tracking > ... > > I'll also add a bit more verbosity to IW when NRT readers are opened > and close, for 5.4.0. > > Mike McCandless > > http://blog.mikemccandless.com > > > On Wed, Nov 11, 2015 at 6:09 AM, Rob Audenaerde > wrote: > > Hi all, > > > > I'm still debugging the growing-index size. I think closing index readers > > might help (work in progress), but I can't really see them holding on to > > files (at least, using lsof ). Restarting the application sheds some > light, > > I see logging on files that are no longer referenced. > > > > What I see is that there are files in the index-directory, that seem to > > longer referenced.. > > > > I put the output of the infoStream online, because is it rather big (30MB > > gzipped): http://www.audenaerde.org/lucene/merges.log.gz > > > > Output of lsof: (executed 'sudo lsof *' in the index directory ). This > is > > on an CentOS box (maybe that influences stuff as well?) > > > > COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME > > java30581 apache memREG 253,0 3176094924 18880508 > > _4gs5_Lucene50_0.dvd > > java30581 apache memREG 253,0 505758610 18880546 _4gs5.fdt > > java30581 apache memREG 253,0 369563337 18880631 > > _4gs5_Lucene50_0.tim > > java30581 apache memREG 253,0 176344058 18880623 > > _4gs5_Lucene50_0.pos > > java30581 apache memREG 253,0 378055201 18880606 > > _4gs5_Lucene50_0.doc > > java30581 apache memREG 253,0 372579599 18880400 > > _4i5a_Lucene50_0.dvd > > java30581 apache memREG 253,0 82017447 18880748 _4g37.cfs > > java30581 apache memREG 253,0 85376507 18880721 _4fb3.cfs > > java30581 apache memREG 253,0 363493917 18880533 > > _4ct1_Lucene50_0.dvd > > java30581 apache memREG 253,09421892 18880806 _4gjc.cfs > > java30581 apache memREG 253,0 76877461 18880553 _4ct1.fdt > > java30581 apache memREG 253,0 46271330 18880661 > > _4ct1_Lucene50_0.tim > > java30581 apache memREG 253,0 26911387 18880653 > > _4ct1_Lucene50_0.pos > > java30581 apache memREG 253,0 54678249 18880568 > > _4ct1_Lucene50_0.doc > > java30581 apache memREG 253,0 76556587 18880328 _4i5a.fdt > > java30581 apache memREG 253,0 45032159 18880389 > > _4i5a_Lucene50_0.tim > > java30581 apache memREG 253,0 26486772 18880388 > > _4i5a_Lucene50_0.pos > > java30581 apache memREG 253,0 55411002 18880362 > > _4i5a_Lucene50_0.doc > > java30581 apache memREG 253,0 70484185 18880340 _4hkn.cfs > > java30581 apache memREG 253,0 10873921 18880324 _4gpz.cfs > > java30581 apache memREG 253,0 17230506 18880524 _4i11.cfs > > java30581 apache memREG 253,06706969 18880575 _4i0t.cfs > > java30581 apache memREG 253,0 15135578 18880624 _4i0i.cfs > > java3
Re: debugging growing index size
Curious indeed! I will turn on the IndexFileDeleter.VERBOSE_REF_COUNTS and recreate the logs. Will get back with them in a day hopefully. Thanks for the extra logging! -Rob On Thu, Nov 12, 2015 at 11:34 AM, Michael McCandless < luc...@mikemccandless.com> wrote: > Hmm, curious. > > I looked at the [large] infoStream output and I see segment _3ou7 > present on init of IW, a few getReader calls referencing it, then a > forceMerge that indeed merges it away, yet I do NOT see IW attempting > deletion of its files. > > And indeed I see plenty (too many: many times per second?) of commits > after that, so the index itself is no longer referencing _3ou7. > > If you are failing to close all NRT readers then I would expect _3ou7 > to be in the lsof output, but it's not. > > The NRT readers close method has logic that notifies IndexWriter when > it's done "needing" the files, to emulate "delete on last close" > semantics for filesystems like HDFS that don't do that ... it's > possible something is wrong here. > > Can you set the (public, static) boolean > IndexFileDeleter.VERBOSE_REF_COUNTS to true, and then re-generate this > log? This causes IW to log the ref count of each file it's tracking > ... > > I'll also add a bit more verbosity to IW when NRT readers are opened > and close, for 5.4.0. > > Mike McCandless > > http://blog.mikemccandless.com > > > On Wed, Nov 11, 2015 at 6:09 AM, Rob Audenaerde > wrote: > > Hi all, > > > > I'm still debugging the growing-index size. I think closing index readers > > might help (work in progress), but I can't really see them holding on to > > files (at least, using lsof ). Restarting the application sheds some > light, > > I see logging on files that are no longer referenced. > > > > What I see is that there are files in the index-directory, that seem to > > longer referenced.. > > > > I put the output of the infoStream online, because is it rather big (30MB > > gzipped): http://www.audenaerde.org/lucene/merges.log.gz > > > > Output of lsof: (executed 'sudo lsof *' in the index directory ). This > is > > on an CentOS box (maybe that influences stuff as well?) > > > > COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME > > java30581 apache memREG 253,0 3176094924 18880508 > > _4gs5_Lucene50_0.dvd > > java30581 apache memREG 253,0 505758610 18880546 _4gs5.fdt > > java30581 apache memREG 253,0 369563337 18880631 > > _4gs5_Lucene50_0.tim > > java30581 apache memREG 253,0 176344058 18880623 > > _4gs5_Lucene50_0.pos > > java30581 apache memREG 253,0 378055201 18880606 > > _4gs5_Lucene50_0.doc > > java30581 apache memREG 253,0 372579599 18880400 > > _4i5a_Lucene50_0.dvd > > java30581 apache memREG 253,0 82017447 18880748 _4g37.cfs > > java30581 apache memREG 253,0 85376507 18880721 _4fb3.cfs > > java30581 apache memREG 253,0 363493917 18880533 > > _4ct1_Lucene50_0.dvd > > java30581 apache memREG 253,09421892 18880806 _4gjc.cfs > > java30581 apache memREG 253,0 76877461 18880553 _4ct1.fdt > > java30581 apache memREG 253,0 46271330 18880661 > > _4ct1_Lucene50_0.tim > > java30581 apache memREG 253,0 26911387 18880653 > > _4ct1_Lucene50_0.pos > > java30581 apache memREG 253,0 54678249 18880568 > > _4ct1_Lucene50_0.doc > > java30581 apache memREG 253,0 76556587 18880328 _4i5a.fdt > > java30581 apache memREG 253,0 45032159 18880389 > > _4i5a_Lucene50_0.tim > > java30581 apache memREG 253,0 26486772 18880388 > > _4i5a_Lucene50_0.pos > > java30581 apache memREG 253,0 55411002 18880362 > > _4i5a_Lucene50_0.doc > > java30581 apache memREG 253,0 70484185 18880340 _4hkn.cfs > > java30581 apache memREG 253,0 10873921 18880324 _4gpz.cfs > > java30581 apache memREG 253,0 17230506 18880524 _4i11.cfs > > java30581 apache memREG 253,06706969 18880575 _4i0t.cfs > > java30581 apache memREG 253,0 15135578 18880624 _4i0i.cfs > > java30581 apache memREG 253,0 15368310 18880717 _4hzp.cfs > > java30581 apache memREG 253,05146140 18880583 _4hze.cfs > > java30581 apache memREG 253,02917380 18880411 _4gs5.nvd > > java30581 apache memREG 253,06871469 18880732 _4hod.cfs > > java30581 apache memREG 253,02860341 18880495 _4i84.cfs > > ja
debugging growing index size
Hi all, I'm still debugging the growing-index size. I think closing index readers might help (work in progress), but I can't really see them holding on to files (at least, using lsof ). Restarting the application sheds some light, I see logging on files that are no longer referenced. What I see is that there are files in the index-directory, that seem to longer referenced.. I put the output of the infoStream online, because is it rather big (30MB gzipped): http://www.audenaerde.org/lucene/merges.log.gz Output of lsof: (executed 'sudo lsof *' in the index directory ). This is on an CentOS box (maybe that influences stuff as well?) COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME java30581 apache memREG 253,0 3176094924 18880508 _4gs5_Lucene50_0.dvd java30581 apache memREG 253,0 505758610 18880546 _4gs5.fdt java30581 apache memREG 253,0 369563337 18880631 _4gs5_Lucene50_0.tim java30581 apache memREG 253,0 176344058 18880623 _4gs5_Lucene50_0.pos java30581 apache memREG 253,0 378055201 18880606 _4gs5_Lucene50_0.doc java30581 apache memREG 253,0 372579599 18880400 _4i5a_Lucene50_0.dvd java30581 apache memREG 253,0 82017447 18880748 _4g37.cfs java30581 apache memREG 253,0 85376507 18880721 _4fb3.cfs java30581 apache memREG 253,0 363493917 18880533 _4ct1_Lucene50_0.dvd java30581 apache memREG 253,09421892 18880806 _4gjc.cfs java30581 apache memREG 253,0 76877461 18880553 _4ct1.fdt java30581 apache memREG 253,0 46271330 18880661 _4ct1_Lucene50_0.tim java30581 apache memREG 253,0 26911387 18880653 _4ct1_Lucene50_0.pos java30581 apache memREG 253,0 54678249 18880568 _4ct1_Lucene50_0.doc java30581 apache memREG 253,0 76556587 18880328 _4i5a.fdt java30581 apache memREG 253,0 45032159 18880389 _4i5a_Lucene50_0.tim java30581 apache memREG 253,0 26486772 18880388 _4i5a_Lucene50_0.pos java30581 apache memREG 253,0 55411002 18880362 _4i5a_Lucene50_0.doc java30581 apache memREG 253,0 70484185 18880340 _4hkn.cfs java30581 apache memREG 253,0 10873921 18880324 _4gpz.cfs java30581 apache memREG 253,0 17230506 18880524 _4i11.cfs java30581 apache memREG 253,06706969 18880575 _4i0t.cfs java30581 apache memREG 253,0 15135578 18880624 _4i0i.cfs java30581 apache memREG 253,0 15368310 18880717 _4hzp.cfs java30581 apache memREG 253,05146140 18880583 _4hze.cfs java30581 apache memREG 253,02917380 18880411 _4gs5.nvd java30581 apache memREG 253,06871469 18880732 _4hod.cfs java30581 apache memREG 253,02860341 18880495 _4i84.cfs java30581 apache memREG 253,0 835726 18880660 _4i7z.cfs java30581 apache memREG 253,01005595 18880648 _4i7w.cfs java30581 apache memREG 253,05639672 18880401 _4i4o.cfs java30581 apache memREG 253,04388371 18880440 _4i4a.cfs java30581 apache memREG 253,01151845 18880512 _4i7v.cfs java30581 apache memREG 253,0 941773 18880613 _4i7x.cfs java30581 apache memREG 253,0 984023 18880588 _4i7o.cfs java30581 apache memREG 253,01790005 18880619 _4i7y.cfs java30581 apache memREG 253,0 466371 18880515 _4ct1.nvd java30581 apache memREG 253,0 723280 18880573 _4i7q.cfs java30581 apache memREG 253,0 806289 18880517 _4i7h.cfs java30581 apache memREG 253,0 17362 18880520 _4i9s.cfs java30581 apache memREG 253,0 698362 18880531 _4i9r.cfs java30581 apache memREG 253,0 483215 18880406 _4i5a.nvd java30581 apache memREG 253,0 14110 18880416 _4i9v.cfs java30581 apache memREG 253,0 6121 18880412 _4i9t.cfs java30581 apache 30wW REG 253,0 0 18877901 write.lock Output of some of the biggest files in the index directory: -rw-r--r--. 1 apache apache 358684577 Nov 11 08:04 _4fjn.cfs -rw-r--r--. 1 apache apache 363493917 Nov 11 07:54 _4ct1_Lucene50_0.dvd -rw-r--r--. 1 apache apache 369563337 Nov 11 08:06 _4gs5_Lucene50_0.tim -rw-r--r--. 1 apache apache 372579599 Nov 11 08:09 _4i5a_Lucene50_0.dvd -rw-r--r--. 1 apache apache 378055201 Nov 11 08:06 _4gs5_Lucene50_0.doc -rw-r--r--. 1 apache apache 427401813 Nov 10 08:14 _3ou7.cfs -rw-r--r--. 1 apache apache 505758610 Nov 11 08:04 _4gs5.fdt -rw-r--r--. 1 apache apache 1107391579 Nov 10 07:55 _3k3a_Lucene50_0.dvd -rw-r--r--. 1 apache apache 3176094924 Nov 11 08:10 _4gs5_Lucene50_0.dvd Note that the 3ou7 and 3k3a segments no longer appear to be in use?
Re: index size growing while deleting
Ah yes, that is the way to go. It is a bit harder here, because we also use a per-user InMemoryIndex that is combined in a multi-reader, so it will be a bit more work, but I think it will be doable. Thanks for all the help. That said, I found it not-so-easy to debug this issue; are there methods (on the IndexWriter / text in the infoStream?) that I could have used to detect what was going on? That might be helpful for other as well? -Rob On Tue, Nov 10, 2015 at 1:32 PM, Jürgen Albert wrote: > Hi Rob, > > we use a SearcherManager to obtain a fresh Searcher for every Query. From > the Searcher we get the Reader. After the query you call > searcherManager.release(searcher). The SearcherManager takes care of the > rest. > > Regards, > > Jürgen. > > > Am 10.11.2015 um 13:27 schrieb Rob Audenaerde: > >> Hi Jürgen, Michael >> >> Thanks! I seem to be able to reduce the index size by closing and >> restarting our application. This reduces the index size from 22G tot 4G, >> with is somewhat the expected size. The infoStream also gives me the >> 'removed unreferenced file (IFD 0 [2015-11-10T12:21:49.293Z; main]: init: >> removing unreferenced file '...) >> >> Now I just need to figure out how to close the IndexReader while keeping >> the application running.. I guess I should/could do something with the >> openIfChanged. Will look further. >> >> -Rob >> >> >> >> On Tue, Nov 10, 2015 at 12:19 PM, Jürgen Albert < >> j.alb...@data-in-motion.biz >> >>> wrote: >>> Hi Rob, >>> >>> we had a similar problem. In our case we had open index readers, that >>> blocked the index from merging its segments and thus deleting the marked >>> segments. >>> >>> Regards, >>> >>> Jürgen. >>> >>> >>> Am 06.11.2015 um 08:59 schrieb Rob Audenaerde: >>> >>> Hi will, others >>>> >>>> Thanks for you reply, >>>> >>>> As far as I understand it, deleting a document is just setting the >>>> deleted >>>> bit, and when segments are merged, then the documents are removed. (not >>>> really sure what this means exactly; I guess the document gets removed >>>> from >>>> the store, the terms will no longer refer to that document. Not sure if >>>> terms get removed if no longer needed, etc). If there are resources to >>>> read >>>> to improve my understanding I havo not found them (yet), if you could >>>> point >>>> me to some that be great! >>>> >>>> I use the default IndexWriterConfig, which I see uses >>>> TieredMergePolicy. I >>>> never close my InderWriter; as I use NRT searching I just alwyas keep it >>>> open. >>>> >>>> My two guesses are that: a) old segments are not removed from disk or b) >>>> deletes are not cleaned up as well as I though they would be. >>>> >>>> I have made a testcase which indexes 5 million rows (five iterations, >>>> five >>>> indexing thread, indexing and deleting all such documents after each >>>> iterator with deleteByQuery), the rows randomly generated. I see the >>>> Taxonomy ever growing (which is logical, because facet-ordinals are >>>> never >>>> removed as far as I understand); the index grows, but also shrinks when >>>> deleting. So I cannot reproduce my problem easily :( >>>> >>>> I will start diving into the Lucene source code, but I was hoping I just >>>> did something wrong. . >>>> >>>> Any hints are appreciated! >>>> >>>> -Rob >>>> >>>> >>>> On Thu, Nov 5, 2015 at 2:52 PM, will wrote: >>>> >>>> Hi Rob: >>>> >>>>> Do you understand how deletes work and how an index is compacted? >>>>> >>>>> There's some configuration/runtime activities you don't mention And >>>>> you make testing process sound like a mirror of production? (Including >>>>> configuration?) >>>>> >>>>> >>>>> -will >>>>> >>>>> >>>>> On 11/5/15 7:33 AM, Rob Audenaerde wrote: >>>>> >>>>> Hi all, >>>>> >>>>>> I'm currently investigating an issue we have with our index. It keeps >>>>>> getting bigger, and I don't het why. >>>>>> >>&g
Re: index size growing while deleting
Hi Jürgen, Michael Thanks! I seem to be able to reduce the index size by closing and restarting our application. This reduces the index size from 22G tot 4G, with is somewhat the expected size. The infoStream also gives me the 'removed unreferenced file (IFD 0 [2015-11-10T12:21:49.293Z; main]: init: removing unreferenced file '...) Now I just need to figure out how to close the IndexReader while keeping the application running.. I guess I should/could do something with the openIfChanged. Will look further. -Rob On Tue, Nov 10, 2015 at 12:19 PM, Jürgen Albert wrote: > Hi Rob, > > we had a similar problem. In our case we had open index readers, that > blocked the index from merging its segments and thus deleting the marked > segments. > > Regards, > > Jürgen. > > > Am 06.11.2015 um 08:59 schrieb Rob Audenaerde: > >> Hi will, others >> >> Thanks for you reply, >> >> As far as I understand it, deleting a document is just setting the deleted >> bit, and when segments are merged, then the documents are removed. (not >> really sure what this means exactly; I guess the document gets removed >> from >> the store, the terms will no longer refer to that document. Not sure if >> terms get removed if no longer needed, etc). If there are resources to >> read >> to improve my understanding I havo not found them (yet), if you could >> point >> me to some that be great! >> >> I use the default IndexWriterConfig, which I see uses TieredMergePolicy. I >> never close my InderWriter; as I use NRT searching I just alwyas keep it >> open. >> >> My two guesses are that: a) old segments are not removed from disk or b) >> deletes are not cleaned up as well as I though they would be. >> >> I have made a testcase which indexes 5 million rows (five iterations, five >> indexing thread, indexing and deleting all such documents after each >> iterator with deleteByQuery), the rows randomly generated. I see the >> Taxonomy ever growing (which is logical, because facet-ordinals are never >> removed as far as I understand); the index grows, but also shrinks when >> deleting. So I cannot reproduce my problem easily :( >> >> I will start diving into the Lucene source code, but I was hoping I just >> did something wrong. . >> >> Any hints are appreciated! >> >> -Rob >> >> >> On Thu, Nov 5, 2015 at 2:52 PM, will wrote: >> >> Hi Rob: >>> >>> Do you understand how deletes work and how an index is compacted? >>> >>> There's some configuration/runtime activities you don't mention And >>> you make testing process sound like a mirror of production? (Including >>> configuration?) >>> >>> >>> -will >>> >>> >>> On 11/5/15 7:33 AM, Rob Audenaerde wrote: >>> >>> Hi all, >>>> >>>> I'm currently investigating an issue we have with our index. It keeps >>>> getting bigger, and I don't het why. >>>> >>>> Here is our use case: >>>> >>>> We index a database of about 4 million records; spread over a few >>>> hundred >>>> tables. The data consists of a mix of text, dates, numbers etc. We also >>>> add >>>> all these fields as facets. >>>> Each night we delete about 90% of the data, which in testing reduces the >>>> index size significantly. >>>> We store the data as StoredFields as well, to prevent having to access >>>> the >>>> database at all. >>>> We use FloatAssociatedFacet fields for the facets. >>>> >>>> >>>> In production however, it seems the index is only growing, up to 71 GB >>>> for >>>> these records for a month of running. >>>> >>>> It seems that lucene's index in just getting bigger there. >>>> >>>> We use lucene 5.3 on CentOS, java 8 64 bit. >>>> >>>> The taxonomy-index does not grow significantly. >>>> >>>> How should I go about checking what is wrong? >>>> >>>> Thanks! >>>> >>>> >>>> > > -- > Jürgen Albert > Geschäftsführer > > Data In Motion UG (haftungsbeschränkt) > > Kahlaische Str. 4 > 07745 Jena > > Mobil: 0157-72521634 > E-Mail: j.alb...@datainmotion.de > Web: www.datainmotion.de > > XING: https://www.xing.com/profile/Juergen_Albert5 > > Rechtliches > > Jena HBR 507027 > USt-IdNr: DE274553639 > St.Nr.: 162/107/04586 > > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Re: index size growing while deleting
On Fri, Nov 6, 2015 at 11:29 AM, Michael McCandless < luc...@mikemccandless.com> wrote: It's also important to IndexWriter.commit (as well as open new NRT > readers) periodically or after doing a large set of updates, as that > lets Lucene remove any old segments referenced by the prior commit > point. > When re-reading your comment I notices I skipped over the part in the brackets and I have an addition question: Why is it needed to open new NRT Readers? (btw I use the openIfChanged() approach when maybeRefreshing()) Thanks!
Re: index size growing while deleting
Thanks Mike for the reply, I already commit every after every 5000 documents per Thread. I also found out today how to enable the InfoStream through the IndexWriterConfig, so I'll have lots of extra information to work on. Will run it on the production environment to find out what's happening there. Any hints are appreciated! -Rob On Fri, Nov 6, 2015 at 11:29 AM, Michael McCandless < luc...@mikemccandless.com> wrote: > It's also important to IndexWriter.commit (as well as open new NRT > readers) periodically or after doing a large set of updates, as that > lets Lucene remove any old segments referenced by the prior commit > point. > > Mike McCandless > > http://blog.mikemccandless.com > > > On Fri, Nov 6, 2015 at 2:59 AM, Rob Audenaerde > wrote: > > Hi will, others > > > > Thanks for you reply, > > > > As far as I understand it, deleting a document is just setting the > deleted > > bit, and when segments are merged, then the documents are removed. (not > > really sure what this means exactly; I guess the document gets removed > from > > the store, the terms will no longer refer to that document. Not sure if > > terms get removed if no longer needed, etc). If there are resources to > read > > to improve my understanding I havo not found them (yet), if you could > point > > me to some that be great! > > > > I use the default IndexWriterConfig, which I see uses TieredMergePolicy. > I > > never close my InderWriter; as I use NRT searching I just alwyas keep it > > open. > > > > My two guesses are that: a) old segments are not removed from disk or b) > > deletes are not cleaned up as well as I though they would be. > > > > I have made a testcase which indexes 5 million rows (five iterations, > five > > indexing thread, indexing and deleting all such documents after each > > iterator with deleteByQuery), the rows randomly generated. I see the > > Taxonomy ever growing (which is logical, because facet-ordinals are never > > removed as far as I understand); the index grows, but also shrinks when > > deleting. So I cannot reproduce my problem easily :( > > > > I will start diving into the Lucene source code, but I was hoping I just > > did something wrong. . > > > > Any hints are appreciated! > > > > -Rob > > > > > > On Thu, Nov 5, 2015 at 2:52 PM, will wrote: > > > >> Hi Rob: > >> > >> Do you understand how deletes work and how an index is compacted? > >> > >> There's some configuration/runtime activities you don't mention And > >> you make testing process sound like a mirror of production? (Including > >> configuration?) > >> > >> > >> -will > >> > >> > >> On 11/5/15 7:33 AM, Rob Audenaerde wrote: > >> > >>> Hi all, > >>> > >>> I'm currently investigating an issue we have with our index. It keeps > >>> getting bigger, and I don't het why. > >>> > >>> Here is our use case: > >>> > >>> We index a database of about 4 million records; spread over a few > hundred > >>> tables. The data consists of a mix of text, dates, numbers etc. We also > >>> add > >>> all these fields as facets. > >>> Each night we delete about 90% of the data, which in testing reduces > the > >>> index size significantly. > >>> We store the data as StoredFields as well, to prevent having to access > the > >>> database at all. > >>> We use FloatAssociatedFacet fields for the facets. > >>> > >>> > >>> In production however, it seems the index is only growing, up to 71 GB > for > >>> these records for a month of running. > >>> > >>> It seems that lucene's index in just getting bigger there. > >>> > >>> We use lucene 5.3 on CentOS, java 8 64 bit. > >>> > >>> The taxonomy-index does not grow significantly. > >>> > >>> How should I go about checking what is wrong? > >>> > >>> Thanks! > >>> > >>> > >> > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Re: index size growing while deleting
Hi will, others Thanks for you reply, As far as I understand it, deleting a document is just setting the deleted bit, and when segments are merged, then the documents are removed. (not really sure what this means exactly; I guess the document gets removed from the store, the terms will no longer refer to that document. Not sure if terms get removed if no longer needed, etc). If there are resources to read to improve my understanding I havo not found them (yet), if you could point me to some that be great! I use the default IndexWriterConfig, which I see uses TieredMergePolicy. I never close my InderWriter; as I use NRT searching I just alwyas keep it open. My two guesses are that: a) old segments are not removed from disk or b) deletes are not cleaned up as well as I though they would be. I have made a testcase which indexes 5 million rows (five iterations, five indexing thread, indexing and deleting all such documents after each iterator with deleteByQuery), the rows randomly generated. I see the Taxonomy ever growing (which is logical, because facet-ordinals are never removed as far as I understand); the index grows, but also shrinks when deleting. So I cannot reproduce my problem easily :( I will start diving into the Lucene source code, but I was hoping I just did something wrong. . Any hints are appreciated! -Rob On Thu, Nov 5, 2015 at 2:52 PM, will wrote: > Hi Rob: > > Do you understand how deletes work and how an index is compacted? > > There's some configuration/runtime activities you don't mention And > you make testing process sound like a mirror of production? (Including > configuration?) > > > -will > > > On 11/5/15 7:33 AM, Rob Audenaerde wrote: > >> Hi all, >> >> I'm currently investigating an issue we have with our index. It keeps >> getting bigger, and I don't het why. >> >> Here is our use case: >> >> We index a database of about 4 million records; spread over a few hundred >> tables. The data consists of a mix of text, dates, numbers etc. We also >> add >> all these fields as facets. >> Each night we delete about 90% of the data, which in testing reduces the >> index size significantly. >> We store the data as StoredFields as well, to prevent having to access the >> database at all. >> We use FloatAssociatedFacet fields for the facets. >> >> >> In production however, it seems the index is only growing, up to 71 GB for >> these records for a month of running. >> >> It seems that lucene's index in just getting bigger there. >> >> We use lucene 5.3 on CentOS, java 8 64 bit. >> >> The taxonomy-index does not grow significantly. >> >> How should I go about checking what is wrong? >> >> Thanks! >> >> >
index size growing while deleting
Hi all, I'm currently investigating an issue we have with our index. It keeps getting bigger, and I don't het why. Here is our use case: We index a database of about 4 million records; spread over a few hundred tables. The data consists of a mix of text, dates, numbers etc. We also add all these fields as facets. Each night we delete about 90% of the data, which in testing reduces the index size significantly. We store the data as StoredFields as well, to prevent having to access the database at all. We use FloatAssociatedFacet fields for the facets. In production however, it seems the index is only growing, up to 71 GB for these records for a month of running. It seems that lucene's index in just getting bigger there. We use lucene 5.3 on CentOS, java 8 64 bit. The taxonomy-index does not grow significantly. How should I go about checking what is wrong? Thanks!
Number of threads in index writer config?
Hi all, I was wondering about the number of threads to use for indexing. There is a setting: getMaxThreadStates() in the IndexWriterConfig that determines how many threads can write to the index simultaneously. The luceneutil Indexer.java (that is used for the nightly benchmarks), seems to use the default value (8), while it uses 20 indexing threads. Is there a reason to not set the maxThreadStates to the number of indexing thread? Thanks!
Re: GROUP BY in Lucene
You can write a custom (facet) collector to do this. I have done something similar, I'll describe my approach: For all the values that need grouping or aggregating, I have added a FacetField ( an AssociatedFacetField, so I can store the value alongside the ordinal) . The main search stays the same, in your case for example a NumericRangeQuery (if the date is store in ms). Then I have a custom facet collector that does the grouping. Basically, it goes through all the MatchingDocs. For each doc, it creates a unique key (composed of X, Y and Z), and makes aggregates as needed (sum D).These are stored in a map. If a key is already in the map, the existing aggregate is added to the new value. Tricky is to make your unique key fast and immutable, so you can precompute the hashcode. This is fast enough if the number of unique keys is smallish (<10.000), index size +- 1M docs). -Rob On Mon, Aug 10, 2015 at 2:47 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > Lucene has a grouping module that has several approaches for grouping > search hits, though it's only by a single field I believe. > > Mike McCandless > > http://blog.mikemccandless.com > > > On Sun, Aug 9, 2015 at 2:55 PM, Gimantha Bandara > wrote: > > Hi all, > > > > Is there a way to achieve $subject? For example, consider the following > SQL > > query. > > > > SELECT A, B, C SUM(D) as E FROM `table` WHERE time BETWEEN fromDate AND > > toDate *GROUP BY X,Y,Z* > > > > In the above query we can group the records by, X,Y,Z. Is there a way to > > achieve the same in Lucene? (I guess Faceting would help, But is it > > possible get all the categoryPaths along with the matching records? ) Is > > there any other way other than using Facets? > > > > -- > > Gimantha Bandara > > Software Engineer > > WSO2. Inc : http://wso2.com > > Mobile : +94714961919 > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
disabling all scoring?
Hi all, I'm doing some analytics with a custom Collector on a fairly large number of searchresults (+-100.000, all the hits that return from a query). I need to retrieve them by a query (so using search), but I don't need any scoring nor keeping the documents in any order. When profiling the application, I saw that for my tests, my entire search takes about 2.4 seconds, and BulkScorer takes 0.4 seconds. So I figured that without scoring, I would be able to chop off 0.4 seconds (+- 17% speed increase). That seems reasonable. What would be the best approach to disable all the 'search-goodies' and just pass the results as fast as possible into my Collector? Thanks for your insights. -Rob
fill 'empty' facet-values, sampling, taxoreader
Hi all, I'm building an application in which users can add arbitrary documents, and all fields will be added as facets as well. This allows users to browse their documents by their own defined facets easily. However, when the number of documents gets very large, I switch to random-sampled facets to make sure the application stays responsive. By the nature of sampling, documents (and thus facet-values) will be missed. I let the user select the number of facet-values he want to see for each facets. For example, the default is 10. If a facet contains values 1 to 20, the user will always see 10 values if all documents are returned in the search and no sampling is done. If sampling is done, and the values are non-uniformly distributed, the user might end up with only 5 values instead of 10. I want to 'fill' the empty 5 facet-value-slots with existing facet-values and an unknown facet-count (?). The reason behind this, is that this value might exist in the resultset and for interaction purposes, it is very nice if this value can be selected and added to the query, to quickly find if there are documents that also contain this facet value. It is even more useful if these facet values are not sorted by count, but by label. The user can then quickly see there are document that contain a certain value. I can iterate over the ordinals via the TaxonomyReader and TaxonomyFacets (by leveraging the 'children'), but these ordinals might no longer be used in the documents. What would be a good approach to tackle this issue?
Re: search performance
Hi Jamie, What is included in the 5 minutes? Just the call to the searcher? seacher.search(...) ? Can you show a bit more of the code you use? On Tue, Jun 3, 2014 at 11:32 AM, Jamie wrote: > Vitaly > > Thanks for the contribution. Unfortunately, we cannot use Lucene's > pagination function, because in reality the user can skip pages to start > the search at any point, not just from the end of the previous search. Even > the > first search (without any pagination), with a max of 1000 hits, takes 5 > minutes to complete. > > Regards > > Jamie > > On 2014/06/03, 10:54 AM, Vitaly Funstein wrote: > >> Something doesn't quite add up. >> >> TopFieldCollector fieldCollector = TopFieldCollector.create(sort, >> max,true, >> >>> false, false, true); >>> >>> We use pagination, so only returning 1000 documents or so at a time. >>> >>> >>> You say you are using pagination, yet the API you are using to create >> your >> collector isn't how you would utilize Lucene's built-in "pagination" >> feature (unless misunderstand the API). If the max is the snippet above is >> 1000, then you're simply returning top 1000 docs every time you execute >> your search. Otherwise... well, could you actually post a bit more of your >> code that runs the search here, in particular? >> >> Assuming that the max is much larger than 1000, however, you could call >> fieldCollector.topDocs(int, int) after accumulating hits using this >> collector, but this won't work multiple times per query execution, >> according to the javadoc. So you either have to re-execute the full >> search, >> and then get the next chunk of ScoreDocs, or use the proper API for this, >> one that accepts as a parameter the end of the previous page of results, >> i.e. IndexSearcher.searchAfter(ScoreDoc, ...) >> >> > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Re: Getting multi-values to use in filter?
Hi Shai, I read the article on your blog, thanks for it! It seems to be a natural fit to do multi-values like this, and it is helpful indeed. For my specific problem, I have multiple values that do not have a fixed number, so it can be either 0 or 10 values. I think the best way to solve this is to encode the number of values as first entry in the BDV. This is not that hard so I will take this road. -Rob > Op 27 apr. 2014 om 21:27 heeft Shai Erera het volgende > geschreven: > > Hi Rob, > > Your question got me interested, so I wrote a quick prototype of what I > think solves your problem (and if not, I hope it solves someone else's! > :)). The idea is to write a special ValueSource, e.g. MaxValueSource which > reads a BinadyDocValues, decodes the values and returns the maximum one. It > can then be embedded in an expression quite easily. > > I published a post on Lucene expressions and included some prototype code > which demonstrates how to do it. Hope it's still helpful to you: > http://shaierera.blogspot.com/2014/04/expressions-with-lucene.html. > > Shai > > >> On Thu, Apr 24, 2014 at 1:20 PM, Shai Erera wrote: >> >> I don't think that you should use the facet module. If all you want is to >> encode a bunch of numbers under a 'foo' field, you can encode them into a >> byte[] and index them as a BDV. Then at search time you get the BDV and >> decode the numbers back. The facet module adds complexity here: yes, you >> get the encoding/decoding for free, but at the cost of adding mock >> categories to the taxonomy, or use associations, for no good reason IMO. >> >> Once you do that, you need to figure out how to extend the expressions >> module to support a function like maxValues(fieldName) (cannot use 'max' >> since it's reserved). I read about it some, and still haven't figured out >> exactly how to do it. The JavascriptCompiler can take custom functions to >> compile expressions, but the methods should take only double values. So I >> think it should be some sort of binding, but I'm not sure yet how to do it. >> Perhaps it should be a name like max_fieldName, which you add a custom >> Expression to as a binding ... I will try to look into it later. >> >> Shai >> >> >> On Wed, Apr 23, 2014 at 6:49 PM, Rob Audenaerde >> wrote: >> >>> Thanks for all the questions, gives me an opportunity to clarify it :) >>> >>> I want the user to be able to give a (simple) formula (so I don't know it >>> on beforehand) and use that formula in the search. The Javascript >>> expressions are really powerful in this use case, but have the >>> single-value >>> limitation. Ideally, I would like to make it really flexible by for >>> example >>> allowing (in-document aggregating) expressions like: max(fieldA) - fieldB >>>> >>> fieldC. >>> >>> Currently, using single values, I can handle expressions in the form of >>> "fieldA - fieldB - fieldC > 0" and evaluate the long-value that I receive >>> from the FunctionValues and the ValueSource. I also optimize the query by >>> assuring the field exists and has a value, etc. to the search still fast >>> enough. This works well, but single value only. >>> >>> I also looked into the facets Association Fields, as they somewhat look >>> like the thing that I want. Only in the faceting module, all ordinals and >>> values are stored in one field, so there is no easy way extract the fields >>> that are used in the expression. >>> >>> I like the solution one you suggested, to add all the numeric fields an >>> encoded byte[] like the facets do, but then on a per-field basis, so that >>> each numeric field has a BDV field that contains all multiple values for >>> that field for that document. >>> >>> Now that I am typing this, I think there is another way. I could use the >>> faceting module and add a different facet field ($facetFIELDA, >>> $facetFIELDB) in the FacetsConfig for each field. That way it would be >>> relatively straightforward to get all the values for a field, as they are >>> exact all the values for the BDV for that document's facet field. Only >>> aggregating all facets will be harder, as the >>> TaxonomyFacetSum*Associations >>> would need to do this for all fields that I need facet counts/sums for. >>> >>> What do you think? >>> >>> -Rob >>> >>> >>>> On Wed, Apr 23, 2014 at 5:13 PM, Shai E
Re: Getting multi-values to use in filter?
Thanks for all the questions, gives me an opportunity to clarify it :) I want the user to be able to give a (simple) formula (so I don't know it on beforehand) and use that formula in the search. The Javascript expressions are really powerful in this use case, but have the single-value limitation. Ideally, I would like to make it really flexible by for example allowing (in-document aggregating) expressions like: max(fieldA) - fieldB > fieldC. Currently, using single values, I can handle expressions in the form of "fieldA - fieldB - fieldC > 0" and evaluate the long-value that I receive from the FunctionValues and the ValueSource. I also optimize the query by assuring the field exists and has a value, etc. to the search still fast enough. This works well, but single value only. I also looked into the facets Association Fields, as they somewhat look like the thing that I want. Only in the faceting module, all ordinals and values are stored in one field, so there is no easy way extract the fields that are used in the expression. I like the solution one you suggested, to add all the numeric fields an encoded byte[] like the facets do, but then on a per-field basis, so that each numeric field has a BDV field that contains all multiple values for that field for that document. Now that I am typing this, I think there is another way. I could use the faceting module and add a different facet field ($facetFIELDA, $facetFIELDB) in the FacetsConfig for each field. That way it would be relatively straightforward to get all the values for a field, as they are exact all the values for the BDV for that document's facet field. Only aggregating all facets will be harder, as the TaxonomyFacetSum*Associations would need to do this for all fields that I need facet counts/sums for. What do you think? -Rob On Wed, Apr 23, 2014 at 5:13 PM, Shai Erera wrote: > A NumericDocValues field can only hold one value. Have you thought about > encoding the values in a BinaryDocValues field? Or are you talking about > multiple fields (different names), each has its own single value, and at > search time you sum the values from a different set of fields? > > If it's one field, multiple values, then why do you need to separate the > values? Is it because you sometimes sum and sometimes e.g. avg? Do you > always include all values of a document in the formula, but the formula > changes between searches, or do you sometimes use only a subset of the > values? > > If you always use all values, but change the formula between queries, then > perhaps you can just encode the pre-computed value under different NDV > fields? If you only use a handful of functions (and they are known in > advance), it may not be too heavy on the index, and definitely perform > better during search. > > Otherwise, I believe I'd consider indexing them as a BDV field. For facets, > we basically need the same multi-valued numeric field, and given that NDV > is single valued, we went w/ BDV. > > If I misunderstood the scenario, I'd appreciate if you clarify it :) > > Shai > > > On Wed, Apr 23, 2014 at 5:49 PM, Rob Audenaerde >wrote: > > > Hi Shai, all, > > > > I am trying to write that Filter :). But I'm a bit at loss as how to > > efficiently grab the multi-values. I can access the > > context.reader().document() that accesses the storedfields, but that > seems > > slow. > > > > For single-value fields I use a compiled JavaScript Expression with > > simplebindings as ValueSource, which seems to work quite well. The > downside > > is that I cannot find a way to implement multi-value through that > solution. > > > > These create for example a LongFieldSource, which uses the > > FieldCache.LongParser. These parsers only seem te parse one field. > > > > Is there an efficient way to get -all- of the (numeric) values for a > field > > in a document? > > > > > > On Wed, Apr 23, 2014 at 4:38 PM, Shai Erera wrote: > > > > > You can do that by writing a Filter which returns matching documents > > based > > > on a sum of the field's value. However I suspect that is going to be > > slow, > > > unless you know that you will need several such filters and can cache > > them. > > > > > > Another approach would be to write a Collector which serves as a > Filter, > > > but computes the sum only for documents that match the query. Hopefully > > > that would mean you compute the sum for less documents than you would > > have > > > w/ the Filter approach. > > > > > > Shai > > > > > > > > > On Wed, Apr 23, 2014 at 5:11 PM, Michael Sokolov < > > > msoko...@safariboo
Re: Getting multi-values to use in filter?
Hi Shai, all, I am trying to write that Filter :). But I'm a bit at loss as how to efficiently grab the multi-values. I can access the context.reader().document() that accesses the storedfields, but that seems slow. For single-value fields I use a compiled JavaScript Expression with simplebindings as ValueSource, which seems to work quite well. The downside is that I cannot find a way to implement multi-value through that solution. These create for example a LongFieldSource, which uses the FieldCache.LongParser. These parsers only seem te parse one field. Is there an efficient way to get -all- of the (numeric) values for a field in a document? On Wed, Apr 23, 2014 at 4:38 PM, Shai Erera wrote: > You can do that by writing a Filter which returns matching documents based > on a sum of the field's value. However I suspect that is going to be slow, > unless you know that you will need several such filters and can cache them. > > Another approach would be to write a Collector which serves as a Filter, > but computes the sum only for documents that match the query. Hopefully > that would mean you compute the sum for less documents than you would have > w/ the Filter approach. > > Shai > > > On Wed, Apr 23, 2014 at 5:11 PM, Michael Sokolov < > msoko...@safaribooksonline.com> wrote: > > > This isn't really a good use case for an index like Lucene. The most > > essential property of an index is that it lets you look up documents very > > quickly based on *precomputed* values. > > > > -Mike > > > > > > On 04/23/2014 06:56 AM, Rob Audenaerde wrote: > > > >> Hi all, > >> > >> I'm looking for a way to use multi-values in a filter. > >> > >> I want to be able to search on sum(field)=100, where field has values > in > >> one documents: > >> > >> field=60 > >> field=40 > >> > >> In this case 'field' is a LongField. I examined the code in the > >> FieldCache, > >> but that seems to focus on single-valued fields only, or > >> > >> > >> It this something that can be done in Lucene? And what would be a good > >> approach? > >> > >> Thanks in advance, > >> > >> -Rob > >> > >> > > > > - > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > >
Re: Getting multi-values to use in filter?
Hi Mike, Thanks for your reply. I think it is not-so-much an invalid use case for Lucene. Lucene already has (experimental) support for Dynamic Range Facets, expressions (javascript expressions, geospatial haversin etc. etc). There are all computed on the fly; and work really well. They just depend on the fact that there is only one (numeric) value per field per document. -Rob On Wed, Apr 23, 2014 at 4:11 PM, Michael Sokolov < msoko...@safaribooksonline.com> wrote: > This isn't really a good use case for an index like Lucene. The most > essential property of an index is that it lets you look up documents very > quickly based on *precomputed* values. > > -Mike > > > > On 04/23/2014 06:56 AM, Rob Audenaerde wrote: > >> Hi all, >> >> I'm looking for a way to use multi-values in a filter. >> >> I want to be able to search on sum(field)=100, where field has values in >> one documents: >> >> field=60 >> field=40 >> >> In this case 'field' is a LongField. I examined the code in the >> FieldCache, >> but that seems to focus on single-valued fields only, or >> >> >> It this something that can be done in Lucene? And what would be a good >> approach? >> >> Thanks in advance, >> >> -Rob >> >> > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Getting multi-values to use in filter?
Hi all, I'm looking for a way to use multi-values in a filter. I want to be able to search on sum(field)=100, where field has values in one documents: field=60 field=40 In this case 'field' is a LongField. I examined the code in the FieldCache, but that seems to focus on single-valued fields only, or It this something that can be done in Lucene? And what would be a good approach? Thanks in advance, -Rob
NRT facet issue (bug?), hard to reproduce, please advise
Hi all, I have a issue using the near real-time search in the taxonomy. I could really use some advise on how to debug/proceed this issue. The issue is as follows: I index 100k documents, with about 40 fields each. For each field, I also add a FacetField (issues arises both with FacetField as FloatAssociationFacetField). Each document has a unique number field (client_no). When just indexing and searching afterwards, all is fine. When searching while indexing, sometimes the number of facets associated with a document is to high, i.e. when collecting facets there are more that one client_no on one document, which of course should not be the case. Before each search, I use the manager.maybeRefreshBlocking() before the search, because I want the most-actual results. I have a taxonomy and indexreader combined in a ReferenceManager (I created this before the SearcherTaxonomyManager existed, but it behaves exactly the same, similar refcount logic) During indexing I commit every 5000 documents (not needed for the NRT search, but needed to prevent loss in the application should shut down). I commit as follows: public void commit() throws DocumentIndexException { try { synchronized ( GlobalIndexCommitAndCloseLock.LOCK ) { this.taxonomyWriter.commit(); this.luceneIndexWriter.commit(); } } catch ( final OutOfMemoryError | IOException e ) { tryCloseWritersOnOOME( this.luceneIndexWriter, this.taxonomyWriter ); throw new DocumentIndexException( e ); } } I use a standard IndexWriterConfig and both IndexWriter and TaxonomyWriter are RAMDirectory(). My testcase indexes the 100k documents, while another thread is continuously calling 'manager.maybeRefreshBlocking()'. This is enough to sometimes cause the taxonomy to be incorrect. The number of indexing threads does not seems to influence the issue, as it also appears when I have only 1 indexing thread. I know it is an index problem, because when I write in the index to file instead of RAM and reopen it in a clean application, I see the same behaviour. I could really use some advise on how to debug/proceed this issue. If more info is needed, just ask. Thanks in advance, -Rob