Re: Slower document retrieval in 8.7.0 comparing to 7.5.0

2021-01-22 Thread Rob Audenaerde
I did some testing for you :)

I modified your code to run in a JMH benchmark; and changed the number of
retrieved docs to 1000 out of 1M in the index. This is what I got:

Lucene 7.5
Benchmark Mode  Cnt   Score   Error  Units
DocRetrievalBenchmark.retrieveDocuments  thrpt4  37.147 ± 6.218  ops/s

Lucene 8.7
Benchmark Mode  Cnt   Score   Error  Units
DocRetrievalBenchmark.retrieveDocuments  thrpt4  18.680 ± 5.755  ops/s

This is much in line with your observations, (lucene 8.7 seems almost twice
as slow) so something is going on when running out-of-the-box.

The code can be found : (not really beautiful, but gets the job done. If
you want to switch lucene-versions, edit the pom and make sure to set the
proper index version)
https://gist.github.com/d2a-raudenaerde/93a490e5b0d17b2fa88862473429aeb3

JMH details:
# JMH version: 1.21
# VM version: JDK 11.0.9.1, OpenJDK 64-Bit Server VM,
11.0.9.1+1-Ubuntu-0ubuntu1.20.04
# VM invoker: /usr/lib/jvm/java-11-openjdk-amd64/bin/java
# VM options: -Xms2G -Xmx2G
# Warmup: 2 iterations, 10 s each
# Measurement: 4 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: org.audenaerde.lucene.DocRetrievalBenchmark.retrieveDocuments


On Fri, Jan 22, 2021 at 4:22 PM Martynas L  wrote:

> Just played with my reading sample. I do not have a goal to show the exact
> numbers, but it is a fact that document retrieval IndexSearcher.doc(int) is
> much slower.
> All our performance tests showed performance degradation after changing to
> 8.7.0, even without measurement we can "see/feel" the operations involving
> documents retrieval became slower.
>
>
>
> On Fri, Jan 22, 2021 at 4:48 PM Rob Audenaerde 
> wrote:
>
> > Hi Martynas
> >
> > How did you measure that?
> >
> > I ask, because writing a good benchmark is not an easy task,  since there
> > are so many factors (class loading times, JIT effects, etc). You should
> use
> > Java Microbenchmark Harness or similar; and set up a random document
> > retrieval task, with warm-up etc.etc.
> >
> > (I'm not aware of any big slowdowns, but as you see them, the best way is
> > to build a robust benchmark and then start comparing)
> >
> > -Rob
> >
> >
> > On Fri, Jan 22, 2021 at 3:43 PM Martynas L 
> wrote:
> >
> > > Even retrieving single document 8.7.0 is more than x2 slower
> > >
> > > On Fri, Jan 22, 2021 at 2:28 PM Diego Ceccarelli (BLOOMBERG/ LONDON) <
> > > dceccarel...@bloomberg.net> wrote:
> > >
> > > > >  I think it will be similar ratio retrieving any number of
> documents.
> > > >
> > > > I'm not sure this is true, if you retrieve a huge amount of documents
> > you
> > > > might cause troubles to the GC.
> > > >
> > > > From: java-user@lucene.apache.org At: 01/22/21 12:11:19To:
> > > > java-user@lucene.apache.org
> > > > Subject: Re: Slower document retrieval in 8.7.0 comparing to 7.5.0
> > > >
> > > > The accent should not be on retrieved documents number, but on the
> > > duration
> > > > ratio - 8.7.0 is 3 times slower. I think it will be similar ratio
> > > > retrieving any number of documents.
> > > >
> > > > On Fri, Jan 22, 2021 at 1:39 PM Rob Audenaerde <
> > rob.audenae...@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > Hi Martrynas,
> > > > >
> > > > > In your sample code you are retrieving all (1 million!) documents
> > from
> > > > the
> > > > > index, that surely is not a good match for lucene  :)
> > > > >
> > > > > Is that a good reflection of your use-case?
> > > > >
> > > > > On Fri, Jan 22, 2021 at 9:52 AM Martynas L  >
> > > > wrote:
> > > > >
> > > > > >  Please see the sample at
> > > > > >
> > > >
> > https://drive.google.com/drive/folders/1ufVZXzkugBAFnuy8HLAY6mbPWzjknrfE
> > > > > >
> > > > > > IndexGenerator - creates a dummy index.
> > > > > > IndexReader - retrieves documents - duration time with 7.5.0
> > version
> > > is
> > > > > > ~2s, while ~6s with 8.7.0
> > > > > >
> > > > > > Regards,
> > > > > > Martynas
> > > > > >
> > > > > >
> > > > > &g

Re: Slower document retrieval in 8.7.0 comparing to 7.5.0

2021-01-22 Thread Rob Audenaerde
Hi Martynas

How did you measure that?

I ask, because writing a good benchmark is not an easy task,  since there
are so many factors (class loading times, JIT effects, etc). You should use
Java Microbenchmark Harness or similar; and set up a random document
retrieval task, with warm-up etc.etc.

(I'm not aware of any big slowdowns, but as you see them, the best way is
to build a robust benchmark and then start comparing)

-Rob


On Fri, Jan 22, 2021 at 3:43 PM Martynas L  wrote:

> Even retrieving single document 8.7.0 is more than x2 slower
>
> On Fri, Jan 22, 2021 at 2:28 PM Diego Ceccarelli (BLOOMBERG/ LONDON) <
> dceccarel...@bloomberg.net> wrote:
>
> > >  I think it will be similar ratio retrieving any number of documents.
> >
> > I'm not sure this is true, if you retrieve a huge amount of documents you
> > might cause troubles to the GC.
> >
> > From: java-user@lucene.apache.org At: 01/22/21 12:11:19To:
> > java-user@lucene.apache.org
> > Subject: Re: Slower document retrieval in 8.7.0 comparing to 7.5.0
> >
> > The accent should not be on retrieved documents number, but on the
> duration
> > ratio - 8.7.0 is 3 times slower. I think it will be similar ratio
> > retrieving any number of documents.
> >
> > On Fri, Jan 22, 2021 at 1:39 PM Rob Audenaerde  >
> > wrote:
> >
> > > Hi Martrynas,
> > >
> > > In your sample code you are retrieving all (1 million!) documents from
> > the
> > > index, that surely is not a good match for lucene  :)
> > >
> > > Is that a good reflection of your use-case?
> > >
> > > On Fri, Jan 22, 2021 at 9:52 AM Martynas L 
> > wrote:
> > >
> > > >  Please see the sample at
> > > >
> > https://drive.google.com/drive/folders/1ufVZXzkugBAFnuy8HLAY6mbPWzjknrfE
> > > >
> > > > IndexGenerator - creates a dummy index.
> > > > IndexReader - retrieves documents - duration time with 7.5.0 version
> is
> > > > ~2s, while ~6s with 8.7.0
> > > >
> > > > Regards,
> > > > Martynas
> > > >
> > > >
> > > > On Thu, Jan 21, 2021 at 8:21 PM Rob Audenaerde <
> > rob.audenae...@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > There is no attachment in the previous email that I can see? Maybe
> > you
> > > > can
> > > > > post it online?
> > > > >
> > > > > On Thu, Jan 21, 2021 at 4:54 PM Martynas L  >
> > > > wrote:
> > > > >
> > > > > > Hello,
> > > > > >
> > > > > > Are there any comments on this issue?
> > > > > > If there is no workaround, we will be forced to rollback to the
> > 7.5.0
> > > > > > version.
> > > > > >
> > > > > > Best regards,
> > > > > > Martynas
> > > > > >
> > > > > > On Tue, Jan 12, 2021 at 12:27 PM Martynas L <
> > martynas@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > Please see attached sample.
> > > > > > > IndexGenerator - creates a dummy index.
> > > > > > > IndexReader - retrieves documents - duration time with 7.5.0
> > > version
> > > > is
> > > > > > > ~2s, while ~6s with 8.7.0
> > > > > > >
> > > > > > > Regards,
> > > > > > > Martynas
> > > > > > >
> > > > > > > On Tue, Dec 22, 2020 at 3:23 PM Vincenzo D'Amore <
> > > v.dam...@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > >> I think it would be useful to have an example of a document
> and,
> > > if
> > > > > > >> possible, an example of query that takes too long.
> > > > > > >>
> > > > > > >> On Mon, Dec 21, 2020 at 1:47 PM Martynas L <
> > > martynas@gmail.com>
> > > > > > >> wrote:
> > > > > > >>
> > > > > > >> > Hello,
> > > > > > >> >
> > > > > > >> > I am sorry for the delay.
> > > > > > >> >
> > > > > > >> > Not sure what you mean by &qu

Re: Slower document retrieval in 8.7.0 comparing to 7.5.0

2021-01-22 Thread Rob Audenaerde
Hi Martrynas,

In your sample code you are retrieving all (1 million!) documents from the
index, that surely is not a good match for lucene  :)

Is that a good reflection of your use-case?

On Fri, Jan 22, 2021 at 9:52 AM Martynas L  wrote:

>  Please see the sample at
> https://drive.google.com/drive/folders/1ufVZXzkugBAFnuy8HLAY6mbPWzjknrfE
>
> IndexGenerator - creates a dummy index.
> IndexReader - retrieves documents - duration time with 7.5.0 version is
> ~2s, while ~6s with 8.7.0
>
> Regards,
> Martynas
>
>
> On Thu, Jan 21, 2021 at 8:21 PM Rob Audenaerde 
> wrote:
>
> > There is no attachment in the previous email that I can see? Maybe you
> can
> > post it online?
> >
> > On Thu, Jan 21, 2021 at 4:54 PM Martynas L 
> wrote:
> >
> > > Hello,
> > >
> > > Are there any comments on this issue?
> > > If there is no workaround, we will be forced to rollback to the 7.5.0
> > > version.
> > >
> > > Best regards,
> > > Martynas
> > >
> > > On Tue, Jan 12, 2021 at 12:27 PM Martynas L 
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > Please see attached sample.
> > > > IndexGenerator - creates a dummy index.
> > > > IndexReader - retrieves documents - duration time with 7.5.0 version
> is
> > > > ~2s, while ~6s with 8.7.0
> > > >
> > > > Regards,
> > > > Martynas
> > > >
> > > > On Tue, Dec 22, 2020 at 3:23 PM Vincenzo D'Amore  >
> > > > wrote:
> > > >
> > > >> I think it would be useful to have an example of a document and, if
> > > >> possible, an example of query that takes too long.
> > > >>
> > > >> On Mon, Dec 21, 2020 at 1:47 PM Martynas L 
> > > >> wrote:
> > > >>
> > > >> > Hello,
> > > >> >
> > > >> > I am sorry for the delay.
> > > >> >
> > > >> > Not sure what you mean by "workload". We have a performance tests,
> > > which
> > > >> > started failing after upgrading to 8.7.0.
> > > >> > So I just tried to query the index (built form the same source) to
> > get
> > > >> all
> > > >> > documents and compare the performance with 7.5.0.
> > > >> >
> > > >> > Document "size" is a sum of all stored string lengths (3402519
> > > >> documents):
> > > >> >
> > > >> > doc size 903 - 88s vs 22s
> > > >> >
> > > >> > doc size 36 (only one field loaded, used searcher.doc(docID,
> > > >> > Collections.singleton("fieldName"))) - 78s vs 16s
> > > >> >
> > > >> > doc size 439 (some fields made not stored) - 46s vs 14.5s
> > > >> >
> > > >> > Best regards,
> > > >> > Martynas
> > > >> >
> > > >> > On Fri, Dec 4, 2020 at 12:06 AM Adrien Grand 
> > > wrote:
> > > >> >
> > > >> > > Hello Martynas,
> > > >> > >
> > > >> > > There have indeed been changes related to stored fields in 8.7.
> > What
> > > >> does
> > > >> > > your workload look like and how large are your documents on
> > average?
> > > >> > >
> > > >> > > On Thu, Dec 3, 2020 at 3:04 PM Martynas L <
> martynas@gmail.com
> > >
> > > >> > wrote:
> > > >> > >
> > > >> > > > Hi,
> > > >> > > > We've migrated from 7.5.0 to 8.7.0 and find out that the index
> > > >> > > "searching"
> > > >> > > > is significantly (4-5 times) slower in the latest version.
> > > >> > > > It seems that
> > > >> > > > org.apache.lucene.search.IndexSearcher#doc(int)
> > > >> > > > is slower.
> > > >> > > >
> > > >> > > > Is it possible to have similar performance with 8.7.0?
> > > >> > > >
> > > >> > > > Best regards,
> > > >> > > > Martynas
> > > >> > > >
> > > >> > >
> > > >> > >
> > > >> > > --
> > > >> > > Adrien
> > > >> > >
> > > >> >
> > > >>
> > > >>
> > > >> --
> > > >> Vincenzo D'Amore
> > > >>
> > > >
> > >
> >
>


Re: Slower document retrieval in 8.7.0 comparing to 7.5.0

2021-01-21 Thread Rob Audenaerde
There is no attachment in the previous email that I can see? Maybe you can
post it online?

On Thu, Jan 21, 2021 at 4:54 PM Martynas L  wrote:

> Hello,
>
> Are there any comments on this issue?
> If there is no workaround, we will be forced to rollback to the 7.5.0
> version.
>
> Best regards,
> Martynas
>
> On Tue, Jan 12, 2021 at 12:27 PM Martynas L 
> wrote:
>
> > Hi,
> >
> > Please see attached sample.
> > IndexGenerator - creates a dummy index.
> > IndexReader - retrieves documents - duration time with 7.5.0 version is
> > ~2s, while ~6s with 8.7.0
> >
> > Regards,
> > Martynas
> >
> > On Tue, Dec 22, 2020 at 3:23 PM Vincenzo D'Amore 
> > wrote:
> >
> >> I think it would be useful to have an example of a document and, if
> >> possible, an example of query that takes too long.
> >>
> >> On Mon, Dec 21, 2020 at 1:47 PM Martynas L 
> >> wrote:
> >>
> >> > Hello,
> >> >
> >> > I am sorry for the delay.
> >> >
> >> > Not sure what you mean by "workload". We have a performance tests,
> which
> >> > started failing after upgrading to 8.7.0.
> >> > So I just tried to query the index (built form the same source) to get
> >> all
> >> > documents and compare the performance with 7.5.0.
> >> >
> >> > Document "size" is a sum of all stored string lengths (3402519
> >> documents):
> >> >
> >> > doc size 903 - 88s vs 22s
> >> >
> >> > doc size 36 (only one field loaded, used searcher.doc(docID,
> >> > Collections.singleton("fieldName"))) - 78s vs 16s
> >> >
> >> > doc size 439 (some fields made not stored) - 46s vs 14.5s
> >> >
> >> > Best regards,
> >> > Martynas
> >> >
> >> > On Fri, Dec 4, 2020 at 12:06 AM Adrien Grand 
> wrote:
> >> >
> >> > > Hello Martynas,
> >> > >
> >> > > There have indeed been changes related to stored fields in 8.7. What
> >> does
> >> > > your workload look like and how large are your documents on average?
> >> > >
> >> > > On Thu, Dec 3, 2020 at 3:04 PM Martynas L 
> >> > wrote:
> >> > >
> >> > > > Hi,
> >> > > > We've migrated from 7.5.0 to 8.7.0 and find out that the index
> >> > > "searching"
> >> > > > is significantly (4-5 times) slower in the latest version.
> >> > > > It seems that
> >> > > > org.apache.lucene.search.IndexSearcher#doc(int)
> >> > > > is slower.
> >> > > >
> >> > > > Is it possible to have similar performance with 8.7.0?
> >> > > >
> >> > > > Best regards,
> >> > > > Martynas
> >> > > >
> >> > >
> >> > >
> >> > > --
> >> > > Adrien
> >> > >
> >> >
> >>
> >>
> >> --
> >> Vincenzo D'Amore
> >>
> >
>


Fwd: best way (performance wise) to search for field without value?

2020-11-13 Thread Rob Audenaerde
To follow up, based on a quick JMH-test with 2M docs with some random data
I see a speedup of 70% :)
That is a nice friday-afternoon gift, thanks!

For ppl that are interested:

I added a BinaryDocValues field like this:

doc.add(BinaryDocValuesField("GROUPS_ALLOWED_EMPTY", new BytesRef(0x01;

And used the finalQuery.add(new DocValuesFieldExistsQuery("
GROUPS_ALLOWED_EMPTY", BooleanClause.Occur.SHOULD);

On Fri, Nov 13, 2020 at 2:09 PM Michael McCandless <
luc...@mikemccandless.com> wrote:

> Maybe NormsFieldExistsQuery as a MUST_NOT clause?  Though, you must enable
> norms on your field to use that.
>
> TermRangeQuery is indeed a horribly costly way to execute this, but if you
> cache the result on each refresh, perhaps it is OK?
>
> You could also index a dedicated doc values field indicating that the
> field empty and then use DocValuesFieldExistsQuery.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Fri, Nov 13, 2020 at 7:56 AM Rob Audenaerde 
> wrote:
>
>> Hi all,
>>
>> We have implemented some security on our index by adding a field
>> 'groups_allowed' to documents, and wrap a boolean must query around the
>> original query, that checks if one of the given user-groups matches at
>> least one groups_allowed.
>>
>> We chose to leave the groups_allowed field empty when the document should
>> able to be retrieved by all users, so we need to also select a document if
>> the 'groups_allowed' is empty.
>>
>> What would be the faster Query construction to do so?
>>
>>
>> Currently I use a TermRangeQuery that basically matches all values and put
>> that in a MUST_NOT combined with a MatchAllDocumentQuery(), but that gets
>> rather slow then the number of groups is high.
>>
>> Thanks!
>>
>


best way (performance wise) to search for field without value?

2020-11-13 Thread Rob Audenaerde
Hi all,

We have implemented some security on our index by adding a field
'groups_allowed' to documents, and wrap a boolean must query around the
original query, that checks if one of the given user-groups matches at
least one groups_allowed.

We chose to leave the groups_allowed field empty when the document should
able to be retrieved by all users, so we need to also select a document if
the 'groups_allowed' is empty.

What would be the faster Query construction to do so?


Currently I use a TermRangeQuery that basically matches all values and put
that in a MUST_NOT combined with a MatchAllDocumentQuery(), but that gets
rather slow then the number of groups is high.

Thanks!


Re: unexpected performance TermsQuery Occur.SHOULD vs TermsInSetQuery?

2020-10-13 Thread Rob Audenaerde
Ah. That makes sense. Thanks!

(I might re-run on a larger index just to learn how it works in more detail)

On Tue, Oct 13, 2020 at 1:24 PM Adrien Grand  wrote:

> 100,000+ requests per core per second is a lot. :) My initial reaction is
> that the query is likely so fast on that index that the bottleneck might be
> rewriting or the initialization of weights/scorers (which don't get more
> costly as the index gets larger) rather than actual query execution, which
> means that we can't really conclude that the boolean query is faster than
> the TermInSetQuery.
>
> Also beware than IndexSearcher#count will look at index statistics if your
> queries have a single term, which would no longer work if you use this
> query as a filter for another query.
>
> On Tue, Oct 13, 2020 at 12:51 PM Rob Audenaerde 
> wrote:
>
> > I reduced the benchmark as far as I could, and now got these results,
> > TermsInSet being a lot slower compared to the Terms/SHOULD.
> >
> >
> > BenchmarkOrQuery.benchmarkTerms   thrpt5  190820.510 ± 16667.411
> > ops/s
> > BenchmarkOrQuery.benchmarkTermsInSet  thrpt5  110548.345 ±  7490.169
> > ops/s
> >
> >
> > @Fork(1)
> > @Measurement(iterations = 5, time = 10)
> > @OutputTimeUnit(TimeUnit.SECONDS)
> > @Warmup(iterations = 3, time = 1)
> > @Benchmark
> > public void benchmarkTerms(final MyState myState) {
> > try {
> > final IndexSearcher searcher =
> > myState.matchedReaders.getIndexSearcher();
> > final BooleanQuery.Builder b = new BooleanQuery.Builder();
> >
> > for (final String role : myState.user.getAdditionalRoles()) {
> > b.add(new TermQuery(new Term(roles, new BytesRef(role))),
> > BooleanClause.Occur.SHOULD);
> > }
> > searcher.count(b.build());
> >
> > } catch (final IOException e) {
> > e.printStackTrace();
> > }
> > }
> >
> > @Fork(1)
> > @Measurement(iterations = 5, time = 10)
> > @OutputTimeUnit(TimeUnit.SECONDS)
> > @Warmup(iterations = 3, time = 1)
> > @Benchmark
> > public void benchmarkTermsInSet(final MyState myState) {
> > try {
> > final IndexSearcher searcher =
> > myState.matchedReaders.getIndexSearcher();
> > final Set roles =
> >
> >
> myState.user.getAdditionalRoles().stream().map(BytesRef::new).collect(Collectors.toSet());
> > searcher.count(new TermInSetQuery(BenchmarkOrQuery.roles,
> roles));
> >
> > } catch (final IOException e) {
> > e.printStackTrace();
> > }
> > }
> >
> >
> > On Tue, Oct 13, 2020 at 11:56 AM Rob Audenaerde <
> rob.audenae...@gmail.com>
> > wrote:
> >
> > > Hello Adrien,
> > >
> > > Thanks for the swift reply. I'll add the details:
> > >
> > > Lucene version: 8.6.2
> > >
> > > The restrictionQuery is indeed a conjunction, it allowes for a document
> > to
> > > be a hit if the 'roles' field is empty as well. It's used within a
> > > bigger query builder; so maybe I did something else wrong. I'll rewrite
> > the
> > > benchmark to just benchmark the TermsInSet and Terms.
> > >
> > > It never occurred (hah) to me to use Occur.FILTER, that is a good point
> > to
> > > check as well.
> > >
> > > As you put it, I would expect the results to be very similar, as I do
> not
> > > react the 16 terms in the TermInSet. I'll let you know what I'll find.
> > >
> > > On Tue, Oct 13, 2020 at 11:48 AM Adrien Grand 
> wrote:
> > >
> > >> Can you give us a few more details:
> > >>  - What version of Lucene are you testing?
> > >>  - Are you benchmarking "restrictionQuery" on its own, or its
> > conjunction
> > >> with another query?
> > >>
> > >> You mentioned that you combine your "restrictionQuery" and the user
> > query
> > >> with Occur.MUST, Occur.FILTER feels more appropriate for
> > >> "restrictionQuery"
> > >> since it should not contribute to scoring.
> > >>
> > >> TermsInSetQuery automatically executes like a BooleanQuery when the
> > number
> > >> of clauses is less than 16, so I would not expect major performance
> > >> differences between a TermInSetQuery over less than 16 terms and a
> > >> BooleanQuery wrapped in a ConstantScoreQuery.
> > >>
> > >> On Tue

Re: unexpected performance TermsQuery Occur.SHOULD vs TermsInSetQuery?

2020-10-13 Thread Rob Audenaerde
I reduced the benchmark as far as I could, and now got these results,
TermsInSet being a lot slower compared to the Terms/SHOULD.


BenchmarkOrQuery.benchmarkTerms   thrpt5  190820.510 ± 16667.411  ops/s
BenchmarkOrQuery.benchmarkTermsInSet  thrpt5  110548.345 ±  7490.169  ops/s


@Fork(1)
@Measurement(iterations = 5, time = 10)
@OutputTimeUnit(TimeUnit.SECONDS)
@Warmup(iterations = 3, time = 1)
@Benchmark
public void benchmarkTerms(final MyState myState) {
try {
final IndexSearcher searcher =
myState.matchedReaders.getIndexSearcher();
final BooleanQuery.Builder b = new BooleanQuery.Builder();

for (final String role : myState.user.getAdditionalRoles()) {
b.add(new TermQuery(new Term(roles, new BytesRef(role))),
BooleanClause.Occur.SHOULD);
}
searcher.count(b.build());

} catch (final IOException e) {
e.printStackTrace();
}
}

@Fork(1)
@Measurement(iterations = 5, time = 10)
@OutputTimeUnit(TimeUnit.SECONDS)
@Warmup(iterations = 3, time = 1)
@Benchmark
public void benchmarkTermsInSet(final MyState myState) {
try {
final IndexSearcher searcher =
myState.matchedReaders.getIndexSearcher();
final Set roles =
myState.user.getAdditionalRoles().stream().map(BytesRef::new).collect(Collectors.toSet());
searcher.count(new TermInSetQuery(BenchmarkOrQuery.roles, roles));

} catch (final IOException e) {
e.printStackTrace();
}
}


On Tue, Oct 13, 2020 at 11:56 AM Rob Audenaerde 
wrote:

> Hello Adrien,
>
> Thanks for the swift reply. I'll add the details:
>
> Lucene version: 8.6.2
>
> The restrictionQuery is indeed a conjunction, it allowes for a document to
> be a hit if the 'roles' field is empty as well. It's used within a
> bigger query builder; so maybe I did something else wrong. I'll rewrite the
> benchmark to just benchmark the TermsInSet and Terms.
>
> It never occurred (hah) to me to use Occur.FILTER, that is a good point to
> check as well.
>
> As you put it, I would expect the results to be very similar, as I do not
> react the 16 terms in the TermInSet. I'll let you know what I'll find.
>
> On Tue, Oct 13, 2020 at 11:48 AM Adrien Grand  wrote:
>
>> Can you give us a few more details:
>>  - What version of Lucene are you testing?
>>  - Are you benchmarking "restrictionQuery" on its own, or its conjunction
>> with another query?
>>
>> You mentioned that you combine your "restrictionQuery" and the user query
>> with Occur.MUST, Occur.FILTER feels more appropriate for
>> "restrictionQuery"
>> since it should not contribute to scoring.
>>
>> TermsInSetQuery automatically executes like a BooleanQuery when the number
>> of clauses is less than 16, so I would not expect major performance
>> differences between a TermInSetQuery over less than 16 terms and a
>> BooleanQuery wrapped in a ConstantScoreQuery.
>>
>> On Tue, Oct 13, 2020 at 11:35 AM Rob Audenaerde > >
>> wrote:
>>
>> > Hello,
>> >
>> > I'm benchmarking an application which implements security on lucene by
>> > adding a multivalue field "roles". If the user has one of these roles,
>> he
>> > can find the document.
>> >
>> > I implemented this as a Boolean and query, added the original query and
>> the
>> > restriction with Occur.MUST.
>> >
>> > I'm having some performance issues when counting the index (>60M docs),
>> so
>> > I thought about tweaking this restriction-implementation.
>> >
>> > I set-up a benchmark like this:
>> >
>> > I generate 2M documents, Each document has a multi-value "roles" field.
>> The
>> > "roles" field in each document has 4 values, taken from (2,2,1000,100)
>> > unique values.
>> > The user has (1,1,2,1) values for roles (so, 1 out of the 2, for the
>> first
>> > role, 1 out of 2 for the second, 2 out of the 1000 for the third value,
>> and
>> > 1 / 100 for the fourth).
>> >
>> > I got a somewhat unexpected performance difference. At first, I
>> implemented
>> > the restriction query like this:
>> >
>> > for (final String role : roles) {
>> > restrictionQuery.add(new TermQuery(new Term("roles", new
>> > BytesRef(role))), Occur.SHOULD);
>> > }
>> >
>> > I then switched to a TermInSetQuery, which I thought would be faster
>> > as it is using constant-scores.
>> >
>> > final Set rolesSet =
>> > roles.stream().map(BytesRef::new).collect(Collectors.toSet());
>> > restrictionQuery.add(new TermInSetQuery("roles", rolesSet),
>> Occur.SHOULD);
>> >
>> >
>> > However, the TermInSetQuery has about 25% slower ops/s. Is that to
>> > be expected? I did not, as I thought the constant-scoring would be
>> faster.
>> >
>>
>>
>> --
>> Adrien
>>
>


Re: unexpected performance TermsQuery Occur.SHOULD vs TermsInSetQuery?

2020-10-13 Thread Rob Audenaerde
Hello Adrien,

Thanks for the swift reply. I'll add the details:

Lucene version: 8.6.2

The restrictionQuery is indeed a conjunction, it allowes for a document to
be a hit if the 'roles' field is empty as well. It's used within a
bigger query builder; so maybe I did something else wrong. I'll rewrite the
benchmark to just benchmark the TermsInSet and Terms.

It never occurred (hah) to me to use Occur.FILTER, that is a good point to
check as well.

As you put it, I would expect the results to be very similar, as I do not
react the 16 terms in the TermInSet. I'll let you know what I'll find.

On Tue, Oct 13, 2020 at 11:48 AM Adrien Grand  wrote:

> Can you give us a few more details:
>  - What version of Lucene are you testing?
>  - Are you benchmarking "restrictionQuery" on its own, or its conjunction
> with another query?
>
> You mentioned that you combine your "restrictionQuery" and the user query
> with Occur.MUST, Occur.FILTER feels more appropriate for "restrictionQuery"
> since it should not contribute to scoring.
>
> TermsInSetQuery automatically executes like a BooleanQuery when the number
> of clauses is less than 16, so I would not expect major performance
> differences between a TermInSetQuery over less than 16 terms and a
> BooleanQuery wrapped in a ConstantScoreQuery.
>
> On Tue, Oct 13, 2020 at 11:35 AM Rob Audenaerde 
> wrote:
>
> > Hello,
> >
> > I'm benchmarking an application which implements security on lucene by
> > adding a multivalue field "roles". If the user has one of these roles, he
> > can find the document.
> >
> > I implemented this as a Boolean and query, added the original query and
> the
> > restriction with Occur.MUST.
> >
> > I'm having some performance issues when counting the index (>60M docs),
> so
> > I thought about tweaking this restriction-implementation.
> >
> > I set-up a benchmark like this:
> >
> > I generate 2M documents, Each document has a multi-value "roles" field.
> The
> > "roles" field in each document has 4 values, taken from (2,2,1000,100)
> > unique values.
> > The user has (1,1,2,1) values for roles (so, 1 out of the 2, for the
> first
> > role, 1 out of 2 for the second, 2 out of the 1000 for the third value,
> and
> > 1 / 100 for the fourth).
> >
> > I got a somewhat unexpected performance difference. At first, I
> implemented
> > the restriction query like this:
> >
> > for (final String role : roles) {
> > restrictionQuery.add(new TermQuery(new Term("roles", new
> > BytesRef(role))), Occur.SHOULD);
> > }
> >
> > I then switched to a TermInSetQuery, which I thought would be faster
> > as it is using constant-scores.
> >
> > final Set rolesSet =
> > roles.stream().map(BytesRef::new).collect(Collectors.toSet());
> > restrictionQuery.add(new TermInSetQuery("roles", rolesSet),
> Occur.SHOULD);
> >
> >
> > However, the TermInSetQuery has about 25% slower ops/s. Is that to
> > be expected? I did not, as I thought the constant-scoring would be
> faster.
> >
>
>
> --
> Adrien
>


unexpected performance TermsQuery Occur.SHOULD vs TermsInSetQuery?

2020-10-13 Thread Rob Audenaerde
Hello,

I'm benchmarking an application which implements security on lucene by
adding a multivalue field "roles". If the user has one of these roles, he
can find the document.

I implemented this as a Boolean and query, added the original query and the
restriction with Occur.MUST.

I'm having some performance issues when counting the index (>60M docs), so
I thought about tweaking this restriction-implementation.

I set-up a benchmark like this:

I generate 2M documents, Each document has a multi-value "roles" field. The
"roles" field in each document has 4 values, taken from (2,2,1000,100)
unique values.
The user has (1,1,2,1) values for roles (so, 1 out of the 2, for the first
role, 1 out of 2 for the second, 2 out of the 1000 for the third value, and
1 / 100 for the fourth).

I got a somewhat unexpected performance difference. At first, I implemented
the restriction query like this:

for (final String role : roles) {
restrictionQuery.add(new TermQuery(new Term("roles", new
BytesRef(role))), Occur.SHOULD);
}

I then switched to a TermInSetQuery, which I thought would be faster
as it is using constant-scores.

final Set rolesSet =
roles.stream().map(BytesRef::new).collect(Collectors.toSet());
restrictionQuery.add(new TermInSetQuery("roles", rolesSet), Occur.SHOULD);


However, the TermInSetQuery has about 25% slower ops/s. Is that to
be expected? I did not, as I thought the constant-scoring would be faster.


find documents with big stored fields

2019-07-01 Thread Rob Audenaerde
Hello,

We are currently trying to investigate an issue where in the index-size is
disproportionally large for the number of documents. We see that the .fdt
file is more than 10 times the regular size.

Reading the docs, I found that this file contains the fielddata.

I would like to find the documents and/or field names/contents with extreme
sizes, so we can delete those from the index without needing to re-index
all data.

What would be the best approach for this?

Thanks,
Rob Audenaerde


force deletes - terms enum still has deleted terms?

2018-09-28 Thread Rob Audenaerde
Hi all,

We build a FST on the terms of our index by iterating the terms of the
readers for our fields, like this:

for (final LeafReaderContext ctx : leaves) {
final LeafReader leafReader = ctx.reader();

for (final String indexField : indexFields) {
final Terms terms =
leafReader.terms(indexField);
// If the field does not exist in this
reader, then we get null, so check for that.
if (terms != null) {
final TermsEnum termsEnum =
terms.iterator();

However, it sometimes the building of the FST seems to find terms that are
from documents that are deleted. This is what we expect, checking the
javadocs.

So, now we switched the IndexWriter to a config with a TieredMergePolicy
with: setForceMergeDeletesPctAllowed(0).

When calling indexWriter.forceMergeDeletes(true) we expect that there will
be no more deletes. However, the deleted terms still sometimes appear. We
use the DirectoryReader.openIfChanged() to refresh the reader before
iterating the terms.

Are we forgetting something?

Thanks in advance.
Rob Audenaerde


Re: Lucene nested query

2018-04-10 Thread Rob Audenaerde
Your query can be seen as an inner join:

select t0.* from employee t0 inner join employee t1 on t0.dept_no =
t1.dept_no where t1.email='a...@email.com'

Maybe JoinUtill can help you.

http://lucene.apache.org/core/7_0_0/join/org/apache/lucene/search/join/JoinUtil.html?is-external=true

On Tue, Apr 10, 2018 at 10:44 AM, Khurram Shehzad 
wrote:

> Hi guys!
>
>
> I've a scenario where the lucene query depends on the result of another
> lucene query.
>
>
> For example, find all the employees of the department where one of its
> employee's email address = 'a...@email.com'
>
>
> SQL would be like:
>
>
> select * from employee where dept_no in(
>
> select dept_no from employee where email = 'a...@email.com'
>
> )
>
>
> Please note that employee is a huge data and inner query can result into 5
> million rows
>
>
> Any thoughts how to replicate this scenario using core lucene?
>
>
> Regards,
>
> Khurram
>


Re: indexing performance 6.6 vs 7.1

2018-01-31 Thread Rob Audenaerde
Hi Adrian,

Thanks for the response. Good points too!

We actually went with a smallish benchmark to be able to profile the
application within reasonable time.

We will do a larger benchmark (say, 1M documents, without profiling) and I
will revisit the commit-code as well. (IIRC we actually increased the
commit frequency a while back because of issues (maybe out-of-memory
issues, it was in the Lucene 4.x time. But this might no longer be relevant)

What I don't understand yet is how this difference (between 6 and 7) came
to be, I was reading the change log but could not really pinpoint it. Sure,
the commit's are far from optimal, but we use the same commit strategy
between 6.6 and 7.1.

-Rob




On Wed, Jan 31, 2018 at 1:56 PM, Adrien Grand  wrote:

> Hi Rob,
>
> I don't think your benchmark is good. If I read it correctly, it only
> indexes between 21k and 22k documents, which is tiny. Plus it should try to
> better replicate production workload, otherwise we will draw wrong
> conclusions.
>
> I also suspect something is not quite right in your indexing code. When I
> look at the IW logs, 562 out of the 642 flushes only write 1 document. I'm
> not surprised that it exacerbates the cost of checksums, which are cheaper
> to compute on one large file than on many tiny files. For the record, even
> committing every 5k documents still sounds too frequent to me for an
> application that is heavily indexing. Maybe you should consider moving to a
> time-based policy? eg. commit every 10 minutes?
>
> Le mer. 31 janv. 2018 à 10:25, Rob Audenaerde  a
> écrit :
>
> > Hi all,
> >
> > We ran the benchmarks (6.6 vs 7.1) with IW info stream and (as attachment
> > cannot be too large) I uploaded them to google drive. They can be found
> > here:
> >
> > https://drive.google.com/open?id=1-nAHgpPO3qZ78lnvvlQ0_lF4uHJ-cWLh
> >
> > Thanks in advance,
> > -Rob
> >
> > On Mon, Jan 29, 2018 at 1:08 PM, Rob Audenaerde <
> rob.audenae...@gmail.com>
> > wrote:
> >
> > > Hi Uwe,
> > >
> > > Thanks for the reply. We commit often. Actually, in the benchmark, we
> > > commit every 60 documents (but we will run a larger set with less
> > commits).
> > > The number of commits we call does not change between 6.6. and 7.1. In
> > our
> > > production systems  we commit every 5000 documents.
> > >
> > > We dug deeper into the commit methods, and currently see the main
> > > difference seems to be the calls to the java.util.zit.Checksum.update(
> ).
> > > The number of calls to that method in 6.6 is around 11M  , and 7.1
> 21M,
> > so
> > > almost twice the calls.
> > >
> > > -Rob
> > >
> > > On Mon, Jan 29, 2018 at 12:18 PM, Uwe Schindler 
> wrote:
> > >
> > >> Hi,
> > >>
> > >> How often do you commit? If you index the data initially (that's the
> > case
> > >> where indexing needs to be fast), one would call commit at the end of
> > the
> > >> whole job, so the actual time it takes is not so important.
> > >>
> > >> If you have a system where the index is updated all the time, then of
> > >> course committing is also something you have to take into account.
> > Systems
> > >> like Solr or Elasticsearch use a transaction log in parallel to
> > indexing,
> > >> so they commit very seldom. If the system crashes, the changes are
> > replayed
> > >> from tranlog since last commit.
> > >>
> > >> Uwe
> > >>
> > >> -
> > >> Uwe Schindler
> > >> Achterdiek 19, D-28357 Bremen
> > >> http://www.thetaphi.de
> > >> eMail: u...@thetaphi.de
> > >>
> > >> > -Original Message-
> > >> > From: Rob Audenaerde [mailto:rob.audenae...@gmail.com]
> > >> > Sent: Monday, January 29, 2018 11:29 AM
> > >> > To: java-user@lucene.apache.org
> > >> > Subject: Re: indexing performance 6.6 vs 7.1
> > >> >
> > >> > Hi all,
> > >> >
> > >> > Some follow up (sorry for the delay).
> > >> >
> > >> > We built a benchmark in our application, and profiled it (on a
> > smallish
> > >> > data set). What we currently see in the profiler is that in Lucene
> 7.1
> > >> the
> > >> > calls to `commit()` take much longer.
> > >> >
> > >> > The self-time committing in 6.6: 3,215 ms
> > >> > The self-time committing in 

Re: indexing performance 6.6 vs 7.1

2018-01-31 Thread Rob Audenaerde
Hi all,

We ran the benchmarks (6.6 vs 7.1) with IW info stream and (as attachment
cannot be too large) I uploaded them to google drive. They can be found
here:

https://drive.google.com/open?id=1-nAHgpPO3qZ78lnvvlQ0_lF4uHJ-cWLh

Thanks in advance,
-Rob

On Mon, Jan 29, 2018 at 1:08 PM, Rob Audenaerde 
wrote:

> Hi Uwe,
>
> Thanks for the reply. We commit often. Actually, in the benchmark, we
> commit every 60 documents (but we will run a larger set with less commits).
> The number of commits we call does not change between 6.6. and 7.1. In our
> production systems  we commit every 5000 documents.
>
> We dug deeper into the commit methods, and currently see the main
> difference seems to be the calls to the java.util.zit.Checksum.update().
> The number of calls to that method in 6.6 is around 11M  , and 7.1  21M, so
> almost twice the calls.
>
> -Rob
>
> On Mon, Jan 29, 2018 at 12:18 PM, Uwe Schindler  wrote:
>
>> Hi,
>>
>> How often do you commit? If you index the data initially (that's the case
>> where indexing needs to be fast), one would call commit at the end of the
>> whole job, so the actual time it takes is not so important.
>>
>> If you have a system where the index is updated all the time, then of
>> course committing is also something you have to take into account. Systems
>> like Solr or Elasticsearch use a transaction log in parallel to indexing,
>> so they commit very seldom. If the system crashes, the changes are replayed
>> from tranlog since last commit.
>>
>> Uwe
>>
>> -
>> Uwe Schindler
>> Achterdiek 19, D-28357 Bremen
>> http://www.thetaphi.de
>> eMail: u...@thetaphi.de
>>
>> > -Original Message-
>> > From: Rob Audenaerde [mailto:rob.audenae...@gmail.com]
>> > Sent: Monday, January 29, 2018 11:29 AM
>> > To: java-user@lucene.apache.org
>> > Subject: Re: indexing performance 6.6 vs 7.1
>> >
>> > Hi all,
>> >
>> > Some follow up (sorry for the delay).
>> >
>> > We built a benchmark in our application, and profiled it (on a smallish
>> > data set). What we currently see in the profiler is that in Lucene 7.1
>> the
>> > calls to `commit()` take much longer.
>> >
>> > The self-time committing in 6.6: 3,215 ms
>> > The self-time committing in 7.1: 10,187 ms.
>> >
>> > We will try to run a larger data set and also later with the IW info
>> > stream.
>> >
>> > -Rob
>> >
>> > On Thu, Jan 18, 2018 at 7:03 PM, Erick Erickson <
>> erickerick...@gmail.com>
>> > wrote:
>> >
>> > > Robert:
>> > >
>> > > Ah, right. I keep confusing my gmail lists
>> > > "lucene dev"
>> > > and
>> > > "lucene list"
>> > >
>> > > Siiih.
>> > >
>> > >
>> > >
>> > > On Thu, Jan 18, 2018 at 9:18 AM, Adrien Grand 
>> > wrote:
>> > > > If you have sparse data, I would have expected index time to
>> *decrease*,
>> > > > not increase.
>> > > >
>> > > > Can you enable the IW info stream and share flush + merge times to
>> see
>> > > > where indexing time goes?
>> > > >
>> > > > If you can run with a profiler, this might also give useful
>> information.
>> > > >
>> > > > Le jeu. 18 janv. 2018 à 11:23, Rob Audenaerde
>> > 
>> > > a
>> > > > écrit :
>> > > >
>> > > >> Hi all,
>> > > >>
>> > > >> We recently upgraded from Lucene 6.6 to 7.1.  We see a significant
>> drop
>> > > in
>> > > >> indexing performace.
>> > > >>
>> > > >> We have a-typical use of Lucene, as we (also) index some database
>> > tables
>> > > >> and add all the values as AssociatedFacetFields as well. This
>> allows us
>> > > to
>> > > >> create pivot tables on search results really fast.
>> > > >>
>> > > >> These tables have some overlapping columns, but also disjoint ones.
>> > > >>
>> > > >> We anticipated a decrease in index size because of the sparse
>> > > docvalues. We
>> > > >> see this happening, with decreases to ~50%-80% of the original
>> index
>> > > size.
>> > > >> But we did not expect an drop in indexing performance (client
>> systems
>> > > >> indexing time increased with +50% to +250%).
>> > > >>
>> > > >> (Our indexing-speed used to be mainly bound by the speed the
>> > Taxonomy
>> > > could
>> > > >> deliver new ordinals for new values, currently we are
>> investigating if
>> > > this
>> > > >> is still the case, will report later when a profiler run has been
>> done)
>> > > >>
>> > > >> Does anyone know if this increase in indexing time is to be
>> expected as
>> > > >> result of the sparse docvalues change?
>> > > >>
>> > > >> Kind regards,
>> > > >>
>> > > >> Rob Audenaerde
>> > > >>
>> > >
>> > > -
>> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> > > For additional commands, e-mail: java-user-h...@lucene.apache.org
>> > >
>> > >
>>
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>


Re: indexing performance 6.6 vs 7.1

2018-01-29 Thread Rob Audenaerde
Hi Uwe,

Thanks for the reply. We commit often. Actually, in the benchmark, we
commit every 60 documents (but we will run a larger set with less commits).
The number of commits we call does not change between 6.6. and 7.1. In our
production systems  we commit every 5000 documents.

We dug deeper into the commit methods, and currently see the main
difference seems to be the calls to the java.util.zit.Checksum.update().
The number of calls to that method in 6.6 is around 11M  , and 7.1  21M, so
almost twice the calls.

-Rob

On Mon, Jan 29, 2018 at 12:18 PM, Uwe Schindler  wrote:

> Hi,
>
> How often do you commit? If you index the data initially (that's the case
> where indexing needs to be fast), one would call commit at the end of the
> whole job, so the actual time it takes is not so important.
>
> If you have a system where the index is updated all the time, then of
> course committing is also something you have to take into account. Systems
> like Solr or Elasticsearch use a transaction log in parallel to indexing,
> so they commit very seldom. If the system crashes, the changes are replayed
> from tranlog since last commit.
>
> Uwe
>
> -
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
> > -Original Message-
> > From: Rob Audenaerde [mailto:rob.audenae...@gmail.com]
> > Sent: Monday, January 29, 2018 11:29 AM
> > To: java-user@lucene.apache.org
> > Subject: Re: indexing performance 6.6 vs 7.1
> >
> > Hi all,
> >
> > Some follow up (sorry for the delay).
> >
> > We built a benchmark in our application, and profiled it (on a smallish
> > data set). What we currently see in the profiler is that in Lucene 7.1
> the
> > calls to `commit()` take much longer.
> >
> > The self-time committing in 6.6: 3,215 ms
> > The self-time committing in 7.1: 10,187 ms.
> >
> > We will try to run a larger data set and also later with the IW info
> > stream.
> >
> > -Rob
> >
> > On Thu, Jan 18, 2018 at 7:03 PM, Erick Erickson  >
> > wrote:
> >
> > > Robert:
> > >
> > > Ah, right. I keep confusing my gmail lists
> > > "lucene dev"
> > > and
> > > "lucene list"
> > >
> > > Siiih.
> > >
> > >
> > >
> > > On Thu, Jan 18, 2018 at 9:18 AM, Adrien Grand 
> > wrote:
> > > > If you have sparse data, I would have expected index time to
> *decrease*,
> > > > not increase.
> > > >
> > > > Can you enable the IW info stream and share flush + merge times to
> see
> > > > where indexing time goes?
> > > >
> > > > If you can run with a profiler, this might also give useful
> information.
> > > >
> > > > Le jeu. 18 janv. 2018 à 11:23, Rob Audenaerde
> > 
> > > a
> > > > écrit :
> > > >
> > > >> Hi all,
> > > >>
> > > >> We recently upgraded from Lucene 6.6 to 7.1.  We see a significant
> drop
> > > in
> > > >> indexing performace.
> > > >>
> > > >> We have a-typical use of Lucene, as we (also) index some database
> > tables
> > > >> and add all the values as AssociatedFacetFields as well. This
> allows us
> > > to
> > > >> create pivot tables on search results really fast.
> > > >>
> > > >> These tables have some overlapping columns, but also disjoint ones.
> > > >>
> > > >> We anticipated a decrease in index size because of the sparse
> > > docvalues. We
> > > >> see this happening, with decreases to ~50%-80% of the original index
> > > size.
> > > >> But we did not expect an drop in indexing performance (client
> systems
> > > >> indexing time increased with +50% to +250%).
> > > >>
> > > >> (Our indexing-speed used to be mainly bound by the speed the
> > Taxonomy
> > > could
> > > >> deliver new ordinals for new values, currently we are investigating
> if
> > > this
> > > >> is still the case, will report later when a profiler run has been
> done)
> > > >>
> > > >> Does anyone know if this increase in indexing time is to be
> expected as
> > > >> result of the sparse docvalues change?
> > > >>
> > > >> Kind regards,
> > > >>
> > > >> Rob Audenaerde
> > > >>
> > >
> > > -
> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > >
> > >
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: indexing performance 6.6 vs 7.1

2018-01-29 Thread Rob Audenaerde
Hi all,

Some follow up (sorry for the delay).

We built a benchmark in our application, and profiled it (on a smallish
data set). What we currently see in the profiler is that in Lucene 7.1 the
calls to `commit()` take much longer.

The self-time committing in 6.6: 3,215 ms
The self-time committing in 7.1: 10,187 ms.

We will try to run a larger data set and also later with the IW info
stream.

-Rob

On Thu, Jan 18, 2018 at 7:03 PM, Erick Erickson 
wrote:

> Robert:
>
> Ah, right. I keep confusing my gmail lists
> "lucene dev"
> and
> "lucene list"
>
> Siiih.
>
>
>
> On Thu, Jan 18, 2018 at 9:18 AM, Adrien Grand  wrote:
> > If you have sparse data, I would have expected index time to *decrease*,
> > not increase.
> >
> > Can you enable the IW info stream and share flush + merge times to see
> > where indexing time goes?
> >
> > If you can run with a profiler, this might also give useful information.
> >
> > Le jeu. 18 janv. 2018 à 11:23, Rob Audenaerde 
> a
> > écrit :
> >
> >> Hi all,
> >>
> >> We recently upgraded from Lucene 6.6 to 7.1.  We see a significant drop
> in
> >> indexing performace.
> >>
> >> We have a-typical use of Lucene, as we (also) index some database tables
> >> and add all the values as AssociatedFacetFields as well. This allows us
> to
> >> create pivot tables on search results really fast.
> >>
> >> These tables have some overlapping columns, but also disjoint ones.
> >>
> >> We anticipated a decrease in index size because of the sparse
> docvalues. We
> >> see this happening, with decreases to ~50%-80% of the original index
> size.
> >> But we did not expect an drop in indexing performance (client systems
> >> indexing time increased with +50% to +250%).
> >>
> >> (Our indexing-speed used to be mainly bound by the speed the Taxonomy
> could
> >> deliver new ordinals for new values, currently we are investigating if
> this
> >> is still the case, will report later when a profiler run has been done)
> >>
> >> Does anyone know if this increase in indexing time is to be expected as
> >> result of the sparse docvalues change?
> >>
> >> Kind regards,
> >>
> >> Rob Audenaerde
> >>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


indexing performance 6.6 vs 7.1

2018-01-18 Thread Rob Audenaerde
Hi all,

We recently upgraded from Lucene 6.6 to 7.1.  We see a significant drop in
indexing performace.

We have a-typical use of Lucene, as we (also) index some database tables
and add all the values as AssociatedFacetFields as well. This allows us to
create pivot tables on search results really fast.

These tables have some overlapping columns, but also disjoint ones.

We anticipated a decrease in index size because of the sparse docvalues. We
see this happening, with decreases to ~50%-80% of the original index size.
But we did not expect an drop in indexing performance (client systems
indexing time increased with +50% to +250%).

(Our indexing-speed used to be mainly bound by the speed the Taxonomy could
deliver new ordinals for new values, currently we are investigating if this
is still the case, will report later when a profiler run has been done)

Does anyone know if this increase in indexing time is to be expected as
result of the sparse docvalues change?

Kind regards,

Rob Audenaerde


Re: Lucene update performance

2017-05-09 Thread Rob Audenaerde
As far as I know, the updateDocument method on the IndexWriter delete and
add. See also the javadoc:

[..] Updates a document by first deleting the document(s)
containing term and then adding the new
document.  The delete and then add are atomic as seen
by a reader on the same index (flush may happen only after
the add). [..]


On Tue, May 9, 2017 at 3:37 PM, Kudrettin Güleryüz 
wrote:

> I do update the entire document each time. Furthermore, this sometimes
> means deleting compressed archives which are stores as multiple documents
> for each compressed archive file and readding them.
>
> Is there an update method, is it better performance than remove then add? I
> was simply removing modified files from the index (which doesn't seem to
> take long), and readd them.
>
> On Tue, May 9, 2017 at 9:33 AM Rob Audenaerde 
> wrote:
>
> > Do you update each entire document? (vs updating numeric docvalues?)
> >
> > That is implemented as 'delete and add' so I guess that will be slower
> than
> > clean sheet indexing. Not sure if it is 3x slower, that seems a bit much?
> >
> > On Tue, May 9, 2017 at 3:24 PM, Kudrettin Güleryüz 
> > wrote:
> >
> > > Hi,
> > >
> > > For a 5.2.1 index that contains around 1.2 million documents, updating
> > the
> > > index with 1.3 million files seems to take 3X longer than doing a
> scratch
> > > indexing. (Files are crawled over NFS, indexes are stored on a
> mechanical
> > > disk locally (Btrfs))
> > >
> > > Is this expected for Lucene's update index logic, or should I further
> > debug
> > > my part of the code for update performance?
> > >
> > > Thank you,
> > > Kudret
> > >
> >
>


Re: Lucene update performance

2017-05-09 Thread Rob Audenaerde
Do you update each entire document? (vs updating numeric docvalues?)

That is implemented as 'delete and add' so I guess that will be slower than
clean sheet indexing. Not sure if it is 3x slower, that seems a bit much?

On Tue, May 9, 2017 at 3:24 PM, Kudrettin Güleryüz 
wrote:

> Hi,
>
> For a 5.2.1 index that contains around 1.2 million documents, updating the
> index with 1.3 million files seems to take 3X longer than doing a scratch
> indexing. (Files are crawled over NFS, indexes are stored on a mechanical
> disk locally (Btrfs))
>
> Is this expected for Lucene's update index logic, or should I further debug
> my part of the code for update performance?
>
> Thank you,
> Kudret
>


Re: Autocomplete using facet labels?

2017-04-12 Thread Rob Audenaerde
Thanks Erick for you reply,

I see you refer to solr sources while I was hoping for lucene suggestions.

I hadn't thought  of the idea of reverse indexing the facet values and will
consider it.

Meanwhile I will try to explore how I might to use the TaxonomyIndex as
well, as this should contain the FacetLabels I'd like to use.

-Rob

On Wed, Apr 12, 2017 at 5:00 PM, Erick Erickson 
wrote:

> First take a look at autosuggest. That does some great stuff if you
> can build the autocomplete dictionary only periodically, which can be
> somewhat expensive. See:
> https://lucidworks.com/2015/03/04/solr-suggester/
>
> There are lighter-weight ways to autosuggest that should be extremely
> fast, in particular index your stuff backwards in a suggest field as:
> John Doe - Author
>
> and use TermsComponent on that for instance. TermsComponent is pretty
> literal, i.e. it's case-sensitive but you can send terms.prefix=jo and
> case things properly on the app side.
>
> Best,
> Erick
>
> On Wed, Apr 12, 2017 at 6:33 AM, Rob Audenaerde
>  wrote:
> > I have a Lucene (6.4.2) index with about 2-5M documents, and each
> document
> > has about 10 facets (for example 'author', 'publisher', etc). The facets
> > might have up to 100.000 different values.
> >
> > I have a search bar on top of my application, and would like to implement
> > autocomplete using the facets. For example, when the user enters 'Jo' I
> > would like the options to be:
> >
> > 'John Doe - Author'
> > 'Jonatan Driver - Publisher'
> > 'Joan Deville - Author'
> > ...
> >
> > My facets are structured using the FacetFields and Lucene Taxonomy, like
> > this:
> >
> > 'Author / John Doe'
> > 'Author / Joan Deville'
> > ...
> >
> > Are there built-in options to create such an autocomplete? Or do I have
> to
> > build it myself?
> >
> > I prefer not to do a search on all the matching documents and collect
> > facets for those, because that is not very fast
> >
> > Any hints?
> >
> > Thanks in advance,
> > Rob Audenaerde
> >
> > See also:
> > http://stackoverflow.com/questions/43369715/lucene-
> autocomplete-using-facet-labels
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Autocomplete using facet labels?

2017-04-12 Thread Rob Audenaerde
I have a Lucene (6.4.2) index with about 2-5M documents, and each document
has about 10 facets (for example 'author', 'publisher', etc). The facets
might have up to 100.000 different values.

I have a search bar on top of my application, and would like to implement
autocomplete using the facets. For example, when the user enters 'Jo' I
would like the options to be:

'John Doe - Author'
'Jonatan Driver - Publisher'
'Joan Deville - Author'
...

My facets are structured using the FacetFields and Lucene Taxonomy, like
this:

'Author / John Doe'
'Author / Joan Deville'
...

Are there built-in options to create such an autocomplete? Or do I have to
build it myself?

I prefer not to do a search on all the matching documents and collect
facets for those, because that is not very fast

Any hints?

Thanks in advance,
Rob Audenaerde

See also:
http://stackoverflow.com/questions/43369715/lucene-autocomplete-using-facet-labels


Re: commit frequency guideline?

2016-11-30 Thread Rob Audenaerde
Thanks for the quick reply!

>What do you mean by "Lucene complain about too-many uncommitted docs"?

--> good question, I was thoughtlessly echoing words from my colleague. I
asked him and he said that it was about taking very long to commit and
memory issues. So maybe this wasn't the best opening statement :)

For the other part of the question: we need users to see the changed
documents immediately, but I think we have this covered by using NRT
Readers and the SearcherManager.

Am I correct to conclude calling commit() is not necessary for finding
recently changed documents?

I think we can then switch to a time based commit() where we just call
commit every 5 minutes, in effect losing a maximum of 5 minutes of work
(which we can mitigate in another way)
 when the server somehow stops working.

Thank you,
-Rob




On Wed, Nov 30, 2016 at 3:17 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> What do you mean by "Lucene complain about too-many uncommitted docs"?
>  Lucene does not really care how frequently you commit...
>
> How frequently you commit is really your choice, i.e. what risk you
> see of power loss / OS crash vs the cost (not just in CPU/IO work for
> the computer, but in the users not seeing the recently indexed
> documents for a while) of replaying those documents since the last
> commit when power comes back.
>
> Pushing durability back into the queue/channel can be a nice option
> too, e.g. Kafka, so that your application doesn't need to keep track
> of which docs were not yet committed.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Wed, Nov 30, 2016 at 8:50 AM, Rob Audenaerde
>  wrote:
> > Hi all,
> >
> > Currently we call commit() many times on our index (about 5M docs, where
> > some 10.000-100.000 modifications during the day). The commit times
> > typically get more expensive when the index grows, up to several seconds,
> > so we want to reduce the number of calls.
> >
> > (Historically, we had Lucene complain about too-many uncommitted docs
> > sometimes, so we went with the commit often approach.)
> >
> > What is a good strategy for calling commit? Fixed frequency? After X
> docs?
> > Combination?
> >
> > I'm curious what is considered 'industry-standard'. Can you share some of
> > your expercience?
> >
> > Thanks!
> >
> > -Rob
>


commit frequency guideline?

2016-11-30 Thread Rob Audenaerde
Hi all,

Currently we call commit() many times on our index (about 5M docs, where
some 10.000-100.000 modifications during the day). The commit times
typically get more expensive when the index grows, up to several seconds,
so we want to reduce the number of calls.

(Historically, we had Lucene complain about too-many uncommitted docs
sometimes, so we went with the commit often approach.)

What is a good strategy for calling commit? Fixed frequency? After X docs?
Combination?

I'm curious what is considered 'industry-standard'. Can you share some of
your expercience?

Thanks!

-Rob


Re: wicket datatable, row selection, update another component

2016-10-28 Thread Rob Audenaerde
Whoops! You are correct! Sorry 'bout that.

On Fri, Oct 28, 2016 at 1:26 PM, Alan Woodward  wrote:

> Hi Rob, I think you posted this to the wrong mailing list?
>
> Alan Woodward
> www.flax.co.uk
>
>
> > On 28 Oct 2016, at 12:13, Rob Audenaerde 
> wrote:
> >
> > Hi all,
> >
> > I have a DataTable which, in onConfigure(), sets a selected item. I want
> > another (detail) panel, outside of this component, to react on that
> > selection i.e. set it's visibility and render details of the selected
> item.
> >
> > What I see is that the onConfigure() of the detail component is called
> > BEFORE the DataTable, so I figure it renders before the DataTable is
> > rendered, so the detail.setVisible() in the onConfigure() in the
> DataTable
> > is called too late.
> >
> > How should I solve this? The only component that know which item is going
> > to be selected is the DataTable.
> >
> > Thanks,
> > Rob
>
>


wicket datatable, row selection, update another component

2016-10-28 Thread Rob Audenaerde
Hi all,

I have a DataTable which, in onConfigure(), sets a selected item. I want
another (detail) panel, outside of this component, to react on that
selection i.e. set it's visibility and render details of the selected item.

What I see is that the onConfigure() of the detail component is called
BEFORE the DataTable, so I figure it renders before the DataTable is
rendered, so the detail.setVisible() in the onConfigure() in the DataTable
is called too late.

How should I solve this? The only component that know which item is going
to be selected is the DataTable.

Thanks,
Rob


Re: clone RAMDirectory

2016-06-30 Thread Rob Audenaerde
Thanks for the quick reply Uwe!

I opened https://issues.apache.org/jira/browse/LUCENE-7366 for this.

-Rob

On Thu, Jun 30, 2016 at 12:06 PM, Uwe Schindler  wrote:

> Hi,
>
> I looked at the code: The FSDirectory passed to RAMDirectory could be
> changed to Directory easily. The additional check for "not is a directory
> inode" is in my opinion lo longer needed, because listFiles should only
> return files.
>
> Can you open an issue about to change the FSDirectory in the RAMDirectory
> ctor to be changed to Directory?
>
> Uwe
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
> > -Original Message-
> > From: Rob Audenaerde [mailto:rob.audenae...@gmail.com]
> > Sent: Thursday, June 30, 2016 12:00 PM
> > To: java-user@lucene.apache.org
> > Subject: clone RAMDirectory
> >
> > Hi all,
> >
> > For increasing the speed of some of my application tests, I want to
> > re-use/copy a pre-populated RAMDirectory over and over.
> >
> > I'm on Lucene 6.0.1
> >
> > It seems an RAMDirectory can be a copy of a FSDirectory, but not of
> another
> > RAMDirectory. Also RAMDirectory is not Clonable.
> >
> > What would be the 'proper' approach to re-use (fast copy) pre-populated
> > indices over tests? I know I can create a FSDirectory and copy that, but
> > then I also need to take into account temporary files etc.
> >
> > Thanks in advance,
> >
> > - Rob
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


clone RAMDirectory

2016-06-30 Thread Rob Audenaerde
Hi all,

For increasing the speed of some of my application tests, I want to
re-use/copy a pre-populated RAMDirectory over and over.

I'm on Lucene 6.0.1

It seems an RAMDirectory can be a copy of a FSDirectory, but not of another
RAMDirectory. Also RAMDirectory is not Clonable.

What would be the 'proper' approach to re-use (fast copy) pre-populated
indices over tests? I know I can create a FSDirectory and copy that, but
then I also need to take into account temporary files etc.

Thanks in advance,

- Rob


Re: GROUP BY in Lucene

2016-03-19 Thread Rob Audenaerde
Hi Gimantha,

You don't need to store the aggregates and don't need to retrieve
Documents. The aggregates are calculated during collection using the
BinaryDocValues from the facet-module. What I do, is that I need to store
values in the facets using AssociationFacetFields. (for example
FloatAssociationFacetField). I just choose facets because then I can use
the facets as well :)

I have a implementation of `Facets` class that does all the aggregation. I
cannot paste all the code unfortunately, but here is the idea (it is loosly
based on the TaxonomyFacetSumIntAssociations implementation, where you can
look up how the BinaryDocValues are translated to ordinals and to facets).
This aggregation is used in conjunction with a FacetsCollector, which
collects the facets during a search:

FacetsCollector fc = new FacetsCollector();
searcher.search(new ConstantScoreQuery(query), fc);


Then, the use this FacetsCollector:

 taxoReader = getTaxonomyReaderManager().acquire();
 OnePassTaxonomyFacets facets = new OnePassTaxonomyFacets(taxoReader,
LuceneIndexConfig.facetConfig);
 Collection
facets.aggregateValues(fc.getMatchingDocs(), p.getGroupByListWithoutData(),
aggregateFields);


The aggregateValues method (cannot paste it all :(  ) :


public final Collection
aggregateValues(List matchingDocs, final List
groupByFields,
final List aggregateFieldNames, EmptyValues
emptyValues) throws IOException {
LOG.info("Starting aggregation for pivot.. EmptyValues=" +
emptyValues);

//We want to group a list of ordinals to a list of aggregates. The
taxoReader has the ordinals, so a selection like 'Lang=NL, Region=South'
will
//end up like a MultiIntKey of [13,44]
Map> aggs = Maps.newHashMap();

List groupByFieldsNames = Lists.newArrayList();
for (GroupByField gbf : groupByFields) {
groupByFieldsNames.add(gbf.getField().getName());
}
int groupByCount = groupByFieldsNames.size();

//We need to know which ordinals are the 'group-by' ordinals, so we
can check if a ordinal that is found, belongs to one of these fields
int[] groupByOrdinals = new int[groupByCount];
for (int i = 0; i < groupByOrdinals.length; i++) {
groupByOrdinals[i] =
this.getOrdinalForListItem(groupByFieldsNames, i);
}

//We need to know with ordinals are the 'aggregate-field' ordinals,
so we can check if a ordinal that is found, belongs to one of these fields
int[] aggregateOrdinals = new int[aggregateFieldNames.size()];
for (int i = 0; i < aggregateOrdinals.length; i++) {
aggregateOrdinals[i] =
this.getOrdinalForListItem(aggregateFieldNames, i);
}

//Now we go and find all the ordinals in the matching documents.
//For each ordinal, we check if it is a groupBy-ordinal, or a
aggregate-ordinal, and act accordinly.
for (MatchingDocs hitList : matchingDocs) {
BinaryDocValues dv =
hitList.context.reader().getBinaryDocValues(this.indexFieldName);

//Here find the oridinals of the group-by-fields and the
arrgegate fields.
//Create a multi ordinal key MultiIntKey from the
group-by-ordinals and use that to add the current value of the fiels to do
the agggregation to the facet-aggregates

..


Hope this helps :)
-Rob


Re: Lucene Facets performance problems (version 4.7.2)

2016-02-26 Thread Rob Audenaerde
Hi Simona,

In addition to Ericks' questions:

Are you talking about *search* time or facet-collection time? And how many
results are in your result set?

I have some experience with collecting facets from large results set, these
are typically slow (as they have to retrieve all the relevant facet fields
for the facetted doccument). In Lucene 4.8 the RandomSamplingFacetsCollector
returned (as per https://issues.apache.org/jira/browse/LUCENE-5476).

-Rob

On Fri, Feb 26, 2016 at 6:01 AM, Simona Russo  wrote:

> Hi all,
>
> we use Lucene *Facet* library version* 4.7.2.*
>
> We have an *index* with *45 millions *of documents (size about 15 GB)  and
> a *taxonomy* index with *57* millions of documents (size about 2 GB).
>
> The total *facet search* time achieve *15 seconds*!
>
> Is it possible to improve this time? Is there any tips to *configure* the
> *taxonomy* index to avoid this waste of time?
>
>
> Thanks in advance
>


Re: Profiling lucene 5.2.0 based tool

2016-02-22 Thread Rob Audenaerde
Hi Sandeep,

How many threads do you use to do the indexing? The benchmarks of Lucene
are done on >20 threads IIRC.

-Rob

On Tue, Feb 23, 2016 at 8:01 AM, sandeep das  wrote:

> Hi,
>
> I've implemented a tool using lucene-5.2.0 to index my CSV files. The tool
> is reading data from CSV files(residing on disk) and creating indexes on
> local disk. It is able to process 3.5 MBps data. There are overall 46
> fields being added in one document. They are only of three data types 1.
> Integer, 2. Long, 3. String.
> All these fields are part of one CSV record and they are parsed using
> custom CSV parser which is faster than any split method of string.
>
> I've configured the following parameters to create indexWriter
> 1. setOpenMode(OpenMode.CREATE)
> 2. setCommitOnClose(true)
> 3. setRAMBufferSizeMB(512)   // Tried 256, 312 as well but performance is
> almost same.
>
> I've read over several blogs that lucene works way faster than these
> figures. So, I thought there are some bottlenecks in my code and profiled
> it using jvisualvm. The application is spending most of the time in
> DefaultIndexChain.processField i.e. 53% of total time.
>
>
> Following is the split of CPU usage in this application:
> 1. reading data from disk is taking 5% of total duration
> 2. adding document is taking 93% of total duration.
>
>-postUpdate  -> 12.8%
>-doAfterDocument -> 20.6%
>-updateDocument  -> 59.8%
>   - finishDocument -> 1.7%
>   - finishStoreFields -> 4.8%
>   - processFields -> 53.1%
>
>
> I'm also attaching the screen shot of call graph generated by jvisualvm.
>
> I've taken care of following points:
> 1. create only one instance of indexWriter
> 2. create only one instance of document and reuse it through out the life
> time of application
> 3. There will be no update in the documents hence only addDocument is
> invoked.
> Note: After going through the code I found out that addDocument is
> internally calling updateDocument only. Is there any way by which we can
> avoid calling updateDocument and only use addDocument API?
> 4. Using setValue APIs to set the pre created fields and reusing these
> fields to create indexes.
>
> Any tip to improve the performance will be immensely appreciated.
>
> Regards,
> Sandeep
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>


RE: debugging growing index size

2015-11-14 Thread Rob Audenaerde
Thank you all,

I will further fix and investigate!
On Nov 14, 2015 10:00, "Uwe Schindler"  wrote:

> I agree. On Linux it is impossible that MMapDirectory is the reason! Only
> on windows you cannot delete still open/mapped files.
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
> > -Original Message-
> > From: Michael McCandless [mailto:luc...@mikemccandless.com]
> > Sent: Friday, November 13, 2015 8:30 PM
> > To: Lucene Users 
> > Subject: Re: debugging growing index size
> >
> > So with MMapDir at defaults (unmap is enabled) you see old files, with
> > no open file handles as reported by lsof, still existing in your index
> > directory, taking lots of space.
> >
> > But with NIOFSDirectory the issue doesn't happen?  Are you sure?
> >
> > I'll look at the 6.6 GB infoStream to see what it says about the ref
> counts.
> >
> > Did you fix the issue in your app where you're not closing all opened
> > NRT readers?
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> >
> > On Fri, Nov 13, 2015 at 12:22 PM, Rob Audenaerde
> >  wrote:
> > > I haven't disabled unmapping, and I am running out-of-the-box
> > > FSDirectory.open(). As I can see it tries to pick MMap.  For the test I
> > > explicitly constructed a NIOFSDIrectoryReader
> > >
> > > OS is (from the top of my head)  CentOS 6.x, Java 1.8.0u33. I can check
> > > later for more details.
> > > On Nov 13, 2015 18:07, "Uwe Schindler"  wrote:
> > >
> > >> Hi,
> > >>
> > >> Lucene has the workaround, so it should not happen, UNLESS you
> > explicitly
> > >> disable the hack using MMapDirectory#setEnableUnmap(false).
> > >>
> > >> Uwe
> > >>
> > >> -
> > >> Uwe Schindler
> > >> H.-H.-Meier-Allee 63, D-28213 Bremen
> > >> http://www.thetaphi.de
> > >> eMail: u...@thetaphi.de
> > >>
> > >> > -Original Message-
> > >> > From: will martin [mailto:wmartin...@gmail.com]
> > >> > Sent: Friday, November 13, 2015 6:04 PM
> > >> > To: java-user@lucene.apache.org
> > >> > Subject: Re: debugging growing index size
> > >> >
> > >> > Hi Rob:
> > >> >
> > >> >
> > >> > Doesn’t this look like known SE issue JDK-4724038 and discussed by
> > Peter
> > >> > Levart and Uwe Schindler on a lucene-dev thread 9/9/2015?
> > >> >
> > >> > MappedByteBuffer …. what OS are you on Rob? What JVM?
> > >> >
> > >> > http://bugs.java.com/view_bug.do?bug_id=4724038
> > >> >
> > >> > http://mail-archives.apache.org/mod_mbox/lucene-
> > >> > dev/201509.mbox/%3c55f0461a.2070...@gmail.com%3E
> > >> >
> > >> > hth
> > >> > -will
> > >> >
> > >> >
> > >> >
> > >> > > On Nov 13, 2015, at 11:23 AM, Rob Audenaerde
> > >> >  wrote:
> > >> > >
> > >> > > I'm currently running using NIOFS. It seems to prevent the issue
> from
> > >> > > appearing.
> > >> > >
> > >> > > This is a second run (with applied deletes etc)
> > >> > >
> > >> > > raudenaerd@:/<6>index/index$sudo ls -lSra *.dvd
> > >> > > -rw-r--r--. 1 apache apache  7993 Nov 13 16:09
> _y_Lucene50_0.dvd
> > >> > > -rw-r--r--. 1 apache apache  39048886 Nov 13 17:12
> > _xod_Lucene50_0.dvd
> > >> > > -rw-r--r--. 1 apache apache  53699972 Nov 13 17:17
> > _110e_Lucene50_0.dvd
> > >> > > -rw-r--r--. 1 apache apache 112855516 Nov 13 17:19
> > _12r5_Lucene50_0.dvd
> > >> > > -rw-r--r--. 1 apache apache 151149886 Nov 13 17:13
> > _y0s_Lucene50_0.dvd
> > >> > > -rw-r--r--. 1 apache apache 222062059 Nov 13 17:17
> > _z20_Lucene50_0.dvd
> > >> > >
> > >> > > raudenaerde:/<6>index/index$sudo ls -lSaa *.dvd
> > >> > > -rw-r--r--. 1 apache apache 222062059 Nov 13 17:17
> > _z20_Lucene50_0.dvd
> > >> > > -rw-r--r--. 1 apache apache 151149886 Nov 13 17:13
> > _y0s_Lucene50_0.dvd
> > >> > > -rw-r--r--. 1 apache apache 112855516 Nov 13 17:19
> 

RE: debugging growing index size

2015-11-13 Thread Rob Audenaerde
I haven't disabled unmapping, and I am running out-of-the-box
FSDirectory.open(). As I can see it tries to pick MMap.  For the test I
explicitly constructed a NIOFSDIrectoryReader

OS is (from the top of my head)  CentOS 6.x, Java 1.8.0u33. I can check
later for more details.
On Nov 13, 2015 18:07, "Uwe Schindler"  wrote:

> Hi,
>
> Lucene has the workaround, so it should not happen, UNLESS you explicitly
> disable the hack using MMapDirectory#setEnableUnmap(false).
>
> Uwe
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
> > -Original Message-
> > From: will martin [mailto:wmartin...@gmail.com]
> > Sent: Friday, November 13, 2015 6:04 PM
> > To: java-user@lucene.apache.org
> > Subject: Re: debugging growing index size
> >
> > Hi Rob:
> >
> >
> > Doesn’t this look like known SE issue JDK-4724038 and discussed by Peter
> > Levart and Uwe Schindler on a lucene-dev thread 9/9/2015?
> >
> > MappedByteBuffer …. what OS are you on Rob? What JVM?
> >
> > http://bugs.java.com/view_bug.do?bug_id=4724038
> >
> > http://mail-archives.apache.org/mod_mbox/lucene-
> > dev/201509.mbox/%3c55f0461a.2070...@gmail.com%3E
> >
> > hth
> > -will
> >
> >
> >
> > > On Nov 13, 2015, at 11:23 AM, Rob Audenaerde
> >  wrote:
> > >
> > > I'm currently running using NIOFS. It seems to prevent the issue from
> > > appearing.
> > >
> > > This is a second run (with applied deletes etc)
> > >
> > > raudenaerd@:/<6>index/index$sudo ls -lSra *.dvd
> > > -rw-r--r--. 1 apache apache  7993 Nov 13 16:09 _y_Lucene50_0.dvd
> > > -rw-r--r--. 1 apache apache  39048886 Nov 13 17:12 _xod_Lucene50_0.dvd
> > > -rw-r--r--. 1 apache apache  53699972 Nov 13 17:17 _110e_Lucene50_0.dvd
> > > -rw-r--r--. 1 apache apache 112855516 Nov 13 17:19 _12r5_Lucene50_0.dvd
> > > -rw-r--r--. 1 apache apache 151149886 Nov 13 17:13 _y0s_Lucene50_0.dvd
> > > -rw-r--r--. 1 apache apache 222062059 Nov 13 17:17 _z20_Lucene50_0.dvd
> > >
> > > raudenaerde:/<6>index/index$sudo ls -lSaa *.dvd
> > > -rw-r--r--. 1 apache apache 222062059 Nov 13 17:17 _z20_Lucene50_0.dvd
> > > -rw-r--r--. 1 apache apache 151149886 Nov 13 17:13 _y0s_Lucene50_0.dvd
> > > -rw-r--r--. 1 apache apache 112855516 Nov 13 17:19 _12r5_Lucene50_0.dvd
> > > -rw-r--r--. 1 apache apache  53699972 Nov 13 17:17 _110e_Lucene50_0.dvd
> > > -rw-r--r--. 1 apache apache  39048886 Nov 13 17:12 _xod_Lucene50_0.dvd
> > > -rw-r--r--. 1 apache apache  7993 Nov 13 16:09 _y_Lucene50_0.dvd
> > >
> > >
> > >
> > > On Thu, Nov 12, 2015 at 3:40 PM, Michael McCandless <
> > > luc...@mikemccandless.com> wrote:
> > >
> > >> Hi Rob,
> > >>
> > >> A couple more things:
> > >>
> > >> Can you print the value of MMapDirectory.UNMAP_SUPPORTED?
> > >>
> > >> Also, can you try your test using NIOFSDirectory instead?  Curious if
> > >> that changes things...
> > >>
> > >> Mike McCandless
> > >>
> > >> http://blog.mikemccandless.com
> > >>
> > >>
> > >> On Thu, Nov 12, 2015 at 7:28 AM, Rob Audenaerde
> > >>  wrote:
> > >>> Curious indeed!
> > >>>
> > >>> I will turn on the IndexFileDeleter.VERBOSE_REF_COUNTS and recreate
> > the
> > >>> logs. Will get back with them in a day hopefully.
> > >>>
> > >>> Thanks for the extra logging!
> > >>>
> > >>> -Rob
> > >>>
> > >>> On Thu, Nov 12, 2015 at 11:34 AM, Michael McCandless <
> > >>> luc...@mikemccandless.com> wrote:
> > >>>
> > >>>> Hmm, curious.
> > >>>>
> > >>>> I looked at the [large] infoStream output and I see segment _3ou7
> > >>>> present on init of IW, a few getReader calls referencing it, then a
> > >>>> forceMerge that indeed merges it away, yet I do NOT see IW
> > attempting
> > >>>> deletion of its files.
> > >>>>
> > >>>> And indeed I see plenty (too many: many times per second?) of
> > commits
> > >>>> after that, so the index itself is no longer referencing _3ou7.
> > >>>>
> > >>>> If you are failing to close all NRT readers then I would

Re: debugging growing index size

2015-11-13 Thread Rob Audenaerde
I'm currently running using NIOFS. It seems to prevent the issue from
appearing.

This is a second run (with applied deletes etc)

raudenaerd@:/<6>index/index$sudo ls -lSra *.dvd
-rw-r--r--. 1 apache apache  7993 Nov 13 16:09 _y_Lucene50_0.dvd
-rw-r--r--. 1 apache apache  39048886 Nov 13 17:12 _xod_Lucene50_0.dvd
-rw-r--r--. 1 apache apache  53699972 Nov 13 17:17 _110e_Lucene50_0.dvd
-rw-r--r--. 1 apache apache 112855516 Nov 13 17:19 _12r5_Lucene50_0.dvd
-rw-r--r--. 1 apache apache 151149886 Nov 13 17:13 _y0s_Lucene50_0.dvd
-rw-r--r--. 1 apache apache 222062059 Nov 13 17:17 _z20_Lucene50_0.dvd

raudenaerde:/<6>index/index$sudo ls -lSaa *.dvd
-rw-r--r--. 1 apache apache 222062059 Nov 13 17:17 _z20_Lucene50_0.dvd
-rw-r--r--. 1 apache apache 151149886 Nov 13 17:13 _y0s_Lucene50_0.dvd
-rw-r--r--. 1 apache apache 112855516 Nov 13 17:19 _12r5_Lucene50_0.dvd
-rw-r--r--. 1 apache apache  53699972 Nov 13 17:17 _110e_Lucene50_0.dvd
-rw-r--r--. 1 apache apache  39048886 Nov 13 17:12 _xod_Lucene50_0.dvd
-rw-r--r--. 1 apache apache  7993 Nov 13 16:09 _y_Lucene50_0.dvd



On Thu, Nov 12, 2015 at 3:40 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> Hi Rob,
>
> A couple more things:
>
> Can you print the value of MMapDirectory.UNMAP_SUPPORTED?
>
> Also, can you try your test using NIOFSDirectory instead?  Curious if
> that changes things...
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Thu, Nov 12, 2015 at 7:28 AM, Rob Audenaerde
>  wrote:
> > Curious indeed!
> >
> > I will turn on the IndexFileDeleter.VERBOSE_REF_COUNTS and recreate the
> > logs. Will get back with them in a day hopefully.
> >
> > Thanks for the extra logging!
> >
> > -Rob
> >
> > On Thu, Nov 12, 2015 at 11:34 AM, Michael McCandless <
> > luc...@mikemccandless.com> wrote:
> >
> >> Hmm, curious.
> >>
> >> I looked at the [large] infoStream output and I see segment _3ou7
> >> present on init of IW, a few getReader calls referencing it, then a
> >> forceMerge that indeed merges it away, yet I do NOT see IW attempting
> >> deletion of its files.
> >>
> >> And indeed I see plenty (too many: many times per second?) of commits
> >> after that, so the index itself is no longer referencing _3ou7.
> >>
> >> If you are failing to close all NRT readers then I would expect _3ou7
> >> to be in the lsof output, but it's not.
> >>
> >> The NRT readers close method has logic that notifies IndexWriter when
> >> it's done "needing" the files, to emulate "delete on last close"
> >> semantics for filesystems like HDFS that don't do that ... it's
> >> possible something is wrong here.
> >>
> >> Can you set the (public, static) boolean
> >> IndexFileDeleter.VERBOSE_REF_COUNTS to true, and then re-generate this
> >> log?  This causes IW to log the ref count of each file it's tracking
> >> ...
> >>
> >> I'll also add a bit more verbosity to IW when NRT readers are opened
> >> and close, for 5.4.0.
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >>
> >>
> >> On Wed, Nov 11, 2015 at 6:09 AM, Rob Audenaerde
> >>  wrote:
> >> > Hi all,
> >> >
> >> > I'm still debugging the growing-index size. I think closing index
> readers
> >> > might help (work in progress), but I can't really see them holding on
> to
> >> > files (at least, using lsof ). Restarting the application sheds some
> >> light,
> >> > I see logging on files that are no longer referenced.
> >> >
> >> > What I see is that there are files in the index-directory, that seem
> to
> >> > longer referenced..
> >> >
> >> > I put the output of the infoStream online, because is it rather big
> (30MB
> >> > gzipped):  http://www.audenaerde.org/lucene/merges.log.gz
> >> >
> >> > Output of lsof:  (executed 'sudo lsof *' in the index directory  ).
> This
> >> is
> >> > on an CentOS box (maybe that influences stuff as well?)
> >> >
> >> > COMMAND   PID   USER   FD   TYPE DEVICE   SIZE/OFF NODE NAME
> >> > java30581 apache  memREG  253,0 3176094924 18880508
> >> > _4gs5_Lucene50_0.dvd
> >> > java30581 apache  memREG  253,0  505758610 18880546 _4gs5.fdt
> >> > java30581 apache  memREG  253,0  369563337 18880631
> >> > _4gs5_Lucene50_0.tim
> >

Re: debugging growing index size

2015-11-13 Thread Rob Audenaerde
I got the data (beware, it is about 180MB download, xz-zipped, unpacked it
is about 6.6 GB).

Unfortunately,  I accidentally restarted the application so the index-files
and lsof output could not be determined for this run. Hopefully the
infoStream log with the extra logging will provide enough information. I
will work that next week if needed.

The infoStream can be downloaded here:

http://www.audenaerde.org/lucene/merges.log.xz

The value of MMapDirectory.UNMAP_SUPPORTED= true

I'm currently trying to create a build with NIOFSDirectory instead.

-Rob



On Thu, Nov 12, 2015 at 11:34 AM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> Hmm, curious.
>
> I looked at the [large] infoStream output and I see segment _3ou7
> present on init of IW, a few getReader calls referencing it, then a
> forceMerge that indeed merges it away, yet I do NOT see IW attempting
> deletion of its files.
>
> And indeed I see plenty (too many: many times per second?) of commits
> after that, so the index itself is no longer referencing _3ou7.
>
> If you are failing to close all NRT readers then I would expect _3ou7
> to be in the lsof output, but it's not.
>
> The NRT readers close method has logic that notifies IndexWriter when
> it's done "needing" the files, to emulate "delete on last close"
> semantics for filesystems like HDFS that don't do that ... it's
> possible something is wrong here.
>
> Can you set the (public, static) boolean
> IndexFileDeleter.VERBOSE_REF_COUNTS to true, and then re-generate this
> log?  This causes IW to log the ref count of each file it's tracking
> ...
>
> I'll also add a bit more verbosity to IW when NRT readers are opened
> and close, for 5.4.0.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Wed, Nov 11, 2015 at 6:09 AM, Rob Audenaerde
>  wrote:
> > Hi all,
> >
> > I'm still debugging the growing-index size. I think closing index readers
> > might help (work in progress), but I can't really see them holding on to
> > files (at least, using lsof ). Restarting the application sheds some
> light,
> > I see logging on files that are no longer referenced.
> >
> > What I see is that there are files in the index-directory, that seem to
> > longer referenced..
> >
> > I put the output of the infoStream online, because is it rather big (30MB
> > gzipped):  http://www.audenaerde.org/lucene/merges.log.gz
> >
> > Output of lsof:  (executed 'sudo lsof *' in the index directory  ). This
> is
> > on an CentOS box (maybe that influences stuff as well?)
> >
> > COMMAND   PID   USER   FD   TYPE DEVICE   SIZE/OFF NODE NAME
> > java30581 apache  memREG  253,0 3176094924 18880508
> > _4gs5_Lucene50_0.dvd
> > java30581 apache  memREG  253,0  505758610 18880546 _4gs5.fdt
> > java30581 apache  memREG  253,0  369563337 18880631
> > _4gs5_Lucene50_0.tim
> > java30581 apache  memREG  253,0  176344058 18880623
> > _4gs5_Lucene50_0.pos
> > java30581 apache  memREG  253,0  378055201 18880606
> > _4gs5_Lucene50_0.doc
> > java30581 apache  memREG  253,0  372579599 18880400
> > _4i5a_Lucene50_0.dvd
> > java30581 apache  memREG  253,0   82017447 18880748 _4g37.cfs
> > java30581 apache  memREG  253,0   85376507 18880721 _4fb3.cfs
> > java30581 apache  memREG  253,0  363493917 18880533
> > _4ct1_Lucene50_0.dvd
> > java30581 apache  memREG  253,09421892 18880806 _4gjc.cfs
> > java30581 apache  memREG  253,0   76877461 18880553 _4ct1.fdt
> > java30581 apache  memREG  253,0   46271330 18880661
> > _4ct1_Lucene50_0.tim
> > java30581 apache  memREG  253,0   26911387 18880653
> > _4ct1_Lucene50_0.pos
> > java30581 apache  memREG  253,0   54678249 18880568
> > _4ct1_Lucene50_0.doc
> > java30581 apache  memREG  253,0   76556587 18880328 _4i5a.fdt
> > java30581 apache  memREG  253,0   45032159 18880389
> > _4i5a_Lucene50_0.tim
> > java30581 apache  memREG  253,0   26486772 18880388
> > _4i5a_Lucene50_0.pos
> > java30581 apache  memREG  253,0   55411002 18880362
> > _4i5a_Lucene50_0.doc
> > java30581 apache  memREG  253,0   70484185 18880340 _4hkn.cfs
> > java30581 apache  memREG  253,0   10873921 18880324 _4gpz.cfs
> > java30581 apache  memREG  253,0   17230506 18880524 _4i11.cfs
> > java30581 apache  memREG  253,06706969 18880575 _4i0t.cfs
> > java30581 apache  memREG  253,0   15135578 18880624 _4i0i.cfs
> > java3

Re: debugging growing index size

2015-11-12 Thread Rob Audenaerde
Curious indeed!

I will turn on the IndexFileDeleter.VERBOSE_REF_COUNTS and recreate the
logs. Will get back with them in a day hopefully.

Thanks for the extra logging!

-Rob

On Thu, Nov 12, 2015 at 11:34 AM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> Hmm, curious.
>
> I looked at the [large] infoStream output and I see segment _3ou7
> present on init of IW, a few getReader calls referencing it, then a
> forceMerge that indeed merges it away, yet I do NOT see IW attempting
> deletion of its files.
>
> And indeed I see plenty (too many: many times per second?) of commits
> after that, so the index itself is no longer referencing _3ou7.
>
> If you are failing to close all NRT readers then I would expect _3ou7
> to be in the lsof output, but it's not.
>
> The NRT readers close method has logic that notifies IndexWriter when
> it's done "needing" the files, to emulate "delete on last close"
> semantics for filesystems like HDFS that don't do that ... it's
> possible something is wrong here.
>
> Can you set the (public, static) boolean
> IndexFileDeleter.VERBOSE_REF_COUNTS to true, and then re-generate this
> log?  This causes IW to log the ref count of each file it's tracking
> ...
>
> I'll also add a bit more verbosity to IW when NRT readers are opened
> and close, for 5.4.0.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Wed, Nov 11, 2015 at 6:09 AM, Rob Audenaerde
>  wrote:
> > Hi all,
> >
> > I'm still debugging the growing-index size. I think closing index readers
> > might help (work in progress), but I can't really see them holding on to
> > files (at least, using lsof ). Restarting the application sheds some
> light,
> > I see logging on files that are no longer referenced.
> >
> > What I see is that there are files in the index-directory, that seem to
> > longer referenced..
> >
> > I put the output of the infoStream online, because is it rather big (30MB
> > gzipped):  http://www.audenaerde.org/lucene/merges.log.gz
> >
> > Output of lsof:  (executed 'sudo lsof *' in the index directory  ). This
> is
> > on an CentOS box (maybe that influences stuff as well?)
> >
> > COMMAND   PID   USER   FD   TYPE DEVICE   SIZE/OFF NODE NAME
> > java30581 apache  memREG  253,0 3176094924 18880508
> > _4gs5_Lucene50_0.dvd
> > java30581 apache  memREG  253,0  505758610 18880546 _4gs5.fdt
> > java30581 apache  memREG  253,0  369563337 18880631
> > _4gs5_Lucene50_0.tim
> > java30581 apache  memREG  253,0  176344058 18880623
> > _4gs5_Lucene50_0.pos
> > java30581 apache  memREG  253,0  378055201 18880606
> > _4gs5_Lucene50_0.doc
> > java30581 apache  memREG  253,0  372579599 18880400
> > _4i5a_Lucene50_0.dvd
> > java30581 apache  memREG  253,0   82017447 18880748 _4g37.cfs
> > java30581 apache  memREG  253,0   85376507 18880721 _4fb3.cfs
> > java30581 apache  memREG  253,0  363493917 18880533
> > _4ct1_Lucene50_0.dvd
> > java30581 apache  memREG  253,09421892 18880806 _4gjc.cfs
> > java30581 apache  memREG  253,0   76877461 18880553 _4ct1.fdt
> > java30581 apache  memREG  253,0   46271330 18880661
> > _4ct1_Lucene50_0.tim
> > java30581 apache  memREG  253,0   26911387 18880653
> > _4ct1_Lucene50_0.pos
> > java30581 apache  memREG  253,0   54678249 18880568
> > _4ct1_Lucene50_0.doc
> > java30581 apache  memREG  253,0   76556587 18880328 _4i5a.fdt
> > java30581 apache  memREG  253,0   45032159 18880389
> > _4i5a_Lucene50_0.tim
> > java30581 apache  memREG  253,0   26486772 18880388
> > _4i5a_Lucene50_0.pos
> > java30581 apache  memREG  253,0   55411002 18880362
> > _4i5a_Lucene50_0.doc
> > java30581 apache  memREG  253,0   70484185 18880340 _4hkn.cfs
> > java30581 apache  memREG  253,0   10873921 18880324 _4gpz.cfs
> > java30581 apache  memREG  253,0   17230506 18880524 _4i11.cfs
> > java30581 apache  memREG  253,06706969 18880575 _4i0t.cfs
> > java30581 apache  memREG  253,0   15135578 18880624 _4i0i.cfs
> > java30581 apache  memREG  253,0   15368310 18880717 _4hzp.cfs
> > java30581 apache  memREG  253,05146140 18880583 _4hze.cfs
> > java30581 apache  memREG  253,02917380 18880411 _4gs5.nvd
> > java30581 apache  memREG  253,06871469 18880732 _4hod.cfs
> > java30581 apache  memREG  253,02860341 18880495 _4i84.cfs
> > ja

debugging growing index size

2015-11-11 Thread Rob Audenaerde
Hi all,

I'm still debugging the growing-index size. I think closing index readers
might help (work in progress), but I can't really see them holding on to
files (at least, using lsof ). Restarting the application sheds some light,
I see logging on files that are no longer referenced.

What I see is that there are files in the index-directory, that seem to
longer referenced..

I put the output of the infoStream online, because is it rather big (30MB
gzipped):  http://www.audenaerde.org/lucene/merges.log.gz

Output of lsof:  (executed 'sudo lsof *' in the index directory  ). This is
on an CentOS box (maybe that influences stuff as well?)

COMMAND   PID   USER   FD   TYPE DEVICE   SIZE/OFF NODE NAME
java30581 apache  memREG  253,0 3176094924 18880508
_4gs5_Lucene50_0.dvd
java30581 apache  memREG  253,0  505758610 18880546 _4gs5.fdt
java30581 apache  memREG  253,0  369563337 18880631
_4gs5_Lucene50_0.tim
java30581 apache  memREG  253,0  176344058 18880623
_4gs5_Lucene50_0.pos
java30581 apache  memREG  253,0  378055201 18880606
_4gs5_Lucene50_0.doc
java30581 apache  memREG  253,0  372579599 18880400
_4i5a_Lucene50_0.dvd
java30581 apache  memREG  253,0   82017447 18880748 _4g37.cfs
java30581 apache  memREG  253,0   85376507 18880721 _4fb3.cfs
java30581 apache  memREG  253,0  363493917 18880533
_4ct1_Lucene50_0.dvd
java30581 apache  memREG  253,09421892 18880806 _4gjc.cfs
java30581 apache  memREG  253,0   76877461 18880553 _4ct1.fdt
java30581 apache  memREG  253,0   46271330 18880661
_4ct1_Lucene50_0.tim
java30581 apache  memREG  253,0   26911387 18880653
_4ct1_Lucene50_0.pos
java30581 apache  memREG  253,0   54678249 18880568
_4ct1_Lucene50_0.doc
java30581 apache  memREG  253,0   76556587 18880328 _4i5a.fdt
java30581 apache  memREG  253,0   45032159 18880389
_4i5a_Lucene50_0.tim
java30581 apache  memREG  253,0   26486772 18880388
_4i5a_Lucene50_0.pos
java30581 apache  memREG  253,0   55411002 18880362
_4i5a_Lucene50_0.doc
java30581 apache  memREG  253,0   70484185 18880340 _4hkn.cfs
java30581 apache  memREG  253,0   10873921 18880324 _4gpz.cfs
java30581 apache  memREG  253,0   17230506 18880524 _4i11.cfs
java30581 apache  memREG  253,06706969 18880575 _4i0t.cfs
java30581 apache  memREG  253,0   15135578 18880624 _4i0i.cfs
java30581 apache  memREG  253,0   15368310 18880717 _4hzp.cfs
java30581 apache  memREG  253,05146140 18880583 _4hze.cfs
java30581 apache  memREG  253,02917380 18880411 _4gs5.nvd
java30581 apache  memREG  253,06871469 18880732 _4hod.cfs
java30581 apache  memREG  253,02860341 18880495 _4i84.cfs
java30581 apache  memREG  253,0 835726 18880660 _4i7z.cfs
java30581 apache  memREG  253,01005595 18880648 _4i7w.cfs
java30581 apache  memREG  253,05639672 18880401 _4i4o.cfs
java30581 apache  memREG  253,04388371 18880440 _4i4a.cfs
java30581 apache  memREG  253,01151845 18880512 _4i7v.cfs
java30581 apache  memREG  253,0 941773 18880613 _4i7x.cfs
java30581 apache  memREG  253,0 984023 18880588 _4i7o.cfs
java30581 apache  memREG  253,01790005 18880619 _4i7y.cfs
java30581 apache  memREG  253,0 466371 18880515 _4ct1.nvd
java30581 apache  memREG  253,0 723280 18880573 _4i7q.cfs
java30581 apache  memREG  253,0 806289 18880517 _4i7h.cfs
java30581 apache  memREG  253,0  17362 18880520 _4i9s.cfs
java30581 apache  memREG  253,0 698362 18880531 _4i9r.cfs
java30581 apache  memREG  253,0 483215 18880406 _4i5a.nvd
java30581 apache  memREG  253,0  14110 18880416 _4i9v.cfs
java30581 apache  memREG  253,0   6121 18880412 _4i9t.cfs
java30581 apache   30wW  REG  253,0  0 18877901 write.lock

Output of some of the biggest files in the index directory:

-rw-r--r--. 1 apache apache  358684577 Nov 11 08:04 _4fjn.cfs
-rw-r--r--. 1 apache apache  363493917 Nov 11 07:54 _4ct1_Lucene50_0.dvd
-rw-r--r--. 1 apache apache  369563337 Nov 11 08:06 _4gs5_Lucene50_0.tim
-rw-r--r--. 1 apache apache  372579599 Nov 11 08:09 _4i5a_Lucene50_0.dvd
-rw-r--r--. 1 apache apache  378055201 Nov 11 08:06 _4gs5_Lucene50_0.doc
-rw-r--r--. 1 apache apache  427401813 Nov 10 08:14 _3ou7.cfs
-rw-r--r--. 1 apache apache  505758610 Nov 11 08:04 _4gs5.fdt
-rw-r--r--. 1 apache apache 1107391579 Nov 10 07:55 _3k3a_Lucene50_0.dvd
-rw-r--r--. 1 apache apache 3176094924 Nov 11 08:10 _4gs5_Lucene50_0.dvd

Note that the 3ou7 and 3k3a segments no longer appear to be in use?


Re: index size growing while deleting

2015-11-10 Thread Rob Audenaerde
Ah yes, that is the way to go.

It is a bit harder here, because we also use a per-user InMemoryIndex that
is combined in a multi-reader, so it will be a bit more work, but I think
it will be doable. Thanks for all the help.

That said, I found it not-so-easy to debug this issue; are there methods
(on the IndexWriter / text in the infoStream?) that I could have used to
detect what was going on? That might be helpful for other as well?

-Rob


On Tue, Nov 10, 2015 at 1:32 PM, Jürgen Albert 
wrote:

> Hi Rob,
>
> we use a SearcherManager to obtain a fresh Searcher for every Query. From
> the Searcher we get the Reader. After the query you call
> searcherManager.release(searcher). The SearcherManager takes care of the
> rest.
>
> Regards,
>
> Jürgen.
>
>
> Am 10.11.2015 um 13:27 schrieb Rob Audenaerde:
>
>> Hi Jürgen, Michael
>>
>> Thanks! I seem to be able to reduce the index size by closing and
>> restarting our application. This reduces the index size from 22G tot 4G,
>> with is somewhat the expected size. The infoStream also gives me the
>> 'removed unreferenced file (IFD 0 [2015-11-10T12:21:49.293Z; main]: init:
>> removing unreferenced file '...)
>>
>> Now I just need to figure out how to close the IndexReader while keeping
>> the application running..  I guess I should/could do something with the
>> openIfChanged. Will look further.
>>
>> -Rob
>>
>>
>>
>> On Tue, Nov 10, 2015 at 12:19 PM, Jürgen Albert <
>> j.alb...@data-in-motion.biz
>>
>>> wrote:
>>> Hi Rob,
>>>
>>> we had a similar problem. In our case we had open index readers, that
>>> blocked the index from merging its segments and thus deleting the marked
>>> segments.
>>>
>>> Regards,
>>>
>>> Jürgen.
>>>
>>>
>>> Am 06.11.2015 um 08:59 schrieb Rob Audenaerde:
>>>
>>> Hi will, others
>>>>
>>>> Thanks for you reply,
>>>>
>>>> As far as I understand it, deleting a document is just setting the
>>>> deleted
>>>> bit, and when segments are merged, then the documents are removed. (not
>>>> really sure what this means exactly; I guess the document gets removed
>>>> from
>>>> the store, the terms will no longer refer to that document. Not sure if
>>>> terms get removed if no longer needed, etc). If there are resources to
>>>> read
>>>> to improve my understanding I havo not found them (yet), if you could
>>>> point
>>>> me to some that be great!
>>>>
>>>> I use the default IndexWriterConfig, which I see uses
>>>> TieredMergePolicy. I
>>>> never close my InderWriter; as I use NRT searching I just alwyas keep it
>>>> open.
>>>>
>>>> My two guesses are that: a) old segments are not removed from disk or b)
>>>> deletes are not cleaned up as well as I though they would be.
>>>>
>>>> I have made a testcase which indexes 5 million rows (five iterations,
>>>> five
>>>> indexing thread, indexing and deleting all such documents after each
>>>> iterator with deleteByQuery), the rows randomly generated. I see the
>>>> Taxonomy ever growing (which is logical, because facet-ordinals are
>>>> never
>>>> removed as far as I understand); the index grows, but also shrinks when
>>>> deleting. So I cannot reproduce my problem easily :(
>>>>
>>>> I will start diving into the Lucene source code, but I was hoping I just
>>>> did something wrong. .
>>>>
>>>> Any hints are appreciated!
>>>>
>>>> -Rob
>>>>
>>>>
>>>> On Thu, Nov 5, 2015 at 2:52 PM, will  wrote:
>>>>
>>>> Hi Rob:
>>>>
>>>>> Do you understand how deletes work and how an index is compacted?
>>>>>
>>>>> There's some configuration/runtime activities you don't mention And
>>>>> you make testing process sound like a mirror of production? (Including
>>>>> configuration?)
>>>>>
>>>>>
>>>>> -will
>>>>>
>>>>>
>>>>> On 11/5/15 7:33 AM, Rob Audenaerde wrote:
>>>>>
>>>>> Hi all,
>>>>>
>>>>>> I'm currently investigating an issue we have with our index. It keeps
>>>>>> getting bigger, and I don't het why.
>>>>>>
>>&g

Re: index size growing while deleting

2015-11-10 Thread Rob Audenaerde
Hi Jürgen, Michael

Thanks! I seem to be able to reduce the index size by closing and
restarting our application. This reduces the index size from 22G tot 4G,
with is somewhat the expected size. The infoStream also gives me the
'removed unreferenced file (IFD 0 [2015-11-10T12:21:49.293Z; main]: init:
removing unreferenced file '...)

Now I just need to figure out how to close the IndexReader while keeping
the application running..  I guess I should/could do something with the
openIfChanged. Will look further.

-Rob



On Tue, Nov 10, 2015 at 12:19 PM, Jürgen Albert  wrote:

> Hi Rob,
>
> we had a similar problem. In our case we had open index readers, that
> blocked the index from merging its segments and thus deleting the marked
> segments.
>
> Regards,
>
> Jürgen.
>
>
> Am 06.11.2015 um 08:59 schrieb Rob Audenaerde:
>
>> Hi will, others
>>
>> Thanks for you reply,
>>
>> As far as I understand it, deleting a document is just setting the deleted
>> bit, and when segments are merged, then the documents are removed. (not
>> really sure what this means exactly; I guess the document gets removed
>> from
>> the store, the terms will no longer refer to that document. Not sure if
>> terms get removed if no longer needed, etc). If there are resources to
>> read
>> to improve my understanding I havo not found them (yet), if you could
>> point
>> me to some that be great!
>>
>> I use the default IndexWriterConfig, which I see uses TieredMergePolicy. I
>> never close my InderWriter; as I use NRT searching I just alwyas keep it
>> open.
>>
>> My two guesses are that: a) old segments are not removed from disk or b)
>> deletes are not cleaned up as well as I though they would be.
>>
>> I have made a testcase which indexes 5 million rows (five iterations, five
>> indexing thread, indexing and deleting all such documents after each
>> iterator with deleteByQuery), the rows randomly generated. I see the
>> Taxonomy ever growing (which is logical, because facet-ordinals are never
>> removed as far as I understand); the index grows, but also shrinks when
>> deleting. So I cannot reproduce my problem easily :(
>>
>> I will start diving into the Lucene source code, but I was hoping I just
>> did something wrong. .
>>
>> Any hints are appreciated!
>>
>> -Rob
>>
>>
>> On Thu, Nov 5, 2015 at 2:52 PM, will  wrote:
>>
>> Hi Rob:
>>>
>>> Do you understand how deletes work and how an index is compacted?
>>>
>>> There's some configuration/runtime activities you don't mention And
>>> you make testing process sound like a mirror of production? (Including
>>> configuration?)
>>>
>>>
>>> -will
>>>
>>>
>>> On 11/5/15 7:33 AM, Rob Audenaerde wrote:
>>>
>>> Hi all,
>>>>
>>>> I'm currently investigating an issue we have with our index. It keeps
>>>> getting bigger, and I don't het why.
>>>>
>>>> Here is our use case:
>>>>
>>>> We index a database of about 4 million records; spread over a few
>>>> hundred
>>>> tables. The data consists of a mix of text, dates, numbers etc. We also
>>>> add
>>>> all these fields as facets.
>>>> Each night we delete about 90% of the data, which in testing reduces the
>>>> index size significantly.
>>>> We store the data as StoredFields as well, to prevent having to access
>>>> the
>>>> database at all.
>>>> We use FloatAssociatedFacet fields for the facets.
>>>>
>>>>
>>>> In production however, it seems the index is only growing, up to 71 GB
>>>> for
>>>> these records for a month of running.
>>>>
>>>> It seems that lucene's index in just getting bigger there.
>>>>
>>>> We use lucene 5.3 on CentOS, java 8 64 bit.
>>>>
>>>> The taxonomy-index does not grow significantly.
>>>>
>>>> How should I go about checking what is wrong?
>>>>
>>>> Thanks!
>>>>
>>>>
>>>>
>
> --
> Jürgen Albert
> Geschäftsführer
>
> Data In Motion UG (haftungsbeschränkt)
>
> Kahlaische Str. 4
> 07745 Jena
>
> Mobil:  0157-72521634
> E-Mail: j.alb...@datainmotion.de
> Web: www.datainmotion.de
>
> XING:   https://www.xing.com/profile/Juergen_Albert5
>
> Rechtliches
>
> Jena HBR 507027
> USt-IdNr: DE274553639
> St.Nr.: 162/107/04586
>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: index size growing while deleting

2015-11-08 Thread Rob Audenaerde
On Fri, Nov 6, 2015 at 11:29 AM, Michael McCandless <
luc...@mikemccandless.com> wrote:

It's also important to IndexWriter.commit (as well as open new NRT
> readers) periodically or after doing a large set of updates, as that
> lets Lucene remove any old segments referenced by the prior commit
> point.
>

When re-reading your comment I notices I skipped over the part in the
brackets and I have an addition question:

Why is it needed to open new NRT Readers? (btw I use the openIfChanged()
approach when maybeRefreshing())

Thanks!


Re: index size growing while deleting

2015-11-06 Thread Rob Audenaerde
Thanks Mike for the reply,

I already commit every after every 5000 documents per Thread.

I also found out today how to enable the InfoStream through the
IndexWriterConfig, so I'll have lots of extra information to work on. Will
run it on the production environment to find out what's happening there.

Any hints are appreciated!

-Rob

On Fri, Nov 6, 2015 at 11:29 AM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> It's also important to IndexWriter.commit (as well as open new NRT
> readers) periodically or after doing a large set of updates, as that
> lets Lucene remove any old segments referenced by the prior commit
> point.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Fri, Nov 6, 2015 at 2:59 AM, Rob Audenaerde 
> wrote:
> > Hi will, others
> >
> > Thanks for you reply,
> >
> > As far as I understand it, deleting a document is just setting the
> deleted
> > bit, and when segments are merged, then the documents are removed. (not
> > really sure what this means exactly; I guess the document gets removed
> from
> > the store, the terms will no longer refer to that document. Not sure if
> > terms get removed if no longer needed, etc). If there are resources to
> read
> > to improve my understanding I havo not found them (yet), if you could
> point
> > me to some that be great!
> >
> > I use the default IndexWriterConfig, which I see uses TieredMergePolicy.
> I
> > never close my InderWriter; as I use NRT searching I just alwyas keep it
> > open.
> >
> > My two guesses are that: a) old segments are not removed from disk or b)
> > deletes are not cleaned up as well as I though they would be.
> >
> > I have made a testcase which indexes 5 million rows (five iterations,
> five
> > indexing thread, indexing and deleting all such documents after each
> > iterator with deleteByQuery), the rows randomly generated. I see the
> > Taxonomy ever growing (which is logical, because facet-ordinals are never
> > removed as far as I understand); the index grows, but also shrinks when
> > deleting. So I cannot reproduce my problem easily :(
> >
> > I will start diving into the Lucene source code, but I was hoping I just
> > did something wrong. .
> >
> > Any hints are appreciated!
> >
> > -Rob
> >
> >
> > On Thu, Nov 5, 2015 at 2:52 PM, will  wrote:
> >
> >> Hi Rob:
> >>
> >> Do you understand how deletes work and how an index is compacted?
> >>
> >> There's some configuration/runtime activities you don't mention And
> >> you make testing process sound like a mirror of production? (Including
> >> configuration?)
> >>
> >>
> >> -will
> >>
> >>
> >> On 11/5/15 7:33 AM, Rob Audenaerde wrote:
> >>
> >>> Hi all,
> >>>
> >>> I'm currently investigating an issue we have with our index. It keeps
> >>> getting bigger, and I don't het why.
> >>>
> >>> Here is our use case:
> >>>
> >>> We index a database of about 4 million records; spread over a few
> hundred
> >>> tables. The data consists of a mix of text, dates, numbers etc. We also
> >>> add
> >>> all these fields as facets.
> >>> Each night we delete about 90% of the data, which in testing reduces
> the
> >>> index size significantly.
> >>> We store the data as StoredFields as well, to prevent having to access
> the
> >>> database at all.
> >>> We use FloatAssociatedFacet fields for the facets.
> >>>
> >>>
> >>> In production however, it seems the index is only growing, up to 71 GB
> for
> >>> these records for a month of running.
> >>>
> >>> It seems that lucene's index in just getting bigger there.
> >>>
> >>> We use lucene 5.3 on CentOS, java 8 64 bit.
> >>>
> >>> The taxonomy-index does not grow significantly.
> >>>
> >>> How should I go about checking what is wrong?
> >>>
> >>> Thanks!
> >>>
> >>>
> >>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: index size growing while deleting

2015-11-05 Thread Rob Audenaerde
Hi will, others

Thanks for you reply,

As far as I understand it, deleting a document is just setting the deleted
bit, and when segments are merged, then the documents are removed. (not
really sure what this means exactly; I guess the document gets removed from
the store, the terms will no longer refer to that document. Not sure if
terms get removed if no longer needed, etc). If there are resources to read
to improve my understanding I havo not found them (yet), if you could point
me to some that be great!

I use the default IndexWriterConfig, which I see uses TieredMergePolicy. I
never close my InderWriter; as I use NRT searching I just alwyas keep it
open.

My two guesses are that: a) old segments are not removed from disk or b)
deletes are not cleaned up as well as I though they would be.

I have made a testcase which indexes 5 million rows (five iterations, five
indexing thread, indexing and deleting all such documents after each
iterator with deleteByQuery), the rows randomly generated. I see the
Taxonomy ever growing (which is logical, because facet-ordinals are never
removed as far as I understand); the index grows, but also shrinks when
deleting. So I cannot reproduce my problem easily :(

I will start diving into the Lucene source code, but I was hoping I just
did something wrong. .

Any hints are appreciated!

-Rob


On Thu, Nov 5, 2015 at 2:52 PM, will  wrote:

> Hi Rob:
>
> Do you understand how deletes work and how an index is compacted?
>
> There's some configuration/runtime activities you don't mention And
> you make testing process sound like a mirror of production? (Including
> configuration?)
>
>
> -will
>
>
> On 11/5/15 7:33 AM, Rob Audenaerde wrote:
>
>> Hi all,
>>
>> I'm currently investigating an issue we have with our index. It keeps
>> getting bigger, and I don't het why.
>>
>> Here is our use case:
>>
>> We index a database of about 4 million records; spread over a few hundred
>> tables. The data consists of a mix of text, dates, numbers etc. We also
>> add
>> all these fields as facets.
>> Each night we delete about 90% of the data, which in testing reduces the
>> index size significantly.
>> We store the data as StoredFields as well, to prevent having to access the
>> database at all.
>> We use FloatAssociatedFacet fields for the facets.
>>
>>
>> In production however, it seems the index is only growing, up to 71 GB for
>> these records for a month of running.
>>
>> It seems that lucene's index in just getting bigger there.
>>
>> We use lucene 5.3 on CentOS, java 8 64 bit.
>>
>> The taxonomy-index does not grow significantly.
>>
>> How should I go about checking what is wrong?
>>
>> Thanks!
>>
>>
>


index size growing while deleting

2015-11-05 Thread Rob Audenaerde
Hi all,

I'm currently investigating an issue we have with our index. It keeps
getting bigger, and I don't het why.

Here is our use case:

We index a database of about 4 million records; spread over a few hundred
tables. The data consists of a mix of text, dates, numbers etc. We also add
all these fields as facets.
Each night we delete about 90% of the data, which in testing reduces the
index size significantly.
We store the data as StoredFields as well, to prevent having to access the
database at all.
We use FloatAssociatedFacet fields for the facets.


In production however, it seems the index is only growing, up to 71 GB for
these records for a month of running.

It seems that lucene's index in just getting bigger there.

We use lucene 5.3 on CentOS, java 8 64 bit.

The taxonomy-index does not grow significantly.

How should I go about checking what is wrong?

Thanks!


Number of threads in index writer config?

2015-08-27 Thread Rob Audenaerde
Hi all,

I was wondering about the number of threads to use for indexing.

There is a setting: getMaxThreadStates() in the IndexWriterConfig that
determines how many threads can write to the index  simultaneously.

The luceneutil Indexer.java (that is used for the nightly benchmarks),
seems to use the default value (8), while it uses 20 indexing threads.

Is there a reason to not set the maxThreadStates to the number of indexing
thread?

Thanks!


Re: GROUP BY in Lucene

2015-08-10 Thread Rob Audenaerde
You can write a custom (facet) collector to do this. I have done something
similar, I'll describe my approach:

For all the values that need grouping or aggregating, I have added a
FacetField ( an AssociatedFacetField, so I can store the value alongside
the ordinal) . The main search stays the same, in your case for example a
NumericRangeQuery  (if the date is store in ms).

Then I have a custom facet collector that does the grouping.

Basically, it goes through all the MatchingDocs. For each doc, it creates a
unique key (composed of X, Y and Z), and makes aggregates as needed (sum
D).These are stored in a map. If a key is already in the map, the existing
aggregate is added to the new value. Tricky is to make your unique key fast
and immutable, so you can  precompute the hashcode.

This is fast enough if the number of unique keys is smallish (<10.000),
index size +- 1M docs).

-Rob


On Mon, Aug 10, 2015 at 2:47 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> Lucene has a grouping module that has several approaches for grouping
> search hits, though it's only by a single field I believe.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Sun, Aug 9, 2015 at 2:55 PM, Gimantha Bandara 
> wrote:
> > Hi all,
> >
> > Is there a way to achieve $subject? For example, consider the following
> SQL
> > query.
> >
> > SELECT A, B, C SUM(D) as E FROM  `table` WHERE time BETWEEN fromDate AND
> > toDate *GROUP BY X,Y,Z*
> >
> > In the above query we can group the records by, X,Y,Z. Is there a way to
> > achieve the same in Lucene? (I guess Faceting would help, But is it
> > possible get all the categoryPaths along with the matching records? ) Is
> > there any other way other than using Facets?
> >
> > --
> > Gimantha Bandara
> > Software Engineer
> > WSO2. Inc : http://wso2.com
> > Mobile : +94714961919
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


disabling all scoring?

2015-02-04 Thread Rob Audenaerde
Hi all,

I'm doing some analytics with a custom Collector on a fairly large number
of searchresults (+-100.000, all the hits that return from a query). I need
to retrieve them by a query (so using search), but I don't need any scoring
nor keeping the documents in any order.

When profiling the application, I saw that for my tests, my entire search
takes about 2.4 seconds, and BulkScorer takes 0.4 seconds. So I figured
that without scoring, I would be able to chop off 0.4 seconds (+- 17% speed
increase). That seems reasonable.

What would be the best approach to disable all the 'search-goodies' and
just pass the results as fast as possible into my Collector?

Thanks for your insights.

-Rob


fill 'empty' facet-values, sampling, taxoreader

2015-01-12 Thread Rob Audenaerde
Hi all,

I'm building an application in which users can add arbitrary documents, and
all fields will be added as facets as well. This allows users to browse
their documents by their own defined facets easily.

However, when the number of documents gets very large, I switch to
random-sampled facets to make sure the application stays responsive. By the
nature of sampling, documents (and thus facet-values) will be missed.

I let the user select the number of facet-values he want to see for each
facets. For example, the default is 10. If a facet contains values 1 to 20,
the user will always see 10 values if all documents are returned in the
search and no sampling is done.

If sampling is done, and the values are non-uniformly distributed, the user
might end up with only 5 values instead of 10. I want to 'fill' the empty 5
facet-value-slots with existing facet-values and an unknown facet-count
(?). The reason behind this, is that this value might exist in the
resultset and for interaction purposes, it is very nice if this value can
be selected and added to the query, to quickly find if there are documents
that also contain this facet value.

It is even more useful if these facet values are not sorted by count, but
by label. The user can then quickly see there are document that contain a
certain value.

I can iterate over the ordinals via the TaxonomyReader and TaxonomyFacets
(by leveraging the 'children'), but these ordinals might no longer be used
in the documents.

What would be a good approach to tackle this issue?


Re: search performance

2014-06-03 Thread Rob Audenaerde
Hi Jamie,

What is included in the 5 minutes?

Just the call to the searcher?

seacher.search(...) ?

Can you show a bit more of the code you use?



On Tue, Jun 3, 2014 at 11:32 AM, Jamie  wrote:

> Vitaly
>
> Thanks for the contribution. Unfortunately, we cannot use Lucene's
> pagination function, because in reality the user can skip pages to start
> the search at any point, not just from the end of the previous search. Even
> the
> first search (without any pagination), with a max of 1000 hits, takes 5
> minutes to complete.
>
> Regards
>
> Jamie
>
> On 2014/06/03, 10:54 AM, Vitaly Funstein wrote:
>
>> Something doesn't quite add up.
>>
>> TopFieldCollector fieldCollector = TopFieldCollector.create(sort,
>> max,true,
>>
>>> false, false, true);
>>>
>>> We use pagination, so only returning 1000 documents or so at a time.
>>>
>>>
>>>  You say you are using pagination, yet the API you are using to create
>> your
>> collector isn't how you would utilize Lucene's built-in "pagination"
>> feature (unless misunderstand the API). If the max is the snippet above is
>> 1000, then you're simply returning top 1000 docs every time you execute
>> your search. Otherwise... well, could you actually post a bit more of your
>> code that runs the search here, in particular?
>>
>> Assuming that the max is much larger than 1000, however, you could call
>> fieldCollector.topDocs(int, int) after accumulating hits using this
>> collector, but this won't work multiple times per query execution,
>> according to the javadoc. So you either have to re-execute the full
>> search,
>> and then get the next chunk of ScoreDocs, or use the proper API for this,
>> one that accepts as a parameter the end of the previous page of results,
>> i.e. IndexSearcher.searchAfter(ScoreDoc, ...)
>>
>>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Getting multi-values to use in filter?

2014-04-29 Thread Rob Audenaerde
Hi Shai,

I read the article on your blog, thanks for it! It seems to be a natural fit to 
do multi-values like this, and it is helpful indeed. For my specific problem, I 
have multiple values that do not have a fixed number, so it can be either 0 or 
10 values. I think the best way to solve this is to encode the number of values 
as first entry in the BDV. This is not that hard so I will take this road.

-Rob


> Op 27 apr. 2014 om 21:27 heeft Shai Erera  het volgende 
> geschreven:
> 
> Hi Rob,
> 
> Your question got me interested, so I wrote a quick prototype of what I
> think solves your problem (and if not, I hope it solves someone else's!
> :)). The idea is to write a special ValueSource, e.g. MaxValueSource which
> reads a BinadyDocValues, decodes the values and returns the maximum one. It
> can then be embedded in an expression quite easily.
> 
> I published a post on Lucene expressions and included some prototype code
> which demonstrates how to do it. Hope it's still helpful to you:
> http://shaierera.blogspot.com/2014/04/expressions-with-lucene.html.
> 
> Shai
> 
> 
>> On Thu, Apr 24, 2014 at 1:20 PM, Shai Erera  wrote:
>> 
>> I don't think that you should use the facet module. If all you want is to
>> encode a bunch of numbers under a 'foo' field, you can encode them into a
>> byte[] and index them as a BDV. Then at search time you get the BDV and
>> decode the numbers back. The facet module adds complexity here: yes, you
>> get the encoding/decoding for free, but at the cost of adding mock
>> categories to the taxonomy, or use associations, for no good reason IMO.
>> 
>> Once you do that, you need to figure out how to extend the expressions
>> module to support a function like maxValues(fieldName) (cannot use 'max'
>> since it's reserved). I read about it some, and still haven't figured out
>> exactly how to do it. The JavascriptCompiler can take custom functions to
>> compile expressions, but the methods should take only double values. So I
>> think it should be some sort of binding, but I'm not sure yet how to do it.
>> Perhaps it should be a name like max_fieldName, which you add a custom
>> Expression to as a binding ... I will try to look into it later.
>> 
>> Shai
>> 
>> 
>> On Wed, Apr 23, 2014 at 6:49 PM, Rob Audenaerde 
>> wrote:
>> 
>>> Thanks for all the questions, gives me an opportunity to clarify it :)
>>> 
>>> I want the user to be able to give a (simple) formula (so I don't know it
>>> on beforehand) and use that formula in the search. The Javascript
>>> expressions are really powerful in this use case, but have the
>>> single-value
>>> limitation. Ideally, I would like to make it really flexible by for
>>> example
>>> allowing (in-document aggregating) expressions like: max(fieldA) - fieldB
>>>> 
>>> fieldC.
>>> 
>>> Currently, using single values, I can handle expressions in the form of
>>> "fieldA - fieldB - fieldC > 0" and evaluate the long-value that I receive
>>> from the FunctionValues and the ValueSource. I also optimize the query by
>>> assuring the field exists and has a value, etc. to the search still fast
>>> enough. This works well, but single value only.
>>> 
>>> I also looked into the facets Association Fields, as they somewhat look
>>> like the thing that I want. Only in the faceting module, all ordinals and
>>> values are stored in one field, so there is no easy way extract the fields
>>> that are used in the expression.
>>> 
>>> I like the solution one you suggested, to add all the numeric fields an
>>> encoded byte[] like the facets do, but then on a per-field basis, so that
>>> each numeric field has a BDV field that contains all multiple values for
>>> that field for that document.
>>> 
>>> Now that I am typing this, I think there is another way. I could use the
>>> faceting module and add a different facet field ($facetFIELDA,
>>> $facetFIELDB) in the FacetsConfig for each field. That way it would be
>>> relatively straightforward to get all the values for a field, as they are
>>> exact all the values for the BDV for that document's facet field. Only
>>> aggregating all facets will be harder, as the
>>> TaxonomyFacetSum*Associations
>>> would need to do this for all fields that I need facet counts/sums for.
>>> 
>>> What do you think?
>>> 
>>> -Rob
>>> 
>>> 
>>>> On Wed, Apr 23, 2014 at 5:13 PM, Shai E

Re: Getting multi-values to use in filter?

2014-04-23 Thread Rob Audenaerde
Thanks for all the questions, gives me an opportunity to clarify it :)

I want the user to be able to give a (simple) formula (so I don't know it
on beforehand) and use that formula in the search. The Javascript
expressions are really powerful in this use case, but have the single-value
limitation. Ideally, I would like to make it really flexible by for example
allowing (in-document aggregating) expressions like: max(fieldA) - fieldB >
fieldC.

Currently, using single values, I can handle expressions in the form of
"fieldA - fieldB - fieldC > 0" and evaluate the long-value that I receive
from the FunctionValues and the ValueSource. I also optimize the query by
assuring the field exists and has a value, etc. to the search still fast
enough. This works well, but single value only.

I also looked into the facets Association Fields, as they somewhat look
like the thing that I want. Only in the faceting module, all ordinals and
values are stored in one field, so there is no easy way extract the fields
that are used in the expression.

I like the solution one you suggested, to add all the numeric fields an
encoded byte[] like the facets do, but then on a per-field basis, so that
each numeric field has a BDV field that contains all multiple values for
that field for that document.

Now that I am typing this, I think there is another way. I could use the
faceting module and add a different facet field ($facetFIELDA,
$facetFIELDB) in the FacetsConfig for each field. That way it would be
relatively straightforward to get all the values for a field, as they are
exact all the values for the BDV for that document's facet field. Only
aggregating all facets will be harder, as the TaxonomyFacetSum*Associations
would need to do this for all fields that I need facet counts/sums for.

What do you think?

-Rob


On Wed, Apr 23, 2014 at 5:13 PM, Shai Erera  wrote:

> A NumericDocValues field can only hold one value. Have you thought about
> encoding the values in a BinaryDocValues field? Or are you talking about
> multiple fields (different names), each has its own single value, and at
> search time you sum the values from a different set of fields?
>
> If it's one field, multiple values, then why do you need to separate the
> values? Is it because you sometimes sum and sometimes e.g. avg? Do you
> always include all values of a document in the formula, but the formula
> changes between searches, or do you sometimes use only a subset of the
> values?
>
> If you always use all values, but change the formula between queries, then
> perhaps you can just encode the pre-computed value under different NDV
> fields? If you only use a handful of functions (and they are known in
> advance), it may not be too heavy on the index, and definitely perform
> better during search.
>
> Otherwise, I believe I'd consider indexing them as a BDV field. For facets,
> we basically need the same multi-valued numeric field, and given that NDV
> is single valued, we went w/ BDV.
>
> If I misunderstood the scenario, I'd appreciate if you clarify it :)
>
> Shai
>
>
> On Wed, Apr 23, 2014 at 5:49 PM, Rob Audenaerde  >wrote:
>
> > Hi Shai, all,
> >
> > I am trying to write that Filter :). But I'm a bit at loss as how to
> > efficiently grab the multi-values. I can access the
> > context.reader().document() that accesses the storedfields, but that
> seems
> > slow.
> >
> > For single-value fields I use a compiled JavaScript Expression with
> > simplebindings as ValueSource, which seems to work quite well. The
> downside
> > is that I cannot find a way to implement multi-value through that
> solution.
> >
> > These create for example a LongFieldSource, which uses the
> > FieldCache.LongParser. These parsers only seem te parse one field.
> >
> > Is there an efficient way to get -all- of the (numeric) values for a
> field
> > in a document?
> >
> >
> > On Wed, Apr 23, 2014 at 4:38 PM, Shai Erera  wrote:
> >
> > > You can do that by writing a Filter which returns matching documents
> > based
> > > on a sum of the field's value. However I suspect that is going to be
> > slow,
> > > unless you know that you will need several such filters and can cache
> > them.
> > >
> > > Another approach would be to write a Collector which serves as a
> Filter,
> > > but computes the sum only for documents that match the query. Hopefully
> > > that would mean you compute the sum for less documents than you would
> > have
> > > w/ the Filter approach.
> > >
> > > Shai
> > >
> > >
> > > On Wed, Apr 23, 2014 at 5:11 PM, Michael Sokolov <
> > > msoko...@safariboo

Re: Getting multi-values to use in filter?

2014-04-23 Thread Rob Audenaerde
Hi Shai, all,

I am trying to write that Filter :). But I'm a bit at loss as how to
efficiently grab the multi-values. I can access the
context.reader().document() that accesses the storedfields, but that seems
slow.

For single-value fields I use a compiled JavaScript Expression with
simplebindings as ValueSource, which seems to work quite well. The downside
is that I cannot find a way to implement multi-value through that solution.

These create for example a LongFieldSource, which uses the
FieldCache.LongParser. These parsers only seem te parse one field.

Is there an efficient way to get -all- of the (numeric) values for a field
in a document?


On Wed, Apr 23, 2014 at 4:38 PM, Shai Erera  wrote:

> You can do that by writing a Filter which returns matching documents based
> on a sum of the field's value. However I suspect that is going to be slow,
> unless you know that you will need several such filters and can cache them.
>
> Another approach would be to write a Collector which serves as a Filter,
> but computes the sum only for documents that match the query. Hopefully
> that would mean you compute the sum for less documents than you would have
> w/ the Filter approach.
>
> Shai
>
>
> On Wed, Apr 23, 2014 at 5:11 PM, Michael Sokolov <
> msoko...@safaribooksonline.com> wrote:
>
> > This isn't really a good use case for an index like Lucene.  The most
> > essential property of an index is that it lets you look up documents very
> > quickly based on *precomputed* values.
> >
> > -Mike
> >
> >
> > On 04/23/2014 06:56 AM, Rob Audenaerde wrote:
> >
> >> Hi all,
> >>
> >> I'm looking for a way to use multi-values in a filter.
> >>
> >> I want to be able to search on  sum(field)=100, where field has values
> in
> >> one documents:
> >>
> >> field=60
> >> field=40
> >>
> >> In this case 'field' is a LongField. I examined the code in the
> >> FieldCache,
> >> but that seems to focus on single-valued fields only, or
> >>
> >>
> >> It this something that can be done in Lucene? And what would be a good
> >> approach?
> >>
> >> Thanks in advance,
> >>
> >> -Rob
> >>
> >>
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
>


Re: Getting multi-values to use in filter?

2014-04-23 Thread Rob Audenaerde
Hi Mike,

Thanks for your reply.

I think it is not-so-much an invalid use case for Lucene. Lucene already
has (experimental) support for Dynamic Range Facets, expressions
(javascript expressions, geospatial haversin etc. etc). There are all
computed on the fly; and work really well. They just depend on the fact
that there is only one (numeric) value per field per document.

-Rob


On Wed, Apr 23, 2014 at 4:11 PM, Michael Sokolov <
msoko...@safaribooksonline.com> wrote:

> This isn't really a good use case for an index like Lucene.  The most
> essential property of an index is that it lets you look up documents very
> quickly based on *precomputed* values.
>
> -Mike
>
>
>
> On 04/23/2014 06:56 AM, Rob Audenaerde wrote:
>
>> Hi all,
>>
>> I'm looking for a way to use multi-values in a filter.
>>
>> I want to be able to search on  sum(field)=100, where field has values in
>> one documents:
>>
>> field=60
>> field=40
>>
>> In this case 'field' is a LongField. I examined the code in the
>> FieldCache,
>> but that seems to focus on single-valued fields only, or
>>
>>
>> It this something that can be done in Lucene? And what would be a good
>> approach?
>>
>> Thanks in advance,
>>
>> -Rob
>>
>>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Getting multi-values to use in filter?

2014-04-23 Thread Rob Audenaerde
Hi all,

I'm looking for a way to use multi-values in a filter.

I want to be able to search on  sum(field)=100, where field has values in
one documents:

field=60
field=40

In this case 'field' is a LongField. I examined the code in the FieldCache,
but that seems to focus on single-valued fields only, or


It this something that can be done in Lucene? And what would be a good
approach?

Thanks in advance,

-Rob


NRT facet issue (bug?), hard to reproduce, please advise

2014-04-11 Thread Rob Audenaerde
Hi all,

I have a issue using the near real-time search in the taxonomy. I could
really use some advise on how to debug/proceed this issue.

The issue is as follows:

I index 100k documents, with about 40 fields each. For each field, I also
add a FacetField (issues arises both with FacetField as
FloatAssociationFacetField). Each document has a unique number field
(client_no).

When just indexing and searching afterwards, all is fine.

When searching while indexing, sometimes the number of facets associated
with a document is to high, i.e. when collecting facets there are more that
one client_no on one document, which of course should not be the case.

Before each search, I use the manager.maybeRefreshBlocking() before the
search, because I want the most-actual results.

I have a taxonomy and indexreader combined in a ReferenceManager (I created
this before the SearcherTaxonomyManager existed, but it behaves exactly the
same, similar refcount logic)

During indexing I commit every 5000 documents (not needed for the NRT
search, but needed to prevent loss in the application should shut down). I
commit as follows:

public void commit() throws DocumentIndexException
{
try
{
synchronized ( GlobalIndexCommitAndCloseLock.LOCK )
{
this.taxonomyWriter.commit();
this.luceneIndexWriter.commit();
}
}
catch ( final OutOfMemoryError | IOException e )
{
tryCloseWritersOnOOME( this.luceneIndexWriter,
this.taxonomyWriter );
throw new DocumentIndexException( e );
}
}

I use a standard IndexWriterConfig and both IndexWriter and TaxonomyWriter
are RAMDirectory().

My testcase indexes the 100k documents, while another thread is
continuously calling 'manager.maybeRefreshBlocking()'. This is enough to
sometimes cause the taxonomy to be incorrect.

The number of indexing threads does not seems to influence the issue, as it
also appears when I have only 1 indexing thread.

I know it is an index problem, because when I write in the index to file
instead of RAM and reopen it in a clean application, I see the same
behaviour.


I could really use some advise on how to debug/proceed this issue. If more
info is needed, just ask.

Thanks in advance,

-Rob