Re: Monitoring decisions taken by IndexOrDocValuesQuery

2021-06-09 Thread Adrien Grand
FWIW a related PR was just merged that allows to introspect query
execution: https://issues.apache.org/jira/browse/LUCENE-9965. It's
different from your use-case though in that it is debugging information for
a single query rather than statistical information across lots of user
queries (and the approach on that other issue makes things much slower so
you wouldn't like to enable it in production).

Out of curiosity, what are you doing with this information about which
execution path is chosen?

On Wed, Jun 9, 2021 at 2:14 PM Egor Moraru  wrote:

> Hi,
>
> At my current project we wanted to monitor for a specific field the
> fraction of indexed vs doc values queries executed by
> IndexOrDocValuesQuery.
>
> We ended up forking IndexOrDocValuesQuery and passing a listener that
> is notified when the query execution path is decided.
>
> Do you think this is something the community might be interested in?
>
> Kind regards,
> Egor Moraru.
>


-- 
Adrien


Re: Potential bug

2021-06-09 Thread baris . kazar
Yes, i did those and i believe i am at the best level of performance now 
and it is not bad at all but i want to make it much better.


i see like a linear drop in timings when i go lower number of words but 
let me do that quick study again.


Fuzzy search  is always expensive but that seems to suit best to my needs.


Thanks Diego for these great questions and i already explored them. But 
thanks again.


Best regards


On 6/9/21 2:04 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:

I have never used fuzzy search but from the documentation it seems very 
expensive, and if you do it on 10 terms and 1M documents it seems very very 
very expensive.

Are you using the default 'fuzzyness' parameter? (0.5) - It might end up 
exploring a lot of documents, did you try to play with that parameter?

Have you tried to see how the performance change if you do not use fuzzy (just 
to see if is fuzzy the introduce the slow down)?
Or what happens to performance if you do fuzzy with 1, 2, 5 terms instead of 10?


From: java-user@lucene.apache.org At: 06/09/21 18:56:31To:  
java-user@lucene.apache.org,  baris.ka...@oracle.com
Subject: Re: Potential bug

i cant reveal those details i am very sorry. but it is more than 1 million.

let me tell that i have a lot of code that processes results from lucene
but the bottle neck is lucene fuzzy search.

Best regards


On 6/9/21 1:53 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:

How many documents do you have in the index?
and can you show an example of query?


From: java-user@lucene.apache.org At: 06/09/21 18:33:25To:

java-user@lucene.apache.org,  baris.ka...@oracle.com

Subject: Re: Potential bug

i have only two fields one string the other is a number (stored as
string), i guess you cant go simpler than this.

i retreieve the hits and my major bottleneck is lucene fuzzy search.


i take each word from the string which is usually around at most 10 words

i build a fuzzy boolean query out of them.


simple query is like this 10 word query.


limit means i want to stop lucene search around 20 hits i dont want
thousands of hits.


Best regards


On 6/9/21 1:25 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:


Hi Baris,


what if the user needs to limit the search process?

What do you mean by 'limit'?


there should be a way to speedup lucene then if this is not possible,
since for some simple queries it takes half a second which is too long.

What do you mean by 'simple' query? there might be multiple reasons behind

slowness of a query that are unrelated to the search (for example, if you
retrieve many documents and for each document you are extracting the content

of

many fields) - would you like to tell us a bit more about your use case?

Regards,
Diego

From: java-user@lucene.apache.org At: 06/09/21 18:18:01To:

java-user@lucene.apache.org

Cc:  baris.ka...@oracle.com
Subject: Re: Potential bug

Thanks Adrien, but the differences is too far apart.

I think the algorithm needs to be revised.


what if the user needs to limit the search process?

that leaves no control.

there should be a way to speedup lucene then if this is not possible,

since for some simple queries it takes half a second which is too long.

Best regards


On 6/9/21 1:13 PM, Adrien Grand wrote:

Hi Baris,

totalhitsThreshold is actually a minimum threshold, not a maximum threshold.

The problem is that Lucene cannot directly identify the top matching
documents for a given query. The strategy it adopts is to start collecting
hits naively in doc ID order and to progressively raise the bar about the
minimum score that is required for a hit to be competitive in order to skip
non-competitive documents. So it's expected that Lucene still collects 100s
or 1000s of hits, even though the collector is configured to only compute
the top 10 hits.

On Wed, Jun 9, 2021 at 7:07 PM  wrote:


Hi,-

  i think this is a potential bug


i set this time totalHitsThreshold to 10 and i get totalhits reported as
1655 but i get 10 results in total.

I think this suggests that there might be a bug with
TopScoreDocCollector algorithm.


Best regards



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To 

Re: Potential bug

2021-06-09 Thread Diego Ceccarelli (BLOOMBERG/ LONDON)
I have never used fuzzy search but from the documentation it seems very 
expensive, and if you do it on 10 terms and 1M documents it seems very very 
very expensive.

Are you using the default 'fuzzyness' parameter? (0.5) - It might end up 
exploring a lot of documents, did you try to play with that parameter? 

Have you tried to see how the performance change if you do not use fuzzy (just 
to see if is fuzzy the introduce the slow down)? 
Or what happens to performance if you do fuzzy with 1, 2, 5 terms instead of 10?


From: java-user@lucene.apache.org At: 06/09/21 18:56:31To:  
java-user@lucene.apache.org,  baris.ka...@oracle.com
Subject: Re: Potential bug

i cant reveal those details i am very sorry. but it is more than 1 million.

let me tell that i have a lot of code that processes results from lucene 
but the bottle neck is lucene fuzzy search.

Best regards


On 6/9/21 1:53 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:
> How many documents do you have in the index?
> and can you show an example of query?
>
>
> From: java-user@lucene.apache.org At: 06/09/21 18:33:25To:  
java-user@lucene.apache.org,  baris.ka...@oracle.com
> Subject: Re: Potential bug
>
> i have only two fields one string the other is a number (stored as
> string), i guess you cant go simpler than this.
>
> i retreieve the hits and my major bottleneck is lucene fuzzy search.
>
>
> i take each word from the string which is usually around at most 10 words
>
> i build a fuzzy boolean query out of them.
>
>
> simple query is like this 10 word query.
>
>
> limit means i want to stop lucene search around 20 hits i dont want
> thousands of hits.
>
>
> Best regards
>
>
> On 6/9/21 1:25 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:
>
>> Hi Baris,
>>
>>> what if the user needs to limit the search process?
>> What do you mean by 'limit'?
>>
>>> there should be a way to speedup lucene then if this is not possible,
>>> since for some simple queries it takes half a second which is too long.
>> What do you mean by 'simple' query? there might be multiple reasons behind
> slowness of a query that are unrelated to the search (for example, if you
> retrieve many documents and for each document you are extracting the content 
of
> many fields) - would you like to tell us a bit more about your use case?
>> Regards,
>> Diego
>>
>> From: java-user@lucene.apache.org At: 06/09/21 18:18:01To:
> java-user@lucene.apache.org
>> Cc:  baris.ka...@oracle.com
>> Subject: Re: Potential bug
>>
>> Thanks Adrien, but the differences is too far apart.
>>
>> I think the algorithm needs to be revised.
>>
>>
>> what if the user needs to limit the search process?
>>
>> that leaves no control.
>>
>> there should be a way to speedup lucene then if this is not possible,
>>
>> since for some simple queries it takes half a second which is too long.
>>
>> Best regards
>>
>>
>> On 6/9/21 1:13 PM, Adrien Grand wrote:
>>> Hi Baris,
>>>
>>> totalhitsThreshold is actually a minimum threshold, not a maximum threshold.
>>>
>>> The problem is that Lucene cannot directly identify the top matching
>>> documents for a given query. The strategy it adopts is to start collecting
>>> hits naively in doc ID order and to progressively raise the bar about the
>>> minimum score that is required for a hit to be competitive in order to skip
>>> non-competitive documents. So it's expected that Lucene still collects 100s
>>> or 1000s of hits, even though the collector is configured to only compute
>>> the top 10 hits.
>>>
>>> On Wed, Jun 9, 2021 at 7:07 PM  wrote:
>>>
 Hi,-

  i think this is a potential bug


 i set this time totalHitsThreshold to 10 and i get totalhits reported as
 1655 but i get 10 results in total.

 I think this suggests that there might be a bug with
 TopScoreDocCollector algorithm.


 Best regards



 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: Potential bug

2021-06-09 Thread baris . kazar

i cant reveal those details i am very sorry. but it is more than 1 million.

let me tell that i have a lot of code that processes results from lucene 
but the bottle neck is lucene fuzzy search.


Best regards


On 6/9/21 1:53 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:

How many documents do you have in the index?
and can you show an example of query?


From: java-user@lucene.apache.org At: 06/09/21 18:33:25To:  
java-user@lucene.apache.org,  baris.ka...@oracle.com
Subject: Re: Potential bug

i have only two fields one string the other is a number (stored as
string), i guess you cant go simpler than this.

i retreieve the hits and my major bottleneck is lucene fuzzy search.


i take each word from the string which is usually around at most 10 words

i build a fuzzy boolean query out of them.


simple query is like this 10 word query.


limit means i want to stop lucene search around 20 hits i dont want
thousands of hits.


Best regards


On 6/9/21 1:25 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:


Hi Baris,


what if the user needs to limit the search process?

What do you mean by 'limit'?


there should be a way to speedup lucene then if this is not possible,
since for some simple queries it takes half a second which is too long.

What do you mean by 'simple' query? there might be multiple reasons behind

slowness of a query that are unrelated to the search (for example, if you
retrieve many documents and for each document you are extracting the content of
many fields) - would you like to tell us a bit more about your use case?

Regards,
Diego

From: java-user@lucene.apache.org At: 06/09/21 18:18:01To:

java-user@lucene.apache.org

Cc:  baris.ka...@oracle.com
Subject: Re: Potential bug

Thanks Adrien, but the differences is too far apart.

I think the algorithm needs to be revised.


what if the user needs to limit the search process?

that leaves no control.

there should be a way to speedup lucene then if this is not possible,

since for some simple queries it takes half a second which is too long.

Best regards


On 6/9/21 1:13 PM, Adrien Grand wrote:

Hi Baris,

totalhitsThreshold is actually a minimum threshold, not a maximum threshold.

The problem is that Lucene cannot directly identify the top matching
documents for a given query. The strategy it adopts is to start collecting
hits naively in doc ID order and to progressively raise the bar about the
minimum score that is required for a hit to be competitive in order to skip
non-competitive documents. So it's expected that Lucene still collects 100s
or 1000s of hits, even though the collector is configured to only compute
the top 10 hits.

On Wed, Jun 9, 2021 at 7:07 PM  wrote:


Hi,-

 i think this is a potential bug


i set this time totalHitsThreshold to 10 and i get totalhits reported as
1655 but i get 10 results in total.

I think this suggests that there might be a bug with
TopScoreDocCollector algorithm.


Best regards



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Potential bug

2021-06-09 Thread Diego Ceccarelli (BLOOMBERG/ LONDON)
How many documents do you have in the index? 
and can you show an example of query? 


From: java-user@lucene.apache.org At: 06/09/21 18:33:25To:  
java-user@lucene.apache.org,  baris.ka...@oracle.com
Subject: Re: Potential bug

i have only two fields one string the other is a number (stored as 
string), i guess you cant go simpler than this.

i retreieve the hits and my major bottleneck is lucene fuzzy search.


i take each word from the string which is usually around at most 10 words

i build a fuzzy boolean query out of them.


simple query is like this 10 word query.


limit means i want to stop lucene search around 20 hits i dont want 
thousands of hits.


Best regards


On 6/9/21 1:25 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:

> Hi Baris,
>
>> what if the user needs to limit the search process?
> What do you mean by 'limit'?
>
>> there should be a way to speedup lucene then if this is not possible,
>> since for some simple queries it takes half a second which is too long.
> What do you mean by 'simple' query? there might be multiple reasons behind 
slowness of a query that are unrelated to the search (for example, if you 
retrieve many documents and for each document you are extracting the content of 
many fields) - would you like to tell us a bit more about your use case?
>
> Regards,
> Diego
>
> From: java-user@lucene.apache.org At: 06/09/21 18:18:01To:  
java-user@lucene.apache.org
> Cc:  baris.ka...@oracle.com
> Subject: Re: Potential bug
>
> Thanks Adrien, but the differences is too far apart.
>
> I think the algorithm needs to be revised.
>
>
> what if the user needs to limit the search process?
>
> that leaves no control.
>
> there should be a way to speedup lucene then if this is not possible,
>
> since for some simple queries it takes half a second which is too long.
>
> Best regards
>
>
> On 6/9/21 1:13 PM, Adrien Grand wrote:
>> Hi Baris,
>>
>> totalhitsThreshold is actually a minimum threshold, not a maximum threshold.
>>
>> The problem is that Lucene cannot directly identify the top matching
>> documents for a given query. The strategy it adopts is to start collecting
>> hits naively in doc ID order and to progressively raise the bar about the
>> minimum score that is required for a hit to be competitive in order to skip
>> non-competitive documents. So it's expected that Lucene still collects 100s
>> or 1000s of hits, even though the collector is configured to only compute
>> the top 10 hits.
>>
>> On Wed, Jun 9, 2021 at 7:07 PM  wrote:
>>
>>> Hi,-
>>>
>>> i think this is a potential bug
>>>
>>>
>>> i set this time totalHitsThreshold to 10 and i get totalhits reported as
>>> 1655 but i get 10 results in total.
>>>
>>> I think this suggests that there might be a bug with
>>> TopScoreDocCollector algorithm.
>>>
>>>
>>> Best regards
>>>
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: Potential bug

2021-06-09 Thread baris . kazar
i have only two fields one string the other is a number (stored as 
string), i guess you cant go simpler than this.


i retreieve the hits and my major bottleneck is lucene fuzzy search.


i take each word from the string which is usually around at most 10 words

i build a fuzzy boolean query out of them.


simple query is like this 10 word query.


limit means i want to stop lucene search around 20 hits i dont want 
thousands of hits.



Best regards






On 6/9/21 1:25 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:


Hi Baris,


what if the user needs to limit the search process?

What do you mean by 'limit'?


there should be a way to speedup lucene then if this is not possible,
since for some simple queries it takes half a second which is too long.

What do you mean by 'simple' query? there might be multiple reasons behind 
slowness of a query that are unrelated to the search (for example, if you 
retrieve many documents and for each document you are extracting the content of 
many fields) - would you like to tell us a bit more about your use case?

Regards,
Diego

From: java-user@lucene.apache.org At: 06/09/21 18:18:01To:  
java-user@lucene.apache.org
Cc:  baris.ka...@oracle.com
Subject: Re: Potential bug

Thanks Adrien, but the differences is too far apart.

I think the algorithm needs to be revised.


what if the user needs to limit the search process?

that leaves no control.

there should be a way to speedup lucene then if this is not possible,

since for some simple queries it takes half a second which is too long.

Best regards


On 6/9/21 1:13 PM, Adrien Grand wrote:

Hi Baris,

totalhitsThreshold is actually a minimum threshold, not a maximum threshold.

The problem is that Lucene cannot directly identify the top matching
documents for a given query. The strategy it adopts is to start collecting
hits naively in doc ID order and to progressively raise the bar about the
minimum score that is required for a hit to be competitive in order to skip
non-competitive documents. So it's expected that Lucene still collects 100s
or 1000s of hits, even though the collector is configured to only compute
the top 10 hits.

On Wed, Jun 9, 2021 at 7:07 PM  wrote:


Hi,-

i think this is a potential bug


i set this time totalHitsThreshold to 10 and i get totalhits reported as
1655 but i get 10 results in total.

I think this suggests that there might be a bug with
TopScoreDocCollector algorithm.


Best regards



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Potential bug

2021-06-09 Thread Diego Ceccarelli (BLOOMBERG/ LONDON)
Hi Baris, 

> what if the user needs to limit the search process?

What do you mean by 'limit'? 

> there should be a way to speedup lucene then if this is not possible,
> since for some simple queries it takes half a second which is too long.

What do you mean by 'simple' query? there might be multiple reasons behind 
slowness of a query that are unrelated to the search (for example, if you 
retrieve many documents and for each document you are extracting the content of 
many fields) - would you like to tell us a bit more about your use case? 

Regards,
Diego

From: java-user@lucene.apache.org At: 06/09/21 18:18:01To:  
java-user@lucene.apache.org
Cc:  baris.ka...@oracle.com
Subject: Re: Potential bug

Thanks Adrien, but the differences is too far apart.

I think the algorithm needs to be revised.


what if the user needs to limit the search process?

that leaves no control.

there should be a way to speedup lucene then if this is not possible,

since for some simple queries it takes half a second which is too long.

Best regards


On 6/9/21 1:13 PM, Adrien Grand wrote:
> Hi Baris,
>
> totalhitsThreshold is actually a minimum threshold, not a maximum threshold.
>
> The problem is that Lucene cannot directly identify the top matching
> documents for a given query. The strategy it adopts is to start collecting
> hits naively in doc ID order and to progressively raise the bar about the
> minimum score that is required for a hit to be competitive in order to skip
> non-competitive documents. So it's expected that Lucene still collects 100s
> or 1000s of hits, even though the collector is configured to only compute
> the top 10 hits.
>
> On Wed, Jun 9, 2021 at 7:07 PM  wrote:
>
>> Hi,-
>>
>>i think this is a potential bug
>>
>>
>> i set this time totalHitsThreshold to 10 and i get totalhits reported as
>> 1655 but i get 10 results in total.
>>
>> I think this suggests that there might be a bug with
>> TopScoreDocCollector algorithm.
>>
>>
>> Best regards
>>
>>
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: Potential bug

2021-06-09 Thread baris . kazar

Thanks Adrien, but the differences is too far apart.

I think the algorithm needs to be revised.


what if the user needs to limit the search process?

that leaves no control.

there should be a way to speedup lucene then if this is not possible,

since for some simple queries it takes half a second which is too long.

Best regards


On 6/9/21 1:13 PM, Adrien Grand wrote:

Hi Baris,

totalhitsThreshold is actually a minimum threshold, not a maximum threshold.

The problem is that Lucene cannot directly identify the top matching
documents for a given query. The strategy it adopts is to start collecting
hits naively in doc ID order and to progressively raise the bar about the
minimum score that is required for a hit to be competitive in order to skip
non-competitive documents. So it's expected that Lucene still collects 100s
or 1000s of hits, even though the collector is configured to only compute
the top 10 hits.

On Wed, Jun 9, 2021 at 7:07 PM  wrote:


Hi,-

   i think this is a potential bug


i set this time totalHitsThreshold to 10 and i get totalhits reported as
1655 but i get 10 results in total.

I think this suggests that there might be a bug with
TopScoreDocCollector algorithm.


Best regards



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Potential bug

2021-06-09 Thread Adrien Grand
Hi Baris,

totalhitsThreshold is actually a minimum threshold, not a maximum threshold.

The problem is that Lucene cannot directly identify the top matching
documents for a given query. The strategy it adopts is to start collecting
hits naively in doc ID order and to progressively raise the bar about the
minimum score that is required for a hit to be competitive in order to skip
non-competitive documents. So it's expected that Lucene still collects 100s
or 1000s of hits, even though the collector is configured to only compute
the top 10 hits.

On Wed, Jun 9, 2021 at 7:07 PM  wrote:

> Hi,-
>
>   i think this is a potential bug
>
>
> i set this time totalHitsThreshold to 10 and i get totalhits reported as
> 1655 but i get 10 results in total.
>
> I think this suggests that there might be a bug with
> TopScoreDocCollector algorithm.
>
>
> Best regards
>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-- 
Adrien


Potential bug

2021-06-09 Thread baris . kazar

Hi,-

 i think this is a potential bug


i set this time totalHitsThreshold to 10 and i get totalhits reported as 
1655 but i get 10 results in total.


I think this suggests that there might be a bug with 
TopScoreDocCollector algorithm.



Best regards



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: TopScoreDocCollector class usage

2021-06-09 Thread baris . kazar

Ok i found it

300 times number of words in the search string but these needs to be 
precisely documented in the Javadocs


i dont want to have trial and error and i guess nobody wants that, 
either please.



Best regards



On 6/9/21 12:11 PM, baris.ka...@oracle.com wrote:

Hi,-

 i used this class now before IndexSearher.search api (with collector 
as 2nd arg) (Please see the "an interesting case" thread before this 
question)



but this time i have a very weird behavior:


i used to have 4000+ hits with default TopScoreDocCollector.create(int 
numHits,  ScoreDoc after, int totalHitsThreshold)


internal usage in IndexSearcher.search api which is 1000 and i set 
after as null here.



Now when i set totalHitsThreshold and numHits in 
TopScoreDocCollector.create to 300


i get 12200+ hits now from totalHits object.


Something is not right here, right?

How can it jump to 3 times when i set totalHitsThreshold as ~ 1/3 of 
default value of totalHitsThreshold and numHits?



Best regards



ps.

NOTE: The search(org.apache.lucene.search.Query, int) and 
searchAfter(org.apache.lucene.search.ScoreDoc, 
org.apache.lucene.search.Query, int) methods are configured to only 
count top hits accurately up to 1,000 and may return a lower bound of 
the hit count if the hit count is greater than or equal to 1,000. On 
queries that match lots of documents, counting the number of hits may 
take much longer than computing the top hits so this trade-off allows 
to get some minimal information about the hit count without slowing 
down search too much. The TopDocs.scoreDocs array is always accurate 
however. If this behavior doesn't suit your needs, you should create 
collectors manually with either TopScoreDocCollector.create(int, int) 
or TopFieldCollector.create(org.apache.lucene.search.Sort, int, int) 
and call search(Query, Collector).



at


https://lucene.apache.org/core/8_5_2/core/org/apache/lucene/search/IndexSearcher.html#searchAfter-org.apache.lucene.search.ScoreDoc-org.apache.lucene.search.Query-int- 





-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



TopScoreDocCollector class usage

2021-06-09 Thread baris . kazar

Hi,-

 i used this class now before IndexSearher.search api (with collector 
as 2nd arg) (Please see the "an interesting case" thread before this 
question)



but this time i have a very weird behavior:


i used to have 4000+ hits with default TopScoreDocCollector.create(int 
numHits,  ScoreDoc after, int totalHitsThreshold)


internal usage in IndexSearcher.search api which is 1000 and i set after 
as null here.



Now when i set totalHitsThreshold and numHits in 
TopScoreDocCollector.create to 300


i get 12200+ hits now from totalHits object.


Something is not right here, right?

How can it jump to 3 times when i set totalHitsThreshold as ~ 1/3 of 
default value of totalHitsThreshold and numHits?



Best regards



ps.

NOTE: The search(org.apache.lucene.search.Query, int) and 
searchAfter(org.apache.lucene.search.ScoreDoc, 
org.apache.lucene.search.Query, int) methods are configured to only 
count top hits accurately up to 1,000 and may return a lower bound of 
the hit count if the hit count is greater than or equal to 1,000. On 
queries that match lots of documents, counting the number of hits may 
take much longer than computing the top hits so this trade-off allows to 
get some minimal information about the hit count without slowing down 
search too much. The TopDocs.scoreDocs array is always accurate however. 
If this behavior doesn't suit your needs, you should create collectors 
manually with either TopScoreDocCollector.create(int, int) or 
TopFieldCollector.create(org.apache.lucene.search.Sort, int, int) and 
call search(Query, Collector).



at


https://lucene.apache.org/core/8_5_2/core/org/apache/lucene/search/IndexSearcher.html#searchAfter-org.apache.lucene.search.ScoreDoc-org.apache.lucene.search.Query-int-


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Monitoring decisions taken by IndexOrDocValuesQuery

2021-06-09 Thread Egor Moraru
Hi,

At my current project we wanted to monitor for a specific field the
fraction of indexed vs doc values queries executed by IndexOrDocValuesQuery.

We ended up forking IndexOrDocValuesQuery and passing a listener that
is notified when the query execution path is decided.

Do you think this is something the community might be interested in?

Kind regards,
Egor Moraru.