[jira] [Commented] (LUCENE-10281) Error condition used to judge whether hits are sparse in StringValueFacetCounts

2021-12-08 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17456048#comment-17456048
 ] 

Greg Miller commented on LUCENE-10281:
--

Yeah, +1 to not considering this a bug (but I'm a little biased I suppose since 
I wrote this). As you point out, the heuristic would be better if we knew how 
many of the hits actually had values in the SSDV field, but it's expensive to 
determine that up-front. So the current heuristic (which is just a heuristic 
and could be flawed in a number of ways), assumes all the hits have a value.

> Error condition used to judge whether hits are sparse in 
> StringValueFacetCounts
> ---
>
> Key: LUCENE-10281
> URL: https://issues.apache.org/jira/browse/LUCENE-10281
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: 8.11
>Reporter: Lu Xugang
>Priority: Minor
> Attachments: 1.jpg
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Description:
> In construction method StringValueFacetCounts(StringDocValuesReaderState 
> state, FacetsCollector facetsCollector), if facetsCollector was provided, a 
> condition of *(totalHits < totalDocs / 10)* used to judge whether using 
> IntIntHashMap which means sparse to store term ord and count 。
> But per totalHits doesn't means it must be containing SSDV , and so is 
> totalDocs. so the right calculation should be *( totalHits has SSDV) / 
> (totalDocs has SSDV) .( totalDocs has SSDV)* was easy to get by 
> SortedSetDocValues#getValueCount(), *totalHits has SSDV* is hard to get 
> because we can only read index by docId provided by FacetsCollector, but the 
> way of getting *totalHits has SSDV* is slow and redundant.
> Solution:
> if we don't wanna to break the old logic that using denseCounts while 
> cardinality < 1024 and using IntIntHashMap while 10% threshold and using 
> denseCounts while the rest of the case, then we could still use denseCounts 
> if cardinality < 1024, if not , using IntIntHashMap. when 10% of the unique 
> term collected,then change to use denseCounts.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10281) Error condition used to judge whether hits are sparse in StringValueFacetCounts

2021-12-06 Thread Lu Xugang (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17454083#comment-17454083
 ] 

Lu Xugang commented on LUCENE-10281:


Hi, [~sokolov] , I did test via *python src/python/localrun.py -source 
wikimedium1m ,* and nineteen comparisons were performed, which result should be 
listed? sorry for not familiar with how to use luceneutil, and I just show the 
final comparison.

 !1.png! 
 

> Error condition used to judge whether hits are sparse in 
> StringValueFacetCounts
> ---
>
> Key: LUCENE-10281
> URL: https://issues.apache.org/jira/browse/LUCENE-10281
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: 8.11
>Reporter: Lu Xugang
>Priority: Minor
> Attachments: 1.png, 面试问题.md
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Description:
> In construction method StringValueFacetCounts(StringDocValuesReaderState 
> state, FacetsCollector facetsCollector), if facetsCollector was provided, a 
> condition of *(totalHits < totalDocs / 10)* used to judge whether using 
> IntIntHashMap which means sparse to store term ord and count 。
> But per totalHits doesn't means it must be containing SSDV , and so is 
> totalDocs. so the right calculation should be *( totalHits has SSDV) / 
> (totalDocs has SSDV) .( totalDocs has SSDV)* was easy to get by 
> SortedSetDocValues#getValueCount(), *totalHits has SSDV* is hard to get 
> because we can only read index by docId provided by FacetsCollector, but the 
> way of getting *totalHits has SSDV* is slow and redundant.
> Solution:
> if we don't wanna to break the old logic that using denseCounts while 
> cardinality < 1024 and using IntIntHashMap while 10% threshold and using 
> denseCounts while the rest of the case, then we could still use denseCounts 
> if cardinality < 1024, if not , using IntIntHashMap. when 10% of the unique 
> term collected,then change to use denseCounts.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10281) Error condition used to judge whether hits are sparse in StringValueFacetCounts

2021-12-03 Thread Michael Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17453134#comment-17453134
 ] 

Michael Sokolov commented on LUCENE-10281:
--

I don't consider this to be a bug since it only affects a heuristic used to 
improve performance. Have you done any performance measurements with this 
change, [~ChrisLu] ?

> Error condition used to judge whether hits are sparse in 
> StringValueFacetCounts
> ---
>
> Key: LUCENE-10281
> URL: https://issues.apache.org/jira/browse/LUCENE-10281
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/facet
>Affects Versions: 8.11
>Reporter: Lu Xugang
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Description:
> In construction method StringValueFacetCounts(StringDocValuesReaderState 
> state, FacetsCollector facetsCollector), if facetsCollector was provided, a 
> condition of *(totalHits < totalDocs / 10)* used to judge whether using 
> IntIntHashMap which means sparse to store term ord and count 。
> But per totalHits doesn't means it must be containing SSDV , and so is 
> totalDocs. so the right calculation should be *( totalHits has SSDV) / 
> (totalDocs has SSDV) .( totalDocs has SSDV)* was easy to get by 
> SortedSetDocValues#getValueCount(), *totalHits has SSDV* is hard to get 
> because we can only read index by docId provided by FacetsCollector, but the 
> way of getting *totalHits has SSDV* is slow and redundant.
> Solution:
> if we don't wanna to break the old logic that using denseCounts while 
> cardinality < 1024 and using IntIntHashMap while 10% threshold and using 
> denseCounts while the rest of the case, then we could still use denseCounts 
> if cardinality < 1024, if not , using IntIntHashMap. when 10% of the unique 
> term collected,then change to use denseCounts.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org