Re: Solr facets implementation question
On Tue, 2015-09-22 at 11:56 -0700, Erick Erickson wrote: > FWIW, there is work being done for "high cardinality faceting" with > some of the recent Streaming Aggregation code. The JIRA I can find that seems relevant is https://issues.apache.org/jira/browse/SOLR-7903 I can see streaming faceting work well if one wants to export the full result set. For top-X requests it seems that there is a lot of overhead resolving terms that will not be used in the final result. But my understanding of Solr streams is very shaky. - Toke Eskildsen, State and University Library, Denmark
Re: Solr facets implementation question
FWIW, there is work being done for "high cardinality faceting" with some of the recent Streaming Aggregation code. So it's at least on the way if not already there. Erick On Tue, Sep 22, 2015 at 11:44 AM, Toke Eskildsen wrote: > adfel70 wrote: >> Hi Toke, Thank you for the detailed explanation, thats exactly what I was >> looking for, except this algorithm fit single index only. could you please >> elaborate what adjustments are needed for distributed index? > > Vanilla Solr requests top-X terms from each shard, with over-provisioning. I > do not remember the exact formula (and I think it is adjustable in Solr 5), > but something like X*1.5+10? Yes, that means that correctness is not > guaranteed for distributed faceting. It would be possible to make some sort > of streaming faceting implementation, but the pathological case is that all > shards must deliver all terms to derive the correct top-X. > > The results from the shards are merged and the top-X terms are fine-counted > where needed: If we have 3 shards and asked for top-1, they might answer > shard1: [foo(3), zoo(1)] > shard2: [foo(1), zoo(1)] > shard3: [bar(2),aar(2)] > (remember the over-provisioning). We derive that foo is the top-1 term, but > since shard 3 did not provide a count for foo, we need to ask shard3 for the > count for that specific term to get the correct overall count. > > The fine-counting is performed differently from standard faceting. It is > basically 'original_query AND facet_field:fine_count_term'. Quite fast for a > few terms, but if there is a need for resolving tens or hundreds of terms for > a non-trivial index, the fine-counting phase can take longer than the initial > faceting phase. > > - Toke Eskildsen > (sorry for the delayed answer - my email reader hid your response)
Re: Solr facets implementation question
adfel70 wrote: > Hi Toke, Thank you for the detailed explanation, thats exactly what I was > looking for, except this algorithm fit single index only. could you please > elaborate what adjustments are needed for distributed index? Vanilla Solr requests top-X terms from each shard, with over-provisioning. I do not remember the exact formula (and I think it is adjustable in Solr 5), but something like X*1.5+10? Yes, that means that correctness is not guaranteed for distributed faceting. It would be possible to make some sort of streaming faceting implementation, but the pathological case is that all shards must deliver all terms to derive the correct top-X. The results from the shards are merged and the top-X terms are fine-counted where needed: If we have 3 shards and asked for top-1, they might answer shard1: [foo(3), zoo(1)] shard2: [foo(1), zoo(1)] shard3: [bar(2),aar(2)] (remember the over-provisioning). We derive that foo is the top-1 term, but since shard 3 did not provide a count for foo, we need to ask shard3 for the count for that specific term to get the correct overall count. The fine-counting is performed differently from standard faceting. It is basically 'original_query AND facet_field:fine_count_term'. Quite fast for a few terms, but if there is a need for resolving tens or hundreds of terms for a non-trivial index, the fine-counting phase can take longer than the initial faceting phase. - Toke Eskildsen (sorry for the delayed answer - my email reader hid your response)
Re: Solr facets implementation question
Toke Eskildsen wrote > adfel70 < > adfel70@ > > wrote: >> I am trying to understand why faceting on a field with lots of unique >> values >> has a great impact on query performance. > > Faceting in Solr is performed in different ways. String faceting different > from Numerics faceting, DocValued fields different from non-DocValued, fc > different from enum. Let's look at String faceting with facet.method=fc > and DocValues. > > Strings (aka Terms) are represented in the faceting code with an ordinal, > which is really just a number. The first term has number 0, the next > number 1 and so forth. When doing a faceting call with the above premise, > what happens is > > 1) An counter of int[unique_values] is allocated. > This is fairly fast, but still with a noticeable impact when the number of > unique value creeps into the millions. On our machine it takes several > hundred milliseconds for 100M values. Also relevant is the overall strain > it puts on the garbage collector. > > 2) For each hit in the result set, the corresponding ordinals are resolved > and counter[ordinal]++ is triggered. > This scales with the result set. Small sets are very fast, quite > independent of the size of the counter-structure. Large result sets are > (naturally) equally slow. > > 3) The counter-structure is iterated and top-X are determined. > This scales with the size of the counter-structure, (nearly) independent > of the result set size. > > 4) The Terms for the top-X ordinals are resolved from the index. > This scales with X. > > > Some of these parts has some non-intuitive penalties: Even very tiny > result sets has aa constant overhead from allocation and iteration. Asking > for top-1M hits means that the underlying priority queue will probably no > longer fit in the CPU cache and will slow things down. Resolving Terms > from ordinals relies of fast IO and a large number of unique Terms might > mean that the disk cache is not large enough. > > > Blatant plug: I have spend a fair amount of time trying to make some of > this faster http://tokee.github.io/lucene-solr/ > > - Toke Eskildsen Hi Toke, Thank you for the detailed explanation, thats exactly what I was looking for, except this algorithm fit single index only. could you please elaborate what adjustments are needed for distributed index? The naive solution would count all terms on each shard, but the initial shard (the one that executed the request) must have ALL results for correct aggregation (its easy to find example which shows that top K results from every shard is not good enough). Is that correct? I tried to verify this behaviour, but I didnt see that the process who got the request from the user used more memory than the other shards. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-facets-implementation-question-tp4227604p4229741.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr facets implementation question
adfel70 wrote: > I am trying to understand why faceting on a field with lots of unique values > has a great impact on query performance. Faceting in Solr is performed in different ways. String faceting different from Numerics faceting, DocValued fields different from non-DocValued, fc different from enum. Let's look at String faceting with facet.method=fc and DocValues. Strings (aka Terms) are represented in the faceting code with an ordinal, which is really just a number. The first term has number 0, the next number 1 and so forth. When doing a faceting call with the above premise, what happens is 1) An counter of int[unique_values] is allocated. This is fairly fast, but still with a noticeable impact when the number of unique value creeps into the millions. On our machine it takes several hundred milliseconds for 100M values. Also relevant is the overall strain it puts on the garbage collector. 2) For each hit in the result set, the corresponding ordinals are resolved and counter[ordinal]++ is triggered. This scales with the result set. Small sets are very fast, quite independent of the size of the counter-structure. Large result sets are (naturally) equally slow. 3) The counter-structure is iterated and top-X are determined. This scales with the size of the counter-structure, (nearly) independent of the result set size. 4) The Terms for the top-X ordinals are resolved from the index. This scales with X. Some of these parts has some non-intuitive penalties: Even very tiny result sets has aa constant overhead from allocation and iteration. Asking for top-1M hits means that the underlying priority queue will probably no longer fit in the CPU cache and will slow things down. Resolving Terms from ordinals relies of fast IO and a large number of unique Terms might mean that the disk cache is not large enough. Blatant plug: I have spend a fair amount of time trying to make some of this faster http://tokee.github.io/lucene-solr/ - Toke Eskildsen
Re: Solr facets implementation question
Every faceting implementation I’ve seen (not just Solr/Lucene) makes big in-memory lists. Lots of values means a bigger list. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On Sep 8, 2015, at 8:33 AM, Shawn Heisey wrote: > On 9/8/2015 9:10 AM, adfel70 wrote: >> I am trying to understand why faceting on a field with lots of unique values >> has a great impact on query performance. Since Googling for Solr facet >> algorithm did not yield anything, I looked how facets are implemented in >> Lucene. I found out that there are 2 methods - taxonomy-based and >> SortedSetDocValues-based. Does Solr facet capabilities are based on one of >> those methods? if so, I still cant understand why unique values impacts >> query performance... > > Lucene's facet implementation is completely separate (and different) > from Solr's implementation. I am not familiar with the inner workings > of either implementation. Solr implemented faceting long before Lucene > did. I think *Solr* actually contains at least two different facet > implementations, used for different kinds of facets. > > Faceting on a field with many unique values uses a HUGE amount of heap > memory, which is likely why query performance is impacted. > > I have a dev system with all my indexes (each of which has dedicated > hardware for production) on it. Normally it requires 15GB of heap to > operate properly. Every now and then, I get asked to do a duplicate > check on a field that *should* be unique, on an index with 250 million > docs in it. The query that I am asked to do for the facet matches about > 100 million docs. This facet query, on a field that DOES have > docValues, will throw OOM if my heap is less than 27GB. The dev machine > only has 32GB of RAM, so as you might imagine, performance is really > terrible when I do this query. Thankfully it's a dev machine. When I > was doing these queries, it was running 4.9.1. I have since upgraded it > to 5.2.1, as a proof of concept for upgrading our production indexes ... > but I have not attempted the facet query since the upgrade. > > Thanks, > Shawn >
Re: Solr facets implementation question
On 9/8/2015 9:10 AM, adfel70 wrote: > I am trying to understand why faceting on a field with lots of unique values > has a great impact on query performance. Since Googling for Solr facet > algorithm did not yield anything, I looked how facets are implemented in > Lucene. I found out that there are 2 methods - taxonomy-based and > SortedSetDocValues-based. Does Solr facet capabilities are based on one of > those methods? if so, I still cant understand why unique values impacts > query performance... Lucene's facet implementation is completely separate (and different) from Solr's implementation. I am not familiar with the inner workings of either implementation. Solr implemented faceting long before Lucene did. I think *Solr* actually contains at least two different facet implementations, used for different kinds of facets. Faceting on a field with many unique values uses a HUGE amount of heap memory, which is likely why query performance is impacted. I have a dev system with all my indexes (each of which has dedicated hardware for production) on it. Normally it requires 15GB of heap to operate properly. Every now and then, I get asked to do a duplicate check on a field that *should* be unique, on an index with 250 million docs in it. The query that I am asked to do for the facet matches about 100 million docs. This facet query, on a field that DOES have docValues, will throw OOM if my heap is less than 27GB. The dev machine only has 32GB of RAM, so as you might imagine, performance is really terrible when I do this query. Thankfully it's a dev machine. When I was doing these queries, it was running 4.9.1. I have since upgraded it to 5.2.1, as a proof of concept for upgrading our production indexes ... but I have not attempted the facet query since the upgrade. Thanks, Shawn
Solr facets implementation question
I am trying to understand why faceting on a field with lots of unique values has a great impact on query performance. Since Googling for Solr facet algorithm did not yield anything, I looked how facets are implemented in Lucene. I found out that there are 2 methods - taxonomy-based and SortedSetDocValues-based. Does Solr facet capabilities are based on one of those methods? if so, I still cant understand why unique values impacts query performance... -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-facets-implementation-question-tp4227604.html Sent from the Solr - User mailing list archive at Nabble.com.