Re: Solr facets implementation question

2015-09-23 Thread Toke Eskildsen
On Tue, 2015-09-22 at 11:56 -0700, Erick Erickson wrote:
> FWIW, there is work being done for "high cardinality faceting" with
> some of the recent Streaming Aggregation code.

The JIRA I can find that seems relevant is
https://issues.apache.org/jira/browse/SOLR-7903

I can see streaming faceting work well if one wants to export the full
result set. For top-X requests it seems that there is a lot of overhead
resolving terms that will not be used in the final result. But my
understanding of Solr streams is very shaky.

- Toke Eskildsen, State and University Library, Denmark




Re: Solr facets implementation question

2015-09-22 Thread Erick Erickson
FWIW, there is work being done for "high cardinality faceting" with
some of the recent Streaming Aggregation code.

So it's at least on the way if not already there.

Erick

On Tue, Sep 22, 2015 at 11:44 AM, Toke Eskildsen  
wrote:
> adfel70  wrote:
>> Hi Toke, Thank you for the detailed explanation, thats exactly what I was
>> looking for, except this algorithm fit single index only. could you please
>> elaborate what adjustments are needed for distributed index?
>
> Vanilla Solr requests top-X terms from each shard, with over-provisioning. I 
> do not remember the exact formula (and I think it is adjustable in Solr 5), 
> but something like X*1.5+10? Yes, that means that correctness is not 
> guaranteed for distributed faceting. It would be possible to make some sort 
> of streaming faceting implementation, but the pathological case is that all 
> shards must deliver all terms to derive the correct top-X.
>
> The results from the shards are merged and the top-X terms are fine-counted 
> where needed: If we have 3 shards and asked for top-1, they might answer
> shard1: [foo(3), zoo(1)]
> shard2: [foo(1), zoo(1)]
> shard3: [bar(2),aar(2)]
> (remember the over-provisioning). We derive that foo is the top-1 term, but 
> since shard 3 did not provide a count for foo, we need to ask shard3 for the 
> count for that specific term to get the correct overall count.
>
> The fine-counting is performed differently from standard faceting. It is 
> basically 'original_query AND facet_field:fine_count_term'. Quite fast for a 
> few terms, but if there is a need for resolving tens or hundreds of terms for 
> a non-trivial index, the fine-counting phase can take longer than the initial 
> faceting phase.
>
> - Toke Eskildsen
> (sorry for the delayed answer - my email reader hid your response)


Re: Solr facets implementation question

2015-09-22 Thread Toke Eskildsen
adfel70  wrote:
> Hi Toke, Thank you for the detailed explanation, thats exactly what I was
> looking for, except this algorithm fit single index only. could you please
> elaborate what adjustments are needed for distributed index?

Vanilla Solr requests top-X terms from each shard, with over-provisioning. I do 
not remember the exact formula (and I think it is adjustable in Solr 5), but 
something like X*1.5+10? Yes, that means that correctness is not guaranteed for 
distributed faceting. It would be possible to make some sort of streaming 
faceting implementation, but the pathological case is that all shards must 
deliver all terms to derive the correct top-X.

The results from the shards are merged and the top-X terms are fine-counted 
where needed: If we have 3 shards and asked for top-1, they might answer 
shard1: [foo(3), zoo(1)]
shard2: [foo(1), zoo(1)]
shard3: [bar(2),aar(2)]
(remember the over-provisioning). We derive that foo is the top-1 term, but 
since shard 3 did not provide a count for foo, we need to ask shard3 for the 
count for that specific term to get the correct overall count. 

The fine-counting is performed differently from standard faceting. It is 
basically 'original_query AND facet_field:fine_count_term'. Quite fast for a 
few terms, but if there is a need for resolving tens or hundreds of terms for a 
non-trivial index, the fine-counting phase can take longer than the initial 
faceting phase.

- Toke Eskildsen
(sorry for the delayed answer - my email reader hid your response)


Re: Solr facets implementation question

2015-09-17 Thread adfel70
Toke Eskildsen wrote
> adfel70 <

> adfel70@

> > wrote:
>> I am trying to understand why faceting on a field with lots of unique
>> values
>> has a great impact on query performance.
> 
> Faceting in Solr is performed in different ways. String faceting different
> from Numerics faceting, DocValued fields different from non-DocValued, fc
> different from enum. Let's look at String faceting with facet.method=fc
> and DocValues.
> 
> Strings (aka Terms) are represented in the faceting code with an ordinal,
> which is really just a number. The first term has number 0, the next
> number 1 and so forth. When doing a faceting call with the above premise,
> what happens is
> 
> 1) An counter of int[unique_values] is allocated.
> This is fairly fast, but still with a noticeable impact when the number of
> unique value creeps into the millions. On our machine it takes several
> hundred milliseconds for 100M values. Also relevant is the overall strain
> it puts on the garbage collector.
> 
> 2) For each hit in the result set, the corresponding ordinals are resolved
> and counter[ordinal]++ is triggered.
> This scales with the result set. Small sets are very fast, quite
> independent of the size of the counter-structure. Large result sets are
> (naturally) equally slow.
> 
> 3) The counter-structure is iterated and top-X are determined.
> This scales with the size of the counter-structure, (nearly) independent
> of the result set size.
> 
> 4) The Terms for the top-X ordinals are resolved from the index.
> This scales with X.
> 
> 
> Some of these parts has some non-intuitive penalties: Even very tiny
> result sets has aa constant overhead from allocation and iteration. Asking
> for top-1M hits means that the underlying priority queue will probably no
> longer fit in the CPU cache and will slow things down. Resolving Terms
> from ordinals relies of fast IO and a large number of unique Terms might
> mean that the disk cache is not large enough.
> 
> 
> Blatant plug: I have spend a fair amount of time trying to make some of
> this faster http://tokee.github.io/lucene-solr/
> 
> - Toke Eskildsen

Hi Toke, Thank you for the detailed explanation, thats exactly what I was
looking for, except this algorithm fit single index only. could you please
elaborate what adjustments are needed for distributed index?
The naive solution would count all terms on each shard, but the initial
shard (the one that executed the request) must have ALL results for correct
aggregation (its easy to find example which shows that top K results from
every shard is not good enough). 
Is that correct? I tried to verify this behaviour, but I didnt see that the
process who got the request from the user used more memory than the other
shards.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-facets-implementation-question-tp4227604p4229741.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr facets implementation question

2015-09-08 Thread Toke Eskildsen
adfel70  wrote:
> I am trying to understand why faceting on a field with lots of unique values
> has a great impact on query performance.

Faceting in Solr is performed in different ways. String faceting different from 
Numerics faceting, DocValued fields different from non-DocValued, fc different 
from enum. Let's look at String faceting with facet.method=fc and DocValues.

Strings (aka Terms) are represented in the faceting code with an ordinal, which 
is really just a number. The first term has number 0, the next number 1 and so 
forth. When doing a faceting call with the above premise, what happens is

1) An counter of int[unique_values] is allocated.
This is fairly fast, but still with a noticeable impact when the number of 
unique value creeps into the millions. On our machine it takes several hundred 
milliseconds for 100M values. Also relevant is the overall strain it puts on 
the garbage collector.

2) For each hit in the result set, the corresponding ordinals are resolved and 
counter[ordinal]++ is triggered.
This scales with the result set. Small sets are very fast, quite independent of 
the size of the counter-structure. Large result sets are (naturally) equally 
slow.

3) The counter-structure is iterated and top-X are determined.
This scales with the size of the counter-structure, (nearly) independent of the 
result set size.

4) The Terms for the top-X ordinals are resolved from the index.
This scales with X.


Some of these parts has some non-intuitive penalties: Even very tiny result 
sets has aa constant overhead from allocation and iteration. Asking for top-1M 
hits means that the underlying priority queue will probably no longer fit in 
the CPU cache and will slow things down. Resolving Terms from ordinals relies 
of fast IO and a large number of unique Terms might mean that the disk cache is 
not large enough.


Blatant plug: I have spend a fair amount of time trying to make some of this 
faster http://tokee.github.io/lucene-solr/

- Toke Eskildsen


Re: Solr facets implementation question

2015-09-08 Thread Walter Underwood
Every faceting implementation I’ve seen (not just Solr/Lucene) makes big 
in-memory lists. Lots of values means a bigger list.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

On Sep 8, 2015, at 8:33 AM, Shawn Heisey  wrote:

> On 9/8/2015 9:10 AM, adfel70 wrote:
>> I am trying to understand why faceting on a field with lots of unique values
>> has a great impact on query performance. Since Googling for Solr facet
>> algorithm did not yield anything, I looked how facets are implemented in
>> Lucene. I found out that there are 2 methods - taxonomy-based and
>> SortedSetDocValues-based. Does Solr facet capabilities are based on one of
>> those methods? if so, I still cant understand why unique values impacts
>> query performance...
> 
> Lucene's facet implementation is completely separate (and different)
> from Solr's implementation.  I am not familiar with the inner workings
> of either implementation.  Solr implemented faceting long before Lucene
> did.  I think *Solr* actually contains at least two different facet
> implementations, used for different kinds of facets.
> 
> Faceting on a field with many unique values uses a HUGE amount of heap
> memory, which is likely why query performance is impacted.
> 
> I have a dev system with all my indexes (each of which has dedicated
> hardware for production) on it.  Normally it requires 15GB of heap to
> operate properly.  Every now and then, I get asked to do a duplicate
> check on a field that *should* be unique, on an index with 250 million
> docs in it.  The query that I am asked to do for the facet matches about
> 100 million docs.  This facet query, on a field that DOES have
> docValues, will throw OOM if my heap is less than 27GB.  The dev machine
> only has 32GB of RAM, so as you might imagine, performance is really
> terrible when I do this query.  Thankfully it's a dev machine.  When I
> was doing these queries, it was running 4.9.1.  I have since upgraded it
> to 5.2.1, as a proof of concept for upgrading our production indexes ...
> but I have not attempted the facet query since the upgrade.
> 
> Thanks,
> Shawn
> 



Re: Solr facets implementation question

2015-09-08 Thread Shawn Heisey
On 9/8/2015 9:10 AM, adfel70 wrote:
> I am trying to understand why faceting on a field with lots of unique values
> has a great impact on query performance. Since Googling for Solr facet
> algorithm did not yield anything, I looked how facets are implemented in
> Lucene. I found out that there are 2 methods - taxonomy-based and
> SortedSetDocValues-based. Does Solr facet capabilities are based on one of
> those methods? if so, I still cant understand why unique values impacts
> query performance...

Lucene's facet implementation is completely separate (and different)
from Solr's implementation.  I am not familiar with the inner workings
of either implementation.  Solr implemented faceting long before Lucene
did.  I think *Solr* actually contains at least two different facet
implementations, used for different kinds of facets.

Faceting on a field with many unique values uses a HUGE amount of heap
memory, which is likely why query performance is impacted.

I have a dev system with all my indexes (each of which has dedicated
hardware for production) on it.  Normally it requires 15GB of heap to
operate properly.  Every now and then, I get asked to do a duplicate
check on a field that *should* be unique, on an index with 250 million
docs in it.  The query that I am asked to do for the facet matches about
100 million docs.  This facet query, on a field that DOES have
docValues, will throw OOM if my heap is less than 27GB.  The dev machine
only has 32GB of RAM, so as you might imagine, performance is really
terrible when I do this query.  Thankfully it's a dev machine.  When I
was doing these queries, it was running 4.9.1.  I have since upgraded it
to 5.2.1, as a proof of concept for upgrading our production indexes ...
but I have not attempted the facet query since the upgrade.

Thanks,
Shawn



Solr facets implementation question

2015-09-08 Thread adfel70
I am trying to understand why faceting on a field with lots of unique values
has a great impact on query performance. Since Googling for Solr facet
algorithm did not yield anything, I looked how facets are implemented in
Lucene. I found out that there are 2 methods - taxonomy-based and
SortedSetDocValues-based. Does Solr facet capabilities are based on one of
those methods? if so, I still cant understand why unique values impacts
query performance...



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-facets-implementation-question-tp4227604.html
Sent from the Solr - User mailing list archive at Nabble.com.