Re: At a high level how does faceting in SolrCloud work?

2012-10-02 Thread Jamie Johnson
Thanks for this guys, really excellent explanation!

On Thu, Sep 27, 2012 at 12:15 AM, Yonik Seeley yo...@lucidworks.com wrote:
 On Wed, Sep 26, 2012 at 6:21 PM, Chris Hostetter
 hossman_luc...@fucit.org wrote:
 2) the coordinator node sums up the counts for any constraint returned by
 multiple nodes, and then picks the top (facet.limit) constraints based n
 the counts it knows about.

 It's actually more sophisticated than that - we don't limit to the top
 facet.limit constraints at the first phase.
 For *all* constraints we see from the first phase, we calculate if it
 could possibly be in the top facet.limit constraints (based on shards
 we haven't heard from).  If so, we request exact counts from those
 shards we haven't heard from.

 (but i believe this is second query
 is optimized to only ask a shard about a constraint if it didn't already
 get the count in the first request)

 Correct.

 So imagine you have 3 shards, and querying them individually with
 facet.field=catfacet.limit=3 you get...

 shardA: cars(8), books(7), computers(6)
 shardB: toys(8), books(7), garden(5)
 shardC: garden(4), books(3), computers(3)

 If you made a solr cloud query (or an explicit distributed query of those
 three shards), the first request the coordinator would send to each shard
 would specify a higher facet.limit, and might get back something like...

 shardA: cars(8), books(7), computers(6), cleaning(4), ...
 shardB: toys(8), books(7), garden(5), cleaning(4), ...
 shardC: garden(4), books(3), computers(3), plants(3), ...

 ...in which case cleaning pops up as a contender for being in the top
 constraints.  The coordinator sums up the counts for the constraints it
 knows about, and might decide that these are the top 3...

 books(17), computers(9), cleaning(8)

 To extend your example, Solr notices that plants has a count of 3 on
 one shard, and was missing from the other two shards.
 The maximum possible count it *could* have is 11 (3+4+4), which could
 possibly put it in the top 3, hence it will also ask shardA and shardB
 about plants.

 -Yonik
 http://lucidworks.com


Re: At a high level how does faceting in SolrCloud work?

2012-10-02 Thread Jamie Johnson
So does mincount get considered in this as well?

On Tue, Oct 2, 2012 at 10:19 AM, Jamie Johnson jej2...@gmail.com wrote:
 Thanks for this guys, really excellent explanation!

 On Thu, Sep 27, 2012 at 12:15 AM, Yonik Seeley yo...@lucidworks.com wrote:
 On Wed, Sep 26, 2012 at 6:21 PM, Chris Hostetter
 hossman_luc...@fucit.org wrote:
 2) the coordinator node sums up the counts for any constraint returned by
 multiple nodes, and then picks the top (facet.limit) constraints based n
 the counts it knows about.

 It's actually more sophisticated than that - we don't limit to the top
 facet.limit constraints at the first phase.
 For *all* constraints we see from the first phase, we calculate if it
 could possibly be in the top facet.limit constraints (based on shards
 we haven't heard from).  If so, we request exact counts from those
 shards we haven't heard from.

 (but i believe this is second query
 is optimized to only ask a shard about a constraint if it didn't already
 get the count in the first request)

 Correct.

 So imagine you have 3 shards, and querying them individually with
 facet.field=catfacet.limit=3 you get...

 shardA: cars(8), books(7), computers(6)
 shardB: toys(8), books(7), garden(5)
 shardC: garden(4), books(3), computers(3)

 If you made a solr cloud query (or an explicit distributed query of those
 three shards), the first request the coordinator would send to each shard
 would specify a higher facet.limit, and might get back something like...

 shardA: cars(8), books(7), computers(6), cleaning(4), ...
 shardB: toys(8), books(7), garden(5), cleaning(4), ...
 shardC: garden(4), books(3), computers(3), plants(3), ...

 ...in which case cleaning pops up as a contender for being in the top
 constraints.  The coordinator sums up the counts for the constraints it
 knows about, and might decide that these are the top 3...

 books(17), computers(9), cleaning(8)

 To extend your example, Solr notices that plants has a count of 3 on
 one shard, and was missing from the other two shards.
 The maximum possible count it *could* have is 11 (3+4+4), which could
 possibly put it in the top 3, hence it will also ask shardA and shardB
 about plants.

 -Yonik
 http://lucidworks.com


Re: At a high level how does faceting in SolrCloud work?

2012-09-26 Thread Chris Hostetter

: I'd like to wrap my head around how faceting in SolrCloud works, does
: Solr ask each shard for their maximum value and then use that to
: determine what else should be asked for from other shards, or does it
: ask for all values and do the aggregation on the requesting server?

For things like facet.range and facet.query, it's really simple: each 
shard is given the facet params, each shard computes an identical list of 
constraints (they are deterministic) and returns them to the coordinator, 
who sums the counts.

For facet.field things are a little more interesting...

The overall approach taken is designed to ensure that the constraint 
counts you get back are always correct (w/o requiring each shard to 
return the fill list of terms it knows about) even if the document 
distribution is really skewed and the constraints returned aren't the 
constraints with the highest count across the entire index.  

how it works (from memory) is basically..

1) each shard is asked for it's top constraints using an artificitlaly 
inlated facet.limit (i think it's something like 20 + 1.5 the original 
limit)

2) the coordinator node sums up the counts for any constraint returned by 
multiple nodes, and then picks the top (facet.limit) constraints based n 
the counts it knows about.

3) the coordinator node then asks each shard to compute it's exact count 
for the selected constraints (since some of those constraints may not have 
been in the original lists returned by some shards), and it then computes 
the final sum. This ensures that the constraint count will match the 
numFound if filter on that constraint (but i believe this is second query 
is optimized to only ask a shard about a constraint if it didn't already 
get the count in the first request)

So imagine you have 3 shards, and querying them individually with 
facet.field=catfacet.limit=3 you get...

shardA: cars(8), books(7), computers(6)
shardB: toys(8), books(7), garden(5)
shardC: garden(4), books(3), computers(3)

If you made a solr cloud query (or an explicit distributed query of those 
three shards), the first request the coordinator would send to each shard 
would specify a higher facet.limit, and might get back something like...

shardA: cars(8), books(7), computers(6), cleaning(4), ...
shardB: toys(8), books(7), garden(5), cleaning(4), ...
shardC: garden(4), books(3), computers(3), plants(3), ...

...in which case cleaning pops up as a contender for being in the top 
constraints.  The coordinator sums up the counts for the constraints it 
knows about, and might decide that these are the top 3...

books(17), computers(9), cleaning(8)

...at which point it sends a second query to shardB asking for an explicit 
count of constraint computers and to shardC for a count of the 
constraint cleaning.  so that the final results can be exact.  when the 
responses come back, those counts are 
added in to the totals the coordinator already knows about.  So in our 
example, if the results of the second query are...

shardB: computers(0)
shardC: cleaning(2)

..then the final top 3 constriants for the cat field might be...

books(17), cleaning(10), computers(9)

(Note that if we had just done a really naive sum of the original counts, 
cleaning wouldn't have even made the list)


-Hoss


Re: At a high level how does faceting in SolrCloud work?

2012-09-26 Thread Yonik Seeley
On Wed, Sep 26, 2012 at 6:21 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:
 2) the coordinator node sums up the counts for any constraint returned by
 multiple nodes, and then picks the top (facet.limit) constraints based n
 the counts it knows about.

It's actually more sophisticated than that - we don't limit to the top
facet.limit constraints at the first phase.
For *all* constraints we see from the first phase, we calculate if it
could possibly be in the top facet.limit constraints (based on shards
we haven't heard from).  If so, we request exact counts from those
shards we haven't heard from.

 (but i believe this is second query
 is optimized to only ask a shard about a constraint if it didn't already
 get the count in the first request)

Correct.

 So imagine you have 3 shards, and querying them individually with
 facet.field=catfacet.limit=3 you get...

 shardA: cars(8), books(7), computers(6)
 shardB: toys(8), books(7), garden(5)
 shardC: garden(4), books(3), computers(3)

 If you made a solr cloud query (or an explicit distributed query of those
 three shards), the first request the coordinator would send to each shard
 would specify a higher facet.limit, and might get back something like...

 shardA: cars(8), books(7), computers(6), cleaning(4), ...
 shardB: toys(8), books(7), garden(5), cleaning(4), ...
 shardC: garden(4), books(3), computers(3), plants(3), ...

 ...in which case cleaning pops up as a contender for being in the top
 constraints.  The coordinator sums up the counts for the constraints it
 knows about, and might decide that these are the top 3...

 books(17), computers(9), cleaning(8)

To extend your example, Solr notices that plants has a count of 3 on
one shard, and was missing from the other two shards.
The maximum possible count it *could* have is 11 (3+4+4), which could
possibly put it in the top 3, hence it will also ask shardA and shardB
about plants.

-Yonik
http://lucidworks.com


At a high level how does faceting in SolrCloud work?

2012-09-24 Thread Jamie Johnson
I'd like to wrap my head around how faceting in SolrCloud works, does
Solr ask each shard for their maximum value and then use that to
determine what else should be asked for from other shards, or does it
ask for all values and do the aggregation on the requesting server?