Re: Solr facets counts deep paged returns inconsistent counts

2017-10-20 Thread Yonik Seeley
On Fri, Oct 20, 2017 at 2:22 PM, kenny  wrote:

> Thanks for the clear explanation. A couple of follow up questions
>
> - can we tune overrequesting in json API?
>

Yes, I still need to document it, but you can specify a specific number of
documents to overrequest:
{
  type : field,
  field : cat,
  overrequest : 500
}

Also note that the JSON facet API does not do refinement by default (it's
not always desired).
Add refine:true to the field facet if you do want it.


> - we do see conflicting counts but that's when we have offsets different
> from 0. We have now already tested it in solr 6.6 with json api. We
> sometimes get the same value in different offsets: for example the range of
> constraints [0,500] and [500,1000] might contain the same constraint.
>

That can happen with both regular faceting and with the JSON Facet API
(deeper paging "discoveres" a new constraint which ranks higher).
Regular faceting does more overrequest by default, and does refinement by
default.  So adding refine:true and a deeper overrequest for json facets
should perform equivalently.

 -Yonik

Kenny
>
> On 20-10-17 17:12, Yonik Seeley wrote:
>
> Facet refinement in Solr guarantees that counts for returned
> constraints are correct, but does not guarantee that the top N
> returned isn't missing a constraint.
>
> Consider the following shard counts (3 shards) for the following
> constraints (aka facet values):
> constraintA: 2 0 0
> constraintB: 0 2 0
> constraintC: 0 0 2
> constraintD: 1 1 1
>
> Now for simplicity consider facet.limit=1:
> Phase 1: retrieve the top 1 facet counts from all 3 shards (this gets
> back A=2,B=2,C=2)
> Phase 2: refinement: retrieve counts for A,B,C for any shard that did
> not contribute to the count in Phase 1: (for example we ask shard2 and
> shard3 for the count of A)
> The counts are all correct, but we missed "D" because it never
> appeared in Phase #1
>
> Solr actually has overrequesting in the first phase to reduce the
> chances of this happening (i.e. it won't actually happen with the
> exact scenario above), but it can still happen.
>
> You can increase the overrequest amount 
> (seehttps://lucene.apache.org/solr/guide/6_6/faceting.html)
> Or use streaming expressions or the SQL that goes on top of that in
> the latest Solr releases.
>
> -Yonik
>
>
> On Fri, Oct 20, 2017 at 10:19 AM, kenny  
>  wrote:
>
> Hi all,
>
> When we run some 'deep' facet counts (eg facet values from 0 to 500 and then
> from 500 to 1000), we see small but disturbing difference in counts between
> the two (for example last count on first batch 165, first count on second
> batch 167)
> We run this on solr 5.3.1 in cloud mode (3 shards) in non-json facet module
> Any-one seen ths before? I could not find any bug reported like this.
>
> Thanks
>
> Kenny
>
>
>
> --
>
> [image: ONTOFORCE] 
> Kenny Knecht, PhD
> CTO and technical lead
> +32 486 75 66 16 <00324756616>
> ke...@ontoforce.com
> www.ontoforce.com
> Meetdistrict, Ottergemsesteenweg-Zuid 808, 9000 Gent, Belgium
> 
> CIC, One Broadway, MA 02142 Cambridge, United States
>


Re: Solr facets counts deep paged returns inconsistent counts

2017-10-20 Thread kenny

Thanks for the clear explanation. A couple of follow up questions

- can we tune overrequesting in json API?

- we do see conflicting counts but that's when we have offsets different 
from 0. We have now already tested it in solr 6.6 with json api. We 
sometimes get the same value in different offsets: for example the range 
of constraints [0,500] and [500,1000] might contain the same constraint.



Kenny


On 20-10-17 17:12, Yonik Seeley wrote:

Facet refinement in Solr guarantees that counts for returned
constraints are correct, but does not guarantee that the top N
returned isn't missing a constraint.

Consider the following shard counts (3 shards) for the following
constraints (aka facet values):
constraintA: 2 0 0
constraintB: 0 2 0
constraintC: 0 0 2
constraintD: 1 1 1

Now for simplicity consider facet.limit=1:
Phase 1: retrieve the top 1 facet counts from all 3 shards (this gets
back A=2,B=2,C=2)
Phase 2: refinement: retrieve counts for A,B,C for any shard that did
not contribute to the count in Phase 1: (for example we ask shard2 and
shard3 for the count of A)
The counts are all correct, but we missed "D" because it never
appeared in Phase #1

Solr actually has overrequesting in the first phase to reduce the
chances of this happening (i.e. it won't actually happen with the
exact scenario above), but it can still happen.

You can increase the overrequest amount (see
https://lucene.apache.org/solr/guide/6_6/faceting.html)
Or use streaming expressions or the SQL that goes on top of that in
the latest Solr releases.

-Yonik


On Fri, Oct 20, 2017 at 10:19 AM, kenny  wrote:

Hi all,

When we run some 'deep' facet counts (eg facet values from 0 to 500 and then
from 500 to 1000), we see small but disturbing difference in counts between
the two (for example last count on first batch 165, first count on second
batch 167)
We run this on solr 5.3.1 in cloud mode (3 shards) in non-json facet module
Any-one seen ths before? I could not find any bug reported like this.

Thanks

Kenny



--

ONTOFORCE  
Kenny Knecht, PhD
CTO and technical lead
+32 486 75 66 16 
ke...@ontoforce.com 
www.ontoforce.com 

Meetdistrict, Ottergemsesteenweg-Zuid 808, 9000 Gent, Belgium
CIC, One Broadway, MA 02142 Cambridge, United States



Re: Solr facets counts deep paged returns inconsistent counts

2017-10-20 Thread Yonik Seeley
Facet refinement in Solr guarantees that counts for returned
constraints are correct, but does not guarantee that the top N
returned isn't missing a constraint.

Consider the following shard counts (3 shards) for the following
constraints (aka facet values):
constraintA: 2 0 0
constraintB: 0 2 0
constraintC: 0 0 2
constraintD: 1 1 1

Now for simplicity consider facet.limit=1:
Phase 1: retrieve the top 1 facet counts from all 3 shards (this gets
back A=2,B=2,C=2)
Phase 2: refinement: retrieve counts for A,B,C for any shard that did
not contribute to the count in Phase 1: (for example we ask shard2 and
shard3 for the count of A)
The counts are all correct, but we missed "D" because it never
appeared in Phase #1

Solr actually has overrequesting in the first phase to reduce the
chances of this happening (i.e. it won't actually happen with the
exact scenario above), but it can still happen.

You can increase the overrequest amount (see
https://lucene.apache.org/solr/guide/6_6/faceting.html)
Or use streaming expressions or the SQL that goes on top of that in
the latest Solr releases.

-Yonik


On Fri, Oct 20, 2017 at 10:19 AM, kenny  wrote:
> Hi all,
>
> When we run some 'deep' facet counts (eg facet values from 0 to 500 and then
> from 500 to 1000), we see small but disturbing difference in counts between
> the two (for example last count on first batch 165, first count on second
> batch 167)
> We run this on solr 5.3.1 in cloud mode (3 shards) in non-json facet module
> Any-one seen ths before? I could not find any bug reported like this.
>
> Thanks
>
> Kenny


Solr facets counts deep paged returns inconsistent counts

2017-10-20 Thread kenny

Hi all,

When we run some 'deep' facet counts (eg facet values from 0 to 500 and 
then from 500 to 1000), we see small but disturbing difference in counts 
between the two (for example last count on first batch 165, first count 
on second batch 167)

We run this on solr 5.3.1 in cloud mode (3 shards) in non-json facet module
Any-one seen ths before? I could not find any bug reported like this.

Thanks

Kenny