Re: Solr seems to reserve facet.limit results

2016-12-08 Thread Toke Eskildsen
Markus Jelsma  wrote:
> I tried the overrequest ratio/count and set them to 1.0/0 . Odd enough,
> with these settings high facet.limit and extremely high facet.limit are
> both up to twice as slow as with 1.5/10 settings.

Not sure if it is the right explanation for your "extremely high 
facet.limit"-case, but here goes...


The two phases in distributed simple String faceting in Solr are very different 
from each other:

The first phase allocates a counter structure, iterates the query hits and 
increments the counters, then extracts the top-X facet terms and returns them.

The second phase receives a list of facet terms to count. The terms are those 
that the shard did not deliver in phase 1. 
An example might help here: For phase 1, shard 1 returns [a:5 b:3 c:3], while 
shard 2 returns [d:2 e:2 c:1]. This is merged to [a:5 c:4 b:3]. Since shard 2 
did not return counts for the terms a and b, these counts are requested from 
shard 2 in phase 2.
In the current implementation, the term counts in the second phase are 
calculated in the same way as enum faceting: Basically one tiny search for each 
term with the query facetfield:term. This does not scale well, so it does not 
take many terms before phase 2 gets _slower_ than phase 1 (you can see for 
yourself in the solr.log). So we want to keep the number of phase 2 term-counts 
down, even if it means that phase 1 gets a bit slower.
This is where over-requesting comes into play: The more you over-request, the 
slower phase 1 gets, but it also means that the chance of the merger having to 
ask for extra term-counts gets lower as they were probably returned in phase 1.
I wrote a bit about the phenomena in 
https://sbdevel.wordpress.com/2014/09/11/even-sparse-faceting-is-limited/

- Toke Eskildsen


RE: Solr seems to reserve facet.limit results

2016-12-08 Thread Markus Jelsma
Thanks Chris, Toke,

I tried the overrequest ratio/count and set them to 1.0/0 . Odd enough, with 
these settings high facet.limit and extremely high facet.limit are both up to 
twice as slow as with 1.5/10 settings.

Even successive calls don't seem to 'warm anything up`. 

Anyone with an explaination for this? This is counterintuitive, well to me at 
least.

Thanks,
Markus
 
-Original message-
> From:Chris Hostetter <hossman_luc...@fucit.org>
> Sent: Tuesday 6th December 2016 1:47
> To: solr-user@lucene.apache.org
> Subject: RE: Solr seems to reserve facet.limit results
> 
> 
> 
> I think what you're seeing might be a result of the overrequesting done
> in phase #1 of a distriuted facet query.
> 
> The purpose of overrequesting is to mitigate the possibility of a 
> constraint which should be in the topN for the collection as a whole, but 
> just outside the topN on every shard -- so they never make it to the 
> second phase of the distributed calculation.
> 
> The amount of overrequest is, by default, a multiplicitive function of the 
> user specified facet.limit with a fudge factor (IIRC: 10+(1.5*facet.limit))
> 
> If you're using an explicitly high facet.limit, you can try setting the 
> overrequets ratio/count to 1.0/0 respectively to force Solr to only 
> request the # of constraints you've specified from each shard, and then 
> aggregate them...
> 
> https://lucene.apache.org/solr/6_3_0/solr-solrj/org/apache/solr/common/params/FacetParams.html#FACET_OVERREQUEST_RATIO
> https://lucene.apache.org/solr/6_3_0/solr-solrj/org/apache/solr/common/params/FacetParams.html#FACET_OVERREQUEST_COUNT
> 
> 
> 
> One side note related to the work around you suggested...
> 
> : One simple solution, in my case would be, now just thinking of it, run 
> : the query with no facets and no rows, get the numFound, and set that as 
> : facet.limit for the actual query.
> 
> ...that assumes that the number of facet constraints returned is limited 
> by the total number of documents matching the query -- in general there is 
> no such garuntee because of multivalued fields (or faceting on tokenized 
> fields), so this type of approach isn't a good idea as a generalized 
> solution
> 
> 
> 
> -Hoss
> http://www.lucidworks.com/
> 


Re: Solr seems to reserve facet.limit results

2016-12-06 Thread Toke Eskildsen
On Mon, 2016-12-05 at 17:47 -0700, Chris Hostetter wrote:
> : One simple solution, in my case would be, now just thinking of it,
> : run the query with no facets and no rows, get the numFound, and set
> : that as facet.limit for the actual query.
> 
> ...that assumes that the number of facet constraints returned is
> limited by the total number of documents matching the query -- in
> general there is no such garuntee because of multivalued fields (or
> faceting on tokenized fields), so this type of approach isn't a good
> idea as a generalized solution

For simple String/Text faceting, which Markus seems to be using, the
number of repetitions of a term in a document does not matter: Each
term only counts at most once per document.


If there are any common case deviations from this, the preface to the
faceting documentation should be updated: "...along with numerical
counts of how many matching documents were found were each term".
https://cwiki.apache.org/confluence/display/solr/Faceting

- Toke Eskildsen, State and University Library, Denmark


RE: Solr seems to reserve facet.limit results

2016-12-05 Thread Chris Hostetter


I think what you're seeing might be a result of the overrequesting done
in phase #1 of a distriuted facet query.

The purpose of overrequesting is to mitigate the possibility of a 
constraint which should be in the topN for the collection as a whole, but 
just outside the topN on every shard -- so they never make it to the 
second phase of the distributed calculation.

The amount of overrequest is, by default, a multiplicitive function of the 
user specified facet.limit with a fudge factor (IIRC: 10+(1.5*facet.limit))

If you're using an explicitly high facet.limit, you can try setting the 
overrequets ratio/count to 1.0/0 respectively to force Solr to only 
request the # of constraints you've specified from each shard, and then 
aggregate them...

https://lucene.apache.org/solr/6_3_0/solr-solrj/org/apache/solr/common/params/FacetParams.html#FACET_OVERREQUEST_RATIO
https://lucene.apache.org/solr/6_3_0/solr-solrj/org/apache/solr/common/params/FacetParams.html#FACET_OVERREQUEST_COUNT



One side note related to the work around you suggested...

: One simple solution, in my case would be, now just thinking of it, run 
: the query with no facets and no rows, get the numFound, and set that as 
: facet.limit for the actual query.

...that assumes that the number of facet constraints returned is limited 
by the total number of documents matching the query -- in general there is 
no such garuntee because of multivalued fields (or faceting on tokenized 
fields), so this type of approach isn't a good idea as a generalized 
solution



-Hoss
http://www.lucidworks.com/


Re: Solr seems to reserve facet.limit results

2016-12-05 Thread Toke Eskildsen
On Fri, 2016-12-02 at 12:17 +, Markus Jelsma wrote:
> I have not considered streaming as i am still completely unfamiliar
> with it and i don't yet know what problems it can solve.

Standard faceting requires all nodes to produce their version of the
full result and send it as one chunk, which is then merged at the
calling node (+ other stuff). For large results that comes with a
significant memory overhead.

Solr streaming is ... well, streaming: With practically the same memory
overhead if you request 10K or 10 billion entries.

> One simple solution, in my case would be, now just thinking of it,
> run the query with no facets and no rows, get the numFound, and set
> that as facet.limit for the actual query.

That would work with your solution. Still, try issuing a "*:*"-search
and see if it breaks your very large facet request.

> Are there any examples / articles about consuming streaming facets
> with SolrJ? 

Sorry, I have little experience with SolrJ.

- Toke Eskildsen, State and University Library, Denmark


RE: Solr seems to reserve facet.limit results

2016-12-02 Thread Markus Jelsma
Hello Toke - this is one 6.3 (forgot to mention) and rows=0 and we consume the 
response in SolrJ.

I have not considered streaming as i am still completely unfamiliar with it and 
i don't yet know what problems it can solve.

One simple solution, in my case would be, now just thinking of it, run the 
query with no facets and no rows, get the numFound, and set that as facet.limit 
for the actual query.

Are there any examples / articles about consuming streaming facets with SolrJ? 

Thanks,
Markus
 
-Original message-
> From:Toke Eskildsen <t...@statsbiblioteket.dk>
> Sent: Friday 2nd December 2016 13:01
> To: solr_user lucene_apache <solr-user@lucene.apache.org>
> Subject: Re: Solr seems to reserve facet.limit results
> 
> On Fri, 2016-12-02 at 11:21 +, Markus Jelsma wrote:
> > Despite the number of actual results, queries with a very high
> > facet.limit are three to five times slower compared to much lower
> > values. For example, i have a query that returns roughly 19.000 facet
> > results. Queries with facet.limit=2 return within 200 ms but
> > queries with facet.limit= 20 million return after around 800 ms. This
> > is in a cloud environment.
> 
> First all, requesting top.20M facet terms in a multi-node cloud is
> really not advisable as the transfer+merge overhead is huge. Have you
> considered streaming?
> 
> > I vaguely remember an issue where Solr reserves the requested limit,
> 
> I looked at both simple String faceting and numeric faceting in Solr.
> While there are pre-allocations of the structures involved, they both
> have build-in limiting, so the large performance difference that you
> are seeing is a bit strange. This was with the Solr 5.4 code that I
> happened to have open. Which version are you using?
> 
> Just a thought: For plain search, specifying rows=20M is quite
> different from rows=20K, as that code does not have the same limiting
> as faceting. Are you perchance setting rows together with facet.limit?
> 
> - Toke Eskildsen, State and University Library, Denmark
> 


Re: Solr seems to reserve facet.limit results

2016-12-02 Thread Toke Eskildsen
On Fri, 2016-12-02 at 11:21 +, Markus Jelsma wrote:
> Despite the number of actual results, queries with a very high
> facet.limit are three to five times slower compared to much lower
> values. For example, i have a query that returns roughly 19.000 facet
> results. Queries with facet.limit=2 return within 200 ms but
> queries with facet.limit= 20 million return after around 800 ms. This
> is in a cloud environment.

First all, requesting top.20M facet terms in a multi-node cloud is
really not advisable as the transfer+merge overhead is huge. Have you
considered streaming?

> I vaguely remember an issue where Solr reserves the requested limit,

I looked at both simple String faceting and numeric faceting in Solr.
While there are pre-allocations of the structures involved, they both
have build-in limiting, so the large performance difference that you
are seeing is a bit strange. This was with the Solr 5.4 code that I
happened to have open. Which version are you using?

Just a thought: For plain search, specifying rows=20M is quite
different from rows=20K, as that code does not have the same limiting
as faceting. Are you perchance setting rows together with facet.limit?

- Toke Eskildsen, State and University Library, Denmark


Solr seems to reserve facet.limit results

2016-12-02 Thread Markus Jelsma
Hi - in some cases we want all facets values and counts for a given query, it 
can be 10k or even 10m but also just one thousand.

Despite the number of actual results, queries with a very high facet.limit are 
three to five times slower compared to much lower values. For example, i have a 
query that returns roughly 19.000 facet results. Queries with facet.limit=2 
return within 200 ms but queries with facet.limit= 20 million return after 
around 800 ms. This is in a cloud environment.

I vaguely remember an issue where Solr reserves the requested limit, is there 
an open issue about this? 

Thanks,
Markus