Re: How to cap facet counts beyond a specified limit

2012-06-08 Thread Toke Eskildsen
On Thu, 2012-06-07 at 10:01 +0200, Andrew Laird wrote:
 For our needs we don't really need to know that a particular facet has
 exactly 14,203,527 matches - just knowing that there are more than a
 million is enough.  If I could somehow limit the hit counts to a
 million (say) [...]

It should be feasible to stop the collector after 1M documents has been
processed. If nothing else then just by ignoring subsequent IDs.
However, the ID's received would be in index-order, which normally means
old-to-new. If the nature of the corpus, and thereby the facet values,
changes over time, this change would not be reflected in the facets that
has many hits as the collector never reaches the newer documents.

 it seems like that could decrease the work required to
 compute the values (just stop counting after the limit is reached) and
 potentially improve faceted search time - especially when we have 20-30
 fields to facet on.  Has anyone else tried to do something like this?

The current Solr facet implementation treats every facet structure
individually. It works fine in a lot of areas but it also means that the
list of IDs for matching documents is iterated once for every facet: In
the sample case, 14M+ hits * 25 fields = 350M+ hits processed.

I have been experimenting with an alternative approach (SOLR-2412) that
packs the terms in the facets as a single structure underneath the hood,
which means only 14M+ hits processed in the current case. Unfortunately
it is not mature and only works for text fields.

- Toke Eskildsen, State and University Library, Denmark



Re: How to cap facet counts beyond a specified limit

2012-06-07 Thread Jack Krupansky

Sounds like an interesting improvement to propose.

It will also depend on various factors, such as number of unique terms in a 
field, field type, etc.


Which field types are giving you the most trouble and how many unique values 
do they have? And do you specify a facet.method or just let it default?


What release of Solr are you on? Are you using trie for numeric fields? 
Are these mostly string fields? Any boolean fields?


-- Jack Krupansky

-Original Message- 
From: Andrew Laird

Sent: Thursday, June 07, 2012 4:01 AM
To: solr-user@lucene.apache.org
Subject: How to cap facet counts beyond a specified limit

We have an index with ~100M documents and I am looking for a simple way to 
speed up faceted searches.  Is there a relatively straightforward way to 
stop counting the number of matching documents beyond some specifiable 
value?  For our needs we don't really need to know that a particular facet 
has exactly 14,203,527 matches - just knowing that there are more than a 
million is enough.  If I could somehow limit the hit counts to a million 
(say) it seems like that could decrease the work required to compute the 
values (just stop counting after the limit is reached) and potentially 
improve faceted search time - especially when we have 20-30 fields to facet 
on.  Has anyone else tried to do something like this?


Many thanks for comments and info,

Sincerely,


andy laird | gettyimages | 206.925.6728