On Tue, 2013-07-30 at 21:48 +0200, Robert Stewart wrote:

[Custom facet structure]

> Then we sorted those sets of facet fields by total document frequency
> so we enumerate the more frequent facet fields first, and we stop
> looking when we find a facet field which has less total document
> matches than the top N facet counts we are looking for.

So the structure was Facet->docIDs? A bit like enum in Solr? Your top-N
cut-off is an interesting optimization for that.

[...]

> The slaves just pre-load that binary structure directly into ram in one
> shot in the background when opening a new snapshot for search.

We used a similar pre-calculation some years ago but abandoned it as the
cost of "Pre-generate_structure + #duplicates * (distribute_structure +
open_structure)" was just as costly and less flexible than "#duplicates
* generate_structure" for us.

> We have 200 million docs, 10 shards, about 20 facet fields, some of
> which contain about 20,000 unique values.  We show top 10 facets for
> about 10 different fields in results page.   We provide search results
> with lots of facets and date counts in around 200-300ms using this
> technique.
> 
> Currently, we are porting this entire system to SOLR.  For a single
> core index of 8 million docs, using similar documents and facet fields
> from our production indexes, I cant get faceted search to perform
> anywhere close to 300ms for general searches.   More like 1.5-3
> seconds.

Solr fc faceting treats each facet independently and in a docID->facet
manner so what happens is

foreach facet {
  foreach docIDinResultSet {
    foreach tagIDinDocument {
      facet.counter[tagID]++
    }
  }
}

With 10 facets, 8M documents and 1 tag/doc/facet, the total loop count
it 80M. That does not normally take 1.5-3 seconds in, so something seems
off. Do you have a lot of facet tags (aka terms) for each document?

> Is there anything else that I should look into for getting better facet
> performance?

Could you list the part of the Solr log with the facet structures? Just
grep for "UnInverted". They look something like this:
UnInverted multi-valued field
{field=lma_long,memSize=42711405,tindexSize=42,time=979,phase1=964,nTerms=23,bigTerms=6,termInstances=1958468,uses=0}

> Given these metrics (200m docs, 20 facet fields, some fields with
> 20,000 unique values), what kind of facet search performance should I
> expect?

Due to the independent faceting handling in Solr, the facet time will
scale a bit worse than linear to the number of documents, relative to
your test setup. With a loop count of 200M*10 (or 20? I am a bit
confused on how many facets you show at a time) = 2G, this will take
multiple seconds. Unless you go experimental (SOLR-2412 to bang my own
drum), your facet count needs to go down or you need to shard with Solr.

> Also we need to issue frequent commits since we are constantly
> streaming new content into the system.

You could use a setup with a smaller "live" shard and multiple stale
ones, but depending on corpus your ranking might suffer.

- Toke Eskildsen, State and University Library, Denmark

Reply via email to