FYI, I am now using docValues for facet fields with somewhat better performance, at least more consistent performance (especially with frequent commits). Also, I see my main bottleneck seems to be ec2 servers - I am not running on m3.xlarge with provisioned EBS 4000 IOPS and it is looking much better.
________________________________________ From: Toke Eskildsen [t...@statsbiblioteket.dk] Sent: Wednesday, August 07, 2013 7:45 AM To: solr-user@lucene.apache.org Subject: Re: poor facet search performance On Tue, 2013-07-30 at 21:48 +0200, Robert Stewart wrote: [Custom facet structure] > Then we sorted those sets of facet fields by total document frequency > so we enumerate the more frequent facet fields first, and we stop > looking when we find a facet field which has less total document > matches than the top N facet counts we are looking for. So the structure was Facet->docIDs? A bit like enum in Solr? Your top-N cut-off is an interesting optimization for that. [...] > The slaves just pre-load that binary structure directly into ram in one > shot in the background when opening a new snapshot for search. We used a similar pre-calculation some years ago but abandoned it as the cost of "Pre-generate_structure + #duplicates * (distribute_structure + open_structure)" was just as costly and less flexible than "#duplicates * generate_structure" for us. > We have 200 million docs, 10 shards, about 20 facet fields, some of > which contain about 20,000 unique values. We show top 10 facets for > about 10 different fields in results page. We provide search results > with lots of facets and date counts in around 200-300ms using this > technique. > > Currently, we are porting this entire system to SOLR. For a single > core index of 8 million docs, using similar documents and facet fields > from our production indexes, I cant get faceted search to perform > anywhere close to 300ms for general searches. More like 1.5-3 > seconds. Solr fc faceting treats each facet independently and in a docID->facet manner so what happens is foreach facet { foreach docIDinResultSet { foreach tagIDinDocument { facet.counter[tagID]++ } } } With 10 facets, 8M documents and 1 tag/doc/facet, the total loop count it 80M. That does not normally take 1.5-3 seconds in, so something seems off. Do you have a lot of facet tags (aka terms) for each document? > Is there anything else that I should look into for getting better facet > performance? Could you list the part of the Solr log with the facet structures? Just grep for "UnInverted". They look something like this: UnInverted multi-valued field {field=lma_long,memSize=42711405,tindexSize=42,time=979,phase1=964,nTerms=23,bigTerms=6,termInstances=1958468,uses=0} > Given these metrics (200m docs, 20 facet fields, some fields with > 20,000 unique values), what kind of facet search performance should I > expect? Due to the independent faceting handling in Solr, the facet time will scale a bit worse than linear to the number of documents, relative to your test setup. With a loop count of 200M*10 (or 20? I am a bit confused on how many facets you show at a time) = 2G, this will take multiple seconds. Unless you go experimental (SOLR-2412 to bang my own drum), your facet count needs to go down or you need to shard with Solr. > Also we need to issue frequent commits since we are constantly > streaming new content into the system. You could use a setup with a smaller "live" shard and multiple stale ones, but depending on corpus your ranking might suffer. - Toke Eskildsen, State and University Library, Denmark