First, terms used to subset the index should be a filter query, not part of the 
main query. That may help, because the filter query terms are not used for 
relevance scoring.

Have you done any system profiling? Where is the bottleneck: CPU or disk? There 
is no point in optimising things before you know the bottleneck.

Also, your latency goals may be impossible. Assume roughly one disk access per 
term in the query. You are not going to be able to do 100,000 random access 
disk IOs in 2 seconds, let alone process the results.

wunder

On Mar 8, 2013, at 9:32 AM, Roman Chyla wrote:

> hi Andy,
> 
> It seems like a common type of operation and I would be also curious what
> others think. My take on this is to create a compressed intbitset and send
> it as a query filter, then have the handler decompress/deserialize it, and
> use it as a filter query. We have already done experiments with intbitsets
> and it is fast to send/receive
> 
> look at page 20>
> http://www.slideshare.net/lbjay/letting-in-the-light-using-solr-as-an-external-search-component
> 
> it is not on my immediate list of tasks, but if you want to help, it can be
> done sooner
> 
> roman
> 
> On Fri, Mar 8, 2013 at 12:10 PM, Andy Lester <a...@petdance.com> wrote:
> 
>> We've got an 11,000,000-document index.  Most documents have a unique ID
>> called "flrid", plus a different ID called "solrid" that is Solr's PK.  For
>> some searches, we need to be able to limit the searches to a subset of
>> documents defined by a list of FLRID values.  The list of FLRID values can
>> change between every search and it will be rare enough to call it "never"
>> that any two searches will have the same set of FLRIDs to limit on.
>> 
>> What we're doing right now is, roughly:
>> 
>>    q=title:dogs AND
>>        (flrid:(123 125 139 .... 34823) OR
>>         flrid:(34837 ... 59091) OR
>>         ... OR
>>         flrid:(101294813 ... 103049934))
>> 
>> Each of those FQs parentheticals can be 1,000 FLRIDs strung together.  We
>> have to subgroup to get past Solr's limitations on the number of terms that
>> can be ORed together.
>> 
>> The problem with this approach (besides that it's clunky) is that it seems
>> to perform O(N^2) or so.  With 1,000 FLRIDs, the search comes back in 50ms
>> or so.  If we have 10,000 FLRIDs, it comes back in 400-500ms.  With 100,000
>> FLRIDs, that jumps up to about 75000ms.  We want it be on the order of
>> 1000-2000ms at most in all cases up to 100,000 FLRIDs.
>> 
>> How can we do this better?
>> 
>> Things we've tried or considered:
>> 
>> * Tried: Using dismax with minimum-match mm:0 to simulate an OR query.  No
>> improvement.
>> * Tried: Putting the FLRIDs into the fq instead of the q.  No improvement.
>> * Considered: dumping all the FLRIDs for a given search into another core
>> and doing a join between it and the main core, but if we do five or ten
>> searches per second, it seems like Solr would die from all the commits.
>> The set of FLRIDs is unique between searches so there is no reuse possible.
>> * Considered: Translating FLRIDs to SolrID and then limiting on SolrID
>> instead, so that Solr doesn't have to hit the documents in order to
>> translate FLRID->SolrID to do the matching.
>> 
>> What we're hoping for:
>> 
>> * An efficient way to pass a long set of IDs, or for Solr to be able to
>> pull them from the app's Oracle database.
>> * Have Solr do big ORs as a set operation not as (what we assume is) a
>> naive one-at-a-time matching.
>> * A way to create a match vector that gets passed to the query, because
>> strings of fqs in the query seems to be a suboptimal way to do it.
>> 
>> I've searched SO and the web and found people asking about this type of
>> situation a few times, but no answers that I see beyond what we're doing
>> now.
>> 
>> *
>> http://stackoverflow.com/questions/11938342/solr-search-within-subset-defined-by-list-of-keys
>> *
>> http://stackoverflow.com/questions/9183898/searching-within-a-subset-of-data-solr
>> *
>> http://lucene.472066.n3.nabble.com/Filtered-search-for-subset-of-ids-td502245.html
>> *
>> http://lucene.472066.n3.nabble.com/Search-within-a-subset-of-documents-td1680475.html
>> 
>> Thanks,
>> Andy
>> 
>> --
>> Andy Lester => a...@petdance.com => www.petdance.com => AIM:petdance
>> 
>> 




Reply via email to