Re: How can I limit my Solr search to an arbitrary set of 100,000 documents?

Chris Hostetter Tue, 12 Mar 2013 11:21:57 -0700

:     q=title:dogs AND 
:         (flrid:(123 125 139 .... 34823) OR 
:          flrid:(34837 ... 59091) OR 
:          ... OR 
:          flrid:(101294813 ... 103049934))


: The problem with this approach (besides that it's clunky) is that it 
: seems to perform O(N^2) or so.  With 1,000 FLRIDs, the search comes back 
: in 50ms or so.  If we have 10,000 FLRIDs, it comes back in 400-500ms.  
: With 100,000 FLRIDs, that jumps up to about 75000ms.  We want it be on 
: the order of 1000-2000ms at most in all cases up to 100,000 FLRIDs.

How are these sets of flrids created/defined?  (undertsanding the source 
of the filter information may help inspire alternative suggestsions, ie: 
XY Problem)

: * Have Solr do big ORs as a set operation not as (what we assume is) a 
: naive one-at-a-time matching.

It's not as naive as it may seem - scoring of disjunctions like this isn't 
a matter of asking each doc if it matches each query clause.  what happens 
is that for each segment of the index, each clause of a disjunction is 
asked to check for the "first" doc it matches in the segment -- which for 
TermQueries like this just means a quick lookup on the TermEnum, and the 
lowest (internal) doc num returned by any of the clauses represents the 
"first" match of that BooleanQuery.  All of the other clauses are asked 
for their "first" and then ultimately they are all asked to skip ahead to 
their next match, etc...

My point being: i don't think your speed observations are based on the 
number of documents, it's based on the number of query clauses -- which 
unfortunately happen to be the same in your situation.

: * An efficient way to pass a long set of IDs, or for Solr to be able to 
: pull them from the app's Oracle database.

This can definitely be done, there just isn't a general purpose turn key 
solution for it.  The appoach you'd need to take is to implement a 
"PostFilter" to implement your custom logic for deciding if a document 
should be in the result set or not, and then generate instances of your 
PostFilter implemantion in a "QParserPlugin".

Here's a blog post with an example of doing this for an ACL type 
situation, where the parser input specifies a "user" and a CSV file is 
consulted to get the list of documents the user is allowed to see...

http://searchhub.org/2012/02/22/custom-security-filtering-in-solr/

..you could follow a similar model where given some input, you generate a 
query to your oracle DB to return a Set<String> of IDs to consult in the 
collect method.


-Hoss

Re: How can I limit my Solr search to an arbitrary set of 100,000 documents?

Reply via email to