hi Andy,

It seems like a common type of operation and I would be also curious what
others think. My take on this is to create a compressed intbitset and send
it as a query filter, then have the handler decompress/deserialize it, and
use it as a filter query. We have already done experiments with intbitsets
and it is fast to send/receive

look at page 20>
http://www.slideshare.net/lbjay/letting-in-the-light-using-solr-as-an-external-search-component

it is not on my immediate list of tasks, but if you want to help, it can be
done sooner

roman

On Fri, Mar 8, 2013 at 12:10 PM, Andy Lester <a...@petdance.com> wrote:

> We've got an 11,000,000-document index.  Most documents have a unique ID
> called "flrid", plus a different ID called "solrid" that is Solr's PK.  For
> some searches, we need to be able to limit the searches to a subset of
> documents defined by a list of FLRID values.  The list of FLRID values can
> change between every search and it will be rare enough to call it "never"
> that any two searches will have the same set of FLRIDs to limit on.
>
> What we're doing right now is, roughly:
>
>     q=title:dogs AND
>         (flrid:(123 125 139 .... 34823) OR
>          flrid:(34837 ... 59091) OR
>          ... OR
>          flrid:(101294813 ... 103049934))
>
> Each of those FQs parentheticals can be 1,000 FLRIDs strung together.  We
> have to subgroup to get past Solr's limitations on the number of terms that
> can be ORed together.
>
> The problem with this approach (besides that it's clunky) is that it seems
> to perform O(N^2) or so.  With 1,000 FLRIDs, the search comes back in 50ms
> or so.  If we have 10,000 FLRIDs, it comes back in 400-500ms.  With 100,000
> FLRIDs, that jumps up to about 75000ms.  We want it be on the order of
> 1000-2000ms at most in all cases up to 100,000 FLRIDs.
>
> How can we do this better?
>
> Things we've tried or considered:
>
> * Tried: Using dismax with minimum-match mm:0 to simulate an OR query.  No
> improvement.
> * Tried: Putting the FLRIDs into the fq instead of the q.  No improvement.
> * Considered: dumping all the FLRIDs for a given search into another core
> and doing a join between it and the main core, but if we do five or ten
> searches per second, it seems like Solr would die from all the commits.
>  The set of FLRIDs is unique between searches so there is no reuse possible.
> * Considered: Translating FLRIDs to SolrID and then limiting on SolrID
> instead, so that Solr doesn't have to hit the documents in order to
> translate FLRID->SolrID to do the matching.
>
> What we're hoping for:
>
> * An efficient way to pass a long set of IDs, or for Solr to be able to
> pull them from the app's Oracle database.
> * Have Solr do big ORs as a set operation not as (what we assume is) a
> naive one-at-a-time matching.
> * A way to create a match vector that gets passed to the query, because
> strings of fqs in the query seems to be a suboptimal way to do it.
>
> I've searched SO and the web and found people asking about this type of
> situation a few times, but no answers that I see beyond what we're doing
> now.
>
> *
> http://stackoverflow.com/questions/11938342/solr-search-within-subset-defined-by-list-of-keys
> *
> http://stackoverflow.com/questions/9183898/searching-within-a-subset-of-data-solr
> *
> http://lucene.472066.n3.nabble.com/Filtered-search-for-subset-of-ids-td502245.html
> *
> http://lucene.472066.n3.nabble.com/Search-within-a-subset-of-documents-td1680475.html
>
> Thanks,
> Andy
>
> --
> Andy Lester => a...@petdance.com => www.petdance.com => AIM:petdance
>
>

Reply via email to