I think we speak of one use case where user wants to limit the search into
a collection of documents but there is no unifying (easy) way to select
those papers - besides a loong query: id:1 OR id:5 OR id:90...

And no, the latency of several hundred milliseconds is perfectly achievable
with several hundred thousands of ids, you should explore the link...

roman



On Fri, Mar 8, 2013 at 12:56 PM, Walter Underwood <wun...@wunderwood.org>wrote:

> First, terms used to subset the index should be a filter query, not part
> of the main query. That may help, because the filter query terms are not
> used for relevance scoring.
>
> Have you done any system profiling? Where is the bottleneck: CPU or disk?
> There is no point in optimising things before you know the bottleneck.
>
> Also, your latency goals may be impossible. Assume roughly one disk access
> per term in the query. You are not going to be able to do 100,000 random
> access disk IOs in 2 seconds, let alone process the results.
>
> wunder
>
> On Mar 8, 2013, at 9:32 AM, Roman Chyla wrote:
>
> > hi Andy,
> >
> > It seems like a common type of operation and I would be also curious what
> > others think. My take on this is to create a compressed intbitset and
> send
> > it as a query filter, then have the handler decompress/deserialize it,
> and
> > use it as a filter query. We have already done experiments with
> intbitsets
> > and it is fast to send/receive
> >
> > look at page 20>
> >
> http://www.slideshare.net/lbjay/letting-in-the-light-using-solr-as-an-external-search-component
> >
> > it is not on my immediate list of tasks, but if you want to help, it can
> be
> > done sooner
> >
> > roman
> >
> > On Fri, Mar 8, 2013 at 12:10 PM, Andy Lester <a...@petdance.com> wrote:
> >
> >> We've got an 11,000,000-document index.  Most documents have a unique ID
> >> called "flrid", plus a different ID called "solrid" that is Solr's PK.
>  For
> >> some searches, we need to be able to limit the searches to a subset of
> >> documents defined by a list of FLRID values.  The list of FLRID values
> can
> >> change between every search and it will be rare enough to call it
> "never"
> >> that any two searches will have the same set of FLRIDs to limit on.
> >>
> >> What we're doing right now is, roughly:
> >>
> >>    q=title:dogs AND
> >>        (flrid:(123 125 139 .... 34823) OR
> >>         flrid:(34837 ... 59091) OR
> >>         ... OR
> >>         flrid:(101294813 ... 103049934))
> >>
> >> Each of those FQs parentheticals can be 1,000 FLRIDs strung together.
>  We
> >> have to subgroup to get past Solr's limitations on the number of terms
> that
> >> can be ORed together.
> >>
> >> The problem with this approach (besides that it's clunky) is that it
> seems
> >> to perform O(N^2) or so.  With 1,000 FLRIDs, the search comes back in
> 50ms
> >> or so.  If we have 10,000 FLRIDs, it comes back in 400-500ms.  With
> 100,000
> >> FLRIDs, that jumps up to about 75000ms.  We want it be on the order of
> >> 1000-2000ms at most in all cases up to 100,000 FLRIDs.
> >>
> >> How can we do this better?
> >>
> >> Things we've tried or considered:
> >>
> >> * Tried: Using dismax with minimum-match mm:0 to simulate an OR query.
>  No
> >> improvement.
> >> * Tried: Putting the FLRIDs into the fq instead of the q.  No
> improvement.
> >> * Considered: dumping all the FLRIDs for a given search into another
> core
> >> and doing a join between it and the main core, but if we do five or ten
> >> searches per second, it seems like Solr would die from all the commits.
> >> The set of FLRIDs is unique between searches so there is no reuse
> possible.
> >> * Considered: Translating FLRIDs to SolrID and then limiting on SolrID
> >> instead, so that Solr doesn't have to hit the documents in order to
> >> translate FLRID->SolrID to do the matching.
> >>
> >> What we're hoping for:
> >>
> >> * An efficient way to pass a long set of IDs, or for Solr to be able to
> >> pull them from the app's Oracle database.
> >> * Have Solr do big ORs as a set operation not as (what we assume is) a
> >> naive one-at-a-time matching.
> >> * A way to create a match vector that gets passed to the query, because
> >> strings of fqs in the query seems to be a suboptimal way to do it.
> >>
> >> I've searched SO and the web and found people asking about this type of
> >> situation a few times, but no answers that I see beyond what we're doing
> >> now.
> >>
> >> *
> >>
> http://stackoverflow.com/questions/11938342/solr-search-within-subset-defined-by-list-of-keys
> >> *
> >>
> http://stackoverflow.com/questions/9183898/searching-within-a-subset-of-data-solr
> >> *
> >>
> http://lucene.472066.n3.nabble.com/Filtered-search-for-subset-of-ids-td502245.html
> >> *
> >>
> http://lucene.472066.n3.nabble.com/Search-within-a-subset-of-documents-td1680475.html
> >>
> >> Thanks,
> >> Andy
> >>
> >> --
> >> Andy Lester => a...@petdance.com => www.petdance.com => AIM:petdance
> >>
> >>
>
>
>
>
>

Reply via email to