I think we speak of one use case where user wants to limit the search into a collection of documents but there is no unifying (easy) way to select those papers - besides a loong query: id:1 OR id:5 OR id:90...
And no, the latency of several hundred milliseconds is perfectly achievable with several hundred thousands of ids, you should explore the link... roman On Fri, Mar 8, 2013 at 12:56 PM, Walter Underwood <wun...@wunderwood.org>wrote: > First, terms used to subset the index should be a filter query, not part > of the main query. That may help, because the filter query terms are not > used for relevance scoring. > > Have you done any system profiling? Where is the bottleneck: CPU or disk? > There is no point in optimising things before you know the bottleneck. > > Also, your latency goals may be impossible. Assume roughly one disk access > per term in the query. You are not going to be able to do 100,000 random > access disk IOs in 2 seconds, let alone process the results. > > wunder > > On Mar 8, 2013, at 9:32 AM, Roman Chyla wrote: > > > hi Andy, > > > > It seems like a common type of operation and I would be also curious what > > others think. My take on this is to create a compressed intbitset and > send > > it as a query filter, then have the handler decompress/deserialize it, > and > > use it as a filter query. We have already done experiments with > intbitsets > > and it is fast to send/receive > > > > look at page 20> > > > http://www.slideshare.net/lbjay/letting-in-the-light-using-solr-as-an-external-search-component > > > > it is not on my immediate list of tasks, but if you want to help, it can > be > > done sooner > > > > roman > > > > On Fri, Mar 8, 2013 at 12:10 PM, Andy Lester <a...@petdance.com> wrote: > > > >> We've got an 11,000,000-document index. Most documents have a unique ID > >> called "flrid", plus a different ID called "solrid" that is Solr's PK. > For > >> some searches, we need to be able to limit the searches to a subset of > >> documents defined by a list of FLRID values. The list of FLRID values > can > >> change between every search and it will be rare enough to call it > "never" > >> that any two searches will have the same set of FLRIDs to limit on. > >> > >> What we're doing right now is, roughly: > >> > >> q=title:dogs AND > >> (flrid:(123 125 139 .... 34823) OR > >> flrid:(34837 ... 59091) OR > >> ... OR > >> flrid:(101294813 ... 103049934)) > >> > >> Each of those FQs parentheticals can be 1,000 FLRIDs strung together. > We > >> have to subgroup to get past Solr's limitations on the number of terms > that > >> can be ORed together. > >> > >> The problem with this approach (besides that it's clunky) is that it > seems > >> to perform O(N^2) or so. With 1,000 FLRIDs, the search comes back in > 50ms > >> or so. If we have 10,000 FLRIDs, it comes back in 400-500ms. With > 100,000 > >> FLRIDs, that jumps up to about 75000ms. We want it be on the order of > >> 1000-2000ms at most in all cases up to 100,000 FLRIDs. > >> > >> How can we do this better? > >> > >> Things we've tried or considered: > >> > >> * Tried: Using dismax with minimum-match mm:0 to simulate an OR query. > No > >> improvement. > >> * Tried: Putting the FLRIDs into the fq instead of the q. No > improvement. > >> * Considered: dumping all the FLRIDs for a given search into another > core > >> and doing a join between it and the main core, but if we do five or ten > >> searches per second, it seems like Solr would die from all the commits. > >> The set of FLRIDs is unique between searches so there is no reuse > possible. > >> * Considered: Translating FLRIDs to SolrID and then limiting on SolrID > >> instead, so that Solr doesn't have to hit the documents in order to > >> translate FLRID->SolrID to do the matching. > >> > >> What we're hoping for: > >> > >> * An efficient way to pass a long set of IDs, or for Solr to be able to > >> pull them from the app's Oracle database. > >> * Have Solr do big ORs as a set operation not as (what we assume is) a > >> naive one-at-a-time matching. > >> * A way to create a match vector that gets passed to the query, because > >> strings of fqs in the query seems to be a suboptimal way to do it. > >> > >> I've searched SO and the web and found people asking about this type of > >> situation a few times, but no answers that I see beyond what we're doing > >> now. > >> > >> * > >> > http://stackoverflow.com/questions/11938342/solr-search-within-subset-defined-by-list-of-keys > >> * > >> > http://stackoverflow.com/questions/9183898/searching-within-a-subset-of-data-solr > >> * > >> > http://lucene.472066.n3.nabble.com/Filtered-search-for-subset-of-ids-td502245.html > >> * > >> > http://lucene.472066.n3.nabble.com/Search-within-a-subset-of-documents-td1680475.html > >> > >> Thanks, > >> Andy > >> > >> -- > >> Andy Lester => a...@petdance.com => www.petdance.com => AIM:petdance > >> > >> > > > > >