Hi Scott, When intersecting the two sets, Lucene has the advantage that both sets are sorted. So Lucene can perform a merge join on the two sets in a single pass. This turns out to be a very fast operation.
Where you'll run into performance issues is if there are a huge number of documents that match the entire query that need to be scored/ranked. I wouldn't start worrying about this either though until your dealing with results with millions of documents. Joel On Fri, Nov 8, 2013 at 8:43 AM, Erick Erickson <erickerick...@gmail.com>wrote: > Have you tried this and measured or is this theoretical? Because > filter queries are _designed_ for this kind of use case. > > bq: If the user has 100 documents, then finding the intersection > requires checking each list ~100 times > > The cached fq is a bitset. Before checking each document, > all that has to happen to "check the list" is index into the bitset and > see if the bit is turned on. If it isn't, the document is bypassed. > > > The lots of cores solution has this drawback. The first time a query > comes in for a particular core, it may be loaded, which will be > noticeably slow, so your users have to be able to tolerate first-time > searches that take a bit of time. That said, test to see if it's > "fast enough" before settling on the solution. > > But really, I'd bypass this and just try the filter query solution and > measure. Because I'd be surprised if you really had performance > issues here, assuming your filter queries are indeed cached and > re-used. > > Best, > Erick > > > On Thu, Nov 7, 2013 at 7:02 PM, Scott Schneider < > scott_schnei...@symantec.com> wrote: > > > Digging a bit more, I think I have answered my own questions. Can > someone > > please say if this sounds right? > > > > http://wiki.apache.org/solr/LotsOfCores looks like a pretty good > > solution. If I give each user his own shard, each query can be run in > only > > one shard. The effect of the filter query will basically be to find that > > shard. The requirements listed on the wiki suggest that performance will > > be good. But in Solr 3.x, this won't scale with the # users/shards. > > > > Prepending a user id to indexed keywords using an analyzer will break > > wildcard search. If there is a wildcard, the query analyzer doesn't run > > filters, so it won't prepend the user id. I could prepend the user id > > myself before calling Solr, but that seems... bad. > > > > Scott > > > > > > > > > -----Original Message----- > > > From: Scott Schneider [mailto:scott_schnei...@symantec.com] > > > Sent: Thursday, November 07, 2013 2:03 PM > > > To: solr-user@lucene.apache.org > > > Subject: RE: fq efficiency > > > > > > Thanks, that link is very helpful, especially the section, "Leapfrog, > > > anyone?" This actually seems quite slow for my use case. Suppose we > > > have 10,000 users and 1,000,000 documents. We search for "hello" for a > > > particular user and let's assume that the fq set for the user is > > > cached. "hello" is a common word and perhaps 10,000 documents will > > > match. If the user has 100 documents, then finding the intersection > > > requires checking each list ~100 times. If the user has 1,000 > > > documents, we check each list ~1,000 times. That doesn't scale well. > > > > > > My searches are usually in one user's data. How can I take advantage > > > of that? I could have a separate index for each user, but loading so > > > many indexes at once seems infeasible; and dynamically loading & > > > unloading indexes is a pain. > > > > > > Or I could create a filter that takes tokens and prepends them with the > > > user id. That seems like a good solution, since my keyword searches > > > always include a user id (and usually just 1 user id). Though I wonder > > > if there is a downside I haven't thought of. > > > > > > Thanks, > > > Scott > > > > > > > > > > -----Original Message----- > > > > From: Shawn Heisey [mailto:s...@elyograg.org] > > > > Sent: Tuesday, November 05, 2013 4:35 PM > > > > To: solr-user@lucene.apache.org > > > > Subject: Re: fq efficiency > > > > > > > > On 11/5/2013 3:36 PM, Scott Schneider wrote: > > > > > I'm wondering if filter queries are efficient enough for my use > > > > cases. I have lots and lots of users in a big, multi-tenant, sharded > > > > index. To run a search, I can use an fq on the user id and pass in > > > the > > > > search terms. Does this scale well with the # users? I suppose > > > that, > > > > since user id is indexed, generating the filter data (which is > > > cached) > > > > will be fast. And looking up search terms is fast, of course. But > > > if > > > > the search term is a common one that many users have in their > > > > documents, then Solr may have to perform an intersection between two > > > > large sets: docs from all users with the search term and all of the > > > > current user's docs. > > > > > > > > > > Also, how about auto-complete and searching with a trailing > > > wildcard? > > > > As I understand it, these work well in a single-tenant index because > > > > keywords are sorted in the index, so it's easy to get all the search > > > > terms that match "foo*". In a multi-tenant index, all users' > > > keywords > > > > are stored together. So if Lucene were to look at all the keywords > > > > from "foo" to "foozzzzz" (I'm not sure if it actually does this), it > > > > would skip over a large majority of keywords that don't belong to > > > this > > > > user. > > > > > > > > From what I understand, there's not really a whole lot of difference > > > > between queries and filter queries when they are NOT cached, except > > > > that > > > > the main query and the filter queries are executed in parallel, which > > > > can save time. > > > > > > > > When filter queries are found in the filterCache, it's a different > > > > story. They get applied *before* the main query, which means that > > > the > > > > main query won't have to work as hard. The filterCache stores > > > > information about which documents in the entire index match the > > > filter. > > > > By storing it as a bitset, the amount of space required is relatively > > > > low. Applying filterCache results is very efficient. > > > > > > > > There are also advanced techniques, like assigning a cost to each > > > > filter > > > > and creating postfilters: > > > > > > > > http://yonik.com/posts/advanced-filter-caching-in-solr/ > > > > > > > > Thanks, > > > > Shawn > > > > > -- Joel Bernstein Search Engineer at Heliosearch