Re: Faceting over limited result set
On Nov 14, 2007 6:44 AM, Mike Klaas [EMAIL PROTECTED] wrote: An implementation might look like: DocList superlist; int facetDocLimit = params.getInt(DMP.FACET_DOCLIMIT, -1); if(facetDocLimit 0 facetDocLimit != req.getLimit()) { superlist = s.getDocList(query, restrictions, SolrPluginUtils.getSort(req), req.getStart(), facetDocLimit, flags); results.docSet = SearcherUtils.getDocSetFromDocList (superlist, s); results.docList = superlist.subset(0, req.getLimit()); } else { Where getDocSetFromDocList() uses DocSetHitCollector to build a DocSet. To answer the performance question: There is a gain to be had when doing lots of faceting on huge indices, if N is low (say, 500-1000). One problem with the implementation above is that it stymies the query caching in SolrIndexSearcher (since the generated DocList is the cache upper bound). -Mike Thanks Mike, that looks like a good place to start. While I really can't think of any practical use for limiting the size of DocSet other than simple faceting, the new search component architecture make it a little more difficult to confine any implementation to only the facet component (unless there is an efficient way to obtain a subset of a DocSet, which there doesn't seem to be). I'm also aware of the query caching issues arising from SolrIndexSearcher however if N is sufficiently low this (hopefully) shouldn't be too much of a problem. I can't find either the SearcherUtils class nor any reference to a getDocSetFromDocList() method in svn trunk, is this deprecated or custom-build code? -Piete
Re: Faceting over limited result set
On 13/11/2007, Chris Hostetter [EMAIL PROTECTED] wrote: can you elaborate on your use case ... the only time i've ever seen people ask about something like this it was because true facet counts were too expensive to compute, so they were doing sampling of the first N results. In Solr, Sampling like this would likely be just as expensive as getting the full count. It's not really a performance-related issue, the primary goal is to use the facet information to determine the most relevant product category related to the particular search being performed. Generally the facets returned by simple, generic queries are fine for this purpose (e.g. a search for nokia will correctly return Mobile / Cell Phone as the most frequent facet), however facet data for more specific searches are not as clear-cut (e.g. samsung tv where TVs will appear at the top of the search results, but will also match other samsung' products like mobile phones and mp3 players - obviously I could tweak 'mm' parameter to fix this particular case, but it wouldn't really solve my problem). The theory is that facet information generated from the first 'x' (lets say 100) matches to a query (ordered by score / relevance) will be more accurate (for the above purpose) than facets obtained over the entire result set. So ideally, it would be useful to be able to contstrain the size of the DocSet somehow (as you mention below). matching occurs in increasing order of docid, so even if there was as hook to say stop matching after N docs those N wouldn't be a good representative sample, they would be biased towards older documents (based on when they were indexed, not on any particular date field) if what you are interested in is stats on the first N docs according to a specific sort (score or otherwise) then you could write a custom request handler that executed a search with a limit of N, got the DocList, iterated over it to build a DocSet, and then used that DocSet to do faceting ... but that would probably take even longer then just using the full DocSet matching the entire query. I was hoping to avoid having to write a custom request handler but your suggestion above sounds like it would do the trick. I'm also debating whether to extract my own facet info from a result set on the client side, but this would be even slower. Thanks for your suggestions so far, Piete
Re: Faceting over limited result set
: It's not really a performance-related issue, the primary goal is to use the : facet information to determine the most relevant product category related to : the particular search being performed. ah ... ok, i understand now. the order does matter, you want the top N documents sorted by some criteria (either score, or maybe popularity i would imagine) and then you want to pick the categories based on that. i had to build this for CNET back before solr went open source, but yes - i did it using a custom subclass of dismax similar to what i discribed before. one thing to watch out for is that you probably want to use a consistent sort independent of the user's sort -- if the user re-sorts by price it can be disconcerting for them if that changes the navigation links. -Hoss
Faceting over limited result set
I'm trying to obtain faceting information based on the first 'x' (lets say 100-500) results matching a given (dismax) query. The actual documents matching the query are not important in this case, so intuitively the simplest approach I can think of would be to limit the result set to 'x' documents. Unfortunately I can't find any easy way to limit the number of documents matched (and returned in the set). It might be possible to achieve the desired result by using a function query + filter query, however that seems a but hack-ish and hopefully I've missed something basic that leads to a simpler solution. Apologies if this has already been discussed / solved before. Thanks, Piete