subject:"Faceting over limited result set"

Re: Faceting over limited result set

2007-11-13 Thread Pieter Berkel

On Nov 14, 2007 6:44 AM, Mike Klaas [EMAIL PROTECTED] wrote:

 An implementation might look like:

  DocList superlist;
  int facetDocLimit = params.getInt(DMP.FACET_DOCLIMIT, -1);
  if(facetDocLimit  0  facetDocLimit != req.getLimit()) {
superlist = s.getDocList(query, restrictions,
 SolrPluginUtils.getSort(req),
 req.getStart(), facetDocLimit,
 flags);
results.docSet = SearcherUtils.getDocSetFromDocList
 (superlist, s);
results.docList = superlist.subset(0, req.getLimit());
  } else {

 Where getDocSetFromDocList() uses DocSetHitCollector to build a DocSet.

 To answer the performance question: There is a gain to be had when
 doing lots of faceting on huge indices, if N is low (say, 500-1000).
 One problem with the implementation above is that it stymies the
 query caching in SolrIndexSearcher (since the generated DocList is 
 the cache upper bound).

 -Mike

Thanks Mike, that looks like a good place to start.  While I really
can't think of any practical use for limiting the size of DocSet other
than simple faceting, the new search component architecture make it a
little more difficult to confine any implementation to only the facet
component (unless there is an efficient way to obtain a subset of a
DocSet, which there doesn't seem to be).  I'm also aware of the query
caching issues arising from SolrIndexSearcher however if N is
sufficiently low this (hopefully) shouldn't be too much of a problem.

I can't find either the SearcherUtils class nor any reference to a
getDocSetFromDocList() method in svn trunk, is this deprecated or
custom-build code?

-Piete

Re: Faceting over limited result set

2007-11-12 Thread Pieter Berkel

On 13/11/2007, Chris Hostetter [EMAIL PROTECTED] wrote:


 can you elaborate on your use case ... the only time i've ever seen people
 ask about something like this it was because true facet counts were too
 expensive to compute, so they were doing sampling of the first N
 results.

 In Solr, Sampling like this would likely be just as expensive as getting
 the full count.


It's not really a performance-related issue, the primary goal is to use the
facet information to determine the most relevant product category related to
the particular search being performed.

Generally the facets returned by simple, generic queries are fine for this
purpose (e.g. a search for nokia will correctly return Mobile / Cell
Phone as the most frequent facet), however facet data for more specific
searches are not as clear-cut (e.g. samsung tv where TVs will appear at
the top of the search results, but will also match other samsung' products
like mobile phones and mp3 players - obviously I could tweak 'mm' parameter
to fix this particular case, but it wouldn't really solve my problem).

The theory is that facet information generated from the first 'x' (lets say
100) matches to a query (ordered by score / relevance) will be more accurate
(for the above purpose) than facets obtained over the entire result set.  So
ideally, it would be useful to be able to contstrain the size of the DocSet
somehow (as you mention below).


matching occurs in increasing order of docid, so even if there was as hook
 to say stop matching after N docs those N wouldn't be a good
 representative sample, they would be biased towards older documents
 (based on when they were indexed, not on any particular date field)

 if what you are interested in is stats on the first N docs according to a
 specific sort (score or otherwise) then you could write a custom request
 handler that executed a search with a limit of N, got the DocList,
 iterated over it to build a DocSet, and then used that DocSet to do
 faceting ... but that would probably take even longer then just using the
 full DocSet matching the entire query.



I was hoping to avoid having to write a custom request handler but your
suggestion above sounds like it would do the trick.  I'm also debating
whether to extract my own facet info from a result set on the client side,
but this would be even slower.

Thanks for your suggestions so far,
Piete

Re: Faceting over limited result set

2007-11-12 Thread Chris Hostetter


: It's not really a performance-related issue, the primary goal is to use the
: facet information to determine the most relevant product category related to
: the particular search being performed.

ah ... ok, i understand now.  the order does matter, you want the top N 
documents sorted by some criteria (either score, or maybe popularity i 
would imagine) and then you want to pick the categories based on that.

i had to build this for CNET back before solr went open source, but yes - 
i did it using a custom subclass of dismax similar to what i 
discribed before.

one thing to watch out for is that you probably want to use a consistent 
sort independent of the user's sort -- if the user re-sorts by price it 
can be disconcerting for them if that changes the navigation links.


-Hoss

Faceting over limited result set

2007-11-11 Thread Pieter Berkel

I'm trying to obtain faceting information based on the first 'x' (lets say
100-500) results matching a given (dismax) query.  The actual documents
matching the query are not important in this case, so intuitively the
simplest approach I can think of would be to limit the result set to 'x'
documents.

Unfortunately I can't find any easy way to limit the number of documents
matched (and returned in the set).  It might be possible to achieve the
desired result by using a function query + filter query, however that seems
a but hack-ish and hopefully I've missed something basic that leads to a
simpler solution.

Apologies if this has already been discussed / solved before.

Thanks,
Piete

Re: Faceting over limited result set

Re: Faceting over limited result set

Re: Faceting over limited result set

Faceting over limited result set

4 matches

Site Navigation

Mail list logo

Footer information