[jira] Commented: (SOLR-236) Field collapsing

Yonik Seeley (JIRA) Sat, 19 Dec 2009 14:28:51 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792916#action_12792916
 ]


Yonik Seeley commented on SOLR-236:
-----------------------------------

First, thanks to everyone who has spent so much time working on this - lack of 
committer attention doesn't equate to lack of interest... this is a very much 
needed feature!

I'd agree with Erik that the most important thing is the interface to the 
client, and making it well thought out and semantically "tight". Martijn's 
recent improvements to the response structure is an example of improvements in 
this area. It's also important to think about the interface in terms of how 
easy it will be to add further features, optimizations, and support distributed 
search. If the code isn't sufficiently standalone, we also need to see how 
easily it fits into the rest of Solr (what APIs it adds or modifies, etc). 
Actually implementing performance improvements and more distributed search can 
come later - as long as we've thought about it now so we haven't boxed 
ourselves in.

It seems like field collapsing should just be additional functionality of the 
query component rather than a separate component since it changes the results?

The most basic question about the interface would be how to present groups. Do 
we stick with a linear document list and supplement that with extra info in a 
different part of the response (as the current approach takes)? Or stick that 
extra info in with some of the documents somehow? Or if collapse=true, replace 
the list of documents with a list of groups, each which can contain many 
documents? Which will be easiest for clients to deal with? If you were starting 
from scratch and didn't have to deal with any of Solr's current shortcomings, 
what would it look like?

>From the wiki:
collapse.maxdocs - what does this actually mean? I assume it collects arbitrary 
documents up to the max (normally by index order)? Does this really make sense? 
Does it affect faceting, etc? If it does make sense, it seems like it would 
also make sense for normal non-collapsed query results too, in which case it 
should be implemented at that level.

collapse.info.doc - what does that do? I understand counts per group, but 
what's count per doc?

collapse.includeCollapsedDocs.fl - I don't understand this one, and can't find 
an example on the wiki or blogs. It says "Parameter indicating to return the 
collapsed documents in the response"... but I thought documents were included 
up until collapse.threshold.

collapse.debug - should perhaps just be rolled into debugQuery, or another 
general debug param (someone recently suggested using a comma separated list... 
debug=timings,query, etc.

Should I be able to specify a completely different sort *within* a group? 
collapse.sort=... seems nice... what are the implications? One bit of 
strangeness: it would seem to allow a highly ranked document responsible for 
the group being at the top of the list being dropped from the group due to a 
different sort criteria within the group. It's not necessarily an 
implementation problem though (sort values for the group should be maintained 
separately).

Is there a way to specify the number of groups that I want back instead of the 
number of documents? Or am I supposed to just over-request 
(rows=num_groups_I_want*threshold) and ignore if I get too many documents back?

Random thought: We need a test to make sure this works with multi-select 
faceting (SimpleFacets asks for the docset of be base query...)

Distributed Search: should be able to use the same type of algorithm that 
faceting does to ensure accurate counts.

Performance: yes, it looks like the current code uses a *lot* of memory.
Here's an algorithm that I thought of on my last plane ride that can do much 
better (assuming max() is the aggregation function):

{code}
=================== two pass collapsing algorithm for collapse.aggregate=max 
====================
First pass: pretend that collapseCount=1
  - Use a TreeSet as a priority queue since one can remove and insert entries.
  - A HashMap<Key,TreeSetEntry> will be used to map from collapse group to top 
entry in the TreeSet
  - compare new doc with smallest element in treeset. If smaller discard and go 
to the next doc.
  - If new doc is bigger, look up it's group. Use the Map to find if the group 
has been added to the TreeSet and add it if not.
  - If the new bigger doc is already in the TreeSet, compare with the document 
in that group. If bigger, update the node,
    remove and re-add to the TreeSet to re-sort.

efficiency: the treeset and hashmap are both only the size of the top number of 
docs we are looking at (10 for instance)
We will now have the top 10 documents collapsed by the right field with a 
collapseCount of 1. Put another way, we have the top 10 groups.

Second pass (if collapseCount>1):
 - create a priority queue for each group (10) of size collapseCount
 - re-execute the query (or if the sort within the collapse groups does not 
involve score, we could just use the docids gathered during phase 1)
 - for each document, find it's appropriate priority queue and insert
 - optimization: we can use the previous info from phase1 to even avoid 
creating a priority queue if no other items matched.

So instead of creating collapse groups for every group in the set (as is done 
now?), we create it for only 10 groups.
Instead of collecting the score for every document in the set (40MB per request 
for a 10M doc index is *big*) we re-execute the query if needed.
We could optionally store the score as is done now... but I bet aggregate 
throughput on large indexes would be better by just re-executing.

Other thought: we could also cache the first phase in the query cache which 
would allow one to quickly move to the 2nd phase for any collapseCount.
{code} 

> Field collapsing
> ----------------
>
>                 Key: SOLR-236
>                 URL: https://issues.apache.org/jira/browse/SOLR-236
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>    Affects Versions: 1.3
>            Reporter: Emmanuel Keller
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 1.5
>
>         Attachments: collapsing-patch-to-1.3.0-dieter.patch, 
> collapsing-patch-to-1.3.0-ivan.patch, collapsing-patch-to-1.3.0-ivan_2.patch, 
> collapsing-patch-to-1.3.0-ivan_3.patch, field-collapse-3.patch, 
> field-collapse-4-with-solrj.patch, field-collapse-5.patch, 
> field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
> field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
> field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
> field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
> field-collapse-5.patch, field-collapse-5.patch, 
> field-collapse-solr-236-2.patch, field-collapse-solr-236.patch, 
> field-collapsing-extended-592129.patch, field_collapsing_1.1.0.patch, 
> field_collapsing_1.3.patch, field_collapsing_dsteigerwald.diff, 
> field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff, 
> quasidistributed.additional.patch, SOLR-236-FieldCollapsing.patch, 
> SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch, 
> SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, solr-236.patch, 
> SOLR-236_collapsing.patch, SOLR-236_collapsing.patch
>
>
> This patch include a new feature called "Field collapsing".
> "Used in order to collapse a group of results with similar value for a given 
> field to a single entry in the result set. Site collapsing is a special case 
> of this, where all results for a given web site is collapsed into one or two 
> entries in the result set, typically with an associated "more documents from 
> this site" link. See also Duplicate detection."
> http://www.fastsearch.com/glossary.aspx?m=48&amid=299
> The implementation add 3 new query parameters (SolrParams):
> "collapse.field" to choose the field used to group results
> "collapse.type" normal (default value) or adjacent
> "collapse.max" to select how many continuous results are allowed before 
> collapsing
> TODO (in progress):
> - More documentation (on source code)
> - Test cases
> Two patches:
> - "field_collapsing.patch" for current development version
> - "field_collapsing_1.1.0.patch" for Solr-1.1.0
> P.S.: Feedback and misspelling correction are welcome ;-)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-236) Field collapsing

Reply via email to