[ 
https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12655750#action_12655750
 ] 

Stephen Weiss commented on SOLR-236:
------------------------------------

I'm using Ivan's patch and running into some trouble with faceting...

Basically, I can tell that faceting is happening after the collapse - because 
the facet counts are definitely lower than they would be otherwise.  For 
example, with one search, I'd have 196 results with no collapsing, I get 120 
results with collapsing - but the facet count is 119???  In other searches the 
difference is more drastic - In another search, I get 61 results without 
collapsing, 61 with collapsing, but the facet count is 39.

Looking at it for a while now, I think I can guess what the problem might be...

The incorrect counts seem to only happen when the term in question does not 
occur evenly across all duplicates of a document.  That is, multiple document 
records may exist for the same image (it's an image search engine), but each 
document will have different terms in different fields depending on the 
audience it's targeting.  So, when you collapse, the counts are lower than they 
should be because when you actually execute a search with that facet's term 
included in the query, *all* the documents after collapsing will be ones that 
have that term.

Here's an illustration:

Collapse field is "link_id", facet field is "keyword":


Doc 1:
id: 123456,
link_id: 2,
keyword: Black, Printed, Dress

Doc 2:
id: 123457,
link_id: 2,
keyword: Black, Shoes, Patent

Doc 3:
id: 123458,
link_id: 2,
keyword: Red, Hat, Felt

Doc 4:
id: 123459,
link_id:1,
keyword: Felt, Hat, Black

So, when you collapse, only two of these documents are in the result set 
(123456, 123459), and only the keywords Black, Printed, Dress, Felt, and Hat 
are counted.  The facet count for Black is 2, the facet count for Felt is 1.  
If you choose Black and add it to your query, you get 2 results (great).  
However, if you add *Felt* to your query, you get 2 results (because a 
different document for link_id 2 is chosen in that query than is in the more 
general query from which the facets are produced).

I think what needs to happen here is that all the terms for all the documents 
that are collapsed together need to be included (just once) with the document 
that gets counted for faceting.  In this example, when the document for link_id 
2 is counted, it would need to appear to the facet counter to have keywords 
Black, Printed, Dress, Shoes, Patent, Red, Hat, and Felt, as opposed to just 
Black, Printed, and Dress.


> Field collapsing
> ----------------
>
>                 Key: SOLR-236
>                 URL: https://issues.apache.org/jira/browse/SOLR-236
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>    Affects Versions: 1.3
>            Reporter: Emmanuel Keller
>             Fix For: 1.4
>
>         Attachments: collapsing-patch-to-1.3.0-ivan.patch, 
> collapsing-patch-to-1.3.0-ivan_2.patch, 
> field-collapsing-extended-592129.patch, field_collapsing_1.1.0.patch, 
> field_collapsing_1.3.patch, field_collapsing_dsteigerwald.diff, 
> field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff, 
> SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch, 
> SOLR-236-FieldCollapsing.patch, solr-236.patch
>
>
> This patch include a new feature called "Field collapsing".
> "Used in order to collapse a group of results with similar value for a given 
> field to a single entry in the result set. Site collapsing is a special case 
> of this, where all results for a given web site is collapsed into one or two 
> entries in the result set, typically with an associated "more documents from 
> this site" link. See also Duplicate detection."
> http://www.fastsearch.com/glossary.aspx?m=48&amid=299
> The implementation add 3 new query parameters (SolrParams):
> "collapse.field" to choose the field used to group results
> "collapse.type" normal (default value) or adjacent
> "collapse.max" to select how many continuous results are allowed before 
> collapsing
> TODO (in progress):
> - More documentation (on source code)
> - Test cases
> Two patches:
> - "field_collapsing.patch" for current development version
> - "field_collapsing_1.1.0.patch" for Solr-1.1.0
> P.S.: Feedback and misspelling correction are welcome ;-)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to