[ 
https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795067#action_12795067
 ] 

Stanislaw Osinski commented on SOLR-236:
----------------------------------------

Hi Grant,

{quote}
I would note, in looking at the Carrot2 code, they actually have a 
ByFieldClusteringAlgorithm (what they call synthetic clustering) which does 
field collapsing/clustering on a value of a field. To quote the javadocs:

Clusters documents into a flat structure based on the values of some field of 
the documents. By default the \...@link Document#SOURCES} field is used and  
Name of the field to cluster by. Each non-null scalar field value with distinct 
hash code will give raise to a single cluster, named using the \...@link 
Object#toString()} value of the field. If the field value is a collection, the 
document will be assigned to all clusters corresponding to the values in the 
collection. Note that arrays will not be 'unfolded' in this way.

I don't know how it performs, but it seems like it would at least be worth 
investigating.
{quote}

Carrot2's {{ByFieldClusteringAlgorithm}} is very simple. It literally throws 
everything into a hash map based on the field value ([source 
code|http://fisheye3.atlassian.com/browse/carrot2/trunk/core/carrot2-algorithm-synthetic/src/org/carrot2/clustering/synthetic/ByFieldClusteringAlgorithm.java?r=trunk#l99]).
 This algorithm is used in our live demo to [cluster by news 
source|http://search.carrot2.org/stable/search?source=boss-news&query=iphone&algorithm=source].

{quote}
Note, they also have a synthetic one for collapsing based on URL: 
ByUrlClusteringAlgorithm
{quote}

This one creates a [hierarchy based on the URL 
segments|http://search.carrot2.org/stable/search?source=boss-web&query=solr&algorithm=url&results=200]
 and might be useful to create "by-domain" collapsing if needed.

In general, my rough guess is that it's the criteria for content-based 
collapsing would be closer to duplicate detection rather than the type of 
grouping Carrot2 produces.

> Field collapsing
> ----------------
>
>                 Key: SOLR-236
>                 URL: https://issues.apache.org/jira/browse/SOLR-236
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>    Affects Versions: 1.3
>            Reporter: Emmanuel Keller
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 1.5
>
>         Attachments: collapsing-patch-to-1.3.0-dieter.patch, 
> collapsing-patch-to-1.3.0-ivan.patch, collapsing-patch-to-1.3.0-ivan_2.patch, 
> collapsing-patch-to-1.3.0-ivan_3.patch, field-collapse-3.patch, 
> field-collapse-4-with-solrj.patch, field-collapse-5.patch, 
> field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
> field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
> field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
> field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
> field-collapse-5.patch, field-collapse-5.patch, 
> field-collapse-solr-236-2.patch, field-collapse-solr-236.patch, 
> field-collapsing-extended-592129.patch, field_collapsing_1.1.0.patch, 
> field_collapsing_1.3.patch, field_collapsing_dsteigerwald.diff, 
> field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff, 
> quasidistributed.additional.patch, SOLR-236-FieldCollapsing.patch, 
> SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch, 
> SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, 
> SOLR-236.patch, solr-236.patch, SOLR-236_collapsing.patch, 
> SOLR-236_collapsing.patch
>
>
> This patch include a new feature called "Field collapsing".
> "Used in order to collapse a group of results with similar value for a given 
> field to a single entry in the result set. Site collapsing is a special case 
> of this, where all results for a given web site is collapsed into one or two 
> entries in the result set, typically with an associated "more documents from 
> this site" link. See also Duplicate detection."
> http://www.fastsearch.com/glossary.aspx?m=48&amid=299
> The implementation add 3 new query parameters (SolrParams):
> "collapse.field" to choose the field used to group results
> "collapse.type" normal (default value) or adjacent
> "collapse.max" to select how many continuous results are allowed before 
> collapsing
> TODO (in progress):
> - More documentation (on source code)
> - Test cases
> Two patches:
> - "field_collapsing.patch" for current development version
> - "field_collapsing_1.1.0.patch" for Solr-1.1.0
> P.S.: Feedback and misspelling correction are welcome ;-)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to