[
https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771155#action_12771155
]
Martijn van Groningen commented on SOLR-236:
--------------------------------------------
It certainly has be going on for a long time :-)
Talking about the last miles there are a few things in my mind about field
collapsing:
* Change the response format. Currently if I look at the response even I get
confused sometimes about the information returned. The response should more
structured. Something like this:
{code:xml}
<lst name="collapse_counts">
<str name="field">venue</str>
<lst name="results">
<lst name="233238"> <!-- id of most relevant document of the group -->
<str name="fieldValue">melkweg</str>
<int name="collapseCount">2</int>
<!-- and other CollapseCollector specific collapse information -->
</lst>
...
</lst>
</lst>
{code}
Currently when doing adjacent field collapsing the _collapse_counts_ gives
results that are unusable to use. The _collapse_counts_ use the field value as
key which is not unique for adjacent collapsing as shown in the example:
{code:xml}
<lst name="collapse_counts">
<int name="hard">1</int>
<int name="hard">1</int>
<int name="electronics">1</int>
<int name="memory">2</int>
<int name="monitor">1</int>
</lst>
{code}
* Add the notion of a CollapseMatcher, that decides whether document field
values are equal or not and thus whether they are allowed to be collapsed. This
opens the road for more exotic features like fuzzy field collapsing and
collapsing on more than one field. Also this allows users of the patch to
easily implement their own matching rules.
* Distributed field collapsing. Although I have some ideas on how to get
started, from my perspective it not going to be performed. Because somehow the
field collapse state has to be shared between shards in order to do proper
field collapsing. This state can potentially be a lot of data depending on the
specific search and corpus.
* And maybe add a collapse collector that collects statistics about most common
field value per collapsed group.
I think that this is somewhat the roadmap from my side for field collapsing at
moment, but feel free to elaborate on this.
Btw I have recently written a
[blog|http://blog.jteam.nl/2009/10/20/result-grouping-field-collapsing-with-solr/]
about field collapsing in general, that might be handy for someone who is
implementing field collapsing.
> Field collapsing
> ----------------
>
> Key: SOLR-236
> URL: https://issues.apache.org/jira/browse/SOLR-236
> Project: Solr
> Issue Type: New Feature
> Components: search
> Affects Versions: 1.3
> Reporter: Emmanuel Keller
> Fix For: 1.5
>
> Attachments: collapsing-patch-to-1.3.0-dieter.patch,
> collapsing-patch-to-1.3.0-ivan.patch, collapsing-patch-to-1.3.0-ivan_2.patch,
> collapsing-patch-to-1.3.0-ivan_3.patch, field-collapse-3.patch,
> field-collapse-4-with-solrj.patch, field-collapse-5.patch,
> field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch,
> field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch,
> field-collapse-5.patch, field-collapse-5.patch,
> field-collapse-solr-236-2.patch, field-collapse-solr-236.patch,
> field-collapsing-extended-592129.patch, field_collapsing_1.1.0.patch,
> field_collapsing_1.3.patch, field_collapsing_dsteigerwald.diff,
> field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff,
> SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch,
> SOLR-236-FieldCollapsing.patch, solr-236.patch, SOLR-236_collapsing.patch,
> SOLR-236_collapsing.patch
>
>
> This patch include a new feature called "Field collapsing".
> "Used in order to collapse a group of results with similar value for a given
> field to a single entry in the result set. Site collapsing is a special case
> of this, where all results for a given web site is collapsed into one or two
> entries in the result set, typically with an associated "more documents from
> this site" link. See also Duplicate detection."
> http://www.fastsearch.com/glossary.aspx?m=48&amid=299
> The implementation add 3 new query parameters (SolrParams):
> "collapse.field" to choose the field used to group results
> "collapse.type" normal (default value) or adjacent
> "collapse.max" to select how many continuous results are allowed before
> collapsing
> TODO (in progress):
> - More documentation (on source code)
> - Test cases
> Two patches:
> - "field_collapsing.patch" for current development version
> - "field_collapsing_1.1.0.patch" for Solr-1.1.0
> P.S.: Feedback and misspelling correction are welcome ;-)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.