[ 
https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan McKinley updated SOLR-236:
-------------------------------

    Attachment: SOLR-236-FieldCollapsing.patch

I updated the patch so that is applies cleanly with trunk, while I was at it, I:
* fixed a few spelling errors
* made the "collapse.type" parameter parsing to throw an error if the passed 
field is unknown (rather then quietly using 'normal')
* changed the patch name to include the number. -- as we update the patch, use 
this same name again so it is easy to tell what is the most current.

I also made a wiki page so there are direct links to interesting queries:
http://wiki.apache.org/solr/FieldCollapsing

- - - - - - -

Again, I will leave any discussion about the lucene implementation to other 
more qualified and will just focus on the response interface.

Currently if you send the query:
http://localhost:8983/solr/select/?q=*:*&collapse.field=cat&collapse.max=1&collapse.type=normal

you get a response that looks like:
<lst name="collapse_counts">
 <int name="hard">1</int>
 <int name="electronics">2</int>
 <int name="memory">2</int>
 <int name="monitor">1</int>
 <int name="software">1</int>
</lst>

It looks like that says: for the field 'cat', there is one more result with 
cat=hard, 2 more results with cat=electronics, ...

How is a client supposed to know how to deal with that?  "hard" is tokenized 
version of "hard drive" -- unless it were a 'string' field, the client would 
need to know how to do that -- or the response needs to change.

>From a client, it would be more useful to have output that looked something 
>like:
<lst name="collapse_counts">
 <str name="field">cat</str>
 <lst name="doc">
  <int name="SP2514N">1</int>
  <int name="6H500F0">1</int>
  <int name="VS1GB400C3">2</int>
  <int name="VS1GB400C3">1</int>
 </lst>
 <lst name="count">
  <int name="hard">1</int>
  <int name="electronics">1</int>
  <int name="memory">2</int>
  <int name="monitor">1</int>
 </lst>
</lst>

"field" says what field was collapsed on,
"doc" is a map of doc id -> how many more collapsed on that field
"count" is a map of 'token'-> how many more collapsed on that field

This way, the client would know what collapse counts apply to which documents 
without knowing about the schema.

thoughts?






> Field collapsing
> ----------------
>
>                 Key: SOLR-236
>                 URL: https://issues.apache.org/jira/browse/SOLR-236
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>    Affects Versions: 1.2
>            Reporter: Emmanuel Keller
>         Attachments: collapse_field.patch, collapse_field.patch, 
> field_collapsing.patch, field_collapsing.patch, field_collapsing.patch, 
> field_collapsing_1.1.0.patch, SOLR-236-FieldCollapsing.patch
>
>
> This patch include a new feature called "Field collapsing".
> "Used in order to collapse a group of results with similar value for a given 
> field to a single entry in the result set. Site collapsing is a special case 
> of this, where all results for a given web site is collapsed into one or two 
> entries in the result set, typically with an associated "more documents from 
> this site" link. See also Duplicate detection."
> http://www.fastsearch.com/glossary.aspx?m=48&amid=299
> The implementation add 3 new query parameters (SolrParams):
> "collapse.field" to choose the field used to group results
> "collapse.type" normal (default value) or adjacent
> "collapse.max" to select how many continuous results are allowed before 
> collapsing
> TODO (in progress):
> - More documentation (on source code)
> - Test cases
> Two patches:
> - "field_collapsing.patch" for current development version (1.2)
> - "field_collapsing_1.1.0.patch" for Solr-1.1.0
> P.S.: Feedback and misspelling correction are welcome ;-)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to