[ https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ryan McKinley updated SOLR-236: ------------------------------- Attachment: SOLR-236-FieldCollapsing.patch I updated the patch so that is applies cleanly with trunk, while I was at it, I: * fixed a few spelling errors * made the "collapse.type" parameter parsing to throw an error if the passed field is unknown (rather then quietly using 'normal') * changed the patch name to include the number. -- as we update the patch, use this same name again so it is easy to tell what is the most current. I also made a wiki page so there are direct links to interesting queries: http://wiki.apache.org/solr/FieldCollapsing - - - - - - - Again, I will leave any discussion about the lucene implementation to other more qualified and will just focus on the response interface. Currently if you send the query: http://localhost:8983/solr/select/?q=*:*&collapse.field=cat&collapse.max=1&collapse.type=normal you get a response that looks like: <lst name="collapse_counts"> <int name="hard">1</int> <int name="electronics">2</int> <int name="memory">2</int> <int name="monitor">1</int> <int name="software">1</int> </lst> It looks like that says: for the field 'cat', there is one more result with cat=hard, 2 more results with cat=electronics, ... How is a client supposed to know how to deal with that? "hard" is tokenized version of "hard drive" -- unless it were a 'string' field, the client would need to know how to do that -- or the response needs to change. >From a client, it would be more useful to have output that looked something >like: <lst name="collapse_counts"> <str name="field">cat</str> <lst name="doc"> <int name="SP2514N">1</int> <int name="6H500F0">1</int> <int name="VS1GB400C3">2</int> <int name="VS1GB400C3">1</int> </lst> <lst name="count"> <int name="hard">1</int> <int name="electronics">1</int> <int name="memory">2</int> <int name="monitor">1</int> </lst> </lst> "field" says what field was collapsed on, "doc" is a map of doc id -> how many more collapsed on that field "count" is a map of 'token'-> how many more collapsed on that field This way, the client would know what collapse counts apply to which documents without knowing about the schema. thoughts? > Field collapsing > ---------------- > > Key: SOLR-236 > URL: https://issues.apache.org/jira/browse/SOLR-236 > Project: Solr > Issue Type: New Feature > Components: search > Affects Versions: 1.2 > Reporter: Emmanuel Keller > Attachments: collapse_field.patch, collapse_field.patch, > field_collapsing.patch, field_collapsing.patch, field_collapsing.patch, > field_collapsing_1.1.0.patch, SOLR-236-FieldCollapsing.patch > > > This patch include a new feature called "Field collapsing". > "Used in order to collapse a group of results with similar value for a given > field to a single entry in the result set. Site collapsing is a special case > of this, where all results for a given web site is collapsed into one or two > entries in the result set, typically with an associated "more documents from > this site" link. See also Duplicate detection." > http://www.fastsearch.com/glossary.aspx?m=48&amid=299 > The implementation add 3 new query parameters (SolrParams): > "collapse.field" to choose the field used to group results > "collapse.type" normal (default value) or adjacent > "collapse.max" to select how many continuous results are allowed before > collapsing > TODO (in progress): > - More documentation (on source code) > - Test cases > Two patches: > - "field_collapsing.patch" for current development version (1.2) > - "field_collapsing_1.1.0.patch" for Solr-1.1.0 > P.S.: Feedback and misspelling correction are welcome ;-) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.