[ https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754511#action_12754511 ]
Martijn van Groningen commented on SOLR-236: -------------------------------------------- Hi Oleg, no I have not made any progress. I'm still not clear how to solve it in an efficient manner as I have written in my previous comment: {quote} I was trying to come up with a solution to implement distributed field collapsing, but I ran into a problem that I could not solve in an efficient manner. Field collapsing keeps track of the number of document collapsed per unique field value and the total count documents encountered per unique field. If the total count is greater than the specified collapse threshold then the number of documents collapsed is the difference between the total count and threshold. Lets say we have two shards each shard has one document with the same field value. The collapse threshold is one, meaning that if we run the collapsing algorithm on the shard individually both documents will never be collapsed. But when the algorithm applies to both shards, one of the documents must be collapsed however neither shared knows that its document is the one to collapse. There are more situations described as above, but it all boils down to the fact that each shard does not have meta information about the other shards in the cluster. Sharing the intermediate collapse results between the shards is in my opinion not an option. This is because if you do that then you also need to share information about documents / fields that have a collapse count of zero. This is totally impractical for large indexes. Besides that there is also another problem with distributed field collapsing. Field collapsing only keeps the most relevant document in the result set and collapses the less relevant ones. If scoring is used to sort then field collapsing will fail to do this properly, because of the fact there is no global scoring (idf). Does anyone have an idea on how to solve this? The first problem seems related to same kind of problem implementing global score has. {quote} I recently read something about Katta and . Katta facilitates distributed search and has for support global scoring. I'm not completely sure how it is implemented in Katta, but maybe with Katta it is relative efficient to share the intermediate collapse results between shards. > Field collapsing > ---------------- > > Key: SOLR-236 > URL: https://issues.apache.org/jira/browse/SOLR-236 > Project: Solr > Issue Type: New Feature > Components: search > Affects Versions: 1.3 > Reporter: Emmanuel Keller > Fix For: 1.5 > > Attachments: collapsing-patch-to-1.3.0-dieter.patch, > collapsing-patch-to-1.3.0-ivan.patch, collapsing-patch-to-1.3.0-ivan_2.patch, > collapsing-patch-to-1.3.0-ivan_3.patch, field-collapse-3.patch, > field-collapse-4-with-solrj.patch, field-collapse-5.patch, > field-collapse-5.patch, field-collapse-5.patch, > field-collapse-solr-236-2.patch, field-collapse-solr-236.patch, > field-collapsing-extended-592129.patch, field_collapsing_1.1.0.patch, > field_collapsing_1.3.patch, field_collapsing_dsteigerwald.diff, > field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff, > SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch, > SOLR-236-FieldCollapsing.patch, solr-236.patch, SOLR-236_collapsing.patch, > SOLR-236_collapsing.patch > > > This patch include a new feature called "Field collapsing". > "Used in order to collapse a group of results with similar value for a given > field to a single entry in the result set. Site collapsing is a special case > of this, where all results for a given web site is collapsed into one or two > entries in the result set, typically with an associated "more documents from > this site" link. See also Duplicate detection." > http://www.fastsearch.com/glossary.aspx?m=48&amid=299 > The implementation add 3 new query parameters (SolrParams): > "collapse.field" to choose the field used to group results > "collapse.type" normal (default value) or adjacent > "collapse.max" to select how many continuous results are allowed before > collapsing > TODO (in progress): > - More documentation (on source code) > - Test cases > Two patches: > - "field_collapsing.patch" for current development version > - "field_collapsing_1.1.0.patch" for Solr-1.1.0 > P.S.: Feedback and misspelling correction are welcome ;-) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.