[ 
https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12751181#action_12751181
 ] 

Martijn van Groningen commented on SOLR-236:
--------------------------------------------

Hi Thomas,

Comparing my format proposal with yours, the difference is how I output the 
collapsed documents. I chose to add all collapsed values in an element per 
field, because that would make it more compact and thus easier to transmit on 
the wire (certainly if the number of collapsed documents to return is large). 
This approach is not standard in Solr and your result structure is more common. 
I think that most of time is properly spent at reading the collapsed field 
values from the index anyway (i/o), therefore I think that your result 
structure is right now properly the best way to go.

I think that supporting the 'old' format is not that good of an idea, because 
this only increases complexity in the code. Also field collapsing is just a 
patch (although it is around for while) and is not a core Solr feature. I think 
people using this patch (and a patch in general) should always be aware that 
everything in a patch is subject to change. I think that _collapse.response_ 
should be named something like _collapse.includeCollapsedDocs_ when this is 
specified it includes the collapsed documents. The 
_collapse.includeCollapsedDocs.fl_ would then only include the specified fields 
in the collapsed documents. So specifying _collapse.includeCollapsedDocs=true 
would result into the following result:
{code:xml}
<lst name="collapse_counts">
    <str name="field">venue</str>
    <lst name="results">
        <lst name="233238">
            <str name="fieldValue">melkweg</str>
            <int name="collapseCount">2</int>
             <lst name="collapsedDocs">
                <doc>
                    <str name="id">233239</str>
                    <str name="name">Foo Bar</str>
                    ...
                </doc>
                <doc>
                    <str name="id">233240</str>
                    <str name="name">Foo Bar 2</str>
                    ...
                </doc>
            </lst>
        </lst>
    </lst>
</lst>
{code}
Not specifying the _collapse.includeCollaspedDocs_ would result into the 
following response output:
{code:xml}
<lst name="collapse_counts">
    <str name="field">venue</str>
    <lst name="results">
        <lst name="233238">
            <str name="fieldValue">melkweg</str>
            <int name="collapseCount">2</int>
        </lst>
    </lst>
</lst>
{code}
This will be the default and only response format.
And when for example _collapse.info.doc=false_ is specified then the following 
result will be returned:
{code:xml}
<lst name="collapse_counts">
    <str name="field">venue</str>
    <lst name="results"> 
        <lst name="melkweg"> <!-- we can not use the head document id any more, 
so we use the field value --> 
            <int name="collapseCount">2</int>
        </lst>
    </lst>
</lst>
{code}
When _collapse.info.count=false_ is specified this would just remove the 
_fieldValue_ from the response. I do not know if these parameters are actually 
set to false by many people, but it is something to keep in mind. I also 
recently added support for field collapsing to solrj in the patch, obviously 
this has to be updated to the latest response format.

In general it must be made clear to the Solr user that this feature is handy, 
but it can dramatically influence the performance in a negative way. This is 
because the response can contain a lot of documents and each field value has to 
be read from the index, which results in a lot of i/o activity on the Solr 
side. Just because of the fact that a lot of data is returned in the response; 
simply viewing the response in the browser can become quite a challenge.

But more important do you think that these changes are acceptable (response 
format / request parameters)?


> Field collapsing
> ----------------
>
>                 Key: SOLR-236
>                 URL: https://issues.apache.org/jira/browse/SOLR-236
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>    Affects Versions: 1.3
>            Reporter: Emmanuel Keller
>             Fix For: 1.5
>
>         Attachments: collapsing-patch-to-1.3.0-dieter.patch, 
> collapsing-patch-to-1.3.0-ivan.patch, collapsing-patch-to-1.3.0-ivan_2.patch, 
> collapsing-patch-to-1.3.0-ivan_3.patch, field-collapse-3.patch, 
> field-collapse-4-with-solrj.patch, field-collapse-5.patch, 
> field-collapse-solr-236-2.patch, field-collapse-solr-236.patch, 
> field-collapsing-extended-592129.patch, field_collapsing_1.1.0.patch, 
> field_collapsing_1.3.patch, field_collapsing_dsteigerwald.diff, 
> field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff, 
> SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch, 
> SOLR-236-FieldCollapsing.patch, solr-236.patch, SOLR-236_collapsing.patch, 
> SOLR-236_collapsing.patch
>
>
> This patch include a new feature called "Field collapsing".
> "Used in order to collapse a group of results with similar value for a given 
> field to a single entry in the result set. Site collapsing is a special case 
> of this, where all results for a given web site is collapsed into one or two 
> entries in the result set, typically with an associated "more documents from 
> this site" link. See also Duplicate detection."
> http://www.fastsearch.com/glossary.aspx?m=48&amid=299
> The implementation add 3 new query parameters (SolrParams):
> "collapse.field" to choose the field used to group results
> "collapse.type" normal (default value) or adjacent
> "collapse.max" to select how many continuous results are allowed before 
> collapsing
> TODO (in progress):
> - More documentation (on source code)
> - Test cases
> Two patches:
> - "field_collapsing.patch" for current development version
> - "field_collapsing_1.1.0.patch" for Solr-1.1.0
> P.S.: Feedback and misspelling correction are welcome ;-)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to