[ 
https://issues.apache.org/jira/browse/SOLR-1773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12833495#action_12833495
 ] 

Koji Sekiguchi commented on SOLR-1773:
--------------------------------------

Random comment on the patch:

- TimeAllowed not supported
- cache not supported
- distributed search is not supported
- sort field is hard-coded in the patch
- collapse.type=adjacent is not supported
- collapse.aggregate is not supported (but supportable)
- not yet, but collapse.sort can be supported

supported parameters:

|collapse|set to on to use field collapsing|
|collapse.field|field name to collapse (required)|
|collapse.limit|maximum number of collapsed docs to return in each collapse 
group|
|collapse.fl|comma- or space- delimited list of fields to return|


> Field Collapsing (lightweight version)
> --------------------------------------
>
>                 Key: SOLR-1773
>                 URL: https://issues.apache.org/jira/browse/SOLR-1773
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>    Affects Versions: 1.4
>            Reporter: Koji Sekiguchi
>            Priority: Minor
>         Attachments: SOLR-1773.patch
>
>
> I'd like to start another approach for field collapsing suggested by Yonik on 
> 19/Dec/09 at SOLR-236. Re-posting the idea:
> {code}
> =================== two pass collapsing algorithm for collapse.aggregate=max 
> ====================
> First pass: pretend that collapseCount=1
>   - Use a TreeSet as  a priority queue since one can remove and insert 
> entries.
>   - A HashMap<Key,TreeSetEntry> will be used to map from collapse group to 
> top entry in the TreeSet
>   - compare new doc with smallest element in treeset.  If smaller discard and 
> go to the next doc.
>   - If new doc is bigger, look up it's group.  Use the Map to find if the 
> group has been added to the TreeSet and add it if not.
>   - If the new bigger doc is already in the TreeSet, compare with the 
> document in that group.  If bigger, update the node,
>     remove and re-add to the TreeSet to re-sort.
> efficiency: the treeset and hashmap are both only the size of the top number 
> of docs we are looking at (10 for instance)
> We will now have the top 10 documents collapsed by the right field with a 
> collapseCount of 1.  Put another way, we have the top 10 groups.
> Second pass (if collapseCount>1):
>  - create a priority queue for each group (10) of size collapseCount
>  - re-execute the query (or if the sort within the collapse groups does not 
> involve score, we could just use the docids gathered during phase 1)
>  - for each document, find it's appropriate priority queue and insert
>  - optimization: we can use the previous info from phase1 to even avoid 
> creating a priority queue if no other items matched.
> So instead of creating collapse groups for every group in the set (as is done 
> now?), we create it for only 10 groups.
> Instead of collecting the score for every document in the set (40MB per 
> request for a 10M doc index is *big*) we re-execute the query if needed.
> We could optionally store the score as is done now... but I bet aggregate 
> throughput on large indexes would be better by just re-executing.
> Other thought: we could also cache the first phase in the query cache which 
> would allow one to quickly move to the 2nd phase for any collapseCount.
> {code}
> The restriction is:
> {quote}
> one would not be able to tell the total number of collapsed docs, or the 
> total number of hits (or the DocSet) after collapsing. So only 
> collapse.facet=before would be supported.
> {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to