[jira] Commented: (SOLR-236) Field collapsing

Stephen Weiss (JIRA) Fri, 06 Mar 2009 08:38:28 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12679638#action_12679638
 ]


Stephen Weiss commented on SOLR-236:
------------------------------------

The machine has 4GB total.  In response to this issue, and especially now that 
we have upgraded it to be 64 bit (again, for this issue), we have already 
ordered another 16 GB for the machine to try and stave off the problem.  We 
should have it in next week.

I restrict commits severely - a commit is only allowed once an hour, in 
practice they happen even less frequently - perhaps 5 or 6 times a day, and 
very spread out.  We are freakishly paranoid :-)  But honestly that's all we 
need - new documents come in in chunks and generally they *want* them to go in 
all at once, and not piecemeal, so that the site updates cleanly (the commits 
are synchronized with other content updates - new images on the home page, etc).

Some more information... just trying to toss out anything that matters.  We 
have a very small set of possible terms - only 60,000 or so which tokenize to 
perhaps 200,000 total distinct words.  We do not use synonyms at index time 
(only at query time).  We use faceting, collapsing, and sorting - that's about 
it, no more like this or spellchecker (although we'd like to, we haven't gotten 
there yet).  Faceting we do use heavily though - there are 16 different fields 
on which we return facet counts.  All these fields together represent no more 
than 15,000 unique terms.  There are approx. 4M documents in the index total, 
and none of them are larger than 1K.

Memory usage on the machine seems to steadily increase - after restart and 
warming, 40% of the RAM on the machine is in use.  Then, as searches come in, 
it steadily increases.  Right now it is using 61%, in an hour it will probably 
be closer to 75% - the danger zone.  This is also unusual because before, it 
used to stay pretty steady around 52-53%.

This is a multi-core system - we have 2 cores, the one I'm describing now is 
only one of them.  The other core is very, very small - total 8000 documents, 
which are also no more than 1 K each.  We do use faceting there but no 
collapsing (it is not necessary for that part).  It is essentially irrelevant, 
with or without that core the machine consumes about the same amount of 
resources.

In response to this problem I have already dramatically reduced the following 
options:

<     <mergeFactor>2</mergeFactor>
<     <maxBufferedDocs>100</maxBufferedDocs>
---
>     <mergeFactor>10</mergeFactor>
>     <maxBufferedDocs>1000</maxBufferedDocs>
42c42
<     <maxFieldLength>2500</maxFieldLength>
---
>     <maxFieldLength>10000</maxFieldLength>
50,51c50,51
<     <mergeFactor>2</mergeFactor>
<     <maxBufferedDocs>100</maxBufferedDocs>
---
>     <mergeFactor>10</mergeFactor>
>     <maxBufferedDocs>1000</maxBufferedDocs>
53c53
<     <maxFieldLength>2500</maxFieldLength>
---
>     <maxFieldLength>10000</maxFieldLength>


( diff of solrconfig.xml - < indicates current values, > indicates values when 
the problem started happening).

This actually seemed to make the search much faster (strangely enough), but it 
doesn't seem to have helped memory consumption very much.

These are our cache parameters:

    <filterCache
      class="solr.LRUCache"
      size="65536"
      initialSize="4096"
      autowarmCount="2048"/>

    <queryResultCache
      class="solr.LRUCache"
      size="512"
      initialSize="512"
      autowarmCount="256"/>

    <documentCache
      class="solr.LRUCache"
      size="16384"
      initialSize="16384"
      autowarmCount="0"/>

    <cache name="collapseCache"
      class="solr.LRUCache"
      size="512"
      initialSize="512"
      autowarmCount="0"/>

I'm actually not sure if the collapseCache even does anything since it does not 
appear in the admin listing.  I'm going to try reducing the filterCache to 32K 
entries and see if that makes a difference.  I think that may be the right 
track since otherwise it seems like a big memory leak is happening.

Is there any way to specify the size of the cache in terms of the actual size 
it should take up in memory, as opposed to the number of entries?  64K sounded 
quite small to me but now I'm thinking that 64K could mean GB's of memory 
depending on what the entries are, I honestly don't understand what the 
correlation would be between an entry and the size that entry takes in RAM.

> Field collapsing
> ----------------
>
>                 Key: SOLR-236
>                 URL: https://issues.apache.org/jira/browse/SOLR-236
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>    Affects Versions: 1.3
>            Reporter: Emmanuel Keller
>             Fix For: 1.5
>
>         Attachments: collapsing-patch-to-1.3.0-dieter.patch, 
> collapsing-patch-to-1.3.0-ivan.patch, collapsing-patch-to-1.3.0-ivan_2.patch, 
> collapsing-patch-to-1.3.0-ivan_3.patch, 
> field-collapsing-extended-592129.patch, field_collapsing_1.1.0.patch, 
> field_collapsing_1.3.patch, field_collapsing_dsteigerwald.diff, 
> field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff, 
> SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch, 
> SOLR-236-FieldCollapsing.patch, solr-236.patch
>
>
> This patch include a new feature called "Field collapsing".
> "Used in order to collapse a group of results with similar value for a given 
> field to a single entry in the result set. Site collapsing is a special case 
> of this, where all results for a given web site is collapsed into one or two 
> entries in the result set, typically with an associated "more documents from 
> this site" link. See also Duplicate detection."
> http://www.fastsearch.com/glossary.aspx?m=48&amid=299
> The implementation add 3 new query parameters (SolrParams):
> "collapse.field" to choose the field used to group results
> "collapse.type" normal (default value) or adjacent
> "collapse.max" to select how many continuous results are allowed before 
> collapsing
> TODO (in progress):
> - More documentation (on source code)
> - Test cases
> Two patches:
> - "field_collapsing.patch" for current development version
> - "field_collapsing_1.1.0.patch" for Solr-1.1.0
> P.S.: Feedback and misspelling correction are welcome ;-)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-236) Field collapsing

Reply via email to