Great writeup Ken,

All the constants you see in UnInvertedField were a best guess - I
wasn't working with any real data.  It's surprising that a big array
allocation every 4096 terms is so significant - I had figured that the
work involved in processing that many terms would far outweigh
realloc+GC.

Could you open a JIRA issue with your recommended changes?  It's
simple enough we should have no problem getting it in for Solr 1.4.

Also, are you using a recent Solr build (within the last month)?
LUCENE-1596 should improve uninvert time for non-optimized indexes.

And don't forget to update http://wiki.apache.org/solr/PublicServers
when you go live!

-Yonik
http://www.lucidimagination.com



On Mon, Jun 15, 2009 at 7:43 PM, Kent Fitch<kent.fi...@gmail.com> wrote:
> Hi,
>
> This may be of interest to other users of SOLR's UnInvertedField who
> have a very large number of unique terms in faceted fields.
>
> Our setup is :
>
> - about 34M lucene documents of bibliographic and full text content
> - index currently 115GB, will at least double over next 6 months
> - moving to support real-time-ish updates (maybe 5 min delay)
>
> We facet on 8 fields, 6 of which are "normal" with small numbers of
> distinct values.  But 2 faceted fields, creator and subject, are huge,
> with 18M and 9M terms respectively.  (Whether we should be faceting on
> such a huge number of values, and at the same time attempting to
> provide real time-ish updates is another question!  Whether facets
> derived from all of the hundreds of thousands of results regardless of
> match quality which typically happens in a large full text index is
> yet another question!).  The app is visible here:
> http://sbdsproto.nla.gov.au/
>
> On a server with 2xquad core AMD 2382 processors and 64GB memory, java
> 1.6.0_13-b03, 64 bit run with "-Xmx15192M -Xms6000M -verbose:gc", with
> the index on Intel X25M SSD, on start-up the elapsed time to create
> the 8 facets is 306 seconds (best time).  Following an index reopen,
> the time to recreate them in 318 seconds (best time).
>
> [We have made an independent experimental change to create the facets
> with 3 async threads, that is, in parallel, and also to decouple them
> from the underlying index, so our facets lag the index changes by the
> time to recreate the facets.  With our setup, the 3 threads reduced
> facet creation elapsed time from about 450 secs to around 320 secs,
> but this will depend a lot on IO capabilities of the device containing
> the index, amount of file system caching, load, etc]
>
> Anyway, we noticed that huge amounts of garbage were being collected
> during facet generation of the creator and subject fields, and tracked
> it down to this decision in UnInvertedField univert():
>
>      if (termNum >= maxTermCounts.length) {
>        // resize, but conserve memory by not doubling
>        // resize at end??? we waste a maximum of 16K (average of 8K)
>        int[] newMaxTermCounts = new int[maxTermCounts.length+4096];
>        System.arraycopy(maxTermCounts, 0, newMaxTermCounts, 0, termNum);
>        maxTermCounts = newMaxTermCounts;
>      }
>
> So, we tried the obvious thing:
>
> - allocate 10K terms initially, rather than 1K
> - extend by doubling the current size, rather than adding a fixed 4K
> - free unused space at the end (but only if unused space is
> "significant") by reallocating the array to the exact required size
>
> And also:
>
> - created a static HashMap lookup keyed on field name which remembers
> the previous allocated size for maxTermCounts for that field, and
> initially allocates that size + 1000 entries
>
> The second change is a minor optimisation, but the first change, by
> eliminating thousands of array reallocations and copies, greatly
> improved load times, down from 306 to 124 seconds on the initial load
> and from 318 to 134 seconds on reloads after index updates.  About
> 60-70 secs is still spend in GC, but it is a significant improvement.
>
> Unless you have very large numbers of facet values, this change won't
> have any positive benefit.
>
> Regards,
>
> Kent Fitch
>

Reply via email to