UnInvertedField performance on faceted fields containing many unique terms

2009-06-15 Thread Kent Fitch
Hi,

This may be of interest to other users of SOLR's UnInvertedField who
have a very large number of unique terms in faceted fields.

Our setup is :

- about 34M lucene documents of bibliographic and full text content
- index currently 115GB, will at least double over next 6 months
- moving to support real-time-ish updates (maybe 5 min delay)

We facet on 8 fields, 6 of which are normal with small numbers of
distinct values.  But 2 faceted fields, creator and subject, are huge,
with 18M and 9M terms respectively.  (Whether we should be faceting on
such a huge number of values, and at the same time attempting to
provide real time-ish updates is another question!  Whether facets
derived from all of the hundreds of thousands of results regardless of
match quality which typically happens in a large full text index is
yet another question!).  The app is visible here:
http://sbdsproto.nla.gov.au/

On a server with 2xquad core AMD 2382 processors and 64GB memory, java
1.6.0_13-b03, 64 bit run with -Xmx15192M -Xms6000M -verbose:gc, with
the index on Intel X25M SSD, on start-up the elapsed time to create
the 8 facets is 306 seconds (best time).  Following an index reopen,
the time to recreate them in 318 seconds (best time).

[We have made an independent experimental change to create the facets
with 3 async threads, that is, in parallel, and also to decouple them
from the underlying index, so our facets lag the index changes by the
time to recreate the facets.  With our setup, the 3 threads reduced
facet creation elapsed time from about 450 secs to around 320 secs,
but this will depend a lot on IO capabilities of the device containing
the index, amount of file system caching, load, etc]

Anyway, we noticed that huge amounts of garbage were being collected
during facet generation of the creator and subject fields, and tracked
it down to this decision in UnInvertedField univert():

  if (termNum = maxTermCounts.length) {
// resize, but conserve memory by not doubling
// resize at end??? we waste a maximum of 16K (average of 8K)
int[] newMaxTermCounts = new int[maxTermCounts.length+4096];
System.arraycopy(maxTermCounts, 0, newMaxTermCounts, 0, termNum);
maxTermCounts = newMaxTermCounts;
  }

So, we tried the obvious thing:

- allocate 10K terms initially, rather than 1K
- extend by doubling the current size, rather than adding a fixed 4K
- free unused space at the end (but only if unused space is
significant) by reallocating the array to the exact required size

And also:

- created a static HashMap lookup keyed on field name which remembers
the previous allocated size for maxTermCounts for that field, and
initially allocates that size + 1000 entries

The second change is a minor optimisation, but the first change, by
eliminating thousands of array reallocations and copies, greatly
improved load times, down from 306 to 124 seconds on the initial load
and from 318 to 134 seconds on reloads after index updates.  About
60-70 secs is still spend in GC, but it is a significant improvement.

Unless you have very large numbers of facet values, this change won't
have any positive benefit.

Regards,

Kent Fitch


Re: UnInvertedField performance on faceted fields containing many unique terms

2009-06-15 Thread Yonik Seeley
Great writeup Ken,

All the constants you see in UnInvertedField were a best guess - I
wasn't working with any real data.  It's surprising that a big array
allocation every 4096 terms is so significant - I had figured that the
work involved in processing that many terms would far outweigh
realloc+GC.

Could you open a JIRA issue with your recommended changes?  It's
simple enough we should have no problem getting it in for Solr 1.4.

Also, are you using a recent Solr build (within the last month)?
LUCENE-1596 should improve uninvert time for non-optimized indexes.

And don't forget to update http://wiki.apache.org/solr/PublicServers
when you go live!

-Yonik
http://www.lucidimagination.com



On Mon, Jun 15, 2009 at 7:43 PM, Kent Fitchkent.fi...@gmail.com wrote:
 Hi,

 This may be of interest to other users of SOLR's UnInvertedField who
 have a very large number of unique terms in faceted fields.

 Our setup is :

 - about 34M lucene documents of bibliographic and full text content
 - index currently 115GB, will at least double over next 6 months
 - moving to support real-time-ish updates (maybe 5 min delay)

 We facet on 8 fields, 6 of which are normal with small numbers of
 distinct values.  But 2 faceted fields, creator and subject, are huge,
 with 18M and 9M terms respectively.  (Whether we should be faceting on
 such a huge number of values, and at the same time attempting to
 provide real time-ish updates is another question!  Whether facets
 derived from all of the hundreds of thousands of results regardless of
 match quality which typically happens in a large full text index is
 yet another question!).  The app is visible here:
 http://sbdsproto.nla.gov.au/

 On a server with 2xquad core AMD 2382 processors and 64GB memory, java
 1.6.0_13-b03, 64 bit run with -Xmx15192M -Xms6000M -verbose:gc, with
 the index on Intel X25M SSD, on start-up the elapsed time to create
 the 8 facets is 306 seconds (best time).  Following an index reopen,
 the time to recreate them in 318 seconds (best time).

 [We have made an independent experimental change to create the facets
 with 3 async threads, that is, in parallel, and also to decouple them
 from the underlying index, so our facets lag the index changes by the
 time to recreate the facets.  With our setup, the 3 threads reduced
 facet creation elapsed time from about 450 secs to around 320 secs,
 but this will depend a lot on IO capabilities of the device containing
 the index, amount of file system caching, load, etc]

 Anyway, we noticed that huge amounts of garbage were being collected
 during facet generation of the creator and subject fields, and tracked
 it down to this decision in UnInvertedField univert():

      if (termNum = maxTermCounts.length) {
        // resize, but conserve memory by not doubling
        // resize at end??? we waste a maximum of 16K (average of 8K)
        int[] newMaxTermCounts = new int[maxTermCounts.length+4096];
        System.arraycopy(maxTermCounts, 0, newMaxTermCounts, 0, termNum);
        maxTermCounts = newMaxTermCounts;
      }

 So, we tried the obvious thing:

 - allocate 10K terms initially, rather than 1K
 - extend by doubling the current size, rather than adding a fixed 4K
 - free unused space at the end (but only if unused space is
 significant) by reallocating the array to the exact required size

 And also:

 - created a static HashMap lookup keyed on field name which remembers
 the previous allocated size for maxTermCounts for that field, and
 initially allocates that size + 1000 entries

 The second change is a minor optimisation, but the first change, by
 eliminating thousands of array reallocations and copies, greatly
 improved load times, down from 306 to 124 seconds on the initial load
 and from 318 to 134 seconds on reloads after index updates.  About
 60-70 secs is still spend in GC, but it is a significant improvement.

 Unless you have very large numbers of facet values, this change won't
 have any positive benefit.

 Regards,

 Kent Fitch



Re: UnInvertedField performance on faceted fields containing many unique terms

2009-06-15 Thread Kent Fitch
Hi Yonik,

On Tue, Jun 16, 2009 at 10:52 AM, Yonik
Seeleyyo...@lucidimagination.com wrote:

 All the constants you see in UnInvertedField were a best guess - I
 wasn't working with any real data.  It's surprising that a big array
 allocation every 4096 terms is so significant - I had figured that the
 work involved in processing that many terms would far outweigh
 realloc+GC.

Well, they were pretty good guesses!  The code is extremely fast for
reasonable sized term lists.
I think with our 18M terms, the increasingly long array of ints was
being reallocated, copied and garbage collected 18M/4K = 4,500 times,
creating 4500x(18Mx4bytes)/2 = 162GB of garbage to collect.

 Could you open a JIRA issue with your recommended changes?  It's
 simple enough we should have no problem getting it in for Solr 1.4.

Thanks - just added SOLR-1220.  I havent mentioned the change to the
initial allocation on 10K (rather than 1024) because I dont think it
is significant.  I also havent mentioned the remembering of sizes to
initially allocate, because the improvement is marginal compared to
this big change, and for all I know, a static hashmap with fieldnames
could cause unwanted side effects from field name clashes if running
SOLR with multiple indices.

 Also, are you using a recent Solr build (within the last month)?
 LUCENE-1596 should improve uninvert time for non-optimized indexes.

We're not - but we'll upgrade to the latest version of 1.4 very soon.

 And don't forget to update http://wiki.apache.org/solr/PublicServers
 when you go live!

We will - thanks for your great work in improving SOLR performance
with 1.4 which makes such outrageous uses of facets even thinkable.

Regards,

Kent Fitch