Great writeup Ken, All the constants you see in UnInvertedField were a best guess - I wasn't working with any real data. It's surprising that a big array allocation every 4096 terms is so significant - I had figured that the work involved in processing that many terms would far outweigh realloc+GC.
Could you open a JIRA issue with your recommended changes? It's simple enough we should have no problem getting it in for Solr 1.4. Also, are you using a recent Solr build (within the last month)? LUCENE-1596 should improve uninvert time for non-optimized indexes. And don't forget to update http://wiki.apache.org/solr/PublicServers when you go live! -Yonik http://www.lucidimagination.com On Mon, Jun 15, 2009 at 7:43 PM, Kent Fitch<kent.fi...@gmail.com> wrote: > Hi, > > This may be of interest to other users of SOLR's UnInvertedField who > have a very large number of unique terms in faceted fields. > > Our setup is : > > - about 34M lucene documents of bibliographic and full text content > - index currently 115GB, will at least double over next 6 months > - moving to support real-time-ish updates (maybe 5 min delay) > > We facet on 8 fields, 6 of which are "normal" with small numbers of > distinct values. But 2 faceted fields, creator and subject, are huge, > with 18M and 9M terms respectively. (Whether we should be faceting on > such a huge number of values, and at the same time attempting to > provide real time-ish updates is another question! Whether facets > derived from all of the hundreds of thousands of results regardless of > match quality which typically happens in a large full text index is > yet another question!). The app is visible here: > http://sbdsproto.nla.gov.au/ > > On a server with 2xquad core AMD 2382 processors and 64GB memory, java > 1.6.0_13-b03, 64 bit run with "-Xmx15192M -Xms6000M -verbose:gc", with > the index on Intel X25M SSD, on start-up the elapsed time to create > the 8 facets is 306 seconds (best time). Following an index reopen, > the time to recreate them in 318 seconds (best time). > > [We have made an independent experimental change to create the facets > with 3 async threads, that is, in parallel, and also to decouple them > from the underlying index, so our facets lag the index changes by the > time to recreate the facets. With our setup, the 3 threads reduced > facet creation elapsed time from about 450 secs to around 320 secs, > but this will depend a lot on IO capabilities of the device containing > the index, amount of file system caching, load, etc] > > Anyway, we noticed that huge amounts of garbage were being collected > during facet generation of the creator and subject fields, and tracked > it down to this decision in UnInvertedField univert(): > > if (termNum >= maxTermCounts.length) { > // resize, but conserve memory by not doubling > // resize at end??? we waste a maximum of 16K (average of 8K) > int[] newMaxTermCounts = new int[maxTermCounts.length+4096]; > System.arraycopy(maxTermCounts, 0, newMaxTermCounts, 0, termNum); > maxTermCounts = newMaxTermCounts; > } > > So, we tried the obvious thing: > > - allocate 10K terms initially, rather than 1K > - extend by doubling the current size, rather than adding a fixed 4K > - free unused space at the end (but only if unused space is > "significant") by reallocating the array to the exact required size > > And also: > > - created a static HashMap lookup keyed on field name which remembers > the previous allocated size for maxTermCounts for that field, and > initially allocates that size + 1000 entries > > The second change is a minor optimisation, but the first change, by > eliminating thousands of array reallocations and copies, greatly > improved load times, down from 306 to 124 seconds on the initial load > and from 318 to 134 seconds on reloads after index updates. About > 60-70 secs is still spend in GC, but it is a significant improvement. > > Unless you have very large numbers of facet values, this change won't > have any positive benefit. > > Regards, > > Kent Fitch >