Hi Yonik, On Tue, Jun 16, 2009 at 10:52 AM, Yonik Seeley<yo...@lucidimagination.com> wrote:
> All the constants you see in UnInvertedField were a best guess - I > wasn't working with any real data. It's surprising that a big array > allocation every 4096 terms is so significant - I had figured that the > work involved in processing that many terms would far outweigh > realloc+GC. Well, they were pretty good guesses! The code is extremely fast for "reasonable" sized term lists. I think with our 18M terms, the increasingly long array of ints was being reallocated, copied and garbage collected 18M/4K = 4,500 times, creating 4500x(18Mx4bytes)/2 = 162GB of garbage to collect. > Could you open a JIRA issue with your recommended changes? It's > simple enough we should have no problem getting it in for Solr 1.4. Thanks - just added SOLR-1220. I havent mentioned the change to the initial allocation on 10K (rather than 1024) because I dont think it is significant. I also havent mentioned the remembering of sizes to initially allocate, because the improvement is marginal compared to this big change, and for all I know, a static hashmap with fieldnames could cause unwanted side effects from field name clashes if running SOLR with multiple indices. > Also, are you using a recent Solr build (within the last month)? > LUCENE-1596 should improve uninvert time for non-optimized indexes. We're not - but we'll upgrade to the latest version of 1.4 very soon. > And don't forget to update http://wiki.apache.org/solr/PublicServers > when you go live! We will - thanks for your great work in improving SOLR performance with 1.4 which makes such outrageous uses of facets even thinkable. Regards, Kent Fitch