With a 4.1 snapshot from a couple of weeks ago, I saw about a 5% drop in index size compared to 3.5.0 when using the same schema. When I updated my 4.1 schema to ICUTokenizer so I could use CJKBigramFilter, my index dropped further -- about 10% less than 3.5, still using the same 4.1 snapshot.

Yesterday I checked out the newest 4.1 snapshot and built the index again. Comparing a recently optimized 3.5.0 index with the same recently optimized index under the new 4.1, I am seeing more than a 30 percent drop in size -- 15.49GB instead of 22.7 GB. As noted above, some of that drop can be explained by the change in schema, but not THAT much. I am very impressed.

Looking at the index directories from yesterday compared to what I remember about the directories a couple of weeks ago, it appears that some of the files that had Lucene40 in the filename now have Lucene41 in the filename.

Is there any chance that this is an indication of a problem, or is the expected index reduction really that good?

Thanks,
Shawn

Reply via email to