With a 4.1 snapshot from a couple of weeks ago, I saw about a 5% drop in
index size compared to 3.5.0 when using the same schema. When I updated
my 4.1 schema to ICUTokenizer so I could use CJKBigramFilter, my index
dropped further -- about 10% less than 3.5, still using the same 4.1
snapshot.
Yesterday I checked out the newest 4.1 snapshot and built the index
again. Comparing a recently optimized 3.5.0 index with the same
recently optimized index under the new 4.1, I am seeing more than a 30
percent drop in size -- 15.49GB instead of 22.7 GB. As noted above,
some of that drop can be explained by the change in schema, but not THAT
much. I am very impressed.
Looking at the index directories from yesterday compared to what I
remember about the directories a couple of weeks ago, it appears that
some of the files that had Lucene40 in the filename now have Lucene41 in
the filename.
Is there any chance that this is an indication of a problem, or is the
expected index reduction really that good?
Thanks,
Shawn
- Extreme index size reduction on 4.1-SNAPSHOT? Shawn Heisey
-