I would, add, though, that the semantics of keep in this are a bit
confusing. The current functionality does not "keep" the original
term, it "keeps" whatever the value for the keep map is regardless of
case. The test case spells this out with the test:
keep set = and
Test token: AnD
Assertion is that the result is And (forceFirstLetter = true)
This is not consistent with my notion of keep. I would think keep
should preserve the original token exactly as it came in and the keep
list should be case sensitive. Consider the word PhD. This is a
prime token, IMO, for a word that belongs in the keep list, yet, the
current functionality would return Phd.
Should I open a bug for this, such that it can be tracked separately?
-Grant
On Jan 31, 2008, at 12:42 PM, Grant Ingersoll wrote:
I have started on SOLR-330 and the first one to tackle is the
CapitilizationFilterFactory (just starting at the top of the
analysis package).
At any rate, there are some optimizations to be made here, but one
thing in the file that is not explicitly stated is that the "keep"
word list is case-insensitive. This is the current, undocumented,
behavior. I am fine with documenting and making it so going
forward. However, if, instead, we make it case-sensitive, we can
then use a CharArraySet (from Lucene) to do quick look ups of the
term buffer char array. The reason this comes up is that
Token.termText() is now deprecated and I am switching off to use the
Token.termBuffer() char array. This filter can then just operate
directly on the char array, which should be a lot faster.
Any opinion on this?
-Grant