Re: CapitilizationFilterFactory

Grant Ingersoll Thu, 31 Jan 2008 10:08:10 -0800

I would, add, though, that the semantics of keep in this are a bitconfusing. The current functionality does not "keep" the originalterm, it "keeps" whatever the value for the keep map is regardless ofcase. The test case spells this out with the test:


keep set = and


Test token: AnD

Assertion is that the result is And (forceFirstLetter = true)

This is not consistent with my notion of keep. I would think keepshould preserve the original token exactly as it came in and the keeplist should be case sensitive. Consider the word PhD. This is aprime token, IMO, for a word that belongs in the keep list, yet, thecurrent functionality would return Phd.


Should I open a bug for this, such that it can be tracked separately?

-Grant

On Jan 31, 2008, at 12:42 PM, Grant Ingersoll wrote:

I have started on SOLR-330 and the first one to tackle is theCapitilizationFilterFactory (just starting at the top of theanalysis package).
At any rate, there are some optimizations to be made here, but onething in the file that is not explicitly stated is that the "keep"word list is case-insensitive. This is the current, undocumented,behavior. I am fine with documenting and making it so goingforward. However, if, instead, we make it case-sensitive, we canthen use a CharArraySet (from Lucene) to do quick look ups of theterm buffer char array. The reason this comes up is thatToken.termText() is now deprecated and I am switching off to use theToken.termBuffer() char array. This filter can then just operatedirectly on the char array, which should be a lot faster.
Any opinion on this?

-Grant

Re: CapitilizationFilterFactory

Reply via email to