CapitilizationFilterFactory

Grant Ingersoll Thu, 31 Jan 2008 09:42:43 -0800

I have started on SOLR-330 and the first one to tackle is theCapitilizationFilterFactory (just starting at the top of the analysispackage).

At any rate, there are some optimizations to be made here, but onething in the file that is not explicitly stated is that the "keep"word list is case-insensitive. This is the current, undocumented,behavior. I am fine with documenting and making it so going forward.However, if, instead, we make it case-sensitive, we can then use aCharArraySet (from Lucene) to do quick look ups of the term bufferchar array. The reason this comes up is that Token.termText() is nowdeprecated and I am switching off to use the Token.termBuffer() chararray. This filter can then just operate directly on the char array,which should be a lot faster.


Any opinion on this?

-Grant

CapitilizationFilterFactory

Reply via email to