Beware... I just looked at CharArraySet at -rHEAD and it *modifies the input
token* if ignoreCase is set:
/** Add this char[] directly to the set.
* If ignoreCase is true for this Set, the text array will be directly
modified.
* The user should never modify this text array after calling this method.
*/
public boolean add(char[] text) {
if (ignoreCase)
for(int i=0;i<text.length;i++)
text[i] = Character.toLowerCase(text[i]);
int slot = getSlot(text, 0, text.length);
I'm not sure whether that affects your use for SOLR-468.
I wonder whether this design tradeoff was worth it; getHashCode(...) already
can nondestructively lowercase while computing the hashcode, so if the line in
both equals(...) methods:
if (Character.toLowerCase(text1[off+i]) != text2[i])
were modified to lowercase text2 then the destructive one-time lowercasing
could be avoided. Sadly there's no Character.equalsIgnoreCase to avoid the
second method call.
- J.J.
At 12:45 PM -0500 1/31/08, Grant Ingersoll wrote:
>Scratch that. CharArraySet has an ignoreCase option that I missed.
>
>-Grant
>
>On Jan 31, 2008, at 12:42 PM, Grant Ingersoll wrote:
>
>>I have started on SOLR-330 and the first one to tackle is the
>>CapitilizationFilterFactory (just starting at the top of the analysis
>>package).
>>
>>At any rate, there are some optimizations to be made here, but one thing in
>>the file that is not explicitly stated is that the "keep" word list is
>>case-insensitive. This is the current, undocumented, behavior. I am fine
>>with documenting and making it so going forward. However, if, instead, we
>>make it case-sensitive, we can then use a CharArraySet (from Lucene) to do
>>quick look ups of the term buffer char array. The reason this comes up is
>>that Token.termText() is now deprecated and I am switching off to use the
>>Token.termBuffer() char array. This filter can then just operate directly on
>>the char array, which should be a lot faster.
>>
>>Any opinion on this?
>>
>>-Grant