Re: CapitilizationFilterFactory

J.J. Larrea Thu, 31 Jan 2008 11:04:19 -0800

Beware... I just looked at CharArraySet at -rHEAD and it *modifies the input 
token* if ignoreCase is set:


  /** Add this char[] directly to the set.
   * If ignoreCase is true for this Set, the text array will be directly 
modified.
   * The user should never modify this text array after calling this method.
   */
  public boolean add(char[] text) {
    if (ignoreCase)
      for(int i=0;i<text.length;i++)
        text[i] = Character.toLowerCase(text[i]);
    int slot = getSlot(text, 0, text.length);

I'm not sure whether that affects your use for SOLR-468.
 
I wonder whether this design tradeoff was worth it; getHashCode(...) already 
can nondestructively lowercase while computing the hashcode, so if the line in  
both equals(...) methods:
        if (Character.toLowerCase(text1[off+i]) != text2[i])
were modified to lowercase text2 then the destructive one-time lowercasing 
could be avoided. Sadly there's no Character.equalsIgnoreCase to avoid the 
second method call.

- J.J.

At 12:45 PM -0500 1/31/08, Grant Ingersoll wrote:
>Scratch that.  CharArraySet has an ignoreCase option that I missed.
>
>-Grant
>
>On Jan 31, 2008, at 12:42 PM, Grant Ingersoll wrote:
>
>>I have started on SOLR-330 and the first one to tackle is the 
>>CapitilizationFilterFactory (just starting at the top of the analysis 
>>package).
>>
>>At any rate, there are some optimizations to be made here, but one thing in 
>>the file that is not explicitly stated is that the "keep" word list is 
>>case-insensitive.  This is the current, undocumented, behavior.  I am fine 
>>with documenting and making it so going forward.  However, if, instead, we 
>>make it case-sensitive, we can then use a CharArraySet (from Lucene) to do 
>>quick look ups of the term buffer char array.  The reason this comes up is 
>>that Token.termText() is now deprecated and I am switching off to use the 
>>Token.termBuffer() char array.  This filter can then just operate directly on 
>>the char array, which should be a lot faster.
>>
>>Any opinion on this?
>>
>>-Grant

Re: CapitilizationFilterFactory

Reply via email to