On 11/7/06 5:44 PM, "Otis Gospodnetic" <[EMAIL PROTECTED]> wrote:
> Grab the code from Lucene in Action, it's got something to get you going, see: > > http://www.lucenebook.com/search?query=metaphone Thanks. I thought about looking that up (I have the book), but the code is really trivial inside Solr. The per-field analyzer takes care of most of the fuss. The meat is a single line of code in the token filter using the DoubleMetaphone class from commons codec. return new Token(dm.encode(token.termText(), token.startOffset(), token.endOffset()); Everything else is just initialization and declaration. A naming convention question: should the class names end in Filter or TokenFilter (and FilterFactory or TokenFilterFactory)? I see both in org.apache.solr.analysis. I'm a bit disappointed in the performance, though. It is half the speed when adding two phonetic fields to search. Dropped from 300 qps to 130. On the other hand, I never thought I'd be complaining about an engine delivering over 100 qps! Could that be from searching extra fields? Indexing is the same speed, so it shouldn't be the DoubleMetaphone class. I'm still trying to get a feel for Lucene performance after years with the Ultraseek engine. Also, the phonetic matches are ranked a bit high, so I'm trying a sub-1.0 boost. I was expecting the lower idf to fix that automatically. The metaphone will almost always have a lower idf because multiple words are mapped to one metaphone, so the encoded term occurs in more documents than the surface terms. One neat trick -- if regular terms are lowercased, they will never collide with the metaphones, which are all upper case. wunder -- Walter Underwood Search Guru, Netflix