On 11/7/06 5:44 PM, "Otis Gospodnetic" <[EMAIL PROTECTED]> wrote:

> Grab the code from Lucene in Action, it's got something to get you going, see:
> 
>   http://www.lucenebook.com/search?query=metaphone

Thanks. I thought about looking that up (I have the book), but the
code is really trivial inside Solr. The per-field analyzer takes
care of most of the fuss. The meat is a single line of code in the
token filter using the DoubleMetaphone class from commons codec.

  return new Token(dm.encode(token.termText(),
                             token.startOffset(),
                             token.endOffset());

Everything else is just initialization and declaration.

A naming convention question: should the class names end in
Filter or TokenFilter (and FilterFactory or TokenFilterFactory)?
I see both in org.apache.solr.analysis.

I'm a bit disappointed in the performance, though. It is half the
speed when adding two phonetic fields to search. Dropped from 300
qps to 130. On the other hand, I never thought I'd be complaining
about an engine delivering over 100 qps!

Could that be from searching extra fields? Indexing is the same
speed, so it shouldn't be the DoubleMetaphone class. I'm still
trying to get a feel for Lucene performance after years with the
Ultraseek engine.

Also, the phonetic matches are ranked a bit high, so I'm trying a
sub-1.0 boost. I was expecting the lower idf to fix that automatically.
The metaphone will almost always have a lower idf because multiple
words are mapped to one metaphone, so the encoded term occurs in more
documents than the surface terms.

One neat trick -- if regular terms are lowercased, they will never
collide with the metaphones, which are all upper case.

wunder
-- 
Walter Underwood
Search Guru, Netflix



Reply via email to