On 7/12/08 7:00 PM, "Chris Harris" <[EMAIL PROTECTED]> wrote:
>
> Mike, your idea of indexing bigrams is also interesting. Do you know
> if any text search platforms do this behind the scenes as their
> default way of handling phrase queries?

Infoseek indexed biwords with their Ultra engine, which lives
on as the Ultraseek enterprise engine. This makes phrase
search very fast, but it isn't exact.

A few years ago, Doug Cutting told me they were doing something a
bit more clever with Nutch, using a biword search to make a candidate
set of phrase matches, then using positional (exact) phrase search
against just that set to get the true matches. That is a really nice
trick that combines the speed of biword matches with the exactness
of positional search.

Biwords are also useful for an approximate phrase IDF. Use the
least common biword and say that is the IDF for the phrase.

I believe Infoseek got the idea through William Chang, who had
been doing genomic search before that. William is at Baidu now.

Infoseek Ultra launched summer of 1996.

wunder



Reply via email to