On 7/12/08 7:00 PM, "Chris Harris" <[EMAIL PROTECTED]> wrote: > > Mike, your idea of indexing bigrams is also interesting. Do you know > if any text search platforms do this behind the scenes as their > default way of handling phrase queries?
Infoseek indexed biwords with their Ultra engine, which lives on as the Ultraseek enterprise engine. This makes phrase search very fast, but it isn't exact. A few years ago, Doug Cutting told me they were doing something a bit more clever with Nutch, using a biword search to make a candidate set of phrase matches, then using positional (exact) phrase search against just that set to get the true matches. That is a really nice trick that combines the speed of biword matches with the exactness of positional search. Biwords are also useful for an approximate phrase IDF. Use the least common biword and say that is the IDF for the phrase. I believe Infoseek got the idea through William Chang, who had been doing genomic search before that. William is at Baidu now. Infoseek Ultra launched summer of 1996. wunder