Mike Klaas suggested last month that I might be able to improve phrase search performance by indexing word bigrams, aka bigram shingles. I've been playing with this, and the initial results are very promising. (I may post some performance data later.) I wanted to describe my technique, which I'm not sure is what Mike had in mind, and see if anyone has any feedback on it. Let me know if it would be better to address this to the Lucene list.
[Note: These experiments are completely separate from the index corruption case I described very recently.] Here is an excerpt from my schema.xml: <fieldType name="shingleText" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.ShingleFilterFactory" outputUnigrams="true" /> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.ShingleFilterFactory" outputUnigrams="false" outputUnigramIfNoNgram="true" /> </analyzer> </fieldType> For indexing, I've used the stock ShingleFilterFactory with the outputUnigrams option, which tokenizes as follows: [Exhibit A] "please divide this sentence into shingles" -> "please", "please divide" "divide", "divide this" "this", "this sentence" "sentence", "sentence into" "into", "into shingles" "shingles" (Tokens on the same line have no position offset between them.) Now for querying: I first tried using the exact same Exhibit A analyzer for queries, but this definitely did not help phrase search performance. (It makes sense why if you delve into the Lucene source, though I don't know how to give a super-brief explanation.) So then I tried outputUnigrams=false with the stock ShingleFilterFactory, thereby tokenizing my queries as follows: [Exhibit B] "please divide this sentence into shingles" -> "please divide" "divide this" "this sentence" "sentence into" "into shingles" And when I did this, things got really zippy. The only problem was that it broke queries that were *not* phrase searches. That's because in this setup a single-word query (e.g. "please") will get tokenized into zero tokens, since a single word isn't long enough to be a bigram. So finally I modified the Lucene ShingleFilter class to add an "outputUnigramIfNoNgram option". Basically, if you set that option, and also set outputUnigrams=false, then the filter will tokenize just as in Exhibit B, except that if the query is only one word long, it will return a corresponding single token, rather than zero tokens. In other words, [Exhibit C] "please" -> "please" Things were still zippy. And, so far, I think I have seriously improved my phrase search performance without ruining anything. Are there any obvious drawbacks to this approach? I admit I haven't thought through exactly how this would affect relevancy scoring. I'm also not sure if the new Lucene ShingleMatrixFilter can be made to do this more trivially than the standard ShingleFilter. (I don't really understand the former yet.) Cheers, Chris