Mike Klaas suggested last month that I might be able to improve phrase
search performance by indexing word bigrams, aka bigram shingles. I've
been playing with this, and the initial results are very promising. (I
may post some performance data later.) I wanted to describe my
technique, which I'm not sure is what Mike had in mind, and see if
anyone has any feedback on it. Let me know if it would be better to
address this to the Lucene list.

[Note: These experiments are completely separate from the index
corruption case I described very recently.]

Here is an excerpt from my schema.xml:

    <fieldType name="shingleText" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.ShingleFilterFactory" outputUnigrams="true" />
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.ShingleFilterFactory"
outputUnigrams="false" outputUnigramIfNoNgram="true" />
      </analyzer>
    </fieldType>

For indexing, I've used the stock ShingleFilterFactory with the
outputUnigrams option, which tokenizes as follows:

[Exhibit A]
"please divide this sentence into shingles" ->
  "please", "please divide"
  "divide", "divide this"
  "this", "this sentence"
  "sentence", "sentence into"
  "into", "into shingles"
  "shingles"

(Tokens on the same line have no position offset between them.)

Now for querying:

I first tried using the exact same Exhibit A analyzer for queries, but
this definitely did not help phrase search performance. (It makes
sense why if you delve into the Lucene source, though I don't know how
to give a super-brief explanation.) So then I tried
outputUnigrams=false with the stock ShingleFilterFactory, thereby
tokenizing my queries as follows:

[Exhibit B]
"please divide this sentence into shingles" ->
  "please divide"
  "divide this"
  "this sentence"
  "sentence into"
  "into shingles"

And when I did this, things got really zippy. The only problem was
that it broke queries that were *not* phrase searches. That's because
in this setup a single-word query (e.g. "please") will get tokenized
into zero tokens, since a single word isn't long enough to be a
bigram.

So finally I modified the Lucene ShingleFilter class to add an
"outputUnigramIfNoNgram option". Basically, if you set that option,
and also set outputUnigrams=false, then the filter will tokenize just
as in Exhibit B, except that if the query is only one word long, it
will return a corresponding single token, rather than zero tokens. In
other words,

[Exhibit C]
"please" ->
  "please"

Things were still zippy. And, so far, I think I have seriously
improved my phrase search performance without ruining anything.

Are there any obvious drawbacks to this approach? I admit I haven't
thought through exactly how this would affect relevancy scoring. I'm
also not sure if the new Lucene ShingleMatrixFilter can be made to do
this more trivially than the standard ShingleFilter. (I don't really
understand the former yet.)

Cheers,
Chris

Reply via email to