Time to pull out the chalkboard. :-)
SIPs, at least in the Amazon sense, are usually found
by means of statistical independence testing. You
can find more info in Chris Manning's and Hinrich
Schuetze's statistical NLP book (heads-up: they're
now working on an IR book with more of a focus on
sear
I didn't make too much progress, and kind of ended up dropping it.
One thing that I played with was creating multiple phrase indexes, one
each for 2, 3, 4, and 5 words. I wrote a tokenizer that would batch up
the words, so, for the input string:
The quick brown fox jumps over the slow lazy