Re: Lucene and SIPs

2006-06-22 Thread Bob Carpenter
Time to pull out the chalkboard. :-) SIPs, at least in the Amazon sense, are usually found by means of statistical independence testing. You can find more info in Chris Manning's and Hinrich Schuetze's statistical NLP book (heads-up: they're now working on an IR book with more of a focus on sear

RE: Lucene and SIPs

2006-06-22 Thread Larry Ogrodnek
I didn't make too much progress, and kind of ended up dropping it. One thing that I played with was creating multiple phrase indexes, one each for 2, 3, 4, and 5 words. I wrote a tokenizer that would batch up the words, so, for the input string: The quick brown fox jumps over the slow lazy