On Dec 21, 2004, at 10:41 AM, Ravi wrote:
I want to be able to use stopwords in exact phrase searches. I have
looked at Nutch and used the same approach (replace common words with
n-grams. Look at net.nutch.analysis.CommonGrams).
So if to,be,or and not are stop words, for the string to be
or not to be, the analyzer produces the following tokens
[to-be, to-be-or, to-be-or-not, to-be-or-not-to, to-be-or-not-to-be,
be-or, be-or-not, be-or-not-to, be-or-not-to-be, or-not, or-not-to,
or-not-to-be, not-to, not-to-be, to-be]
You've gone a bit beyond what Nutch is using. It creates bigrams,
where you've expanded it to many more than that.
Are you also using the position increment of 0 for the gram tokens
like Nutch does?
But I'm having a problem with the search.
when I do a search on not to be the analyzer is converting my search
into
content:not-to not-to-be to-be because the analyzer produces the
tokens not-to,not-to-be,to-be
I'm getting 0 results on this as there is no token not-to not-to-be
to-be in the index.
I want just not-to-be from the analyzer during the search so when I
search on not to be I will get the document which has not-to-be as
a
token.
How can I use the same analyzer to get different results in indexing
and searching?
Nutch does some different stuff between indexing and parsing queries...
[java] 1: [the:WORD] [the-quick:gram]
[java] 2: [quick:WORD]
[java] 3: [brown:WORD]
[java] 4: [fox:WORD]
[java] query = (+url:the quick brown^4.0) (+anchor:the quick
brown^2.0) (+content:the-quick quick brown)
The first four lines show the analysis of the quick brown fox. The
last line is the resultant Lucene query for the quick brown. Notice
that only the content field gets analyzed specially, and also that
only gram tokens are considered in that field, not the WORD tokens
if there is also a gram.
Does this help with your situation?
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]