subject:"Stopwords in phrases"

Stopwords in phrases

2004-12-21 Thread Ravi

 I want to be able to use stopwords in exact phrase searches. I have
looked at Nutch and used the same approach (replace common words with
n-grams. Look at net.nutch.analysis.CommonGrams). 
  So if to,be,or and not are stop words, for the string to be
or not to be, the analyzer produces the following tokens

[to-be, to-be-or, to-be-or-not, to-be-or-not-to, to-be-or-not-to-be,
be-or, be-or-not, be-or-not-to, be-or-not-to-be, or-not, or-not-to,
or-not-to-be, not-to, not-to-be, to-be]

  This is exactly what I wanted from the analyzer during indexing.
  But I'm having a problem with the search. 
 when I do a search on not to be the analyzer is converting my search
into 
  content:not-to not-to-be to-be because the analyzer produces the
tokens not-to,not-to-be,to-be

  I'm getting 0 results on this as there is no token not-to not-to-be
to-be in the index. 

  I want just not-to-be from the analyzer during the search so when I
search on not to be I will get the document which has not-to-be as a
token. 

   How can I use the same analyzer to get different results in indexing
and searching? 

Thanks in advance,
Ravi. 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Stopwords in phrases

2004-12-21 Thread Erik Hatcher

On Dec 21, 2004, at 10:41 AM, Ravi wrote:
 I want to be able to use stopwords in exact phrase searches. I have
looked at Nutch and used the same approach (replace common words with
n-grams. Look at net.nutch.analysis.CommonGrams).
  So if to,be,or and not are stop words, for the string to be
or not to be, the analyzer produces the following tokens
[to-be, to-be-or, to-be-or-not, to-be-or-not-to, to-be-or-not-to-be,
be-or, be-or-not, be-or-not-to, be-or-not-to-be, or-not, or-not-to,
or-not-to-be, not-to, not-to-be, to-be]
You've gone a bit beyond what Nutch is using.  It creates bigrams, 
where you've expanded it to many more than that.

Are you also using the position increment of 0 for the gram tokens 
like Nutch does?

  But I'm having a problem with the search.
 when I do a search on not to be the analyzer is converting my search
into
  content:not-to not-to-be to-be because the analyzer produces the
tokens not-to,not-to-be,to-be
  I'm getting 0 results on this as there is no token not-to not-to-be
to-be in the index.
  I want just not-to-be from the analyzer during the search so when I
search on not to be I will get the document which has not-to-be as 
a
token.

   How can I use the same analyzer to get different results in indexing
and searching?
Nutch does some different stuff between indexing and parsing queries...
 [java] 1: [the:WORD] [the-quick:gram]
 [java] 2: [quick:WORD]
 [java] 3: [brown:WORD]
 [java] 4: [fox:WORD]
 [java] query = (+url:the quick brown^4.0) (+anchor:the quick 
brown^2.0) (+content:the-quick quick brown)

The first four lines show the analysis of the quick brown fox.  The 
last line is the resultant Lucene query for the quick brown.  Notice 
that only the content field gets analyzed specially, and also that 
only gram tokens are considered in that field, not the WORD tokens 
if there is also a gram.

Does this help with your situation?
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Stopwords in phrases

2004-12-21 Thread Ravi


Are you also using the position increment of 0 for the gram tokens
like Nutch does?
Yes. 

I don't think considering only gram tokens will work for me because
Nutch uses only bi-grams. It can only have one gram per token. In my
case I have more than one and even if I get only the grams, I still will
have the same problem. 

Ravi.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Stopwords in phrases

Re: Stopwords in phrases

RE: Stopwords in phrases

3 matches

Site Navigation

Mail list logo

Footer information