Re: Weird behaviour

2009-08-02 Thread Shai Erera
You write that you index the string under the url field. Do you also index
it under title? If not, that can explain why title:Rahul Dravid does not
work for you.

Also, did you try to look at the index w/ Luke? It will show you what are
the terms in the index.

Another thing which is always good to debug such things is to create a
StandardAnalyzer, then request a tokenStream() from it, passing a
StringReader w/ the text you want to parse. Then just print the tokens
returned.

I've done that, using the version from trunk, w/ Version.2_4, and the tokens
that are extracted are:
(http,0,4,type=ALPHANUM)
(en.wikipedia.org,7,23,type=HOST)
(wiki,24,28,type=ALPHANUM)
(rahul,29,34,type=ALPHANUM)
(dravid,35,41,type=ALPHANUM)

So:
1) You don't get results for title:Rahul Dravid since you index it under
url and not title.
2) url:wiki/Rahul_Dravid works, since it looks for a phrase that exists in
the index (look at the last 3 tokens produced by the Analyzer, in the output
above).
3) ur:entire string also works, since you index all of it under the url
field.

Does this explain the behavior you see?

Shai

On Sun, Aug 2, 2009 at 1:27 PM, prashant ullegaddi prashullega...@gmail.com
 wrote:

 Hi,

 I've indexed some 50million documents. I've indexed the target URL of each
 document as url field by using
 StandardAnalyzer with index.ANALYZED. Suppose, there is a wikipedia page
 with title:Rahul Dravid and
 url: http://en.wikipedia.org/wiki/Rahul_Dravid.

 But when I search for +title:Rahul Dravid +url:Wikipedia, I'm getting
 no
 results. I get the document(s) when
 I search for url:http://en.wikipedia.org/wiki/Rahul_Dravid or url:
 en.wikipedia.org/wiki/Rahul_Dravid. I get
 results even when I search for url:wiki/Rahul_Dravid.

 It'd be helpful if somebody can throw some light on this.

 -- Prashant.



Re: Weird behaviour

2009-08-02 Thread prashant ullegaddi
Firstly, I'm indexing the string in url field only.

I've never used Luke, I don't know how to use.

What I'm trying to do is search for those documents which are from
some particular site, and have a given title.


On Sun, Aug 2, 2009 at 4:07 PM, Shai Erera ser...@gmail.com wrote:

 You write that you index the string under the url field. Do you also
 index
 it under title? If not, that can explain why title:Rahul Dravid does
 not
 work for you.

 Also, did you try to look at the index w/ Luke? It will show you what are
 the terms in the index.

 Another thing which is always good to debug such things is to create a
 StandardAnalyzer, then request a tokenStream() from it, passing a
 StringReader w/ the text you want to parse. Then just print the tokens
 returned.

 I've done that, using the version from trunk, w/ Version.2_4, and the
 tokens
 that are extracted are:
 (http,0,4,type=ALPHANUM)
 (en.wikipedia.org,7,23,type=HOST)
 (wiki,24,28,type=ALPHANUM)
 (rahul,29,34,type=ALPHANUM)
 (dravid,35,41,type=ALPHANUM)

 So:
 1) You don't get results for title:Rahul Dravid since you index it under
 url and not title.
 2) url:wiki/Rahul_Dravid works, since it looks for a phrase that exists
 in
 the index (look at the last 3 tokens produced by the Analyzer, in the
 output
 above).
 3) ur:entire string also works, since you index all of it under the
 url
 field.

 Does this explain the behavior you see?

 Shai

 On Sun, Aug 2, 2009 at 1:27 PM, prashant ullegaddi 
 prashullega...@gmail.com
  wrote:

  Hi,
 
  I've indexed some 50million documents. I've indexed the target URL of
 each
  document as url field by using
  StandardAnalyzer with index.ANALYZED. Suppose, there is a wikipedia page
  with title:Rahul Dravid and
  url: http://en.wikipedia.org/wiki/Rahul_Dravid.
 
  But when I search for +title:Rahul Dravid +url:Wikipedia, I'm getting
  no
  results. I get the document(s) when
  I search for url:http://en.wikipedia.org/wiki/Rahul_Dravid or url:
  en.wikipedia.org/wiki/Rahul_Dravid. I get
  results even when I search for url:wiki/Rahul_Dravid.
 
  It'd be helpful if somebody can throw some light on this.
 
  -- Prashant.
 



Re: Weird behaviour

2009-08-02 Thread Shai Erera
How do you parse/convert the page to a Document object? Are you sure the
title Rahul Dravid is extracted properly and put in the title field?

You can read about Luke here: http://www.getopt.org/luke/.

Can you do System.out.println(document.toString()) before you add it to the
index, and paste the output here?

Shai

On Sun, Aug 2, 2009 at 4:47 PM, prashant ullegaddi prashullega...@gmail.com
 wrote:

 Firstly, I'm indexing the string in url field only.

 I've never used Luke, I don't know how to use.

 What I'm trying to do is search for those documents which are from
 some particular site, and have a given title.


 On Sun, Aug 2, 2009 at 4:07 PM, Shai Erera ser...@gmail.com wrote:

  You write that you index the string under the url field. Do you also
  index
  it under title? If not, that can explain why title:Rahul Dravid does
  not
  work for you.
 
  Also, did you try to look at the index w/ Luke? It will show you what are
  the terms in the index.
 
  Another thing which is always good to debug such things is to create a
  StandardAnalyzer, then request a tokenStream() from it, passing a
  StringReader w/ the text you want to parse. Then just print the tokens
  returned.
 
  I've done that, using the version from trunk, w/ Version.2_4, and the
  tokens
  that are extracted are:
  (http,0,4,type=ALPHANUM)
  (en.wikipedia.org,7,23,type=HOST)
  (wiki,24,28,type=ALPHANUM)
  (rahul,29,34,type=ALPHANUM)
  (dravid,35,41,type=ALPHANUM)
 
  So:
  1) You don't get results for title:Rahul Dravid since you index it
 under
  url and not title.
  2) url:wiki/Rahul_Dravid works, since it looks for a phrase that exists
  in
  the index (look at the last 3 tokens produced by the Analyzer, in the
  output
  above).
  3) ur:entire string also works, since you index all of it under the
  url
  field.
 
  Does this explain the behavior you see?
 
  Shai
 
  On Sun, Aug 2, 2009 at 1:27 PM, prashant ullegaddi 
  prashullega...@gmail.com
   wrote:
 
   Hi,
  
   I've indexed some 50million documents. I've indexed the target URL of
  each
   document as url field by using
   StandardAnalyzer with index.ANALYZED. Suppose, there is a wikipedia
 page
   with title:Rahul Dravid and
   url: http://en.wikipedia.org/wiki/Rahul_Dravid.
  
   But when I search for +title:Rahul Dravid +url:Wikipedia, I'm
 getting
   no
   results. I get the document(s) when
   I search for url:http://en.wikipedia.org/wiki/Rahul_Dravid or url:
   en.wikipedia.org/wiki/Rahul_Dravid. I get
   results even when I search for url:wiki/Rahul_Dravid.
  
   It'd be helpful if somebody can throw some light on this.
  
   -- Prashant.
  
 



Re: Weird behaviour

2009-08-02 Thread prashant ullegaddi
Yes, I'm sure that title:Rahul Dravid is extracted properly, and there is
a document relevant to this query as well.
The following query and its results proves it:

Enter query:
Searching for: +title:rahul dravid +url:wiki
4 total matching documents
   trec-id: clueweb09-enwp02-13-14368, URL:
http://en.wikipedia.org/wiki/Rahul_Dravid
   trec-id: clueweb09-enwp01-83-11378, URL:
http://en.wikipedia.org/wiki/Rahul_S_Dravid
   trec-id: clueweb09-en0011-08-22737, URL:
http://www.reference.com/browse/wiki/Rahul_Dravid
   trec-id: clueweb09-enwp01-69-13556, URL:
http://en.wikipedia.org/wiki/Rahul_Sharad_Dravid
Press (q)uit or enter number to jump to a page.

But see following query:

Enter query:
+title:rahul dravid +url:wikipedia
Searching for: +title:rahul dravid +url:wikipedia
0 total matching documents
Press (q)uit or enter number to jump to a page.

Isn't it weird?

-- Prashant.

On Sun, Aug 2, 2009 at 9:13 PM, Shai Erera ser...@gmail.com wrote:

 How do you parse/convert the page to a Document object? Are you sure the
 title Rahul Dravid is extracted properly and put in the title field?

 You can read about Luke here: http://www.getopt.org/luke/.

 Can you do System.out.println(document.toString()) before you add it to the
 index, and paste the output here?

 Shai

 On Sun, Aug 2, 2009 at 4:47 PM, prashant ullegaddi 
 prashullega...@gmail.com
  wrote:

  Firstly, I'm indexing the string in url field only.
 
  I've never used Luke, I don't know how to use.
 
  What I'm trying to do is search for those documents which are from
  some particular site, and have a given title.
 
 
  On Sun, Aug 2, 2009 at 4:07 PM, Shai Erera ser...@gmail.com wrote:
 
   You write that you index the string under the url field. Do you also
   index
   it under title? If not, that can explain why title:Rahul Dravid
 does
   not
   work for you.
  
   Also, did you try to look at the index w/ Luke? It will show you what
 are
   the terms in the index.
  
   Another thing which is always good to debug such things is to create a
   StandardAnalyzer, then request a tokenStream() from it, passing a
   StringReader w/ the text you want to parse. Then just print the tokens
   returned.
  
   I've done that, using the version from trunk, w/ Version.2_4, and the
   tokens
   that are extracted are:
   (http,0,4,type=ALPHANUM)
   (en.wikipedia.org,7,23,type=HOST)
   (wiki,24,28,type=ALPHANUM)
   (rahul,29,34,type=ALPHANUM)
   (dravid,35,41,type=ALPHANUM)
  
   So:
   1) You don't get results for title:Rahul Dravid since you index it
  under
   url and not title.
   2) url:wiki/Rahul_Dravid works, since it looks for a phrase that
 exists
   in
   the index (look at the last 3 tokens produced by the Analyzer, in the
   output
   above).
   3) ur:entire string also works, since you index all of it under the
   url
   field.
  
   Does this explain the behavior you see?
  
   Shai
  
   On Sun, Aug 2, 2009 at 1:27 PM, prashant ullegaddi 
   prashullega...@gmail.com
wrote:
  
Hi,
   
I've indexed some 50million documents. I've indexed the target URL of
   each
document as url field by using
StandardAnalyzer with index.ANALYZED. Suppose, there is a wikipedia
  page
with title:Rahul Dravid and
url: http://en.wikipedia.org/wiki/Rahul_Dravid.
   
But when I search for +title:Rahul Dravid +url:Wikipedia, I'm
  getting
no
results. I get the document(s) when
I search for url:http://en.wikipedia.org/wiki/Rahul_Dravid or url:
en.wikipedia.org/wiki/Rahul_Dravid. I get
results even when I search for url:wiki/Rahul_Dravid.
   
It'd be helpful if somebody can throw some light on this.
   
-- Prashant.
   
  
 



Re: Weird behaviour

2009-08-02 Thread Phil Whelan
Hi Prashant,

I agree with Shai, that using Luke and printing out what the Document
looks like before it goes into the index, are going to be your best
bet for debugging this problem.

The problem you're having is that StandardAnalyzer does not break-up
the hostname into separate terms, as it has a special case for
hostnames and acronyms.

This should work...
+title:rahul dravid +url:en.wikipedia.org

Thanks,
Phil

On Sun, Aug 2, 2009 at 10:14 AM, prashant
ullegaddiprashullega...@gmail.com wrote:
 Yes, I'm sure that title:Rahul Dravid is extracted properly, and there is
 a document relevant to this query as well.
 The following query and its results proves it:

 Enter query:
 Searching for: +title:rahul dravid +url:wiki
 4 total matching documents
   trec-id: clueweb09-enwp02-13-14368, URL:
 http://en.wikipedia.org/wiki/Rahul_Dravid
   trec-id: clueweb09-enwp01-83-11378, URL:
 http://en.wikipedia.org/wiki/Rahul_S_Dravid
   trec-id: clueweb09-en0011-08-22737, URL:
 http://www.reference.com/browse/wiki/Rahul_Dravid
   trec-id: clueweb09-enwp01-69-13556, URL:
 http://en.wikipedia.org/wiki/Rahul_Sharad_Dravid
 Press (q)uit or enter number to jump to a page.

 But see following query:

 Enter query:
 +title:rahul dravid +url:wikipedia
 Searching for: +title:rahul dravid +url:wikipedia
 0 total matching documents
 Press (q)uit or enter number to jump to a page.

 Isn't it weird?

 -- Prashant.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Weird behaviour

2009-08-02 Thread prashant ullegaddi
Hi Phil,

The query you gave did work. Well, that proves StandardAnalyzer has a
different way
of tokenizing URLs.

Thanks,
Prashant.

On Sun, Aug 2, 2009 at 11:22 PM, Phil Whelan phil...@gmail.com wrote:

 Hi Prashant,

 I agree with Shai, that using Luke and printing out what the Document
 looks like before it goes into the index, are going to be your best
 bet for debugging this problem.

 The problem you're having is that StandardAnalyzer does not break-up
 the hostname into separate terms, as it has a special case for
 hostnames and acronyms.

 This should work...
 +title:rahul dravid +url:en.wikipedia.org

 Thanks,
 Phil

 On Sun, Aug 2, 2009 at 10:14 AM, prashant
 ullegaddiprashullega...@gmail.com wrote:
  Yes, I'm sure that title:Rahul Dravid is extracted properly, and there
 is
  a document relevant to this query as well.
  The following query and its results proves it:
 
  Enter query:
  Searching for: +title:rahul dravid +url:wiki
  4 total matching documents
trec-id: clueweb09-enwp02-13-14368, URL:
  http://en.wikipedia.org/wiki/Rahul_Dravid
trec-id: clueweb09-enwp01-83-11378, URL:
  http://en.wikipedia.org/wiki/Rahul_S_Dravid
trec-id: clueweb09-en0011-08-22737, URL:
  http://www.reference.com/browse/wiki/Rahul_Dravid
trec-id: clueweb09-enwp01-69-13556, URL:
  http://en.wikipedia.org/wiki/Rahul_Sharad_Dravid
  Press (q)uit or enter number to jump to a page.
 
  But see following query:
 
  Enter query:
  +title:rahul dravid +url:wikipedia
  Searching for: +title:rahul dravid +url:wikipedia
  0 total matching documents
  Press (q)uit or enter number to jump to a page.
 
  Isn't it weird?
 
  -- Prashant.

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: Weird behaviour

2009-08-02 Thread Shai Erera
You can always create your own Analyzer which creates a TokenStream just
like StandardAnalyzer, but instead of using StandardFilter, write another
TokenFilter which receives the HOST token type, and breaks it further to its
components (e.g., extract en, wikipedia and org). You can also return
the original HOST token and its components.

I hope this helps.

Shai

On Sun, Aug 2, 2009 at 8:58 PM, prashant ullegaddi prashullega...@gmail.com
 wrote:

 Hi Phil,

 The query you gave did work. Well, that proves StandardAnalyzer has a
 different way
 of tokenizing URLs.

 Thanks,
 Prashant.

 On Sun, Aug 2, 2009 at 11:22 PM, Phil Whelan phil...@gmail.com wrote:

  Hi Prashant,
 
  I agree with Shai, that using Luke and printing out what the Document
  looks like before it goes into the index, are going to be your best
  bet for debugging this problem.
 
  The problem you're having is that StandardAnalyzer does not break-up
  the hostname into separate terms, as it has a special case for
  hostnames and acronyms.
 
  This should work...
  +title:rahul dravid +url:en.wikipedia.org
 
  Thanks,
  Phil
 
  On Sun, Aug 2, 2009 at 10:14 AM, prashant
  ullegaddiprashullega...@gmail.com wrote:
   Yes, I'm sure that title:Rahul Dravid is extracted properly, and
 there
  is
   a document relevant to this query as well.
   The following query and its results proves it:
  
   Enter query:
   Searching for: +title:rahul dravid +url:wiki
   4 total matching documents
 trec-id: clueweb09-enwp02-13-14368, URL:
   http://en.wikipedia.org/wiki/Rahul_Dravid
 trec-id: clueweb09-enwp01-83-11378, URL:
   http://en.wikipedia.org/wiki/Rahul_S_Dravid
 trec-id: clueweb09-en0011-08-22737, URL:
   http://www.reference.com/browse/wiki/Rahul_Dravid
 trec-id: clueweb09-enwp01-69-13556, URL:
   http://en.wikipedia.org/wiki/Rahul_Sharad_Dravid
   Press (q)uit or enter number to jump to a page.
  
   But see following query:
  
   Enter query:
   +title:rahul dravid +url:wikipedia
   Searching for: +title:rahul dravid +url:wikipedia
   0 total matching documents
   Press (q)uit or enter number to jump to a page.
  
   Isn't it weird?
  
   -- Prashant.
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 



Re: Weird behaviour

2009-08-02 Thread prashant ullegaddi
Thank you Phil and Shai.

I will write a different Analyzer.

On Sun, Aug 2, 2009 at 11:50 PM, Shai Erera ser...@gmail.com wrote:

 You can always create your own Analyzer which creates a TokenStream just
 like StandardAnalyzer, but instead of using StandardFilter, write another
 TokenFilter which receives the HOST token type, and breaks it further to
 its
 components (e.g., extract en, wikipedia and org). You can also return
 the original HOST token and its components.

 I hope this helps.

 Shai

 On Sun, Aug 2, 2009 at 8:58 PM, prashant ullegaddi 
 prashullega...@gmail.com
  wrote:

  Hi Phil,
 
  The query you gave did work. Well, that proves StandardAnalyzer has a
  different way
  of tokenizing URLs.
 
  Thanks,
  Prashant.
 
  On Sun, Aug 2, 2009 at 11:22 PM, Phil Whelan phil...@gmail.com wrote:
 
   Hi Prashant,
  
   I agree with Shai, that using Luke and printing out what the Document
   looks like before it goes into the index, are going to be your best
   bet for debugging this problem.
  
   The problem you're having is that StandardAnalyzer does not break-up
   the hostname into separate terms, as it has a special case for
   hostnames and acronyms.
  
   This should work...
   +title:rahul dravid +url:en.wikipedia.org
  
   Thanks,
   Phil
  
   On Sun, Aug 2, 2009 at 10:14 AM, prashant
   ullegaddiprashullega...@gmail.com wrote:
Yes, I'm sure that title:Rahul Dravid is extracted properly, and
  there
   is
a document relevant to this query as well.
The following query and its results proves it:
   
Enter query:
Searching for: +title:rahul dravid +url:wiki
4 total matching documents
  trec-id: clueweb09-enwp02-13-14368, URL:
http://en.wikipedia.org/wiki/Rahul_Dravid
  trec-id: clueweb09-enwp01-83-11378, URL:
http://en.wikipedia.org/wiki/Rahul_S_Dravid
  trec-id: clueweb09-en0011-08-22737, URL:
http://www.reference.com/browse/wiki/Rahul_Dravid
  trec-id: clueweb09-enwp01-69-13556, URL:
http://en.wikipedia.org/wiki/Rahul_Sharad_Dravid
Press (q)uit or enter number to jump to a page.
   
But see following query:
   
Enter query:
+title:rahul dravid +url:wikipedia
Searching for: +title:rahul dravid +url:wikipedia
0 total matching documents
Press (q)uit or enter number to jump to a page.
   
Isn't it weird?
   
-- Prashant.
  
   -
   To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
   For additional commands, e-mail: java-user-h...@lucene.apache.org