Re: Weird behaviour
You write that you index the string under the url field. Do you also index it under title? If not, that can explain why title:Rahul Dravid does not work for you. Also, did you try to look at the index w/ Luke? It will show you what are the terms in the index. Another thing which is always good to debug such things is to create a StandardAnalyzer, then request a tokenStream() from it, passing a StringReader w/ the text you want to parse. Then just print the tokens returned. I've done that, using the version from trunk, w/ Version.2_4, and the tokens that are extracted are: (http,0,4,type=ALPHANUM) (en.wikipedia.org,7,23,type=HOST) (wiki,24,28,type=ALPHANUM) (rahul,29,34,type=ALPHANUM) (dravid,35,41,type=ALPHANUM) So: 1) You don't get results for title:Rahul Dravid since you index it under url and not title. 2) url:wiki/Rahul_Dravid works, since it looks for a phrase that exists in the index (look at the last 3 tokens produced by the Analyzer, in the output above). 3) ur:entire string also works, since you index all of it under the url field. Does this explain the behavior you see? Shai On Sun, Aug 2, 2009 at 1:27 PM, prashant ullegaddi prashullega...@gmail.com wrote: Hi, I've indexed some 50million documents. I've indexed the target URL of each document as url field by using StandardAnalyzer with index.ANALYZED. Suppose, there is a wikipedia page with title:Rahul Dravid and url: http://en.wikipedia.org/wiki/Rahul_Dravid. But when I search for +title:Rahul Dravid +url:Wikipedia, I'm getting no results. I get the document(s) when I search for url:http://en.wikipedia.org/wiki/Rahul_Dravid or url: en.wikipedia.org/wiki/Rahul_Dravid. I get results even when I search for url:wiki/Rahul_Dravid. It'd be helpful if somebody can throw some light on this. -- Prashant.
Re: Weird behaviour
Firstly, I'm indexing the string in url field only. I've never used Luke, I don't know how to use. What I'm trying to do is search for those documents which are from some particular site, and have a given title. On Sun, Aug 2, 2009 at 4:07 PM, Shai Erera ser...@gmail.com wrote: You write that you index the string under the url field. Do you also index it under title? If not, that can explain why title:Rahul Dravid does not work for you. Also, did you try to look at the index w/ Luke? It will show you what are the terms in the index. Another thing which is always good to debug such things is to create a StandardAnalyzer, then request a tokenStream() from it, passing a StringReader w/ the text you want to parse. Then just print the tokens returned. I've done that, using the version from trunk, w/ Version.2_4, and the tokens that are extracted are: (http,0,4,type=ALPHANUM) (en.wikipedia.org,7,23,type=HOST) (wiki,24,28,type=ALPHANUM) (rahul,29,34,type=ALPHANUM) (dravid,35,41,type=ALPHANUM) So: 1) You don't get results for title:Rahul Dravid since you index it under url and not title. 2) url:wiki/Rahul_Dravid works, since it looks for a phrase that exists in the index (look at the last 3 tokens produced by the Analyzer, in the output above). 3) ur:entire string also works, since you index all of it under the url field. Does this explain the behavior you see? Shai On Sun, Aug 2, 2009 at 1:27 PM, prashant ullegaddi prashullega...@gmail.com wrote: Hi, I've indexed some 50million documents. I've indexed the target URL of each document as url field by using StandardAnalyzer with index.ANALYZED. Suppose, there is a wikipedia page with title:Rahul Dravid and url: http://en.wikipedia.org/wiki/Rahul_Dravid. But when I search for +title:Rahul Dravid +url:Wikipedia, I'm getting no results. I get the document(s) when I search for url:http://en.wikipedia.org/wiki/Rahul_Dravid or url: en.wikipedia.org/wiki/Rahul_Dravid. I get results even when I search for url:wiki/Rahul_Dravid. It'd be helpful if somebody can throw some light on this. -- Prashant.
Re: Weird behaviour
How do you parse/convert the page to a Document object? Are you sure the title Rahul Dravid is extracted properly and put in the title field? You can read about Luke here: http://www.getopt.org/luke/. Can you do System.out.println(document.toString()) before you add it to the index, and paste the output here? Shai On Sun, Aug 2, 2009 at 4:47 PM, prashant ullegaddi prashullega...@gmail.com wrote: Firstly, I'm indexing the string in url field only. I've never used Luke, I don't know how to use. What I'm trying to do is search for those documents which are from some particular site, and have a given title. On Sun, Aug 2, 2009 at 4:07 PM, Shai Erera ser...@gmail.com wrote: You write that you index the string under the url field. Do you also index it under title? If not, that can explain why title:Rahul Dravid does not work for you. Also, did you try to look at the index w/ Luke? It will show you what are the terms in the index. Another thing which is always good to debug such things is to create a StandardAnalyzer, then request a tokenStream() from it, passing a StringReader w/ the text you want to parse. Then just print the tokens returned. I've done that, using the version from trunk, w/ Version.2_4, and the tokens that are extracted are: (http,0,4,type=ALPHANUM) (en.wikipedia.org,7,23,type=HOST) (wiki,24,28,type=ALPHANUM) (rahul,29,34,type=ALPHANUM) (dravid,35,41,type=ALPHANUM) So: 1) You don't get results for title:Rahul Dravid since you index it under url and not title. 2) url:wiki/Rahul_Dravid works, since it looks for a phrase that exists in the index (look at the last 3 tokens produced by the Analyzer, in the output above). 3) ur:entire string also works, since you index all of it under the url field. Does this explain the behavior you see? Shai On Sun, Aug 2, 2009 at 1:27 PM, prashant ullegaddi prashullega...@gmail.com wrote: Hi, I've indexed some 50million documents. I've indexed the target URL of each document as url field by using StandardAnalyzer with index.ANALYZED. Suppose, there is a wikipedia page with title:Rahul Dravid and url: http://en.wikipedia.org/wiki/Rahul_Dravid. But when I search for +title:Rahul Dravid +url:Wikipedia, I'm getting no results. I get the document(s) when I search for url:http://en.wikipedia.org/wiki/Rahul_Dravid or url: en.wikipedia.org/wiki/Rahul_Dravid. I get results even when I search for url:wiki/Rahul_Dravid. It'd be helpful if somebody can throw some light on this. -- Prashant.
Re: Weird behaviour
Yes, I'm sure that title:Rahul Dravid is extracted properly, and there is a document relevant to this query as well. The following query and its results proves it: Enter query: Searching for: +title:rahul dravid +url:wiki 4 total matching documents trec-id: clueweb09-enwp02-13-14368, URL: http://en.wikipedia.org/wiki/Rahul_Dravid trec-id: clueweb09-enwp01-83-11378, URL: http://en.wikipedia.org/wiki/Rahul_S_Dravid trec-id: clueweb09-en0011-08-22737, URL: http://www.reference.com/browse/wiki/Rahul_Dravid trec-id: clueweb09-enwp01-69-13556, URL: http://en.wikipedia.org/wiki/Rahul_Sharad_Dravid Press (q)uit or enter number to jump to a page. But see following query: Enter query: +title:rahul dravid +url:wikipedia Searching for: +title:rahul dravid +url:wikipedia 0 total matching documents Press (q)uit or enter number to jump to a page. Isn't it weird? -- Prashant. On Sun, Aug 2, 2009 at 9:13 PM, Shai Erera ser...@gmail.com wrote: How do you parse/convert the page to a Document object? Are you sure the title Rahul Dravid is extracted properly and put in the title field? You can read about Luke here: http://www.getopt.org/luke/. Can you do System.out.println(document.toString()) before you add it to the index, and paste the output here? Shai On Sun, Aug 2, 2009 at 4:47 PM, prashant ullegaddi prashullega...@gmail.com wrote: Firstly, I'm indexing the string in url field only. I've never used Luke, I don't know how to use. What I'm trying to do is search for those documents which are from some particular site, and have a given title. On Sun, Aug 2, 2009 at 4:07 PM, Shai Erera ser...@gmail.com wrote: You write that you index the string under the url field. Do you also index it under title? If not, that can explain why title:Rahul Dravid does not work for you. Also, did you try to look at the index w/ Luke? It will show you what are the terms in the index. Another thing which is always good to debug such things is to create a StandardAnalyzer, then request a tokenStream() from it, passing a StringReader w/ the text you want to parse. Then just print the tokens returned. I've done that, using the version from trunk, w/ Version.2_4, and the tokens that are extracted are: (http,0,4,type=ALPHANUM) (en.wikipedia.org,7,23,type=HOST) (wiki,24,28,type=ALPHANUM) (rahul,29,34,type=ALPHANUM) (dravid,35,41,type=ALPHANUM) So: 1) You don't get results for title:Rahul Dravid since you index it under url and not title. 2) url:wiki/Rahul_Dravid works, since it looks for a phrase that exists in the index (look at the last 3 tokens produced by the Analyzer, in the output above). 3) ur:entire string also works, since you index all of it under the url field. Does this explain the behavior you see? Shai On Sun, Aug 2, 2009 at 1:27 PM, prashant ullegaddi prashullega...@gmail.com wrote: Hi, I've indexed some 50million documents. I've indexed the target URL of each document as url field by using StandardAnalyzer with index.ANALYZED. Suppose, there is a wikipedia page with title:Rahul Dravid and url: http://en.wikipedia.org/wiki/Rahul_Dravid. But when I search for +title:Rahul Dravid +url:Wikipedia, I'm getting no results. I get the document(s) when I search for url:http://en.wikipedia.org/wiki/Rahul_Dravid or url: en.wikipedia.org/wiki/Rahul_Dravid. I get results even when I search for url:wiki/Rahul_Dravid. It'd be helpful if somebody can throw some light on this. -- Prashant.
Re: Weird behaviour
Hi Prashant, I agree with Shai, that using Luke and printing out what the Document looks like before it goes into the index, are going to be your best bet for debugging this problem. The problem you're having is that StandardAnalyzer does not break-up the hostname into separate terms, as it has a special case for hostnames and acronyms. This should work... +title:rahul dravid +url:en.wikipedia.org Thanks, Phil On Sun, Aug 2, 2009 at 10:14 AM, prashant ullegaddiprashullega...@gmail.com wrote: Yes, I'm sure that title:Rahul Dravid is extracted properly, and there is a document relevant to this query as well. The following query and its results proves it: Enter query: Searching for: +title:rahul dravid +url:wiki 4 total matching documents trec-id: clueweb09-enwp02-13-14368, URL: http://en.wikipedia.org/wiki/Rahul_Dravid trec-id: clueweb09-enwp01-83-11378, URL: http://en.wikipedia.org/wiki/Rahul_S_Dravid trec-id: clueweb09-en0011-08-22737, URL: http://www.reference.com/browse/wiki/Rahul_Dravid trec-id: clueweb09-enwp01-69-13556, URL: http://en.wikipedia.org/wiki/Rahul_Sharad_Dravid Press (q)uit or enter number to jump to a page. But see following query: Enter query: +title:rahul dravid +url:wikipedia Searching for: +title:rahul dravid +url:wikipedia 0 total matching documents Press (q)uit or enter number to jump to a page. Isn't it weird? -- Prashant. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Weird behaviour
Hi Phil, The query you gave did work. Well, that proves StandardAnalyzer has a different way of tokenizing URLs. Thanks, Prashant. On Sun, Aug 2, 2009 at 11:22 PM, Phil Whelan phil...@gmail.com wrote: Hi Prashant, I agree with Shai, that using Luke and printing out what the Document looks like before it goes into the index, are going to be your best bet for debugging this problem. The problem you're having is that StandardAnalyzer does not break-up the hostname into separate terms, as it has a special case for hostnames and acronyms. This should work... +title:rahul dravid +url:en.wikipedia.org Thanks, Phil On Sun, Aug 2, 2009 at 10:14 AM, prashant ullegaddiprashullega...@gmail.com wrote: Yes, I'm sure that title:Rahul Dravid is extracted properly, and there is a document relevant to this query as well. The following query and its results proves it: Enter query: Searching for: +title:rahul dravid +url:wiki 4 total matching documents trec-id: clueweb09-enwp02-13-14368, URL: http://en.wikipedia.org/wiki/Rahul_Dravid trec-id: clueweb09-enwp01-83-11378, URL: http://en.wikipedia.org/wiki/Rahul_S_Dravid trec-id: clueweb09-en0011-08-22737, URL: http://www.reference.com/browse/wiki/Rahul_Dravid trec-id: clueweb09-enwp01-69-13556, URL: http://en.wikipedia.org/wiki/Rahul_Sharad_Dravid Press (q)uit or enter number to jump to a page. But see following query: Enter query: +title:rahul dravid +url:wikipedia Searching for: +title:rahul dravid +url:wikipedia 0 total matching documents Press (q)uit or enter number to jump to a page. Isn't it weird? -- Prashant. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Weird behaviour
You can always create your own Analyzer which creates a TokenStream just like StandardAnalyzer, but instead of using StandardFilter, write another TokenFilter which receives the HOST token type, and breaks it further to its components (e.g., extract en, wikipedia and org). You can also return the original HOST token and its components. I hope this helps. Shai On Sun, Aug 2, 2009 at 8:58 PM, prashant ullegaddi prashullega...@gmail.com wrote: Hi Phil, The query you gave did work. Well, that proves StandardAnalyzer has a different way of tokenizing URLs. Thanks, Prashant. On Sun, Aug 2, 2009 at 11:22 PM, Phil Whelan phil...@gmail.com wrote: Hi Prashant, I agree with Shai, that using Luke and printing out what the Document looks like before it goes into the index, are going to be your best bet for debugging this problem. The problem you're having is that StandardAnalyzer does not break-up the hostname into separate terms, as it has a special case for hostnames and acronyms. This should work... +title:rahul dravid +url:en.wikipedia.org Thanks, Phil On Sun, Aug 2, 2009 at 10:14 AM, prashant ullegaddiprashullega...@gmail.com wrote: Yes, I'm sure that title:Rahul Dravid is extracted properly, and there is a document relevant to this query as well. The following query and its results proves it: Enter query: Searching for: +title:rahul dravid +url:wiki 4 total matching documents trec-id: clueweb09-enwp02-13-14368, URL: http://en.wikipedia.org/wiki/Rahul_Dravid trec-id: clueweb09-enwp01-83-11378, URL: http://en.wikipedia.org/wiki/Rahul_S_Dravid trec-id: clueweb09-en0011-08-22737, URL: http://www.reference.com/browse/wiki/Rahul_Dravid trec-id: clueweb09-enwp01-69-13556, URL: http://en.wikipedia.org/wiki/Rahul_Sharad_Dravid Press (q)uit or enter number to jump to a page. But see following query: Enter query: +title:rahul dravid +url:wikipedia Searching for: +title:rahul dravid +url:wikipedia 0 total matching documents Press (q)uit or enter number to jump to a page. Isn't it weird? -- Prashant. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Weird behaviour
Thank you Phil and Shai. I will write a different Analyzer. On Sun, Aug 2, 2009 at 11:50 PM, Shai Erera ser...@gmail.com wrote: You can always create your own Analyzer which creates a TokenStream just like StandardAnalyzer, but instead of using StandardFilter, write another TokenFilter which receives the HOST token type, and breaks it further to its components (e.g., extract en, wikipedia and org). You can also return the original HOST token and its components. I hope this helps. Shai On Sun, Aug 2, 2009 at 8:58 PM, prashant ullegaddi prashullega...@gmail.com wrote: Hi Phil, The query you gave did work. Well, that proves StandardAnalyzer has a different way of tokenizing URLs. Thanks, Prashant. On Sun, Aug 2, 2009 at 11:22 PM, Phil Whelan phil...@gmail.com wrote: Hi Prashant, I agree with Shai, that using Luke and printing out what the Document looks like before it goes into the index, are going to be your best bet for debugging this problem. The problem you're having is that StandardAnalyzer does not break-up the hostname into separate terms, as it has a special case for hostnames and acronyms. This should work... +title:rahul dravid +url:en.wikipedia.org Thanks, Phil On Sun, Aug 2, 2009 at 10:14 AM, prashant ullegaddiprashullega...@gmail.com wrote: Yes, I'm sure that title:Rahul Dravid is extracted properly, and there is a document relevant to this query as well. The following query and its results proves it: Enter query: Searching for: +title:rahul dravid +url:wiki 4 total matching documents trec-id: clueweb09-enwp02-13-14368, URL: http://en.wikipedia.org/wiki/Rahul_Dravid trec-id: clueweb09-enwp01-83-11378, URL: http://en.wikipedia.org/wiki/Rahul_S_Dravid trec-id: clueweb09-en0011-08-22737, URL: http://www.reference.com/browse/wiki/Rahul_Dravid trec-id: clueweb09-enwp01-69-13556, URL: http://en.wikipedia.org/wiki/Rahul_Sharad_Dravid Press (q)uit or enter number to jump to a page. But see following query: Enter query: +title:rahul dravid +url:wikipedia Searching for: +title:rahul dravid +url:wikipedia 0 total matching documents Press (q)uit or enter number to jump to a page. Isn't it weird? -- Prashant. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org