Hi,
I've indexed some 50million documents. I've indexed the target URL of each
document as url field by using
StandardAnalyzer with index.ANALYZED. Suppose, there is a wikipedia page
with title:Rahul Dravid and
url: http://en.wikipedia.org/wiki/Rahul_Dravid.
But when I search for +title:Rahul
You write that you index the string under the url field. Do you also index
it under title? If not, that can explain why title:Rahul Dravid does not
work for you.
Also, did you try to look at the index w/ Luke? It will show you what are
the terms in the index.
Another thing which is always good
Firstly, I'm indexing the string in url field only.
I've never used Luke, I don't know how to use.
What I'm trying to do is search for those documents which are from
some particular site, and have a given title.
On Sun, Aug 2, 2009 at 4:07 PM, Shai Erera ser...@gmail.com wrote:
You write
How do you parse/convert the page to a Document object? Are you sure the
title Rahul Dravid is extracted properly and put in the title field?
You can read about Luke here: http://www.getopt.org/luke/.
Can you do System.out.println(document.toString()) before you add it to the
index, and paste
Yes, I'm sure that title:Rahul Dravid is extracted properly, and there is
a document relevant to this query as well.
The following query and its results proves it:
Enter query:
Searching for: +title:rahul dravid +url:wiki
4 total matching documents
trec-id: clueweb09-enwp02-13-14368, URL:
Hi Prashant,
I agree with Shai, that using Luke and printing out what the Document
looks like before it goes into the index, are going to be your best
bet for debugging this problem.
The problem you're having is that StandardAnalyzer does not break-up
the hostname into separate terms, as it has
Hi Phil,
The query you gave did work. Well, that proves StandardAnalyzer has a
different way
of tokenizing URLs.
Thanks,
Prashant.
On Sun, Aug 2, 2009 at 11:22 PM, Phil Whelan phil...@gmail.com wrote:
Hi Prashant,
I agree with Shai, that using Luke and printing out what the Document
looks
You can always create your own Analyzer which creates a TokenStream just
like StandardAnalyzer, but instead of using StandardFilter, write another
TokenFilter which receives the HOST token type, and breaks it further to its
components (e.g., extract en, wikipedia and org). You can also return
the
Thank you Phil and Shai.
I will write a different Analyzer.
On Sun, Aug 2, 2009 at 11:50 PM, Shai Erera ser...@gmail.com wrote:
You can always create your own Analyzer which creates a TokenStream just
like StandardAnalyzer, but instead of using StandardFilter, write another
TokenFilter which