Re: Luke and Indexes

2005-12-08 Thread Andrzej Bialecki

Bryan Woliner wrote:


I have a couple very basic questions about Luke and indexes in
general. Answers to any of these questions are much appreciated:

1. In the Luke overview tab, what does Index version refer to?
 



It's the time (as in System.currentTimeMillis()) when the index was last 
modified.



2. Also in the overview tab, if Has Deletions? is equal to yes,
where are the possible sources of deletions? Dedup? Manual deletions
through luke?

 



Either. Both.


3. Is there any way (w/ Luke or otherwise) to get a file listing all
of the docs in an index. Basically is there an index equivalent of
this command (which outputs all the URLs in a segment):

bin/nutch org.apache.nutch.pagedb.FetchListEntry -dumpurls segmentsDir
 



You can browse through documents on the Document tab. But there is no 
option to dump all documents to a file. Besides, some fields which are 
not stored are no longer accessible, so you cannot retrieve them from 
the index (you may be able to reconstruct them, but it's a lossy operation).



4. Finally, my last question is the one I'm most perplexed by:

I called bin/nutch segread -list -dir for a particular segments
directory and found out that one directory had 93 entries. BUT, when I
opened up the index of that segment in Luke, there were only 23
documents (and 3 deletions)! Where did the rest of the URLs go??
 



Do a segread -dump and check what is the protocol status and parse 
status for the pages that didn't make it to the index. Most likely you 
encountered either protocol errors or parsing errors, so there was 
nothing to index from these entries.


In addition, if you ran the deduplication, some of the entries in your 
index may have been deleted because they were considered duplicates.


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: Luke and Indexes

2005-12-08 Thread Bryan Woliner
Thank you very much for the helpful answers. Most of the pages that
didn't make it into the index were indeed due to protocol errors
(mostly exceeding http.max.delay).

One quick side note. When I was looking at the Nutch wiki page for
bin/nutch segread, I noticed an error on the page and wasn't sure how
to go about fixing it, or alerting someone who can. The page currently
reads:

...

-nocontent

  ignore content data

-noparsedata

  ignore parse_data data

-nocontent

  ignore parse_text data

...

The 2nd -nocontent should probably be -noparsetext, right?

Thanks again for the help,
Bryan

On 12/8/05, Andrzej Bialecki [EMAIL PROTECTED] wrote:
 Bryan Woliner wrote:

 I have a couple very basic questions about Luke and indexes in
 general. Answers to any of these questions are much appreciated:
 
 1. In the Luke overview tab, what does Index version refer to?
 
 

 It's the time (as in System.currentTimeMillis()) when the index was last
 modified.

 2. Also in the overview tab, if Has Deletions? is equal to yes,
 where are the possible sources of deletions? Dedup? Manual deletions
 through luke?
 
 
 

 Either. Both.

 3. Is there any way (w/ Luke or otherwise) to get a file listing all
 of the docs in an index. Basically is there an index equivalent of
 this command (which outputs all the URLs in a segment):
 
 bin/nutch org.apache.nutch.pagedb.FetchListEntry -dumpurls segmentsDir
 
 

 You can browse through documents on the Document tab. But there is no
 option to dump all documents to a file. Besides, some fields which are
 not stored are no longer accessible, so you cannot retrieve them from
 the index (you may be able to reconstruct them, but it's a lossy operation).

 4. Finally, my last question is the one I'm most perplexed by:
 
 I called bin/nutch segread -list -dir for a particular segments
 directory and found out that one directory had 93 entries. BUT, when I
 opened up the index of that segment in Luke, there were only 23
 documents (and 3 deletions)! Where did the rest of the URLs go??
 
 

 Do a segread -dump and check what is the protocol status and parse
 status for the pages that didn't make it to the index. Most likely you
 encountered either protocol errors or parsing errors, so there was
 nothing to index from these entries.

 In addition, if you ran the deduplication, some of the entries in your
 index may have been deleted because they were considered duplicates.

 --
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com





Luke and Indexes

2005-12-07 Thread Bryan Woliner
I have a couple very basic questions about Luke and indexes in
general. Answers to any of these questions are much appreciated:

1. In the Luke overview tab, what does Index version refer to?

2. Also in the overview tab, if Has Deletions? is equal to yes,
where are the possible sources of deletions? Dedup? Manual deletions
through luke?

3. Is there any way (w/ Luke or otherwise) to get a file listing all
of the docs in an index. Basically is there an index equivalent of
this command (which outputs all the URLs in a segment):

bin/nutch org.apache.nutch.pagedb.FetchListEntry -dumpurls segmentsDir

4. Finally, my last question is the one I'm most perplexed by:

I called bin/nutch segread -list -dir for a particular segments
directory and found out that one directory had 93 entries. BUT, when I
opened up the index of that segment in Luke, there were only 23
documents (and 3 deletions)! Where did the rest of the URLs go??

Thanks ahead of time for any helpful suggestions,
Bryan