Re: crawldb contents

Sebastian Nagel Mon, 08 Jul 2013 13:13:43 -0700

Hi,

the folder "624730206" points indeed to a failed (or canceled?) updatedb job.
If the job is successful the intermediate output path (a random number)
is installed (moved) to "current".


You should have a look at the logs around 2013-07-05 23:55.

Assumed crawling is continued the missed data will be crawled again
(or already is because it happened 3 days ago).

Sebastian

On 07/08/2013 09:24 PM, eakarsu wrote:
> 
> I have question on the contents of crawldb folder with Nutch 1.6
> 
> After I do updatedb step, crawldb folder includes the following. Is this
> correct result I should get?
> If not, how I can fix it?
> 
> If I execute "generate" on this crawldb below, will it generate full url
> lists? My concern is that updatedb process is not completed fully because we
> "624730206" and "current" folder at the same time.
> Does Nutch take care of this?
> 
> I appreciate your help
> 
> 
> hduser@hadoopdev1:~$ hadoop dfs -ls 160milyonurls/crawldb
> Warning: $HADOOP_HOME is deprecated.
> 
> Found 3 items
> drwxr-xr-x   - hduser supergroup          0 2013-07-05 23:55
> /user/hduser/160milyonurls/crawldb/624730206
> drwxr-xr-x   - hduser supergroup          0 2013-07-08 18:59
> /user/hduser/160milyonurls/crawldb/current
> drwxr-xr-x   - hduser supergroup          0 2013-07-03 14:39
> /user/hduser/160milyonurls/crawldb/old
> 
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/crawldb-contents-tp4076345.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Re: crawldb contents

Reply via email to