Hi, the folder "624730206" points indeed to a failed (or canceled?) updatedb job. If the job is successful the intermediate output path (a random number) is installed (moved) to "current".
You should have a look at the logs around 2013-07-05 23:55. Assumed crawling is continued the missed data will be crawled again (or already is because it happened 3 days ago). Sebastian On 07/08/2013 09:24 PM, eakarsu wrote: > > I have question on the contents of crawldb folder with Nutch 1.6 > > After I do updatedb step, crawldb folder includes the following. Is this > correct result I should get? > If not, how I can fix it? > > If I execute "generate" on this crawldb below, will it generate full url > lists? My concern is that updatedb process is not completed fully because we > "624730206" and "current" folder at the same time. > Does Nutch take care of this? > > I appreciate your help > > > hduser@hadoopdev1:~$ hadoop dfs -ls 160milyonurls/crawldb > Warning: $HADOOP_HOME is deprecated. > > Found 3 items > drwxr-xr-x - hduser supergroup 0 2013-07-05 23:55 > /user/hduser/160milyonurls/crawldb/624730206 > drwxr-xr-x - hduser supergroup 0 2013-07-08 18:59 > /user/hduser/160milyonurls/crawldb/current > drwxr-xr-x - hduser supergroup 0 2013-07-03 14:39 > /user/hduser/160milyonurls/crawldb/old > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/crawldb-contents-tp4076345.html > Sent from the Nutch - User mailing list archive at Nabble.com. >

