Re: crawldb contents

Sebastian Nagel Tue, 09 Jul 2013 13:31:48 -0700

It should be possible to merge the CrawlDbs
but not that way. "current" is a hard-wired
subdir. A correct call would not contain "current":
 nutch mergedb <output> crawldb1/ crawldb2/


I understand you may have lost lot of data but again:
> Assumed crawling is continued the missed data will be crawled again
> (or already has been crawled again because it happened 3 days ago).
But that's also a question how you run the crawl.

First, you should check whether entries are really lost.
If yes, you better run the update job again.
The segment to update the CrawlDb with should be still there.

The update job took 1.5h, that's a lot. What is your -topN?
If it's large reduce it, so that one cycle finishes within a few hours.
If a job fails the loss is tolerable, just run it again.

On 07/08/2013 10:49 PM, eakarsu wrote:
> Sebastian,
> 
> The hadoop job result page does not render properly. There was nothing wrong
> for updatedb job.
> 
> Can we merge current and 624730206 folders with command?
> 
> nutch mergedb <output_crawldb> 160milyonurls/crawldb/current
> 160milyonurls/crawldb/624730206
> 
> 
> User: hduser
> JobName: crawldb 160milyonurls/crawldb
> JobConf:
> hdfs://summitdev1:54310/media/sdb/app/hadoop/tmp/mapred/staging/hduser/.staging/job_201307050940_0002/job.xml
> Job-ACLs: All users are allowed
> Submitted At: 5-Jul-2013 22:14:37
> Launched At: 5-Jul-2013 22:14:38 (0sec)
> Finished At: 5-Jul-2013 23:55:11 (1hrs, 40mins, 33sec)
> Status: SUCCESS
> Failure Info:
> Analyse This Job
> Kind  Total Tasks(successful+failed+killed)   Successful tasks        Failed 
> tasks
> Killed tasks  Start Time      Finish Time
> Setup         1       1       0       0       5-Jul-2013 22:15:31     
> 5-Jul-2013 22:15:32 (1sec)
> Map   3043    3043    0       0       5-Jul-2013 22:14:41     5-Jul-2013 
> 23:16:56 (1hrs,
> 2mins, 14sec)
> Reduce        40      40      0       0       5-Jul-2013 22:18:25     
> 5-Jul-2013 23:55:35 (1hrs,
> 37mins, 10sec)
> Cleanup       1       1       0       0       5-Jul-2013 23:55:10     
> 5-Jul-2013 23:55:11 (1sec)
> 
> 
> 
> <http://lucene.472066.n3.nabble.com/file/n4076369/Capture.jpg> 
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/crawldb-contents-tp4076345p4076369.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Re: crawldb contents

Reply via email to