AFAIK in local mode Hadoop does not clear the files at all but you can safely 
clear the directory after each crawl cycle or even after each map/reduce job if 
you want. To state the obvious, do not delete it while a job is still running, 
it won't survive.

-----Original message-----
> From:yann <[email protected]>
> Sent: Friday 27th December 2013 16:33
> To: [email protected]
> Subject: Too many links in hadoop directory
> 
> Hi guys,
> 
> after crawling multiple sites repeatedly for a long time, I'm getting 31999
> subdirectories in the hadoop/mapred/local/taskTracker/jobcache/ directory.
> 
> After that, the crawler stops because of a 32000-file limit per directory in
> Linux.
> 
> I'm wondering what is the solution to that? When running Nutch at a high
> level, I don't have access to the specific directories that get created
> under jobcache for a given job, so I can't delete them myself easily (unless
> I'm mistaken?).
> 
> Is there an option to either delete these directories / temp files in Hadoop
> when a crawl is complete, or is there a way to configure Hadoop so that it
> won't run into these limitations? Or any other option to keep my crawler
> running?
> 
> Thanks - help much appreciated again.
> 
> Yann
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Too-many-links-in-hadoop-directory-tp4108378.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
> 

Reply via email to