AFAIK in local mode Hadoop does not clear the files at all but you can safely clear the directory after each crawl cycle or even after each map/reduce job if you want. To state the obvious, do not delete it while a job is still running, it won't survive.
-----Original message----- > From:yann <[email protected]> > Sent: Friday 27th December 2013 16:33 > To: [email protected] > Subject: Too many links in hadoop directory > > Hi guys, > > after crawling multiple sites repeatedly for a long time, I'm getting 31999 > subdirectories in the hadoop/mapred/local/taskTracker/jobcache/ directory. > > After that, the crawler stops because of a 32000-file limit per directory in > Linux. > > I'm wondering what is the solution to that? When running Nutch at a high > level, I don't have access to the specific directories that get created > under jobcache for a given job, so I can't delete them myself easily (unless > I'm mistaken?). > > Is there an option to either delete these directories / temp files in Hadoop > when a crawl is complete, or is there a way to configure Hadoop so that it > won't run into these limitations? Or any other option to keep my crawler > running? > > Thanks - help much appreciated again. > > Yann > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Too-many-links-in-hadoop-directory-tp4108378.html > Sent from the Nutch - User mailing list archive at Nabble.com. >

