Hi guys,

after crawling multiple sites repeatedly for a long time, I'm getting 31999
subdirectories in the hadoop/mapred/local/taskTracker/jobcache/ directory.

After that, the crawler stops because of a 32000-file limit per directory in
Linux.

I'm wondering what is the solution to that? When running Nutch at a high
level, I don't have access to the specific directories that get created
under jobcache for a given job, so I can't delete them myself easily (unless
I'm mistaken?).

Is there an option to either delete these directories / temp files in Hadoop
when a crawl is complete, or is there a way to configure Hadoop so that it
won't run into these limitations? Or any other option to keep my crawler
running?

Thanks - help much appreciated again.

Yann



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Too-many-links-in-hadoop-directory-tp4108378.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to