I'm doing web crawling using nutch, which runs on hadoop in distributed mode. When the crawldb has tens of millions of urls, I have started to see strange failure in generating new segment and updating crawldb. For generating segment, the hadoop job for select is completed successfully and generate-temp-1285641291765 is created. but it does not start the partition job and the segment is not created in segments directory. I try to understand where it fails. There is no error message except for a few WARN messages about connection reset by peer. Hadoop fsck and dfsadmin show the nodes and directories are healthy. Is this a hadoop problem or nutch problem? I'll appreciate any suggestion for how to debug this fatal problem.
Similar problem is seen for updatedb step, which creates the temp dir but never actually update the crawldb. thanks, aj -- AJ Chen, PhD Chair, Semantic Web SIG, sdforum.org web2express.org twitter: @web2express Palo Alto, CA, USA

