I use nutch1.1 and default config files to crawl a list of sites. generate a segment of 10000 urls, and then fetch the segment (parsing is on). several observations and questions: 1. after several thousands pages are fetched, it starts to throw the following error for almost any page:
2010-07-10 02:49:35,197 INFO fetcher.Fetcher - fetching http://local.yahoo.com/MD/Frederick 2010-07-10 02:49:35,931 INFO fetcher.Fetcher - -activeThreads=20, spinWaiting=17, fetchQueues.totalSize=999 2010-07-10 02:49:36,933 INFO fetcher.Fetcher - -activeThreads=20, spinWaiting=17, fetchQueues.totalSize=1000 2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - java.io.IOException: Spill failed 2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:860) 2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:466) 2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:898) 2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:646) 2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for taskTracker/jobcache/job_local_0001/attemp t_local_0001_m_000000_0/output/spill10.out 2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:343) 2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124) 2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - at org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite(MapOutputFile.java:107) 2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1221) 2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:686) 2010-07-10 02:49:36,986 ERROR fetcher.Fetcher - at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1173) 201 what causes the error? 2. the segment dir has only generate dir and _temparary dir but no fetched content directory. I have to kill the fetch. how do I use the fetched pages in _temporary dir or recover the fetched pages? 3. fetching is fast in the beginning and slows down quickly. how to config it to fetch faster toward the end of fetch as well? thanks aj -- AJ Chen, PhD Chair, Semantic Web SIG, sdforum.org http://web2express.org twitter @web2express Palo Alto, CA, USA

