Hi,
1. after several thousands pages are fetched, it starts to throw the > following error for almost any page: > > 2010-07-10 02:49:35,197 INFO fetcher.Fetcher - fetching > http://local.yahoo.com/MD/Frederick > 2010-07-10 02:49:35,931 INFO fetcher.Fetcher - -activeThreads=20, > spinWaiting=17, fetchQueues.totalSize=999 > 2010-07-10 02:49:36,933 INFO fetcher.Fetcher - -activeThreads=20, > spinWaiting=17, fetchQueues.totalSize=1000 > 2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - java.io.IOException: Spill > failed > 2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:860) > 2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - at > > org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:466) > 2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - at > org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:898) > 2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - at > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:646) > 2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - Caused by: > org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any > valid local directory for taskTracker/jobcache/job_local_0001/attemp > t_local_0001_m_000000_0/output/spill10.out > 2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - at > > org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:343) > 2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - at > > org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124) > 2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - at > > org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite(MapOutputFile.java:107) > 2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - at > > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1221) > 2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - at > > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:686) > 2010-07-10 02:49:36,986 ERROR fetcher.Fetcher - at > > org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1173) > 201 > > what causes the error? > Looks like there is no space left on the device where the output of the maps are stored. Try specifying a different value for mapred.local.dir in the hadoop conf. http://hadoop.apache.org/common/docs/r0.20.2/cluster_setup.html > > 2. the segment dir has only generate dir and _temparary dir but no fetched > content directory. I have to kill the fetch. how do I use the fetched > pages > in _temporary dir or recover the fetched pages? > Can't remember on the top of my head but I am pretty sure this has been mentioned previously on the mailing list > > 3. fetching is fast in the beginning and slows down quickly. how to config > it to fetch faster toward the end of fetch as well? > See http://wiki.apache.org/nutch/OptimizingCrawls for hints HTH J. -- DigitalPebble Ltd Open Source Solutions for Text Engineering http://www.digitalpebble.com

