Julien, thanks. hadoop writes temp files to /tmp by default. I need to change that. you are right. but, mapred.local.dir can not be found in any file in the nutch1.1 conf dir. no hadoop-default.xml file; mapred-site.xml is empty. could you tell me the file and the parameter block to be added? -aj
On Sat, Jul 10, 2010 at 11:42 AM, Julien Nioche < [email protected]> wrote: > Hi, > > > 1. after several thousands pages are fetched, it starts to throw the > > following error for almost any page: > > > > 2010-07-10 02:49:35,197 INFO fetcher.Fetcher - fetching > > http://local.yahoo.com/MD/Frederick > > 2010-07-10 02:49:35,931 INFO fetcher.Fetcher - -activeThreads=20, > > spinWaiting=17, fetchQueues.totalSize=999 > > 2010-07-10 02:49:36,933 INFO fetcher.Fetcher - -activeThreads=20, > > spinWaiting=17, fetchQueues.totalSize=1000 > > 2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - java.io.IOException: > Spill > > failed > > 2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - at > > > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:860) > > 2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - at > > > > > org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:466) > > 2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - at > > org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:898) > > 2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - at > > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:646) > > 2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - Caused by: > > org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any > > valid local directory for taskTracker/jobcache/job_local_0001/attemp > > t_local_0001_m_000000_0/output/spill10.out > > 2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - at > > > > > org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:343) > > 2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - at > > > > > org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124) > > 2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - at > > > > > org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite(MapOutputFile.java:107) > > 2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - at > > > > > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1221) > > 2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - at > > > > > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:686) > > 2010-07-10 02:49:36,986 ERROR fetcher.Fetcher - at > > > > > org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1173) > > 201 > > > > what causes the error? > > > > Looks like there is no space left on the device where the output of the > maps > are stored. Try specifying a different value for mapred.local.dir in the > hadoop conf. > > http://hadoop.apache.org/common/docs/r0.20.2/cluster_setup.html > > > > > > 2. the segment dir has only generate dir and _temparary dir but no > fetched > > content directory. I have to kill the fetch. how do I use the fetched > > pages > > in _temporary dir or recover the fetched pages? > > > > Can't remember on the top of my head but I am pretty sure this has been > mentioned previously on the mailing list > > > > > > 3. fetching is fast in the beginning and slows down quickly. how to > config > > it to fetch faster toward the end of fetch as well? > > > > See http://wiki.apache.org/nutch/OptimizingCrawls for hints > > HTH > > J. > -- > DigitalPebble Ltd > > Open Source Solutions for Text Engineering > http://www.digitalpebble.com > -- AJ Chen, PhD Chair, Semantic Web SIG, sdforum.org http://web2express.org twitter @web2express Palo Alto, CA, USA

