Hi,

1. after several thousands pages are fetched, it starts to throw the
> following error for almost any page:
>
> 2010-07-10 02:49:35,197 INFO  fetcher.Fetcher - fetching
> http://local.yahoo.com/MD/Frederick
> 2010-07-10 02:49:35,931 INFO  fetcher.Fetcher - -activeThreads=20,
> spinWaiting=17, fetchQueues.totalSize=999
> 2010-07-10 02:49:36,933 INFO  fetcher.Fetcher - -activeThreads=20,
> spinWaiting=17, fetchQueues.totalSize=1000
> 2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - java.io.IOException: Spill
> failed
> 2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:860)
> 2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - at
>
> org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:466)
> 2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:898)
> 2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:646)
> 2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - Caused by:
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any
> valid local directory for taskTracker/jobcache/job_local_0001/attemp
> t_local_0001_m_000000_0/output/spill10.out
> 2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - at
>
> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:343)
> 2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - at
>
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
> 2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - at
>
> org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite(MapOutputFile.java:107)
> 2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - at
>
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1221)
> 2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - at
>
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:686)
> 2010-07-10 02:49:36,986 ERROR fetcher.Fetcher - at
>
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1173)
> 201
>
> what causes the error?
>

Looks like there is no space left on the device where the output of the maps
are stored. Try specifying a different value for mapred.local.dir in the
hadoop conf.

http://hadoop.apache.org/common/docs/r0.20.2/cluster_setup.html


>
> 2. the segment dir has only generate dir and _temparary dir but no fetched
> content directory. I have to kill the fetch.  how do I use the fetched
> pages
> in _temporary dir or recover the fetched pages?
>

Can't remember on the top of my head but I am pretty sure this has been
mentioned previously on the mailing list


>
> 3. fetching is fast in the beginning and slows down quickly. how to config
> it to fetch faster toward the end of fetch as well?
>

See http://wiki.apache.org/nutch/OptimizingCrawls for hints

HTH

J.
-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com

Reply via email to