I use nutch1.1 and default config files to crawl a list of sites. generate a
segment of 10000 urls, and then fetch the segment (parsing is on).  several
observations and questions:
1. after several thousands pages are fetched, it starts to throw the
following error for almost any page:

2010-07-10 02:49:35,197 INFO  fetcher.Fetcher - fetching
http://local.yahoo.com/MD/Frederick
2010-07-10 02:49:35,931 INFO  fetcher.Fetcher - -activeThreads=20,
spinWaiting=17, fetchQueues.totalSize=999
2010-07-10 02:49:36,933 INFO  fetcher.Fetcher - -activeThreads=20,
spinWaiting=17, fetchQueues.totalSize=1000
2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - java.io.IOException: Spill
failed
2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:860)
2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - at
org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:466)
2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - at
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:898)
2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:646)
2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - Caused by:
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any
valid local directory for taskTracker/jobcache/job_local_0001/attemp
t_local_0001_m_000000_0/output/spill10.out
2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - at
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:343)
2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - at
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - at
org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite(MapOutputFile.java:107)
2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1221)
2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:686)
2010-07-10 02:49:36,986 ERROR fetcher.Fetcher - at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1173)
201

what causes the error?

2. the segment dir has only generate dir and _temparary dir but no fetched
content directory. I have to kill the fetch.  how do I use the fetched pages
in _temporary dir or recover the fetched pages?

3. fetching is fast in the beginning and slows down quickly. how to config
it to fetch faster toward the end of fetch as well?

thanks
aj
-- 
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto, CA, USA

Reply via email to