Julien, thanks. hadoop writes temp files to /tmp by default. I need to
change that. you are right. but, mapred.local.dir can not be found in any
file in  the nutch1.1 conf dir. no hadoop-default.xml file; mapred-site.xml
is empty. could you tell me the file and the parameter block to be added?
-aj

On Sat, Jul 10, 2010 at 11:42 AM, Julien Nioche <
[email protected]> wrote:

> Hi,
>
>
> 1. after several thousands pages are fetched, it starts to throw the
> > following error for almost any page:
> >
> > 2010-07-10 02:49:35,197 INFO  fetcher.Fetcher - fetching
> > http://local.yahoo.com/MD/Frederick
> > 2010-07-10 02:49:35,931 INFO  fetcher.Fetcher - -activeThreads=20,
> > spinWaiting=17, fetchQueues.totalSize=999
> > 2010-07-10 02:49:36,933 INFO  fetcher.Fetcher - -activeThreads=20,
> > spinWaiting=17, fetchQueues.totalSize=1000
> > 2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - java.io.IOException:
> Spill
> > failed
> > 2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - at
> >
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:860)
> > 2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - at
> >
> >
> org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:466)
> > 2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - at
> > org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:898)
> > 2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - at
> > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:646)
> > 2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - Caused by:
> > org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any
> > valid local directory for taskTracker/jobcache/job_local_0001/attemp
> > t_local_0001_m_000000_0/output/spill10.out
> > 2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - at
> >
> >
> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:343)
> > 2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - at
> >
> >
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
> > 2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - at
> >
> >
> org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite(MapOutputFile.java:107)
> > 2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - at
> >
> >
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1221)
> > 2010-07-10 02:49:36,985 ERROR fetcher.Fetcher - at
> >
> >
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:686)
> > 2010-07-10 02:49:36,986 ERROR fetcher.Fetcher - at
> >
> >
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1173)
> > 201
> >
> > what causes the error?
> >
>
> Looks like there is no space left on the device where the output of the
> maps
> are stored. Try specifying a different value for mapred.local.dir in the
> hadoop conf.
>
> http://hadoop.apache.org/common/docs/r0.20.2/cluster_setup.html
>
>
> >
> > 2. the segment dir has only generate dir and _temparary dir but no
> fetched
> > content directory. I have to kill the fetch.  how do I use the fetched
> > pages
> > in _temporary dir or recover the fetched pages?
> >
>
> Can't remember on the top of my head but I am pretty sure this has been
> mentioned previously on the mailing list
>
>
> >
> > 3. fetching is fast in the beginning and slows down quickly. how to
> config
> > it to fetch faster toward the end of fetch as well?
> >
>
> See http://wiki.apache.org/nutch/OptimizingCrawls for hints
>
> HTH
>
> J.
> --
> DigitalPebble Ltd
>
> Open Source Solutions for Text Engineering
> http://www.digitalpebble.com
>



-- 
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto, CA, USA

Reply via email to