After one day crawling with nutch(version 1.4) ... at last i got the bad bad
below exception:
.
.
.
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1204)
at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1240)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1213)
.
.
.
I have 20 news site and input argument of nutch is : depth 3 and topN -1
I have enough space in root directory of my linux also i have a about 4GB of
ram.
and confige my nutch-site.xml as belows:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<configuration>
<property>
<name>http.content.limit</name>
<value>-1</value>
</property>
<property>
<name>file.content.limit</name>
<value>-1</value>
</property>
<property>
<name>file.content.ignored</name>
<value>false</value>
</property>
<property>
<name>file.crawl.parent</name>
<value>false</value>
</property>
<property>
<name>http.agent.name</name>
<value>My Nutch Spider</value>
</property>
<property>
<name>encodingdetector.charset.min.confidence</name>
<value>-1</value>
</property>
<property>
<name>parser.timeout</name>
<value>30</value>
</property>
<property>
<name>db.fetch.interval.default</name>
<value>36000</value>
</property>
<property>
<name>db.fetch.schedule.class</name>
<value>org.apache.nutch.crawl.AdaptiveFetchSchedule</value>
</property>
<property>
<name>db.signature.class</name>
<value>org.apache.nutch.crawl.TextProfileSignature</value>
</property>
<property>
<name>fetcher.verbose</name>
<value>true</value>
</property>
<property>
<name>db.ignore.external.links</name>
<value>false</value>
</property>
<property>
<name>http.timeout</name>
<value>60000</value>
</property>
<property>
<name>db.max.outlinks.per.page</name>
<value>-1</value>
</property>
<property>
<name>http.redirect.max</name>
<value>5</value>
</property>
<property>
<name>db.fetch.interval.max</name>
<value>7776000</value>
</property>
<property>
<name>db.max.anchor.length</name>
<value>20000</value>
</description>
</property>
<property>
<name>hadoop.job.history.user.location</name>
<value>/data/data_solr_site/hadoop-history-user</value>
</property>
<property>
<name>fetcher.threads.fetch</name>
<value>5</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/data/data_solr_site/hadoop</value>
</property>
</configuration>
how can i solve this issue? thanks.
--
View this message in context:
http://lucene.472066.n3.nabble.com/IOExeption-when-crawling-with-nutch-in-Fetching-process-tp3756272p3756272.html
Sent from the Nutch - User mailing list archive at Nabble.com.