After one day crawling with nutch(version 1.4) ... at last i got the bad bad
below exception:

.
.
.

-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
    at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1204)
    at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1240)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1213)
.
.

.

I have 20 news site and input argument of nutch is : depth 3 and topN -1
I have enough space in root directory of my linux also i have a about 4GB of
ram.
and confige my nutch-site.xml as belows:
 

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<configuration>
    <property>
        <name>http.content.limit</name>
        <value>-1</value>     
    </property>

    <property>
        <name>file.content.limit</name>
        <value>-1</value>
    </property>

    <property>
        <name>file.content.ignored</name>
        <value>false</value>     
    </property>

    <property>
        <name>file.crawl.parent</name>
        <value>false</value>     
    </property>    

    <property>
        <name>http.agent.name</name>
        <value>My Nutch Spider</value>        
    </property> 

    <property>
        <name>encodingdetector.charset.min.confidence</name>
        <value>-1</value>      
    </property>      

    <property>
        <name>parser.timeout</name>
        <value>30</value>      
    </property>

    <property>
        <name>db.fetch.interval.default</name>
        <value>36000</value>       
    </property>

    <property>
        <name>db.fetch.schedule.class</name>
        <value>org.apache.nutch.crawl.AdaptiveFetchSchedule</value>       
    </property>

    <property>
        <name>db.signature.class</name>
        <value>org.apache.nutch.crawl.TextProfileSignature</value>      
    </property>

    <property>
        <name>fetcher.verbose</name>
        <value>true</value>        
    </property>

    <property>
        <name>db.ignore.external.links</name>
        <value>false</value>      
    </property>

    <property>
        <name>http.timeout</name>
        <value>60000</value>
           </property>

    <property>
        <name>db.max.outlinks.per.page</name>
        <value>-1</value>
     </property>
    

    <property>
        <name>http.redirect.max</name>
        <value>5</value>       
    </property>

    <property>
        <name>db.fetch.interval.max</name>
        <value>7776000</value>      
    </property>

    <property>
        <name>db.max.anchor.length</name>
        <value>20000</value>       
  </description>
    </property>

<property>
  <name>hadoop.job.history.user.location</name>
  <value>/data/data_solr_site/hadoop-history-user</value> 
</property>

<property>
  <name>fetcher.threads.fetch</name>
  <value>5</value> 
</property>

<property>
  <name>hadoop.tmp.dir</name>
  <value>/data/data_solr_site/hadoop</value>  
</property>

</configuration>




how can i solve this issue? thanks.


--
View this message in context: 
http://lucene.472066.n3.nabble.com/IOExeption-when-crawling-with-nutch-in-Fetching-process-tp3756272p3756272.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to