Hi All,

Experiencing same problem as Žygimantas with Nutch 2.1 and Cassandra (with
HBase everything works OK).

Here are some details of my setup:

Node1 - NameNode, SecondraryNameNode, JobTracker
Node2..Node4 - TaskTraker, DataNode, Cassandra

All these are virtual machines.
CPU is reported as "Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz", 4Gb RAM.

Running Nutch using 
hadoop jar $JAR org.apache.nutch.crawl.Crawler /seeds -threads 10 -numTasks
3 -depth 2 -topN 10000

Getting one mapper for parse job and very slow parsing of individual pages.

Getting lots of errors like this:

2013-02-16 01:26:04,217 WARN org.apache.nutch.parse.ParseUtil: Error parsing
http://someurl.net/ with org.apache.nutch.parse.html.HtmlParser@63a1bc40
java.util.concurrent.TimeoutException
        at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:258)
        at java.util.concurrent.FutureTask.get(FutureTask.java:119)
        at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:148)
        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:129)
        at org.apache.nutch.parse.ParseUtil.process(ParseUtil.java:176)
        at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:129)
        at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:78)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:416)
        at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
        at org.apache.hadoop.mapred.Child.main(Child.java:249)

Any suggestions how to diagnose why it is behaving this way?

Thanks!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Slow-parse-on-hadoop-tp4040215p4040897.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to