Hi All, Experiencing same problem as Žygimantas with Nutch 2.1 and Cassandra (with HBase everything works OK).
Here are some details of my setup: Node1 - NameNode, SecondraryNameNode, JobTracker Node2..Node4 - TaskTraker, DataNode, Cassandra All these are virtual machines. CPU is reported as "Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz", 4Gb RAM. Running Nutch using hadoop jar $JAR org.apache.nutch.crawl.Crawler /seeds -threads 10 -numTasks 3 -depth 2 -topN 10000 Getting one mapper for parse job and very slow parsing of individual pages. Getting lots of errors like this: 2013-02-16 01:26:04,217 WARN org.apache.nutch.parse.ParseUtil: Error parsing http://someurl.net/ with org.apache.nutch.parse.html.HtmlParser@63a1bc40 java.util.concurrent.TimeoutException at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:258) at java.util.concurrent.FutureTask.get(FutureTask.java:119) at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:148) at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:129) at org.apache.nutch.parse.ParseUtil.process(ParseUtil.java:176) at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:129) at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:78) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136) at org.apache.hadoop.mapred.Child.main(Child.java:249) Any suggestions how to diagnose why it is behaving this way? Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Slow-parse-on-hadoop-tp4040215p4040897.html Sent from the Nutch - User mailing list archive at Nabble.com.

