Can you dump your webdb and check what the various fields are like? Can you read these in an editor? I think there may be some problems with the serializers in gora-cassandra but Iam not sure yet. Lewis
On Saturday, February 16, 2013, t_gra <[email protected]> wrote: > Hi All, > > Experiencing same problem as Žygimantas with Nutch 2.1 and Cassandra (with > HBase everything works OK). > > Here are some details of my setup: > > Node1 - NameNode, SecondraryNameNode, JobTracker > Node2..Node4 - TaskTraker, DataNode, Cassandra > > All these are virtual machines. > CPU is reported as "Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz", 4Gb RAM. > > Running Nutch using > hadoop jar $JAR org.apache.nutch.crawl.Crawler /seeds -threads 10 -numTasks > 3 -depth 2 -topN 10000 > > Getting one mapper for parse job and very slow parsing of individual pages. > > Getting lots of errors like this: > > 2013-02-16 01:26:04,217 WARN org.apache.nutch.parse.ParseUtil: Error parsing > http://someurl.net/ with org.apache.nutch.parse.html.HtmlParser@63a1bc40 > java.util.concurrent.TimeoutException > at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:258) > at java.util.concurrent.FutureTask.get(FutureTask.java:119) > at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:148) > at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:129) > at org.apache.nutch.parse.ParseUtil.process(ParseUtil.java:176) > at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:129) > at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:78) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) > at org.apache.hadoop.mapred.Child$4.run(Child.java:255) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:416) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136) > at org.apache.hadoop.mapred.Child.main(Child.java:249) > > Any suggestions how to diagnose why it is behaving this way? > > Thanks! > > > > -- > View this message in context: http://lucene.472066.n3.nabble.com/Slow-parse-on-hadoop-tp4040215p4040897.html > Sent from the Nutch - User mailing list archive at Nabble.com. > -- *Lewis*

