Hey Lewis, I am not knowledgeable about Gora thingy but am curious to know how parsing perf. might affect if one uses different storage. With Hbase it worked fine for OP but Cassandra gave this problem. Is the parsing code separate for these two ? or its while writing parse output that the problem occurs ? I think it might be the later one causing this but I am not sure.
Thanks, Tejas Patil On Sat, Feb 16, 2013 at 1:36 PM, Lewis John Mcgibbney < [email protected]> wrote: > Can you dump your webdb and check what the various fields are like? > Can you read these in an editor? > I think there may be some problems with the serializers in gora-cassandra > but Iam not sure yet. > Lewis > > On Saturday, February 16, 2013, t_gra <[email protected]> wrote: > > Hi All, > > > > Experiencing same problem as Žygimantas with Nutch 2.1 and Cassandra > (with > > HBase everything works OK). > > > > Here are some details of my setup: > > > > Node1 - NameNode, SecondraryNameNode, JobTracker > > Node2..Node4 - TaskTraker, DataNode, Cassandra > > > > All these are virtual machines. > > CPU is reported as "Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz", 4Gb RAM. > > > > Running Nutch using > > hadoop jar $JAR org.apache.nutch.crawl.Crawler /seeds -threads 10 > -numTasks > > 3 -depth 2 -topN 10000 > > > > Getting one mapper for parse job and very slow parsing of individual > pages. > > > > Getting lots of errors like this: > > > > 2013-02-16 01:26:04,217 WARN org.apache.nutch.parse.ParseUtil: Error > parsing > > http://someurl.net/ with org.apache.nutch.parse.html.HtmlParser@63a1bc40 > > java.util.concurrent.TimeoutException > > at > java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:258) > > at java.util.concurrent.FutureTask.get(FutureTask.java:119) > > at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:148) > > at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:129) > > at org.apache.nutch.parse.ParseUtil.process(ParseUtil.java:176) > > at > org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:129) > > at > org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:78) > > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) > > at > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) > > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) > > at org.apache.hadoop.mapred.Child$4.run(Child.java:255) > > at java.security.AccessController.doPrivileged(Native Method) > > at javax.security.auth.Subject.doAs(Subject.java:416) > > at > > > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136) > > at org.apache.hadoop.mapred.Child.main(Child.java:249) > > > > Any suggestions how to diagnose why it is behaving this way? > > > > Thanks! > > > > > > > > -- > > View this message in context: > > http://lucene.472066.n3.nabble.com/Slow-parse-on-hadoop-tp4040215p4040897.html > > Sent from the Nutch - User mailing list archive at Nabble.com. > > > > -- > *Lewis* >

