Can you dump your webdb and check what the various fields are like?
Can you read these in an editor?
I think there may be some problems with the serializers in gora-cassandra
but Iam not sure yet.
Lewis

On Saturday, February 16, 2013, t_gra <[email protected]> wrote:
> Hi All,
>
> Experiencing same problem as Žygimantas with Nutch 2.1 and Cassandra (with
> HBase everything works OK).
>
> Here are some details of my setup:
>
> Node1 - NameNode, SecondraryNameNode, JobTracker
> Node2..Node4 - TaskTraker, DataNode, Cassandra
>
> All these are virtual machines.
> CPU is reported as "Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz", 4Gb RAM.
>
> Running Nutch using
> hadoop jar $JAR org.apache.nutch.crawl.Crawler /seeds -threads 10
-numTasks
> 3 -depth 2 -topN 10000
>
> Getting one mapper for parse job and very slow parsing of individual
pages.
>
> Getting lots of errors like this:
>
> 2013-02-16 01:26:04,217 WARN org.apache.nutch.parse.ParseUtil: Error
parsing
> http://someurl.net/ with org.apache.nutch.parse.html.HtmlParser@63a1bc40
> java.util.concurrent.TimeoutException
>         at
java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:258)
>         at java.util.concurrent.FutureTask.get(FutureTask.java:119)
>         at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:148)
>         at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:129)
>         at org.apache.nutch.parse.ParseUtil.process(ParseUtil.java:176)
>         at
org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:129)
>         at
org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:78)
>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:416)
>         at
>
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
>         at org.apache.hadoop.mapred.Child.main(Child.java:249)
>
> Any suggestions how to diagnose why it is behaving this way?
>
> Thanks!
>
>
>
> --
> View this message in context:
http://lucene.472066.n3.nabble.com/Slow-parse-on-hadoop-tp4040215p4040897.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

-- 
*Lewis*

Reply via email to