Hi Tejas,
2013/2/16 Tejas Patil <[email protected]>: > Hey Lewis, > > I am not knowledgeable about Gora thingy but am curious to know how parsing > perf. might affect if one uses different storage. With Hbase it worked fine > for OP but Cassandra gave this problem. Is the parsing code separate for This is one thing we (Lewis and me) were just discussing. Well, you set up Nutch to use Gora to persist data in whichever data store you want, so all writes and reads are handled separately by each different module. HBase relies on Avro for persisting data but Cassandra does not. Cassandra has its own series of serializers to write everything into bytes to make operations have a better performance. We believe there is something going on with Cassandra serializers and the way Gora uses them which is making this specific job to not to work as desired. > these two ? or its while writing parse output that the problem occurs ? I > think it might be the later one causing this but I am not sure. Could you point me to the parser job code so I can take a look at it? I am a foreigner @ Nutchland so I will appreciate your help in order to sort this out. Renato M. > Thanks, > Tejas Patil > > > On Sat, Feb 16, 2013 at 1:36 PM, Lewis John Mcgibbney < > [email protected]> wrote: > >> Can you dump your webdb and check what the various fields are like? >> Can you read these in an editor? >> I think there may be some problems with the serializers in gora-cassandra >> but Iam not sure yet. >> Lewis >> >> On Saturday, February 16, 2013, t_gra <[email protected]> wrote: >> > Hi All, >> > >> > Experiencing same problem as Žygimantas with Nutch 2.1 and Cassandra >> (with >> > HBase everything works OK). >> > >> > Here are some details of my setup: >> > >> > Node1 - NameNode, SecondraryNameNode, JobTracker >> > Node2..Node4 - TaskTraker, DataNode, Cassandra >> > >> > All these are virtual machines. >> > CPU is reported as "Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz", 4Gb RAM. >> > >> > Running Nutch using >> > hadoop jar $JAR org.apache.nutch.crawl.Crawler /seeds -threads 10 >> -numTasks >> > 3 -depth 2 -topN 10000 >> > >> > Getting one mapper for parse job and very slow parsing of individual >> pages. >> > >> > Getting lots of errors like this: >> > >> > 2013-02-16 01:26:04,217 WARN org.apache.nutch.parse.ParseUtil: Error >> parsing >> > http://someurl.net/ with org.apache.nutch.parse.html.HtmlParser@63a1bc40 >> > java.util.concurrent.TimeoutException >> > at >> java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:258) >> > at java.util.concurrent.FutureTask.get(FutureTask.java:119) >> > at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:148) >> > at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:129) >> > at org.apache.nutch.parse.ParseUtil.process(ParseUtil.java:176) >> > at >> org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:129) >> > at >> org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:78) >> > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) >> > at >> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) >> > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) >> > at org.apache.hadoop.mapred.Child$4.run(Child.java:255) >> > at java.security.AccessController.doPrivileged(Native Method) >> > at javax.security.auth.Subject.doAs(Subject.java:416) >> > at >> > >> >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136) >> > at org.apache.hadoop.mapred.Child.main(Child.java:249) >> > >> > Any suggestions how to diagnose why it is behaving this way? >> > >> > Thanks! >> > >> > >> > >> > -- >> > View this message in context: >> >> http://lucene.472066.n3.nabble.com/Slow-parse-on-hadoop-tp4040215p4040897.html >> > Sent from the Nutch - User mailing list archive at Nabble.com. >> > >> >> -- >> *Lewis* >>

