Re: Slow parse on hadoop

Tejas Patil Sat, 16 Feb 2013 16:23:04 -0800

Hi Renato,

On Sat, Feb 16, 2013 at 2:01 PM, Renato Marroquín Mogrovejo <
[email protected]> wrote:


> Hi Tejas,
>
>
> 2013/2/16 Tejas Patil <[email protected]>:
> > Hey Lewis,
> >
> > I am not knowledgeable about Gora thingy but am curious to know how
> parsing
> > perf. might affect if one uses different storage. With Hbase it worked
> fine
> > for OP but Cassandra gave this problem. Is the parsing code separate for
>
> This is one thing we (Lewis and me) were just discussing. Well, you
> set up Nutch to use Gora to persist data in whichever data store you
> want, so all writes and reads are handled separately by each different
> module. HBase relies on Avro for persisting data but Cassandra does
> not. Cassandra has its own series of serializers to write everything
> into bytes to make operations have a better performance. We believe
> there is something going on with Cassandra serializers and the way
> Gora uses them which is making this specific job to not to work as
> desired.


So this issue is while writing parsed content to Cassandra. As the
serialization is performed for every url, the problem should have been seen
for all urls. But thats not the case. Maybe it has to do something with the
content.

>
> > these two ? or its while writing parse output that the problem occurs ? I
> > think it might be the later one causing this but I am not sure.
>
> Could you point me to the parser job code so I can take a look at it?
> I am a foreigner @ Nutchland so I will appreciate your help in order
> to sort this out.
>
> The core parsing classes are present at [0]. The parser job is implemented
at [1]. Depending on the content, specific parsing is required. This is
done by the parse plugins present at [2].

[0] :
http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/parse/
[1] :
http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/parse/ParserJob.java?view=markup
[2] : http://svn.apache.org/viewvc/nutch/branches/2.x/src/plugin/


> Renato M.
>
> > Thanks,
> > Tejas Patil
> >
> >
> > On Sat, Feb 16, 2013 at 1:36 PM, Lewis John Mcgibbney <
> > [email protected]> wrote:
> >
> >> Can you dump your webdb and check what the various fields are like?
> >> Can you read these in an editor?
> >> I think there may be some problems with the serializers in
> gora-cassandra
> >> but Iam not sure yet.
> >> Lewis
> >>
> >> On Saturday, February 16, 2013, t_gra <[email protected]> wrote:
> >> > Hi All,
> >> >
> >> > Experiencing same problem as Žygimantas with Nutch 2.1 and Cassandra
> >> (with
> >> > HBase everything works OK).
> >> >
> >> > Here are some details of my setup:
> >> >
> >> > Node1 - NameNode, SecondraryNameNode, JobTracker
> >> > Node2..Node4 - TaskTraker, DataNode, Cassandra
> >> >
> >> > All these are virtual machines.
> >> > CPU is reported as "Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz", 4Gb
> RAM.
> >> >
> >> > Running Nutch using
> >> > hadoop jar $JAR org.apache.nutch.crawl.Crawler /seeds -threads 10
> >> -numTasks
> >> > 3 -depth 2 -topN 10000
> >> >
> >> > Getting one mapper for parse job and very slow parsing of individual
> >> pages.
> >> >
> >> > Getting lots of errors like this:
> >> >
> >> > 2013-02-16 01:26:04,217 WARN org.apache.nutch.parse.ParseUtil: Error
> >> parsing
> >> > http://someurl.net/ with
> org.apache.nutch.parse.html.HtmlParser@63a1bc40
> >> > java.util.concurrent.TimeoutException
> >> >         at
> >> java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:258)
> >> >         at java.util.concurrent.FutureTask.get(FutureTask.java:119)
> >> >         at
> org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:148)
> >> >         at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:129)
> >> >         at
> org.apache.nutch.parse.ParseUtil.process(ParseUtil.java:176)
> >> >         at
> >> org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:129)
> >> >         at
> >> org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:78)
> >> >         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> >> >         at
> >> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> >> >         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
> >> >         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
> >> >         at java.security.AccessController.doPrivileged(Native Method)
> >> >         at javax.security.auth.Subject.doAs(Subject.java:416)
> >> >         at
> >> >
> >>
> >>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
> >> >         at org.apache.hadoop.mapred.Child.main(Child.java:249)
> >> >
> >> > Any suggestions how to diagnose why it is behaving this way?
> >> >
> >> > Thanks!
> >> >
> >> >
> >> >
> >> > --
> >> > View this message in context:
> >>
> >>
> http://lucene.472066.n3.nabble.com/Slow-parse-on-hadoop-tp4040215p4040897.html
> >> > Sent from the Nutch - User mailing list archive at Nabble.com.
> >> >
> >>
> >> --
> >> *Lewis*
> >>
>

Re: Slow parse on hadoop

Reply via email to