Re: Slow parse on hadoop

Renato Marroquín Mogrovejo Sat, 16 Feb 2013 14:01:55 -0800

Hi Tejas,


2013/2/16 Tejas Patil <[email protected]>:
> Hey Lewis,
>
> I am not knowledgeable about Gora thingy but am curious to know how parsing
> perf. might affect if one uses different storage. With Hbase it worked fine
> for OP but Cassandra gave this problem. Is the parsing code separate for

This is one thing we (Lewis and me) were just discussing. Well, you
set up Nutch to use Gora to persist data in whichever data store you
want, so all writes and reads are handled separately by each different
module. HBase relies on Avro for persisting data but Cassandra does
not. Cassandra has its own series of serializers to write everything
into bytes to make operations have a better performance. We believe
there is something going on with Cassandra serializers and the way
Gora uses them which is making this specific job to not to work as
desired.

> these two ? or its while writing parse output that the problem occurs ? I
> think it might be the later one causing this but I am not sure.

Could you point me to the parser job code so I can take a look at it?
I am a foreigner @ Nutchland so I will appreciate your help in order
to sort this out.


Renato M.

> Thanks,
> Tejas Patil
>
>
> On Sat, Feb 16, 2013 at 1:36 PM, Lewis John Mcgibbney <
> [email protected]> wrote:
>
>> Can you dump your webdb and check what the various fields are like?
>> Can you read these in an editor?
>> I think there may be some problems with the serializers in gora-cassandra
>> but Iam not sure yet.
>> Lewis
>>
>> On Saturday, February 16, 2013, t_gra <[email protected]> wrote:
>> > Hi All,
>> >
>> > Experiencing same problem as Žygimantas with Nutch 2.1 and Cassandra
>> (with
>> > HBase everything works OK).
>> >
>> > Here are some details of my setup:
>> >
>> > Node1 - NameNode, SecondraryNameNode, JobTracker
>> > Node2..Node4 - TaskTraker, DataNode, Cassandra
>> >
>> > All these are virtual machines.
>> > CPU is reported as "Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz", 4Gb RAM.
>> >
>> > Running Nutch using
>> > hadoop jar $JAR org.apache.nutch.crawl.Crawler /seeds -threads 10
>> -numTasks
>> > 3 -depth 2 -topN 10000
>> >
>> > Getting one mapper for parse job and very slow parsing of individual
>> pages.
>> >
>> > Getting lots of errors like this:
>> >
>> > 2013-02-16 01:26:04,217 WARN org.apache.nutch.parse.ParseUtil: Error
>> parsing
>> > http://someurl.net/ with org.apache.nutch.parse.html.HtmlParser@63a1bc40
>> > java.util.concurrent.TimeoutException
>> >         at
>> java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:258)
>> >         at java.util.concurrent.FutureTask.get(FutureTask.java:119)
>> >         at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:148)
>> >         at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:129)
>> >         at org.apache.nutch.parse.ParseUtil.process(ParseUtil.java:176)
>> >         at
>> org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:129)
>> >         at
>> org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:78)
>> >         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>> >         at
>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>> >         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>> >         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>> >         at java.security.AccessController.doPrivileged(Native Method)
>> >         at javax.security.auth.Subject.doAs(Subject.java:416)
>> >         at
>> >
>>
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
>> >         at org.apache.hadoop.mapred.Child.main(Child.java:249)
>> >
>> > Any suggestions how to diagnose why it is behaving this way?
>> >
>> > Thanks!
>> >
>> >
>> >
>> > --
>> > View this message in context:
>>
>> http://lucene.472066.n3.nabble.com/Slow-parse-on-hadoop-tp4040215p4040897.html
>> > Sent from the Nutch - User mailing list archive at Nabble.com.
>> >
>>
>> --
>> *Lewis*
>>

Re: Slow parse on hadoop

Reply via email to