Re: Slow parse on hadoop

Tejas Patil Sat, 16 Feb 2013 13:37:28 -0800

The warning indicates that the parser exceeds the timeout set for parsing
the document. By default the timeout value is 30 seconds. You might try
increasing the timeout value in conf/nutch-site.xml


<property>
<name>parser.timeout</name>
<value>30</value>
 <description>Timeout in seconds for the parsing of a document, otherwise
treats it as an exception and
moves on the the following documents. This parameter is applied to any
Parser implementation.
 Set to -1 to deactivate, bearing in mind that this could cause
the parsing to crash because of a very long or corrupted document.
 </description>
</property>

While setting this timeout value, pick a smart value based on the size of
documents you are dealing with. As you observe this warning message lot of
times, my guess is that you have bigger files. If not, the content must
have something that makes parser spend a lot of time.

After few trials you will end up with a value good enough so that the crawl
rate aint much affected and the % of warnings are low.

Thanks,
Tejas Patil


On Sat, Feb 16, 2013 at 1:16 PM, t_gra <[email protected]> wrote:

> Hi All,
>
> Experiencing same problem as Žygimantas with Nutch 2.1 and Cassandra (with
> HBase everything works OK).
>
> Here are some details of my setup:
>
> Node1 - NameNode, SecondraryNameNode, JobTracker
> Node2..Node4 - TaskTraker, DataNode, Cassandra
>
> All these are virtual machines.
> CPU is reported as "Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz", 4Gb RAM.
>
> Running Nutch using
> hadoop jar $JAR org.apache.nutch.crawl.Crawler /seeds -threads 10 -numTasks
> 3 -depth 2 -topN 10000
>
> Getting one mapper for parse job and very slow parsing of individual pages.
>
> Getting lots of errors like this:
>
> 2013-02-16 01:26:04,217 WARN org.apache.nutch.parse.ParseUtil: Error
> parsing
> http://someurl.net/ with org.apache.nutch.parse.html.HtmlParser@63a1bc40
> java.util.concurrent.TimeoutException
>         at
> java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:258)
>         at java.util.concurrent.FutureTask.get(FutureTask.java:119)
>         at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:148)
>         at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:129)
>         at org.apache.nutch.parse.ParseUtil.process(ParseUtil.java:176)
>         at
> org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:129)
>         at
> org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:78)
>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:416)
>         at
>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
>         at org.apache.hadoop.mapred.Child.main(Child.java:249)
>
> Any suggestions how to diagnose why it is behaving this way?
>
> Thanks!
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Slow-parse-on-hadoop-tp4040215p4040897.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Re: Slow parse on hadoop

Reply via email to