The warning indicates that the parser exceeds the timeout set for parsing the document. By default the timeout value is 30 seconds. You might try increasing the timeout value in conf/nutch-site.xml
<property> <name>parser.timeout</name> <value>30</value> <description>Timeout in seconds for the parsing of a document, otherwise treats it as an exception and moves on the the following documents. This parameter is applied to any Parser implementation. Set to -1 to deactivate, bearing in mind that this could cause the parsing to crash because of a very long or corrupted document. </description> </property> While setting this timeout value, pick a smart value based on the size of documents you are dealing with. As you observe this warning message lot of times, my guess is that you have bigger files. If not, the content must have something that makes parser spend a lot of time. After few trials you will end up with a value good enough so that the crawl rate aint much affected and the % of warnings are low. Thanks, Tejas Patil On Sat, Feb 16, 2013 at 1:16 PM, t_gra <[email protected]> wrote: > Hi All, > > Experiencing same problem as Žygimantas with Nutch 2.1 and Cassandra (with > HBase everything works OK). > > Here are some details of my setup: > > Node1 - NameNode, SecondraryNameNode, JobTracker > Node2..Node4 - TaskTraker, DataNode, Cassandra > > All these are virtual machines. > CPU is reported as "Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz", 4Gb RAM. > > Running Nutch using > hadoop jar $JAR org.apache.nutch.crawl.Crawler /seeds -threads 10 -numTasks > 3 -depth 2 -topN 10000 > > Getting one mapper for parse job and very slow parsing of individual pages. > > Getting lots of errors like this: > > 2013-02-16 01:26:04,217 WARN org.apache.nutch.parse.ParseUtil: Error > parsing > http://someurl.net/ with org.apache.nutch.parse.html.HtmlParser@63a1bc40 > java.util.concurrent.TimeoutException > at > java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:258) > at java.util.concurrent.FutureTask.get(FutureTask.java:119) > at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:148) > at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:129) > at org.apache.nutch.parse.ParseUtil.process(ParseUtil.java:176) > at > org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:129) > at > org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:78) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) > at org.apache.hadoop.mapred.Child$4.run(Child.java:255) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:416) > at > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136) > at org.apache.hadoop.mapred.Child.main(Child.java:249) > > Any suggestions how to diagnose why it is behaving this way? > > Thanks! > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Slow-parse-on-hadoop-tp4040215p4040897.html > Sent from the Nutch - User mailing list archive at Nabble.com. >

