Hi,

this looks like a bug in Nutch 2.x.

Please, open an issue at http://issues.apache.org/jira/NUTCH
and add information about the exact Nutch version and the
configuration.  Invalid URLs should normally be filtered out
or corrected by URL normalizers during the parsing step.

Thanks,
Sebastian

On 09/15/2016 08:58 AM, shubham.gupta wrote:
> Hey,
> 
> Whenever the update job is executed the following errors occur:
> 
> INFO mapreduce.Job: Task Id : attempt_1473832356852_0104_m_000000_2, Status : 
> FAILED
> Error: java.net.MalformedURLException: no protocol:
> http%3A%2F%2Fwww.smh.com.au%2Fact-news%2Fcanberra-weather-warm-april-expected-after-record-breaking-march-temperatures-20160401-gnw2pg.html&title=Canberra+weather%3A+warm+April+expected+after+record+breaking+March+temperatures&source=The+Sydney+Morning+Herald&summary=Canberra+can+expect+warmer+than+average+temperatures+to+continue+for+April+after+enjoying+its+equal+second+warmest+March+on+record
> 
>     at java.net.URL.<init>(URL.java:586)
>     at java.net.URL.<init>(URL.java:483)
>     at java.net.URL.<init>(URL.java:432)
>     at org.apache.nutch.util.TableUtil.reverseUrl(TableUtil.java:43)
>     at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:96)
>     at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:38)
>     at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
>     at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
>     at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:422)
>     at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>     at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> 
> 
> Job Counters
>         Failed map tasks=4
>         Launched map tasks=4
>         Other local map tasks=4
>         Total time spent by all maps in occupied slots (ms)=417438
>         Total time spent by all reduces in occupied slots (ms)=0
>         Total time spent by all map tasks (ms)=59634
>         Total vcore-seconds taken by all map tasks=59634
>         Total megabyte-seconds taken by all map tasks=213012648
> Exception in thread "main" java.lang.RuntimeException: job failed: 
> name=[]update-table,
> jobid=job_1473832356852_0104
>     at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:119)
>     at org.apache.nutch.crawl.DbUpdaterJob.run(DbUpdaterJob.java:111)
>     at org.apache.nutch.crawl.DbUpdaterJob.updateTable(DbUpdaterJob.java:140)
>     at org.apache.nutch.crawl.DbUpdaterJob.run(DbUpdaterJob.java:174)
>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>     at org.apache.nutch.crawl.DbUpdaterJob.main(DbUpdaterJob.java:178)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:606)
>     at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
>     at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
> 
> This leads to no new updation of urls in the corresponding tables.
> Please help.
> Thanks in advance
> 

Reply via email to