Re: Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'
Hi Clark, This is a lot of information... thank you for compiling it all. Ideally the version of Hadoop being used with Nutch should ALWAYS match the hadoop binaries referenced in https://github.com/apache/nutch/blob/master/ivy/ivy.xml. This way you wont run into the classpath issues. I would like to encourage you to create a wiki page so we can document this in a user firnedly way... would you be open to that? You can create an account at https://cwiki.apache.org/confluence/display/NUTCH/Home Thanks for your consideration. lewismc On 2021/07/14 18:27:23, Clark Benham wrote: > Hi All, > > Sebastian Helped fix my issue: using S3 as a backend I was able to get > nutch-1.19 working with pre-built hadoop-3.3.0 and java 11. There was an > oddity that nutch-1.19 had 11 hadoop 3.1.3 jars, eg. > hadoop-hdfs-3.1.3.jar, hadoop-yarn-api-3.1.3.jar... ; this made running > `hadoop version` give 3.1.3) so I replaced those 3.1.3 jars with the 3.3.0 > jars from the hadoop download. > Also, in the main nutch branch ( > https://github.com/apache/nutch/blob/master/ivy/ivy.xml) ivy.xml currently > has dependencies on hadoop-3.1.3; eg. > > conf="*->default"> > > z > > > > > > conf="*->default" /> > rev="3.1.3" conf="*->default" /> > name="hadoop-mapreduce-client-jobclient" rev="3.1.3" conf="*->default" /> > > > I set yarn.nodemanager.local-dirs to '${hadoop.tmp.dir}/nm-local-dir'. > > I didn't change "mapreduce.job.dir" because there's no namenode nor > datanode processes running when using hadoop with S3, so the UI is blank. > > Copied from Email with Sebastian: > > > The plugin loader doesn't appear to be able to read from s3 in > nutch-1.18 > > > with hadoop-3.2.1[1]. > > > I had a look into the plugin loader: it can only read from the local file > system. > > But that's ok because the Nutch job file is copied to the local machine > > and unpacked. Here the paths how it looks like on one of the running > Common Crawl > > task nodes: > > The configs for the working hadoop are as follows: > > core-site.xml > > > > > > > > > > > > > > > hadoop.tmp.dir > > /home/hdoop/tmpdata > > > > > > fs.defaultFS > > s3a://my-bucket > > > > > > > fs.s3a.access.key > > KEY_PLACEHOLDER > > AWS access key ID. > >Omit for IAM role-based or provider-based authentication. > > > > > > > fs.s3a.secret.key > > SECRET_PLACEHOLDER > > AWS secret key. > >Omit for IAM role-based or provider-based authentication. > > > > > > > fs.s3a.aws.credentials.provider > > > > Comma-separated class names of credential provider classes which > implement > > com.amazonaws.auth.AWSCredentialsProvider. > > > These are loaded and queried in sequence for a valid set of credentials. > > Each listed class must implement one of the following means of > > construction, which are attempted in order: > > 1. a public constructor accepting java.net.URI and > > org.apache.hadoop.conf.Configuration, > > 2. a public static method named getInstance that accepts no > >arguments and returns an instance of > >com.amazonaws.auth.AWSCredentialsProvider, or > > 3. a public default constructor. > > > Specifying org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider > allows > > anonymous access to a publicly accessible S3 bucket without any > credentials. > > Please note that allowing anonymous access to an S3 bucket compromises > > security and therefore is unsuitable for most use cases. It can be > useful > > for accessing public data sets without requiring AWS credentials. > > > If unspecified, then the default list of credential provider classes, > > queried in sequence, is: > > 1. org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider: > >Uses the values of fs.s3a.access.key and fs.s3a.secret.key. > > 2. com.amazonaws.auth.EnvironmentVariableCredentialsProvider: supports > > configuration of AWS access key ID and secret access key in > > environment variables named AWS_ACCESS_KEY_ID and > > AWS_SECRET_ACCESS_KEY, as documented in the AWS SDK. > > 3. com.amazonaws.auth.InstanceProfileCredentialsProvider: supports use > > of instance profile credentials if running in an EC2 VM. > > > > > > > > > > > > org.apache.hadoop > > hadoop-client > > ${hadoop.version} > > > > > > org.apache.hadoop > > hadoop-aws > > ${hadoop.version} > > > > > > > > > > > hadoop-env.sh > > # > > # Licensed to the Apache Software Foundation (ASF) under one > > # omore contributor license agreements. See the NOTICE file > > # distributed with this work for additional information > > # regarding copyright ownership. The ASF licenses this file > > # to you under the Apache License, Version 2.0 (the > > # "License"); you may not use this file except in compliance
Re: Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'
Hi Clark, thanks for summarizing this discussion and sharing the final configuration! Good to know that it's possible to run Nutch on Hadoop using S3A without using HDFS (no namenode/datanodes running). Best, Sebastian
Re: Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'
Hi All, Sebastian Helped fix my issue: using S3 as a backend I was able to get nutch-1.19 working with pre-built hadoop-3.3.0 and java 11. There was an oddity that nutch-1.19 had 11 hadoop 3.1.3 jars, eg. hadoop-hdfs-3.1.3.jar, hadoop-yarn-api-3.1.3.jar... ; this made running `hadoop version` give 3.1.3) so I replaced those 3.1.3 jars with the 3.3.0 jars from the hadoop download. Also, in the main nutch branch ( https://github.com/apache/nutch/blob/master/ivy/ivy.xml) ivy.xml currently has dependencies on hadoop-3.1.3; eg. z I set yarn.nodemanager.local-dirs to '${hadoop.tmp.dir}/nm-local-dir'. I didn't change "mapreduce.job.dir" because there's no namenode nor datanode processes running when using hadoop with S3, so the UI is blank. Copied from Email with Sebastian: > > The plugin loader doesn't appear to be able to read from s3 in nutch-1.18 > > with hadoop-3.2.1[1]. > I had a look into the plugin loader: it can only read from the local file system. > But that's ok because the Nutch job file is copied to the local machine > and unpacked. Here the paths how it looks like on one of the running Common Crawl > task nodes: The configs for the working hadoop are as follows: core-site.xml hadoop.tmp.dir /home/hdoop/tmpdata fs.defaultFS s3a://my-bucket fs.s3a.access.key KEY_PLACEHOLDER AWS access key ID. Omit for IAM role-based or provider-based authentication. fs.s3a.secret.key SECRET_PLACEHOLDER AWS secret key. Omit for IAM role-based or provider-based authentication. fs.s3a.aws.credentials.provider Comma-separated class names of credential provider classes which implement com.amazonaws.auth.AWSCredentialsProvider. These are loaded and queried in sequence for a valid set of credentials. Each listed class must implement one of the following means of construction, which are attempted in order: 1. a public constructor accepting java.net.URI and org.apache.hadoop.conf.Configuration, 2. a public static method named getInstance that accepts no arguments and returns an instance of com.amazonaws.auth.AWSCredentialsProvider, or 3. a public default constructor. Specifying org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider allows anonymous access to a publicly accessible S3 bucket without any credentials. Please note that allowing anonymous access to an S3 bucket compromises security and therefore is unsuitable for most use cases. It can be useful for accessing public data sets without requiring AWS credentials. If unspecified, then the default list of credential provider classes, queried in sequence, is: 1. org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider: Uses the values of fs.s3a.access.key and fs.s3a.secret.key. 2. com.amazonaws.auth.EnvironmentVariableCredentialsProvider: supports configuration of AWS access key ID and secret access key in environment variables named AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, as documented in the AWS SDK. 3. com.amazonaws.auth.InstanceProfileCredentialsProvider: supports use of instance profile credentials if running in an EC2 VM. org.apache.hadoop hadoop-client ${hadoop.version} org.apache.hadoop hadoop-aws ${hadoop.version} hadoop-env.sh # # Licensed to the Apache Software Foundation (ASF) under one # omore contributor license agreements. See the NOTICE file # distributed with this work for additional information # regarding copyright ownership. The ASF licenses this file # to you under the Apache License, Version 2.0 (the # "License"); you may not use this file except in compliance #a with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # Set Hadoop-specific environment variables here. ## ## THIS FILE ACTS AS THE MASTER FILE FOR ALL HADOOP PROJECTS. ## SETTINGS HERE WILL BE READ BY ALL HADOOP COMMANDS. THEREFORE, ## ONE CAN USE THIS FILE TO SET YARN, HDFS, AND MAPREDUCE ## CONFIGURATION OPTIONS INSTEAD OF xxx-env.sh. ## ## Precedence rules: ## ## {yarn-env.sh|hdfs-env.sh} > hadoop-env.sh > hard-coded defaults ## ## {YARN_xyz|HDFS_xyz} > HADOOP_xyz > hard-coded defaults ## # Many of the options here are built from the perspective that users # may want to provide OVERWRITING values on the command line. # For example: # # JAVA_HOME=/usr/java/testing hdfs dfs -ls # # Therefore, the vast majority (BUT NOT ALL!) of these defaults
Re: Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'
Hi Sebastian, NUTCH_HOME=~/nutch; the local filesystem. I am using a plain, pre-built hadoop. There's no "mapreduce.job.dir" I can grep in Hadoop 3.2.1,3.3.0, or Nutch-1.18, 1.19, but mapreduce.job.hdfs-servers defaults to ${fs.defaultFS}, so s3a://temp-crawler in our case. The plugin loader doesn't appear to be able to read from s3 in nutch-1.18 with hadoop-3.2.1[1]. Using java & javac 11 with hadoop-3.3.0 downloaded and untared and a nutch-1.19 I built: I can run a mapreduce job on S3; and a Nutch job on hdfs, but running nutch on S3 still gives "URLNormalizer not found" with the plugin dir on the local filesystem or on S3a. How would you recommend I go about getting the plugin loader to read from other file systems? [1] I still get 'x point org.apache.nutch.net.URLNormalizer not found' (same stack trace as previous email) with `plugin.folders s3a://temp-crawler/user/hdoop/nutch-plugins` set in my nutch-site.xml while `hadoop fs -ls s3a://temp-crawler/user/hdoop/nutch-plugins` lists all the plugins as there. For posterity: I got hadoop-3.3.0 working with a S3 backend by: cd ~/hadoop-3.3.0 cp ./share/hadoop/tools/lib/hadoop-aws-3.3.0.jar ./share/hadoop/common/lib cp ./share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.563.jar ./share/hadoop/common/lib to solve "Class org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory not found" despite the class existing in ~/hadoop-3.3.0/share/hadoop/tools/lib/hadoop-aws-3.3.0.jar checking it's on the classpath with `hadoop classpath | tr ":" "\n" | grep share/hadoop/tools/lib/hadoop-aws-3.3.0.jar` as well as adding it to hadoop-env.sh. see https://stackoverflow.com/questions/58415928/spark-s3-error-java-lang-classnotfoundexception-class-org-apache-hadoop-f On Tue, Jun 15, 2021 at 2:01 AM Sebastian Nagel wrote: > > The local file system? Or hdfs:// or even s3:// resp. s3a://? > > Also important: the value of "mapreduce.job.dir" - it's usually > on hdfs:// and I'm not sure whether the plugin loader is able to > read from other filesystems. At least, I haven't tried. > > > On 6/15/21 10:53 AM, Sebastian Nagel wrote: > > Hi Clark, > > > > sorry, I should read your mail until the end - you mentioned that > > you downgraded Nutch to run with JDK 8. > > > > Could you share to which filesystem does NUTCH_HOME point? > > The local file system? Or hdfs:// or even s3:// resp. s3a://? > > > > Best, > > Sebastian > > > > > > On 6/15/21 10:24 AM, Clark Benham wrote: > >> Hi, > >> > >> > >> I am trying to run Nutch-1.19 on hadoop-3.2.1 with an S3 > >> backend/filesystem; however I get an error ‘URLNormalizer class not > found’. > >> I have edited nutch-site.xml so this plugin should be included: > >> > >> > >> > >>plugin.includes > >> > >> > >> > protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|mimetype-filter|urlnormalizer|urlnormalizer-basic|.*|nutch-extensionpoints > > >> > >> > >> > >> > >> > >> and then built on both nodes (I only have 2 machines). I’ve > successfully > >> run Nutch locally and in distributed mode using HDFS, and I’ve run a > >> mapreduce job with S3 as hadoop’s file system. > >> > >> > >> I thought it was possible nutch is not reading nutch-site.xml because I > >> resolve an error by setting the config through the cli, despite this > >> duplicating nutch-site.xml. > >> > >> The command: > >> > >> `hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job > >> org.apache.nutch.fetcher.Fetcher > >> crawl/crawldb crawl/segments` > >> > >> throws > >> > >> `java.lang.IllegalArgumentException: Fetcher: No agents listed in ' > >> http.agent.name' property` > >> > >> while if I pass a value in for http.agent.name with > >> `-Dhttp.agent.name=myScrapper`, > >> (making the command `hadoop jar > >> $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job > >> org.apache.nutch.fetcher.Fetcher > >> -Dhttp.agent.name=clark crawl/crawldb crawl/segments`), I get an error > >> about there being no input path, which makes sense as I haven’t been > able > >> to generate any segments. > >> > >> > >> However this method of setting nutch config’s doesn’t work for > injecting > >> URLs; eg: > >> > >> `hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job > >> org.apache.nutch.crawl.Injector > >> -Dplugin.includes=".*" crawl/crawldb urls` > >> > >> fails with the same “URLNormalizer” not found. > >> > >> > >> I tried copying the plugin dir to S3 and setting > >> plugin.folders to be a path on S3 without success. (I > expect > >> the plugin to be bundled with the .job so this step should be > unnecessary) > >> > >> > >> The full stack trace for `hadoop jar > >> $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job > >> org.apache.nutch.crawl.Injector > >> crawl/crawldb urls`: > >> > >> SLF4J: Class path contains multiple SLF4J bindings. > >> > >> SLF4J: Found binding in > >> >
Re: Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'
> The local file system? Or hdfs:// or even s3:// resp. s3a://? Also important: the value of "mapreduce.job.dir" - it's usually on hdfs:// and I'm not sure whether the plugin loader is able to read from other filesystems. At least, I haven't tried. On 6/15/21 10:53 AM, Sebastian Nagel wrote: Hi Clark, sorry, I should read your mail until the end - you mentioned that you downgraded Nutch to run with JDK 8. Could you share to which filesystem does NUTCH_HOME point? The local file system? Or hdfs:// or even s3:// resp. s3a://? Best, Sebastian On 6/15/21 10:24 AM, Clark Benham wrote: Hi, I am trying to run Nutch-1.19 on hadoop-3.2.1 with an S3 backend/filesystem; however I get an error ‘URLNormalizer class not found’. I have edited nutch-site.xml so this plugin should be included: plugin.includes protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|mimetype-filter|urlnormalizer|urlnormalizer-basic|.*|nutch-extensionpoints and then built on both nodes (I only have 2 machines). I’ve successfully run Nutch locally and in distributed mode using HDFS, and I’ve run a mapreduce job with S3 as hadoop’s file system. I thought it was possible nutch is not reading nutch-site.xml because I resolve an error by setting the config through the cli, despite this duplicating nutch-site.xml. The command: `hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job org.apache.nutch.fetcher.Fetcher crawl/crawldb crawl/segments` throws `java.lang.IllegalArgumentException: Fetcher: No agents listed in ' http.agent.name' property` while if I pass a value in for http.agent.name with `-Dhttp.agent.name=myScrapper`, (making the command `hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job org.apache.nutch.fetcher.Fetcher -Dhttp.agent.name=clark crawl/crawldb crawl/segments`), I get an error about there being no input path, which makes sense as I haven’t been able to generate any segments. However this method of setting nutch config’s doesn’t work for injecting URLs; eg: `hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job org.apache.nutch.crawl.Injector -Dplugin.includes=".*" crawl/crawldb urls` fails with the same “URLNormalizer” not found. I tried copying the plugin dir to S3 and setting plugin.folders to be a path on S3 without success. (I expect the plugin to be bundled with the .job so this step should be unnecessary) The full stack trace for `hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job org.apache.nutch.crawl.Injector crawl/crawldb urls`: SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/home/hdoop/hadoop-3.2.1/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/home/hdoop/apache-nutch-1.18/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] #Took out multiply Info messages 2021-06-15 07:06:07,842 INFO mapreduce.Job: Task Id : attempt_1623740678244_0001_m_01_0, Status : FAILED Error: java.lang.RuntimeException: x point org.apache.nutch.net.URLNormalizer not found. at org.apache.nutch.net.URLNormalizers.(URLNormalizers.java:145) at org.apache.nutch.crawl.Injector$InjectMapper.setup(Injector.java:139) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168) #This error repeats 6 times total, 3 times for each node 2021-06-15 07:06:26,035 INFO mapreduce.Job: map 100% reduce 100% 2021-06-15 07:06:29,067 INFO mapreduce.Job: Job job_1623740678244_0001 failed with state FAILED due to: Task failed task_1623740678244_0001_m_01 Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0 killedReduces: 0 2021-06-15 07:06:29,190 INFO mapreduce.Job: Counters: 14 Job Counters Failed map tasks=7 Killed map tasks=1 Killed reduce tasks=1 Launched map tasks=8 Other local map tasks=6 Rack-local map tasks=2 Total time spent by all maps in occupied slots (ms)=63196 Total time spent by all reduces in occupied slots (ms)=0 Total time spent by all map tasks (ms)=31598 Total vcore-milliseconds taken by all map tasks=31598 Total megabyte-milliseconds taken by all map tasks=8089088 Map-Reduce Framework CPU time spent (ms)=0 Physical memory (bytes) snapshot=0 Virtual memory (bytes)
Re: Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'
Hi Clark, sorry, I should read your mail until the end - you mentioned that you downgraded Nutch to run with JDK 8. Could you share to which filesystem does NUTCH_HOME point? The local file system? Or hdfs:// or even s3:// resp. s3a://? Best, Sebastian On 6/15/21 10:24 AM, Clark Benham wrote: Hi, I am trying to run Nutch-1.19 on hadoop-3.2.1 with an S3 backend/filesystem; however I get an error ‘URLNormalizer class not found’. I have edited nutch-site.xml so this plugin should be included: plugin.includes protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|mimetype-filter|urlnormalizer|urlnormalizer-basic|.*|nutch-extensionpoints and then built on both nodes (I only have 2 machines). I’ve successfully run Nutch locally and in distributed mode using HDFS, and I’ve run a mapreduce job with S3 as hadoop’s file system. I thought it was possible nutch is not reading nutch-site.xml because I resolve an error by setting the config through the cli, despite this duplicating nutch-site.xml. The command: `hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job org.apache.nutch.fetcher.Fetcher crawl/crawldb crawl/segments` throws `java.lang.IllegalArgumentException: Fetcher: No agents listed in ' http.agent.name' property` while if I pass a value in for http.agent.name with `-Dhttp.agent.name=myScrapper`, (making the command `hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job org.apache.nutch.fetcher.Fetcher -Dhttp.agent.name=clark crawl/crawldb crawl/segments`), I get an error about there being no input path, which makes sense as I haven’t been able to generate any segments. However this method of setting nutch config’s doesn’t work for injecting URLs; eg: `hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job org.apache.nutch.crawl.Injector -Dplugin.includes=".*" crawl/crawldb urls` fails with the same “URLNormalizer” not found. I tried copying the plugin dir to S3 and setting plugin.folders to be a path on S3 without success. (I expect the plugin to be bundled with the .job so this step should be unnecessary) The full stack trace for `hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job org.apache.nutch.crawl.Injector crawl/crawldb urls`: SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/home/hdoop/hadoop-3.2.1/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/home/hdoop/apache-nutch-1.18/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] #Took out multiply Info messages 2021-06-15 07:06:07,842 INFO mapreduce.Job: Task Id : attempt_1623740678244_0001_m_01_0, Status : FAILED Error: java.lang.RuntimeException: x point org.apache.nutch.net.URLNormalizer not found. at org.apache.nutch.net.URLNormalizers.(URLNormalizers.java:145) at org.apache.nutch.crawl.Injector$InjectMapper.setup(Injector.java:139) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168) #This error repeats 6 times total, 3 times for each node 2021-06-15 07:06:26,035 INFO mapreduce.Job: map 100% reduce 100% 2021-06-15 07:06:29,067 INFO mapreduce.Job: Job job_1623740678244_0001 failed with state FAILED due to: Task failed task_1623740678244_0001_m_01 Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0 killedReduces: 0 2021-06-15 07:06:29,190 INFO mapreduce.Job: Counters: 14 Job Counters Failed map tasks=7 Killed map tasks=1 Killed reduce tasks=1 Launched map tasks=8 Other local map tasks=6 Rack-local map tasks=2 Total time spent by all maps in occupied slots (ms)=63196 Total time spent by all reduces in occupied slots (ms)=0 Total time spent by all map tasks (ms)=31598 Total vcore-milliseconds taken by all map tasks=31598 Total megabyte-milliseconds taken by all map tasks=8089088 Map-Reduce Framework CPU time spent (ms)=0 Physical memory (bytes) snapshot=0 Virtual memory (bytes) snapshot=0 2021-06-15 07:06:29,195 ERROR crawl.Injector: Injector job did not succeed, job status: FAILED, reason: Task failed task_1623740678244_0001_m_01 Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0 killedReduces: 0 2021-06-15 07:06:29,562 ERROR crawl.Injector:
Re: Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'
Hi Clark, the class URLNormalizer is not in a plugin - it's part of Nutch core and defines the interface for URL normalizer plugins. Looks like there's something wrong fundamentally, not only with the plugins. > I am trying to run Nutch-1.19 on hadoop-3.2.1 with an S3 Are you aware that the Nutch 1.19 will require JDK 11? - and the recent Nutch snapshots already do, see NUTCH-2857. Hadoop 3.2.1 does not support JDK 11, you'd need to use 3.3.0. Is a plain vanilla Hadoop used, or a specific Hadoop distribution (eg. Cloudera, Amazon EMR)? Note: the normal way to run Nutch is: $NUTCH_HOME/runtime/deploy/bin/nutch ... But in the end it will also call "hadoop jar apache-nutch-xyz.job ..." Best, Sebastian On 6/15/21 10:24 AM, Clark Benham wrote: Hi, I am trying to run Nutch-1.19 on hadoop-3.2.1 with an S3 backend/filesystem; however I get an error ‘URLNormalizer class not found’. I have edited nutch-site.xml so this plugin should be included: plugin.includes protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|mimetype-filter|urlnormalizer|urlnormalizer-basic|.*|nutch-extensionpoints and then built on both nodes (I only have 2 machines). I’ve successfully run Nutch locally and in distributed mode using HDFS, and I’ve run a mapreduce job with S3 as hadoop’s file system. I thought it was possible nutch is not reading nutch-site.xml because I resolve an error by setting the config through the cli, despite this duplicating nutch-site.xml. The command: `hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job org.apache.nutch.fetcher.Fetcher crawl/crawldb crawl/segments` throws `java.lang.IllegalArgumentException: Fetcher: No agents listed in ' http.agent.name' property` while if I pass a value in for http.agent.name with `-Dhttp.agent.name=myScrapper`, (making the command `hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job org.apache.nutch.fetcher.Fetcher -Dhttp.agent.name=clark crawl/crawldb crawl/segments`), I get an error about there being no input path, which makes sense as I haven’t been able to generate any segments. However this method of setting nutch config’s doesn’t work for injecting URLs; eg: `hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job org.apache.nutch.crawl.Injector -Dplugin.includes=".*" crawl/crawldb urls` fails with the same “URLNormalizer” not found. I tried copying the plugin dir to S3 and setting plugin.folders to be a path on S3 without success. (I expect the plugin to be bundled with the .job so this step should be unnecessary) The full stack trace for `hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job org.apache.nutch.crawl.Injector crawl/crawldb urls`: SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/home/hdoop/hadoop-3.2.1/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/home/hdoop/apache-nutch-1.18/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] #Took out multiply Info messages 2021-06-15 07:06:07,842 INFO mapreduce.Job: Task Id : attempt_1623740678244_0001_m_01_0, Status : FAILED Error: java.lang.RuntimeException: x point org.apache.nutch.net.URLNormalizer not found. at org.apache.nutch.net.URLNormalizers.(URLNormalizers.java:145) at org.apache.nutch.crawl.Injector$InjectMapper.setup(Injector.java:139) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168) #This error repeats 6 times total, 3 times for each node 2021-06-15 07:06:26,035 INFO mapreduce.Job: map 100% reduce 100% 2021-06-15 07:06:29,067 INFO mapreduce.Job: Job job_1623740678244_0001 failed with state FAILED due to: Task failed task_1623740678244_0001_m_01 Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0 killedReduces: 0 2021-06-15 07:06:29,190 INFO mapreduce.Job: Counters: 14 Job Counters Failed map tasks=7 Killed map tasks=1 Killed reduce tasks=1 Launched map tasks=8 Other local map tasks=6 Rack-local map tasks=2 Total time spent by all maps in occupied slots (ms)=63196 Total time spent by all reduces in occupied slots (ms)=0 Total time spent by all map tasks (ms)=31598 Total vcore-milliseconds taken by all map tasks=31598
Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'
Hi, I am trying to run Nutch-1.19 on hadoop-3.2.1 with an S3 backend/filesystem; however I get an error ‘URLNormalizer class not found’. I have edited nutch-site.xml so this plugin should be included: plugin.includes protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|mimetype-filter|urlnormalizer|urlnormalizer-basic|.*|nutch-extensionpoints and then built on both nodes (I only have 2 machines). I’ve successfully run Nutch locally and in distributed mode using HDFS, and I’ve run a mapreduce job with S3 as hadoop’s file system. I thought it was possible nutch is not reading nutch-site.xml because I resolve an error by setting the config through the cli, despite this duplicating nutch-site.xml. The command: `hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job org.apache.nutch.fetcher.Fetcher crawl/crawldb crawl/segments` throws `java.lang.IllegalArgumentException: Fetcher: No agents listed in ' http.agent.name' property` while if I pass a value in for http.agent.name with `-Dhttp.agent.name=myScrapper`, (making the command `hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job org.apache.nutch.fetcher.Fetcher -Dhttp.agent.name=clark crawl/crawldb crawl/segments`), I get an error about there being no input path, which makes sense as I haven’t been able to generate any segments. However this method of setting nutch config’s doesn’t work for injecting URLs; eg: `hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job org.apache.nutch.crawl.Injector -Dplugin.includes=".*" crawl/crawldb urls` fails with the same “URLNormalizer” not found. I tried copying the plugin dir to S3 and setting plugin.folders to be a path on S3 without success. (I expect the plugin to be bundled with the .job so this step should be unnecessary) The full stack trace for `hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job org.apache.nutch.crawl.Injector crawl/crawldb urls`: SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/home/hdoop/hadoop-3.2.1/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/home/hdoop/apache-nutch-1.18/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] #Took out multiply Info messages 2021-06-15 07:06:07,842 INFO mapreduce.Job: Task Id : attempt_1623740678244_0001_m_01_0, Status : FAILED Error: java.lang.RuntimeException: x point org.apache.nutch.net.URLNormalizer not found. at org.apache.nutch.net.URLNormalizers.(URLNormalizers.java:145) at org.apache.nutch.crawl.Injector$InjectMapper.setup(Injector.java:139) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168) #This error repeats 6 times total, 3 times for each node 2021-06-15 07:06:26,035 INFO mapreduce.Job: map 100% reduce 100% 2021-06-15 07:06:29,067 INFO mapreduce.Job: Job job_1623740678244_0001 failed with state FAILED due to: Task failed task_1623740678244_0001_m_01 Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0 killedReduces: 0 2021-06-15 07:06:29,190 INFO mapreduce.Job: Counters: 14 Job Counters Failed map tasks=7 Killed map tasks=1 Killed reduce tasks=1 Launched map tasks=8 Other local map tasks=6 Rack-local map tasks=2 Total time spent by all maps in occupied slots (ms)=63196 Total time spent by all reduces in occupied slots (ms)=0 Total time spent by all map tasks (ms)=31598 Total vcore-milliseconds taken by all map tasks=31598 Total megabyte-milliseconds taken by all map tasks=8089088 Map-Reduce Framework CPU time spent (ms)=0 Physical memory (bytes) snapshot=0 Virtual memory (bytes) snapshot=0 2021-06-15 07:06:29,195 ERROR crawl.Injector: Injector job did not succeed, job status: FAILED, reason: Task failed task_1623740678244_0001_m_01 Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0 killedReduces: 0 2021-06-15 07:06:29,562 ERROR crawl.Injector: Injector: java.lang.RuntimeException: Injector job did not succeed, job status: FAILED, reason: Task failed task_1623740678244_0001_m_01 Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0 killedReduces: 0 at org.apache.nutch.crawl.Injector.inject(Injector.java:444) at