Hi Clark, sorry, I should read your mail until the end - you mentioned that you downgraded Nutch to run with JDK 8.
Could you share to which filesystem does NUTCH_HOME point? The local file system? Or hdfs:// or even s3:// resp. s3a://? Best, Sebastian On 6/15/21 10:24 AM, Clark Benham wrote:
Hi, I am trying to run Nutch-1.19 on hadoop-3.2.1 with an S3 backend/filesystem; however I get an error ‘URLNormalizer class not found’. I have edited nutch-site.xml so this plugin should be included: <property> <name>plugin.includes</name> <value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|mimetype-filter|urlnormalizer|urlnormalizer-basic|.*|nutch-extensionpoints</value> </property> and then built on both nodes (I only have 2 machines). I’ve successfully run Nutch locally and in distributed mode using HDFS, and I’ve run a mapreduce job with S3 as hadoop’s file system. I thought it was possible nutch is not reading nutch-site.xml because I resolve an error by setting the config through the cli, despite this duplicating nutch-site.xml. The command: `hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job org.apache.nutch.fetcher.Fetcher crawl/crawldb crawl/segments` throws `java.lang.IllegalArgumentException: Fetcher: No agents listed in ' http.agent.name' property` while if I pass a value in for http.agent.name with `-Dhttp.agent.name=myScrapper`, (making the command `hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job org.apache.nutch.fetcher.Fetcher -Dhttp.agent.name=clark crawl/crawldb crawl/segments`), I get an error about there being no input path, which makes sense as I haven’t been able to generate any segments. However this method of setting nutch config’s doesn’t work for injecting URLs; eg: `hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job org.apache.nutch.crawl.Injector -Dplugin.includes=".*" crawl/crawldb urls` fails with the same “URLNormalizer” not found. I tried copying the plugin dir to S3 and setting <name>plugin.folders</name> to be a path on S3 without success. (I expect the plugin to be bundled with the .job so this step should be unnecessary) The full stack trace for `hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job org.apache.nutch.crawl.Injector crawl/crawldb urls`: SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/home/hdoop/hadoop-3.2.1/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/home/hdoop/apache-nutch-1.18/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] #Took out multiply Info messages 2021-06-15 07:06:07,842 INFO mapreduce.Job: Task Id : attempt_1623740678244_0001_m_000001_0, Status : FAILED Error: java.lang.RuntimeException: x point org.apache.nutch.net.URLNormalizer not found. at org.apache.nutch.net.URLNormalizers.<init>(URLNormalizers.java:145) at org.apache.nutch.crawl.Injector$InjectMapper.setup(Injector.java:139) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168) #This error repeats 6 times total, 3 times for each node 2021-06-15 07:06:26,035 INFO mapreduce.Job: map 100% reduce 100% 2021-06-15 07:06:29,067 INFO mapreduce.Job: Job job_1623740678244_0001 failed with state FAILED due to: Task failed task_1623740678244_0001_m_000001 Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0 killedReduces: 0 2021-06-15 07:06:29,190 INFO mapreduce.Job: Counters: 14 Job Counters Failed map tasks=7 Killed map tasks=1 Killed reduce tasks=1 Launched map tasks=8 Other local map tasks=6 Rack-local map tasks=2 Total time spent by all maps in occupied slots (ms)=63196 Total time spent by all reduces in occupied slots (ms)=0 Total time spent by all map tasks (ms)=31598 Total vcore-milliseconds taken by all map tasks=31598 Total megabyte-milliseconds taken by all map tasks=8089088 Map-Reduce Framework CPU time spent (ms)=0 Physical memory (bytes) snapshot=0 Virtual memory (bytes) snapshot=0 2021-06-15 07:06:29,195 ERROR crawl.Injector: Injector job did not succeed, job status: FAILED, reason: Task failed task_1623740678244_0001_m_000001 Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0 killedReduces: 0 2021-06-15 07:06:29,562 ERROR crawl.Injector: Injector: java.lang.RuntimeException: Injector job did not succeed, job status: FAILED, reason: Task failed task_1623740678244_0001_m_000001 Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0 killedReduces: 0 at org.apache.nutch.crawl.Injector.inject(Injector.java:444) at org.apache.nutch.crawl.Injector.run(Injector.java:571) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.nutch.crawl.Injector.main(Injector.java:535) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:323) at org.apache.hadoop.util.RunJar.main(RunJar.java:236) P.S. I am using a downloaded hadoop-3.2.1; and the only odd thing about my nutch build is that I had to replace all instances of “javac.verion” with “ant.java.version”; as the javac version was 11 to java’s 1.8 giving the error ‘javac invalid target release: 11’: grep -rl "javac.version" . --include "*.xml" | xargs sed -i s^javac.version^ant.java.version^g grep -rl “ant.ant” . --include "*.xml"| xargs sed -i s^ant.ant.^ant.^g