Re: Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'

Lewis John McGibbney Fri, 16 Jul 2021 09:13:32 -0700

Hi Clark,
This is a lot of information... thank you for compiling it all.
Ideally the version of Hadoop being used with Nutch should ALWAYS match the 
hadoop binaries referenced in 
https://github.com/apache/nutch/blob/master/ivy/ivy.xml. This way you wont run 
into the classpath issues.
I would like to encourage you to create a wiki page so we can document this in 
a user firnedly way... would you be open to that?
You can create an account at 
https://cwiki.apache.org/confluence/display/NUTCH/Home
Thanks for your consideration.
lewismc


On 2021/07/14 18:27:23, Clark Benham <cl...@thehive.ai> wrote: 
> Hi All,
> 
> Sebastian Helped fix my issue: using S3 as a backend I was able to get
> nutch-1.19 working with pre-built hadoop-3.3.0 and java 11. There was an
> oddity that nutch-1.19 had 11 hadoop 3.1.3 jars, eg.
> hadoop-hdfs-3.1.3.jar, hadoop-yarn-api-3.1.3.jar... ; this made running
> `hadoop version`  give 3.1.3) so I replaced those 3.1.3 jars with the 3.3.0
> jars from the hadoop download.
> Also, in the main nutch branch (
> https://github.com/apache/nutch/blob/master/ivy/ivy.xml) ivy.xml currently
> has dependencies on hadoop-3.1.3; eg.
> <!-- Hadoop Dependencies -->
> <dependency org="org.apache.hadoop" name="hadoop-common" rev="3.1.3"
> conf="*->default">
> <exclude org="hsqldb" name="hsqldb" />
> <exclude org="net.sf.kosmosfs" name="kfs" />z
> <exclude org="net.java.dev.jets3t" name="jets3t" />
> <exclude org="org.eclipse.jdt" name="core" />
> <exclude org="org.mortbay.jetty" name="jsp-*" />
> <exclude org="ant" name="ant" />
> </dependency>
> <dependency org="org.apache.hadoop" name="hadoop-hdfs" rev="3.1.3"
> conf="*->default" />
> <dependency org="org.apache.hadoop" name="hadoop-mapreduce-client-core"
> rev="3.1.3" conf="*->default" />
> <dependency org="org.apache.hadoop"
> name="hadoop-mapreduce-client-jobclient" rev="3.1.3" conf="*->default" />
> <!-- End of Hadoop Dependencies -->
> 
> I set yarn.nodemanager.local-dirs to '${hadoop.tmp.dir}/nm-local-dir'.
> 
> I didn't change "mapreduce.job.dir" because there's no namenode nor
> datanode processes running when using hadoop with S3, so the UI is blank.
> 
> Copied from Email with Sebastian:
> >  > The plugin loader doesn't appear to be able to read from s3 in
> nutch-1.18
> >  > with hadoop-3.2.1[1].
> 
> > I had a look into the plugin loader: it can only read from the local file
> system.
> > But that's ok because the Nutch job file is copied to the local machine
> > and unpacked. Here the paths how it looks like on one of the running
> Common Crawl
> > task nodes:
> 
> The configs for the working hadoop are as follows:
> 
> core-site.xml
> 
> <?xml version="1.0" encoding="UTF-8"?>
> 
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> 
> <!--
> 
>   Licensed under the Apache License, Version 2.0 (the "License");
> 
>   you may not use this file except in compliance with the License.
> 
>   You may obtain a copy of the License at
> 
> 
>     http://www.apache.org/licenses/LICENSE-2.0
> 
> 
>   Unless required by applicable law or agreed to in writing, software
> 
>   distributed under the License is distributed on an "AS IS" BASIS,
> 
>   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> 
>   See the License for the specific language governing permissions and
> 
>   limitations under the License. See accompanying LICENSE file.
> 
> -->
> 
> 
> <!-- Put site-specific property overrides in this file. -->
> 
> <configuration>
> 
> <property>
> 
>   <name>hadoop.tmp.dir</name>
> 
>   <value>/home/hdoop/tmpdata</value>
> 
> </property>
> 
> <property>
> 
>   <name>fs.defaultFS</name>
> 
>   <value>s3a://my-bucket</value>
> 
> </property>
> 
> 
> <property>
> 
>         <name>fs.s3a.access.key</name>
> 
>         <value>KEY_PLACEHOLDER</value>
> 
>   <description>AWS access key ID.
> 
>    Omit for IAM role-based or provider-based authentication.</description>
> 
> </property>
> 
> 
> <property>
> 
>   <name>fs.s3a.secret.key</name>
> 
>   <value>SECRET_PLACEHOLDER</value>
> 
>   <description>AWS secret key.
> 
>    Omit for IAM role-based or provider-based authentication.</description>
> 
> </property>
> 
> 
> <property>
> 
>   <name>fs.s3a.aws.credentials.provider</name>
> 
>   <description>
> 
>     Comma-separated class names of credential provider classes which
> implement
> 
>     com.amazonaws.auth.AWSCredentialsProvider.
> 
> 
>     These are loaded and queried in sequence for a valid set of credentials.
> 
>     Each listed class must implement one of the following means of
> 
>     construction, which are attempted in order:
> 
>     1. a public constructor accepting java.net.URI and
> 
>         org.apache.hadoop.conf.Configuration,
> 
>     2. a public static method named getInstance that accepts no
> 
>        arguments and returns an instance of
> 
>        com.amazonaws.auth.AWSCredentialsProvider, or
> 
>     3. a public default constructor.
> 
> 
>     Specifying org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider
> allows
> 
>     anonymous access to a publicly accessible S3 bucket without any
> credentials.
> 
>     Please note that allowing anonymous access to an S3 bucket compromises
> 
>     security and therefore is unsuitable for most use cases. It can be
> useful
> 
>     for accessing public data sets without requiring AWS credentials.
> 
> 
>     If unspecified, then the default list of credential provider classes,
> 
>     queried in sequence, is:
> 
>     1. org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider:
> 
>        Uses the values of fs.s3a.access.key and fs.s3a.secret.key.
> 
>     2. com.amazonaws.auth.EnvironmentVariableCredentialsProvider: supports
> 
>         configuration of AWS access key ID and secret access key in
> 
>         environment variables named AWS_ACCESS_KEY_ID and
> 
>         AWS_SECRET_ACCESS_KEY, as documented in the AWS SDK.
> 
>     3. com.amazonaws.auth.InstanceProfileCredentialsProvider: supports use
> 
>         of instance profile credentials if running in an EC2 VM.
> 
>   </description>
> 
> </property>
> 
> 
> 
> <dependencies>
> 
>   <dependency>
> 
>     <groupId>org.apache.hadoop</groupId>
> 
>     <artifactId>hadoop-client</artifactId>
> 
>     <version>${hadoop.version}</version>
> 
>   </dependency>
> 
>   <dependency>
> 
>     <groupId>org.apache.hadoop</groupId>
> 
>     <artifactId>hadoop-aws</artifactId>
> 
>     <version>${hadoop.version}</version>
> 
>   </dependency>
> 
> </dependencies>
> 
> 
> </configuration>
> 
> 
> 
> hadoop-env.sh
> 
> #
> 
> # Licensed to the Apache Software Foundation (ASF) under one
> 
> # omore contributor license agreements.  See the NOTICE file
> 
> # distributed with this work for additional information
> 
> # regarding copyright ownership.  The ASF licenses this file
> 
> # to you under the Apache License, Version 2.0 (the
> 
> # "License"); you may not use this file except in compliance
> 
> #a with the License.  You may obtain a copy of the License at
> 
> #
> 
> #     http://www.apache.org/licenses/LICENSE-2.0
> 
> #
> 
> # Unless required by applicable law or agreed to in writing, software
> 
> # distributed under the License is distributed on an "AS IS" BASIS,
> 
> # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> 
> # See the License for the specific language governing permissions and
> 
> # limitations under the License.
> 
> 
> # Set Hadoop-specific environment variables here.
> 
> 
> ##
> 
> ## THIS FILE ACTS AS THE MASTER FILE FOR ALL HADOOP PROJECTS.
> 
> ## SETTINGS HERE WILL BE READ BY ALL HADOOP COMMANDS.  THEREFORE,
> 
> ## ONE CAN USE THIS FILE TO SET YARN, HDFS, AND MAPREDUCE
> 
> ## CONFIGURATION OPTIONS INSTEAD OF xxx-env.sh.
> 
> ##
> 
> ## Precedence rules:
> 
> ##
> 
> ## {yarn-env.sh|hdfs-env.sh} > hadoop-env.sh > hard-coded defaults
> 
> ##
> 
> ## {YARN_xyz|HDFS_xyz} > HADOOP_xyz > hard-coded defaults
> 
> ##
> 
> 
> # Many of the options here are built from the perspective that users
> 
> # may want to provide OVERWRITING values on the command line.
> 
> # For example:
> 
> #
> 
> #  JAVA_HOME=/usr/java/testing hdfs dfs -ls
> 
> #
> 
> # Therefore, the vast majority (BUT NOT ALL!) of these defaults
> 
> # are configured for substitution and not append.  If append
> 
> # is preferable, modify this file accordingly.
> 
> 
> ###
> 
> # Generic settings for HADOOP
> 
> ###
> 
> 
> # Technically, the only required environment variable is JAVA_HOME.
> 
> # All others are optional.  However, the defaults are probably not
> 
> # preferred.  Many sites configure these options outside of Hadoop,
> 
> # such as in /etc/profile.d
> 
> 
> # The java implementation to use. By default, this environment
> 
> # variable is REQUIRED on ALL platforms except OS X!
> 
> export HADOOP_HOME=~/hadoop-3.3.0
> 
> export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
> 
> export
> EXTRA_PATH=/home/hdoop/hadoop-3.3.0/share/hadoop/tools/lib/hadoop-aws-3.3.0.jar:/home/hdoop/hadoop-3.3.0/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.563.jar:/home/hdoop/nutch:/home/hdoop/nutch/lib/commons-jexl3-3.1.jar:/home/hdoop/nutch/build/plugins:/home/hdoop/nutch/build/lib/*
> 
> export PATH=$JAVA_HOME/bin:$EXTRA_PATH:$PATH
> 
> 
> # Location of Hadoop.  By default, Hadoop will attempt to determine
> 
> # this location based upon its execution path.
> 
> # export HADOOP_HOME=
> 
> 
> # Location of Hadoop's configuration information.  i.e., where this
> 
> # file is living. If this is not defined, Hadoop will attempt to
> 
> # locate it based upon its execution path.
> 
> #
> 
> # NOTE: It is recommend that this variable not be set here but in
> 
> # /etc/profile.d or equivalent.  Some options (such as
> 
> # --config) may react strangely otherwise.
> 
> #
> 
> # export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
> 
> 
> # The maximum amount of heap to use (Java -Xmx).  If no unit
> 
> # is provided, it will be converted to MB.  Daemons will
> 
> # prefer any Xmx setting in their respective _OPT variable.
> 
> # There is no default; the JVM will autoscale based upon machine
> 
> # memory size.
> 
> # export HADOOP_HEAPSIZE_MAX=
> 
> 
> # The minimum amount of heap to use (Java -Xms).  If no unit
> 
> # is provided, it will be converted to MB.  Daemons will
> 
> # prefer any Xms setting in their respective _OPT variable.
> 
> # There is no default; the JVM will autoscale based upon machine
> 
> # memory size.
> 
> # export HADOOP_HEAPSIZE_MIN=
> 
> 
> # Enable extra debugging of Hadoop's JAAS binding, used to set up
> 
> # Kerberos security.
> 
> # export HADOOP_JAAS_DEBUG=true
> 
> 
> # Extra Java runtime options for all Hadoop commands. We don't support
> 
> # IPv6 yet/still, so by default the preference is set to IPv4.
> 
> # export HADOOP_OPTS="-Djava.net.preferIPv4Stack=true"
> 
> # For Kerberos debugging, an extended option set logs more information
> 
> # export HADOOP_OPTS="-Djava.net.preferIPv4Stack=true
> -Dsun.security.krb5.debug=true -Dsun.security.spnego.debug"
> 
> 
> # Some parts of the shell code may do special things dependent upon
> 
> # the operating system.  We have to set this here. See the next
> 
> # section as to why....
> 
> export HADOOP_OS_TYPE=${HADOOP_OS_TYPE:-$(uname -s)}
> 
> 
> # Extra Java runtime options for some Hadoop commands
> 
> # and clients (i.e., hdfs dfs -blah).  These get appended to HADOOP_OPTS for
> 
> # such commands.  In most cases, # this should be left empty and
> 
> # let users supply it on the command line.
> 
> # export HADOOP_CLIENT_OPTS=""
> 
> 
> #
> 
> # A note about classpaths.
> 
> #
> 
> # By default, Apache Hadoop overrides Java's CLASSPATH
> 
> # environment variable.  It is configured such
> 
> # that it starts out blank with new entries added after passing
> 
> # a series of checks (file/dir exists, not already listed aka
> 
> # de-deduplication).  During de-deduplication, wildcards and/or
> 
> # directories are *NOT* expanded to keep it simple. Therefore,
> 
> # if the computed classpath has two specific mentions of
> 
> # awesome-methods-1.0.jar, only the first one added will be seen.
> 
> # If two directories are in the classpath that both contain
> 
> # awesome-methods-1.0.jar, then Java will pick up both versions.
> 
> 
> # An additional, custom CLASSPATH. Site-wide configs should be
> 
> # handled via the shellprofile functionality, utilizing the
> 
> # hadoop_add_classpath function for greater control and much
> 
> # harder for apps/end-users to accidentally override.
> 
> # Similarly, end users should utilize ${HOME}/.hadooprc .
> 
> # This variable should ideally only be used as a short-cut,
> 
> # interactive way for temporary additions on the command line.
> 
> export HADOOP_CLASSPATH=$EXTRA_PATH:$JAVA_HOME/bin:$HADOOP_CLASSPATH
> 
> 
> # Should HADOOP_CLASSPATH be first in the official CLASSPATH?
> 
> # export HADOOP_USER_CLASSPATH_FIRST="yes"
> 
> 
> # If HADOOP_USE_CLIENT_CLASSLOADER is set, the classpath along
> 
> # with the main jar are handled by a separate isolated
> 
> # client classloader when 'hadoop jar', 'yarn jar', or 'mapred job'
> 
> # is utilized. If it is set, HADOOP_CLASSPATH and
> 
> # HADOOP_USER_CLASSPATH_FIRST are ignored.
> 
> # export HADOOP_USE_CLIENT_CLASSLOADER=true
> 
> 
> # HADOOP_CLIENT_CLASSLOADER_SYSTEM_CLASSES overrides the default definition
> of
> 
> # system classes for the client classloader when
> HADOOP_USE_CLIENT_CLASSLOADER
> 
> # is enabled. Names ending in '.' (period) are treated as package names, and
> 
> # names starting with a '-' are treated as negative matches. For example,
> 
> # export
> HADOOP_CLIENT_CLASSLOADER_SYSTEM_CLASSES="-org.apache.hadoop.UserClass,java.,javax.,org.apache.hadoop."
> 
> 
> # Enable optional, bundled Hadoop features
> 
> # This is a comma delimited list.  It may NOT be overridden via .hadooprc
> 
> # Entries may be added/removed as needed.
> 
> # export
> HADOOP_OPTIONAL_TOOLS="hadoop-aliyun,hadoop-openstack,hadoop-azure,hadoop-azure-datalake,hadoop-aws,hadoop-kafka"
> 
> 
> ###
> 
> # Options for remote shell connectivity
> 
> ###
> 
> 
> # There are some optional components of hadoop that allow for
> 
> # command and control of remote hosts.  For example,
> 
> # start-dfs.sh will attempt to bring up all NNs, DNS, etc.
> 
> 
> # Options to pass to SSH when one of the "log into a host and
> 
> # start/stop daemons" scripts is executed
> 
> # export HADOOP_SSH_OPTS="-o BatchMode=yes -o StrictHostKeyChecking=no -o
> ConnectTimeout=10s"
> 
> 
> # The built-in ssh handler will limit itself to 10 simultaneous connections.
> 
> # For pdsh users, this sets the fanout size ( -f )
> 
> # Change this to increase/decrease as necessary.
> 
> # export HADOOP_SSH_PARALLEL=10
> 
> 
> # Filename which contains all of the hosts for any remote execution
> 
> # helper scripts # such as workers.sh, start-dfs.sh, etc.
> 
> # export HADOOP_WORKERS="${HADOOP_CONF_DIR}/workers"
> 
> 
> ###
> 
> # Options for all daemons
> 
> ###
> 
> #
> 
> 
> #
> 
> # Many options may also be specified as Java properties.  It is
> 
> # very common, and in many cases, desirable, to hard-set these
> 
> # in daemon _OPTS variables.  Where applicable, the appropriate
> 
> # Java property is also identified.  Note that many are re-used
> 
> # or set differently in certain contexts (e.g., secure vs
> 
> # non-secure)
> 
> #
> 
> 
> # Where (primarily) daemon log files are stored.
> 
> # ${HADOOP_HOME}/logs by default.
> 
> # Java property: hadoop.log.dir
> 
> # export HADOOP_LOG_DIR=${HADOOP_HOME}/logs
> 
> 
> # A string representing this instance of hadoop. $USER by default.
> 
> # This is used in writing log and pid files, so keep that in mind!
> 
> # Java property: hadoop.id.str
> 
> # export HADOOP_IDENT_STRING=$USER
> 
> 
> # How many seconds to pause after stopping a daemon
> 
> # export HADOOP_STOP_TIMEOUT=5
> 
> 
> # Where pid files are stored.  /tmp by default.
> 
> # export HADOOP_PID_DIR=/tmp
> 
> 
> # Default log4j setting for interactive commands
> 
> # Java property: hadoop.root.logger
> 
> # export HADOOP_ROOT_LOGGER=INFO,console
> 
> 
> # Default log4j setting for daemons spawned explicitly by
> 
> # --daemon option of hadoop, hdfs, mapred and yarn command.
> 
> # Java property: hadoop.root.logger
> 
> # export HADOOP_DAEMON_ROOT_LOGGER=INFO,RFA
> 
> 
> # Default log level and output location for security-related messages.
> 
> # You will almost certainly want to change this on a per-daemon basis via
> 
> # the Java property (i.e., -Dhadoop.security.logger=foo). (Note that the
> 
> # defaults for the NN and 2NN override this by default.)
> 
> # Java property: hadoop.security.logger
> 
> # export HADOOP_SECURITY_LOGGER=INFO,NullAppender
> 
> 
> # Default process priority level
> 
> # Note that sub-processes will also run at this level!
> 
> # export HADOOP_NICENESS=0
> 
> 
> # Default name for the service level authorization file
> 
> # Java property: hadoop.policy.file
> 
> # export HADOOP_POLICYFILE="hadoop-policy.xml"
> 
> 
> #
> 
> # NOTE: this is not used by default!  <-----
> 
> # You can define variables right here and then re-use them later on.
> 
> # For example, it is common to use the same garbage collection settings
> 
> # for all the daemons.  So one could define:
> 
> #
> 
> # export HADOOP_GC_SETTINGS="-verbose:gc -XX:+PrintGCDetails
> -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps"
> 
> #
> 
> # .. and then use it as per the b option under the namenode.
> 
> 
> ###
> 
> # Secure/privileged execution
> 
> ###
> 
> 
> #
> 
> # Out of the box, Hadoop uses jsvc from Apache Commons to launch daemons
> 
> # on privileged ports.  This functionality can be replaced by providing
> 
> # custom functions.  See hadoop-functions.sh for more information.
> 
> #
> 
> 
> # The jsvc implementation to use. Jsvc is required to run secure datanodes
> 
> # that bind to privileged ports to provide authentication of data transfer
> 
> # protocol.  Jsvc is not required if SASL is configured for authentication
> of
> 
> # data transfer protocol using non-privileged ports.
> 
> # export JSVC_HOME=/usr/bin
> 
> 
> #
> 
> # This directory contains pids for secure and privileged processes.
> 
> #export HADOOP_SECURE_PID_DIR=${HADOOP_PID_DIR}
> 
> 
> #
> 
> # This directory contains the logs for secure and privileged processes.
> 
> # Java property: hadoop.log.dir
> 
> # export HADOOP_SECURE_LOG=${HADOOP_LOG_DIR}
> 
> 
> #
> 
> # When running a secure daemon, the default value of HADOOP_IDENT_STRING
> 
> # ends up being a bit bogus.  Therefore, by default, the code will
> 
> # replace HADOOP_IDENT_STRING with HADOOP_xx_SECURE_USER.  If one wants
> 
> # to keep HADOOP_IDENT_STRING untouched, then uncomment this line.
> 
> # export HADOOP_SECURE_IDENT_PRESERVE="true"
> 
> 
> ###
> 
> # NameNode specific parameters
> 
> ###
> 
> 
> # Default log level and output location for file system related change
> 
> # messages. For non-namenode daemons, the Java property must be set in
> 
> # the appropriate _OPTS if one wants something other than INFO,NullAppender
> 
> # Java property: hdfs.audit.logger
> 
> # export HDFS_AUDIT_LOGGER=INFO,NullAppender
> 
> 
> # Specify the JVM options to be used when starting the NameNode.
> 
> # These options will be appended to the options specified as HADOOP_OPTS
> 
> # and therefore may override any similar flags set in HADOOP_OPTS
> 
> #
> 
> # a) Set JMX options
> 
> # export HDFS_NAMENODE_OPTS="-Dcom.sun.management.jmxremote=true
> -Dcom.sun.management.jmxremote.authenticate=false
> -Dcom.sun.management.jmxremote.ssl=false
> -Dcom.sun.management.jmxremote.port=1026"
> 
> #
> 
> # b) Set garbage collection logs
> 
> # export HDFS_NAMENODE_OPTS="${HADOOP_GC_SETTINGS}
> -Xloggc:${HADOOP_LOG_DIR}/gc-rm.log-$(date +'%Y%m%d%H%M')"
> 
> #
> 
> # c) ... or set them directly
> 
> # export HDFS_NAMENODE_OPTS="-verbose:gc -XX:+PrintGCDetails
> -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps
> -Xloggc:${HADOOP_LOG_DIR}/gc-rm.log-$(date +'%Y%m%d%H%M')"
> 
> 
> # this is the default:
> 
> # export HDFS_NAMENODE_OPTS="-Dhadoop.security.logger=INFO,RFAS"
> 
> 
> ###
> 
> # SecondaryNameNode specific parameters
> 
> ###
> 
> # Specify the JVM options to be used when starting the SecondaryNameNode.
> 
> # These options will be appended to the options specified as HADOOP_OPTS
> 
> # and therefore may override any similar flags set in HADOOP_OPTS
> 
> #
> 
> # This is the default:
> 
> # export HDFS_SECONDARYNAMENODE_OPTS="-Dhadoop.security.logger=INFO,RFAS"
> 
> 
> ###
> 
> # DataNode specific parameters
> 
> ###
> 
> # Specify the JVM options to be used when starting the DataNode.
> 
> # These options will be appended to the options specified as HADOOP_OPTS
> 
> # and therefore may override any similar flags set in HADOOP_OPTS
> 
> #
> 
> # This is the default:
> 
> # export HDFS_DATANODE_OPTS="-Dhadoop.security.logger=ERROR,RFAS"
> 
> 
> # On secure datanodes, user to run the datanode as after dropping
> privileges.
> 
> # This **MUST** be uncommented to enable secure HDFS if using privileged
> ports
> 
> # to provide authentication of data transfer protocol.  This **MUST NOT** be
> 
> # defined if SASL is configured for authentication of data transfer protocol
> 
> # using non-privileged ports.
> 
> # This will replace the hadoop.id.str Java property in secure mode.
> 
> # export HDFS_DATANODE_SECURE_USER=hdfs
> 
> 
> # Supplemental options for secure datanodes
> 
> # By default, Hadoop uses jsvc which needs to know to launch a
> 
> # server jvm.
> 
> # export HDFS_DATANODE_SECURE_EXTRA_OPTS="-jvm server"
> 
> 
> ###
> 
> # NFS3 Gateway specific parameters
> 
> ###
> 
> # Specify the JVM options to be used when starting the NFS3 Gateway.
> 
> # These options will be appended to the options specified as HADOOP_OPTS
> 
> # and therefore may override any similar flags set in HADOOP_OPTS
> 
> #
> 
> # export HDFS_NFS3_OPTS=""
> 
> 
> # Specify the JVM options to be used when starting the Hadoop portmapper.
> 
> # These options will be appended to the options specified as HADOOP_OPTS
> 
> # and therefore may override any similar flags set in HADOOP_OPTS
> 
> #
> 
> # export HDFS_PORTMAP_OPTS="-Xmx512m"
> 
> 
> # Supplemental options for priviliged gateways
> 
> # By default, Hadoop uses jsvc which needs to know to launch a
> 
> # server jvm.
> 
> # export HDFS_NFS3_SECURE_EXTRA_OPTS="-jvm server"
> 
> 
> # On privileged gateways, user to run the gateway as after dropping
> privileges
> 
> # This will replace the hadoop.id.str Java property in secure mode.
> 
> # export HDFS_NFS3_SECURE_USER=nfsserver
> 
> 
> ###
> 
> # ZKFailoverController specific parameters
> 
> ###
> 
> # Specify the JVM options to be used when starting the ZKFailoverController.
> 
> # These options will be appended to the options specified as HADOOP_OPTS
> 
> # and therefore may override any similar flags set in HADOOP_OPTS
> 
> #
> 
> # export HDFS_ZKFC_OPTS=""
> 
> 
> ###
> 
> # QuorumJournalNode specific parameters
> 
> ###
> 
> # Specify the JVM options to be used when starting the QuorumJournalNode.
> 
> # These options will be appended to the options specified as HADOOP_OPTS
> 
> # and therefore may override any similar flags set in HADOOP_OPTS
> 
> #
> 
> # export HDFS_JOURNALNODE_OPTS=""
> 
> 
> ###
> 
> # HDFS Balancer specific parameters
> 
> ###
> 
> # Specify the JVM options to be used when starting the HDFS Balancer.
> 
> # These options will be appended to the options specified as HADOOP_OPTS
> 
> # and therefore may override any similar flags set in HADOOP_OPTS
> 
> #
> 
> # export HDFS_BALANCER_OPTS=""
> 
> 
> ###
> 
> # HDFS Mover specific parameters
> 
> ###
> 
> # Specify the JVM options to be used when starting the HDFS Mover.
> 
> # These options will be appended to the options specified as HADOOP_OPTS
> 
> # and therefore may override any similar flags set in HADOOP_OPTS
> 
> #
> 
> # export HDFS_MOVER_OPTS=""
> 
> 
> ###
> 
> # Router-based HDFS Federation specific parameters
> 
> # Specify the JVM options to be used when starting the RBF Routers.
> 
> # These options will be appended to the options specified as HADOOP_OPTS
> 
> # and therefore may override any similar flags set in HADOOP_OPTS
> 
> #
> 
> # export HDFS_DFSROUTER_OPTS=""
> 
> 
> ###
> 
> # HDFS StorageContainerManager specific parameters
> 
> ###
> 
> # Specify the JVM options to be used when starting the HDFS Storage
> Container Manager.
> 
> # These options will be appended to the options specified as HADOOP_OPTS
> 
> # and therefore may override any similar flags set in HADOOP_OPTS
> 
> #
> 
> # export HDFS_STORAGECONTAINERMANAGER_OPTS=""
> 
> 
> ###
> 
> # Advanced Users Only!
> 
> ###
> 
> 
> #
> 
> # When building Hadoop, one can add the class paths to the commands
> 
> # via this special env var:
> 
> # export HADOOP_ENABLE_BUILD_PATHS="true"
> 
> 
> #
> 
> # To prevent accidents, shell commands be (superficially) locked
> 
> # to only allow certain users to execute certain subcommands.
> 
> # It uses the format of (command)_(subcommand)_USER.
> 
> #
> 
> # For example, to limit who can execute the namenode command,
> 
> # export HDFS_NAMENODE_USER=hdfs
> 
> export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
> 
> 
> # Enable s3
> 
> export HADOOP_OPTIONAL_TOOLS="hadoop-aws"
> 
> echo "Ensure AWS Credentials are added to hadoop-env.sh and core-site.xml,
> by running add-aws-keys.sh"
> 
> export AWS_ACCESS_KEY_ID=KEY_PLACEHOLDER
> 
> export AWS_SECRET_ACCESS_KEY=SECRET_PLACEHOLDER
> 
> 
> 
> hdfs-site.xml
> 
> <?xml version="1.0" encoding="UTF-8"?>
> 
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> 
> <!--
> 
>   Licensed under the Apache License, Version 2.0 (the "License");
> 
>   you may not use this file except in compliance with the License.
> 
>   You may obtain a copy of the License at
> 
> 
>     http://www.apache.org/licenses/LICENSE-2.0
> 
> 
>   Unless required by applicable law or agreed to in writing, software
> 
>   distributed under the License is distributed on an "AS IS" BASIS,
> 
>   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> 
>   See the License for the specific language governing permissions and
> 
>   limitations under the License. See accompanying LICENSE file.
> 
> -->
> 
> 
> <!-- Put site-specific property overrides in this file. -->
> 
> <configuration>
> 
> <property>
> 
>   <name>dfs.data.dir</name>
> 
>   <value>/home/hdoop/dfsdata/namenode</value>
> 
> </property>
> 
> <property>
> 
>   <name>dfs.data.dir</name>
> 
>   <value>/home/hdoop/dfsdata/datanode</value>
> 
> </property>
> 
> <property>
> 
>   <name>dfs.replication</name>
> 
>   <value>1</value>
> 
> </property>
> 
> </configuration>
> 
> 
> 
> mapred-site.xml
> 
> <?xml version="1.0"?>
> 
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> 
> <!--
> 
>   Licensed under the Apache License, Version 2.0 (the "License");
> 
>   you may not use this file except in compliance with the License.
> 
>   You may obtain a copy of the License at
> 
> 
>     http://www.apache.org/licenses/LICENSE-2.0
> 
> 
>   Unless required by applicable law or agreed to in writing, software
> 
>   distributed under the License is distributed on an "AS IS" BASIS,
> 
>   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> 
>   See the License for the specific language governing permissions and
> 
>   limitations under the License. See accompanying LICENSE file.
> 
> -->
> 
> 
> <!-- Put site-specific property overrides in this file. -->
> 
> <configuration>
> 
> <property>
> 
>   <name>mapreduce.framework.name</name>
> 
>   <value>yarn</value>
> 
> </property>
> 
> <property>
> 
>             <name>yarn.app.mapreduce.am.env</name>
> 
>     <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
> 
>     </property>
> 
>     <property>
> 
>             <name>mapreduce.map.env</name>
> 
>     <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
> 
>     </property>
> 
>     <property>
> 
>             <name>mapreduce.reduce.env</name>
> 
>     <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
> 
>     </property>
> 
>     <!--
> 
>     <property>
> 
>     <name>mapreduce.application.classpath</name>
> 
> 
>   
> <value>/home/hdoop/hadoop-3.3.0/etc/hadoop:/home/hdoop/hadoop-3.3.0/share/hadoop/common/lib/*:/home/hdoop/hadoop-3.3.0/share/hadoop/common/*:/home/hdoop/hadoop-3.3.0/share/hadoop/hdfs:/home/hdoop/hadoop-3.3.0/share/hadoop/hdfs/lib/*:/home/hdoop/hadoop-3.3.0/share/hadoop/hdfs/*:/home/hdoop/hadoop-3.3.0/share/hadoop/mapreduce/*:/home/hdoop/hadoop-3.3.0/share/hadoop/yarn:/home/hdoop/hadoop-3.3.0/share/hadoop/yarn/lib/*:/home/hdoop/hadoop-3.3.0/share/hadoop/yarn/*:/home/hdoop/hadoop-3.3.0/bin:/home/hdoop/hadoop-3.3.0/sbin</value>
> 
>     </property>
> 
>  -->
> 
>     <property>
> 
>     <name>mapreduce.application.classpath</name>
> 
> 
>   
> <value>home/hdoop/hadoop-3.3.0/etc/hadoop:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/*:/home/hdoop/hadoop-3.3.0//share/hadoop/common/*:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/*:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/*:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/lib/*:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/*:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/*:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/*:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/accessors-smart-1.2.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/animal-sniffer-annotations-1.17.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/asm-5.0.4.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/audience-annotations-0.5.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/avro-1.7.7.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/checker-qual-2.5.2.jar:/home/hdoop/hadoop-3.3.
 
0//share/hadoop/common/lib/commons-beanutils-1.9.3.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/commons-cli-1.2.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/commons-codec-1.11.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/commons-collections-3.2.2.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/commons-compress-1.18.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/commons-configuration2-2.1.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/commons-io-2.5.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/commons-lang3-3.7.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/commons-logging-1.1.3.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/commons-math3-3.1.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/commons-net-3.6.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/commons-text-1.4.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/curator-client-2.13.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/cu
 
rator-framework-2.13.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/curator-recipes-2.13.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/dnsjava-2.1.7.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/error_prone_annotations-2.2.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/failureaccess-1.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/gson-2.2.4.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/guava-27.0-jre.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/hadoop-annotations-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/hadoop-auth-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/htrace-core4-4.1.0-incubating.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/httpclient-4.5.6.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/httpcore-4.4.10.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/j2objc-annotations-1.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jackson-annotations-2.9.8.jar:/h
 
ome/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jackson-core-2.9.8.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jackson-core-asl-1.9.13.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jackson-databind-2.9.8.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jackson-jaxrs-1.9.13.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jackson-mapper-asl-1.9.13.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jackson-xc-1.9.13.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/javax.servlet-api-3.1.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jaxb-api-2.2.11.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jaxb-impl-2.2.3-1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jcip-annotations-1.0-1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jersey-core-1.19.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jersey-json-1.19.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jersey-server-1.19.jar:/home/hdoop/hadoop-3.3.0//share/
 
hadoop/common/lib/jersey-servlet-1.19.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jettison-1.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jetty-http-9.3.24.v20180605.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jetty-io-9.3.24.v20180605.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jetty-security-9.3.24.v20180605.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jetty-server-9.3.24.v20180605.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jetty-servlet-9.3.24.v20180605.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jetty-util-9.3.24.v20180605.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jetty-webapp-9.3.24.v20180605.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jetty-xml-9.3.24.v20180605.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jsch-0.1.54.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/json-smart-2.3.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jsp-api-2.1.jar:/home/hdoop/hadoop-3.3.
 
0//share/hadoop/common/lib/jsr305-3.0.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jsr311-api-1.1.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jul-to-slf4j-1.7.25.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/kerb-admin-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/kerb-client-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/kerb-common-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/kerb-core-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/kerb-crypto-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/kerb-identity-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/kerb-server-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/kerb-simplekdc-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/kerb-util-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/kerby-asn1-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/kerby-config-1.0.1.jar:/home/hdoop/hado
 
op-3.3.0//share/hadoop/common/lib/kerby-pkix-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/kerby-util-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/kerby-xdr-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/listenablefuture-9999.0-empty-to-avoid-conflict-with-guava.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/log4j-1.2.17.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/metrics-core-3.2.4.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/netty-3.10.5.Final.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/nimbus-jose-jwt-4.41.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/paranamer-2.3.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/protobuf-java-2.5.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/re2j-1.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/slf4j-api-1.7.25.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/l
 
ib/snappy-java-1.0.5.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/stax2-api-3.1.4.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/token-provider-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/woodstox-core-5.0.3.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/zookeeper-3.4.13.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/hadoop-common-3.3.0-tests.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/hadoop-common-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/hadoop-kms-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/hadoop-nfs-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/jdiff:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib:/home/hdoop/hadoop-3.3.0//share/hadoop/common/sources:/home/hdoop/hadoop-3.3.0//share/hadoop/common/webapps:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/accessors-smart-1.2.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/animal-sniffer-annotations-1.17.jar:/home/hdoop/hadoop-3.3.0//share/hado
 
op/hdfs/lib/asm-5.0.4.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/audience-annotations-0.5.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/avro-1.7.7.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/checker-qual-2.5.2.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/commons-beanutils-1.9.3.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/commons-cli-1.2.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/commons-codec-1.11.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/commons-collections-3.2.2.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/commons-compress-1.18.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/commons-configuration2-2.1.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/commons-daemon-1.0.13.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/commons-io-2.5.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/commons-lang3-3.7.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/commons-logging-1.1.3.jar:/home/hdoop/hadoop-3.3.0/
 
/share/hadoop/hdfs/lib/commons-math3-3.1.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/commons-net-3.6.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/commons-text-1.4.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/curator-client-2.13.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/curator-framework-2.13.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/curator-recipes-2.13.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/dnsjava-2.1.7.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/error_prone_annotations-2.2.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/failureaccess-1.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/gson-2.2.4.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/guava-27.0-jre.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/hadoop-annotations-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/hadoop-auth-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/htrace-core4-4.1.0-incubating.jar:/home/hdoo
 
p/hadoop-3.3.0//share/hadoop/hdfs/lib/httpclient-4.5.6.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/httpcore-4.4.10.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/j2objc-annotations-1.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jackson-annotations-2.9.8.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jackson-core-2.9.8.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jackson-core-asl-1.9.13.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jackson-databind-2.9.8.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jackson-jaxrs-1.9.13.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jackson-mapper-asl-1.9.13.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jackson-xc-1.9.13.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/javax.servlet-api-3.1.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jaxb-api-2.2.11.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jaxb-impl-2.2.3-1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jcip-annotat
 
ions-1.0-1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jersey-core-1.19.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jersey-json-1.19.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jersey-server-1.19.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jersey-servlet-1.19.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jettison-1.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jetty-http-9.3.24.v20180605.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jetty-io-9.3.24.v20180605.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jetty-security-9.3.24.v20180605.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jetty-server-9.3.24.v20180605.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jetty-servlet-9.3.24.v20180605.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jetty-util-9.3.24.v20180605.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jetty-util-ajax-9.3.24.v20180605.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jetty-webapp-9.3
 
.24.v20180605.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jetty-xml-9.3.24.v20180605.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jsch-0.1.54.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/json-simple-1.1.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/json-smart-2.3.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jsr305-3.0.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jsr311-api-1.1.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/kerb-admin-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/kerb-client-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/kerb-common-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/kerb-core-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/kerb-crypto-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/kerb-identity-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/kerb-server-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/kerb-simplekdc-1.0.1.jar:/ho
 
me/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/kerb-util-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/kerby-asn1-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/kerby-config-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/kerby-pkix-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/kerby-util-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/kerby-xdr-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/leveldbjni-all-1.8.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/listenablefuture-9999.0-empty-to-avoid-conflict-with-guava.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/log4j-1.2.17.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/netty-3.10.5.Final.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/netty-all-4.0.52.Final.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/nimbus-jose-jwt-4.41.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/okhttp-2.7.5.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/okio-
 
1.6.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/paranamer-2.3.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/protobuf-java-2.5.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/re2j-1.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/snappy-java-1.0.5.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/stax2-api-3.1.4.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/token-provider-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/woodstox-core-5.0.3.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/zookeeper-3.4.13.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/hadoop-hdfs-3.3.0-tests.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/hadoop-hdfs-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/hadoop-hdfs-client-3.3.0-tests.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/hadoop-hdfs-client-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/hadoop-hdfs-httpfs-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/hadoop-hdfs-native-client-3.3
 
.0-tests.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/hadoop-hdfs-native-client-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/hadoop-hdfs-nfs-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/hadoop-hdfs-rbf-3.3.0-tests.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/hadoop-hdfs-rbf-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/jdiff:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/sources:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/webapps:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/lib/hamcrest-core-1.3.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/lib/junit-4.11.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/hadoop-mapreduce-client-app-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/hadoop-mapreduce-client-common-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/hadoop-mapreduce-client-core-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/hadoop-mapreduce-client-hs-3.
 
3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/hadoop-mapreduce-client-hs-plugins-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.3.0-tests.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/hadoop-mapreduce-client-nativetask-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/hadoop-mapreduce-client-shuffle-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/hadoop-mapreduce-client-uploader-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/jdiff:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/lib:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/lib-examples:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/sources:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/HikariCP-java7-2.4.12.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/y
 
arn/lib/aopalliance-1.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/bcpkix-jdk15on-1.60.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/bcprov-jdk15on-1.60.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/ehcache-3.3.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/fst-2.50.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/geronimo-jcache_1.0_spec-1.0-alpha-1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/guice-4.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/guice-servlet-4.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/jackson-jaxrs-base-2.9.8.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/jackson-jaxrs-json-provider-2.9.8.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/jackson-module-jaxb-annotations-2.9.8.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/java-util-1.9.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/javax.inject-1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/jersey-client-1.19.jar:/home/hdoop/hadoop
 
-3.3.0//share/hadoop/yarn/lib/jersey-guice-1.19.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/json-io-2.5.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/metrics-core-3.2.4.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/mssql-jdbc-6.2.1.jre7.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/objenesis-1.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/snakeyaml-1.16.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/swagger-annotations-1.5.4.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/hadoop-yarn-api-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/hadoop-yarn-applications-distributedshell-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/hadoop-yarn-applications-unmanaged-am-launcher-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/hadoop-yarn-client-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/hadoop-yarn-common-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/hadoop-yarn-registry-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/had
 
oop/yarn/hadoop-yarn-server-applicationhistoryservice-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/hadoop-yarn-server-common-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/hadoop-yarn-server-nodemanager-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/hadoop-yarn-server-resourcemanager-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/hadoop-yarn-server-router-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/hadoop-yarn-server-sharedcachemanager-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/hadoop-yarn-server-tests-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/hadoop-yarn-server-timeline-pluginstorage-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/hadoop-yarn-server-web-proxy-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/hadoop-yarn-services-api-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/hadoop-yarn-services-core-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/hadoop-yarn-submarine-3.3.0.jar:/home/hdoop/ha
 
doop-3.3.0//share/hadoop/yarn/lib:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/sources:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/test:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/timelineservice:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/webapps:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/yarn-service-examples:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/sources/hadoop-mapreduce-client-app-3.3.0-sources.jar:/home/hdoop/hadoop-3.3.0//hadoop-tools/hadoop-aws/target/hadoop-aws-3.3.0.jar:/home/hdoop/hadoop-3.3.0/hadoop-tools/hadoop-aws/target/hadoop-aws-3.3.0.jar:/home/hdoop/hadoop-3.3.0/hadoop-tools/hadoop-aws/target/lib/aws-java-sdk-bundle-1.11.563.jar:/home/hdoop/nutch/build/lib/commons-jexl3-3.1.jar:/home/hdoop/nutch/lib/*</value>
> 
> </property>
> 
> 
>  <property>
> 
>         <name>yarn.app.mapreduce.am.resource.mb</name>
> 
>         <value>512</value>
> 
> </property>
> 
> 
> <property>
> 
>         <name>mapreduce.map.memory.mb</name>
> 
>         <value>256</value>
> 
> </property>
> 
> 
> <property>
> 
>         <name>mapreduce.reduce.memory.mb</name>
> 
>         <value>256</value>
> 
> </property>
> 
> 
> <!--from NutchHadoop Tutorial -->
> 
> <property>
> 
>   <name>mapred.system.dir</name>
> 
>   <value>/home/hdoop/dfsdata/mapreduce/system</value>
> 
> </property>
> 
> 
> <property>
> 
>   <name>mapred.local.dir</name>
> 
>   <value>/home/hdoop/dfsdata/mapreduce/local</value>
> 
> </property>
> 
> 
> </configuration>
> 
> 
> 
> workers
> 
> hadoop02N <http://hadoop02.o7.castle.fm/>ame
> 
> hadoop01N <http://hadoop01.o7.castle.fm/>ame
> 
> 
> 
> yarn-site.xml
> 
> <?xml version="1.0"?>
> 
> <!--
> 
>   Licensed under the Apache License, Version 2.0 (the "License");
> 
>   you may not use this file except in compliance with the License.
> 
>   You may obtain a copy of the License at
> 
> 
>     http://www.apache.org/licenses/LICENSE-2.0
> 
> 
>   Unless required by applicable law or agreed to in writing, software
> 
>   distributed under the License is distributed on an "AS IS" BASIS,
> 
>   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> 
>   See the License for the specific language governing permissions and
> 
>   limitations under the License. See accompanying LICENSE file.
> 
> -->
> 
> <configuration>
> 
> <property>
> 
>   <name>yarn.nodemanager.aux-services</name>
> 
>   <value>mapreduce_shuffle</value>
> 
> </property>
> 
> <property>
> 
>   <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
> 
>   <value>org.apache.hadoop.mapred.ShuffleHandler</value>
> 
> </property>
> 
> <property>
> 
>   <name>yarn.resourcemanager.hostname</name>
> 
>   <value>IP_PLACEHOLDER</value>
> 
> </property>
> 
> <property>
> 
>   <name>yarn.acl.enable</name>
> 
>   <value>0</value>
> 
> </property>
> 
> <property>
> 
>   <name>yarn.nodemanager.env-whitelist</name>
> 
>   
> <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PERPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
> 
> </property>
> 
> 
> <!--<property>
> 
>  <name>yarn.application.classpath</name>
> 
>  
> <value>$HADOOP_CONF_DIR,$HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,$HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,$HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,$HADOOP_YARN_HOME/*,$HADOOP_YARN_HOME/lib/*,$USS_HOME/*,$USS_CONF</value>
> 
> </property>-->
> 
> 
> <property>
> 
>         <name>yarn.nodemanager.resource.memory-mb</name>
> 
>         <value>1536</value>
> 
> </property>
> 
> 
> <property>
> 
>         <name>yarn.scheduler.maximum-allocation-mb</name>
> 
>         <value>1536</value>
> 
> </property>
> 
> 
> <property>
> 
>         <name>yarn.scheduler.minimum-allocation-mb</name>
> 
>         <value>128</value>
> 
> </property>
> 
> 
> <property>
> 
>         <name>yarn.nodemanager.vmem-check-enabled</name>
> 
>         <value>false</value>
> 
> </property>
> 
> <property>
> 
>         <name>yarn.nodemanager.local-dirs</name>
> 
>         <value>${hadoop.tmp.dir}/nm-local-dir</value>
> 
> </property>
> 
> 
> </configuration>
> 
> 
> On Thu, Jun 17, 2021 at 11:55 AM Clark Benham <cl...@thehive.ai> wrote:
> 
> >
> > Hi Sebastian,
> >
> > NUTCH_HOME=~/nutch; the local filesystem. I am using a plain, pre-built
> > hadoop.
> > There's no "mapreduce.job.dir" I can grep in Hadoop 3.2.1,3.3.0, or
> > Nutch-1.18, 1.19, but mapreduce.job.hdfs-servers defaults to
> > ${fs.defaultFS}, so s3a://temp-crawler in our case.
> > The plugin loader doesn't appear to be able to read from s3 in nutch-1.18
> > with hadoop-3.2.1[1].
> >
> > Using java & javac 11 with hadoop-3.3.0 downloaded and untared and a
> > nutch-1.19 I built:
> > I can run a mapreduce job on S3; and a Nutch job on hdfs, but running
> > nutch on S3 still gives "URLNormalizer not found" with the plugin dir on
> > the local filesystem or on S3a.
> >
> > How would you recommend I go about getting the plugin loader to read from
> > other file systems?
> >
> > [1]  I still get 'x point org.apache.nutch.net.URLNormalizer not found'
> > (same stack trace as previous email) with
> > `<name>plugin.folders</name>
> > <value>s3a://temp-crawler/user/hdoop/nutch-plugins</value>`
> > set in my nutch-site.xml while `hadoop fs -ls
> > s3a://temp-crawler/user/hdoop/nutch-plugins` lists all the plugins as there.
> >
> >
> > For posterity:
> > I got hadoop-3.3.0 working with a S3 backend by:
> >
> > cd ~/hadoop-3.3.0
> >
> > cp ./share/hadoop/tools/lib/hadoop-aws-3.3.0.jar ./share/hadoop/common/lib
> >
> > cp ./share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.563.jar
> > ./share/hadoop/common/lib
> > to solve "Class org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory not
> > found" despite the class existing in
> > ~/hadoop-3.3.0/share/hadoop/tools/lib/hadoop-aws-3.3.0.jar  checking it's
> > on the classpath with `hadoop classpath | tr ":" "\n"  | grep
> > share/hadoop/tools/lib/hadoop-aws-3.3.0.jar` as well as adding it to
> > hadoop-env.sh.
> > see
> > https://stackoverflow.com/questions/58415928/spark-s3-error-java-lang-classnotfoundexception-class-org-apache-hadoop-f
> >
> > On Tue, Jun 15, 2021 at 2:01 AM Sebastian Nagel
> > <wastl.na...@googlemail.com.invalid> wrote:
> >
> >>  > The local file system? Or hdfs:// or even s3:// resp. s3a://?
> >>
> >> Also important: the value of "mapreduce.job.dir" - it's usually
> >> on hdfs:// and I'm not sure whether the plugin loader is able to
> >> read from other filesystems. At least, I haven't tried.
> >>
> >>
> >> On 6/15/21 10:53 AM, Sebastian Nagel wrote:
> >> > Hi Clark,
> >> >
> >> > sorry, I should read your mail until the end - you mentioned that
> >> > you downgraded Nutch to run with JDK 8.
> >> >
> >> > Could you share to which filesystem does NUTCH_HOME point?
> >> > The local file system? Or hdfs:// or even s3:// resp. s3a://?
> >> >
> >> > Best,
> >> > Sebastian
> >> >
> >> >
> >> > On 6/15/21 10:24 AM, Clark Benham wrote:
> >> >> Hi,
> >> >>
> >> >>
> >> >> I am trying to run Nutch-1.19 on hadoop-3.2.1 with an S3
> >> >> backend/filesystem; however I get an error ‘URLNormalizer class not
> >> found’.
> >> >> I have edited nutch-site.xml so this plugin should be included:
> >> >>
> >> >> <property>
> >> >>
> >> >>    <name>plugin.includes</name>
> >> >>
> >> >>
> >> >>
> >> <value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|mimetype-filter|urlnormalizer|urlnormalizer-basic|.*|nutch-extensionpoints</value>
> >>
> >> >>
> >> >>
> >> >>
> >> >> </property>
> >> >>
> >> >>   and then built on both nodes (I only have 2 machines).  I’ve
> >> successfully
> >> >> run Nutch locally and in distributed mode using HDFS, and I’ve run a
> >> >> mapreduce job with S3 as hadoop’s file system.
> >> >>
> >> >>
> >> >> I thought it was possible nutch is not reading nutch-site.xml because I
> >> >> resolve an error by setting the config through the cli, despite this
> >> >> duplicating nutch-site.xml.
> >> >>
> >> >> The command:
> >> >>
> >> >> `hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
> >> >> org.apache.nutch.fetcher.Fetcher
> >> >> crawl/crawldb crawl/segments`
> >> >>
> >> >> throws
> >> >>
> >> >> `java.lang.IllegalArgumentException: Fetcher: No agents listed in '
> >> >> http.agent.name' property`
> >> >>
> >> >> while if I pass a value in for http.agent.name with
> >> >> `-Dhttp.agent.name=myScrapper`,
> >> >> (making the command `hadoop jar
> >> >> $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
> >> >> org.apache.nutch.fetcher.Fetcher
> >> >> -Dhttp.agent.name=clark crawl/crawldb crawl/segments`),  I get an
> >> error
> >> >> about there being no input path, which makes sense as I haven’t been
> >> able
> >> >> to generate any segments.
> >> >>
> >> >>
> >> >>   However this method of setting nutch config’s doesn’t work for
> >> injecting
> >> >> URLs; eg:
> >> >>
> >> >> `hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
> >> >> org.apache.nutch.crawl.Injector
> >> >> -Dplugin.includes=".*" crawl/crawldb urls`
> >> >>
> >> >> fails with the same “URLNormalizer” not found.
> >> >>
> >> >>
> >> >> I tried copying the plugin dir to S3 and setting
> >> >> <name>plugin.folders</name> to be a path on S3 without success. (I
> >> expect
> >> >> the plugin to be bundled with the .job so this step should be
> >> unnecessary)
> >> >>
> >> >>
> >> >> The full stack trace for `hadoop jar
> >> >> $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
> >> >> org.apache.nutch.crawl.Injector
> >> >> crawl/crawldb urls`:
> >> >>
> >> >> SLF4J: Class path contains multiple SLF4J bindings.
> >> >>
> >> >> SLF4J: Found binding in
> >> >>
> >> [jar:file:/home/hdoop/hadoop-3.2.1/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> >> >>
> >> >> SLF4J: Found binding in
> >> >>
> >> [jar:file:/home/hdoop/apache-nutch-1.18/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> >> >>
> >> >> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
> >> >> explanation.
> >> >>
> >> >> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> >> >>
> >> >> #Took out multiply Info messages
> >> >>
> >> >> 2021-06-15 07:06:07,842 INFO mapreduce.Job: Task Id :
> >> >> attempt_1623740678244_0001_m_000001_0, Status : FAILED
> >> >>
> >> >> Error: java.lang.RuntimeException: x point
> >> >> org.apache.nutch.net.URLNormalizer not found.
> >> >>
> >> >> at org.apache.nutch.net.URLNormalizers.<init>(URLNormalizers.java:145)
> >> >>
> >> >> at
> >> org.apache.nutch.crawl.Injector$InjectMapper.setup(Injector.java:139)
> >> >>
> >> >> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
> >> >>
> >> >> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799)
> >> >>
> >> >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)
> >> >>
> >> >> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)
> >> >>
> >> >> at java.security.AccessController.doPrivileged(Native Method)
> >> >>
> >> >> at javax.security.auth.Subject.doAs(Subject.java:422)
> >> >>
> >> >> at
> >> >>
> >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
> >> >>
> >> >> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)
> >> >>
> >> >>
> >> >> #This error repeats 6 times total, 3 times for each node
> >> >>
> >> >>
> >> >> 2021-06-15 07:06:26,035 INFO mapreduce.Job:  map 100% reduce 100%
> >> >>
> >> >> 2021-06-15 07:06:29,067 INFO mapreduce.Job: Job job_1623740678244_0001
> >> >> failed with state FAILED due to: Task failed
> >> >> task_1623740678244_0001_m_000001
> >> >>
> >> >> Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0
> >> >> killedReduces: 0
> >> >>
> >> >>
> >> >> 2021-06-15 07:06:29,190 INFO mapreduce.Job: Counters: 14
> >> >>
> >> >> Job Counters
> >> >>
> >> >> Failed map tasks=7
> >> >>
> >> >> Killed map tasks=1
> >> >>
> >> >> Killed reduce tasks=1
> >> >>
> >> >> Launched map tasks=8
> >> >>
> >> >> Other local map tasks=6
> >> >>
> >> >> Rack-local map tasks=2
> >> >>
> >> >> Total time spent by all maps in occupied slots (ms)=63196
> >> >>
> >> >> Total time spent by all reduces in occupied slots (ms)=0
> >> >>
> >> >> Total time spent by all map tasks (ms)=31598
> >> >>
> >> >> Total vcore-milliseconds taken by all map tasks=31598
> >> >>
> >> >> Total megabyte-milliseconds taken by all map tasks=8089088
> >> >>
> >> >> Map-Reduce Framework
> >> >>
> >> >> CPU time spent (ms)=0
> >> >>
> >> >> Physical memory (bytes) snapshot=0
> >> >>
> >> >> Virtual memory (bytes) snapshot=0
> >> >>
> >> >> 2021-06-15 07:06:29,195 ERROR crawl.Injector: Injector job did not
> >> succeed,
> >> >> job status: FAILED, reason: Task failed
> >> task_1623740678244_0001_m_000001
> >> >>
> >> >> Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0
> >> >> killedReduces: 0
> >> >>
> >> >>
> >> >> 2021-06-15 07:06:29,562 ERROR crawl.Injector: Injector:
> >> >> java.lang.RuntimeException: Injector job did not succeed, job status:
> >> >> FAILED, reason: Task failed task_1623740678244_0001_m_000001
> >> >>
> >> >> Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0
> >> >> killedReduces: 0
> >> >>
> >> >>
> >> >> at org.apache.nutch.crawl.Injector.inject(Injector.java:444)
> >> >>
> >> >> at org.apache.nutch.crawl.Injector.run(Injector.java:571)
> >> >>
> >> >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
> >> >>
> >> >> at org.apache.nutch.crawl.Injector.main(Injector.java:535)
> >> >>
> >> >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >> >>
> >> >> at
> >> >>
> >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> >> >>
> >> >> at
> >> >>
> >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> >> >>
> >> >> at java.lang.reflect.Method.invoke(Method.java:498)
> >> >>
> >> >> at org.apache.hadoop.util.RunJar.run(RunJar.java:323)
> >> >>
> >> >> at org.apache.hadoop.util.RunJar.main(RunJar.java:236)
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> P.S.
> >> >>
> >> >> I am using a downloaded hadoop-3.2.1; and the only odd thing about my
> >> nutch
> >> >> build is that I had to replace all instances of “javac.verion” with
> >> >> “ant.java.version”; as the javac version was 11 to java’s 1.8 giving
> >> the
> >> >> error ‘javac invalid target release: 11’:
> >> >>
> >> >> grep -rl "javac.version" . --include "*.xml" | xargs sed -i
> >> >> s^javac.version^ant.java.version^g
> >> >>
> >> >> grep -rl “ant.ant” . --include "*.xml"| xargs sed -i s^ant.ant.^ant.^g
> >> >>
> >> >
> >>
> >>
>

Re: Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'

Reply via email to