Re: Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'

2021-07-16 Thread Lewis John McGibbney
Hi Clark,
This is a lot of information... thank you for compiling it all.
Ideally the version of Hadoop being used with Nutch should ALWAYS match the 
hadoop binaries referenced in 
https://github.com/apache/nutch/blob/master/ivy/ivy.xml. This way you wont run 
into the classpath issues.
I would like to encourage you to create a wiki page so we can document this in 
a user firnedly way... would you be open to that?
You can create an account at 
https://cwiki.apache.org/confluence/display/NUTCH/Home
Thanks for your consideration.
lewismc

On 2021/07/14 18:27:23, Clark Benham  wrote: 
> Hi All,
> 
> Sebastian Helped fix my issue: using S3 as a backend I was able to get
> nutch-1.19 working with pre-built hadoop-3.3.0 and java 11. There was an
> oddity that nutch-1.19 had 11 hadoop 3.1.3 jars, eg.
> hadoop-hdfs-3.1.3.jar, hadoop-yarn-api-3.1.3.jar... ; this made running
> `hadoop version`  give 3.1.3) so I replaced those 3.1.3 jars with the 3.3.0
> jars from the hadoop download.
> Also, in the main nutch branch (
> https://github.com/apache/nutch/blob/master/ivy/ivy.xml) ivy.xml currently
> has dependencies on hadoop-3.1.3; eg.
> 
>  conf="*->default">
> 
> z
> 
> 
> 
> 
> 
>  conf="*->default" />
>  rev="3.1.3" conf="*->default" />
>  name="hadoop-mapreduce-client-jobclient" rev="3.1.3" conf="*->default" />
> 
> 
> I set yarn.nodemanager.local-dirs to '${hadoop.tmp.dir}/nm-local-dir'.
> 
> I didn't change "mapreduce.job.dir" because there's no namenode nor
> datanode processes running when using hadoop with S3, so the UI is blank.
> 
> Copied from Email with Sebastian:
> >  > The plugin loader doesn't appear to be able to read from s3 in
> nutch-1.18
> >  > with hadoop-3.2.1[1].
> 
> > I had a look into the plugin loader: it can only read from the local file
> system.
> > But that's ok because the Nutch job file is copied to the local machine
> > and unpacked. Here the paths how it looks like on one of the running
> Common Crawl
> > task nodes:
> 
> The configs for the working hadoop are as follows:
> 
> core-site.xml
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>   hadoop.tmp.dir
> 
>   /home/hdoop/tmpdata
> 
> 
> 
> 
> 
>   fs.defaultFS
> 
>   s3a://my-bucket
> 
> 
> 
> 
> 
> 
> fs.s3a.access.key
> 
> KEY_PLACEHOLDER
> 
>   AWS access key ID.
> 
>Omit for IAM role-based or provider-based authentication.
> 
> 
> 
> 
> 
> 
>   fs.s3a.secret.key
> 
>   SECRET_PLACEHOLDER
> 
>   AWS secret key.
> 
>Omit for IAM role-based or provider-based authentication.
> 
> 
> 
> 
> 
> 
>   fs.s3a.aws.credentials.provider
> 
>   
> 
> Comma-separated class names of credential provider classes which
> implement
> 
> com.amazonaws.auth.AWSCredentialsProvider.
> 
> 
> These are loaded and queried in sequence for a valid set of credentials.
> 
> Each listed class must implement one of the following means of
> 
> construction, which are attempted in order:
> 
> 1. a public constructor accepting java.net.URI and
> 
> org.apache.hadoop.conf.Configuration,
> 
> 2. a public static method named getInstance that accepts no
> 
>arguments and returns an instance of
> 
>com.amazonaws.auth.AWSCredentialsProvider, or
> 
> 3. a public default constructor.
> 
> 
> Specifying org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider
> allows
> 
> anonymous access to a publicly accessible S3 bucket without any
> credentials.
> 
> Please note that allowing anonymous access to an S3 bucket compromises
> 
> security and therefore is unsuitable for most use cases. It can be
> useful
> 
> for accessing public data sets without requiring AWS credentials.
> 
> 
> If unspecified, then the default list of credential provider classes,
> 
> queried in sequence, is:
> 
> 1. org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider:
> 
>Uses the values of fs.s3a.access.key and fs.s3a.secret.key.
> 
> 2. com.amazonaws.auth.EnvironmentVariableCredentialsProvider: supports
> 
> configuration of AWS access key ID and secret access key in
> 
> environment variables named AWS_ACCESS_KEY_ID and
> 
> AWS_SECRET_ACCESS_KEY, as documented in the AWS SDK.
> 
> 3. com.amazonaws.auth.InstanceProfileCredentialsProvider: supports use
> 
> of instance profile credentials if running in an EC2 VM.
> 
>   
> 
> 
> 
> 
> 
> 
> 
>   
> 
> org.apache.hadoop
> 
> hadoop-client
> 
> ${hadoop.version}
> 
>   
> 
>   
> 
> org.apache.hadoop
> 
> hadoop-aws
> 
> ${hadoop.version}
> 
>   
> 
> 
> 
> 
> 
> 
> 
> 
> hadoop-env.sh
> 
> #
> 
> # Licensed to the Apache Software Foundation (ASF) under one
> 
> # omore contributor license agreements.  See the NOTICE file
> 
> # distributed with this work for additional information
> 
> # regarding copyright ownership.  The ASF licenses this file
> 
> # to you under the Apache License, Version 2.0 (the
> 
> # "License"); you may not use this file except in compliance

Re: Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'

2021-07-15 Thread Sebastian Nagel

Hi Clark,

thanks for summarizing this discussion and sharing the final configuration!

Good to know that it's possible to run Nutch on Hadoop using S3A without
using HDFS (no namenode/datanodes running).

Best,
Sebastian


Re: Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'

2021-07-14 Thread Clark Benham
Hi All,

Sebastian Helped fix my issue: using S3 as a backend I was able to get
nutch-1.19 working with pre-built hadoop-3.3.0 and java 11. There was an
oddity that nutch-1.19 had 11 hadoop 3.1.3 jars, eg.
hadoop-hdfs-3.1.3.jar, hadoop-yarn-api-3.1.3.jar... ; this made running
`hadoop version`  give 3.1.3) so I replaced those 3.1.3 jars with the 3.3.0
jars from the hadoop download.
Also, in the main nutch branch (
https://github.com/apache/nutch/blob/master/ivy/ivy.xml) ivy.xml currently
has dependencies on hadoop-3.1.3; eg.



z










I set yarn.nodemanager.local-dirs to '${hadoop.tmp.dir}/nm-local-dir'.

I didn't change "mapreduce.job.dir" because there's no namenode nor
datanode processes running when using hadoop with S3, so the UI is blank.

Copied from Email with Sebastian:
>  > The plugin loader doesn't appear to be able to read from s3 in
nutch-1.18
>  > with hadoop-3.2.1[1].

> I had a look into the plugin loader: it can only read from the local file
system.
> But that's ok because the Nutch job file is copied to the local machine
> and unpacked. Here the paths how it looks like on one of the running
Common Crawl
> task nodes:

The configs for the working hadoop are as follows:

core-site.xml














  hadoop.tmp.dir

  /home/hdoop/tmpdata





  fs.defaultFS

  s3a://my-bucket






fs.s3a.access.key

KEY_PLACEHOLDER

  AWS access key ID.

   Omit for IAM role-based or provider-based authentication.






  fs.s3a.secret.key

  SECRET_PLACEHOLDER

  AWS secret key.

   Omit for IAM role-based or provider-based authentication.






  fs.s3a.aws.credentials.provider

  

Comma-separated class names of credential provider classes which
implement

com.amazonaws.auth.AWSCredentialsProvider.


These are loaded and queried in sequence for a valid set of credentials.

Each listed class must implement one of the following means of

construction, which are attempted in order:

1. a public constructor accepting java.net.URI and

org.apache.hadoop.conf.Configuration,

2. a public static method named getInstance that accepts no

   arguments and returns an instance of

   com.amazonaws.auth.AWSCredentialsProvider, or

3. a public default constructor.


Specifying org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider
allows

anonymous access to a publicly accessible S3 bucket without any
credentials.

Please note that allowing anonymous access to an S3 bucket compromises

security and therefore is unsuitable for most use cases. It can be
useful

for accessing public data sets without requiring AWS credentials.


If unspecified, then the default list of credential provider classes,

queried in sequence, is:

1. org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider:

   Uses the values of fs.s3a.access.key and fs.s3a.secret.key.

2. com.amazonaws.auth.EnvironmentVariableCredentialsProvider: supports

configuration of AWS access key ID and secret access key in

environment variables named AWS_ACCESS_KEY_ID and

AWS_SECRET_ACCESS_KEY, as documented in the AWS SDK.

3. com.amazonaws.auth.InstanceProfileCredentialsProvider: supports use

of instance profile credentials if running in an EC2 VM.

  







  

org.apache.hadoop

hadoop-client

${hadoop.version}

  

  

org.apache.hadoop

hadoop-aws

${hadoop.version}

  








hadoop-env.sh

#

# Licensed to the Apache Software Foundation (ASF) under one

# omore contributor license agreements.  See the NOTICE file

# distributed with this work for additional information

# regarding copyright ownership.  The ASF licenses this file

# to you under the Apache License, Version 2.0 (the

# "License"); you may not use this file except in compliance

#a with the License.  You may obtain a copy of the License at

#

# http://www.apache.org/licenses/LICENSE-2.0

#

# Unless required by applicable law or agreed to in writing, software

# distributed under the License is distributed on an "AS IS" BASIS,

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

# See the License for the specific language governing permissions and

# limitations under the License.


# Set Hadoop-specific environment variables here.


##

## THIS FILE ACTS AS THE MASTER FILE FOR ALL HADOOP PROJECTS.

## SETTINGS HERE WILL BE READ BY ALL HADOOP COMMANDS.  THEREFORE,

## ONE CAN USE THIS FILE TO SET YARN, HDFS, AND MAPREDUCE

## CONFIGURATION OPTIONS INSTEAD OF xxx-env.sh.

##

## Precedence rules:

##

## {yarn-env.sh|hdfs-env.sh} > hadoop-env.sh > hard-coded defaults

##

## {YARN_xyz|HDFS_xyz} > HADOOP_xyz > hard-coded defaults

##


# Many of the options here are built from the perspective that users

# may want to provide OVERWRITING values on the command line.

# For example:

#

#  JAVA_HOME=/usr/java/testing hdfs dfs -ls

#

# Therefore, the vast majority (BUT NOT ALL!) of these defaults


Re: Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'

2021-06-17 Thread Clark Benham
Hi Sebastian,

NUTCH_HOME=~/nutch; the local filesystem. I am using a plain, pre-built
hadoop.
There's no "mapreduce.job.dir" I can grep in Hadoop 3.2.1,3.3.0, or
Nutch-1.18, 1.19, but mapreduce.job.hdfs-servers defaults to
${fs.defaultFS}, so s3a://temp-crawler in our case.
The plugin loader doesn't appear to be able to read from s3 in nutch-1.18
with hadoop-3.2.1[1].

Using java & javac 11 with hadoop-3.3.0 downloaded and untared and a
nutch-1.19 I built:
I can run a mapreduce job on S3; and a Nutch job on hdfs, but running nutch
on S3 still gives "URLNormalizer not found" with the plugin dir on the
local filesystem or on S3a.

How would you recommend I go about getting the plugin loader to read from
other file systems?

[1]  I still get 'x point org.apache.nutch.net.URLNormalizer not found'
(same stack trace as previous email) with
`plugin.folders
s3a://temp-crawler/user/hdoop/nutch-plugins`
set in my nutch-site.xml while `hadoop fs -ls
s3a://temp-crawler/user/hdoop/nutch-plugins` lists all the plugins as there.


For posterity:
I got hadoop-3.3.0 working with a S3 backend by:

cd ~/hadoop-3.3.0

cp ./share/hadoop/tools/lib/hadoop-aws-3.3.0.jar ./share/hadoop/common/lib

cp ./share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.563.jar
./share/hadoop/common/lib
to solve "Class org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory not
found" despite the class existing in
~/hadoop-3.3.0/share/hadoop/tools/lib/hadoop-aws-3.3.0.jar  checking it's
on the classpath with `hadoop classpath | tr ":" "\n"  | grep
share/hadoop/tools/lib/hadoop-aws-3.3.0.jar` as well as adding it to
hadoop-env.sh.
see
https://stackoverflow.com/questions/58415928/spark-s3-error-java-lang-classnotfoundexception-class-org-apache-hadoop-f

On Tue, Jun 15, 2021 at 2:01 AM Sebastian Nagel
 wrote:

>  > The local file system? Or hdfs:// or even s3:// resp. s3a://?
>
> Also important: the value of "mapreduce.job.dir" - it's usually
> on hdfs:// and I'm not sure whether the plugin loader is able to
> read from other filesystems. At least, I haven't tried.
>
>
> On 6/15/21 10:53 AM, Sebastian Nagel wrote:
> > Hi Clark,
> >
> > sorry, I should read your mail until the end - you mentioned that
> > you downgraded Nutch to run with JDK 8.
> >
> > Could you share to which filesystem does NUTCH_HOME point?
> > The local file system? Or hdfs:// or even s3:// resp. s3a://?
> >
> > Best,
> > Sebastian
> >
> >
> > On 6/15/21 10:24 AM, Clark Benham wrote:
> >> Hi,
> >>
> >>
> >> I am trying to run Nutch-1.19 on hadoop-3.2.1 with an S3
> >> backend/filesystem; however I get an error ‘URLNormalizer class not
> found’.
> >> I have edited nutch-site.xml so this plugin should be included:
> >>
> >> 
> >>
> >>plugin.includes
> >>
> >>
> >>
> protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|mimetype-filter|urlnormalizer|urlnormalizer-basic|.*|nutch-extensionpoints
>
> >>
> >>
> >>
> >> 
> >>
> >>   and then built on both nodes (I only have 2 machines).  I’ve
> successfully
> >> run Nutch locally and in distributed mode using HDFS, and I’ve run a
> >> mapreduce job with S3 as hadoop’s file system.
> >>
> >>
> >> I thought it was possible nutch is not reading nutch-site.xml because I
> >> resolve an error by setting the config through the cli, despite this
> >> duplicating nutch-site.xml.
> >>
> >> The command:
> >>
> >> `hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
> >> org.apache.nutch.fetcher.Fetcher
> >> crawl/crawldb crawl/segments`
> >>
> >> throws
> >>
> >> `java.lang.IllegalArgumentException: Fetcher: No agents listed in '
> >> http.agent.name' property`
> >>
> >> while if I pass a value in for http.agent.name with
> >> `-Dhttp.agent.name=myScrapper`,
> >> (making the command `hadoop jar
> >> $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
> >> org.apache.nutch.fetcher.Fetcher
> >> -Dhttp.agent.name=clark crawl/crawldb crawl/segments`),  I get an error
> >> about there being no input path, which makes sense as I haven’t been
> able
> >> to generate any segments.
> >>
> >>
> >>   However this method of setting nutch config’s doesn’t work for
> injecting
> >> URLs; eg:
> >>
> >> `hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
> >> org.apache.nutch.crawl.Injector
> >> -Dplugin.includes=".*" crawl/crawldb urls`
> >>
> >> fails with the same “URLNormalizer” not found.
> >>
> >>
> >> I tried copying the plugin dir to S3 and setting
> >> plugin.folders to be a path on S3 without success. (I
> expect
> >> the plugin to be bundled with the .job so this step should be
> unnecessary)
> >>
> >>
> >> The full stack trace for `hadoop jar
> >> $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
> >> org.apache.nutch.crawl.Injector
> >> crawl/crawldb urls`:
> >>
> >> SLF4J: Class path contains multiple SLF4J bindings.
> >>
> >> SLF4J: Found binding in
> >>
> 

Re: Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'

2021-06-15 Thread Sebastian Nagel

> The local file system? Or hdfs:// or even s3:// resp. s3a://?

Also important: the value of "mapreduce.job.dir" - it's usually
on hdfs:// and I'm not sure whether the plugin loader is able to
read from other filesystems. At least, I haven't tried.


On 6/15/21 10:53 AM, Sebastian Nagel wrote:

Hi Clark,

sorry, I should read your mail until the end - you mentioned that
you downgraded Nutch to run with JDK 8.

Could you share to which filesystem does NUTCH_HOME point?
The local file system? Or hdfs:// or even s3:// resp. s3a://?

Best,
Sebastian


On 6/15/21 10:24 AM, Clark Benham wrote:

Hi,


I am trying to run Nutch-1.19 on hadoop-3.2.1 with an S3
backend/filesystem; however I get an error ‘URLNormalizer class not found’.
I have edited nutch-site.xml so this plugin should be included:



   plugin.includes


protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|mimetype-filter|urlnormalizer|urlnormalizer-basic|.*|nutch-extensionpoints 






  and then built on both nodes (I only have 2 machines).  I’ve successfully
run Nutch locally and in distributed mode using HDFS, and I’ve run a
mapreduce job with S3 as hadoop’s file system.


I thought it was possible nutch is not reading nutch-site.xml because I
resolve an error by setting the config through the cli, despite this
duplicating nutch-site.xml.

The command:

`hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
org.apache.nutch.fetcher.Fetcher
crawl/crawldb crawl/segments`

throws

`java.lang.IllegalArgumentException: Fetcher: No agents listed in '
http.agent.name' property`

while if I pass a value in for http.agent.name with
`-Dhttp.agent.name=myScrapper`,
(making the command `hadoop jar
$NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
org.apache.nutch.fetcher.Fetcher
-Dhttp.agent.name=clark crawl/crawldb crawl/segments`),  I get an error
about there being no input path, which makes sense as I haven’t been able
to generate any segments.


  However this method of setting nutch config’s doesn’t work for injecting
URLs; eg:

`hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
org.apache.nutch.crawl.Injector
-Dplugin.includes=".*" crawl/crawldb urls`

fails with the same “URLNormalizer” not found.


I tried copying the plugin dir to S3 and setting
plugin.folders to be a path on S3 without success. (I expect
the plugin to be bundled with the .job so this step should be unnecessary)


The full stack trace for `hadoop jar
$NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
org.apache.nutch.crawl.Injector
crawl/crawldb urls`:

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in
[jar:file:/home/hdoop/hadoop-3.2.1/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in
[jar:file:/home/hdoop/apache-nutch-1.18/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

#Took out multiply Info messages

2021-06-15 07:06:07,842 INFO mapreduce.Job: Task Id :
attempt_1623740678244_0001_m_01_0, Status : FAILED

Error: java.lang.RuntimeException: x point
org.apache.nutch.net.URLNormalizer not found.

at org.apache.nutch.net.URLNormalizers.(URLNormalizers.java:145)

at org.apache.nutch.crawl.Injector$InjectMapper.setup(Injector.java:139)

at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)

at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)

at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:422)

at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)

at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)


#This error repeats 6 times total, 3 times for each node


2021-06-15 07:06:26,035 INFO mapreduce.Job:  map 100% reduce 100%

2021-06-15 07:06:29,067 INFO mapreduce.Job: Job job_1623740678244_0001
failed with state FAILED due to: Task failed
task_1623740678244_0001_m_01

Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0
killedReduces: 0


2021-06-15 07:06:29,190 INFO mapreduce.Job: Counters: 14

Job Counters

Failed map tasks=7

Killed map tasks=1

Killed reduce tasks=1

Launched map tasks=8

Other local map tasks=6

Rack-local map tasks=2

Total time spent by all maps in occupied slots (ms)=63196

Total time spent by all reduces in occupied slots (ms)=0

Total time spent by all map tasks (ms)=31598

Total vcore-milliseconds taken by all map tasks=31598

Total megabyte-milliseconds taken by all map tasks=8089088

Map-Reduce Framework

CPU time spent (ms)=0

Physical memory (bytes) snapshot=0

Virtual memory (bytes) 

Re: Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'

2021-06-15 Thread Sebastian Nagel

Hi Clark,

sorry, I should read your mail until the end - you mentioned that
you downgraded Nutch to run with JDK 8.

Could you share to which filesystem does NUTCH_HOME point?
The local file system? Or hdfs:// or even s3:// resp. s3a://?

Best,
Sebastian


On 6/15/21 10:24 AM, Clark Benham wrote:

Hi,


I am trying to run Nutch-1.19 on hadoop-3.2.1 with an S3
backend/filesystem; however I get an error ‘URLNormalizer class not found’.
I have edited nutch-site.xml so this plugin should be included:



   plugin.includes


protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|mimetype-filter|urlnormalizer|urlnormalizer-basic|.*|nutch-extensionpoints




  and then built on both nodes (I only have 2 machines).  I’ve successfully
run Nutch locally and in distributed mode using HDFS, and I’ve run a
mapreduce job with S3 as hadoop’s file system.


I thought it was possible nutch is not reading nutch-site.xml because I
resolve an error by setting the config through the cli, despite this
duplicating nutch-site.xml.

The command:

`hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
org.apache.nutch.fetcher.Fetcher
crawl/crawldb crawl/segments`

throws

`java.lang.IllegalArgumentException: Fetcher: No agents listed in '
http.agent.name' property`

while if I pass a value in for http.agent.name with
`-Dhttp.agent.name=myScrapper`,
(making the command `hadoop jar
$NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
org.apache.nutch.fetcher.Fetcher
-Dhttp.agent.name=clark crawl/crawldb crawl/segments`),  I get an error
about there being no input path, which makes sense as I haven’t been able
to generate any segments.


  However this method of setting nutch config’s doesn’t work for injecting
URLs; eg:

`hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
org.apache.nutch.crawl.Injector
-Dplugin.includes=".*" crawl/crawldb urls`

fails with the same “URLNormalizer” not found.


I tried copying the plugin dir to S3 and setting
plugin.folders to be a path on S3 without success. (I expect
the plugin to be bundled with the .job so this step should be unnecessary)


The full stack trace for `hadoop jar
$NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
org.apache.nutch.crawl.Injector
crawl/crawldb urls`:

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in
[jar:file:/home/hdoop/hadoop-3.2.1/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in
[jar:file:/home/hdoop/apache-nutch-1.18/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

#Took out multiply Info messages

2021-06-15 07:06:07,842 INFO mapreduce.Job: Task Id :
attempt_1623740678244_0001_m_01_0, Status : FAILED

Error: java.lang.RuntimeException: x point
org.apache.nutch.net.URLNormalizer not found.

at org.apache.nutch.net.URLNormalizers.(URLNormalizers.java:145)

at org.apache.nutch.crawl.Injector$InjectMapper.setup(Injector.java:139)

at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)

at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)

at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:422)

at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)

at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)


#This error repeats 6 times total, 3 times for each node


2021-06-15 07:06:26,035 INFO mapreduce.Job:  map 100% reduce 100%

2021-06-15 07:06:29,067 INFO mapreduce.Job: Job job_1623740678244_0001
failed with state FAILED due to: Task failed
task_1623740678244_0001_m_01

Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0
killedReduces: 0


2021-06-15 07:06:29,190 INFO mapreduce.Job: Counters: 14

Job Counters

Failed map tasks=7

Killed map tasks=1

Killed reduce tasks=1

Launched map tasks=8

Other local map tasks=6

Rack-local map tasks=2

Total time spent by all maps in occupied slots (ms)=63196

Total time spent by all reduces in occupied slots (ms)=0

Total time spent by all map tasks (ms)=31598

Total vcore-milliseconds taken by all map tasks=31598

Total megabyte-milliseconds taken by all map tasks=8089088

Map-Reduce Framework

CPU time spent (ms)=0

Physical memory (bytes) snapshot=0

Virtual memory (bytes) snapshot=0

2021-06-15 07:06:29,195 ERROR crawl.Injector: Injector job did not succeed,
job status: FAILED, reason: Task failed task_1623740678244_0001_m_01

Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0
killedReduces: 0


2021-06-15 07:06:29,562 ERROR crawl.Injector: 

Re: Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'

2021-06-15 Thread Sebastian Nagel

Hi Clark,

the class URLNormalizer is not in a plugin - it's part of Nutch core and defines the interface for URL normalizer plugins. Looks like 
there's something wrong fundamentally, not only with the plugins.


> I am trying to run Nutch-1.19 on hadoop-3.2.1 with an S3

Are you aware that the Nutch 1.19 will require JDK 11? - and the recent Nutch 
snapshots already do,
see NUTCH-2857. Hadoop 3.2.1 does not support JDK 11, you'd need to use 3.3.0. Is a plain vanilla Hadoop used, or a specific Hadoop 
distribution (eg. Cloudera, Amazon EMR)?


Note: the normal way to run Nutch is:
  $NUTCH_HOME/runtime/deploy/bin/nutch  ...
But in the end it will also call "hadoop jar apache-nutch-xyz.job ..."

Best,
Sebastian

On 6/15/21 10:24 AM, Clark Benham wrote:

Hi,


I am trying to run Nutch-1.19 on hadoop-3.2.1 with an S3
backend/filesystem; however I get an error ‘URLNormalizer class not found’.
I have edited nutch-site.xml so this plugin should be included:



   plugin.includes


protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|mimetype-filter|urlnormalizer|urlnormalizer-basic|.*|nutch-extensionpoints




  and then built on both nodes (I only have 2 machines).  I’ve successfully
run Nutch locally and in distributed mode using HDFS, and I’ve run a
mapreduce job with S3 as hadoop’s file system.


I thought it was possible nutch is not reading nutch-site.xml because I
resolve an error by setting the config through the cli, despite this
duplicating nutch-site.xml.

The command:

`hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
org.apache.nutch.fetcher.Fetcher
crawl/crawldb crawl/segments`

throws

`java.lang.IllegalArgumentException: Fetcher: No agents listed in '
http.agent.name' property`

while if I pass a value in for http.agent.name with
`-Dhttp.agent.name=myScrapper`,
(making the command `hadoop jar
$NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
org.apache.nutch.fetcher.Fetcher
-Dhttp.agent.name=clark crawl/crawldb crawl/segments`),  I get an error
about there being no input path, which makes sense as I haven’t been able
to generate any segments.


  However this method of setting nutch config’s doesn’t work for injecting
URLs; eg:

`hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
org.apache.nutch.crawl.Injector
-Dplugin.includes=".*" crawl/crawldb urls`

fails with the same “URLNormalizer” not found.


I tried copying the plugin dir to S3 and setting
plugin.folders to be a path on S3 without success. (I expect
the plugin to be bundled with the .job so this step should be unnecessary)


The full stack trace for `hadoop jar
$NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
org.apache.nutch.crawl.Injector
crawl/crawldb urls`:

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in
[jar:file:/home/hdoop/hadoop-3.2.1/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in
[jar:file:/home/hdoop/apache-nutch-1.18/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

#Took out multiply Info messages

2021-06-15 07:06:07,842 INFO mapreduce.Job: Task Id :
attempt_1623740678244_0001_m_01_0, Status : FAILED

Error: java.lang.RuntimeException: x point
org.apache.nutch.net.URLNormalizer not found.

at org.apache.nutch.net.URLNormalizers.(URLNormalizers.java:145)

at org.apache.nutch.crawl.Injector$InjectMapper.setup(Injector.java:139)

at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)

at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)

at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:422)

at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)

at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)


#This error repeats 6 times total, 3 times for each node


2021-06-15 07:06:26,035 INFO mapreduce.Job:  map 100% reduce 100%

2021-06-15 07:06:29,067 INFO mapreduce.Job: Job job_1623740678244_0001
failed with state FAILED due to: Task failed
task_1623740678244_0001_m_01

Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0
killedReduces: 0


2021-06-15 07:06:29,190 INFO mapreduce.Job: Counters: 14

Job Counters

Failed map tasks=7

Killed map tasks=1

Killed reduce tasks=1

Launched map tasks=8

Other local map tasks=6

Rack-local map tasks=2

Total time spent by all maps in occupied slots (ms)=63196

Total time spent by all reduces in occupied slots (ms)=0

Total time spent by all map tasks (ms)=31598

Total vcore-milliseconds taken by all map tasks=31598


Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'

2021-06-15 Thread Clark Benham
Hi,


I am trying to run Nutch-1.19 on hadoop-3.2.1 with an S3
backend/filesystem; however I get an error ‘URLNormalizer class not found’.
I have edited nutch-site.xml so this plugin should be included:



  plugin.includes


protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|mimetype-filter|urlnormalizer|urlnormalizer-basic|.*|nutch-extensionpoints




 and then built on both nodes (I only have 2 machines).  I’ve successfully
run Nutch locally and in distributed mode using HDFS, and I’ve run a
mapreduce job with S3 as hadoop’s file system.


I thought it was possible nutch is not reading nutch-site.xml because I
resolve an error by setting the config through the cli, despite this
duplicating nutch-site.xml.

The command:

`hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
org.apache.nutch.fetcher.Fetcher
crawl/crawldb crawl/segments`

throws

`java.lang.IllegalArgumentException: Fetcher: No agents listed in '
http.agent.name' property`

while if I pass a value in for http.agent.name with
`-Dhttp.agent.name=myScrapper`,
(making the command `hadoop jar
$NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
org.apache.nutch.fetcher.Fetcher
-Dhttp.agent.name=clark crawl/crawldb crawl/segments`),  I get an error
about there being no input path, which makes sense as I haven’t been able
to generate any segments.


 However this method of setting nutch config’s doesn’t work for injecting
URLs; eg:

`hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
org.apache.nutch.crawl.Injector
-Dplugin.includes=".*" crawl/crawldb urls`

fails with the same “URLNormalizer” not found.


I tried copying the plugin dir to S3 and setting
plugin.folders to be a path on S3 without success. (I expect
the plugin to be bundled with the .job so this step should be unnecessary)


The full stack trace for `hadoop jar
$NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
org.apache.nutch.crawl.Injector
crawl/crawldb urls`:

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in
[jar:file:/home/hdoop/hadoop-3.2.1/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in
[jar:file:/home/hdoop/apache-nutch-1.18/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

#Took out multiply Info messages

2021-06-15 07:06:07,842 INFO mapreduce.Job: Task Id :
attempt_1623740678244_0001_m_01_0, Status : FAILED

Error: java.lang.RuntimeException: x point
org.apache.nutch.net.URLNormalizer not found.

at org.apache.nutch.net.URLNormalizers.(URLNormalizers.java:145)

at org.apache.nutch.crawl.Injector$InjectMapper.setup(Injector.java:139)

at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)

at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)

at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:422)

at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)

at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)


#This error repeats 6 times total, 3 times for each node


2021-06-15 07:06:26,035 INFO mapreduce.Job:  map 100% reduce 100%

2021-06-15 07:06:29,067 INFO mapreduce.Job: Job job_1623740678244_0001
failed with state FAILED due to: Task failed
task_1623740678244_0001_m_01

Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0
killedReduces: 0


2021-06-15 07:06:29,190 INFO mapreduce.Job: Counters: 14

Job Counters

Failed map tasks=7

Killed map tasks=1

Killed reduce tasks=1

Launched map tasks=8

Other local map tasks=6

Rack-local map tasks=2

Total time spent by all maps in occupied slots (ms)=63196

Total time spent by all reduces in occupied slots (ms)=0

Total time spent by all map tasks (ms)=31598

Total vcore-milliseconds taken by all map tasks=31598

Total megabyte-milliseconds taken by all map tasks=8089088

Map-Reduce Framework

CPU time spent (ms)=0

Physical memory (bytes) snapshot=0

Virtual memory (bytes) snapshot=0

2021-06-15 07:06:29,195 ERROR crawl.Injector: Injector job did not succeed,
job status: FAILED, reason: Task failed task_1623740678244_0001_m_01

Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0
killedReduces: 0


2021-06-15 07:06:29,562 ERROR crawl.Injector: Injector:
java.lang.RuntimeException: Injector job did not succeed, job status:
FAILED, reason: Task failed task_1623740678244_0001_m_01

Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0
killedReduces: 0


at org.apache.nutch.crawl.Injector.inject(Injector.java:444)

at