Re: Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'

2021-07-16 Thread Lewis John McGibbney
Hi Clark, This is a lot of information... thank you for compiling it all. Ideally the version of Hadoop being used with Nutch should ALWAYS match the hadoop binaries referenced in https://github.com/apache/nutch/blob/master/ivy/ivy.xml. This way you wont run into the classpath issues. I would

Re: Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'

2021-07-15 Thread Sebastian Nagel
Hi Clark, thanks for summarizing this discussion and sharing the final configuration! Good to know that it's possible to run Nutch on Hadoop using S3A without using HDFS (no namenode/datanodes running). Best, Sebastian

Re: Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'

2021-07-14 Thread Clark Benham
Hi All, Sebastian Helped fix my issue: using S3 as a backend I was able to get nutch-1.19 working with pre-built hadoop-3.3.0 and java 11. There was an oddity that nutch-1.19 had 11 hadoop 3.1.3 jars, eg. hadoop-hdfs-3.1.3.jar, hadoop-yarn-api-3.1.3.jar... ; this made running `hadoop version`

Re: Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'

2021-06-17 Thread Clark Benham
Hi Sebastian, NUTCH_HOME=~/nutch; the local filesystem. I am using a plain, pre-built hadoop. There's no "mapreduce.job.dir" I can grep in Hadoop 3.2.1,3.3.0, or Nutch-1.18, 1.19, but mapreduce.job.hdfs-servers defaults to ${fs.defaultFS}, so s3a://temp-crawler in our case. The plugin loader

Re: Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'

2021-06-15 Thread Sebastian Nagel
> The local file system? Or hdfs:// or even s3:// resp. s3a://? Also important: the value of "mapreduce.job.dir" - it's usually on hdfs:// and I'm not sure whether the plugin loader is able to read from other filesystems. At least, I haven't tried. On 6/15/21 10:53 AM, Sebastian Nagel wrote:

Re: Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'

2021-06-15 Thread Sebastian Nagel
Hi Clark, sorry, I should read your mail until the end - you mentioned that you downgraded Nutch to run with JDK 8. Could you share to which filesystem does NUTCH_HOME point? The local file system? Or hdfs:// or even s3:// resp. s3a://? Best, Sebastian On 6/15/21 10:24 AM, Clark Benham

Re: Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'

2021-06-15 Thread Sebastian Nagel
Hi Clark, the class URLNormalizer is not in a plugin - it's part of Nutch core and defines the interface for URL normalizer plugins. Looks like there's something wrong fundamentally, not only with the plugins. > I am trying to run Nutch-1.19 on hadoop-3.2.1 with an S3 Are you aware that the

Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'

2021-06-15 Thread Clark Benham
Hi, I am trying to run Nutch-1.19 on hadoop-3.2.1 with an S3 backend/filesystem; however I get an error ‘URLNormalizer class not found’. I have edited nutch-site.xml so this plugin should be included: plugin.includes