I have run into this. The nutch shell scripts do not return error status so
you assume bin/crawl has done something when truly it failed. Sometimes the
best way is to determine if you need a plugin.xml file and what the content
should be. Possibly put a blank xml file in its place and see if the error
changes.

On Fri, Jul 21, 2017 at 12:29 PM, Gary Murphy <[email protected]> wrote:

> This is a little embarrassing: I'm stuck on the very first step of the
> new-user installation.
>
> I have apache-nutch-1.13-bin.tar on Ubuntu 16.04 using the Oracle Java8,
> following the wiki.apache.org/nutch/NutchTutorial (*) with a urls/seed.txt
> file that contains only http://www.hunchmanifest.com
>
> but I get zero urls injected:
>
> $ nutch inject crawl/crawldb urls
> Injector: starting at 2017-07-21 16:17:18
> Injector: crawlDb: crawl/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> Injector: Total urls rejected by filters: 0
> Injector: Total urls injected after normalization and filtering: 0
> Injector: Total urls injected but already in CrawlDb: 0
> Injector: Total new urls injected: 0
> Injector: finished at 2017-07-21 16:17:20, elapsed: 00:00:01
>
> I've tried other URLs and none are excluded by the regex rules (but the
> above doesn't list any rejects either)
>
> What could be wrong with my installation? There's nothing suspicious in the
> logs other than warnings for plugins not found:
>
> 2017-07-21 14:38:49,966 INFO  crawl.Injector - Injector: crawlDb:
> crawl/crawldb
> 2017-07-21 14:38:49,966 INFO  crawl.Injector - Injector: urlDir: urls
> 2017-07-21 14:38:49,966 INFO  crawl.Injector - Injector: Converting
> injected urls to crawl db entries.
> 2017-07-21 14:38:50,047 WARN  util.NativeCodeLoader - Unable to load
> native-hadoop library for your platform... using builtin-java classes where
> applicable
> 2017-07-21 14:38:50,762 WARN  plugin.PluginRepository - Error while loading
> plugin `/opt/apache-nutch-1.13/plugins/parse-replace/plugin.xml`
> java.io.FileNotFoundException:
> /opt/apache-nutch-1.13/plugins/parse-replace/plugin.xml (No such file or
> directory)
> 2017-07-21 14:38:50,775 WARN  plugin.PluginRepository - Error while loading
> plugin `/opt/apache-nutch-1.13/plugins/plugin/plugin.xml`
> java.io.FileNotFoundException:
> /opt/apache-nutch-1.13/plugins/plugin/plugin.xml (No such file or
> directory)
> 2017-07-21 14:38:50,791 WARN  plugin.PluginRepository - Error while loading
> plugin `/opt/apache-nutch-1.13/plugins/publish-rabitmq/plugin.xml`
> java.io.FileNotFoundException:
> /opt/apache-nutch-1.13/plugins/publish-rabitmq/plugin.xml (No such file or
> directory)
> 2017-07-21 14:38:50,861 WARN  mapred.LocalJobRunner -
> job_local540893461_0001
> java.lang.Exception: java.lang.NullPointerException
>     at
> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(
> LocalJobRunner.java:462)
>
> That last error may be a catch-all, or could be because there are zero urls
> in the database.  I get the same behavior if I run crawler
>
> crawlerdb has no files although directories are created. Could there be an
> environment variable needed? I am running from the nutch install directory
> with write permissions on all files.  Is there something I've overlooked?
> Is there a -D debug switch I can use to gather more information?
>
> (* also, the content.rdf.u8.gz sample file cited in the NutchTutorial page
> no longer exists; DMOZ is shutdown and the archive site preserves the
> original link that is now a 404)
>

Reply via email to