I have run into this. The nutch shell scripts do not return error status so you assume bin/crawl has done something when truly it failed. Sometimes the best way is to determine if you need a plugin.xml file and what the content should be. Possibly put a blank xml file in its place and see if the error changes.
On Fri, Jul 21, 2017 at 12:29 PM, Gary Murphy <[email protected]> wrote: > This is a little embarrassing: I'm stuck on the very first step of the > new-user installation. > > I have apache-nutch-1.13-bin.tar on Ubuntu 16.04 using the Oracle Java8, > following the wiki.apache.org/nutch/NutchTutorial (*) with a urls/seed.txt > file that contains only http://www.hunchmanifest.com > > but I get zero urls injected: > > $ nutch inject crawl/crawldb urls > Injector: starting at 2017-07-21 16:17:18 > Injector: crawlDb: crawl/crawldb > Injector: urlDir: urls > Injector: Converting injected urls to crawl db entries. > Injector: Total urls rejected by filters: 0 > Injector: Total urls injected after normalization and filtering: 0 > Injector: Total urls injected but already in CrawlDb: 0 > Injector: Total new urls injected: 0 > Injector: finished at 2017-07-21 16:17:20, elapsed: 00:00:01 > > I've tried other URLs and none are excluded by the regex rules (but the > above doesn't list any rejects either) > > What could be wrong with my installation? There's nothing suspicious in the > logs other than warnings for plugins not found: > > 2017-07-21 14:38:49,966 INFO crawl.Injector - Injector: crawlDb: > crawl/crawldb > 2017-07-21 14:38:49,966 INFO crawl.Injector - Injector: urlDir: urls > 2017-07-21 14:38:49,966 INFO crawl.Injector - Injector: Converting > injected urls to crawl db entries. > 2017-07-21 14:38:50,047 WARN util.NativeCodeLoader - Unable to load > native-hadoop library for your platform... using builtin-java classes where > applicable > 2017-07-21 14:38:50,762 WARN plugin.PluginRepository - Error while loading > plugin `/opt/apache-nutch-1.13/plugins/parse-replace/plugin.xml` > java.io.FileNotFoundException: > /opt/apache-nutch-1.13/plugins/parse-replace/plugin.xml (No such file or > directory) > 2017-07-21 14:38:50,775 WARN plugin.PluginRepository - Error while loading > plugin `/opt/apache-nutch-1.13/plugins/plugin/plugin.xml` > java.io.FileNotFoundException: > /opt/apache-nutch-1.13/plugins/plugin/plugin.xml (No such file or > directory) > 2017-07-21 14:38:50,791 WARN plugin.PluginRepository - Error while loading > plugin `/opt/apache-nutch-1.13/plugins/publish-rabitmq/plugin.xml` > java.io.FileNotFoundException: > /opt/apache-nutch-1.13/plugins/publish-rabitmq/plugin.xml (No such file or > directory) > 2017-07-21 14:38:50,861 WARN mapred.LocalJobRunner - > job_local540893461_0001 > java.lang.Exception: java.lang.NullPointerException > at > org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks( > LocalJobRunner.java:462) > > That last error may be a catch-all, or could be because there are zero urls > in the database. I get the same behavior if I run crawler > > crawlerdb has no files although directories are created. Could there be an > environment variable needed? I am running from the nutch install directory > with write permissions on all files. Is there something I've overlooked? > Is there a -D debug switch I can use to gather more information? > > (* also, the content.rdf.u8.gz sample file cited in the NutchTutorial page > no longer exists; DMOZ is shutdown and the archive site preserves the > original link that is now a 404) >

