What has turned out to be a better solution is what I should have done the first time, build from sources ;)
using the ant tar-bin installation package and the same urls/seed.txt file, everything goes fine. which suggests there may be a problem in the 1.13 bin distribution file? Or more likely my novice fingers typed the wrong thing in some xml file and messed it up :) On Fri, Jul 21, 2017 at 10:56 AM, Edward Capriolo <edlinuxg...@gmail.com> wrote: > I have run into this. The nutch shell scripts do not return error status so > you assume bin/crawl has done something when truly it failed. Sometimes the > best way is to determine if you need a plugin.xml file and what the content > should be. Possibly put a blank xml file in its place and see if the error > changes. > > On Fri, Jul 21, 2017 at 12:29 PM, Gary Murphy <g...@schemaapp.com> wrote: > > > This is a little embarrassing: I'm stuck on the very first step of the > > new-user installation. > > > > I have apache-nutch-1.13-bin.tar on Ubuntu 16.04 using the Oracle Java8, > > following the wiki.apache.org/nutch/NutchTutorial (*) with a > urls/seed.txt > > file that contains only http://www.hunchmanifest.com > > > > but I get zero urls injected: > > > > $ nutch inject crawl/crawldb urls > > Injector: starting at 2017-07-21 16:17:18 > > Injector: crawlDb: crawl/crawldb > > Injector: urlDir: urls > > Injector: Converting injected urls to crawl db entries. > > Injector: Total urls rejected by filters: 0 > > Injector: Total urls injected after normalization and filtering: 0 > > Injector: Total urls injected but already in CrawlDb: 0 > > Injector: Total new urls injected: 0 > > Injector: finished at 2017-07-21 16:17:20, elapsed: 00:00:01 > > > > I've tried other URLs and none are excluded by the regex rules (but the > > above doesn't list any rejects either) > > > > What could be wrong with my installation? There's nothing suspicious in > the > > logs other than warnings for plugins not found: > > > > 2017-07-21 14:38:49,966 INFO crawl.Injector - Injector: crawlDb: > > crawl/crawldb > > 2017-07-21 14:38:49,966 INFO crawl.Injector - Injector: urlDir: urls > > 2017-07-21 14:38:49,966 INFO crawl.Injector - Injector: Converting > > injected urls to crawl db entries. > > 2017-07-21 14:38:50,047 WARN util.NativeCodeLoader - Unable to load > > native-hadoop library for your platform... using builtin-java classes > where > > applicable > > 2017-07-21 14:38:50,762 WARN plugin.PluginRepository - Error while > loading > > plugin `/opt/apache-nutch-1.13/plugins/parse-replace/plugin.xml` > > java.io.FileNotFoundException: > > /opt/apache-nutch-1.13/plugins/parse-replace/plugin.xml (No such file or > > directory) > > 2017-07-21 14:38:50,775 WARN plugin.PluginRepository - Error while > loading > > plugin `/opt/apache-nutch-1.13/plugins/plugin/plugin.xml` > > java.io.FileNotFoundException: > > /opt/apache-nutch-1.13/plugins/plugin/plugin.xml (No such file or > > directory) > > 2017-07-21 14:38:50,791 WARN plugin.PluginRepository - Error while > loading > > plugin `/opt/apache-nutch-1.13/plugins/publish-rabitmq/plugin.xml` > > java.io.FileNotFoundException: > > /opt/apache-nutch-1.13/plugins/publish-rabitmq/plugin.xml (No such file > or > > directory) > > 2017-07-21 14:38:50,861 WARN mapred.LocalJobRunner - > > job_local540893461_0001 > > java.lang.Exception: java.lang.NullPointerException > > at > > org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks( > > LocalJobRunner.java:462) > > > > That last error may be a catch-all, or could be because there are zero > urls > > in the database. I get the same behavior if I run crawler > > > > crawlerdb has no files although directories are created. Could there be > an > > environment variable needed? I am running from the nutch install > directory > > with write permissions on all files. Is there something I've overlooked? > > Is there a -D debug switch I can use to gather more information? > > > > (* also, the content.rdf.u8.gz sample file cited in the NutchTutorial > page > > no longer exists; DMOZ is shutdown and the archive site preserves the > > original link that is now a 404) > > >