What has turned out to be a better solution is what I should have done the
first time, build from sources ;)

using the ant tar-bin installation package and the same urls/seed.txt file,
everything goes fine.

which suggests there may be a problem in the 1.13 bin distribution file? Or
more likely my novice fingers typed the wrong thing in some xml file and
messed it up :)


On Fri, Jul 21, 2017 at 10:56 AM, Edward Capriolo <edlinuxg...@gmail.com>
wrote:

> I have run into this. The nutch shell scripts do not return error status so
> you assume bin/crawl has done something when truly it failed. Sometimes the
> best way is to determine if you need a plugin.xml file and what the content
> should be. Possibly put a blank xml file in its place and see if the error
> changes.
>
> On Fri, Jul 21, 2017 at 12:29 PM, Gary Murphy <g...@schemaapp.com> wrote:
>
> > This is a little embarrassing: I'm stuck on the very first step of the
> > new-user installation.
> >
> > I have apache-nutch-1.13-bin.tar on Ubuntu 16.04 using the Oracle Java8,
> > following the wiki.apache.org/nutch/NutchTutorial (*) with a
> urls/seed.txt
> > file that contains only http://www.hunchmanifest.com
> >
> > but I get zero urls injected:
> >
> > $ nutch inject crawl/crawldb urls
> > Injector: starting at 2017-07-21 16:17:18
> > Injector: crawlDb: crawl/crawldb
> > Injector: urlDir: urls
> > Injector: Converting injected urls to crawl db entries.
> > Injector: Total urls rejected by filters: 0
> > Injector: Total urls injected after normalization and filtering: 0
> > Injector: Total urls injected but already in CrawlDb: 0
> > Injector: Total new urls injected: 0
> > Injector: finished at 2017-07-21 16:17:20, elapsed: 00:00:01
> >
> > I've tried other URLs and none are excluded by the regex rules (but the
> > above doesn't list any rejects either)
> >
> > What could be wrong with my installation? There's nothing suspicious in
> the
> > logs other than warnings for plugins not found:
> >
> > 2017-07-21 14:38:49,966 INFO  crawl.Injector - Injector: crawlDb:
> > crawl/crawldb
> > 2017-07-21 14:38:49,966 INFO  crawl.Injector - Injector: urlDir: urls
> > 2017-07-21 14:38:49,966 INFO  crawl.Injector - Injector: Converting
> > injected urls to crawl db entries.
> > 2017-07-21 14:38:50,047 WARN  util.NativeCodeLoader - Unable to load
> > native-hadoop library for your platform... using builtin-java classes
> where
> > applicable
> > 2017-07-21 14:38:50,762 WARN  plugin.PluginRepository - Error while
> loading
> > plugin `/opt/apache-nutch-1.13/plugins/parse-replace/plugin.xml`
> > java.io.FileNotFoundException:
> > /opt/apache-nutch-1.13/plugins/parse-replace/plugin.xml (No such file or
> > directory)
> > 2017-07-21 14:38:50,775 WARN  plugin.PluginRepository - Error while
> loading
> > plugin `/opt/apache-nutch-1.13/plugins/plugin/plugin.xml`
> > java.io.FileNotFoundException:
> > /opt/apache-nutch-1.13/plugins/plugin/plugin.xml (No such file or
> > directory)
> > 2017-07-21 14:38:50,791 WARN  plugin.PluginRepository - Error while
> loading
> > plugin `/opt/apache-nutch-1.13/plugins/publish-rabitmq/plugin.xml`
> > java.io.FileNotFoundException:
> > /opt/apache-nutch-1.13/plugins/publish-rabitmq/plugin.xml (No such file
> or
> > directory)
> > 2017-07-21 14:38:50,861 WARN  mapred.LocalJobRunner -
> > job_local540893461_0001
> > java.lang.Exception: java.lang.NullPointerException
> >     at
> > org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(
> > LocalJobRunner.java:462)
> >
> > That last error may be a catch-all, or could be because there are zero
> urls
> > in the database.  I get the same behavior if I run crawler
> >
> > crawlerdb has no files although directories are created. Could there be
> an
> > environment variable needed? I am running from the nutch install
> directory
> > with write permissions on all files.  Is there something I've overlooked?
> > Is there a -D debug switch I can use to gather more information?
> >
> > (* also, the content.rdf.u8.gz sample file cited in the NutchTutorial
> page
> > no longer exists; DMOZ is shutdown and the archive site preserves the
> > original link that is now a 404)
> >
>

Reply via email to