This is a little embarrassing: I'm stuck on the very first step of the
new-user installation.

I have apache-nutch-1.13-bin.tar on Ubuntu 16.04 using the Oracle Java8,
following the wiki.apache.org/nutch/NutchTutorial (*) with a urls/seed.txt
file that contains only http://www.hunchmanifest.com

but I get zero urls injected:

$ nutch inject crawl/crawldb urls
Injector: starting at 2017-07-21 16:17:18
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Total urls rejected by filters: 0
Injector: Total urls injected after normalization and filtering: 0
Injector: Total urls injected but already in CrawlDb: 0
Injector: Total new urls injected: 0
Injector: finished at 2017-07-21 16:17:20, elapsed: 00:00:01

I've tried other URLs and none are excluded by the regex rules (but the
above doesn't list any rejects either)

What could be wrong with my installation? There's nothing suspicious in the
logs other than warnings for plugins not found:

2017-07-21 14:38:49,966 INFO  crawl.Injector - Injector: crawlDb:
crawl/crawldb
2017-07-21 14:38:49,966 INFO  crawl.Injector - Injector: urlDir: urls
2017-07-21 14:38:49,966 INFO  crawl.Injector - Injector: Converting
injected urls to crawl db entries.
2017-07-21 14:38:50,047 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2017-07-21 14:38:50,762 WARN  plugin.PluginRepository - Error while loading
plugin `/opt/apache-nutch-1.13/plugins/parse-replace/plugin.xml`
java.io.FileNotFoundException:
/opt/apache-nutch-1.13/plugins/parse-replace/plugin.xml (No such file or
directory)
2017-07-21 14:38:50,775 WARN  plugin.PluginRepository - Error while loading
plugin `/opt/apache-nutch-1.13/plugins/plugin/plugin.xml`
java.io.FileNotFoundException:
/opt/apache-nutch-1.13/plugins/plugin/plugin.xml (No such file or directory)
2017-07-21 14:38:50,791 WARN  plugin.PluginRepository - Error while loading
plugin `/opt/apache-nutch-1.13/plugins/publish-rabitmq/plugin.xml`
java.io.FileNotFoundException:
/opt/apache-nutch-1.13/plugins/publish-rabitmq/plugin.xml (No such file or
directory)
2017-07-21 14:38:50,861 WARN  mapred.LocalJobRunner -
job_local540893461_0001
java.lang.Exception: java.lang.NullPointerException
    at
org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)

That last error may be a catch-all, or could be because there are zero urls
in the database.  I get the same behavior if I run crawler

crawlerdb has no files although directories are created. Could there be an
environment variable needed? I am running from the nutch install directory
with write permissions on all files.  Is there something I've overlooked?
Is there a -D debug switch I can use to gather more information?

(* also, the content.rdf.u8.gz sample file cited in the NutchTutorial page
no longer exists; DMOZ is shutdown and the archive site preserves the
original link that is now a 404)

Reply via email to