One hiccup: the invertlinks step fails because /home/ubuntu/apache-nutch-1.14-SNAPSHOT/crawl/segments/20170721202124/parse_data did not exist -- creating that dir manually allows the step-by-step to continue.
On Fri, Jul 21, 2017 at 1:25 PM, Gary Murphy <[email protected]> wrote: > What has turned out to be a better solution is what I should have done the > first time, build from sources ;) > > using the ant tar-bin installation package and the same urls/seed.txt > file, everything goes fine. > > which suggests there may be a problem in the 1.13 bin distribution file? > Or more likely my novice fingers typed the wrong thing in some xml file and > messed it up :) > > > On Fri, Jul 21, 2017 at 10:56 AM, Edward Capriolo <[email protected]> > wrote: > >> I have run into this. The nutch shell scripts do not return error status >> so >> you assume bin/crawl has done something when truly it failed. Sometimes >> the >> best way is to determine if you need a plugin.xml file and what the >> content >> should be. Possibly put a blank xml file in its place and see if the error >> changes. >> >> On Fri, Jul 21, 2017 at 12:29 PM, Gary Murphy <[email protected]> wrote: >> >> > This is a little embarrassing: I'm stuck on the very first step of the >> > new-user installation. >> > >> > I have apache-nutch-1.13-bin.tar on Ubuntu 16.04 using the Oracle Java8, >> > following the wiki.apache.org/nutch/NutchTutorial (*) with a >> urls/seed.txt >> > file that contains only http://www.hunchmanifest.com >> > >> > but I get zero urls injected: >> > >> > $ nutch inject crawl/crawldb urls >> > Injector: starting at 2017-07-21 16:17:18 >> > Injector: crawlDb: crawl/crawldb >> > Injector: urlDir: urls >> > Injector: Converting injected urls to crawl db entries. >> > Injector: Total urls rejected by filters: 0 >> > Injector: Total urls injected after normalization and filtering: 0 >> > Injector: Total urls injected but already in CrawlDb: 0 >> > Injector: Total new urls injected: 0 >> > Injector: finished at 2017-07-21 16:17:20, elapsed: 00:00:01 >> > >> > I've tried other URLs and none are excluded by the regex rules (but the >> > above doesn't list any rejects either) >> > >> > What could be wrong with my installation? There's nothing suspicious in >> the >> > logs other than warnings for plugins not found: >> > >> > 2017-07-21 14:38:49,966 INFO crawl.Injector - Injector: crawlDb: >> > crawl/crawldb >> > 2017-07-21 14:38:49,966 INFO crawl.Injector - Injector: urlDir: urls >> > 2017-07-21 14:38:49,966 INFO crawl.Injector - Injector: Converting >> > injected urls to crawl db entries. >> > 2017-07-21 14:38:50,047 WARN util.NativeCodeLoader - Unable to load >> > native-hadoop library for your platform... using builtin-java classes >> where >> > applicable >> > 2017-07-21 14:38:50,762 WARN plugin.PluginRepository - Error while >> loading >> > plugin `/opt/apache-nutch-1.13/plugins/parse-replace/plugin.xml` >> > java.io.FileNotFoundException: >> > /opt/apache-nutch-1.13/plugins/parse-replace/plugin.xml (No such file >> or >> > directory) >> > 2017-07-21 14:38:50,775 WARN plugin.PluginRepository - Error while >> loading >> > plugin `/opt/apache-nutch-1.13/plugins/plugin/plugin.xml` >> > java.io.FileNotFoundException: >> > /opt/apache-nutch-1.13/plugins/plugin/plugin.xml (No such file or >> > directory) >> > 2017-07-21 14:38:50,791 WARN plugin.PluginRepository - Error while >> loading >> > plugin `/opt/apache-nutch-1.13/plugins/publish-rabitmq/plugin.xml` >> > java.io.FileNotFoundException: >> > /opt/apache-nutch-1.13/plugins/publish-rabitmq/plugin.xml (No such >> file or >> > directory) >> > 2017-07-21 14:38:50,861 WARN mapred.LocalJobRunner - >> > job_local540893461_0001 >> > java.lang.Exception: java.lang.NullPointerException >> > at >> > org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks( >> > LocalJobRunner.java:462) >> > >> > That last error may be a catch-all, or could be because there are zero >> urls >> > in the database. I get the same behavior if I run crawler >> > >> > crawlerdb has no files although directories are created. Could there be >> an >> > environment variable needed? I am running from the nutch install >> directory >> > with write permissions on all files. Is there something I've >> overlooked? >> > Is there a -D debug switch I can use to gather more information? >> > >> > (* also, the content.rdf.u8.gz sample file cited in the NutchTutorial >> page >> > no longer exists; DMOZ is shutdown and the archive site preserves the >> > original link that is now a 404) >> > >> > >

