One hiccup: the invertlinks step fails because
/home/ubuntu/apache-nutch-1.14-SNAPSHOT/crawl/segments/20170721202124/parse_data
did not exist -- creating that dir manually allows the step-by-step to
continue.

On Fri, Jul 21, 2017 at 1:25 PM, Gary Murphy <[email protected]> wrote:

> What has turned out to be a better solution is what I should have done the
> first time, build from sources ;)
>
> using the ant tar-bin installation package and the same urls/seed.txt
> file, everything goes fine.
>
> which suggests there may be a problem in the 1.13 bin distribution file?
> Or more likely my novice fingers typed the wrong thing in some xml file and
> messed it up :)
>
>
> On Fri, Jul 21, 2017 at 10:56 AM, Edward Capriolo <[email protected]>
> wrote:
>
>> I have run into this. The nutch shell scripts do not return error status
>> so
>> you assume bin/crawl has done something when truly it failed. Sometimes
>> the
>> best way is to determine if you need a plugin.xml file and what the
>> content
>> should be. Possibly put a blank xml file in its place and see if the error
>> changes.
>>
>> On Fri, Jul 21, 2017 at 12:29 PM, Gary Murphy <[email protected]> wrote:
>>
>> > This is a little embarrassing: I'm stuck on the very first step of the
>> > new-user installation.
>> >
>> > I have apache-nutch-1.13-bin.tar on Ubuntu 16.04 using the Oracle Java8,
>> > following the wiki.apache.org/nutch/NutchTutorial (*) with a
>> urls/seed.txt
>> > file that contains only http://www.hunchmanifest.com
>> >
>> > but I get zero urls injected:
>> >
>> > $ nutch inject crawl/crawldb urls
>> > Injector: starting at 2017-07-21 16:17:18
>> > Injector: crawlDb: crawl/crawldb
>> > Injector: urlDir: urls
>> > Injector: Converting injected urls to crawl db entries.
>> > Injector: Total urls rejected by filters: 0
>> > Injector: Total urls injected after normalization and filtering: 0
>> > Injector: Total urls injected but already in CrawlDb: 0
>> > Injector: Total new urls injected: 0
>> > Injector: finished at 2017-07-21 16:17:20, elapsed: 00:00:01
>> >
>> > I've tried other URLs and none are excluded by the regex rules (but the
>> > above doesn't list any rejects either)
>> >
>> > What could be wrong with my installation? There's nothing suspicious in
>> the
>> > logs other than warnings for plugins not found:
>> >
>> > 2017-07-21 14:38:49,966 INFO  crawl.Injector - Injector: crawlDb:
>> > crawl/crawldb
>> > 2017-07-21 14:38:49,966 INFO  crawl.Injector - Injector: urlDir: urls
>> > 2017-07-21 14:38:49,966 INFO  crawl.Injector - Injector: Converting
>> > injected urls to crawl db entries.
>> > 2017-07-21 14:38:50,047 WARN  util.NativeCodeLoader - Unable to load
>> > native-hadoop library for your platform... using builtin-java classes
>> where
>> > applicable
>> > 2017-07-21 14:38:50,762 WARN  plugin.PluginRepository - Error while
>> loading
>> > plugin `/opt/apache-nutch-1.13/plugins/parse-replace/plugin.xml`
>> > java.io.FileNotFoundException:
>> > /opt/apache-nutch-1.13/plugins/parse-replace/plugin.xml (No such file
>> or
>> > directory)
>> > 2017-07-21 14:38:50,775 WARN  plugin.PluginRepository - Error while
>> loading
>> > plugin `/opt/apache-nutch-1.13/plugins/plugin/plugin.xml`
>> > java.io.FileNotFoundException:
>> > /opt/apache-nutch-1.13/plugins/plugin/plugin.xml (No such file or
>> > directory)
>> > 2017-07-21 14:38:50,791 WARN  plugin.PluginRepository - Error while
>> loading
>> > plugin `/opt/apache-nutch-1.13/plugins/publish-rabitmq/plugin.xml`
>> > java.io.FileNotFoundException:
>> > /opt/apache-nutch-1.13/plugins/publish-rabitmq/plugin.xml (No such
>> file or
>> > directory)
>> > 2017-07-21 14:38:50,861 WARN  mapred.LocalJobRunner -
>> > job_local540893461_0001
>> > java.lang.Exception: java.lang.NullPointerException
>> >     at
>> > org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(
>> > LocalJobRunner.java:462)
>> >
>> > That last error may be a catch-all, or could be because there are zero
>> urls
>> > in the database.  I get the same behavior if I run crawler
>> >
>> > crawlerdb has no files although directories are created. Could there be
>> an
>> > environment variable needed? I am running from the nutch install
>> directory
>> > with write permissions on all files.  Is there something I've
>> overlooked?
>> > Is there a -D debug switch I can use to gather more information?
>> >
>> > (* also, the content.rdf.u8.gz sample file cited in the NutchTutorial
>> page
>> > no longer exists; DMOZ is shutdown and the archive site preserves the
>> > original link that is now a 404)
>> >
>>
>
>

Reply via email to