Re: Stuck at Step One

Gary Murphy Fri, 21 Jul 2017 14:18:43 -0700

Nearly there, a few changes required in the NutchTutorial:

The Indexing command needs to use the -D to give the server.url:
bin/nutch index -Dsolr.server.url=http://localhost:8983/solr crawl/crawldb/
-linkdb crawl/linkdb/ crawl/segments/* -filter -normalize -deleteGone


but dedup does not -- instead it needs the crawldb?
bin/nutch clean crawl/crawldb

The clean command needs both?
bin/nutch -Dsolr.server.url=http://localhost:8983/solr clean crawl/crawldb

and lastly the crawl example is missing the -s before urls
bin/crawl -i -D solr.server.url=http://localhost:8983/solr/ -s urls/
TestCrawl/  2

this appears to pass the tutorial steps on the github 1.14, however ... it
ends badly:

Indexing 20170721210807 to index
/home/ubuntu/apache-nutch-1.14-SNAPSHOT/bin/nutch index -Dsolr.server.url=
http://localhost:8983/solr/ TestCrawl//crawldb -linkdb TestCrawl//linkdb
TestCrawl//segments/20170721210807
Segment dir is complete: TestCrawl/segments/20170721210807.
Indexer: starting at 2017-07-21 21:08:26
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
SOLRIndexWriter
    solr.server.url : URL of the SOLR instance
    solr.zookeeper.hosts : URL of the Zookeeper quorum
    solr.commit.size : buffer size when sending to SOLR (default 1000)
    solr.mapping.file : name of the mapping file for fields (default
solrindex-mapping.xml)
    solr.auth : use authentication (default false)
    solr.auth.username : username for authentication
    solr.auth.password : password for authentication


Indexing 4/4 documents
Deleting 0 documents
Indexing 4/4 documents
Deleting 0 documents
Indexer: java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:865)
    at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147)
    at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239)

Error running:
  /home/ubuntu/apache-nutch-1.14-SNAPSHOT/bin/nutch index -Dsolr.server.url=
http://localhost:8983/solr/ TestCrawl//crawldb -linkdb TestCrawl//linkdb
TestCrawl//segments/20170721210807
Failed with exit value 255.

So close!

Any and all guidance is eagerly welcome :)


On Fri, Jul 21, 2017 at 1:29 PM, Gary Murphy <[email protected]> wrote:

> One hiccup: the invertlinks step fails because /home/ubuntu/apache-nutch-1.
> 14-SNAPSHOT/crawl/segments/20170721202124/parse_data did not exist --
> creating that dir manually allows the step-by-step to continue.
>
> On Fri, Jul 21, 2017 at 1:25 PM, Gary Murphy <[email protected]> wrote:
>
>> What has turned out to be a better solution is what I should have done
>> the first time, build from sources ;)
>>
>> using the ant tar-bin installation package and the same urls/seed.txt
>> file, everything goes fine.
>>
>> which suggests there may be a problem in the 1.13 bin distribution file?
>> Or more likely my novice fingers typed the wrong thing in some xml file and
>> messed it up :)
>>
>>
>> On Fri, Jul 21, 2017 at 10:56 AM, Edward Capriolo <[email protected]>
>> wrote:
>>
>>> I have run into this. The nutch shell scripts do not return error status
>>> so
>>> you assume bin/crawl has done something when truly it failed. Sometimes
>>> the
>>> best way is to determine if you need a plugin.xml file and what the
>>> content
>>> should be. Possibly put a blank xml file in its place and see if the
>>> error
>>> changes.
>>>
>>> On Fri, Jul 21, 2017 at 12:29 PM, Gary Murphy <[email protected]>
>>> wrote:
>>>
>>> > This is a little embarrassing: I'm stuck on the very first step of the
>>> > new-user installation.
>>> >
>>> > I have apache-nutch-1.13-bin.tar on Ubuntu 16.04 using the Oracle
>>> Java8,
>>> > following the wiki.apache.org/nutch/NutchTutorial (*) with a
>>> urls/seed.txt
>>> > file that contains only http://www.hunchmanifest.com
>>> >
>>> > but I get zero urls injected:
>>> >
>>> > $ nutch inject crawl/crawldb urls
>>> > Injector: starting at 2017-07-21 16:17:18
>>> > Injector: crawlDb: crawl/crawldb
>>> > Injector: urlDir: urls
>>> > Injector: Converting injected urls to crawl db entries.
>>> > Injector: Total urls rejected by filters: 0
>>> > Injector: Total urls injected after normalization and filtering: 0
>>> > Injector: Total urls injected but already in CrawlDb: 0
>>> > Injector: Total new urls injected: 0
>>> > Injector: finished at 2017-07-21 16:17:20, elapsed: 00:00:01
>>> >
>>> > I've tried other URLs and none are excluded by the regex rules (but the
>>> > above doesn't list any rejects either)
>>> >
>>> > What could be wrong with my installation? There's nothing suspicious
>>> in the
>>> > logs other than warnings for plugins not found:
>>> >
>>> > 2017-07-21 14:38:49,966 INFO  crawl.Injector - Injector: crawlDb:
>>> > crawl/crawldb
>>> > 2017-07-21 14:38:49,966 INFO  crawl.Injector - Injector: urlDir: urls
>>> > 2017-07-21 14:38:49,966 INFO  crawl.Injector - Injector: Converting
>>> > injected urls to crawl db entries.
>>> > 2017-07-21 14:38:50,047 WARN  util.NativeCodeLoader - Unable to load
>>> > native-hadoop library for your platform... using builtin-java classes
>>> where
>>> > applicable
>>> > 2017-07-21 14:38:50,762 WARN  plugin.PluginRepository - Error while
>>> loading
>>> > plugin `/opt/apache-nutch-1.13/plugins/parse-replace/plugin.xml`
>>> > java.io.FileNotFoundException:
>>> > /opt/apache-nutch-1.13/plugins/parse-replace/plugin.xml (No such file
>>> or
>>> > directory)
>>> > 2017-07-21 14:38:50,775 WARN  plugin.PluginRepository - Error while
>>> loading
>>> > plugin `/opt/apache-nutch-1.13/plugins/plugin/plugin.xml`
>>> > java.io.FileNotFoundException:
>>> > /opt/apache-nutch-1.13/plugins/plugin/plugin.xml (No such file or
>>> > directory)
>>> > 2017-07-21 14:38:50,791 WARN  plugin.PluginRepository - Error while
>>> loading
>>> > plugin `/opt/apache-nutch-1.13/plugins/publish-rabitmq/plugin.xml`
>>> > java.io.FileNotFoundException:
>>> > /opt/apache-nutch-1.13/plugins/publish-rabitmq/plugin.xml (No such
>>> file or
>>> > directory)
>>> > 2017-07-21 14:38:50,861 WARN  mapred.LocalJobRunner -
>>> > job_local540893461_0001
>>> > java.lang.Exception: java.lang.NullPointerException
>>> >     at
>>> > org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(
>>> > LocalJobRunner.java:462)
>>> >
>>> > That last error may be a catch-all, or could be because there are zero
>>> urls
>>> > in the database.  I get the same behavior if I run crawler
>>> >
>>> > crawlerdb has no files although directories are created. Could there
>>> be an
>>> > environment variable needed? I am running from the nutch install
>>> directory
>>> > with write permissions on all files.  Is there something I've
>>> overlooked?
>>> > Is there a -D debug switch I can use to gather more information?
>>> >
>>> > (* also, the content.rdf.u8.gz sample file cited in the NutchTutorial
>>> page
>>> > no longer exists; DMOZ is shutdown and the archive site preserves the
>>> > original link that is now a 404)
>>> >
>>>
>>
>>
>

Re: Stuck at Step One

Reply via email to