Nearly there, a few changes required in the NutchTutorial: The Indexing command needs to use the -D to give the server.url: bin/nutch index -Dsolr.server.url=http://localhost:8983/solr crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/* -filter -normalize -deleteGone
but dedup does not -- instead it needs the crawldb? bin/nutch clean crawl/crawldb The clean command needs both? bin/nutch -Dsolr.server.url=http://localhost:8983/solr clean crawl/crawldb and lastly the crawl example is missing the -s before urls bin/crawl -i -D solr.server.url=http://localhost:8983/solr/ -s urls/ TestCrawl/ 2 this appears to pass the tutorial steps on the github 1.14, however ... it ends badly: Indexing 20170721210807 to index /home/ubuntu/apache-nutch-1.14-SNAPSHOT/bin/nutch index -Dsolr.server.url= http://localhost:8983/solr/ TestCrawl//crawldb -linkdb TestCrawl//linkdb TestCrawl//segments/20170721210807 Segment dir is complete: TestCrawl/segments/20170721210807. Indexer: starting at 2017-07-21 21:08:26 Indexer: deleting gone documents: false Indexer: URL filtering: false Indexer: URL normalizing: false Active IndexWriters : SOLRIndexWriter solr.server.url : URL of the SOLR instance solr.zookeeper.hosts : URL of the Zookeeper quorum solr.commit.size : buffer size when sending to SOLR (default 1000) solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml) solr.auth : use authentication (default false) solr.auth.username : username for authentication solr.auth.password : password for authentication Indexing 4/4 documents Deleting 0 documents Indexing 4/4 documents Deleting 0 documents Indexer: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:865) at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147) at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239) Error running: /home/ubuntu/apache-nutch-1.14-SNAPSHOT/bin/nutch index -Dsolr.server.url= http://localhost:8983/solr/ TestCrawl//crawldb -linkdb TestCrawl//linkdb TestCrawl//segments/20170721210807 Failed with exit value 255. So close! Any and all guidance is eagerly welcome :) On Fri, Jul 21, 2017 at 1:29 PM, Gary Murphy <[email protected]> wrote: > One hiccup: the invertlinks step fails because /home/ubuntu/apache-nutch-1. > 14-SNAPSHOT/crawl/segments/20170721202124/parse_data did not exist -- > creating that dir manually allows the step-by-step to continue. > > On Fri, Jul 21, 2017 at 1:25 PM, Gary Murphy <[email protected]> wrote: > >> What has turned out to be a better solution is what I should have done >> the first time, build from sources ;) >> >> using the ant tar-bin installation package and the same urls/seed.txt >> file, everything goes fine. >> >> which suggests there may be a problem in the 1.13 bin distribution file? >> Or more likely my novice fingers typed the wrong thing in some xml file and >> messed it up :) >> >> >> On Fri, Jul 21, 2017 at 10:56 AM, Edward Capriolo <[email protected]> >> wrote: >> >>> I have run into this. The nutch shell scripts do not return error status >>> so >>> you assume bin/crawl has done something when truly it failed. Sometimes >>> the >>> best way is to determine if you need a plugin.xml file and what the >>> content >>> should be. Possibly put a blank xml file in its place and see if the >>> error >>> changes. >>> >>> On Fri, Jul 21, 2017 at 12:29 PM, Gary Murphy <[email protected]> >>> wrote: >>> >>> > This is a little embarrassing: I'm stuck on the very first step of the >>> > new-user installation. >>> > >>> > I have apache-nutch-1.13-bin.tar on Ubuntu 16.04 using the Oracle >>> Java8, >>> > following the wiki.apache.org/nutch/NutchTutorial (*) with a >>> urls/seed.txt >>> > file that contains only http://www.hunchmanifest.com >>> > >>> > but I get zero urls injected: >>> > >>> > $ nutch inject crawl/crawldb urls >>> > Injector: starting at 2017-07-21 16:17:18 >>> > Injector: crawlDb: crawl/crawldb >>> > Injector: urlDir: urls >>> > Injector: Converting injected urls to crawl db entries. >>> > Injector: Total urls rejected by filters: 0 >>> > Injector: Total urls injected after normalization and filtering: 0 >>> > Injector: Total urls injected but already in CrawlDb: 0 >>> > Injector: Total new urls injected: 0 >>> > Injector: finished at 2017-07-21 16:17:20, elapsed: 00:00:01 >>> > >>> > I've tried other URLs and none are excluded by the regex rules (but the >>> > above doesn't list any rejects either) >>> > >>> > What could be wrong with my installation? There's nothing suspicious >>> in the >>> > logs other than warnings for plugins not found: >>> > >>> > 2017-07-21 14:38:49,966 INFO crawl.Injector - Injector: crawlDb: >>> > crawl/crawldb >>> > 2017-07-21 14:38:49,966 INFO crawl.Injector - Injector: urlDir: urls >>> > 2017-07-21 14:38:49,966 INFO crawl.Injector - Injector: Converting >>> > injected urls to crawl db entries. >>> > 2017-07-21 14:38:50,047 WARN util.NativeCodeLoader - Unable to load >>> > native-hadoop library for your platform... using builtin-java classes >>> where >>> > applicable >>> > 2017-07-21 14:38:50,762 WARN plugin.PluginRepository - Error while >>> loading >>> > plugin `/opt/apache-nutch-1.13/plugins/parse-replace/plugin.xml` >>> > java.io.FileNotFoundException: >>> > /opt/apache-nutch-1.13/plugins/parse-replace/plugin.xml (No such file >>> or >>> > directory) >>> > 2017-07-21 14:38:50,775 WARN plugin.PluginRepository - Error while >>> loading >>> > plugin `/opt/apache-nutch-1.13/plugins/plugin/plugin.xml` >>> > java.io.FileNotFoundException: >>> > /opt/apache-nutch-1.13/plugins/plugin/plugin.xml (No such file or >>> > directory) >>> > 2017-07-21 14:38:50,791 WARN plugin.PluginRepository - Error while >>> loading >>> > plugin `/opt/apache-nutch-1.13/plugins/publish-rabitmq/plugin.xml` >>> > java.io.FileNotFoundException: >>> > /opt/apache-nutch-1.13/plugins/publish-rabitmq/plugin.xml (No such >>> file or >>> > directory) >>> > 2017-07-21 14:38:50,861 WARN mapred.LocalJobRunner - >>> > job_local540893461_0001 >>> > java.lang.Exception: java.lang.NullPointerException >>> > at >>> > org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks( >>> > LocalJobRunner.java:462) >>> > >>> > That last error may be a catch-all, or could be because there are zero >>> urls >>> > in the database. I get the same behavior if I run crawler >>> > >>> > crawlerdb has no files although directories are created. Could there >>> be an >>> > environment variable needed? I am running from the nutch install >>> directory >>> > with write permissions on all files. Is there something I've >>> overlooked? >>> > Is there a -D debug switch I can use to gather more information? >>> > >>> > (* also, the content.rdf.u8.gz sample file cited in the NutchTutorial >>> page >>> > no longer exists; DMOZ is shutdown and the archive site preserves the >>> > original link that is now a 404) >>> > >>> >> >> >

