Hi Pau, I have not used the solrindex command, but from the "input path" error message, it sounds like it wants the actual segment directory under segments/.
The nutch crawl script uses the following commands: * inject * generate * fetch * parse * updatedb * invertlinks * dedup * index * clean E.g., this is the nutch index command in my environment: bin/nutch index -Dsolr.server.url=http://localhost:8983/solr/name_of_my_core my_crawl_name/crawldb -linkdb my_crawl_name/linkdb my_crawl_name/segments/20170710131518 Thanks, Rashmi -----Original Message----- From: Pau Paches [mailto:[email protected]] Sent: Tuesday, July 11, 2017 2:50 PM To: [email protected] Subject: [EXTERNAL] Re: nutch 1.x tutorial with solr 6.6.0 Hi Rashmi, I have followed your suggestions. Now I'm seeing a different error. bin/nutch solrindex http://127.0.0.1:8983/solr crawl/crawld -linkdb crawl/linkdb crawl/segments The input path at segments is not a segment... skipping Indexer: starting at 2017-07-11 20:45:56 Indexer: deleting gone documents: false Indexer: URL filtering: false Indexer: URL normalizing: false Active IndexWriters : SOLRIndexWriter solr.server.url : URL of the SOLR instance solr.zookeeper.hosts : URL of the Zookeeper quorum solr.commit.size : buffer size when sending to SOLR (default 1000) solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml) solr.auth : use authentication (default false) solr.auth.username : username for authentication solr.auth.password : password for authentication Indexer: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:865) at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147) at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239) Still I see the disturbing warning The input path at segments is not a segment... skipping. And it crashes. If it had not crash the tutorial would ask me to execute bin/nutch index http://localhost:8983/solr crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/20131108063838/ -filter -normalize -deleteGone which seems redundant with the solrindex command. I think this is the way to go, but still something is missing. Thanks, pau On 7/11/17, Srinivasa, Rashmi <[email protected]> wrote: > Hi Pau, > > Yes, it took me a while to get things working because the tutorial is > not complete or up to date. > > In conf/nutch-site.xml, the value for plugin.includes uses > indexer-elastic by default. If you want to use SOLR, you'll have to > change it to indexer-solr. > > I haven't tried SOLR 6.6, but this is what I did in SOLR 5: > 1. bin/solr create -c name_of_my_core -d basic_configs 2. bin/solr > stop -all 3. Copy schema.xml from the nutch_directory/conf to > server/solr/name_of_my_core/conf/ 4. In schema.xml: > * Search for all enablePositionIncrements="true" in the file and > remove them. > * Change <uniqueKey>id</uniqueKey> to <uniqueKey>url</uniqueKey> 5. > bin/solr start > > Thanks, > Rashmi > > -----Original Message----- > From: Pau Paches [mailto:[email protected]] > Sent: Tuesday, July 11, 2017 8:46 AM > To: [email protected] > Subject: [EXTERNAL] Re: nutch 1.x tutorial with solr 6.6.0 > > Hi Yossi and BlackIce, > many thanks for your tips. However, a tutorial needs to be > self-contained, or at least link to the documentation/tutorial on how > to configure the parts it uses. > > > On Tue, Jul 11, 2017 at 1:39 PM BlackIce <[email protected]> wrote: > >> I think by default the newer SOLR starts in "schemaless" mode.. One >> neds to create a config directory with ALL necessary configuration >> files like schema and solar.conf BEFORE creating the collection and >> then run a command to create this collection using this conf >> directory. I don't have access to my nutch set-up at this moment, so >> I can't check.. but this was explained in the SOLR docs. >> >> On Tue, Jul 11, 2017 at 12:58 PM, Yossi Tamari >> <[email protected]> >> wrote: >> >> > I struggled with this as well. Eventually I moved to ElasticSearch, >> > which is much easier. >> > >> > What I did manage to find out, is that in newer versions of SOLR >> > you need to use ZooKeeper to update the conf file. see >> https://stackoverflow.com/a/ >> > 43351358. >> > >> > -----Original Message----- >> > From: Pau Paches [mailto:[email protected]] >> > Sent: 11 July 2017 13:29 >> > To: [email protected] >> > Subject: Re: nutch 1.x tutorial with solr 6.6.0 >> > >> > Hi, >> > I just crawl a single URL so no whole web crawling. >> > So I do option 2, fetching, invertlinks successfully. This is just >> > Nutch 1.x Then I do Indexing into Apache Solr so go to section >> > Setup Solr for search. >> > First thing that does not work: >> > cd ${APACHE_SOLR_HOME}/example >> > java -jar start.jar >> > No start.jar at the specified location, but no problem you start >> > Solr >> > 6.6.0 with bin/solr start. >> > Then the tutorial says: >> > Backup the original Solr example schema.xml: >> > mv ${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml >> > ${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml.org >> > >> > But in current Solr, 6.6.0, there is no schema.xml file. In the >> > whole distribution. What should I do here? >> > if I go directly to run the Solr Index command from >> ${NUTCH_RUNTIME_HOME}: >> > bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb >> > -linkdb crawl/linkdb crawl/segments/ which may not make sense since >> > I have >> skipped >> > some steps, it crashes: >> > The input path at segments is not a segment... skipping >> > Indexer: java.lang.RuntimeException: Missing elastic.cluster and >> > elastic.host. At least one of them should be set in nutch-site.xml >> > ElasticIndexWriter >> > elastic.cluster : elastic prefix cluster >> > elastic.host : hostname >> > elastic.port : port >> > >> > Clearly there is some missing configuration in nutch-site.xml, >> > apart from setting http.agent.name in nutch-site.xml (mentioned) >> > other fields need to be set up. The segments message above is also >> > troubling. >> > >> > If you follow the steps (if they worked) should we run bin/nutch >> solrindex >> > http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb >> > crawl/segments/ (this is the last step in Integrate Solr with >> > Nutch) and then >> > >> > bin/nutch index http://localhost:8983/solr crawl/crawldb/ -linkdb >> > crawl/linkdb/ crawl/segments/20131108063838/ -filter -normalize >> -deleteGone >> > (this is one of the steps of Using Individual Commands for >> > Whole-Web Crawling, which in fact also is the section to read if >> > you are only crawling a URL. >> > >> > This is what I found by following the tutorial at >> > https://wiki.apache.org/nutch/NutchTutorial >> > >> > On 7/9/17, lewis john mcgibbney <[email protected]> wrote: >> > > Hi Pau, >> > > >> > > On Sat, Jul 8, 2017 at 6:52 AM, >> > > <[email protected]> >> > wrote: >> > > >> > >> From: Pau Paches <[email protected]> >> > >> To: [email protected] >> > >> Cc: >> > >> Bcc: >> > >> Date: Sat, 8 Jul 2017 15:52:46 +0200 >> > >> Subject: nutch 1.x tutorial with solr 6.6.0 Hi, I have run the >> > >> Nutch 1.x Tutorial with Solr 6.6.0. >> > >> Many things do not work, >> > > >> > > >> > > What does not work? Can you elaborate? >> > > >> > > >> > >> there is a mismatch between the assumed Solr version and the >> > >> current Solr version. >> > >> >> > > >> > > We support Solr as an indexing backend in the broadest sense >> > > possible. >> We >> > > do not aim to support the latest and greatest Solr version available. >> If >> > > you are interested in upgrading to a particular version, if you >> > > could >> > open >> > > a JIRA issue and provide a pull request it would be excellent. >> > > >> > > >> > >> I have seen some messages about the same problem for Solr 4.x Is >> > >> this the right path to go or should I move to Nutch 2.x? >> > > >> > > >> > > If you are new to Nutch, I would highly advise that you stick >> > > with 1.X >> > > >> > > >> > >> Does it >> > >> make sense to use Solr 6.6 with Nutch 1.x? >> > > >> > > >> > > Yes... you _may_ have a few configuration options to tweak but >> > > there >> have >> > > been no backwards incompatibility issues so I see no reason for >> anything >> > to >> > > be broken. >> > > >> > > >> > >> If yes, I'm willing to >> > >> amend the tutorial if someone helps. >> > >> >> > >> >> > > What is broken? Can you elaborate? >> > > >> > >> > >> > > Confidentiality Notice:: This email, including attachments, may > include non-public, proprietary, confidential or legally privileged > information. If you are not an intended recipient or an authorized > agent of an intended recipient, you are hereby notified that any > dissemination, distribution or copying of the information contained in > or transmitted with this e-mail is unauthorized and strictly > prohibited. If you have received this email in error, please notify > the sender by replying to this message and permanently delete this > e-mail, its attachments, and any copies of it immediately. You should > not retain, copy or use this e-mail or any attachment for any purpose, nor > disclose all or any part of the contents to any other person. > Thank you. >

