Hi Pau, Yes, it took me a while to get things working because the tutorial is not complete or up to date.
In conf/nutch-site.xml, the value for plugin.includes uses indexer-elastic by default. If you want to use SOLR, you'll have to change it to indexer-solr. I haven't tried SOLR 6.6, but this is what I did in SOLR 5: 1. bin/solr create -c name_of_my_core -d basic_configs 2. bin/solr stop -all 3. Copy schema.xml from the nutch_directory/conf to server/solr/name_of_my_core/conf/ 4. In schema.xml: * Search for all enablePositionIncrements="true" in the file and remove them. * Change <uniqueKey>id</uniqueKey> to <uniqueKey>url</uniqueKey> 5. bin/solr start Thanks, Rashmi -----Original Message----- From: Pau Paches [mailto:[email protected]] Sent: Tuesday, July 11, 2017 8:46 AM To: [email protected] Subject: [EXTERNAL] Re: nutch 1.x tutorial with solr 6.6.0 Hi Yossi and BlackIce, many thanks for your tips. However, a tutorial needs to be self-contained, or at least link to the documentation/tutorial on how to configure the parts it uses. On Tue, Jul 11, 2017 at 1:39 PM BlackIce <[email protected]> wrote: > I think by default the newer SOLR starts in "schemaless" mode.. One > neds to create a config directory with ALL necessary configuration > files like schema and solar.conf BEFORE creating the collection and > then run a command to create this collection using this conf > directory. I don't have access to my nutch set-up at this moment, so I > can't check.. but this was explained in the SOLR docs. > > On Tue, Jul 11, 2017 at 12:58 PM, Yossi Tamari <[email protected]> > wrote: > > > I struggled with this as well. Eventually I moved to ElasticSearch, > > which is much easier. > > > > What I did manage to find out, is that in newer versions of SOLR you > > need to use ZooKeeper to update the conf file. see > https://stackoverflow.com/a/ > > 43351358. > > > > -----Original Message----- > > From: Pau Paches [mailto:[email protected]] > > Sent: 11 July 2017 13:29 > > To: [email protected] > > Subject: Re: nutch 1.x tutorial with solr 6.6.0 > > > > Hi, > > I just crawl a single URL so no whole web crawling. > > So I do option 2, fetching, invertlinks successfully. This is just > > Nutch 1.x Then I do Indexing into Apache Solr so go to section Setup > > Solr for search. > > First thing that does not work: > > cd ${APACHE_SOLR_HOME}/example > > java -jar start.jar > > No start.jar at the specified location, but no problem you start > > Solr > > 6.6.0 with bin/solr start. > > Then the tutorial says: > > Backup the original Solr example schema.xml: > > mv ${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml > > ${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml.org > > > > But in current Solr, 6.6.0, there is no schema.xml file. In the > > whole distribution. What should I do here? > > if I go directly to run the Solr Index command from > ${NUTCH_RUNTIME_HOME}: > > bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb > > -linkdb crawl/linkdb crawl/segments/ which may not make sense since > > I have > skipped > > some steps, it crashes: > > The input path at segments is not a segment... skipping > > Indexer: java.lang.RuntimeException: Missing elastic.cluster and > > elastic.host. At least one of them should be set in nutch-site.xml > > ElasticIndexWriter > > elastic.cluster : elastic prefix cluster > > elastic.host : hostname > > elastic.port : port > > > > Clearly there is some missing configuration in nutch-site.xml, apart > > from setting http.agent.name in nutch-site.xml (mentioned) other > > fields need to be set up. The segments message above is also troubling. > > > > If you follow the steps (if they worked) should we run bin/nutch > solrindex > > http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb > > crawl/segments/ (this is the last step in Integrate Solr with Nutch) > > and then > > > > bin/nutch index http://localhost:8983/solr crawl/crawldb/ -linkdb > > crawl/linkdb/ crawl/segments/20131108063838/ -filter -normalize > -deleteGone > > (this is one of the steps of Using Individual Commands for Whole-Web > > Crawling, which in fact also is the section to read if you are only > > crawling a URL. > > > > This is what I found by following the tutorial at > > https://wiki.apache.org/nutch/NutchTutorial > > > > On 7/9/17, lewis john mcgibbney <[email protected]> wrote: > > > Hi Pau, > > > > > > On Sat, Jul 8, 2017 at 6:52 AM, > > > <[email protected]> > > wrote: > > > > > >> From: Pau Paches <[email protected]> > > >> To: [email protected] > > >> Cc: > > >> Bcc: > > >> Date: Sat, 8 Jul 2017 15:52:46 +0200 > > >> Subject: nutch 1.x tutorial with solr 6.6.0 Hi, I have run the > > >> Nutch 1.x Tutorial with Solr 6.6.0. > > >> Many things do not work, > > > > > > > > > What does not work? Can you elaborate? > > > > > > > > >> there is a mismatch between the assumed Solr version and the > > >> current Solr version. > > >> > > > > > > We support Solr as an indexing backend in the broadest sense possible. > We > > > do not aim to support the latest and greatest Solr version available. > If > > > you are interested in upgrading to a particular version, if you > > > could > > open > > > a JIRA issue and provide a pull request it would be excellent. > > > > > > > > >> I have seen some messages about the same problem for Solr 4.x Is > > >> this the right path to go or should I move to Nutch 2.x? > > > > > > > > > If you are new to Nutch, I would highly advise that you stick with > > > 1.X > > > > > > > > >> Does it > > >> make sense to use Solr 6.6 with Nutch 1.x? > > > > > > > > > Yes... you _may_ have a few configuration options to tweak but > > > there > have > > > been no backwards incompatibility issues so I see no reason for > anything > > to > > > be broken. > > > > > > > > >> If yes, I'm willing to > > >> amend the tutorial if someone helps. > > >> > > >> > > > What is broken? Can you elaborate? > > > > > > > > Confidentiality Notice:: This email, including attachments, may include non-public, proprietary, confidential or legally privileged information. If you are not an intended recipient or an authorized agent of an intended recipient, you are hereby notified that any dissemination, distribution or copying of the information contained in or transmitted with this e-mail is unauthorized and strictly prohibited. If you have received this email in error, please notify the sender by replying to this message and permanently delete this e-mail, its attachments, and any copies of it immediately. You should not retain, copy or use this e-mail or any attachment for any purpose, nor disclose all or any part of the contents to any other person. Thank you.

