Re: nutch 1.x tutorial with solr 6.6.0

Pau Paches Tue, 11 Jul 2017 11:50:48 -0700

Hi Rashmi,
I have followed your suggestions.
Now I'm seeing a different error.
bin/nutch solrindex http://127.0.0.1:8983/solr crawl/crawld -linkdb
crawl/linkdb crawl/segments
The input path at segments is not a segment... skipping
Indexer: starting at 2017-07-11 20:45:56
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
SOLRIndexWriter
        solr.server.url : URL of the SOLR instance
        solr.zookeeper.hosts : URL of the Zookeeper quorum
        solr.commit.size : buffer size when sending to SOLR (default 1000)
        solr.mapping.file : name of the mapping file for fields (default
solrindex-mapping.xml)
        solr.auth : use authentication (default false)
        solr.auth.username : username for authentication
        solr.auth.password : password for authentication



Indexer: java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:865)
        at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147)
        at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239)

Still I see the disturbing warning
The input path at segments is not a segment... skipping.

And it crashes.
If it had not crash the tutorial would ask me to execute
bin/nutch index http://localhost:8983/solr crawl/crawldb/ -linkdb
crawl/linkdb/ crawl/segments/20131108063838/ -filter -normalize
-deleteGone
which seems redundant with the solrindex command.

I think this is the way to go, but still something is missing.

Thanks,
pau

On 7/11/17, Srinivasa, Rashmi <[email protected]> wrote:
> Hi Pau,
>
> Yes, it took me a while to get things working because the tutorial is not
> complete or up to date.
>
> In conf/nutch-site.xml, the value for plugin.includes uses indexer-elastic
> by default. If you want to use SOLR, you'll have to change it to
> indexer-solr.
>
> I haven't tried SOLR 6.6, but this is what I did in SOLR 5:
> 1. bin/solr create -c name_of_my_core -d basic_configs
> 2. bin/solr stop -all
> 3. Copy schema.xml from the nutch_directory/conf to
> server/solr/name_of_my_core/conf/
> 4. In schema.xml:
> * Search for all enablePositionIncrements="true" in the file and remove
> them.
> * Change <uniqueKey>id</uniqueKey> to <uniqueKey>url</uniqueKey>
> 5. bin/solr start
>
> Thanks,
> Rashmi
>
> -----Original Message-----
> From: Pau Paches [mailto:[email protected]]
> Sent: Tuesday, July 11, 2017 8:46 AM
> To: [email protected]
> Subject: [EXTERNAL] Re: nutch 1.x tutorial with solr 6.6.0
>
> Hi Yossi and BlackIce,
> many thanks for your tips. However, a tutorial needs to be self-contained,
> or at least link to the documentation/tutorial on how to configure the parts
> it uses.
>
>
> On Tue, Jul 11, 2017 at 1:39 PM BlackIce <[email protected]> wrote:
>
>> I think by default the newer SOLR starts in "schemaless" mode.. One
>> neds to create a config directory with ALL necessary configuration
>> files like schema and solar.conf BEFORE creating the collection and
>> then run a command to create this collection using this conf
>> directory. I don't have access to my nutch set-up at this moment, so I
>> can't check.. but this was explained in the SOLR docs.
>>
>> On Tue, Jul 11, 2017 at 12:58 PM, Yossi Tamari <[email protected]>
>> wrote:
>>
>> > I struggled with this as well. Eventually I moved to ElasticSearch,
>> > which is much easier.
>> >
>> > What I did manage to find out, is that in newer versions of SOLR you
>> > need to use ZooKeeper to update the conf file. see
>> https://stackoverflow.com/a/
>> > 43351358.
>> >
>> > -----Original Message-----
>> > From: Pau Paches [mailto:[email protected]]
>> > Sent: 11 July 2017 13:29
>> > To: [email protected]
>> > Subject: Re: nutch 1.x tutorial with solr 6.6.0
>> >
>> > Hi,
>> > I just crawl a single URL so no whole web crawling.
>> > So I do option 2, fetching, invertlinks successfully. This is just
>> > Nutch 1.x Then I do Indexing into Apache Solr so go to section Setup
>> > Solr for search.
>> > First thing that does not work:
>> > cd ${APACHE_SOLR_HOME}/example
>> > java -jar start.jar
>> > No start.jar at the specified location, but no problem you start
>> > Solr
>> > 6.6.0 with bin/solr start.
>> > Then the tutorial says:
>> > Backup the original Solr example schema.xml:
>> > mv ${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml
>> > ${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml.org
>> >
>> > But in current Solr, 6.6.0, there is no schema.xml file. In the
>> > whole distribution. What should I do here?
>> > if I go directly to run the Solr Index command from
>> ${NUTCH_RUNTIME_HOME}:
>> > bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb
>> > -linkdb crawl/linkdb crawl/segments/ which may not make sense since
>> > I have
>> skipped
>> > some steps, it crashes:
>> > The input path at segments is not a segment... skipping
>> > Indexer: java.lang.RuntimeException: Missing elastic.cluster and
>> > elastic.host. At least one of them should be set in nutch-site.xml
>> > ElasticIndexWriter
>> >         elastic.cluster : elastic prefix cluster
>> >         elastic.host : hostname
>> >         elastic.port : port
>> >
>> > Clearly there is some missing configuration in nutch-site.xml, apart
>> > from setting http.agent.name in nutch-site.xml (mentioned) other
>> > fields need to be set up. The segments message above is also troubling.
>> >
>> > If you follow the steps (if they worked) should we run bin/nutch
>> solrindex
>> > http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb
>> > crawl/segments/ (this is the last step in Integrate Solr with Nutch)
>> > and then
>> >
>> > bin/nutch index http://localhost:8983/solr crawl/crawldb/ -linkdb
>> > crawl/linkdb/ crawl/segments/20131108063838/ -filter -normalize
>> -deleteGone
>> > (this is one of the steps of Using Individual Commands for Whole-Web
>> > Crawling, which in fact also is the section to read if you are only
>> > crawling a URL.
>> >
>> > This is what I found by following the tutorial at
>> > https://wiki.apache.org/nutch/NutchTutorial
>> >
>> > On 7/9/17, lewis john mcgibbney <[email protected]> wrote:
>> > > Hi Pau,
>> > >
>> > > On Sat, Jul 8, 2017 at 6:52 AM,
>> > > <[email protected]>
>> > wrote:
>> > >
>> > >> From: Pau Paches <[email protected]>
>> > >> To: [email protected]
>> > >> Cc:
>> > >> Bcc:
>> > >> Date: Sat, 8 Jul 2017 15:52:46 +0200
>> > >> Subject: nutch 1.x tutorial with solr 6.6.0 Hi, I have run the
>> > >> Nutch 1.x Tutorial with Solr 6.6.0.
>> > >> Many things do not work,
>> > >
>> > >
>> > > What does not work? Can you elaborate?
>> > >
>> > >
>> > >> there is a mismatch between the assumed Solr version and the
>> > >> current Solr version.
>> > >>
>> > >
>> > > We support Solr as an indexing backend in the broadest sense
>> > > possible.
>> We
>> > > do not aim to support the latest and greatest Solr version available.
>> If
>> > > you are interested in upgrading to a particular version, if you
>> > > could
>> > open
>> > > a JIRA issue and provide a pull request it would be excellent.
>> > >
>> > >
>> > >> I have seen some messages about the same problem for Solr 4.x Is
>> > >> this the right path to go or should I move to Nutch 2.x?
>> > >
>> > >
>> > > If you are new to Nutch, I would highly advise that you stick with
>> > > 1.X
>> > >
>> > >
>> > >> Does it
>> > >> make sense to use Solr 6.6 with Nutch 1.x?
>> > >
>> > >
>> > > Yes... you _may_ have a few configuration options to tweak but
>> > > there
>> have
>> > > been no backwards incompatibility issues so I see no reason for
>> anything
>> > to
>> > > be broken.
>> > >
>> > >
>> > >> If yes, I'm willing to
>> > >> amend the tutorial if someone helps.
>> > >>
>> > >>
>> > > What is broken? Can you elaborate?
>> > >
>> >
>> >
>>
>
> Confidentiality Notice::  This email, including attachments, may include
> non-public, proprietary, confidential or legally privileged information.  If
> you are not an intended recipient or an authorized agent of an intended
> recipient, you are hereby notified that any dissemination, distribution or
> copying of the information contained in or transmitted with this e-mail is
> unauthorized and strictly prohibited.  If you have received this email in
> error, please notify the sender by replying to this message and permanently
> delete this e-mail, its attachments, and any copies of it immediately.  You
> should not retain, copy or use this e-mail or any attachment for any
> purpose, nor disclose all or any part of the contents to any other person.
> Thank you.
>

Re: nutch 1.x tutorial with solr 6.6.0

Reply via email to