RE: nutch 1.x tutorial with solr 6.6.0

Srinivasa, Rashmi Tue, 11 Jul 2017 12:34:17 -0700

Hi Pau,

I have not used the solrindex command, but from the "input path" error message, 
it sounds like it wants the actual segment directory under segments/.


The nutch crawl script uses the following commands:
* inject
* generate
* fetch
* parse
* updatedb
* invertlinks
* dedup
* index
* clean

E.g., this is the nutch index command in my environment:
bin/nutch index -Dsolr.server.url=http://localhost:8983/solr/name_of_my_core 
my_crawl_name/crawldb -linkdb my_crawl_name/linkdb 
my_crawl_name/segments/20170710131518

Thanks,
Rashmi

-----Original Message-----
From: Pau Paches [mailto:[email protected]] 
Sent: Tuesday, July 11, 2017 2:50 PM
To: [email protected]
Subject: [EXTERNAL] Re: nutch 1.x tutorial with solr 6.6.0

Hi Rashmi,
I have followed your suggestions.
Now I'm seeing a different error.
bin/nutch solrindex http://127.0.0.1:8983/solr crawl/crawld -linkdb 
crawl/linkdb crawl/segments The input path at segments is not a segment... 
skipping
Indexer: starting at 2017-07-11 20:45:56
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
SOLRIndexWriter
        solr.server.url : URL of the SOLR instance
        solr.zookeeper.hosts : URL of the Zookeeper quorum
        solr.commit.size : buffer size when sending to SOLR (default 1000)
        solr.mapping.file : name of the mapping file for fields (default
solrindex-mapping.xml)
        solr.auth : use authentication (default false)
        solr.auth.username : username for authentication
        solr.auth.password : password for authentication


Indexer: java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:865)
        at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147)
        at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239)

Still I see the disturbing warning
The input path at segments is not a segment... skipping.

And it crashes.
If it had not crash the tutorial would ask me to execute bin/nutch index 
http://localhost:8983/solr crawl/crawldb/ -linkdb crawl/linkdb/ 
crawl/segments/20131108063838/ -filter -normalize -deleteGone which seems 
redundant with the solrindex command.

I think this is the way to go, but still something is missing.

Thanks,
pau

On 7/11/17, Srinivasa, Rashmi <[email protected]> wrote:
> Hi Pau,
>
> Yes, it took me a while to get things working because the tutorial is 
> not complete or up to date.
>
> In conf/nutch-site.xml, the value for plugin.includes uses 
> indexer-elastic by default. If you want to use SOLR, you'll have to 
> change it to indexer-solr.
>
> I haven't tried SOLR 6.6, but this is what I did in SOLR 5:
> 1. bin/solr create -c name_of_my_core -d basic_configs 2. bin/solr 
> stop -all 3. Copy schema.xml from the nutch_directory/conf to 
> server/solr/name_of_my_core/conf/ 4. In schema.xml:
> * Search for all enablePositionIncrements="true" in the file and 
> remove them.
> * Change <uniqueKey>id</uniqueKey> to <uniqueKey>url</uniqueKey> 5. 
> bin/solr start
>
> Thanks,
> Rashmi
>
> -----Original Message-----
> From: Pau Paches [mailto:[email protected]]
> Sent: Tuesday, July 11, 2017 8:46 AM
> To: [email protected]
> Subject: [EXTERNAL] Re: nutch 1.x tutorial with solr 6.6.0
>
> Hi Yossi and BlackIce,
> many thanks for your tips. However, a tutorial needs to be 
> self-contained, or at least link to the documentation/tutorial on how 
> to configure the parts it uses.
>
>
> On Tue, Jul 11, 2017 at 1:39 PM BlackIce <[email protected]> wrote:
>
>> I think by default the newer SOLR starts in "schemaless" mode.. One 
>> neds to create a config directory with ALL necessary configuration 
>> files like schema and solar.conf BEFORE creating the collection and 
>> then run a command to create this collection using this conf 
>> directory. I don't have access to my nutch set-up at this moment, so 
>> I can't check.. but this was explained in the SOLR docs.
>>
>> On Tue, Jul 11, 2017 at 12:58 PM, Yossi Tamari 
>> <[email protected]>
>> wrote:
>>
>> > I struggled with this as well. Eventually I moved to ElasticSearch, 
>> > which is much easier.
>> >
>> > What I did manage to find out, is that in newer versions of SOLR 
>> > you need to use ZooKeeper to update the conf file. see
>> https://stackoverflow.com/a/
>> > 43351358.
>> >
>> > -----Original Message-----
>> > From: Pau Paches [mailto:[email protected]]
>> > Sent: 11 July 2017 13:29
>> > To: [email protected]
>> > Subject: Re: nutch 1.x tutorial with solr 6.6.0
>> >
>> > Hi,
>> > I just crawl a single URL so no whole web crawling.
>> > So I do option 2, fetching, invertlinks successfully. This is just 
>> > Nutch 1.x Then I do Indexing into Apache Solr so go to section 
>> > Setup Solr for search.
>> > First thing that does not work:
>> > cd ${APACHE_SOLR_HOME}/example
>> > java -jar start.jar
>> > No start.jar at the specified location, but no problem you start 
>> > Solr
>> > 6.6.0 with bin/solr start.
>> > Then the tutorial says:
>> > Backup the original Solr example schema.xml:
>> > mv ${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml
>> > ${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml.org
>> >
>> > But in current Solr, 6.6.0, there is no schema.xml file. In the 
>> > whole distribution. What should I do here?
>> > if I go directly to run the Solr Index command from
>> ${NUTCH_RUNTIME_HOME}:
>> > bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb 
>> > -linkdb crawl/linkdb crawl/segments/ which may not make sense since 
>> > I have
>> skipped
>> > some steps, it crashes:
>> > The input path at segments is not a segment... skipping
>> > Indexer: java.lang.RuntimeException: Missing elastic.cluster and 
>> > elastic.host. At least one of them should be set in nutch-site.xml 
>> > ElasticIndexWriter
>> >         elastic.cluster : elastic prefix cluster
>> >         elastic.host : hostname
>> >         elastic.port : port
>> >
>> > Clearly there is some missing configuration in nutch-site.xml, 
>> > apart from setting http.agent.name in nutch-site.xml (mentioned) 
>> > other fields need to be set up. The segments message above is also 
>> > troubling.
>> >
>> > If you follow the steps (if they worked) should we run bin/nutch
>> solrindex
>> > http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb 
>> > crawl/segments/ (this is the last step in Integrate Solr with 
>> > Nutch) and then
>> >
>> > bin/nutch index http://localhost:8983/solr crawl/crawldb/ -linkdb 
>> > crawl/linkdb/ crawl/segments/20131108063838/ -filter -normalize
>> -deleteGone
>> > (this is one of the steps of Using Individual Commands for 
>> > Whole-Web Crawling, which in fact also is the section to read if 
>> > you are only crawling a URL.
>> >
>> > This is what I found by following the tutorial at 
>> > https://wiki.apache.org/nutch/NutchTutorial
>> >
>> > On 7/9/17, lewis john mcgibbney <[email protected]> wrote:
>> > > Hi Pau,
>> > >
>> > > On Sat, Jul 8, 2017 at 6:52 AM,
>> > > <[email protected]>
>> > wrote:
>> > >
>> > >> From: Pau Paches <[email protected]>
>> > >> To: [email protected]
>> > >> Cc:
>> > >> Bcc:
>> > >> Date: Sat, 8 Jul 2017 15:52:46 +0200
>> > >> Subject: nutch 1.x tutorial with solr 6.6.0 Hi, I have run the 
>> > >> Nutch 1.x Tutorial with Solr 6.6.0.
>> > >> Many things do not work,
>> > >
>> > >
>> > > What does not work? Can you elaborate?
>> > >
>> > >
>> > >> there is a mismatch between the assumed Solr version and the 
>> > >> current Solr version.
>> > >>
>> > >
>> > > We support Solr as an indexing backend in the broadest sense 
>> > > possible.
>> We
>> > > do not aim to support the latest and greatest Solr version available.
>> If
>> > > you are interested in upgrading to a particular version, if you 
>> > > could
>> > open
>> > > a JIRA issue and provide a pull request it would be excellent.
>> > >
>> > >
>> > >> I have seen some messages about the same problem for Solr 4.x Is 
>> > >> this the right path to go or should I move to Nutch 2.x?
>> > >
>> > >
>> > > If you are new to Nutch, I would highly advise that you stick 
>> > > with 1.X
>> > >
>> > >
>> > >> Does it
>> > >> make sense to use Solr 6.6 with Nutch 1.x?
>> > >
>> > >
>> > > Yes... you _may_ have a few configuration options to tweak but 
>> > > there
>> have
>> > > been no backwards incompatibility issues so I see no reason for
>> anything
>> > to
>> > > be broken.
>> > >
>> > >
>> > >> If yes, I'm willing to
>> > >> amend the tutorial if someone helps.
>> > >>
>> > >>
>> > > What is broken? Can you elaborate?
>> > >
>> >
>> >
>>
>
> Confidentiality Notice::  This email, including attachments, may 
> include non-public, proprietary, confidential or legally privileged 
> information.  If you are not an intended recipient or an authorized 
> agent of an intended recipient, you are hereby notified that any 
> dissemination, distribution or copying of the information contained in 
> or transmitted with this e-mail is unauthorized and strictly 
> prohibited.  If you have received this email in error, please notify 
> the sender by replying to this message and permanently delete this 
> e-mail, its attachments, and any copies of it immediately.  You should 
> not retain, copy or use this e-mail or any attachment for any purpose, nor 
> disclose all or any part of the contents to any other person.
> Thank you.
>

RE: nutch 1.x tutorial with solr 6.6.0

Reply via email to