Re: nutch 1.x tutorial with solr 6.6.0

Pau Paches Mon, 31 Jul 2017 01:29:51 -0700

Hi all,
still wrestling with this.
Yossi: I checked the solr parameters and they look ok.
In the command line
cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml
${APACHE_SOLR_HOME}/server/solr/configsets/nutch/conf
I used, as source, the managed-schema file from the Jira task mentioned by
Lewis.


Anyway, the index command I finally used was the one in the tutorial plus
the argument -Dsolr.server.url=http://localhost:8983/solr, since not using
this argument results in the error
Indexer: java.io.IOException: No FileSystem for scheme: http
And it now advances farther:

bin/nutch index -Dsolr.server.url=http://localhost:8983/solr crawl/crawldb/
-linkdb crawl/linkdb/ -dir crawl/segments -filter -normalize -deleteGone
Segment dir is complete:
file:/home/paupac/apache-nutch-1.13/crawl/segments/20170727171114.
Segment dir is complete:
file:/home/paupac/apache-nutch-1.13/crawl/segments/20170727170952.
Segment dir is complete:
file:/home/paupac/apache-nutch-1.13/crawl/segments/20170727173137.
Indexer: starting at 2017-07-31 10:23:12
Indexer: deleting gone documents: true
Indexer: URL filtering: true
Indexer: URL normalizing: true
Active IndexWriters :
SOLRIndexWriter
    solr.server.url : URL of the SOLR instance
    solr.zookeeper.hosts : URL of the Zookeeper quorum
    solr.commit.size : buffer size when sending to SOLR (default 1000)
    solr.mapping.file : name of the mapping file for fields (default
solrindex-mapping.xml)
    solr.auth : use authentication (default false)
    solr.auth.username : username for authentication
    solr.auth.password : password for authentication


Indexing 250/250 documents
Deleting 0 documents
Indexing 250/250 documents
Deleting 0 documents
Indexer: java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:865)
    at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147)
    at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239)
At least it has indexed 500 docs before crashing. But it has crashed again.
Has nobody else run the tutorial?

thanks,
pau

On Thu, Jul 13, 2017 at 1:00 AM, Yossi Tamari <[email protected]> wrote:

> Hi Pau,
>
> I think the tutorial is still not fully up-to-date:
> If you haven't, you should update the solr.* properties in nutch-site.xml
> (and run `ant runtime` again to update the runtime).
> Then the command for the tutorial should be:
> bin/nutch index crawl/crawldb/ -linkdb crawl/linkdb/ -dir crawl/segments/
> -filter -normalize -deleteGone
> The -dir parameter should save you the need to run `index` for each
> segment. I'm not sure if you need the final 3 parameters, depends on your
> use case.
>
> -----Original Message-----
> From: Pau Paches [mailto:[email protected]]
> Sent: 12 July 2017 23:48
> To: [email protected]
> Subject: Re: nutch 1.x tutorial with solr 6.6.0
>
> Hi Lewis et al.,
> I have followed the new tutorial.
> In step Step-by-Step: Indexing into Apache Solr
>
> the command
> bin/nutch index http://localhost:8983/solr crawl/crawldb/ -linkdb
> crawl/linkdb/ crawl/segments/20131108063838/ -filter -normalize -deleteGone
>
> should be run for each segment directory (there are 3), I guess but for
> the first segment it fails:
> Indexer: java.io.IOException: No FileSystem for scheme: http
>         at org.apache.hadoop.fs.FileSystem.getFileSystemClass(
> FileSystem.java:2644)
>         at org.apache.hadoop.fs.FileSystem.createFileSystem(
> FileSystem.java:2651)
>         at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92)
>         at org.apache.hadoop.fs.FileSystem$Cache.getInternal(
> FileSystem.java:2687)
>         at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)
>         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)
>         at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
>         at org.apache.hadoop.mapred.FileInputFormat.
> singleThreadedListStatus(FileInputFormat.java:258)
>         at org.apache.hadoop.mapred.FileInputFormat.listStatus(
> FileInputFormat.java:229)
>         at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(
> SequenceFileInputFormat.java:45)
>         at org.apache.hadoop.mapred.FileInputFormat.getSplits(
> FileInputFormat.java:315)
>         at org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(
> JobSubmitter.java:329)
>         at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(
> JobSubmitter.java:320)
>         at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(
> JobSubmitter.java:196)
>         at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
>         at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(
> UserGroupInformation.java:1657)
>         at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
>         at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:575)
>         at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:570)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(
> UserGroupInformation.java:1657)
>         at org.apache.hadoop.mapred.JobClient.submitJobInternal(
> JobClient.java:570)
>         at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.
> java:561)
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:862)
>         at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.
> java:147)
>         at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>         at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239)
>
> thanks,
> pau
>
> On 7/12/17, Pau Paches <[email protected]> wrote:
> > Hi Lewis,
> > Just trying the tutorial again. Doing the third round, it's taking
> > much longer than the other two.
> >
> > What's this schema for?
> > Does the version of Nutch that we run have to have this new schema for
> > compatibility with Solr 6.6.0?
> > Or can we use Nutch 1.13?
> > thanks,
> > pau
> >
> > On 7/12/17, lewis john mcgibbney <[email protected]> wrote:
> >> Hi Folks,
> >> I just updated the tutorial below, if you find any discrepancies
> >> please let me know.
> >>
> >> https://wiki.apache.org/nutch/NutchTutorial
> >>
> >> Also, I have made available a new schema.xml which is compatible with
> >> Solr
> >> 6.6.0 at
> >>
> >> https://issues.apache.org/jira/browse/NUTCH-2400
> >>
> >> Please scope it out and let me know what happens.
> >> Thank you
> >> Lewis
> >>
> >> On Wed, Jul 12, 2017 at 6:58 AM, <[email protected]>
> >> wrote:
> >>
> >>>
> >>> From: Pau Paches [mailto:[email protected]]
> >>> Sent: Tuesday, July 11, 2017 2:50 PM
> >>> To: [email protected]
> >>> Subject: [EXTERNAL] Re: nutch 1.x tutorial with solr 6.6.0
> >>>
> >>> Hi Rashmi,
> >>> I have followed your suggestions.
> >>> Now I'm seeing a different error.
> >>> bin/nutch solrindex http://127.0.0.1:8983/solr crawl/crawld -linkdb
> >>> crawl/linkdb crawl/segments The input path at segments is not a
> >>> segment...
> >>> skipping
> >>> Indexer: starting at 2017-07-11 20:45:56
> >>> Indexer: deleting gone documents: false
> >>
> >>
> >> ...
> >>
> >
>
>

Re: nutch 1.x tutorial with solr 6.6.0

Reply via email to