Hi all,
still wrestling with this.
Yossi: I checked the solr parameters and they look ok.
In the command line
cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml
${APACHE_SOLR_HOME}/server/solr/configsets/nutch/conf
I used, as source, the managed-schema file from the Jira task mentioned by
Lewis.Anyway, the index command I finally used was the one in the tutorial plus the argument -Dsolr.server.url=http://localhost:8983/solr, since not using this argument results in the error Indexer: java.io.IOException: No FileSystem for scheme: http And it now advances farther: bin/nutch index -Dsolr.server.url=http://localhost:8983/solr crawl/crawldb/ -linkdb crawl/linkdb/ -dir crawl/segments -filter -normalize -deleteGone Segment dir is complete: file:/home/paupac/apache-nutch-1.13/crawl/segments/20170727171114. Segment dir is complete: file:/home/paupac/apache-nutch-1.13/crawl/segments/20170727170952. Segment dir is complete: file:/home/paupac/apache-nutch-1.13/crawl/segments/20170727173137. Indexer: starting at 2017-07-31 10:23:12 Indexer: deleting gone documents: true Indexer: URL filtering: true Indexer: URL normalizing: true Active IndexWriters : SOLRIndexWriter solr.server.url : URL of the SOLR instance solr.zookeeper.hosts : URL of the Zookeeper quorum solr.commit.size : buffer size when sending to SOLR (default 1000) solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml) solr.auth : use authentication (default false) solr.auth.username : username for authentication solr.auth.password : password for authentication Indexing 250/250 documents Deleting 0 documents Indexing 250/250 documents Deleting 0 documents Indexer: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:865) at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147) at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239) At least it has indexed 500 docs before crashing. But it has crashed again. Has nobody else run the tutorial? thanks, pau On Thu, Jul 13, 2017 at 1:00 AM, Yossi Tamari <[email protected]> wrote: > Hi Pau, > > I think the tutorial is still not fully up-to-date: > If you haven't, you should update the solr.* properties in nutch-site.xml > (and run `ant runtime` again to update the runtime). > Then the command for the tutorial should be: > bin/nutch index crawl/crawldb/ -linkdb crawl/linkdb/ -dir crawl/segments/ > -filter -normalize -deleteGone > The -dir parameter should save you the need to run `index` for each > segment. I'm not sure if you need the final 3 parameters, depends on your > use case. > > -----Original Message----- > From: Pau Paches [mailto:[email protected]] > Sent: 12 July 2017 23:48 > To: [email protected] > Subject: Re: nutch 1.x tutorial with solr 6.6.0 > > Hi Lewis et al., > I have followed the new tutorial. > In step Step-by-Step: Indexing into Apache Solr > > the command > bin/nutch index http://localhost:8983/solr crawl/crawldb/ -linkdb > crawl/linkdb/ crawl/segments/20131108063838/ -filter -normalize -deleteGone > > should be run for each segment directory (there are 3), I guess but for > the first segment it fails: > Indexer: java.io.IOException: No FileSystem for scheme: http > at org.apache.hadoop.fs.FileSystem.getFileSystemClass( > FileSystem.java:2644) > at org.apache.hadoop.fs.FileSystem.createFileSystem( > FileSystem.java:2651) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92) > at org.apache.hadoop.fs.FileSystem$Cache.getInternal( > FileSystem.java:2687) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371) > at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) > at org.apache.hadoop.mapred.FileInputFormat. > singleThreadedListStatus(FileInputFormat.java:258) > at org.apache.hadoop.mapred.FileInputFormat.listStatus( > FileInputFormat.java:229) > at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus( > SequenceFileInputFormat.java:45) > at org.apache.hadoop.mapred.FileInputFormat.getSplits( > FileInputFormat.java:315) > at org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits( > JobSubmitter.java:329) > at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits( > JobSubmitter.java:320) > at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal( > JobSubmitter.java:196) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at org.apache.hadoop.security.UserGroupInformation.doAs( > UserGroupInformation.java:1657) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287) > at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:575) > at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:570) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at org.apache.hadoop.security.UserGroupInformation.doAs( > UserGroupInformation.java:1657) > at org.apache.hadoop.mapred.JobClient.submitJobInternal( > JobClient.java:570) > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient. > java:561) > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:862) > at org.apache.nutch.indexer.IndexingJob.index(IndexingJob. > java:147) > at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239) > > thanks, > pau > > On 7/12/17, Pau Paches <[email protected]> wrote: > > Hi Lewis, > > Just trying the tutorial again. Doing the third round, it's taking > > much longer than the other two. > > > > What's this schema for? > > Does the version of Nutch that we run have to have this new schema for > > compatibility with Solr 6.6.0? > > Or can we use Nutch 1.13? > > thanks, > > pau > > > > On 7/12/17, lewis john mcgibbney <[email protected]> wrote: > >> Hi Folks, > >> I just updated the tutorial below, if you find any discrepancies > >> please let me know. > >> > >> https://wiki.apache.org/nutch/NutchTutorial > >> > >> Also, I have made available a new schema.xml which is compatible with > >> Solr > >> 6.6.0 at > >> > >> https://issues.apache.org/jira/browse/NUTCH-2400 > >> > >> Please scope it out and let me know what happens. > >> Thank you > >> Lewis > >> > >> On Wed, Jul 12, 2017 at 6:58 AM, <[email protected]> > >> wrote: > >> > >>> > >>> From: Pau Paches [mailto:[email protected]] > >>> Sent: Tuesday, July 11, 2017 2:50 PM > >>> To: [email protected] > >>> Subject: [EXTERNAL] Re: nutch 1.x tutorial with solr 6.6.0 > >>> > >>> Hi Rashmi, > >>> I have followed your suggestions. > >>> Now I'm seeing a different error. > >>> bin/nutch solrindex http://127.0.0.1:8983/solr crawl/crawld -linkdb > >>> crawl/linkdb crawl/segments The input path at segments is not a > >>> segment... > >>> skipping > >>> Indexer: starting at 2017-07-11 20:45:56 > >>> Indexer: deleting gone documents: false > >> > >> > >> ... > >> > > > >

