Hi Pau, I think the tutorial is still not fully up-to-date: If you haven't, you should update the solr.* properties in nutch-site.xml (and run `ant runtime` again to update the runtime). Then the command for the tutorial should be: bin/nutch index crawl/crawldb/ -linkdb crawl/linkdb/ -dir crawl/segments/ -filter -normalize -deleteGone The -dir parameter should save you the need to run `index` for each segment. I'm not sure if you need the final 3 parameters, depends on your use case.
-----Original Message----- From: Pau Paches [mailto:sp.exstream.t...@gmail.com] Sent: 12 July 2017 23:48 To: user@nutch.apache.org Subject: Re: nutch 1.x tutorial with solr 6.6.0 Hi Lewis et al., I have followed the new tutorial. In step Step-by-Step: Indexing into Apache Solr the command bin/nutch index http://localhost:8983/solr crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/20131108063838/ -filter -normalize -deleteGone should be run for each segment directory (there are 3), I guess but for the first segment it fails: Indexer: java.io.IOException: No FileSystem for scheme: http at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2644) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2651) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:258) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229) at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:45) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315) at org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:329) at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:320) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287) at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:575) at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:570) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:570) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:561) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:862) at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147) at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239) thanks, pau On 7/12/17, Pau Paches <sp.exstream.t...@gmail.com> wrote: > Hi Lewis, > Just trying the tutorial again. Doing the third round, it's taking > much longer than the other two. > > What's this schema for? > Does the version of Nutch that we run have to have this new schema for > compatibility with Solr 6.6.0? > Or can we use Nutch 1.13? > thanks, > pau > > On 7/12/17, lewis john mcgibbney <lewi...@apache.org> wrote: >> Hi Folks, >> I just updated the tutorial below, if you find any discrepancies >> please let me know. >> >> https://wiki.apache.org/nutch/NutchTutorial >> >> Also, I have made available a new schema.xml which is compatible with >> Solr >> 6.6.0 at >> >> https://issues.apache.org/jira/browse/NUTCH-2400 >> >> Please scope it out and let me know what happens. >> Thank you >> Lewis >> >> On Wed, Jul 12, 2017 at 6:58 AM, <user-digest-h...@nutch.apache.org> >> wrote: >> >>> >>> From: Pau Paches [mailto:sp.exstream.t...@gmail.com] >>> Sent: Tuesday, July 11, 2017 2:50 PM >>> To: user@nutch.apache.org >>> Subject: [EXTERNAL] Re: nutch 1.x tutorial with solr 6.6.0 >>> >>> Hi Rashmi, >>> I have followed your suggestions. >>> Now I'm seeing a different error. >>> bin/nutch solrindex http://127.0.0.1:8983/solr crawl/crawld -linkdb >>> crawl/linkdb crawl/segments The input path at segments is not a >>> segment... >>> skipping >>> Indexer: starting at 2017-07-11 20:45:56 >>> Indexer: deleting gone documents: false >> >> >> ... >> >