RE: nutch 1.x tutorial with solr 6.6.0

Yossi Tamari Wed, 12 Jul 2017 16:00:58 -0700

Hi Pau,

I think the tutorial is still not fully up-to-date:
If you haven't, you should update the solr.* properties in nutch-site.xml (and 
run `ant runtime` again to update the runtime).
Then the command for the tutorial should be:
bin/nutch index crawl/crawldb/ -linkdb crawl/linkdb/ -dir crawl/segments/ 
-filter -normalize -deleteGone
The -dir parameter should save you the need to run `index` for each segment. 
I'm not sure if you need the final 3 parameters, depends on your use case.


-----Original Message-----
From: Pau Paches [mailto:[email protected]] 
Sent: 12 July 2017 23:48
To: [email protected]
Subject: Re: nutch 1.x tutorial with solr 6.6.0

Hi Lewis et al.,
I have followed the new tutorial.
In step Step-by-Step: Indexing into Apache Solr

the command
bin/nutch index http://localhost:8983/solr crawl/crawldb/ -linkdb crawl/linkdb/ 
crawl/segments/20131108063838/ -filter -normalize -deleteGone

should be run for each segment directory (there are 3), I guess but for the 
first segment it fails:
Indexer: java.io.IOException: No FileSystem for scheme: http
        at 
org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2644)
        at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2651)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92)
        at 
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
        at 
org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:258)
        at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
        at 
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:45)
        at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
        at 
org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:329)
        at 
org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:320)
        at 
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
        at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
        at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:575)
        at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:570)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
        at 
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:570)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:561)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:862)
        at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147)
        at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239)

thanks,
pau

On 7/12/17, Pau Paches <[email protected]> wrote:
> Hi Lewis,
> Just trying the tutorial again. Doing the third round, it's taking 
> much longer than the other two.
>
> What's this schema for?
> Does the version of Nutch that we run have to have this new schema for 
> compatibility with Solr 6.6.0?
> Or can we use Nutch 1.13?
> thanks,
> pau
>
> On 7/12/17, lewis john mcgibbney <[email protected]> wrote:
>> Hi Folks,
>> I just updated the tutorial below, if you find any discrepancies 
>> please let me know.
>>
>> https://wiki.apache.org/nutch/NutchTutorial
>>
>> Also, I have made available a new schema.xml which is compatible with 
>> Solr
>> 6.6.0 at
>>
>> https://issues.apache.org/jira/browse/NUTCH-2400
>>
>> Please scope it out and let me know what happens.
>> Thank you
>> Lewis
>>
>> On Wed, Jul 12, 2017 at 6:58 AM, <[email protected]>
>> wrote:
>>
>>>
>>> From: Pau Paches [mailto:[email protected]]
>>> Sent: Tuesday, July 11, 2017 2:50 PM
>>> To: [email protected]
>>> Subject: [EXTERNAL] Re: nutch 1.x tutorial with solr 6.6.0
>>>
>>> Hi Rashmi,
>>> I have followed your suggestions.
>>> Now I'm seeing a different error.
>>> bin/nutch solrindex http://127.0.0.1:8983/solr crawl/crawld -linkdb 
>>> crawl/linkdb crawl/segments The input path at segments is not a 
>>> segment...
>>> skipping
>>> Indexer: starting at 2017-07-11 20:45:56
>>> Indexer: deleting gone documents: false
>>
>>
>> ...
>>
>

RE: nutch 1.x tutorial with solr 6.6.0

Reply via email to