Re: nutch 1.x tutorial with solr 6.6.0

2017-07-31 Thread Pau Paches
Hi all,
still wrestling with this.
Yossi: I checked the solr parameters and they look ok.
In the command line
cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml
${APACHE_SOLR_HOME}/server/solr/configsets/nutch/conf
I used, as source, the managed-schema file from the Jira task mentioned by
Lewis.

Anyway, the index command I finally used was the one in the tutorial plus
the argument -Dsolr.server.url=http://localhost:8983/solr, since not using
this argument results in the error
Indexer: java.io.IOException: No FileSystem for scheme: http
And it now advances farther:

bin/nutch index -Dsolr.server.url=http://localhost:8983/solr crawl/crawldb/
-linkdb crawl/linkdb/ -dir crawl/segments -filter -normalize -deleteGone
Segment dir is complete:
file:/home/paupac/apache-nutch-1.13/crawl/segments/20170727171114.
Segment dir is complete:
file:/home/paupac/apache-nutch-1.13/crawl/segments/20170727170952.
Segment dir is complete:
file:/home/paupac/apache-nutch-1.13/crawl/segments/20170727173137.
Indexer: starting at 2017-07-31 10:23:12
Indexer: deleting gone documents: true
Indexer: URL filtering: true
Indexer: URL normalizing: true
Active IndexWriters :
SOLRIndexWriter
solr.server.url : URL of the SOLR instance
solr.zookeeper.hosts : URL of the Zookeeper quorum
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default
solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : username for authentication
solr.auth.password : password for authentication


Indexing 250/250 documents
Deleting 0 documents
Indexing 250/250 documents
Deleting 0 documents
Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:865)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239)
At least it has indexed 500 docs before crashing. But it has crashed again.
Has nobody else run the tutorial?

thanks,
pau

On Thu, Jul 13, 2017 at 1:00 AM, Yossi Tamari <yossi.tam...@pipl.com> wrote:

> Hi Pau,
>
> I think the tutorial is still not fully up-to-date:
> If you haven't, you should update the solr.* properties in nutch-site.xml
> (and run `ant runtime` again to update the runtime).
> Then the command for the tutorial should be:
> bin/nutch index crawl/crawldb/ -linkdb crawl/linkdb/ -dir crawl/segments/
> -filter -normalize -deleteGone
> The -dir parameter should save you the need to run `index` for each
> segment. I'm not sure if you need the final 3 parameters, depends on your
> use case.
>
> -Original Message-
> From: Pau Paches [mailto:sp.exstream.t...@gmail.com]
> Sent: 12 July 2017 23:48
> To: user@nutch.apache.org
> Subject: Re: nutch 1.x tutorial with solr 6.6.0
>
> Hi Lewis et al.,
> I have followed the new tutorial.
> In step Step-by-Step: Indexing into Apache Solr
>
> the command
> bin/nutch index http://localhost:8983/solr crawl/crawldb/ -linkdb
> crawl/linkdb/ crawl/segments/20131108063838/ -filter -normalize -deleteGone
>
> should be run for each segment directory (there are 3), I guess but for
> the first segment it fails:
> Indexer: java.io.IOException: No FileSystem for scheme: http
> at org.apache.hadoop.fs.FileSystem.getFileSystemClass(
> FileSystem.java:2644)
> at org.apache.hadoop.fs.FileSystem.createFileSystem(
> FileSystem.java:2651)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92)
> at org.apache.hadoop.fs.FileSystem$Cache.getInternal(
> FileSystem.java:2687)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
> at org.apache.hadoop.mapred.FileInputFormat.
> singleThreadedListStatus(FileInputFormat.java:258)
> at org.apache.hadoop.mapred.FileInputFormat.listStatus(
> FileInputFormat.java:229)
> at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(
> SequenceFileInputFormat.java:45)
> at org.apache.hadoop.mapred.FileInputFormat.getSplits(
> FileInputFormat.java:315)
> at org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(
> JobSubmitter.java:329)
> at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(
> JobSubmitter.java:320)
> at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(
> JobSubmitter.java:196)
> at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
> at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
> at java.security.

RE: nutch 1.x tutorial with solr 6.6.0

2017-07-12 Thread Yossi Tamari
Hi Pau,

I think the tutorial is still not fully up-to-date:
If you haven't, you should update the solr.* properties in nutch-site.xml (and 
run `ant runtime` again to update the runtime).
Then the command for the tutorial should be:
bin/nutch index crawl/crawldb/ -linkdb crawl/linkdb/ -dir crawl/segments/ 
-filter -normalize -deleteGone
The -dir parameter should save you the need to run `index` for each segment. 
I'm not sure if you need the final 3 parameters, depends on your use case.

-Original Message-
From: Pau Paches [mailto:sp.exstream.t...@gmail.com] 
Sent: 12 July 2017 23:48
To: user@nutch.apache.org
Subject: Re: nutch 1.x tutorial with solr 6.6.0

Hi Lewis et al.,
I have followed the new tutorial.
In step Step-by-Step: Indexing into Apache Solr

the command
bin/nutch index http://localhost:8983/solr crawl/crawldb/ -linkdb crawl/linkdb/ 
crawl/segments/20131108063838/ -filter -normalize -deleteGone

should be run for each segment directory (there are 3), I guess but for the 
first segment it fails:
Indexer: java.io.IOException: No FileSystem for scheme: http
at 
org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2644)
at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2651)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92)
at 
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at 
org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:258)
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
at 
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:45)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at 
org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:329)
at 
org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:320)
at 
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:575)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:570)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at 
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:570)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:561)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:862)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239)

thanks,
pau

On 7/12/17, Pau Paches <sp.exstream.t...@gmail.com> wrote:
> Hi Lewis,
> Just trying the tutorial again. Doing the third round, it's taking 
> much longer than the other two.
>
> What's this schema for?
> Does the version of Nutch that we run have to have this new schema for 
> compatibility with Solr 6.6.0?
> Or can we use Nutch 1.13?
> thanks,
> pau
>
> On 7/12/17, lewis john mcgibbney <lewi...@apache.org> wrote:
>> Hi Folks,
>> I just updated the tutorial below, if you find any discrepancies 
>> please let me know.
>>
>> https://wiki.apache.org/nutch/NutchTutorial
>>
>> Also, I have made available a new schema.xml which is compatible with 
>> Solr
>> 6.6.0 at
>>
>> https://issues.apache.org/jira/browse/NUTCH-2400
>>
>> Please scope it out and let me know what happens.
>> Thank you
>> Lewis
>>
>> On Wed, Jul 12, 2017 at 6:58 AM, <user-digest-h...@nutch.apache.org>
>> wrote:
>>
>>>
>>> From: Pau Paches [mailto:sp.exstream.t...@gmail.com]
>>> Sent: Tuesday, July 11, 2017 2:50 PM
>>> To: user@nutch.apache.org
>>> Subject: [EXTERNAL] Re: nutch 1.x tutorial with solr 6.6.0
>>>
>>> Hi Rashmi,
>>> I have follow

Re: nutch 1.x tutorial with solr 6.6.0

2017-07-12 Thread Pau Paches
Hi Lewis et al.,
I have followed the new tutorial.
In step Step-by-Step: Indexing into Apache Solr

the command
bin/nutch index http://localhost:8983/solr crawl/crawldb/ -linkdb crawl/linkdb/
crawl/segments/20131108063838/ -filter -normalize -deleteGone

should be run for each segment directory (there are 3), I guess
but for the first segment it fails:
Indexer: java.io.IOException: No FileSystem for scheme: http
at 
org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2644)
at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2651)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92)
at 
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at 
org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:258)
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
at 
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:45)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at 
org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:329)
at 
org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:320)
at 
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:575)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:570)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at 
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:570)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:561)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:862)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239)

thanks,
pau

On 7/12/17, Pau Paches <sp.exstream.t...@gmail.com> wrote:
> Hi Lewis,
> Just trying the tutorial again. Doing the third round, it's taking
> much longer than the other two.
>
> What's this schema for?
> Does the version of Nutch that we run have to have this new schema for
> compatibility with Solr 6.6.0?
> Or can we use Nutch 1.13?
> thanks,
> pau
>
> On 7/12/17, lewis john mcgibbney <lewi...@apache.org> wrote:
>> Hi Folks,
>> I just updated the tutorial below, if you find any discrepancies please
>> let
>> me know.
>>
>> https://wiki.apache.org/nutch/NutchTutorial
>>
>> Also, I have made available a new schema.xml which is compatible with
>> Solr
>> 6.6.0 at
>>
>> https://issues.apache.org/jira/browse/NUTCH-2400
>>
>> Please scope it out and let me know what happens.
>> Thank you
>> Lewis
>>
>> On Wed, Jul 12, 2017 at 6:58 AM, <user-digest-h...@nutch.apache.org>
>> wrote:
>>
>>>
>>> From: Pau Paches [mailto:sp.exstream.t...@gmail.com]
>>> Sent: Tuesday, July 11, 2017 2:50 PM
>>> To: user@nutch.apache.org
>>> Subject: [EXTERNAL] Re: nutch 1.x tutorial with solr 6.6.0
>>>
>>> Hi Rashmi,
>>> I have followed your suggestions.
>>> Now I'm seeing a different error.
>>> bin/nutch solrindex http://127.0.0.1:8983/solr crawl/crawld -linkdb
>>> crawl/linkdb crawl/segments The input path at segments is not a
>>> segment...
>>> skipping
>>> Indexer: starting at 2017-07-11 20:45:56
>>> Indexer: deleting gone documents: false
>>
>>
>> ...
>>
>


Re: nutch 1.x tutorial with solr 6.6.0

2017-07-12 Thread Pau Paches
Hi Lewis,
Just trying the tutorial again. Doing the third round, it's taking
much longer than the other two.

What's this schema for?
Does the version of Nutch that we run have to have this new schema for
compatibility with Solr 6.6.0?
Or can we use Nutch 1.13?
thanks,
pau

On 7/12/17, lewis john mcgibbney <lewi...@apache.org> wrote:
> Hi Folks,
> I just updated the tutorial below, if you find any discrepancies please let
> me know.
>
> https://wiki.apache.org/nutch/NutchTutorial
>
> Also, I have made available a new schema.xml which is compatible with Solr
> 6.6.0 at
>
> https://issues.apache.org/jira/browse/NUTCH-2400
>
> Please scope it out and let me know what happens.
> Thank you
> Lewis
>
> On Wed, Jul 12, 2017 at 6:58 AM, <user-digest-h...@nutch.apache.org> wrote:
>
>>
>> From: Pau Paches [mailto:sp.exstream.t...@gmail.com]
>> Sent: Tuesday, July 11, 2017 2:50 PM
>> To: user@nutch.apache.org
>> Subject: [EXTERNAL] Re: nutch 1.x tutorial with solr 6.6.0
>>
>> Hi Rashmi,
>> I have followed your suggestions.
>> Now I'm seeing a different error.
>> bin/nutch solrindex http://127.0.0.1:8983/solr crawl/crawld -linkdb
>> crawl/linkdb crawl/segments The input path at segments is not a
>> segment...
>> skipping
>> Indexer: starting at 2017-07-11 20:45:56
>> Indexer: deleting gone documents: false
>
>
> ...
>


Re: nutch 1.x tutorial with solr 6.6.0

2017-07-12 Thread lewis john mcgibbney
Hi Folks,
I just updated the tutorial below, if you find any discrepancies please let
me know.

https://wiki.apache.org/nutch/NutchTutorial

Also, I have made available a new schema.xml which is compatible with Solr
6.6.0 at

https://issues.apache.org/jira/browse/NUTCH-2400

Please scope it out and let me know what happens.
Thank you
Lewis

On Wed, Jul 12, 2017 at 6:58 AM, <user-digest-h...@nutch.apache.org> wrote:

>
> From: Pau Paches [mailto:sp.exstream.t...@gmail.com]
> Sent: Tuesday, July 11, 2017 2:50 PM
> To: user@nutch.apache.org
> Subject: [EXTERNAL] Re: nutch 1.x tutorial with solr 6.6.0
>
> Hi Rashmi,
> I have followed your suggestions.
> Now I'm seeing a different error.
> bin/nutch solrindex http://127.0.0.1:8983/solr crawl/crawld -linkdb
> crawl/linkdb crawl/segments The input path at segments is not a segment...
> skipping
> Indexer: starting at 2017-07-11 20:45:56
> Indexer: deleting gone documents: false


...


RE: nutch 1.x tutorial with solr 6.6.0

2017-07-11 Thread Srinivasa, Rashmi
Hi Pau,

I have not used the solrindex command, but from the "input path" error message, 
it sounds like it wants the actual segment directory under segments/.

The nutch crawl script uses the following commands:
* inject
* generate
* fetch
* parse
* updatedb
* invertlinks
* dedup
* index
* clean

E.g., this is the nutch index command in my environment:
bin/nutch index -Dsolr.server.url=http://localhost:8983/solr/name_of_my_core 
my_crawl_name/crawldb -linkdb my_crawl_name/linkdb 
my_crawl_name/segments/20170710131518

Thanks,
Rashmi

-Original Message-
From: Pau Paches [mailto:sp.exstream.t...@gmail.com] 
Sent: Tuesday, July 11, 2017 2:50 PM
To: user@nutch.apache.org
Subject: [EXTERNAL] Re: nutch 1.x tutorial with solr 6.6.0

Hi Rashmi,
I have followed your suggestions.
Now I'm seeing a different error.
bin/nutch solrindex http://127.0.0.1:8983/solr crawl/crawld -linkdb 
crawl/linkdb crawl/segments The input path at segments is not a segment... 
skipping
Indexer: starting at 2017-07-11 20:45:56
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
SOLRIndexWriter
solr.server.url : URL of the SOLR instance
solr.zookeeper.hosts : URL of the Zookeeper quorum
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default
solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : username for authentication
solr.auth.password : password for authentication


Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:865)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239)

Still I see the disturbing warning
The input path at segments is not a segment... skipping.

And it crashes.
If it had not crash the tutorial would ask me to execute bin/nutch index 
http://localhost:8983/solr crawl/crawldb/ -linkdb crawl/linkdb/ 
crawl/segments/20131108063838/ -filter -normalize -deleteGone which seems 
redundant with the solrindex command.

I think this is the way to go, but still something is missing.

Thanks,
pau

On 7/11/17, Srinivasa, Rashmi <rashmi.sriniv...@finra.org> wrote:
> Hi Pau,
>
> Yes, it took me a while to get things working because the tutorial is 
> not complete or up to date.
>
> In conf/nutch-site.xml, the value for plugin.includes uses 
> indexer-elastic by default. If you want to use SOLR, you'll have to 
> change it to indexer-solr.
>
> I haven't tried SOLR 6.6, but this is what I did in SOLR 5:
> 1. bin/solr create -c name_of_my_core -d basic_configs 2. bin/solr 
> stop -all 3. Copy schema.xml from the nutch_directory/conf to 
> server/solr/name_of_my_core/conf/ 4. In schema.xml:
> * Search for all enablePositionIncrements="true" in the file and 
> remove them.
> * Change id to url 5. 
> bin/solr start
>
> Thanks,
> Rashmi
>
> -Original Message-
> From: Pau Paches [mailto:sp.exstream.t...@gmail.com]
> Sent: Tuesday, July 11, 2017 8:46 AM
> To: user@nutch.apache.org
> Subject: [EXTERNAL] Re: nutch 1.x tutorial with solr 6.6.0
>
> Hi Yossi and BlackIce,
> many thanks for your tips. However, a tutorial needs to be 
> self-contained, or at least link to the documentation/tutorial on how 
> to configure the parts it uses.
>
>
> On Tue, Jul 11, 2017 at 1:39 PM BlackIce <blackice...@gmail.com> wrote:
>
>> I think by default the newer SOLR starts in "schemaless" mode.. One 
>> neds to create a config directory with ALL necessary configuration 
>> files like schema and solar.conf BEFORE creating the collection and 
>> then run a command to create this collection using this conf 
>> directory. I don't have access to my nutch set-up at this moment, so 
>> I can't check.. but this was explained in the SOLR docs.
>>
>> On Tue, Jul 11, 2017 at 12:58 PM, Yossi Tamari 
>> <yossi.tam...@pipl.com>
>> wrote:
>>
>> > I struggled with this as well. Eventually I moved to ElasticSearch, 
>> > which is much easier.
>> >
>> > What I did manage to find out, is that in newer versions of SOLR 
>> > you need to use ZooKeeper to update the conf file. see
>> https://stackoverflow.com/a/
>> > 43351358.
>> >
>> > -Original Message-
>> > From: Pau Paches [mailto:sp.exstream.t...@gmail.com]
>> > Sent: 11 July 2017 13:29
>> > To: user@nutch.apache.org
>> > Subject: Re: nutch 

Re: nutch 1.x tutorial with solr 6.6.0

2017-07-11 Thread Pau Paches
Hi Rashmi,
I have followed your suggestions.
Now I'm seeing a different error.
bin/nutch solrindex http://127.0.0.1:8983/solr crawl/crawld -linkdb
crawl/linkdb crawl/segments
The input path at segments is not a segment... skipping
Indexer: starting at 2017-07-11 20:45:56
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
SOLRIndexWriter
solr.server.url : URL of the SOLR instance
solr.zookeeper.hosts : URL of the Zookeeper quorum
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default
solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : username for authentication
solr.auth.password : password for authentication


Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:865)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239)

Still I see the disturbing warning
The input path at segments is not a segment... skipping.

And it crashes.
If it had not crash the tutorial would ask me to execute
bin/nutch index http://localhost:8983/solr crawl/crawldb/ -linkdb
crawl/linkdb/ crawl/segments/20131108063838/ -filter -normalize
-deleteGone
which seems redundant with the solrindex command.

I think this is the way to go, but still something is missing.

Thanks,
pau

On 7/11/17, Srinivasa, Rashmi <rashmi.sriniv...@finra.org> wrote:
> Hi Pau,
>
> Yes, it took me a while to get things working because the tutorial is not
> complete or up to date.
>
> In conf/nutch-site.xml, the value for plugin.includes uses indexer-elastic
> by default. If you want to use SOLR, you'll have to change it to
> indexer-solr.
>
> I haven't tried SOLR 6.6, but this is what I did in SOLR 5:
> 1. bin/solr create -c name_of_my_core -d basic_configs
> 2. bin/solr stop -all
> 3. Copy schema.xml from the nutch_directory/conf to
> server/solr/name_of_my_core/conf/
> 4. In schema.xml:
> * Search for all enablePositionIncrements="true" in the file and remove
> them.
> * Change id to url
> 5. bin/solr start
>
> Thanks,
> Rashmi
>
> -Original Message-
> From: Pau Paches [mailto:sp.exstream.t...@gmail.com]
> Sent: Tuesday, July 11, 2017 8:46 AM
> To: user@nutch.apache.org
> Subject: [EXTERNAL] Re: nutch 1.x tutorial with solr 6.6.0
>
> Hi Yossi and BlackIce,
> many thanks for your tips. However, a tutorial needs to be self-contained,
> or at least link to the documentation/tutorial on how to configure the parts
> it uses.
>
>
> On Tue, Jul 11, 2017 at 1:39 PM BlackIce <blackice...@gmail.com> wrote:
>
>> I think by default the newer SOLR starts in "schemaless" mode.. One
>> neds to create a config directory with ALL necessary configuration
>> files like schema and solar.conf BEFORE creating the collection and
>> then run a command to create this collection using this conf
>> directory. I don't have access to my nutch set-up at this moment, so I
>> can't check.. but this was explained in the SOLR docs.
>>
>> On Tue, Jul 11, 2017 at 12:58 PM, Yossi Tamari <yossi.tam...@pipl.com>
>> wrote:
>>
>> > I struggled with this as well. Eventually I moved to ElasticSearch,
>> > which is much easier.
>> >
>> > What I did manage to find out, is that in newer versions of SOLR you
>> > need to use ZooKeeper to update the conf file. see
>> https://stackoverflow.com/a/
>> > 43351358.
>> >
>> > -Original Message-
>> > From: Pau Paches [mailto:sp.exstream.t...@gmail.com]
>> > Sent: 11 July 2017 13:29
>> > To: user@nutch.apache.org
>> > Subject: Re: nutch 1.x tutorial with solr 6.6.0
>> >
>> > Hi,
>> > I just crawl a single URL so no whole web crawling.
>> > So I do option 2, fetching, invertlinks successfully. This is just
>> > Nutch 1.x Then I do Indexing into Apache Solr so go to section Setup
>> > Solr for search.
>> > First thing that does not work:
>> > cd ${APACHE_SOLR_HOME}/example
>> > java -jar start.jar
>> > No start.jar at the specified location, but no problem you start
>> > Solr
>> > 6.6.0 with bin/solr start.
>> > Then the tutorial says:
>> > Backup the original Solr example schema.xml:
>> > mv ${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml
>>

RE: nutch 1.x tutorial with solr 6.6.0

2017-07-11 Thread Srinivasa, Rashmi
Hi Pau,

Yes, it took me a while to get things working because the tutorial is not 
complete or up to date.

In conf/nutch-site.xml, the value for plugin.includes uses indexer-elastic by 
default. If you want to use SOLR, you'll have to change it to indexer-solr.

I haven't tried SOLR 6.6, but this is what I did in SOLR 5:
1. bin/solr create -c name_of_my_core -d basic_configs
2. bin/solr stop -all
3. Copy schema.xml from the nutch_directory/conf to 
server/solr/name_of_my_core/conf/
4. In schema.xml:
* Search for all enablePositionIncrements="true" in the file and remove them.
* Change id to url
5. bin/solr start

Thanks,
Rashmi

-Original Message-
From: Pau Paches [mailto:sp.exstream.t...@gmail.com] 
Sent: Tuesday, July 11, 2017 8:46 AM
To: user@nutch.apache.org
Subject: [EXTERNAL] Re: nutch 1.x tutorial with solr 6.6.0

Hi Yossi and BlackIce,
many thanks for your tips. However, a tutorial needs to be self-contained, or 
at least link to the documentation/tutorial on how to configure the parts it 
uses.


On Tue, Jul 11, 2017 at 1:39 PM BlackIce <blackice...@gmail.com> wrote:

> I think by default the newer SOLR starts in "schemaless" mode.. One 
> neds to create a config directory with ALL necessary configuration 
> files like schema and solar.conf BEFORE creating the collection and 
> then run a command to create this collection using this conf 
> directory. I don't have access to my nutch set-up at this moment, so I 
> can't check.. but this was explained in the SOLR docs.
>
> On Tue, Jul 11, 2017 at 12:58 PM, Yossi Tamari <yossi.tam...@pipl.com>
> wrote:
>
> > I struggled with this as well. Eventually I moved to ElasticSearch, 
> > which is much easier.
> >
> > What I did manage to find out, is that in newer versions of SOLR you 
> > need to use ZooKeeper to update the conf file. see
> https://stackoverflow.com/a/
> > 43351358.
> >
> > -Original Message-
> > From: Pau Paches [mailto:sp.exstream.t...@gmail.com]
> > Sent: 11 July 2017 13:29
> > To: user@nutch.apache.org
> > Subject: Re: nutch 1.x tutorial with solr 6.6.0
> >
> > Hi,
> > I just crawl a single URL so no whole web crawling.
> > So I do option 2, fetching, invertlinks successfully. This is just 
> > Nutch 1.x Then I do Indexing into Apache Solr so go to section Setup 
> > Solr for search.
> > First thing that does not work:
> > cd ${APACHE_SOLR_HOME}/example
> > java -jar start.jar
> > No start.jar at the specified location, but no problem you start 
> > Solr
> > 6.6.0 with bin/solr start.
> > Then the tutorial says:
> > Backup the original Solr example schema.xml:
> > mv ${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml
> > ${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml.org
> >
> > But in current Solr, 6.6.0, there is no schema.xml file. In the 
> > whole distribution. What should I do here?
> > if I go directly to run the Solr Index command from
> ${NUTCH_RUNTIME_HOME}:
> > bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb 
> > -linkdb crawl/linkdb crawl/segments/ which may not make sense since 
> > I have
> skipped
> > some steps, it crashes:
> > The input path at segments is not a segment... skipping
> > Indexer: java.lang.RuntimeException: Missing elastic.cluster and 
> > elastic.host. At least one of them should be set in nutch-site.xml 
> > ElasticIndexWriter
> > elastic.cluster : elastic prefix cluster
> > elastic.host : hostname
> > elastic.port : port
> >
> > Clearly there is some missing configuration in nutch-site.xml, apart 
> > from setting http.agent.name in nutch-site.xml (mentioned) other 
> > fields need to be set up. The segments message above is also troubling.
> >
> > If you follow the steps (if they worked) should we run bin/nutch
> solrindex
> > http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb 
> > crawl/segments/ (this is the last step in Integrate Solr with Nutch) 
> > and then
> >
> > bin/nutch index http://localhost:8983/solr crawl/crawldb/ -linkdb 
> > crawl/linkdb/ crawl/segments/20131108063838/ -filter -normalize
> -deleteGone
> > (this is one of the steps of Using Individual Commands for Whole-Web 
> > Crawling, which in fact also is the section to read if you are only 
> > crawling a URL.
> >
> > This is what I found by following the tutorial at 
> > https://wiki.apache.org/nutch/NutchTutorial
> >
> > On 7/9/17, lewis john mcgibbney <lewi...@apache.org> wrote:
> > > Hi Pau,
> > >
> > > On Sat, Jul 8, 2017 at 6:52 AM,

Re: nutch 1.x tutorial with solr 6.6.0

2017-07-11 Thread Pau Paches
Hi Yossi and BlackIce,
many thanks for your tips. However, a tutorial needs to be self-contained,
or at least link to the documentation/tutorial on how to configure the
parts it uses.


On Tue, Jul 11, 2017 at 1:39 PM BlackIce <blackice...@gmail.com> wrote:

> I think by default the newer SOLR starts in "schemaless" mode.. One neds to
> create a config directory with ALL necessary configuration files like
> schema and solar.conf BEFORE creating the collection and then run a command
> to create this collection using this conf directory. I don't have access to
> my nutch set-up at this moment, so I can't check.. but this was explained
> in the SOLR docs.
>
> On Tue, Jul 11, 2017 at 12:58 PM, Yossi Tamari <yossi.tam...@pipl.com>
> wrote:
>
> > I struggled with this as well. Eventually I moved to ElasticSearch, which
> > is much easier.
> >
> > What I did manage to find out, is that in newer versions of SOLR you need
> > to use ZooKeeper to update the conf file. see
> https://stackoverflow.com/a/
> > 43351358.
> >
> > -Original Message-
> > From: Pau Paches [mailto:sp.exstream.t...@gmail.com]
> > Sent: 11 July 2017 13:29
> > To: user@nutch.apache.org
> > Subject: Re: nutch 1.x tutorial with solr 6.6.0
> >
> > Hi,
> > I just crawl a single URL so no whole web crawling.
> > So I do option 2, fetching, invertlinks successfully. This is just Nutch
> > 1.x Then I do Indexing into Apache Solr so go to section Setup Solr for
> > search.
> > First thing that does not work:
> > cd ${APACHE_SOLR_HOME}/example
> > java -jar start.jar
> > No start.jar at the specified location, but no problem you start Solr
> > 6.6.0 with bin/solr start.
> > Then the tutorial says:
> > Backup the original Solr example schema.xml:
> > mv ${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml
> > ${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml.org
> >
> > But in current Solr, 6.6.0, there is no schema.xml file. In the whole
> > distribution. What should I do here?
> > if I go directly to run the Solr Index command from
> ${NUTCH_RUNTIME_HOME}:
> > bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb
> > crawl/linkdb crawl/segments/ which may not make sense since I have
> skipped
> > some steps, it crashes:
> > The input path at segments is not a segment... skipping
> > Indexer: java.lang.RuntimeException: Missing elastic.cluster and
> > elastic.host. At least one of them should be set in nutch-site.xml
> > ElasticIndexWriter
> > elastic.cluster : elastic prefix cluster
> > elastic.host : hostname
> > elastic.port : port
> >
> > Clearly there is some missing configuration in nutch-site.xml, apart from
> > setting http.agent.name in nutch-site.xml (mentioned) other fields need
> > to be set up. The segments message above is also troubling.
> >
> > If you follow the steps (if they worked) should we run bin/nutch
> solrindex
> > http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb
> > crawl/segments/ (this is the last step in Integrate Solr with Nutch) and
> > then
> >
> > bin/nutch index http://localhost:8983/solr crawl/crawldb/ -linkdb
> > crawl/linkdb/ crawl/segments/20131108063838/ -filter -normalize
> -deleteGone
> > (this is one of the steps of Using Individual Commands for Whole-Web
> > Crawling, which in fact also is the section to read if you are only
> > crawling a URL.
> >
> > This is what I found by following the tutorial at
> > https://wiki.apache.org/nutch/NutchTutorial
> >
> > On 7/9/17, lewis john mcgibbney <lewi...@apache.org> wrote:
> > > Hi Pau,
> > >
> > > On Sat, Jul 8, 2017 at 6:52 AM, <user-digest-h...@nutch.apache.org>
> > wrote:
> > >
> > >> From: Pau Paches <sp.exstream.t...@gmail.com>
> > >> To: user@nutch.apache.org
> > >> Cc:
> > >> Bcc:
> > >> Date: Sat, 8 Jul 2017 15:52:46 +0200
> > >> Subject: nutch 1.x tutorial with solr 6.6.0 Hi, I have run the Nutch
> > >> 1.x Tutorial with Solr 6.6.0.
> > >> Many things do not work,
> > >
> > >
> > > What does not work? Can you elaborate?
> > >
> > >
> > >> there is a mismatch between the assumed Solr
> > >> version and the current Solr version.
> > >>
> > >
> > > We support Solr as an indexing backend in the broadest sense possible.
> We
> > > do not aim to support the latest and greatest Solr version availabl

Re: nutch 1.x tutorial with solr 6.6.0

2017-07-11 Thread BlackIce
I think by default the newer SOLR starts in "schemaless" mode.. One neds to
create a config directory with ALL necessary configuration files like
schema and solar.conf BEFORE creating the collection and then run a command
to create this collection using this conf directory. I don't have access to
my nutch set-up at this moment, so I can't check.. but this was explained
in the SOLR docs.

On Tue, Jul 11, 2017 at 12:58 PM, Yossi Tamari <yossi.tam...@pipl.com>
wrote:

> I struggled with this as well. Eventually I moved to ElasticSearch, which
> is much easier.
>
> What I did manage to find out, is that in newer versions of SOLR you need
> to use ZooKeeper to update the conf file. see https://stackoverflow.com/a/
> 43351358.
>
> -Original Message-
> From: Pau Paches [mailto:sp.exstream.t...@gmail.com]
> Sent: 11 July 2017 13:29
> To: user@nutch.apache.org
> Subject: Re: nutch 1.x tutorial with solr 6.6.0
>
> Hi,
> I just crawl a single URL so no whole web crawling.
> So I do option 2, fetching, invertlinks successfully. This is just Nutch
> 1.x Then I do Indexing into Apache Solr so go to section Setup Solr for
> search.
> First thing that does not work:
> cd ${APACHE_SOLR_HOME}/example
> java -jar start.jar
> No start.jar at the specified location, but no problem you start Solr
> 6.6.0 with bin/solr start.
> Then the tutorial says:
> Backup the original Solr example schema.xml:
> mv ${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml
> ${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml.org
>
> But in current Solr, 6.6.0, there is no schema.xml file. In the whole
> distribution. What should I do here?
> if I go directly to run the Solr Index command from ${NUTCH_RUNTIME_HOME}:
> bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb
> crawl/linkdb crawl/segments/ which may not make sense since I have skipped
> some steps, it crashes:
> The input path at segments is not a segment... skipping
> Indexer: java.lang.RuntimeException: Missing elastic.cluster and
> elastic.host. At least one of them should be set in nutch-site.xml
> ElasticIndexWriter
> elastic.cluster : elastic prefix cluster
> elastic.host : hostname
> elastic.port : port
>
> Clearly there is some missing configuration in nutch-site.xml, apart from
> setting http.agent.name in nutch-site.xml (mentioned) other fields need
> to be set up. The segments message above is also troubling.
>
> If you follow the steps (if they worked) should we run bin/nutch solrindex
> http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb
> crawl/segments/ (this is the last step in Integrate Solr with Nutch) and
> then
>
> bin/nutch index http://localhost:8983/solr crawl/crawldb/ -linkdb
> crawl/linkdb/ crawl/segments/20131108063838/ -filter -normalize -deleteGone
> (this is one of the steps of Using Individual Commands for Whole-Web
> Crawling, which in fact also is the section to read if you are only
> crawling a URL.
>
> This is what I found by following the tutorial at
> https://wiki.apache.org/nutch/NutchTutorial
>
> On 7/9/17, lewis john mcgibbney <lewi...@apache.org> wrote:
> > Hi Pau,
> >
> > On Sat, Jul 8, 2017 at 6:52 AM, <user-digest-h...@nutch.apache.org>
> wrote:
> >
> >> From: Pau Paches <sp.exstream.t...@gmail.com>
> >> To: user@nutch.apache.org
> >> Cc:
> >> Bcc:
> >> Date: Sat, 8 Jul 2017 15:52:46 +0200
> >> Subject: nutch 1.x tutorial with solr 6.6.0 Hi, I have run the Nutch
> >> 1.x Tutorial with Solr 6.6.0.
> >> Many things do not work,
> >
> >
> > What does not work? Can you elaborate?
> >
> >
> >> there is a mismatch between the assumed Solr
> >> version and the current Solr version.
> >>
> >
> > We support Solr as an indexing backend in the broadest sense possible. We
> > do not aim to support the latest and greatest Solr version available. If
> > you are interested in upgrading to a particular version, if you could
> open
> > a JIRA issue and provide a pull request it would be excellent.
> >
> >
> >> I have seen some messages about the same problem for Solr 4.x
> >> Is this the right path to go or should I move to Nutch 2.x?
> >
> >
> > If you are new to Nutch, I would highly advise that you stick with 1.X
> >
> >
> >> Does it
> >> make sense to use Solr 6.6 with Nutch 1.x?
> >
> >
> > Yes... you _may_ have a few configuration options to tweak but there have
> > been no backwards incompatibility issues so I see no reason for anything
> to
> > be broken.
> >
> >
> >> If yes, I'm willing to
> >> amend the tutorial if someone helps.
> >>
> >>
> > What is broken? Can you elaborate?
> >
>
>


RE: nutch 1.x tutorial with solr 6.6.0

2017-07-11 Thread Yossi Tamari
I struggled with this as well. Eventually I moved to ElasticSearch, which is 
much easier.

What I did manage to find out, is that in newer versions of SOLR you need to 
use ZooKeeper to update the conf file. see https://stackoverflow.com/a/43351358.

-Original Message-
From: Pau Paches [mailto:sp.exstream.t...@gmail.com] 
Sent: 11 July 2017 13:29
To: user@nutch.apache.org
Subject: Re: nutch 1.x tutorial with solr 6.6.0

Hi,
I just crawl a single URL so no whole web crawling.
So I do option 2, fetching, invertlinks successfully. This is just Nutch 1.x 
Then I do Indexing into Apache Solr so go to section Setup Solr for search.
First thing that does not work:
cd ${APACHE_SOLR_HOME}/example
java -jar start.jar
No start.jar at the specified location, but no problem you start Solr
6.6.0 with bin/solr start.
Then the tutorial says:
Backup the original Solr example schema.xml:
mv ${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml
${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml.org

But in current Solr, 6.6.0, there is no schema.xml file. In the whole 
distribution. What should I do here?
if I go directly to run the Solr Index command from ${NUTCH_RUNTIME_HOME}:
bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb 
crawl/linkdb crawl/segments/ which may not make sense since I have skipped some 
steps, it crashes:
The input path at segments is not a segment... skipping
Indexer: java.lang.RuntimeException: Missing elastic.cluster and elastic.host. 
At least one of them should be set in nutch-site.xml ElasticIndexWriter
elastic.cluster : elastic prefix cluster
elastic.host : hostname
elastic.port : port

Clearly there is some missing configuration in nutch-site.xml, apart from 
setting http.agent.name in nutch-site.xml (mentioned) other fields need to be 
set up. The segments message above is also troubling.

If you follow the steps (if they worked) should we run bin/nutch solrindex 
http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/ 
(this is the last step in Integrate Solr with Nutch) and then

bin/nutch index http://localhost:8983/solr crawl/crawldb/ -linkdb crawl/linkdb/ 
crawl/segments/20131108063838/ -filter -normalize -deleteGone (this is one of 
the steps of Using Individual Commands for Whole-Web Crawling, which in fact 
also is the section to read if you are only crawling a URL.

This is what I found by following the tutorial at 
https://wiki.apache.org/nutch/NutchTutorial

On 7/9/17, lewis john mcgibbney <lewi...@apache.org> wrote:
> Hi Pau,
>
> On Sat, Jul 8, 2017 at 6:52 AM, <user-digest-h...@nutch.apache.org> wrote:
>
>> From: Pau Paches <sp.exstream.t...@gmail.com>
>> To: user@nutch.apache.org
>> Cc:
>> Bcc:
>> Date: Sat, 8 Jul 2017 15:52:46 +0200
>> Subject: nutch 1.x tutorial with solr 6.6.0 Hi, I have run the Nutch 
>> 1.x Tutorial with Solr 6.6.0.
>> Many things do not work,
>
>
> What does not work? Can you elaborate?
>
>
>> there is a mismatch between the assumed Solr
>> version and the current Solr version.
>>
>
> We support Solr as an indexing backend in the broadest sense possible. We
> do not aim to support the latest and greatest Solr version available. If
> you are interested in upgrading to a particular version, if you could open
> a JIRA issue and provide a pull request it would be excellent.
>
>
>> I have seen some messages about the same problem for Solr 4.x
>> Is this the right path to go or should I move to Nutch 2.x?
>
>
> If you are new to Nutch, I would highly advise that you stick with 1.X
>
>
>> Does it
>> make sense to use Solr 6.6 with Nutch 1.x?
>
>
> Yes... you _may_ have a few configuration options to tweak but there have
> been no backwards incompatibility issues so I see no reason for anything to
> be broken.
>
>
>> If yes, I'm willing to
>> amend the tutorial if someone helps.
>>
>>
> What is broken? Can you elaborate?
>



Re: nutch 1.x tutorial with solr 6.6.0

2017-07-11 Thread Pau Paches
Hi,
I just crawl a single URL so no whole web crawling.
So I do option 2, fetching, invertlinks successfully. This is just Nutch 1.x
Then I do Indexing into Apache Solr so go to section Setup Solr for search.
First thing that does not work:
cd ${APACHE_SOLR_HOME}/example
java -jar start.jar
No start.jar at the specified location, but no problem you start Solr
6.6.0 with bin/solr start.
Then the tutorial says:
Backup the original Solr example schema.xml:
mv ${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml
${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml.org

But in current Solr, 6.6.0, there is no schema.xml file. In the whole
distribution. What should I do here?
if I go directly to run the Solr Index command from ${NUTCH_RUNTIME_HOME}:
bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb
crawl/linkdb crawl/segments/
which may not make sense since I have skipped some steps, it crashes:
The input path at segments is not a segment... skipping
Indexer: java.lang.RuntimeException: Missing elastic.cluster and
elastic.host. At least one of them should be set in nutch-site.xml
ElasticIndexWriter
elastic.cluster : elastic prefix cluster
elastic.host : hostname
elastic.port : port

Clearly there is some missing configuration in nutch-site.xml, apart
from setting http.agent.name in nutch-site.xml (mentioned) other
fields need to be set up. The segments message above is also
troubling.

If you follow the steps (if they worked) should we run
bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb
crawl/linkdb crawl/segments/
(this is the last step in Integrate Solr with Nutch) and then

bin/nutch index http://localhost:8983/solr crawl/crawldb/ -linkdb
crawl/linkdb/ crawl/segments/20131108063838/ -filter -normalize
-deleteGone
(this is one of the steps of Using Individual Commands for Whole-Web
Crawling, which in fact also is the section to read if you are only
crawling a URL.

This is what I found by following the tutorial at
https://wiki.apache.org/nutch/NutchTutorial

On 7/9/17, lewis john mcgibbney <lewi...@apache.org> wrote:
> Hi Pau,
>
> On Sat, Jul 8, 2017 at 6:52 AM, <user-digest-h...@nutch.apache.org> wrote:
>
>> From: Pau Paches <sp.exstream.t...@gmail.com>
>> To: user@nutch.apache.org
>> Cc:
>> Bcc:
>> Date: Sat, 8 Jul 2017 15:52:46 +0200
>> Subject: nutch 1.x tutorial with solr 6.6.0
>> Hi,
>> I have run the Nutch 1.x Tutorial with Solr 6.6.0.
>> Many things do not work,
>
>
> What does not work? Can you elaborate?
>
>
>> there is a mismatch between the assumed Solr
>> version and the current Solr version.
>>
>
> We support Solr as an indexing backend in the broadest sense possible. We
> do not aim to support the latest and greatest Solr version available. If
> you are interested in upgrading to a particular version, if you could open
> a JIRA issue and provide a pull request it would be excellent.
>
>
>> I have seen some messages about the same problem for Solr 4.x
>> Is this the right path to go or should I move to Nutch 2.x?
>
>
> If you are new to Nutch, I would highly advise that you stick with 1.X
>
>
>> Does it
>> make sense to use Solr 6.6 with Nutch 1.x?
>
>
> Yes... you _may_ have a few configuration options to tweak but there have
> been no backwards incompatibility issues so I see no reason for anything to
> be broken.
>
>
>> If yes, I'm willing to
>> amend the tutorial if someone helps.
>>
>>
> What is broken? Can you elaborate?
>


Re: nutch 1.x tutorial with solr 6.6.0

2017-07-09 Thread lewis john mcgibbney
Hi Pau,

On Sat, Jul 8, 2017 at 6:52 AM, <user-digest-h...@nutch.apache.org> wrote:

> From: Pau Paches <sp.exstream.t...@gmail.com>
> To: user@nutch.apache.org
> Cc:
> Bcc:
> Date: Sat, 8 Jul 2017 15:52:46 +0200
> Subject: nutch 1.x tutorial with solr 6.6.0
> Hi,
> I have run the Nutch 1.x Tutorial with Solr 6.6.0.
> Many things do not work,


What does not work? Can you elaborate?


> there is a mismatch between the assumed Solr
> version and the current Solr version.
>

We support Solr as an indexing backend in the broadest sense possible. We
do not aim to support the latest and greatest Solr version available. If
you are interested in upgrading to a particular version, if you could open
a JIRA issue and provide a pull request it would be excellent.


> I have seen some messages about the same problem for Solr 4.x
> Is this the right path to go or should I move to Nutch 2.x?


If you are new to Nutch, I would highly advise that you stick with 1.X


> Does it
> make sense to use Solr 6.6 with Nutch 1.x?


Yes... you _may_ have a few configuration options to tweak but there have
been no backwards incompatibility issues so I see no reason for anything to
be broken.


> If yes, I'm willing to
> amend the tutorial if someone helps.
>
>
What is broken? Can you elaborate?


Re: nutch 1.x tutorial with solr 6.6.0

2017-07-09 Thread BlackIce
Sometimes it helps when one replaces the Solr.jar which comes with Nutch
with the solr.jar that comes with the solr one is using

On Sat, Jul 8, 2017 at 3:52 PM, Pau Paches <sp.exstream.t...@gmail.com>
wrote:

> Hi,
> I have run the Nutch 1.x Tutorial with Solr 6.6.0.
> Many things do not work, there is a mismatch between the assumed Solr
> version and the current Solr version.
> I have seen some messages about the same problem for Solr 4.x
> Is this the right path to go or should I move to Nutch 2.x? Does it
> make sense to use Solr 6.6 with Nutch 1.x? If yes, I'm willing to
> amend the tutorial if someone helps.
>
> thanks,
>
> pau
>


nutch 1.x tutorial with solr 6.6.0

2017-07-08 Thread Pau Paches
Hi,
I have run the Nutch 1.x Tutorial with Solr 6.6.0.
Many things do not work, there is a mismatch between the assumed Solr
version and the current Solr version.
I have seen some messages about the same problem for Solr 4.x
Is this the right path to go or should I move to Nutch 2.x? Does it
make sense to use Solr 6.6 with Nutch 1.x? If yes, I'm willing to
amend the tutorial if someone helps.

thanks,

pau