Re: indexing to Solr

2016-12-17 Thread Michael Coffey
Here is another issue with the official Nutch tutorial.
In the section "Integrate Solr with Nutch" it says to backup the original solr 
schema.xml and replace it with one from nutch. It say that the original 
schema.xml is in the directory example/solr/collection1/conf. But there is no 
such directory. When I search for schema.xml, I get the following.
./solr-5.4.1/example/example-DIH/solr/solr/conf/schema.xml
./solr-5.4.1/example/example-DIH/solr/db/conf/schema.xml
./solr-5.4.1/example/example-DIH/solr/mail/conf/schema.xml
./solr-5.4.1/example/example-DIH/solr/rss/conf/schema.xml
./solr-5.4.1/example/example-DIH/solr/tika/conf/schema.xml
./solr-5.4.1/server/solr/configsets/sample_techproducts_configs/conf/schema.xml
./solr-5.4.1/server/solr/configsets/basic_configs/conf/schema.xml

It's not obvious that any one of these is the right one to use.



  From: lewis john mcgibbney <lewi...@apache.org>
 To: "user@nutch.apache.org" <user@nutch.apache.org> 
 Sent: Monday, November 21, 2016 10:34 AM
 Subject: Re: indexing to Solr
   
Hi Michael,

On Sat, Nov 19, 2016 at 8:09 AM, <user-digest-h...@nutch.apache.org> wrote:

> From: Michael Coffey <mcof...@yahoo.com.invalid>
> To: "user@nutch.apache.org" <user@nutch.apache.org>
> Cc:
> Date: Fri, 18 Nov 2016 21:15:14 + (UTC)
> Subject: indexing to Solr
> Where can I find up-to-date information on indexing to Solr?


http://wiki.apache.org/nutch/NutchTutorial
in particular
https://wiki.apache.org/nutch/NutchTutorial#Step-by-Step:_Indexing_into_Apache_Solr
If you find any issues with this tutorial then please let us know. Thank
you.


> When I search the web, I find tutorials that use the deprecated solrindex
> command. I also find questions where people want to know why it doesn't
> work.
>

That is because the only official documentation resides at
http://wiki.apache.org/nutch/NutchTutorial


> I have a good nutch 1.12 installation on a working hadoop cluster and a
> Solr 6.3.0 installation which works for their gettingstarted example.
>

You should use the specified version of Solr for the Nutch release. This is
Solr 5.4.1 as defined in the indexer-solr plugin ivy.xml


> I have questions likeDo I need to create a core and a collection in solr?


Yes I would. This is explained at
https://wiki.apache.org/nutch/NutchTutorial#Setup_Solr_for_search


> Do I need http or cloud type server?Do I need solr.zookeeper.url ?
>

This is not a Nutch question. This is your preferred Solr configuration. If
you are just starting out then I would say it is not a big deal...
experiment and go with what works best for your requirements and resources
capacity.


> What else needs to be set in nutch-site.xml?
>

Not much. For reference though, here are the Solr configuration options.
https://github.com/apache/nutch/blob/master/conf/nutch-default.xml#L1750-L1826


> What about schema?
>

This is covered in
https://wiki.apache.org/nutch/NutchTutorial#Setup_Solr_for_search


>
> Thanks for all the help so far!
>
>
No problems. Any more issues, ping us here and we will help.
Ta


   

Re: indexing to Solr

2016-12-17 Thread Michael Coffey
Here is an issue with the official Nutch tutorial.
In the "setup Solr for Search" it gives the following instructions.* download 
binary file from here
* unzip to $HOME/apache-solr, we will now refer to this as ${APACHE_SOLR_HOME}
* cd ${APACHE_SOLR_HOME}/example
* java -jar start.jar

Unfortunately, there is no start .jar in the examples directory. When I instead 
try to use the start.jar in the servers directory, Java says "WARNING: Nothing 
to start, exiting ..."
You need something like the following to start solr.$APACHE_SOLR_HOME/bin/solr 
start -e cloud -noprompt 

In this case, I am using solr 5.4.1
Also, as mentioned previously, the tutorial says nothing about which version of 
solr to use.

  From: lewis john mcgibbney <lewi...@apache.org>
 To: "user@nutch.apache.org" <user@nutch.apache.org> 
 Sent: Monday, November 21, 2016 10:34 AM
 Subject: Re: indexing to Solr
   
Hi Michael,

On Sat, Nov 19, 2016 at 8:09 AM, <user-digest-h...@nutch.apache.org> wrote:

> From: Michael Coffey <mcof...@yahoo.com.invalid>
> To: "user@nutch.apache.org" <user@nutch.apache.org>
> Cc:
> Date: Fri, 18 Nov 2016 21:15:14 +0000 (UTC)
> Subject: indexing to Solr
> Where can I find up-to-date information on indexing to Solr?


http://wiki.apache.org/nutch/NutchTutorial
in particular
https://wiki.apache.org/nutch/NutchTutorial#Step-by-Step:_Indexing_into_Apache_Solr
If you find any issues with this tutorial then please let us know. Thank
you.


> When I search the web, I find tutorials that use the deprecated solrindex
> command. I also find questions where people want to know why it doesn't
> work.
>

That is because the only official documentation resides at
http://wiki.apache.org/nutch/NutchTutorial


> I have a good nutch 1.12 installation on a working hadoop cluster and a
> Solr 6.3.0 installation which works for their gettingstarted example.
>

You should use the specified version of Solr for the Nutch release. This is
Solr 5.4.1 as defined in the indexer-solr plugin ivy.xml


> I have questions likeDo I need to create a core and a collection in solr?


Yes I would. This is explained at
https://wiki.apache.org/nutch/NutchTutorial#Setup_Solr_for_search


> Do I need http or cloud type server?Do I need solr.zookeeper.url ?
>

This is not a Nutch question. This is your preferred Solr configuration. If
you are just starting out then I would say it is not a big deal...
experiment and go with what works best for your requirements and resources
capacity.


> What else needs to be set in nutch-site.xml?
>

Not much. For reference though, here are the Solr configuration options.
https://github.com/apache/nutch/blob/master/conf/nutch-default.xml#L1750-L1826


> What about schema?
>

This is covered in
https://wiki.apache.org/nutch/NutchTutorial#Setup_Solr_for_search


>
> Thanks for all the help so far!
>
>
No problems. Any more issues, ping us here and we will help.
Ta


   

Re: indexing to Solr

2016-11-21 Thread Michael Coffey
Thanks for the info, I will try again with Solr 5.4.1!
 I think it would be helpful if the tutorial would say something about which 
version(s) of Solr can work with Nutch, perhaps calling attention to the ivy 
file you mentioned in your email. The download link, in our "Setup Solr for 
Search" section, points to a choice of 5.5.3 or 6.3.0 (at the moment). I ran 
into NUTCH-2267 on both of the Solr versions (6.3.0 and 5.5.3) I tried to work 
with.

  From: lewis john mcgibbney <lewi...@apache.org>
 To: "user@nutch.apache.org" <user@nutch.apache.org> 
 Sent: Monday, November 21, 2016 10:34 AM
 Subject: Re: indexing to Solr
   
Hi Michael,

On Sat, Nov 19, 2016 at 8:09 AM, <user-digest-h...@nutch.apache.org> wrote:

> From: Michael Coffey <mcof...@yahoo.com.invalid>
> To: "user@nutch.apache.org" <user@nutch.apache.org>
> Cc:
> Date: Fri, 18 Nov 2016 21:15:14 + (UTC)
> Subject: indexing to Solr
> Where can I find up-to-date information on indexing to Solr?


http://wiki.apache.org/nutch/NutchTutorial
in particular
https://wiki.apache.org/nutch/NutchTutorial#Step-by-Step:_Indexing_into_Apache_Solr
If you find any issues with this tutorial then please let us know. Thank
you.


> When I search the web, I find tutorials that use the deprecated solrindex
> command. I also find questions where people want to know why it doesn't
> work.
>

That is because the only official documentation resides at
http://wiki.apache.org/nutch/NutchTutorial


> I have a good nutch 1.12 installation on a working hadoop cluster and a
> Solr 6.3.0 installation which works for their gettingstarted example.
>

You should use the specified version of Solr for the Nutch release. This is
Solr 5.4.1 as defined in the indexer-solr plugin ivy.xml


> I have questions likeDo I need to create a core and a collection in solr?


Yes I would. This is explained at
https://wiki.apache.org/nutch/NutchTutorial#Setup_Solr_for_search


> Do I need http or cloud type server?Do I need solr.zookeeper.url ?
>

This is not a Nutch question. This is your preferred Solr configuration. If
you are just starting out then I would say it is not a big deal...
experiment and go with what works best for your requirements and resources
capacity.


> What else needs to be set in nutch-site.xml?
>

Not much. For reference though, here are the Solr configuration options.
https://github.com/apache/nutch/blob/master/conf/nutch-default.xml#L1750-L1826


> What about schema?
>

This is covered in
https://wiki.apache.org/nutch/NutchTutorial#Setup_Solr_for_search


>
> Thanks for all the help so far!
>
>
No problems. Any more issues, ping us here and we will help.
Ta


   

Re: indexing to Solr

2016-11-21 Thread lewis john mcgibbney
Hi Michael,

On Sat, Nov 19, 2016 at 8:09 AM, <user-digest-h...@nutch.apache.org> wrote:

> From: Michael Coffey <mcof...@yahoo.com.invalid>
> To: "user@nutch.apache.org" <user@nutch.apache.org>
> Cc:
> Date: Fri, 18 Nov 2016 21:15:14 + (UTC)
> Subject: indexing to Solr
> Where can I find up-to-date information on indexing to Solr?


http://wiki.apache.org/nutch/NutchTutorial
in particular
https://wiki.apache.org/nutch/NutchTutorial#Step-by-Step:_Indexing_into_Apache_Solr
If you find any issues with this tutorial then please let us know. Thank
you.


> When I search the web, I find tutorials that use the deprecated solrindex
> command. I also find questions where people want to know why it doesn't
> work.
>

That is because the only official documentation resides at
http://wiki.apache.org/nutch/NutchTutorial


> I have a good nutch 1.12 installation on a working hadoop cluster and a
> Solr 6.3.0 installation which works for their gettingstarted example.
>

You should use the specified version of Solr for the Nutch release. This is
Solr 5.4.1 as defined in the indexer-solr plugin ivy.xml


> I have questions likeDo I need to create a core and a collection in solr?


Yes I would. This is explained at
https://wiki.apache.org/nutch/NutchTutorial#Setup_Solr_for_search


> Do I need http or cloud type server?Do I need solr.zookeeper.url ?
>

This is not a Nutch question. This is your preferred Solr configuration. If
you are just starting out then I would say it is not a big deal...
experiment and go with what works best for your requirements and resources
capacity.


> What else needs to be set in nutch-site.xml?
>

Not much. For reference though, here are the Solr configuration options.
https://github.com/apache/nutch/blob/master/conf/nutch-default.xml#L1750-L1826


> What about schema?
>

This is covered in
https://wiki.apache.org/nutch/NutchTutorial#Setup_Solr_for_search


>
> Thanks for all the help so far!
>
>
No problems. Any more issues, ping us here and we will help.
Ta


indexing to Solr

2016-11-18 Thread Michael Coffey
Where can I find up-to-date information on indexing to Solr? When I search the 
web, I find tutorials that use the deprecated solrindex command. I also find 
questions where people want to know why it doesn't work.
I have a good nutch 1.12 installation on a working hadoop cluster and a Solr 
6.3.0 installation which works for their gettingstarted example.
I have questions likeDo I need to create a core and a collection in solr?Do I 
need http or cloud type server?Do I need solr.zookeeper.url ?
What else needs to be set in nutch-site.xml?
What about schema?

Thanks for all the help so far!



Re: [MASSMAIL]Crawl Command - Getting Exception While Indexing With Solr

2015-11-18 Thread Manish Verma
Thanks for reply, I just download the source once again and it worked.


Regards

> On Nov 18, 2015, at 6:16 AM, Roannel Fernández Hernández <roan...@uci.cu> 
> wrote:
> 
> Hi
> 
> Check into the folder of the indexer-solr plugin whether exist the 
> solr-solrj-4.10.2.jar library.
> 
> Regards
> 
> - Mensaje original -
>> De: "Roannel Fernández Hernández" <roan...@uci.cu>
>> Para: user@nutch.apache.org
>> Enviados: Miércoles, 18 de Noviembre 2015 9:03:38
>> Asunto: Re: [MASSMAIL]Crawl Command - Getting Exception While Indexing With 
>> Solr
>> 
>> Hi
>> 
>> What version of Nutch you downloaded exactly?
>> 
>> Regards
>> 
>> - Mensaje original -
>>> De: "Manish Verma" ve...@apple.com>
>>> Para: user@nutch.apache.org
>>> Enviados: Lunes, 16 de Noviembre 2015 12:36:46
>>> Asunto: [MASSMAIL]Crawl Command - Getting Exception While Indexing With
>>> Solr
>>> 
>>> Hi ,
>>> 
>>> I was using bin version of Nutch 1.X and everything was working fine , I
>>> downloaded the source of Nutch 1.x and build it, In indexing phase it
>>> throws
>>> below exception. I am using below crawl command and it run well till
>>> parsing
>>> and fails at indexing with below exception.
>>> 
>>> ./crawl -i -D solr.server.url=http://localhost:8983/solr/
>>> /Users/manishverma/Manish/AML/nutch/urls testCrawl5/ 1
>>> 
>>> I see lot of people facing this but no answer . Please suggest.
>>> 
>>> 
>>> Indexing 20151115233845 to index
>>> /Users/manishverma/Downloads/nutchSrc/trunk/runtime/local/bin/nutch index
>>> -Dsolr.server.url=http://localhost:8983/solr/ testCrawl5//crawldb -linkdb
>>> testCrawl5//linkdb testCrawl5//segments/20151115233845
>>> Indexer: starting at 2015-11-15 23:39:02
>>> Indexer: deleting gone documents: false
>>> Indexer: URL filtering: false
>>> Indexer: URL normalizing: false
>>> Exception in thread "main" java.lang.NoClassDefFoundError:
>>> org/apache/solr/client/solrj/SolrServerException
>>>at java.lang.Class.getDeclaredConstructors0(Native Method)
>>>at java.lang.Class.privateGetDeclaredConstructors(Class.java:2585)
>>>at java.lang.Class.getConstructor0(Class.java:2885)
>>>at java.lang.Class.newInstance(Class.java:350)
>>>at
>>>
>>> org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:161)
>>>at
>>>org.apache.nutch.indexer.IndexWriters.(IndexWriters.java:55)
>>>at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:121)
>>>at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:222)
>>>at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>>>at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231)
>>> Caused by: java.lang.ClassNotFoundException:
>>> org.apache.solr.client.solrj.SolrServerException
>>>at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>>>at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>>>at java.security.AccessController.doPrivileged(Native Method)
>>>at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>>>at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>>>at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>>>at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>>>... 10 more
>>> Error running:
>>>  /Users/manishverma/Downloads/nutchSrc/trunk/runtime/local/bin/nutch index
>>>  -Dsolr.server.url=http://localhost:8983/solr/ testCrawl5//crawldb -linkdb
>>>  testCrawl5//linkdb testCrawl5//segments/20151115233845
>>> 
>>> Thanks
>>> Manish Verma
>>> AML Search
>>> +1 669 224 9924
>>> 
>>> 
>> Noviembre 13-14: Final Caribeña 2015 del Concurso de Programación ACM-ICPC
>> https://icpc.baylor.edu/regionals/finder/cf-2015
>> Noviembre 13-14: Final Caribeña 2015 del Concurso de Programación ACM-ICPC
>> https://icpc.baylor.edu/regionals/finder/cf-2015
>> 
> Noviembre 13-14: Final Caribeña 2015 del Concurso de Programación ACM-ICPC
> https://icpc.baylor.edu/regionals/finder/cf-2015



Re: [MASSMAIL]Crawl Command - Getting Exception While Indexing With Solr

2015-11-18 Thread Manish Verma
Thanks for reply, I just download the source once again and it worked.

Regards
Manish


> On Nov 18, 2015, at 6:03 AM, Roannel Fernández Hernández <roan...@uci.cu> 
> wrote:
> 
> Hi
> 
> What version of Nutch you downloaded exactly?
> 
> Regards
> 
> - Mensaje original -
>> De: "Manish Verma" ve...@apple.com>
>> Para: user@nutch.apache.org
>> Enviados: Lunes, 16 de Noviembre 2015 12:36:46
>> Asunto: [MASSMAIL]Crawl Command - Getting Exception While Indexing With Solr
>> 
>> Hi ,
>> 
>> I was using bin version of Nutch 1.X and everything was working fine , I
>> downloaded the source of Nutch 1.x and build it, In indexing phase it throws
>> below exception. I am using below crawl command and it run well till parsing
>> and fails at indexing with below exception.
>> 
>> ./crawl -i -D solr.server.url=http://localhost:8983/solr/
>> /Users/manishverma/Manish/AML/nutch/urls testCrawl5/ 1
>> 
>> I see lot of people facing this but no answer . Please suggest.
>> 
>> 
>> Indexing 20151115233845 to index
>> /Users/manishverma/Downloads/nutchSrc/trunk/runtime/local/bin/nutch index
>> -Dsolr.server.url=http://localhost:8983/solr/ testCrawl5//crawldb -linkdb
>> testCrawl5//linkdb testCrawl5//segments/20151115233845
>> Indexer: starting at 2015-11-15 23:39:02
>> Indexer: deleting gone documents: false
>> Indexer: URL filtering: false
>> Indexer: URL normalizing: false
>> Exception in thread "main" java.lang.NoClassDefFoundError:
>> org/apache/solr/client/solrj/SolrServerException
>>at java.lang.Class.getDeclaredConstructors0(Native Method)
>>at java.lang.Class.privateGetDeclaredConstructors(Class.java:2585)
>>at java.lang.Class.getConstructor0(Class.java:2885)
>>at java.lang.Class.newInstance(Class.java:350)
>>at
>>
>> org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:161)
>>at org.apache.nutch.indexer.IndexWriters.(IndexWriters.java:55)
>>at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:121)
>>at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:222)
>>at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>>at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231)
>> Caused by: java.lang.ClassNotFoundException:
>> org.apache.solr.client.solrj.SolrServerException
>>at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>>at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>>at java.security.AccessController.doPrivileged(Native Method)
>>at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>>at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>>at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>>at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>>... 10 more
>> Error running:
>>  /Users/manishverma/Downloads/nutchSrc/trunk/runtime/local/bin/nutch index
>>  -Dsolr.server.url=http://localhost:8983/solr/ testCrawl5//crawldb -linkdb
>>  testCrawl5//linkdb testCrawl5//segments/20151115233845
>> 
>> Thanks
>> Manish Verma
>> AML Search
>> +1 669 224 9924
>> 
>> 
> Noviembre 13-14: Final Caribeña 2015 del Concurso de Programación ACM-ICPC
> https://icpc.baylor.edu/regionals/finder/cf-2015



Re: [MASSMAIL]Crawl Command - Getting Exception While Indexing With Solr

2015-11-18 Thread Roannel Fernández Hernández
Hi

What version of Nutch you downloaded exactly?

Regards

- Mensaje original -
> De: "Manish Verma" ve...@apple.com>
> Para: user@nutch.apache.org
> Enviados: Lunes, 16 de Noviembre 2015 12:36:46
> Asunto: [MASSMAIL]Crawl Command - Getting Exception While Indexing With Solr
> 
> Hi ,
> 
> I was using bin version of Nutch 1.X and everything was working fine , I
> downloaded the source of Nutch 1.x and build it, In indexing phase it throws
> below exception. I am using below crawl command and it run well till parsing
> and fails at indexing with below exception.
> 
> ./crawl -i -D solr.server.url=http://localhost:8983/solr/
> /Users/manishverma/Manish/AML/nutch/urls testCrawl5/ 1
> 
> I see lot of people facing this but no answer . Please suggest.
>  
> 
> Indexing 20151115233845 to index
> /Users/manishverma/Downloads/nutchSrc/trunk/runtime/local/bin/nutch index
> -Dsolr.server.url=http://localhost:8983/solr/ testCrawl5//crawldb -linkdb
> testCrawl5//linkdb testCrawl5//segments/20151115233845
> Indexer: starting at 2015-11-15 23:39:02
> Indexer: deleting gone documents: false
> Indexer: URL filtering: false
> Indexer: URL normalizing: false
> Exception in thread "main" java.lang.NoClassDefFoundError:
> org/apache/solr/client/solrj/SolrServerException
> at java.lang.Class.getDeclaredConstructors0(Native Method)
> at java.lang.Class.privateGetDeclaredConstructors(Class.java:2585)
> at java.lang.Class.getConstructor0(Class.java:2885)
> at java.lang.Class.newInstance(Class.java:350)
> at
> 
> org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:161)
> at org.apache.nutch.indexer.IndexWriters.(IndexWriters.java:55)
> at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:121)
> at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:222)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231)
> Caused by: java.lang.ClassNotFoundException:
> org.apache.solr.client.solrj.SolrServerException
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> ... 10 more
> Error running:
>   /Users/manishverma/Downloads/nutchSrc/trunk/runtime/local/bin/nutch index
>   -Dsolr.server.url=http://localhost:8983/solr/ testCrawl5//crawldb -linkdb
>   testCrawl5//linkdb testCrawl5//segments/20151115233845
> 
> Thanks
> Manish Verma
> AML Search
> +1 669 224 9924
> 
> 
Noviembre 13-14: Final Caribeña 2015 del Concurso de Programación ACM-ICPC
https://icpc.baylor.edu/regionals/finder/cf-2015


Re: [MASSMAIL]Crawl Command - Getting Exception While Indexing With Solr

2015-11-18 Thread Roannel Fernández Hernández
Hi

Check into the folder of the indexer-solr plugin whether exist the 
solr-solrj-4.10.2.jar library.

Regards

- Mensaje original -
> De: "Roannel Fernández Hernández" <roan...@uci.cu>
> Para: user@nutch.apache.org
> Enviados: Miércoles, 18 de Noviembre 2015 9:03:38
> Asunto: Re: [MASSMAIL]Crawl Command - Getting Exception While Indexing With 
> Solr
> 
> Hi
> 
> What version of Nutch you downloaded exactly?
> 
> Regards
> 
> - Mensaje original -
> > De: "Manish Verma" ve...@apple.com>
> > Para: user@nutch.apache.org
> > Enviados: Lunes, 16 de Noviembre 2015 12:36:46
> > Asunto: [MASSMAIL]Crawl Command - Getting Exception While Indexing With
> > Solr
> > 
> > Hi ,
> > 
> > I was using bin version of Nutch 1.X and everything was working fine , I
> > downloaded the source of Nutch 1.x and build it, In indexing phase it
> > throws
> > below exception. I am using below crawl command and it run well till
> > parsing
> > and fails at indexing with below exception.
> > 
> > ./crawl -i -D solr.server.url=http://localhost:8983/solr/
> > /Users/manishverma/Manish/AML/nutch/urls testCrawl5/ 1
> > 
> > I see lot of people facing this but no answer . Please suggest.
> >  
> > 
> > Indexing 20151115233845 to index
> > /Users/manishverma/Downloads/nutchSrc/trunk/runtime/local/bin/nutch index
> > -Dsolr.server.url=http://localhost:8983/solr/ testCrawl5//crawldb -linkdb
> > testCrawl5//linkdb testCrawl5//segments/20151115233845
> > Indexer: starting at 2015-11-15 23:39:02
> > Indexer: deleting gone documents: false
> > Indexer: URL filtering: false
> > Indexer: URL normalizing: false
> > Exception in thread "main" java.lang.NoClassDefFoundError:
> > org/apache/solr/client/solrj/SolrServerException
> > at java.lang.Class.getDeclaredConstructors0(Native Method)
> > at java.lang.Class.privateGetDeclaredConstructors(Class.java:2585)
> > at java.lang.Class.getConstructor0(Class.java:2885)
> > at java.lang.Class.newInstance(Class.java:350)
> > at
> > 
> > org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:161)
> > at
> > org.apache.nutch.indexer.IndexWriters.(IndexWriters.java:55)
> > at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:121)
> > at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:222)
> > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> > at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231)
> > Caused by: java.lang.ClassNotFoundException:
> > org.apache.solr.client.solrj.SolrServerException
> > at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> > at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> > at java.security.AccessController.doPrivileged(Native Method)
> > at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> > at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> > at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> > ... 10 more
> > Error running:
> >   /Users/manishverma/Downloads/nutchSrc/trunk/runtime/local/bin/nutch index
> >   -Dsolr.server.url=http://localhost:8983/solr/ testCrawl5//crawldb -linkdb
> >   testCrawl5//linkdb testCrawl5//segments/20151115233845
> > 
> > Thanks
> > Manish Verma
> > AML Search
> > +1 669 224 9924
> > 
> > 
> Noviembre 13-14: Final Caribeña 2015 del Concurso de Programación ACM-ICPC
> https://icpc.baylor.edu/regionals/finder/cf-2015
> Noviembre 13-14: Final Caribeña 2015 del Concurso de Programación ACM-ICPC
> https://icpc.baylor.edu/regionals/finder/cf-2015
> 
Noviembre 13-14: Final Caribeña 2015 del Concurso de Programación ACM-ICPC
https://icpc.baylor.edu/regionals/finder/cf-2015


Crawl Command - Getting Exception While Indexing With Solr

2015-11-16 Thread Manish Verma
Hi ,

I was using bin version of Nutch 1.X and everything was working fine , I 
downloaded the source of Nutch 1.x and build it, In indexing phase it throws 
below exception. I am using below crawl command and it run well till parsing 
and fails at indexing with below exception. 

./crawl -i -D solr.server.url=http://localhost:8983/solr/ 
/Users/manishverma/Manish/AML/nutch/urls testCrawl5/ 1

I see lot of people facing this but no answer . Please suggest.
 

Indexing 20151115233845 to index
/Users/manishverma/Downloads/nutchSrc/trunk/runtime/local/bin/nutch index 
-Dsolr.server.url=http://localhost:8983/solr/ testCrawl5//crawldb -linkdb 
testCrawl5//linkdb testCrawl5//segments/20151115233845
Indexer: starting at 2015-11-15 23:39:02
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Exception in thread "main" java.lang.NoClassDefFoundError: 
org/apache/solr/client/solrj/SolrServerException
at java.lang.Class.getDeclaredConstructors0(Native Method)
at java.lang.Class.privateGetDeclaredConstructors(Class.java:2585)
at java.lang.Class.getConstructor0(Class.java:2885)
at java.lang.Class.newInstance(Class.java:350)
at 
org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:161)
at org.apache.nutch.indexer.IndexWriters.(IndexWriters.java:55)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:121)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:222)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231)
Caused by: java.lang.ClassNotFoundException: 
org.apache.solr.client.solrj.SolrServerException
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 10 more
Error running:
  /Users/manishverma/Downloads/nutchSrc/trunk/runtime/local/bin/nutch index 
-Dsolr.server.url=http://localhost:8983/solr/ testCrawl5//crawldb -linkdb 
testCrawl5//linkdb testCrawl5//segments/20151115233845

Thanks
Manish Verma
AML Search
+1 669 224 9924



Re: Problems indexing to solr 3.5 from nutch 1.8

2015-09-06 Thread Guy McD
Hi Lewis,

Thank you so much. Going in my docs.

Guy McDowell
guymcdow...@gmail.com
http://www.GuyMcDowell.com




On Sat, Sep 5, 2015 at 2:10 PM, Lewis John Mcgibbney <
lewis.mcgibb...@gmail.com> wrote:

> Hi Guy,
>
> The schema is present in the conf directory as shown here
>
> https://github.com/apache/nutch/blob/trunk/conf/schema.xml
>
> Lewis
>
> On Thu, Sep 3, 2015 at 11:13 AM, <user-digest-h...@nutch.apache.org>
> wrote:
>
> >
> > Subject: Re: Problems indexing to solr 3.5 from nutch 1.8
> > Having a similar problem in getting Nutch and Solr integrated. Newest
> > version of both. Downloaded and installed a few days ago.
> >
> > Following the tut tells me to copy over the schema.xml, but it doesn't
> > appear to be in the directory that the tut says. Or anywhere for that
> > matter.
> >
> > This is probably a rookie mistake by me, but I'm just not seeing it.
> Help.
> >
> >
>


Re: Problems indexing to solr 3.5 from nutch 1.8

2015-09-05 Thread Lewis John Mcgibbney
Hi Guy,

The schema is present in the conf directory as shown here

https://github.com/apache/nutch/blob/trunk/conf/schema.xml

Lewis

On Thu, Sep 3, 2015 at 11:13 AM, <user-digest-h...@nutch.apache.org> wrote:

>
> Subject: Re: Problems indexing to solr 3.5 from nutch 1.8
> Having a similar problem in getting Nutch and Solr integrated. Newest
> version of both. Downloaded and installed a few days ago.
>
> Following the tut tells me to copy over the schema.xml, but it doesn't
> appear to be in the directory that the tut says. Or anywhere for that
> matter.
>
> This is probably a rookie mistake by me, but I'm just not seeing it. Help.
>
>


Re: Problems indexing to solr 3.5 from nutch 1.8

2015-09-03 Thread Lewis John Mcgibbney
Hi Paddy,

Some comments in addition to my response. You should try upgrading to Nutch
1.10 when we release very shortly. There has been so much work done since
1.8 that you can benefit from. Keep your ears peeled here for a release
candidate and then eventual release.

Please see response below.

On Tue, Sep 1, 2015 at 5:00 AM,  wrote:

>
> I'm running into Problems with indexing documents crawled by nutch 1.8 into
> solr 3.5. Nutch does not report any kind of
> error or warning and seems to run just fine. but the solr index remains
> empty. (The logs do not show any kind of error or warning eather).
>

I would also check your Solr logs. When you copy over the nutch schema.xml
you must make sure that the Solr server logging does not indicate any
issues were encountered during startup. If there are errors then you should
resolve them.


>
> Is there any way to solve this issue? Is nutch 1.8 (uses solrj 3.4)
> compatible with solr 3.5 or are there any known issues?
>
>
As far as I know there are no issues. I would set logging in Nutch and Solr
to DEBUG. This can be done in Nutch by editing various values from INFO -->
DEBUG within conf/log4j.properties.
Something very similar will be true for Solr.
Ta
Lewis


Re: Problems indexing to solr 3.5 from nutch 1.8

2015-09-03 Thread Guy McD
Having a similar problem in getting Nutch and Solr integrated. Newest
version of both. Downloaded and installed a few days ago.

Following the tut tells me to copy over the schema.xml, but it doesn't
appear to be in the directory that the tut says. Or anywhere for that
matter.

This is probably a rookie mistake by me, but I'm just not seeing it. Help.

Guy McDowell
guymcdow...@gmail.com
http://www.GuyMcDowell.com




On Thu, Sep 3, 2015 at 5:28 AM, Lewis John Mcgibbney <
lewis.mcgibb...@gmail.com> wrote:

> Hi Paddy,
>
> Some comments in addition to my response. You should try upgrading to Nutch
> 1.10 when we release very shortly. There has been so much work done since
> 1.8 that you can benefit from. Keep your ears peeled here for a release
> candidate and then eventual release.
>
> Please see response below.
>
> On Tue, Sep 1, 2015 at 5:00 AM,  wrote:
>
> >
> > I'm running into Problems with indexing documents crawled by nutch 1.8
> into
> > solr 3.5. Nutch does not report any kind of
> > error or warning and seems to run just fine. but the solr index remains
> > empty. (The logs do not show any kind of error or warning eather).
> >
>
> I would also check your Solr logs. When you copy over the nutch schema.xml
> you must make sure that the Solr server logging does not indicate any
> issues were encountered during startup. If there are errors then you should
> resolve them.
>
>
> >
> > Is there any way to solve this issue? Is nutch 1.8 (uses solrj 3.4)
> > compatible with solr 3.5 or are there any known issues?
> >
> >
> As far as I know there are no issues. I would set logging in Nutch and Solr
> to DEBUG. This can be done in Nutch by editing various values from INFO -->
> DEBUG within conf/log4j.properties.
> Something very similar will be true for Solr.
> Ta
> Lewis
>


Problems indexing to solr 3.5 from nutch 1.8

2015-09-01 Thread Patrick Wilmes
Hey there,

I'm running into Problems with indexing documents crawled by nutch 1.8 into
solr 3.5. Nutch does not report any kind of
error or warning and seems to run just fine. but the solr index remains
empty. (The logs do not show any kind of error or warning eather).

Is there any way to solve this issue? Is nutch 1.8 (uses solrj 3.4)
compatible with solr 3.5 or are there any known issues?

king regards
paddy


Re: NullPointerException occured during indexing to solr from nutch 1.7 source build.

2014-09-04 Thread atawfik
Hi,

If I am not mistaken your Solr url is not accurate. You should provide the
Solr url plus the used core. For instance, if your core is named
collection1; the default Solr core name, then your url should be
*http://solr-server:8983/solr/collection1*. I believe if you review Solr or
Nutch logs, you will see that the indexing job has failed.

Regards
Ameer



--
View this message in context: 
http://lucene.472066.n3.nabble.com/NullPointerException-occured-during-indexing-to-solr-from-nutch-1-7-source-build-tp4156343p4157058.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: NullPointerException occured during indexing to solr from nutch 1.7 source build.

2014-09-03 Thread vinay . kashyap





Hi Talat,
Thanks for the information, I tried with 1.8 nutch, It
works fine and job compeleted.
However, i was not able to find the
data indexed in solr even i gave below command where solr url is
mentioned:
./crawl /user/nutch/urls /tmp/nutch_1_8_first_output
http://solr-server:8983/solr 1
I was assuming that after migrating
and by specifying solr-server url while running would ensure that data
crawled would get indexed automatically to solr.
is that not the
case? 
If not then how do i manually do it?
:)


From:Talat Uyarer ta...@uyarer.com

Sent:user@nutch.apache.org

Date:Tue, September 2, 2014 8:35 pm

Subject:Re: NullPointerException occured during indexing to solr from
nutch 1.7 source build.





 Hi,



 This is an issue. Below is the code of SolrDeleteDuplicate class
from

 nutch

 1.7 trunk where the solr record is deleted by id field. As documents
don't

 have the url field therefore the id of the documents empty, so its

 throwing

 a null pointer exception when it runs.



 Now i am writing on my phone. i diÅYuml; not find this issue.
But if you

 update

 from 1.7 to newer version. You will not get this error.



 Talat

 On Sep 2, 2014 10:22 AM, vinay.kash...@socialinfra.net
wrote:









 Hi,

 I have taken nutch 1.7 source and copied


mapred-site.xml,hdfs-site.xml,yarn-site.xml,hadoop-env.sh,core-site.xml

 from my Hadoop 2.3.0-cdh5.1.0 and did an ant build.

 Then went on to

 runtime/deploy/bin to start the crawling. it successfully
submitted

 the jobs to my yarn. But later during indexing to solr, i'm
getting

 below

 exceptions.

 I have copied the scheme-solr4.xml to my solr and added

 exceptions in regex-urlfilter.txt for a particular website which
i give

 for crawling in the directory urls/seed.txt.

 Error:

 java.lang.NullPointerException



 at

 org.apache.hadoop.io.Text.encode(Text.java:443)



 at

 org.apache.hadoop.io.Text.set(Text.java:198)



 at




org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:270)



 at




org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:241)



 at




org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:198)



 at


org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:184)



 at

 org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:52)



 at


org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)



 at

 org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)



 at

 org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)



 at

 java.security.AccessController.doPrivileged(Native Method)



 at

 javax.security.auth.Subject.doAs(Subject.java:415)



 at




org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)



 at

 org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)







 Kindly, can any one tell me how to solve this issue? I'm

 basically

 stuck

 here!!







 


NullPointerException occured during indexing to solr from nutch 1.7 source build.

2014-09-02 Thread vinay . kashyap



Hi,
I have taken nutch 1.7 source and copied
mapred-site.xml,hdfs-site.xml,yarn-site.xml,hadoop-env.sh,core-site.xml
from my Hadoop 2.3.0-cdh5.1.0 and did an ant build.
Then went on to
runtime/deploy/bin  to start the crawling. it successfully submitted
the jobs to my yarn. But later during indexing to solr, i'm getting below
exceptions.
I have copied the scheme-solr4.xml to my solr and added
exceptions in regex-urlfilter.txt for a particular website which i give
for crawling in the directory urls/seed.txt.
Error:
java.lang.NullPointerException

        at
org.apache.hadoop.io.Text.encode(Text.java:443)

        at
org.apache.hadoop.io.Text.set(Text.java:198)

        at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:270)

        at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:241)

        at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:198)

        at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:184)

        at
org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:52)

        at
org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)

        at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)

        at
org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)

        at
java.security.AccessController.doPrivileged(Native Method)

        at
javax.security.auth.Subject.doAs(Subject.java:415)

        at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)

        at
org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)

 

Kindly, can any one tell me how to solve this issue? I'm basically stuck
here!! 



RE: Errors when indexing to Solr

2012-09-07 Thread Fournier, Danny G
I've tried crawling with nutch-1.6-SNAPSHOT.jar and got the following
error:

[root@w7sp1-x64 nutch]# bin/nutch crawl urls -dir crawl -depth 3 -topN 5
solrUrl is not set, indexing will be skipped...
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 3
solrUrl=null
topN = 5
Injector: starting at 2012-09-07 08:41:06
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2012-09-07 08:41:21, elapsed: 00:00:14
Generator: starting at 2012-09-07 08:41:21
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 5
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/segments/20120907084129
Generator: finished at 2012-09-07 08:41:36, elapsed: 00:00:15
Fetcher: Your 'http.agent.name' value should be listed first in
'http.robots.agents' property.
Fetcher: starting at 2012-09-07 08:41:36
Fetcher: segment: crawl/segments/20120907084129
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 1 records + hit by time limit :0
Exception in thread main java.io.IOException: Job failed!
at
org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1332)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:136)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)

I then tried to crawl with 1.5.1 (which was successful) and INDEX with
1.6-SNAPSHOT. I got this error:

[root@w7sp1-x64 nutch]# bin/nutch solrindex
http://127.0.0.1:8080/solr/core2 crawl/crawldb -linkdb crawl/linkdb
SolrIndexer: starting at 2012-09-07 09:05:21
SolrIndexer: deleting gone documents: false
SolrIndexer: URL filtering: false
SolrIndexer: URL normalizing: false
org.apache.solr.common.SolrException: Not Found

Not Found

request: http://127.0.0.1:8080/solr/core2/update

 -Original Message-
 From: Fournier, Danny G [mailto:danny.fourn...@dfo-mpo.gc.ca]
 Sent: September 6, 2012 4:15 PM
 To: user@nutch.apache.org
 Subject: Errors when indexing to Solr
 
 I'm getting two different errors while trying to index Nutch crawls to
 Solr. I'm running with:
 
 - CentOS 6.3 VM (Virtualbox) (in host Windows XP)
 - Solr 3.6.1
 - Nutch 1.5.1
 
 It would seem that NUTCH-1251 comes rather close to solving my issue?
 Which would mean that I would have to compile Nutch 1.6 to fix this?
 
 Error #1 - When indexing directly to Solr
 
 Command: bin/nutch crawl urls -solr http://localhost:8080/solr/core2
 -depth 3 -topN 5
 
 Error:  Exception in thread main java.io.IOException:
 org.apache.solr.client.solrj.SolrServerException: Error executing
query
 
 SolrIndexer: starting at 2012-09-06 14:30:11
 Indexing 8 documents
 java.io.IOException: Job failed!
 SolrDeleteDuplicates: starting at 2012-09-06 14:30:55
 SolrDeleteDuplicates: Solr url: http://localhost:8080/solr/core2
 Exception in thread main java.io.IOException:
 org.apache.solr.client.solrj.SolrServerException: Error executing
 query
   at

org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getSp
 lits(SolrDeleteDuplicates.java:200)
   at
 org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:989)
   at
 org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:981)
   at
 org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174)
   at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897)
   at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:416)
   at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInform
 atio
 n.java:1121)
   at

org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
   at
 org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)
   at
 org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261)
   at

org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDupli
 cates.java:373)
   at

org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDupli
 cates.java:353)
   at org.apache.nutch.crawl.Crawl.run(Crawl.java:153)
   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
   at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
 Caused by: org.apache.solr.client.solrj.SolrServerException: Error
 executing query
   at

org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.j
 ava:95)
   at
 org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118)
   at

org.apache.nutch.indexer.solr.SolrDeleteDuplicates

RE: Errors when indexing to Solr

2012-09-07 Thread Markus Jelsma
-Original message-
 From:Fournier, Danny G danny.fourn...@dfo-mpo.gc.ca
 Sent: Fri 07-Sep-2012 14:46
 To: user@nutch.apache.org
 Subject: RE: Errors when indexing to Solr
 
 I've tried crawling with nutch-1.6-SNAPSHOT.jar and got the following
 error:
 
 [root@w7sp1-x64 nutch]# bin/nutch crawl urls -dir crawl -depth 3 -topN 5
 solrUrl is not set, indexing will be skipped...
 crawl started in: crawl
 rootUrlDir = urls
 threads = 10
 depth = 3
 solrUrl=null
 topN = 5
 Injector: starting at 2012-09-07 08:41:06
 Injector: crawlDb: crawl/crawldb
 Injector: urlDir: urls
 Injector: Converting injected urls to crawl db entries.
 Injector: Merging injected urls into crawl db.
 Injector: finished at 2012-09-07 08:41:21, elapsed: 00:00:14
 Generator: starting at 2012-09-07 08:41:21
 Generator: Selecting best-scoring urls due for fetch.
 Generator: filtering: true
 Generator: normalizing: true
 Generator: topN: 5
 Generator: jobtracker is 'local', generating exactly one partition.
 Generator: Partitioning selected urls for politeness.
 Generator: segment: crawl/segments/20120907084129
 Generator: finished at 2012-09-07 08:41:36, elapsed: 00:00:15
 Fetcher: Your 'http.agent.name' value should be listed first in
 'http.robots.agents' property.
 Fetcher: starting at 2012-09-07 08:41:36
 Fetcher: segment: crawl/segments/20120907084129
 Using queue mode : byHost
 Fetcher: threads: 10
 Fetcher: time-out divisor: 2
 QueueFeeder finished: total 1 records + hit by time limit :0
 Exception in thread main java.io.IOException: Job failed!
   at
 org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
   at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1332)
   at org.apache.nutch.crawl.Crawl.run(Crawl.java:136)
   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
   at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)

Please post the relevant log

 
 I then tried to crawl with 1.5.1 (which was successful) and INDEX with
 1.6-SNAPSHOT. I got this error:
 
 [root@w7sp1-x64 nutch]# bin/nutch solrindex
 http://127.0.0.1:8080/solr/core2 crawl/crawldb -linkdb crawl/linkdb
 SolrIndexer: starting at 2012-09-07 09:05:21
 SolrIndexer: deleting gone documents: false
 SolrIndexer: URL filtering: false
 SolrIndexer: URL normalizing: false
 org.apache.solr.common.SolrException: Not Found
 
 Not Found
 
 request: http://127.0.0.1:8080/solr/core2/update

This is no Nutch error, there simply is no Solr running there (404), or a badly 
configured one.

 
  -Original Message-
  From: Fournier, Danny G [mailto:danny.fourn...@dfo-mpo.gc.ca]
  Sent: September 6, 2012 4:15 PM
  To: user@nutch.apache.org
  Subject: Errors when indexing to Solr
  
  I'm getting two different errors while trying to index Nutch crawls to
  Solr. I'm running with:
  
  - CentOS 6.3 VM (Virtualbox) (in host Windows XP)
  - Solr 3.6.1
  - Nutch 1.5.1
  
  It would seem that NUTCH-1251 comes rather close to solving my issue?
  Which would mean that I would have to compile Nutch 1.6 to fix this?
  
  Error #1 - When indexing directly to Solr
  
  Command: bin/nutch crawl urls -solr http://localhost:8080/solr/core2
  -depth 3 -topN 5
  
  Error:  Exception in thread main java.io.IOException:
  org.apache.solr.client.solrj.SolrServerException: Error executing
 query
  
  SolrIndexer: starting at 2012-09-06 14:30:11
  Indexing 8 documents
  java.io.IOException: Job failed!
  SolrDeleteDuplicates: starting at 2012-09-06 14:30:55
  SolrDeleteDuplicates: Solr url: http://localhost:8080/solr/core2
  Exception in thread main java.io.IOException:
  org.apache.solr.client.solrj.SolrServerException: Error executing
  query
  at
 
 org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getSp
  lits(SolrDeleteDuplicates.java:200)
  at
  org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:989)
  at
  org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:981)
  at
  org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174)
  at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897)
  at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
  at java.security.AccessController.doPrivileged(Native Method)
  at javax.security.auth.Subject.doAs(Subject.java:416)
  at
  org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInform
  atio
  n.java:1121)
  at
 
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
  at
  org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)
  at
  org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261)
  at
 
 org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDupli
  cates.java:373)
  at
 
 org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDupli
  cates.java:353)
  at org.apache.nutch.crawl.Crawl.run(Crawl.java:153)
  at org.apache.hadoop.util.ToolRunner.run

RE: Errors when indexing to Solr

2012-09-07 Thread Fournier, Danny G
Markus, 

You were right. My core was setup properly, however, it was labeled something 
different in the conf file. I was able to get rid of that error. Thanks!

I have provided the log you asked for below...

Dan

 -Original Message-
 From: Markus Jelsma [mailto:markus.jel...@openindex.io]
 Sent: September 7, 2012 9:49 AM
 To: user@nutch.apache.org
 Subject: RE: Errors when indexing to Solr
 
 -Original message-
  From:Fournier, Danny G danny.fourn...@dfo-mpo.gc.ca
  Sent: Fri 07-Sep-2012 14:46
  To: user@nutch.apache.org
  Subject: RE: Errors when indexing to Solr
 
  I've tried crawling with nutch-1.6-SNAPSHOT.jar and got the following
  error:
 
  [root@w7sp1-x64 nutch]# bin/nutch crawl urls -dir crawl -depth 3 -topN 5
  solrUrl is not set, indexing will be skipped...
  crawl started in: crawl
  rootUrlDir = urls
  threads = 10
  depth = 3
  solrUrl=null
  topN = 5
  Injector: starting at 2012-09-07 08:41:06
  Injector: crawlDb: crawl/crawldb
  Injector: urlDir: urls
  Injector: Converting injected urls to crawl db entries.
  Injector: Merging injected urls into crawl db.
  Injector: finished at 2012-09-07 08:41:21, elapsed: 00:00:14
  Generator: starting at 2012-09-07 08:41:21
  Generator: Selecting best-scoring urls due for fetch.
  Generator: filtering: true
  Generator: normalizing: true
  Generator: topN: 5
  Generator: jobtracker is 'local', generating exactly one partition.
  Generator: Partitioning selected urls for politeness.
  Generator: segment: crawl/segments/20120907084129
  Generator: finished at 2012-09-07 08:41:36, elapsed: 00:00:15
  Fetcher: Your 'http.agent.name' value should be listed first in
  'http.robots.agents' property.
  Fetcher: starting at 2012-09-07 08:41:36
  Fetcher: segment: crawl/segments/20120907084129
  Using queue mode : byHost
  Fetcher: threads: 10
  Fetcher: time-out divisor: 2
  QueueFeeder finished: total 1 records + hit by time limit :0
  Exception in thread main java.io.IOException: Job failed!
  at
  org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
  at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1332)
  at org.apache.nutch.crawl.Crawl.run(Crawl.java:136)
  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
  at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
 
 Please post the relevant log
 

2012-09-07 09:54:31,418 WARN  mapred.LocalJobRunner - job_local_0005
java.lang.NoClassDefFoundError: 
com/google/common/util/concurrent/ThreadFactoryBuilder
at org.apache.nutch.parse.ParseUtil.init(ParseUtil.java:59)
at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.init(Fetcher.java:602)
at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1186)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
Caused by: java.lang.ClassNotFoundException: 
com.google.common.util.concurrent.ThreadFactoryBuild$
at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
... 6 more


Errors when indexing to Solr

2012-09-06 Thread Fournier, Danny G
I'm getting two different errors while trying to index Nutch crawls to
Solr. I'm running with:

- CentOS 6.3 VM (Virtualbox) (in host Windows XP)
- Solr 3.6.1
- Nutch 1.5.1

It would seem that NUTCH-1251 comes rather close to solving my issue?
Which would mean that I would have to compile Nutch 1.6 to fix this?

Error #1 - When indexing directly to Solr

Command: bin/nutch crawl urls -solr http://localhost:8080/solr/core2
-depth 3 -topN 5

Error:  Exception in thread main java.io.IOException:
org.apache.solr.client.solrj.SolrServerException: Error executing query

SolrIndexer: starting at 2012-09-06 14:30:11
Indexing 8 documents
java.io.IOException: Job failed!
SolrDeleteDuplicates: starting at 2012-09-06 14:30:55
SolrDeleteDuplicates: Solr url: http://localhost:8080/solr/core2
Exception in thread main java.io.IOException:
org.apache.solr.client.solrj.SolrServerException: Error executing
query
at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getSp
lits(SolrDeleteDuplicates.java:200)
at
org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:989)
at
org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:981)
at
org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformatio
n.java:1121)
at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
at
org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)
at
org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261)
at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDupli
cates.java:373)
at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDupli
cates.java:353)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:153)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
Caused by: org.apache.solr.client.solrj.SolrServerException: Error
executing query
at
org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.j
ava:95)
at
org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118)
at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getSp
lits(SolrDeleteDuplicates.java:198)
... 16 more
Caused by: org.apache.solr.common.SolrException: Not Found

Not Found

request: http://localhost:8080/solr/core2/select?q=id:[* TO
*]fl=idrows=1wt=javabinversion=2
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsH
ttpSolrServer.java:430)
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsH
ttpSolrServer.java:244)
at
org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.j
ava:89)
... 18 more

Error #2 - When indexing post-crawl

Command: bin/nutch solrindex http://localhost:8080/solr/core2
crawl/crawldb -linkdb crawl/linkdb

Error: org.apache.solr.common.SolrException: Not Found

SolrIndexer: starting at 2012-09-06 15:39:24
org.apache.solr.common.SolrException: Not Found

Not Found

request: http://localhost:8080/solr/core2/update


Regards,

Dan


Re: OutOfMemoryError when indexing into Solr

2011-10-31 Thread Markus Jelsma
Thanks. We should decrease the default setting for commit.size.

 Confirming that this worked. Also, times look interesting: to send 73K
 documents in 1000 doc batches (default) took 16 minutes; to send 73K
 documents in 100 doc batches took 15 minutes 24 seconds.
 
 Regards,
 
 Arkadi
 
  -Original Message-
  From: arkadi.kosmy...@csiro.au [mailto:arkadi.kosmy...@csiro.au]
  Sent: Friday, 28 October 2011 12:11 PM
  To: user@nutch.apache.org; markus.jel...@openindex.io
  Subject: [ExternalEmail] RE: OutOfMemoryError when indexing into Solr
  
  Hi Markus,
  
   -Original Message-
   From: Markus Jelsma [mailto:markus.jel...@openindex.io]
   Sent: Thursday, 27 October 2011 11:33 PM
   To: user@nutch.apache.org
   Subject: Re: OutOfMemoryError when indexing into Solr
   
   Interesting, how many records and how large are your records?
  
  There a bit more than 80,000 documents.
  
  property
  
namehttp.content.limit/name value15000/value
  
  /property
  
  property
  
 nameindexer.max.tokens/namevalue10/value
  
  /property
  
   How did you increase JVM heap size?
  
  opts=-XX:+UseConcMarkSweepGC -Xms500m -Xmx6000m -
  XX:MinHeapFreeRatio=10 -XX:MaxHeapFreeRatio=30 -XX:MaxPermSize=512m -
  XX:+CMSClassUnloadingEnabled
  
   Do you have custom indexing filters?
  
  Yes. They add a few fields to each document. These fields are small,
  within a hundred of bytes per document.
  
   Can you decrease the commit.size?
  
  Yes. Thank you. Good idea. I did not even consider it because, for
  whatever reason, this option was not in my nutch-default.xml. I've put
  it to 100. I hope that Solr commit is not done after sending each
  bunch. Else this would have a very negative impact on performance
  because Solr commits are very expensive.
  
   Do you also index large amounts of anchors (without deduplication)
  
  and pass in a very large linkdb?
  
  I do index anchors, but don't think that there is anything
  extraordinary about them. As I only index less than 100K pages, my
  linkdb should not be nearly as large as in cases when people index
  millions of documents.
  
   The reducer of IndexerMapReduce is a notorious RAM consumer.
  
  If reducing solr.commit.size helps, it would make sense to decrease the
  default value. Sending small bunches of documents to Solr without
  commits is not that expensive to risk having memory problems.
  
  Thanks again.
  
  Regards,
  
  Arkadi
  
   On Thursday 27 October 2011 05:54:54 arkadi.kosmy...@csiro.au wrote:
Hi,

I am working with a Nutch 1.4 snapshot and having a very strange
   
   problem
   
that makes the system run out of memory when indexing into Solr.
  
  This
  
   does
   
not look like a trivial lack of memory problem that can be solved
  
  by
  
giving more memory to the JVM. I've increased the max memory size
   
   from 2Gb
   
to 3Gb, then to 6Gb, but this did not make any difference.

A log extract is included below.

Would anyone have any idea of how to fix this problem?

Thanks,

Arkadi


2011-10-27 07:08:22,162 INFO  solr.SolrWriter - Adding 1000
  
  documents
  
2011-10-27 07:08:42,248 INFO  solr.SolrWriter - Adding 1000
  
  documents
  
2011-10-27 07:13:54,110 WARN  mapred.LocalJobRunner -
  
  job_local_0254
  
java.lang.OutOfMemoryError: Java heap space

   at java.util.Arrays.copyOfRange(Arrays.java:3209)
   at java.lang.String.init(String.java:215)
   at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:542)
   at java.nio.CharBuffer.toString(CharBuffer.java:1157)
   at org.apache.hadoop.io.Text.decode(Text.java:350)
   at org.apache.hadoop.io.Text.decode(Text.java:322)
   at org.apache.hadoop.io.Text.readString(Text.java:403)
   at
   
   org.apache.nutch.parse.ParseText.readFields(ParseText.java:50)
   
   at
  
  org.apache.nutch.util.GenericWritableConfigurable.readFields(GenericWri
  
   tab
   
leConfigurable.java:54) at
  
  org.apache.hadoop.io.serializer.WritableSerialization$WritableDeseriali
  
   zer
   
.deserialize(WritableSerialization.java:67) at
  
  org.apache.hadoop.io.serializer.WritableSerialization$WritableDeseriali
  
   zer
   
.deserialize(WritableSerialization.java:40) at
  
  org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:99
  
   1)
   
at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:931)
   
   at
  
  org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(Red
  
   uce
   
Task.java:241) at
  
  org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTas
  
   k.j
   
ava:237) at
  
  org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:
   81)
   
at
  
  org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:
   50)
   
at
  
  org.apache.hadoop.mapred.ReduceTask.runOldReducer

RE: OutOfMemoryError when indexing into Solr

2011-10-30 Thread Arkadi.Kosmynin
Confirming that this worked. Also, times look interesting: to send 73K 
documents in 1000 doc batches (default) took 16 minutes; to send 73K documents 
in 100 doc batches took 15 minutes 24 seconds.

Regards,

Arkadi

 -Original Message-
 From: arkadi.kosmy...@csiro.au [mailto:arkadi.kosmy...@csiro.au]
 Sent: Friday, 28 October 2011 12:11 PM
 To: user@nutch.apache.org; markus.jel...@openindex.io
 Subject: [ExternalEmail] RE: OutOfMemoryError when indexing into Solr
 
 Hi Markus,
 
  -Original Message-
  From: Markus Jelsma [mailto:markus.jel...@openindex.io]
  Sent: Thursday, 27 October 2011 11:33 PM
  To: user@nutch.apache.org
  Subject: Re: OutOfMemoryError when indexing into Solr
 
  Interesting, how many records and how large are your records?
 
 There a bit more than 80,000 documents.
 
 property
   namehttp.content.limit/name value15000/value
 /property
 
 property
nameindexer.max.tokens/namevalue10/value
 /property
 
  How did you increase JVM heap size?
 
 opts=-XX:+UseConcMarkSweepGC -Xms500m -Xmx6000m -
 XX:MinHeapFreeRatio=10 -XX:MaxHeapFreeRatio=30 -XX:MaxPermSize=512m -
 XX:+CMSClassUnloadingEnabled
 
  Do you have custom indexing filters?
 
 Yes. They add a few fields to each document. These fields are small,
 within a hundred of bytes per document.
 
  Can you decrease the commit.size?
 
 Yes. Thank you. Good idea. I did not even consider it because, for
 whatever reason, this option was not in my nutch-default.xml. I've put
 it to 100. I hope that Solr commit is not done after sending each
 bunch. Else this would have a very negative impact on performance
 because Solr commits are very expensive.
 
 
  Do you also index large amounts of anchors (without deduplication)
 and pass in a very large linkdb?
 
 I do index anchors, but don't think that there is anything
 extraordinary about them. As I only index less than 100K pages, my
 linkdb should not be nearly as large as in cases when people index
 millions of documents.
 
  The reducer of IndexerMapReduce is a notorious RAM consumer.
 
 If reducing solr.commit.size helps, it would make sense to decrease the
 default value. Sending small bunches of documents to Solr without
 commits is not that expensive to risk having memory problems.
 
 Thanks again.
 
 Regards,
 
 Arkadi
 
 
 
  On Thursday 27 October 2011 05:54:54 arkadi.kosmy...@csiro.au wrote:
   Hi,
  
   I am working with a Nutch 1.4 snapshot and having a very strange
  problem
   that makes the system run out of memory when indexing into Solr.
 This
  does
   not look like a trivial lack of memory problem that can be solved
 by
   giving more memory to the JVM. I've increased the max memory size
  from 2Gb
   to 3Gb, then to 6Gb, but this did not make any difference.
  
   A log extract is included below.
  
   Would anyone have any idea of how to fix this problem?
  
   Thanks,
  
   Arkadi
  
  
   2011-10-27 07:08:22,162 INFO  solr.SolrWriter - Adding 1000
 documents
   2011-10-27 07:08:42,248 INFO  solr.SolrWriter - Adding 1000
 documents
   2011-10-27 07:13:54,110 WARN  mapred.LocalJobRunner -
 job_local_0254
   java.lang.OutOfMemoryError: Java heap space
  at java.util.Arrays.copyOfRange(Arrays.java:3209)
  at java.lang.String.init(String.java:215)
  at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:542)
  at java.nio.CharBuffer.toString(CharBuffer.java:1157)
  at org.apache.hadoop.io.Text.decode(Text.java:350)
  at org.apache.hadoop.io.Text.decode(Text.java:322)
  at org.apache.hadoop.io.Text.readString(Text.java:403)
  at
  org.apache.nutch.parse.ParseText.readFields(ParseText.java:50)
  at
  
 
 org.apache.nutch.util.GenericWritableConfigurable.readFields(GenericWri
  tab
   leConfigurable.java:54) at
  
 
 org.apache.hadoop.io.serializer.WritableSerialization$WritableDeseriali
  zer
   .deserialize(WritableSerialization.java:67) at
  
 
 org.apache.hadoop.io.serializer.WritableSerialization$WritableDeseriali
  zer
   .deserialize(WritableSerialization.java:40) at
  
 
 org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:99
  1)
   at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:931)
  at
  
 
 org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(Red
  uce
   Task.java:241) at
  
 
 org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTas
  k.j
   ava:237) at
  
 
 org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:
  81)
   at
  
 
 org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:
  50)
   at
 
 org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411) at
  
 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216
  )
   2011-10-27 07:13:54,382 ERROR solr.SolrIndexer -
 java.io.IOException:
  Job
   failed!
 
  --
  Markus Jelsma - CTO - Openindex
  http

Re: OutOfMemoryError when indexing into Solr

2011-10-27 Thread Fred Zimmerman
I'm having the exact same problem. I am trying to isolate whether it is a
Solr problem or a Nutch+Solr problem.

On Wed, Oct 26, 2011 at 11:54 PM, arkadi.kosmy...@csiro.au wrote:

 Hi,

 I am working with a Nutch 1.4 snapshot and having a very strange problem
 that makes the system run out of memory when indexing into Solr. This does
 not look like a trivial lack of memory problem that can be solved by giving
 more memory to the JVM. I've increased the max memory size from 2Gb to 3Gb,
 then to 6Gb, but this did not make any difference.

 A log extract is included below.

 Would anyone have any idea of how to fix this problem?

 Thanks,

 Arkadi


 2011-10-27 07:08:22,162 INFO  solr.SolrWriter - Adding 1000 documents
 2011-10-27 07:08:42,248 INFO  solr.SolrWriter - Adding 1000 documents
 2011-10-27 07:13:54,110 WARN  mapred.LocalJobRunner - job_local_0254
 java.lang.OutOfMemoryError: Java heap space
   at java.util.Arrays.copyOfRange(Arrays.java:3209)
   at java.lang.String.init(String.java:215)
   at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:542)
   at java.nio.CharBuffer.toString(CharBuffer.java:1157)
   at org.apache.hadoop.io.Text.decode(Text.java:350)
   at org.apache.hadoop.io.Text.decode(Text.java:322)
   at org.apache.hadoop.io.Text.readString(Text.java:403)
   at org.apache.nutch.parse.ParseText.readFields(ParseText.java:50)
   at
 org.apache.nutch.util.GenericWritableConfigurable.readFields(GenericWritableConfigurable.java:54)
   at
 org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
   at
 org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
   at
 org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:991)
   at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:931)
   at
 org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(ReduceTask.java:241)
   at
 org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.java:237)
   at
 org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:81)
   at
 org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:50)
   at
 org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
   at
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
 2011-10-27 07:13:54,382 ERROR solr.SolrIndexer - java.io.IOException: Job
 failed!




Re: OutOfMemoryError when indexing into Solr

2011-10-27 Thread Markus Jelsma
Your problem is not the same judging from the stack trace on the Solr list. 
Your Solr runs OOM, not Nutch.

On Thursday 27 October 2011 14:20:10 Fred Zimmerman wrote:
 I'm having the exact same problem. I am trying to isolate whether it is a
 Solr problem or a Nutch+Solr problem.
 
 On Wed, Oct 26, 2011 at 11:54 PM, arkadi.kosmy...@csiro.au wrote:
  Hi,
  
  I am working with a Nutch 1.4 snapshot and having a very strange problem
  that makes the system run out of memory when indexing into Solr. This
  does not look like a trivial lack of memory problem that can be solved
  by giving more memory to the JVM. I've increased the max memory size
  from 2Gb to 3Gb, then to 6Gb, but this did not make any difference.
  
  A log extract is included below.
  
  Would anyone have any idea of how to fix this problem?
  
  Thanks,
  
  Arkadi
  
  
  2011-10-27 07:08:22,162 INFO  solr.SolrWriter - Adding 1000 documents
  2011-10-27 07:08:42,248 INFO  solr.SolrWriter - Adding 1000 documents
  2011-10-27 07:13:54,110 WARN  mapred.LocalJobRunner - job_local_0254
  java.lang.OutOfMemoryError: Java heap space
  
at java.util.Arrays.copyOfRange(Arrays.java:3209)
at java.lang.String.init(String.java:215)
at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:542)
at java.nio.CharBuffer.toString(CharBuffer.java:1157)
at org.apache.hadoop.io.Text.decode(Text.java:350)
at org.apache.hadoop.io.Text.decode(Text.java:322)
at org.apache.hadoop.io.Text.readString(Text.java:403)
at org.apache.nutch.parse.ParseText.readFields(ParseText.java:50)
at
  
  org.apache.nutch.util.GenericWritableConfigurable.readFields(GenericWrita
  bleConfigurable.java:54)
  
at
  
  org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserialize
  r.deserialize(WritableSerialization.java:67)
  
at
  
  org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserialize
  r.deserialize(WritableSerialization.java:40)
  
at
  
  org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:991)
  
at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:931)
at
  
  org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(Reduc
  eTask.java:241)
  
at
  
  org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.
  java:237)
  
at
  
  org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:81
  )
  
at
  
  org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:50
  )
  
at
  
  org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
  
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
at
  
  org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
  2011-10-27 07:13:54,382 ERROR solr.SolrIndexer - java.io.IOException: Job
  failed!

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: OutOfMemoryError when indexing into Solr

2011-10-27 Thread Markus Jelsma
Interesting, how many records and how large are your records? How did you 
increase JVM heap size? Do you have custom indexing filters? Can you decrease 
the commit.size? Do you also index large amounts of anchors (without 
deduplication) and pass in a very large linkdb?

The reducer of IndexerMapReduce is a notorious RAM consumer.

On Thursday 27 October 2011 05:54:54 arkadi.kosmy...@csiro.au wrote:
 Hi,
 
 I am working with a Nutch 1.4 snapshot and having a very strange problem
 that makes the system run out of memory when indexing into Solr. This does
 not look like a trivial lack of memory problem that can be solved by
 giving more memory to the JVM. I've increased the max memory size from 2Gb
 to 3Gb, then to 6Gb, but this did not make any difference.
 
 A log extract is included below.
 
 Would anyone have any idea of how to fix this problem?
 
 Thanks,
 
 Arkadi
 
 
 2011-10-27 07:08:22,162 INFO  solr.SolrWriter - Adding 1000 documents
 2011-10-27 07:08:42,248 INFO  solr.SolrWriter - Adding 1000 documents
 2011-10-27 07:13:54,110 WARN  mapred.LocalJobRunner - job_local_0254
 java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:3209)
at java.lang.String.init(String.java:215)
at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:542)
at java.nio.CharBuffer.toString(CharBuffer.java:1157)
at org.apache.hadoop.io.Text.decode(Text.java:350)
at org.apache.hadoop.io.Text.decode(Text.java:322)
at org.apache.hadoop.io.Text.readString(Text.java:403)
at org.apache.nutch.parse.ParseText.readFields(ParseText.java:50)
at
 org.apache.nutch.util.GenericWritableConfigurable.readFields(GenericWritab
 leConfigurable.java:54) at
 org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer
 .deserialize(WritableSerialization.java:67) at
 org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer
 .deserialize(WritableSerialization.java:40) at
 org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:991)
 at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:931) at
 org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(Reduce
 Task.java:241) at
 org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.j
 ava:237) at
 org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:81)
 at
 org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:50)
 at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411) at
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
 2011-10-27 07:13:54,382 ERROR solr.SolrIndexer - java.io.IOException: Job
 failed!

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


RE: OutOfMemoryError when indexing into Solr

2011-10-27 Thread Arkadi.Kosmynin
Hi Markus,

 -Original Message-
 From: Markus Jelsma [mailto:markus.jel...@openindex.io]
 Sent: Thursday, 27 October 2011 11:33 PM
 To: user@nutch.apache.org
 Subject: Re: OutOfMemoryError when indexing into Solr
 
 Interesting, how many records and how large are your records?

There a bit more than 80,000 documents.

property
  namehttp.content.limit/name value15000/value
/property

property
   nameindexer.max.tokens/namevalue10/value 
/property

 How did you increase JVM heap size?

opts=-XX:+UseConcMarkSweepGC -Xms500m -Xmx6000m -XX:MinHeapFreeRatio=10 
-XX:MaxHeapFreeRatio=30 -XX:MaxPermSize=512m -XX:+CMSClassUnloadingEnabled

 Do you have custom indexing filters?

Yes. They add a few fields to each document. These fields are small, within a 
hundred of bytes per document.

 Can you decrease the commit.size?

Yes. Thank you. Good idea. I did not even consider it because, for whatever 
reason, this option was not in my nutch-default.xml. I've put it to 100. I hope 
that Solr commit is not done after sending each bunch. Else this would have a 
very negative impact on performance because Solr commits are very expensive.  
 

 Do you also index large amounts of anchors (without deduplication) and pass 
 in a very large linkdb?

I do index anchors, but don't think that there is anything extraordinary about 
them. As I only index less than 100K pages, my linkdb should not be nearly as 
large as in cases when people index millions of documents.
 
 The reducer of IndexerMapReduce is a notorious RAM consumer.

If reducing solr.commit.size helps, it would make sense to decrease the default 
value. Sending small bunches of documents to Solr without commits is not that 
expensive to risk having memory problems.

Thanks again.

Regards,

Arkadi


 
 On Thursday 27 October 2011 05:54:54 arkadi.kosmy...@csiro.au wrote:
  Hi,
 
  I am working with a Nutch 1.4 snapshot and having a very strange
 problem
  that makes the system run out of memory when indexing into Solr. This
 does
  not look like a trivial lack of memory problem that can be solved by
  giving more memory to the JVM. I've increased the max memory size
 from 2Gb
  to 3Gb, then to 6Gb, but this did not make any difference.
 
  A log extract is included below.
 
  Would anyone have any idea of how to fix this problem?
 
  Thanks,
 
  Arkadi
 
 
  2011-10-27 07:08:22,162 INFO  solr.SolrWriter - Adding 1000 documents
  2011-10-27 07:08:42,248 INFO  solr.SolrWriter - Adding 1000 documents
  2011-10-27 07:13:54,110 WARN  mapred.LocalJobRunner - job_local_0254
  java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOfRange(Arrays.java:3209)
 at java.lang.String.init(String.java:215)
 at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:542)
 at java.nio.CharBuffer.toString(CharBuffer.java:1157)
 at org.apache.hadoop.io.Text.decode(Text.java:350)
 at org.apache.hadoop.io.Text.decode(Text.java:322)
 at org.apache.hadoop.io.Text.readString(Text.java:403)
 at
 org.apache.nutch.parse.ParseText.readFields(ParseText.java:50)
 at
 
 org.apache.nutch.util.GenericWritableConfigurable.readFields(GenericWri
 tab
  leConfigurable.java:54) at
 
 org.apache.hadoop.io.serializer.WritableSerialization$WritableDeseriali
 zer
  .deserialize(WritableSerialization.java:67) at
 
 org.apache.hadoop.io.serializer.WritableSerialization$WritableDeseriali
 zer
  .deserialize(WritableSerialization.java:40) at
 
 org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:99
 1)
  at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:931)
 at
 
 org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(Red
 uce
  Task.java:241) at
 
 org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTas
 k.j
  ava:237) at
 
 org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:
 81)
  at
 
 org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:
 50)
  at
 org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
  at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411) at
 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216
 )
  2011-10-27 07:13:54,382 ERROR solr.SolrIndexer - java.io.IOException:
 Job
  failed!
 
 --
 Markus Jelsma - CTO - Openindex
 http://www.linkedin.com/in/markus17
 050-8536620 / 06-50258350


Re: How to avoid splitting strings when indexing to solr

2011-08-08 Thread Marek Bachmann

On 07.08.2011 15:35, Markus Jelsma wrote:

700 property
701 namemoreIndexingFilter.indexMimeTypeParts/name
702 valuetrue/value
703 descriptionDetermines whether the index-more plugin will split the 
mime-
type
704 in sub parts, this requires the type field to be multi valued. Set to 
true
for backward
705 compatibility. False will not split the mime-type.
706 /description
707 /property


Thank you very much Markus,

I have copied this to my nutch-site.xml. It works very well now.

But I hadn't this option in my nutch-default.xml. Is there a standard 
way to get informed about the options that I can pass to a plugin?






Hello people,

I was just wondering how to avoid that the content-type string is split
in to multiple values.
For example: If a document has the content-type: Application/pdf it is
broken into three pieces Application/pdf, Application, pdf in the
solr filed type.

I am not sure if this is done by nutch, or if it is an index topic in solr.

Sure someone knows the answer to that.

Thank you.




Re: How to avoid splitting strings when indexing to solr

2011-08-08 Thread Markus Jelsma
 it is in nutch-default of 1.3 only. If you upgraded and copied over the 1.2 
conf you'll miss it indeed.

 On 07.08.2011 15:35, Markus Jelsma wrote:
  700 property
  701 namemoreIndexingFilter.indexMimeTypeParts/name
  702 valuetrue/value
  703 descriptionDetermines whether the index-more plugin will 
  split the
  mime- type
  704 in sub parts, this requires the type field to be multi valued. 
  Set
  to true for backward
  705 compatibility. False will not split the mime-type.
  706 /description
  707 /property
 
 Thank you very much Markus,
 
 I have copied this to my nutch-site.xml. It works very well now.
 
 But I hadn't this option in my nutch-default.xml. Is there a standard
 way to get informed about the options that I can pass to a plugin?
 
  Hello people,
  
  I was just wondering how to avoid that the content-type string is split
  in to multiple values.
  For example: If a document has the content-type: Application/pdf it is
  broken into three pieces Application/pdf, Application, pdf in the
  solr filed type.
  
  I am not sure if this is done by nutch, or if it is an index topic in
  solr.
  
  Sure someone knows the answer to that.
  
  Thank you.


Re: How to avoid splitting strings when indexing to solr

2011-08-07 Thread Markus Jelsma
700 property
701 namemoreIndexingFilter.indexMimeTypeParts/name
702 valuetrue/value
703 descriptionDetermines whether the index-more plugin will split the 
mime-
type
704 in sub parts, this requires the type field to be multi valued. Set to 
true 
for backward
705 compatibility. False will not split the mime-type.
706 /description
707 /property 


 Hello people,
 
 I was just wondering how to avoid that the content-type string is split
 in to multiple values.
 For example: If a document has the content-type: Application/pdf it is
 broken into three pieces Application/pdf, Application, pdf in the
 solr filed type.
 
 I am not sure if this is done by nutch, or if it is an index topic in solr.
 
 Sure someone knows the answer to that.
 
 Thank you.


Re: How to avoid splitting strings when indexing to solr

2011-08-05 Thread Gora Mohanty
Hi,

Not too familiar these days
with Nutch, but my guess is
that a Solr analyser is getting applied. To have a field exactly as is, use
the String fieldtype on Solr's schema.xml rather than tje text fieldtype.

Regards,
Gora
On 05-Aug-2011 6:35 PM, Marek Bachmann m.bachm...@uni-kassel.de wrote:
 Hello people,

 I was just wondering how to avoid that the content-type string is split
 in to multiple values.
 For example: If a document has the content-type: Application/pdf it is
 broken into three pieces Application/pdf, Application, pdf in the
 solr filed type.

 I am not sure if this is done by nutch, or if it is an index topic in
solr.

 Sure someone knows the answer to that.

 Thank you.