Re: indexing to Solr
Here is another issue with the official Nutch tutorial. In the section "Integrate Solr with Nutch" it says to backup the original solr schema.xml and replace it with one from nutch. It say that the original schema.xml is in the directory example/solr/collection1/conf. But there is no such directory. When I search for schema.xml, I get the following. ./solr-5.4.1/example/example-DIH/solr/solr/conf/schema.xml ./solr-5.4.1/example/example-DIH/solr/db/conf/schema.xml ./solr-5.4.1/example/example-DIH/solr/mail/conf/schema.xml ./solr-5.4.1/example/example-DIH/solr/rss/conf/schema.xml ./solr-5.4.1/example/example-DIH/solr/tika/conf/schema.xml ./solr-5.4.1/server/solr/configsets/sample_techproducts_configs/conf/schema.xml ./solr-5.4.1/server/solr/configsets/basic_configs/conf/schema.xml It's not obvious that any one of these is the right one to use. From: lewis john mcgibbney <lewi...@apache.org> To: "user@nutch.apache.org" <user@nutch.apache.org> Sent: Monday, November 21, 2016 10:34 AM Subject: Re: indexing to Solr Hi Michael, On Sat, Nov 19, 2016 at 8:09 AM, <user-digest-h...@nutch.apache.org> wrote: > From: Michael Coffey <mcof...@yahoo.com.invalid> > To: "user@nutch.apache.org" <user@nutch.apache.org> > Cc: > Date: Fri, 18 Nov 2016 21:15:14 + (UTC) > Subject: indexing to Solr > Where can I find up-to-date information on indexing to Solr? http://wiki.apache.org/nutch/NutchTutorial in particular https://wiki.apache.org/nutch/NutchTutorial#Step-by-Step:_Indexing_into_Apache_Solr If you find any issues with this tutorial then please let us know. Thank you. > When I search the web, I find tutorials that use the deprecated solrindex > command. I also find questions where people want to know why it doesn't > work. > That is because the only official documentation resides at http://wiki.apache.org/nutch/NutchTutorial > I have a good nutch 1.12 installation on a working hadoop cluster and a > Solr 6.3.0 installation which works for their gettingstarted example. > You should use the specified version of Solr for the Nutch release. This is Solr 5.4.1 as defined in the indexer-solr plugin ivy.xml > I have questions likeDo I need to create a core and a collection in solr? Yes I would. This is explained at https://wiki.apache.org/nutch/NutchTutorial#Setup_Solr_for_search > Do I need http or cloud type server?Do I need solr.zookeeper.url ? > This is not a Nutch question. This is your preferred Solr configuration. If you are just starting out then I would say it is not a big deal... experiment and go with what works best for your requirements and resources capacity. > What else needs to be set in nutch-site.xml? > Not much. For reference though, here are the Solr configuration options. https://github.com/apache/nutch/blob/master/conf/nutch-default.xml#L1750-L1826 > What about schema? > This is covered in https://wiki.apache.org/nutch/NutchTutorial#Setup_Solr_for_search > > Thanks for all the help so far! > > No problems. Any more issues, ping us here and we will help. Ta
Re: indexing to Solr
Here is an issue with the official Nutch tutorial. In the "setup Solr for Search" it gives the following instructions.* download binary file from here * unzip to $HOME/apache-solr, we will now refer to this as ${APACHE_SOLR_HOME} * cd ${APACHE_SOLR_HOME}/example * java -jar start.jar Unfortunately, there is no start .jar in the examples directory. When I instead try to use the start.jar in the servers directory, Java says "WARNING: Nothing to start, exiting ..." You need something like the following to start solr.$APACHE_SOLR_HOME/bin/solr start -e cloud -noprompt In this case, I am using solr 5.4.1 Also, as mentioned previously, the tutorial says nothing about which version of solr to use. From: lewis john mcgibbney <lewi...@apache.org> To: "user@nutch.apache.org" <user@nutch.apache.org> Sent: Monday, November 21, 2016 10:34 AM Subject: Re: indexing to Solr Hi Michael, On Sat, Nov 19, 2016 at 8:09 AM, <user-digest-h...@nutch.apache.org> wrote: > From: Michael Coffey <mcof...@yahoo.com.invalid> > To: "user@nutch.apache.org" <user@nutch.apache.org> > Cc: > Date: Fri, 18 Nov 2016 21:15:14 +0000 (UTC) > Subject: indexing to Solr > Where can I find up-to-date information on indexing to Solr? http://wiki.apache.org/nutch/NutchTutorial in particular https://wiki.apache.org/nutch/NutchTutorial#Step-by-Step:_Indexing_into_Apache_Solr If you find any issues with this tutorial then please let us know. Thank you. > When I search the web, I find tutorials that use the deprecated solrindex > command. I also find questions where people want to know why it doesn't > work. > That is because the only official documentation resides at http://wiki.apache.org/nutch/NutchTutorial > I have a good nutch 1.12 installation on a working hadoop cluster and a > Solr 6.3.0 installation which works for their gettingstarted example. > You should use the specified version of Solr for the Nutch release. This is Solr 5.4.1 as defined in the indexer-solr plugin ivy.xml > I have questions likeDo I need to create a core and a collection in solr? Yes I would. This is explained at https://wiki.apache.org/nutch/NutchTutorial#Setup_Solr_for_search > Do I need http or cloud type server?Do I need solr.zookeeper.url ? > This is not a Nutch question. This is your preferred Solr configuration. If you are just starting out then I would say it is not a big deal... experiment and go with what works best for your requirements and resources capacity. > What else needs to be set in nutch-site.xml? > Not much. For reference though, here are the Solr configuration options. https://github.com/apache/nutch/blob/master/conf/nutch-default.xml#L1750-L1826 > What about schema? > This is covered in https://wiki.apache.org/nutch/NutchTutorial#Setup_Solr_for_search > > Thanks for all the help so far! > > No problems. Any more issues, ping us here and we will help. Ta
Re: indexing to Solr
Thanks for the info, I will try again with Solr 5.4.1! I think it would be helpful if the tutorial would say something about which version(s) of Solr can work with Nutch, perhaps calling attention to the ivy file you mentioned in your email. The download link, in our "Setup Solr for Search" section, points to a choice of 5.5.3 or 6.3.0 (at the moment). I ran into NUTCH-2267 on both of the Solr versions (6.3.0 and 5.5.3) I tried to work with. From: lewis john mcgibbney <lewi...@apache.org> To: "user@nutch.apache.org" <user@nutch.apache.org> Sent: Monday, November 21, 2016 10:34 AM Subject: Re: indexing to Solr Hi Michael, On Sat, Nov 19, 2016 at 8:09 AM, <user-digest-h...@nutch.apache.org> wrote: > From: Michael Coffey <mcof...@yahoo.com.invalid> > To: "user@nutch.apache.org" <user@nutch.apache.org> > Cc: > Date: Fri, 18 Nov 2016 21:15:14 + (UTC) > Subject: indexing to Solr > Where can I find up-to-date information on indexing to Solr? http://wiki.apache.org/nutch/NutchTutorial in particular https://wiki.apache.org/nutch/NutchTutorial#Step-by-Step:_Indexing_into_Apache_Solr If you find any issues with this tutorial then please let us know. Thank you. > When I search the web, I find tutorials that use the deprecated solrindex > command. I also find questions where people want to know why it doesn't > work. > That is because the only official documentation resides at http://wiki.apache.org/nutch/NutchTutorial > I have a good nutch 1.12 installation on a working hadoop cluster and a > Solr 6.3.0 installation which works for their gettingstarted example. > You should use the specified version of Solr for the Nutch release. This is Solr 5.4.1 as defined in the indexer-solr plugin ivy.xml > I have questions likeDo I need to create a core and a collection in solr? Yes I would. This is explained at https://wiki.apache.org/nutch/NutchTutorial#Setup_Solr_for_search > Do I need http or cloud type server?Do I need solr.zookeeper.url ? > This is not a Nutch question. This is your preferred Solr configuration. If you are just starting out then I would say it is not a big deal... experiment and go with what works best for your requirements and resources capacity. > What else needs to be set in nutch-site.xml? > Not much. For reference though, here are the Solr configuration options. https://github.com/apache/nutch/blob/master/conf/nutch-default.xml#L1750-L1826 > What about schema? > This is covered in https://wiki.apache.org/nutch/NutchTutorial#Setup_Solr_for_search > > Thanks for all the help so far! > > No problems. Any more issues, ping us here and we will help. Ta
Re: indexing to Solr
Hi Michael, On Sat, Nov 19, 2016 at 8:09 AM, <user-digest-h...@nutch.apache.org> wrote: > From: Michael Coffey <mcof...@yahoo.com.invalid> > To: "user@nutch.apache.org" <user@nutch.apache.org> > Cc: > Date: Fri, 18 Nov 2016 21:15:14 + (UTC) > Subject: indexing to Solr > Where can I find up-to-date information on indexing to Solr? http://wiki.apache.org/nutch/NutchTutorial in particular https://wiki.apache.org/nutch/NutchTutorial#Step-by-Step:_Indexing_into_Apache_Solr If you find any issues with this tutorial then please let us know. Thank you. > When I search the web, I find tutorials that use the deprecated solrindex > command. I also find questions where people want to know why it doesn't > work. > That is because the only official documentation resides at http://wiki.apache.org/nutch/NutchTutorial > I have a good nutch 1.12 installation on a working hadoop cluster and a > Solr 6.3.0 installation which works for their gettingstarted example. > You should use the specified version of Solr for the Nutch release. This is Solr 5.4.1 as defined in the indexer-solr plugin ivy.xml > I have questions likeDo I need to create a core and a collection in solr? Yes I would. This is explained at https://wiki.apache.org/nutch/NutchTutorial#Setup_Solr_for_search > Do I need http or cloud type server?Do I need solr.zookeeper.url ? > This is not a Nutch question. This is your preferred Solr configuration. If you are just starting out then I would say it is not a big deal... experiment and go with what works best for your requirements and resources capacity. > What else needs to be set in nutch-site.xml? > Not much. For reference though, here are the Solr configuration options. https://github.com/apache/nutch/blob/master/conf/nutch-default.xml#L1750-L1826 > What about schema? > This is covered in https://wiki.apache.org/nutch/NutchTutorial#Setup_Solr_for_search > > Thanks for all the help so far! > > No problems. Any more issues, ping us here and we will help. Ta
indexing to Solr
Where can I find up-to-date information on indexing to Solr? When I search the web, I find tutorials that use the deprecated solrindex command. I also find questions where people want to know why it doesn't work. I have a good nutch 1.12 installation on a working hadoop cluster and a Solr 6.3.0 installation which works for their gettingstarted example. I have questions likeDo I need to create a core and a collection in solr?Do I need http or cloud type server?Do I need solr.zookeeper.url ? What else needs to be set in nutch-site.xml? What about schema? Thanks for all the help so far!
Re: [MASSMAIL]Crawl Command - Getting Exception While Indexing With Solr
Thanks for reply, I just download the source once again and it worked. Regards > On Nov 18, 2015, at 6:16 AM, Roannel Fernández Hernández <roan...@uci.cu> > wrote: > > Hi > > Check into the folder of the indexer-solr plugin whether exist the > solr-solrj-4.10.2.jar library. > > Regards > > - Mensaje original - >> De: "Roannel Fernández Hernández" <roan...@uci.cu> >> Para: user@nutch.apache.org >> Enviados: Miércoles, 18 de Noviembre 2015 9:03:38 >> Asunto: Re: [MASSMAIL]Crawl Command - Getting Exception While Indexing With >> Solr >> >> Hi >> >> What version of Nutch you downloaded exactly? >> >> Regards >> >> - Mensaje original - >>> De: "Manish Verma" ve...@apple.com> >>> Para: user@nutch.apache.org >>> Enviados: Lunes, 16 de Noviembre 2015 12:36:46 >>> Asunto: [MASSMAIL]Crawl Command - Getting Exception While Indexing With >>> Solr >>> >>> Hi , >>> >>> I was using bin version of Nutch 1.X and everything was working fine , I >>> downloaded the source of Nutch 1.x and build it, In indexing phase it >>> throws >>> below exception. I am using below crawl command and it run well till >>> parsing >>> and fails at indexing with below exception. >>> >>> ./crawl -i -D solr.server.url=http://localhost:8983/solr/ >>> /Users/manishverma/Manish/AML/nutch/urls testCrawl5/ 1 >>> >>> I see lot of people facing this but no answer . Please suggest. >>> >>> >>> Indexing 20151115233845 to index >>> /Users/manishverma/Downloads/nutchSrc/trunk/runtime/local/bin/nutch index >>> -Dsolr.server.url=http://localhost:8983/solr/ testCrawl5//crawldb -linkdb >>> testCrawl5//linkdb testCrawl5//segments/20151115233845 >>> Indexer: starting at 2015-11-15 23:39:02 >>> Indexer: deleting gone documents: false >>> Indexer: URL filtering: false >>> Indexer: URL normalizing: false >>> Exception in thread "main" java.lang.NoClassDefFoundError: >>> org/apache/solr/client/solrj/SolrServerException >>>at java.lang.Class.getDeclaredConstructors0(Native Method) >>>at java.lang.Class.privateGetDeclaredConstructors(Class.java:2585) >>>at java.lang.Class.getConstructor0(Class.java:2885) >>>at java.lang.Class.newInstance(Class.java:350) >>>at >>> >>> org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:161) >>>at >>>org.apache.nutch.indexer.IndexWriters.(IndexWriters.java:55) >>>at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:121) >>>at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:222) >>>at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) >>>at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231) >>> Caused by: java.lang.ClassNotFoundException: >>> org.apache.solr.client.solrj.SolrServerException >>>at java.net.URLClassLoader$1.run(URLClassLoader.java:366) >>>at java.net.URLClassLoader$1.run(URLClassLoader.java:355) >>>at java.security.AccessController.doPrivileged(Native Method) >>>at java.net.URLClassLoader.findClass(URLClassLoader.java:354) >>>at java.lang.ClassLoader.loadClass(ClassLoader.java:425) >>>at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) >>>at java.lang.ClassLoader.loadClass(ClassLoader.java:358) >>>... 10 more >>> Error running: >>> /Users/manishverma/Downloads/nutchSrc/trunk/runtime/local/bin/nutch index >>> -Dsolr.server.url=http://localhost:8983/solr/ testCrawl5//crawldb -linkdb >>> testCrawl5//linkdb testCrawl5//segments/20151115233845 >>> >>> Thanks >>> Manish Verma >>> AML Search >>> +1 669 224 9924 >>> >>> >> Noviembre 13-14: Final Caribeña 2015 del Concurso de Programación ACM-ICPC >> https://icpc.baylor.edu/regionals/finder/cf-2015 >> Noviembre 13-14: Final Caribeña 2015 del Concurso de Programación ACM-ICPC >> https://icpc.baylor.edu/regionals/finder/cf-2015 >> > Noviembre 13-14: Final Caribeña 2015 del Concurso de Programación ACM-ICPC > https://icpc.baylor.edu/regionals/finder/cf-2015
Re: [MASSMAIL]Crawl Command - Getting Exception While Indexing With Solr
Thanks for reply, I just download the source once again and it worked. Regards Manish > On Nov 18, 2015, at 6:03 AM, Roannel Fernández Hernández <roan...@uci.cu> > wrote: > > Hi > > What version of Nutch you downloaded exactly? > > Regards > > - Mensaje original - >> De: "Manish Verma" ve...@apple.com> >> Para: user@nutch.apache.org >> Enviados: Lunes, 16 de Noviembre 2015 12:36:46 >> Asunto: [MASSMAIL]Crawl Command - Getting Exception While Indexing With Solr >> >> Hi , >> >> I was using bin version of Nutch 1.X and everything was working fine , I >> downloaded the source of Nutch 1.x and build it, In indexing phase it throws >> below exception. I am using below crawl command and it run well till parsing >> and fails at indexing with below exception. >> >> ./crawl -i -D solr.server.url=http://localhost:8983/solr/ >> /Users/manishverma/Manish/AML/nutch/urls testCrawl5/ 1 >> >> I see lot of people facing this but no answer . Please suggest. >> >> >> Indexing 20151115233845 to index >> /Users/manishverma/Downloads/nutchSrc/trunk/runtime/local/bin/nutch index >> -Dsolr.server.url=http://localhost:8983/solr/ testCrawl5//crawldb -linkdb >> testCrawl5//linkdb testCrawl5//segments/20151115233845 >> Indexer: starting at 2015-11-15 23:39:02 >> Indexer: deleting gone documents: false >> Indexer: URL filtering: false >> Indexer: URL normalizing: false >> Exception in thread "main" java.lang.NoClassDefFoundError: >> org/apache/solr/client/solrj/SolrServerException >>at java.lang.Class.getDeclaredConstructors0(Native Method) >>at java.lang.Class.privateGetDeclaredConstructors(Class.java:2585) >>at java.lang.Class.getConstructor0(Class.java:2885) >>at java.lang.Class.newInstance(Class.java:350) >>at >> >> org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:161) >>at org.apache.nutch.indexer.IndexWriters.(IndexWriters.java:55) >>at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:121) >>at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:222) >>at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) >>at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231) >> Caused by: java.lang.ClassNotFoundException: >> org.apache.solr.client.solrj.SolrServerException >>at java.net.URLClassLoader$1.run(URLClassLoader.java:366) >>at java.net.URLClassLoader$1.run(URLClassLoader.java:355) >>at java.security.AccessController.doPrivileged(Native Method) >>at java.net.URLClassLoader.findClass(URLClassLoader.java:354) >>at java.lang.ClassLoader.loadClass(ClassLoader.java:425) >>at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) >>at java.lang.ClassLoader.loadClass(ClassLoader.java:358) >>... 10 more >> Error running: >> /Users/manishverma/Downloads/nutchSrc/trunk/runtime/local/bin/nutch index >> -Dsolr.server.url=http://localhost:8983/solr/ testCrawl5//crawldb -linkdb >> testCrawl5//linkdb testCrawl5//segments/20151115233845 >> >> Thanks >> Manish Verma >> AML Search >> +1 669 224 9924 >> >> > Noviembre 13-14: Final Caribeña 2015 del Concurso de Programación ACM-ICPC > https://icpc.baylor.edu/regionals/finder/cf-2015
Re: [MASSMAIL]Crawl Command - Getting Exception While Indexing With Solr
Hi What version of Nutch you downloaded exactly? Regards - Mensaje original - > De: "Manish Verma" ve...@apple.com> > Para: user@nutch.apache.org > Enviados: Lunes, 16 de Noviembre 2015 12:36:46 > Asunto: [MASSMAIL]Crawl Command - Getting Exception While Indexing With Solr > > Hi , > > I was using bin version of Nutch 1.X and everything was working fine , I > downloaded the source of Nutch 1.x and build it, In indexing phase it throws > below exception. I am using below crawl command and it run well till parsing > and fails at indexing with below exception. > > ./crawl -i -D solr.server.url=http://localhost:8983/solr/ > /Users/manishverma/Manish/AML/nutch/urls testCrawl5/ 1 > > I see lot of people facing this but no answer . Please suggest. > > > Indexing 20151115233845 to index > /Users/manishverma/Downloads/nutchSrc/trunk/runtime/local/bin/nutch index > -Dsolr.server.url=http://localhost:8983/solr/ testCrawl5//crawldb -linkdb > testCrawl5//linkdb testCrawl5//segments/20151115233845 > Indexer: starting at 2015-11-15 23:39:02 > Indexer: deleting gone documents: false > Indexer: URL filtering: false > Indexer: URL normalizing: false > Exception in thread "main" java.lang.NoClassDefFoundError: > org/apache/solr/client/solrj/SolrServerException > at java.lang.Class.getDeclaredConstructors0(Native Method) > at java.lang.Class.privateGetDeclaredConstructors(Class.java:2585) > at java.lang.Class.getConstructor0(Class.java:2885) > at java.lang.Class.newInstance(Class.java:350) > at > > org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:161) > at org.apache.nutch.indexer.IndexWriters.(IndexWriters.java:55) > at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:121) > at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:222) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231) > Caused by: java.lang.ClassNotFoundException: > org.apache.solr.client.solrj.SolrServerException > at java.net.URLClassLoader$1.run(URLClassLoader.java:366) > at java.net.URLClassLoader$1.run(URLClassLoader.java:355) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:354) > at java.lang.ClassLoader.loadClass(ClassLoader.java:425) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) > at java.lang.ClassLoader.loadClass(ClassLoader.java:358) > ... 10 more > Error running: > /Users/manishverma/Downloads/nutchSrc/trunk/runtime/local/bin/nutch index > -Dsolr.server.url=http://localhost:8983/solr/ testCrawl5//crawldb -linkdb > testCrawl5//linkdb testCrawl5//segments/20151115233845 > > Thanks > Manish Verma > AML Search > +1 669 224 9924 > > Noviembre 13-14: Final Caribeña 2015 del Concurso de Programación ACM-ICPC https://icpc.baylor.edu/regionals/finder/cf-2015
Re: [MASSMAIL]Crawl Command - Getting Exception While Indexing With Solr
Hi Check into the folder of the indexer-solr plugin whether exist the solr-solrj-4.10.2.jar library. Regards - Mensaje original - > De: "Roannel Fernández Hernández" <roan...@uci.cu> > Para: user@nutch.apache.org > Enviados: Miércoles, 18 de Noviembre 2015 9:03:38 > Asunto: Re: [MASSMAIL]Crawl Command - Getting Exception While Indexing With > Solr > > Hi > > What version of Nutch you downloaded exactly? > > Regards > > - Mensaje original - > > De: "Manish Verma" ve...@apple.com> > > Para: user@nutch.apache.org > > Enviados: Lunes, 16 de Noviembre 2015 12:36:46 > > Asunto: [MASSMAIL]Crawl Command - Getting Exception While Indexing With > > Solr > > > > Hi , > > > > I was using bin version of Nutch 1.X and everything was working fine , I > > downloaded the source of Nutch 1.x and build it, In indexing phase it > > throws > > below exception. I am using below crawl command and it run well till > > parsing > > and fails at indexing with below exception. > > > > ./crawl -i -D solr.server.url=http://localhost:8983/solr/ > > /Users/manishverma/Manish/AML/nutch/urls testCrawl5/ 1 > > > > I see lot of people facing this but no answer . Please suggest. > > > > > > Indexing 20151115233845 to index > > /Users/manishverma/Downloads/nutchSrc/trunk/runtime/local/bin/nutch index > > -Dsolr.server.url=http://localhost:8983/solr/ testCrawl5//crawldb -linkdb > > testCrawl5//linkdb testCrawl5//segments/20151115233845 > > Indexer: starting at 2015-11-15 23:39:02 > > Indexer: deleting gone documents: false > > Indexer: URL filtering: false > > Indexer: URL normalizing: false > > Exception in thread "main" java.lang.NoClassDefFoundError: > > org/apache/solr/client/solrj/SolrServerException > > at java.lang.Class.getDeclaredConstructors0(Native Method) > > at java.lang.Class.privateGetDeclaredConstructors(Class.java:2585) > > at java.lang.Class.getConstructor0(Class.java:2885) > > at java.lang.Class.newInstance(Class.java:350) > > at > > > > org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:161) > > at > > org.apache.nutch.indexer.IndexWriters.(IndexWriters.java:55) > > at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:121) > > at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:222) > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > > at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231) > > Caused by: java.lang.ClassNotFoundException: > > org.apache.solr.client.solrj.SolrServerException > > at java.net.URLClassLoader$1.run(URLClassLoader.java:366) > > at java.net.URLClassLoader$1.run(URLClassLoader.java:355) > > at java.security.AccessController.doPrivileged(Native Method) > > at java.net.URLClassLoader.findClass(URLClassLoader.java:354) > > at java.lang.ClassLoader.loadClass(ClassLoader.java:425) > > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) > > at java.lang.ClassLoader.loadClass(ClassLoader.java:358) > > ... 10 more > > Error running: > > /Users/manishverma/Downloads/nutchSrc/trunk/runtime/local/bin/nutch index > > -Dsolr.server.url=http://localhost:8983/solr/ testCrawl5//crawldb -linkdb > > testCrawl5//linkdb testCrawl5//segments/20151115233845 > > > > Thanks > > Manish Verma > > AML Search > > +1 669 224 9924 > > > > > Noviembre 13-14: Final Caribeña 2015 del Concurso de Programación ACM-ICPC > https://icpc.baylor.edu/regionals/finder/cf-2015 > Noviembre 13-14: Final Caribeña 2015 del Concurso de Programación ACM-ICPC > https://icpc.baylor.edu/regionals/finder/cf-2015 > Noviembre 13-14: Final Caribeña 2015 del Concurso de Programación ACM-ICPC https://icpc.baylor.edu/regionals/finder/cf-2015
Crawl Command - Getting Exception While Indexing With Solr
Hi , I was using bin version of Nutch 1.X and everything was working fine , I downloaded the source of Nutch 1.x and build it, In indexing phase it throws below exception. I am using below crawl command and it run well till parsing and fails at indexing with below exception. ./crawl -i -D solr.server.url=http://localhost:8983/solr/ /Users/manishverma/Manish/AML/nutch/urls testCrawl5/ 1 I see lot of people facing this but no answer . Please suggest. Indexing 20151115233845 to index /Users/manishverma/Downloads/nutchSrc/trunk/runtime/local/bin/nutch index -Dsolr.server.url=http://localhost:8983/solr/ testCrawl5//crawldb -linkdb testCrawl5//linkdb testCrawl5//segments/20151115233845 Indexer: starting at 2015-11-15 23:39:02 Indexer: deleting gone documents: false Indexer: URL filtering: false Indexer: URL normalizing: false Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/solr/client/solrj/SolrServerException at java.lang.Class.getDeclaredConstructors0(Native Method) at java.lang.Class.privateGetDeclaredConstructors(Class.java:2585) at java.lang.Class.getConstructor0(Class.java:2885) at java.lang.Class.newInstance(Class.java:350) at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:161) at org.apache.nutch.indexer.IndexWriters.(IndexWriters.java:55) at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:121) at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:222) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231) Caused by: java.lang.ClassNotFoundException: org.apache.solr.client.solrj.SolrServerException at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 10 more Error running: /Users/manishverma/Downloads/nutchSrc/trunk/runtime/local/bin/nutch index -Dsolr.server.url=http://localhost:8983/solr/ testCrawl5//crawldb -linkdb testCrawl5//linkdb testCrawl5//segments/20151115233845 Thanks Manish Verma AML Search +1 669 224 9924
Re: Problems indexing to solr 3.5 from nutch 1.8
Hi Lewis, Thank you so much. Going in my docs. Guy McDowell guymcdow...@gmail.com http://www.GuyMcDowell.com On Sat, Sep 5, 2015 at 2:10 PM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: > Hi Guy, > > The schema is present in the conf directory as shown here > > https://github.com/apache/nutch/blob/trunk/conf/schema.xml > > Lewis > > On Thu, Sep 3, 2015 at 11:13 AM, <user-digest-h...@nutch.apache.org> > wrote: > > > > > Subject: Re: Problems indexing to solr 3.5 from nutch 1.8 > > Having a similar problem in getting Nutch and Solr integrated. Newest > > version of both. Downloaded and installed a few days ago. > > > > Following the tut tells me to copy over the schema.xml, but it doesn't > > appear to be in the directory that the tut says. Or anywhere for that > > matter. > > > > This is probably a rookie mistake by me, but I'm just not seeing it. > Help. > > > > >
Re: Problems indexing to solr 3.5 from nutch 1.8
Hi Guy, The schema is present in the conf directory as shown here https://github.com/apache/nutch/blob/trunk/conf/schema.xml Lewis On Thu, Sep 3, 2015 at 11:13 AM, <user-digest-h...@nutch.apache.org> wrote: > > Subject: Re: Problems indexing to solr 3.5 from nutch 1.8 > Having a similar problem in getting Nutch and Solr integrated. Newest > version of both. Downloaded and installed a few days ago. > > Following the tut tells me to copy over the schema.xml, but it doesn't > appear to be in the directory that the tut says. Or anywhere for that > matter. > > This is probably a rookie mistake by me, but I'm just not seeing it. Help. > >
Re: Problems indexing to solr 3.5 from nutch 1.8
Hi Paddy, Some comments in addition to my response. You should try upgrading to Nutch 1.10 when we release very shortly. There has been so much work done since 1.8 that you can benefit from. Keep your ears peeled here for a release candidate and then eventual release. Please see response below. On Tue, Sep 1, 2015 at 5:00 AM,wrote: > > I'm running into Problems with indexing documents crawled by nutch 1.8 into > solr 3.5. Nutch does not report any kind of > error or warning and seems to run just fine. but the solr index remains > empty. (The logs do not show any kind of error or warning eather). > I would also check your Solr logs. When you copy over the nutch schema.xml you must make sure that the Solr server logging does not indicate any issues were encountered during startup. If there are errors then you should resolve them. > > Is there any way to solve this issue? Is nutch 1.8 (uses solrj 3.4) > compatible with solr 3.5 or are there any known issues? > > As far as I know there are no issues. I would set logging in Nutch and Solr to DEBUG. This can be done in Nutch by editing various values from INFO --> DEBUG within conf/log4j.properties. Something very similar will be true for Solr. Ta Lewis
Re: Problems indexing to solr 3.5 from nutch 1.8
Having a similar problem in getting Nutch and Solr integrated. Newest version of both. Downloaded and installed a few days ago. Following the tut tells me to copy over the schema.xml, but it doesn't appear to be in the directory that the tut says. Or anywhere for that matter. This is probably a rookie mistake by me, but I'm just not seeing it. Help. Guy McDowell guymcdow...@gmail.com http://www.GuyMcDowell.com On Thu, Sep 3, 2015 at 5:28 AM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: > Hi Paddy, > > Some comments in addition to my response. You should try upgrading to Nutch > 1.10 when we release very shortly. There has been so much work done since > 1.8 that you can benefit from. Keep your ears peeled here for a release > candidate and then eventual release. > > Please see response below. > > On Tue, Sep 1, 2015 at 5:00 AM,wrote: > > > > > I'm running into Problems with indexing documents crawled by nutch 1.8 > into > > solr 3.5. Nutch does not report any kind of > > error or warning and seems to run just fine. but the solr index remains > > empty. (The logs do not show any kind of error or warning eather). > > > > I would also check your Solr logs. When you copy over the nutch schema.xml > you must make sure that the Solr server logging does not indicate any > issues were encountered during startup. If there are errors then you should > resolve them. > > > > > > Is there any way to solve this issue? Is nutch 1.8 (uses solrj 3.4) > > compatible with solr 3.5 or are there any known issues? > > > > > As far as I know there are no issues. I would set logging in Nutch and Solr > to DEBUG. This can be done in Nutch by editing various values from INFO --> > DEBUG within conf/log4j.properties. > Something very similar will be true for Solr. > Ta > Lewis >
Problems indexing to solr 3.5 from nutch 1.8
Hey there, I'm running into Problems with indexing documents crawled by nutch 1.8 into solr 3.5. Nutch does not report any kind of error or warning and seems to run just fine. but the solr index remains empty. (The logs do not show any kind of error or warning eather). Is there any way to solve this issue? Is nutch 1.8 (uses solrj 3.4) compatible with solr 3.5 or are there any known issues? king regards paddy
Re: NullPointerException occured during indexing to solr from nutch 1.7 source build.
Hi, If I am not mistaken your Solr url is not accurate. You should provide the Solr url plus the used core. For instance, if your core is named collection1; the default Solr core name, then your url should be *http://solr-server:8983/solr/collection1*. I believe if you review Solr or Nutch logs, you will see that the indexing job has failed. Regards Ameer -- View this message in context: http://lucene.472066.n3.nabble.com/NullPointerException-occured-during-indexing-to-solr-from-nutch-1-7-source-build-tp4156343p4157058.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: NullPointerException occured during indexing to solr from nutch 1.7 source build.
Hi Talat, Thanks for the information, I tried with 1.8 nutch, It works fine and job compeleted. However, i was not able to find the data indexed in solr even i gave below command where solr url is mentioned: ./crawl /user/nutch/urls /tmp/nutch_1_8_first_output http://solr-server:8983/solr 1 I was assuming that after migrating and by specifying solr-server url while running would ensure that data crawled would get indexed automatically to solr. is that not the case? If not then how do i manually do it? :) From:Talat Uyarer ta...@uyarer.com Sent:user@nutch.apache.org Date:Tue, September 2, 2014 8:35 pm Subject:Re: NullPointerException occured during indexing to solr from nutch 1.7 source build. Hi, This is an issue. Below is the code of SolrDeleteDuplicate class from nutch 1.7 trunk where the solr record is deleted by id field. As documents don't have the url field therefore the id of the documents empty, so its throwing a null pointer exception when it runs. Now i am writing on my phone. i diÅYuml; not find this issue. But if you update from 1.7 to newer version. You will not get this error. Talat On Sep 2, 2014 10:22 AM, vinay.kash...@socialinfra.net wrote: Hi, I have taken nutch 1.7 source and copied mapred-site.xml,hdfs-site.xml,yarn-site.xml,hadoop-env.sh,core-site.xml from my Hadoop 2.3.0-cdh5.1.0 and did an ant build. Then went on to runtime/deploy/bin to start the crawling. it successfully submitted the jobs to my yarn. But later during indexing to solr, i'm getting below exceptions. I have copied the scheme-solr4.xml to my solr and added exceptions in regex-urlfilter.txt for a particular website which i give for crawling in the directory urls/seed.txt. Error: java.lang.NullPointerException at org.apache.hadoop.io.Text.encode(Text.java:443) at org.apache.hadoop.io.Text.set(Text.java:198) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:270) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:241) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:198) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:184) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:52) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162) Kindly, can any one tell me how to solve this issue? I'm basically stuck here!!
NullPointerException occured during indexing to solr from nutch 1.7 source build.
Hi, I have taken nutch 1.7 source and copied mapred-site.xml,hdfs-site.xml,yarn-site.xml,hadoop-env.sh,core-site.xml from my Hadoop 2.3.0-cdh5.1.0 and did an ant build. Then went on to runtime/deploy/bin to start the crawling. it successfully submitted the jobs to my yarn. But later during indexing to solr, i'm getting below exceptions. I have copied the scheme-solr4.xml to my solr and added exceptions in regex-urlfilter.txt for a particular website which i give for crawling in the directory urls/seed.txt. Error: java.lang.NullPointerException at org.apache.hadoop.io.Text.encode(Text.java:443) at org.apache.hadoop.io.Text.set(Text.java:198) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:270) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:241) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:198) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:184) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:52) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162) Kindly, can any one tell me how to solve this issue? I'm basically stuck here!!
RE: Errors when indexing to Solr
I've tried crawling with nutch-1.6-SNAPSHOT.jar and got the following error: [root@w7sp1-x64 nutch]# bin/nutch crawl urls -dir crawl -depth 3 -topN 5 solrUrl is not set, indexing will be skipped... crawl started in: crawl rootUrlDir = urls threads = 10 depth = 3 solrUrl=null topN = 5 Injector: starting at 2012-09-07 08:41:06 Injector: crawlDb: crawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: finished at 2012-09-07 08:41:21, elapsed: 00:00:14 Generator: starting at 2012-09-07 08:41:21 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 5 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: crawl/segments/20120907084129 Generator: finished at 2012-09-07 08:41:36, elapsed: 00:00:15 Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. Fetcher: starting at 2012-09-07 08:41:36 Fetcher: segment: crawl/segments/20120907084129 Using queue mode : byHost Fetcher: threads: 10 Fetcher: time-out divisor: 2 QueueFeeder finished: total 1 records + hit by time limit :0 Exception in thread main java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265) at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1332) at org.apache.nutch.crawl.Crawl.run(Crawl.java:136) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.Crawl.main(Crawl.java:55) I then tried to crawl with 1.5.1 (which was successful) and INDEX with 1.6-SNAPSHOT. I got this error: [root@w7sp1-x64 nutch]# bin/nutch solrindex http://127.0.0.1:8080/solr/core2 crawl/crawldb -linkdb crawl/linkdb SolrIndexer: starting at 2012-09-07 09:05:21 SolrIndexer: deleting gone documents: false SolrIndexer: URL filtering: false SolrIndexer: URL normalizing: false org.apache.solr.common.SolrException: Not Found Not Found request: http://127.0.0.1:8080/solr/core2/update -Original Message- From: Fournier, Danny G [mailto:danny.fourn...@dfo-mpo.gc.ca] Sent: September 6, 2012 4:15 PM To: user@nutch.apache.org Subject: Errors when indexing to Solr I'm getting two different errors while trying to index Nutch crawls to Solr. I'm running with: - CentOS 6.3 VM (Virtualbox) (in host Windows XP) - Solr 3.6.1 - Nutch 1.5.1 It would seem that NUTCH-1251 comes rather close to solving my issue? Which would mean that I would have to compile Nutch 1.6 to fix this? Error #1 - When indexing directly to Solr Command: bin/nutch crawl urls -solr http://localhost:8080/solr/core2 -depth 3 -topN 5 Error: Exception in thread main java.io.IOException: org.apache.solr.client.solrj.SolrServerException: Error executing query SolrIndexer: starting at 2012-09-06 14:30:11 Indexing 8 documents java.io.IOException: Job failed! SolrDeleteDuplicates: starting at 2012-09-06 14:30:55 SolrDeleteDuplicates: Solr url: http://localhost:8080/solr/core2 Exception in thread main java.io.IOException: org.apache.solr.client.solrj.SolrServerException: Error executing query at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getSp lits(SolrDeleteDuplicates.java:200) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:989) at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:981) at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInform atio n.java:1121) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDupli cates.java:373) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDupli cates.java:353) at org.apache.nutch.crawl.Crawl.run(Crawl.java:153) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.Crawl.main(Crawl.java:55) Caused by: org.apache.solr.client.solrj.SolrServerException: Error executing query at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.j ava:95) at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates
RE: Errors when indexing to Solr
-Original message- From:Fournier, Danny G danny.fourn...@dfo-mpo.gc.ca Sent: Fri 07-Sep-2012 14:46 To: user@nutch.apache.org Subject: RE: Errors when indexing to Solr I've tried crawling with nutch-1.6-SNAPSHOT.jar and got the following error: [root@w7sp1-x64 nutch]# bin/nutch crawl urls -dir crawl -depth 3 -topN 5 solrUrl is not set, indexing will be skipped... crawl started in: crawl rootUrlDir = urls threads = 10 depth = 3 solrUrl=null topN = 5 Injector: starting at 2012-09-07 08:41:06 Injector: crawlDb: crawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: finished at 2012-09-07 08:41:21, elapsed: 00:00:14 Generator: starting at 2012-09-07 08:41:21 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 5 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: crawl/segments/20120907084129 Generator: finished at 2012-09-07 08:41:36, elapsed: 00:00:15 Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. Fetcher: starting at 2012-09-07 08:41:36 Fetcher: segment: crawl/segments/20120907084129 Using queue mode : byHost Fetcher: threads: 10 Fetcher: time-out divisor: 2 QueueFeeder finished: total 1 records + hit by time limit :0 Exception in thread main java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265) at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1332) at org.apache.nutch.crawl.Crawl.run(Crawl.java:136) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.Crawl.main(Crawl.java:55) Please post the relevant log I then tried to crawl with 1.5.1 (which was successful) and INDEX with 1.6-SNAPSHOT. I got this error: [root@w7sp1-x64 nutch]# bin/nutch solrindex http://127.0.0.1:8080/solr/core2 crawl/crawldb -linkdb crawl/linkdb SolrIndexer: starting at 2012-09-07 09:05:21 SolrIndexer: deleting gone documents: false SolrIndexer: URL filtering: false SolrIndexer: URL normalizing: false org.apache.solr.common.SolrException: Not Found Not Found request: http://127.0.0.1:8080/solr/core2/update This is no Nutch error, there simply is no Solr running there (404), or a badly configured one. -Original Message- From: Fournier, Danny G [mailto:danny.fourn...@dfo-mpo.gc.ca] Sent: September 6, 2012 4:15 PM To: user@nutch.apache.org Subject: Errors when indexing to Solr I'm getting two different errors while trying to index Nutch crawls to Solr. I'm running with: - CentOS 6.3 VM (Virtualbox) (in host Windows XP) - Solr 3.6.1 - Nutch 1.5.1 It would seem that NUTCH-1251 comes rather close to solving my issue? Which would mean that I would have to compile Nutch 1.6 to fix this? Error #1 - When indexing directly to Solr Command: bin/nutch crawl urls -solr http://localhost:8080/solr/core2 -depth 3 -topN 5 Error: Exception in thread main java.io.IOException: org.apache.solr.client.solrj.SolrServerException: Error executing query SolrIndexer: starting at 2012-09-06 14:30:11 Indexing 8 documents java.io.IOException: Job failed! SolrDeleteDuplicates: starting at 2012-09-06 14:30:55 SolrDeleteDuplicates: Solr url: http://localhost:8080/solr/core2 Exception in thread main java.io.IOException: org.apache.solr.client.solrj.SolrServerException: Error executing query at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getSp lits(SolrDeleteDuplicates.java:200) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:989) at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:981) at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInform atio n.java:1121) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDupli cates.java:373) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDupli cates.java:353) at org.apache.nutch.crawl.Crawl.run(Crawl.java:153) at org.apache.hadoop.util.ToolRunner.run
RE: Errors when indexing to Solr
Markus, You were right. My core was setup properly, however, it was labeled something different in the conf file. I was able to get rid of that error. Thanks! I have provided the log you asked for below... Dan -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: September 7, 2012 9:49 AM To: user@nutch.apache.org Subject: RE: Errors when indexing to Solr -Original message- From:Fournier, Danny G danny.fourn...@dfo-mpo.gc.ca Sent: Fri 07-Sep-2012 14:46 To: user@nutch.apache.org Subject: RE: Errors when indexing to Solr I've tried crawling with nutch-1.6-SNAPSHOT.jar and got the following error: [root@w7sp1-x64 nutch]# bin/nutch crawl urls -dir crawl -depth 3 -topN 5 solrUrl is not set, indexing will be skipped... crawl started in: crawl rootUrlDir = urls threads = 10 depth = 3 solrUrl=null topN = 5 Injector: starting at 2012-09-07 08:41:06 Injector: crawlDb: crawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: finished at 2012-09-07 08:41:21, elapsed: 00:00:14 Generator: starting at 2012-09-07 08:41:21 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 5 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: crawl/segments/20120907084129 Generator: finished at 2012-09-07 08:41:36, elapsed: 00:00:15 Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. Fetcher: starting at 2012-09-07 08:41:36 Fetcher: segment: crawl/segments/20120907084129 Using queue mode : byHost Fetcher: threads: 10 Fetcher: time-out divisor: 2 QueueFeeder finished: total 1 records + hit by time limit :0 Exception in thread main java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265) at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1332) at org.apache.nutch.crawl.Crawl.run(Crawl.java:136) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.Crawl.main(Crawl.java:55) Please post the relevant log 2012-09-07 09:54:31,418 WARN mapred.LocalJobRunner - job_local_0005 java.lang.NoClassDefFoundError: com/google/common/util/concurrent/ThreadFactoryBuilder at org.apache.nutch.parse.ParseUtil.init(ParseUtil.java:59) at org.apache.nutch.fetcher.Fetcher$FetcherThread.init(Fetcher.java:602) at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1186) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) Caused by: java.lang.ClassNotFoundException: com.google.common.util.concurrent.ThreadFactoryBuild$ at java.net.URLClassLoader$1.run(URLClassLoader.java:217) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:205) at java.lang.ClassLoader.loadClass(ClassLoader.java:321) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294) at java.lang.ClassLoader.loadClass(ClassLoader.java:266) ... 6 more
Errors when indexing to Solr
I'm getting two different errors while trying to index Nutch crawls to Solr. I'm running with: - CentOS 6.3 VM (Virtualbox) (in host Windows XP) - Solr 3.6.1 - Nutch 1.5.1 It would seem that NUTCH-1251 comes rather close to solving my issue? Which would mean that I would have to compile Nutch 1.6 to fix this? Error #1 - When indexing directly to Solr Command: bin/nutch crawl urls -solr http://localhost:8080/solr/core2 -depth 3 -topN 5 Error: Exception in thread main java.io.IOException: org.apache.solr.client.solrj.SolrServerException: Error executing query SolrIndexer: starting at 2012-09-06 14:30:11 Indexing 8 documents java.io.IOException: Job failed! SolrDeleteDuplicates: starting at 2012-09-06 14:30:55 SolrDeleteDuplicates: Solr url: http://localhost:8080/solr/core2 Exception in thread main java.io.IOException: org.apache.solr.client.solrj.SolrServerException: Error executing query at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getSp lits(SolrDeleteDuplicates.java:200) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:989) at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:981) at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformatio n.java:1121) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDupli cates.java:373) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDupli cates.java:353) at org.apache.nutch.crawl.Crawl.run(Crawl.java:153) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.Crawl.main(Crawl.java:55) Caused by: org.apache.solr.client.solrj.SolrServerException: Error executing query at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.j ava:95) at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getSp lits(SolrDeleteDuplicates.java:198) ... 16 more Caused by: org.apache.solr.common.SolrException: Not Found Not Found request: http://localhost:8080/solr/core2/select?q=id:[* TO *]fl=idrows=1wt=javabinversion=2 at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsH ttpSolrServer.java:430) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsH ttpSolrServer.java:244) at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.j ava:89) ... 18 more Error #2 - When indexing post-crawl Command: bin/nutch solrindex http://localhost:8080/solr/core2 crawl/crawldb -linkdb crawl/linkdb Error: org.apache.solr.common.SolrException: Not Found SolrIndexer: starting at 2012-09-06 15:39:24 org.apache.solr.common.SolrException: Not Found Not Found request: http://localhost:8080/solr/core2/update Regards, Dan
Re: OutOfMemoryError when indexing into Solr
Thanks. We should decrease the default setting for commit.size. Confirming that this worked. Also, times look interesting: to send 73K documents in 1000 doc batches (default) took 16 minutes; to send 73K documents in 100 doc batches took 15 minutes 24 seconds. Regards, Arkadi -Original Message- From: arkadi.kosmy...@csiro.au [mailto:arkadi.kosmy...@csiro.au] Sent: Friday, 28 October 2011 12:11 PM To: user@nutch.apache.org; markus.jel...@openindex.io Subject: [ExternalEmail] RE: OutOfMemoryError when indexing into Solr Hi Markus, -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Thursday, 27 October 2011 11:33 PM To: user@nutch.apache.org Subject: Re: OutOfMemoryError when indexing into Solr Interesting, how many records and how large are your records? There a bit more than 80,000 documents. property namehttp.content.limit/name value15000/value /property property nameindexer.max.tokens/namevalue10/value /property How did you increase JVM heap size? opts=-XX:+UseConcMarkSweepGC -Xms500m -Xmx6000m - XX:MinHeapFreeRatio=10 -XX:MaxHeapFreeRatio=30 -XX:MaxPermSize=512m - XX:+CMSClassUnloadingEnabled Do you have custom indexing filters? Yes. They add a few fields to each document. These fields are small, within a hundred of bytes per document. Can you decrease the commit.size? Yes. Thank you. Good idea. I did not even consider it because, for whatever reason, this option was not in my nutch-default.xml. I've put it to 100. I hope that Solr commit is not done after sending each bunch. Else this would have a very negative impact on performance because Solr commits are very expensive. Do you also index large amounts of anchors (without deduplication) and pass in a very large linkdb? I do index anchors, but don't think that there is anything extraordinary about them. As I only index less than 100K pages, my linkdb should not be nearly as large as in cases when people index millions of documents. The reducer of IndexerMapReduce is a notorious RAM consumer. If reducing solr.commit.size helps, it would make sense to decrease the default value. Sending small bunches of documents to Solr without commits is not that expensive to risk having memory problems. Thanks again. Regards, Arkadi On Thursday 27 October 2011 05:54:54 arkadi.kosmy...@csiro.au wrote: Hi, I am working with a Nutch 1.4 snapshot and having a very strange problem that makes the system run out of memory when indexing into Solr. This does not look like a trivial lack of memory problem that can be solved by giving more memory to the JVM. I've increased the max memory size from 2Gb to 3Gb, then to 6Gb, but this did not make any difference. A log extract is included below. Would anyone have any idea of how to fix this problem? Thanks, Arkadi 2011-10-27 07:08:22,162 INFO solr.SolrWriter - Adding 1000 documents 2011-10-27 07:08:42,248 INFO solr.SolrWriter - Adding 1000 documents 2011-10-27 07:13:54,110 WARN mapred.LocalJobRunner - job_local_0254 java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOfRange(Arrays.java:3209) at java.lang.String.init(String.java:215) at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:542) at java.nio.CharBuffer.toString(CharBuffer.java:1157) at org.apache.hadoop.io.Text.decode(Text.java:350) at org.apache.hadoop.io.Text.decode(Text.java:322) at org.apache.hadoop.io.Text.readString(Text.java:403) at org.apache.nutch.parse.ParseText.readFields(ParseText.java:50) at org.apache.nutch.util.GenericWritableConfigurable.readFields(GenericWri tab leConfigurable.java:54) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeseriali zer .deserialize(WritableSerialization.java:67) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeseriali zer .deserialize(WritableSerialization.java:40) at org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:99 1) at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:931) at org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(Red uce Task.java:241) at org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTas k.j ava:237) at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java: 81) at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java: 50) at org.apache.hadoop.mapred.ReduceTask.runOldReducer
RE: OutOfMemoryError when indexing into Solr
Confirming that this worked. Also, times look interesting: to send 73K documents in 1000 doc batches (default) took 16 minutes; to send 73K documents in 100 doc batches took 15 minutes 24 seconds. Regards, Arkadi -Original Message- From: arkadi.kosmy...@csiro.au [mailto:arkadi.kosmy...@csiro.au] Sent: Friday, 28 October 2011 12:11 PM To: user@nutch.apache.org; markus.jel...@openindex.io Subject: [ExternalEmail] RE: OutOfMemoryError when indexing into Solr Hi Markus, -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Thursday, 27 October 2011 11:33 PM To: user@nutch.apache.org Subject: Re: OutOfMemoryError when indexing into Solr Interesting, how many records and how large are your records? There a bit more than 80,000 documents. property namehttp.content.limit/name value15000/value /property property nameindexer.max.tokens/namevalue10/value /property How did you increase JVM heap size? opts=-XX:+UseConcMarkSweepGC -Xms500m -Xmx6000m - XX:MinHeapFreeRatio=10 -XX:MaxHeapFreeRatio=30 -XX:MaxPermSize=512m - XX:+CMSClassUnloadingEnabled Do you have custom indexing filters? Yes. They add a few fields to each document. These fields are small, within a hundred of bytes per document. Can you decrease the commit.size? Yes. Thank you. Good idea. I did not even consider it because, for whatever reason, this option was not in my nutch-default.xml. I've put it to 100. I hope that Solr commit is not done after sending each bunch. Else this would have a very negative impact on performance because Solr commits are very expensive. Do you also index large amounts of anchors (without deduplication) and pass in a very large linkdb? I do index anchors, but don't think that there is anything extraordinary about them. As I only index less than 100K pages, my linkdb should not be nearly as large as in cases when people index millions of documents. The reducer of IndexerMapReduce is a notorious RAM consumer. If reducing solr.commit.size helps, it would make sense to decrease the default value. Sending small bunches of documents to Solr without commits is not that expensive to risk having memory problems. Thanks again. Regards, Arkadi On Thursday 27 October 2011 05:54:54 arkadi.kosmy...@csiro.au wrote: Hi, I am working with a Nutch 1.4 snapshot and having a very strange problem that makes the system run out of memory when indexing into Solr. This does not look like a trivial lack of memory problem that can be solved by giving more memory to the JVM. I've increased the max memory size from 2Gb to 3Gb, then to 6Gb, but this did not make any difference. A log extract is included below. Would anyone have any idea of how to fix this problem? Thanks, Arkadi 2011-10-27 07:08:22,162 INFO solr.SolrWriter - Adding 1000 documents 2011-10-27 07:08:42,248 INFO solr.SolrWriter - Adding 1000 documents 2011-10-27 07:13:54,110 WARN mapred.LocalJobRunner - job_local_0254 java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOfRange(Arrays.java:3209) at java.lang.String.init(String.java:215) at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:542) at java.nio.CharBuffer.toString(CharBuffer.java:1157) at org.apache.hadoop.io.Text.decode(Text.java:350) at org.apache.hadoop.io.Text.decode(Text.java:322) at org.apache.hadoop.io.Text.readString(Text.java:403) at org.apache.nutch.parse.ParseText.readFields(ParseText.java:50) at org.apache.nutch.util.GenericWritableConfigurable.readFields(GenericWri tab leConfigurable.java:54) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeseriali zer .deserialize(WritableSerialization.java:67) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeseriali zer .deserialize(WritableSerialization.java:40) at org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:99 1) at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:931) at org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(Red uce Task.java:241) at org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTas k.j ava:237) at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java: 81) at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java: 50) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216 ) 2011-10-27 07:13:54,382 ERROR solr.SolrIndexer - java.io.IOException: Job failed! -- Markus Jelsma - CTO - Openindex http
Re: OutOfMemoryError when indexing into Solr
I'm having the exact same problem. I am trying to isolate whether it is a Solr problem or a Nutch+Solr problem. On Wed, Oct 26, 2011 at 11:54 PM, arkadi.kosmy...@csiro.au wrote: Hi, I am working with a Nutch 1.4 snapshot and having a very strange problem that makes the system run out of memory when indexing into Solr. This does not look like a trivial lack of memory problem that can be solved by giving more memory to the JVM. I've increased the max memory size from 2Gb to 3Gb, then to 6Gb, but this did not make any difference. A log extract is included below. Would anyone have any idea of how to fix this problem? Thanks, Arkadi 2011-10-27 07:08:22,162 INFO solr.SolrWriter - Adding 1000 documents 2011-10-27 07:08:42,248 INFO solr.SolrWriter - Adding 1000 documents 2011-10-27 07:13:54,110 WARN mapred.LocalJobRunner - job_local_0254 java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOfRange(Arrays.java:3209) at java.lang.String.init(String.java:215) at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:542) at java.nio.CharBuffer.toString(CharBuffer.java:1157) at org.apache.hadoop.io.Text.decode(Text.java:350) at org.apache.hadoop.io.Text.decode(Text.java:322) at org.apache.hadoop.io.Text.readString(Text.java:403) at org.apache.nutch.parse.ParseText.readFields(ParseText.java:50) at org.apache.nutch.util.GenericWritableConfigurable.readFields(GenericWritableConfigurable.java:54) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40) at org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:991) at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:931) at org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(ReduceTask.java:241) at org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.java:237) at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:81) at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:50) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216) 2011-10-27 07:13:54,382 ERROR solr.SolrIndexer - java.io.IOException: Job failed!
Re: OutOfMemoryError when indexing into Solr
Your problem is not the same judging from the stack trace on the Solr list. Your Solr runs OOM, not Nutch. On Thursday 27 October 2011 14:20:10 Fred Zimmerman wrote: I'm having the exact same problem. I am trying to isolate whether it is a Solr problem or a Nutch+Solr problem. On Wed, Oct 26, 2011 at 11:54 PM, arkadi.kosmy...@csiro.au wrote: Hi, I am working with a Nutch 1.4 snapshot and having a very strange problem that makes the system run out of memory when indexing into Solr. This does not look like a trivial lack of memory problem that can be solved by giving more memory to the JVM. I've increased the max memory size from 2Gb to 3Gb, then to 6Gb, but this did not make any difference. A log extract is included below. Would anyone have any idea of how to fix this problem? Thanks, Arkadi 2011-10-27 07:08:22,162 INFO solr.SolrWriter - Adding 1000 documents 2011-10-27 07:08:42,248 INFO solr.SolrWriter - Adding 1000 documents 2011-10-27 07:13:54,110 WARN mapred.LocalJobRunner - job_local_0254 java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOfRange(Arrays.java:3209) at java.lang.String.init(String.java:215) at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:542) at java.nio.CharBuffer.toString(CharBuffer.java:1157) at org.apache.hadoop.io.Text.decode(Text.java:350) at org.apache.hadoop.io.Text.decode(Text.java:322) at org.apache.hadoop.io.Text.readString(Text.java:403) at org.apache.nutch.parse.ParseText.readFields(ParseText.java:50) at org.apache.nutch.util.GenericWritableConfigurable.readFields(GenericWrita bleConfigurable.java:54) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserialize r.deserialize(WritableSerialization.java:67) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserialize r.deserialize(WritableSerialization.java:40) at org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:991) at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:931) at org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(Reduc eTask.java:241) at org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask. java:237) at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:81 ) at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:50 ) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216) 2011-10-27 07:13:54,382 ERROR solr.SolrIndexer - java.io.IOException: Job failed! -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Re: OutOfMemoryError when indexing into Solr
Interesting, how many records and how large are your records? How did you increase JVM heap size? Do you have custom indexing filters? Can you decrease the commit.size? Do you also index large amounts of anchors (without deduplication) and pass in a very large linkdb? The reducer of IndexerMapReduce is a notorious RAM consumer. On Thursday 27 October 2011 05:54:54 arkadi.kosmy...@csiro.au wrote: Hi, I am working with a Nutch 1.4 snapshot and having a very strange problem that makes the system run out of memory when indexing into Solr. This does not look like a trivial lack of memory problem that can be solved by giving more memory to the JVM. I've increased the max memory size from 2Gb to 3Gb, then to 6Gb, but this did not make any difference. A log extract is included below. Would anyone have any idea of how to fix this problem? Thanks, Arkadi 2011-10-27 07:08:22,162 INFO solr.SolrWriter - Adding 1000 documents 2011-10-27 07:08:42,248 INFO solr.SolrWriter - Adding 1000 documents 2011-10-27 07:13:54,110 WARN mapred.LocalJobRunner - job_local_0254 java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOfRange(Arrays.java:3209) at java.lang.String.init(String.java:215) at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:542) at java.nio.CharBuffer.toString(CharBuffer.java:1157) at org.apache.hadoop.io.Text.decode(Text.java:350) at org.apache.hadoop.io.Text.decode(Text.java:322) at org.apache.hadoop.io.Text.readString(Text.java:403) at org.apache.nutch.parse.ParseText.readFields(ParseText.java:50) at org.apache.nutch.util.GenericWritableConfigurable.readFields(GenericWritab leConfigurable.java:54) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer .deserialize(WritableSerialization.java:67) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer .deserialize(WritableSerialization.java:40) at org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:991) at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:931) at org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(Reduce Task.java:241) at org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.j ava:237) at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:81) at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:50) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216) 2011-10-27 07:13:54,382 ERROR solr.SolrIndexer - java.io.IOException: Job failed! -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
RE: OutOfMemoryError when indexing into Solr
Hi Markus, -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Thursday, 27 October 2011 11:33 PM To: user@nutch.apache.org Subject: Re: OutOfMemoryError when indexing into Solr Interesting, how many records and how large are your records? There a bit more than 80,000 documents. property namehttp.content.limit/name value15000/value /property property nameindexer.max.tokens/namevalue10/value /property How did you increase JVM heap size? opts=-XX:+UseConcMarkSweepGC -Xms500m -Xmx6000m -XX:MinHeapFreeRatio=10 -XX:MaxHeapFreeRatio=30 -XX:MaxPermSize=512m -XX:+CMSClassUnloadingEnabled Do you have custom indexing filters? Yes. They add a few fields to each document. These fields are small, within a hundred of bytes per document. Can you decrease the commit.size? Yes. Thank you. Good idea. I did not even consider it because, for whatever reason, this option was not in my nutch-default.xml. I've put it to 100. I hope that Solr commit is not done after sending each bunch. Else this would have a very negative impact on performance because Solr commits are very expensive. Do you also index large amounts of anchors (without deduplication) and pass in a very large linkdb? I do index anchors, but don't think that there is anything extraordinary about them. As I only index less than 100K pages, my linkdb should not be nearly as large as in cases when people index millions of documents. The reducer of IndexerMapReduce is a notorious RAM consumer. If reducing solr.commit.size helps, it would make sense to decrease the default value. Sending small bunches of documents to Solr without commits is not that expensive to risk having memory problems. Thanks again. Regards, Arkadi On Thursday 27 October 2011 05:54:54 arkadi.kosmy...@csiro.au wrote: Hi, I am working with a Nutch 1.4 snapshot and having a very strange problem that makes the system run out of memory when indexing into Solr. This does not look like a trivial lack of memory problem that can be solved by giving more memory to the JVM. I've increased the max memory size from 2Gb to 3Gb, then to 6Gb, but this did not make any difference. A log extract is included below. Would anyone have any idea of how to fix this problem? Thanks, Arkadi 2011-10-27 07:08:22,162 INFO solr.SolrWriter - Adding 1000 documents 2011-10-27 07:08:42,248 INFO solr.SolrWriter - Adding 1000 documents 2011-10-27 07:13:54,110 WARN mapred.LocalJobRunner - job_local_0254 java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOfRange(Arrays.java:3209) at java.lang.String.init(String.java:215) at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:542) at java.nio.CharBuffer.toString(CharBuffer.java:1157) at org.apache.hadoop.io.Text.decode(Text.java:350) at org.apache.hadoop.io.Text.decode(Text.java:322) at org.apache.hadoop.io.Text.readString(Text.java:403) at org.apache.nutch.parse.ParseText.readFields(ParseText.java:50) at org.apache.nutch.util.GenericWritableConfigurable.readFields(GenericWri tab leConfigurable.java:54) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeseriali zer .deserialize(WritableSerialization.java:67) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeseriali zer .deserialize(WritableSerialization.java:40) at org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:99 1) at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:931) at org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(Red uce Task.java:241) at org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTas k.j ava:237) at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java: 81) at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java: 50) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216 ) 2011-10-27 07:13:54,382 ERROR solr.SolrIndexer - java.io.IOException: Job failed! -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Re: How to avoid splitting strings when indexing to solr
On 07.08.2011 15:35, Markus Jelsma wrote: 700 property 701 namemoreIndexingFilter.indexMimeTypeParts/name 702 valuetrue/value 703 descriptionDetermines whether the index-more plugin will split the mime- type 704 in sub parts, this requires the type field to be multi valued. Set to true for backward 705 compatibility. False will not split the mime-type. 706 /description 707 /property Thank you very much Markus, I have copied this to my nutch-site.xml. It works very well now. But I hadn't this option in my nutch-default.xml. Is there a standard way to get informed about the options that I can pass to a plugin? Hello people, I was just wondering how to avoid that the content-type string is split in to multiple values. For example: If a document has the content-type: Application/pdf it is broken into three pieces Application/pdf, Application, pdf in the solr filed type. I am not sure if this is done by nutch, or if it is an index topic in solr. Sure someone knows the answer to that. Thank you.
Re: How to avoid splitting strings when indexing to solr
it is in nutch-default of 1.3 only. If you upgraded and copied over the 1.2 conf you'll miss it indeed. On 07.08.2011 15:35, Markus Jelsma wrote: 700 property 701 namemoreIndexingFilter.indexMimeTypeParts/name 702 valuetrue/value 703 descriptionDetermines whether the index-more plugin will split the mime- type 704 in sub parts, this requires the type field to be multi valued. Set to true for backward 705 compatibility. False will not split the mime-type. 706 /description 707 /property Thank you very much Markus, I have copied this to my nutch-site.xml. It works very well now. But I hadn't this option in my nutch-default.xml. Is there a standard way to get informed about the options that I can pass to a plugin? Hello people, I was just wondering how to avoid that the content-type string is split in to multiple values. For example: If a document has the content-type: Application/pdf it is broken into three pieces Application/pdf, Application, pdf in the solr filed type. I am not sure if this is done by nutch, or if it is an index topic in solr. Sure someone knows the answer to that. Thank you.
Re: How to avoid splitting strings when indexing to solr
700 property 701 namemoreIndexingFilter.indexMimeTypeParts/name 702 valuetrue/value 703 descriptionDetermines whether the index-more plugin will split the mime- type 704 in sub parts, this requires the type field to be multi valued. Set to true for backward 705 compatibility. False will not split the mime-type. 706 /description 707 /property Hello people, I was just wondering how to avoid that the content-type string is split in to multiple values. For example: If a document has the content-type: Application/pdf it is broken into three pieces Application/pdf, Application, pdf in the solr filed type. I am not sure if this is done by nutch, or if it is an index topic in solr. Sure someone knows the answer to that. Thank you.
Re: How to avoid splitting strings when indexing to solr
Hi, Not too familiar these days with Nutch, but my guess is that a Solr analyser is getting applied. To have a field exactly as is, use the String fieldtype on Solr's schema.xml rather than tje text fieldtype. Regards, Gora On 05-Aug-2011 6:35 PM, Marek Bachmann m.bachm...@uni-kassel.de wrote: Hello people, I was just wondering how to avoid that the content-type string is split in to multiple values. For example: If a document has the content-type: Application/pdf it is broken into three pieces Application/pdf, Application, pdf in the solr filed type. I am not sure if this is done by nutch, or if it is an index topic in solr. Sure someone knows the answer to that. Thank you.