Apache Nutch and Solr Integration
Hello Friends, I am a newbie to Solr and trying to integrate Apache Nutch 1.3 and Solr 3.2 . I did the steps explained in the following two URL's : http://wiki.apache.org/nutch/RunningNutchAndSolr http://thetechietutorials.blogspot.com/2011/06/how-to-build-and-start-apache-solr.html I downloaded both the softwares, however, I am getting error (*solrUrl is not set, indexing will be skipped..*) when I am trying to crawl using Cygwin. Can anyone please help me out to fix this issue ? Else any other website suggesting for Apache Nutch and Solr integration would be greatly helpful. Thanks Regards, Serenity
Re: Apache Nutch and Solr Integration
Can you let me know when and where you were getting the error? A screen-shot will be helpful. On Tue, Jul 5, 2011 at 8:15 AM, serenity keningston serenity.kenings...@gmail.com wrote: Hello Friends, I am a newbie to Solr and trying to integrate Apache Nutch 1.3 and Solr 3.2 . I did the steps explained in the following two URL's : http://wiki.apache.org/nutch/RunningNutchAndSolr http://thetechietutorials.blogspot.com/2011/06/how-to-build-and-start-apache-solr.html I downloaded both the softwares, however, I am getting error (*solrUrl is not set, indexing will be skipped..*) when I am trying to crawl using Cygwin. Can anyone please help me out to fix this issue ? Else any other website suggesting for Apache Nutch and Solr integration would be greatly helpful. Thanks Regards, Serenity
Re: Apache Nutch and Solr Integration
You are using the crawl job so you must specify the URL to your Solr instance. The newly updated wiki has you answer: http://wiki.apache.org/nutch/bin/nutch_crawl Hello Friends, I am a newbie to Solr and trying to integrate Apache Nutch 1.3 and Solr 3.2 . I did the steps explained in the following two URL's : http://wiki.apache.org/nutch/RunningNutchAndSolr http://thetechietutorials.blogspot.com/2011/06/how-to-build-and-start-apach e-solr.html I downloaded both the softwares, however, I am getting error (*solrUrl is not set, indexing will be skipped..*) when I am trying to crawl using Cygwin. Can anyone please help me out to fix this issue ? Else any other website suggesting for Apache Nutch and Solr integration would be greatly helpful. Thanks Regards, Serenity
Re: Apache Nutch and Solr Integration
Please find attached screenshot On Tue, Jul 5, 2011 at 11:53 AM, Way Cool way1.wayc...@gmail.com wrote: Can you let me know when and where you were getting the error? A screen-shot will be helpful. On Tue, Jul 5, 2011 at 8:15 AM, serenity keningston serenity.kenings...@gmail.com wrote: Hello Friends, I am a newbie to Solr and trying to integrate Apache Nutch 1.3 and Solr 3.2 . I did the steps explained in the following two URL's : http://wiki.apache.org/nutch/RunningNutchAndSolr http://thetechietutorials.blogspot.com/2011/06/how-to-build-and-start-apache-solr.html I downloaded both the softwares, however, I am getting error (*solrUrl is not set, indexing will be skipped..*) when I am trying to crawl using Cygwin. Can anyone please help me out to fix this issue ? Else any other website suggesting for Apache Nutch and Solr integration would be greatly helpful. Thanks Regards, Serenity
Re: Apache Nutch and Solr Integration
On Tue, Jul 5, 2011 at 1:11 PM, Way Cool way1.wayc...@gmail.com wrote: Sorry, Serenity, somehow I don't see the attachment. On Tue, Jul 5, 2011 at 11:23 AM, serenity keningston serenity.kenings...@gmail.com wrote: Please find attached screenshot On Tue, Jul 5, 2011 at 11:53 AM, Way Cool way1.wayc...@gmail.com wrote: Can you let me know when and where you were getting the error? A screen-shot will be helpful. On Tue, Jul 5, 2011 at 8:15 AM, serenity keningston serenity.kenings...@gmail.com wrote: Hello Friends, I am a newbie to Solr and trying to integrate Apache Nutch 1.3 and Solr 3.2 . I did the steps explained in the following two URL's : http://wiki.apache.org/nutch/RunningNutchAndSolr http://thetechietutorials.blogspot.com/2011/06/how-to-build-and-start-apache-solr.html I downloaded both the softwares, however, I am getting error (*solrUrl is not set, indexing will be skipped..*) when I am trying to crawl using Cygwin. Can anyone please help me out to fix this issue ? Else any other website suggesting for Apache Nutch and Solr integration would be greatly helpful. Thanks Regards, Serenity
Re: [Nutch] and Solr integration
All, I realize that the documentation says that you crawl first then add to Solr but I spent several hours running the same command through Cygwin with -solrindex http://localhost:8983/solr on the command line (eg. bin/nutch crawl urls -dir crawl -threads 10 -depth 100 -topN 50 -solrindex http://localhost:8983/solr) and it worked. Does anyone know why it's not working for me anymore? I am using the Lucid build of Solr which was what i was using before. I neglected to write down the command line syntax which is biting me in the arse. Any tips on this one would be great! Thanks, Adam On Mon, Dec 20, 2010 at 4:21 PM, Anurag anurag.it.jo...@gmail.com wrote: why are using solrindex in the argument.? It is used when we need to index the crawled data in Solr For more read http://wiki.apache.org/nutch/NutchTutorial . Also for nutch-solr integration this is very useful blog http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ I integrated nutch and solr and it works well. Thanks On Tue, Dec 21, 2010 at 1:57 AM, Adam Estrada-2 [via Lucene] ml-node+2122347-622655030-146...@n3.nabble.comml-node%2b2122347-622655030-146...@n3.nabble.com ml-node%2b2122347-622655030-146...@n3.nabble.comml-node%252b2122347-622655030-146...@n3.nabble.com wrote: All, I have a couple websites that I need to crawl and the following command line used to work I think. Solr is up and running and everything is fine there and I can go through and index the site but I really need the results added to Solr after the crawl. Does anyone have any idea on how to make that happen or what I'm doing wrong? These errors are being thrown fro Hadoop which I am not using at all. $ bin/nutch crawl urls -dir crawl -threads 10 -depth 100 -topN 50 -solrindex ht tp://localhost:8983/solr crawl started in: crawl rootUrlDir = http://localhost:8983/solr threads = 10 depth = 100 indexer=lucene topN = 50 Injector: starting at 2010-12-20 15:23:25 Injector: crawlDb: crawl/crawldb Injector: urlDir: http://localhost:8983/solr Injector: Converting injected urls to crawl db entries. Exception in thread main java.io.IOException: No FileSystem for scheme: http at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1375 ) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.j ava:169) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.ja va:201) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:7 81) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at org.apache.nutch.crawl.Injector.inject(Injector.java:217) at org.apache.nutch.crawl.Crawl.main(Crawl.java:124) -- View message @ http://lucene.472066.n3.nabble.com/Nutch-and-Solr-integration-tp2122347p2122347.html To start a new topic under Solr - User, email ml-node+472068-1941297125-146...@n3.nabble.comml-node%2b472068-1941297125-146...@n3.nabble.com ml-node%2b472068-1941297125-146...@n3.nabble.comml-node%252b472068-1941297125-146...@n3.nabble.com To unsubscribe from Solr - User, click here http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=472068code=YW51cmFnLml0LmpvbGx5QGdtYWlsLmNvbXw0NzIwNjh8LTIwOTgzNDQxOTY= . -- Kumar Anurag - Kumar Anurag -- View this message in context: http://lucene.472066.n3.nabble.com/Nutch-and-Solr-integration-tp2122347p2122623.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: [Nutch] and Solr integration
BLEH! facepalm This is entirely possible to do in a single step AS LONG AS YOU GET THE SYNTAX CORRECT ;-) http://www.lucidimagination.com/blog/2010/09/10/refresh-using-nutch-with-solr/ http://www.lucidimagination.com/blog/2010/09/10/refresh-using-nutch-with-solr/bin/nutch crawl urls -dir crawl -threads 10 -depth 100 -topN 50* -solr* http://localhost:8983/solr http://localhost:8983/solrThe correct param is -solr NOT -solrindex. Cheers, Adam On Mon, Jan 3, 2011 at 11:45 AM, Adam Estrada estrada.a...@gmail.comwrote: All, I realize that the documentation says that you crawl first then add to Solr but I spent several hours running the same command through Cygwin with -solrindex http://localhost:8983/solr on the command line (eg. bin/nutch crawl urls -dir crawl -threads 10 -depth 100 -topN 50 -solrindex http://localhost:8983/solr) and it worked. Does anyone know why it's not working for me anymore? I am using the Lucid build of Solr which was what i was using before. I neglected to write down the command line syntax which is biting me in the arse. Any tips on this one would be great! Thanks, Adam On Mon, Dec 20, 2010 at 4:21 PM, Anurag anurag.it.jo...@gmail.com wrote: why are using solrindex in the argument.? It is used when we need to index the crawled data in Solr For more read http://wiki.apache.org/nutch/NutchTutorial . Also for nutch-solr integration this is very useful blog http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ I integrated nutch and solr and it works well. Thanks On Tue, Dec 21, 2010 at 1:57 AM, Adam Estrada-2 [via Lucene] ml-node+2122347-622655030-146...@n3.nabble.comml-node%2b2122347-622655030-146...@n3.nabble.com ml-node%2b2122347-622655030-146...@n3.nabble.comml-node%252b2122347-622655030-146...@n3.nabble.com wrote: All, I have a couple websites that I need to crawl and the following command line used to work I think. Solr is up and running and everything is fine there and I can go through and index the site but I really need the results added to Solr after the crawl. Does anyone have any idea on how to make that happen or what I'm doing wrong? These errors are being thrown fro Hadoop which I am not using at all. $ bin/nutch crawl urls -dir crawl -threads 10 -depth 100 -topN 50 -solrindex ht tp://localhost:8983/solr crawl started in: crawl rootUrlDir = http://localhost:8983/solr threads = 10 depth = 100 indexer=lucene topN = 50 Injector: starting at 2010-12-20 15:23:25 Injector: crawlDb: crawl/crawldb Injector: urlDir: http://localhost:8983/solr Injector: Converting injected urls to crawl db entries. Exception in thread main java.io.IOException: No FileSystem for scheme: http at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1375 ) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.j ava:169) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.ja va:201) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:7 81) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at org.apache.nutch.crawl.Injector.inject(Injector.java:217) at org.apache.nutch.crawl.Crawl.main(Crawl.java:124) -- View message @ http://lucene.472066.n3.nabble.com/Nutch-and-Solr-integration-tp2122347p2122347.html To start a new topic under Solr - User, email ml-node+472068-1941297125-146...@n3.nabble.comml-node%2b472068-1941297125-146...@n3.nabble.com ml-node%2b472068-1941297125-146...@n3.nabble.comml-node%252b472068-1941297125-146...@n3.nabble.com To unsubscribe from Solr - User, click here http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=472068code=YW51cmFnLml0LmpvbGx5QGdtYWlsLmNvbXw0NzIwNjh8LTIwOTgzNDQxOTY= . -- Kumar Anurag - Kumar Anurag -- View this message in context: http://lucene.472066.n3.nabble.com/Nutch-and-Solr-integration-tp2122347p2122623.html Sent from the Solr - User mailing list archive at Nabble.com.
[Nutch] and Solr integration
All, I have a couple websites that I need to crawl and the following command line used to work I think. Solr is up and running and everything is fine there and I can go through and index the site but I really need the results added to Solr after the crawl. Does anyone have any idea on how to make that happen or what I'm doing wrong? These errors are being thrown fro Hadoop which I am not using at all. $ bin/nutch crawl urls -dir crawl -threads 10 -depth 100 -topN 50 -solrindex ht tp://localhost:8983/solr crawl started in: crawl rootUrlDir = http://localhost:8983/solr threads = 10 depth = 100 indexer=lucene topN = 50 Injector: starting at 2010-12-20 15:23:25 Injector: crawlDb: crawl/crawldb Injector: urlDir: http://localhost:8983/solr Injector: Converting injected urls to crawl db entries. Exception in thread main java.io.IOException: No FileSystem for scheme: http at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1375 ) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.j ava:169) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.ja va:201) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:7 81) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at org.apache.nutch.crawl.Injector.inject(Injector.java:217) at org.apache.nutch.crawl.Crawl.main(Crawl.java:124)
Re: [Nutch] and Solr integration
why are using solrindex in the argument.? It is used when we need to index the crawled data in Solr For more read http://wiki.apache.org/nutch/NutchTutorial . Also for nutch-solr integration this is very useful blog http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ I integrated nutch and solr and it works well. Thanks On Tue, Dec 21, 2010 at 1:57 AM, Adam Estrada-2 [via Lucene] ml-node+2122347-622655030-146...@n3.nabble.comml-node%2b2122347-622655030-146...@n3.nabble.com wrote: All, I have a couple websites that I need to crawl and the following command line used to work I think. Solr is up and running and everything is fine there and I can go through and index the site but I really need the results added to Solr after the crawl. Does anyone have any idea on how to make that happen or what I'm doing wrong? These errors are being thrown fro Hadoop which I am not using at all. $ bin/nutch crawl urls -dir crawl -threads 10 -depth 100 -topN 50 -solrindex ht tp://localhost:8983/solr crawl started in: crawl rootUrlDir = http://localhost:8983/solr threads = 10 depth = 100 indexer=lucene topN = 50 Injector: starting at 2010-12-20 15:23:25 Injector: crawlDb: crawl/crawldb Injector: urlDir: http://localhost:8983/solr Injector: Converting injected urls to crawl db entries. Exception in thread main java.io.IOException: No FileSystem for scheme: http at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1375 ) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.j ava:169) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.ja va:201) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:7 81) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at org.apache.nutch.crawl.Injector.inject(Injector.java:217) at org.apache.nutch.crawl.Crawl.main(Crawl.java:124) -- View message @ http://lucene.472066.n3.nabble.com/Nutch-and-Solr-integration-tp2122347p2122347.html To start a new topic under Solr - User, email ml-node+472068-1941297125-146...@n3.nabble.comml-node%2b472068-1941297125-146...@n3.nabble.com To unsubscribe from Solr - User, click herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=472068code=YW51cmFnLml0LmpvbGx5QGdtYWlsLmNvbXw0NzIwNjh8LTIwOTgzNDQxOTY=. -- Kumar Anurag - Kumar Anurag -- View this message in context: http://lucene.472066.n3.nabble.com/Nutch-and-Solr-integration-tp2122347p2122623.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: [Nutch] and Solr integration
bin/nutch crawl urls -dir crawl -threads 10 -depth 100 -topN 50 -solrindex http://localhost:8983/solr I've run that command before and it worked...that's why I asked. grab nutch from trunk and run bin/nutch and see that it is in fact an option. It looks like Hadoop is the culprit now and I am at a loss on how to fix it. Thanks for the feedback. Adam On Mon, Dec 20, 2010 at 4:21 PM, Anurag anurag.it.jo...@gmail.com wrote: why are using solrindex in the argument.? It is used when we need to index the crawled data in Solr For more read http://wiki.apache.org/nutch/NutchTutorial . Also for nutch-solr integration this is very useful blog http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ I integrated nutch and solr and it works well. Thanks On Tue, Dec 21, 2010 at 1:57 AM, Adam Estrada-2 [via Lucene] ml-node+2122347-622655030-146...@n3.nabble.comml-node%2b2122347-622655030-146...@n3.nabble.com ml-node%2b2122347-622655030-146...@n3.nabble.comml-node%252b2122347-622655030-146...@n3.nabble.com wrote: All, I have a couple websites that I need to crawl and the following command line used to work I think. Solr is up and running and everything is fine there and I can go through and index the site but I really need the results added to Solr after the crawl. Does anyone have any idea on how to make that happen or what I'm doing wrong? These errors are being thrown fro Hadoop which I am not using at all. $ bin/nutch crawl urls -dir crawl -threads 10 -depth 100 -topN 50 -solrindex ht tp://localhost:8983/solr crawl started in: crawl rootUrlDir = http://localhost:8983/solr threads = 10 depth = 100 indexer=lucene topN = 50 Injector: starting at 2010-12-20 15:23:25 Injector: crawlDb: crawl/crawldb Injector: urlDir: http://localhost:8983/solr Injector: Converting injected urls to crawl db entries. Exception in thread main java.io.IOException: No FileSystem for scheme: http at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1375 ) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.j ava:169) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.ja va:201) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:7 81) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at org.apache.nutch.crawl.Injector.inject(Injector.java:217) at org.apache.nutch.crawl.Crawl.main(Crawl.java:124) -- View message @ http://lucene.472066.n3.nabble.com/Nutch-and-Solr-integration-tp2122347p2122347.html To start a new topic under Solr - User, email ml-node+472068-1941297125-146...@n3.nabble.comml-node%2b472068-1941297125-146...@n3.nabble.com ml-node%2b472068-1941297125-146...@n3.nabble.comml-node%252b472068-1941297125-146...@n3.nabble.com To unsubscribe from Solr - User, click here http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=472068code=YW51cmFnLml0LmpvbGx5QGdtYWlsLmNvbXw0NzIwNjh8LTIwOTgzNDQxOTY= . -- Kumar Anurag - Kumar Anurag -- View this message in context: http://lucene.472066.n3.nabble.com/Nutch-and-Solr-integration-tp2122347p2122623.html Sent from the Solr - User mailing list archive at Nabble.com.