Re: [WELCOME] Nutch PMC Welcomes Talat Uyarer to PMC and Committer
Hi All, I am very excited now. :) Thanks a lot to everyone for inviting me. I'm a software engineer and crawler team leader of my company in Istanbul. I have been using Apache Nutch 2.X for 10 months. My company works on Search Technologies. We have a huge Hadoop cluster for crawling and analyzing data. I try to handle a billion pages per a month with Nutch. I am focused on developing Nutch 2.x for better improvements . I hope I will be beneficial for Nutch community. Thanks, Talat Newbie PMC of Nutch 2014-04-01 17:33 GMT+03:00 Lewis John Mcgibbney lewis.mcgibb...@gmail.com: Hi Folks, We are please to announce that the Nutch PMC recently VOTE'd to extend an invitation to Talat inviting him to join our PMC. His ongoing mailing list contributions and code contributions (mostly) to the 2.X branch has been evident for some time now and we are really glad to have him on the team. @Talat, Please feel free to say a bit about yourself and your current interest in using and developing Nutch. Congratulations on your new role within the Nutch community. Best Lewis (on behalf of the Nutch PMC) -- Lewis -- Talat UYARER Websitesi: http://talat.uyarer.com Twitter: http://twitter.com/talatuyarer Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304
Re: [WELCOME] Nutch PMC Welcomes Talat Uyarer to PMC and Committer
Congrats Talat! Well deserved! Renato M. 2014-04-02 8:56 GMT+02:00 Talat Uyarer ta...@uyarer.com: Hi All, I am very excited now. :) Thanks a lot to everyone for inviting me. I'm a software engineer and crawler team leader of my company in Istanbul. I have been using Apache Nutch 2.X for 10 months. My company works on Search Technologies. We have a huge Hadoop cluster for crawling and analyzing data. I try to handle a billion pages per a month with Nutch. I am focused on developing Nutch 2.x for better improvements . I hope I will be beneficial for Nutch community. Thanks, Talat Newbie PMC of Nutch 2014-04-01 17:33 GMT+03:00 Lewis John Mcgibbney lewis.mcgibb...@gmail.com : Hi Folks, We are please to announce that the Nutch PMC recently VOTE'd to extend an invitation to Talat inviting him to join our PMC. His ongoing mailing list contributions and code contributions (mostly) to the 2.X branch has been evident for some time now and we are really glad to have him on the team. @Talat, Please feel free to say a bit about yourself and your current interest in using and developing Nutch. Congratulations on your new role within the Nutch community. Best Lewis (on behalf of the Nutch PMC) -- Lewis -- Talat UYARER Websitesi: http://talat.uyarer.com Twitter: http://twitter.com/talatuyarer Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304
Re: [WELCOME] Nutch PMC Welcomes Talat Uyarer to PMC and Committer
Congratulationsmate! With the contributions of this great team, I can see the brillant future of Nutch 2.x from now :) Alparslan On 02-04-2014 09:57, Renato Marroquín Mogrovejo wrote: Congrats Talat! Well deserved! Renato M. 2014-04-02 8:56 GMT+02:00 Talat Uyarer ta...@uyarer.com: Hi All, I am very excited now. :) Thanks a lot to everyone for inviting me. I'm a software engineer and crawler team leader of my company in Istanbul. I have been using Apache Nutch 2.X for 10 months. My company works on Search Technologies. We have a huge Hadoop cluster for crawling and analyzing data. I try to handle a billion pages per a month with Nutch. I am focused on developing Nutch 2.x for better improvements . I hope I will be beneficial for Nutch community. Thanks, Talat Newbie PMC of Nutch 2014-04-01 17:33 GMT+03:00 Lewis John Mcgibbney lewis.mcgibb...@gmail.com : Hi Folks, We are please to announce that the Nutch PMC recently VOTE'd to extend an invitation to Talat inviting him to join our PMC. His ongoing mailing list contributions and code contributions (mostly) to the 2.X branch has been evident for some time now and we are really glad to have him on the team. @Talat, Please feel free to say a bit about yourself and your current interest in using and developing Nutch. Congratulations on your new role within the Nutch community. Best Lewis (on behalf of the Nutch PMC) -- Lewis -- Talat UYARER Websitesi: http://talat.uyarer.com Twitter: http://twitter.com/talatuyarer Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304
Re: [WELCOME] Nutch PMC Welcomes Talat Uyarer to PMC and Committer
Congratulations Talat and welcome on board! Julien On 2 April 2014 07:56, Talat Uyarer ta...@uyarer.com wrote: Hi All, I am very excited now. :) Thanks a lot to everyone for inviting me. I'm a software engineer and crawler team leader of my company in Istanbul. I have been using Apache Nutch 2.X for 10 months. My company works on Search Technologies. We have a huge Hadoop cluster for crawling and analyzing data. I try to handle a billion pages per a month with Nutch. I am focused on developing Nutch 2.x for better improvements . I hope I will be beneficial for Nutch community. Thanks, Talat Newbie PMC of Nutch 2014-04-01 17:33 GMT+03:00 Lewis John Mcgibbney lewis.mcgibb...@gmail.com : Hi Folks, We are please to announce that the Nutch PMC recently VOTE'd to extend an invitation to Talat inviting him to join our PMC. His ongoing mailing list contributions and code contributions (mostly) to the 2.X branch has been evident for some time now and we are really glad to have him on the team. @Talat, Please feel free to say a bit about yourself and your current interest in using and developing Nutch. Congratulations on your new role within the Nutch community. Best Lewis (on behalf of the Nutch PMC) -- Lewis -- Talat UYARER Websitesi: http://talat.uyarer.com Twitter: http://twitter.com/talatuyarer Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304 -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble
InjectorJob: org.apache.gora.util.GoraException: java.lang.RuntimeException: java.lang.IllegalArgumentException...
Hi all, I have followed all steps to set-up Nutch (2.2.1) with HBase (0.90.4) and Solr (4.7.1) as described in the book Web Crawling and Data Mining with Apache Nutch, however, I am getting the following error: InjectorJob: org.apache.gora.util.GoraException: java.lang.RuntimeException: java.lang.IllegalArgumentException: Not a host:port pair: �27204@eualin-T430eualin-T430,37745,1396453102781 at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:167) at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:135) at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:75) at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:221) at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:251) at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:273) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:282) Caused by: java.lang.RuntimeException: java.lang.IllegalArgumentException: Not a host:port pair: �27204@eualin-T430eualin-T430,37745,1396453102781 at org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:127) at org.apache.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:102) at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:161) ... 7 more Caused by: java.lang.IllegalArgumentException: Not a host:port pair: �27204@eualin-T430eualin-T430,37745,1396453102781 at org.apache.hadoop.hbase.HServerAddress.init(HServerAddress.java:60) at org.apache.hadoop.hbase.MasterAddressTracker.getMasterAddress(MasterAddressTracker.java:63) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getMaster(HConnectionManager.java:354) at org.apache.hadoop.hbase.client.HBaseAdmin.init(HBaseAdmin.java:94) at org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:109) ... 9 more As much as I searched, I could not find any solution. Any ideas? Best, Adam.
One site only index.
I have indexed several site successfully. Now i wish too index a new site and not update any other sites already indexed. I use Nutch 2.21 MYSQL 5.3 and Solr 4.7.0 how would you recommend i go about indexing a new site only if someone can give examples of command lines that would be amazingly helpful. Cheers Shane.
Re: One site only index.
Hi Shane, You could use the same scripts as before but just modify the regex-urlfilter.txt to restrict the crawling scope. BR, Remi On Thu, Apr 3, 2014 at 10:52 AM, Shane Wood sh...@cbm8bit.com wrote: I have indexed several site successfully. Now i wish too index a new site and not update any other sites already indexed. I use Nutch 2.21 MYSQL 5.3 and Solr 4.7.0 how would you recommend i go about indexing a new site only if someone can give examples of command lines that would be amazingly helpful. Cheers Shane.
Re: One site only index.
Can you choose a custom regex-urlfilter.txt too save editing it each time you wish too index a different site ?. I am surprised you can't enter a url when generating a fetch list. ie /bin/nutch generate --only someurl.com --job 192833-292837 The you fetch job 192833-292837 parse job 192833-292837 and finally update dbase job 192833-292837 Now that would be great.. Thanks will be doing it your way for now. :) Shane. On 03/04/14 13:24, remi tassing wrote: Hi Shane, You could use the same scripts as before but just modify the regex-urlfilter.txt to restrict the crawling scope. BR, Remi On Thu, Apr 3, 2014 at 10:52 AM, Shane Woodsh...@cbm8bit.com wrote: I have indexed several site successfully. Now i wish too index a new site and not update any other sites already indexed. I use Nutch 2.21 MYSQL 5.3 and Solr 4.7.0 how would you recommend i go about indexing a new site only if someone can give examples of command lines that would be amazingly helpful. Cheers Shane.
Unable to crawl wiki pages through Nutch
Hi All, I am using Apache Nutch 1.7. I can able to crawl and index all most all sites except wiki pages. While trying to crawl wiki pages it is saying that fetch of http://wiki.ibm.com/ failed with: java.net.UnknownHostException: wiki.ibm.com. Is it require any additional configuration for crawling wiki pages. Anyone assist me on the same would be helpful a lot. Thanks in advance. Reddi Babu -- View this message in context: http://lucene.472066.n3.nabble.com/Unable-to-crawl-wiki-pages-through-Nutch-tp4128772.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Unable to crawl wiki pages through Nutch
Hi reddibabu, java.net.UnknownHostException is dns resolve problem. When i try to enter website it didnt open. Nutch has not any specific confugration for wiki. Talat 3 Nis 2014 07:54 tarihinde reddibabu reddybabu...@gmail.com yazdı: Hi All, I am using Apache Nutch 1.7. I can able to crawl and index all most all sites except wiki pages. While trying to crawl wiki pages it is saying that fetch of http://wiki.ibm.com/ failed with: java.net.UnknownHostException: wiki.ibm.com. Is it require any additional configuration for crawling wiki pages. Anyone assist me on the same would be helpful a lot. Thanks in advance. Reddi Babu -- View this message in context: http://lucene.472066.n3.nabble.com/Unable-to-crawl-wiki-pages-through-Nutch-tp4128772.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Unable to crawl wiki pages through Nutch
reddibabu, I cannot resolve wiki.ibm.com so I'm guessing nutch can't either. Is that an internal dns record? On Wed, Apr 2, 2014 at 11:54 PM, reddibabu reddybabu...@gmail.com wrote: Hi All, I am using Apache Nutch 1.7. I can able to crawl and index all most all sites except wiki pages. While trying to crawl wiki pages it is saying that fetch of http://wiki.ibm.com/ failed with: java.net.UnknownHostException: wiki.ibm.com. Is it require any additional configuration for crawling wiki pages. Anyone assist me on the same would be helpful a lot. Thanks in advance. Reddi Babu -- View this message in context: http://lucene.472066.n3.nabble.com/Unable-to-crawl-wiki-pages-through-Nutch-tp4128772.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Unable to crawl wiki pages through Nutch
If you use local dns server for resolving, you should write nameservers in resolv.conf which nutch working servers. You should be sure nutch's server can resolve this site. If you use console you ycan use lynx for checking Talat 3 Nis 2014 08:02 tarihinde John Lafitte jlafi...@brandextract.com yazdı: reddibabu, I cannot resolve wiki.ibm.com so I'm guessing nutch can't either. Is that an internal dns record? On Wed, Apr 2, 2014 at 11:54 PM, reddibabu reddybabu...@gmail.com wrote: Hi All, I am using Apache Nutch 1.7. I can able to crawl and index all most all sites except wiki pages. While trying to crawl wiki pages it is saying that fetch of http://wiki.ibm.com/ failed with: java.net.UnknownHostException: wiki.ibm.com. Is it require any additional configuration for crawling wiki pages. Anyone assist me on the same would be helpful a lot. Thanks in advance. Reddi Babu -- View this message in context: http://lucene.472066.n3.nabble.com/Unable-to-crawl-wiki-pages-through-Nutch-tp4128772.html Sent from the Nutch - User mailing list archive at Nabble.com.