Re: [WELCOME] Nutch PMC Welcomes Talat Uyarer to PMC and Committer

2014-04-02 Thread Talat Uyarer
Hi All,

I am very excited now. :) Thanks a lot to everyone for inviting me.
I'm a software engineer and crawler team leader of my company in
Istanbul. I have been using Apache Nutch 2.X for 10 months.

My company works on Search Technologies. We have a huge Hadoop cluster
for crawling and analyzing data. I try to handle a billion pages per a
month with Nutch. I am focused on developing Nutch 2.x for better
improvements . I hope I will be beneficial for Nutch community.

Thanks,
Talat
Newbie PMC of Nutch

2014-04-01 17:33 GMT+03:00 Lewis John Mcgibbney lewis.mcgibb...@gmail.com:
 Hi Folks,
 We are please to announce that the Nutch PMC recently VOTE'd to extend an
 invitation to Talat inviting him to join our PMC. His ongoing mailing list
 contributions and code contributions (mostly) to the 2.X branch has been
 evident for some time now and we are really glad to have him on the team.
 @Talat,
 Please feel free to say a bit about yourself and your current interest in
 using and developing Nutch.
 Congratulations on your new role within the Nutch community.
 Best
 Lewis
 (on behalf of the Nutch PMC)

 --
 Lewis



-- 
Talat UYARER
Websitesi: http://talat.uyarer.com
Twitter: http://twitter.com/talatuyarer
Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304


Re: [WELCOME] Nutch PMC Welcomes Talat Uyarer to PMC and Committer

2014-04-02 Thread Renato Marroquín Mogrovejo
Congrats Talat! Well deserved!


Renato M.


2014-04-02 8:56 GMT+02:00 Talat Uyarer ta...@uyarer.com:

 Hi All,

 I am very excited now. :) Thanks a lot to everyone for inviting me.
 I'm a software engineer and crawler team leader of my company in
 Istanbul. I have been using Apache Nutch 2.X for 10 months.

 My company works on Search Technologies. We have a huge Hadoop cluster
 for crawling and analyzing data. I try to handle a billion pages per a
 month with Nutch. I am focused on developing Nutch 2.x for better
 improvements . I hope I will be beneficial for Nutch community.

 Thanks,
 Talat
 Newbie PMC of Nutch

 2014-04-01 17:33 GMT+03:00 Lewis John Mcgibbney lewis.mcgibb...@gmail.com
 :
  Hi Folks,
  We are please to announce that the Nutch PMC recently VOTE'd to extend an
  invitation to Talat inviting him to join our PMC. His ongoing mailing
 list
  contributions and code contributions (mostly) to the 2.X branch has been
  evident for some time now and we are really glad to have him on the team.
  @Talat,
  Please feel free to say a bit about yourself and your current interest in
  using and developing Nutch.
  Congratulations on your new role within the Nutch community.
  Best
  Lewis
  (on behalf of the Nutch PMC)
 
  --
  Lewis



 --
 Talat UYARER
 Websitesi: http://talat.uyarer.com
 Twitter: http://twitter.com/talatuyarer
 Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304



Re: [WELCOME] Nutch PMC Welcomes Talat Uyarer to PMC and Committer

2014-04-02 Thread Alparslan Avcı
Congratulationsmate! With the contributions of this great team, I can 
see the brillant future of Nutch 2.x from now :)


Alparslan


On 02-04-2014 09:57, Renato Marroquín Mogrovejo wrote:

Congrats Talat! Well deserved!


Renato M.


2014-04-02 8:56 GMT+02:00 Talat Uyarer ta...@uyarer.com:


Hi All,

I am very excited now. :) Thanks a lot to everyone for inviting me.
I'm a software engineer and crawler team leader of my company in
Istanbul. I have been using Apache Nutch 2.X for 10 months.

My company works on Search Technologies. We have a huge Hadoop cluster
for crawling and analyzing data. I try to handle a billion pages per a
month with Nutch. I am focused on developing Nutch 2.x for better
improvements . I hope I will be beneficial for Nutch community.

Thanks,
Talat
Newbie PMC of Nutch

2014-04-01 17:33 GMT+03:00 Lewis John Mcgibbney lewis.mcgibb...@gmail.com

:
Hi Folks,
We are please to announce that the Nutch PMC recently VOTE'd to extend an
invitation to Talat inviting him to join our PMC. His ongoing mailing

list

contributions and code contributions (mostly) to the 2.X branch has been
evident for some time now and we are really glad to have him on the team.
@Talat,
Please feel free to say a bit about yourself and your current interest in
using and developing Nutch.
Congratulations on your new role within the Nutch community.
Best
Lewis
(on behalf of the Nutch PMC)

--
Lewis



--
Talat UYARER
Websitesi: http://talat.uyarer.com
Twitter: http://twitter.com/talatuyarer
Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304





Re: [WELCOME] Nutch PMC Welcomes Talat Uyarer to PMC and Committer

2014-04-02 Thread Julien Nioche
Congratulations Talat and welcome on board!

Julien


On 2 April 2014 07:56, Talat Uyarer ta...@uyarer.com wrote:

 Hi All,

 I am very excited now. :) Thanks a lot to everyone for inviting me.
 I'm a software engineer and crawler team leader of my company in
 Istanbul. I have been using Apache Nutch 2.X for 10 months.

 My company works on Search Technologies. We have a huge Hadoop cluster
 for crawling and analyzing data. I try to handle a billion pages per a
 month with Nutch. I am focused on developing Nutch 2.x for better
 improvements . I hope I will be beneficial for Nutch community.

 Thanks,
 Talat
 Newbie PMC of Nutch

 2014-04-01 17:33 GMT+03:00 Lewis John Mcgibbney lewis.mcgibb...@gmail.com
 :
  Hi Folks,
  We are please to announce that the Nutch PMC recently VOTE'd to extend an
  invitation to Talat inviting him to join our PMC. His ongoing mailing
 list
  contributions and code contributions (mostly) to the 2.X branch has been
  evident for some time now and we are really glad to have him on the team.
  @Talat,
  Please feel free to say a bit about yourself and your current interest in
  using and developing Nutch.
  Congratulations on your new role within the Nutch community.
  Best
  Lewis
  (on behalf of the Nutch PMC)
 
  --
  Lewis



 --
 Talat UYARER
 Websitesi: http://talat.uyarer.com
 Twitter: http://twitter.com/talatuyarer
 Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304




-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


InjectorJob: org.apache.gora.util.GoraException: java.lang.RuntimeException: java.lang.IllegalArgumentException...

2014-04-02 Thread Adamantios Corais

Hi all,

I have followed all steps to set-up Nutch (2.2.1) with HBase (0.90.4) 
and Solr (4.7.1) as described in the book Web Crawling and Data Mining 
with Apache Nutch, however, I am getting the following error:


InjectorJob: org.apache.gora.util.GoraException: 
java.lang.RuntimeException: java.lang.IllegalArgumentException: Not a 
host:port pair: �27204@eualin-T430eualin-T430,37745,1396453102781
at 
org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:167)
at 
org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:135)
at 
org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:75)

at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:221)
at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:251)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:273)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:282)
Caused by: java.lang.RuntimeException: 
java.lang.IllegalArgumentException: Not a host:port pair: 
�27204@eualin-T430eualin-T430,37745,1396453102781

at org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:127)
at 
org.apache.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:102)
at 
org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:161)

... 7 more
Caused by: java.lang.IllegalArgumentException: Not a host:port pair: 
�27204@eualin-T430eualin-T430,37745,1396453102781

at org.apache.hadoop.hbase.HServerAddress.init(HServerAddress.java:60)
at 
org.apache.hadoop.hbase.MasterAddressTracker.getMasterAddress(MasterAddressTracker.java:63)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getMaster(HConnectionManager.java:354)

at org.apache.hadoop.hbase.client.HBaseAdmin.init(HBaseAdmin.java:94)
at org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:109)
... 9 more


As much as I searched, I could not find any solution. Any ideas?

Best,
Adam.


One site only index.

2014-04-02 Thread Shane Wood

I have indexed several site successfully.
Now i wish too index a new site and not update any other sites already 
indexed.


I use Nutch 2.21 MYSQL 5.3  and Solr 4.7.0 how would you recommend i go 
about indexing a new site only
if someone can give examples of command lines that would be amazingly 
helpful.


Cheers
Shane.


Re: One site only index.

2014-04-02 Thread remi tassing
Hi Shane,

You could use the same scripts as before but just modify the
regex-urlfilter.txt to restrict the crawling scope.

BR, Remi


On Thu, Apr 3, 2014 at 10:52 AM, Shane Wood sh...@cbm8bit.com wrote:

 I have indexed several site successfully.
 Now i wish too index a new site and not update any other sites already
 indexed.

 I use Nutch 2.21 MYSQL 5.3  and Solr 4.7.0 how would you recommend i go
 about indexing a new site only
 if someone can give examples of command lines that would be amazingly
 helpful.

 Cheers
 Shane.



Re: One site only index.

2014-04-02 Thread Shane Wood
Can you choose a custom regex-urlfilter.txt too save editing it each 
time you wish too index a different site ?.


I am surprised you can't enter a url when generating a fetch list. ie

/bin/nutch generate --only  someurl.com --job 192833-292837

The you fetch job 192833-292837  parse job 192833-292837 and finally 
update dbase  job 192833-292837


Now that would be great..

Thanks will be doing it your way for now. :)

Shane.


On 03/04/14 13:24, remi tassing wrote:

Hi Shane,

You could use the same scripts as before but just modify the
regex-urlfilter.txt to restrict the crawling scope.

BR, Remi


On Thu, Apr 3, 2014 at 10:52 AM, Shane Woodsh...@cbm8bit.com  wrote:

   

I have indexed several site successfully.
Now i wish too index a new site and not update any other sites already
indexed.

I use Nutch 2.21 MYSQL 5.3  and Solr 4.7.0 how would you recommend i go
about indexing a new site only
if someone can give examples of command lines that would be amazingly
helpful.

Cheers
Shane.

 
   




Unable to crawl wiki pages through Nutch

2014-04-02 Thread reddibabu
Hi All,

I am using Apache Nutch 1.7. I can able to crawl and index all most all
sites  except wiki pages. 
While trying to crawl wiki pages it is saying that fetch of
http://wiki.ibm.com/ failed with: java.net.UnknownHostException:
wiki.ibm.com. 

Is it require any additional configuration for crawling wiki pages. 
Anyone assist me on the same would be helpful a lot.


Thanks in advance.
Reddi Babu



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Unable-to-crawl-wiki-pages-through-Nutch-tp4128772.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Unable to crawl wiki pages through Nutch

2014-04-02 Thread Talat Uyarer
Hi reddibabu,

java.net.UnknownHostException is dns resolve problem. When i try to enter
website it didnt open.

Nutch has not any specific confugration for wiki.

Talat
3 Nis 2014 07:54 tarihinde reddibabu reddybabu...@gmail.com yazdı:

 Hi All,

 I am using Apache Nutch 1.7. I can able to crawl and index all most all
 sites  except wiki pages.
 While trying to crawl wiki pages it is saying that fetch of
 http://wiki.ibm.com/ failed with: java.net.UnknownHostException:
 wiki.ibm.com.

 Is it require any additional configuration for crawling wiki pages.
 Anyone assist me on the same would be helpful a lot.


 Thanks in advance.
 Reddi Babu



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Unable-to-crawl-wiki-pages-through-Nutch-tp4128772.html
 Sent from the Nutch - User mailing list archive at Nabble.com.



Re: Unable to crawl wiki pages through Nutch

2014-04-02 Thread John Lafitte
reddibabu,

I cannot resolve wiki.ibm.com so I'm guessing nutch can't either.  Is that
an internal dns record?


On Wed, Apr 2, 2014 at 11:54 PM, reddibabu reddybabu...@gmail.com wrote:

 Hi All,

 I am using Apache Nutch 1.7. I can able to crawl and index all most all
 sites  except wiki pages.
 While trying to crawl wiki pages it is saying that fetch of
 http://wiki.ibm.com/ failed with: java.net.UnknownHostException:
 wiki.ibm.com.

 Is it require any additional configuration for crawling wiki pages.
 Anyone assist me on the same would be helpful a lot.


 Thanks in advance.
 Reddi Babu



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Unable-to-crawl-wiki-pages-through-Nutch-tp4128772.html
 Sent from the Nutch - User mailing list archive at Nabble.com.



Re: Unable to crawl wiki pages through Nutch

2014-04-02 Thread Talat Uyarer
If you use local dns server for resolving, you should write nameservers in
resolv.conf which nutch working servers.

You should be sure nutch's server can resolve this site. If you use console
you ycan use lynx for checking

Talat
3 Nis 2014 08:02 tarihinde John Lafitte jlafi...@brandextract.com yazdı:

 reddibabu,

 I cannot resolve wiki.ibm.com so I'm guessing nutch can't either.  Is that
 an internal dns record?


 On Wed, Apr 2, 2014 at 11:54 PM, reddibabu reddybabu...@gmail.com wrote:

  Hi All,
 
  I am using Apache Nutch 1.7. I can able to crawl and index all most all
  sites  except wiki pages.
  While trying to crawl wiki pages it is saying that fetch of
  http://wiki.ibm.com/ failed with: java.net.UnknownHostException:
  wiki.ibm.com.
 
  Is it require any additional configuration for crawling wiki pages.
  Anyone assist me on the same would be helpful a lot.
 
 
  Thanks in advance.
  Reddi Babu
 
 
 
  --
  View this message in context:
 
 http://lucene.472066.n3.nabble.com/Unable-to-crawl-wiki-pages-through-Nutch-tp4128772.html
  Sent from the Nutch - User mailing list archive at Nabble.com.