Re: Nutch in production

2016-10-18 Thread lewis john mcgibbney
Hi Sachin,
Answering both of your questions here as I am catching up with some mail.

On Fri, Sep 30, 2016 at 5:04 AM, <user-digest-h...@nutch.apache.org> wrote:

>
> From: Sachin Shaju <sachi...@mstack.com>
> To: user@nutch.apache.org
> Cc:
> Date: Fri, 30 Sep 2016 10:00:04 +0530
> Subject: Re: Nutch in production
> Thank you guys for your replies. I will look into the suggestions you gave.
> But I have one more query. How can I trigger nutch from a queue system in a
> distributed environment ?


Well this is a bit more tricky of course, as per my other mailing list
thread, you can easily use the REST API and the Nutchserver for publishing
Nutch workflows so I would advise you to look into that.


> Can REST api be a real option in distributed mode
> ?


As per my other thread... yes :) The one limitation is getting the injected
URLs into HDFS for use within the rest of the workflow.


> Or whether I will have to go for a command line invocation for nutch ?
>
>
I think that we need to provide a patch for Nutch trunk to enable ingestion
of the injected seeds into HDFS via the REST API. Right now this
functionality is lacking. I've created a ticket for it at
https://issues.apache.org/jira/browse/NUTCH-2327

We will try to address this before the pending Nutch 1.13 release however I
cannot promise anything.
Thanjs
Lewis


Re: Nutch in production

2016-09-29 Thread Sachin Shaju
Can I have a link to this ?

Regards,
Sachin Shaju

sachi...@mstack.com
+919539887554

On Thu, Sep 29, 2016 at 11:13 PM, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> Yep also check out the work that Sujen Shah just merged (also on my team
> at JPL and
> USC) where you can publish events to an ActiveMQ queue from Nutch
> crawling. That
> should allow all sorts of production dashboards and analytics.
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect, Instrument Software and Science Data Systems Section (398)
> Manager, Open Source Projects Formulation and Development Office (8212)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Director, Information Retrieval and Data Science Group (IRDS)
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> WWW: http://irds.usc.edu/
> ++
>
>
> On 9/29/16, 10:41 AM, "Karanjeet Singh"  wrote:
>
> Hi Sachin,
>
> Just a suggestion here - you can use Apache Kafka to generate and catch
> events which are mapped to incoming crawl requests, crawl status and
> much
> more.
>
> I have created a prototype for production queue [0] which runs on top
> of a
> supercomputer (TACC Wrangler) and integrated it with Kafka. Please
> have a
> look and let me know if you have any questions.
>
> [0]: https://github.com/karanjeets/PCF-Nutch-on-Wrangler
>
> P.S. - There can be many solutions to this. I am just giving one.  :)
>
> Regards,
> Karanjeet Singh
> http://irds.usc.edu
>
> On Thu, Sep 29, 2016 at 1:33 AM, Sachin Shaju 
> wrote:
>
> > Hi,
> >I was experimenting some crawl cycles with nutch and would like
> to setup
> > a distributed crawl environment. But I wonder how can I trigger
> nutch for
> > incoming crawl requests in a production system. I read about nutch
> REST
> > api. Is that the real option that I have ? Or can I run nutch as a
> > continuously running distributed server by any other option ?
> >
> >  My preferred nutch version is nutch 1.12.
> >
> > Regards,
> > Sachin Shaju
> >
> > sachi...@mstack.com
> > +919539887554
> >
> > --
> >
> >
> > The information contained in this electronic message and any
> attachments to
> > this message are intended for the exclusive use of the addressee(s)
> and may
> > contain proprietary, confidential or privileged information. If you
> are not
> > the intended recipient, you should not disseminate, distribute or
> copy this
> > e-mail. Please notify the sender immediately and destroy all copies
> of this
> > message and any attachments.
> >
> > WARNING: Computer viruses can be transmitted via email. The recipient
> > should check this email and any attachments for the presence of
> viruses.
> > The company accepts no liability for any damage caused by any virus
> > transmitted by this email.
> >
> > www.mStack.com
> >
>
> ᐧ
>
>
>

-- 
 

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not 
the intended recipient, you should not disseminate, distribute or copy this 
e-mail. Please notify the sender immediately and destroy all copies of this 
message and any attachments.

WARNING: Computer viruses can be transmitted via email. The recipient 
should check this email and any attachments for the presence of viruses. 
The company accepts no liability for any damage caused by any virus 
transmitted by this email.

www.mStack.com


Re: Nutch in production

2016-09-29 Thread Mattmann, Chris A (3980)
Yep also check out the work that Sujen Shah just merged (also on my team at JPL 
and
USC) where you can publish events to an ActiveMQ queue from Nutch crawling. That
should allow all sorts of production dashboards and analytics.

++
Chris Mattmann, Ph.D.
Chief Architect, Instrument Software and Science Data Systems Section (398)
Manager, Open Source Projects Formulation and Development Office (8212)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++
 

On 9/29/16, 10:41 AM, "Karanjeet Singh"  wrote:

Hi Sachin,

Just a suggestion here - you can use Apache Kafka to generate and catch
events which are mapped to incoming crawl requests, crawl status and much
more.

I have created a prototype for production queue [0] which runs on top of a
supercomputer (TACC Wrangler) and integrated it with Kafka. Please have a
look and let me know if you have any questions.

[0]: https://github.com/karanjeets/PCF-Nutch-on-Wrangler

P.S. - There can be many solutions to this. I am just giving one.  :)

Regards,
Karanjeet Singh
http://irds.usc.edu

On Thu, Sep 29, 2016 at 1:33 AM, Sachin Shaju  wrote:

> Hi,
>I was experimenting some crawl cycles with nutch and would like to 
setup
> a distributed crawl environment. But I wonder how can I trigger nutch for
> incoming crawl requests in a production system. I read about nutch REST
> api. Is that the real option that I have ? Or can I run nutch as a
> continuously running distributed server by any other option ?
>
>  My preferred nutch version is nutch 1.12.
>
> Regards,
> Sachin Shaju
>
> sachi...@mstack.com
> +919539887554
>
> --
>
>
> The information contained in this electronic message and any attachments 
to
> this message are intended for the exclusive use of the addressee(s) and 
may
> contain proprietary, confidential or privileged information. If you are 
not
> the intended recipient, you should not disseminate, distribute or copy 
this
> e-mail. Please notify the sender immediately and destroy all copies of 
this
> message and any attachments.
>
> WARNING: Computer viruses can be transmitted via email. The recipient
> should check this email and any attachments for the presence of viruses.
> The company accepts no liability for any damage caused by any virus
> transmitted by this email.
>
> www.mStack.com
>

ᐧ




Re: Nutch in production

2016-09-29 Thread Karanjeet Singh
Hi Sachin,

Just a suggestion here - you can use Apache Kafka to generate and catch
events which are mapped to incoming crawl requests, crawl status and much
more.

I have created a prototype for production queue [0] which runs on top of a
supercomputer (TACC Wrangler) and integrated it with Kafka. Please have a
look and let me know if you have any questions.

[0]: https://github.com/karanjeets/PCF-Nutch-on-Wrangler

P.S. - There can be many solutions to this. I am just giving one.  :)

Regards,
Karanjeet Singh
http://irds.usc.edu

On Thu, Sep 29, 2016 at 1:33 AM, Sachin Shaju  wrote:

> Hi,
>I was experimenting some crawl cycles with nutch and would like to setup
> a distributed crawl environment. But I wonder how can I trigger nutch for
> incoming crawl requests in a production system. I read about nutch REST
> api. Is that the real option that I have ? Or can I run nutch as a
> continuously running distributed server by any other option ?
>
>  My preferred nutch version is nutch 1.12.
>
> Regards,
> Sachin Shaju
>
> sachi...@mstack.com
> +919539887554
>
> --
>
>
> The information contained in this electronic message and any attachments to
> this message are intended for the exclusive use of the addressee(s) and may
> contain proprietary, confidential or privileged information. If you are not
> the intended recipient, you should not disseminate, distribute or copy this
> e-mail. Please notify the sender immediately and destroy all copies of this
> message and any attachments.
>
> WARNING: Computer viruses can be transmitted via email. The recipient
> should check this email and any attachments for the presence of viruses.
> The company accepts no liability for any damage caused by any virus
> transmitted by this email.
>
> www.mStack.com
>

ᐧ


Nutch in production

2016-09-29 Thread Sachin Shaju
Hi,
   I was experimenting some crawl cycles with nutch and would like to setup
a distributed crawl environment. But I wonder how can I trigger nutch for
incoming crawl requests in a production system. I read about nutch REST
api. Is that the real option that I have ? Or can I run nutch as a
continuously running distributed server by any other option ?

 My preferred nutch version is nutch 1.12.

Regards,
Sachin Shaju

sachi...@mstack.com
+919539887554

-- 
 

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not 
the intended recipient, you should not disseminate, distribute or copy this 
e-mail. Please notify the sender immediately and destroy all copies of this 
message and any attachments.

WARNING: Computer viruses can be transmitted via email. The recipient 
should check this email and any attachments for the presence of viruses. 
The company accepts no liability for any damage caused by any virus 
transmitted by this email.

www.mStack.com


Re: Please share your experience of using Nutch in production

2014-06-24 Thread Gora Mohanty
On 23 June 2014 01:44, Meraj A. Khan mera...@gmail.com wrote:
 Gora,

 Thanks for sharing your admin perspective , rest assured  I am not trying
 to circumvent any politeness requirements in any way , as I mentioned
 earlier , I am with in the crawl-delay limits that are being set by the web
 masters if any , however , you have confirmed my hunch that I might have to
 reach out to individual webmasters to try and convince them to not block my
 IP address .
[...]

If you are taking the reasonable precautions that you mentioned
earlier, there is
no reason that you should be getting banned by webmasters. Unless a crawler
is actually causing issues for the site performance, it might not even come to
the attention of the webmaster at all.

 By being at a disadvantage , I meant at a disadvantage compared to major
 players like Google, Bing and Yahoo bots , whom the webmasters probably
 would not block access, and by Nutch variant , I meant an instance of a
 customized crawler based on Nutch.

People are unlikely to ban Google et al, as there are clear benefits
to having them
search one's site. If you would like special privileges, such as being
able to hit
the site hard, you will have to convince the webmaster that it your crawler also
brings some such benefit to them.

Regards,
Gora


Re: Please share your experience of using Nutch in production

2014-06-23 Thread Jorge Luis Betancourt Gonzalez
Why are you assuming that the web masters are effectively going to block you? 
In my experience this is the least probable escenario.

On Jun 22, 2014, at 4:14 PM, Meraj A. Khan mera...@gmail.com wrote:

 Gora,
 
 Thanks for sharing your admin perspective , rest assured  I am not trying
 to circumvent any politeness requirements in any way , as I mentioned
 earlier , I am with in the crawl-delay limits that are being set by the web
 masters if any , however , you have confirmed my hunch that I might have to
 reach out to individual webmasters to try and convince them to not block my
 IP address .
 
 Even if I have as small a number as 100 web sites to crawl , it would be a
 huge challenge for us to communicate with each and every webmaster , how
 would one go about doing that ? Also is there a standard way the web
 masters list their contact info so as to sell them the pitch to or persuade
 them to allows us to crawl their websites at a reasonable frequency?
 
 By being at a disadvantage , I meant at a disadvantage compared to major
 players like Google, Bing and Yahoo bots , whom the webmasters probably
 would not block access, and by Nutch variant , I meant an instance of a
 customized crawler based on Nutch.
 
 Thanks.
 
 
 On Sun, Jun 22, 2014 at 1:33 PM, Gora Mohanty g...@mimirtech.com wrote:
 
 On 22 June 2014 22:07, Meraj A. Khan mera...@gmail.com wrote:
 
 Hello Folks,
 
 I have  noticed that Nutch resources and mailing lists are mostly geared
 towards the usage of Nutch in research oriented projects , I would like
 to
 know from those of you who are using Nutch in production for large scale
 crawling (vertical or non-vertical) about what challenges to expect and
 how
 to overcome them.
 
 I will list a few  challenges that  I faced below and would like to hear
 from if you faced these challenges you on how you overcame these.
 
 
  1. If I were to go for a vertical search engine for websites in a
  particular domain  and follow the crawl-delay directive for
 politeness in
  the robots.txt , there is a possibility that the web master could
 still
  block my IP address and I start getting HTTP 403 forbidden/access
 denied
  messages. How can I  overcome these kind of issues , other than
 providing
  full contact info in the nutch-site.xml for the web master to get in
 touch
  with me, before blocking me ?.
 
 Er, providing full access info. is just basic politeness, and IMHO
 should become a requirement for Nutch. If you are going to hit some
 sites particularly hard, with good reasons, try contacting the website
 administrators and explaining to them why you need such access. We
 both administer, and crawl sites, and as an administrator I am quite
 willing to accept reasonable requests. After all, it is also our goal
 to promote our websites, and already most traffic on the web is
 through search engines.
 
  2. The fact that you will be considered as just another Nutch variant
 by
  web master puts you at a great level of dis-advantage , where you
 could be
  blocked from accessing the web site at the whims of the web master.
 
 Not sure what you mean by just another Nutch variant, nor why you
 think that it puts you at a disadvantage. Disadvantage compared to
 whom? Also, whims of the web master? Really? After all, it is their
 resources that you are using, and they are perfectly within their
 rights to ban you if they feel, for whatever reason, that you are
 abusing such resources.
 
  3. Can anyone share info as to how they overcame this issue when they
  were starting out , did you establish a relationship with each website
  owner/master to allows unhindered access ?
  4. Any other tips and suggestions would also be greatly appreciated.
 
 Sorry if I am misreading the above, but what you are asking for smells
 like trying to circumvent reasonable requirements. Yes, do try talking
 to website administrators. You might find them to be surprisingly
 accommodating if you are reasonable in return.
 
 Regards,
 Gora
 

VII Escuela Internacional de Verano en la UCI del 30 de junio al 11 de julio de 
2014. Ver www.uci.cu


Please share your experience of using Nutch in production

2014-06-22 Thread Meraj A. Khan
Hello Folks,

I have  noticed that Nutch resources and mailing lists are mostly geared
towards the usage of Nutch in research oriented projects , I would like to
know from those of you who are using Nutch in production for large scale
crawling (vertical or non-vertical) about what challenges to expect and how
to overcome them.

I will list a few  challenges that  I faced below and would like to hear
from if you faced these challenges you on how you overcame these.


   1. If I were to go for a vertical search engine for websites in a
   particular domain  and follow the crawl-delay directive for politeness in
   the robots.txt , there is a possibility that the web master could still
   block my IP address and I start getting HTTP 403 forbidden/access denied
   messages. How can I  overcome these kind of issues , other than providing
   full contact info in the nutch-site.xml for the web master to get in touch
   with me, before blocking me ?.
   2. The fact that you will be considered as just another Nutch variant by
   web master puts you at a great level of dis-advantage , where you could be
   blocked from accessing the web site at the whims of the web master.
   3. Can anyone share info as to how they overcame this issue when they
   were starting out , did you establish a relationship with each website
   owner/master to allows unhindered access ?
   4. Any other tips and suggestions would also be greatly appreciated.


Thanks.


Re: Please share your experience of using Nutch in production

2014-06-22 Thread Gora Mohanty
On 22 June 2014 22:07, Meraj A. Khan mera...@gmail.com wrote:

 Hello Folks,

 I have  noticed that Nutch resources and mailing lists are mostly geared
 towards the usage of Nutch in research oriented projects , I would like to
 know from those of you who are using Nutch in production for large scale
 crawling (vertical or non-vertical) about what challenges to expect and how
 to overcome them.

 I will list a few  challenges that  I faced below and would like to hear
 from if you faced these challenges you on how you overcame these.


1. If I were to go for a vertical search engine for websites in a
particular domain  and follow the crawl-delay directive for politeness in
the robots.txt , there is a possibility that the web master could still
block my IP address and I start getting HTTP 403 forbidden/access denied
messages. How can I  overcome these kind of issues , other than providing
full contact info in the nutch-site.xml for the web master to get in touch
with me, before blocking me ?.

Er, providing full access info. is just basic politeness, and IMHO
should become a requirement for Nutch. If you are going to hit some
sites particularly hard, with good reasons, try contacting the website
administrators and explaining to them why you need such access. We
both administer, and crawl sites, and as an administrator I am quite
willing to accept reasonable requests. After all, it is also our goal
to promote our websites, and already most traffic on the web is
through search engines.

2. The fact that you will be considered as just another Nutch variant by
web master puts you at a great level of dis-advantage , where you could be
blocked from accessing the web site at the whims of the web master.

Not sure what you mean by just another Nutch variant, nor why you
think that it puts you at a disadvantage. Disadvantage compared to
whom? Also, whims of the web master? Really? After all, it is their
resources that you are using, and they are perfectly within their
rights to ban you if they feel, for whatever reason, that you are
abusing such resources.

3. Can anyone share info as to how they overcame this issue when they
were starting out , did you establish a relationship with each website
owner/master to allows unhindered access ?
4. Any other tips and suggestions would also be greatly appreciated.

Sorry if I am misreading the above, but what you are asking for smells
like trying to circumvent reasonable requirements. Yes, do try talking
to website administrators. You might find them to be surprisingly
accommodating if you are reasonable in return.

Regards,
Gora


Re: Please share your experience of using Nutch in production

2014-06-22 Thread Meraj A. Khan
Gora,

Thanks for sharing your admin perspective , rest assured  I am not trying
to circumvent any politeness requirements in any way , as I mentioned
earlier , I am with in the crawl-delay limits that are being set by the web
masters if any , however , you have confirmed my hunch that I might have to
reach out to individual webmasters to try and convince them to not block my
IP address .

Even if I have as small a number as 100 web sites to crawl , it would be a
huge challenge for us to communicate with each and every webmaster , how
would one go about doing that ? Also is there a standard way the web
masters list their contact info so as to sell them the pitch to or persuade
them to allows us to crawl their websites at a reasonable frequency?

By being at a disadvantage , I meant at a disadvantage compared to major
players like Google, Bing and Yahoo bots , whom the webmasters probably
would not block access, and by Nutch variant , I meant an instance of a
customized crawler based on Nutch.

Thanks.


On Sun, Jun 22, 2014 at 1:33 PM, Gora Mohanty g...@mimirtech.com wrote:

 On 22 June 2014 22:07, Meraj A. Khan mera...@gmail.com wrote:
 
  Hello Folks,
 
  I have  noticed that Nutch resources and mailing lists are mostly geared
  towards the usage of Nutch in research oriented projects , I would like
 to
  know from those of you who are using Nutch in production for large scale
  crawling (vertical or non-vertical) about what challenges to expect and
 how
  to overcome them.
 
  I will list a few  challenges that  I faced below and would like to hear
  from if you faced these challenges you on how you overcame these.
 
 
 1. If I were to go for a vertical search engine for websites in a
 particular domain  and follow the crawl-delay directive for
 politeness in
 the robots.txt , there is a possibility that the web master could
 still
 block my IP address and I start getting HTTP 403 forbidden/access
 denied
 messages. How can I  overcome these kind of issues , other than
 providing
 full contact info in the nutch-site.xml for the web master to get in
 touch
 with me, before blocking me ?.

 Er, providing full access info. is just basic politeness, and IMHO
 should become a requirement for Nutch. If you are going to hit some
 sites particularly hard, with good reasons, try contacting the website
 administrators and explaining to them why you need such access. We
 both administer, and crawl sites, and as an administrator I am quite
 willing to accept reasonable requests. After all, it is also our goal
 to promote our websites, and already most traffic on the web is
 through search engines.

 2. The fact that you will be considered as just another Nutch variant
 by
 web master puts you at a great level of dis-advantage , where you
 could be
 blocked from accessing the web site at the whims of the web master.

 Not sure what you mean by just another Nutch variant, nor why you
 think that it puts you at a disadvantage. Disadvantage compared to
 whom? Also, whims of the web master? Really? After all, it is their
 resources that you are using, and they are perfectly within their
 rights to ban you if they feel, for whatever reason, that you are
 abusing such resources.

 3. Can anyone share info as to how they overcame this issue when they
 were starting out , did you establish a relationship with each website
 owner/master to allows unhindered access ?
 4. Any other tips and suggestions would also be greatly appreciated.

 Sorry if I am misreading the above, but what you are asking for smells
 like trying to circumvent reasonable requirements. Yes, do try talking
 to website administrators. You might find them to be surprisingly
 accommodating if you are reasonable in return.

 Regards,
 Gora