Re: Nutch in production

2016-10-18 Thread lewis john mcgibbney
Hi Sachin,
Answering both of your questions here as I am catching up with some mail.

On Fri, Sep 30, 2016 at 5:04 AM, <user-digest-h...@nutch.apache.org> wrote:

>
> From: Sachin Shaju <sachi...@mstack.com>
> To: user@nutch.apache.org
> Cc:
> Date: Fri, 30 Sep 2016 10:00:04 +0530
> Subject: Re: Nutch in production
> Thank you guys for your replies. I will look into the suggestions you gave.
> But I have one more query. How can I trigger nutch from a queue system in a
> distributed environment ?


Well this is a bit more tricky of course, as per my other mailing list
thread, you can easily use the REST API and the Nutchserver for publishing
Nutch workflows so I would advise you to look into that.


> Can REST api be a real option in distributed mode
> ?


As per my other thread... yes :) The one limitation is getting the injected
URLs into HDFS for use within the rest of the workflow.


> Or whether I will have to go for a command line invocation for nutch ?
>
>
I think that we need to provide a patch for Nutch trunk to enable ingestion
of the injected seeds into HDFS via the REST API. Right now this
functionality is lacking. I've created a ticket for it at
https://issues.apache.org/jira/browse/NUTCH-2327

We will try to address this before the pending Nutch 1.13 release however I
cannot promise anything.
Thanjs
Lewis


Re: Nutch in production

2016-09-29 Thread Sachin Shaju
Can I have a link to this ?

Regards,
Sachin Shaju

sachi...@mstack.com
+919539887554

On Thu, Sep 29, 2016 at 11:13 PM, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> Yep also check out the work that Sujen Shah just merged (also on my team
> at JPL and
> USC) where you can publish events to an ActiveMQ queue from Nutch
> crawling. That
> should allow all sorts of production dashboards and analytics.
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect, Instrument Software and Science Data Systems Section (398)
> Manager, Open Source Projects Formulation and Development Office (8212)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Director, Information Retrieval and Data Science Group (IRDS)
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> WWW: http://irds.usc.edu/
> ++
>
>
> On 9/29/16, 10:41 AM, "Karanjeet Singh"  wrote:
>
> Hi Sachin,
>
> Just a suggestion here - you can use Apache Kafka to generate and catch
> events which are mapped to incoming crawl requests, crawl status and
> much
> more.
>
> I have created a prototype for production queue [0] which runs on top
> of a
> supercomputer (TACC Wrangler) and integrated it with Kafka. Please
> have a
> look and let me know if you have any questions.
>
> [0]: https://github.com/karanjeets/PCF-Nutch-on-Wrangler
>
> P.S. - There can be many solutions to this. I am just giving one.  :)
>
> Regards,
> Karanjeet Singh
> http://irds.usc.edu
>
> On Thu, Sep 29, 2016 at 1:33 AM, Sachin Shaju 
> wrote:
>
> > Hi,
> >I was experimenting some crawl cycles with nutch and would like
> to setup
> > a distributed crawl environment. But I wonder how can I trigger
> nutch for
> > incoming crawl requests in a production system. I read about nutch
> REST
> > api. Is that the real option that I have ? Or can I run nutch as a
> > continuously running distributed server by any other option ?
> >
> >  My preferred nutch version is nutch 1.12.
> >
> > Regards,
> > Sachin Shaju
> >
> > sachi...@mstack.com
> > +919539887554
> >
> > --
> >
> >
> > The information contained in this electronic message and any
> attachments to
> > this message are intended for the exclusive use of the addressee(s)
> and may
> > contain proprietary, confidential or privileged information. If you
> are not
> > the intended recipient, you should not disseminate, distribute or
> copy this
> > e-mail. Please notify the sender immediately and destroy all copies
> of this
> > message and any attachments.
> >
> > WARNING: Computer viruses can be transmitted via email. The recipient
> > should check this email and any attachments for the presence of
> viruses.
> > The company accepts no liability for any damage caused by any virus
> > transmitted by this email.
> >
> > www.mStack.com
> >
>
> ᐧ
>
>
>

-- 
 

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not 
the intended recipient, you should not disseminate, distribute or copy this 
e-mail. Please notify the sender immediately and destroy all copies of this 
message and any attachments.

WARNING: Computer viruses can be transmitted via email. The recipient 
should check this email and any attachments for the presence of viruses. 
The company accepts no liability for any damage caused by any virus 
transmitted by this email.

www.mStack.com


Re: Nutch in production

2016-09-29 Thread Mattmann, Chris A (3980)
Yep also check out the work that Sujen Shah just merged (also on my team at JPL 
and
USC) where you can publish events to an ActiveMQ queue from Nutch crawling. That
should allow all sorts of production dashboards and analytics.

++
Chris Mattmann, Ph.D.
Chief Architect, Instrument Software and Science Data Systems Section (398)
Manager, Open Source Projects Formulation and Development Office (8212)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++
 

On 9/29/16, 10:41 AM, "Karanjeet Singh"  wrote:

Hi Sachin,

Just a suggestion here - you can use Apache Kafka to generate and catch
events which are mapped to incoming crawl requests, crawl status and much
more.

I have created a prototype for production queue [0] which runs on top of a
supercomputer (TACC Wrangler) and integrated it with Kafka. Please have a
look and let me know if you have any questions.

[0]: https://github.com/karanjeets/PCF-Nutch-on-Wrangler

P.S. - There can be many solutions to this. I am just giving one.  :)

Regards,
Karanjeet Singh
http://irds.usc.edu

On Thu, Sep 29, 2016 at 1:33 AM, Sachin Shaju  wrote:

> Hi,
>I was experimenting some crawl cycles with nutch and would like to 
setup
> a distributed crawl environment. But I wonder how can I trigger nutch for
> incoming crawl requests in a production system. I read about nutch REST
> api. Is that the real option that I have ? Or can I run nutch as a
> continuously running distributed server by any other option ?
>
>  My preferred nutch version is nutch 1.12.
>
> Regards,
> Sachin Shaju
>
> sachi...@mstack.com
> +919539887554
>
> --
>
>
> The information contained in this electronic message and any attachments 
to
> this message are intended for the exclusive use of the addressee(s) and 
may
> contain proprietary, confidential or privileged information. If you are 
not
> the intended recipient, you should not disseminate, distribute or copy 
this
> e-mail. Please notify the sender immediately and destroy all copies of 
this
> message and any attachments.
>
> WARNING: Computer viruses can be transmitted via email. The recipient
> should check this email and any attachments for the presence of viruses.
> The company accepts no liability for any damage caused by any virus
> transmitted by this email.
>
> www.mStack.com
>

ᐧ




Re: Nutch in production

2016-09-29 Thread Karanjeet Singh
Hi Sachin,

Just a suggestion here - you can use Apache Kafka to generate and catch
events which are mapped to incoming crawl requests, crawl status and much
more.

I have created a prototype for production queue [0] which runs on top of a
supercomputer (TACC Wrangler) and integrated it with Kafka. Please have a
look and let me know if you have any questions.

[0]: https://github.com/karanjeets/PCF-Nutch-on-Wrangler

P.S. - There can be many solutions to this. I am just giving one.  :)

Regards,
Karanjeet Singh
http://irds.usc.edu

On Thu, Sep 29, 2016 at 1:33 AM, Sachin Shaju  wrote:

> Hi,
>I was experimenting some crawl cycles with nutch and would like to setup
> a distributed crawl environment. But I wonder how can I trigger nutch for
> incoming crawl requests in a production system. I read about nutch REST
> api. Is that the real option that I have ? Or can I run nutch as a
> continuously running distributed server by any other option ?
>
>  My preferred nutch version is nutch 1.12.
>
> Regards,
> Sachin Shaju
>
> sachi...@mstack.com
> +919539887554
>
> --
>
>
> The information contained in this electronic message and any attachments to
> this message are intended for the exclusive use of the addressee(s) and may
> contain proprietary, confidential or privileged information. If you are not
> the intended recipient, you should not disseminate, distribute or copy this
> e-mail. Please notify the sender immediately and destroy all copies of this
> message and any attachments.
>
> WARNING: Computer viruses can be transmitted via email. The recipient
> should check this email and any attachments for the presence of viruses.
> The company accepts no liability for any damage caused by any virus
> transmitted by this email.
>
> www.mStack.com
>

ᐧ