Yep also check out the work that Sujen Shah just merged (also on my team at JPL 
and
USC) where you can publish events to an ActiveMQ queue from Nutch crawling. That
should allow all sorts of production dashboards and analytics.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect, Instrument Software and Science Data Systems Section (398)
Manager, Open Source Projects Formulation and Development Office (8212)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 

On 9/29/16, 10:41 AM, "Karanjeet Singh" <[email protected]> wrote:

    Hi Sachin,
    
    Just a suggestion here - you can use Apache Kafka to generate and catch
    events which are mapped to incoming crawl requests, crawl status and much
    more.
    
    I have created a prototype for production queue [0] which runs on top of a
    supercomputer (TACC Wrangler) and integrated it with Kafka. Please have a
    look and let me know if you have any questions.
    
    [0]: https://github.com/karanjeets/PCF-Nutch-on-Wrangler
    
    P.S. - There can be many solutions to this. I am just giving one.  :)
    
    Regards,
    Karanjeet Singh
    http://irds.usc.edu
    
    On Thu, Sep 29, 2016 at 1:33 AM, Sachin Shaju <[email protected]> wrote:
    
    > Hi,
    >    I was experimenting some crawl cycles with nutch and would like to 
setup
    > a distributed crawl environment. But I wonder how can I trigger nutch for
    > incoming crawl requests in a production system. I read about nutch REST
    > api. Is that the real option that I have ? Or can I run nutch as a
    > continuously running distributed server by any other option ?
    >
    >      My preferred nutch version is nutch 1.12.
    >
    > Regards,
    > Sachin Shaju
    >
    > [email protected]
    > +919539887554
    >
    > --
    >
    >
    > The information contained in this electronic message and any attachments 
to
    > this message are intended for the exclusive use of the addressee(s) and 
may
    > contain proprietary, confidential or privileged information. If you are 
not
    > the intended recipient, you should not disseminate, distribute or copy 
this
    > e-mail. Please notify the sender immediately and destroy all copies of 
this
    > message and any attachments.
    >
    > WARNING: Computer viruses can be transmitted via email. The recipient
    > should check this email and any attachments for the presence of viruses.
    > The company accepts no liability for any damage caused by any virus
    > transmitted by this email.
    >
    > www.mStack.com
    >
    
    ᐧ
    

Reply via email to