Re: Continuous crawling

Julien Nioche Mon, 28 Nov 2011 06:24:18 -0800

Simply run Nutch in pseudo-distributed mode. If you have no idea of what
this means, then it would be a good idea to have a look at
http://hadoop.apache.org/common/docs/stable/single_node_setup.html and in
particular the section mentioning http://localhost:50030/jobtracker.jsp


On 28 November 2011 14:09, Bai Shen <[email protected]> wrote:

> We looked at the hadoop reporter and aren't sure how to access it with
> nutch.  Is there a certain way it works?  Can you give me an example?
> Thanks.
>
> On Mon, Nov 21, 2011 at 3:11 PM, Markus Jelsma
> <[email protected]>wrote:
>
> > **
> >
> > > On Thu, Nov 10, 2011 at 3:32 PM, Markus Jelsma
> >
> > >
> >
> > > <[email protected]>wrote:
> >
> > > > > Interesting. How do you tell if the segments have been fetched,
> etc?
> >
> > > >
> >
> > > > after a job the shell script waits for its completion and return
> code.
> > If
> >
> > > > it
> >
> > > > returns 0 all is fine and we move it to another queue. If != 0 then
> >
> > > > there's an
> >
> > > > error and reports via mail.
> >
> > > >
> >
> > > > Ah, okay. I didn't realize it was returning an error code.
> >
> > > >
> >
> > > > > How
> >
> > > > > do you know if there are any urls that had problems?
> >
> > > >
> >
> > > > Hadoop reporter shows statistics. There are always many errors for
> many
> >
> > > > reasons. This is normal because we crawl everything.
> >
> > >
> >
> > > How are you running Hadoop reporter?
> >
> > You'll get it for free when operating a Hadoop cluster.
> >
> > >
> >
> > > > > Or fetch jobs that
> >
> > > > > errored out, etc.
> >
> > > >
> >
> > > > The non-zero return code.
> >
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Continuous crawling

Reply via email to