Re: Continuous crawling

Markus Jelsma Tue, 29 Nov 2011 02:04:12 -0800

you can search the web for "Nutch Continuous crawling"

On Tuesday 29 November 2011 11:01:14 庄名洲 wrote:
> I'd like to know details in continuous crawling, too.
> could anyone fwd me the original email, because i'm new here. thanks to all
> of you.
> 
> 2011/11/29 庄名洲 <[email protected]>
> 
> > no agnet is listed in the http.agent.name property.
> > I met this before.
> > Just rebuild with ant~~
> > And maybe you'll need .patch files to fix the source. Good luck
> > 
> > 
> > 2011/11/29 Bai Shen <[email protected]>
> > 
> >> I've changed nutch to use the pseudo-distributed mode, but it keeps
> >> erroring out that no agent is listed in the http.agent.name property.  I
> >> copied over my conf directory from local, but that didn't fix it.  What
> >> am I missing?
> >> 
> >> On Mon, Nov 28, 2011 at 9:23 AM, Julien Nioche <
> >> 
> >> [email protected]> wrote:
> >> > Simply run Nutch in pseudo-distributed mode. If you have no idea of
> >> > what this means, then it would be a good idea to have a look at
> >> > http://hadoop.apache.org/common/docs/stable/single_node_setup.html and
> >> 
> >> in
> >> 
> >> > particular the section mentioning
> >> > http://localhost:50030/jobtracker.jsp
> >> > 
> >> > On 28 November 2011 14:09, Bai Shen <[email protected]> wrote:
> >> > > We looked at the hadoop reporter and aren't sure how to access it
> >> > > with nutch.  Is there a certain way it works?  Can you give me an
> >> > > example? Thanks.
> >> > > 
> >> > > On Mon, Nov 21, 2011 at 3:11 PM, Markus Jelsma
> >> > > 
> >> > > <[email protected]>wrote:
> >> > > > **
> >> > > > 
> >> > > > > On Thu, Nov 10, 2011 at 3:32 PM, Markus Jelsma
> >> > > > > 
> >> > > > > <[email protected]>wrote:
> >> > > > > > > Interesting. How do you tell if the segments have been
> >> 
> >> fetched,
> >> 
> >> > > etc?
> >> > > 
> >> > > > > > after a job the shell script waits for its completion and
> >> > > > > > return
> >> > > 
> >> > > code.
> >> > > 
> >> > > > If
> >> > > > 
> >> > > > > > it
> >> > > > > > 
> >> > > > > > returns 0 all is fine and we move it to another queue. If != 0
> >> 
> >> then
> >> 
> >> > > > > > there's an
> >> > > > > > 
> >> > > > > > error and reports via mail.
> >> > > > > > 
> >> > > > > > 
> >> > > > > > 
> >> > > > > > Ah, okay. I didn't realize it was returning an error code.
> >> > > > > > 
> >> > > > > > > How
> >> > > > > > > 
> >> > > > > > > do you know if there are any urls that had problems?
> >> > > > > > 
> >> > > > > > Hadoop reporter shows statistics. There are always many errors
> >> 
> >> for
> >> 
> >> > > many
> >> > > 
> >> > > > > > reasons. This is normal because we crawl everything.
> >> > > > > 
> >> > > > > How are you running Hadoop reporter?
> >> > > > 
> >> > > > You'll get it for free when operating a Hadoop cluster.
> >> > > > 
> >> > > > > > > Or fetch jobs that
> >> > > > > > > 
> >> > > > > > > errored out, etc.
> >> > > > > > 
> >> > > > > > The non-zero return code.
> >> > 
> >> > --
> >> > *
> >> > *Open Source Solutions for Text Engineering
> >> > 
> >> > http://digitalpebble.blogspot.com/
> >> > http://www.digitalpebble.com
> > 
> > --
> > *Best Regards :-)*
> > *mingzhou zhuang
> > Department of Computer Science & Technology,Tsinghua University, Beijing,
> > China*


-- 
Markus Jelsma - CTO - Openindex

Re: Continuous crawling

Reply via email to