you can search the web for "Nutch Continuous crawling" On Tuesday 29 November 2011 11:01:14 庄名洲 wrote: > I'd like to know details in continuous crawling, too. > could anyone fwd me the original email, because i'm new here. thanks to all > of you. > > 2011/11/29 庄名洲 <[email protected]> > > > no agnet is listed in the http.agent.name property. > > I met this before. > > Just rebuild with ant~~ > > And maybe you'll need .patch files to fix the source. Good luck > > > > > > 2011/11/29 Bai Shen <[email protected]> > > > >> I've changed nutch to use the pseudo-distributed mode, but it keeps > >> erroring out that no agent is listed in the http.agent.name property. I > >> copied over my conf directory from local, but that didn't fix it. What > >> am I missing? > >> > >> On Mon, Nov 28, 2011 at 9:23 AM, Julien Nioche < > >> > >> [email protected]> wrote: > >> > Simply run Nutch in pseudo-distributed mode. If you have no idea of > >> > what this means, then it would be a good idea to have a look at > >> > http://hadoop.apache.org/common/docs/stable/single_node_setup.html and > >> > >> in > >> > >> > particular the section mentioning > >> > http://localhost:50030/jobtracker.jsp > >> > > >> > On 28 November 2011 14:09, Bai Shen <[email protected]> wrote: > >> > > We looked at the hadoop reporter and aren't sure how to access it > >> > > with nutch. Is there a certain way it works? Can you give me an > >> > > example? Thanks. > >> > > > >> > > On Mon, Nov 21, 2011 at 3:11 PM, Markus Jelsma > >> > > > >> > > <[email protected]>wrote: > >> > > > ** > >> > > > > >> > > > > On Thu, Nov 10, 2011 at 3:32 PM, Markus Jelsma > >> > > > > > >> > > > > <[email protected]>wrote: > >> > > > > > > Interesting. How do you tell if the segments have been > >> > >> fetched, > >> > >> > > etc? > >> > > > >> > > > > > after a job the shell script waits for its completion and > >> > > > > > return > >> > > > >> > > code. > >> > > > >> > > > If > >> > > > > >> > > > > > it > >> > > > > > > >> > > > > > returns 0 all is fine and we move it to another queue. If != 0 > >> > >> then > >> > >> > > > > > there's an > >> > > > > > > >> > > > > > error and reports via mail. > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > Ah, okay. I didn't realize it was returning an error code. > >> > > > > > > >> > > > > > > How > >> > > > > > > > >> > > > > > > do you know if there are any urls that had problems? > >> > > > > > > >> > > > > > Hadoop reporter shows statistics. There are always many errors > >> > >> for > >> > >> > > many > >> > > > >> > > > > > reasons. This is normal because we crawl everything. > >> > > > > > >> > > > > How are you running Hadoop reporter? > >> > > > > >> > > > You'll get it for free when operating a Hadoop cluster. > >> > > > > >> > > > > > > Or fetch jobs that > >> > > > > > > > >> > > > > > > errored out, etc. > >> > > > > > > >> > > > > > The non-zero return code. > >> > > >> > -- > >> > * > >> > *Open Source Solutions for Text Engineering > >> > > >> > http://digitalpebble.blogspot.com/ > >> > http://www.digitalpebble.com > > > > -- > > *Best Regards :-)* > > *mingzhou zhuang > > Department of Computer Science & Technology,Tsinghua University, Beijing, > > China*
-- Markus Jelsma - CTO - Openindex

