Nutch already uses it in the Fetcher. It outputs stuff like below in the Hadoop GUI and after each job has finished on stdout.
exception 30,154 0 30,154 access_denied 380 0 380 gone 3,159 0 3,159 moved 18,601 0 18,601 robots_denied 7,889 0 7,889 robots_denied_maxcrawldelay 167 0 167 hitByThrougputThreshold 5 0 5 bytes_downloaded 24,012,066,657 0 24,012,066,657 hitByTimeLimit 3,020 0 3,020 notmodified 30,223 0 30,223 temp_moved 21,653 0 21,653 success 433,955 0 433,955 notfound 23,384 0 23,384 On Monday 28 November 2011 15:09:49 Bai Shen wrote: > We looked at the hadoop reporter and aren't sure how to access it with > nutch. Is there a certain way it works? Can you give me an example? > Thanks. > > On Mon, Nov 21, 2011 at 3:11 PM, Markus Jelsma > > <[email protected]>wrote: > > ** > > > > > On Thu, Nov 10, 2011 at 3:32 PM, Markus Jelsma > > > > > > <[email protected]>wrote: > > > > > Interesting. How do you tell if the segments have been fetched, > > > > > etc? > > > > > > > > after a job the shell script waits for its completion and return > > > > code. > > > > If > > > > > > it > > > > > > > > returns 0 all is fine and we move it to another queue. If != 0 then > > > > > > > > there's an > > > > > > > > error and reports via mail. > > > > > > > > > > > > > > > > Ah, okay. I didn't realize it was returning an error code. > > > > > > > > > How > > > > > > > > > > do you know if there are any urls that had problems? > > > > > > > > Hadoop reporter shows statistics. There are always many errors for > > > > many > > > > > > > > reasons. This is normal because we crawl everything. > > > > > > How are you running Hadoop reporter? > > > > You'll get it for free when operating a Hadoop cluster. > > > > > > > Or fetch jobs that > > > > > > > > > > errored out, etc. > > > > > > > > The non-zero return code. -- Markus Jelsma - CTO - Openindex

