Could be an explanation. Thanks a lot. Tom
-----Original Message----- From: Dennis Kubes [mailto:[email protected]] Sent: Montag, 17. Mai 2010 16:17 To: [email protected] Subject: Re: Generating Segments It looks like two things are going on here: One, the generated segments are limited per reducer. It looks like you have more than one segment. For each segment it generated 1000. Two, over and above 1000 is probably redirects. The fetcher itself has no knowledge of the limit, neither does its counts. It just fetches what it is given. Redirects cause more than one entry to be written, with different fetch statuses, into the crawl_fetch folder under segments. The read segments command goes against that same folder and gives you total counts. I don't think there is a way currently to filter on fetched successfully and ignore redirect statuses. Dennis On 05/17/2010 08:52 AM, Tom Landvoigt wrote: > Hi, > > > > I generated segments with -topN 1000 but why the fetcher fetches more > than 1000 urls? > > > > Any ideas? > > > > nu...@blub:/nutch/search> ./bin/nutch readseg -list -dir runbot/segments > > NAME GENERATED FETCHER START FETCHER END > FETCHED PARSED > > 20100513214218 1000 2010-05-13T21:43:20 > 2010-05-13T23:00:13 1000 553 > > 20100513230209 1000 2010-05-13T23:03:15 > 2010-05-14T00:28:32 1000 201 > > 20100514003017 1000 2010-05-14T00:31:20 > 2010-05-14T02:07:23 1221 37 > > 20100514020904 1000 2010-05-14T02:10:05 > 2010-05-14T03:37:45 1000 340 > > 20100514033939 1000 2010-05-14T03:40:45 > 2010-05-14T05:39:52 1414 34 > > 20100514054140 1000 2010-05-14T05:42:41 > 2010-05-14T08:23:45 1283 63 > > > > > > Thanks allot. > > > > --------------------- > > Tom Landvoigt > > >

