It looks like two things are going on here:
One, the generated segments are limited per reducer. It looks like you
have more than one segment. For each segment it generated 1000.
Two, over and above 1000 is probably redirects. The fetcher itself has
no knowledge of the limit, neither does its counts. It just fetches
what it is given. Redirects cause more than one entry to be written,
with different fetch statuses, into the crawl_fetch folder under
segments. The read segments command goes against that same folder and
gives you total counts. I don't think there is a way currently to
filter on fetched successfully and ignore redirect statuses.
Dennis
On 05/17/2010 08:52 AM, Tom Landvoigt wrote:
Hi,
I generated segments with -topN 1000 but why the fetcher fetches more
than 1000 urls?
Any ideas?
nu...@blub:/nutch/search> ./bin/nutch readseg -list -dir runbot/segments
NAME GENERATED FETCHER START FETCHER END
FETCHED PARSED
20100513214218 1000 2010-05-13T21:43:20
2010-05-13T23:00:13 1000 553
20100513230209 1000 2010-05-13T23:03:15
2010-05-14T00:28:32 1000 201
20100514003017 1000 2010-05-14T00:31:20
2010-05-14T02:07:23 1221 37
20100514020904 1000 2010-05-14T02:10:05
2010-05-14T03:37:45 1000 340
20100514033939 1000 2010-05-14T03:40:45
2010-05-14T05:39:52 1414 34
20100514054140 1000 2010-05-14T05:42:41
2010-05-14T08:23:45 1283 63
Thanks allot.
---------------------
Tom Landvoigt