It looks like two things are going on here:

One, the generated segments are limited per reducer. It looks like you have more than one segment. For each segment it generated 1000.

Two, over and above 1000 is probably redirects. The fetcher itself has no knowledge of the limit, neither does its counts. It just fetches what it is given. Redirects cause more than one entry to be written, with different fetch statuses, into the crawl_fetch folder under segments. The read segments command goes against that same folder and gives you total counts. I don't think there is a way currently to filter on fetched successfully and ignore redirect statuses.

Dennis

On 05/17/2010 08:52 AM, Tom Landvoigt wrote:
Hi,



I generated segments with -topN 1000 but why the fetcher fetches more
than 1000 urls?



Any ideas?



nu...@blub:/nutch/search>  ./bin/nutch readseg -list -dir runbot/segments

NAME            GENERATED       FETCHER START           FETCHER END
FETCHED PARSED

20100513214218  1000            2010-05-13T21:43:20
2010-05-13T23:00:13     1000    553

20100513230209  1000            2010-05-13T23:03:15
2010-05-14T00:28:32     1000    201

20100514003017  1000            2010-05-14T00:31:20
2010-05-14T02:07:23     1221    37

20100514020904  1000            2010-05-14T02:10:05
2010-05-14T03:37:45     1000    340

20100514033939  1000            2010-05-14T03:40:45
2010-05-14T05:39:52     1414    34

20100514054140  1000            2010-05-14T05:42:41
2010-05-14T08:23:45     1283    63





Thanks allot.



---------------------

Tom Landvoigt


Reply via email to