Re: Generating Segments

Dennis Kubes Mon, 17 May 2010 07:17:39 -0700

It looks like two things are going on here:

One, the generated segments are limited per reducer. It looks like youhave more than one segment. For each segment it generated 1000.

Two, over and above 1000 is probably redirects. The fetcher itself hasno knowledge of the limit, neither does its counts. It just fetcheswhat it is given. Redirects cause more than one entry to be written,with different fetch statuses, into the crawl_fetch folder undersegments. The read segments command goes against that same folder andgives you total counts. I don't think there is a way currently tofilter on fetched successfully and ignore redirect statuses.


Dennis

On 05/17/2010 08:52 AM, Tom Landvoigt wrote:

Hi,



I generated segments with -topN 1000 but why the fetcher fetches more
than 1000 urls?



Any ideas?



nu...@blub:/nutch/search>  ./bin/nutch readseg -list -dir runbot/segments

NAME            GENERATED       FETCHER START           FETCHER END
FETCHED PARSED

20100513214218  1000            2010-05-13T21:43:20
2010-05-13T23:00:13     1000    553

20100513230209  1000            2010-05-13T23:03:15
2010-05-14T00:28:32     1000    201

20100514003017  1000            2010-05-14T00:31:20
2010-05-14T02:07:23     1221    37

20100514020904  1000            2010-05-14T02:10:05
2010-05-14T03:37:45     1000    340

20100514033939  1000            2010-05-14T03:40:45
2010-05-14T05:39:52     1414    34

20100514054140  1000            2010-05-14T05:42:41
2010-05-14T08:23:45     1283    63





Thanks allot.



---------------------

Tom Landvoigt

Re: Generating Segments

Reply via email to