Could be an explanation. Thanks a lot.

Tom

-----Original Message-----
From: Dennis Kubes [mailto:[email protected]] 
Sent: Montag, 17. Mai 2010 16:17
To: [email protected]
Subject: Re: Generating Segments

It looks like two things are going on here:

One, the generated segments are limited per reducer.  It looks like you 
have more than one segment.  For each segment it generated 1000.

Two, over and above 1000 is probably redirects.  The fetcher itself has 
no knowledge of the limit, neither does its counts.  It just fetches 
what it is given.  Redirects cause more than one entry to be written, 
with different fetch statuses, into the crawl_fetch folder under 
segments.  The read segments command goes against that same folder and 
gives you total counts.  I don't think there is a way currently to 
filter on fetched successfully and ignore redirect statuses.

Dennis

On 05/17/2010 08:52 AM, Tom Landvoigt wrote:
> Hi,
>
>
>
> I generated segments with -topN 1000 but why the fetcher fetches more
> than 1000 urls?
>
>
>
> Any ideas?
>
>
>
> nu...@blub:/nutch/search>  ./bin/nutch readseg -list -dir
runbot/segments
>
> NAME            GENERATED       FETCHER START           FETCHER END
> FETCHED PARSED
>
> 20100513214218  1000            2010-05-13T21:43:20
> 2010-05-13T23:00:13     1000    553
>
> 20100513230209  1000            2010-05-13T23:03:15
> 2010-05-14T00:28:32     1000    201
>
> 20100514003017  1000            2010-05-14T00:31:20
> 2010-05-14T02:07:23     1221    37
>
> 20100514020904  1000            2010-05-14T02:10:05
> 2010-05-14T03:37:45     1000    340
>
> 20100514033939  1000            2010-05-14T03:40:45
> 2010-05-14T05:39:52     1414    34
>
> 20100514054140  1000            2010-05-14T05:42:41
> 2010-05-14T08:23:45     1283    63
>
>
>
>
>
> Thanks allot.
>
>
>
> ---------------------
>
> Tom Landvoigt
>
>
>    

Reply via email to