I have seen this happen in Nutch 2.x.

I would suggest you to check your regex file to see the conditions and use
hbase to get the urls that have unfetched status.

Also, try to check the protocol status of each unfetched url in HBase, most
probably it is either 404 or status other than 200.

Hope this helps.

On Fri, May 24, 2013 at 8:13 AM, Bai Shen <[email protected]> wrote:

> I'm running Nutch 2.1 using HBase.
>
> When I run readdb -stats I show that there are 15k unfetched urls.
>  However, when I run generate -topN 1000 I get no urls to be fetched.  Up
> until now it's been pulling a full thousand urls for each cycle.
>
> Any ideas?  I'm not sure what to check.
>
> Thanks.
>



-- 
Kiran Chitturi

<http://www.linkedin.com/in/kiranchitturi>

Reply via email to