I'm trying to check hbase for urls that have unfetched status but my query
isn't working correctly.  No matter what I don't get a match.

scan 'webpage', {COLUMNS=>['f:bas', 'f:st'],
FILTER=>SingleColumnValueFilter.new(Bytes.toBytes('f'),
Bytes.toBytes('st'), CompareFilter::CompareOp.valueOf('EQUAL'),
Bytes.toBytes('1'))}


I did manage to find one entry with an unfetched status.  It apparently has
no base url, so I'm assuming that's why it's not fetched.  I'm not sure how
that happened.  It also says protocolStatus is NOTFOUND.


On Fri, May 24, 2013 at 9:48 AM, kiran chitturi
<[email protected]>wrote:

> I have seen this happen in Nutch 2.x.
>
> I would suggest you to check your regex file to see the conditions and use
> hbase to get the urls that have unfetched status.
>
> Also, try to check the protocol status of each unfetched url in HBase, most
> probably it is either 404 or status other than 200.
>
> Hope this helps.
>
> On Fri, May 24, 2013 at 8:13 AM, Bai Shen <[email protected]> wrote:
>
> > I'm running Nutch 2.1 using HBase.
> >
> > When I run readdb -stats I show that there are 15k unfetched urls.
> >  However, when I run generate -topN 1000 I get no urls to be fetched.  Up
> > until now it's been pulling a full thousand urls for each cycle.
> >
> > Any ideas?  I'm not sure what to check.
> >
> > Thanks.
> >
>
>
>
> --
> Kiran Chitturi
>
> <http://www.linkedin.com/in/kiranchitturi>
>

Reply via email to