Yes, my guess was right. The protocolStatus says that these files have HTTP
404 status, just that unfetched status is not updated in Nutch.

I also faced a similar problem [1]. Please open a jira and report any
findings.

[1] http://find.searchhub.org/document/6e4464919811d20f#c2a5de6e93942ada

On Fri, May 24, 2013 at 10:03 AM, Bai Shen <[email protected]> wrote:

> I'm trying to check hbase for urls that have unfetched status but my query
> isn't working correctly.  No matter what I don't get a match.
>
> scan 'webpage', {COLUMNS=>['f:bas', 'f:st'],
> FILTER=>SingleColumnValueFilter.new(Bytes.toBytes('f'),
> Bytes.toBytes('st'), CompareFilter::CompareOp.valueOf('EQUAL'),
> Bytes.toBytes('1'))}
>
>
> I did manage to find one entry with an unfetched status.  It apparently has
> no base url, so I'm assuming that's why it's not fetched.  I'm not sure how
> that happened.  It also says protocolStatus is NOTFOUND.
>
>
> On Fri, May 24, 2013 at 9:48 AM, kiran chitturi
> <[email protected]>wrote:
>
> > I have seen this happen in Nutch 2.x.
> >
> > I would suggest you to check your regex file to see the conditions and
> use
> > hbase to get the urls that have unfetched status.
> >
> > Also, try to check the protocol status of each unfetched url in HBase,
> most
> > probably it is either 404 or status other than 200.
> >
> > Hope this helps.
> >
> > On Fri, May 24, 2013 at 8:13 AM, Bai Shen <[email protected]>
> wrote:
> >
> > > I'm running Nutch 2.1 using HBase.
> > >
> > > When I run readdb -stats I show that there are 15k unfetched urls.
> > >  However, when I run generate -topN 1000 I get no urls to be fetched.
>  Up
> > > until now it's been pulling a full thousand urls for each cycle.
> > >
> > > Any ideas?  I'm not sure what to check.
> > >
> > > Thanks.
> > >
> >
> >
> >
> > --
> > Kiran Chitturi
> >
> > <http://www.linkedin.com/in/kiranchitturi>
> >
>



-- 
Kiran Chitturi

<http://www.linkedin.com/in/kiranchitturi>

Reply via email to