What I find odd is the lack of the base url.  I'm wondering how that
happened.

Did you find a solution for your issue?


On Fri, May 24, 2013 at 1:18 PM, kiran chitturi
<[email protected]>wrote:

> Yes, my guess was right. The protocolStatus says that these files have HTTP
> 404 status, just that unfetched status is not updated in Nutch.
>
> I also faced a similar problem [1]. Please open a jira and report any
> findings.
>
> [1] http://find.searchhub.org/document/6e4464919811d20f#c2a5de6e93942ada
>
> On Fri, May 24, 2013 at 10:03 AM, Bai Shen <[email protected]>
> wrote:
>
> > I'm trying to check hbase for urls that have unfetched status but my
> query
> > isn't working correctly.  No matter what I don't get a match.
> >
> > scan 'webpage', {COLUMNS=>['f:bas', 'f:st'],
> > FILTER=>SingleColumnValueFilter.new(Bytes.toBytes('f'),
> > Bytes.toBytes('st'), CompareFilter::CompareOp.valueOf('EQUAL'),
> > Bytes.toBytes('1'))}
> >
> >
> > I did manage to find one entry with an unfetched status.  It apparently
> has
> > no base url, so I'm assuming that's why it's not fetched.  I'm not sure
> how
> > that happened.  It also says protocolStatus is NOTFOUND.
> >
> >
> > On Fri, May 24, 2013 at 9:48 AM, kiran chitturi
> > <[email protected]>wrote:
> >
> > > I have seen this happen in Nutch 2.x.
> > >
> > > I would suggest you to check your regex file to see the conditions and
> > use
> > > hbase to get the urls that have unfetched status.
> > >
> > > Also, try to check the protocol status of each unfetched url in HBase,
> > most
> > > probably it is either 404 or status other than 200.
> > >
> > > Hope this helps.
> > >
> > > On Fri, May 24, 2013 at 8:13 AM, Bai Shen <[email protected]>
> > wrote:
> > >
> > > > I'm running Nutch 2.1 using HBase.
> > > >
> > > > When I run readdb -stats I show that there are 15k unfetched urls.
> > > >  However, when I run generate -topN 1000 I get no urls to be fetched.
> >  Up
> > > > until now it's been pulling a full thousand urls for each cycle.
> > > >
> > > > Any ideas?  I'm not sure what to check.
> > > >
> > > > Thanks.
> > > >
> > >
> > >
> > >
> > > --
> > > Kiran Chitturi
> > >
> > > <http://www.linkedin.com/in/kiranchitturi>
> > >
> >
>
>
>
> --
> Kiran Chitturi
>
> <http://www.linkedin.com/in/kiranchitturi>
>

Reply via email to