What I find odd is the lack of the base url. I'm wondering how that happened.
Did you find a solution for your issue? On Fri, May 24, 2013 at 1:18 PM, kiran chitturi <[email protected]>wrote: > Yes, my guess was right. The protocolStatus says that these files have HTTP > 404 status, just that unfetched status is not updated in Nutch. > > I also faced a similar problem [1]. Please open a jira and report any > findings. > > [1] http://find.searchhub.org/document/6e4464919811d20f#c2a5de6e93942ada > > On Fri, May 24, 2013 at 10:03 AM, Bai Shen <[email protected]> > wrote: > > > I'm trying to check hbase for urls that have unfetched status but my > query > > isn't working correctly. No matter what I don't get a match. > > > > scan 'webpage', {COLUMNS=>['f:bas', 'f:st'], > > FILTER=>SingleColumnValueFilter.new(Bytes.toBytes('f'), > > Bytes.toBytes('st'), CompareFilter::CompareOp.valueOf('EQUAL'), > > Bytes.toBytes('1'))} > > > > > > I did manage to find one entry with an unfetched status. It apparently > has > > no base url, so I'm assuming that's why it's not fetched. I'm not sure > how > > that happened. It also says protocolStatus is NOTFOUND. > > > > > > On Fri, May 24, 2013 at 9:48 AM, kiran chitturi > > <[email protected]>wrote: > > > > > I have seen this happen in Nutch 2.x. > > > > > > I would suggest you to check your regex file to see the conditions and > > use > > > hbase to get the urls that have unfetched status. > > > > > > Also, try to check the protocol status of each unfetched url in HBase, > > most > > > probably it is either 404 or status other than 200. > > > > > > Hope this helps. > > > > > > On Fri, May 24, 2013 at 8:13 AM, Bai Shen <[email protected]> > > wrote: > > > > > > > I'm running Nutch 2.1 using HBase. > > > > > > > > When I run readdb -stats I show that there are 15k unfetched urls. > > > > However, when I run generate -topN 1000 I get no urls to be fetched. > > Up > > > > until now it's been pulling a full thousand urls for each cycle. > > > > > > > > Any ideas? I'm not sure what to check. > > > > > > > > Thanks. > > > > > > > > > > > > > > > > -- > > > Kiran Chitturi > > > > > > <http://www.linkedin.com/in/kiranchitturi> > > > > > > > > > -- > Kiran Chitturi > > <http://www.linkedin.com/in/kiranchitturi> >

