I dropped the f family from HBase and readded it. Nutch filled in the columns and now I have sane fetch times.
However, my fetchInterval is not being populated and every time I run a crawl I get the same urls. Here is my metadata after a crawl. status: 2 (status_fetched) fetchTime: 1370388764041 prevFetchTime: 0 fetchInterval: 0 retriesSinceFetch: 0 modifiedTime: 0 prevModifiedTime: 0 protocolStatus: SUCCESS parseStatus: success/ok Any ideas why? On Tue, Jun 4, 2013 at 8:16 AM, Bai Shen <[email protected]> wrote: > I'm looking at my base url(the root of the internal site that I first > injected to start the crawl). It shows a status of 2 (status_fetched). > The fetch time shows sometime in 2032. > The prev fetch time is May 20, 2013. > The fetch interval is the default 30 days. > 0 retries since fetch. > Modified time is in 2026. > Prev modified time is 0. > Protocol status is SUCCESS. > Parse status is success/ok > > How do I convert the metadata field to readable text? > > > On Mon, Jun 3, 2013 at 11:39 AM, Tejas Patil <[email protected]>wrote: > >> "It's not a freshly injected url." >> I am smelling that those urls were attempted to be fetched but that failed >> and so their retry interval was incremented to a larger value. Can't say >> for sure though. >> Can you share the crawl datum ? The status and meta fields can give some >> clue. >> >> On Mon, Jun 3, 2013 at 8:30 AM, Bai Shen <[email protected]> wrote: >> >> > The time on the machine is set correctly. >> > >> > It's an internal website. Is there anything in particular in the crawl >> > datum you're looking for? >> > >> > It's not a freshly injected url. And it seems like all of my urls have >> the >> > long fetch times. And it seems odd that the max interval wouldn't >> cause a >> > fetch. >> > >> > I was able to do generate -adddays 6000 and run a fetch. I'm waiting >> for >> > it to finish so I can check if this created long fetch times as well. >> > >> > >> > On Mon, Jun 3, 2013 at 10:35 AM, Tejas Patil <[email protected] >> > >wrote: >> > >> > > On Mon, Jun 3, 2013 at 6:53 AM, feng lu <[email protected]> wrote: >> > > >> > > > I see that nutch2.x will use the underlying operating system time to >> > set >> > > > the FetchTime. like this >> > > > >> > > > fit.page.setFetchTime(System.currentTimeMillis()); >> > > > >> > > > The granularity of the value depends on the underlying operating >> > system. >> > > so >> > > > check your current OS time using date command. >> > > > >> > > > >> > > > On Mon, Jun 3, 2013 at 8:57 PM, Bai Shen <[email protected]> >> > > wrote: >> > > > >> > > > > I'm using the 2.x head and even with adding 30 days I'm not >> getting >> > any >> > > > > refetches. I did a readdb on my injected url and it says that the >> > > fetch >> > > > > time is in 2027. >> > > > >> > > >> > > Can share the crawl datum for that url ? >> > > >> > > >> > > > > >> > > > > Any idea why this would occur? >> > > > >> > > >> > > If it was a freshly injected url, then I would go with Fengs' advice. >> > > >> > > Will db.fetch.interval.max kick in and >> > > > > cause it to be fetched earlier? >> > > > >> > > >> > > nope. >> > > >> > > Or will I have to manually change the >> > > > > fetchTime using the hbase shell? >> > > > >> > > >> > > I think so. >> > > >> > > > >> > > > > Thanks. >> > > > > >> > > > >> > > > >> > > > >> > > > -- >> > > > Don't Grow Old, Grow Up... :-) >> > > > >> > > >> > >> > >

