I'm looking at my base url(the root of the internal site that I first injected to start the crawl). It shows a status of 2 (status_fetched). The fetch time shows sometime in 2032. The prev fetch time is May 20, 2013. The fetch interval is the default 30 days. 0 retries since fetch. Modified time is in 2026. Prev modified time is 0. Protocol status is SUCCESS. Parse status is success/ok
How do I convert the metadata field to readable text? On Mon, Jun 3, 2013 at 11:39 AM, Tejas Patil <[email protected]>wrote: > "It's not a freshly injected url." > I am smelling that those urls were attempted to be fetched but that failed > and so their retry interval was incremented to a larger value. Can't say > for sure though. > Can you share the crawl datum ? The status and meta fields can give some > clue. > > On Mon, Jun 3, 2013 at 8:30 AM, Bai Shen <[email protected]> wrote: > > > The time on the machine is set correctly. > > > > It's an internal website. Is there anything in particular in the crawl > > datum you're looking for? > > > > It's not a freshly injected url. And it seems like all of my urls have > the > > long fetch times. And it seems odd that the max interval wouldn't cause > a > > fetch. > > > > I was able to do generate -adddays 6000 and run a fetch. I'm waiting for > > it to finish so I can check if this created long fetch times as well. > > > > > > On Mon, Jun 3, 2013 at 10:35 AM, Tejas Patil <[email protected] > > >wrote: > > > > > On Mon, Jun 3, 2013 at 6:53 AM, feng lu <[email protected]> wrote: > > > > > > > I see that nutch2.x will use the underlying operating system time to > > set > > > > the FetchTime. like this > > > > > > > > fit.page.setFetchTime(System.currentTimeMillis()); > > > > > > > > The granularity of the value depends on the underlying operating > > system. > > > so > > > > check your current OS time using date command. > > > > > > > > > > > > On Mon, Jun 3, 2013 at 8:57 PM, Bai Shen <[email protected]> > > > wrote: > > > > > > > > > I'm using the 2.x head and even with adding 30 days I'm not getting > > any > > > > > refetches. I did a readdb on my injected url and it says that the > > > fetch > > > > > time is in 2027. > > > > > > > > > > Can share the crawl datum for that url ? > > > > > > > > > > > > > > > > Any idea why this would occur? > > > > > > > > > > If it was a freshly injected url, then I would go with Fengs' advice. > > > > > > Will db.fetch.interval.max kick in and > > > > > cause it to be fetched earlier? > > > > > > > > > > nope. > > > > > > Or will I have to manually change the > > > > > fetchTime using the hbase shell? > > > > > > > > > > I think so. > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > > > > > > > > -- > > > > Don't Grow Old, Grow Up... :-) > > > > > > > > > >

