Hi, could you open a Jira issue for this? Please, add as much as possible information and/or examples: - after how many cycles it happened? - fetch time and fetch interval after each cycle (or crawler run) You could dump fetch scheduling related information via readdb/WebTableReader.
Thanks, Sebastian 2013/8/8 Bai Shen <[email protected]> > Yes, I started from scratch again. I deleted my HBase instance and > reinjected my seed. > > It's definitely a bug. My problem is that I've scoured the scheduler code > and can't find where it's going wrong. It works correctly for the first > bunch of cycles where I check the fetch times. Then once I leave it to > crawl overnight, I return to find the far future fetch times. > > > On Wed, Aug 7, 2013 at 3:59 PM, Sebastian Nagel > <[email protected]>wrote: > > > Hi, > > > > looks like a bug. There are a couple of issues open related to > > fetch scheduling. Afaik, all have been observed with 1.x, > > but 2.x should be also affected. However, the problems are > > the opposite: too short re-fetch intervals. > > > > I'll keep this on the radar for NUTCH-1502. > > > > > Originally I had db.fetch.schedule.class set to > > > org.apache.nutch.crawl.AdaptiveFetchSchedule. However, I changed it > back > > > to the default as I thought it was the problem. However, the behavior > > > occurs with both it and the default scheduler. > > > > Did you then start from scratch again? Otherwise the next fetch time > > is still far in the future and the fetch interval keeps too large. > > > > Sebastian > > > > On 08/07/2013 03:30 PM, Bai Shen wrote: > > > Sorry for the delayed reply. I somehow missed it when it originally > came > > > in. > > > > > > db.fetch.schedule.class is unchanged > > > db.fetch.interval.default is 86400 > > > db.fetch.interval.max is 604800 > > > db.fetch.schedule.adaptive.min_interval is 3600 > > > db.fetch.schedule.adaptive.max_interval is unchanged > > > db.fetch.schedule.adaptive.sync_delta is unchanged > > > > > > Originally I had db.fetch.schedule.class set to > > > org.apache.nutch.crawl.AdaptiveFetchSchedule. However, I changed it > back > > > to the default as I thought it was the problem. However, the behavior > > > occurs with both it and the default scheduler. > > > > > > > > > On Wed, Jul 17, 2013 at 2:57 PM, Sebastian Nagel < > > [email protected] > > >> wrote: > > > > > >> Hi, > > >> > > >> can you send values of the following properties (esp. if they differ > > from > > >> default): > > >> db.fetch.schedule.class > > >> db.fetch.interval.default > > >> db.fetch.interval.max > > >> db.fetch.schedule.adaptive.min_interval > > >> db.fetch.schedule.adaptive.max_interval > > >> db.fetch.schedule.adaptive.sync_delta > > >> > > >> Sebastian > > >> > > >> On 07/17/2013 06:58 PM, Bai Shen wrote: > > >>> I'm using Nutch 2.x HEAD with the default scheduler. I have the max > > >> fetch > > >>> interval set to one week and the fetch interval set to one day. > > >>> > > >>> Everything seems to work correctly for a while. Pages show up as > > fetched > > >>> with a fetch time of the next day. However, after a couple of days > > >>> generate produces no urls to fetch. Looking at the url db stats > shows > > >> that > > >>> the fetch time is set months in the future. > > >>> > > >>> I've dug through the fetcher and scheduler code and can't see > anything > > >> that > > >>> would be causing this. Any suggestions as to what to look at next or > > >>> things to try? > > >>> > > >>> Thanks. > > >>> > > >> > > >> > > > > > > > >

