[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13812289#comment-13812289 ] Hudson commented on NUTCH-1245: --- SUCCESS: Integrated in Nutch-nutchgora #808 (See [https://builds.apache.org/job/Nutch-nutchgora/808/]) NUTCH-1588 Port NUTCH-1245 URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again to 2.x (lewismc: http://svn.apache.org/viewvc/nutch/branches/2.x/?view=rev&rev=1538200) * /nutch/branches/2.x/CHANGES.txt * /nutch/branches/2.x/src/java/org/apache/nutch/crawl/AbstractFetchSchedule.java > URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb > and is generated over and over again > > > Key: NUTCH-1245 > URL: https://issues.apache.org/jira/browse/NUTCH-1245 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.4, 1.5 >Reporter: Sebastian Nagel >Priority: Critical > Fix For: 1.7 > > Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, > NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch > > > A document gone with 404 after db.fetch.interval.max (90 days) has passed > is fetched over and over again but although fetch status is fetch_gone > its status in CrawlDb keeps db_unfetched. Consequently, this document will > be generated and fetched from now on in every cycle. > To reproduce: > # create a CrawlDatum in CrawlDb which retry interval hits > db.fetch.interval.max (I manipulated the shouldFetch() in > AbstractFetchSchedule to achieve this) > # now this URL is fetched again > # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to > db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 > days) > # this does not change with every generate-fetch-update cycle, here for two > segments: > {noformat} > /tmp/testcrawl/segments/20120105161430 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:14:21 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:14:48 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > /tmp/testcrawl/segments/20120105161631 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:16:23 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:20:05 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > {noformat} > As far as I can see it's caused by setPageGoneSchedule() in > AbstractFetchSchedule. Some pseudo-code: > {code} > setPageGoneSchedule (called from update / CrawlDbReducer.reduce): > datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * > maxInterval > datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 > if (maxInterval < datum.fetchInterval) // necessarily true >forceRefetch() > forceRefetch: > if (datum.fetchInterval > maxInterval) // true because it's 1.35 * > maxInterval >datum.fetchInterval = 0.9 * maxInterval > datum.status = db_unfetched // > shouldFetch (called from generate / Generator.map): > if ((datum.fetchTime - curTime) > maxInterval) >// always true if the crawler is launched in short intervals >// (lower than 0.35 * maxInterval) >datum.fetchTime = curTime // forces a refetch > {code} > After setPageGoneSchedule is called via update the state is db_unfetched and > the retry interval 0.9 * db.fetch.interval.max (81 days). > Although the fetch time in the CrawlDb is far in the future > {noformat} > % nutch readdb testcrawl/crawldb -url http://localhost/page_gone > URL: http://localhost/page_gone > Version: 7 > Status: 1 (db_unfetched) > Fetch time: Sun May 06 05:20:05 CEST 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Score: 1.0 > Signature: null > Metadata: _pst_: notfound(14),
[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13688533#comment-13688533 ] Hudson commented on NUTCH-1245: --- Integrated in Nutch-trunk #2248 (See [https://builds.apache.org/job/Nutch-trunk/2248/]) NUTCH-1245 URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again (Revision 1494776) Result = SUCCESS snagel : http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1494776 Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/java/org/apache/nutch/crawl/AbstractFetchSchedule.java > URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb > and is generated over and over again > > > Key: NUTCH-1245 > URL: https://issues.apache.org/jira/browse/NUTCH-1245 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.4, 1.5 >Reporter: Sebastian Nagel >Priority: Critical > Fix For: 1.8 > > Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, > NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch > > > A document gone with 404 after db.fetch.interval.max (90 days) has passed > is fetched over and over again but although fetch status is fetch_gone > its status in CrawlDb keeps db_unfetched. Consequently, this document will > be generated and fetched from now on in every cycle. > To reproduce: > # create a CrawlDatum in CrawlDb which retry interval hits > db.fetch.interval.max (I manipulated the shouldFetch() in > AbstractFetchSchedule to achieve this) > # now this URL is fetched again > # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to > db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 > days) > # this does not change with every generate-fetch-update cycle, here for two > segments: > {noformat} > /tmp/testcrawl/segments/20120105161430 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:14:21 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:14:48 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > /tmp/testcrawl/segments/20120105161631 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:16:23 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:20:05 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > {noformat} > As far as I can see it's caused by setPageGoneSchedule() in > AbstractFetchSchedule. Some pseudo-code: > {code} > setPageGoneSchedule (called from update / CrawlDbReducer.reduce): > datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * > maxInterval > datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 > if (maxInterval < datum.fetchInterval) // necessarily true >forceRefetch() > forceRefetch: > if (datum.fetchInterval > maxInterval) // true because it's 1.35 * > maxInterval >datum.fetchInterval = 0.9 * maxInterval > datum.status = db_unfetched // > shouldFetch (called from generate / Generator.map): > if ((datum.fetchTime - curTime) > maxInterval) >// always true if the crawler is launched in short intervals >// (lower than 0.35 * maxInterval) >datum.fetchTime = curTime // forces a refetch > {code} > After setPageGoneSchedule is called via update the state is db_unfetched and > the retry interval 0.9 * db.fetch.interval.max (81 days). > Although the fetch time in the CrawlDb is far in the future > {noformat} > % nutch readdb testcrawl/crawldb -url http://localhost/page_gone > URL: http://localhost/page_gone > Version: 7 > Status: 1 (db_unfetched) > Fetch time: Sun May 06 05:20:05 CEST 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Score: 1.0 > Signature: null > Metadata: _pst_: notf
[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13688482#comment-13688482 ] Sebastian Nagel commented on NUTCH-1245: Committed to trunk (r1494776). Keep open, 2.x is likely to be affected as well. > URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb > and is generated over and over again > > > Key: NUTCH-1245 > URL: https://issues.apache.org/jira/browse/NUTCH-1245 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.4, 1.5 >Reporter: Sebastian Nagel >Priority: Critical > Fix For: 1.8 > > Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, > NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch > > > A document gone with 404 after db.fetch.interval.max (90 days) has passed > is fetched over and over again but although fetch status is fetch_gone > its status in CrawlDb keeps db_unfetched. Consequently, this document will > be generated and fetched from now on in every cycle. > To reproduce: > # create a CrawlDatum in CrawlDb which retry interval hits > db.fetch.interval.max (I manipulated the shouldFetch() in > AbstractFetchSchedule to achieve this) > # now this URL is fetched again > # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to > db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 > days) > # this does not change with every generate-fetch-update cycle, here for two > segments: > {noformat} > /tmp/testcrawl/segments/20120105161430 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:14:21 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:14:48 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > /tmp/testcrawl/segments/20120105161631 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:16:23 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:20:05 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > {noformat} > As far as I can see it's caused by setPageGoneSchedule() in > AbstractFetchSchedule. Some pseudo-code: > {code} > setPageGoneSchedule (called from update / CrawlDbReducer.reduce): > datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * > maxInterval > datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 > if (maxInterval < datum.fetchInterval) // necessarily true >forceRefetch() > forceRefetch: > if (datum.fetchInterval > maxInterval) // true because it's 1.35 * > maxInterval >datum.fetchInterval = 0.9 * maxInterval > datum.status = db_unfetched // > shouldFetch (called from generate / Generator.map): > if ((datum.fetchTime - curTime) > maxInterval) >// always true if the crawler is launched in short intervals >// (lower than 0.35 * maxInterval) >datum.fetchTime = curTime // forces a refetch > {code} > After setPageGoneSchedule is called via update the state is db_unfetched and > the retry interval 0.9 * db.fetch.interval.max (81 days). > Although the fetch time in the CrawlDb is far in the future > {noformat} > % nutch readdb testcrawl/crawldb -url http://localhost/page_gone > URL: http://localhost/page_gone > Version: 7 > Status: 1 (db_unfetched) > Fetch time: Sun May 06 05:20:05 CEST 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Score: 1.0 > Signature: null > Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone > {noformat} > the URL is generated again because (fetch time - current time) is larger than > db.fetch.interval.max. > The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and > the fetch time is always close to current time + 1.35 * db.fetch.interval.max. > It's poss
[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13688145#comment-13688145 ] Markus Jelsma commented on NUTCH-1245: -- Splendid! Thanks guys! > URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb > and is generated over and over again > > > Key: NUTCH-1245 > URL: https://issues.apache.org/jira/browse/NUTCH-1245 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.4, 1.5 >Reporter: Sebastian Nagel >Priority: Critical > Fix For: 1.8 > > Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, > NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch > > > A document gone with 404 after db.fetch.interval.max (90 days) has passed > is fetched over and over again but although fetch status is fetch_gone > its status in CrawlDb keeps db_unfetched. Consequently, this document will > be generated and fetched from now on in every cycle. > To reproduce: > # create a CrawlDatum in CrawlDb which retry interval hits > db.fetch.interval.max (I manipulated the shouldFetch() in > AbstractFetchSchedule to achieve this) > # now this URL is fetched again > # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to > db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 > days) > # this does not change with every generate-fetch-update cycle, here for two > segments: > {noformat} > /tmp/testcrawl/segments/20120105161430 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:14:21 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:14:48 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > /tmp/testcrawl/segments/20120105161631 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:16:23 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:20:05 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > {noformat} > As far as I can see it's caused by setPageGoneSchedule() in > AbstractFetchSchedule. Some pseudo-code: > {code} > setPageGoneSchedule (called from update / CrawlDbReducer.reduce): > datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * > maxInterval > datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 > if (maxInterval < datum.fetchInterval) // necessarily true >forceRefetch() > forceRefetch: > if (datum.fetchInterval > maxInterval) // true because it's 1.35 * > maxInterval >datum.fetchInterval = 0.9 * maxInterval > datum.status = db_unfetched // > shouldFetch (called from generate / Generator.map): > if ((datum.fetchTime - curTime) > maxInterval) >// always true if the crawler is launched in short intervals >// (lower than 0.35 * maxInterval) >datum.fetchTime = curTime // forces a refetch > {code} > After setPageGoneSchedule is called via update the state is db_unfetched and > the retry interval 0.9 * db.fetch.interval.max (81 days). > Although the fetch time in the CrawlDb is far in the future > {noformat} > % nutch readdb testcrawl/crawldb -url http://localhost/page_gone > URL: http://localhost/page_gone > Version: 7 > Status: 1 (db_unfetched) > Fetch time: Sun May 06 05:20:05 CEST 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Score: 1.0 > Signature: null > Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone > {noformat} > the URL is generated again because (fetch time - current time) is larger than > db.fetch.interval.max. > The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and > the fetch time is always close to current time + 1.35 * db.fetch.interval.max. > It's possibly a side effect of NUTCH-516, and may be related to NUTCH-
[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13688111#comment-13688111 ] Lewis John McGibbney commented on NUTCH-1245: - push it. get it in there. I'll cut the RC tonight my time. ;) > URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb > and is generated over and over again > > > Key: NUTCH-1245 > URL: https://issues.apache.org/jira/browse/NUTCH-1245 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.4, 1.5 >Reporter: Sebastian Nagel >Priority: Critical > Fix For: 1.8 > > Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, > NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch > > > A document gone with 404 after db.fetch.interval.max (90 days) has passed > is fetched over and over again but although fetch status is fetch_gone > its status in CrawlDb keeps db_unfetched. Consequently, this document will > be generated and fetched from now on in every cycle. > To reproduce: > # create a CrawlDatum in CrawlDb which retry interval hits > db.fetch.interval.max (I manipulated the shouldFetch() in > AbstractFetchSchedule to achieve this) > # now this URL is fetched again > # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to > db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 > days) > # this does not change with every generate-fetch-update cycle, here for two > segments: > {noformat} > /tmp/testcrawl/segments/20120105161430 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:14:21 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:14:48 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > /tmp/testcrawl/segments/20120105161631 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:16:23 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:20:05 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > {noformat} > As far as I can see it's caused by setPageGoneSchedule() in > AbstractFetchSchedule. Some pseudo-code: > {code} > setPageGoneSchedule (called from update / CrawlDbReducer.reduce): > datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * > maxInterval > datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 > if (maxInterval < datum.fetchInterval) // necessarily true >forceRefetch() > forceRefetch: > if (datum.fetchInterval > maxInterval) // true because it's 1.35 * > maxInterval >datum.fetchInterval = 0.9 * maxInterval > datum.status = db_unfetched // > shouldFetch (called from generate / Generator.map): > if ((datum.fetchTime - curTime) > maxInterval) >// always true if the crawler is launched in short intervals >// (lower than 0.35 * maxInterval) >datum.fetchTime = curTime // forces a refetch > {code} > After setPageGoneSchedule is called via update the state is db_unfetched and > the retry interval 0.9 * db.fetch.interval.max (81 days). > Although the fetch time in the CrawlDb is far in the future > {noformat} > % nutch readdb testcrawl/crawldb -url http://localhost/page_gone > URL: http://localhost/page_gone > Version: 7 > Status: 1 (db_unfetched) > Fetch time: Sun May 06 05:20:05 CEST 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Score: 1.0 > Signature: null > Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone > {noformat} > the URL is generated again because (fetch time - current time) is larger than > db.fetch.interval.max. > The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and > the fetch time is always close to current time + 1.35 * db.fetch.interval.max. > It's possibly a s
[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13688065#comment-13688065 ] Sebastian Nagel commented on NUTCH-1245: Definitely, [~markus17]. I'll commit today evening but without the unit tests. These will be added together with NUTCH-1502 early in 1.8 (hope to get them ready and complete soon). > URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb > and is generated over and over again > > > Key: NUTCH-1245 > URL: https://issues.apache.org/jira/browse/NUTCH-1245 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.4, 1.5 >Reporter: Sebastian Nagel >Priority: Critical > Fix For: 1.8 > > Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, > NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch > > > A document gone with 404 after db.fetch.interval.max (90 days) has passed > is fetched over and over again but although fetch status is fetch_gone > its status in CrawlDb keeps db_unfetched. Consequently, this document will > be generated and fetched from now on in every cycle. > To reproduce: > # create a CrawlDatum in CrawlDb which retry interval hits > db.fetch.interval.max (I manipulated the shouldFetch() in > AbstractFetchSchedule to achieve this) > # now this URL is fetched again > # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to > db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 > days) > # this does not change with every generate-fetch-update cycle, here for two > segments: > {noformat} > /tmp/testcrawl/segments/20120105161430 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:14:21 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:14:48 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > /tmp/testcrawl/segments/20120105161631 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:16:23 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:20:05 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > {noformat} > As far as I can see it's caused by setPageGoneSchedule() in > AbstractFetchSchedule. Some pseudo-code: > {code} > setPageGoneSchedule (called from update / CrawlDbReducer.reduce): > datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * > maxInterval > datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 > if (maxInterval < datum.fetchInterval) // necessarily true >forceRefetch() > forceRefetch: > if (datum.fetchInterval > maxInterval) // true because it's 1.35 * > maxInterval >datum.fetchInterval = 0.9 * maxInterval > datum.status = db_unfetched // > shouldFetch (called from generate / Generator.map): > if ((datum.fetchTime - curTime) > maxInterval) >// always true if the crawler is launched in short intervals >// (lower than 0.35 * maxInterval) >datum.fetchTime = curTime // forces a refetch > {code} > After setPageGoneSchedule is called via update the state is db_unfetched and > the retry interval 0.9 * db.fetch.interval.max (81 days). > Although the fetch time in the CrawlDb is far in the future > {noformat} > % nutch readdb testcrawl/crawldb -url http://localhost/page_gone > URL: http://localhost/page_gone > Version: 7 > Status: 1 (db_unfetched) > Fetch time: Sun May 06 05:20:05 CEST 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Score: 1.0 > Signature: null > Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone > {noformat} > the URL is generated again because (fetch time - current time) is larger than > db.fetch.interval.max. > The retry interval (datum.fetchInterval) oscillates between 0.9 and 1
[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13687867#comment-13687867 ] Markus Jelsma commented on NUTCH-1245: -- I think we should also include this one in the 1.7RC, it is even more important than the issue we committed for the FreeGenerator. We're using this in production for a long time now, all reported issues are no longer a problem. > URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb > and is generated over and over again > > > Key: NUTCH-1245 > URL: https://issues.apache.org/jira/browse/NUTCH-1245 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.4, 1.5 >Reporter: Sebastian Nagel >Priority: Critical > Fix For: 1.8 > > Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, > NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch > > > A document gone with 404 after db.fetch.interval.max (90 days) has passed > is fetched over and over again but although fetch status is fetch_gone > its status in CrawlDb keeps db_unfetched. Consequently, this document will > be generated and fetched from now on in every cycle. > To reproduce: > # create a CrawlDatum in CrawlDb which retry interval hits > db.fetch.interval.max (I manipulated the shouldFetch() in > AbstractFetchSchedule to achieve this) > # now this URL is fetched again > # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to > db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 > days) > # this does not change with every generate-fetch-update cycle, here for two > segments: > {noformat} > /tmp/testcrawl/segments/20120105161430 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:14:21 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:14:48 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > /tmp/testcrawl/segments/20120105161631 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:16:23 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:20:05 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > {noformat} > As far as I can see it's caused by setPageGoneSchedule() in > AbstractFetchSchedule. Some pseudo-code: > {code} > setPageGoneSchedule (called from update / CrawlDbReducer.reduce): > datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * > maxInterval > datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 > if (maxInterval < datum.fetchInterval) // necessarily true >forceRefetch() > forceRefetch: > if (datum.fetchInterval > maxInterval) // true because it's 1.35 * > maxInterval >datum.fetchInterval = 0.9 * maxInterval > datum.status = db_unfetched // > shouldFetch (called from generate / Generator.map): > if ((datum.fetchTime - curTime) > maxInterval) >// always true if the crawler is launched in short intervals >// (lower than 0.35 * maxInterval) >datum.fetchTime = curTime // forces a refetch > {code} > After setPageGoneSchedule is called via update the state is db_unfetched and > the retry interval 0.9 * db.fetch.interval.max (81 days). > Although the fetch time in the CrawlDb is far in the future > {noformat} > % nutch readdb testcrawl/crawldb -url http://localhost/page_gone > URL: http://localhost/page_gone > Version: 7 > Status: 1 (db_unfetched) > Fetch time: Sun May 06 05:20:05 CEST 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Score: 1.0 > Signature: null > Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone > {noformat} > the URL is generated again because (fetch time - current time) is larger than > db.fetch.interval.max. > The retry interval (datu
[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13546925#comment-13546925 ] Markus Jelsma commented on NUTCH-1245: -- Yes, and it also fixes the problem. > URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb > and is generated over and over again > > > Key: NUTCH-1245 > URL: https://issues.apache.org/jira/browse/NUTCH-1245 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.4, 1.5 >Reporter: Sebastian Nagel >Priority: Critical > Fix For: 1.7 > > Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, > NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch > > > A document gone with 404 after db.fetch.interval.max (90 days) has passed > is fetched over and over again but although fetch status is fetch_gone > its status in CrawlDb keeps db_unfetched. Consequently, this document will > be generated and fetched from now on in every cycle. > To reproduce: > # create a CrawlDatum in CrawlDb which retry interval hits > db.fetch.interval.max (I manipulated the shouldFetch() in > AbstractFetchSchedule to achieve this) > # now this URL is fetched again > # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to > db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 > days) > # this does not change with every generate-fetch-update cycle, here for two > segments: > {noformat} > /tmp/testcrawl/segments/20120105161430 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:14:21 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:14:48 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > /tmp/testcrawl/segments/20120105161631 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:16:23 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:20:05 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > {noformat} > As far as I can see it's caused by setPageGoneSchedule() in > AbstractFetchSchedule. Some pseudo-code: > {code} > setPageGoneSchedule (called from update / CrawlDbReducer.reduce): > datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * > maxInterval > datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 > if (maxInterval < datum.fetchInterval) // necessarily true >forceRefetch() > forceRefetch: > if (datum.fetchInterval > maxInterval) // true because it's 1.35 * > maxInterval >datum.fetchInterval = 0.9 * maxInterval > datum.status = db_unfetched // > shouldFetch (called from generate / Generator.map): > if ((datum.fetchTime - curTime) > maxInterval) >// always true if the crawler is launched in short intervals >// (lower than 0.35 * maxInterval) >datum.fetchTime = curTime // forces a refetch > {code} > After setPageGoneSchedule is called via update the state is db_unfetched and > the retry interval 0.9 * db.fetch.interval.max (81 days). > Although the fetch time in the CrawlDb is far in the future > {noformat} > % nutch readdb testcrawl/crawldb -url http://localhost/page_gone > URL: http://localhost/page_gone > Version: 7 > Status: 1 (db_unfetched) > Fetch time: Sun May 06 05:20:05 CEST 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Score: 1.0 > Signature: null > Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone > {noformat} > the URL is generated again because (fetch time - current time) is larger than > db.fetch.interval.max. > The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and > the fetch time is always close to current time + 1.35 * db.fetch.interval.max. > It's possibly a side effect of NUTCH-516, and may be rela
[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13546424#comment-13546424 ] Lewis John McGibbney commented on NUTCH-1245: - So this patch is good for testing? > URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb > and is generated over and over again > > > Key: NUTCH-1245 > URL: https://issues.apache.org/jira/browse/NUTCH-1245 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.4, 1.5 >Reporter: Sebastian Nagel >Priority: Critical > Fix For: 1.7 > > Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, > NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch > > > A document gone with 404 after db.fetch.interval.max (90 days) has passed > is fetched over and over again but although fetch status is fetch_gone > its status in CrawlDb keeps db_unfetched. Consequently, this document will > be generated and fetched from now on in every cycle. > To reproduce: > # create a CrawlDatum in CrawlDb which retry interval hits > db.fetch.interval.max (I manipulated the shouldFetch() in > AbstractFetchSchedule to achieve this) > # now this URL is fetched again > # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to > db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 > days) > # this does not change with every generate-fetch-update cycle, here for two > segments: > {noformat} > /tmp/testcrawl/segments/20120105161430 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:14:21 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:14:48 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > /tmp/testcrawl/segments/20120105161631 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:16:23 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:20:05 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > {noformat} > As far as I can see it's caused by setPageGoneSchedule() in > AbstractFetchSchedule. Some pseudo-code: > {code} > setPageGoneSchedule (called from update / CrawlDbReducer.reduce): > datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * > maxInterval > datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 > if (maxInterval < datum.fetchInterval) // necessarily true >forceRefetch() > forceRefetch: > if (datum.fetchInterval > maxInterval) // true because it's 1.35 * > maxInterval >datum.fetchInterval = 0.9 * maxInterval > datum.status = db_unfetched // > shouldFetch (called from generate / Generator.map): > if ((datum.fetchTime - curTime) > maxInterval) >// always true if the crawler is launched in short intervals >// (lower than 0.35 * maxInterval) >datum.fetchTime = curTime // forces a refetch > {code} > After setPageGoneSchedule is called via update the state is db_unfetched and > the retry interval 0.9 * db.fetch.interval.max (81 days). > Although the fetch time in the CrawlDb is far in the future > {noformat} > % nutch readdb testcrawl/crawldb -url http://localhost/page_gone > URL: http://localhost/page_gone > Version: 7 > Status: 1 (db_unfetched) > Fetch time: Sun May 06 05:20:05 CEST 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Score: 1.0 > Signature: null > Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone > {noformat} > the URL is generated again because (fetch time - current time) is larger than > db.fetch.interval.max. > The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and > the fetch time is always close to current time + 1.35 * db.fetch.interval.max. > It's possibly a side effect of NUTCH-516, an
[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13539951#comment-13539951 ] Markus Jelsma commented on NUTCH-1245: -- Please ignore my comment, i was inadvertently looking at the wrong data!! > URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb > and is generated over and over again > > > Key: NUTCH-1245 > URL: https://issues.apache.org/jira/browse/NUTCH-1245 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.4, 1.5 >Reporter: Sebastian Nagel >Priority: Critical > Fix For: 1.7 > > Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, > NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch > > > A document gone with 404 after db.fetch.interval.max (90 days) has passed > is fetched over and over again but although fetch status is fetch_gone > its status in CrawlDb keeps db_unfetched. Consequently, this document will > be generated and fetched from now on in every cycle. > To reproduce: > # create a CrawlDatum in CrawlDb which retry interval hits > db.fetch.interval.max (I manipulated the shouldFetch() in > AbstractFetchSchedule to achieve this) > # now this URL is fetched again > # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to > db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 > days) > # this does not change with every generate-fetch-update cycle, here for two > segments: > {noformat} > /tmp/testcrawl/segments/20120105161430 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:14:21 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:14:48 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > /tmp/testcrawl/segments/20120105161631 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:16:23 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:20:05 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > {noformat} > As far as I can see it's caused by setPageGoneSchedule() in > AbstractFetchSchedule. Some pseudo-code: > {code} > setPageGoneSchedule (called from update / CrawlDbReducer.reduce): > datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * > maxInterval > datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 > if (maxInterval < datum.fetchInterval) // necessarily true >forceRefetch() > forceRefetch: > if (datum.fetchInterval > maxInterval) // true because it's 1.35 * > maxInterval >datum.fetchInterval = 0.9 * maxInterval > datum.status = db_unfetched // > shouldFetch (called from generate / Generator.map): > if ((datum.fetchTime - curTime) > maxInterval) >// always true if the crawler is launched in short intervals >// (lower than 0.35 * maxInterval) >datum.fetchTime = curTime // forces a refetch > {code} > After setPageGoneSchedule is called via update the state is db_unfetched and > the retry interval 0.9 * db.fetch.interval.max (81 days). > Although the fetch time in the CrawlDb is far in the future > {noformat} > % nutch readdb testcrawl/crawldb -url http://localhost/page_gone > URL: http://localhost/page_gone > Version: 7 > Status: 1 (db_unfetched) > Fetch time: Sun May 06 05:20:05 CEST 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Score: 1.0 > Signature: null > Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone > {noformat} > the URL is generated again because (fetch time - current time) is larger than > db.fetch.interval.max. > The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and > the fetch time is always close to current time + 1.35 * db.fetch.interval.max. > It's possibly a sid
[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13539943#comment-13539943 ] Markus Jelsma commented on NUTCH-1245: -- There's an issue with the patch after all! {code} URL: Version: 7 Status: 6 (db_notmodified) Fetch time: Thu Dec 20 00:19:09 UTC 2012 Modified time: Wed May 16 12:48:30 UTC 2012 Retries since fetch: 0 Retry interval: 5184000 seconds (60 days) Score: 0.0 Signature: b1fa188be92a8dfa5db51e80c4af192a Metadata: Content-Type: application/xhtml+xml_pst_: success(1), lastModified=0 {code} {code} URL: http://www.remeha.nl/intelligentenergy/index.php/remeha_evita_hre_ketel/subsidie/ Version: 7 Status: 6 (db_notmodified) Fetch time: Thu Dec 20 00:38:19 UTC 2012 Modified time: Wed May 16 12:48:30 UTC 2012 Retries since fetch: 0 Retry interval: 5184000 seconds (60 days) Score: 0.0 Signature: b1fa188be92a8dfa5db51e80c4af192a Metadata: Content-Type: application/xhtml+xml_pst_: success(1), lastModified=0 {code} The fetch time is not incremented at all. > URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb > and is generated over and over again > > > Key: NUTCH-1245 > URL: https://issues.apache.org/jira/browse/NUTCH-1245 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.4, 1.5 >Reporter: Sebastian Nagel >Priority: Critical > Fix For: 1.7 > > Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, > NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch > > > A document gone with 404 after db.fetch.interval.max (90 days) has passed > is fetched over and over again but although fetch status is fetch_gone > its status in CrawlDb keeps db_unfetched. Consequently, this document will > be generated and fetched from now on in every cycle. > To reproduce: > # create a CrawlDatum in CrawlDb which retry interval hits > db.fetch.interval.max (I manipulated the shouldFetch() in > AbstractFetchSchedule to achieve this) > # now this URL is fetched again > # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to > db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 > days) > # this does not change with every generate-fetch-update cycle, here for two > segments: > {noformat} > /tmp/testcrawl/segments/20120105161430 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:14:21 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:14:48 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > /tmp/testcrawl/segments/20120105161631 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:16:23 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:20:05 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metada
[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13537860#comment-13537860 ] Markus Jelsma commented on NUTCH-1245: -- Keep in mind, the DummyReporter needs to implement getProgress() at least on Hadoop 1.1.1. {code} public float getProgress() { return 1f; } {code} > URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb > and is generated over and over again > > > Key: NUTCH-1245 > URL: https://issues.apache.org/jira/browse/NUTCH-1245 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.4, 1.5 >Reporter: Sebastian Nagel >Priority: Critical > Fix For: 1.7 > > Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, > NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch > > > A document gone with 404 after db.fetch.interval.max (90 days) has passed > is fetched over and over again but although fetch status is fetch_gone > its status in CrawlDb keeps db_unfetched. Consequently, this document will > be generated and fetched from now on in every cycle. > To reproduce: > # create a CrawlDatum in CrawlDb which retry interval hits > db.fetch.interval.max (I manipulated the shouldFetch() in > AbstractFetchSchedule to achieve this) > # now this URL is fetched again > # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to > db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 > days) > # this does not change with every generate-fetch-update cycle, here for two > segments: > {noformat} > /tmp/testcrawl/segments/20120105161430 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:14:21 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:14:48 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > /tmp/testcrawl/segments/20120105161631 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:16:23 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:20:05 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > {noformat} > As far as I can see it's caused by setPageGoneSchedule() in > AbstractFetchSchedule. Some pseudo-code: > {code} > setPageGoneSchedule (called from update / CrawlDbReducer.reduce): > datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * > maxInterval > datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 > if (maxInterval < datum.fetchInterval) // necessarily true >forceRefetch() > forceRefetch: > if (datum.fetchInterval > maxInterval) // true because it's 1.35 * > maxInterval >datum.fetchInterval = 0.9 * maxInterval > datum.status = db_unfetched // > shouldFetch (called from generate / Generator.map): > if ((datum.fetchTime - curTime) > maxInterval) >// always true if the crawler is launched in short intervals >// (lower than 0.35 * maxInterval) >datum.fetchTime = curTime // forces a refetch > {code} > After setPageGoneSchedule is called via update the state is db_unfetched and > the retry interval 0.9 * db.fetch.interval.max (81 days). > Although the fetch time in the CrawlDb is far in the future > {noformat} > % nutch readdb testcrawl/crawldb -url http://localhost/page_gone > URL: http://localhost/page_gone > Version: 7 > Status: 1 (db_unfetched) > Fetch time: Sun May 06 05:20:05 CEST 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Score: 1.0 > Signature: null > Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone > {noformat} > the URL is generated again because (fetch time - current time) is larger than > db.fetch.interval.max. > The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and > the fetch t
[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13525804#comment-13525804 ] Markus Jelsma commented on NUTCH-1245: -- No objections so far. I've put this in production monday after i had it baked for two weeks orso in a test environment. I'm still keeping an eye on it and only good results so far but i'd like to see it running a bit longer in case there's some edge-case lurking around - although i doubt it :) > URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb > and is generated over and over again > > > Key: NUTCH-1245 > URL: https://issues.apache.org/jira/browse/NUTCH-1245 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.4, 1.5 >Reporter: Sebastian Nagel >Priority: Critical > Fix For: 1.7 > > Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, > NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch > > > A document gone with 404 after db.fetch.interval.max (90 days) has passed > is fetched over and over again but although fetch status is fetch_gone > its status in CrawlDb keeps db_unfetched. Consequently, this document will > be generated and fetched from now on in every cycle. > To reproduce: > # create a CrawlDatum in CrawlDb which retry interval hits > db.fetch.interval.max (I manipulated the shouldFetch() in > AbstractFetchSchedule to achieve this) > # now this URL is fetched again > # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to > db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 > days) > # this does not change with every generate-fetch-update cycle, here for two > segments: > {noformat} > /tmp/testcrawl/segments/20120105161430 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:14:21 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:14:48 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > /tmp/testcrawl/segments/20120105161631 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:16:23 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:20:05 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > {noformat} > As far as I can see it's caused by setPageGoneSchedule() in > AbstractFetchSchedule. Some pseudo-code: > {code} > setPageGoneSchedule (called from update / CrawlDbReducer.reduce): > datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * > maxInterval > datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 > if (maxInterval < datum.fetchInterval) // necessarily true >forceRefetch() > forceRefetch: > if (datum.fetchInterval > maxInterval) // true because it's 1.35 * > maxInterval >datum.fetchInterval = 0.9 * maxInterval > datum.status = db_unfetched // > shouldFetch (called from generate / Generator.map): > if ((datum.fetchTime - curTime) > maxInterval) >// always true if the crawler is launched in short intervals >// (lower than 0.35 * maxInterval) >datum.fetchTime = curTime // forces a refetch > {code} > After setPageGoneSchedule is called via update the state is db_unfetched and > the retry interval 0.9 * db.fetch.interval.max (81 days). > Although the fetch time in the CrawlDb is far in the future > {noformat} > % nutch readdb testcrawl/crawldb -url http://localhost/page_gone > URL: http://localhost/page_gone > Version: 7 > Status: 1 (db_unfetched) > Fetch time: Sun May 06 05:20:05 CEST 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Score: 1.0 > Signature: null > Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone > {noformat} > the URL is generated again because (fetch time - current time)
[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13525439#comment-13525439 ] Sebastian Nagel commented on NUTCH-1245: @kiran: yes, 2.x is affected since fetch schedulers do not differ (much) between 1.x and 2.x. However, with default settings you need a couple of month continuously crawling to run into this problem. @Markus: good news! Pulled the test out to NUTCH-1502 (broader coverage, need more time). Are there objections regarding the proposed patch? > URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb > and is generated over and over again > > > Key: NUTCH-1245 > URL: https://issues.apache.org/jira/browse/NUTCH-1245 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.4, 1.5 >Reporter: Sebastian Nagel >Priority: Critical > Fix For: 1.7 > > Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, > NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch > > > A document gone with 404 after db.fetch.interval.max (90 days) has passed > is fetched over and over again but although fetch status is fetch_gone > its status in CrawlDb keeps db_unfetched. Consequently, this document will > be generated and fetched from now on in every cycle. > To reproduce: > # create a CrawlDatum in CrawlDb which retry interval hits > db.fetch.interval.max (I manipulated the shouldFetch() in > AbstractFetchSchedule to achieve this) > # now this URL is fetched again > # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to > db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 > days) > # this does not change with every generate-fetch-update cycle, here for two > segments: > {noformat} > /tmp/testcrawl/segments/20120105161430 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:14:21 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:14:48 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > /tmp/testcrawl/segments/20120105161631 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:16:23 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:20:05 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > {noformat} > As far as I can see it's caused by setPageGoneSchedule() in > AbstractFetchSchedule. Some pseudo-code: > {code} > setPageGoneSchedule (called from update / CrawlDbReducer.reduce): > datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * > maxInterval > datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 > if (maxInterval < datum.fetchInterval) // necessarily true >forceRefetch() > forceRefetch: > if (datum.fetchInterval > maxInterval) // true because it's 1.35 * > maxInterval >datum.fetchInterval = 0.9 * maxInterval > datum.status = db_unfetched // > shouldFetch (called from generate / Generator.map): > if ((datum.fetchTime - curTime) > maxInterval) >// always true if the crawler is launched in short intervals >// (lower than 0.35 * maxInterval) >datum.fetchTime = curTime // forces a refetch > {code} > After setPageGoneSchedule is called via update the state is db_unfetched and > the retry interval 0.9 * db.fetch.interval.max (81 days). > Although the fetch time in the CrawlDb is far in the future > {noformat} > % nutch readdb testcrawl/crawldb -url http://localhost/page_gone > URL: http://localhost/page_gone > Version: 7 > Status: 1 (db_unfetched) > Fetch time: Sun May 06 05:20:05 CEST 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Score: 1.0 > Signature: null > Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone > {noformat} > the URL is g
[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13511539#comment-13511539 ] kiran commented on NUTCH-1245: -- Can 2.x version be affected by the same issue ? I am crawling a website and after some crawls it said db_unfetched is 960. Even though i crawled 10 times, the db_unfetched remained same. Any inputs ? > URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb > and is generated over and over again > > > Key: NUTCH-1245 > URL: https://issues.apache.org/jira/browse/NUTCH-1245 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.4, 1.5 >Reporter: Sebastian Nagel >Priority: Critical > Fix For: 1.7 > > Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, > NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch > > > A document gone with 404 after db.fetch.interval.max (90 days) has passed > is fetched over and over again but although fetch status is fetch_gone > its status in CrawlDb keeps db_unfetched. Consequently, this document will > be generated and fetched from now on in every cycle. > To reproduce: > # create a CrawlDatum in CrawlDb which retry interval hits > db.fetch.interval.max (I manipulated the shouldFetch() in > AbstractFetchSchedule to achieve this) > # now this URL is fetched again > # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to > db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 > days) > # this does not change with every generate-fetch-update cycle, here for two > segments: > {noformat} > /tmp/testcrawl/segments/20120105161430 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:14:21 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:14:48 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > /tmp/testcrawl/segments/20120105161631 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:16:23 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:20:05 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > {noformat} > As far as I can see it's caused by setPageGoneSchedule() in > AbstractFetchSchedule. Some pseudo-code: > {code} > setPageGoneSchedule (called from update / CrawlDbReducer.reduce): > datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * > maxInterval > datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 > if (maxInterval < datum.fetchInterval) // necessarily true >forceRefetch() > forceRefetch: > if (datum.fetchInterval > maxInterval) // true because it's 1.35 * > maxInterval >datum.fetchInterval = 0.9 * maxInterval > datum.status = db_unfetched // > shouldFetch (called from generate / Generator.map): > if ((datum.fetchTime - curTime) > maxInterval) >// always true if the crawler is launched in short intervals >// (lower than 0.35 * maxInterval) >datum.fetchTime = curTime // forces a refetch > {code} > After setPageGoneSchedule is called via update the state is db_unfetched and > the retry interval 0.9 * db.fetch.interval.max (81 days). > Although the fetch time in the CrawlDb is far in the future > {noformat} > % nutch readdb testcrawl/crawldb -url http://localhost/page_gone > URL: http://localhost/page_gone > Version: 7 > Status: 1 (db_unfetched) > Fetch time: Sun May 06 05:20:05 CEST 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Score: 1.0 > Signature: null > Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone > {noformat} > the URL is generated again because (fetch time - current time) is larger than > db.fetch.interval.max. > The retry interval (datum.fetchInterval) oscillates between 0.9 an
[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13511429#comment-13511429 ] Markus Jelsma commented on NUTCH-1245: -- Sebastian, i'm seeing good results with this patch for now with a low db.fetch.interval.max (60 days). > URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb > and is generated over and over again > > > Key: NUTCH-1245 > URL: https://issues.apache.org/jira/browse/NUTCH-1245 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.4, 1.5 >Reporter: Sebastian Nagel >Priority: Critical > Fix For: 1.6 > > Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, > NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch > > > A document gone with 404 after db.fetch.interval.max (90 days) has passed > is fetched over and over again but although fetch status is fetch_gone > its status in CrawlDb keeps db_unfetched. Consequently, this document will > be generated and fetched from now on in every cycle. > To reproduce: > # create a CrawlDatum in CrawlDb which retry interval hits > db.fetch.interval.max (I manipulated the shouldFetch() in > AbstractFetchSchedule to achieve this) > # now this URL is fetched again > # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to > db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 > days) > # this does not change with every generate-fetch-update cycle, here for two > segments: > {noformat} > /tmp/testcrawl/segments/20120105161430 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:14:21 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:14:48 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > /tmp/testcrawl/segments/20120105161631 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:16:23 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:20:05 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > {noformat} > As far as I can see it's caused by setPageGoneSchedule() in > AbstractFetchSchedule. Some pseudo-code: > {code} > setPageGoneSchedule (called from update / CrawlDbReducer.reduce): > datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * > maxInterval > datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 > if (maxInterval < datum.fetchInterval) // necessarily true >forceRefetch() > forceRefetch: > if (datum.fetchInterval > maxInterval) // true because it's 1.35 * > maxInterval >datum.fetchInterval = 0.9 * maxInterval > datum.status = db_unfetched // > shouldFetch (called from generate / Generator.map): > if ((datum.fetchTime - curTime) > maxInterval) >// always true if the crawler is launched in short intervals >// (lower than 0.35 * maxInterval) >datum.fetchTime = curTime // forces a refetch > {code} > After setPageGoneSchedule is called via update the state is db_unfetched and > the retry interval 0.9 * db.fetch.interval.max (81 days). > Although the fetch time in the CrawlDb is far in the future > {noformat} > % nutch readdb testcrawl/crawldb -url http://localhost/page_gone > URL: http://localhost/page_gone > Version: 7 > Status: 1 (db_unfetched) > Fetch time: Sun May 06 05:20:05 CEST 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Score: 1.0 > Signature: null > Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone > {noformat} > the URL is generated again because (fetch time - current time) is larger than > db.fetch.interval.max. > The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and > the fetch time is always close to current time + 1.35 * db.fetch.inter
[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13493098#comment-13493098 ] Markus Jelsma commented on NUTCH-1245: -- Thanks for the thorough unit tests, they clearly show there's a problem to be solved. I think i agree on the proposed fix you mention for 1245, it makes some sense. Not calling forceRefetch (it only leads to more transient errors) but setting fetch time to max interval to see again later sounds what one could expect. On 578 and 1247, i think if we solve 578 overflowing may not be a big problem anymore. With Nutch as it works today it takes at least 128 days for it to overflow, if we fix it and people use a more reasonable max interval (say 30 days or higher) it'll overflow 10 years from now, which i think is reasonable. I'm not yet sure about the fix on 578. It's complex indeed ;) > URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb > and is generated over and over again > > > Key: NUTCH-1245 > URL: https://issues.apache.org/jira/browse/NUTCH-1245 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.4, 1.5 >Reporter: Sebastian Nagel >Priority: Critical > Fix For: 1.6 > > Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, > NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch > > > A document gone with 404 after db.fetch.interval.max (90 days) has passed > is fetched over and over again but although fetch status is fetch_gone > its status in CrawlDb keeps db_unfetched. Consequently, this document will > be generated and fetched from now on in every cycle. > To reproduce: > # create a CrawlDatum in CrawlDb which retry interval hits > db.fetch.interval.max (I manipulated the shouldFetch() in > AbstractFetchSchedule to achieve this) > # now this URL is fetched again > # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to > db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 > days) > # this does not change with every generate-fetch-update cycle, here for two > segments: > {noformat} > /tmp/testcrawl/segments/20120105161430 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:14:21 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:14:48 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > /tmp/testcrawl/segments/20120105161631 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:16:23 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:20:05 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > {noformat} > As far as I can see it's caused by setPageGoneSchedule() in > AbstractFetchSchedule. Some pseudo-code: > {code} > setPageGoneSchedule (called from update / CrawlDbReducer.reduce): > datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * > maxInterval > datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 > if (maxInterval < datum.fetchInterval) // necessarily true >forceRefetch() > forceRefetch: > if (datum.fetchInterval > maxInterval) // true because it's 1.35 * > maxInterval >datum.fetchInterval = 0.9 * maxInterval > datum.status = db_unfetched // > shouldFetch (called from generate / Generator.map): > if ((datum.fetchTime - curTime) > maxInterval) >// always true if the crawler is launched in short intervals >// (lower than 0.35 * maxInterval) >datum.fetchTime = curTime // forces a refetch > {code} > After setPageGoneSchedule is called via update the state is db_unfetched and > the retry interval 0.9 * db.fetch.interval.max (81 days). > Although the fetch time in the CrawlDb is far in the future > {noformat} > % nutch readdb testcrawl/crawldb -url http://localhost/page_gone > URL: http://loc
[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488935#comment-13488935 ] Sebastian Nagel commented on NUTCH-1245: They are not duplicates but the effects are similar: NUTCH-1245 - caused by calling forceRefetch just after a fetch leads to a fetch_gone. If the fetchInterval is close to db.fetch.interval.max, setPageGoneSchedule calls forceRefetch. That's useless since we got a 404 right now (or within the last day(s) for large crawls). - proposed fix: setPageGoneSchedule should not call forceRefetch but keep the fetchInterval within/below db.fetch.interval.max NUTCH-578 - although the status of a page fetched 3 times (db.fetch.retry.max) with a transient error (fetch_retry) is set to db_gone, the fetchInterval is still only incremented by one day. So next day this page is fetched again. - every fetch_retry still increments the retry counter so that it may overflow (NUTCH-1247) - fix: -* call setPageGoneSchedule in CrawlDbReducer.reduce when retry counter is hit and status is set to db_gone. All patches (by various users/committers) agree in this: it will set the fetchInterval to a value larger than one day, so that from now on the URL is not fetched again and again. -* reset the retry counter to 0 or prohibit an overflow. I'm not sure what the best solution is, see comments on NUTCH-578 Markus, would be great if you start with a look on the JUnit patch. It has two aims: catch the error and make analysis easier (it logs a lot). I would like to extend the test to other CrawlDatum state transitions: these are complex for continuous crawls in combination with retry counters, intervals, signatures, etc. An exhaustive test could ensure that we do not break other state transitions. > URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb > and is generated over and over again > > > Key: NUTCH-1245 > URL: https://issues.apache.org/jira/browse/NUTCH-1245 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.4, 1.5 >Reporter: Sebastian Nagel >Priority: Critical > Fix For: 1.6 > > Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, > NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch > > > A document gone with 404 after db.fetch.interval.max (90 days) has passed > is fetched over and over again but although fetch status is fetch_gone > its status in CrawlDb keeps db_unfetched. Consequently, this document will > be generated and fetched from now on in every cycle. > To reproduce: > # create a CrawlDatum in CrawlDb which retry interval hits > db.fetch.interval.max (I manipulated the shouldFetch() in > AbstractFetchSchedule to achieve this) > # now this URL is fetched again > # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to > db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 > days) > # this does not change with every generate-fetch-update cycle, here for two > segments: > {noformat} > /tmp/testcrawl/segments/20120105161430 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:14:21 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:14:48 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > /tmp/testcrawl/segments/20120105161631 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:16:23 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:20:05 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > {noformat} > As far as I can see it's caused by setPageGoneSchedule() in > AbstractFetchSchedule. Some pseudo-code: > {code} > setPageGoneSchedule (called from update / CrawlDbReducer.reduce): > datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * > maxInterval > datum.fetchTime = fet
[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488864#comment-13488864 ] Markus Jelsma commented on NUTCH-1245: -- Sebastian, very interesting! Can you close either this issue or NUTCH-578? I'll hope to check your patches soon! > URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb > and is generated over and over again > > > Key: NUTCH-1245 > URL: https://issues.apache.org/jira/browse/NUTCH-1245 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.4, 1.5 >Reporter: Sebastian Nagel >Priority: Critical > Fix For: 1.6 > > Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, > NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch > > > A document gone with 404 after db.fetch.interval.max (90 days) has passed > is fetched over and over again but although fetch status is fetch_gone > its status in CrawlDb keeps db_unfetched. Consequently, this document will > be generated and fetched from now on in every cycle. > To reproduce: > # create a CrawlDatum in CrawlDb which retry interval hits > db.fetch.interval.max (I manipulated the shouldFetch() in > AbstractFetchSchedule to achieve this) > # now this URL is fetched again > # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to > db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 > days) > # this does not change with every generate-fetch-update cycle, here for two > segments: > {noformat} > /tmp/testcrawl/segments/20120105161430 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:14:21 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:14:48 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > /tmp/testcrawl/segments/20120105161631 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:16:23 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:20:05 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > {noformat} > As far as I can see it's caused by setPageGoneSchedule() in > AbstractFetchSchedule. Some pseudo-code: > {code} > setPageGoneSchedule (called from update / CrawlDbReducer.reduce): > datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * > maxInterval > datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 > if (maxInterval < datum.fetchInterval) // necessarily true >forceRefetch() > forceRefetch: > if (datum.fetchInterval > maxInterval) // true because it's 1.35 * > maxInterval >datum.fetchInterval = 0.9 * maxInterval > datum.status = db_unfetched // > shouldFetch (called from generate / Generator.map): > if ((datum.fetchTime - curTime) > maxInterval) >// always true if the crawler is launched in short intervals >// (lower than 0.35 * maxInterval) >datum.fetchTime = curTime // forces a refetch > {code} > After setPageGoneSchedule is called via update the state is db_unfetched and > the retry interval 0.9 * db.fetch.interval.max (81 days). > Although the fetch time in the CrawlDb is far in the future > {noformat} > % nutch readdb testcrawl/crawldb -url http://localhost/page_gone > URL: http://localhost/page_gone > Version: 7 > Status: 1 (db_unfetched) > Fetch time: Sun May 06 05:20:05 CEST 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Score: 1.0 > Signature: null > Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone > {noformat} > the URL is generated again because (fetch time - current time) is larger than > db.fetch.interval.max. > The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and > the fetch time is always close to current time + 1.35 * db.f
[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13456928#comment-13456928 ] Markus Jelsma commented on NUTCH-1245: -- Any ideas on this issue? > URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb > and is generated over and over again > > > Key: NUTCH-1245 > URL: https://issues.apache.org/jira/browse/NUTCH-1245 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.4, 1.5 >Reporter: Sebastian Nagel >Priority: Critical > Fix For: 1.6 > > > A document gone with 404 after db.fetch.interval.max (90 days) has passed > is fetched over and over again but although fetch status is fetch_gone > its status in CrawlDb keeps db_unfetched. Consequently, this document will > be generated and fetched from now on in every cycle. > To reproduce: > # create a CrawlDatum in CrawlDb which retry interval hits > db.fetch.interval.max (I manipulated the shouldFetch() in > AbstractFetchSchedule to achieve this) > # now this URL is fetched again > # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to > db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 > days) > # this does not change with every generate-fetch-update cycle, here for two > segments: > {noformat} > /tmp/testcrawl/segments/20120105161430 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:14:21 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:14:48 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > /tmp/testcrawl/segments/20120105161631 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:16:23 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:20:05 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > {noformat} > As far as I can see it's caused by setPageGoneSchedule() in > AbstractFetchSchedule. Some pseudo-code: > {code} > setPageGoneSchedule (called from update / CrawlDbReducer.reduce): > datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * > maxInterval > datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 > if (maxInterval < datum.fetchInterval) // necessarily true >forceRefetch() > forceRefetch: > if (datum.fetchInterval > maxInterval) // true because it's 1.35 * > maxInterval >datum.fetchInterval = 0.9 * maxInterval > datum.status = db_unfetched // > shouldFetch (called from generate / Generator.map): > if ((datum.fetchTime - curTime) > maxInterval) >// always true if the crawler is launched in short intervals >// (lower than 0.35 * maxInterval) >datum.fetchTime = curTime // forces a refetch > {code} > After setPageGoneSchedule is called via update the state is db_unfetched and > the retry interval 0.9 * db.fetch.interval.max (81 days). > Although the fetch time in the CrawlDb is far in the future > {noformat} > % nutch readdb testcrawl/crawldb -url http://localhost/page_gone > URL: http://localhost/page_gone > Version: 7 > Status: 1 (db_unfetched) > Fetch time: Sun May 06 05:20:05 CEST 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Score: 1.0 > Signature: null > Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone > {noformat} > the URL is generated again because (fetch time - current time) is larger than > db.fetch.interval.max. > The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and > the fetch time is always close to current time + 1.35 * db.fetch.interval.max. > It's possibly a side effect of NUTCH-516, and may be related to NUTCH-578 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA adm
[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13192215#comment-13192215 ] Sebastian Nagel commented on NUTCH-1245: There are several possibilities to get a CrawlDatum with a fetchInterval large enough that a multiplication by 1.5 will exceed maxInterval: # with adaptive fetch scheduling ## db.fetch.schedule.adaptive.min_interval > (1.5 * db.fetch.interval.max) -- as José commented some kind of misconfiguration ## after some time when the document didn't change and db.fetch.schedule.adaptive.max_interval > (1.5 * db.fetch.interval.max) -- but this is the default (1 year > 1.5*90 days)! # db.fetch.interval.default > (1.5 * db.fetch.interval.max) -- again some kind of misconfiguration # also setPageGoneSchedule increases the fetchInterval every time it is called, so after a gone page is tried to re-fetch several times we run into the same situation Anyway, I think also the misconfigurations should be made impossible. > URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb > and is generated over and over again > > > Key: NUTCH-1245 > URL: https://issues.apache.org/jira/browse/NUTCH-1245 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.4, 1.5 >Reporter: Sebastian Nagel > Fix For: 1.5 > > > A document gone with 404 after db.fetch.interval.max (90 days) has passed > is fetched over and over again but although fetch status is fetch_gone > its status in CrawlDb keeps db_unfetched. Consequently, this document will > be generated and fetched from now on in every cycle. > To reproduce: > # create a CrawlDatum in CrawlDb which retry interval hits > db.fetch.interval.max (I manipulated the shouldFetch() in > AbstractFetchSchedule to achieve this) > # now this URL is fetched again > # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to > db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 > days) > # this does not change with every generate-fetch-update cycle, here for two > segments: > {noformat} > /tmp/testcrawl/segments/20120105161430 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:14:21 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:14:48 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > /tmp/testcrawl/segments/20120105161631 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:16:23 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:20:05 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > {noformat} > As far as I can see it's caused by setPageGoneSchedule() in > AbstractFetchSchedule. Some pseudo-code: > {code} > setPageGoneSchedule (called from update / CrawlDbReducer.reduce): > datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * > maxInterval > datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 > if (maxInterval < datum.fetchInterval) // necessarily true >forceRefetch() > forceRefetch: > if (datum.fetchInterval > maxInterval) // true because it's 1.35 * > maxInterval >datum.fetchInterval = 0.9 * maxInterval > datum.status = db_unfetched // > shouldFetch (called from generate / Generator.map): > if ((datum.fetchTime - curTime) > maxInterval) >// always true if the crawler is launched in short intervals >// (lower than 0.35 * maxInterval) >datum.fetchTime = curTime // forces a refetch > {code} > After setPageGoneSchedule is called via update the state is db_unfetched and > the retry interval 0.9 * db.fetch.interval.max (81 days). > Although the fetch time in the CrawlDb is far in the future > {noformat} > % nutch readdb testcrawl/crawldb -url http://localhost/page_gone > URL: http://localhost/page_gone > Version: 7 >
[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183292#comment-13183292 ] José Gil commented on NUTCH-1245: - FWIW - We experienced a similar problem (entries in CrawlDB marked as db_fetched in spite of having resulted in a 404 response) and we eventually traced it to a configuration problem: our db.fetch.schedule.adaptive.min_interval was larger than our db.fetch.interval.max. After reducing the first and increasing the second, the entries are marked as db_gone as expected. > URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb > and is generated over and over again > > > Key: NUTCH-1245 > URL: https://issues.apache.org/jira/browse/NUTCH-1245 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.4, 1.5 >Reporter: Sebastian Nagel > Fix For: 1.5 > > > A document gone with 404 after db.fetch.interval.max (90 days) has passed > is fetched over and over again but although fetch status is fetch_gone > its status in CrawlDb keeps db_unfetched. Consequently, this document will > be generated and fetched from now on in every cycle. > To reproduce: > # create a CrawlDatum in CrawlDb which retry interval hits > db.fetch.interval.max (I manipulated the shouldFetch() in > AbstractFetchSchedule to achieve this) > # now this URL is fetched again > # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to > db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 > days) > # this does not change with every generate-fetch-update cycle, here for two > segments: > {noformat} > /tmp/testcrawl/segments/20120105161430 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:14:21 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:14:48 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > /tmp/testcrawl/segments/20120105161631 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:16:23 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:20:05 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > {noformat} > As far as I can see it's caused by setPageGoneSchedule() in > AbstractFetchSchedule. Some pseudo-code: > {code} > setPageGoneSchedule (called from update / CrawlDbReducer.reduce): > datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * > maxInterval > datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 > if (maxInterval < datum.fetchInterval) // necessarily true >forceRefetch() > forceRefetch: > if (datum.fetchInterval > maxInterval) // true because it's 1.35 * > maxInterval >datum.fetchInterval = 0.9 * maxInterval > datum.status = db_unfetched // > shouldFetch (called from generate / Generator.map): > if ((datum.fetchTime - curTime) > maxInterval) >// always true if the crawler is launched in short intervals >// (lower than 0.35 * maxInterval) >datum.fetchTime = curTime // forces a refetch > {code} > After setPageGoneSchedule is called via update the state is db_unfetched and > the retry interval 0.9 * db.fetch.interval.max (81 days). > Although the fetch time in the CrawlDb is far in the future > {noformat} > % nutch readdb testcrawl/crawldb -url http://localhost/page_gone > URL: http://localhost/page_gone > Version: 7 > Status: 1 (db_unfetched) > Fetch time: Sun May 06 05:20:05 CEST 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Score: 1.0 > Signature: null > Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone > {noformat} > the URL is generated again because (fetch time - current time) is larger than > db.fetch.interval.max. > The retry interval (datum.fetchInterval) oscill
[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13180515#comment-13180515 ] Markus Jelsma commented on NUTCH-1245: -- Thanks! This must be the same issue as NUTCH-578 but marked as related for now. Can you provide a patch? > URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb > and is generated over and over again > > > Key: NUTCH-1245 > URL: https://issues.apache.org/jira/browse/NUTCH-1245 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.4, 1.5 >Reporter: Sebastian Nagel > > A document gone with 404 after db.fetch.interval.max (90 days) has passed > is fetched over and over again but although fetch status is fetch_gone > its status in CrawlDb keeps db_unfetched. Consequently, this document will > be generated and fetched from now on in every cycle. > To reproduce: > # create a CrawlDatum in CrawlDb which retry interval hits > db.fetch.interval.max (I manipulated the shouldFetch() in > AbstractFetchSchedule to achieve this) > # now this URL is fetched again > # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to > db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 > days) > # this does not change with every generate-fetch-update cycle, here for two > segments: > {noformat} > /tmp/testcrawl/segments/20120105161430 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:14:21 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:14:48 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > /tmp/testcrawl/segments/20120105161631 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:16:23 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:20:05 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > {noformat} > As far as I can see it's caused by setPageGoneSchedule() in > AbstractFetchSchedule. Some pseudo-code: > {code} > setPageGoneSchedule (called from update / CrawlDbReducer.reduce): > datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * > maxInterval > datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 > if (maxInterval < datum.fetchInterval) // necessarily true >forceRefetch() > forceRefetch: > if (datum.fetchInterval > maxInterval) // true because it's 1.35 * > maxInterval >datum.fetchInterval = 0.9 * maxInterval > datum.status = db_unfetched // > shouldFetch (called from generate / Generator.map): > if ((datum.fetchTime - curTime) > maxInterval) >// always true if the crawler is launched in short intervals >// (lower than 0.35 * maxInterval) >datum.fetchTime = curTime // forces a refetch > {code} > After setPageGoneSchedule is called via update the state is db_unfetched and > the retry interval 0.9 * db.fetch.interval.max (81 days). > Although the fetch time in the CrawlDb is far in the future > {noformat} > % nutch readdb testcrawl/crawldb -url http://localhost/page_gone > URL: http://localhost/page_gone > Version: 7 > Status: 1 (db_unfetched) > Fetch time: Sun May 06 05:20:05 CEST 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Score: 1.0 > Signature: null > Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone > {noformat} > the URL is generated again because (fetch time - current time) is larger than > db.fetch.interval.max. > The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and > the fetch time is always close to current time + 1.35 * db.fetch.interval.max. > It's possibly a side effect of NUTCH-516, and may be related to NUTCH-578 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please c