subject:"\[jira\] \[Commented\] \(NUTCH\-1245\) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again"

[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2013-11-03 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13812289#comment-13812289
 ] 

Hudson commented on NUTCH-1245:
---

SUCCESS: Integrated in Nutch-nutchgora #808 (See 
[https://builds.apache.org/job/Nutch-nutchgora/808/])
NUTCH-1588 Port NUTCH-1245 URL gone with 404 after db.fetch.interval.max stays 
db_unfetched in CrawlDb and is generated over and over again to 2.x (lewismc: 
http://svn.apache.org/viewvc/nutch/branches/2.x/?view=rev&rev=1538200)
* /nutch/branches/2.x/CHANGES.txt
* /nutch/branches/2.x/src/java/org/apache/nutch/crawl/AbstractFetchSchedule.java


> URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb 
> and is generated over and over again
> 
>
> Key: NUTCH-1245
> URL: https://issues.apache.org/jira/browse/NUTCH-1245
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.4, 1.5
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.7
>
> Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, 
> NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch
>
>
> A document gone with 404 after db.fetch.interval.max (90 days) has passed
> is fetched over and over again but although fetch status is fetch_gone
> its status in CrawlDb keeps db_unfetched. Consequently, this document will
> be generated and fetched from now on in every cycle.
> To reproduce:
> # create a CrawlDatum in CrawlDb which retry interval hits 
> db.fetch.interval.max (I manipulated the shouldFetch() in 
> AbstractFetchSchedule to achieve this)
> # now this URL is fetched again
> # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to 
> db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 
> days)
> # this does not change with every generate-fetch-update cycle, here for two 
> segments:
> {noformat}
> /tmp/testcrawl/segments/20120105161430
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:14:21 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:14:48 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> /tmp/testcrawl/segments/20120105161631
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:16:23 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:20:05 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> {noformat}
> As far as I can see it's caused by setPageGoneSchedule() in 
> AbstractFetchSchedule. Some pseudo-code:
> {code}
> setPageGoneSchedule (called from update / CrawlDbReducer.reduce):
> datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * 
> maxInterval
> datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516
> if (maxInterval < datum.fetchInterval) // necessarily true
>forceRefetch()
> forceRefetch:
> if (datum.fetchInterval > maxInterval) // true because it's 1.35 * 
> maxInterval
>datum.fetchInterval = 0.9 * maxInterval
> datum.status = db_unfetched // 
> shouldFetch (called from generate / Generator.map):
> if ((datum.fetchTime - curTime) > maxInterval)
>// always true if the crawler is launched in short intervals
>// (lower than 0.35 * maxInterval)
>datum.fetchTime = curTime // forces a refetch
> {code}
> After setPageGoneSchedule is called via update the state is db_unfetched and 
> the retry interval 0.9 * db.fetch.interval.max (81 days). 
> Although the fetch time in the CrawlDb is far in the future
> {noformat}
> % nutch readdb testcrawl/crawldb -url http://localhost/page_gone
> URL: http://localhost/page_gone
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Sun May 06 05:20:05 CEST 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Score: 1.0
> Signature: null
> Metadata: _pst_: notfound(14),

[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2013-06-19 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13688533#comment-13688533
 ] 

Hudson commented on NUTCH-1245:
---

Integrated in Nutch-trunk #2248 (See 
[https://builds.apache.org/job/Nutch-trunk/2248/])
NUTCH-1245 URL gone with 404 after db.fetch.interval.max stays db_unfetched 
in CrawlDb and is generated over and over again (Revision 1494776)

 Result = SUCCESS
snagel : http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1494776
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/crawl/AbstractFetchSchedule.java


> URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb 
> and is generated over and over again
> 
>
> Key: NUTCH-1245
> URL: https://issues.apache.org/jira/browse/NUTCH-1245
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.4, 1.5
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.8
>
> Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, 
> NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch
>
>
> A document gone with 404 after db.fetch.interval.max (90 days) has passed
> is fetched over and over again but although fetch status is fetch_gone
> its status in CrawlDb keeps db_unfetched. Consequently, this document will
> be generated and fetched from now on in every cycle.
> To reproduce:
> # create a CrawlDatum in CrawlDb which retry interval hits 
> db.fetch.interval.max (I manipulated the shouldFetch() in 
> AbstractFetchSchedule to achieve this)
> # now this URL is fetched again
> # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to 
> db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 
> days)
> # this does not change with every generate-fetch-update cycle, here for two 
> segments:
> {noformat}
> /tmp/testcrawl/segments/20120105161430
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:14:21 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:14:48 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> /tmp/testcrawl/segments/20120105161631
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:16:23 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:20:05 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> {noformat}
> As far as I can see it's caused by setPageGoneSchedule() in 
> AbstractFetchSchedule. Some pseudo-code:
> {code}
> setPageGoneSchedule (called from update / CrawlDbReducer.reduce):
> datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * 
> maxInterval
> datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516
> if (maxInterval < datum.fetchInterval) // necessarily true
>forceRefetch()
> forceRefetch:
> if (datum.fetchInterval > maxInterval) // true because it's 1.35 * 
> maxInterval
>datum.fetchInterval = 0.9 * maxInterval
> datum.status = db_unfetched // 
> shouldFetch (called from generate / Generator.map):
> if ((datum.fetchTime - curTime) > maxInterval)
>// always true if the crawler is launched in short intervals
>// (lower than 0.35 * maxInterval)
>datum.fetchTime = curTime // forces a refetch
> {code}
> After setPageGoneSchedule is called via update the state is db_unfetched and 
> the retry interval 0.9 * db.fetch.interval.max (81 days). 
> Although the fetch time in the CrawlDb is far in the future
> {noformat}
> % nutch readdb testcrawl/crawldb -url http://localhost/page_gone
> URL: http://localhost/page_gone
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Sun May 06 05:20:05 CEST 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Score: 1.0
> Signature: null
> Metadata: _pst_: notf

[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2013-06-19 Thread Sebastian Nagel (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13688482#comment-13688482
 ] 

Sebastian Nagel commented on NUTCH-1245:


Committed to trunk (r1494776). Keep open, 2.x is likely to be affected as well.

> URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb 
> and is generated over and over again
> 
>
> Key: NUTCH-1245
> URL: https://issues.apache.org/jira/browse/NUTCH-1245
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.4, 1.5
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.8
>
> Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, 
> NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch
>
>
> A document gone with 404 after db.fetch.interval.max (90 days) has passed
> is fetched over and over again but although fetch status is fetch_gone
> its status in CrawlDb keeps db_unfetched. Consequently, this document will
> be generated and fetched from now on in every cycle.
> To reproduce:
> # create a CrawlDatum in CrawlDb which retry interval hits 
> db.fetch.interval.max (I manipulated the shouldFetch() in 
> AbstractFetchSchedule to achieve this)
> # now this URL is fetched again
> # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to 
> db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 
> days)
> # this does not change with every generate-fetch-update cycle, here for two 
> segments:
> {noformat}
> /tmp/testcrawl/segments/20120105161430
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:14:21 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:14:48 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> /tmp/testcrawl/segments/20120105161631
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:16:23 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:20:05 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> {noformat}
> As far as I can see it's caused by setPageGoneSchedule() in 
> AbstractFetchSchedule. Some pseudo-code:
> {code}
> setPageGoneSchedule (called from update / CrawlDbReducer.reduce):
> datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * 
> maxInterval
> datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516
> if (maxInterval < datum.fetchInterval) // necessarily true
>forceRefetch()
> forceRefetch:
> if (datum.fetchInterval > maxInterval) // true because it's 1.35 * 
> maxInterval
>datum.fetchInterval = 0.9 * maxInterval
> datum.status = db_unfetched // 
> shouldFetch (called from generate / Generator.map):
> if ((datum.fetchTime - curTime) > maxInterval)
>// always true if the crawler is launched in short intervals
>// (lower than 0.35 * maxInterval)
>datum.fetchTime = curTime // forces a refetch
> {code}
> After setPageGoneSchedule is called via update the state is db_unfetched and 
> the retry interval 0.9 * db.fetch.interval.max (81 days). 
> Although the fetch time in the CrawlDb is far in the future
> {noformat}
> % nutch readdb testcrawl/crawldb -url http://localhost/page_gone
> URL: http://localhost/page_gone
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Sun May 06 05:20:05 CEST 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Score: 1.0
> Signature: null
> Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone
> {noformat}
> the URL is generated again because (fetch time - current time) is larger than 
> db.fetch.interval.max.
> The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and 
> the fetch time is always close to current time + 1.35 * db.fetch.interval.max.
> It's poss

[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2013-06-19 Thread Markus Jelsma (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13688145#comment-13688145
 ] 

Markus Jelsma commented on NUTCH-1245:
--

Splendid! Thanks guys!

> URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb 
> and is generated over and over again
> 
>
> Key: NUTCH-1245
> URL: https://issues.apache.org/jira/browse/NUTCH-1245
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.4, 1.5
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.8
>
> Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, 
> NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch
>
>
> A document gone with 404 after db.fetch.interval.max (90 days) has passed
> is fetched over and over again but although fetch status is fetch_gone
> its status in CrawlDb keeps db_unfetched. Consequently, this document will
> be generated and fetched from now on in every cycle.
> To reproduce:
> # create a CrawlDatum in CrawlDb which retry interval hits 
> db.fetch.interval.max (I manipulated the shouldFetch() in 
> AbstractFetchSchedule to achieve this)
> # now this URL is fetched again
> # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to 
> db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 
> days)
> # this does not change with every generate-fetch-update cycle, here for two 
> segments:
> {noformat}
> /tmp/testcrawl/segments/20120105161430
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:14:21 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:14:48 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> /tmp/testcrawl/segments/20120105161631
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:16:23 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:20:05 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> {noformat}
> As far as I can see it's caused by setPageGoneSchedule() in 
> AbstractFetchSchedule. Some pseudo-code:
> {code}
> setPageGoneSchedule (called from update / CrawlDbReducer.reduce):
> datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * 
> maxInterval
> datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516
> if (maxInterval < datum.fetchInterval) // necessarily true
>forceRefetch()
> forceRefetch:
> if (datum.fetchInterval > maxInterval) // true because it's 1.35 * 
> maxInterval
>datum.fetchInterval = 0.9 * maxInterval
> datum.status = db_unfetched // 
> shouldFetch (called from generate / Generator.map):
> if ((datum.fetchTime - curTime) > maxInterval)
>// always true if the crawler is launched in short intervals
>// (lower than 0.35 * maxInterval)
>datum.fetchTime = curTime // forces a refetch
> {code}
> After setPageGoneSchedule is called via update the state is db_unfetched and 
> the retry interval 0.9 * db.fetch.interval.max (81 days). 
> Although the fetch time in the CrawlDb is far in the future
> {noformat}
> % nutch readdb testcrawl/crawldb -url http://localhost/page_gone
> URL: http://localhost/page_gone
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Sun May 06 05:20:05 CEST 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Score: 1.0
> Signature: null
> Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone
> {noformat}
> the URL is generated again because (fetch time - current time) is larger than 
> db.fetch.interval.max.
> The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and 
> the fetch time is always close to current time + 1.35 * db.fetch.interval.max.
> It's possibly a side effect of NUTCH-516, and may be related to NUTCH-

[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2013-06-19 Thread Lewis John McGibbney (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13688111#comment-13688111
 ] 

Lewis John McGibbney commented on NUTCH-1245:
-

push it. get it in there.
I'll cut the RC tonight my time.
;)

> URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb 
> and is generated over and over again
> 
>
> Key: NUTCH-1245
> URL: https://issues.apache.org/jira/browse/NUTCH-1245
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.4, 1.5
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.8
>
> Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, 
> NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch
>
>
> A document gone with 404 after db.fetch.interval.max (90 days) has passed
> is fetched over and over again but although fetch status is fetch_gone
> its status in CrawlDb keeps db_unfetched. Consequently, this document will
> be generated and fetched from now on in every cycle.
> To reproduce:
> # create a CrawlDatum in CrawlDb which retry interval hits 
> db.fetch.interval.max (I manipulated the shouldFetch() in 
> AbstractFetchSchedule to achieve this)
> # now this URL is fetched again
> # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to 
> db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 
> days)
> # this does not change with every generate-fetch-update cycle, here for two 
> segments:
> {noformat}
> /tmp/testcrawl/segments/20120105161430
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:14:21 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:14:48 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> /tmp/testcrawl/segments/20120105161631
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:16:23 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:20:05 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> {noformat}
> As far as I can see it's caused by setPageGoneSchedule() in 
> AbstractFetchSchedule. Some pseudo-code:
> {code}
> setPageGoneSchedule (called from update / CrawlDbReducer.reduce):
> datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * 
> maxInterval
> datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516
> if (maxInterval < datum.fetchInterval) // necessarily true
>forceRefetch()
> forceRefetch:
> if (datum.fetchInterval > maxInterval) // true because it's 1.35 * 
> maxInterval
>datum.fetchInterval = 0.9 * maxInterval
> datum.status = db_unfetched // 
> shouldFetch (called from generate / Generator.map):
> if ((datum.fetchTime - curTime) > maxInterval)
>// always true if the crawler is launched in short intervals
>// (lower than 0.35 * maxInterval)
>datum.fetchTime = curTime // forces a refetch
> {code}
> After setPageGoneSchedule is called via update the state is db_unfetched and 
> the retry interval 0.9 * db.fetch.interval.max (81 days). 
> Although the fetch time in the CrawlDb is far in the future
> {noformat}
> % nutch readdb testcrawl/crawldb -url http://localhost/page_gone
> URL: http://localhost/page_gone
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Sun May 06 05:20:05 CEST 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Score: 1.0
> Signature: null
> Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone
> {noformat}
> the URL is generated again because (fetch time - current time) is larger than 
> db.fetch.interval.max.
> The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and 
> the fetch time is always close to current time + 1.35 * db.fetch.interval.max.
> It's possibly a s

[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2013-06-19 Thread Sebastian Nagel (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13688065#comment-13688065
 ] 

Sebastian Nagel commented on NUTCH-1245:


Definitely, [~markus17]. I'll commit today evening but without the unit tests.
These will be added together with NUTCH-1502 early in 1.8 (hope to get them 
ready and complete soon).

> URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb 
> and is generated over and over again
> 
>
> Key: NUTCH-1245
> URL: https://issues.apache.org/jira/browse/NUTCH-1245
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.4, 1.5
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.8
>
> Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, 
> NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch
>
>
> A document gone with 404 after db.fetch.interval.max (90 days) has passed
> is fetched over and over again but although fetch status is fetch_gone
> its status in CrawlDb keeps db_unfetched. Consequently, this document will
> be generated and fetched from now on in every cycle.
> To reproduce:
> # create a CrawlDatum in CrawlDb which retry interval hits 
> db.fetch.interval.max (I manipulated the shouldFetch() in 
> AbstractFetchSchedule to achieve this)
> # now this URL is fetched again
> # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to 
> db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 
> days)
> # this does not change with every generate-fetch-update cycle, here for two 
> segments:
> {noformat}
> /tmp/testcrawl/segments/20120105161430
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:14:21 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:14:48 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> /tmp/testcrawl/segments/20120105161631
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:16:23 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:20:05 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> {noformat}
> As far as I can see it's caused by setPageGoneSchedule() in 
> AbstractFetchSchedule. Some pseudo-code:
> {code}
> setPageGoneSchedule (called from update / CrawlDbReducer.reduce):
> datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * 
> maxInterval
> datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516
> if (maxInterval < datum.fetchInterval) // necessarily true
>forceRefetch()
> forceRefetch:
> if (datum.fetchInterval > maxInterval) // true because it's 1.35 * 
> maxInterval
>datum.fetchInterval = 0.9 * maxInterval
> datum.status = db_unfetched // 
> shouldFetch (called from generate / Generator.map):
> if ((datum.fetchTime - curTime) > maxInterval)
>// always true if the crawler is launched in short intervals
>// (lower than 0.35 * maxInterval)
>datum.fetchTime = curTime // forces a refetch
> {code}
> After setPageGoneSchedule is called via update the state is db_unfetched and 
> the retry interval 0.9 * db.fetch.interval.max (81 days). 
> Although the fetch time in the CrawlDb is far in the future
> {noformat}
> % nutch readdb testcrawl/crawldb -url http://localhost/page_gone
> URL: http://localhost/page_gone
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Sun May 06 05:20:05 CEST 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Score: 1.0
> Signature: null
> Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone
> {noformat}
> the URL is generated again because (fetch time - current time) is larger than 
> db.fetch.interval.max.
> The retry interval (datum.fetchInterval) oscillates between 0.9 and 1

[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2013-06-19 Thread Markus Jelsma (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13687867#comment-13687867
 ] 

Markus Jelsma commented on NUTCH-1245:
--

I think we should also include this one in the 1.7RC, it is even more important 
than the issue we committed for the FreeGenerator. We're using this in 
production for a long time now, all reported issues are no longer a problem. 

> URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb 
> and is generated over and over again
> 
>
> Key: NUTCH-1245
> URL: https://issues.apache.org/jira/browse/NUTCH-1245
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.4, 1.5
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.8
>
> Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, 
> NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch
>
>
> A document gone with 404 after db.fetch.interval.max (90 days) has passed
> is fetched over and over again but although fetch status is fetch_gone
> its status in CrawlDb keeps db_unfetched. Consequently, this document will
> be generated and fetched from now on in every cycle.
> To reproduce:
> # create a CrawlDatum in CrawlDb which retry interval hits 
> db.fetch.interval.max (I manipulated the shouldFetch() in 
> AbstractFetchSchedule to achieve this)
> # now this URL is fetched again
> # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to 
> db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 
> days)
> # this does not change with every generate-fetch-update cycle, here for two 
> segments:
> {noformat}
> /tmp/testcrawl/segments/20120105161430
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:14:21 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:14:48 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> /tmp/testcrawl/segments/20120105161631
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:16:23 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:20:05 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> {noformat}
> As far as I can see it's caused by setPageGoneSchedule() in 
> AbstractFetchSchedule. Some pseudo-code:
> {code}
> setPageGoneSchedule (called from update / CrawlDbReducer.reduce):
> datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * 
> maxInterval
> datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516
> if (maxInterval < datum.fetchInterval) // necessarily true
>forceRefetch()
> forceRefetch:
> if (datum.fetchInterval > maxInterval) // true because it's 1.35 * 
> maxInterval
>datum.fetchInterval = 0.9 * maxInterval
> datum.status = db_unfetched // 
> shouldFetch (called from generate / Generator.map):
> if ((datum.fetchTime - curTime) > maxInterval)
>// always true if the crawler is launched in short intervals
>// (lower than 0.35 * maxInterval)
>datum.fetchTime = curTime // forces a refetch
> {code}
> After setPageGoneSchedule is called via update the state is db_unfetched and 
> the retry interval 0.9 * db.fetch.interval.max (81 days). 
> Although the fetch time in the CrawlDb is far in the future
> {noformat}
> % nutch readdb testcrawl/crawldb -url http://localhost/page_gone
> URL: http://localhost/page_gone
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Sun May 06 05:20:05 CEST 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Score: 1.0
> Signature: null
> Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone
> {noformat}
> the URL is generated again because (fetch time - current time) is larger than 
> db.fetch.interval.max.
> The retry interval (datu

[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2013-01-08 Thread Markus Jelsma (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13546925#comment-13546925
 ] 

Markus Jelsma commented on NUTCH-1245:
--

Yes, and it also fixes the problem.

> URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb 
> and is generated over and over again
> 
>
> Key: NUTCH-1245
> URL: https://issues.apache.org/jira/browse/NUTCH-1245
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.4, 1.5
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.7
>
> Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, 
> NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch
>
>
> A document gone with 404 after db.fetch.interval.max (90 days) has passed
> is fetched over and over again but although fetch status is fetch_gone
> its status in CrawlDb keeps db_unfetched. Consequently, this document will
> be generated and fetched from now on in every cycle.
> To reproduce:
> # create a CrawlDatum in CrawlDb which retry interval hits 
> db.fetch.interval.max (I manipulated the shouldFetch() in 
> AbstractFetchSchedule to achieve this)
> # now this URL is fetched again
> # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to 
> db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 
> days)
> # this does not change with every generate-fetch-update cycle, here for two 
> segments:
> {noformat}
> /tmp/testcrawl/segments/20120105161430
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:14:21 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:14:48 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> /tmp/testcrawl/segments/20120105161631
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:16:23 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:20:05 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> {noformat}
> As far as I can see it's caused by setPageGoneSchedule() in 
> AbstractFetchSchedule. Some pseudo-code:
> {code}
> setPageGoneSchedule (called from update / CrawlDbReducer.reduce):
> datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * 
> maxInterval
> datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516
> if (maxInterval < datum.fetchInterval) // necessarily true
>forceRefetch()
> forceRefetch:
> if (datum.fetchInterval > maxInterval) // true because it's 1.35 * 
> maxInterval
>datum.fetchInterval = 0.9 * maxInterval
> datum.status = db_unfetched // 
> shouldFetch (called from generate / Generator.map):
> if ((datum.fetchTime - curTime) > maxInterval)
>// always true if the crawler is launched in short intervals
>// (lower than 0.35 * maxInterval)
>datum.fetchTime = curTime // forces a refetch
> {code}
> After setPageGoneSchedule is called via update the state is db_unfetched and 
> the retry interval 0.9 * db.fetch.interval.max (81 days). 
> Although the fetch time in the CrawlDb is far in the future
> {noformat}
> % nutch readdb testcrawl/crawldb -url http://localhost/page_gone
> URL: http://localhost/page_gone
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Sun May 06 05:20:05 CEST 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Score: 1.0
> Signature: null
> Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone
> {noformat}
> the URL is generated again because (fetch time - current time) is larger than 
> db.fetch.interval.max.
> The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and 
> the fetch time is always close to current time + 1.35 * db.fetch.interval.max.
> It's possibly a side effect of NUTCH-516, and may be rela

[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2013-01-07 Thread Lewis John McGibbney (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13546424#comment-13546424
 ] 

Lewis John McGibbney commented on NUTCH-1245:
-

So this patch is good for testing?

> URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb 
> and is generated over and over again
> 
>
> Key: NUTCH-1245
> URL: https://issues.apache.org/jira/browse/NUTCH-1245
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.4, 1.5
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.7
>
> Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, 
> NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch
>
>
> A document gone with 404 after db.fetch.interval.max (90 days) has passed
> is fetched over and over again but although fetch status is fetch_gone
> its status in CrawlDb keeps db_unfetched. Consequently, this document will
> be generated and fetched from now on in every cycle.
> To reproduce:
> # create a CrawlDatum in CrawlDb which retry interval hits 
> db.fetch.interval.max (I manipulated the shouldFetch() in 
> AbstractFetchSchedule to achieve this)
> # now this URL is fetched again
> # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to 
> db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 
> days)
> # this does not change with every generate-fetch-update cycle, here for two 
> segments:
> {noformat}
> /tmp/testcrawl/segments/20120105161430
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:14:21 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:14:48 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> /tmp/testcrawl/segments/20120105161631
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:16:23 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:20:05 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> {noformat}
> As far as I can see it's caused by setPageGoneSchedule() in 
> AbstractFetchSchedule. Some pseudo-code:
> {code}
> setPageGoneSchedule (called from update / CrawlDbReducer.reduce):
> datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * 
> maxInterval
> datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516
> if (maxInterval < datum.fetchInterval) // necessarily true
>forceRefetch()
> forceRefetch:
> if (datum.fetchInterval > maxInterval) // true because it's 1.35 * 
> maxInterval
>datum.fetchInterval = 0.9 * maxInterval
> datum.status = db_unfetched // 
> shouldFetch (called from generate / Generator.map):
> if ((datum.fetchTime - curTime) > maxInterval)
>// always true if the crawler is launched in short intervals
>// (lower than 0.35 * maxInterval)
>datum.fetchTime = curTime // forces a refetch
> {code}
> After setPageGoneSchedule is called via update the state is db_unfetched and 
> the retry interval 0.9 * db.fetch.interval.max (81 days). 
> Although the fetch time in the CrawlDb is far in the future
> {noformat}
> % nutch readdb testcrawl/crawldb -url http://localhost/page_gone
> URL: http://localhost/page_gone
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Sun May 06 05:20:05 CEST 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Score: 1.0
> Signature: null
> Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone
> {noformat}
> the URL is generated again because (fetch time - current time) is larger than 
> db.fetch.interval.max.
> The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and 
> the fetch time is always close to current time + 1.35 * db.fetch.interval.max.
> It's possibly a side effect of NUTCH-516, an

[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2012-12-27 Thread Markus Jelsma (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13539951#comment-13539951
 ] 

Markus Jelsma commented on NUTCH-1245:
--

Please ignore my comment, i was inadvertently looking at the wrong data!!

> URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb 
> and is generated over and over again
> 
>
> Key: NUTCH-1245
> URL: https://issues.apache.org/jira/browse/NUTCH-1245
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.4, 1.5
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.7
>
> Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, 
> NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch
>
>
> A document gone with 404 after db.fetch.interval.max (90 days) has passed
> is fetched over and over again but although fetch status is fetch_gone
> its status in CrawlDb keeps db_unfetched. Consequently, this document will
> be generated and fetched from now on in every cycle.
> To reproduce:
> # create a CrawlDatum in CrawlDb which retry interval hits 
> db.fetch.interval.max (I manipulated the shouldFetch() in 
> AbstractFetchSchedule to achieve this)
> # now this URL is fetched again
> # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to 
> db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 
> days)
> # this does not change with every generate-fetch-update cycle, here for two 
> segments:
> {noformat}
> /tmp/testcrawl/segments/20120105161430
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:14:21 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:14:48 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> /tmp/testcrawl/segments/20120105161631
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:16:23 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:20:05 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> {noformat}
> As far as I can see it's caused by setPageGoneSchedule() in 
> AbstractFetchSchedule. Some pseudo-code:
> {code}
> setPageGoneSchedule (called from update / CrawlDbReducer.reduce):
> datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * 
> maxInterval
> datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516
> if (maxInterval < datum.fetchInterval) // necessarily true
>forceRefetch()
> forceRefetch:
> if (datum.fetchInterval > maxInterval) // true because it's 1.35 * 
> maxInterval
>datum.fetchInterval = 0.9 * maxInterval
> datum.status = db_unfetched // 
> shouldFetch (called from generate / Generator.map):
> if ((datum.fetchTime - curTime) > maxInterval)
>// always true if the crawler is launched in short intervals
>// (lower than 0.35 * maxInterval)
>datum.fetchTime = curTime // forces a refetch
> {code}
> After setPageGoneSchedule is called via update the state is db_unfetched and 
> the retry interval 0.9 * db.fetch.interval.max (81 days). 
> Although the fetch time in the CrawlDb is far in the future
> {noformat}
> % nutch readdb testcrawl/crawldb -url http://localhost/page_gone
> URL: http://localhost/page_gone
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Sun May 06 05:20:05 CEST 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Score: 1.0
> Signature: null
> Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone
> {noformat}
> the URL is generated again because (fetch time - current time) is larger than 
> db.fetch.interval.max.
> The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and 
> the fetch time is always close to current time + 1.35 * db.fetch.interval.max.
> It's possibly a sid

[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2012-12-27 Thread Markus Jelsma (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13539943#comment-13539943
 ] 

Markus Jelsma commented on NUTCH-1245:
--

There's an issue with the patch after all! 

{code}
URL: 
Version: 7
Status: 6 (db_notmodified)
Fetch time: Thu Dec 20 00:19:09 UTC 2012
Modified time: Wed May 16 12:48:30 UTC 2012
Retries since fetch: 0
Retry interval: 5184000 seconds (60 days)
Score: 0.0
Signature: b1fa188be92a8dfa5db51e80c4af192a
Metadata: Content-Type: application/xhtml+xml_pst_: success(1), lastModified=0
{code}

{code}
URL: 
http://www.remeha.nl/intelligentenergy/index.php/remeha_evita_hre_ketel/subsidie/

   
Version: 7  

 
Status: 6 (db_notmodified)  

 
Fetch time: Thu Dec 20 00:38:19 UTC 2012

 
Modified time: Wed May 16 12:48:30 UTC 2012 

 
Retries since fetch: 0  

 
Retry interval: 5184000 seconds (60 days)   

 
Score: 0.0
Signature: b1fa188be92a8dfa5db51e80c4af192a
Metadata: Content-Type: application/xhtml+xml_pst_: success(1), lastModified=0
{code}

The fetch time is not incremented at all.

> URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb 
> and is generated over and over again
> 
>
> Key: NUTCH-1245
> URL: https://issues.apache.org/jira/browse/NUTCH-1245
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.4, 1.5
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.7
>
> Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, 
> NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch
>
>
> A document gone with 404 after db.fetch.interval.max (90 days) has passed
> is fetched over and over again but although fetch status is fetch_gone
> its status in CrawlDb keeps db_unfetched. Consequently, this document will
> be generated and fetched from now on in every cycle.
> To reproduce:
> # create a CrawlDatum in CrawlDb which retry interval hits 
> db.fetch.interval.max (I manipulated the shouldFetch() in 
> AbstractFetchSchedule to achieve this)
> # now this URL is fetched again
> # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to 
> db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 
> days)
> # this does not change with every generate-fetch-update cycle, here for two 
> segments:
> {noformat}
> /tmp/testcrawl/segments/20120105161430
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:14:21 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:14:48 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> /tmp/testcrawl/segments/20120105161631
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:16:23 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:20:05 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metada

[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2012-12-21 Thread Markus Jelsma (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13537860#comment-13537860
 ] 

Markus Jelsma commented on NUTCH-1245:
--

Keep in mind, the DummyReporter needs to implement getProgress() at least on 
Hadoop 1.1.1.

{code}
public float getProgress() {
  return 1f;
}
{code}

> URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb 
> and is generated over and over again
> 
>
> Key: NUTCH-1245
> URL: https://issues.apache.org/jira/browse/NUTCH-1245
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.4, 1.5
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.7
>
> Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, 
> NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch
>
>
> A document gone with 404 after db.fetch.interval.max (90 days) has passed
> is fetched over and over again but although fetch status is fetch_gone
> its status in CrawlDb keeps db_unfetched. Consequently, this document will
> be generated and fetched from now on in every cycle.
> To reproduce:
> # create a CrawlDatum in CrawlDb which retry interval hits 
> db.fetch.interval.max (I manipulated the shouldFetch() in 
> AbstractFetchSchedule to achieve this)
> # now this URL is fetched again
> # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to 
> db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 
> days)
> # this does not change with every generate-fetch-update cycle, here for two 
> segments:
> {noformat}
> /tmp/testcrawl/segments/20120105161430
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:14:21 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:14:48 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> /tmp/testcrawl/segments/20120105161631
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:16:23 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:20:05 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> {noformat}
> As far as I can see it's caused by setPageGoneSchedule() in 
> AbstractFetchSchedule. Some pseudo-code:
> {code}
> setPageGoneSchedule (called from update / CrawlDbReducer.reduce):
> datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * 
> maxInterval
> datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516
> if (maxInterval < datum.fetchInterval) // necessarily true
>forceRefetch()
> forceRefetch:
> if (datum.fetchInterval > maxInterval) // true because it's 1.35 * 
> maxInterval
>datum.fetchInterval = 0.9 * maxInterval
> datum.status = db_unfetched // 
> shouldFetch (called from generate / Generator.map):
> if ((datum.fetchTime - curTime) > maxInterval)
>// always true if the crawler is launched in short intervals
>// (lower than 0.35 * maxInterval)
>datum.fetchTime = curTime // forces a refetch
> {code}
> After setPageGoneSchedule is called via update the state is db_unfetched and 
> the retry interval 0.9 * db.fetch.interval.max (81 days). 
> Although the fetch time in the CrawlDb is far in the future
> {noformat}
> % nutch readdb testcrawl/crawldb -url http://localhost/page_gone
> URL: http://localhost/page_gone
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Sun May 06 05:20:05 CEST 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Score: 1.0
> Signature: null
> Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone
> {noformat}
> the URL is generated again because (fetch time - current time) is larger than 
> db.fetch.interval.max.
> The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and 
> the fetch t

[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2012-12-06 Thread Markus Jelsma (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13525804#comment-13525804
 ] 

Markus Jelsma commented on NUTCH-1245:
--

No objections so far. I've put this in production monday after i had it baked 
for two weeks orso in a test environment. I'm still keeping an eye on it and 
only good results so far but i'd like to see it running a bit longer in case 
there's some edge-case lurking around - although i doubt it :)

> URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb 
> and is generated over and over again
> 
>
> Key: NUTCH-1245
> URL: https://issues.apache.org/jira/browse/NUTCH-1245
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.4, 1.5
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.7
>
> Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, 
> NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch
>
>
> A document gone with 404 after db.fetch.interval.max (90 days) has passed
> is fetched over and over again but although fetch status is fetch_gone
> its status in CrawlDb keeps db_unfetched. Consequently, this document will
> be generated and fetched from now on in every cycle.
> To reproduce:
> # create a CrawlDatum in CrawlDb which retry interval hits 
> db.fetch.interval.max (I manipulated the shouldFetch() in 
> AbstractFetchSchedule to achieve this)
> # now this URL is fetched again
> # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to 
> db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 
> days)
> # this does not change with every generate-fetch-update cycle, here for two 
> segments:
> {noformat}
> /tmp/testcrawl/segments/20120105161430
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:14:21 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:14:48 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> /tmp/testcrawl/segments/20120105161631
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:16:23 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:20:05 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> {noformat}
> As far as I can see it's caused by setPageGoneSchedule() in 
> AbstractFetchSchedule. Some pseudo-code:
> {code}
> setPageGoneSchedule (called from update / CrawlDbReducer.reduce):
> datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * 
> maxInterval
> datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516
> if (maxInterval < datum.fetchInterval) // necessarily true
>forceRefetch()
> forceRefetch:
> if (datum.fetchInterval > maxInterval) // true because it's 1.35 * 
> maxInterval
>datum.fetchInterval = 0.9 * maxInterval
> datum.status = db_unfetched // 
> shouldFetch (called from generate / Generator.map):
> if ((datum.fetchTime - curTime) > maxInterval)
>// always true if the crawler is launched in short intervals
>// (lower than 0.35 * maxInterval)
>datum.fetchTime = curTime // forces a refetch
> {code}
> After setPageGoneSchedule is called via update the state is db_unfetched and 
> the retry interval 0.9 * db.fetch.interval.max (81 days). 
> Although the fetch time in the CrawlDb is far in the future
> {noformat}
> % nutch readdb testcrawl/crawldb -url http://localhost/page_gone
> URL: http://localhost/page_gone
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Sun May 06 05:20:05 CEST 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Score: 1.0
> Signature: null
> Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone
> {noformat}
> the URL is generated again because (fetch time - current time)

[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2012-12-06 Thread Sebastian Nagel (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13525439#comment-13525439
 ] 

Sebastian Nagel commented on NUTCH-1245:


@kiran: yes, 2.x is affected since fetch schedulers do not differ (much) 
between 1.x and 2.x. However, with default settings you need a couple of month 
continuously crawling to run into this problem.
@Markus: good news! Pulled the test out to NUTCH-1502 (broader coverage, need 
more time).
Are there objections regarding the proposed patch?


> URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb 
> and is generated over and over again
> 
>
> Key: NUTCH-1245
> URL: https://issues.apache.org/jira/browse/NUTCH-1245
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.4, 1.5
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.7
>
> Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, 
> NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch
>
>
> A document gone with 404 after db.fetch.interval.max (90 days) has passed
> is fetched over and over again but although fetch status is fetch_gone
> its status in CrawlDb keeps db_unfetched. Consequently, this document will
> be generated and fetched from now on in every cycle.
> To reproduce:
> # create a CrawlDatum in CrawlDb which retry interval hits 
> db.fetch.interval.max (I manipulated the shouldFetch() in 
> AbstractFetchSchedule to achieve this)
> # now this URL is fetched again
> # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to 
> db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 
> days)
> # this does not change with every generate-fetch-update cycle, here for two 
> segments:
> {noformat}
> /tmp/testcrawl/segments/20120105161430
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:14:21 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:14:48 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> /tmp/testcrawl/segments/20120105161631
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:16:23 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:20:05 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> {noformat}
> As far as I can see it's caused by setPageGoneSchedule() in 
> AbstractFetchSchedule. Some pseudo-code:
> {code}
> setPageGoneSchedule (called from update / CrawlDbReducer.reduce):
> datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * 
> maxInterval
> datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516
> if (maxInterval < datum.fetchInterval) // necessarily true
>forceRefetch()
> forceRefetch:
> if (datum.fetchInterval > maxInterval) // true because it's 1.35 * 
> maxInterval
>datum.fetchInterval = 0.9 * maxInterval
> datum.status = db_unfetched // 
> shouldFetch (called from generate / Generator.map):
> if ((datum.fetchTime - curTime) > maxInterval)
>// always true if the crawler is launched in short intervals
>// (lower than 0.35 * maxInterval)
>datum.fetchTime = curTime // forces a refetch
> {code}
> After setPageGoneSchedule is called via update the state is db_unfetched and 
> the retry interval 0.9 * db.fetch.interval.max (81 days). 
> Although the fetch time in the CrawlDb is far in the future
> {noformat}
> % nutch readdb testcrawl/crawldb -url http://localhost/page_gone
> URL: http://localhost/page_gone
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Sun May 06 05:20:05 CEST 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Score: 1.0
> Signature: null
> Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone
> {noformat}
> the URL is g

[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2012-12-06 Thread kiran (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13511539#comment-13511539
 ] 

kiran commented on NUTCH-1245:
--

Can 2.x version be affected by the same issue ? 

I am crawling a website and after some crawls it said db_unfetched is 960.

Even though i crawled 10 times, the db_unfetched remained same. 

Any inputs ?

> URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb 
> and is generated over and over again
> 
>
> Key: NUTCH-1245
> URL: https://issues.apache.org/jira/browse/NUTCH-1245
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.4, 1.5
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.7
>
> Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, 
> NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch
>
>
> A document gone with 404 after db.fetch.interval.max (90 days) has passed
> is fetched over and over again but although fetch status is fetch_gone
> its status in CrawlDb keeps db_unfetched. Consequently, this document will
> be generated and fetched from now on in every cycle.
> To reproduce:
> # create a CrawlDatum in CrawlDb which retry interval hits 
> db.fetch.interval.max (I manipulated the shouldFetch() in 
> AbstractFetchSchedule to achieve this)
> # now this URL is fetched again
> # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to 
> db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 
> days)
> # this does not change with every generate-fetch-update cycle, here for two 
> segments:
> {noformat}
> /tmp/testcrawl/segments/20120105161430
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:14:21 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:14:48 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> /tmp/testcrawl/segments/20120105161631
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:16:23 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:20:05 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> {noformat}
> As far as I can see it's caused by setPageGoneSchedule() in 
> AbstractFetchSchedule. Some pseudo-code:
> {code}
> setPageGoneSchedule (called from update / CrawlDbReducer.reduce):
> datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * 
> maxInterval
> datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516
> if (maxInterval < datum.fetchInterval) // necessarily true
>forceRefetch()
> forceRefetch:
> if (datum.fetchInterval > maxInterval) // true because it's 1.35 * 
> maxInterval
>datum.fetchInterval = 0.9 * maxInterval
> datum.status = db_unfetched // 
> shouldFetch (called from generate / Generator.map):
> if ((datum.fetchTime - curTime) > maxInterval)
>// always true if the crawler is launched in short intervals
>// (lower than 0.35 * maxInterval)
>datum.fetchTime = curTime // forces a refetch
> {code}
> After setPageGoneSchedule is called via update the state is db_unfetched and 
> the retry interval 0.9 * db.fetch.interval.max (81 days). 
> Although the fetch time in the CrawlDb is far in the future
> {noformat}
> % nutch readdb testcrawl/crawldb -url http://localhost/page_gone
> URL: http://localhost/page_gone
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Sun May 06 05:20:05 CEST 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Score: 1.0
> Signature: null
> Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone
> {noformat}
> the URL is generated again because (fetch time - current time) is larger than 
> db.fetch.interval.max.
> The retry interval (datum.fetchInterval) oscillates between 0.9 an

[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2012-12-06 Thread Markus Jelsma (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13511429#comment-13511429
 ] 

Markus Jelsma commented on NUTCH-1245:
--

Sebastian, i'm seeing good results with this patch for now with a low 
db.fetch.interval.max (60 days).

> URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb 
> and is generated over and over again
> 
>
> Key: NUTCH-1245
> URL: https://issues.apache.org/jira/browse/NUTCH-1245
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.4, 1.5
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.6
>
> Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, 
> NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch
>
>
> A document gone with 404 after db.fetch.interval.max (90 days) has passed
> is fetched over and over again but although fetch status is fetch_gone
> its status in CrawlDb keeps db_unfetched. Consequently, this document will
> be generated and fetched from now on in every cycle.
> To reproduce:
> # create a CrawlDatum in CrawlDb which retry interval hits 
> db.fetch.interval.max (I manipulated the shouldFetch() in 
> AbstractFetchSchedule to achieve this)
> # now this URL is fetched again
> # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to 
> db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 
> days)
> # this does not change with every generate-fetch-update cycle, here for two 
> segments:
> {noformat}
> /tmp/testcrawl/segments/20120105161430
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:14:21 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:14:48 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> /tmp/testcrawl/segments/20120105161631
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:16:23 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:20:05 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> {noformat}
> As far as I can see it's caused by setPageGoneSchedule() in 
> AbstractFetchSchedule. Some pseudo-code:
> {code}
> setPageGoneSchedule (called from update / CrawlDbReducer.reduce):
> datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * 
> maxInterval
> datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516
> if (maxInterval < datum.fetchInterval) // necessarily true
>forceRefetch()
> forceRefetch:
> if (datum.fetchInterval > maxInterval) // true because it's 1.35 * 
> maxInterval
>datum.fetchInterval = 0.9 * maxInterval
> datum.status = db_unfetched // 
> shouldFetch (called from generate / Generator.map):
> if ((datum.fetchTime - curTime) > maxInterval)
>// always true if the crawler is launched in short intervals
>// (lower than 0.35 * maxInterval)
>datum.fetchTime = curTime // forces a refetch
> {code}
> After setPageGoneSchedule is called via update the state is db_unfetched and 
> the retry interval 0.9 * db.fetch.interval.max (81 days). 
> Although the fetch time in the CrawlDb is far in the future
> {noformat}
> % nutch readdb testcrawl/crawldb -url http://localhost/page_gone
> URL: http://localhost/page_gone
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Sun May 06 05:20:05 CEST 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Score: 1.0
> Signature: null
> Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone
> {noformat}
> the URL is generated again because (fetch time - current time) is larger than 
> db.fetch.interval.max.
> The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and 
> the fetch time is always close to current time + 1.35 * db.fetch.inter

[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2012-11-08 Thread Markus Jelsma (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13493098#comment-13493098
 ] 

Markus Jelsma commented on NUTCH-1245:
--

Thanks for the thorough unit tests, they clearly show there's a problem to be 
solved. I think i agree on the proposed fix you mention for 1245, it makes some 
sense. Not calling forceRefetch (it only leads to more transient errors) but 
setting fetch time to max interval to see again later sounds what one could 
expect.

On 578 and 1247, i think if we solve 578 overflowing may not be a big problem 
anymore. With Nutch as it works today it takes at least 128 days for it to 
overflow, if we fix it and people use a more reasonable max interval (say 30 
days or higher) it'll overflow 10 years from now, which i think is reasonable.

I'm not yet sure about the fix on 578. It's complex indeed ;)

> URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb 
> and is generated over and over again
> 
>
> Key: NUTCH-1245
> URL: https://issues.apache.org/jira/browse/NUTCH-1245
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.4, 1.5
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.6
>
> Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, 
> NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch
>
>
> A document gone with 404 after db.fetch.interval.max (90 days) has passed
> is fetched over and over again but although fetch status is fetch_gone
> its status in CrawlDb keeps db_unfetched. Consequently, this document will
> be generated and fetched from now on in every cycle.
> To reproduce:
> # create a CrawlDatum in CrawlDb which retry interval hits 
> db.fetch.interval.max (I manipulated the shouldFetch() in 
> AbstractFetchSchedule to achieve this)
> # now this URL is fetched again
> # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to 
> db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 
> days)
> # this does not change with every generate-fetch-update cycle, here for two 
> segments:
> {noformat}
> /tmp/testcrawl/segments/20120105161430
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:14:21 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:14:48 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> /tmp/testcrawl/segments/20120105161631
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:16:23 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:20:05 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> {noformat}
> As far as I can see it's caused by setPageGoneSchedule() in 
> AbstractFetchSchedule. Some pseudo-code:
> {code}
> setPageGoneSchedule (called from update / CrawlDbReducer.reduce):
> datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * 
> maxInterval
> datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516
> if (maxInterval < datum.fetchInterval) // necessarily true
>forceRefetch()
> forceRefetch:
> if (datum.fetchInterval > maxInterval) // true because it's 1.35 * 
> maxInterval
>datum.fetchInterval = 0.9 * maxInterval
> datum.status = db_unfetched // 
> shouldFetch (called from generate / Generator.map):
> if ((datum.fetchTime - curTime) > maxInterval)
>// always true if the crawler is launched in short intervals
>// (lower than 0.35 * maxInterval)
>datum.fetchTime = curTime // forces a refetch
> {code}
> After setPageGoneSchedule is called via update the state is db_unfetched and 
> the retry interval 0.9 * db.fetch.interval.max (81 days). 
> Although the fetch time in the CrawlDb is far in the future
> {noformat}
> % nutch readdb testcrawl/crawldb -url http://localhost/page_gone
> URL: http://loc

[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2012-11-01 Thread Sebastian Nagel (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488935#comment-13488935
 ] 

Sebastian Nagel commented on NUTCH-1245:


They are not duplicates but the effects are similar:

NUTCH-1245
- caused by calling forceRefetch just after a fetch leads to a fetch_gone. If 
the fetchInterval is
close to db.fetch.interval.max, setPageGoneSchedule calls forceRefetch. That's 
useless since we got
a 404 right now (or within the last day(s) for large crawls).
- proposed fix: setPageGoneSchedule should not call forceRefetch but keep the 
fetchInterval
within/below db.fetch.interval.max

NUTCH-578
- although the status of a page fetched 3 times (db.fetch.retry.max) with a 
transient error
(fetch_retry) is set to db_gone, the fetchInterval is still only incremented by 
one day. So next day
this page is fetched again.
- every fetch_retry still increments the retry counter so that it may overflow 
(NUTCH-1247)
- fix:
-* call setPageGoneSchedule in CrawlDbReducer.reduce when retry counter is hit 
and status is set to
db_gone. All patches (by various users/committers) agree in this: it will set 
the fetchInterval to a
value larger than one day, so that from now on the URL is not fetched again and 
again.
-* reset the retry counter to 0 or prohibit an overflow. I'm not sure what the 
best solution is, see
comments on NUTCH-578

Markus, would be great if you start with a look on the JUnit patch. It has two 
aims: catch the error and make analysis easier (it logs a lot).
I would like to extend the test to other CrawlDatum state transitions: these 
are complex for continuous crawls in combination with retry counters, 
intervals, signatures, etc. An exhaustive test could ensure that we do not 
break other state transitions.

> URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb 
> and is generated over and over again
> 
>
> Key: NUTCH-1245
> URL: https://issues.apache.org/jira/browse/NUTCH-1245
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.4, 1.5
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.6
>
> Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, 
> NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch
>
>
> A document gone with 404 after db.fetch.interval.max (90 days) has passed
> is fetched over and over again but although fetch status is fetch_gone
> its status in CrawlDb keeps db_unfetched. Consequently, this document will
> be generated and fetched from now on in every cycle.
> To reproduce:
> # create a CrawlDatum in CrawlDb which retry interval hits 
> db.fetch.interval.max (I manipulated the shouldFetch() in 
> AbstractFetchSchedule to achieve this)
> # now this URL is fetched again
> # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to 
> db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 
> days)
> # this does not change with every generate-fetch-update cycle, here for two 
> segments:
> {noformat}
> /tmp/testcrawl/segments/20120105161430
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:14:21 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:14:48 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> /tmp/testcrawl/segments/20120105161631
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:16:23 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:20:05 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> {noformat}
> As far as I can see it's caused by setPageGoneSchedule() in 
> AbstractFetchSchedule. Some pseudo-code:
> {code}
> setPageGoneSchedule (called from update / CrawlDbReducer.reduce):
> datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * 
> maxInterval
> datum.fetchTime = fet

[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2012-11-01 Thread Markus Jelsma (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488864#comment-13488864
 ] 

Markus Jelsma commented on NUTCH-1245:
--

Sebastian, very interesting! Can you close either this issue or NUTCH-578? I'll 
hope to check your patches soon!

> URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb 
> and is generated over and over again
> 
>
> Key: NUTCH-1245
> URL: https://issues.apache.org/jira/browse/NUTCH-1245
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.4, 1.5
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.6
>
> Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, 
> NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch
>
>
> A document gone with 404 after db.fetch.interval.max (90 days) has passed
> is fetched over and over again but although fetch status is fetch_gone
> its status in CrawlDb keeps db_unfetched. Consequently, this document will
> be generated and fetched from now on in every cycle.
> To reproduce:
> # create a CrawlDatum in CrawlDb which retry interval hits 
> db.fetch.interval.max (I manipulated the shouldFetch() in 
> AbstractFetchSchedule to achieve this)
> # now this URL is fetched again
> # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to 
> db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 
> days)
> # this does not change with every generate-fetch-update cycle, here for two 
> segments:
> {noformat}
> /tmp/testcrawl/segments/20120105161430
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:14:21 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:14:48 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> /tmp/testcrawl/segments/20120105161631
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:16:23 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:20:05 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> {noformat}
> As far as I can see it's caused by setPageGoneSchedule() in 
> AbstractFetchSchedule. Some pseudo-code:
> {code}
> setPageGoneSchedule (called from update / CrawlDbReducer.reduce):
> datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * 
> maxInterval
> datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516
> if (maxInterval < datum.fetchInterval) // necessarily true
>forceRefetch()
> forceRefetch:
> if (datum.fetchInterval > maxInterval) // true because it's 1.35 * 
> maxInterval
>datum.fetchInterval = 0.9 * maxInterval
> datum.status = db_unfetched // 
> shouldFetch (called from generate / Generator.map):
> if ((datum.fetchTime - curTime) > maxInterval)
>// always true if the crawler is launched in short intervals
>// (lower than 0.35 * maxInterval)
>datum.fetchTime = curTime // forces a refetch
> {code}
> After setPageGoneSchedule is called via update the state is db_unfetched and 
> the retry interval 0.9 * db.fetch.interval.max (81 days). 
> Although the fetch time in the CrawlDb is far in the future
> {noformat}
> % nutch readdb testcrawl/crawldb -url http://localhost/page_gone
> URL: http://localhost/page_gone
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Sun May 06 05:20:05 CEST 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Score: 1.0
> Signature: null
> Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone
> {noformat}
> the URL is generated again because (fetch time - current time) is larger than 
> db.fetch.interval.max.
> The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and 
> the fetch time is always close to current time + 1.35 * db.f

[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2012-09-17 Thread Markus Jelsma (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13456928#comment-13456928
 ] 

Markus Jelsma commented on NUTCH-1245:
--

Any ideas on this issue?

> URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb 
> and is generated over and over again
> 
>
> Key: NUTCH-1245
> URL: https://issues.apache.org/jira/browse/NUTCH-1245
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.4, 1.5
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.6
>
>
> A document gone with 404 after db.fetch.interval.max (90 days) has passed
> is fetched over and over again but although fetch status is fetch_gone
> its status in CrawlDb keeps db_unfetched. Consequently, this document will
> be generated and fetched from now on in every cycle.
> To reproduce:
> # create a CrawlDatum in CrawlDb which retry interval hits 
> db.fetch.interval.max (I manipulated the shouldFetch() in 
> AbstractFetchSchedule to achieve this)
> # now this URL is fetched again
> # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to 
> db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 
> days)
> # this does not change with every generate-fetch-update cycle, here for two 
> segments:
> {noformat}
> /tmp/testcrawl/segments/20120105161430
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:14:21 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:14:48 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> /tmp/testcrawl/segments/20120105161631
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:16:23 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:20:05 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> {noformat}
> As far as I can see it's caused by setPageGoneSchedule() in 
> AbstractFetchSchedule. Some pseudo-code:
> {code}
> setPageGoneSchedule (called from update / CrawlDbReducer.reduce):
> datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * 
> maxInterval
> datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516
> if (maxInterval < datum.fetchInterval) // necessarily true
>forceRefetch()
> forceRefetch:
> if (datum.fetchInterval > maxInterval) // true because it's 1.35 * 
> maxInterval
>datum.fetchInterval = 0.9 * maxInterval
> datum.status = db_unfetched // 
> shouldFetch (called from generate / Generator.map):
> if ((datum.fetchTime - curTime) > maxInterval)
>// always true if the crawler is launched in short intervals
>// (lower than 0.35 * maxInterval)
>datum.fetchTime = curTime // forces a refetch
> {code}
> After setPageGoneSchedule is called via update the state is db_unfetched and 
> the retry interval 0.9 * db.fetch.interval.max (81 days). 
> Although the fetch time in the CrawlDb is far in the future
> {noformat}
> % nutch readdb testcrawl/crawldb -url http://localhost/page_gone
> URL: http://localhost/page_gone
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Sun May 06 05:20:05 CEST 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Score: 1.0
> Signature: null
> Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone
> {noformat}
> the URL is generated again because (fetch time - current time) is larger than 
> db.fetch.interval.max.
> The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and 
> the fetch time is always close to current time + 1.35 * db.fetch.interval.max.
> It's possibly a side effect of NUTCH-516, and may be related to NUTCH-578

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA adm

[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2012-01-24 Thread Sebastian Nagel (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13192215#comment-13192215
 ] 

Sebastian Nagel commented on NUTCH-1245:


There are several possibilities to get a CrawlDatum with a fetchInterval large 
enough that a multiplication by 1.5 will exceed maxInterval:
# with adaptive fetch scheduling
## db.fetch.schedule.adaptive.min_interval > (1.5 * db.fetch.interval.max) -- 
as José commented some kind of misconfiguration
## after some time when the document didn't change and 
db.fetch.schedule.adaptive.max_interval > (1.5 * db.fetch.interval.max) -- but 
this is the default (1 year > 1.5*90 days)!
# db.fetch.interval.default > (1.5 * db.fetch.interval.max) -- again some kind 
of misconfiguration
# also setPageGoneSchedule increases the fetchInterval every time it is called, 
so after a gone page is tried to re-fetch several times we run into the same 
situation

Anyway, I think also the misconfigurations should be made impossible.

> URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb 
> and is generated over and over again
> 
>
> Key: NUTCH-1245
> URL: https://issues.apache.org/jira/browse/NUTCH-1245
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.4, 1.5
>Reporter: Sebastian Nagel
> Fix For: 1.5
>
>
> A document gone with 404 after db.fetch.interval.max (90 days) has passed
> is fetched over and over again but although fetch status is fetch_gone
> its status in CrawlDb keeps db_unfetched. Consequently, this document will
> be generated and fetched from now on in every cycle.
> To reproduce:
> # create a CrawlDatum in CrawlDb which retry interval hits 
> db.fetch.interval.max (I manipulated the shouldFetch() in 
> AbstractFetchSchedule to achieve this)
> # now this URL is fetched again
> # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to 
> db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 
> days)
> # this does not change with every generate-fetch-update cycle, here for two 
> segments:
> {noformat}
> /tmp/testcrawl/segments/20120105161430
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:14:21 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:14:48 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> /tmp/testcrawl/segments/20120105161631
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:16:23 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:20:05 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> {noformat}
> As far as I can see it's caused by setPageGoneSchedule() in 
> AbstractFetchSchedule. Some pseudo-code:
> {code}
> setPageGoneSchedule (called from update / CrawlDbReducer.reduce):
> datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * 
> maxInterval
> datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516
> if (maxInterval < datum.fetchInterval) // necessarily true
>forceRefetch()
> forceRefetch:
> if (datum.fetchInterval > maxInterval) // true because it's 1.35 * 
> maxInterval
>datum.fetchInterval = 0.9 * maxInterval
> datum.status = db_unfetched // 
> shouldFetch (called from generate / Generator.map):
> if ((datum.fetchTime - curTime) > maxInterval)
>// always true if the crawler is launched in short intervals
>// (lower than 0.35 * maxInterval)
>datum.fetchTime = curTime // forces a refetch
> {code}
> After setPageGoneSchedule is called via update the state is db_unfetched and 
> the retry interval 0.9 * db.fetch.interval.max (81 days). 
> Although the fetch time in the CrawlDb is far in the future
> {noformat}
> % nutch readdb testcrawl/crawldb -url http://localhost/page_gone
> URL: http://localhost/page_gone
> Version: 7
>

[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2012-01-10 Thread Commented


[ 
https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183292#comment-13183292
 ] 

José Gil commented on NUTCH-1245:
-

FWIW - We experienced a similar problem (entries in CrawlDB marked as 
db_fetched in spite of having resulted in a 404 response) and we eventually 
traced it to a configuration problem: our 
db.fetch.schedule.adaptive.min_interval was larger than our 
db.fetch.interval.max. After reducing the first and increasing the second, the 
entries are marked as db_gone as expected.

> URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb 
> and is generated over and over again
> 
>
> Key: NUTCH-1245
> URL: https://issues.apache.org/jira/browse/NUTCH-1245
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.4, 1.5
>Reporter: Sebastian Nagel
> Fix For: 1.5
>
>
> A document gone with 404 after db.fetch.interval.max (90 days) has passed
> is fetched over and over again but although fetch status is fetch_gone
> its status in CrawlDb keeps db_unfetched. Consequently, this document will
> be generated and fetched from now on in every cycle.
> To reproduce:
> # create a CrawlDatum in CrawlDb which retry interval hits 
> db.fetch.interval.max (I manipulated the shouldFetch() in 
> AbstractFetchSchedule to achieve this)
> # now this URL is fetched again
> # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to 
> db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 
> days)
> # this does not change with every generate-fetch-update cycle, here for two 
> segments:
> {noformat}
> /tmp/testcrawl/segments/20120105161430
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:14:21 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:14:48 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> /tmp/testcrawl/segments/20120105161631
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:16:23 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:20:05 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> {noformat}
> As far as I can see it's caused by setPageGoneSchedule() in 
> AbstractFetchSchedule. Some pseudo-code:
> {code}
> setPageGoneSchedule (called from update / CrawlDbReducer.reduce):
> datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * 
> maxInterval
> datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516
> if (maxInterval < datum.fetchInterval) // necessarily true
>forceRefetch()
> forceRefetch:
> if (datum.fetchInterval > maxInterval) // true because it's 1.35 * 
> maxInterval
>datum.fetchInterval = 0.9 * maxInterval
> datum.status = db_unfetched // 
> shouldFetch (called from generate / Generator.map):
> if ((datum.fetchTime - curTime) > maxInterval)
>// always true if the crawler is launched in short intervals
>// (lower than 0.35 * maxInterval)
>datum.fetchTime = curTime // forces a refetch
> {code}
> After setPageGoneSchedule is called via update the state is db_unfetched and 
> the retry interval 0.9 * db.fetch.interval.max (81 days). 
> Although the fetch time in the CrawlDb is far in the future
> {noformat}
> % nutch readdb testcrawl/crawldb -url http://localhost/page_gone
> URL: http://localhost/page_gone
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Sun May 06 05:20:05 CEST 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Score: 1.0
> Signature: null
> Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone
> {noformat}
> the URL is generated again because (fetch time - current time) is larger than 
> db.fetch.interval.max.
> The retry interval (datum.fetchInterval) oscill

[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2012-01-05 Thread Markus Jelsma (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13180515#comment-13180515
 ] 

Markus Jelsma commented on NUTCH-1245:
--

Thanks! This must be the same issue as NUTCH-578 but marked as related for now. 
Can you provide a patch?

> URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb 
> and is generated over and over again
> 
>
> Key: NUTCH-1245
> URL: https://issues.apache.org/jira/browse/NUTCH-1245
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.4, 1.5
>Reporter: Sebastian Nagel
>
> A document gone with 404 after db.fetch.interval.max (90 days) has passed
> is fetched over and over again but although fetch status is fetch_gone
> its status in CrawlDb keeps db_unfetched. Consequently, this document will
> be generated and fetched from now on in every cycle.
> To reproduce:
> # create a CrawlDatum in CrawlDb which retry interval hits 
> db.fetch.interval.max (I manipulated the shouldFetch() in 
> AbstractFetchSchedule to achieve this)
> # now this URL is fetched again
> # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to 
> db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 
> days)
> # this does not change with every generate-fetch-update cycle, here for two 
> segments:
> {noformat}
> /tmp/testcrawl/segments/20120105161430
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:14:21 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:14:48 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> /tmp/testcrawl/segments/20120105161631
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:16:23 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:20:05 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> {noformat}
> As far as I can see it's caused by setPageGoneSchedule() in 
> AbstractFetchSchedule. Some pseudo-code:
> {code}
> setPageGoneSchedule (called from update / CrawlDbReducer.reduce):
> datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * 
> maxInterval
> datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516
> if (maxInterval < datum.fetchInterval) // necessarily true
>forceRefetch()
> forceRefetch:
> if (datum.fetchInterval > maxInterval) // true because it's 1.35 * 
> maxInterval
>datum.fetchInterval = 0.9 * maxInterval
> datum.status = db_unfetched // 
> shouldFetch (called from generate / Generator.map):
> if ((datum.fetchTime - curTime) > maxInterval)
>// always true if the crawler is launched in short intervals
>// (lower than 0.35 * maxInterval)
>datum.fetchTime = curTime // forces a refetch
> {code}
> After setPageGoneSchedule is called via update the state is db_unfetched and 
> the retry interval 0.9 * db.fetch.interval.max (81 days). 
> Although the fetch time in the CrawlDb is far in the future
> {noformat}
> % nutch readdb testcrawl/crawldb -url http://localhost/page_gone
> URL: http://localhost/page_gone
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Sun May 06 05:20:05 CEST 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Score: 1.0
> Signature: null
> Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone
> {noformat}
> the URL is generated again because (fetch time - current time) is larger than 
> db.fetch.interval.max.
> The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and 
> the fetch time is always close to current time + 1.35 * db.fetch.interval.max.
> It's possibly a side effect of NUTCH-516, and may be related to NUTCH-578

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please c

[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

23 matches

Site Navigation

Mail list logo

Footer information