[jira] [Updated] (NUTCH-1422) reset signature for redirects
[ https://issues.apache.org/jira/browse/NUTCH-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1422: - Attachment: NUTCH-1422-trunk-v2.patch Patch post NUTCH-1502 which fixes the issue and moves the related test from TODOTestCrawlDbStates.java to TestCrawlDbStates.java Will commit this shortly > reset signature for redirects > - > > Key: NUTCH-1422 > URL: https://issues.apache.org/jira/browse/NUTCH-1422 > Project: Nutch > Issue Type: Bug > Components: crawldb, fetcher >Affects Versions: 1.4 >Reporter: Sebastian Nagel >Priority: Critical > Fix For: 1.9 > > Attachments: NUTCH-1422-trunk-v1.patch, NUTCH-1422-trunk-v2.patch, > NUTCH-1422_redir_notmodified_log.txt > > > In a long running continuous crawl with Nutch 1.4 URLs with a HTTP redirect > (http.redirect.max = 0) are kept as not-modified in the CrawlDb. Short > protocol (cf. attached dumped segment / CrawlDb data): > 2012-02-23 : injected > 2012-02-24 : fetched > 2012-03-30 : re-fetched, signature changed > 2012-04-20 : re-fetched, redirected > 2012-04-24 : in CrawlDb as db_notmodified, still indexed with old content! > The signature of a previously fetched document is not reset when the URL/doc > is changed to a redirect at a later time. CrawlDbReducer.reduce then sets the > status to db_notmodified because the new signature in with fetch status is > identical to the old one. > Possible fixes (??): > * reset the signature in Fetcher > * handle this case in CrawlDbReducer.reduce -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1422) reset signature for redirects
[ https://issues.apache.org/jira/browse/NUTCH-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1422: - Patch Info: Patch Available > reset signature for redirects > - > > Key: NUTCH-1422 > URL: https://issues.apache.org/jira/browse/NUTCH-1422 > Project: Nutch > Issue Type: Bug > Components: crawldb, fetcher >Affects Versions: 1.4 >Reporter: Sebastian Nagel >Priority: Critical > Fix For: 1.9 > > Attachments: NUTCH-1422-trunk-v1.patch, > NUTCH-1422_redir_notmodified_log.txt > > > In a long running continuous crawl with Nutch 1.4 URLs with a HTTP redirect > (http.redirect.max = 0) are kept as not-modified in the CrawlDb. Short > protocol (cf. attached dumped segment / CrawlDb data): > 2012-02-23 : injected > 2012-02-24 : fetched > 2012-03-30 : re-fetched, signature changed > 2012-04-20 : re-fetched, redirected > 2012-04-24 : in CrawlDb as db_notmodified, still indexed with old content! > The signature of a previously fetched document is not reset when the URL/doc > is changed to a redirect at a later time. CrawlDbReducer.reduce then sets the > status to db_notmodified because the new signature in with fetch status is > identical to the old one. > Possible fixes (??): > * reset the signature in Fetcher > * handle this case in CrawlDbReducer.reduce -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1422) reset signature for redirects
[ https://issues.apache.org/jira/browse/NUTCH-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1422: - Fix Version/s: (was: 1.10) 1.9 > reset signature for redirects > - > > Key: NUTCH-1422 > URL: https://issues.apache.org/jira/browse/NUTCH-1422 > Project: Nutch > Issue Type: Bug > Components: crawldb, fetcher >Affects Versions: 1.4 >Reporter: Sebastian Nagel >Priority: Critical > Fix For: 1.9 > > Attachments: NUTCH-1422-trunk-v1.patch, > NUTCH-1422_redir_notmodified_log.txt > > > In a long running continuous crawl with Nutch 1.4 URLs with a HTTP redirect > (http.redirect.max = 0) are kept as not-modified in the CrawlDb. Short > protocol (cf. attached dumped segment / CrawlDb data): > 2012-02-23 : injected > 2012-02-24 : fetched > 2012-03-30 : re-fetched, signature changed > 2012-04-20 : re-fetched, redirected > 2012-04-24 : in CrawlDb as db_notmodified, still indexed with old content! > The signature of a previously fetched document is not reset when the URL/doc > is changed to a redirect at a later time. CrawlDbReducer.reduce then sets the > status to db_notmodified because the new signature in with fetch status is > identical to the old one. > Possible fixes (??): > * reset the signature in Fetcher > * handle this case in CrawlDbReducer.reduce -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1422) reset signature for redirects
[ https://issues.apache.org/jira/browse/NUTCH-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1422: --- Attachment: NUTCH-1422-trunk-v1.patch Patch for trunk: status not_modified is only determined by signature comparison for successfully fetch documents. It does not reset the signature: the old signature is explicitly kept for states fetch_retry and fetch_gone, it should also for redirects. It does not harm if the not_modified detection is save. In addition, it makes some sense for documents which disappeared temporarily to switch again immediately after the first successful re-fetch into status not_modified. This solution is verified by NUTCH-1502. The issue title should be changed accordingly, e.g., to "redirect results erroneously in status not_modified". > reset signature for redirects > - > > Key: NUTCH-1422 > URL: https://issues.apache.org/jira/browse/NUTCH-1422 > Project: Nutch > Issue Type: Bug > Components: crawldb, fetcher >Affects Versions: 1.4 >Reporter: Sebastian Nagel >Priority: Critical > Fix For: 1.10 > > Attachments: NUTCH-1422-trunk-v1.patch, > NUTCH-1422_redir_notmodified_log.txt > > > In a long running continuous crawl with Nutch 1.4 URLs with a HTTP redirect > (http.redirect.max = 0) are kept as not-modified in the CrawlDb. Short > protocol (cf. attached dumped segment / CrawlDb data): > 2012-02-23 : injected > 2012-02-24 : fetched > 2012-03-30 : re-fetched, signature changed > 2012-04-20 : re-fetched, redirected > 2012-04-24 : in CrawlDb as db_notmodified, still indexed with old content! > The signature of a previously fetched document is not reset when the URL/doc > is changed to a redirect at a later time. CrawlDbReducer.reduce then sets the > status to db_notmodified because the new signature in with fetch status is > identical to the old one. > Possible fixes (??): > * reset the signature in Fetcher > * handle this case in CrawlDbReducer.reduce -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1422) reset signature for redirects
[ https://issues.apache.org/jira/browse/NUTCH-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1422: - Priority: Critical (was: Major) > reset signature for redirects > - > > Key: NUTCH-1422 > URL: https://issues.apache.org/jira/browse/NUTCH-1422 > Project: Nutch > Issue Type: Bug > Components: crawldb, fetcher >Affects Versions: 1.4 >Reporter: Sebastian Nagel >Priority: Critical > Fix For: 1.9 > > Attachments: NUTCH-1422_redir_notmodified_log.txt > > > In a long running continuous crawl with Nutch 1.4 URLs with a HTTP redirect > (http.redirect.max = 0) are kept as not-modified in the CrawlDb. Short > protocol (cf. attached dumped segment / CrawlDb data): > 2012-02-23 : injected > 2012-02-24 : fetched > 2012-03-30 : re-fetched, signature changed > 2012-04-20 : re-fetched, redirected > 2012-04-24 : in CrawlDb as db_notmodified, still indexed with old content! > The signature of a previously fetched document is not reset when the URL/doc > is changed to a redirect at a later time. CrawlDbReducer.reduce then sets the > status to db_notmodified because the new signature in with fetch status is > identical to the old one. > Possible fixes (??): > * reset the signature in Fetcher > * handle this case in CrawlDbReducer.reduce -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1422) reset signature for redirects
[ https://issues.apache.org/jira/browse/NUTCH-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1422: --- Attachment: NUTCH-1422_redir_notmodified_log.txt > reset signature for redirects > - > > Key: NUTCH-1422 > URL: https://issues.apache.org/jira/browse/NUTCH-1422 > Project: Nutch > Issue Type: Bug > Components: crawldb, fetcher >Affects Versions: 1.4 >Reporter: Sebastian Nagel > Fix For: 1.6 > > Attachments: NUTCH-1422_redir_notmodified_log.txt > > > In a long running continuous crawl with Nutch 1.4 URLs with a HTTP redirect > (http.redirect.max = 0) are kept as not-modified in the CrawlDb. Short > protocol (cf. attached dumped segment / CrawlDb data): > 2012-02-23 : injected > 2012-02-24 : fetched > 2012-03-30 : re-fetched, signature changed > 2012-04-20 : re-fetched, redirected > 2012-04-24 : in CrawlDb as db_notmodified, still indexed with old content! > The signature of a previously fetched document is not reset when the URL/doc > is changed to a redirect at a later time. CrawlDbReducer.reduce then sets the > status to db_notmodified because the new signature in with fetch status is > identical to the old one. > Possible fixes (??): > * reset the signature in Fetcher > * handle this case in CrawlDbReducer.reduce -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira