[jira] [Updated] (NUTCH-1422) reset signature for redirects

2014-07-15 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1422:
-

Attachment: NUTCH-1422-trunk-v2.patch

Patch post NUTCH-1502 which fixes the issue and moves the related test from 
TODOTestCrawlDbStates.java to TestCrawlDbStates.java

Will commit this shortly

> reset signature for redirects
> -
>
> Key: NUTCH-1422
> URL: https://issues.apache.org/jira/browse/NUTCH-1422
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb, fetcher
>Affects Versions: 1.4
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.9
>
> Attachments: NUTCH-1422-trunk-v1.patch, NUTCH-1422-trunk-v2.patch, 
> NUTCH-1422_redir_notmodified_log.txt
>
>
> In a long running continuous crawl with Nutch 1.4 URLs with a HTTP redirect 
> (http.redirect.max = 0) are kept as not-modified in the CrawlDb. Short 
> protocol (cf. attached dumped segment / CrawlDb data):
>  2012-02-23 :  injected
>  2012-02-24 :  fetched
>  2012-03-30 :  re-fetched, signature changed
>  2012-04-20 :  re-fetched, redirected
>  2012-04-24 :  in CrawlDb as db_notmodified, still indexed with old content!
> The signature of a previously fetched document is not reset when the URL/doc 
> is changed to a redirect at a later time. CrawlDbReducer.reduce then sets the 
> status to db_notmodified because the new signature in with fetch status is 
> identical to the old one.
> Possible fixes (??):
> * reset the signature in Fetcher
> * handle this case in CrawlDbReducer.reduce



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1422) reset signature for redirects

2014-07-15 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1422:
-

Patch Info: Patch Available

> reset signature for redirects
> -
>
> Key: NUTCH-1422
> URL: https://issues.apache.org/jira/browse/NUTCH-1422
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb, fetcher
>Affects Versions: 1.4
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.9
>
> Attachments: NUTCH-1422-trunk-v1.patch, 
> NUTCH-1422_redir_notmodified_log.txt
>
>
> In a long running continuous crawl with Nutch 1.4 URLs with a HTTP redirect 
> (http.redirect.max = 0) are kept as not-modified in the CrawlDb. Short 
> protocol (cf. attached dumped segment / CrawlDb data):
>  2012-02-23 :  injected
>  2012-02-24 :  fetched
>  2012-03-30 :  re-fetched, signature changed
>  2012-04-20 :  re-fetched, redirected
>  2012-04-24 :  in CrawlDb as db_notmodified, still indexed with old content!
> The signature of a previously fetched document is not reset when the URL/doc 
> is changed to a redirect at a later time. CrawlDbReducer.reduce then sets the 
> status to db_notmodified because the new signature in with fetch status is 
> identical to the old one.
> Possible fixes (??):
> * reset the signature in Fetcher
> * handle this case in CrawlDbReducer.reduce



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1422) reset signature for redirects

2014-07-15 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1422:
-

Fix Version/s: (was: 1.10)
   1.9

> reset signature for redirects
> -
>
> Key: NUTCH-1422
> URL: https://issues.apache.org/jira/browse/NUTCH-1422
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb, fetcher
>Affects Versions: 1.4
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.9
>
> Attachments: NUTCH-1422-trunk-v1.patch, 
> NUTCH-1422_redir_notmodified_log.txt
>
>
> In a long running continuous crawl with Nutch 1.4 URLs with a HTTP redirect 
> (http.redirect.max = 0) are kept as not-modified in the CrawlDb. Short 
> protocol (cf. attached dumped segment / CrawlDb data):
>  2012-02-23 :  injected
>  2012-02-24 :  fetched
>  2012-03-30 :  re-fetched, signature changed
>  2012-04-20 :  re-fetched, redirected
>  2012-04-24 :  in CrawlDb as db_notmodified, still indexed with old content!
> The signature of a previously fetched document is not reset when the URL/doc 
> is changed to a redirect at a later time. CrawlDbReducer.reduce then sets the 
> status to db_notmodified because the new signature in with fetch status is 
> identical to the old one.
> Possible fixes (??):
> * reset the signature in Fetcher
> * handle this case in CrawlDbReducer.reduce



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1422) reset signature for redirects

2014-07-10 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1422:
---

Attachment: NUTCH-1422-trunk-v1.patch

Patch for trunk: status not_modified is only determined by signature comparison 
for successfully fetch documents. It does not reset the signature: the old 
signature is explicitly kept for states fetch_retry and fetch_gone, it should 
also for redirects. It does not harm if the not_modified detection is save. In 
addition, it makes some sense for documents which disappeared temporarily to 
switch again immediately after the first successful re-fetch into status 
not_modified.
This solution is verified by NUTCH-1502. The issue title should be changed 
accordingly, e.g., to "redirect results erroneously in status not_modified".

> reset signature for redirects
> -
>
> Key: NUTCH-1422
> URL: https://issues.apache.org/jira/browse/NUTCH-1422
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb, fetcher
>Affects Versions: 1.4
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.10
>
> Attachments: NUTCH-1422-trunk-v1.patch, 
> NUTCH-1422_redir_notmodified_log.txt
>
>
> In a long running continuous crawl with Nutch 1.4 URLs with a HTTP redirect 
> (http.redirect.max = 0) are kept as not-modified in the CrawlDb. Short 
> protocol (cf. attached dumped segment / CrawlDb data):
>  2012-02-23 :  injected
>  2012-02-24 :  fetched
>  2012-03-30 :  re-fetched, signature changed
>  2012-04-20 :  re-fetched, redirected
>  2012-04-24 :  in CrawlDb as db_notmodified, still indexed with old content!
> The signature of a previously fetched document is not reset when the URL/doc 
> is changed to a redirect at a later time. CrawlDbReducer.reduce then sets the 
> status to db_notmodified because the new signature in with fetch status is 
> identical to the old one.
> Possible fixes (??):
> * reset the signature in Fetcher
> * handle this case in CrawlDbReducer.reduce



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1422) reset signature for redirects

2014-04-10 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1422:
-

Priority: Critical  (was: Major)

> reset signature for redirects
> -
>
> Key: NUTCH-1422
> URL: https://issues.apache.org/jira/browse/NUTCH-1422
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb, fetcher
>Affects Versions: 1.4
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.9
>
> Attachments: NUTCH-1422_redir_notmodified_log.txt
>
>
> In a long running continuous crawl with Nutch 1.4 URLs with a HTTP redirect 
> (http.redirect.max = 0) are kept as not-modified in the CrawlDb. Short 
> protocol (cf. attached dumped segment / CrawlDb data):
>  2012-02-23 :  injected
>  2012-02-24 :  fetched
>  2012-03-30 :  re-fetched, signature changed
>  2012-04-20 :  re-fetched, redirected
>  2012-04-24 :  in CrawlDb as db_notmodified, still indexed with old content!
> The signature of a previously fetched document is not reset when the URL/doc 
> is changed to a redirect at a later time. CrawlDbReducer.reduce then sets the 
> status to db_notmodified because the new signature in with fetch status is 
> identical to the old one.
> Possible fixes (??):
> * reset the signature in Fetcher
> * handle this case in CrawlDbReducer.reduce



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1422) reset signature for redirects

2012-07-06 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1422:
---

Attachment: NUTCH-1422_redir_notmodified_log.txt

> reset signature for redirects
> -
>
> Key: NUTCH-1422
> URL: https://issues.apache.org/jira/browse/NUTCH-1422
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb, fetcher
>Affects Versions: 1.4
>Reporter: Sebastian Nagel
> Fix For: 1.6
>
> Attachments: NUTCH-1422_redir_notmodified_log.txt
>
>
> In a long running continuous crawl with Nutch 1.4 URLs with a HTTP redirect 
> (http.redirect.max = 0) are kept as not-modified in the CrawlDb. Short 
> protocol (cf. attached dumped segment / CrawlDb data):
>  2012-02-23 :  injected
>  2012-02-24 :  fetched
>  2012-03-30 :  re-fetched, signature changed
>  2012-04-20 :  re-fetched, redirected
>  2012-04-24 :  in CrawlDb as db_notmodified, still indexed with old content!
> The signature of a previously fetched document is not reset when the URL/doc 
> is changed to a redirect at a later time. CrawlDbReducer.reduce then sets the 
> status to db_notmodified because the new signature in with fetch status is 
> identical to the old one.
> Possible fixes (??):
> * reset the signature in Fetcher
> * handle this case in CrawlDbReducer.reduce

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira