[jira] [Commented] (NUTCH-2683) DeduplicationJob: add option to prefer https:// over http://
[ https://issues.apache.org/jira/browse/NUTCH-2683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814387#comment-16814387 ] Hudson commented on NUTCH-2683: --- FAILURE: Integrated in Jenkins build Nutch-trunk #3619 (See [https://builds.apache.org/job/Nutch-trunk/3619/]) NUTCH-2683 DeduplicationJob: add option to prefer https:// over http:// (snagel: [https://github.com/apache/nutch/commit/3958d0c23e32855225fd52403da7c7234eef5ea2]) * (edit) src/java/org/apache/nutch/crawl/DeduplicationJob.java > DeduplicationJob: add option to prefer https:// over http:// > > > Key: NUTCH-2683 > URL: https://issues.apache.org/jira/browse/NUTCH-2683 > Project: Nutch > Issue Type: Improvement > Components: crawldb >Affects Versions: 1.15 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.16 > > > The deduplication job allows to keep the shortest URLs as the "best" URL of a > set of duplicates, marking all longer ones as duplicates. Recently search > engines started to penalize non-https pages by [giving https pages a higher > rank|https://webmasters.googleblog.com/2014/08/https-as-ranking-signal.html] > and [marking http as > insecure|https://www.blog.google/products/chrome/milestone-chrome-security-marking-http-not-secure/]. > If URLs are identical except for the protocol the deduplication job should be > able to prefer https:// over http:// URLs, although the latter ones are > shorter by one character. Of course, this should be configurable and in > addition to existing preferences (length, score and fetch time) to select the > "best" URL among duplicates. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2683) DeduplicationJob: add option to prefer https:// over http://
[ https://issues.apache.org/jira/browse/NUTCH-2683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814358#comment-16814358 ] ASF GitHub Bot commented on NUTCH-2683: --- sebastian-nagel commented on pull request #425: NUTCH-2683 DeduplicationJob: add option to prefer https:// over http:// URL: https://github.com/apache/nutch/pull/425 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > DeduplicationJob: add option to prefer https:// over http:// > > > Key: NUTCH-2683 > URL: https://issues.apache.org/jira/browse/NUTCH-2683 > Project: Nutch > Issue Type: Improvement > Components: crawldb >Affects Versions: 1.15 >Reporter: Sebastian Nagel >Priority: Major > Fix For: 1.16 > > > The deduplication job allows to keep the shortest URLs as the "best" URL of a > set of duplicates, marking all longer ones as duplicates. Recently search > engines started to penalize non-https pages by [giving https pages a higher > rank|https://webmasters.googleblog.com/2014/08/https-as-ranking-signal.html] > and [marking http as > insecure|https://www.blog.google/products/chrome/milestone-chrome-security-marking-http-not-secure/]. > If URLs are identical except for the protocol the deduplication job should be > able to prefer https:// over http:// URLs, although the latter ones are > shorter by one character. Of course, this should be configurable and in > addition to existing preferences (length, score and fetch time) to select the > "best" URL among duplicates. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2683) DeduplicationJob: add option to prefer https:// over http://
[ https://issues.apache.org/jira/browse/NUTCH-2683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16785836#comment-16785836 ] Sebastian Nagel commented on NUTCH-2683: Any comments or objections? Thanks! Otherwise I'll commit. > DeduplicationJob: add option to prefer https:// over http:// > > > Key: NUTCH-2683 > URL: https://issues.apache.org/jira/browse/NUTCH-2683 > Project: Nutch > Issue Type: Improvement > Components: crawldb >Affects Versions: 1.15 >Reporter: Sebastian Nagel >Priority: Major > Fix For: 1.16 > > > The deduplication job allows to keep the shortest URLs as the "best" URL of a > set of duplicates, marking all longer ones as duplicates. Recently search > engines started to penalize non-https pages by [giving https pages a higher > rank|https://webmasters.googleblog.com/2014/08/https-as-ranking-signal.html] > and [marking http as > insecure|https://www.blog.google/products/chrome/milestone-chrome-security-marking-http-not-secure/]. > If URLs are identical except for the protocol the deduplication job should be > able to prefer https:// over http:// URLs, although the latter ones are > shorter by one character. Of course, this should be configurable and in > addition to existing preferences (length, score and fetch time) to select the > "best" URL among duplicates. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2683) DeduplicationJob: add option to prefer https:// over http://
[ https://issues.apache.org/jira/browse/NUTCH-2683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16735682#comment-16735682 ] ASF GitHub Bot commented on NUTCH-2683: --- sebastian-nagel commented on pull request #425: NUTCH-2683 DeduplicationJob: add option to prefer https:// over http:// URL: https://github.com/apache/nutch/pull/425 - add optional value "httpsOverHttp" to -compareOrder argument to prefer https:// over http:// if it comes before the "urlLength" and neither "score" nor "fetchTime" take precedence - code improvements: remove nested loop, sort imports, add `@Override` statements where applicable Testing with one pair of https/http duplicates: ``` % cat seeds.txt http://nutch.apache.org/ https://nutch.apache.org/ % nutch inject crawldb seeds.txt ... % nutch generate crawldb/ segments ... % nutch fetch segments/* ... % nutch parse segments/* ... % nutch updatedb crawldb/ segments/* ... % nutch dedup crawldb -compareOrder httpsOverHttp,score,urlLength,fetchTime ... Deduplication: 1 documents marked as duplicates ... % nutch readdb crawldb/ -url https://nutch.apache.org/ URL: https://nutch.apache.org/ Version: 7 Status: 2 (db_fetched) Fetch time: Wed Feb 06 11:55:33 CET 2019 Modified time: Mon Jan 07 11:55:33 CET 2019 Retries since fetch: 0 Retry interval: 2592000 seconds (30 days) Score: 1.181 Signature: da0ffbf19768ea2cab9ffa0fb4a778a7 Metadata: ... % nutch readdb crawldb/ -url http://nutch.apache.org/ URL: http://nutch.apache.org/ Version: 7 Status: 7 (db_duplicate) Fetch time: Wed Feb 06 11:55:39 CET 2019 Modified time: Mon Jan 07 11:55:39 CET 2019 Retries since fetch: 0 Retry interval: 2592000 seconds (30 days) Score: 1.181 Signature: da0ffbf19768ea2cab9ffa0fb4a778a7 Metadata: ... ``` The URL `https://nutch.apache.org/` is kept as expected if "httpsOverHttp" is configured. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > DeduplicationJob: add option to prefer https:// over http:// > > > Key: NUTCH-2683 > URL: https://issues.apache.org/jira/browse/NUTCH-2683 > Project: Nutch > Issue Type: Improvement > Components: crawldb >Affects Versions: 1.15 >Reporter: Sebastian Nagel >Priority: Major > Fix For: 1.16 > > > The deduplication job allows to keep the shortest URLs as the "best" URL of a > set of duplicates, marking all longer ones as duplicates. Recently search > engines started to penalize non-https pages by [giving https pages a higher > rank|https://webmasters.googleblog.com/2014/08/https-as-ranking-signal.html] > and [marking http as > insecure|https://www.blog.google/products/chrome/milestone-chrome-security-marking-http-not-secure/]. > If URLs are identical except for the protocol the deduplication job should be > able to prefer https:// over http:// URLs, although the latter ones are > shorter by one character. Of course, this should be configurable and in > addition to existing preferences (length, score and fetch time) to select the > "best" URL among duplicates. -- This message was sent by Atlassian JIRA (v7.6.3#76005)