date:20190410

Build failed in Jenkins: Nutch-trunk #3619

2019-04-10 Thread Apache Jenkins Server

See 


Changes:

[snagel] NUTCH-2683 DeduplicationJob: add option to prefer https:// over http://

[snagel] NUTCH-2666 Increase default value for http.content.limit /

[snagel] NUTCH-2701 Fetcher: log dates and times also in human-readable form -

--
[...truncated 38.15 KB...]
[ivy:resolve]   [SUCCESSFUL ] org.webjars#modernizr;2.6.2-1!modernizr.jar (9ms)
[ivy:resolve] downloading 
http://repo1.maven.org/maven2/joda-time/joda-time/2.3/joda-time-2.3.jar ...
[ivy:resolve] . (567kB)
[ivy:resolve] .. (0kB)
[ivy:resolve]   [SUCCESSFUL ] joda-time#joda-time;2.3!joda-time.jar (21ms)
[ivy:resolve] downloading 
http://repo1.maven.org/maven2/org/apache/wicket/wicket-extensions/6.13.0/wicket-extensions-6.13.0.jar
 ...
[ivy:resolve] 
.. 
(1343kB)
[ivy:resolve] .. (0kB)
[ivy:resolve]   [SUCCESSFUL ] 
org.apache.wicket#wicket-extensions;6.13.0!wicket-extensions.jar (69ms)
[ivy:resolve] downloading 
http://repo1.maven.org/maven2/de/agilecoders/maven/maven-parent-config/0.3.4/maven-parent-config-0.3.4.jar
 ...
[ivy:resolve] .. (5kB)
[ivy:resolve] .. (0kB)
[ivy:resolve]   [SUCCESSFUL ] 
de.agilecoders.maven#maven-parent-config;0.3.4!maven-parent-config.jar (13ms)
[ivy:resolve] downloading 
http://repo1.maven.org/maven2/org/reflections/reflections/0.9.8/reflections-0.9.8.jar
 ...
[ivy:resolve] ... (100kB)
[ivy:resolve] .. (0kB)
[ivy:resolve]   [SUCCESSFUL ] org.reflections#reflections;0.9.8!reflections.jar 
(16ms)
[ivy:resolve] downloading 
http://repo1.maven.org/maven2/javassist/javassist/3.12.1.GA/javassist-3.12.1.GA.jar
 ...
[ivy:resolve] ... (629kB)
[ivy:resolve] .. (0kB)
[ivy:resolve]   [SUCCESSFUL ] javassist#javassist;3.12.1.GA!javassist.jar (19ms)
[ivy:resolve] downloading 
http://repo1.maven.org/maven2/dom4j/dom4j/1.6.1/dom4j-1.6.1.jar ...
[ivy:resolve]  (306kB)
[ivy:resolve] .. (0kB)
[ivy:resolve]   [SUCCESSFUL ] dom4j#dom4j;1.6.1!dom4j.jar (12ms)
[ivy:resolve] downloading 
http://repo1.maven.org/maven2/org/webjars/jquery/2.0.3-1/jquery-2.0.3-1.jar ...
[ivy:resolve] .. (151kB)
[ivy:resolve] .. (0kB)
[ivy:resolve]   [SUCCESSFUL ] org.webjars#jquery;2.0.3-1!jquery.jar (11ms)
[ivy:resolve] downloading 
http://repo1.maven.org/maven2/com/google/javascript/closure-compiler/v20130603/closure-compiler-v20130603.jar
 ...
[ivy:resolve] 
...
 (3475kB)
[ivy:resolve] .. (0kB)
[ivy:resolve]   [SUCCESSFUL ] 
com.google.javascript#closure-compiler;v20130603!closure-compiler.jar (71ms)
[ivy:resolve] downloading 
http://repo1.maven.org/maven2/org/webjars/jquerypp/1.0.1/jquerypp-1.0.1.jar ...
[ivy:resolve] . (653kB)
[ivy:resolve] .. (0kB)
[ivy:resolve]   [SUCCESSFUL ] org.webjars#jquerypp;1.0.1!jquerypp.jar (32ms)
[ivy:resolve] downloading 
http://repo1.maven.org/maven2/org/webjars/jquery-ui/1.10.2-1/jquery-ui-1.10.2-1.jar
 ...
[ivy:resolve] .. (604kB)
[ivy:resolve] .. (0kB)
[ivy:resolve]   [SUCCESSFUL ] org.webjars#jquery-ui;1.10.2-1!jquery-ui.jar 
(34ms)
[ivy:resolve] downloading 
http://repo1.maven.org/maven2/org/webjars/typeaheadjs/0.9.3/typeaheadjs-0.9.3.jar
 ...
[ivy:resolve] .. (19kB)
[ivy:resolve] .. (0kB)
[ivy:resolve]   [SUCCESSFUL ] org.webjars#typeaheadjs;0.9.3!typeaheadjs.jar 
(16ms)
[ivy:resolve] downloading 
http://repo1.maven.org/maven2/args4j/args4j/2.0.16/args4j-2.0.16.jar ...
[ivy:resolve] ... (54kB)
[ivy:resolve] .. (0kB)
[ivy:resolve]   [SUCCESSFUL ] args4j#args4j;2.0.16!args4j.jar (10ms)
[ivy:resolve] downloading 
http://repo1.maven.org/maven2/com/tdunning/t-digest/3.2/t-digest-3.2.jar ...
[ivy:resolve]  (50kB)
[ivy:resolve] .. (0kB)
[ivy:resolve]   [SUCCESSFUL ] com.tdunning#t-digest;3.2!t-digest.jar (10ms)
[ivy:resolve] downloading 
http://repo1.maven.org/maven2/org/apache/tika/tika-core/1.20/tika-core-1.20.jar 
...
[ivy:resolve] .. (683kB)
[ivy:resolve] .. (0kB)
[ivy:resolve]   [SUCCESSFUL ] 
org.apache.tika#tika-core;1.20!tika-core.jar(bundle) (21ms)
[ivy:resolve] downloading 
http://repo1.maven.org/maven2/com/ibm/icu/icu4j/61.1/icu4j-61.1.jar ...
[ivy:resolve]

[jira] [Commented] (NUTCH-2683) DeduplicationJob: add option to prefer https:// over http://

2019-04-10 Thread Hudson (JIRA)



[ 
https://issues.apache.org/jira/browse/NUTCH-2683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814387#comment-16814387
 ] 

Hudson commented on NUTCH-2683:
---

FAILURE: Integrated in Jenkins build Nutch-trunk #3619 (See 
[https://builds.apache.org/job/Nutch-trunk/3619/])
NUTCH-2683 DeduplicationJob: add option to prefer https:// over http:// 
(snagel: 
[https://github.com/apache/nutch/commit/3958d0c23e32855225fd52403da7c7234eef5ea2])
* (edit) src/java/org/apache/nutch/crawl/DeduplicationJob.java


> DeduplicationJob: add option to prefer https:// over http://
> 
>
> Key: NUTCH-2683
> URL: https://issues.apache.org/jira/browse/NUTCH-2683
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.16
>
>
> The deduplication job allows to keep the shortest URLs as the "best" URL of a 
> set of duplicates, marking all longer ones as duplicates. Recently search 
> engines started to penalize non-https pages by [giving https pages a higher 
> rank|https://webmasters.googleblog.com/2014/08/https-as-ranking-signal.html] 
> and [marking http as 
> insecure|https://www.blog.google/products/chrome/milestone-chrome-security-marking-http-not-secure/].
> If URLs are identical except for the protocol the deduplication job should be 
> able to prefer https:// over http:// URLs, although the latter ones are 
> shorter by one character. Of course, this should be configurable and in 
> addition to existing preferences (length, score and fetch time) to select the 
> "best" URL among duplicates.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2666) Increase default value for http.content.limit / ftp.content.limit / file.content.limit

2019-04-10 Thread Hudson (JIRA)



[ 
https://issues.apache.org/jira/browse/NUTCH-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814388#comment-16814388
 ] 

Hudson commented on NUTCH-2666:
---

FAILURE: Integrated in Jenkins build Nutch-trunk #3619 (See 
[https://builds.apache.org/job/Nutch-trunk/3619/])
NUTCH-2666 Increase default value for http.content.limit / (snagel: 
[https://github.com/apache/nutch/commit/13a9a6daf2ca2f764d052ee338b51dc9f91824d5])
* (edit) src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java
* (edit) 
src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
* (edit) conf/nutch-default.xml
* (edit) 
src/plugin/protocol-file/src/java/org/apache/nutch/protocol/file/File.java


> Increase default value for http.content.limit / ftp.content.limit / 
> file.content.limit
> --
>
> Key: NUTCH-2666
> URL: https://issues.apache.org/jira/browse/NUTCH-2666
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.15
>Reporter: Marco Ebbinghaus
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.16
>
>
> The default value for http.content.limit in nutch-default.xml (The length 
> limit for downloaded content using the http://
>  protocol, in bytes. If this value is nonnegative (>=0), content longer
>  than it will be truncated; otherwise, no truncation at all. Do not
>  confuse this setting with the file.content.limit setting.) is set to 64kb. 
> Maybe this default value should be increased as many pages today are greater 
> than 64kb.
> This fact hit me when trying to crawl a single website whose pages are much 
> greater than 64kb and because of that with every crawl cycle the count of 
> db_unfetched urls decreased until it hit zero and the crawler became inactive 
> (because the first 64 kB contained always the same set of navigation links)
> The description might also be updated as this is not only the case for the 
> http protocol, but also for https.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2701) Fetcher: log dates and times also in human-readable form

2019-04-10 Thread Hudson (JIRA)



[ 
https://issues.apache.org/jira/browse/NUTCH-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814389#comment-16814389
 ] 

Hudson commented on NUTCH-2701:
---

FAILURE: Integrated in Jenkins build Nutch-trunk #3619 (See 
[https://builds.apache.org/job/Nutch-trunk/3619/])
NUTCH-2701 Fetcher: log dates and times also in human-readable form - (snagel: 
[https://github.com/apache/nutch/commit/0624d2588ee3b2e84be28ffb59db6d62c1456752])
* (edit) src/java/org/apache/nutch/util/TimingUtil.java
* (edit) src/java/org/apache/nutch/fetcher/Fetcher.java


> Fetcher: log dates and times also in human-readable form
> 
>
> Key: NUTCH-2701
> URL: https://issues.apache.org/jira/browse/NUTCH-2701
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Trivial
> Fix For: 1.16
>
>
> Fetcher logs dates and times as epoch milliseconds. It should log it *also* 
> in a human-readable format, e.g. in case of the timelimit:
> {noformat}
> 19/01/11 17:57:56 INFO fetcher.Fetcher: Fetcher Timelimit set for : 
> 1547246036104
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (NUTCH-2701) Fetcher: log dates and times also in human-readable form

2019-04-10 Thread Sebastian Nagel (JIRA)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2701.

Resolution: Implemented

Merged/committed. Thanks, [~markus17]!

> Fetcher: log dates and times also in human-readable form
> 
>
> Key: NUTCH-2701
> URL: https://issues.apache.org/jira/browse/NUTCH-2701
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Trivial
> Fix For: 1.16
>
>
> Fetcher logs dates and times as epoch milliseconds. It should log it *also* 
> in a human-readable format, e.g. in case of the timelimit:
> {noformat}
> 19/01/11 17:57:56 INFO fetcher.Fetcher: Fetcher Timelimit set for : 
> 1547246036104
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2701) Fetcher: log dates and times also in human-readable form

2019-04-10 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/NUTCH-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814373#comment-16814373
 ] 

ASF GitHub Bot commented on NUTCH-2701:
---

sebastian-nagel commented on pull request #447: NUTCH-2701 Fetcher: log dates 
and times also in human-readable form
URL: https://github.com/apache/nutch/pull/447
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fetcher: log dates and times also in human-readable form
> 
>
> Key: NUTCH-2701
> URL: https://issues.apache.org/jira/browse/NUTCH-2701
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Trivial
> Fix For: 1.16
>
>
> Fetcher logs dates and times as epoch milliseconds. It should log it *also* 
> in a human-readable format, e.g. in case of the timelimit:
> {noformat}
> 19/01/11 17:57:56 INFO fetcher.Fetcher: Fetcher Timelimit set for : 
> 1547246036104
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (NUTCH-2666) Increase default value for http.content.limit / ftp.content.limit / file.content.limit

2019-04-10 Thread Sebastian Nagel (JIRA)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2666.

Resolution: Implemented

Merged in to master, will be available in 1.16. Thanks, [~mebbinghaus]!

> Increase default value for http.content.limit / ftp.content.limit / 
> file.content.limit
> --
>
> Key: NUTCH-2666
> URL: https://issues.apache.org/jira/browse/NUTCH-2666
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.15
>Reporter: Marco Ebbinghaus
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.16
>
>
> The default value for http.content.limit in nutch-default.xml (The length 
> limit for downloaded content using the http://
>  protocol, in bytes. If this value is nonnegative (>=0), content longer
>  than it will be truncated; otherwise, no truncation at all. Do not
>  confuse this setting with the file.content.limit setting.) is set to 64kb. 
> Maybe this default value should be increased as many pages today are greater 
> than 64kb.
> This fact hit me when trying to crawl a single website whose pages are much 
> greater than 64kb and because of that with every crawl cycle the count of 
> db_unfetched urls decreased until it hit zero and the crawler became inactive 
> (because the first 64 kB contained always the same set of navigation links)
> The description might also be updated as this is not only the case for the 
> http protocol, but also for https.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2666) Increase default value for http.content.limit / ftp.content.limit / file.content.limit

2019-04-10 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/NUTCH-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814363#comment-16814363
 ] 

ASF GitHub Bot commented on NUTCH-2666:
---

sebastian-nagel commented on pull request #427: NUTCH-2666 Increase default 
value for http.content.limit / ftp.content.limit / file.content.limit
URL: https://github.com/apache/nutch/pull/427
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Increase default value for http.content.limit / ftp.content.limit / 
> file.content.limit
> --
>
> Key: NUTCH-2666
> URL: https://issues.apache.org/jira/browse/NUTCH-2666
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.15
>Reporter: Marco Ebbinghaus
>Priority: Minor
> Fix For: 1.16
>
>
> The default value for http.content.limit in nutch-default.xml (The length 
> limit for downloaded content using the http://
>  protocol, in bytes. If this value is nonnegative (>=0), content longer
>  than it will be truncated; otherwise, no truncation at all. Do not
>  confuse this setting with the file.content.limit setting.) is set to 64kb. 
> Maybe this default value should be increased as many pages today are greater 
> than 64kb.
> This fact hit me when trying to crawl a single website whose pages are much 
> greater than 64kb and because of that with every crawl cycle the count of 
> db_unfetched urls decreased until it hit zero and the crawler became inactive 
> (because the first 64 kB contained always the same set of navigation links)
> The description might also be updated as this is not only the case for the 
> http protocol, but also for https.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (NUTCH-2666) Increase default value for http.content.limit / ftp.content.limit / file.content.limit

2019-04-10 Thread Sebastian Nagel (JIRA)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-2666:
--

Assignee: Sebastian Nagel

> Increase default value for http.content.limit / ftp.content.limit / 
> file.content.limit
> --
>
> Key: NUTCH-2666
> URL: https://issues.apache.org/jira/browse/NUTCH-2666
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.15
>Reporter: Marco Ebbinghaus
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.16
>
>
> The default value for http.content.limit in nutch-default.xml (The length 
> limit for downloaded content using the http://
>  protocol, in bytes. If this value is nonnegative (>=0), content longer
>  than it will be truncated; otherwise, no truncation at all. Do not
>  confuse this setting with the file.content.limit setting.) is set to 64kb. 
> Maybe this default value should be increased as many pages today are greater 
> than 64kb.
> This fact hit me when trying to crawl a single website whose pages are much 
> greater than 64kb and because of that with every crawl cycle the count of 
> db_unfetched urls decreased until it hit zero and the crawler became inactive 
> (because the first 64 kB contained always the same set of navigation links)
> The description might also be updated as this is not only the case for the 
> http protocol, but also for https.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (NUTCH-2683) DeduplicationJob: add option to prefer https:// over http://

2019-04-10 Thread Sebastian Nagel (JIRA)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2683.

Resolution: Implemented

> DeduplicationJob: add option to prefer https:// over http://
> 
>
> Key: NUTCH-2683
> URL: https://issues.apache.org/jira/browse/NUTCH-2683
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.16
>
>
> The deduplication job allows to keep the shortest URLs as the "best" URL of a 
> set of duplicates, marking all longer ones as duplicates. Recently search 
> engines started to penalize non-https pages by [giving https pages a higher 
> rank|https://webmasters.googleblog.com/2014/08/https-as-ranking-signal.html] 
> and [marking http as 
> insecure|https://www.blog.google/products/chrome/milestone-chrome-security-marking-http-not-secure/].
> If URLs are identical except for the protocol the deduplication job should be 
> able to prefer https:// over http:// URLs, although the latter ones are 
> shorter by one character. Of course, this should be configurable and in 
> addition to existing preferences (length, score and fetch time) to select the 
> "best" URL among duplicates.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2683) DeduplicationJob: add option to prefer https:// over http://

2019-04-10 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/NUTCH-2683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814358#comment-16814358
 ] 

ASF GitHub Bot commented on NUTCH-2683:
---

sebastian-nagel commented on pull request #425: NUTCH-2683 DeduplicationJob: 
add option to prefer https:// over http://
URL: https://github.com/apache/nutch/pull/425
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> DeduplicationJob: add option to prefer https:// over http://
> 
>
> Key: NUTCH-2683
> URL: https://issues.apache.org/jira/browse/NUTCH-2683
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.16
>
>
> The deduplication job allows to keep the shortest URLs as the "best" URL of a 
> set of duplicates, marking all longer ones as duplicates. Recently search 
> engines started to penalize non-https pages by [giving https pages a higher 
> rank|https://webmasters.googleblog.com/2014/08/https-as-ranking-signal.html] 
> and [marking http as 
> insecure|https://www.blog.google/products/chrome/milestone-chrome-security-marking-http-not-secure/].
> If URLs are identical except for the protocol the deduplication job should be 
> able to prefer https:// over http:// URLs, although the latter ones are 
> shorter by one character. Of course, this should be configurable and in 
> addition to existing preferences (length, score and fetch time) to select the 
> "best" URL among duplicates.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (NUTCH-2683) DeduplicationJob: add option to prefer https:// over http://

2019-04-10 Thread Sebastian Nagel (JIRA)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-2683:
--

Assignee: Sebastian Nagel

> DeduplicationJob: add option to prefer https:// over http://
> 
>
> Key: NUTCH-2683
> URL: https://issues.apache.org/jira/browse/NUTCH-2683
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.16
>
>
> The deduplication job allows to keep the shortest URLs as the "best" URL of a 
> set of duplicates, marking all longer ones as duplicates. Recently search 
> engines started to penalize non-https pages by [giving https pages a higher 
> rank|https://webmasters.googleblog.com/2014/08/https-as-ranking-signal.html] 
> and [marking http as 
> insecure|https://www.blog.google/products/chrome/milestone-chrome-security-marking-http-not-secure/].
> If URLs are identical except for the protocol the deduplication job should be 
> able to prefer https:// over http:// URLs, although the latter ones are 
> shorter by one character. Of course, this should be configurable and in 
> addition to existing preferences (length, score and fetch time) to select the 
> "best" URL among duplicates.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Build failed in Jenkins: Nutch-trunk #3619

[jira] [Commented] (NUTCH-2683) DeduplicationJob: add option to prefer https:// over http://

[jira] [Commented] (NUTCH-2666) Increase default value for http.content.limit / ftp.content.limit / file.content.limit

[jira] [Commented] (NUTCH-2701) Fetcher: log dates and times also in human-readable form

[jira] [Resolved] (NUTCH-2701) Fetcher: log dates and times also in human-readable form

[jira] [Commented] (NUTCH-2701) Fetcher: log dates and times also in human-readable form

[jira] [Resolved] (NUTCH-2666) Increase default value for http.content.limit / ftp.content.limit / file.content.limit

[jira] [Commented] (NUTCH-2666) Increase default value for http.content.limit / ftp.content.limit / file.content.limit

[jira] [Assigned] (NUTCH-2666) Increase default value for http.content.limit / ftp.content.limit / file.content.limit

[jira] [Resolved] (NUTCH-2683) DeduplicationJob: add option to prefer https:// over http://

[jira] [Commented] (NUTCH-2683) DeduplicationJob: add option to prefer https:// over http://

[jira] [Assigned] (NUTCH-2683) DeduplicationJob: add option to prefer https:// over http://

12 matches

Site Navigation

Mail list logo

Footer information