[jira] [Commented] (NUTCH-2580) Improvements for Rabbitmq support

2018-05-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489775#comment-16489775
 ] 

ASF GitHub Bot commented on NUTCH-2580:
---

r0ann3l opened a new pull request #335: fix for NUTCH-2580 contributed by 
r0ann3l
URL: https://github.com/apache/nutch/pull/335
 
 
   This patch includes changes in:
   
   lib-rabbitmq (**new**)
   
   - Common functionalities like: open a new connection with a RabbitMQ server, 
open a channel over the connection and binding between an exchange and a queue. 
All is wrapped into `RabbitMQClient.class`.
   - Upgrade of the RabbitMQ's library version from 3.6.5 to 5.2.0.
   - Support for the arguments supported by RabbitMQ when an exchange, a queue 
or a binding is created.
   - Username, password, hostname, port and virtual host, merged into a single 
property.
   
   indexer-rabbit (**modified**)
   
   - A single or multiple documents into each message.
   - Optional binding.
   - Headers from NutchDocument's fields.
   - Static headers.
   
   publish-rabbitmq (**modified**)
   
   - Optional binding.
   - Static headers.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Improvements for Rabbitmq support
> -
>
> Key: NUTCH-2580
> URL: https://issues.apache.org/jira/browse/NUTCH-2580
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, plugin
>Affects Versions: 1.14
>Reporter: Roannel Fernández Hernández
>Priority: Minor
> Fix For: 1.15
>
>
> This one includes:
>  # Creation of lib-rabbitmq for common functionalities (publish-rabbitmq and 
> indexer-rabbit).
>  # Update of the RabbitMQ's library version.
>  # Headers selection from NutchDocument's fields (for indexer-rabbit).
>  # Optional binding.
>  # A single or multiple documents into each message.
>  # Options for the creation of exchange, queue and binding.
>  # Simplify the configuration options.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2580) Improvements for Rabbitmq support

2018-05-24 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-2580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Roannel Fernández Hernández updated NUTCH-2580:
---
Description: 
This one includes:
 # Creation of lib-rabbitmq for common functionalities (publish-rabbitmq and 
indexer-rabbit).
 # Update of the RabbitMQ's library version.
 # Headers selection from NutchDocument's fields (for indexer-rabbit).
 # Optional binding.
 # A single or multiple documents into each message.
 # Options for the creation of exchange, queue and binding.
 # Simplify the configuration options.

  was:
This one includes:
 # Creation of lib-rabbitmq for common functionalities (publish-rabbitmq and 
indexer-rabbit).
 # Support for default exchange (empty).
 # Headers selection from NutchDocument's fields (for indexer-rabbit).
 # Update of the library version.


> Improvements for Rabbitmq support
> -
>
> Key: NUTCH-2580
> URL: https://issues.apache.org/jira/browse/NUTCH-2580
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, plugin
>Affects Versions: 1.14
>Reporter: Roannel Fernández Hernández
>Priority: Minor
> Fix For: 1.15
>
>
> This one includes:
>  # Creation of lib-rabbitmq for common functionalities (publish-rabbitmq and 
> indexer-rabbit).
>  # Update of the RabbitMQ's library version.
>  # Headers selection from NutchDocument's fields (for indexer-rabbit).
>  # Optional binding.
>  # A single or multiple documents into each message.
>  # Options for the creation of exchange, queue and binding.
>  # Simplify the configuration options.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2584) Upgrade parse-tika to use Tika 1.18

2018-05-24 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489580#comment-16489580
 ] 

Sebastian Nagel commented on NUTCH-2584:


Hi [~Bl4ck1c3], don't know what went wrong. The description is not really clear 
what to do in which directory. I've made this clearer, now see [this 
branch|https://github.com/apache/nutch/compare/master...sebastian-nagel:NUTCH-2583-upgrade-dependencies].
 Not yet a pull request as parse-tika unit tests fail:
{noformat}
% ant -Dplugin=parse-tika clean test-plugin
...
[junit] Test org.apache.nutch.tika.TestDOMContentUtils FAILED
{noformat}

Needs a closer look what has changed with the upgrade. Feel free to continue 
and take from the branch whatever you need. Thanks!

> Upgrade parse-tika to use Tika 1.18
> ---
>
> Key: NUTCH-2584
> URL: https://issues.apache.org/jira/browse/NUTCH-2584
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.14
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.15
>
>
> Tika 1.18 is released and NUTCH-2583 includes and upgrade of tika-core.
> See 
> [howto_upgrade_tika|https://github.com/apache/nutch/blob/master/src/plugin/parse-tika/howto_upgrade_tika.txt].
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2584) Upgrade parse-tika to use Tika 1.18

2018-05-24 Thread Ralf (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489311#comment-16489311
 ] 

Ralf commented on NUTCH-2584:
-

Hi,

Tried this.. for me it does not work. Compiler exits with:

[ivy:resolve]  ERRORS
[ivy:resolve] impossible to get artifacts when data has not been loaded. 
IvyNode = javax.measure#unit-api;1.0

> Upgrade parse-tika to use Tika 1.18
> ---
>
> Key: NUTCH-2584
> URL: https://issues.apache.org/jira/browse/NUTCH-2584
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.14
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.15
>
>
> Tika 1.18 is released and NUTCH-2583 includes and upgrade of tika-core.
> See 
> [howto_upgrade_tika|https://github.com/apache/nutch/blob/master/src/plugin/parse-tika/howto_upgrade_tika.txt].
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2576) HTTP protocol plugin based on okhttp

2018-05-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489246#comment-16489246
 ] 

ASF GitHub Bot commented on NUTCH-2576:
---

sebastian-nagel commented on issue #328: NUTCH-2576 HTTP protocol 
implementation based on okhttp
URL: https://github.com/apache/nutch/pull/328#issuecomment-391711875
 
 
   Done:
   - large-scale test (distributed mode on CDH 5.14.2): 195 million pages 
fetched from 28 million hosts (90 million hosts in CrawlDb) in 4 cycles using 
48 Fetcher tasks each with 120 threads. No issues with the connection pool, at 
least, not remarkable unless the waits for locks described or linked in 
[NUTCH-2578](https://issues.apache.org/jira/browse/NUTCH-2578) are addressed.
   
   New TODOs:
   - setting/using Cookies
   - re-throw exceptions as HttpException so that they can be handled by Fetcher


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HTTP protocol plugin based on okhttp
> 
>
> Key: NUTCH-2576
> URL: https://issues.apache.org/jira/browse/NUTCH-2576
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, protocol
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.15
>
>
> [Okhttp|http://square.github.io/okhttp/] is an Apache2-licensed http library 
> which supports HTTP/2. [~jnioche]'s implementation 
> [storm-crawler#443|https://github.com/DigitalPebble/storm-crawler/issues/443] 
> proves that it should be straightforward to implement a Nutch protocol plugin 
> using okhttp. A recent HTTP protocol implementation should also fix (most of) 
> the issues reported in NUTCH-2549.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2583) Upgrading Nutch's dependencies

2018-05-24 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489145#comment-16489145
 ] 

Sebastian Nagel commented on NUTCH-2583:


Thanks, [~Bl4ck1c3]! I can confirm that Nutch builds and all unit tests pass. 
I'll try to test in pseudo-distributed mode during the next days. Opened 
NUTCH-2584 to sync parse-tika with the tika-core dependency.

> Upgrading Nutch's dependencies
> --
>
> Key: NUTCH-2583
> URL: https://issues.apache.org/jira/browse/NUTCH-2583
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 1.14
>Reporter: Ralf
>Priority: Major
> Fix For: 1.15
>
> Attachments: ivy.xml
>
>
> Hi,
>  
> It would be nice to be able to upgrade all of Nutch's dependencies to the 
> latest possible available.
> I've attached an Ivy.xml with the latest possible dependencies without 
> breaking the compile. I've tested it with a few runs of the "crawl script", 
> so far it seems to work, it generates, it fetches, it parses, it indexes to 
> Solr. Increasing any of this dependencies breaks the compile.
>  
> PS: I haven't touched any of the Hadoop stuff and don't remember if I touched 
> the testing part or not.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2584) Upgrade parse-tika to use Tika 1.18

2018-05-24 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2584:
--

 Summary: Upgrade parse-tika to use Tika 1.18
 Key: NUTCH-2584
 URL: https://issues.apache.org/jira/browse/NUTCH-2584
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.14
Reporter: Sebastian Nagel
 Fix For: 1.15


Tika 1.18 is released and NUTCH-2583 includes and upgrade of tika-core.
See 
[howto_upgrade_tika|https://github.com/apache/nutch/blob/master/src/plugin/parse-tika/howto_upgrade_tika.txt].
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (NUTCH-2578) Avoid lock by MimeUtil in constructor of protocol.Content

2018-05-24 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480867#comment-16480867
 ] 

Sebastian Nagel edited comment on NUTCH-2578 at 5/24/18 2:30 PM:
-

Hi [~yossi], got it: definitely a good idea to keep the Tika instance in the 
object cache. Nevertheless, since we know that MimeUtil is thread-safe and 
ObjectCache getters are synchronized (since NUTCH-1606), I would also hold the 
reference in the protocol implementation. The protocol instance is already 
cached by ProtocolFactory and we avoid extra access of the object cache. Does 
this make sense?
 Shortly about the background how this issue has been detected: while testing 
whether the connection pool of okhttp (see NUTCH-2576) causes any locks, I've 
found that other locks appeared much more often in the stacks: NUTCH-2579, 
TIKA-2645, and some more I need to investigate.


was (Author: wastl-nagel):
Hi [~yossi], got it: definitely a good idea to keep the Tika instance in the 
object cache. Nevertheless, since we know that MimeUtil is thread-safe and 
ObjectCache getters are synchronized (since NUTCH-1606), I would also hold the 
reference in the protocol implementation. The protocol instance is already 
cached by ProtocolFactory and we avoid extra access of the object cache. Does 
this make sense?
Shortly about the background how this issue has been detected: while testing 
whether the connection pool of okhttp (see NUTCH-2576) causes any locks, I've 
found that other locks appeared much more often in the stacks: NUTCH-2578, 
TIKA-2645, and some more I need to investigate.

> Avoid lock by MimeUtil in constructor of protocol.Content
> -
>
> Key: NUTCH-2578
> URL: https://issues.apache.org/jira/browse/NUTCH-2578
> Project: Nutch
>  Issue Type: Improvement
>  Components: protocol
>Affects Versions: 1.14
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.15
>
>
> The constructor of the class o.a.n.protocol.Content instantiates a new 
> MimeUtil object. That's not cheap as it always creates a new Tika object and 
> there is a lock on the job/jar file when config files are read:
> {noformat}
> "FetcherThread" #146 daemon prio=5 os_prio=0 tid=0x7f70523c3800 
> nid=0x1de2 waiting for monitor entry [0x7f70193a8000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at java.util.zip.ZipFile.getEntry(ZipFile.java:314)
> - waiting to lock <0x0005e0285758> (a java.util.jar.JarFile)
> at java.util.jar.JarFile.getEntry(JarFile.java:240)
> at java.util.jar.JarFile.getJarEntry(JarFile.java:223)
> at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:1042)
> at 
> sun.misc.URLClassPath$JarLoader.findResource(URLClassPath.java:1020)
> at sun.misc.URLClassPath$1.next(URLClassPath.java:267)
> at sun.misc.URLClassPath$1.hasMoreElements(URLClassPath.java:277)
> at java.net.URLClassLoader$3$1.run(URLClassLoader.java:601)
> at java.net.URLClassLoader$3$1.run(URLClassLoader.java:599)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader$3.next(URLClassLoader.java:598)
> at java.net.URLClassLoader$3.hasMoreElements(URLClassLoader.java:623)
> at sun.misc.CompoundEnumeration.next(CompoundEnumeration.java:45)
> at 
> sun.misc.CompoundEnumeration.hasMoreElements(CompoundEnumeration.java:54)
> at java.util.Collections.list(Collections.java:5239)
> at 
> org.apache.tika.config.ServiceLoader.identifyStaticServiceProviders(ServiceLoader.java:325)
> at 
> org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:352)
> at 
> org.apache.tika.config.ServiceLoader.loadServiceProviders(ServiceLoader.java:274)
> at 
> org.apache.tika.detect.DefaultEncodingDetector.(DefaultEncodingDetector.java:45)
> at 
> org.apache.tika.config.TikaConfig.getDefaultEncodingDetector(TikaConfig.java:92)
> at org.apache.tika.config.TikaConfig.(TikaConfig.java:248)
> at 
> org.apache.tika.config.TikaConfig.getDefaultConfig(TikaConfig.java:386)
> at org.apache.tika.Tika.(Tika.java:116)
> at org.apache.nutch.util.MimeUtil.(MimeUtil.java:69)
> at org.apache.nutch.protocol.Content.(Content.java:83)
> at 
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:316)
> at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:341)
> {noformat}
> If there are many Fetcher threads this may cause a significant bottleneck, 
> running a Fetcher with 120 threads I've found up to 50 threads waiting for 
> this lock:
> {noformat}
> # pid 7195 is a Fetcher map task
> % sudo -u yarn jstack 

[jira] [Updated] (NUTCH-2583) Upgrading Nutch's dependencies

2018-05-24 Thread Ralf (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ralf updated NUTCH-2583:

Description: 
Hi,

 

It would be nice to be able to upgrade all of Nutch's dependencies to the 
latest possible available.

I've attached an Ivy.xml with the latest possible dependencies without breaking 
the compile. I've tested it with a few runs of the "crawl script", so far it 
seems to work, it generates, it fetches, it parses, it indexes to Solr. 
Increasing any of this dependencies breaks the compile.

 

PS: I haven't touched any of the Hadoop stuff and don't remember if I touched 
the testing part or not.

  was:
Hi,

 

It would be nice to be able to upgrade all of Nutch's dependencies to the 
latest possible available.

I've attached an Ivy.xml with the latest possible dependencies without breaking 
the compile. I've tested it with a few runs of the "crawl script", so far it 
seems to work, it generates, it fetches, it parses, it indexes to Solr. 


> Upgrading Nutch's dependencies
> --
>
> Key: NUTCH-2583
> URL: https://issues.apache.org/jira/browse/NUTCH-2583
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 1.14
>Reporter: Ralf
>Priority: Major
> Fix For: 1.15
>
> Attachments: ivy.xml
>
>
> Hi,
>  
> It would be nice to be able to upgrade all of Nutch's dependencies to the 
> latest possible available.
> I've attached an Ivy.xml with the latest possible dependencies without 
> breaking the compile. I've tested it with a few runs of the "crawl script", 
> so far it seems to work, it generates, it fetches, it parses, it indexes to 
> Solr. Increasing any of this dependencies breaks the compile.
>  
> PS: I haven't touched any of the Hadoop stuff and don't remember if I touched 
> the testing part or not.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2583) Upgrading Nutch's dependencies

2018-05-24 Thread Ralf (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ralf updated NUTCH-2583:

Description: 
Hi,

 

It would be nice to be able to upgrade all of Nutch's dependencies to the 
latest possible available.

I've attached an Ivy.xml with the latest possible dependencies without breaking 
the compile. I've tested it with a few runs of the "crawl script", so far it 
seems to work, it generates, it fetches, it parses, it indexes to Solr. 

  was:
Hi,

 

It would be nice to be able to upgrade all of Nutch's dependencies to the 
latest possible available.

 


> Upgrading Nutch's dependencies
> --
>
> Key: NUTCH-2583
> URL: https://issues.apache.org/jira/browse/NUTCH-2583
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 1.14
>Reporter: Ralf
>Priority: Major
> Fix For: 1.15
>
> Attachments: ivy.xml
>
>
> Hi,
>  
> It would be nice to be able to upgrade all of Nutch's dependencies to the 
> latest possible available.
> I've attached an Ivy.xml with the latest possible dependencies without 
> breaking the compile. I've tested it with a few runs of the "crawl script", 
> so far it seems to work, it generates, it fetches, it parses, it indexes to 
> Solr. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2583) Upgrading Nutch's dependencies

2018-05-24 Thread Ralf (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ralf updated NUTCH-2583:

Attachment: ivy.xml

> Upgrading Nutch's dependencies
> --
>
> Key: NUTCH-2583
> URL: https://issues.apache.org/jira/browse/NUTCH-2583
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 1.14
>Reporter: Ralf
>Priority: Major
> Fix For: 1.15
>
> Attachments: ivy.xml
>
>
> Hi,
>  
> It would be nice to be able to upgrade all of Nutch's dependencies to the 
> latest possible available.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2583) Upgrading Nutch's dependencies

2018-05-24 Thread Ralf (JIRA)
Ralf created NUTCH-2583:
---

 Summary: Upgrading Nutch's dependencies
 Key: NUTCH-2583
 URL: https://issues.apache.org/jira/browse/NUTCH-2583
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: 1.14
Reporter: Ralf
 Fix For: 1.15


Hi,

 

It would be nice to be able to upgrade all of Nutch's dependencies to the 
latest possible available.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2290) Update licenses of bundled libraries

2018-05-24 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488992#comment-16488992
 ] 

Sebastian Nagel commented on NUTCH-2290:


Great, [~Bl4ck1c3]! Please open a new issue (this one should stay for the 
licenses) and a pull request with your changes. Thanks!

> Update licenses of bundled libraries
> 
>
> Key: NUTCH-2290
> URL: https://issues.apache.org/jira/browse/NUTCH-2290
> Project: Nutch
>  Issue Type: Bug
>  Components: deployment
>Affects Versions: 2.3.1, 1.12
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.15
>
>
> The files LICENSE.txt and NOTICE.txt were last edited 5 years ago and should 
> be updated to include all licenses of dependencies (and their dependencies) 
> in accordance to [Assembling LICENSE and NOTICE 
> HOWTO|http://www.apache.org/dev/licensing-howto.html]:
> # check for missing or obsolete licenses due to added or removed dependencies
> # update year in NOTICE.txt -- should be a range according to the licensing 
> HOWTO
> # bundled libraries are referenced with path and version number, e.g 
> {{lib/icu4j-4_0_1.jar}}. This would require to update the LICENSE.txt with 
> every dependency upgrade. A more generic reference ("ICU4J") would be easier 
> to maintain but the HOWTO requires to "specify the version of the dependency 
> as licenses are sometimes changed".
> # try to reduce the size of LICENSE.txt (currently 5800 lines). Mainly, 
> according to the HOWTO there is no need to repeat the Apache license again 
> and again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (NUTCH-2557) protocol-http fails to follow redirections when an HTTP response body is invalid

2018-05-24 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488985#comment-16488985
 ] 

Sebastian Nagel edited comment on NUTCH-2557 at 5/24/18 1:32 PM:
-

See [comments in 
NUTCH-2549|https://issues.apache.org/jira/browse/NUTCH-2549?focusedCommentId=16430591=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16430591]:
 by default content from redirect and 404s should be ignored but it should be 
possible to optionally fetch and store the content (eg. by adding a property 
{{http.content.store.404}}).


was (Author: wastl-nagel):
See [comments in 
NUTCH-2549|https://issues.apache.org/jira/browse/NUTCH-2549?focusedCommentId=16430591=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16430591]:
 please try to make this the default but allow to optionally fetch and store 
the content.

> protocol-http fails to follow redirections when an HTTP response body is 
> invalid
> 
>
> Key: NUTCH-2557
> URL: https://issues.apache.org/jira/browse/NUTCH-2557
> Project: Nutch
>  Issue Type: Sub-task
>Reporter: Gerard Bouchar
>Priority: Major
>
> If a server sends a redirection (3XX status code, with a Location header), 
> protocol-http tries to parse the HTTP response body anyway. Thus, if an error 
> occurs while decoding the body, the redirection is not followed and the 
> information is lost. Browsers follow the redirection and close the socket 
> soon as they can.
>  * Example: this page is a redirection to its https version, with an HTTP 
> body containing invalidly gzip encoded contents. Browsers follow the 
> redirection, but nutch throws an error:
>  ** [http://www.webarcelona.net/es/blog?page=2]
>  
> The HttpResponse::getContent class can already return null. I think it should 
> at least return null when parsing the HTTP response body fails.
> Ideally, we would adopt the same behavior as browsers, and not even try 
> parsing the body when the headers indicate a redirection.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2290) Update licenses of bundled libraries

2018-05-24 Thread Ralf (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488984#comment-16488984
 ] 

Ralf commented on NUTCH-2290:
-

I've got an Ivy.xml with updated depencies, as far up as possible without 
breaking the compile, don't know about the rest so far it seemed to work on 
a few trial runs with the crawl script

> Update licenses of bundled libraries
> 
>
> Key: NUTCH-2290
> URL: https://issues.apache.org/jira/browse/NUTCH-2290
> Project: Nutch
>  Issue Type: Bug
>  Components: deployment
>Affects Versions: 2.3.1, 1.12
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.15
>
>
> The files LICENSE.txt and NOTICE.txt were last edited 5 years ago and should 
> be updated to include all licenses of dependencies (and their dependencies) 
> in accordance to [Assembling LICENSE and NOTICE 
> HOWTO|http://www.apache.org/dev/licensing-howto.html]:
> # check for missing or obsolete licenses due to added or removed dependencies
> # update year in NOTICE.txt -- should be a range according to the licensing 
> HOWTO
> # bundled libraries are referenced with path and version number, e.g 
> {{lib/icu4j-4_0_1.jar}}. This would require to update the LICENSE.txt with 
> every dependency upgrade. A more generic reference ("ICU4J") would be easier 
> to maintain but the HOWTO requires to "specify the version of the dependency 
> as licenses are sometimes changed".
> # try to reduce the size of LICENSE.txt (currently 5800 lines). Mainly, 
> according to the HOWTO there is no need to repeat the Apache license again 
> and again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2557) protocol-http fails to follow redirections when an HTTP response body is invalid

2018-05-24 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488985#comment-16488985
 ] 

Sebastian Nagel commented on NUTCH-2557:


See [comments in 
NUTCH-2549|https://issues.apache.org/jira/browse/NUTCH-2549?focusedCommentId=16430591=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16430591]:
 please try to make this the default but allow to optionally fetch and store 
the content.

> protocol-http fails to follow redirections when an HTTP response body is 
> invalid
> 
>
> Key: NUTCH-2557
> URL: https://issues.apache.org/jira/browse/NUTCH-2557
> Project: Nutch
>  Issue Type: Sub-task
>Reporter: Gerard Bouchar
>Priority: Major
>
> If a server sends a redirection (3XX status code, with a Location header), 
> protocol-http tries to parse the HTTP response body anyway. Thus, if an error 
> occurs while decoding the body, the redirection is not followed and the 
> information is lost. Browsers follow the redirection and close the socket 
> soon as they can.
>  * Example: this page is a redirection to its https version, with an HTTP 
> body containing invalidly gzip encoded contents. Browsers follow the 
> redirection, but nutch throws an error:
>  ** [http://www.webarcelona.net/es/blog?page=2]
>  
> The HttpResponse::getContent class can already return null. I think it should 
> at least return null when parsing the HTTP response body fails.
> Ideally, we would adopt the same behavior as browsers, and not even try 
> parsing the body when the headers indicate a redirection.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2576) HTTP protocol plugin based on okhttp

2018-05-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488970#comment-16488970
 ] 

ASF GitHub Bot commented on NUTCH-2576:
---

sebastian-nagel commented on issue #328: NUTCH-2576 HTTP protocol 
implementation based on okhttp
URL: https://github.com/apache/nutch/pull/328#issuecomment-391711875
 
 
   Done:
   - large-scale test (distributed mode on CDH 5.14.2): 195 million pages 
fetched from 90 million hosts using 48 Fetcher tasks each with 120 threads. No 
issues with the connection pool, at least, not remarkable unless the waits for 
locks described or linked in 
[NUTCH-2578](https://issues.apache.org/jira/browse/NUTCH-2578) are addressed.
   
   New TODOs:
   - setting/using Cookies
   - re-throw exceptions as HttpException so that they can be handled by Fetcher


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HTTP protocol plugin based on okhttp
> 
>
> Key: NUTCH-2576
> URL: https://issues.apache.org/jira/browse/NUTCH-2576
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, protocol
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.15
>
>
> [Okhttp|http://square.github.io/okhttp/] is an Apache2-licensed http library 
> which supports HTTP/2. [~jnioche]'s implementation 
> [storm-crawler#443|https://github.com/DigitalPebble/storm-crawler/issues/443] 
> proves that it should be straightforward to implement a Nutch protocol plugin 
> using okhttp. A recent HTTP protocol implementation should also fix (most of) 
> the issues reported in NUTCH-2549.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2549) protocol-http does not behave the same as browsers

2018-05-24 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488947#comment-16488947
 ] 

Sebastian Nagel commented on NUTCH-2549:


Thanks, [~gbouchar]! Could you split the patch and address each sub-issue 
separately? There are also PRs open for review, see NUTCH-2562 and NUTCH-2576 
(tested but not ready yet).

> protocol-http does not behave the same as browsers
> --
>
> Key: NUTCH-2549
> URL: https://issues.apache.org/jira/browse/NUTCH-2549
> Project: Nutch
>  Issue Type: Bug
>Reporter: Gerard Bouchar
>Priority: Major
> Attachments: NUTCH-2549.patch
>
>
> We identified the following issues in protocol-http (a plugin implementing 
> the HTTP protocol):
>  * It fails if an url's path does not start with '/'
>  ** Example: [http://news.fx678.com?171|http://news.fx678.com/?171] (browsers 
> correctly rewrite the url as [http://news.fx678.com/?171], while nutch tries 
> to send an invalid HTTP request starting with *GET ?171 HTTP/1.0*.
>  * It advertises its requests as being HTTP/1.0, but sends an 
> _Accept-Encoding_ request header, that is defined only in HTTP/1.1. This 
> confuses some web servers
>  ** Example: 
> [http://www.hansamanuals.com/main/english/none/theconf___987/manuals/version___82/hwconvindex.htm]
>  * If a server sends a redirection (3XX status code, with a Location header), 
> protocol-http tries to parse the HTTP response body anyway. Thus, if an error 
> occurs while decoding the body, the redirection is not followed and the 
> information is lost. Browsers follow the redirection and close the socket 
> soon as they can.
>  ** Example: [http://www.webarcelona.net/es/blog?page=2]
>  * Some servers invalidly send an HTTP body directly without a status line or 
> headers. Browsers handle that, protocol-http doesn't:
>  ** Example: [https://app.unitymedia.de/]
>  * Some servers invalidly add colons after the HTTP status code in the status 
> line (they can send _HTTP/1.1 404: Not found_ instead of _HTTP/1.1 404 Not 
> found_ for instance). Browsers can handle that.
>  * Some servers invalidly send headers that span over multiple lines. In that 
> case, browsers simply ignore the subsequent lines, but protocol-http throws 
> an error, thus preventing us from fetching the contents of the page.
>  * There is no limit over the size of the HTTP headers it reads. A bogus 
> server could send an infinite stream of different HTTP headers and cause the 
> fetcher to go out of memory, or send the same HTTP header repeatedly and 
> cause the fetcher to timeout.
>  * The same goes for the HTTP status line: no check is made concerning its 
> size.
>  * While reading chunked content, if the content size becomes larger than 
> {color:#9876aa}http{color}.getMaxContent(), instead of just stopping, it 
> tries to read a new chunk before having read the previous one completely, 
> resulting in a '{color:#33}bad chunk length' error.{color}
> {color:#33}Additionally (and that concerns protocol-httpclient as well), 
> when reading http headers, for each header, the SpellCheckedMetadata class 
> computes a Levenshtein distance between it and every  known header in the 
> HttpHeaders interface. Not only is that slow, non-standard, and non-conform 
> to browsers' behavior, but it also causes bugs and prevents us from accessing 
> the real headers sent by the HTTP server.{color}
>  * {color:#33}Example: [http://www.taz.de/!443358/] . The server sends a 
> *Client-Transfer-Encoding: chunked* header, but SpellCheckedMetadata corrects 
> it to *Transfer-Encoding: chunked*. Then, HttpResponse (in protocol-http) 
> tries to read the HTTP body as chunked, whereas it is not.{color}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2500) Add pull-reqest template to github

2018-05-24 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488933#comment-16488933
 ] 

Hudson commented on NUTCH-2500:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3526 (See 
[https://builds.apache.org/job/Nutch-trunk/3526/])
NUTCH-2500 Add pull-reqest template for github (snagel: 
[https://github.com/apache/nutch/commit/2cf5e1cc8cc9eef70c1af39330f4d377ae5ab156])
* (add) .github/pull_request_template.md


> Add pull-reqest template to github
> --
>
> Key: NUTCH-2500
> URL: https://issues.apache.org/jira/browse/NUTCH-2500
> Project: Nutch
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
>
> Github allows to add [pull request 
> templates](https://help.github.com/articles/creating-a-pull-request-template-for-your-repository/).
>  For contributors already familiar with github from other projects that's 
> probably the best place to show a check list which helps us to get pull 
> requests merged more quickly. Here's a draft:
> {noformat}
> Thanks for your contribution to [Apache Nutch](http://nutch.apache.org/)! 
> Your help is appreciated!
> Before opening the pull request, please verify that
> * there is an open issue on the [Nutch issue 
> tracker](https://issues.apache.org/jira/projects/NUTCH) which describes the 
> problem or the improvement. We cannot accept pull requests without an issue 
> because the change wouldn't be listed in the release notes.
> * the issue ID (`NUTCH-`)
>   - is referenced in the title of the pull request
>   - and placed in front of your commit messages
> * commits are squashed into a single one (or few commits for larger changes)
> * Java source code follows [Nutch Eclipse Code Formatting 
> rules](https://github.com/apache/nutch/blob/master/eclipse-codeformat.xml)
> * Nutch builds and unit tests pass by running `ant clean runtime test`
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2500) Add pull-reqest template to github

2018-05-24 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488915#comment-16488915
 ] 

Hudson commented on NUTCH-2500:
---

SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1609 (See 
[https://builds.apache.org/job/Nutch-nutchgora/1609/])
NUTCH-2500 Add pull-reqest template for github (snagel: 
[https://github.com/apache/nutch/commit/ea62c401b4eb62beee10c3540d448c0b858dcbb6])
* (add) .github/pull_request_template.md


> Add pull-reqest template to github
> --
>
> Key: NUTCH-2500
> URL: https://issues.apache.org/jira/browse/NUTCH-2500
> Project: Nutch
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
>
> Github allows to add [pull request 
> templates](https://help.github.com/articles/creating-a-pull-request-template-for-your-repository/).
>  For contributors already familiar with github from other projects that's 
> probably the best place to show a check list which helps us to get pull 
> requests merged more quickly. Here's a draft:
> {noformat}
> Thanks for your contribution to [Apache Nutch](http://nutch.apache.org/)! 
> Your help is appreciated!
> Before opening the pull request, please verify that
> * there is an open issue on the [Nutch issue 
> tracker](https://issues.apache.org/jira/projects/NUTCH) which describes the 
> problem or the improvement. We cannot accept pull requests without an issue 
> because the change wouldn't be listed in the release notes.
> * the issue ID (`NUTCH-`)
>   - is referenced in the title of the pull request
>   - and placed in front of your commit messages
> * commits are squashed into a single one (or few commits for larger changes)
> * Java source code follows [Nutch Eclipse Code Formatting 
> rules](https://github.com/apache/nutch/blob/master/eclipse-codeformat.xml)
> * Nutch builds and unit tests pass by running `ant clean runtime test`
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2500) Add pull-reqest template to github

2018-05-24 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2500:
---
Component/s: documentation

> Add pull-reqest template to github
> --
>
> Key: NUTCH-2500
> URL: https://issues.apache.org/jira/browse/NUTCH-2500
> Project: Nutch
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
>
> Github allows to add [pull request 
> templates](https://help.github.com/articles/creating-a-pull-request-template-for-your-repository/).
>  For contributors already familiar with github from other projects that's 
> probably the best place to show a check list which helps us to get pull 
> requests merged more quickly. Here's a draft:
> {noformat}
> Thanks for your contribution to [Apache Nutch](http://nutch.apache.org/)! 
> Your help is appreciated!
> Before opening the pull request, please verify that
> * there is an open issue on the [Nutch issue 
> tracker](https://issues.apache.org/jira/projects/NUTCH) which describes the 
> problem or the improvement. We cannot accept pull requests without an issue 
> because the change wouldn't be listed in the release notes.
> * the issue ID (`NUTCH-`)
>   - is referenced in the title of the pull request
>   - and placed in front of your commit messages
> * commits are squashed into a single one (or few commits for larger changes)
> * Java source code follows [Nutch Eclipse Code Formatting 
> rules](https://github.com/apache/nutch/blob/master/eclipse-codeformat.xml)
> * Nutch builds and unit tests pass by running `ant clean runtime test`
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (NUTCH-2500) Add pull-reqest template to github

2018-05-24 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2500.

Resolution: Fixed
  Assignee: Sebastian Nagel

Thanks, [~lewismc]! The instructions are now shown in the description field 
when a pull request is opened.

> Add pull-reqest template to github
> --
>
> Key: NUTCH-2500
> URL: https://issues.apache.org/jira/browse/NUTCH-2500
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
>
> Github allows to add [pull request 
> templates](https://help.github.com/articles/creating-a-pull-request-template-for-your-repository/).
>  For contributors already familiar with github from other projects that's 
> probably the best place to show a check list which helps us to get pull 
> requests merged more quickly. Here's a draft:
> {noformat}
> Thanks for your contribution to [Apache Nutch](http://nutch.apache.org/)! 
> Your help is appreciated!
> Before opening the pull request, please verify that
> * there is an open issue on the [Nutch issue 
> tracker](https://issues.apache.org/jira/projects/NUTCH) which describes the 
> problem or the improvement. We cannot accept pull requests without an issue 
> because the change wouldn't be listed in the release notes.
> * the issue ID (`NUTCH-`)
>   - is referenced in the title of the pull request
>   - and placed in front of your commit messages
> * commits are squashed into a single one (or few commits for larger changes)
> * Java source code follows [Nutch Eclipse Code Formatting 
> rules](https://github.com/apache/nutch/blob/master/eclipse-codeformat.xml)
> * Nutch builds and unit tests pass by running `ant clean runtime test`
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2575) protocol-http does not respect the maximum content-size for chunked responses

2018-05-24 Thread Omkar Reddy (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488900#comment-16488900
 ] 

Omkar Reddy commented on NUTCH-2575:


I have taken up [NUTCH-2557|https://issues.apache.org/jira/browse/NUTCH-2557] 
and started working on it. Thanks. 

> protocol-http does not respect the maximum content-size for chunked responses
> -
>
> Key: NUTCH-2575
> URL: https://issues.apache.org/jira/browse/NUTCH-2575
> Project: Nutch
>  Issue Type: Sub-task
>  Components: protocol
>Affects Versions: 1.14
>Reporter: Gerard Bouchar
>Priority: Critical
> Fix For: 1.15
>
>
> There is a bug in HttpResponse::readChunkedContent that prevents it to stop 
> reading content when it exceeds the maximum allowed size.
> There [is a variable 
> contentBytesRead|https://github.com/apache/nutch/blob/master/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java#L404]
>  that is used to check how much content has been read, but it is never 
> updated, so it always stays null, and [the size 
> check|https://github.com/apache/nutch/blob/master/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java#L440-L442]
>  always returns false (unless a single chunk is larger than the maximum 
> allowed content size).
> This allows any server to cause out-of-memory errors on our size.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2549) protocol-http does not behave the same as browsers

2018-05-24 Thread Gerard Bouchar (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gerard Bouchar updated NUTCH-2549:
--
Attachment: NUTCH-2549.patch

> protocol-http does not behave the same as browsers
> --
>
> Key: NUTCH-2549
> URL: https://issues.apache.org/jira/browse/NUTCH-2549
> Project: Nutch
>  Issue Type: Bug
>Reporter: Gerard Bouchar
>Priority: Major
> Attachments: NUTCH-2549.patch
>
>
> We identified the following issues in protocol-http (a plugin implementing 
> the HTTP protocol):
>  * It fails if an url's path does not start with '/'
>  ** Example: [http://news.fx678.com?171|http://news.fx678.com/?171] (browsers 
> correctly rewrite the url as [http://news.fx678.com/?171], while nutch tries 
> to send an invalid HTTP request starting with *GET ?171 HTTP/1.0*.
>  * It advertises its requests as being HTTP/1.0, but sends an 
> _Accept-Encoding_ request header, that is defined only in HTTP/1.1. This 
> confuses some web servers
>  ** Example: 
> [http://www.hansamanuals.com/main/english/none/theconf___987/manuals/version___82/hwconvindex.htm]
>  * If a server sends a redirection (3XX status code, with a Location header), 
> protocol-http tries to parse the HTTP response body anyway. Thus, if an error 
> occurs while decoding the body, the redirection is not followed and the 
> information is lost. Browsers follow the redirection and close the socket 
> soon as they can.
>  ** Example: [http://www.webarcelona.net/es/blog?page=2]
>  * Some servers invalidly send an HTTP body directly without a status line or 
> headers. Browsers handle that, protocol-http doesn't:
>  ** Example: [https://app.unitymedia.de/]
>  * Some servers invalidly add colons after the HTTP status code in the status 
> line (they can send _HTTP/1.1 404: Not found_ instead of _HTTP/1.1 404 Not 
> found_ for instance). Browsers can handle that.
>  * Some servers invalidly send headers that span over multiple lines. In that 
> case, browsers simply ignore the subsequent lines, but protocol-http throws 
> an error, thus preventing us from fetching the contents of the page.
>  * There is no limit over the size of the HTTP headers it reads. A bogus 
> server could send an infinite stream of different HTTP headers and cause the 
> fetcher to go out of memory, or send the same HTTP header repeatedly and 
> cause the fetcher to timeout.
>  * The same goes for the HTTP status line: no check is made concerning its 
> size.
>  * While reading chunked content, if the content size becomes larger than 
> {color:#9876aa}http{color}.getMaxContent(), instead of just stopping, it 
> tries to read a new chunk before having read the previous one completely, 
> resulting in a '{color:#33}bad chunk length' error.{color}
> {color:#33}Additionally (and that concerns protocol-httpclient as well), 
> when reading http headers, for each header, the SpellCheckedMetadata class 
> computes a Levenshtein distance between it and every  known header in the 
> HttpHeaders interface. Not only is that slow, non-standard, and non-conform 
> to browsers' behavior, but it also causes bugs and prevents us from accessing 
> the real headers sent by the HTTP server.{color}
>  * {color:#33}Example: [http://www.taz.de/!443358/] . The server sends a 
> *Client-Transfer-Encoding: chunked* header, but SpellCheckedMetadata corrects 
> it to *Transfer-Encoding: chunked*. Then, HttpResponse (in protocol-http) 
> tries to read the HTTP body as chunked, whereas it is not.{color}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2549) protocol-http does not behave the same as browsers

2018-05-24 Thread Gerard Bouchar (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488894#comment-16488894
 ] 

Gerard Bouchar commented on NUTCH-2549:
---

 [^NUTCH-2549.patch] 

> protocol-http does not behave the same as browsers
> --
>
> Key: NUTCH-2549
> URL: https://issues.apache.org/jira/browse/NUTCH-2549
> Project: Nutch
>  Issue Type: Bug
>Reporter: Gerard Bouchar
>Priority: Major
> Attachments: NUTCH-2549.patch
>
>
> We identified the following issues in protocol-http (a plugin implementing 
> the HTTP protocol):
>  * It fails if an url's path does not start with '/'
>  ** Example: [http://news.fx678.com?171|http://news.fx678.com/?171] (browsers 
> correctly rewrite the url as [http://news.fx678.com/?171], while nutch tries 
> to send an invalid HTTP request starting with *GET ?171 HTTP/1.0*.
>  * It advertises its requests as being HTTP/1.0, but sends an 
> _Accept-Encoding_ request header, that is defined only in HTTP/1.1. This 
> confuses some web servers
>  ** Example: 
> [http://www.hansamanuals.com/main/english/none/theconf___987/manuals/version___82/hwconvindex.htm]
>  * If a server sends a redirection (3XX status code, with a Location header), 
> protocol-http tries to parse the HTTP response body anyway. Thus, if an error 
> occurs while decoding the body, the redirection is not followed and the 
> information is lost. Browsers follow the redirection and close the socket 
> soon as they can.
>  ** Example: [http://www.webarcelona.net/es/blog?page=2]
>  * Some servers invalidly send an HTTP body directly without a status line or 
> headers. Browsers handle that, protocol-http doesn't:
>  ** Example: [https://app.unitymedia.de/]
>  * Some servers invalidly add colons after the HTTP status code in the status 
> line (they can send _HTTP/1.1 404: Not found_ instead of _HTTP/1.1 404 Not 
> found_ for instance). Browsers can handle that.
>  * Some servers invalidly send headers that span over multiple lines. In that 
> case, browsers simply ignore the subsequent lines, but protocol-http throws 
> an error, thus preventing us from fetching the contents of the page.
>  * There is no limit over the size of the HTTP headers it reads. A bogus 
> server could send an infinite stream of different HTTP headers and cause the 
> fetcher to go out of memory, or send the same HTTP header repeatedly and 
> cause the fetcher to timeout.
>  * The same goes for the HTTP status line: no check is made concerning its 
> size.
>  * While reading chunked content, if the content size becomes larger than 
> {color:#9876aa}http{color}.getMaxContent(), instead of just stopping, it 
> tries to read a new chunk before having read the previous one completely, 
> resulting in a '{color:#33}bad chunk length' error.{color}
> {color:#33}Additionally (and that concerns protocol-httpclient as well), 
> when reading http headers, for each header, the SpellCheckedMetadata class 
> computes a Levenshtein distance between it and every  known header in the 
> HttpHeaders interface. Not only is that slow, non-standard, and non-conform 
> to browsers' behavior, but it also causes bugs and prevents us from accessing 
> the real headers sent by the HTTP server.{color}
>  * {color:#33}Example: [http://www.taz.de/!443358/] . The server sends a 
> *Client-Transfer-Encoding: chunked* header, but SpellCheckedMetadata corrects 
> it to *Transfer-Encoding: chunked*. Then, HttpResponse (in protocol-http) 
> tries to read the HTTP body as chunked, whereas it is not.{color}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2575) protocol-http does not respect the maximum content-size for chunked responses

2018-05-24 Thread Gerard Bouchar (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1649#comment-1649
 ] 

Gerard Bouchar commented on NUTCH-2575:
---

Thank you for the fix! Is there work being done on the other subissues ?

> protocol-http does not respect the maximum content-size for chunked responses
> -
>
> Key: NUTCH-2575
> URL: https://issues.apache.org/jira/browse/NUTCH-2575
> Project: Nutch
>  Issue Type: Sub-task
>  Components: protocol
>Affects Versions: 1.14
>Reporter: Gerard Bouchar
>Priority: Critical
> Fix For: 1.15
>
>
> There is a bug in HttpResponse::readChunkedContent that prevents it to stop 
> reading content when it exceeds the maximum allowed size.
> There [is a variable 
> contentBytesRead|https://github.com/apache/nutch/blob/master/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java#L404]
>  that is used to check how much content has been read, but it is never 
> updated, so it always stays null, and [the size 
> check|https://github.com/apache/nutch/blob/master/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java#L440-L442]
>  always returns false (unless a single chunk is larger than the maximum 
> allowed content size).
> This allows any server to cause out-of-memory errors on our size.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2500) Add pull-reqest template to github

2018-05-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488890#comment-16488890
 ] 

ASF GitHub Bot commented on NUTCH-2500:
---

sebastian-nagel closed pull request #333: NUTCH-2500 Add pull-reqest template 
for github
URL: https://github.com/apache/nutch/pull/333
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/.github/pull_request_template.md b/.github/pull_request_template.md
new file mode 100644
index 0..e549eb0e3
--- /dev/null
+++ b/.github/pull_request_template.md
@@ -0,0 +1,13 @@
+Thanks for your contribution to [Apache Nutch](http://nutch.apache.org/)! Your 
help is appreciated!
+
+Before opening the pull request, please verify that
+* there is an open issue on the [Nutch issue 
tracker](https://issues.apache.org/jira/projects/NUTCH) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
+* the issue ID (`NUTCH-`)
+  - is referenced in the title of the pull request
+  - and placed in front of your commit messages
+* commits are squashed into a single one (or few commits for larger changes)
+* Java source code follows [Nutch Eclipse Code Formatting 
rules](https://github.com/apache/nutch/blob/master/eclipse-codeformat.xml)
+* Nutch is successfully built and unit tests pass by running `ant clean 
runtime test`
+* there should be no conflicts when merging the pull request branch into the 
*recent* master branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled master branch.
+
+We will be able to faster integrate your pull request if these conditions are 
met. If you have any questions how to fix your problem or about using Nutch in 
general, please sign up for the [Nutch mailing 
list](http://nutch.apache.org/mailing_lists.html). Thanks!


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add pull-reqest template to github
> --
>
> Key: NUTCH-2500
> URL: https://issues.apache.org/jira/browse/NUTCH-2500
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Sebastian Nagel
>Priority: Minor
>
> Github allows to add [pull request 
> templates](https://help.github.com/articles/creating-a-pull-request-template-for-your-repository/).
>  For contributors already familiar with github from other projects that's 
> probably the best place to show a check list which helps us to get pull 
> requests merged more quickly. Here's a draft:
> {noformat}
> Thanks for your contribution to [Apache Nutch](http://nutch.apache.org/)! 
> Your help is appreciated!
> Before opening the pull request, please verify that
> * there is an open issue on the [Nutch issue 
> tracker](https://issues.apache.org/jira/projects/NUTCH) which describes the 
> problem or the improvement. We cannot accept pull requests without an issue 
> because the change wouldn't be listed in the release notes.
> * the issue ID (`NUTCH-`)
>   - is referenced in the title of the pull request
>   - and placed in front of your commit messages
> * commits are squashed into a single one (or few commits for larger changes)
> * Java source code follows [Nutch Eclipse Code Formatting 
> rules](https://github.com/apache/nutch/blob/master/eclipse-codeformat.xml)
> * Nutch builds and unit tests pass by running `ant clean runtime test`
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2290) Update licenses of bundled libraries

2018-05-24 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488835#comment-16488835
 ] 

Sebastian Nagel commented on NUTCH-2290:


That's not a bad idea but it's a lot of work: some classes need fixes to 
compile with newer versions and we need to test everything (all tools, ideally 
also the plugins) whether conflicting dependencies will cause issues at 
runtime. For many libs it's also safer not to use more recent versions than 
required by Hadoop. Core dependencies (without plugins) and their licenses can 
be viewed by:
{noformat}
ant clean report
# open build/org.apache.nutch-nutch-test.html in a browser
{noformat}

> Update licenses of bundled libraries
> 
>
> Key: NUTCH-2290
> URL: https://issues.apache.org/jira/browse/NUTCH-2290
> Project: Nutch
>  Issue Type: Bug
>  Components: deployment
>Affects Versions: 2.3.1, 1.12
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.15
>
>
> The files LICENSE.txt and NOTICE.txt were last edited 5 years ago and should 
> be updated to include all licenses of dependencies (and their dependencies) 
> in accordance to [Assembling LICENSE and NOTICE 
> HOWTO|http://www.apache.org/dev/licensing-howto.html]:
> # check for missing or obsolete licenses due to added or removed dependencies
> # update year in NOTICE.txt -- should be a range according to the licensing 
> HOWTO
> # bundled libraries are referenced with path and version number, e.g 
> {{lib/icu4j-4_0_1.jar}}. This would require to update the LICENSE.txt with 
> every dependency upgrade. A more generic reference ("ICU4J") would be easier 
> to maintain but the HOWTO requires to "specify the version of the dependency 
> as licenses are sometimes changed".
> # try to reduce the size of LICENSE.txt (currently 5800 lines). Mainly, 
> according to the HOWTO there is no need to repeat the Apache license again 
> and again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2512) Nutch 1.14 does not work under JDK9

2018-05-24 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488708#comment-16488708
 ] 

Sebastian Nagel commented on NUTCH-2512:


Hi [~Bl4ck1c3], while testing HTTP/2 (NUTCH-2576) I was able to run the Nutch 
master compiled with Java 8 successfully using a Java 11 runtime. I've tried to 
reproduce your problem with Nutch master and 1.14 (both compiled with Java 8) 
and Java 11 as runtime in local mode: injector works fine and succeeds. Either 
the problem is specific to Java 9 and has been fixed in later versions or it 
has another reason. The stack resembles the problem described in NUTCH-2533. 
Could you share more details about Java versions and mode 
(local/pseudo-distributed)? Does Injector run with the same command using a 
Java 8 runtime? Thanks!

> Nutch 1.14 does not work under JDK9
> ---
>
> Key: NUTCH-2512
> URL: https://issues.apache.org/jira/browse/NUTCH-2512
> Project: Nutch
>  Issue Type: Bug
>  Components: build, injector
>Affects Versions: 1.14
> Environment: Ubuntu 16.04 (All patches up to 02/20/2018)
> Oracle Java 9 - Oracle JDK 9 (Latest as off 02/22/2018)
>Reporter: Ralf
>Priority: Major
> Fix For: 1.15
>
>
> Nutch 1.14 (Source) does not compile properly under JDK 9
> Nutch 1.14 (Binary) does not function under Java 9
>  
> When trying to Nuild Nutch, Ant complains about missing Sonar files then 
> exits with:
> "BUILD FAILED
> /home/nutch/nutch/build.xml:79: Unparseable date: "01/25/1971 2:00 pm" "
>  
> Once having commented out the "offending code" the Build finishes but the 
> resulting Binary fails to function (as well as the Apache Compiled Binary 
> distribution), Both exit with:
>  
> Injecting seed URLs
> /home/nutch/nutch2/bin/nutch inject searchcrawl//crawldb urls/
> Injector: starting at 2018-02-21 02:02:16
> Injector: crawlDb: searchcrawl/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> WARNING: An illegal reflective access operation has occurred
> WARNING: Illegal reflective access by 
> org.apache.hadoop.security.authentication.util.KerberosUtil 
> (file:/home/nutch/nutch2/lib/hadoop-auth-2.7.4.jar) to method 
> sun.security.krb5.Config.getInstance()
> WARNING: Please consider reporting this to the maintainers of 
> org.apache.hadoop.security.authentication.util.KerberosUtil
> WARNING: Use --illegal-access=warn to enable warnings of further illegal 
> reflective access operations
> WARNING: All illegal access operations will be denied in a future release
> Injector: java.lang.NullPointerException
>         at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getBlockIndex(FileInputFormat.java:444)
>         at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:413)
>         at 
> org.apache.hadoop.mapreduce.lib.input.DelegatingInputFormat.getSplits(DelegatingInputFormat.java:115)
>         at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
>         at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
>         at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
>         at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
>         at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
>         at java.base/java.security.AccessController.doPrivileged(Native 
> Method)
>         at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
>         at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
>         at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
>         at org.apache.nutch.crawl.Injector.inject(Injector.java:417)
>         at org.apache.nutch.crawl.Injector.run(Injector.java:563)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>         at org.apache.nutch.crawl.Injector.main(Injector.java:528)
>  
> Error running:
>   /home/nutch/nutch2/bin/nutch inject searchcrawl//crawldb urls/
> Failed with exit value 255.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)