[jira] [Commented] (NUTCH-2576) HTTP protocol plugin based on okhttp

2018-05-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16468900#comment-16468900
 ] 

ASF GitHub Bot commented on NUTCH-2576:
---

sebastian-nagel commented on issue #328: NUTCH-2576 HTTP protocol 
implementation based on okhttp
URL: https://github.com/apache/nutch/pull/328#issuecomment-387754653
 
 
   Thanks, @jnioche - I'll plan a load test next week. Will check whether the 
collection pool causes any troubles.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HTTP protocol plugin based on okhttp
> 
>
> Key: NUTCH-2576
> URL: https://issues.apache.org/jira/browse/NUTCH-2576
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, protocol
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.15
>
>
> [Okhttp|http://square.github.io/okhttp/] is an Apache2-licensed http library 
> which supports HTTP/2. [~jnioche]'s implementation 
> [storm-crawler#443|https://github.com/DigitalPebble/storm-crawler/issues/443] 
> proves that it should be straightforward to implement a Nutch protocol plugin 
> using okhttp. A recent HTTP protocol implementation should also fix (most of) 
> the issues reported in NUTCH-2549.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2576) HTTP protocol plugin based on okhttp

2018-05-09 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2576:
---
Component/s: protocol
 plugin

> HTTP protocol plugin based on okhttp
> 
>
> Key: NUTCH-2576
> URL: https://issues.apache.org/jira/browse/NUTCH-2576
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, protocol
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.15
>
>
> [Okhttp|http://square.github.io/okhttp/] is an Apache2-licensed http library 
> which supports HTTP/2. [~jnioche]'s implementation 
> [storm-crawler#443|https://github.com/DigitalPebble/storm-crawler/issues/443] 
> proves that it should be straightforward to implement a Nutch protocol plugin 
> using okhttp. A recent HTTP protocol implementation should also fix (most of) 
> the issues reported in NUTCH-2549.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2576) HTTP protocol plugin based on okhttp

2018-05-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16468750#comment-16468750
 ] 

ASF GitHub Bot commented on NUTCH-2576:
---

sebastian-nagel opened a new pull request #328: NUTCH-2576 HTTP protocol 
implementation based on okhttp
URL: https://github.com/apache/nutch/pull/328
 
 
   A Nutch protocol plugin based on [okhttp](http://square.github.io/okhttp/):
   
   - derived from @jnioche's implementation for 
[storm-crawler#443](/DigitalPebble/storm-crawler/issues/443)
 - use okhttp's internal buffer for buffering content
   - adapted to be compatible to Nutch and behave almost the same as 
protocol-http
 - moved shared configuration settings to HttpBase (lib-http)
   - unit tests taken from protocol-http
   
   TODOs:
   - verify that issues reported in NUTCH-2549 do not appear again
   - complete unit tests
   - benchmark and large-scale test
   
   For HTTP/2 support a Java version at runtime is needed which supports ALPN:
   
   ```
   export NUTCH_JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
   
   bin/nutch parsechecker \
  -Dplugin.includes='protocol-okhttp|parse-html' \
  -Dhttp.useHttp2=true \
  -Dstore.http.headers=true \
  https://www.google.com/
   ...
   _response.headers_=HTTP/2 200
   date: Wed, 09 May 2018 11:23:21 GMT
   expires: -1
   cache-control: private, max-age=0
   content-type: text/html; charset=ISO-8859-1
   p3p: CP="This is not a P3P policy! See g.co/p3phelp for more info."
   content-encoding: gzip
   server: gws
   x-xss-protection: 1; mode=block
   ...
   ```
   
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HTTP protocol plugin based on okhttp
> 
>
> Key: NUTCH-2576
> URL: https://issues.apache.org/jira/browse/NUTCH-2576
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, protocol
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.15
>
>
> [Okhttp|http://square.github.io/okhttp/] is an Apache2-licensed http library 
> which supports HTTP/2. [~jnioche]'s implementation 
> [storm-crawler#443|https://github.com/DigitalPebble/storm-crawler/issues/443] 
> proves that it should be straightforward to implement a Nutch protocol plugin 
> using okhttp. A recent HTTP protocol implementation should also fix (most of) 
> the issues reported in NUTCH-2549.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2576) HTTP protocol plugin based on okhttp

2018-05-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16468806#comment-16468806
 ] 

ASF GitHub Bot commented on NUTCH-2576:
---

jnioche commented on issue #328: NUTCH-2576 HTTP protocol implementation based 
on okhttp
URL: https://github.com/apache/nutch/pull/328#issuecomment-387728096
 
 
   @sebastian-nagel one thing I noticed with OkHttp is that its 
[ConnectionPool](https://github.com/square/okhttp/blob/master/okhttp/src/main/java/okhttp3/ConnectionPool.java)
 (default maxIdle 5, with eviction after 5 mins)  struggles when used with many 
threads and different hostnames, which would typically be the case with Nutch 
(and StormCrawler). I have seen an average of 1.5s and up to 6s contention on 
the ConnectionPool, my guess is that [the cleanup 
method](https://github.com/square/okhttp/blob/master/okhttp/src/main/java/okhttp3/ConnectionPool.java#L199)
 and its synchronized block is the main culprit. It iterates on all the 
connections but removes only the one which has been idle for the longest.
   
   Apart from that okHTTP is great: pretty robust and less arcane than Apache 
HTTPClient IMHO.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HTTP protocol plugin based on okhttp
> 
>
> Key: NUTCH-2576
> URL: https://issues.apache.org/jira/browse/NUTCH-2576
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, protocol
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.15
>
>
> [Okhttp|http://square.github.io/okhttp/] is an Apache2-licensed http library 
> which supports HTTP/2. [~jnioche]'s implementation 
> [storm-crawler#443|https://github.com/DigitalPebble/storm-crawler/issues/443] 
> proves that it should be straightforward to implement a Nutch protocol plugin 
> using okhttp. A recent HTTP protocol implementation should also fix (most of) 
> the issues reported in NUTCH-2549.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2576) HTTP protocol plugin based on okhttp

2018-05-09 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2576:
--

 Summary: HTTP protocol plugin based on okhttp
 Key: NUTCH-2576
 URL: https://issues.apache.org/jira/browse/NUTCH-2576
 Project: Nutch
  Issue Type: Improvement
Reporter: Sebastian Nagel
 Fix For: 1.15


[Okhttp|http://square.github.io/okhttp/] is an Apache2-licensed http library 
which supports HTTP/2. [~jnioche]'s implementation 
[storm-crawler#443|https://github.com/DigitalPebble/storm-crawler/issues/443] 
proves that it should be straightforward to implement a Nutch protocol plugin 
using okhttp. A recent HTTP protocol implementation should also fix (most of) 
the issues reported in NUTCH-2549.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)