Hi - well, first of all, why not try :) Protocol-httpclient is very old, and seems to have poor support for some TLS-only sites. Protocol-http is more low-level but also allowed us to add TLS support. I would think that in most cases protocol-http would do fine unless someone finds the time to upgrade protocol-httpclient to a modern version of httpclient.
I've seen these errors too when using protocol-httpclient for https scheme's. And i am considering to use protocol-http for all scheme's. Thinking of it, i'll probably suggest this to my colleague tomorrow and see if we get rid of similar errors. Anyway, please test and report if you have the chance :) M. -----Original message----- > From:Jeffery, Scott <[email protected]> > Sent: Wednesday 23rd March 2016 22:13 > To: [email protected] > Subject: Re: protocol-http or protocol-httpclient? > > I've been unable to crawl the https://www.phoenix.gov site using > protocol-httpclient. For some reason that site has limited TLS to the older > TLSv1 and this causes the apache httpclient to respond with error: > > "fetch of https://www.phoenix.gov/ failed with: javax.net.ssl.SSLException: > Received fatal alert: protocol_version" > > I've even tried many variations of -D options like > > "bin/nutch fetch ... -Dhttps.protocols=SSLv3,TLSv1,TLSv1.1,TLSv1.2 ..." > > only to receive the same error. > > per Markus' comment maybe I should be using protocol-http even with SSL/TLS > sites? > > Scott > > On Tue, Mar 8, 2016 at 8:31 AM, Markus Jelsma <[email protected]> > wrote: > > > Hmm, this was true before we had decent URL normalization. It should run > > fine although you can encounter SSL issues. But those SSL issues might also > > be in protocol-http, which now also supports SSL. You should be fine with > > either plugin. > > Markus > > > > -----Original message----- > > > From:Joseph Naegele <[email protected]> > > > Sent: Tuesday 8th March 2016 16:27 > > > To: [email protected] > > > Subject: protocol-http or protocol-httpclient? > > > > > > I'm using Nutch 1.11. The "plugin.includes" section of nutch-default.xml > > > still states that the protocol-httpclient plugin may present intermittent > > > problems. Is this still the case? What are the problems? > > > > > > There doesn't appear to be any problem crawling HTTPS using the > > > protocol-http plugin. Why do I need to use protocol-httpclient for > > crawling > > > via HTTPS? > > > > > > In short, I want to use the "correct" plugin because I am extending it to > > > perform a bit of extra work. "Correct" in this case means: > > > - The "recommended" of the two > > > - Whichever can crawl both HTTP and HTTPS connections > > > - Whichever performs better > > > > > > Thanks, > > > Joe > > > > > > > > >

