Hello thanks for the answers,

meanwhile I had already tried something else(sorry for the late answer). I had tried the same with Apache-nutch 1.9. There I could crawl the site.

What is puzzling me is, when I changed the nutch-site.xml file to include the plugin: <name>plugin.includes</name>
    
<value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>

I get again the handshake error. In Version 1.8 I had to include it because of the error:

fetch of https://www.sit.de/ failed with: org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=https at org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:83) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:675)

Question now: Is it possible just to use Apache-nutch 1.9 without the plugin? I did not get any error yet in this constellation?

Thanx and nice Weekend

Martin







On Mon, 23 Feb 2015 21:15:55 +0100
 Sebastian Nagel <[email protected]> wrote:
Alternatively, have a look at this description
how to manually add the certificates:
http://stackoverflow.com/questions/6659360/how-to-solve-javax-net-ssl-sslhandshakeexception-error

On 02/23/2015 05:02 PM, Eyeris RodrIguez Rueda wrote:
Hello Martin.
I think that the problem is with httpclient protocol, i have this problem too and if i do parsechecker or indexchecker it happend also. In my context it occours when the certificate es self signed, yo can see http://mail-archives.apache.org/mod_mbox/nutch-user/201401.mbox/%[email protected]%3E you can make the changes and compile again this plugin and try one more time. I have and alternative using httpclient protocol of nutch 1.5.1 version, in this it not occours. Please any progress write to the list again.




----- Mensaje original -----
De: "Martin Krauss" <[email protected]>
Para: [email protected]
Enviados: Lunes, 23 de Febrero 2015 8:17:49
Asunto: [MASSMAIL]Error SSLHandshakeException Crawling sites with https

Hello,

when crawling some sites with https I get the error below:

Other https sites work o.k.

Pls help Martin

fetching: https://www.sit.de
Fetch failed with protocol status: exception(16), lastModified=0: javax.net.ssl.SSLHandshakeException: Remote host closed connection during handshake

2015-02-20 15:35:14,889 INFO parse.ParserChecker - fetching: https://www.sit.de 2015-02-20 15:35:15,520 INFO httpclient.Http - http.proxy.host = null 2015-02-20 15:35:15,521 INFO httpclient.Http - http.proxy.port = 8080
2015-02-20 15:35:15,522 INFO  httpclient.Http - http.timeout = 10000
2015-02-20 15:35:15,522 INFO httpclient.Http - http.content.limit = 65536 2015-02-20 15:35:15,522 INFO httpclient.Http - http.agent = SIT_NUTCH_SPIDER/Nutch-1.8 2015-02-20 15:35:15,522 INFO httpclient.Http - http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3 2015-02-20 15:35:15,522 INFO httpclient.Http - http.accept = text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 2015-02-20 15:35:16,232 ERROR httpclient.Http - Failed to get protocol output javax.net.ssl.SSLHandshakeException: Remote host closed connection during handshake at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:869) at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1190) at sun.security.ssl.SSLSocketImpl.writeRecord(SSLSocketImpl.java:657) at sun.security.ssl.AppOutputStream.write(AppOutputStream.java:108) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) at org.apache.commons.httpclient.HttpConnection.flushRequestOutputStream(HttpConnection.java:828) at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnectionAdapter.flushRequestOutputStream(MultiThreadedHttpConnectionManager.java:1565) at org.apache.commons.httpclient.HttpMethodBase.writeRequest(HttpMethodBase.java:2116) at org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1096) at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:398) at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)Mit freundlichen Grüßen

Martin Krauß

Gottlieb-Daimler-Schule 2

mit Abteilung Akademie für Datenverarbeitung

Böblinger Straße 73        71065 Sindelfingen

Phone:  +49 (0)7031 6117-135

Fax:        +49 (0)7031 6117-119

E-Mail:   [email protected]



Mit freundlichen Grüßen

Martin Krauß

Gottlieb-Daimler-Schule 2

mit Abteilung Akademie für Datenverarbeitung

Böblinger Straße 73        71065 Sindelfingen

Phone:  +49 (0)7031 6117-135

Fax:        +49 (0)7031 6117-119

E-Mail:   [email protected]

Reply via email to