Hello thanks for the answers,
meanwhile I had already tried something else(sorry for the late
answer). I had tried the same with Apache-nutch 1.9. There I could
crawl the site.
What is puzzling me is, when I changed the nutch-site.xml file to
include the plugin:
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
I get again the handshake error. In Version 1.8 I had to include it
because of the error:
fetch of https://www.sit.de/ failed with:
org.apache.nutch.protocol.ProtocolNotFound: protocol not found for
url=https
at
org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:83)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:675)
Question now: Is it possible just to use Apache-nutch 1.9 without the
plugin? I did not get any error yet in this constellation?
Thanx and nice Weekend
Martin
On Mon, 23 Feb 2015 21:15:55 +0100
Sebastian Nagel <[email protected]> wrote:
Alternatively, have a look at this description
how to manually add the certificates:
http://stackoverflow.com/questions/6659360/how-to-solve-javax-net-ssl-sslhandshakeexception-error
On 02/23/2015 05:02 PM, Eyeris RodrIguez Rueda wrote:
Hello Martin.
I think that the problem is with httpclient protocol, i have this
problem too and if i do parsechecker or indexchecker it happend also.
In my context it occours when the certificate es self signed, yo can
see
http://mail-archives.apache.org/mod_mbox/nutch-user/201401.mbox/%[email protected]%3E
you can make the changes and compile again this plugin and try one
more time. I have and alternative using httpclient protocol of nutch
1.5.1 version, in this it not occours. Please any progress write to
the list again.
----- Mensaje original -----
De: "Martin Krauss" <[email protected]>
Para: [email protected]
Enviados: Lunes, 23 de Febrero 2015 8:17:49
Asunto: [MASSMAIL]Error SSLHandshakeException Crawling sites with
https
Hello,
when crawling some sites with https I get the error below:
Other https sites work o.k.
Pls help Martin
fetching: https://www.sit.de
Fetch failed with protocol status: exception(16), lastModified=0:
javax.net.ssl.SSLHandshakeException: Remote host closed connection
during handshake
2015-02-20 15:35:14,889 INFO parse.ParserChecker - fetching:
https://www.sit.de
2015-02-20 15:35:15,520 INFO httpclient.Http - http.proxy.host =
null
2015-02-20 15:35:15,521 INFO httpclient.Http - http.proxy.port =
8080
2015-02-20 15:35:15,522 INFO httpclient.Http - http.timeout = 10000
2015-02-20 15:35:15,522 INFO httpclient.Http - http.content.limit =
65536
2015-02-20 15:35:15,522 INFO httpclient.Http - http.agent =
SIT_NUTCH_SPIDER/Nutch-1.8
2015-02-20 15:35:15,522 INFO httpclient.Http - http.accept.language
=
en-us,en-gb,en;q=0.7,*;q=0.3
2015-02-20 15:35:15,522 INFO httpclient.Http - http.accept =
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
2015-02-20 15:35:16,232 ERROR httpclient.Http - Failed to get
protocol
output
javax.net.ssl.SSLHandshakeException: Remote host closed connection
during handshake
at
sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:869)
at
sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1190)
at
sun.security.ssl.SSLSocketImpl.writeRecord(SSLSocketImpl.java:657)
at
sun.security.ssl.AppOutputStream.write(AppOutputStream.java:108)
at
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at
java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
at
org.apache.commons.httpclient.HttpConnection.flushRequestOutputStream(HttpConnection.java:828)
at
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnectionAdapter.flushRequestOutputStream(MultiThreadedHttpConnectionManager.java:1565)
at
org.apache.commons.httpclient.HttpMethodBase.writeRequest(HttpMethodBase.java:2116)
at
org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1096)
at
org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:398)
at
org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)Mit
freundlichen Grüßen
Martin Krauß
Gottlieb-Daimler-Schule 2
mit Abteilung Akademie für Datenverarbeitung
Böblinger Straße 73 71065 Sindelfingen
Phone: +49 (0)7031 6117-135
Fax: +49 (0)7031 6117-119
E-Mail: [email protected]
Mit freundlichen Grüßen
Martin Krauß
Gottlieb-Daimler-Schule 2
mit Abteilung Akademie für Datenverarbeitung
Böblinger Straße 73 71065 Sindelfingen
Phone: +49 (0)7031 6117-135
Fax: +49 (0)7031 6117-119
E-Mail: [email protected]