hi

I am setting up a new crawler with Nutch 1.15 and am having problems only
with Wordpress.com hosted sites

I can crawl other https sites no problems

Wordpress sites can be crawled on other hosts, but I think there is a
problem with the SSL certs at Wordpress.com

I get this error

FetcherThread 43 fetch of https://whatdavidread.ca/ failed with:
org.apache.commons.httpclient.NoHttpResponseException: The server
whatdavidread.ca failed to respond
FetcherThread 43 has no more work available

there seems to be two layers of SSL certs

first there is a Letsencrypt cert, with many domains, including the one
above, and the tls.auttomatic.com domain

then, underlying the Lets Encrypt cert, there is a *.wordpress.com cert
from Comodo

Certificate chain
 0 s:/OU=Domain Control Validated/OU=EssentialSSL Wildcard/CN=*.
wordpress.com
   i:/C=GB/ST=Greater Manchester/L=Salford/O=COMODO CA Limited/CN=COMODO
RSA Domain Validation Secure Server CA

I can crawl other https sites no problems

I have tried the NUTCH_OPTS=($NUTCH_OPTS -Dhadoop.log.dir="$NUTCH_LOG_DIR"
-Djsse.enableSNIExtension=false) and no joy

my nutch-site.xml

<property>
  <name>plugin.includes</name>

<value>protocol-http|protocol-httpclient|urlfilter-(regex|validator)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|urlfilter-domainblacklist</value>
  <description>
  </description>
</property>


thanks for the consideration
-- 
Nicholas Roberts
www.niccolox.org

Reply via email to