Most probably the problem is these websites allow only some specific
crawlers in their robots.txt file.

On Wed, 14 Nov 2018, 15:56 Semyon Semyonov <semyon.semyo...@mail.com wrote:

> Hi Nicholas,
>
> I have the same problem with https://www.graydon.nl/
> And it doesnt look like a wordpress website.
>
> Semyon
>
>
> Sent: Wednesday, November 14, 2018 at 7:49 AM
> From: "Nicholas Roberts" <niccolo.robe...@gmail.com>
> To: user@nutch.apache.org
> Subject: Wordpress.com hosted sites fail
> org.apache.commons.httpclient.NoHttpResponseException
> hi
>
> I am setting up a new crawler with Nutch 1.15 and am having problems only
> with Wordpress.com hosted sites
>
> I can crawl other https sites no problems
>
> Wordpress sites can be crawled on other hosts, but I think there is a
> problem with the SSL certs at Wordpress.com
>
> I get this error
>
> FetcherThread 43 fetch of https://whatdavidread.ca/ failed with:
> org.apache.commons.httpclient.NoHttpResponseException: The server
> whatdavidread.ca failed to respond
> FetcherThread 43 has no more work available
>
> there seems to be two layers of SSL certs
>
> first there is a Letsencrypt cert, with many domains, including the one
> above, and the tls.auttomatic.com domain
>
> then, underlying the Lets Encrypt cert, there is a *.wordpress.com cert
> from Comodo
>
> Certificate chain
> 0 s:/OU=Domain Control Validated/OU=EssentialSSL Wildcard/CN=*.
> wordpress.com
> i:/C=GB/ST=Greater Manchester/L=Salford/O=COMODO CA Limited/CN=COMODO
> RSA Domain Validation Secure Server CA
>
> I can crawl other https sites no problems
>
> I have tried the NUTCH_OPTS=($NUTCH_OPTS -Dhadoop.log.dir="$NUTCH_LOG_DIR"
> -Djsse.enableSNIExtension=false) and no joy
>
> my nutch-site.xml
>
> <property>
> <name>plugin.includes</name>
>
>
> <value>protocol-http|protocol-httpclient|urlfilter-(regex|validator)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|urlfilter-domainblacklist</value>
> <description>
> </description>
> </property>
>
>
> thanks for the consideration
> --
> Nicholas Roberts
> www.niccolox.org[http://www.niccolox.org]
>

Reply via email to