You can try checking robots.txt for these websites On Wed, 14 Nov 2018, 16:00 Yash Thenuan Thenuan <[email protected] wrote:
> Most probably the problem is these websites allow only some specific > crawlers in their robots.txt file. > > On Wed, 14 Nov 2018, 15:56 Semyon Semyonov <[email protected] > wrote: > >> Hi Nicholas, >> >> I have the same problem with https://www.graydon.nl/ >> And it doesnt look like a wordpress website. >> >> Semyon >> >> >> Sent: Wednesday, November 14, 2018 at 7:49 AM >> From: "Nicholas Roberts" <[email protected]> >> To: [email protected] >> Subject: Wordpress.com hosted sites fail >> org.apache.commons.httpclient.NoHttpResponseException >> hi >> >> I am setting up a new crawler with Nutch 1.15 and am having problems only >> with Wordpress.com hosted sites >> >> I can crawl other https sites no problems >> >> Wordpress sites can be crawled on other hosts, but I think there is a >> problem with the SSL certs at Wordpress.com >> >> I get this error >> >> FetcherThread 43 fetch of https://whatdavidread.ca/ failed with: >> org.apache.commons.httpclient.NoHttpResponseException: The server >> whatdavidread.ca failed to respond >> FetcherThread 43 has no more work available >> >> there seems to be two layers of SSL certs >> >> first there is a Letsencrypt cert, with many domains, including the one >> above, and the tls.auttomatic.com domain >> >> then, underlying the Lets Encrypt cert, there is a *.wordpress.com cert >> from Comodo >> >> Certificate chain >> 0 s:/OU=Domain Control Validated/OU=EssentialSSL Wildcard/CN=*. >> wordpress.com >> i:/C=GB/ST=Greater Manchester/L=Salford/O=COMODO CA Limited/CN=COMODO >> RSA Domain Validation Secure Server CA >> >> I can crawl other https sites no problems >> >> I have tried the NUTCH_OPTS=($NUTCH_OPTS -Dhadoop.log.dir="$NUTCH_LOG_DIR" >> -Djsse.enableSNIExtension=false) and no joy >> >> my nutch-site.xml >> >> <property> >> <name>plugin.includes</name> >> >> >> <value>protocol-http|protocol-httpclient|urlfilter-(regex|validator)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|urlfilter-domainblacklist</value> >> <description> >> </description> >> </property> >> >> >> thanks for the consideration >> -- >> Nicholas Roberts >> www.niccolox.org[http://www.niccolox.org] >> >

