For https://www.graydon.nl/

User-agent: *
Crawl-delay: 10

It doesnt look like they specify any specific robots rules.

Sent: Wednesday, November 14, 2018 at 11:32 AM
From: "Yash Thenuan Thenuan" <rit2014...@iiita.ac.in>
To: user@nutch.apache.org
Subject: Re: Wordpress.com hosted sites fail 
org.apache.commons.httpclient.NoHttpResponseException
You can try checking robots.txt for these websites

On Wed, 14 Nov 2018, 16:00 Yash Thenuan Thenuan <rit2014...@iiita.ac.in
wrote:

> Most probably the problem is these websites allow only some specific
> crawlers in their robots.txt file.
>
> On Wed, 14 Nov 2018, 15:56 Semyon Semyonov <semyon.semyo...@mail.com
> wrote:
>
>> Hi Nicholas,
>>
>> I have the same problem with https://www.graydon.nl/
>> And it doesnt look like a wordpress website.
>>
>> Semyon
>>
>>
>> Sent: Wednesday, November 14, 2018 at 7:49 AM
>> From: "Nicholas Roberts" <niccolo.robe...@gmail.com>
>> To: user@nutch.apache.org
>> Subject: Wordpress.com hosted sites fail
>> org.apache.commons.httpclient.NoHttpResponseException
>> hi
>>
>> I am setting up a new crawler with Nutch 1.15 and am having problems only
>> with Wordpress.com hosted sites
>>
>> I can crawl other https sites no problems
>>
>> Wordpress sites can be crawled on other hosts, but I think there is a
>> problem with the SSL certs at Wordpress.com
>>
>> I get this error
>>
>> FetcherThread 43 fetch of 
>> https://whatdavidread.ca/[https://whatdavidread.ca/] failed with:
>> org.apache.commons.httpclient.NoHttpResponseException: The server
>> whatdavidread.ca failed to respond
>> FetcherThread 43 has no more work available
>>
>> there seems to be two layers of SSL certs
>>
>> first there is a Letsencrypt cert, with many domains, including the one
>> above, and the tls.auttomatic.com domain
>>
>> then, underlying the Lets Encrypt cert, there is a *.wordpress.com cert
>> from Comodo
>>
>> Certificate chain
>> 0 s:/OU=Domain Control Validated/OU=EssentialSSL Wildcard/CN=*.
>> wordpress.com
>> i:/C=GB/ST=Greater Manchester/L=Salford/O=COMODO CA Limited/CN=COMODO
>> RSA Domain Validation Secure Server CA
>>
>> I can crawl other https sites no problems
>>
>> I have tried the NUTCH_OPTS=($NUTCH_OPTS -Dhadoop.log.dir="$NUTCH_LOG_DIR"
>> -Djsse.enableSNIExtension=false) and no joy
>>
>> my nutch-site.xml
>>
>> <property>
>> <name>plugin.includes</name>
>>
>>
>> <value>protocol-http|protocol-httpclient|urlfilter-(regex|validator)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|urlfilter-domainblacklist</value>
>> <description>
>> </description>
>> </property>
>>
>>
>> thanks for the consideration
>> --
>> Nicholas Roberts
>> www.niccolox.org[http://www.niccolox.org][http://www.niccolox.org[http://www.niccolox.org]]
>>
>

Reply via email to