Hi Nicholas,

looks like it's the user-agent string sent in the HTTP header
which makes the server return no/empty content.

bin/nutch parsechecker \
  -Dhttp.agent.name="mytestbot" \
  -Dhttp.agent.version=3.0 \
  -Dhttp.agent.url=http://example.com/ https://whatdavidread.ca/

Obviously, the default agent name containing "Nutch" is blocked on this site.
But this observation may depend on other factors (IP address, etc.) as well.

These settings are just for testing. Of course, I encourage to use
a meaningful name and a valid URL or email address to reach the crawler
operator for complains. If robots.txt is respected and the settings are
polite it's unlikely you get contacted with complains.

Best,
Sebastian


On 11/14/18 7:49 AM, Nicholas Roberts wrote:
> hi
> 
> I am setting up a new crawler with Nutch 1.15 and am having problems only
> with Wordpress.com hosted sites
> 
> I can crawl other https sites no problems
> 
> Wordpress sites can be crawled on other hosts, but I think there is a
> problem with the SSL certs at Wordpress.com
> 
> I get this error
> 
> FetcherThread 43 fetch of https://whatdavidread.ca/ failed with:
> org.apache.commons.httpclient.NoHttpResponseException: The server
> whatdavidread.ca failed to respond
> FetcherThread 43 has no more work available
> 
> there seems to be two layers of SSL certs
> 
> first there is a Letsencrypt cert, with many domains, including the one
> above, and the tls.auttomatic.com domain
> 
> then, underlying the Lets Encrypt cert, there is a *.wordpress.com cert
> from Comodo
> 
> Certificate chain
>  0 s:/OU=Domain Control Validated/OU=EssentialSSL Wildcard/CN=*.
> wordpress.com
>    i:/C=GB/ST=Greater Manchester/L=Salford/O=COMODO CA Limited/CN=COMODO
> RSA Domain Validation Secure Server CA
> 
> I can crawl other https sites no problems
> 
> I have tried the NUTCH_OPTS=($NUTCH_OPTS -Dhadoop.log.dir="$NUTCH_LOG_DIR"
> -Djsse.enableSNIExtension=false) and no joy
> 
> my nutch-site.xml
> 
> <property>
>   <name>plugin.includes</name>
> 
> <value>protocol-http|protocol-httpclient|urlfilter-(regex|validator)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|urlfilter-domainblacklist</value>
>   <description>
>   </description>
> </property>
> 
> 
> thanks for the consideration
> 

Reply via email to