thanks for this I was wondering also about whether Wordpress has a whitelist or some kind of registration process or whether they even have business arrangements around search
On Wed, Nov 14, 2018 at 7:26 AM Sebastian Nagel <wastl.na...@googlemail.com> wrote: > Hi Nicholas, > > looks like it's the user-agent string sent in the HTTP header > which makes the server return no/empty content. > > bin/nutch parsechecker \ > -Dhttp.agent.name="mytestbot" \ > -Dhttp.agent.version=3.0 \ > -Dhttp.agent.url=http://example.com/ https://whatdavidread.ca/ > > Obviously, the default agent name containing "Nutch" is blocked on this > site. > But this observation may depend on other factors (IP address, etc.) as > well. > > These settings are just for testing. Of course, I encourage to use > a meaningful name and a valid URL or email address to reach the crawler > operator for complains. If robots.txt is respected and the settings are > polite it's unlikely you get contacted with complains. > > Best, > Sebastian > > > On 11/14/18 7:49 AM, Nicholas Roberts wrote: > > hi > > > > I am setting up a new crawler with Nutch 1.15 and am having problems only > > with Wordpress.com hosted sites > > > > I can crawl other https sites no problems > > > > Wordpress sites can be crawled on other hosts, but I think there is a > > problem with the SSL certs at Wordpress.com > > > > I get this error > > > > FetcherThread 43 fetch of https://whatdavidread.ca/ failed with: > > org.apache.commons.httpclient.NoHttpResponseException: The server > > whatdavidread.ca failed to respond > > FetcherThread 43 has no more work available > > > > there seems to be two layers of SSL certs > > > > first there is a Letsencrypt cert, with many domains, including the one > > above, and the tls.auttomatic.com domain > > > > then, underlying the Lets Encrypt cert, there is a *.wordpress.com cert > > from Comodo > > > > Certificate chain > > 0 s:/OU=Domain Control Validated/OU=EssentialSSL Wildcard/CN=*. > > wordpress.com > > i:/C=GB/ST=Greater Manchester/L=Salford/O=COMODO CA Limited/CN=COMODO > > RSA Domain Validation Secure Server CA > > > > I can crawl other https sites no problems > > > > I have tried the NUTCH_OPTS=($NUTCH_OPTS > -Dhadoop.log.dir="$NUTCH_LOG_DIR" > > -Djsse.enableSNIExtension=false) and no joy > > > > my nutch-site.xml > > > > <property> > > <name>plugin.includes</name> > > > > > <value>protocol-http|protocol-httpclient|urlfilter-(regex|validator)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|urlfilter-domainblacklist</value> > > <description> > > </description> > > </property> > > > > > > thanks for the consideration > > > > -- -- Nicholas Roberts www.niccolox.org