thanks adding the meta worked On Wed, Nov 14, 2018 at 11:24 AM Nicholas Roberts <[email protected]> wrote:
> thanks for this > > I was wondering also about whether Wordpress has a whitelist or some kind > of registration process or whether they even have business arrangements > around search > > On Wed, Nov 14, 2018 at 7:26 AM Sebastian Nagel < > [email protected]> wrote: > >> Hi Nicholas, >> >> looks like it's the user-agent string sent in the HTTP header >> which makes the server return no/empty content. >> >> bin/nutch parsechecker \ >> -Dhttp.agent.name="mytestbot" \ >> -Dhttp.agent.version=3.0 \ >> -Dhttp.agent.url=http://example.com/ https://whatdavidread.ca/ >> >> Obviously, the default agent name containing "Nutch" is blocked on this >> site. >> But this observation may depend on other factors (IP address, etc.) as >> well. >> >> These settings are just for testing. Of course, I encourage to use >> a meaningful name and a valid URL or email address to reach the crawler >> operator for complains. If robots.txt is respected and the settings are >> polite it's unlikely you get contacted with complains. >> >> Best, >> Sebastian >> >> >> On 11/14/18 7:49 AM, Nicholas Roberts wrote: >> > hi >> > >> > I am setting up a new crawler with Nutch 1.15 and am having problems >> only >> > with Wordpress.com hosted sites >> > >> > I can crawl other https sites no problems >> > >> > Wordpress sites can be crawled on other hosts, but I think there is a >> > problem with the SSL certs at Wordpress.com >> > >> > I get this error >> > >> > FetcherThread 43 fetch of https://whatdavidread.ca/ failed with: >> > org.apache.commons.httpclient.NoHttpResponseException: The server >> > whatdavidread.ca failed to respond >> > FetcherThread 43 has no more work available >> > >> > there seems to be two layers of SSL certs >> > >> > first there is a Letsencrypt cert, with many domains, including the one >> > above, and the tls.auttomatic.com domain >> > >> > then, underlying the Lets Encrypt cert, there is a *.wordpress.com cert >> > from Comodo >> > >> > Certificate chain >> > 0 s:/OU=Domain Control Validated/OU=EssentialSSL Wildcard/CN=*. >> > wordpress.com >> > i:/C=GB/ST=Greater Manchester/L=Salford/O=COMODO CA Limited/CN=COMODO >> > RSA Domain Validation Secure Server CA >> > >> > I can crawl other https sites no problems >> > >> > I have tried the NUTCH_OPTS=($NUTCH_OPTS >> -Dhadoop.log.dir="$NUTCH_LOG_DIR" >> > -Djsse.enableSNIExtension=false) and no joy >> > >> > my nutch-site.xml >> > >> > <property> >> > <name>plugin.includes</name> >> > >> > >> <value>protocol-http|protocol-httpclient|urlfilter-(regex|validator)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|urlfilter-domainblacklist</value> >> > <description> >> > </description> >> > </property> >> > >> > >> > thanks for the consideration >> > >> >> > > -- > > -- > Nicholas Roberts > www.niccolox.org > > -- -- Nicholas Roberts www.niccolox.org

