thanks
adding the meta worked

On Wed, Nov 14, 2018 at 11:24 AM Nicholas Roberts <niccolo.robe...@gmail.com>
wrote:

> thanks for this
>
> I was wondering also about whether Wordpress has a whitelist or some kind
> of registration process or whether they even have business arrangements
> around search
>
> On Wed, Nov 14, 2018 at 7:26 AM Sebastian Nagel <
> wastl.na...@googlemail.com> wrote:
>
>> Hi Nicholas,
>>
>> looks like it's the user-agent string sent in the HTTP header
>> which makes the server return no/empty content.
>>
>> bin/nutch parsechecker \
>>   -Dhttp.agent.name="mytestbot" \
>>   -Dhttp.agent.version=3.0 \
>>   -Dhttp.agent.url=http://example.com/ https://whatdavidread.ca/
>>
>> Obviously, the default agent name containing "Nutch" is blocked on this
>> site.
>> But this observation may depend on other factors (IP address, etc.) as
>> well.
>>
>> These settings are just for testing. Of course, I encourage to use
>> a meaningful name and a valid URL or email address to reach the crawler
>> operator for complains. If robots.txt is respected and the settings are
>> polite it's unlikely you get contacted with complains.
>>
>> Best,
>> Sebastian
>>
>>
>> On 11/14/18 7:49 AM, Nicholas Roberts wrote:
>> > hi
>> >
>> > I am setting up a new crawler with Nutch 1.15 and am having problems
>> only
>> > with Wordpress.com hosted sites
>> >
>> > I can crawl other https sites no problems
>> >
>> > Wordpress sites can be crawled on other hosts, but I think there is a
>> > problem with the SSL certs at Wordpress.com
>> >
>> > I get this error
>> >
>> > FetcherThread 43 fetch of https://whatdavidread.ca/ failed with:
>> > org.apache.commons.httpclient.NoHttpResponseException: The server
>> > whatdavidread.ca failed to respond
>> > FetcherThread 43 has no more work available
>> >
>> > there seems to be two layers of SSL certs
>> >
>> > first there is a Letsencrypt cert, with many domains, including the one
>> > above, and the tls.auttomatic.com domain
>> >
>> > then, underlying the Lets Encrypt cert, there is a *.wordpress.com cert
>> > from Comodo
>> >
>> > Certificate chain
>> >  0 s:/OU=Domain Control Validated/OU=EssentialSSL Wildcard/CN=*.
>> > wordpress.com
>> >    i:/C=GB/ST=Greater Manchester/L=Salford/O=COMODO CA Limited/CN=COMODO
>> > RSA Domain Validation Secure Server CA
>> >
>> > I can crawl other https sites no problems
>> >
>> > I have tried the NUTCH_OPTS=($NUTCH_OPTS
>> -Dhadoop.log.dir="$NUTCH_LOG_DIR"
>> > -Djsse.enableSNIExtension=false) and no joy
>> >
>> > my nutch-site.xml
>> >
>> > <property>
>> >   <name>plugin.includes</name>
>> >
>> >
>> <value>protocol-http|protocol-httpclient|urlfilter-(regex|validator)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|urlfilter-domainblacklist</value>
>> >   <description>
>> >   </description>
>> > </property>
>> >
>> >
>> > thanks for the consideration
>> >
>>
>>
>
> --
>
> --
> Nicholas Roberts
> www.niccolox.org
>
>

-- 

-- 
Nicholas Roberts
www.niccolox.org

Reply via email to