You haven't modified the value for the config below by any chance?
<description>Some servers return HTTP status 403 (Forbidden) if
/robots.txt doesn't exist. This should probably mean that we are
allowed to crawl the site nonetheless. If this is set to false,
then such sites will be treated as forbidden.</description>
The default value (true) should work fine.
On 17 October 2016 at 16:38, Tom Chiverton <t...@extravision.com> wrote:
> A site I am trying to index has it's HTML content on one domain, and some
> linked PDFs on another (an Amazon S3 bucket).
> So I have set up my plugin.includes in site.xml :
> and made sure regexp-urlfilter.xml is OK with it all.
> But I observe some oddness during fetching, and can't locate the PDFs in
> the Solr collection.
> All the content on the PDF domain flys past with no pause :
> -finishing thread FetcherThread8, activeThreads=0
> -finishing thread FetcherThread9, activeThreads=0
> 0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs
> in 0 queues
> Using queue mode : byHost
> Fetcher: threads: 10
> Fetcher: throughput threshold: -1
> Fetcher: throughput threshold sequence: 5
> fetching https://s3-eu-west-1.amazonaws.com/.... (queue crawl
> and then it hits the primary domain and starts pausing between each :
> Turning the log level for the fetcher to debug I see
> DEBUG fetcher.FetcherJob - Denied by robots.txt: https://s3-eu-west-1.
> but there is no robots.txt in the root of the Amazon S3 URL -
> https://s3-eu-west-1.amazonaws.com/robots.txt is a 403 !
> Any ideas what could be up ?
> *Tom Chiverton*
> Lead Developer
> e: t...@extravision.com
> p: 0161 817 2922
> t: @extravision <http://www.twitter.com/extravision>
> w: www.extravision.com
> [image: Extravision - email worth seeing] <http://www.extravision.com/>
> Registered in the UK at: 107 Timber Wharf, 33 Worsley Street, Manchester,
> M15 4LD.
> Company Reg No: 05017214 VAT: GB 824 5386 19
> This e-mail is intended solely for the person to whom it is addressed and
> may contain confidential or privileged information.
> Any views or opinions presented in this e-mail are solely of the author
> and do not necessarily represent those of Extravision Ltd.
*Open Source Solutions for Text Engineering*