Hi Tom You haven't modified the value for the config below by any chance?
<property> <name>http.robots.403.allow</name> <value>true</value> <description>Some servers return HTTP status 403 (Forbidden) if /robots.txt doesn't exist. This should probably mean that we are allowed to crawl the site nonetheless. If this is set to false, then such sites will be treated as forbidden.</description> </property> The default value (true) should work fine. Julien On 17 October 2016 at 16:38, Tom Chiverton <[email protected]> wrote: > A site I am trying to index has it's HTML content on one domain, and some > linked PDFs on another (an Amazon S3 bucket). > > > So I have set up my plugin.includes in site.xml : > > > <value>protocol-httpclient|urlfilter-regex|index-(basic| > anchor|more|metadata)|query-(basic|site|url|lang)|indexer- > solr|nutch-extensionpoints|summary-basic|scoring-opic| > urlnormalizer-(pass|regex|basic)|parse-(html|tika|metatags)</value> > > > and made sure regexp-urlfilter.xml is OK with it all. > > But I observe some oddness during fetching, and can't locate the PDFs in > the Solr collection. > > All the content on the PDF domain flys past with no pause : > > -finishing thread FetcherThread8, activeThreads=0 > -finishing thread FetcherThread9, activeThreads=0 > 0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs > in 0 queues > -activeThreads=0 > Using queue mode : byHost > Fetcher: threads: 10 > Fetcher: throughput threshold: -1 > Fetcher: throughput threshold sequence: 5 > fetching https://s3-eu-west-1.amazonaws.com/.... (queue crawl > delay=5000ms) > > and then it hits the primary domain and starts pausing between each : > > Turning the log level for the fetcher to debug I see > > DEBUG fetcher.FetcherJob - Denied by robots.txt: https://s3-eu-west-1. > > but there is no robots.txt in the root of the Amazon S3 URL - > https://s3-eu-west-1.amazonaws.com/robots.txt is a 403 ! > > Any ideas what could be up ? > > -- > *Tom Chiverton* > Lead Developer > e: [email protected] > p: 0161 817 2922 > t: @extravision <http://www.twitter.com/extravision> > w: www.extravision.com > [image: Extravision - email worth seeing] <http://www.extravision.com/> > Registered in the UK at: 107 Timber Wharf, 33 Worsley Street, Manchester, > M15 4LD. > Company Reg No: 05017214 VAT: GB 824 5386 19 > > This e-mail is intended solely for the person to whom it is addressed and > may contain confidential or privileged information. > Any views or opinions presented in this e-mail are solely of the author > and do not necessarily represent those of Extravision Ltd. > -- *Open Source Solutions for Text Engineering* http://www.digitalpebble.com http://digitalpebble.blogspot.com/ #digitalpebble <http://twitter.com/digitalpebble>

