That's only in nutch-default.xml, and is set to the default which is true.
Good idea though ! Tom On 17/10/16 17:27, Julien Nioche wrote:
Hi Tom You haven't modified the value for the config below by any chance? <property> <name>http.robots.403.allow</name> <value>true</value> <description>Some servers return HTTP status 403 (Forbidden) if /robots.txt doesn't exist. This should probably mean that we are allowed to crawl the site nonetheless. If this is set to false, then such sites will be treated as forbidden.</description> </property> The default value (true) should work fine. JulienOn 17 October 2016 at 16:38, Tom Chiverton <[email protected] <mailto:[email protected]>> wrote:A site I am trying to index has it's HTML content on one domain, and some linked PDFs on another (an Amazon S3 bucket). So I have set up my plugin.includes in site.xml : <value>protocol-httpclient|urlfilter-regex|index-(basic|anchor|more|metadata)|query-(basic|site|url|lang)|indexer-solr|nutch-extensionpoints|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|parse-(html|tika|metatags)</value> and made sure regexp-urlfilter.xml is OK with it all. But I observe some oddness during fetching, and can't locate the PDFs in the Solr collection. All the content on the PDF domain flys past with no pause : -finishing thread FetcherThread8, activeThreads=0 -finishing thread FetcherThread9, activeThreads=0 0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs in 0 queues -activeThreads=0 Using queue mode : byHost Fetcher: threads: 10 Fetcher: throughput threshold: -1 Fetcher: throughput threshold sequence: 5 fetching https://s3-eu-west-1.amazonaws.com/ <https://s3-eu-west-1.amazonaws.com/>.... (queue crawl delay=5000ms) and then it hits the primary domain and starts pausing between each : Turning the log level for the fetcher to debug I see DEBUG fetcher.FetcherJob - Denied by robots.txt: https://s3-eu-west-1. but there is no robots.txt in the root of the Amazon S3 URL - https://s3-eu-west-1.amazonaws.com/robots.txt <https://s3-eu-west-1.amazonaws.com/robots.txt> is a 403 ! Any ideas what could be up ?-- *Tom Chiverton*Lead Developer e: [email protected] <mailto:[email protected]> p: 0161 817 2922 t: @extravision <http://www.twitter.com/extravision> w: www.extravision.com <http://www.extravision.com/> Extravision - email worth seeing <http://www.extravision.com/> Registered in the UK at: 107 Timber Wharf, 33 Worsley Street, Manchester, M15 4LD. Company Reg No: 05017214 VAT: GB 824 5386 19 This e-mail is intended solely for the person to whom it is addressed and may contain confidential or privileged information. Any views or opinions presented in this e-mail are solely of the author and do not necessarily represent those of Extravision Ltd. -- * */Open Source Solutions for Text Engineering/ / /http://www.digitalpebble.com <http://www.digitalpebble.com/> http://digitalpebble.blogspot.com/ #digitalpebble <http://twitter.com/digitalpebble> ______________________________________________________________________ This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com ______________________________________________________________________

