That's only in nutch-default.xml, and is set to the default which is true.
Good idea though !
On 17/10/16 17:27, Julien Nioche wrote:
You haven't modified the value for the config below by any chance?
<description>Some servers return HTTP status 403 (Forbidden) if
/robots.txt doesn't exist. This should probably mean that we are
allowed to crawl the site nonetheless. If this is set to false,
then such sites will be treated as forbidden.</description>
The default value (true) should work fine.
On 17 October 2016 at 16:38, Tom Chiverton <t...@extravision.com
A site I am trying to index has it's HTML content on one domain,
and some linked PDFs on another (an Amazon S3 bucket).
So I have set up my plugin.includes in site.xml :
and made sure regexp-urlfilter.xml is OK with it all.
But I observe some oddness during fetching, and can't locate the
PDFs in the Solr collection.
All the content on the PDF domain flys past with no pause :
-finishing thread FetcherThread8, activeThreads=0
-finishing thread FetcherThread9, activeThreads=0
0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0
kb/s, 0 URLs in 0 queues
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
<https://s3-eu-west-1.amazonaws.com/>.... (queue crawl delay=5000ms)
and then it hits the primary domain and starts pausing between each :
Turning the log level for the fetcher to debug I see
DEBUG fetcher.FetcherJob - Denied by robots.txt: https://s3-eu-west-1.
but there is no robots.txt in the root of the Amazon S3 URL -
<https://s3-eu-west-1.amazonaws.com/robots.txt> is a 403 !
Any ideas what could be up ?
e: t...@extravision.com <mailto:t...@extravision.com>
p: 0161 817 2922
t: @extravision <http://www.twitter.com/extravision>
w: www.extravision.com <http://www.extravision.com/>
Extravision - email worth seeing <http://www.extravision.com/>
Registered in the UK at: 107 Timber Wharf, 33 Worsley Street,
Manchester, M15 4LD.
Company Reg No: 05017214 VAT: GB 824 5386 19
This e-mail is intended solely for the person to whom it is
addressed and may contain confidential or privileged information.
Any views or opinions presented in this e-mail are solely of the
author and do not necessarily represent those of Extravision Ltd.
*/Open Source Solutions for Text Engineering/
This email has been scanned by the Symantec Email Security.cloud service.
For more information please visit http://www.symanteccloud.com