A site I am trying to index has it's HTML content on one domain, and some linked PDFs on another (an Amazon S3 bucket).

So I have set up my plugin.includes in site.xml :


and made sure regexp-urlfilter.xml is OK with it all.

But I observe some oddness during fetching, and can't locate the PDFs in the Solr collection.

All the content on the PDF domain flys past with no pause :

-finishing thread FetcherThread8, activeThreads=0
-finishing thread FetcherThread9, activeThreads=0
0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs in 0 queues
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
fetching https://s3-eu-west-1.amazonaws.com/.... (queue crawl delay=5000ms)

and then it hits the primary domain and starts pausing between each :

Turning the log level for the fetcher to debug I see

DEBUG fetcher.FetcherJob - Denied by robots.txt: https://s3-eu-west-1.

but there is no robots.txt in the root of the Amazon S3 URL - https://s3-eu-west-1.amazonaws.com/robots.txt is a 403 !

Any ideas what could be up ?

*Tom Chiverton*
Lead Developer
e:      t...@extravision.com <mailto:t...@extravision.com>
p:      0161 817 2922
t:      @extravision <http://www.twitter.com/extravision>
w:      www.extravision.com <http://www.extravision.com/>

Extravision - email worth seeing <http://www.extravision.com/>
Registered in the UK at: 107 Timber Wharf, 33 Worsley Street, Manchester, M15 4LD.
Company Reg No: 0‌‌5017214 VAT: GB 8‌‌24 5386 19

This e-mail is intended solely for the person to whom it is addressed and may contain confidential or privileged information. Any views or opinions presented in this e-mail are solely of the author and do not necessarily represent those of Extravision Ltd.

Reply via email to