A site I am trying to index has it's HTML content on one domain, and
some linked PDFs on another (an Amazon S3 bucket).
So I have set up my plugin.includes in site.xml :
<value>protocol-httpclient|urlfilter-regex|index-(basic|anchor|more|metadata)|query-(basic|site|url|lang)|indexer-solr|nutch-extensionpoints|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|parse-(html|tika|metatags)</value>
and made sure regexp-urlfilter.xml is OK with it all.
But I observe some oddness during fetching, and can't locate the PDFs in
the Solr collection.
All the content on the PDF domain flys past with no pause :
-finishing thread FetcherThread8, activeThreads=0
-finishing thread FetcherThread9, activeThreads=0
0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0
URLs in 0 queues
-activeThreads=0
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
fetching https://s3-eu-west-1.amazonaws.com/.... (queue crawl delay=5000ms)
and then it hits the primary domain and starts pausing between each :
Turning the log level for the fetcher to debug I see
DEBUG fetcher.FetcherJob - Denied by robots.txt: https://s3-eu-west-1.
but there is no robots.txt in the root of the Amazon S3 URL -
https://s3-eu-west-1.amazonaws.com/robots.txt is a 403 !
Any ideas what could be up ?
--
*Tom Chiverton*
Lead Developer
e: [email protected] <mailto:[email protected]>
p: 0161 817 2922
t: @extravision <http://www.twitter.com/extravision>
w: www.extravision.com <http://www.extravision.com/>
Extravision - email worth seeing <http://www.extravision.com/>
Registered in the UK at: 107 Timber Wharf, 33 Worsley Street,
Manchester, M15 4LD.
Company Reg No: 05017214 VAT: GB 824 5386 19
This e-mail is intended solely for the person to whom it is addressed
and may contain confidential or privileged information.
Any views or opinions presented in this e-mail are solely of the author
and do not necessarily represent those of Extravision Ltd.