Trouble fetch PDFs to pass to Tika (I think)

Tom Chiverton Mon, 17 Oct 2016 08:40:37 -0700

A site I am trying to index has it's HTML content on one domain, andsome linked PDFs on another (an Amazon S3 bucket).


So I have set up my plugin.includes in site.xml :


<value>protocol-httpclient|urlfilter-regex|index-(basic|anchor|more|metadata)|query-(basic|site|url|lang)|indexer-solr|nutch-extensionpoints|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|parse-(html|tika|metatags)</value>


and made sure regexp-urlfilter.xml is OK with it all.

But I observe some oddness during fetching, and can't locate the PDFs inthe Solr collection.


All the content on the PDF domain flys past with no pause :

-finishing thread FetcherThread8, activeThreads=0
-finishing thread FetcherThread9, activeThreads=0

0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0URLs in 0 queues

-activeThreads=0
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
fetching https://s3-eu-west-1.amazonaws.com/.... (queue crawl delay=5000ms)

and then it hits the primary domain and starts pausing between each :

Turning the log level for the fetcher to debug I see

DEBUG fetcher.FetcherJob - Denied by robots.txt: https://s3-eu-west-1.

but there is no robots.txt in the root of the Amazon S3 URL -https://s3-eu-west-1.amazonaws.com/robots.txt is a 403 !


Any ideas what could be up ?

--
*Tom Chiverton*
Lead Developer
e:      [email protected] <mailto:[email protected]>
p:      0161 817 2922
t:      @extravision <http://www.twitter.com/extravision>
w:      www.extravision.com <http://www.extravision.com/>

Extravision - email worth seeing <http://www.extravision.com/>

Registered in the UK at: 107 Timber Wharf, 33 Worsley Street,Manchester, M15 4LD.

Company Reg No: 0‌‌5017214 VAT: GB 8‌‌24 5386 19

This e-mail is intended solely for the person to whom it is addressedand may contain confidential or privileged information.Any views or opinions presented in this e-mail are solely of the authorand do not necessarily represent those of Extravision Ltd.

Trouble fetch PDFs to pass to Tika (I think)

Reply via email to