Re: Trouble fetch PDFs to pass to Tika (I think)

Julien Nioche Mon, 17 Oct 2016 09:27:35 -0700

Hi Tom

You haven't modified the value for the config below by any chance?


<property> <name>http.robots.403.allow</name>
<value>true</value>
<description>Some servers return HTTP status 403 (Forbidden) if
/robots.txt doesn't exist. This should probably mean that we are
allowed to crawl the site nonetheless. If this is set to false,
then such sites will be treated as forbidden.</description>
</property>
The default value (true) should work fine.

Julien


On 17 October 2016 at 16:38, Tom Chiverton <[email protected]> wrote:

> A site I am trying to index has it's HTML content on one domain, and some
> linked PDFs on another (an Amazon S3 bucket).
>
>
> So I have set up my plugin.includes in site.xml :
>
>
>         <value>protocol-httpclient|urlfilter-regex|index-(basic|
> anchor|more|metadata)|query-(basic|site|url|lang)|indexer-
> solr|nutch-extensionpoints|summary-basic|scoring-opic|
> urlnormalizer-(pass|regex|basic)|parse-(html|tika|metatags)</value>
>
>
> and made sure regexp-urlfilter.xml is OK with it all.
>
> But I observe some oddness during fetching, and can't locate the PDFs in
> the Solr collection.
>
> All the content on the PDF domain flys past with no pause :
>
> -finishing thread FetcherThread8, activeThreads=0
> -finishing thread FetcherThread9, activeThreads=0
> 0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs
> in 0 queues
> -activeThreads=0
> Using queue mode : byHost
> Fetcher: threads: 10
> Fetcher: throughput threshold: -1
> Fetcher: throughput threshold sequence: 5
> fetching https://s3-eu-west-1.amazonaws.com/.... (queue crawl
> delay=5000ms)
>
> and then it hits the primary domain and starts pausing between each :
>
> Turning the log level for the fetcher to debug I see
>
> DEBUG fetcher.FetcherJob - Denied by robots.txt: https://s3-eu-west-1.
>
> but there is no robots.txt in the root of the Amazon S3 URL -
> https://s3-eu-west-1.amazonaws.com/robots.txt is a 403 !
>
> Any ideas what could be up ?
>
> --
> *Tom Chiverton*
> Lead Developer
> e:  [email protected]
> p:  0161 817 2922
> t:  @extravision <http://www.twitter.com/extravision>
> w:  www.extravision.com
> [image: Extravision - email worth seeing] <http://www.extravision.com/>
> Registered in the UK at: 107 Timber Wharf, 33 Worsley Street, Manchester,
> M15 4LD.
> Company Reg No: 0‌‌5017214 VAT: GB 8‌‌24 5386 19
>
> This e-mail is intended solely for the person to whom it is addressed and
> may contain confidential or privileged information.
> Any views or opinions presented in this e-mail are solely of the author
> and do not necessarily represent those of Extravision Ltd.
>



-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble <http://twitter.com/digitalpebble>

Re: Trouble fetch PDFs to pass to Tika (I think)

Reply via email to