Re: Trouble fetch PDFs to pass to Tika (I think)

Tom Chiverton Tue, 18 Oct 2016 00:42:08 -0700

That's only in nutch-default.xml, and is set to the default which is true.


Good idea though !

Tom


On 17/10/16 17:27, Julien Nioche wrote:

Hi Tom

You haven't modified the value for the config below by any chance?

        <property> <name>http.robots.403.allow</name>

        <value>true</value>

        <description>Some servers return HTTP status 403 (Forbidden) if

        /robots.txt doesn't exist. This should probably mean that we are

        allowed to crawl the site nonetheless. If this is set to false,

        then such sites will be treated as forbidden.</description>

        </property>


The default value (true) should work fine.

Julien

On 17 October 2016 at 16:38, Tom Chiverton <[email protected]<mailto:[email protected]>> wrote:


    A site I am trying to index has it's HTML content on one domain,
    and some linked PDFs on another (an Amazon S3 bucket).


    So I have set up my plugin.includes in site.xml :


    
<value>protocol-httpclient|urlfilter-regex|index-(basic|anchor|more|metadata)|query-(basic|site|url|lang)|indexer-solr|nutch-extensionpoints|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|parse-(html|tika|metatags)</value>


    and made sure regexp-urlfilter.xml is OK with it all.


    But I observe some oddness during fetching, and can't locate the
    PDFs in the Solr collection.

    All the content on the PDF domain flys past with no pause :

    -finishing thread FetcherThread8, activeThreads=0
    -finishing thread FetcherThread9, activeThreads=0
    0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0
    kb/s, 0 URLs in 0 queues
    -activeThreads=0
    Using queue mode : byHost
    Fetcher: threads: 10
    Fetcher: throughput threshold: -1
    Fetcher: throughput threshold sequence: 5
    fetching https://s3-eu-west-1.amazonaws.com/
    <https://s3-eu-west-1.amazonaws.com/>.... (queue crawl delay=5000ms)

    and then it hits the primary domain and starts pausing between each :

    Turning the log level for the fetcher to debug I see

    DEBUG fetcher.FetcherJob - Denied by robots.txt: https://s3-eu-west-1.

    but there is no robots.txt in the root of the Amazon S3 URL -
    https://s3-eu-west-1.amazonaws.com/robots.txt
    <https://s3-eu-west-1.amazonaws.com/robots.txt> is a 403 !

    Any ideas what could be up ?

--*Tom Chiverton*

    Lead Developer
    e:  [email protected] <mailto:[email protected]>
    p:  0161 817 2922
    t:  @extravision <http://www.twitter.com/extravision>
    w:  www.extravision.com <http://www.extravision.com/>

    Extravision - email worth seeing <http://www.extravision.com/>
    Registered in the UK at: 107 Timber Wharf, 33 Worsley Street,
    Manchester, M15 4LD.
    Company Reg No: 0‌‌5017214 VAT: GB 8‌‌24 5386 19

    This e-mail is intended solely for the person to whom it is
    addressed and may contain confidential or privileged information.
    Any views or opinions presented in this e-mail are solely of the
    author and do not necessarily represent those of Extravision Ltd.




--
*
*/Open Source Solutions for Text Engineering/
/
/http://www.digitalpebble.com <http://www.digitalpebble.com/>
http://digitalpebble.blogspot.com/
#digitalpebble <http://twitter.com/digitalpebble>

______________________________________________________________________
This email has been scanned by the Symantec Email Security.cloud service.
For more information please visit http://www.symanteccloud.com
______________________________________________________________________

Re: Trouble fetch PDFs to pass to Tika (I think)

Reply via email to