That's only in nutch-default.xml, and is set to the default which is true.

Good idea though !


On 17/10/16 17:27, Julien Nioche wrote:
Hi Tom

You haven't modified the value for the config below by any chance?

        <property> <name>http.robots.403.allow</name>


        <description>Some servers return HTTP status 403 (Forbidden) if

        /robots.txt doesn't exist. This should probably mean that we are

        allowed to crawl the site nonetheless. If this is set to false,

        then such sites will be treated as forbidden.</description>


The default value (true) should work fine.


On 17 October 2016 at 16:38, Tom Chiverton < <>> wrote:

    A site I am trying to index has it's HTML content on one domain,
    and some linked PDFs on another (an Amazon S3 bucket).

    So I have set up my plugin.includes in site.xml :


    and made sure regexp-urlfilter.xml is OK with it all.

    But I observe some oddness during fetching, and can't locate the
    PDFs in the Solr collection.

    All the content on the PDF domain flys past with no pause :

    -finishing thread FetcherThread8, activeThreads=0
    -finishing thread FetcherThread9, activeThreads=0
    0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0
    kb/s, 0 URLs in 0 queues
    Using queue mode : byHost
    Fetcher: threads: 10
    Fetcher: throughput threshold: -1
    Fetcher: throughput threshold sequence: 5
    <>.... (queue crawl delay=5000ms)

    and then it hits the primary domain and starts pausing between each :

    Turning the log level for the fetcher to debug I see

    DEBUG fetcher.FetcherJob - Denied by robots.txt: https://s3-eu-west-1.

    but there is no robots.txt in the root of the Amazon S3 URL -
    <> is a 403 !

    Any ideas what could be up ?

-- *Tom Chiverton*
    Lead Developer
    e: <>
    p:  0161 817 2922
    t:  @extravision <>
    w: <>

    Extravision - email worth seeing <>
    Registered in the UK at: 107 Timber Wharf, 33 Worsley Street,
    Manchester, M15 4LD.
    Company Reg No: 0‌‌5017214 VAT: GB 8‌‌24 5386 19

    This e-mail is intended solely for the person to whom it is
    addressed and may contain confidential or privileged information.
    Any views or opinions presented in this e-mail are solely of the
    author and do not necessarily represent those of Extravision Ltd.

*/Open Source Solutions for Text Engineering/
/ <>
#digitalpebble <>

This email has been scanned by the Symantec Email service.
For more information please visit

Reply via email to