Re: Web crawling , robots.txt and access credentials

Karl Wright Wed, 17 Sep 2014 00:51:48 -0700

Hi Mario,

robots.txt is only ever looked for at the site root, according to the
specification.  That would be:  http://aaa.bb.com/robots.txt .


Karl


On Wed, Sep 17, 2014 at 3:38 AM, Bisonti Mario <[email protected]>
wrote:

>  Hallo.
>
> Does MCF use robots.txt is on http://aaa.bb.com/ccc/robots.txt  or does
> it search for robots.txt only on the root  http://aaa.bb.com/  ?
>
> I restart today , so after many hours, and I suppose caches expires but
> MCF scans everithing on the subfolders.
>
> I read on the postgres table robotsdata of MCF:
> "<binary data>";"aaa.bb.com:80";1410939267040
>
>
>
> Details of the MCF job:
> Seeds:
> http://aaa.bb.com/ccc/
> Include in crawl : .*
> Include in index: .*
> Include only hosts matching seeds? X
>
>
>
> Thanks a lot
>
> Mario
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *Da:* Karl Wright [mailto:[email protected]]
> *Inviato:* martedì 16 settembre 2014 19:22
> *A:* [email protected]
> *Oggetto:* Re: Web crawling , robots.txt and access credentials
>
>
>
> Hi Mario,
>
> I looked at your robots.txt.  In its current form, it should disallow
> EVERYTHING from your site.  The reason is that some of your paths start
> with "/", but the allow clauses do not.
>
> As for why MCF is letting files through, I suspect that this is because
> MCF caches robots data.  If you changed the file and expected MCF to pick
> that up immediately, it won't.  The cached copy expires after, I believe, 1
> hour.  It's kept in the database so even if you recycle the agents process
> it won't purge the cache.
>
> Karl
>
>
>
> On Tue, Sep 16, 2014 at 11:44 AM, Karl Wright <[email protected]> wrote:
>
>  Authentication does not bypass robots ever.
>
> You will want to turn on connector debug logging to see the decisions that
> the web connector is making with respect to which documents are fetched or
> not fetched, and why.
>
> Karl
>
>
>
> On Tue, Sep 16, 2014 at 11:04 AM, Bisonti Mario <[email protected]>
> wrote:
>
>
>
> *Hallo.*
>
>
>
> I would like to crawl some documents in a subfolder of a web site:
>
> http://aaa.bb.com/
>
>
>
> Structure is:
>
> http://aaa.bb.com/ccc/folder1
>
> http://aaa.bb.com/ccc/folder2
>
> http://aaa.bb.com/ccc/folder3
>
>
>
> Folder ccc and subfolder, are with a Basic security
> username: joe
>
> Password: ppppp
>
>
>
> I want to permit the crawling of only some docs on folder1
>
> So I put robots.txt on
>
> http://aaa.bb.com/ccc/robots.txt
>
>
>
> The contents of file robots.txt is
>
> User-agent: *
>
> Disallow: /
>
> Allow: folder1/doc1.pdf
>
> Allow: folder1/doc2.pdf
>
> Allow: folder1/doc3.pdf
>
>
>
>
>
> I setup on MCF 1.7 a repository web connection with:
> “Obey robots.txt for all fetches”
> and on Access credentials:
> http://aaa.bb.com/ccc/
>
> Basic authentication: joe and ppp
>
>
>
> When I create a job :
>
> Include in crawl : .*
>
> Include in index: .*
>
> Include only hosts matching seeds? X
>
>
>
> and I start it, it happens that it crawls all the content of folder1,
> folder2, and folder3,
>
> instead, as I expected, only the :
>
> http://aaa.bb.com/ccc/folder1/doc1.pdf
>
>
>
> http://aaa.bb.com/ccc/folder1/doc2.pdf
>
>
>
> http://aaa.bb.com/ccc/folder1/doc3.pdf
>
>
>
>
>
> Why this?
>
>
>
> Perhaps the Basic Authentication, bypass the specific “Obey robots.txt for
> all fetches” ?
>
>
>
> Thanks a lot for your help.
>
> Mario
>
>
>
>
>
>
>

Re: Web crawling , robots.txt and access credentials

Reply via email to