Web crawling , robots.txt and access credentials

Bisonti Mario Tue, 16 Sep 2014 08:06:22 -0700

Hallo.

I would like to crawl some documents in a subfolder of a web site:
http://aaa.bb.com/


Structure is:
http://aaa.bb.com/ccc/folder1
http://aaa.bb.com/ccc/folder2
http://aaa.bb.com/ccc/folder3

Folder ccc and subfolder, are with a Basic security
username: joe
Password: ppppp

I want to permit the crawling of only some docs on folder1
So I put robots.txt on
http://aaa.bb.com/ccc/robots.txt

The contents of file robots.txt is
User-agent: *
Disallow: /
Allow: folder1/doc1.pdf
Allow: folder1/doc2.pdf
Allow: folder1/doc3.pdf


I setup on MCF 1.7 a repository web connection with:
“Obey robots.txt for all fetches”
and on Access credentials:
http://aaa.bb.com/ccc/
Basic authentication: joe and ppp

When I create a job :
Include in crawl : .*
Include in index: .*
Include only hosts matching seeds? X

and I start it, it happens that it crawls all the content of folder1, folder2, 
and folder3,
instead, as I expected, only the :
http://aaa.bb.com/ccc/folder1/doc1.pdf

http://aaa.bb.com/ccc/folder1/doc2.pdf

http://aaa.bb.com/ccc/folder1/doc3.pdf


Why this?

Perhaps the Basic Authentication, bypass the specific “Obey robots.txt for all 
fetches” ?

Thanks a lot for your help.
Mario

Web crawling , robots.txt and access credentials

Reply via email to