Hallo. I would like to crawl some documents in a subfolder of a web site: http://aaa.bb.com/
Structure is: http://aaa.bb.com/ccc/folder1 http://aaa.bb.com/ccc/folder2 http://aaa.bb.com/ccc/folder3 Folder ccc and subfolder, are with a Basic security username: joe Password: ppppp I want to permit the crawling of only some docs on folder1 So I put robots.txt on http://aaa.bb.com/ccc/robots.txt The contents of file robots.txt is User-agent: * Disallow: / Allow: folder1/doc1.pdf Allow: folder1/doc2.pdf Allow: folder1/doc3.pdf I setup on MCF 1.7 a repository web connection with: “Obey robots.txt for all fetches” and on Access credentials: http://aaa.bb.com/ccc/ Basic authentication: joe and ppp When I create a job : Include in crawl : .* Include in index: .* Include only hosts matching seeds? X and I start it, it happens that it crawls all the content of folder1, folder2, and folder3, instead, as I expected, only the : http://aaa.bb.com/ccc/folder1/doc1.pdf http://aaa.bb.com/ccc/folder1/doc2.pdf http://aaa.bb.com/ccc/folder1/doc3.pdf Why this? Perhaps the Basic Authentication, bypass the specific “Obey robots.txt for all fetches” ? Thanks a lot for your help. Mario
