Hi Mario, I looked at your robots.txt. In its current form, it should disallow EVERYTHING from your site. The reason is that some of your paths start with "/", but the allow clauses do not.
As for why MCF is letting files through, I suspect that this is because MCF caches robots data. If you changed the file and expected MCF to pick that up immediately, it won't. The cached copy expires after, I believe, 1 hour. It's kept in the database so even if you recycle the agents process it won't purge the cache. Karl On Tue, Sep 16, 2014 at 11:44 AM, Karl Wright <[email protected]> wrote: > Authentication does not bypass robots ever. > > You will want to turn on connector debug logging to see the decisions that > the web connector is making with respect to which documents are fetched or > not fetched, and why. > > Karl > > > On Tue, Sep 16, 2014 at 11:04 AM, Bisonti Mario <[email protected]> > wrote: > >> >> >> *Hallo.* >> >> >> >> I would like to crawl some documents in a subfolder of a web site: >> >> http://aaa.bb.com/ >> >> >> >> Structure is: >> >> http://aaa.bb.com/ccc/folder1 >> >> http://aaa.bb.com/ccc/folder2 >> >> http://aaa.bb.com/ccc/folder3 >> >> >> >> Folder ccc and subfolder, are with a Basic security >> username: joe >> >> Password: ppppp >> >> >> >> I want to permit the crawling of only some docs on folder1 >> >> So I put robots.txt on >> >> http://aaa.bb.com/ccc/robots.txt >> >> >> >> The contents of file robots.txt is >> >> User-agent: * >> >> Disallow: / >> >> Allow: folder1/doc1.pdf >> >> Allow: folder1/doc2.pdf >> >> Allow: folder1/doc3.pdf >> >> >> >> >> >> I setup on MCF 1.7 a repository web connection with: >> “Obey robots.txt for all fetches” >> and on Access credentials: >> http://aaa.bb.com/ccc/ >> >> Basic authentication: joe and ppp >> >> >> >> When I create a job : >> >> Include in crawl : .* >> >> Include in index: .* >> >> Include only hosts matching seeds? X >> >> >> >> and I start it, it happens that it crawls all the content of folder1, >> folder2, and folder3, >> >> instead, as I expected, only the : >> >> http://aaa.bb.com/ccc/folder1/doc1.pdf >> >> >> >> http://aaa.bb.com/ccc/folder1/doc2.pdf >> >> >> >> http://aaa.bb.com/ccc/folder1/doc3.pdf >> >> >> >> >> >> Why this? >> >> >> >> Perhaps the Basic Authentication, bypass the specific “Obey robots.txt >> for all fetches” ? >> >> >> >> Thanks a lot for your help. >> >> Mario >> >> >> > >
