Hi Mario, robots.txt is only ever looked for at the site root, according to the specification. That would be: http://aaa.bb.com/robots.txt .
Karl On Wed, Sep 17, 2014 at 3:38 AM, Bisonti Mario <[email protected]> wrote: > Hallo. > > Does MCF use robots.txt is on http://aaa.bb.com/ccc/robots.txt or does > it search for robots.txt only on the root http://aaa.bb.com/ ? > > I restart today , so after many hours, and I suppose caches expires but > MCF scans everithing on the subfolders. > > I read on the postgres table robotsdata of MCF: > "<binary data>";"aaa.bb.com:80";1410939267040 > > > > Details of the MCF job: > Seeds: > http://aaa.bb.com/ccc/ > Include in crawl : .* > Include in index: .* > Include only hosts matching seeds? X > > > > Thanks a lot > > Mario > > > > > > > > > > > > > > > > > > *Da:* Karl Wright [mailto:[email protected]] > *Inviato:* martedì 16 settembre 2014 19:22 > *A:* [email protected] > *Oggetto:* Re: Web crawling , robots.txt and access credentials > > > > Hi Mario, > > I looked at your robots.txt. In its current form, it should disallow > EVERYTHING from your site. The reason is that some of your paths start > with "/", but the allow clauses do not. > > As for why MCF is letting files through, I suspect that this is because > MCF caches robots data. If you changed the file and expected MCF to pick > that up immediately, it won't. The cached copy expires after, I believe, 1 > hour. It's kept in the database so even if you recycle the agents process > it won't purge the cache. > > Karl > > > > On Tue, Sep 16, 2014 at 11:44 AM, Karl Wright <[email protected]> wrote: > > Authentication does not bypass robots ever. > > You will want to turn on connector debug logging to see the decisions that > the web connector is making with respect to which documents are fetched or > not fetched, and why. > > Karl > > > > On Tue, Sep 16, 2014 at 11:04 AM, Bisonti Mario <[email protected]> > wrote: > > > > *Hallo.* > > > > I would like to crawl some documents in a subfolder of a web site: > > http://aaa.bb.com/ > > > > Structure is: > > http://aaa.bb.com/ccc/folder1 > > http://aaa.bb.com/ccc/folder2 > > http://aaa.bb.com/ccc/folder3 > > > > Folder ccc and subfolder, are with a Basic security > username: joe > > Password: ppppp > > > > I want to permit the crawling of only some docs on folder1 > > So I put robots.txt on > > http://aaa.bb.com/ccc/robots.txt > > > > The contents of file robots.txt is > > User-agent: * > > Disallow: / > > Allow: folder1/doc1.pdf > > Allow: folder1/doc2.pdf > > Allow: folder1/doc3.pdf > > > > > > I setup on MCF 1.7 a repository web connection with: > “Obey robots.txt for all fetches” > and on Access credentials: > http://aaa.bb.com/ccc/ > > Basic authentication: joe and ppp > > > > When I create a job : > > Include in crawl : .* > > Include in index: .* > > Include only hosts matching seeds? X > > > > and I start it, it happens that it crawls all the content of folder1, > folder2, and folder3, > > instead, as I expected, only the : > > http://aaa.bb.com/ccc/folder1/doc1.pdf > > > > http://aaa.bb.com/ccc/folder1/doc2.pdf > > > > http://aaa.bb.com/ccc/folder1/doc3.pdf > > > > > > Why this? > > > > Perhaps the Basic Authentication, bypass the specific “Obey robots.txt for > all fetches” ? > > > > Thanks a lot for your help. > > Mario > > > > > > >
