I have read the wget man and info pages, searched the archive of this
list and googled all over the place but I still don't have a
satisfactory answer to a simple question:
Can wget be asked not to retrieve *anything* - not even .html pages -
from a given directory and its subdirectories?
This is relevant in situations where one wants to mirror a site with
many links to a restricted part of the site which requires
authorization but is otherwise of no interest. With wget-1.9.1 my log
file contains hundreds of "Authorization failure" messages.
For example:
wget -nv -w1 -kpE -m -X "/restricted" http://www.example.com/ &
will still attempt to download URLs like
http://www.example.com/restricted/index.html and
http://www.example.com/restricted/subdir/rubbish.html
Looking though the source of the newly released wget-1.10, it looks as
though wget gets .html pages even if they are in the
exclude-directories list, so presumably wget-1.10 will behave the same
way.
I am not sure if this is related, but something similar is logged in
Bugzilla: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=124867
Can anyone confirm the behaviour I have seen, or suggest a work-around?
Many thanks in advance,
Johann