I am using wget-1.8.1. I was trying to crawl a site for all of its HTML pages, so used the command line option --accept=html,htm which was good. However, some sites have Apache configured to sythnesize a directory listing (if you do a request on the directory name). The HTML file produced does not have URLs with file extensions (just directory names) so the --accept=htm,html option means those links wont be followed.
What could be useful is to somehow tell it to also accept URLs with no file extension so I can tell it "accept .htm URLs or URLs with no extension". I guess an alternative approach is to say what content type to keep (keep text/html pages). This means it would try and get the resource, then realise it was a waste of time (so it is less efficient). I will work around it by using a reject list with all the unwanted file extensions I can think of. Alan
