I am using wget-1.8.1.

I was trying to crawl a site for all of its HTML pages, so used the
command line option --accept=html,htm which was good. However,
some sites have Apache configured to sythnesize a directory listing
(if you do a request on the directory name). The HTML file produced
does not have URLs with file extensions (just directory names)
so the --accept=htm,html option means those links wont be followed.

What could be useful is to somehow tell it to also accept URLs
with no file extension so I can tell it "accept .htm URLs or
URLs with no extension".

I guess an alternative approach is to say what content type to keep
(keep text/html pages). This means it would try and get the resource,
then realise it was a waste of time (so it is less efficient).

I will work around it by using a reject list with all the unwanted
file extensions I can think of.

Alan

Reply via email to