wget suggestion - --accept support for no-extension (or similar)

Alan Kent Wed, 27 Mar 2002 18:49:02 -0800

I am using wget-1.8.1.

I was trying to crawl a site for all of its HTML pages, so used the
command line option --accept=html,htm which was good. However,
some sites have Apache configured to sythnesize a directory listing
(if you do a request on the directory name). The HTML file produced
does not have URLs with file extensions (just directory names)
so the --accept=htm,html option means those links wont be followed.


What could be useful is to somehow tell it to also accept URLs
with no file extension so I can tell it "accept .htm URLs or
URLs with no extension".

I guess an alternative approach is to say what content type to keep
(keep text/html pages). This means it would try and get the resource,
then realise it was a waste of time (so it is less efficient).

I will work around it by using a reject list with all the unwanted
file extensions I can think of.

Alan

wget suggestion - --accept support for no-extension (or similar)

Reply via email to