Hi,

I'm trying to create a little program which parses e-mails from Google News
Alerts, downloads the referenced stories, and creates an index of the stories
to make a sort of "virtual scrapbook" of new stories on a particular subject.

I'm using wget --page-requisites (and other options) to download the pages.
The problem is that wget doesn't always save the top HTML file into a
predicatable filename when it gets a 302 response from the server (and maybe in
other situations as well, I haven't checked.)

For example:

wget --page-requisites http://www.theherald.co.uk/6123.shtml

Downloads to www.theherald.co.uk/news/6123.html rather than
www.theherald.co.uk/6123.html

This makes it hard to create the index file, because I can't predict where the
HTML file has been saved so I can't automatically generate a list of URLs I want
the index to link to.

Is there a neat way to fix this?  --output-document has potential, but I'd only
want to force the filename of the top level HTML file, I don't want all page
requisites dumped into a single file.

Alternatively, I could download once with --page-requites and then again with
--output-document but then I can't use --convert-links to make the page
suitable for local viewing.

Any suggestions?

Thanks in advance,
Jim

-- 
Jim Farrand

Reply via email to