Title: --page-requisites doesn't appear to be working for amazon.com

On redhat enterprise linux 3, and redhat 7.2, with wget versions 10.1 and 10.2:

bradym-1950: wget -k -p http://www.amazon.com/exec/obidos/subst/home/home.html
--17:16:11--  http://www.amazon.com/exec/obidos/subst/home/home.html
           => `www.amazon.com/exec/obidos/subst/home/home.html'
Resolving www.amazon.com... 72.21.206.5
Connecting to www.amazon.com|72.21.206.5|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]

    [   <=>                                                                                                            ] 71,375       102.55K/s

17:16:12 (102.40 KB/s) - `www.amazon.com/exec/obidos/subst/home/home.html' saved [71375]


FINISHED --17:16:12--
Downloaded: 71,375 bytes in 1 files
Converting www.amazon.com/exec/obidos/subst/home/home.html... 2-87
Converted 1 files in 0.003 seconds.
bradym-1951: ls -R
.:
www.amazon.com/

./www.amazon.com:
exec/

./www.amazon.com/exec:
obidos/

./www.amazon.com/exec/obidos:
subst/

./www.amazon.com/exec/obidos/subst:
home/

./www.amazon.com/exec/obidos/subst/home:
home.html
bradym-1952:

And the resulting home.html contains a bunch of img tags.

Amazon.com does have a robots.txt, but that doesn't look like it should affect any of those images, since none are stored in the excluded directories:

# Disallow all crawlers access to certain pages.

User-agent: *
Disallow: /exec/obidos/account-access-login
Disallow: /exec/obidos/change-style
Disallow: /exec/obidos/flex-sign-in
Disallow: /exec/obidos/handle-buy-box
Disallow: /exec/obidos/tg/cm/member
Disallow: /gp/cart
Disallow: /gp/flex
Disallow: /gp/product/e-mail-friend
Disallow: /gp/product/product-availability
Disallow: /gp/product/rate-this-item
Disallow: /gp/sign-in

I've tried this with other sites (cnn, yahoo, google) and have similar results.

I am wanting to use wget to archive some pages that are disappearing shortly. What can I do?

Reply via email to