On redhat enterprise linux 3, and redhat 7.2, with wget versions 10.1 and 10.2:
bradym-1950: wget -k -p http://www.amazon.com/exec/obidos/subst/home/home.html
--17:16:11-- http://www.amazon.com/exec/obidos/subst/home/home.html
=> `www.amazon.com/exec/obidos/subst/home/home.html'
Resolving www.amazon.com... 72.21.206.5
Connecting to www.amazon.com|72.21.206.5|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
[ <=> ] 71,375 102.55K/s
17:16:12 (102.40 KB/s) - `www.amazon.com/exec/obidos/subst/home/home.html' saved [71375]
FINISHED --17:16:12--
Downloaded: 71,375 bytes in 1 files
Converting www.amazon.com/exec/obidos/subst/home/home.html... 2-87
Converted 1 files in 0.003 seconds.
bradym-1951: ls -R
.:
www.amazon.com/
./www.amazon.com:
exec/
./www.amazon.com/exec:
obidos/
./www.amazon.com/exec/obidos:
subst/
./www.amazon.com/exec/obidos/subst:
home/
./www.amazon.com/exec/obidos/subst/home:
home.html
bradym-1952:
And the resulting home.html contains a bunch of img tags.
Amazon.com does have a robots.txt, but that doesn't look like it should affect any of those images, since none are stored in the excluded directories:
# Disallow all crawlers access to certain pages.
User-agent: *
Disallow: /exec/obidos/account-access-login
Disallow: /exec/obidos/change-style
Disallow: /exec/obidos/flex-sign-in
Disallow: /exec/obidos/handle-buy-box
Disallow: /exec/obidos/tg/cm/member
Disallow: /gp/cart
Disallow: /gp/flex
Disallow: /gp/product/e-mail-friend
Disallow: /gp/product/product-availability
Disallow: /gp/product/rate-this-item
Disallow: /gp/sign-in
I've tried this with other sites (cnn, yahoo, google) and have similar results.
I am wanting to use wget to archive some pages that are disappearing shortly. What can I do?
Title: --page-requisites doesn't appear to be working for amazon.com
- --page-requisites doesn't appear to be working for amazon.com Montz, Brady
