I run wget on this file:
<! ------------------------------------------------------ >
<A HREF="a.html">a</a>
<! ------------------------------------------------------ >
<A HREF="b.html">b</a>
It downloads b.html, but it does not download a.html.
However, if the following file is used, then it does download a.html:
<! ------------------------------------------------------ >
<A HREF="a.html">a</a>
I ran this command:
~/wget-1.7/src/wget --recursive --debug http://ucsee.eecs.berkeley.edu/~bjacob/ > &
output
Command "find ucsee.eecs.berkeley.edu/" outputs:
ucsee.eecs.berkeley.edu/
ucsee.eecs.berkeley.edu/%7Ebjacob
ucsee.eecs.berkeley.edu/%7Ebjacob/index.html
ucsee.eecs.berkeley.edu/%7Ebjacob/b.html
The first version of the file is at this URL:
http://ucsee.eecs.berkeley.edu/~bjacob/
I have no .wgetrc file.
The bug occurs in wget 1.7.
wget 1.6 always downloads a.html.
Here is the output file from the above wget command:
DEBUG output created by Wget 1.7 on linux-gnu.
parseurl ("http://ucsee.eecs.berkeley.edu/~bjacob/") -> host ucsee.eecs.berkeley.edu
-> opath ~bjacob/ -> dir ~bjacob -> file -> ndir ~bjacob
newpath: /%7Ebjacob/
Checking for ucsee.eecs.berkeley.edu in host_name_address_map.
Checking for ucsee.eecs.berkeley.edu in host_slave_master_map.
First time I hear about ucsee.eecs.berkeley.edu by that name; looking it up.
Caching ucsee.eecs.berkeley.edu <-> 128.32.138.93
Checking again for ucsee.eecs.berkeley.edu in host_slave_master_map.
--23:02:08-- http://ucsee.eecs.berkeley.edu/%7Ebjacob/
=> `ucsee.eecs.berkeley.edu/%7Ebjacob/index.html'
Connecting to ucsee.eecs.berkeley.edu:80... Found ucsee.eecs.berkeley.edu in
host_name_address_map: 128.32.138.93
Created fd 3.
connected!
---request begin---
GET /%7Ebjacob/ HTTP/1.0
User-Agent: Wget/1.7
Host: ucsee.eecs.berkeley.edu
Accept: */*
Connection: Keep-Alive
---request end---
HTTP request sent, awaiting response... HTTP/1.1 200 OK
Date: Thu, 05 Jul 2001 06:00:30 GMT
Server: Apache/1.3.2 (Unix)
Last-Modified: Thu, 05 Jul 2001 05:50:29 GMT
ETag: "297b1-a6-3b440025"
Accept-Ranges: bytes
Content-Length: 166
Keep-Alive: timeout=15, max=100
Connection: Keep-Alive
Content-Type: text/html
Found ucsee.eecs.berkeley.edu in host_name_address_map: 128.32.138.93
Registered fd 3 for persistent reuse.
Length: 166 [text/html]
0K 100% @ 162.11 KB/s
23:02:09 (162.11 KB/s) - `ucsee.eecs.berkeley.edu/%7Ebjacob/index.html' saved
[166/166]
parseurl ("http://ucsee.eecs.berkeley.edu/~bjacob/") -> host ucsee.eecs.berkeley.edu
-> opath ~bjacob/ -> dir ~bjacob -> file -> ndir ~bjacob
newpath: /%7Ebjacob/
Loaded ucsee.eecs.berkeley.edu/%7Ebjacob/index.html (size 166).
ucsee.eecs.berkeley.edu/%7Ebjacob/index.html:
merge("http://ucsee.eecs.berkeley.edu/%7Ebjacob/", "b.html") ->
http://ucsee.eecs.berkeley.edu/%7Ebjacob/b.html
no-follow in ucsee.eecs.berkeley.edu/%7Ebjacob/index.html: 0
parseurl ("http://ucsee.eecs.berkeley.edu/%7Ebjacob/b.html") -> host
ucsee.eecs.berkeley.edu -> opath %7Ebjacob/b.html -> dir ~bjacob -> file b.html ->
ndir ~bjacob
newpath: /%7Ebjacob/b.html
Checking for ucsee.eecs.berkeley.edu in host_name_address_map.
Found; ucsee.eecs.berkeley.edu was already used, by that name.
Comparing hosts ucsee.eecs.berkeley.edu and ucsee.eecs.berkeley.edu...
They are quite alike.
parseurl ("http://ucsee.eecs.berkeley.edu/%7Ebjacob/b.html") -> host
ucsee.eecs.berkeley.edu -> opath %7Ebjacob/b.html -> dir ~bjacob -> file b.html ->
ndir ~bjacob
newpath: /%7Ebjacob/b.html
Loading robots.txt; please ignore errors.
parseurl ("http://ucsee.eecs.berkeley.edu/robots.txt") -> host ucsee.eecs.berkeley.edu
-> opath robots.txt -> dir -> file robots.txt -> ndir
newpath: /robots.txt
Checking for ucsee.eecs.berkeley.edu in host_name_address_map.
Found; ucsee.eecs.berkeley.edu was already used, by that name.
--23:02:09-- http://ucsee.eecs.berkeley.edu/robots.txt
=> `ucsee.eecs.berkeley.edu/robots.txt'
Found ucsee.eecs.berkeley.edu in host_name_address_map: 128.32.138.93
Reusing connection to ucsee.eecs.berkeley.edu:80.
Reusing fd 3.
---request begin---
GET /robots.txt HTTP/1.0
User-Agent: Wget/1.7
Host: ucsee.eecs.berkeley.edu
Accept: */*
Connection: Keep-Alive
---request end---
HTTP request sent, awaiting response... HTTP/1.1 404 Not Found
Date: Thu, 05 Jul 2001 06:00:30 GMT
Server: Apache/1.3.2 (Unix)
Connection: close
Content-Type: text/html
Closing fd 3
Invalidating fd 3 from further reuse.
23:02:09 ERROR 404: Not Found.
I've decided to load it -> parseurl
("http://ucsee.eecs.berkeley.edu/%7Ebjacob/b.html") -> host ucsee.eecs.berkeley.edu ->
opath %7Ebjacob/b.html -> dir ~bjacob -> file b.html -> ndir ~bjacob
newpath: /%7Ebjacob/b.html
Checking for ucsee.eecs.berkeley.edu in host_name_address_map.
Found; ucsee.eecs.berkeley.edu was already used, by that name.
--23:02:09-- http://ucsee.eecs.berkeley.edu/%7Ebjacob/b.html
=> `ucsee.eecs.berkeley.edu/%7Ebjacob/b.html'
Connecting to ucsee.eecs.berkeley.edu:80... Found ucsee.eecs.berkeley.edu in
host_name_address_map: 128.32.138.93
Created fd 3.
connected!
---request begin---
GET /%7Ebjacob/b.html HTTP/1.0
User-Agent: Wget/1.7
Host: ucsee.eecs.berkeley.edu
Accept: */*
Connection: Keep-Alive
Referer: http://ucsee.eecs.berkeley.edu/%7Ebjacob/
---request end---
HTTP request sent, awaiting response... HTTP/1.1 200 OK
Date: Thu, 05 Jul 2001 06:00:31 GMT
Server: Apache/1.3.2 (Unix)
Last-Modified: Thu, 05 Jul 2001 05:52:10 GMT
ETag: "29818-2-3b44008a"
Accept-Ranges: bytes
Content-Length: 2
Keep-Alive: timeout=15, max=100
Connection: Keep-Alive
Content-Type: text/html
Found ucsee.eecs.berkeley.edu in host_name_address_map: 128.32.138.93
Registered fd 3 for persistent reuse.
Length: 2 [text/html]
0K 100% @ 1.95 KB/s
23:02:09 (1.95 KB/s) - `ucsee.eecs.berkeley.edu/%7Ebjacob/b.html' saved [2/2]
Loaded ucsee.eecs.berkeley.edu/%7Ebjacob/b.html (size 2).
no-follow in ucsee.eecs.berkeley.edu/%7Ebjacob/b.html: 0
FINISHED --23:02:09--
Downloaded: 168 bytes in 2 files