Hello!

Here are some ideas for wget:

1. wget to handle compressed files well. Some web sites hold htmls as 
compressed htmls so that auto recursive downloaders like wget won't work. The 
browsers just open the data and show it. handle that too (libz.so and maybe 
other compression libraries).

2. wget to remove files which were not fully retrived. Lets say that I'm 
mirroring a site and I'm out of disk space. The last file downloaded will be 
half finished. When I free the disk space and then continue the mirror 
(without downloading everything again ofcourse) then wget will think that I 
already downloaded the last file which is not right since the file is buggy 
(especially true if the file is binary). Have that as a command line switch 
where people who do mirroring could use it and know for certain that every 
file on disk was fully retrieved.

3. user and password issues cross site. When going to a site that has absolute 
links but is protected by a user and a password then browsers handle that 
well while wget doesnt. For instance: 
www.site.com/my_private_stuff/index.html is protected by a user and a 
password. wget http://user:[EMAIL PROTECTED]/my_private_stuff/index.html 
will work well. But if index.html has a reference of this type: 
http://www.site.com/my_private_stuff/more_private_stuff.html then wget will 
not use the user/password to access that and won't be able to download it. A 
solution could be for the user to specify (using a command line switch) that 
user/password should always be sent to the specified site even if the link is 
absolute and does not contain a user/password.

4. wget to create stub files for dead resources on replicated host. It is no 
secret that a lot of sites have dead links to areas in the site. When 
mirroring the large sites I don't want to ask the remote site again for links 
that he already told me he doesn't have. Wget could leave a file on disk that 
is a stub file - a small file which is only there so that wget will not 
attempt to fetch that resource in the future. This behaviour or it's lack 
thereof could be controlled using a command line switch (very usefull for 
mirroring large sites). This way continued replication of hosts could be a 
lot faster.

Please CC me on replies since I'm not subscribed to the list.

Cheers, and thanks for a great tool.
        Mark

-- 
Name: Mark Veltzer
Title: Research and Development, Meta Ltd.
Address: Habikaa 17/3, Kiriat-Sharet, city.holon, Gush-Dan, country.israel 
58495
Phone: +972-03-5581310
Fax: +972-03-5581310
Email: mailto:[EMAIL PROTECTED]
Homepage: http://www.veltzer.org
OpenSource: CPAN, user: VELTZER, mailto:[EMAIL PROTECTED], url: 
http://search.cpan.org/author/VELTZER/
Public key: http://www.veltzer.org/ascx/public_key.asc, wwwkeys.pgp.net, 
0xC71E5D38

Reply via email to