Here a bug report and wish list I have compiled today.
The experimental patch mentioned is also attached.
Thanks for maintaining wget, it is a very useful program.
Bernhard
--
Professional Service around Free Software (intevation.net)
The FreeGIS Project (freegis.org)
Association for a Free Informational Infrastructure (ffii.org)
FSF Europe (fsfeurope.org)
wget-1.7 Bugs & Wishlist 20010712 [EMAIL PROTECTED]
The bugs were discovered, when I tried to retrieve parts of
http://www.eine-erde-altar.net/ for a static fall-back solution.
Try to retrieve a couple of pages to see the problems.
a) Relative links with changed filenames might get written wrong
Usage of the -E Flag changes the names of the files.
A ".html" is appended. So when convert_links() is called the first time,
the old links in a files will be converted to the new name.
The second convert_all_links() call then reread the urls and cannot
match them using the hash_table_get (dl_url_file_map, u->url) (recur.c 892).
Only the original link was saved there. This results in that these links
will be converted back to COMPLETE references.
b) Wish: Want to save dynamically generated pages which might have
attributes. An URL with attributes describes these pages
uniquily if the single dynamic page is stateless.
Question marks in the URL have to be changed then, because
netscape e.g. will interpret them not as components of a filename
in an href.
Experimental patch: wget-1.7_slamquestionmarksinfilenames.patch
does this for html files, if -E is given.
Runs into problems with bug a).
Should be a seperate options which also works on non text/html files.
c) Wish: With dynamic pages and attributes it is not enough to
be able to exclude suffixes or directories, it should be able
to exclude substrings and regexps.
This might be expensive in performance wise, but it is necessary.
d) Wish: Provision to additionally load pages, which are not loaded yet.
Do not get a file again, if you already have on on disk even
when timestamping is not possible. If there are parts of one
server already retrieved make sure to notice them and convert the
links towards them.
The biggest problem is with a) when the filenames actually change.
Maybe an additional index file holding the contents of the hash-table
dl_url_file_map will have to be saved on disk.
# Experimental patch: wget-1.7_slamquestionmarksinfilenames.patch
# 12.7.2001 [EMAIL PROTECTED]
# will change ? to '_' in text/html filenames, when -E is given
--- http.c.org Wed Jul 11 21:20:34 2001
+++ http.c Wed Jul 11 21:24:19 2001
@@ -1156,6 +1156,7 @@
already the file's suffix, tack on ".html". */
{
char* last_period_in_local_filename = strrchr(u->local, '.');
+ char* p;
if (last_period_in_local_filename == NULL ||
!(strcasecmp(last_period_in_local_filename, ".htm") == EQ ||
@@ -1165,6 +1166,14 @@
u->local = xrealloc(u->local, local_filename_len + sizeof(".html"));
strcpy(u->local + local_filename_len, ".html");
+
+ /* also relace question marks with underscores */
+ if (u->local)
+ for(p=u->local;*p!='\0';p++) {
+ if (*p=='?') *p='_';
+ }
+
+
*dt |= ADDED_HTML_EXTENSION;
}
PGP signature