Here a bug report and wish list I have compiled today.
The experimental patch mentioned is also attached.
Thanks for maintaining wget, it is a very useful program.

        Bernhard

-- 
Professional Service around Free Software                (intevation.net)  
The FreeGIS Project                                         (freegis.org)
Association for a Free Informational Infrastructure            (ffii.org)
FSF Europe                                                (fsfeurope.org)
wget-1.7 Bugs & Wishlist 20010712 [EMAIL PROTECTED]

The bugs were discovered, when I tried to retrieve parts of
http://www.eine-erde-altar.net/ for a static fall-back solution.
Try to retrieve a couple of pages to see the problems.

a)  Relative links with changed filenames might get written wrong

Usage of the -E Flag changes the names of the files.
A ".html" is appended. So when convert_links() is called the first time,
the old links in a files will be converted to the new name.
The second convert_all_links() call then reread the urls and cannot 
match them using the hash_table_get (dl_url_file_map, u->url) (recur.c 892).
Only the original link was saved there. This results in that these links 
will be converted back to COMPLETE references.


b) Wish: Want to save dynamically generated pages which might have 
        attributes. An URL with attributes describes these pages
        uniquily if the single dynamic page is stateless.

        Question marks in the URL have to be changed then, because
        netscape e.g. will interpret them not as components of a filename
        in an href.

        Experimental patch: wget-1.7_slamquestionmarksinfilenames.patch
        does this for html files, if -E is given.
        Runs into problems with bug a).
        Should be a seperate options which also works on non text/html files.


c) Wish: With dynamic pages and attributes it is not enough to
        be able to exclude suffixes or directories, it should be able
        to exclude substrings and regexps. 
        This might be expensive in performance wise, but it is necessary.

d) Wish: Provision to additionally load pages, which are not loaded yet.

        Do not get a file again, if you already have on on disk even
        when timestamping is not possible. If there are parts of one
        server already retrieved make sure to notice them and convert the 
        links towards them. 

        The biggest problem is with a) when the filenames actually change.
        Maybe an additional index file holding the contents of the hash-table
        dl_url_file_map will have to be saved on disk.

# Experimental patch: wget-1.7_slamquestionmarksinfilenames.patch
# 12.7.2001     [EMAIL PROTECTED]
# will change ? to '_' in text/html filenames, when -E is given
--- http.c.org  Wed Jul 11 21:20:34 2001
+++ http.c      Wed Jul 11 21:24:19 2001
@@ -1156,6 +1156,7 @@
        already the file's suffix, tack on ".html". */
     {
       char*  last_period_in_local_filename = strrchr(u->local, '.');
+      char* p;
 
       if (last_period_in_local_filename == NULL ||
          !(strcasecmp(last_period_in_local_filename, ".htm") == EQ ||
@@ -1165,6 +1166,14 @@
          
          u->local = xrealloc(u->local, local_filename_len + sizeof(".html"));
          strcpy(u->local + local_filename_len, ".html");
+
+         /* also relace question marks with underscores */
+         if (u->local)
+                 for(p=u->local;*p!='\0';p++) {
+                         if (*p=='?') *p='_';
+                 }
+
+
 
          *dt |= ADDED_HTML_EXTENSION;
        }

PGP signature

Reply via email to