More flexible URL file name generation

Hrvoje Niksic Sun, 14 Sep 2003 15:02:16 -0700

This patch makes URL file name generation a bit more flexible and,
hopefully, better for the end-user.  It does two things:


* Decouples file name quoting from URL quoting.  The conflation of the
  two has been an endless source of annoyance for users.  For example,
  space *has* to be quoted in URLs, but you don't really want to quote
  it in file names.

* Gives the user more control over the quoting mechanism.  There are
  now several quoting levels:

    --restrict-file-names=none  - no restriction, only quote "/" and "\0"

    --restrict-file-names=unix  - quote the above, plus chars in the
                                  0-31 and in the 128-159 range, which
                                  are not printable in the shell.

    --restrict-file-names=windows - quote the above, plus chars
                                    disallowed on Windows: \, |, <, >,
                                    ?, :, *, and ".

  The default "windows" under Windows and Cygwin and "unix" elsewhere.

This patch should supersede the various patches that have been
floating around that fix the problem in a limited fashion.  Please
test this patch and let me know if it works for you, and if something
else is needed.


2003-09-14  Hrvoje Niksic  <[EMAIL PROTECTED]>

        * url.c (append_uri_pathel): Use opt.restrict_file_names when
        calling file_unsafe_char.

        * init.c: New command restrict_file_names.

        * main.c (main): New option --restrict-file-names[=windows,unix].

        * url.c (url_file_name): Renamed from url_filename.
        (url_file_name): Add directory and hostdir prefix here, not in
        mkstruct.
        (append_dir_structure): New function, does part of the work that
        used to be in mkstruct.  Iterates over path elements in u->path,
        calling append_uri_pathel on each one to append it to the file
        name.
        (append_uri_pathel): URL-unescape a path element and reencode it
        with a different set of rules, more appropriate for handling of
        files.
        (file_unsafe_char): New function, uses a lookup table to decide
        whether a character should be escaped for use in file name.
        (append_string): New utility function.
        (append_char): Ditto.
        (file_unsafe_char): New argument restrict_for_windows, decide
        whether Windows file names should be escaped in run-time.

        * connect.c: Include <stdlib.h> to get prototype for abort().

Index: NEWS
===================================================================
RCS file: /pack/anoncvs/wget/NEWS,v
retrieving revision 1.38
diff -u -r1.38 NEWS
--- NEWS        2003/09/10 20:21:13     1.38
+++ NEWS        2003/09/14 21:45:48
@@ -7,8 +7,6 @@
 
 * Changes in Wget 1.9.
 
-** The build process now requires Autoconf 2.5x.
-
 ** It is now possible to specify that POST method be used for HTTP
 requests.  For example, `wget --post-data="id=foo&data=bar" URL' will
 send a POST request with the specified contents.
@@ -32,6 +30,15 @@
 
 ** The new option `--dns-cache=off' may be used to prevent Wget from
 caching DNS lookups.
+
+** The build process now requires Autoconf 2.5x.
+
+** Wget no longer quotes characters in local file names that would be
+considered "unsafe" as part of URL.  Quoting can still occur for
+control characters or for '/', but no longer for frequent characters
+such as space.  You can use the new option --restrict-file-names to
+enforce even stricter rules, which is useful when downloading to
+Windows partitions.
 
 * Wget 1.8.1 is a bugfix release with no user-visible changes.
 
Index: doc/wget.texi
===================================================================
RCS file: /pack/anoncvs/wget/doc/wget.texi,v
retrieving revision 1.68
diff -u -r1.68 wget.texi
--- doc/wget.texi       2003/09/10 19:41:50     1.68
+++ doc/wget.texi       2003/09/14 21:46:10
@@ -800,6 +800,39 @@
 
 If you don't understand the above description, you probably won't need
 this option.
+
[EMAIL PROTECTED] file names, restrict
[EMAIL PROTECTED] Windows file names
[EMAIL PROTECTED] --restrict-file-names=none|unix|windows
+Restrict characters that may occur in local file names created by Wget
+from remote URLs.  Characters that are considered @dfn{unsafe} under a
+set of restrictions are escaped, i.e. replaced with @samp{%XX}, where
[EMAIL PROTECTED] is the hexadecimal code of the character.
+
+The default for this option depends on the operating system: on Unix and
+Unix-like OS'es, it defaults to ``unix''.  Under Windows and Cygwin, it
+defaults to ``windows''.  Changing the default is useful when you are
+using a non-native partition, e.g. when downloading files to a Windows
+partition mounted from Linux, or when using NFS-mounted or SMB-mounted
+Windows drives.
+
+When set to ``none'', the only characters that are quoted are those that
+are impossible to get into a file name---the NUL character and @samp{/}.
+The control characters, newline, etc. are all placed into file names.
+
+When set to ``unix'', additional unsafe characters are those in the
+0--31 range and in the 128--159 range.  This is because those characters
+are typically not printable.
+
+When set to ``windows'', all of the above are quoted, along with
[EMAIL PROTECTED], @samp{|}, @samp{:}, @samp{?}, @samp{"}, @samp{*}, @samp{<},
+and @samp{>}.  Additionally, Wget in Windows mode uses @samp{+} instead
+of @samp{:} to separate host and port in local file names, and uses
[EMAIL PROTECTED]@@} instead of @samp{?} to separate the query portion of the file
+name from the rest.  Therefore, a URL that would be saved as
[EMAIL PROTECTED]:4300/search.pl?input=blah} in Unix mode would be
+saved as @samp{www.xemacs.org+4300/search.pl@@input=blah} in Windows
+mode.
 @end table
 
 @node Directory Options, HTTP Options, Download Options, Invoking
@@ -2240,6 +2273,10 @@
 @item remove_listing = on/off
 If set to on, remove @sc{ftp} listings downloaded by Wget.  Setting it
 to off is the same as @samp{-nr}.
+
[EMAIL PROTECTED] restrict_file_names = off/unix/windows
+Restrict the file names generated by Wget from URLs.  See
[EMAIL PROTECTED] for a more detailed description.
 
 @item retr_symlinks = on/off
 When set to on, retrieve symbolic links as if they were plain files; the
Index: src/connect.c
===================================================================
RCS file: /pack/anoncvs/wget/src/connect.c,v
retrieving revision 1.19
diff -u -r1.19 connect.c
--- src/connect.c       2003/09/09 19:30:45     1.19
+++ src/connect.c       2003/09/14 21:46:34
@@ -30,6 +30,7 @@
 #include <config.h>
 
 #include <stdio.h>
+#include <stdlib.h>
 #include <sys/types.h>
 #ifdef HAVE_UNISTD_H
 # include <unistd.h>
Index: src/ftp-ls.c
===================================================================
RCS file: /pack/anoncvs/wget/src/ftp-ls.c,v
retrieving revision 1.23
diff -u -r1.23 ftp-ls.c
--- src/ftp-ls.c        2002/05/18 02:16:21     1.23
+++ src/ftp-ls.c        2003/09/14 21:46:37
@@ -842,8 +842,8 @@
     {
       char *tmpu, *tmpp;        /* temporary, clean user and passwd */
 
-      tmpu = encode_string (u->user);
-      tmpp = u->passwd ? encode_string (u->passwd) : NULL;
+      tmpu = url_escape (u->user);
+      tmpp = u->passwd ? url_escape (u->passwd) : NULL;
       upwd = (char *)xmalloc (strlen (tmpu)
                             + (tmpp ? (1 + strlen (tmpp)) : 0) + 2);
       sprintf (upwd, "%s%s%s@", tmpu, tmpp ? ":" : "", tmpp ? tmpp : "");
@@ -863,7 +863,8 @@
       fprintf (fp, "  ");
       if (f->tstamp != -1)
        {
-         /* #### Should we translate the months? */
+         /* #### Should we translate the months?  Or, even better, use
+            ISO 8601 dates?  */
          static char *months[] = {
            "Jan", "Feb", "Mar", "Apr", "May", "Jun",
            "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
Index: src/ftp.c
===================================================================
RCS file: /pack/anoncvs/wget/src/ftp.c,v
retrieving revision 1.62
diff -u -r1.62 ftp.c
--- src/ftp.c   2003/09/04 21:34:58     1.62
+++ src/ftp.c   2003/09/14 21:46:42
@@ -1025,7 +1025,7 @@
   struct stat st;
 
   if (!con->target)
-    con->target = url_filename (u);
+    con->target = url_file_name (u);
 
   if (opt.noclobber && file_exists_p (con->target))
     {
@@ -1245,7 +1245,7 @@
   /* Find the listing file name.  We do it by taking the file name of
      the URL and replacing the last component with the listing file
      name.  */
-  uf = url_filename (u);
+  uf = url_file_name (u);
   lf = file_merge (uf, LIST_FILENAME);
   xfree (uf);
   DEBUGP ((_("Using `%s' as listing tmp file.\n"), lf));
@@ -1335,7 +1335,7 @@
       ofile = xstrdup (u->file);
       url_set_file (u, f->name);
 
-      con->target = url_filename (u);
+      con->target = url_file_name (u);
       err = RETROK;
 
       dlthis = 1;
@@ -1723,7 +1723,7 @@
              char *filename = (opt.output_document
                                ? xstrdup (opt.output_document)
                                : (con.target ? xstrdup (con.target)
-                                  : url_filename (u)));
+                                  : url_file_name (u)));
              res = ftp_index (filename, u, f);
              if (res == FTPOK && opt.verbose)
                {
Index: src/http.c
===================================================================
RCS file: /pack/anoncvs/wget/src/http.c,v
retrieving revision 1.95
diff -u -r1.95 http.c
--- src/http.c  2003/09/04 21:34:58     1.95
+++ src/http.c  2003/09/14 21:46:52
@@ -1614,12 +1614,12 @@
     hstat.local_file = local_file;
   else if (local_file)
     {
-      *local_file = url_filename (u);
+      *local_file = url_file_name (u);
       hstat.local_file = local_file;
     }
   else
     {
-      dummy = url_filename (u);
+      dummy = url_file_name (u);
       hstat.local_file = &dummy;
     }
 
Index: src/init.c
===================================================================
RCS file: /pack/anoncvs/wget/src/init.c,v
retrieving revision 1.57
diff -u -r1.57 init.c
--- src/init.c  2003/09/10 20:21:21     1.57
+++ src/init.c  2003/09/14 21:46:55
@@ -100,6 +100,7 @@
 CMD_DECLARE (cmd_spec_mirror);
 CMD_DECLARE (cmd_spec_progress);
 CMD_DECLARE (cmd_spec_recursive);
+CMD_DECLARE (cmd_spec_restrict_file_names);
 CMD_DECLARE (cmd_spec_useragent);
 
 /* List of recognized commands, each consisting of name, closure and function.
@@ -188,6 +189,7 @@
   { "reject",          &opt.rejects,           cmd_vector },
   { "relativeonly",    &opt.relative_only,     cmd_boolean },
   { "removelisting",   &opt.remove_listing,    cmd_boolean },
+  { "restrictfilenames", &opt.restrict_file_names, cmd_spec_restrict_file_names },
   { "retrsymlinks",    &opt.retr_symlinks,     cmd_boolean },
   { "retryconnrefused",        &opt.retry_connrefused, cmd_boolean },
   { "robots",          &opt.use_robots,        cmd_boolean },
@@ -281,6 +283,13 @@
   opt.dots_in_line = 50;
 
   opt.dns_cache = 1;
+
+  /* The default for file name restriction defaults to the OS type. */
+#if !defined(WINDOWS) && !defined(__CYGWIN__)
+  opt.restrict_file_names = restrict_shell;
+#else
+  opt.restrict_file_names = restrict_windows;
+#endif
 }
 
 /* Return the user's home directory (strdup-ed), or NULL if none is
@@ -1004,6 +1013,26 @@
     {
       if (opt.recursive && !opt.no_dirstruct)
        opt.dirstruct = 1;
+    }
+  return 1;
+}
+
+static int
+cmd_spec_restrict_file_names (const char *com, const char *val, void *closure)
+{
+  /* The currently accepted values are `none', `unix', and
+     `windows'.  */
+  if (0 == strcasecmp (val, "none"))
+    opt.restrict_file_names = restrict_none;
+  else if (0 == strcasecmp (val, "unix"))
+    opt.restrict_file_names = restrict_shell;
+  else if (0 == strcasecmp (val, "windows"))
+    opt.restrict_file_names = restrict_windows;
+  else
+    {
+      fprintf (stderr, _("%s: %s: Invalid specification `%s'.\n"),
+              exec_name, com, val);
+      return 0;
     }
   return 1;
 }
Index: src/main.c
===================================================================
RCS file: /pack/anoncvs/wget/src/main.c,v
retrieving revision 1.76
diff -u -r1.76 main.c
--- src/main.c  2003/09/10 19:41:54     1.76
+++ src/main.c  2003/09/14 21:46:58
@@ -179,10 +179,11 @@
        --bind-address=ADDRESS   bind to ADDRESS (hostname or IP) on local host.\n\
        --limit-rate=RATE        limit download rate to RATE.\n\
        --dns-cache=off          disable caching DNS lookups.\n\
+       --restrict-file-names=MODE restrict chars in file names to MODE.\n\
 \n"), stdout);
   fputs (_("\
 Directories:\n\
-  -nd  --no-directories            don\'t create directories.\n\
+  -nd, --no-directories            don\'t create directories.\n\
   -x,  --force-directories         force creation of directories.\n\
   -nH, --no-host-directories       don\'t create host directories.\n\
   -P,  --directory-prefix=PREFIX   save files to PREFIX/...\n\
@@ -344,6 +345,7 @@
     { "proxy-user", required_argument, NULL, 143 },
     { "quota", required_argument, NULL, 'Q' },
     { "reject", required_argument, NULL, 'R' },
+    { "restrict-file-names", required_argument, NULL, 176 },
     { "save-cookies", required_argument, NULL, 162 },
     { "timeout", required_argument, NULL, 'T' },
     { "tries", required_argument, NULL, 't' },
@@ -609,6 +611,9 @@
          break;
        case 175:
          setval ("dnscache", optarg);
+         break;
+       case 176:
+         setval ("restrictfilenames", optarg);
          break;
        case 'A':
          setval ("accept", optarg);
Index: src/options.h
===================================================================
RCS file: /pack/anoncvs/wget/src/options.h,v
retrieving revision 1.32
diff -u -r1.32 options.h
--- src/options.h       2003/09/10 19:41:54     1.32
+++ src/options.h       2003/09/14 21:46:59
@@ -184,6 +184,12 @@
 
   char *post_data;             /* POST query string */
   char *post_file_name;                /* File to post */
+
+  enum {
+    restrict_none,
+    restrict_shell,
+    restrict_windows
+  } restrict_file_names;       /* whether we restrict file name chars. */
 };
 
 extern struct options opt;
Index: src/url.c
===================================================================
RCS file: /pack/anoncvs/wget/src/url.c,v
retrieving revision 1.79
diff -u -r1.79 url.c
--- src/url.c   2003/09/09 19:30:45     1.79
+++ src/url.c   2003/09/14 21:47:08
@@ -1,5 +1,6 @@
 /* URL handling.
-   Copyright (C) 1995, 1996, 1997, 2000, 2001 Free Software Foundation, Inc.
+   Copyright (C) 1995, 1996, 1997, 2000, 2001, 2003, 2003
+   Free Software Foundation, Inc.
 
 This file is part of GNU Wget.
 
@@ -95,24 +96,22 @@
    code assumes ASCII character set and 8-bit chars.  */
 
 enum {
+  /* rfc1738 reserved chars, preserved from encoding.  */
   urlchr_reserved = 1,
+
+  /* rfc1738 unsafe chars, plus some more.  */
   urlchr_unsafe   = 2
 };
 
+#define urlchr_test(c, mask) (urlchr_table[(unsigned char)(c)] & (mask))
+#define URL_RESERVED_CHAR(c) urlchr_test(c, urlchr_reserved)
+#define URL_UNSAFE_CHAR(c) urlchr_test(c, urlchr_unsafe)
+
+/* Shorthands for the table: */
 #define R  urlchr_reserved
 #define U  urlchr_unsafe
 #define RU R|U
 
-#define urlchr_test(c, mask) (urlchr_table[(unsigned char)(c)] & (mask))
-
-/* rfc1738 reserved chars, preserved from encoding.  */
-
-#define RESERVED_CHAR(c) urlchr_test(c, urlchr_reserved)
-
-/* rfc1738 unsafe chars, plus some more.  */
-
-#define UNSAFE_CHAR(c) urlchr_test(c, urlchr_unsafe)
-
 const static unsigned char urlchr_table[256] =
 {
   U,  U,  U,  U,   U,  U,  U,  U,   /* NUL SOH STX ETX  EOT ENQ ACK BEL */
@@ -142,6 +141,9 @@
   U, U, U, U,  U, U, U, U,  U, U, U, U,  U, U, U, U,
   U, U, U, U,  U, U, U, U,  U, U, U, U,  U, U, U, U,
 };
+#undef R
+#undef U
+#undef RU
 
 /* Decodes the forms %xy in a URL to the character the hexadecimal
    code of which is xy.  xy are hexadecimal digits from
@@ -150,7 +152,7 @@
    literally.  */
 
 static void
-decode_string (char *s)
+url_unescape (char *s)
 {
   char *t = s;                 /* t - tortoise */
   char *h = s;                 /* h - hare     */
@@ -175,10 +177,10 @@
   *t = '\0';
 }
 
-/* Like encode_string, but return S if there are no unsafe chars.  */
+/* Like url_escape, but return S if there are no unsafe chars.  */
 
 static char *
-encode_string_maybe (const char *s)
+url_escape_allow_passthrough (const char *s)
 {
   const char *p1;
   char *p2, *newstr;
@@ -186,7 +188,7 @@
   int addition = 0;
 
   for (p1 = s; *p1; p1++)
-    if (UNSAFE_CHAR (*p1))
+    if (URL_UNSAFE_CHAR (*p1))
       addition += 2;           /* Two more characters (hex digits) */
 
   if (!addition)
@@ -199,7 +201,7 @@
   p2 = newstr;
   while (*p1)
     {
-      if (UNSAFE_CHAR (*p1))
+      if (URL_UNSAFE_CHAR (*p1))
        {
          unsigned char c = *p1++;
          *p2++ = '%';
@@ -215,13 +217,13 @@
   return newstr;
 }
 
-/* Encode the unsafe characters (as determined by UNSAFE_CHAR) in a
+/* Encode the unsafe characters (as determined by URL_UNSAFE_CHAR) in a
    given string, returning a malloc-ed %XX encoded string.  */
   
 char *
-encode_string (const char *s)
+url_escape (const char *s)
 {
-  char *encoded = encode_string_maybe (s);
+  char *encoded = url_escape_allow_passthrough (s);
   if (encoded != s)
     return encoded;
   else
@@ -232,13 +234,13 @@
    the old value of PTR is freed and PTR is made to point to the newly
    allocated storage.  */
 
-#define ENCODE(ptr) do {                       \
-  char *e_new = encode_string_maybe (ptr);     \
-  if (e_new != ptr)                            \
-    {                                          \
-      xfree (ptr);                             \
-      ptr = e_new;                             \
-    }                                          \
+#define ENCODE(ptr) do {                               \
+  char *e_new = url_escape_allow_passthrough (ptr);    \
+  if (e_new != ptr)                                    \
+    {                                                  \
+      xfree (ptr);                                     \
+      ptr = e_new;                                     \
+    }                                                  \
 } while (0)
 
 enum copy_method { CM_DECODE, CM_ENCODE, CM_PASSTHROUGH };
@@ -258,7 +260,7 @@
          char preempt = (XCHAR_TO_XDIGIT (*(p + 1)) << 4) +
            XCHAR_TO_XDIGIT (*(p + 2));
 
-         if (UNSAFE_CHAR (preempt) || RESERVED_CHAR (preempt))
+         if (URL_UNSAFE_CHAR (preempt) || URL_RESERVED_CHAR (preempt))
            return CM_PASSTHROUGH;
          else
            return CM_DECODE;
@@ -267,20 +269,20 @@
        /* Garbled %.. sequence: encode `%'. */
        return CM_ENCODE;
     }
-  else if (UNSAFE_CHAR (*p) && !RESERVED_CHAR (*p))
+  else if (URL_UNSAFE_CHAR (*p) && !URL_RESERVED_CHAR (*p))
     return CM_ENCODE;
   else
     return CM_PASSTHROUGH;
 }
 
-/* Translate a %-quoting (but possibly non-conformant) input string S
-   into a %-quoting (and conformant) output string.  If no characters
+/* Translate a %-escaped (but possibly non-conformant) input string S
+   into a %-escaped (and conformant) output string.  If no characters
    are encoded or decoded, return the same string S; otherwise, return
    a freshly allocated string with the new contents.
 
    After a URL has been run through this function, the protocols that
    use `%' as the quote character can use the resulting string as-is,
-   while those that don't call decode_string() to get to the intended
+   while those that don't call url_unescape() to get to the intended
    data.  This function is also stable: after an input string is
    transformed the first time, all further transformations of the
    result yield the same result string.
@@ -293,20 +295,21 @@
 
        GET /abc%20def HTTP/1.0
 
-   So it appears that the unsafe chars need to be quoted, as with
-   encode_string.  But what if we're requested to download
-   `abc%20def'?  Remember that %-encoding is valid URL syntax, so what
-   the user meant was a literal space, and he was kind enough to quote
-   it.  In that case, Wget should obviously leave the `%20' as is, and
-   send the same request as above.  So in this case we may not call
-   encode_string.
-
-   But what if the requested URI is `abc%20 def'?  If we call
-   encode_string, we end up with `/abc%2520%20def', which is almost
-   certainly not intended.  If we don't call encode_string, we are
-   left with the embedded space and cannot send the request.  What the
+   It appears that the unsafe chars need to be quoted, for example
+   with url_escape.  But what if we're requested to download
+   `abc%20def'?  url_escape transforms "%" to "%25", which would leave
+   us with `abc%2520def'.  This is incorrect -- since %-escapes are
+   part of URL syntax, "%20" is the correct way to denote a literal
+   space on the Wget command line.  This leaves us in the conclusion
+   that in that case Wget should not call url_escape, but leave the
+   `%20' as is.
+
+   And what if the requested URI is `abc%20 def'?  If we call
+   url_escape, we end up with `/abc%2520%20def', which is almost
+   certainly not intended.  If we don't call url_escape, we are left
+   with the embedded space and cannot complete the request.  What the
    user meant was for Wget to request `/abc%20%20def', and this is
-   where reencode_string kicks in.
+   where reencode_escapes kicks in.
 
    Wget used to solve this by first decoding %-quotes, and then
    encoding all the "unsafe" characters found in the resulting string.
@@ -317,7 +320,7 @@
    is inevitable because by the second step we would lose information
    on whether the `+' was originally encoded or not.  Both results
    were wrong because in CGI parameters + means space, while %2B means
-   literal plus.  reencode_string correctly translates the above to
+   literal plus.  reencode_escapes correctly translates the above to
    "a%2B+b", i.e. returns the original string.
 
    This function uses an algorithm proposed by Anon Sricharoenchai:
@@ -352,7 +355,7 @@
    "foo%2b+bar"      -> "foo%2b+bar"  */
 
 static char *
-reencode_string (const char *s)
+reencode_escapes (const char *s)
 {
   const char *p1;
   char *newstr, *p2;
@@ -417,12 +420,12 @@
   return newstr;
 }
 
-/* Run PTR_VAR through reencode_string.  If a new string is consed,
+/* Run PTR_VAR through reencode_escapes.  If a new string is consed,
    free PTR_VAR and make it point to the new storage.  Obviously,
    PTR_VAR needs to be an lvalue.  */
 
 #define REENCODE(ptr_var) do {                 \
-  char *rf_new = reencode_string (ptr_var);    \
+  char *rf_new = reencode_escapes (ptr_var);   \
   if (rf_new != ptr_var)                       \
     {                                          \
       xfree (ptr_var);                         \
@@ -544,9 +547,9 @@
   (*user)[len] = '\0';
 
   if (*user)
-    decode_string (*user);
+    url_unescape (*user);
   if (*passwd)
-    decode_string (*passwd);
+    url_unescape (*passwd);
 
   return 1;
 }
@@ -611,6 +614,10 @@
 
 static void parse_path PARAMS ((const char *, char **, char **));
 
+/* Like strpbrk, with the exception that it returns the pointer to the
+   terminating zero (end-of-string aka "eos") if no matching character
+   is found.  */
+
 static char *
 strpbrk_or_eos (const char *s, const char *accept)
 {
@@ -825,7 +832,7 @@
       return NULL;
     }
 
-  url_encoded = reencode_string (url);
+  url_encoded = reencode_escapes (url);
   p = url_encoded;
 
   p += strlen (supported_schemes[scheme].leading_string);
@@ -1016,9 +1023,9 @@
   else
     {
       if (url_encoded == url)
-       u->url    = xstrdup (url);
+       u->url = xstrdup (url);
       else
-       u->url    = url_encoded;
+       u->url = url_encoded;
     }
   url_encoded = NULL;
 
@@ -1032,14 +1039,14 @@
   return parse_errors[error_code];
 }
 
+/* Parse PATH into dir and file.  PATH is extracted from the URL and
+   is URL-escaped.  The function returns unescaped DIR and FILE.  */
+
 static void
-parse_path (const char *quoted_path, char **dir, char **file)
+parse_path (const char *path, char **dir, char **file)
 {
-  char *path, *last_slash;
+  char *last_slash;
 
-  STRDUP_ALLOCA (path, quoted_path);
-  decode_string (path);
-
   last_slash = strrchr (path, '/');
   if (!last_slash)
     {
@@ -1051,6 +1058,8 @@
       *dir = strdupdelim (path, last_slash);
       *file = xstrdup (last_slash + 1);
     }
+  url_unescape (*dir);
+  url_unescape (*file);
 }
 
 /* Note: URL's "full path" is the path with the query string and
@@ -1303,8 +1312,6 @@
     {
       sprintf (from, "%s.%d", fname, i - 1);
       sprintf (to, "%s.%d", fname, i);
-      /* #### This will fail on machines without the rename() system
-         call.  */
       rename (from, to);
     }
 
@@ -1323,11 +1330,14 @@
   int res;
 
   p = path + strlen (path);
-  for (; *p != '/' && p != path; p--);
+  for (; *p != '/' && p != path; p--)
+    ;
+
   /* Don't create if it's just a file.  */
   if ((p == path) && (*p != '/'))
     return 0;
   t = strdupdelim (path, p);
+
   /* Check whether the directory exists.  */
   if ((stat (t, &st) == 0))
     {
@@ -1360,194 +1370,302 @@
   xfree (t);
   return res;
 }
+
+/* Functions for constructing the file name out of URL components.  */
 
-static int
-count_slashes (const char *s)
+/* A growable string structure, used by url_file_name and friends.
+   This should perhaps be moved to utils.c.
+
+   The idea is to have an easy way to construct a string by having
+   various functions append data to it.  Instead of passing the
+   obligatory BASEVAR, SIZEVAR and TAILPOS to all the functions in
+   questions, we pass the pointer to this struct.  */
+
+struct growable {
+  char *base;
+  int size;
+  int tail;
+};
+
+/* Ensure that the string can accept APPEND_COUNT more characters past
+   the current TAIL position.  If necessary, this will grow the string
+   and update its allocated size.  If the string is already large
+   enough to take TAIL+APPEND_COUNT characters, this does nothing.  */
+#define GROW(g, append_size) do {                                      \
+  struct growable *G_ = g;                                             \
+  DO_REALLOC (G_->base, G_->size, G_->tail + append_size, char);       \
+} while (0)
+
+/* Return the tail position of the string. */
+#define TAIL(r) ((r)->base + (r)->tail)
+
+/* Move the tail position by APPEND_COUNT characters. */
+#define TAIL_INCR(r, append_count) ((r)->tail += append_count)
+
+/* Append the string STR to DEST.  NOTICE: the string in DEST is not
+   terminated.  */
+
+static void
+append_string (const char *str, struct growable *dest)
 {
-  int i = 0;
-  while (*s)
-    if (*s++ == '/')
-      ++i;
-  return i;
+  int l = strlen (str);
+  GROW (dest, l);
+  memcpy (TAIL (dest), str, l);
+  TAIL_INCR (dest, l);
 }
 
-/* Return the path name of the URL-equivalent file name, with a
-   remote-like structure of directories.  */
-static char *
-mkstruct (const struct url *u)
+/* Append CH to DEST.  For example, append_char (0, DEST)
+   zero-terminates DEST.  */
+
+static void
+append_char (char ch, struct growable *dest)
 {
-  char *dir, *file;
-  char *res, *dirpref;
-  int l;
-
-  if (opt.cut_dirs)
-    {
-      char *ptr = u->dir + (*u->dir == '/');
-      int slash_count = 1 + count_slashes (ptr);
-      int cut = MINVAL (opt.cut_dirs, slash_count);
-      for (; cut && *ptr; ptr++)
-       if (*ptr == '/')
-         --cut;
-      STRDUP_ALLOCA (dir, ptr);
-    }
-  else
-    dir = u->dir + (*u->dir == '/');
+  GROW (dest, 1);
+  *TAIL (dest) = ch;
+  TAIL_INCR (dest, 1);
+}
 
-  /* Check for the true name (or at least a consistent name for saving
-     to directory) of HOST, reusing the hlist if possible.  */
-  if (opt.add_hostdir)
-    {
-      /* Add dir_prefix and hostname (if required) to the beginning of
-        dir.  */
-      dirpref = (char *)alloca (strlen (opt.dir_prefix) + 1
-                               + strlen (u->host)
-                               + 1 + numdigit (u->port)
-                               + 1);
-      if (!DOTP (opt.dir_prefix))
-       sprintf (dirpref, "%s/%s", opt.dir_prefix, u->host);
-      else
-       strcpy (dirpref, u->host);
+enum {
+  filechr_unsafe_always  = 1,  /* always unsafe, e.g. / or \0 */
+  filechr_unsafe_shell   = 2,  /* unsafe for shell use, e.g. control chars */
+  filechr_unsafe_windows = 2,  /* disallowed on Windows file system */
+};
 
-      if (u->port != scheme_default_port (u->scheme))
-       {
-         int len = strlen (dirpref);
-         dirpref[len] = ':';
-         number_to_string (dirpref + len + 1, u->port);
-       }
-    }
-  else                         /* not add_hostdir */
-    {
-      if (!DOTP (opt.dir_prefix))
-       dirpref = opt.dir_prefix;
-      else
-       dirpref = "";
-    }
+#define FILE_CHAR_TEST(c, mask) (filechr_table[(unsigned char)(c)] & (mask))
 
-  /* If there is a prefix, prepend it.  */
-  if (*dirpref)
-    {
-      char *newdir = (char *)alloca (strlen (dirpref) + 1 + strlen (dir) + 2);
-      sprintf (newdir, "%s%s%s", dirpref, *dir == '/' ? "" : "/", dir);
-      dir = newdir;
-    }
+/* Shorthands for the table: */
+#define A filechr_unsafe_always
+#define S filechr_unsafe_shell
+#define W filechr_unsafe_windows
+
+/* Forbidden chars:
+
+   always: \0, /
+   Unix shell: 0-31, 128-159
+   Windows:    \, |, /, <, >, ?, :
+
+   Arguably we could also claim `%' to be unsafe, since we use it as
+   the escape character.  If we ever want to be able to reliably
+   translate file name back to URL, this would become important
+   crucial.  Right now, it's better to be minimal in escaping.  */
+
+const static unsigned char filechr_table[256] =
+{
+  A,  S,  S,  S,   S,  S,  S,  S,   /* NUL SOH STX ETX  EOT ENQ ACK BEL */
+  S,  S,  S,  S,   S,  S,  S,  S,   /* BS  HT  LF  VT   FF  CR  SO  SI  */
+  S,  S,  S,  S,   S,  S,  S,  S,   /* DLE DC1 DC2 DC3  DC4 NAK SYN ETB */
+  S,  S,  S,  S,   S,  S,  S,  S,   /* CAN EM  SUB ESC  FS  GS  RS  US  */
+  0,  0,  W,  0,   0,  0,  0,  0,   /* SP  !   "   #    $   %   &   '   */
+  0,  0,  W,  0,   0,  0,  0,  A,   /* (   )   *   +    ,   -   .   /   */
+  0,  0,  0,  0,   0,  0,  0,  0,   /* 0   1   2   3    4   5   6   7   */
+  0,  0,  W,  0,   W,  0,  W,  W,   /* 8   9   :   ;    <   =   >   ?   */
+  0,  0,  0,  0,   0,  0,  0,  0,   /* @   A   B   C    D   E   F   G   */
+  0,  0,  0,  0,   0,  0,  0,  0,   /* H   I   J   K    L   M   N   O   */
+  0,  0,  0,  0,   0,  0,  0,  0,   /* P   Q   R   S    T   U   V   W   */
+  0,  0,  0,  0,   W,  0,  0,  0,   /* X   Y   Z   [    \   ]   ^   _   */
+  0,  0,  0,  0,   0,  0,  0,  0,   /* `   a   b   c    d   e   f   g   */
+  0,  0,  0,  0,   0,  0,  0,  0,   /* h   i   j   k    l   m   n   o   */
+  0,  0,  0,  0,   0,  0,  0,  0,   /* p   q   r   s    t   u   v   w   */
+  0,  0,  0,  0,   0,  0,  0,  0,   /* x   y   z   {    |   }   ~   DEL */
 
-  l = strlen (dir);
-  if (l && dir[l - 1] == '/')
-    dir[l - 1] = '\0';
+  S, S, S, S,  S, S, S, S,  S, S, S, S,  S, S, S, S, /* 128-143 */
+  S, S, S, S,  S, S, S, S,  S, S, S, S,  S, S, S, S, /* 144-159 */
+  0, 0, 0, 0,  0, 0, 0, 0,  0, 0, 0, 0,  0, 0, 0, 0,
+  0, 0, 0, 0,  0, 0, 0, 0,  0, 0, 0, 0,  0, 0, 0, 0,
+
+  0, 0, 0, 0,  0, 0, 0, 0,  0, 0, 0, 0,  0, 0, 0, 0,
+  0, 0, 0, 0,  0, 0, 0, 0,  0, 0, 0, 0,  0, 0, 0, 0,
+  0, 0, 0, 0,  0, 0, 0, 0,  0, 0, 0, 0,  0, 0, 0, 0,
+  0, 0, 0, 0,  0, 0, 0, 0,  0, 0, 0, 0,  0, 0, 0, 0,
+};
 
-  if (!*u->file)
-    file = "index.html";
-  else
-    file = u->file;
+/* Return non-zero if character CH is unsafe for use in file or
+   directory name.  Called by append_uri_pathel. */
 
-  /* Finally, construct the full name.  */
-  res = (char *)xmalloc (strlen (dir) + 1 + strlen (file)
-                        + 1);
-  sprintf (res, "%s%s%s", dir, *dir ? "/" : "", file);
+static inline int
+file_unsafe_char (char ch, int restrict)
+{
+  int mask = filechr_unsafe_always;
+  if (restrict == restrict_shell)
+    mask |= filechr_unsafe_shell;
+  else if (restrict == restrict_windows)
+    mask |= (filechr_unsafe_shell | filechr_unsafe_windows);
+  return FILE_CHAR_TEST (ch, mask);
+}
+
+/* FN_PORT_SEP is the separator between host and port in file names
+   for non-standard port numbers.  On Unix this is normally ':', as in
+   "www.xemacs.org:4001/index.html".  Under Windows, we set it to +
+   because Windows can't handle ':' in file names.  */
+#define FN_PORT_SEP  (opt.restrict_file_names != restrict_windows ? ':' : '+')
+
+/* FN_QUERY_SEP is the separator between the file name and the URL
+   query, normally '?'.  Since Windows cannot handle '?' as part of
+   file name, we use '@' instead there.  */
+#define FN_QUERY_SEP (opt.restrict_file_names != restrict_windows ? '?' : '@')
+
+/* Quote path element, characters in [b, e), as file name, and append
+   the quoted string to DEST.  Each character is quoted as per
+   file_unsafe_char and the corresponding table.  */
 
-  return res;
-}
+static void
+append_uri_pathel (const char *b, const char *e, struct growable *dest)
+{
+  char *pathel;
+  int pathlen;
 
-/* Compose a file name out of BASE, an unescaped file name, and QUERY,
-   an escaped query string.  The trick is to make sure that unsafe
-   characters in BASE are escaped, and that slashes in QUERY are also
-   escaped.  */
+  const char *p;
+  int quoted, outlen;
 
-static char *
-compose_file_name (char *base, char *query)
-{
-  char result[256];
-  char *from;
-  char *to = result;
-
-  /* Copy BASE to RESULT and encode all unsafe characters.  */
-  from = base;
-  while (*from && to - result < sizeof (result))
-    {
-      if (UNSAFE_CHAR (*from))
-       {
-         unsigned char c = *from++;
-         *to++ = '%';
-         *to++ = XDIGIT_TO_XCHAR (c >> 4);
-         *to++ = XDIGIT_TO_XCHAR (c & 0xf);
-       }
-      else
-       *to++ = *from++;
+  /* Currently restrict_for_windows is determined at compile time
+     only.  But some users download files to Windows partitions; they
+     should be able to say --windows-file-names so Wget escapes
+     characters invalid on Windows.  Similar run-time restrictions for
+     other file systems can be implemented.  */
+  const int restrict = opt.restrict_file_names;
+
+  /* Copy [b, e) to PATHEL and URL-unescape it. */
+  BOUNDED_TO_ALLOCA (b, e, pathel);
+  url_unescape (pathel);
+  pathlen = strlen (pathel);
+
+  /* Go through PATHEL and check how many characters we'll need to
+     add for file quoting. */
+  quoted = 0;
+  for (p = pathel; *p; p++)
+    if (file_unsafe_char (*p, restrict))
+      ++quoted;
+
+  /* p - pathel is the string length.  Each quoted char means two
+     additional characters in the string, hence 2*quoted.  */
+  outlen = (p - pathel) + (2 * quoted);
+  GROW (dest, outlen);
+
+  if (!quoted)
+    {
+      /* If there's nothing to quote, we don't need to go through the
+        string the second time.  */
+      memcpy (TAIL (dest), pathel, outlen);
     }
-
-  if (query && to - result < sizeof (result))
+  else
     {
-      *to++ = '?';
-
-      /* Copy QUERY to RESULT and encode all '/' characters. */
-      from = query;
-      while (*from && to - result < sizeof (result))
+      char *q = TAIL (dest);
+      for (p = pathel; *p; p++)
        {
-         if (*from == '/')
+         if (!file_unsafe_char (*p, restrict))
+           *q++ = *p;
+         else
            {
-             *to++ = '%';
-             *to++ = '2';
-             *to++ = 'F';
-             ++from;
+             unsigned char ch = *p;
+             *q++ = '%';
+             *q++ = XDIGIT_TO_XCHAR (ch >> 4);
+             *q++ = XDIGIT_TO_XCHAR (ch & 0xf);
            }
-         else
-           *to++ = *from++;
        }
+      assert (q - TAIL (dest) == outlen);
     }
+  TAIL_INCR (dest, outlen);
+}
 
-  if (to - result < sizeof (result))
-    *to = '\0';
-  else
-    /* Truncate input which is too long, presumably due to a huge
-       query string.  */
-    result[sizeof (result) - 1] = '\0';
+/* Append to DEST the directory structure that corresponds the
+   directory part of URL's path.  For example, if the URL is
+   http://server/dir1/dir2/file, this appends "/dir1/dir2".
+
+   Each path element ("dir1" and "dir2" in the above example) is
+   examined, url-unescaped, and re-escaped as file name element.
 
-  return xstrdup (result);
+   Additionally, it cuts as many directories from the path as
+   specified by opt.cut_dirs.  For example, if opt.cut_dirs is 1, it
+   will produce "bar" for the above example.  For 2 or more, it will
+   produce "".
+
+   Each component of the path is quoted for use as file name.  */
+
+static void
+append_dir_structure (const struct url *u, struct growable *dest)
+{
+  char *pathel, *next;
+  int cut = opt.cut_dirs;
+
+  /* Go through the path components, de-URL-quote them, and quote them
+     (if necessary) as file names.  */
+
+  pathel = u->path;
+  for (; (next = strchr (pathel, '/')) != NULL; pathel = next + 1)
+    {
+      if (cut-- > 0)
+       continue;
+      if (pathel == next)
+       /* Ignore empty pathels.  path_simplify should remove
+          occurrences of "//" from the path, but it has special cases
+          for starting / which generates an empty pathel here.  */
+       continue;
+
+      if (dest->tail)
+       append_char ('/', dest);
+      append_uri_pathel (pathel, next, dest);
+    }
 }
+
+/* Return a unique file name that matches the given URL as good as
+   possible.  Does not create directories on the file system.  */
 
-/* Create a unique filename, corresponding to a given URL.  Calls
-   mkstruct if necessary.  Does *not* actually create any directories.  */
 char *
-url_filename (const struct url *u)
+url_file_name (const struct url *u)
 {
-  char *file, *name;
+  struct growable fnres;
 
-  char *query = u->query && *u->query ? u->query : NULL;
+  char *u_file, *u_query;
+  char *fname, *unique;
 
+  fnres.base = NULL;
+  fnres.size = 0;
+  fnres.tail = 0;
+
+  /* Start with the directory prefix, if specified. */
+  if (!DOTP (opt.dir_prefix))
+    append_string (opt.dir_prefix, &fnres);
+
+  /* If "dirstruct" is turned on (typically the case with -r), add
+     the host and port (unless those have been turned off) and
+     directory structure.  */
   if (opt.dirstruct)
     {
-      char *base = mkstruct (u);
-      file = compose_file_name (base, query);
-      xfree (base);
+      if (opt.add_hostdir)
+       {
+         if (fnres.tail)
+           append_char ('/', &fnres);
+         append_string (u->host, &fnres);
+         if (u->port != scheme_default_port (u->scheme))
+           {
+             char portstr[24];
+             number_to_string (portstr, u->port);
+             append_char (FN_PORT_SEP, &fnres);
+             append_string (portstr, &fnres);
+           }
+       }
+
+      append_dir_structure (u, &fnres);
     }
-  else
-    {
-      char *base = *u->file ? u->file : "index.html";
-      file = compose_file_name (base, query);
 
-      /* Check whether the prefix directory is something other than "."
-        before prepending it.  */
-      if (!DOTP (opt.dir_prefix))
-       {
-         /* #### should just realloc FILE and prepend dir_prefix. */
-         char *nfile = (char *)xmalloc (strlen (opt.dir_prefix)
-                                        + 1 + strlen (file) + 1);
-         sprintf (nfile, "%s/%s", opt.dir_prefix, file);
-         xfree (file);
-         file = nfile;
-       }
+  /* Add the file name. */
+  if (fnres.tail)
+    append_char ('/', &fnres);
+  u_file = *u->file ? u->file : "index.html";
+  append_uri_pathel (u_file, u_file + strlen (u_file), &fnres);
+
+  /* Append "?query" to the file name. */
+  u_query = u->query && *u->query ? u->query : NULL;
+  if (u_query)
+    {
+      append_char (FN_QUERY_SEP, &fnres);
+      append_uri_pathel (u_query, u_query + strlen (u_query), &fnres);
     }
 
-  /* DOS-ish file systems don't like `%' signs in them; we change it
-     to `@'.  */
-#ifdef WINDOWS
-  {
-    char *p = file;
-    for (p = file; *p; p++)
-      if (*p == '%')
-       *p = '@';
-  }
-#endif /* WINDOWS */
+  /* Zero-terminate the file name. */
+  append_char ('\0', &fnres);
+
+  fname = fnres.base;
 
   /* Check the cases in which the unique extensions are not used:
      1) Clobbering is turned off (-nc).
@@ -1557,17 +1675,18 @@
 
      The exception is the case when file does exist and is a
      directory (actually support for bad httpd-s).  */
+
   if ((opt.noclobber || opt.always_rest || opt.timestamping || opt.dirstruct)
-      && !(file_exists_p (file) && !file_non_directory_p (file)))
-    return file;
+      && !(file_exists_p (fname) && !file_non_directory_p (fname)))
+    return fnres.base;
 
   /* Find a unique name.  */
-  name = unique_name (file);
-  xfree (file);
-  return name;
+  unique = unique_name (fname);
+  xfree (fname);
+  return unique;
 }
 
-/* Return the langth of URL's path.  Path is considered to be
+/* Return the length of URL's path.  Path is considered to be
    terminated by one of '?', ';', '#', or by the end of the
    string.  */
 static int
@@ -1680,8 +1799,10 @@
       else if (*p == '/')
        {
          /* Remove empty path elements.  Not mandated by rfc1808 et
-            al, but empty path elements are not all that useful, and
-            the rest of Wget might not deal with them well. */
+            al, but it seems like a good idea to get rid of them.
+            Supporting them properly is hard (in which directory do
+            you save http://x.com///y.html?) and they don't seem to
+            bring much gain.  */
          char *q = p;
          while (*q == '/')
            ++q;
@@ -1964,13 +2085,13 @@
   /* Make sure the user name and password are quoted. */
   if (url->user)
     {
-      quoted_user = encode_string_maybe (url->user);
+      quoted_user = url_escape_allow_passthrough (url->user);
       if (url->passwd)
        {
          if (hide_password)
            quoted_passwd = HIDDEN_PASSWORD;
          else
-           quoted_passwd = encode_string_maybe (url->passwd);
+           quoted_passwd = url_escape_allow_passthrough (url->passwd);
        }
     }
 
Index: src/url.h
===================================================================
RCS file: /pack/anoncvs/wget/src/url.h,v
retrieving revision 1.25
diff -u -r1.25 url.h
--- src/url.h   2002/05/18 02:16:25     1.25
+++ src/url.h   2003/09/14 21:47:09
@@ -130,7 +130,7 @@
 
 /* Function declarations */
 
-char *encode_string PARAMS ((const char *));
+char *url_escape PARAMS ((const char *));
 
 struct url *url_parse PARAMS ((const char *, int *));
 const char *url_error PARAMS ((int));
@@ -157,7 +157,7 @@
 
 void rotate_backups PARAMS ((const char *));
 int mkalldirs PARAMS ((const char *));
-char *url_filename PARAMS ((const struct url *));
+char *url_file_name PARAMS ((const struct url *));
 
 char *getproxy PARAMS ((struct url *));
 int no_proxy_match PARAMS ((const char *, const char **));

More flexible URL file name generation

Reply via email to