ftp: do not URI encode tilde as per RFC 2396

Klemens Nanni Fri, 05 Nov 2021 16:45:19 -0700

ftp(1) implements RFC 1738 from Dec. 1994 but RFC 2396 from Aug. 1998
updated this and the tl;dr is:  do not encode the tilde character.


In theory, this shouldn't make a difference as servers should decode
"%7e" to "~", BUT not all servers do so and thus some respond with 404.

I've hit this in the past already but didn't look at the code.
Now I came across this link that made me fix it:
https://review.trustedfirmware.org/changes/TF-A%2Ftrusted-firmware-a~7726/revisions/9/patch?download

curl(1) and wget(1) both fetch this file successfully, i.e. they behave
identicall wrt. "%" and send it as-is:

        $ 
url='https://review.trustedfirmware.org/changes/TF-A%2Ftrusted-firmware-a~7726/revisions/9/patch?download'

        $ curl -I -sS -w '%{url}\n' "$url" | tail -1
        200 
https://review.trustedfirmware.org/changes/TF-A%2Ftrusted-firmware-a~7726/revisions/9/patch?download

        $ wget -d "$url" 2>&1 | grep -Ew 'GET|OK'
        GET /changes/TF-A%2Ftrusted-firmware-a~7726/revisions/9/patch?download 
HTTP/1.1
        HTTP/1.1 200 OK
        200 OK

ftp(1) still sticks with RFC 1738 and encodes it:

        ftp -d "$url"
        host review.trustedfirmware.org, port https, path 
changes/TF-A%2Ftrusted-firmware-a~7726/revisions/9/patch?download, save as 
patch?download, auth none.
        Trying 64:ff9b::339f:1211...
        Requesting 
https://review.trustedfirmware.org/changes/TF-A%2Ftrusted-firmware-a~7726/revisions/9/patch?download
        GET 
/changes/TF-A%2Ftrusted-firmware-a%7e7726/revisions/9/patch?download HTTP/1.1
        Connection: close
        Host: review.trustedfirmware.org
        User-Agent: OpenBSD ftp

        received 'HTTP/1.1 404 Not Found'
        ftp: Error retrieving 
https://review.trustedfirmware.org/changes/TF-A%2Ftrusted-firmware-a~7726/revisions/9/patch?download:
 404 Not Found

RFC 2396 2.4.2. When to Escape and Unescape says:

   In some cases, data that could be represented by an unreserved
   character may appear escaped; for example, some of the unreserved
   "mark" characters are automatically escaped by some systems.  If the
   given URI scheme defines a canonicalization algorithm, then
   unreserved characters may be unescaped according to that algorithm.
   For example, "%7e" is sometimes used instead of "~" in an http URL
   path, but the two are equivalent for an http URL.

So they're equivalent and tilde does not need encoding.  Again, servers
should always decode it anyway, but not all of them do, so better send
it as-is/unencoded.

Diff below is effectively a one-character change that removes "~" from
`*unsafe_chars' but it also updates the comments to use RFC 2396 wording
and order of characters for easier code-to-standard comparison.

This allows me to fetch this patch:

        $ ./obj/ftp -d "$url"
        host review.trustedfirmware.org, port https, path 
changes/TF-A%2Ftrusted-firmware-a~7726/revisions/9/patch?download, save as 
patch?download, auth none.
        Trying 64:ff9b::339f:1211...
        Requesting 
https://review.trustedfirmware.org/changes/TF-A%2Ftrusted-firmware-a~7726/revisions/9/patch?download
        GET /changes/TF-A%2Ftrusted-firmware-a~7726/revisions/9/patch?download 
HTTP/1.1
        Connection: close
        Host: review.trustedfirmware.org
        User-Agent: OpenBSD ftp

        received 'HTTP/1.1 200 OK'
        received 'Date: Fri, 05 Nov 2021 23:34:49 GMT'
        received 'Server: Apache'
        received 'X-Frame-Options: DENY'
        received 'Content-Disposition: attachment; 
filename="155911c.diff.base64"'
        received 'X-Content-Type-Options: nosniff'
        received 'Cache-Control: private'
        received 'Pragma: no-cache'
        received 'Expires: Mon, 01 Jan 1990 00:00:00 GMT'
        received 'X-FYI-Content-Encoding: base64'
        received 'X-FYI-Content-Type: application/mbox'
        received 'Content-Type: text/plain;charset=iso-8859-1'
        received 'Vary: Accept-Encoding'
        received 'Connection: close'
        received 'Transfer-Encoding: chunked'
        9544 bytes received in 0.00 seconds (12.21 MB/s)


Looking at wget's source somewhat backs up this way of handling tilde
by explicitly mentioning broken "%7e" decoding, i.e. servers expecting
an unencoded "~" as-is:

>From wget-1.2.1/src/url.c 103f:

/* Table of "reserved" and "unsafe" characters.  Those terms are
   rfc1738-speak, as such largely obsoleted by rfc2396 and later
   specs, but the general idea remains.

   A reserved character is the one that you can't decode without
   changing the meaning of the URL.  For example, you can't decode
   "/foo/%2f/bar" into "/foo///bar" because the number and contents of
   path components is different.  Non-reserved characters can be
   changed, so "/foo/%78/bar" is safe to change to "/foo/x/bar".  The
   unsafe characters are loosely based on rfc1738, plus "$" and ",",
   as recommended by rfc2396, and minus "~", which is very frequently
   used (and sometimes unrecognized as %7E by broken servers).

   An unsafe character is the one that should be encoded when URLs are
   placed in foreign environments.  E.g. space and newline are unsafe
   in HTTP contexts because HTTP uses them as separator and line
   terminator, so they must be encoded to %20 and %0A respectively.
   "*" is unsafe in shell context, etc.

   We determine whether a character is unsafe through static table
   lookup.  This code assumes ASCII character set and 8-bit chars.  */

OK?

Index: fetch.c
===================================================================
RCS file: /cvs/src/usr.bin/ftp/fetch.c,v
retrieving revision 1.205
diff -u -p -r1.205 fetch.c
--- fetch.c     31 Aug 2021 09:51:25 -0000      1.205
+++ fetch.c     5 Nov 2021 22:49:35 -0000
@@ -106,14 +106,17 @@ static int        redirect_loop;
 static int     retried;
 
 /*
- * Determine whether the character needs encoding, per RFC1738:
- *     - No corresponding graphic US-ASCII.
- *     - Unsafe characters.
+ * Determine whether the character needs encoding, per RFC2396.
  */
 static int
-unsafe_char(const char *c0)
+to_encode(const char *c0)
 {
-       const char *unsafe_chars = " <>\"#{}|\\^~[]`";
+       /* 2.4.3. Excluded US-ASCII Characters */
+       const char *excluded_chars =
+           " "         /* space */
+           "<>#\""     /* delims (modulo "%", see below) */
+           "{}|\\^[]`" /* unwise */
+           ;
        const unsigned char *c = (const unsigned char *)c0;
 
        /*
@@ -123,16 +126,15 @@ unsafe_char(const char *c0)
        return (iscntrl(*c) || !isascii(*c) ||
 
            /*
-            * Unsafe characters.
-            * '%' is also unsafe, if is not followed by two
+            * '%' is also reserved, if is not followed by two
             * hexadecimal digits.
             */
-           strchr(unsafe_chars, *c) != NULL ||
+           strchr(excluded_chars, *c) != NULL ||
            (*c == '%' && (!isxdigit(c[1]) || !isxdigit(c[2]))));
 }
 
 /*
- * Encode given URL, per RFC1738.
+ * Encode given URL, per RFC2396.
  * Allocate and return string to the caller.
  */
 static char *
@@ -145,11 +147,10 @@ url_encode(const char *path)
 
        /*
         * First pass:
-        * Count unsafe characters, and determine length of the
-        * final URL.
+        * Count characters to encode and determine length of the final URL.
         */
        for (i = 0; i < length; i++)
-               if (unsafe_char(path + i))
+               if (to_encode(path + i))
                        new_length += 2;
 
        epath = epathp = malloc(new_length + 1);        /* One more for '\0'. */
@@ -161,7 +162,7 @@ url_encode(const char *path)
         * Encode, and copy final URL.
         */
        for (i = 0; i < length; i++)
-               if (unsafe_char(path + i)) {
+               if (to_encode(path + i)) {
                        snprintf(epathp, 4, "%%" "%02x",
                            (unsigned char)path[i]);
                        epathp += 3;

ftp: do not URI encode tilde as per RFC 2396

Reply via email to