Hi Ivan, Felix and other hackers,

With CHICKEN 6's UTF-8 support in the works, Felix has started porting
over some eggs in order to test for bugs in the core system.

One of those was uri-generic, and he ran into a problem with the
percent-decoding stuff.  Currently, in CHICKEN 5, we simply
percent-decode the string to a byte sequence and check whether the
char-set contains the byte value reinterpreted as a character.

This is actually pretty broken.  Kooda already reported an issue with
it in 2016, see https://bugs.call-cc.org/changeset/33264/project.
The fix here was a bit of a hack, we simply change the definitions of
the unreserved character set, which is used to decode path components.

It would still not work when calling "uri-decode-string" manually with a
custom srfi-14 character set in CHICKEN 5.

Essentially, it's abusing weird properties of strings-as-bytevectors and
the fact that SRFI-14 is using Latin1 character sets doesn't make it any
better for us.  The upshot of this is that a lot of things are somewhat
ill-defined, but we do allow and (mostly?) preserve arbitrary byte
sequences in an URI decoding/encoding roundtrip.

For CHICKEN 6, this doesn't work because strings are no longer byte
vectors.  To make this work properly, I decided to implement UTF-8
encoding and decoding in percent-encode and percent-decode.  Attached is
a patch for that.

Unfortunately, it also means we must now choose to reject certain URIs
(at least in uri-common) by raising an exception instead of allowing them
to be decoded.  These are for invalid UTF-8 encoded characters, either
because they're a truncated byte sequence or because they encode a
character in too many bytes.

The worst part of all of this is that the generic URI spec doesn't mandate
UTF-8, it just recommends that new schemes use UTF-8, and there's some
oddball recommendation that non-ASCII encodings like EBDIC should try to
convert at least the unreserved ASCII characters to ASCII, but percent-
encode the rest of the characters in that same encoding, yielding some
weird hybrid encoding.

This also means that the new version of uri-common at least would not be
able to transparently pass through URIs that *don't* use UTF-8 encoding,
or URIs with overlong UTF-8 encoding for some characters.

Anyway, this change in uri-generic behaviour probably warrants a new major
version.  I'm not sure whether we should do the same for CHICKEN 5, where
the behaviour is ill-defined but cannot currently raise an exception.
Note that the patch should work on CHICKEN 5 equally well - we'd just be
doing the encoding and decoding explicitly instead of relying on the
string-as-bytevector underlying representation.

Please have a look at the patch and review it.

PS: Probably this can be improved greatly by using low-level C functions
in the utf.c file in CHICKEN core and fixnum operations, but I decided to
keep with the style of the uri-generic egg, which is all standard Scheme.

Cheers,
Peter
Index: tests/run.scm
===================================================================
--- tests/run.scm       (revision 43373)
+++ tests/run.scm       (working copy)
@@ -1,4 +1,4 @@
-;;(load "../uri-generic.scm")
+(load "../uri-generic.scm")
 (import uri-generic (chicken format) (chicken string) srfi-1 test)
 
 (test-begin "uri-generic")
@@ -233,10 +233,13 @@
   '(("foo?bar" "foo%3Fbar")
     ("foo&bar" "foo%26bar")
     ("foo%20bar" "foo%2520bar")
-    ("foo\x00bar\n" "foo%00bar%0A")
+    ("foo\x00;bar\n" "foo%00bar%0A")
     ;; UTF-8 breakage, reported by Adrien Ramos
     ("D&D - Création persos.html"
-     "D%26D%20-%20Cr%C3%A9ation%20persos.html")))
+     "D%26D%20-%20Cr%C3%A9ation%20persos.html")
+    ;; UTF-8 encoding of various lengths
+    ("a béc♂d😎e"
+     "a%20b%C3%A9c%E2%99%82d%F0%9F%98%8Ee")))
 
 (test-group "uri-encode-string test"
   (for-each (lambda (p)
@@ -252,6 +255,31 @@
                   (test (sprintf "~S -> ~S" (second p) expected) expected 
decoded)))
             encode/decode-cases))
 
+
+(define error-decode-cases
+  '(;; Overlong UTF-8 encodings of space
+    "a%C0%A0b"
+    "a%E0%80%A0b"
+    "a%E0%80%80%A0b"
+    ;; Overlong UTF-8 encodings of é
+    "a%E0%80%A9b"
+    "a%E0%80%80%A9b"
+    ;; Overlong UTF-8 encoding of ♂
+    "a%E0%E2%99%82b"
+
+    ;; Incomplete UTF-8 encoding
+    "a%C0b"
+    "a%E2%99"
+    "a%F0%9F%98"
+    "a%F0%9F%98x"
+    "a%F0%9F%98x"))
+
+(test-group "uri-decode-string test with invalid encodings"
+  (for-each (lambda (p)
+              (test-error p (uri-decode-string p)))
+            error-decode-cases))
+
+
 (define normalize-case-cases
   '(("http://exa%2fmple/FOO%2fbar"; "http://exa%2Fmple/FOO%2Fbar";)
     ("http://EXA%2fMPLE/FOO%2fbar"; "http://exa%2Fmple/FOO%2Fbar";)
Index: uri-generic.scm
===================================================================
--- uri-generic.scm     (revision 43373)
+++ uri-generic.scm     (working copy)
@@ -3,7 +3,7 @@
 ;;
 ;; Based on the Haskell URI library by  Graham Klyne <g...@ninebynine.org>.
 ;;
-;; Copyright 2008-2022 Ivan Raikov, Peter Bex, Seth Alves.
+;; Copyright 2008-2024 Ivan Raikov, Peter Bex, Seth Alves.
 ;;
 ;;
 ;;  Redistribution and use in source and binary forms, with or without
@@ -55,7 +55,7 @@
    char-set:uri-reserved char-set:uri-unreserved)
 
 (import scheme (scheme base)
-        (chicken base) (chicken string) (chicken port)
+        (chicken base) (chicken string) (chicken port) (chicken bitwise)
         (chicken format) srfi-1 srfi-4 srfi-14 matchable)
 
 (define uri-error error)
@@ -432,16 +432,43 @@
 ;;
 ;; Returns a 'pct-encoded' sequence of octets.
 ;;
+;; This assumes UTF-8 encoding, see RFC 3629/STD 63 for the
+;; encoding and decoding algorithms.  Ideally we'd like to be able
+;; to specify the encoding here, because the URI standard is
+;; encoding-agnostic (see also section 2.3 "Identifying data").
+;; It's just a binary encoding, and it's up to the client/server
+;; endpoints to decide on the encoding.
+
 (define (pct-encode char-list char-set)
   (define (hex-digit i)
     (and (>= i 0) (< i 16)
          (char-upcase (string-ref (number->string i 16) 0))))
+  (define (enc-byte b)
+    (let ((h1 (hex-digit (quotient b 16)))
+          (h2 (hex-digit (remainder b 16))))
+      `(#\% ,h1 ,h2)))
   (reverse (fold (lambda (c cl)
                    (if (char-set-contains? char-set c)
-                       (let* ((x (char->integer c))
-                              (h1 (hex-digit (quotient x 16)))
-                              (h2 (hex-digit (remainder x 16))))
-                         (cons `(#\% ,h1 ,h2) cl))
+                       (let ((x (char->integer c)))
+                         (cond
+                          ((< x #x80) (cons (enc-byte x) cl))
+                          ;; XXX What to do if the character is a UTF-16 
surrogate pair?
+                          ;; eg in the range between #xD800 and #xDFFF.
+                          ;; Those are forbidden in UTF-8...
+                          ((< x #x800) (let ((x1 (bitwise-ior 
(arithmetic-shift x -6) #xC0))
+                                             (x2 (bitwise-ior (bitwise-and x 
#x3f) #x80)))
+                                         `(,(enc-byte x2) ,(enc-byte x1) 
,@cl)))
+                          ((< x #x10000) (let ((x1 (bitwise-ior 
(arithmetic-shift x -12) #xE0))
+                                               (x2 (bitwise-ior (bitwise-and 
(arithmetic-shift x -6) #x3F) #x80))
+                                               (x3 (bitwise-ior (bitwise-and x 
#x3f) #x80)))
+                                           `(,(enc-byte x3) ,(enc-byte x2) 
,(enc-byte x1) ,@cl)))
+                          ;; NOTE: We skip the check that the character is 
less than 0010 FFFF
+                          (else (let ((x1 (bitwise-ior (arithmetic-shift x 
-18) #xF0))
+                                      (x2 (bitwise-ior (bitwise-and 
(arithmetic-shift x -12) #x3F) #x80))
+                                      (x3 (bitwise-ior (bitwise-and 
(arithmetic-shift x -6) #x3F) #x80))
+                                      (x4 (bitwise-ior (bitwise-and x #x3f) 
#x80)))
+                                  `(,(enc-byte x4) ,(enc-byte x3) ,(enc-byte 
x2) ,(enc-byte x1) ,@cl)))))
+                       ;; No encoding needed
                        (cons c cl)))
                 (list) char-list)))
 
@@ -448,17 +475,84 @@
 ;; Inverse operation: 'pct-decode' a sequence of octets.
 
 (define (pct-decode char-list char-set)
+  (define (decoding-error)
+    (uri-error 'uri-decode-string "Invalid UTF-8 sequence in 
percent-encoding"))
   (define (octet-decode h1 h2)
     (string->number (list->string (list h1 h2)) 16))
-  (map (lambda (c)
-        (match c
-               ((#\% h1 h2)
-                (let ((dc (integer->char (octet-decode h1 h2))))
-                  (if (char-set-contains? char-set dc) dc c)))
-               (else c)))
-       char-list))
+  ;; NOTE: If eat-rest-chars fails, it will raise an error.
+  ;; We could keep the character in its percent-encoded form,
+  ;; but that could lead to confusing double-decoding issues
+  ;; which might also be abused by an attacker.
+  (define (eat-rest-chars first-byte bytes-to-eat cl)
+    (let lp ((codepoint first-byte)
+             (n bytes-to-eat)
+             (cl cl))
+      (cond
+       ((zero? n)
+        ;; Detect overlong encoding
+        (if (or (and (= bytes-to-eat 1)
+                     (<= #x80 codepoint #x7FF))
+                (and (= bytes-to-eat 2)
+                     (<= #x800 codepoint #xFFFF))
+                (and (= bytes-to-eat 3)
+                     (<= #x10000 codepoint #x10FFFF)))
+            (let ((dc (integer->char codepoint)))
+              (if (char-set-contains? char-set dc)
+                  (values dc cl)
+                  (values #f #f)))
+            (decoding-error)))
+       ((null? cl)                  ; Incompletely encoded char
+        (decoding-error))
+       (else
+        (let ((c (car cl)))
+          (match c
+            ((#\% h1 h2)
+             (let ((byte (octet-decode h1 h2)))
+               (if (= #x80 (bitwise-and byte #xC0))
+                   (let ((decoded-byte (bitwise-and byte #x7F)))
+                     (lp (bitwise-ior (arithmetic-shift codepoint 6) 
decoded-byte)
+                         (- n 1)
+                         (cdr cl)))
+                   (decoding-error)))) ; Incomplete/invalid encoding
+            ;; Incomplete/invalid encoding
+            (else (decoding-error))))))))
+  (if (null? char-list)
+      char-list
+      (let lp ((cl char-list)
+               (res '()))
+        (if (null? cl)
+            (reverse res)
+            (let ((c (car cl)))
+              (match c
+               ((#\% h1 h2)
+                (let ((b1 (octet-decode h1 h2)))
+                   (cond
+                    ((= 0 (bitwise-and b1 #x80))
+                     (let ((dc (integer->char b1)))
+                       (lp (cdr cl) (cons (if (char-set-contains? char-set dc) 
dc c) res))))
 
+                    ((= #xC0 (bitwise-and b1 #xE0))
+                     (receive (dc dcl) (eat-rest-chars (bitwise-and b1 #x1f) 1 
(cdr cl))
+                       (if dc
+                           (lp dcl (cons dc res))
+                           (lp (cdr cl) (cons c res)))))
 
+                    ((= #xE0 (bitwise-and b1 #xF0))
+                     (receive (dc dcl) (eat-rest-chars (bitwise-and b1 #x0f) 2 
(cdr cl))
+                       (if dc
+                           (lp dcl (cons dc res))
+                           (lp (cdr cl) (cons c res)))))
+
+                    ((= #xF0 (bitwise-and b1 #xF8))
+                     (receive (dc dcl) (eat-rest-chars (bitwise-and b1 #x07) 3 
(cdr cl))
+                       (if dc
+                           (lp dcl (cons dc res))
+                           (lp (cdr cl) (cons c res)))))
+
+                    (else (lp (cdr cl) (cons c res))))))
+               (else (lp (cdr cl) (cons c res)))))))))
+
+
 ;; RFC3986, section 2.2
 ;;
 ;; Reserved characters.

Reply via email to