Hi Ivan, Felix and other hackers, With CHICKEN 6's UTF-8 support in the works, Felix has started porting over some eggs in order to test for bugs in the core system.
One of those was uri-generic, and he ran into a problem with the percent-decoding stuff. Currently, in CHICKEN 5, we simply percent-decode the string to a byte sequence and check whether the char-set contains the byte value reinterpreted as a character. This is actually pretty broken. Kooda already reported an issue with it in 2016, see https://bugs.call-cc.org/changeset/33264/project. The fix here was a bit of a hack, we simply change the definitions of the unreserved character set, which is used to decode path components. It would still not work when calling "uri-decode-string" manually with a custom srfi-14 character set in CHICKEN 5. Essentially, it's abusing weird properties of strings-as-bytevectors and the fact that SRFI-14 is using Latin1 character sets doesn't make it any better for us. The upshot of this is that a lot of things are somewhat ill-defined, but we do allow and (mostly?) preserve arbitrary byte sequences in an URI decoding/encoding roundtrip. For CHICKEN 6, this doesn't work because strings are no longer byte vectors. To make this work properly, I decided to implement UTF-8 encoding and decoding in percent-encode and percent-decode. Attached is a patch for that. Unfortunately, it also means we must now choose to reject certain URIs (at least in uri-common) by raising an exception instead of allowing them to be decoded. These are for invalid UTF-8 encoded characters, either because they're a truncated byte sequence or because they encode a character in too many bytes. The worst part of all of this is that the generic URI spec doesn't mandate UTF-8, it just recommends that new schemes use UTF-8, and there's some oddball recommendation that non-ASCII encodings like EBDIC should try to convert at least the unreserved ASCII characters to ASCII, but percent- encode the rest of the characters in that same encoding, yielding some weird hybrid encoding. This also means that the new version of uri-common at least would not be able to transparently pass through URIs that *don't* use UTF-8 encoding, or URIs with overlong UTF-8 encoding for some characters. Anyway, this change in uri-generic behaviour probably warrants a new major version. I'm not sure whether we should do the same for CHICKEN 5, where the behaviour is ill-defined but cannot currently raise an exception. Note that the patch should work on CHICKEN 5 equally well - we'd just be doing the encoding and decoding explicitly instead of relying on the string-as-bytevector underlying representation. Please have a look at the patch and review it. PS: Probably this can be improved greatly by using low-level C functions in the utf.c file in CHICKEN core and fixnum operations, but I decided to keep with the style of the uri-generic egg, which is all standard Scheme. Cheers, Peter
Index: tests/run.scm =================================================================== --- tests/run.scm (revision 43373) +++ tests/run.scm (working copy) @@ -1,4 +1,4 @@ -;;(load "../uri-generic.scm") +(load "../uri-generic.scm") (import uri-generic (chicken format) (chicken string) srfi-1 test) (test-begin "uri-generic") @@ -233,10 +233,13 @@ '(("foo?bar" "foo%3Fbar") ("foo&bar" "foo%26bar") ("foo%20bar" "foo%2520bar") - ("foo\x00bar\n" "foo%00bar%0A") + ("foo\x00;bar\n" "foo%00bar%0A") ;; UTF-8 breakage, reported by Adrien Ramos ("D&D - Création persos.html" - "D%26D%20-%20Cr%C3%A9ation%20persos.html"))) + "D%26D%20-%20Cr%C3%A9ation%20persos.html") + ;; UTF-8 encoding of various lengths + ("a béc♂d😎e" + "a%20b%C3%A9c%E2%99%82d%F0%9F%98%8Ee"))) (test-group "uri-encode-string test" (for-each (lambda (p) @@ -252,6 +255,31 @@ (test (sprintf "~S -> ~S" (second p) expected) expected decoded))) encode/decode-cases)) + +(define error-decode-cases + '(;; Overlong UTF-8 encodings of space + "a%C0%A0b" + "a%E0%80%A0b" + "a%E0%80%80%A0b" + ;; Overlong UTF-8 encodings of é + "a%E0%80%A9b" + "a%E0%80%80%A9b" + ;; Overlong UTF-8 encoding of ♂ + "a%E0%E2%99%82b" + + ;; Incomplete UTF-8 encoding + "a%C0b" + "a%E2%99" + "a%F0%9F%98" + "a%F0%9F%98x" + "a%F0%9F%98x")) + +(test-group "uri-decode-string test with invalid encodings" + (for-each (lambda (p) + (test-error p (uri-decode-string p))) + error-decode-cases)) + + (define normalize-case-cases '(("http://exa%2fmple/FOO%2fbar" "http://exa%2Fmple/FOO%2Fbar") ("http://EXA%2fMPLE/FOO%2fbar" "http://exa%2Fmple/FOO%2Fbar") Index: uri-generic.scm =================================================================== --- uri-generic.scm (revision 43373) +++ uri-generic.scm (working copy) @@ -3,7 +3,7 @@ ;; ;; Based on the Haskell URI library by Graham Klyne <g...@ninebynine.org>. ;; -;; Copyright 2008-2022 Ivan Raikov, Peter Bex, Seth Alves. +;; Copyright 2008-2024 Ivan Raikov, Peter Bex, Seth Alves. ;; ;; ;; Redistribution and use in source and binary forms, with or without @@ -55,7 +55,7 @@ char-set:uri-reserved char-set:uri-unreserved) (import scheme (scheme base) - (chicken base) (chicken string) (chicken port) + (chicken base) (chicken string) (chicken port) (chicken bitwise) (chicken format) srfi-1 srfi-4 srfi-14 matchable) (define uri-error error) @@ -432,16 +432,43 @@ ;; ;; Returns a 'pct-encoded' sequence of octets. ;; +;; This assumes UTF-8 encoding, see RFC 3629/STD 63 for the +;; encoding and decoding algorithms. Ideally we'd like to be able +;; to specify the encoding here, because the URI standard is +;; encoding-agnostic (see also section 2.3 "Identifying data"). +;; It's just a binary encoding, and it's up to the client/server +;; endpoints to decide on the encoding. + (define (pct-encode char-list char-set) (define (hex-digit i) (and (>= i 0) (< i 16) (char-upcase (string-ref (number->string i 16) 0)))) + (define (enc-byte b) + (let ((h1 (hex-digit (quotient b 16))) + (h2 (hex-digit (remainder b 16)))) + `(#\% ,h1 ,h2))) (reverse (fold (lambda (c cl) (if (char-set-contains? char-set c) - (let* ((x (char->integer c)) - (h1 (hex-digit (quotient x 16))) - (h2 (hex-digit (remainder x 16)))) - (cons `(#\% ,h1 ,h2) cl)) + (let ((x (char->integer c))) + (cond + ((< x #x80) (cons (enc-byte x) cl)) + ;; XXX What to do if the character is a UTF-16 surrogate pair? + ;; eg in the range between #xD800 and #xDFFF. + ;; Those are forbidden in UTF-8... + ((< x #x800) (let ((x1 (bitwise-ior (arithmetic-shift x -6) #xC0)) + (x2 (bitwise-ior (bitwise-and x #x3f) #x80))) + `(,(enc-byte x2) ,(enc-byte x1) ,@cl))) + ((< x #x10000) (let ((x1 (bitwise-ior (arithmetic-shift x -12) #xE0)) + (x2 (bitwise-ior (bitwise-and (arithmetic-shift x -6) #x3F) #x80)) + (x3 (bitwise-ior (bitwise-and x #x3f) #x80))) + `(,(enc-byte x3) ,(enc-byte x2) ,(enc-byte x1) ,@cl))) + ;; NOTE: We skip the check that the character is less than 0010 FFFF + (else (let ((x1 (bitwise-ior (arithmetic-shift x -18) #xF0)) + (x2 (bitwise-ior (bitwise-and (arithmetic-shift x -12) #x3F) #x80)) + (x3 (bitwise-ior (bitwise-and (arithmetic-shift x -6) #x3F) #x80)) + (x4 (bitwise-ior (bitwise-and x #x3f) #x80))) + `(,(enc-byte x4) ,(enc-byte x3) ,(enc-byte x2) ,(enc-byte x1) ,@cl))))) + ;; No encoding needed (cons c cl))) (list) char-list))) @@ -448,17 +475,84 @@ ;; Inverse operation: 'pct-decode' a sequence of octets. (define (pct-decode char-list char-set) + (define (decoding-error) + (uri-error 'uri-decode-string "Invalid UTF-8 sequence in percent-encoding")) (define (octet-decode h1 h2) (string->number (list->string (list h1 h2)) 16)) - (map (lambda (c) - (match c - ((#\% h1 h2) - (let ((dc (integer->char (octet-decode h1 h2)))) - (if (char-set-contains? char-set dc) dc c))) - (else c))) - char-list)) + ;; NOTE: If eat-rest-chars fails, it will raise an error. + ;; We could keep the character in its percent-encoded form, + ;; but that could lead to confusing double-decoding issues + ;; which might also be abused by an attacker. + (define (eat-rest-chars first-byte bytes-to-eat cl) + (let lp ((codepoint first-byte) + (n bytes-to-eat) + (cl cl)) + (cond + ((zero? n) + ;; Detect overlong encoding + (if (or (and (= bytes-to-eat 1) + (<= #x80 codepoint #x7FF)) + (and (= bytes-to-eat 2) + (<= #x800 codepoint #xFFFF)) + (and (= bytes-to-eat 3) + (<= #x10000 codepoint #x10FFFF))) + (let ((dc (integer->char codepoint))) + (if (char-set-contains? char-set dc) + (values dc cl) + (values #f #f))) + (decoding-error))) + ((null? cl) ; Incompletely encoded char + (decoding-error)) + (else + (let ((c (car cl))) + (match c + ((#\% h1 h2) + (let ((byte (octet-decode h1 h2))) + (if (= #x80 (bitwise-and byte #xC0)) + (let ((decoded-byte (bitwise-and byte #x7F))) + (lp (bitwise-ior (arithmetic-shift codepoint 6) decoded-byte) + (- n 1) + (cdr cl))) + (decoding-error)))) ; Incomplete/invalid encoding + ;; Incomplete/invalid encoding + (else (decoding-error)))))))) + (if (null? char-list) + char-list + (let lp ((cl char-list) + (res '())) + (if (null? cl) + (reverse res) + (let ((c (car cl))) + (match c + ((#\% h1 h2) + (let ((b1 (octet-decode h1 h2))) + (cond + ((= 0 (bitwise-and b1 #x80)) + (let ((dc (integer->char b1))) + (lp (cdr cl) (cons (if (char-set-contains? char-set dc) dc c) res)))) + ((= #xC0 (bitwise-and b1 #xE0)) + (receive (dc dcl) (eat-rest-chars (bitwise-and b1 #x1f) 1 (cdr cl)) + (if dc + (lp dcl (cons dc res)) + (lp (cdr cl) (cons c res))))) + ((= #xE0 (bitwise-and b1 #xF0)) + (receive (dc dcl) (eat-rest-chars (bitwise-and b1 #x0f) 2 (cdr cl)) + (if dc + (lp dcl (cons dc res)) + (lp (cdr cl) (cons c res))))) + + ((= #xF0 (bitwise-and b1 #xF8)) + (receive (dc dcl) (eat-rest-chars (bitwise-and b1 #x07) 3 (cdr cl)) + (if dc + (lp dcl (cons dc res)) + (lp (cdr cl) (cons c res))))) + + (else (lp (cdr cl) (cons c res)))))) + (else (lp (cdr cl) (cons c res))))))))) + + ;; RFC3986, section 2.2 ;; ;; Reserved characters.