Re: [sqlite] Feature request: more robust handling of invalid UTF-16 data

Dennis Snell Mon, 13 Jan 2020 13:37:43 -0800

I’d like to raise this issue again and give my support for what Maks Verver 
recommended in 
https://www.mail-archive.com/sqlite-users@mailinglists.sqlite.org/msg110107.html



Independently I came to this bug while working on an issue in Simplenote’s 
Android app where our data was being corrupted when saved to sqlite inside the 
Android SDK. We received some invalid UTF-16 sequences and instead of rejecting 
them or decoding it properly sqlite is further mangling them and introducing 
more corruption.


Example:
We have a JSON document like this which we store in a table.


    {“content”: “\ud83c\udd70\ud83c(null)\udd71”,”tags":[]}


The JSON is well-formed but the sequence of UTF-16 code points is invalid. We 
have fixed our side of the equation which prevents creating this content, but 
we still receive from time to time the invalid sequence from older client 
libraries.


When sqlite reads this data two types of further corruption occur: reading 
beyond a code unit subsequence; and conflating high and low surrogates.


Reading beyond a code unit subsequence:


When the `TERM` was introduced[1] and updated[2] it appears to have been 
designed to assume that a string ends mid-surrogate but it does not attempt to 
address unpaired surrogates in the middle of an input text. In our case the 
`READ_UTF16BE` macro accepts the second `\ud83c` code unit and then consumes 
the following `\u0028` which is the separate and well-formed “(“. In turn this 
produces the more corrupted value of `\ud83c\udc28`, code point U+1F028, plus 
“null)” without the leading “(“.


Conflating high and low surrogates:


The `READ_UTF16__` macros both attempt to start processing surrogate pairs 
based on the `0xD800 <= c <= 0xE000` value of the input code unit. Because of 
this they will pick up on unpaired low surrogates, consume the next character, 
and then create a more corrupted Unicode string.


In our case, once we reach the `\udd71` the macro consumes the following quote, 
which in the JSON document closes the string, and puts them together as 
`\udd71\u0022` producing the invalid code point U+6C422. Moreover, because it 
consumed the string-ending quote it also corrupted the entire JSON document, as 
the new output resembles the following:


    {“content”: “\ud83c\udd70\ud83c\udc28ull)\ud971\udc22,”tags”:[]}


That is, we write this invalid Unicode sequence but valid JSON document into 
sqlite and read back an invalid Unicode sequence _and_ invalid JSON (see the 
missing quote before “tags”).


Supporting Unicode spec:


The Unicode specification[3] sections 3.2 and 3.9 speak to this situation and 
provides a specific comparable example:


    When a process interprets a code unit sequence which purports to be in a 
Unicode
    character encoding form, it shall treat ill-formed code unit sequences as 
an error
    condition and shall not interpret such sequences as characters.


    Furthermore, such a process must not treat any adjacent well-formed code 
unit
    sequences as being part of those ill-formed code unit sequences.


    For example, with the input UTF-8 code unit sequence <C2 41 42>, such a 
UTF-8
    conversion process must not return <U+FFFD> or <U+FFFD, U+0042>, because
    either of those outputs would be the result of interpreting a well-formed 
subsequence
    as being part of the ill-formed subsequence. The expected return value for 
such a
    process would instead be <U+FFFD, U+0041, U+0042>.


Supporing Maks’ suggestion to use the replacement character on error section 
23.8[4] provides the guidance:


    It [U+FFFD] can be substituted for any “unknown” character in another 
encoding that
    cannot be mapped in terms of known Unicode characters. It can also be used 
as one
    means of indicating a conversion error, when encountering an ill-formed 
sequence in
    a conversion between Unicode encoding forms.


Patching:


The `READ_UTF16__` macros thus should do not only what Maks proposed, which is 
to verify that the character following a surrogate half is also a surrogate 
half, but also to verify that we don’t start interpreting a surrogate sequence 
when encountering an unpaired low surrogate. I propose this change instead:


    #define READ_UTF16LE(zIn, TERM, c){
        c = (*zIn++);
        c += ((*zIn++)<<8);
        if( c>=0xDC00 && c<=0xE000 && TERM ) {
            c = 0xFFFD
        } else if( c>=0xD800 && TERM ){
            int c2 = (zIn[0] | (zIn[1] << 8));
            if ( c2>=0xDC00 && c2<0xE000) {
                zIn += 2;
                c = (c2&0x03FF) + ((c&0x003F)<<10) + (((c&0x03C0)+0x0040)<<10);
            } else {
                c = 0xFFFD;
            }
        }
    }



This will solve both problem of reading past an ill-formed surrogate sequence 
and of interpreting an ill-formed surrogate sequence. I love sqlite and I 
really appreciated how the code is laid out which made it so easy to find this 
macro in the source and identify the problem.


Dennis Snell
Automattic, Inc.


[1]: https://sqlite.org/src/info/19064d7cea

[2]: https://sqlite.org/src/info/a96b4e8c01d167d3

[3]: https://www.unicode.org/versions/Unicode12.0.0/ch03.pdf

[4]: https://www.unicode.org/versions/Unicode12.0.0/ch23.pdf






_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Feature request: more robust handling of invalid UTF-16 data

Reply via email to