Re: [sqlite] Feature request: more robust handling of invalid UTF-16 data

2020-02-16 Thread Maks Verver
*Richard:* the issue with the JSON extension seems unrelated to the issue
that I reported originally, which relates to the SQLite C API
(specifically, the sqlite3_bind_text16() and sqlite3_bind_text16()
functions). My issue is still not fixed.

I've expanded my original sample code to make it easier to run the test and
reproduce the problems:
https://gist.github.com/maksverver/c3d5da8a0a9f2ec1c2a225209f290e13

Let me know if you need more help understanding/reproducing the problem.

*Dennis:* thanks for raising the issue again, and for digging through the
Unicode standard to confirm the most reasonable behavior.

I think your proposed patch is not quite correct. I think I spot two
problems. One:

  if( c>=0xDC00 && c<=0xE000 && TERM ) {

.. here, you should drop the `&& TERM`, otherwise you'll fail to replace a
high surrogate when it occurs at the end of the string. Also, you should
check c<0xE000 because 0xE000 is a valid character by itself. Two:

} else if( c>=0xD800 && TERM ){

Here, you should also check c<0xDC00 (or c<0xE000), otherwise you'll
misinterpret valid characters with code points 0xE000 and above as part of
a surrogate pair.

I believe my original patch handles these and all other cases correctly.

  - Maks.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Feature request: more robust handling of invalid UTF-16 data

2020-01-14 Thread Richard Hipp
On 1/14/20, Richard Hipp  wrote:
> I'm having trouble reproducing this.

I went back to version 3.30.1 and I was able to reproduce it.  So I
bisected and found the following:

https://sqlite.org/src/timeline?c=51027f08c0478f1b

-- 
D. Richard Hipp
d...@sqlite.org
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Feature request: more robust handling of invalid UTF-16 data

2020-01-14 Thread Richard Hipp
On 1/13/20, Dennis Snell  wrote:
> We have a JSON document like this which we store in a table.
>
> {“content”: “\ud83c\udd70\ud83c(null)\udd71”,”tags":[]}
>
>
> The JSON is well-formed but the sequence of UTF-16 code points is invalid.
>
> When sqlite reads this data two types of further corruption

I'm having trouble reproducing this.  The following test script (one
of many) illustrates:

CREATE TABLE t1(j TEXT);
INSERT INTO t1(j) VALUES
('{"content": "\ud83c\udd70\ud83c(null)\udd71","tags":[]}');
SELECT length(json_extract(j,'$.content')) FROM t1;
WITH RECURSIVE c(x) AS (VALUES(1) UNION ALL SELECT x+1 FROM c WHERE x<9)
SELECT x, printf('%x',unicode(substr(json_extract(j,'$.content'),x)))
  FROM t1, c;

The column T1.J is loaded up with your original JSON with the invalid
code points. Then I run json_extract() to pull out the invalid string,
but SQLite says that the length is 9, which I think is the correct
answer.  The second SELECT with the CTE in it loops over each
character and prints out the HEX value for that character.  Here is
what I see:

1|1f170
2|fffd
3|28
4|6e
5|75
6|6c
7|6c
8|29
9|fffd

So the initial surrogate pair was rendered correctly as 0x1f170.  The
\ud83c without the following high surrogate was converted into 0xfffd
(which is the right thing to do, is it not).  Then the 6 ASCII
characters follow.  Finally, the last isolated high-surrogate is
(correctly?) converted into 0xfffd.

What behavior were you expecting?

Is there something that I can be doing differently to make it misbehave?

-- 
D. Richard Hipp
d...@sqlite.org
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Feature request: more robust handling of invalid UTF-16 data

2020-01-14 Thread Detlef Golze
I want to second that. This leads to the situation that data is accepted by the 
database but there is no way to read that data back or more precisely I get the 
wrong (i.e. different) data back. I didn't check the suggested patch, but I 
don't believe it will work in all cases. I'd rather prefer rejecting such 
strings or implicitly  convert them to a BLOB which at least provides a way to 
get the data back.

Thanks,
Detlef.

-Ursprüngliche Nachricht-
Von: sqlite-users  Im Auftrag von 
Dennis Snell
Gesendet: Montag, 13. Januar 2020 21:57
An: sqlite-users@mailinglists.sqlite.org
Betreff: Re: [sqlite] Feature request: more robust handling of invalid UTF-16 
data

I’d like to raise this issue again and give my support for what Maks Verver 
recommended in 
https://www.mail-archive.com/sqlite-users@mailinglists.sqlite.org/msg110107.html


Independently I came to this bug while working on an issue in Simplenote’s 
Android app where our data was being corrupted when saved to sqlite inside the 
Android SDK. We received some invalid UTF-16 sequences and instead of rejecting 
them or decoding it properly sqlite is further mangling them and introducing 
more corruption.


Example:
We have a JSON document like this which we store in a table.


    {“content”: “\ud83c\udd70\ud83c(null)\udd71”,”tags":[]}


The JSON is well-formed but the sequence of UTF-16 code points is invalid. We 
have fixed our side of the equation which prevents creating this content, but 
we still receive from time to time the invalid sequence from older client 
libraries.


When sqlite reads this data two types of further corruption occur: reading 
beyond a code unit subsequence; and conflating high and low surrogates.


Reading beyond a code unit subsequence:


When the `TERM` was introduced[1] and updated[2] it appears to have been 
designed to assume that a string ends mid-surrogate but it does not attempt to 
address unpaired surrogates in the middle of an input text. In our case the 
`READ_UTF16BE` macro accepts the second `\ud83c` code unit and then consumes 
the following `\u0028` which is the separate and well-formed “(“. In turn this 
produces the more corrupted value of `\ud83c\udc28`, code point U+1F028, plus 
“null)” without the leading “(“.


Conflating high and low surrogates:


The `READ_UTF16__` macros both attempt to start processing surrogate pairs 
based on the `0xD800 <= c <= 0xE000` value of the input code unit. Because of 
this they will pick up on unpaired low surrogates, consume the next character, 
and then create a more corrupted Unicode string.


In our case, once we reach the `\udd71` the macro consumes the following quote, 
which in the JSON document closes the string, and puts them together as 
`\udd71\u0022` producing the invalid code point U+6C422. Moreover, because it 
consumed the string-ending quote it also corrupted the entire JSON document, as 
the new output resembles the following:


    {“content”: “\ud83c\udd70\ud83c\udc28ull)\ud971\udc22,”tags”:[]}


That is, we write this invalid Unicode sequence but valid JSON document into 
sqlite and read back an invalid Unicode sequence _and_ invalid JSON (see the 
missing quote before “tags”).


Supporting Unicode spec:


The Unicode specification[3] sections 3.2 and 3.9 speak to this situation and 
provides a specific comparable example:


    When a process interprets a code unit sequence which purports to be in a 
Unicode
    character encoding form, it shall treat ill-formed code unit sequences as 
an error
    condition and shall not interpret such sequences as characters.


    Furthermore, such a process must not treat any adjacent well-formed code 
unit
    sequences as being part of those ill-formed code unit sequences.


    For example, with the input UTF-8 code unit sequence , such a 
UTF-8
    conversion process must not return  or , because
    either of those outputs would be the result of interpreting a well-formed 
subsequence
    as being part of the ill-formed subsequence. The expected return value for 
such a
    process would instead be .


Supporing Maks’ suggestion to use the replacement character on error section 
23.8[4] provides the guidance:


    It [U+FFFD] can be substituted for any “unknown” character in another 
encoding that
    cannot be mapped in terms of known Unicode characters. It can also be used 
as one
    means of indicating a conversion error, when encountering an ill-formed 
sequence in
    a conversion between Unicode encoding forms.


Patching:


The `READ_UTF16__` macros thus should do not only what Maks proposed, which is 
to verify that the character following a surrogate half is also a surrogate 
half, but also to verify that we don’t start interpreting a surrogate sequence 
when encountering an unpaired low surrogate. I propose this change instead:


    #define READ_UTF16LE(zIn, TERM, c){
        c = (*zIn++);
        c += ((*zIn++)<<8);
        if( c>=0xDC00 && c<=0

Re: [sqlite] Feature request: more robust handling of invalid UTF-16 data

2020-01-13 Thread Dennis Snell
I’d like to raise this issue again and give my support for what Maks Verver 
recommended in 
https://www.mail-archive.com/sqlite-users@mailinglists.sqlite.org/msg110107.html


Independently I came to this bug while working on an issue in Simplenote’s 
Android app where our data was being corrupted when saved to sqlite inside the 
Android SDK. We received some invalid UTF-16 sequences and instead of rejecting 
them or decoding it properly sqlite is further mangling them and introducing 
more corruption.


Example:
We have a JSON document like this which we store in a table.


    {“content”: “\ud83c\udd70\ud83c(null)\udd71”,”tags":[]}


The JSON is well-formed but the sequence of UTF-16 code points is invalid. We 
have fixed our side of the equation which prevents creating this content, but 
we still receive from time to time the invalid sequence from older client 
libraries.


When sqlite reads this data two types of further corruption occur: reading 
beyond a code unit subsequence; and conflating high and low surrogates.


Reading beyond a code unit subsequence:


When the `TERM` was introduced[1] and updated[2] it appears to have been 
designed to assume that a string ends mid-surrogate but it does not attempt to 
address unpaired surrogates in the middle of an input text. In our case the 
`READ_UTF16BE` macro accepts the second `\ud83c` code unit and then consumes 
the following `\u0028` which is the separate and well-formed “(“. In turn this 
produces the more corrupted value of `\ud83c\udc28`, code point U+1F028, plus 
“null)” without the leading “(“.


Conflating high and low surrogates:


The `READ_UTF16__` macros both attempt to start processing surrogate pairs 
based on the `0xD800 <= c <= 0xE000` value of the input code unit. Because of 
this they will pick up on unpaired low surrogates, consume the next character, 
and then create a more corrupted Unicode string.


In our case, once we reach the `\udd71` the macro consumes the following quote, 
which in the JSON document closes the string, and puts them together as 
`\udd71\u0022` producing the invalid code point U+6C422. Moreover, because it 
consumed the string-ending quote it also corrupted the entire JSON document, as 
the new output resembles the following:


    {“content”: “\ud83c\udd70\ud83c\udc28ull)\ud971\udc22,”tags”:[]}


That is, we write this invalid Unicode sequence but valid JSON document into 
sqlite and read back an invalid Unicode sequence _and_ invalid JSON (see the 
missing quote before “tags”).


Supporting Unicode spec:


The Unicode specification[3] sections 3.2 and 3.9 speak to this situation and 
provides a specific comparable example:


    When a process interprets a code unit sequence which purports to be in a 
Unicode
    character encoding form, it shall treat ill-formed code unit sequences as 
an error
    condition and shall not interpret such sequences as characters.


    Furthermore, such a process must not treat any adjacent well-formed code 
unit
    sequences as being part of those ill-formed code unit sequences.


    For example, with the input UTF-8 code unit sequence , such a 
UTF-8
    conversion process must not return  or , because
    either of those outputs would be the result of interpreting a well-formed 
subsequence
    as being part of the ill-formed subsequence. The expected return value for 
such a
    process would instead be .


Supporing Maks’ suggestion to use the replacement character on error section 
23.8[4] provides the guidance:


    It [U+FFFD] can be substituted for any “unknown” character in another 
encoding that
    cannot be mapped in terms of known Unicode characters. It can also be used 
as one
    means of indicating a conversion error, when encountering an ill-formed 
sequence in
    a conversion between Unicode encoding forms.


Patching:


The `READ_UTF16__` macros thus should do not only what Maks proposed, which is 
to verify that the character following a surrogate half is also a surrogate 
half, but also to verify that we don’t start interpreting a surrogate sequence 
when encountering an unpaired low surrogate. I propose this change instead:


    #define READ_UTF16LE(zIn, TERM, c){
        c = (*zIn++);
        c += ((*zIn++)<<8);
        if( c>=0xDC00 && c<=0xE000 && TERM ) {
            c = 0xFFFD
        } else if( c>=0xD800 && TERM ){
            int c2 = (zIn[0] | (zIn[1] << 8));
            if ( c2>=0xDC00 && c2<0xE000) {
                zIn += 2;
                c = (c2&0x03FF) + ((c&0x003F)<<10) + (((c&0x03C0)+0x0040)<<10);
            } else {
                c = 0xFFFD;
            }
        }
    }



This will solve both problem of reading past an ill-formed surrogate sequence 
and of interpreting an ill-formed surrogate sequence. I love sqlite and I 
really appreciated how the code is laid out which made it so easy to find this 
macro in the source and identify the problem.


Dennis Snell
Automattic, Inc.


[1]: https://sqlite.org/src/info/19064d7cea


[sqlite] Feature request: more robust handling of invalid UTF-16 data

2018-05-21 Thread Maks Verver
*Background: *UTF-16 is an encoding which allows most characters to be
encoded in a single 16-bit code unit. Characters outside the basic
multilingual plane (i.e. code points between 0x1 and 0x10), require
two code units: a high surrogate between 0xD800 and 0xDBFF, followed by a
low surrogate between 0xDC00 and 0xDFFF. Strings that contain unpaired
surrogates are invalid UTF-16.

*Problem:* sqlite silently accepts invalid UTF-16 strings as arguments to
functions like sqlite3_bind_text16(), but corrupts them when converting
them to UTF-8 and back. A common use case where this happens implicitly is
when using a database with UTF-8 as the native text encoding from a
programming language that uses UTF-16 to represent strings in memory.

Specifically, what happens depends on where the unpaired surrogate
character occurs:

   - *In the middle of the string*: the surrogate is consumed together with
   the following character, so that "fooXbar" may be transformed into"fooYar"
   (note missing 'b'), where Y is some seemingly-random character outside the
   BMP. When read back, it's not obvious that a corruption has occurred.
   - *At the end of the string*: the surrogate is consumed and encoded in
   UTF-8, which is technically invalid (UTF-8 is not supposed to encode
   surrogate characters). How this reads back depends on whether the value is
   accessed through sqlite3_column_text() or sqlite3_column_text16(): the
   latter uses READ_UTF8() internally, which detects the invalid encoding and
   substitutes the replacement character 0xFFFD. So the storage is in a
   logically inconsistent state at this point.

I've created a small proof-of-concept to reproduce some of these issues,
here:
https://gist.github.com/maksverver/2b225637186d64878d3e635ef0a4fd18


The problem is caused by the implementation of READ_UTF16 in utf.c
, which blindly assumes that the
input string is valid UTF-16, and doesn't check that surrogates pair up as
required. Although arguably the problems originated with the caller
that passed bad string data to sqlite, it would be better if sqlite
detected and corrected invalid UTF-16 strings during conversion by
converting unpaired surrogates to the Unicode replacement character 0xFFFD.
This would be consistent with the behavior of READ_UTF8, and would make
sqlite more robust.

As a concrete suggestion, the macro READ_UTF16LE, which looks like this:

#define READ_UTF16LE(zIn, TERM, c){
 c = (*zIn++);
 c += ((*zIn++)<<8);
 if( c>=0xD800 && c<0xE000 && TERM ){
int c2 = (*zIn++);
c2 += ((*zIn++)<<8);
c = (c2&0x03FF) + ((c&0x003F)<<10) + (((c&0x03C0)+0x0040)<<10);
  }
}

Could be changed to something like this:

#define READ_UTF16LE(zIn, TERM, c){
  c = (*zIn++);
  c += ((*zIn++)<<8);
  if( c>=0xD800 ){
int c2 = c=0xDC00 && c2<0xE000) {
  zIn += 2;
  c = (c2&0x03FF) + ((c&0x003F)<<10) + (((c&0x03C0)+0x0040)<<10);
} else {
  c = 0xFFFD;
}
  }
}

(And similarly for READ_UTF16BE.)

Kind regards,
Maks Verver.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users