JSON_EXTRACT [1] produces gibberish for encoded non-BMP characters. Example:

    sqlite> SELECT JSON_EXTRACT('"t\u00e8\u015b\ud835\udd99"', '$');
    tèś������

Expected result is "tèś𝖙".

Since character "𝖙" (U+1D599) is not in the Basic Multilingual Plane, it is encoded as the UTF-16 surrogate pair "\ud835\udd99" in JSON. [2] Seems JSON_EXTRACT decodes surrogate characters independently and do not combine then into a non-BMP character like in CESU-8 [3] or Modified UTF-8 [4].

It works correctly if the JSON contains not-escaped characters:

    sqlite> select JSON_EXTRACT('"tèś𝖙"', '$');
    tèś𝖙


The bug was originally reported on the Python bug tracker. [5]

If SQLite officially uses CESU-8 or Modified UTF-8 instead of the standard UTF-8, this should be documented. Python would use special workarounds for this. But it would be better to use the standard UTF-8.


[1] https://www.sqlite.org/json1.html#jex

[2] https://tools.ietf.org/html/rfc8259#section-7

[3] https://en.wikipedia.org/wiki/CESU-8

[4] https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8

[5] https://bugs.python.org/issue38749

_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to