JSON_EXTRACT [1] produces gibberish for encoded non-BMP characters. Example:
sqlite> SELECT JSON_EXTRACT('"t\u00e8\u015b\ud835\udd99"', '$');
tèś������
Expected result is "tèś𝖙".
Since character "𝖙" (U+1D599) is not in the Basic Multilingual Plane,
it is encoded as the UTF-16 surrogate pair "\ud835\udd99" in JSON. [2]
Seems JSON_EXTRACT decodes surrogate characters independently and do not
combine then into a non-BMP character like in CESU-8 [3] or Modified
UTF-8 [4].
It works correctly if the JSON contains not-escaped characters:
sqlite> select JSON_EXTRACT('"tèś𝖙"', '$');
tèś𝖙
The bug was originally reported on the Python bug tracker. [5]
If SQLite officially uses CESU-8 or Modified UTF-8 instead of the
standard UTF-8, this should be documented. Python would use special
workarounds for this. But it would be better to use the standard UTF-8.
[1] https://www.sqlite.org/json1.html#jex
[2] https://tools.ietf.org/html/rfc8259#section-7
[3] https://en.wikipedia.org/wiki/CESU-8
[4] https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8
[5] https://bugs.python.org/issue38749
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users