** Description changed:
In public Json streams lots of non-utf8 character escapes can be found
causing some problems when parsing json or tidying the contained html (
as for example marketed here: http://www.charbase.com/1f44a-unicode-
fisted-hand-sign ).
The following example Query
I'm not an encoding expert, so anything I say may potentially be wrong.
The string \ud83d\udc4a is an example containing a single javascript
escaped special character (cf http://www.charbase.com/1f44a-unicode-
fisted-hand-sign ). This is very common in JSON data as javascript
engines seem to use
I'm not sure I understand.
1. The default for JSON strings seems to be UTF-8.
2. If a JSON string uses an encoding other than UTF-8, the entire string should
be transcoded. This needs to be done when the data its retrieve. For example,
by passing an encoding parameter to file:read-text.
--
I believe what's going on is that byte sequences like \ud83d\udc4a are
supposed to represent UTF-16 surrogate pairs. This is what Dennis
suggests since 1F44A is the Unicode code point represented.
IMHO, this is a bizarre way to do things: use a UTF-8 byte sequence to
encode UTF-16 surrogate
If there are 3 problems, there should be 3 different bugs, not all
lumped together into a single bug. I have nothing to do with either
html:parse() or tidy.
As for the 3rd bug, you don't say what the error is that it should
report. What exactly is wrong with JSON parsing?
** Changed in: zorba
5 matches
Mail list logo