[Zorba-coders] [Bug 1024448] Re: data-converter module problems with non utf-8 characters

2012-07-16 Thread Dennis Knochenwefel
** Description changed: In public Json streams lots of non-utf8 character escapes can be found causing some problems when parsing json or tidying the contained html ( as for example marketed here: http://www.charbase.com/1f44a-unicode- fisted-hand-sign ). The following example Query

[Zorba-coders] [Bug 1024448] Re: data-converter module problems with non utf-8 characters

2012-07-16 Thread Dennis Knochenwefel
I'm not an encoding expert, so anything I say may potentially be wrong. The string \ud83d\udc4a is an example containing a single javascript escaped special character (cf http://www.charbase.com/1f44a-unicode- fisted-hand-sign ). This is very common in JSON data as javascript engines seem to use

[Zorba-coders] [Bug 1024448] Re: data-converter module problems with non utf-8 characters

2012-07-16 Thread Matthias Brantner
I'm not sure I understand. 1. The default for JSON strings seems to be UTF-8. 2. If a JSON string uses an encoding other than UTF-8, the entire string should be transcoded. This needs to be done when the data its retrieve. For example, by passing an encoding parameter to file:read-text. --

[Zorba-coders] [Bug 1024448] Re: data-converter module problems with non utf-8 characters

2012-07-16 Thread Paul J. Lucas
I believe what's going on is that byte sequences like \ud83d\udc4a are supposed to represent UTF-16 surrogate pairs. This is what Dennis suggests since 1F44A is the Unicode code point represented. IMHO, this is a bizarre way to do things: use a UTF-8 byte sequence to encode UTF-16 surrogate

[Zorba-coders] [Bug 1024448] Re: data-converter module problems with non utf-8 characters

2012-07-13 Thread Paul J. Lucas
If there are 3 problems, there should be 3 different bugs, not all lumped together into a single bug. I have nothing to do with either html:parse() or tidy. As for the 3rd bug, you don't say what the error is that it should report. What exactly is wrong with JSON parsing? ** Changed in: zorba