** Description changed: In public Json streams lots of non-utf8 character escapes can be found causing some problems when parsing json or tidying the contained html ( as for example marketed here: http://www.charbase.com/1f44a-unicode- fisted-hand-sign ). The following example Query causes a whole bunch of problems: - import module namespace json = "http://www.zorba-xquery.com/modules/converters/json"; - import module namespace html = "http://www.zorba-xquery.com/modules/converters/html"; - declare namespace j = "http://john.snelson.org.uk/parsing-json-into-xquery"; - let $text := "<p>" || json:parse("{""text"":""Let's get it. \ud83d\udc4a""}")/j:pair[@name="text"]/text() || "</p>" - return html:parse($text) + import module namespace json = "http://www.zorba-xquery.com/modules/converters/json"; + import module namespace html = "http://www.zorba-xquery.com/modules/converters/html"; + declare namespace j = "http://john.snelson.org.uk/parsing-json-into-xquery"; + let $text := "<p>" || json:parse("{""text"":""Let's get it. \ud83d\udc4a""}")/j:pair[@name="text"]/text() || "</p>" + return html:parse($text) Problems: 1. html:parse () has return type document-node(), but tries to return an empty-sequence in this example (discovered by ghislain) - 2. in file src/com/zorba-xquery/www/modules/converters/html.xq.src/tidy_wrapper.h function createHtmlItem(...) doesn't throw a proper error message (discovered by ghislain) which makes debugging really hard. In contrast, parse-xml throws a very helpful error: - - dynamic error [err:FODC0006]: invalid content passed to fn:parse-xml(): loader parsing error: Char 0xD83D out of allowed range; + * --> moved to bug #1025194 * + + 2. in file src/com/zorba- + xquery/www/modules/converters/html.xq.src/tidy_wrapper.h function + createHtmlItem(...) doesn't throw a proper error message (discovered by + ghislain) which makes debugging really hard. In contrast, parse-xml + throws a very helpful error: + + dynamic error [err:FODC0006]: invalid content passed to fn:parse- + xml(): loader parsing error: Char 0xD83D out of allowed range; Could html:parse report the same error? + + * --> moved to bug #1025193 * 3. json:parse() doesn't report an error here which is good in my opinion. Yet, as these utf-16 (?) encoded characters are used a lot in json, would it be possible to transform them into valid utf-8 (e.g. \ud83d\udc4a -> 👊)? Maybe these findings are going to be a problem in Jsoniq as well?
-- You received this bug notification because you are a member of Zorba Coders, which is the registrant for Zorba. https://bugs.launchpad.net/bugs/1024448 Title: data-converter module problems with non utf-8 characters Status in Zorba - The XQuery Processor: Incomplete Bug description: In public Json streams lots of non-utf8 character escapes can be found causing some problems when parsing json or tidying the contained html ( as for example marketed here: http://www.charbase.com/1f44a-unicode- fisted-hand-sign ). The following example Query causes a whole bunch of problems: import module namespace json = "http://www.zorba-xquery.com/modules/converters/json"; import module namespace html = "http://www.zorba-xquery.com/modules/converters/html"; declare namespace j = "http://john.snelson.org.uk/parsing-json-into-xquery"; let $text := "<p>" || json:parse("{""text"":""Let's get it. \ud83d\udc4a""}")/j:pair[@name="text"]/text() || "</p>" return html:parse($text) Problems: 1. html:parse () has return type document-node(), but tries to return an empty-sequence in this example (discovered by ghislain) * --> moved to bug #1025194 * 2. in file src/com/zorba- xquery/www/modules/converters/html.xq.src/tidy_wrapper.h function createHtmlItem(...) doesn't throw a proper error message (discovered by ghislain) which makes debugging really hard. In contrast, parse-xml throws a very helpful error: dynamic error [err:FODC0006]: invalid content passed to fn:parse- xml(): loader parsing error: Char 0xD83D out of allowed range; Could html:parse report the same error? * --> moved to bug #1025193 * 3. json:parse() doesn't report an error here which is good in my opinion. Yet, as these utf-16 (?) encoded characters are used a lot in json, would it be possible to transform them into valid utf-8 (e.g. \ud83d\udc4a -> 👊)? Maybe these findings are going to be a problem in Jsoniq as well? To manage notifications about this bug go to: https://bugs.launchpad.net/zorba/+bug/1024448/+subscriptions -- Mailing list: https://launchpad.net/~zorba-coders Post to : zorba-coders@lists.launchpad.net Unsubscribe : https://launchpad.net/~zorba-coders More help : https://help.launchpad.net/ListHelp