[Zorba-coders] [Bug 1024448] [NEW] data-converter module problems with non utf-8 characters

Dennis Knochenwefel Fri, 13 Jul 2012 09:19:23 -0700

Public bug reported:

In public Json streams lots of non-utf8 character escapes can be found
causing some problems when parsing json or tidying the contained html (
as for example marketed here: http://www.charbase.com/1f44a-unicode-
fisted-hand-sign ).


The following example Query causes a whole bunch of problems:

  import module namespace json = 
"http://www.zorba-xquery.com/modules/converters/json";;
  import module namespace html = 
"http://www.zorba-xquery.com/modules/converters/html";;
  declare namespace j = "http://john.snelson.org.uk/parsing-json-into-xquery";;
  let $text := "&lt;p>" || json:parse("{""text"":""Let's get it. 
\ud83d\udc4a""}")/j:pair[@name="text"]/text() || "&lt;/p>"
  return html:parse($text)

Problems:

1. html:parse () has return type document-node(), but tries to return an
empty-sequence in this example (discovered by ghislain)

2. in file 
src/com/zorba-xquery/www/modules/converters/html.xq.src/tidy_wrapper.h function 
createHtmlItem(...) doesn't throw a proper error message (discovered by 
ghislain) which makes debugging really hard. In contrast, parse-xml throws a 
very helpful error:
 
  dynamic error [err:FODC0006]: invalid content passed to fn:parse-xml(): 
loader parsing error: Char 0xD83D out of allowed range;

Could html:parse report the same error?

3. json:parse() doesn't report an error here which is good in my
opinion. Yet, as these utf-16 (?) encoded characters are used a lot in
json, would it be possible to transform them into valid utf-8 (e.g.
\ud83d\udc4a -> &#x1f44a;)?

Maybe these findings are going to be a problem in Jsoniq as well?

** Affects: zorba
     Importance: High
     Assignee: Paul J. Lucas (paul-lucas)
         Status: New


** Tags: improve-code-quality incorrect-result jsoniq usability

** Changed in: zorba
   Importance: Undecided => High

-- 
You received this bug notification because you are a member of Zorba
Coders, which is the registrant for Zorba.
https://bugs.launchpad.net/bugs/1024448

Title:
  data-converter module problems with non utf-8 characters

Status in Zorba - The XQuery Processor:
  New

Bug description:
  In public Json streams lots of non-utf8 character escapes can be found
  causing some problems when parsing json or tidying the contained html
  ( as for example marketed here: http://www.charbase.com/1f44a-unicode-
  fisted-hand-sign ).

  The following example Query causes a whole bunch of problems:

    import module namespace json = 
"http://www.zorba-xquery.com/modules/converters/json";;
    import module namespace html = 
"http://www.zorba-xquery.com/modules/converters/html";;
    declare namespace j = "http://john.snelson.org.uk/parsing-json-into-xquery";;
    let $text := "&lt;p>" || json:parse("{""text"":""Let's get it. 
\ud83d\udc4a""}")/j:pair[@name="text"]/text() || "&lt;/p>"
    return html:parse($text)

  Problems:

  1. html:parse () has return type document-node(), but tries to return
  an empty-sequence in this example (discovered by ghislain)

  2. in file 
src/com/zorba-xquery/www/modules/converters/html.xq.src/tidy_wrapper.h function 
createHtmlItem(...) doesn't throw a proper error message (discovered by 
ghislain) which makes debugging really hard. In contrast, parse-xml throws a 
very helpful error:
   
    dynamic error [err:FODC0006]: invalid content passed to fn:parse-xml(): 
loader parsing error: Char 0xD83D out of allowed range;

  Could html:parse report the same error?

  3. json:parse() doesn't report an error here which is good in my
  opinion. Yet, as these utf-16 (?) encoded characters are used a lot in
  json, would it be possible to transform them into valid utf-8 (e.g.
  \ud83d\udc4a -> &#x1f44a;)?

  Maybe these findings are going to be a problem in Jsoniq as well?

To manage notifications about this bug go to:
https://bugs.launchpad.net/zorba/+bug/1024448/+subscriptions

-- 
Mailing list: https://launchpad.net/~zorba-coders
Post to     : zorba-coders@lists.launchpad.net
Unsubscribe : https://launchpad.net/~zorba-coders
More help   : https://help.launchpad.net/ListHelp

[Zorba-coders] [Bug 1024448] [NEW] data-converter module problems with non utf-8 characters

Reply via email to