Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON

2011-07-25 Thread Florian Pflug
On Jul25, 2011, at 07:35 , Joey Adams wrote:
 On Mon, Jul 25, 2011 at 1:05 AM, Joey Adams joeyadams3.14...@gmail.com 
 wrote:
 Should we mimic IEEE floats and preserve -0 versus +0 while treating
 them as equal?  Or should we treat JSON floats like numeric and
 convert -0 to 0 on input?  Or should we do something else?  I think
 converting -0 to 0 would be a bad idea, as it would violate the
 intuitive assumption that JSON can be used to marshal double-precision
 floats.
 
 On the other hand, JavaScript's own .toString and JSON.stringify turn
 -0 into 0, so JSON can't marshal -0 around, anyway (in practice).  Now
 I think turning -0 into 0 would be fine for canonicalizing numbers in
 json_in.

+1.

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON

2011-07-25 Thread Florian Pflug
On Jul25, 2011, at 02:03 , Florian Pflug wrote:
 On Jul25, 2011, at 00:48 , Joey Adams wrote:
 Should we follow the JavaScript standard for rendering numbers (which
 my suggestion approximates)?  Or should we use the shortest encoding
 as Florian suggests?
 
 In the light of the above, consider my suggestion withdrawn. I now think
 we should just follow the JavaScript standard as closely as possible.
 As you said, it's pretty much the same as your suggestion, just more precise
 in the handling of some corner-cases like infinity, nan, +/-0, some
 questions of leading and trailing zeros, ...

Just FYI, I browsed through the ECMA Standard you referenced again, and realized
that they explicitly forbid JSON numeric values to be NaN or (-)Infinity
(Page 205, Step 9 at the top of the page). RFC 4627 seems to take the same 
stand.

I fail to see the wisdom in that, but it's what the standard says...

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON

2011-07-24 Thread Florian Pflug
On Jul24, 2011, at 05:14 , Robert Haas wrote:
 On Fri, Jul 22, 2011 at 10:36 PM, Joey Adams joeyadams3.14...@gmail.com 
 wrote:
 Interesting.  This leads to a couple more questions:
 
  * Should the JSON data type (eventually) have an equality operator?
 
 +1.

+1.

  * Should the JSON input function alphabetize object members by key?
 
 I think it would probably be better if it didn't.  I'm wary of
 overcanonicalization.  It can be useful to have things come back out
 in more or less the format you put them in.

The downside being that we'd then either need to canonicalize in
the equality operator, or live with either no equality operator or
a rather strange one.

Also, if we don't canonicalize now, we (or rather our users) are in
for some pain should we ever decide to store JSON values in some other
form than plain text. Because if we do that, we'd presumably want
to order the members in some predefined way (by their hash value,
for example).

So, from me +1 for alphabetizing members.

 If we canonicalize strings and numbers and alphabetize object members,
 then our equality function is just texteq.  The only stumbling block
 is canonicalizing numbers.  Fortunately, JSON's definition of a
 number is its decimal syntax, so the algorithm is child's play:
 
  * Figure out the digits and exponent.
  * If the exponent is greater than 20 or less than 6 (arbitrary), use
 exponential notation.
 
 The problem is: 2.718282e-1000 won't equal 0 as may be expected.  I
 doubt this matters much.
 
 I don't think that 2.718282e-100 SHOULD equal 0.

I agree. As for your proposed algorithm, I suggest to instead use
exponential notation if it produces a shorter textual representation.
In other words, for values between -1 and 1, we'd switch to exponential
notation if there's more than 1 leading zero (to the right of the decimal
point, of course), and for values outside that range if there're more than
2 trailing zeros and no decimal point. All after redundant zeros and
decimal points are removed. So we'd store

0 as 0
1 as 1
0.1 as 0.1
0.01 as 0.01
0.001 as 1e-3
10 as 10
100 as 100
1000 as 1e3
1000.1 as 1000.1
1001 as 1001

 If, in the future, we add the ability to manipulate large JSON trees
 efficiently (e.g. by using an auxiliary table like TOAST does), we'll
 probably want unique members, so enforcing them now may be prudent.
 
 I doubt you're going to want to reinvent TOAST, but I do think there
 are many advantages to forbidding duplicate keys.  ISTM the question
 is whether to throw an error or just silently discard one of the k/v
 pairs.  Keeping both should not be on the table, IMHO.

+1.

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON

2011-07-24 Thread Joey Adams
On Sat, Jul 23, 2011 at 11:14 PM, Robert Haas robertmh...@gmail.com wrote:
 I doubt you're going to want to reinvent TOAST, ...

I was thinking about making it efficient to access or update
foo.a.b.c.d[1000] in a huge JSON tree.  Simply TOASTing the varlena
text means we have to unpack the entire datum to access and update
individual members.  An alternative would be to split the JSON into
chunks (possibly by using the pg_toast_id table) and have some sort
of index that can be used to efficiently look up values by path.

This would not be trivial, and I don't plan to implement it any time soon.


On Sun, Jul 24, 2011 at 2:19 PM, Florian Pflug f...@phlo.org wrote:
 On Jul24, 2011, at 05:14 , Robert Haas wrote:
 On Fri, Jul 22, 2011 at 10:36 PM, Joey Adams joeyadams3.14...@gmail.com 
 wrote:
 ... Fortunately, JSON's definition of a
 number is its decimal syntax, so the algorithm is child's play:

  * Figure out the digits and exponent.
  * If the exponent is greater than 20 or less than 6 (arbitrary), use
 exponential notation.



 I agree. As for your proposed algorithm, I suggest to instead use
 exponential notation if it produces a shorter textual representation.
 In other words, for values between -1 and 1, we'd switch to exponential
 notation if there's more than 1 leading zero (to the right of the decimal
 point, of course), and for values outside that range if there're more than
 2 trailing zeros and no decimal point. All after redundant zeros and
 decimal points are removed. So we'd store

 0 as 0
 1 as 1
 0.1 as 0.1
 0.01 as 0.01
 0.001 as 1e-3
 10 as 10
 100 as 100
 1000 as 1e3
 1000.1 as 1000.1
 1001 as 1001


Interesting idea.  The reason I suggested using exponential notation
only for extreme exponents (less than -6 or greater than +20) is
partly for presentation value.  Users might be annoyed to see 100
turned into 1e6.  Moreover, applications working solely with integers
that don't expect the floating point syntax may choke on the converted
numbers.  32-bit integers can be losslessly encoded as IEEE
double-precision floats (JavaScript's internal representation), and
JavaScript's algorithm for converting a number to a string ([1],
section 9.8.1) happens to preserve the integer syntax (I think).

Should we follow the JavaScript standard for rendering numbers (which
my suggestion approximates)?  Or should we use the shortest encoding
as Florian suggests?

- Joey

 [1]: 
http://www.ecma-international.org/publications/files/ECMA-ST-ARCH/ECMA-262%205th%20edition%20December%202009.pdf

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON

2011-07-24 Thread Florian Pflug
On Jul25, 2011, at 00:48 , Joey Adams wrote:
 On Sun, Jul 24, 2011 at 2:19 PM, Florian Pflug f...@phlo.org wrote:
 On Jul24, 2011, at 05:14 , Robert Haas wrote:
 On Fri, Jul 22, 2011 at 10:36 PM, Joey Adams joeyadams3.14...@gmail.com 
 wrote:
 ... Fortunately, JSON's definition of a
 number is its decimal syntax, so the algorithm is child's play:
 
  * Figure out the digits and exponent.
  * If the exponent is greater than 20 or less than 6 (arbitrary), use
 exponential notation.
 
 I agree. As for your proposed algorithm, I suggest to instead use
 exponential notation if it produces a shorter textual representation.
 In other words, for values between -1 and 1, we'd switch to exponential
 notation if there's more than 1 leading zero (to the right of the decimal
 point, of course), and for values outside that range if there're more than
 2 trailing zeros and no decimal point. All after redundant zeros and
 decimal points are removed. So we'd store
 
 0 as 0
 1 as 1
 0.1 as 0.1
 0.01 as 0.01
 0.001 as 1e-3
 10 as 10
 100 as 100
 1000 as 1e3
 1000.1 as 1000.1
 1001 as 1001
 
 Interesting idea.  The reason I suggested using exponential notation
 only for extreme exponents (less than -6 or greater than +20) is
 partly for presentation value.  Users might be annoyed to see 100
 turned into 1e6.

I'm not concerned about that, but ...

 Moreover, applications working solely with integers
 that don't expect the floating point syntax may choke on the converted
 numbers.

now that you say it, that's definitely a concern.

 32-bit integers can be losslessly encoded as IEEE
 double-precision floats (JavaScript's internal representation), and
 JavaScript's algorithm for converting a number to a string ([1],
 section 9.8.1) happens to preserve the integer syntax (I think).

Indeed. In fact, it seems to be designed to use the integer syntax
for all integral values with = 66 binary digits. log10(2^66) ~ 19.87

 Should we follow the JavaScript standard for rendering numbers (which
 my suggestion approximates)?  Or should we use the shortest encoding
 as Florian suggests?

In the light of the above, consider my suggestion withdrawn. I now think
we should just follow the JavaScript standard as closely as possible.
As you said, it's pretty much the same as your suggestion, just more precise
in the handling of some corner-cases like infinity, nan, +/-0, some
questions of leading and trailing zeros, ...

I wouldn't have made my suggestion had I realized earlier that limit
of 20 for the exponent was carefully chosen to ensure that the full
range of a 64-bit integer value would be represented in non-exponential
notation. I assumed the bracketed arbitrary in your description applied
to both the upper (20) as well as the lower (-6) bound, when it really only
applies to the lower bound. Sorry for that.

(I am now curious where the seemingly arbitrary lower bound of -6 comes
from, though. The only explanation I can come up with is that somebody
figured that 0.01 is still easily distinguished visually from
0.1, but not so much from 0.001)

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON

2011-07-24 Thread Joey Adams
On Sun, Jul 24, 2011 at 2:19 PM, Florian Pflug f...@phlo.org wrote:
 The downside being that we'd then either need to canonicalize in
 the equality operator, or live with either no equality operator or
 a rather strange one.

It just occurred to me that, even if we sort object members, texteq
might not be a sufficient way to determine equality.  In particular,
IEEE floats treat +0 and -0 as two different things, but they are
equal when compared.  Note that we're only dealing with a decimal
representation; we're not (currently) converting to double-precision
representation and back.

Should we mimic IEEE floats and preserve -0 versus +0 while treating
them as equal?  Or should we treat JSON floats like numeric and
convert -0 to 0 on input?  Or should we do something else?  I think
converting -0 to 0 would be a bad idea, as it would violate the
intuitive assumption that JSON can be used to marshal double-precision
floats.

- Joey

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON

2011-07-24 Thread Joey Adams
On Mon, Jul 25, 2011 at 1:05 AM, Joey Adams joeyadams3.14...@gmail.com wrote:
 Should we mimic IEEE floats and preserve -0 versus +0 while treating
 them as equal?  Or should we treat JSON floats like numeric and
 convert -0 to 0 on input?  Or should we do something else?  I think
 converting -0 to 0 would be a bad idea, as it would violate the
 intuitive assumption that JSON can be used to marshal double-precision
 floats.

On the other hand, JavaScript's own .toString and JSON.stringify turn
-0 into 0, so JSON can't marshal -0 around, anyway (in practice).  Now
I think turning -0 into 0 would be fine for canonicalizing numbers in
json_in.

- Joey

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON

2011-07-23 Thread Joey Adams
Also, should I forbid the escape \u (in all database encodings)?

Pros:

 * If \u is forbidden, and the server encoding is UTF-8, then
every JSON-wrapped string will be convertible to TEXT.

 * It will be consistent with the way PostgreSQL already handles text,
and with the decision to use database-encoded JSON strings.

 * Some applications choke on strings with null characters.  For
example, in some web browsers but not others, if you pass
Hello\uworld to document.write() or assign it to a DOM object's
innerHTML, it will be truncated to Hello.  By banning \u, users
can catch such rogue strings early.

 * It's a little easier to represent internally.

Cons:

 * Means JSON type will accept a subset of the JSON described in
RFC4627.  However, the RFC does say An implementation may set limits
on the length and character contents of strings, so we can arguably
get away with banning \u while being law-abiding citizens.

 * Being able to store U+–U+00FF means users can use JSON strings
to hold binary data: by treating it as Latin-1.

- Joey

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON

2011-07-23 Thread Robert Haas
On Fri, Jul 22, 2011 at 10:36 PM, Joey Adams joeyadams3.14...@gmail.com wrote:
 Interesting.  This leads to a couple more questions:

  * Should the JSON data type (eventually) have an equality operator?

+1.

  * Should the JSON input function alphabetize object members by key?

I think it would probably be better if it didn't.  I'm wary of
overcanonicalization.  It can be useful to have things come back out
in more or less the format you put them in.

 If we canonicalize strings and numbers and alphabetize object members,
 then our equality function is just texteq.  The only stumbling block
 is canonicalizing numbers.  Fortunately, JSON's definition of a
 number is its decimal syntax, so the algorithm is child's play:

  * Figure out the digits and exponent.
  * If the exponent is greater than 20 or less than 6 (arbitrary), use
 exponential notation.

 The problem is: 2.718282e-1000 won't equal 0 as may be expected.  I
 doubt this matters much.

I don't think that 2.718282e-100 SHOULD equal 0.

 If, in the future, we add the ability to manipulate large JSON trees
 efficiently (e.g. by using an auxiliary table like TOAST does), we'll
 probably want unique members, so enforcing them now may be prudent.

I doubt you're going to want to reinvent TOAST, but I do think there
are many advantages to forbidding duplicate keys.  ISTM the question
is whether to throw an error or just silently discard one of the k/v
pairs.  Keeping both should not be on the table, IMHO.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON

2011-07-22 Thread Joey Adams
I think I've decided to only allow escapes of non-ASCII characters
when the database encoding is UTF8.  For example, $$\u2013$$::json
will fail if the database encoding is WIN1252, even though WIN1252 can
encode U+2013 (EN DASH).  This may be somewhat draconian, given that:

 * SQL_ASCII can otherwise handle any language according to the documentation.

 * The XML type doesn't have this restriction (it just stores the
input text verbatim, and converts it to UTF-8 before doing anything
complicated with it).

However, it's simple to implement and understand.  The JSON data type
will not perform any automatic conversion between character encodings.
 Also, if we want to handle this any better in the future, we won't
have to support legacy data containing a mixture of encodings.

In the future, we could create functions to compensate for the issues
people encounter; for example:

 * json_escape_unicode(json [, replace bool]) returns text -- convert
non-ASCII characters to escapes.  Optionally, use \uFFFD for
unconvertible characters.
 * json_unescape_unicode(text [, replace text]) returns json -- like
json_in, but convert Unicode escapes to characters when possible.
Optionally, replace unconvertible characters with a given string.

I've been going back and forth on how to handle encodings in the JSON
type for a while, but suggestions and objections are still welcome.
However, I plan to proceed in this direction so progress can be made.

On another matter, should the JSON type guard against duplicate member
keys?  The JSON RFC says The names within an object SHOULD be
unique, meaning JSON with duplicate members can be considered valid.
JavaScript interpreters (the ones I tried), PHP, and Python all have
the same behavior: discard the first member in favor of the second.
That is, {key:1,key:2} becomes {key:2}.  The XML type throws an
error if a duplicate attribute is present (e.g. 'a href=b
href=c/'::xml).

Thanks for the input,
- Joey

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON

2011-07-22 Thread Robert Haas
On Fri, Jul 22, 2011 at 6:04 PM, Joey Adams joeyadams3.14...@gmail.com wrote:
 On another matter, should the JSON type guard against duplicate member
 keys?  The JSON RFC says The names within an object SHOULD be
 unique, meaning JSON with duplicate members can be considered valid.
 JavaScript interpreters (the ones I tried), PHP, and Python all have
 the same behavior: discard the first member in favor of the second.
 That is, {key:1,key:2} becomes {key:2}.  The XML type throws an
 error if a duplicate attribute is present (e.g. 'a href=b
 href=c/'::xml).

Hmm.  That's tricky.  I lean mildly toward throwing an error as being
more consistent with the general PG philosophy.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON

2011-07-22 Thread Jan Urbański
On 23/07/11 01:12, Robert Haas wrote:
 On Fri, Jul 22, 2011 at 6:04 PM, Joey Adams joeyadams3.14...@gmail.com 
 wrote:
 On another matter, should the JSON type guard against duplicate member
 keys?  The JSON RFC says The names within an object SHOULD be
 unique, meaning JSON with duplicate members can be considered valid.
 JavaScript interpreters (the ones I tried), PHP, and Python all have
 the same behavior: discard the first member in favor of the second.
 That is, {key:1,key:2} becomes {key:2}.  The XML type throws an
 error if a duplicate attribute is present (e.g. 'a href=b
 href=c/'::xml).
 
 Hmm.  That's tricky.  I lean mildly toward throwing an error as being
 more consistent with the general PG philosophy.

OTOH:

regression=# select 'key=1,key=2'::hstore;
   hstore

 key=1
(1 row)

Cheers,
Jan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON

2011-07-22 Thread Robert Haas
On Fri, Jul 22, 2011 at 7:16 PM, Jan Urbański wulc...@wulczer.org wrote:
 On 23/07/11 01:12, Robert Haas wrote:
 On Fri, Jul 22, 2011 at 6:04 PM, Joey Adams joeyadams3.14...@gmail.com 
 wrote:
 On another matter, should the JSON type guard against duplicate member
 keys?  The JSON RFC says The names within an object SHOULD be
 unique, meaning JSON with duplicate members can be considered valid.
 JavaScript interpreters (the ones I tried), PHP, and Python all have
 the same behavior: discard the first member in favor of the second.
 That is, {key:1,key:2} becomes {key:2}.  The XML type throws an
 error if a duplicate attribute is present (e.g. 'a href=b
 href=c/'::xml).

 Hmm.  That's tricky.  I lean mildly toward throwing an error as being
 more consistent with the general PG philosophy.

 OTOH:

 regression=# select 'key=1,key=2'::hstore;
   hstore
 
  key=1
 (1 row)

Fair point.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON

2011-07-22 Thread Florian Pflug
On Jul23, 2011, at 01:12 , Robert Haas wrote:
 On Fri, Jul 22, 2011 at 6:04 PM, Joey Adams joeyadams3.14...@gmail.com 
 wrote:
 On another matter, should the JSON type guard against duplicate member
 keys?  The JSON RFC says The names within an object SHOULD be
 unique, meaning JSON with duplicate members can be considered valid.
 JavaScript interpreters (the ones I tried), PHP, and Python all have
 the same behavior: discard the first member in favor of the second.
 That is, {key:1,key:2} becomes {key:2}.  The XML type throws an
 error if a duplicate attribute is present (e.g. 'a href=b
 href=c/'::xml).
 
 Hmm.  That's tricky.  I lean mildly toward throwing an error as being
 more consistent with the general PG philosophy.

I'm usually all for throwing an error on ambiguous input - but if Javascript,
PHP and Python all agree, it might be wise to just yield to them.

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON

2011-07-22 Thread Florian Pflug
On Jul23, 2011, at 00:04 , Joey Adams wrote:
 I think I've decided to only allow escapes of non-ASCII characters
 when the database encoding is UTF8.  For example, $$\u2013$$::json
 will fail if the database encoding is WIN1252, even though WIN1252 can
 encode U+2013 (EN DASH).  This may be somewhat draconian, given that:
 
 * SQL_ASCII can otherwise handle any language according to the 
 documentation.

+1. It makes the handling if \u sequences consistent with the behaviour
of CHR(), which seems like a Good Thing. Clients can also work around this
restriction be de-escaping themselves, which shouldn't be too hard.

 * The XML type doesn't have this restriction (it just stores the
 input text verbatim, and converts it to UTF-8 before doing anything
 complicated with it).

Yeah. But the price the XML type pays for that is the lack of an 
equality operator.

best regards,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON

2011-07-22 Thread Joey Adams
On Fri, Jul 22, 2011 at 7:12 PM, Robert Haas robertmh...@gmail.com wrote:
 Hmm.  That's tricky.  I lean mildly toward throwing an error as being
 more consistent with the general PG philosophy.

I agree.  Besides, throwing an error on duplicate keys seems like the
most logical thing to do.  The most compelling reason not to, I think,
is that it would make the input function a little slower.

On Fri, Jul 22, 2011 at 8:26 PM, Florian Pflug f...@phlo.org wrote:
 * The XML type doesn't have this restriction (it just stores the
 input text verbatim, and converts it to UTF-8 before doing anything
 complicated with it).

 Yeah. But the price the XML type pays for that is the lack of an
 equality operator.

Interesting.  This leads to a couple more questions:

 * Should the JSON data type (eventually) have an equality operator?
 * Should the JSON input function alphabetize object members by key?

If we canonicalize strings and numbers and alphabetize object members,
then our equality function is just texteq.  The only stumbling block
is canonicalizing numbers.  Fortunately, JSON's definition of a
number is its decimal syntax, so the algorithm is child's play:

 * Figure out the digits and exponent.
 * If the exponent is greater than 20 or less than 6 (arbitrary), use
exponential notation.

The problem is: 2.718282e-1000 won't equal 0 as may be expected.  I
doubt this matters much.

It would be nice to canonicalize JSON on input, and that's the way I'd
like to go, but two caveats are:

 * Input (and other operations) would require more CPU time.  Instead
of being able to pass the data through a quick condense function, it'd
have to construct an AST (to sort object members) and re-encode the
JSON back into a string.
 * Users, for aesthetic reasons, might not want their JSON members rearranged.

If, in the future, we add the ability to manipulate large JSON trees
efficiently (e.g. by using an auxiliary table like TOAST does), we'll
probably want unique members, so enforcing them now may be prudent.

- Joey

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON

2011-07-20 Thread Florian Pflug
On Jul20, 2011, at 06:40 , Joey Adams wrote:
 On Wed, Jul 20, 2011 at 12:32 AM, Robert Haas robertmh...@gmail.com wrote:
 Thanks for the input.  I'm leaning in this direction too.  However, it
 will be a tad tricky to implement the conversions efficiently, ...
 
 I'm a bit confused, because I thought what I was talking about was not
 doing any conversions in the first place.
 
 We want to be able to handle \u escapes when the database encoding
 is not UTF-8.  We could leave them in place, but sooner or later
 they'll need to be converted in order to unwrap or compare JSON
 strings.

Hm, I agree that we need to handle \u escapes in JSON input.
We won't ever produce them during output though, right?

 The approach being discussed is converting escapes to the database
 encoding.  This means escapes of characters not available in the
 database encoding (e.g. \u266B in ISO-8859-1) will be forbidden.
 
 The PostgreSQL parser (which also supports Unicode escapes) takes a
 simpler approach: don't allow non-ASCII escapes unless the server
 encoding is UTF-8.

Maybe JSON should simply reject them too, then. At least for now.

How does that XML type handle the situation? It seems that it'd have
the same problem with unicode entity references #;.

best regards,
Florian Pflug




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON

2011-07-20 Thread Joey Adams
On Wed, Jul 20, 2011 at 6:49 AM, Florian Pflug f...@phlo.org wrote:
 Hm, I agree that we need to handle \u escapes in JSON input.
 We won't ever produce them during output though, right?

We could, to prevent transcoding errors if the client encoding is
different than the server encoding (and neither is SQL_ASCII, nor is
the client encoding UTF8).  For example, if the database encoding is
UTF-8 and the client encoding is WIN1252, I'd think it would be a good
idea to escape non-ASCII characters.

 How does that XML type handle the situation? It seems that it'd have
 the same problem with unicode entity references #;.

From the looks of it, XML operations convert the text to UTF-8 before
passing it to libxml.  The XML type does not normalize the input;
SELECT '#9835;♫'::xml; simply yields #9835;♫.  Escapes of any
character are allowed in any encoding, from the looks of it.

- Joey

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON

2011-07-19 Thread Joey Adams
On Mon, Jul 18, 2011 at 7:36 PM, Florian Pflug f...@phlo.org wrote:
 On Jul19, 2011, at 00:17 , Joey Adams wrote:
 I suppose a simple solution would be to convert all escapes and
 outright ban escapes of characters not in the database encoding.

 +1. Making JSON work like TEXT when it comes to encoding issues
 makes this all much simpler conceptually. It also avoids all kinds
 of weird issues if you extract textual values from a JSON document
 server-side.

Thanks for the input.  I'm leaning in this direction too.  However, it
will be a tad tricky to implement the conversions efficiently, since
the wchar API doesn't provide a fast path for individual codepoint
conversion (that I'm aware of), and pg_do_encoding_conversion doesn't
look like a good thing to call lots of times.

My plan is to scan for escapes of non-ASCII characters, convert them
to UTF-8, and put them in a comma-delimited string like this:

a,b,c,d,

then, convert the resulting string to the server encoding (which may
fail, indicating that some codepoint(s) are not present in the
database encoding).  After that, read the string and plop the
characters where they go.

It's clever, but I can't think of a better way to do it with the existing API.


- Joey

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON

2011-07-19 Thread Alvaro Herrera
Excerpts from Joey Adams's message of mar jul 19 21:03:15 -0400 2011:
 On Mon, Jul 18, 2011 at 7:36 PM, Florian Pflug f...@phlo.org wrote:
  On Jul19, 2011, at 00:17 , Joey Adams wrote:
  I suppose a simple solution would be to convert all escapes and
  outright ban escapes of characters not in the database encoding.
 
  +1. Making JSON work like TEXT when it comes to encoding issues
  makes this all much simpler conceptually. It also avoids all kinds
  of weird issues if you extract textual values from a JSON document
  server-side.
 
 Thanks for the input.  I'm leaning in this direction too.  However, it
 will be a tad tricky to implement the conversions efficiently, since
 the wchar API doesn't provide a fast path for individual codepoint
 conversion (that I'm aware of), and pg_do_encoding_conversion doesn't
 look like a good thing to call lots of times.
 
 My plan is to scan for escapes of non-ASCII characters, convert them
 to UTF-8, and put them in a comma-delimited string like this:
 
 a,b,c,d,
 
 then, convert the resulting string to the server encoding (which may
 fail, indicating that some codepoint(s) are not present in the
 database encoding).  After that, read the string and plop the
 characters where they go.

Ugh.

 It's clever, but I can't think of a better way to do it with the existing 
 API.

Would it work to have a separate entry point into mbutils.c that lets
you cache the conversion proc caller-side?  I think the main problem is
determining the byte length of each source character beforehand.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Fwd: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON

2011-07-19 Thread Joey Adams
Forwarding because the mailing list rejected the original message.

-- Forwarded message --
From: Joey Adams joeyadams3.14...@gmail.com
Date: Tue, Jul 19, 2011 at 11:23 PM
Subject: Re: Initial Review: JSON contrib modul was: Re: [HACKERS]
Another swing at JSON
To: Alvaro Herrera alvhe...@commandprompt.com
Cc: Florian Pflug f...@phlo.org, Tom Lane t...@sss.pgh.pa.us, Robert
Haas robertmh...@gmail.com, Bernd Helmle maili...@oopsware.de,
Dimitri Fontaine dimi...@2ndquadrant.fr, David Fetter
da...@fetter.org, Josh Berkus j...@agliodbs.com, Pg Hackers
pgsql-hackers@postgresql.org


On Tue, Jul 19, 2011 at 10:01 PM, Alvaro Herrera
alvhe...@commandprompt.com wrote:
 Would it work to have a separate entry point into mbutils.c that lets
 you cache the conversion proc caller-side?

That sounds like a really good idea.  There's still the overhead of
calling the proc, but I imagine it's a lot less than looking it up.

 I think the main problem is
 determining the byte length of each source character beforehand.

I'm not sure what you mean.  The idea is to convert the \u escape
to UTF-8 with unicode_to_utf8 (the length of the resulting UTF-8
sequence is easy to compute), call the conversion proc to get the
null-terminated database-encoded character, then append the result to
whatever StringInfo the string is going into.

The only question mark is how big the destination buffer will need to
be.  The maximum number of bytes per char in any supported encoding is
4, but is it possible for one Unicode character to turn into multiple
characters in the database encoding?

While we're at it, should we provide the same capability to the SQL
parser?  Namely, the ability to use \u escapes above U+007F when
the server encoding is not UTF-8?

- Joey

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Fwd: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON

2011-07-19 Thread Bruce Momjian
Joey Adams wrote:
 Forwarding because the mailing list rejected the original message.

Yes, I am seeing email failures to the 'core' email list.

---


 
 -- Forwarded message --
 From: Joey Adams joeyadams3.14...@gmail.com
 Date: Tue, Jul 19, 2011 at 11:23 PM
 Subject: Re: Initial Review: JSON contrib modul was: Re: [HACKERS]
 Another swing at JSON
 To: Alvaro Herrera alvhe...@commandprompt.com
 Cc: Florian Pflug f...@phlo.org, Tom Lane t...@sss.pgh.pa.us, Robert
 Haas robertmh...@gmail.com, Bernd Helmle maili...@oopsware.de,
 Dimitri Fontaine dimi...@2ndquadrant.fr, David Fetter
 da...@fetter.org, Josh Berkus j...@agliodbs.com, Pg Hackers
 pgsql-hackers@postgresql.org
 
 
 On Tue, Jul 19, 2011 at 10:01 PM, Alvaro Herrera
 alvhe...@commandprompt.com wrote:
  Would it work to have a separate entry point into mbutils.c that lets
  you cache the conversion proc caller-side?
 
 That sounds like a really good idea. ?There's still the overhead of
 calling the proc, but I imagine it's a lot less than looking it up.
 
  I think the main problem is
  determining the byte length of each source character beforehand.
 
 I'm not sure what you mean. ?The idea is to convert the \u escape
 to UTF-8 with unicode_to_utf8 (the length of the resulting UTF-8
 sequence is easy to compute), call the conversion proc to get the
 null-terminated database-encoded character, then append the result to
 whatever StringInfo the string is going into.
 
 The only question mark is how big the destination buffer will need to
 be. ?The maximum number of bytes per char in any supported encoding is
 4, but is it possible for one Unicode character to turn into multiple
 characters in the database encoding?
 
 While we're at it, should we provide the same capability to the SQL
 parser? ?Namely, the ability to use \u escapes above U+007F when
 the server encoding is not UTF-8?
 
 - Joey
 
 -- 
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Fwd: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON

2011-07-19 Thread Bruce Momjian
Bruce Momjian wrote:
 Joey Adams wrote:
  Forwarding because the mailing list rejected the original message.
 
 Yes, I am seeing email failures to the 'core' email list.

Marc says it is now fixed.

---


 
  
  -- Forwarded message --
  From: Joey Adams joeyadams3.14...@gmail.com
  Date: Tue, Jul 19, 2011 at 11:23 PM
  Subject: Re: Initial Review: JSON contrib modul was: Re: [HACKERS]
  Another swing at JSON
  To: Alvaro Herrera alvhe...@commandprompt.com
  Cc: Florian Pflug f...@phlo.org, Tom Lane t...@sss.pgh.pa.us, Robert
  Haas robertmh...@gmail.com, Bernd Helmle maili...@oopsware.de,
  Dimitri Fontaine dimi...@2ndquadrant.fr, David Fetter
  da...@fetter.org, Josh Berkus j...@agliodbs.com, Pg Hackers
  pgsql-hackers@postgresql.org
  
  
  On Tue, Jul 19, 2011 at 10:01 PM, Alvaro Herrera
  alvhe...@commandprompt.com wrote:
   Would it work to have a separate entry point into mbutils.c that lets
   you cache the conversion proc caller-side?
  
  That sounds like a really good idea. ?There's still the overhead of
  calling the proc, but I imagine it's a lot less than looking it up.
  
   I think the main problem is
   determining the byte length of each source character beforehand.
  
  I'm not sure what you mean. ?The idea is to convert the \u escape
  to UTF-8 with unicode_to_utf8 (the length of the resulting UTF-8
  sequence is easy to compute), call the conversion proc to get the
  null-terminated database-encoded character, then append the result to
  whatever StringInfo the string is going into.
  
  The only question mark is how big the destination buffer will need to
  be. ?The maximum number of bytes per char in any supported encoding is
  4, but is it possible for one Unicode character to turn into multiple
  characters in the database encoding?
  
  While we're at it, should we provide the same capability to the SQL
  parser? ?Namely, the ability to use \u escapes above U+007F when
  the server encoding is not UTF-8?
  
  - Joey
  
  -- 
  Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
  To make changes to your subscription:
  http://www.postgresql.org/mailpref/pgsql-hackers
 
 -- 
   Bruce Momjian  br...@momjian.ushttp://momjian.us
   EnterpriseDB http://enterprisedb.com
 
   + It's impossible for everything to be true. +
 
 -- 
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON

2011-07-19 Thread Robert Haas
On Tue, Jul 19, 2011 at 9:03 PM, Joey Adams joeyadams3.14...@gmail.com wrote:
 On Mon, Jul 18, 2011 at 7:36 PM, Florian Pflug f...@phlo.org wrote:
 On Jul19, 2011, at 00:17 , Joey Adams wrote:
 I suppose a simple solution would be to convert all escapes and
 outright ban escapes of characters not in the database encoding.

 +1. Making JSON work like TEXT when it comes to encoding issues
 makes this all much simpler conceptually. It also avoids all kinds
 of weird issues if you extract textual values from a JSON document
 server-side.

 Thanks for the input.  I'm leaning in this direction too.  However, it
 will be a tad tricky to implement the conversions efficiently, ...

I'm a bit confused, because I thought what I was talking about was not
doing any conversions in the first place.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON

2011-07-19 Thread Joey Adams
On Wed, Jul 20, 2011 at 12:32 AM, Robert Haas robertmh...@gmail.com wrote:
 Thanks for the input.  I'm leaning in this direction too.  However, it
 will be a tad tricky to implement the conversions efficiently, ...

 I'm a bit confused, because I thought what I was talking about was not
 doing any conversions in the first place.

We want to be able to handle \u escapes when the database encoding
is not UTF-8.  We could leave them in place, but sooner or later
they'll need to be converted in order to unwrap or compare JSON
strings.

The approach being discussed is converting escapes to the database
encoding.  This means escapes of characters not available in the
database encoding (e.g. \u266B in ISO-8859-1) will be forbidden.

The PostgreSQL parser (which also supports Unicode escapes) takes a
simpler approach: don't allow non-ASCII escapes unless the server
encoding is UTF-8.

- Joey

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON

2011-07-18 Thread Robert Haas
On Fri, Jul 15, 2011 at 3:56 PM, Joey Adams joeyadams3.14...@gmail.com wrote:
 On Mon, Jul 4, 2011 at 10:22 PM, Joseph Adams
 joeyadams3.14...@gmail.com wrote:
 I'll try to submit a revised patch within the next couple days.

 Sorry this is later than I said.

 I addressed the issues covered in the review.  I also fixed a bug
 where \u0022 would become , which is invalid JSON, causing an
 assertion failure.

 However, I want to put this back into WIP for a number of reasons:

  * The current code accepts invalid surrogate pairs (e.g.
 \uD800\uD800).  The problem with accepting them is that it would be
 inconsistent with PostgreSQL's Unicode support, and with the Unicode
 standard itself.  In my opinion: as long as the server encoding is
 universal (i.e. UTF-8), decoding a JSON-encoded string should not fail
 (barring data corruption and resource limitations).

  * I'd like to go ahead with the parser rewrite I mentioned earlier.
 The new parser will be able to construct a parse tree when needed, and
 it won't use those overkill parsing macros.

  * I recently learned that not all supported server encodings can be
 converted to Unicode losslessly.  The current code, on output,
 converts non-ASCII characters to Unicode escapes under some
 circumstances (see the comment above json_need_to_escape_unicode).

 I'm having a really hard time figuring out how the JSON module should
 handle non-Unicode character sets.  \u escapes in JSON literals
 can be used to encode characters not available in the server encoding.
  On the other hand, the server encoding can encode characters not
 present in Unicode (see the third bullet point above).  This means
 JSON normalization and comparison (along with member lookup) are not
 possible in general.

I previously suggested that, instead of trying to implement JSON, you
should just try to implement
JSON-without-the-restriction-that-everything-must-be-UTF8.  Most
people are going to be using UTF-8 simply because it's the default,
and if you forget about transcoding then ISTM that this all becomes a
lot simpler.  We don't, in general, have the ability to support data
in multiple encodings inside PostgreSQL, and it seems to me that by
trying to invent a mechanism for making that work as part of this
patch, you are setting the bar for yourself awfully high.

One thing to think about here is that transcoding between UTF-8 and
the server encoding seems like the wrong thing all around.  After all,
the user does not want the data in the server encoding; they want it
in their chosen client encoding.  If you are transcoding between UTF-8
and the server encoding, then that suggests that there's some
double-transcoding going on here, which creates additional
opportunities for (1) inefficiency and (2) outright failure.  I'm
guessing that's because you're dealing with an interface that expects
the internal representation of the datum on one side and the server
encoding on the other side, which gets back to the point in the
preceding paragraph.  You'd probably need to revise that interface in
order to make this really work the way it should, and that might be
more than you want to get into.  At any rate, it probably is a
separate project from making JSON work.

If in spite of the above you're bent on continuing down your present
course, then it seems to me that you'd better make the on-disk
representation UTF-8, with all \u escapes converted to the
corresponding characters.  If you hit an invalid surrogate pair, or a
character that exists in the server encoding but not UTF-8, it's not a
legal JSON object and you throw an error on input, just as you would
for mismatched braces or similar.  On output, you should probably just
use \u to represent any unrepresentable characters - i.e. option 3
from your original list.  That may be slow, but I think that it's not
really worth devoting a lot of mental energy to this case.  Most
people are going to be using UTF-8 because that's the default, and
those who are not shouldn't expect a data format built around UTF-8 to
work perfectly in their environment, especially if they insist on
using characters that are representable in only some of the encodings
they are using.

But, again, why not just forget about transcoding and define it as
JSON, if you happen to be using utf-8 as the server encoding, and
otherwise some variant of JSON that uses the server encoding as its
native format?.  It seems to me that that would be a heck of a lot
simpler and more reliable, and I'm not sure it's any less useful in
practice.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON

2011-07-18 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Fri, Jul 15, 2011 at 3:56 PM, Joey Adams joeyadams3.14...@gmail.com 
 wrote:
 I'm having a really hard time figuring out how the JSON module should
 handle non-Unicode character sets.

 But, again, why not just forget about transcoding and define it as
 JSON, if you happen to be using utf-8 as the server encoding, and
 otherwise some variant of JSON that uses the server encoding as its
 native format?.  It seems to me that that would be a heck of a lot
 simpler and more reliable, and I'm not sure it's any less useful in
 practice.

Right offhand, the only argument I can see against this is that we might
someday want to pass JSON datums directly to third-party loadable
libraries that are picky about UTF8-ness.  But we could just document
and/or enforce that such libraries can only be used in UTF8-encoded
databases.

BTW, could the \u problem be finessed by leaving such escapes in
source form?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON

2011-07-18 Thread Robert Haas
On Mon, Jul 18, 2011 at 3:19 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Fri, Jul 15, 2011 at 3:56 PM, Joey Adams joeyadams3.14...@gmail.com 
 wrote:
 I'm having a really hard time figuring out how the JSON module should
 handle non-Unicode character sets.

 But, again, why not just forget about transcoding and define it as
 JSON, if you happen to be using utf-8 as the server encoding, and
 otherwise some variant of JSON that uses the server encoding as its
 native format?.  It seems to me that that would be a heck of a lot
 simpler and more reliable, and I'm not sure it's any less useful in
 practice.

 Right offhand, the only argument I can see against this is that we might
 someday want to pass JSON datums directly to third-party loadable
 libraries that are picky about UTF8-ness.  But we could just document
 and/or enforce that such libraries can only be used in UTF8-encoded
 databases.

Right.  Or, if someone someday implements a feature to allow multiple
server encodings within the same database, then we can tell people to
use it if they want JSON to work in a spec-canonical way.

 BTW, could the \u problem be finessed by leaving such escapes in
 source form?

Joey seems to want to canonicalize the input, which argues against
that, and to detect invalid surrogate pairs, which likewise argues
against that.  You could argue that's overkill, but it seems to have
at least some merit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON

2011-07-18 Thread Joey Adams
On Mon, Jul 18, 2011 at 3:19 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 BTW, could the \u problem be finessed by leaving such escapes in
 source form?

Yes, it could.  However, it doesn't solve the problem of comparison
(needed for member lookup), which requires canonicalizing the strings
to be compared.

Here's a Unicode handling policy I came up with.  It is guaranteed to
be lossless as long as the client and database encodings are the same.

---

On input (json_in), if the text is valid JSON, it is condensed:

 * Whitespace characters surrounding tokens are removed.
 * Long escapes (like \u0022) are converted to short escapes (like \)
where possible.
 * Unnecessary escapes of ASCII characters (e.g. \u0061 and even
\u007F) are converted to their respective characters.
 * Escapes of non-ASCII characters (e.g. \u0080, \u266B, \uD834\uDD1E)
are converted to their respective characters, but only if the database
encoding is UTF-8.

On output (json_out), non-ASCII characters are converted to \u
escapes, unless one or more of these very likely circumstances hold:

 * The client encoding and database encoding are the same.  No
conversion is performed, so escaping characters will not prevent any
conversion errors.
 * The client encoding is UTF-8.  Escaping is not necessary because
the client can encode all Unicode codepoints.
 * The client encoding and/or database encoding is SQL_ASCII.
SQL_ASCII tells PostgreSQL to shirk transcoding in favor of speed.

When a JSON-encoded string is unwrapped using from_json (e.g.
from_json($$ \u00A1Hola! $$)), escapes will be converted to the
characters they represent.  If any escapes cannot be represented in
the database encoding, an error will be raised.  Note that:

 * If the database encoding is UTF-8, conversion will never fail.
 * If the database encoding is SQL_ASCII, conversion will fail if any
escapes of non-ASCII characters are present.

---

However, I'm having a really hard time figuring out how comparison
would work in this framework.  Here are a few options:

 1. Convert the strings to UTF-8, convert the escapes to characters,
and compare the strings.
 2. Convert the escapes to the database encoding, then compare the strings.
 3. If either string contains escapes of non-ASCII characters, do 1.
Otherwise, do 2.

Number 1 seems the most sane to me, but could lead to rare errors.

Number 3 could produce confusing results.  If character set X has
three different representations of one Unicode codepoint, then we
could have scenarios like this (simplified):

 abc♫ != aaa♫

but:

 abc\u266b == aaa♫

I suppose a simple solution would be to convert all escapes and
outright ban escapes of characters not in the database encoding.  This
would have the nice property that all strings can be unescaped
server-side.  Problem is, what if a browser or other program produces,
say, \u00A0 (NO-BREAK SPACE), and tries to insert it into a database
where the encoding lacks this character?

On the other hand, converting all JSON to UTF-8 would be simpler to
implement.  It would probably be more intuitive, too, given that the
JSON RFC says, JSON text SHALL be encoded in Unicode.

In any case, the documentation should say Use UTF-8 for best
results, as there seems to be no entirely satisfactory way to handle
JSON in other database encodings.

- Joey

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON

2011-07-18 Thread Florian Pflug
On Jul19, 2011, at 00:17 , Joey Adams wrote:
 I suppose a simple solution would be to convert all escapes and
 outright ban escapes of characters not in the database encoding.

+1. Making JSON work like TEXT when it comes to encoding issues
makes this all much simpler conceptually. It also avoids all kinds
of weird issues if you extract textual values from a JSON document
server-side.

If we really need more flexibility than that, we should look at
ways to allow different columns to have different encodings. Doing
that just for JSON seems wrongs - especially because doesn't really
reduce the complexity of the problem, as your examples shows. The
essential problem here is, AFAICS, that there's really no sane way to
compare strings in two different encodings, unless both encode a
subset of unicode only.

 This would have the nice property that all strings can be unescaped
 server-side.  Problem is, what if a browser or other program produces,
 say, \u00A0 (NO-BREAK SPACE), and tries to insert it into a database
 where the encoding lacks this character?

They'll get an error - just as if they had tried to store that same
character in a TEXT column.

 On the other hand, converting all JSON to UTF-8 would be simpler to
 implement.  It would probably be more intuitive, too, given that the
 JSON RFC says, JSON text SHALL be encoded in Unicode.

Yet, they only I reason I'm aware of for some people to not use UTF-8
as the server encoding is that it's pretty inefficient storage-wise for
some scripts (AFAIR some japanese scripts are an example, but I don't
remember the details). By making JSON store UTF-8 on-disk always, the
JSON type gets less appealing to those people.

best regards,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON

2011-07-15 Thread Joey Adams
On Mon, Jul 4, 2011 at 10:22 PM, Joseph Adams
joeyadams3.14...@gmail.com wrote:
 I'll try to submit a revised patch within the next couple days.

Sorry this is later than I said.

I addressed the issues covered in the review.  I also fixed a bug
where \u0022 would become , which is invalid JSON, causing an
assertion failure.

However, I want to put this back into WIP for a number of reasons:

 * The current code accepts invalid surrogate pairs (e.g.
\uD800\uD800).  The problem with accepting them is that it would be
inconsistent with PostgreSQL's Unicode support, and with the Unicode
standard itself.  In my opinion: as long as the server encoding is
universal (i.e. UTF-8), decoding a JSON-encoded string should not fail
(barring data corruption and resource limitations).

 * I'd like to go ahead with the parser rewrite I mentioned earlier.
The new parser will be able to construct a parse tree when needed, and
it won't use those overkill parsing macros.

 * I recently learned that not all supported server encodings can be
converted to Unicode losslessly.  The current code, on output,
converts non-ASCII characters to Unicode escapes under some
circumstances (see the comment above json_need_to_escape_unicode).

I'm having a really hard time figuring out how the JSON module should
handle non-Unicode character sets.  \u escapes in JSON literals
can be used to encode characters not available in the server encoding.
 On the other hand, the server encoding can encode characters not
present in Unicode (see the third bullet point above).  This means
JSON normalization and comparison (along with member lookup) are not
possible in general.

Even if I assume server - UTF-8 - server transcoding is lossless,
the situation is still ugly.  Normalization could be implemented a few
ways:

 * Convert from server encoding to UTF-8, and convert \u escapes
to UTF-8 characters.  This is space-efficient, but the resulting text
would not be compatible with the server encoding (which may or may not
matter).
 * Convert from server encoding to UTF-8, and convert all non-ASCII
characters to \u escapes, resulting in pure ASCII.  This bloats
the text by a factor of three, in the worst case.
 * Convert \u escapes to characters in the server encoding, but
only where possible.  This would be extremely inefficient.

The parse tree (for functions that need it) will need to store JSON
member names and strings either in UTF-8 or in normalized JSON (which
could be the same thing).

Help and advice would be appreciated.  Thanks!

- Joey


json-contrib-rev1-20110714.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON

2011-07-10 Thread Josh Berkus
On 7/4/11 7:22 PM, Joseph Adams wrote:
 I'll try to submit a revised patch within the next couple days.

So?  New patch?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON

2011-07-04 Thread Bernd Helmle



--On 18. Juni 2011 12:29:38 +0200 Bernd Helmle maili...@oopsware.de wrote:


Similar problems occur with a couple other modules I tried (hstore,
intarray).


Hmm, works for me. Seems you have messed up your installation in some way
(build against current -HEAD but running against a 9.1?).

I'm going to review in the next couple of days again.


A bit later than expected, but here is my summary on the patch:

-- Patch Status

The patch applies cleanly. Since it's primarily targeted as an addition to 
contrib/, it doesn't touch the backend source tree (besides documentation TOC 
entries), but adds a new subdirectory in contrib/json. The patch is in context 
diff as requested.


-- Functionality

The patch as is provides a basic implementation for a JSON datatype. It 
includes normalization and validation of JSON strings. It adds the datatype 
implementation as an extension.


The implementation focus on the datatype functionality only, there is no 
additional support for special operators. Thus, i think the comments in the 
control file (and docs) promises actually more than the contrib module will 
deliver:


+comment = 'data type for storing and manipulating JSON content'

I'm not sure, if manipulating is a correct description. Maybe i missed it, 
but i didn't see functions to manipulate JSON strings directly, for example, 
adding an object to a JSON string, ...


-- Coding

The JSON datatype is implemented as a variable length character type, thus 
allows very large JSON strings to be stored. It transparently decides when to 
escape unicode code points depending on the selected client- and 
server-encoding.


Validation is done through its own JSON parser. The parser itself is a 
recursive descent parser. It is noticable that the parser seems to define its 
own is_space() and is_digit() functions, e.g.:


+#define is_space(c) ((c) == '\t' || (c) == '\n' || (c) == '\r' || (c) == ' ')

Wouldn't it be more clean to use isspace() instead?

This code block in function json_escape_unicode() looks suspicious:

+   /* Convert to UTF-8, if necessary. */
+   {
+   const char *orig = json;
+   json = (const char *)
+   pg_do_encoding_conversion((unsigned char *) json, length,
+ GetDatabaseEncoding(), PG_UTF8);
+   if (json != orig)
+   length = strlen(json);
+   }

Seems it *always* wants to convert the string. Isn't there a if condition 
missing?


There's a commented code fragment missing, which should be removed (see last 
two lines of the snippet below):


+static unsigned int
+read_hex16(const char *in)
+{
+   unsigned int i;
+   unsigned int tmp;
+   unsigned int ret = 0;
+
+   for (i = 0; i  4; i++)
+   {
+   char c = *in++;
+
+   Assert(is_hex_digit(c));
+
+   if (c = '0'  c = '9')
+   tmp = c - '0';
+   else if (c = 'A'  c = 'F')
+   tmp = c - 'A' + 10;
+   else /* if (c = 'a'  c = 'f') */
+   tmp = c - 'a' + 10;

json.c introduces new appendStringInfo* functions, e.g.
appendStringInfoEscape() and appendStringInfoUtf8(). Maybe it's better
to name them to something different, since it may occur someday that the backend
will introduce the same notion with a different meaning. They are static,
but why not naming them into something like jsonAppendStringInfoUtf8() or 
similar?


-- Regression Tests

The patch includes a fairly complete regression test suite, which covers much
of the formatting, validating and input functionality of the JSON datatype. It 
also tests UNICODE sequences and input encoding with other server encoding than 
UTF-8.



--
Thanks

Bernd

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON

2011-07-04 Thread Joseph Adams
Thanks for reviewing my patch!

On Mon, Jul 4, 2011 at 7:10 AM, Bernd Helmle maili...@oopsware.de wrote:
 +comment = 'data type for storing and manipulating JSON content'

 I'm not sure, if manipulating is a correct description. Maybe i missed it,
 but i didn't see functions to manipulate JSON strings directly, for example,
 adding an object to a JSON string, ...

Indeed.  I intend to add manipulation functions in the future.  The
words and manipulating shouldn't be there yet.

 -- Coding
 ...
 ... It is noticable that the parser seems to define
 its own is_space() and is_digit() functions, e.g.:

 +#define is_space(c) ((c) == '\t' || (c) == '\n' || (c) == '\r' || (c) == '
 ')

 Wouldn't it be more clean to use isspace() instead?

isspace() accepts '\f', '\v', and sometimes other characters as well,
depending on locale.  Likewise, isdigit() accepts some non-ASCII
characters in addition to [0-9], depending on the locale and platform.
 Thus, to avoid parsing a superset of the JSON spec, I can't use the
ctype.h functions.  I should add a comment with this explanation.

 This code block in function json_escape_unicode() looks suspicious:

 +   /* Convert to UTF-8, if necessary. */
 +   {
 +       const char *orig = json;
 +       json = (const char *)
 +           pg_do_encoding_conversion((unsigned char *) json, length,
 +                                     GetDatabaseEncoding(), PG_UTF8);
 +       if (json != orig)
 +           length = strlen(json);
 +   }

 Seems it *always* wants to convert the string. Isn't there a if condition
 missing?

pg_do_encoding_conversion does this check already.  Perhaps I should
rephrase the comment slightly:

+   /* Convert to UTF-8 (only if necessary). */

 There's a commented code fragment missing, which should be removed (see last
 two lines of the snippet below):

 +static unsigned int
 +read_hex16(const char *in)
 ...
 +               Assert(is_hex_digit(c));
 +
 +               if (c = '0'  c = '9')
 +                       tmp = c - '0';
 +               else if (c = 'A'  c = 'F')
 +                       tmp = c - 'A' + 10;
 +               else /* if (c = 'a'  c = 'f') */
 +                       tmp = c - 'a' + 10;

That comment is there for documentation purposes, to indicate the
range check that's skipped because we know c is a hex digit, and [a-f]
is the only thing it could be (and must be) at that line.  Should it
still be removed?

 json.c introduces new appendStringInfo* functions, e.g.
 appendStringInfoEscape() and appendStringInfoUtf8(). Maybe it's better
 to name them to something different, since it may occur someday that the
 backend
 will introduce the same notion with a different meaning. They are static,
 but why not naming them into something like jsonAppendStringInfoUtf8() or
 similar?

I agree.

 -- Regression Tests
...
 It also tests UNICODE sequences and input encoding with other server
 encoding than UTF-8.

It tests with other *client* encodings than UTF-8, but not other
server encodings than UTF-8.  Is it possible to write tests for
different server encodings?

-- Self-review

There's a minor bug in the normalization code:

 select $$ \u0009 $$::json;
   json
--
 \u0009
(1 row)

This should produce \t instead.

I'm thinking that my expect_*/next_*pop_char/... parsing framework is
overkill, and that, for a syntax as simple as JSON's, visibly messy
parsing code like this:

if (s  e  (*s == 'E' || *s == 'e'))
{
s++;
if (s  e  (*s == '+' || *s == '-'))
s++;
if (!(s  e  is_digit(*s)))
return false;
do
s++;
while (s  e  is_digit(*s));
}

would be easier to maintain than the deceptively elegant-looking code
currently in place:

if (optional_char_cond(s, e, *s == 'E' || *s == 'e'))
{
optional_char_cond(s, e, *s == '+' || *s == '-');
skip1_pred(s, e, is_digit);
}

I don't plan on gutting the current macro-driven parser just yet.
When I do, the new parser will support building a parse tree (only
when needed), and the parsing functions will have signatures like
this:

static bool parse_value(const char **sp, const char *e, JsonNode **out);
static bool parse_string(const char **sp, const char *e, StringInfo 
*out);
...

The current patch doesn't provide manipulation features where a parse
tree would come in handy.  I'd rather hold off on rewriting the parser
until such features are added.

I'll try to submit a revised patch within the next couple days.

Thanks!

- Joey

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers