Re: [HACKERS] JSON for PG 9.2

2012-01-31 Thread Joey Adams
On Tue, Jan 31, 2012 at 1:29 PM, Abhijit Menon-Sen  wrote:
> At 2012-01-31 12:04:31 -0500, robertmh...@gmail.com wrote:
>>
>> That fails to answer the question of what we ought to do if we get an
>> invalid sequence there.
>
> I think it's best to categorically reject invalid surrogates as early as
> possible, considering the number of bugs that are related to them (not
> in Postgres, just in general). I can't see anything good coming from
> letting them in and leaving them to surprise someone in future.
>
> -- ams

+1

Another sequence to beware of is \u.  While escaped NUL characters
are perfectly valid in JSON, NUL characters aren't allowed in TEXT
values.  This means not all JSON strings can be converted to TEXT,
even in UTF-8.  This may also complicate collation, if comparison
functions demand null-terminated strings.

I'm mostly in favor of allowing \u.  Banning \u means users
can't use JSON strings to marshal binary blobs, e.g. by escaping
non-printable characters and only using U+..U+00FF.  Instead, they
have to use base64 or similar.

Banning \u doesn't quite violate the RFC:

An implementation may set limits on the length and character
contents of strings.

-Joey

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] libpq should have functions for escaping data for use in COPY FROM

2012-03-09 Thread Joey Adams
libpq has functions for escaping values in SQL commands
(PQescapeStringConn, PQescapeByteaConn, and the new PQescapeLiteral),
and it supports parameterizing queries with PQexecParams.  But it does
not (to my knowledge) have functions for escaping values for COPY
FROM.

COPY FROM is useful for inserting rows in bulk (though I wonder if
constructing massive INSERT statements and using PQexecParams is just
as efficient).  It is also useful for generating .sql files which can
be run on a database elsewhere.

I think libpq should include functions for escaping with COPY FROM.
Perhaps the API could look like this:

typedef struct EscapeCopyInfo
{
/*
 * If this is not NULL, the PQescapeCopy functions will take
advantage of
 * connection-specific information to minimize the escaped
representation
 * length.
 */
PGConn *conn;

char delimiter;

const char *null_string;
} EscapeCopyInfo;

void initEscapeCopyInfo(EscapeCopyInfo *);

/* If info is NULL, it defaults to {NULL, '\t', "\\N"} */
size_t PQescapeStringCopy(
EscapeCopyInfo *info,
char *to,
const char *from,
size_t from_length,
int *error
);

size_t PQescapeByteaCopy(
EscapeCopyInfo *info,
unsigned char *to,
const unsigned char *from,
size_t from_length,
int *error
);

Using an EscapeCopyInfo structure to specify format information cuts
down on parameter counts for the escaping functions, and makes it
possible to support more options in the future (e.g. CSV).

I'm not terribly attached to the parameter order of the functions.  I
was just following the lead of PQescapeStringConn.

This API writes text into a buffer, rather than allocating a buffer
with malloc like PQescapeByteaConn and PQescapeLiteral do.  This means
rows containing multiple values can be constructed efficiently.  On
the other hand, it is inconvenient and error-prone, since the
programmer has to calculate how big the buffer needs to be (and this
varies depending on settings).  Imagine this bug:

 * A programmer allocates a buffer of size 2*length + 2 for a BYTEA,
assuming the bytestring will be escaped using the \x... format

 * A user runs the program with PostgreSQL 8.4, which predates the
\x... format, and now the bytestring's escaped representation can be
up to 5*length bytes long (but very frequently isn't!)

I wrote a basic implementation for a Haskell binding, but it only
supports default COPY options (text mode, with tab delimiter and \N
null string).


https://github.com/joeyadams/haskell-libpq/blob/copy-from/cbits/escape-copy.c

Before spending a bunch of time on this, I'd like some input.  A few questions:

 * Should we have corresponding functions for parsing COPY TO data, or
is PQexecParams sufficient?

 * Should we support CSV escaping?  Can the CSV format safely encode
all characters (in particular, newlines)?

 * Should we deal with encodings here, or just escape everything that
isn't printable ASCII like the code I wrote does?

Thanks,
-Joey

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] JSON in 9.2 - Could we have just one to_json() function instead of two separate versions ?

2012-05-01 Thread Joey Adams
On Tue, May 1, 2012 at 8:02 AM, Hannu Krosing  wrote:
> Hi hackers
>
> After playing around with array_to_json() and row_to_json() functions a
> bit it I have a question - why do we even have 2 variants *_to_json()

Here's the discussion where that decision was made:

http://archives.postgresql.org/pgsql-hackers/2012-01/msg01339.php

To quote:

>>> why not call all these functions 'to_json' and overload them?
>>
>> I don't honestly feel that advances clarity much. And we might want to 
>> overload each at some stage with options that are specific to the datum 
>> type. We have various foo_to_xml() functions now.
>
> -1
>
> older proposal is more consistent with xml functions

The most compelling argument I see here is the one about options
specific to the datum type.

Two other reasons I can think of:

 * If someone tries to google for how to convert an array to JSON,
having a function named 'array_to_json' will make that easier.

 * If the JSON type does not yet support, say, converting from a
number, it will be apparent from the names and types of the functions,
rather than being a hidden surprise.  On the other hand, array_to_json
and composite_to_json already convert ANY values to JSON, so this
doesn't matter, anyway.

On Tue, May 1, 2012 at 11:02 AM, Hannu Krosing  wrote:
> What we currently lack is direct conversion for simple types, though
> they are easily achieved by converting to a single-element array and
> then stripping outer [] from the result

I agree that this function ought to be exposed.  Note that such a
function (perhaps called datum_to_json) is indeed the same as the
proposed to_json function, which tries to convert a value of any type
to JSON.

> It would be really nice to also have the casts from json to any type,
> including records though.

What the casts currently do (primarily) is convert between the TEXT
and JSON types.  So if you have JSON-encoded TEXT, use a cast to
convert it to the JSON type (this will perform validation, ensuring
that no invalid JSON gets in).  Any escape/unescape operations need to
be explicit.

-Joey

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] JSON in 9.2 - Could we have just one to_json() function instead of two separate versions ?

2012-05-01 Thread Joey Adams
On Tue, May 1, 2012 at 12:22 PM, Andrew Dunstan  wrote:
> Second, RFC 4627 is absolutely clear: a valid JSON value can only be an
> object or an array, so this thing about converting arbitrary datum values to
> JSON is a fantasy. If anything, we should adjust the JSON input routines to
> disallow anything else, rather than start to output what is not valid JSON.

No, the RFC says (emphasis mine):

A JSON *text* is a serialized object or array.

If we let the JSON type correspond to a *value* instead, this
restriction does not apply, and the JSON type has a useful recursive
definition.

For example, this function would not be possible if we applied the
"object or array" restriction:

unnest(json) returns setof json

Note that a similar distinction appears with the XML type: "document"
versus "content".

-Joey

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON

2011-07-15 Thread Joey Adams
On Mon, Jul 4, 2011 at 10:22 PM, Joseph Adams
 wrote:
> I'll try to submit a revised patch within the next couple days.

Sorry this is later than I said.

I addressed the issues covered in the review.  I also fixed a bug
where "\u0022" would become """, which is invalid JSON, causing an
assertion failure.

However, I want to put this back into WIP for a number of reasons:

 * The current code accepts invalid surrogate pairs (e.g.
"\uD800\uD800").  The problem with accepting them is that it would be
inconsistent with PostgreSQL's Unicode support, and with the Unicode
standard itself.  In my opinion: as long as the server encoding is
universal (i.e. UTF-8), decoding a JSON-encoded string should not fail
(barring data corruption and resource limitations).

 * I'd like to go ahead with the parser rewrite I mentioned earlier.
The new parser will be able to construct a parse tree when needed, and
it won't use those overkill parsing macros.

 * I recently learned that not all supported server encodings can be
converted to Unicode losslessly.  The current code, on output,
converts non-ASCII characters to Unicode escapes under some
circumstances (see the comment above json_need_to_escape_unicode).

I'm having a really hard time figuring out how the JSON module should
handle non-Unicode character sets.  \u escapes in JSON literals
can be used to encode characters not available in the server encoding.
 On the other hand, the server encoding can encode characters not
present in Unicode (see the third bullet point above).  This means
JSON normalization and comparison (along with member lookup) are not
possible in general.

Even if I assume server -> UTF-8 -> server transcoding is lossless,
the situation is still ugly.  Normalization could be implemented a few
ways:

 * Convert from server encoding to UTF-8, and convert \u escapes
to UTF-8 characters.  This is space-efficient, but the resulting text
would not be compatible with the server encoding (which may or may not
matter).
 * Convert from server encoding to UTF-8, and convert all non-ASCII
characters to \u escapes, resulting in pure ASCII.  This bloats
the text by a factor of three, in the worst case.
 * Convert \u escapes to characters in the server encoding, but
only where possible.  This would be extremely inefficient.

The parse tree (for functions that need it) will need to store JSON
member names and strings either in UTF-8 or in normalized JSON (which
could be the same thing).

Help and advice would be appreciated.  Thanks!

- Joey


json-contrib-rev1-20110714.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON

2011-07-18 Thread Joey Adams
On Mon, Jul 18, 2011 at 3:19 PM, Tom Lane  wrote:
> BTW, could the \u problem be finessed by leaving such escapes in
> source form?

Yes, it could.  However, it doesn't solve the problem of comparison
(needed for member lookup), which requires canonicalizing the strings
to be compared.

Here's a Unicode handling policy I came up with.  It is guaranteed to
be lossless as long as the client and database encodings are the same.

---

On input (json_in), if the text is valid JSON, it is condensed:

 * Whitespace characters surrounding tokens are removed.
 * Long escapes (like \u0022) are converted to short escapes (like \")
where possible.
 * Unnecessary escapes of ASCII characters (e.g. \u0061 and even
\u007F) are converted to their respective characters.
 * Escapes of non-ASCII characters (e.g. \u0080, \u266B, \uD834\uDD1E)
are converted to their respective characters, but only if the database
encoding is UTF-8.

On output (json_out), non-ASCII characters are converted to \u
escapes, unless one or more of these very likely circumstances hold:

 * The client encoding and database encoding are the same.  No
conversion is performed, so escaping characters will not prevent any
conversion errors.
 * The client encoding is UTF-8.  Escaping is not necessary because
the client can encode all Unicode codepoints.
 * The client encoding and/or database encoding is SQL_ASCII.
SQL_ASCII tells PostgreSQL to shirk transcoding in favor of speed.

When a JSON-encoded string is unwrapped using from_json (e.g.
from_json($$ "\u00A1Hola!" $$)), escapes will be converted to the
characters they represent.  If any escapes cannot be represented in
the database encoding, an error will be raised.  Note that:

 * If the database encoding is UTF-8, conversion will never fail.
 * If the database encoding is SQL_ASCII, conversion will fail if any
escapes of non-ASCII characters are present.

---

However, I'm having a really hard time figuring out how comparison
would work in this framework.  Here are a few options:

 1. Convert the strings to UTF-8, convert the escapes to characters,
and compare the strings.
 2. Convert the escapes to the database encoding, then compare the strings.
 3. If either string contains escapes of non-ASCII characters, do 1.
Otherwise, do 2.

Number 1 seems the most sane to me, but could lead to rare errors.

Number 3 could produce confusing results.  If character set X has
three different representations of one Unicode codepoint, then we
could have scenarios like this (simplified):

 "abc♫" != "aaa♫"

but:

 "abc\u266b" == "aaa♫"

I suppose a simple solution would be to convert all escapes and
outright ban escapes of characters not in the database encoding.  This
would have the nice property that all strings can be unescaped
server-side.  Problem is, what if a browser or other program produces,
say, \u00A0 (NO-BREAK SPACE), and tries to insert it into a database
where the encoding lacks this character?

On the other hand, converting all JSON to UTF-8 would be simpler to
implement.  It would probably be more intuitive, too, given that the
JSON RFC says, "JSON text SHALL be encoded in Unicode."

In any case, the documentation should say "Use UTF-8 for best
results", as there seems to be no entirely satisfactory way to handle
JSON in other database encodings.

- Joey

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON

2011-07-19 Thread Joey Adams
On Mon, Jul 18, 2011 at 7:36 PM, Florian Pflug  wrote:
> On Jul19, 2011, at 00:17 , Joey Adams wrote:
>> I suppose a simple solution would be to convert all escapes and
>> outright ban escapes of characters not in the database encoding.
>
> +1. Making JSON work like TEXT when it comes to encoding issues
> makes this all much simpler conceptually. It also avoids all kinds
> of weird issues if you extract textual values from a JSON document
> server-side.

Thanks for the input.  I'm leaning in this direction too.  However, it
will be a tad tricky to implement the conversions efficiently, since
the wchar API doesn't provide a fast path for individual codepoint
conversion (that I'm aware of), and pg_do_encoding_conversion doesn't
look like a good thing to call lots of times.

My plan is to scan for escapes of non-ASCII characters, convert them
to UTF-8, and put them in a comma-delimited string like this:

a,b,c,d,

then, convert the resulting string to the server encoding (which may
fail, indicating that some codepoint(s) are not present in the
database encoding).  After that, read the string and plop the
characters where they go.

It's "clever", but I can't think of a better way to do it with the existing API.


- Joey

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Fwd: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON

2011-07-19 Thread Joey Adams
Forwarding because the mailing list rejected the original message.

-- Forwarded message --
From: Joey Adams 
Date: Tue, Jul 19, 2011 at 11:23 PM
Subject: Re: Initial Review: JSON contrib modul was: Re: [HACKERS]
Another swing at JSON
To: Alvaro Herrera 
Cc: Florian Pflug , Tom Lane , Robert
Haas , Bernd Helmle ,
Dimitri Fontaine , David Fetter
, Josh Berkus , Pg Hackers



On Tue, Jul 19, 2011 at 10:01 PM, Alvaro Herrera
 wrote:
> Would it work to have a separate entry point into mbutils.c that lets
> you cache the conversion proc caller-side?

That sounds like a really good idea.  There's still the overhead of
calling the proc, but I imagine it's a lot less than looking it up.

> I think the main problem is
> determining the byte length of each source character beforehand.

I'm not sure what you mean.  The idea is to convert the \u escape
to UTF-8 with unicode_to_utf8 (the length of the resulting UTF-8
sequence is easy to compute), call the conversion proc to get the
null-terminated database-encoded character, then append the result to
whatever StringInfo the string is going into.

The only question mark is how big the destination buffer will need to
be.  The maximum number of bytes per char in any supported encoding is
4, but is it possible for one Unicode character to turn into multiple
"character"s in the database encoding?

While we're at it, should we provide the same capability to the SQL
parser?  Namely, the ability to use \u escapes above U+007F when
the server encoding is not UTF-8?

- Joey

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON

2011-07-19 Thread Joey Adams
On Wed, Jul 20, 2011 at 12:32 AM, Robert Haas  wrote:
>> Thanks for the input.  I'm leaning in this direction too.  However, it
>> will be a tad tricky to implement the conversions efficiently, ...
>
> I'm a bit confused, because I thought what I was talking about was not
> doing any conversions in the first place.

We want to be able to handle \u escapes when the database encoding
is not UTF-8.  We could leave them in place, but sooner or later
they'll need to be converted in order to unwrap or compare JSON
strings.

The approach being discussed is converting escapes to the database
encoding.  This means escapes of characters not available in the
database encoding (e.g. \u266B in ISO-8859-1) will be forbidden.

The PostgreSQL parser (which also supports Unicode escapes) takes a
simpler approach: don't allow non-ASCII escapes unless the server
encoding is UTF-8.

- Joey

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON

2011-07-20 Thread Joey Adams
On Wed, Jul 20, 2011 at 6:49 AM, Florian Pflug  wrote:
> Hm, I agree that we need to handle \u escapes in JSON input.
> We won't ever produce them during output though, right?

We could, to prevent transcoding errors if the client encoding is
different than the server encoding (and neither is SQL_ASCII, nor is
the client encoding UTF8).  For example, if the database encoding is
UTF-8 and the client encoding is WIN1252, I'd think it would be a good
idea to escape non-ASCII characters.

> How does that XML type handle the situation? It seems that it'd have
> the same problem with unicode entity references "&#;".

From the looks of it, XML operations convert the text to UTF-8 before
passing it to libxml.  The XML type does not normalize the input;
SELECT '♫♫'::xml; simply yields ♫♫.  Escapes of any
character are allowed in any encoding, from the looks of it.

- Joey

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON

2011-07-22 Thread Joey Adams
I think I've decided to only allow escapes of non-ASCII characters
when the database encoding is UTF8.  For example, $$"\u2013"$$::json
will fail if the database encoding is WIN1252, even though WIN1252 can
encode U+2013 (EN DASH).  This may be somewhat draconian, given that:

 * SQL_ASCII can otherwise handle "any" language according to the documentation.

 * The XML type doesn't have this restriction (it just stores the
input text verbatim, and converts it to UTF-8 before doing anything
complicated with it).

However, it's simple to implement and understand.  The JSON data type
will not perform any automatic conversion between character encodings.
 Also, if we want to handle this any better in the future, we won't
have to support legacy data containing a mixture of encodings.

In the future, we could create functions to compensate for the issues
people encounter; for example:

 * json_escape_unicode(json [, replace bool]) returns text -- convert
non-ASCII characters to escapes.  Optionally, use \uFFFD for
unconvertible characters.
 * json_unescape_unicode(text [, replace text]) returns json -- like
json_in, but convert Unicode escapes to characters when possible.
Optionally, replace unconvertible characters with a given string.

I've been going back and forth on how to handle encodings in the JSON
type for a while, but suggestions and objections are still welcome.
However, I plan to proceed in this direction so progress can be made.

On another matter, should the JSON type guard against duplicate member
keys?  The JSON RFC says "The names within an object SHOULD be
unique," meaning JSON with duplicate members can be considered valid.
JavaScript interpreters (the ones I tried), PHP, and Python all have
the same behavior: discard the first member in favor of the second.
That is, {"key":1,"key":2} becomes {"key":2}.  The XML type throws an
error if a duplicate attribute is present (e.g. ''::xml).

Thanks for the input,
- Joey

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON

2011-07-22 Thread Joey Adams
On Fri, Jul 22, 2011 at 7:12 PM, Robert Haas  wrote:
> Hmm.  That's tricky.  I lean mildly toward throwing an error as being
> more consistent with the general PG philosophy.

I agree.  Besides, throwing an error on duplicate keys seems like the
most logical thing to do.  The most compelling reason not to, I think,
is that it would make the input function a little slower.

On Fri, Jul 22, 2011 at 8:26 PM, Florian Pflug  wrote:
>> * The XML type doesn't have this restriction (it just stores the
>> input text verbatim, and converts it to UTF-8 before doing anything
>> complicated with it).
>
> Yeah. But the price the XML type pays for that is the lack of an
> equality operator.

Interesting.  This leads to a couple more questions:

 * Should the JSON data type (eventually) have an equality operator?
 * Should the JSON input function alphabetize object members by key?

If we canonicalize strings and numbers and alphabetize object members,
then our equality function is just texteq.  The only stumbling block
is canonicalizing numbers.  Fortunately, JSON's definition of a
"number" is its decimal syntax, so the algorithm is child's play:

 * Figure out the digits and exponent.
 * If the exponent is greater than 20 or less than 6 (arbitrary), use
exponential notation.

The problem is: 2.718282e-1000 won't equal 0 as may be expected.  I
doubt this matters much.

It would be nice to canonicalize JSON on input, and that's the way I'd
like to go, but two caveats are:

 * Input (and other operations) would require more CPU time.  Instead
of being able to pass the data through a quick condense function, it'd
have to construct an AST (to sort object members) and re-encode the
JSON back into a string.
 * Users, for aesthetic reasons, might not want their JSON members rearranged.

If, in the future, we add the ability to manipulate large JSON trees
efficiently (e.g. by using an auxiliary table like TOAST does), we'll
probably want unique members, so enforcing them now may be prudent.

- Joey

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON

2011-07-23 Thread Joey Adams
Also, should I forbid the escape \u (in all database encodings)?

Pros:

 * If \u is forbidden, and the server encoding is UTF-8, then
every JSON-wrapped string will be convertible to TEXT.

 * It will be consistent with the way PostgreSQL already handles text,
and with the decision to use database-encoded JSON strings.

 * Some applications choke on strings with null characters.  For
example, in some web browsers but not others, if you pass
"Hello\uworld" to document.write() or assign it to a DOM object's
innerHTML, it will be truncated to "Hello".  By banning \u, users
can catch such rogue strings early.

 * It's a little easier to represent internally.

Cons:

 * Means JSON type will accept a subset of the JSON described in
RFC4627.  However, the RFC does say "An implementation may set limits
on the length and character contents of strings", so we can arguably
get away with banning \u while being law-abiding citizens.

 * Being able to store U+–U+00FF means users can use JSON strings
to hold binary data: by treating it as Latin-1.

- Joey

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON

2011-07-24 Thread Joey Adams
On Sat, Jul 23, 2011 at 11:14 PM, Robert Haas  wrote:
> I doubt you're going to want to reinvent TOAST, ...

I was thinking about making it efficient to access or update
foo.a.b.c.d[1000] in a huge JSON tree.  Simply TOASTing the varlena
text means we have to unpack the entire datum to access and update
individual members.  An alternative would be to split the JSON into
chunks (possibly by using the pg_toast_ table) and have some sort
of index that can be used to efficiently look up values by path.

This would not be trivial, and I don't plan to implement it any time soon.

>
On Sun, Jul 24, 2011 at 2:19 PM, Florian Pflug  wrote:
> On Jul24, 2011, at 05:14 , Robert Haas wrote:
>> On Fri, Jul 22, 2011 at 10:36 PM, Joey Adams  
>> wrote:
>>> ... Fortunately, JSON's definition of a
>>> "number" is its decimal syntax, so the algorithm is child's play:
>>>
>>>  * Figure out the digits and exponent.
>>>  * If the exponent is greater than 20 or less than 6 (arbitrary), use
>>> exponential notation.
>>>
>>
>
> I agree. As for your proposed algorithm, I suggest to instead use
> exponential notation if it produces a shorter textual representation.
> In other words, for values between -1 and 1, we'd switch to exponential
> notation if there's more than 1 leading zero (to the right of the decimal
> point, of course), and for values outside that range if there're more than
> 2 trailing zeros and no decimal point. All after redundant zeros and
> decimal points are removed. So we'd store
>
> 0 as 0
> 1 as 1
> 0.1 as 0.1
> 0.01 as 0.01
> 0.001 as 1e-3
> 10 as 10
> 100 as 100
> 1000 as 1e3
> 1000.1 as 1000.1
> 1001 as 1001
>

Interesting idea.  The reason I suggested using exponential notation
only for extreme exponents (less than -6 or greater than +20) is
partly for presentation value.  Users might be annoyed to see 100
turned into 1e6.  Moreover, applications working solely with integers
that don't expect the floating point syntax may choke on the converted
numbers.  32-bit integers can be losslessly encoded as IEEE
double-precision floats (JavaScript's internal representation), and
JavaScript's algorithm for converting a number to a string ([1],
section 9.8.1) happens to preserve the integer syntax (I think).

Should we follow the JavaScript standard for rendering numbers (which
my suggestion approximates)?  Or should we use the shortest encoding
as Florian suggests?

- Joey

 [1]: 
http://www.ecma-international.org/publications/files/ECMA-ST-ARCH/ECMA-262%205th%20edition%20December%202009.pdf

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON

2011-07-24 Thread Joey Adams
On Sun, Jul 24, 2011 at 2:19 PM, Florian Pflug  wrote:
> The downside being that we'd then either need to canonicalize in
> the equality operator, or live with either no equality operator or
> a rather strange one.

It just occurred to me that, even if we sort object members, texteq
might not be a sufficient way to determine equality.  In particular,
IEEE floats treat +0 and -0 as two different things, but they are
equal when compared.  Note that we're only dealing with a decimal
representation; we're not (currently) converting to double-precision
representation and back.

Should we mimic IEEE floats and preserve -0 versus +0 while treating
them as equal?  Or should we treat JSON floats like numeric and
convert -0 to 0 on input?  Or should we do something else?  I think
converting -0 to 0 would be a bad idea, as it would violate the
intuitive assumption that JSON can be used to marshal double-precision
floats.

- Joey

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON

2011-07-24 Thread Joey Adams
On Mon, Jul 25, 2011 at 1:05 AM, Joey Adams  wrote:
> Should we mimic IEEE floats and preserve -0 versus +0 while treating
> them as equal?  Or should we treat JSON floats like numeric and
> convert -0 to 0 on input?  Or should we do something else?  I think
> converting -0 to 0 would be a bad idea, as it would violate the
> intuitive assumption that JSON can be used to marshal double-precision
> floats.

On the other hand, JavaScript's own .toString and JSON.stringify turn
-0 into 0, so JSON can't marshal -0 around, anyway (in practice).  Now
I think turning -0 into 0 would be fine for canonicalizing numbers in
json_in.

- Joey

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] JSON for PG 9.2

2011-12-13 Thread Joey Adams
On Mon, Dec 5, 2011 at 3:12 PM, Bruce Momjian  wrote:
> Where are we with adding JSON for Postgres 9.2?  We got bogged down in
> the data representation last time we discussed this.

We should probably have a wiki page titled "JSON datatype status" to
help break the cycle we're in:

 * Someone asks about the status of JSON

 * Various ideas are suggested

 * Patches are posted (maybe)

 * More discussion about fundamental issues ensues

 * Nothing is accomplished (as far as adding JSON to Postgres core)

There are several JSON implementations for Postgres floating around, including:

 * http://pgxn.org/dist/pg-json/ : Mentioned in previous posts; a JSON
library based on Jansson supporting path subscript and equality
testing

 * http://git.postgresql.org/gitweb/?p=json-datatype.git;a=summary :
The JSON datatype I implemented for Google Summer of Code 2010.  It
has the most features of any implementation I'm aware of, but:

* Is in the form of a contrib module

* Preserves input text verbatim, a guarantee that will be broken
by more efficient implementations

 * 
http://git.postgresql.org/gitweb/?p=json-datatype.git;a=shortlog;h=refs/heads/json2
 : My rewrite of the JSON module that condenses input (but still
stores it as text) and addresses the issue of JSON when either the
server or client encoding is not UTF-8.  Needs more features and
documentation, but like my other implementation, may not be quite what
we want.

Issues we've encountered include:

 * Should JSON be stored as binary or as text?

 * How do we deal with Unicode escapes and characters if the server or
client encoding is not UTF-8?  Some (common!) character encodings have
code points that don't map to Unicode.  Also, the charset conversion
modules do not provide fast entry points for converting individual
characters; each conversion involves a funcapi call.

---

In an application I'm working on, I store JSON-encoded objects in a
PostgreSQL database (using TEXT).  I do so because it allows me to
store non-relational data that is easy for my JavaScript code to work
with.

However, I fail to see much benefit of a JSON type.  When I need to
work with the data in PHP, C, or Haskell, I use JSON parsing libraries
available in each programming language.  Although being able to
transform or convert JSON data within SQL might be convenient, I can't
think of any compelling reason to do it in my case.

Can someone clarify why a JSON type would be useful, beyond storage
and validation?  What is a real-world, *concrete* example of a problem
where JSON manipulation in the database would be much better than:

 * Using the application's programming language to manipulate the data
(which it does a lot already) ?

 * Using CouchDB or similar instead of PostgreSQL?

- Joey

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] JSON for PG 9.2

2011-12-16 Thread Joey Adams
On Fri, Dec 16, 2011 at 8:52 AM, Robert Haas  wrote:
> But I think the important point is that this is an obscure corner case.  Let 
> me say that one
more time: obscure corner case!

+1

> The only reason JSON needs to care about this at all is that it allows
> \u1234 to mean Unicode code point 0x1234.  But for that detail, JSON
> would be encoding-agnostic.  So I think it's sufficient for us to
> simply decide that that particular feature may not work (or even, will
> not work) for non-ASCII characters if you use a non-UTF8 encoding.
> There's still plenty of useful things that can be done with JSON even
> if that particular feature is not available; and that way we don't
> have to completely disable the data type just because someone wants to
> use EUC-JP or something.

So, if the server encoding is not UTF-8, should we ban Unicode escapes:

"\u00FCber"

or non-ASCII characters?

"über"

Also:

 * What if the server encoding is SQL_ASCII?

 * What if the server encoding is UTF-8, but the client encoding is
something else (e.g. SQL_ASCII)?

- Joey

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Wishlist: parameterizable types

2011-12-21 Thread Joey Adams
This may be ambitious, but it'd be neat if PostgreSQL supported
parameterizable types.  For example, suppose a contrib module defines
a "pair" type.  It could be used as follows:

CREATE TABLE my_table (
coord pair(float, float)
);

The "pair" module could define functions like these for constructing
and examining pairs:

create function pair_create(a, b) returns pair(a, b);
create function pair_fst(pair(a, b)) returns a;
create function pair_snd(pair(a, b)) returns b;

Here, each function is polymorphic in two type variables, a and b.  As
far as I know, PostgreSQL only supports one type variable per
function, via the anyelement keyword.  Thus, unless we restrict
ourselves to only one type parameter, parameterizable types wouldn't
be very useful without support for multiple type variables.

PostgreSQL already has a parameterizable type: array.  However, it
would be nontrivial to introduce another such type.  Currently, nearly
every type in PostgreSQL has a corresponding array type.  For example,
in pg_hba, there's money, and there's _money (array of money values).
Continuing with this pattern means we would need something like P*T
entries in pg_hba, where P is the number of type constructors (e.g.
array), and T is the number of base types.

Moreover, the array type isn't truly nestable.  For one, PostgreSQL
considers int[] and int[][][][] as the same type:

> select '{1,2,3}' :: int[][][][];
  int4
-
 {1,2,3}
(1 row)

Also, arrays inside of arrays aren't allowed, only multidimensional arrays:

> select '{{1,2},{3}}' :: int[][];
ERROR:  multidimensional arrays must have array expressions with
matching dimensions
LINE 1: select '{{1,2},{3}}' :: int[][];

Suppose I didn't like these restrictions on the array type, and wanted
to make a type called "vector" that addresses them.  It might be used
as follows:

> select '{{1,2},{3}}' :: vector(vector(int));

Note that I'm stacking the 'vector' type constructor.  The array type
doesn't let you do that.

We could stretch the idea even further, and allow parameters and
recursion in user-defined types:

create type object(a) as (
key text,
value either(a, object(a))
);

C++ supports parameterizable types through templates.  Java and C#
support them through generics.

What I have in mind comes from Haskell's type system (which inspired
generics in Java).  The functions defined earlier could be implemented
in Haskell as:

data Pair a b = Pair a b

pair_fst :: Pair a b -> a
pair_fst (Pair a _) = a

pair_snd :: Pair a b -> b
pair_snd (Pair _ b) = b

What I'm wondering is: how complex would it be to add such a feature
to PostgreSQL's type system?

- Joey

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] JSON for PG 9.2

2012-01-12 Thread Joey Adams
I wrote an array_to_json function during GSoC 2010:


http://git.postgresql.org/gitweb/?p=json-datatype.git;a=blob;f=json_io.c#l289

It's not exposed as a procedure called array_to_json: it's part of the
to_json function, which decides what to do based on the argument type.

- Joey

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] JSON for PG 9.2

2012-01-14 Thread Joey Adams
On Sat, Jan 14, 2012 at 3:06 PM, Andrew Dunstan  wrote:
> Second, what should be do when the database encoding isn't UTF8? I'm
> inclined to emit a \u escape for any non-ASCII character (assuming it
> has a unicode code point - are there any code points in the non-unicode
> encodings that don't have unicode equivalents?). The alternative would be to
> fail on non-ASCII characters, which might be ugly. Of course, anyone wanting
> to deal with JSON should be using UTF8 anyway, but we still have to deal
> with these things. What about SQL_ASCII? If there's a non-ASCII sequence
> there we really have no way of telling what it should be. There at least I
> think we should probably error out.

I don't think there is a satisfying solution to this problem.  Things
working against us:

 * Some server encodings support characters that don't map to Unicode
characters (e.g. unused slots in Windows-1252).  Thus, converting to
UTF-8 and back is lossy in general.

 * We want a normalized representation for comparison.  This will
involve a mixture of server and Unicode characters, unless the
encoding is UTF-8.

 * We can't efficiently convert individual characters to and from
Unicode with the current API.

 * What do we do about \u ?  TEXT datums cannot contain NUL characters.

I'd say just ban Unicode escapes and non-ASCII characters unless the
server encoding is UTF-8, and ban all \u escapes.  It's easy, and
whatever we support later will be a superset of this.

Strategies for handling this situation have been discussed in prior
emails.  This is where things got stuck last time.

- Joey

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers