Re: [HACKERS] JSON for PG 9.2
On Tue, Jan 31, 2012 at 1:29 PM, Abhijit Menon-Sen wrote: > At 2012-01-31 12:04:31 -0500, robertmh...@gmail.com wrote: >> >> That fails to answer the question of what we ought to do if we get an >> invalid sequence there. > > I think it's best to categorically reject invalid surrogates as early as > possible, considering the number of bugs that are related to them (not > in Postgres, just in general). I can't see anything good coming from > letting them in and leaving them to surprise someone in future. > > -- ams +1 Another sequence to beware of is \u. While escaped NUL characters are perfectly valid in JSON, NUL characters aren't allowed in TEXT values. This means not all JSON strings can be converted to TEXT, even in UTF-8. This may also complicate collation, if comparison functions demand null-terminated strings. I'm mostly in favor of allowing \u. Banning \u means users can't use JSON strings to marshal binary blobs, e.g. by escaping non-printable characters and only using U+..U+00FF. Instead, they have to use base64 or similar. Banning \u doesn't quite violate the RFC: An implementation may set limits on the length and character contents of strings. -Joey -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] libpq should have functions for escaping data for use in COPY FROM
libpq has functions for escaping values in SQL commands (PQescapeStringConn, PQescapeByteaConn, and the new PQescapeLiteral), and it supports parameterizing queries with PQexecParams. But it does not (to my knowledge) have functions for escaping values for COPY FROM. COPY FROM is useful for inserting rows in bulk (though I wonder if constructing massive INSERT statements and using PQexecParams is just as efficient). It is also useful for generating .sql files which can be run on a database elsewhere. I think libpq should include functions for escaping with COPY FROM. Perhaps the API could look like this: typedef struct EscapeCopyInfo { /* * If this is not NULL, the PQescapeCopy functions will take advantage of * connection-specific information to minimize the escaped representation * length. */ PGConn *conn; char delimiter; const char *null_string; } EscapeCopyInfo; void initEscapeCopyInfo(EscapeCopyInfo *); /* If info is NULL, it defaults to {NULL, '\t', "\\N"} */ size_t PQescapeStringCopy( EscapeCopyInfo *info, char *to, const char *from, size_t from_length, int *error ); size_t PQescapeByteaCopy( EscapeCopyInfo *info, unsigned char *to, const unsigned char *from, size_t from_length, int *error ); Using an EscapeCopyInfo structure to specify format information cuts down on parameter counts for the escaping functions, and makes it possible to support more options in the future (e.g. CSV). I'm not terribly attached to the parameter order of the functions. I was just following the lead of PQescapeStringConn. This API writes text into a buffer, rather than allocating a buffer with malloc like PQescapeByteaConn and PQescapeLiteral do. This means rows containing multiple values can be constructed efficiently. On the other hand, it is inconvenient and error-prone, since the programmer has to calculate how big the buffer needs to be (and this varies depending on settings). Imagine this bug: * A programmer allocates a buffer of size 2*length + 2 for a BYTEA, assuming the bytestring will be escaped using the \x... format * A user runs the program with PostgreSQL 8.4, which predates the \x... format, and now the bytestring's escaped representation can be up to 5*length bytes long (but very frequently isn't!) I wrote a basic implementation for a Haskell binding, but it only supports default COPY options (text mode, with tab delimiter and \N null string). https://github.com/joeyadams/haskell-libpq/blob/copy-from/cbits/escape-copy.c Before spending a bunch of time on this, I'd like some input. A few questions: * Should we have corresponding functions for parsing COPY TO data, or is PQexecParams sufficient? * Should we support CSV escaping? Can the CSV format safely encode all characters (in particular, newlines)? * Should we deal with encodings here, or just escape everything that isn't printable ASCII like the code I wrote does? Thanks, -Joey -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] JSON in 9.2 - Could we have just one to_json() function instead of two separate versions ?
On Tue, May 1, 2012 at 8:02 AM, Hannu Krosing wrote: > Hi hackers > > After playing around with array_to_json() and row_to_json() functions a > bit it I have a question - why do we even have 2 variants *_to_json() Here's the discussion where that decision was made: http://archives.postgresql.org/pgsql-hackers/2012-01/msg01339.php To quote: >>> why not call all these functions 'to_json' and overload them? >> >> I don't honestly feel that advances clarity much. And we might want to >> overload each at some stage with options that are specific to the datum >> type. We have various foo_to_xml() functions now. > > -1 > > older proposal is more consistent with xml functions The most compelling argument I see here is the one about options specific to the datum type. Two other reasons I can think of: * If someone tries to google for how to convert an array to JSON, having a function named 'array_to_json' will make that easier. * If the JSON type does not yet support, say, converting from a number, it will be apparent from the names and types of the functions, rather than being a hidden surprise. On the other hand, array_to_json and composite_to_json already convert ANY values to JSON, so this doesn't matter, anyway. On Tue, May 1, 2012 at 11:02 AM, Hannu Krosing wrote: > What we currently lack is direct conversion for simple types, though > they are easily achieved by converting to a single-element array and > then stripping outer [] from the result I agree that this function ought to be exposed. Note that such a function (perhaps called datum_to_json) is indeed the same as the proposed to_json function, which tries to convert a value of any type to JSON. > It would be really nice to also have the casts from json to any type, > including records though. What the casts currently do (primarily) is convert between the TEXT and JSON types. So if you have JSON-encoded TEXT, use a cast to convert it to the JSON type (this will perform validation, ensuring that no invalid JSON gets in). Any escape/unescape operations need to be explicit. -Joey -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] JSON in 9.2 - Could we have just one to_json() function instead of two separate versions ?
On Tue, May 1, 2012 at 12:22 PM, Andrew Dunstan wrote: > Second, RFC 4627 is absolutely clear: a valid JSON value can only be an > object or an array, so this thing about converting arbitrary datum values to > JSON is a fantasy. If anything, we should adjust the JSON input routines to > disallow anything else, rather than start to output what is not valid JSON. No, the RFC says (emphasis mine): A JSON *text* is a serialized object or array. If we let the JSON type correspond to a *value* instead, this restriction does not apply, and the JSON type has a useful recursive definition. For example, this function would not be possible if we applied the "object or array" restriction: unnest(json) returns setof json Note that a similar distinction appears with the XML type: "document" versus "content". -Joey -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON
On Mon, Jul 4, 2011 at 10:22 PM, Joseph Adams wrote: > I'll try to submit a revised patch within the next couple days. Sorry this is later than I said. I addressed the issues covered in the review. I also fixed a bug where "\u0022" would become """, which is invalid JSON, causing an assertion failure. However, I want to put this back into WIP for a number of reasons: * The current code accepts invalid surrogate pairs (e.g. "\uD800\uD800"). The problem with accepting them is that it would be inconsistent with PostgreSQL's Unicode support, and with the Unicode standard itself. In my opinion: as long as the server encoding is universal (i.e. UTF-8), decoding a JSON-encoded string should not fail (barring data corruption and resource limitations). * I'd like to go ahead with the parser rewrite I mentioned earlier. The new parser will be able to construct a parse tree when needed, and it won't use those overkill parsing macros. * I recently learned that not all supported server encodings can be converted to Unicode losslessly. The current code, on output, converts non-ASCII characters to Unicode escapes under some circumstances (see the comment above json_need_to_escape_unicode). I'm having a really hard time figuring out how the JSON module should handle non-Unicode character sets. \u escapes in JSON literals can be used to encode characters not available in the server encoding. On the other hand, the server encoding can encode characters not present in Unicode (see the third bullet point above). This means JSON normalization and comparison (along with member lookup) are not possible in general. Even if I assume server -> UTF-8 -> server transcoding is lossless, the situation is still ugly. Normalization could be implemented a few ways: * Convert from server encoding to UTF-8, and convert \u escapes to UTF-8 characters. This is space-efficient, but the resulting text would not be compatible with the server encoding (which may or may not matter). * Convert from server encoding to UTF-8, and convert all non-ASCII characters to \u escapes, resulting in pure ASCII. This bloats the text by a factor of three, in the worst case. * Convert \u escapes to characters in the server encoding, but only where possible. This would be extremely inefficient. The parse tree (for functions that need it) will need to store JSON member names and strings either in UTF-8 or in normalized JSON (which could be the same thing). Help and advice would be appreciated. Thanks! - Joey json-contrib-rev1-20110714.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON
On Mon, Jul 18, 2011 at 3:19 PM, Tom Lane wrote: > BTW, could the \u problem be finessed by leaving such escapes in > source form? Yes, it could. However, it doesn't solve the problem of comparison (needed for member lookup), which requires canonicalizing the strings to be compared. Here's a Unicode handling policy I came up with. It is guaranteed to be lossless as long as the client and database encodings are the same. --- On input (json_in), if the text is valid JSON, it is condensed: * Whitespace characters surrounding tokens are removed. * Long escapes (like \u0022) are converted to short escapes (like \") where possible. * Unnecessary escapes of ASCII characters (e.g. \u0061 and even \u007F) are converted to their respective characters. * Escapes of non-ASCII characters (e.g. \u0080, \u266B, \uD834\uDD1E) are converted to their respective characters, but only if the database encoding is UTF-8. On output (json_out), non-ASCII characters are converted to \u escapes, unless one or more of these very likely circumstances hold: * The client encoding and database encoding are the same. No conversion is performed, so escaping characters will not prevent any conversion errors. * The client encoding is UTF-8. Escaping is not necessary because the client can encode all Unicode codepoints. * The client encoding and/or database encoding is SQL_ASCII. SQL_ASCII tells PostgreSQL to shirk transcoding in favor of speed. When a JSON-encoded string is unwrapped using from_json (e.g. from_json($$ "\u00A1Hola!" $$)), escapes will be converted to the characters they represent. If any escapes cannot be represented in the database encoding, an error will be raised. Note that: * If the database encoding is UTF-8, conversion will never fail. * If the database encoding is SQL_ASCII, conversion will fail if any escapes of non-ASCII characters are present. --- However, I'm having a really hard time figuring out how comparison would work in this framework. Here are a few options: 1. Convert the strings to UTF-8, convert the escapes to characters, and compare the strings. 2. Convert the escapes to the database encoding, then compare the strings. 3. If either string contains escapes of non-ASCII characters, do 1. Otherwise, do 2. Number 1 seems the most sane to me, but could lead to rare errors. Number 3 could produce confusing results. If character set X has three different representations of one Unicode codepoint, then we could have scenarios like this (simplified): "abc♫" != "aaa♫" but: "abc\u266b" == "aaa♫" I suppose a simple solution would be to convert all escapes and outright ban escapes of characters not in the database encoding. This would have the nice property that all strings can be unescaped server-side. Problem is, what if a browser or other program produces, say, \u00A0 (NO-BREAK SPACE), and tries to insert it into a database where the encoding lacks this character? On the other hand, converting all JSON to UTF-8 would be simpler to implement. It would probably be more intuitive, too, given that the JSON RFC says, "JSON text SHALL be encoded in Unicode." In any case, the documentation should say "Use UTF-8 for best results", as there seems to be no entirely satisfactory way to handle JSON in other database encodings. - Joey -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON
On Mon, Jul 18, 2011 at 7:36 PM, Florian Pflug wrote: > On Jul19, 2011, at 00:17 , Joey Adams wrote: >> I suppose a simple solution would be to convert all escapes and >> outright ban escapes of characters not in the database encoding. > > +1. Making JSON work like TEXT when it comes to encoding issues > makes this all much simpler conceptually. It also avoids all kinds > of weird issues if you extract textual values from a JSON document > server-side. Thanks for the input. I'm leaning in this direction too. However, it will be a tad tricky to implement the conversions efficiently, since the wchar API doesn't provide a fast path for individual codepoint conversion (that I'm aware of), and pg_do_encoding_conversion doesn't look like a good thing to call lots of times. My plan is to scan for escapes of non-ASCII characters, convert them to UTF-8, and put them in a comma-delimited string like this: a,b,c,d, then, convert the resulting string to the server encoding (which may fail, indicating that some codepoint(s) are not present in the database encoding). After that, read the string and plop the characters where they go. It's "clever", but I can't think of a better way to do it with the existing API. - Joey -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Fwd: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON
Forwarding because the mailing list rejected the original message. -- Forwarded message -- From: Joey Adams Date: Tue, Jul 19, 2011 at 11:23 PM Subject: Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON To: Alvaro Herrera Cc: Florian Pflug , Tom Lane , Robert Haas , Bernd Helmle , Dimitri Fontaine , David Fetter , Josh Berkus , Pg Hackers On Tue, Jul 19, 2011 at 10:01 PM, Alvaro Herrera wrote: > Would it work to have a separate entry point into mbutils.c that lets > you cache the conversion proc caller-side? That sounds like a really good idea. There's still the overhead of calling the proc, but I imagine it's a lot less than looking it up. > I think the main problem is > determining the byte length of each source character beforehand. I'm not sure what you mean. The idea is to convert the \u escape to UTF-8 with unicode_to_utf8 (the length of the resulting UTF-8 sequence is easy to compute), call the conversion proc to get the null-terminated database-encoded character, then append the result to whatever StringInfo the string is going into. The only question mark is how big the destination buffer will need to be. The maximum number of bytes per char in any supported encoding is 4, but is it possible for one Unicode character to turn into multiple "character"s in the database encoding? While we're at it, should we provide the same capability to the SQL parser? Namely, the ability to use \u escapes above U+007F when the server encoding is not UTF-8? - Joey -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON
On Wed, Jul 20, 2011 at 12:32 AM, Robert Haas wrote: >> Thanks for the input. I'm leaning in this direction too. However, it >> will be a tad tricky to implement the conversions efficiently, ... > > I'm a bit confused, because I thought what I was talking about was not > doing any conversions in the first place. We want to be able to handle \u escapes when the database encoding is not UTF-8. We could leave them in place, but sooner or later they'll need to be converted in order to unwrap or compare JSON strings. The approach being discussed is converting escapes to the database encoding. This means escapes of characters not available in the database encoding (e.g. \u266B in ISO-8859-1) will be forbidden. The PostgreSQL parser (which also supports Unicode escapes) takes a simpler approach: don't allow non-ASCII escapes unless the server encoding is UTF-8. - Joey -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON
On Wed, Jul 20, 2011 at 6:49 AM, Florian Pflug wrote: > Hm, I agree that we need to handle \u escapes in JSON input. > We won't ever produce them during output though, right? We could, to prevent transcoding errors if the client encoding is different than the server encoding (and neither is SQL_ASCII, nor is the client encoding UTF8). For example, if the database encoding is UTF-8 and the client encoding is WIN1252, I'd think it would be a good idea to escape non-ASCII characters. > How does that XML type handle the situation? It seems that it'd have > the same problem with unicode entity references "". From the looks of it, XML operations convert the text to UTF-8 before passing it to libxml. The XML type does not normalize the input; SELECT '♫♫'::xml; simply yields ♫♫. Escapes of any character are allowed in any encoding, from the looks of it. - Joey -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON
I think I've decided to only allow escapes of non-ASCII characters when the database encoding is UTF8. For example, $$"\u2013"$$::json will fail if the database encoding is WIN1252, even though WIN1252 can encode U+2013 (EN DASH). This may be somewhat draconian, given that: * SQL_ASCII can otherwise handle "any" language according to the documentation. * The XML type doesn't have this restriction (it just stores the input text verbatim, and converts it to UTF-8 before doing anything complicated with it). However, it's simple to implement and understand. The JSON data type will not perform any automatic conversion between character encodings. Also, if we want to handle this any better in the future, we won't have to support legacy data containing a mixture of encodings. In the future, we could create functions to compensate for the issues people encounter; for example: * json_escape_unicode(json [, replace bool]) returns text -- convert non-ASCII characters to escapes. Optionally, use \uFFFD for unconvertible characters. * json_unescape_unicode(text [, replace text]) returns json -- like json_in, but convert Unicode escapes to characters when possible. Optionally, replace unconvertible characters with a given string. I've been going back and forth on how to handle encodings in the JSON type for a while, but suggestions and objections are still welcome. However, I plan to proceed in this direction so progress can be made. On another matter, should the JSON type guard against duplicate member keys? The JSON RFC says "The names within an object SHOULD be unique," meaning JSON with duplicate members can be considered valid. JavaScript interpreters (the ones I tried), PHP, and Python all have the same behavior: discard the first member in favor of the second. That is, {"key":1,"key":2} becomes {"key":2}. The XML type throws an error if a duplicate attribute is present (e.g. ''::xml). Thanks for the input, - Joey -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON
On Fri, Jul 22, 2011 at 7:12 PM, Robert Haas wrote: > Hmm. That's tricky. I lean mildly toward throwing an error as being > more consistent with the general PG philosophy. I agree. Besides, throwing an error on duplicate keys seems like the most logical thing to do. The most compelling reason not to, I think, is that it would make the input function a little slower. On Fri, Jul 22, 2011 at 8:26 PM, Florian Pflug wrote: >> * The XML type doesn't have this restriction (it just stores the >> input text verbatim, and converts it to UTF-8 before doing anything >> complicated with it). > > Yeah. But the price the XML type pays for that is the lack of an > equality operator. Interesting. This leads to a couple more questions: * Should the JSON data type (eventually) have an equality operator? * Should the JSON input function alphabetize object members by key? If we canonicalize strings and numbers and alphabetize object members, then our equality function is just texteq. The only stumbling block is canonicalizing numbers. Fortunately, JSON's definition of a "number" is its decimal syntax, so the algorithm is child's play: * Figure out the digits and exponent. * If the exponent is greater than 20 or less than 6 (arbitrary), use exponential notation. The problem is: 2.718282e-1000 won't equal 0 as may be expected. I doubt this matters much. It would be nice to canonicalize JSON on input, and that's the way I'd like to go, but two caveats are: * Input (and other operations) would require more CPU time. Instead of being able to pass the data through a quick condense function, it'd have to construct an AST (to sort object members) and re-encode the JSON back into a string. * Users, for aesthetic reasons, might not want their JSON members rearranged. If, in the future, we add the ability to manipulate large JSON trees efficiently (e.g. by using an auxiliary table like TOAST does), we'll probably want unique members, so enforcing them now may be prudent. - Joey -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON
Also, should I forbid the escape \u (in all database encodings)? Pros: * If \u is forbidden, and the server encoding is UTF-8, then every JSON-wrapped string will be convertible to TEXT. * It will be consistent with the way PostgreSQL already handles text, and with the decision to use database-encoded JSON strings. * Some applications choke on strings with null characters. For example, in some web browsers but not others, if you pass "Hello\uworld" to document.write() or assign it to a DOM object's innerHTML, it will be truncated to "Hello". By banning \u, users can catch such rogue strings early. * It's a little easier to represent internally. Cons: * Means JSON type will accept a subset of the JSON described in RFC4627. However, the RFC does say "An implementation may set limits on the length and character contents of strings", so we can arguably get away with banning \u while being law-abiding citizens. * Being able to store U+–U+00FF means users can use JSON strings to hold binary data: by treating it as Latin-1. - Joey -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON
On Sat, Jul 23, 2011 at 11:14 PM, Robert Haas wrote: > I doubt you're going to want to reinvent TOAST, ... I was thinking about making it efficient to access or update foo.a.b.c.d[1000] in a huge JSON tree. Simply TOASTing the varlena text means we have to unpack the entire datum to access and update individual members. An alternative would be to split the JSON into chunks (possibly by using the pg_toast_ table) and have some sort of index that can be used to efficiently look up values by path. This would not be trivial, and I don't plan to implement it any time soon. > On Sun, Jul 24, 2011 at 2:19 PM, Florian Pflug wrote: > On Jul24, 2011, at 05:14 , Robert Haas wrote: >> On Fri, Jul 22, 2011 at 10:36 PM, Joey Adams >> wrote: >>> ... Fortunately, JSON's definition of a >>> "number" is its decimal syntax, so the algorithm is child's play: >>> >>> * Figure out the digits and exponent. >>> * If the exponent is greater than 20 or less than 6 (arbitrary), use >>> exponential notation. >>> >> > > I agree. As for your proposed algorithm, I suggest to instead use > exponential notation if it produces a shorter textual representation. > In other words, for values between -1 and 1, we'd switch to exponential > notation if there's more than 1 leading zero (to the right of the decimal > point, of course), and for values outside that range if there're more than > 2 trailing zeros and no decimal point. All after redundant zeros and > decimal points are removed. So we'd store > > 0 as 0 > 1 as 1 > 0.1 as 0.1 > 0.01 as 0.01 > 0.001 as 1e-3 > 10 as 10 > 100 as 100 > 1000 as 1e3 > 1000.1 as 1000.1 > 1001 as 1001 > Interesting idea. The reason I suggested using exponential notation only for extreme exponents (less than -6 or greater than +20) is partly for presentation value. Users might be annoyed to see 100 turned into 1e6. Moreover, applications working solely with integers that don't expect the floating point syntax may choke on the converted numbers. 32-bit integers can be losslessly encoded as IEEE double-precision floats (JavaScript's internal representation), and JavaScript's algorithm for converting a number to a string ([1], section 9.8.1) happens to preserve the integer syntax (I think). Should we follow the JavaScript standard for rendering numbers (which my suggestion approximates)? Or should we use the shortest encoding as Florian suggests? - Joey [1]: http://www.ecma-international.org/publications/files/ECMA-ST-ARCH/ECMA-262%205th%20edition%20December%202009.pdf -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON
On Sun, Jul 24, 2011 at 2:19 PM, Florian Pflug wrote: > The downside being that we'd then either need to canonicalize in > the equality operator, or live with either no equality operator or > a rather strange one. It just occurred to me that, even if we sort object members, texteq might not be a sufficient way to determine equality. In particular, IEEE floats treat +0 and -0 as two different things, but they are equal when compared. Note that we're only dealing with a decimal representation; we're not (currently) converting to double-precision representation and back. Should we mimic IEEE floats and preserve -0 versus +0 while treating them as equal? Or should we treat JSON floats like numeric and convert -0 to 0 on input? Or should we do something else? I think converting -0 to 0 would be a bad idea, as it would violate the intuitive assumption that JSON can be used to marshal double-precision floats. - Joey -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Initial Review: JSON contrib modul was: Re: [HACKERS] Another swing at JSON
On Mon, Jul 25, 2011 at 1:05 AM, Joey Adams wrote: > Should we mimic IEEE floats and preserve -0 versus +0 while treating > them as equal? Or should we treat JSON floats like numeric and > convert -0 to 0 on input? Or should we do something else? I think > converting -0 to 0 would be a bad idea, as it would violate the > intuitive assumption that JSON can be used to marshal double-precision > floats. On the other hand, JavaScript's own .toString and JSON.stringify turn -0 into 0, so JSON can't marshal -0 around, anyway (in practice). Now I think turning -0 into 0 would be fine for canonicalizing numbers in json_in. - Joey -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] JSON for PG 9.2
On Mon, Dec 5, 2011 at 3:12 PM, Bruce Momjian wrote: > Where are we with adding JSON for Postgres 9.2? We got bogged down in > the data representation last time we discussed this. We should probably have a wiki page titled "JSON datatype status" to help break the cycle we're in: * Someone asks about the status of JSON * Various ideas are suggested * Patches are posted (maybe) * More discussion about fundamental issues ensues * Nothing is accomplished (as far as adding JSON to Postgres core) There are several JSON implementations for Postgres floating around, including: * http://pgxn.org/dist/pg-json/ : Mentioned in previous posts; a JSON library based on Jansson supporting path subscript and equality testing * http://git.postgresql.org/gitweb/?p=json-datatype.git;a=summary : The JSON datatype I implemented for Google Summer of Code 2010. It has the most features of any implementation I'm aware of, but: * Is in the form of a contrib module * Preserves input text verbatim, a guarantee that will be broken by more efficient implementations * http://git.postgresql.org/gitweb/?p=json-datatype.git;a=shortlog;h=refs/heads/json2 : My rewrite of the JSON module that condenses input (but still stores it as text) and addresses the issue of JSON when either the server or client encoding is not UTF-8. Needs more features and documentation, but like my other implementation, may not be quite what we want. Issues we've encountered include: * Should JSON be stored as binary or as text? * How do we deal with Unicode escapes and characters if the server or client encoding is not UTF-8? Some (common!) character encodings have code points that don't map to Unicode. Also, the charset conversion modules do not provide fast entry points for converting individual characters; each conversion involves a funcapi call. --- In an application I'm working on, I store JSON-encoded objects in a PostgreSQL database (using TEXT). I do so because it allows me to store non-relational data that is easy for my JavaScript code to work with. However, I fail to see much benefit of a JSON type. When I need to work with the data in PHP, C, or Haskell, I use JSON parsing libraries available in each programming language. Although being able to transform or convert JSON data within SQL might be convenient, I can't think of any compelling reason to do it in my case. Can someone clarify why a JSON type would be useful, beyond storage and validation? What is a real-world, *concrete* example of a problem where JSON manipulation in the database would be much better than: * Using the application's programming language to manipulate the data (which it does a lot already) ? * Using CouchDB or similar instead of PostgreSQL? - Joey -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] JSON for PG 9.2
On Fri, Dec 16, 2011 at 8:52 AM, Robert Haas wrote: > But I think the important point is that this is an obscure corner case. Let > me say that one more time: obscure corner case! +1 > The only reason JSON needs to care about this at all is that it allows > \u1234 to mean Unicode code point 0x1234. But for that detail, JSON > would be encoding-agnostic. So I think it's sufficient for us to > simply decide that that particular feature may not work (or even, will > not work) for non-ASCII characters if you use a non-UTF8 encoding. > There's still plenty of useful things that can be done with JSON even > if that particular feature is not available; and that way we don't > have to completely disable the data type just because someone wants to > use EUC-JP or something. So, if the server encoding is not UTF-8, should we ban Unicode escapes: "\u00FCber" or non-ASCII characters? "über" Also: * What if the server encoding is SQL_ASCII? * What if the server encoding is UTF-8, but the client encoding is something else (e.g. SQL_ASCII)? - Joey -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Wishlist: parameterizable types
This may be ambitious, but it'd be neat if PostgreSQL supported parameterizable types. For example, suppose a contrib module defines a "pair" type. It could be used as follows: CREATE TABLE my_table ( coord pair(float, float) ); The "pair" module could define functions like these for constructing and examining pairs: create function pair_create(a, b) returns pair(a, b); create function pair_fst(pair(a, b)) returns a; create function pair_snd(pair(a, b)) returns b; Here, each function is polymorphic in two type variables, a and b. As far as I know, PostgreSQL only supports one type variable per function, via the anyelement keyword. Thus, unless we restrict ourselves to only one type parameter, parameterizable types wouldn't be very useful without support for multiple type variables. PostgreSQL already has a parameterizable type: array. However, it would be nontrivial to introduce another such type. Currently, nearly every type in PostgreSQL has a corresponding array type. For example, in pg_hba, there's money, and there's _money (array of money values). Continuing with this pattern means we would need something like P*T entries in pg_hba, where P is the number of type constructors (e.g. array), and T is the number of base types. Moreover, the array type isn't truly nestable. For one, PostgreSQL considers int[] and int[][][][] as the same type: > select '{1,2,3}' :: int[][][][]; int4 - {1,2,3} (1 row) Also, arrays inside of arrays aren't allowed, only multidimensional arrays: > select '{{1,2},{3}}' :: int[][]; ERROR: multidimensional arrays must have array expressions with matching dimensions LINE 1: select '{{1,2},{3}}' :: int[][]; Suppose I didn't like these restrictions on the array type, and wanted to make a type called "vector" that addresses them. It might be used as follows: > select '{{1,2},{3}}' :: vector(vector(int)); Note that I'm stacking the 'vector' type constructor. The array type doesn't let you do that. We could stretch the idea even further, and allow parameters and recursion in user-defined types: create type object(a) as ( key text, value either(a, object(a)) ); C++ supports parameterizable types through templates. Java and C# support them through generics. What I have in mind comes from Haskell's type system (which inspired generics in Java). The functions defined earlier could be implemented in Haskell as: data Pair a b = Pair a b pair_fst :: Pair a b -> a pair_fst (Pair a _) = a pair_snd :: Pair a b -> b pair_snd (Pair _ b) = b What I'm wondering is: how complex would it be to add such a feature to PostgreSQL's type system? - Joey -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] JSON for PG 9.2
I wrote an array_to_json function during GSoC 2010: http://git.postgresql.org/gitweb/?p=json-datatype.git;a=blob;f=json_io.c#l289 It's not exposed as a procedure called array_to_json: it's part of the to_json function, which decides what to do based on the argument type. - Joey -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] JSON for PG 9.2
On Sat, Jan 14, 2012 at 3:06 PM, Andrew Dunstan wrote: > Second, what should be do when the database encoding isn't UTF8? I'm > inclined to emit a \u escape for any non-ASCII character (assuming it > has a unicode code point - are there any code points in the non-unicode > encodings that don't have unicode equivalents?). The alternative would be to > fail on non-ASCII characters, which might be ugly. Of course, anyone wanting > to deal with JSON should be using UTF8 anyway, but we still have to deal > with these things. What about SQL_ASCII? If there's a non-ASCII sequence > there we really have no way of telling what it should be. There at least I > think we should probably error out. I don't think there is a satisfying solution to this problem. Things working against us: * Some server encodings support characters that don't map to Unicode characters (e.g. unused slots in Windows-1252). Thus, converting to UTF-8 and back is lossy in general. * We want a normalized representation for comparison. This will involve a mixture of server and Unicode characters, unless the encoding is UTF-8. * We can't efficiently convert individual characters to and from Unicode with the current API. * What do we do about \u ? TEXT datums cannot contain NUL characters. I'd say just ban Unicode escapes and non-ASCII characters unless the server encoding is UTF-8, and ban all \u escapes. It's easy, and whatever we support later will be a superset of this. Strategies for handling this situation have been discussed in prior emails. This is where things got stuck last time. - Joey -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers