subject:"\\\[HACKERS\\\] patch\\\: Add JSON datatype to PostgreSQL \\\(GSoC, WIP\\\)"

Re: [HACKERS] patch: Add JSON datatype to PostgreSQL (GSoC, WIP)

2010-11-21 Thread David E. Wheeler

On Nov 20, 2010, at 9:31 PM, Terry Laurenzo wrote:

 Assuming that the JSON datatype (at a minimum) normalizes text for storage, 
 then the text storage option accounts for about the most expensive path but 
 with none of the benefits of an internal binary form (smaller size, ability 
 to cheaply perform non-trivial manipulation within the database server).
 
 Of course, just having a JSON datatype that blindly stores text will beat 
 everything, but I'm getting closer to thinking that the binary option is 
 worth the tradeoff.
 
 Comments?

benchmarks++

Nice to have some data points for this discussion.

Best,

David, still hoping for the JSON data type in 9.1…
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] patch: Add JSON datatype to PostgreSQL (GSoC, WIP)

2010-11-21 Thread Robert Haas

On Sun, Nov 21, 2010 at 12:31 AM, Terry Laurenzo t...@laurenzo.org wrote:
 What we see from this is that parsing JSON text and generating a binary
 representation is cheap, representing approximately 10% of the base case
 time.  Conversely, anything that involves generating JSON text is expensive,
 accounting for 30-40% of the base case time.  Some incidental profiling
 shows that while the entire operation is expensive, the process of
 generating string literals dominates this time.  There is likely room for
 optimization in this method, but it should be noted that most of these
 documents are lightly escaped (if escaped at all) which represents the happy
 path through the string literal output function.

Ouch!  That's kind of painful.  But certainly for some use cases it
will work out to a huge speedup, if you're doing subscripting or
similar.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] patch: Add JSON datatype to PostgreSQL (GSoC, WIP)

2010-11-21 Thread Andrew Dunstan




On 11/21/2010 12:31 AM, Terry Laurenzo wrote:


I copied the 5 sample documents from json.org http://json.org's 
example section for these tests.  These are loaded into a table with a 
varchar column 1000 times each (so the test table has 5000 rows in 
it).  In all situations, the binary encoding was smaller than the 
normalized text form (between 9 and 23% smaller).  I think there are 
cases where the binary form will be larger than the corresponding text 
form, but I don't think they would be very common.




Is that a pre-toast or post-toast comparison?

Even if it's post-toast, that doesn't seem like enough of a saving to 
convince me that simply storing as text, just as we do for XML, isn't a 
sensible way to go, especially when the cost of reproducing the text for 
delivery to clients  (including, say, pg_dump) is likely to be quite high.


cheers

andrew

Re: [HACKERS] patch: Add JSON datatype to PostgreSQL (GSoC, WIP)

2010-11-20 Thread Terry Laurenzo

I've got a new stripped down version of the binary json plugin on github:
https://github.com/tlaurenzo/pgjson

With all due warning of contrived benchmarks, I wrote some tests to see
where things stand. The test script is here:
https://github.com/tlaurenzo/pgjson/blob/master/testdocs/runbenchmark.sh
The results from my laptop (First gen Macbook Pro 32bit with Snow Leopard
and Postgresql 9.0) are here and copied into this email at the end:

https://github.com/tlaurenzo/pgjson/blob/master/testdocs/benchmark-results-2010-11-20-mbp32.txt

And for some commentary...
I copied the 5 sample documents from json.org's example section for these
tests. These are loaded into a table with a varchar column 1000 times each
(so the test table has 5000 rows in it). In all situations, the binary
encoding was smaller than the normalized text form (between 9 and 23%
smaller). I think there are cases where the binary form will be larger than
the corresponding text form, but I don't think they would be very common.

For the timings, various operations are performed on the 5000 row test table
for 100 iterations. In all situations, the query returns the Length of the
transformed document instead of the document itself so as to factor out
variable client IO between the test runs. The Null Parse is essentially
just a select from the table and therefore represents the baseline. The
times varied a little bit between runs but did not change materially.

What we see from this is that parsing JSON text and generating a binary
representation is cheap, representing approximately 10% of the base case
time. Conversely, anything that involves generating JSON text is expensive,
accounting for 30-40% of the base case time. Some incidental profiling
shows that while the entire operation is expensive, the process of
generating string literals dominates this time. There is likely room for
optimization in this method, but it should be noted that most of these
documents are lightly escaped (if escaped at all) which represents the happy
path through the string literal output function.

While I have not profiled any advanced manipulation of the binary structure
within the server, it stands to reason that manipulating the binary
structure should be significantly faster than an approach that requires
(perhaps multiple) transcoding between text representations in order to
complete a sequence of operations.

Assuming that the JSON datatype (at a minimum) normalizes text for storage,
then the text storage option accounts for about the most expensive path but
with none of the benefits of an internal binary form (smaller size, ability
to cheaply perform non-trivial manipulation within the database server).

Of course, just having a JSON datatype that blindly stores text will beat
everything, but I'm getting closer to thinking that the binary option is
worth the tradeoff.

Comments?
Terry

Running benchmark with 100 iterations per step
Loading data...
Data loaded.
=== DOCUMENT SIZE STATISTICS ===
Document Name | Original Size | Binary Size | Normalized Size |
Percentage Savings
--+---+-+-+
jsonorg_sample1.json | 582 | 311 | 360 |
13.6
jsonorg_sample2.json | 241 | 146 | 183 |
20.2185792349727
jsonorg_sample3.json | 601 | 326 | 389 |
16.1953727506427
jsonorg_sample4.json | 3467 |2466 |2710 |
9.00369003690037
jsonorg_sample5.json | 872 | 469 | 613 |
23.4910277324633
(5 rows)

=== TEST PARSE AND SERIALIZATION ===
Null Parse:
12.12 real 2.51 user 0.13 sys
Parse to Binary:
13.38 real 2.51 user 0.14 sys
Serialize from Binary:
16.65 real 2.51 user 0.13 sys
Normalize Text:
18.99 real 2.51 user 0.13 sys
Roundtrip (parse to binary and serialize):
18.58 real 2.51 user 0.14 sys

69 matches

Mail list logo