Re: [HACKERS] jsonb format is pessimal for toast compression
Bruce Momjian br...@momjian.us writes: On Thu, Sep 25, 2014 at 02:39:37PM -0400, Tom Lane wrote: BTW, it seems like there is consensus that we ought to reorder the items in a jsonb object to have keys first and then values, independently of the other issues under discussion. This means we *will* be breaking on-disk compatibility with 9.4beta2, which means pg_upgrade will need to be taught to refuse an upgrade if the database contains any jsonb columns. Bruce, do you have time to crank out a patch for that? Yes, I can do that easily. Tell me when you want it --- I just need a catalog version number to trigger on. Done --- 201409291 is the cutover point. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
* Tom Lane (t...@sss.pgh.pa.us) wrote: Bruce Momjian br...@momjian.us writes: On Thu, Sep 25, 2014 at 02:39:37PM -0400, Tom Lane wrote: BTW, it seems like there is consensus that we ought to reorder the items in a jsonb object to have keys first and then values, independently of the other issues under discussion. This means we *will* be breaking on-disk compatibility with 9.4beta2, which means pg_upgrade will need to be taught to refuse an upgrade if the database contains any jsonb columns. Bruce, do you have time to crank out a patch for that? Yes, I can do that easily. Tell me when you want it --- I just need a catalog version number to trigger on. Done --- 201409291 is the cutover point. Just to clarify- the commit bumped the catversion to 201409292, so version = 201409291 has the old format while version 201409291 has the new format. There was no 201409291, so I suppose it doesn't matter too much, but technically 'version = 201409291' wouldn't be accurate. I'm guessing this all makes sense for how pg_upgrade works, but I found it a bit surprising that the version mentioned as the cutover point wasn't the catversion committed. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] jsonb format is pessimal for toast compression
* Tom Lane (t...@sss.pgh.pa.us) wrote: Stephen Frost sfr...@snowman.net writes: * Tom Lane (t...@sss.pgh.pa.us) wrote: Done --- 201409291 is the cutover point. Just to clarify- the commit bumped the catversion to 201409292, so version = 201409291 has the old format while version 201409291 has the new format. There was no 201409291, so I suppose it doesn't matter too much, but technically 'version = 201409291' wouldn't be accurate. Nope. See my response to Andrew: ...1 is the cutover commit Bruce should use, because that's what it is in 9.4. Yup, makes sense. Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] jsonb format is pessimal for toast compression
On 09/29/2014 11:49 AM, Arthur Silva wrote: What's the call on the stride length? Are we going to keep it hardcoded? Please, yes. The complications caused by a variable stride length would be horrible. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
Arthur Silva arthur...@gmail.com writes: What's the call on the stride length? Are we going to keep it hardcoded? At the moment it's 32, but we could change it without forcing a new initdb. I ran a simple test that seemed to show 32 was a good choice, but if anyone else wants to try other cases, go for it. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On Mon, Sep 29, 2014 at 12:30:40PM -0400, Tom Lane wrote: Bruce Momjian br...@momjian.us writes: On Thu, Sep 25, 2014 at 02:39:37PM -0400, Tom Lane wrote: BTW, it seems like there is consensus that we ought to reorder the items in a jsonb object to have keys first and then values, independently of the other issues under discussion. This means we *will* be breaking on-disk compatibility with 9.4beta2, which means pg_upgrade will need to be taught to refuse an upgrade if the database contains any jsonb columns. Bruce, do you have time to crank out a patch for that? Yes, I can do that easily. Tell me when you want it --- I just need a catalog version number to trigger on. Done --- 201409291 is the cutover point. Attached patch applied to head, and backpatched to 9.4. I think we need to keep this in all future pg_ugprade versions in case someone from the beta tries to jump versions, e.g. 9.4 beta1 to 9.5. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + diff --git a/contrib/pg_upgrade/check.c b/contrib/pg_upgrade/check.c new file mode 100644 index 88fe12d..bbfcab7 *** a/contrib/pg_upgrade/check.c --- b/contrib/pg_upgrade/check.c *** static void check_is_install_user(Cluste *** 24,29 --- 24,30 static void check_for_prepared_transactions(ClusterInfo *cluster); static void check_for_isn_and_int8_passing_mismatch(ClusterInfo *cluster); static void check_for_reg_data_type_usage(ClusterInfo *cluster); + static void check_for_jsonb_9_4_usage(ClusterInfo *cluster); static void get_bin_version(ClusterInfo *cluster); static char *get_canonical_locale_name(int category, const char *locale); *** check_and_dump_old_cluster(bool live_che *** 99,104 --- 100,108 check_for_prepared_transactions(old_cluster); check_for_reg_data_type_usage(old_cluster); check_for_isn_and_int8_passing_mismatch(old_cluster); + if (GET_MAJOR_VERSION(old_cluster.major_version) == 904 + old_cluster.controldata.cat_ver JSONB_FORMAT_CHANGE_CAT_VER) + check_for_jsonb_9_4_usage(old_cluster); /* Pre-PG 9.4 had a different 'line' data type internal format */ if (GET_MAJOR_VERSION(old_cluster.major_version) = 903) *** check_for_reg_data_type_usage(ClusterInf *** 911,916 --- 915,1010 %s\n\n, output_path); } else + check_ok(); + } + + + /* + * check_for_jsonb_9_4_usage() + * + * JSONB changed its storage format during 9.4 beta, so check for it. + */ + static void + check_for_jsonb_9_4_usage(ClusterInfo *cluster) + { + int dbnum; + FILE *script = NULL; + bool found = false; + char output_path[MAXPGPATH]; + + prep_status(Checking for JSONB user data types); + + snprintf(output_path, sizeof(output_path), tables_using_jsonb.txt); + + for (dbnum = 0; dbnum cluster-dbarr.ndbs; dbnum++) + { + PGresult *res; + bool db_used = false; + int ntups; + int rowno; + int i_nspname, + i_relname, + i_attname; + DbInfo *active_db = cluster-dbarr.dbs[dbnum]; + PGconn *conn = connectToServer(cluster, active_db-db_name); + + /* + * While several relkinds don't store any data, e.g. views, they can + * be used to define data types of other columns, so we check all + * relkinds. + */ + res = executeQueryOrDie(conn, + SELECT n.nspname, c.relname, a.attname + FROM pg_catalog.pg_class c, + pg_catalog.pg_namespace n, + pg_catalog.pg_attribute a + WHERE c.oid = a.attrelid AND + NOT a.attisdropped AND + a.atttypid = 'pg_catalog.jsonb'::pg_catalog.regtype AND + c.relnamespace = n.oid AND + /* exclude possible orphaned temp tables */ + n.nspname !~ '^pg_temp_' AND + n.nspname NOT IN ('pg_catalog', 'information_schema')); + + ntups = PQntuples(res); + i_nspname = PQfnumber(res, nspname); + i_relname = PQfnumber(res, relname); + i_attname = PQfnumber(res, attname); + for (rowno = 0; rowno ntups; rowno++) + { + found = true; + if (script == NULL (script = fopen_priv(output_path, w)) == NULL) + pg_fatal(Could not open file \%s\: %s\n, + output_path, getErrorText(errno)); + if (!db_used) + { + fprintf(script, Database: %s\n, active_db-db_name); + db_used = true; + } + fprintf(script, %s.%s.%s\n, + PQgetvalue(res, rowno, i_nspname), + PQgetvalue(res, rowno, i_relname), + PQgetvalue(res, rowno, i_attname)); + } + + PQclear(res); + + PQfinish(conn); + } + + if (script) + fclose(script); + + if (found) + { + pg_log(PG_REPORT, fatal\n); + pg_fatal(Your installation contains one of the JSONB data types in user tables.\n + The internal format of JSONB changed during 9.4 beta so this cluster cannot currently\n + be upgraded. You can remove the problem tables
Re: [HACKERS] jsonb format is pessimal for toast compression
On 09/26/2014 06:20 PM, Josh Berkus wrote: Overall, I'm satisfied with the performance of the length-and-offset patch. Oh, also ... no bugs found. So, can we get Beta3 out now? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
Josh Berkus j...@agliodbs.com writes: So, can we get Beta3 out now? If nobody else steps up and says they want to do some performance testing, I'll push the latest lengths+offsets patch tomorrow. Are any of the other open items listed at https://wiki.postgresql.org/wiki/PostgreSQL_9.4_Open_Items things that we must-fix-before-beta3? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
All, So these results have become a bit complex. So spreadsheet time. https://docs.google.com/spreadsheets/d/1Mokpx3EqlbWlFDIkF9qzpM7NneN9z-QOXWSzws3E-R4 Some details: The Length-and-Offset test was performed using a more recent 9.4 checkout than the other two tests. This was regrettable, and due to a mistake with git, since the results tell me that there have been some other changes. I added two new datasets: errlog2 is a simple, 4-column error log in JSON format, with 2 small values and 2 large values in each datum. It was there to check if any of our changes affected the performance or size of such simple structures (answer: no). processed_b is a synthetic version of Mozilla Socorro's crash dumps, about 900,000 of them, with nearly identical JSON on each row. These are large json values (around 4KB each) with a broad mix of values and 5 levels of nesting. However, none of the levels have very many keys per level; the max is that the top level has up to 40 keys. Unlike the other data sets, I can provide a copy of processed_b for asking. So, some observations: * Data sizes with lengths-and-offets are slightly (3%) larger than all-lengths for the pathological case (jsonbish) and unaffected for other cases. * Even large, complex JSON (processed_b) gets better compression with the two patches than with head, although only slightly better (16%) * This better compression for processed_b leads to slightly slower extraction (6-7%), and surprisingly slower extraction for length-and-offset than for all-lengths (about 2%). * in the patholgical case, length-and-offset was notably faster on Q1 than all-lengths (24%), and somewhat slower on Q2 (8%). I think this shows me that I don't understand what JSON keys are at the end. * notably, length-and-offset when uncompressed (EXTERNAL) was faster on Q1 than head! This was surprising enough that I retested it. Overall, I'm satisfied with the performance of the length-and-offset patch. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 2014-09-19 15:40:14 +0300, Heikki Linnakangas wrote: On 09/18/2014 09:27 PM, Heikki Linnakangas wrote: I'll try to write a more polished patch tomorrow. We'll then see what it looks like, and can decide if we want it. Ok, here are two patches. One is a refined version of my earlier patch, and the other implements the separate offsets array approach. They are both based on Tom's jsonb-lengths-merged.patch, so they include all the whitespace fixes etc. he mentioned. There is no big difference in terms of code complexity between the patches. IMHO the separate offsets array is easier to understand, but it makes for more complicated accessor macros to find the beginning of the variable-length data. I personally am pretty clearly in favor of Heikki's version. I think it could stand to slightly expand the reasoning behind the mixed length/offset format; it's not immediately obvious why the offsets are problematic for compression. Otherwise, based on a cursory look, it looks good. But independent of which version is chosen, we *REALLY* need to make the decision soon. This issue has held up the next beta (like jsonb has blocked previous beta) for *weeks*. Personally it doesn't make me very happy that Heikki and Tom had to be the people stepping up to fix this. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 09/25/2014 09:01 AM, Andres Freund wrote: But independent of which version is chosen, we *REALLY* need to make the decision soon. This issue has held up the next beta (like jsonb has blocked previous beta) for *weeks*. Yes, please! -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On Thu, Sep 25, 2014 at 06:01:08PM +0200, Andres Freund wrote: But independent of which version is chosen, we *REALLY* need to make the decision soon. This issue has held up the next beta (like jsonb has blocked previous beta) for *weeks*. Personally it doesn't make me very happy that Heikki and Tom had to be the people stepping up to fix this. I think there are a few reasons this has been delayed, aside from the scheduling ones: 1. compression issues were a surprise, and we are wondering if there are any other surprises 2. pg_upgrade makes future data format changes problematic 3. 9.3 multi-xact bugs spooked us into being more careful I am not sure what we can do to increase our speed based on these items. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 09/25/2014 10:14 AM, Bruce Momjian wrote: On Thu, Sep 25, 2014 at 06:01:08PM +0200, Andres Freund wrote: But independent of which version is chosen, we *REALLY* need to make the decision soon. This issue has held up the next beta (like jsonb has blocked previous beta) for *weeks*. Personally it doesn't make me very happy that Heikki and Tom had to be the people stepping up to fix this. I think there are a few reasons this has been delayed, aside from the scheduling ones: 1. compression issues were a surprise, and we are wondering if there are any other surprises 2. pg_upgrade makes future data format changes problematic 3. 9.3 multi-xact bugs spooked us into being more careful I am not sure what we can do to increase our speed based on these items. Alternately, this is delayed because: 1. We have one tested patch to fix the issue. 2. However, people are convinced that there's a better patch possible. 3. But nobody is working on this better patch except in their spare time. Given this, I once again vote for releasing based on Tom's lengths-only patch, which is done, tested, and ready to go. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 2014-09-25 10:18:24 -0700, Josh Berkus wrote: On 09/25/2014 10:14 AM, Bruce Momjian wrote: On Thu, Sep 25, 2014 at 06:01:08PM +0200, Andres Freund wrote: But independent of which version is chosen, we *REALLY* need to make the decision soon. This issue has held up the next beta (like jsonb has blocked previous beta) for *weeks*. Personally it doesn't make me very happy that Heikki and Tom had to be the people stepping up to fix this. I think there are a few reasons this has been delayed, aside from the scheduling ones: 1. compression issues were a surprise, and we are wondering if there are any other surprises 2. pg_upgrade makes future data format changes problematic 3. 9.3 multi-xact bugs spooked us into being more careful I am not sure what we can do to increase our speed based on these items. Alternately, this is delayed because: 1. We have one tested patch to fix the issue. 2. However, people are convinced that there's a better patch possible. 3. But nobody is working on this better patch except in their spare time. Given this, I once again vote for releasing based on Tom's lengths-only patch, which is done, tested, and ready to go. Heikki's patch is there and polished. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 09/25/2014 10:20 AM, Andres Freund wrote: On 2014-09-25 10:18:24 -0700, Josh Berkus wrote: On 09/25/2014 10:14 AM, Bruce Momjian wrote: On Thu, Sep 25, 2014 at 06:01:08PM +0200, Andres Freund wrote: But independent of which version is chosen, we *REALLY* need to make the decision soon. This issue has held up the next beta (like jsonb has blocked previous beta) for *weeks*. Personally it doesn't make me very happy that Heikki and Tom had to be the people stepping up to fix this. I think there are a few reasons this has been delayed, aside from the scheduling ones: 1. compression issues were a surprise, and we are wondering if there are any other surprises 2. pg_upgrade makes future data format changes problematic 3. 9.3 multi-xact bugs spooked us into being more careful I am not sure what we can do to increase our speed based on these items. Alternately, this is delayed because: 1. We have one tested patch to fix the issue. 2. However, people are convinced that there's a better patch possible. 3. But nobody is working on this better patch except in their spare time. Given this, I once again vote for releasing based on Tom's lengths-only patch, which is done, tested, and ready to go. Heikki's patch is there and polished. If Heikki says it's ready, I'll test. So far he's said that it wasn't done yet. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 09/25/2014 10:26 AM, Andres Freund wrote: On 2014-09-25 10:25:24 -0700, Josh Berkus wrote: If Heikki says it's ready, I'll test. So far he's said that it wasn't done yet. http://www.postgresql.org/message-id/541c242e.3030...@vmware.com Yeah, and that didn't include some of Tom's bug fixes apparently, per the succeeding message. Which is why I asked Heikki if he was done, to which he has not replied. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 2014-09-25 10:29:51 -0700, Josh Berkus wrote: On 09/25/2014 10:26 AM, Andres Freund wrote: On 2014-09-25 10:25:24 -0700, Josh Berkus wrote: If Heikki says it's ready, I'll test. So far he's said that it wasn't done yet. http://www.postgresql.org/message-id/541c242e.3030...@vmware.com Yeah, and that didn't include some of Tom's bug fixes apparently, per the succeeding message. Which is why I asked Heikki if he was done, to which he has not replied. Well, Heikki said he doesn't see any fixes in Tom's patch. But either way, this isn't anything that should prevent you from testing. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 2014-09-25 10:25:24 -0700, Josh Berkus wrote: If Heikki says it's ready, I'll test. So far he's said that it wasn't done yet. http://www.postgresql.org/message-id/541c242e.3030...@vmware.com Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
Josh Berkus j...@agliodbs.com writes: On 09/25/2014 10:26 AM, Andres Freund wrote: On 2014-09-25 10:25:24 -0700, Josh Berkus wrote: If Heikki says it's ready, I'll test. So far he's said that it wasn't done yet. http://www.postgresql.org/message-id/541c242e.3030...@vmware.com Yeah, and that didn't include some of Tom's bug fixes apparently, per the succeeding message. Which is why I asked Heikki if he was done, to which he has not replied. I took a quick look at the two patches Heikki posted. I find the separate offsets array approach unappealing. It takes more space than the other approaches, and that space will be filled with data that we already know will not be at all compressible. Moreover, AFAICS we'd have to engrave the stride on stone tablets, which as I already mentioned I'd really like to not do. The offsets-and-lengths patch seems like the approach we ought to compare to my patch, but it looks pretty unfinished to me: AFAICS it includes logic to understand offsets sprinkled into a mostly-lengths array, but no logic that would actually *store* any such offsets, which means it's going to act just like my patch for performance purposes. In the interests of pushing this forward, I will work today on trying to finish and review Heikki's offsets-and-lengths patch so that we have something we can do performance testing on. I doubt that the performance testing will tell us anything we don't expect, but we should do it anyway. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 09/25/2014 11:22 AM, Tom Lane wrote: In the interests of pushing this forward, I will work today on trying to finish and review Heikki's offsets-and-lengths patch so that we have something we can do performance testing on. I doubt that the performance testing will tell us anything we don't expect, but we should do it anyway. OK. I'll spend some time trying to get Socorro with JSONB working so that I'll have a second test case. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
BTW, it seems like there is consensus that we ought to reorder the items in a jsonb object to have keys first and then values, independently of the other issues under discussion. This means we *will* be breaking on-disk compatibility with 9.4beta2, which means pg_upgrade will need to be taught to refuse an upgrade if the database contains any jsonb columns. Bruce, do you have time to crank out a patch for that? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On Thu, Sep 25, 2014 at 02:39:37PM -0400, Tom Lane wrote: BTW, it seems like there is consensus that we ought to reorder the items in a jsonb object to have keys first and then values, independently of the other issues under discussion. This means we *will* be breaking on-disk compatibility with 9.4beta2, which means pg_upgrade will need to be taught to refuse an upgrade if the database contains any jsonb columns. Bruce, do you have time to crank out a patch for that? Yes, I can do that easily. Tell me when you want it --- I just need a catalog version number to trigger on. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 2014-09-25 14:46:18 -0400, Bruce Momjian wrote: On Thu, Sep 25, 2014 at 02:39:37PM -0400, Tom Lane wrote: BTW, it seems like there is consensus that we ought to reorder the items in a jsonb object to have keys first and then values, independently of the other issues under discussion. This means we *will* be breaking on-disk compatibility with 9.4beta2, which means pg_upgrade will need to be taught to refuse an upgrade if the database contains any jsonb columns. Bruce, do you have time to crank out a patch for that? Yes, I can do that easily. Tell me when you want it --- I just need a catalog version number to trigger on. Do you plan to make it conditional on jsonb being used in the database? That'd not be bad to reduce the pain for testers that haven't used jsonb. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On Thu, Sep 25, 2014 at 09:00:07PM +0200, Andres Freund wrote: On 2014-09-25 14:46:18 -0400, Bruce Momjian wrote: On Thu, Sep 25, 2014 at 02:39:37PM -0400, Tom Lane wrote: BTW, it seems like there is consensus that we ought to reorder the items in a jsonb object to have keys first and then values, independently of the other issues under discussion. This means we *will* be breaking on-disk compatibility with 9.4beta2, which means pg_upgrade will need to be taught to refuse an upgrade if the database contains any jsonb columns. Bruce, do you have time to crank out a patch for that? Yes, I can do that easily. Tell me when you want it --- I just need a catalog version number to trigger on. Do you plan to make it conditional on jsonb being used in the database? That'd not be bad to reduce the pain for testers that haven't used jsonb. Yes, I already have code that scans pg_attribute looking for columns with problematic data types and output them to a file, and then throw an error. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
Bruce Momjian wrote: 3. 9.3 multi-xact bugs spooked us into being more careful Uh. Multixact changes in 9.3 were infinitely more invasive than the jsonb changes will ever be. a) they touched basic visibility design and routines, which are complex, understood by very few people, and have remained mostly unchanged for ages; b) they changed on-disk format for an underlying support structure, requiring pg_upgrade to handle the conversion; c) they added new catalog infrastructure to keep track of required freezing; d) they introduced new uint32 counters subject to wraparound; e) they introduced a novel user of slru.c with 5-char long filenames; f) they messed with tuple locking protocol and EvalPlanQual logic for traversing update chains. Maybe I'm forgetting others. JSONB has none of these properties. As far as I can see, the only hairy issue here (other than getting Josh Berkus to actually test the proposed patches) is that JSONB is changing on-disk format; but we're avoiding most issues there by dictating that people with existing JSONB databases need to pg_dump them, i.e. there is no conversion step being written for pg_upgrade. It's good to be careful; it's even better to be more careful. I too have learned a lesson there. Anyway I have no opinion on the JSONB stuff, other than considering that ignoring performance for large arrays and large objects seems to run counter to the whole point of JSONB in the first place (and of course failing to compress is part of that, too.) -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
I wrote: The offsets-and-lengths patch seems like the approach we ought to compare to my patch, but it looks pretty unfinished to me: AFAICS it includes logic to understand offsets sprinkled into a mostly-lengths array, but no logic that would actually *store* any such offsets, which means it's going to act just like my patch for performance purposes. In the interests of pushing this forward, I will work today on trying to finish and review Heikki's offsets-and-lengths patch so that we have something we can do performance testing on. I doubt that the performance testing will tell us anything we don't expect, but we should do it anyway. I've now done that, and attached is what I think would be a committable version. Having done this work, I no longer think that this approach is significantly messier code-wise than the all-lengths version, and it does have the merit of not degrading on very large objects/arrays. So at the moment I'm leaning to this solution not the all-lengths one. To get a sense of the compression effects of varying the stride distance, I repeated the compression measurements I'd done on 14 August with Pavel's geometry data (24077.1408052...@sss.pgh.pa.us). The upshot of that was min max avg external text representation220 172685 880.3 JSON representation (compressed text) 224 78565 541.3 pg_column_size, JSONB HEAD repr.225 82540 639.0 pg_column_size, all-lengths repr. 225 66794 531.1 Here's what I get with this patch and different stride distances: JB_OFFSET_STRIDE = 8225 68551 559.7 JB_OFFSET_STRIDE = 16 225 67601 552.3 JB_OFFSET_STRIDE = 32 225 67120 547.4 JB_OFFSET_STRIDE = 64 225 66886 546.9 JB_OFFSET_STRIDE = 128 225 66879 546.9 JB_OFFSET_STRIDE = 256 225 66846 546.8 So at least for that test data, 32 seems like the sweet spot. We are giving up a couple percent of space in comparison to the all-lengths version, but this is probably an acceptable tradeoff for not degrading on very large arrays. I've not done any speed testing. regards, tom lane diff --git a/src/backend/utils/adt/jsonb.c b/src/backend/utils/adt/jsonb.c index 2fd87fc..9beebb3 100644 *** a/src/backend/utils/adt/jsonb.c --- b/src/backend/utils/adt/jsonb.c *** jsonb_from_cstring(char *json, int len) *** 196,207 static size_t checkStringLen(size_t len) { ! if (len JENTRY_POSMASK) ereport(ERROR, (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), errmsg(string too long to represent as jsonb string), errdetail(Due to an implementation restriction, jsonb strings cannot exceed %d bytes., ! JENTRY_POSMASK))); return len; } --- 196,207 static size_t checkStringLen(size_t len) { ! if (len JENTRY_OFFLENMASK) ereport(ERROR, (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), errmsg(string too long to represent as jsonb string), errdetail(Due to an implementation restriction, jsonb strings cannot exceed %d bytes., ! JENTRY_OFFLENMASK))); return len; } diff --git a/src/backend/utils/adt/jsonb_util.c b/src/backend/utils/adt/jsonb_util.c index 04f35bf..f157df3 100644 *** a/src/backend/utils/adt/jsonb_util.c --- b/src/backend/utils/adt/jsonb_util.c *** *** 26,40 * in MaxAllocSize, and the number of elements (or pairs) must fit in the bits * reserved for that in the JsonbContainer.header field. * ! * (the total size of an array's elements is also limited by JENTRY_POSMASK, ! * but we're not concerned about that here) */ #define JSONB_MAX_ELEMS (Min(MaxAllocSize / sizeof(JsonbValue), JB_CMASK)) #define JSONB_MAX_PAIRS (Min(MaxAllocSize / sizeof(JsonbPair), JB_CMASK)) ! static void fillJsonbValue(JEntry *array, int index, char *base_addr, JsonbValue *result); ! static bool equalsJsonbScalarValue(JsonbValue *a, JsonbValue *b); static int compareJsonbScalarValue(JsonbValue *a, JsonbValue *b); static Jsonb *convertToJsonb(JsonbValue *val); static void convertJsonbValue(StringInfo buffer, JEntry *header, JsonbValue *val, int level); --- 26,41 * in MaxAllocSize, and the number of elements (or pairs) must fit in the bits * reserved for that in the JsonbContainer.header field. * ! * (The total size of an array's or object's elements is also limited by ! * JENTRY_OFFLENMASK, but we're not concerned about that here.) */ #define JSONB_MAX_ELEMS (Min(MaxAllocSize / sizeof(JsonbValue), JB_CMASK)) #define JSONB_MAX_PAIRS (Min(MaxAllocSize / sizeof(JsonbPair), JB_CMASK)) ! static void fillJsonbValue(JsonbContainer *container, int index, ! char *base_addr, uint32 offset, JsonbValue *result); ! static bool equalsJsonbScalarValue(JsonbValue *a, JsonbValue *b); static int
Re: [HACKERS] jsonb format is pessimal for toast compression
On 09/25/2014 08:10 PM, Tom Lane wrote: I wrote: The offsets-and-lengths patch seems like the approach we ought to compare to my patch, but it looks pretty unfinished to me: AFAICS it includes logic to understand offsets sprinkled into a mostly-lengths array, but no logic that would actually *store* any such offsets, which means it's going to act just like my patch for performance purposes. In the interests of pushing this forward, I will work today on trying to finish and review Heikki's offsets-and-lengths patch so that we have something we can do performance testing on. I doubt that the performance testing will tell us anything we don't expect, but we should do it anyway. I've now done that, and attached is what I think would be a committable version. Having done this work, I no longer think that this approach is significantly messier code-wise than the all-lengths version, and it does have the merit of not degrading on very large objects/arrays. So at the moment I'm leaning to this solution not the all-lengths one. To get a sense of the compression effects of varying the stride distance, I repeated the compression measurements I'd done on 14 August with Pavel's geometry data (24077.1408052...@sss.pgh.pa.us). The upshot of that was min max avg external text representation 220 172685 880.3 JSON representation (compressed text) 224 78565 541.3 pg_column_size, JSONB HEAD repr. 225 82540 639.0 pg_column_size, all-lengths repr. 225 66794 531.1 Here's what I get with this patch and different stride distances: JB_OFFSET_STRIDE = 8 225 68551 559.7 JB_OFFSET_STRIDE = 16 225 67601 552.3 JB_OFFSET_STRIDE = 32 225 67120 547.4 JB_OFFSET_STRIDE = 64 225 66886 546.9 JB_OFFSET_STRIDE = 128225 66879 546.9 JB_OFFSET_STRIDE = 256225 66846 546.8 So at least for that test data, 32 seems like the sweet spot. We are giving up a couple percent of space in comparison to the all-lengths version, but this is probably an acceptable tradeoff for not degrading on very large arrays. I've not done any speed testing. I'll do some tommorrow. I should have some different DBs to test on, too. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 09/24/2014 08:16 AM, Tom Lane wrote: Jan Wieck j...@wi3ck.info writes: On 09/15/2014 09:46 PM, Craig Ringer wrote: Anyway - this is looking like the change will go in, and with it a catversion bump. Introduction of a jsonb version/flags byte might be worthwhile at the same time. It seems likely that there'll be more room for improvement in jsonb, possibly even down to using different formats for different data. Is it worth paying a byte per value to save on possible upgrade pain? If there indeed has to be a catversion bump in the process of this, then I agree with Craig. FWIW, I don't really. To begin with, it wouldn't be a byte per value, it'd be four bytes, because we need word-alignment of the jsonb contents so there's noplace to squeeze in an ID byte for free. Secondly, as I wrote in 15378.1408548...@sss.pgh.pa.us: : There remains the : question of whether to take this opportunity to add a version ID to the : binary format. I'm not as excited about that idea as I originally was; : having now studied the code more carefully, I think that any expansion : would likely happen by adding more type codes and/or commandeering the : currently-unused high-order bit of JEntrys. We don't need a version ID : in the header for that. Moreover, if we did have such an ID, it would be : notationally painful to get it to most of the places that might need it. Heikki's patch would eat up the high-order JEntry bits, but the other points remain. If we don't need to be backwards-compatible with the 9.4beta on-disk format, we don't necessarily need to eat the high-order JEntry bit. You can just assume that that every nth element is stored as an offset, and the rest as lengths. Although it would be nice to have the flag for it explicitly. There are also a few free bits in the JsonbContainer header that can be used as a version ID in the future. So I don't think we need to change the format to add an explicit version ID field. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
Heikki Linnakangas hlinnakan...@vmware.com writes: On 09/24/2014 08:16 AM, Tom Lane wrote: Heikki's patch would eat up the high-order JEntry bits, but the other points remain. If we don't need to be backwards-compatible with the 9.4beta on-disk format, we don't necessarily need to eat the high-order JEntry bit. You can just assume that that every nth element is stored as an offset, and the rest as lengths. Although it would be nice to have the flag for it explicitly. If we go with this approach, I think that we *should* eat the high bit for it. The main reason I want to do that is that it avoids having to engrave the value of N on stone tablets. I think that we should use a pretty large value of N --- maybe 32 or so --- and having the freedom to change it later based on experience seems like a good thing. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On Fri, Sep 19, 2014 at 5:40 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: I think we should bite the bullet and break compatibility with 9.4beta2 format, even if we go with my patch. In a jsonb object, it makes sense to store all the keys first, like Tom did, because of cache benefits, and the future possibility to do smart EXTERNAL access. Also, even if we can make the on-disk format compatible, it's weird that you can get different runtime behavior with datums created with a beta version. Seems more clear to just require a pg_dump + restore. I vote for going with your patch, and breaking compatibility for the reasons stated here (though I'm skeptical of the claims about cache benefits, FWIW). -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
Peter Geoghegan p...@heroku.com writes: On Fri, Sep 19, 2014 at 5:40 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: I think we should bite the bullet and break compatibility with 9.4beta2 format, even if we go with my patch. In a jsonb object, it makes sense to store all the keys first, like Tom did, because of cache benefits, and the future possibility to do smart EXTERNAL access. Also, even if we can make the on-disk format compatible, it's weird that you can get different runtime behavior with datums created with a beta version. Seems more clear to just require a pg_dump + restore. I vote for going with your patch, and breaking compatibility for the reasons stated here (though I'm skeptical of the claims about cache benefits, FWIW). I'm also skeptical of that, but I think the potential for smart EXTERNAL access is a valid consideration. I've not had time to read Heikki's updated patch yet --- has anyone else compared the two patches for code readability? If they're fairly close on that score, then I'd agree his approach is the best solution. (I will look at his code, but I'm not sure I'm the most unbiased observer.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 09/15/2014 09:46 PM, Craig Ringer wrote: On 09/16/2014 07:44 AM, Peter Geoghegan wrote: FWIW, I am slightly concerned about weighing use cases around very large JSON documents too heavily. Having enormous jsonb documents just isn't going to work out that well, but neither will equivalent designs in popular document database systems for similar reasons. For example, the maximum BSON document size supported by MongoDB is 16 megabytes, and that seems to be something that their users don't care too much about. Having 270 pairs in an object isn't unreasonable, but it isn't going to be all that common either. Also, at a certain size the fact that Pg must rewrite the whole document for any change to it starts to introduce other practical changes. Anyway - this is looking like the change will go in, and with it a catversion bump. Introduction of a jsonb version/flags byte might be worthwhile at the same time. It seems likely that there'll be more room for improvement in jsonb, possibly even down to using different formats for different data. Is it worth paying a byte per value to save on possible upgrade pain? This comment seems to have drowned in the discussion. If there indeed has to be a catversion bump in the process of this, then I agree with Craig. Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On Tue, Sep 23, 2014 at 10:02 PM, Jan Wieck j...@wi3ck.info wrote: Is it worth paying a byte per value to save on possible upgrade pain? This comment seems to have drowned in the discussion. If there indeed has to be a catversion bump in the process of this, then I agree with Craig. -1. We already have a reserved bit. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
Jan Wieck j...@wi3ck.info writes: On 09/15/2014 09:46 PM, Craig Ringer wrote: Anyway - this is looking like the change will go in, and with it a catversion bump. Introduction of a jsonb version/flags byte might be worthwhile at the same time. It seems likely that there'll be more room for improvement in jsonb, possibly even down to using different formats for different data. Is it worth paying a byte per value to save on possible upgrade pain? If there indeed has to be a catversion bump in the process of this, then I agree with Craig. FWIW, I don't really. To begin with, it wouldn't be a byte per value, it'd be four bytes, because we need word-alignment of the jsonb contents so there's noplace to squeeze in an ID byte for free. Secondly, as I wrote in 15378.1408548...@sss.pgh.pa.us: : There remains the : question of whether to take this opportunity to add a version ID to the : binary format. I'm not as excited about that idea as I originally was; : having now studied the code more carefully, I think that any expansion : would likely happen by adding more type codes and/or commandeering the : currently-unused high-order bit of JEntrys. We don't need a version ID : in the header for that. Moreover, if we did have such an ID, it would be : notationally painful to get it to most of the places that might need it. Heikki's patch would eat up the high-order JEntry bits, but the other points remain. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 09/19/2014 07:07 AM, Tom Lane wrote: Heikki Linnakangas hlinnakan...@vmware.com writes: Tom: You mentioned earlier that your patch fixes some existing bugs. What were they? What I remember at the moment (sans caffeine) is that the routines for assembling jsonb values out of field data were lacking some necessary tests for overflow of the size/offset fields. If you like I can apply those fixes separately, but I think they were sufficiently integrated with other changes in the logic that it wouldn't really help much for patch reviewability. Where are we on this? Do we have a patch ready for testing? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 09/18/2014 09:27 PM, Heikki Linnakangas wrote: On 09/18/2014 07:53 PM, Josh Berkus wrote: On 09/16/2014 08:45 PM, Tom Lane wrote: We're somewhat comparing apples and oranges here, in that I pushed my approach to something that I think is of committable quality (and which, not incidentally, fixes some existing bugs that we'd need to fix in any case); while Heikki's patch was just proof-of-concept. It would be worth pushing Heikki's patch to committable quality so that we had a more complete understanding of just what the complexity difference really is. Is anyone actually working on this? If not, I'm voting for the all-lengths patch so that we can get 9.4 out the door. I'll try to write a more polished patch tomorrow. We'll then see what it looks like, and can decide if we want it. Ok, here are two patches. One is a refined version of my earlier patch, and the other implements the separate offsets array approach. They are both based on Tom's jsonb-lengths-merged.patch, so they include all the whitespace fixes etc. he mentioned. There is no big difference in terms of code complexity between the patches. IMHO the separate offsets array is easier to understand, but it makes for more complicated accessor macros to find the beginning of the variable-length data. Unlike Tom's patch, these patches don't cache any offsets when doing a binary search. Doesn't seem worth it, when the access time is O(1) anyway. Both of these patches have a #define JB_OFFSET_STRIDE for the stride size. For the separate offsets array, the offsets array has one element for every JB_OFFSET_STRIDE children. For the other patch, every JB_OFFSET_STRIDE child stores the end offset, while others store the length. A smaller value makes random access faster, at the cost of compressibility / on-disk size. I haven't done any measurements to find the optimal value, the values in the patches are arbitrary. I think we should bite the bullet and break compatibility with 9.4beta2 format, even if we go with my patch. In a jsonb object, it makes sense to store all the keys first, like Tom did, because of cache benefits, and the future possibility to do smart EXTERNAL access. Also, even if we can make the on-disk format compatible, it's weird that you can get different runtime behavior with datums created with a beta version. Seems more clear to just require a pg_dump + restore. Tom: You mentioned earlier that your patch fixes some existing bugs. What were they? There were a bunch of whitespace and comment fixes that we should apply in any case, but I couldn't see any actual bugs. I think we should apply those fixes separately, to make sure we don't forget about them, and to make it easier to review these patches. - Heikki diff --git a/src/backend/utils/adt/jsonb.c b/src/backend/utils/adt/jsonb.c index 2fd87fc..456011a 100644 --- a/src/backend/utils/adt/jsonb.c +++ b/src/backend/utils/adt/jsonb.c @@ -196,12 +196,12 @@ jsonb_from_cstring(char *json, int len) static size_t checkStringLen(size_t len) { - if (len JENTRY_POSMASK) + if (len JENTRY_LENMASK) ereport(ERROR, (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), errmsg(string too long to represent as jsonb string), errdetail(Due to an implementation restriction, jsonb strings cannot exceed %d bytes., - JENTRY_POSMASK))); + JENTRY_LENMASK))); return len; } diff --git a/src/backend/utils/adt/jsonb_util.c b/src/backend/utils/adt/jsonb_util.c index 04f35bf..7f7ed4f 100644 --- a/src/backend/utils/adt/jsonb_util.c +++ b/src/backend/utils/adt/jsonb_util.c @@ -26,15 +26,16 @@ * in MaxAllocSize, and the number of elements (or pairs) must fit in the bits * reserved for that in the JsonbContainer.header field. * - * (the total size of an array's elements is also limited by JENTRY_POSMASK, - * but we're not concerned about that here) + * (The total size of an array's or object's elements is also limited by + * JENTRY_LENMASK, but we're not concerned about that here.) */ #define JSONB_MAX_ELEMS (Min(MaxAllocSize / sizeof(JsonbValue), JB_CMASK)) #define JSONB_MAX_PAIRS (Min(MaxAllocSize / sizeof(JsonbPair), JB_CMASK)) -static void fillJsonbValue(JEntry *array, int index, char *base_addr, +static void fillJsonbValue(JEntry *children, int index, + char *base_addr, uint32 offset, JsonbValue *result); -static bool equalsJsonbScalarValue(JsonbValue *a, JsonbValue *b); +static bool equalsJsonbScalarValue(JsonbValue *a, JsonbValue *b); static int compareJsonbScalarValue(JsonbValue *a, JsonbValue *b); static Jsonb *convertToJsonb(JsonbValue *val); static void convertJsonbValue(StringInfo buffer, JEntry *header, JsonbValue *val, int level); @@ -42,7 +43,7 @@ static void convertJsonbArray(StringInfo buffer, JEntry *header, JsonbValue *val static void convertJsonbObject(StringInfo buffer, JEntry *header, JsonbValue *val, int level); static void convertJsonbScalar(StringInfo buffer, JEntry *header, JsonbValue
Re: [HACKERS] jsonb format is pessimal for toast compression
Heikki Linnakangas hlinnakan...@vmware.com writes: Tom: You mentioned earlier that your patch fixes some existing bugs. What were they? What I remember at the moment (sans caffeine) is that the routines for assembling jsonb values out of field data were lacking some necessary tests for overflow of the size/offset fields. If you like I can apply those fixes separately, but I think they were sufficiently integrated with other changes in the logic that it wouldn't really help much for patch reviewability. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 09/16/2014 08:45 PM, Tom Lane wrote: We're somewhat comparing apples and oranges here, in that I pushed my approach to something that I think is of committable quality (and which, not incidentally, fixes some existing bugs that we'd need to fix in any case); while Heikki's patch was just proof-of-concept. It would be worth pushing Heikki's patch to committable quality so that we had a more complete understanding of just what the complexity difference really is. Is anyone actually working on this? If not, I'm voting for the all-lengths patch so that we can get 9.4 out the door. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 09/18/2014 07:53 PM, Josh Berkus wrote: On 09/16/2014 08:45 PM, Tom Lane wrote: We're somewhat comparing apples and oranges here, in that I pushed my approach to something that I think is of committable quality (and which, not incidentally, fixes some existing bugs that we'd need to fix in any case); while Heikki's patch was just proof-of-concept. It would be worth pushing Heikki's patch to committable quality so that we had a more complete understanding of just what the complexity difference really is. Is anyone actually working on this? If not, I'm voting for the all-lengths patch so that we can get 9.4 out the door. I'll try to write a more polished patch tomorrow. We'll then see what it looks like, and can decide if we want it. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On Mon, Sep 15, 2014 at 7:44 PM, Peter Geoghegan p...@heroku.com wrote: On Mon, Sep 15, 2014 at 4:05 PM, Josh Berkus j...@agliodbs.com wrote: Actually, having the keys all at the same level *is* relevant for the issue we're discussing. If those 270 keys are organized in a tree, it's not the same as having them all on one level (and not as problematic). I believe Robert meant that the 270 keys are not at the top level, but are at some level (in other words, some object has 270 pairs). That is equivalent to having them at the top level for the purposes of this discussion. Yes, that's exactly what I meant. FWIW, I am slightly concerned about weighing use cases around very large JSON documents too heavily. Having enormous jsonb documents just isn't going to work out that well, but neither will equivalent designs in popular document database systems for similar reasons. For example, the maximum BSON document size supported by MongoDB is 16 megabytes, and that seems to be something that their users don't care too much about. Having 270 pairs in an object isn't unreasonable, but it isn't going to be all that common either. The JSON documents in this case were not particularly large. These objects were 100kB; they just had a lot of keys. I'm a little baffled by the apparent theme that people think that (object size) / (# of keys) will tend to be large. Maybe there will be some instances where that's the case, but it's not what I'd expect. I would expect people to use JSON to serialize structured data in situations where normalizing would be unwieldly. For example, pick your favorite Facebook or Smartphone game - Plants vs. Zombies, Farmville, Candy Crush Saga, whatever. Or even a traditional board game like chess. Think about what the game state looks like as an abstract object. Almost without exception, you've got some kind of game board with a bunch of squares and then you have a bunch of pieces (plants, crops, candies, pawns) that are positioned on those squares. Now you want to store this in a database. You're certainly not going to have a table column per square, and EAV would be stupid, so what's left? You could use an array, but an array of strings might not be descriptive enough; for a square in Farmville, for example, you might need to know the type of crop, and whether it was fertilized with special magic fertilizer, and when it's going to be ready to harvest, and when it'll wither if not harvested. So a JSON is a pretty natural structure: an array of arrays of objects. If you have a 30x30 farm, you'll have 900 keys. If you have a 50x50 farm, which probably means you're spending real money to buy imaginary plants, you'll have 2500 keys. (For the record, I have no actual knowledge of how any of these games are implemented under the hood. I'm just speculating on how I would have done it.) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 09/16/2014 06:31 AM, Robert Haas wrote: On Mon, Sep 15, 2014 at 7:44 PM, Peter Geoghegan p...@heroku.com wrote: On Mon, Sep 15, 2014 at 4:05 PM, Josh Berkus j...@agliodbs.com wrote: Actually, having the keys all at the same level *is* relevant for the issue we're discussing. If those 270 keys are organized in a tree, it's not the same as having them all on one level (and not as problematic). I believe Robert meant that the 270 keys are not at the top level, but are at some level (in other words, some object has 270 pairs). That is equivalent to having them at the top level for the purposes of this discussion. Yes, that's exactly what I meant. FWIW, I am slightly concerned about weighing use cases around very large JSON documents too heavily. Having enormous jsonb documents just isn't going to work out that well, but neither will equivalent designs in popular document database systems for similar reasons. For example, the maximum BSON document size supported by MongoDB is 16 megabytes, and that seems to be something that their users don't care too much about. Having 270 pairs in an object isn't unreasonable, but it isn't going to be all that common either. Well, I can only judge from the use cases I personally have, none of which involve more than 100 keys at any level for most rows. So far I've seen some people argue hypotetical use cases involving hundreds of keys per level, but nobody who *actually* has such a use case. Also, note that we currently don't know where the last value extraction becomes a performance problem at this stage, except that it's somewhere between 200 and 100,000. Also, we don't have a test which shows the hybrid approach (Heikki's patch) performing better with 1000's of keys. Basically, if someone is going to make a serious case for Heikki's hybrid approach over the simpler lengths-only approach, then please post some test data showing the benefit ASAP, since I can't demonstrate it. Otherwise, let's get beta 3 out the door so we can get the 9.4 release train moving again. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On Tue, Sep 16, 2014 at 12:47 PM, Josh Berkus j...@agliodbs.com wrote: On 09/16/2014 06:31 AM, Robert Haas wrote: On Mon, Sep 15, 2014 at 7:44 PM, Peter Geoghegan p...@heroku.com wrote: On Mon, Sep 15, 2014 at 4:05 PM, Josh Berkus j...@agliodbs.com wrote: Actually, having the keys all at the same level *is* relevant for the issue we're discussing. If those 270 keys are organized in a tree, it's not the same as having them all on one level (and not as problematic). I believe Robert meant that the 270 keys are not at the top level, but are at some level (in other words, some object has 270 pairs). That is equivalent to having them at the top level for the purposes of this discussion. Yes, that's exactly what I meant. FWIW, I am slightly concerned about weighing use cases around very large JSON documents too heavily. Having enormous jsonb documents just isn't going to work out that well, but neither will equivalent designs in popular document database systems for similar reasons. For example, the maximum BSON document size supported by MongoDB is 16 megabytes, and that seems to be something that their users don't care too much about. Having 270 pairs in an object isn't unreasonable, but it isn't going to be all that common either. Well, I can only judge from the use cases I personally have, none of which involve more than 100 keys at any level for most rows. So far I've seen some people argue hypotetical use cases involving hundreds of keys per level, but nobody who *actually* has such a use case. I already told you that I did, and that it was the only and only app I had written for JSONB. Also, note that we currently don't know where the last value extraction becomes a performance problem at this stage, except that it's somewhere between 200 and 100,000. Also, we don't have a test which shows the hybrid approach (Heikki's patch) performing better with 1000's of keys. Fair point. Basically, if someone is going to make a serious case for Heikki's hybrid approach over the simpler lengths-only approach, then please post some test data showing the benefit ASAP, since I can't demonstrate it. Otherwise, let's get beta 3 out the door so we can get the 9.4 release train moving again. I don't personally care about this enough to spend more time on it. I told you my extremely limited experience because it seems to contradict your broader experience. If you don't care, you don't care. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 09/16/2014 09:54 AM, Robert Haas wrote: On Tue, Sep 16, 2014 at 12:47 PM, Josh Berkus j...@agliodbs.com wrote: On 09/16/2014 06:31 AM, Robert Haas wrote: On Mon, Sep 15, 2014 at 7:44 PM, Peter Geoghegan p...@heroku.com wrote: On Mon, Sep 15, 2014 at 4:05 PM, Josh Berkus j...@agliodbs.com wrote: Actually, having the keys all at the same level *is* relevant for the issue we're discussing. If those 270 keys are organized in a tree, it's not the same as having them all on one level (and not as problematic). I believe Robert meant that the 270 keys are not at the top level, but are at some level (in other words, some object has 270 pairs). That is equivalent to having them at the top level for the purposes of this discussion. Yes, that's exactly what I meant. FWIW, I am slightly concerned about weighing use cases around very large JSON documents too heavily. Having enormous jsonb documents just isn't going to work out that well, but neither will equivalent designs in popular document database systems for similar reasons. For example, the maximum BSON document size supported by MongoDB is 16 megabytes, and that seems to be something that their users don't care too much about. Having 270 pairs in an object isn't unreasonable, but it isn't going to be all that common either. Well, I can only judge from the use cases I personally have, none of which involve more than 100 keys at any level for most rows. So far I've seen some people argue hypotetical use cases involving hundreds of keys per level, but nobody who *actually* has such a use case. I already told you that I did, and that it was the only and only app I had written for JSONB. Ah, ok, I thought yours was a test case. Did you check how it performed on the two patches at all? My tests with 185 keys didn't show any difference, including for a last key case. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 09/16/2014 07:47 PM, Josh Berkus wrote: On 09/16/2014 06:31 AM, Robert Haas wrote: On Mon, Sep 15, 2014 at 7:44 PM, Peter Geoghegan p...@heroku.com wrote: On Mon, Sep 15, 2014 at 4:05 PM, Josh Berkus j...@agliodbs.com wrote: Actually, having the keys all at the same level *is* relevant for the issue we're discussing. If those 270 keys are organized in a tree, it's not the same as having them all on one level (and not as problematic). I believe Robert meant that the 270 keys are not at the top level, but are at some level (in other words, some object has 270 pairs). That is equivalent to having them at the top level for the purposes of this discussion. Yes, that's exactly what I meant. FWIW, I am slightly concerned about weighing use cases around very large JSON documents too heavily. Having enormous jsonb documents just isn't going to work out that well, but neither will equivalent designs in popular document database systems for similar reasons. For example, the maximum BSON document size supported by MongoDB is 16 megabytes, and that seems to be something that their users don't care too much about. Having 270 pairs in an object isn't unreasonable, but it isn't going to be all that common either. Well, I can only judge from the use cases I personally have, none of which involve more than 100 keys at any level for most rows. So far I've seen some people argue hypotetical use cases involving hundreds of keys per level, but nobody who *actually* has such a use case. Also, note that we currently don't know where the last value extraction becomes a performance problem at this stage, except that it's somewhere between 200 and 100,000. Also, we don't have a test which shows the hybrid approach (Heikki's patch) performing better with 1000's of keys. Basically, if someone is going to make a serious case for Heikki's hybrid approach over the simpler lengths-only approach, then please post some test data showing the benefit ASAP, since I can't demonstrate it. Otherwise, let's get beta 3 out the door so we can get the 9.4 release train moving again. Are you looking for someone with a real life scenario, or just synthetic test case? The latter is easy to do. See attached test program. It's basically the same I posted earlier. Here are the results from my laptop with Tom's jsonb-lengths-merged.patch: postgres=# select * from testtimes ; elem | duration_ms --+- 11 |0.289508 12 |0.288122 13 |0.290558 14 |0.287889 15 |0.286303 17 |0.290415 19 |0.289829 21 |0.289783 23 |0.287104 25 |0.289834 28 |0.290735 31 |0.291844 34 |0.293454 37 |0.293866 41 |0.291217 45 |0.289243 50 |0.290385 55 |0.292085 61 |0.290892 67 |0.292335 74 |0.292561 81 |0.291416 89 |0.295714 98 | 0.29844 108 |0.297421 119 |0.299471 131 |0.299877 144 |0.301604 158 |0.303365 174 |0.304203 191 |0.303596 210 |0.306526 231 |0.304189 254 |0.307782 279 |0.307372 307 |0.306873 338 |0.310471 372 | 0.3151 409 |0.320354 450 | 0.32038 495 |0.322127 545 |0.323256 600 |0.330419 660 |0.334226 726 |0.336951 799 | 0.34108 879 |0.347746 967 |0.354275 1064 |0.356696 1170 |0.366906 1287 |0.375352 1416 |0.392952 1558 |0.392907 1714 |0.402157 1885 |0.412384 2074 |0.425958 2281 |0.435415 2509 | 0.45301 2760 |0.469983 3036 |0.487329 3340 |0.505505 3674 |0.530412 4041 |0.552585 4445 |0.581815 4890 |0.610509 5379 |0.642885 5917 |0.680395 6509 |0.713849 7160 |0.757561 7876 |0.805225 8664 |0.856142 9530 |0.913255 (72 rows) That's up to 9530 elements - it's pretty easy to extrapolate from there to higher counts, it's O(n). With unpatched git master, the runtime is flat, regardless of which element is queried, at about 0.29 s. With jsonb-with-offsets-and-lengths-2.patch, there's no difference that I could measure. The difference starts to be meaningful at around 500 entries. In practice, I doubt anyone's going to notice until you start talking about tens of thousands of entries. I'll leave it up to the jury to decide if we care or not. It seems like a fairly unusual use case, where you push around large enough arrays or objects to notice. Then again, I'm sure *someone* will do it. People do strange things, and they find ways to abuse the features that the original developers didn't think of. - Heikki jsonb-lengths.sql Description: application/sql -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On Tue, Sep 16, 2014 at 3:12 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: I'll leave it up to the jury to decide if we care or not. It seems like a fairly unusual use case, where you push around large enough arrays or objects to notice. Then again, I'm sure *someone* will do it. People do strange things, and they find ways to abuse the features that the original developers didn't think of. Again, it's not abusing of the feature. It's using it. Jsonb is supposed to be fast for this. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On Tue, Sep 16, 2014 at 1:11 PM, Josh Berkus j...@agliodbs.com wrote: Well, I can only judge from the use cases I personally have, none of which involve more than 100 keys at any level for most rows. So far I've seen some people argue hypotetical use cases involving hundreds of keys per level, but nobody who *actually* has such a use case. I already told you that I did, and that it was the only and only app I had written for JSONB. Ah, ok, I thought yours was a test case. Did you check how it performed on the two patches at all? My tests with 185 keys didn't show any difference, including for a last key case. No, I didn't test it. But I think Heikki's test results pretty much tell us everything there is to see here. This isn't really that complicated; I've read a few papers on index compression over the years and they seem to often use techniques that have the same general flavor as what Heikki did here, adding complexity in the data format to gain other advantages. So I don't think we should be put off. Basically, I think that if we make a decision to use Tom's patch rather than Heikki's patch, we're deciding that the initial decision, by the folks who wrote the original jsonb code, to make array access less than O(n) was misguided. While that could be true, I'd prefer to bet that those folks knew what they were doing. The only way reason we're even considering changing it is that the array of lengths doesn't compress well, and we've got an approach that fixes that problem while preserving the advantages of fast lookup. We should have a darn fine reason to say no to that approach, and it didn't benefit my particular use case is not it. In practice, I'm not very surprised that the impact doesn't seem too bad when you're running SQL queries from the client. There's so much other overhead, for de-TOASTing and client communication and even just planner and executor costs, that this gets lost in the noise. But think about a PL/pgsql procedure, say, where somebody might loop over all of the elements in array. If those operations go from O(1) to O(n), then the loop goes from O(n) to O(n^2). I will bet you a beverage of your choice that somebody will find that behavior within a year of release and be dismayed by it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
Heikki, Robert: On 09/16/2014 11:12 AM, Heikki Linnakangas wrote: Are you looking for someone with a real life scenario, or just synthetic test case? The latter is easy to do. See attached test program. It's basically the same I posted earlier. Here are the results from my laptop with Tom's jsonb-lengths-merged.patch: Thanks for that! postgres=# select * from testtimes ; elem | duration_ms --+- 3674 |0.530412 4041 |0.552585 4445 |0.581815 This looks like the level at which the difference gets to be really noticeable. Note that this is completely swamped by the difference between compressed vs. uncompressed though. With unpatched git master, the runtime is flat, regardless of which element is queried, at about 0.29 s. With jsonb-with-offsets-and-lengths-2.patch, there's no difference that I could measure. OK, thanks. The difference starts to be meaningful at around 500 entries. In practice, I doubt anyone's going to notice until you start talking about tens of thousands of entries. I'll leave it up to the jury to decide if we care or not. It seems like a fairly unusual use case, where you push around large enough arrays or objects to notice. Then again, I'm sure *someone* will do it. People do strange things, and they find ways to abuse the features that the original developers didn't think of. Right, but the question is whether it's worth having a more complex code and data structure in order to support what certainly *seems* to be a fairly obscure use-case, that is more than 4000 keys at the same level. And it's not like it stops working or becomes completely unresponsive at that level; it's just double the response time. On 09/16/2014 12:20 PM, Robert Haas wrote: Basically, I think that if we make a decision to use Tom's patch rather than Heikki's patch, we're deciding that the initial decision, by the folks who wrote the original jsonb code, to make array access less than O(n) was misguided. While that could be true, I'd prefer to bet that those folks knew what they were doing. The only way reason we're even considering changing it is that the array of lengths doesn't compress well, and we've got an approach that fixes that problem while preserving the advantages of fast lookup. We should have a darn fine reason to say no to that approach, and it didn't benefit my particular use case is not it. Do you feel that way *as a code maintainer*? That is, if you ended up maintaining the JSONB code, would you still feel that it's worth the extra complexity? Because that will be the main cost here. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On Tue, Sep 16, 2014 at 3:24 PM, Josh Berkus j...@agliodbs.com wrote: Do you feel that way *as a code maintainer*? That is, if you ended up maintaining the JSONB code, would you still feel that it's worth the extra complexity? Because that will be the main cost here. I feel that Heikki doesn't have a reputation for writing or committing unmaintainable code. I haven't reviewed the patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 16/09/14 21:20, Robert Haas wrote: In practice, I'm not very surprised that the impact doesn't seem too bad when you're running SQL queries from the client. There's so much other overhead, for de-TOASTing and client communication and even just planner and executor costs, that this gets lost in the noise. But think about a PL/pgsql procedure, say, where somebody might loop over all of the elements in array. If those operations go from O(1) to O(n), then the loop goes from O(n) to O(n^2). I will bet you a beverage of your choice that somebody will find that behavior within a year of release and be dismayed by it. As somebody who did see server melt (quite literally that time unfortunately) thanks to the CPU overhead of operations on varlena arrays +1 (in fact +many). Especially if we are trying to promote the json improvements in 9.4 as best of both worlds kind of thing. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 09/16/2014 10:37 PM, Robert Haas wrote: On Tue, Sep 16, 2014 at 3:24 PM, Josh Berkus j...@agliodbs.com wrote: Do you feel that way *as a code maintainer*? That is, if you ended up maintaining the JSONB code, would you still feel that it's worth the extra complexity? Because that will be the main cost here. I feel that Heikki doesn't have a reputation for writing or committing unmaintainable code. I haven't reviewed the patch. The patch I posted was not pretty, but I'm sure it could be refined to something sensible. There are many possible variations of the basic scheme of storing mostly lengths, but an offset for every N elements. I replaced the length with offset on some element and used a flag bit to indicate which it is. Perhaps a simpler approach would be to store lengths, but also store a separate smaller array of offsets, after the lengths array. I can write a patch if we want to go that way. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On Tue, Sep 16, 2014 at 4:20 PM, Robert Haas robertmh...@gmail.com wrote: On Tue, Sep 16, 2014 at 1:11 PM, Josh Berkus j...@agliodbs.com wrote: Well, I can only judge from the use cases I personally have, none of which involve more than 100 keys at any level for most rows. So far I've seen some people argue hypotetical use cases involving hundreds of keys per level, but nobody who *actually* has such a use case. I already told you that I did, and that it was the only and only app I had written for JSONB. Ah, ok, I thought yours was a test case. Did you check how it performed on the two patches at all? My tests with 185 keys didn't show any difference, including for a last key case. No, I didn't test it. But I think Heikki's test results pretty much tell us everything there is to see here. This isn't really that complicated; I've read a few papers on index compression over the years and they seem to often use techniques that have the same general flavor as what Heikki did here, adding complexity in the data format to gain other advantages. So I don't think we should be put off. I second this reasoning. Even if I ran a couple of very realistic test cases that support all-lengths I do fell that the Hybrid aproach would be better as it covers all bases. To put things in perspective Tom's latest patch isn't much simpler either. Since it would still be a breaking change we should consider changing the layout to key-key-key-value-value-value as it seems to pay off. Basically, I think that if we make a decision to use Tom's patch rather than Heikki's patch, we're deciding that the initial decision, by the folks who wrote the original jsonb code, to make array access less than O(n) was misguided. While that could be true, I'd prefer to bet that those folks knew what they were doing. The only way reason we're even considering changing it is that the array of lengths doesn't compress well, and we've got an approach that fixes that problem while preserving the advantages of fast lookup. We should have a darn fine reason to say no to that approach, and it didn't benefit my particular use case is not it. In practice, I'm not very surprised that the impact doesn't seem too bad when you're running SQL queries from the client. There's so much other overhead, for de-TOASTing and client communication and even just planner and executor costs, that this gets lost in the noise. But think about a PL/pgsql procedure, say, where somebody might loop over all of the elements in array. If those operations go from O(1) to O(n), then the loop goes from O(n) to O(n^2). I will bet you a beverage of your choice that somebody will find that behavior within a year of release and be dismayed by it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
- Цитат от Robert Haas (robertmh...@gmail.com), на 16.09.2014 в 22:20 - In practice, I'm not very surprised that the impact doesn't seem too bad when you're running SQL queries from the client. There's so much other overhead, for de-TOASTing and client communication and even just planner and executor costs, that this gets lost in the noise. But think about a PL/pgsql procedure, say, where somebody might loop over all of the elements in array. If those operations go from O(1) to O(n), then the loop goes from O(n) to O(n^2). I will bet you a beverage of your choice that somebody will find that behavior within a year of release and be dismayed by it. Hi, I can imagine situation exactly like that. We could use jsonb object to represent sparse vectors in the database where the key is the dimension and the value is the value. So they could easily grow to thousands of dimensions. Once you have than in the database it is easy to go and write some simple numeric computations on these vectors, let's say you want a dot product of 2 sparse vectors. If the random access inside one vector is going to O(n^2) then the dot product computation will be going to O(n^2*m^2), so not pretty. I am not saying that the DB is the right place to do this type of computations but it is somethimes convenient to have it also in the DB. Regards, luben -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
Heikki Linnakangas hlinnakan...@vmware.com writes: On 09/16/2014 10:37 PM, Robert Haas wrote: On Tue, Sep 16, 2014 at 3:24 PM, Josh Berkus j...@agliodbs.com wrote: Do you feel that way *as a code maintainer*? That is, if you ended up maintaining the JSONB code, would you still feel that it's worth the extra complexity? Because that will be the main cost here. I feel that Heikki doesn't have a reputation for writing or committing unmaintainable code. I haven't reviewed the patch. The patch I posted was not pretty, but I'm sure it could be refined to something sensible. We're somewhat comparing apples and oranges here, in that I pushed my approach to something that I think is of committable quality (and which, not incidentally, fixes some existing bugs that we'd need to fix in any case); while Heikki's patch was just proof-of-concept. It would be worth pushing Heikki's patch to committable quality so that we had a more complete understanding of just what the complexity difference really is. There are many possible variations of the basic scheme of storing mostly lengths, but an offset for every N elements. I replaced the length with offset on some element and used a flag bit to indicate which it is. Aside from the complexity issue, a demerit of Heikki's solution is that it eats up a flag bit that we may well wish we had back later. On the other hand, there's definitely something to be said for not breaking pg_upgrade-ability of 9.4beta databases. Perhaps a simpler approach would be to store lengths, but also store a separate smaller array of offsets, after the lengths array. That way would also give up on-disk compatibility, and I'm not sure it's any simpler in practice than your existing solution. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 09/12/2014 01:30 PM, Heikki Linnakangas wrote: Performance was one argument for sure. It's not hard to come up with a case where the all-lengths approach is much slower: take a huge array with, say, million elements, and fetch the last element in a tight loop. And do that in a PL/pgSQL function without storing the datum to disk, so that it doesn't get toasted. Not a very common thing to do in real life, although something like that might come up if you do a lot of json processing in PL/pgSQL. but storing offsets makes that faster. While I didnt post the results (because they were uninteresting), I did specifically test the last element in a set of 200 elements for all-lengths vs. original offsets for JSONB, and the results were not statistically different. I did not test against your patch; is there some reason why your patch would be faster for the last element case than the original offsets version? If not, I think the corner case is so obscure as to be not worth optimizing for. I can't imagine that more than a tiny minority of our users are going to have thousands of keys per datum. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On Mon, Sep 15, 2014 at 2:12 PM, Josh Berkus j...@agliodbs.com wrote: If not, I think the corner case is so obscure as to be not worth optimizing for. I can't imagine that more than a tiny minority of our users are going to have thousands of keys per datum. Worst case is linear cost scaling vs number of keys, which depends on the number of keys how expensive it is. It would have an effect only on uncompressed jsonb, since compressed jsonb already pays a linear cost for decompression. I'd suggest testing performance of large small keys in uncompressed form. It's bound to have a noticeable regression there. Now, large small keys could be 200 or 2000, or even 20k. I'd guess several should be tested to find the shape of the curve. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 09/15/2014 10:23 AM, Claudio Freire wrote: Now, large small keys could be 200 or 2000, or even 20k. I'd guess several should be tested to find the shape of the curve. Well, we know that it's not noticeable with 200, and that it is noticeable with 100K. It's only worth testing further if we think that having more than 200 top-level keys in one JSONB value is going to be a use case for more than 0.1% of our users. I personally do not. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On Mon, Sep 15, 2014 at 4:09 PM, Josh Berkus j...@agliodbs.com wrote: On 09/15/2014 10:23 AM, Claudio Freire wrote: Now, large small keys could be 200 or 2000, or even 20k. I'd guess several should be tested to find the shape of the curve. Well, we know that it's not noticeable with 200, and that it is noticeable with 100K. It's only worth testing further if we think that having more than 200 top-level keys in one JSONB value is going to be a use case for more than 0.1% of our users. I personally do not. Yes, but bear in mind that the worst case is exactly at the use case jsonb was designed to speed up: element access within relatively big json documents. Having them uncompressed is expectable because people using jsonb will often favor speed over compactness if it's a tradeoff (otherwise they'd use plain json). So while you're right that it's perhaps above what would be a common use case, the range somewhere between 200 and 100K for the tipping point seems overly imprecise to me. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 09/15/2014 12:15 PM, Claudio Freire wrote: So while you're right that it's perhaps above what would be a common use case, the range somewhere between 200 and 100K for the tipping point seems overly imprecise to me. Well, then, you know how to solve that. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On Mon, Sep 15, 2014 at 4:17 PM, Josh Berkus j...@agliodbs.com wrote: On 09/15/2014 12:15 PM, Claudio Freire wrote: So while you're right that it's perhaps above what would be a common use case, the range somewhere between 200 and 100K for the tipping point seems overly imprecise to me. Well, then, you know how to solve that. I was hoping testing with other numbers was a simple hitting a key for someone else. But sure. I'll set something up. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 09/15/2014 12:25 PM, Claudio Freire wrote: On Mon, Sep 15, 2014 at 4:17 PM, Josh Berkus j...@agliodbs.com wrote: On 09/15/2014 12:15 PM, Claudio Freire wrote: So while you're right that it's perhaps above what would be a common use case, the range somewhere between 200 and 100K for the tipping point seems overly imprecise to me. Well, then, you know how to solve that. I was hoping testing with other numbers was a simple hitting a key for someone else. Nope. My test case has a fixed size. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On Mon, Sep 15, 2014 at 3:09 PM, Josh Berkus j...@agliodbs.com wrote: On 09/15/2014 10:23 AM, Claudio Freire wrote: Now, large small keys could be 200 or 2000, or even 20k. I'd guess several should be tested to find the shape of the curve. Well, we know that it's not noticeable with 200, and that it is noticeable with 100K. It's only worth testing further if we think that having more than 200 top-level keys in one JSONB value is going to be a use case for more than 0.1% of our users. I personally do not. FWIW, I have written one (1) application that uses JSONB and it has one sub-object (not the top-level object) that in the most typical configuration contains precisely 270 keys. Now, granted, that is not the top-level object, if that distinction is actually relevant here, but color me just a bit skeptical of this claim anyway. This was just a casual thing I did for my own use, not anything industrial strength, so it's hard to believe I'm stressing the system more than 99.9% of users will. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 09/15/2014 02:16 PM, Robert Haas wrote: On Mon, Sep 15, 2014 at 3:09 PM, Josh Berkus j...@agliodbs.com wrote: On 09/15/2014 10:23 AM, Claudio Freire wrote: Now, large small keys could be 200 or 2000, or even 20k. I'd guess several should be tested to find the shape of the curve. Well, we know that it's not noticeable with 200, and that it is noticeable with 100K. It's only worth testing further if we think that having more than 200 top-level keys in one JSONB value is going to be a use case for more than 0.1% of our users. I personally do not. FWIW, I have written one (1) application that uses JSONB and it has one sub-object (not the top-level object) that in the most typical configuration contains precisely 270 keys. Now, granted, that is not the top-level object, if that distinction is actually relevant here, but color me just a bit skeptical of this claim anyway. This was just a casual thing I did for my own use, not anything industrial strength, so it's hard to believe I'm stressing the system more than 99.9% of users will. Actually, having the keys all at the same level *is* relevant for the issue we're discussing. If those 270 keys are organized in a tree, it's not the same as having them all on one level (and not as problematic). -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On Mon, Sep 15, 2014 at 4:05 PM, Josh Berkus j...@agliodbs.com wrote: Actually, having the keys all at the same level *is* relevant for the issue we're discussing. If those 270 keys are organized in a tree, it's not the same as having them all on one level (and not as problematic). I believe Robert meant that the 270 keys are not at the top level, but are at some level (in other words, some object has 270 pairs). That is equivalent to having them at the top level for the purposes of this discussion. FWIW, I am slightly concerned about weighing use cases around very large JSON documents too heavily. Having enormous jsonb documents just isn't going to work out that well, but neither will equivalent designs in popular document database systems for similar reasons. For example, the maximum BSON document size supported by MongoDB is 16 megabytes, and that seems to be something that their users don't care too much about. Having 270 pairs in an object isn't unreasonable, but it isn't going to be all that common either. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
I couldn't get my hands on the twitter data but I'm generating my own. The json template is http://paste2.org/wJ1dfcjw and data was generated with http://www.json-generator.com/. It has 35 top level keys, just in case someone is wondering. I generated 1 random objects and I'm inserting them repeatedly until I got 320k rows. Test query: SELECT data-'name', data-'email' FROM t_json Test storage: EXTERNAL Test jsonb lengths quartiles: {1278,1587,1731,1871,2231} Tom's lengths+cache aware: 455ms HEAD: 440ms This is a realistic-ish workload in my opinion and Tom's patch performs within 4% of HEAD. Due to the overall lenghts I couldn't really test compressibility so I re-ran the test. This time I inserted an array of 2 objects in each row, as in: [obj, obj]; The objects where taken in sequence from the 1 pool so contents match in both tests. Test query: SELECT data # '{0, name}', data # '{0, email}', data # '{1, name}', data # '{1, email}' FROM t_json Test storage: EXTENDED HEAD: 17mb table + 878mb toast HEAD size quartiles: {2015,2500,2591,2711,3483} HEAD query runtime: 15s Tom's: 220mb table + 580mb toast Tom's size quartiles: {1665,1984,2061,2142.25,2384} Tom's query runtime: 13s This is an intriguing edge case that Tom's patch actually outperform the base implementation for 3~4kb jsons.
Re: [HACKERS] jsonb format is pessimal for toast compression
On 09/16/2014 07:44 AM, Peter Geoghegan wrote: FWIW, I am slightly concerned about weighing use cases around very large JSON documents too heavily. Having enormous jsonb documents just isn't going to work out that well, but neither will equivalent designs in popular document database systems for similar reasons. For example, the maximum BSON document size supported by MongoDB is 16 megabytes, and that seems to be something that their users don't care too much about. Having 270 pairs in an object isn't unreasonable, but it isn't going to be all that common either. Also, at a certain size the fact that Pg must rewrite the whole document for any change to it starts to introduce other practical changes. Anyway - this is looking like the change will go in, and with it a catversion bump. Introduction of a jsonb version/flags byte might be worthwhile at the same time. It seems likely that there'll be more room for improvement in jsonb, possibly even down to using different formats for different data. Is it worth paying a byte per value to save on possible upgrade pain? -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 09/11/2014 06:56 PM, Arthur Silva wrote: In my testings with the github archive data the savings - performance-penalty was fine, but I'm not confident in those results since there were only 8 top level keys. Well, we did want to see that the patch doesn't create a regression with data which doesn't fall into the problem case area, and your test did that nicely. For comparison, some twitter api objects[1] have 30+ top level keys. If I have time in the next couple of days I'll conduct some testings with the public twitter fire-hose data. Yah, if we have enough time for me to get the Mozilla Socorro test environment working, I can also test with Mozilla crash data. That has some deep nesting and very large values. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On Thu, Sep 11, 2014 at 9:01 PM, Josh Berkus j...@agliodbs.com wrote: So, I finally got time to test Tom's latest patch on this. TLDR: we want to go with Tom's latest patch and release beta3. Figures: So I tested HEAD against the latest lengths patch. Per Arthur Silva, I checked uncompressed times for JSONB against compressed times. This changed the picture considerably. Did you -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On Fri, Sep 12, 2014 at 1:00 PM, Robert Haas robertmh...@gmail.com wrote: On Thu, Sep 11, 2014 at 9:01 PM, Josh Berkus j...@agliodbs.com wrote: So, I finally got time to test Tom's latest patch on this. TLDR: we want to go with Tom's latest patch and release beta3. Figures: So I tested HEAD against the latest lengths patch. Per Arthur Silva, I checked uncompressed times for JSONB against compressed times. This changed the picture considerably. Did you Blah. Did you test Heikki's patch from here? http://www.postgresql.org/message-id/53ec8194.4020...@vmware.com Tom didn't like it, but I thought it was rather clever. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 09/12/2014 10:00 AM, Robert Haas wrote: On Fri, Sep 12, 2014 at 1:00 PM, Robert Haas robertmh...@gmail.com wrote: On Thu, Sep 11, 2014 at 9:01 PM, Josh Berkus j...@agliodbs.com wrote: So, I finally got time to test Tom's latest patch on this. TLDR: we want to go with Tom's latest patch and release beta3. Figures: So I tested HEAD against the latest lengths patch. Per Arthur Silva, I checked uncompressed times for JSONB against compressed times. This changed the picture considerably. Did you Blah. Did you test Heikki's patch from here? http://www.postgresql.org/message-id/53ec8194.4020...@vmware.com Tom didn't like it, but I thought it was rather clever. Yes, I posted the results for that a couple weeks ago; Tom had posted a cleaned-up version of that patch, but materially it made no difference in sizes or extraction times compared with Tom's lengths-only patch. Same for Arthur's tests. It's certainly possible that there is a test case for which Heikki's approach is superior, but if so we haven't seen it. And since it's approach is also more complicated, sticking with the simpler lengths-only approach seems like the way to go. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On Fri, Sep 12, 2014 at 1:11 PM, Josh Berkus j...@agliodbs.com wrote: On 09/12/2014 10:00 AM, Robert Haas wrote: On Fri, Sep 12, 2014 at 1:00 PM, Robert Haas robertmh...@gmail.com wrote: On Thu, Sep 11, 2014 at 9:01 PM, Josh Berkus j...@agliodbs.com wrote: So, I finally got time to test Tom's latest patch on this. TLDR: we want to go with Tom's latest patch and release beta3. Figures: So I tested HEAD against the latest lengths patch. Per Arthur Silva, I checked uncompressed times for JSONB against compressed times. This changed the picture considerably. Did you Blah. Did you test Heikki's patch from here? http://www.postgresql.org/message-id/53ec8194.4020...@vmware.com Tom didn't like it, but I thought it was rather clever. Yes, I posted the results for that a couple weeks ago; Tom had posted a cleaned-up version of that patch, but materially it made no difference in sizes or extraction times compared with Tom's lengths-only patch. Same for Arthur's tests. It's certainly possible that there is a test case for which Heikki's approach is superior, but if so we haven't seen it. And since it's approach is also more complicated, sticking with the simpler lengths-only approach seems like the way to go. Huh, OK. I'm slightly surprised, but that's why we benchmark these things. Thanks for following up on this. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
Robert Haas robertmh...@gmail.com writes: On Fri, Sep 12, 2014 at 1:11 PM, Josh Berkus j...@agliodbs.com wrote: It's certainly possible that there is a test case for which Heikki's approach is superior, but if so we haven't seen it. And since it's approach is also more complicated, sticking with the simpler lengths-only approach seems like the way to go. Huh, OK. I'm slightly surprised, but that's why we benchmark these things. The argument for Heikki's patch was never that it would offer better performance; it's obvious (at least to me) that it won't. The argument was that it'd be upward-compatible with what we're doing now, so that we'd not have to force an on-disk compatibility break with 9.4beta2. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 09/12/2014 08:52 PM, Tom Lane wrote: Robert Haas robertmh...@gmail.com writes: On Fri, Sep 12, 2014 at 1:11 PM, Josh Berkus j...@agliodbs.com wrote: It's certainly possible that there is a test case for which Heikki's approach is superior, but if so we haven't seen it. And since it's approach is also more complicated, sticking with the simpler lengths-only approach seems like the way to go. Huh, OK. I'm slightly surprised, but that's why we benchmark these things. The argument for Heikki's patch was never that it would offer better performance; it's obvious (at least to me) that it won't. Performance was one argument for sure. It's not hard to come up with a case where the all-lengths approach is much slower: take a huge array with, say, million elements, and fetch the last element in a tight loop. And do that in a PL/pgSQL function without storing the datum to disk, so that it doesn't get toasted. Not a very common thing to do in real life, although something like that might come up if you do a lot of json processing in PL/pgSQL. but storing offsets makes that faster. IOW, something like this: do $$ declare ja jsonb; i int4; begin select json_agg(g) into ja from generate_series(1, 10) g; for i in 1..10 loop perform ja - 9; end loop; end; $$; should perform much better with current git master or my patch, than with the all-lengths patch. I'm OK with going for the all-lengths approach anyway; it's simpler, and working with huge arrays is hopefully not that common. But it's not a completely open-and-shut case. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
So, I finally got time to test Tom's latest patch on this. TLDR: we want to go with Tom's latest patch and release beta3. Figures: So I tested HEAD against the latest lengths patch. Per Arthur Silva, I checked uncompressed times for JSONB against compressed times. This changed the picture considerably. TABLE SIZES --- HEAD ?column? | pg_size_pretty -+ json text format| 393 MB jsonb: compressed | 1147 MB jsonb: uncompressed | 1221 MB PATCHED ?column? | pg_size_pretty -+ json text format| 394 MB jsonb: compressed | 525 MB jsonb: uncompressed | 1200 MB EXTRACTION TIMES HEAD Q1 (search via GIN index followed by extracting 100,000 values from rows): jsonb compressed: 4000 jsonb uncompressed: 3250 Q2 (seq scan and extract 200,000 values from rows): json: 11700 jsonb compressed: 3150 jsonb uncompressed: 2700 PATCHED Q1: jsonb compressed: 6750 jsonb uncompressed: 3350 Q2: json: 11796 jsonb compressed: 4700 jsonb uncompressed: 2650 -- Conclusion: with Tom's patch, compressed JSONB is 55% smaller when compressed (EXTENDED). Extraction times are 50% to 70% slower, but this appears to be almost entirely due to decompression overhead. When not compressing (EXTERNAL), extraction times for patch versions are statistically the same as HEAD, and file sizes are similar to HEAD. USER REACTION - I polled at both PDXpgDay and at FOSS4G, asking some ~~ 80 Postgres users how they would feel about a compression vs. extraction time tradeoff. The audience was evenly split. However, with the current patch, the user can choose. Users who know enough for performance tuning can set JSONB columns to EXTERNAL, and the the same performance as the unpatched version. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
* Josh Berkus (j...@agliodbs.com) wrote: TLDR: we want to go with Tom's latest patch and release beta3. Having not even read the rest- yes please. We really need to get beta3 out and figure out when we're going to actually release 9.4... Admittedly, the last month has been good and we've been fixing issues, but it'd really be good to wrap 9.4 up. Conclusion: with Tom's patch, compressed JSONB is 55% smaller when compressed (EXTENDED). Extraction times are 50% to 70% slower, but this appears to be almost entirely due to decompression overhead. When not compressing (EXTERNAL), extraction times for patch versions are statistically the same as HEAD, and file sizes are similar to HEAD. Not really a surprise. I polled at both PDXpgDay and at FOSS4G, asking some ~~ 80 Postgres users how they would feel about a compression vs. extraction time tradeoff. The audience was evenly split. Also not a surprise. However, with the current patch, the user can choose. Users who know enough for performance tuning can set JSONB columns to EXTERNAL, and the the same performance as the unpatched version. Agreed. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] jsonb format is pessimal for toast compression
On Thu, Sep 11, 2014 at 10:01 PM, Josh Berkus j...@agliodbs.com wrote: So, I finally got time to test Tom's latest patch on this. TLDR: we want to go with Tom's latest patch and release beta3. Figures: So I tested HEAD against the latest lengths patch. Per Arthur Silva, I checked uncompressed times for JSONB against compressed times. This changed the picture considerably. TABLE SIZES --- HEAD ?column? | pg_size_pretty -+ json text format| 393 MB jsonb: compressed | 1147 MB jsonb: uncompressed | 1221 MB PATCHED ?column? | pg_size_pretty -+ json text format| 394 MB jsonb: compressed | 525 MB jsonb: uncompressed | 1200 MB EXTRACTION TIMES HEAD Q1 (search via GIN index followed by extracting 100,000 values from rows): jsonb compressed: 4000 jsonb uncompressed: 3250 Q2 (seq scan and extract 200,000 values from rows): json: 11700 jsonb compressed: 3150 jsonb uncompressed: 2700 PATCHED Q1: jsonb compressed: 6750 jsonb uncompressed: 3350 Q2: json: 11796 jsonb compressed: 4700 jsonb uncompressed: 2650 -- Conclusion: with Tom's patch, compressed JSONB is 55% smaller when compressed (EXTENDED). Extraction times are 50% to 70% slower, but this appears to be almost entirely due to decompression overhead. When not compressing (EXTERNAL), extraction times for patch versions are statistically the same as HEAD, and file sizes are similar to HEAD. USER REACTION - I polled at both PDXpgDay and at FOSS4G, asking some ~~ 80 Postgres users how they would feel about a compression vs. extraction time tradeoff. The audience was evenly split. However, with the current patch, the user can choose. Users who know enough for performance tuning can set JSONB columns to EXTERNAL, and the the same performance as the unpatched version. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers The compression ratio difference is exaggerated in this case but it does support that Tom's patch alleviates the extraction penalty. In my testings with the github archive data the savings - performance-penalty was fine, but I'm not confident in those results since there were only 8 top level keys. For comparison, some twitter api objects[1] have 30+ top level keys. If I have time in the next couple of days I'll conduct some testings with the public twitter fire-hose data. [1] https://dev.twitter.com/rest/reference/get/statuses/home_timeline
Re: [HACKERS] jsonb format is pessimal for toast compression
On Sep 4, 2014, at 7:26 PM, Jan Wieck j...@wi3ck.info wrote: This is only because the input data was exact copies of the same strings over and over again. PGLZ can very well compress slightly less identical strings of varying lengths too. Not as well, but well enough. But I suspect such input data would make it fail again, even with lengths. We had a bit of discussion about JSONB compression at PDXPUG Day this morning. Josh polled the room, and about half though we should apply the patch for better compression, while the other half seemed to want faster access operations. (Some folks no doubt voted for both.) But in the ensuing discussion, I started to think that maybe we should leave it as it is, for two reasons: 1. There has been a fair amount of discussion about ways to better deal with this in future releases, such as hints to TOAST about how to compress, or the application of different compression algorithms (or pluggable compression). I’m assuming that leaving it as-is does not remove those possibilities. 2. The major advantage of JSONB is fast access operations. If those are not as important for a given use case as storage space, there’s still the JSON type, which *does* compress reasonably well. IOW, We already have a JSON alternative the compresses well. So why make the same (or similar) trade-offs with JSONB? Just my $0.02. I would like to see some consensus on this, soon, though, as I am eager to get 9.4 and JSONB, regardless of the outcome! Best, David signature.asc Description: Message signed with OpenPGP using GPGMail
Re: [HACKERS] jsonb format is pessimal for toast compression
On 08/12/2014 10:58 AM, Robert Haas wrote: What would really be ideal here is if the JSON code could inform the toast compression code this many initial bytes are likely incompressible, just pass them through without trying, and then start compressing at byte N, where N is the byte following the TOC. But I don't know that there's a reasonable way to implement that. Sorry, being late for the party. Anyhow, this strikes me as a good basic direction of thought. But I think we should put the burden on the data type, not on toast. To do that data types could have an optional toast_hint_values() function, which the toast code can call with the actual datum at hand and its default parameter array. The hint values function then can modify that parameter array, telling toast how much to skip, how hard to try (or not at all) and so on. A data type specific function should know much better how to figure out how compressible a particular datum may be. Certainly nothing for 9.4, but it might require changing the toast API in a different way than just handing it an oid and hard-coding the JASONBOID case into toast for 9.4. If we are going to change the API, we might as well do it right. Regards, Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 08/08/2014 10:21 AM, Andrew Dunstan wrote: On 08/07/2014 11:17 PM, Tom Lane wrote: I looked into the issue reported in bug #11109. The problem appears to be that jsonb's on-disk format is designed in such a way that the leading portion of any JSON array or object will be fairly incompressible, because it consists mostly of a strictly-increasing series of integer offsets. This interacts poorly with the code in pglz_compress() that gives up if it's found nothing compressible in the first first_success_by bytes of a value-to-be-compressed. (first_success_by is 1024 in the default set of compression parameters.) [snip] There is plenty of compressible data once we get into the repetitive strings in the payload part --- but that starts at offset 944, and up to that point there is nothing that pg_lzcompress can get a handle on. There are, by definition, no sequences of 4 or more repeated bytes in that area. I think in principle pg_lzcompress could decide to compress the 3-byte sequences consisting of the high-order 24 bits of each offset; but it doesn't choose to do so, probably because of the way its lookup hash table works: * pglz_hist_idx - * * Computes the history table slot for the lookup by the next 4 * characters in the input. * * NB: because we use the next 4 characters, we are not guaranteed to * find 3-character matches; they very possibly will be in the wrong * hash list. This seems an acceptable tradeoff for spreading out the * hash keys more. For jsonb header data, the next 4 characters are *always* different, so only a chance hash collision can result in a match. There is therefore a pretty good chance that no compression will occur before it gives up because of first_success_by. I'm not sure if there is any easy fix for this. We could possibly change the default first_success_by value, but I think that'd just be postponing the problem to larger jsonb objects/arrays, and it would hurt performance for genuinely incompressible data. A somewhat painful, but not yet out-of-the-question, alternative is to change the jsonb on-disk representation. Perhaps the JEntry array could be defined as containing element lengths instead of element ending offsets. Not sure though if that would break binary searching for JSON object keys. Ouch. Back when this structure was first presented at pgCon 2013, I wondered if we shouldn't extract the strings into a dictionary, because of key repetition, and convinced myself that this shouldn't be necessary because in significant cases TOAST would take care of it. Maybe we should have pglz_compress() look at the *last* 1024 bytes if it can't find anything worth compressing in the first, for values larger than a certain size. It's worth noting that this is a fairly pathological case. AIUI the example you constructed has an array with 100k string elements. I don't think that's typical. So I suspect that unless I've misunderstood the statement of the problem we're going to find that almost all the jsonb we will be storing is still compressible. I also think that a substantial part of the problem of coming up with a representative data sample is because the size of the incompressible data at the beginning is somewhat tied to the overall size of the datum itself. This may or may not be true in any particular use case, but as a general rule of thumb I would assume that the larger the JSONB document, the larger the offset array at the beginning. Would changing 1024 to a fraction of the datum length for the time being give us enough room to come up with a proper solution for 9.5? Regards, Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 08/08/2014 11:18 AM, Tom Lane wrote: Andrew Dunstan and...@dunslane.net writes: On 08/07/2014 11:17 PM, Tom Lane wrote: I looked into the issue reported in bug #11109. The problem appears to be that jsonb's on-disk format is designed in such a way that the leading portion of any JSON array or object will be fairly incompressible, because it consists mostly of a strictly-increasing series of integer offsets. Ouch. Back when this structure was first presented at pgCon 2013, I wondered if we shouldn't extract the strings into a dictionary, because of key repetition, and convinced myself that this shouldn't be necessary because in significant cases TOAST would take care of it. That's not really the issue here, I think. The problem is that a relatively minor aspect of the representation, namely the choice to store a series of offsets rather than a series of lengths, produces nonrepetitive data even when the original input is repetitive. This is only because the input data was exact copies of the same strings over and over again. PGLZ can very well compress slightly less identical strings of varying lengths too. Not as well, but well enough. But I suspect such input data would make it fail again, even with lengths. Regards, Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On Wed, Aug 27, 2014 at 1:09 AM, Arthur Silva arthur...@gmail.com wrote: It won't be faster by any means, but it should definitely be incorporated if any format changes are made (like Tom already suggested). I think it's important we gather at least 2 more things before making any calls: * Josh tests w/ cache aware patch, which should confirm cache aware is indeed prefered * Tests with toast hacked to use lz4 instead, which might ease any decisions -- Arthur Silva On Wed, Aug 27, 2014 at 12:53 AM, Peter Geoghegan p...@heroku.com wrote: On Tue, Aug 26, 2014 at 8:41 PM, Arthur Silva arthur...@gmail.com wrote: The difference is small but I's definitely faster, which makes sense since cache line misses are probably slightly reduced. As in the previous runs, I ran the query a dozen times and took the average after excluding runs with a high deviation. I'm not surprised that it hasn't beaten HEAD. I haven't studied the problem in detail, but I don't think that the cache awareness of the new revision is necessarily a distinct advantage. -- Peter Geoghegan I'm attaching a quick-n-dirty patch that uses lz4 compression instead of pglz in case someone wants to experiment with it. Seems to work in my test env, I'll make more tests when I get home. PS: gotta love gmail fixed defaults of top-posting... lz4.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 08/16/2014 02:19 AM, Tom Lane wrote: I think the realistic alternatives at this point are either to switch to all-lengths as in my test patch, or to use the hybrid approach of Heikki's test patch. IMO the major attraction of Heikki's patch is that it'd be upward compatible with existing beta installations, ie no initdb required (but thus, no opportunity to squeeze in a version identifier either). It's not showing up terribly well in the performance tests I've been doing --- it's about halfway between HEAD and my patch on that extract-a-key-from-a-PLAIN-stored-column test. But, just as with my patch, there are things that could be done to micro-optimize it by touching a bit more code. I did some quick stats comparing compressed sizes for the delicio.us data, printing quartiles as per Josh's lead: all-lengths {440,569,609,655,1257} Heikki's patch {456,582,624,671,1274} HEAD{493,636,684,744,1485} (As before, this is pg_column_size of the jsonb within a table whose rows are wide enough to force tuptoaster.c to try to compress the jsonb; otherwise many of these values wouldn't get compressed.) These documents don't have enough keys to trigger the first_success_by issue, so that HEAD doesn't look too awful, but still there's about an 11% gain from switching from offsets to lengths. Heikki's method captures much of that but not all. Personally I'd prefer to go to the all-lengths approach, but a large part of that comes from a subjective assessment that the hybrid approach is too messy. Others might well disagree. It's not too pretty, no. But it would be nice to not have to make a tradeoff between lookup speed and compressibility. Yet another idea is to store all lengths, but add an additional array of offsets to JsonbContainer. The array would contain the offset of, say, every 16th element. It would be very small compared to the lengths array, but would greatly speed up random access on a large array/object. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
Heikki Linnakangas hlinnakan...@vmware.com writes: On 08/16/2014 02:19 AM, Tom Lane wrote: I think the realistic alternatives at this point are either to switch to all-lengths as in my test patch, or to use the hybrid approach of Heikki's test patch. ... Personally I'd prefer to go to the all-lengths approach, but a large part of that comes from a subjective assessment that the hybrid approach is too messy. Others might well disagree. It's not too pretty, no. But it would be nice to not have to make a tradeoff between lookup speed and compressibility. Yet another idea is to store all lengths, but add an additional array of offsets to JsonbContainer. The array would contain the offset of, say, every 16th element. It would be very small compared to the lengths array, but would greatly speed up random access on a large array/object. That does nothing to address my basic concern about the patch, which is that it's too complicated and therefore bug-prone. Moreover, it'd lose on-disk compatibility which is really the sole saving grace of the proposal. My feeling about it at this point is that the apparent speed gain from using offsets is illusory: in practically all real-world cases where there are enough keys or array elements for it to matter, costs associated with compression (or rather failure to compress) will dominate any savings we get from offset-assisted lookups. I agree that the evidence for this opinion is pretty thin ... but the evidence against it is nonexistent. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 08/26/2014 07:51 AM, Tom Lane wrote: My feeling about it at this point is that the apparent speed gain from using offsets is illusory: in practically all real-world cases where there are enough keys or array elements for it to matter, costs associated with compression (or rather failure to compress) will dominate any savings we get from offset-assisted lookups. I agree that the evidence for this opinion is pretty thin ... but the evidence against it is nonexistent. Well, I have shown one test case which shows where lengths is a net penalty. However, for that to be the case, you have to have the following conditions *all* be true: * lots of top-level keys * short values * rows which are on the borderline for TOAST * table which fits in RAM ... so that's a special case and if it's sub-optimal, no bigee. Also, it's not like it's an order-of-magnitude slower. Anyway, I called for feedback on by blog, and have gotten some: http://www.databasesoup.com/2014/08/the-great-jsonb-tradeoff.html -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
Josh Berkus j...@agliodbs.com writes: Anyway, I called for feedback on by blog, and have gotten some: http://www.databasesoup.com/2014/08/the-great-jsonb-tradeoff.html I was hoping you'd get some useful data from that, but so far it seems like a rehash of points made in the on-list thread :-( regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 08/26/2014 11:40 AM, Tom Lane wrote: Josh Berkus j...@agliodbs.com writes: Anyway, I called for feedback on by blog, and have gotten some: http://www.databasesoup.com/2014/08/the-great-jsonb-tradeoff.html I was hoping you'd get some useful data from that, but so far it seems like a rehash of points made in the on-list thread :-( regards, tom lane yah, me too. :-( Unfortunately even the outside commentors don't seem to understand that storage size *is* related to speed, it's exchanging I/O speed for CPU speed. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
Josh Berkus j...@agliodbs.com writes: On 08/26/2014 11:40 AM, Tom Lane wrote: I was hoping you'd get some useful data from that, but so far it seems like a rehash of points made in the on-list thread :-( Unfortunately even the outside commentors don't seem to understand that storage size *is* related to speed, it's exchanging I/O speed for CPU speed. Yeah, exactly. Given current hardware trends, data compression is becoming more of a win not less as time goes on: CPU cycles are cheap even compared to main memory access, let alone mass storage. So I'm thinking we want to adopt a compression-friendly data format even if it measures out as a small loss currently. I wish it were cache-friendly too, per the upthread tangent about having to fetch keys from all over the place within a large JSON object. ... and while I was typing that sentence, lightning struck. The existing arrangement of object subfields with keys and values interleaved is just plain dumb. We should rearrange that as all the keys in order, then all the values in the same order. Then the keys are naturally adjacent in memory and object-key searches become much more cache-friendly: you probably touch most of the key portion of the object, but none of the values portion, until you know exactly what part of the latter to fetch. This approach might complicate the lookup logic marginally but I bet not very much; and it will be a huge help if we ever want to do smart access to EXTERNAL (non-compressed) JSON values. I will go prototype that just to see how much code rearrangement is required. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 2014-08-26 15:01:27 -0400, Tom Lane wrote: Josh Berkus j...@agliodbs.com writes: On 08/26/2014 11:40 AM, Tom Lane wrote: I was hoping you'd get some useful data from that, but so far it seems like a rehash of points made in the on-list thread :-( Unfortunately even the outside commentors don't seem to understand that storage size *is* related to speed, it's exchanging I/O speed for CPU speed. Yeah, exactly. Given current hardware trends, data compression is becoming more of a win not less as time goes on: CPU cycles are cheap even compared to main memory access, let alone mass storage. So I'm thinking we want to adopt a compression-friendly data format even if it measures out as a small loss currently. On the other hand the majority of databases these day fit into main memory due to its increasing sizes and postgres is more often CPU than IO bound. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 26 August 2014 11:34, Josh Berkus j...@agliodbs.com wrote: On 08/26/2014 07:51 AM, Tom Lane wrote: My feeling about it at this point is that the apparent speed gain from using offsets is illusory: in practically all real-world cases where there are enough keys or array elements for it to matter, costs associated with compression (or rather failure to compress) will dominate any savings we get from offset-assisted lookups. I agree that the evidence for this opinion is pretty thin ... but the evidence against it is nonexistent. Well, I have shown one test case which shows where lengths is a net penalty. However, for that to be the case, you have to have the following conditions *all* be true: * lots of top-level keys * short values * rows which are on the borderline for TOAST * table which fits in RAM ... so that's a special case and if it's sub-optimal, no bigee. Also, it's not like it's an order-of-magnitude slower. Anyway, I called for feedback on by blog, and have gotten some: http://www.databasesoup.com/2014/08/the-great-jsonb-tradeoff.html It would be really interesting to see your results with column STORAGE EXTERNAL for that benchmark. I think it is important to separate out the slowdown due to decompression now being needed vs that inherent in the new format, we can always switch off compression on a per-column basis using STORAGE EXTERNAL. My JSON data has smallish objects with a small number of keys, it barely compresses at all with the patch and shows similar results to Arthur's data. Across ~500K rows I get: encoded=# select count(properties-'submitted_by') from compressed; count 431948 (1 row) Time: 250.512 ms encoded=# select count(properties-'submitted_by') from uncompressed; count 431948 (1 row) Time: 218.552 ms Laurence
Re: [HACKERS] jsonb format is pessimal for toast compression
Andres Freund and...@2ndquadrant.com writes: On 2014-08-26 15:01:27 -0400, Tom Lane wrote: Yeah, exactly. Given current hardware trends, data compression is becoming more of a win not less as time goes on: CPU cycles are cheap even compared to main memory access, let alone mass storage. So I'm thinking we want to adopt a compression-friendly data format even if it measures out as a small loss currently. On the other hand the majority of databases these day fit into main memory due to its increasing sizes and postgres is more often CPU than IO bound. Well, better data compression helps make that true ;-). And don't forget cache effects; actual main memory is considered slow these days. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On Tue, Aug 26, 2014 at 4:01 PM, Tom Lane t...@sss.pgh.pa.us wrote: Josh Berkus j...@agliodbs.com writes: On 08/26/2014 11:40 AM, Tom Lane wrote: I was hoping you'd get some useful data from that, but so far it seems like a rehash of points made in the on-list thread :-( Unfortunately even the outside commentors don't seem to understand that storage size *is* related to speed, it's exchanging I/O speed for CPU speed. Yeah, exactly. Given current hardware trends, data compression is becoming more of a win not less as time goes on: CPU cycles are cheap even compared to main memory access, let alone mass storage. So I'm thinking we want to adopt a compression-friendly data format even if it measures out as a small loss currently. I wish it were cache-friendly too, per the upthread tangent about having to fetch keys from all over the place within a large JSON object. What about my earlier proposal? An in-memory compressed representation would greatly help cache locality, more so if you pack keys as you mentioned. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 2014-08-26 15:17:13 -0400, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: On 2014-08-26 15:01:27 -0400, Tom Lane wrote: Yeah, exactly. Given current hardware trends, data compression is becoming more of a win not less as time goes on: CPU cycles are cheap even compared to main memory access, let alone mass storage. So I'm thinking we want to adopt a compression-friendly data format even if it measures out as a small loss currently. On the other hand the majority of databases these day fit into main memory due to its increasing sizes and postgres is more often CPU than IO bound. Well, better data compression helps make that true ;-). People disable toast compression though because it results in better performance :(. Part of that could be fixed by a faster compression method, part of it by decompressing less often. But still. And don't forget cache effects; actual main memory is considered slow these days. Right. But that plays the other way round too. Compressed datums need to be copied to be accessed uncompressed. Whereas at least in comparison to inline compressed datums that's not necessary. Anyway, that's just to say that I don't really agree that CPU overhead is a worthy price to pay for storage efficiency if the gains are small. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On Tue, Aug 26, 2014 at 12:27 PM, Andres Freund and...@2ndquadrant.com wrote: Anyway, that's just to say that I don't really agree that CPU overhead is a worthy price to pay for storage efficiency if the gains are small. +1 -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 08/26/2014 12:27 PM, Andres Freund wrote: Anyway, that's just to say that I don't really agree that CPU overhead is a worthy price to pay for storage efficiency if the gains are small. But in this case the gains aren't small; we're talking up to 60% smaller storage. Testing STORAGE EXTENDED soon. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
I wrote: I wish it were cache-friendly too, per the upthread tangent about having to fetch keys from all over the place within a large JSON object. ... and while I was typing that sentence, lightning struck. The existing arrangement of object subfields with keys and values interleaved is just plain dumb. We should rearrange that as all the keys in order, then all the values in the same order. Then the keys are naturally adjacent in memory and object-key searches become much more cache-friendly: you probably touch most of the key portion of the object, but none of the values portion, until you know exactly what part of the latter to fetch. This approach might complicate the lookup logic marginally but I bet not very much; and it will be a huge help if we ever want to do smart access to EXTERNAL (non-compressed) JSON values. I will go prototype that just to see how much code rearrangement is required. This looks pretty good from a coding point of view. I have not had time yet to see if it affects the speed of the benchmark cases we've been trying. I suspect that it won't make much difference in them. I think if we do decide to make an on-disk format change, we should seriously consider including this change. The same concept could be applied to offset-based storage of course, although I rather doubt that we'd make that combination of choices since it would be giving up on-disk compatibility for benefits that are mostly in the future. Attached are two patches: one is a delta against the last jsonb-lengths patch I posted, and the other is a merged patch showing the total change from HEAD, for ease of application. regards, tom lane diff --git a/src/backend/utils/adt/jsonb_util.c b/src/backend/utils/adt/jsonb_util.c index e47eaea..4e7fe67 100644 *** a/src/backend/utils/adt/jsonb_util.c --- b/src/backend/utils/adt/jsonb_util.c *** *** 26,33 * in MaxAllocSize, and the number of elements (or pairs) must fit in the bits * reserved for that in the JsonbContainer.header field. * ! * (the total size of an array's elements is also limited by JENTRY_LENMASK, ! * but we're not concerned about that here) */ #define JSONB_MAX_ELEMS (Min(MaxAllocSize / sizeof(JsonbValue), JB_CMASK)) #define JSONB_MAX_PAIRS (Min(MaxAllocSize / sizeof(JsonbPair), JB_CMASK)) --- 26,33 * in MaxAllocSize, and the number of elements (or pairs) must fit in the bits * reserved for that in the JsonbContainer.header field. * ! * (The total size of an array's or object's elements is also limited by ! * JENTRY_LENMASK, but we're not concerned about that here.) */ #define JSONB_MAX_ELEMS (Min(MaxAllocSize / sizeof(JsonbValue), JB_CMASK)) #define JSONB_MAX_PAIRS (Min(MaxAllocSize / sizeof(JsonbPair), JB_CMASK)) *** findJsonbValueFromContainer(JsonbContain *** 294,303 { JEntry *children = container-children; int count = (container-header JB_CMASK); ! JsonbValue *result = palloc(sizeof(JsonbValue)); Assert((flags ~(JB_FARRAY | JB_FOBJECT)) == 0); if (flags JB_FARRAY container-header) { char *base_addr = (char *) (children + count); --- 294,309 { JEntry *children = container-children; int count = (container-header JB_CMASK); ! JsonbValue *result; Assert((flags ~(JB_FARRAY | JB_FOBJECT)) == 0); + /* Quick out without a palloc cycle if object/array is empty */ + if (count = 0) + return NULL; + + result = palloc(sizeof(JsonbValue)); + if (flags JB_FARRAY container-header) { char *base_addr = (char *) (children + count); *** findJsonbValueFromContainer(JsonbContain *** 323,329 char *base_addr = (char *) (children + count * 2); uint32 *offsets; uint32 lastoff; ! int lastoffpos; uint32 stopLow = 0, stopHigh = count; --- 329,335 char *base_addr = (char *) (children + count * 2); uint32 *offsets; uint32 lastoff; ! int i; uint32 stopLow = 0, stopHigh = count; *** findJsonbValueFromContainer(JsonbContain *** 332,379 /* * We use a cache to avoid redundant getJsonbOffset() computations ! * inside the search loop. Note that count may well be zero at this ! * point; to avoid an ugly special case for initializing lastoff and ! * lastoffpos, we allocate one extra array element. */ ! offsets = (uint32 *) palloc((count * 2 + 1) * sizeof(uint32)); ! offsets[0] = lastoff = 0; ! lastoffpos = 0; /* Binary search on object/pair keys *only* */ while (stopLow stopHigh) { uint32 stopMiddle; - int index; int difference; JsonbValue candidate; stopMiddle = stopLow + (stopHigh - stopLow) / 2; - /* - * Compensate for the fact that we're searching through pairs (not - * entries). - */ - index = stopMiddle * 2; - - /* Update the offsets cache through at least
Re: [HACKERS] jsonb format is pessimal for toast compression
Tom, here's the results with github data (8 top level keys) only. Here's a sample object https://gist.github.com/igrigorik/2017462 All-Lenghts + Cache-Aware EXTERNAL Query 1: 516ms Query 2: 350ms The difference is small but I's definitely faster, which makes sense since cache line misses are probably slightly reduced. As in the previous runs, I ran the query a dozen times and took the average after excluding runs with a high deviation. compare to (copied from my previous email) HEAD (aka, all offsets) EXTERNAL Test query 1 runtime: 505ms Test query 2 runtime: 350ms All Lengths (Tom Lane patch) EXTERNAL Test query 1 runtime: 525ms Test query 2 runtime: 355ms -- Arthur Silva On Tue, Aug 26, 2014 at 7:11 PM, Tom Lane t...@sss.pgh.pa.us wrote: I wrote: I wish it were cache-friendly too, per the upthread tangent about having to fetch keys from all over the place within a large JSON object. ... and while I was typing that sentence, lightning struck. The existing arrangement of object subfields with keys and values interleaved is just plain dumb. We should rearrange that as all the keys in order, then all the values in the same order. Then the keys are naturally adjacent in memory and object-key searches become much more cache-friendly: you probably touch most of the key portion of the object, but none of the values portion, until you know exactly what part of the latter to fetch. This approach might complicate the lookup logic marginally but I bet not very much; and it will be a huge help if we ever want to do smart access to EXTERNAL (non-compressed) JSON values. I will go prototype that just to see how much code rearrangement is required. This looks pretty good from a coding point of view. I have not had time yet to see if it affects the speed of the benchmark cases we've been trying. I suspect that it won't make much difference in them. I think if we do decide to make an on-disk format change, we should seriously consider including this change. The same concept could be applied to offset-based storage of course, although I rather doubt that we'd make that combination of choices since it would be giving up on-disk compatibility for benefits that are mostly in the future. Attached are two patches: one is a delta against the last jsonb-lengths patch I posted, and the other is a merged patch showing the total change from HEAD, for ease of application. regards, tom lane
Re: [HACKERS] jsonb format is pessimal for toast compression
On Tue, Aug 26, 2014 at 8:41 PM, Arthur Silva arthur...@gmail.com wrote: The difference is small but I's definitely faster, which makes sense since cache line misses are probably slightly reduced. As in the previous runs, I ran the query a dozen times and took the average after excluding runs with a high deviation. I'm not surprised that it hasn't beaten HEAD. I haven't studied the problem in detail, but I don't think that the cache awareness of the new revision is necessarily a distinct advantage. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
Peter Geoghegan p...@heroku.com writes: I'm not surprised that it hasn't beaten HEAD. I haven't studied the problem in detail, but I don't think that the cache awareness of the new revision is necessarily a distinct advantage. I doubt it's a significant advantage in the current state of the code; I'm happy if it's not a loss. I was looking ahead to someday fetching key values efficiently from large EXTERNAL (ie out-of-line-but-not-compressed) JSON values, analogously to the existing optimization for fetching text substrings from EXTERNAL text values. As mentioned upthread, the current JSONB representation would be seriously unfriendly to such a thing. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
It won't be faster by any means, but it should definitely be incorporated if any format changes are made (like Tom already suggested). I think it's important we gather at least 2 more things before making any calls: * Josh tests w/ cache aware patch, which should confirm cache aware is indeed prefered * Tests with toast hacked to use lz4 instead, which might ease any decisions -- Arthur Silva On Wed, Aug 27, 2014 at 12:53 AM, Peter Geoghegan p...@heroku.com wrote: On Tue, Aug 26, 2014 at 8:41 PM, Arthur Silva arthur...@gmail.com wrote: The difference is small but I's definitely faster, which makes sense since cache line misses are probably slightly reduced. As in the previous runs, I ran the query a dozen times and took the average after excluding runs with a high deviation. I'm not surprised that it hasn't beaten HEAD. I haven't studied the problem in detail, but I don't think that the cache awareness of the new revision is necessarily a distinct advantage. -- Peter Geoghegan
Re: [HACKERS] jsonb format is pessimal for toast compression
On Thu, Aug 21, 2014 at 6:20 PM, Josh Berkus j...@agliodbs.com wrote: On 08/20/2014 03:42 PM, Arthur Silva wrote: What data are you using right now Josh? The same data as upthread. Can you test the three patches (9.4 head, 9.4 with Tom's cleanup of Heikki's patch, and 9.4 with Tom's latest lengths-only) on your workload? I'm concerned that my workload is unusual and don't want us to make this decision based entirely on it. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com Here's my test results so far with the github archive data. It's important to keep in mind that the PushEvent event objects that I use in the queries only contains a small number of keys (8 to be precise), so these tests don't really stress the changed code. Anyway, in this dataset (with the small objects) using the all-lengths patch provide small compression savings but the overhead is minimal. Test data: 610MB of Json -- 341969 items Index size (jsonb_ops): 331MB Test query 1: SELECT data-'url', data-'actor' FROM t_json WHERE data @ '{type: PushEvent}' Test query 1 items: 169732 Test query 2: SELECT data FROM t_json WHERE data @ '{type: PushEvent}' Test query 2 items: HEAD (aka, all offsets) EXTENDED Size: 374MB Toast Size: 145MB Test query 1 runtime: 680ms Test query 2 runtime: 405ms HEAD (aka, all offsets) EXTERNAL Size: 366MB Toast Size: 333MB Test query 1 runtime: 505ms Test query 2 runtime: 350ms All Lengths (Tom Lane patch) EXTENDED Size: 379MB Toast Size: 108MB Test query 1 runtime: 720ms Test query 2 runtime: 420ms All Lengths (Tom Lane patch) EXTERNAL Size: 366MB Toast Size: 333MB Test query 1 runtime: 525ms Test query 2 runtime: 355ms -- Arthur Silva
Re: [HACKERS] jsonb format is pessimal for toast compression
On 08/20/2014 03:42 PM, Arthur Silva wrote: What data are you using right now Josh? The same data as upthread. Can you test the three patches (9.4 head, 9.4 with Tom's cleanup of Heikki's patch, and 9.4 with Tom's latest lengths-only) on your workload? I'm concerned that my workload is unusual and don't want us to make this decision based entirely on it. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers