Re: [HACKERS] jsonb format is pessimal for toast compression
I was not complaining; I think JSONB is awesome. But I am one of those people who would like to put 100's of GB (or more) JSON files into Postgres and I am concerned about file size and possible future changes to the format. On Fri, Aug 8, 2014 at 7:10 PM, Peter Geoghegan p...@heroku.com wrote: On Fri, Aug 8, 2014 at 12:06 PM, Josh Berkus j...@agliodbs.com wrote: One we ship 9.4, many users are going to load 100's of GB into JSONB fields. Even if we fix the compressability issue in 9.5, those users won't be able to fix the compression without rewriting all their data, which could be prohibitive. And we'll be in a position where we have to support the 9.4 JSONB format/compression technique for years so that users aren't blocked from upgrading. FWIW, if we take the delicious JSON data as representative, a table storing that data as jsonb is 1374 MB in size. Whereas an equivalent table with the data typed using the original json datatype (but with white space differences more or less ignored, because it was created using a jsonb - json cast), the same data is 1352 MB. Larry's complaint is valid; this is a real problem, and I'd like to fix it before 9.4 is out. However, let us not lose sight of the fact that JSON data is usually a poor target for TOAST compression. With idiomatic usage, redundancy is very much more likely to appear across rows, and not within individual Datums. Frankly, we aren't doing a very good job there, and doing better requires an alternative strategy. -- Peter Geoghegan
Re: [HACKERS] jsonb format is pessimal for toast compression
Apologies if this is a ridiculous suggestion, but I believe that swapping out the compression algorithm (for Snappy, for example) has been discussed in the past. I wonder if that algorithm is sufficiently different that it would produce a better result, and if that might not be preferable to some of the other options. On Thu, Aug 7, 2014 at 11:17 PM, Tom Lane t...@sss.pgh.pa.us wrote: I looked into the issue reported in bug #11109. The problem appears to be that jsonb's on-disk format is designed in such a way that the leading portion of any JSON array or object will be fairly incompressible, because it consists mostly of a strictly-increasing series of integer offsets. This interacts poorly with the code in pglz_compress() that gives up if it's found nothing compressible in the first first_success_by bytes of a value-to-be-compressed. (first_success_by is 1024 in the default set of compression parameters.) As an example, here's gdb's report of the bitwise representation of the example JSON value in the bug thread: 0x2ab85ac: 0x2005 0x0004 0x50003098 0x309f 0x2ab85bc: 0x30ae 0x30b8 0x30cf 0x30da 0x2ab85cc: 0x30df 0x30ee 0x3105 0x6b6e756a 0x2ab85dc: 0x40de 0x0034 0x0068 0x009c 0x2ab85ec: 0x00d0 0x0104 0x0138 0x016c 0x2ab85fc: 0x01a0 0x01d4 0x0208 0x023c 0x2ab860c: 0x0270 0x02a4 0x02d8 0x030c 0x2ab861c: 0x0340 0x0374 0x03a8 0x03dc 0x2ab862c: 0x0410 0x0444 0x0478 0x04ac 0x2ab863c: 0x04e0 0x0514 0x0548 0x057c 0x2ab864c: 0x05b0 0x05e4 0x0618 0x064c 0x2ab865c: 0x0680 0x06b4 0x06e8 0x071c 0x2ab866c: 0x0750 0x0784 0x07b8 0x07ec 0x2ab867c: 0x0820 0x0854 0x0888 0x08bc 0x2ab868c: 0x08f0 0x0924 0x0958 0x098c 0x2ab869c: 0x09c0 0x09f4 0x0a28 0x0a5c 0x2ab86ac: 0x0a90 0x0ac4 0x0af8 0x0b2c 0x2ab86bc: 0x0b60 0x0b94 0x0bc8 0x0bfc 0x2ab86cc: 0x0c30 0x0c64 0x0c98 0x0ccc 0x2ab86dc: 0x0d00 0x0d34 0x0d68 0x0d9c 0x2ab86ec: 0x0dd0 0x0e04 0x0e38 0x0e6c 0x2ab86fc: 0x0ea0 0x0ed4 0x0f08 0x0f3c 0x2ab870c: 0x0f70 0x0fa4 0x0fd8 0x100c 0x2ab871c: 0x1040 0x1074 0x10a8 0x10dc 0x2ab872c: 0x1110 0x1144 0x1178 0x11ac 0x2ab873c: 0x11e0 0x1214 0x1248 0x127c 0x2ab874c: 0x12b0 0x12e4 0x1318 0x134c 0x2ab875c: 0x1380 0x13b4 0x13e8 0x141c 0x2ab876c: 0x1450 0x1484 0x14b8 0x14ec 0x2ab877c: 0x1520 0x1554 0x1588 0x15bc 0x2ab878c: 0x15f0 0x1624 0x1658 0x168c 0x2ab879c: 0x16c0 0x16f4 0x1728 0x175c 0x2ab87ac: 0x1790 0x17c4 0x17f8 0x182c 0x2ab87bc: 0x1860 0x1894 0x18c8 0x18fc 0x2ab87cc: 0x1930 0x1964 0x1998 0x19cc 0x2ab87dc: 0x1a00 0x1a34 0x1a68 0x1a9c 0x2ab87ec: 0x1ad0 0x1b04 0x1b38 0x1b6c 0x2ab87fc: 0x1ba0 0x1bd4 0x1c08 0x1c3c 0x2ab880c: 0x1c70 0x1ca4 0x1cd8 0x1d0c 0x2ab881c: 0x1d40 0x1d74 0x1da8 0x1ddc 0x2ab882c: 0x1e10 0x1e44 0x1e78 0x1eac 0x2ab883c: 0x1ee0 0x1f14 0x1f48 0x1f7c 0x2ab884c: 0x1fb0 0x1fe4 0x2018 0x204c 0x2ab885c: 0x2080 0x20b4 0x20e8 0x211c 0x2ab886c: 0x2150 0x2184 0x21b8 0x21ec 0x2ab887c: 0x2220 0x2254 0x2288 0x22bc 0x2ab888c: 0x22f0 0x2324 0x2358 0x238c 0x2ab889c: 0x23c0 0x23f4 0x2428 0x245c 0x2ab88ac: 0x2490 0x24c4 0x24f8 0x252c 0x2ab88bc: 0x2560 0x2594 0x25c8 0x25fc 0x2ab88cc: 0x2630 0x2664 0x2698 0x26cc 0x2ab88dc: 0x2700 0x2734 0x2768