Re: [HACKERS] Proposal: custom compression methods

2015-12-16 Thread Michael Paquier
On Wed, Dec 16, 2015 at 10:17 PM, Andres Freund  wrote:
> Again. Unless I miss something that was solved in
> www.postgresql.org/message-id/flat/20130615102028.gk19...@alap2.anarazel.de
>
> Personally I think we should add lz4 and maybe on strongly compressing
> algorithm and be done with it. Runtime extensibility will make this much
> more complicated, and I doubt it's worth the complexity.

+1, at the end we are going to have the same conclusion as for FPW
compression when we discussed that, let's just switch to something
that is less CPU-consuming than pglz, and lz4 is a good candidate for
that.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: custom compression methods

2015-12-16 Thread Jim Nasby

On 12/16/15 7:03 AM, Tomas Vondra wrote:


While versioning or increasing the 1GB limit are interesting, they have
nothing to do with this particular patch. (BTW I don't see how the
versioning would work at varlena level - that's something that needs to
be handled at data type level).


Right, but that's often going to be very hard to do and still support 
pg_upgrade. It's not like most types have spare bits laying around. 
Granted, this still means non-varlena types are screwed.



I think the only question we need to ask here is whether we want to
allow mixed compression for a column. If no, we're OK with the current
header. This is what the patch does, as it requires a rewrite after
changing the compression method.


I think that is related to the other items though: none of those other 
items (except maybe the 1G limit) seem to warrant dorking with varlena, 
but if there were 3 separate features that could make use of support for 
8 byte varlena then perhaps it's time to invest in that effort. 
Especially since IIRC we're currently out of bits, so if we wanted to 
change this we'd need to do it at least a release in advance so there 
was a version that would expand 4 byte varlena to 8 byte as needed.



And we're not painting ourselves in the corner - if we decide to
increase the varlena header size in the future, this patch does not make
it any more complicated.


True.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: custom compression methods

2015-12-16 Thread Tomas Vondra

Hi,

On 12/14/2015 12:51 PM, Simon Riggs wrote:

On 13 December 2015 at 17:28, Alexander Korotkov
> wrote:

it would be nice to make compression methods pluggable.


Agreed.

My thinking is that this should be combined with work to make use of
the compressed data, which is why Alvaro, Tomas, David have been
working on Col Store API for about 18 months and work on that
continues with more submissions for 9.6 due.


I'm not sure it makes sense to combine those two uses of compression, 
because there are various differences - some subtle, some less subtle. 
It's a bit difficult to discuss this without any column store 
background, but I'll try anyway.


The compression methods discussed in this thread, used to compress a 
single varlena value, are "general-purpose" in the sense that they 
operate on opaque stream of bytes, without any additional context (e.g. 
about structure of the data being compressed). So essentially the 
methods have an API like this:


  int   compress(char *src, int srclen, char *dst, int dstlen);
  int decompress(char *src, int srclen, char *dst, int dstlen);

And possibly some auxiliary methods like "estimate compressed length" 
and such.


OTOH the compression methods we're messing with while working on the 
column store are quite different - they operate on columns (i.e. "arrays 
of Datums"). Also, column stores prefer "light-weight" compression 
methods like RLE or DICT (dictionary compression) because those methods 
allow execution on compressed data when done properly. Which for example 
requires additional info about the data type in the column, so that the 
RLE groups match the data type length.


So the API of those methods looks quite different, compared to the 
general-purpose methods. Not only the compression/decompression methods 
will have additional parameters with info about the data type, but 
there'll be methods used for iterating over values in the compressed 
data etc.


Of course, it'd be nice to have the ability to add/remove even those 
light-weight methods, but I'm not sure it makes sense to squash them 
into the same catalog. I can imagine a catalog suitable for both APIs 
(essentially having two groups of columns, one for each type of 
compression algorithm), but I can't really imagine a compression method 
providing both interfaces at the same time.


In any case, I don't think this is the main challenge the patch needs to 
solve at this point.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: custom compression methods

2015-12-16 Thread Andres Freund
On 2015-12-16 14:14:36 +0100, Tomas Vondra wrote:
> I think the main obstacle to make this possible is the lack of free space in
> varlena header / need to add the ID of the compression method into the
> value.
> 
> FWIW I'd like to allow this (mixing compression types), but I don't think
> it's worth the complexity at this point. We can add that later, if it turns
> out to be a problem in practice (which it probably won't).

Again. Unless I miss something that was solved in
www.postgresql.org/message-id/flat/20130615102028.gk19...@alap2.anarazel.de

Personally I think we should add lz4 and maybe on strongly compressing
algorithm and be done with it. Runtime extensibility will make this much
more complicated, and I doubt it's worth the complexity.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: custom compression methods

2015-12-16 Thread Tomas Vondra



On 12/14/2015 07:28 PM, Jim Nasby wrote:

On 12/14/15 12:50 AM, Craig Ringer wrote:

The issue with per-Datum is that TOAST claims two bits of a varlena
header, which already limits us to 1 GiB varlena values, something
people are starting to find to be a problem. There's no wiggle room to
steal more bits. If you want pluggable compression you need a way to
store knowledge of how a given datum is compressed with the datum or
have a fast, efficient way to check.


... unless we allowed for 8 byte varlena headers. Compression changes
themselves certainly don't warrant that, but if people are already
unhappy with 1GB TOAST then maybe that's enough.

The other thing this might buy us are a few bits that could be used to
support Datum versioning for other purposes, such as when the binary
format of something changes. I would think that at some point we'll need
that for pg_upgrade.


While versioning or increasing the 1GB limit are interesting, they have 
nothing to do with this particular patch. (BTW I don't see how the 
versioning would work at varlena level - that's something that needs to 
be handled at data type level).


I think the only question we need to ask here is whether we want to 
allow mixed compression for a column. If no, we're OK with the current 
header. This is what the patch does, as it requires a rewrite after 
changing the compression method.


And we're not painting ourselves in the corner - if we decide to 
increase the varlena header size in the future, this patch does not make 
it any more complicated.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: custom compression methods

2015-12-16 Thread Tomas Vondra

Hi,

On 12/13/2015 06:28 PM, Alexander Korotkov wrote:
>

Compression method of column would be stored in pg_attribute table.
Dependencies between columns and compression methods would be tracked in
pg_depend preventing dropping compression method which is currently in
use. Compression method of the attribute could be altered by ALTER TABLE
command.

ALTER TABLE table_name ALTER COLUMN column_name SET COMPRESSION METHOD
compname;


Do you plan to make this available in CREATE TABLE? For example 
Greenplum allows to specify COMPRESSTYPE/COMPRESSLEVEL per column.


What about compression levels? Do you plan to allow tweaking them? 
Tracking them would require another column in pg_attribute, probably.



Since mixing of different compression method in the same attribute
would be hard to manage (especially dependencies tracking), altering
attribute compression method would require a table rewrite.


I don't think the dependency tracking would be a big issue. The easiest 
we could do is simply track which columns used the compression type in 
the past, and scan them when removing it (the compression type).


I think the main obstacle to make this possible is the lack of free 
space in varlena header / need to add the ID of the compression method 
into the value.


FWIW I'd like to allow this (mixing compression types), but I don't 
think it's worth the complexity at this point. We can add that later, if 
it turns out to be a problem in practice (which it probably won't).


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: custom compression methods

2015-12-14 Thread Simon Riggs
On 13 December 2015 at 17:28, Alexander Korotkov 
wrote:


> it would be nice to make compression methods pluggable.
>

Agreed.

My thinking is that this should be combined with work to make use of the
compressed data, which is why Alvaro, Tomas, David have been working on Col
Store API for about 18 months and work on that continues with more
submissions for 9.6 due.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Proposal: custom compression methods

2015-12-14 Thread Bill Moran
On Mon, 14 Dec 2015 14:50:57 +0800
Craig Ringer  wrote:

> On 14 December 2015 at 01:28, Alexander Korotkov 
> wrote:
> 
> > Hackers,
> >
> > I'd like to propose a new feature: "Custom compression methods".

I missed the initial post on this thread ...

Have you started or do you plan to actually start work on this? I've
already started some preliminary coding with this as one of it's
goals, but it's been stalled for about a month due to life
intervening. I plan to come back to it soon, so if you decide to
start actually writing code, please contact me so we don't double
up on the work.

-- 
Bill Moran


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: custom compression methods

2015-12-14 Thread Andres Freund
On 2015-12-14 14:50:57 +0800, Craig Ringer wrote:
> http://www.postgresql.org/message-id/flat/20130615102028.gk19...@alap2.anarazel.de#20130615102028.gk19...@alap2.anarazel.de

> The issue with per-Datum is that TOAST claims two bits of a varlena header,
> which already limits us to 1 GiB varlena values, something people are
> starting to find to be a problem. There's no wiggle room to steal more
> bits. If you want pluggable compression you need a way to store knowledge
> of how a given datum is compressed with the datum or have a fast, efficient
> way to check.
>
> pg_upgrade means you can't just redefine the current toast bits so the
> compressed bit means "data is compressed, check first byte of varlena data
> for algorithm" because existing data won't have that, the first byte will
> be the start of the compressed data stream.

I don't think there's an actual problem here. My old patch that you
referenced solves this.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: custom compression methods

2015-12-14 Thread Jim Nasby

On 12/14/15 12:50 AM, Craig Ringer wrote:

The issue with per-Datum is that TOAST claims two bits of a varlena
header, which already limits us to 1 GiB varlena values, something
people are starting to find to be a problem. There's no wiggle room to
steal more bits. If you want pluggable compression you need a way to
store knowledge of how a given datum is compressed with the datum or
have a fast, efficient way to check.


... unless we allowed for 8 byte varlena headers. Compression changes 
themselves certainly don't warrant that, but if people are already 
unhappy with 1GB TOAST then maybe that's enough.


The other thing this might buy us are a few bits that could be used to 
support Datum versioning for other purposes, such as when the binary 
format of something changes. I would think that at some point we'll need 
that for pg_upgrade.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Proposal: custom compression methods

2015-12-13 Thread Alexander Korotkov
Hackers,

I'd like to propose a new feature: "Custom compression methods".

*Motivation*

Currently when datum doesn't fit the page PostgreSQL tries to compress it
using PGLZ algorithm. Compression of particular attributes could be turned
on/off by tuning storage parameter of column. Also, there is heuristics
that datum is not compressible when its first KB is not compressible. I can
see following reasons for improving this situation.

 * Heuristics used for detection of compressible data could be not optimal.
We already met this situation with jsonb.
 * For some data distributions there could be more effective compression
methods than PGLZ. For example:
 * For natural languages we could use predefined dictionaries which
would allow us to compress even relatively short strings (which are not
long enough for PGLZ to train its dictionary).
 * For jsonb/hstore we could implement compression methods which have
dictionary of keys. This could be either static predefined dictionary or
dynamically appended dictionary with some storage.
 * For jsonb and other container types we can implement compression
methods which would allow extraction of particular fields without
decompression of full value.

Therefore, it would be nice to make compression methods pluggable.

*Design*

Compression methods would be stored in pg_compress system catalog table of
following structure:

compnamename
comptype  oid
compcompfunc  regproc
compdecompfunc  regproc

Compression methods could be created by "CREATE COMPRESSION METHOD" command
and deleted by "DROP COMPRESSION METHOD" command.

CREATE COMPRESSION METHOD compname [FOR TYPE comptype_name]
WITH COMPRESS FUNCTION compcompfunc_name
 DECOMPRESS FUNCTION compdecompfunc_name;
DROP COMPRESSION METHOD compname;

Signatures of compcompfunc and compdecompfunc would be similar
pglz_compress and pglz_decompress except compression strategy. There is
only one compression strategy in use for pglz (PGLZ_strategy_default).
Thus, I'm not sure it would be useful to provide multiple strategies for
compression methods.

extern int32 compcompfunc(const char *source, int32 slen, char *dest);
extern int32 compdecompfunc(const char *source, int32 slen, char *dest,
int32 rawsize);

Compression method could be type-agnostic (comptype = 0) or type specific
(comptype != 0). Default compression method is PGLZ.

Compression method of column would be stored in pg_attribute table.
Dependencies between columns and compression methods would be tracked in
pg_depend preventing dropping compression method which is currently in use.
Compression method of the attribute could be altered by ALTER TABLE command.

ALTER TABLE table_name ALTER COLUMN column_name SET COMPRESSION METHOD
compname;

Since mixing of different compression method in the same attribute would be
hard to manage (especially dependencies tracking), altering attribute
compression method would require a table rewrite.

*Implementation details*

Catalog changes, new commands, dependency tracking etc are mostly
mechanical stuff with no fundamental problems. The hardest part seems to be
providing seamless integration of custom compression methods into existing
code.

It doesn't seems hard to add extra parameter with compression method to
toast_compress_datum. However, PG_DETOAST_DATUM should call custom
decompress function with only knowledge of datum. That means that we should
somehow conceal knowledge of compression method into datum. The solution
could be putting compression method oid right after varlena header. Putting
this on-disk would cause storage overhead and break backward compatibility.
Thus, we can add this oid right after reading datum from the page. This
could be the weakest point in the whole proposal and I'll be very glad for
better ideas.

P.S. I'd like to thank Petr Korobeinikov  who
started work on this patch and sent me draft of proposal in Russian.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] Proposal: custom compression methods

2015-12-13 Thread Craig Ringer
On 14 December 2015 at 15:27, Chapman Flack  wrote:

> On 12/14/15 01:50, Craig Ringer wrote:
>
> > pg_upgrade means you can't just redefine the current toast bits so the
> > compressed bit means "data is compressed, check first byte of varlena
> data
> > for algorithm" because existing data won't have that, the first byte will
> > be the start of the compressed data stream.
>
> Is there any small sequence of initial bytes you wouldn't ever see in PGLZ
> output?  Either something invalid, or something obviously nonoptimal
> like run(n,'A')||run(n,'A') where PGLZ would have just output run(2n,'A')?
>
>
I don't think we need to worry, since doing it per-column makes this issue
go away. Per-Datum compression would make it easier to switch methods
(requiring no table rewrite) at the cost of more storage for each varlena,
which probably isn't worth it anyway.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Proposal: custom compression methods

2015-12-13 Thread Craig Ringer
On 14 December 2015 at 01:28, Alexander Korotkov 
wrote:

> Hackers,
>
> I'd like to propose a new feature: "Custom compression methods".
>

Are you aware of the past work in this area? There's quite a bit of history
and I strongly advise you to read the relevant threads to make sure you
don't run into the same problems.

See:

http://www.postgresql.org/message-id/flat/20130615102028.gk19...@alap2.anarazel.de#20130615102028.gk19...@alap2.anarazel.de

for at least one of the prior attempts.


> *Motivation*
>
> Currently when datum doesn't fit the page PostgreSQL tries to compress it
> using PGLZ algorithm. Compression of particular attributes could be turned
> on/off by tuning storage parameter of column. Also, there is heuristics
> that datum is not compressible when its first KB is not compressible. I can
> see following reasons for improving this situation.
>

Yeah, recent discussion has made it clear that there's room for improving
how and when TOAST compresses things. Per-attribute compression thresholds
made a lot of sense.

Therefore, it would be nice to make compression methods pluggable.
>

Very important issues to consider here is on-disk format stability, space
overhead, and pg_upgrade-ability. It looks like you have addressed all of
these issues below by making compression methods per-column not per-Datum
and forcing a full table rewrite to change it.

The issue with per-Datum is that TOAST claims two bits of a varlena header,
which already limits us to 1 GiB varlena values, something people are
starting to find to be a problem. There's no wiggle room to steal more
bits. If you want pluggable compression you need a way to store knowledge
of how a given datum is compressed with the datum or have a fast, efficient
way to check.

pg_upgrade means you can't just redefine the current toast bits so the
compressed bit means "data is compressed, check first byte of varlena data
for algorithm" because existing data won't have that, the first byte will
be the start of the compressed data stream.

There's also the issue of what you do when the algorithm used for a datum
is no longer loaded. I don't care so much about that one, I'm happy to say
"you ERROR and tell the user to fix the situation". But I think some people
were concerned about that too, or being stuck with algorithms forever once
they're added.

Looks like you've dealt with all those concerns.


> DROP COMPRESSION METHOD compname;
>
>
When you drop a compression method what happens to data compressed with
that method?

If you re-create it can the data be associated with the re-created method?


> Compression method of column would be stored in pg_attribute table.
>

So you can't change it without a full table rewrite, but thus you also
don't have to poach any TOAST header bits to determine which algorithm is
used. And you can use pg_depend to prevent dropping a compression method
still in use by a table. Makes sense.

Looks promising, but I haven't re-read the old thread in detail to see if
this approach was already considered and rejected.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Proposal: custom compression methods

2015-12-13 Thread Chapman Flack
On 12/14/15 01:50, Craig Ringer wrote:

> pg_upgrade means you can't just redefine the current toast bits so the
> compressed bit means "data is compressed, check first byte of varlena data
> for algorithm" because existing data won't have that, the first byte will
> be the start of the compressed data stream.

Is there any small sequence of initial bytes you wouldn't ever see in PGLZ
output?  Either something invalid, or something obviously nonoptimal
like run(n,'A')||run(n,'A') where PGLZ would have just output run(2n,'A')?

-Chap


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers