Re: [HACKERS] alternative compression algorithms?

2015-04-30 Thread Robert Haas
On Wed, Apr 29, 2015 at 9:12 PM, Tomas Vondra
tomas.von...@2ndquadrant.com wrote:
 The whole script (doing a lot of estimates) takes 1:50 with pglz and only
 1:25 with lz4. That's ~25-30% improvement.

Still pretty good

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] alternative compression algorithms?

2015-04-29 Thread Robert Haas
On Mon, Apr 20, 2015 at 9:03 AM, Tomas Vondra
tomas.von...@2ndquadrant.com wrote:
 Sure, it's not an ultimate solution, but it might help a bit. I do have
 other ideas how to optimize this, but in the planner every milisecond
 counts. Looking at 'perf top' and seeing pglz_decompress() in top 3.

I suggested years ago that we should not compress data in
pg_statistic.  Tom shot that down, but I don't understand why.  It
seems to me that when we know data is extremely frequently accessed,
storing it uncompressed makes sense.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] alternative compression algorithms?

2015-04-29 Thread Petr Jelinek

On 30/04/15 00:44, Tom Lane wrote:

Robert Haas robertmh...@gmail.com writes:

On Mon, Apr 20, 2015 at 9:03 AM, Tomas Vondra
tomas.von...@2ndquadrant.com wrote:

Sure, it's not an ultimate solution, but it might help a bit. I do have
other ideas how to optimize this, but in the planner every milisecond
counts. Looking at 'perf top' and seeing pglz_decompress() in top 3.



I suggested years ago that we should not compress data in
pg_statistic.  Tom shot that down, but I don't understand why.  It
seems to me that when we know data is extremely frequently accessed,
storing it uncompressed makes sense.


I've not been following this thread, but I do not think your argument here
holds any water.  pg_statistic entries are generally fetched via the
syscaches, and we fixed things years ago so that toasted tuple entries
are detoasted before insertion in syscache.  So I don't believe that
preventing on-disk compression would make for any significant improvement,
at least not after the first reference within a session.


AFAICS the syscache tuples are detoasted but not decompressed before 
insertion to syscache (catcache calls toast_flatten_tuple which doesn't 
do decompression) so the pglz_decompress is still called every time.


--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] alternative compression algorithms?

2015-04-29 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Mon, Apr 20, 2015 at 9:03 AM, Tomas Vondra
 tomas.von...@2ndquadrant.com wrote:
 Sure, it's not an ultimate solution, but it might help a bit. I do have
 other ideas how to optimize this, but in the planner every milisecond
 counts. Looking at 'perf top' and seeing pglz_decompress() in top 3.

 I suggested years ago that we should not compress data in
 pg_statistic.  Tom shot that down, but I don't understand why.  It
 seems to me that when we know data is extremely frequently accessed,
 storing it uncompressed makes sense.

I've not been following this thread, but I do not think your argument here
holds any water.  pg_statistic entries are generally fetched via the
syscaches, and we fixed things years ago so that toasted tuple entries
are detoasted before insertion in syscache.  So I don't believe that
preventing on-disk compression would make for any significant improvement,
at least not after the first reference within a session.

Also, it's a very long way from some pg_statistic entries are frequently
accessed to all pg_statistic entries are frequently accessed.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] alternative compression algorithms?

2015-04-29 Thread Tomas Vondra

Hi,

On 04/29/15 23:54, Robert Haas wrote:

On Mon, Apr 20, 2015 at 9:03 AM, Tomas Vondra
tomas.von...@2ndquadrant.com wrote:

Sure, it's not an ultimate solution, but it might help a bit. I do have
other ideas how to optimize this, but in the planner every milisecond
counts. Looking at 'perf top' and seeing pglz_decompress() in top 3.


I suggested years ago that we should not compress data in
pg_statistic.  Tom shot that down, but I don't understand why.  It
seems to me that when we know data is extremely frequently accessed,
storing it uncompressed makes sense.


I'm not convinced not compressing the data is a good idea - it suspect 
it would only move the time to TOAST, increase memory pressure (in 
general and in shared buffers). But I think that using a more efficient 
compression algorithm would help a lot.


For example, when profiling the multivariate stats patch (with multiple 
quite large histograms), the pglz_decompress is #1 in the profile, 
occupying more than 30% of the time. After replacing it with the lz4, 
the data are bit larger, but it drops to ~0.25% in the profile and 
planning the drops proportionally.


It's not a silver bullet, but it would help a lot in those cases.


--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] alternative compression algorithms?

2015-04-29 Thread Robert Haas
On Wed, Apr 29, 2015 at 6:55 PM, Tomas Vondra
tomas.von...@2ndquadrant.com wrote:
 I'm not convinced not compressing the data is a good idea - it suspect it
 would only move the time to TOAST, increase memory pressure (in general and
 in shared buffers). But I think that using a more efficient compression
 algorithm would help a lot.

 For example, when profiling the multivariate stats patch (with multiple
 quite large histograms), the pglz_decompress is #1 in the profile, occupying
 more than 30% of the time. After replacing it with the lz4, the data are bit
 larger, but it drops to ~0.25% in the profile and planning the drops
 proportionally.

That seems to imply a 100x improvement in decompression speed.  Really???

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] alternative compression algorithms?

2015-04-29 Thread Tomas Vondra



On 04/30/15 02:42, Robert Haas wrote:

On Wed, Apr 29, 2015 at 6:55 PM, Tomas Vondra
tomas.von...@2ndquadrant.com wrote:

I'm not convinced not compressing the data is a good idea - it suspect it
would only move the time to TOAST, increase memory pressure (in general and
in shared buffers). But I think that using a more efficient compression
algorithm would help a lot.

For example, when profiling the multivariate stats patch (with multiple
quite large histograms), the pglz_decompress is #1 in the profile, occupying
more than 30% of the time. After replacing it with the lz4, the data are bit
larger, but it drops to ~0.25% in the profile and planning the drops
proportionally.


That seems to imply a 100x improvement in decompression speed.  Really???


Sorry, that was a bit misleading over-statement. The profiles (same 
dataset, same workload) look like this:



pglz_decompress
---
  44.51%  postgres  [.] pglz_decompress
  13.60%  postgres  [.] update_match_bitmap_histogram
   8.40%  postgres  [.] float8_cmp_internal
   7.43%  postgres  [.] float8lt
   6.49%  postgres  [.] deserialize_mv_histogram
   6.23%  postgres  [.] FunctionCall2Coll
   4.06%  postgres  [.] DatumGetFloat8
   3.48%  libc-2.18.so  [.] __isnan
   1.26%  postgres  [.] clauselist_mv_selectivity
   1.09%  libc-2.18.so  [.] __memcpy_sse2_unaligned

lz4
---
  18.05%  postgres  [.] update_match_bitmap_histogram
  11.67%  postgres  [.] float8_cmp_internal
  10.53%  postgres  [.] float8lt
   8.67%  postgres  [.] FunctionCall2Coll
   8.52%  postgres  [.] deserialize_mv_histogram
   5.52%  postgres  [.] DatumGetFloat8
   4.90%  libc-2.18.so  [.] __isnan
   3.92%  liblz4.so.1.6.0   [.] 0x2603
   2.08%  liblz4.so.1.6.0   [.] 0x2847
   1.81%  postgres  [.] clauselist_mv_selectivity
   1.47%  libc-2.18.so  [.] __memcpy_sse2_unaligned
   1.33%  liblz4.so.1.6.0   [.] 0x260f
   1.16%  liblz4.so.1.6.0   [.] 0x25e3
   (and then a long tail of other lz4 calls)

The difference used to more significant, but I've done a lot of 
improvements in the update_match_bitmap method (so the lz4 methods are 
more significant).


The whole script (doing a lot of estimates) takes 1:50 with pglz and 
only 1:25 with lz4. That's ~25-30% improvement.


The results are slightly unreliable because collected in a Xen VM, and 
the overhead is non-negligible (but the same in both cases). I wouldn't 
be surprised if the difference was more significant without the VM.


--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] alternative compression algorithms?

2015-04-20 Thread Tomas Vondra



On 04/20/15 05:07, Andres Freund wrote:

Hi,

On 2015-04-19 22:51:53 +0200, Tomas Vondra wrote:

The reason why I'm asking about this is the multivariate statistics patch -
while optimizing the planning overhead, I realized that considerable amount
of time is spent decompressing the statistics (serialized as bytea), and
using an algorithm with better decompression performance (lz4 comes to mind)
would help a lot. The statistics may be a few tens/hundreds kB, and in the
planner every millisecond counts.


I've a hard time believing that a different compression algorithm is
 going to be the solution for that. Decompressing, quite possibly
repeatedly, several hundreds of kb during planning isn't going to be
 fun. Even with a good, fast, compression algorithm.


Sure, it's not an ultimate solution, but it might help a bit. I do have 
other ideas how to optimize this, but in the planner every milisecond 
counts. Looking at 'perf top' and seeing pglz_decompress() in top 3.


regards
Tomas


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] alternative compression algorithms?

2015-04-20 Thread Heikki Linnakangas

On 04/19/2015 11:51 PM, Tomas Vondra wrote:

Hi there,

in the past we've repeatedly discussed the option of using a different
compression algorithm (e.g. lz4), but every time the discussion died off
because of fear of possible patent issues [1] [2] and many other
threads. Have we decided it's not worth the risks, making patches in
this area futile?
...
I'm a bit confused though, because I've noticed various other FOSS
projects adopting lz4 over the past few years and I'm yet to find a
project voicing the same concerns about patents. So either they're
reckless or we're excessively paranoid.


IMHO we should switch to a different algorithm. Wrt patents, if we 
choose an algorithm that's already used widely in several other open 
source projects, that's safe enough. It's as safe as we're going to get.


There is always the chance that you infringe on a patent when you write 
any code. Compression is a particularly heavily-patented field, but it's 
nevertheless a bit strange to be super-paranoid there, and not worry 
about patents elsewhere. (The safest option would be to pick an 
algorithm that was patented, but the patent has expired. I don't think 
any of the compression algorithms discussed recently are old enough for 
that, though.)


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] alternative compression algorithms?

2015-04-19 Thread Michael Paquier
On Mon, Apr 20, 2015 at 5:51 AM, Tomas Vondra wrote:
 I'm a bit confused though, because I've noticed various other FOSS projects
 adopting lz4 over the past few years and I'm yet to find a project voicing
 the same concerns about patents. So either they're reckless or we're
 excessively paranoid.

Both are true, now being very careful regarding software patents
usually pays a lot. Strange things happen in the legal world.

 Also, lz4 is not the only compression algorithm available - I've done a
 bunch of tests with lz4, lz4hc, lzo and snappy, and lzo actually performed
 better than lz4 (not claiming that's a universal truth). But I suppose that
 the patent concerns are not somehow specific to lz4 but about the
 compression in general.

Some proprietary algorithms may perform even better and faster, who
knows for sure? And even if for now lzo or lz4 are better, we may find
something still better than what we think is now. In short, I still
tend to think that the right answer would be to have a dedicated set
of hooks and let people play with the compression algorithms they want
(statement stands for FPW compression as well, but that's separate
than the multivariate statistics patch).
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] alternative compression algorithms?

2015-04-19 Thread Andres Freund
Hi,

On 2015-04-19 22:51:53 +0200, Tomas Vondra wrote:
 The reason why I'm asking about this is the multivariate statistics patch -
 while optimizing the planning overhead, I realized that considerable amount
 of time is spent decompressing the statistics (serialized as bytea), and
 using an algorithm with better decompression performance (lz4 comes to mind)
 would help a lot. The statistics may be a few tens/hundreds kB, and in the
 planner every millisecond counts.

I've a hard time believing that a different compression algorithm is
going to be the solution for that. Decompressing, quite possibly
repeatedly, several hundreds of kb during planning isn't going to be
fun. Even with a good, fast, compression algorithm.

But independent of this, it'd be good to make progress on it.

 Would a differentiated approach work? That is, either adding an initdb
 option allowing the user to choose an alternative compression algorithm (and
 thus let him consider the possible patent issues), or using different
 algorithms for different pieces of data (e.g. keep pglz for the user data,
 and lz4 for statistics).
 
 The first option is quite trivial to implement - I already have an
 experimental patch implementing that (attached, but a bit dirty). The second
 option is probably more difficult (we'd have to teach tuple toaster about
 multiple compression algorithms and pass that information somehow). Also,
 I'm not sure it'd make the patent concerns go away ...

Note that I had implemented the second option somewhere down the lane
in the [2] you refer to.

 [2] 
 http://www.postgresql.org/message-id/20130614230142.gc19...@awork2.anarazel.de

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] alternative compression algorithms?

2015-04-19 Thread Stephen Frost
* Michael Paquier (michael.paqu...@gmail.com) wrote:
 On Mon, Apr 20, 2015 at 5:51 AM, Tomas Vondra wrote:
  I'm a bit confused though, because I've noticed various other FOSS projects
  adopting lz4 over the past few years and I'm yet to find a project voicing
  the same concerns about patents. So either they're reckless or we're
  excessively paranoid.
 
 Both are true, now being very careful regarding software patents
 usually pays a lot. Strange things happen in the legal world.
 
  Also, lz4 is not the only compression algorithm available - I've done a
  bunch of tests with lz4, lz4hc, lzo and snappy, and lzo actually performed
  better than lz4 (not claiming that's a universal truth). But I suppose that
  the patent concerns are not somehow specific to lz4 but about the
  compression in general.
 
 Some proprietary algorithms may perform even better and faster, who
 knows for sure? And even if for now lzo or lz4 are better, we may find
 something still better than what we think is now. In short, I still
 tend to think that the right answer would be to have a dedicated set
 of hooks and let people play with the compression algorithms they want
 (statement stands for FPW compression as well, but that's separate
 than the multivariate statistics patch).

I'm no lawyer, but I understand that one useful defense against patent
infringment lawsuits is to stop infringing on the patent (eg: stop doing
whatever it is the patent is about).  Therefore, the best approach would
be exactly this- provide a way to swap out what the compression
algorithm is.  One issue we'll need to consider is how to do that for
stored data, as you'd need to recompress everything.  What that
basically means is we'll need to burn bits to indicate which compression
algorithm is used.

For my 2c, I suspect that's worth it as I expect the gains to be worth
those extra bits, in the general case, but I'm sure there will be cases
where it's a loss.

Thanks,

Stephen


signature.asc
Description: Digital signature