subject:"Re\: \[HACKERS\] Compression and on\-disk sorting"

Re: [HACKERS] Compression and on-disk sorting

2006-06-07 Thread Jim C. Nasby

On Fri, May 26, 2006 at 09:21:44PM +0100, Simon Riggs wrote:
 On Fri, 2006-05-26 at 14:47 -0500, Jim C. Nasby wrote:
 
  But the meat is:
  -- work_mem --
  Scale   20002
  not compressed  150 805.7   797.7
  not compressed  300017820   17436
  compressed  150 371.4   400.1
  compressed  300081528537
  compressed, no headers  300073257876
 
 Since Tom has committed the header-removing patch, we need to test
 
   not compressed, no headers v compressed, no headers

-- work_mem --
Scale   20002
not compressed  150 805.7   797.7
not compressed  300017820   17436
not compressed, no hdr  300014470   14507
compressed  150 371.4   400.1
compressed  300081528537
compressed, no headers  300073257876

 There is a noticeable rise in sort time with increasing work_mem, but
 that needs to be offset from the benefit that in-general comes from
 using a large Heap for the sort. With the data you're using that always
 looks like a loss, but that isn't true with all input data orderings.

I thought that a change had been made to the on-disk sort specifically to
eliminate the problem of more work_mem making the sort take longer. I also
thought that there was something about that fix that was tunable.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Compression and on-disk sorting

2006-06-07 Thread Simon Riggs

On Wed, 2006-06-07 at 01:33 -0500, Jim C. Nasby wrote:
 On Fri, May 26, 2006 at 09:21:44PM +0100, Simon Riggs wrote:
  On Fri, 2006-05-26 at 14:47 -0500, Jim C. Nasby wrote:
  
   But the meat is:
   -- work_mem --
   Scale   20002
   not compressed  150 805.7   797.7
   not compressed  300017820   17436
   compressed  150 371.4   400.1
   compressed  300081528537
   compressed, no headers  300073257876
  
  Since Tom has committed the header-removing patch, we need to test
  
  not compressed, no headers v compressed, no headers
 
 -- work_mem --
 Scale   20002
 not compressed  150 805.7   797.7
 not compressed  300017820   17436
 not compressed, no hdr  300014470   14507
 compressed  150 371.4   400.1
 compressed  300081528537
 compressed, no headers  300073257876

That looks fairly conclusive. Can we try tests with data in reverse
order, so we use more tapes? We're still using a single tape, so the
additional overhead of compression doesn't cause any pain.

  There is a noticeable rise in sort time with increasing work_mem, but
  that needs to be offset from the benefit that in-general comes from
  using a large Heap for the sort. With the data you're using that always
  looks like a loss, but that isn't true with all input data orderings.
 
 I thought that a change had been made to the on-disk sort specifically to
 eliminate the problem of more work_mem making the sort take longer. 

There was a severe non-optimal piece of code...but the general effect
still exists. As does the effect that having higher work_mem produces
fewer runs which speeds up the final stages of the sort.

 I also
 thought that there was something about that fix that was tunable.

Increasing work_mem makes *this* test case take longer. 

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Compression and on-disk sorting

2006-06-07 Thread Jim C. Nasby

On Wed, Jun 07, 2006 at 11:59:50AM +0100, Simon Riggs wrote:
 On Wed, 2006-06-07 at 01:33 -0500, Jim C. Nasby wrote:
  On Fri, May 26, 2006 at 09:21:44PM +0100, Simon Riggs wrote:
   On Fri, 2006-05-26 at 14:47 -0500, Jim C. Nasby wrote:
   
But the meat is:
-- work_mem --
Scale   20002
not compressed  150 805.7   797.7
not compressed  300017820   17436
compressed  150 371.4   400.1
compressed  300081528537
compressed, no headers  300073257876
   
   Since Tom has committed the header-removing patch, we need to test
   
 not compressed, no headers v compressed, no headers
  
  -- work_mem --
  Scale   20002
  not compressed  150 805.7   797.7
  not compressed  300017820   17436
  not compressed, no hdr  300014470   14507
  compressed  150 371.4   400.1
  compressed  300081528537
  compressed, no headers  300073257876
 
 That looks fairly conclusive. Can we try tests with data in reverse
 order, so we use more tapes? We're still using a single tape, so the
 additional overhead of compression doesn't cause any pain.

Would simply changing the ORDER BY to DESC suffice for this? FWIW:

bench=# select correlation from pg_stats where tablename='accounts' and 
attname='bid';
 correlation 
-
   1
(1 row)

   There is a noticeable rise in sort time with increasing work_mem, but
   that needs to be offset from the benefit that in-general comes from
   using a large Heap for the sort. With the data you're using that always
   looks like a loss, but that isn't true with all input data orderings.
  
  I thought that a change had been made to the on-disk sort specifically to
  eliminate the problem of more work_mem making the sort take longer. 
 
 There was a severe non-optimal piece of code...but the general effect
 still exists. As does the effect that having higher work_mem produces
 fewer runs which speeds up the final stages of the sort.
 
  I also
  thought that there was something about that fix that was tunable.
 
 Increasing work_mem makes *this* test case take longer. 
 
 -- 
   Simon Riggs 
   EnterpriseDB   http://www.enterprisedb.com
 
 
 ---(end of broadcast)---
 TIP 5: don't forget to increase your free space map settings
 

-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Compression and on-disk sorting

2006-06-07 Thread Simon Riggs

On Wed, 2006-06-07 at 09:35 -0500, Jim C. Nasby wrote:

 Would simply changing the ORDER BY to DESC suffice for this? FWIW:

Try sorting on aid also, both ascneding and descending.

We need to try lots of tests, not just one thats chosen to show the
patch in the best light. I want this, but we need to check.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Compression and on-disk sorting

2006-06-07 Thread Jim C. Nasby

On Wed, Jun 07, 2006 at 04:11:57PM +0100, Simon Riggs wrote:
 On Wed, 2006-06-07 at 09:35 -0500, Jim C. Nasby wrote:
 
  Would simply changing the ORDER BY to DESC suffice for this? FWIW:
 
 Try sorting on aid also, both ascneding and descending.
 
 We need to try lots of tests, not just one thats chosen to show the
 patch in the best light. I want this, but we need to check.

Well, correlation on everything in that table is 1. At this point maybe
it makes more sense to just come up with a different test case, possibly
generate_series and random. Better yet would be if someone came up with
a patch that actually populated the filler field in pgbench. Better
still would be allowing the user to define how large they wanted the
filler field to be...
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Compression and on-disk sorting

2006-05-26 Thread Jim C. Nasby

I've done some more testing with Tom's recently committed changes to
tuplesort.c, which remove the tupleheaders from the sort data. It does
about 10% better than compression alone does. What's interesting is that
the gains are about 10% regardless of compression, which means
compression isn't helping very much on all the redundant header data,
which I find very odd. And the header data is very redundant:

bench=# select xmin,xmax,cmin,cmax,aid from accounts order by aid limit 1;
  xmin  | xmax | cmin | cmax | aid
+--+--+--+-
 280779 |0 |0 |0 |   1
(1 row)

bench=# select xmin,xmax,cmin,cmax,aid from accounts order by aid desc limit 1;
  xmin  | xmax | cmin | cmax |aid
+--+--+--+---
 310778 |0 |0 |0 | 3
(1 row)

Makes sense, since pgbench loads the database via a string of COPY commands,
each of which loads 1 rows.

Something else worth mentioning is that sort performance is worse with
larger work_mem for all cases except the old HEAD, prior to the
tuplesort.c changes. It looks like whatever was done to fix that will
need to be adjusted/rethought pending the outcome of using compression.

In any case, compression certainly seems to be a clear win, at least in
this case. If there's interest, I can test this on some larger hardware,
or if someone wants to produce a patch for pgbench that will load some
kind of real data into accounts.filler, I can test that as well.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Compression and on-disk sorting

2006-05-26 Thread Tom Lane

Jim C. Nasby [EMAIL PROTECTED] writes:
 Something else worth mentioning is that sort performance is worse with
 larger work_mem for all cases except the old HEAD, prior to the
 tuplesort.c changes. It looks like whatever was done to fix that will
 need to be adjusted/rethought pending the outcome of using compression.

Please clarify.  What are you comparing here exactly, and what cases did
you test?

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Compression and on-disk sorting

2006-05-26 Thread Jim C. Nasby

On Fri, May 26, 2006 at 12:35:36PM -0400, Tom Lane wrote:
 Jim C. Nasby [EMAIL PROTECTED] writes:
  Something else worth mentioning is that sort performance is worse with
  larger work_mem for all cases except the old HEAD, prior to the
  tuplesort.c changes. It looks like whatever was done to fix that will
  need to be adjusted/rethought pending the outcome of using compression.
 
 Please clarify.  What are you comparing here exactly, and what cases did
 you test?

Sorry, forgot to put the url in:
http://jim.nasby.net/misc/pgsqlcompression/compress_sort.txt

But the meat is:
-- work_mem --
Scale   20002
not compressed  150 805.7   797.7
not compressed  300017820   17436
compressed  150 371.4   400.1
compressed  300081528537
compressed, no headers  300073257876

Performance degrades with more work_mem any time compression is used. I
thought I had data on just your tuplesort.c change without compression,
but I guess I don't. :( I can run that tonight if desired.

As for the code, the 3 things I've tested are HEAD as of 5/17/06 with no
patches (labeld 'not compressed'); that code with the compression patch
(compressed), and that code with both the compression patch and your change to
tuplesort.c that removes tuple headers from the sorted data (compressed, no
headers).
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Compression and on-disk sorting

2006-05-26 Thread Simon Riggs

On Fri, 2006-05-26 at 14:47 -0500, Jim C. Nasby wrote:

 But the meat is:
 -- work_mem --
 Scale   20002
 not compressed  150 805.7   797.7
 not compressed  300017820   17436
 compressed  150 371.4   400.1
 compressed  300081528537
 compressed, no headers  300073257876

Since Tom has committed the header-removing patch, we need to test

not compressed, no headers v compressed, no headers

There is a noticeable rise in sort time with increasing work_mem, but
that needs to be offset from the benefit that in-general comes from
using a large Heap for the sort. With the data you're using that always
looks like a loss, but that isn't true with all input data orderings.

-- 
  Simon Riggs
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Compression and on-disk sorting

2006-05-26 Thread Tom Lane

Simon Riggs [EMAIL PROTECTED] writes:
 There is a noticeable rise in sort time with increasing work_mem, but
 that needs to be offset from the benefit that in-general comes from
 using a large Heap for the sort. With the data you're using that always
 looks like a loss, but that isn't true with all input data orderings.

Yeah, these are all the exact same test data, right?  We need a bit more
variety in the test cases before drawing any sweeping conclusions.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Compression and on-disk sorting

2006-05-26 Thread Jim C. Nasby

On Fri, May 26, 2006 at 04:41:51PM -0400, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  There is a noticeable rise in sort time with increasing work_mem, but
  that needs to be offset from the benefit that in-general comes from
  using a large Heap for the sort. With the data you're using that always
  looks like a loss, but that isn't true with all input data orderings.
 
 Yeah, these are all the exact same test data, right?  We need a bit more
 variety in the test cases before drawing any sweeping conclusions.

All testing is select count(*) from (select * from accounts order by
bid) a; hitting a pgbench database, since that's something anyone can
(presumably) reproduce. Suggestions for other datasets welcome.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Compression and on-disk sorting

2006-05-24 Thread Joshua D. Drake


Jim C. Nasby wrote:

Finally completed testing of a dataset that doesn't fit in memory with
compression enabled. Results are at
http://jim.nasby.net/misc/pgsqlcompression .

Summary:
work_memcompressed  not compressed  gain
in-memory   2   400.1   797.7   49.8%
in-memory   2000371.4   805.7   53.9%
not in-memory   2   853717436   51.0%
not in-memory   2000815217820   54.3%

I find it very interesting that the gains are identical even when the
tapes should fit in memory. My guess is that for some reason the OS is
flushing those to disk anyway. In fact, watching gstat during a run, I
do see write activity hitting the drives. So if there was some way to
tune that behavior, the in-memory case would probably be much, much
faster. Anyone know FreeBSD well enough to suggest how to change this?
Anyone want to test on linux and see if the results are the same? This
could indicate that it might be advantageous to attempt an in-memory
sort with compressed data before spilling that compressed data to
disk...



I can test it on linux just let me know what you need.

J



As for CPU utilization, it was ~33% with compression and ~13% without.
That tells me that CPU could become a factor if everything was truely in
memory (including the table we were reading from), but if that's the
case there's a good chance that we wouldn't even be switching to an
on-disk sort. If everything isn't in memory then you're likely to be IO
bound anyway...



--

   === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
   Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/



---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [HACKERS] Compression and on-disk sorting

2006-05-24 Thread Jim C. Nasby

On Wed, May 24, 2006 at 02:20:43PM -0700, Joshua D. Drake wrote:
 Jim C. Nasby wrote:
 Finally completed testing of a dataset that doesn't fit in memory with
 compression enabled. Results are at
 http://jim.nasby.net/misc/pgsqlcompression .
 
 Summary:
 work_memcompressed  not compressed  gain
 in-memory   2   400.1   797.7   49.8%
 in-memory   2000371.4   805.7   53.9%
 not in-memory   2   853717436   51.0%
 not in-memory   2000815217820   54.3%
 
 I find it very interesting that the gains are identical even when the
 tapes should fit in memory. My guess is that for some reason the OS is
 flushing those to disk anyway. In fact, watching gstat during a run, I
 do see write activity hitting the drives. So if there was some way to
 tune that behavior, the in-memory case would probably be much, much
 faster. Anyone know FreeBSD well enough to suggest how to change this?
 Anyone want to test on linux and see if the results are the same? This
 could indicate that it might be advantageous to attempt an in-memory
 sort with compressed data before spilling that compressed data to
 disk...
 
 
 I can test it on linux just let me know what you need.

Actually, after talking to Larry he mentioned that it'd be worth
checking to see if we're doing something like opening the files in
O_DIRECT, which I haven't had a chance to do. Might be worth looking at
that before running more tests.

Anyway, I've posted the patch now as well, and compress_sort.txt has the
commands I was running. Those are just against a plain pgbench database
that's been freshly initialized (ie: no dead tuples). I just created two
install directories from a checkout of HEAD via --prefix=, one with the
patch and one without. Both hit the same $PGDATA. I've posted the
postgresql.conf as well.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Compression and on-disk sorting

2006-05-21 Thread Martijn van Oosterhout

On Fri, May 19, 2006 at 01:39:45PM -0500, Jim C. Nasby wrote:
  Do you have any stats on CPU usage? Memory usage?
 
 I've only been taking a look at vmstat from time-to-time, and I have yet
 to see the machine get CPU-bound. Haven't really paid much attention to
 memory. Is there anything in partucular you're looking for? I can log
 vmstat for the next set of runs (with a scaling factor of 1). I plan
 on doing those runs tonight...

I've got some more info on zlibs memory usage:

Compression: 5816 bytes + 256KB buffer = approx 261KB
Decompression: 9512 bytes + 32KB buffer = approx 42KB

As Tom said, you only run one compression at a time but logtape doesn't
know that. It can only free the compression structures on Rewind or
Freeze, neither of which are run until the merge pass. I don't
understand the algorithm enough to know if it's safe to rewind the old
tape in selectnewtape. That would seem to defeat the freeze if only
one tape optimisation.

One final thing, with trace_sort=on on my machine I get this with
compression:

LOG:  performsort done (except 28-way final merge): CPU 1.48s/7.49u sec elapsed 
10.24 sec
LOG:  external sort ended, 163 disk blocks used: CPU 1.48s/7.49u sec elapsed 
10.30 sec

and without compression:

LOG:  performsort done (except 28-way final merge): CPU 2.85s/1.90u sec elapsed 
14.76 sec
LOG:  external sort ended, 18786 disk blocks used: CPU 2.88s/1.90u sec elapsed 
15.70 sec

This indicates an awful lot of I/O waiting, some 60% of the time
without compression. The compression has cut the I/O wait from 10sec to
1.5sec at the expense of 5.5sec of compression time. If you had a
faster compression algorithm (zlib is not that fast) the results would
be even better...

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature

Re: [HACKERS] Compression and on-disk sorting

2006-05-19 Thread Luke Lonergan

Jim, 

 http://jim.nasby.net/misc/compress_sort.txt is preliminary results.
 I've run into a slight problem in that even at a compression 
 level of -3, zlib is cutting the on-disk size of sorts by 
 25x. So my pgbench sort test with scale=150 that was 
 producing a 2G on-disk sort is now producing a 80M sort, 
 which obviously fits in memory. And cuts sort times by more than half.

When you're ready, we can test this on some other interesting cases and
on fast hardware.

BTW - external sorting is *still* 4x slower than popular commercial DBMS
(PCDB) on real workload when full rows are used in queries.  The final
results we had after the last bit of sort improvements were limited to
cases where only the sort column was used in the query, and for that
case the improved external sort code was as fast as PCDB provided lots
of work_mem are used, but when the whole contents of the row are
consumed (as with TPC-H and in many real world cases) the performance is
still far slower.

So, compression of the tuples may be just what we're looking for.

- Luke


---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Compression and on-disk sorting

2006-05-19 Thread Tom Lane

Martijn van Oosterhout kleptog@svana.org writes:
 I'm seeing 250,000 blocks being cut down to 9,500 blocks. That's almost
 unbeleiveable. What's in the table?

Yeah, I'd tend to question the test data being used.  gzip does not do
that well on typical text (especially not at the lower settings we'd
likely want to use).

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] Compression and on-disk sorting

2006-05-19 Thread Martijn van Oosterhout

On Fri, May 19, 2006 at 09:03:31AM -0400, Tom Lane wrote:
 Martijn van Oosterhout kleptog@svana.org writes:
  I'm seeing 250,000 blocks being cut down to 9,500 blocks. That's almost
  unbeleiveable. What's in the table?
 
 Yeah, I'd tend to question the test data being used.  gzip does not do
 that well on typical text (especially not at the lower settings we'd
 likely want to use).

However, postgres tables are very highly compressable, 10-to-1 is not
that uncommon. pg_proc and pg_index compress by that for example.
Indexes compress even more (a few on my system compress 25-to-1 but
that could just be slack space, the record being 37-to-1
(pg_constraint_conname_nsp_index)).

The only table on my test system over 32KB that doesn't reach 2-to-1
compression with gzip -3 is one of the toast tables.

So getting 25-to-1 is a lot, but possibly not that extreme.
pg_statistic, which is about as close to random data as you're going to
get on a postgres system, compresses 5-to-1.

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature

Re: [HACKERS] Compression and on-disk sorting

2006-05-19 Thread Tom Lane

Martijn van Oosterhout kleptog@svana.org writes:
 However, postgres tables are very highly compressable, 10-to-1 is not
 that uncommon. pg_proc and pg_index compress by that for example.
 Indexes compress even more (a few on my system compress 25-to-1 but
 that could just be slack space, the record being 37-to-1
 (pg_constraint_conname_nsp_index)).

Anything containing a column of type name will compress amazingly well
because of all the padding spaces.  I don't think that's representative
of user data though ... except maybe for the occasional novice using
char(255) ...

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] Compression and on-disk sorting

2006-05-19 Thread Jim C. Nasby

On Fri, May 19, 2006 at 09:29:03AM +0200, Martijn van Oosterhout wrote:
 On Thu, May 18, 2006 at 10:02:44PM -0500, Jim C. Nasby wrote:
  http://jim.nasby.net/misc/compress_sort.txt is preliminary results.
  I've run into a slight problem in that even at a compression level of
  -3, zlib is cutting the on-disk size of sorts by 25x. So my pgbench sort
  test with scale=150 that was producing a 2G on-disk sort is now
  producing a 80M sort, which obviously fits in memory. And cuts sort
  times by more than half.
 
 I'm seeing 250,000 blocks being cut down to 9,500 blocks. That's almost
 unbeleiveable. What's in the table? It would seem to imply that our
 tuple format is far more compressable than we expected.

It's just SELECT count(*) FROM (SELECT * FROM accounts ORDER BY bid) a;
If the tape routines were actually storing visibility information, I'd
expect that to be pretty compressible in this case since all the tuples
were presumably created in a single transaction by pgbench.

If needs be, I could try the patch against http://stats.distributed.net,
assuming that it would apply to REL_8_1.

 Do you have any stats on CPU usage? Memory usage?

I've only been taking a look at vmstat from time-to-time, and I have yet
to see the machine get CPU-bound. Haven't really paid much attention to
memory. Is there anything in partucular you're looking for? I can log
vmstat for the next set of runs (with a scaling factor of 1). I plan
on doing those runs tonight...
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Compression and on-disk sorting

2006-05-19 Thread Tom Lane

Jim C. Nasby [EMAIL PROTECTED] writes:
 On Fri, May 19, 2006 at 09:29:03AM +0200, Martijn van Oosterhout wrote:
 I'm seeing 250,000 blocks being cut down to 9,500 blocks. That's almost
 unbeleiveable. What's in the table? It would seem to imply that our
 tuple format is far more compressable than we expected.

 It's just SELECT count(*) FROM (SELECT * FROM accounts ORDER BY bid) a;
 If the tape routines were actually storing visibility information, I'd
 expect that to be pretty compressible in this case since all the tuples
 were presumably created in a single transaction by pgbench.

It's worse than that: IIRC what passes through a heaptuple sort are
tuples manufactured by heap_form_tuple, which will have consistently
zeroed header fields.  However, the above isn't very helpful since the
rest of us have no idea what that accounts table contains.  How wide
is the tuple data, and what's in it?

(This suggests that we might try harder to strip unnecessary header info
from tuples being written to tape inside tuplesort.c.  I think most of
the required fields could be reconstructed given the TupleDesc.)

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Compression and on-disk sorting

2006-05-19 Thread Hannu Krosing

Ühel kenal päeval, R, 2006-05-19 kell 14:53, kirjutas Tom Lane:
 Jim C. Nasby [EMAIL PROTECTED] writes:
  On Fri, May 19, 2006 at 09:29:03AM +0200, Martijn van Oosterhout wrote:
  I'm seeing 250,000 blocks being cut down to 9,500 blocks. That's almost
  unbeleiveable. What's in the table? It would seem to imply that our
  tuple format is far more compressable than we expected.
 
  It's just SELECT count(*) FROM (SELECT * FROM accounts ORDER BY bid) a;
  If the tape routines were actually storing visibility information, I'd
  expect that to be pretty compressible in this case since all the tuples
  were presumably created in a single transaction by pgbench.
 
 It's worse than that: IIRC what passes through a heaptuple sort are
 tuples manufactured by heap_form_tuple, which will have consistently
 zeroed header fields.  However, the above isn't very helpful since the
 rest of us have no idea what that accounts table contains.  How wide
 is the tuple data, and what's in it?

Was he not using pg_bench data ?

 (This suggests that we might try harder to strip unnecessary header info
 from tuples being written to tape inside tuplesort.c.  I think most of
 the required fields could be reconstructed given the TupleDesc.)

I guess that tapefiles compress better than averahe table because they
are sorted, and thus at least a little more repetitive than the rest. 
If there are varlen types, then they usually also have abundance of
small 4-byte integers, which should also compress at least better than
4/1, maybe a lot better.


-- 

Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Compression and on-disk sorting

2006-05-19 Thread Martijn van Oosterhout

On Fri, May 19, 2006 at 10:02:50PM +0300, Hannu Krosing wrote:
   It's just SELECT count(*) FROM (SELECT * FROM accounts ORDER BY bid) a;
   If the tape routines were actually storing visibility information, I'd
   expect that to be pretty compressible in this case since all the tuples
   were presumably created in a single transaction by pgbench.
 
 Was he not using pg_bench data ?

Hmm, so there was only 3 integer fields and one varlena structure which
was always empty. This prepended with a tuple header with mostly blank
fields or at least repeated, yes, I can see how we might get a 25-to-1
compression.

Maybe we need to change pgbench so that it puts random text in the
filler field, that would at least put some strain on the compression
algorithm...

 I guess that tapefiles compress better than averahe table because they
 are sorted, and thus at least a little more repetitive than the rest. 
 If there are varlen types, then they usually also have abundance of
 small 4-byte integers, which should also compress at least better than
 4/1, maybe a lot better.

Hmm, that makes sense. That also explains the 37-to-1 compression I was
seeing on indexes :).

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature

Re: [HACKERS] Compression and on-disk sorting

2006-05-19 Thread Jim C. Nasby

On Fri, May 19, 2006 at 10:02:50PM +0300, Hannu Krosing wrote:
 ??hel kenal p??eval, R, 2006-05-19 kell 14:53, kirjutas Tom Lane:
  Jim C. Nasby [EMAIL PROTECTED] writes:
   On Fri, May 19, 2006 at 09:29:03AM +0200, Martijn van Oosterhout wrote:
   I'm seeing 250,000 blocks being cut down to 9,500 blocks. That's almost
   unbeleiveable. What's in the table? It would seem to imply that our
   tuple format is far more compressable than we expected.
  
   It's just SELECT count(*) FROM (SELECT * FROM accounts ORDER BY bid) a;
   If the tape routines were actually storing visibility information, I'd
   expect that to be pretty compressible in this case since all the tuples
   were presumably created in a single transaction by pgbench.
  
  It's worse than that: IIRC what passes through a heaptuple sort are
  tuples manufactured by heap_form_tuple, which will have consistently
  zeroed header fields.  However, the above isn't very helpful since the
  rest of us have no idea what that accounts table contains.  How wide
  is the tuple data, and what's in it?
 
 Was he not using pg_bench data ?

I am. For reference:

bench=# \d accounts 
   Table public.accounts
  Column  | Type  | Modifiers 
--+---+---
 aid  | integer   | not null
 bid  | integer   | 
 abalance | integer   | 
 filler   | character(84) | 


  (This suggests that we might try harder to strip unnecessary header info
  from tuples being written to tape inside tuplesort.c.  I think most of
  the required fields could be reconstructed given the TupleDesc.)
 
 I guess that tapefiles compress better than averahe table because they
 are sorted, and thus at least a little more repetitive than the rest. 
 If there are varlen types, then they usually also have abundance of
 small 4-byte integers, which should also compress at least better than
 4/1, maybe a lot better.

If someone wants to provide a patch that strips out the headers I can test that
as well.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Compression and on-disk sorting

2006-05-19 Thread Jim C. Nasby

On Fri, May 19, 2006 at 09:29:44PM +0200, Martijn van Oosterhout wrote:
 On Fri, May 19, 2006 at 10:02:50PM +0300, Hannu Krosing wrote:
It's just SELECT count(*) FROM (SELECT * FROM accounts ORDER BY bid) a;
If the tape routines were actually storing visibility information, I'd
expect that to be pretty compressible in this case since all the tuples
were presumably created in a single transaction by pgbench.
  
  Was he not using pg_bench data ?
 
 Hmm, so there was only 3 integer fields and one varlena structure which
 was always empty. This prepended with a tuple header with mostly blank
 fields or at least repeated, yes, I can see how we might get a 25-to-1
 compression.
 
 Maybe we need to change pgbench so that it puts random text in the
 filler field, that would at least put some strain on the compression
 algorithm...

Wow, I thought there was actually something in there...

True random data wouldn't be such a great test either; what would
probably be best is a set of random words, since in real life you're
unlikely to have truely random data.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] Compression and on-disk sorting

2006-05-19 Thread Tom Lane

Jim C. Nasby [EMAIL PROTECTED] writes:
 True random data wouldn't be such a great test either; what would
 probably be best is a set of random words, since in real life you're
 unlikely to have truely random data.

True random data would provide worst-case compression behavior, so
we'd want to try that to find out what the downside is; but we shouldn't
consider it to be the design center.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Compression and on-disk sorting

2006-05-19 Thread Hannu Krosing

Ühel kenal päeval, R, 2006-05-19 kell 14:57, kirjutas Jim C. Nasby:
 On Fri, May 19, 2006 at 09:29:44PM +0200, Martijn van Oosterhout wrote:
  On Fri, May 19, 2006 at 10:02:50PM +0300, Hannu Krosing wrote:
 It's just SELECT count(*) FROM (SELECT * FROM accounts ORDER BY bid) 
 a;
 If the tape routines were actually storing visibility information, I'd
 expect that to be pretty compressible in this case since all the 
 tuples
 were presumably created in a single transaction by pgbench.
   
   Was he not using pg_bench data ?
  
  Hmm, so there was only 3 integer fields and one varlena structure which
  was always empty. This prepended with a tuple header with mostly blank
  fields or at least repeated, yes, I can see how we might get a 25-to-1
  compression.
  
  Maybe we need to change pgbench so that it puts random text in the
  filler field, that would at least put some strain on the compression
  algorithm...
 
 Wow, I thought there was actually something in there...
 
 True random data wouldn't be such a great test either; what would
 probably be best is a set of random words, since in real life you're
 unlikely to have truely random data.

I usually use something like the following for my random name tests:

#!/usr/bin/python

import random

words = [line.strip() for line in open('/usr/share/dict/words')]

def make_random_name(min_items, max_items):
l = []
for w in range(random.randint(min_items, max_items)):
l.append(random.choice(words))
return ' '.join(l)

it gives out somewhat justifyable but still quite amusing results:

 make_random_name(2,4)
'encroaches Twedy'
 make_random_name(2,4)
'annuloida Maiah commends imputatively'
 make_random_name(2,4)
'terebral wine-driven pacota'
 make_random_name(2,4)
'ballads disenfranchise cabriolets spiny-fruited'


-- 

Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com



---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Compression and on-disk sorting

2006-05-18 Thread Zeugswetter Andreas DCP SD


 1) Use n sort areas for n tapes making everything purely sequential
access.

Some time ago testing I did has shown, that iff the IO block size is
large enough
(256k) it does not really matter that much if the blocks are at random
locations.
I think that is still true for current model disks.

So unless we parallelize, it is imho sufficient to see to it that we
write
(and read) large enough blocks with single calls. This also has no
problem in 
highly concurrent scenarios, where you do not have enough spindles.

Andreas

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Compression and on-disk sorting

2006-05-18 Thread Simon Riggs

On Tue, 2006-05-16 at 15:42 -0500, Jim C. Nasby wrote:
 On Tue, May 16, 2006 at 12:31:07PM -0500, Jim C. Nasby wrote:
  In any case, my curiousity is aroused, so I'm currently benchmarking
  pgbench on both a compressed and uncompressed $PGDATA/base. I'll also do
  some benchmarks with pg_tmp compressed.
  
 Results: http://jim.nasby.net/bench.log
 
 As expected, compressing $PGDATA/base was a loss. But compressing
 pgsql_tmp and then doing some disk-based sorts did show an improvement,
 from 366.1 seconds to 317.3 seconds, an improvement of 13.3%. This is on
 a Windows XP laptop (Dell Latitude D600) with 512MB, so it's somewhat of
 a worst-case scenario. On the other hand, XP's compression algorithm
 appears to be pretty aggressive, as it cut the size of the on-disk sort
 file from almost 700MB to 82MB. There's probably gains to be had from a
 different compression algorithm.

Can you re-run these tests using SELECT aid FROM accounts ...
SELECT 1 ...  is of course highly compressible.

I also note that the compressed file fits within memory and may not even
have been written out at all. That's good, but this sounds like the very
best case of what we can hope to achieve. We need to test a whole range
of cases to see if it is generally applicable, or only in certain cases
- and if so which ones.

Would you be able to write up some extensive testing of Martijn's patch?
He's followed through on your idea and written it (on -patches now...)

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Compression and on-disk sorting

2006-05-18 Thread Bruce Momjian


Uh, TODO already has:

o %Add a GUC variable to control the tablespace for temporary objects
  and sort files

  It could start with a random tablespace from a supplied list and
  cycle through the list.

Do we need to add to this?

---

Greg Stark wrote:
 Jim C. Nasby [EMAIL PROTECTED] writes:
 
  Which means we need all the interface bits to be able to tell PostgreSQL
  where every single temp storage area is. Presumably much of the
  tablespace mechanism could be used for this, but it's still a bunch of
  work. And you can't just say I have 8 spindles, you have to tell
  PostgreSQL exactly where to put each temporary area (unless you just
  have it put one on every tablespace you have defined).
 
 Yes, if you have more than one temporary area you definitely need a way to
 tell Postgres where to put them since obviously they won't be in the postgres
 base directory. But I think that's all Postgres really needs to know.
 
 One could imagine a more complex version where Postgres has meta information
 about the bandwidth and seek penalty for each sort area separately. Presumably
 also for each table space. But that's a whole lot more complexity than
 Postgres's current cost model.
 
 
   that it should strive to maximize sequential reads within one temp area 
   and
   expect switching between temp areas (which represent multiple spindles) 
   to be
   better than multiplexing multiple tapes within a single temp area (which
   represents a single spindle).
  
  Which adds yet more complexity to all the code that uses the temp area.
  And as others have brought up, you still have to allow for the case when
  splitting all of this out into multiple files means you end up using
  substantially more disk space. That further drives up the complexity.
 
 You also have to consider that it won't always be a benefit to spread the sort
 over multiple sort areas. If there's only one sort going on and you can reach
 a 1:1 ratio between tapes and spindles then I think it would be a huge
 benefit. Effectively boosting the sort speed by random_page_cost.
 
 But if you don't have as many spindles as your algorithm needs tapes
 then it's unclear which to multiplex down and whether you gain any benefit
 once you're multiplexing over simply using a single sort area.
 
 And worse, if there are multiple sorts going on in the system then you're not
 going to get sequential access even if you have multiple sort areas available.
 If you have N sort areas and N sorts are going on then you're probably better
 off multiplexing each one down to a single sort area and letting them each
 proceed without interfering with each other rather than having each one hog
 all the sort areas and forcing the OS to do the multiplexing blindly.
 
  My point is that unless someone shows that there's a non-trivial
  performance gain here, it's not going to happen.
 
 I think two extreme cases are well worth pursuing: 
 
 1) Use n sort areas for n tapes making everything purely sequential access.
That would be useful for large DSS systems where large sorts are running
and i/o bandwidth is high for sequential access. That gives effectively a
random_page_cost speed boost.
 
 2) Use the current algorithm unchanged but have each sort use a different sort
area in some sort of round-robin fashion. That helps the OLTP type
environment (ignoring for the moment that OLTP environments really ought
not be doing disk sorts) where people complain about unreliable execution
times more than slow execution times. If you can provide enough sort areas
then it would remove one big reason other queries concurrent impact the
execution time of queries.
 
 -- 
 greg
 
 
 ---(end of broadcast)---
 TIP 3: Have you checked our extensive FAQ?
 
http://www.postgresql.org/docs/faq
 

-- 
  Bruce Momjian   http://candle.pha.pa.us
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Compression and on-disk sorting

2006-05-18 Thread Jim C. Nasby

On Thu, May 18, 2006 at 11:34:51AM +0100, Simon Riggs wrote:
 On Tue, 2006-05-16 at 15:42 -0500, Jim C. Nasby wrote:
  On Tue, May 16, 2006 at 12:31:07PM -0500, Jim C. Nasby wrote:
   In any case, my curiousity is aroused, so I'm currently benchmarking
   pgbench on both a compressed and uncompressed $PGDATA/base. I'll also do
   some benchmarks with pg_tmp compressed.
   
  Results: http://jim.nasby.net/bench.log
  
  As expected, compressing $PGDATA/base was a loss. But compressing
  pgsql_tmp and then doing some disk-based sorts did show an improvement,
  from 366.1 seconds to 317.3 seconds, an improvement of 13.3%. This is on
  a Windows XP laptop (Dell Latitude D600) with 512MB, so it's somewhat of
  a worst-case scenario. On the other hand, XP's compression algorithm
  appears to be pretty aggressive, as it cut the size of the on-disk sort
  file from almost 700MB to 82MB. There's probably gains to be had from a
  different compression algorithm.
 
 Can you re-run these tests using SELECT aid FROM accounts ...
 SELECT 1 ...  is of course highly compressible.
 
 I also note that the compressed file fits within memory and may not even
 have been written out at all. That's good, but this sounds like the very
 best case of what we can hope to achieve. We need to test a whole range
 of cases to see if it is generally applicable, or only in certain cases
 - and if so which ones.
 
 Would you be able to write up some extensive testing of Martijn's patch?
 He's followed through on your idea and written it (on -patches now...)

Yes, I'm working on that. I'd rather test his stuff than XP's
compression anyway.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Compression and on-disk sorting

2006-05-18 Thread Jim C. Nasby

On Thu, May 18, 2006 at 10:57:16AM +0200, Zeugswetter Andreas DCP SD wrote:
 
  1) Use n sort areas for n tapes making everything purely sequential
 access.
 
 Some time ago testing I did has shown, that iff the IO block size is
 large enough
 (256k) it does not really matter that much if the blocks are at random
 locations.
 I think that is still true for current model disks.
 
 So unless we parallelize, it is imho sufficient to see to it that we
 write
 (and read) large enough blocks with single calls. This also has no
 problem in 
 highly concurrent scenarios, where you do not have enough spindles.

AFAIK logtape currently reads in much less than 256k blocks. Of course
if you get lucky you'll read from one tape for some time before
switching to another, which should have a sort-of similar effect if the
drives aren't very busy with other things.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Compression and on-disk sorting

2006-05-18 Thread Tom Lane

Bruce Momjian pgman@candle.pha.pa.us writes:
 Uh, TODO already has:

 o %Add a GUC variable to control the tablespace for temporary objects
   and sort files

   It could start with a random tablespace from a supplied list and
   cycle through the list.

 Do we need to add to this?

The list bit strikes me as lily-gilding, or at least designing features
well beyond proven need.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Compression and on-disk sorting

2006-05-18 Thread Bruce Momjian

Tom Lane wrote:
 Bruce Momjian pgman@candle.pha.pa.us writes:
  Uh, TODO already has:
 
  o %Add a GUC variable to control the tablespace for temporary 
  objects
and sort files
 
It could start with a random tablespace from a supplied list and
cycle through the list.
 
  Do we need to add to this?
 
 The list bit strikes me as lily-gilding, or at least designing features
 well beyond proven need.

I have seen other databases do the round-robin idea, so it is worth
testing.

-- 
  Bruce Momjian   http://candle.pha.pa.us
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Compression and on-disk sorting

2006-05-18 Thread Martijn van Oosterhout

On Thu, May 18, 2006 at 11:22:46AM -0500, Jim C. Nasby wrote:
 AFAIK logtape currently reads in much less than 256k blocks. Of course
 if you get lucky you'll read from one tape for some time before
 switching to another, which should have a sort-of similar effect if the
 drives aren't very busy with other things.

Logtape works with BLCKSZ blocks. Whether it's advisable or not, I
don't know. One thing I'm noticing with this compression-in-logtape is
that the memory cost per tape increases considerably. Currently we rely
heavily on the OS to cache for us.

Lets say we buffer 256KB per tape, and we assume a large sort with many
tapes, you're going to blow all your work_mem on buffers. If using
compression uses more memory so that you can't have as many tapes and
thus need multiple passes, well, we need to test if this is still a
win.

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature

Re: [HACKERS] Compression and on-disk sorting

2006-05-18 Thread Jim C. Nasby

On Thu, May 18, 2006 at 08:32:10PM +0200, Martijn van Oosterhout wrote:
 On Thu, May 18, 2006 at 11:22:46AM -0500, Jim C. Nasby wrote:
  AFAIK logtape currently reads in much less than 256k blocks. Of course
  if you get lucky you'll read from one tape for some time before
  switching to another, which should have a sort-of similar effect if the
  drives aren't very busy with other things.
 
 Logtape works with BLCKSZ blocks. Whether it's advisable or not, I
 don't know. One thing I'm noticing with this compression-in-logtape is
 that the memory cost per tape increases considerably. Currently we rely
 heavily on the OS to cache for us.
 
 Lets say we buffer 256KB per tape, and we assume a large sort with many
 tapes, you're going to blow all your work_mem on buffers. If using
 compression uses more memory so that you can't have as many tapes and
 thus need multiple passes, well, we need to test if this is still a
 win.

So you're sticking with 8K blocks on disk, after compression? And then
decompressing blocks as they come in?

Actually, I guess the amount of memory used for zlib's lookback buffer
(or whatever they call it) could be pretty substantial, and I'm not sure
if there would be a way to combine that across all tapes.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Compression and on-disk sorting

2006-05-18 Thread Tom Lane

Jim C. Nasby [EMAIL PROTECTED] writes:
 Actually, I guess the amount of memory used for zlib's lookback buffer
 (or whatever they call it) could be pretty substantial, and I'm not sure
 if there would be a way to combine that across all tapes.

But there's only one active write tape at a time.  My recollection of
zlib is that compression is memory-hungry but decompression not so much,
so it seems like this shouldn't be a huge deal.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Compression and on-disk sorting

2006-05-18 Thread Jim C. Nasby

On Thu, May 18, 2006 at 04:55:17PM -0400, Tom Lane wrote:
 Jim C. Nasby [EMAIL PROTECTED] writes:
  Actually, I guess the amount of memory used for zlib's lookback buffer
  (or whatever they call it) could be pretty substantial, and I'm not sure
  if there would be a way to combine that across all tapes.
 
 But there's only one active write tape at a time.  My recollection of
 zlib is that compression is memory-hungry but decompression not so much,
 so it seems like this shouldn't be a huge deal.

It seems more appropriate to discuss results here, rather than on
-patches...

http://jim.nasby.net/misc/compress_sort.txt is preliminary results.
I've run into a slight problem in that even at a compression level of
-3, zlib is cutting the on-disk size of sorts by 25x. So my pgbench sort
test with scale=150 that was producing a 2G on-disk sort is now
producing a 80M sort, which obviously fits in memory. And cuts sort
times by more than half.

So, if nothing else, it looks like compression is definately a win if it
means you can now fit the sort within the disk cache. While that doesn't
sound like something very worthwhile, it essentially extends work_mem
from a fraction of memory to up to ~25x memory.

I'm currently loading up a pgbench database with a scaling factor of
15000; hopefully I'll have results from that testing in the morning.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Compression and on-disk sorting

2006-05-17 Thread Albe Laurenz

Andrew Piskorski wrote:
 Rod Taylor wrote:
Disk storage is cheap. Disk bandwidth or throughput is very
expensive.
 
 Oracle has included table compression since 9iR2.  They report table
 size reductions of 2x to 4x as typical, with proportional reductions
 in I/O, and supposedly, usually low to negligible overhead for writes:

[...]

 The main tricks seem to be:  One, EXTREMELY lightweight compression
 schemes - basically table lookups designed to be as cpu friendly as
 posible.  Two, keep the data compressed in RAM as well so that you can
 also cache more of the data, and indeed keep it the compressed until
 as late in the CPU processing pipeline as possible.

Oracle's compression seems to work as follows:
- At the beginning of each data block, there is a 'lookup table'
  containing frequently used values in table entries (of that block).
- This lookup table is referenced from within the block.

There is a White Paper that describes the algorithm and contains
praise for the effects:
http://www.oracle.com/technology/products/bi/pdf/o9ir2_compression_perfo
rmance_twp.pdf

Oracle does not compress tables by default.
This is what they have to say about it:

Table compression should be used with highly redundant data, such as
tables
with many foreign keys. You should avoid compressing tables with much
update
or other DML activity. Although compressed tables or partitions are
updatable,
there is some overhead in updating these tables, and high update
activity
may work against compression by causing some space to be wasted.

Yours,
Laurenz Albe

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Compression and on-disk sorting

2006-05-17 Thread Martijn van Oosterhout

On Wed, May 17, 2006 at 09:45:35AM +0200, Albe Laurenz wrote:
 Oracle's compression seems to work as follows:
 - At the beginning of each data block, there is a 'lookup table'
   containing frequently used values in table entries (of that block).
 - This lookup table is referenced from within the block.

Clever idea, pity we can't use it (what's the bet it's patented?). I'd
wager anything beyond simple compression is patented by someone.

The biggest issue is really that once postgres reads a block from disk
and uncompresses it, this block will be much larger than 8K. Somehow
you have to arrange storage for this.

I have some ideas though, but as Tom says, should go for the quick and
dirty numbers first, to determine if it's even worth doing.

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature

Re: [HACKERS] Compression and on-disk sorting

2006-05-17 Thread Martijn van Oosterhout

On Wed, May 17, 2006 at 12:03:15AM -0400, Tom Lane wrote:
 AFAICS the only sane choice here is to use
 src/backend/utils/adt/pg_lzcompress.c, on the grounds that (1) it's
 already in the backend, and (2) data compression in general is such a
 minefield of patents that we'd be foolish to expose ourselves in more
 than one direction.

Unfortunatly, the interface provided by pg_lzcompress.c is probably
insufficient for this purpose. You want to be able to compress tuples
as they get inserted and start a new block once the output reaches a
certain size. pg_lzcompress.c only has the options compress-whole-block
and decompress-whole-block.

zlib allows you to compress as the data comes along, keeping an eye on
the output buffer while you do it. For an initial test, using zlib
directly would probably be easier. If it works out we can look into
alternatives.

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature

Re: [HACKERS] Compression and on-disk sorting

2006-05-17 Thread Andrew Piskorski

On Tue, May 16, 2006 at 11:48:21PM -0400, Greg Stark wrote:

 There are some very fast decompression algorithms:
 
 http://www.oberhumer.com/opensource/lzo/

Sure, and for some tasks in PostgreSQL perhaps it would be useful.
But at least as of July 2005, a Sandor Heman, one of the MonetDB guys,
had looked at zlib, bzlib2, lzrw, and lzo, and claimed that:

  ... in general, it is very unlikely that we could achieve any
  bandwidth gains with these algorithms. LZRW and LZO might increase
  bandwidth on relatively slow disk systems, with bandwidths up to
  100MB/s, but this would induce high processing overheads, which
  interferes with query execution. On a fast disk system, such as our
  350MB/s 12 disk RAID, all the generic algorithms will fail to achieve
  any speedup.

  http://www.google.com/search?q=MonetDB+LZO+HemanbtnG=Search
  http://homepages.cwi.nl/~heman/downloads/msthesis.pdf

 I think most of the mileage from lookup tables would be better implemented
 at a higher level by giving tools to data modellers that let them achieve
 denser data representations. Things like convenient enum data types, 1-bit
 boolean data types, short integer data types, etc.

Things like enums and 1 bit booleans certainly could be useful, but
they cannot take advantage of duplicate values across multiple rows at
all, even if 1000 rows have the exact same value in their date
column and are all in the same disk block, right?

Thus I suspect that the exact opposite is true, a good table
compression scheme would render special denser data types largely
redundant and obsolete.

Good table compression might be a lot harder to do, of course.
Certainly Oracle's implementation of it had some bugs which made it
difficult to use reliably in practice (in certain circumstances
updates could fail, or if not fail perhaps have pathological
performance), bugs which are supposed to be fixed in 10.2.0.2, which
was only released within the last few months.

-- 
Andrew Piskorski [EMAIL PROTECTED]
http://www.piskorski.com/

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Compression and on-disk sorting

2006-05-17 Thread Zeugswetter Andreas DCP SD


 Certainly, if you can't prototype a convincing performance win using
 that algorithm, it's unlikely to be worth anyone's time to 
 look harder.

That should be easily possible with LZO. It would need to be the lib
that
we can optionally link to (--with-lzo), since the lib is GPL.

lzo even allows for inplace decompression and overlapping compression.

Andreas

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Compression and on-disk sorting

2006-05-17 Thread Hannu Krosing

Ühel kenal päeval, K, 2006-05-17 kell 12:20, kirjutas Zeugswetter
Andreas DCP SD:
  Certainly, if you can't prototype a convincing performance win using
  that algorithm, it's unlikely to be worth anyone's time to 
  look harder.
 
 That should be easily possible with LZO. It would need to be the lib
 that
 we can optionally link to (--with-lzo), since the lib is GPL.
 
 lzo even allows for inplace decompression and overlapping compression.

Does being GPL also automatically imply that it is patent-free, so that
we could re-implement it under BSD license if it gives good results?


Hannu



---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Compression and on-disk sorting

2006-05-17 Thread Zeugswetter Andreas DCP SD


 Unfortunatly, the interface provided by pg_lzcompress.c is probably
 insufficient for this purpose. You want to be able to compress tuples
 as they get inserted and start a new block once the output reaches a

I don't think anything that compresses single tuples without context is
going to be a win under realistic circumstances.

I would at least compress whole pages. Allow a max ratio of 1:n,
have the pg buffercache be uncompressed, and only compress on write
(filesystem cache then holds compressed pages).

The tricky part is predicting whether a tuple still fits in a n*8k
uncompressed
8k compressed page, but since lzo is fast you might even test it in
corner cases.
(probably logic that needs to also be in the available page freespace
calculation)
Choosing a good n is also tricky, probably 2 (or 3 ?) is good.

You probably also want to always keep the header part of the page
uncompressed.

Andreas

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Compression and on-disk sorting

2006-05-17 Thread Jonah H. Harris


On 5/17/06, Martijn van Oosterhout kleptog@svana.org wrote:

Clever idea, pity we can't use it (what's the bet it's patented?). I'd
wager anything beyond simple compression is patented by someone.


Oracle's patent application 20040054858 covers the method itself
including the process for storing and retrieving compressed data.

--
Jonah H. Harris, Software Architect | phone: 732.331.1300
EnterpriseDB Corporation| fax: 732.331.1301
33 Wood Ave S, 2nd Floor| [EMAIL PROTECTED]
Iselin, New Jersey 08830| http://www.enterprisedb.com/

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [HACKERS] Compression and on-disk sorting

2006-05-17 Thread Tom Lane

Martijn van Oosterhout kleptog@svana.org writes:
 Clever idea, pity we can't use it (what's the bet it's patented?). I'd
 wager anything beyond simple compression is patented by someone.

You're in for a rude awakening: even simple compression is anything
but simple.  As I said, it's a minefield of patents.  I recall reading a
very long statement by one of the zlib developers (Jean-loup Gailly, I
think) explaining exactly how they had threaded their way through that
minefield, and why they were different enough from half-a-dozen
similar-looking patented methods to not infringe any of them.

I feel fairly confident that zlib is patent-free, first because they did
their homework and second because they've now been out there and highly
visible for a good long time without getting sued.  I've got no such
confidence in any other random algorithm you might choose --- in fact,
I'm not at all sure that pg_lzcompress.c is safe.  If we were
aggressively trying to avoid patent risks we might well consider
dropping pg_lzcompress.c and using zlib exclusively.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Compression and on-disk sorting

2006-05-17 Thread Jim C. Nasby

On Tue, May 16, 2006 at 06:48:25PM -0400, Greg Stark wrote:
 Martijn van Oosterhout kleptog@svana.org writes:
 
   It might be easier to switch to giving each tape it's own file...
  
  I don't think it would make much difference. OTOH, if this turns out to
  be a win, the tuplestore could have the same optimisation.
 
 Would giving each tape its own file make it easier to allow multiple temporary
 sort areas and allow optimizing to avoid seeking when multiple spindles area
 available?

Only if those spindles weren't all in a single RAID array and if we went
through the trouble of creating all the machinery so you could tell
PostgreSQL where all those spindles were mounted in the filesystem. And
even after all that work, there's not much that says it would perform
better than a simple RAID10.

What *might* make sense would be to provide two locations for pgsql_tmp,
because a lot of operations in there involve reading and writing at the
same time:

Read from heap while writing tapes to pgsql_tmp
read from tapes while writing final version to pgsql_tmp

There's probably some gain to be had by writing the final version to a
tablespace other than the default one (which is where pgsql_tmp would
be, I think). But recent changes in -HEAD mean that the second step is
now only performed in certain cases.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Compression and on-disk sorting

2006-05-17 Thread Jim C. Nasby

If we're going to consider table-level compression, ISTM the logical
first step is to provide greater control over TOASTing; namely
thresholds for when to compress and/or go to external storage that can
be set on a per-field or at least per-table basis.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Compression and on-disk sorting

2006-05-17 Thread Jim C. Nasby

On Wed, May 17, 2006 at 10:06:04AM +0200, Martijn van Oosterhout wrote:
 On Wed, May 17, 2006 at 09:45:35AM +0200, Albe Laurenz wrote:
  Oracle's compression seems to work as follows:
  - At the beginning of each data block, there is a 'lookup table'
containing frequently used values in table entries (of that block).
  - This lookup table is referenced from within the block.
 
 Clever idea, pity we can't use it (what's the bet it's patented?). I'd
 wager anything beyond simple compression is patented by someone.
 
 The biggest issue is really that once postgres reads a block from disk
 and uncompresses it, this block will be much larger than 8K. Somehow
 you have to arrange storage for this.

It's entirely possible that the best performance would be found from not
un-compressing blocks when putting them into shared_buffers, though.
That would mean you'd only have to deal with compression when pulling
individual tuples. Simple, right? :)
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] Compression and on-disk sorting

2006-05-17 Thread Tom Lane

Jim C. Nasby [EMAIL PROTECTED] writes:
 What *might* make sense would be to provide two locations for pgsql_tmp,
 because a lot of operations in there involve reading and writing at the
 same time:

 Read from heap while writing tapes to pgsql_tmp
 read from tapes while writing final version to pgsql_tmp

Note that a large part of the reason for the current logtape.c design
is to avoid requiring 2X or more disk space to sort X amount of data.
AFAICS, any design that does the above will put us right back in the 2X
regime.  That's a direct, measurable penalty; it'll take more than
handwaving arguments to convince me we should change it in pursuit of
unquantified speed benefits.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Compression and on-disk sorting

2006-05-17 Thread Jim C. Nasby

On Wed, May 17, 2006 at 11:38:05AM -0400, Tom Lane wrote:
 Jim C. Nasby [EMAIL PROTECTED] writes:
  What *might* make sense would be to provide two locations for pgsql_tmp,
  because a lot of operations in there involve reading and writing at the
  same time:
 
  Read from heap while writing tapes to pgsql_tmp
  read from tapes while writing final version to pgsql_tmp
 
 Note that a large part of the reason for the current logtape.c design
 is to avoid requiring 2X or more disk space to sort X amount of data.
 AFAICS, any design that does the above will put us right back in the 2X
 regime.  That's a direct, measurable penalty; it'll take more than
 handwaving arguments to convince me we should change it in pursuit of
 unquantified speed benefits.

I certainly agree that there's no reason to make this change without
testing, but if you've got enough spindles laying around to actually
make use of this I suspect that requiring 2X the disk space to sort X
won't bother you.

Actually, I suspect in most cases it won't matter; I don't think people
make a habit of trying to sort their entire database. :) But we'd want
to protect for the oddball cases... yech.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Compression and on-disk sorting

2006-05-17 Thread Tom Lane

Jim C. Nasby [EMAIL PROTECTED] writes:
 On Wed, May 17, 2006 at 11:38:05AM -0400, Tom Lane wrote:
 Note that a large part of the reason for the current logtape.c design
 is to avoid requiring 2X or more disk space to sort X amount of data.

 Actually, I suspect in most cases it won't matter; I don't think people
 make a habit of trying to sort their entire database. :)

Well, some years ago we used something like 4X space to sort X amount of
data (this is the native behavior of 7-way polyphase merge, it turns out)
and we got yelled at.  Which is what prompted the writing of logtape.c.
Maybe disk space has gotten so cheap since then that it no longer
matters ... but I suspect the nature of DB applications is that people
are always pushing the envelope of what their hardware can do.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Compression and on-disk sorting

2006-05-17 Thread Rod Taylor

 Actually, I suspect in most cases it won't matter; I don't think people
 make a habit of trying to sort their entire database. :) But we'd want
 to protect for the oddball cases... yech.

I can make query result sets that are far larger than the database
itself.

create table fat_table_with_few_tuples(fat_status_id serial primary key,
fat_1 text, fat_2 text);

create table thin_table_with_many_tuples(fat_status_id integer
references fat_table_with_few_tuples, thin_1 integer, thin_2 integer);

SELECT * FROM thin_table_with_many_tuples NATURAL JOIN
fat_table_with_few_tuples order by fat_1, thin_1, thin_2, fat_2;


I would be asking the folks trying to use PostgreSQL for data
warehousing what their opinion is. A few fact tables in an audit query
could easily result in a very large amount of temporary diskspace being
required.

-- 


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] Compression and on-disk sorting

2006-05-17 Thread Martijn van Oosterhout

For all those people not subscribed to -patches (should appear in
archive soon), I just posted a patch there implemented zlib compression
for logtape.c. If people have test machines for speed-testing this sort
of stuff, please have at it.

You can also download it here:
http://svana.org/kleptog/temp/compress-sort.patch 

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature

Re: [HACKERS] Compression and on-disk sorting

2006-05-17 Thread Greg Stark


Jim C. Nasby [EMAIL PROTECTED] writes:

 Only if those spindles weren't all in a single RAID array and if we went
 through the trouble of creating all the machinery so you could tell
 PostgreSQL where all those spindles were mounted in the filesystem.

I think the way you do this is simply by documenting that the admin should
create precisely one temp area per physical spindle (or raid array).


-- 
greg


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] Compression and on-disk sorting

2006-05-17 Thread Greg Stark


Andrew Piskorski [EMAIL PROTECTED] writes:

 Things like enums and 1 bit booleans certainly could be useful, but
 they cannot take advantage of duplicate values across multiple rows at
 all, even if 1000 rows have the exact same value in their date
 column and are all in the same disk block, right?

That's an interesting direction to go in. Generic algorithms would still help
in that case since the identical value would occur more frequently than other
values it would be encoded in a smaller symbol. But there's going to be a
limit to how compressed it can get the data.

The ideal way to handle the situation you're describing would be to interleave
the tuples so that you have all 1000 values of the first column, followed by
all 1000 values of the second column and so on. Then you run a generic
algorithm on this and it achieves very high compression rates since there are
a lot of repeating patterns.

I don't see how you build a working database with data in this form however.
For example, a single insert would require updating small pieces of data
across the entire table. Perhaps there's some middle ground with interleaving
the tuples within a single compressed page, or something like that?

-- 
greg


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Compression and on-disk sorting

2006-05-17 Thread Tom Lane

Greg Stark [EMAIL PROTECTED] writes:
 The ideal way to handle the situation you're describing would be to interleave
 the tuples so that you have all 1000 values of the first column, followed by
 all 1000 values of the second column and so on. Then you run a generic
 algorithm on this and it achieves very high compression rates since there are
 a lot of repeating patterns.

It's not obvious to me that that yields a form more compressible than
what we have now.  As long as the previous value is within the lookback
window, an LZ-style compressor will still be able to use it.  More
importantly, the layout you describe would be unable to take advantage
of any cross-column correlation, which in real data is likely to be a
useful property for compression.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Compression and on-disk sorting

2006-05-17 Thread Jim C. Nasby

On Wed, May 17, 2006 at 12:55:53PM -0400, Greg Stark wrote:
 
 Jim C. Nasby [EMAIL PROTECTED] writes:
 
  Only if those spindles weren't all in a single RAID array and if we went
  through the trouble of creating all the machinery so you could tell
  PostgreSQL where all those spindles were mounted in the filesystem.
 
 I think the way you do this is simply by documenting that the admin should
 create precisely one temp area per physical spindle (or raid array).

And you still need some way to tell PostgreSQL about all of that.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] Compression and on-disk sorting

2006-05-17 Thread Jim C. Nasby

On Wed, May 17, 2006 at 12:16:13PM -0400, Rod Taylor wrote:
  Actually, I suspect in most cases it won't matter; I don't think people
  make a habit of trying to sort their entire database. :) But we'd want
  to protect for the oddball cases... yech.
 
 I can make query result sets that are far larger than the database
 itself.
 
 create table fat_table_with_few_tuples(fat_status_id serial primary key,
 fat_1 text, fat_2 text);
 
 create table thin_table_with_many_tuples(fat_status_id integer
 references fat_table_with_few_tuples, thin_1 integer, thin_2 integer);
 
 SELECT * FROM thin_table_with_many_tuples NATURAL JOIN
 fat_table_with_few_tuples order by fat_1, thin_1, thin_2, fat_2;
 
 
 I would be asking the folks trying to use PostgreSQL for data
 warehousing what their opinion is. A few fact tables in an audit query
 could easily result in a very large amount of temporary diskspace being
 required.

Note my last sentence: we'd need to provide for cases where this was a
problem. How much that would complicate the code, I don't know...

This is another case where someone (with more skills than me) would need
to hack the backend enough to be able to test it and see how big a
performance gain there was.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Compression and on-disk sorting

2006-05-17 Thread Hannu Krosing

Ühel kenal päeval, K, 2006-05-17 kell 10:01, kirjutas Jim C. Nasby:
 If we're going to consider table-level compression, ISTM the logical
 first step is to provide greater control over TOASTing; namely
 thresholds for when to compress and/or go to external storage that can
 be set on a per-field or at least per-table basis.

also, we would get a lot of compression, if we could get rid of index
on toast table, and use the ctid directly.

-- 

Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Compression and on-disk sorting

2006-05-17 Thread Jim C. Nasby

On Wed, May 17, 2006 at 10:55:19PM +0300, Hannu Krosing wrote:
 ??hel kenal p??eval, K, 2006-05-17 kell 10:01, kirjutas Jim C. Nasby:
  If we're going to consider table-level compression, ISTM the logical
  first step is to provide greater control over TOASTing; namely
  thresholds for when to compress and/or go to external storage that can
  be set on a per-field or at least per-table basis.
 
 also, we would get a lot of compression, if we could get rid of index
 on toast table, and use the ctid directly.

It'd also be damn handy to be able to ditch the 2-pass vacuum
requirement. I've often wondered if treating the toast table as a real
table was overkill; for example, it should be possible to include WAL
information for it in with the parent table. That plus using ctid as a
reference would hopefully allow for removing a lot of overhead from it.
Presumably all the MVCC info could go, since that would be taken care of
by the parent table.

Of course the downside is that it'd mean a different set of code for
handling toast storage...
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Compression and on-disk sorting

2006-05-17 Thread Greg Stark

Jim C. Nasby [EMAIL PROTECTED] writes:

 On Wed, May 17, 2006 at 12:55:53PM -0400, Greg Stark wrote:
  
  Jim C. Nasby [EMAIL PROTECTED] writes:
  
   Only if those spindles weren't all in a single RAID array and if we went
   through the trouble of creating all the machinery so you could tell
   PostgreSQL where all those spindles were mounted in the filesystem.
  
  I think the way you do this is simply by documenting that the admin should
  create precisely one temp area per physical spindle (or raid array).
 
 And you still need some way to tell PostgreSQL about all of that.

No, my point was that you tell Postges how many spindles you have and where to
find them by creating precisely one temp area on each spindle. It then knows
that it should strive to maximize sequential reads within one temp area and
expect switching between temp areas (which represent multiple spindles) to be
better than multiplexing multiple tapes within a single temp area (which
represents a single spindle).

-- 
greg


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Compression and on-disk sorting

2006-05-17 Thread Jim C. Nasby

On Wed, May 17, 2006 at 05:44:22PM -0400, Greg Stark wrote:
 Jim C. Nasby [EMAIL PROTECTED] writes:
 
  On Wed, May 17, 2006 at 12:55:53PM -0400, Greg Stark wrote:
   
   Jim C. Nasby [EMAIL PROTECTED] writes:
   
Only if those spindles weren't all in a single RAID array and if we went
through the trouble of creating all the machinery so you could tell
PostgreSQL where all those spindles were mounted in the filesystem.
   
   I think the way you do this is simply by documenting that the admin should
   create precisely one temp area per physical spindle (or raid array).
  
  And you still need some way to tell PostgreSQL about all of that.
 
 No, my point was that you tell Postges how many spindles you have and where to
 find them by creating precisely one temp area on each spindle. It then knows

Which means we need all the interface bits to be able to tell PostgreSQL
where every single temp storage area is. Presumably much of the
tablespace mechanism could be used for this, but it's still a bunch of
work. And you can't just say I have 8 spindles, you have to tell
PostgreSQL exactly where to put each temporary area (unless you just
have it put one on every tablespace you have defined).

 that it should strive to maximize sequential reads within one temp area and
 expect switching between temp areas (which represent multiple spindles) to be
 better than multiplexing multiple tapes within a single temp area (which
 represents a single spindle).

Which adds yet more complexity to all the code that uses the temp area.
And as others have brought up, you still have to allow for the case when
splitting all of this out into multiple files means you end up using
substantially more disk space. That further drives up the complexity.

My point is that unless someone shows that there's a non-trivial
performance gain here, it's not going to happen.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Compression and on-disk sorting

2006-05-17 Thread Greg Stark

Jim C. Nasby [EMAIL PROTECTED] writes:

 Which means we need all the interface bits to be able to tell PostgreSQL
 where every single temp storage area is. Presumably much of the
 tablespace mechanism could be used for this, but it's still a bunch of
 work. And you can't just say I have 8 spindles, you have to tell
 PostgreSQL exactly where to put each temporary area (unless you just
 have it put one on every tablespace you have defined).

Yes, if you have more than one temporary area you definitely need a way to
tell Postgres where to put them since obviously they won't be in the postgres
base directory. But I think that's all Postgres really needs to know.

One could imagine a more complex version where Postgres has meta information
about the bandwidth and seek penalty for each sort area separately. Presumably
also for each table space. But that's a whole lot more complexity than
Postgres's current cost model.


  that it should strive to maximize sequential reads within one temp area and
  expect switching between temp areas (which represent multiple spindles) to 
  be
  better than multiplexing multiple tapes within a single temp area (which
  represents a single spindle).
 
 Which adds yet more complexity to all the code that uses the temp area.
 And as others have brought up, you still have to allow for the case when
 splitting all of this out into multiple files means you end up using
 substantially more disk space. That further drives up the complexity.

You also have to consider that it won't always be a benefit to spread the sort
over multiple sort areas. If there's only one sort going on and you can reach
a 1:1 ratio between tapes and spindles then I think it would be a huge
benefit. Effectively boosting the sort speed by random_page_cost.

But if you don't have as many spindles as your algorithm needs tapes
then it's unclear which to multiplex down and whether you gain any benefit
once you're multiplexing over simply using a single sort area.

And worse, if there are multiple sorts going on in the system then you're not
going to get sequential access even if you have multiple sort areas available.
If you have N sort areas and N sorts are going on then you're probably better
off multiplexing each one down to a single sort area and letting them each
proceed without interfering with each other rather than having each one hog
all the sort areas and forcing the OS to do the multiplexing blindly.

 My point is that unless someone shows that there's a non-trivial
 performance gain here, it's not going to happen.

I think two extreme cases are well worth pursuing: 

1) Use n sort areas for n tapes making everything purely sequential access.
   That would be useful for large DSS systems where large sorts are running
   and i/o bandwidth is high for sequential access. That gives effectively a
   random_page_cost speed boost.

2) Use the current algorithm unchanged but have each sort use a different sort
   area in some sort of round-robin fashion. That helps the OLTP type
   environment (ignoring for the moment that OLTP environments really ought
   not be doing disk sorts) where people complain about unreliable execution
   times more than slow execution times. If you can provide enough sort areas
   then it would remove one big reason other queries concurrent impact the
   execution time of queries.

-- 
greg


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] Compression and on-disk sorting

2006-05-16 Thread Zeugswetter Andreas DCP SD


  Given that any time that happens we end up caring much less about
CPU
  usage and much more about disk IO, for any of these cases that use
  non-random access, compressing the data before sending it to disk
would
  potentially be a sizeable win.
 
 Note however that what the code thinks is a spill to disk and what
 actually involves disk I/O are two different things.  If you think
 of it as a spill to kernel disk cache then the attraction is a lot
 weaker...

Yes, that is very true. However it would also increase the probability
that spill to disk is not needed, since more data fits in RAM.

It would probably need some sort of plugin architecture, since the
fastest compression algorithms (LZO) that also reach good ratios are
gpl.
LZO is proven to increase physical IO write speed with low CPU overhead.

Andreas

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Compression and on-disk sorting

2006-05-16 Thread Zeugswetter Andreas DCP SD


  Personally, I believe it would be worth it - but only to a few. And
  these most of these few are likely using Oracle. So, no gain unless
  you can convince them to switch back... :-)
 
 We do know that the benefit for commercial databases that use raw and
 file system storage is that raw storage is only a few percentage
 points faster.

Imho it is really not comparable because they all use direct or async IO
that bypasses the OS buffercache even when using filesystem files for
storage.
A substantial speed difference is allocation of space for restore
(no format of fs and no file allocation needed).

I am not saying this to advocate moving in that direction however.
I do however think that there is substantial headroom in reducing the
number
of IO calls and reducing on disk storage requirements.
Especially in concurrent load scenarios.

Andreas

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Compression and on-disk sorting

2006-05-16 Thread Bort, Paul

 
 Compressed-filesystem extension (like e2compr, and I think either
 Fat or NTFS) can do that.
 

Windows (NT/2000/XP) can compress individual directories and files under
NTFS; new files in a compressed directory are compressed by default.

So if the 'spill-to-disk' all happened in its own specific directory, it
would be trivial to mark that directory for compression. 

I don't know enough Linux/Unix to know if it has similar capabilities.


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Compression and on-disk sorting

2006-05-16 Thread Andrew Dunstan


Bort, Paul wrote:

Compressed-filesystem extension (like e2compr, and I think either
Fat or NTFS) can do that.




Windows (NT/2000/XP) can compress individual directories and files under
NTFS; new files in a compressed directory are compressed by default.

So if the 'spill-to-disk' all happened in its own specific directory, it
would be trivial to mark that directory for compression. 


I don't know enough Linux/Unix to know if it has similar capabilities.


  


Or would want to ...

I habitually turn off all compression on my Windows boxes, because it's 
a performance hit in my experience. Disk is cheap ...


cheers

andrew

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] Compression and on-disk sorting

2006-05-16 Thread Andrew Dunstan


Rod Taylor wrote:
I habitually turn off all compression on my Windows boxes, because it's 
a performance hit in my experience. Disk is cheap ...



Disk storage is cheap. Disk bandwidth or throughput is very expensive.
  


Sure, but in my experience using Windows File System compression is not 
a win here. Presumably if it were an unqualified win they would have it 
turned on everywhere. The fact that there's an option is a good 
indication that it isn't in many cases. It is most commonly used for 
files like executables that are in effect read-only - but that doesn't 
help us.


cheers

andrew


---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [HACKERS] Compression and on-disk sorting

2006-05-16 Thread Jim C. Nasby

On Tue, May 16, 2006 at 09:24:38AM +0200, Zeugswetter Andreas DCP SD wrote:
 
   Given that any time that happens we end up caring much less about
 CPU
   usage and much more about disk IO, for any of these cases that use
   non-random access, compressing the data before sending it to disk
 would
   potentially be a sizeable win.
  
  Note however that what the code thinks is a spill to disk and what
  actually involves disk I/O are two different things.  If you think
  of it as a spill to kernel disk cache then the attraction is a lot
  weaker...
 
 Yes, that is very true. However it would also increase the probability
 that spill to disk is not needed, since more data fits in RAM.

That's a pretty thin margin though, depending on how good the
compression is. This also assumes that you have a compression algorithm
that supports random access.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Compression and on-disk sorting

2006-05-16 Thread Jim C. Nasby

On Tue, May 16, 2006 at 12:27:42PM -0400, Andrew Dunstan wrote:
 Rod Taylor wrote:
 I habitually turn off all compression on my Windows boxes, because it's 
 a performance hit in my experience. Disk is cheap ...
 
 
 Disk storage is cheap. Disk bandwidth or throughput is very expensive.

Hey, that's my line! :P

 Sure, but in my experience using Windows File System compression is not 
 a win here. Presumably if it were an unqualified win they would have it 
 turned on everywhere. The fact that there's an option is a good 
 indication that it isn't in many cases. It is most commonly used for 
 files like executables that are in effect read-only - but that doesn't 
 help us.

The issue with filesystem level compression is that it has to support
things like random access, which isn't needed for on-disk sorting (not
sure about other things like hashing, etc).

In any case, my curiousity is aroused, so I'm currently benchmarking
pgbench on both a compressed and uncompressed $PGDATA/base. I'll also do
some benchmarks with pg_tmp compressed.

Does anyone have time to hack some kind of compression into the on-disk
sort code just to get some benchmark numbers? Unfortunately, doing so is
beyond my meager C abilitiy...
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Compression and on-disk sorting

2006-05-16 Thread Jim C. Nasby

On Tue, May 16, 2006 at 12:31:07PM -0500, Jim C. Nasby wrote:
 In any case, my curiousity is aroused, so I'm currently benchmarking
 pgbench on both a compressed and uncompressed $PGDATA/base. I'll also do
 some benchmarks with pg_tmp compressed.
 
Results: http://jim.nasby.net/bench.log

As expected, compressing $PGDATA/base was a loss. But compressing
pgsql_tmp and then doing some disk-based sorts did show an improvement,
from 366.1 seconds to 317.3 seconds, an improvement of 13.3%. This is on
a Windows XP laptop (Dell Latitude D600) with 512MB, so it's somewhat of
a worst-case scenario. On the other hand, XP's compression algorithm
appears to be pretty aggressive, as it cut the size of the on-disk sort
file from almost 700MB to 82MB. There's probably gains to be had from a
different compression algorithm.

 Does anyone have time to hack some kind of compression into the on-disk
 sort code just to get some benchmark numbers? Unfortunately, doing so is
 beyond my meager C abilitiy...
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Compression and on-disk sorting

2006-05-16 Thread Martijn van Oosterhout

On Tue, May 16, 2006 at 12:31:07PM -0500, Jim C. Nasby wrote:
 Does anyone have time to hack some kind of compression into the on-disk
 sort code just to get some benchmark numbers? Unfortunately, doing so is
 beyond my meager C abilitiy...

I had a look at this. At first glance it doesn't seem too hard, except
the whole logtape process kinda gets in the way. If it wern't for the
mark/restore it'd be trivial. Might take a stab at it some time, if I
can think of a way to handle the seeking...

-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature

Re: [HACKERS] Compression and on-disk sorting

2006-05-16 Thread Jim C. Nasby

On Tue, May 16, 2006 at 11:46:15PM +0200, Martijn van Oosterhout wrote:
 On Tue, May 16, 2006 at 12:31:07PM -0500, Jim C. Nasby wrote:
  Does anyone have time to hack some kind of compression into the on-disk
  sort code just to get some benchmark numbers? Unfortunately, doing so is
  beyond my meager C abilitiy...
 
 I had a look at this. At first glance it doesn't seem too hard, except
 the whole logtape process kinda gets in the way. If it wern't for the
 mark/restore it'd be trivial. Might take a stab at it some time, if I
 can think of a way to handle the seeking...

Oh, do we need to randomly seek? Is that how we switch from one tape to
another?

It might be easier to switch to giving each tape it's own file...
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] Compression and on-disk sorting

2006-05-16 Thread Martijn van Oosterhout

On Tue, May 16, 2006 at 04:50:22PM -0500, Jim C. Nasby wrote:
  I had a look at this. At first glance it doesn't seem too hard, except
  the whole logtape process kinda gets in the way. If it wern't for the
  mark/restore it'd be trivial. Might take a stab at it some time, if I
  can think of a way to handle the seeking...
 
 Oh, do we need to randomly seek? Is that how we switch from one tape to
 another?

Not seek, mark/restore. As the code describes, sometimes you go back a
tuple. The primary reason I think is for the final pass, a merge sort
might read the tuples multiple times, so it needs to support it there.

 It might be easier to switch to giving each tape it's own file...

I don't think it would make much difference. OTOH, if this turns out to
be a win, the tuplestore could have the same optimisation.

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature

Re: [HACKERS] Compression and on-disk sorting

2006-05-16 Thread Tom Lane

Martijn van Oosterhout kleptog@svana.org writes:
 Not seek, mark/restore. As the code describes, sometimes you go back a
 tuple. The primary reason I think is for the final pass, a merge sort
 might read the tuples multiple times, so it needs to support it there.

However it'd be possible to tell logtape in advance whether a particular
tape needs to support that, and only apply compression when not; it
would work all the time for intermediate merge passes, and with the
recent executor changes to pass down you-need-to-support-mark flags,
it'd work for the output pass in a lot of cases too.

If you're just trying to get some quick and dirty numbers: do
compression, replace Seek/Tell with PANICs, and only test on plain
sorts no joins ;-)

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Compression and on-disk sorting

2006-05-16 Thread Greg Stark

Martijn van Oosterhout kleptog@svana.org writes:

  It might be easier to switch to giving each tape it's own file...
 
 I don't think it would make much difference. OTOH, if this turns out to
 be a win, the tuplestore could have the same optimisation.

Would giving each tape its own file make it easier to allow multiple temporary
sort areas and allow optimizing to avoid seeking when multiple spindles area
available?


-- 
greg


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Compression and on-disk sorting

2006-05-16 Thread Andrew Piskorski

On Tue, May 16, 2006 at 12:31:07PM -0500, Jim C. Nasby wrote:
On Tue, May 16, 2006 at 12:27:42PM -0400, Andrew Dunstan wrote:
Rod Taylor wrote:
I habitually turn off all compression on my Windows boxes, because it's
a performance hit in my experience. Disk is cheap ...

Disk storage is cheap. Disk bandwidth or throughput is very expensive.

Sure, but in my experience using Windows File System compression is not
a win here. Presumably if it were an unqualified win they would have it

Does anyone have time to hack some kind of compression into the on-disk
sort code just to get some benchmark numbers? Unfortunately, doing so is
beyond my meager C abilitiy...

Folks, first of all, I'm in no way an expert on data compression in
RDBMSs, but other databases DO include data compression features and
claim it as a SIGNIFICANT win in I/O reduction.

Looking at performance of the Windows File System compression, etc.,
doesn't make too much sense when there are actual RDBMS compression
implementations to compare to, on the commerical market, in open
source code, and in the academic literature.

Oracle has included table compression since 9iR2. They report table
size reductions of 2x to 4x as typical, with proportional reductions
in I/O, and supposedly, usually low to negligible overhead for writes:

http://download-east.oracle.com/docs/cd/B19306_01/server.102/b14211/build_db.htm#sthref289

Decision Speed: Table Compression In Action by Meikel Poess and Hermann Baer
(2003):

http://www.oracle.com/technology/oramag/webcolumns/2003/techarticles/poess_tablecomp.html

Compressing Data for Space and Speed by Sanjay Mishra (2004):
http://www.oracle.com/technology/oramag/oracle/04-mar/o24tech_data.html

Order For Maximum Compression:

http://oramossoracle.blogspot.com/2005/11/table-compression-order-for-maximum.html

I don't remember whether the current (Open Source) MonetDB includes
table compression or not, but they've published papers with lots of
interesting detail on the compression and other high performance OLAP
features in their latest (not released) X100 MoneyDB research
codebase:

http://monetdb.cwi.nl/
http://homepages.cwi.nl/~mk/MonetDB/
http://sourceforge.net/projects/monetdb/
ftp://ftp.research.microsoft.com/pub/debull/A05june/issue1.htm

Now, the docs and papers above are all focused on query performance,
they say nothing directly about using using compression for on-disk
sorts. But, I'd bet that similar rules of thumb will apply in both
cases.

The main tricks seem to be: One, EXTREMELY lightweight compression
schemes - basically table lookups designed to be as cpu friendly as
posible. Two, keep the data compressed in RAM as well so that you can
also cache more of the data, and indeed keep it the compressed until
as late in the CPU processing pipeline as possible.

A corrolary of that is forget compression schemes like gzip - it
reduces data size nicely but is far too slow on the cpu to be
particularly useful in improving overall throughput rates.

Note, I have not really tested ANY of the above myself, your mileage
may well vary from what I recall from those various articles...

--
Andrew Piskorski [EMAIL PROTECTED]
http://www.piskorski.com/

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Compression and on-disk sorting

2006-05-16 Thread Greg Stark

Andrew Piskorski [EMAIL PROTECTED] writes:

 The main tricks seem to be:  One, EXTREMELY lightweight compression
 schemes - basically table lookups designed to be as cpu friendly as
 posible.  Two, keep the data compressed in RAM as well so that you can
 also cache more of the data, and indeed keep it the compressed until
 as late in the CPU processing pipeline as possible.
 
 A corrolary of that is forget compression schemes like gzip - it
 reduces data size nicely but is far too slow on the cpu to be
 particularly useful in improving overall throughput rates.

There are some very fast decompression algorithms:

http://www.oberhumer.com/opensource/lzo/


I think most of the mileage from lookup tables would be better implemented
at a higher level by giving tools to data modellers that let them achieve
denser data representations. Things like convenient enum data types, 1-bit
boolean data types, short integer data types, etc.

-- 
greg


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Compression and on-disk sorting

2006-05-16 Thread Tom Lane

Greg Stark [EMAIL PROTECTED] writes:
 Andrew Piskorski [EMAIL PROTECTED] writes:
 A corrolary of that is forget compression schemes like gzip - it
 reduces data size nicely but is far too slow on the cpu to be
 particularly useful in improving overall throughput rates.

 There are some very fast decompression algorithms:

AFAICS the only sane choice here is to use
src/backend/utils/adt/pg_lzcompress.c, on the grounds that (1) it's
already in the backend, and (2) data compression in general is such a
minefield of patents that we'd be foolish to expose ourselves in more
than one direction.

Certainly, if you can't prototype a convincing performance win using
that algorithm, it's unlikely to be worth anyone's time to look harder.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Compression and on-disk sorting

2006-05-15 Thread Tom Lane

Jim C. Nasby [EMAIL PROTECTED] writes:
 A recent post Tom made in -bugs about how bad performance would be if we
 spilled after-commit triggers to disk got me thinking... There are
 several operations the database performs that potentially spill to disk.
 Given that any time that happens we end up caring much less about CPU
 usage and much more about disk IO, for any of these cases that use
 non-random access, compressing the data before sending it to disk would
 potentially be a sizeable win.

Note however that what the code thinks is a spill to disk and what
actually involves disk I/O are two different things.  If you think
of it as a spill to kernel disk cache then the attraction is a lot
weaker...

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Compression and on-disk sorting

2006-05-15 Thread Jim C. Nasby

On Mon, May 15, 2006 at 02:18:03PM -0400, Tom Lane wrote:
 Jim C. Nasby [EMAIL PROTECTED] writes:
  A recent post Tom made in -bugs about how bad performance would be if we
  spilled after-commit triggers to disk got me thinking... There are
  several operations the database performs that potentially spill to disk.
  Given that any time that happens we end up caring much less about CPU
  usage and much more about disk IO, for any of these cases that use
  non-random access, compressing the data before sending it to disk would
  potentially be a sizeable win.
 
 Note however that what the code thinks is a spill to disk and what
 actually involves disk I/O are two different things.  If you think
 of it as a spill to kernel disk cache then the attraction is a lot
 weaker...

I'm really starting to see why other databases want the OS out of their
way...

I guess at this point the best we could do would be to have a
configurable limit for when compression started. The first X number of
bytes go out uncompressed, everything after that is compressed. I don't
know of any technical reason why you couldn't switch in the middle of a
file, so long as you knew exactly where you switched.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Compression and on-disk sorting

2006-05-15 Thread Andrew Dunstan


Jim C. Nasby wrote:

On Mon, May 15, 2006 at 02:18:03PM -0400, Tom Lane wrote:
  

Jim C. Nasby [EMAIL PROTECTED] writes:


A recent post Tom made in -bugs about how bad performance would be if we
spilled after-commit triggers to disk got me thinking... There are
several operations the database performs that potentially spill to disk.
Given that any time that happens we end up caring much less about CPU
usage and much more about disk IO, for any of these cases that use
non-random access, compressing the data before sending it to disk would
potentially be a sizeable win.
  

Note however that what the code thinks is a spill to disk and what
actually involves disk I/O are two different things.  If you think
of it as a spill to kernel disk cache then the attraction is a lot
weaker...



I'm really starting to see why other databases want the OS out of their
way...
  


Some of it is pure NIH syndrome. I recently heard of some tests done by 
a major DB team that showed their finely crafted raw file system stuff 
performing at best a few percent better than a standard file system, and 
sometimes worse. I have often heard of the supposed benefits of our 
being able to go behind the OS, but I am very dubious about it. What 
makes people think that we could do any better than the OS guys?



cheers

andrew

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [HACKERS] Compression and on-disk sorting

2006-05-15 Thread Jim C. Nasby

On Mon, May 15, 2006 at 03:44:50PM -0400, Andrew Dunstan wrote:
 Jim C. Nasby wrote:
 On Mon, May 15, 2006 at 02:18:03PM -0400, Tom Lane wrote:
   
 Jim C. Nasby [EMAIL PROTECTED] writes:
 
 A recent post Tom made in -bugs about how bad performance would be if we
 spilled after-commit triggers to disk got me thinking... There are
 several operations the database performs that potentially spill to disk.
 Given that any time that happens we end up caring much less about CPU
 usage and much more about disk IO, for any of these cases that use
 non-random access, compressing the data before sending it to disk would
 potentially be a sizeable win.
   
 Note however that what the code thinks is a spill to disk and what
 actually involves disk I/O are two different things.  If you think
 of it as a spill to kernel disk cache then the attraction is a lot
 weaker...
 
 
 I'm really starting to see why other databases want the OS out of their
 way...
   
 
 Some of it is pure NIH syndrome. I recently heard of some tests done by 
 a major DB team that showed their finely crafted raw file system stuff 
 performing at best a few percent better than a standard file system, and 
 sometimes worse. I have often heard of the supposed benefits of our 
 being able to go behind the OS, but I am very dubious about it. What 
 makes people think that we could do any better than the OS guys?

The problem is that it seems like there's never enough ability to clue
the OS in on what the application is trying to accomplish. For a long
time we didn't have a background writer, because the OS should be able
to flush things out on it's own before checkpoint. Now there's talk of a
background reader, because backends keep stalling on waiting on disk IO.

In this case the problem is that we want to tell the OS Hey, if this
stuff is actually going to go out to the spindles then compress it. And
by the way, we won't be doing any random access on it, either. But
AFAIK there's no option like that in fopen... :)

I agree, when it comes to base-level stuff like how to actually put the
data on the physical media, there's not much to be gained in this day
and age by using RAW storage, and in fact Oracle hasn't favored RAW for
quite some time. Every DBA I've ever talked to says that the only reason
to use RAW is if you're trying to eek every last ounce of performance
out of the hardware that you can, which for 99.99% of installs makes
absolutely no sense.

But, there's a big range between writing your own filesystem and
assuming that the OS should just handle everything for you. I think a
lot of commercial databases lean too far towards not trusting the OS
(which is understandable to a degree, givin how much OSes have
improved), while in some areas I think we still rely too much on the OS
(like read-ahead).
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Compression and on-disk sorting

2006-05-15 Thread Martijn van Oosterhout

On Mon, May 15, 2006 at 03:02:07PM -0500, Jim C. Nasby wrote:
 The problem is that it seems like there's never enough ability to clue
 the OS in on what the application is trying to accomplish. For a long
 time we didn't have a background writer, because the OS should be able
 to flush things out on it's own before checkpoint. Now there's talk of a
 background reader, because backends keep stalling on waiting on disk IO.

Hmm, I thought the background writer was created to reduce the cost of
checkpoint which has to write out modified pages in the buffers. The
background writer tries to keep the number of dirty pages down.

I don't know about Oracle, but with or without an OS, backends are
going to block on I/O and you'd still need a background reader in both
cases for precisely the same reason. We might be able to do without a
background reader if we did asyncronous i/o, but that can't be done
portably.

 In this case the problem is that we want to tell the OS Hey, if this
 stuff is actually going to go out to the spindles then compress it. And
 by the way, we won't be doing any random access on it, either. But
 AFAIK there's no option like that in fopen... :)

posix_fadvise(). We don't use it and many OSes don't support it, but it
is there.

The O/S is some overhead, but the benefits outweigh the costs IMHO.

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature

Re: [HACKERS] Compression and on-disk sorting

2006-05-15 Thread Jim C. Nasby

On Mon, May 15, 2006 at 10:09:47PM +0200, Martijn van Oosterhout wrote:
  In this case the problem is that we want to tell the OS Hey, if this
  stuff is actually going to go out to the spindles then compress it. And
  by the way, we won't be doing any random access on it, either. But
  AFAIK there's no option like that in fopen... :)
 
 posix_fadvise(). We don't use it and many OSes don't support it, but it
 is there.

There's an fadvise that tells the OS to compress the data if it actually
makes it to disk?
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Compression and on-disk sorting

2006-05-15 Thread Ron Mayer


Jim C. Nasby wrote:


There's an fadvise that tells the OS to compress the data if it actually
makes it to disk?


Compressed-filesystem extension (like e2compr, and I think either
Fat or NTFS) can do that.

I think the reasons against adding this feature to postgresql are
largely the same as the reasons why compressed filesystems aren't
very popular.

Has anyone tried running postgresql on a compressing file-system?
I'd expect the penalties to outweigh the benefits (or they'd be
more common); but if it gives impressive results, it might add
weight to this feature idea.


  Ron M

I think the real reason Oracle and others practically re-wrote
their own VM-system and filesystems is that at the time it was
important for them to run under Windows98; where it was rather
easy to write better filesystems than your customer's OS was
bundled with.

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org

Re: [HACKERS] Compression and on-disk sorting

2006-05-15 Thread Tom Lane

Ron Mayer [EMAIL PROTECTED] writes:
 I think the real reason Oracle and others practically re-wrote
 their own VM-system and filesystems is that at the time it was
 important for them to run under Windows98; where it was rather
 easy to write better filesystems than your customer's OS was
 bundled with.

Windows98?  No, those decisions predate any thought of running Oracle
on Windows, probably by decades.  But I think the thought process was
about as above whenever they did make it; they were running on some
pretty stupid OSes way back when.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Compression and on-disk sorting

2006-05-15 Thread Joshua D. Drake


Tom Lane wrote:

Ron Mayer [EMAIL PROTECTED] writes:

I think the real reason Oracle and others practically re-wrote
their own VM-system and filesystems is that at the time it was
important for them to run under Windows98; where it was rather
easy to write better filesystems than your customer's OS was
bundled with.


Windows98?  No, those decisions predate any thought of running Oracle
on Windows, probably by decades.  But I think the thought process was
about as above whenever they did make it; they were running on some
pretty stupid OSes way back when.


Windows XP?

runs



regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org




--

   === The PostgreSQL Company: Command Prompt, Inc. ===
 Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
 Providing the most comprehensive  PostgreSQL solutions since 1997
http://www.commandprompt.com/



---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Compression and on-disk sorting

2006-05-15 Thread mark

On Mon, May 15, 2006 at 05:42:53PM -0700, Joshua D. Drake wrote:
 Windows98?  No, those decisions predate any thought of running Oracle
 on Windows, probably by decades.  But I think the thought process was
 about as above whenever they did make it; they were running on some
 pretty stupid OSes way back when.
 Windows XP?
 runs

You guys have to kill your Windows hate - in jest or otherwise. It's
zealous, and blinding. I'm picking on you Joshua, only because your
message is the one I saw last. Sorry...

Writing your own block caching layer can make a lot of sense. Why would it
be assumed, that a file system designed for use from a desktop, would be
optimal at all for database style loads?

Why would it be assumed that a file system to be used for many different
smaller files would be optimal at all for database style loads?

It's always going to be true, that the more specific the requirements,
the more highly optimized one can design a system. The Linux block
caching layer, or file system layout can be beat *for sure* for database
loads.

The real question - and I believe Tom and others have correctly harped
on it in the past is - is it worth it? Until somebody actually pulls
up their sleeves, invests a month or more of their life to it, and
does it, we really won't know. And even then, the cost of maintenance
would have to be considered. Who is going to keep up-to-date on
theoretical storage models? What happens when generic file system
levels again surpass the first attempt?

Personally, I believe it would be worth it - but only to a few. And
these most of these few are likely using Oracle. So, no gain unless
you can convince them to switch back... :-)

Cheers,
mark

-- 
[EMAIL PROTECTED] / [EMAIL PROTECTED] / [EMAIL PROTECTED] 
__
.  .  _  ._  . .   .__.  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/|_ |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada

  One ring to rule them all, one ring to find them, one ring to bring them all
   and in the darkness bind them...

   http://mark.mielke.cc/


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Compression and on-disk sorting

2006-05-15 Thread Gregory Maxwell


Oh come on,  Sorry to troll but this is too easy.

On 5/15/06, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:

You guys have to kill your Windows hate - in jest or otherwise. It's
zealous, and blinding.

[snip]

Why would it
be assumed, that a file system designed for use from a desktop, would be
optimal at all for database style loads?


It wouldn't.
Why would someone use a desktop OS for a database?
Why would you call the result of answering the previous question
zealous and blinding?

PG's use of the OS's block cache is a good move because it makes PG
tend to 'just work' where the alternatives require non-trivial tuning
(sizing their caches not to push the OS into swap).  The advantages of
this are great enough that if additional smarts are needed in the OS
cache it might well be worth the work to add it there and to ask for
new fadvise flags to get the desired behavior.

That's something that would be easy enough for a dedicated hacker to
do, or easy enough to collaborate with the OS developers if the need
could be demonstrated clearly enough.

What reasonable OS couldn't you do that with?

:)

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Compression and on-disk sorting

2006-05-15 Thread Bruce Momjian

[EMAIL PROTECTED] wrote:
 The real question - and I believe Tom and others have correctly harped
 on it in the past is - is it worth it? Until somebody actually pulls
 up their sleeves, invests a month or more of their life to it, and
 does it, we really won't know. And even then, the cost of maintenance
 would have to be considered. Who is going to keep up-to-date on
 theoretical storage models? What happens when generic file system
 levels again surpass the first attempt?
 
 Personally, I believe it would be worth it - but only to a few. And
 these most of these few are likely using Oracle. So, no gain unless
 you can convince them to switch back... :-)

We do know that the benefit for commercial databases that use raw and
file system storage is that raw storage is only a few percentage
points faster.

-- 
  Bruce Momjian   http://candle.pha.pa.us
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

92 matches

Mail list logo