Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation

2012-10-27 Thread Noah Misch
On Sat, Oct 27, 2012 at 04:57:46PM +0530, Amit Kapila wrote:
 On Saturday, October 27, 2012 4:03 AM Noah Misch wrote:
  Could you elaborate on your reason for continuing to treat TOAST as a
  special
  case?  As best I recall, the only reason to do so before was the fact
  that
  TOAST can change the physical representation of a column even when
  executor
  did not change its logical content.  Since you're no longer relying on
  the
  executor's opinion of what changed, a TOASTed value is not special.
 
 I thought for initial version of patch, without this change, patch will have
 less impact and less test.

Not that I'm aware.  If you still think so, please explain.

 For this patch I am interested to go with delta encoding approach based on
 column boundaries.

Fair enough.

  If you conclude that finding sub-column similarity is not worthwhile, at
  least
  teach your algorithm to aggregate runs of changing or unchanging columns
  into
  fewer delta instructions.  If a table contains twenty unchanging bool
  columns,
  you currently use at least 80 bytes to encode that fact.  By treating
  the run
  of columns as a unit for delta encoding purposes, you could encode it in
  23
  bytes.  
 
 Do you mean to say handle for non-continuous unchanged columns?

My statement above was a mess.

 I believe for continuous unchanged columns its already handled until there
 are any alignment changes. Example
 
 create table tbl(f1 int, f2 bool, f3 bool, f4 bool, f5 bool, f6 bool, f7
 bool, 
  f8 bool, f9 bool, f10 bool, f11 bool, f12 bool, f13 bool, 
  f14 bool, f15 bool, f16 bool, f17 bool, f18 bool, f19 bool,
 
  f20 bool, f21 bool); 
 
 insert into tbl values(10,
 't','t','t','t','t','t','t','t','t','t','t','t','t','t','t','t','t','t','t',
 't'); 
 
 update tbl set f1 = 20; 
 
 The delta algorithm for the above operation reduced the size of the tuple
 from 24 bytes to 12 bytes. 
 
 4 bytes - IGN command and LEN 
 4 bytes - ADD command and LEN 
 4 bytes - Data block 

I now see that this case is already handled.  Sorry for the noise.
Incidentally, I tried this variant:

create table tbl(f1 int, f2 bool, f3 bool, f4 bool, f5 bool, f6 bool, f7 bool,
 f8 bool, f9 bool, f10 bool, f11 bool, f12 bool, f13 bool,
 f14 bool, f15 bool, f16 bool, f17 bool, f18 bool, f19 bool,
 f20 bool, f21 bool, f22 int, f23 int);
insert into tbl values(1,
't','t','t','t','t','t','t','t','t','t','t','t','t','t','t','t','t','t','t',
't', 2, 3);
update tbl set f1 = 2, f22 = 4, f23 = 6;

It yielded an erroneous delta: IGN 4, ADD 4, COPY 24, IGN 4, ADD 4, COPY 28,
IGN 4, ADD 4.  (The delta happens to be longer than the data and goes unused).

nm


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical to physical page mapping

2012-10-27 Thread Heikki Linnakangas

On 27.10.2012 16:43, Tom Lane wrote:

Jan Wieckjanwi...@yahoo.com  writes:

The reason why we need full_page_writes is that we need to guard against
torn pages or partial writes. So what if smgr would manage a mapping
between logical page numbers and their physical location in the relation?



At the moment where we today require a full page write into WAL, we
would mark the buffer as needs relocation. The smgr would then write
this page into another physical location whenever it is time to write it
(via the background writer, hopefully). After that page is flushed, it
would update the page location pointer, or whatever we want to call it.
A thus free'd physical page location can be reused, once the location
pointer has been flushed to disk. This is a critical ordering of writes.
First the page at the new location, second the pointer to the current
location. Doing so would make write(2) appear atomic to us, which is
exactly what we need for crash recovery.


Hmm, aka copy-on-write.


I think you're just moving the atomic-write problem from the data pages
to wherever you keep these pointers.


If the pointers are stored as simple 4-byte integers, you probably could 
assume that they're atomic, and won't be torn.


There's a lot of practical problems in adding another level of 
indirection to every page access, though. It'll surely add some overhead 
to every access, even if the data never changes. And it's not at all 
clear to me that it would perform better than full_page_writes. You're 
writing and flushing out roughly the same amount of data AFAICS.


What exactly is the problem with full_page_writes that we're trying to 
solve?


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation

2012-10-27 Thread Heikki Linnakangas

On 27.10.2012 14:27, Amit Kapila wrote:

On Saturday, October 27, 2012 4:03 AM Noah Misch wrote:

In my previous review, I said:

Given [not relying on the executor to know which columns changed],
why not
treat the tuple as an opaque series of bytes and not worry about
datum
boundaries?  When several narrow columns change together, say a
sequence
of sixteen smallint columns, you will use fewer binary delta
commands by
representing the change with a single 32-byte substitution.  If an
UPDATE
changes just part of a long datum, the delta encoding algorithm
will still
be able to save considerable space.  That case arises in many
forms:
changing one word in a long string, changing one element in a long
array,
changing one field of a composite-typed column.  Granted, this
makes the
choice of delta encoding algorithm more important.

We may be leaving considerable savings on the table by assuming that
column
boundaries are the only modified-range boundaries worth recognizing.
What is
your willingness to explore general algorithms for choosing such
boundaries?
Such an investigation may, of course, be a dead end.


For this patch I am interested to go with delta encoding approach based on
column boundaries.

However I shall try to do it separately and if it gives positive results
then I will share with hackers.
I will try with VCDiff once or let me know if you have any other algorithm
in mind.


One idea is to use the LZ format in the WAL record, but use your 
memcmp() code to construct it. I believe the slow part in LZ compression 
is in trying to locate matches in the history, so if you just replace 
that with your code that's aware of the column boundaries and uses 
simple memcmp() to detect what parts changed, you could create LZ 
compressed output just as quickly as the custom encoded format. It would 
leave the door open for making the encoding smarter or to do actual 
compression in the future, without changing the format and the code to 
decode it.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical to physical page mapping

2012-10-27 Thread Claudio Freire
On Sat, Oct 27, 2012 at 3:41 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:

 I think you're just moving the atomic-write problem from the data pages
 to wherever you keep these pointers.


 If the pointers are stored as simple 4-byte integers, you probably could
 assume that they're atomic, and won't be torn.

That could backfire.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical to physical page mapping

2012-10-27 Thread Tom Lane
Claudio Freire klaussfre...@gmail.com writes:
 On Sat, Oct 27, 2012 at 3:41 PM, Heikki Linnakangas
 hlinnakan...@vmware.com wrote:
 I think you're just moving the atomic-write problem from the data pages
 to wherever you keep these pointers.

 If the pointers are stored as simple 4-byte integers, you probably could
 assume that they're atomic, and won't be torn.

 That could backfire.

Yeah, the potential loss in resiliency is scary.  Assume for the sake of
argument that we're storing these indirection pointers in 8K pages, 2000
or so to the page.  If you get a read failure on a regular heap page,
you've lost 8K worth of data.  If you get a read failure on an
indirection page, you've lost 16MB worth.  (Though perhaps the pointers
could be reconstructed, given additional overhead data on each regular
heap page; but that wouldn't be very cheap.)  Also, the write traffic
on a pointer page is 2000X as much as it is on an average heap page,
offering 2000X the opportunities for the disk hardware to drop a bit,
or for SSD storage to wear down, etc.  (Perhaps it's not that bad for
typical access patterns, but then again perhaps it's worse, given the
already noted strict requirements for write ordering.)  So it seems like
there'd be a rather nasty magnification of consequences for hardware
errors.

You've also got something like a 2000X concentration of access
contention for pointer pages compared to regular pages.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical to physical page mapping

2012-10-27 Thread Gavin Flower

On 28/10/12 07:41, Heikki Linnakangas wrote:

On 27.10.2012 16:43, Tom Lane wrote:

Jan Wieckjanwi...@yahoo.com writes:
The reason why we need full_page_writes is that we need to guard 
against

torn pages or partial writes. So what if smgr would manage a mapping
between logical page numbers and their physical location in the 
relation?



At the moment where we today require a full page write into WAL, we
would mark the buffer as needs relocation. The smgr would then write
this page into another physical location whenever it is time to 
write it

(via the background writer, hopefully). After that page is flushed, it
would update the page location pointer, or whatever we want to call it.
A thus free'd physical page location can be reused, once the location
pointer has been flushed to disk. This is a critical ordering of 
writes.

First the page at the new location, second the pointer to the current
location. Doing so would make write(2) appear atomic to us, which is
exactly what we need for crash recovery.


Hmm, aka copy-on-write.


I think you're just moving the atomic-write problem from the data pages
to wherever you keep these pointers.


If the pointers are stored as simple 4-byte integers, you probably 
could assume that they're atomic, and won't be torn.


There's a lot of practical problems in adding another level of 
indirection to every page access, though. It'll surely add some 
overhead to every access, even if the data never changes. And it's not 
at all clear to me that it would perform better than full_page_writes. 
You're writing and flushing out roughly the same amount of data AFAICS.


What exactly is the problem with full_page_writes that we're trying to 
solve?


- Heikki


Would a 4 byte pointer be adequate for a 64 bit machine with well over 
4GB used by Postgres?



Cheers,
Gavin


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] My first patch! (to \df output)

2012-10-27 Thread Greg Sabino Mullane

-BEGIN PGP SIGNED MESSAGE-
Hash: RIPEMD160


 This was actually kind of anti-climactic, since it only 
 took about 5 minutes to make the change and get it 
 working. Didn't really feel the way I expected it to ;)

Well, we can reject your patch and start bike-shedding 
it for the next four months, if that makes you feel better! :)

Congrats!

- -- 
Greg Sabino Mullane g...@turnstep.com
PGP Key: 0x14964AC8 201210271914
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-BEGIN PGP SIGNATURE-

iEYEAREDAAYFAlCMau4ACgkQvJuQZxSWSshdoQCg6eJ14LLcJrn04rN2/efO14iz
swgAoPbBSv8PAre6qtVrRH3LL/iNQqeD
=m/ns
-END PGP SIGNATURE-




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers