Re: [HACKERS] [WIP] Performance Improvement by reducing WAL for Update Operation

2012-09-27 Thread Amit Kapila
 On Thursday, September 27, 2012 9:12 AM Noah Misch wrote:
 On Mon, Sep 24, 2012 at 10:57:02AM +, Amit kapila wrote:
  Rebased version of patch based on latest code.
 
 I like the direction you're taking with this patch; the gains are
 striking,
 especially considering the isolation of the changes.

  Thank you for a detailed review of the patch.

 You cannot assume executor-unmodified columns are also unmodified from
 heap_update()'s perspective.  Expansion in one column may instigate
 TOAST
 compression of a logically-unmodified column, and that counts as a
 change for
 xlog delta purposes.  You do currently skip the optimization for
 relations
 having a TOAST table, but TOAST compression can still apply.  Observe
 this
 with text columns of storage mode PLAIN.  I see two ways out: skip the
 new
 behavior when need_toast=true, or compare all inline column data, not
 just
 what the executor modified.  One can probably construct a benchmark
 favoring
 either choice.  I'd lean toward the latter; wide tuples are the kind
 this
 change can most help.  If the marginal advantage of ignoring known-
 unmodified
 columns proves important, we can always bring it back after designing a
 way to
 track which columns changed in the toaster.

You are right that it can give benefit for both ways, but we should also see
which approach can 
give better results for most of the scenario's. 
As in most cases of Update I have observed, the change in values will not
increase the length of value to too much.
OTOH I am not sure may be there are many more scenario's which change the
length of updated value which can lead to scenario explained by you above.

 
 Given that, why not treat the tuple as an opaque series of bytes and
 not worry
 about datum boundaries?  When several narrow columns change together,
 say a
 sequence of sixteen smallint columns, you will use fewer binary delta
 commands
 by representing the change with a single 32-byte substitution.  If an
 UPDATE
 changes just part of a long datum, the delta encoding algorithm will
 still be
 able to save considerable space.  That case arises in many forms:
 changing
 one word in a long string, changing one element in a long array,
 changing one
 field of a composite-typed column.  Granted, this makes the choice of
 delta
 encoding algorithm more important.
 
 Like Heikki, I'm left wondering why your custom delta encoding is
 preferable
 to an encoding from the literature.  Your encoding has much in common
 with
 VCDIFF, even sharing two exact command names.  If a custom encoding is
 the
 right thing, code comments or a README section should at least discuss
 the
 advantages over an established alternative.  Idle thought: it might pay
 off to
 use 1-byte sizes and offsets most of the time.  Tuples shorter than 256
 bytes
 are common; for longer tuples, we can afford wider offsets.

My apprehension was that it can affect the performance if do more work by
holding the lock. 
If we use any standard technique like LZ of VCDiff, it has overhead of
comparison
and other things pertaining to their algorithm. 
However using updated patch by Heikki, I can run the various performance
tests both for update operation as well as recovery.

With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Performance Improvement by reducing WAL for Update Operation

2012-08-30 Thread Robert Haas
On Fri, Aug 10, 2012 at 1:25 AM, Amit Kapila amit.kap...@huawei.com wrote:
 I think the property that recovery only needs to worry about each
 block individually is one that we want to preserve.  Supporting this
 optimizating only when full_page_writes=off seems ugly,

 I think recovery needs to worry about multiple blocks as well in some cases.
 Please see below case and correct me if I am wrong.
 I think currently also there can be problems in case of full_page_writes=off
 for crash recovery.
 1. Tuple A on page 1 is updated.  The new version, tuple B, is placed on
 page 2.
 2. Page 1 is Partially written to disk.
 3. During recovery, it can so appear that there is no need to update XMAX
 and other related things in Old tuple
as LSN is greater than WAL lsn.
 4. Now also there can be other problems related to tuple visibility.

Well, you're only supposed to turn full_page_writes=off if partial
page writes are impossible on your system.  If you turn off
full_page_writes on a system where partial page writes are impossible,
then you've intentionally broken crash recovery, and you get to keep
both pieces.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Performance Improvement by reducing WAL for Update Operation

2012-08-30 Thread Amit Kapila
On Thursday, August 30, 2012 11:23 PM Robert Haas
[mailto:robertmh...@gmail.com] wrote:
On Fri, Aug 10, 2012 at 1:25 AM, Amit Kapila amit.kap...@huawei.com wrote:
 I think the property that recovery only needs to worry about each
 block individually is one that we want to preserve.  Supporting this
 optimizating only when full_page_writes=off seems ugly,

 I think recovery needs to worry about multiple blocks as well in some
cases.
 Please see below case and correct me if I am wrong.
 I think currently also there can be problems in case of
full_page_writes=off
 for crash recovery.
 1. Tuple A on page 1 is updated.  The new version, tuple B, is placed on
 page 2.
 2. Page 1 is Partially written to disk.
 3. During recovery, it can so appear that there is no need to update XMAX
 and other related things in Old tuple
as LSN is greater than WAL lsn.
 4. Now also there can be other problems related to tuple visibility.

 Well, you're only supposed to turn full_page_writes=off if partial
 page writes are impossible on your system.  If you turn off
 full_page_writes on a system where partial page writes are impossible,

  I think you mean to say full_page_writes on a system where partial page
writes are possible.
  Because if partial page writes are impossible then user should keep
full_page_writes = OFF.

 then you've intentionally broken crash recovery, and you get to keep
 both pieces.

  Robert, in broad I got your and Simon's idea that we should do
optimization of WAL (Reduce) in case update happens 
  on same page. I have implemented the final Patch which does WAL
optimization only in case when updated tuple is on same 
  page. Also we have observed that with fillfactor 80 the performance
improvement is good.

With Regards,
Amit Kapila.




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Performance Improvement by reducing WAL for Update Operation

2012-08-28 Thread Amit kapila

On August 27, 2012 7:00 PM Amit Kapila wrote:
On August 27, 2012 5:58 PM Heikki Linnakangas wrote:
On 27.08.2012 15:18, Amit kapila wrote:
 I have implemented the WAL Reduction Patch for the case of HOT Update as


 Let's do it for HOT updates only. Simon  Robert made good arguments on
 why this is a bad idea for non-HOT updates.

 Okay, I shall do it that way.
 So now I shall send information about all the testing I have done for this
 Patch and then Upload it in CF.

Test Scenario's are below and testcases for same are attached with this mail.

Scenario1: 
Recover the data where the field data is updated with different value from an 
exisitng data of an integer field. 
Steps: 
1. Start the server, create table, insert one record into the table. 
2. update the integer field with other than existing data. 
3. Shutdown the server immediately. 
4. Start the server and connect the client and check the data in the table. 
Expected behavior: 
The updated data should present in the table after database recovery. 

Scenario2: 
Recover the data where the field data is updated with different value from an 
exisitng data of char and varchar fields. 
Steps: 
1. Start the server, create table, insert one record into the table. 
2. update both char and varchar fields with other than existing data. 
3. Shutdown the server immediately. 
4. Start the server and connect the client and check the data in the table. 
Expected behavior: 
The updated data should present in the table after database recovery. 

Scenario3: 
Recover the data where the field data is updated with NULL value from an 
exisitng data of a field. 
Steps: 
1. Start the server, create table, insert one record into the table. 
2. update a field with NULL value. 
3. Shutdown the server immediately. 
4. Start the server and connect the client and check the data in the table. 
Expected behavior: 
The updated data should present in the table after database recovery. 

Scenario4: 
Recover the data where the field data is updated with a proper value from an 
exisitng data of a field where the row contains NULL data. 
Steps: 
1. Start the server, create table, insert one record into the table. 
2. update a field with a different value other than existing data. 
3. Shutdown the server immediately. 
4. Start the server and connect the client and check the data in the table. 
Expected behavior: 
The updated data should present in the table after database recovery. 

Scenario5: 
Recover the data where all fields data is updated with NULL value from an 
exisitng data of a fields. 
Steps: 
1. Start the server, create table, insert one record into the table. 
2. update all fields with NULL values. 
3. Shutdown the server immediately. 
4. Start the server and connect the client and check the data in the table. 
Expected behavior: 
The updated data should present in the table after database recovery. 

Scenario6: 
Recover the data of updated field of a table where the table contains a toast 
table. 
Steps: 
1. Start the server, create table, insert one record into the table. 
2. update a field with a different value other than existing data. 
3. Shutdown the server immediately. 
4. Start the server and connect the client and check the data in the table. 
Expected behavior: 
The updated data should present in the table after database recovery. 

Scenario7: 
Recover the data of updated field of a table where the row length is less than 
128 bytes. 
Steps: 
1. Start the server, create table, insert one record into the table. 
2. update a field with a different value other than existing data. 
3. Shutdown the server immediately. 
4. Start the server and connect the client and check the data in the table. 
Expected behavior: 
The updated data should present in the table after database recovery. 

Scenario8: 
Recover the data of updated field of a table where the before trigger modifies 
the tuple before the tuple updates. 
Steps: 
1. Start the server, create table, insert one record into the table. 
2. create a before trigger which modifies the same record. 
3. update a field with a different value other than existing data. 
4. Shutdown the server immediately. 
5. Start the server and connect the client and check the data in the table. 
Expected behavior: 
The updated data should present in the table after database recovery. 

Scenario9: 
Recover the data where the update operation fails because of trigger returns 
NULL. 
Steps: 
1. Start the server, create table, insert one record into the table. 
2. update a field fails as before trigger returns NULL. 
3. Shutdown the server immediately. 
4. Start the server and connect the client and check the data in the table. 
Expected behavior: 
The update command shouldn't be effective after recovery also. 




With Regards,
Amit Kapila.-- Test case 1
drop table if exists tbl;
create table tbl(f1 int, f2 varchar(100), f3 float8, f4 char(200));
insert into tbl values(1,'hari',2.1,'test');
checkpoint;

-- first update is as it creates a 

Re: [HACKERS] [WIP] Performance Improvement by reducing WAL for Update Operation

2012-08-27 Thread Heikki Linnakangas

On 27.08.2012 15:18, Amit kapila wrote:

I have implemented the WAL Reduction Patch for the case of HOT Update as 
pointed out by Simon and Robert. In this patch it only goes for Optimized WAL 
in case of HOT Update with other restrictions same as in previous patch.

The performance numbers for this patch are attached in this mail. It has 
improved by 90% if the page has fillfactor 80.

Now going forward I have following options:
a. Upload the patch in Open CF for WAL Reduction which contains reductution for 
HOT and non-HOT updates.
b. Upload the patch in Open CF for WAL Reduction which contains reductution for 
HOT updates.
c. Upload both the patches as different versions.


Let's do it for HOT updates only. Simon  Robert made good arguments on 
why this is a bad idea for non-HOT updates.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Performance Improvement by reducing WAL for Update Operation

2012-08-27 Thread Amit Kapila
From: Heikki Linnakangas [mailto:heikki.linnakan...@enterprisedb.com] 
Sent: Monday, August 27, 2012 5:58 PM
To: Amit kapila
On 27.08.2012 15:18, Amit kapila wrote:
 I have implemented the WAL Reduction Patch for the case of HOT Update as
pointed out by Simon and Robert. In this patch it only goes for Optimized
WAL in case of HOT Update with other restrictions same as in previous patch.

 The performance numbers for this patch are attached in this mail. It has
improved by 90% if the page has fillfactor 80.

 Now going forward I have following options:
 a. Upload the patch in Open CF for WAL Reduction which contains
reductution for HOT and non-HOT updates.
 b. Upload the patch in Open CF for WAL Reduction which contains
reductution for HOT updates.
 c. Upload both the patches as different versions.

 Let's do it for HOT updates only. Simon  Robert made good arguments on 
 why this is a bad idea for non-HOT updates.

Okay, I shall do it that way. 
So now I shall send information about all the testing I have done for this
Patch and then Upload it in CF.

With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Performance Improvement by reducing WAL for Update Operation

2012-08-23 Thread Bruce Momjian
On Wed, Aug 22, 2012 at 07:38:33PM +0530, Amit Kapila wrote:
 I had made sure no full_page_write happens by making checkpoint interval and
 checkpoints segments  longer.
 
  
 
 Original code - 1.8GModified code - 1.1G  Diff - 63% reduction, incase of
 fill factor 100.
 Original code - 1.6GModified code - 1.1G  Diff - 45% reduction, incase of
 fill factor 80.
 
  
 
 I am still in process of collecting synchronous commit mode on data.

Wow, that sounds promising.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Performance Improvement by reducing WAL for Update Operation

2012-08-23 Thread Amit Kapila
From: Bruce Momjian [mailto:br...@momjian.us] 
Sent: Friday, August 24, 2012 2:12 AM
On Wed, Aug 22, 2012 at 07:38:33PM +0530, Amit Kapila wrote:
 I had made sure no full_page_write happens by making checkpoint interval
and
 checkpoints segments  longer.
 
  
 
 Original code - 1.8GModified code - 1.1G  Diff - 63% reduction,
incase of
 fill factor 100.
 Original code - 1.6GModified code - 1.1G  Diff - 45% reduction,
incase of
 fill factor 80.
 
  
 
 I am still in process of collecting synchronous commit mode on data.

 Wow, that sounds promising.
  Thanks you.

Right now I am collecting the data for Synchronous_commit =on mode; My
initial observation is that
incase fsync is off, the results are good(around 50% perf improvement). 
However if fsync is on, the performance results fall down to 3~5%. I am not
sure even if the data for I/O is reduced, 
Still why there is no big performance gain as in case of Synchronous_commit
= off or when fsync is off.

I am trying with different methods of wal_sync_method parameter and by
setting some value of commit_delay as suggested by Peter Geoghegan in one of
his mails.

Please suggest me if anyone has any thoughts on what kind of parameter's are
best for such a use case or let me know if I am missing anything and such
kind of performance improvement can only improve performance for fsync =off
case.

With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Performance Improvement by reducing WAL for Update Operation

2012-08-22 Thread Amit Kapila
From: pgsql-hackers-ow...@postgresql.org
[mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Amit Kapila
Sent: Wednesday, August 22, 2012 8:34 AM
From: Jesper Krogh [mailto:jes...@krogh.cc] 
Sent: Wednesday, August 22, 2012 1:13 AM
On 21/08/12 16:57, Amit kapila wrote: 

Test results: 

1. The pgbench test run for 10min. 
 2. The test reult is for modified pgbench (such that total row size is
1800 and updated columns are of length 300) tpc-b testcase. 
 The result and modified pgbench code is attached with mail. 

 3. The performance improvement shown in the m/c I have tested is quite
good (more than 100% for sync commit = off).


 I cannot comment on completeness or correctness of the code, but I do
think a relevant test would be 
 to turn synchronous_commit on as default. 


 Even though you aim at an improved performance, it would be nice to see
the reduction in WAL-size 
as an effect of this patch. 

 Yes, I shall take care of doing both the above tests and send the report.

The data for WAL reduction is as below:

The number of transactions processed are 16000 by doing update only of size
250 bytes with an record size of 1800. 

I had made sure no full_page_write happens by making checkpoint interval and
checkpoints segments  longer.

 

Original code - 1.8GModified code - 1.1G  Diff - 63% reduction, incase
of fill factor 100. 
Original code - 1.6GModified code - 1.1G  Diff - 45% reduction, incase
of fill factor 80. 

 

I am still in process of collecting synchronous commit mode on data.

Please let me know what more kind of data will be helpful to indicate the
benefits of this implementation.



With Regards,

Amit Kapila.

 



Re: [HACKERS] [WIP] Performance Improvement by reducing WAL for Update Operation

2012-08-21 Thread Amit Kapila
From: Jesper Krogh [mailto:jes...@krogh.cc] 
Sent: Wednesday, August 22, 2012 1:13 AM
On 21/08/12 16:57, Amit kapila wrote: 

Test results: 

1. The pgbench test run for 10min. 
 2. The test reult is for modified pgbench (such that total row size is
1800 and updated columns are of length 300) tpc-b testcase. 
 The result and modified pgbench code is attached with mail. 

 3. The performance improvement shown in the m/c I have tested is quite
good (more than 100% for sync commit = off).


 I cannot comment on completeness or correctness of the code, but I do
think a relevant test would be 
 to turn synchronous_commit on as default. 


 Even though you aim at an improved performance, it would be nice to see
the reduction in WAL-size 
as an effect of this patch. 

Yes, I shall take care of doing both the above tests and send the report.

 

With Regards,

Amit Kapila.



Re: [HACKERS] [WIP] Performance Improvement by reducing WAL for Update Operation

2012-08-09 Thread Simon Riggs
On 3 August 2012 12:46, Amit kapila amit.kap...@huawei.com wrote:

 Frame the new tuple from old tuple and WAL record:

Sounds good.

I'd suggest we do this only when the saving is large enough for
benefit, rather than do this every time.

You don't mention whether or not the old and the new tuple are on the
same data block.

Personally, I think it will be important to ensure the above,
otherwise recovery will require much additional code for that case.
And that code will be prone to race conditions and performance issues.

Please also bear in mind that Andres will be looking to include the PK
columns in every WAL record for BDR. That could be an option, but I
doubt there is much value in excluding PK columns. I think I'd want
them to be there for debugging purposes so we can prove this code is
correct in production, since otherwise this could be a source of data
loss bugs.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Performance Improvement by reducing WAL for Update Operation

2012-08-09 Thread Amit Kapila
From: Simon Riggs [mailto:si...@2ndquadrant.com] 
Sent: Thursday, August 09, 2012 12:36 PM
On 3 August 2012 12:46, Amit kapila amit.kap...@huawei.com wrote:

 Frame the new tuple from old tuple and WAL record:

 Sounds good.
  Thanks.

 I'd suggest we do this only when the saving is large enough for
 benefit, rather than do this every time.  
  Do you mean to say that when length of updated values of tuple is less
than some threshold(1/3 or 2/3, etc..) value of 
  total length?


 You don't mention whether or not the old and the new tuple are on the
 same data block.

  WAL reduction is done for the case even when old and new are on different
data blocks as well.

 Personally, I think it will be important to ensure the above,
 otherwise recovery will require much additional code for that case.


  In recovery currently also, it handles the case when old and new are on
different page such that
  it has to read old page to get the old tuple.

  The modifications needs to ensure handling of following cases:

  a. When there is backup block,and old-new tuples are on different page
 Currently it doesn't read the old page,
 However for new implementation it needs to read old page for this case
also.

  b. When changes are already applied on page [line : if (XLByteLE(lsn,
PageGetLSN(page))); function: heap_xlog_update]
 Currently it doesn't read the old page,
 However for new implementation it needs to read old page for this case
also.

 And that code will be prone to race conditions and performance issues.

  Are you telling performance issues, as now we may need to read old page in
some of the cases
  when earlier it was not reading?
  If yes, then I think as I have mentioned above, according to me above 2
cases are not very usual cases.
  However the benefit of Update operation on running server is good enough
as it reduces the WAL volume.
  If other then above, then please suggest me.


 Please also bear in mind that Andres will be looking to include the PK
 columns in every WAL record for BDR. That could be an option, but I
 doubt there is much value in excluding PK columns. 

  Agreed. However once the implementation by Andres is done I can merge both
codes and 
  take the performance data again, based on which we can take decision.


With Regards,
Amit Kapila.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Performance Improvement by reducing WAL for Update Operation

2012-08-09 Thread Simon Riggs
On 9 August 2012 09:49, Amit Kapila amit.kap...@huawei.com wrote:

 I'd suggest we do this only when the saving is large enough for
 benefit, rather than do this every time.
   Do you mean to say that when length of updated values of tuple is less
 than some threshold(1/3 or 2/3, etc..) value of
   total length?

Some heuristic, yes, similar to TOAST's minimum threshold. To attempt
removal of rows in all cases would not be worth it, so we need a fast
path way of saying lets just take all of the columns.

 You don't mention whether or not the old and the new tuple are on the
 same data block.

   WAL reduction is done for the case even when old and new are on different
 data blocks as well.

That makes me feel nervous. I doubt the marginal gain is worth it.
Most updates don't cross blocks.

 Please also bear in mind that Andres will be looking to include the PK
 columns in every WAL record for BDR. That could be an option, but I
 doubt there is much value in excluding PK columns.

   Agreed. However once the implementation by Andres is done I can merge both
 codes and
   take the performance data again, based on which we can take decision.

It won't happen like that because there won't be a single point where
Andres is done. If you agree, then its worth doing it that way to
begin with, rather than requiring us to revisit the same section of
code twice.

One huge point that needs to be thought through is how we prove this
code actually works on WAL/recovery side. A normal regression test
won't prove that and we don't have a framework in place for that.

If you think about what you'll need to do to prove you haven't made
some fatal corruption of WAL, its going to look a lot like logical
replication tests. Worst case here is that mistakes on this patch will
show up as Andres' mistakes. So there is a stronger connection to
Andres' work than it first appears.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Performance Improvement by reducing WAL for Update Operation

2012-08-09 Thread Heikki Linnakangas

On 09.08.2012 12:18, Simon Riggs wrote:

On 9 August 2012 09:49, Amit Kapilaamit.kap...@huawei.com  wrote:


   WAL reduction is done for the case even when old and new are on different
data blocks as well.


That makes me feel nervous. I doubt the marginal gain is worth it.
Most updates don't cross blocks.


That was my first instinctive reaction too. But if the mechanism works 
just as well for cross-page updates, seems a bit strange to not use it.


One argument would be that if for some reason the old block is corrupt 
or lost, you would not be able to recover the new version of the tuple 
from the WAL alone. At the moment, it's nice that the WAL record 
contains all the information required to reconstruct the new tuple, 
regardless of the old data block contents. But then again, full-page 
writes cover that too. There will be a full-page image of the old block 
in the WAL anyway.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Performance Improvement by reducing WAL for Update Operation

2012-08-09 Thread Simon Riggs
On 9 August 2012 11:30, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 09.08.2012 12:18, Simon Riggs wrote:

 On 9 August 2012 09:49, Amit Kapilaamit.kap...@huawei.com  wrote:

WAL reduction is done for the case even when old and new are on
 different
 data blocks as well.


 That makes me feel nervous. I doubt the marginal gain is worth it.
 Most updates don't cross blocks.


 That was my first instinctive reaction too. But if the mechanism works just
 as well for cross-page updates, seems a bit strange to not use it.

 One argument would be that if for some reason the old block is corrupt or
 lost, you would not be able to recover the new version of the tuple from the
 WAL alone. At the moment, it's nice that the WAL record contains all the
 information required to reconstruct the new tuple, regardless of the old
 data block contents.

Exactly. If we lose the first block in a checkpoint, we could lose all
updates to rows in that page and all other pages linked to it over a
whole checkpoint duration. Basically, page corruption will propogate
from block to block if we do this.

Given the marginal gain because of a low percentage of cross-block
updates, I'm not keen. Low percentage because HOT tries hard to keep
things on same block - even for non-HOT updates (which is the case,
even though it sounds weird).

 But then again, full-page writes cover that too. There
 will be a full-page image of the old block in the WAL anyway.

Right, but we're planning to remove that, so its not a safe assumption
to use when building new code.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Performance Improvement by reducing WAL for Update Operation

2012-08-09 Thread Amit Kapila
From: pgsql-hackers-ow...@postgresql.org
[mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Simon Riggs
Sent: Thursday, August 09, 2012 2:49 PM
On 9 August 2012 09:49, Amit Kapila amit.kap...@huawei.com wrote:

 I'd suggest we do this only when the saving is large enough for
 benefit, rather than do this every time.
   Do you mean to say that when length of updated values of tuple is less
 than some threshold(1/3 or 2/3, etc..) value of
   total length?

 Some heuristic, yes, similar to TOAST's minimum threshold. To attempt
 removal of rows in all cases would not be worth it, so we need a fast
 path way of saying lets just take all of the columns.

  Yes, it has to be done. Currently I have 2 ideas to take care of this:
  a. Based on number of updated columns
  b. Based on length of updated values
  If you have any other idea or you favor among one of the above, let me
know your opinion.

 You don't mention whether or not the old and the new tuple are on the
 same data block.

   WAL reduction is done for the case even when old and new are on
different
 data blocks as well.

 That makes me feel nervous. I doubt the marginal gain is worth it.
 Most updates don't cross blocks.

How can it be proved whether gain is marginal or substantial to handle the
case.

One way is test after modification:
I have updated pg_bench tpc_b case:
1. Schema is such that it contains 1800 length rows
2. tpc_b only has updates
3. length of updated column values is 300.
4. All tables has 100% fill factor.
5. Vacuum is OFF

So in such a run, I think many should be updates are across blocks. But not
sure, neither I have verified it in any way.
The above run has given a good performance improvement.



 Please also bear in mind that Andres will be looking to include the PK
 columns in every WAL record for BDR. That could be an option, but I
 doubt there is much value in excluding PK columns.

   Agreed. However once the implementation by Andres is done I can merge
both
 codes and
   take the performance data again, based on which we can take decision.

 It won't happen like that because there won't be a single point where
 Andres is done. If you agree, then its worth doing it that way to
 begin with, rather than requiring us to revisit the same section of
 code twice.

This optimization is to reduce the amount of WAL and definitely adding
anything extra will 
have some impact. 
However if there is no better way other than by including PK in WAL, then I
don't have any problem.

 One huge point that needs to be thought through is how we prove this
 code actually works on WAL/recovery side. A normal regression test
 won't prove that and we don't have a framework in place for that.

My initial idea to validate recovery :
1. Manual Test: a. To generate enough scenarios for update operation. 
b. For each scenario, make sure Replay happens properly.
2. Community Review.



With Regards,
Amit Kapila.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Performance Improvement by reducing WAL for Update Operation

2012-08-09 Thread Heikki Linnakangas

On 09.08.2012 14:11, Simon Riggs wrote:

Given the marginal gain because of a low percentage of cross-block
updates, I'm not keen. Low percentage because HOT tries hard to keep
things on same block - even for non-HOT updates (which is the case,
even though it sounds weird).


That depends entirely on the workload. If you do a bulk update that 
updates every row on the table, most are going to be cross-block 
updates, and the WAL size does matter.



But then again, full-page writes cover that too. There
will be a full-page image of the old block in the WAL anyway.


Right, but we're planning to remove that, so its not a safe assumption
to use when building new code.


I don't think we're going to get rid of full-page images any time soon. 
I guess you could easily check if full-page writes are enabled, though, 
and only do it for cross-page updates if it is.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Performance Improvement by reducing WAL for Update Operation

2012-08-09 Thread Simon Riggs
On 9 August 2012 12:17, Amit Kapila amit.kap...@huawei.com wrote:

 This optimization is to reduce the amount of WAL and definitely adding
 anything extra will have some impact.

Of course. The question is How much impact?. Each tweak has
progressively less and less gain. This isn't a binary choice.

Squeezing the last ounce of performance at the expense of all other
concerns is not a sensible goal, IMHO, nor do we attempt that
elsewhere.

Given we're making no attempt to remove full page writes, which is
clearly the biggest source of WAL volume currently, micro optimisation
of other factors seems unwarranted at this stage.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Performance Improvement by reducing WAL for Update Operation

2012-08-09 Thread Amit Kapila
From: Heikki Linnakangas [mailto:heikki.linnakan...@enterprisedb.com] 
Sent: Thursday, August 09, 2012 4:59 PM
On 09.08.2012 14:11, Simon Riggs wrote:

 But then again, full-page writes cover that too. There
 will be a full-page image of the old block in the WAL anyway.

 Right, but we're planning to remove that, so its not a safe assumption
 to use when building new code.

 I don't think we're going to get rid of full-page images any time soon. 
 I guess you could easily check if full-page writes are enabled, though, 
 and only do it for cross-page updates if it is.

According to my understanding you are talking about corruption due to
partial page writes which can be handled by full-page image of WAL. Correct
me if I misunderstood.
Based on that, even if full-page image is removed it will be maintained by
double buffer write[an alternative solution to full-page writes for some of
the paths] for the case of corrupt page handling.

With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Performance Improvement by reducing WAL for Update Operation

2012-08-09 Thread Heikki Linnakangas

On 09.08.2012 15:56, Amit Kapila wrote:

From: Heikki Linnakangas [mailto:heikki.linnakan...@enterprisedb.com]
Sent: Thursday, August 09, 2012 4:59 PM
On 09.08.2012 14:11, Simon Riggs wrote:


But then again, full-page writes cover that too. There
will be a full-page image of the old block in the WAL anyway.



Right, but we're planning to remove that, so its not a safe assumption
to use when building new code.



I don't think we're going to get rid of full-page images any time soon.
I guess you could easily check if full-page writes are enabled, though,
and only do it for cross-page updates if it is.


According to my understanding you are talking about corruption due to
partial page writes which can be handled by full-page image of WAL. Correct
me if I misunderstood.


I meant corruption caused by anything, like disk failure, bugs, cosmic 
rays, etc. The point is that currently the WAL record contains all the 
information required to reconstruct the old tuple. With a diff method, 
that's no longer the case, so if the old tuple gets corrupt for whatever 
reason, that error will be propagated to the new tuple.


It's not an issue as long as everything works correctly, but some 
redundancy is nice when you're trying to resurrect a corrupt database. 
That's what we're talking about here. That said, I don't think it's a 
big deal for this patch, at least not as long as full-page writes are 
enabled.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Performance Improvement by reducing WAL for Update Operation

2012-08-09 Thread Amit Kapila
From: Simon Riggs [mailto:si...@2ndquadrant.com] 
Sent: Thursday, August 09, 2012 5:29 PM
On 9 August 2012 12:17, Amit Kapila amit.kap...@huawei.com wrote:

 This optimization is to reduce the amount of WAL and definitely adding
 anything extra will have some impact.

 Of course. The question is How much impact?. Each tweak has
 progressively less and less gain. This isn't a binary choice.

 Squeezing the last ounce of performance at the expense of all other
 concerns is not a sensible goal, IMHO, nor do we attempt that
 elsewhere.

 Given we're making no attempt to remove full page writes, which is
 clearly the biggest source of WAL volume currently, micro optimisation
 of other factors seems unwarranted at this stage.

What I am pointing from WAL reduction is about Update operation performance
and
full-page writes doesn't have direct correlation with Update operation
except for 
a case of first time update of page after checkpoint.

With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Performance Improvement by reducing WAL for Update Operation

2012-08-09 Thread Robert Haas
On Thu, Aug 9, 2012 at 9:09 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 I meant corruption caused by anything, like disk failure, bugs, cosmic rays,
 etc. The point is that currently the WAL record contains all the information
 required to reconstruct the old tuple. With a diff method, that's no longer
 the case, so if the old tuple gets corrupt for whatever reason, that error
 will be propagated to the new tuple.

 It's not an issue as long as everything works correctly, but some redundancy
 is nice when you're trying to resurrect a corrupt database. That's what
 we're talking about here. That said, I don't think it's a big deal for this
 patch, at least not as long as full-page writes are enabled.

So suppose that the following sequence of events occurs:

1. Tuple A on page 1 is updated.  The new version, tuple B, is placed on page 2.
2. The table is vacuumed, removing tuple A.
3. Page 1 is written durably to disk.
4. Crash.

If reconstructing tuple B requires possession of tuple A, it seems
that we are now screwed.

No?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Performance Improvement by reducing WAL for Update Operation

2012-08-09 Thread Heikki Linnakangas

On 09.08.2012 19:39, Robert Haas wrote:

On Thu, Aug 9, 2012 at 9:09 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:

I meant corruption caused by anything, like disk failure, bugs, cosmic rays,
etc. The point is that currently the WAL record contains all the information
required to reconstruct the old tuple. With a diff method, that's no longer
the case, so if the old tuple gets corrupt for whatever reason, that error
will be propagated to the new tuple.

It's not an issue as long as everything works correctly, but some redundancy
is nice when you're trying to resurrect a corrupt database. That's what
we're talking about here. That said, I don't think it's a big deal for this
patch, at least not as long as full-page writes are enabled.


So suppose that the following sequence of events occurs:

1. Tuple A on page 1 is updated.  The new version, tuple B, is placed on page 2.
2. The table is vacuumed, removing tuple A.
3. Page 1 is written durably to disk.
4. Crash.

If reconstructing tuple B requires possession of tuple A, it seems
that we are now screwed.


Not with full_page_writes=on, as crash recovery will restore the old 
page contents. But you're right, with full_page_writes=off you are screwed.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Performance Improvement by reducing WAL for Update Operation

2012-08-09 Thread Robert Haas
On Thu, Aug 9, 2012 at 12:43 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 So suppose that the following sequence of events occurs:

 1. Tuple A on page 1 is updated.  The new version, tuple B, is placed on
 page 2.
 2. The table is vacuumed, removing tuple A.
 3. Page 1 is written durably to disk.
 4. Crash.

 If reconstructing tuple B requires possession of tuple A, it seems
 that we are now screwed.

 Not with full_page_writes=on, as crash recovery will restore the old page
 contents. But you're right, with full_page_writes=off you are screwed.

I think the property that recovery only needs to worry about each
block individually is one that we want to preserve.  Supporting this
optimizating only when full_page_writes=off seems ugly, and I also
agree with Simon's objection upthread: the current design minimizes
the chances of corruption propagating from block to block.  Even if
the proposed design is bullet-proof as of this moment (at least with
full_page_writes=on) it seems very possible that it could get
accidentally broken by future code changes, leading to hard-to-find
data corruption bugs.  It might also complicate other things that we
will want to do down the line, like parallelizing recovery.

In the pgbench testing I've done, almost all of the updates are HOT,
provided you run the test long enough to reach steady state, so
restricting this optimization to HOT updates shouldn't hurt that case
(or similar real-world cases) very much.  Of course there are probably
also real-world cases where HOT applies only seldom, and those cases
won't get the benefit of this, but you can't win them all.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Performance Improvement by reducing WAL for Update Operation

2012-08-09 Thread Amit Kapila
From: Robert Haas [mailto:robertmh...@gmail.com] 
Sent: Thursday, August 09, 2012 11:18 PM
On Thu, Aug 9, 2012 at 12:43 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 So suppose that the following sequence of events occurs:

 1. Tuple A on page 1 is updated.  The new version, tuple B, is placed on
 page 2.
 2. The table is vacuumed, removing tuple A.
 3. Page 1 is written durably to disk.
 4. Crash.

 If reconstructing tuple B requires possession of tuple A, it seems
 that we are now screwed.

 Not with full_page_writes=on, as crash recovery will restore the old page
 contents. But you're right, with full_page_writes=off you are screwed.

 I think the property that recovery only needs to worry about each
 block individually is one that we want to preserve.  Supporting this
 optimizating only when full_page_writes=off seems ugly, 

I think recovery needs to worry about multiple blocks as well in some cases.
Please see below case and correct me if I am wrong.
I think currently also there can be problems in case of full_page_writes=off
for crash recovery.
1. Tuple A on page 1 is updated.  The new version, tuple B, is placed on
page 2.
2. Page 1 is Partially written to disk.
3. During recovery, it can so appear that there is no need to update XMAX
and other related things in Old tuple 
   as LSN is greater than WAL lsn.
4. Now also there can be other problems related to tuple visibility.



 and I also
 agree with Simon's objection upthread: the current design minimizes
 the chances of corruption propagating from block to block.  Even if
 the proposed design is bullet-proof as of this moment (at least with
 full_page_writes=on) it seems very possible that it could get
 accidentally broken by future code changes, leading to hard-to-find
 data corruption bugs. It might also complicate other things that we
 will want to do down the line, like parallelizing recovery.

I can see the problem incase we remove full-page-writes concept and replace
with some
other equivalent concept which doesn't have current flexibility.


With Regards,
Amit Kapila.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Performance Improvement by reducing WAL for Update Operation

2012-08-06 Thread Heikki Linnakangas

On 06.08.2012 06:10, Amit Kapila wrote:

Currently the solution for fixed length columns cannot handle the case of
variable length columns and NULLS. The reason is for fixed length columns
there is no need of diff technology between old and new tuple, however for
other cases it will be required.
For fixed length columns, if we just note the OFFSET, LENGTH, VALUE of
changed columns of new tuple in WAL, it will be sufficient to do the replay
of WAL. However to handle other cases we need to use diff mechanism.

Can we do something like if the changed columns are fixed length and doesn't
contain NULL's, then store [OFFSET, LENGTH, VALUE] format in WAL and for
other cases store diff format.

This has advantage that for Updates containing only fixed length columns
don't have to pay penality of doing diff between new and old tuple. Also we
can do the whole work in 2 parts, one for fixed length columns and second to
handle other cases.


Let's keep it simple and use the same diff format for all tuples, at 
least for now. If it turns out that you can indeed get even more gain 
for fixed length tuples by something like that, then let's do that later 
as a separate patch.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Performance Improvement by reducing WAL for Update Operation

2012-08-06 Thread Amit Kapila
From: Heikki Linnakangas [mailto:heikki.linnakan...@enterprisedb.com] 
Sent: Monday, August 06, 2012 2:32 PM
To: Amit Kapila
Cc: 'Bruce Momjian'; pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] [WIP] Performance Improvement by reducing WAL for
Update Operation
On 06.08.2012 06:10, Amit Kapila wrote:
 Currently the solution for fixed length columns cannot handle the case of
 variable length columns and NULLS. The reason is for fixed length columns
 there is no need of diff technology between old and new tuple, however
for
 other cases it will be required.
 For fixed length columns, if we just note the OFFSET, LENGTH, VALUE of
 changed columns of new tuple in WAL, it will be sufficient to do the
replay
 of WAL. However to handle other cases we need to use diff mechanism.

 Can we do something like if the changed columns are fixed length and
doesn't
 contain NULL's, then store [OFFSET, LENGTH, VALUE] format in WAL and for
 other cases store diff format.

 This has advantage that for Updates containing only fixed length columns
 don't have to pay penality of doing diff between new and old tuple. Also
we
 can do the whole work in 2 parts, one for fixed length columns and second
to
 handle other cases.

 Let's keep it simple and use the same diff format for all tuples, at 
 least for now. If it turns out that you can indeed get even more gain 
 for fixed length tuples by something like that, then let's do that later 
 as a separate patch.

Okay, I shall first try to design and implement the same format for all
tuples
and discuss the results of same with community.

With Regards,
Amit Kapila.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Performance Improvement by reducing WAL for Update Operation

2012-08-05 Thread Amit Kapila
From: Bruce Momjian [mailto:br...@momjian.us] 
Sent: Saturday, August 04, 2012 8:06 PM
On Sat, Aug  4, 2012 at 05:21:06PM +0300, Heikki Linnakangas wrote:
 On 04.08.2012 11:01, Amit Kapila wrote:
Missed one point which needs to be handled is pg_upgrade
 
 I don't think there's anything to do for pg_upgrade. This doesn't
 change the on-disk data format, just the WAL format, and pg_upgrade
 isn't sensitive to WAL format changes.

Correct.

Thanks Bruce and Heikki for this information. 

I need your feedback on the below design point, as it will make my further
work on this performance issue more clear.
Also let me know if the explanation below is not clear, I shall try to use
some examples to explain my point.

Currently the solution for fixed length columns cannot handle the case of
variable length columns and NULLS. The reason is for fixed length columns
there is no need of diff technology between old and new tuple, however for
other cases it will be required.
For fixed length columns, if we just note the OFFSET, LENGTH, VALUE of
changed columns of new tuple in WAL, it will be sufficient to do the replay
of WAL. However to handle other cases we need to use diff mechanism.

Can we do something like if the changed columns are fixed length and doesn't
contain NULL's, then store [OFFSET, LENGTH, VALUE] format in WAL and for
other cases store diff format.

This has advantage that for Updates containing only fixed length columns
don't have to pay penality of doing diff between new and old tuple. Also we
can do the whole work in 2 parts, one for fixed length columns and second to
handle other cases.


With Regards,
Amit Kapila.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Performance Improvement by reducing WAL for Update Operation

2012-08-04 Thread Amit Kapila
From: Heikki Linnakangas [mailto:heikki.linnakan...@enterprisedb.com] 
Sent: Saturday, August 04, 2012 1:33 AM
On 03.08.2012 14:46, Amit kapila wrote:
 Currently the change is done only for fixed length columns for simple
tables and the tuple should not contain NULLS.

 This is a Proof of concept, the design and implementation needs to be
changed based on final design required for handling other scenario's

 Update operation:
 -
 1. Check for the simple table or not.(No toast, No before update
triggers)
 2. Works only for not null tuples.
 3. Identify the modified columns from the target entry.
 4. Based on the modified column list, check for any variable length
columns are modified, if so this optimization is not applied.
 5. Identify the offset and length for the modified columns and store it
as an optimized WAL tuple in the following format.
 Note: Wal update header is modified to denote whether wal update
optimization is done or not.
  WAL update header + Tuple header(no change from previous format)
+
  [offset(2bytes)] [length(2 bytes)] [changed data value]
  [offset(2bytes)] [length(2 bytes)] [changed data value]



 The performance will need to be re-verified after you fix these 
 limitations. Those limitations need to be fixed before this can be
applied.

Yes, I agree that solution should fix these limitations and performance
numbers needs to be re-verified. 
Currently in my mind the work to be done is as follows:

1. Solution which can handle Variable length columns and NULLs
2. Handling of Before Triggers
3. Can the solution for fixed length columns be same as Variable length
columns and NULLS.
4. Make the final code patch which addresses all the above.

Please suggest if there are more things that needs to be handled?

For the 3rd point, currently the solution for fixed length columns cannot
handle the case of variable length columns and NULLS. The reason is for
fixed length columns there is no need of diff technology between old and new
tuple, however for other cases it will be required.
For fixed length columns, if we just note the OFFSET, LENGTH, VALUE of
changed columns of new tuple in WAL, it will be sufficient to do the replay
of WAL. However to handle other cases we need to use diff mechanism.

Can we do something like if the changed columns are fixed length and doesn't
contain NULL's, then store [OFFSET, LENGTH, VALUE] format in WAL and for
other cases store diff format.

This has advantage that for Updates containing only fixed length columns
don't have to pay penality of doing diff between new and old tuple. Also we
can do the whole work in 2 parts, one for fixed length columns and second to
handle other cases. 


 It would be nice to use some well-known binary delta algorithm for this, 
 rather than invent our own. OTOH, we have more knowledge of the 
 attribute boundaries, so a custom algorithm might work better. 

I shall work on this and post after initial work.

 In any case, I'd like to see the code to do the delta encoding/decoding to
be 
 put into separate functions, outside of heapam.c. It would be good for 
 readability, and we might want to reuse this in other places too.

Agreed. I shall take care of doing it in suggested way.


With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] [WIP] Performance Improvement by reducing WAL for Update Operation

2012-08-04 Thread Amit Kapila
Missed one point which needs to be handled is pg_upgrade

-Original Message-
From: pgsql-hackers-ow...@postgresql.org
[mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Amit Kapila
Sent: Saturday, August 04, 2012 12:12 PM
To: 'Heikki Linnakangas'
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] [WIP] Performance Improvement by reducing WAL for
Update Operation

From: Heikki Linnakangas [mailto:heikki.linnakan...@enterprisedb.com] 
Sent: Saturday, August 04, 2012 1:33 AM
On 03.08.2012 14:46, Amit kapila wrote:
 Currently the change is done only for fixed length columns for simple
tables and the tuple should not contain NULLS.

 This is a Proof of concept, the design and implementation needs to be
changed based on final design required for handling other scenario's

 Update operation:
 -
 1. Check for the simple table or not.(No toast, No before update
triggers)
 2. Works only for not null tuples.
 3. Identify the modified columns from the target entry.
 4. Based on the modified column list, check for any variable length
columns are modified, if so this optimization is not applied.
 5. Identify the offset and length for the modified columns and store it
as an optimized WAL tuple in the following format.
 Note: Wal update header is modified to denote whether wal update
optimization is done or not.
  WAL update header + Tuple header(no change from previous format)
+
  [offset(2bytes)] [length(2 bytes)] [changed data value]
  [offset(2bytes)] [length(2 bytes)] [changed data value]



 The performance will need to be re-verified after you fix these 
 limitations. Those limitations need to be fixed before this can be
applied.

Yes, I agree that solution should fix these limitations and performance
numbers needs to be re-verified. 
Currently in my mind the work to be done is as follows:

1. Solution which can handle Variable length columns and NULLs
2. Handling of Before Triggers
3. Can the solution for fixed length columns be same as Variable length
columns and NULLS.
4. Make the final code patch which addresses all the above.

Please suggest if there are more things that needs to be handled?

For the 3rd point, currently the solution for fixed length columns cannot
handle the case of variable length columns and NULLS. The reason is for
fixed length columns there is no need of diff technology between old and new
tuple, however for other cases it will be required.
For fixed length columns, if we just note the OFFSET, LENGTH, VALUE of
changed columns of new tuple in WAL, it will be sufficient to do the replay
of WAL. However to handle other cases we need to use diff mechanism.

Can we do something like if the changed columns are fixed length and doesn't
contain NULL's, then store [OFFSET, LENGTH, VALUE] format in WAL and for
other cases store diff format.

This has advantage that for Updates containing only fixed length columns
don't have to pay penality of doing diff between new and old tuple. Also we
can do the whole work in 2 parts, one for fixed length columns and second to
handle other cases. 


 It would be nice to use some well-known binary delta algorithm for this, 
 rather than invent our own. OTOH, we have more knowledge of the 
 attribute boundaries, so a custom algorithm might work better. 

I shall work on this and post after initial work.

 In any case, I'd like to see the code to do the delta encoding/decoding to
be 
 put into separate functions, outside of heapam.c. It would be good for 
 readability, and we might want to reuse this in other places too.

Agreed. I shall take care of doing it in suggested way.


With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Performance Improvement by reducing WAL for Update Operation

2012-08-04 Thread Heikki Linnakangas

On 04.08.2012 11:01, Amit Kapila wrote:

Missed one point which needs to be handled is pg_upgrade


I don't think there's anything to do for pg_upgrade. This doesn't change 
the on-disk data format, just the WAL format, and pg_upgrade isn't 
sensitive to WAL format changes.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Performance Improvement by reducing WAL for Update Operation

2012-08-04 Thread Bruce Momjian
On Sat, Aug  4, 2012 at 05:21:06PM +0300, Heikki Linnakangas wrote:
 On 04.08.2012 11:01, Amit Kapila wrote:
 Missed one point which needs to be handled is pg_upgrade
 
 I don't think there's anything to do for pg_upgrade. This doesn't
 change the on-disk data format, just the WAL format, and pg_upgrade
 isn't sensitive to WAL format changes.

Correct.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


FW: [HACKERS] [WIP] Performance Improvement by reducing WAL for Update Operation

2012-08-03 Thread Amit Kapila
One thing I forgot to mention that for tests I have used pg_prewarm utility
to load all the data in shared buffers before start of test.

 

With Regards,

Amit Kapila.

 

From: pgsql-hackers-ow...@postgresql.org
[mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Amit kapila
Sent: Friday, August 03, 2012 5:17 PM
To: pgsql-hackers@postgresql.org
Subject: [HACKERS] [WIP] Performance Improvement by reducing WAL for Update
Operation

 

Problem statement: 

---

Reducing wal size for an update operation for performance improvement.

Advantages: 

-
1. Observed increase in performance with pgbench when server is
running in sync_commit off mode. 
a. with pgbench (tpc_b) - 13% 
b. with modified pgbench (such that size of modified columns
are less than all row) - 83% 

2. WAL size is reduced 

Design/Impementation: 

--

Currently the change is done only for fixed length columns for simple tables
and the tuple should not contain NULLS. 

This is a Proof of concept, the design and implementation needs to be
changed based on final design required for handling other scenario's

Update operation: 
-
1. Check for the simple table or not.(No toast, No before update triggers) 
2. Works only for not null tuples. 
3. Identify the modified columns from the target entry. 
4. Based on the modified column list, check for any variable length columns
are modified, if so this optimization is not applied. 
5. Identify the offset and length for the modified columns and store it as
an optimized WAL tuple in the following format. 
   Note: Wal update header is modified to denote whether wal update
optimization is done or not. 
WAL update header + Tuple header(no change from previous format) + 
[offset(2bytes)] [length(2 bytes)] [changed data value] 
[offset(2bytes)] [length(2 bytes)] [changed data value] 
   
   

Recovery: 


The following steps are only incase of the tuple is optimized. 

6. For forming the new tuple, old tuple is required.(including if the old
tuple does not require any modifications also). 
7. Form the new tuple based on the format specified in the 5th point. 
8. once new tuple is framed, follow the exisiting behavior. 

Frame the new tuple from old tuple and WAL record: 

1. The length of the data which is needs to be copied from old tuple is
calculated as 
   the difference of offset present in the WAL record and the old tuple
offset. 
   (for the first time, the old tuple offset value is zero) 
2. Once the old tuple data copied, then increase the offset for old tuple by
the 
   copied length. 
3. Get the length and value of modified column from WAL record, copy it into
new tuple. 
4. Increase the old tuple offset with the modified column length. 
5. Repeat this procedure until the WAL record reaches the end. 
6. If any remaining left out old tuple data will be copied. 


Test results: 

--
1. The pgbench test run for 10min. 

2. pgbench result for tpc-b is attached with this mail as pgbench_org

3. modified pgbench(such that size of modified columns are less than all
row)  result for tpc-b is attached with this mail as pgbench_1800_300

Modified pgbench code: 

---
1. Schema of the tables are modified as added some extra fields to increase
the record size to 1800. 
2. The tcp_b benchmark suite to do only update operations. 
3. The update operation changed as to update 3 columns with 300 bytes out of
total size of 1800 bytes. 
4. During initialization of tables removed the NULL value insertions. 


I am working on solution to handle other scenarios like variable length
columns, tuple contain NULLs, handling for before triggers. 

 

Please provide suggestions/objections?

 

With Regards,

Amit Kapila.



Re: [HACKERS] [WIP] Performance Improvement by reducing WAL for Update Operation

2012-08-03 Thread Heikki Linnakangas

On 03.08.2012 14:46, Amit kapila wrote:

Currently the change is done only for fixed length columns for simple tables 
and the tuple should not contain NULLS.

This is a Proof of concept, the design and implementation needs to be changed 
based on final design required for handling other scenario's

Update operation:
-
1. Check for the simple table or not.(No toast, No before update triggers)
2. Works only for not null tuples.
3. Identify the modified columns from the target entry.
4. Based on the modified column list, check for any variable length columns are 
modified, if so this optimization is not applied.
5. Identify the offset and length for the modified columns and store it as an 
optimized WAL tuple in the following format.
Note: Wal update header is modified to denote whether wal update 
optimization is done or not.
 WAL update header + Tuple header(no change from previous format) +
 [offset(2bytes)] [length(2 bytes)] [changed data value]
 [offset(2bytes)] [length(2 bytes)] [changed data value]
   
   


The performance will need to be re-verified after you fix these 
limitations. Those limitations need to be fixed before this can be applied.


It would be nice to use some well-known binary delta algorithm for this, 
rather than invent our own. OTOH, we have more knowledge of the 
attribute boundaries, so a custom algorithm might work better. In any 
case, I'd like to see the code to do the delta encoding/decoding to be 
put into separate functions, outside of heapam.c. It would be good for 
readability, and we might want to reuse this in other places too.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers