subject:"\"\\\\\\\[HACKERS\\\\\\\] Enabling Checksums\""

Re: [HACKERS] Enabling Checksums

2013-09-12 Thread Greg Smith


On 3/18/13 10:52 AM, Bruce Momjian wrote:

With a potential 10-20% overhead, I am unclear who would enable this at
initdb time.


If you survey people who are running PostgreSQL on "cloud" hardware, be 
it Amazon's EC2 or similar options from other vendors, you will find a 
high percentage of them would pay quite a bit of performance to make 
their storage more reliable.  To pick one common measurement for 
popularity, a Google search on "ebs corruption" returns 17 million hits. 
 To quote one of those, Baron Schwartz of Percona talking about MySQL 
on EC2:


"BTW, I have seen data corruption on EBS volumes. It’s not clear whether 
it was InnoDB’s fault (extremely unlikely IMO), the operating system’s 
fault, EBS’s fault, or something else."


http://www.mysqlperformanceblog.com/2011/08/04/mysql-performance-on-ec2ebs-versus-rds/

*That* uncertainty is where a lot of the demand for this feature is 
coming from.  People deploy into the cloud, their data gets corrupted, 
and no one call tell them what/why/how it happened.  And that means they 
don't even know what to change to make it better.  The only people I see 
really doing something about this problem all seem years off, and I'm 
not sure they are going to help--especially since some of them are 
targeting "enterprise" storage rather than the cloud-style installations.



I assume a user would wait until they suspected corruption to turn it
on, and because it is only initdb-enabled, they would have to
dump/reload their cluster.  The open question is whether this is a
usable feature as written, or whether we should wait until 9.4.


The reliability issues of both physical and virtual hardware are so 
widely known that many people will deploy with this on as their default 
configuration.


If you don't trust your existing data, you can't retroactively check it. 
 A checksum of an already corrupt block is useless.  Accordingly, there 
is no use case for converting an installation with real or even 
suspected problems to a checksummed one.  If you wait until you suspect 
corruption to care about checksums, it's really too late.  There is only 
one available next step:  you must do a dump to figure out what's 
readable.  That is the spot that all of the incoming data recovery 
customers we see at 2ndQuadrant are already in when we're called.  The 
cluster is suspicious, sometimes they can get data out of it with a 
dump, and if we hack up their install we can usually recover a bit more 
than they could.


After the data from a partially corrupted database is dumped, someone 
who has just been through that pain might decide they should turn 
checksums on when they restore the dump.  When it's on, they can access 
future damage easily at the block level when it happens, and possibly 
repair it without doing a full dump/reload.  What's implemented in the 
feature we're talking about has a good enough UI to handle this entire 
cycle I see damaged installations go through.


Good questions, Bruce, I don't think the reasons behind this feature's 
demand have been highlighted very well before.  I try not to spook the 
world by talking regularly about how many corrupt PostgreSQL databases 
I've seen, but they do happen.  Most of my regular ranting on crappy 
SSDs that lie about writes comes from a TB scale PostgreSQL install that 
got corrupted due to the write-cache flaws of the early Intel 
SSDs--twice.  The would have happily lost even the worst-case 20% of 
regular performance to avoid going down for two days each time they saw 
corruption, where we had to dump/reload to get them going again.  If the 
install had checksums, I could have figured out which blocks were 
damaged and manually fixed them.  Without checksums, there really was 
nowhere to go except dump/reload.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-09-12 Thread Greg Smith


On 3/18/13 10:52 AM, Bruce Momjian wrote:

With a potential 10-20% overhead, I am unclear who would enable this at
initdb time.


If you survey people who are running PostgreSQL on "cloud" hardware, be 
it Amazon's EC2 or similar options from other vendors, you will find a 
high percentage of them would pay quite a bit of performance to make 
their storage more reliable.  To pick one common measurement for 
popularity, a Google search on "ebs corruption" returns 17 million hits. 
 To quote one of those, Baron Schwartz of Percona talking about MySQL 
on EC2:


"BTW, I have seen data corruption on EBS volumes. It’s not clear whether 
it was InnoDB’s fault (extremely unlikely IMO), the operating system’s 
fault, EBS’s fault, or something else."


http://www.mysqlperformanceblog.com/2011/08/04/mysql-performance-on-ec2ebs-versus-rds/

*That* uncertainty is where a lot of the demand for this feature is 
coming from.  People deploy into the cloud, their data gets corrupted, 
and no one call tell them what/why/how it happened.  And that means they 
don't even know what to change to make it better.  The only people I see 
really doing something about this problem all seem years off, and I'm 
not sure they are going to help--especially since some of them are 
targeting "enterprise" storage rather than the cloud-style installations.



I assume a user would wait until they suspected corruption to turn it
on, and because it is only initdb-enabled, they would have to
dump/reload their cluster.  The open question is whether this is a
usable feature as written, or whether we should wait until 9.4.


The reliability issues of both physical and virtual hardware are so 
widely known that many people will deploy with this on as their default 
configuration.


If you don't trust your existing data, you can't retroactively check it. 
 A checksum of an already corrupt block is useless.  Accordingly, there 
is no use case for converting an installation with real or even 
suspected problems to a checksummed one.  If you wait until you suspect 
corruption to care about checksums, it's really too late.  There is only 
one available next step:  you must do a dump to figure out what's 
readable.  That is the spot that all of the incoming data recovery 
customers we see at 2ndQuadrant are already in when we're called.  The 
cluster is suspicious, sometimes they can get data out of it with a 
dump, and if we hack up their install we can usually recover a bit more 
than they could.


After the data from a partially corrupted database is dumped, someone 
who has just been through that pain might decide they should turn 
checksums on when they restore the dump.  When it's on, they can access 
future damage easily at the block level when it happens, and possibly 
repair it without doing a full dump/reload.  What's implemented in the 
feature we're talking about has a good enough UI to handle this entire 
cycle I see damaged installations go through.



In fact, this feature is going to need
pg_upgrade changes to detect from pg_controldata that the old/new
clusters have the same checksum setting.


I think that's done already, but it's certainly something to test out too.

Good questions, Bruce, I don't think the reasons behind this feature's 
demand have been highlighted very well before.  I try not to spook the 
world by talking regularly about how many corrupt PostgreSQL databases 
I've seen, but they do happen.  Most of my regular ranting on crappy 
SSDs that lie about writes comes from a TB scale PostgreSQL install that 
got corrupted due to the write-cache flaws of the early Intel 
SSDs--twice.  The would have happily lost even the worst-case 20% of 
regular performance to avoid going down for two days each time they saw 
corruption, where we had to dump/reload to get them going again.  If the 
install had checksums, I could have figured out which blocks were 
damaged and manually fixed them.  Without checksums, there's no way to 
even tell for sure what is broken.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-25 Thread Tom Lane

Simon Riggs  writes:
> On 24 April 2013 21:06, Jeff Davis  wrote:
>> What goal are you trying to accomplish with this patch?

> That we might need to patch the checksum version on a production release.

I don't actually buy that argument, certainly not as something that
could happen in 9.3.

I'm inclined to think we should forget about this until we have a
concrete use-case for it.  As Jeff says, there is no need for pg_control
contents to be compatible across major releases, so there's no harm in
waiting if we have any doubts about how it ought to work.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-24 Thread Jeff Davis

On Wed, 2013-04-24 at 21:09 +0100, Simon Riggs wrote:
> On 24 April 2013 21:06, Jeff Davis  wrote:
> 
> > What goal are you trying to accomplish with this patch?
> 
> That we might need to patch the checksum version on a production release.

Oh, I see.

I don't think we need two output fields from pg_controldata though. It's
a little redundant, and confused me when I was looking at the impact on
pg_upgrade. And it means nothing to the user until we actually have
multiple algorithms available, at which time we are better off with a
text representation.

Other than that, I think your patch is fine to accomplish the
aforementioned goal. Essentially, it just changes the bool to a uint32,
which I favor.

Regards,
Jeff Davis

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-24 Thread Simon Riggs

On 24 April 2013 21:06, Jeff Davis  wrote:

> What goal are you trying to accomplish with this patch?

That we might need to patch the checksum version on a production release.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-24 Thread Jeff Davis

On Wed, 2013-04-24 at 08:20 +0100, Simon Riggs wrote:
> On 24 April 2013 01:10, Jeff Davis  wrote:
> > I'd prefer that it was some kind of a checksum ID code -- e.g. 0 for no
> > checksum, 1 for FNV-1a-SR3, etc. That would allow us to release 9.4 with
> > a new algorithm without forcing existing users to change.
> 
> That's exactly what the patch does.

The word "version" indicates an order to it though, like N+1 is always
preferable to N. This is user-facing (through pg_controldata output),
otherwise I wouldn't mind.

> > initdb would have to take the code as an option, probably in string
> > form.
> 
> When/if we have multiple options we can add that. The main thing was
> to make sure the control file recorded things in a common way.

The main strange thing to me is that we're still using the
enabled/disabled for the output of pg_controldata as well as the
"version".

When we do have multiple options, it seems like we'd just have one field
output:

  Data page checksums: none|crc32c|pg-fnv

What goal are you trying to accomplish with this patch? pg_control
doesn't need to be compatible between releases, so can't we just add
this later when we really do have multiple options?

Regards,
Jeff Davis

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-24 Thread Simon Riggs

On 24 April 2013 01:10, Jeff Davis  wrote:
> On Tue, 2013-04-23 at 16:28 +0100, Simon Riggs wrote:
>> * make the pg_control.data_checksums field into a version number, for
>> future flexibility...
>> patch attached
>
> Commenting on this separately because it's a separate issue.
>
> I'd prefer that it was some kind of a checksum ID code -- e.g. 0 for no
> checksum, 1 for FNV-1a-SR3, etc. That would allow us to release 9.4 with
> a new algorithm without forcing existing users to change.

That's exactly what the patch does.

> initdb would have to take the code as an option, probably in string
> form.

When/if we have multiple options we can add that. The main thing was
to make sure the control file recorded things in a common way.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-23 Thread Jeff Davis

On Tue, 2013-04-23 at 16:28 +0100, Simon Riggs wrote:
> * make the pg_control.data_checksums field into a version number, for
> future flexibility...
> patch attached

Commenting on this separately because it's a separate issue.

I'd prefer that it was some kind of a checksum ID code -- e.g. 0 for no
checksum, 1 for FNV-1a-SR3, etc. That would allow us to release 9.4 with
a new algorithm without forcing existing users to change.

initdb would have to take the code as an option, probably in string
form.

What do you think?

Regards,
Jeff Davis

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-23 Thread Simon Riggs

On 23 April 2013 02:35, Jeff Davis  wrote:
> On Tue, 2013-04-23 at 01:08 +0300, Ants Aasma wrote:
>> A slight delay, but here it is. I didn't lift the checksum part into a
>> separate file as I didn't have a great idea what I would call it. The
>> code is reasonably compact so I don't see a great need for this right
>> now. It would be more worth the effort when/if we add non-generic
>> variants. I'm not particularly attached to the method I used to mask
>> out pd_checksum field, this could be improved if someone has a better
>> idea how to structure the code.
>
> Thank you. A few initial comments:
>
> I have attached (for illustration purposes only) a patch on top of yours
> that divides the responsibilities a little more cleanly.
>
> * easier to move into a separate file, and use your recommended compiler
> flags without affecting other routines in bufpage.c
> * makes the checksum algorithm itself simpler
> * leaves the data-page-specific aspects (mixing in the page number,
> ignoring pd_checksum, reducing to 16 bits) to PageCalcChecksum16
> * overall easier to review and understand
>
> I'm not sure what we should call the separate file or where we should
> put it, though. How about src/backend/utils/checksum/checksum_fnv.c? Is
> there a clean way to override the compiler flags for a single file so we
> don't need to put it in its own directory?


OK, I like that a lot better and it seems something I could commit.

I suggest the following additional changes...

* put the README stuff directly in the checksum.c file
  * I think we need some external links that describe this algorithm
and we need comments that explain what we know about this in terms of
detection capability and why it was chosen against others
  * We need some comments/analysis about whether the coding causes a
problem if vectorization is *not* available

* make the pg_control.data_checksums field into a version number, for
future flexibility...
patch attached

* rename the routine away from checksum_fnv so its simply a generic
checksum call - more modular.

That way all knowledge of the algorithm is in one file only. If we do
need to change the algorithm in the future we can more easily support
multiple versions.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


checksums_version.v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-22 Thread Alvaro Herrera

Jeff Davis escribió:

> I'm not sure what we should call the separate file or where we should
> put it, though. How about src/backend/utils/checksum/checksum_fnv.c? Is
> there a clean way to override the compiler flags for a single file so we
> don't need to put it in its own directory?

Sure, see src/backend/parser/Makefile about gram.o.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-22 Thread Jeff Davis

On Tue, 2013-04-23 at 01:08 +0300, Ants Aasma wrote:
> A slight delay, but here it is. I didn't lift the checksum part into a
> separate file as I didn't have a great idea what I would call it. The
> code is reasonably compact so I don't see a great need for this right
> now. It would be more worth the effort when/if we add non-generic
> variants. I'm not particularly attached to the method I used to mask
> out pd_checksum field, this could be improved if someone has a better
> idea how to structure the code.

Thank you. A few initial comments:

I have attached (for illustration purposes only) a patch on top of yours
that divides the responsibilities a little more cleanly.

* easier to move into a separate file, and use your recommended compiler
flags without affecting other routines in bufpage.c
* makes the checksum algorithm itself simpler
* leaves the data-page-specific aspects (mixing in the page number,
ignoring pd_checksum, reducing to 16 bits) to PageCalcChecksum16
* overall easier to review and understand

I'm not sure what we should call the separate file or where we should
put it, though. How about src/backend/utils/checksum/checksum_fnv.c? Is
there a clean way to override the compiler flags for a single file so we
don't need to put it in its own directory?

Regards,
Jeff Davis

*** a/src/backend/storage/page/bufpage.c
--- b/src/backend/storage/page/bufpage.c
***
*** 23,28  static char pageCopyData[BLCKSZ];	/* for checksum calculation */
--- 23,29 
  static Page pageCopy = pageCopyData;
  
  static uint16 PageCalcChecksum16(Page page, BlockNumber blkno);
+ static uint32 checksum_fnv(char *data, uint32 size);
  
  /* 
   *		Page support functions
***
*** 986,1017  static const uint32 checksumBaseOffsets[N_SUMS] = {
  static uint16
  PageCalcChecksum16(Page page, BlockNumber blkno)
  {
! 	uint32 sums[N_SUMS];
! 	uint32 (*pageArr)[N_SUMS] = (uint32 (*)[N_SUMS]) page;
! 	uint32 result = blkno;
! 	int i, j;
! 	int pd_checksum_word = offsetof(PageHeaderData, pd_checksum)/sizeof(uint32);
! 	int pd_checksum_half = (offsetof(PageHeaderData, pd_checksum) % sizeof(uint32)) / sizeof(uint16);
  
  	/* only calculate the checksum for properly-initialized pages */
  	Assert(!PageIsNew(page));
  
  	/* initialize partial checksums to their corresponding offsets */
  	memcpy(sums, checksumBaseOffsets, sizeof(checksumBaseOffsets));
  
! 	/* first iteration needs to mask out pd_checksum field itself with zero */
! 	for (j = 0; j < N_SUMS; j++)
! 	{
! 		uint32 value = pageArr[0][j];
! 		if (j == pd_checksum_word)
! 			((uint16*) &value)[pd_checksum_half] = 0;
! 		CHECKSUM_COMP(sums[j], value);
! 	}
! 
! 	/* now add in the rest of the page */
! 	for (i = 1; i < BLCKSZ/sizeof(uint32)/N_SUMS; i++)
  		for (j = 0; j < N_SUMS; j++)
! 			CHECKSUM_COMP(sums[j], pageArr[i][j]);
  
  	/* finally add in one round of zeroes for one more layer of mixing */
  	for (j = 0; j < N_SUMS; j++)
--- 987,1035 
  static uint16
  PageCalcChecksum16(Page page, BlockNumber blkno)
  {
! 	PageHeader	phdr   = (PageHeader) page;
! 	uint16		save_checksum;
! 	uint32		fnv_checksum;
  
  	/* only calculate the checksum for properly-initialized pages */
  	Assert(!PageIsNew(page));
  
+ 	/*
+ 	 * Save pd_checksum and set it to zero, so that the checksum calculation
+ 	 * isn't affected by the checksum stored on the page.
+ 	 */
+ 	save_checksum = phdr->pd_checksum;
+ 	phdr->pd_checksum = 0;
+ 	fnv_checksum = checksum_fnv(page, BLCKSZ);
+ 	phdr->pd_checksum = save_checksum;
+ 
+ 	/* mix in the block number to detect transposed pages */
+ 	fnv_checksum ^= blkno;
+ 
+ 	/*
+ 	 * Reduce to a uint16 (to fit in the pd_checksum field) with an offset of
+ 	 * one. That avoids checksums of zero, which seems like a good idea.
+ 	 */
+ 	return (fnv_checksum % 65535) + 1;
+ }
+ 
+ static uint32
+ checksum_fnv(char *data, uint32 size)
+ {
+ 	uint32 sums[N_SUMS];
+ 	uint32 (*dataArr)[N_SUMS] = (uint32 (*)[N_SUMS]) data;
+ 	uint32 result;
+ 	int i, j;
+ 
+ 	Assert((size % (sizeof(uint32)*N_SUMS)) == 0);
+ 
  	/* initialize partial checksums to their corresponding offsets */
  	memcpy(sums, checksumBaseOffsets, sizeof(checksumBaseOffsets));
  
! 	/* main checksum calculation */
! 	for (i = 0; i < size/sizeof(uint32)/N_SUMS; i++)
  		for (j = 0; j < N_SUMS; j++)
! 			CHECKSUM_COMP(sums[j], dataArr[i][j]);
  
  	/* finally add in one round of zeroes for one more layer of mixing */
  	for (j = 0; j < N_SUMS; j++)
***
*** 1021,1026  PageCalcChecksum16(Page page, BlockNumber blkno)
  	for (i = 0; i < N_SUMS; i++)
  		result ^= sums[i];
  
! 	/* use mod mapping to map the value into 16 bits and offset from zero */
! 	return (result % 65535) + 1;
  }
--- 1039,1043 
  	for (i = 0; i < N_SUMS; i++)
  		result ^= sums[i];
  
! 	return result;
  }

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your s

Re: [HACKERS] Enabling Checksums

2013-04-22 Thread Ants Aasma

On Mon, Apr 22, 2013 at 10:57 PM, Ants Aasma  wrote:
> On Mon, Apr 22, 2013 at 10:54 PM, Simon Riggs  wrote:
>> On 22 April 2013 20:32, Florian Pflug  wrote:
>>
>>> Assuming that we only ship a plain C implementation with 9.3, what
>>> are we missing on that front? The C implementation of FNV1+SHIFT is
>>> only a few dozen lines or so.
>>
>> Forgive me, I can't seem to locate the patch for this? Re-post please,
>> just for clarity.
>
> Not posted yet. I'm writing it as we speak. Will post within half an hour or 
> so.

A slight delay, but here it is. I didn't lift the checksum part into a
separate file as I didn't have a great idea what I would call it. The
code is reasonably compact so I don't see a great need for this right
now. It would be more worth the effort when/if we add non-generic
variants. I'm not particularly attached to the method I used to mask
out pd_checksum field, this could be improved if someone has a better
idea how to structure the code.

I confirmed with objdump that compiling on GCC 4.7 with -msse4.1
-funroll-loops -ftree-vectorize does in fact vectorize that loop.
Simple way to verify: objdump -d src/backend/storage/page/bufpage.o |
grep pmulld | wc -l should output 16.

Unfortunately I can't work on this patch for about a week. Postgresql
9.3 will have to wait for me as I need to tend to the release of Ants
v2.0.

Regards,
Ants Aasma
-- 
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

parallel-fnv-checksum.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-22 Thread Ants Aasma

On Mon, Apr 22, 2013 at 10:54 PM, Simon Riggs  wrote:
> On 22 April 2013 20:32, Florian Pflug  wrote:
>
>> Assuming that we only ship a plain C implementation with 9.3, what
>> are we missing on that front? The C implementation of FNV1+SHIFT is
>> only a few dozen lines or so.
>
> Forgive me, I can't seem to locate the patch for this? Re-post please,
> just for clarity.

Not posted yet. I'm writing it as we speak. Will post within half an hour or so.

Regards,
Ants Aasma
-- 
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-22 Thread Simon Riggs

On 22 April 2013 20:32, Florian Pflug  wrote:

> Assuming that we only ship a plain C implementation with 9.3, what
> are we missing on that front? The C implementation of FNV1+SHIFT is
> only a few dozen lines or so.

Forgive me, I can't seem to locate the patch for this? Re-post please,
just for clarity.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-22 Thread Florian Pflug

On Apr22, 2013, at 21:14 , Jeff Davis  wrote:
> On Mon, 2013-04-22 at 20:04 +0200, Florian Pflug wrote:
>> The one downside of the fnv1+shift approach is that it's built around
>> the assumption that processing 64-bytes at once is the sweet spot. That
>> might be true for x86 and x86_64 today, but it won't stay that way for
>> long, and quite surely isn't true for other architectures. That doesn't
>> necessarily rule it out, but it certainly weakens the argument that
>> slipping it into 9.3 avoids having the change the algorithm later...
> 
> I think you are setting the bar way too high. Right now, we have a slow
> algorithm. According to Ants's tests, FNV-1a is much, much faster. Do
> you think that it will still be such a bottleneck that we will want to
> change it again later for purely performance reasons?

To clarify, it wasn't my intent to argue against shipping FNV1+SHIFT
in 9.3 - in fact I'd like to see us do exactly that. I was merely trying
to be unbiased, and hence stated not only arguments in favour or FNV1+SHIFT
(the ones about CRCs theoretical advantages in error detection being 
not really relevant to us), but also the one downside of FNV1+SHIFT.

Seems like I could have done a better job expressing myself though.

> The biggest problem now is getting one of these faster algorithms (FNV
> or even a faster CRC) into shape that is acceptable to
> reviewers/committers. If we don't do that, we will be missing out on a
> lot of potential checksum users for whom the existing CRC algorithm is
> just too slow.

Assuming that we only ship a plain C implementation with 9.3, what
are we missing on that front? The C implementation of FNV1+SHIFT is
only a few dozen lines or so.

best regards,
Florian Pflug

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-22 Thread Jeff Davis

On Mon, 2013-04-22 at 19:25 +0300, Ants Aasma wrote:
> I was just now writing up a generic C based patch based on the
> parallel FNV-1a + shift that we discussed with Florian with an added
> round of mixing. Testing the performance in isolation indicates that:
> 1) it is about an order of magnitude faster than the Sarwate CRC
> method used in Postgresql.
> 2) it is about 2x faster than fastest software based CRC method.
> 3) by using -msse4.1 -funroll-loops -ftree-vectorize compilation
> options the performance improves 5x. (within 20% of handcoded ASM)

That's great news!

This means that we can have a simple C implementation in a separate
file, and pass a few build flags when compiling just that file (so it
doesn't affect other code). That should make reviewers/committers happy
(including me).

FWIW, that was my last real concern about FNV (reviewability). I'm not
worried about the performance based on your analysis; nor am I worried
about the error detection rate.

Regards,
Jeff Davis

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-22 Thread Robert Haas

On Mon, Apr 22, 2013 at 3:14 PM, Jeff Davis  wrote:
> The biggest problem now is getting one of these faster algorithms (FNV
> or even a faster CRC) into shape that is acceptable to
> reviewers/committers. If we don't do that, we will be missing out on a
> lot of potential checksum users for whom the existing CRC algorithm is
> just too slow.

+1.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-22 Thread Jeff Davis

On Mon, 2013-04-22 at 20:04 +0200, Florian Pflug wrote:
> The one downside of the fnv1+shift approach is that it's built around
> the assumption that processing 64-bytes at once is the sweet spot. That
> might be true for x86 and x86_64 today, but it won't stay that way for
> long, and quite surely isn't true for other architectures. That doesn't
> necessarily rule it out, but it certainly weakens the argument that
> slipping it into 9.3 avoids having the change the algorithm later...

I think you are setting the bar way too high. Right now, we have a slow
algorithm. According to Ants's tests, FNV-1a is much, much faster. Do
you think that it will still be such a bottleneck that we will want to
change it again later for purely performance reasons?

The only time this is likely to matter is in the situation Greg Smith
describes, where shared buffers is much smaller than memory, and the
working set of buffers is near the size of memory (in other words, a lot
of buffers moving to and from shared memory, but not much to or from
disk). And it's already significantly faster than algorithm in the
original tests (Fletcher), so it's not clear that it's still even a
serious problem.

(Also remember that checksum users already accept a WAL penalty.)

The biggest problem now is getting one of these faster algorithms (FNV
or even a faster CRC) into shape that is acceptable to
reviewers/committers. If we don't do that, we will be missing out on a
lot of potential checksum users for whom the existing CRC algorithm is
just too slow.

Regards,
Jeff Davis

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-22 Thread Ants Aasma

On Mon, Apr 22, 2013 at 9:04 PM, Florian Pflug  wrote:
> The one downside of the fnv1+shift approach is that it's built around
> the assumption that processing 64-bytes at once is the sweet spot. That
> might be true for x86 and x86_64 today, but it won't stay that way for
> long, and quite surely isn't true for other architectures. That doesn't
> necessarily rule it out, but it certainly weakens the argument that
> slipping it into 9.3 avoids having the change the algorithm later...

It's actually 128 bytes as it was tested. The ideal shape depends on
multiplication latency, multiplication throughput and amount of
registers available. Specifically BLCKSZ/mul_throughput_in_bytes needs
to be larger than BLCKSZ/(N_SUMS*sizeof(uint32))*(mul latency + 2*xor
latency). For latest Intel the values are 8192/16 = 512 and
8192/(32*4)*(5 + 2*1) = 448. 128 bytes is also 8 registers which is
the highest power of two fitting into architectural registers (16).
This means that the value chosen is indeed the sweet spot for x86
today. For future processors we can expect the multiplication width to
increase and possibly the latency too shifting the sweet spot into
higher widths. In fact, Haswell coming out later this year should have
AVX2 instructions that introduce integer ops on 256bit registers,
making the current choice already suboptimal.

All that said, having a lower width won't make the algorithm slower on
future processors, it will just leave some parallelism on the table
that could be used to make it even faster. The line in the sand needed
to be drawn somewhere, I chose the maximum comfortable width today
fearing that even that would be shot down based on code size.
Coincidentally 32 elements is also the internal parallelism that GPUs
have settled on. We could bump the width up by one notch to buy some
future safety, but after that I'm skeptical we will see any
conventional processors that would benefit from a higher width. I just
tested that the auto-vectorized version runs at basically identical
speed as GCC's inability to do good register allocation means that it
juggles values between registers and the stack one way or the other.

So to recap, I don't know of any CPUs where a lower value would be
better. Raising the width by one notch would mean better performance
on future processors, but raising it further would just bloat the size
of the inner loop without much benefit in sight.

Regards,
Ants Aasma

-- 
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-22 Thread Florian Pflug

On Apr22, 2013, at 18:25 , Ants Aasma  wrote:
> I was just now writing up a generic C based patch based on the
> parallel FNV-1a + shift that we discussed with Florian with an added
> round of mixing. Testing the performance in isolation indicates that:
> 1) it is about an order of magnitude faster than the Sarwate CRC
> method used in Postgresql.
> 2) it is about 2x faster than fastest software based CRC method.
> 3) by using -msse4.1 -funroll-loops -ftree-vectorize compilation
> options the performance improves 5x. (within 20% of handcoded ASM)
> 
> This leaves lingering doubts about the quality of the checksum. It's
> hard if not impossible to prove absence of interesting patterns that
> would trigger collisions. I do know the checksum quality is miles
> ahead of the Fletcher sum originally proposed and during the last week
> I haven't been able to think of a way to make the collision rate
> significantly differ from CRC.

Note though that CRCs may very well have similar "interesting"
corruption patterns which don't cause the checksum to change, though.
The only guarantee they really give is that those patterns will involve
more than N-1 flipped bits, where N is the hamming distance of the
CRC. For 16-bit checksums, N can at most be 16 (since XOR-ing the data
with a shifted version of the CRC polynomial will not cause the checksum
to change).

Thus, once more than two bytes on a page get corrupted, CRCs may not
have any advantage over fnv1+shift or similar approaches. They may even
work worse, since detecting some forms of corruption with 100% certainty
means missing others with a probability of more than 2^-16. Some CRC
polynomials for example detect all corruptions which affect an odd number
of bits, but in turn have a probability of 2^-15 of missing ones which
affect an even number of bits.

Since we're mostly attempting to protect against disk, not memory
corruption here, I'm not convinced at all that errors in only a few
bits are all that common, and certainly not that they are more likely
than other forms of corruption. I'd expect, for example, that blocks of
512 bytes (i.e. one sector) suddenly reading 0 is at least as likely
as a single flipped bit.

The one downside of the fnv1+shift approach is that it's built around
the assumption that processing 64-bytes at once is the sweet spot. That
might be true for x86 and x86_64 today, but it won't stay that way for
long, and quite surely isn't true for other architectures. That doesn't
necessarily rule it out, but it certainly weakens the argument that
slipping it into 9.3 avoids having the change the algorithm later...

best regards,
Florian Pflug

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-22 Thread Josh Berkus

On 04/22/2013 09:25 AM, Ants Aasma wrote:
> This leaves lingering doubts about the quality of the checksum. It's
> hard if not impossible to prove absence of interesting patterns that
> would trigger collisions. I do know the checksum quality is miles
> ahead of the Fletcher sum originally proposed and during the last week
> I haven't been able to think of a way to make the collision rate
> significantly differ from CRC.

When we originally discussed this feature, we were potentially
discussing a checksum algo which produced collisions for 1 out of 256
pages.  That approach was considered acceptable, since it would be very
unlikely for such a collision to occur across multiple corrupted pages,
and fairly rare to have only one corrupted page.

So my perspective is, if we're doing better than 1 in 256, it's good enough.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-22 Thread Ants Aasma

On Mon, Apr 22, 2013 at 6:33 PM, Andres Freund  wrote:
> I don't see us changing away from CRCs anymore either by now. But I
> think at least changing the polynom to something that
> a) has higher error detection properties
> b) can noticeably sped up on a a good part of the hardware pg is run on

+1 of changing the polynomial if we stick with CRC, but I think the
differences in error detection capability are mostly academic for
PostgreSQL usecase. Or does anyone have an experience with seeing
multiple random bit errors per page.

> If we are feeling really adventurous we can switch to a faster CRC
> implementation, there are enough ones around and I know that at least my
> proposed patch from some years ago (which is by far not the fastest that
> is doable) is in production usage some places.

The faster CRC implementation just use parallel lookup tables of more
bytes in parallel. Performance results from [1] show that doing 4
bytes in parallel will yield a 2.8x speedup, and 8 bytes in parallel
yields another 1.7x on top of that at the cost of using a 8kB lookup
table. And the end result is still over 3x slower than the code in the
original patch, where Greg's performance results prompted me to look
at what would have a lower overhead.

[1] http://create.stephan-brumme.com/crc32/

Regards,
Ants Aasma
-- 
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-22 Thread Ants Aasma

On Mon, Apr 22, 2013 at 6:27 PM, Robert Haas  wrote:
> On Wed, Apr 17, 2013 at 8:21 PM, Greg Smith  wrote:
>>> The more I read of this thread, the more unhappy I get.  It appears that
>>> the entire design process is being driven by micro-optimization for CPUs
>>> being built by Intel in 2013.
>>
>> And that's not going to get anyone past review, since all the tests I've
>> been doing the last two weeks are on how fast an AMD Opteron 6234 with OS
>> cache >> shared_buffers can run this.  The main thing I'm still worried
>> about is what happens when you have a fast machine that can move memory
>> around very quickly and an in-memory workload, but it's hamstrung by the
>> checksum computation--and it's not a 2013 Intel machine.
>
> This is a good point.  However, I don't completely agree with the
> conclusion that we shouldn't be worrying about any of this right now.
> While I agree with Tom that it's far too late to think about any
> CPU-specific optimizations for 9.3, I have a lot of concern, based on
> Ants's numbers, that we've picked a checksum algorithm which is hard
> to optimize for performance.  If we don't get that fixed for 9.3,
> we're potentially looking at inflicting many years of serious
> suffering on our user base.  If we at least get the *algorithm* right
> now, we can worry about optimizing it later.  If we get it wrong,
> we'll be living with the consequence of that for a really long time.

I was just now writing up a generic C based patch based on the
parallel FNV-1a + shift that we discussed with Florian with an added
round of mixing. Testing the performance in isolation indicates that:
1) it is about an order of magnitude faster than the Sarwate CRC
method used in Postgresql.
2) it is about 2x faster than fastest software based CRC method.
3) by using -msse4.1 -funroll-loops -ftree-vectorize compilation
options the performance improves 5x. (within 20% of handcoded ASM)

This leaves lingering doubts about the quality of the checksum. It's
hard if not impossible to prove absence of interesting patterns that
would trigger collisions. I do know the checksum quality is miles
ahead of the Fletcher sum originally proposed and during the last week
I haven't been able to think of a way to make the collision rate
significantly differ from CRC.

> I wish that we had not scheduled beta quite so soon, as I am sure
> there will be even more resistance to changing this after beta.  But
> I'm having a hard time escaping the conclusion that we're on the edge
> of shipping something we will later regret quite deeply.  Maybe I'm
> wrong?

Its unfortunate that this got delayed by so long. The performance side
of the argument was clear a month ago.

Regards,
Ants Aasma
-- 
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-22 Thread Andres Freund

On 2013-04-22 11:27:25 -0400, Robert Haas wrote:
> On Wed, Apr 17, 2013 at 8:21 PM, Greg Smith  wrote:
> >> The more I read of this thread, the more unhappy I get.  It appears that
> >> the entire design process is being driven by micro-optimization for CPUs
> >> being built by Intel in 2013.
> >
> > And that's not going to get anyone past review, since all the tests I've
> > been doing the last two weeks are on how fast an AMD Opteron 6234 with OS
> > cache >> shared_buffers can run this.  The main thing I'm still worried
> > about is what happens when you have a fast machine that can move memory
> > around very quickly and an in-memory workload, but it's hamstrung by the
> > checksum computation--and it's not a 2013 Intel machine.
> 
> This is a good point.  However, I don't completely agree with the
> conclusion that we shouldn't be worrying about any of this right now.
> While I agree with Tom that it's far too late to think about any
> CPU-specific optimizations for 9.3, I have a lot of concern, based on
> Ants's numbers, that we've picked a checksum algorithm which is hard
> to optimize for performance.  If we don't get that fixed for 9.3,
> we're potentially looking at inflicting many years of serious
> suffering on our user base.  If we at least get the *algorithm* right
> now, we can worry about optimizing it later.  If we get it wrong,
> we'll be living with the consequence of that for a really long time.
> 
> I wish that we had not scheduled beta quite so soon, as I am sure
> there will be even more resistance to changing this after beta.  But
> I'm having a hard time escaping the conclusion that we're on the edge
> of shipping something we will later regret quite deeply.  Maybe I'm
> wrong?

I don't see us changing away from CRCs anymore either by now. But I
think at least changing the polynom to something that
a) has higher error detection properties
b) can noticeably sped up on a a good part of the hardware pg is run on

If we are feeling really adventurous we can switch to a faster CRC
implementation, there are enough ones around and I know that at least my
proposed patch from some years ago (which is by far not the fastest that
is doable) is in production usage some places.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-22 Thread Robert Haas

On Wed, Apr 17, 2013 at 8:21 PM, Greg Smith  wrote:
>> The more I read of this thread, the more unhappy I get.  It appears that
>> the entire design process is being driven by micro-optimization for CPUs
>> being built by Intel in 2013.
>
> And that's not going to get anyone past review, since all the tests I've
> been doing the last two weeks are on how fast an AMD Opteron 6234 with OS
> cache >> shared_buffers can run this.  The main thing I'm still worried
> about is what happens when you have a fast machine that can move memory
> around very quickly and an in-memory workload, but it's hamstrung by the
> checksum computation--and it's not a 2013 Intel machine.

This is a good point.  However, I don't completely agree with the
conclusion that we shouldn't be worrying about any of this right now.
While I agree with Tom that it's far too late to think about any
CPU-specific optimizations for 9.3, I have a lot of concern, based on
Ants's numbers, that we've picked a checksum algorithm which is hard
to optimize for performance.  If we don't get that fixed for 9.3,
we're potentially looking at inflicting many years of serious
suffering on our user base.  If we at least get the *algorithm* right
now, we can worry about optimizing it later.  If we get it wrong,
we'll be living with the consequence of that for a really long time.

I wish that we had not scheduled beta quite so soon, as I am sure
there will be even more resistance to changing this after beta.  But
I'm having a hard time escaping the conclusion that we're on the edge
of shipping something we will later regret quite deeply.  Maybe I'm
wrong?

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-18 Thread Greg Stark

On Thu, Apr 18, 2013 at 6:04 PM, Florian Weimer  wrote:
> The TCP checksum is too weak to be practical.  Every now an then, I
> see data transfers where the checksum is valid, but the content
> contains bit flips.

Well of course, it's only a 16-bit checksum. 64k packets isn't very
many so if you're not counting checksum failures it won't take very
long before one gets through. The purpose of the checksum is to notify
you that you have a problem, not to block bad packets from getting
through.

> Anything that flips bits randomly at intervals
> which are multiples of 16 bits is quite likely to pass through
> checksum detection.

I'm not sure about this

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-18 Thread Florian Pflug

On 18.04.2013, at 20:02, Ants Aasma  wrote:
> On Thu, Apr 18, 2013 at 8:24 PM, Ants Aasma  wrote:
>> On Thu, Apr 18, 2013 at 8:15 PM, Florian Pflug  wrote:
>>> So either the CRC32-C polynomial isn't irreducible, or there something
>>> fishy going on. Could there be a bug in your CRC implementation? Maybe
>>> a mixup between big and little endian, or something like that?
>> 
>> I'm suspecting an implementation bug myself. I already checked the
>> test harness and that was all sane, compiler hadn't taken any
>> unforgivable liberties there. I will crosscheck the output with other
>> implementations to verify that the checksum is implemented correctly.
> 
> Looks like the implementation is correct. I cross-referenced it
> against a bitwise algorithm for crc32 with the castagnoli polynomial.
> This also rules out any endianness issues as the bitwise variant
> consumes input byte at a time.
> 
> What ever it is, it is something specific to PostgreSQL page layout.
> If I use /dev/urandom as the source the issue disappears. So much for
> CRC32 being proven good.

Weird. Is the code of your test harness available publicly, or could you post 
it? I'd like to look into this...

best regard,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-18 Thread Ants Aasma

On Thu, Apr 18, 2013 at 8:24 PM, Ants Aasma  wrote:
> On Thu, Apr 18, 2013 at 8:15 PM, Florian Pflug  wrote:
>> So either the CRC32-C polynomial isn't irreducible, or there something
>> fishy going on. Could there be a bug in your CRC implementation? Maybe
>> a mixup between big and little endian, or something like that?
>
> I'm suspecting an implementation bug myself. I already checked the
> test harness and that was all sane, compiler hadn't taken any
> unforgivable liberties there. I will crosscheck the output with other
> implementations to verify that the checksum is implemented correctly.

Looks like the implementation is correct. I cross-referenced it
against a bitwise algorithm for crc32 with the castagnoli polynomial.
This also rules out any endianness issues as the bitwise variant
consumes input byte at a time.

What ever it is, it is something specific to PostgreSQL page layout.
If I use /dev/urandom as the source the issue disappears. So much for
CRC32 being proven good.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



-- 
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-18 Thread Jeff Davis

On Thu, 2013-04-18 at 19:05 +0200, Florian Pflug wrote:
> On Apr18, 2013, at 19:04 , Jeff Davis  wrote:
> > On Wed, 2013-04-17 at 20:21 -0400, Greg Smith wrote:
> >> -Original checksum feature used Fletcher checksums.  Its main problems, 
> >> to quote wikipedia, include that it "cannot distinguish between blocks 
> >> of all 0 bits and blocks of all 1 bits".
> > 
> > That is fairly easy to fix by using a different modulus: 251 vs 255.
> 
> At the expense of a drastic performance hit though, no? Modulus operations
> aren't exactly cheap.

Modulo is only necessary when there's a possibility of overflow, or at
the very end of the calculation. If we accumulate 32-bit integers into
64-bit sums, then it turns out that it can't overflow given the largest
input we support (32K page).

32K page = 8192 32-bit integers

1*(2^32-1) + 2*(2^32-1) + 3*(2^32-1) ... 8192*(2^32-1)
= (2^32-1) * (8192^2 - 8192)/2
= 144097595856261120 ( < 2^64-1 )

So, we only need to do the modulo at the end.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-18 Thread Ants Aasma

On Thu, Apr 18, 2013 at 8:15 PM, Florian Pflug  wrote:
> So either the CRC32-C polynomial isn't irreducible, or there something
> fishy going on. Could there be a bug in your CRC implementation? Maybe
> a mixup between big and little endian, or something like that?

I'm suspecting an implementation bug myself. I already checked the
test harness and that was all sane, compiler hadn't taken any
unforgivable liberties there. I will crosscheck the output with other
implementations to verify that the checksum is implemented correctly.

Regards,
Ants Aasma
-- 
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-18 Thread Florian Pflug

On Apr18, 2013, at 18:48 , Ants Aasma  wrote:
> On Thu, Apr 18, 2013 at 5:57 PM, Ants Aasma  wrote:
>> I'll generate an avalanche diagram for CRC32C too, but it will take a
>> while even if I use a smaller dataset.
> 
> Well that was useless... In CRC flipping each bit in the input flips
> preset pattern of bits in the output regardless of the actual data on
> the page. Some stats for CRC32C - input bits affect 28344 different
> bit combinations. Count of bits by number of duplicated bitpatterns:

Yup, CRC is linear too. CRC is essentially long division for polynomials,
i.e. you interpret the N input bits as the coefficients of a (large)
polynomial of degree (N-1), and divide by the CRC polynomial. The remainder
is the checksum, and consists of B bits where B is the degree of the
CRC polynomial. (Polynomial here means polynomial over GF(2), i.e. over
a field with only two values 0 and 1)

I'm currently trying to see if one can easily explain the partial-write
behaviour from that. Having lots of zeros at the end end corresponds
to an input polynomial of the form

  p(x) * x^l

where l is the number of zero bits. The CRC (q(x) is the CRC polynomial) is

  p(x) * x^l mod q(x) = (p(x) mod q(x)) * (x^l mod q(x)) mod q(x)

That still doesn't explain it, though - the result *should* simply
be the checksum of p(x), scrambled a bit by the multiplication with
(x^l mod q(x)). But if q(x) is irreducible, that scrambling is invertible
(as multiplication module some irreducible element always is), and thus
shouldn't matter much.

So either the CRC32-C polynomial isn't irreducible, or there something
fishy going on. Could there be a bug in your CRC implementation? Maybe
a mixup between big and little endian, or something like that?

The third possibility is that I've overlooking something, of course ;-)
Will think more about this tomorrow if time permits

best regards,
Florian Pflug

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-18 Thread Ants Aasma

On Thu, Apr 18, 2013 at 8:05 PM, Florian Pflug  wrote:
> On Apr18, 2013, at 19:04 , Jeff Davis  wrote:
>> On Wed, 2013-04-17 at 20:21 -0400, Greg Smith wrote:
>>> -Original checksum feature used Fletcher checksums.  Its main problems,
>>> to quote wikipedia, include that it "cannot distinguish between blocks
>>> of all 0 bits and blocks of all 1 bits".
>>
>> That is fairly easy to fix by using a different modulus: 251 vs 255.
>
> At the expense of a drastic performance hit though, no? Modulus operations
> aren't exactly cheap.

The modulus can be done in the end. By using a modulus of 65521 the
resulting checksum is called Adler-32. [1] However the quality of
Fletcher-32/Adler-32 is strictly worse than even the first iteration
of multiply-add based checksums proposed.

[1] http://en.wikipedia.org/wiki/Adler-32

Regards,
Ants Aasma
-- 
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-18 Thread Florian Pflug

On Apr18, 2013, at 19:04 , Jeff Davis  wrote:
> On Wed, 2013-04-17 at 20:21 -0400, Greg Smith wrote:
>> -Original checksum feature used Fletcher checksums.  Its main problems, 
>> to quote wikipedia, include that it "cannot distinguish between blocks 
>> of all 0 bits and blocks of all 1 bits".
> 
> That is fairly easy to fix by using a different modulus: 251 vs 255.

At the expense of a drastic performance hit though, no? Modulus operations
aren't exactly cheap.

best regards,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-18 Thread Florian Weimer

* Greg Smith:

> The TCP/IP checksum spec is at https://tools.ietf.org/html/rfc793 ;
> its error detection limitations are described at
> http://www.noahdavids.org/self_published/CRC_and_checksum.html ; and a
> good article about optimizing its code is at
> http://www.locklessinc.com/articles/tcp_checksum/  I'll take a longer
> look at whether it's an improvement on the Fletcher-16 used by the
> current patch.

The TCP checksum is too weak to be practical.  Every now an then, I
see data transfers where the checksum is valid, but the content
contains bit flips.  Anything that flips bits randomly at intervals
which are multiples of 16 bits is quite likely to pass through
checksum detection.

In practice, TCP relies on checksumming on the sub-IP layers.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-18 Thread Jeff Davis

On Wed, 2013-04-17 at 20:21 -0400, Greg Smith wrote:
> -Original checksum feature used Fletcher checksums.  Its main problems, 
> to quote wikipedia, include that it "cannot distinguish between blocks 
> of all 0 bits and blocks of all 1 bits".

That is fairly easy to fix by using a different modulus: 251 vs 255.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-18 Thread Ants Aasma

On Thu, Apr 18, 2013 at 5:57 PM, Ants Aasma  wrote:
> I'll generate an avalanche diagram for CRC32C too, but it will take a
> while even if I use a smaller dataset.

Well that was useless... In CRC flipping each bit in the input flips
preset pattern of bits in the output regardless of the actual data on
the page. Some stats for CRC32C - input bits affect 28344 different
bit combinations. Count of bits by number of duplicated bitpatterns:
[(1, 8868),
 (2, 17722),
 (3, 17775),
 (4, 12048),
 (5, 5725),
 (6, 2268),
 (7, 875),
 (8, 184),
 (9, 45),
 (10, 10),
 (16, 16)]

Count of bit positions by number of bit-positions affected:
[(0, 16),
 (1, 25),
 (3, 1185),
 (5, 8487),
 (7, 22970),
 (9, 22913),
 (11, 8790),
 (13, 1119),
 (15, 31)]

Map of number of bit position affected, with 8 being black and 0 or 16
being red attached.

I'm not sure if the issues with partial writes are somehow related to this.

Regards,
Ants Aasma
-- 
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de
<>
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-18 Thread Bruce Momjian

On Thu, Apr 18, 2013 at 09:17:39AM +0100, Simon Riggs wrote:
> On 17 April 2013 22:36, Bruce Momjian  wrote:
> 
> >> > I would like to know the answer of how an upgrade from checksum to
> >> > no-checksum would behave so I can modify pg_upgrade to allow it.
> >>
> >> Why? 9.3 pg_upgrade certainly doesn't need it. When we get to 9.4, if
> >> someone has checksums enabled and wants to disable it, why is pg_upgrade
> >> the right time to do that? Wouldn't it make more sense to allow them to
> >> do that at any time?
> >
> > Well, right now, pg_upgrade is the only way you could potentially turn
> > off checksums.  You are right that we might eventually want a command,
> > but my point is that we currently have a limitation in pg_upgrade that
> > might not be necessary.
> 
> We don't currently have checksums, so pg_upgrade doesn't need to cope
> with turning them off in 9.3

True, 9.2 doesn't have checksums, while 9.3 will.  One point is that
pg_upgrade could actually be used to turn off checksums for 9.3 to 9.3
upgrades if no tablespaces are used.

> For 9.4, it might, but likely we've have a tool to turn them off
> before then anyway.

True.   Would we want pg_upgrade to still enforce matching checksum
modes for old and new servers at that point?  Eventually we will have to
decide that.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-18 Thread Ants Aasma

On Thu, Apr 18, 2013 at 5:08 AM, Greg Smith  wrote:
> On 4/17/13 8:56 PM, Ants Aasma wrote:
>>
>> Nothing from the two points, but the CRC calculation algorithm can be
>> switched out for slice-by-4 or slice-by-8 variant. Speed up was around
>> factor of 4 if I remember correctly...I can provide you
>
>> with a patch of the generic version of any of the discussed algorithms
>> within an hour, leaving plenty of time in beta or in 9.4 to
>> accommodate the optimized versions.
>
> Can you nail down a solid, potential for commit slice-by-4 or slice-by-8
> patch then?  You dropped into things like per-byte overhead to reach this
> conclusion, which was fine to let the methods battle each other. Maybe I
> missed it, but I didn't remember seeing an obvious full patch for this
> implementation then come back up from that.  With the schedule pressure this
> needs to return to more database-level tests.  Your concerns about the
> committed feature being much slower then the original Fletcher one are
> troubling, and we might as well do that showdown again now with the best of
> the CRC implementations you've found.

I meant any of fast ones is easy to nail down. The sped up slice-by-8
is somewhat slightly trickier to clean up. Especially if anyone
expects it to accelerate WAL calculation, then it brings up a whole
bunch of design questions on how to handle alignment issues. For
performance testing what is attached should work fine, it would still
need some cleanup.

> It's fair that you're very concerned about (1), but I wouldn't give it 100%
> odds of happening either.  The user demand that's motivated me to work on
> this will be happy with any of (1) through (3), and in two of them
> optimizing the 16 bit checksums now turns out to be premature.

Fair enough, although I'd like to point out the optimization is
premature in the sense that the effort might go to waste. The checksum
function is a self contained, easy to test and very low maintenance
piece of code - not the usual premature optimization risk.

Regards,
Ants Aasma
-- 
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


crc32c-sb8-checksum.v0.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-18 Thread Daniel Farina

On Wed, Apr 17, 2013 at 11:08 PM, Andres Freund  wrote:
> On 2013-04-17 18:16:36 -0700, Daniel Farina wrote:
>> The original paper is often shorthanded "Castagnoli 93", but it exists
>> in the IEEE's sphere of influence and is hard to find a copy of.
>> Luckily, a pretty interesting survey paper discussing some of the
>> issues was written by Koopman in 2002 and is available:
>> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.11.8323 As a
>> pedagolgical note, it's pretty interesting and accessible piece of
>> writing (for me, as someone who knows little of error
>> detection/correction) and explains some of the engineering reasons
>> that provoke such exercises.
>
> http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=231911&userType=inst
>
> There's also a koopman paper from 2004 thats interesting.

Having read the 2002 paper more, it seems that the current CRC32
doesn't have a whole lot going for it: CRC32C pretty much cleans its
clock across the board (I don't understand detected Hamming Distance
that seem greater than the information content of the message, e.g. HD
14 with 8 bit messages as seen in CRC32C: that's where CRC32 can "win").

CRC32C looks, all in all, the most flexible, because detection of
Hamming Distance 4 spans from 5244-131072 bits (the upper range of
which is a full 16KiB!) and there is superior Hamming Distance
detection on shorter messages up until the point where it seems like
the Hamming Distance able to be detected is larger than the message
size itself (e.g. HM 13 on an 8 bit message).  I'm not sure if this is
an error in my understanding, or what.

Also, larger runs (16KB) are better served by CRC32C: even the
probably-best contender I can see (0xD419CC15) drops to Hamming
Distance 2-detection right after 65505 bits.  CRC32C has the biggest
range at HD4, although Koopman 0xBA0DC66 comes close, gaining superior
Hamming distance detection for 178-16360 bits (the upper end of this
rnage is short of 2KiB by 3 bytes).

All in all, there is no reason I can see to keep CRC32 at all, vs
CRC32C on the basis of error detection alone, so putting aside all the
business about instruction set architecture, I think a software CRC32C
in a vacuum can be seen as a robustness improvement.

There may be polynomials that are not CRC32 or CRC32C that one might
view as having slightly better tradeoffs as seen in Table 1 of Koopman
2002, but it's kind of a stretch: being able to handle 8KB and 16KB as
seen in CRC32C at HD4 as seen in CRC32C is awfully compelling to me.
Koopman 0xBA0DC66B can admirably reach HD6 on a much larger range, up
to 16360 bytes, which is every so shy of 2KiB.  Castagnoli 0xD419CC15
can, short of 8KB by 31 bits can detect HD 5.

Corrections welcome on my interpretations of Tbl 1.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-18 Thread Simon Riggs

On 17 April 2013 22:36, Bruce Momjian  wrote:

>> > I would like to know the answer of how an upgrade from checksum to
>> > no-checksum would behave so I can modify pg_upgrade to allow it.
>>
>> Why? 9.3 pg_upgrade certainly doesn't need it. When we get to 9.4, if
>> someone has checksums enabled and wants to disable it, why is pg_upgrade
>> the right time to do that? Wouldn't it make more sense to allow them to
>> do that at any time?
>
> Well, right now, pg_upgrade is the only way you could potentially turn
> off checksums.  You are right that we might eventually want a command,
> but my point is that we currently have a limitation in pg_upgrade that
> might not be necessary.

We don't currently have checksums, so pg_upgrade doesn't need to cope
with turning them off in 9.3

For 9.4, it might, but likely we've have a tool to turn them off
before then anyway.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-17 Thread Andres Freund

On 2013-04-18 00:44:02 +0300, Ants Aasma wrote:
> I went ahead and coded up both the parallel FNV-1a and parallel FNV-1a
> + srl1-xor variants and ran performance tests and detection rate tests
> on both.
> 
> Performance results:
> Mul-add checksums: 12.9 bytes/s
> FNV-1a checksums: 13.5 bytes/s
> FNV-1a + srl-1: 7.4 bytes/s
> 
> Detection rates:
> False positive rates:
>  Add-mul   FNV-1a FNV-1a + srl-1
> Single bit flip: 1:inf 1:129590   1:64795
> Double bit flip: 1:148 1:511  1:53083
> Triple bit flip: 1:673 1:5060 1:61511
>   Quad bit flip: 1:18721:193491:68320
> Write 0x00 byte: 1:774538137   1:118776   1:68952
> Write 0xFF byte: 1:165399500   1:137489   1:68958
>   Partial write: 1:59949   1:719391:89923
>   Write garbage: 1:64866   1:649801:67732
> Write run of 00: 1:57077   1:611401:59723
> Write run of FF: 1:63085   1:596091:62977
> 
> Test descriptions:
> N bit flip: picks N random non-overlapping bits and flips their value.
> Write X byte: overwrites a single byte with X.
> Partial write: picks a random cut point, overwrites everything from
> there to end with 0x00.
> Write garbage/run of X: picks two random cut points and fills
> everything in between with random values/X bytes.

I don't think this table is complete without competing numbers for
truncated crc-32. Any chance to get that?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-17 Thread Andres Freund

On 2013-04-17 18:16:36 -0700, Daniel Farina wrote:
> The original paper is often shorthanded "Castagnoli 93", but it exists
> in the IEEE's sphere of influence and is hard to find a copy of.
> Luckily, a pretty interesting survey paper discussing some of the
> issues was written by Koopman in 2002 and is available:
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.11.8323 As a
> pedagolgical note, it's pretty interesting and accessible piece of
> writing (for me, as someone who knows little of error
> detection/correction) and explains some of the engineering reasons
> that provoke such exercises.

http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=231911&userType=inst

There's also a koopman paper from 2004 thats interesting.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-17 Thread Greg Smith

On 4/17/13 8:56 PM, Ants Aasma wrote:

Nothing from the two points, but the CRC calculation algorithm can be
switched out for slice-by-4 or slice-by-8 variant. Speed up was around
factor of 4 if I remember correctly...I can provide you

> with a patch of the generic version of any of the discussed algorithms
> within an hour, leaving plenty of time in beta or in 9.4 to
> accommodate the optimized versions.

Can you nail down a solid, potential for commit slice-by-4 or slice-by-8 
patch then?  You dropped into things like per-byte overhead to reach 
this conclusion, which was fine to let the methods battle each other. 
Maybe I missed it, but I didn't remember seeing an obvious full patch 
for this implementation then come back up from that.  With the schedule 
pressure this needs to return to more database-level tests.  Your 
concerns about the committed feature being much slower then the original 
Fletcher one are troubling, and we might as well do that showdown again 
now with the best of the CRC implementations you've found.

Actually the state is that with the [CRC] polynomial used there is
currently close to zero hope of CPUs optimizing for us.

Ah, I didn't catch that before.  It sounds like the alternate slicing 
implementation should also use a different polynomial then, which sounds 
reasonable.  This doesn't even have to be exactly the same CRC function 
that the WAL uses.  A CRC that's modified for performance or having a 
better future potential is fine; there's just a lot of resistance to 
using something other than a CRC right now.

I'm not sure about the 9.4 part: if we ship with the builtin CRC as
committed, there is a 100% chance that we will want to switch out the
algorithm in 9.4, and there will be quite a large subset of users that
will find the performance unusable.

Now I have to switch out my reviewer hat for my 3 bit fortune telling 
one.  (It uses a Magic 8 Ball)  This entire approach is squeezing what 
people would prefer to be a 32 bit CRC into a spare 16 bits, as a useful 
step advancing toward a long term goal.  I have four major branches of 
possible futures here I've thought about:

1) Database checksums with 16 bits are good enough, but they have to be 
much faster to satisfy users.  It may take a different checksum 
implementation altogether to make that possible, and distinguishing 
between the two of them requires borrowing even more metadata bits from 
somewhere.  (This seems the future you're worried about)

2) Database checksums work out well, but they have to be 32 bits to 
satisfy users and/or error detection needs.  Work on pg_upgrade and 
expanding the page headers will be needed.  Optimization of the CRC now 
has a full 32 bit target.

3) The demand for database checksums is made obsolete by either 
mainstream filesystem checksumming, performance issues, or just general 
market whim.  The 16 bit checksum PostgreSQL implements becomes a 
vestigial feature, and whenever it gets in the way of making changes 
someone proposes eliminating them.  (I call this one the "rules" future)

4) 16 bit checksums turn out to be such a problem in the field that 
everyone regrets the whole thing, and discussions turn immediately 
toward how to eliminate that risk.

It's fair that you're very concerned about (1), but I wouldn't give it 
100% odds of happening either.  The user demand that's motivated me to 
work on this will be happy with any of (1) through (3), and in two of 
them optimizing the 16 bit checksums now turns out to be premature.

--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-17 Thread Daniel Farina

On Wed, Apr 17, 2013 at 5:21 PM, Greg Smith  wrote:
> Let me see if I can summarize where the messages flying by are at since
> you'd like to close this topic for now:
>
> -Original checksum feature used Fletcher checksums.  Its main problems, to
> quote wikipedia, include that it "cannot distinguish between blocks of all 0
> bits and blocks of all 1 bits".
>
> -Committed checksum feature uses truncated CRC-32.  This has known good
> error detection properties, but is expensive to compute.  There's reason to
> believe that particular computation will become cheaper on future platforms
> though.  But taking full advantage of that will require adding CPU-specific
> code to the database.
>
> -The latest idea is using the Fowler–Noll–Vo hash function:
> https://en.wikipedia.org/wiki/Fowler_Noll_Vo_hash  There's 20 years of
> research around when that is good or bad.  The exactly properties depend on
> magic "FNV primes":  http://isthe.com/chongo/tech/comp/fnv/#fnv-prime that
> can vary based on both your target block size and how many bytes you'll
> process at a time.  For PostgreSQL checksums, one of the common
> problems--getting an even distribution of the hashed values--isn't important
> the way it is for other types of hashes.  Ants and Florian have now dug into
> how exactly that and specific CPU optimization concerns impact the best
> approach for 8K database pages.  This is very clearly a 9.4 project that is
> just getting started.

I was curious about the activity in this thread and wanted to understand
the tradeoffs, and came to the same understanding as you when poking
around.  It seems the tough aspect of the equation is that the most
well studied thing is slow (CRC-32C) unless you have special ISA
support  Trying to find as much information and conclusive research on
FNV was a lot more challenging.  Fletcher is similar in that regard.

Given my hasty attempt to understand each of the alternatives, my
qualitative judgement is that, strangely enough, the most conservative
choice of the three (in terms of being understood and treated in the
literature more than ten times over) is CRC-32C, but it's also the one
being cast as only suitable inside micro-optimization.  To add
another, theoretically-oriented dimension to the discussion, I'd like
suggest it's also the most thoroughly studied of all the alternatives.
 I really had a hard time finding follow-up papers about the two
alternatives, but to be fair, I didn't try very hard...then again, I
didn't try very hard for any of the three, it's just that CRC32C was
by far the easiest find materials on.

The original paper is often shorthanded "Castagnoli 93", but it exists
in the IEEE's sphere of influence and is hard to find a copy of.
Luckily, a pretty interesting survey paper discussing some of the
issues was written by Koopman in 2002 and is available:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.11.8323 As a
pedagolgical note, it's pretty interesting and accessible piece of
writing (for me, as someone who knows little of error
detection/correction) and explains some of the engineering reasons
that provoke such exercises.

Basically...if it comes down to understand what the heck is going on
and what the trade-offs are, it was a lot easier to brush up on
CRC32-C in my meandering around the Internet.

One might think this level of scrutiny would constitute a viable
explanation of why CRC32C found its way into several standards and
then finally in silicon.

All in all, if the real world costs of CRC32C on not-SSE4.2 are
allowable, I think it's the most researched and and conservative
option, although perhaps some of the other polynomials seen in Koopman
could also be desirable.  It seems there's a tradeoff in CRC
polynomials between long-message and short-message error detection,
and the paper above may allow for a more informed selection.  CRC32C
is considered a good trade-off for both, but I haven't assessed the
paper in enough detail to suggest whether there are specialized
long-run polynomials that may be better still (although, then, there
is also the microoptimization question, which postdates the literature
I was looking at by a lot).

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-17 Thread Ants Aasma

On Thu, Apr 18, 2013 at 3:21 AM, Greg Smith  wrote:
> On 4/17/13 6:32 PM, Tom Lane wrote:
>>
>> The more I read of this thread, the more unhappy I get.  It appears that
>> the entire design process is being driven by micro-optimization for CPUs
>> being built by Intel in 2013.
>
>
> And that's not going to get anyone past review, since all the tests I've
> been doing the last two weeks are on how fast an AMD Opteron 6234 with OS
> cache >> shared_buffers can run this.  The main thing I'm still worried
> about is what happens when you have a fast machine that can move memory
> around very quickly and an in-memory workload, but it's hamstrung by the
> checksum computation--and it's not a 2013 Intel machine.
>
> The question I started with here was answered to some depth and then skipped
> past.  I'd like to jerk attention back to that, since I thought some good
> answers from Ants went by.  Is there a simple way to optimize the committed
> CRC computation (or a similar one with the same error detection properties)
> based on either:
>
> a) Knowing that the input will be a 8K page, rather than the existing use
> case with an arbitrary sized WAL section.
>
> b) Straightforward code rearrangement or optimization flags.
>
> That was all I thought was still feasible to consider changing for 9.3 a few
> weeks ago.  And the possible scope has only been shrinking since then.

Nothing from the two points, but the CRC calculation algorithm can be
switched out for slice-by-4 or slice-by-8 variant. Speed up was around
factor of 4 if I remember correctly.

>> And I reiterate that there is theory out there about the error detection
>> capabilities of CRCs.  I'm not seeing any theory here, which leaves me
>> with very little confidence that we know what we're doing.
>
>
> Let me see if I can summarize where the messages flying by are at since
> you'd like to close this topic for now:
>
> -Original checksum feature used Fletcher checksums.  Its main problems, to
> quote wikipedia, include that it "cannot distinguish between blocks of all 0
> bits and blocks of all 1 bits".

That was only the most glaring problem.

> -Committed checksum feature uses truncated CRC-32.  This has known good
> error detection properties, but is expensive to compute.  There's reason to
> believe that particular computation will become cheaper on future platforms
> though.  But taking full advantage of that will require adding CPU-specific
> code to the database.

Actually the state is that with the polynomial used there is currently
close to zero hope of CPUs optimizing for us. By switching the
polynomial we can have hardware acceleration on Intel CPUs, little
hope of others supporting given that AMD hasn't by now and Intel touts
around patents in this area. However the calculation can be made about
factor of 4 faster by restructuring the calculation. This optimization
is plain C and not CPU specific.

The committed checksum is an order of magnitude slower than the
Fletcher one that was performance tested with the patch.

> -The latest idea is using the Fowler–Noll–Vo hash function:
> https://en.wikipedia.org/wiki/Fowler_Noll_Vo_hash  There's 20 years of
> research around when that is good or bad.  The exactly properties depend on
> magic "FNV primes":  http://isthe.com/chongo/tech/comp/fnv/#fnv-prime that
> can vary based on both your target block size and how many bytes you'll
> process at a time.  For PostgreSQL checksums, one of the common
> problems--getting an even distribution of the hashed values--isn't important
> the way it is for other types of hashes.  Ants and Florian have now dug into
> how exactly that and specific CPU optimization concerns impact the best
> approach for 8K database pages.  This is very clearly a 9.4 project that is
> just getting started.

I'm not sure about the 9.4 part: if we ship with the builtin CRC as
committed, there is a 100% chance that we will want to switch out the
algorithm in 9.4, and there will be quite a large subset of users that
will find the performance unusable. If we change it to whatever we
come up with here, there is a small chance that the algorithm will
give worse than expected error detection rate in some circumstances
and we will want offer a better algorithm. More probably it will be
good enough and the low performance hit will allow more users to turn
it on. This is a 16bit checksum that we talking about, not SHA-1, it
is expected to occasionally fail to detect errors. I can provide you
with a patch of the generic version of any of the discussed algorithms
within an hour, leaving plenty of time in beta or in 9.4 to
accommodate the optimized versions. It's literally a dozen self
contained lines of code.

Regards,
Ants Aasma
-- 
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-ha

Re: [HACKERS] Enabling Checksums

2013-04-17 Thread Greg Smith


On 4/17/13 6:32 PM, Tom Lane wrote:

The more I read of this thread, the more unhappy I get.  It appears that
the entire design process is being driven by micro-optimization for CPUs
being built by Intel in 2013.


And that's not going to get anyone past review, since all the tests I've 
been doing the last two weeks are on how fast an AMD Opteron 6234 with 
OS cache >> shared_buffers can run this.  The main thing I'm still 
worried about is what happens when you have a fast machine that can move 
memory around very quickly and an in-memory workload, but it's hamstrung 
by the checksum computation--and it's not a 2013 Intel machine.


The question I started with here was answered to some depth and then 
skipped past.  I'd like to jerk attention back to that, since I thought 
some good answers from Ants went by.  Is there a simple way to optimize 
the committed CRC computation (or a similar one with the same error 
detection properties) based on either:


a) Knowing that the input will be a 8K page, rather than the existing 
use case with an arbitrary sized WAL section.


b) Straightforward code rearrangement or optimization flags.

That was all I thought was still feasible to consider changing for 9.3 a 
few weeks ago.  And the possible scope has only been shrinking since then.



And I reiterate that there is theory out there about the error detection
capabilities of CRCs.  I'm not seeing any theory here, which leaves me
with very little confidence that we know what we're doing.


Let me see if I can summarize where the messages flying by are at since 
you'd like to close this topic for now:


-Original checksum feature used Fletcher checksums.  Its main problems, 
to quote wikipedia, include that it "cannot distinguish between blocks 
of all 0 bits and blocks of all 1 bits".


-Committed checksum feature uses truncated CRC-32.  This has known good 
error detection properties, but is expensive to compute.  There's reason 
to believe that particular computation will become cheaper on future 
platforms though.  But taking full advantage of that will require adding 
CPU-specific code to the database.


-The latest idea is using the Fowler–Noll–Vo hash function: 
https://en.wikipedia.org/wiki/Fowler_Noll_Vo_hash  There's 20 years of 
research around when that is good or bad.  The exactly properties depend 
on magic "FNV primes":  http://isthe.com/chongo/tech/comp/fnv/#fnv-prime 
that can vary based on both your target block size and how many bytes 
you'll process at a time.  For PostgreSQL checksums, one of the common 
problems--getting an even distribution of the hashed values--isn't 
important the way it is for other types of hashes.  Ants and Florian 
have now dug into how exactly that and specific CPU optimization 
concerns impact the best approach for 8K database pages.  This is very 
clearly a 9.4 project that is just getting started.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-17 Thread Ants Aasma

On Thu, Apr 18, 2013 at 2:25 AM, Florian Pflug  wrote:
> On Apr17, 2013, at 23:44 , Ants Aasma  wrote:
>> Performance results:
>> Mul-add checksums: 12.9 bytes/s
>> FNV-1a checksums: 13.5 bytes/s
>> FNV-1a + srl-1: 7.4 bytes/s
>>
>> Detection rates:
>> False positive rates:
>> Add-mul   FNV-1a FNV-1a + srl-1
>> Single bit flip: 1:inf 1:129590   1:64795
>> Double bit flip: 1:148 1:511  1:53083
>> Triple bit flip: 1:673 1:5060 1:61511
>>  Quad bit flip: 1:18721:193491:68320
>> Write 0x00 byte: 1:774538137   1:118776   1:68952
>> Write 0xFF byte: 1:165399500   1:137489   1:68958
>>  Partial write: 1:59949   1:719391:89923
>>  Write garbage: 1:64866   1:649801:67732
>> Write run of 00: 1:57077   1:611401:59723
>> Write run of FF: 1:63085   1:596091:62977
>>
>> Test descriptions:
>> N bit flip: picks N random non-overlapping bits and flips their value.
>> Write X byte: overwrites a single byte with X.
>> Partial write: picks a random cut point, overwrites everything from
>> there to end with 0x00.
>> Write garbage/run of X: picks two random cut points and fills
>> everything in between with random values/X bytes.
>
> Cool, thanks for testing that! The results for FNV-1a + srl-1 look
> promising, I think. Its failure rate is consistently about 1:2^16,
> which is the value you'd expect. That gives me some confidence that
> the additional shift as working as expected.
>
> BTW, which prime are you using for FNV-1a and FNV-1a+srl1?

The official 32bit FNV one, 16777619.

Offsets were just random numbers. Seems good enough given the
following from the FNV page:

"These non-zero integers are the FNV-0 hashes of the following 32 octets:

chongo  /\../\"

>> The effect on false positive rates
>> for double bit errors is particularly impressive. I'm now running a
>> testrun that shift right by 13 to see how that works out, intuitively
>> it should help dispersing the bits a lot faster.

Empirical results are slightly better with shift of 13:

Single bit flip: 1:61615
Double bit flip: 1:58078
Triple bit flip: 1:66329
  Quad bit flip: 1:62141
Write 0x00 byte: 1:66327
Write 0xFF byte: 1:65274
  Partial write: 1:71939
  Write garbage: 1:65095
 Write run of 0: 1:62845
Write run of FF: 1:64638

> Maybe, but it also makes *only* bits 14 and 15 actually affects bits
> below them, because all other's are shifted out. If you choose the
> right prime it may still work, you'd have to pick one which with
> enough lower bits set so that every bits affects bit 14 or 15 at some
> point…
>
> All in all a small shift seems better to me - if 1 for some reason
> isn't a good choice, I'd expect 3 or so to be a suitable
> replacement, but nothing much larger…

I don't think the big shift is a problem, the other bits were taken
into account by the multiply, and with the larger shift the next
multiplication will disperse the changes once again. Nevertheless, I'm
running the tests with shift of 3 now.

> I should have some time tomorrow to spent on this, and will try
> to validate our FNV-1a modification, and see if I find a way to judge
> whether 1 is a good shift.

Great. I will spend some brain cycles on it too.

Regards,
Ants Aasma
-- 
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-17 Thread Ants Aasma

On Thu, Apr 18, 2013 at 1:32 AM, Tom Lane  wrote:
> Ants Aasma  writes:
>> I was thinking about something similar too. The big issue here is that
>> the parallel checksums already hide each other latencies effectively
>> executing one each of movdqu/pmullw/paddw each cycle, that's why the
>> N_SUMS adds up to 128 bytes not 16 bytes.
>
> The more I read of this thread, the more unhappy I get.  It appears that
> the entire design process is being driven by micro-optimization for CPUs
> being built by Intel in 2013.  That ought to be, at best, a fifth-order
> consideration, with full recognition that it'll be obsolete in two years,
> and is already irrelevant to anyone not running one of those CPUs.

The large scale structure takes into account the trends in computer
architecture. A lot more so than using anything straight out of the
literature. Specifically, computer architectures have hit a wall in
terms of sequential throughput, so the linear dependency chain in the
checksum algorithm will be the bottleneck soon if it isn't already.
From that it follows that a fast and future proof algorithm should not
calculate the checksum in a single log chain. The proposed algorithms
divide the input into 64x64 and 32x64 chunks. It's easy to show that
both convert the dependency chain from O(n) to O(sqrt(n)). Secondly,
unless we pick something really popular, CPUs are unlikely to provide
specifically for us, so the algorithm should be built from general
purpose computational pieces. Vector integer multiply and xor are
pretty much guaranteed to be there and fast on future CPUs. In my view
it's much more probable to be available and fast on future CPU's than
something like the Intel CRC32 acceleration.

> I would like to ban all discussion of assembly-language optimizations
> until after 9.3 is out, so that we can concentrate on what actually
> matters.  Which IMO is mostly the error detection rate and the probable
> nature of false successes.  I'm glad to see that you're paying at least
> some attention to that, but the priorities in this discussion are
> completely backwards.

I approached it from the angle that what needs to be done to get a
fundamentally fast approach have a good enough error detection rate
and not have a way of generating false positives that will give a
likely error. The algorithms are simple enough and well studied enough
that the rewards from tweaking them are negligible. I think the
resulting performance speaks for itself. Now the question is what is a
good enough algorithm. In my view, the checksum is more like a canary
in the coal mine, not something that can be relied upon, and so
ultimate efficiency is not that important if there are no obvious
horrible cases. I can see that there are other views and so am
exploring different tradeoffs between performance and quality.

> And I reiterate that there is theory out there about the error detection
> capabilities of CRCs.  I'm not seeing any theory here, which leaves me
> with very little confidence that we know what we're doing.

I haven't found much literature that is of use here. There is theory
underlying here coming from basic number theory and distilled into
rules for hash functions. For the FNV hash the prime supposedly is
carefully chosen, although all literature so far is saying "it is a
good choice, but here is not the place to explain why".

Regards,
Ants Aasma
-- 
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-17 Thread Florian Pflug

On Apr17, 2013, at 23:44 , Ants Aasma  wrote:
> Performance results:
> Mul-add checksums: 12.9 bytes/s
> FNV-1a checksums: 13.5 bytes/s
> FNV-1a + srl-1: 7.4 bytes/s
> 
> Detection rates:
> False positive rates:
> Add-mul   FNV-1a FNV-1a + srl-1
> Single bit flip: 1:inf 1:129590   1:64795
> Double bit flip: 1:148 1:511  1:53083
> Triple bit flip: 1:673 1:5060 1:61511
>  Quad bit flip: 1:18721:193491:68320
> Write 0x00 byte: 1:774538137   1:118776   1:68952
> Write 0xFF byte: 1:165399500   1:137489   1:68958
>  Partial write: 1:59949   1:719391:89923
>  Write garbage: 1:64866   1:649801:67732
> Write run of 00: 1:57077   1:611401:59723
> Write run of FF: 1:63085   1:596091:62977
> 
> Test descriptions:
> N bit flip: picks N random non-overlapping bits and flips their value.
> Write X byte: overwrites a single byte with X.
> Partial write: picks a random cut point, overwrites everything from
> there to end with 0x00.
> Write garbage/run of X: picks two random cut points and fills
> everything in between with random values/X bytes.

Cool, thanks for testing that! The results for FNV-1a + srl-1 look
promising, I think. Its failure rate is consistently about 1:2^16,
which is the value you'd expect. That gives me some confidence that
the additional shift as working as expected.

BTW, which prime are you using for FNV-1a and FNV-1a+srl1?

> So adding in the shifted value nearly cuts the performance in half. I
> think that by playing with the instruction order I might coax the CPU
> scheduler to schedule the instructions better, but even in best case
> it will be somewhat slower. The point to keep in mind that even this
> slower speed is still faster than hardware accelerated CRC32, so all
> in all the hit might not be so bad.

Yeah. ~7 bytes/cycle still translates to over 10GB/s on typical CPU,
so that's still plenty fast I'd say...

> The effect on false positive rates
> for double bit errors is particularly impressive. I'm now running a
> testrun that shift right by 13 to see how that works out, intuitively
> it should help dispersing the bits a lot faster.

Maybe, but it also makes *only* bits 14 and 15 actually affects bits
below them, because all other's are shifted out. If you choose the
right prime it may still work, you'd have to pick one which with
enough lower bits set so that every bits affects bit 14 or 15 at some
point…

All in all a small shift seems better to me - if 1 for some reason
isn't a good choice, I'd expect 3 or so to be a suitable
replacement, but nothing much larger…

I should have some time tomorrow to spent on this, and will try 
to validate our FNV-1a modification, and see if I find a way to judge
whether 1 is a good shift.

>>> I wonder if we use 32bit FNV-1a's (the h = (h^v)*p variant) with
>>> different offset-basis values, would it be enough to just XOR fold the
>>> resulting values together. The algorithm looking like this:
>> 
>> Hm, this will make the algorithm less resilient to some particular
>> input permutations (e.g. those which swap the 64*i-th and the (64+1)-ith
>> words), but those seem very unlikely to occur randomly. But if we're
>> worried about that, we could use your linear combination method for
>> the aggregation phase.
> 
> I don't think it significantly reduces resilience to permutations
> thanks to using different basis offsets and multiply not distributing
> over xor.

Oh, yeah, I though you were still using 0 as base offset. If you don't,
the objection is moot.

best regards,
Florian Pflug




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-17 Thread Florian Pflug

On Apr18, 2013, at 00:32 , Tom Lane  wrote:
> Ants Aasma  writes:
>> I was thinking about something similar too. The big issue here is that
>> the parallel checksums already hide each other latencies effectively
>> executing one each of movdqu/pmullw/paddw each cycle, that's why the
>> N_SUMS adds up to 128 bytes not 16 bytes.
> 
> The more I read of this thread, the more unhappy I get.  It appears that
> the entire design process is being driven by micro-optimization for CPUs
> being built by Intel in 2013.  That ought to be, at best, a fifth-order
> consideration, with full recognition that it'll be obsolete in two years,
> and is already irrelevant to anyone not running one of those CPUs.

Micro-optimization for particular CPUs yes, but general performance
considerations, no. For example, 2^n is probably one of the worst modulus
you can pick for a hash function - any prime would work much better.
But doing the computations modulo 2^16 or 2^32 carries zero performance
overhead, whereas picking another modulus requires some renormalization
after every operation. That, however, is *not* a given - it stems from
the fact nearly all CPUs in existence operated on binary integers. This
fact must thus enter into the design phase very early, and makes
2^16 or 2^32 a sensible choice for a modulus *despite* it's shortcomings,
simply because it allows for fast implementations.

> I would like to ban all discussion of assembly-language optimizations
> until after 9.3 is out, so that we can concentrate on what actually
> matters.  Which IMO is mostly the error detection rate and the probable
> nature of false successes.  I'm glad to see that you're paying at least
> some attention to that, but the priorities in this discussion are
> completely backwards.

I'd say lots of attention is paid to that, but there's *also* attention
paid to speed. Which I good, because ideally we want to end up with
a checksum with both has good error-detection properties *and* good
performance. If performance is of no concern to us, then there's little
reason not to use CRC…

> And I reiterate that there is theory out there about the error detection
> capabilities of CRCs.  I'm not seeing any theory here, which leaves me
> with very little confidence that we know what we're doing.

If you've got any pointers to literature on error-detection capabilities
of CPU-friendly checksum functions, please share. I am aware of the vast
literature on CRC, and also on some other algebraic approaches, but
none of those even come close to the speed of FNV+shift (unless there's
a special CRC instruction, that is). And there's also a ton of stuff on
cryptographic hashing, but those are optimized for a completely different
use-case...

best regards,
Florian Pflug

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-17 Thread Tom Lane

Ants Aasma  writes:
> I was thinking about something similar too. The big issue here is that
> the parallel checksums already hide each other latencies effectively
> executing one each of movdqu/pmullw/paddw each cycle, that's why the
> N_SUMS adds up to 128 bytes not 16 bytes.

The more I read of this thread, the more unhappy I get.  It appears that
the entire design process is being driven by micro-optimization for CPUs
being built by Intel in 2013.  That ought to be, at best, a fifth-order
consideration, with full recognition that it'll be obsolete in two years,
and is already irrelevant to anyone not running one of those CPUs.

I would like to ban all discussion of assembly-language optimizations
until after 9.3 is out, so that we can concentrate on what actually
matters.  Which IMO is mostly the error detection rate and the probable
nature of false successes.  I'm glad to see that you're paying at least
some attention to that, but the priorities in this discussion are
completely backwards.

And I reiterate that there is theory out there about the error detection
capabilities of CRCs.  I'm not seeing any theory here, which leaves me
with very little confidence that we know what we're doing.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-17 Thread Ants Aasma

On Wed, Apr 17, 2013 at 6:54 PM, Florian Pflug  wrote:
> On Apr17, 2013, at 16:47 , Ants Aasma  wrote:
>> This made me remember, the issue I had was with high order bits, not
>> with low order ones, somehow I got them confused. The exact issue is
>> that the high order bits don't affect any bit lower than them. It's
>> easy to see that if you remember the shift and add nature of multiply.
>> Unfortunately XOR will not fix that. Neither will adding an offset
>> basis. This is the fundamental thing that is behind the not-so-great
>> uncorrelated bit error detection rate.
>
> Right. We could maybe fix that by extending the update step to
>
>   tmp = s[j] ^ d[i,j]
>   s[j] = (t * PRIME) ^ (t >> 1)
>
> or something like that. Shifting t instead of (t * PRIME) should
> help to reduce the performance impact, since a reordering CPU should
> be able to parallelize the multiple and the shift. Note though that
> I haven't really though that through extensively - the general idea
> should be sound, but whether 1 is a good shifting amount I do not
> know.

I was thinking about something similar too. The big issue here is that
the parallel checksums already hide each other latencies effectively
executing one each of movdqu/pmullw/paddw each cycle, that's why the
N_SUMS adds up to 128 bytes not 16 bytes.

I went ahead and coded up both the parallel FNV-1a and parallel FNV-1a
+ srl1-xor variants and ran performance tests and detection rate tests
on both.

Performance results:
Mul-add checksums: 12.9 bytes/s
FNV-1a checksums: 13.5 bytes/s
FNV-1a + srl-1: 7.4 bytes/s

Detection rates:
False positive rates:
 Add-mul   FNV-1a FNV-1a + srl-1
Single bit flip: 1:inf 1:129590   1:64795
Double bit flip: 1:148 1:511  1:53083
Triple bit flip: 1:673 1:5060 1:61511
  Quad bit flip: 1:18721:193491:68320
Write 0x00 byte: 1:774538137   1:118776   1:68952
Write 0xFF byte: 1:165399500   1:137489   1:68958
  Partial write: 1:59949   1:719391:89923
  Write garbage: 1:64866   1:649801:67732
Write run of 00: 1:57077   1:611401:59723
Write run of FF: 1:63085   1:596091:62977

Test descriptions:
N bit flip: picks N random non-overlapping bits and flips their value.
Write X byte: overwrites a single byte with X.
Partial write: picks a random cut point, overwrites everything from
there to end with 0x00.
Write garbage/run of X: picks two random cut points and fills
everything in between with random values/X bytes.

So adding in the shifted value nearly cuts the performance in half. I
think that by playing with the instruction order I might coax the CPU
scheduler to schedule the instructions better, but even in best case
it will be somewhat slower. The point to keep in mind that even this
slower speed is still faster than hardware accelerated CRC32, so all
in all the hit might not be so bad. The effect on false positive rates
for double bit errors is particularly impressive. I'm now running a
testrun that shift right by 13 to see how that works out, intuitively
it should help dispersing the bits a lot faster.

>> I wonder if we use 32bit FNV-1a's (the h = (h^v)*p variant) with
>> different offset-basis values, would it be enough to just XOR fold the
>> resulting values together. The algorithm looking like this:
>
> Hm, this will make the algorithm less resilient to some particular
> input permutations (e.g. those which swap the 64*i-th and the (64+1)-ith
> words), but those seem very unlikely to occur randomly. But if we're
> worried about that, we could use your linear combination method for
> the aggregation phase.

I don't think it significantly reduces resilience to permutations
thanks to using different basis offsets and multiply not distributing
over xor.

>> Speaking against this option is the fact that we will need to do CPU
>> detection at startup to make it fast on the x86 that support SSE4.1,
>> and the fact that AMD CPUs before 2011 will run it an order of
>> magnitude slower (but still faster than the best CRC).
>
> Hm, CPU detection isn't that hard, and given the speed at which Intel
> currently invents new instructions we'll end up going that route sooner
> or later anyway, I think.

Sure it's not that hard but it does have an order of magnitude more
design decisions than #if defined(__x86_64__). Maybe a first stab
could avoid a generic infrastructure and just have the checksum
function as a function pointer, with the default "trampoline"
implementation running a cpuid and overwriting the function pointer
with either the optimized or generic versions and then calling it.

>> Any opinions if it would be a reasonable tradeoff to have a better
>> checksum with great performance on latest x86 CPUs and good
>> performance on other architectures at the expense of having only ok
>> performance on older AMD CPUs?
>
> The loss on AMD is offset by the increased performance on machines
> where we can't vectorize, I'd say.

+1 Old AMD machines won'

Re: [HACKERS] Enabling Checksums

2013-04-17 Thread Bruce Momjian

On Wed, Apr 17, 2013 at 01:59:12PM -0700, Jeff Davis wrote:
> On Wed, 2013-04-17 at 12:42 -0400, Bruce Momjian wrote:
> > > AFAIK, there's currently no per-page checksum flag. Still, being only
> > > able to go from checksummed to not-checksummed probably is for all
> > > practical purposes the same as not being able to pg_upgrade at all.
> > > Otherwise, why would people have enabled checksums in the first place?
> > 
> > Good point, but it is _an_ option, at least.
> > 
> > I would like to know the answer of how an upgrade from checksum to
> > no-checksum would behave so I can modify pg_upgrade to allow it.
> 
> Why? 9.3 pg_upgrade certainly doesn't need it. When we get to 9.4, if
> someone has checksums enabled and wants to disable it, why is pg_upgrade
> the right time to do that? Wouldn't it make more sense to allow them to
> do that at any time?

Well, right now, pg_upgrade is the only way you could potentially turn
off checksums.  You are right that we might eventually want a command,
but my point is that we currently have a limitation in pg_upgrade that
might not be necessary.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-17 Thread Jeff Davis

On Wed, 2013-04-17 at 16:58 +0100, Greg Stark wrote:
> On Wed, Apr 17, 2013 at 4:28 PM, Florian Pflug  wrote:
> > Is there any way we can change the checksum algorithm in 9.4
> > *without* breaking pg_upgrade?
> 
> Personally I think we're going to need a solution for page format
> changes someday eventually
> 
> What advantages are we postponing now to avoid it?
> 
> * 32-bit checksums?
> * Being able to enable/disable checksums?
> 
> Anything else?

I'm not sure that changing the page format is the most difficult part of
enabling/disabling checksums. It's easy enough to have page header bits
if the current information is not enough (and those bits were there, but
Heikki requested their removal and I couldn't think of a concrete reason
to keep them).

Eventually, it would be nice to be able to break the page format and
have more space for things like checksums (and probably a few other
things, maybe some visibility-related optimizations). But that's a few
years off and we don't have any real plan for that.

What I wanted to accomplish with this patch is the simplest checksum
mechanism that we could get that would be fast enough that many people
would be able to use it. I expect it to be useful until we do decide to
break the page format.

Regards,
Jeff Davis

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-17 Thread Jeff Davis

On Wed, 2013-04-17 at 12:42 -0400, Bruce Momjian wrote:
> > AFAIK, there's currently no per-page checksum flag. Still, being only
> > able to go from checksummed to not-checksummed probably is for all
> > practical purposes the same as not being able to pg_upgrade at all.
> > Otherwise, why would people have enabled checksums in the first place?
> 
> Good point, but it is _an_ option, at least.
> 
> I would like to know the answer of how an upgrade from checksum to
> no-checksum would behave so I can modify pg_upgrade to allow it.

Why? 9.3 pg_upgrade certainly doesn't need it. When we get to 9.4, if
someone has checksums enabled and wants to disable it, why is pg_upgrade
the right time to do that? Wouldn't it make more sense to allow them to
do that at any time?

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-17 Thread Tom Lane

Bruce Momjian  writes:
> On Wed, Apr 17, 2013 at 01:29:18PM -0400, Tom Lane wrote:
>> But having said that, I'm not sure why this would be pg_upgrade's
>> problem.  By definition, we do not want pg_upgrade running around
>> looking at individual data pages.  Therefore, whatever we might do
>> about checksum algorithm changes would have to be something that can be
>> managed on-the-fly by the newer server.

> Well, my idea was that pg_upgrade would allow upgrades from old clusters
> with the same checksum algorithm version, but not non-matching ones. 
> This would allow the checksum algorithm to be changed and force
> pg_upgrade to fail.

It's rather premature to be defining pg_upgrade's behavior for a
situation that doesn't exist yet, and may very well never exist
in that form.  It seems more likely to me that we'd want to allow
incremental algorithm changes, in which case pg_upgrade ought not do
anything about this case anyway.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-17 Thread Bruce Momjian

On Wed, Apr 17, 2013 at 01:29:18PM -0400, Tom Lane wrote:
> Bruce Momjian  writes:
> > Uh, not sure how pg_upgrade would detect that as the version number is
> > not stored in pg_controldata, e.g.:
> 
> > Data page checksums:  enabled/disabled
> 
> That seems pretty shortsighted.  The field probably ought to be defined
> as containing a checksum algorithm ID number, not a boolean.
> 
> But having said that, I'm not sure why this would be pg_upgrade's
> problem.  By definition, we do not want pg_upgrade running around
> looking at individual data pages.  Therefore, whatever we might do
> about checksum algorithm changes would have to be something that can be
> managed on-the-fly by the newer server.

Well, my idea was that pg_upgrade would allow upgrades from old clusters
with the same checksum algorithm version, but not non-matching ones. 
This would allow the checksum algorithm to be changed and force
pg_upgrade to fail.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-17 Thread Tom Lane

Bruce Momjian  writes:
> Uh, not sure how pg_upgrade would detect that as the version number is
> not stored in pg_controldata, e.g.:

>   Data page checksums:  enabled/disabled

That seems pretty shortsighted.  The field probably ought to be defined
as containing a checksum algorithm ID number, not a boolean.

But having said that, I'm not sure why this would be pg_upgrade's
problem.  By definition, we do not want pg_upgrade running around
looking at individual data pages.  Therefore, whatever we might do
about checksum algorithm changes would have to be something that can be
managed on-the-fly by the newer server.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-17 Thread Bruce Momjian

On Wed, Apr 17, 2013 at 01:22:01PM -0400, Tom Lane wrote:
> Greg Stark  writes:
> > On Wed, Apr 17, 2013 at 4:28 PM, Florian Pflug  wrote:
> >> Is there any way we can change the checksum algorithm in 9.4
> >> *without* breaking pg_upgrade?
> 
> > Personally I think we're going to need a solution for page format
> > changes someday eventually
> 
> > What advantages are we postponing now to avoid it?
> 
> Um, other than the ability to make a release?
> 
> We aren't going to hold up 9.3 until that particular bit of pie in the
> sky lands.  Indeed I don't expect to see it available in the next couple
> years either.  When we were looking at that seriously, two or three
> years ago, arbitrary page format changes looked *hard*.
> 
> The idea of bumping the page format version number to signal a checksum
> algorithm change might work though.

Uh, not sure how pg_upgrade would detect that as the version number is
not stored in pg_controldata, e.g.:

Data page checksums:  enabled/disabled

Do we need to address this for 9.3?  (Yuck)

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-17 Thread Tom Lane

Greg Stark  writes:
> On Wed, Apr 17, 2013 at 4:28 PM, Florian Pflug  wrote:
>> Is there any way we can change the checksum algorithm in 9.4
>> *without* breaking pg_upgrade?

> Personally I think we're going to need a solution for page format
> changes someday eventually

> What advantages are we postponing now to avoid it?

Um, other than the ability to make a release?

We aren't going to hold up 9.3 until that particular bit of pie in the
sky lands.  Indeed I don't expect to see it available in the next couple
years either.  When we were looking at that seriously, two or three
years ago, arbitrary page format changes looked *hard*.

The idea of bumping the page format version number to signal a checksum
algorithm change might work though.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-17 Thread Bruce Momjian

On Wed, Apr 17, 2013 at 06:33:58PM +0200, Florian Pflug wrote:
> > I was going to ask about the flexibility of pg_upgrade and checksums. 
> > Right now you have to match the old and new cluster checksum modes, but
> > it seems it would be possible to allow pg_upgrade use from checksum to
> > no-checksum servers.  Does the backend look at the pg_controldata setting,
> > or at the page checksum flag?  If the former, it seems pg_upgrade could
> > run a a no-checksum server just fine that had checksum information on
> > its pages.  This might give us more flexibility in changing the checksum
> > algorithm in the future, i.e. you only lose checksum ability.
> 
> AFAIK, there's currently no per-page checksum flag. Still, being only
> able to go from checksummed to not-checksummed probably is for all
> practical purposes the same as not being able to pg_upgrade at all.
> Otherwise, why would people have enabled checksums in the first place?

Good point, but it is _an_ option, at least.

I would like to know the answer of how an upgrade from checksum to
no-checksum would behave so I can modify pg_upgrade to allow it.


-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-17 Thread Florian Pflug

On Apr17, 2013, at 18:15 , Bruce Momjian  wrote:
> On Wed, Apr 17, 2013 at 05:28:06PM +0200, Florian Pflug wrote:
>> However, you're right that time's running out. It'd be a shame though
>> if we'd lock ourselves into CRC as the only available algorithm essentially
>> forever.  Is there any way we can change the checksum algorithm in 9.4
>> *without* breaking pg_upgrade? Maybe pd_pagesize_version could be used
>> for that - we could make version 5 mean "just like version 4, but with
>> a different checksum algorithm". Since the layout wouldn't actually
>> chance, that'd be far easier to pull off than actually supporting multiple
>> page layouts. If that works, then shipping 9.3 with CRC is probably
>> the best solution. If not, we should see to it that something like Ants
>> parallel version of FNV or a smallget into 9.3 if at all possible,
>> IMHO.
> 
> I was going to ask about the flexibility of pg_upgrade and checksums. 
> Right now you have to match the old and new cluster checksum modes, but
> it seems it would be possible to allow pg_upgrade use from checksum to
> no-checksum servers.  Does the backend look at the pg_controldata setting,
> or at the page checksum flag?  If the former, it seems pg_upgrade could
> run a a no-checksum server just fine that had checksum information on
> its pages.  This might give us more flexibility in changing the checksum
> algorithm in the future, i.e. you only lose checksum ability.

AFAIK, there's currently no per-page checksum flag. Still, being only
able to go from checksummed to not-checksummed probably is for all
practical purposes the same as not being able to pg_upgrade at all.
Otherwise, why would people have enabled checksums in the first place?

best regards,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-17 Thread Bruce Momjian

On Wed, Apr 17, 2013 at 05:28:06PM +0200, Florian Pflug wrote:
> However, you're right that time's running out. It'd be a shame though
> if we'd lock ourselves into CRC as the only available algorithm essentially
> forever.  Is there any way we can change the checksum algorithm in 9.4
> *without* breaking pg_upgrade? Maybe pd_pagesize_version could be used
> for that - we could make version 5 mean "just like version 4, but with
> a different checksum algorithm". Since the layout wouldn't actually
> chance, that'd be far easier to pull off than actually supporting multiple
> page layouts. If that works, then shipping 9.3 with CRC is probably
> the best solution. If not, we should see to it that something like Ants
> parallel version of FNV or a smallget into 9.3 if at all possible,
> IMHO.

I was going to ask about the flexibility of pg_upgrade and checksums. 
Right now you have to match the old and new cluster checksum modes, but
it seems it would be possible to allow pg_upgrade use from checksum to
no-checksum servers.  Does the backend look at the pg_controldata setting,
or at the page checksum flag?  If the former, it seems pg_upgrade could
run a a no-checksum server just fine that had checksum information on
its pages.  This might give us more flexibility in changing the checksum
algorithm in the future, i.e. you only lose checksum ability.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-17 Thread Greg Stark

On Wed, Apr 17, 2013 at 4:28 PM, Florian Pflug  wrote:
> Is there any way we can change the checksum algorithm in 9.4
> *without* breaking pg_upgrade?

Personally I think we're going to need a solution for page format
changes someday eventually

What advantages are we postponing now to avoid it?

* 32-bit checksums?
* Being able to enable/disable checksums?

Anything else?


-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-17 Thread Florian Pflug

On Apr17, 2013, at 16:47 , Ants Aasma  wrote:
> This made me remember, the issue I had was with high order bits, not
> with low order ones, somehow I got them confused. The exact issue is
> that the high order bits don't affect any bit lower than them. It's
> easy to see that if you remember the shift and add nature of multiply.
> Unfortunately XOR will not fix that. Neither will adding an offset
> basis. This is the fundamental thing that is behind the not-so-great
> uncorrelated bit error detection rate.

Right. We could maybe fix that by extending the update step to

  tmp = s[j] ^ d[i,j]
  s[j] = (t * PRIME) ^ (t >> 1)

or something like that. Shifting t instead of (t * PRIME) should
help to reduce the performance impact, since a reordering CPU should
be able to parallelize the multiple and the shift. Note though that
I haven't really though that through extensively - the general idea
should be sound, but whether 1 is a good shifting amount I do not
know.

> While I understand that linearity is not a desirable property, I
> couldn't think of a realistic case where it would hurt. I can see how
> it can hurt checksums of variable length values, but for our fixed
> buffer case it's definitely not so clear cut. On the pro side the
> distributive property that is behind linearity allowed me to do final
> aggregation in a tree form, performing the multiplies in parallel
> instead of linearly. This adds up to the difference between 250 cycles
> (64*(3 cycle IMUL + 1 cycle XOR)) and 25 cycles (4*5 cycle pmullw + 5
> cycle addw). Given that the main loop is about 576 cycles, this is a
> significant difference.

> I wonder if we use 32bit FNV-1a's (the h = (h^v)*p variant) with
> different offset-basis values, would it be enough to just XOR fold the
> resulting values together. The algorithm looking like this:

Hm, this will make the algorithm less resilient to some particular
input permutations (e.g. those which swap the 64*i-th and the (64+1)-ith
words), but those seem very unlikely to occur randomly. But if we're
worried about that, we could use your linear combination method for
the aggregation phase.

> Speaking against this option is the fact that we will need to do CPU
> detection at startup to make it fast on the x86 that support SSE4.1,
> and the fact that AMD CPUs before 2011 will run it an order of
> magnitude slower (but still faster than the best CRC).

Hm, CPU detection isn't that hard, and given the speed at which Intel
currently invents new instructions we'll end up going that route sooner
or later anyway, I think. 

> Any opinions if it would be a reasonable tradeoff to have a better
> checksum with great performance on latest x86 CPUs and good
> performance on other architectures at the expense of having only ok
> performance on older AMD CPUs?

The loss on AMD is offset by the increased performance on machines
where we can't vectorize, I'd say.

best regards,
Florian Pflug

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-17 Thread Florian Pflug

On Apr17, 2013, at 17:09 , Bruce Momjian  wrote:
> As much as I love the idea of improving the algorithm, it is disturbing
> we are discussing this so close to beta, with an algorithm that is under
> analysis, with no (runtime) CPU detection, and in something that is
> going to be embedded into our data page format.  I can't even think of
> another case where we do run-time CPU detection.

We could still ship the new checksum algorithm with 9.3, but omit the
SSE-optimized version, i.e. include only the plain C implementation.
I think Ants mentioned somehwere that gcc does a pretty good job of
vectorizing that, so people who really care (and who use GCC)
could compile with -msse41 --unrool-loops --tree-vectorize, and get
performance close to that of a hand-coded SSE version.

The important decision we're facing is which algorithm to use. I personally
believe Ants is on the right track there - FNV or a variant thereof
looks like a good choice to me, but the details have yet to be nailed
I think.

However, you're right that time's running out. It'd be a shame though
if we'd lock ourselves into CRC as the only available algorithm essentially
forever.  Is there any way we can change the checksum algorithm in 9.4
*without* breaking pg_upgrade? Maybe pd_pagesize_version could be used
for that - we could make version 5 mean "just like version 4, but with
a different checksum algorithm". Since the layout wouldn't actually
chance, that'd be far easier to pull off than actually supporting multiple
page layouts. If that works, then shipping 9.3 with CRC is probably
the best solution. If not, we should see to it that something like Ants
parallel version of FNV or a smallget into 9.3 if at all possible,
IMHO.

best regards,
Florian Pflug

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-17 Thread Bruce Momjian

On Wed, Apr 17, 2013 at 05:47:55PM +0300, Ants Aasma wrote:
> The SSE4.1 implementation of this would be as fast as the last pat,
> generic version will be faster and we avoid the linearity issue. By
> using different offsets for each of the partial hashes we don't
> directly suffer from commutativity of the final xor folding. By using
> the xor-then-multiply variant the last values hashed have their bits
> mixed before folding together.
> 
> Speaking against this option is the fact that we will need to do CPU
> detection at startup to make it fast on the x86 that support SSE4.1,
> and the fact that AMD CPUs before 2011 will run it an order of
> magnitude slower (but still faster than the best CRC).
> 
> Any opinions if it would be a reasonable tradeoff to have a better
> checksum with great performance on latest x86 CPUs and good
> performance on other architectures at the expense of having only ok
> performance on older AMD CPUs?
> 
> Also, any good suggestions where should we do CPU detection when we go
> this route?

As much as I love the idea of improving the algorithm, it is disturbing
we are discussing this so close to beta, with an algorithm that is under
analysis, with no (runtime) CPU detection, and in something that is
going to be embedded into our data page format.  I can't even think of
another case where we do run-time CPU detection.

I am wondering if we need to tell users that pg_upgrade will not be
possible if you enable page-level checksums, so we are not trapped with
something we want to improve in 9.4.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-17 Thread Ants Aasma

On Wed, Apr 17, 2013 at 2:26 AM, Florian Pflug  wrote:
>>> This raises two question. First, why are there two primes? You could
>>> just as well using a single prime q and set p=q^64 mod 2^16. You then
>>> get
>>>  S = sum V[i,j] * q^(64*(64-i) + (64-j)
>>>= sum V[i,j] * q^(4096 - 64*(i-1) - j)
>>> You get higher prime powers that way, but you can easily choose a prime
>>> that yields distinct values mod 2^16 for exponents up to 16383. Your
>>> PRIME2, for example, does. (It wraps around for 16384, i.e.
>>> PRIME2^16384 = 1 mod 2^16, but that's true for every possible prime since
>>> 16384 is the Carmichael function's value at 2^16)
>>
>> The experimental detection rate is about the same if we use a single
>> prime. But I think you have the analytical form wrong here. It should
>> be given q = p:
>>
>>S = sum V[i,j] * p^(64-i) * p^(64-j)
>>  = sum V[i,j] * p^(64 - i + 64 - j)
>>  = sum V[i,j] * p^(128 - i -j)
>
> Yeah, if you set q = p that's true. My suggestion was p=q^64 though...

So it was, I guess it was too late here and I missed it... All thing
considered that is a good suggestion, if for nothing else, the generic
implementation can be smaller this way.

>>> Second, why does it use addition instead of XOR? It seems that FNV
>>> usually XORs the terms together instead of adding them?
>>
>> Testing showed slightly better detection rate for adds. Intuitively I
>> think it's because the carry introduces some additional mixing.
>
> Hm, but OTOH it makes S linear in V, i.e. if you have two inputs
> V1,V2 and V = V1 + V2, then S = S1 + S2. Also, if V' = V*m, then
> S' = S*m. The second property is quite undesirable, I think. Assume
> all the V[i,j] are divisible by 2^k, i.e. have zeros at all bit
> positions 0..(k-1). Then, due to linearity, S is also divisible by
> 2^k, i.e. also has no ones before the k-th bit. This means, for example
> that if you hash values values which all have their lowest bit cleared,
> you get only 2^15 distinct hash values. If they all have the two
> lowest bits cleared, you get only 2^14 distinct values, and so on…
>
> Generally, linearity doesn't seem to be a property that one wants
> in a hash I think, so my suggestion is to stick to XOR.

This made me remember, the issue I had was with high order bits, not
with low order ones, somehow I got them confused. The exact issue is
that the high order bits don't affect any bit lower than them. It's
easy to see that if you remember the shift and add nature of multiply.
Unfortunately XOR will not fix that. Neither will adding an offset
basis. This is the fundamental thing that is behind the not-so-great
uncorrelated bit error detection rate.

While I understand that linearity is not a desirable property, I
couldn't think of a realistic case where it would hurt. I can see how
it can hurt checksums of variable length values, but for our fixed
buffer case it's definitely not so clear cut. On the pro side the
distributive property that is behind linearity allowed me to do final
aggregation in a tree form, performing the multiplies in parallel
instead of linearly. This adds up to the difference between 250 cycles
(64*(3 cycle IMUL + 1 cycle XOR)) and 25 cycles (4*5 cycle pmullw + 5
cycle addw). Given that the main loop is about 576 cycles, this is a
significant difference.

>>> Here, btw, is a page on FNV hashing. It mentions a few rules for
>>> picking suitable primes
>>>
>>> http://www.isthe.com/chongo/tech/comp/fnv
>>
>> Unfortunately the rules don't apply here because of the hash size.
>
> Yeah :-(.
>
> I noticed that their 32-bit prime only has a single one outside
> the first 16 bits. Maybe we can take advantage of that and use a
> 32-bit state while still providing decent performance on machines
> without a 32-bit x 32-bit -> 32-bit multiply instruction?

Looking at the Power instruction set, a 32bit mul by the FNV prime
would look like this:

vmulouh tmp1, hash, prime
vmladduhm tmp1, hash, prime<<16
vslw tmp2, hash, 24
vadduwm hash, tmp1, tmp2

That is 4 instructions to multiply 4 values. Depending on the specific
execution ports on the processor it might faster or slower than the
scalar version but not by a whole lot. Main benefit would be that the
intermediate state could be held in registers.

> If we lived in an Intel-only world, I'd suggest going with a
> 32-bit state, since SSE4.1 support is *very* wide-spread already -
> the last CPUs without it came out over 5 years ago, I think.
> (Core2 and later support SSE4.1, and some later Core1 do too)
>
> But unfortunately things look bleak even for other x86
> implementations - AMD support SSE4.1 only starting with
> Bulldozer, which came out 2011 or so I believe. Leaving the x86
> realm, it seems that only ARM's NEON provides the instructions
> we'd need - AltiVec seems to be support only 16-bit multiplies,
> and from what some quick googling brought up, MIPS and SPARC
> SIMD instructions look no better..
>
> OTOH, chances are that nobody will ever d

Re: [HACKERS] Enabling Checksums

2013-04-16 Thread Florian Pflug

On Apr16, 2013, at 23:41 , Ants Aasma  wrote:
> On Tue, Apr 16, 2013 at 11:20 PM, Florian Pflug  wrote:
>> On Apr13, 2013, at 17:14 , Ants Aasma  wrote:
>>> Based on current analysis, it is particularly good at detecting single
>>> bit errors, as good at detecting burst errors as can be expected from
>>> 16 bits and not horrible at detecting burst writes of zeroes. It is
>>> quite bad at detecting multiple uncorrelated single bit errors and
>>> extremely bad at detecting repeating patterns of errors in low order
>>> bits.
>> 
>> I've read the patch and tried to understand why it's that bad at
>> detecting repeating patterns of errors in low order bits, and to see
>> if there might be a way to fix that without too much of a performance
>> impact.
>> 
>> Here's what I gather the algorithm does:
>> 
>>  It treats the input data, a page of L bytes, as a Nx64 matrix V
>>  of 16-bit quantities (N = L/64, obviously).
>>  It then first computes (using two primes p (PRIME1) and q (PRIME2))
>> 
>>S = V[1,1]*p^63*q^63 + V[1,2]*p^63*q^62 + … + V[1,64]*p^63*q^0
>>  + V[2,1]*p^62*q^63 + V[2,2]*p^62*q^62 + … + V[2,64]*p^62*q^0
>>  + …
>>  + V[N,1]*p^0 *q^63 + V[N,2]*p^0 *q^62 + … + V[N,64]*p^0 *q^0
>>  (mod 2^16)
>>  = sum V[i,j]*p^(64-i)*q^(64-j)
>> 
>>   Note that it does that by first computing the row-wise sums without
>>   the q^i coefficient, and then (in what the code calls the aggregation
>>   phase) combines those row-wise sums into a total, adding the q^i-
>>   coefficients along the way.
>> 
>>   The final hash value is then
>> 
>> H = S * p + B * q mod 2^16
>> 
>>   where B is a salt value intended to detect swapped pages (currently
>>   B is simply the page index)
> 
> Great job analyzing the analytic form of the algorithm and sorry I you
> had to do it instead finding it in the documentation.

No problem, glad if I can help!

>> This raises two question. First, why are there two primes? You could
>> just as well using a single prime q and set p=q^64 mod 2^16. You then
>> get
>>  S = sum V[i,j] * q^(64*(64-i) + (64-j)
>>= sum V[i,j] * q^(4096 - 64*(i-1) - j)
>> You get higher prime powers that way, but you can easily choose a prime
>> that yields distinct values mod 2^16 for exponents up to 16383. Your
>> PRIME2, for example, does. (It wraps around for 16384, i.e.
>> PRIME2^16384 = 1 mod 2^16, but that's true for every possible prime since
>> 16384 is the Carmichael function's value at 2^16)
> 
> The experimental detection rate is about the same if we use a single
> prime. But I think you have the analytical form wrong here. It should
> be given q = p:
> 
>S = sum V[i,j] * p^(64-i) * p^(64-j)
>  = sum V[i,j] * p^(64 - i + 64 - j)
>  = sum V[i,j] * p^(128 - i -j)

Yeah, if you set q = p that's true. My suggestion was p=q^64 though...

>> Second, why does it use addition instead of XOR? It seems that FNV
>> usually XORs the terms together instead of adding them?
> 
> Testing showed slightly better detection rate for adds. Intuitively I
> think it's because the carry introduces some additional mixing.

Hm, but OTOH it makes S linear in V, i.e. if you have two inputs
V1,V2 and V = V1 + V2, then S = S1 + S2. Also, if V' = V*m, then
S' = S*m. The second property is quite undesirable, I think. Assume
all the V[i,j] are divisible by 2^k, i.e. have zeros at all bit
positions 0..(k-1). Then, due to linearity, S is also divisible by
2^k, i.e. also has no ones before the k-th bit. This means, for example
that if you hash values values which all have their lowest bit cleared,
you get only 2^15 distinct hash values. If they all have the two
lowest bits cleared, you get only 2^14 distinct values, and so on…

Generally, linearity doesn't seem to be a property that one wants
in a hash I think, so my suggestion is to stick to XOR.

>> Here, btw, is a page on FNV hashing. It mentions a few rules for
>> picking suitable primes
>> 
>> http://www.isthe.com/chongo/tech/comp/fnv
> 
> Unfortunately the rules don't apply here because of the hash size.

Yeah :-(.

I noticed that their 32-bit prime only has a single one outside
the first 16 bits. Maybe we can take advantage of that and use a
32-bit state while still providing decent performance on machines
without a 32-bit x 32-bit -> 32-bit multiply instruction?

If we lived in an Intel-only world, I'd suggest going with a
32-bit state, since SSE4.1 support is *very* wide-spread already -
the last CPUs without it came out over 5 years ago, I think.
(Core2 and later support SSE4.1, and some later Core1 do too)

But unfortunately things look bleak even for other x86
implementations - AMD support SSE4.1 only starting with
Bulldozer, which came out 2011 or so I believe. Leaving the x86
realm, it seems that only ARM's NEON provides the instructions
we'd need - AltiVec seems to be support only 16-bit multiplies,
and from what some quick googling brought up, MIPS and SPARC
SIMD instructions look no better..

OTOH, chances are that nobody will ever

Re: [HACKERS] Enabling Checksums

2013-04-16 Thread Ants Aasma

On Tue, Apr 16, 2013 at 11:20 PM, Florian Pflug  wrote:
> On Apr13, 2013, at 17:14 , Ants Aasma  wrote:
>> Based on current analysis, it is particularly good at detecting single
>> bit errors, as good at detecting burst errors as can be expected from
>> 16 bits and not horrible at detecting burst writes of zeroes. It is
>> quite bad at detecting multiple uncorrelated single bit errors and
>> extremely bad at detecting repeating patterns of errors in low order
>> bits.
>
> I've read the patch and tried to understand why it's that bad at
> detecting repeating patterns of errors in low order bits, and to see
> if there might be a way to fix that without too much of a performance
> impact.
>
> Here's what I gather the algorithm does:
>
>   It treats the input data, a page of L bytes, as a Nx64 matrix V
>   of 16-bit quantities (N = L/64, obviously).
>   It then first computes (using two primes p (PRIME1) and q (PRIME2))
>
> S = V[1,1]*p^63*q^63 + V[1,2]*p^63*q^62 + … + V[1,64]*p^63*q^0
>   + V[2,1]*p^62*q^63 + V[2,2]*p^62*q^62 + … + V[2,64]*p^62*q^0
>   + …
>   + V[N,1]*p^0 *q^63 + V[N,2]*p^0 *q^62 + … + V[N,64]*p^0 *q^0
>   (mod 2^16)
>   = sum V[i,j]*p^(64-i)*q^(64-j)
>
>Note that it does that by first computing the row-wise sums without
>the q^i coefficient, and then (in what the code calls the aggregation
>phase) combines those row-wise sums into a total, adding the q^i-
>coefficients along the way.
>
>The final hash value is then
>
>  H = S * p + B * q mod 2^16
>
>where B is a salt value intended to detect swapped pages (currently
>B is simply the page index)

Great job analyzing the analytic form of the algorithm and sorry I you
had to do it instead finding it in the documentation.

> This raises two question. First, why are there two primes? You could
> just as well using a single prime q and set p=q^64 mod 2^16. You then
> get
>   S = sum V[i,j] * q^(64*(64-i) + (64-j)
> = sum V[i,j] * q^(4096 - 64*(i-1) - j)
> You get higher prime powers that way, but you can easily choose a prime
> that yields distinct values mod 2^16 for exponents up to 16383. Your
> PRIME2, for example, does. (It wraps around for 16384, i.e.
> PRIME2^16384 = 1 mod 2^16, but that's true for every possible prime since
> 16384 is the Carmichael function's value at 2^16)

The experimental detection rate is about the same if we use a single
prime. But I think you have the analytical form wrong here. It should
be given q = p:

S = sum V[i,j] * p^(64-i) * p^(64-j)
  = sum V[i,j] * p^(64 - i + 64 - j)
  = sum V[i,j] * p^(128 - i -j)

This makes the whole matrix symmetric. While I can't think of any real
world errors that would exhibit symmetry in this 64x64 matrix, it
seemed better to me to avoid the issue altogether and use different
primes. IIRC it helped a lot for the case of page[i] = i & 0xFF.

> Second, why does it use addition instead of XOR? It seems that FNV
> usually XORs the terms together instead of adding them?

Testing showed slightly better detection rate for adds. Intuitively I
think it's because the carry introduces some additional mixing.

> Regarding the bad behaviour for multiple low-bit errors - can you
> explain why it behaves badly in that case? I currently fail to see
> why that would be. I *can* see that the lowest bit of the hash depends
> only on the lowest bit of the input words, but as long as the lowest
> bits of the input words also affect other bits of the hash, that shouldn't
> matter. Which I think they do, but I might be missing something...

Looks like you're right. I was somehow concentrating only on how the
lowest bits depend on the input.

> Here, btw, is a page on FNV hashing. It mentions a few rules for
> picking suitable primes
>
> http://www.isthe.com/chongo/tech/comp/fnv

Unfortunately the rules don't apply here because of the hash size.

Thanks for your analysis. I will do my best to get this all into a
document and will do some more analysis to see if I can come up with
some kind of proof for the error cases.

Regards,
Ants Aasma
-- 
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-16 Thread Tom Lane

Ants Aasma  writes:
> On Tue, Apr 16, 2013 at 5:05 PM, Simon Riggs  wrote:
>> My only review comments are to ask for some explanation of the magic 
>> numbers...

> The specific values used are mostly magic to me too. As mentioned in a
> short sentence in the patch, the values are experimentally chosen,
> guided by some intuition about what good values should look like.

There actually is quite a lot of theory out there about this sort of
thing.  If we are inventing our own checksum function then We're Doing
It Wrong.  We should be adopting an existing, proven function.
"Experimentally derived" is about the worst recommendation I can think
of in this area.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-16 Thread Ants Aasma

On Tue, Apr 16, 2013 at 5:05 PM, Simon Riggs  wrote:
> On 9 April 2013 03:35, Ants Aasma  wrote:
>> On Fri, Apr 5, 2013 at 9:39 PM, Ants Aasma  wrote:
>>> Unless somebody tells me not to waste my time I'll go ahead and come
>>> up with a workable patch by Monday.
>>
>> And here you go. I decided to be verbose with the comments as it's
>> easier to delete a comment to write one. I also left in a huge jumble
>> of macros to calculate the contents of a helper var during compile
>> time. This can easily be replaced with the calculated values once we
>> settle on specific parameters.
>>
>> Currently only x86-64 is implemented. 32bit x86 would be mostly a
>> copy-and-paste job, replacing 64bit pointer registers with 32bit ones.
>> For other platforms the simplest way would be to use a vectorizing
>> compiler on the generic variant. -funroll-loops -ftree-vectorize is
>> enough on gcc.
>>
>> Quick bench results on the worst case workload:
>> master no checksums: tps = 15.561848
>> master with checksums: tps = 1.695450
>> simd checksums: tps = 14.602698
>
> Numbers look very good on this. Well done.
>
> I support the direction of this, but I don't think I'm sufficiently
> well qualified to verify that the code does what it should and/or fix
> it if it breaks. If others want to see this happen you'll need to
> pitch in.
>
> My only review comments are to ask for some explanation of the magic 
> numbers...
> #define CSUM_PRIME1 0x49
> #define CSUM_PRIME2 0x986b
> #define CSUM_TRUNC 65521
>
> Where magic means a level of technology far above my own
> understanding, and yet no (or not enough) code comments to assist me.

The specific values used are mostly magic to me too. As mentioned in a
short sentence in the patch, the values are experimentally chosen,
guided by some intuition about what good values should look like.

Basically the methodology for the choice was that I took all the pages
from a 2.8GB test database, and then for each page introduced a bunch
of common errors and observed how many errors were undetected. The
main observations were: 1) the exact value of the primes doesn't
really matter for detection efficiency.
2) values with a non-uniform distribution of zeroes and ones seem to
work slightly better.

I'll write up a readme of why the values are how they are and with
some more explanation of the algorithm.

Regards,
Ants Aasma
-- 
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-16 Thread Florian Pflug

On Apr13, 2013, at 17:14 , Ants Aasma  wrote:
> Based on current analysis, it is particularly good at detecting single
> bit errors, as good at detecting burst errors as can be expected from
> 16 bits and not horrible at detecting burst writes of zeroes. It is
> quite bad at detecting multiple uncorrelated single bit errors and
> extremely bad at detecting repeating patterns of errors in low order
> bits.

I've read the patch and tried to understand why it's that bad at
detecting repeating patterns of errors in low order bits, and to see
if there might be a way to fix that without too much of a performance
impact.

Here's what I gather the algorithm does:

  It treats the input data, a page of L bytes, as a Nx64 matrix V
  of 16-bit quantities (N = L/64, obviously).
  It then first computes (using two primes p (PRIME1) and q (PRIME2))

S = V[1,1]*p^63*q^63 + V[1,2]*p^63*q^62 + … + V[1,64]*p^63*q^0
  + V[2,1]*p^62*q^63 + V[2,2]*p^62*q^62 + … + V[2,64]*p^62*q^0
  + …
  + V[N,1]*p^0 *q^63 + V[N,2]*p^0 *q^62 + … + V[N,64]*p^0 *q^0
  (mod 2^16)
  = sum V[i,j]*p^(64-i)*q^(64-j)

   Note that it does that by first computing the row-wise sums without
   the q^i coefficient, and then (in what the code calls the aggregation
   phase) combines those row-wise sums into a total, adding the q^i-
   coefficients along the way.

   The final hash value is then

 H = S * p + B * q mod 2^16

   where B is a salt value intended to detect swapped pages (currently
   B is simply the page index)

This raises two question. First, why are there two primes? You could
just as well using a single prime q and set p=q^64 mod 2^16. You then
get
  S = sum V[i,j] * q^(64*(64-i) + (64-j)
= sum V[i,j] * q^(4096 - 64*(i-1) - j)
You get higher prime powers that way, but you can easily choose a prime
that yields distinct values mod 2^16 for exponents up to 16383. Your
PRIME2, for example, does. (It wraps around for 16384, i.e.
PRIME2^16384 = 1 mod 2^16, but that's true for every possible prime since
16384 is the Carmichael function's value at 2^16)

Second, why does it use addition instead of XOR? It seems that FNV
usually XORs the terms together instead of adding them?

Regarding the bad behaviour for multiple low-bit errors - can you
explain why it behaves badly in that case? I currently fail to see
why that would be. I *can* see that the lowest bit of the hash depends
only on the lowest bit of the input words, but as long as the lowest
bits of the input words also affect other bits of the hash, that shouldn't
matter. Which I think they do, but I might be missing something...

Here, btw, is a page on FNV hashing. It mentions a few rules for
picking suitable primes

http://www.isthe.com/chongo/tech/comp/fnv

best regards,
Florian Pflug

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-16 Thread Simon Riggs

On 16 April 2013 20:27, Tom Lane  wrote:
> Greg Stark  writes:
>> On Fri, Apr 12, 2013 at 9:42 PM, Simon Riggs  wrote:
>>> * WAL checksum is not used as the sole basis for end-of-WAL discovery.
>>> We reuse the WAL files, so the prev field in each WAL record shows
>>> what the previous end of WAL was. Hence if the WAL checksums give a
>>> false positive we still have a double check that the data really is
>>> wrong. It's unbelievable that you'd get a false positive and then have
>>> the prev field match as well, even though it was the genuine
>>> end-of-WAL.
>
>> This is kind of true and kind of not true. If a system loses power
>> while writing lots of data to WAL then the blocks at the end of the
>> WAL might not be written out in order. Everything since the last log
>> sync might be partly written out and partly not written out. That's
>> the case where the checksum is critical. The beginning of a record
>> could easily be written out including xl_prev and the end of the
>> record not written. 1/64,000 power losses would then end up with an
>> assertion failure or corrupt database.
>
> I have a hard time believing that it's a good idea to add checksums to
> data pages and at the same time weaken our ability to detect WAL
> corruption.  So this seems to me to be going in the wrong direction.
> What's it buying for us anyway?  A few CPU cycles saved during WAL
> generation?  That's probably swamped by the other costs of writing WAL,
> especially if you're using replication.

This part of the thread is dead now  I said "lets drop this idea"
on 13 April.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-16 Thread Tom Lane

Greg Stark  writes:
> On Fri, Apr 12, 2013 at 9:42 PM, Simon Riggs  wrote:
>> * WAL checksum is not used as the sole basis for end-of-WAL discovery.
>> We reuse the WAL files, so the prev field in each WAL record shows
>> what the previous end of WAL was. Hence if the WAL checksums give a
>> false positive we still have a double check that the data really is
>> wrong. It's unbelievable that you'd get a false positive and then have
>> the prev field match as well, even though it was the genuine
>> end-of-WAL.

> This is kind of true and kind of not true. If a system loses power
> while writing lots of data to WAL then the blocks at the end of the
> WAL might not be written out in order. Everything since the last log
> sync might be partly written out and partly not written out. That's
> the case where the checksum is critical. The beginning of a record
> could easily be written out including xl_prev and the end of the
> record not written. 1/64,000 power losses would then end up with an
> assertion failure or corrupt database.

I have a hard time believing that it's a good idea to add checksums to
data pages and at the same time weaken our ability to detect WAL
corruption.  So this seems to me to be going in the wrong direction.
What's it buying for us anyway?  A few CPU cycles saved during WAL
generation?  That's probably swamped by the other costs of writing WAL,
especially if you're using replication.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-16 Thread Greg Stark

On Fri, Apr 12, 2013 at 9:42 PM, Simon Riggs  wrote:
> * WAL checksum is not used as the sole basis for end-of-WAL discovery.
> We reuse the WAL files, so the prev field in each WAL record shows
> what the previous end of WAL was. Hence if the WAL checksums give a
> false positive we still have a double check that the data really is
> wrong. It's unbelievable that you'd get a false positive and then have
> the prev field match as well, even though it was the genuine
> end-of-WAL.

This is kind of true and kind of not true. If a system loses power
while writing lots of data to WAL then the blocks at the end of the
WAL might not be written out in order. Everything since the last log
sync might be partly written out and partly not written out. That's
the case where the checksum is critical. The beginning of a record
could easily be written out including xl_prev and the end of the
record not written. 1/64,000 power losses would then end up with an
assertion failure or corrupt database.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-16 Thread Simon Riggs

On 9 April 2013 03:35, Ants Aasma  wrote:
> On Fri, Apr 5, 2013 at 9:39 PM, Ants Aasma  wrote:
>> Unless somebody tells me not to waste my time I'll go ahead and come
>> up with a workable patch by Monday.
>
> And here you go. I decided to be verbose with the comments as it's
> easier to delete a comment to write one. I also left in a huge jumble
> of macros to calculate the contents of a helper var during compile
> time. This can easily be replaced with the calculated values once we
> settle on specific parameters.
>
> Currently only x86-64 is implemented. 32bit x86 would be mostly a
> copy-and-paste job, replacing 64bit pointer registers with 32bit ones.
> For other platforms the simplest way would be to use a vectorizing
> compiler on the generic variant. -funroll-loops -ftree-vectorize is
> enough on gcc.
>
> Quick bench results on the worst case workload:
> master no checksums: tps = 15.561848
> master with checksums: tps = 1.695450
> simd checksums: tps = 14.602698

Numbers look very good on this. Well done.

I support the direction of this, but I don't think I'm sufficiently
well qualified to verify that the code does what it should and/or fix
it if it breaks. If others want to see this happen you'll need to
pitch in.

My only review comments are to ask for some explanation of the magic numbers...
#define CSUM_PRIME1 0x49
#define CSUM_PRIME2 0x986b
#define CSUM_TRUNC 65521

Where magic means a level of technology far above my own
understanding, and yet no (or not enough) code comments to assist me.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-13 Thread Bruce Momjian

On Sat, Apr 13, 2013 at 06:14:28PM +0300, Ants Aasma wrote:
> > CRCs are known to be good for that sort of thing; it's what they were
> > designed for.  I'd like to see some evidence that any substitute
> > algorithm has similar properties.  Without that, I'm going to vote
> > against this idea.
> 
> Sorry for creating confusion here by playing fast and loose with the
> terminology. We are not talking about that hash function at all. What
> we are talking about here is Fowler-Noll-Vo-ish
> (http://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2%80%93Vo_hash_function)
> hash function that is restructured to be parallelisable with SIMD
> instructions with the explicit goal of being as fast as possible. The
> resulting hash function is roughly two orders of magnitude faster than
> 1-byte-at-a-time CRC32 currently in use. Performance is about
> comparable with optimized fixed size memcpy running in cache.
> 
> Based on current analysis, it is particularly good at detecting single
> bit errors, as good at detecting burst errors as can be expected from
> 16 bits and not horrible at detecting burst writes of zeroes. It is
> quite bad at detecting multiple uncorrelated single bit errors and
> extremely bad at detecting repeating patterns of errors in low order
> bits.
> 
> All in all I would say that the performance is worth the loss in
> detection capability as we are not talking about using the checksum to
> prove correctness.

Agreed. It would be good to get these details into the patch so others
are not confused in the future.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-13 Thread Ants Aasma

On Sat, Apr 13, 2013 at 6:26 PM, Andres Freund  wrote:
>> All in all I would say that the performance is worth the loss in
>> detection capability as we are not talking about using the checksum to
>> prove correctness.
>
> Is it actually a loss compared to our 16bit flavor of crc32 we now use?
> I didn't think so far from the properties you described?

I would have to run the testsuite I made to see now much but I would
presume so. The algorithm relies on multiplication for bit diffusion
and multiply has lousy diffusion on low order bits, exactly no
diffusion for the lowest bit. And for 16bit values low order bits is
quite a large fraction of the total hash.

If we allow for operations that are not in SSE2 then there are a few
things that we could do to make the hash quality better without
affecting performance. pmulld instruction (SSE4.1) would allow for
32bit values in the intermediate state. And pshufb (SSE3) would allow
us to swap high and low bytes introducing additional mixing. On Intel
Sandy Bridge, if I understand the microarchitecture correctly, either
change would be basically free, but not both because pshufb and paddw
use execution ports 0 and 5, while pmulld needs port 0 and pmullw
needs port 1. Currently the main loop takes 1 cycle per 16byte chunk,
any changes introducing conflicts there would cut the performance in
half.

Regards,
Ants Aasma
-- 
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-13 Thread Andres Freund

On 2013-04-13 18:14:28 +0300, Ants Aasma wrote:
> On Sat, Apr 13, 2013 at 5:58 PM, Tom Lane  wrote:
> > Andres Freund  writes:
> >> On 2013-04-13 09:14:26 -0400, Bruce Momjian wrote:
> >>> As I understand it, SIMD is just a CPU-optimized method for producing a
> >>> CRC checksum.  Is that right?  Does it produce the same result as a
> >>> non-CPU-optimized CRC calculation?
> >
> >> No we are talking about a different algorithm that results in different
> >> results, thats why its important to choose now since we can't change it
> >> later without breaking pg_upgrade in further releases.
> >> http://en.wikipedia.org/wiki/SIMD_%28hash_function%29
> >
> > [ squint... ]  We're talking about a *cryptographic* hash function?
> > Why in the world was this considered a good idea for page checksums?
> >
> > In the first place, it's probably not very fast compared to some
> > alternatives, and in the second place, the criteria by which people
> > would consider it a good crypto hash function have approximately nothing
> > to do with what we need for a checksum function.  What we want for a
> > checksum function is high probability of detection of common hardware
> > failure modes, such as burst errors and all-zeroes.  This is
> > particularly critical when we're going with only a 16-bit checksum ---
> > the probabilities need to be skewed in the right direction, or it's not
> > going to be all that terribly useful.
> >
> > CRCs are known to be good for that sort of thing; it's what they were
> > designed for.  I'd like to see some evidence that any substitute
> > algorithm has similar properties.  Without that, I'm going to vote
> > against this idea.
> 
> Sorry for creating confusion here by playing fast and loose with the
> terminology. We are not talking about that hash function at all. What
> we are talking about here is Fowler-Noll-Vo-ish
> (http://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2%80%93Vo_hash_function)
> hash function that is restructured to be parallelisable with SIMD
> instructions with the explicit goal of being as fast as possible. The
> resulting hash function is roughly two orders of magnitude faster than
> 1-byte-at-a-time CRC32 currently in use. Performance is about
> comparable with optimized fixed size memcpy running in cache.

Gah, one shouldn't look to quick for a reference, sorry.
 
> Based on current analysis, it is particularly good at detecting single
> bit errors, as good at detecting burst errors as can be expected from
> 16 bits and not horrible at detecting burst writes of zeroes. It is
> quite bad at detecting multiple uncorrelated single bit errors and
> extremely bad at detecting repeating patterns of errors in low order
> bits.

> All in all I would say that the performance is worth the loss in
> detection capability as we are not talking about using the checksum to
> prove correctness.

Is it actually a loss compared to our 16bit flavor of crc32 we now use?
I didn't think so far from the properties you described?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-13 Thread Ants Aasma

On Sat, Apr 13, 2013 at 5:58 PM, Tom Lane  wrote:
> Andres Freund  writes:
>> On 2013-04-13 09:14:26 -0400, Bruce Momjian wrote:
>>> As I understand it, SIMD is just a CPU-optimized method for producing a
>>> CRC checksum.  Is that right?  Does it produce the same result as a
>>> non-CPU-optimized CRC calculation?
>
>> No we are talking about a different algorithm that results in different
>> results, thats why its important to choose now since we can't change it
>> later without breaking pg_upgrade in further releases.
>> http://en.wikipedia.org/wiki/SIMD_%28hash_function%29
>
> [ squint... ]  We're talking about a *cryptographic* hash function?
> Why in the world was this considered a good idea for page checksums?
>
> In the first place, it's probably not very fast compared to some
> alternatives, and in the second place, the criteria by which people
> would consider it a good crypto hash function have approximately nothing
> to do with what we need for a checksum function.  What we want for a
> checksum function is high probability of detection of common hardware
> failure modes, such as burst errors and all-zeroes.  This is
> particularly critical when we're going with only a 16-bit checksum ---
> the probabilities need to be skewed in the right direction, or it's not
> going to be all that terribly useful.
>
> CRCs are known to be good for that sort of thing; it's what they were
> designed for.  I'd like to see some evidence that any substitute
> algorithm has similar properties.  Without that, I'm going to vote
> against this idea.

Sorry for creating confusion here by playing fast and loose with the
terminology. We are not talking about that hash function at all. What
we are talking about here is Fowler-Noll-Vo-ish
(http://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2%80%93Vo_hash_function)
hash function that is restructured to be parallelisable with SIMD
instructions with the explicit goal of being as fast as possible. The
resulting hash function is roughly two orders of magnitude faster than
1-byte-at-a-time CRC32 currently in use. Performance is about
comparable with optimized fixed size memcpy running in cache.

Based on current analysis, it is particularly good at detecting single
bit errors, as good at detecting burst errors as can be expected from
16 bits and not horrible at detecting burst writes of zeroes. It is
quite bad at detecting multiple uncorrelated single bit errors and
extremely bad at detecting repeating patterns of errors in low order
bits.

All in all I would say that the performance is worth the loss in
detection capability as we are not talking about using the checksum to
prove correctness.

Regards,
Ants Aasma
-- 
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-13 Thread Andres Freund

On 2013-04-13 10:58:53 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2013-04-13 09:14:26 -0400, Bruce Momjian wrote:
> >> As I understand it, SIMD is just a CPU-optimized method for producing a
> >> CRC checksum.  Is that right?  Does it produce the same result as a
> >> non-CPU-optimized CRC calculation?
> 
> > No we are talking about a different algorithm that results in different
> > results, thats why its important to choose now since we can't change it
> > later without breaking pg_upgrade in further releases.
> > http://en.wikipedia.org/wiki/SIMD_%28hash_function%29
> 
> [ squint... ]  We're talking about a *cryptographic* hash function?
> Why in the world was this considered a good idea for page checksums?

In Ants' implementation its heck of a lot of faster than any CRC
implementation we have seen so far on relatively large blocks (like pages).

pgbench results:
ca+csw_uxo-frkuzl0yzs0wsdl8dipzv-ugmvyn-yv45sgub...@mail.gmail.com

byte/cycle comparison:
CA+CSw_su1fopLNBz1NAfkSNw4_=gv+5pf0kdlqmpvukw1q4...@mail.gmail.com

> In the first place, it's probably not very fast compared to some
> alternatives, and in the second place, the criteria by which people
> would consider it a good crypto hash function have approximately nothing
> to do with what we need for a checksum function.  What we want for a
> checksum function is high probability of detection of common hardware
> failure modes, such as burst errors and all-zeroes.  This is
> particularly critical when we're going with only a 16-bit checksum ---
> the probabilities need to be skewed in the right direction, or it's not
> going to be all that terribly useful.
> 
> CRCs are known to be good for that sort of thing; it's what they were
> designed for.  I'd like to see some evidence that any substitute
> algorithm has similar properties.  Without that, I'm going to vote
> against this idea.

Ants has dome some analysis on this, like
CA+CSw_tMoA85e=1vs4omjzjg2mr_hulikovpd80dp5rurds...@mail.gmail.com .
That doesn't look bad to me and unless I am missing something its better
than our CRC with 16bit.

So while I would say its not 100% researched there has been a rather
detailed investigation by Ants - I am rather impressed.

My biggest doubt so far is the reliance on inline assembly for the top
performance on x86-64 and a generic implementation otherwise that only
is really fast with appropriate compiler flags..

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-13 Thread Tom Lane

Andres Freund  writes:
> On 2013-04-13 09:14:26 -0400, Bruce Momjian wrote:
>> As I understand it, SIMD is just a CPU-optimized method for producing a
>> CRC checksum.  Is that right?  Does it produce the same result as a
>> non-CPU-optimized CRC calculation?

> No we are talking about a different algorithm that results in different
> results, thats why its important to choose now since we can't change it
> later without breaking pg_upgrade in further releases.
> http://en.wikipedia.org/wiki/SIMD_%28hash_function%29

[ squint... ]  We're talking about a *cryptographic* hash function?
Why in the world was this considered a good idea for page checksums?

In the first place, it's probably not very fast compared to some
alternatives, and in the second place, the criteria by which people
would consider it a good crypto hash function have approximately nothing
to do with what we need for a checksum function.  What we want for a
checksum function is high probability of detection of common hardware
failure modes, such as burst errors and all-zeroes.  This is
particularly critical when we're going with only a 16-bit checksum ---
the probabilities need to be skewed in the right direction, or it's not
going to be all that terribly useful.

CRCs are known to be good for that sort of thing; it's what they were
designed for.  I'd like to see some evidence that any substitute
algorithm has similar properties.  Without that, I'm going to vote
against this idea.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-13 Thread Andres Freund

On 2013-04-13 09:14:26 -0400, Bruce Momjian wrote:
> On Fri, Apr 12, 2013 at 02:38:27PM -0700, Jeff Davis wrote:
> > In general, we have more flexibility with WAL because there is no
> > upgrade issue. It would be nice to share code with the data page
> > checksum algorithm; but really we should just use whatever offers the
> > best trade-off in terms of complexity, performance, and error detection
> > rate.
> > 
> > I don't think we need to decide all of this right now. Personally, I'm
> > satisfied having SIMD checksums on data pages now and leaving WAL
> > optimization until later.
> 
> As I understand it, SIMD is just a CPU-optimized method for producing a
> CRC checksum.  Is that right?  Does it produce the same result as a
> non-CPU-optimized CRC calculation?

No we are talking about a different algorithm that results in different
results, thats why its important to choose now since we can't change it
later without breaking pg_upgrade in further releases.

http://en.wikipedia.org/wiki/SIMD_%28hash_function%29

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-13 Thread Bruce Momjian

On Fri, Apr 12, 2013 at 02:38:27PM -0700, Jeff Davis wrote:
> In general, we have more flexibility with WAL because there is no
> upgrade issue. It would be nice to share code with the data page
> checksum algorithm; but really we should just use whatever offers the
> best trade-off in terms of complexity, performance, and error detection
> rate.
> 
> I don't think we need to decide all of this right now. Personally, I'm
> satisfied having SIMD checksums on data pages now and leaving WAL
> optimization until later.

As I understand it, SIMD is just a CPU-optimized method for producing a
CRC checksum.  Is that right?  Does it produce the same result as a
non-CPU-optimized CRC calculation?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-13 Thread Simon Riggs

On 12 April 2013 23:21, Ants Aasma  wrote:

>> In general, we have more flexibility with WAL because there is no
>> upgrade issue. It would be nice to share code with the data page
>> checksum algorithm; but really we should just use whatever offers the
>> best trade-off in terms of complexity, performance, and error detection
>> rate.
>>
>> I don't think we need to decide all of this right now. Personally, I'm
>> satisfied having SIMD checksums on data pages now and leaving WAL
>> optimization until later.
>
> +1

OK, lets drop that idea then. SIMD checksums for 16-bit page checksums
only in this release.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-12 Thread Ants Aasma

On Sat, Apr 13, 2013 at 12:38 AM, Jeff Davis  wrote:
> On Fri, 2013-04-12 at 23:03 +0300, Heikki Linnakangas wrote:
>> I think this is a bad idea. It complicates the WAL format significantly.
>> Simon's patch didn't include the changes to recovery to validate the
>> checksum, but I suspect it would be complicated. And it reduces the
>> error-detection capability of WAL recovery. Keep in mind that unlike
>> page checksums, which are never expected to fail, so even if we miss a
>> few errors it's still better than nothing, the WAL checkum is used to
>> detect end-of-WAL. There is expected to be a failure every time we do
>> crash recovery. This far, we've considered the probability of one in
>> 1^32 small enough for that purpose, but IMHO one in 1^16 is much too weak.
>
> One thing that just occurred to me is that we could make the SIMD
> checksum a 32-bit checksum, and reduce it down to 16 bits for the data
> pages. That might give us more flexibility to later use it for WAL
> without compromising on the error detection nearly as much (though
> obviously that wouldn't work with Simon's current proposal which uses
> the same data page checksum in a WAL backup block).

The simple 32bit version of the algorithm would need CPU capability
checks for the fast version and would work only on CPUs produced in
the last few years. Not a show stopper but but more complex code and
less applicability for sure.

An alternative would be to calculate 2 16 bit checksums, concat them
for the 32bit checksum and add them for the 16 bit one. In this case
we wouldn't need to change the current algorithm. A future change
could just factor out everything until the last add as the common
function. But keep in mind that we are talking about sharing about 400
bytes of machine code here.

> In general, we have more flexibility with WAL because there is no
> upgrade issue. It would be nice to share code with the data page
> checksum algorithm; but really we should just use whatever offers the
> best trade-off in terms of complexity, performance, and error detection
> rate.
>
> I don't think we need to decide all of this right now. Personally, I'm
> satisfied having SIMD checksums on data pages now and leaving WAL
> optimization until later.

+1

I feel quite uneasy about reducing the effectiveness of WAL end
detection. There are many ways to improve WAL performance and I have
no idea what would be the best one. At the very least some performance
tests are in order. As this is not an essential part of having usable
checksums, but a general performance optimization I feel that it is
not fair to others to postpone the release to resolve this now. I'd be
more than happy to research this for 9.4.

Regards,
Ants Aasma
-- 
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-12 Thread Jeff Davis

On Fri, 2013-04-12 at 23:03 +0300, Heikki Linnakangas wrote:
> I think this is a bad idea. It complicates the WAL format significantly. 
> Simon's patch didn't include the changes to recovery to validate the 
> checksum, but I suspect it would be complicated. And it reduces the 
> error-detection capability of WAL recovery. Keep in mind that unlike 
> page checksums, which are never expected to fail, so even if we miss a 
> few errors it's still better than nothing, the WAL checkum is used to 
> detect end-of-WAL. There is expected to be a failure every time we do 
> crash recovery. This far, we've considered the probability of one in 
> 1^32 small enough for that purpose, but IMHO one in 1^16 is much too weak.

One thing that just occurred to me is that we could make the SIMD
checksum a 32-bit checksum, and reduce it down to 16 bits for the data
pages. That might give us more flexibility to later use it for WAL
without compromising on the error detection nearly as much (though
obviously that wouldn't work with Simon's current proposal which uses
the same data page checksum in a WAL backup block).

In general, we have more flexibility with WAL because there is no
upgrade issue. It would be nice to share code with the data page
checksum algorithm; but really we should just use whatever offers the
best trade-off in terms of complexity, performance, and error detection
rate.

I don't think we need to decide all of this right now. Personally, I'm
satisfied having SIMD checksums on data pages now and leaving WAL
optimization until later.

Regards,
Jeff Davis

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-12 Thread Jeff Davis

On Fri, 2013-04-12 at 21:28 +0200, Andres Freund wrote:
> That means we will have to do the verification for this in
> ValidXLogRecord() *not* in RestoreBkpBlock or somesuch. Otherwise we
> won't always recognize the end of WAL correctly.
> And I am a bit wary of reducing the likelihood of noticing the proper
> end-of-recovery by reducing the crc width.

Good point.

> Why again are we doing this now? Just to reduce the overhead of CRC
> computation for full page writes? Or are we forseeing issues with the
> page checksums being wrong because of non-zero data in the hole being
> zero after the restore from bkp blocks?

That shouldn't be a problem, because the block is not expected to have a
proper checksum in WAL, and it will be recalculated before being
written. So I see these changes as mostly independent.

The reason we're discussing right now is because, when choosing the
checksum algorithm, I was hoping that it might be usable in the future
for WAL backup blocks. I'm convinced that they can be; and the primary
question now seems to be "should they be", which does not need to be
settled right now in my opinion.

Anyway, I would be perfectly happy if we just got the SIMD algorithm in
for data pages. The support for changing the WAL checksums seems
lukewarm, and there might be quite a few alternatives (e.g. optimizing
the CRC for backup blocks as Heikki suggested) to achieve that
performance goal.

Regards,
Jeff Davis

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-12 Thread Jeff Davis

On Fri, 2013-04-12 at 15:21 -0400, Bruce Momjian wrote:
> > * When we do PageSetChecksumInplace(), we need to be 100% sure that the
> > hole is empty; otherwise the checksum will fail when we re-expand it. It
> > might be worth a memset beforehand just to be sure.
> 
> Do we write the page holes to the WAL for full-page writes?  I hope we
> don't.

No, but the page hole is included in the checksum.

Let's say that the page hole contains some non-zero value, and we
calculate a checksum. When we eliminate the page hole, and then
reconstitute the page using zeros for the page hole later, then the page
will not match the checksum any more.

So, we need to be sure the original page hole is all-zero when we
calculate the checksum.

Regards,
Jeff Davis

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-12 Thread Simon Riggs

On 12 April 2013 21:03, Heikki Linnakangas  wrote:

> No, the patch has to compute the 16-bit checksum for the page when the
> full-page image is added to the WAL record. There would otherwise be no need
> to calculate the page checksum at that point, but only later when the page
> is written out from shared buffer cache.
>
> I think this is a bad idea. It complicates the WAL format significantly.
> Simon's patch didn't include the changes to recovery to validate the
> checksum, but I suspect it would be complicated. And it reduces the
> error-detection capability of WAL recovery. Keep in mind that unlike page
> checksums, which are never expected to fail, so even if we miss a few errors
> it's still better than nothing, the WAL checkum is used to detect
> end-of-WAL. There is expected to be a failure every time we do crash
> recovery. This far, we've considered the probability of one in 1^32 small
> enough for that purpose, but IMHO one in 1^16 is much too weak.
>
> If you want to speed up the CRC calculation of full-page images, you could
> have an optimized version of the WAL CRC algorithm, using e.g. SIMD
> instructions. Because typical WAL records are small, max 100-200 bytes, and
> it consists of several even smaller chunks, the normal WAL CRC calculation
> is quite resistant to common optimization techniques. But it might work for
> the full-page images. Let's not conflate it with the page checksums, though.

I accept the general tone of that as a reasonable perspective and in
many ways am on the fence myself. This is sensitive stuff.

A few points
* The code to validate the checksum isn't complex, though it is more
than the current one line. Lets say about 10 lines of clear code. I'll
work on that to show its true. I don't see that as a point of
objection.

* WAL checksum is not used as the sole basis for end-of-WAL discovery.
We reuse the WAL files, so the prev field in each WAL record shows
what the previous end of WAL was. Hence if the WAL checksums give a
false positive we still have a double check that the data really is
wrong. It's unbelievable that you'd get a false positive and then have
the prev field match as well, even though it was the genuine
end-of-WAL.

Yes, we could also have a second SIMD calculation optimised for WAL
CRC32 on an 8192 byte block, rather than just one set of SIMD code for
both. We could also have a single set of SIMD code producing a 32-bit
checksum, then take the low 16 bits as we do currently.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-12 Thread Heikki Linnakangas


On 12.04.2013 22:31, Bruce Momjian wrote:

On Fri, Apr 12, 2013 at 09:28:42PM +0200, Andres Freund wrote:

Only point worth discussing is that this change would make backup blocks be
covered by a 16-bit checksum, not the CRC-32 it is now. i.e. the record
header is covered by a CRC32 but the backup blocks only by 16-bit.


That means we will have to do the verification for this in
ValidXLogRecord() *not* in RestoreBkpBlock or somesuch. Otherwise we
won't always recognize the end of WAL correctly.
And I am a bit wary of reducing the likelihood of noticing the proper
end-of-recovery by reducing the crc width.

Why again are we doing this now? Just to reduce the overhead of CRC
computation for full page writes? Or are we forseeing issues with the
page checksums being wrong because of non-zero data in the hole being
zero after the restore from bkp blocks?


I thought the idea is that we were going to re-use the already-computed
CRC checksum on the page, and we only have 16-bits of storage for that.


No, the patch has to compute the 16-bit checksum for the page when the 
full-page image is added to the WAL record. There would otherwise be no 
need to calculate the page checksum at that point, but only later when 
the page is written out from shared buffer cache.


I think this is a bad idea. It complicates the WAL format significantly. 
Simon's patch didn't include the changes to recovery to validate the 
checksum, but I suspect it would be complicated. And it reduces the 
error-detection capability of WAL recovery. Keep in mind that unlike 
page checksums, which are never expected to fail, so even if we miss a 
few errors it's still better than nothing, the WAL checkum is used to 
detect end-of-WAL. There is expected to be a failure every time we do 
crash recovery. This far, we've considered the probability of one in 
1^32 small enough for that purpose, but IMHO one in 1^16 is much too weak.


If you want to speed up the CRC calculation of full-page images, you 
could have an optimized version of the WAL CRC algorithm, using e.g. 
SIMD instructions. Because typical WAL records are small, max 100-200 
bytes, and it consists of several even smaller chunks, the normal WAL 
CRC calculation is quite resistant to common optimization techniques. 
But it might work for the full-page images. Let's not conflate it with 
the page checksums, though.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-12 Thread Andres Freund

On 2013-04-12 15:31:36 -0400, Bruce Momjian wrote:
> On Fri, Apr 12, 2013 at 09:28:42PM +0200, Andres Freund wrote:
> > > Only point worth discussing is that this change would make backup blocks 
> > > be
> > > covered by a 16-bit checksum, not the CRC-32 it is now. i.e. the record
> > > header is covered by a CRC32 but the backup blocks only by 16-bit.
> > 
> > That means we will have to do the verification for this in
> > ValidXLogRecord() *not* in RestoreBkpBlock or somesuch. Otherwise we
> > won't always recognize the end of WAL correctly.
> > And I am a bit wary of reducing the likelihood of noticing the proper
> > end-of-recovery by reducing the crc width.
> > 
> > Why again are we doing this now? Just to reduce the overhead of CRC
> > computation for full page writes? Or are we forseeing issues with the
> > page checksums being wrong because of non-zero data in the hole being
> > zero after the restore from bkp blocks?
> 
> I thought the idea is that we were going to re-use the already-computed
> CRC checksum on the page, and we only have 16-bits of storage for that.

Well, but the proposal seems to be to do this also for non-checksum
enabled datadirs, so ...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-12 Thread Bruce Momjian

On Fri, Apr 12, 2013 at 09:28:42PM +0200, Andres Freund wrote:
> > Only point worth discussing is that this change would make backup blocks be
> > covered by a 16-bit checksum, not the CRC-32 it is now. i.e. the record
> > header is covered by a CRC32 but the backup blocks only by 16-bit.
> 
> That means we will have to do the verification for this in
> ValidXLogRecord() *not* in RestoreBkpBlock or somesuch. Otherwise we
> won't always recognize the end of WAL correctly.
> And I am a bit wary of reducing the likelihood of noticing the proper
> end-of-recovery by reducing the crc width.
> 
> Why again are we doing this now? Just to reduce the overhead of CRC
> computation for full page writes? Or are we forseeing issues with the
> page checksums being wrong because of non-zero data in the hole being
> zero after the restore from bkp blocks?

I thought the idea is that we were going to re-use the already-computed
CRC checksum on the page, and we only have 16-bits of storage for that.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-12 Thread Andres Freund

On 2013-04-11 20:12:59 +0100, Simon Riggs wrote:
> On 11 April 2013 04:27, Jeff Davis  wrote:
> 
> > On Wed, 2013-04-10 at 20:17 +0100, Simon Riggs wrote:
> >
> > > OK, so we have a single combined "calculate a checksum for a block"
> > > function. That uses Jeff's zeroing trick and Ants' bulk-oriented
> > > performance optimization.
> > >
> > >
> > > For buffer checksums we simply calculate for the block.
> >
> > Sounds good.
> >
> > > For WAL full page writes, we first set the checksums for all defined
> > > buffers, then calculate the checksum of remaining data plus the
> > > pd_checksum field from each block using the normal WAL CRC32.
> > >
> > > Seems good to me. One set of fast code. And it avoids the weirdness
> > > that the checksum stored on the full page is actually wrong.
> >
> > Oh, that's a nice benefit.
> 
> 
> So, if we apply a patch like the one attached, we then end up with the WAL
> checksum using the page checksum as an integral part of its calculation.
> (There is no increase in code inside WALInsertLock, nothing at all touched
> in that area).
> 
> Then all we need to do is make PageSetChecksumInplace() use Ants' algo and
> we're done.
> 
> Only point worth discussing is that this change would make backup blocks be
> covered by a 16-bit checksum, not the CRC-32 it is now. i.e. the record
> header is covered by a CRC32 but the backup blocks only by 16-bit.

That means we will have to do the verification for this in
ValidXLogRecord() *not* in RestoreBkpBlock or somesuch. Otherwise we
won't always recognize the end of WAL correctly.
And I am a bit wary of reducing the likelihood of noticing the proper
end-of-recovery by reducing the crc width.

Why again are we doing this now? Just to reduce the overhead of CRC
computation for full page writes? Or are we forseeing issues with the
page checksums being wrong because of non-zero data in the hole being
zero after the restore from bkp blocks?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-12 Thread Bruce Momjian

On Fri, Apr 12, 2013 at 12:07:36PM -0700, Jeff Davis wrote:
> > (Attached patch is discussion only. Checking checksum in recovery
> > isn't coded at all.)
> 
> I like it.
> 
> A few points:
> 
> * Given that setting the checksum is unconditional in a backup block, do
> we want to zero the checksum field when the backup block is restored if
> checksums are disabled? Otherwise we would have a strange situation
> where some blocks have a checksum on disk even when checksums are
> disabled.
> 
> * When we do PageSetChecksumInplace(), we need to be 100% sure that the
> hole is empty; otherwise the checksum will fail when we re-expand it. It
> might be worth a memset beforehand just to be sure.

Do we write the page holes to the WAL for full-page writes?  I hope we
don't.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-12 Thread Jeff Davis

On Thu, 2013-04-11 at 20:12 +0100, Simon Riggs wrote:

> So, if we apply a patch like the one attached, we then end up with the
> WAL checksum using the page checksum as an integral part of its
> calculation. (There is no increase in code inside WALInsertLock,
> nothing at all touched in that area).
> 
> 
> Then all we need to do is make PageSetChecksumInplace() use Ants' algo
> and we're done.
> 
> 
> Only point worth discussing is that this change would make backup
> blocks be covered by a 16-bit checksum, not the CRC-32 it is now. i.e.
> the record header is covered by a CRC32 but the backup blocks only by
> 16-bit. 

FWIW, that's fine with me. 

> (Attached patch is discussion only. Checking checksum in recovery
> isn't coded at all.)

I like it.

A few points:

* Given that setting the checksum is unconditional in a backup block, do
we want to zero the checksum field when the backup block is restored if
checksums are disabled? Otherwise we would have a strange situation
where some blocks have a checksum on disk even when checksums are
disabled.

* When we do PageSetChecksumInplace(), we need to be 100% sure that the
hole is empty; otherwise the checksum will fail when we re-expand it. It
might be worth a memset beforehand just to be sure.

Regards,
Jeff Davis

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-12 Thread Bruce Momjian

On Wed, Apr 10, 2013 at 11:19:56AM -0700, Jeff Davis wrote:
> On Wed, 2013-04-10 at 11:01 +0300, Ants Aasma wrote:
> > I think we should first deal with using it for page checksums and if
> > future versions want to reuse some of the code for WAL checksums then
> > we can rearrange the code.
> 
> Sounds good to me, although I expect we at least want any assembly to be
> in a separate file (if the specialization makes it in 9.3).

Sounds good.  Simon has done a good job shepherding this to completion. 

My only question is whether the 16-bit page checksums stored in WAL
reduce our ability to detect failed/corrupt writes to WAL?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-11 Thread Simon Riggs

On 11 April 2013 04:27, Jeff Davis  wrote:

> On Wed, 2013-04-10 at 20:17 +0100, Simon Riggs wrote:
>
> > OK, so we have a single combined "calculate a checksum for a block"
> > function. That uses Jeff's zeroing trick and Ants' bulk-oriented
> > performance optimization.
> >
> >
> > For buffer checksums we simply calculate for the block.
>
> Sounds good.
>
> > For WAL full page writes, we first set the checksums for all defined
> > buffers, then calculate the checksum of remaining data plus the
> > pd_checksum field from each block using the normal WAL CRC32.
> >
> > Seems good to me. One set of fast code. And it avoids the weirdness
> > that the checksum stored on the full page is actually wrong.
>
> Oh, that's a nice benefit.

So, if we apply a patch like the one attached, we then end up with the WAL
checksum using the page checksum as an integral part of its calculation.
(There is no increase in code inside WALInsertLock, nothing at all touched
in that area).

Then all we need to do is make PageSetChecksumInplace() use Ants' algo and
we're done.

Only point worth discussing is that this change would make backup blocks be
covered by a 16-bit checksum, not the CRC-32 it is now. i.e. the record
header is covered by a CRC32 but the backup blocks only by 16-bit.

(Attached patch is discussion only. Checking checksum in recovery isn't
coded at all.)

Thoughts?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

make_wal_records_use_page_checksums.v0.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-10 Thread Jeff Davis

On Wed, 2013-04-10 at 20:17 +0100, Simon Riggs wrote:

> OK, so we have a single combined "calculate a checksum for a block"
> function. That uses Jeff's zeroing trick and Ants' bulk-oriented
> performance optimization.
> 
> 
> For buffer checksums we simply calculate for the block.

Sounds good.

> For WAL full page writes, we first set the checksums for all defined
> buffers, then calculate the checksum of remaining data plus the
> pd_checksum field from each block using the normal WAL CRC32.
> 
> Seems good to me. One set of fast code. And it avoids the weirdness
> that the checksum stored on the full page is actually wrong.

Oh, that's a nice benefit.

> It also means that the WAL checksum calculation includes the hole, yet
> we do not include the data for the hole. So we have to do an extra
> copy when restoring the backuo block.

I like this, but it sounds like there is some room for discussion on
some of these points. I assume changes to the WAL checksums are 9.4
material?

I'm satisfied with SIMD data checksums in 9.3 and that we have a plan
for using SIMD for WAL checksums later.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

1 2 3 4 >

1 - 100 of 359 matches

Mail list logo