Re: [HACKERS] Enabling Checksums

2013-09-12 Thread Greg Smith

On 3/18/13 10:52 AM, Bruce Momjian wrote:

With a potential 10-20% overhead, I am unclear who would enable this at
initdb time.


If you survey people who are running PostgreSQL on cloud hardware, be 
it Amazon's EC2 or similar options from other vendors, you will find a 
high percentage of them would pay quite a bit of performance to make 
their storage more reliable.  To pick one common measurement for 
popularity, a Google search on ebs corruption returns 17 million hits. 
 To quote one of those, Baron Schwartz of Percona talking about MySQL 
on EC2:


BTW, I have seen data corruption on EBS volumes. It’s not clear whether 
it was InnoDB’s fault (extremely unlikely IMO), the operating system’s 
fault, EBS’s fault, or something else.


http://www.mysqlperformanceblog.com/2011/08/04/mysql-performance-on-ec2ebs-versus-rds/

*That* uncertainty is where a lot of the demand for this feature is 
coming from.  People deploy into the cloud, their data gets corrupted, 
and no one call tell them what/why/how it happened.  And that means they 
don't even know what to change to make it better.  The only people I see 
really doing something about this problem all seem years off, and I'm 
not sure they are going to help--especially since some of them are 
targeting enterprise storage rather than the cloud-style installations.



I assume a user would wait until they suspected corruption to turn it
on, and because it is only initdb-enabled, they would have to
dump/reload their cluster.  The open question is whether this is a
usable feature as written, or whether we should wait until 9.4.


The reliability issues of both physical and virtual hardware are so 
widely known that many people will deploy with this on as their default 
configuration.


If you don't trust your existing data, you can't retroactively check it. 
 A checksum of an already corrupt block is useless.  Accordingly, there 
is no use case for converting an installation with real or even 
suspected problems to a checksummed one.  If you wait until you suspect 
corruption to care about checksums, it's really too late.  There is only 
one available next step:  you must do a dump to figure out what's 
readable.  That is the spot that all of the incoming data recovery 
customers we see at 2ndQuadrant are already in when we're called.  The 
cluster is suspicious, sometimes they can get data out of it with a 
dump, and if we hack up their install we can usually recover a bit more 
than they could.


After the data from a partially corrupted database is dumped, someone 
who has just been through that pain might decide they should turn 
checksums on when they restore the dump.  When it's on, they can access 
future damage easily at the block level when it happens, and possibly 
repair it without doing a full dump/reload.  What's implemented in the 
feature we're talking about has a good enough UI to handle this entire 
cycle I see damaged installations go through.



In fact, this feature is going to need
pg_upgrade changes to detect from pg_controldata that the old/new
clusters have the same checksum setting.


I think that's done already, but it's certainly something to test out too.

Good questions, Bruce, I don't think the reasons behind this feature's 
demand have been highlighted very well before.  I try not to spook the 
world by talking regularly about how many corrupt PostgreSQL databases 
I've seen, but they do happen.  Most of my regular ranting on crappy 
SSDs that lie about writes comes from a TB scale PostgreSQL install that 
got corrupted due to the write-cache flaws of the early Intel 
SSDs--twice.  The would have happily lost even the worst-case 20% of 
regular performance to avoid going down for two days each time they saw 
corruption, where we had to dump/reload to get them going again.  If the 
install had checksums, I could have figured out which blocks were 
damaged and manually fixed them.  Without checksums, there's no way to 
even tell for sure what is broken.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-09-12 Thread Greg Smith

On 3/18/13 10:52 AM, Bruce Momjian wrote:

With a potential 10-20% overhead, I am unclear who would enable this at
initdb time.


If you survey people who are running PostgreSQL on cloud hardware, be 
it Amazon's EC2 or similar options from other vendors, you will find a 
high percentage of them would pay quite a bit of performance to make 
their storage more reliable.  To pick one common measurement for 
popularity, a Google search on ebs corruption returns 17 million hits. 
 To quote one of those, Baron Schwartz of Percona talking about MySQL 
on EC2:


BTW, I have seen data corruption on EBS volumes. It’s not clear whether 
it was InnoDB’s fault (extremely unlikely IMO), the operating system’s 
fault, EBS’s fault, or something else.


http://www.mysqlperformanceblog.com/2011/08/04/mysql-performance-on-ec2ebs-versus-rds/

*That* uncertainty is where a lot of the demand for this feature is 
coming from.  People deploy into the cloud, their data gets corrupted, 
and no one call tell them what/why/how it happened.  And that means they 
don't even know what to change to make it better.  The only people I see 
really doing something about this problem all seem years off, and I'm 
not sure they are going to help--especially since some of them are 
targeting enterprise storage rather than the cloud-style installations.



I assume a user would wait until they suspected corruption to turn it
on, and because it is only initdb-enabled, they would have to
dump/reload their cluster.  The open question is whether this is a
usable feature as written, or whether we should wait until 9.4.


The reliability issues of both physical and virtual hardware are so 
widely known that many people will deploy with this on as their default 
configuration.


If you don't trust your existing data, you can't retroactively check it. 
 A checksum of an already corrupt block is useless.  Accordingly, there 
is no use case for converting an installation with real or even 
suspected problems to a checksummed one.  If you wait until you suspect 
corruption to care about checksums, it's really too late.  There is only 
one available next step:  you must do a dump to figure out what's 
readable.  That is the spot that all of the incoming data recovery 
customers we see at 2ndQuadrant are already in when we're called.  The 
cluster is suspicious, sometimes they can get data out of it with a 
dump, and if we hack up their install we can usually recover a bit more 
than they could.


After the data from a partially corrupted database is dumped, someone 
who has just been through that pain might decide they should turn 
checksums on when they restore the dump.  When it's on, they can access 
future damage easily at the block level when it happens, and possibly 
repair it without doing a full dump/reload.  What's implemented in the 
feature we're talking about has a good enough UI to handle this entire 
cycle I see damaged installations go through.


Good questions, Bruce, I don't think the reasons behind this feature's 
demand have been highlighted very well before.  I try not to spook the 
world by talking regularly about how many corrupt PostgreSQL databases 
I've seen, but they do happen.  Most of my regular ranting on crappy 
SSDs that lie about writes comes from a TB scale PostgreSQL install that 
got corrupted due to the write-cache flaws of the early Intel 
SSDs--twice.  The would have happily lost even the worst-case 20% of 
regular performance to avoid going down for two days each time they saw 
corruption, where we had to dump/reload to get them going again.  If the 
install had checksums, I could have figured out which blocks were 
damaged and manually fixed them.  Without checksums, there really was 
nowhere to go except dump/reload.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-25 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 On 24 April 2013 21:06, Jeff Davis pg...@j-davis.com wrote:
 What goal are you trying to accomplish with this patch?

 That we might need to patch the checksum version on a production release.

I don't actually buy that argument, certainly not as something that
could happen in 9.3.

I'm inclined to think we should forget about this until we have a
concrete use-case for it.  As Jeff says, there is no need for pg_control
contents to be compatible across major releases, so there's no harm in
waiting if we have any doubts about how it ought to work.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-24 Thread Simon Riggs
On 24 April 2013 01:10, Jeff Davis pg...@j-davis.com wrote:
 On Tue, 2013-04-23 at 16:28 +0100, Simon Riggs wrote:
 * make the pg_control.data_checksums field into a version number, for
 future flexibility...
 patch attached

 Commenting on this separately because it's a separate issue.

 I'd prefer that it was some kind of a checksum ID code -- e.g. 0 for no
 checksum, 1 for FNV-1a-SR3, etc. That would allow us to release 9.4 with
 a new algorithm without forcing existing users to change.

That's exactly what the patch does.

 initdb would have to take the code as an option, probably in string
 form.

When/if we have multiple options we can add that. The main thing was
to make sure the control file recorded things in a common way.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-24 Thread Jeff Davis
On Wed, 2013-04-24 at 08:20 +0100, Simon Riggs wrote:
 On 24 April 2013 01:10, Jeff Davis pg...@j-davis.com wrote:
  I'd prefer that it was some kind of a checksum ID code -- e.g. 0 for no
  checksum, 1 for FNV-1a-SR3, etc. That would allow us to release 9.4 with
  a new algorithm without forcing existing users to change.
 
 That's exactly what the patch does.

The word version indicates an order to it though, like N+1 is always
preferable to N. This is user-facing (through pg_controldata output),
otherwise I wouldn't mind.

  initdb would have to take the code as an option, probably in string
  form.
 
 When/if we have multiple options we can add that. The main thing was
 to make sure the control file recorded things in a common way.

The main strange thing to me is that we're still using the
enabled/disabled for the output of pg_controldata as well as the
version.

When we do have multiple options, it seems like we'd just have one field
output:

  Data page checksums: none|crc32c|pg-fnv

What goal are you trying to accomplish with this patch? pg_control
doesn't need to be compatible between releases, so can't we just add
this later when we really do have multiple options?

Regards,
Jeff Davis





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-24 Thread Simon Riggs
On 24 April 2013 21:06, Jeff Davis pg...@j-davis.com wrote:

 What goal are you trying to accomplish with this patch?

That we might need to patch the checksum version on a production release.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-24 Thread Jeff Davis
On Wed, 2013-04-24 at 21:09 +0100, Simon Riggs wrote:
 On 24 April 2013 21:06, Jeff Davis pg...@j-davis.com wrote:
 
  What goal are you trying to accomplish with this patch?
 
 That we might need to patch the checksum version on a production release.

Oh, I see.

I don't think we need two output fields from pg_controldata though. It's
a little redundant, and confused me when I was looking at the impact on
pg_upgrade. And it means nothing to the user until we actually have
multiple algorithms available, at which time we are better off with a
text representation.

Other than that, I think your patch is fine to accomplish the
aforementioned goal. Essentially, it just changes the bool to a uint32,
which I favor.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-23 Thread Simon Riggs
On 23 April 2013 02:35, Jeff Davis pg...@j-davis.com wrote:
 On Tue, 2013-04-23 at 01:08 +0300, Ants Aasma wrote:
 A slight delay, but here it is. I didn't lift the checksum part into a
 separate file as I didn't have a great idea what I would call it. The
 code is reasonably compact so I don't see a great need for this right
 now. It would be more worth the effort when/if we add non-generic
 variants. I'm not particularly attached to the method I used to mask
 out pd_checksum field, this could be improved if someone has a better
 idea how to structure the code.

 Thank you. A few initial comments:

 I have attached (for illustration purposes only) a patch on top of yours
 that divides the responsibilities a little more cleanly.

 * easier to move into a separate file, and use your recommended compiler
 flags without affecting other routines in bufpage.c
 * makes the checksum algorithm itself simpler
 * leaves the data-page-specific aspects (mixing in the page number,
 ignoring pd_checksum, reducing to 16 bits) to PageCalcChecksum16
 * overall easier to review and understand

 I'm not sure what we should call the separate file or where we should
 put it, though. How about src/backend/utils/checksum/checksum_fnv.c? Is
 there a clean way to override the compiler flags for a single file so we
 don't need to put it in its own directory?


OK, I like that a lot better and it seems something I could commit.

I suggest the following additional changes...

* put the README stuff directly in the checksum.c file
  * I think we need some external links that describe this algorithm
and we need comments that explain what we know about this in terms of
detection capability and why it was chosen against others
  * We need some comments/analysis about whether the coding causes a
problem if vectorization is *not* available

* make the pg_control.data_checksums field into a version number, for
future flexibility...
patch attached

* rename the routine away from checksum_fnv so its simply a generic
checksum call - more modular.

That way all knowledge of the algorithm is in one file only. If we do
need to change the algorithm in the future we can more easily support
multiple versions.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


checksums_version.v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-23 Thread Jeff Davis
On Tue, 2013-04-23 at 16:28 +0100, Simon Riggs wrote:
 * make the pg_control.data_checksums field into a version number, for
 future flexibility...
 patch attached

Commenting on this separately because it's a separate issue.

I'd prefer that it was some kind of a checksum ID code -- e.g. 0 for no
checksum, 1 for FNV-1a-SR3, etc. That would allow us to release 9.4 with
a new algorithm without forcing existing users to change.

initdb would have to take the code as an option, probably in string
form.

What do you think?

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-22 Thread Robert Haas
On Wed, Apr 17, 2013 at 8:21 PM, Greg Smith g...@2ndquadrant.com wrote:
 The more I read of this thread, the more unhappy I get.  It appears that
 the entire design process is being driven by micro-optimization for CPUs
 being built by Intel in 2013.

 And that's not going to get anyone past review, since all the tests I've
 been doing the last two weeks are on how fast an AMD Opteron 6234 with OS
 cache  shared_buffers can run this.  The main thing I'm still worried
 about is what happens when you have a fast machine that can move memory
 around very quickly and an in-memory workload, but it's hamstrung by the
 checksum computation--and it's not a 2013 Intel machine.

This is a good point.  However, I don't completely agree with the
conclusion that we shouldn't be worrying about any of this right now.
While I agree with Tom that it's far too late to think about any
CPU-specific optimizations for 9.3, I have a lot of concern, based on
Ants's numbers, that we've picked a checksum algorithm which is hard
to optimize for performance.  If we don't get that fixed for 9.3,
we're potentially looking at inflicting many years of serious
suffering on our user base.  If we at least get the *algorithm* right
now, we can worry about optimizing it later.  If we get it wrong,
we'll be living with the consequence of that for a really long time.

I wish that we had not scheduled beta quite so soon, as I am sure
there will be even more resistance to changing this after beta.  But
I'm having a hard time escaping the conclusion that we're on the edge
of shipping something we will later regret quite deeply.  Maybe I'm
wrong?

...Robert


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-22 Thread Andres Freund
On 2013-04-22 11:27:25 -0400, Robert Haas wrote:
 On Wed, Apr 17, 2013 at 8:21 PM, Greg Smith g...@2ndquadrant.com wrote:
  The more I read of this thread, the more unhappy I get.  It appears that
  the entire design process is being driven by micro-optimization for CPUs
  being built by Intel in 2013.
 
  And that's not going to get anyone past review, since all the tests I've
  been doing the last two weeks are on how fast an AMD Opteron 6234 with OS
  cache  shared_buffers can run this.  The main thing I'm still worried
  about is what happens when you have a fast machine that can move memory
  around very quickly and an in-memory workload, but it's hamstrung by the
  checksum computation--and it's not a 2013 Intel machine.
 
 This is a good point.  However, I don't completely agree with the
 conclusion that we shouldn't be worrying about any of this right now.
 While I agree with Tom that it's far too late to think about any
 CPU-specific optimizations for 9.3, I have a lot of concern, based on
 Ants's numbers, that we've picked a checksum algorithm which is hard
 to optimize for performance.  If we don't get that fixed for 9.3,
 we're potentially looking at inflicting many years of serious
 suffering on our user base.  If we at least get the *algorithm* right
 now, we can worry about optimizing it later.  If we get it wrong,
 we'll be living with the consequence of that for a really long time.
 
 I wish that we had not scheduled beta quite so soon, as I am sure
 there will be even more resistance to changing this after beta.  But
 I'm having a hard time escaping the conclusion that we're on the edge
 of shipping something we will later regret quite deeply.  Maybe I'm
 wrong?

I don't see us changing away from CRCs anymore either by now. But I
think at least changing the polynom to something that
a) has higher error detection properties
b) can noticeably sped up on a a good part of the hardware pg is run on

If we are feeling really adventurous we can switch to a faster CRC
implementation, there are enough ones around and I know that at least my
proposed patch from some years ago (which is by far not the fastest that
is doable) is in production usage some places.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-22 Thread Ants Aasma
On Mon, Apr 22, 2013 at 6:27 PM, Robert Haas robertmh...@gmail.com wrote:
 On Wed, Apr 17, 2013 at 8:21 PM, Greg Smith g...@2ndquadrant.com wrote:
 The more I read of this thread, the more unhappy I get.  It appears that
 the entire design process is being driven by micro-optimization for CPUs
 being built by Intel in 2013.

 And that's not going to get anyone past review, since all the tests I've
 been doing the last two weeks are on how fast an AMD Opteron 6234 with OS
 cache  shared_buffers can run this.  The main thing I'm still worried
 about is what happens when you have a fast machine that can move memory
 around very quickly and an in-memory workload, but it's hamstrung by the
 checksum computation--and it's not a 2013 Intel machine.

 This is a good point.  However, I don't completely agree with the
 conclusion that we shouldn't be worrying about any of this right now.
 While I agree with Tom that it's far too late to think about any
 CPU-specific optimizations for 9.3, I have a lot of concern, based on
 Ants's numbers, that we've picked a checksum algorithm which is hard
 to optimize for performance.  If we don't get that fixed for 9.3,
 we're potentially looking at inflicting many years of serious
 suffering on our user base.  If we at least get the *algorithm* right
 now, we can worry about optimizing it later.  If we get it wrong,
 we'll be living with the consequence of that for a really long time.

I was just now writing up a generic C based patch based on the
parallel FNV-1a + shift that we discussed with Florian with an added
round of mixing. Testing the performance in isolation indicates that:
1) it is about an order of magnitude faster than the Sarwate CRC
method used in Postgresql.
2) it is about 2x faster than fastest software based CRC method.
3) by using -msse4.1 -funroll-loops -ftree-vectorize compilation
options the performance improves 5x. (within 20% of handcoded ASM)

This leaves lingering doubts about the quality of the checksum. It's
hard if not impossible to prove absence of interesting patterns that
would trigger collisions. I do know the checksum quality is miles
ahead of the Fletcher sum originally proposed and during the last week
I haven't been able to think of a way to make the collision rate
significantly differ from CRC.

 I wish that we had not scheduled beta quite so soon, as I am sure
 there will be even more resistance to changing this after beta.  But
 I'm having a hard time escaping the conclusion that we're on the edge
 of shipping something we will later regret quite deeply.  Maybe I'm
 wrong?

Its unfortunate that this got delayed by so long. The performance side
of the argument was clear a month ago.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-22 Thread Ants Aasma
On Mon, Apr 22, 2013 at 6:33 PM, Andres Freund and...@2ndquadrant.com wrote:
 I don't see us changing away from CRCs anymore either by now. But I
 think at least changing the polynom to something that
 a) has higher error detection properties
 b) can noticeably sped up on a a good part of the hardware pg is run on

+1 of changing the polynomial if we stick with CRC, but I think the
differences in error detection capability are mostly academic for
PostgreSQL usecase. Or does anyone have an experience with seeing
multiple random bit errors per page.

 If we are feeling really adventurous we can switch to a faster CRC
 implementation, there are enough ones around and I know that at least my
 proposed patch from some years ago (which is by far not the fastest that
 is doable) is in production usage some places.

The faster CRC implementation just use parallel lookup tables of more
bytes in parallel. Performance results from [1] show that doing 4
bytes in parallel will yield a 2.8x speedup, and 8 bytes in parallel
yields another 1.7x on top of that at the cost of using a 8kB lookup
table. And the end result is still over 3x slower than the code in the
original patch, where Greg's performance results prompted me to look
at what would have a lower overhead.

[1] http://create.stephan-brumme.com/crc32/

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-22 Thread Josh Berkus
On 04/22/2013 09:25 AM, Ants Aasma wrote:
 This leaves lingering doubts about the quality of the checksum. It's
 hard if not impossible to prove absence of interesting patterns that
 would trigger collisions. I do know the checksum quality is miles
 ahead of the Fletcher sum originally proposed and during the last week
 I haven't been able to think of a way to make the collision rate
 significantly differ from CRC.

When we originally discussed this feature, we were potentially
discussing a checksum algo which produced collisions for 1 out of 256
pages.  That approach was considered acceptable, since it would be very
unlikely for such a collision to occur across multiple corrupted pages,
and fairly rare to have only one corrupted page.

So my perspective is, if we're doing better than 1 in 256, it's good enough.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-22 Thread Florian Pflug
On Apr22, 2013, at 18:25 , Ants Aasma a...@cybertec.at wrote:
 I was just now writing up a generic C based patch based on the
 parallel FNV-1a + shift that we discussed with Florian with an added
 round of mixing. Testing the performance in isolation indicates that:
 1) it is about an order of magnitude faster than the Sarwate CRC
 method used in Postgresql.
 2) it is about 2x faster than fastest software based CRC method.
 3) by using -msse4.1 -funroll-loops -ftree-vectorize compilation
 options the performance improves 5x. (within 20% of handcoded ASM)
 
 This leaves lingering doubts about the quality of the checksum. It's
 hard if not impossible to prove absence of interesting patterns that
 would trigger collisions. I do know the checksum quality is miles
 ahead of the Fletcher sum originally proposed and during the last week
 I haven't been able to think of a way to make the collision rate
 significantly differ from CRC.

Note though that CRCs may very well have similar interesting
corruption patterns which don't cause the checksum to change, though.
The only guarantee they really give is that those patterns will involve
more than N-1 flipped bits, where N is the hamming distance of the
CRC. For 16-bit checksums, N can at most be 16 (since XOR-ing the data
with a shifted version of the CRC polynomial will not cause the checksum
to change).

Thus, once more than two bytes on a page get corrupted, CRCs may not
have any advantage over fnv1+shift or similar approaches. They may even
work worse, since detecting some forms of corruption with 100% certainty
means missing others with a probability of more than 2^-16. Some CRC
polynomials for example detect all corruptions which affect an odd number
of bits, but in turn have a probability of 2^-15 of missing ones which
affect an even number of bits.

Since we're mostly attempting to protect against disk, not memory
corruption here, I'm not convinced at all that errors in only a few
bits are all that common, and certainly not that they are more likely
than other forms of corruption. I'd expect, for example, that blocks of
512 bytes (i.e. one sector) suddenly reading 0 is at least as likely
as a single flipped bit.

The one downside of the fnv1+shift approach is that it's built around
the assumption that processing 64-bytes at once is the sweet spot. That
might be true for x86 and x86_64 today, but it won't stay that way for
long, and quite surely isn't true for other architectures. That doesn't
necessarily rule it out, but it certainly weakens the argument that
slipping it into 9.3 avoids having the change the algorithm later...

best regards,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-22 Thread Ants Aasma
On Mon, Apr 22, 2013 at 9:04 PM, Florian Pflug f...@phlo.org wrote:
 The one downside of the fnv1+shift approach is that it's built around
 the assumption that processing 64-bytes at once is the sweet spot. That
 might be true for x86 and x86_64 today, but it won't stay that way for
 long, and quite surely isn't true for other architectures. That doesn't
 necessarily rule it out, but it certainly weakens the argument that
 slipping it into 9.3 avoids having the change the algorithm later...

It's actually 128 bytes as it was tested. The ideal shape depends on
multiplication latency, multiplication throughput and amount of
registers available. Specifically BLCKSZ/mul_throughput_in_bytes needs
to be larger than BLCKSZ/(N_SUMS*sizeof(uint32))*(mul latency + 2*xor
latency). For latest Intel the values are 8192/16 = 512 and
8192/(32*4)*(5 + 2*1) = 448. 128 bytes is also 8 registers which is
the highest power of two fitting into architectural registers (16).
This means that the value chosen is indeed the sweet spot for x86
today. For future processors we can expect the multiplication width to
increase and possibly the latency too shifting the sweet spot into
higher widths. In fact, Haswell coming out later this year should have
AVX2 instructions that introduce integer ops on 256bit registers,
making the current choice already suboptimal.

All that said, having a lower width won't make the algorithm slower on
future processors, it will just leave some parallelism on the table
that could be used to make it even faster. The line in the sand needed
to be drawn somewhere, I chose the maximum comfortable width today
fearing that even that would be shot down based on code size.
Coincidentally 32 elements is also the internal parallelism that GPUs
have settled on. We could bump the width up by one notch to buy some
future safety, but after that I'm skeptical we will see any
conventional processors that would benefit from a higher width. I just
tested that the auto-vectorized version runs at basically identical
speed as GCC's inability to do good register allocation means that it
juggles values between registers and the stack one way or the other.

So to recap, I don't know of any CPUs where a lower value would be
better. Raising the width by one notch would mean better performance
on future processors, but raising it further would just bloat the size
of the inner loop without much benefit in sight.

Regards,
Ants Aasma

-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-22 Thread Jeff Davis
On Mon, 2013-04-22 at 20:04 +0200, Florian Pflug wrote:
 The one downside of the fnv1+shift approach is that it's built around
 the assumption that processing 64-bytes at once is the sweet spot. That
 might be true for x86 and x86_64 today, but it won't stay that way for
 long, and quite surely isn't true for other architectures. That doesn't
 necessarily rule it out, but it certainly weakens the argument that
 slipping it into 9.3 avoids having the change the algorithm later...

I think you are setting the bar way too high. Right now, we have a slow
algorithm. According to Ants's tests, FNV-1a is much, much faster. Do
you think that it will still be such a bottleneck that we will want to
change it again later for purely performance reasons?

The only time this is likely to matter is in the situation Greg Smith
describes, where shared buffers is much smaller than memory, and the
working set of buffers is near the size of memory (in other words, a lot
of buffers moving to and from shared memory, but not much to or from
disk). And it's already significantly faster than algorithm in the
original tests (Fletcher), so it's not clear that it's still even a
serious problem.

(Also remember that checksum users already accept a WAL penalty.)

The biggest problem now is getting one of these faster algorithms (FNV
or even a faster CRC) into shape that is acceptable to
reviewers/committers. If we don't do that, we will be missing out on a
lot of potential checksum users for whom the existing CRC algorithm is
just too slow.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-22 Thread Robert Haas
On Mon, Apr 22, 2013 at 3:14 PM, Jeff Davis pg...@j-davis.com wrote:
 The biggest problem now is getting one of these faster algorithms (FNV
 or even a faster CRC) into shape that is acceptable to
 reviewers/committers. If we don't do that, we will be missing out on a
 lot of potential checksum users for whom the existing CRC algorithm is
 just too slow.

+1.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-22 Thread Jeff Davis
On Mon, 2013-04-22 at 19:25 +0300, Ants Aasma wrote:
 I was just now writing up a generic C based patch based on the
 parallel FNV-1a + shift that we discussed with Florian with an added
 round of mixing. Testing the performance in isolation indicates that:
 1) it is about an order of magnitude faster than the Sarwate CRC
 method used in Postgresql.
 2) it is about 2x faster than fastest software based CRC method.
 3) by using -msse4.1 -funroll-loops -ftree-vectorize compilation
 options the performance improves 5x. (within 20% of handcoded ASM)

That's great news!

This means that we can have a simple C implementation in a separate
file, and pass a few build flags when compiling just that file (so it
doesn't affect other code). That should make reviewers/committers happy
(including me).

FWIW, that was my last real concern about FNV (reviewability). I'm not
worried about the performance based on your analysis; nor am I worried
about the error detection rate.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-22 Thread Florian Pflug
On Apr22, 2013, at 21:14 , Jeff Davis pg...@j-davis.com wrote:
 On Mon, 2013-04-22 at 20:04 +0200, Florian Pflug wrote:
 The one downside of the fnv1+shift approach is that it's built around
 the assumption that processing 64-bytes at once is the sweet spot. That
 might be true for x86 and x86_64 today, but it won't stay that way for
 long, and quite surely isn't true for other architectures. That doesn't
 necessarily rule it out, but it certainly weakens the argument that
 slipping it into 9.3 avoids having the change the algorithm later...
 
 I think you are setting the bar way too high. Right now, we have a slow
 algorithm. According to Ants's tests, FNV-1a is much, much faster. Do
 you think that it will still be such a bottleneck that we will want to
 change it again later for purely performance reasons?

To clarify, it wasn't my intent to argue against shipping FNV1+SHIFT
in 9.3 - in fact I'd like to see us do exactly that. I was merely trying
to be unbiased, and hence stated not only arguments in favour or FNV1+SHIFT
(the ones about CRCs theoretical advantages in error detection being 
not really relevant to us), but also the one downside of FNV1+SHIFT.

Seems like I could have done a better job expressing myself though.

 The biggest problem now is getting one of these faster algorithms (FNV
 or even a faster CRC) into shape that is acceptable to
 reviewers/committers. If we don't do that, we will be missing out on a
 lot of potential checksum users for whom the existing CRC algorithm is
 just too slow.

Assuming that we only ship a plain C implementation with 9.3, what
are we missing on that front? The C implementation of FNV1+SHIFT is
only a few dozen lines or so.

best regards,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-22 Thread Simon Riggs
On 22 April 2013 20:32, Florian Pflug f...@phlo.org wrote:

 Assuming that we only ship a plain C implementation with 9.3, what
 are we missing on that front? The C implementation of FNV1+SHIFT is
 only a few dozen lines or so.

Forgive me, I can't seem to locate the patch for this? Re-post please,
just for clarity.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-22 Thread Ants Aasma
On Mon, Apr 22, 2013 at 10:54 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On 22 April 2013 20:32, Florian Pflug f...@phlo.org wrote:

 Assuming that we only ship a plain C implementation with 9.3, what
 are we missing on that front? The C implementation of FNV1+SHIFT is
 only a few dozen lines or so.

 Forgive me, I can't seem to locate the patch for this? Re-post please,
 just for clarity.

Not posted yet. I'm writing it as we speak. Will post within half an hour or so.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-22 Thread Ants Aasma
On Mon, Apr 22, 2013 at 10:57 PM, Ants Aasma a...@cybertec.at wrote:
 On Mon, Apr 22, 2013 at 10:54 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On 22 April 2013 20:32, Florian Pflug f...@phlo.org wrote:

 Assuming that we only ship a plain C implementation with 9.3, what
 are we missing on that front? The C implementation of FNV1+SHIFT is
 only a few dozen lines or so.

 Forgive me, I can't seem to locate the patch for this? Re-post please,
 just for clarity.

 Not posted yet. I'm writing it as we speak. Will post within half an hour or 
 so.

A slight delay, but here it is. I didn't lift the checksum part into a
separate file as I didn't have a great idea what I would call it. The
code is reasonably compact so I don't see a great need for this right
now. It would be more worth the effort when/if we add non-generic
variants. I'm not particularly attached to the method I used to mask
out pd_checksum field, this could be improved if someone has a better
idea how to structure the code.

I confirmed with objdump that compiling on GCC 4.7 with -msse4.1
-funroll-loops -ftree-vectorize does in fact vectorize that loop.
Simple way to verify: objdump -d src/backend/storage/page/bufpage.o |
grep pmulld | wc -l should output 16.

Unfortunately I can't work on this patch for about a week. Postgresql
9.3 will have to wait for me as I need to tend to the release of Ants
v2.0.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


parallel-fnv-checksum.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-22 Thread Jeff Davis
On Tue, 2013-04-23 at 01:08 +0300, Ants Aasma wrote:
 A slight delay, but here it is. I didn't lift the checksum part into a
 separate file as I didn't have a great idea what I would call it. The
 code is reasonably compact so I don't see a great need for this right
 now. It would be more worth the effort when/if we add non-generic
 variants. I'm not particularly attached to the method I used to mask
 out pd_checksum field, this could be improved if someone has a better
 idea how to structure the code.

Thank you. A few initial comments:

I have attached (for illustration purposes only) a patch on top of yours
that divides the responsibilities a little more cleanly.

* easier to move into a separate file, and use your recommended compiler
flags without affecting other routines in bufpage.c
* makes the checksum algorithm itself simpler
* leaves the data-page-specific aspects (mixing in the page number,
ignoring pd_checksum, reducing to 16 bits) to PageCalcChecksum16
* overall easier to review and understand

I'm not sure what we should call the separate file or where we should
put it, though. How about src/backend/utils/checksum/checksum_fnv.c? Is
there a clean way to override the compiler flags for a single file so we
don't need to put it in its own directory?

Regards,
Jeff Davis

*** a/src/backend/storage/page/bufpage.c
--- b/src/backend/storage/page/bufpage.c
***
*** 23,28  static char pageCopyData[BLCKSZ];	/* for checksum calculation */
--- 23,29 
  static Page pageCopy = pageCopyData;
  
  static uint16 PageCalcChecksum16(Page page, BlockNumber blkno);
+ static uint32 checksum_fnv(char *data, uint32 size);
  
  /* 
   *		Page support functions
***
*** 986,1017  static const uint32 checksumBaseOffsets[N_SUMS] = {
  static uint16
  PageCalcChecksum16(Page page, BlockNumber blkno)
  {
! 	uint32 sums[N_SUMS];
! 	uint32 (*pageArr)[N_SUMS] = (uint32 (*)[N_SUMS]) page;
! 	uint32 result = blkno;
! 	int i, j;
! 	int pd_checksum_word = offsetof(PageHeaderData, pd_checksum)/sizeof(uint32);
! 	int pd_checksum_half = (offsetof(PageHeaderData, pd_checksum) % sizeof(uint32)) / sizeof(uint16);
  
  	/* only calculate the checksum for properly-initialized pages */
  	Assert(!PageIsNew(page));
  
  	/* initialize partial checksums to their corresponding offsets */
  	memcpy(sums, checksumBaseOffsets, sizeof(checksumBaseOffsets));
  
! 	/* first iteration needs to mask out pd_checksum field itself with zero */
! 	for (j = 0; j  N_SUMS; j++)
! 	{
! 		uint32 value = pageArr[0][j];
! 		if (j == pd_checksum_word)
! 			((uint16*) value)[pd_checksum_half] = 0;
! 		CHECKSUM_COMP(sums[j], value);
! 	}
! 
! 	/* now add in the rest of the page */
! 	for (i = 1; i  BLCKSZ/sizeof(uint32)/N_SUMS; i++)
  		for (j = 0; j  N_SUMS; j++)
! 			CHECKSUM_COMP(sums[j], pageArr[i][j]);
  
  	/* finally add in one round of zeroes for one more layer of mixing */
  	for (j = 0; j  N_SUMS; j++)
--- 987,1035 
  static uint16
  PageCalcChecksum16(Page page, BlockNumber blkno)
  {
! 	PageHeader	phdr   = (PageHeader) page;
! 	uint16		save_checksum;
! 	uint32		fnv_checksum;
  
  	/* only calculate the checksum for properly-initialized pages */
  	Assert(!PageIsNew(page));
  
+ 	/*
+ 	 * Save pd_checksum and set it to zero, so that the checksum calculation
+ 	 * isn't affected by the checksum stored on the page.
+ 	 */
+ 	save_checksum = phdr-pd_checksum;
+ 	phdr-pd_checksum = 0;
+ 	fnv_checksum = checksum_fnv(page, BLCKSZ);
+ 	phdr-pd_checksum = save_checksum;
+ 
+ 	/* mix in the block number to detect transposed pages */
+ 	fnv_checksum ^= blkno;
+ 
+ 	/*
+ 	 * Reduce to a uint16 (to fit in the pd_checksum field) with an offset of
+ 	 * one. That avoids checksums of zero, which seems like a good idea.
+ 	 */
+ 	return (fnv_checksum % 65535) + 1;
+ }
+ 
+ static uint32
+ checksum_fnv(char *data, uint32 size)
+ {
+ 	uint32 sums[N_SUMS];
+ 	uint32 (*dataArr)[N_SUMS] = (uint32 (*)[N_SUMS]) data;
+ 	uint32 result;
+ 	int i, j;
+ 
+ 	Assert((size % (sizeof(uint32)*N_SUMS)) == 0);
+ 
  	/* initialize partial checksums to their corresponding offsets */
  	memcpy(sums, checksumBaseOffsets, sizeof(checksumBaseOffsets));
  
! 	/* main checksum calculation */
! 	for (i = 0; i  size/sizeof(uint32)/N_SUMS; i++)
  		for (j = 0; j  N_SUMS; j++)
! 			CHECKSUM_COMP(sums[j], dataArr[i][j]);
  
  	/* finally add in one round of zeroes for one more layer of mixing */
  	for (j = 0; j  N_SUMS; j++)
***
*** 1021,1026  PageCalcChecksum16(Page page, BlockNumber blkno)
  	for (i = 0; i  N_SUMS; i++)
  		result ^= sums[i];
  
! 	/* use mod mapping to map the value into 16 bits and offset from zero */
! 	return (result % 65535) + 1;
  }
--- 1039,1043 
  	for (i = 0; i  N_SUMS; i++)
  		result ^= sums[i];
  
! 	return result;
  }

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Re: [HACKERS] Enabling Checksums

2013-04-22 Thread Alvaro Herrera
Jeff Davis escribió:

 I'm not sure what we should call the separate file or where we should
 put it, though. How about src/backend/utils/checksum/checksum_fnv.c? Is
 there a clean way to override the compiler flags for a single file so we
 don't need to put it in its own directory?

Sure, see src/backend/parser/Makefile about gram.o.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-18 Thread Andres Freund
On 2013-04-17 18:16:36 -0700, Daniel Farina wrote:
 The original paper is often shorthanded Castagnoli 93, but it exists
 in the IEEE's sphere of influence and is hard to find a copy of.
 Luckily, a pretty interesting survey paper discussing some of the
 issues was written by Koopman in 2002 and is available:
 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.11.8323 As a
 pedagolgical note, it's pretty interesting and accessible piece of
 writing (for me, as someone who knows little of error
 detection/correction) and explains some of the engineering reasons
 that provoke such exercises.

http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=arnumber=231911userType=inst

There's also a koopman paper from 2004 thats interesting.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-18 Thread Andres Freund
On 2013-04-18 00:44:02 +0300, Ants Aasma wrote:
 I went ahead and coded up both the parallel FNV-1a and parallel FNV-1a
 + srl1-xor variants and ran performance tests and detection rate tests
 on both.
 
 Performance results:
 Mul-add checksums: 12.9 bytes/s
 FNV-1a checksums: 13.5 bytes/s
 FNV-1a + srl-1: 7.4 bytes/s
 
 Detection rates:
 False positive rates:
  Add-mul   FNV-1a FNV-1a + srl-1
 Single bit flip: 1:inf 1:129590   1:64795
 Double bit flip: 1:148 1:511  1:53083
 Triple bit flip: 1:673 1:5060 1:61511
   Quad bit flip: 1:18721:193491:68320
 Write 0x00 byte: 1:774538137   1:118776   1:68952
 Write 0xFF byte: 1:165399500   1:137489   1:68958
   Partial write: 1:59949   1:719391:89923
   Write garbage: 1:64866   1:649801:67732
 Write run of 00: 1:57077   1:611401:59723
 Write run of FF: 1:63085   1:596091:62977
 
 Test descriptions:
 N bit flip: picks N random non-overlapping bits and flips their value.
 Write X byte: overwrites a single byte with X.
 Partial write: picks a random cut point, overwrites everything from
 there to end with 0x00.
 Write garbage/run of X: picks two random cut points and fills
 everything in between with random values/X bytes.

I don't think this table is complete without competing numbers for
truncated crc-32. Any chance to get that?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-18 Thread Simon Riggs
On 17 April 2013 22:36, Bruce Momjian br...@momjian.us wrote:

  I would like to know the answer of how an upgrade from checksum to
  no-checksum would behave so I can modify pg_upgrade to allow it.

 Why? 9.3 pg_upgrade certainly doesn't need it. When we get to 9.4, if
 someone has checksums enabled and wants to disable it, why is pg_upgrade
 the right time to do that? Wouldn't it make more sense to allow them to
 do that at any time?

 Well, right now, pg_upgrade is the only way you could potentially turn
 off checksums.  You are right that we might eventually want a command,
 but my point is that we currently have a limitation in pg_upgrade that
 might not be necessary.

We don't currently have checksums, so pg_upgrade doesn't need to cope
with turning them off in 9.3

For 9.4, it might, but likely we've have a tool to turn them off
before then anyway.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-18 Thread Daniel Farina
On Wed, Apr 17, 2013 at 11:08 PM, Andres Freund and...@2ndquadrant.com wrote:
 On 2013-04-17 18:16:36 -0700, Daniel Farina wrote:
 The original paper is often shorthanded Castagnoli 93, but it exists
 in the IEEE's sphere of influence and is hard to find a copy of.
 Luckily, a pretty interesting survey paper discussing some of the
 issues was written by Koopman in 2002 and is available:
 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.11.8323 As a
 pedagolgical note, it's pretty interesting and accessible piece of
 writing (for me, as someone who knows little of error
 detection/correction) and explains some of the engineering reasons
 that provoke such exercises.

 http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=arnumber=231911userType=inst

 There's also a koopman paper from 2004 thats interesting.

Having read the 2002 paper more, it seems that the current CRC32
doesn't have a whole lot going for it: CRC32C pretty much cleans its
clock across the board (I don't understand detected Hamming Distance
that seem greater than the information content of the message, e.g. HD
14 with 8 bit messages as seen in CRC32C: that's where CRC32 can win).

CRC32C looks, all in all, the most flexible, because detection of
Hamming Distance 4 spans from 5244-131072 bits (the upper range of
which is a full 16KiB!) and there is superior Hamming Distance
detection on shorter messages up until the point where it seems like
the Hamming Distance able to be detected is larger than the message
size itself (e.g. HM 13 on an 8 bit message).  I'm not sure if this is
an error in my understanding, or what.

Also, larger runs (16KB) are better served by CRC32C: even the
probably-best contender I can see (0xD419CC15) drops to Hamming
Distance 2-detection right after 65505 bits.  CRC32C has the biggest
range at HD4, although Koopman 0xBA0DC66 comes close, gaining superior
Hamming distance detection for 178-16360 bits (the upper end of this
rnage is short of 2KiB by 3 bytes).

All in all, there is no reason I can see to keep CRC32 at all, vs
CRC32C on the basis of error detection alone, so putting aside all the
business about instruction set architecture, I think a software CRC32C
in a vacuum can be seen as a robustness improvement.

There may be polynomials that are not CRC32 or CRC32C that one might
view as having slightly better tradeoffs as seen in Table 1 of Koopman
2002, but it's kind of a stretch: being able to handle 8KB and 16KB as
seen in CRC32C at HD4 as seen in CRC32C is awfully compelling to me.
Koopman 0xBA0DC66B can admirably reach HD6 on a much larger range, up
to 16360 bytes, which is every so shy of 2KiB.  Castagnoli 0xD419CC15
can, short of 8KB by 31 bits can detect HD 5.

Corrections welcome on my interpretations of Tbl 1.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-18 Thread Ants Aasma
On Thu, Apr 18, 2013 at 5:08 AM, Greg Smith g...@2ndquadrant.com wrote:
 On 4/17/13 8:56 PM, Ants Aasma wrote:

 Nothing from the two points, but the CRC calculation algorithm can be
 switched out for slice-by-4 or slice-by-8 variant. Speed up was around
 factor of 4 if I remember correctly...I can provide you

 with a patch of the generic version of any of the discussed algorithms
 within an hour, leaving plenty of time in beta or in 9.4 to
 accommodate the optimized versions.

 Can you nail down a solid, potential for commit slice-by-4 or slice-by-8
 patch then?  You dropped into things like per-byte overhead to reach this
 conclusion, which was fine to let the methods battle each other. Maybe I
 missed it, but I didn't remember seeing an obvious full patch for this
 implementation then come back up from that.  With the schedule pressure this
 needs to return to more database-level tests.  Your concerns about the
 committed feature being much slower then the original Fletcher one are
 troubling, and we might as well do that showdown again now with the best of
 the CRC implementations you've found.

I meant any of fast ones is easy to nail down. The sped up slice-by-8
is somewhat slightly trickier to clean up. Especially if anyone
expects it to accelerate WAL calculation, then it brings up a whole
bunch of design questions on how to handle alignment issues. For
performance testing what is attached should work fine, it would still
need some cleanup.

 It's fair that you're very concerned about (1), but I wouldn't give it 100%
 odds of happening either.  The user demand that's motivated me to work on
 this will be happy with any of (1) through (3), and in two of them
 optimizing the 16 bit checksums now turns out to be premature.

Fair enough, although I'd like to point out the optimization is
premature in the sense that the effort might go to waste. The checksum
function is a self contained, easy to test and very low maintenance
piece of code - not the usual premature optimization risk.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


crc32c-sb8-checksum.v0.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-18 Thread Bruce Momjian
On Thu, Apr 18, 2013 at 09:17:39AM +0100, Simon Riggs wrote:
 On 17 April 2013 22:36, Bruce Momjian br...@momjian.us wrote:
 
   I would like to know the answer of how an upgrade from checksum to
   no-checksum would behave so I can modify pg_upgrade to allow it.
 
  Why? 9.3 pg_upgrade certainly doesn't need it. When we get to 9.4, if
  someone has checksums enabled and wants to disable it, why is pg_upgrade
  the right time to do that? Wouldn't it make more sense to allow them to
  do that at any time?
 
  Well, right now, pg_upgrade is the only way you could potentially turn
  off checksums.  You are right that we might eventually want a command,
  but my point is that we currently have a limitation in pg_upgrade that
  might not be necessary.
 
 We don't currently have checksums, so pg_upgrade doesn't need to cope
 with turning them off in 9.3

True, 9.2 doesn't have checksums, while 9.3 will.  One point is that
pg_upgrade could actually be used to turn off checksums for 9.3 to 9.3
upgrades if no tablespaces are used.

 For 9.4, it might, but likely we've have a tool to turn them off
 before then anyway.

True.   Would we want pg_upgrade to still enforce matching checksum
modes for old and new servers at that point?  Eventually we will have to
decide that.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-18 Thread Ants Aasma
On Thu, Apr 18, 2013 at 5:57 PM, Ants Aasma a...@cybertec.at wrote:
 I'll generate an avalanche diagram for CRC32C too, but it will take a
 while even if I use a smaller dataset.

Well that was useless... In CRC flipping each bit in the input flips
preset pattern of bits in the output regardless of the actual data on
the page. Some stats for CRC32C - input bits affect 28344 different
bit combinations. Count of bits by number of duplicated bitpatterns:
[(1, 8868),
 (2, 17722),
 (3, 17775),
 (4, 12048),
 (5, 5725),
 (6, 2268),
 (7, 875),
 (8, 184),
 (9, 45),
 (10, 10),
 (16, 16)]

Count of bit positions by number of bit-positions affected:
[(0, 16),
 (1, 25),
 (3, 1185),
 (5, 8487),
 (7, 22970),
 (9, 22913),
 (11, 8790),
 (13, 1119),
 (15, 31)]

Map of number of bit position affected, with 8 being black and 0 or 16
being red attached.

I'm not sure if the issues with partial writes are somehow related to this.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de
attachment: effect-random-crc.png
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-18 Thread Jeff Davis
On Wed, 2013-04-17 at 20:21 -0400, Greg Smith wrote:
 -Original checksum feature used Fletcher checksums.  Its main problems, 
 to quote wikipedia, include that it cannot distinguish between blocks 
 of all 0 bits and blocks of all 1 bits.

That is fairly easy to fix by using a different modulus: 251 vs 255.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-18 Thread Florian Weimer
* Greg Smith:

 The TCP/IP checksum spec is at https://tools.ietf.org/html/rfc793 ;
 its error detection limitations are described at
 http://www.noahdavids.org/self_published/CRC_and_checksum.html ; and a
 good article about optimizing its code is at
 http://www.locklessinc.com/articles/tcp_checksum/  I'll take a longer
 look at whether it's an improvement on the Fletcher-16 used by the
 current patch.

The TCP checksum is too weak to be practical.  Every now an then, I
see data transfers where the checksum is valid, but the content
contains bit flips.  Anything that flips bits randomly at intervals
which are multiples of 16 bits is quite likely to pass through
checksum detection.

In practice, TCP relies on checksumming on the sub-IP layers.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-18 Thread Florian Pflug
On Apr18, 2013, at 19:04 , Jeff Davis pg...@j-davis.com wrote:
 On Wed, 2013-04-17 at 20:21 -0400, Greg Smith wrote:
 -Original checksum feature used Fletcher checksums.  Its main problems, 
 to quote wikipedia, include that it cannot distinguish between blocks 
 of all 0 bits and blocks of all 1 bits.
 
 That is fairly easy to fix by using a different modulus: 251 vs 255.

At the expense of a drastic performance hit though, no? Modulus operations
aren't exactly cheap.

best regards,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-18 Thread Ants Aasma
On Thu, Apr 18, 2013 at 8:05 PM, Florian Pflug f...@phlo.org wrote:
 On Apr18, 2013, at 19:04 , Jeff Davis pg...@j-davis.com wrote:
 On Wed, 2013-04-17 at 20:21 -0400, Greg Smith wrote:
 -Original checksum feature used Fletcher checksums.  Its main problems,
 to quote wikipedia, include that it cannot distinguish between blocks
 of all 0 bits and blocks of all 1 bits.

 That is fairly easy to fix by using a different modulus: 251 vs 255.

 At the expense of a drastic performance hit though, no? Modulus operations
 aren't exactly cheap.

The modulus can be done in the end. By using a modulus of 65521 the
resulting checksum is called Adler-32. [1] However the quality of
Fletcher-32/Adler-32 is strictly worse than even the first iteration
of multiply-add based checksums proposed.

[1] http://en.wikipedia.org/wiki/Adler-32

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-18 Thread Florian Pflug
On Apr18, 2013, at 18:48 , Ants Aasma a...@cybertec.at wrote:
 On Thu, Apr 18, 2013 at 5:57 PM, Ants Aasma a...@cybertec.at wrote:
 I'll generate an avalanche diagram for CRC32C too, but it will take a
 while even if I use a smaller dataset.
 
 Well that was useless... In CRC flipping each bit in the input flips
 preset pattern of bits in the output regardless of the actual data on
 the page. Some stats for CRC32C - input bits affect 28344 different
 bit combinations. Count of bits by number of duplicated bitpatterns:

Yup, CRC is linear too. CRC is essentially long division for polynomials,
i.e. you interpret the N input bits as the coefficients of a (large)
polynomial of degree (N-1), and divide by the CRC polynomial. The remainder
is the checksum, and consists of B bits where B is the degree of the
CRC polynomial. (Polynomial here means polynomial over GF(2), i.e. over
a field with only two values 0 and 1)

I'm currently trying to see if one can easily explain the partial-write
behaviour from that. Having lots of zeros at the end end corresponds
to an input polynomial of the form

  p(x) * x^l

where l is the number of zero bits. The CRC (q(x) is the CRC polynomial) is

  p(x) * x^l mod q(x) = (p(x) mod q(x)) * (x^l mod q(x)) mod q(x)

That still doesn't explain it, though - the result *should* simply
be the checksum of p(x), scrambled a bit by the multiplication with
(x^l mod q(x)). But if q(x) is irreducible, that scrambling is invertible
(as multiplication module some irreducible element always is), and thus
shouldn't matter much.

So either the CRC32-C polynomial isn't irreducible, or there something
fishy going on. Could there be a bug in your CRC implementation? Maybe
a mixup between big and little endian, or something like that?

The third possibility is that I've overlooking something, of course ;-)
Will think more about this tomorrow if time permits

best regards,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-18 Thread Ants Aasma
On Thu, Apr 18, 2013 at 8:15 PM, Florian Pflug f...@phlo.org wrote:
 So either the CRC32-C polynomial isn't irreducible, or there something
 fishy going on. Could there be a bug in your CRC implementation? Maybe
 a mixup between big and little endian, or something like that?

I'm suspecting an implementation bug myself. I already checked the
test harness and that was all sane, compiler hadn't taken any
unforgivable liberties there. I will crosscheck the output with other
implementations to verify that the checksum is implemented correctly.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-18 Thread Jeff Davis
On Thu, 2013-04-18 at 19:05 +0200, Florian Pflug wrote:
 On Apr18, 2013, at 19:04 , Jeff Davis pg...@j-davis.com wrote:
  On Wed, 2013-04-17 at 20:21 -0400, Greg Smith wrote:
  -Original checksum feature used Fletcher checksums.  Its main problems, 
  to quote wikipedia, include that it cannot distinguish between blocks 
  of all 0 bits and blocks of all 1 bits.
  
  That is fairly easy to fix by using a different modulus: 251 vs 255.
 
 At the expense of a drastic performance hit though, no? Modulus operations
 aren't exactly cheap.

Modulo is only necessary when there's a possibility of overflow, or at
the very end of the calculation. If we accumulate 32-bit integers into
64-bit sums, then it turns out that it can't overflow given the largest
input we support (32K page).

32K page = 8192 32-bit integers

1*(2^32-1) + 2*(2^32-1) + 3*(2^32-1) ... 8192*(2^32-1)
= (2^32-1) * (8192^2 - 8192)/2
= 144097595856261120 (  2^64-1 )

So, we only need to do the modulo at the end.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-18 Thread Ants Aasma
On Thu, Apr 18, 2013 at 8:24 PM, Ants Aasma a...@cybertec.at wrote:
 On Thu, Apr 18, 2013 at 8:15 PM, Florian Pflug f...@phlo.org wrote:
 So either the CRC32-C polynomial isn't irreducible, or there something
 fishy going on. Could there be a bug in your CRC implementation? Maybe
 a mixup between big and little endian, or something like that?

 I'm suspecting an implementation bug myself. I already checked the
 test harness and that was all sane, compiler hadn't taken any
 unforgivable liberties there. I will crosscheck the output with other
 implementations to verify that the checksum is implemented correctly.

Looks like the implementation is correct. I cross-referenced it
against a bitwise algorithm for crc32 with the castagnoli polynomial.
This also rules out any endianness issues as the bitwise variant
consumes input byte at a time.

What ever it is, it is something specific to PostgreSQL page layout.
If I use /dev/urandom as the source the issue disappears. So much for
CRC32 being proven good.

Regards,
Ants Aasma
--
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-18 Thread Florian Pflug
On 18.04.2013, at 20:02, Ants Aasma a...@cybertec.at wrote:
 On Thu, Apr 18, 2013 at 8:24 PM, Ants Aasma a...@cybertec.at wrote:
 On Thu, Apr 18, 2013 at 8:15 PM, Florian Pflug f...@phlo.org wrote:
 So either the CRC32-C polynomial isn't irreducible, or there something
 fishy going on. Could there be a bug in your CRC implementation? Maybe
 a mixup between big and little endian, or something like that?
 
 I'm suspecting an implementation bug myself. I already checked the
 test harness and that was all sane, compiler hadn't taken any
 unforgivable liberties there. I will crosscheck the output with other
 implementations to verify that the checksum is implemented correctly.
 
 Looks like the implementation is correct. I cross-referenced it
 against a bitwise algorithm for crc32 with the castagnoli polynomial.
 This also rules out any endianness issues as the bitwise variant
 consumes input byte at a time.
 
 What ever it is, it is something specific to PostgreSQL page layout.
 If I use /dev/urandom as the source the issue disappears. So much for
 CRC32 being proven good.

Weird. Is the code of your test harness available publicly, or could you post 
it? I'd like to look into this...

best regard,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-18 Thread Greg Stark
On Thu, Apr 18, 2013 at 6:04 PM, Florian Weimer f...@deneb.enyo.de wrote:
 The TCP checksum is too weak to be practical.  Every now an then, I
 see data transfers where the checksum is valid, but the content
 contains bit flips.

Well of course, it's only a 16-bit checksum. 64k packets isn't very
many so if you're not counting checksum failures it won't take very
long before one gets through. The purpose of the checksum is to notify
you that you have a problem, not to block bad packets from getting
through.

 Anything that flips bits randomly at intervals
 which are multiples of 16 bits is quite likely to pass through
 checksum detection.

I'm not sure about this


-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-17 Thread Ants Aasma
On Wed, Apr 17, 2013 at 2:26 AM, Florian Pflug f...@phlo.org wrote:
 This raises two question. First, why are there two primes? You could
 just as well using a single prime q and set p=q^64 mod 2^16. You then
 get
  S = sum V[i,j] * q^(64*(64-i) + (64-j)
= sum V[i,j] * q^(4096 - 64*(i-1) - j)
 You get higher prime powers that way, but you can easily choose a prime
 that yields distinct values mod 2^16 for exponents up to 16383. Your
 PRIME2, for example, does. (It wraps around for 16384, i.e.
 PRIME2^16384 = 1 mod 2^16, but that's true for every possible prime since
 16384 is the Carmichael function's value at 2^16)

 The experimental detection rate is about the same if we use a single
 prime. But I think you have the analytical form wrong here. It should
 be given q = p:

S = sum V[i,j] * p^(64-i) * p^(64-j)
  = sum V[i,j] * p^(64 - i + 64 - j)
  = sum V[i,j] * p^(128 - i -j)

 Yeah, if you set q = p that's true. My suggestion was p=q^64 though...

So it was, I guess it was too late here and I missed it... All thing
considered that is a good suggestion, if for nothing else, the generic
implementation can be smaller this way.

 Second, why does it use addition instead of XOR? It seems that FNV
 usually XORs the terms together instead of adding them?

 Testing showed slightly better detection rate for adds. Intuitively I
 think it's because the carry introduces some additional mixing.

 Hm, but OTOH it makes S linear in V, i.e. if you have two inputs
 V1,V2 and V = V1 + V2, then S = S1 + S2. Also, if V' = V*m, then
 S' = S*m. The second property is quite undesirable, I think. Assume
 all the V[i,j] are divisible by 2^k, i.e. have zeros at all bit
 positions 0..(k-1). Then, due to linearity, S is also divisible by
 2^k, i.e. also has no ones before the k-th bit. This means, for example
 that if you hash values values which all have their lowest bit cleared,
 you get only 2^15 distinct hash values. If they all have the two
 lowest bits cleared, you get only 2^14 distinct values, and so on…

 Generally, linearity doesn't seem to be a property that one wants
 in a hash I think, so my suggestion is to stick to XOR.

This made me remember, the issue I had was with high order bits, not
with low order ones, somehow I got them confused. The exact issue is
that the high order bits don't affect any bit lower than them. It's
easy to see that if you remember the shift and add nature of multiply.
Unfortunately XOR will not fix that. Neither will adding an offset
basis. This is the fundamental thing that is behind the not-so-great
uncorrelated bit error detection rate.

While I understand that linearity is not a desirable property, I
couldn't think of a realistic case where it would hurt. I can see how
it can hurt checksums of variable length values, but for our fixed
buffer case it's definitely not so clear cut. On the pro side the
distributive property that is behind linearity allowed me to do final
aggregation in a tree form, performing the multiplies in parallel
instead of linearly. This adds up to the difference between 250 cycles
(64*(3 cycle IMUL + 1 cycle XOR)) and 25 cycles (4*5 cycle pmullw + 5
cycle addw). Given that the main loop is about 576 cycles, this is a
significant difference.

 Here, btw, is a page on FNV hashing. It mentions a few rules for
 picking suitable primes

 http://www.isthe.com/chongo/tech/comp/fnv

 Unfortunately the rules don't apply here because of the hash size.

 Yeah :-(.

 I noticed that their 32-bit prime only has a single one outside
 the first 16 bits. Maybe we can take advantage of that and use a
 32-bit state while still providing decent performance on machines
 without a 32-bit x 32-bit - 32-bit multiply instruction?

Looking at the Power instruction set, a 32bit mul by the FNV prime
would look like this:

vmulouh tmp1, hash, prime
vmladduhm tmp1, hash, prime16
vslw tmp2, hash, 24
vadduwm hash, tmp1, tmp2

That is 4 instructions to multiply 4 values. Depending on the specific
execution ports on the processor it might faster or slower than the
scalar version but not by a whole lot. Main benefit would be that the
intermediate state could be held in registers.

 If we lived in an Intel-only world, I'd suggest going with a
 32-bit state, since SSE4.1 support is *very* wide-spread already -
 the last CPUs without it came out over 5 years ago, I think.
 (Core2 and later support SSE4.1, and some later Core1 do too)

 But unfortunately things look bleak even for other x86
 implementations - AMD support SSE4.1 only starting with
 Bulldozer, which came out 2011 or so I believe. Leaving the x86
 realm, it seems that only ARM's NEON provides the instructions
 we'd need - AltiVec seems to be support only 16-bit multiplies,
 and from what some quick googling brought up, MIPS and SPARC
 SIMD instructions look no better..

 OTOH, chances are that nobody will ever do SIMD implementations
 for those machines. In that case, working in 32-bit chunks instead
 of 

Re: [HACKERS] Enabling Checksums

2013-04-17 Thread Bruce Momjian
On Wed, Apr 17, 2013 at 05:47:55PM +0300, Ants Aasma wrote:
 The SSE4.1 implementation of this would be as fast as the last pat,
 generic version will be faster and we avoid the linearity issue. By
 using different offsets for each of the partial hashes we don't
 directly suffer from commutativity of the final xor folding. By using
 the xor-then-multiply variant the last values hashed have their bits
 mixed before folding together.
 
 Speaking against this option is the fact that we will need to do CPU
 detection at startup to make it fast on the x86 that support SSE4.1,
 and the fact that AMD CPUs before 2011 will run it an order of
 magnitude slower (but still faster than the best CRC).
 
 Any opinions if it would be a reasonable tradeoff to have a better
 checksum with great performance on latest x86 CPUs and good
 performance on other architectures at the expense of having only ok
 performance on older AMD CPUs?
 
 Also, any good suggestions where should we do CPU detection when we go
 this route?

As much as I love the idea of improving the algorithm, it is disturbing
we are discussing this so close to beta, with an algorithm that is under
analysis, with no (runtime) CPU detection, and in something that is
going to be embedded into our data page format.  I can't even think of
another case where we do run-time CPU detection.

I am wondering if we need to tell users that pg_upgrade will not be
possible if you enable page-level checksums, so we are not trapped with
something we want to improve in 9.4.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-17 Thread Florian Pflug
On Apr17, 2013, at 17:09 , Bruce Momjian br...@momjian.us wrote:
 As much as I love the idea of improving the algorithm, it is disturbing
 we are discussing this so close to beta, with an algorithm that is under
 analysis, with no (runtime) CPU detection, and in something that is
 going to be embedded into our data page format.  I can't even think of
 another case where we do run-time CPU detection.

We could still ship the new checksum algorithm with 9.3, but omit the
SSE-optimized version, i.e. include only the plain C implementation.
I think Ants mentioned somehwere that gcc does a pretty good job of
vectorizing that, so people who really care (and who use GCC)
could compile with -msse41 --unrool-loops --tree-vectorize, and get
performance close to that of a hand-coded SSE version.

The important decision we're facing is which algorithm to use. I personally
believe Ants is on the right track there - FNV or a variant thereof
looks like a good choice to me, but the details have yet to be nailed
I think.

However, you're right that time's running out. It'd be a shame though
if we'd lock ourselves into CRC as the only available algorithm essentially
forever.  Is there any way we can change the checksum algorithm in 9.4
*without* breaking pg_upgrade? Maybe pd_pagesize_version could be used
for that - we could make version 5 mean just like version 4, but with
a different checksum algorithm. Since the layout wouldn't actually
chance, that'd be far easier to pull off than actually supporting multiple
page layouts. If that works, then shipping 9.3 with CRC is probably
the best solution. If not, we should see to it that something like Ants
parallel version of FNV or a smallget into 9.3 if at all possible,
IMHO.

best regards,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-17 Thread Florian Pflug
On Apr17, 2013, at 16:47 , Ants Aasma a...@cybertec.at wrote:
 This made me remember, the issue I had was with high order bits, not
 with low order ones, somehow I got them confused. The exact issue is
 that the high order bits don't affect any bit lower than them. It's
 easy to see that if you remember the shift and add nature of multiply.
 Unfortunately XOR will not fix that. Neither will adding an offset
 basis. This is the fundamental thing that is behind the not-so-great
 uncorrelated bit error detection rate.

Right. We could maybe fix that by extending the update step to

  tmp = s[j] ^ d[i,j]
  s[j] = (t * PRIME) ^ (t  1)

or something like that. Shifting t instead of (t * PRIME) should
help to reduce the performance impact, since a reordering CPU should
be able to parallelize the multiple and the shift. Note though that
I haven't really though that through extensively - the general idea
should be sound, but whether 1 is a good shifting amount I do not
know.

 While I understand that linearity is not a desirable property, I
 couldn't think of a realistic case where it would hurt. I can see how
 it can hurt checksums of variable length values, but for our fixed
 buffer case it's definitely not so clear cut. On the pro side the
 distributive property that is behind linearity allowed me to do final
 aggregation in a tree form, performing the multiplies in parallel
 instead of linearly. This adds up to the difference between 250 cycles
 (64*(3 cycle IMUL + 1 cycle XOR)) and 25 cycles (4*5 cycle pmullw + 5
 cycle addw). Given that the main loop is about 576 cycles, this is a
 significant difference.

 I wonder if we use 32bit FNV-1a's (the h = (h^v)*p variant) with
 different offset-basis values, would it be enough to just XOR fold the
 resulting values together. The algorithm looking like this:

Hm, this will make the algorithm less resilient to some particular
input permutations (e.g. those which swap the 64*i-th and the (64+1)-ith
words), but those seem very unlikely to occur randomly. But if we're
worried about that, we could use your linear combination method for
the aggregation phase.

 Speaking against this option is the fact that we will need to do CPU
 detection at startup to make it fast on the x86 that support SSE4.1,
 and the fact that AMD CPUs before 2011 will run it an order of
 magnitude slower (but still faster than the best CRC).

Hm, CPU detection isn't that hard, and given the speed at which Intel
currently invents new instructions we'll end up going that route sooner
or later anyway, I think. 

 Any opinions if it would be a reasonable tradeoff to have a better
 checksum with great performance on latest x86 CPUs and good
 performance on other architectures at the expense of having only ok
 performance on older AMD CPUs?

The loss on AMD is offset by the increased performance on machines
where we can't vectorize, I'd say.

best regards,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-17 Thread Greg Stark
On Wed, Apr 17, 2013 at 4:28 PM, Florian Pflug f...@phlo.org wrote:
 Is there any way we can change the checksum algorithm in 9.4
 *without* breaking pg_upgrade?

Personally I think we're going to need a solution for page format
changes someday eventually

What advantages are we postponing now to avoid it?

* 32-bit checksums?
* Being able to enable/disable checksums?

Anything else?


-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-17 Thread Bruce Momjian
On Wed, Apr 17, 2013 at 05:28:06PM +0200, Florian Pflug wrote:
 However, you're right that time's running out. It'd be a shame though
 if we'd lock ourselves into CRC as the only available algorithm essentially
 forever.  Is there any way we can change the checksum algorithm in 9.4
 *without* breaking pg_upgrade? Maybe pd_pagesize_version could be used
 for that - we could make version 5 mean just like version 4, but with
 a different checksum algorithm. Since the layout wouldn't actually
 chance, that'd be far easier to pull off than actually supporting multiple
 page layouts. If that works, then shipping 9.3 with CRC is probably
 the best solution. If not, we should see to it that something like Ants
 parallel version of FNV or a smallget into 9.3 if at all possible,
 IMHO.

I was going to ask about the flexibility of pg_upgrade and checksums. 
Right now you have to match the old and new cluster checksum modes, but
it seems it would be possible to allow pg_upgrade use from checksum to
no-checksum servers.  Does the backend look at the pg_controldata setting,
or at the page checksum flag?  If the former, it seems pg_upgrade could
run a a no-checksum server just fine that had checksum information on
its pages.  This might give us more flexibility in changing the checksum
algorithm in the future, i.e. you only lose checksum ability.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-17 Thread Florian Pflug
On Apr17, 2013, at 18:15 , Bruce Momjian br...@momjian.us wrote:
 On Wed, Apr 17, 2013 at 05:28:06PM +0200, Florian Pflug wrote:
 However, you're right that time's running out. It'd be a shame though
 if we'd lock ourselves into CRC as the only available algorithm essentially
 forever.  Is there any way we can change the checksum algorithm in 9.4
 *without* breaking pg_upgrade? Maybe pd_pagesize_version could be used
 for that - we could make version 5 mean just like version 4, but with
 a different checksum algorithm. Since the layout wouldn't actually
 chance, that'd be far easier to pull off than actually supporting multiple
 page layouts. If that works, then shipping 9.3 with CRC is probably
 the best solution. If not, we should see to it that something like Ants
 parallel version of FNV or a smallget into 9.3 if at all possible,
 IMHO.
 
 I was going to ask about the flexibility of pg_upgrade and checksums. 
 Right now you have to match the old and new cluster checksum modes, but
 it seems it would be possible to allow pg_upgrade use from checksum to
 no-checksum servers.  Does the backend look at the pg_controldata setting,
 or at the page checksum flag?  If the former, it seems pg_upgrade could
 run a a no-checksum server just fine that had checksum information on
 its pages.  This might give us more flexibility in changing the checksum
 algorithm in the future, i.e. you only lose checksum ability.

AFAIK, there's currently no per-page checksum flag. Still, being only
able to go from checksummed to not-checksummed probably is for all
practical purposes the same as not being able to pg_upgrade at all.
Otherwise, why would people have enabled checksums in the first place?

best regards,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-17 Thread Bruce Momjian
On Wed, Apr 17, 2013 at 06:33:58PM +0200, Florian Pflug wrote:
  I was going to ask about the flexibility of pg_upgrade and checksums. 
  Right now you have to match the old and new cluster checksum modes, but
  it seems it would be possible to allow pg_upgrade use from checksum to
  no-checksum servers.  Does the backend look at the pg_controldata setting,
  or at the page checksum flag?  If the former, it seems pg_upgrade could
  run a a no-checksum server just fine that had checksum information on
  its pages.  This might give us more flexibility in changing the checksum
  algorithm in the future, i.e. you only lose checksum ability.
 
 AFAIK, there's currently no per-page checksum flag. Still, being only
 able to go from checksummed to not-checksummed probably is for all
 practical purposes the same as not being able to pg_upgrade at all.
 Otherwise, why would people have enabled checksums in the first place?

Good point, but it is _an_ option, at least.

I would like to know the answer of how an upgrade from checksum to
no-checksum would behave so I can modify pg_upgrade to allow it.


-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-17 Thread Tom Lane
Greg Stark st...@mit.edu writes:
 On Wed, Apr 17, 2013 at 4:28 PM, Florian Pflug f...@phlo.org wrote:
 Is there any way we can change the checksum algorithm in 9.4
 *without* breaking pg_upgrade?

 Personally I think we're going to need a solution for page format
 changes someday eventually

 What advantages are we postponing now to avoid it?

Um, other than the ability to make a release?

We aren't going to hold up 9.3 until that particular bit of pie in the
sky lands.  Indeed I don't expect to see it available in the next couple
years either.  When we were looking at that seriously, two or three
years ago, arbitrary page format changes looked *hard*.

The idea of bumping the page format version number to signal a checksum
algorithm change might work though.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-17 Thread Bruce Momjian
On Wed, Apr 17, 2013 at 01:22:01PM -0400, Tom Lane wrote:
 Greg Stark st...@mit.edu writes:
  On Wed, Apr 17, 2013 at 4:28 PM, Florian Pflug f...@phlo.org wrote:
  Is there any way we can change the checksum algorithm in 9.4
  *without* breaking pg_upgrade?
 
  Personally I think we're going to need a solution for page format
  changes someday eventually
 
  What advantages are we postponing now to avoid it?
 
 Um, other than the ability to make a release?
 
 We aren't going to hold up 9.3 until that particular bit of pie in the
 sky lands.  Indeed I don't expect to see it available in the next couple
 years either.  When we were looking at that seriously, two or three
 years ago, arbitrary page format changes looked *hard*.
 
 The idea of bumping the page format version number to signal a checksum
 algorithm change might work though.

Uh, not sure how pg_upgrade would detect that as the version number is
not stored in pg_controldata, e.g.:

Data page checksums:  enabled/disabled

Do we need to address this for 9.3?  (Yuck)

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-17 Thread Tom Lane
Bruce Momjian br...@momjian.us writes:
 Uh, not sure how pg_upgrade would detect that as the version number is
 not stored in pg_controldata, e.g.:

   Data page checksums:  enabled/disabled

That seems pretty shortsighted.  The field probably ought to be defined
as containing a checksum algorithm ID number, not a boolean.

But having said that, I'm not sure why this would be pg_upgrade's
problem.  By definition, we do not want pg_upgrade running around
looking at individual data pages.  Therefore, whatever we might do
about checksum algorithm changes would have to be something that can be
managed on-the-fly by the newer server.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-17 Thread Bruce Momjian
On Wed, Apr 17, 2013 at 01:29:18PM -0400, Tom Lane wrote:
 Bruce Momjian br...@momjian.us writes:
  Uh, not sure how pg_upgrade would detect that as the version number is
  not stored in pg_controldata, e.g.:
 
  Data page checksums:  enabled/disabled
 
 That seems pretty shortsighted.  The field probably ought to be defined
 as containing a checksum algorithm ID number, not a boolean.
 
 But having said that, I'm not sure why this would be pg_upgrade's
 problem.  By definition, we do not want pg_upgrade running around
 looking at individual data pages.  Therefore, whatever we might do
 about checksum algorithm changes would have to be something that can be
 managed on-the-fly by the newer server.

Well, my idea was that pg_upgrade would allow upgrades from old clusters
with the same checksum algorithm version, but not non-matching ones. 
This would allow the checksum algorithm to be changed and force
pg_upgrade to fail.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-17 Thread Tom Lane
Bruce Momjian br...@momjian.us writes:
 On Wed, Apr 17, 2013 at 01:29:18PM -0400, Tom Lane wrote:
 But having said that, I'm not sure why this would be pg_upgrade's
 problem.  By definition, we do not want pg_upgrade running around
 looking at individual data pages.  Therefore, whatever we might do
 about checksum algorithm changes would have to be something that can be
 managed on-the-fly by the newer server.

 Well, my idea was that pg_upgrade would allow upgrades from old clusters
 with the same checksum algorithm version, but not non-matching ones. 
 This would allow the checksum algorithm to be changed and force
 pg_upgrade to fail.

It's rather premature to be defining pg_upgrade's behavior for a
situation that doesn't exist yet, and may very well never exist
in that form.  It seems more likely to me that we'd want to allow
incremental algorithm changes, in which case pg_upgrade ought not do
anything about this case anyway.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-17 Thread Jeff Davis
On Wed, 2013-04-17 at 12:42 -0400, Bruce Momjian wrote:
  AFAIK, there's currently no per-page checksum flag. Still, being only
  able to go from checksummed to not-checksummed probably is for all
  practical purposes the same as not being able to pg_upgrade at all.
  Otherwise, why would people have enabled checksums in the first place?
 
 Good point, but it is _an_ option, at least.
 
 I would like to know the answer of how an upgrade from checksum to
 no-checksum would behave so I can modify pg_upgrade to allow it.

Why? 9.3 pg_upgrade certainly doesn't need it. When we get to 9.4, if
someone has checksums enabled and wants to disable it, why is pg_upgrade
the right time to do that? Wouldn't it make more sense to allow them to
do that at any time?

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-17 Thread Jeff Davis
On Wed, 2013-04-17 at 16:58 +0100, Greg Stark wrote:
 On Wed, Apr 17, 2013 at 4:28 PM, Florian Pflug f...@phlo.org wrote:
  Is there any way we can change the checksum algorithm in 9.4
  *without* breaking pg_upgrade?
 
 Personally I think we're going to need a solution for page format
 changes someday eventually
 
 What advantages are we postponing now to avoid it?
 
 * 32-bit checksums?
 * Being able to enable/disable checksums?
 
 Anything else?

I'm not sure that changing the page format is the most difficult part of
enabling/disabling checksums. It's easy enough to have page header bits
if the current information is not enough (and those bits were there, but
Heikki requested their removal and I couldn't think of a concrete reason
to keep them).

Eventually, it would be nice to be able to break the page format and
have more space for things like checksums (and probably a few other
things, maybe some visibility-related optimizations). But that's a few
years off and we don't have any real plan for that.

What I wanted to accomplish with this patch is the simplest checksum
mechanism that we could get that would be fast enough that many people
would be able to use it. I expect it to be useful until we do decide to
break the page format.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-17 Thread Bruce Momjian
On Wed, Apr 17, 2013 at 01:59:12PM -0700, Jeff Davis wrote:
 On Wed, 2013-04-17 at 12:42 -0400, Bruce Momjian wrote:
   AFAIK, there's currently no per-page checksum flag. Still, being only
   able to go from checksummed to not-checksummed probably is for all
   practical purposes the same as not being able to pg_upgrade at all.
   Otherwise, why would people have enabled checksums in the first place?
  
  Good point, but it is _an_ option, at least.
  
  I would like to know the answer of how an upgrade from checksum to
  no-checksum would behave so I can modify pg_upgrade to allow it.
 
 Why? 9.3 pg_upgrade certainly doesn't need it. When we get to 9.4, if
 someone has checksums enabled and wants to disable it, why is pg_upgrade
 the right time to do that? Wouldn't it make more sense to allow them to
 do that at any time?

Well, right now, pg_upgrade is the only way you could potentially turn
off checksums.  You are right that we might eventually want a command,
but my point is that we currently have a limitation in pg_upgrade that
might not be necessary.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-17 Thread Ants Aasma
On Wed, Apr 17, 2013 at 6:54 PM, Florian Pflug f...@phlo.org wrote:
 On Apr17, 2013, at 16:47 , Ants Aasma a...@cybertec.at wrote:
 This made me remember, the issue I had was with high order bits, not
 with low order ones, somehow I got them confused. The exact issue is
 that the high order bits don't affect any bit lower than them. It's
 easy to see that if you remember the shift and add nature of multiply.
 Unfortunately XOR will not fix that. Neither will adding an offset
 basis. This is the fundamental thing that is behind the not-so-great
 uncorrelated bit error detection rate.

 Right. We could maybe fix that by extending the update step to

   tmp = s[j] ^ d[i,j]
   s[j] = (t * PRIME) ^ (t  1)

 or something like that. Shifting t instead of (t * PRIME) should
 help to reduce the performance impact, since a reordering CPU should
 be able to parallelize the multiple and the shift. Note though that
 I haven't really though that through extensively - the general idea
 should be sound, but whether 1 is a good shifting amount I do not
 know.

I was thinking about something similar too. The big issue here is that
the parallel checksums already hide each other latencies effectively
executing one each of movdqu/pmullw/paddw each cycle, that's why the
N_SUMS adds up to 128 bytes not 16 bytes.

I went ahead and coded up both the parallel FNV-1a and parallel FNV-1a
+ srl1-xor variants and ran performance tests and detection rate tests
on both.

Performance results:
Mul-add checksums: 12.9 bytes/s
FNV-1a checksums: 13.5 bytes/s
FNV-1a + srl-1: 7.4 bytes/s

Detection rates:
False positive rates:
 Add-mul   FNV-1a FNV-1a + srl-1
Single bit flip: 1:inf 1:129590   1:64795
Double bit flip: 1:148 1:511  1:53083
Triple bit flip: 1:673 1:5060 1:61511
  Quad bit flip: 1:18721:193491:68320
Write 0x00 byte: 1:774538137   1:118776   1:68952
Write 0xFF byte: 1:165399500   1:137489   1:68958
  Partial write: 1:59949   1:719391:89923
  Write garbage: 1:64866   1:649801:67732
Write run of 00: 1:57077   1:611401:59723
Write run of FF: 1:63085   1:596091:62977

Test descriptions:
N bit flip: picks N random non-overlapping bits and flips their value.
Write X byte: overwrites a single byte with X.
Partial write: picks a random cut point, overwrites everything from
there to end with 0x00.
Write garbage/run of X: picks two random cut points and fills
everything in between with random values/X bytes.

So adding in the shifted value nearly cuts the performance in half. I
think that by playing with the instruction order I might coax the CPU
scheduler to schedule the instructions better, but even in best case
it will be somewhat slower. The point to keep in mind that even this
slower speed is still faster than hardware accelerated CRC32, so all
in all the hit might not be so bad. The effect on false positive rates
for double bit errors is particularly impressive. I'm now running a
testrun that shift right by 13 to see how that works out, intuitively
it should help dispersing the bits a lot faster.

 I wonder if we use 32bit FNV-1a's (the h = (h^v)*p variant) with
 different offset-basis values, would it be enough to just XOR fold the
 resulting values together. The algorithm looking like this:

 Hm, this will make the algorithm less resilient to some particular
 input permutations (e.g. those which swap the 64*i-th and the (64+1)-ith
 words), but those seem very unlikely to occur randomly. But if we're
 worried about that, we could use your linear combination method for
 the aggregation phase.

I don't think it significantly reduces resilience to permutations
thanks to using different basis offsets and multiply not distributing
over xor.

 Speaking against this option is the fact that we will need to do CPU
 detection at startup to make it fast on the x86 that support SSE4.1,
 and the fact that AMD CPUs before 2011 will run it an order of
 magnitude slower (but still faster than the best CRC).

 Hm, CPU detection isn't that hard, and given the speed at which Intel
 currently invents new instructions we'll end up going that route sooner
 or later anyway, I think.

Sure it's not that hard but it does have an order of magnitude more
design decisions than #if defined(__x86_64__). Maybe a first stab
could avoid a generic infrastructure and just have the checksum
function as a function pointer, with the default trampoline
implementation running a cpuid and overwriting the function pointer
with either the optimized or generic versions and then calling it.

 Any opinions if it would be a reasonable tradeoff to have a better
 checksum with great performance on latest x86 CPUs and good
 performance on other architectures at the expense of having only ok
 performance on older AMD CPUs?

 The loss on AMD is offset by the increased performance on machines
 where we can't vectorize, I'd say.

+1 Old AMD machines won't soon be used by anyone caring 

Re: [HACKERS] Enabling Checksums

2013-04-17 Thread Tom Lane
Ants Aasma a...@cybertec.at writes:
 I was thinking about something similar too. The big issue here is that
 the parallel checksums already hide each other latencies effectively
 executing one each of movdqu/pmullw/paddw each cycle, that's why the
 N_SUMS adds up to 128 bytes not 16 bytes.

The more I read of this thread, the more unhappy I get.  It appears that
the entire design process is being driven by micro-optimization for CPUs
being built by Intel in 2013.  That ought to be, at best, a fifth-order
consideration, with full recognition that it'll be obsolete in two years,
and is already irrelevant to anyone not running one of those CPUs.

I would like to ban all discussion of assembly-language optimizations
until after 9.3 is out, so that we can concentrate on what actually
matters.  Which IMO is mostly the error detection rate and the probable
nature of false successes.  I'm glad to see that you're paying at least
some attention to that, but the priorities in this discussion are
completely backwards.

And I reiterate that there is theory out there about the error detection
capabilities of CRCs.  I'm not seeing any theory here, which leaves me
with very little confidence that we know what we're doing.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-17 Thread Florian Pflug
On Apr18, 2013, at 00:32 , Tom Lane t...@sss.pgh.pa.us wrote:
 Ants Aasma a...@cybertec.at writes:
 I was thinking about something similar too. The big issue here is that
 the parallel checksums already hide each other latencies effectively
 executing one each of movdqu/pmullw/paddw each cycle, that's why the
 N_SUMS adds up to 128 bytes not 16 bytes.
 
 The more I read of this thread, the more unhappy I get.  It appears that
 the entire design process is being driven by micro-optimization for CPUs
 being built by Intel in 2013.  That ought to be, at best, a fifth-order
 consideration, with full recognition that it'll be obsolete in two years,
 and is already irrelevant to anyone not running one of those CPUs.

Micro-optimization for particular CPUs yes, but general performance
considerations, no. For example, 2^n is probably one of the worst modulus
you can pick for a hash function - any prime would work much better.
But doing the computations modulo 2^16 or 2^32 carries zero performance
overhead, whereas picking another modulus requires some renormalization
after every operation. That, however, is *not* a given - it stems from
the fact nearly all CPUs in existence operated on binary integers. This
fact must thus enter into the design phase very early, and makes
2^16 or 2^32 a sensible choice for a modulus *despite* it's shortcomings,
simply because it allows for fast implementations.

 I would like to ban all discussion of assembly-language optimizations
 until after 9.3 is out, so that we can concentrate on what actually
 matters.  Which IMO is mostly the error detection rate and the probable
 nature of false successes.  I'm glad to see that you're paying at least
 some attention to that, but the priorities in this discussion are
 completely backwards.

I'd say lots of attention is paid to that, but there's *also* attention
paid to speed. Which I good, because ideally we want to end up with
a checksum with both has good error-detection properties *and* good
performance. If performance is of no concern to us, then there's little
reason not to use CRC…

 And I reiterate that there is theory out there about the error detection
 capabilities of CRCs.  I'm not seeing any theory here, which leaves me
 with very little confidence that we know what we're doing.

If you've got any pointers to literature on error-detection capabilities
of CPU-friendly checksum functions, please share. I am aware of the vast
literature on CRC, and also on some other algebraic approaches, but
none of those even come close to the speed of FNV+shift (unless there's
a special CRC instruction, that is). And there's also a ton of stuff on
cryptographic hashing, but those are optimized for a completely different
use-case...

best regards,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-17 Thread Florian Pflug
On Apr17, 2013, at 23:44 , Ants Aasma a...@cybertec.at wrote:
 Performance results:
 Mul-add checksums: 12.9 bytes/s
 FNV-1a checksums: 13.5 bytes/s
 FNV-1a + srl-1: 7.4 bytes/s
 
 Detection rates:
 False positive rates:
 Add-mul   FNV-1a FNV-1a + srl-1
 Single bit flip: 1:inf 1:129590   1:64795
 Double bit flip: 1:148 1:511  1:53083
 Triple bit flip: 1:673 1:5060 1:61511
  Quad bit flip: 1:18721:193491:68320
 Write 0x00 byte: 1:774538137   1:118776   1:68952
 Write 0xFF byte: 1:165399500   1:137489   1:68958
  Partial write: 1:59949   1:719391:89923
  Write garbage: 1:64866   1:649801:67732
 Write run of 00: 1:57077   1:611401:59723
 Write run of FF: 1:63085   1:596091:62977
 
 Test descriptions:
 N bit flip: picks N random non-overlapping bits and flips their value.
 Write X byte: overwrites a single byte with X.
 Partial write: picks a random cut point, overwrites everything from
 there to end with 0x00.
 Write garbage/run of X: picks two random cut points and fills
 everything in between with random values/X bytes.

Cool, thanks for testing that! The results for FNV-1a + srl-1 look
promising, I think. Its failure rate is consistently about 1:2^16,
which is the value you'd expect. That gives me some confidence that
the additional shift as working as expected.

BTW, which prime are you using for FNV-1a and FNV-1a+srl1?

 So adding in the shifted value nearly cuts the performance in half. I
 think that by playing with the instruction order I might coax the CPU
 scheduler to schedule the instructions better, but even in best case
 it will be somewhat slower. The point to keep in mind that even this
 slower speed is still faster than hardware accelerated CRC32, so all
 in all the hit might not be so bad.

Yeah. ~7 bytes/cycle still translates to over 10GB/s on typical CPU,
so that's still plenty fast I'd say...

 The effect on false positive rates
 for double bit errors is particularly impressive. I'm now running a
 testrun that shift right by 13 to see how that works out, intuitively
 it should help dispersing the bits a lot faster.

Maybe, but it also makes *only* bits 14 and 15 actually affects bits
below them, because all other's are shifted out. If you choose the
right prime it may still work, you'd have to pick one which with
enough lower bits set so that every bits affects bit 14 or 15 at some
point…

All in all a small shift seems better to me - if 1 for some reason
isn't a good choice, I'd expect 3 or so to be a suitable
replacement, but nothing much larger…

I should have some time tomorrow to spent on this, and will try 
to validate our FNV-1a modification, and see if I find a way to judge
whether 1 is a good shift.

 I wonder if we use 32bit FNV-1a's (the h = (h^v)*p variant) with
 different offset-basis values, would it be enough to just XOR fold the
 resulting values together. The algorithm looking like this:
 
 Hm, this will make the algorithm less resilient to some particular
 input permutations (e.g. those which swap the 64*i-th and the (64+1)-ith
 words), but those seem very unlikely to occur randomly. But if we're
 worried about that, we could use your linear combination method for
 the aggregation phase.
 
 I don't think it significantly reduces resilience to permutations
 thanks to using different basis offsets and multiply not distributing
 over xor.

Oh, yeah, I though you were still using 0 as base offset. If you don't,
the objection is moot.

best regards,
Florian Pflug




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-17 Thread Ants Aasma
On Thu, Apr 18, 2013 at 1:32 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Ants Aasma a...@cybertec.at writes:
 I was thinking about something similar too. The big issue here is that
 the parallel checksums already hide each other latencies effectively
 executing one each of movdqu/pmullw/paddw each cycle, that's why the
 N_SUMS adds up to 128 bytes not 16 bytes.

 The more I read of this thread, the more unhappy I get.  It appears that
 the entire design process is being driven by micro-optimization for CPUs
 being built by Intel in 2013.  That ought to be, at best, a fifth-order
 consideration, with full recognition that it'll be obsolete in two years,
 and is already irrelevant to anyone not running one of those CPUs.

The large scale structure takes into account the trends in computer
architecture. A lot more so than using anything straight out of the
literature. Specifically, computer architectures have hit a wall in
terms of sequential throughput, so the linear dependency chain in the
checksum algorithm will be the bottleneck soon if it isn't already.
From that it follows that a fast and future proof algorithm should not
calculate the checksum in a single log chain. The proposed algorithms
divide the input into 64x64 and 32x64 chunks. It's easy to show that
both convert the dependency chain from O(n) to O(sqrt(n)). Secondly,
unless we pick something really popular, CPUs are unlikely to provide
specifically for us, so the algorithm should be built from general
purpose computational pieces. Vector integer multiply and xor are
pretty much guaranteed to be there and fast on future CPUs. In my view
it's much more probable to be available and fast on future CPU's than
something like the Intel CRC32 acceleration.

 I would like to ban all discussion of assembly-language optimizations
 until after 9.3 is out, so that we can concentrate on what actually
 matters.  Which IMO is mostly the error detection rate and the probable
 nature of false successes.  I'm glad to see that you're paying at least
 some attention to that, but the priorities in this discussion are
 completely backwards.

I approached it from the angle that what needs to be done to get a
fundamentally fast approach have a good enough error detection rate
and not have a way of generating false positives that will give a
likely error. The algorithms are simple enough and well studied enough
that the rewards from tweaking them are negligible. I think the
resulting performance speaks for itself. Now the question is what is a
good enough algorithm. In my view, the checksum is more like a canary
in the coal mine, not something that can be relied upon, and so
ultimate efficiency is not that important if there are no obvious
horrible cases. I can see that there are other views and so am
exploring different tradeoffs between performance and quality.

 And I reiterate that there is theory out there about the error detection
 capabilities of CRCs.  I'm not seeing any theory here, which leaves me
 with very little confidence that we know what we're doing.

I haven't found much literature that is of use here. There is theory
underlying here coming from basic number theory and distilled into
rules for hash functions. For the FNV hash the prime supposedly is
carefully chosen, although all literature so far is saying it is a
good choice, but here is not the place to explain why.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-17 Thread Ants Aasma
On Thu, Apr 18, 2013 at 2:25 AM, Florian Pflug f...@phlo.org wrote:
 On Apr17, 2013, at 23:44 , Ants Aasma a...@cybertec.at wrote:
 Performance results:
 Mul-add checksums: 12.9 bytes/s
 FNV-1a checksums: 13.5 bytes/s
 FNV-1a + srl-1: 7.4 bytes/s

 Detection rates:
 False positive rates:
 Add-mul   FNV-1a FNV-1a + srl-1
 Single bit flip: 1:inf 1:129590   1:64795
 Double bit flip: 1:148 1:511  1:53083
 Triple bit flip: 1:673 1:5060 1:61511
  Quad bit flip: 1:18721:193491:68320
 Write 0x00 byte: 1:774538137   1:118776   1:68952
 Write 0xFF byte: 1:165399500   1:137489   1:68958
  Partial write: 1:59949   1:719391:89923
  Write garbage: 1:64866   1:649801:67732
 Write run of 00: 1:57077   1:611401:59723
 Write run of FF: 1:63085   1:596091:62977

 Test descriptions:
 N bit flip: picks N random non-overlapping bits and flips their value.
 Write X byte: overwrites a single byte with X.
 Partial write: picks a random cut point, overwrites everything from
 there to end with 0x00.
 Write garbage/run of X: picks two random cut points and fills
 everything in between with random values/X bytes.

 Cool, thanks for testing that! The results for FNV-1a + srl-1 look
 promising, I think. Its failure rate is consistently about 1:2^16,
 which is the value you'd expect. That gives me some confidence that
 the additional shift as working as expected.

 BTW, which prime are you using for FNV-1a and FNV-1a+srl1?

The official 32bit FNV one, 16777619.

Offsets were just random numbers. Seems good enough given the
following from the FNV page:

These non-zero integers are the FNV-0 hashes of the following 32 octets:

chongo Landon Curt Noll /\../\

 The effect on false positive rates
 for double bit errors is particularly impressive. I'm now running a
 testrun that shift right by 13 to see how that works out, intuitively
 it should help dispersing the bits a lot faster.

Empirical results are slightly better with shift of 13:

Single bit flip: 1:61615
Double bit flip: 1:58078
Triple bit flip: 1:66329
  Quad bit flip: 1:62141
Write 0x00 byte: 1:66327
Write 0xFF byte: 1:65274
  Partial write: 1:71939
  Write garbage: 1:65095
 Write run of 0: 1:62845
Write run of FF: 1:64638

 Maybe, but it also makes *only* bits 14 and 15 actually affects bits
 below them, because all other's are shifted out. If you choose the
 right prime it may still work, you'd have to pick one which with
 enough lower bits set so that every bits affects bit 14 or 15 at some
 point…

 All in all a small shift seems better to me - if 1 for some reason
 isn't a good choice, I'd expect 3 or so to be a suitable
 replacement, but nothing much larger…

I don't think the big shift is a problem, the other bits were taken
into account by the multiply, and with the larger shift the next
multiplication will disperse the changes once again. Nevertheless, I'm
running the tests with shift of 3 now.

 I should have some time tomorrow to spent on this, and will try
 to validate our FNV-1a modification, and see if I find a way to judge
 whether 1 is a good shift.

Great. I will spend some brain cycles on it too.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-17 Thread Greg Smith

On 4/17/13 6:32 PM, Tom Lane wrote:

The more I read of this thread, the more unhappy I get.  It appears that
the entire design process is being driven by micro-optimization for CPUs
being built by Intel in 2013.


And that's not going to get anyone past review, since all the tests I've 
been doing the last two weeks are on how fast an AMD Opteron 6234 with 
OS cache  shared_buffers can run this.  The main thing I'm still 
worried about is what happens when you have a fast machine that can move 
memory around very quickly and an in-memory workload, but it's hamstrung 
by the checksum computation--and it's not a 2013 Intel machine.


The question I started with here was answered to some depth and then 
skipped past.  I'd like to jerk attention back to that, since I thought 
some good answers from Ants went by.  Is there a simple way to optimize 
the committed CRC computation (or a similar one with the same error 
detection properties) based on either:


a) Knowing that the input will be a 8K page, rather than the existing 
use case with an arbitrary sized WAL section.


b) Straightforward code rearrangement or optimization flags.

That was all I thought was still feasible to consider changing for 9.3 a 
few weeks ago.  And the possible scope has only been shrinking since then.



And I reiterate that there is theory out there about the error detection
capabilities of CRCs.  I'm not seeing any theory here, which leaves me
with very little confidence that we know what we're doing.


Let me see if I can summarize where the messages flying by are at since 
you'd like to close this topic for now:


-Original checksum feature used Fletcher checksums.  Its main problems, 
to quote wikipedia, include that it cannot distinguish between blocks 
of all 0 bits and blocks of all 1 bits.


-Committed checksum feature uses truncated CRC-32.  This has known good 
error detection properties, but is expensive to compute.  There's reason 
to believe that particular computation will become cheaper on future 
platforms though.  But taking full advantage of that will require adding 
CPU-specific code to the database.


-The latest idea is using the Fowler–Noll–Vo hash function: 
https://en.wikipedia.org/wiki/Fowler_Noll_Vo_hash  There's 20 years of 
research around when that is good or bad.  The exactly properties depend 
on magic FNV primes:  http://isthe.com/chongo/tech/comp/fnv/#fnv-prime 
that can vary based on both your target block size and how many bytes 
you'll process at a time.  For PostgreSQL checksums, one of the common 
problems--getting an even distribution of the hashed values--isn't 
important the way it is for other types of hashes.  Ants and Florian 
have now dug into how exactly that and specific CPU optimization 
concerns impact the best approach for 8K database pages.  This is very 
clearly a 9.4 project that is just getting started.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-17 Thread Ants Aasma
On Thu, Apr 18, 2013 at 3:21 AM, Greg Smith g...@2ndquadrant.com wrote:
 On 4/17/13 6:32 PM, Tom Lane wrote:

 The more I read of this thread, the more unhappy I get.  It appears that
 the entire design process is being driven by micro-optimization for CPUs
 being built by Intel in 2013.


 And that's not going to get anyone past review, since all the tests I've
 been doing the last two weeks are on how fast an AMD Opteron 6234 with OS
 cache  shared_buffers can run this.  The main thing I'm still worried
 about is what happens when you have a fast machine that can move memory
 around very quickly and an in-memory workload, but it's hamstrung by the
 checksum computation--and it's not a 2013 Intel machine.

 The question I started with here was answered to some depth and then skipped
 past.  I'd like to jerk attention back to that, since I thought some good
 answers from Ants went by.  Is there a simple way to optimize the committed
 CRC computation (or a similar one with the same error detection properties)
 based on either:

 a) Knowing that the input will be a 8K page, rather than the existing use
 case with an arbitrary sized WAL section.

 b) Straightforward code rearrangement or optimization flags.

 That was all I thought was still feasible to consider changing for 9.3 a few
 weeks ago.  And the possible scope has only been shrinking since then.

Nothing from the two points, but the CRC calculation algorithm can be
switched out for slice-by-4 or slice-by-8 variant. Speed up was around
factor of 4 if I remember correctly.

 And I reiterate that there is theory out there about the error detection
 capabilities of CRCs.  I'm not seeing any theory here, which leaves me
 with very little confidence that we know what we're doing.


 Let me see if I can summarize where the messages flying by are at since
 you'd like to close this topic for now:

 -Original checksum feature used Fletcher checksums.  Its main problems, to
 quote wikipedia, include that it cannot distinguish between blocks of all 0
 bits and blocks of all 1 bits.

That was only the most glaring problem.

 -Committed checksum feature uses truncated CRC-32.  This has known good
 error detection properties, but is expensive to compute.  There's reason to
 believe that particular computation will become cheaper on future platforms
 though.  But taking full advantage of that will require adding CPU-specific
 code to the database.

Actually the state is that with the polynomial used there is currently
close to zero hope of CPUs optimizing for us. By switching the
polynomial we can have hardware acceleration on Intel CPUs, little
hope of others supporting given that AMD hasn't by now and Intel touts
around patents in this area. However the calculation can be made about
factor of 4 faster by restructuring the calculation. This optimization
is plain C and not CPU specific.

The committed checksum is an order of magnitude slower than the
Fletcher one that was performance tested with the patch.

 -The latest idea is using the Fowler–Noll–Vo hash function:
 https://en.wikipedia.org/wiki/Fowler_Noll_Vo_hash  There's 20 years of
 research around when that is good or bad.  The exactly properties depend on
 magic FNV primes:  http://isthe.com/chongo/tech/comp/fnv/#fnv-prime that
 can vary based on both your target block size and how many bytes you'll
 process at a time.  For PostgreSQL checksums, one of the common
 problems--getting an even distribution of the hashed values--isn't important
 the way it is for other types of hashes.  Ants and Florian have now dug into
 how exactly that and specific CPU optimization concerns impact the best
 approach for 8K database pages.  This is very clearly a 9.4 project that is
 just getting started.

I'm not sure about the 9.4 part: if we ship with the builtin CRC as
committed, there is a 100% chance that we will want to switch out the
algorithm in 9.4, and there will be quite a large subset of users that
will find the performance unusable. If we change it to whatever we
come up with here, there is a small chance that the algorithm will
give worse than expected error detection rate in some circumstances
and we will want offer a better algorithm. More probably it will be
good enough and the low performance hit will allow more users to turn
it on. This is a 16bit checksum that we talking about, not SHA-1, it
is expected to occasionally fail to detect errors. I can provide you
with a patch of the generic version of any of the discussed algorithms
within an hour, leaving plenty of time in beta or in 9.4 to
accommodate the optimized versions. It's literally a dozen self
contained lines of code.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-17 Thread Daniel Farina
On Wed, Apr 17, 2013 at 5:21 PM, Greg Smith g...@2ndquadrant.com wrote:
 Let me see if I can summarize where the messages flying by are at since
 you'd like to close this topic for now:

 -Original checksum feature used Fletcher checksums.  Its main problems, to
 quote wikipedia, include that it cannot distinguish between blocks of all 0
 bits and blocks of all 1 bits.

 -Committed checksum feature uses truncated CRC-32.  This has known good
 error detection properties, but is expensive to compute.  There's reason to
 believe that particular computation will become cheaper on future platforms
 though.  But taking full advantage of that will require adding CPU-specific
 code to the database.

 -The latest idea is using the Fowler–Noll–Vo hash function:
 https://en.wikipedia.org/wiki/Fowler_Noll_Vo_hash  There's 20 years of
 research around when that is good or bad.  The exactly properties depend on
 magic FNV primes:  http://isthe.com/chongo/tech/comp/fnv/#fnv-prime that
 can vary based on both your target block size and how many bytes you'll
 process at a time.  For PostgreSQL checksums, one of the common
 problems--getting an even distribution of the hashed values--isn't important
 the way it is for other types of hashes.  Ants and Florian have now dug into
 how exactly that and specific CPU optimization concerns impact the best
 approach for 8K database pages.  This is very clearly a 9.4 project that is
 just getting started.

I was curious about the activity in this thread and wanted to understand
the tradeoffs, and came to the same understanding as you when poking
around.  It seems the tough aspect of the equation is that the most
well studied thing is slow (CRC-32C) unless you have special ISA
support  Trying to find as much information and conclusive research on
FNV was a lot more challenging.  Fletcher is similar in that regard.

Given my hasty attempt to understand each of the alternatives, my
qualitative judgement is that, strangely enough, the most conservative
choice of the three (in terms of being understood and treated in the
literature more than ten times over) is CRC-32C, but it's also the one
being cast as only suitable inside micro-optimization.  To add
another, theoretically-oriented dimension to the discussion, I'd like
suggest it's also the most thoroughly studied of all the alternatives.
 I really had a hard time finding follow-up papers about the two
alternatives, but to be fair, I didn't try very hard...then again, I
didn't try very hard for any of the three, it's just that CRC32C was
by far the easiest find materials on.

The original paper is often shorthanded Castagnoli 93, but it exists
in the IEEE's sphere of influence and is hard to find a copy of.
Luckily, a pretty interesting survey paper discussing some of the
issues was written by Koopman in 2002 and is available:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.11.8323 As a
pedagolgical note, it's pretty interesting and accessible piece of
writing (for me, as someone who knows little of error
detection/correction) and explains some of the engineering reasons
that provoke such exercises.

Basically...if it comes down to understand what the heck is going on
and what the trade-offs are, it was a lot easier to brush up on
CRC32-C in my meandering around the Internet.

One might think this level of scrutiny would constitute a viable
explanation of why CRC32C found its way into several standards and
then finally in silicon.

All in all, if the real world costs of CRC32C on not-SSE4.2 are
allowable, I think it's the most researched and and conservative
option, although perhaps some of the other polynomials seen in Koopman
could also be desirable.  It seems there's a tradeoff in CRC
polynomials between long-message and short-message error detection,
and the paper above may allow for a more informed selection.  CRC32C
is considered a good trade-off for both, but I haven't assessed the
paper in enough detail to suggest whether there are specialized
long-run polynomials that may be better still (although, then, there
is also the microoptimization question, which postdates the literature
I was looking at by a lot).


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-17 Thread Greg Smith

On 4/17/13 8:56 PM, Ants Aasma wrote:

Nothing from the two points, but the CRC calculation algorithm can be
switched out for slice-by-4 or slice-by-8 variant. Speed up was around
factor of 4 if I remember correctly...I can provide you

 with a patch of the generic version of any of the discussed algorithms
 within an hour, leaving plenty of time in beta or in 9.4 to
 accommodate the optimized versions.

Can you nail down a solid, potential for commit slice-by-4 or slice-by-8 
patch then?  You dropped into things like per-byte overhead to reach 
this conclusion, which was fine to let the methods battle each other. 
Maybe I missed it, but I didn't remember seeing an obvious full patch 
for this implementation then come back up from that.  With the schedule 
pressure this needs to return to more database-level tests.  Your 
concerns about the committed feature being much slower then the original 
Fletcher one are troubling, and we might as well do that showdown again 
now with the best of the CRC implementations you've found.



Actually the state is that with the [CRC] polynomial used there is
currently close to zero hope of CPUs optimizing for us.


Ah, I didn't catch that before.  It sounds like the alternate slicing 
implementation should also use a different polynomial then, which sounds 
reasonable.  This doesn't even have to be exactly the same CRC function 
that the WAL uses.  A CRC that's modified for performance or having a 
better future potential is fine; there's just a lot of resistance to 
using something other than a CRC right now.



I'm not sure about the 9.4 part: if we ship with the builtin CRC as
committed, there is a 100% chance that we will want to switch out the
algorithm in 9.4, and there will be quite a large subset of users that
will find the performance unusable.


Now I have to switch out my reviewer hat for my 3 bit fortune telling 
one.  (It uses a Magic 8 Ball)  This entire approach is squeezing what 
people would prefer to be a 32 bit CRC into a spare 16 bits, as a useful 
step advancing toward a long term goal.  I have four major branches of 
possible futures here I've thought about:


1) Database checksums with 16 bits are good enough, but they have to be 
much faster to satisfy users.  It may take a different checksum 
implementation altogether to make that possible, and distinguishing 
between the two of them requires borrowing even more metadata bits from 
somewhere.  (This seems the future you're worried about)


2) Database checksums work out well, but they have to be 32 bits to 
satisfy users and/or error detection needs.  Work on pg_upgrade and 
expanding the page headers will be needed.  Optimization of the CRC now 
has a full 32 bit target.


3) The demand for database checksums is made obsolete by either 
mainstream filesystem checksumming, performance issues, or just general 
market whim.  The 16 bit checksum PostgreSQL implements becomes a 
vestigial feature, and whenever it gets in the way of making changes 
someone proposes eliminating them.  (I call this one the rules future)


4) 16 bit checksums turn out to be such a problem in the field that 
everyone regrets the whole thing, and discussions turn immediately 
toward how to eliminate that risk.


It's fair that you're very concerned about (1), but I wouldn't give it 
100% odds of happening either.  The user demand that's motivated me to 
work on this will be happy with any of (1) through (3), and in two of 
them optimizing the 16 bit checksums now turns out to be premature.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-16 Thread Simon Riggs
On 9 April 2013 03:35, Ants Aasma a...@cybertec.at wrote:
 On Fri, Apr 5, 2013 at 9:39 PM, Ants Aasma a...@cybertec.at wrote:
 Unless somebody tells me not to waste my time I'll go ahead and come
 up with a workable patch by Monday.

 And here you go. I decided to be verbose with the comments as it's
 easier to delete a comment to write one. I also left in a huge jumble
 of macros to calculate the contents of a helper var during compile
 time. This can easily be replaced with the calculated values once we
 settle on specific parameters.

 Currently only x86-64 is implemented. 32bit x86 would be mostly a
 copy-and-paste job, replacing 64bit pointer registers with 32bit ones.
 For other platforms the simplest way would be to use a vectorizing
 compiler on the generic variant. -funroll-loops -ftree-vectorize is
 enough on gcc.

 Quick bench results on the worst case workload:
 master no checksums: tps = 15.561848
 master with checksums: tps = 1.695450
 simd checksums: tps = 14.602698

Numbers look very good on this. Well done.

I support the direction of this, but I don't think I'm sufficiently
well qualified to verify that the code does what it should and/or fix
it if it breaks. If others want to see this happen you'll need to
pitch in.

My only review comments are to ask for some explanation of the magic numbers...
#define CSUM_PRIME1 0x49
#define CSUM_PRIME2 0x986b
#define CSUM_TRUNC 65521

Where magic means a level of technology far above my own
understanding, and yet no (or not enough) code comments to assist me.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-16 Thread Greg Stark
On Fri, Apr 12, 2013 at 9:42 PM, Simon Riggs si...@2ndquadrant.com wrote:
 * WAL checksum is not used as the sole basis for end-of-WAL discovery.
 We reuse the WAL files, so the prev field in each WAL record shows
 what the previous end of WAL was. Hence if the WAL checksums give a
 false positive we still have a double check that the data really is
 wrong. It's unbelievable that you'd get a false positive and then have
 the prev field match as well, even though it was the genuine
 end-of-WAL.

This is kind of true and kind of not true. If a system loses power
while writing lots of data to WAL then the blocks at the end of the
WAL might not be written out in order. Everything since the last log
sync might be partly written out and partly not written out. That's
the case where the checksum is critical. The beginning of a record
could easily be written out including xl_prev and the end of the
record not written. 1/64,000 power losses would then end up with an
assertion failure or corrupt database.



-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-16 Thread Tom Lane
Greg Stark st...@mit.edu writes:
 On Fri, Apr 12, 2013 at 9:42 PM, Simon Riggs si...@2ndquadrant.com wrote:
 * WAL checksum is not used as the sole basis for end-of-WAL discovery.
 We reuse the WAL files, so the prev field in each WAL record shows
 what the previous end of WAL was. Hence if the WAL checksums give a
 false positive we still have a double check that the data really is
 wrong. It's unbelievable that you'd get a false positive and then have
 the prev field match as well, even though it was the genuine
 end-of-WAL.

 This is kind of true and kind of not true. If a system loses power
 while writing lots of data to WAL then the blocks at the end of the
 WAL might not be written out in order. Everything since the last log
 sync might be partly written out and partly not written out. That's
 the case where the checksum is critical. The beginning of a record
 could easily be written out including xl_prev and the end of the
 record not written. 1/64,000 power losses would then end up with an
 assertion failure or corrupt database.

I have a hard time believing that it's a good idea to add checksums to
data pages and at the same time weaken our ability to detect WAL
corruption.  So this seems to me to be going in the wrong direction.
What's it buying for us anyway?  A few CPU cycles saved during WAL
generation?  That's probably swamped by the other costs of writing WAL,
especially if you're using replication.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-16 Thread Simon Riggs
On 16 April 2013 20:27, Tom Lane t...@sss.pgh.pa.us wrote:
 Greg Stark st...@mit.edu writes:
 On Fri, Apr 12, 2013 at 9:42 PM, Simon Riggs si...@2ndquadrant.com wrote:
 * WAL checksum is not used as the sole basis for end-of-WAL discovery.
 We reuse the WAL files, so the prev field in each WAL record shows
 what the previous end of WAL was. Hence if the WAL checksums give a
 false positive we still have a double check that the data really is
 wrong. It's unbelievable that you'd get a false positive and then have
 the prev field match as well, even though it was the genuine
 end-of-WAL.

 This is kind of true and kind of not true. If a system loses power
 while writing lots of data to WAL then the blocks at the end of the
 WAL might not be written out in order. Everything since the last log
 sync might be partly written out and partly not written out. That's
 the case where the checksum is critical. The beginning of a record
 could easily be written out including xl_prev and the end of the
 record not written. 1/64,000 power losses would then end up with an
 assertion failure or corrupt database.

 I have a hard time believing that it's a good idea to add checksums to
 data pages and at the same time weaken our ability to detect WAL
 corruption.  So this seems to me to be going in the wrong direction.
 What's it buying for us anyway?  A few CPU cycles saved during WAL
 generation?  That's probably swamped by the other costs of writing WAL,
 especially if you're using replication.

This part of the thread is dead now  I said lets drop this idea
on 13 April.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-16 Thread Florian Pflug
On Apr13, 2013, at 17:14 , Ants Aasma a...@cybertec.at wrote:
 Based on current analysis, it is particularly good at detecting single
 bit errors, as good at detecting burst errors as can be expected from
 16 bits and not horrible at detecting burst writes of zeroes. It is
 quite bad at detecting multiple uncorrelated single bit errors and
 extremely bad at detecting repeating patterns of errors in low order
 bits.

I've read the patch and tried to understand why it's that bad at
detecting repeating patterns of errors in low order bits, and to see
if there might be a way to fix that without too much of a performance
impact.

Here's what I gather the algorithm does:

  It treats the input data, a page of L bytes, as a Nx64 matrix V
  of 16-bit quantities (N = L/64, obviously).
  It then first computes (using two primes p (PRIME1) and q (PRIME2))

S = V[1,1]*p^63*q^63 + V[1,2]*p^63*q^62 + … + V[1,64]*p^63*q^0
  + V[2,1]*p^62*q^63 + V[2,2]*p^62*q^62 + … + V[2,64]*p^62*q^0
  + …
  + V[N,1]*p^0 *q^63 + V[N,2]*p^0 *q^62 + … + V[N,64]*p^0 *q^0
  (mod 2^16)
  = sum V[i,j]*p^(64-i)*q^(64-j)

   Note that it does that by first computing the row-wise sums without
   the q^i coefficient, and then (in what the code calls the aggregation
   phase) combines those row-wise sums into a total, adding the q^i-
   coefficients along the way.

   The final hash value is then

 H = S * p + B * q mod 2^16

   where B is a salt value intended to detect swapped pages (currently
   B is simply the page index)

This raises two question. First, why are there two primes? You could
just as well using a single prime q and set p=q^64 mod 2^16. You then
get
  S = sum V[i,j] * q^(64*(64-i) + (64-j)
= sum V[i,j] * q^(4096 - 64*(i-1) - j)
You get higher prime powers that way, but you can easily choose a prime
that yields distinct values mod 2^16 for exponents up to 16383. Your
PRIME2, for example, does. (It wraps around for 16384, i.e.
PRIME2^16384 = 1 mod 2^16, but that's true for every possible prime since
16384 is the Carmichael function's value at 2^16)

Second, why does it use addition instead of XOR? It seems that FNV
usually XORs the terms together instead of adding them?

Regarding the bad behaviour for multiple low-bit errors - can you
explain why it behaves badly in that case? I currently fail to see
why that would be. I *can* see that the lowest bit of the hash depends
only on the lowest bit of the input words, but as long as the lowest
bits of the input words also affect other bits of the hash, that shouldn't
matter. Which I think they do, but I might be missing something...

Here, btw, is a page on FNV hashing. It mentions a few rules for
picking suitable primes

http://www.isthe.com/chongo/tech/comp/fnv

best regards,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-16 Thread Ants Aasma
On Tue, Apr 16, 2013 at 5:05 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On 9 April 2013 03:35, Ants Aasma a...@cybertec.at wrote:
 On Fri, Apr 5, 2013 at 9:39 PM, Ants Aasma a...@cybertec.at wrote:
 Unless somebody tells me not to waste my time I'll go ahead and come
 up with a workable patch by Monday.

 And here you go. I decided to be verbose with the comments as it's
 easier to delete a comment to write one. I also left in a huge jumble
 of macros to calculate the contents of a helper var during compile
 time. This can easily be replaced with the calculated values once we
 settle on specific parameters.

 Currently only x86-64 is implemented. 32bit x86 would be mostly a
 copy-and-paste job, replacing 64bit pointer registers with 32bit ones.
 For other platforms the simplest way would be to use a vectorizing
 compiler on the generic variant. -funroll-loops -ftree-vectorize is
 enough on gcc.

 Quick bench results on the worst case workload:
 master no checksums: tps = 15.561848
 master with checksums: tps = 1.695450
 simd checksums: tps = 14.602698

 Numbers look very good on this. Well done.

 I support the direction of this, but I don't think I'm sufficiently
 well qualified to verify that the code does what it should and/or fix
 it if it breaks. If others want to see this happen you'll need to
 pitch in.

 My only review comments are to ask for some explanation of the magic 
 numbers...
 #define CSUM_PRIME1 0x49
 #define CSUM_PRIME2 0x986b
 #define CSUM_TRUNC 65521

 Where magic means a level of technology far above my own
 understanding, and yet no (or not enough) code comments to assist me.

The specific values used are mostly magic to me too. As mentioned in a
short sentence in the patch, the values are experimentally chosen,
guided by some intuition about what good values should look like.

Basically the methodology for the choice was that I took all the pages
from a 2.8GB test database, and then for each page introduced a bunch
of common errors and observed how many errors were undetected. The
main observations were: 1) the exact value of the primes doesn't
really matter for detection efficiency.
2) values with a non-uniform distribution of zeroes and ones seem to
work slightly better.

I'll write up a readme of why the values are how they are and with
some more explanation of the algorithm.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-16 Thread Tom Lane
Ants Aasma a...@cybertec.at writes:
 On Tue, Apr 16, 2013 at 5:05 PM, Simon Riggs si...@2ndquadrant.com wrote:
 My only review comments are to ask for some explanation of the magic 
 numbers...

 The specific values used are mostly magic to me too. As mentioned in a
 short sentence in the patch, the values are experimentally chosen,
 guided by some intuition about what good values should look like.

There actually is quite a lot of theory out there about this sort of
thing.  If we are inventing our own checksum function then We're Doing
It Wrong.  We should be adopting an existing, proven function.
Experimentally derived is about the worst recommendation I can think
of in this area.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-16 Thread Ants Aasma
On Tue, Apr 16, 2013 at 11:20 PM, Florian Pflug f...@phlo.org wrote:
 On Apr13, 2013, at 17:14 , Ants Aasma a...@cybertec.at wrote:
 Based on current analysis, it is particularly good at detecting single
 bit errors, as good at detecting burst errors as can be expected from
 16 bits and not horrible at detecting burst writes of zeroes. It is
 quite bad at detecting multiple uncorrelated single bit errors and
 extremely bad at detecting repeating patterns of errors in low order
 bits.

 I've read the patch and tried to understand why it's that bad at
 detecting repeating patterns of errors in low order bits, and to see
 if there might be a way to fix that without too much of a performance
 impact.

 Here's what I gather the algorithm does:

   It treats the input data, a page of L bytes, as a Nx64 matrix V
   of 16-bit quantities (N = L/64, obviously).
   It then first computes (using two primes p (PRIME1) and q (PRIME2))

 S = V[1,1]*p^63*q^63 + V[1,2]*p^63*q^62 + … + V[1,64]*p^63*q^0
   + V[2,1]*p^62*q^63 + V[2,2]*p^62*q^62 + … + V[2,64]*p^62*q^0
   + …
   + V[N,1]*p^0 *q^63 + V[N,2]*p^0 *q^62 + … + V[N,64]*p^0 *q^0
   (mod 2^16)
   = sum V[i,j]*p^(64-i)*q^(64-j)

Note that it does that by first computing the row-wise sums without
the q^i coefficient, and then (in what the code calls the aggregation
phase) combines those row-wise sums into a total, adding the q^i-
coefficients along the way.

The final hash value is then

  H = S * p + B * q mod 2^16

where B is a salt value intended to detect swapped pages (currently
B is simply the page index)

Great job analyzing the analytic form of the algorithm and sorry I you
had to do it instead finding it in the documentation.

 This raises two question. First, why are there two primes? You could
 just as well using a single prime q and set p=q^64 mod 2^16. You then
 get
   S = sum V[i,j] * q^(64*(64-i) + (64-j)
 = sum V[i,j] * q^(4096 - 64*(i-1) - j)
 You get higher prime powers that way, but you can easily choose a prime
 that yields distinct values mod 2^16 for exponents up to 16383. Your
 PRIME2, for example, does. (It wraps around for 16384, i.e.
 PRIME2^16384 = 1 mod 2^16, but that's true for every possible prime since
 16384 is the Carmichael function's value at 2^16)

The experimental detection rate is about the same if we use a single
prime. But I think you have the analytical form wrong here. It should
be given q = p:

S = sum V[i,j] * p^(64-i) * p^(64-j)
  = sum V[i,j] * p^(64 - i + 64 - j)
  = sum V[i,j] * p^(128 - i -j)

This makes the whole matrix symmetric. While I can't think of any real
world errors that would exhibit symmetry in this 64x64 matrix, it
seemed better to me to avoid the issue altogether and use different
primes. IIRC it helped a lot for the case of page[i] = i  0xFF.

 Second, why does it use addition instead of XOR? It seems that FNV
 usually XORs the terms together instead of adding them?

Testing showed slightly better detection rate for adds. Intuitively I
think it's because the carry introduces some additional mixing.

 Regarding the bad behaviour for multiple low-bit errors - can you
 explain why it behaves badly in that case? I currently fail to see
 why that would be. I *can* see that the lowest bit of the hash depends
 only on the lowest bit of the input words, but as long as the lowest
 bits of the input words also affect other bits of the hash, that shouldn't
 matter. Which I think they do, but I might be missing something...

Looks like you're right. I was somehow concentrating only on how the
lowest bits depend on the input.

 Here, btw, is a page on FNV hashing. It mentions a few rules for
 picking suitable primes

 http://www.isthe.com/chongo/tech/comp/fnv

Unfortunately the rules don't apply here because of the hash size.

Thanks for your analysis. I will do my best to get this all into a
document and will do some more analysis to see if I can come up with
some kind of proof for the error cases.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-16 Thread Florian Pflug
On Apr16, 2013, at 23:41 , Ants Aasma a...@cybertec.at wrote:
 On Tue, Apr 16, 2013 at 11:20 PM, Florian Pflug f...@phlo.org wrote:
 On Apr13, 2013, at 17:14 , Ants Aasma a...@cybertec.at wrote:
 Based on current analysis, it is particularly good at detecting single
 bit errors, as good at detecting burst errors as can be expected from
 16 bits and not horrible at detecting burst writes of zeroes. It is
 quite bad at detecting multiple uncorrelated single bit errors and
 extremely bad at detecting repeating patterns of errors in low order
 bits.
 
 I've read the patch and tried to understand why it's that bad at
 detecting repeating patterns of errors in low order bits, and to see
 if there might be a way to fix that without too much of a performance
 impact.
 
 Here's what I gather the algorithm does:
 
  It treats the input data, a page of L bytes, as a Nx64 matrix V
  of 16-bit quantities (N = L/64, obviously).
  It then first computes (using two primes p (PRIME1) and q (PRIME2))
 
S = V[1,1]*p^63*q^63 + V[1,2]*p^63*q^62 + … + V[1,64]*p^63*q^0
  + V[2,1]*p^62*q^63 + V[2,2]*p^62*q^62 + … + V[2,64]*p^62*q^0
  + …
  + V[N,1]*p^0 *q^63 + V[N,2]*p^0 *q^62 + … + V[N,64]*p^0 *q^0
  (mod 2^16)
  = sum V[i,j]*p^(64-i)*q^(64-j)
 
   Note that it does that by first computing the row-wise sums without
   the q^i coefficient, and then (in what the code calls the aggregation
   phase) combines those row-wise sums into a total, adding the q^i-
   coefficients along the way.
 
   The final hash value is then
 
 H = S * p + B * q mod 2^16
 
   where B is a salt value intended to detect swapped pages (currently
   B is simply the page index)
 
 Great job analyzing the analytic form of the algorithm and sorry I you
 had to do it instead finding it in the documentation.

No problem, glad if I can help!

 This raises two question. First, why are there two primes? You could
 just as well using a single prime q and set p=q^64 mod 2^16. You then
 get
  S = sum V[i,j] * q^(64*(64-i) + (64-j)
= sum V[i,j] * q^(4096 - 64*(i-1) - j)
 You get higher prime powers that way, but you can easily choose a prime
 that yields distinct values mod 2^16 for exponents up to 16383. Your
 PRIME2, for example, does. (It wraps around for 16384, i.e.
 PRIME2^16384 = 1 mod 2^16, but that's true for every possible prime since
 16384 is the Carmichael function's value at 2^16)
 
 The experimental detection rate is about the same if we use a single
 prime. But I think you have the analytical form wrong here. It should
 be given q = p:
 
S = sum V[i,j] * p^(64-i) * p^(64-j)
  = sum V[i,j] * p^(64 - i + 64 - j)
  = sum V[i,j] * p^(128 - i -j)

Yeah, if you set q = p that's true. My suggestion was p=q^64 though...

 Second, why does it use addition instead of XOR? It seems that FNV
 usually XORs the terms together instead of adding them?
 
 Testing showed slightly better detection rate for adds. Intuitively I
 think it's because the carry introduces some additional mixing.

Hm, but OTOH it makes S linear in V, i.e. if you have two inputs
V1,V2 and V = V1 + V2, then S = S1 + S2. Also, if V' = V*m, then
S' = S*m. The second property is quite undesirable, I think. Assume
all the V[i,j] are divisible by 2^k, i.e. have zeros at all bit
positions 0..(k-1). Then, due to linearity, S is also divisible by
2^k, i.e. also has no ones before the k-th bit. This means, for example
that if you hash values values which all have their lowest bit cleared,
you get only 2^15 distinct hash values. If they all have the two
lowest bits cleared, you get only 2^14 distinct values, and so on…

Generally, linearity doesn't seem to be a property that one wants
in a hash I think, so my suggestion is to stick to XOR.

 Here, btw, is a page on FNV hashing. It mentions a few rules for
 picking suitable primes
 
 http://www.isthe.com/chongo/tech/comp/fnv
 
 Unfortunately the rules don't apply here because of the hash size.

Yeah :-(.

I noticed that their 32-bit prime only has a single one outside
the first 16 bits. Maybe we can take advantage of that and use a
32-bit state while still providing decent performance on machines
without a 32-bit x 32-bit - 32-bit multiply instruction?

If we lived in an Intel-only world, I'd suggest going with a
32-bit state, since SSE4.1 support is *very* wide-spread already -
the last CPUs without it came out over 5 years ago, I think.
(Core2 and later support SSE4.1, and some later Core1 do too)

But unfortunately things look bleak even for other x86
implementations - AMD support SSE4.1 only starting with
Bulldozer, which came out 2011 or so I believe. Leaving the x86
realm, it seems that only ARM's NEON provides the instructions
we'd need - AltiVec seems to be support only 16-bit multiplies,
and from what some quick googling brought up, MIPS and SPARC
SIMD instructions look no better..

OTOH, chances are that nobody will ever do SIMD implementations
for those machines. In that case, working in 32-bit chunks 

Re: [HACKERS] Enabling Checksums

2013-04-13 Thread Simon Riggs
On 12 April 2013 23:21, Ants Aasma a...@cybertec.at wrote:

 In general, we have more flexibility with WAL because there is no
 upgrade issue. It would be nice to share code with the data page
 checksum algorithm; but really we should just use whatever offers the
 best trade-off in terms of complexity, performance, and error detection
 rate.

 I don't think we need to decide all of this right now. Personally, I'm
 satisfied having SIMD checksums on data pages now and leaving WAL
 optimization until later.

 +1

OK, lets drop that idea then. SIMD checksums for 16-bit page checksums
only in this release.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-13 Thread Bruce Momjian
On Fri, Apr 12, 2013 at 02:38:27PM -0700, Jeff Davis wrote:
 In general, we have more flexibility with WAL because there is no
 upgrade issue. It would be nice to share code with the data page
 checksum algorithm; but really we should just use whatever offers the
 best trade-off in terms of complexity, performance, and error detection
 rate.
 
 I don't think we need to decide all of this right now. Personally, I'm
 satisfied having SIMD checksums on data pages now and leaving WAL
 optimization until later.

As I understand it, SIMD is just a CPU-optimized method for producing a
CRC checksum.  Is that right?  Does it produce the same result as a
non-CPU-optimized CRC calculation?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-13 Thread Andres Freund
On 2013-04-13 09:14:26 -0400, Bruce Momjian wrote:
 On Fri, Apr 12, 2013 at 02:38:27PM -0700, Jeff Davis wrote:
  In general, we have more flexibility with WAL because there is no
  upgrade issue. It would be nice to share code with the data page
  checksum algorithm; but really we should just use whatever offers the
  best trade-off in terms of complexity, performance, and error detection
  rate.
  
  I don't think we need to decide all of this right now. Personally, I'm
  satisfied having SIMD checksums on data pages now and leaving WAL
  optimization until later.
 
 As I understand it, SIMD is just a CPU-optimized method for producing a
 CRC checksum.  Is that right?  Does it produce the same result as a
 non-CPU-optimized CRC calculation?

No we are talking about a different algorithm that results in different
results, thats why its important to choose now since we can't change it
later without breaking pg_upgrade in further releases.

http://en.wikipedia.org/wiki/SIMD_%28hash_function%29

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-13 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 On 2013-04-13 09:14:26 -0400, Bruce Momjian wrote:
 As I understand it, SIMD is just a CPU-optimized method for producing a
 CRC checksum.  Is that right?  Does it produce the same result as a
 non-CPU-optimized CRC calculation?

 No we are talking about a different algorithm that results in different
 results, thats why its important to choose now since we can't change it
 later without breaking pg_upgrade in further releases.
 http://en.wikipedia.org/wiki/SIMD_%28hash_function%29

[ squint... ]  We're talking about a *cryptographic* hash function?
Why in the world was this considered a good idea for page checksums?

In the first place, it's probably not very fast compared to some
alternatives, and in the second place, the criteria by which people
would consider it a good crypto hash function have approximately nothing
to do with what we need for a checksum function.  What we want for a
checksum function is high probability of detection of common hardware
failure modes, such as burst errors and all-zeroes.  This is
particularly critical when we're going with only a 16-bit checksum ---
the probabilities need to be skewed in the right direction, or it's not
going to be all that terribly useful.

CRCs are known to be good for that sort of thing; it's what they were
designed for.  I'd like to see some evidence that any substitute
algorithm has similar properties.  Without that, I'm going to vote
against this idea.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-13 Thread Andres Freund
On 2013-04-13 10:58:53 -0400, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  On 2013-04-13 09:14:26 -0400, Bruce Momjian wrote:
  As I understand it, SIMD is just a CPU-optimized method for producing a
  CRC checksum.  Is that right?  Does it produce the same result as a
  non-CPU-optimized CRC calculation?
 
  No we are talking about a different algorithm that results in different
  results, thats why its important to choose now since we can't change it
  later without breaking pg_upgrade in further releases.
  http://en.wikipedia.org/wiki/SIMD_%28hash_function%29
 
 [ squint... ]  We're talking about a *cryptographic* hash function?
 Why in the world was this considered a good idea for page checksums?

In Ants' implementation its heck of a lot of faster than any CRC
implementation we have seen so far on relatively large blocks (like pages).

pgbench results:
ca+csw_uxo-frkuzl0yzs0wsdl8dipzv-ugmvyn-yv45sgub...@mail.gmail.com

byte/cycle comparison:
CA+CSw_su1fopLNBz1NAfkSNw4_=gv+5pf0kdlqmpvukw1q4...@mail.gmail.com

 In the first place, it's probably not very fast compared to some
 alternatives, and in the second place, the criteria by which people
 would consider it a good crypto hash function have approximately nothing
 to do with what we need for a checksum function.  What we want for a
 checksum function is high probability of detection of common hardware
 failure modes, such as burst errors and all-zeroes.  This is
 particularly critical when we're going with only a 16-bit checksum ---
 the probabilities need to be skewed in the right direction, or it's not
 going to be all that terribly useful.
 
 CRCs are known to be good for that sort of thing; it's what they were
 designed for.  I'd like to see some evidence that any substitute
 algorithm has similar properties.  Without that, I'm going to vote
 against this idea.

Ants has dome some analysis on this, like
CA+CSw_tMoA85e=1vs4omjzjg2mr_hulikovpd80dp5rurds...@mail.gmail.com .
That doesn't look bad to me and unless I am missing something its better
than our CRC with 16bit.

So while I would say its not 100% researched there has been a rather
detailed investigation by Ants - I am rather impressed.

My biggest doubt so far is the reliance on inline assembly for the top
performance on x86-64 and a generic implementation otherwise that only
is really fast with appropriate compiler flags..

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-13 Thread Ants Aasma
On Sat, Apr 13, 2013 at 5:58 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Andres Freund and...@2ndquadrant.com writes:
 On 2013-04-13 09:14:26 -0400, Bruce Momjian wrote:
 As I understand it, SIMD is just a CPU-optimized method for producing a
 CRC checksum.  Is that right?  Does it produce the same result as a
 non-CPU-optimized CRC calculation?

 No we are talking about a different algorithm that results in different
 results, thats why its important to choose now since we can't change it
 later without breaking pg_upgrade in further releases.
 http://en.wikipedia.org/wiki/SIMD_%28hash_function%29

 [ squint... ]  We're talking about a *cryptographic* hash function?
 Why in the world was this considered a good idea for page checksums?

 In the first place, it's probably not very fast compared to some
 alternatives, and in the second place, the criteria by which people
 would consider it a good crypto hash function have approximately nothing
 to do with what we need for a checksum function.  What we want for a
 checksum function is high probability of detection of common hardware
 failure modes, such as burst errors and all-zeroes.  This is
 particularly critical when we're going with only a 16-bit checksum ---
 the probabilities need to be skewed in the right direction, or it's not
 going to be all that terribly useful.

 CRCs are known to be good for that sort of thing; it's what they were
 designed for.  I'd like to see some evidence that any substitute
 algorithm has similar properties.  Without that, I'm going to vote
 against this idea.

Sorry for creating confusion here by playing fast and loose with the
terminology. We are not talking about that hash function at all. What
we are talking about here is Fowler-Noll-Vo-ish
(http://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2%80%93Vo_hash_function)
hash function that is restructured to be parallelisable with SIMD
instructions with the explicit goal of being as fast as possible. The
resulting hash function is roughly two orders of magnitude faster than
1-byte-at-a-time CRC32 currently in use. Performance is about
comparable with optimized fixed size memcpy running in cache.

Based on current analysis, it is particularly good at detecting single
bit errors, as good at detecting burst errors as can be expected from
16 bits and not horrible at detecting burst writes of zeroes. It is
quite bad at detecting multiple uncorrelated single bit errors and
extremely bad at detecting repeating patterns of errors in low order
bits.

All in all I would say that the performance is worth the loss in
detection capability as we are not talking about using the checksum to
prove correctness.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-13 Thread Andres Freund
On 2013-04-13 18:14:28 +0300, Ants Aasma wrote:
 On Sat, Apr 13, 2013 at 5:58 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  Andres Freund and...@2ndquadrant.com writes:
  On 2013-04-13 09:14:26 -0400, Bruce Momjian wrote:
  As I understand it, SIMD is just a CPU-optimized method for producing a
  CRC checksum.  Is that right?  Does it produce the same result as a
  non-CPU-optimized CRC calculation?
 
  No we are talking about a different algorithm that results in different
  results, thats why its important to choose now since we can't change it
  later without breaking pg_upgrade in further releases.
  http://en.wikipedia.org/wiki/SIMD_%28hash_function%29
 
  [ squint... ]  We're talking about a *cryptographic* hash function?
  Why in the world was this considered a good idea for page checksums?
 
  In the first place, it's probably not very fast compared to some
  alternatives, and in the second place, the criteria by which people
  would consider it a good crypto hash function have approximately nothing
  to do with what we need for a checksum function.  What we want for a
  checksum function is high probability of detection of common hardware
  failure modes, such as burst errors and all-zeroes.  This is
  particularly critical when we're going with only a 16-bit checksum ---
  the probabilities need to be skewed in the right direction, or it's not
  going to be all that terribly useful.
 
  CRCs are known to be good for that sort of thing; it's what they were
  designed for.  I'd like to see some evidence that any substitute
  algorithm has similar properties.  Without that, I'm going to vote
  against this idea.
 
 Sorry for creating confusion here by playing fast and loose with the
 terminology. We are not talking about that hash function at all. What
 we are talking about here is Fowler-Noll-Vo-ish
 (http://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2%80%93Vo_hash_function)
 hash function that is restructured to be parallelisable with SIMD
 instructions with the explicit goal of being as fast as possible. The
 resulting hash function is roughly two orders of magnitude faster than
 1-byte-at-a-time CRC32 currently in use. Performance is about
 comparable with optimized fixed size memcpy running in cache.

Gah, one shouldn't look to quick for a reference, sorry.
 
 Based on current analysis, it is particularly good at detecting single
 bit errors, as good at detecting burst errors as can be expected from
 16 bits and not horrible at detecting burst writes of zeroes. It is
 quite bad at detecting multiple uncorrelated single bit errors and
 extremely bad at detecting repeating patterns of errors in low order
 bits.

 All in all I would say that the performance is worth the loss in
 detection capability as we are not talking about using the checksum to
 prove correctness.

Is it actually a loss compared to our 16bit flavor of crc32 we now use?
I didn't think so far from the properties you described?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-13 Thread Ants Aasma
On Sat, Apr 13, 2013 at 6:26 PM, Andres Freund and...@2ndquadrant.com wrote:
 All in all I would say that the performance is worth the loss in
 detection capability as we are not talking about using the checksum to
 prove correctness.

 Is it actually a loss compared to our 16bit flavor of crc32 we now use?
 I didn't think so far from the properties you described?

I would have to run the testsuite I made to see now much but I would
presume so. The algorithm relies on multiplication for bit diffusion
and multiply has lousy diffusion on low order bits, exactly no
diffusion for the lowest bit. And for 16bit values low order bits is
quite a large fraction of the total hash.

If we allow for operations that are not in SSE2 then there are a few
things that we could do to make the hash quality better without
affecting performance. pmulld instruction (SSE4.1) would allow for
32bit values in the intermediate state. And pshufb (SSE3) would allow
us to swap high and low bytes introducing additional mixing. On Intel
Sandy Bridge, if I understand the microarchitecture correctly, either
change would be basically free, but not both because pshufb and paddw
use execution ports 0 and 5, while pmulld needs port 0 and pmullw
needs port 1. Currently the main loop takes 1 cycle per 16byte chunk,
any changes introducing conflicts there would cut the performance in
half.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-13 Thread Bruce Momjian
On Sat, Apr 13, 2013 at 06:14:28PM +0300, Ants Aasma wrote:
  CRCs are known to be good for that sort of thing; it's what they were
  designed for.  I'd like to see some evidence that any substitute
  algorithm has similar properties.  Without that, I'm going to vote
  against this idea.
 
 Sorry for creating confusion here by playing fast and loose with the
 terminology. We are not talking about that hash function at all. What
 we are talking about here is Fowler-Noll-Vo-ish
 (http://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2%80%93Vo_hash_function)
 hash function that is restructured to be parallelisable with SIMD
 instructions with the explicit goal of being as fast as possible. The
 resulting hash function is roughly two orders of magnitude faster than
 1-byte-at-a-time CRC32 currently in use. Performance is about
 comparable with optimized fixed size memcpy running in cache.
 
 Based on current analysis, it is particularly good at detecting single
 bit errors, as good at detecting burst errors as can be expected from
 16 bits and not horrible at detecting burst writes of zeroes. It is
 quite bad at detecting multiple uncorrelated single bit errors and
 extremely bad at detecting repeating patterns of errors in low order
 bits.
 
 All in all I would say that the performance is worth the loss in
 detection capability as we are not talking about using the checksum to
 prove correctness.

Agreed. It would be good to get these details into the patch so others
are not confused in the future.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-12 Thread Bruce Momjian
On Wed, Apr 10, 2013 at 11:19:56AM -0700, Jeff Davis wrote:
 On Wed, 2013-04-10 at 11:01 +0300, Ants Aasma wrote:
  I think we should first deal with using it for page checksums and if
  future versions want to reuse some of the code for WAL checksums then
  we can rearrange the code.
 
 Sounds good to me, although I expect we at least want any assembly to be
 in a separate file (if the specialization makes it in 9.3).

Sounds good.  Simon has done a good job shepherding this to completion. 

My only question is whether the 16-bit page checksums stored in WAL
reduce our ability to detect failed/corrupt writes to WAL?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-12 Thread Jeff Davis
On Thu, 2013-04-11 at 20:12 +0100, Simon Riggs wrote:

 So, if we apply a patch like the one attached, we then end up with the
 WAL checksum using the page checksum as an integral part of its
 calculation. (There is no increase in code inside WALInsertLock,
 nothing at all touched in that area).
 
 
 Then all we need to do is make PageSetChecksumInplace() use Ants' algo
 and we're done.
 
 
 Only point worth discussing is that this change would make backup
 blocks be covered by a 16-bit checksum, not the CRC-32 it is now. i.e.
 the record header is covered by a CRC32 but the backup blocks only by
 16-bit. 

FWIW, that's fine with me. 

 (Attached patch is discussion only. Checking checksum in recovery
 isn't coded at all.)

I like it.

A few points:

* Given that setting the checksum is unconditional in a backup block, do
we want to zero the checksum field when the backup block is restored if
checksums are disabled? Otherwise we would have a strange situation
where some blocks have a checksum on disk even when checksums are
disabled.

* When we do PageSetChecksumInplace(), we need to be 100% sure that the
hole is empty; otherwise the checksum will fail when we re-expand it. It
might be worth a memset beforehand just to be sure.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-12 Thread Bruce Momjian
On Fri, Apr 12, 2013 at 12:07:36PM -0700, Jeff Davis wrote:
  (Attached patch is discussion only. Checking checksum in recovery
  isn't coded at all.)
 
 I like it.
 
 A few points:
 
 * Given that setting the checksum is unconditional in a backup block, do
 we want to zero the checksum field when the backup block is restored if
 checksums are disabled? Otherwise we would have a strange situation
 where some blocks have a checksum on disk even when checksums are
 disabled.
 
 * When we do PageSetChecksumInplace(), we need to be 100% sure that the
 hole is empty; otherwise the checksum will fail when we re-expand it. It
 might be worth a memset beforehand just to be sure.

Do we write the page holes to the WAL for full-page writes?  I hope we
don't.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-12 Thread Andres Freund
On 2013-04-11 20:12:59 +0100, Simon Riggs wrote:
 On 11 April 2013 04:27, Jeff Davis pg...@j-davis.com wrote:
 
  On Wed, 2013-04-10 at 20:17 +0100, Simon Riggs wrote:
 
   OK, so we have a single combined calculate a checksum for a block
   function. That uses Jeff's zeroing trick and Ants' bulk-oriented
   performance optimization.
  
  
   For buffer checksums we simply calculate for the block.
 
  Sounds good.
 
   For WAL full page writes, we first set the checksums for all defined
   buffers, then calculate the checksum of remaining data plus the
   pd_checksum field from each block using the normal WAL CRC32.
  
   Seems good to me. One set of fast code. And it avoids the weirdness
   that the checksum stored on the full page is actually wrong.
 
  Oh, that's a nice benefit.
 
 
 So, if we apply a patch like the one attached, we then end up with the WAL
 checksum using the page checksum as an integral part of its calculation.
 (There is no increase in code inside WALInsertLock, nothing at all touched
 in that area).
 
 Then all we need to do is make PageSetChecksumInplace() use Ants' algo and
 we're done.
 
 Only point worth discussing is that this change would make backup blocks be
 covered by a 16-bit checksum, not the CRC-32 it is now. i.e. the record
 header is covered by a CRC32 but the backup blocks only by 16-bit.

That means we will have to do the verification for this in
ValidXLogRecord() *not* in RestoreBkpBlock or somesuch. Otherwise we
won't always recognize the end of WAL correctly.
And I am a bit wary of reducing the likelihood of noticing the proper
end-of-recovery by reducing the crc width.

Why again are we doing this now? Just to reduce the overhead of CRC
computation for full page writes? Or are we forseeing issues with the
page checksums being wrong because of non-zero data in the hole being
zero after the restore from bkp blocks?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-12 Thread Bruce Momjian
On Fri, Apr 12, 2013 at 09:28:42PM +0200, Andres Freund wrote:
  Only point worth discussing is that this change would make backup blocks be
  covered by a 16-bit checksum, not the CRC-32 it is now. i.e. the record
  header is covered by a CRC32 but the backup blocks only by 16-bit.
 
 That means we will have to do the verification for this in
 ValidXLogRecord() *not* in RestoreBkpBlock or somesuch. Otherwise we
 won't always recognize the end of WAL correctly.
 And I am a bit wary of reducing the likelihood of noticing the proper
 end-of-recovery by reducing the crc width.
 
 Why again are we doing this now? Just to reduce the overhead of CRC
 computation for full page writes? Or are we forseeing issues with the
 page checksums being wrong because of non-zero data in the hole being
 zero after the restore from bkp blocks?

I thought the idea is that we were going to re-use the already-computed
CRC checksum on the page, and we only have 16-bits of storage for that.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-12 Thread Andres Freund
On 2013-04-12 15:31:36 -0400, Bruce Momjian wrote:
 On Fri, Apr 12, 2013 at 09:28:42PM +0200, Andres Freund wrote:
   Only point worth discussing is that this change would make backup blocks 
   be
   covered by a 16-bit checksum, not the CRC-32 it is now. i.e. the record
   header is covered by a CRC32 but the backup blocks only by 16-bit.
  
  That means we will have to do the verification for this in
  ValidXLogRecord() *not* in RestoreBkpBlock or somesuch. Otherwise we
  won't always recognize the end of WAL correctly.
  And I am a bit wary of reducing the likelihood of noticing the proper
  end-of-recovery by reducing the crc width.
  
  Why again are we doing this now? Just to reduce the overhead of CRC
  computation for full page writes? Or are we forseeing issues with the
  page checksums being wrong because of non-zero data in the hole being
  zero after the restore from bkp blocks?
 
 I thought the idea is that we were going to re-use the already-computed
 CRC checksum on the page, and we only have 16-bits of storage for that.

Well, but the proposal seems to be to do this also for non-checksum
enabled datadirs, so ...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-12 Thread Heikki Linnakangas

On 12.04.2013 22:31, Bruce Momjian wrote:

On Fri, Apr 12, 2013 at 09:28:42PM +0200, Andres Freund wrote:

Only point worth discussing is that this change would make backup blocks be
covered by a 16-bit checksum, not the CRC-32 it is now. i.e. the record
header is covered by a CRC32 but the backup blocks only by 16-bit.


That means we will have to do the verification for this in
ValidXLogRecord() *not* in RestoreBkpBlock or somesuch. Otherwise we
won't always recognize the end of WAL correctly.
And I am a bit wary of reducing the likelihood of noticing the proper
end-of-recovery by reducing the crc width.

Why again are we doing this now? Just to reduce the overhead of CRC
computation for full page writes? Or are we forseeing issues with the
page checksums being wrong because of non-zero data in the hole being
zero after the restore from bkp blocks?


I thought the idea is that we were going to re-use the already-computed
CRC checksum on the page, and we only have 16-bits of storage for that.


No, the patch has to compute the 16-bit checksum for the page when the 
full-page image is added to the WAL record. There would otherwise be no 
need to calculate the page checksum at that point, but only later when 
the page is written out from shared buffer cache.


I think this is a bad idea. It complicates the WAL format significantly. 
Simon's patch didn't include the changes to recovery to validate the 
checksum, but I suspect it would be complicated. And it reduces the 
error-detection capability of WAL recovery. Keep in mind that unlike 
page checksums, which are never expected to fail, so even if we miss a 
few errors it's still better than nothing, the WAL checkum is used to 
detect end-of-WAL. There is expected to be a failure every time we do 
crash recovery. This far, we've considered the probability of one in 
1^32 small enough for that purpose, but IMHO one in 1^16 is much too weak.


If you want to speed up the CRC calculation of full-page images, you 
could have an optimized version of the WAL CRC algorithm, using e.g. 
SIMD instructions. Because typical WAL records are small, max 100-200 
bytes, and it consists of several even smaller chunks, the normal WAL 
CRC calculation is quite resistant to common optimization techniques. 
But it might work for the full-page images. Let's not conflate it with 
the page checksums, though.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-12 Thread Simon Riggs
On 12 April 2013 21:03, Heikki Linnakangas hlinnakan...@vmware.com wrote:

 No, the patch has to compute the 16-bit checksum for the page when the
 full-page image is added to the WAL record. There would otherwise be no need
 to calculate the page checksum at that point, but only later when the page
 is written out from shared buffer cache.

 I think this is a bad idea. It complicates the WAL format significantly.
 Simon's patch didn't include the changes to recovery to validate the
 checksum, but I suspect it would be complicated. And it reduces the
 error-detection capability of WAL recovery. Keep in mind that unlike page
 checksums, which are never expected to fail, so even if we miss a few errors
 it's still better than nothing, the WAL checkum is used to detect
 end-of-WAL. There is expected to be a failure every time we do crash
 recovery. This far, we've considered the probability of one in 1^32 small
 enough for that purpose, but IMHO one in 1^16 is much too weak.

 If you want to speed up the CRC calculation of full-page images, you could
 have an optimized version of the WAL CRC algorithm, using e.g. SIMD
 instructions. Because typical WAL records are small, max 100-200 bytes, and
 it consists of several even smaller chunks, the normal WAL CRC calculation
 is quite resistant to common optimization techniques. But it might work for
 the full-page images. Let's not conflate it with the page checksums, though.

I accept the general tone of that as a reasonable perspective and in
many ways am on the fence myself. This is sensitive stuff.

A few points
* The code to validate the checksum isn't complex, though it is more
than the current one line. Lets say about 10 lines of clear code. I'll
work on that to show its true. I don't see that as a point of
objection.

* WAL checksum is not used as the sole basis for end-of-WAL discovery.
We reuse the WAL files, so the prev field in each WAL record shows
what the previous end of WAL was. Hence if the WAL checksums give a
false positive we still have a double check that the data really is
wrong. It's unbelievable that you'd get a false positive and then have
the prev field match as well, even though it was the genuine
end-of-WAL.

Yes, we could also have a second SIMD calculation optimised for WAL
CRC32 on an 8192 byte block, rather than just one set of SIMD code for
both. We could also have a single set of SIMD code producing a 32-bit
checksum, then take the low 16 bits as we do currently.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-12 Thread Jeff Davis
On Fri, 2013-04-12 at 15:21 -0400, Bruce Momjian wrote:
  * When we do PageSetChecksumInplace(), we need to be 100% sure that the
  hole is empty; otherwise the checksum will fail when we re-expand it. It
  might be worth a memset beforehand just to be sure.
 
 Do we write the page holes to the WAL for full-page writes?  I hope we
 don't.

No, but the page hole is included in the checksum.

Let's say that the page hole contains some non-zero value, and we
calculate a checksum. When we eliminate the page hole, and then
reconstitute the page using zeros for the page hole later, then the page
will not match the checksum any more.

So, we need to be sure the original page hole is all-zero when we
calculate the checksum.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-12 Thread Jeff Davis
On Fri, 2013-04-12 at 21:28 +0200, Andres Freund wrote:
 That means we will have to do the verification for this in
 ValidXLogRecord() *not* in RestoreBkpBlock or somesuch. Otherwise we
 won't always recognize the end of WAL correctly.
 And I am a bit wary of reducing the likelihood of noticing the proper
 end-of-recovery by reducing the crc width.

Good point.

 Why again are we doing this now? Just to reduce the overhead of CRC
 computation for full page writes? Or are we forseeing issues with the
 page checksums being wrong because of non-zero data in the hole being
 zero after the restore from bkp blocks?

That shouldn't be a problem, because the block is not expected to have a
proper checksum in WAL, and it will be recalculated before being
written. So I see these changes as mostly independent.

The reason we're discussing right now is because, when choosing the
checksum algorithm, I was hoping that it might be usable in the future
for WAL backup blocks. I'm convinced that they can be; and the primary
question now seems to be should they be, which does not need to be
settled right now in my opinion.

Anyway, I would be perfectly happy if we just got the SIMD algorithm in
for data pages. The support for changing the WAL checksums seems
lukewarm, and there might be quite a few alternatives (e.g. optimizing
the CRC for backup blocks as Heikki suggested) to achieve that
performance goal.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-12 Thread Jeff Davis
On Fri, 2013-04-12 at 23:03 +0300, Heikki Linnakangas wrote:
 I think this is a bad idea. It complicates the WAL format significantly. 
 Simon's patch didn't include the changes to recovery to validate the 
 checksum, but I suspect it would be complicated. And it reduces the 
 error-detection capability of WAL recovery. Keep in mind that unlike 
 page checksums, which are never expected to fail, so even if we miss a 
 few errors it's still better than nothing, the WAL checkum is used to 
 detect end-of-WAL. There is expected to be a failure every time we do 
 crash recovery. This far, we've considered the probability of one in 
 1^32 small enough for that purpose, but IMHO one in 1^16 is much too weak.

One thing that just occurred to me is that we could make the SIMD
checksum a 32-bit checksum, and reduce it down to 16 bits for the data
pages. That might give us more flexibility to later use it for WAL
without compromising on the error detection nearly as much (though
obviously that wouldn't work with Simon's current proposal which uses
the same data page checksum in a WAL backup block).

In general, we have more flexibility with WAL because there is no
upgrade issue. It would be nice to share code with the data page
checksum algorithm; but really we should just use whatever offers the
best trade-off in terms of complexity, performance, and error detection
rate.

I don't think we need to decide all of this right now. Personally, I'm
satisfied having SIMD checksums on data pages now and leaving WAL
optimization until later.

Regards,
Jeff Davis





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-12 Thread Ants Aasma
On Sat, Apr 13, 2013 at 12:38 AM, Jeff Davis pg...@j-davis.com wrote:
 On Fri, 2013-04-12 at 23:03 +0300, Heikki Linnakangas wrote:
 I think this is a bad idea. It complicates the WAL format significantly.
 Simon's patch didn't include the changes to recovery to validate the
 checksum, but I suspect it would be complicated. And it reduces the
 error-detection capability of WAL recovery. Keep in mind that unlike
 page checksums, which are never expected to fail, so even if we miss a
 few errors it's still better than nothing, the WAL checkum is used to
 detect end-of-WAL. There is expected to be a failure every time we do
 crash recovery. This far, we've considered the probability of one in
 1^32 small enough for that purpose, but IMHO one in 1^16 is much too weak.

 One thing that just occurred to me is that we could make the SIMD
 checksum a 32-bit checksum, and reduce it down to 16 bits for the data
 pages. That might give us more flexibility to later use it for WAL
 without compromising on the error detection nearly as much (though
 obviously that wouldn't work with Simon's current proposal which uses
 the same data page checksum in a WAL backup block).

The simple 32bit version of the algorithm would need CPU capability
checks for the fast version and would work only on CPUs produced in
the last few years. Not a show stopper but but more complex code and
less applicability for sure.

An alternative would be to calculate 2 16 bit checksums, concat them
for the 32bit checksum and add them for the 16 bit one. In this case
we wouldn't need to change the current algorithm. A future change
could just factor out everything until the last add as the common
function. But keep in mind that we are talking about sharing about 400
bytes of machine code here.

 In general, we have more flexibility with WAL because there is no
 upgrade issue. It would be nice to share code with the data page
 checksum algorithm; but really we should just use whatever offers the
 best trade-off in terms of complexity, performance, and error detection
 rate.

 I don't think we need to decide all of this right now. Personally, I'm
 satisfied having SIMD checksums on data pages now and leaving WAL
 optimization until later.

+1

I feel quite uneasy about reducing the effectiveness of WAL end
detection. There are many ways to improve WAL performance and I have
no idea what would be the best one. At the very least some performance
tests are in order. As this is not an essential part of having usable
checksums, but a general performance optimization I feel that it is
not fair to others to postpone the release to resolve this now. I'd be
more than happy to research this for 9.4.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-11 Thread Simon Riggs
On 11 April 2013 04:27, Jeff Davis pg...@j-davis.com wrote:

 On Wed, 2013-04-10 at 20:17 +0100, Simon Riggs wrote:

  OK, so we have a single combined calculate a checksum for a block
  function. That uses Jeff's zeroing trick and Ants' bulk-oriented
  performance optimization.
 
 
  For buffer checksums we simply calculate for the block.

 Sounds good.

  For WAL full page writes, we first set the checksums for all defined
  buffers, then calculate the checksum of remaining data plus the
  pd_checksum field from each block using the normal WAL CRC32.
 
  Seems good to me. One set of fast code. And it avoids the weirdness
  that the checksum stored on the full page is actually wrong.

 Oh, that's a nice benefit.


So, if we apply a patch like the one attached, we then end up with the WAL
checksum using the page checksum as an integral part of its calculation.
(There is no increase in code inside WALInsertLock, nothing at all touched
in that area).

Then all we need to do is make PageSetChecksumInplace() use Ants' algo and
we're done.

Only point worth discussing is that this change would make backup blocks be
covered by a 16-bit checksum, not the CRC-32 it is now. i.e. the record
header is covered by a CRC32 but the backup blocks only by 16-bit.

(Attached patch is discussion only. Checking checksum in recovery isn't
coded at all.)

Thoughts?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


make_wal_records_use_page_checksums.v0.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-10 Thread Ants Aasma
On Wed, Apr 10, 2013 at 4:36 AM, Jeff Davis pg...@j-davis.com wrote:
 On Tue, 2013-04-09 at 05:35 +0300, Ants Aasma wrote:
 And here you go. I decided to be verbose with the comments as it's
 easier to delete a comment to write one. I also left in a huge jumble
 of macros to calculate the contents of a helper var during compile
 time. This can easily be replaced with the calculated values once we
 settle on specific parameters.

 Great, thank you.

 Is it possible to put an interface over it that somewhat resembles the
 CRC checksum (INIT/COMP/FIN)? It looks a little challenging because of
 the nature of the algorithm, but it would make it easier to extend to
 other places (e.g. WAL). It doesn't have to match the INIT/COMP/FIN
 pattern exactly.

The algorithm has 128 bytes of state. Storing it on every step would
negate any performance gains and C doesn't have a way to keep it in
registers. If we can trust that the compiler doesn't clobber xmm
registers then it could be split up into the following pieces:
1. init
2. process 128 bytes
3. aggregate state
4. mix in block number

Even if we don't split it up, factoring out steps 1..3 would make
sense as there is no point in making step 4 platform specific and so
is just duplicated.

 Regardless, we should have some kind of fairly generic interface and
 move the code to its own file (e.g. checksum.c).

 To make the interface more generic, would it make sense to require the
 caller to save the page's stored checksum and zero it before
 calculating? That would avoid the awkwardness of avoiding the
 pd_checksum field. For example (code for illustration only):

Yes, that would help make it reusable.

 That would make it possible to use a different word size -- is uint16
 optimal or would a larger word be more efficient?

Larger words would have better mixing as multiplies mix 4 bytes at a
time instead of 2. Performance of the vectorized version will be the
same as it is tied to the vector length but unvectorized will get a
speed up. The reason I picked 16bits is not actually related to the
checksum hole but because pmullw instruction is guaranteed to be
available on all 64bit CPUs whereas pmulld is only available on the
latest CPUs.

 It looks like the block size needs to be an even multiple of
 sizeof(uint16)*NSUMS. And it also look like it's hard to combine
 different regions of memory into the same calculation (unless we want to
 just calculate them separately and XOR them or something). Does that
 mean that this is not suitable for WAL at all?

I think it would be possible to define a padding scheme for
irregularly sized memory segments where we would only need a lead-out
command for blocks that are not a multiple of 128 bytes. The
performance of it would need to be measured. All-in-all, it's not
really a great match for WAL. While all of the fast checksums process
many bytes in a single iteration, they still process an order of
magnitude bytes less and so have an easier time with irregularly
shaped blocks.

 Using SIMD for WAL is not a requirement at all; I just thought it might
 be a nice benefit for non-checksum-enabled users in some later release.

I think we should first deal with using it for page checksums and if
future versions want to reuse some of the code for WAL checksums then
we can rearrange the code.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   3   4   >