Re: [HACKERS] Enabling Checksums
On 3/18/13 10:52 AM, Bruce Momjian wrote: With a potential 10-20% overhead, I am unclear who would enable this at initdb time. If you survey people who are running PostgreSQL on "cloud" hardware, be it Amazon's EC2 or similar options from other vendors, you will find a high percentage of them would pay quite a bit of performance to make their storage more reliable. To pick one common measurement for popularity, a Google search on "ebs corruption" returns 17 million hits. To quote one of those, Baron Schwartz of Percona talking about MySQL on EC2: "BTW, I have seen data corruption on EBS volumes. It’s not clear whether it was InnoDB’s fault (extremely unlikely IMO), the operating system’s fault, EBS’s fault, or something else." http://www.mysqlperformanceblog.com/2011/08/04/mysql-performance-on-ec2ebs-versus-rds/ *That* uncertainty is where a lot of the demand for this feature is coming from. People deploy into the cloud, their data gets corrupted, and no one call tell them what/why/how it happened. And that means they don't even know what to change to make it better. The only people I see really doing something about this problem all seem years off, and I'm not sure they are going to help--especially since some of them are targeting "enterprise" storage rather than the cloud-style installations. I assume a user would wait until they suspected corruption to turn it on, and because it is only initdb-enabled, they would have to dump/reload their cluster. The open question is whether this is a usable feature as written, or whether we should wait until 9.4. The reliability issues of both physical and virtual hardware are so widely known that many people will deploy with this on as their default configuration. If you don't trust your existing data, you can't retroactively check it. A checksum of an already corrupt block is useless. Accordingly, there is no use case for converting an installation with real or even suspected problems to a checksummed one. If you wait until you suspect corruption to care about checksums, it's really too late. There is only one available next step: you must do a dump to figure out what's readable. That is the spot that all of the incoming data recovery customers we see at 2ndQuadrant are already in when we're called. The cluster is suspicious, sometimes they can get data out of it with a dump, and if we hack up their install we can usually recover a bit more than they could. After the data from a partially corrupted database is dumped, someone who has just been through that pain might decide they should turn checksums on when they restore the dump. When it's on, they can access future damage easily at the block level when it happens, and possibly repair it without doing a full dump/reload. What's implemented in the feature we're talking about has a good enough UI to handle this entire cycle I see damaged installations go through. Good questions, Bruce, I don't think the reasons behind this feature's demand have been highlighted very well before. I try not to spook the world by talking regularly about how many corrupt PostgreSQL databases I've seen, but they do happen. Most of my regular ranting on crappy SSDs that lie about writes comes from a TB scale PostgreSQL install that got corrupted due to the write-cache flaws of the early Intel SSDs--twice. The would have happily lost even the worst-case 20% of regular performance to avoid going down for two days each time they saw corruption, where we had to dump/reload to get them going again. If the install had checksums, I could have figured out which blocks were damaged and manually fixed them. Without checksums, there really was nowhere to go except dump/reload. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 3/18/13 10:52 AM, Bruce Momjian wrote: With a potential 10-20% overhead, I am unclear who would enable this at initdb time. If you survey people who are running PostgreSQL on "cloud" hardware, be it Amazon's EC2 or similar options from other vendors, you will find a high percentage of them would pay quite a bit of performance to make their storage more reliable. To pick one common measurement for popularity, a Google search on "ebs corruption" returns 17 million hits. To quote one of those, Baron Schwartz of Percona talking about MySQL on EC2: "BTW, I have seen data corruption on EBS volumes. It’s not clear whether it was InnoDB’s fault (extremely unlikely IMO), the operating system’s fault, EBS’s fault, or something else." http://www.mysqlperformanceblog.com/2011/08/04/mysql-performance-on-ec2ebs-versus-rds/ *That* uncertainty is where a lot of the demand for this feature is coming from. People deploy into the cloud, their data gets corrupted, and no one call tell them what/why/how it happened. And that means they don't even know what to change to make it better. The only people I see really doing something about this problem all seem years off, and I'm not sure they are going to help--especially since some of them are targeting "enterprise" storage rather than the cloud-style installations. I assume a user would wait until they suspected corruption to turn it on, and because it is only initdb-enabled, they would have to dump/reload their cluster. The open question is whether this is a usable feature as written, or whether we should wait until 9.4. The reliability issues of both physical and virtual hardware are so widely known that many people will deploy with this on as their default configuration. If you don't trust your existing data, you can't retroactively check it. A checksum of an already corrupt block is useless. Accordingly, there is no use case for converting an installation with real or even suspected problems to a checksummed one. If you wait until you suspect corruption to care about checksums, it's really too late. There is only one available next step: you must do a dump to figure out what's readable. That is the spot that all of the incoming data recovery customers we see at 2ndQuadrant are already in when we're called. The cluster is suspicious, sometimes they can get data out of it with a dump, and if we hack up their install we can usually recover a bit more than they could. After the data from a partially corrupted database is dumped, someone who has just been through that pain might decide they should turn checksums on when they restore the dump. When it's on, they can access future damage easily at the block level when it happens, and possibly repair it without doing a full dump/reload. What's implemented in the feature we're talking about has a good enough UI to handle this entire cycle I see damaged installations go through. In fact, this feature is going to need pg_upgrade changes to detect from pg_controldata that the old/new clusters have the same checksum setting. I think that's done already, but it's certainly something to test out too. Good questions, Bruce, I don't think the reasons behind this feature's demand have been highlighted very well before. I try not to spook the world by talking regularly about how many corrupt PostgreSQL databases I've seen, but they do happen. Most of my regular ranting on crappy SSDs that lie about writes comes from a TB scale PostgreSQL install that got corrupted due to the write-cache flaws of the early Intel SSDs--twice. The would have happily lost even the worst-case 20% of regular performance to avoid going down for two days each time they saw corruption, where we had to dump/reload to get them going again. If the install had checksums, I could have figured out which blocks were damaged and manually fixed them. Without checksums, there's no way to even tell for sure what is broken. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
Simon Riggs writes: > On 24 April 2013 21:06, Jeff Davis wrote: >> What goal are you trying to accomplish with this patch? > That we might need to patch the checksum version on a production release. I don't actually buy that argument, certainly not as something that could happen in 9.3. I'm inclined to think we should forget about this until we have a concrete use-case for it. As Jeff says, there is no need for pg_control contents to be compatible across major releases, so there's no harm in waiting if we have any doubts about how it ought to work. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Wed, 2013-04-24 at 21:09 +0100, Simon Riggs wrote: > On 24 April 2013 21:06, Jeff Davis wrote: > > > What goal are you trying to accomplish with this patch? > > That we might need to patch the checksum version on a production release. Oh, I see. I don't think we need two output fields from pg_controldata though. It's a little redundant, and confused me when I was looking at the impact on pg_upgrade. And it means nothing to the user until we actually have multiple algorithms available, at which time we are better off with a text representation. Other than that, I think your patch is fine to accomplish the aforementioned goal. Essentially, it just changes the bool to a uint32, which I favor. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 24 April 2013 21:06, Jeff Davis wrote: > What goal are you trying to accomplish with this patch? That we might need to patch the checksum version on a production release. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Wed, 2013-04-24 at 08:20 +0100, Simon Riggs wrote: > On 24 April 2013 01:10, Jeff Davis wrote: > > I'd prefer that it was some kind of a checksum ID code -- e.g. 0 for no > > checksum, 1 for FNV-1a-SR3, etc. That would allow us to release 9.4 with > > a new algorithm without forcing existing users to change. > > That's exactly what the patch does. The word "version" indicates an order to it though, like N+1 is always preferable to N. This is user-facing (through pg_controldata output), otherwise I wouldn't mind. > > initdb would have to take the code as an option, probably in string > > form. > > When/if we have multiple options we can add that. The main thing was > to make sure the control file recorded things in a common way. The main strange thing to me is that we're still using the enabled/disabled for the output of pg_controldata as well as the "version". When we do have multiple options, it seems like we'd just have one field output: Data page checksums: none|crc32c|pg-fnv What goal are you trying to accomplish with this patch? pg_control doesn't need to be compatible between releases, so can't we just add this later when we really do have multiple options? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 24 April 2013 01:10, Jeff Davis wrote: > On Tue, 2013-04-23 at 16:28 +0100, Simon Riggs wrote: >> * make the pg_control.data_checksums field into a version number, for >> future flexibility... >> patch attached > > Commenting on this separately because it's a separate issue. > > I'd prefer that it was some kind of a checksum ID code -- e.g. 0 for no > checksum, 1 for FNV-1a-SR3, etc. That would allow us to release 9.4 with > a new algorithm without forcing existing users to change. That's exactly what the patch does. > initdb would have to take the code as an option, probably in string > form. When/if we have multiple options we can add that. The main thing was to make sure the control file recorded things in a common way. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Tue, 2013-04-23 at 16:28 +0100, Simon Riggs wrote: > * make the pg_control.data_checksums field into a version number, for > future flexibility... > patch attached Commenting on this separately because it's a separate issue. I'd prefer that it was some kind of a checksum ID code -- e.g. 0 for no checksum, 1 for FNV-1a-SR3, etc. That would allow us to release 9.4 with a new algorithm without forcing existing users to change. initdb would have to take the code as an option, probably in string form. What do you think? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 23 April 2013 02:35, Jeff Davis wrote: > On Tue, 2013-04-23 at 01:08 +0300, Ants Aasma wrote: >> A slight delay, but here it is. I didn't lift the checksum part into a >> separate file as I didn't have a great idea what I would call it. The >> code is reasonably compact so I don't see a great need for this right >> now. It would be more worth the effort when/if we add non-generic >> variants. I'm not particularly attached to the method I used to mask >> out pd_checksum field, this could be improved if someone has a better >> idea how to structure the code. > > Thank you. A few initial comments: > > I have attached (for illustration purposes only) a patch on top of yours > that divides the responsibilities a little more cleanly. > > * easier to move into a separate file, and use your recommended compiler > flags without affecting other routines in bufpage.c > * makes the checksum algorithm itself simpler > * leaves the data-page-specific aspects (mixing in the page number, > ignoring pd_checksum, reducing to 16 bits) to PageCalcChecksum16 > * overall easier to review and understand > > I'm not sure what we should call the separate file or where we should > put it, though. How about src/backend/utils/checksum/checksum_fnv.c? Is > there a clean way to override the compiler flags for a single file so we > don't need to put it in its own directory? OK, I like that a lot better and it seems something I could commit. I suggest the following additional changes... * put the README stuff directly in the checksum.c file * I think we need some external links that describe this algorithm and we need comments that explain what we know about this in terms of detection capability and why it was chosen against others * We need some comments/analysis about whether the coding causes a problem if vectorization is *not* available * make the pg_control.data_checksums field into a version number, for future flexibility... patch attached * rename the routine away from checksum_fnv so its simply a generic checksum call - more modular. That way all knowledge of the algorithm is in one file only. If we do need to change the algorithm in the future we can more easily support multiple versions. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services checksums_version.v1.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
Jeff Davis escribió: > I'm not sure what we should call the separate file or where we should > put it, though. How about src/backend/utils/checksum/checksum_fnv.c? Is > there a clean way to override the compiler flags for a single file so we > don't need to put it in its own directory? Sure, see src/backend/parser/Makefile about gram.o. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Tue, 2013-04-23 at 01:08 +0300, Ants Aasma wrote: > A slight delay, but here it is. I didn't lift the checksum part into a > separate file as I didn't have a great idea what I would call it. The > code is reasonably compact so I don't see a great need for this right > now. It would be more worth the effort when/if we add non-generic > variants. I'm not particularly attached to the method I used to mask > out pd_checksum field, this could be improved if someone has a better > idea how to structure the code. Thank you. A few initial comments: I have attached (for illustration purposes only) a patch on top of yours that divides the responsibilities a little more cleanly. * easier to move into a separate file, and use your recommended compiler flags without affecting other routines in bufpage.c * makes the checksum algorithm itself simpler * leaves the data-page-specific aspects (mixing in the page number, ignoring pd_checksum, reducing to 16 bits) to PageCalcChecksum16 * overall easier to review and understand I'm not sure what we should call the separate file or where we should put it, though. How about src/backend/utils/checksum/checksum_fnv.c? Is there a clean way to override the compiler flags for a single file so we don't need to put it in its own directory? Regards, Jeff Davis *** a/src/backend/storage/page/bufpage.c --- b/src/backend/storage/page/bufpage.c *** *** 23,28 static char pageCopyData[BLCKSZ]; /* for checksum calculation */ --- 23,29 static Page pageCopy = pageCopyData; static uint16 PageCalcChecksum16(Page page, BlockNumber blkno); + static uint32 checksum_fnv(char *data, uint32 size); /* * Page support functions *** *** 986,1017 static const uint32 checksumBaseOffsets[N_SUMS] = { static uint16 PageCalcChecksum16(Page page, BlockNumber blkno) { ! uint32 sums[N_SUMS]; ! uint32 (*pageArr)[N_SUMS] = (uint32 (*)[N_SUMS]) page; ! uint32 result = blkno; ! int i, j; ! int pd_checksum_word = offsetof(PageHeaderData, pd_checksum)/sizeof(uint32); ! int pd_checksum_half = (offsetof(PageHeaderData, pd_checksum) % sizeof(uint32)) / sizeof(uint16); /* only calculate the checksum for properly-initialized pages */ Assert(!PageIsNew(page)); /* initialize partial checksums to their corresponding offsets */ memcpy(sums, checksumBaseOffsets, sizeof(checksumBaseOffsets)); ! /* first iteration needs to mask out pd_checksum field itself with zero */ ! for (j = 0; j < N_SUMS; j++) ! { ! uint32 value = pageArr[0][j]; ! if (j == pd_checksum_word) ! ((uint16*) &value)[pd_checksum_half] = 0; ! CHECKSUM_COMP(sums[j], value); ! } ! ! /* now add in the rest of the page */ ! for (i = 1; i < BLCKSZ/sizeof(uint32)/N_SUMS; i++) for (j = 0; j < N_SUMS; j++) ! CHECKSUM_COMP(sums[j], pageArr[i][j]); /* finally add in one round of zeroes for one more layer of mixing */ for (j = 0; j < N_SUMS; j++) --- 987,1035 static uint16 PageCalcChecksum16(Page page, BlockNumber blkno) { ! PageHeader phdr = (PageHeader) page; ! uint16 save_checksum; ! uint32 fnv_checksum; /* only calculate the checksum for properly-initialized pages */ Assert(!PageIsNew(page)); + /* + * Save pd_checksum and set it to zero, so that the checksum calculation + * isn't affected by the checksum stored on the page. + */ + save_checksum = phdr->pd_checksum; + phdr->pd_checksum = 0; + fnv_checksum = checksum_fnv(page, BLCKSZ); + phdr->pd_checksum = save_checksum; + + /* mix in the block number to detect transposed pages */ + fnv_checksum ^= blkno; + + /* + * Reduce to a uint16 (to fit in the pd_checksum field) with an offset of + * one. That avoids checksums of zero, which seems like a good idea. + */ + return (fnv_checksum % 65535) + 1; + } + + static uint32 + checksum_fnv(char *data, uint32 size) + { + uint32 sums[N_SUMS]; + uint32 (*dataArr)[N_SUMS] = (uint32 (*)[N_SUMS]) data; + uint32 result; + int i, j; + + Assert((size % (sizeof(uint32)*N_SUMS)) == 0); + /* initialize partial checksums to their corresponding offsets */ memcpy(sums, checksumBaseOffsets, sizeof(checksumBaseOffsets)); ! /* main checksum calculation */ ! for (i = 0; i < size/sizeof(uint32)/N_SUMS; i++) for (j = 0; j < N_SUMS; j++) ! CHECKSUM_COMP(sums[j], dataArr[i][j]); /* finally add in one round of zeroes for one more layer of mixing */ for (j = 0; j < N_SUMS; j++) *** *** 1021,1026 PageCalcChecksum16(Page page, BlockNumber blkno) for (i = 0; i < N_SUMS; i++) result ^= sums[i]; ! /* use mod mapping to map the value into 16 bits and offset from zero */ ! return (result % 65535) + 1; } --- 1039,1043 for (i = 0; i < N_SUMS; i++) result ^= sums[i]; ! return result; } -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your s
Re: [HACKERS] Enabling Checksums
On Mon, Apr 22, 2013 at 10:57 PM, Ants Aasma wrote: > On Mon, Apr 22, 2013 at 10:54 PM, Simon Riggs wrote: >> On 22 April 2013 20:32, Florian Pflug wrote: >> >>> Assuming that we only ship a plain C implementation with 9.3, what >>> are we missing on that front? The C implementation of FNV1+SHIFT is >>> only a few dozen lines or so. >> >> Forgive me, I can't seem to locate the patch for this? Re-post please, >> just for clarity. > > Not posted yet. I'm writing it as we speak. Will post within half an hour or > so. A slight delay, but here it is. I didn't lift the checksum part into a separate file as I didn't have a great idea what I would call it. The code is reasonably compact so I don't see a great need for this right now. It would be more worth the effort when/if we add non-generic variants. I'm not particularly attached to the method I used to mask out pd_checksum field, this could be improved if someone has a better idea how to structure the code. I confirmed with objdump that compiling on GCC 4.7 with -msse4.1 -funroll-loops -ftree-vectorize does in fact vectorize that loop. Simple way to verify: objdump -d src/backend/storage/page/bufpage.o | grep pmulld | wc -l should output 16. Unfortunately I can't work on this patch for about a week. Postgresql 9.3 will have to wait for me as I need to tend to the release of Ants v2.0. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de parallel-fnv-checksum.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Mon, Apr 22, 2013 at 10:54 PM, Simon Riggs wrote: > On 22 April 2013 20:32, Florian Pflug wrote: > >> Assuming that we only ship a plain C implementation with 9.3, what >> are we missing on that front? The C implementation of FNV1+SHIFT is >> only a few dozen lines or so. > > Forgive me, I can't seem to locate the patch for this? Re-post please, > just for clarity. Not posted yet. I'm writing it as we speak. Will post within half an hour or so. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 22 April 2013 20:32, Florian Pflug wrote: > Assuming that we only ship a plain C implementation with 9.3, what > are we missing on that front? The C implementation of FNV1+SHIFT is > only a few dozen lines or so. Forgive me, I can't seem to locate the patch for this? Re-post please, just for clarity. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Apr22, 2013, at 21:14 , Jeff Davis wrote: > On Mon, 2013-04-22 at 20:04 +0200, Florian Pflug wrote: >> The one downside of the fnv1+shift approach is that it's built around >> the assumption that processing 64-bytes at once is the sweet spot. That >> might be true for x86 and x86_64 today, but it won't stay that way for >> long, and quite surely isn't true for other architectures. That doesn't >> necessarily rule it out, but it certainly weakens the argument that >> slipping it into 9.3 avoids having the change the algorithm later... > > I think you are setting the bar way too high. Right now, we have a slow > algorithm. According to Ants's tests, FNV-1a is much, much faster. Do > you think that it will still be such a bottleneck that we will want to > change it again later for purely performance reasons? To clarify, it wasn't my intent to argue against shipping FNV1+SHIFT in 9.3 - in fact I'd like to see us do exactly that. I was merely trying to be unbiased, and hence stated not only arguments in favour or FNV1+SHIFT (the ones about CRCs theoretical advantages in error detection being not really relevant to us), but also the one downside of FNV1+SHIFT. Seems like I could have done a better job expressing myself though. > The biggest problem now is getting one of these faster algorithms (FNV > or even a faster CRC) into shape that is acceptable to > reviewers/committers. If we don't do that, we will be missing out on a > lot of potential checksum users for whom the existing CRC algorithm is > just too slow. Assuming that we only ship a plain C implementation with 9.3, what are we missing on that front? The C implementation of FNV1+SHIFT is only a few dozen lines or so. best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Mon, 2013-04-22 at 19:25 +0300, Ants Aasma wrote: > I was just now writing up a generic C based patch based on the > parallel FNV-1a + shift that we discussed with Florian with an added > round of mixing. Testing the performance in isolation indicates that: > 1) it is about an order of magnitude faster than the Sarwate CRC > method used in Postgresql. > 2) it is about 2x faster than fastest software based CRC method. > 3) by using -msse4.1 -funroll-loops -ftree-vectorize compilation > options the performance improves 5x. (within 20% of handcoded ASM) That's great news! This means that we can have a simple C implementation in a separate file, and pass a few build flags when compiling just that file (so it doesn't affect other code). That should make reviewers/committers happy (including me). FWIW, that was my last real concern about FNV (reviewability). I'm not worried about the performance based on your analysis; nor am I worried about the error detection rate. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Mon, Apr 22, 2013 at 3:14 PM, Jeff Davis wrote: > The biggest problem now is getting one of these faster algorithms (FNV > or even a faster CRC) into shape that is acceptable to > reviewers/committers. If we don't do that, we will be missing out on a > lot of potential checksum users for whom the existing CRC algorithm is > just too slow. +1. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Mon, 2013-04-22 at 20:04 +0200, Florian Pflug wrote: > The one downside of the fnv1+shift approach is that it's built around > the assumption that processing 64-bytes at once is the sweet spot. That > might be true for x86 and x86_64 today, but it won't stay that way for > long, and quite surely isn't true for other architectures. That doesn't > necessarily rule it out, but it certainly weakens the argument that > slipping it into 9.3 avoids having the change the algorithm later... I think you are setting the bar way too high. Right now, we have a slow algorithm. According to Ants's tests, FNV-1a is much, much faster. Do you think that it will still be such a bottleneck that we will want to change it again later for purely performance reasons? The only time this is likely to matter is in the situation Greg Smith describes, where shared buffers is much smaller than memory, and the working set of buffers is near the size of memory (in other words, a lot of buffers moving to and from shared memory, but not much to or from disk). And it's already significantly faster than algorithm in the original tests (Fletcher), so it's not clear that it's still even a serious problem. (Also remember that checksum users already accept a WAL penalty.) The biggest problem now is getting one of these faster algorithms (FNV or even a faster CRC) into shape that is acceptable to reviewers/committers. If we don't do that, we will be missing out on a lot of potential checksum users for whom the existing CRC algorithm is just too slow. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Mon, Apr 22, 2013 at 9:04 PM, Florian Pflug wrote: > The one downside of the fnv1+shift approach is that it's built around > the assumption that processing 64-bytes at once is the sweet spot. That > might be true for x86 and x86_64 today, but it won't stay that way for > long, and quite surely isn't true for other architectures. That doesn't > necessarily rule it out, but it certainly weakens the argument that > slipping it into 9.3 avoids having the change the algorithm later... It's actually 128 bytes as it was tested. The ideal shape depends on multiplication latency, multiplication throughput and amount of registers available. Specifically BLCKSZ/mul_throughput_in_bytes needs to be larger than BLCKSZ/(N_SUMS*sizeof(uint32))*(mul latency + 2*xor latency). For latest Intel the values are 8192/16 = 512 and 8192/(32*4)*(5 + 2*1) = 448. 128 bytes is also 8 registers which is the highest power of two fitting into architectural registers (16). This means that the value chosen is indeed the sweet spot for x86 today. For future processors we can expect the multiplication width to increase and possibly the latency too shifting the sweet spot into higher widths. In fact, Haswell coming out later this year should have AVX2 instructions that introduce integer ops on 256bit registers, making the current choice already suboptimal. All that said, having a lower width won't make the algorithm slower on future processors, it will just leave some parallelism on the table that could be used to make it even faster. The line in the sand needed to be drawn somewhere, I chose the maximum comfortable width today fearing that even that would be shot down based on code size. Coincidentally 32 elements is also the internal parallelism that GPUs have settled on. We could bump the width up by one notch to buy some future safety, but after that I'm skeptical we will see any conventional processors that would benefit from a higher width. I just tested that the auto-vectorized version runs at basically identical speed as GCC's inability to do good register allocation means that it juggles values between registers and the stack one way or the other. So to recap, I don't know of any CPUs where a lower value would be better. Raising the width by one notch would mean better performance on future processors, but raising it further would just bloat the size of the inner loop without much benefit in sight. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Apr22, 2013, at 18:25 , Ants Aasma wrote: > I was just now writing up a generic C based patch based on the > parallel FNV-1a + shift that we discussed with Florian with an added > round of mixing. Testing the performance in isolation indicates that: > 1) it is about an order of magnitude faster than the Sarwate CRC > method used in Postgresql. > 2) it is about 2x faster than fastest software based CRC method. > 3) by using -msse4.1 -funroll-loops -ftree-vectorize compilation > options the performance improves 5x. (within 20% of handcoded ASM) > > This leaves lingering doubts about the quality of the checksum. It's > hard if not impossible to prove absence of interesting patterns that > would trigger collisions. I do know the checksum quality is miles > ahead of the Fletcher sum originally proposed and during the last week > I haven't been able to think of a way to make the collision rate > significantly differ from CRC. Note though that CRCs may very well have similar "interesting" corruption patterns which don't cause the checksum to change, though. The only guarantee they really give is that those patterns will involve more than N-1 flipped bits, where N is the hamming distance of the CRC. For 16-bit checksums, N can at most be 16 (since XOR-ing the data with a shifted version of the CRC polynomial will not cause the checksum to change). Thus, once more than two bytes on a page get corrupted, CRCs may not have any advantage over fnv1+shift or similar approaches. They may even work worse, since detecting some forms of corruption with 100% certainty means missing others with a probability of more than 2^-16. Some CRC polynomials for example detect all corruptions which affect an odd number of bits, but in turn have a probability of 2^-15 of missing ones which affect an even number of bits. Since we're mostly attempting to protect against disk, not memory corruption here, I'm not convinced at all that errors in only a few bits are all that common, and certainly not that they are more likely than other forms of corruption. I'd expect, for example, that blocks of 512 bytes (i.e. one sector) suddenly reading 0 is at least as likely as a single flipped bit. The one downside of the fnv1+shift approach is that it's built around the assumption that processing 64-bytes at once is the sweet spot. That might be true for x86 and x86_64 today, but it won't stay that way for long, and quite surely isn't true for other architectures. That doesn't necessarily rule it out, but it certainly weakens the argument that slipping it into 9.3 avoids having the change the algorithm later... best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 04/22/2013 09:25 AM, Ants Aasma wrote: > This leaves lingering doubts about the quality of the checksum. It's > hard if not impossible to prove absence of interesting patterns that > would trigger collisions. I do know the checksum quality is miles > ahead of the Fletcher sum originally proposed and during the last week > I haven't been able to think of a way to make the collision rate > significantly differ from CRC. When we originally discussed this feature, we were potentially discussing a checksum algo which produced collisions for 1 out of 256 pages. That approach was considered acceptable, since it would be very unlikely for such a collision to occur across multiple corrupted pages, and fairly rare to have only one corrupted page. So my perspective is, if we're doing better than 1 in 256, it's good enough. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Mon, Apr 22, 2013 at 6:33 PM, Andres Freund wrote: > I don't see us changing away from CRCs anymore either by now. But I > think at least changing the polynom to something that > a) has higher error detection properties > b) can noticeably sped up on a a good part of the hardware pg is run on +1 of changing the polynomial if we stick with CRC, but I think the differences in error detection capability are mostly academic for PostgreSQL usecase. Or does anyone have an experience with seeing multiple random bit errors per page. > If we are feeling really adventurous we can switch to a faster CRC > implementation, there are enough ones around and I know that at least my > proposed patch from some years ago (which is by far not the fastest that > is doable) is in production usage some places. The faster CRC implementation just use parallel lookup tables of more bytes in parallel. Performance results from [1] show that doing 4 bytes in parallel will yield a 2.8x speedup, and 8 bytes in parallel yields another 1.7x on top of that at the cost of using a 8kB lookup table. And the end result is still over 3x slower than the code in the original patch, where Greg's performance results prompted me to look at what would have a lower overhead. [1] http://create.stephan-brumme.com/crc32/ Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Mon, Apr 22, 2013 at 6:27 PM, Robert Haas wrote: > On Wed, Apr 17, 2013 at 8:21 PM, Greg Smith wrote: >>> The more I read of this thread, the more unhappy I get. It appears that >>> the entire design process is being driven by micro-optimization for CPUs >>> being built by Intel in 2013. >> >> And that's not going to get anyone past review, since all the tests I've >> been doing the last two weeks are on how fast an AMD Opteron 6234 with OS >> cache >> shared_buffers can run this. The main thing I'm still worried >> about is what happens when you have a fast machine that can move memory >> around very quickly and an in-memory workload, but it's hamstrung by the >> checksum computation--and it's not a 2013 Intel machine. > > This is a good point. However, I don't completely agree with the > conclusion that we shouldn't be worrying about any of this right now. > While I agree with Tom that it's far too late to think about any > CPU-specific optimizations for 9.3, I have a lot of concern, based on > Ants's numbers, that we've picked a checksum algorithm which is hard > to optimize for performance. If we don't get that fixed for 9.3, > we're potentially looking at inflicting many years of serious > suffering on our user base. If we at least get the *algorithm* right > now, we can worry about optimizing it later. If we get it wrong, > we'll be living with the consequence of that for a really long time. I was just now writing up a generic C based patch based on the parallel FNV-1a + shift that we discussed with Florian with an added round of mixing. Testing the performance in isolation indicates that: 1) it is about an order of magnitude faster than the Sarwate CRC method used in Postgresql. 2) it is about 2x faster than fastest software based CRC method. 3) by using -msse4.1 -funroll-loops -ftree-vectorize compilation options the performance improves 5x. (within 20% of handcoded ASM) This leaves lingering doubts about the quality of the checksum. It's hard if not impossible to prove absence of interesting patterns that would trigger collisions. I do know the checksum quality is miles ahead of the Fletcher sum originally proposed and during the last week I haven't been able to think of a way to make the collision rate significantly differ from CRC. > I wish that we had not scheduled beta quite so soon, as I am sure > there will be even more resistance to changing this after beta. But > I'm having a hard time escaping the conclusion that we're on the edge > of shipping something we will later regret quite deeply. Maybe I'm > wrong? Its unfortunate that this got delayed by so long. The performance side of the argument was clear a month ago. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 2013-04-22 11:27:25 -0400, Robert Haas wrote: > On Wed, Apr 17, 2013 at 8:21 PM, Greg Smith wrote: > >> The more I read of this thread, the more unhappy I get. It appears that > >> the entire design process is being driven by micro-optimization for CPUs > >> being built by Intel in 2013. > > > > And that's not going to get anyone past review, since all the tests I've > > been doing the last two weeks are on how fast an AMD Opteron 6234 with OS > > cache >> shared_buffers can run this. The main thing I'm still worried > > about is what happens when you have a fast machine that can move memory > > around very quickly and an in-memory workload, but it's hamstrung by the > > checksum computation--and it's not a 2013 Intel machine. > > This is a good point. However, I don't completely agree with the > conclusion that we shouldn't be worrying about any of this right now. > While I agree with Tom that it's far too late to think about any > CPU-specific optimizations for 9.3, I have a lot of concern, based on > Ants's numbers, that we've picked a checksum algorithm which is hard > to optimize for performance. If we don't get that fixed for 9.3, > we're potentially looking at inflicting many years of serious > suffering on our user base. If we at least get the *algorithm* right > now, we can worry about optimizing it later. If we get it wrong, > we'll be living with the consequence of that for a really long time. > > I wish that we had not scheduled beta quite so soon, as I am sure > there will be even more resistance to changing this after beta. But > I'm having a hard time escaping the conclusion that we're on the edge > of shipping something we will later regret quite deeply. Maybe I'm > wrong? I don't see us changing away from CRCs anymore either by now. But I think at least changing the polynom to something that a) has higher error detection properties b) can noticeably sped up on a a good part of the hardware pg is run on If we are feeling really adventurous we can switch to a faster CRC implementation, there are enough ones around and I know that at least my proposed patch from some years ago (which is by far not the fastest that is doable) is in production usage some places. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Wed, Apr 17, 2013 at 8:21 PM, Greg Smith wrote: >> The more I read of this thread, the more unhappy I get. It appears that >> the entire design process is being driven by micro-optimization for CPUs >> being built by Intel in 2013. > > And that's not going to get anyone past review, since all the tests I've > been doing the last two weeks are on how fast an AMD Opteron 6234 with OS > cache >> shared_buffers can run this. The main thing I'm still worried > about is what happens when you have a fast machine that can move memory > around very quickly and an in-memory workload, but it's hamstrung by the > checksum computation--and it's not a 2013 Intel machine. This is a good point. However, I don't completely agree with the conclusion that we shouldn't be worrying about any of this right now. While I agree with Tom that it's far too late to think about any CPU-specific optimizations for 9.3, I have a lot of concern, based on Ants's numbers, that we've picked a checksum algorithm which is hard to optimize for performance. If we don't get that fixed for 9.3, we're potentially looking at inflicting many years of serious suffering on our user base. If we at least get the *algorithm* right now, we can worry about optimizing it later. If we get it wrong, we'll be living with the consequence of that for a really long time. I wish that we had not scheduled beta quite so soon, as I am sure there will be even more resistance to changing this after beta. But I'm having a hard time escaping the conclusion that we're on the edge of shipping something we will later regret quite deeply. Maybe I'm wrong? ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Thu, Apr 18, 2013 at 6:04 PM, Florian Weimer wrote: > The TCP checksum is too weak to be practical. Every now an then, I > see data transfers where the checksum is valid, but the content > contains bit flips. Well of course, it's only a 16-bit checksum. 64k packets isn't very many so if you're not counting checksum failures it won't take very long before one gets through. The purpose of the checksum is to notify you that you have a problem, not to block bad packets from getting through. > Anything that flips bits randomly at intervals > which are multiples of 16 bits is quite likely to pass through > checksum detection. I'm not sure about this -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 18.04.2013, at 20:02, Ants Aasma wrote: > On Thu, Apr 18, 2013 at 8:24 PM, Ants Aasma wrote: >> On Thu, Apr 18, 2013 at 8:15 PM, Florian Pflug wrote: >>> So either the CRC32-C polynomial isn't irreducible, or there something >>> fishy going on. Could there be a bug in your CRC implementation? Maybe >>> a mixup between big and little endian, or something like that? >> >> I'm suspecting an implementation bug myself. I already checked the >> test harness and that was all sane, compiler hadn't taken any >> unforgivable liberties there. I will crosscheck the output with other >> implementations to verify that the checksum is implemented correctly. > > Looks like the implementation is correct. I cross-referenced it > against a bitwise algorithm for crc32 with the castagnoli polynomial. > This also rules out any endianness issues as the bitwise variant > consumes input byte at a time. > > What ever it is, it is something specific to PostgreSQL page layout. > If I use /dev/urandom as the source the issue disappears. So much for > CRC32 being proven good. Weird. Is the code of your test harness available publicly, or could you post it? I'd like to look into this... best regard, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Thu, Apr 18, 2013 at 8:24 PM, Ants Aasma wrote: > On Thu, Apr 18, 2013 at 8:15 PM, Florian Pflug wrote: >> So either the CRC32-C polynomial isn't irreducible, or there something >> fishy going on. Could there be a bug in your CRC implementation? Maybe >> a mixup between big and little endian, or something like that? > > I'm suspecting an implementation bug myself. I already checked the > test harness and that was all sane, compiler hadn't taken any > unforgivable liberties there. I will crosscheck the output with other > implementations to verify that the checksum is implemented correctly. Looks like the implementation is correct. I cross-referenced it against a bitwise algorithm for crc32 with the castagnoli polynomial. This also rules out any endianness issues as the bitwise variant consumes input byte at a time. What ever it is, it is something specific to PostgreSQL page layout. If I use /dev/urandom as the source the issue disappears. So much for CRC32 being proven good. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Thu, 2013-04-18 at 19:05 +0200, Florian Pflug wrote: > On Apr18, 2013, at 19:04 , Jeff Davis wrote: > > On Wed, 2013-04-17 at 20:21 -0400, Greg Smith wrote: > >> -Original checksum feature used Fletcher checksums. Its main problems, > >> to quote wikipedia, include that it "cannot distinguish between blocks > >> of all 0 bits and blocks of all 1 bits". > > > > That is fairly easy to fix by using a different modulus: 251 vs 255. > > At the expense of a drastic performance hit though, no? Modulus operations > aren't exactly cheap. Modulo is only necessary when there's a possibility of overflow, or at the very end of the calculation. If we accumulate 32-bit integers into 64-bit sums, then it turns out that it can't overflow given the largest input we support (32K page). 32K page = 8192 32-bit integers 1*(2^32-1) + 2*(2^32-1) + 3*(2^32-1) ... 8192*(2^32-1) = (2^32-1) * (8192^2 - 8192)/2 = 144097595856261120 ( < 2^64-1 ) So, we only need to do the modulo at the end. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Thu, Apr 18, 2013 at 8:15 PM, Florian Pflug wrote: > So either the CRC32-C polynomial isn't irreducible, or there something > fishy going on. Could there be a bug in your CRC implementation? Maybe > a mixup between big and little endian, or something like that? I'm suspecting an implementation bug myself. I already checked the test harness and that was all sane, compiler hadn't taken any unforgivable liberties there. I will crosscheck the output with other implementations to verify that the checksum is implemented correctly. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Apr18, 2013, at 18:48 , Ants Aasma wrote: > On Thu, Apr 18, 2013 at 5:57 PM, Ants Aasma wrote: >> I'll generate an avalanche diagram for CRC32C too, but it will take a >> while even if I use a smaller dataset. > > Well that was useless... In CRC flipping each bit in the input flips > preset pattern of bits in the output regardless of the actual data on > the page. Some stats for CRC32C - input bits affect 28344 different > bit combinations. Count of bits by number of duplicated bitpatterns: Yup, CRC is linear too. CRC is essentially long division for polynomials, i.e. you interpret the N input bits as the coefficients of a (large) polynomial of degree (N-1), and divide by the CRC polynomial. The remainder is the checksum, and consists of B bits where B is the degree of the CRC polynomial. (Polynomial here means polynomial over GF(2), i.e. over a field with only two values 0 and 1) I'm currently trying to see if one can easily explain the partial-write behaviour from that. Having lots of zeros at the end end corresponds to an input polynomial of the form p(x) * x^l where l is the number of zero bits. The CRC (q(x) is the CRC polynomial) is p(x) * x^l mod q(x) = (p(x) mod q(x)) * (x^l mod q(x)) mod q(x) That still doesn't explain it, though - the result *should* simply be the checksum of p(x), scrambled a bit by the multiplication with (x^l mod q(x)). But if q(x) is irreducible, that scrambling is invertible (as multiplication module some irreducible element always is), and thus shouldn't matter much. So either the CRC32-C polynomial isn't irreducible, or there something fishy going on. Could there be a bug in your CRC implementation? Maybe a mixup between big and little endian, or something like that? The third possibility is that I've overlooking something, of course ;-) Will think more about this tomorrow if time permits best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Thu, Apr 18, 2013 at 8:05 PM, Florian Pflug wrote: > On Apr18, 2013, at 19:04 , Jeff Davis wrote: >> On Wed, 2013-04-17 at 20:21 -0400, Greg Smith wrote: >>> -Original checksum feature used Fletcher checksums. Its main problems, >>> to quote wikipedia, include that it "cannot distinguish between blocks >>> of all 0 bits and blocks of all 1 bits". >> >> That is fairly easy to fix by using a different modulus: 251 vs 255. > > At the expense of a drastic performance hit though, no? Modulus operations > aren't exactly cheap. The modulus can be done in the end. By using a modulus of 65521 the resulting checksum is called Adler-32. [1] However the quality of Fletcher-32/Adler-32 is strictly worse than even the first iteration of multiply-add based checksums proposed. [1] http://en.wikipedia.org/wiki/Adler-32 Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Apr18, 2013, at 19:04 , Jeff Davis wrote: > On Wed, 2013-04-17 at 20:21 -0400, Greg Smith wrote: >> -Original checksum feature used Fletcher checksums. Its main problems, >> to quote wikipedia, include that it "cannot distinguish between blocks >> of all 0 bits and blocks of all 1 bits". > > That is fairly easy to fix by using a different modulus: 251 vs 255. At the expense of a drastic performance hit though, no? Modulus operations aren't exactly cheap. best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
* Greg Smith: > The TCP/IP checksum spec is at https://tools.ietf.org/html/rfc793 ; > its error detection limitations are described at > http://www.noahdavids.org/self_published/CRC_and_checksum.html ; and a > good article about optimizing its code is at > http://www.locklessinc.com/articles/tcp_checksum/ I'll take a longer > look at whether it's an improvement on the Fletcher-16 used by the > current patch. The TCP checksum is too weak to be practical. Every now an then, I see data transfers where the checksum is valid, but the content contains bit flips. Anything that flips bits randomly at intervals which are multiples of 16 bits is quite likely to pass through checksum detection. In practice, TCP relies on checksumming on the sub-IP layers. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Wed, 2013-04-17 at 20:21 -0400, Greg Smith wrote: > -Original checksum feature used Fletcher checksums. Its main problems, > to quote wikipedia, include that it "cannot distinguish between blocks > of all 0 bits and blocks of all 1 bits". That is fairly easy to fix by using a different modulus: 251 vs 255. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Thu, Apr 18, 2013 at 5:57 PM, Ants Aasma wrote: > I'll generate an avalanche diagram for CRC32C too, but it will take a > while even if I use a smaller dataset. Well that was useless... In CRC flipping each bit in the input flips preset pattern of bits in the output regardless of the actual data on the page. Some stats for CRC32C - input bits affect 28344 different bit combinations. Count of bits by number of duplicated bitpatterns: [(1, 8868), (2, 17722), (3, 17775), (4, 12048), (5, 5725), (6, 2268), (7, 875), (8, 184), (9, 45), (10, 10), (16, 16)] Count of bit positions by number of bit-positions affected: [(0, 16), (1, 25), (3, 1185), (5, 8487), (7, 22970), (9, 22913), (11, 8790), (13, 1119), (15, 31)] Map of number of bit position affected, with 8 being black and 0 or 16 being red attached. I'm not sure if the issues with partial writes are somehow related to this. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de <> -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Thu, Apr 18, 2013 at 09:17:39AM +0100, Simon Riggs wrote: > On 17 April 2013 22:36, Bruce Momjian wrote: > > >> > I would like to know the answer of how an upgrade from checksum to > >> > no-checksum would behave so I can modify pg_upgrade to allow it. > >> > >> Why? 9.3 pg_upgrade certainly doesn't need it. When we get to 9.4, if > >> someone has checksums enabled and wants to disable it, why is pg_upgrade > >> the right time to do that? Wouldn't it make more sense to allow them to > >> do that at any time? > > > > Well, right now, pg_upgrade is the only way you could potentially turn > > off checksums. You are right that we might eventually want a command, > > but my point is that we currently have a limitation in pg_upgrade that > > might not be necessary. > > We don't currently have checksums, so pg_upgrade doesn't need to cope > with turning them off in 9.3 True, 9.2 doesn't have checksums, while 9.3 will. One point is that pg_upgrade could actually be used to turn off checksums for 9.3 to 9.3 upgrades if no tablespaces are used. > For 9.4, it might, but likely we've have a tool to turn them off > before then anyway. True. Would we want pg_upgrade to still enforce matching checksum modes for old and new servers at that point? Eventually we will have to decide that. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Thu, Apr 18, 2013 at 5:08 AM, Greg Smith wrote: > On 4/17/13 8:56 PM, Ants Aasma wrote: >> >> Nothing from the two points, but the CRC calculation algorithm can be >> switched out for slice-by-4 or slice-by-8 variant. Speed up was around >> factor of 4 if I remember correctly...I can provide you > >> with a patch of the generic version of any of the discussed algorithms >> within an hour, leaving plenty of time in beta or in 9.4 to >> accommodate the optimized versions. > > Can you nail down a solid, potential for commit slice-by-4 or slice-by-8 > patch then? You dropped into things like per-byte overhead to reach this > conclusion, which was fine to let the methods battle each other. Maybe I > missed it, but I didn't remember seeing an obvious full patch for this > implementation then come back up from that. With the schedule pressure this > needs to return to more database-level tests. Your concerns about the > committed feature being much slower then the original Fletcher one are > troubling, and we might as well do that showdown again now with the best of > the CRC implementations you've found. I meant any of fast ones is easy to nail down. The sped up slice-by-8 is somewhat slightly trickier to clean up. Especially if anyone expects it to accelerate WAL calculation, then it brings up a whole bunch of design questions on how to handle alignment issues. For performance testing what is attached should work fine, it would still need some cleanup. > It's fair that you're very concerned about (1), but I wouldn't give it 100% > odds of happening either. The user demand that's motivated me to work on > this will be happy with any of (1) through (3), and in two of them > optimizing the 16 bit checksums now turns out to be premature. Fair enough, although I'd like to point out the optimization is premature in the sense that the effort might go to waste. The checksum function is a self contained, easy to test and very low maintenance piece of code - not the usual premature optimization risk. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de crc32c-sb8-checksum.v0.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Wed, Apr 17, 2013 at 11:08 PM, Andres Freund wrote: > On 2013-04-17 18:16:36 -0700, Daniel Farina wrote: >> The original paper is often shorthanded "Castagnoli 93", but it exists >> in the IEEE's sphere of influence and is hard to find a copy of. >> Luckily, a pretty interesting survey paper discussing some of the >> issues was written by Koopman in 2002 and is available: >> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.11.8323 As a >> pedagolgical note, it's pretty interesting and accessible piece of >> writing (for me, as someone who knows little of error >> detection/correction) and explains some of the engineering reasons >> that provoke such exercises. > > http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=231911&userType=inst > > There's also a koopman paper from 2004 thats interesting. Having read the 2002 paper more, it seems that the current CRC32 doesn't have a whole lot going for it: CRC32C pretty much cleans its clock across the board (I don't understand detected Hamming Distance that seem greater than the information content of the message, e.g. HD 14 with 8 bit messages as seen in CRC32C: that's where CRC32 can "win"). CRC32C looks, all in all, the most flexible, because detection of Hamming Distance 4 spans from 5244-131072 bits (the upper range of which is a full 16KiB!) and there is superior Hamming Distance detection on shorter messages up until the point where it seems like the Hamming Distance able to be detected is larger than the message size itself (e.g. HM 13 on an 8 bit message). I'm not sure if this is an error in my understanding, or what. Also, larger runs (16KB) are better served by CRC32C: even the probably-best contender I can see (0xD419CC15) drops to Hamming Distance 2-detection right after 65505 bits. CRC32C has the biggest range at HD4, although Koopman 0xBA0DC66 comes close, gaining superior Hamming distance detection for 178-16360 bits (the upper end of this rnage is short of 2KiB by 3 bytes). All in all, there is no reason I can see to keep CRC32 at all, vs CRC32C on the basis of error detection alone, so putting aside all the business about instruction set architecture, I think a software CRC32C in a vacuum can be seen as a robustness improvement. There may be polynomials that are not CRC32 or CRC32C that one might view as having slightly better tradeoffs as seen in Table 1 of Koopman 2002, but it's kind of a stretch: being able to handle 8KB and 16KB as seen in CRC32C at HD4 as seen in CRC32C is awfully compelling to me. Koopman 0xBA0DC66B can admirably reach HD6 on a much larger range, up to 16360 bytes, which is every so shy of 2KiB. Castagnoli 0xD419CC15 can, short of 8KB by 31 bits can detect HD 5. Corrections welcome on my interpretations of Tbl 1. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 17 April 2013 22:36, Bruce Momjian wrote: >> > I would like to know the answer of how an upgrade from checksum to >> > no-checksum would behave so I can modify pg_upgrade to allow it. >> >> Why? 9.3 pg_upgrade certainly doesn't need it. When we get to 9.4, if >> someone has checksums enabled and wants to disable it, why is pg_upgrade >> the right time to do that? Wouldn't it make more sense to allow them to >> do that at any time? > > Well, right now, pg_upgrade is the only way you could potentially turn > off checksums. You are right that we might eventually want a command, > but my point is that we currently have a limitation in pg_upgrade that > might not be necessary. We don't currently have checksums, so pg_upgrade doesn't need to cope with turning them off in 9.3 For 9.4, it might, but likely we've have a tool to turn them off before then anyway. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 2013-04-18 00:44:02 +0300, Ants Aasma wrote: > I went ahead and coded up both the parallel FNV-1a and parallel FNV-1a > + srl1-xor variants and ran performance tests and detection rate tests > on both. > > Performance results: > Mul-add checksums: 12.9 bytes/s > FNV-1a checksums: 13.5 bytes/s > FNV-1a + srl-1: 7.4 bytes/s > > Detection rates: > False positive rates: > Add-mul FNV-1a FNV-1a + srl-1 > Single bit flip: 1:inf 1:129590 1:64795 > Double bit flip: 1:148 1:511 1:53083 > Triple bit flip: 1:673 1:5060 1:61511 > Quad bit flip: 1:18721:193491:68320 > Write 0x00 byte: 1:774538137 1:118776 1:68952 > Write 0xFF byte: 1:165399500 1:137489 1:68958 > Partial write: 1:59949 1:719391:89923 > Write garbage: 1:64866 1:649801:67732 > Write run of 00: 1:57077 1:611401:59723 > Write run of FF: 1:63085 1:596091:62977 > > Test descriptions: > N bit flip: picks N random non-overlapping bits and flips their value. > Write X byte: overwrites a single byte with X. > Partial write: picks a random cut point, overwrites everything from > there to end with 0x00. > Write garbage/run of X: picks two random cut points and fills > everything in between with random values/X bytes. I don't think this table is complete without competing numbers for truncated crc-32. Any chance to get that? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 2013-04-17 18:16:36 -0700, Daniel Farina wrote: > The original paper is often shorthanded "Castagnoli 93", but it exists > in the IEEE's sphere of influence and is hard to find a copy of. > Luckily, a pretty interesting survey paper discussing some of the > issues was written by Koopman in 2002 and is available: > http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.11.8323 As a > pedagolgical note, it's pretty interesting and accessible piece of > writing (for me, as someone who knows little of error > detection/correction) and explains some of the engineering reasons > that provoke such exercises. http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=231911&userType=inst There's also a koopman paper from 2004 thats interesting. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 4/17/13 8:56 PM, Ants Aasma wrote: Nothing from the two points, but the CRC calculation algorithm can be switched out for slice-by-4 or slice-by-8 variant. Speed up was around factor of 4 if I remember correctly...I can provide you > with a patch of the generic version of any of the discussed algorithms > within an hour, leaving plenty of time in beta or in 9.4 to > accommodate the optimized versions. Can you nail down a solid, potential for commit slice-by-4 or slice-by-8 patch then? You dropped into things like per-byte overhead to reach this conclusion, which was fine to let the methods battle each other. Maybe I missed it, but I didn't remember seeing an obvious full patch for this implementation then come back up from that. With the schedule pressure this needs to return to more database-level tests. Your concerns about the committed feature being much slower then the original Fletcher one are troubling, and we might as well do that showdown again now with the best of the CRC implementations you've found. Actually the state is that with the [CRC] polynomial used there is currently close to zero hope of CPUs optimizing for us. Ah, I didn't catch that before. It sounds like the alternate slicing implementation should also use a different polynomial then, which sounds reasonable. This doesn't even have to be exactly the same CRC function that the WAL uses. A CRC that's modified for performance or having a better future potential is fine; there's just a lot of resistance to using something other than a CRC right now. I'm not sure about the 9.4 part: if we ship with the builtin CRC as committed, there is a 100% chance that we will want to switch out the algorithm in 9.4, and there will be quite a large subset of users that will find the performance unusable. Now I have to switch out my reviewer hat for my 3 bit fortune telling one. (It uses a Magic 8 Ball) This entire approach is squeezing what people would prefer to be a 32 bit CRC into a spare 16 bits, as a useful step advancing toward a long term goal. I have four major branches of possible futures here I've thought about: 1) Database checksums with 16 bits are good enough, but they have to be much faster to satisfy users. It may take a different checksum implementation altogether to make that possible, and distinguishing between the two of them requires borrowing even more metadata bits from somewhere. (This seems the future you're worried about) 2) Database checksums work out well, but they have to be 32 bits to satisfy users and/or error detection needs. Work on pg_upgrade and expanding the page headers will be needed. Optimization of the CRC now has a full 32 bit target. 3) The demand for database checksums is made obsolete by either mainstream filesystem checksumming, performance issues, or just general market whim. The 16 bit checksum PostgreSQL implements becomes a vestigial feature, and whenever it gets in the way of making changes someone proposes eliminating them. (I call this one the "rules" future) 4) 16 bit checksums turn out to be such a problem in the field that everyone regrets the whole thing, and discussions turn immediately toward how to eliminate that risk. It's fair that you're very concerned about (1), but I wouldn't give it 100% odds of happening either. The user demand that's motivated me to work on this will be happy with any of (1) through (3), and in two of them optimizing the 16 bit checksums now turns out to be premature. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Wed, Apr 17, 2013 at 5:21 PM, Greg Smith wrote: > Let me see if I can summarize where the messages flying by are at since > you'd like to close this topic for now: > > -Original checksum feature used Fletcher checksums. Its main problems, to > quote wikipedia, include that it "cannot distinguish between blocks of all 0 > bits and blocks of all 1 bits". > > -Committed checksum feature uses truncated CRC-32. This has known good > error detection properties, but is expensive to compute. There's reason to > believe that particular computation will become cheaper on future platforms > though. But taking full advantage of that will require adding CPU-specific > code to the database. > > -The latest idea is using the Fowler–Noll–Vo hash function: > https://en.wikipedia.org/wiki/Fowler_Noll_Vo_hash There's 20 years of > research around when that is good or bad. The exactly properties depend on > magic "FNV primes": http://isthe.com/chongo/tech/comp/fnv/#fnv-prime that > can vary based on both your target block size and how many bytes you'll > process at a time. For PostgreSQL checksums, one of the common > problems--getting an even distribution of the hashed values--isn't important > the way it is for other types of hashes. Ants and Florian have now dug into > how exactly that and specific CPU optimization concerns impact the best > approach for 8K database pages. This is very clearly a 9.4 project that is > just getting started. I was curious about the activity in this thread and wanted to understand the tradeoffs, and came to the same understanding as you when poking around. It seems the tough aspect of the equation is that the most well studied thing is slow (CRC-32C) unless you have special ISA support Trying to find as much information and conclusive research on FNV was a lot more challenging. Fletcher is similar in that regard. Given my hasty attempt to understand each of the alternatives, my qualitative judgement is that, strangely enough, the most conservative choice of the three (in terms of being understood and treated in the literature more than ten times over) is CRC-32C, but it's also the one being cast as only suitable inside micro-optimization. To add another, theoretically-oriented dimension to the discussion, I'd like suggest it's also the most thoroughly studied of all the alternatives. I really had a hard time finding follow-up papers about the two alternatives, but to be fair, I didn't try very hard...then again, I didn't try very hard for any of the three, it's just that CRC32C was by far the easiest find materials on. The original paper is often shorthanded "Castagnoli 93", but it exists in the IEEE's sphere of influence and is hard to find a copy of. Luckily, a pretty interesting survey paper discussing some of the issues was written by Koopman in 2002 and is available: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.11.8323 As a pedagolgical note, it's pretty interesting and accessible piece of writing (for me, as someone who knows little of error detection/correction) and explains some of the engineering reasons that provoke such exercises. Basically...if it comes down to understand what the heck is going on and what the trade-offs are, it was a lot easier to brush up on CRC32-C in my meandering around the Internet. One might think this level of scrutiny would constitute a viable explanation of why CRC32C found its way into several standards and then finally in silicon. All in all, if the real world costs of CRC32C on not-SSE4.2 are allowable, I think it's the most researched and and conservative option, although perhaps some of the other polynomials seen in Koopman could also be desirable. It seems there's a tradeoff in CRC polynomials between long-message and short-message error detection, and the paper above may allow for a more informed selection. CRC32C is considered a good trade-off for both, but I haven't assessed the paper in enough detail to suggest whether there are specialized long-run polynomials that may be better still (although, then, there is also the microoptimization question, which postdates the literature I was looking at by a lot). -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Thu, Apr 18, 2013 at 3:21 AM, Greg Smith wrote: > On 4/17/13 6:32 PM, Tom Lane wrote: >> >> The more I read of this thread, the more unhappy I get. It appears that >> the entire design process is being driven by micro-optimization for CPUs >> being built by Intel in 2013. > > > And that's not going to get anyone past review, since all the tests I've > been doing the last two weeks are on how fast an AMD Opteron 6234 with OS > cache >> shared_buffers can run this. The main thing I'm still worried > about is what happens when you have a fast machine that can move memory > around very quickly and an in-memory workload, but it's hamstrung by the > checksum computation--and it's not a 2013 Intel machine. > > The question I started with here was answered to some depth and then skipped > past. I'd like to jerk attention back to that, since I thought some good > answers from Ants went by. Is there a simple way to optimize the committed > CRC computation (or a similar one with the same error detection properties) > based on either: > > a) Knowing that the input will be a 8K page, rather than the existing use > case with an arbitrary sized WAL section. > > b) Straightforward code rearrangement or optimization flags. > > That was all I thought was still feasible to consider changing for 9.3 a few > weeks ago. And the possible scope has only been shrinking since then. Nothing from the two points, but the CRC calculation algorithm can be switched out for slice-by-4 or slice-by-8 variant. Speed up was around factor of 4 if I remember correctly. >> And I reiterate that there is theory out there about the error detection >> capabilities of CRCs. I'm not seeing any theory here, which leaves me >> with very little confidence that we know what we're doing. > > > Let me see if I can summarize where the messages flying by are at since > you'd like to close this topic for now: > > -Original checksum feature used Fletcher checksums. Its main problems, to > quote wikipedia, include that it "cannot distinguish between blocks of all 0 > bits and blocks of all 1 bits". That was only the most glaring problem. > -Committed checksum feature uses truncated CRC-32. This has known good > error detection properties, but is expensive to compute. There's reason to > believe that particular computation will become cheaper on future platforms > though. But taking full advantage of that will require adding CPU-specific > code to the database. Actually the state is that with the polynomial used there is currently close to zero hope of CPUs optimizing for us. By switching the polynomial we can have hardware acceleration on Intel CPUs, little hope of others supporting given that AMD hasn't by now and Intel touts around patents in this area. However the calculation can be made about factor of 4 faster by restructuring the calculation. This optimization is plain C and not CPU specific. The committed checksum is an order of magnitude slower than the Fletcher one that was performance tested with the patch. > -The latest idea is using the Fowler–Noll–Vo hash function: > https://en.wikipedia.org/wiki/Fowler_Noll_Vo_hash There's 20 years of > research around when that is good or bad. The exactly properties depend on > magic "FNV primes": http://isthe.com/chongo/tech/comp/fnv/#fnv-prime that > can vary based on both your target block size and how many bytes you'll > process at a time. For PostgreSQL checksums, one of the common > problems--getting an even distribution of the hashed values--isn't important > the way it is for other types of hashes. Ants and Florian have now dug into > how exactly that and specific CPU optimization concerns impact the best > approach for 8K database pages. This is very clearly a 9.4 project that is > just getting started. I'm not sure about the 9.4 part: if we ship with the builtin CRC as committed, there is a 100% chance that we will want to switch out the algorithm in 9.4, and there will be quite a large subset of users that will find the performance unusable. If we change it to whatever we come up with here, there is a small chance that the algorithm will give worse than expected error detection rate in some circumstances and we will want offer a better algorithm. More probably it will be good enough and the low performance hit will allow more users to turn it on. This is a 16bit checksum that we talking about, not SHA-1, it is expected to occasionally fail to detect errors. I can provide you with a patch of the generic version of any of the discussed algorithms within an hour, leaving plenty of time in beta or in 9.4 to accommodate the optimized versions. It's literally a dozen self contained lines of code. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-ha
Re: [HACKERS] Enabling Checksums
On 4/17/13 6:32 PM, Tom Lane wrote: The more I read of this thread, the more unhappy I get. It appears that the entire design process is being driven by micro-optimization for CPUs being built by Intel in 2013. And that's not going to get anyone past review, since all the tests I've been doing the last two weeks are on how fast an AMD Opteron 6234 with OS cache >> shared_buffers can run this. The main thing I'm still worried about is what happens when you have a fast machine that can move memory around very quickly and an in-memory workload, but it's hamstrung by the checksum computation--and it's not a 2013 Intel machine. The question I started with here was answered to some depth and then skipped past. I'd like to jerk attention back to that, since I thought some good answers from Ants went by. Is there a simple way to optimize the committed CRC computation (or a similar one with the same error detection properties) based on either: a) Knowing that the input will be a 8K page, rather than the existing use case with an arbitrary sized WAL section. b) Straightforward code rearrangement or optimization flags. That was all I thought was still feasible to consider changing for 9.3 a few weeks ago. And the possible scope has only been shrinking since then. And I reiterate that there is theory out there about the error detection capabilities of CRCs. I'm not seeing any theory here, which leaves me with very little confidence that we know what we're doing. Let me see if I can summarize where the messages flying by are at since you'd like to close this topic for now: -Original checksum feature used Fletcher checksums. Its main problems, to quote wikipedia, include that it "cannot distinguish between blocks of all 0 bits and blocks of all 1 bits". -Committed checksum feature uses truncated CRC-32. This has known good error detection properties, but is expensive to compute. There's reason to believe that particular computation will become cheaper on future platforms though. But taking full advantage of that will require adding CPU-specific code to the database. -The latest idea is using the Fowler–Noll–Vo hash function: https://en.wikipedia.org/wiki/Fowler_Noll_Vo_hash There's 20 years of research around when that is good or bad. The exactly properties depend on magic "FNV primes": http://isthe.com/chongo/tech/comp/fnv/#fnv-prime that can vary based on both your target block size and how many bytes you'll process at a time. For PostgreSQL checksums, one of the common problems--getting an even distribution of the hashed values--isn't important the way it is for other types of hashes. Ants and Florian have now dug into how exactly that and specific CPU optimization concerns impact the best approach for 8K database pages. This is very clearly a 9.4 project that is just getting started. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Thu, Apr 18, 2013 at 2:25 AM, Florian Pflug wrote: > On Apr17, 2013, at 23:44 , Ants Aasma wrote: >> Performance results: >> Mul-add checksums: 12.9 bytes/s >> FNV-1a checksums: 13.5 bytes/s >> FNV-1a + srl-1: 7.4 bytes/s >> >> Detection rates: >> False positive rates: >> Add-mul FNV-1a FNV-1a + srl-1 >> Single bit flip: 1:inf 1:129590 1:64795 >> Double bit flip: 1:148 1:511 1:53083 >> Triple bit flip: 1:673 1:5060 1:61511 >> Quad bit flip: 1:18721:193491:68320 >> Write 0x00 byte: 1:774538137 1:118776 1:68952 >> Write 0xFF byte: 1:165399500 1:137489 1:68958 >> Partial write: 1:59949 1:719391:89923 >> Write garbage: 1:64866 1:649801:67732 >> Write run of 00: 1:57077 1:611401:59723 >> Write run of FF: 1:63085 1:596091:62977 >> >> Test descriptions: >> N bit flip: picks N random non-overlapping bits and flips their value. >> Write X byte: overwrites a single byte with X. >> Partial write: picks a random cut point, overwrites everything from >> there to end with 0x00. >> Write garbage/run of X: picks two random cut points and fills >> everything in between with random values/X bytes. > > Cool, thanks for testing that! The results for FNV-1a + srl-1 look > promising, I think. Its failure rate is consistently about 1:2^16, > which is the value you'd expect. That gives me some confidence that > the additional shift as working as expected. > > BTW, which prime are you using for FNV-1a and FNV-1a+srl1? The official 32bit FNV one, 16777619. Offsets were just random numbers. Seems good enough given the following from the FNV page: "These non-zero integers are the FNV-0 hashes of the following 32 octets: chongo /\../\" >> The effect on false positive rates >> for double bit errors is particularly impressive. I'm now running a >> testrun that shift right by 13 to see how that works out, intuitively >> it should help dispersing the bits a lot faster. Empirical results are slightly better with shift of 13: Single bit flip: 1:61615 Double bit flip: 1:58078 Triple bit flip: 1:66329 Quad bit flip: 1:62141 Write 0x00 byte: 1:66327 Write 0xFF byte: 1:65274 Partial write: 1:71939 Write garbage: 1:65095 Write run of 0: 1:62845 Write run of FF: 1:64638 > Maybe, but it also makes *only* bits 14 and 15 actually affects bits > below them, because all other's are shifted out. If you choose the > right prime it may still work, you'd have to pick one which with > enough lower bits set so that every bits affects bit 14 or 15 at some > point… > > All in all a small shift seems better to me - if 1 for some reason > isn't a good choice, I'd expect 3 or so to be a suitable > replacement, but nothing much larger… I don't think the big shift is a problem, the other bits were taken into account by the multiply, and with the larger shift the next multiplication will disperse the changes once again. Nevertheless, I'm running the tests with shift of 3 now. > I should have some time tomorrow to spent on this, and will try > to validate our FNV-1a modification, and see if I find a way to judge > whether 1 is a good shift. Great. I will spend some brain cycles on it too. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Thu, Apr 18, 2013 at 1:32 AM, Tom Lane wrote: > Ants Aasma writes: >> I was thinking about something similar too. The big issue here is that >> the parallel checksums already hide each other latencies effectively >> executing one each of movdqu/pmullw/paddw each cycle, that's why the >> N_SUMS adds up to 128 bytes not 16 bytes. > > The more I read of this thread, the more unhappy I get. It appears that > the entire design process is being driven by micro-optimization for CPUs > being built by Intel in 2013. That ought to be, at best, a fifth-order > consideration, with full recognition that it'll be obsolete in two years, > and is already irrelevant to anyone not running one of those CPUs. The large scale structure takes into account the trends in computer architecture. A lot more so than using anything straight out of the literature. Specifically, computer architectures have hit a wall in terms of sequential throughput, so the linear dependency chain in the checksum algorithm will be the bottleneck soon if it isn't already. From that it follows that a fast and future proof algorithm should not calculate the checksum in a single log chain. The proposed algorithms divide the input into 64x64 and 32x64 chunks. It's easy to show that both convert the dependency chain from O(n) to O(sqrt(n)). Secondly, unless we pick something really popular, CPUs are unlikely to provide specifically for us, so the algorithm should be built from general purpose computational pieces. Vector integer multiply and xor are pretty much guaranteed to be there and fast on future CPUs. In my view it's much more probable to be available and fast on future CPU's than something like the Intel CRC32 acceleration. > I would like to ban all discussion of assembly-language optimizations > until after 9.3 is out, so that we can concentrate on what actually > matters. Which IMO is mostly the error detection rate and the probable > nature of false successes. I'm glad to see that you're paying at least > some attention to that, but the priorities in this discussion are > completely backwards. I approached it from the angle that what needs to be done to get a fundamentally fast approach have a good enough error detection rate and not have a way of generating false positives that will give a likely error. The algorithms are simple enough and well studied enough that the rewards from tweaking them are negligible. I think the resulting performance speaks for itself. Now the question is what is a good enough algorithm. In my view, the checksum is more like a canary in the coal mine, not something that can be relied upon, and so ultimate efficiency is not that important if there are no obvious horrible cases. I can see that there are other views and so am exploring different tradeoffs between performance and quality. > And I reiterate that there is theory out there about the error detection > capabilities of CRCs. I'm not seeing any theory here, which leaves me > with very little confidence that we know what we're doing. I haven't found much literature that is of use here. There is theory underlying here coming from basic number theory and distilled into rules for hash functions. For the FNV hash the prime supposedly is carefully chosen, although all literature so far is saying "it is a good choice, but here is not the place to explain why". Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Apr17, 2013, at 23:44 , Ants Aasma wrote: > Performance results: > Mul-add checksums: 12.9 bytes/s > FNV-1a checksums: 13.5 bytes/s > FNV-1a + srl-1: 7.4 bytes/s > > Detection rates: > False positive rates: > Add-mul FNV-1a FNV-1a + srl-1 > Single bit flip: 1:inf 1:129590 1:64795 > Double bit flip: 1:148 1:511 1:53083 > Triple bit flip: 1:673 1:5060 1:61511 > Quad bit flip: 1:18721:193491:68320 > Write 0x00 byte: 1:774538137 1:118776 1:68952 > Write 0xFF byte: 1:165399500 1:137489 1:68958 > Partial write: 1:59949 1:719391:89923 > Write garbage: 1:64866 1:649801:67732 > Write run of 00: 1:57077 1:611401:59723 > Write run of FF: 1:63085 1:596091:62977 > > Test descriptions: > N bit flip: picks N random non-overlapping bits and flips their value. > Write X byte: overwrites a single byte with X. > Partial write: picks a random cut point, overwrites everything from > there to end with 0x00. > Write garbage/run of X: picks two random cut points and fills > everything in between with random values/X bytes. Cool, thanks for testing that! The results for FNV-1a + srl-1 look promising, I think. Its failure rate is consistently about 1:2^16, which is the value you'd expect. That gives me some confidence that the additional shift as working as expected. BTW, which prime are you using for FNV-1a and FNV-1a+srl1? > So adding in the shifted value nearly cuts the performance in half. I > think that by playing with the instruction order I might coax the CPU > scheduler to schedule the instructions better, but even in best case > it will be somewhat slower. The point to keep in mind that even this > slower speed is still faster than hardware accelerated CRC32, so all > in all the hit might not be so bad. Yeah. ~7 bytes/cycle still translates to over 10GB/s on typical CPU, so that's still plenty fast I'd say... > The effect on false positive rates > for double bit errors is particularly impressive. I'm now running a > testrun that shift right by 13 to see how that works out, intuitively > it should help dispersing the bits a lot faster. Maybe, but it also makes *only* bits 14 and 15 actually affects bits below them, because all other's are shifted out. If you choose the right prime it may still work, you'd have to pick one which with enough lower bits set so that every bits affects bit 14 or 15 at some point… All in all a small shift seems better to me - if 1 for some reason isn't a good choice, I'd expect 3 or so to be a suitable replacement, but nothing much larger… I should have some time tomorrow to spent on this, and will try to validate our FNV-1a modification, and see if I find a way to judge whether 1 is a good shift. >>> I wonder if we use 32bit FNV-1a's (the h = (h^v)*p variant) with >>> different offset-basis values, would it be enough to just XOR fold the >>> resulting values together. The algorithm looking like this: >> >> Hm, this will make the algorithm less resilient to some particular >> input permutations (e.g. those which swap the 64*i-th and the (64+1)-ith >> words), but those seem very unlikely to occur randomly. But if we're >> worried about that, we could use your linear combination method for >> the aggregation phase. > > I don't think it significantly reduces resilience to permutations > thanks to using different basis offsets and multiply not distributing > over xor. Oh, yeah, I though you were still using 0 as base offset. If you don't, the objection is moot. best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Apr18, 2013, at 00:32 , Tom Lane wrote: > Ants Aasma writes: >> I was thinking about something similar too. The big issue here is that >> the parallel checksums already hide each other latencies effectively >> executing one each of movdqu/pmullw/paddw each cycle, that's why the >> N_SUMS adds up to 128 bytes not 16 bytes. > > The more I read of this thread, the more unhappy I get. It appears that > the entire design process is being driven by micro-optimization for CPUs > being built by Intel in 2013. That ought to be, at best, a fifth-order > consideration, with full recognition that it'll be obsolete in two years, > and is already irrelevant to anyone not running one of those CPUs. Micro-optimization for particular CPUs yes, but general performance considerations, no. For example, 2^n is probably one of the worst modulus you can pick for a hash function - any prime would work much better. But doing the computations modulo 2^16 or 2^32 carries zero performance overhead, whereas picking another modulus requires some renormalization after every operation. That, however, is *not* a given - it stems from the fact nearly all CPUs in existence operated on binary integers. This fact must thus enter into the design phase very early, and makes 2^16 or 2^32 a sensible choice for a modulus *despite* it's shortcomings, simply because it allows for fast implementations. > I would like to ban all discussion of assembly-language optimizations > until after 9.3 is out, so that we can concentrate on what actually > matters. Which IMO is mostly the error detection rate and the probable > nature of false successes. I'm glad to see that you're paying at least > some attention to that, but the priorities in this discussion are > completely backwards. I'd say lots of attention is paid to that, but there's *also* attention paid to speed. Which I good, because ideally we want to end up with a checksum with both has good error-detection properties *and* good performance. If performance is of no concern to us, then there's little reason not to use CRC… > And I reiterate that there is theory out there about the error detection > capabilities of CRCs. I'm not seeing any theory here, which leaves me > with very little confidence that we know what we're doing. If you've got any pointers to literature on error-detection capabilities of CPU-friendly checksum functions, please share. I am aware of the vast literature on CRC, and also on some other algebraic approaches, but none of those even come close to the speed of FNV+shift (unless there's a special CRC instruction, that is). And there's also a ton of stuff on cryptographic hashing, but those are optimized for a completely different use-case... best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
Ants Aasma writes: > I was thinking about something similar too. The big issue here is that > the parallel checksums already hide each other latencies effectively > executing one each of movdqu/pmullw/paddw each cycle, that's why the > N_SUMS adds up to 128 bytes not 16 bytes. The more I read of this thread, the more unhappy I get. It appears that the entire design process is being driven by micro-optimization for CPUs being built by Intel in 2013. That ought to be, at best, a fifth-order consideration, with full recognition that it'll be obsolete in two years, and is already irrelevant to anyone not running one of those CPUs. I would like to ban all discussion of assembly-language optimizations until after 9.3 is out, so that we can concentrate on what actually matters. Which IMO is mostly the error detection rate and the probable nature of false successes. I'm glad to see that you're paying at least some attention to that, but the priorities in this discussion are completely backwards. And I reiterate that there is theory out there about the error detection capabilities of CRCs. I'm not seeing any theory here, which leaves me with very little confidence that we know what we're doing. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Wed, Apr 17, 2013 at 6:54 PM, Florian Pflug wrote: > On Apr17, 2013, at 16:47 , Ants Aasma wrote: >> This made me remember, the issue I had was with high order bits, not >> with low order ones, somehow I got them confused. The exact issue is >> that the high order bits don't affect any bit lower than them. It's >> easy to see that if you remember the shift and add nature of multiply. >> Unfortunately XOR will not fix that. Neither will adding an offset >> basis. This is the fundamental thing that is behind the not-so-great >> uncorrelated bit error detection rate. > > Right. We could maybe fix that by extending the update step to > > tmp = s[j] ^ d[i,j] > s[j] = (t * PRIME) ^ (t >> 1) > > or something like that. Shifting t instead of (t * PRIME) should > help to reduce the performance impact, since a reordering CPU should > be able to parallelize the multiple and the shift. Note though that > I haven't really though that through extensively - the general idea > should be sound, but whether 1 is a good shifting amount I do not > know. I was thinking about something similar too. The big issue here is that the parallel checksums already hide each other latencies effectively executing one each of movdqu/pmullw/paddw each cycle, that's why the N_SUMS adds up to 128 bytes not 16 bytes. I went ahead and coded up both the parallel FNV-1a and parallel FNV-1a + srl1-xor variants and ran performance tests and detection rate tests on both. Performance results: Mul-add checksums: 12.9 bytes/s FNV-1a checksums: 13.5 bytes/s FNV-1a + srl-1: 7.4 bytes/s Detection rates: False positive rates: Add-mul FNV-1a FNV-1a + srl-1 Single bit flip: 1:inf 1:129590 1:64795 Double bit flip: 1:148 1:511 1:53083 Triple bit flip: 1:673 1:5060 1:61511 Quad bit flip: 1:18721:193491:68320 Write 0x00 byte: 1:774538137 1:118776 1:68952 Write 0xFF byte: 1:165399500 1:137489 1:68958 Partial write: 1:59949 1:719391:89923 Write garbage: 1:64866 1:649801:67732 Write run of 00: 1:57077 1:611401:59723 Write run of FF: 1:63085 1:596091:62977 Test descriptions: N bit flip: picks N random non-overlapping bits and flips their value. Write X byte: overwrites a single byte with X. Partial write: picks a random cut point, overwrites everything from there to end with 0x00. Write garbage/run of X: picks two random cut points and fills everything in between with random values/X bytes. So adding in the shifted value nearly cuts the performance in half. I think that by playing with the instruction order I might coax the CPU scheduler to schedule the instructions better, but even in best case it will be somewhat slower. The point to keep in mind that even this slower speed is still faster than hardware accelerated CRC32, so all in all the hit might not be so bad. The effect on false positive rates for double bit errors is particularly impressive. I'm now running a testrun that shift right by 13 to see how that works out, intuitively it should help dispersing the bits a lot faster. >> I wonder if we use 32bit FNV-1a's (the h = (h^v)*p variant) with >> different offset-basis values, would it be enough to just XOR fold the >> resulting values together. The algorithm looking like this: > > Hm, this will make the algorithm less resilient to some particular > input permutations (e.g. those which swap the 64*i-th and the (64+1)-ith > words), but those seem very unlikely to occur randomly. But if we're > worried about that, we could use your linear combination method for > the aggregation phase. I don't think it significantly reduces resilience to permutations thanks to using different basis offsets and multiply not distributing over xor. >> Speaking against this option is the fact that we will need to do CPU >> detection at startup to make it fast on the x86 that support SSE4.1, >> and the fact that AMD CPUs before 2011 will run it an order of >> magnitude slower (but still faster than the best CRC). > > Hm, CPU detection isn't that hard, and given the speed at which Intel > currently invents new instructions we'll end up going that route sooner > or later anyway, I think. Sure it's not that hard but it does have an order of magnitude more design decisions than #if defined(__x86_64__). Maybe a first stab could avoid a generic infrastructure and just have the checksum function as a function pointer, with the default "trampoline" implementation running a cpuid and overwriting the function pointer with either the optimized or generic versions and then calling it. >> Any opinions if it would be a reasonable tradeoff to have a better >> checksum with great performance on latest x86 CPUs and good >> performance on other architectures at the expense of having only ok >> performance on older AMD CPUs? > > The loss on AMD is offset by the increased performance on machines > where we can't vectorize, I'd say. +1 Old AMD machines won'
Re: [HACKERS] Enabling Checksums
On Wed, Apr 17, 2013 at 01:59:12PM -0700, Jeff Davis wrote: > On Wed, 2013-04-17 at 12:42 -0400, Bruce Momjian wrote: > > > AFAIK, there's currently no per-page checksum flag. Still, being only > > > able to go from checksummed to not-checksummed probably is for all > > > practical purposes the same as not being able to pg_upgrade at all. > > > Otherwise, why would people have enabled checksums in the first place? > > > > Good point, but it is _an_ option, at least. > > > > I would like to know the answer of how an upgrade from checksum to > > no-checksum would behave so I can modify pg_upgrade to allow it. > > Why? 9.3 pg_upgrade certainly doesn't need it. When we get to 9.4, if > someone has checksums enabled and wants to disable it, why is pg_upgrade > the right time to do that? Wouldn't it make more sense to allow them to > do that at any time? Well, right now, pg_upgrade is the only way you could potentially turn off checksums. You are right that we might eventually want a command, but my point is that we currently have a limitation in pg_upgrade that might not be necessary. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Wed, 2013-04-17 at 16:58 +0100, Greg Stark wrote: > On Wed, Apr 17, 2013 at 4:28 PM, Florian Pflug wrote: > > Is there any way we can change the checksum algorithm in 9.4 > > *without* breaking pg_upgrade? > > Personally I think we're going to need a solution for page format > changes someday eventually > > What advantages are we postponing now to avoid it? > > * 32-bit checksums? > * Being able to enable/disable checksums? > > Anything else? I'm not sure that changing the page format is the most difficult part of enabling/disabling checksums. It's easy enough to have page header bits if the current information is not enough (and those bits were there, but Heikki requested their removal and I couldn't think of a concrete reason to keep them). Eventually, it would be nice to be able to break the page format and have more space for things like checksums (and probably a few other things, maybe some visibility-related optimizations). But that's a few years off and we don't have any real plan for that. What I wanted to accomplish with this patch is the simplest checksum mechanism that we could get that would be fast enough that many people would be able to use it. I expect it to be useful until we do decide to break the page format. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Wed, 2013-04-17 at 12:42 -0400, Bruce Momjian wrote: > > AFAIK, there's currently no per-page checksum flag. Still, being only > > able to go from checksummed to not-checksummed probably is for all > > practical purposes the same as not being able to pg_upgrade at all. > > Otherwise, why would people have enabled checksums in the first place? > > Good point, but it is _an_ option, at least. > > I would like to know the answer of how an upgrade from checksum to > no-checksum would behave so I can modify pg_upgrade to allow it. Why? 9.3 pg_upgrade certainly doesn't need it. When we get to 9.4, if someone has checksums enabled and wants to disable it, why is pg_upgrade the right time to do that? Wouldn't it make more sense to allow them to do that at any time? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
Bruce Momjian writes: > On Wed, Apr 17, 2013 at 01:29:18PM -0400, Tom Lane wrote: >> But having said that, I'm not sure why this would be pg_upgrade's >> problem. By definition, we do not want pg_upgrade running around >> looking at individual data pages. Therefore, whatever we might do >> about checksum algorithm changes would have to be something that can be >> managed on-the-fly by the newer server. > Well, my idea was that pg_upgrade would allow upgrades from old clusters > with the same checksum algorithm version, but not non-matching ones. > This would allow the checksum algorithm to be changed and force > pg_upgrade to fail. It's rather premature to be defining pg_upgrade's behavior for a situation that doesn't exist yet, and may very well never exist in that form. It seems more likely to me that we'd want to allow incremental algorithm changes, in which case pg_upgrade ought not do anything about this case anyway. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Wed, Apr 17, 2013 at 01:29:18PM -0400, Tom Lane wrote: > Bruce Momjian writes: > > Uh, not sure how pg_upgrade would detect that as the version number is > > not stored in pg_controldata, e.g.: > > > Data page checksums: enabled/disabled > > That seems pretty shortsighted. The field probably ought to be defined > as containing a checksum algorithm ID number, not a boolean. > > But having said that, I'm not sure why this would be pg_upgrade's > problem. By definition, we do not want pg_upgrade running around > looking at individual data pages. Therefore, whatever we might do > about checksum algorithm changes would have to be something that can be > managed on-the-fly by the newer server. Well, my idea was that pg_upgrade would allow upgrades from old clusters with the same checksum algorithm version, but not non-matching ones. This would allow the checksum algorithm to be changed and force pg_upgrade to fail. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
Bruce Momjian writes: > Uh, not sure how pg_upgrade would detect that as the version number is > not stored in pg_controldata, e.g.: > Data page checksums: enabled/disabled That seems pretty shortsighted. The field probably ought to be defined as containing a checksum algorithm ID number, not a boolean. But having said that, I'm not sure why this would be pg_upgrade's problem. By definition, we do not want pg_upgrade running around looking at individual data pages. Therefore, whatever we might do about checksum algorithm changes would have to be something that can be managed on-the-fly by the newer server. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Wed, Apr 17, 2013 at 01:22:01PM -0400, Tom Lane wrote: > Greg Stark writes: > > On Wed, Apr 17, 2013 at 4:28 PM, Florian Pflug wrote: > >> Is there any way we can change the checksum algorithm in 9.4 > >> *without* breaking pg_upgrade? > > > Personally I think we're going to need a solution for page format > > changes someday eventually > > > What advantages are we postponing now to avoid it? > > Um, other than the ability to make a release? > > We aren't going to hold up 9.3 until that particular bit of pie in the > sky lands. Indeed I don't expect to see it available in the next couple > years either. When we were looking at that seriously, two or three > years ago, arbitrary page format changes looked *hard*. > > The idea of bumping the page format version number to signal a checksum > algorithm change might work though. Uh, not sure how pg_upgrade would detect that as the version number is not stored in pg_controldata, e.g.: Data page checksums: enabled/disabled Do we need to address this for 9.3? (Yuck) -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
Greg Stark writes: > On Wed, Apr 17, 2013 at 4:28 PM, Florian Pflug wrote: >> Is there any way we can change the checksum algorithm in 9.4 >> *without* breaking pg_upgrade? > Personally I think we're going to need a solution for page format > changes someday eventually > What advantages are we postponing now to avoid it? Um, other than the ability to make a release? We aren't going to hold up 9.3 until that particular bit of pie in the sky lands. Indeed I don't expect to see it available in the next couple years either. When we were looking at that seriously, two or three years ago, arbitrary page format changes looked *hard*. The idea of bumping the page format version number to signal a checksum algorithm change might work though. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Wed, Apr 17, 2013 at 06:33:58PM +0200, Florian Pflug wrote: > > I was going to ask about the flexibility of pg_upgrade and checksums. > > Right now you have to match the old and new cluster checksum modes, but > > it seems it would be possible to allow pg_upgrade use from checksum to > > no-checksum servers. Does the backend look at the pg_controldata setting, > > or at the page checksum flag? If the former, it seems pg_upgrade could > > run a a no-checksum server just fine that had checksum information on > > its pages. This might give us more flexibility in changing the checksum > > algorithm in the future, i.e. you only lose checksum ability. > > AFAIK, there's currently no per-page checksum flag. Still, being only > able to go from checksummed to not-checksummed probably is for all > practical purposes the same as not being able to pg_upgrade at all. > Otherwise, why would people have enabled checksums in the first place? Good point, but it is _an_ option, at least. I would like to know the answer of how an upgrade from checksum to no-checksum would behave so I can modify pg_upgrade to allow it. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Apr17, 2013, at 18:15 , Bruce Momjian wrote: > On Wed, Apr 17, 2013 at 05:28:06PM +0200, Florian Pflug wrote: >> However, you're right that time's running out. It'd be a shame though >> if we'd lock ourselves into CRC as the only available algorithm essentially >> forever. Is there any way we can change the checksum algorithm in 9.4 >> *without* breaking pg_upgrade? Maybe pd_pagesize_version could be used >> for that - we could make version 5 mean "just like version 4, but with >> a different checksum algorithm". Since the layout wouldn't actually >> chance, that'd be far easier to pull off than actually supporting multiple >> page layouts. If that works, then shipping 9.3 with CRC is probably >> the best solution. If not, we should see to it that something like Ants >> parallel version of FNV or a smallget into 9.3 if at all possible, >> IMHO. > > I was going to ask about the flexibility of pg_upgrade and checksums. > Right now you have to match the old and new cluster checksum modes, but > it seems it would be possible to allow pg_upgrade use from checksum to > no-checksum servers. Does the backend look at the pg_controldata setting, > or at the page checksum flag? If the former, it seems pg_upgrade could > run a a no-checksum server just fine that had checksum information on > its pages. This might give us more flexibility in changing the checksum > algorithm in the future, i.e. you only lose checksum ability. AFAIK, there's currently no per-page checksum flag. Still, being only able to go from checksummed to not-checksummed probably is for all practical purposes the same as not being able to pg_upgrade at all. Otherwise, why would people have enabled checksums in the first place? best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Wed, Apr 17, 2013 at 05:28:06PM +0200, Florian Pflug wrote: > However, you're right that time's running out. It'd be a shame though > if we'd lock ourselves into CRC as the only available algorithm essentially > forever. Is there any way we can change the checksum algorithm in 9.4 > *without* breaking pg_upgrade? Maybe pd_pagesize_version could be used > for that - we could make version 5 mean "just like version 4, but with > a different checksum algorithm". Since the layout wouldn't actually > chance, that'd be far easier to pull off than actually supporting multiple > page layouts. If that works, then shipping 9.3 with CRC is probably > the best solution. If not, we should see to it that something like Ants > parallel version of FNV or a smallget into 9.3 if at all possible, > IMHO. I was going to ask about the flexibility of pg_upgrade and checksums. Right now you have to match the old and new cluster checksum modes, but it seems it would be possible to allow pg_upgrade use from checksum to no-checksum servers. Does the backend look at the pg_controldata setting, or at the page checksum flag? If the former, it seems pg_upgrade could run a a no-checksum server just fine that had checksum information on its pages. This might give us more flexibility in changing the checksum algorithm in the future, i.e. you only lose checksum ability. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Wed, Apr 17, 2013 at 4:28 PM, Florian Pflug wrote: > Is there any way we can change the checksum algorithm in 9.4 > *without* breaking pg_upgrade? Personally I think we're going to need a solution for page format changes someday eventually What advantages are we postponing now to avoid it? * 32-bit checksums? * Being able to enable/disable checksums? Anything else? -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Apr17, 2013, at 16:47 , Ants Aasma wrote: > This made me remember, the issue I had was with high order bits, not > with low order ones, somehow I got them confused. The exact issue is > that the high order bits don't affect any bit lower than them. It's > easy to see that if you remember the shift and add nature of multiply. > Unfortunately XOR will not fix that. Neither will adding an offset > basis. This is the fundamental thing that is behind the not-so-great > uncorrelated bit error detection rate. Right. We could maybe fix that by extending the update step to tmp = s[j] ^ d[i,j] s[j] = (t * PRIME) ^ (t >> 1) or something like that. Shifting t instead of (t * PRIME) should help to reduce the performance impact, since a reordering CPU should be able to parallelize the multiple and the shift. Note though that I haven't really though that through extensively - the general idea should be sound, but whether 1 is a good shifting amount I do not know. > While I understand that linearity is not a desirable property, I > couldn't think of a realistic case where it would hurt. I can see how > it can hurt checksums of variable length values, but for our fixed > buffer case it's definitely not so clear cut. On the pro side the > distributive property that is behind linearity allowed me to do final > aggregation in a tree form, performing the multiplies in parallel > instead of linearly. This adds up to the difference between 250 cycles > (64*(3 cycle IMUL + 1 cycle XOR)) and 25 cycles (4*5 cycle pmullw + 5 > cycle addw). Given that the main loop is about 576 cycles, this is a > significant difference. > I wonder if we use 32bit FNV-1a's (the h = (h^v)*p variant) with > different offset-basis values, would it be enough to just XOR fold the > resulting values together. The algorithm looking like this: Hm, this will make the algorithm less resilient to some particular input permutations (e.g. those which swap the 64*i-th and the (64+1)-ith words), but those seem very unlikely to occur randomly. But if we're worried about that, we could use your linear combination method for the aggregation phase. > Speaking against this option is the fact that we will need to do CPU > detection at startup to make it fast on the x86 that support SSE4.1, > and the fact that AMD CPUs before 2011 will run it an order of > magnitude slower (but still faster than the best CRC). Hm, CPU detection isn't that hard, and given the speed at which Intel currently invents new instructions we'll end up going that route sooner or later anyway, I think. > Any opinions if it would be a reasonable tradeoff to have a better > checksum with great performance on latest x86 CPUs and good > performance on other architectures at the expense of having only ok > performance on older AMD CPUs? The loss on AMD is offset by the increased performance on machines where we can't vectorize, I'd say. best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Apr17, 2013, at 17:09 , Bruce Momjian wrote: > As much as I love the idea of improving the algorithm, it is disturbing > we are discussing this so close to beta, with an algorithm that is under > analysis, with no (runtime) CPU detection, and in something that is > going to be embedded into our data page format. I can't even think of > another case where we do run-time CPU detection. We could still ship the new checksum algorithm with 9.3, but omit the SSE-optimized version, i.e. include only the plain C implementation. I think Ants mentioned somehwere that gcc does a pretty good job of vectorizing that, so people who really care (and who use GCC) could compile with -msse41 --unrool-loops --tree-vectorize, and get performance close to that of a hand-coded SSE version. The important decision we're facing is which algorithm to use. I personally believe Ants is on the right track there - FNV or a variant thereof looks like a good choice to me, but the details have yet to be nailed I think. However, you're right that time's running out. It'd be a shame though if we'd lock ourselves into CRC as the only available algorithm essentially forever. Is there any way we can change the checksum algorithm in 9.4 *without* breaking pg_upgrade? Maybe pd_pagesize_version could be used for that - we could make version 5 mean "just like version 4, but with a different checksum algorithm". Since the layout wouldn't actually chance, that'd be far easier to pull off than actually supporting multiple page layouts. If that works, then shipping 9.3 with CRC is probably the best solution. If not, we should see to it that something like Ants parallel version of FNV or a smallget into 9.3 if at all possible, IMHO. best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Wed, Apr 17, 2013 at 05:47:55PM +0300, Ants Aasma wrote: > The SSE4.1 implementation of this would be as fast as the last pat, > generic version will be faster and we avoid the linearity issue. By > using different offsets for each of the partial hashes we don't > directly suffer from commutativity of the final xor folding. By using > the xor-then-multiply variant the last values hashed have their bits > mixed before folding together. > > Speaking against this option is the fact that we will need to do CPU > detection at startup to make it fast on the x86 that support SSE4.1, > and the fact that AMD CPUs before 2011 will run it an order of > magnitude slower (but still faster than the best CRC). > > Any opinions if it would be a reasonable tradeoff to have a better > checksum with great performance on latest x86 CPUs and good > performance on other architectures at the expense of having only ok > performance on older AMD CPUs? > > Also, any good suggestions where should we do CPU detection when we go > this route? As much as I love the idea of improving the algorithm, it is disturbing we are discussing this so close to beta, with an algorithm that is under analysis, with no (runtime) CPU detection, and in something that is going to be embedded into our data page format. I can't even think of another case where we do run-time CPU detection. I am wondering if we need to tell users that pg_upgrade will not be possible if you enable page-level checksums, so we are not trapped with something we want to improve in 9.4. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Wed, Apr 17, 2013 at 2:26 AM, Florian Pflug wrote: >>> This raises two question. First, why are there two primes? You could >>> just as well using a single prime q and set p=q^64 mod 2^16. You then >>> get >>> S = sum V[i,j] * q^(64*(64-i) + (64-j) >>>= sum V[i,j] * q^(4096 - 64*(i-1) - j) >>> You get higher prime powers that way, but you can easily choose a prime >>> that yields distinct values mod 2^16 for exponents up to 16383. Your >>> PRIME2, for example, does. (It wraps around for 16384, i.e. >>> PRIME2^16384 = 1 mod 2^16, but that's true for every possible prime since >>> 16384 is the Carmichael function's value at 2^16) >> >> The experimental detection rate is about the same if we use a single >> prime. But I think you have the analytical form wrong here. It should >> be given q = p: >> >>S = sum V[i,j] * p^(64-i) * p^(64-j) >> = sum V[i,j] * p^(64 - i + 64 - j) >> = sum V[i,j] * p^(128 - i -j) > > Yeah, if you set q = p that's true. My suggestion was p=q^64 though... So it was, I guess it was too late here and I missed it... All thing considered that is a good suggestion, if for nothing else, the generic implementation can be smaller this way. >>> Second, why does it use addition instead of XOR? It seems that FNV >>> usually XORs the terms together instead of adding them? >> >> Testing showed slightly better detection rate for adds. Intuitively I >> think it's because the carry introduces some additional mixing. > > Hm, but OTOH it makes S linear in V, i.e. if you have two inputs > V1,V2 and V = V1 + V2, then S = S1 + S2. Also, if V' = V*m, then > S' = S*m. The second property is quite undesirable, I think. Assume > all the V[i,j] are divisible by 2^k, i.e. have zeros at all bit > positions 0..(k-1). Then, due to linearity, S is also divisible by > 2^k, i.e. also has no ones before the k-th bit. This means, for example > that if you hash values values which all have their lowest bit cleared, > you get only 2^15 distinct hash values. If they all have the two > lowest bits cleared, you get only 2^14 distinct values, and so on… > > Generally, linearity doesn't seem to be a property that one wants > in a hash I think, so my suggestion is to stick to XOR. This made me remember, the issue I had was with high order bits, not with low order ones, somehow I got them confused. The exact issue is that the high order bits don't affect any bit lower than them. It's easy to see that if you remember the shift and add nature of multiply. Unfortunately XOR will not fix that. Neither will adding an offset basis. This is the fundamental thing that is behind the not-so-great uncorrelated bit error detection rate. While I understand that linearity is not a desirable property, I couldn't think of a realistic case where it would hurt. I can see how it can hurt checksums of variable length values, but for our fixed buffer case it's definitely not so clear cut. On the pro side the distributive property that is behind linearity allowed me to do final aggregation in a tree form, performing the multiplies in parallel instead of linearly. This adds up to the difference between 250 cycles (64*(3 cycle IMUL + 1 cycle XOR)) and 25 cycles (4*5 cycle pmullw + 5 cycle addw). Given that the main loop is about 576 cycles, this is a significant difference. >>> Here, btw, is a page on FNV hashing. It mentions a few rules for >>> picking suitable primes >>> >>> http://www.isthe.com/chongo/tech/comp/fnv >> >> Unfortunately the rules don't apply here because of the hash size. > > Yeah :-(. > > I noticed that their 32-bit prime only has a single one outside > the first 16 bits. Maybe we can take advantage of that and use a > 32-bit state while still providing decent performance on machines > without a 32-bit x 32-bit -> 32-bit multiply instruction? Looking at the Power instruction set, a 32bit mul by the FNV prime would look like this: vmulouh tmp1, hash, prime vmladduhm tmp1, hash, prime<<16 vslw tmp2, hash, 24 vadduwm hash, tmp1, tmp2 That is 4 instructions to multiply 4 values. Depending on the specific execution ports on the processor it might faster or slower than the scalar version but not by a whole lot. Main benefit would be that the intermediate state could be held in registers. > If we lived in an Intel-only world, I'd suggest going with a > 32-bit state, since SSE4.1 support is *very* wide-spread already - > the last CPUs without it came out over 5 years ago, I think. > (Core2 and later support SSE4.1, and some later Core1 do too) > > But unfortunately things look bleak even for other x86 > implementations - AMD support SSE4.1 only starting with > Bulldozer, which came out 2011 or so I believe. Leaving the x86 > realm, it seems that only ARM's NEON provides the instructions > we'd need - AltiVec seems to be support only 16-bit multiplies, > and from what some quick googling brought up, MIPS and SPARC > SIMD instructions look no better.. > > OTOH, chances are that nobody will ever d
Re: [HACKERS] Enabling Checksums
On Apr16, 2013, at 23:41 , Ants Aasma wrote: > On Tue, Apr 16, 2013 at 11:20 PM, Florian Pflug wrote: >> On Apr13, 2013, at 17:14 , Ants Aasma wrote: >>> Based on current analysis, it is particularly good at detecting single >>> bit errors, as good at detecting burst errors as can be expected from >>> 16 bits and not horrible at detecting burst writes of zeroes. It is >>> quite bad at detecting multiple uncorrelated single bit errors and >>> extremely bad at detecting repeating patterns of errors in low order >>> bits. >> >> I've read the patch and tried to understand why it's that bad at >> detecting repeating patterns of errors in low order bits, and to see >> if there might be a way to fix that without too much of a performance >> impact. >> >> Here's what I gather the algorithm does: >> >> It treats the input data, a page of L bytes, as a Nx64 matrix V >> of 16-bit quantities (N = L/64, obviously). >> It then first computes (using two primes p (PRIME1) and q (PRIME2)) >> >>S = V[1,1]*p^63*q^63 + V[1,2]*p^63*q^62 + … + V[1,64]*p^63*q^0 >> + V[2,1]*p^62*q^63 + V[2,2]*p^62*q^62 + … + V[2,64]*p^62*q^0 >> + … >> + V[N,1]*p^0 *q^63 + V[N,2]*p^0 *q^62 + … + V[N,64]*p^0 *q^0 >> (mod 2^16) >> = sum V[i,j]*p^(64-i)*q^(64-j) >> >> Note that it does that by first computing the row-wise sums without >> the q^i coefficient, and then (in what the code calls the aggregation >> phase) combines those row-wise sums into a total, adding the q^i- >> coefficients along the way. >> >> The final hash value is then >> >> H = S * p + B * q mod 2^16 >> >> where B is a salt value intended to detect swapped pages (currently >> B is simply the page index) > > Great job analyzing the analytic form of the algorithm and sorry I you > had to do it instead finding it in the documentation. No problem, glad if I can help! >> This raises two question. First, why are there two primes? You could >> just as well using a single prime q and set p=q^64 mod 2^16. You then >> get >> S = sum V[i,j] * q^(64*(64-i) + (64-j) >>= sum V[i,j] * q^(4096 - 64*(i-1) - j) >> You get higher prime powers that way, but you can easily choose a prime >> that yields distinct values mod 2^16 for exponents up to 16383. Your >> PRIME2, for example, does. (It wraps around for 16384, i.e. >> PRIME2^16384 = 1 mod 2^16, but that's true for every possible prime since >> 16384 is the Carmichael function's value at 2^16) > > The experimental detection rate is about the same if we use a single > prime. But I think you have the analytical form wrong here. It should > be given q = p: > >S = sum V[i,j] * p^(64-i) * p^(64-j) > = sum V[i,j] * p^(64 - i + 64 - j) > = sum V[i,j] * p^(128 - i -j) Yeah, if you set q = p that's true. My suggestion was p=q^64 though... >> Second, why does it use addition instead of XOR? It seems that FNV >> usually XORs the terms together instead of adding them? > > Testing showed slightly better detection rate for adds. Intuitively I > think it's because the carry introduces some additional mixing. Hm, but OTOH it makes S linear in V, i.e. if you have two inputs V1,V2 and V = V1 + V2, then S = S1 + S2. Also, if V' = V*m, then S' = S*m. The second property is quite undesirable, I think. Assume all the V[i,j] are divisible by 2^k, i.e. have zeros at all bit positions 0..(k-1). Then, due to linearity, S is also divisible by 2^k, i.e. also has no ones before the k-th bit. This means, for example that if you hash values values which all have their lowest bit cleared, you get only 2^15 distinct hash values. If they all have the two lowest bits cleared, you get only 2^14 distinct values, and so on… Generally, linearity doesn't seem to be a property that one wants in a hash I think, so my suggestion is to stick to XOR. >> Here, btw, is a page on FNV hashing. It mentions a few rules for >> picking suitable primes >> >> http://www.isthe.com/chongo/tech/comp/fnv > > Unfortunately the rules don't apply here because of the hash size. Yeah :-(. I noticed that their 32-bit prime only has a single one outside the first 16 bits. Maybe we can take advantage of that and use a 32-bit state while still providing decent performance on machines without a 32-bit x 32-bit -> 32-bit multiply instruction? If we lived in an Intel-only world, I'd suggest going with a 32-bit state, since SSE4.1 support is *very* wide-spread already - the last CPUs without it came out over 5 years ago, I think. (Core2 and later support SSE4.1, and some later Core1 do too) But unfortunately things look bleak even for other x86 implementations - AMD support SSE4.1 only starting with Bulldozer, which came out 2011 or so I believe. Leaving the x86 realm, it seems that only ARM's NEON provides the instructions we'd need - AltiVec seems to be support only 16-bit multiplies, and from what some quick googling brought up, MIPS and SPARC SIMD instructions look no better.. OTOH, chances are that nobody will ever
Re: [HACKERS] Enabling Checksums
On Tue, Apr 16, 2013 at 11:20 PM, Florian Pflug wrote: > On Apr13, 2013, at 17:14 , Ants Aasma wrote: >> Based on current analysis, it is particularly good at detecting single >> bit errors, as good at detecting burst errors as can be expected from >> 16 bits and not horrible at detecting burst writes of zeroes. It is >> quite bad at detecting multiple uncorrelated single bit errors and >> extremely bad at detecting repeating patterns of errors in low order >> bits. > > I've read the patch and tried to understand why it's that bad at > detecting repeating patterns of errors in low order bits, and to see > if there might be a way to fix that without too much of a performance > impact. > > Here's what I gather the algorithm does: > > It treats the input data, a page of L bytes, as a Nx64 matrix V > of 16-bit quantities (N = L/64, obviously). > It then first computes (using two primes p (PRIME1) and q (PRIME2)) > > S = V[1,1]*p^63*q^63 + V[1,2]*p^63*q^62 + … + V[1,64]*p^63*q^0 > + V[2,1]*p^62*q^63 + V[2,2]*p^62*q^62 + … + V[2,64]*p^62*q^0 > + … > + V[N,1]*p^0 *q^63 + V[N,2]*p^0 *q^62 + … + V[N,64]*p^0 *q^0 > (mod 2^16) > = sum V[i,j]*p^(64-i)*q^(64-j) > >Note that it does that by first computing the row-wise sums without >the q^i coefficient, and then (in what the code calls the aggregation >phase) combines those row-wise sums into a total, adding the q^i- >coefficients along the way. > >The final hash value is then > > H = S * p + B * q mod 2^16 > >where B is a salt value intended to detect swapped pages (currently >B is simply the page index) Great job analyzing the analytic form of the algorithm and sorry I you had to do it instead finding it in the documentation. > This raises two question. First, why are there two primes? You could > just as well using a single prime q and set p=q^64 mod 2^16. You then > get > S = sum V[i,j] * q^(64*(64-i) + (64-j) > = sum V[i,j] * q^(4096 - 64*(i-1) - j) > You get higher prime powers that way, but you can easily choose a prime > that yields distinct values mod 2^16 for exponents up to 16383. Your > PRIME2, for example, does. (It wraps around for 16384, i.e. > PRIME2^16384 = 1 mod 2^16, but that's true for every possible prime since > 16384 is the Carmichael function's value at 2^16) The experimental detection rate is about the same if we use a single prime. But I think you have the analytical form wrong here. It should be given q = p: S = sum V[i,j] * p^(64-i) * p^(64-j) = sum V[i,j] * p^(64 - i + 64 - j) = sum V[i,j] * p^(128 - i -j) This makes the whole matrix symmetric. While I can't think of any real world errors that would exhibit symmetry in this 64x64 matrix, it seemed better to me to avoid the issue altogether and use different primes. IIRC it helped a lot for the case of page[i] = i & 0xFF. > Second, why does it use addition instead of XOR? It seems that FNV > usually XORs the terms together instead of adding them? Testing showed slightly better detection rate for adds. Intuitively I think it's because the carry introduces some additional mixing. > Regarding the bad behaviour for multiple low-bit errors - can you > explain why it behaves badly in that case? I currently fail to see > why that would be. I *can* see that the lowest bit of the hash depends > only on the lowest bit of the input words, but as long as the lowest > bits of the input words also affect other bits of the hash, that shouldn't > matter. Which I think they do, but I might be missing something... Looks like you're right. I was somehow concentrating only on how the lowest bits depend on the input. > Here, btw, is a page on FNV hashing. It mentions a few rules for > picking suitable primes > > http://www.isthe.com/chongo/tech/comp/fnv Unfortunately the rules don't apply here because of the hash size. Thanks for your analysis. I will do my best to get this all into a document and will do some more analysis to see if I can come up with some kind of proof for the error cases. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
Ants Aasma writes: > On Tue, Apr 16, 2013 at 5:05 PM, Simon Riggs wrote: >> My only review comments are to ask for some explanation of the magic >> numbers... > The specific values used are mostly magic to me too. As mentioned in a > short sentence in the patch, the values are experimentally chosen, > guided by some intuition about what good values should look like. There actually is quite a lot of theory out there about this sort of thing. If we are inventing our own checksum function then We're Doing It Wrong. We should be adopting an existing, proven function. "Experimentally derived" is about the worst recommendation I can think of in this area. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Tue, Apr 16, 2013 at 5:05 PM, Simon Riggs wrote: > On 9 April 2013 03:35, Ants Aasma wrote: >> On Fri, Apr 5, 2013 at 9:39 PM, Ants Aasma wrote: >>> Unless somebody tells me not to waste my time I'll go ahead and come >>> up with a workable patch by Monday. >> >> And here you go. I decided to be verbose with the comments as it's >> easier to delete a comment to write one. I also left in a huge jumble >> of macros to calculate the contents of a helper var during compile >> time. This can easily be replaced with the calculated values once we >> settle on specific parameters. >> >> Currently only x86-64 is implemented. 32bit x86 would be mostly a >> copy-and-paste job, replacing 64bit pointer registers with 32bit ones. >> For other platforms the simplest way would be to use a vectorizing >> compiler on the generic variant. -funroll-loops -ftree-vectorize is >> enough on gcc. >> >> Quick bench results on the worst case workload: >> master no checksums: tps = 15.561848 >> master with checksums: tps = 1.695450 >> simd checksums: tps = 14.602698 > > Numbers look very good on this. Well done. > > I support the direction of this, but I don't think I'm sufficiently > well qualified to verify that the code does what it should and/or fix > it if it breaks. If others want to see this happen you'll need to > pitch in. > > My only review comments are to ask for some explanation of the magic > numbers... > #define CSUM_PRIME1 0x49 > #define CSUM_PRIME2 0x986b > #define CSUM_TRUNC 65521 > > Where magic means a level of technology far above my own > understanding, and yet no (or not enough) code comments to assist me. The specific values used are mostly magic to me too. As mentioned in a short sentence in the patch, the values are experimentally chosen, guided by some intuition about what good values should look like. Basically the methodology for the choice was that I took all the pages from a 2.8GB test database, and then for each page introduced a bunch of common errors and observed how many errors were undetected. The main observations were: 1) the exact value of the primes doesn't really matter for detection efficiency. 2) values with a non-uniform distribution of zeroes and ones seem to work slightly better. I'll write up a readme of why the values are how they are and with some more explanation of the algorithm. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Apr13, 2013, at 17:14 , Ants Aasma wrote: > Based on current analysis, it is particularly good at detecting single > bit errors, as good at detecting burst errors as can be expected from > 16 bits and not horrible at detecting burst writes of zeroes. It is > quite bad at detecting multiple uncorrelated single bit errors and > extremely bad at detecting repeating patterns of errors in low order > bits. I've read the patch and tried to understand why it's that bad at detecting repeating patterns of errors in low order bits, and to see if there might be a way to fix that without too much of a performance impact. Here's what I gather the algorithm does: It treats the input data, a page of L bytes, as a Nx64 matrix V of 16-bit quantities (N = L/64, obviously). It then first computes (using two primes p (PRIME1) and q (PRIME2)) S = V[1,1]*p^63*q^63 + V[1,2]*p^63*q^62 + … + V[1,64]*p^63*q^0 + V[2,1]*p^62*q^63 + V[2,2]*p^62*q^62 + … + V[2,64]*p^62*q^0 + … + V[N,1]*p^0 *q^63 + V[N,2]*p^0 *q^62 + … + V[N,64]*p^0 *q^0 (mod 2^16) = sum V[i,j]*p^(64-i)*q^(64-j) Note that it does that by first computing the row-wise sums without the q^i coefficient, and then (in what the code calls the aggregation phase) combines those row-wise sums into a total, adding the q^i- coefficients along the way. The final hash value is then H = S * p + B * q mod 2^16 where B is a salt value intended to detect swapped pages (currently B is simply the page index) This raises two question. First, why are there two primes? You could just as well using a single prime q and set p=q^64 mod 2^16. You then get S = sum V[i,j] * q^(64*(64-i) + (64-j) = sum V[i,j] * q^(4096 - 64*(i-1) - j) You get higher prime powers that way, but you can easily choose a prime that yields distinct values mod 2^16 for exponents up to 16383. Your PRIME2, for example, does. (It wraps around for 16384, i.e. PRIME2^16384 = 1 mod 2^16, but that's true for every possible prime since 16384 is the Carmichael function's value at 2^16) Second, why does it use addition instead of XOR? It seems that FNV usually XORs the terms together instead of adding them? Regarding the bad behaviour for multiple low-bit errors - can you explain why it behaves badly in that case? I currently fail to see why that would be. I *can* see that the lowest bit of the hash depends only on the lowest bit of the input words, but as long as the lowest bits of the input words also affect other bits of the hash, that shouldn't matter. Which I think they do, but I might be missing something... Here, btw, is a page on FNV hashing. It mentions a few rules for picking suitable primes http://www.isthe.com/chongo/tech/comp/fnv best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 16 April 2013 20:27, Tom Lane wrote: > Greg Stark writes: >> On Fri, Apr 12, 2013 at 9:42 PM, Simon Riggs wrote: >>> * WAL checksum is not used as the sole basis for end-of-WAL discovery. >>> We reuse the WAL files, so the prev field in each WAL record shows >>> what the previous end of WAL was. Hence if the WAL checksums give a >>> false positive we still have a double check that the data really is >>> wrong. It's unbelievable that you'd get a false positive and then have >>> the prev field match as well, even though it was the genuine >>> end-of-WAL. > >> This is kind of true and kind of not true. If a system loses power >> while writing lots of data to WAL then the blocks at the end of the >> WAL might not be written out in order. Everything since the last log >> sync might be partly written out and partly not written out. That's >> the case where the checksum is critical. The beginning of a record >> could easily be written out including xl_prev and the end of the >> record not written. 1/64,000 power losses would then end up with an >> assertion failure or corrupt database. > > I have a hard time believing that it's a good idea to add checksums to > data pages and at the same time weaken our ability to detect WAL > corruption. So this seems to me to be going in the wrong direction. > What's it buying for us anyway? A few CPU cycles saved during WAL > generation? That's probably swamped by the other costs of writing WAL, > especially if you're using replication. This part of the thread is dead now I said "lets drop this idea" on 13 April. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
Greg Stark writes: > On Fri, Apr 12, 2013 at 9:42 PM, Simon Riggs wrote: >> * WAL checksum is not used as the sole basis for end-of-WAL discovery. >> We reuse the WAL files, so the prev field in each WAL record shows >> what the previous end of WAL was. Hence if the WAL checksums give a >> false positive we still have a double check that the data really is >> wrong. It's unbelievable that you'd get a false positive and then have >> the prev field match as well, even though it was the genuine >> end-of-WAL. > This is kind of true and kind of not true. If a system loses power > while writing lots of data to WAL then the blocks at the end of the > WAL might not be written out in order. Everything since the last log > sync might be partly written out and partly not written out. That's > the case where the checksum is critical. The beginning of a record > could easily be written out including xl_prev and the end of the > record not written. 1/64,000 power losses would then end up with an > assertion failure or corrupt database. I have a hard time believing that it's a good idea to add checksums to data pages and at the same time weaken our ability to detect WAL corruption. So this seems to me to be going in the wrong direction. What's it buying for us anyway? A few CPU cycles saved during WAL generation? That's probably swamped by the other costs of writing WAL, especially if you're using replication. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Fri, Apr 12, 2013 at 9:42 PM, Simon Riggs wrote: > * WAL checksum is not used as the sole basis for end-of-WAL discovery. > We reuse the WAL files, so the prev field in each WAL record shows > what the previous end of WAL was. Hence if the WAL checksums give a > false positive we still have a double check that the data really is > wrong. It's unbelievable that you'd get a false positive and then have > the prev field match as well, even though it was the genuine > end-of-WAL. This is kind of true and kind of not true. If a system loses power while writing lots of data to WAL then the blocks at the end of the WAL might not be written out in order. Everything since the last log sync might be partly written out and partly not written out. That's the case where the checksum is critical. The beginning of a record could easily be written out including xl_prev and the end of the record not written. 1/64,000 power losses would then end up with an assertion failure or corrupt database. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 9 April 2013 03:35, Ants Aasma wrote: > On Fri, Apr 5, 2013 at 9:39 PM, Ants Aasma wrote: >> Unless somebody tells me not to waste my time I'll go ahead and come >> up with a workable patch by Monday. > > And here you go. I decided to be verbose with the comments as it's > easier to delete a comment to write one. I also left in a huge jumble > of macros to calculate the contents of a helper var during compile > time. This can easily be replaced with the calculated values once we > settle on specific parameters. > > Currently only x86-64 is implemented. 32bit x86 would be mostly a > copy-and-paste job, replacing 64bit pointer registers with 32bit ones. > For other platforms the simplest way would be to use a vectorizing > compiler on the generic variant. -funroll-loops -ftree-vectorize is > enough on gcc. > > Quick bench results on the worst case workload: > master no checksums: tps = 15.561848 > master with checksums: tps = 1.695450 > simd checksums: tps = 14.602698 Numbers look very good on this. Well done. I support the direction of this, but I don't think I'm sufficiently well qualified to verify that the code does what it should and/or fix it if it breaks. If others want to see this happen you'll need to pitch in. My only review comments are to ask for some explanation of the magic numbers... #define CSUM_PRIME1 0x49 #define CSUM_PRIME2 0x986b #define CSUM_TRUNC 65521 Where magic means a level of technology far above my own understanding, and yet no (or not enough) code comments to assist me. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Sat, Apr 13, 2013 at 06:14:28PM +0300, Ants Aasma wrote: > > CRCs are known to be good for that sort of thing; it's what they were > > designed for. I'd like to see some evidence that any substitute > > algorithm has similar properties. Without that, I'm going to vote > > against this idea. > > Sorry for creating confusion here by playing fast and loose with the > terminology. We are not talking about that hash function at all. What > we are talking about here is Fowler-Noll-Vo-ish > (http://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2%80%93Vo_hash_function) > hash function that is restructured to be parallelisable with SIMD > instructions with the explicit goal of being as fast as possible. The > resulting hash function is roughly two orders of magnitude faster than > 1-byte-at-a-time CRC32 currently in use. Performance is about > comparable with optimized fixed size memcpy running in cache. > > Based on current analysis, it is particularly good at detecting single > bit errors, as good at detecting burst errors as can be expected from > 16 bits and not horrible at detecting burst writes of zeroes. It is > quite bad at detecting multiple uncorrelated single bit errors and > extremely bad at detecting repeating patterns of errors in low order > bits. > > All in all I would say that the performance is worth the loss in > detection capability as we are not talking about using the checksum to > prove correctness. Agreed. It would be good to get these details into the patch so others are not confused in the future. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Sat, Apr 13, 2013 at 6:26 PM, Andres Freund wrote: >> All in all I would say that the performance is worth the loss in >> detection capability as we are not talking about using the checksum to >> prove correctness. > > Is it actually a loss compared to our 16bit flavor of crc32 we now use? > I didn't think so far from the properties you described? I would have to run the testsuite I made to see now much but I would presume so. The algorithm relies on multiplication for bit diffusion and multiply has lousy diffusion on low order bits, exactly no diffusion for the lowest bit. And for 16bit values low order bits is quite a large fraction of the total hash. If we allow for operations that are not in SSE2 then there are a few things that we could do to make the hash quality better without affecting performance. pmulld instruction (SSE4.1) would allow for 32bit values in the intermediate state. And pshufb (SSE3) would allow us to swap high and low bytes introducing additional mixing. On Intel Sandy Bridge, if I understand the microarchitecture correctly, either change would be basically free, but not both because pshufb and paddw use execution ports 0 and 5, while pmulld needs port 0 and pmullw needs port 1. Currently the main loop takes 1 cycle per 16byte chunk, any changes introducing conflicts there would cut the performance in half. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 2013-04-13 18:14:28 +0300, Ants Aasma wrote: > On Sat, Apr 13, 2013 at 5:58 PM, Tom Lane wrote: > > Andres Freund writes: > >> On 2013-04-13 09:14:26 -0400, Bruce Momjian wrote: > >>> As I understand it, SIMD is just a CPU-optimized method for producing a > >>> CRC checksum. Is that right? Does it produce the same result as a > >>> non-CPU-optimized CRC calculation? > > > >> No we are talking about a different algorithm that results in different > >> results, thats why its important to choose now since we can't change it > >> later without breaking pg_upgrade in further releases. > >> http://en.wikipedia.org/wiki/SIMD_%28hash_function%29 > > > > [ squint... ] We're talking about a *cryptographic* hash function? > > Why in the world was this considered a good idea for page checksums? > > > > In the first place, it's probably not very fast compared to some > > alternatives, and in the second place, the criteria by which people > > would consider it a good crypto hash function have approximately nothing > > to do with what we need for a checksum function. What we want for a > > checksum function is high probability of detection of common hardware > > failure modes, such as burst errors and all-zeroes. This is > > particularly critical when we're going with only a 16-bit checksum --- > > the probabilities need to be skewed in the right direction, or it's not > > going to be all that terribly useful. > > > > CRCs are known to be good for that sort of thing; it's what they were > > designed for. I'd like to see some evidence that any substitute > > algorithm has similar properties. Without that, I'm going to vote > > against this idea. > > Sorry for creating confusion here by playing fast and loose with the > terminology. We are not talking about that hash function at all. What > we are talking about here is Fowler-Noll-Vo-ish > (http://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2%80%93Vo_hash_function) > hash function that is restructured to be parallelisable with SIMD > instructions with the explicit goal of being as fast as possible. The > resulting hash function is roughly two orders of magnitude faster than > 1-byte-at-a-time CRC32 currently in use. Performance is about > comparable with optimized fixed size memcpy running in cache. Gah, one shouldn't look to quick for a reference, sorry. > Based on current analysis, it is particularly good at detecting single > bit errors, as good at detecting burst errors as can be expected from > 16 bits and not horrible at detecting burst writes of zeroes. It is > quite bad at detecting multiple uncorrelated single bit errors and > extremely bad at detecting repeating patterns of errors in low order > bits. > All in all I would say that the performance is worth the loss in > detection capability as we are not talking about using the checksum to > prove correctness. Is it actually a loss compared to our 16bit flavor of crc32 we now use? I didn't think so far from the properties you described? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Sat, Apr 13, 2013 at 5:58 PM, Tom Lane wrote: > Andres Freund writes: >> On 2013-04-13 09:14:26 -0400, Bruce Momjian wrote: >>> As I understand it, SIMD is just a CPU-optimized method for producing a >>> CRC checksum. Is that right? Does it produce the same result as a >>> non-CPU-optimized CRC calculation? > >> No we are talking about a different algorithm that results in different >> results, thats why its important to choose now since we can't change it >> later without breaking pg_upgrade in further releases. >> http://en.wikipedia.org/wiki/SIMD_%28hash_function%29 > > [ squint... ] We're talking about a *cryptographic* hash function? > Why in the world was this considered a good idea for page checksums? > > In the first place, it's probably not very fast compared to some > alternatives, and in the second place, the criteria by which people > would consider it a good crypto hash function have approximately nothing > to do with what we need for a checksum function. What we want for a > checksum function is high probability of detection of common hardware > failure modes, such as burst errors and all-zeroes. This is > particularly critical when we're going with only a 16-bit checksum --- > the probabilities need to be skewed in the right direction, or it's not > going to be all that terribly useful. > > CRCs are known to be good for that sort of thing; it's what they were > designed for. I'd like to see some evidence that any substitute > algorithm has similar properties. Without that, I'm going to vote > against this idea. Sorry for creating confusion here by playing fast and loose with the terminology. We are not talking about that hash function at all. What we are talking about here is Fowler-Noll-Vo-ish (http://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2%80%93Vo_hash_function) hash function that is restructured to be parallelisable with SIMD instructions with the explicit goal of being as fast as possible. The resulting hash function is roughly two orders of magnitude faster than 1-byte-at-a-time CRC32 currently in use. Performance is about comparable with optimized fixed size memcpy running in cache. Based on current analysis, it is particularly good at detecting single bit errors, as good at detecting burst errors as can be expected from 16 bits and not horrible at detecting burst writes of zeroes. It is quite bad at detecting multiple uncorrelated single bit errors and extremely bad at detecting repeating patterns of errors in low order bits. All in all I would say that the performance is worth the loss in detection capability as we are not talking about using the checksum to prove correctness. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 2013-04-13 10:58:53 -0400, Tom Lane wrote: > Andres Freund writes: > > On 2013-04-13 09:14:26 -0400, Bruce Momjian wrote: > >> As I understand it, SIMD is just a CPU-optimized method for producing a > >> CRC checksum. Is that right? Does it produce the same result as a > >> non-CPU-optimized CRC calculation? > > > No we are talking about a different algorithm that results in different > > results, thats why its important to choose now since we can't change it > > later without breaking pg_upgrade in further releases. > > http://en.wikipedia.org/wiki/SIMD_%28hash_function%29 > > [ squint... ] We're talking about a *cryptographic* hash function? > Why in the world was this considered a good idea for page checksums? In Ants' implementation its heck of a lot of faster than any CRC implementation we have seen so far on relatively large blocks (like pages). pgbench results: ca+csw_uxo-frkuzl0yzs0wsdl8dipzv-ugmvyn-yv45sgub...@mail.gmail.com byte/cycle comparison: CA+CSw_su1fopLNBz1NAfkSNw4_=gv+5pf0kdlqmpvukw1q4...@mail.gmail.com > In the first place, it's probably not very fast compared to some > alternatives, and in the second place, the criteria by which people > would consider it a good crypto hash function have approximately nothing > to do with what we need for a checksum function. What we want for a > checksum function is high probability of detection of common hardware > failure modes, such as burst errors and all-zeroes. This is > particularly critical when we're going with only a 16-bit checksum --- > the probabilities need to be skewed in the right direction, or it's not > going to be all that terribly useful. > > CRCs are known to be good for that sort of thing; it's what they were > designed for. I'd like to see some evidence that any substitute > algorithm has similar properties. Without that, I'm going to vote > against this idea. Ants has dome some analysis on this, like CA+CSw_tMoA85e=1vs4omjzjg2mr_hulikovpd80dp5rurds...@mail.gmail.com . That doesn't look bad to me and unless I am missing something its better than our CRC with 16bit. So while I would say its not 100% researched there has been a rather detailed investigation by Ants - I am rather impressed. My biggest doubt so far is the reliance on inline assembly for the top performance on x86-64 and a generic implementation otherwise that only is really fast with appropriate compiler flags.. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
Andres Freund writes: > On 2013-04-13 09:14:26 -0400, Bruce Momjian wrote: >> As I understand it, SIMD is just a CPU-optimized method for producing a >> CRC checksum. Is that right? Does it produce the same result as a >> non-CPU-optimized CRC calculation? > No we are talking about a different algorithm that results in different > results, thats why its important to choose now since we can't change it > later without breaking pg_upgrade in further releases. > http://en.wikipedia.org/wiki/SIMD_%28hash_function%29 [ squint... ] We're talking about a *cryptographic* hash function? Why in the world was this considered a good idea for page checksums? In the first place, it's probably not very fast compared to some alternatives, and in the second place, the criteria by which people would consider it a good crypto hash function have approximately nothing to do with what we need for a checksum function. What we want for a checksum function is high probability of detection of common hardware failure modes, such as burst errors and all-zeroes. This is particularly critical when we're going with only a 16-bit checksum --- the probabilities need to be skewed in the right direction, or it's not going to be all that terribly useful. CRCs are known to be good for that sort of thing; it's what they were designed for. I'd like to see some evidence that any substitute algorithm has similar properties. Without that, I'm going to vote against this idea. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 2013-04-13 09:14:26 -0400, Bruce Momjian wrote: > On Fri, Apr 12, 2013 at 02:38:27PM -0700, Jeff Davis wrote: > > In general, we have more flexibility with WAL because there is no > > upgrade issue. It would be nice to share code with the data page > > checksum algorithm; but really we should just use whatever offers the > > best trade-off in terms of complexity, performance, and error detection > > rate. > > > > I don't think we need to decide all of this right now. Personally, I'm > > satisfied having SIMD checksums on data pages now and leaving WAL > > optimization until later. > > As I understand it, SIMD is just a CPU-optimized method for producing a > CRC checksum. Is that right? Does it produce the same result as a > non-CPU-optimized CRC calculation? No we are talking about a different algorithm that results in different results, thats why its important to choose now since we can't change it later without breaking pg_upgrade in further releases. http://en.wikipedia.org/wiki/SIMD_%28hash_function%29 Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Fri, Apr 12, 2013 at 02:38:27PM -0700, Jeff Davis wrote: > In general, we have more flexibility with WAL because there is no > upgrade issue. It would be nice to share code with the data page > checksum algorithm; but really we should just use whatever offers the > best trade-off in terms of complexity, performance, and error detection > rate. > > I don't think we need to decide all of this right now. Personally, I'm > satisfied having SIMD checksums on data pages now and leaving WAL > optimization until later. As I understand it, SIMD is just a CPU-optimized method for producing a CRC checksum. Is that right? Does it produce the same result as a non-CPU-optimized CRC calculation? -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 12 April 2013 23:21, Ants Aasma wrote: >> In general, we have more flexibility with WAL because there is no >> upgrade issue. It would be nice to share code with the data page >> checksum algorithm; but really we should just use whatever offers the >> best trade-off in terms of complexity, performance, and error detection >> rate. >> >> I don't think we need to decide all of this right now. Personally, I'm >> satisfied having SIMD checksums on data pages now and leaving WAL >> optimization until later. > > +1 OK, lets drop that idea then. SIMD checksums for 16-bit page checksums only in this release. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Sat, Apr 13, 2013 at 12:38 AM, Jeff Davis wrote: > On Fri, 2013-04-12 at 23:03 +0300, Heikki Linnakangas wrote: >> I think this is a bad idea. It complicates the WAL format significantly. >> Simon's patch didn't include the changes to recovery to validate the >> checksum, but I suspect it would be complicated. And it reduces the >> error-detection capability of WAL recovery. Keep in mind that unlike >> page checksums, which are never expected to fail, so even if we miss a >> few errors it's still better than nothing, the WAL checkum is used to >> detect end-of-WAL. There is expected to be a failure every time we do >> crash recovery. This far, we've considered the probability of one in >> 1^32 small enough for that purpose, but IMHO one in 1^16 is much too weak. > > One thing that just occurred to me is that we could make the SIMD > checksum a 32-bit checksum, and reduce it down to 16 bits for the data > pages. That might give us more flexibility to later use it for WAL > without compromising on the error detection nearly as much (though > obviously that wouldn't work with Simon's current proposal which uses > the same data page checksum in a WAL backup block). The simple 32bit version of the algorithm would need CPU capability checks for the fast version and would work only on CPUs produced in the last few years. Not a show stopper but but more complex code and less applicability for sure. An alternative would be to calculate 2 16 bit checksums, concat them for the 32bit checksum and add them for the 16 bit one. In this case we wouldn't need to change the current algorithm. A future change could just factor out everything until the last add as the common function. But keep in mind that we are talking about sharing about 400 bytes of machine code here. > In general, we have more flexibility with WAL because there is no > upgrade issue. It would be nice to share code with the data page > checksum algorithm; but really we should just use whatever offers the > best trade-off in terms of complexity, performance, and error detection > rate. > > I don't think we need to decide all of this right now. Personally, I'm > satisfied having SIMD checksums on data pages now and leaving WAL > optimization until later. +1 I feel quite uneasy about reducing the effectiveness of WAL end detection. There are many ways to improve WAL performance and I have no idea what would be the best one. At the very least some performance tests are in order. As this is not an essential part of having usable checksums, but a general performance optimization I feel that it is not fair to others to postpone the release to resolve this now. I'd be more than happy to research this for 9.4. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Fri, 2013-04-12 at 23:03 +0300, Heikki Linnakangas wrote: > I think this is a bad idea. It complicates the WAL format significantly. > Simon's patch didn't include the changes to recovery to validate the > checksum, but I suspect it would be complicated. And it reduces the > error-detection capability of WAL recovery. Keep in mind that unlike > page checksums, which are never expected to fail, so even if we miss a > few errors it's still better than nothing, the WAL checkum is used to > detect end-of-WAL. There is expected to be a failure every time we do > crash recovery. This far, we've considered the probability of one in > 1^32 small enough for that purpose, but IMHO one in 1^16 is much too weak. One thing that just occurred to me is that we could make the SIMD checksum a 32-bit checksum, and reduce it down to 16 bits for the data pages. That might give us more flexibility to later use it for WAL without compromising on the error detection nearly as much (though obviously that wouldn't work with Simon's current proposal which uses the same data page checksum in a WAL backup block). In general, we have more flexibility with WAL because there is no upgrade issue. It would be nice to share code with the data page checksum algorithm; but really we should just use whatever offers the best trade-off in terms of complexity, performance, and error detection rate. I don't think we need to decide all of this right now. Personally, I'm satisfied having SIMD checksums on data pages now and leaving WAL optimization until later. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Fri, 2013-04-12 at 21:28 +0200, Andres Freund wrote: > That means we will have to do the verification for this in > ValidXLogRecord() *not* in RestoreBkpBlock or somesuch. Otherwise we > won't always recognize the end of WAL correctly. > And I am a bit wary of reducing the likelihood of noticing the proper > end-of-recovery by reducing the crc width. Good point. > Why again are we doing this now? Just to reduce the overhead of CRC > computation for full page writes? Or are we forseeing issues with the > page checksums being wrong because of non-zero data in the hole being > zero after the restore from bkp blocks? That shouldn't be a problem, because the block is not expected to have a proper checksum in WAL, and it will be recalculated before being written. So I see these changes as mostly independent. The reason we're discussing right now is because, when choosing the checksum algorithm, I was hoping that it might be usable in the future for WAL backup blocks. I'm convinced that they can be; and the primary question now seems to be "should they be", which does not need to be settled right now in my opinion. Anyway, I would be perfectly happy if we just got the SIMD algorithm in for data pages. The support for changing the WAL checksums seems lukewarm, and there might be quite a few alternatives (e.g. optimizing the CRC for backup blocks as Heikki suggested) to achieve that performance goal. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Fri, 2013-04-12 at 15:21 -0400, Bruce Momjian wrote: > > * When we do PageSetChecksumInplace(), we need to be 100% sure that the > > hole is empty; otherwise the checksum will fail when we re-expand it. It > > might be worth a memset beforehand just to be sure. > > Do we write the page holes to the WAL for full-page writes? I hope we > don't. No, but the page hole is included in the checksum. Let's say that the page hole contains some non-zero value, and we calculate a checksum. When we eliminate the page hole, and then reconstitute the page using zeros for the page hole later, then the page will not match the checksum any more. So, we need to be sure the original page hole is all-zero when we calculate the checksum. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 12 April 2013 21:03, Heikki Linnakangas wrote: > No, the patch has to compute the 16-bit checksum for the page when the > full-page image is added to the WAL record. There would otherwise be no need > to calculate the page checksum at that point, but only later when the page > is written out from shared buffer cache. > > I think this is a bad idea. It complicates the WAL format significantly. > Simon's patch didn't include the changes to recovery to validate the > checksum, but I suspect it would be complicated. And it reduces the > error-detection capability of WAL recovery. Keep in mind that unlike page > checksums, which are never expected to fail, so even if we miss a few errors > it's still better than nothing, the WAL checkum is used to detect > end-of-WAL. There is expected to be a failure every time we do crash > recovery. This far, we've considered the probability of one in 1^32 small > enough for that purpose, but IMHO one in 1^16 is much too weak. > > If you want to speed up the CRC calculation of full-page images, you could > have an optimized version of the WAL CRC algorithm, using e.g. SIMD > instructions. Because typical WAL records are small, max 100-200 bytes, and > it consists of several even smaller chunks, the normal WAL CRC calculation > is quite resistant to common optimization techniques. But it might work for > the full-page images. Let's not conflate it with the page checksums, though. I accept the general tone of that as a reasonable perspective and in many ways am on the fence myself. This is sensitive stuff. A few points * The code to validate the checksum isn't complex, though it is more than the current one line. Lets say about 10 lines of clear code. I'll work on that to show its true. I don't see that as a point of objection. * WAL checksum is not used as the sole basis for end-of-WAL discovery. We reuse the WAL files, so the prev field in each WAL record shows what the previous end of WAL was. Hence if the WAL checksums give a false positive we still have a double check that the data really is wrong. It's unbelievable that you'd get a false positive and then have the prev field match as well, even though it was the genuine end-of-WAL. Yes, we could also have a second SIMD calculation optimised for WAL CRC32 on an 8192 byte block, rather than just one set of SIMD code for both. We could also have a single set of SIMD code producing a 32-bit checksum, then take the low 16 bits as we do currently. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 12.04.2013 22:31, Bruce Momjian wrote: On Fri, Apr 12, 2013 at 09:28:42PM +0200, Andres Freund wrote: Only point worth discussing is that this change would make backup blocks be covered by a 16-bit checksum, not the CRC-32 it is now. i.e. the record header is covered by a CRC32 but the backup blocks only by 16-bit. That means we will have to do the verification for this in ValidXLogRecord() *not* in RestoreBkpBlock or somesuch. Otherwise we won't always recognize the end of WAL correctly. And I am a bit wary of reducing the likelihood of noticing the proper end-of-recovery by reducing the crc width. Why again are we doing this now? Just to reduce the overhead of CRC computation for full page writes? Or are we forseeing issues with the page checksums being wrong because of non-zero data in the hole being zero after the restore from bkp blocks? I thought the idea is that we were going to re-use the already-computed CRC checksum on the page, and we only have 16-bits of storage for that. No, the patch has to compute the 16-bit checksum for the page when the full-page image is added to the WAL record. There would otherwise be no need to calculate the page checksum at that point, but only later when the page is written out from shared buffer cache. I think this is a bad idea. It complicates the WAL format significantly. Simon's patch didn't include the changes to recovery to validate the checksum, but I suspect it would be complicated. And it reduces the error-detection capability of WAL recovery. Keep in mind that unlike page checksums, which are never expected to fail, so even if we miss a few errors it's still better than nothing, the WAL checkum is used to detect end-of-WAL. There is expected to be a failure every time we do crash recovery. This far, we've considered the probability of one in 1^32 small enough for that purpose, but IMHO one in 1^16 is much too weak. If you want to speed up the CRC calculation of full-page images, you could have an optimized version of the WAL CRC algorithm, using e.g. SIMD instructions. Because typical WAL records are small, max 100-200 bytes, and it consists of several even smaller chunks, the normal WAL CRC calculation is quite resistant to common optimization techniques. But it might work for the full-page images. Let's not conflate it with the page checksums, though. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 2013-04-12 15:31:36 -0400, Bruce Momjian wrote: > On Fri, Apr 12, 2013 at 09:28:42PM +0200, Andres Freund wrote: > > > Only point worth discussing is that this change would make backup blocks > > > be > > > covered by a 16-bit checksum, not the CRC-32 it is now. i.e. the record > > > header is covered by a CRC32 but the backup blocks only by 16-bit. > > > > That means we will have to do the verification for this in > > ValidXLogRecord() *not* in RestoreBkpBlock or somesuch. Otherwise we > > won't always recognize the end of WAL correctly. > > And I am a bit wary of reducing the likelihood of noticing the proper > > end-of-recovery by reducing the crc width. > > > > Why again are we doing this now? Just to reduce the overhead of CRC > > computation for full page writes? Or are we forseeing issues with the > > page checksums being wrong because of non-zero data in the hole being > > zero after the restore from bkp blocks? > > I thought the idea is that we were going to re-use the already-computed > CRC checksum on the page, and we only have 16-bits of storage for that. Well, but the proposal seems to be to do this also for non-checksum enabled datadirs, so ... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Fri, Apr 12, 2013 at 09:28:42PM +0200, Andres Freund wrote: > > Only point worth discussing is that this change would make backup blocks be > > covered by a 16-bit checksum, not the CRC-32 it is now. i.e. the record > > header is covered by a CRC32 but the backup blocks only by 16-bit. > > That means we will have to do the verification for this in > ValidXLogRecord() *not* in RestoreBkpBlock or somesuch. Otherwise we > won't always recognize the end of WAL correctly. > And I am a bit wary of reducing the likelihood of noticing the proper > end-of-recovery by reducing the crc width. > > Why again are we doing this now? Just to reduce the overhead of CRC > computation for full page writes? Or are we forseeing issues with the > page checksums being wrong because of non-zero data in the hole being > zero after the restore from bkp blocks? I thought the idea is that we were going to re-use the already-computed CRC checksum on the page, and we only have 16-bits of storage for that. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 2013-04-11 20:12:59 +0100, Simon Riggs wrote: > On 11 April 2013 04:27, Jeff Davis wrote: > > > On Wed, 2013-04-10 at 20:17 +0100, Simon Riggs wrote: > > > > > OK, so we have a single combined "calculate a checksum for a block" > > > function. That uses Jeff's zeroing trick and Ants' bulk-oriented > > > performance optimization. > > > > > > > > > For buffer checksums we simply calculate for the block. > > > > Sounds good. > > > > > For WAL full page writes, we first set the checksums for all defined > > > buffers, then calculate the checksum of remaining data plus the > > > pd_checksum field from each block using the normal WAL CRC32. > > > > > > Seems good to me. One set of fast code. And it avoids the weirdness > > > that the checksum stored on the full page is actually wrong. > > > > Oh, that's a nice benefit. > > > So, if we apply a patch like the one attached, we then end up with the WAL > checksum using the page checksum as an integral part of its calculation. > (There is no increase in code inside WALInsertLock, nothing at all touched > in that area). > > Then all we need to do is make PageSetChecksumInplace() use Ants' algo and > we're done. > > Only point worth discussing is that this change would make backup blocks be > covered by a 16-bit checksum, not the CRC-32 it is now. i.e. the record > header is covered by a CRC32 but the backup blocks only by 16-bit. That means we will have to do the verification for this in ValidXLogRecord() *not* in RestoreBkpBlock or somesuch. Otherwise we won't always recognize the end of WAL correctly. And I am a bit wary of reducing the likelihood of noticing the proper end-of-recovery by reducing the crc width. Why again are we doing this now? Just to reduce the overhead of CRC computation for full page writes? Or are we forseeing issues with the page checksums being wrong because of non-zero data in the hole being zero after the restore from bkp blocks? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Fri, Apr 12, 2013 at 12:07:36PM -0700, Jeff Davis wrote: > > (Attached patch is discussion only. Checking checksum in recovery > > isn't coded at all.) > > I like it. > > A few points: > > * Given that setting the checksum is unconditional in a backup block, do > we want to zero the checksum field when the backup block is restored if > checksums are disabled? Otherwise we would have a strange situation > where some blocks have a checksum on disk even when checksums are > disabled. > > * When we do PageSetChecksumInplace(), we need to be 100% sure that the > hole is empty; otherwise the checksum will fail when we re-expand it. It > might be worth a memset beforehand just to be sure. Do we write the page holes to the WAL for full-page writes? I hope we don't. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Thu, 2013-04-11 at 20:12 +0100, Simon Riggs wrote: > So, if we apply a patch like the one attached, we then end up with the > WAL checksum using the page checksum as an integral part of its > calculation. (There is no increase in code inside WALInsertLock, > nothing at all touched in that area). > > > Then all we need to do is make PageSetChecksumInplace() use Ants' algo > and we're done. > > > Only point worth discussing is that this change would make backup > blocks be covered by a 16-bit checksum, not the CRC-32 it is now. i.e. > the record header is covered by a CRC32 but the backup blocks only by > 16-bit. FWIW, that's fine with me. > (Attached patch is discussion only. Checking checksum in recovery > isn't coded at all.) I like it. A few points: * Given that setting the checksum is unconditional in a backup block, do we want to zero the checksum field when the backup block is restored if checksums are disabled? Otherwise we would have a strange situation where some blocks have a checksum on disk even when checksums are disabled. * When we do PageSetChecksumInplace(), we need to be 100% sure that the hole is empty; otherwise the checksum will fail when we re-expand it. It might be worth a memset beforehand just to be sure. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Wed, Apr 10, 2013 at 11:19:56AM -0700, Jeff Davis wrote: > On Wed, 2013-04-10 at 11:01 +0300, Ants Aasma wrote: > > I think we should first deal with using it for page checksums and if > > future versions want to reuse some of the code for WAL checksums then > > we can rearrange the code. > > Sounds good to me, although I expect we at least want any assembly to be > in a separate file (if the specialization makes it in 9.3). Sounds good. Simon has done a good job shepherding this to completion. My only question is whether the 16-bit page checksums stored in WAL reduce our ability to detect failed/corrupt writes to WAL? -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 11 April 2013 04:27, Jeff Davis wrote: > On Wed, 2013-04-10 at 20:17 +0100, Simon Riggs wrote: > > > OK, so we have a single combined "calculate a checksum for a block" > > function. That uses Jeff's zeroing trick and Ants' bulk-oriented > > performance optimization. > > > > > > For buffer checksums we simply calculate for the block. > > Sounds good. > > > For WAL full page writes, we first set the checksums for all defined > > buffers, then calculate the checksum of remaining data plus the > > pd_checksum field from each block using the normal WAL CRC32. > > > > Seems good to me. One set of fast code. And it avoids the weirdness > > that the checksum stored on the full page is actually wrong. > > Oh, that's a nice benefit. So, if we apply a patch like the one attached, we then end up with the WAL checksum using the page checksum as an integral part of its calculation. (There is no increase in code inside WALInsertLock, nothing at all touched in that area). Then all we need to do is make PageSetChecksumInplace() use Ants' algo and we're done. Only point worth discussing is that this change would make backup blocks be covered by a 16-bit checksum, not the CRC-32 it is now. i.e. the record header is covered by a CRC32 but the backup blocks only by 16-bit. (Attached patch is discussion only. Checking checksum in recovery isn't coded at all.) Thoughts? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services make_wal_records_use_page_checksums.v0.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Wed, 2013-04-10 at 20:17 +0100, Simon Riggs wrote: > OK, so we have a single combined "calculate a checksum for a block" > function. That uses Jeff's zeroing trick and Ants' bulk-oriented > performance optimization. > > > For buffer checksums we simply calculate for the block. Sounds good. > For WAL full page writes, we first set the checksums for all defined > buffers, then calculate the checksum of remaining data plus the > pd_checksum field from each block using the normal WAL CRC32. > > Seems good to me. One set of fast code. And it avoids the weirdness > that the checksum stored on the full page is actually wrong. Oh, that's a nice benefit. > It also means that the WAL checksum calculation includes the hole, yet > we do not include the data for the hole. So we have to do an extra > copy when restoring the backuo block. I like this, but it sounds like there is some room for discussion on some of these points. I assume changes to the WAL checksums are 9.4 material? I'm satisfied with SIMD data checksums in 9.3 and that we have a plan for using SIMD for WAL checksums later. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers