Re: Add timeline to partial WAL segments
On Sat, Jan 05, 2019 at 08:32:55AM +0200, David Steele wrote: > Excellent. This now requires work on my part so I have marked the patch > "Waiting on Author". As nothing has happened here, I have marked the patch as RwF. -- Michael signature.asc Description: PGP signature
Re: Add timeline to partial WAL segments
On 1/5/19 3:31 AM, Michael Paquier wrote: On Mon, Dec 31, 2018 at 01:07:34PM +0200, David Steele wrote: In short, the *initial* name of the WAL file is set to what it should be if it doesn't complete so we don't need to run around and try to rename files on failure. Only on success do we need to rename. Sound plausible? It does. Excellent. This now requires work on my part so I have marked the patch "Waiting on Author". Thanks, -- -David da...@pgmasters.net
Re: Add timeline to partial WAL segments
On Mon, Dec 31, 2018 at 01:07:34PM +0200, David Steele wrote: > In short, the *initial* name of the WAL file is set to what it should be if > it doesn't complete so we don't need to run around and try to rename files > on failure. Only on success do we need to rename. > > Sound plausible? It does. -- Michael signature.asc Description: PGP signature
Re: Add timeline to partial WAL segments
On 12/21/18 2:10 AM, Michael Paquier wrote: On Thu, Dec 20, 2018 at 02:13:01PM +0200, David Steele wrote: Or perhaps just always add the timeline to the .partial? That way it doesn't need to be renamed later. Also, there would be a consistent name, rather than sometimes .partial, sometimes ..partial. Hm. A renaming still needs to happen afterwards anyway, no? When a segment is created with a given name pg_receivewal cannot know if the segment it is working on will be the last partial segment of a given timeline. And what would be changed in the segment name is the addition of the new TLI, not the previous one. I was thinking the file would only be renamed on successful completion. Suppose we are on timeline 1 pg_receivewal would be writing to: 000100010001.0001.partial If the WAL segment is never completed (server crashes, etc.) it would keep that name. If a cluster is promoted it would archive: 000100010001.0002.partial And then pg_receivewal would start writing on: 000200010001.0002.partial When that segment successfully completes it would be renamed by pg_receivewal to 000200010001. In short, the *initial* name of the WAL file is set to what it should be if it doesn't complete so we don't need to run around and try to rename files on failure. Only on success do we need to rename. Sound plausible? -- -David da...@pgmasters.net
Re: Add timeline to partial WAL segments
On 12/20/18 10:56 PM, Robert Haas wrote: On Fri, Dec 14, 2018 at 6:05 PM David Steele wrote: The question in my mind: is it safe to back-patch? I cannot imagine it being a good idea to stick a behavioral change like this into a minor release. Yeah, it lets people get out from under this problem a lot sooner, but it potentially breaks backup scripts even for people who were not suffering in the first place. I don't think solving this problem for the people who have it is worth inflicting that kind of breakage on everybody. Fair enough -- figured I would make an argument and see what people thought. It's easy enough to solve on my end, but I'll wait and see what naming scheme we come up with. -- -David da...@pgmasters.net
Re: Add timeline to partial WAL segments
On Fri, Dec 14, 2018 at 6:05 PM David Steele wrote: > The question in my mind: is it safe to back-patch? I cannot imagine it being a good idea to stick a behavioral change like this into a minor release. Yeah, it lets people get out from under this problem a lot sooner, but it potentially breaks backup scripts even for people who were not suffering in the first place. I don't think solving this problem for the people who have it is worth inflicting that kind of breakage on everybody. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Add timeline to partial WAL segments
On 12/15/18 1:56 AM, Michael Paquier wrote: On Fri, Dec 14, 2018 at 06:05:18PM -0500, David Steele wrote: On 12/14/18 3:26 PM, Robert Haas wrote: The new TLI is the only thing that is guaranteed to be unique with each new promotion, and I would guess that it is therefore the right thing to use to disambiguate them. This is the conclusion we came to after a few months of diagnosing and working on this problem. Hm. Okay. From that point of view I think that we should also make pg_receivewal able to generate the same kind of segment format when it detects a timeline switch for the last partial segment of the previous timeline. Switching to a new timeline makes the streaming finish, so we could use that to close properly the WAL segment with the new name. At the same time it would be nice to have a macro able to generate a partial segment name and also check after it. Or perhaps just always add the timeline to the .partial? That way it doesn't need to be renamed later. Also, there would be a consistent name, rather than sometimes .partial, sometimes ..partial. The question in my mind: is it safe to back-patch? That's unfortunately a visibility change, meaning that any existing backup solution able to handle .partial segments would get broken in a minor release. From that point of view this should go only on HEAD. That means every archive command needs to deal with this issue on its own. It's definitely not a trivial thing to do. It's obviously strong to call this a bug, but there's very clearly an issue here. I wonder if there is anything else we can do that would work? The other patch to reorder the priority of the .ready files could go down without any problem though. Glad to hear it. -- -David da...@pgmasters.net
Re: Add timeline to partial WAL segments
On Fri, Dec 14, 2018 at 06:05:18PM -0500, David Steele wrote: > On 12/14/18 3:26 PM, Robert Haas wrote: >> The new TLI is the only thing that is guaranteed to be unique with >> each new promotion, and I would guess that it is therefore the right >> thing to use to disambiguate them. > > This is the conclusion we came to after a few months of diagnosing and > working on this problem. Hm. Okay. From that point of view I think that we should also make pg_receivewal able to generate the same kind of segment format when it detects a timeline switch for the last partial segment of the previous timeline. Switching to a new timeline makes the streaming finish, so we could use that to close properly the WAL segment with the new name. At the same time it would be nice to have a macro able to generate a partial segment name and also check after it. > The question in my mind: is it safe to back-patch? That's unfortunately a visibility change, meaning that any existing backup solution able to handle .partial segments would get broken in a minor release. From that point of view this should go only on HEAD. The other patch to reorder the priority of the .ready files could go down without any problem though. -- Michael signature.asc Description: PGP signature
Re: Add timeline to partial WAL segments
On 12/14/18 3:26 PM, Robert Haas wrote: > On Thu, Dec 13, 2018 at 12:17 AM Michael Paquier wrote: >> On Wed, Dec 12, 2018 at 07:54:05AM -0500, David Steele wrote: >>> The LSN switch point is often the same even when servers are going to >>> different timelines. If the LSN is different enough then the problem >>> solves itself since the .partial will be on an entirely different >>> segment. >> >> That would mean that WAL forked exactly at the same record. You have >> likely seen more cases where than can happen in real life than I do. > > Suppose that the original master fails during an idle period, and we > promote a slave. But we accidentally promote a slave that can't serve > as the new master, like because it's in a datacenter with an > unreliable network connection or one which is about to be engulfed in > lava. Much more common than people think. > So, we go to promote a different slave, and because we never > got around to reconfiguring the standbys to follow the previous > promotion, kaboom. Exactly. > Or, suppose we do PITR to recover from some user error, but then > somebody screws up the contents of the recovered cluster and we have > to do it over again. Naturally we'll recover to the same point. > > The new TLI is the only thing that is guaranteed to be unique with > each new promotion, and I would guess that it is therefore the right > thing to use to disambiguate them. This is the conclusion we came to after a few months of diagnosing and working on this problem. The question in my mind: is it safe to back-patch? -- -David da...@pgmasters.net
Re: Add timeline to partial WAL segments
On Thu, Dec 13, 2018 at 12:17 AM Michael Paquier wrote: > On Wed, Dec 12, 2018 at 07:54:05AM -0500, David Steele wrote: > > The LSN switch point is often the same even when servers are going to > > different timelines. If the LSN is different enough then the problem > > solves itself since the .partial will be on an entirely different > > segment. > > That would mean that WAL forked exactly at the same record. You have > likely seen more cases where than can happen in real life than I do. Suppose that the original master fails during an idle period, and we promote a slave. But we accidentally promote a slave that can't serve as the new master, like because it's in a datacenter with an unreliable network connection or one which is about to be engulfed in lava. So, we go to promote a different slave, and because we never got around to reconfiguring the standbys to follow the previous promotion, kaboom. Or, suppose we do PITR to recover from some user error, but then somebody screws up the contents of the recovered cluster and we have to do it over again. Naturally we'll recover to the same point. The new TLI is the only thing that is guaranteed to be unique with each new promotion, and I would guess that it is therefore the right thing to use to disambiguate them. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Add timeline to partial WAL segments
On 12/13/18 8:35 AM, David Steele wrote: > On 12/12/18 7:17 PM, Michael Paquier wrote: >> On Wed, Dec 12, 2018 at 07:54:05AM -0500, David Steele wrote: > >>> But, we could at least use the . notation and end up with something like >>> 000100010001.0002.partial or perhaps >>> 000100010001.T0002.partial? Maybe >>> 000100010001.0002.tpartial? >>> I updated the patch to use 000100010001.0002.partial since this is more consistent with what we do in other cases. -- -David da...@pgmasters.net From 7eda6932eab47cd25a4fa25a8a5575b41b51c9ff Mon Sep 17 00:00:00 2001 From: David Steele Date: Thu, 13 Dec 2018 12:20:43 -0500 Subject: [PATCH] Add timeline to partial WAL segments. If two clusters are promoted from the same timeline on the same WAL segment, they will both archive a .partial file with the same name even if they select different timelines to promote to. If one partially promotes and later fails we want to be able to promote another cluster. This duplicate .partial file causes problems for archive commands that won't overwrite an existing file. Even if the archive command checks the contents they might not be binary identical due to differing unused space at the end of the file or because one cluster accepted more transactions. Add the timeline being promoted *to* into the .partial filename to prevent the conflict, e.g. 000100010001.0002.partial. If more than one cluster tries to promote to the same timeline there will still be a conflict but this is desirable. Once a cluster has claimed a timeline no other cluster should be archiving on it. --- src/backend/access/transam/xlog.c | 12 src/include/postmaster/pgarch.h | 2 +- 2 files changed, 9 insertions(+), 5 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index c80b14ed97..be4da2c3d0 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -7624,8 +7624,10 @@ StartupXLOG(void) * the completed version of the same segment later, it will fail. (We * used to do that in 9.4 and below, and it caused such problems). * -* As a compromise, we rename the last segment with the .partial -* suffix, and archive it. Archive recovery will never try to read +* As a compromise, we rename the last segment with the new timeline and +* .partial suffix, and archive it. The timeline is added in case there +* are multiple promotions from the same timeline before a stable +* primary emerges. Archive recovery will never try to read * .partial segments, so they will normally go unused. But in the odd * PITR case, the administrator can copy them manually to the pg_wal * directory (removing the suffix). They can be useful in debugging, @@ -7653,8 +7655,10 @@ StartupXLOG(void) charpartialpath[MAXPGPATH]; XLogFilePath(origpath, EndOfLogTLI, endLogSegNo, wal_segment_size); - snprintf(partialfname, MAXFNAMELEN, "%s.partial", origfname); - snprintf(partialpath, MAXPGPATH, "%s.partial", origpath); + snprintf(partialfname, MAXFNAMELEN, "%s.%08X.partial", +origfname, ThisTimeLineID); + snprintf(partialpath, MAXPGPATH, "%s.%08X.partial", origpath, +ThisTimeLineID); /* * Make sure there's no .done or .ready file for the .partial diff --git a/src/include/postmaster/pgarch.h b/src/include/postmaster/pgarch.h index 292e63a26a..504ab7b9dc 100644 --- a/src/include/postmaster/pgarch.h +++ b/src/include/postmaster/pgarch.h @@ -23,7 +23,7 @@ * -- */ #define MIN_XFN_CHARS 16 -#define MAX_XFN_CHARS 40 +#define MAX_XFN_CHARS 41 #define VALID_XFN_CHARS "0123456789ABCDEF.history.backup.partial" /* -- -- 2.17.2 (Apple Git-113) signature.asc Description: OpenPGP digital signature
Re: Add timeline to partial WAL segments
On 12/12/18 7:17 PM, Michael Paquier wrote: > On Wed, Dec 12, 2018 at 07:54:05AM -0500, David Steele wrote: >> But, we could at least use the . notation and end up with something like >> 000100010001.0002.partial or perhaps >> 000100010001.T0002.partial? Maybe >> 000100010001.0002.tpartial? >> >> I can't decide whether the .partial files generated by pg_receivewal are >> a problem or not. It seems to me that only the original cluster is >> going to be able to stream this file -- once the segment is renamed to >> .partial it won't be visible to pg_receivexlog. So, .partial without a >> timeline extension just means partial from the original timeline. > > Still this does not actually solve the problem when two servers are > trying to use the same timeline? You would get conflicts with the > history file as well in this case but the partial segment gets archived > first.. It seems to me that it is kind of difficult to come with a > totally bullet-proof solution. What we would like to do here is reduce the window in which the server will make a bad choice by pushing timeline files first, then allow the server to archive by using a more unique name for partial WAL. It is understood that servers will need to be rebuilt in HA environments, but we would like to reduce the frequency. More importantly we want to preserve the integrity of the WAL archive since it serves as the ground truth for all the clusters. Reducing duplicates spares users decisions about what to delete to get their cluster archiving again. >> There's another difference. The .partial generated by pg_receivewal is >> an actively-worked-on file whereas the .partial generated by a promotion >> is probably headed for oblivion. I haven't see a single case where one >> was used in an actual recovery (which doesn't mean it hasn't happened, >> of course). > > There are many people implementing their own backup solutions, it is > hard to say that none of those solutions are actually able to copy the > last partial file to replay up to the end of a wanted timeline for a > PITR. Indeed, we've thought about adding it to pgBackRest. But we haven't, and to my knowledge nobody else has either. It will come though, and a better way of naming these files will help make them useful. Regards, -- -David da...@pgmasters.net signature.asc Description: OpenPGP digital signature
Re: Add timeline to partial WAL segments
On Wed, Dec 12, 2018 at 07:54:05AM -0500, David Steele wrote: > The LSN switch point is often the same even when servers are going to > different timelines. If the LSN is different enough then the problem > solves itself since the .partial will be on an entirely different > segment. That would mean that WAL forked exactly at the same record. You have likely seen more cases where than can happen in real life than I do. > But, we could at least use the . notation and end up with something like > 000100010001.0002.partial or perhaps > 000100010001.T0002.partial? Maybe > 000100010001.0002.tpartial? > > I can't decide whether the .partial files generated by pg_receivewal are > a problem or not. It seems to me that only the original cluster is > going to be able to stream this file -- once the segment is renamed to > .partial it won't be visible to pg_receivexlog. So, .partial without a > timeline extension just means partial from the original timeline. Still this does not actually solve the problem when two servers are trying to use the same timeline? You would get conflicts with the history file as well in this case but the partial segment gets archived first.. It seems to me that it is kind of difficult to come with a totally bullet-proof solution. Adding the timeline is appealing to use if the history files can be added on time, the switchpoint LSN is also appealing as per the likelihood of WAL forking at a different point on a record basis. Perhaps another solution could be to add both, but that looks like an overkill. > There's another difference. The .partial generated by pg_receivewal is > an actively-worked-on file whereas the .partial generated by a promotion > is probably headed for oblivion. I haven't see a single case where one > was used in an actual recovery (which doesn't mean it hasn't happened, > of course). There are many people implementing their own backup solutions, it is hard to say that none of those solutions are actually able to copy the last partial file to replay up to the end of a wanted timeline for a PITR. >> (I am completely sold to the idea of prioritizing file types in the >> archiver.) > > OK, I'll work up a patch for that, then. It doesn't solve the .partial > problem, but it does reduce the likelihood that two servers will end up > on the same timeline. Thanks a lot, David! -- Michael signature.asc Description: PGP signature
Re: Add timeline to partial WAL segments
On 12/12/18 12:14 AM, Michael Paquier wrote: > On Tue, Dec 11, 2018 at 11:27:30PM -0500, David Steele wrote: >> And yeah, I'm not sure that I'm totally sold on this idea either, but I >> wanted to get a conversation going. > > Another idea which I think we could live with is to embed within the > segment file name the LSN switch point, in a format consistent with > backup history files. And as a bonus it could be used for sanity > checks. The LSN switch point is often the same even when servers are going to different timelines. If the LSN is different enough then the problem solves itself since the .partial will be on an entirely different segment. But, we could at least use the . notation and end up with something like 000100010001.0002.partial or perhaps 000100010001.T0002.partial? Maybe 000100010001.0002.tpartial? I can't decide whether the .partial files generated by pg_receivewal are a problem or not. It seems to me that only the original cluster is going to be able to stream this file -- once the segment is renamed to .partial it won't be visible to pg_receivexlog. So, .partial without a timeline extension just means partial from the original timeline. There's another difference. The .partial generated by pg_receivewal is an actively-worked-on file whereas the .partial generated by a promotion is probably headed for oblivion. I haven't see a single case where one was used in an actual recovery (which doesn't mean it hasn't happened, of course). > (I am completely sold to the idea of prioritizing file types in the > archiver.) OK, I'll work up a patch for that, then. It doesn't solve the .partial problem, but it does reduce the likelihood that two servers will end up on the same timeline. -- -David da...@pgmasters.net
Re: Add timeline to partial WAL segments
On Tue, Dec 11, 2018 at 11:27:30PM -0500, David Steele wrote: > And yeah, I'm not sure that I'm totally sold on this idea either, but I > wanted to get a conversation going. Another idea which I think we could live with is to embed within the segment file name the LSN switch point, in a format consistent with backup history files. And as a bonus it could be used for sanity checks. (I am completely sold to the idea of prioritizing file types in the archiver.) -- Michael signature.asc Description: PGP signature
Re: Add timeline to partial WAL segments
On 12/11/18 11:14 PM, Robert Haas wrote: > On Wed, Dec 12, 2018 at 1:13 PM David Steele wrote: >> On 12/11/18 11:06 PM, Robert Haas wrote: >>> On Wed, Dec 12, 2018 at 11:29 AM Michael Paquier >>> wrote: I really don't think that it is a good idea to link a future timeline within a segment which includes in its name a direct reference to its current timeline. Also I don't buy much the argument that those segments are a nuisance as well all the time. They may be for some tools, however not for others depending on the archiving strategy (distributed locations for example), and if they are a problem for your deployments, why not just discarding them within the archive command and be done with them? >>> >>> -1. Writing an archive_command already requires a PhD in >>> PostgreSQL-ology. The very last thing we should do is invent even >>> more ways for an archive command to be subtly wrong. >> >> The point here is to make archive commands simpler. As it is, the >> various backup tools are going to need to find ways to deal with the >> problem, and each solution will be different. >> >> The goal is to come up with a solution that works and that all archive >> commands can use, rather than each one rolling their own solution. > > I understand. I'm more or less agreeing with you. Actually, I'm not > really sure whether your particular proposal is the best way of > handling this, but I disagree with Michael's suggestion that we should > just throw responsibility back on archive_command. I apologize, I misread that. Your comments make a lot more sense now that I read them in the correct context! And yeah, I'm not sure that I'm totally sold on this idea either, but I wanted to get a conversation going. -- -David da...@pgmasters.net
Re: Add timeline to partial WAL segments
On Wed, Dec 12, 2018 at 1:13 PM David Steele wrote: > On 12/11/18 11:06 PM, Robert Haas wrote: > > On Wed, Dec 12, 2018 at 11:29 AM Michael Paquier > > wrote: > >> I really don't think that it is a good idea to link a future timeline > >> within a segment which includes in its name a direct reference to its > >> current timeline. Also I don't buy much the argument that those > >> segments are a nuisance as well all the time. They may be for some > >> tools, however not for others depending on the archiving strategy > >> (distributed locations for example), and if they are a problem for your > >> deployments, why not just discarding them within the archive command and > >> be done with them? > > > > -1. Writing an archive_command already requires a PhD in > > PostgreSQL-ology. The very last thing we should do is invent even > > more ways for an archive command to be subtly wrong. > > The point here is to make archive commands simpler. As it is, the > various backup tools are going to need to find ways to deal with the > problem, and each solution will be different. > > The goal is to come up with a solution that works and that all archive > commands can use, rather than each one rolling their own solution. I understand. I'm more or less agreeing with you. Actually, I'm not really sure whether your particular proposal is the best way of handling this, but I disagree with Michael's suggestion that we should just throw responsibility back on archive_command. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Add timeline to partial WAL segments
On 12/11/18 11:06 PM, Robert Haas wrote: > On Wed, Dec 12, 2018 at 11:29 AM Michael Paquier wrote: >> I really don't think that it is a good idea to link a future timeline >> within a segment which includes in its name a direct reference to its >> current timeline. Also I don't buy much the argument that those >> segments are a nuisance as well all the time. They may be for some >> tools, however not for others depending on the archiving strategy >> (distributed locations for example), and if they are a problem for your >> deployments, why not just discarding them within the archive command and >> be done with them? > > -1. Writing an archive_command already requires a PhD in > PostgreSQL-ology. The very last thing we should do is invent even > more ways for an archive command to be subtly wrong. The point here is to make archive commands simpler. As it is, the various backup tools are going to need to find ways to deal with the problem, and each solution will be different. The goal is to come up with a solution that works and that all archive commands can use, rather than each one rolling their own solution. -- -David da...@pgmasters.net
Re: Add timeline to partial WAL segments
On Wed, Dec 12, 2018 at 11:29 AM Michael Paquier wrote: > I really don't think that it is a good idea to link a future timeline > within a segment which includes in its name a direct reference to its > current timeline. Also I don't buy much the argument that those > segments are a nuisance as well all the time. They may be for some > tools, however not for others depending on the archiving strategy > (distributed locations for example), and if they are a problem for your > deployments, why not just discarding them within the archive command and > be done with them? -1. Writing an archive_command already requires a PhD in PostgreSQL-ology. The very last thing we should do is invent even more ways for an archive command to be subtly wrong. I have a feeling the already-known ways to do it wrong are not too well documented, which is another problem. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Add timeline to partial WAL segments
On Tue, Dec 11, 2018 at 09:15:16AM -0500, David Steele wrote: > It looks to me like the history file is written first, and then the > .partial. But because we process WAL alphabetically the .partial ends > up being pushed first. Yes, that's right. > The idea is to stake a claim to the timeline as quickly as possible so > nobody else claims it. The pgarch_readyXlog() reads through all the > files, so it would be easy to return the history files first, then the > .partial, then the new timeline files, regardless of what order the > .ready files were written in. > > The history file is also small, so it will be faster to copy and less > subject to latency concerns. We want to reduce the window in which > another potential primary can end up with a duplicate timeline. Your arguments here are sensible, so +1 for changing pgarch_readyXlog so as it gives priority to history files for archiving. It seems to me that we should also be careful of ordering partial files and normal segments, and perhaps consider backup history files as having the lower level of priority. >>> Another option would be to immediately archive the first WAL segment on >>> timeline 2 and forgo the .partial file entirely. In this case the >>> archiver will archive the 0002.history file before >>> 000200010001 and we avoid the race condition above. That >>> also means we could recover A and promote without a conflict on the >>> .partial. Or we could recover A along timeline 2. >> >> This breaks the definition of IsPartialXLogFileName() in >> xlog_internal.h, and the current naming convention of using only dots as >> field separators. > > Good point. Interesting that missing this didn't break any tests. This means also that we may need more tests for that. For pg_receivewal this could just be a check at the end of the last test for example. >> Another more tricky problem is that this is >> inconsistent with the way pg_receivewal.c behaves for non-completed >> segments, which is a reason behind using .partial for the last partial >> segment on the backend side as well. So this proposal makes things more >> inconsistent. > > I think that we might need to apply the same logic to pg_receivewal. I really don't think that it is a good idea to link a future timeline within a segment which includes in its name a direct reference to its current timeline. Also I don't buy much the argument that those segments are a nuisance as well all the time. They may be for some tools, however not for others depending on the archiving strategy (distributed locations for example), and if they are a problem for your deployments, why not just discarding them within the archive command and be done with them? -- Michael signature.asc Description: PGP signature
Re: Add timeline to partial WAL segments
Hi Michael, On 12/10/18 8:43 PM, Michael Paquier wrote: > On Mon, Dec 10, 2018 at 10:21:23AM -0500, David Steele wrote: >> We recommend that archive commands not overwrite an existing segment. >> Some backup tools will compare the contents and succeed if they are >> equal, but in this case that will still often fail because recycled WAL >> segments will have different bytes at the end on the primary and >> standby. The files may not even be logically the same because B may not >> have received all WAL from A. > > This is not a new problem, the last, partial segment generated > post-promotion of a timeline needs to be archived. Since the > introduction of .partial within the segment name in 9.5, we also assume > that the OP would be smart enough to rename the segment to replay up to > the end of the past timeline if need be for PITR. It's not a new problem, but I do think it was only partially solved. There are easy-to-reproduce cases where more than one .partial is generated and I think we should handle that gracefully. >> However, there is still a race condition here. Since the >> 000100010001.partial is archived first the 0002.history >> file might not make it to the archive before B crashes. In that case A >> will pick timeline 2 and still be stuck. However, I'm thinking it would >> be easy to teach pgarch_readyXlog() to return any .history files it >> finds first (in order, of course). > > Still the .ready file of the partial segment would be generated before > the history file, right? In what does that help? It looks to me like the history file is written first, and then the .partial. But because we process WAL alphabetically the .partial ends up being pushed first. The idea is to stake a claim to the timeline as quickly as possible so nobody else claims it. The pgarch_readyXlog() reads through all the files, so it would be easy to return the history files first, then the .partial, then the new timeline files, regardless of what order the .ready files were written in. The history file is also small, so it will be faster to copy and less subject to latency concerns. We want to reduce the window in which another potential primary can end up with a duplicate timeline. >> Another option would be to immediately archive the first WAL segment on >> timeline 2 and forgo the .partial file entirely. In this case the >> archiver will archive the 0002.history file before >> 000200010001 and we avoid the race condition above. That >> also means we could recover A and promote without a conflict on the >> .partial. Or we could recover A along timeline 2. > > This breaks the definition of IsPartialXLogFileName() in > xlog_internal.h, and the current naming convention of using only dots as > field separators. Good point. Interesting that missing this didn't break any tests. > Another more tricky problem is that this is > inconsistent with the way pg_receivewal.c behaves for non-completed > segments, which is a reason behind using .partial for the last partial > segment on the backend side as well. So this proposal makes things more > inconsistent. I think that we might need to apply the same logic to pg_receivewal. >> I have attached a patch that adds the timeline to the .partial file. >> This passes check-world. >> >> I think we should consider back-patching some set of these changes since >> this causes real pain in current production HA configurations. >> >> Thoughts? > > So you basically append the new timeline ID to the segment name which > still uses the old timeline ID in the first 8 characters of its name. > Logically I find this proposal weird as the segment refers contents > which are part of the past, and the backend is not going to use the > contents of this segment when jumping to the a new timeline, but the > contents of the segment which has the same contents up to the point WAL > forked, with the name of the new timeline. .partial files are rarely used in general, but we decided not to throw them away because they *might* contain valuable information. However, in their current form they are more of a nuisance than a help. What we're really looking for here is a way to sensibly version .partial files in a way so that they a) don't conflict in the repo and b) their name indicates where they came from. If anyone can think of a naming scheme that makes more sense, I'm all ears. > It seems to me that this is quite a change for a low-probability > problem, as this assumes that the promotion of two different servers > happen on exactly the same segment and that both would finish by > archiving the same last partial segment. It's actually a common (i.e. daily) problem at scale, especially when archiving to high-latency storage like S3. Patroni is especially likely to show the issue as it favors uptime over preserving data in its default configuration. Thanks, -- -David da...@pgmasters.net signature.asc Description: OpenPGP digital
Re: Add timeline to partial WAL segments
On Mon, Dec 10, 2018 at 10:21:23AM -0500, David Steele wrote: > We recommend that archive commands not overwrite an existing segment. > Some backup tools will compare the contents and succeed if they are > equal, but in this case that will still often fail because recycled WAL > segments will have different bytes at the end on the primary and > standby. The files may not even be logically the same because B may not > have received all WAL from A. This is not a new problem, the last, partial segment generated post-promotion of a timeline needs to be archived. Since the introduction of .partial within the segment name in 9.5, we also assume that the OP would be smart enough to rename the segment to replay up to the end of the past timeline if need be for PITR. > However, there is still a race condition here. Since the > 000100010001.partial is archived first the 0002.history > file might not make it to the archive before B crashes. In that case A > will pick timeline 2 and still be stuck. However, I'm thinking it would > be easy to teach pgarch_readyXlog() to return any .history files it > finds first (in order, of course). Still the .ready file of the partial segment would be generated before the history file, right? In what does that help? > Another option would be to immediately archive the first WAL segment on > timeline 2 and forgo the .partial file entirely. In this case the > archiver will archive the 0002.history file before > 000200010001 and we avoid the race condition above. That > also means we could recover A and promote without a conflict on the > .partial. Or we could recover A along timeline 2. This breaks the definition of IsPartialXLogFileName() in xlog_internal.h, and the current naming convention of using only dots as field separators. Another more tricky problem is that this is inconsistent with the way pg_receivewal.c behaves for non-completed segments, which is a reason behind using .partial for the last partial segment on the backend side as well. So this proposal makes things more inconsistent. > I have attached a patch that adds the timeline to the .partial file. > This passes check-world. > > I think we should consider back-patching some set of these changes since > this causes real pain in current production HA configurations. > > Thoughts? So you basically append the new timeline ID to the segment name which still uses the old timeline ID in the first 8 characters of its name. Logically I find this proposal weird as the segment refers contents which are part of the past, and the backend is not going to use the contents of this segment when jumping to the a new timeline, but the contents of the segment which has the same contents up to the point WAL forked, with the name of the new timeline. It seems to me that this is quite a change for a low-probability problem, as this assumes that the promotion of two different servers happen on exactly the same segment and that both would finish by archiving the same last partial segment. Putting aside this proposal, it would be actually nice to put in xlog_internal.h a macro which is able to write a partial file name, close to the same place where we check if the segment name refers to a partial segment. My 2c. -- Michael signature.asc Description: PGP signature
Add timeline to partial WAL segments
Hackers, The .partial mechanism was added in de768844 to help avoid conflicts between a newly promoted primary and an old primary that might produce the same WAL segment. This works for a single promotion but can become problematic in HA configurations where there may be several promotions before a stable primary emerges. Consider the following scenario: 1) A is the primary 2) B follows A as a standby 3) A is shutdown immediate 4) B is promoted and selects timeline 2 5) B archives 000100010001.partial 6) B archives 0002.history 7) B goes away before archiving 000200010001 8) A is put into recovery 9) A is promoted and selects timeline 3 10) A can't archive 000100010001.partial because it already exists We recommend that archive commands not overwrite an existing segment. Some backup tools will compare the contents and succeed if they are equal, but in this case that will still often fail because recycled WAL segments will have different bytes at the end on the primary and standby. The files may not even be logically the same because B may not have received all WAL from A. After some discussion with the Patroni folks, Stephen and I came up with the idea of adding the timeline that the cluster is *promoting to* into the .partial name to avoid these sorts of conflicts. However, there is still a race condition here. Since the 000100010001.partial is archived first the 0002.history file might not make it to the archive before B crashes. In that case A will pick timeline 2 and still be stuck. However, I'm thinking it would be easy to teach pgarch_readyXlog() to return any .history files it finds first (in order, of course). Another option would be to immediately archive the first WAL segment on timeline 2 and forgo the .partial file entirely. In this case the archiver will archive the 0002.history file before 000200010001 and we avoid the race condition above. That also means we could recover A and promote without a conflict on the .partial. Or we could recover A along timeline 2. Or we could do some combination of the above. I have attached a patch that adds the timeline to the .partial file. This passes check-world. I think we should consider back-patching some set of these changes since this causes real pain in current production HA configurations. Thoughts? -- -David da...@pgmasters.net diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index c80b14ed97..b2e0c1abc1 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -7624,8 +7624,10 @@ StartupXLOG(void) * the completed version of the same segment later, it will fail. (We * used to do that in 9.4 and below, and it caused such problems). * -* As a compromise, we rename the last segment with the .partial -* suffix, and archive it. Archive recovery will never try to read +* As a compromise, we rename the last segment with the new timeline and +* .partial suffix, and archive it. The timeline is added in case there +* are multiple promotions from the same timeline before a stable +* primary emerges. Archive recovery will never try to read * .partial segments, so they will normally go unused. But in the odd * PITR case, the administrator can copy them manually to the pg_wal * directory (removing the suffix). They can be useful in debugging, @@ -7653,8 +7655,10 @@ StartupXLOG(void) charpartialpath[MAXPGPATH]; XLogFilePath(origpath, EndOfLogTLI, endLogSegNo, wal_segment_size); - snprintf(partialfname, MAXFNAMELEN, "%s.partial", origfname); - snprintf(partialpath, MAXPGPATH, "%s.partial", origpath); + snprintf(partialfname, MAXFNAMELEN, "%s-%08X.partial", +origfname, ThisTimeLineID); + snprintf(partialpath, MAXPGPATH, "%s-%08X.partial", origpath, +ThisTimeLineID); /* * Make sure there's no .done or .ready file for the .partial