Re: [PATCHES] Fix database is ready race condition
Tom Lane wrote: I've applied a modified form of this patch. The postmaster now says database system is ready to accept connections after it's finished reacting to the completion of the startup process. Thank you, that's just perfect for me. Markus ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
[PATCHES] Fix database is ready race condition
Hi, is there a good reason to print the database system is ready message in StartupXLOG() in xact.c? It has a) nothing to do with xlog and b) opens a small race condition: the message gets printed, while it still take some CPU cycles until the postmaster really gets the SIGCHLD signal and sets StartupPID = 0. If you (or rather: an automated test program) try to connect within this timespan, you get a database is starting up error, which clearly contradicts the is ready message. I admit this is not a real issue in the common case and only matters in automated testing or some such. But in case this does not break anything... (ereport is used in the reaper, so I guess it's fine to use that in signal handlers). I'm not sure if the message is needed at all in BS_XLOG_BOOTSTRAP mode. Probably it should better say something different. Patch attached. Regards Markus *** src/backend/access/transam/xlog.c 2191ee8ca338d74f666b4d3509cc4361c44b4353 --- src/backend/access/transam/xlog.c e77a26a26ec46d4479563ed7ff5885ea9c21135a *** StartupXLOG(void) *** 5168,5176 /* Reload shared-memory state for prepared transactions */ RecoverPreparedTransactions(); - ereport(LOG, - (errmsg(database system is ready))); - /* Shut down readFile facility, free space */ if (readFile = 0) { --- 5168,5173 *** src/backend/bootstrap/bootstrap.c 55fd17241f51b6f23131a0d36d5ce583aa7a3488 --- src/backend/bootstrap/bootstrap.c 8a54e88b06acad46c83320ca8fe13caa75ad77b9 *** BootstrapMain(int argc, char *argv[]) *** 418,423 --- 418,425 bootstrap_signals(); BootStrapXLOG(); StartupXLOG(); + ereport(LOG, + (errmsg(database system is ready))); break; case BS_XLOG_STARTUP: *** src/backend/postmaster/postmaster.c 561d13618e62e95a32b42b2e9305a638edacf24f --- src/backend/postmaster/postmaster.c 5a567893b0ed78d312a19e7054127dc5b6b69df3 *** reaper(SIGNAL_ARGS) *** 2040,2045 --- 2040,2048 if (StartupPID != 0 pid == StartupPID) { StartupPID = 0; + ereport(LOG, + (errmsg(database system is ready))); + /* Note: FATAL exit of startup is treated as catastrophic */ if (!EXIT_STATUS_0(exitstatus)) { ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [pgsql-patches] Autovacuum launcher patch
Alvaro Herrera wrote: Hmm, I remember eyeballing that code. Would you mind sending me an URL to that file, or something? Or maybe send me the files themselves? Sure, here's a patch against current CVS. Please remove all the functions referencing to buffer and buffer.h to compile. Remember that it's a work in progress thing. It has flaws. One issue that currently bugs me is, that processes can deadlock if they keep trying to create a message (IMessagesCreate), but fail because the queue is full of messages for themselves. A process should thus always try to fetch messages (via IMessagesCheck) and remove pending ones before retrying to send one. That's not always practical. One design limitation is, that you have to know how large your message is as soon as you reserve (shared) memory for it, but that's intended. At least I've stress tested the wrap-around code and it seems to work. No guarantees, though ;-) Regards Markus # # old_revision [9a68fa59cb0ca3246f03880664062abb98f1a61a] # # add_file src/backend/storage/ipc/imsg.c # content [3e84c6372a47612a2fe233fee6b122808135580e] # # add_file src/include/storage/imsg.h # content [3cf37b12a00b90f65b8393fc5e27c98d772dc22b] # # patch src/backend/storage/ipc/Makefile # from [71276ab6483aebbb27f87c988d77ab876611f190] #to [9a99101d3e8bbfe52c97763db536804e94371828] # # patch src/backend/storage/ipc/ipci.c # from [177f266b4668190a6ab1f2902305f7b7e577ef8d] #to [1971e2122ba4455c8b9784e70059d917fdf4f4c8] # --- src/backend/storage/ipc/imsg.c 3e84c6372a47612a2fe233fee6b122808135580e +++ src/backend/storage/ipc/imsg.c 3e84c6372a47612a2fe233fee6b122808135580e @@ -0,0 +1,375 @@ +/*- + * + * imsg.c + *internal messages from process to process sent via shared memory. + * + * + * Copyright (c) 2006, Markus Schiltknecht [EMAIL PROTECTED] + * + *- + */ + +#include unistd.h +#include signal.h +#include string.h + +#ifdef HAVE_SYS_FILIO_H +#include sys/filio.h +#endif + +#include sys/ioctl.h + +#include postgres.h +#include miscadmin.h +#include storage/proc.h +#include storage/imsg.h +#include storage/ipc.h +#include storage/buffer.h +#include storage/spin.h +#include utils/elog.h + +/* global variable pointing to the shmem area */ +IMessageCtlData *IMessageCtl = NULL; + +/* + * Initialization of shared memory for internal messages. + */ +int +IMessageShmemSize(void) +{ + return MAXALIGN(IMessageBufferSize); +} + +void +IMessageShmemInit(void) +{ + bool foundIMessageCtl; + +#ifdef IMSG_DEBUG + elog(DEBUG3, IMessageShmemInit(): initializing shared memory); +#endif + + IMessageCtl = (IMessageCtlData *) + ShmemInitStruct(IMsgCtl, + MAXALIGN(IMessageBufferSize), + foundIMessageCtl); + + if (foundIMessageCtl) + return; + + /* empty the control structure and all message descriptors */ + memset(IMessageCtl, 0, MAXALIGN(IMessageBufferSize)); + + /* initialize start and end pointers */ + IMessageCtl-queue_start = (IMessage*) IMSG_BUFFER_START(IMessageCtl); + IMessageCtl-queue_end = (IMessage*) IMSG_BUFFER_START(IMessageCtl); + + SpinLockInit(IMessageCtl-msgs_lck); +} + +/* + * IMessageCreate + * + * creates a new but deactivated message within the queue, returning the + * message header of the newly created message. + */ +IMessage* +IMessageCreate(int recipient, int msg_size) +{ + IMessage *msg; + intremaining_space; + +#ifdef IMSG_DEBUG + elog(DEBUG3, IMessageCreate(): recipient: %d, size: %d, + recipient, msg_size); +#endif + + /* assert a reasonable maximum message size */ + Assert(msg_size (MAXALIGN(IMessageBufferSize) / 4)); + + START_CRIT_SECTION(); + { + /* use volatile pointer to prevent code rearrangement */ + volatile IMessageCtlData *imsgctl = IMessageCtl; + + SpinLockAcquire(imsgctl-msgs_lck); + + /* + * Check if there is enough space for the message plus the + * terminating header + */ + if (imsgctl-queue_end imsgctl-queue_start) + remaining_space = (int) imsgctl-queue_start - + (int) imsgctl-queue_end; + else + remaining_space = (int) IMSG_BUFFER_END(imsgctl) - + (int) imsgctl-queue_end; + + if (remaining_space (MAXALIGN(IMessageBufferSize) / 8)) + { +#ifdef IMSG_DEBUG + elog(DEBUG3, IMessageCreate(): cleanup starting); +#endif + + /* Clean up messages that have been removed. */ + while (imsgctl-queue_start-recipient == 0) + { +if (imsgctl-queue_start imsgctl-queue_end) +{ + if ((imsgctl-queue_start-sender == 0) + (imsgctl-queue_start-recipient == 0)) + { +#ifdef IMSG_DEBUG + elog(DEBUG3, IMessageCreate(): cleanup wrapped); +#endif + imsgctl-queue_start = (IMessage*) IMSG_BUFFER_START(imsgctl); + continue; + } +} +else if (imsgctl-queue_start = imsgctl-queue_end) + break; + +imsgctl-queue_start = (IMessage
Re: [PATCHES] replication docs: split single vs. multi-master
Hello Bruce, Bruce Momjian wrote: Actually the patch moves down data paritioning. I am confused. Uh.. yeah, sorry, that's what I meant. I thought a long time about this. I have always liked splitting the solutions up into single and multi-master, but in doing this documentation section, I realized that the split isn't all that helpful, and can be confusing. Not mentioning that categorization doesn't help in clearing the confusion. Just look around, most people use these terms. They're used by MySQL and Oracle. Even Microsofts ActiveDirectory seems to have a multi-master operation mode. For example, Slony is clearly single-master, Agreed. but what about data partitioning? That is multi-master, in that there is more than one master, but only one master per data set. Data Partitioning is a way to work around the trouble of database replication in the application layer. Instead of trying to categorize it like a replication algorithm, we should explain that working around the trouble may be worthwhile in many cases. And for multi-master, Oracle RAC is clearly multi master, Yes. and I can see pgpool as multi-master, or as several single-master systems, in that they operate independently. Several single-master systems? C'mon! Pgpool simply implements the most simplistic form of multi-master replication. Just because you can access the single databases inside the cluster doesn't make it less Multi-Master, does it? After much thought, it seems that putting things into single/multi-master categories just adds more confusion, because several solutions just aren't clear Agreed, I'm not saying you must categorize all solutions you describe. But please do categorize the ones which can be (and have so often been) categorized. or fall into neither, e.g. Shared Disk Failover. Oh, yes, this reminds me of Brad Nicholson's suggestion in [1] to add a warning about the risk of having two postmaster come up What about other means of sharing disks or filesystems? NBDs or even worse: NFS? Another issue is that you mentioned heavly locking for multi-master, when in fact pgpool doesn't do any special inter-server locking, so it just doesn't apply. Sure it does apply, in the sense that *every* single lock is granted and released on *every* node. The total amount of locks scales linearly with the amount of nodes in the cluster. In summary, it just seemed clearer to talk about each item and how it works, rather than try to categorize them. The categorization just seems to do more harm than good. Of course, I might be totally wrong, and am still looking for feedback, but these are my current thoughts. Feedback? AFAICT, the categorization in Single- and Multi-Master replication is very common. I think that's partly because it's focused on the solution. One can ask: do I want to write on all nodes or is a failover solution sufficient? Or can I probably get away with a read-only Slave? It's a categorization the user does, often before having a glimpse about how complicated database replication really is. Thus, IMO, it would make sense to help the user and allow him to quickly find answers. (And we can still tell them that it's not easy or even possible to categorize all the solutions.) I didn't mention distributed shared memory as a separate item because I felt it was an implementation detail of clustering, rather than something separate. I kept two-phase in the cluster item for the same reason. Why is pgpool not an implementation detail of clustering, then? Current version at: http://momjian.us/main/writings/pgsql/sgml/failover.html That somehow doesn't work for me: --- momjian.us ping statistics --- 15 packets transmitted, 0 received, 100% packet loss, time 14011ms Just my 2 cents, in the hope to be of help. Regards Markus [1]: Brad Nicholson's suggestion: http://archives.postgresql.org/pgsql-admin/2006-11/msg00154.php ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] replication docs: split single vs. multi-master
Bruce Momjian wrote: I am now attaching the additional text I added based on your comments. I have also changed the markup so all the solutions appear on the same web page. I think seeing it all together might give us new ideas for improvement. Good, it's definitely better to have it all on one page. I just thought about the words 'master' and 'slave', which are admittedly quite unfortunate. I remember reading about efforts to remove them from geek-speech. They proposed to introduce better names. At least with the old IDE drives, master- and slave-drives seem to disappear now... Regards Markus ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[PATCHES] replication docs: split single vs. multi-master
Hi, as promised on -docs, here comes my proposal on how to improve the replication documentation. The patches are split as follows and have to be applied in order: replication_doku_1.diff: Smallest possible one-word change to warm-up... replication_doku_2.diff: Moves down Clustering For Parallel Query Execution, because it's not a replication type, but a feature, see explanation below. replication_doku_3.diff: This is the most important part, splitting all replication types into single- and multi-master replication. I'm new to SGML, so please bear with me if this is not the right way to do it... Shared-Disk-Failover does IMO not fall into a replication category. Should we mention there, that 'sharing' a disk using NFS or some such is not recommended? (And more importantly, does not work as a multi-master replication solution) I've added a general paragraph describing Single-Master Replication. I'm stating that 'Single-Master Replication is always asynchronous'. Can anybody think of a counter example? Or a use case for sync Single-Master Replication? The argument to put down is: if you go sync, why don't you do Multi-Master right away? Most of the Clustering for Load Balancing text applies to all synchronous, Multi-Master Replication algorithms, even to Query Broadcasting. Thus it became the general description of Multi-Master Replication. The section Clustering for Load Balancing has been removed. replication_doku_4.diff: These are the text modifications I did to adjust to the new structure. I've adjusted the Multi-Master Replication text to really be appropriate for all existing solutions. Query Broadcasting has some corrections, mainly to stick to describe that algorithm there and none of the general properties of Multi-Master Replication. I've added two sections to describe 2PC and Distributed SHMEM algorithms which belong into that category and cover all of the previous text. Except that I've removed the mentioning of Oracle RAC in favor of Pgpool-II. IMO this makes it clearer, what replication types exist and how to categorize them. I'm tempted to mention the Postgres-R algorithm as fourth sub-section of Multi-Master Replication, as it's quite different from all the others in many aspects. But I urgently need to do go to work now... besides, I'm heavily biased regarding Postgres-R, so probably someone else should write that paragraph. :-) The only downside of the structure I'm proposing here is: the non-replication-algorithms fall of somewhat. Namely: Shared-Disk Failover, Data Partitioning, Parallel Query Execution and Commercial Solutions. For me, Data Partitioning as well as Parallel Query Execution are possible optimizations which can be run on top of replicated data. They don't replicate data and are thus not replication solutions. But grouping those two together would make sense. So. I really have to go to work now! Regards Markus *** doc/src/sgml/failover.sgml 2006-11-15 08:52:25.0 +0100 --- doc/src/sgml/failover.sgml 2006-11-15 08:58:34.0 +0100 *** *** 126,132 /para para !Such partitioning implements both failover and load balancing. Failover is achieved because the data resides on both servers, and this is an ideal way to enable failover if the servers share a slow communication channel. Load balancing is possible because read requests can go to any --- 126,132 /para para !Such partitioning provides both failover and load balancing. Failover is achieved because the data resides on both servers, and this is an ideal way to enable failover if the servers share a slow communication channel. Load balancing is possible because read requests can go to any *** doc/src/sgml/failover.sgml 2006-11-15 08:58:34.0 +0100 --- doc/src/sgml/failover.sgml 2006-11-15 09:10:53.0 +0100 *** *** 114,150 /para /sect1 - sect1 id=data-partitioning - titleData Partitioning/title - - para -Data partitioning splits tables into data sets. Each set can only be -modified by one server. For example, data can be partitioned by -offices, e.g. London and Paris. While London and Paris servers have all -data records, only London can modify London records, and Paris can only -modify Paris records. - /para - - para -Such partitioning provides both failover and load balancing. Failover -is achieved because the data resides on both servers, and this is an -ideal way to enable failover if the servers share a slow communication -channel. Load balancing is possible because read requests can go to any -of the servers, and write requests are split among the servers. Of -course, the communication to keep all the servers up-to-date adds -overhead, so ideally the write load should be low, or localized as in -the London/Paris
Re: [PATCHES] Replication Documentation
Hello, Peter Eisentraut wrote: 1. post information on pgsql-general 1.a. solicit comments 2. put information page on web site 3. link from documentation to web site I don't remember such a clear agreement either. I'm glad Chris has written something. And posting it to -docs seems a much better fit, IMHO. Also, I think we didn't really agree on where exactly to put what information. See my previous mail on -hackers for my opinion on that. I don't think this sort of material belongs directly into the PostgreSQL documentation. I agree with that. Regards Markus ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] Support Parallel Query Execution in Executor
Hello, On Sat, 2006-04-08 at 13:45 -0400, Jonah H. Harris wrote: I'd like to help teaching PostgreSQL the art of parallel query execution. Cool, we're not at the communication-level yet, but your help would be appreciated. In case you're interested I'll compile a patch for review. Surely! [ This patch is not meant to be included immediately. But it might be one step towards supporting parallel query execution, IMHO. ] This is my internal message passing implementation. I'd probably also need to document its use... It's not tested except for 'it works for me'. And it still misses the 'wrap around' functionality to recycle the buffer. Also, I should probably rename buffer.h and buffer.c since there are a lot of other buffers in PostgreSQL. Markus *** /dev/null 2006-04-03 19:40:38.0 +0200 --- trunk/src/backend/storage/ipc/imsg.c 2006-04-08 20:39:32.0 +0200 *** *** 0 --- 1,277 + /*- + * + * imsg.c + *internal messages from process to process sent via shared memory. + * + * + * Copyright (c) 2006, Markus Schiltknecht [EMAIL PROTECTED] + * + *- + */ + + #include unistd.h + #include signal.h + + #ifdef HAVE_SYS_FILIO_H + #include sys/filio.h + #endif + + #include sys/ioctl.h + + #include postgres.h + #include miscadmin.h + #include storage/proc.h + #include storage/imsg.h + #include storage/ipc.h + #include storage/buffer.h + #include storage/spin.h + #include utils/elog.h + + /* global variable pointing to the shmem area */ + IMessageCtlData *IMessageCtl = NULL; + + #define IMSG_DEBUG + + /* + * Initialization of shared memory for internal messages. + */ + int + IMessageShmemSize(void) + { + return MAXALIGN(IMessageBufferSize); + } + + void + IMessageShmemInit(void) + { + bool foundIMessageCtl; + + #ifdef IMSG_DEBUG + elog(DEBUG3, IMessageShmemInit(): initializing shared memory); + #endif + + IMessageCtl = (IMessageCtlData *) + ShmemInitStruct(IMsgCtl, + MAXALIGN(IMessageBufferSize), + foundIMessageCtl); + + if (foundIMessageCtl) + return; + + /* empty the control structure and all message descriptors */ + memset(IMessageCtl, 0, MAXALIGN(IMessageBufferSize)); + + /* initialize start and end pointers */ + IMessageCtl-queue_start = (IMessage*) IMSG_BUFFER_START(IMessageCtl); + IMessageCtl-queue_end = (IMessage*) IMSG_BUFFER_START(IMessageCtl); + + SpinLockInit(IMessageCtl-msgs_lck); + } + + /* + * IMessageCreate + * + * creates a new but deactivated message within the queue, returning the + * message header of the newly created message. + */ + IMessage* + IMessageCreate(int recipient, int msg_size) + { + IMessage *msg; + + #ifdef IMSG_DEBUG + elog(DEBUG3, IMessageCreate(): recipient: %d, size: %d, + recipient, msg_size); + #endif + + /* assert a reasonable maximum message size */ + Assert(msg_size MAXALIGN(IMessageShmemSize) / 4); + + START_CRIT_SECTION(); + { + /* use volatile pointer to prevent code rearrangement */ + volatile IMessageCtlData *imsgctl = IMessageCtl; + + SpinLockAcquire(imsgctl-msgs_lck); + + /* check if there is enough space for the message plus the + * terminating header */ + if ((int) imsgctl-queue_end + msg_size + sizeof(IMessage) + IMSG_BUFFER_END(imsgctl)) + { + msg = (IMessage*) imsgctl-queue_end; + imsgctl-queue_end += IMSG_ALIGN(msg_size); + + imsgctl-queue_end-sender = 0; + imsgctl-queue_end-recipient = 0; + } + else + { + /* most probably we can wrap around and recycle the space + * at the start of the queue. */ + + /* TODO: implement wrap-around functionality */ + Assert(0); + msg = NULL; /* to avoid compiler warning */ + } + + /* initialize the message as inactive */ + msg-sender = 0; + msg-recipient = recipient; + msg-size = msg_size; + + /* queue editing finished */ + SpinLockRelease(imsgctl-msgs_lck); + + #ifdef IMSG_DEBUG + elog(DEBUG3, IMessageCreate(): created at %08X size: %d, (int) msg, + msg-size); + #endif + } + END_CRIT_SECTION(); + + return msg; + } + + void + IMessageActivate(IMessage *msg) + { + msg-sender = MyProcPid; + + /* send a signal to the recipient */ + kill(msg-recipient, SIGUSR1); + } + + /* + * IMessageRemove + * + * Marks a message as removable by setting the recipient to null. The message + * will eventually be removed during creation of new messages, see + * IMessageCreate(). + */ + void + IMessageRemove(IMessage *msg) + { + msg-recipient = 0; + } + + /* + * IMessageCheck + * + * Checks if there is a message in the queue for this process. Returns null + * if there is no message for this process, the message header otherwise. The + * message remains in the queue and should be removed by IMessageRemove(). + */ + IMessage* + IMessageCheck(void) + { + IMessage *msg