Re: [PATCHES] Fix database is ready race condition

2007-02-07 Thread Markus Schiltknecht

Tom Lane wrote:

I've applied a modified form of this patch.  The postmaster now says
database system is ready to accept connections after it's finished
reacting to the completion of the startup process.


Thank you, that's just perfect for me.

Markus

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


[PATCHES] Fix database is ready race condition

2007-02-03 Thread Markus Schiltknecht

Hi,

is there a good reason to print the database system is ready message 
in StartupXLOG() in xact.c? It has a) nothing to do with xlog and b) 
opens a small race condition: the message gets printed, while it still 
take some CPU cycles until the postmaster really gets the SIGCHLD signal 
and sets StartupPID = 0. If you (or rather: an automated test program) 
try to connect within this timespan, you get a database is starting up 
error, which clearly contradicts the is ready message.


I admit this is not a real issue in the common case and only matters in 
automated testing or some such. But in case this does not break 
anything...  (ereport is used in the reaper, so I guess it's fine to use 
that in signal handlers). I'm not sure if the message is needed at all 
in BS_XLOG_BOOTSTRAP mode. Probably it should better say something 
different.


Patch attached.

Regards

Markus

*** src/backend/access/transam/xlog.c	2191ee8ca338d74f666b4d3509cc4361c44b4353
--- src/backend/access/transam/xlog.c	e77a26a26ec46d4479563ed7ff5885ea9c21135a
*** StartupXLOG(void)
*** 5168,5176 
  	/* Reload shared-memory state for prepared transactions */
  	RecoverPreparedTransactions();
  
- 	ereport(LOG,
- 			(errmsg(database system is ready)));
- 
  	/* Shut down readFile facility, free space */
  	if (readFile = 0)
  	{
--- 5168,5173 

*** src/backend/bootstrap/bootstrap.c	55fd17241f51b6f23131a0d36d5ce583aa7a3488
--- src/backend/bootstrap/bootstrap.c	8a54e88b06acad46c83320ca8fe13caa75ad77b9
*** BootstrapMain(int argc, char *argv[])
*** 418,423 
--- 418,425 
  			bootstrap_signals();
  			BootStrapXLOG();
  			StartupXLOG();
+ 			ereport(LOG,
+ 	(errmsg(database system is ready)));
  			break;
  
  		case BS_XLOG_STARTUP:

*** src/backend/postmaster/postmaster.c	561d13618e62e95a32b42b2e9305a638edacf24f
--- src/backend/postmaster/postmaster.c	5a567893b0ed78d312a19e7054127dc5b6b69df3
*** reaper(SIGNAL_ARGS)
*** 2040,2045 
--- 2040,2048 
  		if (StartupPID != 0  pid == StartupPID)
  		{
  			StartupPID = 0;
+ 			ereport(LOG,
+ 	(errmsg(database system is ready)));
+ 
  			/* Note: FATAL exit of startup is treated as catastrophic */
  			if (!EXIT_STATUS_0(exitstatus))
  			{

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [pgsql-patches] Autovacuum launcher patch

2007-01-29 Thread Markus Schiltknecht

Alvaro Herrera wrote:

Hmm, I remember eyeballing that code.  Would you mind sending me an URL
to that file, or something?  Or maybe send me the files themselves?


Sure, here's a patch against current CVS. Please remove all the 
functions referencing to buffer and buffer.h to compile.


Remember that it's a work in progress thing. It has flaws. One issue 
that currently bugs me is, that processes can deadlock if they keep 
trying to create a message (IMessagesCreate), but fail because the queue 
is full of messages for themselves. A process should thus always try to 
fetch messages (via IMessagesCheck) and remove pending ones before 
retrying to send one. That's not always practical.


One design limitation is, that you have to know how large your message 
is as soon as you reserve (shared) memory for it, but that's intended.


At least I've stress tested the wrap-around code and it seems to work. 
No guarantees, though ;-)


Regards

Markus

# 
# old_revision [9a68fa59cb0ca3246f03880664062abb98f1a61a]
# 
# add_file src/backend/storage/ipc/imsg.c
#  content [3e84c6372a47612a2fe233fee6b122808135580e]
# 
# add_file src/include/storage/imsg.h
#  content [3cf37b12a00b90f65b8393fc5e27c98d772dc22b]
# 
# patch src/backend/storage/ipc/Makefile
#  from [71276ab6483aebbb27f87c988d77ab876611f190]
#to [9a99101d3e8bbfe52c97763db536804e94371828]
# 
# patch src/backend/storage/ipc/ipci.c
#  from [177f266b4668190a6ab1f2902305f7b7e577ef8d]
#to [1971e2122ba4455c8b9784e70059d917fdf4f4c8]
# 

--- src/backend/storage/ipc/imsg.c	3e84c6372a47612a2fe233fee6b122808135580e
+++ src/backend/storage/ipc/imsg.c	3e84c6372a47612a2fe233fee6b122808135580e
@@ -0,0 +1,375 @@
+/*-
+ *
+ * imsg.c
+ *internal messages from process to process sent via shared memory.
+ *
+ *
+ * Copyright (c) 2006, Markus Schiltknecht [EMAIL PROTECTED]
+ *
+ *-
+ */
+
+#include unistd.h
+#include signal.h
+#include string.h
+
+#ifdef HAVE_SYS_FILIO_H
+#include sys/filio.h
+#endif
+
+#include sys/ioctl.h
+
+#include postgres.h
+#include miscadmin.h
+#include storage/proc.h
+#include storage/imsg.h
+#include storage/ipc.h
+#include storage/buffer.h
+#include storage/spin.h
+#include utils/elog.h
+
+/* global variable pointing to the shmem area */
+IMessageCtlData *IMessageCtl = NULL;
+
+/*
+ * Initialization of shared memory for internal messages.
+ */
+int
+IMessageShmemSize(void)
+{
+	return MAXALIGN(IMessageBufferSize);
+}
+
+void
+IMessageShmemInit(void)
+{
+	bool		foundIMessageCtl;
+
+#ifdef IMSG_DEBUG
+	elog(DEBUG3, IMessageShmemInit(): initializing shared memory);
+#endif
+
+	IMessageCtl = (IMessageCtlData *)
+		ShmemInitStruct(IMsgCtl,
+		MAXALIGN(IMessageBufferSize),
+		foundIMessageCtl);
+
+	if (foundIMessageCtl)
+		return;
+
+	/* empty the control structure and all message descriptors */
+	memset(IMessageCtl, 0, MAXALIGN(IMessageBufferSize));
+
+	/* initialize start and end pointers */
+	IMessageCtl-queue_start = (IMessage*) IMSG_BUFFER_START(IMessageCtl);
+	IMessageCtl-queue_end = (IMessage*) IMSG_BUFFER_START(IMessageCtl);
+
+	SpinLockInit(IMessageCtl-msgs_lck);
+}
+
+/*
+ *   IMessageCreate
+ *
+ * creates a new but deactivated message within the queue, returning the
+ * message header of the newly created message.
+ */
+IMessage*
+IMessageCreate(int recipient, int msg_size)
+{
+	IMessage	   *msg;
+	intremaining_space;
+
+#ifdef IMSG_DEBUG
+	elog(DEBUG3, IMessageCreate(): recipient: %d, size: %d,
+		recipient, msg_size);
+#endif
+
+	/* assert a reasonable maximum message size */
+	Assert(msg_size  (MAXALIGN(IMessageBufferSize) / 4));
+
+	START_CRIT_SECTION();
+	{
+		/* use volatile pointer to prevent code rearrangement */
+		volatile IMessageCtlData *imsgctl = IMessageCtl;
+
+		SpinLockAcquire(imsgctl-msgs_lck);
+
+		/*
+		 * Check if there is enough space for the message plus the
+		 * terminating header
+		 */
+		if (imsgctl-queue_end  imsgctl-queue_start)
+			remaining_space = (int) imsgctl-queue_start -
+			  (int) imsgctl-queue_end;
+		else
+			remaining_space = (int) IMSG_BUFFER_END(imsgctl) -
+			  (int) imsgctl-queue_end;
+
+		if (remaining_space  (MAXALIGN(IMessageBufferSize) / 8))
+		{
+#ifdef IMSG_DEBUG
+			elog(DEBUG3, IMessageCreate(): cleanup starting);
+#endif
+
+			/* Clean up messages that have been removed. */
+			while (imsgctl-queue_start-recipient == 0)
+			{
+if (imsgctl-queue_start  imsgctl-queue_end)
+{
+	if ((imsgctl-queue_start-sender == 0) 
+		(imsgctl-queue_start-recipient == 0))
+	{
+#ifdef IMSG_DEBUG
+			elog(DEBUG3, IMessageCreate(): cleanup wrapped);
+#endif
+		imsgctl-queue_start = (IMessage*) IMSG_BUFFER_START(imsgctl);
+		continue;
+	}
+}
+else if (imsgctl-queue_start = imsgctl-queue_end)
+	break;
+
+imsgctl-queue_start = (IMessage

Re: [PATCHES] replication docs: split single vs. multi-master

2006-11-16 Thread Markus Schiltknecht

Hello Bruce,

Bruce Momjian wrote:

Actually the patch moves down data paritioning.  I am confused.


Uh.. yeah, sorry, that's what I meant.


I thought a long time about this.  I have always liked splitting the
solutions up into single and multi-master, but in doing this
documentation section, I realized that the split isn't all that helpful,
and can be confusing.  


Not mentioning that categorization doesn't help in clearing the 
confusion. Just look around, most people use these terms. They're used 
by MySQL and Oracle. Even Microsofts ActiveDirectory seems to have a 
multi-master operation mode.


For example, Slony is clearly single-master, 


Agreed.


but
what about data partitioning?  That is multi-master, in that there is
more than one master, but only one master per data set.  


Data Partitioning is a way to work around the trouble of database 
replication in the application layer. Instead of trying to categorize it 
like a replication algorithm, we should explain that working around the 
trouble may be worthwhile in many cases.



And for
multi-master, Oracle RAC is clearly multi master,


Yes.


 and I can see pgpool
as multi-master, or as several single-master systems, in that they
operate independently.  


Several single-master systems? C'mon! Pgpool simply implements the most 
simplistic form of multi-master replication. Just because you can access 
the single databases inside the cluster doesn't make it less 
Multi-Master, does it?



After much thought, it seems that putting things
into single/multi-master categories just adds more confusion, because
several solutions just aren't clear


Agreed, I'm not saying you must categorize all solutions you describe. 
But please do categorize the ones which can be (and have so often been) 
categorized.



or fall into neither, e.g. Shared Disk Failover.


Oh, yes, this reminds me of Brad Nicholson's suggestion in [1] to add a 
warning about the risk of having two postmaster come up


What about other means of sharing disks or filesystems? NBDs or even 
worse: NFS?



Another issue is that you mentioned heavly locking for
multi-master, when in fact pgpool doesn't do any special inter-server
locking, so it just doesn't apply.


Sure it does apply, in the sense that *every* single lock is granted and 
released on *every* node. The total amount of locks scales linearly with 
the amount of nodes in the cluster.



In summary, it just seemed clearer to talk about each item and how it
works, rather than try to categorize them.  The categorization just
seems to do more harm than good.

Of course, I might be totally wrong, and am still looking for feedback,
but these are my current thoughts.  Feedback?


AFAICT, the categorization in Single- and Multi-Master replication is 
very common. I think that's partly because it's focused on the solution. 
One can ask: do I want to write on all nodes or is a failover solution 
sufficient? Or can I probably get away with a read-only Slave?


It's a categorization the user does, often before having a glimpse about 
how complicated database replication really is. Thus, IMO, it would make 
sense to help the user and allow him to quickly find answers. (And we 
can still tell them that it's not easy or even possible to categorize 
all the solutions.)



I didn't mention distributed shared memory as a separate item because I
felt it was an implementation detail of clustering, rather than
something separate.  I kept two-phase in the cluster item for the same
reason.


Why is pgpool not an implementation detail of clustering, then?


Current version at:

http://momjian.us/main/writings/pgsql/sgml/failover.html


That somehow doesn't work for me:

--- momjian.us ping statistics ---
15 packets transmitted, 0 received, 100% packet loss, time 14011ms


Just my 2 cents, in the hope to be of help.

Regards

Markus


[1]: Brad Nicholson's suggestion:
http://archives.postgresql.org/pgsql-admin/2006-11/msg00154.php

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PATCHES] replication docs: split single vs. multi-master

2006-11-16 Thread Markus Schiltknecht

Bruce Momjian wrote:
I am now attaching the additional text I added based on your comments. 
I have also changed the markup so all the solutions appear on the same

web page.  I think seeing it all together might give us new ideas for
improvement.


Good, it's definitely better to have it all on one page.

I just thought about the words 'master' and 'slave', which are 
admittedly quite unfortunate. I remember reading about efforts to remove 
them from geek-speech. They proposed to introduce better names. At least 
with the old IDE drives, master- and slave-drives seem to disappear now...


Regards

Markus


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


[PATCHES] replication docs: split single vs. multi-master

2006-11-15 Thread Markus Schiltknecht

Hi,

as promised on -docs, here comes my proposal on how to improve the 
replication documentation. The patches are split as follows and have to 
be applied in order:


replication_doku_1.diff:

  Smallest possible one-word change to warm-up...


replication_doku_2.diff:

  Moves down Clustering For Parallel Query Execution, because
  it's not a replication type, but a feature, see explanation below.


replication_doku_3.diff:

  This is the most important part, splitting all replication types
  into single- and multi-master replication.  I'm new to SGML, so
  please bear with me if this is not the right way to do it...

  Shared-Disk-Failover does IMO not fall into a replication category.
  Should we mention there, that 'sharing' a disk using NFS or some
  such is not recommended? (And more importantly, does not work as
  a multi-master replication solution)

  I've added a general paragraph describing Single-Master Replication.
  I'm stating that 'Single-Master Replication is always asynchronous'.
  Can anybody think of a counter example? Or a use case for sync
  Single-Master Replication? The argument to put down is: if you go
  sync, why don't you do Multi-Master right away?

  Most of the Clustering for Load Balancing text applies to all
  synchronous, Multi-Master Replication algorithms, even to
  Query Broadcasting. Thus it became the general description
  of Multi-Master Replication. The section Clustering for
  Load Balancing has been removed.


replication_doku_4.diff:

  These are the text modifications I did to adjust to the new structure.
  I've adjusted the Multi-Master Replication text to really be
  appropriate for all existing solutions.

  Query Broadcasting has some corrections, mainly to stick to describe
  that algorithm there and none of the general properties of
  Multi-Master Replication.

  I've added two sections to describe 2PC and Distributed SHMEM
  algorithms which belong into that category and cover all of the
  previous text. Except that I've removed the mentioning of Oracle RAC
  in favor of Pgpool-II.


IMO this makes it clearer, what replication types exist and how to 
categorize them. I'm tempted to mention the Postgres-R algorithm as 
fourth sub-section of Multi-Master Replication, as it's quite different 
from all the others in many aspects. But I urgently need to do go to 
work now... besides, I'm heavily biased regarding Postgres-R, so 
probably someone else should write that paragraph. :-)


The only downside of the structure I'm proposing here is: the 
non-replication-algorithms fall of somewhat. Namely: Shared-Disk 
Failover, Data Partitioning, Parallel Query Execution and 
Commercial Solutions.


For me, Data Partitioning as well as Parallel Query Execution are 
possible optimizations which can be run on top of replicated data. They 
don't replicate data and are thus not replication solutions. But 
grouping those two together would make sense.


So. I really have to go to work now!

Regards

Markus

*** doc/src/sgml/failover.sgml	2006-11-15 08:52:25.0 +0100
--- doc/src/sgml/failover.sgml	2006-11-15 08:58:34.0 +0100
***
*** 126,132 
/para
  
para
!Such partitioning implements both failover and load balancing.  Failover
 is achieved because the data resides on both servers, and this is an
 ideal way to enable failover if the servers share a slow communication
 channel. Load balancing is possible because read requests can go to any
--- 126,132 
/para
  
para
!Such partitioning provides both failover and load balancing.  Failover
 is achieved because the data resides on both servers, and this is an
 ideal way to enable failover if the servers share a slow communication
 channel. Load balancing is possible because read requests can go to any
*** doc/src/sgml/failover.sgml	2006-11-15 08:58:34.0 +0100
--- doc/src/sgml/failover.sgml	2006-11-15 09:10:53.0 +0100
***
*** 114,150 
/para
   /sect1
  
-  sect1 id=data-partitioning
-   titleData Partitioning/title
- 
-   para
-Data partitioning splits tables into data sets.  Each set can only be
-modified by one server.  For example, data can be partitioned by
-offices, e.g. London and Paris.  While London and Paris servers have all
-data records, only London can modify London records, and Paris can only
-modify Paris records.
-   /para
- 
-   para
-Such partitioning provides both failover and load balancing.  Failover
-is achieved because the data resides on both servers, and this is an
-ideal way to enable failover if the servers share a slow communication
-channel. Load balancing is possible because read requests can go to any
-of the servers, and write requests are split among the servers.  Of
-course, the communication to keep all the servers up-to-date adds
-overhead, so ideally the write load should be low, or localized as in
-the London/Paris 

Re: [PATCHES] Replication Documentation

2006-08-02 Thread Markus Schiltknecht

Hello,

Peter Eisentraut wrote:

1. post information on pgsql-general
1.a. solicit comments
2. put information page on web site
3. link from documentation to web site


I don't remember such a clear agreement either. I'm glad Chris has 
written something. And posting it to -docs seems a much better fit, IMHO.


Also, I think we didn't really agree on where exactly to put what 
information. See my previous mail on -hackers for my opinion on that.


I don't think this sort of material belongs directly into the PostgreSQL 
documentation.


I agree with that.

Regards

Markus

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Support Parallel Query Execution in Executor

2006-04-08 Thread Markus Schiltknecht
Hello,

On Sat, 2006-04-08 at 13:45 -0400, Jonah H. Harris wrote:
  I'd like to help teaching PostgreSQL the art of parallel query
  execution.
 
 Cool, we're not at the communication-level yet, but your help would be
 appreciated.
 
  In case you're interested I'll compile a patch for review.
 
 Surely!

[ This patch is not meant to be included immediately. But it might be
one step towards supporting parallel query execution, IMHO. ]

This is my internal message passing implementation. I'd probably also
need to document its use...

It's not tested except for 'it works for me'. And it still misses the
'wrap around' functionality to recycle the buffer. Also, I should
probably rename buffer.h and buffer.c since there are a lot of other
buffers in PostgreSQL.

Markus

*** /dev/null	2006-04-03 19:40:38.0 +0200
--- trunk/src/backend/storage/ipc/imsg.c	2006-04-08 20:39:32.0 +0200
***
*** 0 
--- 1,277 
+ /*-
+  *
+  * imsg.c
+  *internal messages from process to process sent via shared memory.
+  *
+  *
+  * Copyright (c) 2006, Markus Schiltknecht [EMAIL PROTECTED]
+  *
+  *-
+  */
+ 
+ #include unistd.h
+ #include signal.h
+ 
+ #ifdef HAVE_SYS_FILIO_H
+ #include sys/filio.h
+ #endif
+ 
+ #include sys/ioctl.h
+ 
+ #include postgres.h
+ #include miscadmin.h
+ #include storage/proc.h
+ #include storage/imsg.h
+ #include storage/ipc.h
+ #include storage/buffer.h
+ #include storage/spin.h
+ #include utils/elog.h
+ 
+ /* global variable pointing to the shmem area */
+ IMessageCtlData *IMessageCtl = NULL;
+ 
+ #define IMSG_DEBUG
+ 
+ /*
+  * Initialization of shared memory for internal messages.
+  */
+ int
+ IMessageShmemSize(void)
+ {
+ 	return MAXALIGN(IMessageBufferSize);
+ }
+ 
+ void
+ IMessageShmemInit(void)
+ {
+ 	bool		foundIMessageCtl;
+ 
+ #ifdef IMSG_DEBUG
+ 	elog(DEBUG3, IMessageShmemInit(): initializing shared memory);
+ #endif
+ 
+ 	IMessageCtl = (IMessageCtlData *)
+ 		ShmemInitStruct(IMsgCtl,
+ 		MAXALIGN(IMessageBufferSize),
+ 		foundIMessageCtl);
+ 
+ 	if (foundIMessageCtl)
+ 		return;
+ 
+ 	/* empty the control structure and all message descriptors */
+ 	memset(IMessageCtl, 0, MAXALIGN(IMessageBufferSize));
+ 
+ 	/* initialize start and end pointers */
+ 	IMessageCtl-queue_start = (IMessage*) IMSG_BUFFER_START(IMessageCtl);
+ 	IMessageCtl-queue_end = (IMessage*) IMSG_BUFFER_START(IMessageCtl);
+ 
+ 	SpinLockInit(IMessageCtl-msgs_lck);
+ }
+ 
+ /*
+  *   IMessageCreate
+  *
+  * creates a new but deactivated message within the queue, returning the
+  * message header of the newly created message.
+  */
+ IMessage*
+ IMessageCreate(int recipient, int msg_size)
+ {
+ 	IMessage	   *msg;
+ 
+ #ifdef IMSG_DEBUG
+ 	elog(DEBUG3, IMessageCreate(): recipient: %d, size: %d,
+ 		recipient, msg_size);
+ #endif
+ 
+ 	/* assert a reasonable maximum message size */
+ 	Assert(msg_size  MAXALIGN(IMessageShmemSize) / 4);
+ 
+ 	START_CRIT_SECTION();
+ 	{
+ 		/* use volatile pointer to prevent code rearrangement */
+ 		volatile IMessageCtlData *imsgctl = IMessageCtl;
+ 
+ 		SpinLockAcquire(imsgctl-msgs_lck);
+ 
+ 		/* check if there is enough space for the message plus the
+ 		 * terminating header */
+ 		if ((int) imsgctl-queue_end + msg_size + sizeof(IMessage) 
+ 			 IMSG_BUFFER_END(imsgctl))
+ 		{
+ 			msg = (IMessage*) imsgctl-queue_end;
+ 			imsgctl-queue_end += IMSG_ALIGN(msg_size);
+ 
+ 			imsgctl-queue_end-sender = 0;
+ 			imsgctl-queue_end-recipient = 0;
+ 		}
+ 		else
+ 		{
+ 			/* most probably we can wrap around and recycle the space
+ 			 * at the start of the queue. */
+ 
+ 			/* TODO: implement wrap-around functionality */
+ 			Assert(0);
+ 			msg = NULL; /* to avoid compiler warning */
+ 		}
+ 
+ 		/* initialize the message as inactive */
+ 		msg-sender = 0;
+ 		msg-recipient = recipient;
+ 		msg-size = msg_size;
+ 
+ 		/* queue editing finished */
+ 		SpinLockRelease(imsgctl-msgs_lck);
+ 
+ #ifdef IMSG_DEBUG
+ 	elog(DEBUG3, IMessageCreate(): created at %08X size: %d, (int) msg,
+ 		msg-size);
+ #endif
+ 	}
+ 	END_CRIT_SECTION();
+ 
+ 	return msg;
+ }
+ 
+ void
+ IMessageActivate(IMessage *msg)
+ {
+ 	msg-sender = MyProcPid;
+ 
+ 	/* send a signal to the recipient */
+ 	kill(msg-recipient, SIGUSR1);
+ }
+ 
+ /*
+  *   IMessageRemove
+  *
+  * Marks a message as removable by setting the recipient to null. The message
+  * will eventually be removed during creation of new messages, see
+  * IMessageCreate().
+  */
+ void
+ IMessageRemove(IMessage *msg)
+ {
+ 	msg-recipient = 0;
+ }
+ 
+ /*
+  *   IMessageCheck
+  *
+  * Checks if there is a message in the queue for this process. Returns null
+  * if there is no message for this process, the message header otherwise. The
+  * message remains in the queue and should be removed by IMessageRemove().
+  */
+ IMessage*
+ IMessageCheck(void)
+ {
+ 	IMessage	   *msg