Hello, Foreword:
Unfortunately, I have no time to read the mailing lists and attend events like PostgreSQL and NoSQL. Some of the ideas came from MongoDB and Cassandra. The inspiration was the pg_rewind. There is little new here, it’s a wish-list put together, considering what could be possible in the foreseeable future. It’s likely that people worked on a similar or a better concept. But let me try. Reasons: Downtime is bad. PostgreSQL failover requires manual intervention (client configuration or host or DNS editing). Third party tools (in my experience) don’t offer the same stability and quality as PostgreSQL is. Also, this concept wouldn’t work without pg_rewind. Less software means less bugs. Goals: Providing near to 100% HA with minimal manual intervention. Minimizing possible human errors during failover. Making startup founders sleep well in the night. Automatic client configuration. Avoiding split brains. Extras: Automatic streaming chain configuration. No-goals: Multi-master replication. Sharding. Proxying. Load balancing. Why these: It’s better to have a working technology now than a futuristic solution in the future. For many applications, stability and HA are more important than sharding or multi-master. The concept: You can set up a single-master PostgreSQL cluster with two or more nodes that can failover several times without manual re-configuration. Restarting the client isn’t needed if it’s smart enough to reconnect. Third party software isn’t needed. Proxying isn’t needed. Cases: Running the cluster: The cluster is running. There is one master. Every other nodes are hot-standby slaves. The client-driver accepts several hostname(:port) values in the connection parameters. They must belong to the same cluster. (The cluster’s name might be provided too). The rest of the options (username, database name) are the same and needed only once. It’s not necessary to list every hosts. (Even listing one host is enough but not recommended). The client connects to one of the given hosts. If the node is running and it’s a slave, it tells the client which host the master is. The client connects to the master, even if the master was not listed in the connection parameters. It’s should be possible that the client stays connected to the slave for read-only queries if the application wants to do that. If the node the client tried connect to isn’t working, the client tries another node and so. Manually promoting a new master: The administrator promotes any of the slaves. The slave tells the master to gracefully stop. The master stops executing queries. It waits until the slave (the new master) receives all the replication log. The new master is promoted. The old master becomes a slave. (It might use pg_rewind). The old master asks the connected clients to reconnect to the new master. Then it drops the existing connections. It accepts new connections though and tells them who the master is. Manual step-down of the master: The administrator kindly asks the master to stop being the master. The cluster elects a new master. Then it’s the same as promoting a new master. Manual shutdown of the master: It’s same as step-down but the master won’t run as a slave until it’s started up again. Automatic failover: The master stops responding for a given period. The majority of the cluster elects a new master. Then the process is the same as manual promotion. When the old master starts up, the cluster tells it that it is not a master anymore. It does pg_rewind and acts as a slave. Automatic failover can happen again without human intervention. The clients are reconnected to the new master each time. Automatic failover without majority: It’s possible to tell in the config which server may act as a master when there is no majority to vote. Replication chain: There are two cases. 1: All the slaves connect to the master. 2: One slave connects to the master and the rest of the nodes replicate from this slave. Configuration: Every node should have a “recovery.conf” that is not renamed on promotion. cluster_name: an identifier for the cluster. Why not. hosts: list of the hosts. It is recommended but not needed to include every hosts in every file. It could work as the driver, discovering the rest of the cluster. master_priority: integer. How likely this node becomes the new master on failover (except manual promotion). A working cluster should not elect a new master just because it has higher priority than the current one. Election happens only for the described reasons above. slave_priority: integer. If any running node has this value larger than 0, the replication node is also elected, and the rest of the slaves replicate from the elected slave. Otherwise, they replicate from the master. primary_master: boolean. The node may run as master without elected by the majority. (This is not needed on manual promotion or shutdown. See bookkeeping.) safe: boolean. If this is set true and any kind of graceful failover happens, the promotion has to wait until this node also receives the whole replication stream even if it’s not the new master. Unless it’s not running. Every node can have this true for maximum safety. Bookkeeping: It would be good to know whether a node crashed or was shut down properly. This would make a difference in master election, streaming_slave election and the “safe” option. A two nodes cluster would highly depend on the bookkeeping. Bookkeeping would also help when a crashed/disconnected master that has primary_master=true comes back but doesn’t see the rest of the cluster. Questions: Is there any chance that something like this gets implemented? Thank you for reading. M.