Greetings;
Slony 1.1.5 on Pg 8.1.4 Solaris/SPARC.
Recently, we discovered a problem on the disk array of a machine
serving as master (Node1) for a large 120G production DB that runs up
to 650 TPS during peak loads. There were 3 nodes in the cluster in
all configured as such;
Node1
Node2
Node3
There are 2 replication sets and we shut down applications, and moved
the 2 sets to Node2 thus making it the master. To avoid a problem we
experienced earlier by running Slony funcs by hand, a proper Slonik
preamble file was created and Slonik used for all command submission.
Moving the sets took place as follows and no errores were seen.
lock set 1
wait for event ALL
move set 1
wait...
Ditto for set 2
App servers brought up and apparently everything fine.
Oops! We might need to pull Node1 and I forgot to change subscription
to make Node3 have new master, Node2 as provider. So, for the moment
we had this;
Node2
Node1
Node3
But everything is up and running here. I figured all needed was to
change the subscription info for Node3 and the subscribe set Slonik
command was issued on the new master node.
App servers are back live again and the system is busy.
subscribe set1 provider = new master
wait for confirmed ALL wait on new master
Ditto for set2
The script runs and there are no errors.
What we find however is that the only node which had sl_listen table
entries adjusted was the new master and the node that we were trying
to point to new provider began to fall behind even though querying the
sl_status view was reporting the node up to date with respect to all
other nodes in the system.
Eventually (hours later) , the rebuildlistenentries() function was run
on all nodes and resulted in a correct looking sl_listen table on all
nodes.
Demons were restarted a few times during this process and in fact,
additional non-sync events sent down from the master, including a
complete subscription of a new node are being replicated to the Node3
machine.
The tables suggest a configuration like this now;
Node2
Node1
Node3
Node3 however is no longer receiving any updates to the DB tables and
may have to be dropped. The subscription of a new slave due to DB
size and OLTP workload is a 36 hour job and one that we can only do
over a weekend. Thus, losing slaves is painful and we wish to
understand and avoid this scenario if at all possible.
Please explain what may have gone wrong here.
Many thanks
--
-------------------------------------------------------------------------------
Jerry Sievers 732 365-2844 (work) Production Database Administrator
305 321-1144 (mobil WWW E-Commerce Consultant
_______________________________________________
Slony1-general mailing list
[email protected]
http://lists.slony.info/mailman/listinfo/slony1-general