Hi,
we have a four node replication system that has been running for a  
year now, and recently upgraded to slony 1.2.6.

We run two or three different sets in the replication, where one set  
is the current data and the other sets are different schemas that we  
archive data to from the main schema, one schema per year...not  
important, however, when we need to change master, we have to do it  
one set at the time with slonik_move_set, like so:
slonik_move_set 3 1 3 (moving from node 1 to node 3)
slonik_move_set 2 1 3
slonik_move_set 1 1 3
now, to get the sl_subscribe to look ok, so that nodes 2&4 don't get  
updates via the old master 1, we have to do:
slonik_subscribe_set 3 2
slonik_subscribe_set 3 4
slonik_subscribe_set 2 2
slonik_subscribe_set 2 4

Problem is, when doing the first subscribe above, slony 1.2's  
RebuildListenEntries() kicks in and changes sl_listen in a way that  
prevents events from the new master to propagate to node 2 and 4,  
like this:

db=# select * from sl_listen where li_origin = 3 or li_provider = 3;
li_origin | li_provider | li_receiver
-----------+-------------+-------------
          3 |           3 |           1
          2 |           3 |           1
          4 |           3 |           1
          1 |           3 |           2
          4 |           3 |           2
          1 |           3 |           4
          2 |           3 |           4
(7 rows)

If i interpret this correctly, there is no way for events to reach  
node 2 and 4, and this causes replication to stop. Inserting values  
(3 3 4) and a (3 3 2) and a subsequent restart of slons resolves the  
problem, but then it breaks again at the next subscribe. After  
repeating the inserts four times (each subscribe) we get to the  
correct sl_listen.

Examining this further it seems like the following sl_subscribe:
db=# select * from sl_subscribe order by 1 desc;
sub_set | sub_provider | sub_receiver | sub_forward | sub_active
---------+--------------+--------------+-------------+------------
        3 |            3 |            1 | t           | t
        3 |            3 |            4 | t           | t
        3 |            3 |            2 | t           | t
        2 |            3 |            1 | t           | t
        2 |            1 |            2 | t           | t
        2 |            1 |            4 | t           | t
        1 |            3 |            1 | t           | t
        1 |            1 |            2 | f           | t
        1 |            1 |            4 | f           | t
(9 rows)

...that we are trying to correct by subscribe, causes  
RebuildListenEntries to make the wrong decision, looking at the code  
of the inner loop, looping over all sl_nodes and sl_paths, it does this:

-- If v_receiver.no_id subscribes a set from v_provider.no_id, events  
have to travel the same
-- path as the data. Ignore possible sl_listen that would break that  
rule.

perform 1 from sl_subscribe
         join sl_set on sl_set.set_id = sl_subscribe.sub_set
         where
                 sub_receiver = v_receiver.no_id and
                 sub_provider != v_provider.no_id and
                 set_origin = v_origin.no_id ;
if not found then
         insert into sl_listen (li_receiver, li_provider, li_origin)
                 values (v_receiver.no_id, v_provider.no_id,  
v_origin.no_id) ;
end if;

  I do not fully understand the logic here but for instance when  
provider is 3 and receiver is 4, "perform 1 ..." sees that there is a  
set *not* beeing provided by 3 and therefore ignores the fact that  
set 3 is provided by 3, and doesn't generate a way for node 4 to  
listen for events from node 3.

... or have i missed something?

//T-Å
_______________________________________________
Slony1-general mailing list
[email protected]
http://gborg.postgresql.org/mailman/listinfo/slony1-general

Reply via email to