Quick update: a couple hours after deleting entries from both sl_log
tables with txids > 630M, it appears the cleanup thread has taken care of
business. sl_log_1 is down from 54GB to 24KB.
Tom :-)
On 2/1/16, 9:06 AM, "Tignor, Tom" <[email protected]> wrote:
>
> Jan,
> Thanks much for all the help. I’ve been looking over logswitch finish
> and
>the event and changelog tables. Selecting logswitch_finish() on one of my
>replicas simply returned ""Slony-I: log switch to sl_log_2 still in
>progress - sl_log_1 not truncated” All three seem to be in that state with
>either sl_log_1 or sl_log_2. Looking closer at sl_event and sl_log_2 I’m
>seeing some strangeness.
>
>ams=# select min(pg_catalog.txid_snapshot_xmin(ev_snapshot)),
>max(pg_catalog.txid_snapshot_xmin(ev_snapshot)) from _ams_cluster.sl_event
>where ev_origin = 1;
> min | max
>-----------+-----------
> 139136948 | 139204299
>(1 row)
>
>
>ams=# select min(log_txid), max(log_txid) from _ams_cluster.sl_log_2 where
>log_origin = 1;
> min | max
>-----------+-----------
> 631532717 | 631661386
>(1 row)
>
>
>ams=#
>
>
> So I understand all the txids referenced in sl_event are in the 139M
>range while all those in sl_log_2 are in the 631M range. Normally, the
>sl_log txids should be older, shouldn’t they? Do you think I’ve hit a
>txid-wraparound problem?
>
> Tom :-)
>
>
>
>On 1/28/16, 1:09 PM, "Jan Wieck" <[email protected]> wrote:
>
>>On 01/28/2016 11:11 AM, Tignor, Tom wrote:
>>>
>>> Output below. They seem to be replicating normally, except for the
>>>sl_log
>>> growth.
>>
>>Indeed. Is there anything in the slon logs for those nodes that says why
>>it doesn't finish the log switch?
>>
>>Connect to the database as a the slony user.
>>
>>To check if a log switch is indeed in progress, do
>>
>> SELECT last_value FROM _ams_cluster.sl_log_status;
>>
>>It should be either 2 or 3. If it is 0 or 1, no log switch is in
>>progress and you can start one with
>>
>> SELECCT _ams_cluster.logswitch_start();
>>
>>If it is 2 or 3, then you can do
>>
>> SELECT _ams_cluster.logswitch_finish();
>>
>>All these operations are harmless and will only do what is safely
>>possible. Look at the code of logswitch_finish() to find out how it
>>determines if the current log switch can be finished. In short, the
>>cleanup thread is removing events from sl_event that have been confirmed
>>by all nodes in the cluster. The function logswitch_finish() looks if
>>there is anything left in sl_event, that belonged to that old log. If so
>>it will not finish. Running those queries manually you can find out what
>>that event is that is preventing the switch to finish.
>>
>>
>>
>>>
>>>
>>> [email protected]:~$ /a/third-party/postgresql/bin/psql -U
>>> ams_slony -d ams -c 'select * from _ams_cluster.sl_status'
>>> st_origin | st_received | st_last_event | st_last_event_ts
>>> |
>>> st_last_received | st_last_received_ts |
>>> st_last_received_event_ts | st_lag_num_events | st_lag_time
>>>
>>>-----------+-------------+---------------+------------------------------
>>>-
>>>+-
>>>
>>>-----------------+-------------------------------+----------------------
>>>-
>>>--
>>> ------+-------------------+-----------------
>>> 2 | 1 | 5000611610 | 2016-01-28
>>>16:06:37.343826+00 |
>>> 5000611610 | 2016-01-28 16:06:38.843562+00 | 2016-01-28
>>> 16:06:37.343826+00 | 0 | 00:00:09.201996
>>> 2 | 3 | 5000611610 | 2016-01-28
>>>16:06:37.343826+00 |
>>> 5000611609 | 2016-01-28 16:06:29.851545+00 | 2016-01-28
>>> 16:06:27.341894+00 | 1 | 00:00:19.203928
>>> 2 | 4 | 5000611610 | 2016-01-28
>>>16:06:37.343826+00 |
>>> 5000611610 | 2016-01-28 16:06:38.710974+00 | 2016-01-28
>>> 16:06:37.343826+00 | 0 | 00:00:09.201996
>>> (3 rows)
>>>
>>>
>>> [email protected]:~$ /a/third-party/postgresql/bin/psql -U
>>> ams_slony -d ams -c 'select * from _ams_cluster.sl_status'
>>> st_origin | st_received | st_last_event | st_last_event_ts
>>> |
>>> st_last_received | st_last_received_ts |
>>> st_last_received_event_ts | st_lag_num_events | st_lag_time
>>>
>>>-----------+-------------+---------------+------------------------------
>>>-
>>>+-
>>>
>>>-----------------+-------------------------------+----------------------
>>>-
>>>--
>>> ------+-------------------+-----------------
>>> 3 | 4 | 5000654642 | 2016-01-28
>>>16:07:05.493455+00 |
>>> 5000654642 | 2016-01-28 16:07:06.486539+00 | 2016-01-28
>>> 16:07:05.493455+00 | 0 | 00:00:08.522529
>>> 3 | 1 | 5000654642 | 2016-01-28
>>>16:07:05.493455+00 |
>>> 5000654642 | 2016-01-28 16:07:08.040292+00 | 2016-01-28
>>> 16:07:05.493455+00 | 0 | 00:00:08.522529
>>> 3 | 2 | 5000654642 | 2016-01-28
>>>16:07:05.493455+00 |
>>> 5000654642 | 2016-01-28 16:07:08.472049+00 | 2016-01-28
>>> 16:07:05.493455+00 | 0 | 00:00:08.522529
>>> (3 rows)
>>>
>>>
>>> [email protected]:~$ /a/third-party/postgresql/bin/psql -U
>>> ams_slony -d ams -c 'select * from _ams_cluster.sl_status'
>>> st_origin | st_received | st_last_event | st_last_event_ts
>>> |
>>> st_last_received | st_last_received_ts |
>>> st_last_received_event_ts | st_lag_num_events | st_lag_time
>>>
>>>-----------+-------------+---------------+------------------------------
>>>-
>>>+-
>>>
>>>-----------------+-------------------------------+----------------------
>>>-
>>>--
>>> ------+-------------------+-----------------
>>> 4 | 3 | 5000637483 | 2016-01-28
>>>16:07:32.698809+00 |
>>> 5000637482 | 2016-01-28 16:07:28.731404+00 | 2016-01-28
>>> 16:07:22.695826+00 | 1 | 00:00:19.077657
>>> 4 | 1 | 5000637483 | 2016-01-28
>>>16:07:32.698809+00 |
>>> 5000637482 | 2016-01-28 16:07:24.839978+00 | 2016-01-28
>>> 16:07:22.695826+00 | 1 | 00:00:19.077657
>>> 4 | 2 | 5000637483 | 2016-01-28
>>>16:07:32.698809+00 |
>>> 5000637482 | 2016-01-28 16:07:22.926411+00 | 2016-01-28
>>> 16:07:22.695826+00 | 1 | 00:00:19.077657
>>> (3 rows)
>>>
>>>
>>>
>>> Tom :-)
>>>
>>>
>>>
>>> On 1/28/16, 10:38 AM, "Jan Wieck" <[email protected]> wrote:
>>>
>>>>On 01/28/2016 08:30 AM, Tignor, Tom wrote:
>>>>>
>>>>> Hello slony folks,
>>>>> From my reading I¹m guessing (hoping) this isn¹t a new problem. I
>>>>>have
>>>>> a simple cluster with one provider replicating to three subscribers.
>>>>>The
>>>>> provider¹s changelog tables (sl_log_[1|2]) are fine, but the
>>>>>subscribers
>>>>> (with forwarding enabled) are all showing runaway growth. Looked
>>>>>through
>>>>> the FAQ and I don¹t see the node I dropped or any idle transactions
>>>>>as
>>>>> viable culprits. Are there other thoughts on the cause? Can I safely
>>>>> manually delete/truncate some/all of the changelog tables? These
>>>>> replicas are all leaf nodes. I only have forwarding turned on to
>>>>>allow
>>>>> for failover, and my replication rate is the 2 sec default.
>>>>> Thanks in advance for any insights.
>>>>
>>>>What is the output of the sl_status view "on those leaf nodes"?
>>>>
>>>>
>>>>>
>>>>> ams=# select
>>>>> pg_size_pretty(pg_total_relation_size('_ams_cluster.sl_log_1'));
>>>>>
>>>>> pg_size_pretty
>>>>>
>>>>> ----------------
>>>>>
>>>>> 75 MB
>>>>>
>>>>> (1 row)
>>>>>
>>>>>
>>>>> ams=# select
>>>>> pg_size_pretty(pg_total_relation_size('_ams_cluster.sl_log_2'));
>>>>>
>>>>> pg_size_pretty
>>>>>
>>>>> ----------------
>>>>>
>>>>> 34 GB
>>>>>
>>>>> (1 row)
>>>>>
>>>>>
>>>>> ams=# select * from _ams_cluster.sl_confirm where con_origin not in
>>>>> (select no_id from _ams_cluster.sl_node) or con_received not in
>>>>>(select
>>>>> no_id from _ams_cluster.sl_node);
>>>>>
>>>>> con_origin | con_received | con_seqno | con_timestamp
>>>>>
>>>>> ------------+--------------+-----------+---------------
>>>>>
>>>>> (0 rows)
>>>>>
>>>>>
>>>>> ams=# select * from pg_stat_activity where current_query like
>>>>>'%IDLE%';
>>>>>
>>>>> datid | datname | procpid | usesysid | usename |
>>>>> application_name | client_addr | client_hostname |
>>>>>client_port |
>>>>> backend_start | xact_start |
>>>>> query_start | waiting |
>>>>>
>>>>> current_query
>>>>>
>>>>>
>>>>>-------+---------+---------+----------+------------+------------------
>>>>>-
>>>>>--
>>>>>------+----------------+-----------------+-------------+--------------
>>>>>-
>>>>>--
>>>>>--------------+-------------------------------+-----------------------
>>>>>-
>>>>>--
>>>>>-----+---------+---
>>>>>
>>>>> ----------------------------------------------------------------
>>>>>
>>>>> 16393 | ams | 2611 | 212995 | ams_viewer |
>>>>> | 88.221.209.10 | | 43328 |
>>>>>2016-01-28
>>>>> 12:24:49.706389+00 | | 2016-01-28
>>>>> 13:18:02.427848+00 | f | <I
>>>>>
>>>>> DLE>
>>>>>
>>>>> 16393 | ams | 12894 | 212995 | ams_viewer |
>>>>> | 88.221.209.10 | | 60112 |
>>>>>2016-01-28
>>>>> 12:47:26.230681+00 | | 2016-01-28
>>>>> 13:15:27.744242+00 | f | <I
>>>>>
>>>>> DLE>
>>>>>
>>>>> 16393 | ams | 12884 | 212995 | ams_viewer |
>>>>> | 88.221.209.10 | | 44302 |
>>>>>2016-01-28
>>>>> 12:47:25.100006+00 | | 2016-01-28
>>>>> 13:15:27.936059+00 | f | <I
>>>>>
>>>>> DLE>
>>>>>
>>>>> 16393 | ams | 23466 | 213867 | ams_slony | psql
>>>>> | | | -1 |
>>>>>2016-01-28
>>>>> 13:11:32.030343+00 | 2016-01-28 13:18:37.283992+00 | 2016-01-28
>>>>> 13:18:37.283992+00 | f | se
>>>>>
>>>>> lect * from pg_stat_activity where current_query like '%IDLE%';
>>>>>
>>>>> 16393 | ams | 6719 | 213867 | ams_slony |
>>>>> slon.origin_2_provider_2 | 60.254.150.133 | |
>>>>> 61806 | 2016-01-22 01:59:14.800129+00 |
>>>>>|
>>>>> 2016-01-28 13:18:25.935111+00 | f | <I
>>>>>
>>>>> DLE>
>>>>>
>>>>> 16393 | ams | 6718 | 213867 | ams_slony |
>>>>> slon.origin_3_provider_2 | 60.254.150.133 | |
>>>>> 61805 | 2016-01-22 01:59:14.797655+00 |
>>>>>|
>>>>> 2016-01-28 13:18:34.304475+00 | f | <I
>>>>>
>>>>> DLE>
>>>>>
>>>>> 16393 | ams | 5505 | 213867 | ams_slony |
>>>>> slon.origin_4_provider_2 | 80.67.75.105 | |
>>>>> 36477 | 2016-01-22 01:56:25.637046+00 |
>>>>>|
>>>>> 2016-01-28 13:18:36.1348+00 | f | <I
>>>>>
>>>>> DLE>
>>>>>
>>>>> 16393 | ams | 5504 | 213867 | ams_slony |
>>>>> slon.origin_3_provider_2 | 72.246.50.22 | |
>>>>> 51813 | 2016-01-22 01:56:25.240798+00 |
>>>>>|
>>>>> 2016-01-28 13:18:28.961629+00 | f | <I
>>>>>
>>>>> DLE>
>>>>>
>>>>> 16393 | ams | 5487 | 213867 | ams_slony |
>>>>> slon.origin_4_provider_2 | 72.246.50.22 | |
>>>>> 51803 | 2016-01-22 01:56:22.896388+00 |
>>>>>|
>>>>> 2016-01-28 13:18:35.858913+00 | f | <I
>>>>>
>>>>> DLE>
>>>>>
>>>>> 16393 | ams | 5047 | 213867 | ams_slony |
>>>>> slon.origin_2_provider_2 | 72.246.50.22 | |
>>>>> 51564 | 2016-01-22 01:55:23.600296+00 |
>>>>>|
>>>>> 2016-01-28 13:18:34.487192+00 | f | <I
>>>>>
>>>>> DLE>
>>>>>
>>>>> 16393 | ams | 5041 | 213867 | ams_slony |
>>>>> slon.origin_2_provider_2 | 80.67.75.105 | |
>>>>> 36402 | 2016-01-22 01:55:22.964462+00 |
>>>>>|
>>>>> 2016-01-28 13:18:34.519066+00 | f | <I
>>>>>
>>>>> DLE>
>>>>>
>>>>> 16393 | ams | 6694 | 213867 | ams_slony |
>>>>> slon.node_2_listen | 60.254.150.133 | |
>>>>> 61795 | 2016-01-22 01:59:12.095052+00 |
>>>>>|
>>>>> 2016-01-28 13:18:27.928384+00 | f | <I
>>>>>
>>>>> DLE>
>>>>>
>>>>> 16393 | ams | 4456 | 213867 | ams_slony |
>>>>> slon.node_2_listen | 72.246.50.22 | |
>>>>> 51238 | 2016-01-22 01:54:21.481355+00 |
>>>>>|
>>>>> 2016-01-28 13:18:36.766973+00 | f | <I
>>>>>
>>>>> DLE>
>>>>>
>>>>> 16393 | ams | 4457 | 213867 | ams_slony |
>>>>> slon.node_2_listen | 80.67.75.105 | |
>>>>> 36333 | 2016-01-22 01:54:21.500456+00 |
>>>>>|
>>>>> 2016-01-28 13:18:36.204482+00 | f | <I
>>>>>
>>>>> DLE>
>>>>>
>>>>> 16393 | ams | 4428 | 213867 | ams_slony |
>>>>> slon.local_monitor | | |
>>>>> -1 | 2016-01-22 01:54:18.977015+00 | |
>>>>> 2016-01-28 13:18:36.652567+00 | f | <I
>>>>>
>>>>> DLE>
>>>>>
>>>>> 16393 | ams | 4427 | 213867 | ams_slony | slon.local_sync
>>>>> | | | -1 |
>>>>>2016-01-22
>>>>> 01:54:18.976932+00 | | 2016-01-28
>>>>> 13:18:36.151998+00 | f | <I
>>>>>
>>>>> DLE>
>>>>>
>>>>> 16393 | ams | 4426 | 213867 | ams_slony |
>>>>> slon.local_cleanup | | |
>>>>> -1 | 2016-01-22 01:54:18.976842+00 | |
>>>>> 2016-01-28 13:12:12.582921+00 | f | <I
>>>>>
>>>>> DLE>
>>>>>
>>>>> 16393 | ams | 4425 | 213867 | ams_slony |
>>>>> slon.remoteWorkerThread_4 | | |
>>>>> -1 | 2016-01-22 01:54:18.976783+00 | |
>>>>> 2016-01-28 13:18:33.99715+00 | f | <I
>>>>>
>>>>> DLE>
>>>>>
>>>>> 16393 | ams | 4420 | 213867 | ams_slony |
>>>>> slon.remoteWorkerThread_1 | | |
>>>>> -1 | 2016-01-22 01:54:18.976548+00 | |
>>>>> 2016-01-28 13:18:33.561531+00 | f | <I
>>>>>
>>>>> DLE>
>>>>>
>>>>> 16393 | ams | 4419 | 213867 | ams_slony |
>>>>> slon.remoteWorkerThread_3 | | |
>>>>> -1 | 2016-01-22 01:54:18.97647+00 | |
>>>>> 2016-01-28 13:18:34.808907+00 | f | <I
>>>>>
>>>>> DLE>
>>>>>
>>>>> 16393 | ams | 4413 | 213867 | ams_slony |
>>>>>slon.local_listen
>>>>> | | | -1 |
>>>>>2016-01-22
>>>>> 01:54:18.965568+00 | | 2016-01-28
>>>>> 13:18:37.096159+00 | f | <I
>>>>>
>>>>> DLE>
>>>>>
>>>>> (21 rows)
>>>>>
>>>>>
>>>>> ams=#
>>>>>
>>>>>
>>>>>
>>>>> Tom :-)
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Slony1-general mailing list
>>>>> [email protected]
>>>>> http://lists.slony.info/mailman/listinfo/slony1-general
>>>>>
>>>>
>>>>
>>>>--
>>>>Jan Wieck
>>>>Senior Software Engineer
>>>>http://slony.info
>>>
>>
>>
>>--
>>Jan Wieck
>>Senior Software Engineer
>>http://slony.info
>
_______________________________________________
Slony1-general mailing list
[email protected]
http://lists.slony.info/mailman/listinfo/slony1-general