Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-04-25 Thread Danka Ivanović
Hi,
Here are the logs when pacemaker fails to start postgres service on master.
It manage to start only postgres slave.
I tried different configuration with pgslqms and pgsql resource agents.
Those errors are when I use pgsqlms agent, which configuration I have sent
in first mail:

Apr 25 16:40:23 [4213] master   lrmd: info: log_execute:  executing
- rsc:PGSQL action:start call_id:51
launching as "postgres" command "/usr/lib/postgresql/9.5/bin/pg_ctl
--pgdata /var/lib/postgresql/9.5/main -w --timeout 120 start -o -c
config_file=/etc/postgresql/9.5/main/postgresql.conf"
Apr 25 16:40:24 [4211] mastercib: info: cib_perform_op: +
/cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='PGSQL']/lrm_rsc_op[@id='PGSQL_last_0']:
@operation_key=PGSQL_start_0, @operation=start,
@transition-key=12:30:0:078c2b66-b095-49c4-947b-2427dd7852bf,
@transition-magic=0:0;12:30:0:078c2b66-b095-49c4-947b-2427dd7852bf,
@call-id=176, @rc-code=0, @exec-time=1146, @queue-time=0
Apr 25 16:40:53 [4216] master   crmd:debug: crm_timer_start: Started
Shutdown Escalation (I_STOP:120ms), src=53
Apr 25 16:41:23 [4213] master   lrmd:  warning:
child_timeout_callback: PGSQL_start_0
process (PID 5986) timed out

Part of the log is attached.

On Tue, 23 Apr 2019 at 17:28, Danka Ivanović 
wrote:

> Hi,
> It seems that ldap timeout caused cluster failure. Cluster is checking
> status every 15s on master and 16s on slave. Cluster needs postgres user
> for authentication, but ldap first query user on ldap server and then
> localy on host. When connection to ldap server was interrupted, cluster
> couldn't find postgres user and authenticate on db to check state. Problem
> is solved with reconfiguring /etc/ldap.conf and /etc/nslcd.conf. Following
> variable is added: nss_initgroups_ignoreusers with specified local users
> which should be ignored when querying ldap server. Thanks for your help. :)
> Another problem is that I cannot start postgres master with pacemaker.
> When I start postgres manually (with systemd) and then start pacemaker on
> slave, pacemaker is able to recognize master and start slave and failover
> works.
> That is another problem which I didn't manage to solve. Should I send a
> new mail for that issue or we can continue in this thread?
>
> On Fri, 19 Apr 2019 at 19:19, Jehan-Guillaume de Rorthais 
> wrote:
>
>> On Fri, 19 Apr 2019 17:26:14 +0200
>> Danka Ivanović  wrote:
>> ...
>> > Should I change any of those timeout parameters in order to avoid
>> timeout?
>>
>> You can try to raise the timeout, indeed. But as far as we don't know
>> **why**
>> your VMs froze for some time, it is difficult to guess how high should be
>> these timeouts.
>>
>> Not to mention that it will raise your RTO.
>>
>
>
> --
> Pozdrav
> Danka Ivanovic
>


-- 
Pozdrav
Danka Ivanovic
Apr 25 16:39:50 [4211] mastercib:debug: crm_client_new: 
Connecting 0x55d8444e8e80 for uid=0 gid=0 pid=5791 
id=c93d535d-77d8-4556-9a63-d9a1c2b45de9
Apr 25 16:39:50 [4211] mastercib:debug: handle_new_connection:  
IPC credentials authenticated (4211-5791-13)
Apr 25 16:39:50 [4211] mastercib:debug: qb_ipcs_shm_connect:
connecting to client [5791]
Apr 25 16:39:50 [4211] mastercib:debug: qb_rb_open_2:   shm 
size:524301; real_size:528384; rb->word_size:132096
Apr 25 16:39:50 [4211] mastercib:debug: qb_rb_open_2:   shm 
size:524301; real_size:528384; rb->word_size:132096
Apr 25 16:39:50 [4211] mastercib:debug: qb_rb_open_2:   shm 
size:524301; real_size:528384; rb->word_size:132096
Apr 25 16:39:50 [4211] mastercib:debug: cib_acl_enabled:CIB ACL 
is disabled
Apr 25 16:39:50 [4211] mastercib:debug: 
qb_ipcs_dispatch_connection_request:HUP conn (4211-5791-13)
Apr 25 16:39:50 [4211] mastercib:debug: qb_ipcs_disconnect: 
qb_ipcs_disconnect(4211-5791-13) state:2
Apr 25 16:39:50 [4211] mastercib:debug: crm_client_destroy: 
Destroying 0 events
Apr 25 16:39:50 [4211] mastercib:debug: qb_rb_close:
Free'ing ringbuffer: /dev/shm/qb-cib_rw-response-4211-5791-13-header
Apr 25 16:39:50 [4211] mastercib:debug: qb_rb_close:
Free'ing ringbuffer: /dev/shm/qb-cib_rw-event-4211-5791-13-header
Apr 25 16:39:50 [4211] mastercib:debug: qb_rb_close:
Free'ing ringbuffer: /dev/shm/qb-cib_rw-request-4211-5791-13-header
Apr 25 16:39:50 [15544] master corosync debug   [QB] IPC credentials 
authenticated (15544-5837-24)
Apr 25 16:39:50 [15544] master corosync debug   [QB] connecting to client 
[5837]
Apr 25 16:39:50 [15544] master corosync debug   [QB] shm size:1048589; 
real_size:1052672; rb->word_size:263168
Apr 25 16:39:50 [15544] master corosync debug   [QB] shm size:1048589; 
real_size:1052672; rb->word_size:263168
Apr 25 16:39:50 [15544] master corosync debug   [QB] shm size:1048589; 
real_size:1052672; 

Re: [ClusterLabs] Pacemaker detail log directory permissions

2019-04-25 Thread Jan Pokorný
On 24/04/19 09:32 -0500, Ken Gaillot wrote:
> On Wed, 2019-04-24 at 16:08 +0200, wf...@niif.hu wrote:
>> Make install creates /var/log/pacemaker with mode 0770, owned by
>> hacluster:haclient.  However, if I create the directory as root:root
>> instead, pacemaker.log appears as hacluster:haclient all the
>> same.  What breaks in this setup besides log rotation (which can be
>> fixed by removing the su directive)?  Why is it a good idea to let
>> the haclient group write the logs?
> 
> Cluster administrators are added to the haclient group. It's a minor
> use case, but the group write permission allows such users to run
> commands that log to the detail log. An example would be running
> "crm_resource --force-start" for a resource agent that writes debug
> information to the log.

I think the prime and foremost use case is that half of the actual
pacemaker daemons run as hacluster:haclient themselves, and it's
preferred for them to be not completely muted about what they do,
correct? :-)

Indeed, users can configure whatever log routing they desire
(I was actually toying with an idea to make it a lot more flexible,
log-per-type-of-daemon and perhaps even distinguished by PID,
configurable log formats since currently it's arguably a heavy
overkill to keep the hostname stated repeatedly over and over
without actually bothering to recheck it from time to time, etc.).

Also note, relying on almighty root privileges (like with the pristine
deployment) is a silent misconception that cannot be taken for fully
granted, so again arguably, even the root daemons should take
a haclient group's coat on top of their own just in case [*].

> If ACLs are not in use, such users already have full read/write
> access to the CIB, so being able to read and write the log is not an
> additional concern.
> 
> With ACLs, I could see wanting to change the permissions, and that idea
> has come up already. One approach might be to add a PCMK_log_mode
> option that would default to 0660, and users could make it more strict
> if desired.

It looks reasonable to prevent read-backs by anyone but root, that
could be applied without any further toggles, assuming the pacemaker
code won't flip once purposefully allowed read bits for group back
automatically and unconditionally.


[*] for instance when SELinux hits hard (which is currently not the
case for Fedora/EL family), even though the executor(s) would need
to be exempted if process inheritance taints the tree once forever:
https://danwalsh.livejournal.com/69478.html

-- 
Jan (Poki)


pgp3dAgFvEUfh.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Warning (SLES 12 SP4): ocf:heartbeat:CTDB does not work any more

2019-04-25 Thread Alex Crow

On 25/04/2019 14:35, Ulrich Windl wrote:

Hi!

I managed to get my cluster up again after upgrading from SLES11 SP4 to SLES12 
SP4, but my CTDB Samba won't start any more. The problem is:
CTDB(prm_s02_ctdb)[30904]: ERROR: Failed to execute /usr/sbin/ctdbd.
lrmd[27341]:   notice: prm_s02_ctdb_start_0:30857:stderr [ Invalid option 
--logfile=/var/log/ctdb/log.ctdb: unknown option ]

That option comes from /usr/lib/ocf/resource.d/heartbeat/CTDB:
: ${OCF_RESKEY_ctdb_logfile:=/var/log/ctdb/log.ctdb}

log_option="--logging=file:$OCF_RESKEY_ctdb_logfile"

So the RA and the binary don't match! The binary seems to lack a --logging 
option.
ctdb-4.6.16+git.133.479a9537a28-3.35.4.x86_64
resource-agents-4.1.9+git24.9b664917-3.3.3.x86_64

Regards,
Ulrich



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Ulrich,

There have been some IMHO silly changes in CTDB lately. My least 
favourite is changing the output of ctdb scriptstatus to fixed-width 
columns instead of machine-readable colon-separated data, but left a 
completely non-functional option to use a user-provided delimiter, thus 
completely breaking the nagios monitoring plugin. I complained, and 
simply said it was "as designed". I felt like replying, "so, stupidly 
designed then".


Cheers

Alex

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Warning (SLES 12 SP4): ocf:heartbeat:CTDB does not work any more

2019-04-25 Thread Ulrich Windl
Hi!

I managed to get my cluster up again after upgrading from SLES11 SP4 to SLES12 
SP4, but my CTDB Samba won't start any more. The problem is:
CTDB(prm_s02_ctdb)[30904]: ERROR: Failed to execute /usr/sbin/ctdbd.
lrmd[27341]:   notice: prm_s02_ctdb_start_0:30857:stderr [ Invalid option 
--logfile=/var/log/ctdb/log.ctdb: unknown option ]

That option comes from /usr/lib/ocf/resource.d/heartbeat/CTDB:
: ${OCF_RESKEY_ctdb_logfile:=/var/log/ctdb/log.ctdb}

log_option="--logging=file:$OCF_RESKEY_ctdb_logfile"

So the RA and the binary don't match! The binary seems to lack a --logging 
option.
ctdb-4.6.16+git.133.479a9537a28-3.35.4.x86_64
resource-agents-4.1.9+git24.9b664917-3.3.3.x86_64

Regards,
Ulrich



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/