Bug#801690: [Pkg-samba-maint] Bug#801690: 'smbstatus -b' leads to broken ctdb cluster

2016-04-15 Thread Adi Kriegisch
Hi!

> I'm not able to reproduce the bug under current sid.
Even fixed with recent update in Jessie! :-) YEAH! Thanks for your support!
 
> As ctdb in jessie was in another repository than samba, I suspect an
> API incompatibility.
Actually I am not quite sure if that really is an API incompatibility; from
what I found out, the issue would have been fixed with an update to ctdb 2.5.6
which includes a lot of fixes in general.

For the records: when trying to run ctdb with gdb, the issue did not occur
-- but ctdb was painfully slow. Next I tried to read the messages on the
socket, like this:
  | mv /var/run/ctdb/ctdbd.socket /var/run/ctdb/ctdbd.socket-orig
  | socat -t100 -x -v \
  |UNIX-LISTEN:/var/run/ctdb/ctdbd.socket,mode=777,reuseaddr,fork \
  |UNIX-CONNECT:/var/run/ctdb/ctdbd.socket-orig
  | mv /var/run/ctdb/ctdbd.socket-orig /var/run/ctdb/ctdbd.socket
That just slowed down ctdb a little, but everything worked like a charm. So
I suspect some kind of race condition to be the root cause of the issue.
 
> I'm tempted to mark this as fixed under sid, but can you setup a sid
> box and test yourself with a similar config?
You may even mark this as fixed in jessie with version 4.2.10+...

Thank you very much for your help!

-- Adi


signature.asc
Description: Digital signature


Bug#801690: [Pkg-samba-maint] Bug#801690: 'smbstatus -b' leads to broken ctdb cluster

2016-04-02 Thread Mathieu Parent
Hello Adi,

I'm not able to reproduce the bug under current sid.

As ctdb in jessie was in another repository than samba, I suspect an
API incompatibility.

I'm tempted to mark this as fixed under sid, but can you setup a sid
box and test yourself with a similar config?

Regards

Mathieu Parent



Bug#801690: [Pkg-samba-maint] Bug#801690: 'smbstatus -b' leads to broken ctdb cluster

2015-11-04 Thread Adi Kriegisch
Hi!

Thanks for getting back to me! :)

> > I recently upgraded a samba cluster from Wheezy (with Kernel, ctdb, samba
> > and glusterfs from backports) to Jessie. The cluster itself is way older
> > and basically always worked. Since the upgrade to Jessie 'smbstatus -b'
> > (almost always) just hangs the whole cluster; I need to interrupt the call
> > with ctrl+c (or run with 'timeout 2') to avoid a complete cluster lockup
> > leading to the other cluster nodes being banned and the node I run smbstatus
> > on to have ctdbd run at 100% load but not being able to recover.
> 
> How do you recover then? KILL-ing ctdbd?
Killing the loaded node is the easiest; manual unbanning of the other nodes
is still required. Combinations of enabling and disabling nodes may fix the
situation too.

> > Calling 'smbstatus --locks' and 'smbstatus --shares' works just fine.
> 
> Have you tried which of --processes, --notify hangs? Does it hangs
> with "-b --fast"?
Ah, I missed that: '--brief --fast' works just fine. So obviously the
validation does not work...

> > 'strace'ing ctdbd leads to a massive amount of these messages:
> >   | 
> > write(58,"\240\4\0\0BDTC\1\0\0\0\215U\336\25\5\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> >   |  1184) = -1 EAGAIN (Resource temporarily 
> > unavailable)
> 
> fd 58 is probably the ctdb socket. Can you confirm?
Right.

> To have more usefull info, can you install gdb, ctdb-dbg and samba-dbg
> and send the stacktrace of ctdbd at the write?
Ok, I will report back the stack traces in a few days (I'm afraid I can
only do these during the weekend).

All the best,
Adi


signature.asc
Description: Digital signature


Bug#801690: [Pkg-samba-maint] Bug#801690: 'smbstatus -b' leads to broken ctdb cluster

2015-11-01 Thread Mathieu Parent
2015-10-13 15:44 GMT+02:00 Adi Kriegisch :
> Package: ctdb
> Version: 2.5.4+debian0-4
>
> Dear maintainers,

Hello Adi,

Sorry for my late reply.

> I recently upgraded a samba cluster from Wheezy (with Kernel, ctdb, samba
> and glusterfs from backports) to Jessie. The cluster itself is way older
> and basically always worked. Since the upgrade to Jessie 'smbstatus -b'
> (almost always) just hangs the whole cluster; I need to interrupt the call
> with ctrl+c (or run with 'timeout 2') to avoid a complete cluster lockup
> leading to the other cluster nodes being banned and the node I run smbstatus
> on to have ctdbd run at 100% load but not being able to recover.

How do you recover then? KILL-ing ctdbd?

> The cluster itself consists of three nodes sharing three cluster ips. The
> only service ctdb manages is Samba. The lock file is located on a mirrored
> glusterfs volume.
>
> running and interrupting the hanging smbstatus leads to the following log
> messages in /var/log/ctdb/log.ctdb:
>   | 2015/10/13 15:09:24.923002 [19378]: Starting traverse on DB
>   |  smbXsrv_session_global.tdb (id 2592646)
>   | 2015/10/13 15:09:25.505302 [19378]: server/ctdb_traverse.c:644 Traverse
>   |  cancelled by client disconnect for database:0x6b06a26d
>   | 2015/10/13 15:09:25.505492 [19378]: Could not find idr:2592646
>   | [...]
>   | 2015/10/13 15:09:25.507553 [19378]: Could not find idr:2592646
>
> 'ctdb getdbmap' lists that database, but also lists a second entry for
> smbXsrv_session_global.tdb:
>   | dbid:0x521b7544 name:smbXsrv_version_global.tdb 
> path:/var/lib/ctdb/smbXsrv_version_global.tdb.0
>   | dbid:0x6b06a26d name:smbXsrv_session_global.tdb 
> path:/var/lib/ctdb/smbXsrv_session_global.tdb.0
> (I have no idea if that has always been the case or if that happened after
> the upgrade).
>
> Calling 'smbstatus --locks' and 'smbstatus --shares' works just fine.

Have you tried which of --processes, --notify hangs? Does it hangs
with "-b --fast"?

,

> 'strace'ing ctdbd leads to a massive amount of these messages:
>   | 
> write(58,"\240\4\0\0BDTC\1\0\0\0\215U\336\25\5\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
>   |  1184) = -1 EAGAIN (Resource temporarily 
> unavailable)

fd 58 is probably the ctdb socket. Can you confirm?

To have more usefull info, can you install gdb, ctdb-dbg and samba-dbg
and send the stacktrace of ctdbd at the write?

> Running 'ctdb_diagnostics' is only possible shortly after  the cluster is
> started (ie. while smbstatus -b works) and yields the following messages:
>   | ERROR[1]: /etc/krb5.conf is missing on node 0
>   | ERROR[2]: File /etc/hosts is different on node 1
>   | ERROR[3]: File /etc/hosts is different on node 2
>   | ERROR[4]: File /etc/samba/smb.conf is different on node 1
>   | ERROR[5]: File /etc/samba/smb.conf is different on node 2
>   | ERROR[6]: File /etc/fstab is different on node 1
>   | ERROR[7]: File /etc/fstab is different on node 2
>   | ERROR[8]: /etc/multipath.conf is missing on node 0
>   | ERROR[9]: /etc/pam.d/system-auth is missing on node 0
>   | ERROR[10]: /etc/default/nfs is missing on node 0
>   | ERROR[11]: /etc/exports is missing on node 0
>   | ERROR[12]: /etc/vsftpd/vsftpd.conf is missing on node 0
>   | ERROR[13]: Optional file /etc/ctdb/static-routes is not present on node 0
> '/etc/hosts' differs in some newlines and comments while 'smb.conf' only
> has some different log levels on the nodes. The rest of the messages does
> not affect ctdb as it only manages samba.

Yes. Nothing relevant here.

> Feel free to ask if you need any more information.

Regards


-- 
Mathieu



Bug#801690: 'smbstatus -b' leads to broken ctdb cluster

2015-10-13 Thread Adi Kriegisch
Package: ctdb
Version: 2.5.4+debian0-4

Dear maintainers,

I recently upgraded a samba cluster from Wheezy (with Kernel, ctdb, samba
and glusterfs from backports) to Jessie. The cluster itself is way older
and basically always worked. Since the upgrade to Jessie 'smbstatus -b'
(almost always) just hangs the whole cluster; I need to interrupt the call
with ctrl+c (or run with 'timeout 2') to avoid a complete cluster lockup
leading to the other cluster nodes being banned and the node I run smbstatus
on to have ctdbd run at 100% load but not being able to recover.

The cluster itself consists of three nodes sharing three cluster ips. The
only service ctdb manages is Samba. The lock file is located on a mirrored
glusterfs volume.

running and interrupting the hanging smbstatus leads to the following log
messages in /var/log/ctdb/log.ctdb:
  | 2015/10/13 15:09:24.923002 [19378]: Starting traverse on DB
  |  smbXsrv_session_global.tdb (id 2592646)
  | 2015/10/13 15:09:25.505302 [19378]: server/ctdb_traverse.c:644 Traverse
  |  cancelled by client disconnect for database:0x6b06a26d
  | 2015/10/13 15:09:25.505492 [19378]: Could not find idr:2592646
  | [...]
  | 2015/10/13 15:09:25.507553 [19378]: Could not find idr:2592646

'ctdb getdbmap' lists that database, but also lists a second entry for
smbXsrv_session_global.tdb:
  | dbid:0x521b7544 name:smbXsrv_version_global.tdb 
path:/var/lib/ctdb/smbXsrv_version_global.tdb.0
  | dbid:0x6b06a26d name:smbXsrv_session_global.tdb 
path:/var/lib/ctdb/smbXsrv_session_global.tdb.0
(I have no idea if that has always been the case or if that happened after
the upgrade).

Calling 'smbstatus --locks' and 'smbstatus --shares' works just fine.
'strace'ing ctdbd leads to a massive amount of these messages:
  | 
write(58,"\240\4\0\0BDTC\1\0\0\0\215U\336\25\5\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
  |  1184) = -1 EAGAIN (Resource temporarily 
unavailable)

Running 'ctdb_diagnostics' is only possible shortly after  the cluster is
started (ie. while smbstatus -b works) and yields the following messages:
  | ERROR[1]: /etc/krb5.conf is missing on node 0
  | ERROR[2]: File /etc/hosts is different on node 1
  | ERROR[3]: File /etc/hosts is different on node 2
  | ERROR[4]: File /etc/samba/smb.conf is different on node 1
  | ERROR[5]: File /etc/samba/smb.conf is different on node 2
  | ERROR[6]: File /etc/fstab is different on node 1
  | ERROR[7]: File /etc/fstab is different on node 2
  | ERROR[8]: /etc/multipath.conf is missing on node 0
  | ERROR[9]: /etc/pam.d/system-auth is missing on node 0
  | ERROR[10]: /etc/default/nfs is missing on node 0
  | ERROR[11]: /etc/exports is missing on node 0
  | ERROR[12]: /etc/vsftpd/vsftpd.conf is missing on node 0
  | ERROR[13]: Optional file /etc/ctdb/static-routes is not present on node 0
'/etc/hosts' differs in some newlines and comments while 'smb.conf' only
has some different log levels on the nodes. The rest of the messages does
not affect ctdb as it only manages samba.

Feel free to ask if you need any more information.

-- Adi


signature.asc
Description: Digital signature