Launchpad has imported 19 comments from the remote bug at
https://bugs.openldap.org/show_bug.cgi?id=8650.

If you reply to an imported comment from within Launchpad, your comment
will be sent to the remote bug automatically. Read more about
Launchpad's inter-bugtracker facilities at
https://help.launchpad.net/InterBugTracking.

------------------------------------------------------------------------
On 2017-05-06T22:32:26+00:00 Ryan Tandy wrote:

Full_Name: Ryan Tandy
Version: RE24
OS: Debian
URL: 
Submission from: (NULL) (24.68.41.160)
Submitted by: ryan


https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=861838

That bug's submitter seems to have unintentionally configured their slapd with
the entire list of system CAs. They're fixing it, but we have a bug here too.

When the ServerHello is larger than 16kb, gnutls_handshake can return
GNUTLS_E_AGAIN. In theory this was always possible, but I'm only seeing it
happen with gnutls 3.x and haven't the exact change responsible.

We need to loop gnutls_handshake until it completes, like we do already in the
re-handshake case.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/openldap/+bug/1921562/comments/0

------------------------------------------------------------------------
On 2017-05-06T22:52:25+00:00 Ryan Tandy wrote:

changed notes
changed state Open to Test
moved from Incoming to Software Bugs

Reply at:
https://bugs.launchpad.net/ubuntu/+source/openldap/+bug/1921562/comments/1

------------------------------------------------------------------------
On 2017-05-06T22:58:54+00:00 Ryan Tandy wrote:

Committed the fix, and pinged the submitter to test it.


Reply at:
https://bugs.launchpad.net/ubuntu/+source/openldap/+bug/1921562/comments/2

------------------------------------------------------------------------
On 2018-02-09T17:22:50+00:00 Quanah-x wrote:

changed notes
changed state Test to Release

Reply at:
https://bugs.launchpad.net/ubuntu/+source/openldap/+bug/1921562/comments/3

------------------------------------------------------------------------
On 2018-03-22T19:25:42+00:00 Quanah-x wrote:

changed notes
changed state Release to Closed

Reply at:
https://bugs.launchpad.net/ubuntu/+source/openldap/+bug/1921562/comments/4

------------------------------------------------------------------------
On 2018-08-03T15:19:06+00:00 Kartik Subbarao wrote:

Hi Ryan,

I'm running into a problem with slapd 2.4.46 hanging on Ubuntu 18.04, 
which seems to be a side effect of the ITS#8650 patch:

https://github.com/openldap/openldap/commit/7b5181da8cdd47a13041f9ee36fa9590a0fa6e48

slapd will run fine for a while, but during some periods of 
high-traffic, it'll hang. It'll peg the CPU at 100% and won't respond to 
any new LDAP connections. After some time, it'll resume working again, 
but overall it's fairly unreliable.

strace on slapd during the hang shows that it's constantly making read() 
calls that return EAGAIN. After doing a gdb stack trace on slapd, I 
realized that these read() calls are happening as part of the busywait 
for loop in tlsg_session_accept() that repeatedly calls 
gnutls_handshake() when it gets EAGAIN. When slapd recovers from this 
hang state, the first message it prints is a TLS negotiation failure 
error on the culprit file descriptor.

If I back out the ITS#8650 patch, the problem goes away. If I insert 
sleep(1) in the for loop, slapd no longer pegs the CPU at 100%, but it 
still becomes unresponsive during these high-traffic periods.

I don't know what's happening during these high-traffic periods that 
causes the TLS negotiation to go astray. Unfortunately it's not easy to 
reproduce this problem outside of this production environment, given the 
diversity of clients running different OSes with various versions of SSL 
libraries.

I'm wondering if there is a better way to handle EAGAIN returned from 
gnutls_handshake(), instead of doing a busywait as in ITS#8650, or my 
simplistic attempt at inserting a sleep() call which doesn't really seem 
to help. I'm wondering how the GnuTLS developers intend for people to 
use gnutls_handshake() properly, so as to gracefully handle sessions 
that involve long packets on the one hand, without opening up a 
vulnerability to chew up lots of system resources on the other hand.

Regards,

     -Kartik



Reply at: 
https://bugs.launchpad.net/ubuntu/+source/openldap/+bug/1921562/comments/5

------------------------------------------------------------------------
On 2018-08-03T16:09:47+00:00 Ryan Tandy wrote:

Hi Kartik,

On Fri, Aug 03, 2018 at 11:19:06AM -0400, Kartik Subbarao wrote:
>I'm running into a problem with slapd 2.4.46 hanging on Ubuntu 18.04, 
>which seems to be a side effect of the ITS#8650 patch:
>
>https://github.com/openldap/openldap/commit/7b5181da8cdd47a13041f9ee36fa9590a0fa6e48
>
>slapd will run fine for a while, but during some periods of 
>high-traffic, it'll hang. It'll peg the CPU at 100% and won't respond 
>to any new LDAP connections. After some time, it'll resume working 
>again, but overall it's fairly unreliable.

Thanks for letting me know about this. This patch is running on quite a 
few systems by now, I'm sorry the problem wasn't caught sooner. :/

>I'm wondering if there is a better way to handle EAGAIN returned from 
>gnutls_handshake(), instead of doing a busywait as in ITS#8650, or my 
>simplistic attempt at inserting a sleep() call which doesn't really 
>seem to help. I'm wondering how the GnuTLS developers intend for 
>people to use gnutls_handshake() properly, so as to gracefully handle 
>sessions that involve long packets on the one hand, without opening up 
>a vulnerability to chew up lots of system resources on the other hand.

Right. I mean, this is how GnuTLS' own example shows to do it:

https://gitlab.com/gnutls/gnutls/blob/master/doc/examples/ex-client-
dtls.c#L73-77

We could place a limit on the number of iterations, though any such 
limit would have to be arbitrary.

There might be an asynchronous GnuTLS API that could be used to avoid 
tying up slapd while this is going on.

I will look at how some other GnuTLS servers deal with this...


Reply at:
https://bugs.launchpad.net/ubuntu/+source/openldap/+bug/1921562/comments/6

------------------------------------------------------------------------
On 2018-08-03T18:35:02+00:00 Kartik Subbarao wrote:

On 08/03/2018 12:09 PM, Ryan Tandy wrote:
> Thanks for letting me know about this. This patch is running on quite 
> a few systems by now, I'm sorry the problem wasn't caught sooner. :/

No worries, thanks for responding so quickly on this!

>> I'm wondering if there is a better way to handle EAGAIN returned from 
>> gnutls_handshake(), instead of doing a busywait as in ITS#8650, or my 
>> simplistic attempt at inserting a sleep() call which doesn't really 
>> seem to help. I'm wondering how the GnuTLS developers intend for 
>> people to use gnutls_handshake() properly, so as to gracefully handle 
>> sessions that involve long packets on the one hand, without opening 
>> up a vulnerability to chew up lots of system resources on the other 
>> hand.
>
> Right. I mean, this is how GnuTLS' own example shows to do it:
>
> https://gitlab.com/gnutls/gnutls/blob/master/doc/examples/ex-client-dtls.c#L73-77
>  
>

Hmm, that's a head-scratcher. It doesn't seem very effective to have a 
non-blocking I/O interface and then recommend wrapping it in a busywait 
loop :-)

> We could place a limit on the number of iterations, though any such 
> limit would have to be arbitrary.
>
> There might be an asynchronous GnuTLS API that could be used to avoid 
> tying up slapd while this is going on.
>
> I will look at how some other GnuTLS servers deal with this...

Cool, thanks Ryan.

Regards,

     -Kartik



Reply at: 
https://bugs.launchpad.net/ubuntu/+source/openldap/+bug/1921562/comments/7

------------------------------------------------------------------------
On 2018-08-26T19:07:20+00:00 Ryan Tandy wrote:

Hi Karthik,

Sorry about the lack of updates on this one.

It looks clear that my patch for this ITS was wrong and needs to be 
reverted.

Looking again at the original issue, after reverting the patch, I've 
found that the behaviour varies with GnuTLS version. I need to figure 
out why this is, which probably means spending some time bisecting 
GnuTLS changes.

If I create a patch to log some additional debug info about the GnuTLS 
setup, would you be willing to run it in your environment? For example 
I'm curious whether EAGAIN is being returned from the read side or write 
side (guessing read, but would be nice to confirm).

thanks
Ryan


Reply at:
https://bugs.launchpad.net/ubuntu/+source/openldap/+bug/1921562/comments/8

------------------------------------------------------------------------
On 2018-08-27T17:22:12+00:00 Kartik Subbarao wrote:

On 08/26/2018 03:07 PM, Ryan Tandy wrote:
> If I create a patch to log some additional debug info about the GnuTLS 
> setup, would you be willing to run it in your environment? For example 
> I'm curious whether EAGAIN is being returned from the read side or 
> write side (guessing read, but would be nice to confirm).

Sure, send me the patch. If the patch just does some additional GnuTLS 
logging, then I don't see a problem with running it in the production 
environment and sending you the results.

Regards,

     -Kartik


Reply at:
https://bugs.launchpad.net/ubuntu/+source/openldap/+bug/1921562/comments/9

------------------------------------------------------------------------
On 2018-09-06T15:49:51+00:00 O-bugs wrote:

fixed in master
fixed in RE24 (2.4.46)
Needs to be fixed further

Reply at:
https://bugs.launchpad.net/ubuntu/+source/openldap/+bug/1921562/comments/10

------------------------------------------------------------------------
On 2018-09-06T15:49:51+00:00 Quanah-x wrote:

changed notes
changed state Closed to Open

Reply at:
https://bugs.launchpad.net/ubuntu/+source/openldap/+bug/1921562/comments/11

------------------------------------------------------------------------
On 2018-09-19T05:55:50+00:00 Ryan Tandy wrote:

Made some good progress on this one this evening.

The original issue this ITS is about is that gnutls_handshake() can, in 
some versions of GnuTLS, return GNUTLS_E_AGAIN even when the socket is 
blocking. Specifically, this happens in the case I described with a 
large CA list sent by the server.

For slapd, the patch I committed is unfortunately completely wrong. It 
has been using non-blocking sockets forever, EAGAIN is expected and 
handled robustly -- or it was, until I introduced the busy-loop.

For clients I'm still working on figuring out the right path forward. 
There is some EAGAIN handling conditional on LDAP_USE_NON_BLOCKING_TLS 
which itself is behind LDAP_DEVEL. However this code is meant for 
non-blocking sockets, and in my case it ends up stuck in poll() waiting 
for a notification that never arrives. In 2.4, ret == 1 simply falls 
into the success case and proceeds to send data without completing the 
handshake first.

It's possible that what I actually want here is a (ret > 0) case in 
ldap_int_tls_start for when LDAP_USE_NON_BLOCKING_TLS is absent and 
ldap_int_tls_connect returns 1. (I'd also need to adapt the non-blocking 
path to be able to handle a blocking socket as well.)

But it's also possible that gnutls_handshake() returning GNUTLS_E_AGAIN 
with a blocking socket is simply a GnuTLS bug that was introduced at 
some point. I still need to determine exactly when and why its behaviour 
changed. (It is still happening with 3.5.19.)

In any case, my patch has to be reverted, as its impact (making slapd 
busy-loop) is obviously worse than the status quo (misbehaving clients 
in a specific case). I have pushed that revert now and will continue 
digging as time permits.


Reply at:
https://bugs.launchpad.net/ubuntu/+source/openldap/+bug/1921562/comments/12

------------------------------------------------------------------------
On 2018-10-05T05:33:21+00:00 Ryan Tandy wrote:

On Tue, Sep 18, 2018 at 10:55:50PM -0700, Ryan Tandy wrote:
>There is some EAGAIN handling conditional on LDAP_USE_NON_BLOCKING_TLS 
>which itself is behind LDAP_DEVEL. However this code is meant for 
>non-blocking sockets, and in my case it ends up stuck in poll() 
>waiting for a notification that never arrives.

This turned out to be because the fd and timeout fields are only 
initialized when timeout is configured and the non-blocking behaviour 
was triggered because of it. Otherwise the code simply doesn't 
anticipate EAGAIN could be returned and the behaviour is more or less 
undefined; it ends up calling poll() with fd = -1 and a garbage timeout.

>It's possible that what I actually want here is a (ret > 0) case in 
>ldap_int_tls_start for when LDAP_USE_NON_BLOCKING_TLS is absent and 
>ldap_int_tls_connect returns 1. (I'd also need to adapt the 
>non-blocking path to be able to handle a blocking socket as well.)

More precisely, what I actually want is a (ret > 0) case that is used 
unless both USE_NON_BLOCKING_TLS is true and a timeout is configured.

If we added to ber_sockbuf_ctrl() the ability to query whether the 
socket is non-blocking, for GnuTLS at least we could bypass poll() and 
go straight back into ldap_int_tls_connect(). However I don't see a lot 
of benefit to this as long as calling poll() on a blocking socket has no 
downside.


Reply at:
https://bugs.launchpad.net/ubuntu/+source/openldap/+bug/1921562/comments/13

------------------------------------------------------------------------
On 2020-04-12T18:58:42+00:00 Ryan Tandy wrote:

*** Issue 9210 has been marked as a duplicate of this issue. ***

Reply at:
https://bugs.launchpad.net/ubuntu/+source/openldap/+bug/1921562/comments/14

------------------------------------------------------------------------
On 2020-04-12T19:04:47+00:00 Ryan Tandy wrote:

The other way we can get a non-blocking socket is if the client set one
up itself and gave it to us via ldap_init_fd(). sssd does this, or used
to: bug 9210.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/openldap/+bug/1921562/comments/15

------------------------------------------------------------------------
On 2020-04-12T20:46:44+00:00 Ryan Tandy wrote:

*** Issue 9210 has been marked as a duplicate of this issue. ***

Reply at:
https://bugs.launchpad.net/ubuntu/+source/openldap/+bug/1921562/comments/16

------------------------------------------------------------------------
On 2020-04-12T21:02:40+00:00 Ryan Tandy wrote:

Created attachment 708
test program with non-blocking socket

Here's a test program that exercises the scenario with a non-blocking
socket, similar to the case described in bug 9210. Currently it fails on
2.4 with LDAP_SERVER_DOWN and on 2.5 with LDAP_TIMEOUT, but succeeds if
you comment out the fcntl(). Any patch needs to correct that as well as
the scenario described here with a blocking socket.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/openldap/+bug/1921562/comments/17

------------------------------------------------------------------------
On 2020-04-12T23:50:10+00:00 Howard Chu wrote:

Commits: 
  • 735e1ab 
by Howard Chu at 2020-04-12T22:18:51+00:00 
ITS#8650 loop on incomplete TLS handshake

Reply at:
https://bugs.launchpad.net/ubuntu/+source/openldap/+bug/1921562/comments/18


** Changed in: openldap
       Status: Unknown => Fix Released

** Changed in: openldap
   Importance: Unknown => Medium

** Bug watch added: Debian Bug tracker #861838
   https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=861838

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to openldap in Ubuntu.
https://bugs.launchpad.net/bugs/1921562

Title:
  Intermittent hangs during ldap_search_ext when TLS enabled

Status in openldap:
  Fix Released
Status in openldap package in Ubuntu:
  Confirmed

Bug description:
  When connecting to an LDAP server with TLS, ldap_search_ext can hang
  if during the initial TLS handshake a signal is received by the
  process. The cause of this bug is the same as
  https://bugs.openldap.org/show_bug.cgi?id=8650 which was fixed in
  https://git.openldap.org/openldap/openldap/-/commit/735e1ab and was
  released as part of version 2.4.50. This bug effects Ubuntu 20.04 LTS
  and potentially earlier Ubuntu releases. Later Ubuntu releases use an
  openldap version that is at least 2.4.50 and are therefore not
  affected.

  In our case this bug cause failures in the SSSD LDAP backend at least
  once per day, resulting in authentication errors followed by a sssd_be
  restart after a timeout has been hit:

  Mar 19 19:05:31 mail auth[867454]: pam_sss(dovecot:auth): received for user 
redacted: 4 (System error)
  Mar 19 19:05:32 mail sssd_be[867455]: Starting up

  A reduced version of the patch linked above can be found attached to
  this bug report. This patch has been applied to version 2.4.49+dfsg-
  2ubuntu1.7 and has been running in production for approximately a week
  and the issue has no longer occurred. No other issues have appeared
  during this period.

  As this bug affects all systems using LDAP with TLS, I suggest that
  the fix for this bug is ported to Ubuntu 20.04 LTS and potentially
  earlier versions.

To manage notifications about this bug go to:
https://bugs.launchpad.net/openldap/+bug/1921562/+subscriptions

-- 
Mailing list: https://launchpad.net/~touch-packages
Post to     : touch-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~touch-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to