[Bug 2143920] Re: [SRU] Uncaught SSL errors can crash worker threads

Zachary Raines Wed, 11 Mar 2026 14:05:41 -0700

** Description changed:

  [ Impact ]
  
  * This is a common problem with security scanners where all worker threads 
are killed by malformed requests and the server becomes unresponsive.
-     - The ceph dashboard on reef is susceptible to this issue.
+     - The ceph dashboard on quincy and squid is susceptible to this issue.
  * A malicious attacker could use the same technique to DOS the server.
  
  [ Test Plan ]
  
  I reproduced the issue against both a minimal cheroot server and the
  ceph dashboard.
  
  In both cases, I used tlsfuzzer [1] to reproduce the bug, by running
  `scripts/test-tls13-ccs.py -h <IP> -p <PORT>`.
  
  For cheroot
  ===========
  
  I did the following in an lxd container.
  
  1. Create a minimal cheroot server file
  
  server.py
  ---------
  from cheroot.wsgi import Server as WSGIServer
  from cheroot.ssl.builtin import BuiltinSSLAdapter
  
  def app(environ, start_response):
      status = '200 OK'
      response_headers = [('Content-type', 'text/plain')]
      start_response(status, response_headers)
      return [b"Ok."]
  
  server = WSGIServer(('0.0.0.0', 8443), app)
  
  server.ssl_adapter = BuiltinSSLAdapter(
      certificate='cert.pem',
      private_key='key.pem'
  )
  
  if __name__ == '__main__':
      try:
          server.start()
      except KeyboardInterrupt:
          server.stop()
  
  -----------
  
  2. Create a self-signed certificate and key
  
  `openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -days
  365 -nodes`
  
  3. Start the server: `sudo python3 server.py`
  
  4. Verify that the server responds correctly
  `curl -k -I https://<IP>:8443`
  
  Output
  ---------------
  HTTP/1.1 200 OK
  Content-type: text/plain
  Connection: close
  Date: Tue, 10 Mar 2026 15:44:09 GMT
  Server: Cheroot/8.5.2
  ----------------
  
  and check the number of worker threads
  
  `grep -i threads /proc/$(pgrep -f server.py)/status`
  
  Expected Output
  ---------------
  Threads:      11
  ---------------
  
  5. Run the tls13-ccs script of tlsfuzzer repeatedly until it times out.
  
  `scripts/test-tls13-ccs.py -h <IP> -p 8443`
  
  After several runs you will see:
  
  > AssertionError: Timeout when waiting for peer message
  
  6. Observe that connections to the server now timeout (or hang with no
  timeout specified)
  
  `curl -k -I https://<IP>:8443 --max-time 5`
  
  Expected Output
  ---------------
  HTTP/1.1 200 OK
  Content-type: text/plain
  Connection: close
  Date: Tue, 10 Mar 2026 15:44:09 GMT
  Server: Cheroot/8.5.2
  ----------------
  
  Actual Output
  ------------
  curl: (28) Operation timed out after 5002 milliseconds with 0 bytes received
  -------------
  
  and check the number of threads for server process
  
  `grep -i threads /proc/$(pgrep -f server.py)/status`
  
  Expected Output
  ---------------
  Threads:      11
  ---------------
  
  Actual Output
  ---------------
  Threads:      1
  ---------------
  
  Note that all of the worker threads have died.
  
  For ceph-dashboard
  ==================
  
  1. Deploy a minimal ceph lab on lxd [2]
  
  2. Add ceph-dashboard to the model [3]
  
  3. Note down the <IP> of one of the ceph-mon nodes which also hosts the
  dashboard.
  
  4. Verify that the ceph dashboard is reachable at https://<IP>:8443
  either in the browser or with curl
  
  curl -k -I https://<IP>:8443 --max-time 5
  
  Output
  ---------------
  HTTP/1.1 200 OK
  Content-Type: text/html;charset=utf-8
  Server: Ceph-Dashboard
  Date: Tue, 10 Mar 2026 15:55:47 GMT
  Content-Security-Policy: frame-ancestors 'self';
  X-Content-Type-Options: nosniff
  Strict-Transport-Security: max-age=63072000; includeSubDomains; preload
  Content-Language: en-US
  Vary: Accept-Language, Accept-Encoding
  Cache-Control: no-cache
  Last-Modified: Fri, 12 Jul 2024 14:10:44 GMT
  Accept-Ranges: bytes
  Content-Length: 6466
  ---------------
  
  5. Run the tls13-ccs script of tlsfuzzer repeatedly until it times out.
  
  `scripts/test-tls13-ccs.py -h <IP> -p 8443`
  
  After several runs you will see:
  
  > AssertionError: Timeout when waiting for peer message
  
  6. The ceph-dashboard is now unreachable from the browser and curl
  
  curl -k -I https://<IP>:8443 --max-time 5
  
  Expected Output
  ---------------
  HTTP/1.1 200 OK
  Content-Type: text/html;charset=utf-8
  Server: Ceph-Dashboard
  Date: Tue, 10 Mar 2026 15:55:47 GMT
  Content-Security-Policy: frame-ancestors 'self';
  X-Content-Type-Options: nosniff
  Strict-Transport-Security: max-age=63072000; includeSubDomains; preload
  Content-Language: en-US
  Vary: Accept-Language, Accept-Encoding
  Cache-Control: no-cache
  Last-Modified: Fri, 12 Jul 2024 14:10:44 GMT
  Accept-Ranges: bytes
  Content-Length: 6466
  ---------------
  
  Actual Output
  -------------
  curl: (28) Operation timed out after 5002 milliseconds with 0 bytes received
  -------------
  
  7. Read the syslog on mon unit and observe uncaught exceptions in the
  cheroot server threads
  
  e.g., sudo grep "Thread" -A10 /var/log/syslog
  
  Expected Output: <No Thread Errors>
  
  Actual Output
  --------
  Mar 10 15:57:31 juju-73b987-2 ceph-mgr[65287]: Exception in thread ('CP 
Server Thread-11',):
  Mar 10 15:57:31 juju-73b987-2 ceph-mgr[65287]: Traceback (most recent call 
last):
  Mar 10 15:57:31 juju-73b987-2 ceph-mgr[65287]:   File 
"/lib/python3/dist-packages/cheroot/server.py", line 1277, in communicate
  Mar 10 15:57:31 juju-73b987-2 ceph-mgr[65287]:     req.parse_request()
  Mar 10 15:57:31 juju-73b987-2 ceph-mgr[65287]:   File 
"/lib/python3/dist-packages/cheroot/server.py", line 706, in parse_request
  Mar 10 15:57:31 juju-73b987-2 ceph-mgr[65287]:     success = 
self.read_request_line()
  Mar 10 15:57:31 juju-73b987-2 ceph-mgr[65287]:   File 
"/lib/python3/dist-packages/cheroot/server.py", line 747, in read_request_line
  Mar 10 15:57:31 juju-73b987-2 ceph-mgr[65287]:     request_line = 
self.rfile.readline()
  Mar 10 15:57:31 juju-73b987-2 ceph-mgr[65287]:   File 
"/lib/python3/dist-packages/cheroot/server.py", line 304, in readline
  Mar 10 15:57:31 juju-73b987-2 ceph-mgr[65287]:     data = 
self.rfile.readline(256)
  Mar 10 15:57:31 juju-73b987-2 ceph-mgr[65287]:   File 
"/lib/python3.10/_pyio.py", line 582, in readline
  ---------
  
  Fix Verification
  ================
  
  As an extra verification of the fixes in the SRU we can run the whole
  tlsfuzzer suite against the patched server and verify that the number of
  worker threads remains the same. This is a good indicator that there are
  no obvious additional vulnerabilities that could crash the worker
  thread.
  
  1. Perform steps 1-4 of the "For cheroot" test plan above.
  2. From the tlsfuzzer directory run all scripts targeting the server (some of 
these will error out because they are missing required arguments but this does 
no harm)
  
  `for s in scripts/test-*.py; do $s -h <IP> -p 8443; done`
  
  NOTE: the server will log many errors, but none should crash the worker
  threads
  
  3. Verify that the number of threads is the same as before running the
  suite
  
  `grep -i threads /proc/$(pgrep -f server.py)/status`
  
  Expected Output
  ---------------
  Threads:      11
  ---------------
  
  [ Where problems could occur ]
  
  * Because we are now swallowing errors, there is the potential for
  threads to be left in a bad state when they would have previously
  crashed.
  
  * The smaller patch will not necessarily catch all exception types and
  so would leave some errors of this kind, although solving the specific
  incarnation.
  
  [ Other Info ]
  
  * This issue was fixed in upstream version 10.0.1, which is in questing and 
above.
  * Upstream ceph has fixed this issue in reef+[4] by bumping the cheroot 
dependency to 10.0.1.
  * There are two variants of an upstream patch
      - One simply catches and logs SSL errors in the threadpool [5]. This 
patch was proposed but not merged.
      - The other is a more holistic revisiting of the error handling in the 
threadpool [6], and is the patch that landed in 10.0.1.
  * This leaves a few options for SRU
      1. Apply the minimal patch to jammy and noble
      2. Apply upstream patch to jammy and noble
      3. Apply the minimal patch to jammy and the upstream patch to noble
      4. Apply the minimal patch to jammy and bump the noble package to 10.0.1
  * Of these, option 3 seems likely to carry the smallest risk of regression 
and so is what I have proposed, but I welcome input on the other options.
  
  [1]: https://github.com/tlsfuzzer/tlsfuzzer
  [2]: https://ubuntu.com/ceph/docs/tutorial
  [3]: https://ubuntu.com/ceph/docs/install-dashboard
  [4]: https://github.com/ceph/ceph/pull/57001
  [5]: https://github.com/cherrypy/cheroot/pull/365
  [6]: https://github.com/cherrypy/cheroot/pull/649
  
  Related Upstream Issues
  -----------------------
  https://github.com/cherrypy/cherrypy/issues/1989
  https://github.com/cherrypy/cheroot/issues/358


-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2143920

Title:
  [SRU] Uncaught SSL errors can crash worker threads

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/2143920/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 2143920] Re: [SRU] Uncaught SSL errors can crash worker threads

Reply via email to