Hi, Our uWSGI server hangs (stops serving any requests until it's restarted) about once a week, generally after a harakiri event. Can anyone help troubleshoot this? Also how can I debug harakiri events in general? Most of them don't cause the server to hang, but I don't understand what's causing them. The requests printed when the worker dies are all normal parts of our app that are accessed hundreds of times per day without incident.
uWSGI version is 2.0.8. OS is Ubuntu 14.04 LTS. CPU is x86_64 - Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz on Amazon EC2. Webserver is nginx, load balancer is haproxy. Config is below. Logs from a harakiri that caused the server to hang: Thu Nov 20 15:09:29 2014 - *** HARAKIRI ON WORKER 8 (pid: 5046, try: 1) *** HARAKIRI: -- syscall> 7 0x7fffafe0e9c0 0x1 0xffffffff 0x8 0x1040bc8 0x1 0x7fffafe0e9a0 0x7f3afea6cfbd HARAKIRI: -- wchan> poll_schedule_timeout Thu Nov 20 15:09:29 2014 - HARAKIRI !!! worker 8 status !!! Thu Nov 20 15:09:29 2014 - HARAKIRI [core 0] 127.0.0.1 - GET /acct_quota since 1416495853 Thu Nov 20 15:09:29 2014 - HARAKIRI [core 1] 127.0.0.1 - POST /pullf/ since 1416495854 Thu Nov 20 15:09:29 2014 - HARAKIRI [core 2] 127.0.0.1 - GET / since 1416495861 Thu Nov 20 15:09:29 2014 - HARAKIRI [core 3] 127.0.0.1 - GET / since 1416495853 Thu Nov 20 15:09:29 2014 - HARAKIRI [core 4] 127.0.0.1 - POST /signin/ since 1416495865 Thu Nov 20 15:09:29 2014 - HARAKIRI [core 5] 127.0.0.1 - POST /clientresp since 1416495856 Thu Nov 20 15:09:29 2014 - HARAKIRI [core 6] 127.0.0.1 - POST /pullf/ since 1416495858 Thu Nov 20 15:09:29 2014 - HARAKIRI [core 7] 127.0.0.1 - GET /~Dreamshot/495/percentage-of-bachelors-degrees-conferred-to-women-in-the-usa-by-major-1970-2012/ since 1416495852 Thu Nov 20 15:09:29 2014 - HARAKIRI [core 8] 127.0.0.1 - POST /stylethemes/ since 1416495858 Thu Nov 20 15:09:29 2014 - HARAKIRI [core 9] 127.0.0.1 - POST /clientresp since 1416495854 Thu Nov 20 15:09:29 2014 - HARAKIRI [core 10] 127.0.0.1 - POST /clientresp since 1416495856 Thu Nov 20 15:09:29 2014 - HARAKIRI [core 11] 127.0.0.1 - GET /acct_quota since 1416495860 Thu Nov 20 15:09:29 2014 - HARAKIRI [core 12] 127.0.0.1 - POST /signin/ since 1416495866 Thu Nov 20 15:09:29 2014 - HARAKIRI [core 13] 127.0.0.1 - POST /clientresp since 1416495865 Thu Nov 20 15:09:29 2014 - HARAKIRI [core 14] 127.0.0.1 - POST /pullf/ since 1416495853 Thu Nov 20 15:09:29 2014 - HARAKIRI [core 15] 127.0.0.1 - GET /%7Ehianalytics/189/ since 1416495852 Thu Nov 20 15:09:29 2014 - HARAKIRI [core 16] 127.0.0.1 - POST /pullf/ since 1416495851 Thu Nov 20 15:09:29 2014 - HARAKIRI [core 17] 127.0.0.1 - POST /signin/ since 1416495868 Thu Nov 20 15:09:29 2014 - HARAKIRI [core 18] 127.0.0.1 - POST /clientresp since 1416495866 Thu Nov 20 15:09:29 2014 - HARAKIRI [core 19] 127.0.0.1 - GET /getsources?fid=&extrarefs=Doktorigi%3A8 since 1416495868 Thu Nov 20 15:09:29 2014 - HARAKIRI !!! end of worker 8 status !!! DAMN ! worker 8 (pid: 5046) died, killed by signal 9 :( trying respawn ... Respawned uWSGI worker 8 (new pid: 10985) monitor (pid=10985): Starting stack trace monitor. WSGI app 0 (mountpoint='') ready in 1 seconds on interpreter 0xa3dd80 pid: 10985 (default app) When the server is able to successfully restart the worker, the message looks similar. Here's our latest: Fri Nov 21 18:36:12 2014 - *** HARAKIRI ON WORKER 5 (pid: 23549, try: 1) *** HARAKIRI: -- wchan> futex_wait_queue_me Fri Nov 21 18:36:12 2014 - HARAKIRI !!! worker 5 status !!! Fri Nov 21 18:36:12 2014 - HARAKIRI [core 0] 127.0.0.1 - GET /plot since 1416594367 Fri Nov 21 18:36:12 2014 - HARAKIRI [core 1] 127.0.0.1 - POST /getuser/ since 1416594367 Fri Nov 21 18:36:12 2014 - HARAKIRI [core 2] 127.0.0.1 - POST /user_account_actions since 1416594370 Fri Nov 21 18:36:12 2014 - HARAKIRI [core 3] 127.0.0.1 - GET /plot since 1416594366 Fri Nov 21 18:36:12 2014 - HARAKIRI [core 4] 127.0.0.1 - POST /pullf/ since 1416594368 Fri Nov 21 18:36:12 2014 - HARAKIRI [core 5] 127.0.0.1 - POST /clientresp since 1416594368 Fri Nov 21 18:36:12 2014 - HARAKIRI [core 6] 127.0.0.1 - GET /python/3d-plots-tutorial/ since 1416594368 Fri Nov 21 18:36:12 2014 - HARAKIRI [core 7] 127.0.0.1 - POST /getuser/ since 1416594370 Fri Nov 21 18:36:12 2014 - HARAKIRI [core 8] 127.0.0.1 - POST /getuser/ since 1416594367 Fri Nov 21 18:36:12 2014 - HARAKIRI [core 9] 127.0.0.1 - POST /getuser/ since 1416594368 Fri Nov 21 18:36:12 2014 - HARAKIRI [core 10] 127.0.0.1 - POST /getuser/ since 1416594368 Fri Nov 21 18:36:12 2014 - HARAKIRI [core 11] 127.0.0.1 - POST /svgtopdf/ since 1416594371 Fri Nov 21 18:36:12 2014 - HARAKIRI [core 12] 127.0.0.1 - POST /clientresp since 1416594366 Fri Nov 21 18:36:12 2014 - HARAKIRI [core 13] 127.0.0.1 - GET /quandl?code=WORLDBANK/UZB_SP_RUR_TOTL_ZS since 1416594368 Fri Nov 21 18:36:12 2014 - HARAKIRI [core 14] 127.0.0.1 - GET /~martin.2098/20/-line0-css-penthouse-line0-line0 since 1416594367 Fri Nov 21 18:36:12 2014 - HARAKIRI [core 15] 127.0.0.1 - POST /user_account_actions since 1416594368 Fri Nov 21 18:36:12 2014 - HARAKIRI [core 16] 127.0.0.1 - GET /plot since 1416594368 Fri Nov 21 18:36:12 2014 - HARAKIRI [core 17] 127.0.0.1 - GET /plot since 1416594367 Fri Nov 21 18:36:12 2014 - HARAKIRI [core 18] 127.0.0.1 - POST /clientresp since 1416594371 Fri Nov 21 18:36:12 2014 - HARAKIRI [core 19] 127.0.0.1 - POST /getnotifs/ since 1416594367 Fri Nov 21 18:36:12 2014 - HARAKIRI !!! end of worker 5 status !!! DAMN ! worker 5 (pid: 23549) died, killed by signal 9 :( trying respawn ... Respawned uWSGI worker 5 (new pid: 24129) monitor (pid=24129): Starting stack trace monitor. WSGI app 0 (mountpoint='') ready in 0 seconds on interpreter 0xae8aa0 pid: 24129 (default app) Configuration from --show-config: ;uWSGI instance configuration [uwsgi] show-config = true emperor = /etc/streambed_uwsgi.ini ;end of configuration Contents of /etc/streambed_uwsgi.ini: [uwsgi] uid = www-data gid = www-data chdir = /var/www/streambed/shelly module = apache.wsgi socket = /var/run/streambed.sock chown-socket = www-data logto = /var/log/uwsgi/streambed pidfile = /var/run/streambed.pid master = true # Conventional SIGTERM behaviour - needed for runit: die-on-term = true # Clean up on exit: vacuum = true # 10 processes, 20 threads each: processes = 10 threads = 20 buffer-size = 32768 # Load the app in each worker process, rather than in the master process: lazy = true # Maximum time to service a request (seconds): harakiri = 300 harakiri-verbose = true # Reload each process after this number of requests: max-requests = 10000 # Save HTTP bodies larger than this to disk (bytes): post-buffering = 1000000 # Stats socket stats = /var/run/uwsgi/streambed.stats Thanks for any hints or suggestions on either of these issues! Jody McIntyre Plotly Engineering _______________________________________________ uWSGI mailing list [email protected] http://lists.unbit.it/cgi-bin/mailman/listinfo/uwsgi
