[jira] [Commented] (AIRFLOW-1235) Odd behaviour when all gunicorn workers die
[ https://issues.apache.org/jira/browse/AIRFLOW-1235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16410113#comment-16410113 ] ASF subversion and git services commented on AIRFLOW-1235: -- Commit 7e762d42df50d84e4740e15c24594c50aaab53a2 in incubator-airflow's branch refs/heads/master from [~sekikn] [ https://git-wip-us.apache.org/repos/asf?p=incubator-airflow.git;h=7e762d4 ] [AIRFLOW-1235] Fix webserver's odd behaviour In some cases, the gunicorn master shuts down but the webserver monitor process doesn't. This PR add timeout functionality to shutdown all related processes in such cases. Dear Airflow maintainers, Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below! ### JIRA - [x] My PR addresses the following [Airflow JIRA] (https://issues.apache.org/jira/browse/AIRFLOW/) issues and references them in the PR title. For example, "[AIRFLOW-XXX] My Airflow PR" - https://issues.apache.org/jira/browse/AIRFLOW-1235 ### Description - [x] Here are some details about my PR, including screenshots of any UI changes: In some cases, the gunicorn master shuts down but the webserver monitor process doesn't. This PR add timeout functionality to shutdown all related processes in such cases. ### Tests - [x] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: tests.core:CliTests.test_cli_webserver_shutdown_wh en_gunicorn_master_is_killed ### Commits - [x] My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git- commit/)": 1. Subject is separated from body by a blank line 2. Subject is limited to 50 characters 3. Subject does not end with a period 4. Subject uses the imperative mood ("add", not "adding") 5. Body wraps at 72 characters 6. Body explains "what" and "why", not "how" Closes #2330 from sekikn/AIRFLOW-1235 > Odd behaviour when all gunicorn workers die > --- > > Key: AIRFLOW-1235 > URL: https://issues.apache.org/jira/browse/AIRFLOW-1235 > Project: Apache Airflow > Issue Type: Bug > Components: webserver >Affects Versions: 1.8.0 >Reporter: Erik Forsberg >Assignee: Kengo Seki >Priority: Major > Fix For: 2.0.0 > > > The webserver has sometimes stopped responding to port 443, and today I found > the issue - I had a misconfigured resolv.conf that made it unable to talk to > my postgresql. This was the root cause, but the way airflow webserver behaved > was a bit odd. > It seems that when all gunicorn workers failed to start, the gunicorn master > shut down. However, the main process (the one that starts gunicorn master) > did not shut down, so there was no way of detecting the failed status of > webserver from e.g. systemd or init script. > Full traceback leading to stale webserver process: > {noformat} > May 21 09:51:57 airmaster01 airflow[26451]: [2017-05-21 09:51:57 +] > [23794] [ERROR] Exception in worker process: > May 21 09:51:57 airmaster01 airflow[26451]: Traceback (most recent call last): > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", > line 1122, in _do_get > May 21 09:51:57 airmaster01 airflow[26451]: return self._pool.get(wait, > self._timeout) > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/util/queue.py", > line 145, in get > May 21 09:51:57 airmaster01 airflow[26451]: raise Empty > May 21 09:51:57 airmaster01 airflow[26451]: sqlalchemy.util.queue.Empty > May 21 09:51:57 airmaster01 airflow[26451]: During handling of the above > exception, another exception occurred: > May 21 09:51:57 airmaster01 airflow[26451]: Traceback (most recent call last): > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/engine/base.py", > line 2147, in _wrap_pool_connect > May 21 09:51:57 airmaster01 airflow[26451]: return fn() > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", > line 387, in connect > May 21 09:51:57 airmaster01 airflow[26451]: return > _ConnectionFairy._checkout(self) > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", > line 766, in _checkout > May 21 09:51:57 airmaster01 airflow[26451]: fairy = > _ConnectionRecord.checkout(pool) > May 21 09:51:57 airmaster01 airflow[26451]: File >
[jira] [Commented] (AIRFLOW-1235) Odd behaviour when all gunicorn workers die
[ https://issues.apache.org/jira/browse/AIRFLOW-1235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16410114#comment-16410114 ] ASF subversion and git services commented on AIRFLOW-1235: -- Commit 7e762d42df50d84e4740e15c24594c50aaab53a2 in incubator-airflow's branch refs/heads/master from [~sekikn] [ https://git-wip-us.apache.org/repos/asf?p=incubator-airflow.git;h=7e762d4 ] [AIRFLOW-1235] Fix webserver's odd behaviour In some cases, the gunicorn master shuts down but the webserver monitor process doesn't. This PR add timeout functionality to shutdown all related processes in such cases. Dear Airflow maintainers, Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below! ### JIRA - [x] My PR addresses the following [Airflow JIRA] (https://issues.apache.org/jira/browse/AIRFLOW/) issues and references them in the PR title. For example, "[AIRFLOW-XXX] My Airflow PR" - https://issues.apache.org/jira/browse/AIRFLOW-1235 ### Description - [x] Here are some details about my PR, including screenshots of any UI changes: In some cases, the gunicorn master shuts down but the webserver monitor process doesn't. This PR add timeout functionality to shutdown all related processes in such cases. ### Tests - [x] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: tests.core:CliTests.test_cli_webserver_shutdown_wh en_gunicorn_master_is_killed ### Commits - [x] My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git- commit/)": 1. Subject is separated from body by a blank line 2. Subject is limited to 50 characters 3. Subject does not end with a period 4. Subject uses the imperative mood ("add", not "adding") 5. Body wraps at 72 characters 6. Body explains "what" and "why", not "how" Closes #2330 from sekikn/AIRFLOW-1235 > Odd behaviour when all gunicorn workers die > --- > > Key: AIRFLOW-1235 > URL: https://issues.apache.org/jira/browse/AIRFLOW-1235 > Project: Apache Airflow > Issue Type: Bug > Components: webserver >Affects Versions: 1.8.0 >Reporter: Erik Forsberg >Assignee: Kengo Seki >Priority: Major > Fix For: 2.0.0 > > > The webserver has sometimes stopped responding to port 443, and today I found > the issue - I had a misconfigured resolv.conf that made it unable to talk to > my postgresql. This was the root cause, but the way airflow webserver behaved > was a bit odd. > It seems that when all gunicorn workers failed to start, the gunicorn master > shut down. However, the main process (the one that starts gunicorn master) > did not shut down, so there was no way of detecting the failed status of > webserver from e.g. systemd or init script. > Full traceback leading to stale webserver process: > {noformat} > May 21 09:51:57 airmaster01 airflow[26451]: [2017-05-21 09:51:57 +] > [23794] [ERROR] Exception in worker process: > May 21 09:51:57 airmaster01 airflow[26451]: Traceback (most recent call last): > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", > line 1122, in _do_get > May 21 09:51:57 airmaster01 airflow[26451]: return self._pool.get(wait, > self._timeout) > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/util/queue.py", > line 145, in get > May 21 09:51:57 airmaster01 airflow[26451]: raise Empty > May 21 09:51:57 airmaster01 airflow[26451]: sqlalchemy.util.queue.Empty > May 21 09:51:57 airmaster01 airflow[26451]: During handling of the above > exception, another exception occurred: > May 21 09:51:57 airmaster01 airflow[26451]: Traceback (most recent call last): > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/engine/base.py", > line 2147, in _wrap_pool_connect > May 21 09:51:57 airmaster01 airflow[26451]: return fn() > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", > line 387, in connect > May 21 09:51:57 airmaster01 airflow[26451]: return > _ConnectionFairy._checkout(self) > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", > line 766, in _checkout > May 21 09:51:57 airmaster01 airflow[26451]: fairy = > _ConnectionRecord.checkout(pool) > May 21 09:51:57 airmaster01 airflow[26451]: File >
[jira] [Commented] (AIRFLOW-1235) Odd behaviour when all gunicorn workers die
[ https://issues.apache.org/jira/browse/AIRFLOW-1235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16410112#comment-16410112 ] ASF subversion and git services commented on AIRFLOW-1235: -- Commit 7e762d42df50d84e4740e15c24594c50aaab53a2 in incubator-airflow's branch refs/heads/master from [~sekikn] [ https://git-wip-us.apache.org/repos/asf?p=incubator-airflow.git;h=7e762d4 ] [AIRFLOW-1235] Fix webserver's odd behaviour In some cases, the gunicorn master shuts down but the webserver monitor process doesn't. This PR add timeout functionality to shutdown all related processes in such cases. Dear Airflow maintainers, Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below! ### JIRA - [x] My PR addresses the following [Airflow JIRA] (https://issues.apache.org/jira/browse/AIRFLOW/) issues and references them in the PR title. For example, "[AIRFLOW-XXX] My Airflow PR" - https://issues.apache.org/jira/browse/AIRFLOW-1235 ### Description - [x] Here are some details about my PR, including screenshots of any UI changes: In some cases, the gunicorn master shuts down but the webserver monitor process doesn't. This PR add timeout functionality to shutdown all related processes in such cases. ### Tests - [x] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: tests.core:CliTests.test_cli_webserver_shutdown_wh en_gunicorn_master_is_killed ### Commits - [x] My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git- commit/)": 1. Subject is separated from body by a blank line 2. Subject is limited to 50 characters 3. Subject does not end with a period 4. Subject uses the imperative mood ("add", not "adding") 5. Body wraps at 72 characters 6. Body explains "what" and "why", not "how" Closes #2330 from sekikn/AIRFLOW-1235 > Odd behaviour when all gunicorn workers die > --- > > Key: AIRFLOW-1235 > URL: https://issues.apache.org/jira/browse/AIRFLOW-1235 > Project: Apache Airflow > Issue Type: Bug > Components: webserver >Affects Versions: 1.8.0 >Reporter: Erik Forsberg >Assignee: Kengo Seki >Priority: Major > Fix For: 2.0.0 > > > The webserver has sometimes stopped responding to port 443, and today I found > the issue - I had a misconfigured resolv.conf that made it unable to talk to > my postgresql. This was the root cause, but the way airflow webserver behaved > was a bit odd. > It seems that when all gunicorn workers failed to start, the gunicorn master > shut down. However, the main process (the one that starts gunicorn master) > did not shut down, so there was no way of detecting the failed status of > webserver from e.g. systemd or init script. > Full traceback leading to stale webserver process: > {noformat} > May 21 09:51:57 airmaster01 airflow[26451]: [2017-05-21 09:51:57 +] > [23794] [ERROR] Exception in worker process: > May 21 09:51:57 airmaster01 airflow[26451]: Traceback (most recent call last): > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", > line 1122, in _do_get > May 21 09:51:57 airmaster01 airflow[26451]: return self._pool.get(wait, > self._timeout) > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/util/queue.py", > line 145, in get > May 21 09:51:57 airmaster01 airflow[26451]: raise Empty > May 21 09:51:57 airmaster01 airflow[26451]: sqlalchemy.util.queue.Empty > May 21 09:51:57 airmaster01 airflow[26451]: During handling of the above > exception, another exception occurred: > May 21 09:51:57 airmaster01 airflow[26451]: Traceback (most recent call last): > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/engine/base.py", > line 2147, in _wrap_pool_connect > May 21 09:51:57 airmaster01 airflow[26451]: return fn() > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", > line 387, in connect > May 21 09:51:57 airmaster01 airflow[26451]: return > _ConnectionFairy._checkout(self) > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", > line 766, in _checkout > May 21 09:51:57 airmaster01 airflow[26451]: fairy = > _ConnectionRecord.checkout(pool) > May 21 09:51:57 airmaster01 airflow[26451]: File >
[jira] [Commented] (AIRFLOW-1235) Odd behaviour when all gunicorn workers die
[ https://issues.apache.org/jira/browse/AIRFLOW-1235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405547#comment-16405547 ] John Arnold commented on AIRFLOW-1235: -- For reference, here's what it looks linke in my logs too... ``` [2018-03-19 03:21:31,442] \{configuration.py:211} WARNING - section/key [celery/ssl_active] not found in config [2018-03-19 03:21:31,442] \{default_celery.py:44} WARNING - Celery Executor will run without SSL [2018-03-19 03:21:33,299] \{config.py:26} INFO - Missing required environment vars. Attempting to load from /etc/celery/conf.d/celery [2018-03-19 03:21:33,352] \{__init__.py:45} INFO - Using executor CeleryExecutor [2018-03-19 03:21:33,724] \{models.py:196} INFO - Filling up the DagBag from /var/lib/airflow/venv/lib/python3.6/site-packages/net_task_contrib/airflow/dags [2018-03-19 03:21:36 +] [125724] [INFO] Handling signal: ttou [2018-03-19 03:21:36 +] [55375] [INFO] Worker exiting (pid: 55375) [2018-03-19 03:22:07 +] [125724] [INFO] Handling signal: ttin [2018-03-19 03:22:07 +] [59230] [INFO] Booting worker with pid: 59230 [2018-03-19 03:22:07,621] \{configuration.py:211} WARNING - section/key [celery/ssl_active] not found in config [2018-03-19 03:22:07,621] \{default_celery.py:44} WARNING - Celery Executor will run without SSL [2018-03-19 03:22:08,604] \{config.py:26} INFO - Missing required environment vars. Attempting to load from /etc/celery/conf.d/celery [2018-03-19 03:22:08,648] \{__init__.py:45} INFO - Using executor CeleryExecutor [2018-03-19 03:22:08,979] \{models.py:196} INFO - Filling up the DagBag from /var/lib/airflow/venv/lib/python3.6/site-packages/net_task_contrib/airflow/dags [2018-03-19 03:22:09 +] [59230] [ERROR] Exception in worker process Traceback (most recent call last): File "/var/lib/airflow/venv/lib/python3.6/site-packages/sqlalchemy/pool.py", line 1124, in _do_get return self._pool.get(wait, self._timeout) File "/var/lib/airflow/venv/lib/python3.6/site-packages/sqlalchemy/util/queue.py", line 145, in get raise Empty sqlalchemy.util.queue.Empty During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/var/lib/airflow/venv/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 2147, in _wrap_pool_connect return fn() File "/var/lib/airflow/venv/lib/python3.6/site-packages/sqlalchemy/pool.py", line 387, in connect return _ConnectionFairy._checkout(self) File "/var/lib/airflow/venv/lib/python3.6/site-packages/sqlalchemy/pool.py", line 768, in _checkout fairy = _ConnectionRecord.checkout(pool) File "/var/lib/airflow/venv/lib/python3.6/site-packages/sqlalchemy/pool.py", line 516, in checkout rec = pool._do_get() File "/var/lib/airflow/venv/lib/python3.6/site-packages/sqlalchemy/pool.py", line 1140, in _do_get self._dec_overflow() File "/var/lib/airflow/venv/lib/python3.6/site-packages/sqlalchemy/util/langhelpers.py", line 66, in __exit__ compat.reraise(exc_type, exc_value, exc_tb) File "/var/lib/airflow/venv/lib/python3.6/site-packages/sqlalchemy/util/compat.py", line 187, in reraise raise value File "/var/lib/airflow/venv/lib/python3.6/site-packages/sqlalchemy/pool.py", line 1137, in _do_get return self._create_connection() File "/var/lib/airflow/venv/lib/python3.6/site-packages/sqlalchemy/pool.py", line 333, in _create_connection return _ConnectionRecord(self) File "/var/lib/airflow/venv/lib/python3.6/site-packages/sqlalchemy/pool.py", line 461, in __init__ self.__connect(first_connect_check=True) File "/var/lib/airflow/venv/lib/python3.6/site-packages/sqlalchemy/pool.py", line 651, in __connect connection = pool._invoke_creator(self) File "/var/lib/airflow/venv/lib/python3.6/site-packages/sqlalchemy/engine/strategies.py", line 105, in connect return dialect.connect(*cargs, **cparams) File "/var/lib/airflow/venv/lib/python3.6/site-packages/sqlalchemy/engine/default.py", line 393, in connect return self.dbapi.connect(*cargs, **cparams) File "/var/lib/airflow/venv/lib/python3.6/site-packages/psycopg2/__init__.py", line 130, in connect conn = _connect(dsn, connection_factory=connection_factory, **kwasync) psycopg2.OperationalError: SSL SYSCALL error: EOF detected The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/var/lib/airflow/venv/lib/python3.6/site-packages/gunicorn/arbiter.py", line 578, in spawn_worker worker.init_process() File "/var/lib/airflow/venv/lib/python3.6/site-packages/gunicorn/workers/base.py", line 126, in init_process self.load_wsgi() File "/var/lib/airflow/venv/lib/python3.6/site-packages/gunicorn/workers/base.py", line 135, in load_wsgi self.wsgi = self.app.wsgi() File "/var/lib/airflow/venv/lib/python3.6/site-packages/gunicorn/app/base.py", line 67, in wsgi self.callable = self.load() File
[jira] [Commented] (AIRFLOW-1235) Odd behaviour when all gunicorn workers die
[ https://issues.apache.org/jira/browse/AIRFLOW-1235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405394#comment-16405394 ] John Arnold commented on AIRFLOW-1235: -- external check and restart is not a great user experience – how do we fix the code problem so that it doesn't crash in the first place, or recovers if it does crash? > Odd behaviour when all gunicorn workers die > --- > > Key: AIRFLOW-1235 > URL: https://issues.apache.org/jira/browse/AIRFLOW-1235 > Project: Apache Airflow > Issue Type: Bug > Components: webserver >Affects Versions: 1.8.0 >Reporter: Erik Forsberg >Assignee: Kengo Seki >Priority: Major > > The webserver has sometimes stopped responding to port 443, and today I found > the issue - I had a misconfigured resolv.conf that made it unable to talk to > my postgresql. This was the root cause, but the way airflow webserver behaved > was a bit odd. > It seems that when all gunicorn workers failed to start, the gunicorn master > shut down. However, the main process (the one that starts gunicorn master) > did not shut down, so there was no way of detecting the failed status of > webserver from e.g. systemd or init script. > Full traceback leading to stale webserver process: > {noformat} > May 21 09:51:57 airmaster01 airflow[26451]: [2017-05-21 09:51:57 +] > [23794] [ERROR] Exception in worker process: > May 21 09:51:57 airmaster01 airflow[26451]: Traceback (most recent call last): > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", > line 1122, in _do_get > May 21 09:51:57 airmaster01 airflow[26451]: return self._pool.get(wait, > self._timeout) > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/util/queue.py", > line 145, in get > May 21 09:51:57 airmaster01 airflow[26451]: raise Empty > May 21 09:51:57 airmaster01 airflow[26451]: sqlalchemy.util.queue.Empty > May 21 09:51:57 airmaster01 airflow[26451]: During handling of the above > exception, another exception occurred: > May 21 09:51:57 airmaster01 airflow[26451]: Traceback (most recent call last): > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/engine/base.py", > line 2147, in _wrap_pool_connect > May 21 09:51:57 airmaster01 airflow[26451]: return fn() > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", > line 387, in connect > May 21 09:51:57 airmaster01 airflow[26451]: return > _ConnectionFairy._checkout(self) > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", > line 766, in _checkout > May 21 09:51:57 airmaster01 airflow[26451]: fairy = > _ConnectionRecord.checkout(pool) > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", > line 516, in checkout > May 21 09:51:57 airmaster01 airflow[26451]: rec = pool._do_get() > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", > line 1138, in _do_get > May 21 09:51:57 airmaster01 airflow[26451]: self._dec_overflow() > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/util/langhelpers.py", > line 66, in __exit__ > May 21 09:51:57 airmaster01 airflow[26451]: compat.reraise(exc_type, > exc_value, exc_tb) > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/util/compat.py", > line 187, in reraise > May 21 09:51:57 airmaster01 airflow[26451]: raise value > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", > line 1135, in _do_get > May 21 09:51:57 airmaster01 airflow[26451]: return self._create_connection() > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", > line 333, in _create_connection > May 21 09:51:57 airmaster01 airflow[26451]: return _ConnectionRecord(self) > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", > line 461, in __init__ > May 21 09:51:57 airmaster01 airflow[26451]: > self.__connect(first_connect_check=True) > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", > line 651, in __connect > May 21 09:51:57 airmaster01 airflow[26451]: connection = > pool._invoke_creator(self) > May 21 09:51:57 airmaster01
[jira] [Commented] (AIRFLOW-1235) Odd behaviour when all gunicorn workers die
[ https://issues.apache.org/jira/browse/AIRFLOW-1235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16398316#comment-16398316 ] yuanyifang commented on AIRFLOW-1235: - I send a get request to the webserver to check if the webserver is still working. If i got a bad response, I restart the webserver > Odd behaviour when all gunicorn workers die > --- > > Key: AIRFLOW-1235 > URL: https://issues.apache.org/jira/browse/AIRFLOW-1235 > Project: Apache Airflow > Issue Type: Bug > Components: webserver >Affects Versions: 1.8.0 >Reporter: Erik Forsberg >Assignee: Kengo Seki >Priority: Major > > The webserver has sometimes stopped responding to port 443, and today I found > the issue - I had a misconfigured resolv.conf that made it unable to talk to > my postgresql. This was the root cause, but the way airflow webserver behaved > was a bit odd. > It seems that when all gunicorn workers failed to start, the gunicorn master > shut down. However, the main process (the one that starts gunicorn master) > did not shut down, so there was no way of detecting the failed status of > webserver from e.g. systemd or init script. > Full traceback leading to stale webserver process: > {noformat} > May 21 09:51:57 airmaster01 airflow[26451]: [2017-05-21 09:51:57 +] > [23794] [ERROR] Exception in worker process: > May 21 09:51:57 airmaster01 airflow[26451]: Traceback (most recent call last): > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", > line 1122, in _do_get > May 21 09:51:57 airmaster01 airflow[26451]: return self._pool.get(wait, > self._timeout) > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/util/queue.py", > line 145, in get > May 21 09:51:57 airmaster01 airflow[26451]: raise Empty > May 21 09:51:57 airmaster01 airflow[26451]: sqlalchemy.util.queue.Empty > May 21 09:51:57 airmaster01 airflow[26451]: During handling of the above > exception, another exception occurred: > May 21 09:51:57 airmaster01 airflow[26451]: Traceback (most recent call last): > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/engine/base.py", > line 2147, in _wrap_pool_connect > May 21 09:51:57 airmaster01 airflow[26451]: return fn() > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", > line 387, in connect > May 21 09:51:57 airmaster01 airflow[26451]: return > _ConnectionFairy._checkout(self) > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", > line 766, in _checkout > May 21 09:51:57 airmaster01 airflow[26451]: fairy = > _ConnectionRecord.checkout(pool) > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", > line 516, in checkout > May 21 09:51:57 airmaster01 airflow[26451]: rec = pool._do_get() > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", > line 1138, in _do_get > May 21 09:51:57 airmaster01 airflow[26451]: self._dec_overflow() > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/util/langhelpers.py", > line 66, in __exit__ > May 21 09:51:57 airmaster01 airflow[26451]: compat.reraise(exc_type, > exc_value, exc_tb) > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/util/compat.py", > line 187, in reraise > May 21 09:51:57 airmaster01 airflow[26451]: raise value > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", > line 1135, in _do_get > May 21 09:51:57 airmaster01 airflow[26451]: return self._create_connection() > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", > line 333, in _create_connection > May 21 09:51:57 airmaster01 airflow[26451]: return _ConnectionRecord(self) > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", > line 461, in __init__ > May 21 09:51:57 airmaster01 airflow[26451]: > self.__connect(first_connect_check=True) > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", > line 651, in __connect > May 21 09:51:57 airmaster01 airflow[26451]: connection = > pool._invoke_creator(self) > May 21 09:51:57 airmaster01 airflow[26451]: File >
[jira] [Commented] (AIRFLOW-1235) Odd behaviour when all gunicorn workers die
[ https://issues.apache.org/jira/browse/AIRFLOW-1235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397679#comment-16397679 ] John Arnold commented on AIRFLOW-1235: -- Does anyone know how to fix this so that either gunicorn gets restarted, or the main process dies? > Odd behaviour when all gunicorn workers die > --- > > Key: AIRFLOW-1235 > URL: https://issues.apache.org/jira/browse/AIRFLOW-1235 > Project: Apache Airflow > Issue Type: Bug > Components: webserver >Affects Versions: 1.8.0 >Reporter: Erik Forsberg >Assignee: Kengo Seki >Priority: Major > > The webserver has sometimes stopped responding to port 443, and today I found > the issue - I had a misconfigured resolv.conf that made it unable to talk to > my postgresql. This was the root cause, but the way airflow webserver behaved > was a bit odd. > It seems that when all gunicorn workers failed to start, the gunicorn master > shut down. However, the main process (the one that starts gunicorn master) > did not shut down, so there was no way of detecting the failed status of > webserver from e.g. systemd or init script. > Full traceback leading to stale webserver process: > {noformat} > May 21 09:51:57 airmaster01 airflow[26451]: [2017-05-21 09:51:57 +] > [23794] [ERROR] Exception in worker process: > May 21 09:51:57 airmaster01 airflow[26451]: Traceback (most recent call last): > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", > line 1122, in _do_get > May 21 09:51:57 airmaster01 airflow[26451]: return self._pool.get(wait, > self._timeout) > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/util/queue.py", > line 145, in get > May 21 09:51:57 airmaster01 airflow[26451]: raise Empty > May 21 09:51:57 airmaster01 airflow[26451]: sqlalchemy.util.queue.Empty > May 21 09:51:57 airmaster01 airflow[26451]: During handling of the above > exception, another exception occurred: > May 21 09:51:57 airmaster01 airflow[26451]: Traceback (most recent call last): > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/engine/base.py", > line 2147, in _wrap_pool_connect > May 21 09:51:57 airmaster01 airflow[26451]: return fn() > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", > line 387, in connect > May 21 09:51:57 airmaster01 airflow[26451]: return > _ConnectionFairy._checkout(self) > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", > line 766, in _checkout > May 21 09:51:57 airmaster01 airflow[26451]: fairy = > _ConnectionRecord.checkout(pool) > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", > line 516, in checkout > May 21 09:51:57 airmaster01 airflow[26451]: rec = pool._do_get() > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", > line 1138, in _do_get > May 21 09:51:57 airmaster01 airflow[26451]: self._dec_overflow() > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/util/langhelpers.py", > line 66, in __exit__ > May 21 09:51:57 airmaster01 airflow[26451]: compat.reraise(exc_type, > exc_value, exc_tb) > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/util/compat.py", > line 187, in reraise > May 21 09:51:57 airmaster01 airflow[26451]: raise value > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", > line 1135, in _do_get > May 21 09:51:57 airmaster01 airflow[26451]: return self._create_connection() > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", > line 333, in _create_connection > May 21 09:51:57 airmaster01 airflow[26451]: return _ConnectionRecord(self) > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", > line 461, in __init__ > May 21 09:51:57 airmaster01 airflow[26451]: > self.__connect(first_connect_check=True) > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", > line 651, in __connect > May 21 09:51:57 airmaster01 airflow[26451]: connection = > pool._invoke_creator(self) > May 21 09:51:57 airmaster01 airflow[26451]: File >
[jira] [Commented] (AIRFLOW-1235) Odd behaviour when all gunicorn workers die
[ https://issues.apache.org/jira/browse/AIRFLOW-1235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16393418#comment-16393418 ] James Davidheiser commented on AIRFLOW-1235: I saw the same thing with a mysql connection error. I am running Airflow in Kubernetes, and and plan to set up a Liveness probe to make sure it kills the container if the server stops working. Some tasks failed on connecting to the database too, but those were marked as failed as I would expect. > Odd behaviour when all gunicorn workers die > --- > > Key: AIRFLOW-1235 > URL: https://issues.apache.org/jira/browse/AIRFLOW-1235 > Project: Apache Airflow > Issue Type: Bug > Components: webserver >Affects Versions: 1.8.0 >Reporter: Erik Forsberg >Assignee: Kengo Seki >Priority: Major > > The webserver has sometimes stopped responding to port 443, and today I found > the issue - I had a misconfigured resolv.conf that made it unable to talk to > my postgresql. This was the root cause, but the way airflow webserver behaved > was a bit odd. > It seems that when all gunicorn workers failed to start, the gunicorn master > shut down. However, the main process (the one that starts gunicorn master) > did not shut down, so there was no way of detecting the failed status of > webserver from e.g. systemd or init script. > Full traceback leading to stale webserver process: > {noformat} > May 21 09:51:57 airmaster01 airflow[26451]: [2017-05-21 09:51:57 +] > [23794] [ERROR] Exception in worker process: > May 21 09:51:57 airmaster01 airflow[26451]: Traceback (most recent call last): > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", > line 1122, in _do_get > May 21 09:51:57 airmaster01 airflow[26451]: return self._pool.get(wait, > self._timeout) > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/util/queue.py", > line 145, in get > May 21 09:51:57 airmaster01 airflow[26451]: raise Empty > May 21 09:51:57 airmaster01 airflow[26451]: sqlalchemy.util.queue.Empty > May 21 09:51:57 airmaster01 airflow[26451]: During handling of the above > exception, another exception occurred: > May 21 09:51:57 airmaster01 airflow[26451]: Traceback (most recent call last): > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/engine/base.py", > line 2147, in _wrap_pool_connect > May 21 09:51:57 airmaster01 airflow[26451]: return fn() > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", > line 387, in connect > May 21 09:51:57 airmaster01 airflow[26451]: return > _ConnectionFairy._checkout(self) > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", > line 766, in _checkout > May 21 09:51:57 airmaster01 airflow[26451]: fairy = > _ConnectionRecord.checkout(pool) > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", > line 516, in checkout > May 21 09:51:57 airmaster01 airflow[26451]: rec = pool._do_get() > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", > line 1138, in _do_get > May 21 09:51:57 airmaster01 airflow[26451]: self._dec_overflow() > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/util/langhelpers.py", > line 66, in __exit__ > May 21 09:51:57 airmaster01 airflow[26451]: compat.reraise(exc_type, > exc_value, exc_tb) > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/util/compat.py", > line 187, in reraise > May 21 09:51:57 airmaster01 airflow[26451]: raise value > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", > line 1135, in _do_get > May 21 09:51:57 airmaster01 airflow[26451]: return self._create_connection() > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", > line 333, in _create_connection > May 21 09:51:57 airmaster01 airflow[26451]: return _ConnectionRecord(self) > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", > line 461, in __init__ > May 21 09:51:57 airmaster01 airflow[26451]: > self.__connect(first_connect_check=True) > May 21 09:51:57 airmaster01 airflow[26451]: File >
[jira] [Commented] (AIRFLOW-1235) Odd behaviour when all gunicorn workers die
[ https://issues.apache.org/jira/browse/AIRFLOW-1235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392021#comment-16392021 ] John Arnold commented on AIRFLOW-1235: -- I have this same problem – workers fail due to postgres issue, gunicorn master hangs, but airflow webserver doesn't detect the hang and die. It just zombies. If it would just die, then my container could exit and be restarted. > Odd behaviour when all gunicorn workers die > --- > > Key: AIRFLOW-1235 > URL: https://issues.apache.org/jira/browse/AIRFLOW-1235 > Project: Apache Airflow > Issue Type: Bug > Components: webserver >Affects Versions: 1.8.0 >Reporter: Erik Forsberg >Assignee: Kengo Seki >Priority: Major > > The webserver has sometimes stopped responding to port 443, and today I found > the issue - I had a misconfigured resolv.conf that made it unable to talk to > my postgresql. This was the root cause, but the way airflow webserver behaved > was a bit odd. > It seems that when all gunicorn workers failed to start, the gunicorn master > shut down. However, the main process (the one that starts gunicorn master) > did not shut down, so there was no way of detecting the failed status of > webserver from e.g. systemd or init script. > Full traceback leading to stale webserver process: > {noformat} > May 21 09:51:57 airmaster01 airflow[26451]: [2017-05-21 09:51:57 +] > [23794] [ERROR] Exception in worker process: > May 21 09:51:57 airmaster01 airflow[26451]: Traceback (most recent call last): > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", > line 1122, in _do_get > May 21 09:51:57 airmaster01 airflow[26451]: return self._pool.get(wait, > self._timeout) > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/util/queue.py", > line 145, in get > May 21 09:51:57 airmaster01 airflow[26451]: raise Empty > May 21 09:51:57 airmaster01 airflow[26451]: sqlalchemy.util.queue.Empty > May 21 09:51:57 airmaster01 airflow[26451]: During handling of the above > exception, another exception occurred: > May 21 09:51:57 airmaster01 airflow[26451]: Traceback (most recent call last): > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/engine/base.py", > line 2147, in _wrap_pool_connect > May 21 09:51:57 airmaster01 airflow[26451]: return fn() > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", > line 387, in connect > May 21 09:51:57 airmaster01 airflow[26451]: return > _ConnectionFairy._checkout(self) > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", > line 766, in _checkout > May 21 09:51:57 airmaster01 airflow[26451]: fairy = > _ConnectionRecord.checkout(pool) > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", > line 516, in checkout > May 21 09:51:57 airmaster01 airflow[26451]: rec = pool._do_get() > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", > line 1138, in _do_get > May 21 09:51:57 airmaster01 airflow[26451]: self._dec_overflow() > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/util/langhelpers.py", > line 66, in __exit__ > May 21 09:51:57 airmaster01 airflow[26451]: compat.reraise(exc_type, > exc_value, exc_tb) > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/util/compat.py", > line 187, in reraise > May 21 09:51:57 airmaster01 airflow[26451]: raise value > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", > line 1135, in _do_get > May 21 09:51:57 airmaster01 airflow[26451]: return self._create_connection() > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", > line 333, in _create_connection > May 21 09:51:57 airmaster01 airflow[26451]: return _ConnectionRecord(self) > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", > line 461, in __init__ > May 21 09:51:57 airmaster01 airflow[26451]: > self.__connect(first_connect_check=True) > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", > line 651, in __connect > May 21 09:51:57 airmaster01 airflow[26451]: connection =
[jira] [Commented] (AIRFLOW-1235) Odd behaviour when all gunicorn workers die
[ https://issues.apache.org/jira/browse/AIRFLOW-1235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16298626#comment-16298626 ] Adam Hodges commented on AIRFLOW-1235: -- I am also seeing this issue when we have connectivity issues to our Postgres database. All the workers die and then the gunicorn master dies, but the airflow process does not die. If the airflow process died, I could at least detect it and restart the airflow webserver. My logs: {noformat} [2017-12-20 00:56:26 +] [844] [ERROR] Exception in worker process: Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/gunicorn/arbiter.py", line 507, in spawn_worker worker.init_process() File "/usr/local/lib/python2.7/dist-packages/gunicorn/workers/base.py", line 118, in init_process self.wsgi = self.app.wsgi() File "/usr/local/lib/python2.7/dist-packages/gunicorn/app/base.py", line 67, in wsgi self.callable = self.load() File "/usr/local/lib/python2.7/dist-packages/gunicorn/app/wsgiapp.py", line 65, in load return self.load_wsgiapp() File "/usr/local/lib/python2.7/dist-packages/gunicorn/app/wsgiapp.py", line 52, in load_wsgiapp return util.import_app(self.app_uri) File "/usr/local/lib/python2.7/dist-packages/gunicorn/util.py", line 366, in import_app app = eval(obj, mod.__dict__) File "", line 1, in File "/usr/local/lib/python2.7/dist-packages/airflow/www/app.py", line 161, in cached_app app = create_app(config) File "/usr/local/lib/python2.7/dist-packages/airflow/www/app.py", line 60, in create_app from airflow.www import views File "/usr/local/lib/python2.7/dist-packages/airflow/www/views.py", line 1977, in class ChartModelView(wwwutils.DataProfilingMixin, AirflowModelView): File "/usr/local/lib/python2.7/dist-packages/airflow/www/views.py", line 2059, in ChartModelView .group_by(models.Connection.conn_id) File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2855, in __iter__ return self._execute_and_instances(context) File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2876, in _execute_and_instances close_with_result=True) File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2885, in _get_bind_args **kw File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2867, in _connection_from_session conn = self.session.connection(**kw) File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 998, in connection execution_options=execution_options) File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 1003, in _connection_for_bind engine, execution_options) File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 403, in _connection_for_bind conn = bind.contextual_connect() File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 2112, in contextual_connect self._wrap_pool_connect(self.pool.connect, None), File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 2151, in _wrap_pool_connect e, dialect, self) File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1465, in _handle_dbapi_exception_noconnection exc_info File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/util/compat.py", line 203, in raise_from_cause reraise(type(exception), exception, tb=exc_tb, cause=cause) File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 2147, in _wrap_pool_connect return fn() File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/pool.py", line 387, in connect return _ConnectionFairy._checkout(self) File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/pool.py", line 766, in _checkout fairy = _ConnectionRecord.checkout(pool) File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/pool.py", line 516, in checkout rec = pool._do_get() File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/pool.py", line 1138, in _do_get self._dec_overflow() File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/util/langhelpers.py", line 66, in __exit__ compat.reraise(exc_type, exc_value, exc_tb) File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/pool.py", line 1135, in _do_get return self._create_connection() File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/pool.py", line 333, in _create_connection return _ConnectionRecord(self) File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/pool.py", line 461, in __init__ self.__connect(first_connect_check=True) File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/pool.py", line 651, in __connect connection = pool._invoke_creator(self) File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/strategies.py", line 105, in connect return
[jira] [Commented] (AIRFLOW-1235) Odd behaviour when all gunicorn workers die
[ https://issues.apache.org/jira/browse/AIRFLOW-1235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16022892#comment-16022892 ] Kengo Seki commented on AIRFLOW-1235: - I could reproduce this issue as follows: 1. Run webserver in foreground 2. Kill gunicorn master {code} $ ps aux | grep airflow | grep -v grep sekikn2929 4.5 9.3 241204 71172 pts/1S+ 09:12 0:02 /home/sekikn/.virtualenvs/a/bin/python2 /home/sekikn/.virtualenvs/a/bin/airflow webserver sekikn2935 1.9 6.1 115868 46676 pts/1S+ 09:12 0:00 gunicorn: master [airflow-webserver] sekikn2943 1.3 8.8 242144 67300 pts/1Sl+ 09:12 0:00 [ready] gunicorn: worker [airflow-webserver] sekikn2944 1.3 8.8 242040 67312 pts/1Sl+ 09:12 0:00 [ready] gunicorn: worker [airflow-webserver] sekikn2945 1.3 8.8 242052 67320 pts/1Sl+ 09:12 0:00 [ready] gunicorn: worker [airflow-webserver] sekikn2952 6.0 8.8 242056 67196 pts/1Sl+ 09:13 0:00 [ready] gunicorn: worker [airflow-webserver] $ kill 2935 {code} 3. Then gunicorn master remains as a zombie and webserver gets stuck {code} $ ps aux | grep airflow | grep -v grep sekikn2929 10.5 9.3 241204 71220 pts/1S+ 09:12 0:22 /home/sekikn/.virtualenvs/a/bin/python2 /home/sekikn/.virtualenvs/a/bin/airflow webserver $ ps 2935 PID TTY STAT TIME COMMAND 2935 pts/1Z+ 0:00 [gunicorn: maste] {code} At step 3, the following message is output to the log: {code} [2017-05-24 09:13:52,092] [2929] {cli.py:671} DEBUG - [4 / 4] doing a refresh of 1 workers {code} So I think webserver waits for workers infinitely (but they will never start) at line 679. {code} 668 def start_refresh(gunicorn_master_proc): 669 batch_size = conf.getint('webserver', 'worker_refresh_batch_size') 670 logging.debug('%s doing a refresh of %s workers', 671 state, batch_size) 672 sys.stdout.flush() 673 sys.stderr.flush() 674 675 excess = 0 676 for _ in range(batch_size): 677 gunicorn_master_proc.send_signal(signal.SIGTTIN) 678 excess += 1 679 wait_until_true(lambda: num_workers_expected + excess == 680 get_num_workers_running(gunicorn_master_proc)) {code} > Odd behaviour when all gunicorn workers die > --- > > Key: AIRFLOW-1235 > URL: https://issues.apache.org/jira/browse/AIRFLOW-1235 > Project: Apache Airflow > Issue Type: Bug > Components: webserver >Affects Versions: 1.8.0 >Reporter: Erik Forsberg > > The webserver has sometimes stopped responding to port 443, and today I found > the issue - I had a misconfigured resolv.conf that made it unable to talk to > my postgresql. This was the root cause, but the way airflow webserver behaved > was a bit odd. > It seems that when all gunicorn workers failed to start, the gunicorn master > shut down. However, the main process (the one that starts gunicorn master) > did not shut down, so there was no way of detecting the failed status of > webserver from e.g. systemd or init script. > Full traceback leading to stale webserver process: > {noformat} > May 21 09:51:57 airmaster01 airflow[26451]: [2017-05-21 09:51:57 +] > [23794] [ERROR] Exception in worker process: > May 21 09:51:57 airmaster01 airflow[26451]: Traceback (most recent call last): > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", > line 1122, in _do_get > May 21 09:51:57 airmaster01 airflow[26451]: return self._pool.get(wait, > self._timeout) > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/util/queue.py", > line 145, in get > May 21 09:51:57 airmaster01 airflow[26451]: raise Empty > May 21 09:51:57 airmaster01 airflow[26451]: sqlalchemy.util.queue.Empty > May 21 09:51:57 airmaster01 airflow[26451]: During handling of the above > exception, another exception occurred: > May 21 09:51:57 airmaster01 airflow[26451]: Traceback (most recent call last): > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/engine/base.py", > line 2147, in _wrap_pool_connect > May 21 09:51:57 airmaster01 airflow[26451]: return fn() > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", > line 387, in connect > May 21 09:51:57 airmaster01 airflow[26451]: return > _ConnectionFairy._checkout(self) > May 21 09:51:57 airmaster01 airflow[26451]: File > "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", > line 766, in _checkout > May 21 09:51:57 airmaster01 airflow[26451]: fairy = > _ConnectionRecord.checkout(pool) > May