[jira] [Commented] (AIRFLOW-1235) Odd behaviour when all gunicorn workers die

2018-03-22 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-1235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16410113#comment-16410113
 ] 

ASF subversion and git services commented on AIRFLOW-1235:
--

Commit 7e762d42df50d84e4740e15c24594c50aaab53a2 in incubator-airflow's branch 
refs/heads/master from [~sekikn]
[ https://git-wip-us.apache.org/repos/asf?p=incubator-airflow.git;h=7e762d4 ]

[AIRFLOW-1235] Fix webserver's odd behaviour

In some cases, the gunicorn master shuts down
but the webserver monitor process doesn't.
This PR add timeout functionality to shutdown
all related processes in such cases.

Dear Airflow maintainers,

Please accept this PR. I understand that it will
not be reviewed until I have checked off all the
steps below!

### JIRA
- [x] My PR addresses the following [Airflow JIRA]
(https://issues.apache.org/jira/browse/AIRFLOW/)
issues and references them in the PR title. For
example, "[AIRFLOW-XXX] My Airflow PR"
-
https://issues.apache.org/jira/browse/AIRFLOW-1235

### Description
- [x] Here are some details about my PR, including
screenshots of any UI changes:

In some cases, the gunicorn master shuts down
but the webserver monitor process doesn't.
This PR add timeout functionality to shutdown
all related processes in such cases.

### Tests
- [x] My PR adds the following unit tests __OR__
does not need testing for this extremely good
reason:

tests.core:CliTests.test_cli_webserver_shutdown_wh
en_gunicorn_master_is_killed

### Commits
- [x] My commits all reference JIRA issues in
their subject lines, and I have squashed multiple
commits if they address the same issue. In
addition, my commits follow the guidelines from
"[How to write a good git commit
message](http://chris.beams.io/posts/git-
commit/)":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not
"adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Closes #2330 from sekikn/AIRFLOW-1235


> Odd behaviour when all gunicorn workers die
> ---
>
> Key: AIRFLOW-1235
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1235
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: webserver
>Affects Versions: 1.8.0
>Reporter: Erik Forsberg
>Assignee: Kengo Seki
>Priority: Major
> Fix For: 2.0.0
>
>
> The webserver has sometimes stopped responding to port 443, and today I found 
> the issue - I had a misconfigured resolv.conf that made it unable to talk to 
> my postgresql. This was the root cause, but the way airflow webserver behaved 
> was a bit odd.
> It seems that when all gunicorn workers failed to start, the gunicorn master 
> shut down. However, the main process (the one that starts gunicorn master) 
> did not shut down, so there was no way of detecting the failed status of 
> webserver from e.g. systemd or init script.
> Full traceback leading to stale webserver process:
> {noformat}
> May 21 09:51:57 airmaster01 airflow[26451]: [2017-05-21 09:51:57 +] 
> [23794] [ERROR] Exception in worker process:
> May 21 09:51:57 airmaster01 airflow[26451]: Traceback (most recent call last):
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", 
> line 1122, in _do_get
> May 21 09:51:57 airmaster01 airflow[26451]: return self._pool.get(wait, 
> self._timeout)
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/util/queue.py",
>  line 145, in get
> May 21 09:51:57 airmaster01 airflow[26451]: raise Empty
> May 21 09:51:57 airmaster01 airflow[26451]: sqlalchemy.util.queue.Empty
> May 21 09:51:57 airmaster01 airflow[26451]: During handling of the above 
> exception, another exception occurred:
> May 21 09:51:57 airmaster01 airflow[26451]: Traceback (most recent call last):
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/engine/base.py",
>  line 2147, in _wrap_pool_connect
> May 21 09:51:57 airmaster01 airflow[26451]: return fn()
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", 
> line 387, in connect
> May 21 09:51:57 airmaster01 airflow[26451]: return 
> _ConnectionFairy._checkout(self)
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", 
> line 766, in _checkout
> May 21 09:51:57 airmaster01 airflow[26451]: fairy = 
> _ConnectionRecord.checkout(pool)
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> 

[jira] [Commented] (AIRFLOW-1235) Odd behaviour when all gunicorn workers die

2018-03-22 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-1235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16410114#comment-16410114
 ] 

ASF subversion and git services commented on AIRFLOW-1235:
--

Commit 7e762d42df50d84e4740e15c24594c50aaab53a2 in incubator-airflow's branch 
refs/heads/master from [~sekikn]
[ https://git-wip-us.apache.org/repos/asf?p=incubator-airflow.git;h=7e762d4 ]

[AIRFLOW-1235] Fix webserver's odd behaviour

In some cases, the gunicorn master shuts down
but the webserver monitor process doesn't.
This PR add timeout functionality to shutdown
all related processes in such cases.

Dear Airflow maintainers,

Please accept this PR. I understand that it will
not be reviewed until I have checked off all the
steps below!

### JIRA
- [x] My PR addresses the following [Airflow JIRA]
(https://issues.apache.org/jira/browse/AIRFLOW/)
issues and references them in the PR title. For
example, "[AIRFLOW-XXX] My Airflow PR"
-
https://issues.apache.org/jira/browse/AIRFLOW-1235

### Description
- [x] Here are some details about my PR, including
screenshots of any UI changes:

In some cases, the gunicorn master shuts down
but the webserver monitor process doesn't.
This PR add timeout functionality to shutdown
all related processes in such cases.

### Tests
- [x] My PR adds the following unit tests __OR__
does not need testing for this extremely good
reason:

tests.core:CliTests.test_cli_webserver_shutdown_wh
en_gunicorn_master_is_killed

### Commits
- [x] My commits all reference JIRA issues in
their subject lines, and I have squashed multiple
commits if they address the same issue. In
addition, my commits follow the guidelines from
"[How to write a good git commit
message](http://chris.beams.io/posts/git-
commit/)":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not
"adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Closes #2330 from sekikn/AIRFLOW-1235


> Odd behaviour when all gunicorn workers die
> ---
>
> Key: AIRFLOW-1235
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1235
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: webserver
>Affects Versions: 1.8.0
>Reporter: Erik Forsberg
>Assignee: Kengo Seki
>Priority: Major
> Fix For: 2.0.0
>
>
> The webserver has sometimes stopped responding to port 443, and today I found 
> the issue - I had a misconfigured resolv.conf that made it unable to talk to 
> my postgresql. This was the root cause, but the way airflow webserver behaved 
> was a bit odd.
> It seems that when all gunicorn workers failed to start, the gunicorn master 
> shut down. However, the main process (the one that starts gunicorn master) 
> did not shut down, so there was no way of detecting the failed status of 
> webserver from e.g. systemd or init script.
> Full traceback leading to stale webserver process:
> {noformat}
> May 21 09:51:57 airmaster01 airflow[26451]: [2017-05-21 09:51:57 +] 
> [23794] [ERROR] Exception in worker process:
> May 21 09:51:57 airmaster01 airflow[26451]: Traceback (most recent call last):
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", 
> line 1122, in _do_get
> May 21 09:51:57 airmaster01 airflow[26451]: return self._pool.get(wait, 
> self._timeout)
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/util/queue.py",
>  line 145, in get
> May 21 09:51:57 airmaster01 airflow[26451]: raise Empty
> May 21 09:51:57 airmaster01 airflow[26451]: sqlalchemy.util.queue.Empty
> May 21 09:51:57 airmaster01 airflow[26451]: During handling of the above 
> exception, another exception occurred:
> May 21 09:51:57 airmaster01 airflow[26451]: Traceback (most recent call last):
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/engine/base.py",
>  line 2147, in _wrap_pool_connect
> May 21 09:51:57 airmaster01 airflow[26451]: return fn()
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", 
> line 387, in connect
> May 21 09:51:57 airmaster01 airflow[26451]: return 
> _ConnectionFairy._checkout(self)
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", 
> line 766, in _checkout
> May 21 09:51:57 airmaster01 airflow[26451]: fairy = 
> _ConnectionRecord.checkout(pool)
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> 

[jira] [Commented] (AIRFLOW-1235) Odd behaviour when all gunicorn workers die

2018-03-22 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-1235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16410112#comment-16410112
 ] 

ASF subversion and git services commented on AIRFLOW-1235:
--

Commit 7e762d42df50d84e4740e15c24594c50aaab53a2 in incubator-airflow's branch 
refs/heads/master from [~sekikn]
[ https://git-wip-us.apache.org/repos/asf?p=incubator-airflow.git;h=7e762d4 ]

[AIRFLOW-1235] Fix webserver's odd behaviour

In some cases, the gunicorn master shuts down
but the webserver monitor process doesn't.
This PR add timeout functionality to shutdown
all related processes in such cases.

Dear Airflow maintainers,

Please accept this PR. I understand that it will
not be reviewed until I have checked off all the
steps below!

### JIRA
- [x] My PR addresses the following [Airflow JIRA]
(https://issues.apache.org/jira/browse/AIRFLOW/)
issues and references them in the PR title. For
example, "[AIRFLOW-XXX] My Airflow PR"
-
https://issues.apache.org/jira/browse/AIRFLOW-1235

### Description
- [x] Here are some details about my PR, including
screenshots of any UI changes:

In some cases, the gunicorn master shuts down
but the webserver monitor process doesn't.
This PR add timeout functionality to shutdown
all related processes in such cases.

### Tests
- [x] My PR adds the following unit tests __OR__
does not need testing for this extremely good
reason:

tests.core:CliTests.test_cli_webserver_shutdown_wh
en_gunicorn_master_is_killed

### Commits
- [x] My commits all reference JIRA issues in
their subject lines, and I have squashed multiple
commits if they address the same issue. In
addition, my commits follow the guidelines from
"[How to write a good git commit
message](http://chris.beams.io/posts/git-
commit/)":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not
"adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Closes #2330 from sekikn/AIRFLOW-1235


> Odd behaviour when all gunicorn workers die
> ---
>
> Key: AIRFLOW-1235
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1235
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: webserver
>Affects Versions: 1.8.0
>Reporter: Erik Forsberg
>Assignee: Kengo Seki
>Priority: Major
> Fix For: 2.0.0
>
>
> The webserver has sometimes stopped responding to port 443, and today I found 
> the issue - I had a misconfigured resolv.conf that made it unable to talk to 
> my postgresql. This was the root cause, but the way airflow webserver behaved 
> was a bit odd.
> It seems that when all gunicorn workers failed to start, the gunicorn master 
> shut down. However, the main process (the one that starts gunicorn master) 
> did not shut down, so there was no way of detecting the failed status of 
> webserver from e.g. systemd or init script.
> Full traceback leading to stale webserver process:
> {noformat}
> May 21 09:51:57 airmaster01 airflow[26451]: [2017-05-21 09:51:57 +] 
> [23794] [ERROR] Exception in worker process:
> May 21 09:51:57 airmaster01 airflow[26451]: Traceback (most recent call last):
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", 
> line 1122, in _do_get
> May 21 09:51:57 airmaster01 airflow[26451]: return self._pool.get(wait, 
> self._timeout)
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/util/queue.py",
>  line 145, in get
> May 21 09:51:57 airmaster01 airflow[26451]: raise Empty
> May 21 09:51:57 airmaster01 airflow[26451]: sqlalchemy.util.queue.Empty
> May 21 09:51:57 airmaster01 airflow[26451]: During handling of the above 
> exception, another exception occurred:
> May 21 09:51:57 airmaster01 airflow[26451]: Traceback (most recent call last):
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/engine/base.py",
>  line 2147, in _wrap_pool_connect
> May 21 09:51:57 airmaster01 airflow[26451]: return fn()
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", 
> line 387, in connect
> May 21 09:51:57 airmaster01 airflow[26451]: return 
> _ConnectionFairy._checkout(self)
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", 
> line 766, in _checkout
> May 21 09:51:57 airmaster01 airflow[26451]: fairy = 
> _ConnectionRecord.checkout(pool)
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> 

[jira] [Commented] (AIRFLOW-1235) Odd behaviour when all gunicorn workers die

2018-03-19 Thread John Arnold (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-1235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405547#comment-16405547
 ] 

John Arnold commented on AIRFLOW-1235:
--

For reference, here's what it looks linke in my logs too...

```

[2018-03-19 03:21:31,442] \{configuration.py:211} WARNING - section/key 
[celery/ssl_active] not found in config
[2018-03-19 03:21:31,442] \{default_celery.py:44} WARNING - Celery Executor 
will run without SSL
[2018-03-19 03:21:33,299] \{config.py:26} INFO - Missing required environment 
vars. Attempting to load from /etc/celery/conf.d/celery
[2018-03-19 03:21:33,352] \{__init__.py:45} INFO - Using executor CeleryExecutor
[2018-03-19 03:21:33,724] \{models.py:196} INFO - Filling up the DagBag from 
/var/lib/airflow/venv/lib/python3.6/site-packages/net_task_contrib/airflow/dags
[2018-03-19 03:21:36 +] [125724] [INFO] Handling signal: ttou
[2018-03-19 03:21:36 +] [55375] [INFO] Worker exiting (pid: 55375)
[2018-03-19 03:22:07 +] [125724] [INFO] Handling signal: ttin
[2018-03-19 03:22:07 +] [59230] [INFO] Booting worker with pid: 59230
[2018-03-19 03:22:07,621] \{configuration.py:211} WARNING - section/key 
[celery/ssl_active] not found in config
[2018-03-19 03:22:07,621] \{default_celery.py:44} WARNING - Celery Executor 
will run without SSL
[2018-03-19 03:22:08,604] \{config.py:26} INFO - Missing required environment 
vars. Attempting to load from /etc/celery/conf.d/celery
[2018-03-19 03:22:08,648] \{__init__.py:45} INFO - Using executor CeleryExecutor
[2018-03-19 03:22:08,979] \{models.py:196} INFO - Filling up the DagBag from 
/var/lib/airflow/venv/lib/python3.6/site-packages/net_task_contrib/airflow/dags
[2018-03-19 03:22:09 +] [59230] [ERROR] Exception in worker process
Traceback (most recent call last):
 File "/var/lib/airflow/venv/lib/python3.6/site-packages/sqlalchemy/pool.py", 
line 1124, in _do_get
 return self._pool.get(wait, self._timeout)
 File 
"/var/lib/airflow/venv/lib/python3.6/site-packages/sqlalchemy/util/queue.py", 
line 145, in get
 raise Empty
sqlalchemy.util.queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
 File 
"/var/lib/airflow/venv/lib/python3.6/site-packages/sqlalchemy/engine/base.py", 
line 2147, in _wrap_pool_connect
 return fn()
 File "/var/lib/airflow/venv/lib/python3.6/site-packages/sqlalchemy/pool.py", 
line 387, in connect
 return _ConnectionFairy._checkout(self)
 File "/var/lib/airflow/venv/lib/python3.6/site-packages/sqlalchemy/pool.py", 
line 768, in _checkout
 fairy = _ConnectionRecord.checkout(pool)
 File "/var/lib/airflow/venv/lib/python3.6/site-packages/sqlalchemy/pool.py", 
line 516, in checkout
 rec = pool._do_get()
 File "/var/lib/airflow/venv/lib/python3.6/site-packages/sqlalchemy/pool.py", 
line 1140, in _do_get
 self._dec_overflow()
 File 
"/var/lib/airflow/venv/lib/python3.6/site-packages/sqlalchemy/util/langhelpers.py",
 line 66, in __exit__
 compat.reraise(exc_type, exc_value, exc_tb)
 File 
"/var/lib/airflow/venv/lib/python3.6/site-packages/sqlalchemy/util/compat.py", 
line 187, in reraise
 raise value
 File "/var/lib/airflow/venv/lib/python3.6/site-packages/sqlalchemy/pool.py", 
line 1137, in _do_get
 return self._create_connection()
 File "/var/lib/airflow/venv/lib/python3.6/site-packages/sqlalchemy/pool.py", 
line 333, in _create_connection
 return _ConnectionRecord(self)
 File "/var/lib/airflow/venv/lib/python3.6/site-packages/sqlalchemy/pool.py", 
line 461, in __init__
 self.__connect(first_connect_check=True)
 File "/var/lib/airflow/venv/lib/python3.6/site-packages/sqlalchemy/pool.py", 
line 651, in __connect
 connection = pool._invoke_creator(self)
 File 
"/var/lib/airflow/venv/lib/python3.6/site-packages/sqlalchemy/engine/strategies.py",
 line 105, in connect
 return dialect.connect(*cargs, **cparams)
 File 
"/var/lib/airflow/venv/lib/python3.6/site-packages/sqlalchemy/engine/default.py",
 line 393, in connect
 return self.dbapi.connect(*cargs, **cparams)
 File "/var/lib/airflow/venv/lib/python3.6/site-packages/psycopg2/__init__.py", 
line 130, in connect
 conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: SSL SYSCALL error: EOF detected


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
 File "/var/lib/airflow/venv/lib/python3.6/site-packages/gunicorn/arbiter.py", 
line 578, in spawn_worker
 worker.init_process()
 File 
"/var/lib/airflow/venv/lib/python3.6/site-packages/gunicorn/workers/base.py", 
line 126, in init_process
 self.load_wsgi()
 File 
"/var/lib/airflow/venv/lib/python3.6/site-packages/gunicorn/workers/base.py", 
line 135, in load_wsgi
 self.wsgi = self.app.wsgi()
 File "/var/lib/airflow/venv/lib/python3.6/site-packages/gunicorn/app/base.py", 
line 67, in wsgi
 self.callable = self.load()
 File 

[jira] [Commented] (AIRFLOW-1235) Odd behaviour when all gunicorn workers die

2018-03-19 Thread John Arnold (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-1235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405394#comment-16405394
 ] 

John Arnold commented on AIRFLOW-1235:
--

external check and restart is not a great user experience – how do we fix the 
code problem so that it doesn't crash in the first place, or recovers if it 
does crash?

> Odd behaviour when all gunicorn workers die
> ---
>
> Key: AIRFLOW-1235
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1235
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: webserver
>Affects Versions: 1.8.0
>Reporter: Erik Forsberg
>Assignee: Kengo Seki
>Priority: Major
>
> The webserver has sometimes stopped responding to port 443, and today I found 
> the issue - I had a misconfigured resolv.conf that made it unable to talk to 
> my postgresql. This was the root cause, but the way airflow webserver behaved 
> was a bit odd.
> It seems that when all gunicorn workers failed to start, the gunicorn master 
> shut down. However, the main process (the one that starts gunicorn master) 
> did not shut down, so there was no way of detecting the failed status of 
> webserver from e.g. systemd or init script.
> Full traceback leading to stale webserver process:
> {noformat}
> May 21 09:51:57 airmaster01 airflow[26451]: [2017-05-21 09:51:57 +] 
> [23794] [ERROR] Exception in worker process:
> May 21 09:51:57 airmaster01 airflow[26451]: Traceback (most recent call last):
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", 
> line 1122, in _do_get
> May 21 09:51:57 airmaster01 airflow[26451]: return self._pool.get(wait, 
> self._timeout)
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/util/queue.py",
>  line 145, in get
> May 21 09:51:57 airmaster01 airflow[26451]: raise Empty
> May 21 09:51:57 airmaster01 airflow[26451]: sqlalchemy.util.queue.Empty
> May 21 09:51:57 airmaster01 airflow[26451]: During handling of the above 
> exception, another exception occurred:
> May 21 09:51:57 airmaster01 airflow[26451]: Traceback (most recent call last):
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/engine/base.py",
>  line 2147, in _wrap_pool_connect
> May 21 09:51:57 airmaster01 airflow[26451]: return fn()
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", 
> line 387, in connect
> May 21 09:51:57 airmaster01 airflow[26451]: return 
> _ConnectionFairy._checkout(self)
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", 
> line 766, in _checkout
> May 21 09:51:57 airmaster01 airflow[26451]: fairy = 
> _ConnectionRecord.checkout(pool)
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", 
> line 516, in checkout
> May 21 09:51:57 airmaster01 airflow[26451]: rec = pool._do_get()
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", 
> line 1138, in _do_get
> May 21 09:51:57 airmaster01 airflow[26451]: self._dec_overflow()
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/util/langhelpers.py",
>  line 66, in __exit__
> May 21 09:51:57 airmaster01 airflow[26451]: compat.reraise(exc_type, 
> exc_value, exc_tb)
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/util/compat.py",
>  line 187, in reraise
> May 21 09:51:57 airmaster01 airflow[26451]: raise value
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", 
> line 1135, in _do_get
> May 21 09:51:57 airmaster01 airflow[26451]: return self._create_connection()
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", 
> line 333, in _create_connection
> May 21 09:51:57 airmaster01 airflow[26451]: return _ConnectionRecord(self)
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", 
> line 461, in __init__
> May 21 09:51:57 airmaster01 airflow[26451]: 
> self.__connect(first_connect_check=True)
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", 
> line 651, in __connect
> May 21 09:51:57 airmaster01 airflow[26451]: connection = 
> pool._invoke_creator(self)
> May 21 09:51:57 airmaster01 

[jira] [Commented] (AIRFLOW-1235) Odd behaviour when all gunicorn workers die

2018-03-14 Thread yuanyifang (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-1235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16398316#comment-16398316
 ] 

yuanyifang commented on AIRFLOW-1235:
-

I send a get request to the webserver to check if the webserver is still 
working. If i got a bad response, I restart the webserver 

> Odd behaviour when all gunicorn workers die
> ---
>
> Key: AIRFLOW-1235
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1235
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: webserver
>Affects Versions: 1.8.0
>Reporter: Erik Forsberg
>Assignee: Kengo Seki
>Priority: Major
>
> The webserver has sometimes stopped responding to port 443, and today I found 
> the issue - I had a misconfigured resolv.conf that made it unable to talk to 
> my postgresql. This was the root cause, but the way airflow webserver behaved 
> was a bit odd.
> It seems that when all gunicorn workers failed to start, the gunicorn master 
> shut down. However, the main process (the one that starts gunicorn master) 
> did not shut down, so there was no way of detecting the failed status of 
> webserver from e.g. systemd or init script.
> Full traceback leading to stale webserver process:
> {noformat}
> May 21 09:51:57 airmaster01 airflow[26451]: [2017-05-21 09:51:57 +] 
> [23794] [ERROR] Exception in worker process:
> May 21 09:51:57 airmaster01 airflow[26451]: Traceback (most recent call last):
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", 
> line 1122, in _do_get
> May 21 09:51:57 airmaster01 airflow[26451]: return self._pool.get(wait, 
> self._timeout)
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/util/queue.py",
>  line 145, in get
> May 21 09:51:57 airmaster01 airflow[26451]: raise Empty
> May 21 09:51:57 airmaster01 airflow[26451]: sqlalchemy.util.queue.Empty
> May 21 09:51:57 airmaster01 airflow[26451]: During handling of the above 
> exception, another exception occurred:
> May 21 09:51:57 airmaster01 airflow[26451]: Traceback (most recent call last):
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/engine/base.py",
>  line 2147, in _wrap_pool_connect
> May 21 09:51:57 airmaster01 airflow[26451]: return fn()
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", 
> line 387, in connect
> May 21 09:51:57 airmaster01 airflow[26451]: return 
> _ConnectionFairy._checkout(self)
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", 
> line 766, in _checkout
> May 21 09:51:57 airmaster01 airflow[26451]: fairy = 
> _ConnectionRecord.checkout(pool)
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", 
> line 516, in checkout
> May 21 09:51:57 airmaster01 airflow[26451]: rec = pool._do_get()
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", 
> line 1138, in _do_get
> May 21 09:51:57 airmaster01 airflow[26451]: self._dec_overflow()
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/util/langhelpers.py",
>  line 66, in __exit__
> May 21 09:51:57 airmaster01 airflow[26451]: compat.reraise(exc_type, 
> exc_value, exc_tb)
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/util/compat.py",
>  line 187, in reraise
> May 21 09:51:57 airmaster01 airflow[26451]: raise value
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", 
> line 1135, in _do_get
> May 21 09:51:57 airmaster01 airflow[26451]: return self._create_connection()
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", 
> line 333, in _create_connection
> May 21 09:51:57 airmaster01 airflow[26451]: return _ConnectionRecord(self)
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", 
> line 461, in __init__
> May 21 09:51:57 airmaster01 airflow[26451]: 
> self.__connect(first_connect_check=True)
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", 
> line 651, in __connect
> May 21 09:51:57 airmaster01 airflow[26451]: connection = 
> pool._invoke_creator(self)
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> 

[jira] [Commented] (AIRFLOW-1235) Odd behaviour when all gunicorn workers die

2018-03-13 Thread John Arnold (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-1235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397679#comment-16397679
 ] 

John Arnold commented on AIRFLOW-1235:
--

Does anyone know how to fix this so that either gunicorn gets restarted, or the 
main process dies?

> Odd behaviour when all gunicorn workers die
> ---
>
> Key: AIRFLOW-1235
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1235
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: webserver
>Affects Versions: 1.8.0
>Reporter: Erik Forsberg
>Assignee: Kengo Seki
>Priority: Major
>
> The webserver has sometimes stopped responding to port 443, and today I found 
> the issue - I had a misconfigured resolv.conf that made it unable to talk to 
> my postgresql. This was the root cause, but the way airflow webserver behaved 
> was a bit odd.
> It seems that when all gunicorn workers failed to start, the gunicorn master 
> shut down. However, the main process (the one that starts gunicorn master) 
> did not shut down, so there was no way of detecting the failed status of 
> webserver from e.g. systemd or init script.
> Full traceback leading to stale webserver process:
> {noformat}
> May 21 09:51:57 airmaster01 airflow[26451]: [2017-05-21 09:51:57 +] 
> [23794] [ERROR] Exception in worker process:
> May 21 09:51:57 airmaster01 airflow[26451]: Traceback (most recent call last):
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", 
> line 1122, in _do_get
> May 21 09:51:57 airmaster01 airflow[26451]: return self._pool.get(wait, 
> self._timeout)
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/util/queue.py",
>  line 145, in get
> May 21 09:51:57 airmaster01 airflow[26451]: raise Empty
> May 21 09:51:57 airmaster01 airflow[26451]: sqlalchemy.util.queue.Empty
> May 21 09:51:57 airmaster01 airflow[26451]: During handling of the above 
> exception, another exception occurred:
> May 21 09:51:57 airmaster01 airflow[26451]: Traceback (most recent call last):
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/engine/base.py",
>  line 2147, in _wrap_pool_connect
> May 21 09:51:57 airmaster01 airflow[26451]: return fn()
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", 
> line 387, in connect
> May 21 09:51:57 airmaster01 airflow[26451]: return 
> _ConnectionFairy._checkout(self)
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", 
> line 766, in _checkout
> May 21 09:51:57 airmaster01 airflow[26451]: fairy = 
> _ConnectionRecord.checkout(pool)
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", 
> line 516, in checkout
> May 21 09:51:57 airmaster01 airflow[26451]: rec = pool._do_get()
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", 
> line 1138, in _do_get
> May 21 09:51:57 airmaster01 airflow[26451]: self._dec_overflow()
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/util/langhelpers.py",
>  line 66, in __exit__
> May 21 09:51:57 airmaster01 airflow[26451]: compat.reraise(exc_type, 
> exc_value, exc_tb)
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/util/compat.py",
>  line 187, in reraise
> May 21 09:51:57 airmaster01 airflow[26451]: raise value
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", 
> line 1135, in _do_get
> May 21 09:51:57 airmaster01 airflow[26451]: return self._create_connection()
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", 
> line 333, in _create_connection
> May 21 09:51:57 airmaster01 airflow[26451]: return _ConnectionRecord(self)
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", 
> line 461, in __init__
> May 21 09:51:57 airmaster01 airflow[26451]: 
> self.__connect(first_connect_check=True)
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", 
> line 651, in __connect
> May 21 09:51:57 airmaster01 airflow[26451]: connection = 
> pool._invoke_creator(self)
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> 

[jira] [Commented] (AIRFLOW-1235) Odd behaviour when all gunicorn workers die

2018-03-09 Thread James Davidheiser (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-1235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16393418#comment-16393418
 ] 

James Davidheiser commented on AIRFLOW-1235:


I saw the same thing with a mysql connection error.  I am running Airflow in 
Kubernetes, and and plan to set up a Liveness probe to make sure it kills the 
container if the server stops working.  

 

Some tasks failed on connecting to the database too, but those were marked as 
failed as I would expect.

> Odd behaviour when all gunicorn workers die
> ---
>
> Key: AIRFLOW-1235
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1235
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: webserver
>Affects Versions: 1.8.0
>Reporter: Erik Forsberg
>Assignee: Kengo Seki
>Priority: Major
>
> The webserver has sometimes stopped responding to port 443, and today I found 
> the issue - I had a misconfigured resolv.conf that made it unable to talk to 
> my postgresql. This was the root cause, but the way airflow webserver behaved 
> was a bit odd.
> It seems that when all gunicorn workers failed to start, the gunicorn master 
> shut down. However, the main process (the one that starts gunicorn master) 
> did not shut down, so there was no way of detecting the failed status of 
> webserver from e.g. systemd or init script.
> Full traceback leading to stale webserver process:
> {noformat}
> May 21 09:51:57 airmaster01 airflow[26451]: [2017-05-21 09:51:57 +] 
> [23794] [ERROR] Exception in worker process:
> May 21 09:51:57 airmaster01 airflow[26451]: Traceback (most recent call last):
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", 
> line 1122, in _do_get
> May 21 09:51:57 airmaster01 airflow[26451]: return self._pool.get(wait, 
> self._timeout)
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/util/queue.py",
>  line 145, in get
> May 21 09:51:57 airmaster01 airflow[26451]: raise Empty
> May 21 09:51:57 airmaster01 airflow[26451]: sqlalchemy.util.queue.Empty
> May 21 09:51:57 airmaster01 airflow[26451]: During handling of the above 
> exception, another exception occurred:
> May 21 09:51:57 airmaster01 airflow[26451]: Traceback (most recent call last):
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/engine/base.py",
>  line 2147, in _wrap_pool_connect
> May 21 09:51:57 airmaster01 airflow[26451]: return fn()
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", 
> line 387, in connect
> May 21 09:51:57 airmaster01 airflow[26451]: return 
> _ConnectionFairy._checkout(self)
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", 
> line 766, in _checkout
> May 21 09:51:57 airmaster01 airflow[26451]: fairy = 
> _ConnectionRecord.checkout(pool)
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", 
> line 516, in checkout
> May 21 09:51:57 airmaster01 airflow[26451]: rec = pool._do_get()
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", 
> line 1138, in _do_get
> May 21 09:51:57 airmaster01 airflow[26451]: self._dec_overflow()
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/util/langhelpers.py",
>  line 66, in __exit__
> May 21 09:51:57 airmaster01 airflow[26451]: compat.reraise(exc_type, 
> exc_value, exc_tb)
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/util/compat.py",
>  line 187, in reraise
> May 21 09:51:57 airmaster01 airflow[26451]: raise value
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", 
> line 1135, in _do_get
> May 21 09:51:57 airmaster01 airflow[26451]: return self._create_connection()
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", 
> line 333, in _create_connection
> May 21 09:51:57 airmaster01 airflow[26451]: return _ConnectionRecord(self)
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", 
> line 461, in __init__
> May 21 09:51:57 airmaster01 airflow[26451]: 
> self.__connect(first_connect_check=True)
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> 

[jira] [Commented] (AIRFLOW-1235) Odd behaviour when all gunicorn workers die

2018-03-08 Thread John Arnold (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-1235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392021#comment-16392021
 ] 

John Arnold commented on AIRFLOW-1235:
--

I have this same problem – workers fail due to postgres issue, gunicorn master 
hangs, but airflow webserver doesn't detect the hang and die.  It just zombies. 
 If it would just die, then my container could exit and be restarted.

> Odd behaviour when all gunicorn workers die
> ---
>
> Key: AIRFLOW-1235
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1235
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: webserver
>Affects Versions: 1.8.0
>Reporter: Erik Forsberg
>Assignee: Kengo Seki
>Priority: Major
>
> The webserver has sometimes stopped responding to port 443, and today I found 
> the issue - I had a misconfigured resolv.conf that made it unable to talk to 
> my postgresql. This was the root cause, but the way airflow webserver behaved 
> was a bit odd.
> It seems that when all gunicorn workers failed to start, the gunicorn master 
> shut down. However, the main process (the one that starts gunicorn master) 
> did not shut down, so there was no way of detecting the failed status of 
> webserver from e.g. systemd or init script.
> Full traceback leading to stale webserver process:
> {noformat}
> May 21 09:51:57 airmaster01 airflow[26451]: [2017-05-21 09:51:57 +] 
> [23794] [ERROR] Exception in worker process:
> May 21 09:51:57 airmaster01 airflow[26451]: Traceback (most recent call last):
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", 
> line 1122, in _do_get
> May 21 09:51:57 airmaster01 airflow[26451]: return self._pool.get(wait, 
> self._timeout)
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/util/queue.py",
>  line 145, in get
> May 21 09:51:57 airmaster01 airflow[26451]: raise Empty
> May 21 09:51:57 airmaster01 airflow[26451]: sqlalchemy.util.queue.Empty
> May 21 09:51:57 airmaster01 airflow[26451]: During handling of the above 
> exception, another exception occurred:
> May 21 09:51:57 airmaster01 airflow[26451]: Traceback (most recent call last):
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/engine/base.py",
>  line 2147, in _wrap_pool_connect
> May 21 09:51:57 airmaster01 airflow[26451]: return fn()
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", 
> line 387, in connect
> May 21 09:51:57 airmaster01 airflow[26451]: return 
> _ConnectionFairy._checkout(self)
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", 
> line 766, in _checkout
> May 21 09:51:57 airmaster01 airflow[26451]: fairy = 
> _ConnectionRecord.checkout(pool)
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", 
> line 516, in checkout
> May 21 09:51:57 airmaster01 airflow[26451]: rec = pool._do_get()
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", 
> line 1138, in _do_get
> May 21 09:51:57 airmaster01 airflow[26451]: self._dec_overflow()
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/util/langhelpers.py",
>  line 66, in __exit__
> May 21 09:51:57 airmaster01 airflow[26451]: compat.reraise(exc_type, 
> exc_value, exc_tb)
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/util/compat.py",
>  line 187, in reraise
> May 21 09:51:57 airmaster01 airflow[26451]: raise value
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", 
> line 1135, in _do_get
> May 21 09:51:57 airmaster01 airflow[26451]: return self._create_connection()
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", 
> line 333, in _create_connection
> May 21 09:51:57 airmaster01 airflow[26451]: return _ConnectionRecord(self)
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", 
> line 461, in __init__
> May 21 09:51:57 airmaster01 airflow[26451]: 
> self.__connect(first_connect_check=True)
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", 
> line 651, in __connect
> May 21 09:51:57 airmaster01 airflow[26451]: connection = 

[jira] [Commented] (AIRFLOW-1235) Odd behaviour when all gunicorn workers die

2017-12-20 Thread Adam Hodges (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-1235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16298626#comment-16298626
 ] 

Adam Hodges commented on AIRFLOW-1235:
--

I am also seeing this issue when we have connectivity issues to our Postgres 
database. All the workers die and then the gunicorn master dies, but the 
airflow process does not die. If the airflow process died, I could at least 
detect it and restart the airflow webserver.

My logs:

{noformat}
[2017-12-20 00:56:26 +] [844] [ERROR] Exception in worker process:
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/gunicorn/arbiter.py", line 507, 
in spawn_worker
worker.init_process()
  File "/usr/local/lib/python2.7/dist-packages/gunicorn/workers/base.py", line 
118, in init_process
self.wsgi = self.app.wsgi()
  File "/usr/local/lib/python2.7/dist-packages/gunicorn/app/base.py", line 67, 
in wsgi
self.callable = self.load()
  File "/usr/local/lib/python2.7/dist-packages/gunicorn/app/wsgiapp.py", line 
65, in load
return self.load_wsgiapp()
  File "/usr/local/lib/python2.7/dist-packages/gunicorn/app/wsgiapp.py", line 
52, in load_wsgiapp
return util.import_app(self.app_uri)
  File "/usr/local/lib/python2.7/dist-packages/gunicorn/util.py", line 366, in 
import_app
app = eval(obj, mod.__dict__)
  File "", line 1, in 
  File "/usr/local/lib/python2.7/dist-packages/airflow/www/app.py", line 161, 
in cached_app
app = create_app(config)
  File "/usr/local/lib/python2.7/dist-packages/airflow/www/app.py", line 60, in 
create_app
from airflow.www import views
  File "/usr/local/lib/python2.7/dist-packages/airflow/www/views.py", line 
1977, in 
class ChartModelView(wwwutils.DataProfilingMixin, AirflowModelView):
  File "/usr/local/lib/python2.7/dist-packages/airflow/www/views.py", line 
2059, in ChartModelView
.group_by(models.Connection.conn_id)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 
2855, in __iter__
return self._execute_and_instances(context)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 
2876, in _execute_and_instances
close_with_result=True)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 
2885, in _get_bind_args
**kw
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 
2867, in _connection_from_session
conn = self.session.connection(**kw)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 
998, in connection
execution_options=execution_options)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 
1003, in _connection_for_bind
engine, execution_options)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 
403, in _connection_for_bind
conn = bind.contextual_connect()
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 
2112, in contextual_connect
self._wrap_pool_connect(self.pool.connect, None),
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 
2151, in _wrap_pool_connect
e, dialect, self)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 
1465, in _handle_dbapi_exception_noconnection
exc_info
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/util/compat.py", line 
203, in raise_from_cause
reraise(type(exception), exception, tb=exc_tb, cause=cause)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 
2147, in _wrap_pool_connect
return fn()
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/pool.py", line 387, 
in connect
return _ConnectionFairy._checkout(self)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/pool.py", line 766, 
in _checkout
fairy = _ConnectionRecord.checkout(pool)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/pool.py", line 516, 
in checkout
rec = pool._do_get()
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/pool.py", line 1138, 
in _do_get
self._dec_overflow()
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/util/langhelpers.py", 
line 66, in __exit__
compat.reraise(exc_type, exc_value, exc_tb)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/pool.py", line 1135, 
in _do_get
return self._create_connection()
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/pool.py", line 333, 
in _create_connection
return _ConnectionRecord(self)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/pool.py", line 461, 
in __init__
self.__connect(first_connect_check=True)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/pool.py", line 651, 
in __connect
connection = pool._invoke_creator(self)
  File 
"/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/strategies.py", line 
105, in connect
return 

[jira] [Commented] (AIRFLOW-1235) Odd behaviour when all gunicorn workers die

2017-05-24 Thread Kengo Seki (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-1235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16022892#comment-16022892
 ] 

Kengo Seki commented on AIRFLOW-1235:
-

I could reproduce this issue as follows:

1. Run webserver in foreground
2. Kill gunicorn master

{code}
$ ps aux | grep airflow | grep -v grep
sekikn2929  4.5  9.3 241204 71172 pts/1S+   09:12   0:02 
/home/sekikn/.virtualenvs/a/bin/python2 /home/sekikn/.virtualenvs/a/bin/airflow 
webserver
sekikn2935  1.9  6.1 115868 46676 pts/1S+   09:12   0:00 gunicorn: 
master [airflow-webserver]
sekikn2943  1.3  8.8 242144 67300 pts/1Sl+  09:12   0:00 [ready] 
gunicorn: worker [airflow-webserver]
sekikn2944  1.3  8.8 242040 67312 pts/1Sl+  09:12   0:00 [ready] 
gunicorn: worker [airflow-webserver]
sekikn2945  1.3  8.8 242052 67320 pts/1Sl+  09:12   0:00 [ready] 
gunicorn: worker [airflow-webserver]
sekikn2952  6.0  8.8 242056 67196 pts/1Sl+  09:13   0:00 [ready] 
gunicorn: worker [airflow-webserver]
$ kill 2935
{code}

3. Then gunicorn master remains as a zombie and webserver gets stuck

{code}
$ ps aux | grep airflow | grep -v grep
sekikn2929 10.5  9.3 241204 71220 pts/1S+   09:12   0:22 
/home/sekikn/.virtualenvs/a/bin/python2 /home/sekikn/.virtualenvs/a/bin/airflow 
webserver
$ ps 2935
  PID TTY  STAT   TIME COMMAND
 2935 pts/1Z+ 0:00 [gunicorn: maste] 
{code}

At step 3, the following message is output to the log:

{code}
[2017-05-24 09:13:52,092] [2929] {cli.py:671} DEBUG - [4 / 4] doing a refresh 
of 1 workers
{code}

So I think webserver waits for workers infinitely (but they will never start) 
at line 679.

{code}
 668 def start_refresh(gunicorn_master_proc):
 669 batch_size = conf.getint('webserver', 'worker_refresh_batch_size')
 670 logging.debug('%s doing a refresh of %s workers',
 671   state, batch_size)
 672 sys.stdout.flush()
 673 sys.stderr.flush()
 674 
 675 excess = 0
 676 for _ in range(batch_size):
 677 gunicorn_master_proc.send_signal(signal.SIGTTIN)
 678 excess += 1
 679 wait_until_true(lambda: num_workers_expected + excess ==
 680 get_num_workers_running(gunicorn_master_proc))
{code}

> Odd behaviour when all gunicorn workers die
> ---
>
> Key: AIRFLOW-1235
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1235
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: webserver
>Affects Versions: 1.8.0
>Reporter: Erik Forsberg
>
> The webserver has sometimes stopped responding to port 443, and today I found 
> the issue - I had a misconfigured resolv.conf that made it unable to talk to 
> my postgresql. This was the root cause, but the way airflow webserver behaved 
> was a bit odd.
> It seems that when all gunicorn workers failed to start, the gunicorn master 
> shut down. However, the main process (the one that starts gunicorn master) 
> did not shut down, so there was no way of detecting the failed status of 
> webserver from e.g. systemd or init script.
> Full traceback leading to stale webserver process:
> {noformat}
> May 21 09:51:57 airmaster01 airflow[26451]: [2017-05-21 09:51:57 +] 
> [23794] [ERROR] Exception in worker process:
> May 21 09:51:57 airmaster01 airflow[26451]: Traceback (most recent call last):
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", 
> line 1122, in _do_get
> May 21 09:51:57 airmaster01 airflow[26451]: return self._pool.get(wait, 
> self._timeout)
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/util/queue.py",
>  line 145, in get
> May 21 09:51:57 airmaster01 airflow[26451]: raise Empty
> May 21 09:51:57 airmaster01 airflow[26451]: sqlalchemy.util.queue.Empty
> May 21 09:51:57 airmaster01 airflow[26451]: During handling of the above 
> exception, another exception occurred:
> May 21 09:51:57 airmaster01 airflow[26451]: Traceback (most recent call last):
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/engine/base.py",
>  line 2147, in _wrap_pool_connect
> May 21 09:51:57 airmaster01 airflow[26451]: return fn()
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", 
> line 387, in connect
> May 21 09:51:57 airmaster01 airflow[26451]: return 
> _ConnectionFairy._checkout(self)
> May 21 09:51:57 airmaster01 airflow[26451]: File 
> "/opt/airflow/production/lib/python3.4/site-packages/sqlalchemy/pool.py", 
> line 766, in _checkout
> May 21 09:51:57 airmaster01 airflow[26451]: fairy = 
> _ConnectionRecord.checkout(pool)
> May