[jira] [Commented] (AIRFLOW-191) Database connection leak on Postgresql backend

2016-09-19 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15503533#comment-15503533
 ] 

ASF subversion and git services commented on AIRFLOW-191:
-

Commit 4905a5563d47b45e38b91661ee5aa7f3765a129b in incubator-airflow's branch 
refs/heads/master from [~kxepal]
[ https://git-wip-us.apache.org/repos/asf?p=incubator-airflow.git;h=4905a55 ]

[AIRFLOW-191] Fix connection leak with PostgreSQL backend

This issue happens because job falls asleep during
heartbeat without
closing a session, which holds a connection. This
turns database
connection into IDLE state, but doesn't releases
it for other clients,
so when connection poll get exhausted, they get
blocked for ~heartbeat
timeframe causing global performance degradation.

Closes #1790 from kxepal/AIRFLOW-191-postgresql-
connection-leak


> Database connection leak on Postgresql backend
> --
>
> Key: AIRFLOW-191
> URL: https://issues.apache.org/jira/browse/AIRFLOW-191
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: executor
>Affects Versions: Airflow 1.7.1.2
>Reporter: Sergei Iakhnin
> Attachments: Sid_anands_airflow_idle_in_transaction.png
>
>
> I raised this issue on github several months ago and there was even a PR but 
> it never maid it into mainline. Basically, workers tend to hang onto DB 
> connections in Postgres for recording heartbeat.
> I'm running a cluster with 115 workers, each with 8 slots. My Postgres DB is 
> configured to allow 1000 simultaneous connections. I should effectively be 
> able to run 920 tasks at the same time, but am actually limited to only about 
> 450-480 because of idle transactions from workers hanging on to DB 
> connections.
> If I run the following query
> select count(*),state, client_hostname from pg_stat_activity group by state, 
> client_hostname
> These are the results:
> count state client_hostname
> 1 active  (null)
> 1 idlelocalhost
> 451   idle in transaction (null)
> 446   idle(null)
> 1 active  localhost
> The idle connections are all trying to run COMMIT
> The "idle in transaction" connections are all trying to run 
> SELECT job.id AS job_id, job.dag_id AS job_dag_id, job.state AS job_state, 
> job.job_type AS job_job_type, job.start_date AS job_start_date, job.end_date 
> AS job_end_date, job.latest_heartbeat AS job_latest_heartbeat, 
> job.executor_class AS job_executor_class, job.hostname AS job_hostname, 
> job.unixname AS job_unixname 
> FROM job 
> WHERE job.id = 213823 
>  LIMIT 1
> with differing job.ids of course.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AIRFLOW-191) Database connection leak on Postgresql backend

2016-09-19 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15503534#comment-15503534
 ] 

ASF subversion and git services commented on AIRFLOW-191:
-

Commit 4905a5563d47b45e38b91661ee5aa7f3765a129b in incubator-airflow's branch 
refs/heads/master from [~kxepal]
[ https://git-wip-us.apache.org/repos/asf?p=incubator-airflow.git;h=4905a55 ]

[AIRFLOW-191] Fix connection leak with PostgreSQL backend

This issue happens because job falls asleep during
heartbeat without
closing a session, which holds a connection. This
turns database
connection into IDLE state, but doesn't releases
it for other clients,
so when connection poll get exhausted, they get
blocked for ~heartbeat
timeframe causing global performance degradation.

Closes #1790 from kxepal/AIRFLOW-191-postgresql-
connection-leak


> Database connection leak on Postgresql backend
> --
>
> Key: AIRFLOW-191
> URL: https://issues.apache.org/jira/browse/AIRFLOW-191
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: executor
>Affects Versions: Airflow 1.7.1.2
>Reporter: Sergei Iakhnin
> Attachments: Sid_anands_airflow_idle_in_transaction.png
>
>
> I raised this issue on github several months ago and there was even a PR but 
> it never maid it into mainline. Basically, workers tend to hang onto DB 
> connections in Postgres for recording heartbeat.
> I'm running a cluster with 115 workers, each with 8 slots. My Postgres DB is 
> configured to allow 1000 simultaneous connections. I should effectively be 
> able to run 920 tasks at the same time, but am actually limited to only about 
> 450-480 because of idle transactions from workers hanging on to DB 
> connections.
> If I run the following query
> select count(*),state, client_hostname from pg_stat_activity group by state, 
> client_hostname
> These are the results:
> count state client_hostname
> 1 active  (null)
> 1 idlelocalhost
> 451   idle in transaction (null)
> 446   idle(null)
> 1 active  localhost
> The idle connections are all trying to run COMMIT
> The "idle in transaction" connections are all trying to run 
> SELECT job.id AS job_id, job.dag_id AS job_dag_id, job.state AS job_state, 
> job.job_type AS job_job_type, job.start_date AS job_start_date, job.end_date 
> AS job_end_date, job.latest_heartbeat AS job_latest_heartbeat, 
> job.executor_class AS job_executor_class, job.hostname AS job_hostname, 
> job.unixname AS job_unixname 
> FROM job 
> WHERE job.id = 213823 
>  LIMIT 1
> with differing job.ids of course.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AIRFLOW-191) Database connection leak on Postgresql backend

2016-06-15 Thread Alexander Shorin (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331682#comment-15331682
 ] 

Alexander Shorin commented on AIRFLOW-191:
--

Another source of "idle in transaction" issue is {{Variable.get}}. It less 
popular than heartbeats to cause big problems, but still consumes the 
connections for nothing.

> Database connection leak on Postgresql backend
> --
>
> Key: AIRFLOW-191
> URL: https://issues.apache.org/jira/browse/AIRFLOW-191
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: executor
>Affects Versions: Airflow 1.7.1.2
>Reporter: Sergei Iakhnin
> Attachments: Sid_anands_airflow_idle_in_transaction.png
>
>
> I raised this issue on github several months ago and there was even a PR but 
> it never maid it into mainline. Basically, workers tend to hang onto DB 
> connections in Postgres for recording heartbeat.
> I'm running a cluster with 115 workers, each with 8 slots. My Postgres DB is 
> configured to allow 1000 simultaneous connections. I should effectively be 
> able to run 920 tasks at the same time, but am actually limited to only about 
> 450-480 because of idle transactions from workers hanging on to DB 
> connections.
> If I run the following query
> select count(*),state, client_hostname from pg_stat_activity group by state, 
> client_hostname
> These are the results:
> count state client_hostname
> 1 active  (null)
> 1 idlelocalhost
> 451   idle in transaction (null)
> 446   idle(null)
> 1 active  localhost
> The idle connections are all trying to run COMMIT
> The "idle in transaction" connections are all trying to run 
> SELECT job.id AS job_id, job.dag_id AS job_dag_id, job.state AS job_state, 
> job.job_type AS job_job_type, job.start_date AS job_start_date, job.end_date 
> AS job_end_date, job.latest_heartbeat AS job_latest_heartbeat, 
> job.executor_class AS job_executor_class, job.hostname AS job_hostname, 
> job.unixname AS job_unixname 
> FROM job 
> WHERE job.id = 213823 
>  LIMIT 1
> with differing job.ids of course.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AIRFLOW-191) Database connection leak on Postgresql backend

2016-06-14 Thread Alexander Shorin (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331223#comment-15331223
 ] 

Alexander Shorin commented on AIRFLOW-191:
--

We fixed this with the following patch:
{code}
diff --git a/airflow/jobs.py b/airflow/jobs.py
index 9a0d455..af9474b 100644
--- a/airflow/jobs.py
+++ b/airflow/jobs.py
@@ -130,6 +130,9 @@ class BaseJob(Base, LoggingMixin):
 '''
 session = settings.Session()
 job = session.query(BaseJob).filter(BaseJob.id == self.id).first()
+make_transient(job)
+session.commit()
+session.close()
 
 if job.state == State.SHUTDOWN:
 self.kill()
@@ -142,6 +145,7 @@ class BaseJob(Base, LoggingMixin):
 
 job.latest_heartbeat = datetime.now()
 
+session = settings.Session()
 session.merge(job)
 session.commit()
 session.close()

{code}

TL;DR don't do sleep while your transaction is open.

> Database connection leak on Postgresql backend
> --
>
> Key: AIRFLOW-191
> URL: https://issues.apache.org/jira/browse/AIRFLOW-191
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: executor
>Affects Versions: Airflow 1.7.1.2
>Reporter: Sergei Iakhnin
> Attachments: Sid_anands_airflow_idle_in_transaction.png
>
>
> I raised this issue on github several months ago and there was even a PR but 
> it never maid it into mainline. Basically, workers tend to hang onto DB 
> connections in Postgres for recording heartbeat.
> I'm running a cluster with 115 workers, each with 8 slots. My Postgres DB is 
> configured to allow 1000 simultaneous connections. I should effectively be 
> able to run 920 tasks at the same time, but am actually limited to only about 
> 450-480 because of idle transactions from workers hanging on to DB 
> connections.
> If I run the following query
> select count(*),state, client_hostname from pg_stat_activity group by state, 
> client_hostname
> These are the results:
> count state client_hostname
> 1 active  (null)
> 1 idlelocalhost
> 451   idle in transaction (null)
> 446   idle(null)
> 1 active  localhost
> The idle connections are all trying to run COMMIT
> The "idle in transaction" connections are all trying to run 
> SELECT job.id AS job_id, job.dag_id AS job_dag_id, job.state AS job_state, 
> job.job_type AS job_job_type, job.start_date AS job_start_date, job.end_date 
> AS job_end_date, job.latest_heartbeat AS job_latest_heartbeat, 
> job.executor_class AS job_executor_class, job.hostname AS job_hostname, 
> job.unixname AS job_unixname 
> FROM job 
> WHERE job.id = 213823 
>  LIMIT 1
> with differing job.ids of course.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AIRFLOW-191) Database connection leak on Postgresql backend

2016-06-02 Thread Chris Riccomini (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15312911#comment-15312911
 ] 

Chris Riccomini commented on AIRFLOW-191:
-

{quote}
Maybe they could all send their heartbeats to a service that runs on the 
manager node, and the service can manage a modestly sized connection pool?"
{quote}

Long term, the REST service should handle the heartbeats, so everything can 
heartbeat to the service, and it can talk to the DB from one (or a few) places.

> Database connection leak on Postgresql backend
> --
>
> Key: AIRFLOW-191
> URL: https://issues.apache.org/jira/browse/AIRFLOW-191
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: executor
>Affects Versions: Airflow 1.7.1.2
>Reporter: Sergei Iakhnin
> Attachments: Sid_anands_airflow_idle_in_transaction.png
>
>
> I raised this issue on github several months ago and there was even a PR but 
> it never maid it into mainline. Basically, workers tend to hang onto DB 
> connections in Postgres for recording heartbeat.
> I'm running a cluster with 115 workers, each with 8 slots. My Postgres DB is 
> configured to allow 1000 simultaneous connections. I should effectively be 
> able to run 920 tasks at the same time, but am actually limited to only about 
> 450-480 because of idle transactions from workers hanging on to DB 
> connections.
> If I run the following query
> select count(*),state, client_hostname from pg_stat_activity group by state, 
> client_hostname
> These are the results:
> count state client_hostname
> 1 active  (null)
> 1 idlelocalhost
> 451   idle in transaction (null)
> 446   idle(null)
> 1 active  localhost
> The idle connections are all trying to run COMMIT
> The "idle in transaction" connections are all trying to run 
> SELECT job.id AS job_id, job.dag_id AS job_dag_id, job.state AS job_state, 
> job.job_type AS job_job_type, job.start_date AS job_start_date, job.end_date 
> AS job_end_date, job.latest_heartbeat AS job_latest_heartbeat, 
> job.executor_class AS job_executor_class, job.hostname AS job_hostname, 
> job.unixname AS job_unixname 
> FROM job 
> WHERE job.id = 213823 
>  LIMIT 1
> with differing job.ids of course.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AIRFLOW-191) Database connection leak on Postgresql backend

2016-06-01 Thread Siddharth Anand (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15311379#comment-15311379
 ] 

Siddharth Anand commented on AIRFLOW-191:
-

Sergei,
Regarding "It seems too expensive that job heartbeats should consume as many db 
connections as there are cores in the cluster. Maybe they could all send their 
heartbeats to a service that runs on the manager node, and the service can 
manage a modestly sized connection pool?", I am all for it as well. Please feel 
free to submit a PR for this.

cc
[~bolke][~jlowin] [~criccomini][~maxime.beauche...@apache.org]

> Database connection leak on Postgresql backend
> --
>
> Key: AIRFLOW-191
> URL: https://issues.apache.org/jira/browse/AIRFLOW-191
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: executor
>Affects Versions: Airflow 1.7.1.2
>Reporter: Sergei Iakhnin
> Attachments: Sid_anands_airflow_idle_in_transaction.png
>
>
> I raised this issue on github several months ago and there was even a PR but 
> it never maid it into mainline. Basically, workers tend to hang onto DB 
> connections in Postgres for recording heartbeat.
> I'm running a cluster with 115 workers, each with 8 slots. My Postgres DB is 
> configured to allow 1000 simultaneous connections. I should effectively be 
> able to run 920 tasks at the same time, but am actually limited to only about 
> 450-480 because of idle transactions from workers hanging on to DB 
> connections.
> If I run the following query
> select count(*),state, client_hostname from pg_stat_activity group by state, 
> client_hostname
> These are the results:
> count state client_hostname
> 1 active  (null)
> 1 idlelocalhost
> 451   idle in transaction (null)
> 446   idle(null)
> 1 active  localhost
> The idle connections are all trying to run COMMIT
> The "idle in transaction" connections are all trying to run 
> SELECT job.id AS job_id, job.dag_id AS job_dag_id, job.state AS job_state, 
> job.job_type AS job_job_type, job.start_date AS job_start_date, job.end_date 
> AS job_end_date, job.latest_heartbeat AS job_latest_heartbeat, 
> job.executor_class AS job_executor_class, job.hostname AS job_hostname, 
> job.unixname AS job_unixname 
> FROM job 
> WHERE job.id = 213823 
>  LIMIT 1
> with differing job.ids of course.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AIRFLOW-191) Database connection leak on Postgresql backend

2016-06-01 Thread Siddharth Anand (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15311344#comment-15311344
 ] 

Siddharth Anand commented on AIRFLOW-191:
-

Hi Sergei, [~criccomini]

I am running the Local Executor and I see the "idle" running COMMIT (and a few 
ROLLBACK) and I see the "idle in transaction" running the heartbeat query you 
are seeing.

{code}
Current query for procpid 121145:

SELECT job.id AS job_id, job.dag_id AS job_dag_id, job.state AS job_state, 
job.job_type AS job_job_type, job.start_date AS job_start_date, job.end_date AS 
job_end_date, job.latest_heartbeat AS job_latest_heartbeat, job.executor_class 
AS job_executor_class, job.hostname AS job_hostname, job.unixname AS 
job_unixname 
FROM job 
WHERE job.id = 166658 
 LIMIT 1
{code}

{code}
postgres=# select state, count(*)
from pg_stat_activity
where datname = 'airflow'
group by 1
;
state| count 
-+---
 idle in transaction |18
 idle|22
(2 rows)
{code}


> Database connection leak on Postgresql backend
> --
>
> Key: AIRFLOW-191
> URL: https://issues.apache.org/jira/browse/AIRFLOW-191
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: executor
>Affects Versions: Airflow 1.7.1.2
>Reporter: Sergei Iakhnin
>
> I raised this issue on github several months ago and there was even a PR but 
> it never maid it into mainline. Basically, workers tend to hang onto DB 
> connections in Postgres for recording heartbeat.
> I'm running a cluster with 115 workers, each with 8 slots. My Postgres DB is 
> configured to allow 1000 simultaneous connections. I should effectively be 
> able to run 920 tasks at the same time, but am actually limited to only about 
> 450-480 because of idle transactions from workers hanging on to DB 
> connections.
> If I run the following query
> select count(*),state, client_hostname from pg_stat_activity group by state, 
> client_hostname
> These are the results:
> count state client_hostname
> 1 active  (null)
> 1 idlelocalhost
> 451   idle in transaction (null)
> 446   idle(null)
> 1 active  localhost
> The idle connections are all trying to run COMMIT
> The "idle in transaction" connections are all trying to run 
> SELECT job.id AS job_id, job.dag_id AS job_dag_id, job.state AS job_state, 
> job.job_type AS job_job_type, job.start_date AS job_start_date, job.end_date 
> AS job_end_date, job.latest_heartbeat AS job_latest_heartbeat, 
> job.executor_class AS job_executor_class, job.hostname AS job_hostname, 
> job.unixname AS job_unixname 
> FROM job 
> WHERE job.id = 213823 
>  LIMIT 1
> with differing job.ids of course.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AIRFLOW-191) Database connection leak on Postgresql backend

2016-05-31 Thread Sergei Iakhnin (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15308343#comment-15308343
 ] 

Sergei Iakhnin commented on AIRFLOW-191:


You are right about celery. 

I think the root cause is that db connections are not always appropriately 
disposed of. I.e. I've made a patch where I close the session and dispose of 
the SQLAlchemy engine every time, and this brings the issue under control 
somewhat, although this can't be the long term solution.

It seems too expensive that job heartbeats should consume as many db 
connections as there are cores in the cluster. Maybe they could all send their 
heartbeats to a service that runs on the manager node, and the service can 
manage a modestly sized connection pool?



> Database connection leak on Postgresql backend
> --
>
> Key: AIRFLOW-191
> URL: https://issues.apache.org/jira/browse/AIRFLOW-191
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: executor
>Affects Versions: Airflow 1.7.1
>Reporter: Sergei Iakhnin
>
> I raised this issue on github several months ago and there was even a PR but 
> it never maid it into mainline. Basically, workers tend to hang onto DB 
> connections in Postgres for recording heartbeat.
> I'm running a cluster with 115 workers, each with 8 slots. My Postgres DB is 
> configured to allow 1000 simultaneous connections. I should effectively be 
> able to run 920 tasks at the same time, but am actually limited to only about 
> 450-480 because of idle transactions from workers hanging on to DB 
> connections.
> If I run the following query
> select count(*),state, client_hostname from pg_stat_activity group by state, 
> client_hostname
> These are the results:
> count state client_hostname
> 1 active  (null)
> 1 idlelocalhost
> 451   idle in transaction (null)
> 446   idle(null)
> 1 active  localhost
> The idle connections are all trying to run COMMIT
> The "idle in transaction" connections are all trying to run 
> SELECT job.id AS job_id, job.dag_id AS job_dag_id, job.state AS job_state, 
> job.job_type AS job_job_type, job.start_date AS job_start_date, job.end_date 
> AS job_end_date, job.latest_heartbeat AS job_latest_heartbeat, 
> job.executor_class AS job_executor_class, job.hostname AS job_hostname, 
> job.unixname AS job_unixname 
> FROM job 
> WHERE job.id = 213823 
>  LIMIT 1
> with differing job.ids of course.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AIRFLOW-191) Database connection leak on Postgresql backend

2016-05-31 Thread Chris Riccomini (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15308079#comment-15308079
 ] 

Chris Riccomini commented on AIRFLOW-191:
-

[~llevar], I'm assuming by "workers" that you're running Celery?

cc [~bolke] [~sanand], I wonder if this is Postgres-specific? Seems possible 
that it's more about what SQLAlchemy is doing.

> Database connection leak on Postgresql backend
> --
>
> Key: AIRFLOW-191
> URL: https://issues.apache.org/jira/browse/AIRFLOW-191
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: executor
>Affects Versions: Airflow 1.7.1
>Reporter: Sergei Iakhnin
>
> I raised this issue on github several months ago and there was even a PR but 
> it never maid it into mainline. Basically, workers tend to hang onto DB 
> connections in Postgres for recording heartbeat.
> I'm running a cluster with 115 workers, each with 8 slots. My Postgres DB is 
> configured to allow 1000 simultaneous connections. I should effectively be 
> able to run 920 tasks at the same time, but am actually limited to only about 
> 450-480 because of idle transactions from workers hanging on to DB 
> connections.
> If I run the following query
> select count(*),state, client_hostname from pg_stat_activity group by state, 
> client_hostname
> These are the results:
> count state client_hostname
> 1 active  (null)
> 1 idlelocalhost
> 451   idle in transaction (null)
> 446   idle(null)
> 1 active  localhost
> The idle connections are all trying to run COMMIT
> The "idle in transaction" connections are all trying to run 
> SELECT job.id AS job_id, job.dag_id AS job_dag_id, job.state AS job_state, 
> job.job_type AS job_job_type, job.start_date AS job_start_date, job.end_date 
> AS job_end_date, job.latest_heartbeat AS job_latest_heartbeat, 
> job.executor_class AS job_executor_class, job.hostname AS job_hostname, 
> job.unixname AS job_unixname 
> FROM job 
> WHERE job.id = 213823 
>  LIMIT 1
> with differing job.ids of course.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)