[jira] [Commented] (AIRFLOW-191) Database connection leak on Postgresql backend
[ https://issues.apache.org/jira/browse/AIRFLOW-191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15503533#comment-15503533 ] ASF subversion and git services commented on AIRFLOW-191: - Commit 4905a5563d47b45e38b91661ee5aa7f3765a129b in incubator-airflow's branch refs/heads/master from [~kxepal] [ https://git-wip-us.apache.org/repos/asf?p=incubator-airflow.git;h=4905a55 ] [AIRFLOW-191] Fix connection leak with PostgreSQL backend This issue happens because job falls asleep during heartbeat without closing a session, which holds a connection. This turns database connection into IDLE state, but doesn't releases it for other clients, so when connection poll get exhausted, they get blocked for ~heartbeat timeframe causing global performance degradation. Closes #1790 from kxepal/AIRFLOW-191-postgresql- connection-leak > Database connection leak on Postgresql backend > -- > > Key: AIRFLOW-191 > URL: https://issues.apache.org/jira/browse/AIRFLOW-191 > Project: Apache Airflow > Issue Type: Bug > Components: executor >Affects Versions: Airflow 1.7.1.2 >Reporter: Sergei Iakhnin > Attachments: Sid_anands_airflow_idle_in_transaction.png > > > I raised this issue on github several months ago and there was even a PR but > it never maid it into mainline. Basically, workers tend to hang onto DB > connections in Postgres for recording heartbeat. > I'm running a cluster with 115 workers, each with 8 slots. My Postgres DB is > configured to allow 1000 simultaneous connections. I should effectively be > able to run 920 tasks at the same time, but am actually limited to only about > 450-480 because of idle transactions from workers hanging on to DB > connections. > If I run the following query > select count(*),state, client_hostname from pg_stat_activity group by state, > client_hostname > These are the results: > count state client_hostname > 1 active (null) > 1 idlelocalhost > 451 idle in transaction (null) > 446 idle(null) > 1 active localhost > The idle connections are all trying to run COMMIT > The "idle in transaction" connections are all trying to run > SELECT job.id AS job_id, job.dag_id AS job_dag_id, job.state AS job_state, > job.job_type AS job_job_type, job.start_date AS job_start_date, job.end_date > AS job_end_date, job.latest_heartbeat AS job_latest_heartbeat, > job.executor_class AS job_executor_class, job.hostname AS job_hostname, > job.unixname AS job_unixname > FROM job > WHERE job.id = 213823 > LIMIT 1 > with differing job.ids of course. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AIRFLOW-191) Database connection leak on Postgresql backend
[ https://issues.apache.org/jira/browse/AIRFLOW-191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15503534#comment-15503534 ] ASF subversion and git services commented on AIRFLOW-191: - Commit 4905a5563d47b45e38b91661ee5aa7f3765a129b in incubator-airflow's branch refs/heads/master from [~kxepal] [ https://git-wip-us.apache.org/repos/asf?p=incubator-airflow.git;h=4905a55 ] [AIRFLOW-191] Fix connection leak with PostgreSQL backend This issue happens because job falls asleep during heartbeat without closing a session, which holds a connection. This turns database connection into IDLE state, but doesn't releases it for other clients, so when connection poll get exhausted, they get blocked for ~heartbeat timeframe causing global performance degradation. Closes #1790 from kxepal/AIRFLOW-191-postgresql- connection-leak > Database connection leak on Postgresql backend > -- > > Key: AIRFLOW-191 > URL: https://issues.apache.org/jira/browse/AIRFLOW-191 > Project: Apache Airflow > Issue Type: Bug > Components: executor >Affects Versions: Airflow 1.7.1.2 >Reporter: Sergei Iakhnin > Attachments: Sid_anands_airflow_idle_in_transaction.png > > > I raised this issue on github several months ago and there was even a PR but > it never maid it into mainline. Basically, workers tend to hang onto DB > connections in Postgres for recording heartbeat. > I'm running a cluster with 115 workers, each with 8 slots. My Postgres DB is > configured to allow 1000 simultaneous connections. I should effectively be > able to run 920 tasks at the same time, but am actually limited to only about > 450-480 because of idle transactions from workers hanging on to DB > connections. > If I run the following query > select count(*),state, client_hostname from pg_stat_activity group by state, > client_hostname > These are the results: > count state client_hostname > 1 active (null) > 1 idlelocalhost > 451 idle in transaction (null) > 446 idle(null) > 1 active localhost > The idle connections are all trying to run COMMIT > The "idle in transaction" connections are all trying to run > SELECT job.id AS job_id, job.dag_id AS job_dag_id, job.state AS job_state, > job.job_type AS job_job_type, job.start_date AS job_start_date, job.end_date > AS job_end_date, job.latest_heartbeat AS job_latest_heartbeat, > job.executor_class AS job_executor_class, job.hostname AS job_hostname, > job.unixname AS job_unixname > FROM job > WHERE job.id = 213823 > LIMIT 1 > with differing job.ids of course. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AIRFLOW-191) Database connection leak on Postgresql backend
[ https://issues.apache.org/jira/browse/AIRFLOW-191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331682#comment-15331682 ] Alexander Shorin commented on AIRFLOW-191: -- Another source of "idle in transaction" issue is {{Variable.get}}. It less popular than heartbeats to cause big problems, but still consumes the connections for nothing. > Database connection leak on Postgresql backend > -- > > Key: AIRFLOW-191 > URL: https://issues.apache.org/jira/browse/AIRFLOW-191 > Project: Apache Airflow > Issue Type: Bug > Components: executor >Affects Versions: Airflow 1.7.1.2 >Reporter: Sergei Iakhnin > Attachments: Sid_anands_airflow_idle_in_transaction.png > > > I raised this issue on github several months ago and there was even a PR but > it never maid it into mainline. Basically, workers tend to hang onto DB > connections in Postgres for recording heartbeat. > I'm running a cluster with 115 workers, each with 8 slots. My Postgres DB is > configured to allow 1000 simultaneous connections. I should effectively be > able to run 920 tasks at the same time, but am actually limited to only about > 450-480 because of idle transactions from workers hanging on to DB > connections. > If I run the following query > select count(*),state, client_hostname from pg_stat_activity group by state, > client_hostname > These are the results: > count state client_hostname > 1 active (null) > 1 idlelocalhost > 451 idle in transaction (null) > 446 idle(null) > 1 active localhost > The idle connections are all trying to run COMMIT > The "idle in transaction" connections are all trying to run > SELECT job.id AS job_id, job.dag_id AS job_dag_id, job.state AS job_state, > job.job_type AS job_job_type, job.start_date AS job_start_date, job.end_date > AS job_end_date, job.latest_heartbeat AS job_latest_heartbeat, > job.executor_class AS job_executor_class, job.hostname AS job_hostname, > job.unixname AS job_unixname > FROM job > WHERE job.id = 213823 > LIMIT 1 > with differing job.ids of course. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AIRFLOW-191) Database connection leak on Postgresql backend
[ https://issues.apache.org/jira/browse/AIRFLOW-191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331223#comment-15331223 ] Alexander Shorin commented on AIRFLOW-191: -- We fixed this with the following patch: {code} diff --git a/airflow/jobs.py b/airflow/jobs.py index 9a0d455..af9474b 100644 --- a/airflow/jobs.py +++ b/airflow/jobs.py @@ -130,6 +130,9 @@ class BaseJob(Base, LoggingMixin): ''' session = settings.Session() job = session.query(BaseJob).filter(BaseJob.id == self.id).first() +make_transient(job) +session.commit() +session.close() if job.state == State.SHUTDOWN: self.kill() @@ -142,6 +145,7 @@ class BaseJob(Base, LoggingMixin): job.latest_heartbeat = datetime.now() +session = settings.Session() session.merge(job) session.commit() session.close() {code} TL;DR don't do sleep while your transaction is open. > Database connection leak on Postgresql backend > -- > > Key: AIRFLOW-191 > URL: https://issues.apache.org/jira/browse/AIRFLOW-191 > Project: Apache Airflow > Issue Type: Bug > Components: executor >Affects Versions: Airflow 1.7.1.2 >Reporter: Sergei Iakhnin > Attachments: Sid_anands_airflow_idle_in_transaction.png > > > I raised this issue on github several months ago and there was even a PR but > it never maid it into mainline. Basically, workers tend to hang onto DB > connections in Postgres for recording heartbeat. > I'm running a cluster with 115 workers, each with 8 slots. My Postgres DB is > configured to allow 1000 simultaneous connections. I should effectively be > able to run 920 tasks at the same time, but am actually limited to only about > 450-480 because of idle transactions from workers hanging on to DB > connections. > If I run the following query > select count(*),state, client_hostname from pg_stat_activity group by state, > client_hostname > These are the results: > count state client_hostname > 1 active (null) > 1 idlelocalhost > 451 idle in transaction (null) > 446 idle(null) > 1 active localhost > The idle connections are all trying to run COMMIT > The "idle in transaction" connections are all trying to run > SELECT job.id AS job_id, job.dag_id AS job_dag_id, job.state AS job_state, > job.job_type AS job_job_type, job.start_date AS job_start_date, job.end_date > AS job_end_date, job.latest_heartbeat AS job_latest_heartbeat, > job.executor_class AS job_executor_class, job.hostname AS job_hostname, > job.unixname AS job_unixname > FROM job > WHERE job.id = 213823 > LIMIT 1 > with differing job.ids of course. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AIRFLOW-191) Database connection leak on Postgresql backend
[ https://issues.apache.org/jira/browse/AIRFLOW-191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15312911#comment-15312911 ] Chris Riccomini commented on AIRFLOW-191: - {quote} Maybe they could all send their heartbeats to a service that runs on the manager node, and the service can manage a modestly sized connection pool?" {quote} Long term, the REST service should handle the heartbeats, so everything can heartbeat to the service, and it can talk to the DB from one (or a few) places. > Database connection leak on Postgresql backend > -- > > Key: AIRFLOW-191 > URL: https://issues.apache.org/jira/browse/AIRFLOW-191 > Project: Apache Airflow > Issue Type: Bug > Components: executor >Affects Versions: Airflow 1.7.1.2 >Reporter: Sergei Iakhnin > Attachments: Sid_anands_airflow_idle_in_transaction.png > > > I raised this issue on github several months ago and there was even a PR but > it never maid it into mainline. Basically, workers tend to hang onto DB > connections in Postgres for recording heartbeat. > I'm running a cluster with 115 workers, each with 8 slots. My Postgres DB is > configured to allow 1000 simultaneous connections. I should effectively be > able to run 920 tasks at the same time, but am actually limited to only about > 450-480 because of idle transactions from workers hanging on to DB > connections. > If I run the following query > select count(*),state, client_hostname from pg_stat_activity group by state, > client_hostname > These are the results: > count state client_hostname > 1 active (null) > 1 idlelocalhost > 451 idle in transaction (null) > 446 idle(null) > 1 active localhost > The idle connections are all trying to run COMMIT > The "idle in transaction" connections are all trying to run > SELECT job.id AS job_id, job.dag_id AS job_dag_id, job.state AS job_state, > job.job_type AS job_job_type, job.start_date AS job_start_date, job.end_date > AS job_end_date, job.latest_heartbeat AS job_latest_heartbeat, > job.executor_class AS job_executor_class, job.hostname AS job_hostname, > job.unixname AS job_unixname > FROM job > WHERE job.id = 213823 > LIMIT 1 > with differing job.ids of course. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AIRFLOW-191) Database connection leak on Postgresql backend
[ https://issues.apache.org/jira/browse/AIRFLOW-191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15311379#comment-15311379 ] Siddharth Anand commented on AIRFLOW-191: - Sergei, Regarding "It seems too expensive that job heartbeats should consume as many db connections as there are cores in the cluster. Maybe they could all send their heartbeats to a service that runs on the manager node, and the service can manage a modestly sized connection pool?", I am all for it as well. Please feel free to submit a PR for this. cc [~bolke][~jlowin] [~criccomini][~maxime.beauche...@apache.org] > Database connection leak on Postgresql backend > -- > > Key: AIRFLOW-191 > URL: https://issues.apache.org/jira/browse/AIRFLOW-191 > Project: Apache Airflow > Issue Type: Bug > Components: executor >Affects Versions: Airflow 1.7.1.2 >Reporter: Sergei Iakhnin > Attachments: Sid_anands_airflow_idle_in_transaction.png > > > I raised this issue on github several months ago and there was even a PR but > it never maid it into mainline. Basically, workers tend to hang onto DB > connections in Postgres for recording heartbeat. > I'm running a cluster with 115 workers, each with 8 slots. My Postgres DB is > configured to allow 1000 simultaneous connections. I should effectively be > able to run 920 tasks at the same time, but am actually limited to only about > 450-480 because of idle transactions from workers hanging on to DB > connections. > If I run the following query > select count(*),state, client_hostname from pg_stat_activity group by state, > client_hostname > These are the results: > count state client_hostname > 1 active (null) > 1 idlelocalhost > 451 idle in transaction (null) > 446 idle(null) > 1 active localhost > The idle connections are all trying to run COMMIT > The "idle in transaction" connections are all trying to run > SELECT job.id AS job_id, job.dag_id AS job_dag_id, job.state AS job_state, > job.job_type AS job_job_type, job.start_date AS job_start_date, job.end_date > AS job_end_date, job.latest_heartbeat AS job_latest_heartbeat, > job.executor_class AS job_executor_class, job.hostname AS job_hostname, > job.unixname AS job_unixname > FROM job > WHERE job.id = 213823 > LIMIT 1 > with differing job.ids of course. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AIRFLOW-191) Database connection leak on Postgresql backend
[ https://issues.apache.org/jira/browse/AIRFLOW-191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15311344#comment-15311344 ] Siddharth Anand commented on AIRFLOW-191: - Hi Sergei, [~criccomini] I am running the Local Executor and I see the "idle" running COMMIT (and a few ROLLBACK) and I see the "idle in transaction" running the heartbeat query you are seeing. {code} Current query for procpid 121145: SELECT job.id AS job_id, job.dag_id AS job_dag_id, job.state AS job_state, job.job_type AS job_job_type, job.start_date AS job_start_date, job.end_date AS job_end_date, job.latest_heartbeat AS job_latest_heartbeat, job.executor_class AS job_executor_class, job.hostname AS job_hostname, job.unixname AS job_unixname FROM job WHERE job.id = 166658 LIMIT 1 {code} {code} postgres=# select state, count(*) from pg_stat_activity where datname = 'airflow' group by 1 ; state| count -+--- idle in transaction |18 idle|22 (2 rows) {code} > Database connection leak on Postgresql backend > -- > > Key: AIRFLOW-191 > URL: https://issues.apache.org/jira/browse/AIRFLOW-191 > Project: Apache Airflow > Issue Type: Bug > Components: executor >Affects Versions: Airflow 1.7.1.2 >Reporter: Sergei Iakhnin > > I raised this issue on github several months ago and there was even a PR but > it never maid it into mainline. Basically, workers tend to hang onto DB > connections in Postgres for recording heartbeat. > I'm running a cluster with 115 workers, each with 8 slots. My Postgres DB is > configured to allow 1000 simultaneous connections. I should effectively be > able to run 920 tasks at the same time, but am actually limited to only about > 450-480 because of idle transactions from workers hanging on to DB > connections. > If I run the following query > select count(*),state, client_hostname from pg_stat_activity group by state, > client_hostname > These are the results: > count state client_hostname > 1 active (null) > 1 idlelocalhost > 451 idle in transaction (null) > 446 idle(null) > 1 active localhost > The idle connections are all trying to run COMMIT > The "idle in transaction" connections are all trying to run > SELECT job.id AS job_id, job.dag_id AS job_dag_id, job.state AS job_state, > job.job_type AS job_job_type, job.start_date AS job_start_date, job.end_date > AS job_end_date, job.latest_heartbeat AS job_latest_heartbeat, > job.executor_class AS job_executor_class, job.hostname AS job_hostname, > job.unixname AS job_unixname > FROM job > WHERE job.id = 213823 > LIMIT 1 > with differing job.ids of course. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AIRFLOW-191) Database connection leak on Postgresql backend
[ https://issues.apache.org/jira/browse/AIRFLOW-191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15308343#comment-15308343 ] Sergei Iakhnin commented on AIRFLOW-191: You are right about celery. I think the root cause is that db connections are not always appropriately disposed of. I.e. I've made a patch where I close the session and dispose of the SQLAlchemy engine every time, and this brings the issue under control somewhat, although this can't be the long term solution. It seems too expensive that job heartbeats should consume as many db connections as there are cores in the cluster. Maybe they could all send their heartbeats to a service that runs on the manager node, and the service can manage a modestly sized connection pool? > Database connection leak on Postgresql backend > -- > > Key: AIRFLOW-191 > URL: https://issues.apache.org/jira/browse/AIRFLOW-191 > Project: Apache Airflow > Issue Type: Bug > Components: executor >Affects Versions: Airflow 1.7.1 >Reporter: Sergei Iakhnin > > I raised this issue on github several months ago and there was even a PR but > it never maid it into mainline. Basically, workers tend to hang onto DB > connections in Postgres for recording heartbeat. > I'm running a cluster with 115 workers, each with 8 slots. My Postgres DB is > configured to allow 1000 simultaneous connections. I should effectively be > able to run 920 tasks at the same time, but am actually limited to only about > 450-480 because of idle transactions from workers hanging on to DB > connections. > If I run the following query > select count(*),state, client_hostname from pg_stat_activity group by state, > client_hostname > These are the results: > count state client_hostname > 1 active (null) > 1 idlelocalhost > 451 idle in transaction (null) > 446 idle(null) > 1 active localhost > The idle connections are all trying to run COMMIT > The "idle in transaction" connections are all trying to run > SELECT job.id AS job_id, job.dag_id AS job_dag_id, job.state AS job_state, > job.job_type AS job_job_type, job.start_date AS job_start_date, job.end_date > AS job_end_date, job.latest_heartbeat AS job_latest_heartbeat, > job.executor_class AS job_executor_class, job.hostname AS job_hostname, > job.unixname AS job_unixname > FROM job > WHERE job.id = 213823 > LIMIT 1 > with differing job.ids of course. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AIRFLOW-191) Database connection leak on Postgresql backend
[ https://issues.apache.org/jira/browse/AIRFLOW-191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15308079#comment-15308079 ] Chris Riccomini commented on AIRFLOW-191: - [~llevar], I'm assuming by "workers" that you're running Celery? cc [~bolke] [~sanand], I wonder if this is Postgres-specific? Seems possible that it's more about what SQLAlchemy is doing. > Database connection leak on Postgresql backend > -- > > Key: AIRFLOW-191 > URL: https://issues.apache.org/jira/browse/AIRFLOW-191 > Project: Apache Airflow > Issue Type: Bug > Components: executor >Affects Versions: Airflow 1.7.1 >Reporter: Sergei Iakhnin > > I raised this issue on github several months ago and there was even a PR but > it never maid it into mainline. Basically, workers tend to hang onto DB > connections in Postgres for recording heartbeat. > I'm running a cluster with 115 workers, each with 8 slots. My Postgres DB is > configured to allow 1000 simultaneous connections. I should effectively be > able to run 920 tasks at the same time, but am actually limited to only about > 450-480 because of idle transactions from workers hanging on to DB > connections. > If I run the following query > select count(*),state, client_hostname from pg_stat_activity group by state, > client_hostname > These are the results: > count state client_hostname > 1 active (null) > 1 idlelocalhost > 451 idle in transaction (null) > 446 idle(null) > 1 active localhost > The idle connections are all trying to run COMMIT > The "idle in transaction" connections are all trying to run > SELECT job.id AS job_id, job.dag_id AS job_dag_id, job.state AS job_state, > job.job_type AS job_job_type, job.start_date AS job_start_date, job.end_date > AS job_end_date, job.latest_heartbeat AS job_latest_heartbeat, > job.executor_class AS job_executor_class, job.hostname AS job_hostname, > job.unixname AS job_unixname > FROM job > WHERE job.id = 213823 > LIMIT 1 > with differing job.ids of course. -- This message was sent by Atlassian JIRA (v6.3.4#6332)