how to have good DAG+Kubernetes behavior on airflow crash/recovery?

2017-12-17 Thread Christopher Bockman
Hi all, We run DAGs, and sometimes Airflow crashes (for whatever reason--maybe something as simple as the underlying infrastructure going down). Currently, we run everything on Kubernetes (including Airflow), so the Airflow pods crashes generally will be detected, and then they will restart.

Re: [VOTE] Airflow 1.9.0rc8

2017-12-17 Thread Bolke de Bruin
This is just matter of setting the tag in the repo right? We should remove that check or make it not fail at least. It is ridiculous. B. Verstuurd vanaf mijn iPad > Op 17 dec. 2017 om 07:32 heeft Joy Gao het volgende > geschreven: > > Ahh, tested the build on a fresh

Re: how to have good DAG+Kubernetes behavior on airflow crash/recovery?

2017-12-17 Thread Christopher Bockman
Hmm, perhaps we've just had a couple of bad/unlucky runs but, in general, the underlying task-kill process doesn't really seem to work, from what we've seen. I would guess this is related to https://issues.apache.org/jira/browse/AIRFLOW-1623. On Sun, Dec 17, 2017 at 12:22 PM, Bolke de Bruin

Re: how to have good DAG+Kubernetes behavior on airflow crash/recovery?

2017-12-17 Thread Christopher Bockman
Upon further internal discussion, we might be seeing the task cloning because the postgres DB is getting into a corrupted state...but unclear. If consensus is we *shouldn't* be seeing this behavior, even as-is, we'll push more on that angle. On Sun, Dec 17, 2017 at 10:45 AM, Christopher Bockman

Re: how to have good DAG+Kubernetes behavior on airflow crash/recovery?

2017-12-17 Thread Bolke de Bruin
Quite important to know is, is that Airflow’s executors do not keep state after a restart. This particularly affects distributed executors (celery, dask) as the workers are independent from the scheduler. Thus at restart we reset all the tasks in the queued state that the executor does not know

Re: how to have good DAG+Kubernetes behavior on airflow crash/recovery?

2017-12-17 Thread Bolke de Bruin
Shorter heartbeats, you might still have some tasks being scheduled nevertheless due to the time window. However, if the tasks detects it is running somewhere else, it should also terminate itself. [scheduler] # Task instances listen for external kill signal (when you clear tasks # from the CLI

Re: how to have good DAG+Kubernetes behavior on airflow crash/recovery?

2017-12-17 Thread Christopher Bockman
> P.S. I am assuming that you are talking about your scheduler going down, not workers Correct (and, in some unfortunate scenarios, everything else...) > Normally a task will detect (on the heartbeat interval) whether its state was changed externally and will terminate itself. Hmm, that would