Re: Scheduler silently dies

2017-03-27 Thread Bolke de Bruin
Cgroups solve this fine to protect the scheduler against being killed due to its childs eating too much memory. Reaping means ensuring the parent is aware (join()) that is childs are dead/done. Not necessarily restarting them. Lowering Parallelism in the context of a localexecutor solves the

Re: Scheduler silently dies

2017-03-27 Thread Gerard Toonstra
Reaping it basically means calling "is_alive()" followed by a "wait()", then restarting it. In this case, it's going to aggrevate the situation even more, because the OOM condition will continue to last for much longer, potentially not even allowing people to log in on a shell. It's a possibility,

Re: Scheduler silently dies

2017-03-27 Thread Bolke de Bruin
Resource issues (like OOM) make sense as they are really hard to recover from. In this case I assume you are probably running heavy lifting (memory intensive) jobs on your machine. Reducing the parallelism parameter (if I remember correctly) will probably help you or increasing the memory availa

Re: Scheduler silently dies

2017-03-27 Thread Nicholas Hodgkinson
Actually, something pretty interesting; it seems I'm hitting OOM: [2017-03-25 02:31:40,546] {local_executor.py:31} INFO - LocalWorker running airflow run AUTOMATOR-sensor-v2 SENSOR--jira_case_close_times 2017-03-25T02:25:00 --local --pool sensor-pool -sd DAGS_FOLDER/automator-sensor.py Process Loc

Re: Scheduler silently dies

2017-03-27 Thread Bolke de Bruin
Defunct children means we are not reaping them. So the "recovering" thing might be partially right, we probably need to build in some monitoring mechanism in the local executor. B. Sent from my iPhone > On 27 Mar 2017, at 12:40, Gerard Toonstra wrote: > > Any more info from grepping that l

Re: Scheduler silently dies

2017-03-27 Thread Gerard Toonstra
Any more info from grepping that log file? G> On Mon, Mar 27, 2017 at 9:26 PM, Nicholas Hodgkinson < nik.hodgkin...@collectivehealth.com> wrote: > from airflow.cfg: > > [core] > ... > executor = LocalExecutor > parallelism = 32 > dag_concurrency = 16 > dags_are_paused_at_creation = True > non_po

Re: Scheduler silently dies

2017-03-27 Thread Nicholas Hodgkinson
from airflow.cfg: [core] ... executor = LocalExecutor parallelism = 32 dag_concurrency = 16 dags_are_paused_at_creation = True non_pooled_task_slot_count = 128 max_active_runs_per_dag = 16 ... Pretty much the defaults; I've never tweaked these values. -N nik.hodgkin...@collectivehealth.com On

Re: Scheduler silently dies

2017-03-27 Thread Gerard Toonstra
So looks like the localworkers are dying. Airflow does not recover from that. In SchedulerJob (jobs.py), you can see the "_execute_helper" function. This calls "executor.start()", which is implemented in local_executor.py in your case. The LocalExecutor is thus an object owned by the SchedulerJ

Re: Scheduler silently dies

2017-03-27 Thread harish singh
1.8: increasing DAGBAG_IMPORT_TIMEOUT helps. I don't see the issue (although not sure why tasks progress has become slow? But thats not the issue we are discussing here. So I am ignoring that here) 1.7: our prod is running 1.7 and we havent seen the "defunct process" issue for more than a week n

Re: Scheduler silently dies

2017-03-27 Thread Nicholas Hodgkinson
1.7.1.3, however it seems this is still an issue in 1.8 according to other posters. I'll upgrade today. Yes, localexecutor. Will remove -n 10 -N nik.hodgkin...@collectivehealth.com On Mon, Mar 27, 2017 at 10:40 AM, Bolke de Bruin wrote: > Is this: > > 1. On 1.8.0? 1.7.1 is not supported anymor

Re: Scheduler silently dies

2017-03-27 Thread Bolke de Bruin
Is this: 1. On 1.8.0? 1.7.1 is not supported anymore. 2. localexecutor? Your are running with N=10, can you try running without it? B. Sent from my iPhone > On 27 Mar 2017, at 10:28, Nicholas Hodgkinson > wrote: > > Ok, I'm not sure how helpful this is and I'm working on getting some more

Re: Scheduler silently dies

2017-03-27 Thread Nicholas Hodgkinson
Ok, I'm not sure how helpful this is and I'm working on getting some more information, but here's some preliminary data: Process tree (`ps axjf`): 1 2391 2391 2391 ? -1 Ssl999 0:13 /usr/bin/python usr/local/bin/airflow scheduler -n 10 2391 2435 2391 2391 ? -1 Z

Re: Scheduler silently dies

2017-03-26 Thread Gerard Toonstra
> > > By the way, I remember that the scheduler would only spawn one or three > processes, but I may be wrong. > Right now when I start, it spawns 7 separate processes for the scheduler > (8 total) with some additional > ones spawned when the dag file processor starts. > > These other processes wer

Re: Scheduler silently dies

2017-03-26 Thread Gerard Toonstra
what may be helpful to dive into this a bit more is "pyrasite" . You need gdb installed on the machine, but afterwards you can attach to a running process and then use python "payloads" to investigate what's going on, for example dump the stack trace per threads: http://pyrasite.readthedocs.io/en/

Re: Scheduler silently dies

2017-03-25 Thread Bolke de Bruin
I case you *think* you have encountered a schedule *hang*, please provide a strace on the parent process, provide process list output that shows defunct scheduler processes, and provide *all* logging (main logs, scheduler processing log, task logs), preferably in debug mode (settings.py). Also s

Re: Scheduler silently dies

2017-03-25 Thread Bolke de Bruin
Please specify what “stop doing its job” means. It doesn’t log anything anymore? If it does, the scheduler hasn’t died and hasn’t stopped. B. > On 24 Mar 2017, at 18:20, Gael Magnan wrote: > > We encountered the same kind of problem with the scheduler that stopped > doing its job even after r

Re: Scheduler silently dies

2017-03-25 Thread Bolke de Bruin
Hi Harish, The below does *not* indicate a scheduler hang, it is a valid exception as mentioned earlier. Bolke. > On 24 Mar 2017, at 19:07, harish singh wrote: > > We have been using (1.7) over a year and never faced this issue. > The moment we switched to 1.8, I think we have hit this issue.

Re: Scheduler silently dies

2017-03-24 Thread Gael Magnan
We encountered the same kind of problem with the scheduler that stopped doing its job even after rebooting. I thought changing the start date or the state of a task instance might be to blame but I've never been able to pinpoint the problem either. We are using celery and docker if it helps. Le s

Re: Scheduler silently dies

2017-03-24 Thread Bolke de Bruin
For 1.8 and the issue you are seeing you might want to try increasing: DAGBAG_IMPORT_TIMEOUT under core which defaults to 30. This reminds me that doing timeouts this way cannot be done in child processes and might explain the defunct processes, so please test if that works. Bolke Sent from

Re: Scheduler silently dies

2017-03-24 Thread harish singh
We have been using (1.7) over a year and never faced this issue. The moment we switched to 1.8, I think we have hit this issue. The reason why I saw "I think" is because I am not sure if it is the same issue. But whenever I restart, my pipeline proceeds. *Airflow 1.7Having said that, In 1.7, I d

Re: Scheduler silently dies

2017-03-24 Thread Bolke de Bruin
We are running *without* num runs for over a year (and never have). It is a very elusive issue which has not been reproducible. I like more info on this but it needs to be very elaborate even to the point of access to the system exposing the behavior. Bolke Sent from my iPhone > On 24 Mar 2

Re: Scheduler silently dies

2017-03-24 Thread Vijay Ramesh
We literally have a cron job that restarts the scheduler every 30 min. Num runs didn't work consistently in rc4, sometimes it would restart itself and sometimes we'd end up with a few zombie scheduler processes and things would get stuck. Also running locally, without celery. On Mar 24, 2017 16:02

Re: Scheduler silently dies

2017-03-24 Thread lrohde
We have max runs set and still hit this. Our solution is dumber: monitoring log output, and kill the scheduler if it stops emitting. Works like a charm. > On Mar 24, 2017, at 5:50 PM, F. Hakan Koklu wrote: > > Some solutions to this problem is restarting the scheduler frequently or > some sort

Re: Scheduler silently dies

2017-03-24 Thread F. Hakan Koklu
Some solutions to this problem is restarting the scheduler frequently or some sort of monitoring on the scheduler. We have set up a dag that pings cronitor (a dead man's snitch type of service) every 10 minutes and the snitch pages you when the scheduler dies and does not sen

Re: Scheduler silently dies

2017-03-24 Thread Andrew Phillips
We use celery and run into it from time to time. Bang goes my theory ;-) At least, assuming it's the same underlying cause... Regards ap

Re: Scheduler silently dies

2017-03-24 Thread lrohde
We use celery and run into it from time to time. On Mar 24, 2017, at 4:16 PM, Andrew Phillips wrote: >> Does anyone have any idea why this happens? It seems like a bug that should >> be fixed, but we're all just living with it instead of trying to fix it. > > From the little I understand, one

Re: Scheduler silently dies

2017-03-24 Thread Andrew Phillips
Does anyone have any idea why this happens? It seems like a bug that should be fixed, but we're all just living with it instead of trying to fix it. From the little I understand, one of the main problems here is that it seems very difficult to reliably reproduce the issue. There are a bunch o

Re: Scheduler silently dies

2017-03-24 Thread Nicholas Hodgkinson
Does anyone have any idea why this happens? It seems like a bug that should be fixed, but we're all just living with it instead of trying to fix it. Just my two cents. -N nik.hodgkin...@collectivehealth.com On Fri, Mar 24, 2017 at 12:22 PM, harish singh wrote: > happens on our set up, on 1.8 a

Re: Scheduler silently dies

2017-03-24 Thread harish singh
happens on our set up, on 1.8 as well. we have kept this number to be 10 which seems to work well for us. On Fri, Mar 24, 2017 at 12:16 PM, Nicholas Hodgkinson < nik.hodgkin...@collectivehealth.com> wrote: > So I'm experiencing a problem that I can't figure out; namely my scheduler > just stops s

Scheduler silently dies

2017-03-24 Thread Nicholas Hodgkinson
So I'm experiencing a problem that I can't figure out; namely my scheduler just stops scheduling tasks for seemingly no reason. I've found this: https://bug623317.bugzilla.mozilla.org/show_bug.cgi?id=1286825 which seems to indicate that I should be restarting my scheduler frequently (I currently ha