Re: Scheduler silently dies

2017-03-27 Thread Bolke de Bruin
Resource issues (like OOM) make sense as they are really hard to recover from. In this case I assume you are probably running heavy lifting (memory intensive) jobs on your machine. Reducing the parallelism parameter (if I remember correctly) will probably help you or increasing the memory

Re: Scheduler silently dies

2017-03-27 Thread Nicholas Hodgkinson
Actually, something pretty interesting; it seems I'm hitting OOM: [2017-03-25 02:31:40,546] {local_executor.py:31} INFO - LocalWorker running airflow run AUTOMATOR-sensor-v2 SENSOR--jira_case_close_times 2017-03-25T02:25:00 --local --pool sensor-pool -sd DAGS_FOLDER/automator-sensor.py Process

Re: Scheduler silently dies

2017-03-27 Thread Bolke de Bruin
Defunct children means we are not reaping them. So the "recovering" thing might be partially right, we probably need to build in some monitoring mechanism in the local executor. B. Sent from my iPhone > On 27 Mar 2017, at 12:40, Gerard Toonstra wrote: > > Any more

Re: Scheduler silently dies

2017-03-27 Thread Gerard Toonstra
Any more info from grepping that log file? G> On Mon, Mar 27, 2017 at 9:26 PM, Nicholas Hodgkinson < nik.hodgkin...@collectivehealth.com> wrote: > from airflow.cfg: > > [core] > ... > executor = LocalExecutor > parallelism = 32 > dag_concurrency = 16 > dags_are_paused_at_creation = True >

Re: Scheduler silently dies

2017-03-27 Thread Gerard Toonstra
So looks like the localworkers are dying. Airflow does not recover from that. In SchedulerJob (jobs.py), you can see the "_execute_helper" function. This calls "executor.start()", which is implemented in local_executor.py in your case. The LocalExecutor is thus an object owned by the

Re: Scheduler silently dies

2017-03-27 Thread harish singh
1.8: increasing DAGBAG_IMPORT_TIMEOUT helps. I don't see the issue (although not sure why tasks progress has become slow? But thats not the issue we are discussing here. So I am ignoring that here) 1.7: our prod is running 1.7 and we havent seen the "defunct process" issue for more than a week

Re: Scheduler silently dies

2017-03-26 Thread Gerard Toonstra
> > > By the way, I remember that the scheduler would only spawn one or three > processes, but I may be wrong. > Right now when I start, it spawns 7 separate processes for the scheduler > (8 total) with some additional > ones spawned when the dag file processor starts. > > These other processes

Re: Scheduler silently dies

2017-03-26 Thread Gerard Toonstra
what may be helpful to dive into this a bit more is "pyrasite" . You need gdb installed on the machine, but afterwards you can attach to a running process and then use python "payloads" to investigate what's going on, for example dump the stack trace per threads:

Re: Scheduler silently dies

2017-03-25 Thread Bolke de Bruin
I case you *think* you have encountered a schedule *hang*, please provide a strace on the parent process, provide process list output that shows defunct scheduler processes, and provide *all* logging (main logs, scheduler processing log, task logs), preferably in debug mode (settings.py). Also

Re: Scheduler silently dies

2017-03-25 Thread Bolke de Bruin
Please specify what “stop doing its job” means. It doesn’t log anything anymore? If it does, the scheduler hasn’t died and hasn’t stopped. B. > On 24 Mar 2017, at 18:20, Gael Magnan wrote: > > We encountered the same kind of problem with the scheduler that stopped >

Re: Scheduler silently dies

2017-03-25 Thread Bolke de Bruin
Hi Harish, The below does *not* indicate a scheduler hang, it is a valid exception as mentioned earlier. Bolke. > On 24 Mar 2017, at 19:07, harish singh wrote: > > We have been using (1.7) over a year and never faced this issue. > The moment we switched to 1.8, I

Re: Scheduler silently dies

2017-03-24 Thread Bolke de Bruin
For 1.8 and the issue you are seeing you might want to try increasing: DAGBAG_IMPORT_TIMEOUT under core which defaults to 30. This reminds me that doing timeouts this way cannot be done in child processes and might explain the defunct processes, so please test if that works. Bolke Sent

Re: Scheduler silently dies

2017-03-24 Thread harish singh
We have been using (1.7) over a year and never faced this issue. The moment we switched to 1.8, I think we have hit this issue. The reason why I saw "I think" is because I am not sure if it is the same issue. But whenever I restart, my pipeline proceeds. *Airflow 1.7Having said that, In 1.7, I

Re: Scheduler silently dies

2017-03-24 Thread Bolke de Bruin
We are running *without* num runs for over a year (and never have). It is a very elusive issue which has not been reproducible. I like more info on this but it needs to be very elaborate even to the point of access to the system exposing the behavior. Bolke Sent from my iPhone > On 24 Mar

Re: Scheduler silently dies

2017-03-24 Thread Vijay Ramesh
We literally have a cron job that restarts the scheduler every 30 min. Num runs didn't work consistently in rc4, sometimes it would restart itself and sometimes we'd end up with a few zombie scheduler processes and things would get stuck. Also running locally, without celery. On Mar 24, 2017

Re: Scheduler silently dies

2017-03-24 Thread lrohde
We have max runs set and still hit this. Our solution is dumber: monitoring log output, and kill the scheduler if it stops emitting. Works like a charm. > On Mar 24, 2017, at 5:50 PM, F. Hakan Koklu wrote: > > Some solutions to this problem is restarting the scheduler

Re: Scheduler silently dies

2017-03-24 Thread F. Hakan Koklu
Some solutions to this problem is restarting the scheduler frequently or some sort of monitoring on the scheduler. We have set up a dag that pings cronitor (a dead man's snitch type of service) every 10 minutes and the snitch pages you when the scheduler dies and does not

Re: Scheduler silently dies

2017-03-24 Thread Andrew Phillips
We use celery and run into it from time to time. Bang goes my theory ;-) At least, assuming it's the same underlying cause... Regards ap

Re: Scheduler silently dies

2017-03-24 Thread lrohde
We use celery and run into it from time to time. On Mar 24, 2017, at 4:16 PM, Andrew Phillips wrote: >> Does anyone have any idea why this happens? It seems like a bug that should >> be fixed, but we're all just living with it instead of trying to fix it. > > From the

Re: Scheduler silently dies

2017-03-24 Thread Andrew Phillips
Does anyone have any idea why this happens? It seems like a bug that should be fixed, but we're all just living with it instead of trying to fix it. From the little I understand, one of the main problems here is that it seems very difficult to reliably reproduce the issue. There are a bunch

Re: Scheduler silently dies

2017-03-24 Thread Nicholas Hodgkinson
Does anyone have any idea why this happens? It seems like a bug that should be fixed, but we're all just living with it instead of trying to fix it. Just my two cents. -N nik.hodgkin...@collectivehealth.com On Fri, Mar 24, 2017 at 12:22 PM, harish singh wrote: >

Re: Scheduler silently dies

2017-03-24 Thread harish singh
happens on our set up, on 1.8 as well. we have kept this number to be 10 which seems to work well for us. On Fri, Mar 24, 2017 at 12:16 PM, Nicholas Hodgkinson < nik.hodgkin...@collectivehealth.com> wrote: > So I'm experiencing a problem that I can't figure out; namely my scheduler > just stops

Scheduler silently dies

2017-03-24 Thread Nicholas Hodgkinson
So I'm experiencing a problem that I can't figure out; namely my scheduler just stops scheduling tasks for seemingly no reason. I've found this: https://bug623317.bugzilla.mozilla.org/show_bug.cgi?id=1286825 which seems to indicate that I should be restarting my scheduler frequently (I currently