how many workers you have and what heartbeat are you using ?
it's no news that the scheduler_worker table is the most "congested"
because it's where most of the IPC happens. That's why I should release a
version that does IPC on redis, but alas, I didn't find the time yet to
polish the code.
BTW: as it is the code can't be made "less pushy": every transaction you
see is necessary to avoid bugs
On Tuesday, November 17, 2015 at 1:47:00 AM UTC+1, Andy Southgate wrote:
>
> Hi all,
>
> First of all, many thanks to all involved in web2py, it's been working out
> very well for me :)
>
> I've been using the scheduler a lot, and I think I've found a source of
> database deadlocks. As far as I can tell, it happens when the system is
> deleting what it thinks are dead workers because heartbeats have timed out,
> and electing a new ticker, and two or more worker processes are trying to
> do this at the same time. They each update there own heartbeats and then
> do several update/delete operations on a number of the scheduler_worker
> rows, without intervening db.commits, and the 'many rows' operations
> collide with each other. These can range from simple:
>
> 2015-11-16 11:02:33 GMT PID=5244 trans=72287062 ERROR: 40P01: deadlock
> detected
> 2015-11-16 11:02:33 GMT PID=5244 trans=72287062 DETAIL: Process 5244
> waits for ShareLock on transaction 72287313; blocked by process 6236.
> Process 6236 waits for ShareLock on transaction 72287062; blocked by
> process 5244.
> Process 5244: UPDATE scheduler_worker SET is_ticker='F' WHERE
> (scheduler_worker.worker_name <> 'UKVMAAS#6316');
> Process 6236: DELETE FROM scheduler_worker WHERE
> (((scheduler_worker.last_heartbeat < '2015-11-16 11:00:44') AND
> (scheduler_worker.status = 'ACTIVE')) OR ((scheduler_worker.last_heartbeat
> < '2015-11-16 10:46:44') AND (scheduler_worker.status <> 'ACTIVE')));
>
> to spectacular:
>
> 2015-11-16 11:02:16 GMT PID=6772 trans=72287377 ERROR: 40P01: deadlock
> detected
> 2015-11-16 11:02:16 GMT PID=6772 trans=72287377 DETAIL: Process 6772
> waits for ExclusiveLock on tuple (311,9) of relation 16681 of database
> 16384; blocked by process 564.
> Process 564 waits for ShareLock on transaction 72287313; blocked by
> process 6236.
> Process 6236 waits for AccessExclusiveLock on tuple (0,19) of relation
> 16908 of database 16384; blocked by process 6804.
> Process 6804 waits for ShareLock on transaction 72287062; blocked by
> process 5244.
> Process 5244 waits for ShareLock on transaction 72287388; blocked by
> process 728.
> Process 728 waits for ExclusiveLock on tuple (311,9) of relation 16681 of
> database 16384; blocked by process 6772.
> Process 6772: UPDATE scheduler_task SET
> status='QUEUED',assigned_worker_name='' WHERE
> ((scheduler_task.assigned_worker_name IN (SELECT
> scheduler_worker.worker_name FROM scheduler_worker WHERE
> (((scheduler_worker.last_heartbeat < '2015-11-16 11:01:01') AND
> (scheduler_worker.status = 'ACTIVE')) OR ((scheduler_worker.last_heartbeat
> < '2015-11-16 10:47:01') AND (scheduler_worker.status <> 'ACTIVE'))))) AND
> (scheduler_task.status = 'RUNNING'));
> Process 564: UPDATE scheduler_task SET
> status='QUEUED',assigned_worker_name='' WHERE
> ((scheduler_task.assigned_worker_name IN (SELECT
> scheduler_worker.worker_name FROM scheduler_worker WHERE
> (((scheduler_worker.last_heartbeat < '2015-11-16 11:00:45') AND
> (scheduler_worker.status = 'ACTIVE')) OR ((scheduler_worker.last_heartbeat
> < '2015-11-16 10:46:45') AND (scheduler_worker.status <> 'ACTIVE'))))) AND
> (scheduler_task.status = 'RUNNING'));
> Process 6236: DELETE FROM scheduler_worker WHERE
> (((scheduler_worker.last_heartbeat < '2015-11-16 11:00:44') AND
> (scheduler_worker.status = 'ACTIVE')) OR ((scheduler_worker.last_heartbeat
> < '2015-11-16 10:46:44') AND (scheduler_worker.status <> 'ACTIVE')));
> Process 6804: UPDATE scheduler_worker SET
> status='ACTIVE',last_heartbeat='2015-11-16
> 11:00:36',worker_stats='{"status": "ACTIVE", "errors": 0, "workers": 0,
> "queue": 0, "empty_runs": 11683, "sleep": 1.0, "distribution": null,
> "total": 0}' WHERE (scheduler_worker.worker_name = 'UKVMAAS#4280');
> Process 5244: UPDATE scheduler_worker SET is_ticker='F' WHERE
> (scheduler_worker.worker_name <> 'UKVMAAS#6316');
> Process 728: UPDATE scheduler_task SET
> status='QUEUED',assigned_worker_name='' WHERE
> ((scheduler_task.assigned_worker_name IN (SELECT
> scheduler_worker.worker_name FROM scheduler_worker WHERE
> (((scheduler_worker.last_heartbeat < '2015-11-16 11:01:03') AND
> (scheduler_worker.status = 'ACTIVE')) OR ((scheduler_worker.last_heartbeat
> < '2015-11-16 10:47:03') AND (scheduler_worker.status <> 'ACTIVE'))))) AND
> (scheduler_task.status = 'RUNNING'));
> (from PostgreSQL 9.4.4 logs set up for debug, web2py 2.12.3, Windows 10)
>
> This seems to happen more often than you'd hope because the earlier
> database operations tend to synchronise multiple workers in time if they're
> already waiting on a lock. The worst case I've found is to set the
> deadlock timeout in PostgreSQL longer than the heartbeat timeout, so a
> number of workers are released when the DB times out the deadlocked
> transaction. This can get stuck in a loop where it immediately recreates
> the same problem.
>
> If this makes sense, is it possible to split up send_heartbeat into more
> transactions without introducing other problems?
>
> Many thanks,
>
> Andy S.
>
> PS My heartbeats time out because this is a VM that occasionally gets
> starved of resource, so not web2py's fault :)
>
>
--
Resources:
- http://web2py.com
- http://web2py.com/book (Documentation)
- http://github.com/web2py/web2py (Source code)
- https://code.google.com/p/web2py/issues/list (Report Issues)
---
You received this message because you are subscribed to the Google Groups
"web2py-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.