Yurik added a comment.

I also disagree :) The real monitoring should not look at the process running at all. It should only look at the last timestamp - see how far behind WDQS is. If it gets behind further than X, send the alert - and that would be a very stable indicator that something is wrong - no matter if its the process that hung, or crashed, or simply cannot cope with the amount of data. On the other hand, the updater service itself should be resiliant to any kinds of problems - if there is an intermittent problem like a temporary DNS is down (like I had), the service will continue trying, and will self-recover the moment network is back up. This is the same logic as in any router or replication service - they always keeps trying until succeeding.


TASK DETAIL
https://phabricator.wikimedia.org/T176927

EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Yurik
Cc: Smalyshev, Aklapper, Yurik, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, Avner, debt, Gehel, Jonas, FloNight, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to