| Yurik added a comment. |
I also disagree :) The real monitoring should not look at the process running at all. It should only look at the last timestamp - see how far behind WDQS is. If it gets behind further than X, send the alert - and that would be a very stable indicator that something is wrong - no matter if its the process that hung, or crashed, or simply cannot cope with the amount of data. On the other hand, the updater service itself should be resiliant to any kinds of problems - if there is an intermittent problem like a temporary DNS is down (like I had), the service will continue trying, and will self-recover the moment network is back up. This is the same logic as in any router or replication service - they always keeps trying until succeeding.
Cc: Smalyshev, Aklapper, Yurik, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, Avner, debt, Gehel, Jonas, FloNight, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
_______________________________________________ Wikidata-bugs mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
