Hello all, Recently, as part of our work documenting Administration Procedures for the Distributed Guice James product, we are having some reflections regarding the way to conduct monitoring, which undertook some nice discussions.
Currently, monitoring of `mailbox event processing` and `mail processing` can be achieved via logs (ie ERROR log review, etc..) However, logs requires correct kibana configuration which means also good information. But: - It makes retries/final-try non trivial to distinguish - Admin generally monotor logs using a time-window. Events older than this time window are ignored. We can think of several mechanisms to enhance this matter of fact: - Having for instance a health check, like MailboxEventProcessingHealthCheck ensuring that dead-letter is empty, or returning "degraded" otherwize - Having a metric displayed in a board. For the dead-letter exemple, a boolean text field can be enough. While interesting, the health check options received the following critics so far: - A perfectly behaving James server might report some failed processing entries (for example on some border line EML parsing), leading to a degraded status of an overwize perfectly working James server (for both the mail processing and mailbox processing case) - Through grafana, the admin will have the information directly available. Nowaday, health-checks requires her to execute the healthcheck via webadmin. More actions is generally the best way of having none of them taken. We would be very interested by feedback on this topic, in order to get a friendlyer admin experience. Best regards, Benoit --------------------------------------------------------------------- To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org For additional commands, e-mail: server-dev-h...@james.apache.org