> Also, we would still need a way to alert admin in case of faulty task
execution (back to the healthcheck?)
I think log messages are still the most frequent method used to report
internal errors. They are easy to integrate with external
observability systems which will handle the alerting for James and all
the other systems.
If you send a message to a command channel you can also catch
unacknowledged commands in a deadletter and monitor that.
Cheers
Jean
Le mer. 28 sept. 2022 à 04:25, Benoit TELLIER <btell...@apache.org> a
écrit :
Hello Jean,
Thanks for the reply!
I'll inline my answers.
On 28/09/2022 03:27, Jean Helou wrote:
Hi Benoît
This has been at the back of my mind for a while. Tonight I
finally have time to formulate my thoughts.
I do like the idea of modular health checks. In a system such as
james where instances can have wildly
varying modules enabled, having modular health check system
definitely sounds great.
And... Modularity can be great to solve conflicts / lukewarmness
issues :-)
I will add it to our backlog.
I am lukewarm on the motivation for introducing said modular
system : monitoring externally
triggered task executions.
I'll first reformulate what you wrote:
* Various implementations of features in james can produce
inconsistent or garbage data over time.
* The accumulation of such data can be detrimental to James
operational performance if left unchecked.
If it was only performance... It's detrimental to consistency too...
* James provides batch processing tasks to fix these issues to be
called from an external scheduler
You propose to add configurable metadata relative to these tasks,
in particular expected execution frequency and last execution date.
This "metadata" would be compared with actual execution to
contribute to the healthcheck of the system.
If this is indeed what you meant, I would like to offer a
different point of view.
Overall, I think that monitoring the execution of these tasks
falls outside the scope of james since they are externally triggered.
I feel the task exemples you provided can be grouped in two
categories for which a differeHellnt approach could be needed :
- Tasks to clean Internally generated inconsistencies/garbage
data which may eventually degrade the operation of the system :
Since we have tasks that can fix the inconsistencies or garbage
data, it means we are able to detect such inconsistent
or garbage data.
In such a case we could possibly contribute that to the health
check instead of reporting on some task execution.We don't
really care if the task has been executed or not as long as the
data is consistent and clean. we do care about an ever
increasing volume of inconsistent/garbage data that threatens
system properties.
True.
We could have corrective tasks maintain a garbage / non-garbage
and a consistant / inconsistent ratios. That would themselves back
a healthcheck.
However, no running tasks implies no detection which makes the
entire pyramid unstable.
My concern is to deliver to customers / operational team a
solution that requires minimum knowledge / set up to be operated
correctly. The furter I can go is check their conf.
How do I ensure correctness of their set up? How can I take
responsibility of what they did and support it?
If our own code/implementation generates these, maybe we could
come up with a way to fix the situation automatically
and internally by default so that the system eventually converges
to a clean state ?
If corrective tasks are run periodically, alerts would be raised,
and system would converge.
I think JMAP uploads, C* consistency check/fixes and the
opensearch tasks belong to this category, probably the blob
garbage collection too
even purging deleted data falls in this category: since the
rcommendation is to run a task from an external cron to clean it
(thus
having no system or context ) we might as well run it
automatically and let the system converge.
Thus bringing in a distributed scheduler in James.
At the very least, it would require "distributed election"
semantic, which our current midleware stack do not support. I'm
not keen on adding a consul/etcd/zookeeper into the mix for that
sole purpose. I know about libraries like atomix, but configuring
it, especially in a cloudy environment could be trickier than it
sounds...
Sure, it would limit the amount of work needed for our admins.
Also, we would still need a way to alert admin in case of faulty
task execution (back to the healthcheck?)
I think both approach might not be mutually exclusive, but
more complementary by the way.
- Tasks that are related to operation hygiene more than system
health
I don't think spam reports generation should contribute to the
health of the system. The system will continue to operate just
fine even if
no spam reports are generated ? I do agree that it's good
operational hygiene though.
True, but if you make it optional it is down to the guy
configuring the server to choose (or to the consultant reviewing
the configuration).
Best regards,
Benoit
regards,
jean
Le mer. 21 sept. 2022 à 11:28, Benoit TELLIER
<btell...@apache.org> a écrit :
Hello all,
Today James relies on externally scheduled tasks for it's
well behaving.
Non exhaustive example of tasks that *could* be critical to
the well
behaving of James and *may/should* be monitored:
- Blob deduplication Garbage collection
- JMAP uploads clean-up
- Cassandra consistency checks/fix
- Spam reports
- Auditing / fixing OpenSearch indexing
- Purging data within the deleted message vault
- ...
What can go wrong:
- The admin did not configure / set up the CRON
- The CRON executes badly
- There is an error running the tasks
- The task is never scheduled because task execution
throughtput is to
low, etc...
What I would love:
- A green button if required tasks are well executed
- An orange button if investigation is required because the
task is
never scheduled
The overall supervision for James todays revolves around the
concept of
healthcheck:
- Periodically run
- Results exposed through the logs
- Callable via HTTP to interoperate with alerting stacks
(prometheus /
load-balancer / Zabbix / ...)
- Hopefully one day will be on the first page of James
administration
site....
So, the proposal is to implement a healthcheck for
supervising task
execution. One would configure
the tasks he expects to run successfully, and a time period
in which he
wishes the task to be well executed.
We would would add a configuration properties within
healthcheck.properties for this. Specifying no tasks makes
the check a noop effectively disabling it (default behaviour).
I wishes to contribute such a feature to the James project.
Alternatives I have:
- Do this as part of Linagora products (TMail) if this is
non-consensual within the community
- Propose modularisation for healthchecks, allowing custom
health
checks and treating this as an opt-in extension
(extensions-jars loading
mechanism).
Thoughts?
Best regards
Benoit
---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org
For additional commands, e-mail: server-dev-h...@james.apache.org