Hello Jean,
Thanks for the reply!
I'll inline my answers.
On 28/09/2022 03:27, Jean Helou wrote:
Hi Benoît
This has been at the back of my mind for a while. Tonight I finally
have time to formulate my thoughts.
I do like the idea of modular health checks. In a system such as james
where instances can have wildly
varying modules enabled, having modular health check system
definitely sounds great.
And... Modularity can be great to solve conflicts / lukewarmness issues :-)
I will add it to our backlog.
I am lukewarm on the motivation for introducing said modular system :
monitoring externally
triggered task executions.
I'll first reformulate what you wrote:
* Various implementations of features in james can produce
inconsistent or garbage data over time.
* The accumulation of such data can be detrimental to James
operational performance if left unchecked.
If it was only performance... It's detrimental to consistency too...
* James provides batch processing tasks to fix these issues to be
called from an external scheduler
You propose to add configurable metadata relative to these tasks, in
particular expected execution frequency and last execution date.
This "metadata" would be compared with actual execution to contribute
to the healthcheck of the system.
If this is indeed what you meant, I would like to offer a different
point of view.
Overall, I think that monitoring the execution of these tasks falls
outside the scope of james since they are externally triggered.
I feel the task exemples you provided can be grouped in two categories
for which a differeHellnt approach could be needed :
- Tasks to clean Internally generated inconsistencies/garbage data
which may eventually degrade the operation of the system :
Since we have tasks that can fix the inconsistencies or garbage data,
it means we are able to detect such inconsistent
or garbage data.
In such a case we could possibly contribute that to the health check
instead of reporting on some task execution.We don't
really care if the task has been executed or not as long as the data
is consistent and clean. we do care about an ever
increasing volume of inconsistent/garbage data that threatens system
properties.
True.
We could have corrective tasks maintain a garbage / non-garbage and a
consistant / inconsistent ratios. That would themselves back a healthcheck.
However, no running tasks implies no detection which makes the entire
pyramid unstable.
My concern is to deliver to customers / operational team a solution that
requires minimum knowledge / set up to be operated correctly. The furter
I can go is check their conf.
How do I ensure correctness of their set up? How can I take
responsibility of what they did and support it?
If our own code/implementation generates these, maybe we could come up
with a way to fix the situation automatically
and internally by default so that the system eventually converges to a
clean state ?
If corrective tasks are run periodically, alerts would be raised, and
system would converge.
I think JMAP uploads, C* consistency check/fixes and the opensearch
tasks belong to this category, probably the blob garbage collection too
even purging deleted data falls in this category: since the
rcommendation is to run a task from an external cron to clean it (thus
having no system or context ) we might as well run it automatically
and let the system converge.
Thus bringing in a distributed scheduler in James.
At the very least, it would require "distributed election" semantic,
which our current midleware stack do not support. I'm not keen on adding
a consul/etcd/zookeeper into the mix for that sole purpose. I know about
libraries like atomix, but configuring it, especially in a cloudy
environment could be trickier than it sounds...
Sure, it would limit the amount of work needed for our admins.
Also, we would still need a way to alert admin in case of faulty task
execution (back to the healthcheck?)
I think both approach might not be mutually exclusive, but more
complementary by the way.
- Tasks that are related to operation hygiene more than system health
I don't think spam reports generation should contribute to the health
of the system. The system will continue to operate just fine even if
no spam reports are generated ? I do agree that it's good operational
hygiene though.
True, but if you make it optional it is down to the guy configuring the
server to choose (or to the consultant reviewing the configuration).
Best regards,
Benoit
regards,
jean
Le mer. 21 sept. 2022 à 11:28, Benoit TELLIER <btell...@apache.org> a
écrit :
Hello all,
Today James relies on externally scheduled tasks for it's well
behaving.
Non exhaustive example of tasks that *could* be critical to the well
behaving of James and *may/should* be monitored:
- Blob deduplication Garbage collection
- JMAP uploads clean-up
- Cassandra consistency checks/fix
- Spam reports
- Auditing / fixing OpenSearch indexing
- Purging data within the deleted message vault
- ...
What can go wrong:
- The admin did not configure / set up the CRON
- The CRON executes badly
- There is an error running the tasks
- The task is never scheduled because task execution throughtput
is to
low, etc...
What I would love:
- A green button if required tasks are well executed
- An orange button if investigation is required because the task is
never scheduled
The overall supervision for James todays revolves around the
concept of
healthcheck:
- Periodically run
- Results exposed through the logs
- Callable via HTTP to interoperate with alerting stacks
(prometheus /
load-balancer / Zabbix / ...)
- Hopefully one day will be on the first page of James
administration
site....
So, the proposal is to implement a healthcheck for supervising task
execution. One would configure
the tasks he expects to run successfully, and a time period in
which he
wishes the task to be well executed.
We would would add a configuration properties within
healthcheck.properties for this. Specifying no tasks makes
the check a noop effectively disabling it (default behaviour).
I wishes to contribute such a feature to the James project.
Alternatives I have:
- Do this as part of Linagora products (TMail) if this is
non-consensual within the community
- Propose modularisation for healthchecks, allowing custom health
checks and treating this as an opt-in extension (extensions-jars
loading
mechanism).
Thoughts?
Best regards
Benoit
---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org
For additional commands, e-mail: server-dev-h...@james.apache.org