Hello Jean,

Thanks for the reply!

I'll inline my answers.

On 28/09/2022 03:27, Jean Helou wrote:
Hi Benoît

This has been at the back of my mind for a while. Tonight I finally have time to formulate my thoughts.

I do like the idea of modular health checks. In a system such as james where instances can have wildly  varying modules enabled, having modular health check system definitely sounds great.
And... Modularity can be great to solve conflicts / lukewarmness issues :-)

I will add it to our backlog.

I am lukewarm on the motivation for introducing said modular system : monitoring externally
triggered task executions.

I'll first reformulate what you wrote:
* Various implementations of features in james can produce inconsistent or garbage data over time. * The accumulation of such data can be detrimental to James operational performance if left unchecked.
If it was only performance... It's detrimental to consistency too...
* James provides batch processing tasks to fix these issues to be called from an external scheduler

You propose to add configurable metadata relative to these tasks, in particular expected execution frequency and last execution date. This "metadata"  would be compared with actual execution to contribute to the healthcheck of the system.

If this is indeed what you meant, I would like to offer a different point of view. Overall,  I think that monitoring the execution of these tasks falls outside the scope of james since they are externally triggered.

I feel the task exemples you provided can be grouped in two categories for which a differeHellnt approach could be needed :

- Tasks to clean Internally generated inconsistencies/garbage data which may eventually degrade the operation of the system : Since we have tasks that can fix the inconsistencies or garbage data, it means we are able to detect such inconsistent
or garbage data.
In such a case we could possibly contribute that to the health check instead of reporting on some task execution.We don't really care if the task has been executed or not as long as the data is consistent and clean. we do care about an ever increasing volume of inconsistent/garbage data that threatens system properties.
True.

We could have corrective tasks maintain a garbage / non-garbage and a consistant / inconsistent ratios. That would themselves back a healthcheck.

However, no running tasks implies no detection which makes the entire pyramid unstable.

My concern is to deliver to customers / operational team a solution that requires minimum knowledge / set up to be operated correctly. The furter I can go is check their conf.

How do I ensure correctness of their set up? How can I take responsibility of what they did and support it?

If our own code/implementation generates these, maybe we could come up with a way to fix the situation automatically and internally by default so that the system eventually converges to a clean state ?
If corrective tasks are run periodically, alerts would be raised, and system would converge.
I think JMAP uploads, C* consistency check/fixes and the opensearch tasks belong to this category, probably the blob garbage collection too even purging deleted data falls in this category: since the rcommendation is to run a task from an external cron to clean it (thus having no system or context ) we might as well run it automatically and let the system converge.
Thus bringing in a distributed scheduler in James.

At the very least, it would require "distributed election" semantic, which our current midleware stack do not support. I'm not keen on adding a consul/etcd/zookeeper into the mix for that sole purpose. I know about libraries like atomix, but configuring it, especially in a cloudy environment could be trickier than it sounds...

Sure, it would limit the amount of work needed for our admins.

Also, we would still need a way to alert admin in case of faulty task execution (back to the healthcheck?)

I think both approach might not be mutually exclusive, but more complementary by the way.


- Tasks that are related to operation hygiene more than system health
I don't think spam reports generation should contribute to the health of the system. The system will continue to operate just fine even if no spam reports are generated ? I do agree that it's good operational hygiene though.
True, but if you make it optional it is down to the guy configuring the server to choose (or to the consultant reviewing the configuration).

Best regards,

Benoit

regards,
jean


Le mer. 21 sept. 2022 à 11:28, Benoit TELLIER <btell...@apache.org> a écrit :

    Hello all,

    Today James relies on externally scheduled tasks for it's well
    behaving.

    Non exhaustive example of tasks that *could* be critical to the well
    behaving of James and *may/should* be monitored:
      - Blob deduplication Garbage collection
      - JMAP uploads clean-up
      - Cassandra consistency checks/fix
      - Spam reports
      - Auditing / fixing OpenSearch indexing
      - Purging data within the deleted message vault
      - ...

    What can go wrong:
      - The admin did not configure / set up the CRON
      - The CRON executes badly
      - There is an error running the tasks
      - The task is never scheduled because task execution throughtput
    is to
    low, etc...

    What I would love:
      - A green button if required tasks are well executed
      - An orange button if investigation is required because the task is
    never scheduled

    The overall supervision  for James todays revolves around the
    concept of
    healthcheck:
      - Periodically run
      - Results exposed through the logs
      - Callable via HTTP to interoperate with alerting stacks
    (prometheus /
    load-balancer / Zabbix / ...)
      - Hopefully one day will be on the first page of James
    administration
    site....

    So, the proposal is to implement a healthcheck for supervising task
    execution. One would configure
    the tasks he expects to run successfully, and a time period in
    which he
    wishes the task to be well executed.
    We would would add a configuration properties within
    healthcheck.properties for this. Specifying no tasks makes
    the check a noop effectively disabling it (default behaviour).

    I wishes to contribute such a feature to the James project.

    Alternatives I have:
      - Do this as part of Linagora products (TMail) if this is
    non-consensual within the community
      - Propose modularisation for healthchecks, allowing custom health
    checks and treating this as an opt-in extension (extensions-jars
    loading
    mechanism).

    Thoughts?

    Best regards

    Benoit




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org
    For additional commands, e-mail: server-dev-h...@james.apache.org

Reply via email to