Re: Being ops-friendly: a healthcheck to track task execution?

Benoit TELLIER Tue, 27 Sep 2022 19:25:35 -0700

Hello Jean,

Thanks for the reply!


I'll inline my answers.

On 28/09/2022 03:27, Jean Helou wrote:

Hi Benoît
This has been at the back of my mind for a while. Tonight I finallyhave time to formulate my thoughts.
I do like the idea of modular health checks. In a system such as jameswhere instances can have wildly varying modules enabled, having modular health check systemdefinitely sounds great.

And... Modularity can be great to solve conflicts / lukewarmness issues :-)

I will add it to our backlog.

I am lukewarm on the motivation for introducing said modular system :monitoring externally
triggered task executions.

I'll first reformulate what you wrote:
* Various implementations of features in james can produceinconsistent or garbage data over time.* The accumulation of such data can be detrimental to Jamesoperational performance if left unchecked.

If it was only performance... It's detrimental to consistency too...

* James provides batch processing tasks to fix these issues to becalled from an external scheduler
You propose to add configurable metadata relative to these tasks, inparticular expected execution frequency and last execution date.This "metadata" would be compared with actual execution to contributeto the healthcheck of the system.
If this is indeed what you meant, I would like to offer a differentpoint of view.Overall, I think that monitoring the execution of these tasks fallsoutside the scope of james since they are externally triggered.
I feel the task exemples you provided can be grouped in two categoriesfor which a differeHellnt approach could be needed :
- Tasks to clean Internally generated inconsistencies/garbage datawhich may eventually degrade the operation of the system :Since we have tasks that can fix the inconsistencies or garbage data,it means we are able to detect such inconsistent
or garbage data.
In such a case we could possibly contribute that to the health checkinstead of reporting on some task execution.We don'treally care if the task has been executed or not as long as the datais consistent and clean. we do care about an everincreasing volume of inconsistent/garbage data that threatens systemproperties.

True.

We could have corrective tasks maintain a garbage / non-garbage and aconsistant / inconsistent ratios. That would themselves back a healthcheck.

However, no running tasks implies no detection which makes the entirepyramid unstable.

My concern is to deliver to customers / operational team a solution thatrequires minimum knowledge / set up to be operated correctly. The furterI can go is check their conf.

How do I ensure correctness of their set up? How can I takeresponsibility of what they did and support it?

If our own code/implementation generates these, maybe we could come upwith a way to fix the situation automaticallyand internally by default so that the system eventually converges to aclean state ?

If corrective tasks are run periodically, alerts would be raised, andsystem would converge.

I think JMAP uploads, C* consistency check/fixes and the opensearchtasks belong to this category, probably the blob garbage collection tooeven purging deleted data falls in this category: since thercommendation is to run a task from an external cron to clean it (thushaving no system or context ) we might as well run it automaticallyand let the system converge.

Thus bringing in a distributed scheduler in James.

At the very least, it would require "distributed election" semantic,which our current midleware stack do not support. I'm not keen on addinga consul/etcd/zookeeper into the mix for that sole purpose. I know aboutlibraries like atomix, but configuring it, especially in a cloudyenvironment could be trickier than it sounds...


Sure, it would limit the amount of work needed for our admins.

Also, we would still need a way to alert admin in case of faulty taskexecution (back to the healthcheck?)

I think both approach might not be mutually exclusive, but morecomplementary by the way.

- Tasks that are related to operation hygiene more than system health
I don't think spam reports generation should contribute to the healthof the system. The system will continue to operate just fine even ifno spam reports are generated ? I do agree that it's good operationalhygiene though.

True, but if you make it optional it is down to the guy configuring theserver to choose (or to the consultant reviewing the configuration).


Best regards,

Benoit


regards,
jean

Le mer. 21 sept. 2022 à 11:28, Benoit TELLIER <btell...@apache.org> aécrit :


    Hello all,

    Today James relies on externally scheduled tasks for it's well
    behaving.

    Non exhaustive example of tasks that *could* be critical to the well
    behaving of James and *may/should* be monitored:
      - Blob deduplication Garbage collection
      - JMAP uploads clean-up
      - Cassandra consistency checks/fix
      - Spam reports
      - Auditing / fixing OpenSearch indexing
      - Purging data within the deleted message vault
      - ...

    What can go wrong:
      - The admin did not configure / set up the CRON
      - The CRON executes badly
      - There is an error running the tasks
      - The task is never scheduled because task execution throughtput
    is to
    low, etc...

    What I would love:
      - A green button if required tasks are well executed
      - An orange button if investigation is required because the task is
    never scheduled

    The overall supervision  for James todays revolves around the
    concept of
    healthcheck:
      - Periodically run
      - Results exposed through the logs
      - Callable via HTTP to interoperate with alerting stacks
    (prometheus /
    load-balancer / Zabbix / ...)
      - Hopefully one day will be on the first page of James
    administration
    site....

    So, the proposal is to implement a healthcheck for supervising task
    execution. One would configure
    the tasks he expects to run successfully, and a time period in
    which he
    wishes the task to be well executed.
    We would would add a configuration properties within
    healthcheck.properties for this. Specifying no tasks makes
    the check a noop effectively disabling it (default behaviour).

    I wishes to contribute such a feature to the James project.

    Alternatives I have:
      - Do this as part of Linagora products (TMail) if this is
    non-consensual within the community
      - Propose modularisation for healthchecks, allowing custom health
    checks and treating this as an opt-in extension (extensions-jars
    loading
    mechanism).

    Thoughts?

    Best regards

    Benoit




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org
    For additional commands, e-mail: server-dev-h...@james.apache.org

Re: Being ops-friendly: a healthcheck to track task execution?

Reply via email to