Re: Being ops-friendly: a healthcheck to track task execution?

Benoit TELLIER Sat, 01 Oct 2022 22:51:06 -0700

Hello Jean,

On 02/10/2022 12:14, Jean Helou wrote:

> Thus bringing in a distributed scheduler in James.
>
> At the very least, it would require "distributed election" semantic, which our currentmidleware stack do not support
Hmm all distributed instances of James use an external queue system.sending a trigger message to a dedicated queue, whichever instanceconsumes the message will run the tak this time. Even if there isrepeat delivery it will run a self cleaning job a second time.

This sounds more than a hack than a realistic and safe approach to me.

https://aphyr.com/posts/315-jepsen-rabbitmq present a similar approach(lock service).

Also, chicken egg problem: how do you send a single message in the firstplace?

> Also, we would still need a way to alert admin in case of faulty taskexecution (back to the healthcheck?)

I think log messages are still the most frequent method used to reportinternal errors. They are easy to integrate with externalobservability systems which will handle the alerting for James and allthe other systems.If you send a message to a command channel you can also catchunacknowledged commands in a deadletter and monitor that.


Cheers
Jean

Le mer. 28 sept. 2022 à 04:25, Benoit TELLIER <btell...@apache.org> aécrit :


    Hello Jean,

    Thanks for the reply!

    I'll inline my answers.

    On 28/09/2022 03:27, Jean Helou wrote:

    Hi Benoît

    This has been at the back of my mind for a while. Tonight I
    finally have time to formulate my thoughts.

    I do like the idea of modular health checks. In a system such as
    james where instances can have wildly
     varying modules enabled, having modular health check system
    definitely sounds great.

    And... Modularity can be great to solve conflicts / lukewarmness
    issues :-)

    I will add it to our backlog.


    I am lukewarm on the motivation for introducing said modular
    system : monitoring externally
    triggered task executions.

    I'll first reformulate what you wrote:
    * Various implementations of features in james can produce
    inconsistent or garbage data over time.
    * The accumulation of such data can be detrimental to James
    operational performance if left unchecked.

    If it was only performance... It's detrimental to consistency too...

    * James provides batch processing tasks to fix these issues to be
    called from an external scheduler

    You propose to add configurable metadata relative to these tasks,
    in particular expected execution frequency and last execution date.
    This "metadata"  would be compared with actual execution to
    contribute to the healthcheck of the system.

    If this is indeed what you meant, I would like to offer a
    different point of view.
    Overall,  I think that monitoring the execution of these tasks
    falls outside the scope of james since they are externally triggered.

    I feel the task exemples you provided can be grouped in two
    categories for which a differeHellnt approach could be needed :

    - Tasks to clean Internally generated inconsistencies/garbage
    data which may eventually degrade the operation of the system :
    Since we have tasks that can fix the inconsistencies or garbage
    data, it means we are able to detect such inconsistent
    or garbage data.
    In such a case we could possibly contribute that to the health
    check instead of reporting on some task execution.We don't
    really care if the task has been executed or not as long as the
    data is consistent and clean. we do care about an ever
    increasing volume of inconsistent/garbage data that threatens
    system properties.

    True.

    We could have corrective tasks maintain a garbage / non-garbage
    and a consistant / inconsistent ratios. That would themselves back
    a healthcheck.

    However, no running tasks implies no detection which makes the
    entire pyramid unstable.

    My concern is to deliver to customers / operational team a
    solution that requires minimum knowledge / set up to be operated
    correctly. The furter I can go is check their conf.

    How do I ensure correctness of their set up? How can I take
    responsibility of what they did and support it?

    If our own code/implementation generates these, maybe we could
    come up with a way to fix the situation automatically
    and internally by default so that the system eventually converges
    to a clean state ?

    If corrective tasks are run periodically, alerts would be raised,
    and system would converge.

    I think JMAP uploads, C* consistency check/fixes and the
    opensearch tasks belong to this category, probably the blob
    garbage collection too
    even purging deleted data falls in this category: since the
    rcommendation is to run a task from an external cron to clean it
    (thus
    having no system or context ) we might as well run it
    automatically and let the system converge.

    Thus bringing in a distributed scheduler in James.

    At the very least, it would require "distributed election"
    semantic, which our current midleware stack do not support. I'm
    not keen on adding a consul/etcd/zookeeper into the mix for that
    sole purpose. I know about libraries like atomix, but configuring
    it, especially in a cloudy environment could be trickier than it
    sounds...

    Sure, it would limit the amount of work needed for our admins.

    Also, we would still need a way to alert admin in case of faulty
    task execution (back to the healthcheck?)


          I think both approach might not be mutually exclusive, but
    more complementary by the way.


    - Tasks that are related to operation hygiene more than system
    health
    I don't think spam reports generation should contribute to the
    health of the system. The system will continue to operate just
    fine even if
    no spam reports are generated ? I do agree that it's good
    operational hygiene though.

    True, but if you make it optional it is down to the guy
    configuring the server to choose (or to the consultant reviewing
    the configuration).

    Best regards,

    Benoit


    regards,
    jean


    Le mer. 21 sept. 2022 à 11:28, Benoit TELLIER
    <btell...@apache.org> a écrit :

        Hello all,

        Today James relies on externally scheduled tasks for it's
        well behaving.

        Non exhaustive example of tasks that *could* be critical to
        the well
        behaving of James and *may/should* be monitored:
          - Blob deduplication Garbage collection
          - JMAP uploads clean-up
          - Cassandra consistency checks/fix
          - Spam reports
          - Auditing / fixing OpenSearch indexing
          - Purging data within the deleted message vault
          - ...

        What can go wrong:
          - The admin did not configure / set up the CRON
          - The CRON executes badly
          - There is an error running the tasks
          - The task is never scheduled because task execution
        throughtput is to
        low, etc...

        What I would love:
          - A green button if required tasks are well executed
          - An orange button if investigation is required because the
        task is
        never scheduled

        The overall supervision  for James todays revolves around the
        concept of
        healthcheck:
          - Periodically run
          - Results exposed through the logs
          - Callable via HTTP to interoperate with alerting stacks
        (prometheus /
        load-balancer / Zabbix / ...)
          - Hopefully one day will be on the first page of James
        administration
        site....

        So, the proposal is to implement a healthcheck for
        supervising task
        execution. One would configure
        the tasks he expects to run successfully, and a time period
        in which he
        wishes the task to be well executed.
        We would would add a configuration properties within
        healthcheck.properties for this. Specifying no tasks makes
        the check a noop effectively disabling it (default behaviour).

        I wishes to contribute such a feature to the James project.

        Alternatives I have:
          - Do this as part of Linagora products (TMail) if this is
        non-consensual within the community
          - Propose modularisation for healthchecks, allowing custom
        health
        checks and treating this as an opt-in extension
        (extensions-jars loading
        mechanism).

        Thoughts?

        Best regards

        Benoit




        ---------------------------------------------------------------------
        To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org
        For additional commands, e-mail: server-dev-h...@james.apache.org

Re: Being ops-friendly: a healthcheck to track task execution?

Reply via email to