[jira] [Closed] (JAMES-3784) Ease mail repository / event dead letter operation

Benoit Tellier (Jira) Fri, 19 Aug 2022 23:16:07 -0700


     [ 
https://issues.apache.org/jira/browse/JAMES-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Benoit Tellier closed JAMES-3784.
---------------------------------
    Resolution: Fixed

Done

> Ease mail repository / event dead letter operation
> --------------------------------------------------
>
>                 Key: JAMES-3784
>                 URL: https://issues.apache.org/jira/browse/JAMES-3784
>             Project: James Server
>          Issue Type: Improvement
>          Components: mailbox, MailStore &amp; MailRepository
>    Affects Versions: master
>            Reporter: Benoit Tellier
>            Priority: Major
>             Fix For: 3.8.0
>
>          Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> h3. Mailing list thread
> https://www.mail-archive.com/[email protected]/msg72012.html
> h3. context
> James does mostly 2 kinds of processing:
>  - Mail processing: when receiving a mail in SMTP, the mail is enqueued and 
> then the mailet processing is executed. Mailet/matchers are called against it 
> and a serie of decision can be made: store this mail n a user mailbox, 
> forward it, bounce, ignore it, etc...  - Event processing: Once actions are 
> taken in a user mailbox, an event is emitted to the event bus. Listeners are 
> then called to "decorate" features of the mailbox manager in a non invasive 
> way. ElasticSearch indexing, quota management, JMAP projections and state, 
> IMAP/JMAP notifications...
> Of course each of these processing can fail and error management is applied.
> Regarding Mail processing the mail is stored in /var/mail/error .
> To detect incidents:
>   - ERROR logs during processing
>   - Webadmin calls shows a non-zero /var/mail/error repository size
> To fix this incident:
>   - Explicit admin action is required, and if needed a reprocessing can be 
> attempted (webadmin)
> Regarding Event processing, listener execution is retried several time with a 
> delay. If it keeps failingit is eventually stored in dead letter.
> To detect incidents:
>  - ERROR logs during processing
>  - WebAdmin reports a non-zero size for deadletter
>  - Health check, wich eventually does a recuring WARNING log that cannot be 
> missed.
> To fix this incident:
>   - Explicit admin action is required, and if needed a redelivery can be 
> attempted (webadmin) 
> h3. Problem statement
> Most users misses this yet critical part of error management in James.
> Actions are never taken, problems piles up.
> While understandably major incidents with thousands of problems would clearly 
> benefit from an admin intervention, I would like small incidents to self 
> recover without a human intervention.
> In practice, none of my clients (me included) managed to set up a reliable 
> action plan regarding processing failures. Problems could be takled months 
> after they arise thus escalating in major issues needlessly. 
> h3. Proposed solution
>  - Implement a healthcheck that verifies var/mail/error is empty
>  - An upper bound on redelivery/reprocessing exposed by webadmin
> The goal of this limit is to prevent unbounded processing that could consume 
> unbounded  resources. Auto-healing could be budgetted for (eg: 10 mails/min).
> A human intervention is still needed in some cases:
>  - Massive outage whose require a full redelivery/reprocessing
>  - Bugs that cause recurring failure.
> The goal is to have auto-healing in place, given those tasks are called with 
> CRONs.
> CRONs remove the need for extra James based developments that adds complexity.
> h3. Proposed changes
> Add a `limit` parameter to reprocessing /redelivery.
> If specified, it enables to limit the count of element 
> reprocessed/redelivered. If unspecified the count of processed element is 
> unbounded (like today)
> Endpoints to modify:
>   - 
> https://james.staged.apache.org/james-distributed-app/3.7.0/operate/webadmin.html#_reprocessing_mails_from_a_mail_repository
>  - 
> https://james.staged.apache.org/james-distributed-app/3.7.0/operate/webadmin.html#_redeliver_all_events
>  - 
> https://james.staged.apache.org/james-distributed-app/3.7.0/operate/webadmin.html#_redeliver_group_events
> We also need:
>  - to update webadmin documentation accordingly.
>  - to recommend a CRON of eg 10 redelivery/reprocessing per minute in our 
> operation guides. 
> (https://james.apache.org/server/manage-guice-distributed-james.html + 
> https://james.staged.apache.org/james-distributed-app/3.7.0/operate/guide.html)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Closed] (JAMES-3784) Ease mail repository / event dead letter operation

Reply via email to