[ https://issues.apache.org/jira/browse/JAMES-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Benoit Tellier closed JAMES-3784. --------------------------------- Resolution: Fixed Done > Ease mail repository / event dead letter operation > -------------------------------------------------- > > Key: JAMES-3784 > URL: https://issues.apache.org/jira/browse/JAMES-3784 > Project: James Server > Issue Type: Improvement > Components: mailbox, MailStore & MailRepository > Affects Versions: master > Reporter: Benoit Tellier > Priority: Major > Fix For: 3.8.0 > > Time Spent: 2.5h > Remaining Estimate: 0h > > h3. Mailing list thread > https://www.mail-archive.com/server-dev@james.apache.org/msg72012.html > h3. context > James does mostly 2 kinds of processing: > - Mail processing: when receiving a mail in SMTP, the mail is enqueued and > then the mailet processing is executed. Mailet/matchers are called against it > and a serie of decision can be made: store this mail n a user mailbox, > forward it, bounce, ignore it, etc... - Event processing: Once actions are > taken in a user mailbox, an event is emitted to the event bus. Listeners are > then called to "decorate" features of the mailbox manager in a non invasive > way. ElasticSearch indexing, quota management, JMAP projections and state, > IMAP/JMAP notifications... > Of course each of these processing can fail and error management is applied. > Regarding Mail processing the mail is stored in /var/mail/error . > To detect incidents: > - ERROR logs during processing > - Webadmin calls shows a non-zero /var/mail/error repository size > To fix this incident: > - Explicit admin action is required, and if needed a reprocessing can be > attempted (webadmin) > Regarding Event processing, listener execution is retried several time with a > delay. If it keeps failingit is eventually stored in dead letter. > To detect incidents: > - ERROR logs during processing > - WebAdmin reports a non-zero size for deadletter > - Health check, wich eventually does a recuring WARNING log that cannot be > missed. > To fix this incident: > - Explicit admin action is required, and if needed a redelivery can be > attempted (webadmin) > h3. Problem statement > Most users misses this yet critical part of error management in James. > Actions are never taken, problems piles up. > While understandably major incidents with thousands of problems would clearly > benefit from an admin intervention, I would like small incidents to self > recover without a human intervention. > In practice, none of my clients (me included) managed to set up a reliable > action plan regarding processing failures. Problems could be takled months > after they arise thus escalating in major issues needlessly. > h3. Proposed solution > - Implement a healthcheck that verifies var/mail/error is empty > - An upper bound on redelivery/reprocessing exposed by webadmin > The goal of this limit is to prevent unbounded processing that could consume > unbounded resources. Auto-healing could be budgetted for (eg: 10 mails/min). > A human intervention is still needed in some cases: > - Massive outage whose require a full redelivery/reprocessing > - Bugs that cause recurring failure. > The goal is to have auto-healing in place, given those tasks are called with > CRONs. > CRONs remove the need for extra James based developments that adds complexity. > h3. Proposed changes > Add a `limit` parameter to reprocessing /redelivery. > If specified, it enables to limit the count of element > reprocessed/redelivered. If unspecified the count of processed element is > unbounded (like today) > Endpoints to modify: > - > https://james.staged.apache.org/james-distributed-app/3.7.0/operate/webadmin.html#_reprocessing_mails_from_a_mail_repository > - > https://james.staged.apache.org/james-distributed-app/3.7.0/operate/webadmin.html#_redeliver_all_events > - > https://james.staged.apache.org/james-distributed-app/3.7.0/operate/webadmin.html#_redeliver_group_events > We also need: > - to update webadmin documentation accordingly. > - to recommend a CRON of eg 10 redelivery/reprocessing per minute in our > operation guides. > (https://james.apache.org/server/manage-guice-distributed-james.html + > https://james.staged.apache.org/james-distributed-app/3.7.0/operate/guide.html) -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org For additional commands, e-mail: server-dev-h...@james.apache.org