[ https://issues.apache.org/jira/browse/JAMES-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483695#comment-17483695 ]
Ilja Weis commented on JAMES-3710: ---------------------------------- Hello, thank you for the explanation. I've looked at the queues in RabbitMQ to see what kind of message activity is happening during the deletes. We do have mailboxEvent-workQueue-org.apache.james.mailbox.cassandra.DeleteMessageListener$DeleteMessageListenerGroup and mailboxEvent-workQueue-org.apache.james.pop3server.mailbox.PopulateMetadataStoreListener$PopulateMetadataStoreListenerGroup, so I had hoped that since the POP3 Metadata events were persisted, so they would just be caught up upon server restart. However, there is zero activity in those two queues, but a lot of messages in mailboxEvent-workQueue-org.apache.james.events.GroupRegistrationHandler$GroupRegistrationHandlerGroup Is it supposed to behave like that, or are we maybe losing some events somewhere along they way? > Restarting James while deleting using POP3 causes inconsistency > --------------------------------------------------------------- > > Key: JAMES-3710 > URL: https://issues.apache.org/jira/browse/JAMES-3710 > Project: James Server > Issue Type: Bug > Affects Versions: master > Reporter: Ilja Weis > Priority: Major > Attachments: logs.zip > > > Running James (distributed-pop3-app) and restarting it while > it is processing a lot of POP3 sessions that are deleting messages > causes the tables messagev3, pop3metadata and imapuidtable to be > inconsistent, leading to the problems discussed in JAMES-3709. > The effect after applying the changes from > https://github.com/apache/james-project/pull/861 is that POP3 sessions TOPing > messages that no longer exist > but still are in pop3metadata now cause "-ERR Message (12) does not exist." > as expected but no longer destroy the whole session. > The interesting part is why the data becomes inconsistent. This is probably > an edge case because we're probably going to restart James all the time > but perhaps there's a problem somewhere that's just more likely to happen > when restarting? > My setup: 4 James instances in Kubernetes, S3 storage, Cassandra > cluster, 2 datacenters with 3 nodes each, replication DC1:3,DC2:3. Relevant > cassandra properties: > cassandra.consistency_level.regular=QUORUM > cassandra.consistency_level.lightweight_transaction=SERIAL > message.read.strong.consistency=false > message.write.strong.consistency.unsafe=false > mailbox.read.strong.consistency=false > mailbox.read.repair.chance=0.00 > mailbox.counters.read.repair.chance.max=0.000 > mailbox.counters.read.repair.chance.one.hundred=0.000 > What I did: > - Send a number of mails to 200 mailboxes. After this: > > select count(*) from messagev3; > count > ------- > 16837 > > select count(*) from imapuidtable; > count > ------- > 16837 > > select count(*) from pop3metadata ; > count > ------- > 16837 > Then, start deleting all those messages with 40 parallel sessions. > There are no concurrent sessions to the same account. > While the deletes are running, restart all the james instances. > After a moment, we have: > > select count(*) from messagev3; > count > ------- > 14669 > > select count(*) from imapuidtable; > count > ------- > 14194 > > select count(*) from pop3metadata ; > count > ------- > 14669 > Not sure if messagev3 is relevant here, just adding it for completeness. > Now if I'm accessing the mailboxes, this will touch some of the > messages that are no longer in imapuidtable which of course leads to > -ERR Message (12) does not exist. > and > 16:32:48.175 [WARN ] o.a.j.p.m.DistributedMailboxAdapter - Removing > cc9b9f70-7f8d-11ec-b659-0f643f8b11e7 from > 8870f760-7f87-11ec-9109-5fc36fc8b18d POP3 projection for user > james-032@james.testing at it is not backed by a MailboxMessage > Running the fixPop3Inconsistencies task in such a situation would then > clean up all the pop3metadata messages as "stale". > I have attached the James log while deleting/restarting and after > restarting for all 4 instances. > What do we make of this? Is this something relevant or more like something > that just can happen if we restart James at the wrong time? -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org For additional commands, e-mail: server-dev-h...@james.apache.org