Ilja Weis created JAMES-3710:
--------------------------------
Summary: Restarting James while deleting using POP3 causes
inconsistency
Key: JAMES-3710
URL: https://issues.apache.org/jira/browse/JAMES-3710
Project: James Server
Issue Type: Bug
Affects Versions: master
Reporter: Ilja Weis
Attachments: logs.zip
Running James (distributed-pop3-app) and restarting it while
it is processing a lot of POP3 sessions that are deleting messages
causes the tables messagev3, pop3metadata and imapuidtable to be
inconsistent, leading to the problems discussed in JAMES-3709.
The effect after applying the changes from
https://github.com/apache/james-project/pull/861 is that POP3 sessions TOPing
messages that no longer exist
but still are in pop3metadata now cause "-ERR Message (12) does not exist."
as expected but no longer destroy the whole session.
The interesting part is why the data becomes inconsistent. This is probably
an edge case because we're probably going to restart James all the time
but perhaps there's a problem somewhere that's just more likely to happen
when restarting?
My setup: 4 James instances in Kubernetes, S3 storage, Cassandra
cluster, 2 datacenters with 3 nodes each, replication DC1:3,DC2:3. Relevant
cassandra properties:
cassandra.consistency_level.regular=QUORUM
cassandra.consistency_level.lightweight_transaction=SERIAL
message.read.strong.consistency=false
message.write.strong.consistency.unsafe=false
mailbox.read.strong.consistency=false
mailbox.read.repair.chance=0.00
mailbox.counters.read.repair.chance.max=0.000
mailbox.counters.read.repair.chance.one.hundred=0.000
What I did:
- Send a number of mails to 200 mailboxes. After this:
> select count(*) from messagev3;
count
-------
16837
> select count(*) from imapuidtable;
count
-------
16837
> select count(*) from pop3metadata ;
count
-------
16837
Then, start deleting all those messages with 40 parallel sessions.
There are no concurrent sessions to the same account.
While the deletes are running, restart all the james instances.
After a moment, we have:
> select count(*) from messagev3;
count
-------
14669
> select count(*) from imapuidtable;
count
-------
14194
> select count(*) from pop3metadata ;
count
-------
14669
Not sure if messagev3 is relevant here, just adding it for completeness.
Now if I'm accessing the mailboxes, this will touch some of the
messages that are no longer in imapuidtable which of course leads to
-ERR Message (12) does not exist.
and
16:32:48.175 [WARN ] o.a.j.p.m.DistributedMailboxAdapter - Removing
cc9b9f70-7f8d-11ec-b659-0f643f8b11e7 from 8870f760-7f87-11ec-9109-5fc36fc8b18d
POP3 projection for user [email protected] at it is not backed by a
MailboxMessage
Running the fixPop3Inconsistencies task in such a situation would then
clean up all the pop3metadata messages as "stale".
I have attached the James log while deleting/restarting and after
restarting for all 4 instances.
What do we make of this? Is this something relevant or more like something
that just can happen if we restart James at the wrong time?
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]