[jira] [Commented] (JAMES-3740) IMAP UID <-> MSN mapping occupies too much memory

Matthieu Baechler (Jira) Mon, 04 Apr 2022 12:15:16 -0700


    [ 
https://issues.apache.org/jira/browse/JAMES-3740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517056#comment-17517056
 ]


Matthieu Baechler commented on JAMES-3740:
------------------------------------------

What about building a sorted list containing ranges of UID? Most mailboxes 
probably have contiguous UIDs and thus it could potentially reduce a lot the 
size of the datastructure.

For example (numbers are UIDs): 0-15 / 20-30 / 32-45 / 50-100 / 102-178 / 
180-222 / 230-260

If I need to find the MSN for UID 35, I iterate over ranges, and add them until 
I reach the UID. Here : 16 + 11 + 3.
If I need to find the UID for MSN 110, I iterate over ranges and add there 
sizes until I reach the number I want. Here 16 + 11 + 14 + 51 (=92) then I take 
the 18th element of the 102-178 range (because it's 110 - 92), so UID is 120.

It means the lookup is 0(n) with n being the number of ranges.



> IMAP UID <-> MSN mapping occupies too much memory
> -------------------------------------------------
>
>                 Key: JAMES-3740
>                 URL: https://issues.apache.org/jira/browse/JAMES-3740
>             Project: James Server
>          Issue Type: Improvement
>          Components: IMAPServer
>    Affects Versions: 3.7.0
>            Reporter: Benoit Tellier
>            Priority: Major
>         Attachments: Screenshot from 2022-03-30 17-39-37.png
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> h3. What is UID <-> MSN mapping ?
> In IMAP RFC-3501 there is two ways one addresses a message:
>  - By its UID (Unique ID) that is unique (until UID_VALIDITY changes...)
>  - By its MSN (Message Sequence  Number) which is the (mutable) position of a 
> message in the mailbox.
> We then need:
>  - Given a UID return its MSN which is for instance compulsory upon EXPUNGED 
> notifications when QRESYNCH is not enabled.
>  - Given a MSN based request we need to convert it back to a UID (rare).
> We do store the list of UIDs, sorted, in RAM and perform binarysearches to 
> resolve those.
> h3. What is the impact on heap?
> Each uid is wrapped in a MessageUID object. This object wrapping comes with 
> an overhead of at least 12 bytes in addition to the 8 bytes payload (long). 
> Quick benchmarks shows it's actually worse: 10 million uids did take up to 
> 275 MB.
> {code:java}
>     @Test
>     void measureHeapUsage() throws InterruptedException {
>         int count =10000000;
>         testee.addAll(IntStream.range(0, count)
>             .mapToObj(i -> MessageUid.of(i + 1))
>             .collect(Collectors.toList()));
>         Thread.sleep(1000);
>         System.out.println("GCing");
>         System.gc();
>         Thread.sleep(1000);
>         
> System.out.println(ManagementFactory.getMemoryMXBean().getHeapMemoryUsage().getUsed());
>     }
> {code}
> Now, from let's take a classical production deployment I get:
>  - Some users have up to 2.5 million messages in their INBOX
>  - I can get an average of 100.000 messages for each user
> So for a small scale deployment, we are already "consuming" ~300 MB of memory 
> just for the UID <-> mapping.
> Scaling to 1.000 users on a single James instance we clearly see that HEAP 
> consumption will start being a problem (~3GB) without even speaking of target 
> of 10.000 users per James I do have in mind.
> It's worth mentioning that IMAP being statefull, and UID <-> MSN mapping 
> attached to a selected mailbox, such a mapping is long lived:
>  - Multiple small objects would need to be copied individually by the GC, 
> putting pressure during long gen
>  - Those long lived object will eventually be promoted to old gen, thus the 
> more there is the longer the resulting stop-the-world GC pauses will be.
> h3. Temporary fix ?
> We can get rid of the object boxing in UidMsnConverter by using primitive 
> type collections for instance provided by fastutils project.
> The same bench was down to 84MB.
> Also, we could get things more compact by using an INT representation of 
> UIDs. (Those are most of the case below 2 billions, to be above this there 
> need to be more than 2 billion emails transiting through one's mailbox which 
> is highly unlikely). A fallback to "long" storage can be setted up if a UID 
> above 2 billion is observed.
> This such a compact int storage we are down to 46MB.
> So taking the prior mentioned numbers we could expect a 1.000 people 
> deployment to require ~400 MB and a larger scale 10.000 people deployment on 
> a single James to consume up to 4GB. Not that enjoyable but definitly more 
> manageable.
> Please note that primitive collections are more GC friendly as their elements 
> are manages together, as a single object (backing array).
> h3. What other mail servers do
> I found references to Dovecote, which does a similar algorithm compared to 
> us: binary search on a list of uids. The noticeable difference is that this 
> list of UIDs is held on disk and not in memory as we do.
> References: 
> https://doc.dovecot.org/developer_manual/design/indexes/mail_index_api/?highlight=time
> Of course, such a solution would be attractive... We could imagine keeping 
> the last 1.000 uids in memory, which would most of the time be the ones used 
> for MSN resolution and locate the rest on-disk, use them only when needed and 
> thus dramatically reduce heap pressure.
> Making UidMsnConverter an interface with a backing factory would enable 
> different implementation to co-exist and allow some experimentation ;-)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org
For additional commands, e-mail: server-dev-h...@james.apache.org

[jira] [Commented] (JAMES-3740) IMAP UID <-> MSN mapping occupies too much memory

Reply via email to