[ https://issues.apache.org/jira/browse/JAMES-3740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517056#comment-17517056 ]
Matthieu Baechler commented on JAMES-3740: ------------------------------------------ What about building a sorted list containing ranges of UID? Most mailboxes probably have contiguous UIDs and thus it could potentially reduce a lot the size of the datastructure. For example (numbers are UIDs): 0-15 / 20-30 / 32-45 / 50-100 / 102-178 / 180-222 / 230-260 If I need to find the MSN for UID 35, I iterate over ranges, and add them until I reach the UID. Here : 16 + 11 + 3. If I need to find the UID for MSN 110, I iterate over ranges and add there sizes until I reach the number I want. Here 16 + 11 + 14 + 51 (=92) then I take the 18th element of the 102-178 range (because it's 110 - 92), so UID is 120. It means the lookup is 0(n) with n being the number of ranges. > IMAP UID <-> MSN mapping occupies too much memory > ------------------------------------------------- > > Key: JAMES-3740 > URL: https://issues.apache.org/jira/browse/JAMES-3740 > Project: James Server > Issue Type: Improvement > Components: IMAPServer > Affects Versions: 3.7.0 > Reporter: Benoit Tellier > Priority: Major > Attachments: Screenshot from 2022-03-30 17-39-37.png > > Time Spent: 20m > Remaining Estimate: 0h > > h3. What is UID <-> MSN mapping ? > In IMAP RFC-3501 there is two ways one addresses a message: > - By its UID (Unique ID) that is unique (until UID_VALIDITY changes...) > - By its MSN (Message Sequence Number) which is the (mutable) position of a > message in the mailbox. > We then need: > - Given a UID return its MSN which is for instance compulsory upon EXPUNGED > notifications when QRESYNCH is not enabled. > - Given a MSN based request we need to convert it back to a UID (rare). > We do store the list of UIDs, sorted, in RAM and perform binarysearches to > resolve those. > h3. What is the impact on heap? > Each uid is wrapped in a MessageUID object. This object wrapping comes with > an overhead of at least 12 bytes in addition to the 8 bytes payload (long). > Quick benchmarks shows it's actually worse: 10 million uids did take up to > 275 MB. > {code:java} > @Test > void measureHeapUsage() throws InterruptedException { > int count =10000000; > testee.addAll(IntStream.range(0, count) > .mapToObj(i -> MessageUid.of(i + 1)) > .collect(Collectors.toList())); > Thread.sleep(1000); > System.out.println("GCing"); > System.gc(); > Thread.sleep(1000); > > System.out.println(ManagementFactory.getMemoryMXBean().getHeapMemoryUsage().getUsed()); > } > {code} > Now, from let's take a classical production deployment I get: > - Some users have up to 2.5 million messages in their INBOX > - I can get an average of 100.000 messages for each user > So for a small scale deployment, we are already "consuming" ~300 MB of memory > just for the UID <-> mapping. > Scaling to 1.000 users on a single James instance we clearly see that HEAP > consumption will start being a problem (~3GB) without even speaking of target > of 10.000 users per James I do have in mind. > It's worth mentioning that IMAP being statefull, and UID <-> MSN mapping > attached to a selected mailbox, such a mapping is long lived: > - Multiple small objects would need to be copied individually by the GC, > putting pressure during long gen > - Those long lived object will eventually be promoted to old gen, thus the > more there is the longer the resulting stop-the-world GC pauses will be. > h3. Temporary fix ? > We can get rid of the object boxing in UidMsnConverter by using primitive > type collections for instance provided by fastutils project. > The same bench was down to 84MB. > Also, we could get things more compact by using an INT representation of > UIDs. (Those are most of the case below 2 billions, to be above this there > need to be more than 2 billion emails transiting through one's mailbox which > is highly unlikely). A fallback to "long" storage can be setted up if a UID > above 2 billion is observed. > This such a compact int storage we are down to 46MB. > So taking the prior mentioned numbers we could expect a 1.000 people > deployment to require ~400 MB and a larger scale 10.000 people deployment on > a single James to consume up to 4GB. Not that enjoyable but definitly more > manageable. > Please note that primitive collections are more GC friendly as their elements > are manages together, as a single object (backing array). > h3. What other mail servers do > I found references to Dovecote, which does a similar algorithm compared to > us: binary search on a list of uids. The noticeable difference is that this > list of UIDs is held on disk and not in memory as we do. > References: > https://doc.dovecot.org/developer_manual/design/indexes/mail_index_api/?highlight=time > Of course, such a solution would be attractive... We could imagine keeping > the last 1.000 uids in memory, which would most of the time be the ones used > for MSN resolution and locate the rest on-disk, use them only when needed and > thus dramatically reduce heap pressure. > Making UidMsnConverter an interface with a backing factory would enable > different implementation to co-exist and allow some experimentation ;-) -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org For additional commands, e-mail: server-dev-h...@james.apache.org