Hi, On Thu, 2019-11-28 at 15:51 +0700, Tellier Benoit wrote: > Hello all, > > ## Context > > We are working on JMAP, and EMail::hasAttachments metadata is listed > as > a fast property. > > However to retrieve it today, we need to do a full message read in > order > to load attachment (as JMAP hasAttachment do not take inlined > attachments into account and mailbox property do). > > Also, while inspecting the code, MessageResult::getLoadedAttachments > is > never used with attachment bytes. This means that given an email with > a > 10 MB attachment, upon GetMessages call with full profile, we are > going > to read the full eml (10 MB) then load attachment bytes (10 MB) while > the attachment could have not been loaded in the first place. In our > little example we read 20MB while only 10 MB could have been > necessary. > > This attachment over-reading results in both performance and cost > issue > on the object storage - what is the topic me, René and Duc are > currently > working on. > > ## Involved POJOs > > Attachment (mailbox-api) > - id > - type > - bytes > > MessageAttachment (mailbox-api) > - attachment (of type Attachment) > - name > - cid > - isInline > > Attachment (jmap) > - blobId (derived from attachmentId) > - type > - name > - size > - cid > - isInline > > DAOAttachment (mailbox-cassandra) > - id > - blobId > - type > - size > > - Message (mailbox-store) & MessageResult (mailbox-api) allows > listing > attachments. Content usage includes: > - Scanning search
I'm glad you trigger this discussion because it's a very important one if we want to have a fast server in the future. I will share my analysis regarding James usage of byte arrays. There are various places in James where we load data as byte arrays. Why do we do that? It's actually a process that leads to this design decision: 1. We want the raw data at hand because sometimes we don't have any other choice than to parse the mime to do something meaninful with it 2. Abstracting the raw data is not an easy thing: InputStream is not always replay-able and it brings resource management problems (who is in charge of closing it, etc), mmap memory is not easily usable in Java and we probably don't want to read it again and again from the source. 3. Then, given that most mail servers are ok to pretend emails should not be too big (let's say ~20MiB), James take the easiest solution: load mails entirely in memory But what does it mean for James: if 100 users are reading or sending 5MiB emails every second and the lifetime of a mail in the memory is about 5 seconds, it means that you need at the very least 2.5GiB just to keep these emails in memory (and I'm not even talking about needless copies or generated garbage that needs to be collected). We probably won't ever gain much traction with such low performance. What can we do to overcome this problem? We should never load any email fully in memory. We should also leverage non-heap memory (like filesystem caches) to avoid putting too much pressure on the JVM. It means either using mmap (aka DirectBuffer), using temp files to replay reads (with as few flush as possible) and/or loading InputStream at the very last moment (when we actually need to parse it). And it leads me to the actual topic of this email: of course we need to compute as many things as possible to prevent loading the raw data. > ## Proposal > > Introduce a new POJO: MessageAttachmentMetadata (mailbox-api) > - id > - name > - cid > - isInline > - size > - type > > - Message (mailbox-store) & MessageResult (mailbox-api) SHOULD > return > MessageAttachmentMetadata NOT MessageAttachment. Thus these metadata > will be added at the FetchGroup.profile.MINIMAL. Do we always need attachments information? > We need to port "scanning search" to do on the fly message parsing. Not sure to understand this sentence > This > is OK as: > - memory-guice is not intending for production usage, no need to be > performant > - Usage of scanning search is not exposed as a product > - jpa and maildir do not store attachment so recomputation is the > current behaviour. > > ## Consequences > > JMAP Email::hasAttachment property would rely on > FetchGroup.Profile.MINIMAL, allowing the implementation of > https://github.com/apache/james-project/blob/master/src/adr/0013-precompute-jmap-preview.md > > > JMAP GetMessages with full profile will read 2 time less data > allowing > both a cost and performance improvment. > > Note that all caller reading full message will also benefit from > these > changes (IMAP fetch, mailbox backup, review recomputation) > > ## Alternative > > We could merge MessageAttachment & Attachment together however this > would lead to significant datastructure re-arranging for no > behavioural > gains and just a slightly more lean API. > > Thus I propose not to takle this now. I'm not sure to understand this alternative. What about this one: remove bytes from Attachment and always call BlobStore when you need to read the raw data (or put it in a temp file for inbound emails) Looking at the code, it should not be that hard to implement and we could monitor BlobStore usage by implementing something in glowroot. That would be a first step toward a no-byte-array strategy. [...] Cheers, -- Matthieu Baechler --------------------------------------------------------------------- To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org For additional commands, e-mail: server-dev-h...@james.apache.org