RE: Memory, Large messages and SharedInputStreams (Was: Apache James / JAMES-134)

Ofir Gross Tue, 24 Jan 2006 01:27:17 -0800

Ofir Gross wrote:
Hi,
I share my thoughts:
I appreciate this. Sorry If it takes so much to reply but I've been busywith the spool synchronization problems and with my day-job.
I'd like to keep the conversation on the dev-list because it is archivedfor future reference and somebody else could eventually partecipate to thebrainstorming.
 There is a problem with the constructor:

   public MimeMessage(MimeMessage source) throws MessagingException {
        [...]
        source.writeTo(bos);       <<----------Problem is here
[...]
Because it __always__ saves into a ByteArrayInputStream which is inmemory. The only way to avoid it is either __not__ to call thatconstructor, or to subclass it like I did, in order to overide thismethod. When I have overiden this constructor, I saved to a temporary fileinstead of a byte array.
Ok, that's clear, we must remove calls to that constructor from our code,and we should use new MimeMessage(Session s, InputStream is).
Furthermore we should pass a SharedInputStream to that constructor.

How is it done technically? How to do the conversion fromMimeMessage(MimeMessage source) to MimeMessage(Session s, InputStream is) ?

You suggested to use the existing repository instead, which is a goodidea. But two questions:1. how do I locate the message in the repository form the MimeMessage"source"?
I think this is not possible. You can get the source starting from the Mailobject in the repository but not viceversa.
Why do you need that?

I wanted to use it in a subclass_of_MimeMessage that would override thatconstructor and there, in that overriding constructor, it would use themessage in the repository in order to construct the newsubclass_of_MimeMessage. This is instead of using a temporary file like Idid (or a byte array like in the original MimeMessage constructor). If itisn't possible now, it could be made possible in a subclass by putting a keyfield in that subclass, that would contain data that would serve as areference to that message in the repository. Then in places where Jamesconstructs a new MimeMessage from the repository, it would instead constructthe subclass, and then add to it the key that references that message in therepository. This way when that object would be passed in the future to sucha constructor, it would contain a reference to its place in the repository,and so it could be constructed from the repository.

2. Does every MimeMessage exist in the repository?
Not every mimemessage, because when we create a new mimemessage fromscratch (like the bounce mailet) we create the MimeMessage and after thatwe send it to James that store it in the spool.BTW most time when we deal with big messages we probably received them fromthe SMTPServer and we already have them in streamrepositories ormailrepositories.

So I understand that some times it may happen that there would be a bigmessage not in the repository. If that message would have aSharedInputStream contentStream, then no problem. If not, then it may be aproblem. One way to deal with it is: If the message is immediately stored atthe spool, then its repository place reference in the subclass can beupdated, and so it will have its place in the repository. If the spoolstoring doesn't happen immediately then a temporary file should be used. Iguess the user of that class would need to tell it, if it should use atemporary file. Or have the spool write the repository reference in, when itsaves the message in the repository, and the user will tell the subclass ifhe/she is done with it. Then the subclass would check itself, and if it istoo big, and don't have a SharedInputStream, and is not in the repository,it will make a temp file. If this scheme would be used then in most casesthere would be no temp file. But it still leaves room for improvement sinceI want to eliminate the temp file. An other way to deal with it, is that theusers will make sure, that if the message is big, it will have aSharedInputstream contentStream.

When the SMTPServer receive a new message it currently create a newMailImpl using the "public MailImpl(String name, MailAddress sender,Collection recipients, InputStream messageIn)" costructor.That constructor create a new MimeMessageSource using the InputStream fromthe socket (new MimeMessageInputStreamSource(name, messageIn);)MimeMessageInputStreamSource currently store the message in a .m64 file andeventually provide the stream to the following users:
public synchronized InputStream getInputStream() throws IOException {
   return new BufferedInputStream(new FileInputStream(file));
}

This is the first step.
Maybe we should start changing the MimeMessageInputStreamSource to providea Shared input stream?!?

Yes, it looks like a good idea. There is already in javaMail 1.3.2 acom.sun.mail.util.SharedFileInputStream class. Version 1.4 has also ajavax.mail.util.SharedFileInputStream.



The preceding method could be re-written as:

public synchronized InputStream getInputStream() throws IOException
{
  return new  SharedFileInputStream(file);
}

Very easy.

Then, when a message is stored in a dbfile repository then the header isstored in the body field of the repository db table while the body of themessage is stored in a streamrepository.
MimeMessageJDBCSource is the object that handle this behaviour. So everytime we read a message from the db repository we use this object.
We call this method of that object to get the InputStream:
---
public synchronized InputStream getInputStream() throws IOException
---
and it create a new SequenceInputStream using a ByteArrayInputStream of theheader and a the input stream provided by the streamrepository.get().
---
InputStream in = new ByteArrayInputStream(headers);
if (sr != null) {
   in = new SequenceInputStream(in, sr.get(key));
}
---
The implementation of the streamrepository.get used is inFile_Persistent_Stream_Repository and you can see it uses a
---
final ResettableFileInputStream stream = new ResettableFileInputStream(getFile( key ) );
---
Maybe we should work on the above files to always be able to provide sharedinputstreams.


Yes, it looks possible to me.

The other constructor: "public MimeMessage(Session session, InputStreamis)" will be OK if the InputStream provided to it will be aSharedInputStream, and that can be done by tracking down calls to thisconstractor (by grep "new MimeMessage"), and wrapping up the providedinput stream with a SharedInputStream implementing wrapper. I did thatgrep and looked at the results, and this looks possible.
The main one is on the MimeMessageWrapper.loadMessage()
----
in = source.getInputStream();
headers = loadHeaders(in);
ByteArrayInputStream headersIn = newByteArrayInputStream(headers.toByteArray());
in = new SequenceInputStream(headersIn, in);

message = new MimeMessage(session, in);
----
source is the MimeMessageJDBCSource we already seen before and we couldchange it to provide a SharedInputStream (to avoid a copy to a new file)You see a similar pattern as before: we create a new SequenceInputStreamfrom a ByteArrayInputStream and the stream provided by the source.We should find a way to have a SharedInputStream at the end of all thissteps, without the need to copy the stream on a new file.
We could introduce your proposed wrapped only in a few mailet but we shouldavoid using it in our core because we already have 2 wrappers over themimemessage and we can get better/cleaner result following the path Idescribe in this mail.

If there are too many wrappers already, maybe one of them(MimeMessageWrapper?) could be modified so it would overrideMimeMessage(MimeMessage source) constructor, and save a key reference to themessage in the repository, and so no new class would be introduced.

Maybe if all InputStreams provided to that constructor will beSharedInputStream, then all MimeMessage will have a SharedInputStreamimplementing contentStream as well? If it is true, then the overidingconstructor of "MimeMessage(MimeMessage source)" could use that stream,and it will not need to locate the message in the repository. But I am notsure wether it is true, because the constructor "MimeMessage(Sessionsession)" don't touch content, or contentStream, and so if it isconstructed that way, it will not have either, and so there will be noInputStream for it, in case it is sent this way to theMimeMessage(MimeMessage source) constructor.
When the message is constructed from someone else using the newMimeMessage(session) we have no power on how it is handled.IMHO we should start optimizing our own operations then we could write afew docs on how to write optimized mailets.
A further step would be to use streaming operations also for "db" onlyrepositories. We currently use blob operations that write and read fullbytearrays but most new dbs/jdbc drivers correctly supports the use ofstreams for write and read operations of large contents. Here there wouldbe one more issue because they don't provide (obviously) SharedInputStreamsbut their own InputStreams and I don't know how we could implement it. Weprobably should use your proposed wrapper if we need that, or otherwisecreate a wrapper over the previous InputStream implementing theSharedInputStream and simply retreaving a new InputStream at every call tonewStream (returning, in turn, a new SharedInputStream).

I have seen that usage of byte arrays in the retrieval from the DB, andwondered why getInputStream was not used. Now I know. Well, if todayDBs/drivers are ok, it is better to use it, since it will save RAM. As ofthe SharedInputStream version for it, a wrapper class can be made that wouldsave the stuff that is needed for the stream retrieval from the DB, i.e. itwill do the same thing that is done by the original method when it retrievesthe message from the DB. Thus a new stream would be created, and the wrappercan save it in a member object. The start/end pair of newStream(start,end)can also be saved and be considered when the wrapper class forwards reads tothe underlying InputStream member object. (So it will not allow reading outof the start/end boundaries)

The worst part of this work is that it is very difficult to unit-testmemory issues: any idea on how to test it?

I don't know exactly what you mean, so what I write is probably not theright answer to it, but anyway here goes: if we have two versions, an oldone that works and a new one that we test, we can make both versions producean output file, and then compare the files (diff -s file1 file2). Thisoutput file may be constructed by adding calls to various methods of thetested class, and writing the results to the file. If the only thingdifferent is one class, the difference would suggest an error in the newclass. Since I don't know what you meant, maybe you could refer me to someexisting unit-tests that you have, so I will have a better idea of what youmeant.


Stefano


_________________________________________________________________

FREE pop-up blocking with the new MSN Toolbar get it now!http://toolbar.msn.click-url.com/go/onm00200415ave/direct/01/



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Memory, Large messages and SharedInputStreams (Was: Apache James / JAMES-134)

Reply via email to