Re: HeaderParser loads email body into RAM

Tim Legant Sun, 10 Aug 2003 21:04:21 -0700

Jesse Guardiani <[EMAIL PROTECTED]> writes:

> On Sat, 09 Aug 2003 13:42:06 -0400, Jesse Guardiani
> <[EMAIL PROTECTED]> wrote:
> 
> >On 09 Aug 2003 01:33:52 -0500, Tim Legant <[EMAIL PROTECTED]> wrote:
> >
> >>Jesse Guardiani <[EMAIL PROTECTED]> writes:
> >>
> >>> I do this:
> >>> 
> >>>   # Print headers
> >>>   print msg_as_string(msgin)
> >>>   # Print body
> >>>   print sys.stdin.read()
> >>
> >>I don't think this does what you think it does.
> >
> >Are you 100% sure about that?
> 
> Well, even if you're not - I am, now.


Good.  I am, too.  Here's a pointer to the Python docs for the read()
method of file objects.

http://www.python.org/doc/2.2.2/lib/bltin-file-objects.html

Note that it says "The bytes are returned as a string object."

> My test message was running through my secondary MX for some
> reason, which caused it to avoid being processed by my script.
> 
> Oops. :-)

<grin> It happens.

> I did some checking, and in a more fool (that's me) proof test, this
> appears to do the job:
> 
> while 1:
>     data = sys.stdin.read(256)
>     if data != '':
>         sys.stdout.write(data)
>     else:
>         sys.stdout.flush()
>         break
> 
> I bet it's a bit slower than a straight copy, but it fits the bill for
> me.

This is the correct way to reduce memory use.  You could even use a
bigger buffer, say 8K or so.  The outstanding problem with doing this
is the filter.  There are at least three rules that I can think of off
the top of my head that require the entire message body.  They are
'body', 'body-file' and 'pipe'.  The 'pipe' rule could easily be
re-implemented to page the message to the filter program, as in your
code above.

It's not so easy to do so for the 'body*' rules.  The problem is that
a regular expression might match the string that is composed of, say,
10 characters at the end of one buffer read and 12 characters at the
beginning of the next buffer read.  The simple implementation would be
to search each buffer, but in my example, the text that should match
never would.  You would need a more complex algorithm.  I know how to
do it, but it's a much bigger change than I want to make before 1.0.

Finally, if a filter uses more than one of those three rules or any
one of them more than once, you'll be paging the message in multiple
times, which will undoubtedly be a speed hit.  You could avoid this by
caching the entire message in memory if a particlar filter required it
and then re-using the cached version in any other rules that require
it.  This gets us right back to where we are today, with the entire
body in RAM.  It is also more complex than I want to tackle before
release 1.0.

> I'll do some more testing then post the results. Perhaps this *is*
> a "feature" that is better implemented as an option. The speed
> tests should let me know...
> 
> Sorry I spoke so soon, and thanks for pointing that out to me Tim!

No problem.  Just didn't want you thinking you'd found the magic
bullet and then end up not seeing any real improvement.


Tim

_________________________________________________
tmda-workers mailing list ([EMAIL PROTECTED])
http://tmda.net/lists/listinfo/tmda-workers

Re: HeaderParser loads email body into RAM

Reply via email to