Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Stefano Bagnara Mon, 21 Jul 2008 05:13:06 -0700

Robert Burrell Donkin ha scritto:

On Mon, Jul 21, 2008 at 10:11 AM, Stefano Bagnara <[EMAIL PROTECTED]> wrote:

Robert Burrell Donkin ha scritto:

On Sun, Jul 20, 2008 at 9:08 PM, Stefano Bagnara <[EMAIL PROTECTED]> wrote:

Robert Burrell Donkin ha scritto:

On Sat, Jul 19, 2008 at 5:29 PM, Stefano Bagnara <[EMAIL PROTECTED]>
wrote:

Oleg Kalnichevski ha scritto:

Robert Burrell Donkin wrote:

On 7/18/08, Stefano Bagnara <[EMAIL PROTECTED]> wrote:

Robert Burrell Donkin ha scritto:

On Fri, Jul 18, 2008 at 9:34 AM, Stefano Bagnara <[EMAIL PROTECTED]>
wrote:

Robert Burrell Donkin ha scritto:

<snip>

2) ((TextBody) b).getReader(). This give me a reader, so this
support
the "line" concept: I do expect this one to treat "non canonical"
newlines like the header/structure parser: if headers are allowed to
terminate with an isolated LF then also lines in text content should
do
the same (because probably the whole mime message has LF instead of
CRLF). [RFC seems to suggest that the fact is that the MIME message
is
encoded using LF instead of CRLF and that this specific encoding
breaks
binary parts, but we want to be smarter wrt this issue].

TextBody is part of the DOM. This can and should be addressed there
(rather than in the parser). I think that doing this should satisfy
both needs without compromising the performance of the parser.

If this is indeed something we can all agree on, I can try to solve
the
first problem (strict/lenient line delimiter handling) using a
pluggable
strategy of some kind.

Oleg

My limited knowledge of mime4j details doesn't let me reply "+1". So I
simply tell what I expect from mime4j as an user:

it's important to understand that mime4j targets different kinds of
user. the pull parser is a low level application agnostic interface
aimed at experts who need performance. the DOM and SAX components are
higher level interfaces for less experience users who are willing to
compromise flexibility and performance. each user will have different
expectations.

Lenient line delimiter parsing:
- consider isolated LF and CR in the mime stream as newlines as long as
a
newline concept exists in that specific place (everywhere but binary
body
parts having ContentTransferEncoding = "binary").

the low level interface should allow the user to determine whether
they want to canonicalise. the higher level interface should probably
canonicalise.

I have an alternative proposal, see the bottom of this message.

- This means that a CR in a base64 stream is a newline, a CR in a
text/plain
is a newline, a "CR<boundary> CR" sequence is a valid multipart
boundary,
"CRLFCR", "CRLFCR", "CRCRLF", "LFCRLF", "LFCR", "CRCR" or "LFLF"
sequences
are valid separators between header and body because they are
considered
as
equivalent to "CRLFCRLF".

i'm not sure i agree (i need to think about this a little more)

Ok, let me know your doubts as you get them.

- THis also means that writing in output this stuff will result in a
mime
stream with NO isolated CRs or LFs (unless they are in a "binary"
encoded
body).

i'm happy for the high level DOM API to perform conversions on the
streams.

Strict line delimiter parsing (I don't care if we have this now, I just
think we should have this in mind while factoring mime4j because it
should
be possible to implement this with no major changes).

this is a non-goal as far as i'm concerned. performant validating
parsers tend to be more difficult to write. if a validating engine is
needed then i'd prefer to approach the design without preconditions. i
need a fast robust parser that is able to cope with practical MIME
documents whether they are valid or not.

Ok, no one seems to care about strict parsing, so let's forget about this
for now, but please let me understand this:
I see mime4j already have a strict parsing concept about throwing
exceptions
vs monitor calls when it encounter malformed/unexpected content: what is
the
rationale for needing the current strict parsing while not needing the
CRLF
delimiter strict parsing?

- LFs and CRs are not newlines, they are not considered newlines and
results
in errors raised by the parser (invalid header, invalid content, and so
on)
that will result in a parsing failure or (if the raised errors are
ignored)
in invalid DOM (I'm not sure how we currently handle this case for
non-expected 8bit content in an header, but it should be the same).
- writing in output this content should result in a well-formed
content,
so:
 - if an LF in the header is somehow "encodable" as a valid sequence it
should be parsed as LF and then encoded while outputting. If instead an
LF
in the header is not encodable then we should fail parsing or remove it
(or
convert it to "?" or anything similar) if we want to be lenient.

i'm happy for this to be added to the high level DOM

I'm not saying that I want mime4j to support all of this before a
release, I
just want to understand if this is what you also expect and if this can
be
considered a common goal.

i'm happy to address your concerns by adding conversion code into the
higher level API layers but if mime4j seriously needs to compromise
the low level API then i'm not sure i can use this library for my mail
work either. in this case, i'd be happy to introduce a proposal for a
performant low level pull parser for MIME to the commons instead.

I'm working on a solution having readLine methods not returning the
newline
chars so that the user of readLine does not need to care about line
delimiter.
This way we can tune the line delimiter inside the
BufferedLineReaderInputStream and not everywhere else.

users of the low level API may well care about preservation of line
endings

What exactly is part of the low level API?


the pull parser

I'm not sure I understand how I preserve line endings in headers in the
current implementation.


the role of the parser is just to detect and parse the headers


is AbstractEntity part of the pull parser (low level api or not?)

My proposal does not change what you see from outside that class, itonly changes the contract between that class and theLineReaderInputStream (the line ending stripping has been moved to thestream readLine method while previously it was in the AbstractEntitymethod just after readLine was called).

AFAIK the DOM based API has never correctly preserved line endings in
headers. i've though about this on occasion and the conclusion i've
always reached is that this would be challenging to implement in a
performant fashion. i think i'd approach design from a different
direction: insist that the mail was stored on file, then use a memory
mapped file and nio to avoid double buffering (rather than use the
pull parser)

For a DOM access it sounds like a good plan. Of course the low level orSAX API are much better for filtering streams.

"Client code" for LineReaderInputStream should use readLine ONLY when
line
recognition is needed (as it already happen).

I have already coded a solution doing this (and using only CRLF and LF as
line delimiters, like the current behaviour).

I'm running a few tests, I'll probably create a JIRA and a proposal
tomorrow.

this proposal seems likely to reduce the correctness and usefulness of
the low level parser in order to address an issue in the high level
API

I'm not sure how can we deal with CR-LF in a consistent way if we don't do
this at a low level, but maybe I'm missing something.


line endings can be handled consistently by canonicalising in the
higher level APIs. the DOM API is not performant and IMO it would be
perfectly acceptable to canonicalise line endings at this level.

I made a list of classes having to deal with line endings in theprevious message.I identified 6 of them: can you help me classify the 6 classes againstwhat level they are against (DOM / SAX / low level pull parser).

if canonicalisation is forced into the low level API then this would
make mime4j unsuitable for more general use including some of the mail
usages i'm interested in. if this is the consensus then i'm very happy
to simple create a new library that is suitable for more general usage
(including advanced mail) either here or somewhere else.

My change does not canonicalize anything: I simply changed the contractfor readLine method. I'm not able to identify an use of the mime4jlibrary that changes its behaviour after applying my proposed patch.

This does not mean that the patch itself is good, but I don't understandyour argument: my code does not canonicalize anything, it does exactlythe same it happen now. In fact the test results (excluding the specificreadLine method call tests) didn't change their expected value.


Stefano

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Reply via email to