Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Robert Burrell Donkin Sun, 20 Jul 2008 12:22:36 -0700

On Sat, Jul 19, 2008 at 5:29 PM, Stefano Bagnara <[EMAIL PROTECTED]> wrote:
> Oleg Kalnichevski ha scritto:
>>
>> Robert Burrell Donkin wrote:
>>>
>>> On 7/18/08, Stefano Bagnara <[EMAIL PROTECTED]> wrote:
>>>>
>>>> Robert Burrell Donkin ha scritto:
>>>>>
>>>>> On Fri, Jul 18, 2008 at 9:34 AM, Stefano Bagnara <[EMAIL PROTECTED]>
>>>>> wrote:
>>>>>>
>>>>>> Robert Burrell Donkin ha scritto:


<snip>

>>>> 2) ((TextBody) b).getReader(). This give me a reader, so this support
>>>> the "line" concept: I do expect this one to treat "non canonical"
>>>> newlines like the header/structure parser: if headers are allowed to
>>>> terminate with an isolated LF then also lines in text content should do
>>>> the same (because probably the whole mime message has LF instead of
>>>> CRLF). [RFC seems to suggest that the fact is that the MIME message is
>>>> encoded using LF instead of CRLF and that this specific encoding breaks
>>>> binary parts, but we want to be smarter wrt this issue].
>>>
>>> TextBody is part of the DOM. This can and should be addressed there
>>> (rather than in the parser). I think that doing this should satisfy
>>> both needs without compromising the performance of the parser.
>>>
>>
>> If this is indeed something we can all agree on, I can try to solve the
>> first problem (strict/lenient line delimiter handling) using a pluggable
>> strategy of some kind.
>>
>> Oleg
>
> My limited knowledge of mime4j details doesn't let me reply "+1". So I
> simply tell what I expect from mime4j as an user:

it's important to understand that mime4j targets different kinds of
user. the pull parser is a low level application agnostic interface
aimed at experts who need performance. the DOM and SAX components are
higher level interfaces for less experience users who are willing to
compromise flexibility and performance. each user will have different
expectations.

> Lenient line delimiter parsing:
> - consider isolated LF and CR in the mime stream as newlines as long as a
> newline concept exists in that specific place (everywhere but binary body
> parts having ContentTransferEncoding = "binary").

the low level interface should allow the user to determine whether
they want to canonicalise. the higher level interface should probably
canonicalise.

> - This means that a CR in a base64 stream is a newline, a CR in a text/plain
> is a newline, a "CR<boundary> CR" sequence is a valid multipart boundary,
> "CRLFCR", "CRLFCR", "CRCRLF", "LFCRLF", "LFCR", "CRCR" or "LFLF" sequences
> are valid separators between header and body because they are considered as
> equivalent to "CRLFCRLF".

i'm not sure i agree (i need to think about this a little more)

> - THis also means that writing in output this stuff will result in a mime
> stream with NO isolated CRs or LFs (unless they are in a "binary" encoded
> body).

i'm happy for the high level DOM API to perform conversions on the streams.

> Strict line delimiter parsing (I don't care if we have this now, I just
> think we should have this in mind while factoring mime4j because it should
> be possible to implement this with no major changes).

this is a non-goal as far as i'm concerned. performant validating
parsers tend to be more difficult to write. if a validating engine is
needed then i'd prefer to approach the design without preconditions. i
need a fast robust parser that is able to cope with practical MIME
documents whether they are valid or not.

> - LFs and CRs are not newlines, they are not considered newlines and results
> in errors raised by the parser (invalid header, invalid content, and so on)
> that will result in a parsing failure or (if the raised errors are ignored)
> in invalid DOM (I'm not sure how we currently handle this case for
> non-expected 8bit content in an header, but it should be the same).
> - writing in output this content should result in a well-formed content, so:
>  - if an LF in the header is somehow "encodable" as a valid sequence it
> should be parsed as LF and then encoded while outputting. If instead an LF
> in the header is not encodable then we should fail parsing or remove it (or
> convert it to "?" or anything similar) if we want to be lenient.

i'm happy for this to be added to the high level DOM

> I'm not saying that I want mime4j to support all of this before a release, I
> just want to understand if this is what you also expect and if this can be
> considered a common goal.

i'm happy to address your concerns by adding conversion code into the
higher level API layers but if mime4j seriously needs to compromise
the low level API then i'm not sure i can use this library for my mail
work either. in this case, i'd be happy to introduce a proposal for a
performant low level pull parser for MIME to the commons instead.

- robert

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Reply via email to