Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Robert Burrell Donkin Mon, 21 Jul 2008 01:24:54 -0700

On Sun, Jul 20, 2008 at 9:08 PM, Stefano Bagnara <[EMAIL PROTECTED]> wrote:
> Robert Burrell Donkin ha scritto:
>>
>> On Sat, Jul 19, 2008 at 5:29 PM, Stefano Bagnara <[EMAIL PROTECTED]> wrote:
>>>
>>> Oleg Kalnichevski ha scritto:
>>>>
>>>> Robert Burrell Donkin wrote:
>>>>>
>>>>> On 7/18/08, Stefano Bagnara <[EMAIL PROTECTED]> wrote:
>>>>>>
>>>>>> Robert Burrell Donkin ha scritto:
>>>>>>>
>>>>>>> On Fri, Jul 18, 2008 at 9:34 AM, Stefano Bagnara <[EMAIL PROTECTED]>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Robert Burrell Donkin ha scritto:
>>
>> <snip>
>>
>>>>>> 2) ((TextBody) b).getReader(). This give me a reader, so this support
>>>>>> the "line" concept: I do expect this one to treat "non canonical"
>>>>>> newlines like the header/structure parser: if headers are allowed to
>>>>>> terminate with an isolated LF then also lines in text content should
>>>>>> do
>>>>>> the same (because probably the whole mime message has LF instead of
>>>>>> CRLF). [RFC seems to suggest that the fact is that the MIME message is
>>>>>> encoded using LF instead of CRLF and that this specific encoding
>>>>>> breaks
>>>>>> binary parts, but we want to be smarter wrt this issue].
>>>>>
>>>>> TextBody is part of the DOM. This can and should be addressed there
>>>>> (rather than in the parser). I think that doing this should satisfy
>>>>> both needs without compromising the performance of the parser.
>>>>>
>>>> If this is indeed something we can all agree on, I can try to solve the
>>>> first problem (strict/lenient line delimiter handling) using a pluggable
>>>> strategy of some kind.
>>>>
>>>> Oleg
>>>
>>> My limited knowledge of mime4j details doesn't let me reply "+1". So I
>>> simply tell what I expect from mime4j as an user:
>>
>> it's important to understand that mime4j targets different kinds of
>> user. the pull parser is a low level application agnostic interface
>> aimed at experts who need performance. the DOM and SAX components are
>> higher level interfaces for less experience users who are willing to
>> compromise flexibility and performance. each user will have different
>> expectations.
>>
>>> Lenient line delimiter parsing:
>>> - consider isolated LF and CR in the mime stream as newlines as long as a
>>> newline concept exists in that specific place (everywhere but binary body
>>> parts having ContentTransferEncoding = "binary").
>>
>> the low level interface should allow the user to determine whether
>> they want to canonicalise. the higher level interface should probably
>> canonicalise.
>
> I have an alternative proposal, see the bottom of this message.
>
>>> - This means that a CR in a base64 stream is a newline, a CR in a
>>> text/plain
>>> is a newline, a "CR<boundary> CR" sequence is a valid multipart boundary,
>>> "CRLFCR", "CRLFCR", "CRCRLF", "LFCRLF", "LFCR", "CRCR" or "LFLF"
>>> sequences
>>> are valid separators between header and body because they are considered
>>> as
>>> equivalent to "CRLFCRLF".
>>
>> i'm not sure i agree (i need to think about this a little more)
>
> Ok, let me know your doubts as you get them.
>
>>> - THis also means that writing in output this stuff will result in a mime
>>> stream with NO isolated CRs or LFs (unless they are in a "binary" encoded
>>> body).
>>
>> i'm happy for the high level DOM API to perform conversions on the
>> streams.
>>
>>> Strict line delimiter parsing (I don't care if we have this now, I just
>>> think we should have this in mind while factoring mime4j because it
>>> should
>>> be possible to implement this with no major changes).
>>
>> this is a non-goal as far as i'm concerned. performant validating
>> parsers tend to be more difficult to write. if a validating engine is
>> needed then i'd prefer to approach the design without preconditions. i
>> need a fast robust parser that is able to cope with practical MIME
>> documents whether they are valid or not.
>
> Ok, no one seems to care about strict parsing, so let's forget about this
> for now, but please let me understand this:
> I see mime4j already have a strict parsing concept about throwing exceptions
> vs monitor calls when it encounter malformed/unexpected content: what is the
> rationale for needing the current strict parsing while not needing the CRLF
> delimiter strict parsing?
>
>>> - LFs and CRs are not newlines, they are not considered newlines and
>>> results
>>> in errors raised by the parser (invalid header, invalid content, and so
>>> on)
>>> that will result in a parsing failure or (if the raised errors are
>>> ignored)
>>> in invalid DOM (I'm not sure how we currently handle this case for
>>> non-expected 8bit content in an header, but it should be the same).
>>> - writing in output this content should result in a well-formed content,
>>> so:
>>>  - if an LF in the header is somehow "encodable" as a valid sequence it
>>> should be parsed as LF and then encoded while outputting. If instead an
>>> LF
>>> in the header is not encodable then we should fail parsing or remove it
>>> (or
>>> convert it to "?" or anything similar) if we want to be lenient.
>>
>> i'm happy for this to be added to the high level DOM
>>
>>> I'm not saying that I want mime4j to support all of this before a
>>> release, I
>>> just want to understand if this is what you also expect and if this can
>>> be
>>> considered a common goal.
>>
>> i'm happy to address your concerns by adding conversion code into the
>> higher level API layers but if mime4j seriously needs to compromise
>> the low level API then i'm not sure i can use this library for my mail
>> work either. in this case, i'd be happy to introduce a proposal for a
>> performant low level pull parser for MIME to the commons instead.
>
> I'm working on a solution having readLine methods not returning the newline
> chars so that the user of readLine does not need to care about line
> delimiter.
> This way we can tune the line delimiter inside the
> BufferedLineReaderInputStream and not everywhere else.


users of the low level API may well care about preservation of line endings

> "Client code" for LineReaderInputStream should use readLine ONLY when line
> recognition is needed (as it already happen).
>
> I have already coded a solution doing this (and using only CRLF and LF as
> line delimiters, like the current behaviour).
>
> I'm running a few tests, I'll probably create a JIRA and a proposal
> tomorrow.

this proposal seems likely to reduce the correctness and usefulness of
the low level parser in order to address an issue in the high level
API

- robert

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Reply via email to