Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Stefano Bagnara Thu, 17 Jul 2008 11:22:46 -0700

Oleg Kalnichevski ha scritto:

Stefano Bagnara wrote:
Stefano Bagnara ha scritto:
I noticed that at a point in past the EOLConvertingInputStream hasbeen removed from the chain.
I think this create issues when we parse an input file having only \nand write it in output.
- It seems that we parse most of the code only checking for \n (whatdoes it happen when instead there are only \r? what should we do?)
As far as I know a single CR is not used as a valid line delimiteranywhere. Please correct me if I am wrong.


AFAIK old MacOS (<X) use CR as their line delimiter.
This is the same as unixes using LF.

- If the message have only newlines it seems mime4j ends upoutputting headers with CRLF and body with LF.
Why is it a problem? Headers serve a specific role. They convey metadataabout a content body. The transport aspects of metadata are irrelevant,whereas one _usually_ does not want to a content body to go through aprocess of unnecessary transformation.

I don't understand what "specific role" is related to the RFC: I'mtalking about rfc compliance and real world cases as 2 different things.First we have to understand what does it means to be RFC compliant andwhat is a valid mime content and what is a valid "permissive parsing"from the RFC PoV (as an example if we didn't read the rfc we now wouldhave a not compliant mime parser because of the outerboundaries nothaving precedence on the nested boundaries).

You may know that there are specific MIME contents (e.g: deliverynotifications) having "header-style" lines in the content: so whyheaders are different from the body? Why should we convert headers toCRLF? Either we care about a compliant output or I don't understand whywe should put CRLF in headers.

- If the input message have CR ending lines they are not consideredby mime4j.
IMHO either we accept LF, CR, and CRLF as CRLF or we only accept CRLF.
I respectfully disagree.

That's good: disagreement allow discussion and allow us to understandwhy something is good or bad.The important thing is that the mime4j community share a goal otherwiseeach one will commit code diverging from the goal of the other.We cannot simply change the behaviour of mime4j because one user needthis without discussion or analysis.The RFC is our first resource, then we have real world use case to dealwith, and user requirements are on a third layer and have to comply withprevious requirement.

Maybe the right solution is making the behaviour configurable, I don'tknow this, but I think that it's clear we need to discuss the issuebecause otherwise we simply move away from the RFCs.

If we do that we have to take care of encoded nested messages: theycould have again LF, CR and CRLF like the top stream.
What is the right approach? Should we add a EOLConvertingInputStream(CONVERT_BOTH) to every level of parsing or should we fail to parsemessages with bad newlines?
I don't like the current behaviour where we accept some malformeddata (LF alone are considered CRLF from our parser), we change someof them (the one between headers are converted to CRLF) and we stilloutput malformed data.
Opinions?
I tried this patch and it seems to work fine (even if it breaks one ofour core tests that do not expect a CR in an header to be considered anewline):
Not only does this change completely reverts the performance gains andmakes the whole refactroring exercise completely pointless due to anutterly inefficient implementation of EOLConvertingInputStream, it isalso conceptually wrong (in my humble opinion), as it causes mime4j tocorrupt 8bit encoded 'application/octet-stream' content. This basicallyrenders mime4j incompatible with commons browsers and HttpClient

The performance of the EOLConvertingInputStream is not important at allif removing it we have an unusable library. So let's talk about what weexpect from the library, then we'll discuss how to make it performant. Ibelieve we have technical skills to make a performant EOLConverting stream.

About the 8bit encoded 'application/octet-stream' I think we just needto find the right RFC telling us what we have to do: the RFC I readabout MIME and its applications always tell that CR and LF must not bealone and that the appropriate transfer encoding have to be used inorder to avoid isolated LF and CR: it is not a matter of personalpreferences, it is a matter of rfc compliance. Let's find the docs, first.


What I can find as definition of "8bit" (RFC-2045 Section 2.8) is:
-------------------
"8bit data" refers to data that is all represented as relatively
short lines with 998 octets or less between CRLF line separation
sequences [RFC-821]), but octets with decimal values greater than 127
may be used.  As with "7bit data" CR and LF octets only occur as part
of CRLF line separation sequences and no NULs are allowed.
-------------------

So this would say that 8bit encoded 'application/octet-stream' haveanyway lines of 998 chars and does not include isolated CR and LF.

We have to understand if real world abused the 8bit specification or ifthere is some mime extension we are not considering: this is important,otherwise we will be the next abuser of the RFC. Apache JAMES PMC agreed(in past, multiple times) that we have to make sure that we are strictabout mime written by mime4j and we are permissive with input.

If you commit this change could you please provide an option to excludeEOLConvertingInputStream filter?

I'm not going to commit anything without agreement on what we want todo. If *I* am the only one that care about the RFC we can even ignorethis thread at all, but my duty as PMC member is to raise similar issueand to let the community decide.

Thank you


Thank you, too!

Oleg


Stefano


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Reply via email to