On Sun, Jul 20, 2008 at 9:08 PM, Stefano Bagnara <[EMAIL PROTECTED]> wrote: > Robert Burrell Donkin ha scritto: >> >> On Sat, Jul 19, 2008 at 5:29 PM, Stefano Bagnara <[EMAIL PROTECTED]> wrote: >>> >>> Oleg Kalnichevski ha scritto: >>>> >>>> Robert Burrell Donkin wrote: >>>>> >>>>> On 7/18/08, Stefano Bagnara <[EMAIL PROTECTED]> wrote: >>>>>> >>>>>> Robert Burrell Donkin ha scritto: >>>>>>> >>>>>>> On Fri, Jul 18, 2008 at 9:34 AM, Stefano Bagnara <[EMAIL PROTECTED]> >>>>>>> wrote: >>>>>>>> >>>>>>>> Robert Burrell Donkin ha scritto: >> >> <snip> >> >>>>>> 2) ((TextBody) b).getReader(). This give me a reader, so this support >>>>>> the "line" concept: I do expect this one to treat "non canonical" >>>>>> newlines like the header/structure parser: if headers are allowed to >>>>>> terminate with an isolated LF then also lines in text content should >>>>>> do >>>>>> the same (because probably the whole mime message has LF instead of >>>>>> CRLF). [RFC seems to suggest that the fact is that the MIME message is >>>>>> encoded using LF instead of CRLF and that this specific encoding >>>>>> breaks >>>>>> binary parts, but we want to be smarter wrt this issue]. >>>>> >>>>> TextBody is part of the DOM. This can and should be addressed there >>>>> (rather than in the parser). I think that doing this should satisfy >>>>> both needs without compromising the performance of the parser. >>>>> >>>> If this is indeed something we can all agree on, I can try to solve the >>>> first problem (strict/lenient line delimiter handling) using a pluggable >>>> strategy of some kind. >>>> >>>> Oleg >>> >>> My limited knowledge of mime4j details doesn't let me reply "+1". So I >>> simply tell what I expect from mime4j as an user: >> >> it's important to understand that mime4j targets different kinds of >> user. the pull parser is a low level application agnostic interface >> aimed at experts who need performance. the DOM and SAX components are >> higher level interfaces for less experience users who are willing to >> compromise flexibility and performance. each user will have different >> expectations. >> >>> Lenient line delimiter parsing: >>> - consider isolated LF and CR in the mime stream as newlines as long as a >>> newline concept exists in that specific place (everywhere but binary body >>> parts having ContentTransferEncoding = "binary"). >> >> the low level interface should allow the user to determine whether >> they want to canonicalise. the higher level interface should probably >> canonicalise. > > I have an alternative proposal, see the bottom of this message. > >>> - This means that a CR in a base64 stream is a newline, a CR in a >>> text/plain >>> is a newline, a "CR<boundary> CR" sequence is a valid multipart boundary, >>> "CRLFCR", "CRLFCR", "CRCRLF", "LFCRLF", "LFCR", "CRCR" or "LFLF" >>> sequences >>> are valid separators between header and body because they are considered >>> as >>> equivalent to "CRLFCRLF". >> >> i'm not sure i agree (i need to think about this a little more) > > Ok, let me know your doubts as you get them. > >>> - THis also means that writing in output this stuff will result in a mime >>> stream with NO isolated CRs or LFs (unless they are in a "binary" encoded >>> body). >> >> i'm happy for the high level DOM API to perform conversions on the >> streams. >> >>> Strict line delimiter parsing (I don't care if we have this now, I just >>> think we should have this in mind while factoring mime4j because it >>> should >>> be possible to implement this with no major changes). >> >> this is a non-goal as far as i'm concerned. performant validating >> parsers tend to be more difficult to write. if a validating engine is >> needed then i'd prefer to approach the design without preconditions. i >> need a fast robust parser that is able to cope with practical MIME >> documents whether they are valid or not. > > Ok, no one seems to care about strict parsing, so let's forget about this > for now, but please let me understand this: > I see mime4j already have a strict parsing concept about throwing exceptions > vs monitor calls when it encounter malformed/unexpected content: what is the > rationale for needing the current strict parsing while not needing the CRLF > delimiter strict parsing? > >>> - LFs and CRs are not newlines, they are not considered newlines and >>> results >>> in errors raised by the parser (invalid header, invalid content, and so >>> on) >>> that will result in a parsing failure or (if the raised errors are >>> ignored) >>> in invalid DOM (I'm not sure how we currently handle this case for >>> non-expected 8bit content in an header, but it should be the same). >>> - writing in output this content should result in a well-formed content, >>> so: >>> - if an LF in the header is somehow "encodable" as a valid sequence it >>> should be parsed as LF and then encoded while outputting. If instead an >>> LF >>> in the header is not encodable then we should fail parsing or remove it >>> (or >>> convert it to "?" or anything similar) if we want to be lenient. >> >> i'm happy for this to be added to the high level DOM >> >>> I'm not saying that I want mime4j to support all of this before a >>> release, I >>> just want to understand if this is what you also expect and if this can >>> be >>> considered a common goal. >> >> i'm happy to address your concerns by adding conversion code into the >> higher level API layers but if mime4j seriously needs to compromise >> the low level API then i'm not sure i can use this library for my mail >> work either. in this case, i'd be happy to introduce a proposal for a >> performant low level pull parser for MIME to the commons instead. > > I'm working on a solution having readLine methods not returning the newline > chars so that the user of readLine does not need to care about line > delimiter. > This way we can tune the line delimiter inside the > BufferedLineReaderInputStream and not everywhere else.
users of the low level API may well care about preservation of line endings > "Client code" for LineReaderInputStream should use readLine ONLY when line > recognition is needed (as it already happen). > > I have already coded a solution doing this (and using only CRLF and LF as > line delimiters, like the current behaviour). > > I'm running a few tests, I'll probably create a JIRA and a proposal > tomorrow. this proposal seems likely to reduce the correctness and usefulness of the low level parser in order to address an issue in the high level API - robert --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]