Re: [mime4j] Please Review Cursor API

Stefano Bagnara Wed, 25 Jul 2007 15:49:54 -0700

I understand this is a long message and I made many question, so: if
this does not help your research just ignore it and go ahead with your
ideas. I'll review the code ;-)

--

Robert Burrell Donkin ha scritto:
> On 7/25/07, Stefano Bagnara <[EMAIL PROTECTED]> wrote:
>> what about adding a Cursor.isInMimePart() or something similar?
> 
> not sure it would be so simple as that. the cursor would probably need
> to become a first pass parser.
> 
> the cursor would need to perform basic parsing of the email to find
> the appropriate mime headers and so the appropriate boundary. it would
> be possible to model the API so that the cursor performed basic
> non-recursive pull parsing (header lines, parts but not part headers).

You lost me here. I don't have enough understanding of what we are
talking about to bring more useful hints. I'll wait for the code to
review ;-)

>> Not sure I understand the problem. Can't we ignore the encoding issue,
>> at all? The important thing is that the API uses a string and a string
>> always can contain a 7bit sequence in a lossless way. If you write such
>> string to bytes using the US-ASCII charset the result will be unchanged,
>> right?
> 
> if the string contains only US-ACSII then yes, the transformation will
> be lossless

Well, the String object is only a "container" large enough for our
purpose. In OOP we often use an Integer to pass data that should be a
subset of an integer. The important fact is that if the meaning of the
data we want to transfer is kept.
That's why we can use the string and simply do a parameter check to see
it is really an US-ASCII only sequence or we can use anything else. IMHO
the choice does not depend on the charset support of the String object,
but the easy of use. You are developing the API, you are more entitled
to decide whether a byte[] is better than String.

> my point is that by including a string in the API the caller is forced
> to decode the natural representation (bytes) to a string which will
> then be encoded to bytes by the cursor implementation. this approach
> seems wrong to me.

Well, bytes are the natural representation for every information we
manage in IT ;-)

My point is that String have very convenient methods and they are really
well optimized in the JVM, so maybe sometimes String handling is not so
worse than manual byte handling but they are more usable than byte-arrays.

FWIW you can also introduce a "Boundary" object so that implementation
can be optimized without altering the API.

>> (if you had non US-ASCII they will be instead converted to "?").
> 
> that depends on the way the encoding is done
> 
> String.getBytes() is JVM and charset dependent

shouldn't getBytes("US-ASCII") work always fine for a String including
7bit only chars and use "?" for chars outside the 7bit ?

> using the more flexible nio encoders, then bad characters can be
> reported, ignored or replaced

Not sure I understand this point: do we need to recognize/ignore/replace
bad chars in the Boundary wrt to that api call?

>> The only problems are when we try to use non US-ASCII chars as a
>> boundary, but this should not be allowed as it is an illegal argument:
>> maybe we may want to check this in the
>> public·void·boundary(String·boundary)·throws·IOException. Maybe a throw
>> a new IllegalArgumentException on a boundary including non US-ASCII
>> chars is enough (maybe a check for "?" presence is enough).
> 
> throwing an exception does seem reasonable
> 
> i prefer to offer subclasses for cases such as this so that they can
> be caught and (perhaps) dealt with
> 
> i generally prefer checked to runtime exceptions but perhaps an
> IOException may be wrong here

IMHO the specific check is an argument validity check and an
IllegalArgumentException better fits in. I see IOException more related
to IO problems and not related to content/argument.
Btw I'm also fine with IOException, and as you are the one with the
dirty hands now, you should decide, IMHO ;-)

>> Passing byte
>> sequences IMHO would not solve the issue as you would have to check the
>> 8th bit anyway.
> 
> true but the check is much quicker and the failure more precise

I agree. It is a tradeoff of easy of use vs speed/precision. In my
understanding we didn't need *that* speed and precision for the
boundary, but I don't know exactly what code you're talking about, so
I'm fine with the low level operations too.

> there are various way that an encoding might fail and there would be
> effort involved in determining the exact cause

Well, as far as I can tell from CharToByteASCII.convert sources there
are no failures involved (if you don't pass wrong buffer sizes, and this
should not happen using String functions).

>> The details depends mainly on the usage of the boundary by the
>> underlying system: if the system works with bytes then maybe it is ok to
>> use bytes also for the boundary method, otherwise IMHO it's safe to keep
>> using the String (and maybe add the argument check).
> 
> MIME works with 8-bit bytes not 16-bit UNICODE so bytes are the
> natural way of representing boundaries in java
> 
> - robert

I don't agree with your interpretation of MIME working with 8bit bytes:
to be more precise, I agree that a byte contains 8 bits ;-) . UNICODE
and bytes are not something we can compare and put as alternative.

UNICODE is a way to represent chars using bits. We can compare single
byte Chars with 2 byte chars, but not UNICODE vs bytes. MIME does not
work with 8bit bytes more than any other PC related specification.

Maybe I'm not understanding your point at all, that's why I keep trying
to give you details on my mis/understanding.

As I said previously everything in IT is mapped by bits and bytes
(because of the available hardware) but MIME is just another thing that
we represent with bytes: MIME define what is a line, what is CRLF, what
is 7bit data, what is 8bit data, what is binary data. Working directly
with bytes IMHO does not means you use the RIGHT way to represent MIME
data , it simply means working in a low-level raw byte representation
without any further abstraction.

Are you going to write your own CharSequence implementation/wrapper for
everything so to avoid the memory abuse of Java's UNICODE based Strings?
Or maybe a "CompactString" that is able to wrap a bytebuffer or a
CharSequence and can convert to/from them (but differently from JAVA do
not convert them to UNICODE by default) would simply do the trick?

To be sure you're not misunderstanding me let me repeat I'm not against
your approach (I don't even understand it, so I cannot be against it). I
just want to understand what problem you're trying to solve and how you
do propose to solve it.

Are we still discussing the "boundary" or does this concerns belongs
also/only to MimePart contents?

Maybe it add something to this dicussion a corner case I see very often
discussed in mime related mailing lists: some non-rfc-compliant client
simply put 8bit chars in the headers values and encode that data using
the same encoding of the mime body. This is not a compliant behavior but
some mime application manage to try to understand/recover this case and
no server I'm aware of simply reject the message as being non compliant
(even if this seems the only compliant option from RFC reading): what is
our position wrt to this issue? Can our position influence the decisions
about the way we parse and move around this data?

Stefano

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [mime4j] Please Review Cursor API

Reply via email to