Re: [whatwg] [mimesniff] Complete MIME type parsing algorithm for section 5

2013-05-28 Thread Peter Occil


I see you've updated the MIME sniffing algorithm in response to my feedback. 
Here

I'll go over the difference and I want you to comment on these.

1. I assume the term whitespace character means the same as a whitespace 
byte under
the MIME Sniffing spec.  As such the use of that term is inadequate for the 
following reasons.


  * A whitespace character includes 0x0C, form feed (FF), which is not 
considered whitespace

 in either HTTP or the Internet Message Format (IMF, RFC5322).

 For example, the following would not be well-formed under HTTP or IMF:

 text/plain{FF}; charset=utf-8

 But the current algorithm would consider that string well-formed 
anyway.


  * All steps in the document that are the same as step 7 skip all 
whitespace characters, even
 if the whitespace isn't well formed under HTTP or IMF.  For example, a 
bare carriage
 return (CR) or line feed character (LF) is not allowed, and a CR-LF 
pair not followed by either
 SPACE or TAB is also not allowed. IMF also allows comments within 
whitespace.


 For example, the following would not be well-formed under HTTP or IMF:

 text/plain;{CR} charset=utf-8
 text/plain;{LF} charset=utf-8
 text/plain;{CR}{LF}charset=utf-8

 (Note the lack of space in the last example. Note also that folding 
whitespace is deprecated

 under the current HTTP draft.)

 And the following examples would be allowed under IMF, but not HTTP:

 (comment) text/plain; charset=utf-8
 text/plain; (comment) charset=utf-8
 text/plain; (comment (nested)) charset=utf-8
 text/plain; charset=utf-8 (comment)
 text/plain; {CR}{LF} (comment) charset=utf-8

2. While the type, subtype, and parameter name are checked for their length, 
the other rules
 for wellformedness are not checked in your version, namely, that they must 
not be empty,
 contain a byte that isn't a MIME type byte (see my original message), or 
begin with a byte that

 isn't an ASCII alphanumeric.

 For example, the following would not be well-formed under RFC6838:

 te*xt/plain;charset=utf-8
 text/pl*ain;charset=utf-8
 text/plain;ch*arset=utf-8
 text/plain;=utf-8
 text/;charset=utf-8
 /plain;charset=utf-8

 The first three examples are because * isn't a MIME type byte.


3. Unquoted parameter values are not checked to ensure that they are not 
empty and do
 not contain a byte that isn't a parameter value byte (see my original 
message).


 For example, the following would not be well-formed under HTTP or MIME:

 text/plain;charset=ut?f-8
 text/plain;charset=utf=8

4. Quoted parameter values are not checked to ensure that they do not 
contain a 0x7F byte

 or a byte other than TAB (0x09) that is less than 0x20.

 For example, the following would not be well-formed under HTTP or MIME:

 text/plain;charset=utf{LF}-8
 text/plain;charset=utf{0x7F}-8
 text/plain;charset=utf\{LF}-8
 text/plain;charset=utf\{0x7F}-8

Please give your comments.

--Peter

-Original Message- 
From: Gordon P. Hemsley

Sent: Saturday, May 25, 2013 1:26 PM
To: Peter Occil
Cc: WHATWG
Subject: Re: [whatwg] [mimesniff] Complete MIME type parsing algorithm for 
section 5


On Sat, May 25, 2013 at 12:46 PM, Peter Occil pocci...@gmail.com wrote:

My algorithm skips only SPACE and TAB instead of all whitespace characters
because it assumes that the field value was already extracted from
Content-Type according to the HTTP/HTTPbis spec (0x0C, form feed, is never
considered whitespace in HTTP headers). In particular, it assumes that
folding whitespace (obs-fold) was replaced with spaces (or the message 
with

obs-fold rejected) before the Content-Type value was interpreted.


Thanks for your detailed explanation.

It'll take me a little while to evaluate what you've proposed here,
but in the meantime: Keep in mind that the Content-Type header is not
the only source for a MIME type. This algorithm needs to consider MIME
types from all possible sources.

--
Gordon P. Hemsley
m...@gphemsley.org
http://gphemsley.org/ • http://gphemsley.org/blog/ 



[whatwg] [mimesniff] Complete MIME type parsing algorithm for section 5

2013-05-25 Thread Peter Occil
I present this draft of the complete algorithm for parsing a MIME type.  I 
would appreciate comments.

--Peter



An ASCII alphanumeric is a byte or character in the ranges 0x41-0x5A, 
0x61-0x7A, and 0x30-0x39.
A MIME type byte is an ASCII alphanumeric or one of the following bytes: ! # $ 
 ^ _ . + -
A parameter value byte is a MIME type byte or one of the following bytes: % ' * 
` | ~

To parse a MIME type, run the following steps:

1. Let length be the length of the byte sequence of the MIME type.
2. If length is less than 1, return undefined.
3. Let pointer be 0.  Pointer is a zero-based index to the current byte in the 
byte sequence.
4. Advance pointer to the next byte other than 0x20 (SPACE) or 0x09 (TAB).
5. Let type be the byte string from the current byte up to but not including 
the next / byte. Advance pointer to the next / byte.
6. If the current byte isn't /, return undefined.
7. Increment pointer by 1.
8. Let subtype be the byte string from the current byte up to but not including 
the next 0x20 (SPACE), 0x09 (TAB), or ; byte.  Advance pointer to the next 
0x20 (SPACE), 0x09 (TAB), or ; byte.
9. If type is empty, contains a byte that isn't a MIME type byte, or doesn't
begin with an ASCII alphanumeric, or is longer than 127 bytes, return undefined.
10. If subtype is empty, contains a byte that isn't a MIME type byte, or 
doesn't begin with an ASCII alphanumeric, or is longer than 127 bytes, return 
undefined.
11. Convert type and subtype to ASCII lowercase.
12. Let parameters be an empty dictionary.
13. Run the following substeps in a loop.
 1. Advance pointer to the next byte other than 0x20 (SPACE) or 0x09 (TAB).
 2. If pointer is equal to length, return type, subtype, and parameters.
 3. If the current byte isn't ;, return undefined.
 4. Increment pointer by 1.
 5. If pointer is equal to length, return type, subtype, and parameters.
 6. Let parameter be the byte string from the current byte up to but not 
including the next = byte. Advance pointer to the next = byte.
 7. If parameter is empty, contains a byte that isn't a MIME type byte, or 
doesn't begin with an ASCII alphanumeric, or is longer than 127 bytes, return 
undefined.
 8. If parameters contains a mapping for parameter, return undefined.
 9. Convert parameter to ASCII lowercase.
 10. If the current byte isn't =, return undefined.
 11. Increment pointer by 1.
 12. If the current byte equals 0x22 (quotation mark), run the following 
substeps:
  1. Let value be an empty byte string.
  2. Increment pointer by 1.
  3. Run these substeps in a loop.
  1. If pointer is equal to length, return type, subtype, 
and parameters.
  2. If the current byte equals 0x7F or is less than 0x20, 
and the current byte isn't TAB (0x09), return type, subtype, and parameters.
  3. If the current byte equals 0x22 (quotation mark), 
increment pointer by 1 and terminate this loop.
  4. Otherwise, if the current byte is \, increment 
pointer by 1. Then, if there is a current byte, append that byte to value.
  5. Otherwise, append the current byte to value.
  6. Increment pointer by 1.
  4. Add the mapping of parameter to value to the parameters 
dictionary.
 13. Otherwise, run these substeps:
  1. Let value be the byte string from the current byte up to but 
not including the next 0x20 (SPACE), 0x09 (TAB), or ; byte.  Advance pointer 
to the next 0x20 (SPACE), 0x09 (TAB), or ; byte.
  2. If value is empty or contains a byte that isn't a parameter 
value byte, return undefined.
  3. Add the mapping of parameter to value to the parameters 
dictionary.

---




Re: [whatwg] [mimesniff] Complete MIME type parsing algorithm for section 5

2013-05-25 Thread Gordon P. Hemsley
Peter,

The burden is on you to describe your proposals and what their purpose
and benefit would be.

How does this proposed algorithm differ from what is already in the
spec? How is it better?

Regards,
Gordon

On Sat, May 25, 2013 at 3:58 AM, Peter Occil pocci...@gmail.com wrote:
 I present this draft of the complete algorithm for parsing a MIME type.  I 
 would appreciate comments.

 --Peter

 

 An ASCII alphanumeric is a byte or character in the ranges 0x41-0x5A, 
 0x61-0x7A, and 0x30-0x39.
 A MIME type byte is an ASCII alphanumeric or one of the following bytes: ! # 
 $  ^ _ . + -
 A parameter value byte is a MIME type byte or one of the following bytes: % ' 
 * ` | ~

 To parse a MIME type, run the following steps:

 1. Let length be the length of the byte sequence of the MIME type.
 2. If length is less than 1, return undefined.
 3. Let pointer be 0.  Pointer is a zero-based index to the current byte in 
 the byte sequence.
 4. Advance pointer to the next byte other than 0x20 (SPACE) or 0x09 (TAB).
 5. Let type be the byte string from the current byte up to but not including 
 the next / byte. Advance pointer to the next / byte.
 6. If the current byte isn't /, return undefined.
 7. Increment pointer by 1.
 8. Let subtype be the byte string from the current byte up to but not 
 including the next 0x20 (SPACE), 0x09 (TAB), or ; byte.  Advance pointer to 
 the next 0x20 (SPACE), 0x09 (TAB), or ; byte.
 9. If type is empty, contains a byte that isn't a MIME type byte, or doesn't
 begin with an ASCII alphanumeric, or is longer than 127 bytes, return 
 undefined.
 10. If subtype is empty, contains a byte that isn't a MIME type byte, or 
 doesn't begin with an ASCII alphanumeric, or is longer than 127 bytes, return 
 undefined.
 11. Convert type and subtype to ASCII lowercase.
 12. Let parameters be an empty dictionary.
 13. Run the following substeps in a loop.
  1. Advance pointer to the next byte other than 0x20 (SPACE) or 0x09 
 (TAB).
  2. If pointer is equal to length, return type, subtype, and parameters.
  3. If the current byte isn't ;, return undefined.
  4. Increment pointer by 1.
  5. If pointer is equal to length, return type, subtype, and parameters.
  6. Let parameter be the byte string from the current byte up to but not 
 including the next = byte. Advance pointer to the next = byte.
  7. If parameter is empty, contains a byte that isn't a MIME type byte, 
 or doesn't begin with an ASCII alphanumeric, or is longer than 127 bytes, 
 return undefined.
  8. If parameters contains a mapping for parameter, return undefined.
  9. Convert parameter to ASCII lowercase.
  10. If the current byte isn't =, return undefined.
  11. Increment pointer by 1.
  12. If the current byte equals 0x22 (quotation mark), run the following 
 substeps:
   1. Let value be an empty byte string.
   2. Increment pointer by 1.
   3. Run these substeps in a loop.
   1. If pointer is equal to length, return type, subtype, 
 and parameters.
   2. If the current byte equals 0x7F or is less than 
 0x20, and the current byte isn't TAB (0x09), return type, subtype, and 
 parameters.
   3. If the current byte equals 0x22 (quotation mark), 
 increment pointer by 1 and terminate this loop.
   4. Otherwise, if the current byte is \, increment 
 pointer by 1. Then, if there is a current byte, append that byte to value.
   5. Otherwise, append the current byte to value.
   6. Increment pointer by 1.
   4. Add the mapping of parameter to value to the parameters 
 dictionary.
  13. Otherwise, run these substeps:
   1. Let value be the byte string from the current byte up to but 
 not including the next 0x20 (SPACE), 0x09 (TAB), or ; byte.  Advance 
 pointer to the next 0x20 (SPACE), 0x09 (TAB), or ; byte.
   2. If value is empty or contains a byte that isn't a parameter 
 value byte, return undefined.
   3. Add the mapping of parameter to value to the parameters 
 dictionary.

 ---





-- 
Gordon P. Hemsley
m...@gphemsley.org
http://gphemsley.org/ • http://gphemsley.org/blog/


Re: [whatwg] [mimesniff] Complete MIME type parsing algorithm for section 5

2013-05-25 Thread Peter Occil

Sorry for not including proper discussion.

These are the differences in my algorithm from the existing one.

One weakness of the existing algorithm is that its terminology can be rather 
technical (sequence[s], while [this happens], execute the following 
steps: [one step only], Loop M).  On the other hand, my algorithm is 
better seen as a logical set of steps that are intended to be easy to 
follow; actual implementations may differ as long as they produce the same 
results. (This is the same approach used in the Unicode Standard.) 
Accordingly, there are fewer loops and fewer if-structures, making the 
algorithm easier to understand and follow.


My algorithm is also stricter in many aspects than the existing one, as 
explained further below.


My algorithm skips only SPACE and TAB instead of all whitespace characters 
because it assumes that the field value was already extracted from 
Content-Type according to the HTTP/HTTPbis spec (0x0C, form feed, is never 
considered whitespace in HTTP headers). In particular, it assumes that 
folding whitespace (obs-fold) was replaced with spaces (or the message with 
obs-fold rejected) before the Content-Type value was interpreted.


Type, subtype, and parameter names are converted to lowercase.

Type, subtype, and parameter names are checked according to the rules found 
in RFC6838 section 4.2, rather than RFC2045 section 1; the former is what I 
believe is the latest syntax of those names, while the latter is an older 
syntax.


Parameter values are checked according to the rules found in HTTPbis part 1, 
section 3.2.6, in the latest version [1].  In particular, it rejects 
parameters with unclosed or otherwise invalid quoted strings, and checks the 
characters in unquoted parameter values.


My algorithm treats Content-Type values with duplicate parameter names as an 
error (see RFC6838 section 4.3).


--

Also, there is a mistake: Two steps were reversed. They should say the 
following instead:


8. Convert parameter to ASCII lowercase.
9. If parameters contains a mapping for parameter, return undefined.

--Peter

[1]: 
https://svn.tools.ietf.org/svn/wg/httpbis/draft-ietf-httpbis/latest/p1-messaging.html


-Original Message- 
From: Gordon P. Hemsley

Sent: Saturday, May 25, 2013 11:55 AM
To: Peter Occil
Cc: WHATWG
Subject: Re: [whatwg] [mimesniff] Complete MIME type parsing algorithm for 
section 5


Peter,

The burden is on you to describe your proposals and what their purpose
and benefit would be.

How does this proposed algorithm differ from what is already in the
spec? How is it better?

Regards,
Gordon

On Sat, May 25, 2013 at 3:58 AM, Peter Occil pocci...@gmail.com wrote:
I present this draft of the complete algorithm for parsing a MIME type.  I 
would appreciate comments.


--Peter



An ASCII alphanumeric is a byte or character in the ranges 0x41-0x5A, 
0x61-0x7A, and 0x30-0x39.
A MIME type byte is an ASCII alphanumeric or one of the following bytes: ! 
# $  ^ _ . + -
A parameter value byte is a MIME type byte or one of the following bytes: 
% ' * ` | ~


To parse a MIME type, run the following steps:

1. Let length be the length of the byte sequence of the MIME type.
2. If length is less than 1, return undefined.
3. Let pointer be 0.  Pointer is a zero-based index to the current byte in 
the byte sequence.

4. Advance pointer to the next byte other than 0x20 (SPACE) or 0x09 (TAB).
5. Let type be the byte string from the current byte up to but not 
including the next / byte. Advance pointer to the next / byte.

6. If the current byte isn't /, return undefined.
7. Increment pointer by 1.
8. Let subtype be the byte string from the current byte up to but not 
including the next 0x20 (SPACE), 0x09 (TAB), or ; byte.  Advance pointer 
to the next 0x20 (SPACE), 0x09 (TAB), or ; byte.
9. If type is empty, contains a byte that isn't a MIME type byte, or 
doesn't
begin with an ASCII alphanumeric, or is longer than 127 bytes, return 
undefined.
10. If subtype is empty, contains a byte that isn't a MIME type byte, or 
doesn't begin with an ASCII alphanumeric, or is longer than 127 bytes, 
return undefined.

11. Convert type and subtype to ASCII lowercase.
12. Let parameters be an empty dictionary.
13. Run the following substeps in a loop.
 1. Advance pointer to the next byte other than 0x20 (SPACE) or 0x09 
(TAB).
 2. If pointer is equal to length, return type, subtype, and 
parameters.

 3. If the current byte isn't ;, return undefined.
 4. Increment pointer by 1.
 5. If pointer is equal to length, return type, subtype, and 
parameters.
 6. Let parameter be the byte string from the current byte up to but 
not including the next = byte. Advance pointer to the next = byte.
 7. If parameter is empty, contains a byte that isn't a MIME type 
byte, or doesn't begin with an ASCII alphanumeric, or is longer than 127 
bytes, return undefined.

 8

Re: [whatwg] [mimesniff] Complete MIME type parsing algorithm for section 5

2013-05-25 Thread Gordon P. Hemsley
On Sat, May 25, 2013 at 12:46 PM, Peter Occil pocci...@gmail.com wrote:
 My algorithm skips only SPACE and TAB instead of all whitespace characters
 because it assumes that the field value was already extracted from
 Content-Type according to the HTTP/HTTPbis spec (0x0C, form feed, is never
 considered whitespace in HTTP headers). In particular, it assumes that
 folding whitespace (obs-fold) was replaced with spaces (or the message with
 obs-fold rejected) before the Content-Type value was interpreted.

Thanks for your detailed explanation.

It'll take me a little while to evaluate what you've proposed here,
but in the meantime: Keep in mind that the Content-Type header is not
the only source for a MIME type. This algorithm needs to consider MIME
types from all possible sources.

--
Gordon P. Hemsley
m...@gphemsley.org
http://gphemsley.org/ • http://gphemsley.org/blog/


Re: [whatwg] [mimesniff] Complete MIME type parsing algorithm for section 5

2013-05-25 Thread Peter Occil

I noticed two more mistakes.

Change the following:

4. Otherwise, if the current byte is \, increment pointer by 1. Then, if 
there is a current byte, append that byte to value.

5. Otherwise, append the current byte to value.

to:

4. If the current byte is \, increment pointer by 1. Then, if there isn't 
a current byte, if the current byte equals 0x7F, or if the current byte is 
less than 0x20 and isn't TAB (0x09), return type, subtype, and parameters.

5. Append the current byte to value.

After:

5. If pointer is equal to length, return type, subtype, and parameters.

add the following step:

5a. Advance pointer to the next byte other than 0x20 (SPACE) or 0x09 (TAB).

Take your time.

--Peter