Re: [whatwg] [mimesniff] Review request: Parsing a MIME type

2013-06-01 Thread Gordon P. Hemsley
On Fri, May 31, 2013 at 11:50 PM, Peter Occil pocci...@gmail.com wrote:

 * Another important point to notice is the fact that this algorithm
 allows parameter names to appear without values. This is useful in
 situations such as the base64 option in data: URLs that use the mere
 presence or absence of a parameter to set its boolean value.


 Since you mention data URLs I should note that data URLs can be percent
 encoded, which HTTP
 and MIME headers can't be. This raises additional considerations when
 parsing a data URL's MIME type correctly;
 see reference [1] for test cases.  In particular:

 [1]: http://greenbytes.de/tech/tc/datauri/

This is a very useful resource; thank you for pointing it out to me.

Realize now that that's the only thing that matters: What do the browsers do?

(And percent encoding doesn't matter, as that gets handled before the
parsing begins.)

 * A data URL that begins with data:, or data:;base64, (with no MIME
 type) is assumed to have the MIME type
  text/plain;charset=us-ascii under RFC2397.
 * A data URL that begins with  data:; (with no type or subtype, but with
 parameters) is assumed to have the MIME type
  text/plain under RFC2397.

An empty or invalide MIME type will get treated as unknown and will
eventually be sniffed (if it isn't already). I'll have to consider
what to do with the base64 and other parameters parts, though.

 * The word base64 can only appear at the end of the MIME type, so that a
 data URL like
   data:application/example;base64;foo=bar,AA== will not be encoded in
 base64, strictly speaking. A parameter name (base64 or otherwise)
   cannot otherwise appear without a parameter value.

As I mentioned, strictly speaking doesn't matter, as all browsers do
the same thing, according to the resource you linked: base64
parameters with values are fine; base64 boolean parameters in other
than last place are warnings. (Not sure what the reasoning behind that
distinction is, but that's what reality is.)

So it seems the only issue I have to worry about is what to do with
MIME types which only have parameters.

Regards,
Gordon

--
Gordon P. Hemsley
m...@gphemsley.org
http://gphemsley.org/ • http://gphemsley.org/blog/


Re: [whatwg] [mimesniff] Review request: Parsing a MIME type

2013-06-01 Thread Gordon P. Hemsley
On Sat, Jun 1, 2013 at 11:41 AM, Gordon P. Hemsley gphems...@gmail.com wrote:
 On Fri, May 31, 2013 at 11:50 PM, Peter Occil pocci...@gmail.com wrote:
 * The word base64 can only appear at the end of the MIME type, so that a
 data URL like
   data:application/example;base64;foo=bar,AA== will not be encoded in
 base64, strictly speaking. A parameter name (base64 or otherwise)
   cannot otherwise appear without a parameter value.

 As I mentioned, strictly speaking doesn't matter, as all browsers do
 the same thing, according to the resource you linked: base64
 parameters with values are fine; base64 boolean parameters in other
 than last place are warnings. (Not sure what the reasoning behind that
 distinction is, but that's what reality is.)

It seems I read the purpose of the test wrong for base64 parameters
with values: They're fine insofar as they're allowed, but they don't
trigger base64 decoding (except in Safari?), unlike if the boolean
base64 parameter is in a non-last position.

--
Gordon P. Hemsley
m...@gphemsley.org
http://gphemsley.org/ • http://gphemsley.org/blog/


Re: [whatwg] [mimesniff] Review request: Parsing a MIME type

2013-06-01 Thread Peter Occil

Thanks for adding it for me; I forget to use Reply All.

What I intended is to use the result of canPlayType to determine
how a browser parses a certain MIME type.  Especially if there are duplicate
codecs parameters. For example, if a browser
returns probably from the canPlayType method with the following MIME type:

 video/mp4; codecs=avc1.42E01E

will it also return probably from this MIME type?

 video/mp4; codecs=foobar; codecs=avc1.42E01E

Or from this MIME type?

 video/mp4; codecs=avc1.42E01E; codecs=foobar

As you mentioned before, with respect to which value is used if a
parameter appears more than once, the answer might be different
depending on the browser.

-Original Message- 
From: Gordon P. Hemsley

Sent: Saturday, June 01, 2013 8:25 PM
To: Peter Occil ; whatwg List
Subject: Re: [whatwg] [mimesniff] Review request: Parsing a MIME type

(Re-added the list; I hope that's OK.)

The canPlayType method (and similar mechanisms) are only
approximations of what the browser can support. The codecs is
generally not strictly necessary when the UA goes to actually play the
file—if the codecs parameter is missing, it can generally be
recovered by parsing/processing the file. Thus, it is not an
especially reliable testing method.


[...] 



[whatwg] [mimesniff] Review request: Parsing a MIME type

2013-05-31 Thread Gordon P. Hemsley
Hello all,

This is a request seeking feedback and review on the MIME Sniffing
algorithm to parse a MIME type:

http://mimesniff.spec.whatwg.org/#parse-a-mime-type

After numerous iterations, I think it is in a state that accurately
reflects the best current practices for interoperability.

As is common with such things, there are numerous points in this
algorithm where implementations do not agree. In general, Firefox and
Chrome tend to pattern together, as do IE and Opera. Safari often
patterns on its own, in favor of a more literal interpretation of the
various RFCs on the matter.

At times, I have had to make a decision as to which was the best
approach. This usually results in half of the implementations being in
violation of the spec; I hope, in those instances, the implementations
in question can be updated to become interoperable with the rest.

With that being said, there are two specific points I want to raise:

(1) The more recent RFCs on the matter restrict type, subtype, and
parameter names to 127 characters. No implementation actually enforces
this limit, but I have included it in the algorithm (relevant points
appear in red) because I think it would be better and safer for both
the user and the user agent to do so.

(2) Based on my analysis of existing implementations, anything that
occurs between the semicolon (and any first whitespace) and the equals
sign is treated as the parameter name, including any whitespace before
the equals sign. However, in order to test parameters, I have been
using 'charset' (because that's they only one I'm aware of that has a
Web-visible effect), and certain implementations may be sniffing
specifically for the string charset=, which would cloud the results
of my testing. Any enlightenment into this issue would be much
appreciated.

I also have a few general points:

* You may notice in the algorithm that I am using hybrid terminology,
sometimes talking about bytes and sometimes talking about characters.
This is mostly because I haven't decided/determined whether to treat a
MIME type as ASCII or as UTF-8. I think there are arguments on both
sides of the issue, but I'm eager to hear your opinions and advice
(especially about how I might phrase the algorithm if it were written
in terms of characters instead of bytes).

* One of the most controversial parts of this algorithm might be the
issue of what to do when a parameter appears more than once. (The RFCs
suggest that the MIME type should be treated as invalid in such a
case, but no implementation actually treats it that way.) I have opted
to make a later appearance of a parameter override and replace an
earlier appearance of a parameter. Modulo caveat (2) above, this is
only done in half the implementations; in particular, IE and Opera
appear to use the first instance of the parameter as the canonical
value.

* Another important point to notice is the fact that this algorithm
allows parameter names to appear without values. This is useful in
situations such as the base64 option in data: URLs that use the mere
presence or absence of a parameter to set its boolean value. Note,
however, that a parameter that has been given an explicit value (even
if that value is the empty string) does not get overridden by the
later appearance of a boolean parameter of the same name.

I think those are the important points of background information you
need to know in order to evaluate this algorithm.

I look forward to your response.

Regards,
Gordon

--
Gordon P. Hemsley
m...@gphemsley.org
http://gphemsley.org/ • http://gphemsley.org/blog/


Re: [whatwg] [mimesniff] Review request: Parsing a MIME type

2013-05-31 Thread Peter Occil



* Another important point to notice is the fact that this algorithm
allows parameter names to appear without values. This is useful in
situations such as the base64 option in data: URLs that use the mere
presence or absence of a parameter to set its boolean value.


Since you mention data URLs I should note that data URLs can be percent 
encoded, which HTTP
and MIME headers can't be. This raises additional considerations when 
parsing a data URL's MIME type correctly;

see reference [1] for test cases.  In particular:

* A data URL that begins with data:, or data:;base64, (with no MIME 
type) is assumed to have the MIME type

 text/plain;charset=us-ascii under RFC2397.
* A data URL that begins with  data:; (with no type or subtype, but with 
parameters) is assumed to have the MIME type

 text/plain under RFC2397.
* The word base64 can only appear at the end of the MIME type, so that a 
data URL like
  data:application/example;base64;foo=bar,AA== will not be encoded in 
base64, strictly speaking. A parameter name (base64 or otherwise)

  cannot otherwise appear without a parameter value.

[1]: http://greenbytes.de/tech/tc/datauri/