Re: [websec] #22: content-type sniffing should include charset sniffing

Tobias Gondrom Mon, 24 Oct 2011 00:08:12 -0700

<hat="individual">

Well, I don't feel that looking at the octets aka magic numbers is a"superficial work-around", rather find it a reasonable argument.Personally, I don't think we need to go into charset-sniffing in thedraft, but if you can make it an uncomplicated sub-set of some of thecontent-types, it probably won't harm.

However, I have some doubts that we can limit the extend ofcharset-sniffing to a small area in the draft as discussed and may notend up with a whole other set of additional problems.... - in which caseI would rather like to see the content-type sniffed and talk aboutcharset-sniffing in another draft - if we want to go there...


just my 5cents, Tobias



On 24/10/11 07:47, Larry Masinter wrote:

The charset sniffing documentation in the HTML5 document isn't all that 
complicated, anyway.

And it has to be somewhere.

What's the point of standardizing sniffing of the internet media type without 
also standardizing the sniffing of all of the relevant parameters.... the goal 
is to sniff the content-type, the media type by itself isn't what's used.
It's just for text and xml types, the 'charset' parameter is already there.

Also, the algorithm in the document currently is incomplete and inappropriate if you're 
going to sniff XML-based media types, so the fact that the current algorithm can get away 
with hiding "charset guessing" as if it were just on octets and not the 
characters -- well, that's just a superficial work-around.

Larry


-----Original Message-----
From: "Martin J. Dürst" [mailto:[email protected]]
Sent: Sunday, October 23, 2011 11:37 PM
To: Larry Masinter
Cc: Adam Barth; [email protected]
Subject: Re: [websec] #22: content-type sniffing should include charset sniffing

I agree with Adam and Tobias that we should not pull all of charset sniffing 
into this document. Many charset details depend on the mime type in the first 
place, and are carefully described in the respective specs. For some transfer 
protocols, the question of charset may be irrelevant (e.g. for text over 
Websocket, which prescribes and checks for UTF-8).

Larry is right that in some cases, some preliminary charset sniffing is 
necessary to get at some information at the start of the document, but I think 
we should strictly limit this draft to these cases.

Regards,    Martin.

On 2011/10/24 13:14, Larry Masinter wrote:

I was talking about the necessary dependency of the specifications -- that you 
couldn't specify media type sniffing completely without making at least a 
normative reference to charset sniffing.

The fact that the code works that way is evidence, of course, but
we're not talking about possibility of implementation (where a single
implementation is evidence) but rather orthogonality of interfaces
(where the question is whether ALL implementations must follow this
pattern.)

Larry




-----Original Message-----
From: Adam Barth [mailto:[email protected]]
Sent: Sunday, October 23, 2011 8:37 PM
To: Larry Masinter
Cc: Tobias Gondrom; [email protected]
Subject: Re: [websec] #22: content-type sniffing should include
charset sniffing

I mean, that's how the code works, so it must be possible.  :)

Adam


On Sun, Oct 23, 2011 at 8:32 PM, Larry Masinter<[email protected]>   wrote:

I know it's complicated, but scanning text is necessarily part of determining which 
application/something+xml  you have.  I think (but should really check before saying 
this) that XML media type registrations describe what the DOCTYPE or XML namespace or 
root element are, and that, to properly "sniff" them, you'd have to scan text. 
But before you scan text, you have to determine charset.

So if we're going to support sniffing of media types in general, I don't see 
how we can do that without also specifying charset determination.



Larry
]

-----Original Message-----
From: [email protected] [mailto:[email protected]] On
Behalf Of Adam Barth
Sent: Sunday, October 23, 2011 8:28 PM
To: Tobias Gondrom
Cc: [email protected]
Subject: Re: [websec] #22: content-type sniffing should include
charset sniffing

The charset sniffing is also complicated by the fact that sometimes user agents need 
to parse some of the HTML to find a<meta>   element.
In some situations, user agents need to restart the parsing algorithm, which is 
quite delicate and better to describe in the same document as HTML parsing (at 
least for use by HTML processing engines).

Adam


On Sun, Oct 23, 2011 at 8:24 PM, Tobias Gondrom<[email protected]>   
wrote:

<hat="individual">
I tend not to agree with that.

The fact that charset sniffing might happen at the same time as
mime-sniffing does not seem like a strong argument to include this
in the draft.

Furthermore I would rather have these issues separate:
First you determine the content-type and then after that you may
want to determine the charset used within that content-type (if you
really have to sniff the charset). I can also imagine that charset
sniffing algorithm might be depending on the application identified
by the sniffed mime-type, which again would speak against throwing it in 
together with mime-sniffing....

Kind regards, Tobias



On 24/10/11 00:55, websec issue tracker wrote:

#22: content-type sniffing should include charset sniffing

   the HTML5 spec contains some algorithms for sniffing charset,
overriding
   labeled charset, etc.

   MIME parameters like charset are as much a part of the
content-type as the
   base internet media type, and any sniffing of parameters and other
   metadata (overriding content-type or guessing where it is not
supplied or
   wrong) should be included in this document, since the sniffing
will happen
   at the same time.

_______________________________________________
websec mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/websec

_______________________________________________
websec mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/websec

_______________________________________________
websec mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/websec

_______________________________________________
websec mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/websec


_______________________________________________
websec mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/websec

Re: [websec] #22: content-type sniffing should include charset sniffing

Reply via email to