Re: [whatwg] Encoding sniffing algorithm
Ian Hickson ian at hixie.ch on Thu Sep 6 12:55:03 PDT 2012: > On Fri, 27 Jul 2012, Leif Halvard Silli wrote: >> Revised encoding sniffing algorithm proposal: >> >> NEW! 0. document is XML format - opt out of the algorithm. >> [This step is already implicit in the spec, but it would >> make sense to explicitly include it to make sure that >> one could e.g. write test cases to see that it is step >> is implemented. Currently Safari, Chrome and Opera do >> not 100% implement this step.] > > I don't understand the relevance of the algorithm to XML. Why would anyone > even look at this algorithm if they were parsing XML? In principle it should not be needed. Agree. But many of those who are parsing XML are also parsing HTML - for that reason it should be natural for them to compare specs and requirements. Currently, in particular Webkit and Chromium seem to be colored by their HTML parsing when they parse XML. (See the table in my blog post.) Also, the spec do a few time includes phrases similar to "if it is XML, then abort these steps" (for example in '3.4.1 Opening the input stream'),[*] so there is some precedence, I think. [*] http://www.whatwg.org/specs/web-apps/current-work/multipage/elements.html#opening-the-input-stream >> NEW! #. Alternative: The BOM signature could go here instead of >> in step 5. There is a bug to move the BOM hereto and make >> it override anything else. What speaks against this are: >> a) that Firefox, IE10 and Opera do not currently have >> this behavior. >> b) this revision of the sniffing algorithm, especially >> the revision in step 6 (required UTF-8 detection), >> might make the BOM-trumps-everything-else override >> less necessary >> What speaks for this override: >> a) Safari, Chrome and legacy IE implement it. >> b) some legacy content may depend on it > > Not sure what this means. You will be dealing with it when you take care of Anne's bug: "Bug 15359 Make BOM trump HTTP". [*] Thus, you can just ignore it. [*] https://www.w3.org/Bugs/Public/show_bug.cgi?id=15359. >> 1. user override. >> (PS: The spec should clarify whether user override is >> cacheable.) > > This seems to be entirely a user interface issue. But then, why do you go on to describe it in the new note? (See below.) >> NEW! 2. iframe inherits user override from parent browsing context >> [Currently not mentioned in the spec, despite that "all" >> UAs do have this step for HTML docs.] > > That's a UI issue much like whether it's remembered or not. But I've added > a non-normative note. Your new note: """1. Typically, user agents remember such user requests across sessions, and in some cases apply them to documents in iframes as well.""" My comments: 1: How does that differ from the "info on the likely encoding" step? 2: Could you define 'sessions' somewhere? It sounds to me that the 'sessions' behavior that you describe resembles the Opera behavior. Which is bad when the Opera behavior is the least typical one. (And most annoying from a page developer's point of view.) The typical thing - which Opera breaks! - is to, in some way or another, limit the encoding override to the current *tab* only. Thus, if you insist on describing what UAs "typically" do, then you should instead of describing the exception (Opera), say that browsers *differ*, but that the typical thing is to limit the encoding override, some way or another, to the current tab. 3: Browses differ enough for you to evaluate how they behave and pick the best behavior. However, I'd say Firefox is best as it offers a compromise between IE and Webkit. (See belows.) Comments in more details: FIRSTLY: Regarding "across sessions". then my assumption would be that a "single session" is equal to the lifespan of a single tab (or a single window, if there is no Tab in the window). If so, then that is how Safari/Chrome behave: Override lasts as long as one stays in the current frame. SECONDLY: Does 'sessions' relate to a particular document - as in "document during several sessions"? Or to a particular tab/window - as in "session = tab"? * Under FIRSTLY, I described how Safari/Chrome behave: They do not give heed to the document. They *only* give heed to the current tab/window: If you override a document to use the KOI8-R encoding then the next document you load in the same tab will use the KOI8-R encoding too. * Internet Explorer (version 8, at least) will, by contrast, give heed to that particular document, it seems. Thus, it seems to not reuse the overridden encoding in case it meets a new document, in the same tab, whose encoding is not declared. *However*, just as Safari/Chrome, once you open the same document (whose encoding was overridden) in a new
Re: [whatwg] Encoding sniffing algorithm
On Fri, 27 Jul 2012, Leif Halvard Silli wrote: > > I have just written a document on how implementations prioritize > encoding info for HTML documents.[1] (As that document shows, I have not > tested Safari 6.) Based on my findings there, I would like to suggest > that the spec's encoding sniffing algorithm should be updated to look as > follows: > > Revised encoding sniffing algorithm proposal: > > NEW! 0. document is XML format - opt out of the algorithm. > [This step is already implicit in the spec, but it would > make sense to explicitly include it to make sure that > one could e.g. write test cases to see that it is step > is implemented. Currently Safari, Chrome and Opera do > not 100% implement this step.] I don't understand the relevance of the algorithm to XML. Why would anyone even look at this algorithm if they were parsing XML? > NEW! #. Alternative: The BOM signature could go here instead of > in step 5. There is a bug to move the BOM hereto and make > it override anything else. What speaks against this are: > a) that Firefox, IE10 and Opera do not currently have > this behavior. > b) this revision of the sniffing algorithm, especially > the revision in step 6 (required UTF-8 detection), > might make the BOM-trumps-everything-else override > less necessary > What speaks for this override: > a) Safari, Chrome and legacy IE implement it. > b) some legacy content may depend on it Not sure what this means. > 1. user override. > (PS: The spec should clarify whether user override is > cacheable.) This seems to be entirely a user interface issue. > NEW! 2. iframe inherits user override from parent browsing context > [Currently not mentioned in the spec, despite that "all" > UAs do have this step for HTML docs.] That's a UI issue much like whether it's remembered or not. But I've added a non-normative note. > NEW! 6. UTF-8 detection. > I think we should separate UTF-8 detection from other > detection in order to make this step obligatory. > The newness here is only the limitation to UTF-8 > detection plus that it should be obligatory. > (Thus: If it is not detected as UTF-8, then > the parser proceeds to next step in the algorithm.) > This step would make browsers lean more strongly > towards UTF-8. Without a specific algorithm to detect UTF-8, this is meaningless. > NEW! 7. parent browsing context default. > The current spec does not mention this step at all, > despite that both Opera, IE, Safari, Chrome, Firefox > do implement it. Added. (Some comprehensive testing of this would be good, e.g. comparing it to each of the earlier and later steps, considering it with different ways of giving the encoding, differnet locales, etc.) > Regarding 6. and 7., then the order is important. Chrome > does for instance perform UTF-8 detection, but it does it > only /after/ the parent browsing context. Whereas everyone > else (Opera 12 by default, Firefox for some locales - don't > know if there are others) let it happen before the 'parent > browsing context default'. Can you elaborate on this? > NEW! 8. info on “the likely encoding” > The main newness is that this step is placed _after_ > the (revised) UTF-8 detection and after the (new) parent > browsing context default. > The name 'the likely encoding' is from the current spec > text. I am a bit uncertain about what it means in the > current spec, though. So I move here what I think make > sense. The steps under this point should perhaps be > optional: > > a. detection of other charsets than UTF-8 >(e.g the optional Cyrillic detection in >Firefox or legacy Asian encoding detection. >The actual detection might happen in step 6, >but it should only be made to count here.) I don't understand your reasoning on the desired ordering here. > b. markup label of the sister language > >(Opera/Webkit/Chrome currently have this directly >after the native encoding label step - step 5. No idea what this means. > c. Other things? What does "likely encoding" current >refer to, exactly? The spec gives an example. -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
[whatwg] Encoding sniffing algorithm - update proposal
I have just written a document on how implementations prioritize encoding info for HTML documents.[1] (As that document shows, I have not tested Safari 6.) Based on my findings there, I would like to suggest that the spec's encoding sniffing algorithm should be updated to look as follows: Revised encoding sniffing algorithm proposal: NEW! 0. document is XML format - opt out of the algorithm. [This step is already implicit in the spec, but it would make sense to explicitly include it to make sure that one could e.g. write test cases to see that it is step is implemented. Currently Safari, Chrome and Opera do not 100% implement this step.] NEW! #. Alternative: The BOM signature could go here instead of in step 5. There is a bug to move the BOM hereto and make it override anything else. What speaks against this are: a) that Firefox, IE10 and Opera do not currently have this behavior. b) this revision of the sniffing algorithm, especially the revision in step 6 (required UTF-8 detection), might make the BOM-trumps-everything-else override less necessary What speaks for this override: a) Safari, Chrome and legacy IE implement it. b) some legacy content may depend on it 1. user override. (PS: The spec should clarify whether user override is cacheable.) NEW! 2. iframe inherits user override from parent browsing context [Currently not mentioned in the spec, despite that "all" UAs do have this step for HTML docs.] 3. explicit charset attribute in Content-Type header. 4. BOM signature [or as the second step, see above] 5. native markup label NEW! 6. UTF-8 detection. I think we should separate UTF-8 detection from other detection in order to make this step obligatory. The newness here is only the limitation to UTF-8 detection plus that it should be obligatory. (Thus: If it is not detected as UTF-8, then the parser proceeds to next step in the algorithm.) This step would make browsers lean more strongly towards UTF-8. NEW! 7. parent browsing context default. The current spec does not mention this step at all, despite that both Opera, IE, Safari, Chrome, Firefox do implement it. Regarding 6. and 7., then the order is important. Chrome does for instance perform UTF-8 detection, but it does it only /after/ the parent browsing context. Whereas everyone else (Opera 12 by default, Firefox for some locales - don't know if there are others) let it happen before the 'parent browsing context default'. NEW! 8. info on “the likely encoding” The main newness is that this step is placed _after_ the (revised) UTF-8 detection and after the (new) parent browsing context default. The name 'the likely encoding' is from the current spec text. I am a bit uncertain about what it means in the current spec, though. So I move here what I think make sense. The steps under this point should perhaps be optional: a. detection of other charsets than UTF-8 (e.g the optional Cyrillic detection in Firefox or legacy Asian encoding detection. The actual detection might happen in step 6, but it should only be made to count here.) b. markup label of the sister language (Opera/Webkit/Chrome currently have this directly after the native encoding label step - step 5. c. Other things? What does "likely encoding" current refer to, exactly? 9. locale default [1] http://malform.no/blog/white-spots-in-html5-s-encoding-sniffing-algorithm [2] To the question of whether the BOM should trump everything else, then I think it it would be more important to get the other parts of this algorithm right. If we do get the rest of it right, then the 'BOM should trump' argument, becomes less important. -- Leif Halvard Silli