[whatwg] Handling of invalid UTF-8
In the spec preview it had a section about UTF-8 decoding and the handling of invalid byte sequences, http://dev.w3.org/html5/spec-preview/infrastructure.html#utf-8 . But I have noticed this section has been removed from the current version. So what algorithm is used for handling of invalid UTF-8 byte sequences? Or this no longer part of the HTML 5 specification? My testing on firefox and chrome seems to indicate that they follow the algorithm of replacing the first byte of an invalid sequence with the replacement character http://en.wikipedia.org/wiki/Replacement_character � (U+FFFD) and then continue with the parsing of the next byte.
Re: [whatwg] Handling of invalid UTF-8
On Thu, Aug 29, 2013 at 5:29 PM, Cameron Zemek grom...@gmail.com wrote: In the spec preview it had a section about UTF-8 decoding and the handling of invalid byte sequences, http://dev.w3.org/html5/spec-preview/infrastructure.html#utf-8 . But I have noticed this section has been removed from the current version. So what algorithm is used for handling of invalid UTF-8 byte sequences? Or this no longer part of the HTML 5 specification? http://www.whatwg.org/specs/web-apps/current-work/#dependencies has a reference to the Encoding spec, which is where the UTF-8 decoding logic lives now: http://encoding.spec.whatwg.org/#utf-8 -- Glenn Maynard
Re: [whatwg] Handling of invalid UTF-8
On Fri, 30 Aug 2013, Cameron Zemek wrote: In the spec preview it had a section about UTF-8 decoding and the handling of invalid byte sequences, http://dev.w3.org/html5/spec-preview/infrastructure.html#utf-8 You really don't want to be using that as a reference. It's a very out of date copy of a fork of the spec. On Thu, 29 Aug 2013, Glenn Maynard wrote: http://www.whatwg.org/specs/web-apps/current-work/#dependencies has a reference to the Encoding spec, which is where the UTF-8 decoding logic lives now: http://encoding.spec.whatwg.org/#utf-8 Right, the HTML standard (http://whatwg.org/html) now uses the Encoding standard (http://encoding.spec.whatwg.org/) to define UTF-8 processing. Let us know if you see anything wrong with either of these specs! Cheers, -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'