Re: [whatwg] Byte-wise tokenization algorithm

Geoffrey Sneddon Sun, 21 Dec 2008 09:19:54 -0800


On 21 Dec 2008, at 16:35, Edward Z. Yang wrote:

I suppose the big pivot point is "as if". A byte-wise implementation
would replace character globally with byte, and any U+xxxx designation
with the UTF-8 encoded byte version. HTML 5 dictates end behavior, not
the actual algorithm implementation, no?

It states that what is done must be wholly equivalent to the givenalgorithm.

But an HTML5 implementation,
according to the spec, must at a minimum support the UTF-8 and
Windows-1252 encodings, so the overall implementation might notdepending
on exactly how this is done.

The plan is to convert Windows-1252 into UTF-8 before processing;with a

reasonably good iconv implementation, support for lots of encodings is
possible. The implementation might not be fully conforming if iconv
doesn't perform the proper (possibly context-sensitive; I haven't
checked) substitution when it doesn't recognize a character, but it
should be close.

I've never seen any way of getting iconv (at least via PHP) to do whatHTML 5 requires (i.e., replacing invalid bytes with U+FFFD). It is,however, possible using mbstring (which also has the advantage of notbeing system dependant), as well as with PHP6's Unicode support.



--
Geoffrey Sneddon
<http://gsnedders.com/>

Re: [whatwg] Byte-wise tokenization algorithm

Reply via email to