On 21 Dec 2008, at 16:35, Edward Z. Yang wrote:

I suppose the big pivot point is "as if". A byte-wise implementation
would replace character globally with byte, and any U+xxxx designation
with the UTF-8 encoded byte version. HTML 5 dictates end behavior, not
the actual algorithm implementation, no?

It states that what is done must be wholly equivalent to the given algorithm.

But an HTML5 implementation,
according to the spec, must at a minimum support the UTF-8 and
Windows-1252 encodings, so the overall implementation might not depending
on exactly how this is done.

The plan is to convert Windows-1252 into UTF-8 before processing; with a
reasonably good iconv implementation, support for lots of encodings is
possible. The implementation might not be fully conforming if iconv
doesn't perform the proper (possibly context-sensitive; I haven't
checked) substitution when it doesn't recognize a character, but it
should be close.

I've never seen any way of getting iconv (at least via PHP) to do what HTML 5 requires (i.e., replacing invalid bytes with U+FFFD). It is, however, possible using mbstring (which also has the advantage of not being system dependant), as well as with PHP6's Unicode support.


--
Geoffrey Sneddon
<http://gsnedders.com/>

Reply via email to