On 21 Dec 2008, at 16:35, Edward Z. Yang wrote:
I suppose the big pivot point is "as if". A byte-wise implementation
would replace character globally with byte, and any U+xxxx designation
with the UTF-8 encoded byte version. HTML 5 dictates end behavior, not
the actual algorithm implementation, no?
It states that what is done must be wholly equivalent to the given
algorithm.
But an HTML5 implementation,
according to the spec, must at a minimum support the UTF-8 and
Windows-1252 encodings, so the overall implementation might not
depending
on exactly how this is done.
The plan is to convert Windows-1252 into UTF-8 before processing;
with a
reasonably good iconv implementation, support for lots of encodings is
possible. The implementation might not be fully conforming if iconv
doesn't perform the proper (possibly context-sensitive; I haven't
checked) substitution when it doesn't recognize a character, but it
should be close.
I've never seen any way of getting iconv (at least via PHP) to do what
HTML 5 requires (i.e., replacing invalid bytes with U+FFFD). It is,
however, possible using mbstring (which also has the advantage of not
being system dependant), as well as with PHP6's Unicode support.
--
Geoffrey Sneddon
<http://gsnedders.com/>