On Wed, 28 Dec 2011 12:30:49 +0100, Leif Halvard Silli
<xn--mlform-iua@målform.no> wrote:
I spotted a shortcoming in your testing:
I ran some utf-16 tests using 007A as input data, optionally preceded by
FFFE or FEFF, and with utf-16, utf-16le, and utf-16be declared in the
Content-Type header. For WebKit I tested both Safari 5.1.2 and Chrome
17.0.963.12. Trident is Internet Explorer 9 on Windows 7. Presto is
Opera
11.60. Gecko is Nightly 12.0a1 (2011-12-26).
HTTP BOM Trident WebKit Gecko Presto
utf-16 - 7A00 7A00 007A 007A
utf-16le - 7A00 7A00 7A00 7A00
utf-16be - 007A 007A 007A 007A
The above test row is not complete. You should also run a BOM-less test
using the UTF-16 label but where the 007A is represented in the
big-endian way - a bit like I did here:
<http://malform.no/testing/utf/#html-table-7>. The you get as result
that Opera and Firefox do not take it for a given that files sent as
'utf-16' are big-endian:
utf-16 - gibb* gibb* 007A 007A
*gibb = gibberish/mojibake.
I get U+7A00 as I indicated above. I would not qualify that as gibberish
personally. (My table is somewhat confusing as input 007A was meant to
describe octets, but the table describes code points.)
Anyway, per
http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2009-July/021102.html
Presto and Gecko do have some magic, but it seems better if they were the
same as Trident (and WebKit).
That the BOM is removed from the output for utf-16be labelled files,
means that the 'utf-16be' labelled file nevertheless is treated as
UTF-16 (per UTF-16's specification). (Otherwise, if it had not been
removed, the BOM character should have caused quirks mode.)
Taking what you did not test for into account, it would make sense if
'utf-16' continues to be treated as a label under which both big-endian
and litt-endian can be expected. And thus, that Webkit and IE starts to
detect when UTF-16 is big-endian, but without a BOM.
I am not sure what you are trying to say here.
--
Anne van Kesteren
http://annevankesteren.nl/