"Ayers, Mike" wrote:

>         Am I reading this wrong?  Here's what I get:
> 
>         I hand you a UTF-16 document.  This document is:
> 
> FE FF 00 48 00 65 00 6C 00 6C 00 6F
> 
>         ..so it says "Hello".  Then I say, "Oh, by the way, that's
> big-endian."  *POOF*  The content of the document has changed, and there is
> now a 'ZERO WIDTH NO BREAK SPACE' at the beginning.  Smells pretty skunky...

No, what you have said is that this document is in "UTF16-BE" encoding.
That's a name for an encoding that is known a priori to be BE, and does
not permit a BOM.  It is not the name for an encoding that has a BOM but
just happens to be BE.

Since you have changed the encoding, the content has naturally
changed too, just as if you had declared an 8859-1 document
to be 8859-2.

>         BTW, what is a ZWNBSP anyway?  From here it seems like a
> non-character.  Is there an actual use for it? 

Yes.  It indicates that a line break may not be introduced at this point.
It is similar to the NO-BREAK SPACE (U+00A0) which you may be familiar
with under its HTML name of  , except that it doesn't produce any actual
whitespace.  ZWNBSP is useful in languages that don't use whitespace, and
in strings like "M.T.A." where a line breaker might be tempted to break after
a period.

Its opposite number is ZWSP (U+200B), which likewise doesn't generate any
actual whitespace, but indicates that line breaking *is* permitted here.

-- 

Schlingt dreifach einen Kreis um dies! || John Cowan <[EMAIL PROTECTED]>
Schliesst euer Aug vor heiliger Schau,  || http://www.reutershealth.com
Denn er genoss vom Honig-Tau,           || http://www.ccil.org/~cowan
Und trank die Milch vom Paradies.            -- Coleridge (tr. Politzer)

Reply via email to