Thanks!

Could you then explain the relationship between UTF-8 and XML. Does is make
sense to have unescaped UTF-8 encoding in XML or should UTF-8 always be
escaped when used in XML?

/Jacob

-----Original Message-----
From: Michael Smith [mailto:[EMAIL PROTECTED] 
Sent: 2. februar 2004 01:04
To: Slide Users Mailing List
Subject: Re: TXFileStore and local filesystem

Jacob Lund wrote:
> Ok! Let me see if I can explain myself - I am not an expert on this so
> please correct me if I am wrong!
> 
> An UTF-8 representation of one character consists of at combination of
> characters. Now JAVA is a Unicode language and this means that one
character
> can represent "any" type of character in the world!

This is incorrect. Your basic reasoning is more or less right, but your 
terminology is incorrect in ways that will tend to confuse your thinking 
(and that of others). You're confusing "character" and "byte" - a better 
way to phrase this is:
        "A UTF-8 representation of one character consists of one or more
bytes" 
(note the distinction: a character is an abstract entity, any 
representation of that character is as a series of bytes).

> 
> Basically UTF-8 only makes sense when working on an "old" 7 bit asci
system
> and you need to use characters not available in the given codepage.

No. UTF-8 a) makes sense in many places, and b) doesn't specifically 
help in this case. There's a UTF-7 that you could use for this, but 
nobody uses UTF-7, and I really don't recommend even bothering to look 
up the details of it.

> 
> Both UTF-8 and UTF-16 uses a varying number of bytes to represent one
> character, where Unicode always uses 32 bit characters (maybe it is 24
bit).

This gets somewhat complex.
Unicode does not use any number of bits for a character. Unicode 
specifies characters (as "codepoints") as an abstract integer, with no 
explicit representation.

THEN, you have an 'encoding' of this integer to give an explicit 
representation of that abstract codepoint.

UTF-8 uses a variable number of bytes to represent it (from 1-4, I 
think? I think the encoding allows for up to 6 bytes, but unicode 
doesn't actually use more than 4). UTF-8 is very widely used - for 
example, the overwhelming majority of XML content uses UTF-8, and 
widespread usage on the internet is generally (though definately not 
exclusively) migrating towards UTF-8 for most text content.

UTF-16 _generally_ uses a fixed 2 bytes per character. However, this is 
complicated by "surrogate pairs", which are a special sort of escape 
sequence used by unicode to allow access to codepoints outside the BMP 
(Basic Multilingual Plane). It's worth noting here that Java's 'char' 
type (and hence Strings, etc.) use UTF-16, but ignore things like 
surrogates - this is mostly ok, but makes it fairly painful to do really 
complex multilingual stuff.

There are two different versions of UTF-16, UTF-16-LE, and UTF-16-BE 
(little endian and big endian). They are generally distinguished by the 
use of an explicit BOM (Byte Order Marker, another 'special' unicode 
character) as the first character of a file. When being used in memory 
(as Java does) in an application, the character is generally stored in 
native endianness for whatever platform is being used.

Some things (notably a lot of microsoft documentation - I haven't seen 
this usage widely outside of MS software) uses "unicode" to mean "The 
UTF-16-LE encoding of unicode". This is very confusing. So, for example, 
when things say that NTFS stores filenames in unicode, it actually means 
that they are stored in UTF16-LE. However, frequently this distinction 
does not matter - to many applications, the only important point is that 
unicode is being used, so the full character repertoire of unicode is 
available (sometimes restricted only to the BMP).


There's also UTF-32, which always uses 32 bits per character. It's not 
widely used - mostly because for almost all applications, it's simply 
wasteful of memory.

> 
> This was my understanding of the UTF standards and unicode - am I wrong
> here?

I hope I've cleared some things up, here.

Mike


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to