As defined in the standard UTF-8 and UTF-16 as the default character sets to be recognized by all spec comliant parsers. This means if you have something like that as a header

<?xml version="1.0"?>

you are free to use UTF-8 encoding in this XML document. You can use ASCII as well, as UTF-8 is backward compatible to ASCII...

Oliver

Jacob Lund wrote:

Thanks!

Could you then explain the relationship between UTF-8 and XML. Does is make
sense to have unescaped UTF-8 encoding in XML or should UTF-8 always be
escaped when used in XML?

/Jacob

-----Original Message-----
From: Michael Smith [mailto:[EMAIL PROTECTED] Sent: 2. februar 2004 01:04
To: Slide Users Mailing List
Subject: Re: TXFileStore and local filesystem


Jacob Lund wrote:

Ok! Let me see if I can explain myself - I am not an expert on this so
please correct me if I am wrong!

An UTF-8 representation of one character consists of at combination of
characters. Now JAVA is a Unicode language and this means that one

character


can represent "any" type of character in the world!


This is incorrect. Your basic reasoning is more or less right, but your terminology is incorrect in ways that will tend to confuse your thinking (and that of others). You're confusing "character" and "byte" - a better way to phrase this is:
"A UTF-8 representation of one character consists of one or more
bytes" (note the distinction: a character is an abstract entity, any representation of that character is as a series of bytes).



Basically UTF-8 only makes sense when working on an "old" 7 bit asci

system


and you need to use characters not available in the given codepage.


No. UTF-8 a) makes sense in many places, and b) doesn't specifically help in this case. There's a UTF-7 that you could use for this, but nobody uses UTF-7, and I really don't recommend even bothering to look up the details of it.


Both UTF-8 and UTF-16 uses a varying number of bytes to represent one
character, where Unicode always uses 32 bit characters (maybe it is 24

bit).


This gets somewhat complex.
Unicode does not use any number of bits for a character. Unicode specifies characters (as "codepoints") as an abstract integer, with no explicit representation.


THEN, you have an 'encoding' of this integer to give an explicit representation of that abstract codepoint.

UTF-8 uses a variable number of bytes to represent it (from 1-4, I think? I think the encoding allows for up to 6 bytes, but unicode doesn't actually use more than 4). UTF-8 is very widely used - for example, the overwhelming majority of XML content uses UTF-8, and widespread usage on the internet is generally (though definately not exclusively) migrating towards UTF-8 for most text content.

UTF-16 _generally_ uses a fixed 2 bytes per character. However, this is complicated by "surrogate pairs", which are a special sort of escape sequence used by unicode to allow access to codepoints outside the BMP (Basic Multilingual Plane). It's worth noting here that Java's 'char' type (and hence Strings, etc.) use UTF-16, but ignore things like surrogates - this is mostly ok, but makes it fairly painful to do really complex multilingual stuff.

There are two different versions of UTF-16, UTF-16-LE, and UTF-16-BE (little endian and big endian). They are generally distinguished by the use of an explicit BOM (Byte Order Marker, another 'special' unicode character) as the first character of a file. When being used in memory (as Java does) in an application, the character is generally stored in native endianness for whatever platform is being used.

Some things (notably a lot of microsoft documentation - I haven't seen this usage widely outside of MS software) uses "unicode" to mean "The UTF-16-LE encoding of unicode". This is very confusing. So, for example, when things say that NTFS stores filenames in unicode, it actually means that they are stored in UTF16-LE. However, frequently this distinction does not matter - to many applications, the only important point is that unicode is being used, so the full character repertoire of unicode is available (sometimes restricted only to the BMP).


There's also UTF-32, which always uses 32 bits per character. It's not widely used - mostly because for almost all applications, it's simply wasteful of memory.



This was my understanding of the UTF standards and unicode - am I wrong
here?


I hope I've cleared some things up, here.

Mike


--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]


--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]


.





--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to