Re: TXFileStore and local filesystem

Michael Smith Sun, 01 Feb 2004 17:05:38 -0800

Jacob Lund wrote:

Ok! Let me see if I can explain myself - I am not an expert on this so
please correct me if I am wrong!

An UTF-8 representation of one character consists of at combination of
characters. Now JAVA is a Unicode language and this means that one character
can represent "any" type of character in the world!

This is incorrect. Your basic reasoning is more or less right, but your terminology is incorrect in ways that will tend to confuse your thinking (and that of others). You're confusing "character" and "byte" - a better way to phrase this is: "A UTF-8 representation of one character consists of one or more bytes" (note the distinction: a character is an abstract entity, any representation of that character is as a series of bytes).


Basically UTF-8 only makes sense when working on an "old" 7 bit asci system
and you need to use characters not available in the given codepage.

No. UTF-8 a) makes sense in many places, and b) doesn't specifically help in this case. There's a UTF-7 that you could use for this, but nobody uses UTF-7, and I really don't recommend even bothering to look up the details of it.


Both UTF-8 and UTF-16 uses a varying number of bytes to represent one
character, where Unicode always uses 32 bit characters (maybe it is 24 bit).

This gets somewhat complex. Unicode does not use any number of bits for a character. Unicode specifies characters (as "codepoints") as an abstract integer, with no explicit representation.

THEN, you have an 'encoding' of this integer to give an explicit representation of that abstract codepoint.

UTF-8 uses a variable number of bytes to represent it (from 1-4, I think? I think the encoding allows for up to 6 bytes, but unicode doesn't actually use more than 4). UTF-8 is very widely used - for example, the overwhelming majority of XML content uses UTF-8, and widespread usage on the internet is generally (though definately not exclusively) migrating towards UTF-8 for most text content.

UTF-16 _generally_ uses a fixed 2 bytes per character. However, this is complicated by "surrogate pairs", which are a special sort of escape sequence used by unicode to allow access to codepoints outside the BMP (Basic Multilingual Plane). It's worth noting here that Java's 'char' type (and hence Strings, etc.) use UTF-16, but ignore things like surrogates - this is mostly ok, but makes it fairly painful to do really complex multilingual stuff.

There are two different versions of UTF-16, UTF-16-LE, and UTF-16-BE (little endian and big endian). They are generally distinguished by the use of an explicit BOM (Byte Order Marker, another 'special' unicode character) as the first character of a file. When being used in memory (as Java does) in an application, the character is generally stored in native endianness for whatever platform is being used.

Some things (notably a lot of microsoft documentation - I haven't seen this usage widely outside of MS software) uses "unicode" to mean "The UTF-16-LE encoding of unicode". This is very confusing. So, for example, when things say that NTFS stores filenames in unicode, it actually means that they are stored in UTF16-LE. However, frequently this distinction does not matter - to many applications, the only important point is that unicode is being used, so the full character repertoire of unicode is available (sometimes restricted only to the BMP).

There's also UTF-32, which always uses 32 bits per character. It's not widely used - mostly because for almost all applications, it's simply wasteful of memory.


This was my understanding of the UTF standards and unicode - am I wrong
here?

I hope I've cleared some things up, here.

Mike


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: TXFileStore and local filesystem

Reply via email to