Theodore Smith said: > I find the unicode website very confusing.
Not a very useful observation. If you have a specific suggestion for improvement, make it to the webmaster at: http://www.unicode.org/unicode/reporting.html > > Is it that to get any useful non confusing information, > we have to buy your huge book? I guess you may not have spent enough time on the site to notice that the huge book is online: http://www.unicode.org/unicode/uni2book/u2.html Look in Chapter 3 for the UTF-8 definition. > What with all the addendums, addendums > to addendums and addendums to addendums to addendums, crossings out, > etc etc. It becomes impossible to work out what you are really saying. > > Why not just make one technical standard, like w3.org do for XML? Because we cannot publish a new version of the entire book every year. The editors are well aware of the fact that amended text to the standard in the online publications for Unicode 3.1 and Unicode 3.2 makes some sections difficult to follow for the most recent edition. That is why we *do* republish the entire text of the standard at intervals, for the major editions, when we can. > > My problem is, I'm trying to w ork out how UTF8, and UTF16 are > encoded. This is all available in the online version. > I heard that UTF32 can have surrogate pairs! You have heard incorrectly. See: http://www.unicode.org/unicode/reports/tr19/ UAX #19 "UTF-32": "An irregular UTF-32 code unit sequence is an eight-byte sequence where the first four bytes correspond to a high surrogate, and the next four bytes correspond to a low surrogate. As a consequence of C12, these irregular UTF-32 sequences shall not be generated by a conformant process." > This is pretty > crazy I think because UTF can only encode 10FFFF (a nice number, comes > to 1114111 a nicer number) values. While 4 bytes can hold over 4 > billion values. So whats the use of surrogates with UTF32? None. > > I can't find this information. Now you have it. > I have found addendums to addendums > that might or might not be the final answer, or the complete answer, > but I can't tell because its not all compiled into one standard > definition. Unicode 4.0 will all be compiled into one huge book again. But then I wonder if you actually want to "buy [our] huge book" in any case. ;-) --Ken BTW, it is "Unicode" -- not "UniCode". > > > -- > Theodore H. Smith - Macintosh Consultant / Contractor. > My website: <www.elfdata.com/>

