Your "Magic encoders" do not really help. Sort of magic, yes, but probably even more difficult to see how to use it (using quaternary numbers for computing octets) than just understanding the algorithm described (using binary numbers for computing octets).
The way I usually think about the conversion is by thinking with binary numbers (conversion from hex to binary is trivial, just like shifting binary digits or changing their grouping in order to set the octet values, and reconvert them trivially in hex; the only non trivial conversion is between binary (or hex) to decimal : you need a mental conversion table of octet values (which is easy to remember only up to 4-bits, or from a few specific octets with only 1-bit set to 1 (like 0x10=16, 0x20=32, 0x40=64) and these vaues minus 1 (like 0xFF=255). After this step, either you have mentally remembered the full range of octet values if you want to reduce the number of operations to mentally compute and reduce errors. But a computer is simple to program without using such conversion table, it converts numbers between binary, hex or decimal for you, and in fact such conversion is not even needed to convert between codepoints and octets-encodings using any numeric base conversion, it works directly with binary numbers and just has to care about how to group subsequences of bits (like octets or a full code point) into code units for storage (e.g. bytes) and how to pad these bits in code units. But there's still a bug (or request for enhancement) for your Pocket converters : - For UTF-16 you correctly exclude the range U+D800..U+DFFF (surrogates) from the sets of convertible codepoints. - But you don't exclude this range in the case of your UTF-8 and UTF-32 "magic encoders" which could forget this case. Of course your encoder would create distinct sequences for these code points, but they are not valid UTF-8 or valid UTF-32 encodings. - So one row in the UTF-8 magic encoder concerns the whole range U+0800..U+FFFF. This row should be split in two disjoint parts U+0800..U+D7FF and U+E000..U+FFFF. - Same remark about your 1-row magic encoder for UTF-32 (two rows should be used). 2012/12/12 Otto Stolz <[email protected]> > Hello, > > am 2012-12-11 20:16, schrieb James Lin: > > If i have a code point: U+4E8C or "二" >> In UTF-8, it's "E4 BA 8C" while in UTF-16, it's "4E8C". >> Where is this "BA" comes from? >> > > Cf. <http://skew.org/cumped/>. > > Enclosed are the (almost original) version of “€œCima’s Magic > UTF-8 Pocket encoder”€ (2004), and its two followers for > more UTFs. Display or print with a fixed-pitch font, > such as Lucida Console or Courier New. Enjoy! > > Cheers, > Otto Stolz > > >

