Without going into details, I will note that I would say a few things here slightly differently.
That said, overall, you seem to be in the "right" ballpark. Thanks, -- Raul On Sat, Jul 9, 2016 at 2:11 PM, Don Guinn <dongu...@gmail.com> wrote: > Note 'Observations on Unicode' > > There seems to be a lot of confusion in J around Unicode and UTF-8. > It seems that people are having a lot of trouble dealing with UTF-8 > in J, particularly in arrays. Most of you are probably familiar with > this issue and all said here, but it bares repeating. I am posting > it to source rather than programming since it also includes a proposed > change to J and recent Unicode posts are from people also on source. > > Unicode defines a representation of characters as numbers or "code > points". Some confusion results in that the glyph (what a code point > looks like) may look like the glyph of another code point. Not really > an issue as far as J is concerned. In Unicode charts of the Unicode > number is normally written as U+HHHH, where HHHH is a number in > hexadecimal. Four digits show up in most tables. > > Unicode is divided into planes. Plane 0, Basic Multilingual Plane > (BMP), is the range U+0 through U+FFFF and is the part of Unicode > supported by Windows and Unix double-byte character set (DBCS). Other > planes of Unicode are not supported by Windows or Unix at this time. > > The code points are represented several ways. Windows and Unix use > 16 bit unsigned integers, double-byte-character-set (DBCS) or wide > characters in C. J type is 131072 for 3!:0 and names this “unicode” > (note the lower case "u"). This is not UTF-8. Windows and Unix also > represent Unicode with UTF-8. UTF-8 is the primary representation > for Unicode in J, an ingenious way to represent Unicode compatible > with ASCII. Also there is UTF-16 and others to represent code point > bits for Unicode. The WEB has other ways to handle Unicode. > > The standard for UTF-8 defines one to four bytes to contain the code > point bits. The one byte codes correspond to the ASCII characters U+0 > through U+7F. The UTF-8 standard allows a maximum of U+10FFFF, well > beyond what Windows and Unix now support. For three byte UTF-8 covers > 16 code point bits, fits nicely for wide characters. > > UTF-8 contains start bytes and continuation bytes. Start bytes with a > high order zero bit look exactly like standard ASCII. Start bytes of > “11xxxxxx” mark the starts of multi-byte codes. Bytes of “10xxxxxx” > are continuation bytes and must follow a start byte. The number of > continuation bytes is given in the start byte. > ) > > Note 'Invalid byte sequences' > > It is an error if something happens to separate the start byte from > its continuation bytes. > > An unexpected continuation byte. > Continuation bytes must only follow start bytes. > > A start byte not followed by enough continuation bytes. > The start byte defines how many continuation bytes follow. > > A sequence that decodes to a value greater than U+10FFF. > RFC 3629 limits Unicode this maximum for compatibility with UTF-16. > > An overlong encoding. > Each continuation byte contains 6 code point bits. If the first > continuation byte contains all zero code point bits it should be > shortened. > > The display of any UTF-8 characters failing the above tests varies > from system to system. The official position now is to display > � (U+FFFD). J often displays other characters. > ) > > Note 'The Internet' > > The internet only supports the transmission of text as ASCII > characters, characters in the range 0 through 7f hex. And many > special characters are not allowed in text. Those characters not > allowed are sent in a few ways. > > 1. A byte is represented as two hexadecimal digits following an > equal sign (=). For example: Blank is sent as =20 instead of > as 32{a. . > > The Unicode symbol α (U+3B1) is sent as "=ce=b1", the bytes > are UTF-8, a start byte followed by a continuation byte in > hexadecimal. > > 2. #&nnn; – where nnn is the decimal point code number. For > example: #&916; will display as Δ. > > 3. #&hxxx; - same as 2. except the number is in hexadecimal. > > Unicode and UTF-8 cannot be sent directly It must be converted to > ASCII as described above. A raw text file with a lot of characters > beyond ASCII can get very hard to view. > ) > > Note 'Unicode and J' > > J primitives treat UTF-8 bytes as literal. They do not recognize > start and continuation bytes as UTF-8 making up characters. It is > up to the programmer to handle the multi-byte characters. Vectors > of UTF-8 characters display fine and are easy to work with. But > higher dimensions must be carefully managed. Take the problem > covered recently. > ) > > ]s=: 8 6 $ 'ఝ' ,'a','ఝ' > à° aà° > à° aà > ° à° a > ఝఝ > aà° à° > aà° à > ° aà° > à° aà° > > Note '' > > When start bytes are separated from continuation bytes error > characters are displayed. J displays a line at a time. Continuation > bytes moved to the next line are not recognized as a continuation > of a multi-byte UTF-8 character. > > This example has all kinds of problems. First, the glyphs are wider > than other characters, so there is no way they can align with other > fixed width characters. Second, their UTF-8 codes are 3 bytes, not > supported by the new boxing algorithm in J. > > The following definition of s gives a more supported example to > examine handling UTF-8 in J. > ) > > ]s=:'Δ' , 'a' ,'Δ' > ΔaΔ > $s > 5 > a.i.s > 206 148 97 206 148 > <s NB. Displays boxed nicely. > ┌───┐ > │ΔaΔ│ > └───┘ > 8 6 $ s NB. But how about reshaping? > ΔaΔΠ> ”aΔΔ > aΔΔa > ΔΔaÎ > ”ΔaΔ > ΔaΔΠ> ”aΔΔ > aΔΔa > > NB. Still have the problem of splitting start and continuation bits. > > NB. Say we want to display the 3 characters in a column. > > ]s=: 'Δ' , 'a' ,: 'Δ' > Δ > aa > Δ > > Note '' > > Where did the second "a" come from? To J the "Δ" is 2 characters. > So the "a" must be expanded to match lengths. To get it to look > right the "a" needs to be padded. > ) > > ]s=: 'Δ' , 'a ' ,: 'Δ' > Δ > a > Δ > > <s NB. But now the boxed display isn't what we want. > ┌──┐ > │Δ │ > │a │ > │Δ │ > └──┘ > > NB. Perhaps treating it as a vector works better. > > ]s=: 'Δ' , LF, 'a' , LF, 'Δ' > Δ > a > Δ > <s NB. But the line feeds are treated as blanks. > ┌─────┐ > │Δ a Δ│ > └─────┘ > > ]s=: 'Δ' ; 'a' ; 'Δ' NB. Try boxing each character. > ┌─┬─┬─┐ > │Δ│a│Δ│ > └─┴─┴─┘ > >s > Δ > a > Δ > <>s NB. Box looks OK but extra space. > ┌──┐ > │Δ │ > │a │ > │Δ │ > └──┘ > ,>s NB. Still got the extra space. > Δa Δ > ,s,&><LF NB. Still not quite right. > Δ > a > Δ > > ,>s,<'α' > Δa Δα > > Note '' > > But this requires a lot of special effort to handle UTF-8 > characters. And any definitions for handling ASCII text will > probably need modifying to get them to work with UTF-8. > > It is my opinion that converting UTF-8 literal to unicode makes > manipulating arrays of characters much easier. So I defined 4 > verbs to assist. > ) > > U =:u: NB. Convert code points to unicode. > UCP=:3&u: NB. Convert UTF-8 (char) or unicode to code points. > UC =:7&u: NB. Convert UTF-8 to unicode if necessary. > UD =:8&u:"1 NB. Convert unicode to UTF-8 or char. > > Note '' > > Use U instead of a.&{ to convert numbers to text. > > Use UCP instead a.&i. . It gives the same result as a.&i. for > literals and give code points for unicode. > > UD converts unicode to literal and any characters outside of ASCII > to UTF-8. It can be used instead of a.&{ to convert. It is > necessary to set it to rank 1 as 8&u: only works on vectors. > > So let's look try s as unicode instead of as UTF-8. > ) > > ]s=: UC 'ΔaΔ' > ΔaΔ > $s > 3 > UCP s > 916 97 916 > <s > ┌───┐ > │ΔaΔ│ > └───┘ > ,.s > Δ > a > Δ > <,.s > ┌─┐ > │Δ│ > │a│ > │Δ│ > └─┘ > <"0 s > ┌─┬─┬─┐ > │Δ│a│Δ│ > └─┴─┴─┘ > s,'Δ' > ΔaΔΔ > > Note '' > > Oops! Got a problem. > > When J mixes nouns of different internal types it must convert > them to the same type before processing. Like comparing 1 to > 1.5-0.5 . Here J converts char to wide by putting a zero byte > value in front of the char byte. This works fine for ASCII. But > not if the noun includes any UTF-8 multi-byte codes. > > Here J treated the 2 byte Δ UTF-8 as two characters. Both being > invalid UTF-8. > > One must make sure that unicode is never mixed with literal that > may contain UTF-8 multi-byte characters. > ) > > s,UC 'Δ' > ΔaΔΔ > > Note 'Proposal' > > When J primitives needs to convert literal (char) to unicode > (wide) that it convert the char to literal using the UTF-8 > conversion algorithm (7&u:) instead of adding a zero byte. > > This would give almost complete transparency mixing UTF-8 and > unicode. I can't think of any case where one would not want to > have UTF-8 converted to unicode when literal is mixed with > unicode; however, if it is required 2&u: could be used. > > This should not cause any backward compatibility problems as it > only changes how char is converted to wide by default, something > I suspect no one currently uses. Not any other operation. > ) > > Note 'Implementation' > > I realize that it is very late in this development cycle of J. > So this would probably not be done in it; however, I feel that > this change to make UTF-8 and unicode more compatible would make > it easier for people to use unicode avoiding all the confusion > trying to do everything in UTF-8. > > This would affect the concatenation verbs (dyadic , ,. ,: and > monadic ;), the comparison verbs (= and -:) and probably amend > (}). > > Hopefully this is done in a single macro or subroutine used by > these verbs. If this is the case, the change should be not too > difficult. > > Perhaps later. > ) > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm