I consider myself a frequent user of wide/utf8 unicode user and yet I prefer doing the proper conversion by myself instead of automatic conversion by J.
There are confusions when using utf8 but the best way I suggested is to gain more experience, but not by relying on J to do automatic conversion of literal to wide. J will convert literal to wide for external interface such as 1!:x family and 15!:0 under windows. Recent beta for unicode display had made a similar assumption as yours to convert all one byte literal to wide in regardless whether they are utf8 or not. This leads to a change in behavior that Bob has demonstrated that illegal symbol becoming another character. I would like to have this fixed if possible. On Jul 11, 2016 6:14 AM, "Don Guinn" <dongu...@gmail.com> wrote: > Thanks for your response Bill. I certainly have no problem any more > controlling the way literal converts to unicode; however, I would like you > to consider the following. > > literal is a problem as it has many inconsistent uses. The ASCII part is > not a problem as it is the same whether literal (char) is interpreted as > extended ASCII or UTF-8. Char is also used as numeric in many places, like > image files. > > I was not thinking that the errors were in J, but in my failure to assure > that UTF-8 codes have been converted to unicode (wide) before catenating or > whatever with wide. When support was added to J for Unicode the proper > decision to maintain compatibility with extended ASCII was correct. But > things have changed. Now I doubt that anyone uses extended ASCII. Any > programs using extended ASCII are probably obsolete or converted to use > UTF-8. > > Like dropping the dot after x and y, it makes more sense now to assume that > char may contain UTF-8 when treated as text. > > I am not suggesting any change in the way char is handled except when > combining with wide. So programs not using wide would not be affected. Wide > is different from char as it is only Unicode. It has no other use. So any > time wide and char are mixed the char bytes are must be Unicode points. So > I looked at what U+80 through U+FF are. Some control codes of which I don't > understand and Latin-1 Supplement. There are many useful symbols in > this range. But how would they be entered? > > The only way to enter them as extended ASCII is by indexing into a. . Say I > want to assign some name a value with the £ symbol. > > (163{a.),":1234 > �1234 > (2 u: 163{a.),":1234 > £1234 > (4 u: 163),":1234 > £1234 > ]Value=:'£',":1234 > £1234 > a.i.Value > 194 163 49 50 51 52 > 3 u: 7 u: Value > 163 49 50 51 52 > > OK. I cheated. I can't directly enter £ on my keyboard, but the > British can on theirs. But notice that J did not treat the 163{a. as > £. It entered the UTF-8 representation. And the same thing would > happen for any of the Latin-1 Supplement characters. They would be in UTF-8 > multi-byte. > > Anything retrieved from the web assumes that characters coming in can > include UTF-8 or be wide. J assumes that char is UTF-8 for both entry and > display. It actually treats extended ASCII as invalid UTF-8. > > Things get messy in J when trying to manipulate UTF-8 in J, particularly in > arrays. People seem to avoid using wide because of the care that must be > taken when using it then end out struggling trying to make UTF-8 cooperate. > All the more reason to make wide easy to use. > > Question: Does anybody have a need for or have any program that has a need > for char bytes >127 to be expanded to wide by adding a zero byte? > > On Sat, Jul 9, 2016 at 6:15 PM, Don Guinn <dongu...@gmail.com> wrote: > > > Thanks. I'll look them up. > > > > I think most of the too wide characters take 3 UTF-8 bytes so aren't > > supported any way. > > On Jul 9, 2016 5:21 PM, "robert therriault" <bobtherria...@mac.com> > wrote: > > > > Thanks Don, both for this encapsulation and the writing that you done in > > the past on unicode in J. I am learning a lot. > > > > Two things which are worth mentioning. > > > > utf8 > > 8&u: > > ucp > > 7&u: > > uucp > > u:@(7&u:) > > > > are already defined in the standard library and may lead to some > confusion > > with your definitions > > UCP > > 3&u: > > UD NB. I like your use of rank as that was an issue in my early > > explorations of unicode > > 8&u:"1 > > UC > > 7&u: > > > > and second,the issue remains with uneven boxing with some of the unicode > > characters (which I think is one of the prime motivators in this > adventure.) > > > > ]s=: ucp 'ΔaΔ' > > ΔaΔ > > <s > > ┌───┐ > > │ΔaΔ│ > > └───┘ > > ]t=: ucp 'Δaఝ' NB. last character wider > > Δaఝ > > <t > > ┌───┐ > > │Δaఝ│ > > └───┘ > > uwid=: ({.@:glqextent_jgl2_ @: u: @: ":) "0 NB.glqextent_jgl2_ > > available in jqt ide > > uwid s > > 9 7 9 > > uwid t > > 9 7 16 > > JVERSION > > Engine: j805/j64/darwin > > Beta-9: commercial/2016-07-05T17:11:06 > > Library: 8.04.15 > > Qt IDE: 1.4.9/5.4.2 > > Platform: Darwin 64 > > Installer: J804 install > > InstallPath: /users/bobtherriault/j64-804 > > Contact: www.jsoftware.com > > > > Perhaps your solution of dealing primarily with unicode wide characters > > along with finding some way of having box sizing respond to the different > > character widths could be worth exploring? > > > > Cheers, bob > > > > > > > > > On Jul 9, 2016, at 3:12 PM, Don Guinn <dongu...@gmail.com> wrote: > > > > > > I would appreciate any suggestions. I tried working with UTF-8 but it > > > wasn't easy to work with arrays of UTF-8. Then tried using unicode. > Much > > > easier. Except had to be really careful when mixing with UTF-8. Hence > the > > > proposal. > > > On Jul 9, 2016 4:05 PM, "Raul Miller" <rauldmil...@gmail.com> wrote: > > > > > >> Without going into details, I will note that I would say a few things > > >> here slightly differently. > > >> > > >> That said, overall, you seem to be in the "right" ballpark. > > >> > > >> Thanks, > > >> > > >> -- > > >> Raul > > >> > > >> > > >> On Sat, Jul 9, 2016 at 2:11 PM, Don Guinn <dongu...@gmail.com> wrote: > > >>> Note 'Observations on Unicode' > > >>> > > >>> There seems to be a lot of confusion in J around Unicode and UTF-8. > > >>> It seems that people are having a lot of trouble dealing with UTF-8 > > >>> in J, particularly in arrays. Most of you are probably familiar with > > >>> this issue and all said here, but it bares repeating. I am posting > > >>> it to source rather than programming since it also includes a > proposed > > >>> change to J and recent Unicode posts are from people also on source. > > >>> > > >>> Unicode defines a representation of characters as numbers or "code > > >>> points". Some confusion results in that the glyph (what a code point > > >>> looks like) may look like the glyph of another code point. Not really > > >>> an issue as far as J is concerned. In Unicode charts of the Unicode > > >>> number is normally written as U+HHHH, where HHHH is a number in > > >>> hexadecimal. Four digits show up in most tables. > > >>> > > >>> Unicode is divided into planes. Plane 0, Basic Multilingual Plane > > >>> (BMP), is the range U+0 through U+FFFF and is the part of Unicode > > >>> supported by Windows and Unix double-byte character set (DBCS). Other > > >>> planes of Unicode are not supported by Windows or Unix at this time. > > >>> > > >>> The code points are represented several ways. Windows and Unix use > > >>> 16 bit unsigned integers, double-byte-character-set (DBCS) or wide > > >>> characters in C. J type is 131072 for 3!:0 and names this “unicode” > > >>> (note the lower case "u"). This is not UTF-8. Windows and Unix also > > >>> represent Unicode with UTF-8. UTF-8 is the primary representation > > >>> for Unicode in J, an ingenious way to represent Unicode compatible > > >>> with ASCII. Also there is UTF-16 and others to represent code point > > >>> bits for Unicode. The WEB has other ways to handle Unicode. > > >>> > > >>> The standard for UTF-8 defines one to four bytes to contain the code > > >>> point bits. The one byte codes correspond to the ASCII characters U+0 > > >>> through U+7F. The UTF-8 standard allows a maximum of U+10FFFF, well > > >>> beyond what Windows and Unix now support. For three byte UTF-8 covers > > >>> 16 code point bits, fits nicely for wide characters. > > >>> > > >>> UTF-8 contains start bytes and continuation bytes. Start bytes with a > > >>> high order zero bit look exactly like standard ASCII. Start bytes of > > >>> “11xxxxxx” mark the starts of multi-byte codes. Bytes of “10xxxxxx” > > >>> are continuation bytes and must follow a start byte. The number of > > >>> continuation bytes is given in the start byte. > > >>> ) > > >>> > > >>> Note 'Invalid byte sequences' > > >>> > > >>> It is an error if something happens to separate the start byte from > > >>> its continuation bytes. > > >>> > > >>> An unexpected continuation byte. > > >>> Continuation bytes must only follow start bytes. > > >>> > > >>> A start byte not followed by enough continuation bytes. > > >>> The start byte defines how many continuation bytes follow. > > >>> > > >>> A sequence that decodes to a value greater than U+10FFF. > > >>> RFC 3629 limits Unicode this maximum for compatibility with UTF-16. > > >>> > > >>> An overlong encoding. > > >>> Each continuation byte contains 6 code point bits. If the first > > >>> continuation byte contains all zero code point bits it should be > > >>> shortened. > > >>> > > >>> The display of any UTF-8 characters failing the above tests varies > > >>> from system to system. The official position now is to display > > >>> � (U+FFFD). J often displays other characters. > > >>> ) > > >>> > > >>> Note 'The Internet' > > >>> > > >>> The internet only supports the transmission of text as ASCII > > >>> characters, characters in the range 0 through 7f hex. And many > > >>> special characters are not allowed in text. Those characters not > > >>> allowed are sent in a few ways. > > >>> > > >>> 1. A byte is represented as two hexadecimal digits following an > > >>> equal sign (=). For example: Blank is sent as =20 instead of > > >>> as 32{a. . > > >>> > > >>> The Unicode symbol α (U+3B1) is sent as "=ce=b1", the bytes > > >>> are UTF-8, a start byte followed by a continuation byte in > > >>> hexadecimal. > > >>> > > >>> 2. #&nnn; – where nnn is the decimal point code number. For > > >>> example: #&916; will display as Δ. > > >>> > > >>> 3. #&hxxx; - same as 2. except the number is in hexadecimal. > > >>> > > >>> Unicode and UTF-8 cannot be sent directly It must be converted to > > >>> ASCII as described above. A raw text file with a lot of characters > > >>> beyond ASCII can get very hard to view. > > >>> ) > > >>> > > >>> Note 'Unicode and J' > > >>> > > >>> J primitives treat UTF-8 bytes as literal. They do not recognize > > >>> start and continuation bytes as UTF-8 making up characters. It is > > >>> up to the programmer to handle the multi-byte characters. Vectors > > >>> of UTF-8 characters display fine and are easy to work with. But > > >>> higher dimensions must be carefully managed. Take the problem > > >>> covered recently. > > >>> ) > > >>> > > >>> ]s=: 8 6 $ 'ఝ' ,'a','ఝ' > > >>> à° aà° > > >>> à° aà > > >>> ° à° a > > >>> ఝఝ > > >>> aà° à° > > >>> aà° à > > >>> ° aà° > > >>> à° aà° > > >>> > > >>> Note '' > > >>> > > >>> When start bytes are separated from continuation bytes error > > >>> characters are displayed. J displays a line at a time. Continuation > > >>> bytes moved to the next line are not recognized as a continuation > > >>> of a multi-byte UTF-8 character. > > >>> > > >>> This example has all kinds of problems. First, the glyphs are wider > > >>> than other characters, so there is no way they can align with other > > >>> fixed width characters. Second, their UTF-8 codes are 3 bytes, not > > >>> supported by the new boxing algorithm in J. > > >>> > > >>> The following definition of s gives a more supported example to > > >>> examine handling UTF-8 in J. > > >>> ) > > >>> > > >>> ]s=:'Δ' , 'a' ,'Δ' > > >>> ΔaΔ > > >>> $s > > >>> 5 > > >>> a.i.s > > >>> 206 148 97 206 148 > > >>> <s NB. Displays boxed nicely. > > >>> ┌───┐ > > >>> │ΔaΔ│ > > >>> └───┘ > > >>> 8 6 $ s NB. But how about reshaping? > > >>> ΔaΔΠ> > >>> ”aΔΔ > > >>> aΔΔa > > >>> ΔΔaÎ > > >>> ”ΔaΔ > > >>> ΔaΔΠ> > >>> ”aΔΔ > > >>> aΔΔa > > >>> > > >>> NB. Still have the problem of splitting start and continuation > bits. > > >>> > > >>> NB. Say we want to display the 3 characters in a column. > > >>> > > >>> ]s=: 'Δ' , 'a' ,: 'Δ' > > >>> Δ > > >>> aa > > >>> Δ > > >>> > > >>> Note '' > > >>> > > >>> Where did the second "a" come from? To J the "Δ" is 2 characters. > > >>> So the "a" must be expanded to match lengths. To get it to look > > >>> right the "a" needs to be padded. > > >>> ) > > >>> > > >>> ]s=: 'Δ' , 'a ' ,: 'Δ' > > >>> Δ > > >>> a > > >>> Δ > > >>> > > >>> <s NB. But now the boxed display isn't what we want. > > >>> ┌──┐ > > >>> │Δ │ > > >>> │a │ > > >>> │Δ │ > > >>> └──┘ > > >>> > > >>> NB. Perhaps treating it as a vector works better. > > >>> > > >>> ]s=: 'Δ' , LF, 'a' , LF, 'Δ' > > >>> Δ > > >>> a > > >>> Δ > > >>> <s NB. But the line feeds are treated as blanks. > > >>> ┌─────┐ > > >>> │Δ a Δ│ > > >>> └─────┘ > > >>> > > >>> ]s=: 'Δ' ; 'a' ; 'Δ' NB. Try boxing each character. > > >>> ┌─┬─┬─┐ > > >>> │Δ│a│Δ│ > > >>> └─┴─┴─┘ > > >>>> s > > >>> Δ > > >>> a > > >>> Δ > > >>> <>s NB. Box looks OK but extra space. > > >>> ┌──┐ > > >>> │Δ │ > > >>> │a │ > > >>> │Δ │ > > >>> └──┘ > > >>> ,>s NB. Still got the extra space. > > >>> Δa Δ > > >>> ,s,&><LF NB. Still not quite right. > > >>> Δ > > >>> a > > >>> Δ > > >>> > > >>> ,>s,<'α' > > >>> Δa Δα > > >>> > > >>> Note '' > > >>> > > >>> But this requires a lot of special effort to handle UTF-8 > > >>> characters. And any definitions for handling ASCII text will > > >>> probably need modifying to get them to work with UTF-8. > > >>> > > >>> It is my opinion that converting UTF-8 literal to unicode makes > > >>> manipulating arrays of characters much easier. So I defined 4 > > >>> verbs to assist. > > >>> ) > > >>> > > >>> U =:u: NB. Convert code points to unicode. > > >>> UCP=:3&u: NB. Convert UTF-8 (char) or unicode to code points. > > >>> UC =:7&u: NB. Convert UTF-8 to unicode if necessary. > > >>> UD =:8&u:"1 NB. Convert unicode to UTF-8 or char. > > >>> > > >>> Note '' > > >>> > > >>> Use U instead of a.&{ to convert numbers to text. > > >>> > > >>> Use UCP instead a.&i. . It gives the same result as a.&i. for > > >>> literals and give code points for unicode. > > >>> > > >>> UD converts unicode to literal and any characters outside of ASCII > > >>> to UTF-8. It can be used instead of a.&{ to convert. It is > > >>> necessary to set it to rank 1 as 8&u: only works on vectors. > > >>> > > >>> So let's look try s as unicode instead of as UTF-8. > > >>> ) > > >>> > > >>> ]s=: UC 'ΔaΔ' > > >>> ΔaΔ > > >>> $s > > >>> 3 > > >>> UCP s > > >>> 916 97 916 > > >>> <s > > >>> ┌───┐ > > >>> │ΔaΔ│ > > >>> └───┘ > > >>> ,.s > > >>> Δ > > >>> a > > >>> Δ > > >>> <,.s > > >>> ┌─┐ > > >>> │Δ│ > > >>> │a│ > > >>> │Δ│ > > >>> └─┘ > > >>> <"0 s > > >>> ┌─┬─┬─┐ > > >>> │Δ│a│Δ│ > > >>> └─┴─┴─┘ > > >>> s,'Δ' > > >>> ΔaΔΔ > > >>> > > >>> Note '' > > >>> > > >>> Oops! Got a problem. > > >>> > > >>> When J mixes nouns of different internal types it must convert > > >>> them to the same type before processing. Like comparing 1 to > > >>> 1.5-0.5 . Here J converts char to wide by putting a zero byte > > >>> value in front of the char byte. This works fine for ASCII. But > > >>> not if the noun includes any UTF-8 multi-byte codes. > > >>> > > >>> Here J treated the 2 byte Δ UTF-8 as two characters. Both being > > >>> invalid UTF-8. > > >>> > > >>> One must make sure that unicode is never mixed with literal that > > >>> may contain UTF-8 multi-byte characters. > > >>> ) > > >>> > > >>> s,UC 'Δ' > > >>> ΔaΔΔ > > >>> > > >>> Note 'Proposal' > > >>> > > >>> When J primitives needs to convert literal (char) to unicode > > >>> (wide) that it convert the char to literal using the UTF-8 > > >>> conversion algorithm (7&u:) instead of adding a zero byte. > > >>> > > >>> This would give almost complete transparency mixing UTF-8 and > > >>> unicode. I can't think of any case where one would not want to > > >>> have UTF-8 converted to unicode when literal is mixed with > > >>> unicode; however, if it is required 2&u: could be used. > > >>> > > >>> This should not cause any backward compatibility problems as it > > >>> only changes how char is converted to wide by default, something > > >>> I suspect no one currently uses. Not any other operation. > > >>> ) > > >>> > > >>> Note 'Implementation' > > >>> > > >>> I realize that it is very late in this development cycle of J. > > >>> So this would probably not be done in it; however, I feel that > > >>> this change to make UTF-8 and unicode more compatible would make > > >>> it easier for people to use unicode avoiding all the confusion > > >>> trying to do everything in UTF-8. > > >>> > > >>> This would affect the concatenation verbs (dyadic , ,. ,: and > > >>> monadic ;), the comparison verbs (= and -:) and probably amend > > >>> (}). > > >>> > > >>> Hopefully this is done in a single macro or subroutine used by > > >>> these verbs. If this is the case, the change should be not too > > >>> difficult. > > >>> > > >>> Perhaps later. > > >>> ) > > >>> > ---------------------------------------------------------------------- > > >>> For information about J forums see > http://www.jsoftware.com/forums.htm > > >> ---------------------------------------------------------------------- > > >> For information about J forums see > http://www.jsoftware.com/forums.htm > > > ---------------------------------------------------------------------- > > > For information about J forums see http://www.jsoftware.com/forums.htm > > > > ---------------------------------------------------------------------- > > For information about J forums see http://www.jsoftware.com/forums.htm > > > > > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm