Re: [Jsource] Problems dealing with UTF-8

Don Guinn Sat, 09 Jul 2016 15:13:00 -0700

I would appreciate any suggestions. I tried working with UTF-8 but it
wasn't easy to work with arrays of UTF-8. Then tried using unicode. Much
easier. Except had to be really careful when mixing with UTF-8. Hence the
proposal.
On Jul 9, 2016 4:05 PM, "Raul Miller" <[email protected]> wrote:


> Without going into details, I will note that I would say a few things
> here slightly differently.
>
> That said, overall, you seem to be in the "right" ballpark.
>
> Thanks,
>
> --
> Raul
>
>
> On Sat, Jul 9, 2016 at 2:11 PM, Don Guinn <[email protected]> wrote:
> >    Note 'Observations on Unicode'
> >
> > There seems to be a lot of confusion in J around Unicode and UTF-8.
> > It seems that people are having a lot of trouble dealing with UTF-8
> > in J, particularly in arrays. Most of you are probably familiar with
> > this issue and all said here, but it bares repeating. I am posting
> > it to source rather than programming since it also includes a proposed
> > change to J and recent Unicode posts are from people also on source.
> >
> > Unicode defines a representation of characters as numbers or "code
> > points". Some confusion results in that the glyph (what a code point
> > looks like) may look like the glyph of another code point. Not really
> > an issue as far as J is concerned. In Unicode charts of the Unicode
> > number is normally written as U+HHHH, where HHHH is a number in
> > hexadecimal. Four digits show up in most tables.
> >
> > Unicode is divided into planes. Plane 0, Basic Multilingual Plane
> > (BMP), is the range U+0 through U+FFFF and is the part of Unicode
> > supported by Windows and Unix double-byte character set (DBCS). Other
> > planes of Unicode are not supported by Windows or Unix at this time.
> >
> > The code points are represented several ways. Windows and Unix use
> > 16 bit unsigned integers, double-byte-character-set (DBCS) or wide
> > characters in C. J type is 131072 for 3!:0 and names this “unicode”
> > (note the lower case "u"). This is not UTF-8. Windows and Unix also
> > represent Unicode with UTF-8. UTF-8 is the primary representation
> > for Unicode in J, an ingenious way to represent Unicode compatible
> > with ASCII. Also there is UTF-16 and others to represent code point
> > bits for Unicode. The WEB has other ways to handle Unicode.
> >
> > The standard for UTF-8 defines one to four bytes to contain the code
> > point bits. The one byte codes correspond to the ASCII characters U+0
> > through U+7F. The UTF-8 standard allows a maximum of U+10FFFF, well
> > beyond what Windows and Unix now support. For three byte UTF-8 covers
> > 16 code point bits, fits nicely for wide characters.
> >
> > UTF-8 contains start bytes and continuation bytes. Start bytes with a
> > high order zero bit look exactly like standard ASCII. Start bytes of
> > “11xxxxxx” mark the starts of multi-byte codes. Bytes of “10xxxxxx”
> > are continuation bytes and must follow a start byte. The number of
> > continuation bytes is given in the start byte.
> > )
> >
> >    Note 'Invalid byte sequences'
> >
> > It is an error if something happens to separate the start byte from
> > its continuation bytes.
> >
> >   An unexpected continuation byte.
> >   Continuation bytes must only follow start bytes.
> >
> >   A start byte not followed by enough continuation bytes.
> >   The start byte defines how many continuation bytes follow.
> >
> >   A sequence that decodes to a value greater than U+10FFF.
> >   RFC 3629 limits Unicode this maximum for compatibility with UTF-16.
> >
> >   An overlong encoding.
> >   Each continuation byte contains 6 code point bits. If the first
> >   continuation byte contains all zero code point bits it should be
> >   shortened.
> >
> > The display of any UTF-8 characters failing the above tests varies
> > from system to system. The official position now is to display
> > � (U+FFFD). J often displays other characters.
> > )
> >
> >    Note 'The Internet'
> >
> > The internet only supports the transmission of text as ASCII
> > characters, characters in the range 0 through 7f hex. And many
> > special characters are not allowed in text. Those characters not
> > allowed are sent in a few ways.
> >
> >   1. A byte is represented as two hexadecimal digits following an
> >      equal sign (=). For example: Blank is sent as =20 instead of
> >      as 32{a. .
> >
> >      The Unicode symbol α (U+3B1) is sent as "=ce=b1", the bytes
> >      are UTF-8, a start byte followed by a continuation byte in
> >      hexadecimal.
> >
> >   2. #&nnn; – where nnn is the decimal point code number. For
> >      example: #&916; will display as Δ.
> >
> >   3. #&hxxx; - same as 2. except the number is in hexadecimal.
> >
> > Unicode and UTF-8 cannot be sent directly It must be converted to
> > ASCII as described above. A raw text file with a lot of characters
> > beyond ASCII can get very hard to view.
> > )
> >
> >    Note 'Unicode and J'
> >
> > J primitives treat UTF-8 bytes as literal. They do not recognize
> > start and continuation bytes as UTF-8 making up characters. It is
> > up to the programmer to handle the multi-byte characters. Vectors
> > of UTF-8 characters display fine and are easy to work with. But
> > higher dimensions must be carefully managed. Take the problem
> > covered recently.
> > )
> >
> >    ]s=:  8 6 $ 'ఝ' ,'a','ఝ'
> > à° aà°
> > à° aà
> > ° à° a
> > ఝఝ
> > aà° à°
> > aà° à
> > ° aà°
> > à° aà°
> >
> >    Note ''
> >
> > When start bytes are separated from continuation bytes error
> > characters are displayed. J displays a line at a time. Continuation
> > bytes moved to the next line are not recognized as a continuation
> > of a multi-byte UTF-8 character.
> >
> > This example has all kinds of problems. First, the glyphs are wider
> > than other characters, so there is no way they can align with other
> > fixed width characters. Second, their UTF-8 codes are 3 bytes, not
> > supported by the new boxing algorithm in J.
> >
> > The following definition of s gives a more supported example to
> > examine handling UTF-8 in J.
> > )
> >
> >    ]s=:'Δ' , 'a' ,'Δ'
> > ΔaΔ
> >    $s
> > 5
> >    a.i.s
> > 206 148 97 206 148
> >    <s                NB. Displays boxed nicely.
> > ┌───┐
> > │ΔaΔ│
> > └───┘
> >    8 6 $ s           NB. But how about reshaping?
> > Î”aÎ”Î
> > ”aÎ”Î”
> > aΔΔa
> > Î”Î”aÎ
> > ”Î”aÎ”
> > Î”aÎ”Î
> > ”aÎ”Î”
> > aΔΔa
> >
> >    NB. Still have the problem of splitting start and continuation bits.
> >
> >    NB. Say we want to display the 3 characters in a column.
> >
> >    ]s=: 'Δ' , 'a' ,: 'Δ'
> > Δ
> > aa
> > Δ
> >
> >    Note ''
> >
> > Where did the second "a" come from? To J the "Δ" is 2 characters.
> > So the "a" must be expanded to match lengths. To get it to look
> > right the "a" needs to be padded.
> > )
> >
> >    ]s=: 'Δ' , 'a ' ,: 'Δ'
> > Δ
> > a
> > Δ
> >
> >    <s           NB. But now the boxed display isn't what we want.
> > ┌──┐
> > │Δ │
> > │a │
> > │Δ │
> > └──┘
> >
> >    NB. Perhaps treating it as a vector works better.
> >
> >    ]s=: 'Δ' , LF,  'a' , LF, 'Δ'
> > Δ
> > a
> > Δ
> >    <s           NB. But the line feeds are treated as blanks.
> > ┌─────┐
> > │Δ a Δ│
> > └─────┘
> >
> >    ]s=: 'Δ' ; 'a' ; 'Δ' NB. Try boxing each character.
> > ┌─┬─┬─┐
> > │Δ│a│Δ│
> > └─┴─┴─┘
> >    >s
> > Δ
> > a
> > Δ
> >    <>s          NB. Box looks OK but extra space.
> > ┌──┐
> > │Δ │
> > │a │
> > │Δ │
> > └──┘
> >    ,>s          NB. Still got the extra space.
> > Δa Δ
> >    ,s,&><LF     NB. Still not quite right.
> > Δ
> > a
> >  Δ
> >
> >    ,>s,<'α'
> > Δa Δα
> >
> >    Note ''
> >
> > But this requires a lot of special effort to handle UTF-8
> > characters. And any definitions for handling ASCII text will
> > probably need modifying to get them to work with UTF-8.
> >
> > It is my opinion that converting UTF-8 literal to unicode makes
> > manipulating arrays of characters much easier. So I defined 4
> > verbs to assist.
> > )
> >
> >    U  =:u:       NB. Convert code points to unicode.
> >    UCP=:3&u:     NB. Convert UTF-8 (char) or unicode to code points.
> >    UC =:7&u:     NB. Convert UTF-8 to unicode if necessary.
> >    UD =:8&u:"1   NB. Convert unicode to UTF-8 or char.
> >
> >    Note ''
> >
> > Use U instead of a.&{ to convert numbers to text.
> >
> > Use UCP instead a.&i. . It gives the same result as a.&i. for
> > literals and give code points for unicode.
> >
> > UD converts unicode to literal and any characters outside of ASCII
> > to UTF-8. It can be used instead of a.&{ to convert. It is
> > necessary to set it to rank 1 as 8&u: only works on vectors.
> >
> > So let's look try s as unicode instead of as UTF-8.
> > )
> >
> >    ]s=: UC 'ΔaΔ'
> > ΔaΔ
> >    $s
> > 3
> >    UCP s
> > 916 97 916
> >    <s
> > ┌───┐
> > │ΔaΔ│
> > └───┘
> >    ,.s
> > Δ
> > a
> > Δ
> >    <,.s
> > ┌─┐
> > │Δ│
> > │a│
> > │Δ│
> > └─┘
> >    <"0 s
> > ┌─┬─┬─┐
> > │Δ│a│Δ│
> > └─┴─┴─┘
> >    s,'Δ'
> > ΔaΔÎ”
> >
> >    Note ''
> >
> > Oops! Got a problem.
> >
> > When J mixes nouns of different internal types it must convert
> > them to the same type before processing. Like comparing 1 to
> > 1.5-0.5 . Here J converts char to wide by putting a zero byte
> > value in front of the char byte. This works fine for ASCII. But
> > not if the noun includes any UTF-8 multi-byte codes.
> >
> > Here J treated the 2 byte Δ UTF-8 as two characters. Both being
> > invalid UTF-8.
> >
> > One must make sure that unicode is never mixed with literal that
> > may contain UTF-8 multi-byte characters.
> > )
> >
> >    s,UC 'Δ'
> > ΔaΔΔ
> >
> >    Note 'Proposal'
> >
> > When J primitives needs to convert literal (char) to unicode
> > (wide) that it convert the char to literal using the UTF-8
> > conversion algorithm (7&u:) instead of adding a zero byte.
> >
> > This would give almost complete transparency mixing UTF-8 and
> > unicode. I can't think of any case where one would not want to
> > have UTF-8 converted to unicode when literal is mixed with
> > unicode; however, if it is required 2&u: could be used.
> >
> > This should not cause any backward compatibility problems as it
> > only changes how char is converted to wide by default, something
> > I suspect no one currently uses. Not any other operation.
> > )
> >
> >    Note 'Implementation'
> >
> > I realize that it is very late in this development cycle of J.
> > So this would probably not be done in it; however, I feel that
> > this change to make UTF-8 and unicode more compatible would make
> > it easier for people to use unicode avoiding all the confusion
> > trying to do everything in UTF-8.
> >
> > This would affect the concatenation verbs (dyadic , ,. ,: and
> > monadic ;), the comparison verbs (= and -:) and probably amend
> > (}).
> >
> > Hopefully this is done in a single macro or subroutine used by
> > these verbs. If this is the case, the change should be not too
> > difficult.
> >
> > Perhaps later.
> > )
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jsource] Problems dealing with UTF-8

Reply via email to