Re: [Jsource] Problems dealing with UTF-8

Raul Miller Sat, 09 Jul 2016 15:05:57 -0700

Without going into details, I will note that I would say a few things
here slightly differently.


That said, overall, you seem to be in the "right" ballpark.

Thanks,

-- 
Raul


On Sat, Jul 9, 2016 at 2:11 PM, Don Guinn <dongu...@gmail.com> wrote:
>    Note 'Observations on Unicode'
>
> There seems to be a lot of confusion in J around Unicode and UTF-8.
> It seems that people are having a lot of trouble dealing with UTF-8
> in J, particularly in arrays. Most of you are probably familiar with
> this issue and all said here, but it bares repeating. I am posting
> it to source rather than programming since it also includes a proposed
> change to J and recent Unicode posts are from people also on source.
>
> Unicode defines a representation of characters as numbers or "code
> points". Some confusion results in that the glyph (what a code point
> looks like) may look like the glyph of another code point. Not really
> an issue as far as J is concerned. In Unicode charts of the Unicode
> number is normally written as U+HHHH, where HHHH is a number in
> hexadecimal. Four digits show up in most tables.
>
> Unicode is divided into planes. Plane 0, Basic Multilingual Plane
> (BMP), is the range U+0 through U+FFFF and is the part of Unicode
> supported by Windows and Unix double-byte character set (DBCS). Other
> planes of Unicode are not supported by Windows or Unix at this time.
>
> The code points are represented several ways. Windows and Unix use
> 16 bit unsigned integers, double-byte-character-set (DBCS) or wide
> characters in C. J type is 131072 for 3!:0 and names this “unicode”
> (note the lower case "u"). This is not UTF-8. Windows and Unix also
> represent Unicode with UTF-8. UTF-8 is the primary representation
> for Unicode in J, an ingenious way to represent Unicode compatible
> with ASCII. Also there is UTF-16 and others to represent code point
> bits for Unicode. The WEB has other ways to handle Unicode.
>
> The standard for UTF-8 defines one to four bytes to contain the code
> point bits. The one byte codes correspond to the ASCII characters U+0
> through U+7F. The UTF-8 standard allows a maximum of U+10FFFF, well
> beyond what Windows and Unix now support. For three byte UTF-8 covers
> 16 code point bits, fits nicely for wide characters.
>
> UTF-8 contains start bytes and continuation bytes. Start bytes with a
> high order zero bit look exactly like standard ASCII. Start bytes of
> “11xxxxxx” mark the starts of multi-byte codes. Bytes of “10xxxxxx”
> are continuation bytes and must follow a start byte. The number of
> continuation bytes is given in the start byte.
> )
>
>    Note 'Invalid byte sequences'
>
> It is an error if something happens to separate the start byte from
> its continuation bytes.
>
>   An unexpected continuation byte.
>   Continuation bytes must only follow start bytes.
>
>   A start byte not followed by enough continuation bytes.
>   The start byte defines how many continuation bytes follow.
>
>   A sequence that decodes to a value greater than U+10FFF.
>   RFC 3629 limits Unicode this maximum for compatibility with UTF-16.
>
>   An overlong encoding.
>   Each continuation byte contains 6 code point bits. If the first
>   continuation byte contains all zero code point bits it should be
>   shortened.
>
> The display of any UTF-8 characters failing the above tests varies
> from system to system. The official position now is to display
> � (U+FFFD). J often displays other characters.
> )
>
>    Note 'The Internet'
>
> The internet only supports the transmission of text as ASCII
> characters, characters in the range 0 through 7f hex. And many
> special characters are not allowed in text. Those characters not
> allowed are sent in a few ways.
>
>   1. A byte is represented as two hexadecimal digits following an
>      equal sign (=). For example: Blank is sent as =20 instead of
>      as 32{a. .
>
>      The Unicode symbol α (U+3B1) is sent as "=ce=b1", the bytes
>      are UTF-8, a start byte followed by a continuation byte in
>      hexadecimal.
>
>   2. #&nnn; – where nnn is the decimal point code number. For
>      example: #&916; will display as Δ.
>
>   3. #&hxxx; - same as 2. except the number is in hexadecimal.
>
> Unicode and UTF-8 cannot be sent directly It must be converted to
> ASCII as described above. A raw text file with a lot of characters
> beyond ASCII can get very hard to view.
> )
>
>    Note 'Unicode and J'
>
> J primitives treat UTF-8 bytes as literal. They do not recognize
> start and continuation bytes as UTF-8 making up characters. It is
> up to the programmer to handle the multi-byte characters. Vectors
> of UTF-8 characters display fine and are easy to work with. But
> higher dimensions must be carefully managed. Take the problem
> covered recently.
> )
>
>    ]s=:  8 6 $ 'ఝ' ,'a','ఝ'
> à° aà°
> à° aà
> ° à° a
> ఝఝ
> aà° à°
> aà° à
> ° aà°
> à° aà°
>
>    Note ''
>
> When start bytes are separated from continuation bytes error
> characters are displayed. J displays a line at a time. Continuation
> bytes moved to the next line are not recognized as a continuation
> of a multi-byte UTF-8 character.
>
> This example has all kinds of problems. First, the glyphs are wider
> than other characters, so there is no way they can align with other
> fixed width characters. Second, their UTF-8 codes are 3 bytes, not
> supported by the new boxing algorithm in J.
>
> The following definition of s gives a more supported example to
> examine handling UTF-8 in J.
> )
>
>    ]s=:'Δ' , 'a' ,'Δ'
> ΔaΔ
>    $s
> 5
>    a.i.s
> 206 148 97 206 148
>    <s                NB. Displays boxed nicely.
> ┌───┐
> │ΔaΔ│
> └───┘
>    8 6 $ s           NB. But how about reshaping?
> Î”aÎ”Î
> ”aÎ”Î”
> aΔΔa
> Î”Î”aÎ
> ”Î”aÎ”
> Î”aÎ”Î
> ”aÎ”Î”
> aΔΔa
>
>    NB. Still have the problem of splitting start and continuation bits.
>
>    NB. Say we want to display the 3 characters in a column.
>
>    ]s=: 'Δ' , 'a' ,: 'Δ'
> Δ
> aa
> Δ
>
>    Note ''
>
> Where did the second "a" come from? To J the "Δ" is 2 characters.
> So the "a" must be expanded to match lengths. To get it to look
> right the "a" needs to be padded.
> )
>
>    ]s=: 'Δ' , 'a ' ,: 'Δ'
> Δ
> a
> Δ
>
>    <s           NB. But now the boxed display isn't what we want.
> ┌──┐
> │Δ │
> │a │
> │Δ │
> └──┘
>
>    NB. Perhaps treating it as a vector works better.
>
>    ]s=: 'Δ' , LF,  'a' , LF, 'Δ'
> Δ
> a
> Δ
>    <s           NB. But the line feeds are treated as blanks.
> ┌─────┐
> │Δ a Δ│
> └─────┘
>
>    ]s=: 'Δ' ; 'a' ; 'Δ' NB. Try boxing each character.
> ┌─┬─┬─┐
> │Δ│a│Δ│
> └─┴─┴─┘
>    >s
> Δ
> a
> Δ
>    <>s          NB. Box looks OK but extra space.
> ┌──┐
> │Δ │
> │a │
> │Δ │
> └──┘
>    ,>s          NB. Still got the extra space.
> Δa Δ
>    ,s,&><LF     NB. Still not quite right.
> Δ
> a
>  Δ
>
>    ,>s,<'α'
> Δa Δα
>
>    Note ''
>
> But this requires a lot of special effort to handle UTF-8
> characters. And any definitions for handling ASCII text will
> probably need modifying to get them to work with UTF-8.
>
> It is my opinion that converting UTF-8 literal to unicode makes
> manipulating arrays of characters much easier. So I defined 4
> verbs to assist.
> )
>
>    U  =:u:       NB. Convert code points to unicode.
>    UCP=:3&u:     NB. Convert UTF-8 (char) or unicode to code points.
>    UC =:7&u:     NB. Convert UTF-8 to unicode if necessary.
>    UD =:8&u:"1   NB. Convert unicode to UTF-8 or char.
>
>    Note ''
>
> Use U instead of a.&{ to convert numbers to text.
>
> Use UCP instead a.&i. . It gives the same result as a.&i. for
> literals and give code points for unicode.
>
> UD converts unicode to literal and any characters outside of ASCII
> to UTF-8. It can be used instead of a.&{ to convert. It is
> necessary to set it to rank 1 as 8&u: only works on vectors.
>
> So let's look try s as unicode instead of as UTF-8.
> )
>
>    ]s=: UC 'ΔaΔ'
> ΔaΔ
>    $s
> 3
>    UCP s
> 916 97 916
>    <s
> ┌───┐
> │ΔaΔ│
> └───┘
>    ,.s
> Δ
> a
> Δ
>    <,.s
> ┌─┐
> │Δ│
> │a│
> │Δ│
> └─┘
>    <"0 s
> ┌─┬─┬─┐
> │Δ│a│Δ│
> └─┴─┴─┘
>    s,'Δ'
> ΔaΔÎ”
>
>    Note ''
>
> Oops! Got a problem.
>
> When J mixes nouns of different internal types it must convert
> them to the same type before processing. Like comparing 1 to
> 1.5-0.5 . Here J converts char to wide by putting a zero byte
> value in front of the char byte. This works fine for ASCII. But
> not if the noun includes any UTF-8 multi-byte codes.
>
> Here J treated the 2 byte Δ UTF-8 as two characters. Both being
> invalid UTF-8.
>
> One must make sure that unicode is never mixed with literal that
> may contain UTF-8 multi-byte characters.
> )
>
>    s,UC 'Δ'
> ΔaΔΔ
>
>    Note 'Proposal'
>
> When J primitives needs to convert literal (char) to unicode
> (wide) that it convert the char to literal using the UTF-8
> conversion algorithm (7&u:) instead of adding a zero byte.
>
> This would give almost complete transparency mixing UTF-8 and
> unicode. I can't think of any case where one would not want to
> have UTF-8 converted to unicode when literal is mixed with
> unicode; however, if it is required 2&u: could be used.
>
> This should not cause any backward compatibility problems as it
> only changes how char is converted to wide by default, something
> I suspect no one currently uses. Not any other operation.
> )
>
>    Note 'Implementation'
>
> I realize that it is very late in this development cycle of J.
> So this would probably not be done in it; however, I feel that
> this change to make UTF-8 and unicode more compatible would make
> it easier for people to use unicode avoiding all the confusion
> trying to do everything in UTF-8.
>
> This would affect the concatenation verbs (dyadic , ,. ,: and
> monadic ;), the comparison verbs (= and -:) and probably amend
> (}).
>
> Hopefully this is done in a single macro or subroutine used by
> these verbs. If this is the case, the change should be not too
> difficult.
>
> Perhaps later.
> )
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jsource] Problems dealing with UTF-8

Reply via email to