[Jsource] Problems dealing with UTF-8

Don Guinn Sat, 09 Jul 2016 11:11:42 -0700

   Note 'Observations on Unicode'

There seems to be a lot of confusion in J around Unicode and UTF-8.
It seems that people are having a lot of trouble dealing with UTF-8
in J, particularly in arrays. Most of you are probably familiar with
this issue and all said here, but it bares repeating. I am posting
it to source rather than programming since it also includes a proposed
change to J and recent Unicode posts are from people also on source.


Unicode defines a representation of characters as numbers or "code
points". Some confusion results in that the glyph (what a code point
looks like) may look like the glyph of another code point. Not really
an issue as far as J is concerned. In Unicode charts of the Unicode
number is normally written as U+HHHH, where HHHH is a number in
hexadecimal. Four digits show up in most tables.

Unicode is divided into planes. Plane 0, Basic Multilingual Plane
(BMP), is the range U+0 through U+FFFF and is the part of Unicode
supported by Windows and Unix double-byte character set (DBCS). Other
planes of Unicode are not supported by Windows or Unix at this time.

The code points are represented several ways. Windows and Unix use
16 bit unsigned integers, double-byte-character-set (DBCS) or wide
characters in C. J type is 131072 for 3!:0 and names this “unicode”
(note the lower case "u"). This is not UTF-8. Windows and Unix also
represent Unicode with UTF-8. UTF-8 is the primary representation
for Unicode in J, an ingenious way to represent Unicode compatible
with ASCII. Also there is UTF-16 and others to represent code point
bits for Unicode. The WEB has other ways to handle Unicode.

The standard for UTF-8 defines one to four bytes to contain the code
point bits. The one byte codes correspond to the ASCII characters U+0
through U+7F. The UTF-8 standard allows a maximum of U+10FFFF, well
beyond what Windows and Unix now support. For three byte UTF-8 covers
16 code point bits, fits nicely for wide characters.

UTF-8 contains start bytes and continuation bytes. Start bytes with a
high order zero bit look exactly like standard ASCII. Start bytes of
“11xxxxxx” mark the starts of multi-byte codes. Bytes of “10xxxxxx”
are continuation bytes and must follow a start byte. The number of
continuation bytes is given in the start byte.
)

   Note 'Invalid byte sequences'

It is an error if something happens to separate the start byte from
its continuation bytes.

  An unexpected continuation byte.
  Continuation bytes must only follow start bytes.

  A start byte not followed by enough continuation bytes.
  The start byte defines how many continuation bytes follow.

  A sequence that decodes to a value greater than U+10FFF.
  RFC 3629 limits Unicode this maximum for compatibility with UTF-16.

  An overlong encoding.
  Each continuation byte contains 6 code point bits. If the first
  continuation byte contains all zero code point bits it should be
  shortened.

The display of any UTF-8 characters failing the above tests varies
from system to system. The official position now is to display
� (U+FFFD). J often displays other characters.
)

   Note 'The Internet'

The internet only supports the transmission of text as ASCII
characters, characters in the range 0 through 7f hex. And many
special characters are not allowed in text. Those characters not
allowed are sent in a few ways.

  1. A byte is represented as two hexadecimal digits following an
     equal sign (=). For example: Blank is sent as =20 instead of
     as 32{a. .

     The Unicode symbol α (U+3B1) is sent as "=ce=b1", the bytes
     are UTF-8, a start byte followed by a continuation byte in
     hexadecimal.

  2. #&nnn; – where nnn is the decimal point code number. For
     example: #&916; will display as Δ.

  3. #&hxxx; - same as 2. except the number is in hexadecimal.

Unicode and UTF-8 cannot be sent directly It must be converted to
ASCII as described above. A raw text file with a lot of characters
beyond ASCII can get very hard to view.
)

   Note 'Unicode and J'

J primitives treat UTF-8 bytes as literal. They do not recognize
start and continuation bytes as UTF-8 making up characters. It is
up to the programmer to handle the multi-byte characters. Vectors
of UTF-8 characters display fine and are easy to work with. But
higher dimensions must be carefully managed. Take the problem
covered recently.
)

   ]s=:  8 6 $ 'ఝ' ,'a','ఝ'
à° aà°
à° aà
° à° a
ఝఝ
aà° à°
aà° à
° aà°
à° aà°

   Note ''

When start bytes are separated from continuation bytes error
characters are displayed. J displays a line at a time. Continuation
bytes moved to the next line are not recognized as a continuation
of a multi-byte UTF-8 character.

This example has all kinds of problems. First, the glyphs are wider
than other characters, so there is no way they can align with other
fixed width characters. Second, their UTF-8 codes are 3 bytes, not
supported by the new boxing algorithm in J.

The following definition of s gives a more supported example to
examine handling UTF-8 in J.
)

   ]s=:'Δ' , 'a' ,'Δ'
ΔaΔ
   $s
5
   a.i.s
206 148 97 206 148
   <s                NB. Displays boxed nicely.
┌───┐
│ΔaΔ│
└───┘
   8 6 $ s           NB. But how about reshaping?
Î”aÎ”Î
”aÎ”Î”
aΔΔa
Î”Î”aÎ
”Î”aÎ”
Î”aÎ”Î
”aÎ”Î”
aΔΔa

   NB. Still have the problem of splitting start and continuation bits.

   NB. Say we want to display the 3 characters in a column.

   ]s=: 'Δ' , 'a' ,: 'Δ'
Δ
aa
Δ

   Note ''

Where did the second "a" come from? To J the "Δ" is 2 characters.
So the "a" must be expanded to match lengths. To get it to look
right the "a" needs to be padded.
)

   ]s=: 'Δ' , 'a ' ,: 'Δ'
Δ
a
Δ

   <s           NB. But now the boxed display isn't what we want.
┌──┐
│Δ │
│a │
│Δ │
└──┘

   NB. Perhaps treating it as a vector works better.

   ]s=: 'Δ' , LF,  'a' , LF, 'Δ'
Δ
a
Δ
   <s           NB. But the line feeds are treated as blanks.
┌─────┐
│Δ a Δ│
└─────┘

   ]s=: 'Δ' ; 'a' ; 'Δ' NB. Try boxing each character.
┌─┬─┬─┐
│Δ│a│Δ│
└─┴─┴─┘
   >s
Δ
a
Δ
   <>s          NB. Box looks OK but extra space.
┌──┐
│Δ │
│a │
│Δ │
└──┘
   ,>s          NB. Still got the extra space.
Δa Δ
   ,s,&><LF     NB. Still not quite right.
Δ
a
 Δ

   ,>s,<'α'
Δa Δα

   Note ''

But this requires a lot of special effort to handle UTF-8
characters. And any definitions for handling ASCII text will
probably need modifying to get them to work with UTF-8.

It is my opinion that converting UTF-8 literal to unicode makes
manipulating arrays of characters much easier. So I defined 4
verbs to assist.
)

   U  =:u:       NB. Convert code points to unicode.
   UCP=:3&u:     NB. Convert UTF-8 (char) or unicode to code points.
   UC =:7&u:     NB. Convert UTF-8 to unicode if necessary.
   UD =:8&u:"1   NB. Convert unicode to UTF-8 or char.

   Note ''

Use U instead of a.&{ to convert numbers to text.

Use UCP instead a.&i. . It gives the same result as a.&i. for
literals and give code points for unicode.

UD converts unicode to literal and any characters outside of ASCII
to UTF-8. It can be used instead of a.&{ to convert. It is
necessary to set it to rank 1 as 8&u: only works on vectors.

So let's look try s as unicode instead of as UTF-8.
)

   ]s=: UC 'ΔaΔ'
ΔaΔ
   $s
3
   UCP s
916 97 916
   <s
┌───┐
│ΔaΔ│
└───┘
   ,.s
Δ
a
Δ
   <,.s
┌─┐
│Δ│
│a│
│Δ│
└─┘
   <"0 s
┌─┬─┬─┐
│Δ│a│Δ│
└─┴─┴─┘
   s,'Δ'
ΔaΔÎ”

   Note ''

Oops! Got a problem.

When J mixes nouns of different internal types it must convert
them to the same type before processing. Like comparing 1 to
1.5-0.5 . Here J converts char to wide by putting a zero byte
value in front of the char byte. This works fine for ASCII. But
not if the noun includes any UTF-8 multi-byte codes.

Here J treated the 2 byte Δ UTF-8 as two characters. Both being
invalid UTF-8.

One must make sure that unicode is never mixed with literal that
may contain UTF-8 multi-byte characters.
)

   s,UC 'Δ'
ΔaΔΔ

   Note 'Proposal'

When J primitives needs to convert literal (char) to unicode
(wide) that it convert the char to literal using the UTF-8
conversion algorithm (7&u:) instead of adding a zero byte.

This would give almost complete transparency mixing UTF-8 and
unicode. I can't think of any case where one would not want to
have UTF-8 converted to unicode when literal is mixed with
unicode; however, if it is required 2&u: could be used.

This should not cause any backward compatibility problems as it
only changes how char is converted to wide by default, something
I suspect no one currently uses. Not any other operation.
)

   Note 'Implementation'

I realize that it is very late in this development cycle of J.
So this would probably not be done in it; however, I feel that
this change to make UTF-8 and unicode more compatible would make
it easier for people to use unicode avoiding all the confusion
trying to do everything in UTF-8.

This would affect the concatenation verbs (dyadic , ,. ,: and
monadic ;), the comparison verbs (= and -:) and probably amend
(}).

Hopefully this is done in a single macro or subroutine used by
these verbs. If this is the case, the change should be not too
difficult.

Perhaps later.
)
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

[Jsource] Problems dealing with UTF-8

Reply via email to