Re: [Jsource] Problems dealing with UTF-8

robert therriault Sat, 09 Jul 2016 16:21:49 -0700

Thanks Don, both for this encapsulation and the writing that you done in the 
past on unicode in J. I am learning a lot.


Two things which are worth mentioning.

    utf8
8&u:
   ucp 
7&u:
   uucp
u:@(7&u:)

are already defined in the standard library and may lead to some confusion with 
your definitions
    UCP
3&u:
   UD NB. I like your use of rank as that was an issue in my early explorations 
of unicode
8&u:"1
   UC
7&u:   

and second,the issue remains with uneven boxing with some of the unicode 
characters (which I think is one of the prime motivators in this adventure.)

    ]s=: ucp 'ΔaΔ'
ΔaΔ
   <s
┌───┐
│ΔaΔ│
└───┘
   ]t=: ucp 'Δaఝ' NB. last character wider
Δaఝ
   <t
┌───┐
│Δaఝ│
└───┘
   uwid=: ({.@:glqextent_jgl2_ @: u: @:  ":) "0 NB.glqextent_jgl2_ available in 
jqt ide
   uwid s
9 7 9
   uwid t
9 7 16
    JVERSION
Engine: j805/j64/darwin
Beta-9: commercial/2016-07-05T17:11:06
Library: 8.04.15
Qt IDE: 1.4.9/5.4.2
Platform: Darwin 64
Installer: J804 install
InstallPath: /users/bobtherriault/j64-804
Contact: www.jsoftware.com

Perhaps your solution of dealing primarily with unicode wide characters along 
with finding some way of having box sizing respond to the different character 
widths could be worth exploring?

Cheers, bob



> On Jul 9, 2016, at 3:12 PM, Don Guinn <dongu...@gmail.com> wrote:
> 
> I would appreciate any suggestions. I tried working with UTF-8 but it
> wasn't easy to work with arrays of UTF-8. Then tried using unicode. Much
> easier. Except had to be really careful when mixing with UTF-8. Hence the
> proposal.
> On Jul 9, 2016 4:05 PM, "Raul Miller" <rauldmil...@gmail.com> wrote:
> 
>> Without going into details, I will note that I would say a few things
>> here slightly differently.
>> 
>> That said, overall, you seem to be in the "right" ballpark.
>> 
>> Thanks,
>> 
>> --
>> Raul
>> 
>> 
>> On Sat, Jul 9, 2016 at 2:11 PM, Don Guinn <dongu...@gmail.com> wrote:
>>>   Note 'Observations on Unicode'
>>> 
>>> There seems to be a lot of confusion in J around Unicode and UTF-8.
>>> It seems that people are having a lot of trouble dealing with UTF-8
>>> in J, particularly in arrays. Most of you are probably familiar with
>>> this issue and all said here, but it bares repeating. I am posting
>>> it to source rather than programming since it also includes a proposed
>>> change to J and recent Unicode posts are from people also on source.
>>> 
>>> Unicode defines a representation of characters as numbers or "code
>>> points". Some confusion results in that the glyph (what a code point
>>> looks like) may look like the glyph of another code point. Not really
>>> an issue as far as J is concerned. In Unicode charts of the Unicode
>>> number is normally written as U+HHHH, where HHHH is a number in
>>> hexadecimal. Four digits show up in most tables.
>>> 
>>> Unicode is divided into planes. Plane 0, Basic Multilingual Plane
>>> (BMP), is the range U+0 through U+FFFF and is the part of Unicode
>>> supported by Windows and Unix double-byte character set (DBCS). Other
>>> planes of Unicode are not supported by Windows or Unix at this time.
>>> 
>>> The code points are represented several ways. Windows and Unix use
>>> 16 bit unsigned integers, double-byte-character-set (DBCS) or wide
>>> characters in C. J type is 131072 for 3!:0 and names this “unicode”
>>> (note the lower case "u"). This is not UTF-8. Windows and Unix also
>>> represent Unicode with UTF-8. UTF-8 is the primary representation
>>> for Unicode in J, an ingenious way to represent Unicode compatible
>>> with ASCII. Also there is UTF-16 and others to represent code point
>>> bits for Unicode. The WEB has other ways to handle Unicode.
>>> 
>>> The standard for UTF-8 defines one to four bytes to contain the code
>>> point bits. The one byte codes correspond to the ASCII characters U+0
>>> through U+7F. The UTF-8 standard allows a maximum of U+10FFFF, well
>>> beyond what Windows and Unix now support. For three byte UTF-8 covers
>>> 16 code point bits, fits nicely for wide characters.
>>> 
>>> UTF-8 contains start bytes and continuation bytes. Start bytes with a
>>> high order zero bit look exactly like standard ASCII. Start bytes of
>>> “11xxxxxx” mark the starts of multi-byte codes. Bytes of “10xxxxxx”
>>> are continuation bytes and must follow a start byte. The number of
>>> continuation bytes is given in the start byte.
>>> )
>>> 
>>>   Note 'Invalid byte sequences'
>>> 
>>> It is an error if something happens to separate the start byte from
>>> its continuation bytes.
>>> 
>>>  An unexpected continuation byte.
>>>  Continuation bytes must only follow start bytes.
>>> 
>>>  A start byte not followed by enough continuation bytes.
>>>  The start byte defines how many continuation bytes follow.
>>> 
>>>  A sequence that decodes to a value greater than U+10FFF.
>>>  RFC 3629 limits Unicode this maximum for compatibility with UTF-16.
>>> 
>>>  An overlong encoding.
>>>  Each continuation byte contains 6 code point bits. If the first
>>>  continuation byte contains all zero code point bits it should be
>>>  shortened.
>>> 
>>> The display of any UTF-8 characters failing the above tests varies
>>> from system to system. The official position now is to display
>>> � (U+FFFD). J often displays other characters.
>>> )
>>> 
>>>   Note 'The Internet'
>>> 
>>> The internet only supports the transmission of text as ASCII
>>> characters, characters in the range 0 through 7f hex. And many
>>> special characters are not allowed in text. Those characters not
>>> allowed are sent in a few ways.
>>> 
>>>  1. A byte is represented as two hexadecimal digits following an
>>>     equal sign (=). For example: Blank is sent as =20 instead of
>>>     as 32{a. .
>>> 
>>>     The Unicode symbol α (U+3B1) is sent as "=ce=b1", the bytes
>>>     are UTF-8, a start byte followed by a continuation byte in
>>>     hexadecimal.
>>> 
>>>  2. #&nnn; – where nnn is the decimal point code number. For
>>>     example: #&916; will display as Δ.
>>> 
>>>  3. #&hxxx; - same as 2. except the number is in hexadecimal.
>>> 
>>> Unicode and UTF-8 cannot be sent directly It must be converted to
>>> ASCII as described above. A raw text file with a lot of characters
>>> beyond ASCII can get very hard to view.
>>> )
>>> 
>>>   Note 'Unicode and J'
>>> 
>>> J primitives treat UTF-8 bytes as literal. They do not recognize
>>> start and continuation bytes as UTF-8 making up characters. It is
>>> up to the programmer to handle the multi-byte characters. Vectors
>>> of UTF-8 characters display fine and are easy to work with. But
>>> higher dimensions must be carefully managed. Take the problem
>>> covered recently.
>>> )
>>> 
>>>   ]s=:  8 6 $ 'ఝ' ,'a','ఝ'
>>> à° aà°
>>> à° aà
>>> ° à° a
>>> ఝఝ
>>> aà° à°
>>> aà° à
>>> ° aà°
>>> à° aà°
>>> 
>>>   Note ''
>>> 
>>> When start bytes are separated from continuation bytes error
>>> characters are displayed. J displays a line at a time. Continuation
>>> bytes moved to the next line are not recognized as a continuation
>>> of a multi-byte UTF-8 character.
>>> 
>>> This example has all kinds of problems. First, the glyphs are wider
>>> than other characters, so there is no way they can align with other
>>> fixed width characters. Second, their UTF-8 codes are 3 bytes, not
>>> supported by the new boxing algorithm in J.
>>> 
>>> The following definition of s gives a more supported example to
>>> examine handling UTF-8 in J.
>>> )
>>> 
>>>   ]s=:'Δ' , 'a' ,'Δ'
>>> ΔaΔ
>>>   $s
>>> 5
>>>   a.i.s
>>> 206 148 97 206 148
>>>   <s                NB. Displays boxed nicely.
>>> ┌───┐
>>> │ΔaΔ│
>>> └───┘
>>>   8 6 $ s           NB. But how about reshaping?
>>> Î”aÎ”Î
>>> ”aÎ”Î”
>>> aΔΔa
>>> Î”Î”aÎ
>>> ”Î”aÎ”
>>> Î”aÎ”Î
>>> ”aÎ”Î”
>>> aΔΔa
>>> 
>>>   NB. Still have the problem of splitting start and continuation bits.
>>> 
>>>   NB. Say we want to display the 3 characters in a column.
>>> 
>>>   ]s=: 'Δ' , 'a' ,: 'Δ'
>>> Δ
>>> aa
>>> Δ
>>> 
>>>   Note ''
>>> 
>>> Where did the second "a" come from? To J the "Δ" is 2 characters.
>>> So the "a" must be expanded to match lengths. To get it to look
>>> right the "a" needs to be padded.
>>> )
>>> 
>>>   ]s=: 'Δ' , 'a ' ,: 'Δ'
>>> Δ
>>> a
>>> Δ
>>> 
>>>   <s           NB. But now the boxed display isn't what we want.
>>> ┌──┐
>>> │Δ │
>>> │a │
>>> │Δ │
>>> └──┘
>>> 
>>>   NB. Perhaps treating it as a vector works better.
>>> 
>>>   ]s=: 'Δ' , LF,  'a' , LF, 'Δ'
>>> Δ
>>> a
>>> Δ
>>>   <s           NB. But the line feeds are treated as blanks.
>>> ┌─────┐
>>> │Δ a Δ│
>>> └─────┘
>>> 
>>>   ]s=: 'Δ' ; 'a' ; 'Δ' NB. Try boxing each character.
>>> ┌─┬─┬─┐
>>> │Δ│a│Δ│
>>> └─┴─┴─┘
>>>> s
>>> Δ
>>> a
>>> Δ
>>>   <>s          NB. Box looks OK but extra space.
>>> ┌──┐
>>> │Δ │
>>> │a │
>>> │Δ │
>>> └──┘
>>>   ,>s          NB. Still got the extra space.
>>> Δa Δ
>>>   ,s,&><LF     NB. Still not quite right.
>>> Δ
>>> a
>>> Δ
>>> 
>>>   ,>s,<'α'
>>> Δa Δα
>>> 
>>>   Note ''
>>> 
>>> But this requires a lot of special effort to handle UTF-8
>>> characters. And any definitions for handling ASCII text will
>>> probably need modifying to get them to work with UTF-8.
>>> 
>>> It is my opinion that converting UTF-8 literal to unicode makes
>>> manipulating arrays of characters much easier. So I defined 4
>>> verbs to assist.
>>> )
>>> 
>>>   U  =:u:       NB. Convert code points to unicode.
>>>   UCP=:3&u:     NB. Convert UTF-8 (char) or unicode to code points.
>>>   UC =:7&u:     NB. Convert UTF-8 to unicode if necessary.
>>>   UD =:8&u:"1   NB. Convert unicode to UTF-8 or char.
>>> 
>>>   Note ''
>>> 
>>> Use U instead of a.&{ to convert numbers to text.
>>> 
>>> Use UCP instead a.&i. . It gives the same result as a.&i. for
>>> literals and give code points for unicode.
>>> 
>>> UD converts unicode to literal and any characters outside of ASCII
>>> to UTF-8. It can be used instead of a.&{ to convert. It is
>>> necessary to set it to rank 1 as 8&u: only works on vectors.
>>> 
>>> So let's look try s as unicode instead of as UTF-8.
>>> )
>>> 
>>>   ]s=: UC 'ΔaΔ'
>>> ΔaΔ
>>>   $s
>>> 3
>>>   UCP s
>>> 916 97 916
>>>   <s
>>> ┌───┐
>>> │ΔaΔ│
>>> └───┘
>>>   ,.s
>>> Δ
>>> a
>>> Δ
>>>   <,.s
>>> ┌─┐
>>> │Δ│
>>> │a│
>>> │Δ│
>>> └─┘
>>>   <"0 s
>>> ┌─┬─┬─┐
>>> │Δ│a│Δ│
>>> └─┴─┴─┘
>>>   s,'Δ'
>>> ΔaΔÎ”
>>> 
>>>   Note ''
>>> 
>>> Oops! Got a problem.
>>> 
>>> When J mixes nouns of different internal types it must convert
>>> them to the same type before processing. Like comparing 1 to
>>> 1.5-0.5 . Here J converts char to wide by putting a zero byte
>>> value in front of the char byte. This works fine for ASCII. But
>>> not if the noun includes any UTF-8 multi-byte codes.
>>> 
>>> Here J treated the 2 byte Δ UTF-8 as two characters. Both being
>>> invalid UTF-8.
>>> 
>>> One must make sure that unicode is never mixed with literal that
>>> may contain UTF-8 multi-byte characters.
>>> )
>>> 
>>>   s,UC 'Δ'
>>> ΔaΔΔ
>>> 
>>>   Note 'Proposal'
>>> 
>>> When J primitives needs to convert literal (char) to unicode
>>> (wide) that it convert the char to literal using the UTF-8
>>> conversion algorithm (7&u:) instead of adding a zero byte.
>>> 
>>> This would give almost complete transparency mixing UTF-8 and
>>> unicode. I can't think of any case where one would not want to
>>> have UTF-8 converted to unicode when literal is mixed with
>>> unicode; however, if it is required 2&u: could be used.
>>> 
>>> This should not cause any backward compatibility problems as it
>>> only changes how char is converted to wide by default, something
>>> I suspect no one currently uses. Not any other operation.
>>> )
>>> 
>>>   Note 'Implementation'
>>> 
>>> I realize that it is very late in this development cycle of J.
>>> So this would probably not be done in it; however, I feel that
>>> this change to make UTF-8 and unicode more compatible would make
>>> it easier for people to use unicode avoiding all the confusion
>>> trying to do everything in UTF-8.
>>> 
>>> This would affect the concatenation verbs (dyadic , ,. ,: and
>>> monadic ;), the comparison verbs (= and -:) and probably amend
>>> (}).
>>> 
>>> Hopefully this is done in a single macro or subroutine used by
>>> these verbs. If this is the case, the change should be not too
>>> difficult.
>>> 
>>> Perhaps later.
>>> )
>>> ----------------------------------------------------------------------
>>> For information about J forums see http://www.jsoftware.com/forums.htm
>> ----------------------------------------------------------------------
>> For information about J forums see http://www.jsoftware.com/forums.htm
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm

----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jsource] Problems dealing with UTF-8

Reply via email to