Re: [Jsource] Problems dealing with UTF-8

bill lam Sun, 10 Jul 2016 15:44:53 -0700

I consider myself a frequent user of wide/utf8 unicode user and yet I
prefer doing the proper conversion by myself instead of automatic
conversion by J.


There are confusions when using utf8 but the best way I suggested is to
gain more experience, but not by relying on J to do automatic conversion of
literal to wide.

J will convert literal to wide for external interface such as 1!:x family
and 15!:0 under windows.

Recent beta for unicode display had made a similar assumption as yours to
convert all one byte literal to wide in regardless whether they are utf8 or
not.  This leads to a change in behavior that Bob has demonstrated that
illegal symbol becoming another character. I would like to have this fixed
if possible.
On Jul 11, 2016 6:14 AM, "Don Guinn" <dongu...@gmail.com> wrote:

> Thanks for your response Bill. I certainly have no problem any more
> controlling the way literal converts to unicode; however, I would like you
> to consider the following.
>
> literal is a problem as it has many inconsistent uses. The ASCII part is
> not a problem as it is the same whether literal (char) is interpreted as
> extended ASCII or UTF-8. Char is also used as numeric in many places, like
> image files.
>
> I was not thinking that the errors were in J, but in my failure to assure
> that UTF-8 codes have been converted to unicode (wide) before catenating or
> whatever with wide. When support was added to J for Unicode the proper
> decision to maintain compatibility with extended ASCII was correct. But
> things have changed. Now I doubt that anyone uses extended ASCII. Any
> programs using extended ASCII are probably obsolete or converted to use
> UTF-8.
>
> Like dropping the dot after x and y, it makes more sense now to assume that
> char may contain UTF-8 when treated as text.
>
> I am not suggesting any change in the way char is handled except when
> combining with wide. So programs not using wide would not be affected. Wide
> is different from char as it is only Unicode. It has no other use. So any
> time wide and char are mixed the char bytes are must be Unicode points. So
> I looked at what U+80 through U+FF are. Some control codes of which I don't
> understand and Latin-1 Supplement. There are many useful symbols in
> this range. But how would they be entered?
>
> The only way to enter them as extended ASCII is by indexing into a. . Say I
> want to assign some name a value with the &pound symbol.
>
>    (163{a.),":1234
> �1234
>    (2 u: 163{a.),":1234
> £1234
>    (4 u: 163),":1234
> £1234
>    ]Value=:'£',":1234
> £1234
>    a.i.Value
> 194 163 49 50 51 52
>    3 u: 7 u: Value
> 163 49 50 51 52
>
> OK. I cheated. I can't directly enter &pound on my keyboard, but the
> British can on theirs. But notice that J did not treat the 163{a. as
> &pound. It entered the UTF-8 representation. And the same thing would
> happen for any of the Latin-1 Supplement characters. They would be in UTF-8
> multi-byte.
>
> Anything retrieved from the web assumes that characters coming in can
> include UTF-8 or be wide. J assumes that char is UTF-8 for both entry and
> display. It actually treats extended ASCII as invalid UTF-8.
>
> Things get messy in J when trying to manipulate UTF-8 in J, particularly in
> arrays. People seem to avoid using wide because of the care that must be
> taken when using it then end out struggling trying to make UTF-8 cooperate.
> All the more reason to make wide easy to use.
>
> Question: Does anybody have a need for or have any program that has a need
> for char bytes >127 to be expanded to wide by adding a zero byte?
>
> On Sat, Jul 9, 2016 at 6:15 PM, Don Guinn <dongu...@gmail.com> wrote:
>
> > Thanks. I'll look them up.
> >
> > I think most of the too wide characters take 3 UTF-8 bytes so aren't
> > supported any way.
> > On Jul 9, 2016 5:21 PM, "robert therriault" <bobtherria...@mac.com>
> wrote:
> >
> > Thanks Don, both for this encapsulation and the writing that you done in
> > the past on unicode in J. I am learning a lot.
> >
> > Two things which are worth mentioning.
> >
> >     utf8
> > 8&u:
> >    ucp
> > 7&u:
> >    uucp
> > u:@(7&u:)
> >
> > are already defined in the standard library and may lead to some
> confusion
> > with your definitions
> >     UCP
> > 3&u:
> >    UD NB. I like your use of rank as that was an issue in my early
> > explorations of unicode
> > 8&u:"1
> >    UC
> > 7&u:
> >
> > and second,the issue remains with uneven boxing with some of the unicode
> > characters (which I think is one of the prime motivators in this
> adventure.)
> >
> >     ]s=: ucp 'ΔaΔ'
> > ΔaΔ
> >    <s
> > ┌───┐
> > │ΔaΔ│
> > └───┘
> >    ]t=: ucp 'Δaఝ' NB. last character wider
> > Δaఝ
> >    <t
> > ┌───┐
> > │Δaఝ│
> > └───┘
> >    uwid=: ({.@:glqextent_jgl2_ @: u: @:  ":) "0 NB.glqextent_jgl2_
> > available in jqt ide
> >    uwid s
> > 9 7 9
> >    uwid t
> > 9 7 16
> >     JVERSION
> > Engine: j805/j64/darwin
> > Beta-9: commercial/2016-07-05T17:11:06
> > Library: 8.04.15
> > Qt IDE: 1.4.9/5.4.2
> > Platform: Darwin 64
> > Installer: J804 install
> > InstallPath: /users/bobtherriault/j64-804
> > Contact: www.jsoftware.com
> >
> > Perhaps your solution of dealing primarily with unicode wide characters
> > along with finding some way of having box sizing respond to the different
> > character widths could be worth exploring?
> >
> > Cheers, bob
> >
> >
> >
> > > On Jul 9, 2016, at 3:12 PM, Don Guinn <dongu...@gmail.com> wrote:
> > >
> > > I would appreciate any suggestions. I tried working with UTF-8 but it
> > > wasn't easy to work with arrays of UTF-8. Then tried using unicode.
> Much
> > > easier. Except had to be really careful when mixing with UTF-8. Hence
> the
> > > proposal.
> > > On Jul 9, 2016 4:05 PM, "Raul Miller" <rauldmil...@gmail.com> wrote:
> > >
> > >> Without going into details, I will note that I would say a few things
> > >> here slightly differently.
> > >>
> > >> That said, overall, you seem to be in the "right" ballpark.
> > >>
> > >> Thanks,
> > >>
> > >> --
> > >> Raul
> > >>
> > >>
> > >> On Sat, Jul 9, 2016 at 2:11 PM, Don Guinn <dongu...@gmail.com> wrote:
> > >>>   Note 'Observations on Unicode'
> > >>>
> > >>> There seems to be a lot of confusion in J around Unicode and UTF-8.
> > >>> It seems that people are having a lot of trouble dealing with UTF-8
> > >>> in J, particularly in arrays. Most of you are probably familiar with
> > >>> this issue and all said here, but it bares repeating. I am posting
> > >>> it to source rather than programming since it also includes a
> proposed
> > >>> change to J and recent Unicode posts are from people also on source.
> > >>>
> > >>> Unicode defines a representation of characters as numbers or "code
> > >>> points". Some confusion results in that the glyph (what a code point
> > >>> looks like) may look like the glyph of another code point. Not really
> > >>> an issue as far as J is concerned. In Unicode charts of the Unicode
> > >>> number is normally written as U+HHHH, where HHHH is a number in
> > >>> hexadecimal. Four digits show up in most tables.
> > >>>
> > >>> Unicode is divided into planes. Plane 0, Basic Multilingual Plane
> > >>> (BMP), is the range U+0 through U+FFFF and is the part of Unicode
> > >>> supported by Windows and Unix double-byte character set (DBCS). Other
> > >>> planes of Unicode are not supported by Windows or Unix at this time.
> > >>>
> > >>> The code points are represented several ways. Windows and Unix use
> > >>> 16 bit unsigned integers, double-byte-character-set (DBCS) or wide
> > >>> characters in C. J type is 131072 for 3!:0 and names this “unicode”
> > >>> (note the lower case "u"). This is not UTF-8. Windows and Unix also
> > >>> represent Unicode with UTF-8. UTF-8 is the primary representation
> > >>> for Unicode in J, an ingenious way to represent Unicode compatible
> > >>> with ASCII. Also there is UTF-16 and others to represent code point
> > >>> bits for Unicode. The WEB has other ways to handle Unicode.
> > >>>
> > >>> The standard for UTF-8 defines one to four bytes to contain the code
> > >>> point bits. The one byte codes correspond to the ASCII characters U+0
> > >>> through U+7F. The UTF-8 standard allows a maximum of U+10FFFF, well
> > >>> beyond what Windows and Unix now support. For three byte UTF-8 covers
> > >>> 16 code point bits, fits nicely for wide characters.
> > >>>
> > >>> UTF-8 contains start bytes and continuation bytes. Start bytes with a
> > >>> high order zero bit look exactly like standard ASCII. Start bytes of
> > >>> “11xxxxxx” mark the starts of multi-byte codes. Bytes of “10xxxxxx”
> > >>> are continuation bytes and must follow a start byte. The number of
> > >>> continuation bytes is given in the start byte.
> > >>> )
> > >>>
> > >>>   Note 'Invalid byte sequences'
> > >>>
> > >>> It is an error if something happens to separate the start byte from
> > >>> its continuation bytes.
> > >>>
> > >>>  An unexpected continuation byte.
> > >>>  Continuation bytes must only follow start bytes.
> > >>>
> > >>>  A start byte not followed by enough continuation bytes.
> > >>>  The start byte defines how many continuation bytes follow.
> > >>>
> > >>>  A sequence that decodes to a value greater than U+10FFF.
> > >>>  RFC 3629 limits Unicode this maximum for compatibility with UTF-16.
> > >>>
> > >>>  An overlong encoding.
> > >>>  Each continuation byte contains 6 code point bits. If the first
> > >>>  continuation byte contains all zero code point bits it should be
> > >>>  shortened.
> > >>>
> > >>> The display of any UTF-8 characters failing the above tests varies
> > >>> from system to system. The official position now is to display
> > >>> � (U+FFFD). J often displays other characters.
> > >>> )
> > >>>
> > >>>   Note 'The Internet'
> > >>>
> > >>> The internet only supports the transmission of text as ASCII
> > >>> characters, characters in the range 0 through 7f hex. And many
> > >>> special characters are not allowed in text. Those characters not
> > >>> allowed are sent in a few ways.
> > >>>
> > >>>  1. A byte is represented as two hexadecimal digits following an
> > >>>     equal sign (=). For example: Blank is sent as =20 instead of
> > >>>     as 32{a. .
> > >>>
> > >>>     The Unicode symbol α (U+3B1) is sent as "=ce=b1", the bytes
> > >>>     are UTF-8, a start byte followed by a continuation byte in
> > >>>     hexadecimal.
> > >>>
> > >>>  2. #&nnn; – where nnn is the decimal point code number. For
> > >>>     example: #&916; will display as Δ.
> > >>>
> > >>>  3. #&hxxx; - same as 2. except the number is in hexadecimal.
> > >>>
> > >>> Unicode and UTF-8 cannot be sent directly It must be converted to
> > >>> ASCII as described above. A raw text file with a lot of characters
> > >>> beyond ASCII can get very hard to view.
> > >>> )
> > >>>
> > >>>   Note 'Unicode and J'
> > >>>
> > >>> J primitives treat UTF-8 bytes as literal. They do not recognize
> > >>> start and continuation bytes as UTF-8 making up characters. It is
> > >>> up to the programmer to handle the multi-byte characters. Vectors
> > >>> of UTF-8 characters display fine and are easy to work with. But
> > >>> higher dimensions must be carefully managed. Take the problem
> > >>> covered recently.
> > >>> )
> > >>>
> > >>>   ]s=:  8 6 $ 'ఝ' ,'a','ఝ'
> > >>> à° aà°
> > >>> à° aà
> > >>> ° à° a
> > >>> ఝఝ
> > >>> aà° à°
> > >>> aà° à
> > >>> ° aà°
> > >>> à° aà°
> > >>>
> > >>>   Note ''
> > >>>
> > >>> When start bytes are separated from continuation bytes error
> > >>> characters are displayed. J displays a line at a time. Continuation
> > >>> bytes moved to the next line are not recognized as a continuation
> > >>> of a multi-byte UTF-8 character.
> > >>>
> > >>> This example has all kinds of problems. First, the glyphs are wider
> > >>> than other characters, so there is no way they can align with other
> > >>> fixed width characters. Second, their UTF-8 codes are 3 bytes, not
> > >>> supported by the new boxing algorithm in J.
> > >>>
> > >>> The following definition of s gives a more supported example to
> > >>> examine handling UTF-8 in J.
> > >>> )
> > >>>
> > >>>   ]s=:'Δ' , 'a' ,'Δ'
> > >>> ΔaΔ
> > >>>   $s
> > >>> 5
> > >>>   a.i.s
> > >>> 206 148 97 206 148
> > >>>   <s                NB. Displays boxed nicely.
> > >>> ┌───┐
> > >>> │ΔaΔ│
> > >>> └───┘
> > >>>   8 6 $ s           NB. But how about reshaping?
> > >>> Î”aÎ”Î
> > >>> ”aÎ”Î”
> > >>> aΔΔa
> > >>> Î”Î”aÎ
> > >>> ”Î”aÎ”
> > >>> Î”aÎ”Î
> > >>> ”aÎ”Î”
> > >>> aΔΔa
> > >>>
> > >>>   NB. Still have the problem of splitting start and continuation
> bits.
> > >>>
> > >>>   NB. Say we want to display the 3 characters in a column.
> > >>>
> > >>>   ]s=: 'Δ' , 'a' ,: 'Δ'
> > >>> Δ
> > >>> aa
> > >>> Δ
> > >>>
> > >>>   Note ''
> > >>>
> > >>> Where did the second "a" come from? To J the "Δ" is 2 characters.
> > >>> So the "a" must be expanded to match lengths. To get it to look
> > >>> right the "a" needs to be padded.
> > >>> )
> > >>>
> > >>>   ]s=: 'Δ' , 'a ' ,: 'Δ'
> > >>> Δ
> > >>> a
> > >>> Δ
> > >>>
> > >>>   <s           NB. But now the boxed display isn't what we want.
> > >>> ┌──┐
> > >>> │Δ │
> > >>> │a │
> > >>> │Δ │
> > >>> └──┘
> > >>>
> > >>>   NB. Perhaps treating it as a vector works better.
> > >>>
> > >>>   ]s=: 'Δ' , LF,  'a' , LF, 'Δ'
> > >>> Δ
> > >>> a
> > >>> Δ
> > >>>   <s           NB. But the line feeds are treated as blanks.
> > >>> ┌─────┐
> > >>> │Δ a Δ│
> > >>> └─────┘
> > >>>
> > >>>   ]s=: 'Δ' ; 'a' ; 'Δ' NB. Try boxing each character.
> > >>> ┌─┬─┬─┐
> > >>> │Δ│a│Δ│
> > >>> └─┴─┴─┘
> > >>>> s
> > >>> Δ
> > >>> a
> > >>> Δ
> > >>>   <>s          NB. Box looks OK but extra space.
> > >>> ┌──┐
> > >>> │Δ │
> > >>> │a │
> > >>> │Δ │
> > >>> └──┘
> > >>>   ,>s          NB. Still got the extra space.
> > >>> Δa Δ
> > >>>   ,s,&><LF     NB. Still not quite right.
> > >>> Δ
> > >>> a
> > >>> Δ
> > >>>
> > >>>   ,>s,<'α'
> > >>> Δa Δα
> > >>>
> > >>>   Note ''
> > >>>
> > >>> But this requires a lot of special effort to handle UTF-8
> > >>> characters. And any definitions for handling ASCII text will
> > >>> probably need modifying to get them to work with UTF-8.
> > >>>
> > >>> It is my opinion that converting UTF-8 literal to unicode makes
> > >>> manipulating arrays of characters much easier. So I defined 4
> > >>> verbs to assist.
> > >>> )
> > >>>
> > >>>   U  =:u:       NB. Convert code points to unicode.
> > >>>   UCP=:3&u:     NB. Convert UTF-8 (char) or unicode to code points.
> > >>>   UC =:7&u:     NB. Convert UTF-8 to unicode if necessary.
> > >>>   UD =:8&u:"1   NB. Convert unicode to UTF-8 or char.
> > >>>
> > >>>   Note ''
> > >>>
> > >>> Use U instead of a.&{ to convert numbers to text.
> > >>>
> > >>> Use UCP instead a.&i. . It gives the same result as a.&i. for
> > >>> literals and give code points for unicode.
> > >>>
> > >>> UD converts unicode to literal and any characters outside of ASCII
> > >>> to UTF-8. It can be used instead of a.&{ to convert. It is
> > >>> necessary to set it to rank 1 as 8&u: only works on vectors.
> > >>>
> > >>> So let's look try s as unicode instead of as UTF-8.
> > >>> )
> > >>>
> > >>>   ]s=: UC 'ΔaΔ'
> > >>> ΔaΔ
> > >>>   $s
> > >>> 3
> > >>>   UCP s
> > >>> 916 97 916
> > >>>   <s
> > >>> ┌───┐
> > >>> │ΔaΔ│
> > >>> └───┘
> > >>>   ,.s
> > >>> Δ
> > >>> a
> > >>> Δ
> > >>>   <,.s
> > >>> ┌─┐
> > >>> │Δ│
> > >>> │a│
> > >>> │Δ│
> > >>> └─┘
> > >>>   <"0 s
> > >>> ┌─┬─┬─┐
> > >>> │Δ│a│Δ│
> > >>> └─┴─┴─┘
> > >>>   s,'Δ'
> > >>> ΔaΔÎ”
> > >>>
> > >>>   Note ''
> > >>>
> > >>> Oops! Got a problem.
> > >>>
> > >>> When J mixes nouns of different internal types it must convert
> > >>> them to the same type before processing. Like comparing 1 to
> > >>> 1.5-0.5 . Here J converts char to wide by putting a zero byte
> > >>> value in front of the char byte. This works fine for ASCII. But
> > >>> not if the noun includes any UTF-8 multi-byte codes.
> > >>>
> > >>> Here J treated the 2 byte Δ UTF-8 as two characters. Both being
> > >>> invalid UTF-8.
> > >>>
> > >>> One must make sure that unicode is never mixed with literal that
> > >>> may contain UTF-8 multi-byte characters.
> > >>> )
> > >>>
> > >>>   s,UC 'Δ'
> > >>> ΔaΔΔ
> > >>>
> > >>>   Note 'Proposal'
> > >>>
> > >>> When J primitives needs to convert literal (char) to unicode
> > >>> (wide) that it convert the char to literal using the UTF-8
> > >>> conversion algorithm (7&u:) instead of adding a zero byte.
> > >>>
> > >>> This would give almost complete transparency mixing UTF-8 and
> > >>> unicode. I can't think of any case where one would not want to
> > >>> have UTF-8 converted to unicode when literal is mixed with
> > >>> unicode; however, if it is required 2&u: could be used.
> > >>>
> > >>> This should not cause any backward compatibility problems as it
> > >>> only changes how char is converted to wide by default, something
> > >>> I suspect no one currently uses. Not any other operation.
> > >>> )
> > >>>
> > >>>   Note 'Implementation'
> > >>>
> > >>> I realize that it is very late in this development cycle of J.
> > >>> So this would probably not be done in it; however, I feel that
> > >>> this change to make UTF-8 and unicode more compatible would make
> > >>> it easier for people to use unicode avoiding all the confusion
> > >>> trying to do everything in UTF-8.
> > >>>
> > >>> This would affect the concatenation verbs (dyadic , ,. ,: and
> > >>> monadic ;), the comparison verbs (= and -:) and probably amend
> > >>> (}).
> > >>>
> > >>> Hopefully this is done in a single macro or subroutine used by
> > >>> these verbs. If this is the case, the change should be not too
> > >>> difficult.
> > >>>
> > >>> Perhaps later.
> > >>> )
> > >>>
> ----------------------------------------------------------------------
> > >>> For information about J forums see
> http://www.jsoftware.com/forums.htm
> > >> ----------------------------------------------------------------------
> > >> For information about J forums see
> http://www.jsoftware.com/forums.htm
> > > ----------------------------------------------------------------------
> > > For information about J forums see http://www.jsoftware.com/forums.htm
> >
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> >
> >
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jsource] Problems dealing with UTF-8

Reply via email to