I agree with this, especially the analysis of why it's reasonable to support 16-bit characters as the extended character size in J.

I would quibble with just one word: when you say

UTF-8 is a unicode encoding

I would change 'unicode' to 'character'.

Unicode assigns numbers to glyphs; UTF-8 represents numbers as a sequence of bytes. In theory, you could use UTF-8 with a different character encoding. Of course, that is even less likely than immediate uptake of 21-bit Unicode.

Henry Rich





On 7/5/2016 12:17 PM, Raul Miller wrote:
On Tue, Jul 5, 2016 at 10:56 AM, robert therriault
<bobtherria...@mac.com> wrote:
  If you haven't already, take a look at the video and see what you think.

https://youtu.be/eN9H-rMk1No
So, ok...

In the video, you are working with your display verb (v) and the
result of this expression:

    'a',(u:600+i.5),u:30000+/i.5

But there were several things you said in the video which I should
disagree with. I think you might be able to use a bit more background
on these issues:

First, using the J Dictionary terminology, both of these sentences
have literal results:

    'a',(u:600+i.5),u:30000+/i.5

    ": 'a',(u:600+i.5),u:30000+/i.5

I expect the mail system to mutilate those results, so I have removed
them from my message here.

Second, using the Unicode Consortium's terminology, both of those
results are Unicode. More specifically, UTF-8 is a unicode encoding.

If we look at the J type numbers for these two sentences, we get:

    3!:0 'a',(u:600+i.5),u:30000+/i.5
131072
    3!:0 ": 'a',(u:600+i.5),u:30000+/i.5
2

The first sentence has type number 131072, which roughly corresponds
to UTF-16. The second sentence has type number 2, which roughly
corresponds to UTF-8. J should perhaps have a third kind of character
literal, also, which has not yet been implemented.

So, taking a few steps back...

Unicode currently has enough characters defined that they need 21 bits
to enumerate them all. But they also have licenses on the standard way
of representing the characters, which has been important for some
people and which has slowed the adoption of that aspect of the
standard. This translates to limited font support (but there's also
limited keyboard support as well as dubious recommendations within the
Unicode standards themselves).

In other words, if we were to represent the full set of Unicode
characters using type 131072, some of them would still be multi-byte
characters. But there is little or no font support for most of them.

Also, 21 bits is about 2 million characters. So that's megabytes of
information that would be needed mostly to support characters which
the Unicode people have defined but which almost nobody uses (in part
because of licensing concerns).

Hopefully that is enough background...

To round out this discussion, though, let us imagine that J had been
extended to fully support the full 21 bit space of the Unicode
character set.

First off, as near as I can tell, there is no font support in any
operating system for characters which are "unicode code points which
cannot be represented as a single code unit using utf-16". In other
words, there's nothing for J to tie into to represent most of those
characters. So I guess they would mostly wind up looking like any
other unrecognized character. So if I supported them, this would be a
"temporary hack" (which would likely survive for decades, and then be
busy work for someone trying to keep up with some hypothetical
changing and conflicting implementations).

Second, we would need to use one of the very few available type
numbers to represent these characters which nobody supports. (No font
support, no keyboard support - just compliance with some obscure
standards which have not yet been proven useful.) And, J has only two
unused type numbers which fit in a 32 bit implementation:

https://github.com/jsoftware/jsource/blob/master/jsrc/jtype.h#L160

    2^30x
1073741824
    2^31x
2147483648

Though, actually, since the type is stored in a *A as an I (which is
long on 32 bit J and long long on 64 bit J) which is a signed type on
every major platform, there is really only one remaining unambiguous
type number available for 32 bit J. (2147483648 would really be
-1073741824 for current 32 bit J implementations.)

Anyways, we probably do not want to use our one remaining type number
which can be used in 32 bit J for a character literal type which will
probably never be supported on a 32 bit operating system.

So, let's pretend that we want to use type number 8589934592 to
represent UTF-32 character literals, and that all of the characters
not supported by the OS get displayed using the question mark in a
diamond representation.

That would, at least, give us a way of representing all unicode
characters where all unicode consortium defined characters retain
their integrity when the array is reshaped.

But that would not solve the boxed display issue. To solve the boxed
display issue, we need to know the character width for each character.
J assumes a "fixed width" font, but Unicode has defined some
characters to have double width and we don't have any fonts which make
them the same width as anything else. The Unicode Consortium has also
defined some characters which are zero width and some which do other
strange things which conflict with the idea of fixed width characters.

So basically, the Unicode standard itself is the source of confusion.
And it's approximately a bottomless pit of confusion (if you try to
read the full set of standards available at unicode.org). But all of
it is for good reasons, of course.

...

Which brings me back to questions like "What do you want the user to
understand?"

Thanks,


----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to