Re: [Jsource] Non displayable characters display changed in penultimate beta

Raul Miller Tue, 05 Jul 2016 10:47:33 -0700

Why would you make this change?

I think both statements are true:


(1) "UTF-8 is a unicode encoding" - it is an encoding defined by the
unicode consortium as a part of the unicode standards, and

(2) "UTF-8 is a character encoding" - this encoding represents
[unicode] characters.

But I am not sure why you would want to replace the one phrasing with
the other, in this context.

Thanks,

-- 
Raul

On Tue, Jul 5, 2016 at 12:45 PM, Henry Rich <henryhr...@gmail.com> wrote:
> I agree with this, especially the analysis of why it's reasonable to support
> 16-bit characters as the extended character size in J.
>
> I would quibble with just one word: when you say
>
> UTF-8 is a unicode encoding
>
> I would change 'unicode' to 'character'.
>
> Unicode assigns numbers to glyphs; UTF-8 represents numbers as a sequence of
> bytes.  In theory, you could use UTF-8 with a different character encoding.
> Of course, that is even less likely than immediate uptake of 21-bit Unicode.
>
> Henry Rich
>
>
>
>
>
>
> On 7/5/2016 12:17 PM, Raul Miller wrote:
>>
>> On Tue, Jul 5, 2016 at 10:56 AM, robert therriault
>> <bobtherria...@mac.com> wrote:
>>>
>>>   If you haven't already, take a look at the video and see what you
>>> think.
>>>
>>> https://youtu.be/eN9H-rMk1No
>>
>> So, ok...
>>
>> In the video, you are working with your display verb (v) and the
>> result of this expression:
>>
>>     'a',(u:600+i.5),u:30000+/i.5
>>
>> But there were several things you said in the video which I should
>> disagree with. I think you might be able to use a bit more background
>> on these issues:
>>
>> First, using the J Dictionary terminology, both of these sentences
>> have literal results:
>>
>>     'a',(u:600+i.5),u:30000+/i.5
>>
>>     ": 'a',(u:600+i.5),u:30000+/i.5
>>
>> I expect the mail system to mutilate those results, so I have removed
>> them from my message here.
>>
>> Second, using the Unicode Consortium's terminology, both of those
>> results are Unicode. More specifically, UTF-8 is a unicode encoding.
>>
>> If we look at the J type numbers for these two sentences, we get:
>>
>>     3!:0 'a',(u:600+i.5),u:30000+/i.5
>> 131072
>>     3!:0 ": 'a',(u:600+i.5),u:30000+/i.5
>> 2
>>
>> The first sentence has type number 131072, which roughly corresponds
>> to UTF-16. The second sentence has type number 2, which roughly
>> corresponds to UTF-8. J should perhaps have a third kind of character
>> literal, also, which has not yet been implemented.
>>
>> So, taking a few steps back...
>>
>> Unicode currently has enough characters defined that they need 21 bits
>> to enumerate them all. But they also have licenses on the standard way
>> of representing the characters, which has been important for some
>> people and which has slowed the adoption of that aspect of the
>> standard. This translates to limited font support (but there's also
>> limited keyboard support as well as dubious recommendations within the
>> Unicode standards themselves).
>>
>> In other words, if we were to represent the full set of Unicode
>> characters using type 131072, some of them would still be multi-byte
>> characters. But there is little or no font support for most of them.
>>
>> Also, 21 bits is about 2 million characters. So that's megabytes of
>> information that would be needed mostly to support characters which
>> the Unicode people have defined but which almost nobody uses (in part
>> because of licensing concerns).
>>
>> Hopefully that is enough background...
>>
>> To round out this discussion, though, let us imagine that J had been
>> extended to fully support the full 21 bit space of the Unicode
>> character set.
>>
>> First off, as near as I can tell, there is no font support in any
>> operating system for characters which are "unicode code points which
>> cannot be represented as a single code unit using utf-16". In other
>> words, there's nothing for J to tie into to represent most of those
>> characters. So I guess they would mostly wind up looking like any
>> other unrecognized character. So if I supported them, this would be a
>> "temporary hack" (which would likely survive for decades, and then be
>> busy work for someone trying to keep up with some hypothetical
>> changing and conflicting implementations).
>>
>> Second, we would need to use one of the very few available type
>> numbers to represent these characters which nobody supports. (No font
>> support, no keyboard support - just compliance with some obscure
>> standards which have not yet been proven useful.) And, J has only two
>> unused type numbers which fit in a 32 bit implementation:
>>
>> https://github.com/jsoftware/jsource/blob/master/jsrc/jtype.h#L160
>>
>>     2^30x
>> 1073741824
>>     2^31x
>> 2147483648
>>
>> Though, actually, since the type is stored in a *A as an I (which is
>> long on 32 bit J and long long on 64 bit J) which is a signed type on
>> every major platform, there is really only one remaining unambiguous
>> type number available for 32 bit J. (2147483648 would really be
>> -1073741824 for current 32 bit J implementations.)
>>
>> Anyways, we probably do not want to use our one remaining type number
>> which can be used in 32 bit J for a character literal type which will
>> probably never be supported on a 32 bit operating system.
>>
>> So, let's pretend that we want to use type number 8589934592 to
>> represent UTF-32 character literals, and that all of the characters
>> not supported by the OS get displayed using the question mark in a
>> diamond representation.
>>
>> That would, at least, give us a way of representing all unicode
>> characters where all unicode consortium defined characters retain
>> their integrity when the array is reshaped.
>>
>> But that would not solve the boxed display issue. To solve the boxed
>> display issue, we need to know the character width for each character.
>> J assumes a "fixed width" font, but Unicode has defined some
>> characters to have double width and we don't have any fonts which make
>> them the same width as anything else. The Unicode Consortium has also
>> defined some characters which are zero width and some which do other
>> strange things which conflict with the idea of fixed width characters.
>>
>> So basically, the Unicode standard itself is the source of confusion.
>> And it's approximately a bottomless pit of confusion (if you try to
>> read the full set of standards available at unicode.org). But all of
>> it is for good reasons, of course.
>>
>> ...
>>
>> Which brings me back to questions like "What do you want the user to
>> understand?"
>>
>> Thanks,
>>
>
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jsource] Non displayable characters display changed in penultimate beta

Reply via email to