Why would you make this change? I think both statements are true:
(1) "UTF-8 is a unicode encoding" - it is an encoding defined by the unicode consortium as a part of the unicode standards, and (2) "UTF-8 is a character encoding" - this encoding represents [unicode] characters. But I am not sure why you would want to replace the one phrasing with the other, in this context. Thanks, -- Raul On Tue, Jul 5, 2016 at 12:45 PM, Henry Rich <henryhr...@gmail.com> wrote: > I agree with this, especially the analysis of why it's reasonable to support > 16-bit characters as the extended character size in J. > > I would quibble with just one word: when you say > > UTF-8 is a unicode encoding > > I would change 'unicode' to 'character'. > > Unicode assigns numbers to glyphs; UTF-8 represents numbers as a sequence of > bytes. In theory, you could use UTF-8 with a different character encoding. > Of course, that is even less likely than immediate uptake of 21-bit Unicode. > > Henry Rich > > > > > > > On 7/5/2016 12:17 PM, Raul Miller wrote: >> >> On Tue, Jul 5, 2016 at 10:56 AM, robert therriault >> <bobtherria...@mac.com> wrote: >>> >>> If you haven't already, take a look at the video and see what you >>> think. >>> >>> https://youtu.be/eN9H-rMk1No >> >> So, ok... >> >> In the video, you are working with your display verb (v) and the >> result of this expression: >> >> 'a',(u:600+i.5),u:30000+/i.5 >> >> But there were several things you said in the video which I should >> disagree with. I think you might be able to use a bit more background >> on these issues: >> >> First, using the J Dictionary terminology, both of these sentences >> have literal results: >> >> 'a',(u:600+i.5),u:30000+/i.5 >> >> ": 'a',(u:600+i.5),u:30000+/i.5 >> >> I expect the mail system to mutilate those results, so I have removed >> them from my message here. >> >> Second, using the Unicode Consortium's terminology, both of those >> results are Unicode. More specifically, UTF-8 is a unicode encoding. >> >> If we look at the J type numbers for these two sentences, we get: >> >> 3!:0 'a',(u:600+i.5),u:30000+/i.5 >> 131072 >> 3!:0 ": 'a',(u:600+i.5),u:30000+/i.5 >> 2 >> >> The first sentence has type number 131072, which roughly corresponds >> to UTF-16. The second sentence has type number 2, which roughly >> corresponds to UTF-8. J should perhaps have a third kind of character >> literal, also, which has not yet been implemented. >> >> So, taking a few steps back... >> >> Unicode currently has enough characters defined that they need 21 bits >> to enumerate them all. But they also have licenses on the standard way >> of representing the characters, which has been important for some >> people and which has slowed the adoption of that aspect of the >> standard. This translates to limited font support (but there's also >> limited keyboard support as well as dubious recommendations within the >> Unicode standards themselves). >> >> In other words, if we were to represent the full set of Unicode >> characters using type 131072, some of them would still be multi-byte >> characters. But there is little or no font support for most of them. >> >> Also, 21 bits is about 2 million characters. So that's megabytes of >> information that would be needed mostly to support characters which >> the Unicode people have defined but which almost nobody uses (in part >> because of licensing concerns). >> >> Hopefully that is enough background... >> >> To round out this discussion, though, let us imagine that J had been >> extended to fully support the full 21 bit space of the Unicode >> character set. >> >> First off, as near as I can tell, there is no font support in any >> operating system for characters which are "unicode code points which >> cannot be represented as a single code unit using utf-16". In other >> words, there's nothing for J to tie into to represent most of those >> characters. So I guess they would mostly wind up looking like any >> other unrecognized character. So if I supported them, this would be a >> "temporary hack" (which would likely survive for decades, and then be >> busy work for someone trying to keep up with some hypothetical >> changing and conflicting implementations). >> >> Second, we would need to use one of the very few available type >> numbers to represent these characters which nobody supports. (No font >> support, no keyboard support - just compliance with some obscure >> standards which have not yet been proven useful.) And, J has only two >> unused type numbers which fit in a 32 bit implementation: >> >> https://github.com/jsoftware/jsource/blob/master/jsrc/jtype.h#L160 >> >> 2^30x >> 1073741824 >> 2^31x >> 2147483648 >> >> Though, actually, since the type is stored in a *A as an I (which is >> long on 32 bit J and long long on 64 bit J) which is a signed type on >> every major platform, there is really only one remaining unambiguous >> type number available for 32 bit J. (2147483648 would really be >> -1073741824 for current 32 bit J implementations.) >> >> Anyways, we probably do not want to use our one remaining type number >> which can be used in 32 bit J for a character literal type which will >> probably never be supported on a 32 bit operating system. >> >> So, let's pretend that we want to use type number 8589934592 to >> represent UTF-32 character literals, and that all of the characters >> not supported by the OS get displayed using the question mark in a >> diamond representation. >> >> That would, at least, give us a way of representing all unicode >> characters where all unicode consortium defined characters retain >> their integrity when the array is reshaped. >> >> But that would not solve the boxed display issue. To solve the boxed >> display issue, we need to know the character width for each character. >> J assumes a "fixed width" font, but Unicode has defined some >> characters to have double width and we don't have any fonts which make >> them the same width as anything else. The Unicode Consortium has also >> defined some characters which are zero width and some which do other >> strange things which conflict with the idea of fixed width characters. >> >> So basically, the Unicode standard itself is the source of confusion. >> And it's approximately a bottomless pit of confusion (if you try to >> read the full set of standards available at unicode.org). But all of >> it is for good reasons, of course. >> >> ... >> >> Which brings me back to questions like "What do you want the user to >> understand?" >> >> Thanks, >> > > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm