On 5/31/24 12:53, enh wrote: >> Let's see... Ah: >> >> https://www.unicode.org/L2/L1999/UnicodeData.html >> >> That's a bit long. My suggestion had 9 decimal numbers, this has "IDEOGRAPHIC >> TELEGRAPH SYMBOL FOR JANUARY" as one of fifteen fields, with "<compat> 0031 >> 6708" being another single field. How nice. (And still extensive warnings >> that >> this doesn't cover everything. I think "too much is never enough" was an MTV >> slogan back in the 1980s? Ah, it's from "The Marriage of Figaro" in 1784.) > > citation needed? (or if you want me to keep trying to think of where > that or something similar occurs in the libretto, at least tell me > whether it's an aria or recitative :-) )
Sorry, not the Mozart one. And not the Italian one Mozart based his version on, but the original french version the Italian one was based on: https://en.wikipedia.org/wiki/The_Marriage_of_Figaro_(play) The quote gets translated a few ways out of the 300 year old french: https://www.oxfordreference.com/display/10.1093/acref/9780191826719.001.0001/q-oro-ed4-00000807 And to clarify again, I mean Wolfgang, not his equally (if not more) talented sister Maria who toured together with her sibling as child prodigies but was sidelined as soon as she reached "marriageable age" and had to teach piano for a living: https://en.wikipedia.org/wiki/Maria_Anna_Mozart Some letters from Wolfgang praising her compositions have survived, but her parents destroyed all her actual sheet music because it had cooties. Next time people talk about the "great men of history"... Don't get me started about Einstein's first wife. >> In ascii, wcwidth() is basically isprint() plus "tab is weird". >> >> For unicode, wcwidth() comes into play. The unicode bureaucracy committee >> being >> too microsofted to competently provide one is irrelevant to wcwidth() not >> being >> needed for ascii. >> >> (I also note the assumption of monospaced fonts in all this. Java's >> fontmetrics() was about measuring pixel counts in non-monospaced fonts, which >> this doesn't even contemplate.) > > this is why i keep telling you that wcwidth() only really makes sense > for tty-based stuff. and even there ... I need to figure out where to wrap lines in command line editing and text editors and so on. (I have been relieved of duty on vi, but I still need to make shell command line editing work. Plus fold and so on. And screen, and watch. Might do a nano-alike at some point. This is already sort of in top...) > i'm curious whether the > different terminal emulators actually behave the same in any of the > interesting cases. (_especially_ when you get to the "that can't > happen in well-formed text in the language that uses that script" > cases.) I have an ANSI probe sequence to ask where the cursor is, but even if I wanted to be that chatty (and didn't mind that the amount of time it takes to get a response is arbitrary and variable, with no response actually guaranteed to come anyway, and other input surrounding the response), if the output's already wrapped and scrolled the screen since the last time I asked it's bad. And if I _disable_ screen wrap then A) I dunno if it's truncated the output, B) lots of other stuff breaks (it's like leaving the screen in raw mode, only SUBTLY wrong, and yes QEMU does this from time to time and drives bash line editing NUTS, that's why run-qemu.sh echoes the relevant stop doing that sequence AND mkroot's init also outputs it)... Which means I need a wcwidth() to know how many columns the next character will advance the cursor in the terminal before outputting it. >> Not that I particularly want to ship a large ascii table either. When I dug >> into >> musl's take on this, I was mostly reverse engineering their compression >> format >> and then going "huh, yeah you probably do want to compress this". >> >> I could generate the table I listed with a C program that runs ispunct() and >> similar on every unicode code point and outputs the result. I could then >> compare >> what musl, glibc, and bionic produce for their output. The problem is it's >> not >> authoritative, it's downwind of the "macos is still using 2002 data" issue >> that >> keeps provoking this. :( > > i'm really confused that you keep mentioning ascii. if you really mean > ispunct() here, say, and not iswpunct(), The difference between them is that ispunct() has always taken an int but the C committee was cowardly and refused to make it actually respond to the whole range, so they created a new function to do the same thing. At least fseeko() can blame LP64 for long and pointer being the same size having splash damage. (Moore's Law didn't advance the components in a coordinated manner, we hit the need for >2 gig files ten years before we hit the need for >4gig system RAM and thus 64 bit registers...) (I suppose the C committee was fighting IBM and Microsoft for 10 years before utf8 happened, and then the unicode committee had Microsoft on it and thus combining characters were placed AFTER the characters they combined with so you never know when you're done rendering a character until you've started reading the one AFTER it, which is just insane...) > then that's a completely > solved problem --- ispunct() only covers ascii, and there's no > implementation we've seen that differs from any of the others there. Because the problem in that part of the data set is well defined and everybody agrees on what success looks like. > The c argument is an int, the value of which the application > shall ensure is a character representable as an unsigned char or > equal to the value of the macro EOF. If the argument has any > other value, the behavior is undefined. Is this integer punctuation? Yes/no. >> None of this seems likely to handle my earlier "widest unicode characters" >> thread with the REAL oddball encodings, but none of the current ones do >> either >> and that's ok. Just acknowledging that there needs to BE a special case >> exception list is the first step to having a GOOD special case exception list >> that can include that sort of thing. (And have all the arguments about >> excluding >> stuff to keep it down to a dull roar...) >> >> I.E. if the table of standard data can't cover everything it shouldn't try >> to, >> so what's the sane subset we CAN cleanly automate? > > well, the most likely exception you'll encounter isn't about the > _characters_ it's about the _locale_ you're asking the question for. > one problem with unification (not just "han") is that you have > multiple "characters" (in terms of "what do they mean"/"how do they > behave") mapped to the same codepoint. _I_ don't, no. I'm using the "C" locale with UTF-8 support. > (specifically here i'm thinking > of turkish/azeri i.) Needing to know the locale to render UNICODE CODE POINTS defeats the purpose of unicode: what values should I get in the "C" locale with UTF-8 support? (Congratulations to microsoft for reintroducing the concept of CODE PAGES to UNICODE, but I'm not humoring them. Too broken for words.) Maybe the table annotates these as "weird" and our stub exception handler returns 0 for all their attributes. I'm ok with that. I'm not trying to get everything right, I'm trying to 80/20 this. If somebody who isn't me wants to write a big exception handler that cares about locale for broken characters the standards committee seemingly accepted bribes to include, fine. Characters existing that the table cannot, by itself, provide answers for means that if you emit them unescaped into the shell's command line editing, or in "watch" output, or use them in fields that "ps" or "top" are trying to align, then stuff may leak out of their boxes and scroll the screen inappropriately, and I am FINE with that. But I'm also fine escaping them: lib/utf8.c already has crunch_escape() doing the "standard escapes" that vi was doing when I first fed it a bunch of weird values to see how it would cope years ago. I may not get to use them in vi itself because I'm not writing that, but I can still have line editing and friends print a variety of escapes for codepoints I can't reliably measure. It's not pretty, but it means I retain control of where the cursor is, and the data can even be represented unambiguously with a bit of work. Being unable to tell ascii from kanji when statically linked is a bigger issue from where I'm standing. (P.S. I still need an ANSI escape sequence parser to do all this right, but I wrote my first one of those in DOS as a teenager. Probably won't do the full "man 4 console_codes" collection but I can handle a lot and then ^[ the ESC for sequences I don't recognize in "watch" and "less" and so on...) >> Rob Still Rob _______________________________________________ Toybox mailing list [email protected] http://lists.landley.net/listinfo.cgi/toybox-landley.net
