On 5/30/24 16:12, enh wrote: >> > hmm... looking at Apple's online FreeBSD code, it looks like they have >> > very different (presumably older) FreeBSD code >> > [https://opensource.apple.com/source/Libc/Libc-320.1.3/locale/FreeBSD/tolower.c.auto.html], >> > and the footer of the file that reads implies that they're using data >> > from Unicode 3.2 (released in 2002, which would make sense given the >> > 2002 BSD copyright date in the tolower.c source): >> >> Sigh, can't they just ship machine consumable bitmaps or something? > > because everyone wants different formats. even the same library has > changed over time. (and not just because characters went from 16 bits > to 21 bits!)
Conversion from a simple format seems straightforward to me. Part of my frame of reference here is Tim Berners Lee inventing the 404 error. That was Tim's big advance that made HTML work where Ted Nelson's overdesigned hyper-cyber-iText didn't. Tim 80/20'd the problem by just handling the easy cases (we have the data) and punting the hard cases (updating links when they moved) to humans. Ted published his hyper-hype paper in 1965 and then failed to interest anyone in it for a quarter century before Tim made something actually useful (beating Gopher by about 6 months). Crediting Ted as the inventor of html is like crediting Jules Verne as the inventor of the submarine, or H.G. Wells as the (eventual) inventor of the time machine. (Lazerpig had a rant about this in his video on stealth planes: the inventor is the person who made it WORK, not who came up with the idea of humans flying or a knob on the wall that controls the air temperature.) So to me, the question is "how much can we put in a simple format", and then have a list of broken characters you need an exception handler function for. How do we 80/20 this? >> I can have >> my test plumbing pull "standards" files, ala: >> >> https://github.com/landley/toybox/blob/master/mkroot/packages/tests >> >> But an organization shipping a PDF or 9 interlocking JSON files with a turing >> complete stylesheet doesn't help much. > > (not really the point, but the one you want for the stuff you're > talking about here is actually just a text file. Let's see... Ah: https://www.unicode.org/L2/L1999/UnicodeData.html That's a bit long. My suggestion had 9 decimal numbers, this has "IDEOGRAPHIC TELEGRAPH SYMBOL FOR JANUARY" as one of fifteen fields, with "<compat> 0031 6708" being another single field. How nice. (And still extensive warnings that this doesn't cover everything. I think "too much is never enough" was an MTV slogan back in the 1980s? Ah, it's from "The Marriage of Figaro" in 1784.) aosp/external/icu/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode/UnicodeData.txt aosp/external/icu/android_icu4j/src/main/tests/android/icu/dev/data/unicode/UnicodeData.txt aosp/external/icu/icu4c/source/data/unidata/UnicodeData.txt aosp/external/pcre/maint/Unicode.tables/UnicodeData.txt aosp/external/cronet/third_party/icu/source/data/unidata/UnicodeData.txt aosp/out/soong/workspace/external/cronet/third_party/icu/source/data/unidata/UnicodeData.txt Android seems to have checked in multiple copies of this file. $ for i in $THAT; do [ -n "$OLD" ] && diff -u $OLD $i; OLD=$i; done | grep +++ +++ aosp/external/pcre/maint/Unicode.tables/UnicodeData.txt 2023-08-18 15:16:31.239657629 -0500 +++ aosp/external/cronet/third_party/icu/source/data/unidata/UnicodeData.txt 2023-08-18 15:14:44.351661450 -0500 And I need to re-pull my tree for them to match. > i've repeatedly been > tempted to teach unicode(1) to read it, since it's always installed on > macOS and debian anyway [for values of "always" that include "all my > machines, anyway"], to be able to show far more information about any > given character.) I've thrown a note on the todo heap... >> Which is _sad_ because there's only a dozen ispunct() variants that read a >> bit >> out of a bitmap (and haven't significantly changed since K&R: neither >> isblank() >> nor isascii() is worth the wrapper), plus a toupper/tolower pair that map >> integers with "no change" being the common case. > > (one of the things you'll learn from parsing the file is that that's > not how toupper()/tolower() works for all characters. plus there's > titlecase. plus case folding.) "For all characters". I'm just looking for low hanging fruit and a list of exceptions to punt to a function. >> Plus unicode has wcwidth(). > > no, it doesn't. (i wouldn't be maintaining my own if it did!) In ascii, wcwidth() is basically isprint() plus "tab is weird". For unicode, wcwidth() comes into play. The unicode bureaucracy committee being too microsofted to competently provide one is irrelevant to wcwidth() not being needed for ascii. (I also note the assumption of monospaced fonts in all this. Java's fontmetrics() was about measuring pixel counts in non-monospaced fonts, which this doesn't even contemplate.) >> So code, alpha, cntrl, digit, punct, space, width, upper, lower. Something >> like: >> >> 0,0,0,0,0,0,0,0,0 >> 13,0,1,0,0,1,0,0,0 >> 32,0,0,0,0,1,1,0,0 >> 57,0,0,1,0,0,1,0,0 >> 58,0,0,0,1,0,1,0,0 >> 65,1,0,0,0,0,1,0,97 >> >> No, that doesn't cover weird stuff like the right-to-left gearshift or the >> excluded mapping ranges or even the low ascii characters having special >> effects >> like newline and tab, but those aren't really "characters" are they? > > those are exactly the weeds where all the dragons lurk. even the > EastAsianWidth property, which is as close as unicode comes to having > "wcwidth()" has "ambiguous" _and_ "neutral" --- two distinct special > cases :-) I'm trying for html, not hypertext. I expect 404 errors something/someone else will have to handle. A function returning "dunno" is acceptable in this context. Somebody else writing a wrapper function to intercept "dunno" and handle 37 weird bits is "an exercise left for the reader". >> Special >> case the special cases, don't try to represent them in a table like that >> beyond >> what ispunct() and toupper() and friends should return. (Maybe have a -1 >> width >> for "weird".) >> >> But again, that's my dunning-kruger talking. I don't see WHY it's so >> complicated. Arguing about efficient representation isn't the same as arguing >> about "this is the data, it should be easy to diff new releases against the >> previous release to see what changed, so why don't they publish this?" > > i suspect they'd ask "what do you need the diff for? surely you're not > _manually_ translating this into some other form?" :-) A) Nuts to their white mice. 2) I want to see what changed so I can confirm I can ignore it (or add "dunno"). III) The python approach of enforcing version number without caring what's IN the version excludes the possibility of other implementations and extensions. If a Korean standards body wanted to take its country range and define its own local properties for code points within there, that's irrelevant to unicode committee draft document release versioning procedure appendix formatting clarification updates (volume III). The data should not be precious. It's just data. NOT being able to diff it is suspicious. >> Heck, if your width options are 0, 1, 2, and 3 (with 3 being "exception, >> look it >> up in another table"), all the data except case mapping is one byte per >> character... > > fwiw, because it's written in terms of icu4c, An external black box library dependency I don't want to import, and which you didn't want to include in static binaries. (And the above file list had 3 "icu" implementations next to each other.) > which is in turn mostly just exposing the unicode data, Can we go from "mostly" to "all"? :) Not that I particularly want to ship a large ascii table either. When I dug into musl's take on this, I was mostly reverse engineering their compression format and then going "huh, yeah you probably do want to compress this". I could generate the table I listed with a C program that runs ispunct() and similar on every unicode code point and outputs the result. I could then compare what musl, glibc, and bionic produce for their output. The problem is it's not authoritative, it's downwind of the "macos is still using 2002 data" issue that keeps provoking this. :( > the bionic implementation of wcwidth() > gives a decent "pseudocode" view of how you'd implement it in terms of > the unicode data directly: > https://android.googlesource.com/platform/bionic/+/refs/heads/main/libc/bionic/wcwidth.cpp > > (at least "to the best of my knowledge". since there is no standard, > and this function most recently changed _yesterday_, i can give no > guarantee :-) ) That looks like the exception handler wrapper function I was referring to earlier. :) None of this seems likely to handle my earlier "widest unicode characters" thread with the REAL oddball encodings, but none of the current ones do either and that's ok. Just acknowledging that there needs to BE a special case exception list is the first step to having a GOOD special case exception list that can include that sort of thing. (And have all the arguments about excluding stuff to keep it down to a dull roar...) I.E. if the table of standard data can't cover everything it shouldn't try to, so what's the sane subset we CAN cleanly automate? Rob _______________________________________________ Toybox mailing list Toybox@lists.landley.net http://lists.landley.net/listinfo.cgi/toybox-landley.net