I’m not responding without thinking, as I was blamed of when I did, but it is painful for me to dig into what Ken explained about how we should be consuming UCD data. I’ll now try to get some more clarity into the topic.
> On 31/08/18 19:59 Ken Whistler via Unicode wrote: > […] > > > > Third, please remember that folks who come here complaining about the > > complications of parsing the UCD are a very small percentage of a very > > small percentage of a very small percentage of interested parties. OK, among avg. 700 list subscribers, relatively few are ever complaining about anything, let alone about this particular topic. But we should always keep in mind that many folks out there complaining about Unicode don’t come here to do so. > > Nearly everybody who needs UCD data should be consuming it as a > > secondary source (e.g. for reference via codepoints.net), or as a > > tertiary source (behind specialized API's, regex, etc.), Like already suggested, “as” should probably read “via” in that part. > > or as an end > > user (just getting behavior they expect for characters in applications). That is more than a simple statement about who is consuming UCD data which way, as you say “should.” There seem to be assumptions that it is discouraged to dive into the raw data; that folks reading file headers are not doing well; that the data should be assembled only in certain ways; and that ignorant people shouldn’t open the UCD cupboard to pick a file they deem useful. If so, then it might be surprising to know that when submitting a proposal about Bidi-mirroring mathematical symbols issues feedback http://www.unicode.org/L2/L2017/17438-bidi-math-fdbk.html I’d started as a quasi-end-user not getting behavior I expected for characters in browsers, as I was spotting characters bidi-mirrored by glyph exchange, like it is implemented in web browsers, because I wanted that end-users could experience bidi-mirroring as it works. Unexpectedly a number of math symbols did not mirror, despite many of them being even scalar neighbors. > > Programmers who actually *need* to consume the raw UCD data files and > > write parsers for them directly should actually be able to deal with the > > format complexity -- and, if anything, slowing them down to make them > > think about the reasons for the format complexity might be a good thing, I can see one main reason for the format complexity, and that is that data from various propeties don’t necessarily telescope the same way to make for small files. The complexity of UCD would then mainly be self-induced by the way of packing data into one small file per property rather than adding the value to each relevant code point in one large list as is UnicodeData.txt. While I’m now taking the time to write this up because I’m committed to process that information, we can think of many many people who don’t like to be slowed down trying to find out why Unicode changed UCD design while following the original idea of a large CSV list would be straightforward, eventually by setting up a new one if the first one got stuck. What I can figure out is that while a new property was added, that particular property was always thought of as being the last one. (At some point the many files were then dumped into the known XML files.) If UCD is to be made of small files, it is necessarily complex, and the conclusion is that there should be another large CSV grid to make things simple again and lightweight alike so far as they can. > > as it tends to put the lie to the easy initial assumption that the UCD > > is nothing more than a bunch of simple attributes for all the code points. Did you try the sentence when taking off “simple”? It appears to me as not being a lie then. One attribute comes to mind that is so complex that its design even changed over time, despite Unicode’s commitment to stability. The Bidi_Mirrored_Glyph property was originally designed to include “best-fit” pairs for least-worse display in applications not supporting RTL glyphs (ie without OpenType support), with legibility of math formulae in mind. Later (probably due to a poorly written OpenType spec), no more best-fit pairs were added to BidiMirroring.txt, as if OpenType implementers weren’t to remove the best-fit pairs anyway prior to using the file (while the spec says to use it as-is). That led then to the display problem pointed above. I’m sparing the particular problem related to 3 pairs of symbols with tilde, nor to the missing Bidi_Mirroring_Type property, given UTC was not interested. So you can understand that I’m not unaware of the complexity of UCD. Though I don’t think that this could be an argument for not publishing a medium-size CSV file with scalar values listed as in UnicodeData.txt. > > […] > Even Excel Starter, that I have, is a great tool helping > to perform tasks I fail to get with other tools, even spreadsheet software. Ie not every spreadsheet software seems to do the job as I need it. Regards, Marcel