A good compromise between human readability, machine processability and filesize would be using YAML.
Unlike JSON, YAML supports comments, anchors and references, multiple documents in a file and several other features. Regards, Marius Spix On Fri, 31 Aug 2018 06:58:37 +0200 (CEST) Marcel Schneider via Unicode wrote: > On 30/08/18 23:34 Philippe Verdy via Unicode wrote: > > > > Welel an alternative to XML is JSON which is more compact and > > faster/simpler to process; > > Thanks for pointing the problem and the solution alike. Indeed the > main drawback of the XML format of UCD is that it results in an > “insane” filesize. “Insane” was applied to the number of semicolons > in UnicodeData.txt, but that is irrelevant. What is really insane is > the filesize of the XML versions of the UCD. Even without Unihan, it > may take up to a minute or so to load in a text editor. > > > however JSON has no explicit schema, unless the schema is being > > made part of the data itself, complicating its structure (with many > > levels of arrays of arrays, in which case it becomes less easy to > > read by humans, but more adapted to automated processes for fast > > processing). > > > > I'd say that the XML alone is enough to generate any JSON-derived > > dataset that will conform to the schema an application expects to > > process fast (and with just the data it can process, excluding > > various extensions still not implemetned). But the fastest > > implementations are also based on data tables encoded in code (such > > as DLL or Java classes), or custom database formats (such as > > Berkeley dB) generated also automatically from the XML, without the > > processing cost of decompression schemes and parsers. > > > > Still today, even if XML is not the usual format used by > > applications, it is still the most interoperable format that allows > > building all sorts of applications in all sorts of languages: the > > cost of parsing is left to an application builder/compiler. > > I’ve tried an online tool to get ucd.nounihan.flat.xml converted to > CSV. The tool is great and offers a lot of options, but given the > “insane” file size, my browser was up for over two hours of trouble > until I shut down the computer manually. From what I could see in the > result field, there are many bogus values, meaning that their > presence is useless in the tags of most characters. And while many > attributes have cryptic names in order to keep the file size minimal, > some attributes have overlong values, ie the design is inconsistent. > Eg in every character we read: jg="No_Joining_Group" That is bogus. > One would need to take them off the tags of most characters, and even > in the characters where they are relevant, the value would be simply > "No". What’s the use of abbreviating "Joining Group" to "jg" in the > atribute name if in the value it is written out? And I’m quoting from > U+0000. Further many values are set to a crosshatch, instead of > simply being removed from the characters where they are empty. Then > the many instances of "undetermined script" resulting in *two* > attribues with "Zyyy" value. Then in almost each character we’re told > that it is not a whitespace, not a dash, not a hyphen, and not a > quotation mark: Dash="N" WSpace="N" Hyphen="N" QMark="N" One couldn’t > tell that UCD does actually benefit from the flexibility of XML, > given that many attributes are systematically present even where they > are useless. Perhaps ucd-*.xml would be two thirds, half, or one > third their actual size if they were properly designed. > > > Some apps embed the compilers themselves and use a stored cache for > > faster processing: this approach allows easy updates by detecting > > changes in the XML source, and then downloading them. > > > > But in CLDR such updates are generally not automated : the general > > scheme evolves over time and there are complex dependencies to > > check so that some data becomes usable > > Should probably read *un*usable. > > > (frequently you need to implement some new algorithms to follow the > > processing rules documented in CLDR, or to use data not completely > > validated, or to allow aplicatioçns to provide their overrides from > > insufficiently complete datasets in CLDR, even if CLDR provides a > > root locale and applcaitions are supposed to follow the BCP47 > > fallback resolution rules; applciations also have their own need > > about which language codes they use or need, and CLDR provides many > > locales that many applications are still not prepared to render > > correctly, and many application users complain if an application is > > partly translated and contains too many fallbacks to another > > language, or worse to another script). > > So the case is even worse than what I could see when looking into > CLDR. Many countries, including France, don’t care about the data of > their own locale in CLDR, but I’m not going to vent about that on > Unicode Public, because that involves language offices and > authorities, and would have political entanglements. > > Staying technical, I can tell so far about the file header of > UnicodeData.txt that I can see zero technical reasons not to add it. > Processes using the file to generate an overview of Unicode also use > other files and are thus able to process comments correctly, whereas > those processes using UnicodeData to look up character properties > provided in the file would start searching the code point. (Perhaps > there are compilers building DLLs from the file.) > > Le jeu. 30 août 2018 à 20:38, Doug Ewell via Unicode a écrit : > > > > > UnicodeData.txt was devised long before any of the other UCD data > files. Though it might seem like a simple enhancement to us, adding a > header block, or even a single line, would break a lot of existing > processes that were built long ago to parse this file. > > > > So Unicode can't add a header to this file, and that is the reason > the format can never be changed (e.g. with more columns). That is why > new files keep getting created instead. > > > > The XML format could indeed be expanded with more attributes and more > subsections. Any process that can parse XML can handle unknown stuff > like this without misinterpreting the stuff it does know. > > > > That's why the only two reasonable options for getting UCD data are > to read all the tab- and semicolon-delimited files, and be ready for > new files, or just read the XML. Asking for changes to existing UCD > file formats is kind of a non-starter, given these two alternatives. > > > > > > > > > > -- > Doug Ewell | Thornton, CO, US | ewellic.org > > > > > > -------- Original message -------- > Message: 3 > > Date: Thu, 30 Aug 2018 02:27:33 +0200 (CEST) > > From: Marcel Schneider via Unicode > > > > > Curiously, UnicodeData.txt is lacking the header line. That makes it > unflexible. I never wondered why the header line is missing, probably > because compared to the other UCD files, the file looks really odd > without a file header showing at least the version number and > datestamp. It?s like the file was made up for dumb parsers unable to > handle comment delimiters, and never to be upgraded to do so. > > But I like the format, and that?s why at some point I submitted > feedback asking for an extension. [...] > > > > > >
pgp4fSzKD6mpQ.pgp
Description: Digitale Signatur von OpenPGP