is the term "exponentially" really appropriate ? the NamesList file is not so large, and the grow would remain linear.
Anyway, this file (current CSV format or XML format) does not need to be part of the core UCD files, they can be in a separate download for people needing it. One benefit I would see is that this conversion to XML using an automated tool could ensure that it is properly formated. But I believe that Unibook is already parsing it to produce consistent code charts so its format is already checked. And this advantage is not really effective. But the main benefit would be that the file could be edited and updated using standard tools. XML is not the only choice available, JSON today is simpler to parse, easier to read (and even edit) by humans, it can embed indentation whitespaces (outside quoted strings) that won't be considered part of the data (unlike XML where they "pollute" the DOM with extra text elements). In fact I belive that the old CSV formats used in the original UCD may be deprecated in favor of JSON (the old format could be automatically generated for applications that want them. It could unify all formats with a single parser in all tools. Files in older CSV or tabulated formats would be in a separate derived collection. Then users would choose which format they prefer (legacy now derived, JSON, or XML if people really want it). The advantage of XML however is the stability for later updates that may need to insert additional data or annotations (with JSON or CSV/tabulated formats, the number of columns is fixed, all columns must be fed at least with an empty data, even if it is is not significant). Note that legacy formats also have comments after hash signs, but many comments found at end of data lines also have some parsable meaning, so they are structured, and may be followed by an extra hash sign for a real comment) The advantage of existing XSV/tabulated formats is that they are extremely easy to import in a spreadsheet for easier use by a human (I won't requiest the UTC to provide these files in XLS/XLSX or ODC format...). But JSON and XML could as well be imported provided that the each data file remains structured as a 2D grid without substructures within cells (otherwise you need to provide an explicit schema). But note that some columns is frequently structured: those containing the code point key is frequently specifying a code range using an additional separator; as well those whose value is an ordered list of code points, using space separator and possibly a leading subtag (such as decomposition data): in XML you would translate them into separate subelements or into additional attributes, and in JSON, you'll need to structure these structured cells using subarrays. So the data is *already* not strictly 2D (converting them to a pure 2D format, for relational use, would require adding additional key or referencing "ID" columns and those converted files would be much less easier to read/edit by humans, in *any* format: CSV/tabular, JSON or XML). Other candidate formats also include Turtle (generally derived from OWL, but replacing the XML envelope format by a tabulated "2.5D" format which is much easier than XML to read/edit and much more compact than XML-based formats and easier to parse)... 2016-03-14 3:14 GMT+01:00 Marcel Schneider <[email protected]>: > On Sun, 13 Mar 2016 13:03:20 -0600, Doug Ewell wrote: > > > My point is that of J.S. Choi and Janusz Bień: the problem with > > declaring NamesList off-limits is that it does contain information that > > is either: > > > > • not available in any other UCD file, or > > • available, but only in comments (like the MAS mappings), which aren't > > supposed to be parsed either. > > > > Ken wrote: > > > > > [ .. ] NamesList.txt is itself the result of a complicated merge > > > of code point, name, and decomposition mapping information from > > > UnicodeData.txt, of listings of standardized variation sequences from > > > StandardizedVariants.txt, and then a very long list of annotational > > > material, including names list subhead material, etc., maintained in > > > other sources. > > > > But sometimes an implementer really does need a piece of information > > that exists only in those "other sources." When that happens, sometimes > > the only choices are to resort to NamesList or to create one's own data > > file, as Ken did by parsing the comment lines from the math file. Both > > of these are equally distasteful when trying to be conformant. > > > If so, then extending the XML UCD with all the information that is > actually missing in it while available in the Code Charts and > NamesList.txt, ends up being a good idea. But it still remains that such a > step would exponentially increase the amount of data, because items that > were not meant to be systematically provided, must be. > > Further I see that once this is completed, other requirements could need > to tackle the same job on the core specs. > > The point would be to know whether in Unicode implementation and i18n, > those needs are frequent. E.g. the last Apostrophe thread showed that full > automatization is sometimes impossible anyway. > > Marcel > >

