Re: annotations (was: NamesList.txt as data source)
is the term "exponentially" really appropriate ? the NamesList file is not so large, and the grow would remain linear. Anyway, this file (current CSV format or XML format) does not need to be part of the core UCD files, they can be in a separate download for people needing it. One benefit I would see is that this conversion to XML using an automated tool could ensure that it is properly formated. But I believe that Unibook is already parsing it to produce consistent code charts so its format is already checked. And this advantage is not really effective. But the main benefit would be that the file could be edited and updated using standard tools. XML is not the only choice available, JSON today is simpler to parse, easier to read (and even edit) by humans, it can embed indentation whitespaces (outside quoted strings) that won't be considered part of the data (unlike XML where they "pollute" the DOM with extra text elements). In fact I belive that the old CSV formats used in the original UCD may be deprecated in favor of JSON (the old format could be automatically generated for applications that want them. It could unify all formats with a single parser in all tools. Files in older CSV or tabulated formats would be in a separate derived collection. Then users would choose which format they prefer (legacy now derived, JSON, or XML if people really want it). The advantage of XML however is the stability for later updates that may need to insert additional data or annotations (with JSON or CSV/tabulated formats, the number of columns is fixed, all columns must be fed at least with an empty data, even if it is is not significant). Note that legacy formats also have comments after hash signs, but many comments found at end of data lines also have some parsable meaning, so they are structured, and may be followed by an extra hash sign for a real comment) The advantage of existing XSV/tabulated formats is that they are extremely easy to import in a spreadsheet for easier use by a human (I won't requiest the UTC to provide these files in XLS/XLSX or ODC format...). But JSON and XML could as well be imported provided that the each data file remains structured as a 2D grid without substructures within cells (otherwise you need to provide an explicit schema). But note that some columns is frequently structured: those containing the code point key is frequently specifying a code range using an additional separator; as well those whose value is an ordered list of code points, using space separator and possibly a leading subtag (such as decomposition data): in XML you would translate them into separate subelements or into additional attributes, and in JSON, you'll need to structure these structured cells using subarrays. So the data is *already* not strictly 2D (converting them to a pure 2D format, for relational use, would require adding additional key or referencing "ID" columns and those converted files would be much less easier to read/edit by humans, in *any* format: CSV/tabular, JSON or XML). Other candidate formats also include Turtle (generally derived from OWL, but replacing the XML envelope format by a tabulated "2.5D" format which is much easier than XML to read/edit and much more compact than XML-based formats and easier to parse)... 2016-03-14 3:14 GMT+01:00 Marcel Schneider: > On Sun, 13 Mar 2016 13:03:20 -0600, Doug Ewell wrote: > > > My point is that of J.S. Choi and Janusz Bień: the problem with > > declaring NamesList off-limits is that it does contain information that > > is either: > > > > • not available in any other UCD file, or > > • available, but only in comments (like the MAS mappings), which aren't > > supposed to be parsed either. > > > > Ken wrote: > > > > > [ .. ] NamesList.txt is itself the result of a complicated merge > > > of code point, name, and decomposition mapping information from > > > UnicodeData.txt, of listings of standardized variation sequences from > > > StandardizedVariants.txt, and then a very long list of annotational > > > material, including names list subhead material, etc., maintained in > > > other sources. > > > > But sometimes an implementer really does need a piece of information > > that exists only in those "other sources." When that happens, sometimes > > the only choices are to resort to NamesList or to create one's own data > > file, as Ken did by parsing the comment lines from the math file. Both > > of these are equally distasteful when trying to be conformant. > > > If so, then extending the XML UCD with all the information that is > actually missing in it while available in the Code Charts and > NamesList.txt, ends up being a good idea. But it still remains that such a > step would exponentially increase the amount of data, because items that > were not meant to be systematically provided, must be. > > Further I see that once this is completed, other requirements could need > to tackle the same job on the core specs. > >
Re: annotations (was: NamesList.txt as data source)
On Sun, 13 Mar 2016 13:03:20 -0600, Doug Ewell wrote: > My point is that of J.S. Choi and Janusz Bień: the problem with > declaring NamesList off-limits is that it does contain information that > is either: > > • not available in any other UCD file, or > • available, but only in comments (like the MAS mappings), which aren't > supposed to be parsed either. > > Ken wrote: > > > [ .. ] NamesList.txt is itself the result of a complicated merge > > of code point, name, and decomposition mapping information from > > UnicodeData.txt, of listings of standardized variation sequences from > > StandardizedVariants.txt, and then a very long list of annotational > > material, including names list subhead material, etc., maintained in > > other sources. > > But sometimes an implementer really does need a piece of information > that exists only in those "other sources." When that happens, sometimes > the only choices are to resort to NamesList or to create one's own data > file, as Ken did by parsing the comment lines from the math file. Both > of these are equally distasteful when trying to be conformant. If so, then extending the XML UCD with all the information that is actually missing in it while available in the Code Charts and NamesList.txt, ends up being a good idea. But it still remains that such a step would exponentially increase the amount of data, because items that were not meant to be systematically provided, must be. Further I see that once this is completed, other requirements could need to tackle the same job on the core specs. The point would be to know whether in Unicode implementation and i18n, those needs are frequent. E.g. the last Apostrophe thread showed that full automatization is sometimes impossible anyway. Marcel
Re: annotations (was: NamesList.txt as data source)
My point is that of J.S. Choi and Janusz Bień: the problem with declaring NamesList off-limits is that it does contain information that is either: • not available in any other UCD file, or • available, but only in comments (like the MAS mappings), which aren't supposed to be parsed either. Ken wrote: [ .. ] NamesList.txt is itself the result of a complicated merge of code point, name, and decomposition mapping information from UnicodeData.txt, of listings of standardized variation sequences from StandardizedVariants.txt, and then a very long list of annotational material, including names list subhead material, etc., maintained in other sources. But sometimes an implementer really does need a piece of information that exists only in those "other sources." When that happens, sometimes the only choices are to resort to NamesList or to create one's own data file, as Ken did by parsing the comment lines from the math file. Both of these are equally distasteful when trying to be conformant. -- Doug Ewell | http://ewellic.org | Thornton, CO
Re: annotations (was: NamesList.txt as data source)
On Sun, 13 Mar 2016 07:55:24 +0100, Janusz S. Bień wrote: > For this purpose he wrote also a converter from NamesList format to XML That goes straight into the direction I suggested past year as a beta feedback item[1], but I never thought that it could be so simple. > I understand there is no intention to make an official XML version of the file as it would require changes in Unibook? The difference however between homemade databases and official ones is that the latter raise much higher expectations. Asmus Freytag outlined in this thread―as well as in his comments on my feedback―that *no* “complete” UCD version, regardless of how complete it effectively might be, can ever meet the assumptions people inevitably would make on it. Further, experience shows that the actually provided information is way more than most people are able to mentally process. E.g. most online character information providers do not display the formal aliases, so that in the best case some aware users add that information using the comment facility. I donʼt cite any: These are free tools and platforms that must not be criticized. When we imagine a hypothetical UCD containing detailed information about the usage of any existing language, not only Polish but also Czech, Romanian, Portugese, Vietnamese, Devanagari, Tirhuta, just to cite some few, the result would be a data mass of which I’m not sure that it would pay back the cost induced at collection, nor that it would really be useful. For the NamesList, the TXT format is superior to XML at least in that, it prevents from forgetting that NamesList.txt is the source of the Code Charts. Not less, not more. Marcel [1] http://www.unicode.org/review/pri297/feedback.html Date/Time: Sat May 2 07:10:09 CDT 2015 Opt Subject: PRI #297: UnicodeXData.txt Date/Time: Wed May 6 08:03:04 CDT 2015 Opt Subject: PRI #297: feedback on XML files
annotations (was: NamesList.txt as data source)
On Thu, Mar 10 2016 at 22:40 CET, kenwhist...@att.net writes: > The *reason* that NamesList.txt exists at all is to drive the tool, > unibook, that formats the full Unicode code charts for posting. [...] On Fri, Mar 11 2016 at 3:13 CET, asm...@ix.netcom.com writes: > On 3/10/2016 5:49 PM, "J. S. Choi" wrote: >> One thing about NamesList.txt is that, as far as I have been able to >> tell, it’s the only machine-readable, parseable source of those >> annotations and cross-references. [...] > This is a different issue. The nameslist.txt is a reasonable source > for driving other formatting programs than just Unibook. Exactly. A student of mine wrote a font sampling program producing output in a Unibook-like form. For this purpose he wrote also a converter from NamesList format to XML: https://github.com/ppablo28/fntsample_ucd_comments https://github.com/ppablo28/ucd_xml_parser I use the XML version of NamesList to provide my own comments to characters (work in progress): https://bitbucket.org/jsbien/parkosz-font/downloads/Parkosz1907draft.pdf Other examples of NamesList.txt use are http://www.fileformat.info/info/unicode/ https://codepoints.net/ Although not exactly the formatting programs, in my opinion they constitute also a valid use. > In fact, the possibility of reuse in this context probably among the > unstated rationales for making the information and syntax available in > the first place. I understand there is no intention to make an official XML version of the file as it would require changes in Unibook? [...] >> What are these other primary sources that maintain these other >> annotation data; are they publicly available? If the name list is the >> only place where these sources’ data have been published, then, for >> better or for worse, the name list is all that is available for much >> information on many code points’ usage. > See my first through third paragraph. You wrote: [...] > There are explanations about character use that are only maintained in > the PDF of the core specification, where this information is packaged > in a way that can be understood by a human reader, but is not amenable > to be extracted by machine. > > While the annotations, comments, cross references etc. in Namelist.txt > appear, formally, to be machine extractable, the way they are created > and managed make them just as much "human-accessible" only as the core > specification. I'm afraid it's not clear for me. Let's take an example. Sometime ago I inquired about a controversial alias for U+018D: http://www.unicode.org/mail-arch/unicode-ml/y2015-m06/0014.html Can I really find anything about "reversed Polish-hook o" in the core specification which is not a literal copy of the information from NamesList.txt? Best regards Janusz -- , Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department) jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/