Re: annotations (was: NamesList.txt as data source)

2016-03-14 Thread Philippe Verdy
is the term "exponentially" really appropriate ? the NamesList file is not
so large, and the grow would remain linear.

Anyway, this file (current CSV format or XML format) does not need to be
part of the core UCD files, they can be in a separate download for people
needing it.

One benefit I would see is that this conversion to XML using an automated
tool could ensure that it is properly formated. But I believe that Unibook
is already parsing it to produce consistent code charts so its format is
already checked. And this advantage is not really effective.

But the main benefit would be that the file could be edited and updated
using standard tools. XML is not the only choice available, JSON today is
simpler to parse, easier to read (and even edit) by humans, it can embed
indentation whitespaces (outside quoted strings) that won't be considered
part of the data (unlike XML where they "pollute" the DOM with extra text
elements).

In fact I belive that the old CSV formats used in the original UCD may be
deprecated in favor of JSON (the old format could be automatically
generated for applications that want them. It could unify all formats with
a single parser in all tools. Files in older CSV or tabulated formats would
be in a separate derived collection. Then users would choose which format
they prefer (legacy now derived, JSON, or XML if people really want it).

The advantage of XML however is the stability for later updates that may
need to insert additional data or annotations (with JSON or CSV/tabulated
formats, the number of columns is fixed, all columns must be fed at least
with an empty data, even if it is is not significant). Note that legacy
formats also have comments after hash signs, but many comments found at end
of data lines also have some parsable meaning, so they are structured, and
may be followed by an extra hash sign for a real comment)

The advantage of existing XSV/tabulated formats is that they are extremely
easy to import in a spreadsheet for easier use by a human (I won't requiest
the UTC to provide these files in XLS/XLSX or ODC format...). But JSON and
XML could as well be imported provided that the each data file remains
structured as a 2D grid without substructures within cells (otherwise you
need to provide an explicit schema).

But note that some columns is frequently structured: those containing the
code point key is frequently specifying a code range using an additional
separator; as well those whose value is an ordered list of code points,
using space separator and possibly a leading subtag (such as decomposition
data): in XML you would translate them into separate subelements or into
additional attributes, and in JSON, you'll need to structure these
structured cells using subarrays. So the data is *already* not strictly 2D
(converting them to a pure 2D format, for relational use, would require
adding additional key or referencing "ID" columns and those converted files
would be much less easier to read/edit by humans, in *any* format:
CSV/tabular, JSON or XML).

Other candidate formats also include Turtle (generally derived from OWL,
but replacing the XML envelope format by a tabulated "2.5D" format which is
much easier than XML to read/edit and much more compact than XML-based
formats and easier to parse)...

2016-03-14 3:14 GMT+01:00 Marcel Schneider :

> On Sun, 13 Mar 2016 13:03:20 -0600, Doug Ewell  wrote:
>
> > My point is that of J.S. Choi and Janusz Bień: the problem with
> > declaring NamesList off-limits is that it does contain information that
> > is either:
> >
> > • not available in any other UCD file, or
> > • available, but only in comments (like the MAS mappings), which aren't
> > supposed to be parsed either.
> >
> > Ken wrote:
> >
> > > [ .. ] NamesList.txt is itself the result of a complicated merge
> > > of code point, name, and decomposition mapping information from
> > > UnicodeData.txt, of listings of standardized variation sequences from
> > > StandardizedVariants.txt, and then a very long list of annotational
> > > material, including names list subhead material, etc., maintained in
> > > other sources.
> >
> > But sometimes an implementer really does need a piece of information
> > that exists only in those "other sources." When that happens, sometimes
> > the only choices are to resort to NamesList or to create one's own data
> > file, as Ken did by parsing the comment lines from the math file. Both
> > of these are equally distasteful when trying to be conformant.
>
>
> If so, then extending the XML UCD with all the information that is
> actually missing in it while available in the Code Charts and
> NamesList.txt, ends up being a good idea. But it still remains that such a
> step would exponentially increase the amount of data, because items that
> were not meant to be systematically provided, must be.
>
> Further I see that once this is completed, other requirements could need
> to tackle the same job on the core specs.
>
> 

Re: annotations (was: NamesList.txt as data source)

2016-03-13 Thread Marcel Schneider
On Sun, 13 Mar 2016 13:03:20 -0600, Doug Ewell  wrote:

> My point is that of J.S. Choi and Janusz Bień: the problem with
> declaring NamesList off-limits is that it does contain information that
> is either:
> 
> • not available in any other UCD file, or
> • available, but only in comments (like the MAS mappings), which aren't
> supposed to be parsed either.
> 
> Ken wrote:
> 
> > [ .. ] NamesList.txt is itself the result of a complicated merge
> > of code point, name, and decomposition mapping information from
> > UnicodeData.txt, of listings of standardized variation sequences from
> > StandardizedVariants.txt, and then a very long list of annotational
> > material, including names list subhead material, etc., maintained in
> > other sources.
> 
> But sometimes an implementer really does need a piece of information
> that exists only in those "other sources." When that happens, sometimes
> the only choices are to resort to NamesList or to create one's own data
> file, as Ken did by parsing the comment lines from the math file. Both
> of these are equally distasteful when trying to be conformant.


If so, then extending the XML UCD with all the information that is actually 
missing in it while available in the Code Charts and NamesList.txt, ends up 
being a good idea. But it still remains that such a step would exponentially 
increase the amount of data, because items that were not meant to be 
systematically provided, must be.

Further I see that once this is completed, other requirements could need to 
tackle the same job on the core specs.

The point would be to know whether in Unicode implementation and i18n, those 
needs are frequent. E.g. the last Apostrophe thread showed that full 
automatization is sometimes impossible anyway.

Marcel



Re: annotations (was: NamesList.txt as data source)

2016-03-13 Thread Doug Ewell
My point is that of J.S. Choi and Janusz Bień: the problem with 
declaring NamesList off-limits is that it does contain information that 
is either:


• not available in any other UCD file, or
• available, but only in comments (like the MAS mappings), which aren't
 supposed to be parsed either.

Ken wrote:


[ .. ] NamesList.txt is itself the result of a complicated merge
of code point, name, and decomposition mapping information from
UnicodeData.txt, of listings of standardized variation sequences from
StandardizedVariants.txt, and then a very long list of annotational
material, including names list subhead material, etc., maintained in
other sources.


But sometimes an implementer really does need a piece of information 
that exists only in those "other sources." When that happens, sometimes 
the only choices are to resort to NamesList or to create one's own data 
file, as Ken did by parsing the comment lines from the math file. Both 
of these are equally distasteful when trying to be conformant.


--
Doug Ewell | http://ewellic.org | Thornton, CO  



Re: annotations (was: NamesList.txt as data source)

2016-03-13 Thread Marcel Schneider
On Sun, 13 Mar 2016 07:55:24 +0100, Janusz S. Bień  wrote:

> For this purpose he wrote also a converter from NamesList format to XML

That goes straight into the direction I suggested past year as a beta feedback 
item[1], but I never thought that it could be so simple.

> I understand there is no intention to make an official XML version of
the file as it would require changes in Unibook?

The difference however between homemade databases and official ones is that the 
latter raise much higher expectations. Asmus Freytag outlined in this thread―as 
well as in his comments on my feedback―that *no* “complete” UCD version, 
regardless of how complete it effectively might be, can ever meet the 
assumptions people inevitably would make on it.

Further, experience shows that the actually provided information is way more 
than most people are able to mentally process. E.g. most online character 
information providers do not display the formal aliases, so that in the best 
case some aware users add that information using the comment facility. I donʼt 
cite any: These are free tools and platforms that must not be criticized.

When we imagine a hypothetical UCD containing detailed information about the 
usage of any existing language, not only Polish but also Czech, Romanian, 
Portugese, Vietnamese, Devanagari, Tirhuta, just to cite some few, the result 
would be a data mass of which I’m not sure that it would pay back the cost 
induced at collection, nor that it would really be useful.

For the NamesList, the TXT format is superior to XML at least in that, it 
prevents from forgetting that NamesList.txt is the source of the Code Charts. 
Not less, not more.

Marcel

[1] http://www.unicode.org/review/pri297/feedback.html
Date/Time: Sat May 2 07:10:09 CDT 2015
   Opt Subject: PRI #297: UnicodeXData.txt
Date/Time: Wed May 6 08:03:04 CDT 2015
   Opt Subject: PRI #297: feedback on XML files



annotations (was: NamesList.txt as data source)

2016-03-12 Thread Janusz S. Bień
On Thu, Mar 10 2016 at 22:40 CET, kenwhist...@att.net writes:

> The *reason* that NamesList.txt exists at all is to drive the tool,
> unibook, that formats the full Unicode code charts for posting. 

[...]

On Fri, Mar 11 2016 at  3:13 CET, asm...@ix.netcom.com writes:
> On 3/10/2016 5:49 PM, "J. S. Choi" wrote:

>> One thing about NamesList.txt is that, as far as I have been able to
>> tell, it’s the only machine-readable, parseable source of those
>> annotations and cross-references.

[...]

> This is a different issue. The nameslist.txt is a reasonable source
> for driving other formatting programs than just Unibook.

Exactly.

A student of mine wrote a font sampling program producing output in a
Unibook-like form. For this purpose he wrote also a converter from
NamesList format to XML:

  https://github.com/ppablo28/fntsample_ucd_comments

  https://github.com/ppablo28/ucd_xml_parser

I use the XML version of NamesList to provide my own comments to
characters (work in progress):

 
https://bitbucket.org/jsbien/parkosz-font/downloads/Parkosz1907draft.pdf

Other examples of NamesList.txt use are

  http://www.fileformat.info/info/unicode/
  https://codepoints.net/

Although not exactly the formatting programs, in my opinion they
constitute also a valid use.

> In fact, the possibility of reuse in this context probably among the
> unstated rationales for making the information and syntax available in
> the first place.

I understand there is no intention to make an official XML version of
the file as it would require changes in Unibook?

[...]


>> What are these other primary sources that maintain these other
>> annotation data; are they publicly available? If the name list is the
>> only place where these sources’ data have been published, then, for
>> better or for worse, the name list is all that is available for much
>> information on many code points’ usage.

> See my first through third paragraph.

You wrote:

[...]

> There are explanations about character use that are only maintained in
> the PDF of the core specification, where this information is packaged
> in a way that can be understood by a human reader, but is not amenable
> to be extracted by machine.
>
> While the annotations, comments, cross references etc. in Namelist.txt
> appear, formally, to be machine extractable, the way they are created
> and managed make them just as much "human-accessible" only as the core
> specification.

I'm afraid it's not clear for me. Let's take an example. Sometime ago I
inquired about a controversial alias for U+018D:

http://www.unicode.org/mail-arch/unicode-ml/y2015-m06/0014.html

Can I really find anything about "reversed Polish-hook o" in the core
specification which is not a literal copy of the information from
NamesList.txt?

Best regards

Janusz

-- 
   ,   
Prof. dr hab. Janusz S. Bien -  Uniwersytet Warszawski (Katedra Lingwistyki 
Formalnej)
Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department)
jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/