[issue16684] Unicode property value abbreviated names and long names

2019-09-20 Thread Greg Price


Greg Price  added the comment:

I've gone and implemented a version of this that's integrated into 
Tools/unicode/makeunicodedata.py , and into the unicodedata module.  Patch 
attached.  Demo:

>>> import unicodedata, pprint
>>> pprint.pprint(unicodedata.property_value_aliases)
{'bidirectional': {'AL': ['Arabic_Letter'],
# ...
   'WS': ['White_Space']},
 'category': {'C': ['Other'],
# ...
 'east_asian_width': {'A': ['Ambiguous'],
# ...
  'W': ['Wide']}}


Note that the values are lists.  That's because a value can have multiple 
aliases in addition to its "short name":

>>> unicodedata.property_value_aliases['category'][unicodedata.category('4')]
['Decimal_Number', 'digit']


This implementation also provides the reverse mapping, from an alias to the 
"short name":

>>> pprint.pprint(unicodedata.property_value_by_alias)
{'bidirectional': {'Arabic_Letter': 'AL',
# ...


This draft doesn't have tests or docs, but it's otherwise complete. I've posted 
it at this stage for feedback on a few open questions:

* This version is in C; at import time some C code builds up the dicts, from 
static tables in the header generated by makeunicodedata.py .  It's not *that* 
much code... but it sure would be more convenient to do in Python instead.

  Should the unicodedata module perhaps have a Python part?  I'd be happy to go 
about that -- rename the existing C module to _unicodedata and add a small 
unicodedata.py wrapper -- if there's a feeling that it'd be a good idea.  Then 
this could go there instead of using the C code I've just written.


* Is this API the right one?
  * This version has e.g. 
unicodedata.property_value_by_alias['category']['Decimal_Number'] == 'Nd' .

  * Perhaps make category/bidirectional/east_asian_width into attributes rather 
than keys? So e.g. 
unicodedata.property_value_by_alias.category['Decimal_Number'] == 'Nd' .

  * Or: the standard says "loose matching" should be applied to these names, so 
e.g. 'decimal number' or 'is-decimal-number' is equivalent to 'Decimal_Number'. 
To accomplish that, perhaps make it not dicts at all but functions?

So e.g. unicodedata.property_value_by_alias('decimal number') == 
unicodedata.property_value_by_alias('Decimal_Number') == 'Nd' .

  * There's also room for bikeshedding on the names.


* How shall we handle ucd_3_2_0 for this feature?

  This implementation doesn't attempt to record the older version of the data.  
My reasoning is that because the applications of the old data are quite 
specific and they haven't needed this information yet, it seems unlikely anyone 
will ever really want to know from this module just which aliases existed 
already in 3.2.0 and which didn't yet.

  OTOH, as a convenience I've caused e.g. 
unicodedata.ucd_3_2_0.property_value_by_alias to exist, just pointing to the 
same object as unicodedata.property_value_by_alias . This allows 
unicodedata.ucd_3_2_0 to remain a near drop-in substitute for the unicodedata 
module itself, while minimizing the complexity it adds to the implementation.

  Might be cleanest to just leave these off of ucd_3_2_0 entirely, though. It's 
still easy to get at them -- just get them from the module itself -- and it 
makes it explicit that you're getting current rather than old data.

--
keywords: +patch
nosy: +Greg Price
versions: +Python 3.9 -Python 3.8
Added file: https://bugs.python.org/file48616/prop-val-aliases.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16684] Unicode property value abbreviated names and long names

2018-06-20 Thread Ned Deily


Change by Ned Deily :


--
nosy: +benjamin.peterson
versions: +Python 3.8 -Python 3.7

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16684] Unicode property value abbreviated names and long names

2018-06-20 Thread Pander


Pander  added the comment:

Since June 2018, Unicode version 11.0 is out. Perhaps that could help move this 
forward.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16684] Unicode property value abbreviated names and long names

2017-01-11 Thread Serhiy Storchaka

Changes by Serhiy Storchaka :


--
versions: +Python 3.7 -Python 3.4

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16684] Unicode property value abbreviated names and long names

2017-01-11 Thread Pander

Pander added the comment:

Any updates or ideas on how to move this forward? Meanwhile, should the issue 
relate to version 3.6? Thanks. Ah, see also https://bugs.python.org/issue6331 
please

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16684] Unicode property value abbreviated names and long names

2012-12-25 Thread Ezio Melotti

Ezio Melotti added the comment:

The script should probably be integrated in Tools/unicode/makeunicodedata.py.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16684
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16684] Unicode property value abbreviated names and long names

2012-12-23 Thread Pander

Pander added the comment:

Attached is the requested proof-of-concept script.

--
Added file: 
http://bugs.python.org/file28405/create-unicodedata-dicts-prop-value-alias-20121223.py

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16684
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16684] Unicode property value abbreviated names and long names

2012-12-23 Thread Terry J. Reedy

Terry J. Reedy added the comment:

I verified that the prototype file works in 2.7.3. I rewrote it for 3.3 using a 
refactored approach (and discovered that the site sometimes times out).

--
Added file: http://bugs.python.org/file28411/bc_ea_gc.py

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16684
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16684] Unicode property value abbreviated names and long names

2012-12-21 Thread Terry J. Reedy

Terry J. Reedy added the comment:

This seems like a plausible request to me. The three dicts comprise 70 
code-alias pairs. If unicodedata had a Python version (should it?), the 
simplest thing would be to add bididict, eawdist, and gcdict to that version 
(and not to the C version). I don't know how well putting dicts in C code 
works. A unicodealias module could be added but I do not really like that idea. 
I would prefer adding data attributes and correspond docs to the current module.

Pander: submitting a proof-of-concept script that accesses and parses that url 
and produces ready-to-go python code like below might encourage adoption of 
your proposal. In any case, it would be here for others to use.

bididict = {
'AL': 'Arabic_Letter',
...
'WS': 'White_Space',
}

eawdict = ...

--
nosy: +loewis, terry.reedy

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16684
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16684] Unicode property value abbreviated names and long names

2012-12-14 Thread Pander

New submission from Pander:

The package unicodedata
  http://docs.python.org/3/library/unicodedata.html
offers looking up of property values in terms of general category, 
bidirectional class and east asian width for Unicode characters
  unicodedata.category(unichr)
  unicodedata.bidirectional(unichr)
  unicodedata.east_asian_width(chr)

The abbreviated name of the specific category is returned. However, for certain 
applications it is important to be able to get the from abbreviated name to the 
long name and vice versa.

The data needed to do this can be found at
  http://www.unicode.org/Public/UNIDATA/PropertyValueAliases.txt
under sections
  # General_Category (gc)
  # Bidi_Class (bc)
  # East_Asian_Width (ea)
Use only the second (abbreviated name) and third (long name) fields and 
ignoring other fields and possible comments.

For general category, also support translation back and forth of the one-letter 
abbreviations which are groups representing two-letter general categories 
abbreviations with the same initial letter.

Please extend this package with a way of translating back and forth between 
abbreviated name and long name for property values defined in Unicode for 
general category, bidirectional class and East Asian width. This functionality 
should be independent of retrieving the abbreviated names for Unicode character 
as is available now and should be accessible via separate methods or 
dictionaries in which developers can perform lookups themselves.

Implementing the functionality requested in this issue allows Python developers 
to get from an abbreviated property value to a meaningful property value name 
and vice versa without having to retrieve this information from the Unicode 
Consortium and/or shipping this information with their code with the risk of 
using outdated information.

--
components: Unicode
messages: 177476
nosy: PanderMusubi, ezio.melotti
priority: normal
severity: normal
status: open
title: Unicode property value abbreviated names and long names
type: enhancement
versions: Python 3.5

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16684
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16684] Unicode property value abbreviated names and long names

2012-12-14 Thread Ezio Melotti

Ezio Melotti added the comment:

 for certain applications it is important to be able to get the from 
 abbreviated name to the long name and vice versa.

What kind of application?  I have a module where I defined my own dict that 
maps categories with their full names, but I'm not sure this feature is common 
enough that should be included and maintained in the stdlib.

If it's added, a dict is probably enough, but a script to parse the file you 
mentioned and update this dict should also be included.

--
stage:  - needs patch
versions: +Python 3.4 -Python 3.5

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16684
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16684] Unicode property value abbreviated names and long names

2012-12-14 Thread Pander

Pander added the comment:

I myself have a lot of Python applications that process font files and interact 
with fonttools and FontForge, which are both written in Python too. As you also 
have your own dict for this purpose and probably other people too, it would be 
justified to add these three small dicts in the standard lib. Especially since 
this package in the standard lib follows the definitions from Unicode 
Consortium.

When this is shipped in one package developers will always have an in sync 
translation from abbreviated names to long names and vice versa. Over the last 
years I needed to adjust my dicts regularly for the added definitions by 
Unicode Consortium which are supported by stdlib.

At the moment, translation from Unicode codes U+1234 to human-readable Unicode 
names and vice versa is offered at the moment. Providing human-readable names 
for the property values is a service of the same level and will be catering to 
approximately the same user group.

If you agree that these dicts can be added I am willing to provide a script 
that will parse the aforementioned file.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16684
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com