[issue10459] missing character names in unicodedata (CJK...)

2010-11-22 Thread Martin v . Löwis

Martin v. Löwis mar...@v.loewis.de added the comment:

For 3.2, this now fixed in r86681.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10459
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10459] missing character names in unicodedata (CJK...)

2010-11-22 Thread Martin v . Löwis

Martin v. Löwis mar...@v.loewis.de added the comment:

The patch for 3.1 is r86685. The patch for 2.7 is r86686.

--
resolution:  - fixed
status: open - closed

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10459
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10459] missing character names in unicodedata (CJK...)

2010-11-19 Thread Martin v . Löwis

Martin v. Löwis mar...@v.loewis.de added the comment:

Marc-Andre: Many of the characters you refer actually do have names assigned, 
even if the names don't appear in the Unicode character database. Instead, they 
are specified in section 4.8 of the Unicode standard, and unicodedata.c already 
implements that (it just wasn't updated when the ranges changed; I will look 
into this).

--
nosy: +loewis

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10459
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10459] missing character names in unicodedata (CJK...)

2010-11-19 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

Martin v. Löwis wrote:
 
 Martin v. Löwis mar...@v.loewis.de added the comment:
 
 Marc-Andre: Many of the characters you refer actually do have names assigned, 
 even if the names don't appear in the Unicode character database. Instead, 
 they are specified in section 4.8 of the Unicode standard, and unicodedata.c 
 already implements that (it just wasn't updated when the ranges changed; I 
 will look into this).

Thanks for pointing this out. I wasn't aware of there being a standard
for constructing names for CJK ideograph ranges.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10459
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10459] missing character names in unicodedata (CJK...)

2010-11-19 Thread Vlastimil Brom

New submission from Vlastimil Brom vlastimil.b...@gmail.com:

I just noticed an ommision of come character names in unicodedata module.
These are some CJK - Ideographs:

龼 (0x9fbc) - 鿋 (0x9fcb)
 (CJK Unified Ideographs [19968-40959] [0x4e00-0x9fff])

꜀ (0x2a700) - 뜴 (0x2b734)
(CJK Unified Ideographs Extension C [173824-177983] [0x2a700-0x2b73f])

띀 (0x2b740) - 렝 (0x2b81d)
 (CJK Unified Ideographs Extension D [177984-178207] [0x2b740-0x2b81f])

The names are probably to be generated - e.g. CJK UNIFIED IDEOGRAPH-2A700 ... 
etc.

(Tested with the recompiled unicodedata - using unicode 6.0; with the py 27 - 
builtin module (unidata_version: '5.2.0') only the first two ranges are 
relevant (as CJK Unified Ideographs Extension D is an adition of Unicode 6)

(Also there are the unprintable ASCII controls, surrogates and private use 
areas, where the missing names are probably ok.)


I tested with the following rather clumsy code:

# # # # # # # # # # # # # # # 
# wide_unichr = custom unichr emulating unicode ranges beyond  on narrow 
python build
codepoints_missing_char_names = [[-2,-2],] # dummy
for i in xrange(0x10+1):
if unicodedata.category(wide_unichr(i))[:1] != 'C' and 
unicodedata.name(wide_unichr(i), u??noname??) == u??noname??:
if codepoints_missing_char_names[-1][1] == i-1:
codepoints_missing_char_names[-1][1] = i
else:
codepoints_missing_char_names.append([i, i])

for first, last in codepoints_missing_char_names[1:]:
print u%s (%s) - %s (%s) % (wide_unichr(first), hex(first), 
wide_unichr(last), hex(last),)
# # # # # # # # # # # # # # # # # # # # # # # # # # 

Unfortunately, I can't provide a fix, as unicodedata involves C code, where my 
knowledge is near zero.

vbr

--
messages: 121521
nosy: vbr
priority: normal
severity: normal
status: open
title: missing character names in unicodedata (CJK...)

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10459
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10459] missing character names in unicodedata (CJK...)

2010-11-19 Thread Ezio Melotti

Changes by Ezio Melotti ezio.melo...@gmail.com:


--
nosy: +ezio.melotti

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10459
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10459] missing character names in unicodedata (CJK...)

2010-11-19 Thread Vlastimil Brom

Changes by Vlastimil Brom vlastimil.b...@gmail.com:


--
components: +Library (Lib), Unicode
type:  - behavior

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10459
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10459] missing character names in unicodedata (CJK...)

2010-11-19 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

Vlastimil Brom wrote:
 
 New submission from Vlastimil Brom vlastimil.b...@gmail.com:
 
 I just noticed an ommision of come character names in unicodedata module.
 These are some CJK - Ideographs:
 
 龼 (0x9fbc) - 鿋 (0x9fcb)
  (CJK Unified Ideographs [19968-40959] [0x4e00-0x9fff])
 
 ꜀ (0x2a700) - 뜴 (0x2b734)
 (CJK Unified Ideographs Extension C [173824-177983] [0x2a700-0x2b73f])
 
 띀 (0x2b740) - 렝 (0x2b81d)
  (CJK Unified Ideographs Extension D [177984-178207] [0x2b740-0x2b81f])
 
 The names are probably to be generated - e.g. CJK UNIFIED IDEOGRAPH-2A700 ... 
 etc.

I don't think we should fill those rather big ranges with generated
names, unless there's a standard for this. There are quite a
few ranges in the Unicode database that are assigned, but don't
have a literal name associated with them.

--
nosy: +lemburg

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10459
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com