[issue6331] Add unicode script info to the unicode database

2019-08-27 Thread Greg Price


Change by Greg Price :


--
nosy: +Greg Price

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6331] Add unicode script info to the unicode database

2018-06-20 Thread STINNER Victor


STINNER Victor  added the comment:

> Since June 2018, Unicode version 11.0 is out. Perhaps that could help move 
> this forward.

Python 3.7 has been upgrade to Unicode 11.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6331] Add unicode script info to the unicode database

2018-06-20 Thread Pander


Pander  added the comment:

Since June 2018, Unicode version 11.0 is out. Perhaps that could help move this 
forward.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6331] Add unicode script info to the unicode database

2017-01-11 Thread Antoine Pitrou

Changes by Antoine Pitrou :


--
nosy:  -pitrou

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6331] Add unicode script info to the unicode database

2017-01-11 Thread Serhiy Storchaka

Changes by Serhiy Storchaka :


--
versions: +Python 3.7 -Python 3.6

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6331] Add unicode script info to the unicode database

2017-01-11 Thread Pander

Pander added the comment:

Any updates or ideas on how to move this forward? See also 
https://bugs.python.org/issue16684 Thanks.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6331] Add unicode script info to the unicode database

2015-10-17 Thread Denis Jacquerye

Changes by Denis Jacquerye :


--
nosy: +Denis Jacquerye

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6331] Add unicode script info to the unicode database

2015-09-21 Thread Mark Lawrence

Changes by Mark Lawrence :


--
versions: +Python 3.6 -Python 3.4

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6331] Add unicode script info to the unicode database

2015-09-21 Thread Berker Peksag

Changes by Berker Peksag :


--
nosy: +berker.peksag

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6331] Add unicode script info to the unicode database

2015-09-21 Thread Cosimo Lupo

Cosimo Lupo added the comment:

I would very much like a `script()` function to be added to the built-in 
unicodedata module.
What's the current status of this issue?
Thanks.

Cosimo

--
nosy: +Cosimo Lupo

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6331] Add unicode script info to the unicode database

2014-09-02 Thread Elizabeth Myers

Elizabeth Myers added the comment:

 I think this needs to be fixed, then - we need to study why there are
 so many new records (e.g. what script contributes most new records),
 and then look for alternatives.

The Common script appears to be very fragmented and may be the cause of the 
issues.

 One alternative could be to create a separate Trie for scripts.

Not having seen the one in C yet, I have one written in Python, custom-made for 
storing the script database, based on the general idea of a range tree. It 
stores ranges individually straight out of Scripts.txt. The general idea is you 
take the average of the lower and upper bounds of a given range (they can be 
equal). When searching, you compare the codepoint value to the average in the 
present node, and use that to find which direction to search the tree in.

Without coalescing neighbouring ranges that are the same script, I have 1,606 
nodes in the tree (for Unicode 7.0, which added a lot of scripts). After 
coalescing, there appear to be 806 nodes.

If anyone cares, I'll be more than happy to post code for inspection.

 I don't know what this will be used for, but one application is
 certainly regular expressions. So we need an efficient test whether
 the character is in the expected script or not. It would be bad if
 such a test would have to do a .lower() on each lookup.

This is actually required for restriction-level detection as described in 
Unicode TR39, for all levels of restriction above ASCII-only 
(http://www.unicode.org/reports/tr39/#Restriction_Level_Detection).

--
nosy: +Elizacat

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue6331
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6331] Add unicode script info to the unicode database

2014-03-23 Thread Pander

Pander added the comment:

I see the patch support Unicode scripts 
https://en.wikipedia.org/wiki/Script_%28Unicode%29 but I am also interested in 
support for Unicode blocks https://en.wikipedia.org/wiki/Unicode_block

Code for support for the latter is at https://github.com/nagisa/unicodeblocks

I could ont quiet make out of the patch also supports Unicode blocks. If not, 
shoudl that be requested in a separete issue?

Furthermore, support for Unicode scripts and blocks should be updated each time 
a new version of Unicode standard is published. Someone should check of the 
latest patch should be updated to the latest version of Unicode. Not only for 
this issue but for each release of PYthon.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue6331
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6331] Add unicode script info to the unicode database

2014-03-23 Thread Martin v . Löwis

Martin v. Löwis added the comment:

Adding support for blocks should indeed go into a separate issue. Your code for 
that is not suitable, as it should integrate with the existing 
make_unicodedata.py script, which your code does not.

And yes, indeed, of course, we automatically update (nearly) all data in Python 
automatically from the files provided by the Unicode consortium.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue6331
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6331] Add unicode script info to the unicode database

2014-03-20 Thread Martin v . Löwis

Martin v. Löwis added the comment:

Pander: In what way would this extend or improve the current patch?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue6331
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6331] Add unicode script info to the unicode database

2013-02-10 Thread Antoine Pitrou

Changes by Antoine Pitrou pit...@free.fr:


--
nosy: +benjamin.peterson, haypo, lemburg, pitrou

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue6331
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6331] Add unicode script info to the unicode database

2012-12-14 Thread Pander

Pander added the comment:

Please, also consider reviewing functionality offered by:
  http://pypi.python.org/pypi/unicodescript/
and
  http://pypi.python.org/pypi/unicodeblocks/
which could be used to improve and extend the proposed patch.

--
nosy: +PanderMusubi

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue6331
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6331] Add unicode script info to the unicode database

2012-12-14 Thread Pander

Pander added the comment:

The latest version of the respective sources can be found here:
  https://github.com/ConradIrwin/unicodescript
and here:
  https://github.com/simukis/unicodeblocks

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue6331
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6331] Add unicode script info to the unicode database

2012-09-26 Thread Ezio Melotti

Changes by Ezio Melotti ezio.melo...@gmail.com:


--
versions: +Python 3.4 -Python 3.2

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue6331
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6331] Add unicode script info to the unicode database

2010-07-21 Thread Mark Lawrence

Mark Lawrence breamore...@yahoo.co.uk added the comment:

Could someone with unicode knowledge take this review on, given that comments 
have already been made and responded to?

--
nosy: +BreamoreBoy
versions: +Python 3.2 -Python 2.7

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue6331
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6331] Add unicode script info to the unicode database

2009-07-24 Thread Ezio Melotti

Changes by Ezio Melotti ezio.melo...@gmail.com:


--
keywords: +needs review
stage:  - patch review

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue6331
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6331] Add unicode script info to the unicode database

2009-07-01 Thread Walter Dörwald

Walter Dörwald wal...@livinglogic.de added the comment:

Here is a new version that includes a new function scriptl() that
returns the script name in lowercase.

--
Added file: http://bugs.python.org/file14418/unicode-script-3.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue6331
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6331] Add unicode script info to the unicode database

2009-06-25 Thread Walter Dörwald

Walter Dörwald wal...@livinglogic.de added the comment:

I was comparing apples and oranges: The 229 entries for the trunk where
for an UCS2 build (the patched version was UCS4), with UCS4 there are
317 entries for the trunk.

size unicodedata.o gives:

__TEXT  __DATA  __OBJC  others  dec hex
13622   587057  0   23811   624490  9876a

for trunk

and

__TEXT  __DATA  __OBJC  others  dec hex
17769   588817  0   24454   631040  9a100

for the patched version.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue6331
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6331] Add unicode script info to the unicode database

2009-06-24 Thread Ezio Melotti

Changes by Ezio Melotti ezio.melo...@gmail.com:


--
nosy: +ezio.melotti
priority:  - normal

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue6331
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6331] Add unicode script info to the unicode database

2009-06-24 Thread Walter Dörwald

Walter Dörwald wal...@livinglogic.de added the comment:

Martin v. Löwis wrote:
 Martin v. Löwis mar...@v.loewis.de added the comment:
 
 I think the patch is incorrect: the default value for the script
 property ought to be Unknown, not Common (despite UCD.html saying the
 contrary; see UTR#24 and Scripts.txt).

Fixed.

 I'm puzzled why you use a hard-coded list of script names. The set of
 scripts will certainly change across Unicode versions, and I think it
 would be better to learn the script names from Scripts.txt.

I hardcoded the list, because I saw no easy way to get the indexes
consistent across both versions of the database.

 Out of curiosity: how does the addition of the script property affect
 the number of distinct database records, and the total size of the database?

I'm not exactly sure how to measure this, but the length of
_PyUnicode_Database_Records goes from 229 entries to 690 entries.

If it's any help I can post the output of makeunicodedata.py.

 I think a common application would be lower-cases script names, for more
 efficient comparison; UCD has also changed the spelling of the script
 names over time (from being all-capital before). So I propose that
 a) two functions are provided: one with the original script names, and
 one with the lower-case script names

It this really neccessary, if we only have one version of the database?

 b) keep cached versions of interned script name strings in separate
 arrays, to avoid PyString_FromString every time.

Implemented.

 I'm doubtful that script names need to be provided for old database
 versions, so I would be happy to not record the script for old versions,
 and raise an exception if somebody tries to get the script for an old
 database version - surely applications of the old database records won't
 be accessing the script property, anyway.

OK, I've removed the script_changes info for the old database. (And with
this change the list of script names is no longer hardcoded).

Here's a new version of the patch (unicode-script-2.diff).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue6331
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6331] Add unicode script info to the unicode database

2009-06-24 Thread Walter Dörwald

Changes by Walter Dörwald wal...@livinglogic.de:


Added file: http://bugs.python.org/file14356/unicode-script-2.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue6331
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6331] Add unicode script info to the unicode database

2009-06-24 Thread Martin v . Löwis

Martin v. Löwis mar...@v.loewis.de added the comment:

 I'm puzzled why you use a hard-coded list of script names. The set of
 scripts will certainly change across Unicode versions, and I think it
 would be better to learn the script names from Scripts.txt.
 
 I hardcoded the list, because I saw no easy way to get the indexes
 consistent across both versions of the database.

Couldn't you have a global cache, something like

scripts = ['Unknown']
def findscript(script):
  try:
return scripts.index(script)
  except ValueError:
scripts.append(script)
return len(scripts)-1

 Out of curiosity: how does the addition of the script property affect
 the number of distinct database records, and the total size of the database?
 
 I'm not exactly sure how to measure this, but the length of
 _PyUnicode_Database_Records goes from 229 entries to 690 entries.

I think this needs to be fixed, then - we need to study why there are
so many new records (e.g. what script contributes most new records),
and then look for alternatives.

One alternative could be to create a separate Trie for scripts.

I'd also be curious if we can increase the homogeneity of scripts
(i.e. produce longer runs of equal scripts) if we declare that
unassigned code points have the script that corresponds to the block
(i.e. the script that surrounding characters have), and then only
change it to Unknown at lookup time if it's unassigned.

 If it's any help I can post the output of makeunicodedata.py.

I'd be interested in size unicodedata.so, and how it changes.
Perhaps the actual size increase isn't that bad.

 a) two functions are provided: one with the original script names, and
 one with the lower-case script names
 
 It this really neccessary, if we only have one version of the database?

I don't know what this will be used for, but one application is
certainly regular expressions. So we need an efficient test whether
the character is in the expected script or not. It would be bad if
such a test would have to do a .lower() on each lookup.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue6331
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6331] Add unicode script info to the unicode database

2009-06-23 Thread Walter Dörwald

New submission from Walter Dörwald wal...@livinglogic.de:

This patch adds a function unicodedata.script() that returns information
about the script of the Unicode character.

--
components: Unicode
files: unicode-script.diff
keywords: patch
messages: 89642
nosy: doerwalter
severity: normal
status: open
title: Add unicode script info to the unicode database
type: feature request
versions: Python 2.7
Added file: http://bugs.python.org/file14348/unicode-script.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue6331
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6331] Add unicode script info to the unicode database

2009-06-23 Thread Martin v . Löwis

Martin v. Löwis mar...@v.loewis.de added the comment:

I think the patch is incorrect: the default value for the script
property ought to be Unknown, not Common (despite UCD.html saying the
contrary; see UTR#24 and Scripts.txt).

I'm puzzled why you use a hard-coded list of script names. The set of
scripts will certainly change across Unicode versions, and I think it
would be better to learn the script names from Scripts.txt.

Out of curiosity: how does the addition of the script property affect
the number of distinct database records, and the total size of the database?

I think a common application would be lower-cases script names, for more
efficient comparison; UCD has also changed the spelling of the script
names over time (from being all-capital before). So I propose that
a) two functions are provided: one with the original script names, and
one with the lower-case script names
b) keep cached versions of interned script name strings in separate
arrays, to avoid PyString_FromString every time.

I'm doubtful that script names need to be provided for old database
versions, so I would be happy to not record the script for old versions,
and raise an exception if somebody tries to get the script for an old
database version - surely applications of the old database records won't
be accessing the script property, anyway.

--
nosy: +loewis

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue6331
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6331] Add unicode script info to the unicode database

2009-06-23 Thread Akira Kitada

Changes by Akira Kitada akit...@gmail.com:


--
nosy: +akitada

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue6331
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com