[issue32771] merge the underlying data stores of unicodedata and the str type

2019-08-14 Thread Greg Price


Greg Price  added the comment:

> About the RSS memory, I'm not sure how Linux accounts the Unicode databases 
> before they are accessed. Is it like read-only memory loaded on demand when 
> accessed?

It stands for "resident set size", as in "resident in memory"; and it only 
counts pages of real physical memory. The intention is to count up pages that 
the process is somehow using.

Where the definition potentially gets fuzzy is if this process and another are 
sharing some memory.  I don't know much about how that kind of edge case is 
handled.  But one thing I think it's pretty consistently good at is not 
counting pages that you've nominally mapped from a file, but haven't actually 
forced to be loaded physically into memory by actually looking at them.

That is: say you ask for a file (or some range of it) to be mapped into memory 
for you.  This means it's now there in the address space, and if the process 
does a load instruction from any of those addresses, the kernel will ensure the 
load instruction works seamlessly.  But: most of it won't be eagerly read from 
disk or loaded physically into RAM.  Rather, the kernel's counting on that load 
instruction causing a page fault; and its page-fault handler will take care of 
reading from the disk and sticking the data physically into RAM.  So until you 
actually execute some loads from those addresses, the data in that mapping 
doesn't contribute to the genuine demand for scarce physical RAM on the 
machine; and it also isn't counted in the RSS number.


Here's a demo!  This 262392 kiB (269 MB) Git packfile is the biggest file lying 
around in my CPython directory:

$ du -k .git/objects/pack/pack-0e4acf3b2d8c21849bb11d875bc14b4d62dc7ab1.pack
262392  .git/objects/pack/pack-0e4acf3b2d8c21849bb11d875bc14b4d62dc7ab1.pack


Open it for read -- adds 100 kiB, not sure why:

$ python
Python 3.7.3 (default, Apr  3 2019, 05:39:12) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os, mmap
>>> os.system(f"grep ^VmRSS /proc/{os.getpid()}/status")
VmRSS:  9968 kB
>>> fd = 
>>> os.open('.git/objects/pack/pack-0e4acf3b2d8c21849bb11d875bc14b4d62dc7ab1.pack',
>>>  os.O_RDONLY)
>>> os.system(f"grep ^VmRSS /proc/{os.getpid()}/status")
VmRSS: 10068 kB


Map it into our address space -- RSS doesn't budge:

>>> m = mmap.mmap(fd, 0, prot=mmap.PROT_READ)
>>> m

>>> len(m)
268684419
>>> os.system(f"grep ^VmRSS /proc/{os.getpid()}/status")
VmRSS: 10068 kB


Cause the process to actually look at all the data (this takes about ~10s, 
too)...

>>> sum(len(l) for l in m)
268684419
>>> os.system(f"grep ^VmRSS /proc/{os.getpid()}/status")
VmRSS:271576 kB

RSS goes way up, by 261508 kiB!  Oddly slightly less (by ~1MB) than the file's 
size.


But wait, there's more. Drop that mapping, and RSS goes right back down (OK, 
keeps 8 kiB extra):

>>> del m
>>> os.system(f"grep ^VmRSS /proc/{os.getpid()}/status")
VmRSS: 10076 kB

... and then map the exact same file again, and it's *still* down:

>>> m = mmap.mmap(fd, 0, prot=mmap.PROT_READ)
>>> os.system(f"grep ^VmRSS /proc/{os.getpid()}/status")
VmRSS: 10076 kB

This last step is interesting because it's a certainty that the data is still 
physically in memory -- this is my desktop, with plenty of free RAM.  And it's 
even in our address space.  But because we haven't actually loaded from those 
addresses, it's still in memory only at the kernel's caching whim, and so 
apparently our process doesn't get "charged" or "blamed" for its presence there.


In the case of running an executable with a bunch of data in it, I expect that 
the bulk of the data (and of the code for that matter) winds up treated very 
much like the file contents we mmap'd in.  It's mapped but not eagerly 
physically loaded; so it doesn't contribute to the RSS number, nor to the 
genuine demand for scarce physical RAM on the machine.


That's a bit long :-), but hopefully informative.  In short, I think for us RSS 
should work well as a pretty faithful measure of the real memory consumption 
that we want to be frugal with.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32771] merge the underlying data stores of unicodedata and the str type

2019-08-14 Thread Benjamin Peterson


Benjamin Peterson  added the comment:

It's also possible we're missing some logical compression opportunities by 
artificially partitioning the Unicode databases. Encoded optimally, the 
combined databases could very well take up less space than their raw sum 
suggests.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32771] merge the underlying data stores of unicodedata and the str type

2019-08-14 Thread STINNER Victor


STINNER Victor  added the comment:

Note: On Debian and Ubuntu, the unicodedata is a built-in module. It's not 
built as a dynamic library. About the RSS memory, I'm not sure how Linux 
accounts the Unicode databases before they are accessed. Is it like read-only 
memory loaded on demand when accessed?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32771] merge the underlying data stores of unicodedata and the str type

2019-08-14 Thread Greg Price


Greg Price  added the comment:

OK, I forked off the discussion of case-mapping as #37848. I think it's 
probably good to first sort out what we want, before returning to how to 
implement it (if it's agreed that changes are desired.)

Are there other areas of functionality that would be good to add in the core, 
and require data that's currently only in unicodedata?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32771] merge the underlying data stores of unicodedata and the str type

2019-08-13 Thread Benjamin Peterson

Benjamin Peterson  added the comment:

The goal is to implement the locale-specific case mappings of 
https://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt and §3.13 of 
the Unicode 12 standard in str.lower/upper/casefold. To do this, you need 
access to certain character properties available in unicodedata but not the 
builtin database.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32771] merge the underlying data stores of unicodedata and the str type

2019-08-13 Thread Greg Price


Greg Price  added the comment:

Speaking of improving functionality:

> Having unicodedata readily accessible to the str type would also permit 
> higher a fidelity unicode implementation. For example, implementing 
> language-tailored str.lower() requires having canonical combining class of a 
> character available. This data lives only in unicodedata currently.

Benjamin, can you say more about the behavior you have in mind here? I don't 
entirely follow. (Is or should there be an issue for it?)

--
versions: +Python 3.9 -Python 3.8

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32771] merge the underlying data stores of unicodedata and the str type

2019-08-13 Thread Greg Price

Greg Price  added the comment:

> Loading it dynamically reduces the memory footprint.

Ah, this is a good question to ask!

First, FWIW on my Debian buster desktop I get a smaller figure for `import 
unicodedata`: only 64 kiB.

$ python
Python 3.7.3 (default, Apr  3 2019, 05:39:12) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.system(f"grep ^VmRSS /proc/{os.getpid()}/status")
VmRSS:  9888 kB

>>> import unicodedata
>>> os.system(f"grep ^VmRSS /proc/{os.getpid()}/status")
VmRSS:  9952 kB

But whether 64 kiB or 160 kiB, it's much smaller than the 1.1 MiB of the whole 
module.  Which makes sense -- there's no need to bring the whole thing in 
memory when we only import it, or generally to bring into memory the parts we 
aren't using.  I wouldn't expect that to change materially if the tables and 
algorithms were built in.

Here's another experiment: suppose we load everything that ast.c needs in order 
to handle non-ASCII identifiers.

$ python
Python 3.7.3 (default, Apr  3 2019, 05:39:12) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.system(f"grep ^VmRSS /proc/{os.getpid()}/status")
VmRSS:  9800 kB

>>> là = 3
>>> os.system(f"grep ^VmRSS /proc/{os.getpid()}/status")
VmRSS:  9864 kB

So that also comes to 64 kiB.

We wouldn't want to add 64 kiB to our memory use for no reason; but I think 64 
or 160 kiB is well within the range that's an acceptable cost if it gets us a 
significant simplification or improvement to core functionality, like Unicode.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32771] merge the underlying data stores of unicodedata and the str type

2019-08-13 Thread STINNER Victor


STINNER Victor  added the comment:

Hum, I forget to mention that the module is compiled as a dynamically library, 
at least on Fedora:

$ python3
Python 3.7.4 (default, Jul  9 2019, 16:32:37) 
[GCC 9.1.1 20190503 (Red Hat 9.1.1-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import unicodedata
>>> unicodedata


It's a big file: 1.1 MiB.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32771] merge the underlying data stores of unicodedata and the str type

2019-08-13 Thread STINNER Victor


STINNER Victor  added the comment:

> This will remove awkward maneuvers like ast.c importing unicodedata in order 
> to perform normalization.

unicodedata is not needed by default. ast.c only imports unicodedata at the 
first non-ASCII identifier. If you application (and all dependencies) only use 
ASCII identifiers, unicodedata is never loaded. Loading it dynamically reduces 
the memory footprint. 

Raw measure on my Fedora 30 laptop:

$ python3
Python 3.7.4 (default, Jul  9 2019, 16:32:37) 
[GCC 9.1.1 20190503 (Red Hat 9.1.1-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.system(f"grep ^VmRSS /proc/{os.getpid()}/status")
VmRSS: 10236 kB

>>> import unicodedata
>>> os.system(f"grep ^VmRSS /proc/{os.getpid()}/status")
VmRSS: 10396 kB

It uses 160 KiB of memory.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32771] merge the underlying data stores of unicodedata and the str type

2019-08-13 Thread Greg Price


Change by Greg Price :


--
nosy: +Greg Price

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32771] merge the underlying data stores of unicodedata and the str type

2018-04-13 Thread Matej Cepl

Change by Matej Cepl :


--
nosy: +mcepl

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32771] merge the underlying data stores of unicodedata and the str type

2018-02-05 Thread Serhiy Storchaka

Serhiy Storchaka  added the comment:

+1. And perhaps a new C API for direct access to the Unicode DB should be 
provided.

--
components: +Interpreter Core
nosy: +serhiy.storchaka

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue32771] merge the underlying data stores of unicodedata and the str type

2018-02-04 Thread Benjamin Peterson

New submission from Benjamin Peterson :

Both Objects/unicodeobject.c and Modules/unicodedatamodule.c rely on large 
generated databases (Objects/unicodetype_db.h, Modules/unicodename_db.h, 
Modules/unicodedata_db.h). This separation made sense in Python 2 where Unicode 
was less of an important part of the language than Python3-recall Python 2's 
configure script has --without-unicode!. However, in Python 3, Unicode is a 
core language concept and literally baked into the syntax of the language. I 
therefore propose moving all of unicodedata's tables and algorithms into the 
interpreter core proper and converting Modules/unicodedata.c into a facade. 
This will remove awkward maneuvers like ast.c importing unicodedata in order to 
perform normalization. Having unicodedata readily accessible to the str type 
would also permit higher a fidelity unicode implementation. For example, 
implementing language-tailored str.lower() requires having canonical combining 
class of a character available. This data lives only in unicodedata currently.

--
components: Unicode
messages: 311634
nosy: benjamin.peterson, ezio.melotti, vstinner
priority: normal
severity: normal
stage: needs patch
status: open
title: merge the underlying data stores of unicodedata and the str type
type: enhancement
versions: Python 3.8

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com