Re: [Python-ideas] Move optional data out of pyc files

2018-04-16 Thread Brett Cannon
On Sat, 14 Apr 2018 at 17:01 Neil Schemenauer 
wrote:

> On 2018-04-12, M.-A. Lemburg wrote:
> > This leaves the proposal to restructure pyc files into a sectioned
> > file and possibly indexed file to make access to (lazily) loaded
> > parts faster.
>
> I would like to see a format can hold one or more modules in a
> single file.  Something like the zip format but optimized for fast
> interpreter startup time.  It should support lazy loading of module
> parts (e.g. maybe my lazy bytecode execution idea[1]).  Obviously a
> lot of details to work out.
>

Eric Snow, Barry Warsaw, and I chatted about a custom file format for
holding Python source (and data files). My notes on the chat can be found
at
https://notebooks.azure.com/Brett/libraries/design-ideas/html/Python%20source%20archive%20file%20format.ipynb
. (And since we aren't trying to rewrite bytecode we figured it wouldn't
break your proposal, Neil ;) .

-Brett


>
> The design should also take into account the widespread use of
> virtual environments.  So, it should be easy and space efficient to
> build virtual environments using this format (e.g. maybe allow
> overlays so that stdlib package is not copied into virtual
> environment, virtual packages would be overlaid on stdlib file).
> Also, should be easy to bundle all modules into a "uber" package and
> append it to the Python executable.  CPython should provide
> out-of-box support for single-file executables.
>
>
> 1. https://github.com/python/cpython/pull/6194
> ___
> Python-ideas mailing list
> Python-ideas@python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
>
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Move optional data out of pyc files

2018-04-14 Thread Neil Schemenauer
On 2018-04-12, M.-A. Lemburg wrote:
> This leaves the proposal to restructure pyc files into a sectioned
> file and possibly indexed file to make access to (lazily) loaded
> parts faster.

I would like to see a format can hold one or more modules in a
single file.  Something like the zip format but optimized for fast
interpreter startup time.  It should support lazy loading of module
parts (e.g. maybe my lazy bytecode execution idea[1]).  Obviously a
lot of details to work out.

The design should also take into account the widespread use of
virtual environments.  So, it should be easy and space efficient to
build virtual environments using this format (e.g. maybe allow
overlays so that stdlib package is not copied into virtual
environment, virtual packages would be overlaid on stdlib file).
Also, should be easy to bundle all modules into a "uber" package and
append it to the Python executable.  CPython should provide
out-of-box support for single-file executables.


1. https://github.com/python/cpython/pull/6194
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Move optional data out of pyc files

2018-04-13 Thread George Fischhof
2018-04-11 2:03 GMT+02:00 Steven D'Aprano :
[snip]


> I shouldn't think that the number of files on disk is very important,
> now that they're hidden away in the __pycache__ directory where they can
> be ignored by humans. Even venerable old FAT32 has a limit of 65,534
> files in a single folder, and 268,435,437 on the entire volume. So
> unless the std lib expands to 16000+ modules, the number of files in the
> __pycache__ directory ought to be well below that limit.
>
[snip]

Hi all,

Just for information for everyone:
(I was a VMS system manager  more than a decade ago, and I know that Win NT
(at least the core) is developed by a former VMS engineer. NTFS  is created
on the bases of Files-11 (Files-11B) file system. And in both file systems
the directory is a tree (in Files-11 it is a B-tree, maybe in NTFS it is
different tree, but tree). Holding the files ordered alphabetically.
And if there are "too much" files then accessing files will be slower.
(check for example the windows\system32 folder).

Of course it is not matter if there are some hundred or 1-2 thousand files.
But the too much matters.

I did a little measurement (intentionally not used functions not to make
the result wrong):



import os
import time

try:
os.mkdir('tmp_thousands_of_files')
except:
pass

name1 = 10001

start = time.time()
file_name = 'tmp_thousands_of_files/' + str(name1)
f = open(file_name, 'w')
f.write('aaa')
f.close()

stop = time.time()

file_time = stop-start

print(f'one file time {file_time} \n {start} \n {stop}')


for i in range(10002, 2):
file_name = 'tmp_thousands_of_files/' + str(i)
f = open(file_name, 'w')
f.write('aaa')
f.close()



name2 = 1

start = time.time()
file_name = 'tmp_thousands_of_files/' + str(name2)
f = open(file_name, 'w')
f.write('aaa')
f.close()

stop = time.time()

file_time = stop-start
print(f'after 10k, name before {file_time} \n {start} \n {stop}')


name3 = 20010

start = time.time()
file_name = 'tmp_thousands_of_files/' + str(name3)
f = open(file_name, 'w')
f.write('aaa')
f.close()

stop = time.time()

file_time = stop-start
print(f'after 10k, name after {file_time} \n {start} \n {stop}')

"""
result

c:\>python several_files_in_one_folder.py
one file time 0.0
 1523476699.5144918
 1523476699.5144918
after 10k, name before 0.015625953674316406
 1523476714.622918
 1523476714.6385438
after 10k, name after 0.0
 1523476714.6385438
 1523476714.6385438
"""


used: Python 3.6.1, windows 8.1, SSD drive

As you can see, when there an insertion into the beginning of the tree it
is much slower then adding to the end. (yes, I know the list insertion is
slow as well, but I saw VMS directory with 50k files, and the dir command
gave 5-10 files then waited some seconds before the next 5-10 files ... ;-)
)


BR,
George
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Move optional data out of pyc files

2018-04-12 Thread Giampaolo Rodola'
On Fri, 13 Apr 2018 at 03:47, M.-A. Lemburg  wrote:

> I think moving data out of pyc files is going in a wrong direction:
> more stat calls means slower import and slower startup time.
>
> Trying to make pycs smaller also isn't really worth it (they
> compress quite well).
>
> Saving memory could be done by disabling reading objects lazily
> from the file - without removing anything from the pyc file.
> Whether the few 100kB RAM this saves is worth the effort depends
> on the application space.
>
> This leaves the proposal to restructure pyc files into a sectioned
> file and possibly indexed file to make access to (lazily) loaded
> parts faster.


+1. With this in place -O and -OO cmdline options would become even less
useful (which is good).

-- 
Giampaolo - http://grodola.blogspot.com
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Move optional data out of pyc files

2018-04-12 Thread M.-A. Lemburg
I think moving data out of pyc files is going in a wrong direction:
more stat calls means slower import and slower startup time.

Trying to make pycs smaller also isn't really worth it (they
compress quite well).

Saving memory could be done by disabling reading objects lazily
from the file - without removing anything from the pyc file.
Whether the few 100kB RAM this saves is worth the effort depends
on the application space.

This leaves the proposal to restructure pyc files into a sectioned
file and possibly indexed file to make access to (lazily) loaded
parts faster. More structure would add ways to more easily
update the content going forward (similar to how PE executable files
are structured) and allow us to get rid of extra pyc file
variants (e.g. for special optimized versions). So that's an
interesting approach :-)

BTW: In all this, please remember that quite a few applications
do use doc strings as part of the code, not only for documentation.
Most prominent are probably parsers which keep the parsing
definitions in doc strings.

On 12.04.2018 20:32, Daniel Moisset wrote:
> I've been playing a bit with this trying to collect some data and
> measure how useful this would be. You can take a look at the script I'm
> using at: https://github.com/dmoisset/pycstats 
> 
> What I'm measuring is:
> 1. Number of objects in the pyc, and how many of those are:
>    * docstrings (I'm using a heuristic here which I'm not 100% sure it
> is correct)
>    * lnotabs
>    * Duplicate objects; these have not been discussed in this thread
> before but are another source of optimization I noticed while writing
> this. Essentially I'm refering to immutable constants that are instanced
> more than once and could be shared. You can also measure the effect of
> this optimization across modules and within a single module[1]
> 2. Bytes used in memory by the categories above (sum of sys.getsizeof()
> for each category).
> 
> I'm not measuring anything related to annotations because, as I
> mentioned before, they are generated piecemeal by executable bytecode so
> they are hard to separate
> 
> Running this on my python 3.6 pyc cache I get:
> 
> $ find /usr/lib/python3.6 -name '*.pyc' |xargs python3.6 pycstats.py 
> 8645 docstrings, 1705441B
> 19060 lineno tables, 941702B
> 59382/202898 duplicate objects for 3101287/18582807 memory size
> 
> So this means around ~10% of the memory used after loading is used for
> docstrings, ~5% for lnotabs, and ~15% for objects that could be shared.
> The sharing assumes we can share betwwen modules, but even doing it
> within modules, you can get to ~7%. 
> 
> In short, this could mean a 25%-35% reduction in memory use for code
> objects if the stdlib is a good benchmark.
> 
> Best,
> D.
> 
> [1] Regarding duplicates, I've found some unexpected things within
> loaded code objects, for example instances of the small integer "1" with
> different id() than the singleton that cpython normally uses for "1",
> although most duplicates are some small strings, tuples with argument
> names, or . Something that could be interesting to write is a "pyc
> optimizer" that removes duplicates, this should be a gain at a minimal
> preprocessing cost.
> 
> 
> On 12 April 2018 at 15:16, Daniel Moisset  > wrote:
> 
> One implementation difficulty specifically related to annotations,
> is that they are quite hard to find/extract from the code objects.
> Both docstrings and lnotab are within specific fields of the code
> object for their function/class/module; annotations are spread as
> individual constants (assuming PEP 563), which are loaded in
> bytecode through separate LOAD_CONST statements before creating the
> function object, and that can happen in the middle of bytecode for
> the higher level object (the module or class containing a function
> definition). So the change for achieving that will be more
> significant than just "add a couple of descriptors to function
> objects and change the module marshalling code".
> 
> Probably making annotations fit a single structure that can live in
> co_consts could make this change easier, and also make startup of
> annotated modules faster (because you just load a single constant
> instead of one per argument), this might be a valuable change by itself.
> 
> 
> 
> On 12 April 2018 at 11:48, INADA Naoki  > wrote:
> 
> > Finally, loading docstrings and other optional components can be 
> made lazy.
> > This was not in my original idea, and this will significantly 
> complicate the
> > implementation, but in principle it is possible. This will require 
> larger
> > changes in the marshal format and bytecode.
> 
> I'm +1 on this idea.
> 
> * New pyc format has code section (same to current) and text
> section.
> text 

Re: [Python-ideas] Move optional data out of pyc files

2018-04-12 Thread Daniel Moisset
I've been playing a bit with this trying to collect some data and measure
how useful this would be. You can take a look at the script I'm using at:
https://github.com/dmoisset/pycstats

What I'm measuring is:
1. Number of objects in the pyc, and how many of those are:
   * docstrings (I'm using a heuristic here which I'm not 100% sure it is
correct)
   * lnotabs
   * Duplicate objects; these have not been discussed in this thread before
but are another source of optimization I noticed while writing this.
Essentially I'm refering to immutable constants that are instanced more
than once and could be shared. You can also measure the effect of this
optimization across modules and within a single module[1]
2. Bytes used in memory by the categories above (sum of sys.getsizeof() for
each category).

I'm not measuring anything related to annotations because, as I mentioned
before, they are generated piecemeal by executable bytecode so they are
hard to separate

Running this on my python 3.6 pyc cache I get:

$ find /usr/lib/python3.6 -name '*.pyc' |xargs python3.6 pycstats.py
8645 docstrings, 1705441B
19060 lineno tables, 941702B
59382/202898 duplicate objects for 3101287/18582807 memory size

So this means around ~10% of the memory used after loading is used for
docstrings, ~5% for lnotabs, and ~15% for objects that could be shared. The
sharing assumes we can share betwwen modules, but even doing it within
modules, you can get to ~7%.

In short, this could mean a 25%-35% reduction in memory use for code
objects if the stdlib is a good benchmark.

Best,
D.

[1] Regarding duplicates, I've found some unexpected things within loaded
code objects, for example instances of the small integer "1" with different
id() than the singleton that cpython normally uses for "1", although most
duplicates are some small strings, tuples with argument names, or .
Something that could be interesting to write is a "pyc optimizer" that
removes duplicates, this should be a gain at a minimal preprocessing cost.


On 12 April 2018 at 15:16, Daniel Moisset  wrote:

> One implementation difficulty specifically related to annotations, is that
> they are quite hard to find/extract from the code objects. Both docstrings
> and lnotab are within specific fields of the code object for their
> function/class/module; annotations are spread as individual constants
> (assuming PEP 563), which are loaded in bytecode through separate
> LOAD_CONST statements before creating the function object, and that can
> happen in the middle of bytecode for the higher level object (the module or
> class containing a function definition). So the change for achieving that
> will be more significant than just "add a couple of descriptors to function
> objects and change the module marshalling code".
>
> Probably making annotations fit a single structure that can live in
> co_consts could make this change easier, and also make startup of annotated
> modules faster (because you just load a single constant instead of one per
> argument), this might be a valuable change by itself.
>
>
>
> On 12 April 2018 at 11:48, INADA Naoki  wrote:
>
>> > Finally, loading docstrings and other optional components can be made
>> lazy.
>> > This was not in my original idea, and this will significantly
>> complicate the
>> > implementation, but in principle it is possible. This will require
>> larger
>> > changes in the marshal format and bytecode.
>>
>> I'm +1 on this idea.
>>
>> * New pyc format has code section (same to current) and text section.
>> text section stores UTF-8 strings and not loaded at import time.
>> * Function annotation (only when PEP 563 is used) and docstring are
>> stored as integer, point to offset in the text section.
>> * When type.__doc__, PyFunction.__doc__, PyFunction.__annotation__ are
>> integer, text is loaded from the text section lazily.
>>
>> PEP 563 will reduce some startup time, but __annotation__ is still
>> dict.  Memory overhead is negligible.
>>
>> In [1]: def foo(a: int, b: int) -> int:
>>...: return a + b
>>...:
>>...:
>>
>> In [2]: import sys
>> In [3]: sys.getsizeof(foo)
>> Out[3]: 136
>>
>> In [4]: sys.getsizeof(foo.__annotations__)
>> Out[4]: 240
>>
>> When PEP 563 is used, there are no side effect while building the
>> annotation.
>> So the annotation can be serialized in text, like
>> {"a":"int","b":"int","return":"int"}.
>>
>> This change will require new pyc format, and descriptor for
>> PyFunction.__doc__, PyFunction.__annotation__
>> and type.__doc__.
>>
>> Regards,
>>
>> --
>> INADA Naoki  
>> ___
>> Python-ideas mailing list
>> Python-ideas@python.org
>> https://mail.python.org/mailman/listinfo/python-ideas
>> Code of Conduct: http://python.org/psf/codeofconduct/
>>
>
>
>
> --
> Daniel F. Moisset - UK Country Manager - Machinalis Limited
> www.machinalis.co.uk 
> Skype: 

Re: [Python-ideas] Move optional data out of pyc files

2018-04-12 Thread Daniel Moisset
One implementation difficulty specifically related to annotations, is that
they are quite hard to find/extract from the code objects. Both docstrings
and lnotab are within specific fields of the code object for their
function/class/module; annotations are spread as individual constants
(assuming PEP 563), which are loaded in bytecode through separate
LOAD_CONST statements before creating the function object, and that can
happen in the middle of bytecode for the higher level object (the module or
class containing a function definition). So the change for achieving that
will be more significant than just "add a couple of descriptors to function
objects and change the module marshalling code".

Probably making annotations fit a single structure that can live in
co_consts could make this change easier, and also make startup of annotated
modules faster (because you just load a single constant instead of one per
argument), this might be a valuable change by itself.



On 12 April 2018 at 11:48, INADA Naoki  wrote:

> > Finally, loading docstrings and other optional components can be made
> lazy.
> > This was not in my original idea, and this will significantly complicate
> the
> > implementation, but in principle it is possible. This will require larger
> > changes in the marshal format and bytecode.
>
> I'm +1 on this idea.
>
> * New pyc format has code section (same to current) and text section.
> text section stores UTF-8 strings and not loaded at import time.
> * Function annotation (only when PEP 563 is used) and docstring are
> stored as integer, point to offset in the text section.
> * When type.__doc__, PyFunction.__doc__, PyFunction.__annotation__ are
> integer, text is loaded from the text section lazily.
>
> PEP 563 will reduce some startup time, but __annotation__ is still
> dict.  Memory overhead is negligible.
>
> In [1]: def foo(a: int, b: int) -> int:
>...: return a + b
>...:
>...:
>
> In [2]: import sys
> In [3]: sys.getsizeof(foo)
> Out[3]: 136
>
> In [4]: sys.getsizeof(foo.__annotations__)
> Out[4]: 240
>
> When PEP 563 is used, there are no side effect while building the
> annotation.
> So the annotation can be serialized in text, like
> {"a":"int","b":"int","return":"int"}.
>
> This change will require new pyc format, and descriptor for
> PyFunction.__doc__, PyFunction.__annotation__
> and type.__doc__.
>
> Regards,
>
> --
> INADA Naoki  
> ___
> Python-ideas mailing list
> Python-ideas@python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
>



-- 
Daniel F. Moisset - UK Country Manager - Machinalis Limited
www.machinalis.co.uk 
Skype: @dmoisset T: + 44 7398 827139

1 Fore St, London, EC2Y 9DT

Machinalis Limited is a company registered in England and Wales. Registered
number: 10574987.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Move optional data out of pyc files

2018-04-12 Thread INADA Naoki
> Finally, loading docstrings and other optional components can be made lazy.
> This was not in my original idea, and this will significantly complicate the
> implementation, but in principle it is possible. This will require larger
> changes in the marshal format and bytecode.

I'm +1 on this idea.

* New pyc format has code section (same to current) and text section.
text section stores UTF-8 strings and not loaded at import time.
* Function annotation (only when PEP 563 is used) and docstring are
stored as integer, point to offset in the text section.
* When type.__doc__, PyFunction.__doc__, PyFunction.__annotation__ are
integer, text is loaded from the text section lazily.

PEP 563 will reduce some startup time, but __annotation__ is still
dict.  Memory overhead is negligible.

In [1]: def foo(a: int, b: int) -> int:
   ...: return a + b
   ...:
   ...:

In [2]: import sys
In [3]: sys.getsizeof(foo)
Out[3]: 136

In [4]: sys.getsizeof(foo.__annotations__)
Out[4]: 240

When PEP 563 is used, there are no side effect while building the annotation.
So the annotation can be serialized in text, like
{"a":"int","b":"int","return":"int"}.

This change will require new pyc format, and descriptor for
PyFunction.__doc__, PyFunction.__annotation__
and type.__doc__.

Regards,

-- 
INADA Naoki  
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Move optional data out of pyc files

2018-04-12 Thread Serhiy Storchaka

10.04.18 19:24, Antoine Pitrou пише:

2. Line numbers (lnotab). They are helpful for formatting tracebacks,
for tracing, and debugging with the debugger. Sources are helpful in
such cases too. If the program doesn't contain errors ;-) and is sipped
without sources, they could be removed.


What is the weight of lnotab arrays?  While docstrings can be large,
I'm somehow skeptical that removing lnotab arrays would bring a
significant improvement.  It would be nice to have more data about this.


Maybe it is low. I just mentioned three kinds of data in pyc files that 
can be optional. If move out docstrings and annotations, why not move 
lnotabs? It would be easy if we already implement the infrastructure for 
others two.



3. Annotations. They are used mainly by third party tools that
statically analyze sources. They are rarely used at runtime.


Even less used than docstrings probably.


And since there is a way of providing annotations in human-readable 
format separately from source codes, it looks naturally to provide a way 
for compiling them into separate files.


___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Move optional data out of pyc files

2018-04-12 Thread Serhiy Storchaka

10.04.18 20:38, Chris Angelico пише:

On Wed, Apr 11, 2018 at 2:14 AM, Serhiy Storchaka  wrote:
A deployed Python distribution generally has .pyc files for all of the
standard library. I don't think people want to lose the ability to
call help(), and unless I'm misunderstanding, that requires
docstrings. So this will mean twice as many files and twice as many
file-open calls to import from the standard library. What will be the
impact on startup time?


Yes, this will mean more syscalls when import with docstrings. But the 
startup time doesn't matter for interactive shell in which you call 
help(). It was expected that programs which need to gain the benefit 
from separating optional components will run without loading them (like 
with option -OO).


The overhead can be reduced by packing multiple files in a single archive.

Finally, loading docstrings and other optional components can be made 
lazy. This was not in my original idea, and this will significantly 
complicate the implementation, but in principle it is possible. This 
will require larger changes in the marshal format and bytecode. This can 
open a door for further enhancements: loading the code and building 
classes and other complex data (especially heavy namedtuples, enums and 
dataclasses) on demand. Often you need to use just a single attribute or 
function from a large module. But this is different change, out of scope 
of this topic.


___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Move optional data out of pyc files

2018-04-11 Thread Chris Angelico
On Thu, Apr 12, 2018 at 11:59 AM, Steven D'Aprano  wrote:
> On Thu, Apr 12, 2018 at 12:09:38AM +1000, Chris Angelico wrote:
>
> [...]
>> >> Consider a very common use-case: an OS-provided
>> >> Python interpreter whose files are all owned by 'root'. Those will be
>> >> distributed with .pyc files for performance, but you don't want to
>> >> deprive the users of help() and anything else that needs docstrings
>> >> etc. So... are the docstrings lazily loaded or eagerly loaded?
>> >
>> > What relevance is that they're owned by root?
>>
>> You have to predict in advance what you'll want to have in your pyc
>> files. Can't create them on the fly.
>
> How is that different from the situation right now?

If the files aren't owned by root (more specifically, if they're owned
by you, and you can write to the pycache directory), you can do
everything at runtime. Otherwise, you have to do everything at
installation time.

>> > What semantic change do you expect?
>> >
>> > There's an implementation change, of course, but that's Serhiy's problem
>> > to deal with and I'm sure that he has considered that. There should be
>> > no semantic change. When you access obj.__doc__, then and only then are
>> > the compiled docstrings for that module read from the disk.
>>
>> In other words, attempting to access obj.__doc__ can actually go and
>> open a file. Does it need to check if the file exists as part of the
>> import, or does it go back to sys.path?
>
> That's implementation, so I don't know, but I imagine that the module
> object will have a link pointing directly to the expected file on disk.
> No need to search the path, you just go directly to the expected file.
> Apart from handling the case when it doesn't exist, in which case the
> docstring or annotations get set to None, it should be relatively
> straight-forward.
>
> That link could be an explicit pathname:
>
> /path/to/__pycache__/foo.cpython-33-doc.pyc
>
> or it could be implicitly built when required from the "master" .pyc
> file's path, since the differences are likely to be deterministic.

Referencing a path name requires that each directory in it be opened.
Checking to see if the file exists requires, at absolute best, one
more stat call, and that's assuming you have an open handle to the
directory.

>> If the former, you're right
>> back with the eager loading problem of needing to do 2-4 times as many
>> stat calls;
>
> Except that's not eager loading. When you open the file on demand, it
> might never be opened at all. If it is opened, it is likely to be a long
> time after interpreter startup.

I have no idea what you mean here. Eager loading != opening the file
on demand. Eager statting != opening on demand. If you're not going to
hold open handles to heaps of directories, you have to reference
everything by path name.

>> > As for the in-memory data structures of objects themselves, I imagine
>> > something like the __doc__ and __annotation__ slots pointing to a table
>> > of strings, which is not initialised until you attempt to read from the
>> > table. Or something -- don't pay too much attention to my wild guesses.
>> >
>> > The bottom line is, is there some reason *aside from performance* to
>> > avoid this? Because if the performance is worse, I'm sure Serhiy will be
>> > the first to dump this idea.
>>
>> Obviously it could be turned into just a performance question, but in
>> that case everything has to be preloaded
>
> You don't need to preload things to get a performance benefit.
> Preloading things that you don't need immediately and may never need at
> all, like docstrings, annotations and line numbers, is inefficient.

Right, and if you DON'T preload everything, you have a potential
semantic difference. Which is exactly what you were asking me, and I
was answering.

> So let's look at a few common scenarios:
>
>
> 1. You run a script. Let's say that the script ends up loading, directly
> or indirectly, 200 modules, none of which need docstrings or annotations
> during runtime, and the script runs to completion without needing to
> display a traceback. You save loading 200 sets of docstrings,
> annotations and line numbers ("metadata" for brevity) so overall the
> interpreter starts up quicker and the script runs faster.
>
>
> 2. You run the same script, but this time it raises an exception and
> displays a traceback. So now you have to load, let's say, 20 sets of
> line numbers, which is a bit slower, but that doesn't happen until the
> exception is raised and the traceback printed, which is already a slow
> and exceptional case so who cares if it takes an extra few milliseconds?
> It is still an overall win because of the 180 sets of metadata you
> didn't need to load.

Does this loading happen when the exception is constructed or when
it's printed? How much can you do with an exception without triggering
the loading of metadata? Is it now possible for the mere formatting of
a traceback to fail because of 

Re: [Python-ideas] Move optional data out of pyc files

2018-04-11 Thread Steven D'Aprano
On Thu, Apr 12, 2018 at 12:09:38AM +1000, Chris Angelico wrote:

[...]
> >> Consider a very common use-case: an OS-provided
> >> Python interpreter whose files are all owned by 'root'. Those will be
> >> distributed with .pyc files for performance, but you don't want to
> >> deprive the users of help() and anything else that needs docstrings
> >> etc. So... are the docstrings lazily loaded or eagerly loaded?
> >
> > What relevance is that they're owned by root?
> 
> You have to predict in advance what you'll want to have in your pyc
> files. Can't create them on the fly.

How is that different from the situation right now?


> > What semantic change do you expect?
> >
> > There's an implementation change, of course, but that's Serhiy's problem
> > to deal with and I'm sure that he has considered that. There should be
> > no semantic change. When you access obj.__doc__, then and only then are
> > the compiled docstrings for that module read from the disk.
> 
> In other words, attempting to access obj.__doc__ can actually go and
> open a file. Does it need to check if the file exists as part of the
> import, or does it go back to sys.path? 

That's implementation, so I don't know, but I imagine that the module 
object will have a link pointing directly to the expected file on disk. 
No need to search the path, you just go directly to the expected file. 
Apart from handling the case when it doesn't exist, in which case the 
docstring or annotations get set to None, it should be relatively 
straight-forward.

That link could be an explicit pathname:

/path/to/__pycache__/foo.cpython-33-doc.pyc

or it could be implicitly built when required from the "master" .pyc 
file's path, since the differences are likely to be deterministic.


> If the former, you're right
> back with the eager loading problem of needing to do 2-4 times as many
> stat calls;

Except that's not eager loading. When you open the file on demand, it 
might never be opened at all. If it is opened, it is likely to be a long 
time after interpreter startup.


> > As for the in-memory data structures of objects themselves, I imagine
> > something like the __doc__ and __annotation__ slots pointing to a table
> > of strings, which is not initialised until you attempt to read from the
> > table. Or something -- don't pay too much attention to my wild guesses.
> >
> > The bottom line is, is there some reason *aside from performance* to
> > avoid this? Because if the performance is worse, I'm sure Serhiy will be
> > the first to dump this idea.
> 
> Obviously it could be turned into just a performance question, but in
> that case everything has to be preloaded

You don't need to preload things to get a performance benefit. 
Preloading things that you don't need immediately and may never need at 
all, like docstrings, annotations and line numbers, is inefficient.

I fear that you have completely failed to understand the (potential) 
performance benefit here.

The point, or at least *a* point, of the exercise is to speed up 
interpreter startup by deferring some of the work until it is needed. 
When you defer work, the pluses are that it reduces startup time, and 
sometimes you can avoid doing it at all; the minus is that if you do end 
up needing to do it, you have to do a little bit extra.

So let's look at a few common scenarios:


1. You run a script. Let's say that the script ends up loading, directly 
or indirectly, 200 modules, none of which need docstrings or annotations 
during runtime, and the script runs to completion without needing to 
display a traceback. You save loading 200 sets of docstrings, 
annotations and line numbers ("metadata" for brevity) so overall the 
interpreter starts up quicker and the script runs faster.


2. You run the same script, but this time it raises an exception and 
displays a traceback. So now you have to load, let's say, 20 sets of 
line numbers, which is a bit slower, but that doesn't happen until the 
exception is raised and the traceback printed, which is already a slow 
and exceptional case so who cares if it takes an extra few milliseconds? 
It is still an overall win because of the 180 sets of metadata you 
didn't need to load.


3. You have a long-running server application which runs for days or 
weeks between restarts. Let's say it loads 1000 modules, so you get 
significant savings during start up (let's say, hypothetically shaving 
off 2 seconds from a 30 second start up time), but over the course of 
the week it ends up eventually loading all 1000 sets of metadata. Since 
that is deferred until needed, it doesn't happen all at once, but spread 
out a little bit at a time.

Overall, you end up doing four times as many file system operations, but 
since they're amortized over the entire week, not startup, it is still a 
win.

(And remember that this extra cost only applies the first time a 
module's metadata is needed. It isn't a cost you keep paying over and 
over again.)

We're (hopefully!) not 

Re: [Python-ideas] Move optional data out of pyc files

2018-04-11 Thread Terry Reedy

On 4/11/2018 4:26 AM, Petr Viktorin wrote:

Currently in Fedora, we ship *both* optimized and non-optimized pycs to 
make sure both -O and non--O will work nicely without root privilieges. 
So splitting the docstrings into a separate file would be, for us, a 
benefit in terms of file size.


Currently, the Windows installer has an option to pre-compile stdlib 
modules.  (At least it does if one does an all-users installation.)  If 
one selects this, it creates normal, -O, and -OO versions of each. 
Since, like most people, I never run with -O or  -OO, replacing this 
redundancy with 1 segmented file or 2 non-redundant files might be a win 
for most people.


--
Terry Jan Reedy

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Move optional data out of pyc files

2018-04-11 Thread Nick Coghlan
On 11 April 2018 at 02:14, Serhiy Storchaka  wrote:
> Currently pyc files contain data that is useful mostly for developing and is
> not needed in most normal cases in stable program. There is even an option
> that allows to exclude a part of this information from pyc files. It is
> expected that this saves memory, startup time, and disk space (or the time
> of loading from network). I propose to move this data from pyc files into
> separate file or files. pyc files should contain only external references to
> external files. If the corresponding external file is absent or specific
> option suppresses them, references are replaced with None or NULL at import
> time, otherwise they are loaded from external files.
>
> 1. Docstrings. They are needed mainly for developing.
>
> 2. Line numbers (lnotab). They are helpful for formatting tracebacks, for
> tracing, and debugging with the debugger. Sources are helpful in such cases
> too. If the program doesn't contain errors ;-) and is sipped without
> sources, they could be removed.
>
> 3. Annotations. They are used mainly by third party tools that statically
> analyze sources. They are rarely used at runtime.

While I don't think the default inline pyc format should change, in my
ideal world I'd like to see the optimized format change to a
side-loading model where these things are still emitted, but they're
placed in a separate metadata file that isn't loaded by default.

The metadata file would then be lazily loaded at runtime, such that
`-O` gave you the memory benefits of `-OO`, but
docstrings/annotations/source line references/etc could still be
loaded on demand if something actually needed them. This approach
would also mitigate the valid points Chris Angelico raises around hot
reloading support - we could just declare that it requires even more
care than usual to use hot reloading in combination with `-O`.

Bonus points if the sideloaded metadata file could be designed in such
a way that an extension module compiler like Cython or an alternate
pyc compiler frontend like Hylang could use it to provide relevant
references back to the original source code (JavaScript's source maps
may provide inspiration on that front).

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Move optional data out of pyc files

2018-04-11 Thread Chris Angelico
On Wed, Apr 11, 2018 at 4:06 PM, Steven D'Aprano  wrote:
> On Wed, Apr 11, 2018 at 02:21:17PM +1000, Chris Angelico wrote:
>
> [...]
>> > Yes, it will double the number of files. Actually quadruple it, if the
>> > annotations and line numbers are in separate files too. But if most of
>> > those extra files never need to be opened, then there's no cost to them.
>> > And whatever extra cost there is, is amortized over the lifetime of the
>> > interpreter.
>>
>> Yes, if they are actually not needed. My question was about whether
>> that is truly valid.
>
> We're never really going to know the affect on performance without
> implementing and benchmarking the code. It might turn out that, to our
> surprise, three quarters of the std lib relies on loading docstrings
> during startup. But I doubt it.
>
>
>> Consider a very common use-case: an OS-provided
>> Python interpreter whose files are all owned by 'root'. Those will be
>> distributed with .pyc files for performance, but you don't want to
>> deprive the users of help() and anything else that needs docstrings
>> etc. So... are the docstrings lazily loaded or eagerly loaded?
>
> What relevance is that they're owned by root?

You have to predict in advance what you'll want to have in your pyc
files. Can't create them on the fly.

>> If eagerly, you've doubled the number of file-open calls to initialize
>> the interpreter.
>
> I do not understand why you think this is even an option. Has Serhiy
> said something that I missed that makes this seem to be on the table?
> That's not a rhetorical question -- I may have missed something. But I'm
> sure he understands that doubling or quadrupling the number of file
> operations during startup is not an optimization.
>
>
>> (Or quadrupled, if you need annotations and line
>> numbers and they're all separate.) If lazily, things are a lot more
>> complicated than the original description suggested, and there'd need
>> to be some semantic changes here.
>
> What semantic change do you expect?
>
> There's an implementation change, of course, but that's Serhiy's problem
> to deal with and I'm sure that he has considered that. There should be
> no semantic change. When you access obj.__doc__, then and only then are
> the compiled docstrings for that module read from the disk.

In other words, attempting to access obj.__doc__ can actually go and
open a file. Does it need to check if the file exists as part of the
import, or does it go back to sys.path? If the former, you're right
back with the eager loading problem of needing to do 2-4 times as many
stat calls; if the latter, it's semantically different in that a
change to sys.path can influence something that normally is preloaded.

> As for the in-memory data structures of objects themselves, I imagine
> something like the __doc__ and __annotation__ slots pointing to a table
> of strings, which is not initialised until you attempt to read from the
> table. Or something -- don't pay too much attention to my wild guesses.
>
> The bottom line is, is there some reason *aside from performance* to
> avoid this? Because if the performance is worse, I'm sure Serhiy will be
> the first to dump this idea.

Obviously it could be turned into just a performance question, but in
that case everything has to be preloaded, and I doubt there's going to
be any advantage. To be absolutely certain of retaining the existing
semantics, there'd need to be some sort of anchoring to ensure that
*this* .pyc file goes with *that* .pyc_docstrings file. Looking them
up anew will mean that there's every possibility that you get the
wrong file back.

As a simple example, upgrading your Python installation while you have
a Python script running can give you this effect already. Just import
a few modules, then change everything on disk. If you now import a
module that was already imported, you get it from cache (and the
unmodified version); import something that wasn't imported already,
and it goes to the disk. At the granularity of modules, this is seldom
a problem (I can imagine some package modules getting confused by
this, but otherwise not usually), but if docstrings are looked up
separately - and especially if lnotab is too - you could happily
import and use something (say, in a web server), then run updates, and
then an exception requires you to look up a line number. Oops, a few
lines got inserted into that file, and now all the line numbers are
straight-up wrong. That's a definite behavioural change. Maybe it's
one that's considered acceptable, but it definitely is a change. And
if mutations to sys.path can do this, it's definitely a semantic
change in Python.

ChrisA
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Move optional data out of pyc files

2018-04-11 Thread Erik Bray
On Tue, Apr 10, 2018 at 9:50 PM, Eric V. Smith  wrote:
>
>>> 3. Annotations. They are used mainly by third party tools that
>>> statically analyze sources. They are rarely used at runtime.
>>
>> Even less used than docstrings probably.
>
> typing.NamedTuple and dataclasses use annotations at runtime.

Astropy uses annotations at runtime for optional unit checking on
arguments that take dimensionful quantities:
http://docs.astropy.org/en/stable/api/astropy.units.quantity_input.html#astropy.units.quantity_input
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Move optional data out of pyc files

2018-04-11 Thread Petr Viktorin



On 04/11/18 08:06, Steven D'Aprano wrote:

On Wed, Apr 11, 2018 at 02:21:17PM +1000, Chris Angelico wrote:

[...]

Yes, it will double the number of files. Actually quadruple it, if the
annotations and line numbers are in separate files too. But if most of
those extra files never need to be opened, then there's no cost to them.
And whatever extra cost there is, is amortized over the lifetime of the
interpreter.


Yes, if they are actually not needed. My question was about whether
that is truly valid.


We're never really going to know the affect on performance without
implementing and benchmarking the code. It might turn out that, to our
surprise, three quarters of the std lib relies on loading docstrings
during startup. But I doubt it.



Consider a very common use-case: an OS-provided
Python interpreter whose files are all owned by 'root'. Those will be
distributed with .pyc files for performance, but you don't want to
deprive the users of help() and anything else that needs docstrings
etc. So... are the docstrings lazily loaded or eagerly loaded?


What relevance is that they're owned by root?



If eagerly, you've doubled the number of file-open calls to initialize
the interpreter.


I do not understand why you think this is even an option. Has Serhiy
said something that I missed that makes this seem to be on the table?
That's not a rhetorical question -- I may have missed something. But I'm
sure he understands that doubling or quadrupling the number of file
operations during startup is not an optimization.



(Or quadrupled, if you need annotations and line
numbers and they're all separate.) If lazily, things are a lot more
complicated than the original description suggested, and there'd need
to be some semantic changes here.


What semantic change do you expect?

There's an implementation change, of course, but that's Serhiy's problem
to deal with and I'm sure that he has considered that. There should be
no semantic change. When you access obj.__doc__, then and only then are
the compiled docstrings for that module read from the disk.

I don't know the current implementation of .pyc files, but I like
Antoine's suggestion of laying it out in four separate areas (plus
header), each one marshalled:

 code
 docstrings
 annotations
 line numbers

Aside from code, which is mandatory, the three other sections could be
None to represent "not available", as is the case when you pass -00 to
the interpreter, or they could be some other sentinel that means "load
lazily from the appropriate file", or they could be the marshalled data
directly in place to support byte-code only libraries.

As for the in-memory data structures of objects themselves, I imagine
something like the __doc__ and __annotation__ slots pointing to a table
of strings, which is not initialised until you attempt to read from the
table. Or something -- don't pay too much attention to my wild guesses.


A __doc__ sentinel could even say something like "bytes 350--420 in the 
original .py file, as UTF-8".




The bottom line is, is there some reason *aside from performance* to
avoid this? Because if the performance is worse, I'm sure Serhiy will be
the first to dump this idea.



___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Move optional data out of pyc files

2018-04-11 Thread Petr Viktorin

On 04/11/18 06:21, Chris Angelico wrote:

On Wed, Apr 11, 2018 at 1:02 PM, Steven D'Aprano  wrote:

On Wed, Apr 11, 2018 at 10:08:58AM +1000, Chris Angelico wrote:


File system limits aren't usually an issue; as you say, even FAT32 can
store a metric ton of files in a single directory. I'm more interested
in how long it takes to open a file, and whether doubling that time
will have a measurable impact on Python startup time. Part of that
cost can be reduced by using openat(), on platforms that support it,
but even with a directory handle, there's still a definite non-zero
cost to opening and reading an additional file.


Yes, it will double the number of files. Actually quadruple it, if the
annotations and line numbers are in separate files too. But if most of
those extra files never need to be opened, then there's no cost to them.
And whatever extra cost there is, is amortized over the lifetime of the
interpreter.


Yes, if they are actually not needed. My question was about whether
that is truly valid. Consider a very common use-case: an OS-provided
Python interpreter whose files are all owned by 'root'. Those will be
distributed with .pyc files for performance, but you don't want to
deprive the users of help() and anything else that needs docstrings
etc.


Currently in Fedora, we ship *both* optimized and non-optimized pycs to 
make sure both -O and non--O will work nicely without root privilieges. 
So splitting the docstrings into a separate file would be, for us, a 
benefit in terms of file size.




So... are the docstrings lazily loaded or eagerly loaded? If
eagerly, you've doubled the number of file-open calls to initialize
the interpreter. (Or quadrupled, if you need annotations and line
numbers and they're all separate.) If lazily, things are a lot more
complicated than the original description suggested, and there'd need
to be some semantic changes here.


Serhiy is experienced enough that I think we should assume he's not
going to push this optimization into production unless it actually does
reduce startup time. He has proven himself enough that we should assume
competence rather than incompetence :-)


Oh, I'm definitely assuming that he knows what he's doing :-) Doesn't
mean I can't ask the question though.

ChrisA
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Move optional data out of pyc files

2018-04-11 Thread Steve Barnes


On 10/04/2018 18:54, Zachary Ware wrote:
> On Tue, Apr 10, 2018 at 12:38 PM, Chris Angelico  wrote:
>> A deployed Python distribution generally has .pyc files for all of the
>> standard library. I don't think people want to lose the ability to
>> call help(), and unless I'm misunderstanding, that requires
>> docstrings. So this will mean twice as many files and twice as many
>> file-open calls to import from the standard library. What will be the
>> impact on startup time?
> 
> What about instead of separate files turning the single file into a
> pseudo-zip file containing all of the proposed files, and provide a
> simple tool for removing whatever parts you don't want?
> 

Personally I quite like the idea of having the doc strings, and possibly 
other optional components, in a zipped section after a marker for the 
end of the operational code. Possibly the loader could stop reading at 
that point, (reducing load time and memory impact), and only load and 
unzip on demand.

Zipping the doc strings should have a significant reduction in file 
sizes but it is worth remembering a couple of things:

  - Python is already one of the most compact languages for what it can 
do - I have had experts demanding to know where the rest of the program 
is hidden and how it is being downloaded when they noticed the size of 
the installed code verses the functionality provided.
  - File size <> disk space consumed - on most file systems each file 
typically occupies 1 + (file_size // allocation_size) clusters of the 
drive and with increasing disk sizes generally the allocation_size is 
increasing both of my NTFS drives currently have 4096 byte allocation 
sizes but I am offered up to 2 MB allocation sizes - splitting a .pyc 
10,052 byte .pyc file, (picking a random example from my drive) into a 
5,052 and 5,000 byte files will change the disk space occupied  from 
3*4,096 to 4*4,096 plus the extra directory entry.
  - Where absolute file size is critical you, (such as embedded 
systems), can always use the -O & -OO flags.
-- 
Steve (Gadget) Barnes
Any opinions in this message are my personal opinions and do not reflect 
those of my employer.

---
This email has been checked for viruses by AVG.
http://www.avg.com

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Move optional data out of pyc files

2018-04-11 Thread Steven D'Aprano
On Wed, Apr 11, 2018 at 02:21:17PM +1000, Chris Angelico wrote:

[...]
> > Yes, it will double the number of files. Actually quadruple it, if the
> > annotations and line numbers are in separate files too. But if most of
> > those extra files never need to be opened, then there's no cost to them.
> > And whatever extra cost there is, is amortized over the lifetime of the
> > interpreter.
> 
> Yes, if they are actually not needed. My question was about whether
> that is truly valid.

We're never really going to know the affect on performance without 
implementing and benchmarking the code. It might turn out that, to our 
surprise, three quarters of the std lib relies on loading docstrings 
during startup. But I doubt it.


> Consider a very common use-case: an OS-provided
> Python interpreter whose files are all owned by 'root'. Those will be
> distributed with .pyc files for performance, but you don't want to
> deprive the users of help() and anything else that needs docstrings
> etc. So... are the docstrings lazily loaded or eagerly loaded?

What relevance is that they're owned by root?


> If eagerly, you've doubled the number of file-open calls to initialize
> the interpreter.

I do not understand why you think this is even an option. Has Serhiy 
said something that I missed that makes this seem to be on the table? 
That's not a rhetorical question -- I may have missed something. But I'm 
sure he understands that doubling or quadrupling the number of file 
operations during startup is not an optimization.


> (Or quadrupled, if you need annotations and line
> numbers and they're all separate.) If lazily, things are a lot more
> complicated than the original description suggested, and there'd need
> to be some semantic changes here.

What semantic change do you expect?

There's an implementation change, of course, but that's Serhiy's problem 
to deal with and I'm sure that he has considered that. There should be 
no semantic change. When you access obj.__doc__, then and only then are 
the compiled docstrings for that module read from the disk.

I don't know the current implementation of .pyc files, but I like 
Antoine's suggestion of laying it out in four separate areas (plus 
header), each one marshalled:

code
docstrings
annotations
line numbers

Aside from code, which is mandatory, the three other sections could be 
None to represent "not available", as is the case when you pass -00 to 
the interpreter, or they could be some other sentinel that means "load 
lazily from the appropriate file", or they could be the marshalled data 
directly in place to support byte-code only libraries.

As for the in-memory data structures of objects themselves, I imagine 
something like the __doc__ and __annotation__ slots pointing to a table 
of strings, which is not initialised until you attempt to read from the 
table. Or something -- don't pay too much attention to my wild guesses.

The bottom line is, is there some reason *aside from performance* to 
avoid this? Because if the performance is worse, I'm sure Serhiy will be 
the first to dump this idea.


-- 
Steve
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Move optional data out of pyc files

2018-04-10 Thread Chris Angelico
On Wed, Apr 11, 2018 at 1:02 PM, Steven D'Aprano  wrote:
> On Wed, Apr 11, 2018 at 10:08:58AM +1000, Chris Angelico wrote:
>
>> File system limits aren't usually an issue; as you say, even FAT32 can
>> store a metric ton of files in a single directory. I'm more interested
>> in how long it takes to open a file, and whether doubling that time
>> will have a measurable impact on Python startup time. Part of that
>> cost can be reduced by using openat(), on platforms that support it,
>> but even with a directory handle, there's still a definite non-zero
>> cost to opening and reading an additional file.
>
> Yes, it will double the number of files. Actually quadruple it, if the
> annotations and line numbers are in separate files too. But if most of
> those extra files never need to be opened, then there's no cost to them.
> And whatever extra cost there is, is amortized over the lifetime of the
> interpreter.

Yes, if they are actually not needed. My question was about whether
that is truly valid. Consider a very common use-case: an OS-provided
Python interpreter whose files are all owned by 'root'. Those will be
distributed with .pyc files for performance, but you don't want to
deprive the users of help() and anything else that needs docstrings
etc. So... are the docstrings lazily loaded or eagerly loaded? If
eagerly, you've doubled the number of file-open calls to initialize
the interpreter. (Or quadrupled, if you need annotations and line
numbers and they're all separate.) If lazily, things are a lot more
complicated than the original description suggested, and there'd need
to be some semantic changes here.

> Serhiy is experienced enough that I think we should assume he's not
> going to push this optimization into production unless it actually does
> reduce startup time. He has proven himself enough that we should assume
> competence rather than incompetence :-)

Oh, I'm definitely assuming that he knows what he's doing :-) Doesn't
mean I can't ask the question though.

ChrisA
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Move optional data out of pyc files

2018-04-10 Thread Steven D'Aprano
On Wed, Apr 11, 2018 at 10:08:58AM +1000, Chris Angelico wrote:

> File system limits aren't usually an issue; as you say, even FAT32 can
> store a metric ton of files in a single directory. I'm more interested
> in how long it takes to open a file, and whether doubling that time
> will have a measurable impact on Python startup time. Part of that
> cost can be reduced by using openat(), on platforms that support it,
> but even with a directory handle, there's still a definite non-zero
> cost to opening and reading an additional file.

Yes, it will double the number of files. Actually quadruple it, if the 
annotations and line numbers are in separate files too. But if most of 
those extra files never need to be opened, then there's no cost to them. 
And whatever extra cost there is, is amortized over the lifetime of the 
interpreter.

The expectation here is that this could lead to reducing startup time, 
since the files which are read are smaller and less data needs to be 
read and traverse the network up front, but can be defered until they're 
actually needed.

Serhiy is experienced enough that I think we should assume he's not 
going to push this optimization into production unless it actually does 
reduce startup time. He has proven himself enough that we should assume 
competence rather than incompetence :-)

Here is the proposal as I understand it:

- by default, change .pyc files to store annotations, docstrings
  and line numbers as references to external files which will be
  lazily loaded on-need;

- single-file .pyc files must still be supported, but this won't
  be the default and could rely on an external "merge" tool;

- objects that rely on docstrings or annotations, such as dataclass,
  may experience a (hopefully very small) increase of import time,
  since they may not be able to defer loading the extra files;

- but in general, most modules should (we expect) see an decrease
  in the load time;

- which will (we hope) reduce startup time;

- libraries which make eager use of docstrings and annotations might
  even ship with the single-file .pyc instead (the library installer
  can look after that aspect), and so avoid any extra cost.

Naturally pushing this into production will require benchmarks that 
prove this actually does improve startup time. I believe that Serhiy's 
reason for asking is to determine whether it is worth his while to 
experiment on this. There's no point in implementing these changes and 
benchmarking them, if there's no chance of it being accepted.

So on the assumptions that:

- benchmarking does demonstrate a non-trivial speedup of
  interpreter startup;

- single-file .pyc files are still supported, for the use
  of byte-code only libraries;

- and modules which are particularly badly impacted by this
  change are able to opt-out and use a single .pyc file;

I see no reason not to support this idea if Serhiy (or someone else) is 
willing to put in the work.


-- 
Steve
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Move optional data out of pyc files

2018-04-10 Thread Eric Fahlgren
On Tue, Apr 10, 2018 at 5:03 PM, Steven D'Aprano 
wrote:

>
> __pycache__/spam.cpython-38.pyc
> __pycache__/spam.cpython-38-doc.pyc
> __pycache__/spam.cpython-38-lno.pyc
> __pycache__/spam.cpython-38-ann.pyc
>

​Our product uses the doc strings for auto-generated help, so we need to
keep those.  We also allow users to write plugins and scripts, so getting
valid feedback in tracebacks is essential for our support people, so we'll
keep the lno files, too.  Annotations can probably go.

Looking at one of our little pyc files, I see:

-rwx--+ 1 efahlgren admins  9252 Apr 10 17:25 ./lm/lib/config.pyc*​

Since disk blocks are typically 4096 bytes, that's really a 12k file.
Let's say it's 8k of byte code, 1k of doc, a bit of lno.  So the proposed
layout would give:

config.pyc -> 8k
config-doc.pyc -> 4k
config-lno.pyc -> 4k

So now I've increased disk usage by 25% (yeah yeah, I know, I picked that
small file on purpose to illustrate the point, but it's not unusual).

These files are often opened over a network, at least for user plugins.
This can take a really, really long time on some of our poorly connected
machines, like 1-2 seconds per file (no kidding, it's horrible).  Now
instead of opening just one file in 1-2 seconds, we have increased the time
by 300%, just to do the stat+open, probably another stat to make sure
there's no "ann" file laying about.  Ouch.

-1 from me.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Move optional data out of pyc files

2018-04-10 Thread Gregory P. Smith
On Tue, Apr 10, 2018 at 12:51 PM Eric V. Smith  wrote:

>
> >> 3. Annotations. They are used mainly by third party tools that
> >> statically analyze sources. They are rarely used at runtime.
> >
> > Even less used than docstrings probably.
>
> typing.NamedTuple and dataclasses use annotations at runtime.
>
> Eric
>

Yep. Everything accessible in any way at runtime is used by something at
runtime. It's a public API, we can't just get rid of it.

Several libraries rely on docstrings being available (additional case in
point beyond the already linked to cli tool: ply
)

Most of the world never appears to use -O and -OO.  If they do, they don't
use these libraries or jump through special hoops to prevent pyo
compliation of any sources that need them.  (unlikely)

-gps
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Move optional data out of pyc files

2018-04-10 Thread Chris Angelico
On Wed, Apr 11, 2018 at 10:03 AM, Steven D'Aprano  wrote:
> On Wed, Apr 11, 2018 at 03:38:08AM +1000, Chris Angelico wrote:
>> A deployed Python distribution generally has .pyc files for all of the
>> standard library. I don't think people want to lose the ability to
>> call help(), and unless I'm misunderstanding, that requires
>> docstrings. So this will mean twice as many files and twice as many
>> file-open calls to import from the standard library. What will be the
>> impact on startup time?
>
> I shouldn't think that the number of files on disk is very important,
> now that they're hidden away in the __pycache__ directory where they can
> be ignored by humans. Even venerable old FAT32 has a limit of 65,534
> files in a single folder, and 268,435,437 on the entire volume. So
> unless the std lib expands to 16000+ modules, the number of files in the
> __pycache__ directory ought to be well below that limit.
>
> I think even MicroPython ought to be okay with that. (But it would be
> nice to find out for sure: does it support file systems with *really*
> tiny limits?)

File system limits aren't usually an issue; as you say, even FAT32 can
store a metric ton of files in a single directory. I'm more interested
in how long it takes to open a file, and whether doubling that time
will have a measurable impact on Python startup time. Part of that
cost can be reduced by using openat(), on platforms that support it,
but even with a directory handle, there's still a definite non-zero
cost to opening and reading an additional file.

ChrisA
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Move optional data out of pyc files

2018-04-10 Thread Steven D'Aprano
On Wed, Apr 11, 2018 at 03:38:08AM +1000, Chris Angelico wrote:
> On Wed, Apr 11, 2018 at 2:14 AM, Serhiy Storchaka  wrote:
> > Currently pyc files contain data that is useful mostly for developing and is
> > not needed in most normal cases in stable program. There is even an option
> > that allows to exclude a part of this information from pyc files. It is
> > expected that this saves memory, startup time, and disk space (or the time
> > of loading from network). I propose to move this data from pyc files into
> > separate file or files. pyc files should contain only external references to
> > external files. If the corresponding external file is absent or specific
> > option suppresses them, references are replaced with None or NULL at import
> > time, otherwise they are loaded from external files.
> >
> > 1. Docstrings. They are needed mainly for developing.
> >
> > 2. Line numbers (lnotab). They are helpful for formatting tracebacks, for
> > tracing, and debugging with the debugger. Sources are helpful in such cases
> > too. If the program doesn't contain errors ;-) and is sipped without
> > sources, they could be removed.
> >
> > 3. Annotations. They are used mainly by third party tools that statically
> > analyze sources. They are rarely used at runtime.
> >
> > Docstrings will be read from the corresponding docstring file unless -OO is
> > supplied. This will allow also to localize docstrings. Depending on locale
> > or other settings different docstring file can be used.
> >
> > For suppressing line numbers and annotations new options can be added.
> 
> A deployed Python distribution generally has .pyc files for all of the
> standard library. I don't think people want to lose the ability to
> call help(), and unless I'm misunderstanding, that requires
> docstrings. So this will mean twice as many files and twice as many
> file-open calls to import from the standard library. What will be the
> impact on startup time?

I shouldn't think that the number of files on disk is very important, 
now that they're hidden away in the __pycache__ directory where they can 
be ignored by humans. Even venerable old FAT32 has a limit of 65,534 
files in a single folder, and 268,435,437 on the entire volume. So 
unless the std lib expands to 16000+ modules, the number of files in the 
__pycache__ directory ought to be well below that limit.

I think even MicroPython ought to be okay with that. (But it would be 
nice to find out for sure: does it support file systems with *really* 
tiny limits?)

The entire __pycache__ directory is supposed to be a black box except 
under unusual circumstances, so it doesn't matter (at least not to me)
if we have:

__pycache__/spam.cpython-38.pyc

alone or:

__pycache__/spam.cpython-38.pyc
__pycache__/spam.cpython-38-doc.pyc
__pycache__/spam.cpython-38-lno.pyc
__pycache__/spam.cpython-38-ann.pyc

(say). And if the external references are loaded lazily, on need, rather 
than eagerly, this could save startup time, which I think is the 
intention. The doc strings would be still available, just not loaded 
until the first time you try to use them.

However, Python supports byte-code only distribution, using .pyc files 
external to the __pycache__. In that case, it would be annoying and 
inconvenient to distribute four top-level files, so I think that the use 
of external references has to be optional, and there has to be a way to 
either compile to a single .pyc file containing all four parts, or an 
external tool that can take the existing four files and merge them.


-- 
Steve
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Move optional data out of pyc files

2018-04-10 Thread Eric V. Smith

>> 3. Annotations. They are used mainly by third party tools that 
>> statically analyze sources. They are rarely used at runtime.
> 
> Even less used than docstrings probably.

typing.NamedTuple and dataclasses use annotations at runtime. 

Eric
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Move optional data out of pyc files

2018-04-10 Thread Antoine Pitrou
On Tue, 10 Apr 2018 19:14:58 +0300
Serhiy Storchaka 
wrote:
> Currently pyc files contain data that is useful mostly for developing 
> and is not needed in most normal cases in stable program. There is even 
> an option that allows to exclude a part of this information from pyc 
> files. It is expected that this saves memory, startup time, and disk 
> space (or the time of loading from network). I propose to move this data 
> from pyc files into separate file or files. pyc files should contain 
> only external references to external files. If the corresponding 
> external file is absent or specific option suppresses them, references 
> are replaced with None or NULL at import time, otherwise they are loaded 
> from external files.
> 
> 1. Docstrings. They are needed mainly for developing.
> 
> 2. Line numbers (lnotab). They are helpful for formatting tracebacks, 
> for tracing, and debugging with the debugger. Sources are helpful in 
> such cases too. If the program doesn't contain errors ;-) and is sipped 
> without sources, they could be removed.
> 
> 3. Annotations. They are used mainly by third party tools that 
> statically analyze sources. They are rarely used at runtime.
> 
> Docstrings will be read from the corresponding docstring file unless -OO 
> is supplied. This will allow also to localize docstrings. Depending on 
> locale or other settings different docstring file can be used.

An alternate proposal would be to have separate sections in a
single marshal file.  The main section (containing the loadable
module) would have references to the other sections. This way it's easy
for the loader to say "all references to the docstring section and/or
to the annotation section are replaced with None", depending on how
Python is started.  It would also be possible to do it on disk with a
strip-like utility.

I'm not volunteering to do all this, so just my 2 cents ;-)

Regards

Antoine.


___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Move optional data out of pyc files

2018-04-10 Thread Antoine Pitrou
On Tue, 10 Apr 2018 11:13:01 -0700
Ethan Furman  wrote:
> On 04/10/2018 10:54 AM, Zachary Ware wrote:
> > On Tue, Apr 10, 2018 at 12:38 PM, Chris Angelico  wrote:  
> >> A deployed Python distribution generally has .pyc files for all of the
> >> standard library. I don't think people want to lose the ability to
> >> call help(), and unless I'm misunderstanding, that requires
> >> docstrings. So this will mean twice as many files and twice as many
> >> file-open calls to import from the standard library. What will be the
> >> impact on startup time?  
> >
> > What about instead of separate files turning the single file into a
> > pseudo-zip file containing all of the proposed files, and provide a
> > simple tool for removing whatever parts you don't want?  
> 
> -O and -OO already do some trimming; perhaps going that route instead of 
> having multiple files would be better.

"python -O" and "python -OO" *do* generate different pyc files.
If you want to trim docstrings with those options, you need to
regenerate pyc files for all your dependencies (including third-party
libraries and standard library modules).

Serhiy's proposal allows "-O" and "-OO" to work without needing a
custom bytecode generation step.

Regard

Antoine.


___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Move optional data out of pyc files

2018-04-10 Thread Daniel Moisset
I'm not sure I understand the benefit of this, perhaps you can clarify.
What I see is two scenarios

Scenario A) External files are present

In this case, the data is loaded from the pyc and then from external file,
so there are no savings in memory, startup time, disk space, or network
load time, it's just the same disk information and runtime structure with a
different layout

Scenario B) External files are not present

In this case, you get runtime improvements exactly identical to not having
the data in the pyc which is roughly what you get with -OO.

The only new capability I see this adds is the localization benefit, is
that what this proposal is about?



On 10 April 2018 at 17:14, Serhiy Storchaka  wrote:

> Currently pyc files contain data that is useful mostly for developing and
> is not needed in most normal cases in stable program. There is even an
> option that allows to exclude a part of this information from pyc files. It
> is expected that this saves memory, startup time, and disk space (or the
> time of loading from network). I propose to move this data from pyc files
> into separate file or files. pyc files should contain only external
> references to external files. If the corresponding external file is absent
> or specific option suppresses them, references are replaced with None or
> NULL at import time, otherwise they are loaded from external files.
>
> 1. Docstrings. They are needed mainly for developing.
>
> 2. Line numbers (lnotab). They are helpful for formatting tracebacks, for
> tracing, and debugging with the debugger. Sources are helpful in such cases
> too. If the program doesn't contain errors ;-) and is sipped without
> sources, they could be removed.
>
> 3. Annotations. They are used mainly by third party tools that statically
> analyze sources. They are rarely used at runtime.
>
> Docstrings will be read from the corresponding docstring file unless -OO
> is supplied. This will allow also to localize docstrings. Depending on
> locale or other settings different docstring file can be used.
>
> For suppressing line numbers and annotations new options can be added.
>
> ___
> Python-ideas mailing list
> Python-ideas@python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
>



-- 
Daniel F. Moisset - UK Country Manager - Machinalis Limited
www.machinalis.co.uk 
Skype: @dmoisset T: + 44 7398 827139

1 Fore St, London, EC2Y 9DT

Machinalis Limited is a company registered in England and Wales. Registered
number: 10574987.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Move optional data out of pyc files

2018-04-10 Thread Stephan Houben
There are libraries out there like this:

https://docopt.readthedocs.io/en/0.2.0/

which use docstrings for runtime info.

Today we already have -OO which allows us to create docstring-less bytecode
files in case we have, after careful consideration, established that it is
safe to do so.

I think the current way (-OO) to avoid docstring loading is the correct one.
It pushes the responsibility on whoever did the packaging to decide if -OO
is appropriate.

The ability to remove the docstrings after bytecode generation would be
kinda nice
(similar to Unix "strip" command)
but given how fast bytecode compilation is, frankly I don't think it is
very important.

Stephan

2018-04-10 19:54 GMT+02:00 Zachary Ware :

> On Tue, Apr 10, 2018 at 12:38 PM, Chris Angelico  wrote:
> > A deployed Python distribution generally has .pyc files for all of the
> > standard library. I don't think people want to lose the ability to
> > call help(), and unless I'm misunderstanding, that requires
> > docstrings. So this will mean twice as many files and twice as many
> > file-open calls to import from the standard library. What will be the
> > impact on startup time?
>
> What about instead of separate files turning the single file into a
> pseudo-zip file containing all of the proposed files, and provide a
> simple tool for removing whatever parts you don't want?
>
> --
> Zach
> ___
> Python-ideas mailing list
> Python-ideas@python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
>
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Move optional data out of pyc files

2018-04-10 Thread Ethan Furman

On 04/10/2018 10:54 AM, Zachary Ware wrote:

On Tue, Apr 10, 2018 at 12:38 PM, Chris Angelico  wrote:

A deployed Python distribution generally has .pyc files for all of the
standard library. I don't think people want to lose the ability to
call help(), and unless I'm misunderstanding, that requires
docstrings. So this will mean twice as many files and twice as many
file-open calls to import from the standard library. What will be the
impact on startup time?


What about instead of separate files turning the single file into a
pseudo-zip file containing all of the proposed files, and provide a
simple tool for removing whatever parts you don't want?


-O and -OO already do some trimming; perhaps going that route instead of having 
multiple files would be better.

--
~Ethan~

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Move optional data out of pyc files

2018-04-10 Thread Zachary Ware
On Tue, Apr 10, 2018 at 12:38 PM, Chris Angelico  wrote:
> A deployed Python distribution generally has .pyc files for all of the
> standard library. I don't think people want to lose the ability to
> call help(), and unless I'm misunderstanding, that requires
> docstrings. So this will mean twice as many files and twice as many
> file-open calls to import from the standard library. What will be the
> impact on startup time?

What about instead of separate files turning the single file into a
pseudo-zip file containing all of the proposed files, and provide a
simple tool for removing whatever parts you don't want?

-- 
Zach
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Move optional data out of pyc files

2018-04-10 Thread Chris Angelico
On Wed, Apr 11, 2018 at 2:14 AM, Serhiy Storchaka  wrote:
> Currently pyc files contain data that is useful mostly for developing and is
> not needed in most normal cases in stable program. There is even an option
> that allows to exclude a part of this information from pyc files. It is
> expected that this saves memory, startup time, and disk space (or the time
> of loading from network). I propose to move this data from pyc files into
> separate file or files. pyc files should contain only external references to
> external files. If the corresponding external file is absent or specific
> option suppresses them, references are replaced with None or NULL at import
> time, otherwise they are loaded from external files.
>
> 1. Docstrings. They are needed mainly for developing.
>
> 2. Line numbers (lnotab). They are helpful for formatting tracebacks, for
> tracing, and debugging with the debugger. Sources are helpful in such cases
> too. If the program doesn't contain errors ;-) and is sipped without
> sources, they could be removed.
>
> 3. Annotations. They are used mainly by third party tools that statically
> analyze sources. They are rarely used at runtime.
>
> Docstrings will be read from the corresponding docstring file unless -OO is
> supplied. This will allow also to localize docstrings. Depending on locale
> or other settings different docstring file can be used.
>
> For suppressing line numbers and annotations new options can be added.

A deployed Python distribution generally has .pyc files for all of the
standard library. I don't think people want to lose the ability to
call help(), and unless I'm misunderstanding, that requires
docstrings. So this will mean twice as many files and twice as many
file-open calls to import from the standard library. What will be the
impact on startup time?

ChrisA
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Move optional data out of pyc files

2018-04-10 Thread Antoine Pitrou
On Tue, 10 Apr 2018 19:14:58 +0300
Serhiy Storchaka 
wrote:
> Currently pyc files contain data that is useful mostly for developing 
> and is not needed in most normal cases in stable program. There is even 
> an option that allows to exclude a part of this information from pyc 
> files. It is expected that this saves memory, startup time, and disk 
> space (or the time of loading from network). I propose to move this data 
> from pyc files into separate file or files. pyc files should contain 
> only external references to external files. If the corresponding 
> external file is absent or specific option suppresses them, references 
> are replaced with None or NULL at import time, otherwise they are loaded 
> from external files.
> 
> 1. Docstrings. They are needed mainly for developing.

Indeed, it may be nice to find a solution to ship them separately.

> 2. Line numbers (lnotab). They are helpful for formatting tracebacks, 
> for tracing, and debugging with the debugger. Sources are helpful in 
> such cases too. If the program doesn't contain errors ;-) and is sipped 
> without sources, they could be removed.

What is the weight of lnotab arrays?  While docstrings can be large,
I'm somehow skeptical that removing lnotab arrays would bring a
significant improvement.  It would be nice to have more data about this.

> 3. Annotations. They are used mainly by third party tools that 
> statically analyze sources. They are rarely used at runtime.

Even less used than docstrings probably.

Regards

Antoine.


___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/