[issue40495] compileall: option to hardlink duplicate optimization levels bytecode cache files

2020-05-15 Thread Shantanu


Change by Shantanu :


--
nosy: +hauntsaninja
nosy_count: 6.0 -> 7.0
pull_requests: +19428
pull_request: https://github.com/python/cpython/pull/19806

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40495] compileall: option to hardlink duplicate optimization levels bytecode cache files

2020-05-14 Thread STINNER Victor

STINNER Victor  added the comment:

Thanks Lumír and Miro! I close the issue.

--
resolution:  -> fixed
stage: patch review -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40495] compileall: option to hardlink duplicate optimization levels bytecode cache files

2020-05-14 Thread STINNER Victor

STINNER Victor  added the comment:


New changeset e77d428856fbd339faee44ff47214eda5fb51d57 by Lumír 'Frenzy' Balhar 
in branch 'master':
bpo-40495: compileall option to hardlink duplicate pyc files (GH-19901)
https://github.com/python/cpython/commit/e77d428856fbd339faee44ff47214eda5fb51d57


--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40495] compileall: option to hardlink duplicate optimization levels bytecode cache files

2020-05-11 Thread STINNER Victor


STINNER Victor  added the comment:

> Currently, it's possible to implement this optimization using the Unix 
> command "hardlink".

PR 19901 avoids the dependency on external "hardlink" command.

In practice, PR 19901 only impacts newly written PYC files, whereas using 
manually the "hardlink" command cannot track which files are not or not. 
"hardlink" command is less practice, PR 19901 avoids modifying PYC files that 
we don't "own".

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40495] compileall: option to hardlink duplicate optimization levels bytecode cache files

2020-05-11 Thread STINNER Victor


STINNER Victor  added the comment:

Currently, it's possible to implement this optimization using the Unix command 
"hardlink". Example:

hardlink -c -v /usr/lib64/python3.8/__pycache__/*.pyc

On my Fedora 32, this command says:

Directories:   1
Objects: 520
Regular files:   519
Comparisons: 133
Linked:  133
Saved:   2220032

For example, string.cpython-38.pyc and string.cpython-38.opt-1.pyc become hard 
links.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40495] compileall: option to hardlink duplicate optimization levels bytecode cache files

2020-05-11 Thread Miro Hrončok

Miro Hrončok  added the comment:

> Is it possible that the PYC file of optimization level 0 content is modified 
> if the PY file content changed, with would make PYC files or optimization 
> level 1 and 2 inconsistent? ...

Note that there is a test exactly for this, in case the implementation is 
changed in the future.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40495] compileall: option to hardlink duplicate optimization levels bytecode cache files

2020-05-11 Thread STINNER Victor


STINNER Victor  added the comment:

While reviewing PR 19901, I was confused by py_compile and compileall 
documentation which is outdated: it doesn't mention that optimize argument can 
be a list of integers.

https://docs.python.org/dev/library/py_compile.html#py_compile.compile
"optimize controls the optimization level and is passed to the built-in 
compile() function. The default of -1 selects the optimization level of the 
current interpreter."

https://docs.python.org/dev/library/compileall.html#compileall.compile_dir
"optimize specifies the optimization level for the compiler. It is passed to 
the built-in compile() function."

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40495] compileall: option to hardlink duplicate optimization levels bytecode cache files

2020-05-11 Thread STINNER Victor

STINNER Victor  added the comment:

Is it possible that the PYC file of optimization level 0 content is modified if 
the PY file content changed, with would make PYC files or optimization level 1 
and 2 inconsistent?

Christian Heimes:
> Python's import system is fully compatible with this approach. importlib 
> never directly writes to a .pyc file. Instead it always creates a new 
> temporary file next to the .pyc file and then overrides the .pyc file with an 
> atomic file system operation. See _write_atomic() in 
> Lib/importlib/_bootstrap_external.py.

It seems like importlib doesn't have the issue because it doesn't open PYC file 
to write its content, but _write_atomic() creates a *new* file and then call 
os.replace() to rename the temporary file to the PYC final name.

Alright, I think that I understood :-)

--

PYC file became more complicated with PEP 552. Here are my own notes to try to 
understand how it's supposed to be used.


Python 3.9 now has _imp.check_hash_based_pycs string which can be overriden by 
--check-hash-based-pycs command line option. It can have 3 values:
* "always"
* "never"
* "default"

These values are defined by the PEP 552:

* "never" causes the interpreter to always assume hash-based pycs are valid
* "default" means the check_source flag in hash-based pycs determines 
invalidation
* "always" causes the interpreter to hash the source file for invalidation 
regardless of value of check_source bit

When a PYC file is created, it has a "check_source" bit:

* Bit set: If the check_source flag is set, Python will determine the validity 
of the pyc by hashing the source file and comparing the hash with the expected 
hash in the pyc. If the pyc needs to be regenerated, it will be regenerated as 
a hash-based pyc again with the check_source flag set.
* Bit unset, Python will simply load the pyc without checking the hash of the 
source file. The expectation in this case is that some external system (e.g., 
the local Linux distribution’s package manager) is responsible for keeping pycs 
up to date, so Python itself doesn’t have to check.

I mostly copied/pasted the PEP 552 :-)

py_compile and compileall have a new invalidation_mode which can have 3 values:

class PycInvalidationMode(Enum):
TIMESTAMP
CHECKED_HASH
UNCHECKED_HASH

The default is compiled in py_compile by:

def _get_default_invalidation_mode():
if os.environ.get('SOURCE_DATE_EPOCH'):
return PycInvalidationMode.CHECKED_HASH
else:
return PycInvalidationMode.TIMESTAMP

importlib: SourceLoader.get_code(filename) uses:

flags = _classify_pyc(data, fullname, exc_details)
bytes_data = memoryview(data)[16:]
hash_based = flags & 0b1 != 0
if hash_based:
check_source = flags & 0b10 != 0
if (_imp.check_hash_based_pycs != 'never' and
(check_source or
 _imp.check_hash_based_pycs == 'always')):
source_bytes = self.get_data(source_path)
source_hash = _imp.source_hash(
_RAW_MAGIC_NUMBER,
source_bytes,
)
_validate_hash_pyc(data, source_hash, fullname,
   exc_details)
else:
_validate_timestamp_pyc(
data,
source_mtime,
st['size'],
fullname,
exc_details,
)

--
nosy: +vstinner

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40495] compileall: option to hardlink duplicate optimization levels bytecode cache files

2020-05-04 Thread Lumír Balhar

Change by Lumír Balhar :


--
keywords: +patch
pull_requests: +19214
stage:  -> patch review
pull_request: https://github.com/python/cpython/pull/19901

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40495] compileall: option to hardlink duplicate optimization levels bytecode cache files

2020-05-04 Thread Lumír Balhar

Lumír Balhar  added the comment:

I forgot to mention that I am working on PR which should be ready soon because 
the implementation is already done and tested in compileall2.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40495] compileall: option to hardlink duplicate optimization levels bytecode cache files

2020-05-04 Thread Christian Heimes


Christian Heimes  added the comment:

Brett, FYI

--
nosy: +brett.cannon

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40495] compileall: option to hardlink duplicate optimization levels bytecode cache files

2020-05-04 Thread Christian Heimes


Christian Heimes  added the comment:

Python's import system is fully compatible with this approach.

importlib never directly writes to a .pyc file. Instead it always creates a new 
temporary file next to the .pyc file and then overrides the .pyc file with an 
atomic file system operation. See _write_atomic() in 
Lib/importlib/_bootstrap_external.py.

compileall and py_compile also use _write_atomic().

--
nosy: +christian.heimes

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40495] compileall: option to hardlink duplicate optimization levels bytecode cache files

2020-05-04 Thread Filipe Laíns

Change by Filipe Laíns :


--
nosy: +FFY00

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40495] compileall: option to hardlink duplicate optimization levels bytecode cache files

2020-05-04 Thread Miro Hrončok

Change by Miro Hrončok :


--
nosy: +hroncok

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40495] compileall: option to hardlink duplicate optimization levels bytecode cache files

2020-05-04 Thread Lumír Balhar

New submission from Lumír Balhar :

We would like to include a possibility of hardlink deduplication of identical 
pyc files to compileall module in Python 3.9. We've discussed the change [0] 
and tested it in Fedora RPM build system via implementation in the compileall2 
module [1].

The discussion [0] contains a lot of details so I mention here only the key 
features:
* the deduplication can be enabled only if multiple optimization levels are 
processed at once
* it generates a pyc file (optimization level 0) as usual but if it finds that 
optimized files (optimization levels 1 and 2) have the same content, it uses 
hardlinks (os.link) to prevents duplicates
* the deduplication is disabled by default

We believe that this might be handy for more Pythonistas. In our case, this 
functionality lowers the installation size of Python 3.9 from 125 MiB to 103 
MiB.

[0] 
https://discuss.python.org/t/compileall-option-to-hardlink-duplicate-optimization-levels-bytecode-cache-files/3014
[1] https://github.com/fedora-python/compileall2

--
components: Library (Lib)
messages: 368022
nosy: frenzy
priority: normal
severity: normal
status: open
title: compileall: option to hardlink duplicate optimization levels bytecode 
cache files
type: enhancement
versions: Python 3.9

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com