[issue46990] Surprising list overallocation from .split()

2022-03-11 Thread Tim Peters


Tim Peters  added the comment:

Well, that's annoying ;-) In context, the OP was saving a list of 10 million 
splits. So each overallocation by a single element burned 80 million bytes of 
RAM. Overallocating by 7 burned 560 million bytes.

Which is unusual. Usually a split result is short-lived, consumed once then 
thrown away.

OTOH, the overwhelming motivation for overallocating at all is to acheive O(1) 
amortized time after a long _sequence_ of appends, and split results typically 
aren't appended to at all. split() appears to be using it as a timing 
micro-optimization for tiny lists instead.

So, like I said, it's annoying ;-) For "small" lists, split() really shouldn't 
overallocate at all (because, as before, split results are rarely appended to). 
A compromise could be to save pointers to the first N (12, whatever) instances 
of the splitting string in a stack ("auto") vector, before any list object (or 
result string object) is created. If it's out of stuff to do before reaching N, 
fine, build a result out of exactly what was found. If there's more to do, 
build a result from the first N, and go on as currently (letting PyList_Append 
deal with it - overallocation is huge in percentage terms when the list is 
short, but not so much as the list gets longer).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue46990] Surprising list overallocation from .split()

2022-03-11 Thread Jelle Zijlstra


Jelle Zijlstra  added the comment:

The value 12 is hardcoded here: 
https://github.com/python/cpython/blob/a89c29fbcc7e7e85848499443d819c3fab68c78a/Objects/stringlib/split.h#L14

The comment there says that this is because most .split() calls are on lines of 
human-readable text, which has about 11 words per line. I don't know if I 
believe that.

--
nosy: +JelleZijlstra

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue46990] Surprising list overallocation from .split()

2022-03-11 Thread Tim Peters


Change by Tim Peters :


--
type: behavior -> resource usage

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue46990] Surprising list overallocation from .split()

2022-03-11 Thread Tim Peters


New submission from Tim Peters :

When looking into a StackOverflow question about surprisingly high memory use, 
I stumbled into this (under 3.10.1, Win64):

>>> import sys
>>> s = "1 2 3 4 5".split()
>>> s
['1', '2', '3', '4', '5']
>>> sys.getsizeof(s)
152
>>> _ - sys.getsizeof([])
96
>>> 96 / 8
12.0

That is, we allocated enough space in the list to store 12(!) elements, despite 
that only 5 are used. Other ways of building a 5-element list I've tried 
overallocate by at most 3 slots:

>>> sys.getsizeof([ch for ch in "12345"])
120
>>> sys.getsizeof([1, 2, 3, 4, 5])
120

(and 120 - 56 = 64, room for 8 pointers)

Then there's this curiosity, which allocates space for exactly the 5 needed:

>>> sys.getsizeof(list(tuple("1 2 3 4 5".split(
96

(and 96 - 56 = 40, room for the 5 pointers needed)

I don't expect this to be consistent, but allocating space for 12 when only 5 
are needed is unreasonable. Even allocating space for 8 is pushing it ;-)

--
components: Interpreter Core
messages: 414942
nosy: tim.peters
priority: normal
severity: normal
status: open
title: Surprising list overallocation from .split()
type: behavior
versions: Python 3.11

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com