[issue45001] Date parsing helpers in email module incorrectly raise IndexError for some malformed inputs

2021-08-25 Thread wouter bolsterlee


wouter bolsterlee  added the comment:

pull request with fix at https://github.com/python/cpython/pull/27946

--
keywords: +patch
message_count: 1.0 -> 2.0
pull_requests: +26392
stage:  -> patch review
pull_request: https://github.com/python/cpython/pull/27946

___
Python tracker 
<https://bugs.python.org/issue45001>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45001] Date parsing helpers in email module incorrectly raise IndexError for some malformed inputs

2021-08-25 Thread wouter bolsterlee

New submission from wouter bolsterlee :

Various date parsing utilities in the email module, such as 
email.utils.parsedate(), are supposed to gracefully handle invalid input, 
typically by raising an appropriate exception or by returning None.

The internal email._parseaddr._parsedate_tz() helper used by some of these date 
parsing routines tries to be robust against malformed input, but unfortunately 
it can still crash ungracefully when a non-empty but whitespace-only input is 
passed. This manifests as an unexpected IndexError.

In practice, this can happen when parsing an email with only a newline inside a 
‘Date:’ header, which unfortunately happens occasionally in the real world.

Here's a minimal example:

$ python
Python 3.9.6 (default, Jun 30 2021, 10:22:16) 
[GCC 11.1.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import email.utils
>>> email.utils.parsedate('foo')
>>> email.utils.parsedate(' ')
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/lib/python3.9/email/_parseaddr.py", line 176, in parsedate
t = parsedate_tz(data)
  File "/usr/lib/python3.9/email/_parseaddr.py", line 50, in parsedate_tz
res = _parsedate_tz(data)
  File "/usr/lib/python3.9/email/_parseaddr.py", line 72, in _parsedate_tz
if data[0].endswith(',') or data[0].lower() in _daynames:
IndexError: list index out of range


The fix is rather straight-forward; will open a pull request shortly.

--
components: email
messages: 400261
nosy: barry, r.david.murray, wbolster
priority: normal
severity: normal
status: open
title: Date parsing helpers in email module incorrectly raise IndexError for 
some malformed inputs
type: crash
versions: Python 3.10, Python 3.11, Python 3.9

___
Python tracker 
<https://bugs.python.org/issue45001>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35701] [uuid] 3.8 breaks weak references for UUIDs

2019-01-14 Thread wouter bolsterlee


wouter bolsterlee  added the comment:

the test could be sth like

x = uuid.uuid4()
y = weakref.ref(x)
assert x is y()

--

___
Python tracker 
<https://bugs.python.org/issue35701>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30907] speed up comparisons to self for built-in containers

2017-07-21 Thread wouter bolsterlee

Changes by wouter bolsterlee <u...@xs4all.nl>:


--
resolution:  -> rejected
stage:  -> resolved
status: open -> closed

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue30907>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30977] reduce uuid.UUID() memory footprint

2017-07-21 Thread wouter bolsterlee

wouter bolsterlee added the comment:

i consider uuids as low level data types, not as fancy containers, similar to 
how i view datetime objects. given the native support in e.g. postgresql and 
many other systems, it's common to deal with uuids.

of course you can convert to/from strings or numbers, but that is cumbersome in 
many cases. for comparison, one would typically not convert unicode text 
from/into utf-8 encoded byte strings either, even though the latter will save 
memory in many cases.

from experience: converting can lead to nasty bugs, e.g. because you forgot 
about a conversion, and then a uuid string does not compare equal to a 
uuid.UUID instance, leaving you puzzled.

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue30977>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30977] reduce uuid.UUID() memory footprint

2017-07-20 Thread Wouter Bolsterlee

Wouter Bolsterlee added the comment:

as a follow-up note, i also experimented with keeping the actual value as a 
bytes object instead of an integer, but that does not lead to less memory being 
used: a 128-bit integer uses less memory than a 16 byte bytes object 
(presumably because PyBytesObject has a cached hash() field and a trailing null 
byte).

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue30977>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30977] reduce uuid.UUID() memory footprint

2017-07-20 Thread Wouter Bolsterlee

New submission from Wouter Bolsterlee:

memory usage for uuid.UUID seems larger than it has to be. it seems that using 
__slots__ will save around ~100 bytes per instance, which is very significant, 
e.g. when dealing with large sets of uuids (which are often used as "primary 
keys" into external data stores).

uuid.UUID has a __setattr__ that prevents any extra attributes to be
set:

def __setattr__(self, name, value):
raise TypeError('UUID objects are immutable')

...so it seems to me not having __dict__ should not cause any problems?


before (RES column):

>>> import uuid
>>> y = {uuid.uuid4() for _ in range(100)}

  PID USER  PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
23020 wbolster   20   0  315M  253M  7256 S  0.0  1.6  0:04.98 python

with slots:

>>> import uuid
>>> y = {uuid.uuid4() for _ in range(100)}

  PID USER  PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
21722 wbolster   20   0  206M  145M  7348 S  0.0  0.9  0:05.03 python

i will open a pr on github shortly.

--
messages: 298738
nosy: wbolster
priority: normal
severity: normal
status: open
title: reduce uuid.UUID() memory footprint

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue30977>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30977] reduce uuid.UUID() memory footprint

2017-07-20 Thread Wouter Bolsterlee

Changes by Wouter Bolsterlee <u...@xs4all.nl>:


--
pull_requests: +2835

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue30977>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30907] speed up comparisons to self for built-in containers

2017-07-13 Thread Wouter Bolsterlee

Wouter Bolsterlee added the comment:

fwiw, "if (PyDict_CheckExact(a) && a == b)" amounts to two pointer comparisons, 
the overhead of which is too small to measure. (i tried.)

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue30907>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30907] speed up comparisons to self for built-in containers

2017-07-12 Thread Wouter Bolsterlee

Changes by Wouter Bolsterlee <u...@xs4all.nl>:


--
pull_requests: +2745

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue30907>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30907] speed up comparisons to self for built-in containers

2017-07-12 Thread Wouter Bolsterlee

New submission from Wouter Bolsterlee:

when comparing instances of the built-in container types (dict, list, and 
others) python assumes that "identity implies equality". this is documented 
(and assumed) behaviour:


"In enforcing reflexivity of elements, the comparison of collections assumes 
that for a collection element x, x == x is always true. Based on that 
assumption, element identity is compared first, and element comparison is 
performed only for distinct elements."

https://docs.python.org/3/reference/expressions.html#value-comparisons

because of this, various operations such as rich comparison of lists can be 
sped up by checking for identity first, and only falling back to (possibly) 
much slower rich comparisons on container elements when two elements are not 
identical (i.e. id(a) == id(b)).

because identity implies equality, comparing a container instance to itself is 
guaranteed to be true.

currently, comparing a list to itself takes O(n) time, which can be measured 
easily:

>>> x = list(range(1000))
>>> %timeit x == x
2.95 µs ± 19.2 ns per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> y = list(range(10))
>>> %timeit y == y
293 µs ± 3.35 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

the second example is 100 times as slow.

i will shortly submit a few pull request to make "compare to self" run in O(1) 
time instead for a few of the built-in containers.

--
components: Interpreter Core
messages: 298213
nosy: wbolster
priority: normal
severity: normal
status: open
title: speed up comparisons to self for built-in containers
versions: Python 3.7

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue30907>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue23493] optimize sort_keys in json module by using operator.itemgetter()

2015-02-23 Thread Wouter Bolsterlee

Wouter Bolsterlee added the comment:

Using IPython and CPython 3.4:

 d = dict.fromkeys(map(str, range(1000)))


Current implementation:

 %timeit sorted(d.items(), key=lambda kv: kv[0])
1000 loops, best of 3: 605 µs per loop
 %timeit sorted(d.items(), key=lambda kv: kv[0])
1000 loops, best of 3: 614 µs per loop

Proposed change:

 %timeit sorted(d.items(), key=operator.itemgetter(0))
1000 loops, best of 3: 527 µs per loop
 %timeit sorted(d.items(), key=operator.itemgetter(0))
1000 loops, best of 3: 523 µs per loop

Alternative without 'key' arg, since all keys in a JSON object must be strings, 
so the tuples from .items() can be compared directly.:

 %timeit sorted(d.items())
1000 loops, best of 3: 755 µs per loop
 %timeit sorted(d.items())
1000 loops, best of 3: 756 µs per loop

As you can see, the operator approach seems the fastest on CPython 3.4, even 
faster than not having a 'key' arg at all (possibly because it avoids doing a 
tuple compare, which descends into a string compare for the first item).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue23493
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue23493] optimize sort_keys in json module by using operator.itemgetter()

2015-02-20 Thread Wouter Bolsterlee

New submission from Wouter Bolsterlee:

The JSON encoder uses a lambda function for the sort(key=...) invocation used 
to sort the keys in a JSON object in case sort_keys=True is passed:

https://hg.python.org/cpython/file/46bfddb14cbe/Lib/json/encoder.py#l352

Instead of having a lambda, operator.itemgetter(0) can be used here, which is a 
minor performance gain. It should be noted that many other internals of the 
JSON encoder have also been fine-tuned (manually) for performance reasons.

--
components: Library (Lib)
messages: 236302
nosy: wbolster
priority: normal
severity: normal
status: open
title: optimize sort_keys in json module by using operator.itemgetter()
versions: Python 2.7, Python 3.2, Python 3.3, Python 3.4, Python 3.5, Python 3.6

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue23493
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18962] Add special case for single iterator in heapq.merge function

2013-09-11 Thread Wouter Bolsterlee

Wouter Bolsterlee added the comment:

Thanks for the quick response. 

Btw, do I understand correctly code cleanups are not welcome, even when 
touching the code anyway?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18962
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18962] Add special case for single iterator in heapq.merge function

2013-09-11 Thread Wouter Bolsterlee

Wouter Bolsterlee added the comment:

(In case you missed it: my latest comment included a cleaned up version of an 
earlier patch.)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18962
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18962] Add special case for single iterator in heapq.merge function

2013-09-10 Thread Wouter Bolsterlee

Wouter Bolsterlee added the comment:

Thanks Raymond, that is exactly what I had in mind (see my previous comment).
Here's a slightly cleaned up version of the patch (stylistic/PEP8 cleanups),
with some benchmarks included below.

In case the two longest iterators have about the same size, no performance
difference can be measured:

$ ./python -m timeit -s 'from heapq import merge' 'for x in merge([], [], [1], 
range(100), range(100)): pass'

Without patch:

1 loops, best of 3: 71.2 usec per loop
1 loops, best of 3: 71.9 usec per loop
1 loops, best of 3: 71.7 usec per loop

With patch:

1 loops, best of 3: 71.4 usec per loop
1 loops, best of 3: 76.7 usec per loop
1 loops, best of 3: 72.1 usec per loop

As expected, the performance gain is very significant in case one of the
iterators is much longer than the others:

$ python -m timeit -n 100 -s 'from heapq import merge' 'for x in merge([], [], 
[1], range(100), range(10)): pass'

Without path:

100 loops, best of 3: 27.8 msec per loop
100 loops, best of 3: 26.9 msec per loop
100 loops, best of 3: 27.7 msec per loop

With patch:

100 loops, best of 3: 6.26 msec per loop
100 loops, best of 3: 6.28 msec per loop
100 loops, best of 3: 6.03 msec per loop

--
Added file: http://bugs.python.org/file31726/merge3.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18962
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18962] Add special case for single iterator in heapq.merge function

2013-09-08 Thread Wouter Bolsterlee

Wouter Bolsterlee added the comment:

An additional speedup would be to add a if len(h) == 1 check inside the while 
loop, and just yield from the remaining iterator if a single iterable remains. 
This would also speed up merges with multiple inputs, as it doesn't do the 
whole heapreplace() loop for the last remaining iterable. Example: merge([], 
[], [], range(10). This would involve some more refactoring inside the 
function though, since the current implementation only stores the .next() 
function for each iterable.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18962
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18962] Add special case for single iterator in heapq.merge function

2013-09-07 Thread Wouter Bolsterlee

New submission from Wouter Bolsterlee:

The heapq.merge() function merges multiple sorted iterables into a single 
sorted output.  The function uses a heap queue that is repeatedly looped over 
until it has generated all output.

If only a single iterable is passed to heapq.merge(), the heap will have len(h) 
== 1, which means the looping and heap manipulation just degrades to a 
convoluted yield from iterables[0]. This negatively impacts performance. The 
attached patch adds a short-circuit for this single input case.

The behaviour of the function remains unchanged:

 list(merge_before([1,3,5,7], [0,2,4,8], [5,10,15,20], [], [25]))
[0, 1, 2, 3, 4, 5, 5, 7, 8, 10, 15, 20, 25]
 list(merge_after([1,3,5,7], [0,2,4,8], [5,10,15,20], [], [25]))
[0, 1, 2, 3, 4, 5, 5, 7, 8, 10, 15, 20, 25]

For multiple inputs, there is no measurable performance difference (measured
using IPython's %timeit). Without patch:

 %timeit list(merge_before([1,3,5,7], [0,2,4,8], [5,10,15,20], [], [25]))
10 loops, best of 3: 13.7 µs per loop
 %timeit list(merge_before([1,3,5,7], [0,2,4,8], [5,10,15,20], [], [25]))
10 loops, best of 3: 13.4 µs per loop
 %timeit list(merge_before([1,3,5,7], [0,2,4,8], [5,10,15,20], [], [25]))
10 loops, best of 3: 13.5 µs per loop

With patch:

 %timeit list(merge_after([1,3,5,7], [0,2,4,8], [5,10,15,20], [], [25]))
10 loops, best of 3: 13.7 µs per loop
 %timeit list(merge_after([1,3,5,7], [0,2,4,8], [5,10,15,20], [], [25]))
10 loops, best of 3: 13.6 µs per loop
 %timeit list(merge_after([1,3,5,7], [0,2,4,8], [5,10,15,20], [], [25]))
10 loops, best of 3: 13.6 µs per loop
 %timeit list(merge_after([1,3,5,7], [0,2,4,8], [5,10,15,20], [], [25]))
10 loops, best of 3: 13.8 µs per loop

The real performance gain is of course when only a single iterable is passed. 
Without patch:

 %timeit for x in merge_before(range(1)): pass
100 loops, best of 3: 2.65 ms per loop
 %timeit for x in merge_before(range(1)): pass
100 loops, best of 3: 2.52 ms per loop
 %timeit for x in merge_before(range(1)): pass
100 loops, best of 3: 2.51 ms per loop

With patch:

 %timeit for x in merge_after(range(1)): pass
1000 loops, best of 3: 604 µs per loop
 %timeit for x in merge_after(range(1)): pass
1000 loops, best of 3: 609 µs per loop
 %timeit for x in merge_after(range(1)): pass
1000 loops, best of 3: 603 µs per loop

This is a ~4x speedup for an input consisting of 1 items. Compared to plain 
iteration:

 %timeit for x in range(1): pass
1000 loops, best of 3: 263 µs per loop
 %timeit for x in range(1): pass
1000 loops, best of 3: 268 µs per loop
 %timeit for x in range(1): pass
1000 loops, best of 3: 273 µs per loop

Without the patch, heapq.merge() is ~10x as slow as plain iteration. With the 
patch, this is reduced to ~2.5x.

(Note: all testing done using Python 3.3.2 on a 64bit Linux machine.)

--
components: Library (Lib)
files: heapq-merge-single-input-optimization.patch
keywords: patch
messages: 197174
nosy: wbolster
priority: normal
severity: normal
status: open
title: Add special case for single iterator in heapq.merge function
type: performance
versions: Python 3.3
Added file: 
http://bugs.python.org/file31649/heapq-merge-single-input-optimization.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18962
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com