Re: [Python-Dev] Usage of += on strings in loops in stdlib
I added a _PyUnicodeWriter internal API to optimize str%args and str.format(args). It uses a buffer which is overallocated, so it's basically like CPython str += str optimization. I still don't know how efficient it is on Windows, since realloc() is slow on Windows (at least on old Windows versions). We should add an official and public API to concatenate strings. I know that PyPy has already its own API. Example: writer = UnicodeWriter() for item in data: writer += item # i guess that it's faster than writer.append(item) return str(writer) # or writer.getvalue() ? I don't care of the exact implementation of UnicodeWriter, it just have to be as fast or faster than ''.join(data). I don't remember if _PyUnicodeWriter is faster than StringIO or slower. I created an issue for that: http://bugs.python.org/issue15612 Victor 2013/2/12 Maciej Fijalkowski fij...@gmail.com: Hi We recently encountered a performance issue in stdlib for pypy. It turned out that someone commited a performance fix that uses += for strings instead of .join() that was there before. Now this hurts pypy (we can mitigate it to some degree though) and possible Jython and IronPython too. How people feel about generally not having += on long strings in stdlib (since the refcount = 1 thing is a hack)? What about other performance improvements in stdlib that are problematic for pypy or others? Personally I would like cleaner code in stdlib vs speeding up CPython. Typically that also helps pypy so I'm not unbiased. Cheers, fijal ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/victor.stinner%40gmail.com ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
On Wed, Feb 13, 2013 at 10:02 AM, Victor Stinner victor.stin...@gmail.com wrote: I added a _PyUnicodeWriter internal API to optimize str%args and str.format(args). It uses a buffer which is overallocated, so it's basically like CPython str += str optimization. I still don't know how efficient it is on Windows, since realloc() is slow on Windows (at least on old Windows versions). We should add an official and public API to concatenate strings. I know that PyPy has already its own API. Example: writer = UnicodeWriter() for item in data: writer += item # i guess that it's faster than writer.append(item) return str(writer) # or writer.getvalue() ? I don't care of the exact implementation of UnicodeWriter, it just have to be as fast or faster than ''.join(data). I don't remember if _PyUnicodeWriter is faster than StringIO or slower. I created an issue for that: http://bugs.python.org/issue15612 Victor it's in __pypy__.builders (StringBuilder and UnicodeBuilder). The API does not really matter, as long as there is a way to preallocate certain size (which I don't think there is in StringIO for example). bytearray comes close but has a relatively inconvinient API and any pure-python bytearray wrapper will not be fast on CPython. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
On Tue, Feb 12, 2013 at 10:03 PM, Maciej Fijalkowski fij...@gmail.com wrote: Hi We recently encountered a performance issue in stdlib for pypy. It turned out that someone commited a performance fix that uses += for strings instead of .join() that was there before. Can someone show the actual diff? Of this? I'm making a talk about outdated patterns in Python at DjangoCon EU, prompted by this question, and obsessive avoidance of string concatenation. But all the tests I've done show that ''.join() still is faster or as fast, except when you are joining very few strings, like for example two strings, in which case concatenation is faster or as fast. Both under PyPy and CPython. So I'd like to know in which case ''.hoin() is faster on PyPy and += faster on CPython. Code with times x = 10 s1 = 'X'* x s2 = 'X'* x for i in xrange(500): s1 += s2 Python 3.3: 0.049 seconds PyPy 1.9: 24.217 seconds PyPy indeed is much much slower than CPython here. But let's look at the join case: x = 10 s1 = 'X'* x s2 = 'X'* x for i in xrange(500): s1 = ''.join((s1, s2)) Python 3.3: 18.969 seconds PyPy 1.9: 62.539 seconds Here PyPy needs twice the time, and CPython needs 387 times as long time. Both are slower. The best case is of course to make a long list of strings and join them: x = 10 s1 = 'X'* x s2 = 'X'* x l = [s1] for i in xrange(500): l.append(s2) s1 = ''.join(l) Python 3.3: 0.052 seconds PyPy 1.9: 0.117 seconds That's not always feasible though. //Lennart ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
On 02/12/2013 05:25 PM, Christian Tismer wrote: Ropes have been implemented by Carl-Friedrich Bolz in 2007 as I remember. No idea what the impact was, if any at all. Would ropes be an answer (and a simple way to cope with string mutation patterns) as an alternative implementation, and therefore still justify the usage of that pattern? I've always hated the .join(array) idiom for fast string concatenation--it's ugly and it flies in the face of TOOWTDI. I think everyone should use x = a + b + c + d for string concatenation, and we should just make that fast. In 2006 I proposed lazy string concatenation, a sort of rope that hid the details inside the string object. If a and b are strings, a+b returned a string object that internally lazily contained references to a and b, and only computed its value if you asked for it. Here's the Unicode version: http://bugs.python.org/issue1629305 Why didn't it get accepted? I lumped in lazy slicing, a bad move as it was more controversial. That and the possibility that macros like PyUnicode_AS_UNICODE could now possibly fail, which would have meant checking 400+ call sites to ensure they handle the possibility of failure. This latter work has already happened with the new efficient Unicode representation patch. I keep thinking it's time to revive the lazy string concatenation patch. //arry/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
On 12/02/2013 21:03, Maciej Fijalkowski wrote: We recently encountered a performance issue in stdlib for pypy. It turned out that someone commited a performance fix that uses += for strings instead of .join() that was there before. That's... interesting. I fixed a performance bug in httplib some years ago by doing the exact opposite; += - ''.join(). In that case, it changed downloading a file from 20 minutes to 3 seconds. That was likely on Python 2.5. How people feel about generally not having += on long strings in stdlib (since the refcount = 1 thing is a hack)? +1 from me. Chris -- Simplistix - Content Management, Batch Processing Python Consulting - http://www.simplistix.co.uk ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
Le Wed, 13 Feb 2013 09:02:07 +0100, Victor Stinner victor.stin...@gmail.com a écrit : I added a _PyUnicodeWriter internal API to optimize str%args and str.format(args). It uses a buffer which is overallocated, so it's basically like CPython str += str optimization. I still don't know how efficient it is on Windows, since realloc() is slow on Windows (at least on old Windows versions). We should add an official and public API to concatenate strings. There's io.StringIO already. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
On 12.02.13 23:03, Maciej Fijalkowski wrote: How people feel about generally not having += on long strings in stdlib (since the refcount = 1 thing is a hack)? Sometimes the use of += for strings or bytes is appropriate. For example, I deliberately used += for bytes instead b''.join() (note that there is even no such hack for bytes) in zipfile module where in most cases one of component is empty, and the concatenation of nonempty components only happens once. b''.join() was noticeably slower here. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
On 13/02/13 19:52, Larry Hastings wrote: I've always hated the .join(array) idiom for fast string concatenation --it's ugly and it flies in the face of TOOWTDI. I think everyone should use x = a + b + c + d for string concatenation, and we should just make that fast. .join(array) is much nicer looking than: # ridiculous and impractical for more than a few items array[0] + array[1] + array[2] + ... + array[N] or: # not an expression result = for s in array: result += s or even: # currently prohibited, and not obvious sum(array, ) although I will admit to a certain fondness towards # even less obvious than sum map(operator.add, array) and join has been the obvious way to do repeated concatenation of many substrings since at least Python 1.5 when it was spelled string.join(array [, sep= ]). -- Steven ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
On 13.02.13 09:52, Nick Coghlan wrote: On Wed, Feb 13, 2013 at 5:42 PM, Alexandre Vassalotti alexan...@peadrop.com wrote: I don't think so. Ropes are really useful when you work with gigabytes of data, but unfortunately they don't make good general-purpose strings. Monolithic arrays are much more efficient and simple for the typical use-cases we have in Python. If I recall correctly, io.StringIO and io.BytesIO have been updated to use ropes internally in 3.3. io.BytesIO has not yet. But it will be in 3.4 (issue #15381). On the other hand, there is a plan for rewriting StringIO to more effective continuous buffer implementation (issue #15612). ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
On 2013-02-13, at 12:37 , Steven D'Aprano wrote: # even less obvious than sum map(operator.add, array) That one does not work, it'll try to call the binary `add` with each item of the array when the map iterator is reified, erroring out. functools.reduce(operator.add, array, '') would work though, it's an other way to spell `sum` without the string prohibition. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
On 13/02/13 20:09, Chris Withers wrote: On 12/02/2013 21:03, Maciej Fijalkowski wrote: We recently encountered a performance issue in stdlib for pypy. It turned out that someone commited a performance fix that uses += for strings instead of .join() that was there before. That's... interesting. I fixed a performance bug in httplib some years ago by doing the exact opposite; += - ''.join(). In that case, it changed downloading a file from 20 minutes to 3 seconds. That was likely on Python 2.5. I remember it well. http://mail.python.org/pipermail/python-dev/2009-August/091125.html I frequently link to this thread as an example of just how bad repeated string concatenation can be, how painful it can be to debug, and how even when the optimization is fast on one system, it may fail and be slow on another system. -- Steven ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
On 13/02/13 22:46, Xavier Morel wrote: On 2013-02-13, at 12:37 , Steven D'Aprano wrote: # even less obvious than sum map(operator.add, array) That one does not work, it'll try to call the binary `add` with each item of the array when the map iterator is reified, erroring out. functools.reduce(operator.add, array, '') would work though, it's an other way to spell `sum` without the string prohibition. Oops, you are right of course, I was thinking reduce but it came out map. Thanks for the correction. -- Steven ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
On 13.02.13 10:52, Larry Hastings wrote: I've always hated the .join(array) idiom for fast string concatenation--it's ugly and it flies in the face of TOOWTDI. I think everyone should use x = a + b + c + d for string concatenation, and we should just make that fast. I prefer x = '%s%s%s%s' % (a, b, c, d) when string's number is more than 3 and some of them are literal strings. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
On Wed, Feb 13, 2013 at 7:10 AM, Serhiy Storchaka storch...@gmail.comwrote: On 13.02.13 10:52, Larry Hastings wrote: I've always hated the .join(array) idiom for fast string concatenation--it's ugly and it flies in the face of TOOWTDI. I think everyone should use x = a + b + c + d for string concatenation, and we should just make that fast. I prefer x = '%s%s%s%s' % (a, b, c, d) when string's number is more than 3 and some of them are literal strings. Fixed: x = ('%s' * len(abcd)) % abcd ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
On Wed, Feb 13, 2013 at 1:10 PM, Serhiy Storchaka storch...@gmail.com wrote: I prefer x = '%s%s%s%s' % (a, b, c, d) when string's number is more than 3 and some of them are literal strings. This has the benefit of being slow both on CPython and PyPy. Although using .format() is even slower. :-) //Lennart ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
On 13.02.13 14:17, Daniel Holth wrote: On Wed, Feb 13, 2013 at 7:10 AM, Serhiy Storchaka storch...@gmail.com mailto:storch...@gmail.com wrote: On 13.02.13 10:52, Larry Hastings wrote: I've always hated the .join(array) idiom for fast string concatenation--it's ugly and it flies in the face of TOOWTDI. I think everyone should use x = a + b + c + d for string concatenation, and we should just make that fast. I prefer x = '%s%s%s%s' % (a, b, c, d) when string's number is more than 3 and some of them are literal strings. Fixed: x = ('%s' * len(abcd)) % abcd Which becomes in the new formatting style x = ('{}' * len(abcd)).format(*abcd) hmm, hmm, not soo nice -- Christian Tismer :^) mailto:tis...@stackless.com Software Consulting : Have a break! Take a ride on Python's Karl-Liebknecht-Str. 121 :*Starship* http://starship.python.net/ 14482 Potsdam: PGP key - http://pgp.uni-mainz.de phone +49 173 24 18 776 fax +49 (30) 700143-0023 PGP 0x57F3BF04 9064 F4E1 D754 C2FF 1619 305B C09C 5A3B 57F3 BF04 whom do you want to sponsor today? http://www.stackless.com/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
On 13/02/2013 11:53, Steven D'Aprano wrote: I fixed a performance bug in httplib some years ago by doing the exact opposite; += - ''.join(). In that case, it changed downloading a file from 20 minutes to 3 seconds. That was likely on Python 2.5. I remember it well. http://mail.python.org/pipermail/python-dev/2009-August/091125.html I frequently link to this thread as an example of just how bad repeated string concatenation can be, how painful it can be to debug, and how even when the optimization is fast on one system, it may fail and be slow on another system. Amusing is that http://mail.python.org/pipermail/python-dev/2009-August/thread.html#91125 doesn't even list the email where I found the problem... Chris -- Simplistix - Content Management, Batch Processing Python Consulting - http://www.simplistix.co.uk ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
2013/2/13 Lennart Regebro rege...@gmail.com On Wed, Feb 13, 2013 at 1:10 PM, Serhiy Storchaka storch...@gmail.com wrote: I prefer x = '%s%s%s%s' % (a, b, c, d) when string's number is more than 3 and some of them are literal strings. This has the benefit of being slow both on CPython and PyPy. Although using .format() is even slower. :-) Did you really try it? PyPy is really fast with str.__mod__, when the format string is a constant. Yes, it's jitted. -- Amaury Forgeot d'Arc ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
On 13.02.13 15:27, Amaury Forgeot d'Arc wrote: 2013/2/13 Lennart Regebro rege...@gmail.com mailto:rege...@gmail.com On Wed, Feb 13, 2013 at 1:10 PM, Serhiy Storchaka storch...@gmail.com mailto:storch...@gmail.com wrote: I prefer x = '%s%s%s%s' % (a, b, c, d) when string's number is more than 3 and some of them are literal strings. This has the benefit of being slow both on CPython and PyPy. Although using .format() is even slower. :-) Did you really try it? PyPy is really fast with str.__mod__, when the format string is a constant. Yes, it's jitted. How about the .format() style: Is that jitted as well? In order to get people to prefer .format over __mod__, it would be nice if PyPy made this actually _faster_ :-) -- Christian Tismer :^) mailto:tis...@stackless.com Software Consulting : Have a break! Take a ride on Python's Karl-Liebknecht-Str. 121 :*Starship* http://starship.python.net/ 14482 Potsdam: PGP key - http://pgp.uni-mainz.de phone +49 173 24 18 776 fax +49 (30) 700143-0023 PGP 0x57F3BF04 9064 F4E1 D754 C2FF 1619 305B C09C 5A3B 57F3 BF04 whom do you want to sponsor today? http://www.stackless.com/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
On 13.02.13 15:23, Lennart Regebro wrote: On Wed, Feb 13, 2013 at 1:10 PM, Serhiy Storchaka storch...@gmail.com wrote: I prefer x = '%s%s%s%s' % (a, b, c, d) when string's number is more than 3 and some of them are literal strings. This has the benefit of being slow both on CPython and PyPy. Although using .format() is even slower. :-) Only slightly. $ ./python -m timeit -s spam = 'spam'; ham = 'ham' spam + ' = ' + ham + '\n' 100 loops, best of 3: 0.501 usec per loop $ ./python -m timeit -s spam = 'spam'; ham = 'ham' ''.join([spam, ' = ', ham, '\n']) 100 loops, best of 3: 0.504 usec per loop $ ./python -m timeit -s spam = 'spam'; ham = 'ham' '%s = %s\n' % (spam, ham) 100 loops, best of 3: 0.524 usec per loop But the last variant looks better for me. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
2013/2/13 Christian Tismer tis...@stackless.com On 13.02.13 15:27, Amaury Forgeot d'Arc wrote: 2013/2/13 Lennart Regebro rege...@gmail.com On Wed, Feb 13, 2013 at 1:10 PM, Serhiy Storchaka storch...@gmail.com wrote: I prefer x = '%s%s%s%s' % (a, b, c, d) when string's number is more than 3 and some of them are literal strings. This has the benefit of being slow both on CPython and PyPy. Although using .format() is even slower. :-) Did you really try it? PyPy is really fast with str.__mod__, when the format string is a constant. Yes, it's jitted. How about the .format() style: Is that jitted as well? In order to get people to prefer .format over __mod__, it would be nice if PyPy made this actually _faster_ :-) .format() is jitted as well. But it's still slower than str.__mod__ (about 25%) I suppose it can be further optimized. -- Amaury Forgeot d'Arc ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
On Wed, Feb 13, 2013 at 3:27 PM, Amaury Forgeot d'Arc amaur...@gmail.com wrote: 2013/2/13 Lennart Regebro rege...@gmail.com On Wed, Feb 13, 2013 at 1:10 PM, Serhiy Storchaka storch...@gmail.com wrote: I prefer x = '%s%s%s%s' % (a, b, c, d) when string's number is more than 3 and some of them are literal strings. This has the benefit of being slow both on CPython and PyPy. Although using .format() is even slower. :-) Did you really try it? Yes. PyPy is really fast with str.__mod__, when the format string is a constant. Yes, it's jitted. Simple concatenation: s1 = s1 + s2 PyPy-1.9 time for 100 concats of 1 length strings = 7.133 CPython time for 100 concats of 1 length strings = 0.005 Making a list of strings and joining after the loop: s1 = ''.join(l) PyPy-1.9 time for 100 concats of 1 length strings = 0.005 CPython time for 100 concats of 1 length strings = 0.003 Old formatting: s1 = '%s%s' % (s1, s2) PyPy-1.9 time for 100 concats of 1 length strings = 20.924 CPython time for 100 concats of 1 length strings = 3.787 New formatting: s1 = '{0}{1}'.format(s1, s2) PyPy-1.9 time for 100 concats of 1 length strings = 13.249 CPython time for 100 concats of 1 length strings = 3.751 I have, by the way, yet to find a usecase where the fastest method in CPython is not also the fastest in PyPy. //Lennart ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
On Wed, Feb 13, 2013 at 3:27 PM, Amaury Forgeot d'Arc amaur...@gmail.com wrote: Yes, it's jitted. Admittedly, I have no idea in which cases the JIT kicks in, and what I should do to make that happen to make sure I have the best possible real-life test cases. //Lennart ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
On 13.02.13 15:17, Daniel Holth wrote: On Wed, Feb 13, 2013 at 7:10 AM, Serhiy Storchaka storch...@gmail.com mailto:storch...@gmail.com wrote: I prefer x = '%s%s%s%s' % (a, b, c, d) when string's number is more than 3 and some of them are literal strings. Fixed: x = ('%s' * len(abcd)) % abcd No, you don't need this for the constant number of strings. Because almost certainly some of strings will be literals, you can write this in a more nice way. Compare: 'config[' + key + '] = ' + value + '\n' ''.join(['config[', key, '] = ', value, '\n']) 'config[%s] = %s\n' % (key, value) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
2013/2/13 Lennart Regebro rege...@gmail.com On Wed, Feb 13, 2013 at 3:27 PM, Amaury Forgeot d'Arc amaur...@gmail.com wrote: Yes, it's jitted. Admittedly, I have no idea in which cases the JIT kicks in, and what I should do to make that happen to make sure I have the best possible real-life test cases. PyPy JIT kicks in only after 1000 iterations. I usually use timeit. It's funny to see how the 1000 loops line is 5 times faster than the 100 loops: $ ./pypy-c -m timeit -v -s a,b,c,d='1234' '{}{}{}{}'.format(a,b,c,d) 10 loops - 2.19e-05 secs 100 loops - 0.000122 secs 1000 loops - 0.00601 secs 1 loops - 0.000363 secs 10 loops - 0.00528 secs 100 loops - 0.0533 secs 1000 loops - 0.528 secs raw times: 0.521 0.52 0.51 1000 loops, best of 3: 0.051 usec per loop -- Amaury Forgeot d'Arc ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
On 2013-02-13 13:23, Lennart Regebro wrote: On Wed, Feb 13, 2013 at 1:10 PM, Serhiy Storchaka storch...@gmail.com wrote: I prefer x = '%s%s%s%s' % (a, b, c, d) when string's number is more than 3 and some of them are literal strings. This has the benefit of being slow both on CPython and PyPy. Although using .format() is even slower. :-) How about adding a class method for catenation: str.cat(a, b, c, d) str.cat([a, b, c, d]) # Equivalent to .join([a, b, c, d]) Each argument could be a string or a list of strings. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
On Wed, Feb 13, 2013 at 7:33 PM, MRAB pyt...@mrabarnett.plus.com wrote: On 2013-02-13 13:23, Lennart Regebro wrote: On Wed, Feb 13, 2013 at 1:10 PM, Serhiy Storchaka storch...@gmail.com wrote: I prefer x = '%s%s%s%s' % (a, b, c, d) when string's number is more than 3 and some of them are literal strings. This has the benefit of being slow both on CPython and PyPy. Although using .format() is even slower. :-) How about adding a class method for catenation: str.cat(a, b, c, d) str.cat([a, b, c, d]) # Equivalent to .join([a, b, c, d]) Each argument could be a string or a list of strings. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/fijall%40gmail.com I actually wonder. There seems to be the consensus to avoid += (to some extent). Can someone commit the change to urrllib then? I'm talking about reverting http://bugs.python.org/issue1285086 specifically ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
On Wed, Feb 13, 2013 at 1:06 PM, Maciej Fijalkowski fij...@gmail.comwrote: On Wed, Feb 13, 2013 at 7:33 PM, MRAB pyt...@mrabarnett.plus.com wrote: On 2013-02-13 13:23, Lennart Regebro wrote: On Wed, Feb 13, 2013 at 1:10 PM, Serhiy Storchaka storch...@gmail.com wrote: I prefer x = '%s%s%s%s' % (a, b, c, d) when string's number is more than 3 and some of them are literal strings. This has the benefit of being slow both on CPython and PyPy. Although using .format() is even slower. :-) How about adding a class method for catenation: str.cat(a, b, c, d) str.cat([a, b, c, d]) # Equivalent to .join([a, b, c, d]) Each argument could be a string or a list of strings. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/fijall%40gmail.com I actually wonder. There seems to be the consensus to avoid += (to some extent). Can someone commit the change to urrllib then? I'm talking about reverting http://bugs.python.org/issue1285086 specifically Please re-open the bug with a comment as to why and I'm sure someone will get to it. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
On Wed, Feb 13, 2013 at 8:24 PM, Brett Cannon br...@python.org wrote: On Wed, Feb 13, 2013 at 1:06 PM, Maciej Fijalkowski fij...@gmail.com wrote: On Wed, Feb 13, 2013 at 7:33 PM, MRAB pyt...@mrabarnett.plus.com wrote: On 2013-02-13 13:23, Lennart Regebro wrote: On Wed, Feb 13, 2013 at 1:10 PM, Serhiy Storchaka storch...@gmail.com wrote: I prefer x = '%s%s%s%s' % (a, b, c, d) when string's number is more than 3 and some of them are literal strings. This has the benefit of being slow both on CPython and PyPy. Although using .format() is even slower. :-) How about adding a class method for catenation: str.cat(a, b, c, d) str.cat([a, b, c, d]) # Equivalent to .join([a, b, c, d]) Each argument could be a string or a list of strings. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/fijall%40gmail.com I actually wonder. There seems to be the consensus to avoid += (to some extent). Can someone commit the change to urrllib then? I'm talking about reverting http://bugs.python.org/issue1285086 specifically Please re-open the bug with a comment as to why and I'm sure someone will get to it. I can't re-open the bug, my account is kind of lame (and seriously, why do you guys *do* have multiple layers of bug tracker accounts?) Cheers, fijal ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
On Wed, Feb 13, 2013 at 1:27 PM, Maciej Fijalkowski fij...@gmail.comwrote: On Wed, Feb 13, 2013 at 8:24 PM, Brett Cannon br...@python.org wrote: On Wed, Feb 13, 2013 at 1:06 PM, Maciej Fijalkowski fij...@gmail.com wrote: On Wed, Feb 13, 2013 at 7:33 PM, MRAB pyt...@mrabarnett.plus.com wrote: On 2013-02-13 13:23, Lennart Regebro wrote: On Wed, Feb 13, 2013 at 1:10 PM, Serhiy Storchaka storch...@gmail.com wrote: I prefer x = '%s%s%s%s' % (a, b, c, d) when string's number is more than 3 and some of them are literal strings. This has the benefit of being slow both on CPython and PyPy. Although using .format() is even slower. :-) How about adding a class method for catenation: str.cat(a, b, c, d) str.cat([a, b, c, d]) # Equivalent to .join([a, b, c, d]) Each argument could be a string or a list of strings. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/fijall%40gmail.com I actually wonder. There seems to be the consensus to avoid += (to some extent). Can someone commit the change to urrllib then? I'm talking about reverting http://bugs.python.org/issue1285086 specifically Please re-open the bug with a comment as to why and I'm sure someone will get to it. I can't re-open the bug, my account is kind of lame Then leave a comment and I will re-open it. (and seriously, why do you guys *do* have multiple layers of bug tracker accounts?) You obviously have not had users argue with your decision by constantly flipping a bug back open. =) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
On 13.02.13 19:06, Maciej Fijalkowski wrote: On Wed, Feb 13, 2013 at 7:33 PM, MRAB pyt...@mrabarnett.plus.com wrote: On 2013-02-13 13:23, Lennart Regebro wrote: On Wed, Feb 13, 2013 at 1:10 PM, Serhiy Storchaka storch...@gmail.com wrote: I prefer x = '%s%s%s%s' % (a, b, c, d) when string's number is more than 3 and some of them are literal strings. This has the benefit of being slow both on CPython and PyPy. Although using .format() is even slower. :-) How about adding a class method for catenation: str.cat(a, b, c, d) str.cat([a, b, c, d]) # Equivalent to .join([a, b, c, d]) Each argument could be a string or a list of strings. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/fijall%40gmail.com I actually wonder. There seems to be the consensus to avoid += (to some extent). Can someone commit the change to urrllib then? I'm talking about reverting http://bugs.python.org/issue1285086 specifically So _is_ += faster in certain library funcs than ''.join() ? If that's the case, the behavior of string concat could be something that might be added to some implementation info, if speed really matters. The library function then could take this info and use the appropriate code path to always be fast, during module initialisation. This is also quite explicit, since it tells the reader not to use in-place add when it is not optimized. If += is anyway a bit slower than other ways, forget it. I would then maybe add a commend somewhere that says avoiding '+=' because it is not reliable or something. cheers - chris -- Christian Tismer :^) mailto:tis...@stackless.com Software Consulting : Have a break! Take a ride on Python's Karl-Liebknecht-Str. 121 :*Starship* http://starship.python.net/ 14482 Potsdam: PGP key - http://pgp.uni-mainz.de phone +49 173 24 18 776 fax +49 (30) 700143-0023 PGP 0x57F3BF04 9064 F4E1 D754 C2FF 1619 305B C09C 5A3B 57F3 BF04 whom do you want to sponsor today? http://www.stackless.com/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
On 13.02.13 20:40, Christian Tismer wrote: If += is anyway a bit slower than other ways, forget it. I would then maybe add a commend somewhere that says avoiding '+=' because it is not reliable or something. += is a fastest way (in any implementation) if you concatenates only two strings. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
On Wed, Feb 13, 2013 at 7:06 PM, Maciej Fijalkowski fij...@gmail.com wrote: I actually wonder. There seems to be the consensus to avoid += (to some extent). Can someone commit the change to urrllib then? I'm talking about reverting http://bugs.python.org/issue1285086 specifically That's unquoting of URLs, strings that aren't particularly long, normally. And it's not in any tight loops. I'm astonished that any change makes any noticeable speed difference here at all. //Lennart ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
On Wed, Feb 13, 2013 at 4:02 PM, Amaury Forgeot d'Arc amaur...@gmail.com wrote: 2013/2/13 Lennart Regebro rege...@gmail.com On Wed, Feb 13, 2013 at 3:27 PM, Amaury Forgeot d'Arc amaur...@gmail.com wrote: Yes, it's jitted. Admittedly, I have no idea in which cases the JIT kicks in, and what I should do to make that happen to make sure I have the best possible real-life test cases. PyPy JIT kicks in only after 1000 iterations. Actually, my test code mixed iterations and string length up when printing the results, so the tests I showed was not 100 iterations with 10.000 long string, but 10.000 iterations with 100 long strings. No matter what the iteration/string length is .format() is the slowest or second slowest of all string concatenation methods I've tried and '%s%s' % just marginally faster. This both on PyPy and CPython and irrespective of string length. I'll stick my neck out and say that using formatting for concatenation is probably an anti-pattern. //Lennart ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
Hi, I wrote quick hack to expose _PyUnicodeWriter as _string.UnicodeWriter: http://www.haypocalc.com/tmp/string_unicode_writer.patch And I wrote a (micro-)benchmark: http://www.haypocalc.com/tmp/bench_join.py ( The benchmark uses only ASCII string, it would be interesting to test latin1, BMP and non-BMP characters too. ) UnicodeWriter (using the writer += str API) is the fastest method in most cases, except for data = ['a'*10**4] * 10**2 (in this case, it's 8x slower!). I guess that the overhead comes for the overallocation which then require to shrink the buffer (shrinking may copy the whole string). The overallocation factor may be adapted depending on the size. If computing the final length is cheap (eg. if it's always the same), it's always faster to use UnicodeWriter with a preallocated buffer. The UnicodeWriter +=; preallocate test uses a precomputed length (ok, it's cheating!). I also implemented UnicodeWriter.append method to measure the overhead of a method lookup: it's expensive :-) -- Platform: Linux-3.6.10-2.fc16.x86_64-x86_64-with-fedora-16-Verne Python unicode implementation: PEP 393 Date: 2013-02-14 01:00:06 CFLAGS: -Wno-unused-result -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes SCM: hg revision=659ef9d360ae+ tag=tip branch=default date=2013-02-13 15:25 + CPU model: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz Python version: 3.4.0a0 (default:659ef9d360ae+, Feb 14 2013, 00:35:19) [GCC 4.6.3 20120306 (Red Hat 4.6.3-2)] Bits: int=32, long=64, long long=64, pointer=64 [ data = ['a'] * 10**2 ] 4.21 us: UnicodeWriter +=; preallocate 4.86 us (+15%): UnicodeWriter append; lookup attr once 4.99 us (+18%): UnicodeWriter += 6.35 us (+51%): str += str 6.45 us (+53%): io.StringIO; lookup attr once 7.02 us (+67%): .join(list) 7.46 us (+77%): UnicodeWriter append 8.77 us (+108%): io.StringIO [ data = ['abc'] * 10**4 ] 356 us: UnicodeWriter append; lookup attr once 375 us (+5%): UnicodeWriter +=; preallocate 376 us (+6%): UnicodeWriter += 495 us (+39%): io.StringIO; lookup attr once 614 us (+73%): .join(list) 629 us (+77%): UnicodeWriter append 716 us (+101%): str += str 737 us (+107%): io.StringIO [ data = ['a'*10**4] * 10**1 ] 3.67 us: str += str 3.76 us: UnicodeWriter +=; preallocate 3.95 us (+8%): UnicodeWriter += 4.01 us (+9%): UnicodeWriter append; lookup attr once 4.06 us (+11%): .join(list) 4.24 us (+15%): UnicodeWriter append 4.59 us (+25%): io.StringIO; lookup attr once 4.77 us (+30%): io.StringIO [ data = ['a'*10**4] * 10**2 ] 41.2 us: UnicodeWriter +=; preallocate 43.8 us (+6%): str += str 45.4 us (+10%): .join(list) 45.9 us (+11%): io.StringIO; lookup attr once 48.3 us (+17%): io.StringIO 370 us (+797%): UnicodeWriter += 370 us (+798%): UnicodeWriter append; lookup attr once 377 us (+816%): UnicodeWriter append [ data = ['a'*10**4] * 10**4 ] 38.9 ms: UnicodeWriter +=; preallocate 39 ms: .join(list) 39.1 ms: io.StringIO; lookup attr once 39.4 ms: UnicodeWriter append; lookup attr once 39.5 ms: io.StringIO 39.6 ms: UnicodeWriter += 40.1 ms: str += str 40.1 ms: UnicodeWriter append Victor 2013/2/13 Antoine Pitrou solip...@pitrou.net: Le Wed, 13 Feb 2013 09:02:07 +0100, Victor Stinner victor.stin...@gmail.com a écrit : I added a _PyUnicodeWriter internal API to optimize str%args and str.format(args). It uses a buffer which is overallocated, so it's basically like CPython str += str optimization. I still don't know how efficient it is on Windows, since realloc() is slow on Windows (at least on old Windows versions). We should add an official and public API to concatenate strings. There's io.StringIO already. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/victor.stinner%40gmail.com ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
On 14/02/13 01:18, Chris Withers wrote: On 13/02/2013 11:53, Steven D'Aprano wrote: I fixed a performance bug in httplib some years ago by doing the exact opposite; += - ''.join(). In that case, it changed downloading a file from 20 minutes to 3 seconds. That was likely on Python 2.5. I remember it well. http://mail.python.org/pipermail/python-dev/2009-August/091125.html I frequently link to this thread as an example of just how bad repeated string concatenation can be, how painful it can be to debug, and how even when the optimization is fast on one system, it may fail and be slow on another system. Amusing is that http://mail.python.org/pipermail/python-dev/2009-August/thread.html#91125 doesn't even list the email where I found the problem... That's because it wasn't solved until the following month. http://mail.python.org/pipermail/python-dev/2009-September/thread.html#91581 -- Steven ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
On Thu, 14 Feb 2013 01:21:40 +0100 Victor Stinner victor.stin...@gmail.com wrote: UnicodeWriter (using the writer += str API) is the fastest method in most cases, except for data = ['a'*10**4] * 10**2 (in this case, it's 8x slower!). I guess that the overhead comes for the overallocation which then require to shrink the buffer (shrinking may copy the whole string). The overallocation factor may be adapted depending on the size. How about testing on Windows? If computing the final length is cheap (eg. if it's always the same), it's always faster to use UnicodeWriter with a preallocated buffer. That's not a particularly surprising discovery, is it? ;-) Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Usage of += on strings in loops in stdlib
Hi We recently encountered a performance issue in stdlib for pypy. It turned out that someone commited a performance fix that uses += for strings instead of .join() that was there before. Now this hurts pypy (we can mitigate it to some degree though) and possible Jython and IronPython too. How people feel about generally not having += on long strings in stdlib (since the refcount = 1 thing is a hack)? What about other performance improvements in stdlib that are problematic for pypy or others? Personally I would like cleaner code in stdlib vs speeding up CPython. Typically that also helps pypy so I'm not unbiased. Cheers, fijal ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
Hi ! On Tue, 12 Feb 2013 23:03:04 +0200 Maciej Fijalkowski fij...@gmail.com wrote: We recently encountered a performance issue in stdlib for pypy. It turned out that someone commited a performance fix that uses += for strings instead of .join() that was there before. Now this hurts pypy (we can mitigate it to some degree though) and possible Jython and IronPython too. How people feel about generally not having += on long strings in stdlib (since the refcount = 1 thing is a hack)? I agree that += should not be used as an optimization (on strings) in the stdlib code. The optimization is there so that uncareful code does not degenerate, but deliberately relying on it is a bit devilish. (optimisare diabolicum :-)) Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
On Tue, Feb 12, 2013 at 4:06 PM, Antoine Pitrou solip...@pitrou.net wrote: Hi ! On Tue, 12 Feb 2013 23:03:04 +0200 Maciej Fijalkowski fij...@gmail.com wrote: We recently encountered a performance issue in stdlib for pypy. It turned out that someone commited a performance fix that uses += for strings instead of .join() that was there before. Now this hurts pypy (we can mitigate it to some degree though) and possible Jython and IronPython too. How people feel about generally not having += on long strings in stdlib (since the refcount = 1 thing is a hack)? I agree that += should not be used as an optimization (on strings) in the stdlib code. The optimization is there so that uncareful code does not degenerate, but deliberately relying on it is a bit devilish. (optimisare diabolicum :-)) Ditto from me. If you're going so far as to want to optimize Python code then you probably are going to care enough to accelerate it in C, in which case you can leave the Python code idiomatic. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
On Tue, Feb 12, 2013 at 11:16 PM, Brett Cannon br...@python.org wrote: On Tue, Feb 12, 2013 at 4:06 PM, Antoine Pitrou solip...@pitrou.net wrote: Hi ! On Tue, 12 Feb 2013 23:03:04 +0200 Maciej Fijalkowski fij...@gmail.com wrote: We recently encountered a performance issue in stdlib for pypy. It turned out that someone commited a performance fix that uses += for strings instead of .join() that was there before. Now this hurts pypy (we can mitigate it to some degree though) and possible Jython and IronPython too. How people feel about generally not having += on long strings in stdlib (since the refcount = 1 thing is a hack)? I agree that += should not be used as an optimization (on strings) in the stdlib code. The optimization is there so that uncareful code does not degenerate, but deliberately relying on it is a bit devilish. (optimisare diabolicum :-)) Ditto from me. If you're going so far as to want to optimize Python code then you probably are going to care enough to accelerate it in C, in which case you can leave the Python code idiomatic. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/fijall%40gmail.com I should actually reference the original CPython issue http://bugs.python.org/issue1285086 ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
On Tue, Feb 12, 2013 at 1:03 PM, Maciej Fijalkowski fij...@gmail.com wrote: Hi We recently encountered a performance issue in stdlib for pypy. It turned out that someone commited a performance fix that uses += for strings instead of .join() that was there before. Now this hurts pypy (we can mitigate it to some degree though) and possible Jython and IronPython too. Just to confirm Jython does not have optimizations for += String and will do much better with the idiomatic .join(). -Frank ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
On Tue, 12 Feb 2013 13:32:50 -0800 fwierzbi...@gmail.com fwierzbi...@gmail.com wrote: On Tue, Feb 12, 2013 at 1:03 PM, Maciej Fijalkowski fij...@gmail.com wrote: Hi We recently encountered a performance issue in stdlib for pypy. It turned out that someone commited a performance fix that uses += for strings instead of .join() that was there before. Now this hurts pypy (we can mitigate it to some degree though) and possible Jython and IronPython too. Just to confirm Jython does not have optimizations for += String and will do much better with the idiomatic .join(). For the record, io.StringIO should be quite fast in 3.3. (except for the method call overhead that Guido is complaining about :-)) Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
On 2/12/2013 4:16 PM, Brett Cannon wrote: On Tue, Feb 12, 2013 at 4:06 PM, Antoine Pitrou solip...@pitrou.net mailto:solip...@pitrou.net wrote: Hi ! On Tue, 12 Feb 2013 23:03:04 +0200 Maciej Fijalkowski fij...@gmail.com mailto:fij...@gmail.com wrote: We recently encountered a performance issue in stdlib for pypy. It turned out that someone commited a performance fix that uses += for strings instead of .join() that was there before. Now this hurts pypy (we can mitigate it to some degree though) and possible Jython and IronPython too. How people feel about generally not having += on long strings in stdlib (since the refcount = 1 thing is a hack)? I agree that += should not be used as an optimization (on strings) in the stdlib code. The optimization is there so that uncareful code does not degenerate, but deliberately relying on it is a bit devilish. (optimisare diabolicum :-)) Ditto from me. If you're going so far as to want to optimize Python code then you probably are going to care enough to accelerate it in C, in which case you can leave the Python code idiomatic. But the only reason .join() is a Python idiom in the first place is because it was the fast way to do what everyone initially coded as s += Just because we all learned a long time ago that joining was the fast way to build a string doesn't mean that .join() is the clean idiomatic way to do it. --Ned. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/ned%40nedbatchelder.com ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
On Tue, 12 Feb 2013 16:40:38 -0500, Ned Batchelder n...@nedbatchelder.com wrote: On 2/12/2013 4:16 PM, Brett Cannon wrote: On Tue, Feb 12, 2013 at 4:06 PM, Antoine Pitrou solip...@pitrou.net mailto:solip...@pitrou.net wrote: On Tue, 12 Feb 2013 23:03:04 +0200 Maciej Fijalkowski fij...@gmail.com mailto:fij...@gmail.com wrote: We recently encountered a performance issue in stdlib for pypy. It turned out that someone commited a performance fix that uses += for strings instead of .join() that was there before. Now this hurts pypy (we can mitigate it to some degree though) and possible Jython and IronPython too. How people feel about generally not having += on long strings in stdlib (since the refcount = 1 thing is a hack)? I agree that += should not be used as an optimization (on strings) in the stdlib code. The optimization is there so that uncareful code does not degenerate, but deliberately relying on it is a bit devilish. (optimisare diabolicum :-)) Ditto from me. If you're going so far as to want to optimize Python code then you probably are going to care enough to accelerate it in C, in which case you can leave the Python code idiomatic. But the only reason .join() is a Python idiom in the first place is because it was the fast way to do what everyone initially coded as s += Just because we all learned a long time ago that joining was the fast way to build a string doesn't mean that .join() is the clean idiomatic way to do it. If 'idiomatic' (a terrible term) means the standard way in this language, which is how it is employed in the programming community, then yes, .join() is the idiomatic way to write that *in Python*, and thus is cleaner code *in Python*. --David ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
On Tue, 12 Feb 2013 16:40:38 -0500 Ned Batchelder n...@nedbatchelder.com wrote: But the only reason .join() is a Python idiom in the first place is because it was the fast way to do what everyone initially coded as s += Just because we all learned a long time ago that joining was the fast way to build a string doesn't mean that .join() is the clean idiomatic way to do it. It's idiomatic because strings are immutable (by design, not because of an optimization detail) and therefore concatenation *has* to imply building a new string from scratch. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
On 2013-02-12, at 22:40 , Ned Batchelder wrote: But the only reason .join() is a Python idiom in the first place is because it was the fast way to do what everyone initially coded as s += Just because we all learned a long time ago that joining was the fast way to build a string doesn't mean that .join() is the clean idiomatic way to do it. Well no, str.join is the idiomatic way to do it because it is: idiomatic |ˌidēəˈmatik| adjective 1 using, containing, or denoting expressions that are natural to a native speaker or would you argue that the natural way for weathered python developers to concatenate string is to *not* use str.join? Of course usually idioms have original reasons for being, reasons which are sometimes long gone (not unlike religious mandates or prohibitions). For Python, ignoring the refcounting hack (which is not only cpython specific but *current* cpython specific *and* doesn't apply to all cases) that reason still exist: python's strings are formally immutable bytestrings, and repeated concatenation of immutable bytestrings is quadratic. Thus str.join is idiomatic, and although it's possible (if difficult) to change the idiom straight string concatenation would make a terrible new idiom as it will behave either unreliably (current CPython) or simply terribly (every other Python implementation). No? ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
On 2013-02-12 21:44, Antoine Pitrou wrote: On Tue, 12 Feb 2013 16:40:38 -0500 Ned Batchelder n...@nedbatchelder.com wrote: But the only reason .join() is a Python idiom in the first place is because it was the fast way to do what everyone initially coded as s += Just because we all learned a long time ago that joining was the fast way to build a string doesn't mean that .join() is the clean idiomatic way to do it. It's idiomatic because strings are immutable (by design, not because of an optimization detail) and therefore concatenation *has* to imply building a new string from scratch. Tuples are much like immutable lists; sets were added, and then frozensets; should we be adding mutable strings too (a bit like C#'s StringBuilder)? (Just wondering...) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
Am 12.02.2013 22:32, schrieb Antoine Pitrou: For the record, io.StringIO should be quite fast in 3.3. (except for the method call overhead that Guido is complaining about :-)) AFAIK it's not the actual *call* of the method that is slow, but rather attribute lookup and creation of bound method objects. If speed is of the essence, code can cache the method object locally: strio = io.StringIO() write = strio.write for element in elements: write(element) result = strio.getvalue() Christian ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
On Wed, Feb 13, 2013 at 1:28 AM, Christian Heimes christ...@python.org wrote: Am 12.02.2013 22:32, schrieb Antoine Pitrou: For the record, io.StringIO should be quite fast in 3.3. (except for the method call overhead that Guido is complaining about :-)) AFAIK it's not the actual *call* of the method that is slow, but rather attribute lookup and creation of bound method objects. If speed is of the essence, code can cache the method object locally: strio = io.StringIO() write = strio.write for element in elements: write(element) result = strio.getvalue() And this is a great example of muddying code in stdlib for the sake of speeding up CPython. Cheers, fijal ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
On Wed, Feb 13, 2013 at 1:20 AM, MRAB pyt...@mrabarnett.plus.com wrote: On 2013-02-12 21:44, Antoine Pitrou wrote: On Tue, 12 Feb 2013 16:40:38 -0500 Ned Batchelder n...@nedbatchelder.com wrote: But the only reason .join() is a Python idiom in the first place is because it was the fast way to do what everyone initially coded as s += Just because we all learned a long time ago that joining was the fast way to build a string doesn't mean that .join() is the clean idiomatic way to do it. It's idiomatic because strings are immutable (by design, not because of an optimization detail) and therefore concatenation *has* to imply building a new string from scratch. Tuples are much like immutable lists; sets were added, and then frozensets; should we be adding mutable strings too (a bit like C#'s StringBuilder)? (Just wondering...) Isn't bytearray what you're looking for? ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
On 13 Feb 2013 07:08, Maciej Fijalkowski fij...@gmail.com wrote: Hi We recently encountered a performance issue in stdlib for pypy. It turned out that someone commited a performance fix that uses += for strings instead of .join() that was there before. Now this hurts pypy (we can mitigate it to some degree though) and possible Jython and IronPython too. How people feel about generally not having += on long strings in stdlib (since the refcount = 1 thing is a hack)? What about other performance improvements in stdlib that are problematic for pypy or others? Personally I would like cleaner code in stdlib vs speeding up CPython. For the specific case of Don't rely on the fragile refcounting hack in CPython's string concatenation I strongly agree. However, as a general principle, I can't agree until speed.python.org is a going concern and we can get a reasonable overview of any resulting performance implications. Regards, Nick. Typically that also helps pypy so I'm not unbiased. Cheers, fijal ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
On Tue, Feb 12, 2013 at 1:44 PM, Antoine Pitrou solip...@pitrou.net wrote: It's idiomatic because strings are immutable (by design, not because of an optimization detail) and therefore concatenation *has* to imply building a new string from scratch. Not necessarily. It is totally possible to implement strings such they are immutable and concatenation takes O(1): ropes are the canonical example of this. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
On 12.02.13 22:03, Maciej Fijalkowski wrote: Hi We recently encountered a performance issue in stdlib for pypy. It turned out that someone commited a performance fix that uses += for strings instead of .join() that was there before. Now this hurts pypy (we can mitigate it to some degree though) and possible Jython and IronPython too. How people feel about generally not having += on long strings in stdlib (since the refcount = 1 thing is a hack)? What about other performance improvements in stdlib that are problematic for pypy or others? Personally I would like cleaner code in stdlib vs speeding up CPython. Typically that also helps pypy so I'm not unbiased. Cheers, fijal ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/tismer%40stackless.com Howdy. Funny coincidence that this issue came up an hour after I asked about string_concat optimization absence on the pypy channel. I did not read email while writing the efficient string concatenation re-iteration._ _ Maybe we should use the time machine, go backwards and undo the patch, although it still makes a lot of sense and is fastest, opcode-wise, at least on CPython. Which will not matter so much for PyPy of course because _that_ goes away. Alas, the damage to the mindsets already has happened, and the cure will probably be as hard as the eviction of the print statement, after all. But since I'm a complete Python 3.3 convert (with consequent changes to my projects which was not so trivial), I think to also start publicly saying that s += t is a pattern that should not be used in the Gigabyte domain, from 2013. Actually a tad, because it contradicted normal programming patterns in an appealing way. Way too sexy... But let's toss it. Keep the past eight years in good memories as an exceptional period of liberal abuse. Maybe we should add it as an addition to the Zen of Python: There are obviously good things, but obvious is the finest liar. -- Christian Tismer :^) mailto:tis...@stackless.com Software Consulting : Have a break! Take a ride on Python's Karl-Liebknecht-Str. 121 :*Starship* http://starship.python.net/ 14482 Potsdam: PGP key - http://pgp.uni-mainz.de phone +49 173 24 18 776 fax +49 (30) 700143-0023 PGP 0x57F3BF04 9064 F4E1 D754 C2FF 1619 305B C09C 5A3B 57F3 BF04 whom do you want to sponsor today? http://www.stackless.com/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
On 2/12/2013 6:20 PM, MRAB wrote: Tuples are much like immutable lists; sets were added, and then frozensets; should we be adding mutable strings too (a bit like C#'s StringBuilder)? (Just wondering...) StringIO is effectively a mutable string with a file interface. sio.write('abc') is the equivalent of lis.extend(['a', 'b', 'c']). -- Terry Jan Reedy ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
On 2/12/2013 4:03 PM, Maciej Fijalkowski wrote: Hi We recently encountered a performance issue in stdlib for pypy. It turned out that someone commited a performance fix that uses += for strings instead of .join() that was there before. Now this hurts pypy (we can mitigate it to some degree though) and possible Jython and IronPython too. How people feel about generally not having += on long strings in stdlib (since the refcount = 1 thing is a hack)? What about other performance improvements in stdlib that are problematic for pypy or others? Personally I would like cleaner code in stdlib vs speeding up CPython. Typically that also helps pypy so I'm not unbiased. I agree. sum() refuses to sum strings specifically to encourage .join(). sum(('x', 'b'), '') Traceback (most recent call last): File pyshell#0, line 1, in module sum(('x', 'b'), '') TypeError: sum() can't sum strings [use ''.join(seq) instead] The doc entry for sum says the same thing. -- Terry Jan Reedy ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
On Wed, 13 Feb 2013 00:28:15 +0100 Christian Heimes christ...@python.org wrote: Am 12.02.2013 22:32, schrieb Antoine Pitrou: For the record, io.StringIO should be quite fast in 3.3. (except for the method call overhead that Guido is complaining about :-)) AFAIK it's not the actual *call* of the method that is slow, but rather attribute lookup and creation of bound method objects. Take a look at http://bugs.python.org/issue17170 ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
On Wed, 13 Feb 2013 09:39:23 +1000 Nick Coghlan ncogh...@gmail.com wrote: On 13 Feb 2013 07:08, Maciej Fijalkowski fij...@gmail.com wrote: Hi We recently encountered a performance issue in stdlib for pypy. It turned out that someone commited a performance fix that uses += for strings instead of .join() that was there before. Now this hurts pypy (we can mitigate it to some degree though) and possible Jython and IronPython too. How people feel about generally not having += on long strings in stdlib (since the refcount = 1 thing is a hack)? What about other performance improvements in stdlib that are problematic for pypy or others? Personally I would like cleaner code in stdlib vs speeding up CPython. For the specific case of Don't rely on the fragile refcounting hack in CPython's string concatenation I strongly agree. However, as a general principle, I can't agree until speed.python.org is a going concern and we can get a reasonable overview of any resulting performance implications. Anybody can run the benchmark suite for himself, speed.p.o is (fortunately) not a roadblock: http://bugs.python.org/issue17170 Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
On Wed, 13 Feb 2013 08:16:21 +0100 Antoine Pitrou solip...@pitrou.net wrote: On Wed, 13 Feb 2013 09:39:23 +1000 Nick Coghlan ncogh...@gmail.com wrote: On 13 Feb 2013 07:08, Maciej Fijalkowski fij...@gmail.com wrote: Hi We recently encountered a performance issue in stdlib for pypy. It turned out that someone commited a performance fix that uses += for strings instead of .join() that was there before. Now this hurts pypy (we can mitigate it to some degree though) and possible Jython and IronPython too. How people feel about generally not having += on long strings in stdlib (since the refcount = 1 thing is a hack)? What about other performance improvements in stdlib that are problematic for pypy or others? Personally I would like cleaner code in stdlib vs speeding up CPython. For the specific case of Don't rely on the fragile refcounting hack in CPython's string concatenation I strongly agree. However, as a general principle, I can't agree until speed.python.org is a going concern and we can get a reasonable overview of any resulting performance implications. Anybody can run the benchmark suite for himself, speed.p.o is (fortunately) not a roadblock: http://bugs.python.org/issue17170 And I meant to paste the repo URL actually: http://hg.python.org/benchmarks/ Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
On Tue, Feb 12, 2013 at 5:25 PM, Christian Tismer tis...@stackless.comwrote: Would ropes be an answer (and a simple way to cope with string mutation patterns) as an alternative implementation, and therefore still justify the usage of that pattern? I don't think so. Ropes are really useful when you work with gigabytes of data, but unfortunately they don't make good general-purpose strings. Monolithic arrays are much more efficient and simple for the typical use-cases we have in Python. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Usage of += on strings in loops in stdlib
On Wed, Feb 13, 2013 at 5:42 PM, Alexandre Vassalotti alexan...@peadrop.com wrote: On Tue, Feb 12, 2013 at 5:25 PM, Christian Tismer tis...@stackless.com wrote: Would ropes be an answer (and a simple way to cope with string mutation patterns) as an alternative implementation, and therefore still justify the usage of that pattern? I don't think so. Ropes are really useful when you work with gigabytes of data, but unfortunately they don't make good general-purpose strings. Monolithic arrays are much more efficient and simple for the typical use-cases we have in Python. If I recall correctly, io.StringIO and io.BytesIO have been updated to use ropes internally in 3.3. Writing to one of those and then calling getvalue() at the end is the main alternative to the list+join trick (when concatenating many small strings, the memory saving relative to a list can be notable since strings have a fairly large per-instance overhead). Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com