Re: [Python-Dev] memcmp performance
Richard Saunders, 25.10.2011 01:17: -On [20111024 09:22], Stefan Behnel wrote: I agree. Given that the analysis shows that the libc memcmp() is particularly fast on many Linux systems, it should be up to the Python package maintainers for these systems to set that option externally through the optimisation CFLAGS. Indeed, this is how I constructed my Python 3.3 and Python 2.7 : setenv CFLAGS '-fno-builtin-memcmp' just before I configured. I would like to revisit changing unicode_compare: adding a special arm for using memcmp when the unicode kinds are the same will only work in two specific instances: (1) the strings are the same kind, the char size is 1 * We could add THIS to unicode_compare, but it seems extremely specialized by itself But also extremely likely to happen. This means that the strings are pure ASCII, which is highly likely and one of the main reasons why the unicode string layout was rewritten for CPython 3.3. It allows CPython to save a lot of memory (thus clearly proving how likely this case is!), and it would also allow it to do faster comparisons for these strings. Stefan ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] memcmp performance
Le Mardi 25 Octobre 2011 10:44:16 Stefan Behnel a écrit : Richard Saunders, 25.10.2011 01:17: -On [20111024 09:22], Stefan Behnel wrote: I agree. Given that the analysis shows that the libc memcmp() is particularly fast on many Linux systems, it should be up to the Python package maintainers for these systems to set that option externally through the optimisation CFLAGS. Indeed, this is how I constructed my Python 3.3 and Python 2.7 : setenv CFLAGS '-fno-builtin-memcmp' just before I configured. I would like to revisit changing unicode_compare: adding a special arm for using memcmp when the unicode kinds are the same will only work in two specific instances: (1) the strings are the same kind, the char size is 1 * We could add THIS to unicode_compare, but it seems extremely specialized by itself But also extremely likely to happen. This means that the strings are pure ASCII, which is highly likely and one of the main reasons why the unicode string layout was rewritten for CPython 3.3. It allows CPython to save a lot of memory (thus clearly proving how likely this case is!), and it would also allow it to do faster comparisons for these strings. Python 3.3 has already some optimizations for latin1: CPU and the C language are more efficient to process char* strings than Py_UCS2 and Py_UCS4 strings. For example, we are using memchr() to search a single character is a latin1 string. Victor ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] memcmp performance
-On [20111024 09:22], Stefan Behnel (stefan...@behnel.de) wrote:I agree. Given that the analysis shows that the libc memcmp() isparticularly fast on many Linux systems, it should be up to the Pythonpackage maintainers for these systems to set that option externally throughthe optimisation CFLAGS.Indeed, this is how I constructed my Python 3.3 and Python 2.7 :setenv CFLAGS '-fno-builtin-memcmp'just before I configured.I would like to revisit changing unicode_compare: adding aspecial arm for using memcmp when the "unicode kinds" are thesame will only work in two specific instances:(1) the strings are the same kind, the char size is 1 * We could add THIS to unicode_compare, but it seems extremely specialized by itself(2) the strings are the same kind, the char size is 1, and checking for equality * Since unicode_compare can't detect equality checking, we can't really add this to unicode_compare at allThe problem is, of course, that memcmp won't compare for less-thanor greater-than correctly (unless on a BIG ENDIAN machine) forchar sizes of 2 or 4.If we wanted to put memcmp in unicodeobject.c, it would probably needto go into PyUnicode_RichCompare (so we would have some more semanticinformation). I may try to put together a patch for that, if peoplethink that's a good idea? It would be JUST adding a call to memcmpfor two instances specified above.From: Jeroen Ruigrok van der Werven asmo...@in-nomine.orgIn the same stretch, stuff like this needs to be documented. Packagemaintainers cannot be expected to follow each and every mailinglist's postsfor nuggets of information like this. Been there, done that, it's impossibleto keep track.I would like to second that: the whole point of a Makefile/configurationfile is to capture knowledge like this so it doesn't get lost.I would prefer the option would be part of a standard build Pythondistributes, but as long as the information gets captured SOMEWHEREso that (say) Fedora Core 17 has Python 2.7 built with -fno-builtin-memcmp,I would be happy. Gooday, Richie___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] memcmp performance
Martin v. Löwis, 23.10.2011 23:44: I am still rooting for -fno-builtin-memcmp in both Python 2.7 and 3.3 ... (after we put memcmp in unicode_compare) -1. We shouldn't do anything about this. Python has the tradition of not working around platform bugs, except if the work-arounds are necessary to make something work at all - i.e. in particular not for performance issues. If this is a serious problem, then platform vendors need to look into it (CPU vendor, compiler vendor, OS vendor). If they don't act, it's probably not a serious problem. In the specific case, I don't think it's a problem at all. It's not that memcmp is slow with the builtin version - it's just not as fast as it could be. Adding a compiler option would put a maintenance burden on Python - we already have way too many compiler options in configure.in, and there is no good procedure to ever take them out should they not be needed anymore. I agree. Given that the analysis shows that the libc memcmp() is particularly fast on many Linux systems, it should be up to the Python package maintainers for these systems to set that option externally through the optimisation CFLAGS. Stefan ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] memcmp performance
-On [20111024 09:22], Stefan Behnel (stefan...@behnel.de) wrote: I agree. Given that the analysis shows that the libc memcmp() is particularly fast on many Linux systems, it should be up to the Python package maintainers for these systems to set that option externally through the optimisation CFLAGS. In the same stretch, stuff like this needs to be documented. Package maintainers cannot be expected to follow each and every mailinglist's posts for nuggets of information like this. Been there, done that, it's impossible to keep track. -- Jeroen Ruigrok van der Werven asmodai(-at-)in-nomine.org / asmodai イェルーン ラウフロック ヴァン デル ウェルヴェン http://www.in-nomine.org/ | GPG: 2EAC625B Only in sleep can one find salvation that resembles Death... ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] memcmp performance
I am still rooting for -fno-builtin-memcmp in both Python 2.7 and 3.3 ... (after we put memcmp in unicode_compare) -1. We shouldn't do anything about this. Python has the tradition of not working around platform bugs, except if the work-arounds are necessary to make something work at all - i.e. in particular not for performance issues. If this is a serious problem, then platform vendors need to look into it (CPU vendor, compiler vendor, OS vendor). If they don't act, it's probably not a serious problem. In the specific case, I don't think it's a problem at all. It's not that memcmp is slow with the builtin version - it's just not as fast as it could be. Adding a compiler option would put a maintenance burden on Python - we already have way too many compiler options in configure.in, and there is no good procedure to ever take them out should they not be needed anymore. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] memcmp performance
Antoine Pitrou, 20.10.2011 23:08: I have been doing some performance experiments with memcmp, and I was surprised that memcmp wasn't faster than it was in Python. I did a whole, long analysis and came up with some very simple results. Thanks for the analysis. Non-bugfix work now happens on Python 3, where the str type is Python 2's unicode type. Your recommendations would have to be revisited under that light. Well, Py3 is quite a bit different now that PEP393 is in. It appears to use memcmp() or strcmp() a lot less than before, but I think unicode_compare() should actually receive an optimisation to use a fast memcmp() if both string kinds are equal, at least when their character unit size is less than 4 (i.e. especially for ASCII strings). Funny enough, tailmatch() has such an optimisation. Stefan ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] memcmp performance
On Fri, 21 Oct 2011 08:24:44 +0200 Stefan Behnel stefan...@behnel.de wrote: Antoine Pitrou, 20.10.2011 23:08: I have been doing some performance experiments with memcmp, and I was surprised that memcmp wasn't faster than it was in Python. I did a whole, long analysis and came up with some very simple results. Thanks for the analysis. Non-bugfix work now happens on Python 3, where the str type is Python 2's unicode type. Your recommendations would have to be revisited under that light. Well, Py3 is quite a bit different now that PEP393 is in. It appears to use memcmp() or strcmp() a lot less than before, but I think unicode_compare() should actually receive an optimisation to use a fast memcmp() if both string kinds are equal, at least when their character unit size is less than 4 (i.e. especially for ASCII strings). Funny enough, tailmatch() has such an optimisation. Yes, unicode_compare() probably deserves optimizing. Patches welcome, by the way :) Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] memcmp performance
Richard Saunders I have been doing some performance experiments with memcmp, and I was surprised that memcmp wasn't faster than it was in Python. I did a whole, long analysis and came up with some very simple results.Antoine Pitrou, 20.10.2011 23:08: Thanks for the analysis. Non-bugfix work now happens on Python 3, where the str type is Python 2's unicode type. Your recommendations would have to be revisited under that light. Stefan Behnel stefan...@behnel.deWell, Py3 is quite a bit different now that PEP393 is in. It appears to usememcmp() or strcmp() a lot less than before, but I think unicode_compare()should actually receive an optimisation to use a fast memcmp() if bothstring kinds are equal, at least when their character unit size is lessthan 4 (i.e. especially for ASCII strings). Funny enough, tailmatch() hassuch an optimisation.I started looking at the most recent 3.x baseline: a lot of places,the memcmp analysis appears relevant (zlib, arraymodule, datetime, xmlparse):all still use memcmp in about the same way. But I agree that there aresome major differences in the unicode portion.As long as the two strings are the same unicode "kind", you can use amemcmp to compare. In that case, I would almost argue some memcmpoptimization is even more important: unicode strings are potentially 2to 4 times larger, so the amount of time spent in memcmp may be more(i.e., I am still rooting for -fno-builtin-memcmp on the compile lines).I went ahead a quick string_test3.py for comparing strings(similar to what I did in Python 2.7)# Simple python string comparison test for Python 3.3a = []; b = []; c = []; d = []for x in range(0,1000) : a.append("the quick brown fox"+str(x)) b.append("the wuick brown fox"+str(x)) c.append("the quick brown fox"+str(x)) d.append("the wuick brown fox"+str(x))count = 0for x in range(0,20) : if a==c : count += 1 if a==c : count += 2 if a==d : count += 3 if b==c : count += 5 if b==d : count += 7 if c==d : count += 11print(count)Timings on On My FC14 machine (Intel Xeon W3520@2.67Ghz):29.18 seconds: Vanilla build of Python 3.329.17 seconds: Python 3.3 compiled with -fno-builtin-memcmp:No change: a little investigation shows unicode_compare is where allthe work is: Here's currently the main loop inside unicode_compare: for (i = 0; i len1 i len2; ++i) {Py_UCS4 c1, c2;c1 = PyUnicode_READ(kind1, data1, i);c2 = PyUnicode_READ(kind2, data2, i);if (c1 != c2) return (c1 c2) ? -1 : 1; } return (len1 len2) ? -1 : (len1 != len2);If both loops are the same unicode kind, we can add memcmpto unicode_compare for an optimization: Py_ssize_t len = (len1len2) ? len1: len2; /* use memcmp if both the same kind */ if (kind1==kind2) { int result=memcmp(data1, data2, ((int)kind1)*len); if (result!=0) return result0 ? -1 : +1; }Rerunning the test with this small change to unicode_compare:17.84 seconds: -fno-builtin-memcmp36.25 seconds: STANDARD memcmpThe standard memcmp is WORSE that the original unicode_comparecode, but if we compile using memcmp with -fno-builtin-memcmp, we get thatwonderful 2x performance increase again.I am still rooting for -fno-builtin-memcmp in both Python 2.7 and 3.3 ...(after we put memcmp in unicode_compare) Gooday, Richie___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] memcmp performance
Richard Saunders, 21.10.2011 20:23: As long as the two strings are the same unicode kind, you can use a memcmp to compare. In that case, I would almost argue some memcmp optimization is even more important: unicode strings are potentially 2 to 4 times larger, so the amount of time spent in memcmp may be more (i.e., I am still rooting for -fno-builtin-memcmp on the compile lines). I would argue that the pure ASCII (1 byte per character) case is even more important than the other cases, and it suffers from the 1 byte per comparison problem you noted. That's why you got the 2x speed-up for your quick test. Stefan ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] memcmp performance
On Fri, 21 Oct 2011 18:23:24 + (GMT) Richard Saunders richismyn...@me.com wrote: If both loops are the same unicode kind, we can add memcmp to unicode_compare for an optimization: Py_ssize_t len = (len1len2) ? len1: len2; /* use memcmp if both the same kind */ if (kind1==kind2) { int result=memcmp(data1, data2, ((int)kind1)*len); if (result!=0) return result0 ? -1 : +1; } Hmm, you have to be a bit subtler than that: on a little-endian machine, you can't compare two characters by comparing their bytes representation in memory order. So memcmp() can only be used for the one-byte representation. (actually, it can also be used for equality comparisons on any representation) Rerunning the test with this small change to unicode_compare: 17.84 seconds: -fno-builtin-memcmp 36.25 seconds: STANDARD memcmp The standard memcmp is WORSE that the original unicode_compare code, but if we compile using memcmp with -fno-builtin-memcmp, we get that wonderful 2x performance increase again. The standard memcmp being worse is a bit puzzling. Intuitively, it should have roughly the same performance as the original function. I also wonder whether the slowdown could materialize on non-glibc systems. I am still rooting for -fno-builtin-memcmp in both Python 2.7 and 3.3 ... (after we put memcmp in unicode_compare) A patch for unicode_compare would be a good start. Its performance can then be checked on other systems (such as Windows). Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] memcmp performance
Hi,This is my first time on Python-dev, so I apologize for my newbie-ness.I have been doing some performance experiments with memcmp, and I wassurprised that memcmp wasn't faster than it was in Python. I did a whole,long analysis and came up with some very simple results.Before I put in a tracker bug report, I wanted to present my findingsand make sure they were repeatable to others (isn't that the natureof science? ;) as well as offer discussion.The analysis is a pdf and is here: http://www.picklingtools.com/study.pdfThe testcases are a tarball here: http://www.picklingtools.com/PickTest5.tar.gzI have three basic recommendations in the study: I amcurious what other people think. Gooday, Richie___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] memcmp performance
Hello, I have been doing some performance experiments with memcmp, and I was surprised that memcmp wasn't faster than it was in Python. I did a whole, long analysis and came up with some very simple results. Before I put in a tracker bug report, I wanted to present my findings and make sure they were repeatable to others (isn't that the nature of science? ;) as well as offer discussion. Thanks for the analysis. Non-bugfix work now happens on Python 3, where the str type is Python 2's unicode type. Your recommendations would have to be revisited under that light. Have you reported gcc's outdated optimization issue to them? Or is it already solved in newer gcc versions? Under glibc-based systems, it seems we can't go wrong with the system memcpy function. If gcc doesn't get in the way, that is. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] memcmp performance
On 10/20/2011 5:08 PM, Antoine Pitrou wrote: Have you reported gcc's outdated optimization issue to them? Or is it already solved in newer gcc versions? I checked this on gcc 4.6, and it still optimizes memcmp/strcmp into a repz cmpsb instruction on x86. This has been known to be a problem since at least 2002[1][2]. There are also some alternative implementations available on their mailing list. It seems the main objection to removing the optimization was that gcc isn't always compiling against an optimized libc, so they didn't want to drop the optimization. Beyond that, I think nobody was willing to put in the effort to change the optimization itself. [1] http://gcc.gnu.org/ml/gcc/2002-10/msg01616.html [2] http://gcc.gnu.org/ml/gcc/2003-04/msg00166.html -- Scott Dial sc...@scottdial.com ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] memcmp performance
Hey, I have been doing some performance experiments with memcmp, and I was surprised that memcmp wasn't faster than it was in Python. I did a whole, long analysis and came up with some very simple results.Paul Svensson suggested I post as much as I can as text, as people would be more likely to read it.So, here's the basic ideas:(1) memcmp is surprisingly slow on some Intel gcc platforms (Linux)On several Linux, Intel platforms, memcmp was 2-3x slower thana simple, portable C function (with some optimizations).(2) The problem: If you compile C programs with gcc with any optimization on, it will replace all memcmp calls with an assembly language stub: rep cmpsb instead of the memcmp call.(3) rep cmpsb seems like it would be faster, but it really isn't: this completely bypasses the memcmp.S, memcmp_sse3.S and memcmp_sse4.S in glibc which are typically faster.(4) The basic conclusion is that the Python baseline on Intel gcc platforms should probably be compiled with -fno-builtin-memcmp so we "avoid" gcc's memcmp optimization.The numbers are all in the paper: I will endeavor to try to generate a text formof all the tables so it's easier to read. This is much first in the Python devarena, so I went a little overboard with my paper below. ;) Gooday, Richie Before I put in a tracker bug report, I wanted to present my findings and make sure they were repeatable to others (isn't that the nature of science? ;) as well as offer discussion. The analysis is a pdf and is here: http://www.picklingtools.com/study.pdf The testcases are a tarball here: http://www.picklingtools.com/PickTest5.tar.gz I have three basic recommendations in the study: I am curious what other people think.___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com