Re: [Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)
Keir Mierle wrote: Hi, I'm working on Argon (http://www.third-bit.com/trac/argon) with Greg Wilson this summer We're having a very strange problem with Python's unicode parsing of source files. Basically, our CGI script was running extremely slowly on our production box (a pokey dual-Xeon 3GHz w/ 4GB RAM and 15K SCSI drives). Slow to the tune of 6-10 seconds per request. I eventually tracked this down to imports of our source tree; the actual request was completing in 300ms, the rest of the time was spent in __import__. This is caused by the chances to the codecs in 2.4. Basically the codecs no longer rely on C's readline() to do line splitting (which can't work for UTF-16), but do it themselves (via unicode.splitlines()). After doing some gprof profiling, I discovered _PyUnicodeUCS2_IsLinebreak was getting called 51 million times. Our code is 1.2 million characters, so I hardly think it makes sense to call IsLinebreak 50 times for each character; and we're not even importing our entire source tree on every invocation. But if you're using CGI, you're importing your source on every invocation. Switching to a different server side technology might help. Nevertheless 50 million calls seems to be a bit much. Our code is a fork of Trac, and originally had these lines at the top: # -*- coding: iso8859-1 -*- This made me suspicious, so I removed all of them. The CGI execution time immediately dropped to ~1 second. gprof revealed that _PyUnicodeUCS2_IsLinebreak is not called at all anymore. Now that our code works fast enough, I don't really care about this, but I thought python-dev might want to know something weird is going on with unicode splitlines. I wonder if we should switch back to a simple readline() implementation for those codecs that don't require the current implementation (basically every charmap codec). AFAIK source files are opened in universal newline mode, so at least we'd get proper treatment of \n, \r and \r\n line ends, but we'd loose u\x1c, u\x1d, u\x1e, u\x85, u\u2028 and u\u2029 (which are line terminators according to unicode.splitlines()). I documented my investigation of this problem; if anyone wants further details, just email me. (I'm not on python-dev) http://www.third-bit.com/trac/argon/ticket/525 Bye, Walter Dörwald ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)
Walter Dörwald wrote: This is caused by the chances to the codecs in 2.4. Basically the codecs no longer rely on C's readline() to do line splitting (which can't work for UTF-16), but do it themselves (via unicode.splitlines()). That explains why you get any calls to IsLineBreak; it doesn't explain why you get so many of them. I investigated this a bit, and one issue seems to be that StreamReader.readline performs splitline on the entire input, only to fetch the first line. It then joins the rest for later processing. In addition, it also performs splitlines on a single line, just to strip any trailing line breaks. The net effect is that, for a file with N lines, IsLineBreak is invoked up to N*N/2 times per character (atleast for the last character). So I think it would be best if Unicode characters exposed a .islinebreak method (or, failing that, codecs just knew what the line break characters are in Unicode 3.2), and then codecs would split off the first line of input itself. After doing some gprof profiling, I discovered _PyUnicodeUCS2_IsLinebreak was getting called 51 million times. Our code is 1.2 million characters, so I hardly think it makes sense to call IsLinebreak 50 times for each character; and we're not even importing our entire source tree on every invocation. But if you're using CGI, you're importing your source on every invocation. Well, no. Only the CGI script needs to be parsed every time; all modules could load off bytecode files. Which suggests that Keir Mierle doesn't use bytecode files, I think he should. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)
Walter Dörwald wrote: I wonder if we should switch back to a simple readline() implementation for those codecs that don't require the current implementation (basically every charmap codec). That would be my preference as well. The 2.4 .readline() approach is really only needed for codecs that have to deal with encodings that: a) use multi-byte formats, or b) support more line-end formats than just CR, CRLF, LF, or c) are stateful. This can easily be had by using a mix-in class for codecs which do need the buffered .readline() approach. AFAIK source files are opened in universal newline mode, so at least we'd get proper treatment of \n, \r and \r\n line ends, but we'd loose u\x1c, u\x1d, u\x1e, u\x85, u\u2028 and u\u2029 (which are line terminators according to unicode.splitlines()). While the Unicode standard defines these characters as line end code points, I think their definition does not necessarily apply to data that is converted from a certain encoding to Unicode, so that's not a big loss. E.g. in ASCII or Latin-1, FILE, GROUP and RECORD SEPARATOR and NEXT LINE characters (0x1c, 0x1d, 0x1e, 0x85) are not interpreted as line end characters. Furthermore, we had no reports of anyone complaining in Python 1.6, 2.0 - 2.3 that line endings were not detected properly. All these Python versions relied on the stream's .readline() method to get the next line. The only bug reports we had were for UTF-16 which falls into the above category a) and did not support .readline() until Python 2.4. A note on the performance of _PyUnicode_IsLinebreak(): in Python 2.0 Fredrik changed this to use the two step lookup (reducing the size of the lookup tables considerably). I think it's worthwhile reconsidering this approach for character type queries that do no involve a huge number of code points. In Python 1.6 the function looked like this (and was inlined by the compiler using its own fast lookup table): int _PyUnicode_IsLinebreak(register const Py_UNICODE ch) { switch (ch) { case 0x000A: /* LINE FEED */ case 0x000D: /* CARRIAGE RETURN */ case 0x001C: /* FILE SEPARATOR */ case 0x001D: /* GROUP SEPARATOR */ case 0x001E: /* RECORD SEPARATOR */ case 0x0085: /* NEXT LINE */ case 0x2028: /* LINE SEPARATOR */ case 0x2029: /* PARAGRAPH SEPARATOR */ return 1; default: return 0; } } another candidate to convert back is: int _PyUnicode_IsWhitespace(register const Py_UNICODE ch) { switch (ch) { case 0x0009: /* HORIZONTAL TABULATION */ case 0x000A: /* LINE FEED */ case 0x000B: /* VERTICAL TABULATION */ case 0x000C: /* FORM FEED */ case 0x000D: /* CARRIAGE RETURN */ case 0x001C: /* FILE SEPARATOR */ case 0x001D: /* GROUP SEPARATOR */ case 0x001E: /* RECORD SEPARATOR */ case 0x001F: /* UNIT SEPARATOR */ case 0x0020: /* SPACE */ case 0x0085: /* NEXT LINE */ case 0x00A0: /* NO-BREAK SPACE */ case 0x1680: /* OGHAM SPACE MARK */ case 0x2000: /* EN QUAD */ case 0x2001: /* EM QUAD */ case 0x2002: /* EN SPACE */ case 0x2003: /* EM SPACE */ case 0x2004: /* THREE-PER-EM SPACE */ case 0x2005: /* FOUR-PER-EM SPACE */ case 0x2006: /* SIX-PER-EM SPACE */ case 0x2007: /* FIGURE SPACE */ case 0x2008: /* PUNCTUATION SPACE */ case 0x2009: /* THIN SPACE */ case 0x200A: /* HAIR SPACE */ case 0x200B: /* ZERO WIDTH SPACE */ case 0x2028: /* LINE SEPARATOR */ case 0x2029: /* PARAGRAPH SEPARATOR */ case 0x202F: /* NARROW NO-BREAK SPACE */ case 0x3000: /* IDEOGRAPHIC SPACE */ return 1; default: return 0; } } -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 23 2005) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)
M.-A. Lemburg wrote: I think it's worthwhile reconsidering this approach for character type queries that do no involve a huge number of code points. I would advise against that. I measure both versions (your version called PyUnicode_IsLinebreak2) with the following code volatile int result; void unibench() { #define REPS 100LL long long i; clock_t s1,s2,s3,s4,s5; s1 = clock(); for(i=0;iREPS;i++) result = _PyUnicode_IsLinebreak('('); s2 = clock(); for(i=0;iREPS;i++) result = PyUnicode_IsLinebreak2('('); s3 = clock(); for(i=0;iREPS;i++) result = _PyUnicode_IsLinebreak('\n'); s4 = clock(); for(i=0;iREPS;i++) result = PyUnicode_IsLinebreak2('\n'); s5 = clock(); printf(f1, (: %d\nf2, (: %d\nf1, CR: %d\n, f2, CR: %d\n, (int)(s2-s1),(int)(s3-s2),(int)(s4-s3),(int)(s5-s4)); } and got those numbers f1, (: 1321 f2, (: 1330 f1, CR: 1322 , f2, CR: 1325 What can be seen is that performance the two versions is nearly identical, with the code currently used being slightly better. What can also be seen is that, on my machine, 1e10 calls to IsLinebreak take 13.2 seconds. So 51 Mio calls take about 70ms. The reported performance problem is more likely in the allocation of all these splitlines results, and the copying of the same strings over and over again. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)
Martin v. Löwis wrote: Walter Dörwald wrote: This is caused by the chances to the codecs in 2.4. Basically the codecs no longer rely on C's readline() to do line splitting (which can't work for UTF-16), but do it themselves (via unicode.splitlines()). That explains why you get any calls to IsLineBreak; it doesn't explain why you get so many of them. I investigated this a bit, and one issue seems to be that StreamReader.readline performs splitline on the entire input, only to fetch the first line. It then joins the rest for later processing. In addition, it also performs splitlines on a single line, just to strip any trailing line breaks. This is because unicode.splitlines() is the only API available to Python that knows about unicode line feeds. The net effect is that, for a file with N lines, IsLineBreak is invoked up to N*N/2 times per character (atleast for the last character). So I think it would be best if Unicode characters exposed a .islinebreak method (or, failing that, codecs just knew what the line break characters are in Unicode 3.2), and then codecs would split off the first line of input itself. I think a maxsplit argument (just as for unicode.split()) would help too. [...] Bye, Walter Dörwald ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)
Martin v. Löwis wrote: M.-A. Lemburg wrote: I think it's worthwhile reconsidering this approach for character type queries that do no involve a huge number of code points. I would advise against that. I measure both versions (your version called PyUnicode_IsLinebreak2) with the following code volatile int result; void unibench() { #define REPS 100LL long long i; clock_t s1,s2,s3,s4,s5; s1 = clock(); for(i=0;iREPS;i++) result = _PyUnicode_IsLinebreak('('); s2 = clock(); for(i=0;iREPS;i++) result = PyUnicode_IsLinebreak2('('); s3 = clock(); for(i=0;iREPS;i++) result = _PyUnicode_IsLinebreak('\n'); s4 = clock(); for(i=0;iREPS;i++) result = PyUnicode_IsLinebreak2('\n'); s5 = clock(); printf(f1, (: %d\nf2, (: %d\nf1, CR: %d\n, f2, CR: %d\n, (int)(s2-s1),(int)(s3-s2),(int)(s4-s3),(int)(s5-s4)); } and got those numbers f1, (: 1321 f2, (: 1330 f1, CR: 1322 , f2, CR: 1325 What can be seen is that performance the two versions is nearly identical, with the code currently used being slightly better. What can also be seen is that, on my machine, 1e10 calls to IsLinebreak take 13.2 seconds. So 51 Mio calls take about 70ms. Your test is somewhat biased: the current solution works using type records, so it has to swap in a new record for each character you test. In you benchmark, the same character is tested over and over again and the type record likely already stored in the CPU cache. The .splitlines() routine itself calls the above function for each and every character in the string, so quite a few of these type records have to be looked up. Here's a version that uses os.py as basis: #include stdlib.h #include time.h #include Python.h int _PyUnicode_IsLinebreak16(register const Py_UNICODE ch) { switch (ch) { case 0x000A: /* LINE FEED */ case 0x000D: /* CARRIAGE RETURN */ case 0x001C: /* FILE SEPARATOR */ case 0x001D: /* GROUP SEPARATOR */ case 0x001E: /* RECORD SEPARATOR */ case 0x0085: /* NEXT LINE */ case 0x2028: /* LINE SEPARATOR */ case 0x2029: /* PARAGRAPH SEPARATOR */ return 1; default: return 0; } } #define REPS 1 #define BUFFERSIZE 3 int main(void) { long i, j; clock_t s1,s2,s3; char *buffer; FILE *datafile; long filelen; int result; datafile = fopen(os.py, rb); if (datafile == NULL) { printf(could not find os.py\n); return -1; } buffer = (char *)malloc(BUFFERSIZE); filelen = fread(buffer, 1, BUFFERSIZE, datafile); printf(filelen=%li bytes\n, filelen); s1 = clock(); /* Python 2.4 */ for(i = 0; i REPS; i++) for (j = 0; j filelen; j++) result = _PyUnicode_IsLinebreak((Py_UNICODE)buffer[j]); s2 = clock(); /* Python 1.6 */ for(i = 0; i REPS; i++) for (j = 0; j filelen; j++) result = _PyUnicode_IsLinebreak16((Py_UNICODE)buffer[j]); s3 = clock(); printf(2.4: %d\n 1.6: %d\n, (int)(s2-s1), (int)(s3-s2)); return 0; } Output, compiled with -O3: filelen=23147 bytes 2.4: 257 1.6: 123 That's a factor 2. The reported performance problem is more likely in the allocation of all these splitlines results, and the copying of the same strings over and over again. True. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 23 2005) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)
Walter Dörwald wrote: I think a maxsplit argument (just as for unicode.split()) would help too. Correct - that would allow to get rid of the quadratic part. We should also strive for avoiding the second copy of the line, if the user requested keepends. I wonder whether it would be worthwhile to cache the .splitlines result. An application that has just invoked .readline() will likely invoke .readline() again. If there is more than one line left, we could return the first line right away (potentially trimming the line ending if necessary). Only when a single line is left, we would attempt to read more data. In a plain .read(), we would first join the lines back. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)
Martin v. Löwis wrote: Walter Dörwald wrote: I think a maxsplit argument (just as for unicode.split()) would help too. Correct - that would allow to get rid of the quadratic part. OK, such a patch should be rather simple. I'll give it a try. We should also strive for avoiding the second copy of the line, if the user requested keepends. Your suggested unicode method islinebreak() would help with that. Then we could add the following to the string module: unicodelinebreaks = u.join(unichr(c) for c in xrange(0, sys.maxunicode) if unichr(c).islinebreak()) Then if line and not keepends: line = line.splitlines(False)[0] could be if line and not keepends: line = line.rstrip(string.unicodelinebreaks) I wonder whether it would be worthwhile to cache the .splitlines result. An application that has just invoked .readline() will likely invoke .readline() again. If there is more than one line left, we could return the first line right away (potentially trimming the line ending if necessary). Only when a single line is left, we would attempt to read more data. In a plain .read(), we would first join the lines back. OK, this would mean we'd have to distinguish between a direct call to read() and one done by readline() (which we do anyway through the firstline argument). Bye, Walter Dörwald ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)
Martin v. Löwis wrote: Walter Dörwald wrote: Martin v. Löwis wrote: Walter Dörwald wrote: I think a maxsplit argument (just as for unicode.split()) would help too. Correct - that would allow to get rid of the quadratic part. OK, such a patch should be rather simple. I'll give it a try. Actually, on a second thought - it would not remove the quadratic aspect. At least it would remove the quadratic number of calls to _PyUnicodeUCS2_IsLinebreak(). For each character it would be called only once. You would still copy the rest string completely on each split. So on the first split, you copy N lines (one result line, and N-1 lines into the rest string), on the second split, N-2 lines, and so on, totalling N*N/2 line copies again. OK, that's true. We could prevent string copying if we kept the unsplit string and the position of the current line terminator, but this would require a first position after a line terminator method. The only thing you save is the join (as the rest is already joined), and the IsLineBreak calls (which are necessary only for the first line). Please see python.org/sf/1268314; The last part of the patch seems to be more related to bug #1235646. With the patch test_pep263 and test_codecs fail (and test_parser, but this might be unrelated): python Lib/test/test_pep263.py gives the following output: File Lib/test/test_pep263.py, line 22 SyntaxError: list index out of range test_codecs.py has the following two complaints: File /var/home/walter/Achtung/Python-linecache/dist/src/Lib/codecs.py, line 366, in readline self.charbuffer = lines[1] + self.charbuffer IndexError: list index out of range and File /var/home/walter/Achtung/Python-linecache/dist/src/Lib/codecs.py, line 336, in readline line = result.splitlines(False)[0] NameError: global name 'result' is not defined it solves the problem by keeping the splitlines result. It only invokes IsLineBreak once per character, and also copies each character only once, and allocates each line only once, totalling in O(N) for these operations. It still does contain a quadratic operation: the lines are stored in a list, and the result list is removed from the list with del lines[0]. This copies N-1 pointers, result in N*N/2 pointer copies. That should still be much faster than the current code. Using collections.deque() should get rid of this problem. unicodelinebreaks = u.join(unichr(c) for c in xrange(0, sys.maxunicode) if unichr(c).islinebreak()) That is very inefficient. I would rather add a static list to the string module, and have a test that says assert str.unicodelinebreaks == u.join(ch for ch in (unichr(c) for c in xrange(0, sys.maxunicode)) if unicodedata.bidirectional(ch)=='B' or unicodedata.category(ch)=='Zl') You mean, in the test suite? unicodelinebreaks could then be defined as # u\r\n\x1c\x1d\x1e\x85\u2028\u2029 '\n\r\x1c\x1d\x1e\xc2\x85\xe2\x80\xa8\xe2\x80\xa9'.decode(utf-8) That might be better, as this definition won't change very often. BTW, why the decode() call? For a Python without unicode? OK, this would mean we'd have to distinguish between a direct call to read() and one done by readline() (which we do anyway through the firstline argument). See my patch. If we have cached lines, we don't need to call .read at all. I wonder what happens, if calls to read() and readline() are mixed (e.g. if I'm reading Fortran source or anything with a fixed line header): read() would be used to read the first n character (which joins the line buffer) and readline() reads the rest (which would split it again) etc. (Of course this could be done via a single readline() call). But, I think a maxsplit argument for splitlines() woould make sense independent of this problem. Bye, Walter Dörwald ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)
On Wed, 2005-08-24 at 07:33, Martin v. Löwis wrote: Walter Dörwald wrote: Martin v. Löwis wrote: Walter Dörwald wrote: [...] Actually, on a second thought - it would not remove the quadratic aspect. You would still copy the rest string completely on each split. So on the first split, you copy N lines (one result line, and N-1 lines into the rest string), on the second split, N-2 lines, and so on, totalling N*N/2 line copies again. The only thing you save is the join (as the rest is already joined), and the IsLineBreak calls (which are necessary only for the first line). [...] In the past, I've avoided the string copy overhead inherent in split() by using buffers... I've always wondered why Python didn't use buffer type tricks internally for split-type operations. I haven't looked at Python's string implementation, but the fact that strings are immutable surely means that you can safely and efficiently reference an implementation level data object for all strings... ie all strings are buffers. The only problem I can see with this is huge data objects might hang around just because some small fragment of it is still referenced by a string. Surely a simple huristic or two like if len(string) len(data)/8: copy data; else: reference data would go a long way towards avoiding that. In my limited playing around with manipulating of strings and benchmarking stuff, the biggest overhead is nearly always the copys. -- Donovan Baarda [EMAIL PROTECTED] ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)
M.-A. Lemburg wrote: Walter Dörwald wrote: I wonder if we should switch back to a simple readline() implementation for those codecs that don't require the current implementation (basically every charmap codec). That would be my preference as well. The 2.4 .readline() approach is really only needed for codecs that have to deal with encodings that: a) use multi-byte formats, or b) support more line-end formats than just CR, CRLF, LF, or c) are stateful. This can easily be had by using a mix-in class for codecs which do need the buffered .readline() approach. Should this be a mix-in or should we simply have two base classes? Which of those bases/mix-ins should be the default? AFAIK source files are opened in universal newline mode, so at least we'd get proper treatment of \n, \r and \r\n line ends, but we'd loose u\x1c, u\x1d, u\x1e, u\x85, u\u2028 and u\u2029 (which are line terminators according to unicode.splitlines()). While the Unicode standard defines these characters as line end code points, I think their definition does not necessarily apply to data that is converted from a certain encoding to Unicode, so that's not a big loss. E.g. in ASCII or Latin-1, FILE, GROUP and RECORD SEPARATOR and NEXT LINE characters (0x1c, 0x1d, 0x1e, 0x85) are not interpreted as line end characters. Furthermore, we had no reports of anyone complaining in Python 1.6, 2.0 - 2.3 that line endings were not detected properly. All these Python versions relied on the stream's .readline() method to get the next line. The only bug reports we had were for UTF-16 which falls into the above category a) and did not support .readline() until Python 2.4. True. Bye, Walter Dörwald ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)
Walter Dörwald wrote: At least it would remove the quadratic number of calls to _PyUnicodeUCS2_IsLinebreak(). For each character it would be called only once. Correct. However, I very much doubt that this is the cause of the slowdown. The last part of the patch seems to be more related to bug #1235646. You mean the last chunk (linebuffer=None)? This is just the extension to reset. With the patch test_pep263 and test_codecs fail (and test_parser, but this might be unrelated): Oops, I thought I ran the test suite, but apparently with the patch removed. New version uploaded. Using collections.deque() should get rid of this problem. Alright. There are so many types in Python I've never heard of :-) You mean, in the test suite? Right. BTW, why the decode() call? For a Python without unicode? Right. Not sure what people think whether this should still be supported, but I keep supporting it whenever I think of it. I wonder what happens, if calls to read() and readline() are mixed (e.g. if I'm reading Fortran source or anything with a fixed line header): read() would be used to read the first n character (which joins the line buffer) and readline() reads the rest (which would split it again) etc. (Of course this could be done via a single readline() call). Then performance would drop again - it should still be correct, though. If this is becomes a frequent problem, we could satisfy read requests from the split lines as well (i.e. join as many lines as you need). However, I would rather expect that callers of read() typically want the entire file, or want to read in large chunks (with no line orientation at all). But, I think a maxsplit argument for splitlines() woould make sense independent of this problem. I'm not so sure anymore. It is good for consistency, but I doubt there are actual use cases: how often do you want only the first n lines of some string? Reading the first n lines of a file might be an application, but then, you would rather use .readline() directly. For readline, I don't think there is a clear case for splitting of only the first line (unless you want to return an index instead of the rest string): if the application eventually wants all of the data, we better split it right away into individual strings, instead of dealing with a gradually decreasing trailer. Anyway, I don't think we should go back to C's readline/fgets. This is just too messy wrt. buffering and text vs. binary mode. I wish Python would stop using stdio entirely. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)
Martin v. Löwis wrote: Walter Dörwald wrote: At least it would remove the quadratic number of calls to _PyUnicodeUCS2_IsLinebreak(). For each character it would be called only once. Correct. However, I very much doubt that this is the cause of the slowdown. Probably. We'd need a test with the original Argon source to really know. The last part of the patch seems to be more related to bug #1235646. You mean the last chunk (linebuffer=None)? This is just the extension to reset. Ouch, you're right: The part of cvs diff was part of my checkout, not your patch. I have so many Python checkouts, that I sometimes forget which is which! ;) With the patch test_pep263 and test_codecs fail (and test_parser, but this might be unrelated): Oops, I thought I ran the test suite, but apparently with the patch removed. New version uploaded. Looks much better now. Using collections.deque() should get rid of this problem. Alright. There are so many types in Python I've never heard of :-) The problem is that unicode.splitlines() returns a list, so the push/pop performance advantange of collections.deque might be eaten by having to create a collections.deque object in the first place. You mean, in the test suite? Right. BTW, why the decode() call? For a Python without unicode? Right. Not sure what people think whether this should still be supported, but I keep supporting it whenever I think of it. OK, so should we add this for 2.4.2 or only for 2.5? Should this really be put into string.py, or should it be a class attribute of unicode? (At least that's what was proposed for the other strings in string.py (string.whitespace etc.) too. I wonder what happens, if calls to read() and readline() are mixed (e.g. if I'm reading Fortran source or anything with a fixed line header): read() would be used to read the first n character (which joins the line buffer) and readline() reads the rest (which would split it again) etc. (Of course this could be done via a single readline() call). Then performance would drop again - it should still be correct, though. If this is becomes a frequent problem, we could satisfy read requests from the split lines as well (i.e. join as many lines as you need). However, I would rather expect that callers of read() typically want the entire file, or want to read in large chunks (with no line orientation at all). Agreed! Don't fix a bug that hasn't been reported! ;) But, I think a maxsplit argument for splitlines() woould make sense independent of this problem. I'm not so sure anymore. It is good for consistency, but I doubt there are actual use cases: how often do you want only the first n lines of some string? Reading the first n lines of a file might be an application, but then, you would rather use .readline() directly. Not every unicode string is read from a StreamReader. For readline, I don't think there is a clear case for splitting of only the first line (unless you want to return an index instead of the rest string): if the application eventually wants all of the data, we better split it right away into individual strings, instead of dealing with a gradually decreasing trailer. True, this would be best for a readline loop. Another solution would be to have a unicode.itersplitlines() and store the iterator. Then we wouldn't need a maxsplit because you simply can stop iterating once you have what you want. Anyway, I don't think we should go back to C's readline/fgets. This is just too messy wrt. buffering and text vs. binary mode. I don't know about C's readline, but StreamReader.read() and StreamReader.readline() are messy enough. But at least it's something we can fix ourselves. I wish Python would stop using stdio entirely. So reverting to the 2.3 behaviour for simple codecs is out? Bye, Walter Dörwald ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)
Walter Dörwald wrote: Right. Not sure what people think whether this should still be supported, but I keep supporting it whenever I think of it. OK, so should we add this for 2.4.2 or only for 2.5? You mean, string.unicodelinebreaks? I think something needs to be done to fix the performance problem. In doing so, API changes might occur. We should not add API changes in 2.4.2 unless they contribute to the bug fix, and even then, the release manager probably needs to approve them (in any case, they certainly need to be backwards compatible) Should this really be put into string.py, or should it be a class attribute of unicode? (At least that's what was proposed for the other strings in string.py (string.whitespace etc.) too. If the 2.4.2 fix is based on this kind of data, I think it should go into a private attribute of codecs.py. For 2.5, I would put it into strings for tradition. There is no point in having some of these constants in strings and others as class attributes (unless we also add them as class attributes in 2.5, in which case adding unicodelinebreaks into strings would be pointless). So I think in 2.5, I would like to see # string.py ascii_letters = str.ascii_letters in which case unicode.linebreaks would be the right spelling. I'm not so sure anymore. It is good for consistency, but I doubt there are actual use cases: how often do you want only the first n lines of some string? Reading the first n lines of a file might be an application, but then, you would rather use .readline() directly. Not every unicode string is read from a StreamReader. Sure: but how often do you want to fetch the first line of a Unicode string you happen to have in memory, without iterating over all lines eventually? Another solution would be to have a unicode.itersplitlines() and store the iterator. Then we wouldn't need a maxsplit because you simply can stop iterating once you have what you want. That might work. I would then ask for itersplitlines to return pairs of (line, truncated) so you can easily know whether you merely ran into the end of the string, or whether you got a complete line (although it might be a bit too specific for the readlines() case) So reverting to the 2.3 behaviour for simple codecs is out? I'm -1, atleast. It would also fix the problem at hand, for the reported case. However, it does leave some codecs in the cold, most notably UTF-8 (which, in turn, isn't an issue for PEP 262, since UTF-8 is built-in in the parser). I think the UTF-8 stream reader should support all Unicode line breaks, so it should continue to use the Python approach. However, UTF-8 is fairly common, so that reading an UTF-8-encoded file line-by-line shouldn't suck. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)
Am 24.08.2005 um 21:15 schrieb Martin v. Löwis: Walter Dörwald wrote: Right. Not sure what people think whether this should still be supported, but I keep supporting it whenever I think of it. OK, so should we add this for 2.4.2 or only for 2.5? You mean, string.unicodelinebreaks? Yes. I think something needs to be done to fix the performance problem. In doing so, API changes might occur. We should not add API changes in 2.4.2 unless they contribute to the bug fix, and even then, the release manager probably needs to approve them (in any case, they certainly need to be backwards compatible) OK. Your version of the patch (without replacing line = line.splitlines(False)[0] with something better) might be enough for 2.4.2. Should this really be put into string.py, or should it be a class attribute of unicode? (At least that's what was proposed for the other strings in string.py (string.whitespace etc.) too. If the 2.4.2 fix is based on this kind of data, I think it should go into a private attribute of codecs.py. I think codecs.unicodelinebreaks has one big problem: it will not work for codecs that do str-str decoding. For 2.5, I would put it into strings for tradition. There is no point in having some of these constants in strings and others as class attributes (unless we also add them as class attributes in 2.5, in which case adding unicodelinebreaks into strings would be pointless). So I think in 2.5, I would like to see # string.py ascii_letters = str.ascii_letters in which case unicode.linebreaks would be the right spelling. And it would have the advantage, that it could work both with str and unicode if we had both str.linebreaks and unicode.linebreaks I'm not so sure anymore. It is good for consistency, but I doubt there are actual use cases: how often do you want only the first n lines of some string? Reading the first n lines of a file might be an application, but then, you would rather use .readline() directly. Not every unicode string is read from a StreamReader. Sure: but how often do you want to fetch the first line of a Unicode string you happen to have in memory, without iterating over all lines eventually? I don't know. The only obvious spot in the standard library (apart from codecs.py) seems to be def shortdescription(self): return self.description().splitlines() [0] in Lib/plat-mac/pimp.py Another solution would be to have a unicode.itersplitlines() and store the iterator. Then we wouldn't need a maxsplit because you simply can stop iterating once you have what you want. That might work. I would then ask for itersplitlines to return pairs of (line, truncated) so you can easily know whether you merely ran into the end of the string, or whether you got a complete line (although it might be a bit too specific for the readlines() case) Or maybe (line, terminatorlength) which gives you the same info (terminatorlength == 0 means truncated) and makes it easy to strip the terminator. So reverting to the 2.3 behaviour for simple codecs is out? I'm -1, atleast. It would also fix the problem at hand, for the reported case. However, it does leave some codecs in the cold, most notably UTF-8 (which, in turn, isn't an issue for PEP 262, since UTF-8 is built-in in the parser). You meant PEP 263, right? I think the UTF-8 stream reader should support all Unicode line breaks, so it should continue to use the Python approach. OK. However, UTF-8 is fairly common, so that reading an UTF-8-encoded file line-by-line shouldn't suck. OK, so what's missing is a solution for str-str codecs (or we keep line = line.splitlines(False)[0] and test, whether this is fast enough). Bye, Walter Dörwald ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com