Re: [Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)

2005-08-24 Thread Walter Dörwald
Am 24.08.2005 um 21:15 schrieb Martin v. Löwis: > Walter Dörwald wrote: > > >>> Right. Not sure what people think whether this should still be >>> supported, but I keep supporting it whenever I think of it. >>> >> >> OK, so should we add this for 2.4.2 or only for 2.5? >> > > You mean, string.unic

Re: [Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)

2005-08-24 Thread Martin v. Löwis
Walter Dörwald wrote: >> Right. Not sure what people think whether this should still be >> supported, but I keep supporting it whenever I think of it. > > > OK, so should we add this for 2.4.2 or only for 2.5? You mean, string.unicodelinebreaks? I think something needs to be done to fix the perf

Re: [Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)

2005-08-24 Thread Walter Dörwald
Martin v. Löwis wrote: > Walter Dörwald wrote: > >>At least it would remove the quadratic number of calls to >>_PyUnicodeUCS2_IsLinebreak(). For each character it would be called only >>once. > > Correct. However, I very much doubt that this is the cause of the > slowdown. Probably. We'd need a

Re: [Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)

2005-08-24 Thread Martin v. Löwis
Walter Dörwald wrote: > At least it would remove the quadratic number of calls to > _PyUnicodeUCS2_IsLinebreak(). For each character it would be called only > once. Correct. However, I very much doubt that this is the cause of the slowdown. > The last part of the patch seems to be more related to

Re: [Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)

2005-08-24 Thread Walter Dörwald
M.-A. Lemburg wrote: > Walter Dörwald wrote: > >>I wonder if we should switch back to a simple readline() implementation >>for those codecs that don't require the current implementation >>(basically every charmap codec). > > That would be my preference as well. The 2.4 .readline() approach >

Re: [Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)

2005-08-24 Thread Donovan Baarda
On Wed, 2005-08-24 at 07:33, "Martin v. Löwis" wrote: > Walter Dörwald wrote: > > Martin v. Löwis wrote: > > > >> Walter Dörwald wrote: [...] > Actually, on a second thought - it would not remove the quadratic > aspect. You would still copy the rest string completely on each > split. So on the fir

Re: [Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)

2005-08-24 Thread Walter Dörwald
Martin v. Löwis wrote: > Walter Dörwald wrote: > >>Martin v. Löwis wrote: >> >>>Walter Dörwald wrote: >>> I think a maxsplit argument (just as for unicode.split()) would help too. >>> >>>Correct - that would allow to get rid of the quadratic part. >> >>OK, such a patch should be rather si

Re: [Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)

2005-08-24 Thread Martin v. Löwis
Walter Dörwald wrote: > Martin v. Löwis wrote: > >> Walter Dörwald wrote: >> >>> I think a maxsplit argument (just as for unicode.split()) would help >>> too. >> >> >> Correct - that would allow to get rid of the quadratic part. > > > OK, such a patch should be rather simple. I'll give it a try.

Re: [Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)

2005-08-24 Thread Walter Dörwald
Martin v. Löwis wrote: > Walter Dörwald wrote: > >>I think a maxsplit argument (just as for unicode.split()) would help too. > > Correct - that would allow to get rid of the quadratic part. OK, such a patch should be rather simple. I'll give it a try. > We should also strive for avoiding the s

Re: [Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)

2005-08-24 Thread Martin v. Löwis
Walter Dörwald wrote: > I think a maxsplit argument (just as for unicode.split()) would help too. Correct - that would allow to get rid of the quadratic part. We should also strive for avoiding the second copy of the line, if the user requested keepends. I wonder whether it would be worthwhile to

Re: [Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)

2005-08-24 Thread M.-A. Lemburg
Martin v. Löwis wrote: > M.-A. Lemburg wrote: > >>I think it's worthwhile reconsidering this approach for >>character type queries that do no involve a huge number >>of code points. > > > I would advise against that. I measure both versions > (your version called PyUnicode_IsLinebreak2) with the

Re: [Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)

2005-08-24 Thread Walter Dörwald
Martin v. Löwis wrote: > Walter Dörwald wrote: > >>This is caused by the chances to the codecs in 2.4. Basically the codecs >>no longer rely on C's readline() to do line splitting (which can't work >>for UTF-16), but do it themselves (via unicode.splitlines()). > > That explains why you get an

Re: [Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)

2005-08-24 Thread Martin v. Löwis
M.-A. Lemburg wrote: > I think it's worthwhile reconsidering this approach for > character type queries that do no involve a huge number > of code points. I would advise against that. I measure both versions (your version called PyUnicode_IsLinebreak2) with the following code volatile int result;

Re: [Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)

2005-08-24 Thread M.-A. Lemburg
Walter Dörwald wrote: > I wonder if we should switch back to a simple readline() implementation > for those codecs that don't require the current implementation > (basically every charmap codec). That would be my preference as well. The 2.4 .readline() approach is really only needed for codecs

Re: [Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)

2005-08-24 Thread Martin v. Löwis
Walter Dörwald wrote: > This is caused by the chances to the codecs in 2.4. Basically the codecs > no longer rely on C's readline() to do line splitting (which can't work > for UTF-16), but do it themselves (via unicode.splitlines()). That explains why you get any calls to IsLineBreak; it doesn'

Re: [Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)

2005-08-24 Thread Walter Dörwald
Keir Mierle wrote: > Hi, I'm working on Argon (http://www.third-bit.com/trac/argon) with Greg > Wilson this summer > > We're having a very strange problem with Python's unicode parsing of source > files. Basically, our CGI script was running extremely slowly on our > production > box (a pokey du

[Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)

2005-08-23 Thread Keir Mierle
Hi, I'm working on Argon (http://www.third-bit.com/trac/argon) with Greg Wilson this summer We're having a very strange problem with Python's unicode parsing of source files. Basically, our CGI script was running extremely slowly on our production box (a pokey dual-Xeon 3GHz w/ 4GB RAM and 15K SCS