Am 24.08.2005 um 21:15 schrieb Martin v. Löwis:
> Walter Dörwald wrote:
>
>
>>> Right. Not sure what people think whether this should still be
>>> supported, but I keep supporting it whenever I think of it.
>>>
>>
>> OK, so should we add this for 2.4.2 or only for 2.5?
>>
>
> You mean, string.unic
Walter Dörwald wrote:
>> Right. Not sure what people think whether this should still be
>> supported, but I keep supporting it whenever I think of it.
>
>
> OK, so should we add this for 2.4.2 or only for 2.5?
You mean, string.unicodelinebreaks? I think something needs to be
done to fix the perf
Martin v. Löwis wrote:
> Walter Dörwald wrote:
>
>>At least it would remove the quadratic number of calls to
>>_PyUnicodeUCS2_IsLinebreak(). For each character it would be called only
>>once.
>
> Correct. However, I very much doubt that this is the cause of the
> slowdown.
Probably. We'd need a
Walter Dörwald wrote:
> At least it would remove the quadratic number of calls to
> _PyUnicodeUCS2_IsLinebreak(). For each character it would be called only
> once.
Correct. However, I very much doubt that this is the cause of the
slowdown.
> The last part of the patch seems to be more related to
M.-A. Lemburg wrote:
> Walter Dörwald wrote:
>
>>I wonder if we should switch back to a simple readline() implementation
>>for those codecs that don't require the current implementation
>>(basically every charmap codec).
>
> That would be my preference as well. The 2.4 .readline() approach
>
On Wed, 2005-08-24 at 07:33, "Martin v. Löwis" wrote:
> Walter Dörwald wrote:
> > Martin v. Löwis wrote:
> >
> >> Walter Dörwald wrote:
[...]
> Actually, on a second thought - it would not remove the quadratic
> aspect. You would still copy the rest string completely on each
> split. So on the fir
Martin v. Löwis wrote:
> Walter Dörwald wrote:
>
>>Martin v. Löwis wrote:
>>
>>>Walter Dörwald wrote:
>>>
I think a maxsplit argument (just as for unicode.split()) would help
too.
>>>
>>>Correct - that would allow to get rid of the quadratic part.
>>
>>OK, such a patch should be rather si
Walter Dörwald wrote:
> Martin v. Löwis wrote:
>
>> Walter Dörwald wrote:
>>
>>> I think a maxsplit argument (just as for unicode.split()) would help
>>> too.
>>
>>
>> Correct - that would allow to get rid of the quadratic part.
>
>
> OK, such a patch should be rather simple. I'll give it a try.
Martin v. Löwis wrote:
> Walter Dörwald wrote:
>
>>I think a maxsplit argument (just as for unicode.split()) would help too.
>
> Correct - that would allow to get rid of the quadratic part.
OK, such a patch should be rather simple. I'll give it a try.
> We should also strive for avoiding the s
Walter Dörwald wrote:
> I think a maxsplit argument (just as for unicode.split()) would help too.
Correct - that would allow to get rid of the quadratic part.
We should also strive for avoiding the second copy of the line,
if the user requested keepends.
I wonder whether it would be worthwhile to
Martin v. Löwis wrote:
> M.-A. Lemburg wrote:
>
>>I think it's worthwhile reconsidering this approach for
>>character type queries that do no involve a huge number
>>of code points.
>
>
> I would advise against that. I measure both versions
> (your version called PyUnicode_IsLinebreak2) with the
Martin v. Löwis wrote:
> Walter Dörwald wrote:
>
>>This is caused by the chances to the codecs in 2.4. Basically the codecs
>>no longer rely on C's readline() to do line splitting (which can't work
>>for UTF-16), but do it themselves (via unicode.splitlines()).
>
> That explains why you get an
M.-A. Lemburg wrote:
> I think it's worthwhile reconsidering this approach for
> character type queries that do no involve a huge number
> of code points.
I would advise against that. I measure both versions
(your version called PyUnicode_IsLinebreak2) with the
following code
volatile int result;
Walter Dörwald wrote:
> I wonder if we should switch back to a simple readline() implementation
> for those codecs that don't require the current implementation
> (basically every charmap codec).
That would be my preference as well. The 2.4 .readline() approach
is really only needed for codecs
Walter Dörwald wrote:
> This is caused by the chances to the codecs in 2.4. Basically the codecs
> no longer rely on C's readline() to do line splitting (which can't work
> for UTF-16), but do it themselves (via unicode.splitlines()).
That explains why you get any calls to IsLineBreak; it doesn'
Keir Mierle wrote:
> Hi, I'm working on Argon (http://www.third-bit.com/trac/argon) with Greg
> Wilson this summer
>
> We're having a very strange problem with Python's unicode parsing of source
> files. Basically, our CGI script was running extremely slowly on our
> production
> box (a pokey du
Hi, I'm working on Argon (http://www.third-bit.com/trac/argon) with Greg
Wilson this summer
We're having a very strange problem with Python's unicode parsing of source
files. Basically, our CGI script was running extremely slowly on our production
box (a pokey dual-Xeon 3GHz w/ 4GB RAM and 15K SCS
17 matches
Mail list logo