Re: [Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)

2005-08-24 Thread Walter Dörwald
Keir Mierle wrote:

 Hi, I'm working on Argon (http://www.third-bit.com/trac/argon) with Greg
 Wilson this summer
 
 We're having a very strange problem with Python's unicode parsing of source
 files. Basically, our CGI script was running extremely slowly on our 
 production
 box (a pokey dual-Xeon 3GHz w/ 4GB RAM and 15K SCSI drives). Slow to the tune
 of 6-10 seconds per request. I eventually tracked this down to imports of our
 source tree; the actual request was completing in 300ms, the rest of the time
 was spent in __import__.

This is caused by the chances to the codecs in 2.4. Basically the codecs 
no longer rely on C's readline() to do line splitting (which can't work 
for UTF-16), but do it themselves (via unicode.splitlines()).

 After doing some gprof profiling, I discovered _PyUnicodeUCS2_IsLinebreak was
 getting called 51 million times. Our code is 1.2 million characters, so I
 hardly think it makes sense to call IsLinebreak 50 times for each character;
 and we're not even importing our entire source tree on every invocation.

But if you're using CGI, you're importing your source on every 
invocation. Switching to a different server side technology might help. 
Nevertheless 50 million calls seems to be a bit much.

 Our code is a fork of Trac, and originally had these lines at the top:
 
 # -*- coding: iso8859-1 -*-  
 
 This made me suspicious, so I removed all of them. The CGI execution time
 immediately dropped to ~1 second. gprof revealed that
 _PyUnicodeUCS2_IsLinebreak is not called at all anymore.
 
 Now that our code works fast enough, I don't really care about this, but I
 thought python-dev might want to know something weird is going on with unicode
 splitlines.

I wonder if we should switch back to a simple readline() implementation 
for those codecs that don't require the current implementation 
(basically every charmap codec). AFAIK source files are opened in 
universal newline mode, so at least we'd get proper treatment of \n, 
\r and \r\n line ends, but we'd loose u\x1c, u\x1d, u\x1e, 
u\x85, u\u2028 and u\u2029 (which are line terminators according 
to unicode.splitlines()).

 I documented my investigation of this problem; if anyone wants further 
 details,
 just email me. (I'm not on python-dev)
 http://www.third-bit.com/trac/argon/ticket/525

Bye,
Walter Dörwald
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)

2005-08-24 Thread Martin v. Löwis
Walter Dörwald wrote:
 This is caused by the chances to the codecs in 2.4. Basically the codecs 
 no longer rely on C's readline() to do line splitting (which can't work 
 for UTF-16), but do it themselves (via unicode.splitlines()).

That explains why you get any calls to IsLineBreak; it doesn't explain
why you get so many of them.

I investigated this a bit, and one issue seems to be that
StreamReader.readline performs splitline on the entire input, only to
fetch the first line. It then joins the rest for later processing.
In addition, it also performs splitlines on a single line, just to
strip any trailing line breaks.

The net effect is that, for a file with N lines, IsLineBreak is invoked
up to N*N/2 times per character (atleast for the last character).

So I think it would be best if Unicode characters exposed a .islinebreak
method (or, failing that, codecs just knew what the line break
characters are in Unicode 3.2), and then codecs would split off
the first line of input itself.

After doing some gprof profiling, I discovered _PyUnicodeUCS2_IsLinebreak was
getting called 51 million times. Our code is 1.2 million characters, so I
hardly think it makes sense to call IsLinebreak 50 times for each character;
and we're not even importing our entire source tree on every invocation.
 
 
 But if you're using CGI, you're importing your source on every 
 invocation.

Well, no. Only the CGI script needs to be parsed every time; all modules
could load off bytecode files.

Which suggests that Keir Mierle doesn't use bytecode files, I think he
should.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)

2005-08-24 Thread M.-A. Lemburg
Walter Dörwald wrote:
 I wonder if we should switch back to a simple readline() implementation 
 for those codecs that don't require the current implementation 
 (basically every charmap codec). 

That would be my preference as well. The 2.4 .readline() approach
is really only needed for codecs that have to deal with encodings
that:

a) use multi-byte formats, or
b) support more line-end formats than just CR, CRLF, LF, or
c) are stateful.

This can easily be had by using a mix-in class for
codecs which do need the buffered .readline() approach.

 AFAIK source files are opened in 
 universal newline mode, so at least we'd get proper treatment of \n, 
 \r and \r\n line ends, but we'd loose u\x1c, u\x1d, u\x1e, 
 u\x85, u\u2028 and u\u2029 (which are line terminators according 
 to unicode.splitlines()).

While the Unicode standard defines these characters as line
end code points, I think their definition does not necessarily
apply to data that is converted from a certain encoding to
Unicode, so that's not a big loss.

E.g. in ASCII or Latin-1, FILE, GROUP and RECORD
SEPARATOR and NEXT LINE characters (0x1c, 0x1d, 0x1e, 0x85)
are not interpreted as line end characters.

Furthermore, we had no reports of anyone complaining in
Python 1.6, 2.0 - 2.3 that line endings were not detected
properly. All these Python versions relied on the stream's
.readline() method to get the next line. The only bug reports
we had were for UTF-16 which falls into the above
category a) and did not support .readline() until Python 2.4.

A note on the performance of _PyUnicode_IsLinebreak():
in Python 2.0 Fredrik changed this to use the two step
lookup (reducing the size of the lookup tables considerably).

I think it's worthwhile reconsidering this approach for
character type queries that do no involve a huge number
of code points.

In Python 1.6 the function looked like this (and was
inlined by the compiler using its own fast lookup
table):

int _PyUnicode_IsLinebreak(register const Py_UNICODE ch)
{
switch (ch) {
case 0x000A: /* LINE FEED */
case 0x000D: /* CARRIAGE RETURN */
case 0x001C: /* FILE SEPARATOR */
case 0x001D: /* GROUP SEPARATOR */
case 0x001E: /* RECORD SEPARATOR */
case 0x0085: /* NEXT LINE */
case 0x2028: /* LINE SEPARATOR */
case 0x2029: /* PARAGRAPH SEPARATOR */
return 1;
default:
return 0;
}
}

another candidate to convert back is:

int _PyUnicode_IsWhitespace(register const Py_UNICODE ch)
{
switch (ch) {
case 0x0009: /* HORIZONTAL TABULATION */
case 0x000A: /* LINE FEED */
case 0x000B: /* VERTICAL TABULATION */
case 0x000C: /* FORM FEED */
case 0x000D: /* CARRIAGE RETURN */
case 0x001C: /* FILE SEPARATOR */
case 0x001D: /* GROUP SEPARATOR */
case 0x001E: /* RECORD SEPARATOR */
case 0x001F: /* UNIT SEPARATOR */
case 0x0020: /* SPACE */
case 0x0085: /* NEXT LINE */
case 0x00A0: /* NO-BREAK SPACE */
case 0x1680: /* OGHAM SPACE MARK */
case 0x2000: /* EN QUAD */
case 0x2001: /* EM QUAD */
case 0x2002: /* EN SPACE */
case 0x2003: /* EM SPACE */
case 0x2004: /* THREE-PER-EM SPACE */
case 0x2005: /* FOUR-PER-EM SPACE */
case 0x2006: /* SIX-PER-EM SPACE */
case 0x2007: /* FIGURE SPACE */
case 0x2008: /* PUNCTUATION SPACE */
case 0x2009: /* THIN SPACE */
case 0x200A: /* HAIR SPACE */
case 0x200B: /* ZERO WIDTH SPACE */
case 0x2028: /* LINE SEPARATOR */
case 0x2029: /* PARAGRAPH SEPARATOR */
case 0x202F: /* NARROW NO-BREAK SPACE */
case 0x3000: /* IDEOGRAPHIC SPACE */
return 1;
default:
return 0;
}
}

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Aug 23 2005)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)

2005-08-24 Thread Martin v. Löwis
M.-A. Lemburg wrote:
 I think it's worthwhile reconsidering this approach for
 character type queries that do no involve a huge number
 of code points.

I would advise against that. I measure both versions
(your version called PyUnicode_IsLinebreak2) with the
following code

volatile int result;
void unibench()
{
#define REPS 100LL
  long long i;
  clock_t s1,s2,s3,s4,s5;
  s1 = clock();
  for(i=0;iREPS;i++)
result = _PyUnicode_IsLinebreak('(');
  s2 = clock();
  for(i=0;iREPS;i++)
result = PyUnicode_IsLinebreak2('(');
  s3 = clock();
  for(i=0;iREPS;i++)
result = _PyUnicode_IsLinebreak('\n');
  s4 = clock();
  for(i=0;iREPS;i++)
result = PyUnicode_IsLinebreak2('\n');
  s5 = clock();
  printf(f1, (: %d\nf2, (: %d\nf1, CR: %d\n, f2, CR: %d\n,
 (int)(s2-s1),(int)(s3-s2),(int)(s4-s3),(int)(s5-s4));
}

and got those numbers

f1, (: 1321
f2, (: 1330
f1, CR: 1322
, f2, CR: 1325

What can be seen is that performance the two versions is nearly
identical, with the code currently used being slightly better.
What can also be seen is that, on my machine, 1e10 calls to
IsLinebreak take 13.2 seconds. So 51  Mio calls take about 70ms.

The reported performance problem is more likely in the allocation
of all these splitlines results, and the copying of the same
strings over and over again.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)

2005-08-24 Thread Walter Dörwald
Martin v. Löwis wrote:

 Walter Dörwald wrote:
 
This is caused by the chances to the codecs in 2.4. Basically the codecs 
no longer rely on C's readline() to do line splitting (which can't work 
for UTF-16), but do it themselves (via unicode.splitlines()).
 
 That explains why you get any calls to IsLineBreak; it doesn't explain
 why you get so many of them.
 
 I investigated this a bit, and one issue seems to be that
 StreamReader.readline performs splitline on the entire input, only to
 fetch the first line. It then joins the rest for later processing.
 In addition, it also performs splitlines on a single line, just to
 strip any trailing line breaks.

This is because unicode.splitlines() is the only API available to Python 
that knows about unicode line feeds.

 The net effect is that, for a file with N lines, IsLineBreak is invoked
 up to N*N/2 times per character (atleast for the last character).
 
 So I think it would be best if Unicode characters exposed a .islinebreak
 method (or, failing that, codecs just knew what the line break
 characters are in Unicode 3.2), and then codecs would split off
 the first line of input itself.

I think a maxsplit argument (just as for unicode.split()) would help too.

 [...]

Bye,
Walter Dörwald
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)

2005-08-24 Thread M.-A. Lemburg
Martin v. Löwis wrote:
 M.-A. Lemburg wrote:
 
I think it's worthwhile reconsidering this approach for
character type queries that do no involve a huge number
of code points.
 
 
 I would advise against that. I measure both versions
 (your version called PyUnicode_IsLinebreak2) with the
 following code
 
 volatile int result;
 void unibench()
 {
 #define REPS 100LL
   long long i;
   clock_t s1,s2,s3,s4,s5;
   s1 = clock();
   for(i=0;iREPS;i++)
 result = _PyUnicode_IsLinebreak('(');
   s2 = clock();
   for(i=0;iREPS;i++)
 result = PyUnicode_IsLinebreak2('(');
   s3 = clock();
   for(i=0;iREPS;i++)
 result = _PyUnicode_IsLinebreak('\n');
   s4 = clock();
   for(i=0;iREPS;i++)
 result = PyUnicode_IsLinebreak2('\n');
   s5 = clock();
   printf(f1, (: %d\nf2, (: %d\nf1, CR: %d\n, f2, CR: %d\n,
(int)(s2-s1),(int)(s3-s2),(int)(s4-s3),(int)(s5-s4));
 }
 
 and got those numbers
 
 f1, (: 1321
 f2, (: 1330
 f1, CR: 1322
 , f2, CR: 1325
 
 What can be seen is that performance the two versions is nearly
 identical, with the code currently used being slightly better.
 What can also be seen is that, on my machine, 1e10 calls to
 IsLinebreak take 13.2 seconds. So 51  Mio calls take about 70ms.

Your test is somewhat biased: the current solution
works using type records, so it has to swap in a new
record for each character you test. In you benchmark,
the same character is tested over and over again
and the type record likely already stored in the
CPU cache.

The .splitlines() routine itself calls the above
function for each and every character in the string,
so quite a few of these type records have to be
looked up.

Here's a version that uses os.py as basis:

#include stdlib.h
#include time.h
#include Python.h

int _PyUnicode_IsLinebreak16(register const Py_UNICODE ch)
{
switch (ch) {
case 0x000A: /* LINE FEED */
case 0x000D: /* CARRIAGE RETURN */
case 0x001C: /* FILE SEPARATOR */
case 0x001D: /* GROUP SEPARATOR */
case 0x001E: /* RECORD SEPARATOR */
case 0x0085: /* NEXT LINE */
case 0x2028: /* LINE SEPARATOR */
case 0x2029: /* PARAGRAPH SEPARATOR */
return 1;
default:
return 0;
}
}

#define REPS 1
#define BUFFERSIZE 3

int main(void)
{
long i, j;
clock_t s1,s2,s3;
char *buffer;
FILE *datafile;
long filelen;
int result;

datafile = fopen(os.py, rb);
if (datafile == NULL) {
printf(could not find os.py\n);
return -1;
}
buffer = (char *)malloc(BUFFERSIZE);
filelen = fread(buffer, 1, BUFFERSIZE, datafile);
printf(filelen=%li bytes\n, filelen);

s1 = clock();

/* Python 2.4 */
for(i = 0; i  REPS; i++)
for (j = 0; j  filelen; j++)
result = _PyUnicode_IsLinebreak((Py_UNICODE)buffer[j]);
s2 = clock();

/* Python 1.6 */
for(i = 0; i  REPS; i++)
for (j = 0; j  filelen; j++)
result = _PyUnicode_IsLinebreak16((Py_UNICODE)buffer[j]);
s3 = clock();

printf(2.4: %d\n
   1.6: %d\n,
   (int)(s2-s1),
   (int)(s3-s2));
return 0;
}

Output, compiled with -O3:

filelen=23147 bytes
2.4: 257
1.6: 123

That's a factor 2.

 The reported performance problem is more likely in the allocation
 of all these splitlines results, and the copying of the same
 strings over and over again.

True.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Aug 23 2005)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)

2005-08-24 Thread Martin v. Löwis
Walter Dörwald wrote:
 I think a maxsplit argument (just as for unicode.split()) would help too.

Correct - that would allow to get rid of the quadratic part.
We should also strive for avoiding the second copy of the line,
if the user requested keepends.

I wonder whether it would be worthwhile to cache the .splitlines result.
An application that has just invoked .readline() will likely invoke
.readline() again. If there is more than one line left, we could return
the first line right away (potentially trimming the line ending if
necessary). Only when a single line is left, we would attempt to
read more data. In a plain .read(), we would first join the lines
back.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)

2005-08-24 Thread Walter Dörwald
Martin v. Löwis wrote:

 Walter Dörwald wrote:
 
I think a maxsplit argument (just as for unicode.split()) would help too.
 
 Correct - that would allow to get rid of the quadratic part.

OK, such a patch should be rather simple. I'll give it a try.

 We should also strive for avoiding the second copy of the line,
 if the user requested keepends.

Your suggested unicode method islinebreak() would help with that. Then 
we could add the following to the string module:

unicodelinebreaks = u.join(unichr(c) for c in xrange(0, 
sys.maxunicode) if unichr(c).islinebreak())

Then

 if line and not keepends:
 line = line.splitlines(False)[0]

could be

 if line and not keepends:
 line = line.rstrip(string.unicodelinebreaks)

 I wonder whether it would be worthwhile to cache the .splitlines result.
 An application that has just invoked .readline() will likely invoke
 .readline() again. If there is more than one line left, we could return
 the first line right away (potentially trimming the line ending if
 necessary). Only when a single line is left, we would attempt to
 read more data. In a plain .read(), we would first join the lines
 back.

OK, this would mean we'd have to distinguish between a direct call to 
read() and one done by readline() (which we do anyway through the 
firstline argument).

Bye,
Walter Dörwald
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)

2005-08-24 Thread Walter Dörwald
Martin v. Löwis wrote:

 Walter Dörwald wrote:
 
Martin v. Löwis wrote:

Walter Dörwald wrote:

I think a maxsplit argument (just as for unicode.split()) would help
too.

Correct - that would allow to get rid of the quadratic part.

OK, such a patch should be rather simple. I'll give it a try.
 
 Actually, on a second thought - it would not remove the quadratic
 aspect.

At least it would remove the quadratic number of calls to 
_PyUnicodeUCS2_IsLinebreak(). For each character it would be called only 
once.

 You would still copy the rest string completely on each
 split. So on the first split, you copy N lines (one result line,
 and N-1 lines into the rest string), on the second split, N-2
 lines, and so on, totalling N*N/2 line copies again.

OK, that's true.

We could prevent string copying if we kept the unsplit string and the 
position of the current line terminator, but this would require a first 
position after a line terminator method.

 The only
 thing you save is the join (as the rest is already joined), and
 the IsLineBreak calls (which are necessary only for the first
 line).
 
 Please see python.org/sf/1268314;

The last part of the patch seems to be more related to bug #1235646.

With the patch test_pep263 and test_codecs fail (and test_parser, but 
this might be unrelated):

python Lib/test/test_pep263.py gives the following output:

File Lib/test/test_pep263.py, line 22
SyntaxError: list index out of range

test_codecs.py has the following two complaints:

File /var/home/walter/Achtung/Python-linecache/dist/src/Lib/codecs.py, 
line 366, in readline
 self.charbuffer = lines[1] + self.charbuffer
IndexError: list index out of range

and

File /var/home/walter/Achtung/Python-linecache/dist/src/Lib/codecs.py, 
line 336, in readline
 line = result.splitlines(False)[0]
NameError: global name 'result' is not defined

 it solves the problem by
 keeping the splitlines result. It only invokes IsLineBreak
 once per character, and also copies each character only once,
 and allocates each line only once, totalling in O(N) for
 these operations. It still does contain a quadratic operation:
 the lines are stored in a list, and the result list is
 removed from the list with del lines[0]. This copies N-1
 pointers, result in N*N/2 pointer copies. That should still
 be much faster than the current code.

Using collections.deque() should get rid of this problem.

unicodelinebreaks = u.join(unichr(c) for c in xrange(0,
sys.maxunicode) if unichr(c).islinebreak())
 
 That is very inefficient. I would rather add a static list
 to the string module, and have a test that says
 
 assert str.unicodelinebreaks == u.join(ch for ch in (unichr(c) for c
 in xrange(0, sys.maxunicode)) if unicodedata.bidirectional(ch)=='B' or
 unicodedata.category(ch)=='Zl')

You mean, in the test suite?

 unicodelinebreaks could then be defined as
 
 # u\r\n\x1c\x1d\x1e\x85\u2028\u2029
 '\n\r\x1c\x1d\x1e\xc2\x85\xe2\x80\xa8\xe2\x80\xa9'.decode(utf-8)

That might be better, as this definition won't change very often.

BTW, why the decode() call? For a Python without unicode?

OK, this would mean we'd have to distinguish between a direct call to
read() and one done by readline() (which we do anyway through the
firstline argument).
 
 See my patch. If we have cached lines, we don't need to call .read
 at all.

I wonder what happens, if calls to read() and readline() are mixed (e.g. 
if I'm reading Fortran source or anything with a fixed line header): 
read() would be used to read the first n character (which joins the line 
buffer) and readline() reads the rest (which would split it again) etc.
(Of course this could be done via a single readline() call).

But, I think a maxsplit argument for splitlines() woould make sense 
independent of this problem.

Bye,
Walter Dörwald
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)

2005-08-24 Thread Donovan Baarda
On Wed, 2005-08-24 at 07:33, Martin v. Löwis wrote:
 Walter Dörwald wrote:
  Martin v. Löwis wrote:
  
  Walter Dörwald wrote:
[...]
 Actually, on a second thought - it would not remove the quadratic
 aspect. You would still copy the rest string completely on each
 split. So on the first split, you copy N lines (one result line,
 and N-1 lines into the rest string), on the second split, N-2
 lines, and so on, totalling N*N/2 line copies again. The only
 thing you save is the join (as the rest is already joined), and
 the IsLineBreak calls (which are necessary only for the first
 line).
[...]

In the past, I've avoided the string copy overhead inherent in split()
by using buffers...

I've always wondered why Python didn't use buffer type tricks internally
for split-type operations. I haven't looked at Python's string
implementation, but the fact that strings are immutable surely means
that you can safely and efficiently reference an implementation level
data object for all strings... ie all strings are buffers.

The only problem I can see with this is huge data objects might hang
around just because some small fragment of it is still referenced by a
string. Surely a simple huristic or two like if len(string) 
len(data)/8: copy data; else: reference data would go a long way
towards avoiding that.

In my limited playing around with manipulating of strings and
benchmarking stuff, the biggest overhead is nearly always the copys.

-- 
Donovan Baarda [EMAIL PROTECTED]

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)

2005-08-24 Thread Walter Dörwald
M.-A. Lemburg wrote:

 Walter Dörwald wrote:
 
I wonder if we should switch back to a simple readline() implementation 
for those codecs that don't require the current implementation 
(basically every charmap codec). 
 
 That would be my preference as well. The 2.4 .readline() approach
 is really only needed for codecs that have to deal with encodings
 that:
 
 a) use multi-byte formats, or
 b) support more line-end formats than just CR, CRLF, LF, or
 c) are stateful.
 
 This can easily be had by using a mix-in class for
 codecs which do need the buffered .readline() approach.

Should this be a mix-in or should we simply have two base classes? Which 
of those bases/mix-ins should be the default?

AFAIK source files are opened in 
universal newline mode, so at least we'd get proper treatment of \n, 
\r and \r\n line ends, but we'd loose u\x1c, u\x1d, u\x1e, 
u\x85, u\u2028 and u\u2029 (which are line terminators according 
to unicode.splitlines()).
 
 While the Unicode standard defines these characters as line
 end code points, I think their definition does not necessarily
 apply to data that is converted from a certain encoding to
 Unicode, so that's not a big loss.
 
 E.g. in ASCII or Latin-1, FILE, GROUP and RECORD
 SEPARATOR and NEXT LINE characters (0x1c, 0x1d, 0x1e, 0x85)
 are not interpreted as line end characters.
 
 Furthermore, we had no reports of anyone complaining in
 Python 1.6, 2.0 - 2.3 that line endings were not detected
 properly.  All these Python versions relied on the stream's
 .readline() method to get the next line. The only bug reports
 we had were for UTF-16 which falls into the above
 category a) and did not support .readline() until Python 2.4.

True.

Bye,
Walter Dörwald
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)

2005-08-24 Thread Martin v. Löwis
Walter Dörwald wrote:
 At least it would remove the quadratic number of calls to
 _PyUnicodeUCS2_IsLinebreak(). For each character it would be called only
 once.

Correct. However, I very much doubt that this is the cause of the
slowdown.

 The last part of the patch seems to be more related to bug #1235646.

You mean the last chunk (linebuffer=None)? This is just the extension
to reset.

 With the patch test_pep263 and test_codecs fail (and test_parser, but
 this might be unrelated):

Oops, I thought I ran the test suite, but apparently with the patch
removed. New version uploaded.

 Using collections.deque() should get rid of this problem.

Alright. There are so many types in Python I've never heard of :-)

 You mean, in the test suite?

Right.

 BTW, why the decode() call? For a Python without unicode?

Right. Not sure what people think whether this should still be
supported, but I keep supporting it whenever I think of it.

 I wonder what happens, if calls to read() and readline() are mixed (e.g.
 if I'm reading Fortran source or anything with a fixed line header):
 read() would be used to read the first n character (which joins the line
 buffer) and readline() reads the rest (which would split it again) etc.
 (Of course this could be done via a single readline() call).

Then performance would drop again - it should still be correct, though.

If this is becomes a frequent problem, we could satisfy read requests
from the split lines as well (i.e. join as many lines as you need).
However, I would rather expect that callers of read() typically want
the entire file, or want to read in large chunks (with no line
orientation at all).

 But, I think a maxsplit argument for splitlines() woould make sense
 independent of this problem.

I'm not so sure anymore. It is good for consistency, but I doubt there
are actual use cases: how often do you want only the first n lines
of some string? Reading the first n lines of a file might be an
application, but then, you would rather use .readline() directly.

For readline, I don't think there is a clear case for splitting of
only the first line (unless you want to return an index instead of
the rest string): if the application eventually wants all of the
data, we better split it right away into individual strings, instead
of dealing with a gradually decreasing trailer.

Anyway, I don't think we should go back to C's readline/fgets. This
is just too messy wrt. buffering and text vs. binary mode. I wish
Python would stop using stdio entirely.

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)

2005-08-24 Thread Walter Dörwald
Martin v. Löwis wrote:

 Walter Dörwald wrote:
 
At least it would remove the quadratic number of calls to
_PyUnicodeUCS2_IsLinebreak(). For each character it would be called only
once.
 
 Correct. However, I very much doubt that this is the cause of the
 slowdown.

Probably. We'd need a test with the original Argon source to really know.

The last part of the patch seems to be more related to bug #1235646.
 
 You mean the last chunk (linebuffer=None)? This is just the extension
 to reset.

Ouch, you're right: The part of cvs diff was part of my checkout, not 
your patch. I have so many Python checkouts, that I sometimes forget 
which is which! ;)

With the patch test_pep263 and test_codecs fail (and test_parser, but
this might be unrelated):
 
 Oops, I thought I ran the test suite, but apparently with the patch
 removed. New version uploaded.

Looks much better now.

Using collections.deque() should get rid of this problem.
 
 Alright. There are so many types in Python I've never heard of :-)

The problem is that unicode.splitlines() returns a list, so the push/pop 
performance advantange of collections.deque might be eaten by having to 
create a collections.deque object in the first place.

You mean, in the test suite?
 
 Right.
 
BTW, why the decode() call? For a Python without unicode?
 
 Right. Not sure what people think whether this should still be
 supported, but I keep supporting it whenever I think of it.

OK, so should we add this for 2.4.2 or only for 2.5?

Should this really be put into string.py, or should it be a class 
attribute of unicode? (At least that's what was proposed for the other 
strings in string.py (string.whitespace etc.) too.

I wonder what happens, if calls to read() and readline() are mixed (e.g.
if I'm reading Fortran source or anything with a fixed line header):
read() would be used to read the first n character (which joins the line
buffer) and readline() reads the rest (which would split it again) etc.
(Of course this could be done via a single readline() call).
 
 Then performance would drop again - it should still be correct, though.
 
 If this is becomes a frequent problem, we could satisfy read requests
 from the split lines as well (i.e. join as many lines as you need).
 However, I would rather expect that callers of read() typically want
 the entire file, or want to read in large chunks (with no line
 orientation at all).

Agreed! Don't fix a bug that hasn't been reported! ;)

But, I think a maxsplit argument for splitlines() woould make sense
independent of this problem.
 
 I'm not so sure anymore. It is good for consistency, but I doubt there
 are actual use cases: how often do you want only the first n lines
 of some string? Reading the first n lines of a file might be an
 application, but then, you would rather use .readline() directly.

Not every unicode string is read from a StreamReader.

 For readline, I don't think there is a clear case for splitting of
 only the first line (unless you want to return an index instead of
 the rest string): if the application eventually wants all of the
 data, we better split it right away into individual strings, instead
 of dealing with a gradually decreasing trailer.

True, this would be best for a readline loop.

Another solution would be to have a unicode.itersplitlines() and store 
the iterator. Then we wouldn't need a maxsplit because you simply can 
stop iterating once you have what you want.

 Anyway, I don't think we should go back to C's readline/fgets. This
 is just too messy wrt. buffering and text vs. binary mode.

I don't know about C's readline, but StreamReader.read() and 
StreamReader.readline() are messy enough. But at least it's something we 
can fix ourselves.

 I wish
 Python would stop using stdio entirely.

So reverting to the 2.3 behaviour for simple codecs is out?

Bye,
Walter Dörwald
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)

2005-08-24 Thread Martin v. Löwis
Walter Dörwald wrote:
 Right. Not sure what people think whether this should still be
 supported, but I keep supporting it whenever I think of it.
 
 
 OK, so should we add this for 2.4.2 or only for 2.5?

You mean, string.unicodelinebreaks? I think something needs to be
done to fix the performance problem. In doing so, API changes
might occur. We should not add API changes in 2.4.2 unless they
contribute to the bug fix, and even then, the release manager
probably needs to approve them (in any case, they certainly
need to be backwards compatible)

 Should this really be put into string.py, or should it be a class
 attribute of unicode? (At least that's what was proposed for the other
 strings in string.py (string.whitespace etc.) too.

If the 2.4.2 fix is based on this kind of data, I think it should go
into a private attribute of codecs.py. For 2.5, I would put it
into strings for tradition. There is no point in having some of these
constants in strings and others as class attributes (unless we also
add them as class attributes in 2.5, in which case adding
unicodelinebreaks into strings would be pointless).

So I think in 2.5, I would like to see

# string.py
ascii_letters = str.ascii_letters

in which case unicode.linebreaks would be the right spelling.

 I'm not so sure anymore. It is good for consistency, but I doubt there
 are actual use cases: how often do you want only the first n lines
 of some string? Reading the first n lines of a file might be an
 application, but then, you would rather use .readline() directly.
 
 
 Not every unicode string is read from a StreamReader.

Sure: but how often do you want to fetch the first line of a Unicode
string you happen to have in memory, without iterating over all lines
eventually?

 Another solution would be to have a unicode.itersplitlines() and store
 the iterator. Then we wouldn't need a maxsplit because you simply can
 stop iterating once you have what you want.

That might work. I would then ask for itersplitlines to return pairs
of (line, truncated) so you can easily know whether you merely ran
into the end of the string, or whether you got a complete line
(although it might be a bit too specific for the readlines() case)

 So reverting to the 2.3 behaviour for simple codecs is out?

I'm -1, atleast. It would also fix the problem at hand, for the reported
case. However, it does leave some codecs in the cold, most notably
UTF-8 (which, in turn, isn't an issue for PEP 262, since UTF-8 is
built-in in the parser). I think the UTF-8 stream reader should support
all Unicode line breaks, so it should continue to use the Python
approach. However, UTF-8 is fairly common, so that reading an
UTF-8-encoded file line-by-line shouldn't suck.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] 51 Million calls to _PyUnicodeUCS2_IsLinebreak() (???)

2005-08-24 Thread Walter Dörwald
Am 24.08.2005 um 21:15 schrieb Martin v. Löwis:

 Walter Dörwald wrote:


 Right. Not sure what people think whether this should still be
 supported, but I keep supporting it whenever I think of it.


 OK, so should we add this for 2.4.2 or only for 2.5?


 You mean, string.unicodelinebreaks?


Yes.

 I think something needs to be
 done to fix the performance problem. In doing so, API changes
 might occur. We should not add API changes in 2.4.2 unless they
 contribute to the bug fix, and even then, the release manager
 probably needs to approve them (in any case, they certainly
 need to be backwards compatible)


OK. Your version of the patch (without replacing line =  
line.splitlines(False)[0] with something better) might be enough for  
2.4.2.

 Should this really be put into string.py, or should it be a class
 attribute of unicode? (At least that's what was proposed for the  
 other
 strings in string.py (string.whitespace etc.) too.


 If the 2.4.2 fix is based on this kind of data, I think it should go
 into a private attribute of codecs.py.


I think codecs.unicodelinebreaks has one big problem: it will not  
work for codecs that do str-str decoding.

 For 2.5, I would put it
 into strings for tradition. There is no point in having some of these
 constants in strings and others as class attributes (unless we also
 add them as class attributes in 2.5, in which case adding
 unicodelinebreaks into strings would be pointless).

 So I think in 2.5, I would like to see

 # string.py
 ascii_letters = str.ascii_letters

 in which case unicode.linebreaks would be the right spelling.


And it would have the advantage, that it could work both with str and  
unicode if we had both str.linebreaks and unicode.linebreaks

 I'm not so sure anymore. It is good for consistency, but I doubt  
 there
 are actual use cases: how often do you want only the first n lines
 of some string? Reading the first n lines of a file might be an
 application, but then, you would rather use .readline() directly.


 Not every unicode string is read from a StreamReader.


 Sure: but how often do you want to fetch the first line of a Unicode
 string you happen to have in memory, without iterating over all lines
 eventually?


I don't know. The only obvious spot in the standard library (apart  
from codecs.py) seems to be
def shortdescription(self): return self.description().splitlines() 
[0]
in Lib/plat-mac/pimp.py

 Another solution would be to have a unicode.itersplitlines() and  
 store
 the iterator. Then we wouldn't need a maxsplit because you simply can
 stop iterating once you have what you want.


 That might work. I would then ask for itersplitlines to return pairs
 of (line, truncated) so you can easily know whether you merely ran
 into the end of the string, or whether you got a complete line
 (although it might be a bit too specific for the readlines() case)


Or maybe (line, terminatorlength) which gives you the same info  
(terminatorlength == 0 means truncated) and makes it easy to strip  
the terminator.

 So reverting to the 2.3 behaviour for simple codecs is out?


 I'm -1, atleast. It would also fix the problem at hand, for the  
 reported
 case. However, it does leave some codecs in the cold, most notably
 UTF-8 (which, in turn, isn't an issue for PEP 262, since UTF-8 is
 built-in in the parser).


You meant PEP 263, right?

 I think the UTF-8 stream reader should support
 all Unicode line breaks, so it should continue to use the Python
 approach.


OK.

 However, UTF-8 is fairly common, so that reading an
 UTF-8-encoded file line-by-line shouldn't suck.


OK, so what's missing is a solution for str-str codecs (or we keep  
line = line.splitlines(False)[0] and test, whether this is fast enough).

Bye,
Walter Dörwald


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com