[Python-Dev] Mailbox module - timings and functionality changes

2010-06-29 Thread Steve Holden
I hope this is an appropriate dev topic.

It seems to me that the unicode discussions of recent days are well
highlighted by difficulties I am having using the mailbox module (hardly
surprising given the difficulties of handling email generally) even
though it passes its tests.

I can't find anything related in the issue tracker (symptoms: one
program that works fine under Python 2 in under twenty seconds takes
forever (over ten minutes) to fail while creating the (start, stop)
index to the mailbox). My code reads Thunderbird mailboxen from file
store on my Windows Vista system under 3.1.

The failures I am experiencing could easily be encoding issues so I
won't post any detail yet, but I am concerned about the timing - even
when the code is "fixed", if it needs to be, the performance may still
make the module of dubious value.

Can someone who is set up to do easily just do a timing of test_mailbox
under 2.6 and 3.2, to verify they see the same disparity as me? The test
takes about twice as long under 3.1 here (and I am concerned that
unexercised aspects of the code may extend real-world problem run times
by an order of magnitude or more).

regards
 Steve
-- 
Steve Holden   +1 571 484 6266   +1 800 494 3119
See Python Video!   http://python.mirocommunity.org/
Holden Web LLC http://www.holdenweb.com/
UPCOMING EVENTS:http://holdenweb.eventbrite.com/
"All I want for my birthday is another birthday" -
 Ian Dury, 1942-2000

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Mailbox module - timings and functionality changes

2010-06-29 Thread Miki Tebeka
Hello Steve,

> Can someone who is set up to do easily just do a timing of test_mailbox
> under 2.6 and 3.2, to verify they see the same disparity as me? The test
> takes about twice as long under 3.1 here
On Ubuntu timing was:

Python 2.6.5:  23.8sec
Python 2.7rc2: 32.7sec
Python 3.1.2:  32.3sec

All the best,
--
Miki
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Mailbox module - timings and functionality changes

2010-06-29 Thread Senthil Kumaran
On Tue, Jun 29, 2010 at 09:56:11AM -0400, Steve Holden wrote:
> Can someone who is set up to do easily just do a timing of test_mailbox
> under 2.6 and 3.2, to verify they see the same disparity as me? The test

Actually, No.

Python 2.7b2+ (trunk:81685M, Jun  4 2010, 21:52:06) 
Ran 274 tests in 27.231s

OK

real0m27.769s
user0m1.110s
sys 0m0.440s

Python 3.2a0 (py3k:82364M, Jun 29 2010, 19:37:27

Ran 268 tests in 24.444s

OK

real0m25.126s
user0m2.810s
sys 0m0.270s
07:39 PM:senthil@:~/python/py3k

This is under Ubuntu 64 Bit.
Perhaps, the problem you are observing is Windows Only?

-- 
Senthil

Banectomy, n.:
The removal of bruises on a banana.
-- Rich Hall, "Sniglets"
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Mailbox module - timings and functionality changes

2010-06-29 Thread Nick Coghlan
Command line: ./python -m test.regrtest -v test_mailbox

trunk: Ran 274 tests in 25.239s
py3k: Ran 268 tests in 26.263s

So I don't see any substantial difference on a Kubuntu 10.04 box (both
builds are recent'ish, but not completely up to date).

However, the underlying IO access is significantly different between
POSIX and Windows, so there could still be something pathological
happening at the filesystem manipulation layer. My comparisons are
also 2.7 vs 3.2 rather than 2.6 vs 3.1.

Cheers,
Nick.

-- 
Nick Coghlan   |   [email protected]   |   Brisbane, Australia
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Mailbox module - timings and functionality changes

2010-06-29 Thread Steve Holden
Nick Coghlan wrote:
> Command line: ./python -m test.regrtest -v test_mailbox
> 
> trunk: Ran 274 tests in 25.239s
> py3k: Ran 268 tests in 26.263s
> 
> So I don't see any substantial difference on a Kubuntu 10.04 box (both
> builds are recent'ish, but not completely up to date).
> 
> However, the underlying IO access is significantly different between
> POSIX and Windows, so there could still be something pathological
> happening at the filesystem manipulation layer. My comparisons are
> also 2.7 vs 3.2 rather than 2.6 vs 3.1.
> 
> Cheers,
> Nick.
> 
Thanks for all the timings! If a Windows user could do the same thing
that would help ...

regards
 Steve
-- 
Steve Holden   +1 571 484 6266   +1 800 494 3119
See Python Video!   http://python.mirocommunity.org/
Holden Web LLC http://www.holdenweb.com/
UPCOMING EVENTS:http://holdenweb.eventbrite.com/
"All I want for my birthday is another birthday" -
 Ian Dury, 1942-2000
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Mailbox module - timings and functionality changes

2010-06-29 Thread Steve Holden
Steve Holden wrote:
> Nick Coghlan wrote:
>> Command line: ./python -m test.regrtest -v test_mailbox
>>
>> trunk: Ran 274 tests in 25.239s
>> py3k: Ran 268 tests in 26.263s
>>
>> So I don't see any substantial difference on a Kubuntu 10.04 box (both
>> builds are recent'ish, but not completely up to date).
>>
>> However, the underlying IO access is significantly different between
>> POSIX and Windows, so there could still be something pathological
>> happening at the filesystem manipulation layer. My comparisons are
>> also 2.7 vs 3.2 rather than 2.6 vs 3.1.
>>
>> Cheers,
>> Nick.
>>
> Thanks for all the timings! If a Windows user could do the same thing
> that would help ...
> 
And there is *definitely a performance issue. I created a Thunderbird
folder of 26 Google alerts and just parsed then all after reading them
in from the mailbox.

2.5 (!):  0.78 sec
3.1: 42.80 sec

Rather than debate the code here perhaps I should just open an issue for
this? I can then provide both a program and some data, which can be
added to the tests if appropriate. The issue can clearly stand some
investigation.

regards
 Steve
-- 
Steve Holden   +1 571 484 6266   +1 800 494 3119
See Python Video!   http://python.mirocommunity.org/
Holden Web LLC http://www.holdenweb.com/
UPCOMING EVENTS:http://holdenweb.eventbrite.com/
"All I want for my birthday is another birthday" -
 Ian Dury, 1942-2000

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] what environment variable should contain compiler warning suppression flags?

2010-06-29 Thread Barry Warsaw
On Jun 28, 2010, at 05:28 PM, M.-A. Lemburg wrote:

>How many Python users will compile Python in debug mode ?

How many Python users compile Python at all? :)

>The point is that the default build of Python should use
>the correct production settings for the C compiler out of
>the box and that's what AC_PROG_CC is all about.

Sure.

>I'm pretty sure that Python developers who want to use a
>debug build have enough code foo to get the -O2 turned into a -O0
>either by adjust OPT and/or by providing their own CFLAGS env var.

Yes, but it's a PITA for several reasons, IMO:

* It's pretty underdocumented
* It's obscure
* It's hard to remember the exact fu needed because you do it infrequently
* I usually only remember my mistake when gdb acts funny

I strongly suggest that --with-pydebug should be all you need to ensure the
best debugging environment, which means turning off compiler optimization.
Last time I tried, the -O0 was added and it worked well.  (I know this has
been in flux though.)

>Also note that in some cases you may actually want to have
>a debug build with optimizations turned on, e.g. to track down
>a compiler optimization bug.

Yes, but that's *much* more rare than wanting to step through some bit of C
code without going crazy.

-Barry


signature.asc
Description: PGP signature
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Mailbox module - timings and functionality changes

2010-06-29 Thread Tim Golden

On 29/06/2010 15:26, Steve Holden wrote:

Nick Coghlan wrote:

Command line: ./python -m test.regrtest -v test_mailbox

trunk: Ran 274 tests in 25.239s
py3k: Ran 268 tests in 26.263s

So I don't see any substantial difference on a Kubuntu 10.04 box (both
builds are recent'ish, but not completely up to date).

However, the underlying IO access is significantly different between
POSIX and Windows, so there could still be something pathological
happening at the filesystem manipulation layer. My comparisons are
also 2.7 vs 3.2 rather than 2.6 vs 3.1.

Cheers,
Nick.


Thanks for all the timings! If a Windows user could do the same thing
that would help ...


WinXP SP3

2.6 Ran 272 tests in 13.172s
3.1 Ran 267 tests in 15.735s
py3k A *lot* of ERROR and FAIL tests

WinXP SP3

TJG
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] what environment variable should contain compiler warning suppression flags?

2010-06-29 Thread Barry Warsaw
On Jun 28, 2010, at 06:03 PM, M.-A. Lemburg wrote:

>OPT already uses -O0 if --with-pydebug is used and the
>compiler supports -g. Since OPT gets added after CFLAGS, the override
>already happens...

So nobody's proposing to drop that?  Good!  Ignore my last message then. :)

-Barry


signature.asc
Description: PGP signature
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Mailbox module - timings and functionality changes

2010-06-29 Thread Guido van Rossum
On Tue, Jun 29, 2010 at 7:49 AM, Steve Holden  wrote:
> Steve Holden wrote:
>> Nick Coghlan wrote:
>>> Command line: ./python -m test.regrtest -v test_mailbox
>>>
>>> trunk: Ran 274 tests in 25.239s
>>> py3k: Ran 268 tests in 26.263s
>>>
>>> So I don't see any substantial difference on a Kubuntu 10.04 box (both
>>> builds are recent'ish, but not completely up to date).
>>>
>>> However, the underlying IO access is significantly different between
>>> POSIX and Windows, so there could still be something pathological
>>> happening at the filesystem manipulation layer. My comparisons are
>>> also 2.7 vs 3.2 rather than 2.6 vs 3.1.
>>>
>>> Cheers,
>>> Nick.
>>>
>> Thanks for all the timings! If a Windows user could do the same thing
>> that would help ...
>>
> And there is *definitely a performance issue. I created a Thunderbird
> folder of 26 Google alerts and just parsed then all after reading them
> in from the mailbox.
>
> 2.5 (!):  0.78 sec
> 3.1    : 42.80 sec
>
> Rather than debate the code here perhaps I should just open an issue for
> this? I can then provide both a program and some data, which can be
> added to the tests if appropriate. The issue can clearly stand some
> investigation.

Since you have such a great reproducible test case, could you point
the profiler at it? (Perhaps on a reduced dataset... The profiler
multiples your run time by some number between 2 and 10 IIRC.)

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Mailbox module - timings and functionality changes

2010-06-29 Thread Tim Golden

On 29/06/2010 15:51, Tim Golden wrote:

On 29/06/2010 15:26, Steve Holden wrote:

Nick Coghlan wrote:

Command line: ./python -m test.regrtest -v test_mailbox

trunk: Ran 274 tests in 25.239s
py3k: Ran 268 tests in 26.263s

So I don't see any substantial difference on a Kubuntu 10.04 box (both
builds are recent'ish, but not completely up to date).

However, the underlying IO access is significantly different between
POSIX and Windows, so there could still be something pathological
happening at the filesystem manipulation layer. My comparisons are
also 2.7 vs 3.2 rather than 2.6 vs 3.1.

Cheers,
Nick.


Thanks for all the timings! If a Windows user could do the same thing
that would help ...


WinXP SP3

2.6 Ran 272 tests in 13.172s
3.1 Ran 267 tests in 15.735s
py3k A *lot* of ERROR and FAIL tests


py3k HEAD on Win7 Ran 268 tests in 34.055s

TJG
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Pickle security and remote logging

2010-06-29 Thread Vinay Sajip
anatoly techtonik  gmail.com> writes:

> insecure. SocketHandler and DatagramHandler docs should at least
> contain a warning about danger of exposing unpickling interfaces to
> insecure networks.

I've updated the documentation of SocketHandler.makePickle to mention security
concerns, and that the method can be overridden to use a more secure
implementation (e.g. HMAC-signed pickles).

Regards,

Vinay Sajip

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] what environment variable should contain compiler warning suppression flags?

2010-06-29 Thread Steve Holden
Barry Warsaw wrote:
> On Jun 28, 2010, at 05:28 PM, M.-A. Lemburg wrote:
> 
>> How many Python users will compile Python in debug mode ?
> 
> How many Python users compile Python at all? :)
> 
>> The point is that the default build of Python should use
>> the correct production settings for the C compiler out of
>> the box and that's what AC_PROG_CC is all about.
> 
> Sure.
> 
>> I'm pretty sure that Python developers who want to use a
>> debug build have enough code foo to get the -O2 turned into a -O0
>> either by adjust OPT and/or by providing their own CFLAGS env var.
> 
> Yes, but it's a PITA for several reasons, IMO:
> 
> * It's pretty underdocumented
> * It's obscure
> * It's hard to remember the exact fu needed because you do it infrequently
> * I usually only remember my mistake when gdb acts funny
> 
> I strongly suggest that --with-pydebug should be all you need to ensure the
> best debugging environment, which means turning off compiler optimization.
> Last time I tried, the -O0 was added and it worked well.  (I know this has
> been in flux though.)
> 
>> Also note that in some cases you may actually want to have
>> a debug build with optimizations turned on, e.g. to track down
>> a compiler optimization bug.
> 
> Yes, but that's *much* more rare than wanting to step through some bit of C
> code without going crazy.

I agree - trying to step through -O2 optimized code isn't going to help
debug your code, it's going to help you debug the optimizer. That's a
very rare use case.

regards
 Steve
-- 
Steve Holden   +1 571 484 6266   +1 800 494 3119
See Python Video!   http://python.mirocommunity.org/
Holden Web LLC http://www.holdenweb.com/
UPCOMING EVENTS:http://holdenweb.eventbrite.com/
"All I want for my birthday is another birthday" -
 Ian Dury, 1942-2000

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Mailbox module - timings and functionality changes

2010-06-29 Thread Antoine Pitrou
On Tue, 29 Jun 2010 11:40:50 -0400
Steve Holden  wrote:
> Sure. I attach the outputs of both files, as well as the program and the
> data. With profiling (python -m cProfile test3.py) the run took less
> than a third of a second under 2.5, and 168 seconds under 3.1. I'd say
> that was problematical :)
> 
> I will leave the profiler output to speak for itself, since I can find
> nothing much to say about it except that there's a hell of a lot of
> decoding going on inside mailbox.iterkeys().

Ok, a lot of time is spent in cp1252 decoding. Somewhat less time, but
still too much of it, is spent in TextIOWrapper.tell(). This seems to
imply that mailbox files are opened in text mode, which sounds wrong to
me. Perhaps Andrew can shed more light on this?



___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Mailbox module - timings and functionality changes

2010-06-29 Thread A.M. Kuchling
On Tue, Jun 29, 2010 at 07:56:22AM -0700, Guido van Rossum wrote:
> Since you have such a great reproducible test case, could you point
> the profiler at it? (Perhaps on a reduced dataset... The profiler
> multiples your run time by some number between 2 and 10 IIRC.)

Let me underline Guido's suggestion.  Steve, I've done a lot of
mailbox.py stuff and can look at your problem, but off the top of my
head, my suspicion would be that I/O is the culprit, and a profile
could confirm that.  My thought is that mailbox.py is opening the file
in some reading mode that ends up doing a lot more processing on
Windows than on Unix because of universal newlines or something like
that.

--amk
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Mailbox module - timings and functionality changes

2010-06-29 Thread A.M. Kuchling
On Tue, Jun 29, 2010 at 11:40:50AM -0400, Steve Holden wrote:
> I will leave the profiler output to speak for itself, since I can find
> nothing much to say about it except that there's a hell of a lot of
> decoding going on inside mailbox.iterkeys().

The problem is actually in _generate_toc(), which is reading through
the entire file to figure out where all the 'From' lines that start
messages are located.  TextIOWrapper()'s tell() method seems to be
very slow, so one help is to only call tell() when necessary; patch:

-> svn diff Lib/
Index: Lib/mailbox.py
===
--- Lib/mailbox.py  (revision 82346)
+++ Lib/mailbox.py  (working copy)
@@ -775,13 +775,14 @@
 starts, stops = [], []
 self._file.seek(0)
 while True:
-line_pos = self._file.tell()
 line = self._file.readline()
 if line.startswith('From '):
+line_pos = self._file.tell()
 if len(stops) < len(starts):
 stops.append(line_pos - len(os.linesep))
 starts.append(line_pos)
 elif not line:
+line_pos = self._file.tell()
 stops.append(line_pos)
 break
 self._toc = dict(enumerate(zip(starts, stops)))

But should mailboxes really be opened in a UTF-8 encoding, or should
they be treated as 7-bit text?  I'll have to think about this.

--amk
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Mailbox module - timings and functionality changes

2010-06-29 Thread R. David Murray
On Tue, 29 Jun 2010 18:34:22 +0200, Antoine Pitrou  wrote:
> On Tue, 29 Jun 2010 11:40:50 -0400
> Steve Holden  wrote:
> > Sure. I attach the outputs of both files, as well as the program and the
> > data. With profiling (python -m cProfile test3.py) the run took less
> > than a third of a second under 2.5, and 168 seconds under 3.1. I'd say
> > that was problematical :)
> > 
> > I will leave the profiler output to speak for itself, since I can find
> > nothing much to say about it except that there's a hell of a lot of
> > decoding going on inside mailbox.iterkeys().
> 
> Ok, a lot of time is spent in cp1252 decoding. Somewhat less time, but
> still too much of it, is spent in TextIOWrapper.tell(). This seems to
> imply that mailbox files are opened in text mode, which sounds wrong to
> me. Perhaps Andrew can shed more light on this?

Given the current state of the email package for python3, it makes
sense that it would open them in text mode.  email can't currently
process bytes, only text.

--
R. David Murray  www.bitdance.com
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Mailbox module - timings and functionality changes

2010-06-29 Thread Antoine Pitrou
On Tue, 29 Jun 2010 12:52:28 -0400
"A.M. Kuchling"  wrote:
> 
> But should mailboxes really be opened in a UTF-8 encoding, or should
> they be treated as 7-bit text?  I'll have to think about this.

I don't see how you can assume UTF-8 for mailbox files, given that each
message will have its particular encoding.
Besides, Steve's profile results show that you are not using UTF-8, but
rather the local encoding, which is cp1252 under his Windows setup.

Regards

Antoine.


___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Mailbox module - timings and functionality changes

2010-06-29 Thread Steve Holden
A.M. Kuchling wrote:
> On Tue, Jun 29, 2010 at 11:40:50AM -0400, Steve Holden wrote:
>> I will leave the profiler output to speak for itself, since I can find
>> nothing much to say about it except that there's a hell of a lot of
>> decoding going on inside mailbox.iterkeys().
> 
> The problem is actually in _generate_toc(), which is reading through
> the entire file to figure out where all the 'From' lines that start
> messages are located.  TextIOWrapper()'s tell() method seems to be
> very slow, so one help is to only call tell() when necessary; patch:
> 
> -> svn diff Lib/
> Index: Lib/mailbox.py
> ===
> --- Lib/mailbox.py(revision 82346)
> +++ Lib/mailbox.py(working copy)
> @@ -775,13 +775,14 @@
>  starts, stops = [], []
>  self._file.seek(0)
>  while True:
> -line_pos = self._file.tell()
>  line = self._file.readline()
>  if line.startswith('From '):
> +line_pos = self._file.tell()
>  if len(stops) < len(starts):
>  stops.append(line_pos - len(os.linesep))
>  starts.append(line_pos)
>  elif not line:
> +line_pos = self._file.tell()
>  stops.append(line_pos)
>  break
>  self._toc = dict(enumerate(zip(starts, stops)))
> 
> But should mailboxes really be opened in a UTF-8 encoding, or should
> they be treated as 7-bit text?  I'll have to think about this.

Neither! You can't open them as 7-bit text, because real-world email
does contain bytes whose ordinal value exceeds 127. You can't open them
using a text encoding because theoretically there might be ASCII headers
that indicate that parts of the content are in specific character sets
or encodings.

If only we had a data structure that easily allowed us to manipulate
8-bit characters ...

regards
 Steve
-- 
Steve Holden   +1 571 484 6266   +1 800 494 3119
See Python Video!   http://python.mirocommunity.org/
Holden Web LLC http://www.holdenweb.com/
UPCOMING EVENTS:http://holdenweb.eventbrite.com/
"All I want for my birthday is another birthday" -
 Ian Dury, 1942-2000
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Mailbox module - timings and functionality changes

2010-06-29 Thread Guido van Rossum
It should probably be opened in binary mode. Binary files do have a
.readline() method (returning a bytes object), and bytes objects have
a .startswith() method. The tell positions computed this way are even
compatible with those used by the text file. So you could do it this
way:

- open binary stream
- compute TOC by reading through it using .readline() and .tell()
- rewind (don't close)
- wrap the binary stream in a text stream
- use that for the rest of the code

--Guido

On Tue, Jun 29, 2010 at 10:54 AM, Steve Holden  wrote:
> A.M. Kuchling wrote:
>> On Tue, Jun 29, 2010 at 11:40:50AM -0400, Steve Holden wrote:
>>> I will leave the profiler output to speak for itself, since I can find
>>> nothing much to say about it except that there's a hell of a lot of
>>> decoding going on inside mailbox.iterkeys().
>>
>> The problem is actually in _generate_toc(), which is reading through
>> the entire file to figure out where all the 'From' lines that start
>> messages are located.  TextIOWrapper()'s tell() method seems to be
>> very slow, so one help is to only call tell() when necessary; patch:
>>
>> -> svn diff Lib/
>> Index: Lib/mailbox.py
>> ===
>> --- Lib/mailbox.py    (revision 82346)
>> +++ Lib/mailbox.py    (working copy)
>> @@ -775,13 +775,14 @@
>>          starts, stops = [], []
>>          self._file.seek(0)
>>          while True:
>> -            line_pos = self._file.tell()
>>              line = self._file.readline()
>>              if line.startswith('From '):
>> +                line_pos = self._file.tell()
>>                  if len(stops) < len(starts):
>>                      stops.append(line_pos - len(os.linesep))
>>                  starts.append(line_pos)
>>              elif not line:
>> +                line_pos = self._file.tell()
>>                  stops.append(line_pos)
>>                  break
>>          self._toc = dict(enumerate(zip(starts, stops)))
>>
>> But should mailboxes really be opened in a UTF-8 encoding, or should
>> they be treated as 7-bit text?  I'll have to think about this.
>
> Neither! You can't open them as 7-bit text, because real-world email
> does contain bytes whose ordinal value exceeds 127. You can't open them
> using a text encoding because theoretically there might be ASCII headers
> that indicate that parts of the content are in specific character sets
> or encodings.
>
> If only we had a data structure that easily allowed us to manipulate
> 8-bit characters ...
>
> regards
>  Steve
> --
> Steve Holden           +1 571 484 6266   +1 800 494 3119
> See Python Video!       http://python.mirocommunity.org/
> Holden Web LLC                 http://www.holdenweb.com/
> UPCOMING EVENTS:        http://holdenweb.eventbrite.com/
> "All I want for my birthday is another birthday" -
>                                     Ian Dury, 1942-2000
> ___
> Python-Dev mailing list
> [email protected]
> http://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: 
> http://mail.python.org/mailman/options/python-dev/guido%40python.org
>



-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Mailbox module - timings and functionality changes

2010-06-29 Thread Steve Holden
Guido van Rossum wrote:
> It should probably be opened in binary mode. Binary files do have a
> .readline() method (returning a bytes object), and bytes objects have
> a .startswith() method. The tell positions computed this way are even
> compatible with those used by the text file. So you could do it this
> way:
> 
> - open binary stream
> - compute TOC by reading through it using .readline() and .tell()
> - rewind (don't close)

Because closing is inefficient, or because it breaks the algorithm?

> - wrap the binary stream in a text stream

"wrap" how? The ultimate destiny of the text is twofold:

1) To be stored as some kind of LOB in a database, and
2) Therefrom to be reconstituted and parsed into email.Message objects.

Is the wrapping a one-off operation or a software layer? Sorry, being a
bit dense here, I know.

regards
 Steve

> - use that for the rest of the code
> 
> --Guido
> 
> On Tue, Jun 29, 2010 at 10:54 AM, Steve Holden  wrote:
>> A.M. Kuchling wrote:
>>> On Tue, Jun 29, 2010 at 11:40:50AM -0400, Steve Holden wrote:
 I will leave the profiler output to speak for itself, since I can find
 nothing much to say about it except that there's a hell of a lot of
 decoding going on inside mailbox.iterkeys().
>>> The problem is actually in _generate_toc(), which is reading through
>>> the entire file to figure out where all the 'From' lines that start
>>> messages are located.  TextIOWrapper()'s tell() method seems to be
>>> very slow, so one help is to only call tell() when necessary; patch:
>>>
>>> -> svn diff Lib/
>>> Index: Lib/mailbox.py
>>> ===
>>> --- Lib/mailbox.py(revision 82346)
>>> +++ Lib/mailbox.py(working copy)
>>> @@ -775,13 +775,14 @@
>>>  starts, stops = [], []
>>>  self._file.seek(0)
>>>  while True:
>>> -line_pos = self._file.tell()
>>>  line = self._file.readline()
>>>  if line.startswith('From '):
>>> +line_pos = self._file.tell()
>>>  if len(stops) < len(starts):
>>>  stops.append(line_pos - len(os.linesep))
>>>  starts.append(line_pos)
>>>  elif not line:
>>> +line_pos = self._file.tell()
>>>  stops.append(line_pos)
>>>  break
>>>  self._toc = dict(enumerate(zip(starts, stops)))
>>>
>>> But should mailboxes really be opened in a UTF-8 encoding, or should
>>> they be treated as 7-bit text?  I'll have to think about this.
>> Neither! You can't open them as 7-bit text, because real-world email
>> does contain bytes whose ordinal value exceeds 127. You can't open them
>> using a text encoding because theoretically there might be ASCII headers
>> that indicate that parts of the content are in specific character sets
>> or encodings.
>>
>> If only we had a data structure that easily allowed us to manipulate
>> 8-bit characters ...
>>
>> regards
>>  Steve
-- 
Steve Holden   +1 571 484 6266   +1 800 494 3119
See Python Video!   http://python.mirocommunity.org/
Holden Web LLC http://www.holdenweb.com/
UPCOMING EVENTS:http://holdenweb.eventbrite.com/
"All I want for my birthday is another birthday" -
 Ian Dury, 1942-2000

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Pickle security and remote logging

2010-06-29 Thread anatoly techtonik
On Tue, Jun 29, 2010 at 6:15 PM, Vinay Sajip  wrote:
>
> I've updated the documentation of SocketHandler.makePickle to mention security
> concerns, and that the method can be overridden to use a more secure
> implementation (e.g. HMAC-signed pickles).

Thanks. But I doubt HMAC complication helps to protect logging server.
If shared key is compromised -server becomes vulnerable. I would
prefer approach when no code execution is possible. Some alternative
serialization way for transmitting log data structures over network.
Protocol buffers first come in mind, but they seem to be an overkill,
and stdlib doesn't include any implementation.

-- 
anatoly t.
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Pickle security and remote logging

2010-06-29 Thread Guido van Rossum
On Tue, Jun 29, 2010 at 4:22 PM, anatoly techtonik  wrote:
> On Tue, Jun 29, 2010 at 6:15 PM, Vinay Sajip  wrote:
>>
>> I've updated the documentation of SocketHandler.makePickle to mention 
>> security
>> concerns, and that the method can be overridden to use a more secure
>> implementation (e.g. HMAC-signed pickles).
>
> Thanks. But I doubt HMAC complication helps to protect logging server.
> If shared key is compromised -server becomes vulnerable. I would
> prefer approach when no code execution is possible. Some alternative
> serialization way for transmitting log data structures over network.
> Protocol buffers first come in mind, but they seem to be an overkill,
> and stdlib doesn't include any implementation.

You could use marshal by default. It does not execute code when
unmarshalling. A limitation is that it only supports built-in types
like list, dict, string etc. but that might be just fine for logging
data. Another option would be JSON. (Or XML, if you want bulky. :-)

As for protocol buffers, assuming its absence (so far :-) from the
stdlib is the only objection, how hard would it be to make the logging
package "prepared" so that if one *did* have protocol buffers
installed, it would be a one-line config setting to use them?

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Mailbox module - timings and functionality changes

2010-06-29 Thread R. David Murray
On Tue, 29 Jun 2010 13:54:09 -0400, Steve Holden  wrote:
> A.M. Kuchling wrote:
> > But should mailboxes really be opened in a UTF-8 encoding, or should
> > they be treated as 7-bit text?  I'll have to think about this.
> 
> Neither! You can't open them as 7-bit text, because real-world email
> does contain bytes whose ordinal value exceeds 127. You can't open them
> using a text encoding because theoretically there might be ASCII headers
> that indicate that parts of the content are in specific character sets
> or encodings.
> 
> If only we had a data structure that easily allowed us to manipulate
> 8-bit characters ...

email6 *will* handle this use case.  When it exists :)  But note that it
is *not* just a matter of easily handling 8 bit characters.  There are
a whole bunch of algorithms needed for interpreting that 7 and 8 bit data.
All the info is there in the email headers, but being able to do string
operations on 8 bit byte strings doesn't get you the answers you need
by itself.

It really is the case that the Python3 bytes/unicode split forces us
to redo most of the algorithms so that they handle bytes and text
*correctly*.  This isn't a trivial undertaking, but the end result
will be well worth it.

--
R. David Murray  www.bitdance.com
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Mailbox module - timings and functionality changes

2010-06-29 Thread R. David Murray
On Tue, 29 Jun 2010 17:02:14 -0400, Steve Holden  wrote:
> Guido van Rossum wrote:
> 
> > - wrap the binary stream in a text stream
> 
> "wrap" how? The ultimate destiny of the text is twofold:

I would imagine Guido is talking about an io.TextIOWrapper...in other
words, take the binary file you've just finished grabbing info
from, and reread it as a text file in order to grab the actual
message content.

If you have messages in your files that are using an 8bit content
transfer encoding, then you (currently) will have some problems
unless the charset happens to be the one you use when you wrap
the binary stream as a text stream.

--
R. David Murray  www.bitdance.com
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Mailbox module - timings and functionality changes

2010-06-29 Thread Steve Holden
R. David Murray wrote:
> On Tue, 29 Jun 2010 13:54:09 -0400, Steve Holden  wrote:
>> A.M. Kuchling wrote:
>>> But should mailboxes really be opened in a UTF-8 encoding, or should
>>> they be treated as 7-bit text?  I'll have to think about this.
>> Neither! You can't open them as 7-bit text, because real-world email
>> does contain bytes whose ordinal value exceeds 127. You can't open them
>> using a text encoding because theoretically there might be ASCII headers
>> that indicate that parts of the content are in specific character sets
>> or encodings.
>>
>> If only we had a data structure that easily allowed us to manipulate
>> 8-bit characters ...
> 
> email6 *will* handle this use case.  When it exists :)  But note that it
> is *not* just a matter of easily handling 8 bit characters.  There are
> a whole bunch of algorithms needed for interpreting that 7 and 8 bit data.
> All the info is there in the email headers, but being able to do string
> operations on 8 bit byte strings doesn't get you the answers you need
> by itself.
> 
> It really is the case that the Python3 bytes/unicode split forces us
> to redo most of the algorithms so that they handle bytes and text
> *correctly*.  This isn't a trivial undertaking, but the end result
> will be well worth it.
> 
I completely agree. The unusual thing here is that I of all people
should find himself running into these issues, since my use of Python is
normally pretty conservative. Since the course I am currently writing is
already overdue I have to find answers now to problems that were present
in the initial 3.0 release and have not received much attention since.

You know that I support your work to revise the email package. I hope
that we can eventually have it incorporate mailbox readers as well.

regards
 Steve
-- 
Steve Holden   +1 571 484 6266   +1 800 494 3119
See Python Video!   http://python.mirocommunity.org/
Holden Web LLC http://www.holdenweb.com/
UPCOMING EVENTS:http://holdenweb.eventbrite.com/
"All I want for my birthday is another birthday" -
 Ian Dury, 1942-2000

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] OS X buildbots: why am I skipping these tests?

2010-06-29 Thread Bill Janssen
My Leopard and Tiger PPC buildbots are momentarily green!  But I'm
looking into why I'm skipping some tests.  My buildbots are up-to-date
OS-wise and very vanilla, with the latest applicable Xcode.

4 skips unexpected on darwin:
test_gdb test_ioctl test_readline test_ttk_guionly

Three of these (gdb, readline, ttk_guionly) are just bad predictions of
which tests should skip on Darwin, I think -- gdb is only version 6, so
that test won't run, readline doesn't get built, ttk doesn't work
without Tcl/Tk 8.5.  But the the skip of test_ioctl baffles me.

"test_ioctl skipped -- Unable to open /dev/tty"

But when I log in via ssh and try it with the system python:

~ wjanssen$ python
python
Python 2.5.1 (r251:54863, Jun 17 2009, 20:37:34) 
[GCC 4.0.1 (Apple Inc. build 5465)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> open("/dev/tty")
open("/dev/tty")

>>> 

Seems to work fine.  So this I don't understand.  Any ideas, anyone?

Bill
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] what environment variable should contain compiler warning suppression flags?

2010-06-29 Thread Stephen J. Turnbull
Steve Holden writes:

 > I agree - trying to step through -O2 optimized code isn't going to
 > help debug your code, it's going to help you debug the
 > optimizer. That's a very rare use case.

Not really.  I don't have a lot of practice in debugging at that
level, so take it with a grain of salt, but what I've found with
XEmacs code is that debugging at -O0 is less often helpful than
debugging at -O2.  Quite often a naive compilation strategy is used
which basically turns those C statements into macros for the
underlying assembler, and the code works the way the author thinks it
should.  But his assumptions are invalid, and when optimized it fails.

So I guess you can call that "debugging the optimizer" if you like
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] OS X buildbots: why am I skipping these tests?

2010-06-29 Thread Guido van Rossum
On Tue, Jun 29, 2010 at 7:55 PM, Bill Janssen  wrote:
> My Leopard and Tiger PPC buildbots are momentarily green!  But I'm
> looking into why I'm skipping some tests.  My buildbots are up-to-date
> OS-wise and very vanilla, with the latest applicable Xcode.
>
> 4 skips unexpected on darwin:
>    test_gdb test_ioctl test_readline test_ttk_guionly
>
> Three of these (gdb, readline, ttk_guionly) are just bad predictions of
> which tests should skip on Darwin, I think -- gdb is only version 6, so
> that test won't run, readline doesn't get built, ttk doesn't work
> without Tcl/Tk 8.5.

So it looks like you gould get readline and ttk to run and pass by
separately downloading and installing readline (I've done this many
times before) and Tcl/Tk (no idea but I suppose it should work).

> But the the skip of test_ioctl baffles me.
>
> "test_ioctl skipped -- Unable to open /dev/tty"
>
> But when I log in via ssh and try it with the system python:
>
> ~ wjanssen$ python
> python
> Python 2.5.1 (r251:54863, Jun 17 2009, 20:37:34)
> [GCC 4.0.1 (Apple Inc. build 5465)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.
 open("/dev/tty")
> open("/dev/tty")
> 

>
> Seems to work fine.  So this I don't understand.  Any ideas, anyone?

Maybe the buildbot runs the tests as a tty-less daemon process. If you
ask me it's pretty crazy to have a test that requires a tty. But there
you have it -- and it's the same in Python 3. (But then again, who
knows, I might have written that test. ;-)

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] OS X buildbots: why am I skipping these tests?

2010-06-29 Thread Martin v. Löwis
> Seems to work fine.  So this I don't understand.  Any ideas, anyone?

Didn't we discuss this before? The buildbot slave has no controlling
terminal anymore, hence it cannot open /dev/tty. If you are curious,
just patch your checkout to output the exact errno (e.g. to stdout),
and trigger a build through the web.

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Taking over the Mercurial Migration

2010-06-29 Thread Martin v. Löwis
It seems that both Dirkjan and Brett are very caught up
with real life for the coming months. So I suggest that
some other committer who favors the Mercurial transition
steps forward and takes over this project.

If nobody volunteers, I propose that we release 3.2
from Subversion, and reconsider Mercurial migration
next year.

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Taking over the Mercurial Migration

2010-06-29 Thread Stephen J. Turnbull
"Martin v. Löwis" writes:

 > It seems that both Dirkjan and Brett are very caught up
 > with real life for the coming months. So I suggest that
 > some other committer who favors the Mercurial transition
 > steps forward and takes over this project.

I am not a committer, and am not intimately familiar with PEP 385, so
not appropriate to become the proponent, I think.  However, I am one
of the PEP 374 co-authors, and have experience with previous
transition to Mercurial of similar scale (XEmacs).  I can promise to
devote time to the transition in July and August, in support of
whoever might step forward.  I hope someone does.

 > If nobody volunteers, I propose that we release 3.2
 > from Subversion, and reconsider Mercurial migration
 > next year.

In the absence of a volunteer, I think that's probably necessary.
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com