Re: [Email-SIG] [Python-Dev] Dropping bytes support in json

2009-04-13 Thread Barry Warsaw

On Apr 10, 2009, at 3:04 PM, Stephen J. Turnbull wrote:


Shouldn't this thread move lock stock and .signature to email-sig?


Yep.  I'll try to be more conscientious about removing python-dev from  
the CC.



Idempotency?  I'm not sure what that means in the context of the
email package ... multiplication by zero?wink  Do you mean that
.parse().to_wire() should be idempotent?  Yes, I think that's a good
idea, and it shouldn't be too hard to implement by (optionally?)
caching the whole original message or individual components (headers
with all whitespace including folding cached verbatim, etc).  I think
caching has to be done, since stuff like did the original fold with a
leading tab or a leading space, and at what column and so on seems
kind of pointless to encode as attributes on Header objects.


I tend to agree.  I'm also happy of there's a way to tell say the  
parser that an application doesn't care about that.  All that extra  
caching will have a memory overhead that you should only pay for if  
you care.


-Barry



PGP.sig
Description: This is a digitally signed message part
___
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Re: [Email-SIG] [Python-Dev] Dropping bytes support in json

2009-04-13 Thread Tony Nelson
At 10:11 -0400 04/13/2009, Barry Warsaw wrote:

On Apr 10, 2009, at 11:08 AM, James Y Knight wrote:

 Until you write a parser for every header, you simply cannot decode
 to unicode. The only sane choices are:
 1) raw bytes
 2) parsed structured data

The email package does not need a parser for every header, but it
should provide a framework that applications (or third party
libraries) can use to extend the built-in header parsers.  A bare
minimum for functionality requires a Content-Type parser.  I think the
email package should also include an address header (Originator,
Destination) parser, and a Message-ID header parser.  Possibly
others.  The default would probably be some unstructured parser for
headers like Subject.

I think the email package should have a parser for every header.  All the
headers defined in normal mail RFCs should have their own parser, and there
would be a default parser for unhandled headers, probably the Unstructured
parser.  Users could add their own, probably by importing something module
that knew how to add its parsing to the email package parsers.
-- 

TonyN.:'   mailto:tonynel...@georgeanelson.com
  '  http://www.georgeanelson.com/
___
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com


Re: [Email-SIG] [Python-Dev] Dropping bytes support in json

2009-04-10 Thread Barry Warsaw

On Apr 9, 2009, at 11:59 PM, Tony Nelson wrote:

Thinking about this stuff makes me nostalgic for the sloppy happy  
days

of Python 2.x


You now have the opportunity to finally unsnarl that mess.  It is  
not an

insurmountable opportunity.


No, it's just a full time job wink.  Now where did I put that hack- 
drink-coffee-twitter clone?


-Barry



PGP.sig
Description: This is a digitally signed message part
___
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Re: [Email-SIG] [Python-Dev] Dropping bytes support in json

2009-04-10 Thread Glenn Linderman
On approximately 4/10/2009 9:56 AM, came the following characters from 
the keyboard of Barry Warsaw:

On Apr 10, 2009, at 1:19 AM, gl...@divmod.com wrote:

On 02:38 am, ba...@python.org wrote:
So, what I'm really asking is this.  Let's say you agree that there 
are use cases for accessing a header value as either the raw encoded 
bytes or the decoded unicode.  What should this return:


 message['Subject']

The raw bytes or the decoded unicode?


My personal preference would be to just get deprecate this API, and 
get rid of it, replacing it with a slightly more explicit one.


  message.headers['Subject']
  message.bytes_headers['Subject']


This is pretty darn clever Glyph.  Stop that! :)

I'm not 100% sure I like the name .bytes_headers or that .headers 
should be the decoded header (rather than have .headers return the 
bytes thingie and say .decoded_headers return the decoded thingies), 
but I do like the general approach.


If one name has to be longer than the other, it should be the bytes 
version.  Real user code is more likely to want to use the text version, 
and hopefully there will be more of that type of code than 
implementations using bytes.


Of course, one could use message.header and message.bythdr and they'd be 
the same length.



--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

___
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com


Re: [Email-SIG] [Python-Dev] Dropping bytes support in json

2009-04-10 Thread Barry Warsaw

On Apr 10, 2009, at 2:00 PM, Glenn Linderman wrote:

If one name has to be longer than the other, it should be the bytes  
version.  Real user code is more likely to want to use the text  
version, and hopefully there will be more of that type of code than  
implementations using bytes.


I'm not sure we know that yet, actually.  Nothing written for Python 2  
counts, and email is too broken in 3 for any sane person to be writing  
such code for Python 3.


Of course, one could use message.header and message.bythdr and  
they'd be the same length.


I was trying to figure out what  a 'thdr' was that we'd want to index  
'by' it. :)


-Barry



PGP.sig
Description: This is a digitally signed message part
___
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Re: [Email-SIG] [Python-Dev] Dropping bytes support in json

2009-04-10 Thread Barry Warsaw

On Apr 10, 2009, at 2:06 PM, Michael Foord wrote:


Shouldn't headers always be text?


/me weeps



PGP.sig
Description: This is a digitally signed message part
___
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Re: [Email-SIG] [Python-Dev] Dropping bytes support in json

2009-04-10 Thread Stephen J. Turnbull
Shouldn't this thread move lock stock and .signature to email-sig?

Barry Warsaw writes:

   It does seem to make sense to think about headers as text header
   names and text header values.
  
   I disagree.  IMHO, structured header types should have object values,
   and something like
  
  While I agree, there's still a need for a higher level API that make  
  it easy to do the simple things.

Sure.  I'm suggesting that the way to determine whether something is
simple or not is by whether it falls out naturally from correct
structure.  Ie, no operations that only a Cirque du Soleil juggler can
perform are allowed.

  I agree that the Message class needs to be strict.  A parser needs to  
  be lenient;

Not always.  The Postel Principle only applies to stuph coming in off
the wire.  But we're *also* going to be parsing pseudo-email
components that are being handed to us by applications (eg, the
perennial control-character-in-the-unremovable-address Mailman bug).
Our parser should Just Say No to that crap.

  see the .defects attribute introduced in the current email  
  package.  Oh, and this reminds me that we still haven't talked about  
  idempotency.  That's an important principle in the current email  
  package, but do we need to give up on that?

Idempotency?  I'm not sure what that means in the context of the
email package ... multiplication by zero?wink  Do you mean that
.parse().to_wire() should be idempotent?  Yes, I think that's a good
idea, and it shouldn't be too hard to implement by (optionally?)
caching the whole original message or individual components (headers
with all whitespace including folding cached verbatim, etc).  I think
caching has to be done, since stuff like did the original fold with a
leading tab or a leading space, and at what column and so on seems
kind of pointless to encode as attributes on Header objects.

[Description of MessageTextView and MessageWireView elided.]

  This seems similar to Glyph's basic idea, but with a different spelling.

Yes.  I don't much care which way it's done, and Glyph's style of
spelling is more explicit.  But I was thinking in terms of the number
of people who are surely going to sing Mama don' 'low no Unicodes
roun' here and squeal codec WTF?! outta mah face, man!
___
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com


Re: [Email-SIG] [Python-Dev] Dropping bytes support in json

2009-04-09 Thread Barry Warsaw

On Apr 9, 2009, at 11:55 AM, Daniel Stutzbach wrote:


On Thu, Apr 9, 2009 at 6:01 AM, Barry Warsaw ba...@python.org wrote:
Anyway, aside from that decision, I haven't come up with an elegant  
way to allow /output/ in both bytes and strings (input is I think  
theoretically easier by sniffing the arguments).


Won't this work? (assuming dumps() always returns a string)

def dumpb(obj, encoding='utf-8', *args, **kw):
s = dumps(obj, *args, **kw)
return s.encode(encoding)


So, what I'm really asking is this.  Let's say you agree that there  
are use cases for accessing a header value as either the raw encoded  
bytes or the decoded unicode.  What should this return:


 message['Subject']

The raw bytes or the decoded unicode?

Okay, so you've picked one.  Now how do you spell the other way?

The Message class probably has these explicit methods:

 Message.get_header_bytes('Subject')
 Message.get_header_string('Subject')

(or better names... it's late and I'm tired ;).  One of those maps to  
message['Subject'] but which is the more obvious choice?


Now, setting headers.  Sometimes you have some unicode thing and  
sometimes you have some bytes.  You need to end up with bytes in the  
ASCII range and you'd like to leave the header value unencoded if so.   
But in both cases, you might have bytes or characters outside that  
range, so you need an explicit encoding, defaulting to utf-8 probably.


 Message.set_header('Subject', 'Some text', encoding='utf-8')
 Message.set_header('Subject', b'Some bytes')

One of those maps to

 message['Subject'] = ???

I'm open to any suggestions here!
-Barry



PGP.sig
Description: This is a digitally signed message part
___
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Re: [Email-SIG] [Python-Dev] Dropping bytes support in json

2009-04-09 Thread Barry Warsaw

On Apr 9, 2009, at 11:21 PM, Nick Coghlan wrote:


Barry Warsaw wrote:

I don't know whether the parameter thing will work or not, but you're
probably right that we need to get the bytes-everywhere API first.


Given that json is a wire protocol, that sounds like the right  
approach

for json as well. Once bytes-everywhere works, then a text API can be
built on top of it, but it is difficult to build a bytes API on top  
of a

text one.


Agreed!


So I guess the IO library *is* the right model: bytes at the bottom of
the stack, with text as a wrapper around it (mediated by codecs).


Yes, that's a very interesting (and proven?) model.  I don't quite see  
how we could apply that email and json, but it seems like there's a  
good idea there. ;)


-Barry



PGP.sig
Description: This is a digitally signed message part
___
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Re: [Email-SIG] [Python-Dev] Dropping bytes support in json

2009-04-09 Thread Stephen J. Turnbull
Barry Warsaw writes:

  There are really two ways to look at an email message.  It's either an  
  unstructured blob of bytes, or it's a structured tree of objects.

Indeed!

  Those objects have headers and payload.  The payload can be of any  
  type, though I think it generally breaks down into strings for text/ 
  * types and bytes for anything else (not counting multiparts).

*sigh*  Why are you back-tracking?

The payload should be of an appropriate *object* type.  Atomic object
types will have their content stored as string or bytes [nb I use
Python 3 terminology throughout].  Composite types (multipart/*) won't
need string or bytes attributes AFAICS.

Start by implementing the application/octet-stream and
text/plain;charset=utf-8 object types, of course.

  It does seem to make sense to think about headers as text header names  
  and text header values.

I disagree.  IMHO, structured header types should have object values,
and something like

message['to'] = Barry 'da FLUFL' Warsaw ba...@python.org

should be smart enough to detect that it's a string and attempt to
(flexibly) parse it into a fullname and a mailbox adding escapes, etc.
Whether these should be structured objects or they can be strings or
bytes, I'm not sure (probably bytes, not strings, though -- see next
exampl).  OTOH

message['to'] = b'''Barry 'da.FLUFL' Warsaw ba...@python.org'''

should assume that the client knows what they are doing, and should
parse it strictly (and I mean be a real bastard, eg, raise an
exception on any non-ASCII octet), merely dividing it into fullname
and mailbox, and caching the bytes for later insertion in a
wire-format message.

  In that case, I think you want the values as unicodes, and probably  
  the headers as unicodes containing only ASCII.  So your table would be  
  strings in both cases.  OTOH, maybe your application cares about the  
  raw underlying encoded data, in which case the header names are  
  probably still strings of ASCII-ish unicodes and the values are  
  bytes.  It's this distinction (and I think the competing use cases)  
  that make a true Python 3.x API for email more complicated.

I don't see why you can't have the email API be specific, with
message['to'] always returning a structured_header object (or maybe
even more specifically an address_header object), and methods like

message['to'].build_header_as_text()

which returns

To: Barry 'da.FLUFL' Warsaw ba...@python.org

and

message['to'].build_header_in_wire_format()

which returns

bTo: Barry 'da.FLUFL' Warsaw ba...@python.org

Then have email.textview.Message and email.wireview.Message which
provide a simple interface where message['to'] would invoke
.build_header_as_text() and .build_header_in_wire_format()
respectively.

  Thinking about this stuff makes me nostalgic for the sloppy happy days  
  of Python 2.x

Er, yeah.

Nostalgic-for-the-BITNET-days-where-everything-was-Just-EBCDIC-ly y'rs,
___
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com