Re: How do I decode unicode characters in the subject using email.message_from_string()?

2009-02-25 Thread rdmurray
John Machin sjmac...@lexicon.net wrote:
 On Feb 25, 11:07=A0am, Roy H. Han starsareblueandfara...@gmail.com
 wrote:
  Dear python-list,
 
  I'm having some trouble decoding an email header using the standard
  imaplib.IMAP4 class and email.message_from_string method.
 
  In particular, email.message_from_string() does not seem to properly
  decode unicode characters in the subject.
 
  How do I decode unicode characters in the subject?
 
 You don't. You can't. You decode str objects into unicode objects. You
 encode unicode objects into str objects. If your input is not a str
 object, you have a problem.

I can't speak for the OP, but I had a similar (and possibly
identical-in-intent) question.  Suppose you have a Subject line that
looks like this:

Subject: 'u' Obselete type =?ISO-8859-1?Q?--_it_is_identical_?=   
=?ISO-8859-1?Q?to_=27d=27=2E_=287=29?=

How do you get the email module to decode that into unicode?  The same
question applies to the other header lines, and the answer is it isn't
easy, and I had to read and reread the docs and experiment for a while
to figure it out.  I understand there's going to be a sprint on the
email module at pycon, maybe some of this will get improved then.

Here's the final version of my test program.  The third to last line is
one I thought ought to work given that Header has a __unicode__ method.
The final line is the one that did work (note the kludge to turn None
into 'ascii'...IMO 'ascii' is what deocde_header _should_ be returning,
and this code shows why!)

---
from email import message_from_string
from email.header import Header, decode_header

x = message_from_string(\
To: test
Subject: 'u' Obselete type =?ISO-8859-1?Q?--_it_is_identical_?=   
=?ISO-8859-1?Q?to_=27d=27=2E_=287=29?=

this is a test.
)

print x
print 
for key, header in x.items():
print key, 'type', type(header)
print key+:, unicode(Header(header)).decode('utf-8')
print key+:, decode_header(header)
print key+:, ''.join([s.decode(t or 'ascii') for (s, t) in 
decode_header(header)]).encode('utf-8')
---


From nobody Wed Feb 25 08:35:29 2009
To: test
Subject: 'u' Obselete type =?ISO-8859-1?Q?--_it_is_identical_?=
=?ISO-8859-1?Q?to_=27d=27=2E_=287=29?=

this is a test.


To type type 'str'
To: test
To: [('test', None)]
To: test
Subject type type 'str'
Subject: 'u' Obselete type =?ISO-8859-1?Q?--_it_is_identical_?=   
=?ISO-8859-1?Q?to_=27d=27=2E_=287=29?=
Subject: [('u' Obselete type, None), (-- it is identical to 'd'. (7), 
'iso-8859-1')]
Subject: 'u' Obselete type-- it is identical to 'd'. (7)


--RDM

--
http://mail.python.org/mailman/listinfo/python-list


Re: How do I decode unicode characters in the subject using email.message_from_string()?

2009-02-25 Thread Roy H. Han
Thanks for writing back, RDM and John Machin.  Tomorrow I'll try the
code you suggested, RDM.  It looks quite helpful and I'll report the
results.

In the meantime, John asked for more data.  The sender's email client
is Microsoft Outlook 11.  The recipient email client is Lotus Notes.



Actual Subject
=?us-ascii?Q?Inteum_C/SR_User_Tip:__Quick_Access_to_Recently_Opened_Inteu?=\r\n\t=?us-ascii?Q?m_C/SR_Records?=

Expected Subject
Inteum C/SR User Tip: Quick Access to Recently Opened Inteum C/SR Records

X-Mailer
Microsoft Office Outlook 11

X-MimeOLE
Produced By Microsoft MimeOLE V6.00.2900.5579



RHH



On Wed, Feb 25, 2009 at 8:39 AM,  rdmur...@bitdance.com wrote:
 John Machin sjmac...@lexicon.net wrote:
 On Feb 25, 11:07=A0am, Roy H. Han starsareblueandfara...@gmail.com
 wrote:
  Dear python-list,
 
  I'm having some trouble decoding an email header using the standard
  imaplib.IMAP4 class and email.message_from_string method.
 
  In particular, email.message_from_string() does not seem to properly
  decode unicode characters in the subject.
 
  How do I decode unicode characters in the subject?

 You don't. You can't. You decode str objects into unicode objects. You
 encode unicode objects into str objects. If your input is not a str
 object, you have a problem.

 I can't speak for the OP, but I had a similar (and possibly
 identical-in-intent) question.  Suppose you have a Subject line that
 looks like this:

    Subject: 'u' Obselete type =?ISO-8859-1?Q?--_it_is_identical_?=   
 =?ISO-8859-1?Q?to_=27d=27=2E_=287=29?=

 How do you get the email module to decode that into unicode?  The same
 question applies to the other header lines, and the answer is it isn't
 easy, and I had to read and reread the docs and experiment for a while
 to figure it out.  I understand there's going to be a sprint on the
 email module at pycon, maybe some of this will get improved then.

 Here's the final version of my test program.  The third to last line is
 one I thought ought to work given that Header has a __unicode__ method.
 The final line is the one that did work (note the kludge to turn None
 into 'ascii'...IMO 'ascii' is what deocde_header _should_ be returning,
 and this code shows why!)

 ---
 from email import message_from_string
 from email.header import Header, decode_header

 x = message_from_string(\
 To: test
 Subject: 'u' Obselete type =?ISO-8859-1?Q?--_it_is_identical_?=   
 =?ISO-8859-1?Q?to_=27d=27=2E_=287=29?=

 this is a test.
 )

 print x
 print 
 for key, header in x.items():
    print key, 'type', type(header)
    print key+:, unicode(Header(header)).decode('utf-8')
    print key+:, decode_header(header)
    print key+:, ''.join([s.decode(t or 'ascii') for (s, t) in 
 decode_header(header)]).encode('utf-8')
 ---


    From nobody Wed Feb 25 08:35:29 2009
    To: test
    Subject: 'u' Obselete type =?ISO-8859-1?Q?--_it_is_identical_?=
            =?ISO-8859-1?Q?to_=27d=27=2E_=287=29?=

    this is a test.

    
    To type type 'str'
    To: test
    To: [('test', None)]
    To: test
    Subject type type 'str'
    Subject: 'u' Obselete type =?ISO-8859-1?Q?--_it_is_identical_?=   
 =?ISO-8859-1?Q?to_=27d=27=2E_=287=29?=
    Subject: [('u' Obselete type, None), (-- it is identical to 'd'. (7), 
 'iso-8859-1')]
    Subject: 'u' Obselete type-- it is identical to 'd'. (7)


 --RDM

 --
 http://mail.python.org/mailman/listinfo/python-list

--
http://mail.python.org/mailman/listinfo/python-list


Re: How do I decode unicode characters in the subject using email.message_from_string()?

2009-02-25 Thread Steve Holden
Roy H. Han wrote:
 On Wed, Feb 25, 2009 at 8:39 AM,  rdmur...@bitdance.com wrote:
[Top-posting corrected]
 John Machin sjmac...@lexicon.net wrote:
 On Feb 25, 11:07=A0am, Roy H. Han starsareblueandfara...@gmail.com
 wrote:
 Dear python-list,

 I'm having some trouble decoding an email header using the standard
 imaplib.IMAP4 class and email.message_from_string method.

 In particular, email.message_from_string() does not seem to properly
 decode unicode characters in the subject.

 How do I decode unicode characters in the subject?
 You don't. You can't. You decode str objects into unicode objects. You
 encode unicode objects into str objects. If your input is not a str
 object, you have a problem.
 I can't speak for the OP, but I had a similar (and possibly
 identical-in-intent) question.  Suppose you have a Subject line that
 looks like this:

Subject: 'u' Obselete type =?ISO-8859-1?Q?--_it_is_identical_?=   
 =?ISO-8859-1?Q?to_=27d=27=2E_=287=29?=

 How do you get the email module to decode that into unicode?  The same
 question applies to the other header lines, and the answer is it isn't
 easy, and I had to read and reread the docs and experiment for a while
 to figure it out.  I understand there's going to be a sprint on the
 email module at pycon, maybe some of this will get improved then.

 Here's the final version of my test program.  The third to last line is
 one I thought ought to work given that Header has a __unicode__ method.
 The final line is the one that did work (note the kludge to turn None
 into 'ascii'...IMO 'ascii' is what deocde_header _should_ be returning,
 and this code shows why!)

 ---
 from email import message_from_string
 from email.header import Header, decode_header

 x = message_from_string(\
 To: test
 Subject: 'u' Obselete type =?ISO-8859-1?Q?--_it_is_identical_?=   
 =?ISO-8859-1?Q?to_=27d=27=2E_=287=29?=

 this is a test.
 )

 print x
 print 
 for key, header in x.items():
print key, 'type', type(header)
print key+:, unicode(Header(header)).decode('utf-8')
print key+:, decode_header(header)
print key+:, ''.join([s.decode(t or 'ascii') for (s, t) in 
 decode_header(header)]).encode('utf-8')
 ---


From nobody Wed Feb 25 08:35:29 2009
To: test
Subject: 'u' Obselete type =?ISO-8859-1?Q?--_it_is_identical_?=
=?ISO-8859-1?Q?to_=27d=27=2E_=287=29?=

this is a test.


To type type 'str'
To: test
To: [('test', None)]
To: test
Subject type type 'str'
Subject: 'u' Obselete type =?ISO-8859-1?Q?--_it_is_identical_?=   
 =?ISO-8859-1?Q?to_=27d=27=2E_=287=29?=
Subject: [('u' Obselete type, None), (-- it is identical to 'd'. (7), 
 'iso-8859-1')]
Subject: 'u' Obselete type-- it is identical to 'd'. (7)


 Thanks for writing back, RDM and John Machin.  Tomorrow I'll try the
 code you suggested, RDM.  It looks quite helpful and I'll report the
 results.
 
 In the meantime, John asked for more data.  The sender's email client
 is Microsoft Outlook 11.  The recipient email client is Lotus Notes.
 
 
 
 Actual Subject
 =?us-ascii?Q?Inteum_C/SR_User_Tip:__Quick_Access_to_Recently_Opened_Inteu?=\r\n\t=?us-ascii?Q?m_C/SR_Records?=
 
 Expected Subject
 Inteum C/SR User Tip: Quick Access to Recently Opened Inteum C/SR Records
 
 X-Mailer
 Microsoft Office Outlook 11
 
 X-MimeOLE
 Produced By Microsoft MimeOLE V6.00.2900.5579
 
 from email.header import decode_header
 print
decode_header(=?us-ascii?Q?Inteum_C/SR_User_Tip:__Quick_Access_to_Recently_Opened_Inteu?=\r\n\t=?us-ascii?Q?m_C/SR_Records?=)
[('Inteum C/SR User Tip:  Quick Access to Recently Opened Inteum C/SR
Records', 'us-ascii')]


regards
 Steve
-- 
Steve Holden+1 571 484 6266   +1 800 494 3119
Holden Web LLC  http://www.holdenweb.com/

--
http://mail.python.org/mailman/listinfo/python-list


Re: How do I decode unicode characters in the subject using email.message_from_string()?

2009-02-25 Thread rdmurray
Steve Holden st...@holdenweb.com wrote:
  from email.header import decode_header
  print
 decode_header(=?us-ascii?Q?Inteum_C/SR_User_Tip:__Quick_Access_to_Recently_Opened_Inteu?=\r\n\t=?us-ascii?Q?m_C/SR_Records?=)
 [('Inteum C/SR User Tip:  Quick Access to Recently Opened Inteum C/SR
 Records', 'us-ascii')]
 

It is interesting that decode_header does what I would consider to be
the right thing (from a pragmatic standpoint) with that particular bit
of Microsoft not-quite-standards-compliant brain-damage; but, removing
the tab is not in fact standards compliant if I'm reading the RFC
correctly.

--RDM

--
http://mail.python.org/mailman/listinfo/python-list


Re: How do I decode unicode characters in the subject using email.message_from_string()?

2009-02-25 Thread Roy H. Han
Cool, it works!

Thanks, RDM, for stating the right approach.
Thanks, Steve, for teaching by example.

I wonder why the email.message_from_string() method doesn't call
email.header.decode_header() automatically.


On Wed, Feb 25, 2009 at 9:50 AM,  rdmur...@bitdance.com wrote:
 Steve Holden st...@holdenweb.com wrote:
  from email.header import decode_header
  print
 decode_header(=?us-ascii?Q?Inteum_C/SR_User_Tip:__Quick_Access_to_Recently_Opened_Inteu?=\r\n\t=?us-ascii?Q?m_C/SR_Records?=)
 [('Inteum C/SR User Tip:  Quick Access to Recently Opened Inteum C/SR
 Records', 'us-ascii')]
 

 It is interesting that decode_header does what I would consider to be
 the right thing (from a pragmatic standpoint) with that particular bit
 of Microsoft not-quite-standards-compliant brain-damage; but, removing
 the tab is not in fact standards compliant if I'm reading the RFC
 correctly.

 --RDM

 --
 http://mail.python.org/mailman/listinfo/python-list

--
http://mail.python.org/mailman/listinfo/python-list


Re: How do I decode unicode characters in the subject using email.message_from_string()?

2009-02-25 Thread Steve Holden
rdmur...@bitdance.com wrote:
 Steve Holden st...@holdenweb.com wrote:
 from email.header import decode_header
 print
 decode_header(=?us-ascii?Q?Inteum_C/SR_User_Tip:__Quick_Access_to_Recently_Opened_Inteu?=\r\n\t=?us-ascii?Q?m_C/SR_Records?=)
 [('Inteum C/SR User Tip:  Quick Access to Recently Opened Inteum C/SR
 Records', 'us-ascii')]
 
 It is interesting that decode_header does what I would consider to be
 the right thing (from a pragmatic standpoint) with that particular bit
 of Microsoft not-quite-standards-compliant brain-damage; but, removing
 the tab is not in fact standards compliant if I'm reading the RFC
 correctly.
 
You'd need to quote me chapter and verse on that. I understood that the
tab simply indicated continuation, but it's a *long* time since I read
the RFCs.

regards
 Steve
-- 
Steve Holden+1 571 484 6266   +1 800 494 3119
Holden Web LLC  http://www.holdenweb.com/

--
http://mail.python.org/mailman/listinfo/python-list


Re: How do I decode unicode characters in the subject using email.message_from_string()?

2009-02-25 Thread Thorsten Kampe
* Roy H. Han (Wed, 25 Feb 2009 10:17:22 -0500)
 Thanks, RDM, for stating the right approach.
 Thanks, Steve, for teaching by example.
 
 I wonder why the email.message_from_string() method doesn't call
 email.header.decode_header() automatically.

And I wonder why you would think the header contains Unicode characters 
when it says us-ascii (=?us-ascii?Q?). I think there is a tendency 
to label everything Unicode someone does not understand.

Thorsten
--
http://mail.python.org/mailman/listinfo/python-list


Re: How do I decode unicode characters in the subject using email.message_from_string()?

2009-02-25 Thread Gabriel Genellina
En Wed, 25 Feb 2009 13:40:31 -0200, Thorsten Kampe  
thors...@thorstenkampe.de escribió:



* Roy H. Han (Wed, 25 Feb 2009 10:17:22 -0500)

Thanks, RDM, for stating the right approach.
Thanks, Steve, for teaching by example.

I wonder why the email.message_from_string() method doesn't call
email.header.decode_header() automatically.


And I wonder why you would think the header contains Unicode characters
when it says us-ascii (=?us-ascii?Q?). I think there is a tendency
to label everything Unicode someone does not understand.


And I wonder why you would think the header does *not* contain Unicode  
characters when it says us-ascii?. I think there is a tendency here  
too...


--
Gabriel Genellina

--
http://mail.python.org/mailman/listinfo/python-list


Re: How do I decode unicode characters in the subject using email.message_from_string()?

2009-02-25 Thread Thorsten Kampe
* Gabriel Genellina (Wed, 25 Feb 2009 14:00:16 -0200)
 En Wed, 25 Feb 2009 13:40:31 -0200, Thorsten Kampe  
 thors...@thorstenkampe.de escribió:
  * Roy H. Han (Wed, 25 Feb 2009 10:17:22 -0500)
  Thanks, RDM, for stating the right approach.
  Thanks, Steve, for teaching by example.
 
  I wonder why the email.message_from_string() method doesn't call
  email.header.decode_header() automatically.
 
  And I wonder why you would think the header contains Unicode characters
  when it says us-ascii (=?us-ascii?Q?). I think there is a tendency
  to label everything Unicode someone does not understand.
 
 And I wonder why you would think the header does *not* contain Unicode  
 characters when it says us-ascii?.

Basically because it didn't contain any Unicode characters (anything 
outside the ASCII range).

Thorsten
--
http://mail.python.org/mailman/listinfo/python-list


Re: How do I decode unicode characters in the subject using email.message_from_string()?

2009-02-25 Thread Tim Golden

Thorsten Kampe wrote:

* Gabriel Genellina (Wed, 25 Feb 2009 14:00:16 -0200)
En Wed, 25 Feb 2009 13:40:31 -0200, Thorsten Kampe  
thors...@thorstenkampe.de escribió:

* Roy H. Han (Wed, 25 Feb 2009 10:17:22 -0500)

Thanks, RDM, for stating the right approach.
Thanks, Steve, for teaching by example.

I wonder why the email.message_from_string() method doesn't call
email.header.decode_header() automatically.

And I wonder why you would think the header contains Unicode characters
when it says us-ascii (=?us-ascii?Q?). I think there is a tendency
to label everything Unicode someone does not understand.
And I wonder why you would think the header does *not* contain Unicode  
characters when it says us-ascii?.


Basically because it didn't contain any Unicode characters (anything 
outside the ASCII range).


And I imagine that Gabriel's point was -- and my point certainly
is -- that Unicode includes all the characters *inside* the
ASCII range.


TJG
--
http://mail.python.org/mailman/listinfo/python-list


Re: How do I decode unicode characters in the subject using email.message_from_string()?

2009-02-25 Thread Gabriel Genellina
En Wed, 25 Feb 2009 15:01:08 -0200, Thorsten Kampe  
thors...@thorstenkampe.de escribió:

* Gabriel Genellina (Wed, 25 Feb 2009 14:00:16 -0200)

En Wed, 25 Feb 2009 13:40:31 -0200, Thorsten Kampe
thors...@thorstenkampe.de escribió:
 * Roy H. Han (Wed, 25 Feb 2009 10:17:22 -0500)
 Thanks, RDM, for stating the right approach.
 Thanks, Steve, for teaching by example.

 I wonder why the email.message_from_string() method doesn't call
 email.header.decode_header() automatically.

 And I wonder why you would think the header contains Unicode  
characters

 when it says us-ascii (=?us-ascii?Q?). I think there is a tendency
 to label everything Unicode someone does not understand.

And I wonder why you would think the header does *not* contain Unicode
characters when it says us-ascii?.


Basically because it didn't contain any Unicode characters (anything
outside the ASCII range).


I think you have to revise your definition of Unicode.

--
Gabriel Genellina

--
http://mail.python.org/mailman/listinfo/python-list


Re: How do I decode unicode characters in the subject using email.message_from_string()?

2009-02-25 Thread rdmurray
Steve Holden st...@holdenweb.com wrote:
 rdmur...@bitdance.com wrote:
  Steve Holden st...@holdenweb.com wrote:
  from email.header import decode_header
  print
  decode_header(=?us-ascii?Q?Inteum_C/SR_User_Tip:__Quick_Access_to_Recently_Opened_Inteu?=\r\n\t=?us-ascii?Q?m_C/SR_Records?=)
  [('Inteum C/SR User Tip:  Quick Access to Recently Opened Inteum C/SR
  Records', 'us-ascii')]
  
  It is interesting that decode_header does what I would consider to be
  the right thing (from a pragmatic standpoint) with that particular bit
  of Microsoft not-quite-standards-compliant brain-damage; but, removing
  the tab is not in fact standards compliant if I'm reading the RFC
  correctly.
  
 You'd need to quote me chapter and verse on that. I understood that the
 tab simply indicated continuation, but it's a *long* time since I read
 the RFCs.

Tab is not mentioned in RFC 2822 except to say that it is a valid
whitespace character.  Header folding (insertion of crlf) can
occur most places whitespace appears, and is defined in section
2.2.3 thusly:

   Each header field is logically a single line of characters comprising
   the field name, the colon, and the field body.  For convenience
   however, and to deal with the 998/78 character limitations per line,
   the field body portion of a header field can be split into a multiple
   line representation; this is called folding.  The general rule is
   that wherever this standard allows for folding white space (not
   simply WSP characters), a CRLF may be inserted before any WSP.  For
   example, the header field:

   Subject: This is a test

   can be represented as:

   Subject: This
is a test

   [irrelevant note elided]

   The process of moving from this folded multiple-line representation
   of a header field to its single line representation is called
   unfolding. Unfolding is accomplished by simply removing any CRLF
   that is immediately followed by WSP.  Each header field should be
   treated in its unfolded form for further syntactic and semantic
   evaluation.

So, the whitespace characters are supposed to be left unchanged
after unfolding.

--David

--
http://mail.python.org/mailman/listinfo/python-list


Re: How do I decode unicode characters in the subject using email.message_from_string()?

2009-02-25 Thread Steve Holden
rdmur...@bitdance.com wrote:
[...]
 
The process of moving from this folded multiple-line representation
of a header field to its single line representation is called
unfolding. Unfolding is accomplished by simply removing any CRLF
that is immediately followed by WSP.  Each header field should be
treated in its unfolded form for further syntactic and semantic
evaluation.
 
 So, the whitespace characters are supposed to be left unchanged
 after unfolding.
 
That would certainly appear to be the case. Thanks.

regards
 Steve
-- 
Steve Holden+1 571 484 6266   +1 800 494 3119
Holden Web LLC  http://www.holdenweb.com/

--
http://mail.python.org/mailman/listinfo/python-list


Re: How do I decode unicode characters in the subject using email.message_from_string()?

2009-02-25 Thread Thorsten Kampe
* Tim Golden (Wed, 25 Feb 2009 17:27:07 +)
 Thorsten Kampe wrote:
  * Gabriel Genellina (Wed, 25 Feb 2009 14:00:16 -0200)
  En Wed, 25 Feb 2009 13:40:31 -0200, Thorsten Kampe  
[...]
  And I wonder why you would think the header contains Unicode characters
  when it says us-ascii (=?us-ascii?Q?). I think there is a tendency
  to label everything Unicode someone does not understand.
  And I wonder why you would think the header does *not* contain Unicode  
  characters when it says us-ascii?.
  
  Basically because it didn't contain any Unicode characters (anything 
  outside the ASCII range).
 
 And I imagine that Gabriel's point was -- and my point certainly
 is -- that Unicode includes all the characters *inside* the
 ASCII range.

I know that this was Gabriel's point. And my point was that Gabriel's 
point was pointless. If you call any text (or character) Unicode then 
the word Unicode is generalized to an extent where it doesn't mean 
anything at all anymore and becomes a buzz word.

With the same reason you could call ASCII an Unicode encoding (which it 
isn't) because all ASCII characters are Unicode characters (code 
points). Only encodings that cover the full Unicode range can reasonably 
be called Unicode encodings.

The OP just saw some weird characters in the email subject and thought 
I know. It looks weird. Must be Unicode. But it wasn't. It was good 
ole ASCII - only Quoted Printable encoded.


Thorsten
--
http://mail.python.org/mailman/listinfo/python-list


Re: How do I decode unicode characters in the subject using email.message_from_string()?

2009-02-25 Thread Gabriel Genellina

En Wed, 25 Feb 2009 15:44:18 -0200, rdmur...@bitdance.com escribió:


Tab is not mentioned in RFC 2822 except to say that it is a valid
whitespace character.  Header folding (insertion of crlf) can
occur most places whitespace appears, and is defined in section
2.2.3 thusly: [...]
So, the whitespace characters are supposed to be left unchanged
after unfolding.


Yep, there is an old bug report sleeping in the tracker about this...

--
Gabriel Genellina

--
http://mail.python.org/mailman/listinfo/python-list


Re: How do I decode unicode characters in the subject using email.message_from_string()?

2009-02-25 Thread Gabriel Genellina
En Wed, 25 Feb 2009 16:19:35 -0200, Thorsten Kampe  
thors...@thorstenkampe.de escribió:

* Tim Golden (Wed, 25 Feb 2009 17:27:07 +)

Thorsten Kampe wrote:
 * Gabriel Genellina (Wed, 25 Feb 2009 14:00:16 -0200)
 En Wed, 25 Feb 2009 13:40:31 -0200, Thorsten Kampe

[...]
 And I wonder why you would think the header contains Unicode  
characters
 when it says us-ascii (=?us-ascii?Q?). I think there is a  
tendency

 to label everything Unicode someone does not understand.
 And I wonder why you would think the header does *not* contain  
Unicode

 characters when it says us-ascii?.

 Basically because it didn't contain any Unicode characters (anything
 outside the ASCII range).

And I imagine that Gabriel's point was -- and my point certainly
is -- that Unicode includes all the characters *inside* the
ASCII range.


I know that this was Gabriel's point. And my point was that Gabriel's
point was pointless. If you call any text (or character) Unicode then
the word Unicode is generalized to an extent where it doesn't mean
anything at all anymore and becomes a buzz word.


If it's text, it should use Unicode. Maybe not now, but in a few years, it  
will be totally unacceptable not to properly use Unicode to process  
textual data.



With the same reason you could call ASCII an Unicode encoding (which it
isn't) because all ASCII characters are Unicode characters (code
points). Only encodings that cover the full Unicode range can reasonably
be called Unicode encodings.


Not at all. ASCII is as valid as character encoding (coded character set  
as the Unicode guys like to say) as ISO 10646 (which covers the whole  
range).



The OP just saw some weird characters in the email subject and thought
I know. It looks weird. Must be Unicode. But it wasn't. It was good
ole ASCII - only Quoted Printable encoded.


Good f*cked ASCII is Unicode too.

--
Gabriel Genellina

--
http://mail.python.org/mailman/listinfo/python-list


Re: How do I decode unicode characters in the subject using email.message_from_string()?

2009-02-24 Thread John Machin
On Feb 25, 11:07 am, Roy H. Han starsareblueandfara...@gmail.com
wrote:
 Dear python-list,

 I'm having some trouble decoding an email header using the standard
 imaplib.IMAP4 class and email.message_from_string method.

 In particular, email.message_from_string() does not seem to properly
 decode unicode characters in the subject.

 How do I decode unicode characters in the subject?

You don't. You can't. You decode str objects into unicode objects. You
encode unicode objects into str objects. If your input is not a str
object, you have a problem.

I'm no expert on the email package, but experts don't have crystal
balls, so let's gather some data for them while we're waiting for
their timezones to align:

Presumably your code is doing something like:
   msg = email.message_from_string(a_string)

Please report the results of
   print repr(a_string)
and
   print type(msg)
   print msg.items()
and tell us what you expected.

Cheers,
John
--
http://mail.python.org/mailman/listinfo/python-list