[issue28945] get_boundary invokes unquote twice

2016-12-22 Thread Eric Lafontaine

Eric Lafontaine added the comment:

Hi,

I would like to thank you for keeping up with me.  I may not be easy at times, 
but please make me understand if it doesn't make sense :).

Thanks bpoaugust and David,
Eric Lafontaine

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28945] get_boundary invokes unquote twice

2016-12-22 Thread Eric Lafontaine

Eric Lafontaine added the comment:

Hi,

I would like to have a bigger set of user testing this patch before it gets 
approve... I really don't know all the extent to which it's a good/bad idea.  

The proposition was to not do the unquote in the rfc2231 collapse method.  It 
doesn't break anything on my machine on all the email tests... but I can't feel 
safe.

I also added the 2 test case for future support (multiline boundary and single 
line boundary).  It's basically the code I pasted on my answer on the 
2016-12-20.

Let me know what you think.  Is it there we should stop doing it?

Regards,
Eric Lafontaine

--
keywords: +patch
Added file: http://bugs.python.org/file46003/issue_28945.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28945] get_boundary invokes unquote twice

2016-12-20 Thread R. David Murray

R. David Murray added the comment:

The RFC compliance battle in email was lost long ago.  We have to do our best 
to process what we get, and to produce only RFC correct output (unless 
retransmitting without change).  That is Postel's law, and you can't 
successfully process Internet email without following the first part.

I haven't looked at the proposed solution yet.  Not sure when I'll get time, 
since I should try to think and research the questions (why was it done the way 
it is currently done?), which will take quite a bit of time.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28945] get_boundary invokes unquote twice

2016-12-20 Thread Eric Lafontaine

Eric Lafontaine added the comment:

N.B., I've tried to do the keyword parameter on the get_param inside the 
get_boundary as well as I though it was a good Idea as well, but it was 
breaking some test cases (multiple ones).  So I didn't pursue.

N.B.  I'm guessing (guessing) unquote must've been used for email adresses as 
well i.e. "Eric Lafontaine "

I also didn't had the time to read all answer since yesterday, sorry for not 
adding it to my earlier answer.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28945] get_boundary invokes unquote twice

2016-12-20 Thread Eric Lafontaine

Eric Lafontaine added the comment:

Hi all,

I hate this proposition.  I feel it's a "victory" for the people who don't want 
to follow RFC standard and allow "triple"-quoting on things that aren't 
supposed to...

Now that my opinion is said, I made 2 test case that should be added to the 
test_email file of the test library to support the change :

_
def test_rfc2231_multiple_quote_boundary(self):
m = '''\
Content-Type: multipart/alternative;
\tboundary*0*="<')
def test_rfc2231_multiple_quote_boundary2(self):
m = '''\
Content-Type: multipart/alternative;
\tboundary="<>";

'''
msg = email.message_from_string(m)
self.assertEqual(msg.get_boundary(),
 '<>')
___

The problem however does lie within collapse function as you've mentionned 
(because you've taken the code from there haven't you? :P) :
def collapse_rfc2231_value(value, errors='replace',
   fallback_charset='us-ascii'):
if not isinstance(value, tuple) or len(value) != 3:
return value <===removed 
unquote on value

 This doesn't seem to have broken any of the 1580 tests case of test_email 
(Python3.7), but I can't help but to feel weird about it.  Could someone 
explain why we were unquoting there? and why use those weird condition (not 
isintance(value,tuple) or len(value) !=3)?
 
 Regards,
 Eric Lafontaine

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28945] get_boundary invokes unquote twice

2016-12-20 Thread R. David Murray

R. David Murray added the comment:

I'm pretty sure you are correct in your guess that the utility is used by other 
code that depends on the <> unquoting.

The new email policies (that is, the new header parsing code) probably do a 
better job of this, but get_param and friends haven't been wired up to use them 
when available yet.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28945] get_boundary invokes unquote twice

2016-12-20 Thread bpoaugust

bpoaugust added the comment:

According to RFC822, a quoted-string should only be wrapped in double-quotes. 

So I'm not sure why unquote treats <> as quotes. If it did not, then again this 
issue would not arise.

However maybe utils.unquote is needed by other code that uses <>, so it cannot 
just be updated without further analysis.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28945] get_boundary invokes unquote twice

2016-12-20 Thread bpoaugust

bpoaugust added the comment:

It looks like a simpler alternative is to just change

boundary = self.get_param('boundary', missing)
to
boundary = self.get_param('boundary', missing, unquote=False)

and let collapse_rfc2231_value do the unquoting

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28945] get_boundary invokes unquote twice

2016-12-19 Thread bpoaugust

bpoaugust added the comment:

I agree that strictly speaking the boundary is invalid.
However:
'Be strict in what you generate, be liberal in what you accept'
The mail package should never create such boundaries.
However it should process them if possible.

If the boundary definition is mangled by stripping out all the invalid 
characters, then it won't match the markers. So it won't solve the issue.

Whereas ensuring that only a single level of quotes is removed does fix the 
issue.

This is what works for me:

def get_boundary(self, failobj=None):
missing = object()
# get_param removes the outer quotes
boundary = self.get_param('boundary', missing)
if boundary is missing:
return failobj
# Don't unquote again in collapse_rfc2231_value
if not isinstance(boundary, tuple) or len(boundary) != 3:
return boundary
# RFC 2046 says that boundaries may begin but not end in w/s
return utils.collapse_rfc2231_value(boundary).rstrip()

I think the bug is actually in collapse_rfc2231_value - that should not do any 
unquoting, as the value will be passed to it already unquoted (at least in this 
case). However there might perhaps be some cases where collapse_rfc2231_value 
is called with a quoted value, so to fix the immediate problem it's safer to 
fix get_boundary. (I could have re-quoted the value instead, and let 
collapse_rfc2231_value do its thing.)

unquote is correct as it stands - it should only remove the outer quotes. There 
may be a need to quote strings that just happen to be enclosed in quote chars.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28945] get_boundary invokes unquote twice

2016-12-19 Thread Eric Lafontaine

Eric Lafontaine added the comment:

Hi all,

I believe this is the right behavior and what ever generated the boundary 
"<<>>" is the problem ; 


RFC  2046 page 22:
_
The only mandatory global parameter for the "multipart" media type is the 
boundary parameter, which consists of 1 to 70 characters from a set of 
characters known to be very robust through mail gateways, and NOT ending with 
white space. (If a boundary delimiter line appears to end with white space, the 
white space must be presumed to have been added by a gateway, and must be 
deleted.)  It is formally specified by the following BNF:

 boundary := 0*69 bcharsnospace

 bchars := bcharsnospace / " "

 bcharsnospace := DIGIT / ALPHA / "'" / "(" / ")" /
  "+" / "_" / "," / "-" / "." /
  "/" / ":" / "=" / "?"
_
In other words, the only valid boundaries characters are :
01234567890 abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'()+_,-./:=?

Any other character should be removed to get the boundary right.  I believe the 
issue is that it wasn't removed in the first place.  It is a bug in my opinion, 
but the other way around :).

Funny thing is that the unquote function only remove the first character 
it sees... either '<' and the '"'...

def unquote(str):
"""Remove quotes from a string."""
if len(str) > 1:
if str.startswith('"') and str.endswith('"'):
return str[1:-1].replace('', '\\').replace('\\"', '"')
if str.startswith('<') and str.endswith('>'):
return str[1:-1]
return str

Now, if I modify unquote to only keep the list of character above, would I 
break something? Probably. 
I haven't found any other defining RFC about boundaries that tells me what was 
the format supported.  Can someone help me on that?

This is what the function should look like :

import string
def get_boundary(str):
""" return the valid boundary parameter as per RFC 2046 page 22. """
if len(str) > 1:
import re
return re.sub('[^'+
  string.ascii_letters +
  string.digits +
  r""" '()+_,-./:=?]|="""
  ,'',str
  ).rstrip(' ')
return str

import unittest

class boundary_tester(unittest.TestCase):
def test_get_boundary(self):
boundary1 = """ abc def gh< 123 >!@ %!%' """
ref_boundary1 = """ abc def gh 123  '""" # this is the valid Boundary
ret_value = get_boundary(boundary1)
self.assertEqual(ret_value,ref_boundary1)

def test_get_boundary2(self):
boundary1 = ''.join((' ',string.printable))
ref_boundary1 = ' 
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ\'()+,-./:?_' # 
this is the valid Boundary
ret_value = get_boundary(boundary1)
self.assertEqual(ret_value,ref_boundary1)


I believe this should be added to the email.message.Message get_boundary 
function.  

Regards,
Eric Lafontaine

--
nosy: +Eric Lafontaine

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28945] get_boundary invokes unquote twice

2016-12-12 Thread bpoaugust

New submission from bpoaugust:

get_boundary calls get_param('boundary') which unquotes the value.
It then calls utils.collapse_rfc2231_value which also calls unquote.

This causes problems for boundaries that have two sets of quotes.
For example, I have seen the following in the wild:

Content-Type: multipart/mixed; boundary="<<001-3e1dcd5a-119e>>"

Both "" and <> are treated as quotes by unquote.

The result is that both "< and >" are stripped off.
This means that the boundaries no longer match.

--
components: email
messages: 282991
nosy: barry, bpoaugust, r.david.murray
priority: normal
severity: normal
status: open
title: get_boundary invokes unquote twice
type: behavior
versions: Python 3.4, Python 3.5

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com