[issue20559] urllib/http fail to sanitize a non-ascii url

2021-12-10 Thread STINNER Victor


Change by STINNER Victor :


--
nosy:  -vstinner

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20559] urllib/http fail to sanitize a non-ascii url

2021-12-10 Thread Irit Katriel


Irit Katriel  added the comment:

Reproduced on 3.11.

--
nosy: +iritkatriel
versions: +Python 3.10, Python 3.11, Python 3.9 -Python 3.3

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20559] urllib/http fail to sanitize a non-ascii url

2017-01-18 Thread Martin Panter

Martin Panter added the comment:

See also Issue 3991 with proposals for handling non-ASCII as new features.

--
nosy: +martin.panter

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20559] urllib/http fail to sanitize a non-ascii url

2014-02-14 Thread Éric Araujo

Éric Araujo added the comment:

Even if Python 3’s text model is based on Unicode, some data formats have their 
own rules.  There’s a long debate about whether URIs should be bytes or text; 
it looks like unlike web browsers, urllib/httplib don’t try to be smart with 
the URIs they are given but just require them to be properly formatted, i.e. 
not containing any space or characters that are not %-encoded.

Is the documentation clear about this behaviour?  If not, it would probably be 
simpler to improve the documentation rather than change the behaviour.

--
nosy: +eric.araujo

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20559] urllib/http fail to sanitize a non-ascii url

2014-02-07 Thread Bill Winslow

Bill Winslow added the comment:

Follow up -- I need to use urllib.parse.quote to safely encode a url -- though 
if I may be so bold, I submit that since much of the goal of Python 3 was to 
make unicode "just work", I the (stupid) user shouldn't have to remember to 
safely encode unicode urls...

A reasonable way to do it would be to insert the following in place of 
urllib/request.py line 469 (which is OpenerDirector.open()):

response = self._open(req, data)

would become

try:
response = self._open(req, data)
except UnicodeDecodeError as e:
req.full_url = quote(req.full_url, safe='/%')
response = self._open(req, data)

That's untested of course, but hopefully it'll encourage discussion.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20559] urllib/http fail to sanitize a non-ascii url

2014-02-07 Thread Bill Winslow

New submission from Bill Winslow:

The following code will produce a UnicodeEncodeError about a character being 
non-ascii:

from urllib import request, parse, error
url = 'http://en.wikipedia.org/wiki/Antonio Vallejo-Nájera'
req = request.Request(url)
response = request.urlopen(req)

This fails as follows:

Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/lib/python3.3/urllib/request.py", line 156, in urlopen
return opener.open(url, data, timeout)
  File "/usr/lib/python3.3/urllib/request.py", line 469, in open
response = self._open(req, data)
  File "/usr/lib/python3.3/urllib/request.py", line 487, in _open
'_open', req)
  File "/usr/lib/python3.3/urllib/request.py", line 447, in _call_chain
result = func(*args)
  File "/usr/lib/python3.3/urllib/request.py", line 1268, in http_open
return self.do_open(http.client.HTTPConnection, req)
  File "/usr/lib/python3.3/urllib/request.py", line 1248, in do_open
h.request(req.get_method(), req.selector, req.data, headers)
  File "/usr/lib/python3.3/http/client.py", line 1067, in request
self._send_request(method, url, body, headers)
  File "/usr/lib/python3.3/http/client.py", line 1095, in _send_request
self.putrequest(method, url, **skips)
  File "/usr/lib/python3.3/http/client.py", line 959, in putrequest
self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode character '\xe1' in position 27: 
ordinal not in range(128)

I examined the library code in question: line 958 in http/client.py, the line 
before the one that barfs, contains the following comment: 

# Non-ASCII characters should have been eliminated earlier

I added a print statement to the library code:

print(request)
self._output(request.encode('ascii'))

This prints the following: 

>>> response = request.urlopen(req)
GET /wiki/Antonio Vallejo-Nájera HTTP/1.1
Traceback (most recent call last): ...

I confirmed that the 27th character as mentioned in the traceback is in fact 
the á in the last name. Clearly either urllib or http is not properly 
sanitizing the url -- unfortunately, my knowledge is useless as to determining 
where the actual error is; hopefully this report contains enough detail to make 
it easy enough.

--
components: Library (Lib), Unicode
messages: 210587
nosy: Dubslow, ezio.melotti, haypo
priority: normal
severity: normal
status: open
title: urllib/http fail to sanitize a non-ascii url
type: behavior
versions: Python 3.3

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com