#29323: HTTPRequest QueryDict wrongly decodes binary encoded values
---------------------------------+--------------------------------------
     Reporter:  Thomas Riccardi  |                    Owner:  nobody
         Type:  Bug              |                   Status:  new
    Component:  HTTP handling    |                  Version:  1.11
     Severity:  Normal           |               Resolution:
     Keywords:                   |             Triage Stage:  Unreviewed
    Has patch:  0                |      Needs documentation:  0
  Needs tests:  0                |  Patch needs improvement:  0
Easy pickings:  0                |                    UI/UX:  0
---------------------------------+--------------------------------------
Changes (by Thomas Riccardi):

 * status:  closed => new
 * resolution:  wontfix =>


Old description:

> `application/x-www-form-urlencoded` bodies allow binary values: just
> percent-escape (most of) the bytes.
>
> It doesn't work with python2 and django.
>
> In python2, QueryDict returns an `unicode` string for values, which makes
> it impossible to represent binary data.
> Worse: it returns an `unicode` string whose unicode code points are the
> raw expected bytes (caused by lax behavior of `str.decode('iso-8859-1')`)
> so it silently fails.
>
> Example:
> {{{#!python
> q = QueryDict('foo=%00%7f%80%ff')
> q['foo']
> # actual:
> # u'\x00\x7f\x80\xff'
> # expected:
> # '\x00\x7f\x80\xff'
> }}}
>
> Relevant code:
> https://github.com/django/django/blob/f89b11b879b83aa505dc8231da5f06ca4b1b062e/django/http/request.py#L397-L401
>
> This was introduced by
> https://github.com/django/django/commit/fa02120d360387bebbbe735e86686bb4c7c43db2
> while trying to accept broken user agents that send non-ascii as
> urlencoded raw values:
> - the python 3 version seems fine: it tries to decode '''before'''
> calling the query string parser (from `urllib`)
> - the python 2 version made a mistake: it decodes '''after''' calling the
> query string parser (and only on the value, not the key strangely)
>

> I'm not sure how to fix this, as there are many locations where the key
> and value are re-encoded (`appendlist` does it too..)
>

> I have not (yet) tested with python3.

New description:

 `application/x-www-form-urlencoded` bodies allow binary values: just
 percent-escape (most of) the bytes.

 It doesn't work with python2 and django.

 In python2, QueryDict returns an `unicode` string for values, which makes
 it impossible to represent binary data.
 Worse: it returns an `unicode` string whose unicode code points are the
 raw expected bytes (caused by lax behavior of `str.decode('iso-8859-1')`)
 so it silently fails.

 Example:
 {{{#!python
 q = QueryDict('foo=%00%7f%80%ff')
 q['foo']
 # actual:
 # u'\x00\x7f\x80\xff'
 # expected:
 # '\x00\x7f\x80\xff'
 }}}

 Relevant code:
 
https://github.com/django/django/blob/f89b11b879b83aa505dc8231da5f06ca4b1b062e/django/http/request.py#L397-L401

 This was introduced by
 
https://github.com/django/django/commit/fa02120d360387bebbbe735e86686bb4c7c43db2
 while trying to accept broken user agents that send non-ascii as
 urlencoded raw values:
 - the python 3 version seems fine: it tries to decode '''before''' calling
 the query string parser (from `urllib`)
 - the python 2 version made a mistake: it decodes '''after''' calling the
 query string parser (and only on the value, not the key strangely)


 I'm not sure how to fix this, as there are many locations where the key
 and value are re-encoded (`appendlist` does it too..)


 In python 3 it fails too, but probably only because of the forced post-
 query-string-parser decoding into unicode string: to support binary values
 we don't want decoding at this point.

 A test for python3:
 {{{#!python
     def test_binary_input(self):
         """
         QueryDicts must be able to handle binary.
         """
         q = QueryDict(b'foo=%00%7f%80%ff')
         self.assertListEqual(list(map(ord, q['foo'])), [0x00, 0x7F, 0x80,
 0xFF])
 }}}
 {{{
 AssertionError: Lists differ: [0, 127, 65533, 65533] != [0, 127, 128, 255]

 First differing element 2:
 65533
 128

 - [0, 127, 65533, 65533]
 + [0, 127, 128, 255]
 }}}

--

Comment:

 OK.

 I tested with python3 and it breaks too, but only because the code
 explicitly tries to generates an unicode string (`bytes_to_text` in
 `appendlist`).

 I updated the description to also add the python3 test.

 To support binary we would need to add an explicit option to get the
 `bytes` (maybe an encoding that asks to keep `bytes`?); not sure how to
 expose it in `HTTPRequest`.

-- 
Ticket URL: <https://code.djangoproject.com/ticket/29323#comment:3>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.

-- 
You received this message because you are subscribed to the Google Groups 
"Django updates" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-updates+unsubscr...@googlegroups.com.
To post to this group, send email to django-updates@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-updates/073.a4f7adeeed8ec4978bac73899e3eb7e9%40djangoproject.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to