[issue39280] Don't allow datetime parsing to accept non-Ascii digits

2020-07-17 Thread Ram Rachum


Change by Ram Rachum :


--
stage: patch review -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39280] Don't allow datetime parsing to accept non-Ascii digits

2020-02-03 Thread Mark Dickinson


Change by Mark Dickinson :


--
nosy: +mark.dickinson

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39280] Don't allow datetime parsing to accept non-Ascii digits

2020-02-03 Thread STINNER Victor


STINNER Victor  added the comment:

> Hi guys,

sidenote: please avoid "hi guys" which is not inclusive, see 
https://heyguys.cc/ Thanks ;-)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39280] Don't allow datetime parsing to accept non-Ascii digits

2020-01-31 Thread Ram Rachum


Ram Rachum  added the comment:

Hi guys,

Can we please have a decision on whether we want to move forward with this 
issue or not? I have a patch in the PR, and I'll be happy to continue working 
on it, or close this issue if that's the decision.


Thanks,
Ram.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39280] Don't allow datetime parsing to accept non-Ascii digits

2020-01-10 Thread Steven D'Aprano


Steven D'Aprano  added the comment:

> If all we're going to do is 
> switch [0-9] to \d (which won't work for the places where it's 
> actually [1-9], mind you)

Ah, that's a good point.

[...]
> we're better off resolving the 
> inconsistency by requiring ASCII digits and considering non-ASCII 
> support to be a separate feature request.

This seems reasonable.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39280] Don't allow datetime parsing to accept non-Ascii digits

2020-01-10 Thread Paul Ganssle

Paul Ganssle  added the comment:

> Yes, but not within the same format. If someone were to choose the format 
> '2014-04-10T24:00:00', they would have a reasonable expectation that there is 
> only one unique string that corresponds with that datetime

That's a particularly bad example, because it's exactly the same as another 
string with the exact same format:

  2014-04-11T00:00:00

Since ISO 8601 allows you to specify midnight (and only midnight) using 
previous day + 24:00. Admittedly, that is the only ambiguity I know of offhand 
(though it's a huge spec) *for a given format*, but also ISO 8601 does not 
really have a concept of format specifiers, so it's not like there's a way to 
unambiguously specify the format you are intending to use.

Either way, I think we can explicitly dispense with "there will be an exact 
mapping between a given (format_str, datetime_str) pair and the datetime it 
produces" as a goal here. I can't think of any good reason you'd want that 
property, nor have we made any indication that I can see that we provide it 
(probably the opposite, since there are some formats that explicitly ignore 
whitespace).

> Okay, since it seems like I'm the only one who wants this change, I'll let it 
> go. Thanks for your input.

I wouldn't go that far. I think I am +0 or +1 on this change, I just wanted to 
be absolutely clear *why* we're doing this. I don't want someone pointing at 
this thread in the future and saying, "Core dev says that it's a bug in their 
code if they don't follow X standard / if more than one string produces the 
same datetime / etc".

I think the strongest argument for making this or a similar change is that I'm 
fairly certain that we don't have the bandwidth to handle internationalized 
dates and I don't think we have much to gain by doing a sort of half-assed 
version of that by accepting unicode transliterations of numerals and calling 
it a day. I think there are tons of edge cases here that could bite people, and 
if we don't support this *now* I'd rather give people an error message early in 
the process and try to point people at a library that is designed to handle 
datetime localization issues. If all we're going to do is switch [0-9] to \d 
(which won't work for the places where it's actually [1-9], mind you), I think 
people will get a better version of that with something like:

  def normalize_dt_str(dt_str):
  return "".join(str(int(x)) if x.isdigit() else x
 for x in dt_str)

There are probably more robust and/or faster versions of this, but it's 
probably roughly equivalent to what we'd be doing here *anyway*, and at least 
people would have to opt-in to this.

I am definitely open to us supporting non-ASCII digits in strptime if it would 
be useful at the level of support we could provide, but given that it's 
currently broken for any reasonable use case and as far as I know no one has 
complained, we're better off resolving the inconsistency by requiring ASCII 
digits and considering non-ASCII support to be a separate feature request.

CC-ing Inada on this as unicode guru and because he might have some intuition 
about how useful non-ASCII support might be. The only place I've seen non-ASCII 
dates is in Japanese graveyards, and those tend to use Chinese numerals (which 
don't match \d anyway), though Japanese and Korean also tends to make heavier 
use of "full-width numerals" block, so maybe parsing something like 
"2020-02-02" is an actual pain point that would be improved by this change 
(though, again, I suspect that this is just the beginning of the required 
changes and we may never get a decent implementation that supports unicode 
numerals).

--
nosy: +inada.naoki

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39280] Don't allow datetime parsing to accept non-Ascii digits

2020-01-10 Thread Ram Rachum


Ram Rachum  added the comment:

Okay, since it seems like I'm the only one who wants this change, I'll let it 
go. Thanks for your input.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39280] Don't allow datetime parsing to accept non-Ascii digits

2020-01-10 Thread STINNER Victor


Change by STINNER Victor :


--
nosy: +vstinner

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39280] Don't allow datetime parsing to accept non-Ascii digits

2020-01-09 Thread Steven D'Aprano

Steven D'Aprano  added the comment:

> > ISO 8601 specifies dozens of ways to represent the same datetime
> 
> Yes, but not within the same format. If someone were to choose the 
> format '2014-04-10T24:00:00', they would have a reasonable expectation 
> that there is only one unique string that corresponds with that 
> datetime.

There are no format code in '2014-04-10T24:00:00', it cannot be a 
format. Without knowing the format, there is no way of telling what that 
string represents or whether it is a unique representation.

- Is the leading 0 in '04' mandatory or optional?
- Could the '00's be replaced by '0' or even left out altogether?
- Is 24:00:00 the exact same time as 00:00:00?
- If the final '00' represents seconds, does it accept microseconds '.000'?

> > As far as the inconsistency, I think that's an argument for being 
> > less strict, not more, and allowing non-ASCII digits in more places 
> > not just the first. Why shouldn't (let's say) a Bengali user specify 
> > the day of the month using Bengali digits?
> 
> That's an interesting direction. It might be a good thing to allow 
> users in different locales to specify datetime with their native 
> digits. But, if we were to go in that direction, we would have to:
> 
> 1. Define that use case more rigorously, consider whether it has any 
> effects we need to consider, like clashes with how the local culture 
> expresses dates.

I don't think so. Supporting different calendars is a seperate, and far 
more complex issue, than just supporting dumb transliteration of 
Hindu-Arabic digits 0...9 into other character sets.

This makes datetime behave more like int(), float(), etc, all of which 
accept non-ASCII digits including (e.g.) East-Asian double-width digits.

py> int('5432')
5432

Supporting non-Western calendars is a worthy thing, but probably best 
left for a third-party library.

> 2. Enable it in other fields except year.

That part is easy: replace `[0-9]` in regexes with `\d`.

> 3. Add tests for it. 

Aye, that would need to be done. Adding the tests would be the hard 
part.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39280] Don't allow datetime parsing to accept non-Ascii digits

2020-01-09 Thread Ram Rachum


Ram Rachum  added the comment:

> To me, this seems like a pretty thin justification for calling this a 
> security vulnerability.

I know that lots of exploits were made because of bugs in Python's URL parsing, 
and security releases of Python were made because of that. It's possible that 
similar exploits could be done for bugs in parsing datetimes. It's different, 
and it's a thin justification, I agree, so if it's treated as a non-security 
bug, I'm okay with that. 


> ISO 8601 specifies dozens of ways to represent the same datetime

Yes, but not within the same format. If someone were to choose the format 
'2014-04-10T24:00:00', they would have a reasonable expectation that there is 
only one unique string that corresponds with that datetime.

> As far as the inconsistency, I think that's an argument for being less 
> strict, not more, and allowing non-ASCII digits in more places not just the 
> first. Why shouldn't (let's say) a Bengali user specify the day of the month 
> using Bengali digits?

That's an interesting direction. It might be a good thing to allow users in 
different locales to specify datetime with their native digits. But, if we were 
to go in that direction, we would have to: 

1. Define that use case more rigorously, consider whether it has any effects we 
need to consider, like clashes with how the local culture expresses dates. 
Maybe we accept the year 2020 in Elbonian digits, but the Elbonian people count 
their years from the birth of Saint Elbon, and for them the year 2020 in 
Elbonian digits means -1553 BC for us?
2. Enable it in other fields except year.
3. Add tests for it. 

Since we're not going to do that ever, I don't think there's a point in leaving 
a 10% implementation of non-Ascii datetime parsing.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39280] Don't allow datetime parsing to accept non-Ascii digits

2020-01-09 Thread Steven D'Aprano


Steven D'Aprano  added the comment:

> If user code were to check for uniqueness of a datetime by comparing it as a 
> string, this is where an attacker could fool this logic, by using a non-Ascii 
> digit.

To me, this seems like a pretty thin justification for calling this a security 
vulnerability.

Using the exact same reasoning, one could argue that "If user code were to 
check for uniqueness of a float by comparing it as a string, this is where an 
attacker could fool this logic, by using leading or trailing spaces, extra 
non-significant digits, upper- or lowercase 'E', etc."

py> float(" +00012.14509e00 ") == float("12.145")
True

Referring specifically to strptime(), there are many format codes which break 
uniqueness by allowing optional leading zeroes, and month names are case 
insensitive e.g. %b accepts 'jAn' as well as 'Jan'.

https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior

As far as the inconsistency, I think that's an argument for being less strict, 
not more, and allowing non-ASCII digits in more places not just the first. Why 
shouldn't (let's say) a Bengali user specify the day of the month using Bengali 
digits?

--
nosy: +steven.daprano

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39280] Don't allow datetime parsing to accept non-Ascii digits

2020-01-09 Thread Paul Ganssle


Paul Ganssle  added the comment:

I don't love the inconsistency, but can you elaborate on the actual *danger* 
posed by this? What security vulnerabilities involve parsing a datetime using a 
non-ascii digit?

The reason that `fromisoformat` doesn't accept non-ASCII digits is actually 
because it's the inverse of `datetime.isoformat`, which never *emits* non-ASCII 
digits. For `strptime`, we're really going more for a general specifier for 
parsing datetime strings in a given format. I'll note that we do accept any 
valid unicode character for the date/time separator.

>From my perspective, there are a few goals, some of which may be in conflict 
>with the others:

1. Mitigating security vulnerabilities, if they exist.
2. Supporting international locales if possible.
3. Improving consistency in the API.

If no one ever actually specifies datetimes in non-ascii locales (and this 
gravestone that includes the date in both Latin and Chinese/Japanese characters 
seems to suggest otherwise: 
https://jbnewall.com/wp-content/uploads/2017/02/LEE-MONUMENT1-370x270.jpg ), 
then I don't see a problem dropping our patchy support, but I think we need to 
carefully consider the backwards compatibility implications if we go through 
with that.

One bit of evidence in favor of "no one uses this anyway" is that no one has 
yet complained that apparently this doesn't work for "%d" even if it works for 
"%y", so presumably it's not heavily used. If our support for this sort of 
thing is so broken that no one could possibly be using it, I suppose we may as 
well break it all the way, but it would be nice to try and identify some 
resources that the documentation can point to for how to handle international 
date parsing.


> Note the "unique and unambiguous". By accepting non-Ascii digits, we're 
> breaking the uniqueness requirement of ISO 8601.

I think in this case "but the standard says X" is probably not a very strong 
argument. 
 Even if we were coding to the ISO 8601 standard (I don't think we claim to be, 
we're just using that convention), I don't really know how to interpret the 
"unique" portion of that claim, considering that ISO 8601 specifies dozens of 
ways to represent the same datetime. Here's an example from [my 
`dateutil.parse.isoparse` test 
suite](https://github.com/dateutil/dateutil/blob/110a09b4ad46fb87ae858a14bfb5a6b92557b01d/dateutil/test/test_isoparser.py#L150):

```
'2014-04-11T00',
'2014-04-10T24',
'2014-04-11T00:00',
'2014-04-10T24:00',
'2014-04-11T00:00:00',
'2014-04-10T24:00:00',
'2014-04-11T00:00:00.000',
'2014-04-10T24:00:00.000',
'2014-04-11T00:00:00.00',
'2014-04-10T24:00:00.00'
```

All of these represent the exact same moment in time, and this doesn't even get 
into using the week-number/day-number configurations or anything with time 
zones. They also allow for the use of `,` as the subsecond-component separator 
(so add 4 more variants for that) and they allow you to leave out the dashes 
between the date components and the colons between time components, so you can 
multiply the possible variants by 4.

Just a random aside - I think there may be strong arguments for doing this even 
if we don't care about coding to a specific standard.

--
nosy: +p-ganssle

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39280] Don't allow datetime parsing to accept non-Ascii digits

2020-01-09 Thread Ram Rachum


Change by Ram Rachum :


--
keywords: +patch
pull_requests: +17337
stage:  -> patch review
pull_request: https://github.com/python/cpython/pull/17931

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39280] Don't allow datetime parsing to accept non-Ascii digits

2020-01-09 Thread Ram Rachum

New submission from Ram Rachum :

I've been doing some research into the use of `\d` in regular expressions in 
CPython, and any security vulnerabilities that might happen as a result of the 
fact that it accepts non-Ascii digits like ٢ and 5.

In most places in the CPython codebase, the `re.ASCII` flag is used for such 
cases, thus ensuring the `re` module prohibits these non-Ascii digits. 
Personally, my preference is to never use `\d` and always use `[0-9]`. I think 
that it's rule that's more easy to enforce and less likely to result in a 
slipup, but that's a matter of personal taste.

I found a few places where we don't use the `re.ASCII` flag and we do accept 
non-Ascii digits.

The first and less interesting place is platform.py, where we define patterns 
used for detecting versions of PyPy and IronPython. I don't know how anyone 
would exploit that, but personally I'd change that to a [0-9] just to be safe. 
I've opened bpo-39279 for that. 

The more sensitive place is the `datetime` module. 

Happily, the `datetime.datetime.fromisoformat` function rejects non-Ascii 
digits. But the `datetime.datetime.strptime` function does not: 

from datetime import datetime

time_format = '%Y-%m-%d'
parse = lambda s: datetime.strptime(s, time_format)
   
x = '٢019-12-22'
y = '2019-12-22'
assert x != y
assert parse(x) == parse(y)
print(parse(x))
# Output: 2019-12-22 00:00:00

If user code were to check for uniqueness of a datetime by comparing it as a 
string, this is where an attacker could fool this logic, by using a non-Ascii 
digit.

Two more interesting points about this: 

1. If you'd try the same trick, but you'd insert ٢ in the day section instead 
of the year section, Python would reject that. So we definitely have 
inconsistent behavior.
2. In the documentation for `strptime`, we're referencing the 1989 C standard. 
Since the first version of Unicode was published in 1991, it's reasonable not 
to expect the standard to support digits that were introduced in Unicode.

If you'd scroll down in that documentation, you'll see that we also implement 
the less-known ISO 8601 standard, where `%G-%V-%u` represents a year, week 
number, and day of week. The `%G` is vulnerable:

from datetime import datetime

time_format = '%G-%V-%u'
parse = lambda s: datetime.strptime(s, time_format)
   
x = '٢019-53-4'
y = '2019-53-4'
assert x != y
assert parse(x) == parse(y)
print(parse(x))
# Output: 2020-01-02 00:00:00

I looked at the ISO 8601:2004 document, and under the "Fundamental principles" 
chapter, it says:

This International Standard gives a set of rules for the representation of
time points
time intervals
recurring time intervals.
Both accurate and approximate representations can be identified by 
means of unique and unambiguous expressions specifying the relevant dates, 
times of day and durations.  

Note the "unique and unambiguous". By accepting non-Ascii digits, we're 
breaking the uniqueness requirement of ISO 8601.

--
components: Library (Lib)
messages: 359695
nosy: cool-RR
priority: normal
severity: normal
status: open
title: Don't allow datetime parsing to accept non-Ascii digits
type: security
versions: Python 3.7, Python 3.8, Python 3.9

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com