[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

2021-11-03 Thread Serhiy Storchaka
03.11.21 15:14, Stephen J. Turnbull пише:
> So the only
> time that wouldn't be true is if escape sequences are allowed to
> represent characters.  I believe unicode_escape is the only codec
> that does.

Also raw_unicode_escape and utf_7. And maybe punycode or idna, I am not
sure.

___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/XRORKXTTV55YOSMP7Z7MAL4AG2UQRXHK/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

2021-11-03 Thread Stephen J. Turnbull
Chris Angelico writes:

 > Ah, okay, so much for that, then. What about the weaker sense:
 > Characters below 128 are always and only represented by those byte
 > values? So if you find byte value 39, it might not actually be an
 > apostrophe, but if you're looking for an apostrophe, you know for sure
 > that it'll be represented by byte value 39?

1.  The apostrophe that Python considers a string delimiter is always
represented by byte value 39 in the compiler input.  So the only
time that wouldn't be true is if escape sequences are allowed to
represent characters.  I believe unicode_escape is the only codec
that does.

2.  There's always eval which will accept a string containing escape
sequences.

 > Yes. I'm sure someone will come along and say "but I have to have an
 > all-ASCII source file, directly runnable, with non-ASCII variable
 > names", because XKCD 1172, but I don't have enough sympathy for that
 > obscure situation to want the mess that unicode_escape can give.

It's not an obscure situation to me.  As I wrote earlier, been there,
done that, made my own T-shirt.  I don't *think* it matters today, but
the number of DOS machines and Windows 98 machines left in Japan is
not zero.  Probably they can't run Python 3, but that's not something
I can testify to.

___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/RNCM3QNGBRRM5GW6SL3Q6FP6R55F5CHU/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

2021-11-03 Thread Petr Viktorin



On 03. 11. 21 12:33, Serhiy Storchaka wrote:

03.11.21 12:36, Petr Viktorin пише:

On 03. 11. 21 2:58, Kyle Stanley wrote:

I'd suggest both: briefer, easier to read write up for average user in
docs, more details/semantics in informational PEP. Thanks for working
on this, Petr!


Well, this is the brief write-up :)
Maybe it would work better if the  info was integrated into the relevant
parts of the docs, rather than be a separate HOWTO.

I went with an informational PEP because it's quicker to publish.


What is the supposed target audience of this document?


Good question! At this point it looks like it's linter authors.


If it is core
Python developers only, then PEP is the right place to publish it. But I
think that it rather describes potential issues in arbitrary Python
project, and as such, it will be more accessible as a part of the Python
documentation (as a HOW-TO article perhaps). AFAIK all other
informational PEPs are about developing Python, not developing in Python
(even if they are (mis)used (e.g. PEP 8) outside their scope).


There's a bunch of packaging PEPs, or a PEP on what the the 
/usr/bin/python command should be. I think PEP 672 is in good company 
for now.

___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/UTNIZZVWL56G7KSYSS67PYYZ2YPE7NX3/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

2021-11-03 Thread Petr Viktorin

On 03. 11. 21 12:37, Chris Angelico wrote:

On Wed, Nov 3, 2021 at 10:22 PM Steven D'Aprano  wrote:


On Wed, Nov 03, 2021 at 11:21:53AM +1100, Chris Angelico wrote:


TBH, I'm not entirely sure how valid it is to talk about *security*
considerations when we're dealing with Python source code and variable
confusions, but that's a term that is well understood.


It's not like Unicode is the only way to write obfuscated code,
malicious or otherwise.



But to the extent that it is a security concern, it's not one that
linters can really cope with. I'm not sure how a linter would stop
someone from publishing code on PyPI that causes confusion by its
character encoding, for instance.


Do we require that PyPI prevents people from publishing code that causes
confusion by its poorly written code and obfuscated and confusing
identifiers?

The linter is to *flag the issue* during, say, code review or before
running the code, like other code quality issues.

If you're just running random code you downloaded from the internet
using pip, then Unicode confusables are the least of your worries.

I'm not really sure why people get so uptight about Unicode confusables,
while being blasé about the opportunities to smuggle malicious code into
pure ASCII code.



Right, which is why I was NOT talking about confusables. I don't
consider them to be a particularly Unicode-related threat, although
the larger range of available characters does make it more plausible
than in ASCII.

But I do see a problem with code where most editors misrepresent the
code, where abuse of a purely ASCII character encoding for purely
ASCII code can cause all kinds of tooling issues. THAT is a more
viable attack vector, since code reviewers will be likely to assume
that their syntax highlighting is correct.

And yes, I'm aware that Python can't be expected to cope with poor
tools, but when *many* well-known tools have the same problem, one
must wonder who should be solving the issue.


This is a very good point. Let's not point fingers, but figure out how 
to make users' lives easier together :)



This was the first time I was "in" on an embargoed "issue", and let me 
tell you, I was surprised by the amount of time spent on polishing the 
messaging. Now, you can't reasonably twist all this into a "Python is 
insecure" or "Company X products are insecure" headline, which is good, 
but with that out of the way we can focus on *what* could be improved 
over *where* the improvement could be and who should do it.

___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/FNUZCNDF7K2LLHRYRDYY3ZZYISRCI4XJ/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

2021-11-03 Thread Chris Angelico
On Wed, Nov 3, 2021 at 10:22 PM Steven D'Aprano  wrote:
>
> On Wed, Nov 03, 2021 at 11:21:53AM +1100, Chris Angelico wrote:
>
> > TBH, I'm not entirely sure how valid it is to talk about *security*
> > considerations when we're dealing with Python source code and variable
> > confusions, but that's a term that is well understood.
>
> It's not like Unicode is the only way to write obfuscated code,
> malicious or otherwise.
>
>
> > But to the extent that it is a security concern, it's not one that
> > linters can really cope with. I'm not sure how a linter would stop
> > someone from publishing code on PyPI that causes confusion by its
> > character encoding, for instance.
>
> Do we require that PyPI prevents people from publishing code that causes
> confusion by its poorly written code and obfuscated and confusing
> identifiers?
>
> The linter is to *flag the issue* during, say, code review or before
> running the code, like other code quality issues.
>
> If you're just running random code you downloaded from the internet
> using pip, then Unicode confusables are the least of your worries.
>
> I'm not really sure why people get so uptight about Unicode confusables,
> while being blasé about the opportunities to smuggle malicious code into
> pure ASCII code.
>

Right, which is why I was NOT talking about confusables. I don't
consider them to be a particularly Unicode-related threat, although
the larger range of available characters does make it more plausible
than in ASCII.

But I do see a problem with code where most editors misrepresent the
code, where abuse of a purely ASCII character encoding for purely
ASCII code can cause all kinds of tooling issues. THAT is a more
viable attack vector, since code reviewers will be likely to assume
that their syntax highlighting is correct.

And yes, I'm aware that Python can't be expected to cope with poor
tools, but when *many* well-known tools have the same problem, one
must wonder who should be solving the issue.

ChrisA
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/WMSJLYG5YQ7SMNHXKSXNEMM7UKKIARCN/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

2021-11-03 Thread Serhiy Storchaka
03.11.21 12:36, Petr Viktorin пише:
> On 03. 11. 21 2:58, Kyle Stanley wrote:
>> I'd suggest both: briefer, easier to read write up for average user in
>> docs, more details/semantics in informational PEP. Thanks for working
>> on this, Petr!
> 
> Well, this is the brief write-up :)
> Maybe it would work better if the  info was integrated into the relevant
> parts of the docs, rather than be a separate HOWTO.
> 
> I went with an informational PEP because it's quicker to publish.

What is the supposed target audience of this document? If it is core
Python developers only, then PEP is the right place to publish it. But I
think that it rather describes potential issues in arbitrary Python
project, and as such, it will be more accessible as a part of the Python
documentation (as a HOW-TO article perhaps). AFAIK all other
informational PEPs are about developing Python, not developing in Python
(even if they are (mis)used (e.g. PEP 8) outside their scope).

___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/TM7EU4QHHXTJMXGQT2EJRKZYZ764HNAD/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

2021-11-03 Thread Steven D'Aprano
On Wed, Nov 03, 2021 at 11:11:00AM +0100, Marc-Andre Lemburg wrote:

> Coming back to the thread topic, many of the Unicode security
> considerations don't apply to non-Unicode encodings, since those
> usually don't support e.g. changing the bidi direction within a
> stream of text or other interesting features you have in Unicode
> such as combining code points, invisible (space) code points, font
> rendering hint code points, etc.
> 
> So in a sense, those non-Unicode encodings are safer than
> using UTF-8 :-)

Thank you MAL for that timely reminder that most encodings are not 
Unicode. I have to admit that I often forget that there is a whole 
universe of non-Unicode, non-ASCII encodings.


> Please also note that most character lookalikes are not encoding
> issues, but instead font issues, which then result in the characters
> looking similar.

+1


-- 
Steve
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/NJFO5C7367F4NLLQTJRNNNUCRRLA6BES/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

2021-11-03 Thread Steven D'Aprano
On Wed, Nov 03, 2021 at 11:21:53AM +1100, Chris Angelico wrote:

> TBH, I'm not entirely sure how valid it is to talk about *security*
> considerations when we're dealing with Python source code and variable
> confusions, but that's a term that is well understood.

It's not like Unicode is the only way to write obfuscated code, 
malicious or otherwise.


> But to the extent that it is a security concern, it's not one that
> linters can really cope with. I'm not sure how a linter would stop
> someone from publishing code on PyPI that causes confusion by its
> character encoding, for instance.

Do we require that PyPI prevents people from publishing code that causes 
confusion by its poorly written code and obfuscated and confusing 
identifiers?

The linter is to *flag the issue* during, say, code review or before 
running the code, like other code quality issues.

If you're just running random code you downloaded from the internet 
using pip, then Unicode confusables are the least of your worries.

I'm not really sure why people get so uptight about Unicode confusables, 
while being blasé about the opportunities to smuggle malicious code into 
pure ASCII code.

https://en.wikipedia.org/wiki/Underhanded_C_Contest

Is it unfamiliarity? Worse? "Real programmers write identifiers in 
English." And the ironic thing is, while it is very difficult indeed for 
automated checkers to detect underhanded code in ASCII, it is trivially 
easier for editors, linters and other tools to spot the sort of Unicode 
confusables we're talking about here. But we spend all our energy 
worrying about the minor issue, and almost none on the broader problem 
of malicious code in general.

I'm pretty sure I could upload a library to PyPI that included

os.system('rm -rf .')

and nobody would blink an eye, but if I write:

A = 1
А = 2
Α = 3
print(A, А, Α)

everyone goes insane. Let's keep the threat in perspective. Writing an 
informational PEP for the education of people is a great idea. Rushing 
into making wholesale changes to the interpreter, not so much.


-- 
Steve
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/YGPSWZL4Z7LKTUHC25JVMHA5LUSLLQEL/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

2021-11-03 Thread Petr Viktorin

On 03. 11. 21 2:58, Kyle Stanley wrote:
I'd suggest both: briefer, easier to read write up for average user in 
docs, more details/semantics in informational PEP. Thanks for working on 
this, Petr!


Well, this is the brief write-up :)
Maybe it would work better if the  info was integrated into the relevant 
parts of the docs, rather than be a separate HOWTO.


I went with an informational PEP because it's quicker to publish.



On Tue, Nov 2, 2021 at 2:07 PM David Mertz, Ph.D. > wrote:


This is an amazing document, Petr. Really great work!

I think I agree with Marc-André that putting it in the actual Python
documentation would give it more visibility than in a PEP.

On Tue, Nov 2, 2021, 1:06 PM Marc-Andre Lemburg mailto:m...@egenix.com>> wrote:

On 01.11.2021 13:17, Petr Viktorin wrote:
 >> PEP: 
 >> Title: Unicode Security Considerations for Python
 >> Author: Petr Viktorin mailto:encu...@gmail.com>>
 >> Status: Active
 >> Type: Informational
 >> Content-Type: text/x-rst
 >> Created: 01-Nov-2021
 >> Post-History:

Thanks for writing this up. I'm not sure whether a PEP is the
right place
for such documentation, though. Wouldn't it be more visible in
the standard
Python documentation ?

-- 
Marc-Andre Lemburg

eGenix.com

Professional Python Services directly from the Experts (#1, Nov
02 2021)
 >>> Python Projects, Coaching and Support ...
https://www.egenix.com/ 
 >>> Python Product Development ...
https://consulting.egenix.com/ 


::: We implement business ideas - efficiently in both time and
costs :::

    eGenix.com Software, Skills and Services GmbH 
Pastor-Loeh-Str.48

     D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
            Registered at Amtsgericht Duesseldorf: HRB 46611
https://www.egenix.com/company/contact/

https://www.malemburg.com/ 

___
Python-Dev mailing list -- python-dev@python.org

To unsubscribe send an email to python-dev-le...@python.org

https://mail.python.org/mailman3/lists/python-dev.python.org/

Message archived at

https://mail.python.org/archives/list/python-dev@python.org/message/FSFG2B3LCWU5PQWX3WRIOJGNV2JFW4AU/


Code of Conduct: http://python.org/psf/codeofconduct/


___
Python-Dev mailing list -- python-dev@python.org

To unsubscribe send an email to python-dev-le...@python.org

https://mail.python.org/mailman3/lists/python-dev.python.org/

Message archived at

https://mail.python.org/archives/list/python-dev@python.org/message/6PHPDZRCYNA44NHSHXPBL7QMWXMHXWGU/


Code of Conduct: http://python.org/psf/codeofconduct/



___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/6OET4CKEZIA34PAXIJR7BUDKT2DPX2DG/
Code of Conduct: http://python.org/psf/codeofconduct/


___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/ZNANXZ7VP6CVDAGWEFXHKYFO6AR3MZXQ/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

2021-11-03 Thread Paul Moore
On Wed, 3 Nov 2021 at 10:11, Marc-Andre Lemburg  wrote:
> I don't think limiting the source code encoding is the right approach
> to making code more secure. Instead, tooling has to be used to detect
> potentially malicious code points in code.

+1

Discussing "making code more secure" without being clear on what the
threat model is, is always going to be inconclusive. In this case, I
believe the threat model is "an untrusted 3rd party submitting a PR
which potentially contains malicious code to a Python project". For
that threat, I think the correct approach is for core Python to
promote awareness (via this PEP and maybe something in the docs
themselves) and for projects to implement appropriate code checks that
are run against all PRs to flag this sort of issue.

What threat can't be addressed at a per-project level, but *can* be
addressed in core Python (without triggering so many false positives
that people are trained to ignore the warnings or work around the
prohibitions, defeating the purpose of the change)?

Paul
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/FQ42C66BVCE6AQFSP4J6V6ERS4VV44MK/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

2021-11-03 Thread Marc-Andre Lemburg
On 03.11.2021 01:21, Chris Angelico wrote:
> On Wed, Nov 3, 2021 at 11:09 AM Steven D'Aprano  wrote:
>>
>> On Wed, Nov 03, 2021 at 03:03:54AM +1100, Chris Angelico wrote:
>>> On Wed, Nov 3, 2021 at 1:06 AM Petr Viktorin  wrote:
 Let me know if it's clear in the newest version, with this note:

> Here, ``encoding: unicode_escape`` in the initial comment is an encoding
> declaration. The ``unicode_escape`` encoding instructs Python to treat
> ``\u0027`` as a single quote (which can start/end a string), ``\u002c`` as
> a comma (punctuator), etc.

>>>
>>> Huh. Is that level of generality actually still needed? Can Python
>>> deprecate all but a small handful of encodings?
>>
>> To be clear, are you proposing to deprecate the encodings *completely*
>> or just as the source code encoding?
> 
> Only source code encodings. Obviously we still need to be able to cope
> with all manner of *data*, but Python source code shouldn't need to be
> in bizarre, weird encodings.
> 
> (Honestly, I'd love to just require that Python source code be UTF-8,
> but that would probably cause problems, so mandating that it be one of
> a small set of encodings would be a safer option.)

Most Python code will be written in UTF-8 going forward, but there's
still a lot of code out there in other encodings. Limiting this
to some reduced set doesn't really make sense, since it's not
clear where to draw the line.

Coming back to the thread topic, many of the Unicode security
considerations don't apply to non-Unicode encodings, since those
usually don't support e.g. changing the bidi direction within a
stream of text or other interesting features you have in Unicode
such as combining code points, invisible (space) code points, font
rendering hint code points, etc.

So in a sense, those non-Unicode encodings are safer than
using UTF-8 :-)

Please also note that most character lookalikes are not encoding
issues, but instead font issues, which then result in the characters
looking similar.

There are fonts which are designed to avoid this
and it's no surprise that source code fonts typically do make
e.g. 0 and O, as well as 1 and l look sufficiently different to be
able to notice the difference.

Things get a lot harder when dealing with combining characters, since
it's not always easy to spot the added diacritics, e.g. try
this:

>>> print ('a\u0348bc') # strong articulation
a͈bc
>>> print ('a\u034Fbc') # combining grapheme joiner
a͏bc

The latter is only "visible" in the unicode_escape encoding:

>>> print ('a\u034Fbc'.encode('unicode_escape'))
b'a\\u034fbc'

Projects wanting to limit code encoding settings, disallow using
bidi markers and other special code points in source code, can easily
do this via e.g. pre-commit hooks, special editor settings, code
linters or security scanners.

I don't think limiting the source code encoding is the right approach
to making code more secure. Instead, tooling has to be used to detect
potentially malicious code points in code.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts (#1, Nov 03 2021)
>>> Python Projects, Coaching and Support ...https://www.egenix.com/
>>> Python Product Development ...https://consulting.egenix.com/


::: We implement business ideas - efficiently in both time and costs :::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   https://www.egenix.com/company/contact/
 https://www.malemburg.com/

___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/MBWBY47ILPL3E6733W4XAZXF2M6RKFH6/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

2021-11-03 Thread Chris Angelico
On Wed, Nov 3, 2021 at 8:01 PM Stephen J. Turnbull
 wrote:
>
> Chris Angelico writes:
>
>  > But I was surprised to find that Python would let you use
>  > unicode_escape for source code.
>
> I'm not surprised.  Today it's probably not necessary, but I've
> exchanged a lot of code (not Python, though) with folks whose editors
> were limited to 8 bit codes or even just ASCII.  It wasn't frequent
> that I needed to discuss non-ASCII code with them (that they needed to
> run) but it would have been painful to do without some form of codec
> that encoded Japanese using only ASCII bytes.

Bearing in mind that string literals can always have their own
escapes, this feature is really only important to the source code
tokens themselves.

>  > Maybe the phrase "a small handful" was a bit too hopeful, but would it
>  > be possible to mandate (after, obviously, a deprecation period) that
>  > source encodings be ASCII-compatible?
>
> Not sure what you mean there.  In the usual sense of ASCII-compatible
> (the ASCII bytes always mean the corresponding character in the ASCII
> encoding), I think there are at least two ASCII-incompatible encodings
> that would cause a lot of pain if they were prohibited, specifically
> Shift JIS and Big5.  (In certain contexts in those encodings an ASCII
> byte frequently is a trailing byte in a multibyte character.)

Ah, okay, so much for that, then. What about the weaker sense:
Characters below 128 are always and only represented by those byte
values? So if you find byte value 39, it might not actually be an
apostrophe, but if you're looking for an apostrophe, you know for sure
that it'll be represented by byte value 39?

> It might make sense to prohibit unicode_escape nowadays -- I think
> almost all systems now can handle Unicode properly, but I don't think
> we can go farther than that.
>

Yes. I'm sure someone will come along and say "but I have to have an
all-ASCII source file, directly runnable, with non-ASCII variable
names", because XKCD 1172, but I don't have enough sympathy for that
obscure situation to want the mess that unicode_escape can give.

ChrisA
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/4JUWQJRMPCSPY3CCJCXLJKBVZ2UFW56F/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

2021-11-03 Thread Serhiy Storchaka
03.11.21 11:01, Stephen J. Turnbull пише:
>  And of
> course UTF-16 is incompatible in that sense, although I don't know if
> anybody actually saves Python code in UTF-16.

CPython does not currently support UTF-16 for source files.

___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/KN4MPLKSRKQOJM2DUFQNO4UGGOJN5YNU/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

2021-11-03 Thread Stephen J. Turnbull
Chris Angelico writes:

 > But I was surprised to find that Python would let you use
 > unicode_escape for source code.

I'm not surprised.  Today it's probably not necessary, but I've
exchanged a lot of code (not Python, though) with folks whose editors
were limited to 8 bit codes or even just ASCII.  It wasn't frequent
that I needed to discuss non-ASCII code with them (that they needed to
run) but it would have been painful to do without some form of codec
that encoded Japanese using only ASCII bytes.

 > Maybe the phrase "a small handful" was a bit too hopeful, but would it
 > be possible to mandate (after, obviously, a deprecation period) that
 > source encodings be ASCII-compatible?

Not sure what you mean there.  In the usual sense of ASCII-compatible
(the ASCII bytes always mean the corresponding character in the ASCII
encoding), I think there are at least two ASCII-incompatible encodings
that would cause a lot of pain if they were prohibited, specifically
Shift JIS and Big5.  (In certain contexts in those encodings an ASCII
byte frequently is a trailing byte in a multibyte character.)  I'm sure
there is a ton of legacy Python code in those encodings in East Asia,
some of which is still maintained in the original encoding.  And of
course UTF-16 is incompatible in that sense, although I don't know if
anybody actually saves Python code in UTF-16.

It might make sense to prohibit unicode_escape nowadays -- I think
almost all systems now can handle Unicode properly, but I don't think
we can go farther than that.

___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/ESIU62AXASWUDX7MSPMTFIDONIAI/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

2021-11-03 Thread Chris Angelico
On Wed, Nov 3, 2021 at 5:12 PM Stephen J. Turnbull
 wrote:
>
> Chris Angelico writes:
>
>  > Huh. Is that level of generality actually still needed? Can Python
>  > deprecate all but a small handful of encodings?
>
> I think that's pointless.  With few exceptions (GB18030, Big5 has a
> couple of code point pairs that encode the same very rare characters,
> ISO 2022 extensions) you're not going to run into the confuseables
> problem, and AFAIK the only generic BIDI solution is Unicode (the ISO
> 8859 encodings of Hebrew and Arabic do not have direction markers).
>
> What exactly are you thinking?

You'll never eliminate confusables (even ASCII has some, depending on
font). But I was surprised to find that Python would let you use
unicode_escape for source code.



# coding: unicode_escape

x = '''

Code example:

\u0027\u0027\u0027 # format in monospaced on the web site

print("Did you think this would be executed?")

\u0027\u0027\u0027 # end monospaced

Surprise!
'''

print("There are %d lines in x." % len(x.split(chr(10



With some carefully-crafted comments, a lot of human readers will
ignore the magic tokens. It's not uncommon to put example code into
triple-quoted strings, and it's also not all that surprising when
simplified examples do things that you wouldn't normally want done
(like monkeypatching other modules), since it's just an example, after
all.

I don't have access to very many editors, but SciTE, VS Code, nano,
and the GitHub gist display all syntax-highlighted this as if it were
a single large string. Only Idle showed it as code in between, and
that's because it actually decoded it using the declared character
coding, so the magic lines showed up with actual apostrophes.

Maybe the phrase "a small handful" was a bit too hopeful, but would it
be possible to mandate (after, obviously, a deprecation period) that
source encodings be ASCII-compatible?

ChrisA
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/QQM7HLRMVKBELRRYBJYGR356QVSOLKKZ/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

2021-11-03 Thread Stephen J. Turnbull
Chris Angelico writes:

 > Huh. Is that level of generality actually still needed? Can Python
 > deprecate all but a small handful of encodings?

I think that's pointless.  With few exceptions (GB18030, Big5 has a
couple of code point pairs that encode the same very rare characters,
ISO 2022 extensions) you're not going to run into the confuseables
problem, and AFAIK the only generic BIDI solution is Unicode (the ISO
8859 encodings of Hebrew and Arabic do not have direction markers).

What exactly are you thinking?

The only thing I'd like to see is to rearrange the codec aliases so
that the "common names" would denote the maximal repertoires in each
family (gb denotes gb18030, sjis denotes shift_jisx0213, etc) as in
the WhatWG recommendations for web browsers.  But that's probably too
backward incompatible to fly.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/W4RJJVAJN7FB24R52YSCU2Y3QZQE3BEL/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

2021-11-02 Thread Stephen J. Turnbull
Serhiy Storchaka writes:
 > This is excellent!
 > 
 > 01.11.21 14:17, Petr Viktorin пише:
 > >> CPython treats the control character NUL (``\0``) as end of input,
 > >> but many editors simply skip it, possibly showing code that Python
 > >> will not
 > >> run as a regular part of a file.
 > 
 > It is an implementation detail and we will get rid of it.

You can't, probably not for a decade, because people will be running
versions of Python released before you change it.  I hope this PEP
will address Python as it is as well as as it will be.

 > It only happens when you read the Python script from a file.

Which is one of the likely vectors for malware.  It might be worth
teaching virus checkers about this, for example.

___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/OUFJ47LYOHQ245BIKWVPCH4OCDB4CM7N/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

2021-11-02 Thread Jim J. Jewett
Chris Angelico wrote:
> I'm not sure how a linter would stop
> someone from publishing code on PyPI that causes confusion by its
> character encoding, for instance.

If it becomes important, the cheeseshop backend can run various validations 
(including a linter) on submissions, and include those results in the display 
template.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/NO6XRUPLOEAO2ZMUJEXXRNQMVFWZUGLT/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

2021-11-02 Thread Kyle Stanley
I'd suggest both: briefer, easier to read write up for average user in
docs, more details/semantics in informational PEP. Thanks for working on
this, Petr!

On Tue, Nov 2, 2021 at 2:07 PM David Mertz, Ph.D. 
wrote:

> This is an amazing document, Petr. Really great work!
>
> I think I agree with Marc-André that putting it in the actual Python
> documentation would give it more visibility than in a PEP.
>
> On Tue, Nov 2, 2021, 1:06 PM Marc-Andre Lemburg  wrote:
>
>> On 01.11.2021 13:17, Petr Viktorin wrote:
>> >> PEP: 
>> >> Title: Unicode Security Considerations for Python
>> >> Author: Petr Viktorin 
>> >> Status: Active
>> >> Type: Informational
>> >> Content-Type: text/x-rst
>> >> Created: 01-Nov-2021
>> >> Post-History:
>>
>> Thanks for writing this up. I'm not sure whether a PEP is the right place
>> for such documentation, though. Wouldn't it be more visible in the
>> standard
>> Python documentation ?
>>
>> --
>> Marc-Andre Lemburg
>> eGenix.com
>>
>> Professional Python Services directly from the Experts (#1, Nov 02 2021)
>> >>> Python Projects, Coaching and Support ...https://www.egenix.com/
>> >>> Python Product Development ...https://consulting.egenix.com/
>> 
>>
>> ::: We implement business ideas - efficiently in both time and costs :::
>>
>>eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
>> D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
>>Registered at Amtsgericht Duesseldorf: HRB 46611
>>https://www.egenix.com/company/contact/
>>  https://www.malemburg.com/
>>
>> ___
>> Python-Dev mailing list -- python-dev@python.org
>> To unsubscribe send an email to python-dev-le...@python.org
>> https://mail.python.org/mailman3/lists/python-dev.python.org/
>> Message archived at
>> https://mail.python.org/archives/list/python-dev@python.org/message/FSFG2B3LCWU5PQWX3WRIOJGNV2JFW4AU/
>> Code of Conduct: http://python.org/psf/codeofconduct/
>>
> ___
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-le...@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-dev@python.org/message/6PHPDZRCYNA44NHSHXPBL7QMWXMHXWGU/
> Code of Conduct: http://python.org/psf/codeofconduct/
>
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/6OET4CKEZIA34PAXIJR7BUDKT2DPX2DG/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

2021-11-02 Thread Chris Angelico
On Wed, Nov 3, 2021 at 11:09 AM Steven D'Aprano  wrote:
>
> On Wed, Nov 03, 2021 at 03:03:54AM +1100, Chris Angelico wrote:
> > On Wed, Nov 3, 2021 at 1:06 AM Petr Viktorin  wrote:
> > > Let me know if it's clear in the newest version, with this note:
> > >
> > > > Here, ``encoding: unicode_escape`` in the initial comment is an encoding
> > > > declaration. The ``unicode_escape`` encoding instructs Python to treat
> > > > ``\u0027`` as a single quote (which can start/end a string), ``\u002c`` 
> > > > as
> > > > a comma (punctuator), etc.
> > >
> >
> > Huh. Is that level of generality actually still needed? Can Python
> > deprecate all but a small handful of encodings?
>
> To be clear, are you proposing to deprecate the encodings *completely*
> or just as the source code encoding?

Only source code encodings. Obviously we still need to be able to cope
with all manner of *data*, but Python source code shouldn't need to be
in bizarre, weird encodings.

(Honestly, I'd love to just require that Python source code be UTF-8,
but that would probably cause problems, so mandating that it be one of
a small set of encodings would be a safer option.)

> Personally, I think that using obscure encodings as the source encoding
> is one of those "linters and code reviews should check it" issues.
>
> Besides, now that I've learned about this unicode_escape encoding, I
> think that's going to be *awesome* for winning obfuscated Python
> competitions! *wink*

TBH, I'm not entirely sure how valid it is to talk about *security*
considerations when we're dealing with Python source code and variable
confusions, but that's a term that is well understood.

But to the extent that it is a security concern, it's not one that
linters can really cope with. I'm not sure how a linter would stop
someone from publishing code on PyPI that causes confusion by its
character encoding, for instance.

ChrisA
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/HJ452KNBAFXI6WBQ6OUMHHZRRETPC7QL/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

2021-11-02 Thread Steven D'Aprano
On Wed, Nov 03, 2021 at 03:03:54AM +1100, Chris Angelico wrote:
> On Wed, Nov 3, 2021 at 1:06 AM Petr Viktorin  wrote:
> > Let me know if it's clear in the newest version, with this note:
> >
> > > Here, ``encoding: unicode_escape`` in the initial comment is an encoding
> > > declaration. The ``unicode_escape`` encoding instructs Python to treat
> > > ``\u0027`` as a single quote (which can start/end a string), ``\u002c`` as
> > > a comma (punctuator), etc.
> >
> 
> Huh. Is that level of generality actually still needed? Can Python
> deprecate all but a small handful of encodings?

To be clear, are you proposing to deprecate the encodings *completely* 
or just as the source code encoding?

Personally, I think that using obscure encodings as the source encoding 
is one of those "linters and code reviews should check it" issues. 

Besides, now that I've learned about this unicode_escape encoding, I 
think that's going to be *awesome* for winning obfuscated Python 
competitions! *wink*


-- 
Steve
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/27IDDKAADVBAZSRZ2I5EO5SLXZIY6ANW/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

2021-11-02 Thread Terry Reedy

On 11/2/2021 1:02 PM, Marc-Andre Lemburg wrote:

On 01.11.2021 13:17, Petr Viktorin wrote:

PEP: 
Title: Unicode Security Considerations for Python
Author: Petr Viktorin 
Status: Active
Type: Informational
Content-Type: text/x-rst
Created: 01-Nov-2021
Post-History:


Thanks for writing this up. I'm not sure whether a PEP is the right place
for such documentation, though. Wouldn't it be more visible in the standard
Python documentation ?


There is already "Unicode HOW TO"  We could add "Unicode problems and 
pitfalls".



--
Terry Jan Reedy

___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/5KDNR5RIITKMIKGSZK2WCPEQDA6AJGQE/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

2021-11-02 Thread Chris Angelico
On Wed, Nov 3, 2021 at 5:07 AM David Mertz, Ph.D.  wrote:
>
> This is an amazing document, Petr. Really great work!
>
> I think I agree with Marc-André that putting it in the actual Python 
> documentation would give it more visibility than in a PEP.
>

There are quite a few other PEPs that have similar sorts of advice,
like PEP 257 on docstrings, and several of the type hinting PEPs. IMO
it's fine.

ChrisA
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/NICZBYG332C4WBFZVCHCTDTEP3NGEF7B/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

2021-11-02 Thread David Mertz, Ph.D.
This is an amazing document, Petr. Really great work!

I think I agree with Marc-André that putting it in the actual Python
documentation would give it more visibility than in a PEP.

On Tue, Nov 2, 2021, 1:06 PM Marc-Andre Lemburg  wrote:

> On 01.11.2021 13:17, Petr Viktorin wrote:
> >> PEP: 
> >> Title: Unicode Security Considerations for Python
> >> Author: Petr Viktorin 
> >> Status: Active
> >> Type: Informational
> >> Content-Type: text/x-rst
> >> Created: 01-Nov-2021
> >> Post-History:
>
> Thanks for writing this up. I'm not sure whether a PEP is the right place
> for such documentation, though. Wouldn't it be more visible in the standard
> Python documentation ?
>
> --
> Marc-Andre Lemburg
> eGenix.com
>
> Professional Python Services directly from the Experts (#1, Nov 02 2021)
> >>> Python Projects, Coaching and Support ...https://www.egenix.com/
> >>> Python Product Development ...https://consulting.egenix.com/
> 
>
> ::: We implement business ideas - efficiently in both time and costs :::
>
>eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
> D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
>Registered at Amtsgericht Duesseldorf: HRB 46611
>https://www.egenix.com/company/contact/
>  https://www.malemburg.com/
>
> ___
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-le...@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-dev@python.org/message/FSFG2B3LCWU5PQWX3WRIOJGNV2JFW4AU/
> Code of Conduct: http://python.org/psf/codeofconduct/
>
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/6PHPDZRCYNA44NHSHXPBL7QMWXMHXWGU/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

2021-11-02 Thread Marc-Andre Lemburg
On 01.11.2021 13:17, Petr Viktorin wrote:
>> PEP: 
>> Title: Unicode Security Considerations for Python
>> Author: Petr Viktorin 
>> Status: Active
>> Type: Informational
>> Content-Type: text/x-rst
>> Created: 01-Nov-2021
>> Post-History:

Thanks for writing this up. I'm not sure whether a PEP is the right place
for such documentation, though. Wouldn't it be more visible in the standard
Python documentation ?

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts (#1, Nov 02 2021)
>>> Python Projects, Coaching and Support ...https://www.egenix.com/
>>> Python Product Development ...https://consulting.egenix.com/


::: We implement business ideas - efficiently in both time and costs :::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   https://www.egenix.com/company/contact/
 https://www.malemburg.com/

___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/FSFG2B3LCWU5PQWX3WRIOJGNV2JFW4AU/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

2021-11-02 Thread Chris Angelico
On Wed, Nov 3, 2021 at 1:06 AM Petr Viktorin  wrote:
> Let me know if it's clear in the newest version, with this note:
>
> > Here, ``encoding: unicode_escape`` in the initial comment is an encoding
> > declaration. The ``unicode_escape`` encoding instructs Python to treat
> > ``\u0027`` as a single quote (which can start/end a string), ``\u002c`` as
> > a comma (punctuator), etc.
>

Huh. Is that level of generality actually still needed? Can Python
deprecate all but a small handful of encodings?

ChrisA
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/WA7P7YLY7N6CGF7N5G6DVG3PIA24BPS7/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

2021-11-02 Thread Petr Viktorin

On 01. 11. 21 13:17, Petr Viktorin wrote:

Hello,
Today, an attack called "Trojan source" was revealed, where a malicious 
contributor can use Unicode features (left-to-right text and homoglyphs) 
to code that, when shown in an editor, will look different from how a 
computer language parser will process it.

See https://trojansource.codes/, CVE-2021-42574 and CVE-2021-42694.

This is not a bug in Python.
As far as I know, the Python Security Response team reviewed the report 
and decided that it should be handled in code editors, diff viewers, 
repository frontends and similar software, rather than in the language.


I agree: in my opinion, the attack is similar to abusing any other 
"gotcha" where Python doesn't parse text as a non-expert human would. 
For example: `if a or b == 'yes'`, mutable default arguments, or a 
misleading typo.


Nevertheless, I did do a bit of research about similar gotchas in 
Python, and I'd like to publish a summary as an informational PEP, 
pasted below.



Thanks for the comments, everyone! I've updated the document and sent it 
to https://github.com/python/peps/pull/2129
A rendered version is at 
https://github.com/encukou/peps/blob/pep-0672/pep-0672.rst




Toshio Kuratomi wrote:

  `Unicode`_ is a system for handling all kinds of written language.
It aims to allow any character from any human natural language (as
well as a few characters which are not from natural languages) to be
used. Python code may consist of almost all valid Unicode characters.


Thanks! That's a nice summary; I condensed it a bit more and used it.
(I'm not joining the conversation on glyphs, characters, codepoints and 
encodings -- that's much too technical for this document. Using the 
specific technical terms unfortunately doesn't help understanding, so I 
use the vague ones like "character" and "letter".)



Jim J. Jewett wrote:

"The East Asian symbol for *ten* looks like a plus sign, so ``十= 10`` is a complete 
Python statement."


Normally, an identifier must begin with a letter, and numbers can only be used in the 
second and subsequent positions.  (XID_CONTINUE instead of XID_START)  The fact that some 
characters with numeric values are considered letters (in this case, category Lo, Other 
Letters) is a different problem than just looking visually confusable with "+", 
and it should probably be listed on its own.


I'm not a native speaker, but as I understand it, "十" is closer to a 
single-letter word than a single-digit number. It translates better as 
"ten" than "10". (And it appears in "十四", "fourteen", just like "four" 
appears in "fourteen".)



Patrick Schultz wrote:

- The Unicode consortium has a list of confusables, in case useful


Yup, and it's linked from the documents that describe how to use it. I 
link to those rather than just the list.

But thank you!


Terry Reedy wrote:

Bidirectional Text
--

Some scripts, such as Hebrew or Arabic, are written right-to-left.


[Suggested addition, subject to further revision.]

There are at least three levels of handling r2l chars: none, local (contiguous 
sequences are properly reversed), and extended (see below).  The handling 
depends on the display software and may depend on the quoting.  Tk and hence 
tkinter (and IDLE) text widgets do local handing.  Windows Notepad++ does local 
handling of unquoted code but extending handling of quoted text.  Windows 
Notepad currently does extended handling even without quotes.


I'd like to leave these details out of the document. The examples should 
render convincingly in browsers. The text should now describe the 
behavior even if you open it in an editor that does things differently, 
and acknowledge that such editors exist. (The behavior of specific 
editors/toolkits might well change in the future.)



For example, with ``encoding: unicode_escape``, characters like
quotes or braces can be hidden in an (f-)string, with many tools (syntax
highlighters, linters, etc.) considering them part of the string.
For example::


I don't see the connection between the text above and the example that follows.


# For writing Japanese, you don't need an editor that supports
# UTF-8 source encoding: unicode_escape sequences work just as well.

[etc]


Let me know if it's clear in the newest version, with this note:


Here, ``encoding: unicode_escape`` in the initial comment is an encoding
declaration. The ``unicode_escape`` encoding instructs Python to treat
``\u0027`` as a single quote (which can start/end a string), ``\u002c`` as
a comma (punctuator), etc.



Steven D'Aprano wrote:

Before the age of computers, most mechanical typewriters lacked the keys 
for the digits ``0`` and ``1``


I'm not sure that "most" is justifed here. One of the most popular 
typewriters in history, the Underwood #5 (from 1900 to 1920), lacked 
the 1 key but had a 0 distinct from O.



[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

2021-11-01 Thread Steven D'Aprano
On Mon, Nov 01, 2021 at 11:41:06AM -0700, Toshio Kuratomi wrote:

> Unicode specifies the mapping of glyphs to code points.  Then a second
> mapping from code points to sequences of bytes is what is actually
> recorded by the computer.  The second mapping is what programmers
> using Python will commonly think of as the encoding while the majority
> of what you're writing about has more to do with the first mapping.

I don't think that is correct.

According to the Unicode consortium -- and I hope that they would know 
*wink* -- Unicode is the universal character encoding. In other words:

"Unicode provides a unique number for every character"

https://www.unicode.org/standard/WhatIsUnicode.html

Not glyphs.

("Character" in natural language is a bit of a fuzzy concept, so I think 
that Unicode here is referring to what their glossary calls an abstract 
character.)

The usual meaning of glyph is for the graphical images used 
by fonts (typefaces) for display. Sense 2 in the Unicode glossary here:

https://www.unicode.org/glossary/#glyph

I'm not really sure what they mean by sense 1, unless they mean a 
representative glyph, which is intended to stand in as an example of the 
entire range of glyphs.

Unicode does not specify what the glyphs for code points are, although 
it does provide representative samples. See, for example, their comment 
on emoji:

"The Unicode Consortium provides character code charts that show a 
representative glyph"

http://www.unicode.org/faq/emoji_dingbats.html

Their code point charts likewise show representative glyphs for other 
letters and symbols, not authoritative. And of course, many abstract 
characters do not have glyphs at all, e.g. invisible joiners, control 
characters, variation selectors, noncharacters, etc.

The mapping from bytes to code points and abstract characters is also 
part of Unicode. The UTF encodings are part of Unicode:

https://www.unicode.org/faq/utf_bom.html#gen2

The "U" in UTF literally stands for Unicode :-)


-- 
Steve
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/I7ZRNIHSQ7UL4NSKOXFRYBYHQEXGNBPA/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

2021-11-01 Thread Terry Reedy

On 11/1/2021 8:17 AM, Petr Viktorin wrote:

Nevertheless, I did do a bit of research about similar gotchas in 
Python, and I'd like to publish a summary as an informational PEP, 
pasted below.


Very helpful.


Bidirectional Text
--

Some scripts, such as Hebrew or Arabic, are written right-to-left.


[Suggested addition, subject to further revision.]

There are at least three levels of handling r2l chars: none, local 
(contiguous sequences are properly reversed), and extended (see below). 
 The handling depends on the display software and may depend on the 
quoting.  Tk and hence tkinter (and IDLE) text widgets do local handing. 
 Windows Notepad++ does local handling of unquoted code but extending 
handling of quoted text.  Windows Notepad currently does extended 
handling even without quotes.


In extended handling, phrases ...


Phrases in such scripts interact with nearby text in ways that can be
surprising to people who aren't familiar with these writing systems 
and their

computer representation.

The exact process is complicated, and explained in Unicode® Standard 
Annex #9,

"Unicode Bidirectional Algorithm".

Some surprising examples include:

* In the statement ``ערך = 23``, the variable ``ערך`` is set to the 
integer 23.


In local handling, one sees  = 23`.  In extended handling,
one sees 23 = .  (Notepad++ sees backticks as quotes.)



Source Encoding
---

The encoding of Python source files is given by a specific regex on 
the first

two lines of a file, as per `Encoding declarations`_.
This mechanism is very liberal in what it accepts, and thus easy to 
obfuscate.


This can be misused in combination with Python-specific special-purpose
encodings (see `Text Encodings`_).



Are `Encoding declarations`_ and `Text Encodings`_ supposed to link to 
something?




For example, with ``encoding: unicode_escape``, characters like
quotes or braces can be hidden in an (f-)string, with many tools (syntax
highlighters, linters, etc.) considering them part of the string.
For example::


I don't see the connection between the text above and the example that 
follows.



    # For writing Japanese, you don't need an editor that supports
    # UTF-8 source encoding: unicode_escape sequences work just as well.

[etc]


--
Terry Jan Reedy
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/34JROXNUHEUDC4TOWUAM74KIGIRRHHG4/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

2021-11-01 Thread Jim J. Jewett
"The East Asian symbol for *ten* looks like a plus sign, so ``十= 10`` is a 
complete Python statement."

Normally, an identifier must begin with a letter, and numbers can only be used 
in the second and subsequent positions.  (XID_CONTINUE instead of XID_START)  
The fact that some characters with numeric values are considered letters (in 
this case, category Lo, Other Letters) is a different problem than just looking 
visually confusable with "+", and it should probably be listed on its own.

-jJ
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/RV7RU7DGWFIBEGFKNYDP63ZRJNP5Y4YU/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

2021-11-01 Thread Toshio Kuratomi
This is an excellent enumeration of some of the concerns!

One minor comment about the introductory material:

On Mon, Nov 1, 2021 at 5:21 AM Petr Viktorin  wrote:

> >
> > Introduction
> > 
> >
> > Python code is written in `Unicode`_ – a system for encoding and
> > handling all kinds of written language.

Unicode specifies the mapping of glyphs to code points.  Then a second
mapping from code points to sequences of bytes is what is actually
recorded by the computer.  The second mapping is what programmers
using Python will commonly think of as the encoding while the majority
of what you're writing about has more to do with the first mapping.
I'd try to word this in a way that doesn't lead a reader to conflate
those two mappings.

Maybe something like this?

  `Unicode`_ is a system for handling all kinds of written language.
It aims to allow any character from any human natural language (as
well as a few characters which are not from natural languages) to be
used. Python code may consist of almost all valid Unicode characters.

> > While this allows programmers from all around the world to express 
> > themselves,
> > it also allows writing code that is potentially confusing to readers.
> >

-Toshio
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/Q2T3GKC6R6UH5O7RZJJNREG3XQDDZ6N4/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

2021-11-01 Thread Serhiy Storchaka
This is excellent!

01.11.21 14:17, Petr Viktorin пише:
>> CPython treats the control character NUL (``\0``) as end of input,
>> but many editors simply skip it, possibly showing code that Python
>> will not
>> run as a regular part of a file.

It is an implementation detail and we will get rid of it. It only
happens when you read the Python script from a file. If you import it as
a module or run with runpy, the NUL character is an error.

>> Some characters can be used to hide/overwrite other characters when
>> source is
>> listed in common terminals:
>>
>> * BS (``\b``, Backspace) moves the cursor back, so the character after it
>>   will overwrite the character before.
>> * CR (``\r``, carriage return) moves the cursor to the start of line,
>>   subsequent characters overwrite the start of the line.
>> * DEL (``\x7F``) commonly initiates escape codes which allow arbitrary
>>   control of the terminal.

ESC (``\x1B``) starts many control sequences.

``\1A`` means the end of the text file on Windows. Some programs (for
example "type") ignore the rest of the file.

___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/CBI7ME3YUAVVH5B6LSC745GJSVUIZJHO/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

2021-11-01 Thread Steven D'Aprano
Thanks for writing this Petr!

A few comments below.

On Mon, Nov 01, 2021 at 01:17:02PM +0100, Petr Viktorin wrote:

> >ASCII-only Considerations
> >-
> >
> >ASCII is a subset of Unicode
> >
> >While issues with the ASCII character set are generally well understood,
> >the're presented here to help better understanding of the non-ASCII cases.

You should mention that some very common typefaces (fonts) are more 
confusable than others. For instance, Arial (a common font on Windows 
systems) makes the two letter combination 'rn' virtually 
indistinguishable from the single letter 'm'.


> >Before the age of computers, most mechanical typewriters lacked the keys 
> >for the digits ``0`` and ``1``

I'm not sure that "most" is justifed here. One of the most popular 
typewriters in history, the Underwood #5 (from 1900 to 1920), lacked 
the 1 key but had a 0 distinct from O.

https://i1.wp.com/curiousasacathy.com/wp-content/uploads/2016/04/underwood-no-5-standard-typewriter-circa-1901.jpg

The Oliver 5 (1894 – 1928) had both a 0 and a 1, as did the 1895 Ford 
Typewriter. As did possibly the best selling typewriter in history, the 
IBM Selectric (introduced in 1961).

http://www.technocrazed.com/the-interesting-history-of-evolution-of-typewriters-photo-gallery

Perhaps you should say "many older mechanical typewriters"?


> >Bidirectional Text
> >--

The section on bidirectional text is interesting, because reading it in 
my email client mutt, all the examples are left to right.

You might like to note that not all applications support bidirectional 
text.


> >Unicode includes alorithms to *normalize* variants like these to a 
> >single form, and Python identifiers are normalized.

Typo: "algorithms".



This is a good and useful document, thank you again.


-- 
Steve
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/CHGK6LLBMVRQ6GGEMRWYJNRLUL7KUMVS/
Code of Conduct: http://python.org/psf/codeofconduct/