[issue45105] Incorrect handling of unicode character \U00010900

2021-09-12 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: We recently discussed the RTLO attack on Python sources (sorry, I don't remember on what resource) and decided that we should do something with this. I think this is a related issue. -- ___ Python tracker

[issue45105] Incorrect handling of unicode character \U00010900

2021-09-12 Thread Ronald Oussoren
Ronald Oussoren added the comment: @Steven: the difference between indexing and the repr of list() is also explained by Eryk's explanation. s = ... # (value from msg401078) for x in repr(list(s)): print(x) The output shows characters in the expected order. -- nosy:

[issue45105] Incorrect handling of unicode character \U00010900

2021-09-10 Thread Terry J. Reedy
Change by Terry J. Reedy : -- nosy: +serhiy.storchaka ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe:

[issue45105] Incorrect handling of unicode character \U00010900

2021-09-06 Thread STINNER Victor
Change by STINNER Victor : -- nosy: -vstinner ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe:

[issue45105] Incorrect handling of unicode character \U00010900

2021-09-05 Thread Max Bachmann
Max Bachmann added the comment: As far as a I understood this is caused by the same reason: ``` >>> s = '123\U00010900456' >>> s '123ऀ456' >>> list(s) ['1', '2', '3', 'ऀ', '4', '5', '6'] # note that everything including the commas is mirrored until ] is reached >>> s[3] 'ऀ' >>> list(s)[3] 'ऀ'

[issue45105] Incorrect handling of unicode character \U00010900

2021-09-05 Thread Steven D'Aprano
Steven D'Aprano added the comment: Hmmm, digging deeper, I saved the page source code and opened it with hexdump. The relevant lines are: 7780 60 60 0d 0a 26 67 74 3b 26 67 74 3b 26 67 74 3b |``..| 7790 20 73 20 3d 20 27 30 f0 90 a4 80 30 30 27 0d 0a | s = '000'..| which

[issue45105] Incorrect handling of unicode character \U00010900

2021-09-05 Thread Steven D'Aprano
Steven D'Aprano added the comment: > what's really there when I copy it from Firefox is '0\U000109', > which matches the result Max gets for individual index operations such as > s[1]. But *not* the result that Max got from calling list(). Can you reproduce that difference between

[issue45105] Incorrect handling of unicode character \U00010900

2021-09-05 Thread Max Bachmann
Max Bachmann added the comment: > That is using Python 3.9 in the xfce4-terminal. Which xterm are you using? This was in the default gnome terminal that is pre-installed on Fedora 34 and on windows I directly opened the Python Terminal. I just installed xfce4-terminal on my Fedora 34 machine

[issue45105] Incorrect handling of unicode character \U00010900

2021-09-05 Thread Serhiy Storchaka
Change by Serhiy Storchaka : -- status: open -> pending ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe:

[issue45105] Incorrect handling of unicode character \U00010900

2021-09-05 Thread Eryk Sun
Eryk Sun added the comment: > I think you may be mistaken. In Max's original post, he has > s = '000X' It displays that way for me under Firefox in Linux, but what's really there when I copy it from Firefox is '0\U000109', which matches the result Max gets for individual index

[issue45105] Incorrect handling of unicode character \U00010900

2021-09-05 Thread Steven D'Aprano
Steven D'Aprano added the comment: Eryk Sun said: > The original string has the Phoenician right-to-left character at index 1, > not at index 3. I think you may be mistaken. In Max's original post, he has s = '000X' where the X is actually the Phoenician ALF character. At least that

[issue45105] Incorrect handling of unicode character \U00010900

2021-09-05 Thread Steven D'Aprano
Steven D'Aprano added the comment: I'm afraid I cannot reproduce the problem. >>> s = '000ऀ' # \U00010900 >>> s '000ऀ' >>> s[0] '0' >>> s[1] '0' >>> s[2] '0' >>> s[3] 'ऀ' >>> list(s) ['0', '0', '0', 'ऀ'] That is using Python 3.9 in the xfce4-terminal. Which xterm are you using? I am very

[issue45105] Incorrect handling of unicode character \U00010900

2021-09-05 Thread Eryk Sun
Eryk Sun added the comment: AFAICT, there is no bug here. It's just confusing how Unicode right-to-left characters in the repr() can modify how it's displayed in the console/terminal. Use the ascii() representation to avoid the problem. > The same behavior does not occur when directly using

[issue45105] Incorrect handling of unicode character \U00010900

2021-09-05 Thread Max Bachmann
Max Bachmann added the comment: This is the result of copy pasting example posted above on windows using ``` Python 3.7.8 (tags/v3.7.8:4b47a5b6ba, Jun 28 2020, 08:53:46) [MSC v.1916 64 bit (AMD64)] on win32 ``` which appears to run into similar problems: ``` >>> s = '0��00'

[issue45105] Incorrect handling of unicode character \U00010900

2021-09-05 Thread Max Bachmann
New submission from Max Bachmann : I noticed that when using the Unicode character \U00010900 when inserting the character as character: Here is the result on the Python console both for 3.6 and 3.9: ``` >>> s = '0ऀ00' >>> s '0ऀ00' >>> ls = list(s) >>> ls ['0', 'ऀ', '0', '0'] >>> s[0] '0' >>>