Re: python3 raw strings and \u escapes
This is a related question. I perform an octal dump on a file: $ od -cx file 000 h e l l o w o r l d \n 65686c6c206f6f776c720a64 I want to output the names of those characters: $ python3 Python 3.2.3 (default, May 19 2012, 17:01:30) [GCC 4.6.3] on linux2 Type help, copyright, credits or license for more information. import unicodedata unicodedata.name(\u0068) 'LATIN SMALL LETTER H' unicodedata.name(\u0065) 'LATIN SMALL LETTER E' But, how to do this programatically: first_two_letters = 65686c6c206f6f776c72 0a64.split()[0] first_two_letters '6568' first_letter = 00 + first_two_letters[2:] first_letter '0068' Now what? -- http://mail.python.org/mailman/listinfo/python-list
Re: python3 raw strings and \u escapes
On 16/06/2012 00:42, Jason Friedman wrote: This is a related question. I perform an octal dump on a file: $ od -cx file 000 h e l l o w o r l d \n 65686c6c206f6f776c720a64 I want to output the names of those characters: $ python3 Python 3.2.3 (default, May 19 2012, 17:01:30) [GCC 4.6.3] on linux2 Type help, copyright, credits or license for more information. import unicodedata unicodedata.name(\u0068) 'LATIN SMALL LETTER H' unicodedata.name(\u0065) 'LATIN SMALL LETTER E' But, how to do this programatically: first_two_letters = 65686c6c206f6f776c720a64.split()[0] first_two_letters '6568' first_letter = 00 + first_two_letters[2:] first_letter '0068' Now what? hex_code = 65 unicodedata.name(chr(int(hex_code, 16))) 'LATIN SMALL LETTER E' -- http://mail.python.org/mailman/listinfo/python-list
Re: python3 raw strings and \u escapes
This is a related question. I perform an octal dump on a file: $ od -cx file 000 h e l l o w o r l d \n 6568 6c6c 206f 6f77 6c72 0a64 I want to output the names of those characters: $ python3 Python 3.2.3 (default, May 19 2012, 17:01:30) [GCC 4.6.3] on linux2 Type help, copyright, credits or license for more information. import unicodedata unicodedata.name(\u0068) 'LATIN SMALL LETTER H' unicodedata.name(\u0065) 'LATIN SMALL LETTER E' But, how to do this programatically: first_two_letters = 6568 6c6c 206f 6f77 6c72 0a64.split()[0] first_two_letters '6568' first_letter = 00 + first_two_letters[2:] first_letter '0068' Now what? hex_code = 65 unicodedata.name(chr(int(hex_code, 16))) 'LATIN SMALL LETTER E' Very helpful, thank you MRAB. The finished product: http://pastebin.com/4egQcke2. -- http://mail.python.org/mailman/listinfo/python-list
Re: python3 raw strings and \u escapes
On 05/30/2012 09:07 AM, ru...@yahoo.com wrote: On 05/30/2012 05:54 AM, Thomas Rachel wrote: Am 30.05.2012 08:52 schrieb ru...@yahoo.com: This breaks a lot of my code because in python 2 re.split (ur'[\u3000]', u'A\u3000A') == [u'A', u'A'] but in python 3 (the result of running 2to3), re.split (r'[\u3000]', 'A\u3000A' ) == ['A\u3000A'] I can remove the r prefix from the regex string but then if I have other regex backslash symbols in it, I have to double all the other backslashes -- the very thing that the r-prefix was invented to avoid. Or I can leave the r prefix and replace something like r'[ \u3000]' with r'[ ]'. But that is confusing because one can't distinguish between the space character and the ideographic space character. It also a problem if a reader of the code doesn't have a font that can display the character. Was there a reason for dropping the lexical processing of \u escapes in strings in python3 (other than to add another annoyance in a long list of python3 annoyances?) Probably it is more consequent. Alas, it makes the whole stuff incompatible to Py2. But if you think about it: why allow for \u if \r, \n etc. are disallowed as well? Maybe the blame is elsewhere then... If the re module interprets (in a regex string) the 2-character string consisting of r'\' followed by 'n' as a single newline character, then why wasn't re changed for Python 3 to interpret the 6-character string, r'\u3000' as a single unicode character to correspond with Python's lexer no longer doing that (as it did in Python 2)? And is there no choice for me but to choose between the two poor choices I mention above to deal with this problem? There is a 3rd one: use r'[ ' + '\u3000' + ']'. Not very nice to read, but should do the trick... I guess the +s could be left out allowing something like, '[ \u3000]' r'\w+ \d{3}' but I'll have to try it a little; maybe just doubling backslashes won't be much worse. I did that for years in Perl and lived through it. Just for some closure, there are many places in my code that I had/have to track down and change. But the biggest problem so far is a lexer module that is structured as many dozens of little functions, each with a docstring that is a regex string. The only way I found change these and maintain sanity was to go through them and remove the r prefix from any strings that contain \u literals, and then double any other backslashes in the string. Since these are docstrings, creating them with executable code was awkward, and using adjacent string concatenation led to a very confusing mix of string styles. Strings that used concatenation often had a single logical regex structure (eg a character set [...]) split between two strings. The extra quote characters were as visually confusing as doubled backslashes in many cases. Strings with doubled backslashes, although harder to read were, were much easier to edit reliably and in their way, more regular. It does make this module look very Perlish though... :-) -- http://mail.python.org/mailman/listinfo/python-list
Re: python3 raw strings and \u escapes
On Fri, Jun 1, 2012 at 6:28 AM, ru...@yahoo.com ru...@yahoo.com wrote: ... a lexer module that is structured as many dozens of little functions, each with a docstring that is a regex string. This may be a good opportunity to take a step back and ask yourself: Why so many functions, each with a regular expression in its docstring? Chris Angelico -- http://mail.python.org/mailman/listinfo/python-list
Re: python3 raw strings and \u escapes
On 05/31/2012 03:10 PM, Chris Angelico wrote: On Fri, Jun 1, 2012 at 6:28 AM, ru...@yahoo.com ru...@yahoo.com wrote: ... a lexer module that is structured as many dozens of little functions, each with a docstring that is a regex string. This may be a good opportunity to take a step back and ask yourself: Why so many functions, each with a regular expression in its docstring? Because that's the way David Beazley designed Ply? http://dabeaz.com/ply/ Personally, I think it's an abuse of docstrings but he never asked me for my opinion... -- http://mail.python.org/mailman/listinfo/python-list
Re: python3 raw strings and \u escapes
On 5/30/2012 1:52 AM, ru...@yahoo.com wrote: Was there a reason for dropping the lexical processing of \u escapes in strings in python3 (other than to add another annoyance in a long list of python3 annoyances?) To me, this would be a Python 2 annoyance since I would expect r'\u3000' to be literally the six characters '\u3000' since the entire point of raw strings is to treat everything literally. Why should anything at all be processed when constructing a raw string? -- CPython 3.3.0a3 | Windows NT 6.1.7601.17790 -- http://mail.python.org/mailman/listinfo/python-list
Re: python3 raw strings and \u escapes
Am 30.05.2012 08:52 schrieb ru...@yahoo.com: This breaks a lot of my code because in python 2 re.split (ur'[\u3000]', u'A\u3000A') == [u'A', u'A'] but in python 3 (the result of running 2to3), re.split (r'[\u3000]', 'A\u3000A' ) == ['A\u3000A'] I can remove the r prefix from the regex string but then if I have other regex backslash symbols in it, I have to double all the other backslashes -- the very thing that the r-prefix was invented to avoid. Or I can leave the r prefix and replace something like r'[ \u3000]' with r'[ ]'. But that is confusing because one can't distinguish between the space character and the ideographic space character. It also a problem if a reader of the code doesn't have a font that can display the character. Was there a reason for dropping the lexical processing of \u escapes in strings in python3 (other than to add another annoyance in a long list of python3 annoyances?) Probably it is more consequent. Alas, it makes the whole stuff incompatible to Py2. But if you think about it: why allow for \u if \r, \n etc. are disallowed as well? And is there no choice for me but to choose between the two poor choices I mention above to deal with this problem? There is a 3rd one: use r'[ ' + '\u3000' + ']'. Not very nice to read, but should do the trick... Thomas -- http://mail.python.org/mailman/listinfo/python-list
Re: python3 raw strings and \u escapes
On Wed, May 30, 2012 at 2:52 AM, ru...@yahoo.com ru...@yahoo.com wrote: Was there a reason for dropping the lexical processing of \u escapes in strings in python3 (other than to add another annoyance in a long list of python3 annoyances?) And is there no choice for me but to choose between the two poor choices I mention above to deal with this problem? The solution of r'[' + '\u3000' + r']...' was pretty good. Real reason I posted: Maybe the re module should handle \u escapes, in addition to the other backslash escapes it processes? This would be backwards incompatible, though, so maybe it's too late. -- Devin -- http://mail.python.org/mailman/listinfo/python-list
Re: python3 raw strings and \u escapes
On 30 May 2012 12:54, Thomas Rachel nutznetz-0c1b6768-bfa9-48d5-a470-7603bd3aa...@spamschutz.glglgl.de wrote: There is a 3rd one: use r'[ ' + '\u3000' + ']'. Not very nice to read, but should do the trick... You could even take advantage of string literal concatenation:) r'[' '\u3000' r']' -- Arnaud -- http://mail.python.org/mailman/listinfo/python-list
Re: python3 raw strings and \u escapes
On 05/30/2012 05:54 AM, Thomas Rachel wrote: Am 30.05.2012 08:52 schrieb ru...@yahoo.com: This breaks a lot of my code because in python 2 re.split (ur'[\u3000]', u'A\u3000A') == [u'A', u'A'] but in python 3 (the result of running 2to3), re.split (r'[\u3000]', 'A\u3000A' ) == ['A\u3000A'] I can remove the r prefix from the regex string but then if I have other regex backslash symbols in it, I have to double all the other backslashes -- the very thing that the r-prefix was invented to avoid. Or I can leave the r prefix and replace something like r'[ \u3000]' with r'[ ]'. But that is confusing because one can't distinguish between the space character and the ideographic space character. It also a problem if a reader of the code doesn't have a font that can display the character. Was there a reason for dropping the lexical processing of \u escapes in strings in python3 (other than to add another annoyance in a long list of python3 annoyances?) Probably it is more consequent. Alas, it makes the whole stuff incompatible to Py2. But if you think about it: why allow for \u if \r, \n etc. are disallowed as well? Maybe the blame is elsewhere then... If the re module interprets (in a regex string) the 2-character string consisting of r'\' followed by 'n' as a single newline character, then why wasn't re changed for Python 3 to interpret the 6-character string, r'\u3000' as a single unicode character to correspond with Python's lexer no longer doing that (as it did in Python 2)? And is there no choice for me but to choose between the two poor choices I mention above to deal with this problem? There is a 3rd one: use r'[ ' + '\u3000' + ']'. Not very nice to read, but should do the trick... I guess the +s could be left out allowing something like, '[ \u3000]' r'\w+ \d{3}' but I'll have to try it a little; maybe just doubling backslashes won't be much worse. I did that for years in Perl and lived through it. -- http://mail.python.org/mailman/listinfo/python-list
Re: python3 raw strings and \u escapes
On 5/30/2012 2:52 AM, ru...@yahoo.com wrote: In python2, \u escapes are processed in raw unicode strings. That is, ur'\u3000' is a string of length 1 consisting of the IDEOGRAPHIC SPACE unicode character. That surprised me until I rechecked the fine manual and found: When an 'r' or 'R' prefix is present, a character following a backslash is included in the string without change, and all backslashes are left in the string. When an 'r' or 'R' prefix is used in conjunction with a 'u' or 'U' prefix, then the \u and \U escape sequences are processed while all other backslashes are left in the string. When 'u' was removed in Python 3, a choice had to be made and the first must have seemed to be the obvious one, or perhaps the automatic one. In 3.3, 'u' is being restored. I have inquired on pydev list whether the difference above should also be restored, and mentioned this thread. -- Terry Jan Reedy -- http://mail.python.org/mailman/listinfo/python-list
Re: python3 raw strings and \u escapes
On 30.05.12 14:54, Thomas Rachel wrote: There is a 3rd one: use r'[ ' + '\u3000' + ']'. Not very nice to read, but should do the trick... Or r'[ %s]' % ('\u3000',). -- http://mail.python.org/mailman/listinfo/python-list
Re: python3 raw strings and \u escapes
On 05/30/2012 10:46 AM, Terry Reedy wrote: On 5/30/2012 2:52 AM, ru...@yahoo.com wrote: In python2, \u escapes are processed in raw unicode strings. That is, ur'\u3000' is a string of length 1 consisting of the IDEOGRAPHIC SPACE unicode character. That surprised me until I rechecked the fine manual and found: When an 'r' or 'R' prefix is present, a character following a backslash is included in the string without change, and all backslashes are left in the string. When an 'r' or 'R' prefix is used in conjunction with a 'u' or 'U' prefix, then the \u and \U escape sequences are processed while all other backslashes are left in the string. When 'u' was removed in Python 3, a choice had to be made and the first must have seemed to be the obvious one, or perhaps the automatic one. In 3.3, 'u' is being restored. I have inquired on pydev list whether the difference above should also be restored, and mentioned this thread. As mentioned is a different message, another option might be to leave raw strings as is (more consistent since all backslashes are treated the same) and have the re module un-escape \u (and similar) literals in regex string (also more consistent since that's what it does with '\\n', '\\t', etc.) I do realize though that this may have back-compatibilty problems that makes it impossible to do. -- http://mail.python.org/mailman/listinfo/python-list
Re: python3 raw strings and \u escapes
On 30 mai, 13:54, Thomas Rachel nutznetz-0c1b6768-bfa9-48d5- a470-7603bd3aa...@spamschutz.glglgl.de wrote: Am 30.05.2012 08:52 schrieb ru...@yahoo.com: This breaks a lot of my code because in python 2 re.split (ur'[\u3000]', u'A\u3000A') == [u'A', u'A'] but in python 3 (the result of running 2to3), re.split (r'[\u3000]', 'A\u3000A' ) == ['A\u3000A'] I can remove the r prefix from the regex string but then if I have other regex backslash symbols in it, I have to double all the other backslashes -- the very thing that the r-prefix was invented to avoid. Or I can leave the r prefix and replace something like r'[ \u3000]' with r'[ ]'. But that is confusing because one can't distinguish between the space character and the ideographic space character. It also a problem if a reader of the code doesn't have a font that can display the character. Was there a reason for dropping the lexical processing of \u escapes in strings in python3 (other than to add another annoyance in a long list of python3 annoyances?) Probably it is more consequent. Alas, it makes the whole stuff incompatible to Py2. But if you think about it: why allow for \u if \r, \n etc. are disallowed as well? And is there no choice for me but to choose between the two poor choices I mention above to deal with this problem? There is a 3rd one: use r'[ ' + '\u3000' + ']'. Not very nice to read, but should do the trick... Thomas I suggest to take the problem differently. Python 3 succeeded to put order in the missmatch of the coding of the characters Python 2 was proposing. In your case, the import unicodedata as ud ud.name('\u3000') 'IDEOGRAPHIC SPACE' character (in fact a unicode code point), is just a character as a ud.name('a') 'LATIN SMALL LETTER A' The code point / unicode logic, Python 3 proposes and follows, becomes just straightforward. s = 'a\u3000é\u3000€' s.split('\u3000') ['a', 'é', '€'] import re re.split('\u3000', s) ['a', 'é', '€'] The backslash, used as real backslash, remains what it really was in Python 2. Note, the absence of r'...' . s = 'a\\b\\c' print(s) a\b\c s.split('\\') ['a', 'b', 'c'] re.split('', s) ['a', 'b', 'c'] hex(ord('\\')) '0x5c' re.split('\u005c\u005c', s) ['a', 'b', 'c'] jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: python3 raw strings and \u escapes
On 30 mai, 08:52, ru...@yahoo.com ru...@yahoo.com wrote: In python2, \u escapes are processed in raw unicode strings. That is, ur'\u3000' is a string of length 1 consisting of the IDEOGRAPHIC SPACE unicode character. In python3, \u escapes are not processed in raw strings. r'\u3000' is a string of length 6 consisting of a backslash, 'u', '3' and three '0' characters. This breaks a lot of my code because in python 2 re.split (ur'[\u3000]', u'A\u3000A') == [u'A', u'A'] but in python 3 (the result of running 2to3), re.split (r'[\u3000]', 'A\u3000A' ) == ['A\u3000A'] I can remove the r prefix from the regex string but then if I have other regex backslash symbols in it, I have to double all the other backslashes -- the very thing that the r-prefix was invented to avoid. Or I can leave the r prefix and replace something like r'[ \u3000]' with r'[ ]'. But that is confusing because one can't distinguish between the space character and the ideographic space character. It also a problem if a reader of the code doesn't have a font that can display the character. Was there a reason for dropping the lexical processing of \u escapes in strings in python3 (other than to add another annoyance in a long list of python3 annoyances?) And is there no choice for me but to choose between the two poor choices I mention above to deal with this problem? I suggest to take the problem differently. Python 3 succeeded to put order in the missmatch of the coding of the characters Python 2 was proposing. The 'IDEOGRAPHIC SPACE' and 'REVERSE SOLIDUS' (backslash) characters (in fact unicode code points) are just (normal) characters. The backslash, used as an escaping command, keeps its function. Note the absence of r'...' s = 'a\u3000é\u3000€' s.split('\u3000') ['a', 'é', '€'] import re re.split('\u3000', s) ['a', 'é', '€'] s = 'a\\b\\c' print(s) a\b\c s.split('\\') ['a', 'b', 'c'] re.split('', s) ['a', 'b', 'c'] hex(ord('\\')) '0x5c' re.split('\u005c\u005c', s) ['a', 'b', 'c'] jmf -- http://mail.python.org/mailman/listinfo/python-list