subject:"Re\: python3 raw strings and \\u escapes"

Re: python3 raw strings and \u escapes

2012-06-15 Thread Jason Friedman

This is a related question.

I perform an octal dump on a file:
$ od -cx file
000   h   e   l   l   o   w   o   r   l   d  \n
   65686c6c206f6f776c720a64

I want to output the names of those characters:
$ python3
Python 3.2.3 (default, May 19 2012, 17:01:30)
[GCC 4.6.3] on linux2
Type help, copyright, credits or license for more information.
 import unicodedata
 unicodedata.name(\u0068)
'LATIN SMALL LETTER H'
 unicodedata.name(\u0065)
'LATIN SMALL LETTER E'

But, how to do this programatically:
 first_two_letters = 65686c6c206f6f776c72
 0a64.split()[0]
 first_two_letters
'6568'
 first_letter = 00 + first_two_letters[2:]
 first_letter
'0068'

Now what?
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: python3 raw strings and \u escapes

2012-06-15 Thread MRAB


On 16/06/2012 00:42, Jason Friedman wrote:

This is a related question.

I perform an octal dump on a file:
$ od -cx file
000   h   e   l   l   o   w   o   r   l   d  \n
65686c6c206f6f776c720a64

I want to output the names of those characters:
$ python3
Python 3.2.3 (default, May 19 2012, 17:01:30)
[GCC 4.6.3] on linux2
Type help, copyright, credits or license for more information.

 import unicodedata
 unicodedata.name(\u0068)

'LATIN SMALL LETTER H'

 unicodedata.name(\u0065)

'LATIN SMALL LETTER E'

But, how to do this programatically:

 first_two_letters = 65686c6c206f6f776c720a64.split()[0]
 first_two_letters

'6568'

 first_letter = 00 + first_two_letters[2:]
 first_letter

'0068'

Now what?


 hex_code = 65
 unicodedata.name(chr(int(hex_code, 16)))
'LATIN SMALL LETTER E'
--
http://mail.python.org/mailman/listinfo/python-list

Re: python3 raw strings and \u escapes

2012-06-15 Thread Jason Friedman

 This is a related question.

 I perform an octal dump on a file:
 $ od -cx file
 000   h   e   l   l   o       w   o   r   l   d  \n
            6568    6c6c    206f    6f77    6c72    0a64

 I want to output the names of those characters:
 $ python3
 Python 3.2.3 (default, May 19 2012, 17:01:30)
 [GCC 4.6.3] on linux2
 Type help, copyright, credits or license for more information.

  import unicodedata
  unicodedata.name(\u0068)

 'LATIN SMALL LETTER H'

  unicodedata.name(\u0065)

 'LATIN SMALL LETTER E'

 But, how to do this programatically:

  first_two_letters = 6568    6c6c    206f    6f77    6c72
  0a64.split()[0]
  first_two_letters

 '6568'

  first_letter = 00 + first_two_letters[2:]
  first_letter

 '0068'

 Now what?

 hex_code = 65
 unicodedata.name(chr(int(hex_code, 16)))
 'LATIN SMALL LETTER E'

Very helpful, thank you MRAB.

The finished product:  http://pastebin.com/4egQcke2.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: python3 raw strings and \u escapes

2012-05-31 Thread ru...@yahoo.com

On 05/30/2012 09:07 AM, ru...@yahoo.com wrote:
 On 05/30/2012 05:54 AM, Thomas Rachel wrote:
 Am 30.05.2012 08:52 schrieb ru...@yahoo.com:

 This breaks a lot of my code because in python 2
re.split (ur'[\u3000]', u'A\u3000A') ==  [u'A', u'A']
 but in python 3 (the result of running 2to3),
re.split (r'[\u3000]', 'A\u3000A' ) ==  ['A\u3000A']

 I can remove the r prefix from the regex string but then
 if I have other regex backslash symbols in it, I have to
 double all the other backslashes -- the very thing that
 the r-prefix was invented to avoid.

 Or I can leave the r prefix and replace something like
 r'[ \u3000]' with r'[ 　]'.  But that is confusing because
 one can't distinguish between the space character and
 the ideographic space character.  It also a problem if a
 reader of the code doesn't have a font that can display
 the character.

 Was there a reason for dropping the lexical processing of
 \u escapes in strings in python3 (other than to add another
 annoyance in a long list of python3 annoyances?)

 Probably it is more consequent. Alas, it makes the whole stuff
 incompatible to Py2.

 But if you think about it: why allow for \u if \r, \n etc. are
 disallowed as well?

 Maybe the blame is elsewhere then...  If the re module
 interprets (in a regex string) the 2-character string
 consisting of r'\' followed by 'n' as a single newline
 character, then why wasn't re changed for Python 3 to
 interpret the 6-character string, r'\u3000' as a single
 unicode character to correspond with Python's lexer no
 longer doing that (as it did in Python 2)?

 And is there no choice for me but to choose between the two
 poor choices I mention above to deal with this problem?

 There is a 3rd one: use   r'[ ' + '\u3000' + ']'. Not very nice to read,
 but should do the trick...

 I guess the +s could be left out allowing something
 like,

   '[ \u3000]' r'\w+ \d{3}'

 but I'll have to try it a little; maybe just doubling
 backslashes won't be much worse.  I did that for years
 in Perl and lived through it.

Just for some closure, there are many places in my code
that I had/have to track down and change.  But the biggest
problem so far is a lexer module that is structured as many
dozens of little functions, each with a docstring that is
a regex string.

The only way I found change these and maintain sanity was
to go through them and remove the r prefix from any strings
that contain \u literals, and then double any other
backslashes in the string.

Since these are docstrings, creating them with executable
code was awkward, and using adjacent string concatenation
led to a very confusing mix of string styles.  Strings that
used concatenation often had a single logical regex structure
(eg a character set [...]) split between two strings.
The extra quote characters were as visually confusing as
doubled backslashes in many cases.

Strings with doubled backslashes, although harder to read
were, were much easier to edit reliably and in their way,
more regular.  It does make this module look very Perlish
though... :-)
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: python3 raw strings and \u escapes

2012-05-31 Thread Chris Angelico

On Fri, Jun 1, 2012 at 6:28 AM, ru...@yahoo.com ru...@yahoo.com wrote:
 ... a lexer module that is structured as many
 dozens of little functions, each with a docstring that is
 a regex string.

This may be a good opportunity to take a step back and ask yourself:
Why so many functions, each with a regular expression in its
docstring?

Chris Angelico
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: python3 raw strings and \u escapes

2012-05-31 Thread ru...@yahoo.com

On 05/31/2012 03:10 PM, Chris Angelico wrote:
 On Fri, Jun 1, 2012 at 6:28 AM, ru...@yahoo.com ru...@yahoo.com wrote:
 ... a lexer module that is structured as many
 dozens of little functions, each with a docstring that is
 a regex string.

 This may be a good opportunity to take a step back and ask yourself:
 Why so many functions, each with a regular expression in its
 docstring?

Because that's the way David Beazley designed Ply?
 http://dabeaz.com/ply/

Personally, I think it's an abuse of docstrings but
he never asked me for my opinion...
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: python3 raw strings and \u escapes

2012-05-30 Thread Andrew Berg

On 5/30/2012 1:52 AM, ru...@yahoo.com wrote:
 Was there a reason for dropping the lexical processing of
 \u escapes in strings in python3 (other than to add another
 annoyance in a long list of python3 annoyances?)
To me, this would be a Python 2 annoyance since I would expect r'\u3000'
to be literally the six characters '\u3000' since the entire point of
raw strings is to treat everything literally. Why should anything at all
be processed when constructing a raw string?

-- 
CPython 3.3.0a3 | Windows NT 6.1.7601.17790
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: python3 raw strings and \u escapes

2012-05-30 Thread Thomas Rachel


Am 30.05.2012 08:52 schrieb ru...@yahoo.com:


This breaks a lot of my code because in python 2
   re.split (ur'[\u3000]', u'A\u3000A') ==  [u'A', u'A']
but in python 3 (the result of running 2to3),
   re.split (r'[\u3000]', 'A\u3000A' ) ==  ['A\u3000A']

I can remove the r prefix from the regex string but then
if I have other regex backslash symbols in it, I have to
double all the other backslashes -- the very thing that
the r-prefix was invented to avoid.

Or I can leave the r prefix and replace something like
r'[ \u3000]' with r'[ 　]'.  But that is confusing because
one can't distinguish between the space character and
the ideographic space character.  It also a problem if a
reader of the code doesn't have a font that can display
the character.

Was there a reason for dropping the lexical processing of
\u escapes in strings in python3 (other than to add another
annoyance in a long list of python3 annoyances?)


Probably it is more consequent. Alas, it makes the whole stuff 
incompatible to Py2.


But if you think about it: why allow for \u if \r, \n etc. are 
disallowed as well?




And is there no choice for me but to choose between the two
poor choices I mention above to deal with this problem?


There is a 3rd one: use   r'[ ' + '\u3000' + ']'. Not very nice to read, 
but should do the trick...



Thomas
--
http://mail.python.org/mailman/listinfo/python-list

Re: python3 raw strings and \u escapes

2012-05-30 Thread Devin Jeanpierre

On Wed, May 30, 2012 at 2:52 AM, ru...@yahoo.com ru...@yahoo.com wrote:
 Was there a reason for dropping the lexical processing of
 \u escapes in strings in python3 (other than to add another
 annoyance in a long list of python3 annoyances?)

 And is there no choice for me but to choose between the two
 poor choices I mention above to deal with this problem?

The solution of r'[' + '\u3000' + r']...' was pretty good.

Real reason I posted: Maybe the re module should handle \u escapes, in
addition to the other backslash escapes it processes?

This would be backwards incompatible, though, so maybe it's too late.

-- Devin
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: python3 raw strings and \u escapes

2012-05-30 Thread Arnaud Delobelle

On 30 May 2012 12:54, Thomas Rachel
nutznetz-0c1b6768-bfa9-48d5-a470-7603bd3aa...@spamschutz.glglgl.de
wrote:
 There is a 3rd one: use   r'[ ' + '\u3000' + ']'. Not very nice to read, but
 should do the trick...

You could even take advantage of string literal concatenation:)

r'[' '\u3000' r']'

-- 
Arnaud
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: python3 raw strings and \u escapes

2012-05-30 Thread ru...@yahoo.com

On 05/30/2012 05:54 AM, Thomas Rachel wrote:
 Am 30.05.2012 08:52 schrieb ru...@yahoo.com:

 This breaks a lot of my code because in python 2
re.split (ur'[\u3000]', u'A\u3000A') ==  [u'A', u'A']
 but in python 3 (the result of running 2to3),
re.split (r'[\u3000]', 'A\u3000A' ) ==  ['A\u3000A']

 I can remove the r prefix from the regex string but then
 if I have other regex backslash symbols in it, I have to
 double all the other backslashes -- the very thing that
 the r-prefix was invented to avoid.

 Or I can leave the r prefix and replace something like
 r'[ \u3000]' with r'[ 　]'.  But that is confusing because
 one can't distinguish between the space character and
 the ideographic space character.  It also a problem if a
 reader of the code doesn't have a font that can display
 the character.

 Was there a reason for dropping the lexical processing of
 \u escapes in strings in python3 (other than to add another
 annoyance in a long list of python3 annoyances?)

 Probably it is more consequent. Alas, it makes the whole stuff
 incompatible to Py2.

 But if you think about it: why allow for \u if \r, \n etc. are
 disallowed as well?

Maybe the blame is elsewhere then...  If the re module
interprets (in a regex string) the 2-character string
consisting of r'\' followed by 'n' as a single newline
character, then why wasn't re changed for Python 3 to
interpret the 6-character string, r'\u3000' as a single
unicode character to correspond with Python's lexer no
longer doing that (as it did in Python 2)?

 And is there no choice for me but to choose between the two
 poor choices I mention above to deal with this problem?

 There is a 3rd one: use   r'[ ' + '\u3000' + ']'. Not very nice to read,
 but should do the trick...

I guess the +s could be left out allowing something
like,

  '[ \u3000]' r'\w+ \d{3}'

but I'll have to try it a little; maybe just doubling
backslashes won't be much worse.  I did that for years
in Perl and lived through it.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: python3 raw strings and \u escapes

2012-05-30 Thread Terry Reedy


On 5/30/2012 2:52 AM, ru...@yahoo.com wrote:

In python2, \u escapes are processed in raw unicode
strings.  That is, ur'\u3000' is a string of length 1
consisting of the IDEOGRAPHIC SPACE unicode character.


That surprised me until I rechecked the fine manual and found:

When an 'r' or 'R' prefix is present, a character following a backslash 
is included in the string without change, and all backslashes are left 
in the string.


When an 'r' or 'R' prefix is used in conjunction with a 'u' or 'U' 
prefix, then the \u and \U escape sequences are processed 
while all other backslashes are left in the string.


When 'u' was removed in Python 3, a choice had to be made and the first 
must have seemed to be the obvious one, or perhaps the automatic one.


In 3.3, 'u' is being restored. I have inquired on pydev list whether the 
difference above should also be restored, and mentioned this thread.


--
Terry Jan Reedy

--
http://mail.python.org/mailman/listinfo/python-list

Re: python3 raw strings and \u escapes

2012-05-30 Thread Serhiy Storchaka


On 30.05.12 14:54, Thomas Rachel wrote:

There is a 3rd one: use r'[ ' + '\u3000' + ']'. Not very nice to read,
but should do the trick...


Or r'[ %s]' % ('\u3000',).

--
http://mail.python.org/mailman/listinfo/python-list

Re: python3 raw strings and \u escapes

2012-05-30 Thread ru...@yahoo.com


On 05/30/2012 10:46 AM, Terry Reedy wrote:
 On 5/30/2012 2:52 AM, ru...@yahoo.com wrote:
 In python2, \u escapes are processed in raw unicode
 strings.  That is, ur'\u3000' is a string of length 1
 consisting of the IDEOGRAPHIC SPACE unicode character.

 That surprised me until I rechecked the fine manual and found:

 When an 'r' or 'R' prefix is present, a character following a backslash
 is included in the string without change, and all backslashes are left
 in the string.

 When an 'r' or 'R' prefix is used in conjunction with a 'u' or 'U'
 prefix, then the \u and \U escape sequences are processed
 while all other backslashes are left in the string.

 When 'u' was removed in Python 3, a choice had to be made and the first
 must have seemed to be the obvious one, or perhaps the automatic one.

 In 3.3, 'u' is being restored. I have inquired on pydev list whether the
 difference above should also be restored, and mentioned this thread.

As mentioned is a different message, another option might
be to leave raw strings as is (more consistent since all
backslashes are treated the same) and have the re module
un-escape \u (and similar) literals in regex string
(also more consistent since that's what it does with '\\n',
'\\t', etc.)

I do realize though that this may have back-compatibilty
problems that makes it impossible to do.



-- 
http://mail.python.org/mailman/listinfo/python-list

Re: python3 raw strings and \u escapes

2012-05-30 Thread jmfauth

On 30 mai, 13:54, Thomas Rachel nutznetz-0c1b6768-bfa9-48d5-
a470-7603bd3aa...@spamschutz.glglgl.de wrote:
 Am 30.05.2012 08:52 schrieb ru...@yahoo.com:



  This breaks a lot of my code because in python 2
         re.split (ur'[\u3000]', u'A\u3000A') ==  [u'A', u'A']
  but in python 3 (the result of running 2to3),
         re.split (r'[\u3000]', 'A\u3000A' ) ==  ['A\u3000A']

  I can remove the r prefix from the regex string but then
  if I have other regex backslash symbols in it, I have to
  double all the other backslashes -- the very thing that
  the r-prefix was invented to avoid.

  Or I can leave the r prefix and replace something like
  r'[ \u3000]' with r'[ 　]'.  But that is confusing because
  one can't distinguish between the space character and
  the ideographic space character.  It also a problem if a
  reader of the code doesn't have a font that can display
  the character.

  Was there a reason for dropping the lexical processing of
  \u escapes in strings in python3 (other than to add another
  annoyance in a long list of python3 annoyances?)

 Probably it is more consequent. Alas, it makes the whole stuff
 incompatible to Py2.

 But if you think about it: why allow for \u if \r, \n etc. are
 disallowed as well?

  And is there no choice for me but to choose between the two
  poor choices I mention above to deal with this problem?

 There is a 3rd one: use   r'[ ' + '\u3000' + ']'. Not very nice to read,
 but should do the trick...

 Thomas

I suggest to take the problem differently. Python 3
succeeded to put order in the missmatch of the coding
of the characters Python 2 was proposing.

In your case, the

 import unicodedata as ud
 ud.name('\u3000')
'IDEOGRAPHIC SPACE'

character (in fact a unicode code point), is just
a character as a

 ud.name('a')
'LATIN SMALL LETTER A'

The code point / unicode logic, Python 3 proposes and follows,
becomes just straightforward.

 s = 'a\u3000é\u3000€'
 s.split('\u3000')
['a', 'é', '€']

 import re
 re.split('\u3000', s)
['a', 'é', '€']


The backslash, used as real backslash, remains what it
really was in Python 2. Note, the absence of r'...' .

 s = 'a\\b\\c'
 print(s)
a\b\c
 s.split('\\')
['a', 'b', 'c']
 re.split('', s)
['a', 'b', 'c']

 hex(ord('\\'))
'0x5c'
 re.split('\u005c\u005c', s)
['a', 'b', 'c']

jmf

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: python3 raw strings and \u escapes

2012-05-30 Thread jmfauth

On 30 mai, 08:52, ru...@yahoo.com ru...@yahoo.com wrote:
 In python2, \u escapes are processed in raw unicode
 strings.  That is, ur'\u3000' is a string of length 1
 consisting of the IDEOGRAPHIC SPACE unicode character.

 In python3, \u escapes are not processed in raw strings.
 r'\u3000' is a string of length 6 consisting of a backslash,
 'u', '3' and three '0' characters.

 This breaks a lot of my code because in python 2
       re.split (ur'[\u3000]', u'A\u3000A') == [u'A', u'A']
 but in python 3 (the result of running 2to3),
       re.split (r'[\u3000]', 'A\u3000A' ) == ['A\u3000A']

 I can remove the r prefix from the regex string but then
 if I have other regex backslash symbols in it, I have to
 double all the other backslashes -- the very thing that
 the r-prefix was invented to avoid.

 Or I can leave the r prefix and replace something like
 r'[ \u3000]' with r'[ 　]'.  But that is confusing because
 one can't distinguish between the space character and
 the ideographic space character.  It also a problem if a
 reader of the code doesn't have a font that can display
 the character.

 Was there a reason for dropping the lexical processing of
 \u escapes in strings in python3 (other than to add another
 annoyance in a long list of python3 annoyances?)

 And is there no choice for me but to choose between the two
 poor choices I mention above to deal with this problem?


I suggest to take the problem differently. Python 3
succeeded to put order in the missmatch of the coding
of the characters Python 2 was proposing.

The 'IDEOGRAPHIC SPACE' and 'REVERSE SOLIDUS' (backslash)
characters (in fact unicode code points) are just (normal)
characters. The backslash, used as an escaping command,
keeps its function.

Note the absence of r'...'

 s = 'a\u3000é\u3000€'
 s.split('\u3000')
['a', 'é', '€']

 import re
 re.split('\u3000', s)
['a', 'é', '€']


 s = 'a\\b\\c'
 print(s)
a\b\c
 s.split('\\')
['a', 'b', 'c']
 re.split('', s)
['a', 'b', 'c']

 hex(ord('\\'))
'0x5c'
 re.split('\u005c\u005c', s)
['a', 'b', 'c']

jmf
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: python3 raw strings and \u escapes

Re: python3 raw strings and \u escapes

Re: python3 raw strings and \u escapes

Re: python3 raw strings and \u escapes

Re: python3 raw strings and \u escapes

Re: python3 raw strings and \u escapes

Re: python3 raw strings and \u escapes

Re: python3 raw strings and \u escapes

Re: python3 raw strings and \u escapes

Re: python3 raw strings and \u escapes

Re: python3 raw strings and \u escapes

Re: python3 raw strings and \u escapes

Re: python3 raw strings and \u escapes

Re: python3 raw strings and \u escapes

Re: python3 raw strings and \u escapes

Re: python3 raw strings and \u escapes

16 matches

Site Navigation

Mail list logo

Footer information