Re: re Questions
On Sun, Jan 26, 2014 at 9:59 AM, Blake Adams blakesad...@gmail.com wrote: Im pretty new to Python and understand most of the basics of Python re but am stumped by a unexpected matching dynamics. If I want to set up a match replicating the '\w' pattern I would assume that would be done with '[A-z0-9_]'. However, when I run the following: re.findall('[A-z0-9_]','^;z %C\@0~_') it matches ['^', 'z', 'C', '\\', '0', '_']. I would expect the match to be ['z', 'C', '0', '_']. Why does this happen? Because the characters \ ] ^ and _ are between Z and a in the ASCII character set. You need to do this: re.findall('[A-Za-z0-9_]','^;z %C\@0~_') -- https://mail.python.org/mailman/listinfo/python-list
Re: re Questions
On Mon, Jan 27, 2014 at 3:59 AM, Blake Adams blakesad...@gmail.com wrote: If I want to set up a match replicating the '\w' pattern I would assume that would be done with '[A-z0-9_]'. However, when I run the following: re.findall('[A-z0-9_]','^;z %C\@0~_') it matches ['^', 'z', 'C', '\\', '0', '_']. I would expect the match to be ['z', 'C', '0', '_']. Why does this happen? Because \w is not the same as [A-z0-9_]. Quoting from the docs: \w For Unicode (str) patterns:Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched (but the flag affects the entire regular expression, so in such cases using an explicit [a-zA-Z0-9_] may be a better choice).For 8-bit (bytes) patterns:Matches characters considered alphanumeric in the ASCII character set; this is equivalent to [a-zA-Z0-9_]. If you're working with a byte string, then you're close, but A-z is quite different from A-Za-z. The set [A-z] is equivalent to [ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz] (that's a literal backslash in there, btw), so it'll also catch several non-alphabetic characters. With a Unicode string, it's quite distinctly different. Either way, \w means word characters, though, so just go ahead and use it whenever you want word characters :) ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: re Questions
In article mailman.5996.1390756093.18130.python-l...@python.org, Chris Angelico ros...@gmail.com wrote: The set [A-z] is equivalent to [ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz] I'm inclined to suggest the regex compiler should issue a warning for this. I've never seen a character range other than A-Z, a-z, or 0-9. Well, I suppose A-F or a-f if you're trying to match hex digits (and some variations on that for octal). But, I can't imagine any example where somebody wrote A-z and it wasn't an error. -- https://mail.python.org/mailman/listinfo/python-list
Re: re Questions
On Sunday, January 26, 2014 12:06:59 PM UTC-5, larry@gmail.com wrote: On Sun, Jan 26, 2014 at 9:59 AM, Blake Adams blakesad...@gmail.com wrote: Im pretty new to Python and understand most of the basics of Python re but am stumped by a unexpected matching dynamics. If I want to set up a match replicating the '\w' pattern I would assume that would be done with '[A-z0-9_]'. However, when I run the following: re.findall('[A-z0-9_]','^;z %C\@0~_') it matches ['^', 'z', 'C', '\\', '0', '_']. I would expect the match to be ['z', 'C', '0', '_']. Why does this happen? Because the characters \ ] ^ and _ are between Z and a in the ASCII character set. You need to do this: re.findall('[A-Za-z0-9_]','^;z %C\@0~_') Got it that makes sense. Thanks for the quick reply Larry -- https://mail.python.org/mailman/listinfo/python-list
Re: re Questions
On Sunday, January 26, 2014 12:08:01 PM UTC-5, Chris Angelico wrote: On Mon, Jan 27, 2014 at 3:59 AM, Blake Adams blakesad...@gmail.com wrote: If I want to set up a match replicating the '\w' pattern I would assume that would be done with '[A-z0-9_]'. However, when I run the following: re.findall('[A-z0-9_]','^;z %C\@0~_') it matches ['^', 'z', 'C', '\\', '0', '_']. I would expect the match to be ['z', 'C', '0', '_']. Why does this happen? Because \w is not the same as [A-z0-9_]. Quoting from the docs: \w For Unicode (str) patterns:Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched (but the flag affects the entire regular expression, so in such cases using an explicit [a-zA-Z0-9_] may be a better choice).For 8-bit (bytes) patterns:Matches characters considered alphanumeric in the ASCII character set; this is equivalent to [a-zA-Z0-9_]. If you're working with a byte string, then you're close, but A-z is quite different from A-Za-z. The set [A-z] is equivalent to [ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz] (that's a literal backslash in there, btw), so it'll also catch several non-alphabetic characters. With a Unicode string, it's quite distinctly different. Either way, \w means word characters, though, so just go ahead and use it whenever you want word characters :) ChrisA Thanks Chris -- https://mail.python.org/mailman/listinfo/python-list
Re: re Questions
On Mon, Jan 27, 2014 at 4:15 AM, Roy Smith r...@panix.com wrote: In article mailman.5996.1390756093.18130.python-l...@python.org, Chris Angelico ros...@gmail.com wrote: The set [A-z] is equivalent to [ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz] I'm inclined to suggest the regex compiler should issue a warning for this. I've never seen a character range other than A-Z, a-z, or 0-9. Well, I suppose A-F or a-f if you're trying to match hex digits (and some variations on that for octal). But, I can't imagine any example where somebody wrote A-z and it wasn't an error. I've used a variety of character ranges, certainly more than the 4-5 you listed, but I agree that A-z is extremely likely to be an error. However, I've sometimes used a regex (bytes mode) to find, say, all the ASCII printable characters - [ -~] - and I wouldn't want that precluded. It's a bit tricky trying to figure out which are likely to be errors and which are not, so I'd be inclined to keep things as they are. No warnings. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: re Questions
On 26/01/2014 17:15, Blake Adams wrote: On Sunday, January 26, 2014 12:08:01 PM UTC-5, Chris Angelico wrote: On Mon, Jan 27, 2014 at 3:59 AM, Blake Adams blakesad...@gmail.com wrote: If I want to set up a match replicating the '\w' pattern I would assume that would be done with '[A-z0-9_]'. However, when I run the following: re.findall('[A-z0-9_]','^;z %C\@0~_') it matches ['^', 'z', 'C', '\\', '0', '_']. I would expect the match to be ['z', 'C', '0', '_']. Why does this happen? Because \w is not the same as [A-z0-9_]. Quoting from the docs: \w For Unicode (str) patterns:Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched (but the flag affects the entire regular expression, so in such cases using an explicit [a-zA-Z0-9_] may be a better choice).For 8-bit (bytes) patterns:Matches characters considered alphanumeric in the ASCII character set; this is equivalent to [a-zA-Z0-9_]. If you're working with a byte string, then you're close, but A-z is quite different from A-Za-z. The set [A-z] is equivalent to [ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz] (that's a literal backslash in there, btw), so it'll also catch several non-alphabetic characters. With a Unicode string, it's quite distinctly different. Either way, \w means word characters, though, so just go ahead and use it whenever you want word characters :) ChrisA Thanks Chris I'm pleased to see that your question has been answered. Now would you please read and action this https://wiki.python.org/moin/GoogleGroupsPython to prevent us seeing the double line spacing above, thanks. -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence -- https://mail.python.org/mailman/listinfo/python-list
Re: re Questions
On 26/01/2014 17:25, Chris Angelico wrote: On Mon, Jan 27, 2014 at 4:15 AM, Roy Smith r...@panix.com wrote: In article mailman.5996.1390756093.18130.python-l...@python.org, Chris Angelico ros...@gmail.com wrote: The set [A-z] is equivalent to [ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz] I'm inclined to suggest the regex compiler should issue a warning for this. I've never seen a character range other than A-Z, a-z, or 0-9. Well, I suppose A-F or a-f if you're trying to match hex digits (and some variations on that for octal). But, I can't imagine any example where somebody wrote A-z and it wasn't an error. I've used a variety of character ranges, certainly more than the 4-5 you listed, but I agree that A-z is extremely likely to be an error. However, I've sometimes used a regex (bytes mode) to find, say, all the ASCII printable characters - [ -~] - and I wouldn't want that precluded. It's a bit tricky trying to figure out which are likely to be errors and which are not, so I'd be inclined to keep things as they are. No warnings. ChrisA I suggest a single warning is always given Regular expressions can be fickle. Have you considered using string methods?. My apologies to regex fans if they're currently choking over their tea, coffee, cocoa, beer, scotch, saki, ouzo or whatever :) -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence -- https://mail.python.org/mailman/listinfo/python-list
Re: re Questions
On 2014-01-26 12:15, Roy Smith wrote: The set [A-z] is equivalent to [ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz] I'm inclined to suggest the regex compiler should issue a warning for this. I've never seen a character range other than A-Z, a-z, or 0-9. Well, I suppose A-F or a-f if you're trying to match hex digits (and some variations on that for octal). But, I can't imagine any example where somebody wrote A-z and it wasn't an error. I'd not object to warnings on that one literal A-z set, but I've done some work with VINs¹ where the allowable character-set is A-Z and digits, minus letters that can be hard to distinguish visually (I/O/Q), so I've used ^[A-HJ-NPR-Z0-9]{17}$ as a first-pass filter for VINs that were entered (often scanned, but occasionally hand-keyed). In some environments, I've been able to intercept I/O/Q and remap them accordingly to 1/0/0 to do the disambiguation for the user. So I'd not want to see other character-classes touched, as they can be perfectly legit. -tkc ¹ http://en.wikipedia.org/wiki/Vehicle_Identification_Number -- https://mail.python.org/mailman/listinfo/python-list