Re: [PHP-DEV] preg_replace does not replace all occurrences
On 14 March 2011 20:36, Hannes Landeholm landeh...@gmail.com wrote: What is more likely to be wrong? Your understanding of a specific regex pattern (which happens to be full of escapes making it incredibly hard to read) or the implementation of preg_replace? ~Hannes On 14 March 2011 16:18, Martin Scotta martinsco...@gmail.com wrote: I chose the simplest example to show the preg_replace behavior, there are better (and safer) ways to scape slash characters. Anyways, *is this the expected preg_replace behavior?* Martin ?php function test($str) { static $re = '/(^|[^])\'/'; static $change = '$1\\\''; echo $str, PHP_EOL, preg_replace($re, $change, $str), PHP_EOL, PHP_EOL; } test(str '' str); // bug? test(str \\'\\' str); // ok test('str'); // ok test(\'str\'); // ok Expected: str '' str str \'\' str str \'\' str str \'\' str 'str' \'str\' \'str\' \'str\' Result: str '' str str \'' str str \'\' str str \'\' str 'str' \'str\' \'str\' \'str\' Martin Scotta -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php Did no one see why the regex was wrong? RegexBuddy (a windows app) explains regexes VERY VERY well. -- Richard Quadling Twitter : EE : Zend @RQuadling : e-e.com/M_248814.html : bit.ly/9O8vFY -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] preg_replace does not replace all occurrences
On 15 March 2011 10:32, Richard Quadling rquadl...@gmail.com wrote: On 14 March 2011 20:36, Hannes Landeholm landeh...@gmail.com wrote: What is more likely to be wrong? Your understanding of a specific regex pattern (which happens to be full of escapes making it incredibly hard to read) or the implementation of preg_replace? ~Hannes On 14 March 2011 16:18, Martin Scotta martinsco...@gmail.com wrote: I chose the simplest example to show the preg_replace behavior, there are better (and safer) ways to scape slash characters. Anyways, *is this the expected preg_replace behavior?* Martin ?php function test($str) { static $re = '/(^|[^])\'/'; static $change = '$1\\\''; echo $str, PHP_EOL, preg_replace($re, $change, $str), PHP_EOL, PHP_EOL; } test(str '' str); // bug? test(str \\'\\' str); // ok test('str'); // ok test(\'str\'); // ok Expected: str '' str str \'\' str str \'\' str str \'\' str 'str' \'str\' \'str\' \'str\' Result: str '' str str \'' str str \'\' str str \'\' str 'str' \'str\' \'str\' \'str\' Martin Scotta -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php Did no one see why the regex was wrong? RegexBuddy (a windows app) explains regexes VERY VERY well. The important bit (where the problem lies with regard to the regex) is ... Match a single character NOT present in the list below «[^]» A \ character «\\» A \ character «\\» The issue is the word _single_. -- Richard Quadling Twitter : EE : Zend @RQuadling : e-e.com/M_248814.html : bit.ly/9O8vFY -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] preg_replace does not replace all occurrences
static $re = '/(^|[^])\'/'; Did no one see why the regex was wrong? I saw what the regex was. I didn't think like you that it was 'wrong'. Once you unescape the characters in the PHP single-quoted string above (where two backslashes count as one, and backslash-quote counts as a quote), the actual pattern that reaches the preg_replace function is: /(^|[^\\])'/ RegexBuddy (a windows app) explains regexes VERY VERY well. What kind of patterns? Does it support PCRE ones? The important bit (where the problem lies with regard to the regex) is ... Match a single character NOT present in the list below «[^]» A \ character «\\» A \ character «\\» This is not the case. 1. As above, the pattern reaching preg_replace is /(^|[^\\])'/ 2. PCRE, unlike many other regular expression implementations, allows backslash-escaping inside character classes (square brackets). So the doubled backslash only actually counts as a single backslash character to be excluded from the set of characters the atom will match. There is no error here. (And even if there were two backslashes being excluded, of course, it wouldn't hurt anything or change the meaning of the pattern.) The issue is the word _single_. I don't think anybody thought otherwise. The problem was that, to a casual observer, the pattern seems to mean a quote which doesn't already have a backslash before it. I believe this was its intent. (And the replacement added the 'missing' backslash.) But the pattern doesn't mean that. It actually means a character which isn't a backslash, followed by a quote. This is subtly different. And it's most noticeable when two quotes follow each other in the subject string. In str''str first the pattern matches r' (non-backslash followed by quote), and then it keeps searching from that point, i.e. it searches 'str. Since this isn't the beginning of the string, and there is no quote following a non-backslash character, there are no further matches. Now, here is a pattern which actually means a quote which doesn't already have a backslash before it which is achieved by means of a lookbehind assertion, which, even when searching the string after the first match, 'str, still 'looks back' on the earlier part of the string to recognise the second quote is not preceded by a backslash and match a second time: /(^|(?!\\))'/ As a PHP single-quoted string this is: '/(^|(?!))\'/' Hope this helps, Ben. -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] preg_replace does not replace all occurrences
Now, here is a pattern which actually means a quote which doesn't already have a backslash before it which is achieved by means of a lookbehind assertion, which, even when searching the string after the first match, 'str, still 'looks back' on the earlier part of the string to recognise the second quote is not preceded by a backslash and match a second time: /(^|(?!\\))'/ As a PHP single-quoted string this is: '/(^|(?!))\'/' And I should mention, as Martin did, that this actually isn't a good idea. There are better/safer ways to escape quotes. In particular, consider how this subject string str\\'; delete from users; will not have the quote escaped, because it is preceded by *two* backslashes. To match more carefully, you have to be careful to 'eat backslashes in pairs'. Someone gave a pattern that attempted to do something like that in an earlier post, too. Ben. -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] preg_replace does not replace all occurrences
On 03/15/11 12:41, Ben Schmidt wrote: [snip] Hope this helps, Ben. As an outsider in this discussion, I'd just like to applaud you for one of the best, in-depth, most patient and most thorough explanations I have ever seen on a mailing list. Dave -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] preg_replace does not replace all occurrences
On 15 March 2011 12:41, Ben Schmidt mail_ben_schm...@yahoo.com.au wrote: static $re = '/(^|[^])\'/'; Did no one see why the regex was wrong? I saw what the regex was. I didn't think like you that it was 'wrong'. Once you unescape the characters in the PHP single-quoted string above (where two backslashes count as one, and backslash-quote counts as a quote), the actual pattern that reaches the preg_replace function is: /(^|[^\\])'/ RegexBuddy (a windows app) explains regexes VERY VERY well. What kind of patterns? Does it support PCRE ones? Yep and MANY other flavours (C#, C++, Dephi, Groovy, Java, Javascript, MySQL, ...) The important bit (where the problem lies with regard to the regex) is ... Match a single character NOT present in the list below «[^]» A \ character «\\» A \ character «\\» This is not the case. 1. As above, the pattern reaching preg_replace is /(^|[^\\])'/ 2. PCRE, unlike many other regular expression implementations, allows backslash-escaping inside character classes (square brackets). So the doubled backslash only actually counts as a single backslash character to be excluded from the set of characters the atom will match. There is no error here. (And even if there were two backslashes being excluded, of course, it wouldn't hurt anything or change the meaning of the pattern.) The issue is the word _single_. I don't think anybody thought otherwise. The problem was that, to a casual observer, the pattern seems to mean a quote which doesn't already have a backslash before it. I believe this was its intent. (And the replacement added the 'missing' backslash.) But the pattern doesn't mean that. It actually means a character which isn't a backslash, followed by a quote. This is subtly different. And it's most noticeable when two quotes follow each other in the subject string. In str''str first the pattern matches r' (non-backslash followed by quote), and then it keeps searching from that point, i.e. it searches 'str. Since this isn't the beginning of the string, and there is no quote following a non-backslash character, there are no further matches. Now, here is a pattern which actually means a quote which doesn't already have a backslash before it which is achieved by means of a lookbehind assertion, which, even when searching the string after the first match, 'str, still 'looks back' on the earlier part of the string to recognise the second quote is not preceded by a backslash and match a second time: /(^|(?!\\))'/ As a PHP single-quoted string this is: '/(^|(?!))\'/' Hope this helps, Ben. If I say ... ?php echo '/(^|[^])\'/'; ? I get ... /(^|[^\\])'/ which is explained as ... (^|[^\\])' Options: case insensitive; ^ and $ match at line breaks Match the regular expression below and capture its match into backreference number 1 «(^|[^\\])» Match either the regular expression below (attempting the next alternative only if this one fails) «^» Assert position at the beginning of a line (at beginning of the string or after a line break character) «^» Or match regular expression number 2 below (the entire group fails if this one fails to match) «[^\\]» Match any character that is NOT a \ character «[^\\]» Match the character “'” literally «'» And that certainly makes a LOT more sense. Decoding regexes and handling the escaping needed for the language is a real headache sometimes. Just imagine creating regex code for use by client side Javascript using PHP. 8 \ in a row for a single \ wouldn't be impossible. Sorry for the confusion. -- Richard Quadling Twitter : EE : Zend @RQuadling : e-e.com/M_248814.html : bit.ly/9O8vFY -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] preg_replace does not replace all occurrences
On 14 March 2011 15:18, Martin Scotta martinsco...@gmail.com wrote: ?php function test($str) { static $re = '/(^|[^])\'/'; static $change = '$1\\\''; echo $str, PHP_EOL, preg_replace($re, $change, $str), PHP_EOL, PHP_EOL; } test(str '' str); // bug? test(str \\'\\' str); // ok test('str'); // ok test(\'str\'); // ok Your regex is ... (^|[^])' Options: case insensitive; ^ and $ match at line breaks Match the regular expression below and capture its match into backreference number 1 «(^|[^])» Match either the regular expression below (attempting the next alternative only if this one fails) «^» Assert position at the beginning of a line (at beginning of the string or after a line break character) «^» Or match regular expression number 2 below (the entire group fails if this one fails to match) «[^]» Match a single character NOT present in the list below «[^]» A \ character «\\» A \ character «\\» Match the character “'” literally «'» I think [^] is wrong and you want it to be ... (?!) or (?!\\{2}) With that, the output is ... str '' str str \'\' str str \'\' str str \\'\\' str 'str' \'str\' \'str\' \\'str\\' -- Richard Quadling Twitter : EE : Zend @RQuadling : e-e.com/M_248814.html : bit.ly/9O8vFY -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] preg_replace does not replace all occurrences
On 15/03/11 2:18 AM, Martin Scotta wrote: I chose the simplest example to show the preg_replace behavior, You've GOT to be kidding. The SIMPLEST?! How about an example that doesn't require escaping ALL the interesting characters involved? Here's a modified version that I think it quite a bit simpler: ?php function test($str) { static $re = '/(^|[^a])b/'; static $change = '$1ab'; echo $str, PHP_EOL; // input echo preg_replace($re, $change, $str), PHP_EOL, PHP_EOL; // output } test(str bb str); // bug? test(str abab str); // ok test(b str b); // ok test(ab str ab); // ok ? The way I interpret it, it should put an 'a' before every 'b' that is not already preceded by an 'a'. But the buggy case gives 'str abb str' rather than the expected 'str abab str'. It does look like a bug to me. Ben. there are better (and safer) ways to scape slash characters. Anyways, *is this the expected preg_replace behavior?* Martin ?php function test($str) { static $re = '/(^|[^])\'/'; static $change = '$1\\\''; echo $str, PHP_EOL, preg_replace($re, $change, $str), PHP_EOL, PHP_EOL; } test(str '' str); // bug? test(str \\'\\' str); // ok test('str'); // ok test(\'str\'); // ok Expected: str '' str str \'\' str str \'\' str str \'\' str 'str' \'str\' \'str\' \'str\' Result: str '' str str \'' str str \'\' str str \'\' str 'str' \'str\' \'str\' \'str\' Martin Scotta -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] preg_replace does not replace all occurrences
On 15/03/11 5:38 AM, Ben Schmidt wrote: On 15/03/11 2:18 AM, Martin Scotta wrote: I chose the simplest example to show the preg_replace behavior, You've GOT to be kidding. The SIMPLEST?! How about an example that doesn't require escaping ALL the interesting characters involved? Here's a modified version that I think it quite a bit simpler: ?php function test($str) { static $re = '/(^|[^a])b/'; static $change = '$1ab'; echo $str, PHP_EOL; // input echo preg_replace($re, $change, $str), PHP_EOL, PHP_EOL; // output } test(str bb str); // bug? test(str abab str); // ok test(b str b); // ok test(ab str ab); // ok ? The way I interpret it, it should put an 'a' before every 'b' that is not already preceded by an 'a'. But the buggy case gives 'str abb str' rather than the expected 'str abab str'. It does look like a bug to me. Actually, no it doesn't. The behaviour is correct. Matches cannot overlap. Since the character preceding 'b' is part of the match, there is only one match in the string 'str bb str'. The match is ' b'. After that match, the You actually want an assertion. I think this: static $re = '/(^|(?!a))b/'; Ben. -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] preg_replace does not replace all occurrences
What is more likely to be wrong? Your understanding of a specific regex pattern (which happens to be full of escapes making it incredibly hard to read) or the implementation of preg_replace? ~Hannes On 14 March 2011 16:18, Martin Scotta martinsco...@gmail.com wrote: I chose the simplest example to show the preg_replace behavior, there are better (and safer) ways to scape slash characters. Anyways, *is this the expected preg_replace behavior?* Martin ?php function test($str) { static $re = '/(^|[^])\'/'; static $change = '$1\\\''; echo $str, PHP_EOL, preg_replace($re, $change, $str), PHP_EOL, PHP_EOL; } test(str '' str); // bug? test(str \\'\\' str); // ok test('str'); // ok test(\'str\'); // ok Expected: str '' str str \'\' str str \'\' str str \'\' str 'str' \'str\' \'str\' \'str\' Result: str '' str str \'' str str \'\' str str \'\' str 'str' \'str\' \'str\' \'str\' Martin Scotta -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php