Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace
On 15.12.21 14:15, Gilles Darold wrote: Le 15/12/2021 à 13:41, Peter Eisentraut a écrit : On 03.08.21 19:10, Tom Lane wrote: Gilles Darold writes: Sorry I have missed that, but I'm fine with this implemenation so let's keep the v6 version of the patch and drop this one. Pushed, then. There's still lots of time to tweak the behavior of course. I have a documentation follow-up to this. It seems that these new functions are almost a de facto standard, whereas the SQL-standard functions are not implemented anywhere. I propose the attached patch to update the subsection in the pattern-matching section to give more detail on this and suggest equivalent functions among these newly added ones. What do you think? I'm in favor to apply your changes to documentation. It is a good thing to precise the relation between this implementation of the regex_* functions and the SQL stardard. ok, done
Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace
Le 15/12/2021 à 13:41, Peter Eisentraut a écrit : On 03.08.21 19:10, Tom Lane wrote: Gilles Darold writes: Sorry I have missed that, but I'm fine with this implemenation so let's keep the v6 version of the patch and drop this one. Pushed, then. There's still lots of time to tweak the behavior of course. I have a documentation follow-up to this. It seems that these new functions are almost a de facto standard, whereas the SQL-standard functions are not implemented anywhere. I propose the attached patch to update the subsection in the pattern-matching section to give more detail on this and suggest equivalent functions among these newly added ones. What do you think? I'm in favor to apply your changes to documentation. It is a good thing to precise the relation between this implementation of the regex_* functions and the SQL stardard. -- Gilles Darold
Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace
On 03.08.21 19:10, Tom Lane wrote: Gilles Darold writes: Sorry I have missed that, but I'm fine with this implemenation so let's keep the v6 version of the patch and drop this one. Pushed, then. There's still lots of time to tweak the behavior of course. I have a documentation follow-up to this. It seems that these new functions are almost a de facto standard, whereas the SQL-standard functions are not implemented anywhere. I propose the attached patch to update the subsection in the pattern-matching section to give more detail on this and suggest equivalent functions among these newly added ones. What do you think?From a2dbd0e24a30b945a5d641ed773dc44f5e6b50c1 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Wed, 15 Dec 2021 11:02:59 +0100 Subject: [PATCH] doc: More documentation on regular expressions and SQL standard --- doc/src/sgml/func.sgml | 91 +--- src/backend/catalog/sql_features.txt | 10 +-- 2 files changed, 88 insertions(+), 13 deletions(-) diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 5801299b27..e58efce586 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -7353,10 +7353,26 @@ Basic Regular Expressions - Differences from XQuery (LIKE_REGEX) + Differences from SQL Standard and XQuery -LIKE_REGEX +LIKE_REGEX + + + +OCCURRENCES_REGEX + + + +POSITION_REGEX + + + +SUBSTRING_REGEX + + + +TRANSLATE_REGEX @@ -7364,16 +7380,75 @@ Differences from XQuery (LIKE_REGEX) - Since SQL:2008, the SQL standard includes - a LIKE_REGEX operator that performs pattern + Since SQL:2008, the SQL standard includes regular expression operators + and functions that performs pattern matching according to the XQuery regular expression - standard. PostgreSQL does not yet - implement this operator, but you can get very similar behavior using - the regexp_match() function, since XQuery - regular expressions are quite close to the ARE syntax described above. + standard: + + LIKE_REGEX + OCCURRENCES_REGEX + POSITION_REGEX + SUBSTRING_REGEX + TRANSLATE_REGEX + + PostgreSQL does not currently implement these + operators and functions. You can get approximately equivalent + functionality in each case as shown in . (Various optional clauses on + both sides have been omitted in this table.) + + + + Regular Expression Functions Equivalencies + + + + +SQL standard +PostgreSQL + + + + + +string LIKE_REGEX pattern +regexp_like(string, pattern) or string ~ pattern + + + +OCCURRENCES_REGEX(pattern IN string +regexp_count(string, pattern) + + + +POSITION_REGEX(pattern IN string +regexp_instr(string, pattern) + + + +SUBSTRING_REGEX(pattern IN string +regexp_substr(string, pattern) + + + +TRANSLATE_REGEX(pattern IN string WITH replacement +regexp_replace(string, pattern, replacement) + + + + + + + Regular expression functions similar to those provided by PostgreSQL are + also available in a number of other SQL implementations, whereas the + SQL-standard functions are not as widely implemented. Some of the + details of the regular expression syntax will likely differ in each + implementation. + The SQL-standard operators and functions use XQuery regular expressions, + which are quite close to the ARE syntax described above. Notable differences between the existing POSIX-based regular-expression feature and XQuery regular expressions include: diff --git a/src/backend/catalog/sql_features.txt b/src/backend/catalog/sql_features.txt index 9f424216e2..b8a78f4d41 100644 --- a/src/backend/catalog/sql_features.txt +++ b/src/backend/catalog/sql_features.txt @@ -323,11 +323,11 @@ F821 Local table references NO F831 Full cursor update NO F831 Full cursor update 01 Updatable scrollable cursorsNO F831 Full cursor update 02 Updatable ordered cursors NO -F841 LIKE_REGEX predicateNO -F842 OCCURRENCES_REGEX function NO -F843 POSITION_REGEX function NO -F844 SUBSTRING_REGEX functionNO -F845 TRANSLATE_REGEX functionNO +F841 LIKE_REGEX predicateNO consider regexp_like() +F842 OCCURRENCES_REGEX function NO consider regexp_matches() +F843 POSITION_REGEX function NO consider regexp_instr() +F844 SUBSTRING_REGEX functionNO
Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace
Gilles Darold writes: > Sorry I have missed that, but I'm fine with this implemenation so let's > keep the v6 version of the patch and drop this one. Pushed, then. There's still lots of time to tweak the behavior of course. regards, tom lane
Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace
Le 03/08/2021 à 15:39, Tom Lane a écrit : Erik Rijkers writes: On 8/3/21 1:26 PM, Gilles Darold wrote: Le 03/08/2021 à 11:45, Gilles Darold a écrit : Actually I just found that the regexp_like() function doesn't support the start parameter which is something we should support. I saw that Oracle do not support it but DB2 does and I think we should also support it. I will post a new version of the patch once it is done. +1 I for one am in favor of this 'start'-argument addition. Slightly harder usage, but more precise manipulation. As I said upthread, I am *not* in favor of making those DB2 additions. We do not need to create ambiguities around those functions like the one we have for regexp_replace. If Oracle doesn't have those options, why do we need them? Sorry I have missed that, but I'm fine with this implemenation so let's keep the v6 version of the patch and drop this one. -- Gilles Darold
Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace
Erik Rijkers writes: > On 8/3/21 1:26 PM, Gilles Darold wrote: >> Le 03/08/2021 à 11:45, Gilles Darold a écrit : >>> Actually I just found that the regexp_like() function doesn't support >>> the start parameter which is something we should support. I saw that >>> Oracle do not support it but DB2 does and I think we should also >>> support it. I will post a new version of the patch once it is done. > +1 > I for one am in favor of this 'start'-argument addition. Slightly > harder usage, but more precise manipulation. As I said upthread, I am *not* in favor of making those DB2 additions. We do not need to create ambiguities around those functions like the one we have for regexp_replace. If Oracle doesn't have those options, why do we need them? regards, tom lane
Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace
On 8/3/21 1:26 PM, Gilles Darold wrote: Le 03/08/2021 à 11:45, Gilles Darold a écrit : Actually I just found that the regexp_like() function doesn't support the start parameter which is something we should support. I saw that Oracle do not support it but DB2 does and I think we should also support it. I will post a new version of the patch once it is done. +1 I for one am in favor of this 'start'-argument addition. Slightly harder usage, but more precise manipulation. Erik Rijkers Here is a new version of the patch that adds the start parameter to regexp_like() function but while I'm adding support to this parameter it become less obvious for me that we should implement it. However feel free to not use this version if you think that adding the start parameter has no real interest. Best regards,
Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace
Le 03/08/2021 à 11:45, Gilles Darold a écrit : Actually I just found that the regexp_like() function doesn't support the start parameter which is something we should support. I saw that Oracle do not support it but DB2 does and I think we should also support it. I will post a new version of the patch once it is done. Here is a new version of the patch that adds the start parameter to regexp_like() function but while I'm adding support to this parameter it become less obvious for me that we should implement it. However feel free to not use this version if you think that adding the start parameter has no real interest. Best regards, -- Gilles Darold diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index a5b6adc4bb..2bc9060e47 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -3108,6 +3108,80 @@ repeat('Pg', 4) PgPgPgPg + + + + regexp_count + +regexp_count ( string text, pattern text + [, start integer + [, flags text ] ] ) +integer + + +Returns the number of times the POSIX regular +expression pattern matches in +the string; see +. + + +regexp_count('123456789012', '\d\d\d', 2) +3 + + + + + + + regexp_instr + +regexp_instr ( string text, pattern text + [, start integer + [, N integer + [, endoption integer + [, flags text + [, subexpr integer ] ] ] ] ] ) +integer + + +Returns the position within string where +the N'th match of the POSIX regular +expression pattern occurs, or zero if there is +no such match; see . + + +regexp_instr('ABCDEF', 'c(.)(..)', 1, 1, 0, 'i') +3 + + +regexp_instr('ABCDEF', 'c(.)(..)', 1, 1, 0, 'i', 2) +5 + + + + + + + regexp_like + +regexp_like ( string text, pattern text + [, start integer + [, flags text ] ] ) +boolean + + +Checks whether a match of the POSIX regular +expression pattern occurs +within string starting at +the start'th character; see +. + + +regexp_like('Hello World', 'world$', 1, 'i') +t + + + @@ -3117,8 +3191,9 @@ repeat('Pg', 4) PgPgPgPg text[] -Returns captured substrings resulting from the first match of a POSIX -regular expression to the string; see +Returns captured substrings resulting from the first match of the +POSIX regular expression pattern to +the string; see . @@ -3136,10 +3211,11 @@ repeat('Pg', 4) PgPgPgPg setof text[] -Returns captured substrings resulting from the first match of a -POSIX regular expression to the string, -or multiple matches if the g flag is used; -see . +Returns captured substrings resulting from the first match of the +POSIX regular expression pattern to +the string, or all matches if +the g flag is used; see +. regexp_matches('foobarbequebaz', 'ba.', 'g') @@ -3156,14 +3232,16 @@ repeat('Pg', 4) PgPgPgPg regexp_replace -regexp_replace ( string text, pattern text, replacement text [, flags text ] ) +regexp_replace ( string text, pattern text, replacement text + [, start integer ] + [, flags text ] ) text -Replaces substrings resulting from the first match of a -POSIX regular expression, or multiple substring matches -if the g flag is used; see . +Replaces the substring that is the first match to the POSIX +regular expression pattern, or all matches +if the g flag is used; see +. regexp_replace('Thomas', '.[mN]a.', 'M') @@ -3171,6 +3249,26 @@ repeat('Pg', 4) PgPgPgPg + + +regexp_replace ( string text, pattern text, replacement text, + start integer, + N integer + [, flags text ] ) +text + + +Replaces the substring that is the N'th +match to the POSIX regular expression pattern, +or all matches if N is zero; see +. + + +regexp_replace('Thomas', '.', 'X', 3, 2) +ThoXas + + + @@ -3213,6 +3311,35 @@ repeat('Pg', 4) PgPgPgPg + + + + regexp_substr + +regexp_substr ( string text, pattern text + [, start integer + [, N integer + [, flags text + [, subexpr
Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace
Le 02/08/2021 à 23:22, Gilles Darold a écrit : Le 02/08/2021 à 01:21, Tom Lane a écrit : Gilles Darold writes: [ v5-0001-regexp-foo-functions.patch ] I've gone through this whole patch now, and found quite a lot that I did not like. In no particular order: * Wrapping parentheses around the user's regexp doesn't work. It can turn an invalid regexp into a valid one: for example 'a)(b' should draw a syntax error. With this patch, no error would be thrown, but the "outer" parens wouldn't do what you expected. Worse, it can turn a valid regexp into an invalid one: the metasyntax options described in 9.7.3.4 only work at the start of the regexp. So we have to handle whole-regexp cases honestly rather than trying to turn them into an instance of the parenthesized-subexpression case. * You did a lot of things quite inefficiently, apparently to avoid touching any existing code. I think it's better to extend setup_regexp_matches() and replace_text_regexp() a little bit so that they can support the behaviors these new functions need. In both of them, it's absolutely trivial to allow a search start position to be passed in; and it doesn't take much to teach replace_text_regexp() to replace only the N'th match. * Speaking of N'th, there is not much of anything that I like about Oracle's terminology for the function arguments, and I don't think we ought to adopt it. If we're documenting the functions as processing the "N'th match", it seems to me to be natural to call the parameter "N" not "occurrence". Speaking of the "occurrence'th occurrence" is just silly, not to mention long and easy to misspell. Likewise, "position" is a horribly vague term for the search start position; it could be interpreted to mean several other things. "start" seems much better. "return_opt" is likewise awfully unclear. I went with "endoption" below, though I could be talked into something else. The only one of Oracle's choices that I like is "subexpr" for subexpression number ... but you went with DB2's rather vague "group" instead. I don't want to use their "capture group" terminology, because that appears nowhere else in our documentation. Our existing terminology is "parenthesized subexpression", which seems fine to me (and also agrees with Oracle's docs). * I spent a lot of time on the docs too. A lot of the syntax specs were wrong (where you put the brackets matters), many of the examples seemed confusingly overcomplicated, and the text explanations needed copy-editing. * Also, the regression tests seemed misguided. This patch is not responsible for testing the regexp engine as such; we have tests elsewhere that do that. So I don't think we need complex regexps here. We just need to verify that the parameters of these functions act properly, and check their error cases. That can be done much more quickly and straightforwardly than what you had. So here's a revised version that I like better. I think this is pretty nearly committable, aside from the question of whether a too-large subexpression number should be an error or not. Thanks a lot for the patch improvement and the guidance. I have read the patch and I agree with your choices I think I was too much trying to mimic the oraclisms. I don't think we should take care of the too-large subexpression number, the regexp writer should always test its regular expression and also this will not prevent him to chose the wrong capture group number but just a non existing one. Actually I just found that the regexp_like() function doesn't support the start parameter which is something we should support. I saw that Oracle do not support it but DB2 does and I think we should also support it. I will post a new version of the patch once it is done. Best regards, -- Gilles Darold
Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace
Le 02/08/2021 à 01:21, Tom Lane a écrit : > Gilles Darold writes: >> [ v5-0001-regexp-foo-functions.patch ] > I've gone through this whole patch now, and found quite a lot that I did > not like. In no particular order: > > * Wrapping parentheses around the user's regexp doesn't work. It can > turn an invalid regexp into a valid one: for example 'a)(b' should draw > a syntax error. With this patch, no error would be thrown, but the > "outer" parens wouldn't do what you expected. Worse, it can turn a > valid regexp into an invalid one: the metasyntax options described in > 9.7.3.4 only work at the start of the regexp. So we have to handle > whole-regexp cases honestly rather than trying to turn them into an > instance of the parenthesized-subexpression case. > > * You did a lot of things quite inefficiently, apparently to avoid > touching any existing code. I think it's better to extend > setup_regexp_matches() and replace_text_regexp() a little bit so that > they can support the behaviors these new functions need. In both of > them, it's absolutely trivial to allow a search start position to be > passed in; and it doesn't take much to teach replace_text_regexp() > to replace only the N'th match. > > * Speaking of N'th, there is not much of anything that I like > about Oracle's terminology for the function arguments, and I don't > think we ought to adopt it. If we're documenting the functions as > processing the "N'th match", it seems to me to be natural to call > the parameter "N" not "occurrence". Speaking of the "occurrence'th > occurrence" is just silly, not to mention long and easy to misspell. > Likewise, "position" is a horribly vague term for the search start > position; it could be interpreted to mean several other things. > "start" seems much better. "return_opt" is likewise awfully unclear. > I went with "endoption" below, though I could be talked into something > else. The only one of Oracle's choices that I like is "subexpr" for > subexpression number ... but you went with DB2's rather vague "group" > instead. I don't want to use their "capture group" terminology, > because that appears nowhere else in our documentation. Our existing > terminology is "parenthesized subexpression", which seems fine to me > (and also agrees with Oracle's docs). > > * I spent a lot of time on the docs too. A lot of the syntax specs > were wrong (where you put the brackets matters), many of the examples > seemed confusingly overcomplicated, and the text explanations needed > copy-editing. > > * Also, the regression tests seemed misguided. This patch is not > responsible for testing the regexp engine as such; we have tests > elsewhere that do that. So I don't think we need complex regexps > here. We just need to verify that the parameters of these functions > act properly, and check their error cases. That can be done much > more quickly and straightforwardly than what you had. > > > So here's a revised version that I like better. I think this > is pretty nearly committable, aside from the question of whether > a too-large subexpression number should be an error or not. Thanks a lot for the patch improvement and the guidance. I have read the patch and I agree with your choices I think I was too much trying to mimic the oraclisms. I don't think we should take care of the too-large subexpression number, the regexp writer should always test its regular expression and also this will not prevent him to chose the wrong capture group number but just a non existing one. Best regards, -- Gilles Darold
Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace
I wrote: > ... aside from the question of whether > a too-large subexpression number should be an error or not. Oh ... poking around some more, I noticed a very nearby precedent. regexp_replace's replacement string can include \1 to \9 to insert the substring matching the N'th parenthesized subexpression. But if there is no such subexpression, you don't get an error, just an empty insertion. So that seems like an argument for not throwing an error for an out-of-range subexpr parameter. regards, tom lane
Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace
Gilles Darold writes: > [ v5-0001-regexp-foo-functions.patch ] I've gone through this whole patch now, and found quite a lot that I did not like. In no particular order: * Wrapping parentheses around the user's regexp doesn't work. It can turn an invalid regexp into a valid one: for example 'a)(b' should draw a syntax error. With this patch, no error would be thrown, but the "outer" parens wouldn't do what you expected. Worse, it can turn a valid regexp into an invalid one: the metasyntax options described in 9.7.3.4 only work at the start of the regexp. So we have to handle whole-regexp cases honestly rather than trying to turn them into an instance of the parenthesized-subexpression case. * You did a lot of things quite inefficiently, apparently to avoid touching any existing code. I think it's better to extend setup_regexp_matches() and replace_text_regexp() a little bit so that they can support the behaviors these new functions need. In both of them, it's absolutely trivial to allow a search start position to be passed in; and it doesn't take much to teach replace_text_regexp() to replace only the N'th match. * Speaking of N'th, there is not much of anything that I like about Oracle's terminology for the function arguments, and I don't think we ought to adopt it. If we're documenting the functions as processing the "N'th match", it seems to me to be natural to call the parameter "N" not "occurrence". Speaking of the "occurrence'th occurrence" is just silly, not to mention long and easy to misspell. Likewise, "position" is a horribly vague term for the search start position; it could be interpreted to mean several other things. "start" seems much better. "return_opt" is likewise awfully unclear. I went with "endoption" below, though I could be talked into something else. The only one of Oracle's choices that I like is "subexpr" for subexpression number ... but you went with DB2's rather vague "group" instead. I don't want to use their "capture group" terminology, because that appears nowhere else in our documentation. Our existing terminology is "parenthesized subexpression", which seems fine to me (and also agrees with Oracle's docs). * I spent a lot of time on the docs too. A lot of the syntax specs were wrong (where you put the brackets matters), many of the examples seemed confusingly overcomplicated, and the text explanations needed copy-editing. * Also, the regression tests seemed misguided. This patch is not responsible for testing the regexp engine as such; we have tests elsewhere that do that. So I don't think we need complex regexps here. We just need to verify that the parameters of these functions act properly, and check their error cases. That can be done much more quickly and straightforwardly than what you had. So here's a revised version that I like better. I think this is pretty nearly committable, aside from the question of whether a too-large subexpression number should be an error or not. regards, tom lane diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index a5b6adc4bb..80aac4965e 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -3108,6 +3108,78 @@ repeat('Pg', 4) PgPgPgPg + + + + regexp_count + +regexp_count ( string text, pattern text + [, start integer + [, flags text ] ] ) +integer + + +Returns the number of times the POSIX regular +expression pattern matches in +the string; see +. + + +regexp_count('123456789012', '\d\d\d', 2) +3 + + + + + + + regexp_instr + +regexp_instr ( string text, pattern text + [, start integer + [, N integer + [, endoption integer + [, flags text + [, subexpr integer ] ] ] ] ] ) +integer + + +Returns the position within string where +the N'th match of the POSIX regular +expression pattern occurs, or zero if there is +no such match; see . + + +regexp_instr('ABCDEF', 'c(.)(..)', 1, 1, 0, 'i') +3 + + +regexp_instr('ABCDEF', 'c(.)(..)', 1, 1, 0, 'i', 2) +5 + + + + + + + regexp_like + +regexp_like ( string text, pattern text + [, flags text ] ) +boolean + + +Checks whether a match of the POSIX regular +expression pattern occurs +within string; see +. + + +regexp_like('Hello World', 'world$', 'i') +t + + + @@ -3117,8 +3189,9 @@ repeat('Pg', 4) PgPgPgPg text[] -Returns captured substrings resulting from the first match of a POSIX -regular expression to the string; see +
Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace
Le 01/08/2021 à 19:23, Tom Lane a écrit : > I've been working through this patch, and trying to verify > compatibility against Oracle and DB2, and I see some points that need > discussion or at least recording for the archives. > > * In Oracle, while the documentation for regexp_instr says that > return_option should only be 0 or 1, experimentation with sqlfiddle > shows that any nonzero value is silently treated as 1. The patch > raises an error for other values, which I think is a good idea. > (IBM's docs say that DB2 raises an error too, though I can't test > that.) We don't need to be bug-compatible to that extent. > > * What should happen when the subexpression/capture group number of > regexp_instr or regexp_substr exceeds the number of parenthesized > subexpressions of the regexp? Oracle silently returns a no-match > result (0 or NULL), as does this patch. However, IBM's docs say > that DB2 raises an error. I'm inclined to think that this is > likewise taking bug-compatibility too far, and that we should > raise an error like DB2. There are clearly cases where throwing > an error would help debug a faulty call, while I'm less clear on > a use-case where not throwing an error would be useful. > > * IBM's docs say that both regexp_count and regexp_like have > arguments "string, pattern [, start] [, flags]" --- that is, > each of start and flags can be independently specified or omitted. > The patch follows Oracle, which has no start option for > regexp_like, and where you can't write flags for regexp_count > without writing start. This is fine by me, because doing these > like DB2 would introduce the same which-argument-is-this issues > as we're being forced to cope with for regexp_replace. I don't > think we need to accept ambiguity in these cases too. But it's > worth memorializing this decision in the thread. > > * The patch has most of these functions silently ignoring the 'g' > flag, but I think they should raise errors instead. Oracle doesn't > accept a 'g' flag for these, so why should we? The only case where > that logic doesn't hold is regexp_replace, because depending on which > syntax you use the 'g' flag might or might not be meaningful. So > for regexp_replace, I'd vote for silently ignoring 'g' if the > occurrence-number parameter is given, while honoring it if not. > > I've already made changes in my local copy per the last item, > but I've not done anything about throwing errors for out-of-range > subexpression numbers. Anybody have an opinion about that one? I thought about this while I was implementing the functions and chose to not throw an error because of the Oracle behavior and also with others regular expression implementation. For example in Perl there is no error: $ perl -e '$str="hello world"; $str =~ s/(l)/$20/; print "$str\n";' helo world Usually a regular expression is always tested by its creator to be sure that this the right one and that it does what is expected. But I agree that it could help the writer to debug its RE. Also if I recall well Oracle and DB2 limit the number of capture groups back references from \1 to \9 for Oracle and \0 to \9 for DB2. I have chosen to not apply this limit, I don't see the interest of such a limitation. -- Gilles Darold http://www.darold.net/
Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace
Le 30/07/2021 à 23:38, Tom Lane a écrit : > Gilles Darold writes: >> Le 26/07/2021 à 21:56, Tom Lane a écrit : >>> I'm inclined to just drop the regexp_replace additions. I don't think >>> that the extra parameters Oracle provides here are especially useful. >>> They're definitely not useful enough to justify creating compatibility >>> hazards for. >> I would not say that being able to replace the Nth occurrence of a >> pattern matching is not useful but i agree that this is not a common >> case with replacement. Both Oracle [1] and IBM DB2 [2] propose this form >> and I have though that we can not have compatibility issues because of >> the different data type at the 4th parameter. > Well, here's an example of the potential issues: > > [...] Thanks for pointing me this case, I did not think that the prepared statement could lead to this confusion. >> Anyway, maybe we can just >> rename the function even if I would prefer that regexp_replace() be >> extended. For example: >> regexp_replace(source, pattern, replacement [, flags ]); >> regexp_substitute(source, pattern, replacement [, position ] [, >> occurrence ] [, flags ]); > Hmm. Of course the entire selling point of this patch seems to be > bug-compatibility with Oracle, so using different names is largely > defeating the point :-( > > Maybe we should just hold our noses and do it. The point that > you'd get a recognizable failure if the wrong function were chosen > reassures me a little bit. We've seen a lot of cases where this > sort of ambiguity results in the system just silently doing something > different from what you expected, and I was afraid that that could > happen here. I join a new version of the patch that include a check of the option parameter in the basic form of regexp_replace() and return an error in ambiguous cases. PREPARE rr AS SELECT regexp_replace('healthy, wealthy, and wise','(\w+)thy', '\1ish', $1); EXECUTE rr(1); ERROR: ambiguous use of the option parameter in regex_replace(), value: 1 HINT: you might set the occurrence parameter to force the use of the extended form of regex_replace() This is done by checking if the option parameter value is an integer and throw the error in this case. I don't think of anything better. Best regards, -- Gilles Darold diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index a5b6adc4bb..02d1f72e1e 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -3108,6 +3108,66 @@ repeat('Pg', 4) PgPgPgPg + + + + regexp_count + +regexp_count ( string text, pattern text [, position integer ] [, flags text ] ) +integer + + +Returns the number of times a pattern occurs for a match of a POSIX +regular expression to the string; see +. + + +regexp_count('123456789012', '\d{3}', 3) +3 + + + + + + + regexp_instr + +regexp_instr ( string text, pattern text [, position integer ] [, occurence integer ] [, returnopt integer ] [, flags text ] [, group integer ] ) +integer + + + Returns the position within string where the + match of a POSIX regular expression occurs. It returns an integer + indicating the beginning or ending position of the matched substring, + depending on the value of the returnopt argument + (default beginning). If no match is found the function returns 0; + see . + + +regexp_instr('1234567890', '(123)(4(56)(78))', 1, 1, 0, 'i', 4) +7 + + + + + + + regexp_like + +regexp_like ( string text, pattern text [, flags text ] ) +boolean + + +Evaluates the existence of a match to a POSIX regular expression +in string; see . + + +regexp_like('Hello'||chr(10)||'world', '^world$', 'm') +t + + + + @@ -3156,7 +3216,7 @@ repeat('Pg', 4) PgPgPgPg regexp_replace -regexp_replace ( string text, pattern text, replacement text [, flags text ] ) +regexp_replace ( string text, pattern text, replacement text [, position integer ] [, occurence integer ] [, flags text ] ) text @@ -3171,6 +3231,24 @@ repeat('Pg', 4) PgPgPgPg + + + + regexp_substr + +regexp_substr ( string text, pattern text [, position integer ] [, occurence integer ] [, flags text ] [, group integer ] ) +text + + + Return the substring within string corresponding to the + match of a POSIX regular expression; see . + + +regexp_substr('1234567890 1234557890', '(123)(4(5[56])(78))', 1, 2, 'i', 3) +55 + + +
Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace
I've been working through this patch, and trying to verify compatibility against Oracle and DB2, and I see some points that need discussion or at least recording for the archives. * In Oracle, while the documentation for regexp_instr says that return_option should only be 0 or 1, experimentation with sqlfiddle shows that any nonzero value is silently treated as 1. The patch raises an error for other values, which I think is a good idea. (IBM's docs say that DB2 raises an error too, though I can't test that.) We don't need to be bug-compatible to that extent. * What should happen when the subexpression/capture group number of regexp_instr or regexp_substr exceeds the number of parenthesized subexpressions of the regexp? Oracle silently returns a no-match result (0 or NULL), as does this patch. However, IBM's docs say that DB2 raises an error. I'm inclined to think that this is likewise taking bug-compatibility too far, and that we should raise an error like DB2. There are clearly cases where throwing an error would help debug a faulty call, while I'm less clear on a use-case where not throwing an error would be useful. * IBM's docs say that both regexp_count and regexp_like have arguments "string, pattern [, start] [, flags]" --- that is, each of start and flags can be independently specified or omitted. The patch follows Oracle, which has no start option for regexp_like, and where you can't write flags for regexp_count without writing start. This is fine by me, because doing these like DB2 would introduce the same which-argument-is-this issues as we're being forced to cope with for regexp_replace. I don't think we need to accept ambiguity in these cases too. But it's worth memorializing this decision in the thread. * The patch has most of these functions silently ignoring the 'g' flag, but I think they should raise errors instead. Oracle doesn't accept a 'g' flag for these, so why should we? The only case where that logic doesn't hold is regexp_replace, because depending on which syntax you use the 'g' flag might or might not be meaningful. So for regexp_replace, I'd vote for silently ignoring 'g' if the occurrence-number parameter is given, while honoring it if not. I've already made changes in my local copy per the last item, but I've not done anything about throwing errors for out-of-range subexpression numbers. Anybody have an opinion about that one? regards, tom lane
Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace
Gilles Darold writes: > Le 26/07/2021 à 21:56, Tom Lane a écrit : >> I'm inclined to just drop the regexp_replace additions. I don't think >> that the extra parameters Oracle provides here are especially useful. >> They're definitely not useful enough to justify creating compatibility >> hazards for. > I would not say that being able to replace the Nth occurrence of a > pattern matching is not useful but i agree that this is not a common > case with replacement. Both Oracle [1] and IBM DB2 [2] propose this form > and I have though that we can not have compatibility issues because of > the different data type at the 4th parameter. Well, here's an example of the potential issues: regression=# create function rr(text,text,text,text) returns text regression-# language sql as $$select 'text'$$; CREATE FUNCTION regression=# create function rr(text,text,text,int4) returns text language sql as $$select 'int4'$$; CREATE FUNCTION regression=# select rr('a','b','c','d'); rr -- text (1 row) regression=# select rr('a','b','c',42); rr -- int4 (1 row) So far so good, but: regression=# prepare rr as select rr('a','b','c',$1); PREPARE regression=# execute rr(12); rr -- text (1 row) So somebody trying to use the 4-parameter Oracle form from, say, JDBC would get bit if they were sloppy about specifying parameter types. The one saving grace is that digits aren't valid regexp flags, so the outcome would be something like regression=# select regexp_replace('a','b','c','12'); ERROR: invalid regular expression option: "1" which'd be less difficult to debug than silent misbehavior. Conversely, if you thought you were passing flags but it somehow got interpreted as a start position, that would fail too: regression=# prepare rri as select rr('a','b','c', $1::int); PREPARE regression=# execute rri('gi'); ERROR: invalid input syntax for type integer: "gi" LINE 1: execute rri('gi'); ^ Still, I bet a lot that we'd see periodic bug reports complaining that it doesn't work. > Anyway, maybe we can just > rename the function even if I would prefer that regexp_replace() be > extended. For example: > regexp_replace(source, pattern, replacement [, flags ]); > regexp_substitute(source, pattern, replacement [, position ] [, > occurrence ] [, flags ]); Hmm. Of course the entire selling point of this patch seems to be bug-compatibility with Oracle, so using different names is largely defeating the point :-( Maybe we should just hold our noses and do it. The point that you'd get a recognizable failure if the wrong function were chosen reassures me a little bit. We've seen a lot of cases where this sort of ambiguity results in the system just silently doing something different from what you expected, and I was afraid that that could happen here. regards, tom lane
Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace
Le 26/07/2021 à 21:56, Tom Lane a écrit : > Gilles Darold writes: >> [ v4-0001-regexp-foo-functions.patch ] > I started to work through this and was distressed to realize that > it's trying to redefine regexp_replace() in an incompatible way. > We already have > > regression=# \df regexp_replace >List of functions >Schema | Name | Result data type | Argument data types | > Type > ++--++-- > pg_catalog | regexp_replace | text | text, text, text | > func > pg_catalog | regexp_replace | text | text, text, text, text | > func > (2 rows) > > The patch proposes to add (among other alternatives) > > +{ oid => '9608', descr => 'replace text using regexp', > + proname => 'regexp_replace', prorettype => 'text', > + proargtypes => 'text text text int4', prosrc => > 'textregexreplace_extended_no_occurrence' }, > > which is going to be impossibly confusing for both humans and machines. > I don't think we should go there. Even if you managed to construct > examples that didn't result in "ambiguous function" failures, that > doesn't mean that ordinary mortals won't get bit that way. > > I'm inclined to just drop the regexp_replace additions. I don't think > that the extra parameters Oracle provides here are especially useful. > They're definitely not useful enough to justify creating compatibility > hazards for. I would not say that being able to replace the Nth occurrence of a pattern matching is not useful but i agree that this is not a common case with replacement. Both Oracle [1] and IBM DB2 [2] propose this form and I have though that we can not have compatibility issues because of the different data type at the 4th parameter. Anyway, maybe we can just rename the function even if I would prefer that regexp_replace() be extended. For example: regexp_replace(source, pattern, replacement [, flags ]); regexp_substitute(source, pattern, replacement [, position ] [, occurrence ] [, flags ]); of course with only 3 parameters the two functions are the same. What do you think about the renaming proposal instead of simply drop the extended form of the function? Best regards, [1] https://docs.oracle.com/database/121/SQLRF/functions163.htm#SQLRF06302 [2] https://www.ibm.com/docs/en/db2oc?topic=functions-regexp-replace -- Gilles Darold http://www.darold.net/
Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace
Gilles Darold writes: > [ v4-0001-regexp-foo-functions.patch ] I started to work through this and was distressed to realize that it's trying to redefine regexp_replace() in an incompatible way. We already have regression=# \df regexp_replace List of functions Schema | Name | Result data type | Argument data types | Type ++--++-- pg_catalog | regexp_replace | text | text, text, text | func pg_catalog | regexp_replace | text | text, text, text, text | func (2 rows) The patch proposes to add (among other alternatives) +{ oid => '9608', descr => 'replace text using regexp', + proname => 'regexp_replace', prorettype => 'text', + proargtypes => 'text text text int4', prosrc => 'textregexreplace_extended_no_occurrence' }, which is going to be impossibly confusing for both humans and machines. I don't think we should go there. Even if you managed to construct examples that didn't result in "ambiguous function" failures, that doesn't mean that ordinary mortals won't get bit that way. I'm inclined to just drop the regexp_replace additions. I don't think that the extra parameters Oracle provides here are especially useful. They're definitely not useful enough to justify creating compatibility hazards for. regards, tom lane
Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace
Le 21/03/2021 à 15:53, Tom Lane a écrit : > Chapman Flack writes: >> If this turns out to be a case of "attached the wrong patch, here's >> the one that does implement foo_regex functions!" then I reserve an >> objection to that. :) >> And the patch renamed. diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index fee0561961..36e446cb7b 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -3097,6 +3097,66 @@ repeat('Pg', 4) PgPgPgPg + + + + regexp_count + +regexp_count ( string text, pattern text [, position integer ] [, flags text ] ) +integer + + +Returns the number of times a pattern occurs for a match of a POSIX +regular expression to the string; see +. + + +regexp_count('123456789012', '\d{3}', 3) +3 + + + + + + + regexp_instr + +regexp_instr ( string text, pattern text [, position integer ] [, occurence integer ] [, returnopt integer ] [, flags text ] [, group integer ] ) +integer + + + Returns the position within string where the + match of a POSIX regular expression occurs. It returns an integer + indicating the beginning or ending position of the matched substring, + depending on the value of the returnopt argument + (default beginning). If no match is found the function returns 0; + see . + + +regexp_instr('1234567890', '(123)(4(56)(78))', 1, 1, 0, 'i', 4) +7 + + + + + + + regexp_like + +regexp_like ( string text, pattern text [, flags text ] ) +boolean + + +Evaluates the existence of a match to a POSIX regular expression +in string; see . + + +regexp_like('Hello'||chr(10)||'world', '^world$', 'm') +t + + + + @@ -3144,7 +3204,7 @@ repeat('Pg', 4) PgPgPgPg regexp_replace -regexp_replace ( string text, pattern text, replacement text [, flags text ] ) +regexp_replace ( string text, pattern text, replacement text [, position integer ] [, occurence integer ] [, flags text ] ) text @@ -3157,6 +3217,24 @@ repeat('Pg', 4) PgPgPgPg + + + + regexp_substr + +regexp_substr ( string text, pattern text [, position integer ] [, occurence integer ] [, flags text ] [, group integer ] ) +text + + + Return the substring within string corresponding to the + match of a POSIX regular expression; see . + + +regexp_substr('1234567890 1234557890', '(123)(4(5[56])(78))', 1, 2, 'i', 3) +55 + + + @@ -5295,6 +5373,18 @@ substring('foobar' similar '#"o_b#"%' escape '#')NULL regexp_split_to_array + +regexp_like + + +regexp_count + + +regexp_instr + + +regexp_substr + lists the available @@ -5451,6 +5541,8 @@ substring('foobar' from 'o(.)b') o It has the syntax regexp_replace(source, pattern, replacement + , position + , occurrence , flags ). The source string is returned unchanged if there is no match to the pattern. If there is a @@ -5464,12 +5556,19 @@ substring('foobar' from 'o(.)b') o substring matching the entire pattern should be inserted. Write \\ if you need to put a literal backslash in the replacement text. - The flags parameter is an optional text - string containing zero or more single-letter flags that change the + pattern is searched in string starting + from an optional position or from the beginning + of source by default. Optional occurrence + parameter indicates which occurrence of pattern in + source should be replaced. Default value for occurrence + is 1, replace only the first occurrence. The flags parameter + is an optional text string containing zero or more single-letter flags that change the function's behavior. Flag i specifies case-insensitive matching, while flag g specifies replacement of each matching - substring rather than only the first one. Supported flags (though - not g) are + substring rather than only the first one. When occurrence + is set flag g has no effect. If occurrence + is set to zero, all occurrences are replaced which is similar to flag g. + Supported flags (though not g) are described in . @@ -5482,6 +5581,10 @@ regexp_replace('foobarbaz', 'b..', 'X', 'g') fooXX regexp_replace('foobarbaz', 'b(..)', 'X\1Y', 'g') fooXarYXazY +regexp_replace('A PostgreSQL function',
Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace
Le 21/03/2021 à 15:53, Tom Lane a écrit : > Chapman Flack writes: >> If this turns out to be a case of "attached the wrong patch, here's >> the one that does implement foo_regex functions!" then I reserve an >> objection to that. :) > +1 to that. Just to add a note, I do have some ideas about extending > our regex parser so that it could duplicate the XQuery syntax --- none > of the points we mention in 9.7.3.8 seem insurmountable. I'm not > planning to work on that in the near future, mind you, but I definitely > think that we don't want to paint ourselves into a corner where we've > already implemented the XQuery regex functions with the wrong behavior. > > regards, tom lane I apologize for confusing with the words and phrases I have used. This patch implements the regexp_foo () functions which are available in most RDBMS with the behavior described in the documentation. I have modified the title of the patch in the commitfest to removed wrong use of XQUERY. I don't know too if the other RDBMS respect the XQUERY behavior but for what I've seen for Oracle they are using limited regexp modifiers with sometime not the same letter than PostgreSQL for the same behavior. I have implemented these functions with the Oracle behavior in Orafce [1] with a function that checks the modifiers used. This patch doesn't mimic the Oracle behavior, it use the PostgreSQL behavior with regexp, the one used by regex_replace() and regex_matches(). All regexp modifiers can be used. [1] https://github.com/orafce/orafce/blob/master/orafce--3.14--3.15.sql -- Gilles Darold http://www.darold.net/
Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace
Chapman Flack writes: > If this turns out to be a case of "attached the wrong patch, here's > the one that does implement foo_regex functions!" then I reserve an > objection to that. :) +1 to that. Just to add a note, I do have some ideas about extending our regex parser so that it could duplicate the XQuery syntax --- none of the points we mention in 9.7.3.8 seem insurmountable. I'm not planning to work on that in the near future, mind you, but I definitely think that we don't want to paint ourselves into a corner where we've already implemented the XQuery regex functions with the wrong behavior. regards, tom lane
Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace
On 03/21/21 09:19, Gilles Darold wrote: >>> On 2021.03.20. 19:48 Gilles Darold wrote: >>> >>> This is a new version of the patch that now implements all the XQUERY >>> regexp functions as described in the standard, minus the differences of >>> PostgerSQL regular expression explain in [1]. >>> >>> The standard SQL describe functions like_regex(), occurrences_regex(), >>> position_regex(), substring_regex() and translate_regex() which >>> correspond to the commonly named functions regexp_like(), >>> regexp_count(), regexp_instr(), regexp_substr() and regexp_replace() as >>> reported by Chapman Flack in [2]. All these function are implemented in >>> [v2-0001-xquery-regexp-functions.patch] I quickly looked over this patch preparing to object if it actually purported to implement the ISO foo_regex() named functions without the ISO semantics, but a quick grep reassured me that it doesn't implement any of those functions. It only supplies functions in the alternative, apparently common de facto naming scheme regexp_foo(). To be clear, I think that's the right call. I do not think it would be a good idea to supply functions that have the ISO names but not the specified regex dialect. A set of functions analogous to the ISO ones but differently named and with a different regex dialect seems fine to me, especially if these different names are de facto common, and as far as I can tell, that is what this patch provides. So I have no objection to that. :) It might then be fair to say that the /description/ of the patch as implementing the XQuery-based foo_regex functions isn't quite right, or at least carries a risk of jarring some readers into hasty double-takes on Sunday mornings before coffee. It might be clearer to just mention the close correspondence between the functions in this differently-named set and the corresponding ISO ones. If this turns out to be a case of "attached the wrong patch, here's the one that does implement foo_regex functions!" then I reserve an objection to that. :) Regards, -Chap
Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace
Le 21/03/2021 à 12:07, e...@xs4all.nl a écrit : >> On 2021.03.20. 19:48 Gilles Darold wrote: >> >> This is a new version of the patch that now implements all the XQUERY >> regexp functions as described in the standard, minus the differences of >> PostgerSQL regular expression explain in [1]. >> >> The standard SQL describe functions like_regex(), occurrences_regex(), >> position_regex(), substring_regex() and translate_regex() which >> correspond to the commonly named functions regexp_like(), >> regexp_count(), regexp_instr(), regexp_substr() and regexp_replace() as >> reported by Chapman Flack in [2]. All these function are implemented in >> [v2-0001-xquery-regexp-functions.patch] > Hi, > > Apply, compile and (world)check are fine. I haven't found errors in > functionality. > > I went through the docs, and came up with these changes in func.sgml, and > pg_proc.dat. > > Useful functions - thanks! > > Erik Rijkers Thanks a lot Erik, here is a version of the patch with your corrections. -- Gilles Darold LzLabs GmbH http://www.lzlabs.com/ diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index fee0561961..36e446cb7b 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -3097,6 +3097,66 @@ repeat('Pg', 4) PgPgPgPg + + + + regexp_count + +regexp_count ( string text, pattern text [, position integer ] [, flags text ] ) +integer + + +Returns the number of times a pattern occurs for a match of a POSIX +regular expression to the string; see +. + + +regexp_count('123456789012', '\d{3}', 3) +3 + + + + + + + regexp_instr + +regexp_instr ( string text, pattern text [, position integer ] [, occurence integer ] [, returnopt integer ] [, flags text ] [, group integer ] ) +integer + + + Returns the position within string where the + match of a POSIX regular expression occurs. It returns an integer + indicating the beginning or ending position of the matched substring, + depending on the value of the returnopt argument + (default beginning). If no match is found the function returns 0; + see . + + +regexp_instr('1234567890', '(123)(4(56)(78))', 1, 1, 0, 'i', 4) +7 + + + + + + + regexp_like + +regexp_like ( string text, pattern text [, flags text ] ) +boolean + + +Evaluates the existence of a match to a POSIX regular expression +in string; see . + + +regexp_like('Hello'||chr(10)||'world', '^world$', 'm') +t + + + + @@ -3144,7 +3204,7 @@ repeat('Pg', 4) PgPgPgPg regexp_replace -regexp_replace ( string text, pattern text, replacement text [, flags text ] ) +regexp_replace ( string text, pattern text, replacement text [, position integer ] [, occurence integer ] [, flags text ] ) text @@ -3157,6 +3217,24 @@ repeat('Pg', 4) PgPgPgPg + + + + regexp_substr + +regexp_substr ( string text, pattern text [, position integer ] [, occurence integer ] [, flags text ] [, group integer ] ) +text + + + Return the substring within string corresponding to the + match of a POSIX regular expression; see . + + +regexp_substr('1234567890 1234557890', '(123)(4(5[56])(78))', 1, 2, 'i', 3) +55 + + + @@ -5295,6 +5373,18 @@ substring('foobar' similar '#"o_b#"%' escape '#')NULL regexp_split_to_array + +regexp_like + + +regexp_count + + +regexp_instr + + +regexp_substr + lists the available @@ -5451,6 +5541,8 @@ substring('foobar' from 'o(.)b') o It has the syntax regexp_replace(source, pattern, replacement + , position + , occurrence , flags ). The source string is returned unchanged if there is no match to the pattern. If there is a @@ -5464,12 +5556,19 @@ substring('foobar' from 'o(.)b') o substring matching the entire pattern should be inserted. Write \\ if you need to put a literal backslash in the replacement text. - The flags parameter is an optional text - string containing zero or more single-letter flags that change the + pattern is searched in string starting + from an optional position or from the beginning + of source by default. Optional occurrence + parameter indicates which occurrence of pattern in + source should be replaced. Default value for occurrence + is 1, replace only the first occurrence. The flags parameter + is an optional text string
Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace
> On 2021.03.20. 19:48 Gilles Darold wrote: > > This is a new version of the patch that now implements all the XQUERY > regexp functions as described in the standard, minus the differences of > PostgerSQL regular expression explain in [1]. > > The standard SQL describe functions like_regex(), occurrences_regex(), > position_regex(), substring_regex() and translate_regex() which > correspond to the commonly named functions regexp_like(), > regexp_count(), regexp_instr(), regexp_substr() and regexp_replace() as > reported by Chapman Flack in [2]. All these function are implemented in > [v2-0001-xquery-regexp-functions.patch] Hi, Apply, compile and (world)check are fine. I haven't found errors in functionality. I went through the docs, and came up with these changes in func.sgml, and pg_proc.dat. Useful functions - thanks! Erik Rijkers--- doc/src/sgml/func.sgml.orig 2021-03-21 03:59:37.884365465 +0100 +++ doc/src/sgml/func.sgml 2021-03-21 11:37:46.880644051 +0100 @@ -3106,7 +3106,7 @@ integer -Return the number of times a pattern occurs for a match of a POSIX +Returns the number of times a pattern occurs for a match of a POSIX regular expression to the string; see . @@ -3125,11 +3125,11 @@ integer - Return the position within string where the + Returns the position within string where the match of a POSIX regular expression occurs. It returns an integer indicating the beginning or ending position of the matched substring, depending on the value of the returnopt argument - (default beginning). If no match is found, then the function returns 0; + (default beginning). If no match is found the function returns 0; see . @@ -3147,12 +3147,12 @@ boolean -Evaluate the existence of a match to a POSIX regular expression +Evaluates the existence of a match to a POSIX regular expression in string; see . regexp_like('Hello'||chr(10)||'world', '^world$', 'm') -3 +t @@ -5773,7 +5773,7 @@ - The regexp_like function evaluate the existence of a match + The regexp_like function evaluates the existence of a match to a POSIX regular expression in string; returns a boolean resulting from matching a POSIX regular expression pattern to a string. It has the syntax regexp_like(string, @@ -5782,7 +5782,7 @@ from the beginning of string. The flags parameter is an optional text string containing zero or more single-letter flags that change the function's behavior. - regexp_count accepts all the flags + regexp_like accepts all the flags shown in . This function is similar to regexp operator ~ when used without flags and similar to operator ~* when @@ -5792,9 +5792,9 @@ Some examples: -SELECT 'found' FROM t1 WHERE regexp_like('Hello'||chr(10)||'world', '^world$', 'm'); - regexp_like --- +SELECT 'found' FROM (values('Hello'||chr(10)||'world') as f(col) WHERE regexp_like(col, '^world$', 'm'); + ?column? +-- found (1 row) @@ -5814,7 +5814,7 @@ regexp_count accepts all the flags shown in . The g flag is forced internally to count all matches. - This function returns 0 if there is no match or the number of match as + This function returns 0 if there is no match or the number of matches as an integer. @@ -5853,17 +5853,17 @@ the position of the character after the occurrence. The flags parameter is an optional text string containing zero or more single-letter flags that change the function's behavior. - regexp_count accepts all the flags + regexp_instr accepts all the flags shown in . The g flag is forced internally to track all matches. For a pattern with capture groups, group is an integer indicating - which capture in pattern is the target of the function. A capture group is a part of the pattern - enclosed in parentheses. Capture groups can be nested. They are numbered in order in which their - left parentheses appear in pattern. If group is zero, then the position + which capture in pattern is the target of the function. A capture group is a part of the pattern + enclosed in parentheses. Capture groups can be nested. They are numbered in the order in which their + left parentheses appear in pattern. If group is zero, then the position of the entire substring that matches the pattern is returned. If pattern - does not have at least group capture group, the function returns zero. + does not have at least group capture groups, the function returns zero. This function returns 0 if there is no match or the starting or ending position - of match as an integer. + of a match as an integer.
Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace
Le 20/03/2021 à 19:48, Gilles Darold a écrit : > > Hi, > > > This is a new version of the patch that now implements all the XQUERY > regexp functions as described in the standard, minus the differences > of PostgerSQL regular expression explain in [1]. > > > The standard SQL describe functions like_regex(), occurrences_regex(), > position_regex(), substring_regex() and translate_regex() which > correspond to the commonly named functions regexp_like(), > regexp_count(), regexp_instr(), regexp_substr() and regexp_replace() > as reported by Chapman Flack in [2]. All these function are > implemented in the patch. Syntax of the functions are: > > > - regexp_like(string, pattern [, flags ]) > > - regexp_count( string, pattern [, position ] [, flags ]) > > - regexp_instr( string, pattern [, position ] [, occurrence ] [, > returnopt ] [, flags ] [, group ]) > > - regexp_substr( string, pattern [, position ] [, occurrence ] [, > flags ] [, group ]) > > - regexp_replace(source, pattern, replacement [, position ] [, > occurrence ] [, flags ]) > > > In addition to previous patch version I have added the regexp()_like > function and extended the existsing regex_replace() function. The > patch documents these functions and adds regression tests for all > functions. I will add it to the commitfest. > > > An other regexp functions regexp_positions() that returns all > occurrences that matched a POSIX regular expression is also developped > by Joel Jacobson, see [2]. This function expands the list of regexp > functions described in XQUERY. > > > [1] > https://www.postgresql.org/docs/13/functions-matching.html#FUNCTIONS-POSIX-REGEXP > > [2] > https://www.postgresql.org/message-id/flat/bfd5-909d-408b-8531-95b32f18d4ab%40www.fastmail.com#3ec8ba658eeabcae2ac6ccca33bd1aed > > I would like to see these functions in PG 14 but it is a bit too late, added to commitfest 2021-07. -- Gilles Darold LzLabs GmbH http://www.lzlabs.com/
Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace
Hi, This is a new version of the patch that now implements all the XQUERY regexp functions as described in the standard, minus the differences of PostgerSQL regular expression explain in [1]. The standard SQL describe functions like_regex(), occurrences_regex(), position_regex(), substring_regex() and translate_regex() which correspond to the commonly named functions regexp_like(), regexp_count(), regexp_instr(), regexp_substr() and regexp_replace() as reported by Chapman Flack in [2]. All these function are implemented in the patch. Syntax of the functions are: - regexp_like(string, pattern [, flags ]) - regexp_count( string, pattern [, position ] [, flags ]) - regexp_instr( string, pattern [, position ] [, occurrence ] [, returnopt ] [, flags ] [, group ]) - regexp_substr( string, pattern [, position ] [, occurrence ] [, flags ] [, group ]) - regexp_replace(source, pattern, replacement [, position ] [, occurrence ] [, flags ]) In addition to previous patch version I have added the regexp()_like function and extended the existsing regex_replace() function. The patch documents these functions and adds regression tests for all functions. I will add it to the commitfest. An other regexp functions regexp_positions() that returns all occurrences that matched a POSIX regular expression is also developped by Joel Jacobson, see [2]. This function expands the list of regexp functions described in XQUERY. [1] https://www.postgresql.org/docs/13/functions-matching.html#FUNCTIONS-POSIX-REGEXP [2] https://www.postgresql.org/message-id/flat/bfd5-909d-408b-8531-95b32f18d4ab%40www.fastmail.com#3ec8ba658eeabcae2ac6ccca33bd1aed -- Gilles Darold LzLabs GmbH http://www.lzlabs.com/ diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index fee0561961..f5f08f1509 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -3097,6 +3097,66 @@ repeat('Pg', 4) PgPgPgPg + + + + regexp_count + +regexp_count ( string text, pattern text [, position integer ] [, flags text ] ) +integer + + +Return the number of times a pattern occurs for a match of a POSIX +regular expression to the string; see +. + + +regexp_count('123456789012', '\d{3}', 3) +3 + + + + + + + regexp_instr + +regexp_instr ( string text, pattern text [, position integer ] [, occurence integer ] [, returnopt integer ] [, flags text ] [, group integer ] ) +integer + + + Return the position within string where the + match of a POSIX regular expression occurs. It returns an integer + indicating the beginning or ending position of the matched substring, + depending on the value of the returnopt argument + (default beginning). If no match is found, then the function returns 0; + see . + + +regexp_instr('1234567890', '(123)(4(56)(78))', 1, 1, 0, 'i', 4) +7 + + + + + + + regexp_like + +regexp_like ( string text, pattern text [, flags text ] ) +boolean + + +Evaluate the existence of a match to a POSIX regular expression +in string; see . + + +regexp_like('Hello'||chr(10)||'world', '^world$', 'm') +3 + + + + @@ -3144,7 +3204,7 @@ repeat('Pg', 4) PgPgPgPg regexp_replace -regexp_replace ( string text, pattern text, replacement text [, flags text ] ) +regexp_replace ( string text, pattern text, replacement text [, position integer ] [, occurence integer ] [, flags text ] ) text @@ -3157,6 +3217,24 @@ repeat('Pg', 4) PgPgPgPg + + + + regexp_substr + +regexp_substr ( string text, pattern text [, position integer ] [, occurence integer ] [, flags text ] [, group integer ] ) +text + + + Return the substring within string corresponding to the + match of a POSIX regular expression; see . + + +regexp_substr('1234567890 1234557890', '(123)(4(5[56])(78))', 1, 2, 'i', 3) +55 + + + @@ -5295,6 +5373,18 @@ substring('foobar' similar '#"o_b#"%' escape '#')NULL regexp_split_to_array + +regexp_like + + +regexp_count + + +regexp_instr + + +regexp_substr + lists the available @@ -5451,6 +5541,8 @@ substring('foobar' from 'o(.)b') o It has the syntax regexp_replace(source, pattern, replacement + , position + , occurrence , flags ). The source string is returned unchanged if there is no match to the pattern. If there is a @@ -5464,12 +5556,19 @@ substring('foobar'
Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace
My apologies for the links in the head, the email formatting and the missing patch, I accidently send the email too early. -- Gilles diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index bf99f82149..88e08b40d2 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -3097,6 +3097,47 @@ repeat('Pg', 4) PgPgPgPg + + + + regexp_count + +regexp_count ( string text, pattern text [, position integer ] [, flags text ] ) +integer + + +Return the number of times a pattern occurs for a match of a POSIX +regular expression to the string; see +. + + +regexp_count('123456789012', '\d{3}', 3) +3 + + + + + + + regexp_instr + +regexp_instr ( string text, pattern text [, position integer ] [, occurence integer ] [, returnopt integer ] [, flags text ] [, group integer ] ) +integer + + + Return the position within string where the + match of a POSIX regular expression occurs. It returns an integer + indicating the beginning or ending position of the matched substring, + depending on the value of the returnopt argument + (default beginning). If no match is found, then the function returns 0; + see . + + +regexp_instr('1234567890', '(123)(4(56)(78))', 1, 1, 0, 'i', 4) +7 + + + @@ -3157,6 +3198,24 @@ repeat('Pg', 4) PgPgPgPg + + + + regexp_substr + +regexp_substr ( string text, pattern text [, position integer ] [, occurence integer ] [, flags text ] [, group integer ] ) +text + + + Return the substring within string corresponding to the + match of a POSIX regular expression; see . + + +regexp_substr('1234567890 1234557890', '(123)(4(5[56])(78))', 1, 2, 'i', 3) +55 + + + @@ -5295,6 +5354,15 @@ substring('foobar' similar '#"o_b#"%' escape '#')NULL regexp_split_to_array + +regexp_count + + +regexp_instr + + +regexp_substr + lists the available @@ -5669,6 +5737,132 @@ SELECT foo FROM regexp_split_to_table('the quick brown fox', '\s*') AS foo; in practice. Other software systems such as Perl use similar definitions. + + The regexp_count function returns the number of + captured substring(s) resulting from matching a POSIX regular + expression pattern to a string. It has the syntax regexp_replace( + string, pattern , + position , flags ). + pattern is searched in string starting + from an optional position or from the beginning of string + by default. The flags parameter is an optional text string + containing zero or more single-letter flags that change the function's behavior. + regexp_count accepts all the flags + shown in . + The g flag is forced internally to count all matches. + This function returns 0 if there is no match or the number of match as + an integer. + + + +Some examples: + +SELECT regexp_count('123123123123', '\d{3}', 1); + regexp_count +-- +4 +(1 row) + +SELECT regexp_count('Hello'||CHR(10)||'world!', '^world!$', 1, 'm'); + regexp_count +-- +1 +(1 row) + + + + + The regexp_instr function returns the beginning or ending + position of the matched substring resulting from matching a POSIX regular + expression pattern to a string. It has the syntax regexp_instr( + string, pattern , + position , occurrence + , returnopt + , flags + , group ). + pattern is searched in string starting + from an optional position or from the beginning + of string by default. occurrence + indicates which occurrence of pattern in string + should be searched. When returnopt is set to 0 (default) the function + returns the position of the first character of the occurrence. When set to 1 returns the position + of the character after the occurrence. + The flags parameter is an optional text string + containing zero or more single-letter flags that change the function's behavior. + regexp_count accepts all the flags + shown in . + The g flag is forced internally to track all matches. + For a pattern with capture groups, group is an integer indicating + which capture in pattern is the target of the function. A capture group is a part of the pattern + enclosed in parentheses. Capture groups can be nested. They are numbered in order in which their + left parentheses appear in pattern. If group is zero, then the position + of the entire substring that matches the pattern is returned. If pattern + does not have at least group
[PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace
Oracle: https://docs.oracle.com/en/database/oracle/oracle-database/18/adfns/regexp.html#GUID-F14733F3-B943-4BAD-8489-F9704986386B IBM: https://www.ibm.com/support/producthub/db2/docs/content/SSEPGG_11.5.0/com.ibm.db2.luw.sql.ref.doc/doc/r0061494.html?pos=2 Z/OS: https://www.ibm.com/support/knowledgecenter/SSEPEK_12.0.0/sqlref/src/tpc/db2z_bif_regexplike.html EDB: https://www.enterprisedb.com/edb-docs/d/edb-postgres-advanced-server/reference/database-compatibility-for-oracle-developers-reference-guide/9.6/Database_Compatibility_for_Oracle_Developers_Reference_Guide.1.098.html Hi, I would like to suggest adding the $subject functions to PostgreSQL. We can do lot of things using regexp_matches() and regexp_replace() but some time it consist on building complex queries that these functions can greatly simplify. Look like all RDBMS that embedded a regexp engine implement these functions (Oracle, DB2, MySQL, etc) but I don't know if they are part of the SQL standard. Probably using regexp_matches() can be enough even if it generates more complex statements but having these functions in PostgreSQL could be useful for users and code coming from theses RDBMS. - REGEXP_COUNT( string text, pattern text, [, position int] [, flags text ] ) -> integer Return the number of times a pattern occurs in a source string after a certain position, default from beginning. It can be implemented in PostgreSQL as a subquery using: SELECT count(*) FROM regexp_matches('A1B2C3', '[A-Z][0-9]', 'g'); -> 3 To support positioning we have to use substr(), for example starting at position 2: SELECT count(*) FROM regexp_matches(substr('A1B2C3', 2), '[A-Z][0-9]'); -> 2 With regexp_count() we can simply use it like this: SELECT regexp_count('A1B2C3', '[A-Z][0-9]'); -> 3 SELECT regexp_count('A1B2C3', '[A-Z][0-9]', 2); -> 2 - REGEXP_INSTR( string text, pattern text, [, position int] [, occurrence int] [, return_opt int ] [, flags text ] [, group int] ) -> integer Return the position in a string for a regular expression pattern. It returns an integer indicating the beginning or ending position of the matched substring, depending on the value of the return_opt argument (default beginning). If no match is found, then the function returns 0. * position: indicates the character where the search should begin. * occurrence: indicates which occurrence of pattern found in string should be search. * return_opt: 0 mean returns the position of the first character of the occurrence, 1 mean returns the position of the character following the occurrence. * flags: regular expression modifiers. * group: indicates which subexpression in pattern is the target of the function. Example: SELECT regexp_instr('1234567890', '(123)(4(56)(78))', 1, 1, 0, 'i', 4); -> 7 to obtain a PostgreSQL equivalent: SELECT position((SELECT (regexp_matches('1234567890', '(123)(4(56)(78))', 'ig'))[4] offset 0 limit 1) IN '1234567890'); - REGEXP_SUBSTR( string text, pattern text, [, position int] [, occurrence int] [, flags text ] [, group int] ) -> text It is similar to regexp_instr(), but instead of returning the position of the substring, it returns the substring itself. Example: SELECT regexp_substr('500 gilles''s street, 38000 Grenoble, FR', ',[^,]+,'); -> , 38000 Grenoble, or with a more complex extraction: SELECT regexp_substr('1234567890', '(123)(4(56)(78))', 1, 1, 'i', 4); -> 78 SELECT regexp_substr('1234567890 1234557890', '(123)(4(5[56])(78))', 1, 2, 'i', 3); -> 55 To obtain the same result for the last example we have to use: SELECT (SELECT * FROM regexp_matches('1234567890 1234557890', '(123)(4(5[56])(78))', 'g') offset 1 limit 2)[3]; I have not implemented the regexp_like() function, it is quite similar than the ~ and ~* operators except that it can also support other modifiers than 'i'. I can implement it easily and add it to the patch if we want to supports all those common functions. - REGEXP_LIKE( string text, pattern text, [, flags text ] ) -> boolean Similar to the LIKE condition, except that it performs regular expression matching instead of the simple pattern matching performed by LIKE. Example: SELECT * FROM t1 WHERE regexp_like(col1, '^d$', 'm'); to obtain a PostgreSQL equivalent: SELECT * FROM t1 WHERE regexp_match (col1, '^d$', 'm' ) IS NOT NULL; There is also a possible extension to regexp_replace() that I have not implemented yet because it need more work than the previous functions. - REGEXP_REPLACE( string text, pattern text, replace_string text, [, position int] [, occurrence int] [, flags text ] ) Extend PostgreSQL regexp_replace() by adding position and occurrence capabilities. The