Re: [PATCHES] patch adding new regexp functions
Jeremy Drake wrote: On Thu, 22 Mar 2007, Tom Lane wrote: I'd vote for making this new code look like the rest of it, to wit hardwire the values. Attached please find a patch which does this. I just realized that the last patch removed all usage of fcinfo in the setup_regexp_matches function, so this version of the patch also removes it as a parameter to that function. -- Think of it! With VLSI we can pack 100 ENIACs in 1 sq. cm.!Index: src/backend/utils/adt/regexp.c === RCS file: /home/jeremyd/local/postgres/cvsuproot/pgsql/src/backend/utils/adt/regexp.c,v retrieving revision 1.70 diff -c -r1.70 regexp.c *** src/backend/utils/adt/regexp.c 20 Mar 2007 05:44:59 - 1.70 --- src/backend/utils/adt/regexp.c 28 Mar 2007 18:57:28 - *** *** 30,35 --- 30,36 #include postgres.h #include access/heapam.h + #include catalog/pg_type.h #include funcapi.h #include regex/regex.h #include utils/builtins.h *** *** 95,106 size_toffset; re_comp_flags flags; - - /* text type info */ - Oid param_type; - int16 typlen; - bool typbyval; - char typalign; } regexp_matches_ctx; typedef struct regexp_split_ctx --- 96,101 *** *** 119,126 static intnum_res = 0;/* # of cached re's */ static cached_re_str re_array[MAX_CACHED_RES];/* cached re's */ ! static regexp_matches_ctx *setup_regexp_matches(FunctionCallInfo fcinfo, ! text *orig_str, text *pattern, text *flags); static ArrayType *perform_regexp_matches(regexp_matches_ctx *matchctx); --- 114,120 static intnum_res = 0;/* # of cached re's */ static cached_re_str re_array[MAX_CACHED_RES];/* cached re's */ ! static regexp_matches_ctx *setup_regexp_matches(text *orig_str, text *pattern, text *flags); static ArrayType *perform_regexp_matches(regexp_matches_ctx *matchctx); *** *** 760,767 oldcontext = MemoryContextSwitchTo(funcctx-multi_call_memory_ctx); /* be sure to copy the input string into the multi-call ctx */ ! matchctx = setup_regexp_matches(fcinfo, PG_GETARG_TEXT_P_COPY(0), ! pattern, flags); MemoryContextSwitchTo(oldcontext); funcctx-user_fctx = (void *) matchctx; --- 754,761 oldcontext = MemoryContextSwitchTo(funcctx-multi_call_memory_ctx); /* be sure to copy the input string into the multi-call ctx */ ! matchctx = setup_regexp_matches(PG_GETARG_TEXT_P_COPY(0), pattern, ! flags); MemoryContextSwitchTo(oldcontext); funcctx-user_fctx = (void *) matchctx; *** *** 822,828 } static regexp_matches_ctx * ! setup_regexp_matches(FunctionCallInfo fcinfo, text *orig_str, text *pattern, text *flags) { regexp_matches_ctx *matchctx = palloc(sizeof(regexp_matches_ctx)); --- 816,822 } static regexp_matches_ctx * ! setup_regexp_matches(text *orig_str, text *pattern, text *flags) { regexp_matches_ctx *matchctx = palloc(sizeof(regexp_matches_ctx)); *** *** 835,845 matchctx-pmatch = palloc(sizeof(regmatch_t) * (matchctx-cpattern-re_nsub + 1)); matchctx-offset = 0; - /* get text type oid, too lazy to do it some other way */ - matchctx-param_type = get_fn_expr_argtype(fcinfo-flinfo, 0); - get_typlenbyvalalign(matchctx-param_type, matchctx-typlen, -matchctx-typbyval, matchctx-typalign); - matchctx-wide_str = palloc(sizeof(pg_wchar) * (matchctx-orig_len + 1)); matchctx-wide_len = pg_mb2wchar_with_len(VARDATA(matchctx-orig_str), matchctx-wide_str, matchctx-orig_len); --- 829,834 *** *** 915,923 dims[0] = 1; } return construct_md_array(elems, nulls, ndims, dims, lbs, ! matchctx-param_type, matchctx-typlen, ! matchctx-typbyval, matchctx-typalign); } Datum --- 904,912 dims[0] = 1; } + /* XXX: this hardcodes
Re: [PATCHES] patch adding new regexp functions
Your patch has been added to the PostgreSQL unapplied patches list at: http://momjian.postgresql.org/cgi-bin/pgpatches It will be applied as soon as one of the PostgreSQL committers reviews and approves it. --- Jeremy Drake wrote: Jeremy Drake wrote: On Thu, 22 Mar 2007, Tom Lane wrote: I'd vote for making this new code look like the rest of it, to wit hardwire the values. Attached please find a patch which does this. I just realized that the last patch removed all usage of fcinfo in the setup_regexp_matches function, so this version of the patch also removes it as a parameter to that function. -- Think of it! With VLSI we can pack 100 ENIACs in 1 sq. cm.! Content-Description: [ Attachment, skipping... ] -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] patch adding new regexp functions
Jeremy Drake wrote: I just realized that the last patch removed all usage of fcinfo in the setup_regexp_matches function, so this version of the patch also removes it as a parameter to that function. Applied, thanks. -Neil ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] patch adding new regexp functions
Jeremy Drake wrote: On Mon, 26 Mar 2007, Bruce Momjian wrote: Tom Lane wrote: Or were you speaking to the question of whether to adjust the regexp patch to conform more nearly to the coding practices found elsewhere? I agree with that, but I thought there was already a submitted patch for it. Yes, regex patch adjustment, and I have not seen a patch which makes such adjustments. http://archives.postgresql.org/pgsql-patches/2007-03/msg00285.php Ah, yes, I still have that in my mailbox, but didn't think it was related because there was discussion _after_ the patch was posted. Will add the patch to the queue now. Thanks. -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PATCHES] patch adding new regexp functions
Your patch has been added to the PostgreSQL unapplied patches list at: http://momjian.postgresql.org/cgi-bin/pgpatches It will be applied as soon as one of the PostgreSQL committers reviews and approves it. --- Jeremy Drake wrote: On Thu, 22 Mar 2007, Tom Lane wrote: I'd vote for making this new code look like the rest of it, to wit hardwire the values. Attached please find a patch which does this. -- Although written many years ago, Lady Chatterley's Lover has just been reissued by the Grove Press, and this pictorial account of the day-to-day life of an English gamekeeper is full of considerable interest to outdoor minded readers, as it contains many passages on pheasant-raising, the apprehending of poachers, ways to control vermin, and other chores and duties of the professional gamekeeper. Unfortunately, one is obliged to wade through many pages of extraneous material in order to discover and savour those sidelights on the management of a midland shooting estate, and in this reviewer's opinion the book cannot take the place of J. R. Miller's Practical Gamekeeping. -- Ed Zern, Field and Stream (Nov. 1959) Content-Description: [ Attachment, skipping... ] ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] patch adding new regexp functions
Jeremy Drake wrote: On Thu, 22 Mar 2007, Tom Lane wrote: AFAIR, the reason there's no TextPGetDatum (and ditto for lots of other datatypes) is lack of obvious usefulness. A function dealing with a text * doesn't normally have reason to convert that to a Datum until it returns --- and at that point PG_RETURN_TEXT_P is the thing to use. Do you have a counterexample, or does this just suggest that the regexp function patch needs some refactoring? If you are asking why I have reason to convert text * to a Datum in cases other than PG_RETURN_TEXT_P, it is used for calling text_substr functions using DirectFunctionCallN. BTW, this usage of text_substr using PointerGetDatum was copied from the pre-existing textregexsubstr function. Is there a follup patch based on this discussion? -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PATCHES] patch adding new regexp functions
Bruce Momjian [EMAIL PROTECTED] writes: Jeremy Drake wrote: On Thu, 22 Mar 2007, Tom Lane wrote: AFAIR, the reason there's no TextPGetDatum (and ditto for lots of other datatypes) is lack of obvious usefulness. If you are asking why I have reason to convert text * to a Datum in cases other than PG_RETURN_TEXT_P, it is used for calling text_substr functions using DirectFunctionCallN. BTW, this usage of text_substr using PointerGetDatum was copied from the pre-existing textregexsubstr function. Is there a follup patch based on this discussion? Not at the moment. I suppose someone could run around and replace PointerGetDatum by (exactly-equivalent) TextPGetDatum etc, but it seems like mostly make-work. I definitely don't want to spend time on such a project for 8.3. Or were you speaking to the question of whether to adjust the regexp patch to conform more nearly to the coding practices found elsewhere? I agree with that, but I thought there was already a submitted patch for it. regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] patch adding new regexp functions
Tom Lane wrote: Bruce Momjian [EMAIL PROTECTED] writes: Jeremy Drake wrote: On Thu, 22 Mar 2007, Tom Lane wrote: AFAIR, the reason there's no TextPGetDatum (and ditto for lots of other datatypes) is lack of obvious usefulness. If you are asking why I have reason to convert text * to a Datum in cases other than PG_RETURN_TEXT_P, it is used for calling text_substr functions using DirectFunctionCallN. BTW, this usage of text_substr using PointerGetDatum was copied from the pre-existing textregexsubstr function. Is there a follup patch based on this discussion? Not at the moment. I suppose someone could run around and replace PointerGetDatum by (exactly-equivalent) TextPGetDatum etc, but it seems like mostly make-work. I definitely don't want to spend time on such a project for 8.3. Or were you speaking to the question of whether to adjust the regexp patch to conform more nearly to the coding practices found elsewhere? I agree with that, but I thought there was already a submitted patch for it. Yes, regex patch adjustment, and I have not seen a patch which makes such adjustments. -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PATCHES] patch adding new regexp functions
On Mon, 26 Mar 2007, Bruce Momjian wrote: Tom Lane wrote: Or were you speaking to the question of whether to adjust the regexp patch to conform more nearly to the coding practices found elsewhere? I agree with that, but I thought there was already a submitted patch for it. Yes, regex patch adjustment, and I have not seen a patch which makes such adjustments. http://archives.postgresql.org/pgsql-patches/2007-03/msg00285.php -- Pope Goestheveezl was the shortest reigning pope in the history of the Church, reigning for two hours and six minutes on 1 April 1866. The white smoke had hardly faded into the blue of the Vatican skies before it dawned on the assembled multitudes in St. Peter's Square that his name had hilarious possibilities. The crowds fell about, helpless with laughter, singing Half a pound of tuppenny rice Half a pound of treacle That's the way the chimney smokes Pope Goestheveezl The square was finally cleared by armed carabineri with tears of laughter streaming down their faces. The event set a record for hilarious civic functions, smashing the previous record set when Baron Hans Neizant Bompzidaize was elected Landburgher of Koln in 1653. -- Mike Harding, The Armchair Anarchist's Almanac ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] patch adding new regexp functions
Jeremy Drake [EMAIL PROTECTED] writes: BTW, should I be calling get_typlenbyvalalign on TEXTOID or are there macros for those also? Hardcoding -1 for typlen of varlenas is one of the few (the only?) magic constants used throughout the source code. I'm surprised there isn't a macro for it though. Do you need the alignment? If so I want to check the code against the packed varlena patch. Just in case. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PATCHES] patch adding new regexp functions
Gregory Stark [EMAIL PROTECTED] writes: Jeremy Drake [EMAIL PROTECTED] writes: BTW, should I be calling get_typlenbyvalalign on TEXTOID or are there macros for those also? Hardcoding -1 for typlen of varlenas is one of the few (the only?) magic constants used throughout the source code. I'm surprised there isn't a macro for it though. Do you need the alignment? If so I want to check the code against the packed varlena patch. Just in case. Ah, it's just to construct an array, that's not a concern at all. And you're detoasting the text data types before using or storing them so that's fine. The only thing I would say is that this should maybe be a TextGetDatum() just for code hygiene. It should be exactly equivalent though: + PointerGetDatum(matchctx-orig_str), -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] patch adding new regexp functions
Jeremy Drake [EMAIL PROTECTED] writes: BTW, should I be calling get_typlenbyvalalign on TEXTOID or are there macros for those also? By and large we tend to hard-wire those properties too, eg there are plenty of places that do things like this: /* XXX: this hardcodes assumptions about the regtype type */ result = construct_array(tmp_ary, num_params, REGTYPEOID, 4, true, 'i'); Some are better commented than others ... regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] patch adding new regexp functions
On Wed, 21 Mar 2007, Tom Lane wrote: Jeremy Drake [EMAIL PROTECTED] writes: BTW, should I be calling get_typlenbyvalalign on TEXTOID or are there macros for those also? By and large we tend to hard-wire those properties too, eg there are plenty of places that do things like this: /* XXX: this hardcodes assumptions about the regtype type */ result = construct_array(tmp_ary, num_params, REGTYPEOID, 4, true, 'i'); Some are better commented than others ... So, what action (if any) should be taken? Should all (or some) of these values be hardcoded, or should the current calls to determine them be left in place? regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org -- Fortune's Real-Life Courtroom Quote #19: Q: Doctor, how many autopsies have you performed on dead people? A: All my autopsies have been performed on dead people. ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PATCHES] patch adding new regexp functions
Jeremy Drake [EMAIL PROTECTED] writes: On Wed, 21 Mar 2007, Tom Lane wrote: By and large we tend to hard-wire those properties too, eg there are plenty of places that do things like this: So, what action (if any) should be taken? Should all (or some) of these values be hardcoded, or should the current calls to determine them be left in place? I'd vote for making this new code look like the rest of it, to wit hardwire the values. As-is, you're expending code space and cycles on flexibility that's entirely illusory --- if we ever decided to change these properties of TEXT, we'd have way more work to do than just fixing the new regexp functions. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] patch adding new regexp functions
On Wed, 21 Mar 2007, Gregory Stark wrote: The only thing I would say is that this should maybe be a TextGetDatum() just for code hygiene. It should be exactly equivalent though: I could not find such a thing using ctags, nor TextPGetDatum(). I looked at PG_RETURN_TEXT_P and it just uses PG_RETURN_POINTER, which in turn uses PointerGetDatum. If there is such a thing, it is well camoflaged... -- If you think the problem is bad now, just wait until we've solved it. -- Arthur Kasspe ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PATCHES] patch adding new regexp functions
On Thu, 22 Mar 2007, Tom Lane wrote: I'd vote for making this new code look like the rest of it, to wit hardwire the values. Attached please find a patch which does this. -- Although written many years ago, Lady Chatterley's Lover has just been reissued by the Grove Press, and this pictorial account of the day-to-day life of an English gamekeeper is full of considerable interest to outdoor minded readers, as it contains many passages on pheasant-raising, the apprehending of poachers, ways to control vermin, and other chores and duties of the professional gamekeeper. Unfortunately, one is obliged to wade through many pages of extraneous material in order to discover and savour those sidelights on the management of a midland shooting estate, and in this reviewer's opinion the book cannot take the place of J. R. Miller's Practical Gamekeeping. -- Ed Zern, Field and Stream (Nov. 1959)Index: src/backend/utils/adt/regexp.c === RCS file: /home/jeremyd/local/postgres/cvsuproot/pgsql/src/backend/utils/adt/regexp.c,v retrieving revision 1.70 diff -c -r1.70 regexp.c *** src/backend/utils/adt/regexp.c 20 Mar 2007 05:44:59 - 1.70 --- src/backend/utils/adt/regexp.c 22 Mar 2007 05:17:15 - *** *** 30,35 --- 30,36 #include postgres.h #include access/heapam.h + #include catalog/pg_type.h #include funcapi.h #include regex/regex.h #include utils/builtins.h *** *** 95,106 size_toffset; re_comp_flags flags; - - /* text type info */ - Oid param_type; - int16 typlen; - bool typbyval; - char typalign; } regexp_matches_ctx; typedef struct regexp_split_ctx --- 96,101 *** *** 835,845 matchctx-pmatch = palloc(sizeof(regmatch_t) * (matchctx-cpattern-re_nsub + 1)); matchctx-offset = 0; - /* get text type oid, too lazy to do it some other way */ - matchctx-param_type = get_fn_expr_argtype(fcinfo-flinfo, 0); - get_typlenbyvalalign(matchctx-param_type, matchctx-typlen, -matchctx-typbyval, matchctx-typalign); - matchctx-wide_str = palloc(sizeof(pg_wchar) * (matchctx-orig_len + 1)); matchctx-wide_len = pg_mb2wchar_with_len(VARDATA(matchctx-orig_str), matchctx-wide_str, matchctx-orig_len); --- 830,835 *** *** 915,923 dims[0] = 1; } return construct_md_array(elems, nulls, ndims, dims, lbs, ! matchctx-param_type, matchctx-typlen, ! matchctx-typbyval, matchctx-typalign); } Datum --- 905,913 dims[0] = 1; } + /* XXX: this hardcodes assumptions about the text type */ return construct_md_array(elems, nulls, ndims, dims, lbs, ! TEXTOID, -1, false, 'i'); } Datum *** *** 976,991 { ArrayBuildState *astate = NULL; regexp_split_ctx*splitctx; - Oid param_type; int nitems; splitctx = setup_regexp_split(PG_GETARG_TEXT_P(0), PG_GETARG_TEXT_P(1), PG_GETARG_TEXT_P_IF_EXISTS(2)); - /* get text type oid, too lazy to do it some other way */ - param_type = get_fn_expr_argtype(fcinfo-flinfo, 0); - for (nitems = 0; splitctx-offset splitctx-wide_len; nitems++) { if (nitems splitctx-wide_len) --- 966,977 *** *** 995,1001 astate = accumArrayResult(astate, get_next_split(splitctx), false, ! param_type, CurrentMemoryContext); } --- 981,987 astate = accumArrayResult(astate, get_next_split(splitctx), false, ! TEXTOID, CurrentMemoryContext); } ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PATCHES] patch adding new regexp functions
Jeremy Drake [EMAIL PROTECTED] writes: I could not find such a thing using ctags, nor TextPGetDatum(). I looked at PG_RETURN_TEXT_P and it just uses PG_RETURN_POINTER, which in turn uses PointerGetDatum. If there is such a thing, it is well camoflaged... AFAIR, the reason there's no TextPGetDatum (and ditto for lots of other datatypes) is lack of obvious usefulness. A function dealing with a text * doesn't normally have reason to convert that to a Datum until it returns --- and at that point PG_RETURN_TEXT_P is the thing to use. Do you have a counterexample, or does this just suggest that the regexp function patch needs some refactoring? regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PATCHES] patch adding new regexp functions
On Thu, 22 Mar 2007, Tom Lane wrote: AFAIR, the reason there's no TextPGetDatum (and ditto for lots of other datatypes) is lack of obvious usefulness. A function dealing with a text * doesn't normally have reason to convert that to a Datum until it returns --- and at that point PG_RETURN_TEXT_P is the thing to use. Do you have a counterexample, or does this just suggest that the regexp function patch needs some refactoring? If you are asking why I have reason to convert text * to a Datum in cases other than PG_RETURN_TEXT_P, it is used for calling text_substr functions using DirectFunctionCallN. BTW, this usage of text_substr using PointerGetDatum was copied from the pre-existing textregexsubstr function. -- Malek's Law: Any simple idea will be worded in the most complicated way. ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] patch adding new regexp functions
Jeremy Drake wrote: The patch has been sitting in the unapplied patches queue for a while, and the inevitable bitrot finally caught up with it. This version of the patch is exactly the same as the one in the patches queue, except for using new, non-conflicting oids for the functions. Applied -- thanks for the patch. BTW, a few cosmetic comments: * #includes should be sorted alphabetically (unless another #include sort order rules take precedence, like including postgres.h first). * patch included some trailing CR line endings * IMHO using ++var in the increment component of a for loop is bad style in C (if only because it is needlessly inconsistent; good C++ style is not necessarily good C style) * Comments like /* get text type oid, too lazy to do it some other way */ do not exactly inspire confidence :) -Neil ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] patch adding new regexp functions
Neil Conway [EMAIL PROTECTED] writes: * Comments like /* get text type oid, too lazy to do it some other way */ do not exactly inspire confidence :) Surely the code was just using the TEXTOID macro? If so, it does not require a comment. If not, it should be fixed ... regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] patch adding new regexp functions
On Wed, 21 Mar 2007, Tom Lane wrote: Neil Conway [EMAIL PROTECTED] writes: * Comments like /* get text type oid, too lazy to do it some other way */ do not exactly inspire confidence :) Surely the code was just using the TEXTOID macro? If so, it does not require a comment. If not, it should be fixed ... Nope, it does get_fn_expr_argtype(fcinfo-flinfo, 0); The too lazy remark was that I thought there may be a better way (like the macro you mentioned) but did not go looking for it because I already had something that worked that I found in the manual. If you like, I can put together another patch that removes the param_type members from the structs and hardcodes TEXTOID in the proper places. BTW, should I be calling get_typlenbyvalalign on TEXTOID or are there macros for those also? regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org -- We're only in it for the volume. -- Black Sabbath ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PATCHES] patch adding new regexp functions
Jeremy Drake wrote: The patch has been sitting in the unapplied patches queue for a while Barring any objections, I'll apply this tomorrow. -Neil ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] patch adding new regexp functions
Your patch has been added to the PostgreSQL unapplied patches list at: http://momjian.postgresql.org/cgi-bin/pgpatches It will be applied as soon as one of the PostgreSQL committers reviews and approves it. --- Jeremy Drake wrote: On Sun, 18 Feb 2007, Tom Lane wrote: Jeremy Drake [EMAIL PROTECTED] writes: I will rename the functions regexp_split_to_(table|array) and I will add an optional limit parameter to the regexp_split_to_table function, for consistency and to avoid ordering concerns with LIMIT. I'd go the other way: get rid of the limit option all 'round. It seems like fairly useless complexity. OK, here is what is hopefully the final version of this patch. I have renamed the functions as above, and removed the optional limit parameter. If there are no further complaints, this should be ready to apply. -- It is the business of little minds to shrink. -- Carl Sandburg Content-Description: [ Attachment, skipping... ] ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PATCHES] patch adding new regexp functions
Jeremy Drake wrote: As for the argument about array vs setof, I could see doing both to end the argument of which one is really superior for any particular problem. regexp_split(string text, pattern text[, flags text]) returns setof text regexp_split_array(string text, pattern text[. flags text[, limit int]]) returns text[] Since you are not splitting an array but returning an array, I would think that regexp_split_to_array would be better, and the other should then be regexp_split_to_table. But why does the second one have a limit and the first one doesn't? Is this because you rely on the LIMIT clause to do the same? Is there a guarantee that LIMIT on a table function makes a consistent order? -- Peter Eisentraut http://developer.postgresql.org/~petere/ ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] patch adding new regexp functions
On Sun, 18 Feb 2007, Jeremy Drake wrote: On Sun, 18 Feb 2007, Peter Eisentraut wrote: regexp_split(string text, pattern text[, flags text]) returns setof text regexp_split_array(string text, pattern text[. flags text[, limit int]]) returns text[] Since you are not splitting an array but returning an array, I would think that regexp_split_to_array would be better, and the other should then be regexp_split_to_table. OK But why does the second one have a limit and the first one doesn't? I will rename the functions regexp_split_to_(table|array) and I will add an optional limit parameter to the regexp_split_to_table function, for consistency and to avoid ordering concerns with LIMIT. -- Sometimes I worry about being a success in a mediocre world. -- Lily Tomlin ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PATCHES] patch adding new regexp functions
Jeremy Drake [EMAIL PROTECTED] writes: I will rename the functions regexp_split_to_(table|array) and I will add an optional limit parameter to the regexp_split_to_table function, for consistency and to avoid ordering concerns with LIMIT. I'd go the other way: get rid of the limit option all 'round. It seems like fairly useless complexity. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] patch adding new regexp functions
Jeremy Drake wrote: In case you haven't noticed, I am rather averse to making this return text[] because it is much easier in my experience to use the results when returned in SETOF rather than text[], The primary use case I know for string splitting is parsing comma/pipe/whatever separated fields into a row structure, and the way I see it your API proposal makes that exceptionally difficult. I don't know what your use case is, though. All of this is missing actual use cases. While, if you really really wanted a text[], you could use the (fully documented) ARRAY(select resultstr from regexp_split(...) order by startpos) construct. I think, however, that we should be providing simple primitives that can be combined into complex expressions rather than complex primitives that have to be dissected apart to get simple results. As for the regexp_matches() function, it seems to me that it returns too much information at once. What is the use case for getting all of prematch, fullmatch, matches, and postmatch in one call? It was requested by David Fetter: http://archives.postgresql.org/pgsql-hackers/2007-02/msg00056.php It was not horribly difficult to provide, and it seemed reasonable to me. I have no need for them personally. David Fetter has also repeated failed to offer a use case for this, so I hesitate to accept this. -- Peter Eisentraut http://developer.postgresql.org/~petere/ ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] patch adding new regexp functions
On Sat, Feb 17, 2007 at 09:02:24AM +0100, Peter Eisentraut wrote: Jeremy Drake wrote: In case you haven't noticed, I am rather averse to making this return text[] because it is much easier in my experience to use the results when returned in SETOF rather than text[], The primary use case I know for string splitting is parsing comma/pipe/whatever separated fields into a row structure, and the way I see it your API proposal makes that exceptionally difficult. I don't know what your use case is, though. All of this is missing actual use cases. While, if you really really wanted a text[], you could use the (fully documented) ARRAY(select resultstr from regexp_split(...) order by startpos) construct. I think, however, that we should be providing simple primitives that can be combined into complex expressions rather than complex primitives that have to be dissected apart to get simple results. As for the regexp_matches() function, it seems to me that it returns too much information at once. What is the use case for getting all of prematch, fullmatch, matches, and postmatch in one call? It was requested by David Fetter: http://archives.postgresql.org/pgsql-hackers/2007-02/msg00056.php It was not horribly difficult to provide, and it seemed reasonable to me. I have no need for them personally. David Fetter has also repeated failed to offer a use case for this, so I hesitate to accept this. What is it about having the whole match, pre-match and post-match available that you're objecting to? Are you saying there aren't common uses for any or all of these? Regular expression users use them all over the place, and adding this capability to SQL seems like a reasonable next step :) Cheers, D -- David Fetter [EMAIL PROTECTED] http://fetter.org/ phone: +1 415 235 3778AIM: dfetter666 Skype: davidfetter Remember to vote! ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] patch adding new regexp functions
On Sat, 17 Feb 2007, Peter Eisentraut wrote: Jeremy Drake wrote: In case you haven't noticed, I am rather averse to making this return text[] because it is much easier in my experience to use the results when returned in SETOF rather than text[], The primary use case I know for string splitting is parsing comma/pipe/whatever separated fields into a row structure, and the way I see it your API proposal makes that exceptionally difficult. For this case see string_to_array: http://developer.postgresql.org/pgdocs/postgres/functions-array.html select string_to_array('a|b|c', '|'); string_to_array - {a,b,c} (1 row) I don't know what your use case is, though. All of this is missing actual use cases. The particular use case I had for this function was at a previous employer, and I am not sure exactly how much detail is appropriate to divulge. Basically, the project was doing some text processing inside of postgres, and getting all of the words from a string into a table with some processing (excluding stopwords and so forth) as efficiently as possible was a big concern. The regexp_split function code was based on some code that a friend of mine wrote which used PCRE rather than postgres' internal regexp support. I don't know exactly what his use-case was, but he probably had one because he wrote the function and had it returning SETOF text ;) Perhaps he can share a general idea of what it was (nudge nudge)? While, if you really really wanted a text[], you could use the (fully documented) ARRAY(select resultstr from regexp_split(...) order by startpos) construct. I think, however, that we should be providing simple primitives that can be combined into complex expressions rather than complex primitives that have to be dissected apart to get simple results. The most simple primitive is string_to_array(text, text) returns text[], but it was not sufficient for our needs. As for the regexp_matches() function, it seems to me that it returns too much information at once. What is the use case for getting all of prematch, fullmatch, matches, and postmatch in one call? It was requested by David Fetter: http://archives.postgresql.org/pgsql-hackers/2007-02/msg00056.php It was not horribly difficult to provide, and it seemed reasonable to me. I have no need for them personally. David Fetter has also repeated failed to offer a use case for this, so I hesitate to accept this. I have no strong opinion either way, so I will let those who do argue it out and wait for the dust to settle ;) -- The Law, in its majestic equality, forbids the rich, as well as the poor, to sleep under the bridges, to beg in the streets, and to steal bread. -- Anatole France ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] patch adding new regexp functions
David Fetter wrote: What is it about having the whole match, pre-match and post-match available that you're objecting to? Are you saying there aren't common uses for any or all of these? Regular expression users use them all over the place, You keep saying that, and I keep saying please show a use case. -- Peter Eisentraut http://developer.postgresql.org/~petere/ ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] patch adding new regexp functions
Peter Eisentraut [EMAIL PROTECTED] writes: David Fetter wrote: What is it about having the whole match, pre-match and post-match available that you're objecting to? Are you saying there aren't common uses for any or all of these? Regular expression users use them all over the place, You keep saying that, and I keep saying please show a use case. Maybe I'm missing something, but ISTM that given the ability to return multiple match substrings there is no need for such features. Given an arbitrary regex 'xyz', write '^(.*)(xyz)(.*)$' and you'll get back the pre-match, whole match, and post-match in addition to any parenthesized substrings that 'xyz' contains. If you only need some of them, you can leave out the corresponding parts of this pattern. So I'd vote against complicating the API in order to make special provision for these results. I claim that all we need is a function taking (string text, pattern text, flags text) and returning either array of text or setof text containing the matched substrings in whatever order is standard (order by left-parenthesis position, I think). In the degenerate case where there are no parenthesized subpatterns, just return the whole match as a single element. As for the argument about array vs setof, I could see doing both to end the argument of which one is really superior for any particular problem. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] patch adding new regexp functions
Jeremy Drake wrote: For this case see string_to_array: That only works with fixed delimiters, and I was hoping that regexp_split would be an alternative with regexp delimiters. It's certainly another argument that all the split functions in PostgreSQL should offer analogous interfaces. There is no reason being offered why splitting on a fixed string should result in an array and splitting on a regexp should result in a table. The particular use case I had for this function was at a previous employer, and I am not sure exactly how much detail is appropriate to divulge. Basically, the project was doing some text processing inside of postgres, and getting all of the words from a string into a table with some processing (excluding stopwords and so forth) as efficiently as possible was a big concern. We already have tsearch for that. -- Peter Eisentraut http://developer.postgresql.org/~petere/ ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PATCHES] patch adding new regexp functions
Finally the voice of reason :) On Sat, 17 Feb 2007, Tom Lane wrote: So I'd vote against complicating the API in order to make special provision for these results. I claim that all we need is a function taking (string text, pattern text, flags text) and returning either array of text or setof text For this function, it would be setof array of text, as the capture groups would definitely go in an array, but if you asked for global in the flags, there could be more than one match in the string. containing the matched substrings in whatever order is standard (order by left-parenthesis position, I think). In the degenerate case where there are no parenthesized subpatterns, just return the whole match as a single element. Good idea, that did not occur to me. I was planning to throw an error in that case. As for the argument about array vs setof, I could see doing both to end the argument of which one is really superior for any particular problem. The array vs setof argument was on the split function. I will work on doing both. Any idea how the name and/or arguments should differ? I think that an optional limit argument for the array version like perl has would be reasonable, but a name difference in the functions is probably necessary to avoid confusion. -- Speaking of Godzilla and other things that convey horror: With a purposeful grimace and a Mongo-like flair He throws the spinning disk drives in the air! And he picks up a Vax and he throws it back down As he wades through the lab making terrible sounds! Helpless users with projects due Scream My God! as he stomps on the tape drives, too! Oh, no! He says Unix runs too slow! Go, go, DECzilla! Oh, yes! He's gonna bring up VMS! Go, go, DECzilla! * VMS is a trademark of Digital Equipment Corporation * DECzilla is a trademark of Hollow Chocolate Bunnies of Death, Inc. -- Curtis Jackson ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] patch adding new regexp functions
Jeremy Drake wrote: The regexp_split function code was based on some code that a friend of mine wrote which used PCRE rather than postgres' internal regexp support. I don't know exactly what his use-case was, but he probably had one because he wrote the function and had it returning SETOF text ;) Perhaps he can share a general idea of what it was (nudge nudge)? db=# CREATE OR REPLACE FUNCTION split(p TEXT, t TEXT) RETURNS SETOF TEXT AS $$ db$# my ($p, $t) = @_; db$# return [ split(/$p/,$t) ]; db$# $$ LANGUAGE plperl; CREATE FUNCTION Time: 1.254 ms db=# select distinct word from (select * from split('\\W+','mary had a little lamb, whose fleece was black as soot') as word) as ss; word a as black fleece had lamb little mary soot was whose (11 rows) Time: 30.517 ms As you can see, this can easily be done with a plperl function. Some people may not want to install plperl, or may not want to allow arbitrary patterns to be handed to perl in this fashion. That was not my concern. I was simply trying to see if I could make it faster in a C-language coded function. In the end I dropped the project because the plperl function works fast enough for me and I don't have any objection to plperl from a security standpoint, etc. mark ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] patch adding new regexp functions
Jeremy Drake [EMAIL PROTECTED] writes: On Sat, 17 Feb 2007, Tom Lane wrote: So I'd vote against complicating the API in order to make special provision for these results. I claim that all we need is a function taking (string text, pattern text, flags text) and returning either array of text or setof text For this function, it would be setof array of text, as the capture groups would definitely go in an array, but if you asked for global in the flags, there could be more than one match in the string. Oh, right. And you could do a 2-D array if you wanted it all in one blob (or a guarantee of order). So no need for record-returning functions? regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] patch adding new regexp functions
On Sat, 17 Feb 2007, Tom Lane wrote: Jeremy Drake [EMAIL PROTECTED] writes: On Sat, 17 Feb 2007, Tom Lane wrote: So I'd vote against complicating the API in order to make special provision for these results. I claim that all we need is a function taking (string text, pattern text, flags text) and returning either array of text or setof text For this function, it would be setof array of text, as the capture groups would definitely go in an array, but if you asked for global in the flags, there could be more than one match in the string. Oh, right. And you could do a 2-D array if you wanted it all in one blob (or a guarantee of order). I don't think that there is much of an argument for having this one return a 2d array, in this particular case, even perl requires you to build a loop, IIRC. my $str = 'foobarbecuebazilbarf'; while($str=~/(b[^b]+)(b[^b]+)/g) { print $1, \t, length($1), \n; print $2, \t, length($2), \n; print ---\n; } bar 3 becue 5 --- bazil 5 barf4 --- So no need for record-returning functions? No. -- Go placidly amid the noise and waste, and remember what value there may be in owning a piece thereof. -- National Lampoon, Deteriorata ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PATCHES] patch adding new regexp functions
Am Freitag, 16. Februar 2007 08:02 schrieb Jeremy Drake: On Thu, 15 Feb 2007, Peter Eisentraut wrote: I have no strong opinion about how matches are returned. Seeing the definitional difficulties that you point out, it may be fine to return them unordered. But then all matches functions should do that. For the split functions, however, providing the order is clearly important. Does this version sufficiently address your concerns? I don't think anyone asked for the start position and length in the result of regexp_split(). The result should be an array of text. That is what Perl et al. do. As for the regexp_matches() function, it seems to me that it returns too much information at once. What is the use case for getting all of prematch, fullmatch, matches, and postmatch in one call? -- Peter Eisentraut http://developer.postgresql.org/~petere/ ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PATCHES] patch adding new regexp functions
On Fri, Feb 16, 2007 at 05:54:47PM +0100, Peter Eisentraut wrote: Am Freitag, 16. Februar 2007 17:11 schrieb David Fetter: As for the regexp_matches() function, it seems to me that it returns too much information at once. What is the use case for getting all of prematch, fullmatch, matches, and postmatch in one call? If not in one call, how would you get it? Perl, for example, makes these available to any regex match in the form of variables it sets. The question is, what is the use case? If there is one in Perl, can this proposed function API support it? Perl makes the following variables available in any regex match, although it optimizes some cases for when they're not there: $1, ... $n (captured matches in parentheses) $` (pre-match) $' (post-match) $ (whole match) Cheers, D -- David Fetter [EMAIL PROTECTED] http://fetter.org/ phone: +1 415 235 3778AIM: dfetter666 Skype: davidfetter Remember to vote! ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] patch adding new regexp functions
David Fetter wrote: The question is, what is the use case? If there is one in Perl, can this proposed function API support it? Perl makes the following variables available in any regex match, although it optimizes some cases for when they're not there: $1, ... $n (captured matches in parentheses) $` (pre-match) $' (post-match) $ (whole match) Use of any of these is notoriously costly, BTW, especially pre-match and post-match. See perlre man page. cheers andrew ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PATCHES] patch adding new regexp functions
David Fetter wrote: The question is, what is the use case? If there is one in Perl, can this proposed function API support it? Perl makes the following variables available in any regex match, That is not a use case. -- Peter Eisentraut http://developer.postgresql.org/~petere/ ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PATCHES] patch adding new regexp functions
On Fri, 16 Feb 2007, Peter Eisentraut wrote: Am Freitag, 16. Februar 2007 08:02 schrieb Jeremy Drake: Does this version sufficiently address your concerns? I don't think anyone asked for the start position and length in the result of regexp_split(). The result should be an array of text. That is what Perl et al. do. The length is not returned, I simply call length() on the string result to make sure that no whitespace gets in. The start position was suggested in these two messages from Alvaro Herrera: http://archives.postgresql.org/pgsql-patches/2007-02/msg00276.php http://archives.postgresql.org/pgsql-patches/2007-02/msg00281.php This was meant to address your concern about the order not necessarily being preserved of the split results when being returned as SETOF. This gives the benefits of returning SETOF while still allowing the order to be preserved if desired (simply add ORDER BY startpos to guarantee the correct order). In case you haven't noticed, I am rather averse to making this return text[] because it is much easier in my experience to use the results when returned in SETOF rather than text[], and in all of the code that I have experience with where this would be useful I would end up using information_schema._pg_expandarray (a function that, AFAIK, is documented nowhere) to convert it into SETOF text. While, if you really really wanted a text[], you could use the (fully documented) ARRAY(select resultstr from regexp_split(...) order by startpos) construct. As for the regexp_matches() function, it seems to me that it returns too much information at once. What is the use case for getting all of prematch, fullmatch, matches, and postmatch in one call? It was requested by David Fetter: http://archives.postgresql.org/pgsql-hackers/2007-02/msg00056.php It was not horribly difficult to provide, and it seemed reasonable to me. I have no need for them personally. -- Some people in this department wouldn't recognize subtlety if it hit them on the head. ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PATCHES] patch adding new regexp functions
Neil Conway wrote: On Wed, 2007-02-14 at 16:49 -0800, Jeremy Drake wrote: What was the status of this? Was there anything else I needed to do with this patch, or is it ready to be applied? Let me know if there is anything else I need to do on this... Will do -- I'm planning to apply this as soon as I have the free cycles to do so, likely tomorrow or Friday. I don't know which patch is actually being proposed now. It would be good to make this more explicit and maybe include a synopsis of the functions in the email, so we know what's going on. What confuses me about some of the functions I've seen in earlier patches in this thread is that they return setof something. But in my mind, regular expression matches or string splits are inherently ordered, so an array would be the correct return type. -- Peter Eisentraut http://developer.postgresql.org/~petere/ ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PATCHES] patch adding new regexp functions
On Thu, 15 Feb 2007, Peter Eisentraut wrote: Neil Conway wrote: On Wed, 2007-02-14 at 16:49 -0800, Jeremy Drake wrote: What was the status of this? Was there anything else I needed to do with this patch, or is it ready to be applied? Let me know if there is anything else I need to do on this... Will do -- I'm planning to apply this as soon as I have the free cycles to do so, likely tomorrow or Friday. I don't know which patch is actually being proposed now. It would be good to make this more explicit and maybe include a synopsis of the functions in the email, so we know what's going on. Sorry, my intent was just to check to see if I had gotten the patch sufficiently fixed for Neil to apply and he just hadn't gotten to it yet (which seems to be the case), or if there was something else he still expected me to fix that I had missed in the prior discussions. I suppose I should have emailed him privately. The patch in question can be seen in the archives here: http://archives.postgresql.org/pgsql-patches/2007-02/msg00214.php The functions added are: * regexp_split(str text, pattern text) RETURNS SETOF text regexp_split(str text, pattern text, flags text) RETURNS SETOF text returns each section of the string delimited by the pattern. * regexp_matches(str text, pattern text) RETURNS text[] returns all capture groups when matching pattern against string in an array * regexp_matches(str text, pattern text, flags text) RETURNS SETOF (prematch text, fullmatch text, matches text[], postmatch text) returns all capture groups when matching pattern against string in an array. also returns the entire match in fullmatch. if the 'g' option is given, returns all matches in the string. if the 'r' option is given, also return the text before and after the match in prematch and postmatch respectively. What confuses me about some of the functions I've seen in earlier patches in this thread is that they return setof something. But in my mind, regular expression matches or string splits are inherently ordered, so an array would be the correct return type. They do return SETOF. Addressing them separately: regexp_matches uses a text[] for the match groups. If you specify the global flag, it could return multiple matches. Couple this with the requested feature of pre- and postmatch returns (with its own flag) and the return would turn into some sort of nasty array of tuples, or multiple arrays. It seems much cleaner to me to return a set of the matches found, and I find which order the matches occur in to be much less important than the fact that they did occur and their contents. regexp_split returns setof text. This has, in my opinion, a much greater case to return an array. However, there are several issues with this approach: # My experience with the array code leads me to believe that building up an array is an expensive proposition. I know I could code it smarter so that the array is only constructed in the end. # With a set-returning function, it is possible to add a LIMIT clause, to prevent splitting up more of the string than is necessary. It is also immediately possible to insert them into a table, or do grouping on them, or call a function on each value. Most of the time when I do a split, I intend to do something like this with each split value. # You can still get an array if you really want it: #* SELECT ARRAY(SELECT * FROM regexp_split('first, second, third', E',\\s*')) -- No problem is so formidable that you can't just walk away from it. -- C. Schulz ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PATCHES] patch adding new regexp functions
Jeremy Drake wrote: regexp_matches uses a text[] for the match groups. If you specify the global flag, it could return multiple matches. Couple this with the requested feature of pre- and postmatch returns (with its own flag) and the return would turn into some sort of nasty array of tuples, or multiple arrays. It seems much cleaner to me to return a set of the matches found, and I find which order the matches occur in to be much less important than the fact that they did occur and their contents. Then the fact that the flag-less matches function returns an array would be a mistake. They have to return the same category of object. regexp_split returns setof text. This has, in my opinion, a much greater case to return an array. However, there are several issues with this approach: Any programming language I have ever seen returns the result of a regular expression split as a structure with order. That in turn implies that there are use cases for having the order, which your proposed function could not address. # My experience with the array code leads me to believe that building up an array is an expensive proposition. I know I could code it smarter so that the array is only constructed in the end. You can make any code arbitrarily fast if it doesn't have to give the right answer. # With a set-returning function, it is possible to add a LIMIT clause, to prevent splitting up more of the string than is necessary. You can also add such functionality to a function in form of a parameter. In fact, relying on a LIMIT clause to do this seems pretty fragile. We argue elsewhere that LIMIT without ORDER BY is not well-defined, and while it's hard to imagine in the current implementation why the result of a set returning function would come back in arbitrary order, it is certainly possible in theory, so you still need to order the result set if you want reliable limits, but that is not possible of the order is lost in the result. It is also immediately possible to insert them into a table, or do grouping on them, or call a function on each value. Most of the time when I do a split, I intend to do something like this with each split value. These sort of arguments remind me of the contrib/xml2 module, which also has a very, uh, pragmatic API. Sure, these operations may seem useful to you. But when we offer a function as part of the core API, it is also important that we offer a clean design that allows other users to combine reasonably orthogonal functionality into tools that are useful to them. -- Peter Eisentraut http://developer.postgresql.org/~petere/ ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] patch adding new regexp functions
Jeremy Drake wrote: The functions added are: * regexp_split(str text, pattern text) RETURNS SETOF text regexp_split(str text, pattern text, flags text) RETURNS SETOF text returns each section of the string delimited by the pattern. * regexp_matches(str text, pattern text) RETURNS text[] returns all capture groups when matching pattern against string in an array * regexp_matches(str text, pattern text, flags text) RETURNS SETOF (prematch text, fullmatch text, matches text[], postmatch text) returns all capture groups when matching pattern against string in an array. also returns the entire match in fullmatch. if the 'g' option is given, returns all matches in the string. if the 'r' option is given, also return the text before and after the match in prematch and postmatch respectively. I think the position the match is in could be important. I'm wondering if you could define them like create type re_match(match text, position int) regexp_split(str text, pattern text) returns setof re_match or maybe regexp_split(str text, pattern text, OUT match text, OUT position int); (not sure of the exact syntax for this one) so that you would have the position for each match, automatically. Is this information available in the regex code? -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] patch adding new regexp functions
Peter Eisentraut [EMAIL PROTECTED] writes: Jeremy Drake wrote: # My experience with the array code leads me to believe that building up an array is an expensive proposition. I know I could code it smarter so that the array is only constructed in the end. You can make any code arbitrarily fast if it doesn't have to give the right answer. Even more to the point, it's folly to suppose that the overhead of processing a SETOF result is less than that of array construction. I tend to agree with Peter's concern that returning a set is going to make it hard to track which result is which. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] patch adding new regexp functions
Alvaro Herrera [EMAIL PROTECTED] writes: so that you would have the position for each match, automatically. Is this information available in the regex code? Certainly, that's where we got the text snippets from to begin with. However, I'm not sure that this is important enough to justify a special type --- for one thing, since we don't have arrays of composites, that would foreclose responding to Peter's concern that SETOF is the wrong thing. If you look at the Perl and Tcl APIs for regexes, they return just the strings, not the numerical positions; and I've not heard anyone complaining about that. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] patch adding new regexp functions
Tom Lane wrote: Alvaro Herrera [EMAIL PROTECTED] writes: so that you would have the position for each match, automatically. Is this information available in the regex code? Certainly, that's where we got the text snippets from to begin with. However, I'm not sure that this is important enough to justify a special type --- for one thing, since we don't have arrays of composites, that would foreclose responding to Peter's concern that SETOF is the wrong thing. My point is that if you want to have the order in which the matches were found, you can do that easily by looking at the positions; no need to create an ordered array. Which does respond to Peter's concern, since the point was to keep the ordering of matches, which an array does; but if we provide the positions, the SETOF way does as well. On the other hand, I don't think it's impossible to have matches that start earlier than others in the string, but are actually found later (say, because they are a parentized expression that ends later). So giving the starting positions allows one to know where are they located, rather than where were they reported. (I don't really know if the matches are sorted before reporting though.) If you look at the Perl and Tcl APIs for regexes, they return just the strings, not the numerical positions; and I've not heard anyone complaining about that. I know, but that may be just because it would be too much extra complexity for them (in terms of user API) to be returning the positions along the text. I know I'd be fairly annoyed if =~ in Perl returned an array of hashes { text = 'foo', position = 42} instead of array of text. We don't have that problem. In fact, I would claim that's much easier to deal with a SETOF function than is to deal with text[]. Regarding the nobody complains argument, I don't find that particularly compelling; witness how people gets used to working around limitations in MySQL ... ;-) -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PATCHES] patch adding new regexp functions
Alvaro Herrera wrote: On the other hand, I don't think it's impossible to have matches that start earlier than others in the string, but are actually found later (say, because they are a parentized expression that ends later). So giving the starting positions allows one to know where are they located, rather than where were they reported. (I don't really know if the matches are sorted before reporting though.) I have no strong opinion about how matches are returned. Seeing the definitional difficulties that you point out, it may be fine to return them unordered. But then all matches functions should do that. For the split functions, however, providing the order is clearly important. -- Peter Eisentraut http://developer.postgresql.org/~petere/ ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] patch adding new regexp functions
On Thu, Feb 15, 2007 at 10:37:26AM -0500, Tom Lane wrote: Alvaro Herrera [EMAIL PROTECTED] writes: so that you would have the position for each match, automatically. Is this information available in the regex code? Certainly, that's where we got the text snippets from to begin with. However, I'm not sure that this is important enough to justify a special type --- for one thing, since we don't have arrays of composites, This is a TODO :) I've obviously misunderstood the scope of the TODO because it appears that an INSERT into pg_type at creation time for compound types that looks something like the below would do it. What have I missed? INSERT INTO pg_type VALUES ( '_foo', /* Generated by makeArrayTypeName */ 16744, /* OID of schema */ 10, /* OID of owner of the base type */ -1, /* typlen indicates varlena */ 'f',/* not passed by value */ 'c',/* typtype is composite */ 't',/* type is already defined */ ',',/* typdelim */ 0, /* should this actually refer to the type? */ 'foo'::regtype, /* typelem */ 'array_in', /* typinput */ 'array_out',/* typoutput */ 'array_recv', /* typreceive */ 'array_send', /* typsend */ 0, /* typanalyze */ 'i',/* typalign. Should this be 'd'? */ 'x',/* typstorage */ 'f',/* not a DOMAIN, but while we're at it, why not arrays of DOMAIN? */ 0, /* base type. should this be different? */ -1, /* no typmod */ 0 /* dims not specified */ ); that would foreclose responding to Peter's concern that SETOF is the wrong thing. If you look at the Perl and Tcl APIs for regexes, they return just the strings, not the numerical positions; and I've not heard anyone complaining about that. They do return them in the order in which they appear, though, which, as far as I can tell, Jeremy's functions also do. Cheers, D -- David Fetter [EMAIL PROTECTED] http://fetter.org/ phone: +1 415 235 3778AIM: dfetter666 Skype: davidfetter Remember to vote! ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] patch adding new regexp functions
On Thu, 15 Feb 2007, Alvaro Herrera wrote: Jeremy Drake wrote: The functions added are: * regexp_split(str text, pattern text) RETURNS SETOF text regexp_split(str text, pattern text, flags text) RETURNS SETOF text returns each section of the string delimited by the pattern. * regexp_matches(str text, pattern text) RETURNS text[] returns all capture groups when matching pattern against string in an array * regexp_matches(str text, pattern text, flags text) RETURNS SETOF (prematch text, fullmatch text, matches text[], postmatch text) returns all capture groups when matching pattern against string in an array. also returns the entire match in fullmatch. if the 'g' option is given, returns all matches in the string. if the 'r' option is given, also return the text before and after the match in prematch and postmatch respectively. I think the position the match is in could be important. I'm wondering if you could define them like create type re_match(match text, position int) regexp_split(str text, pattern text) returns setof re_match So it looks like the issues are: 1. regexp_matches without flags has a different return type than regexp_matches with flags. I can make them return the same OUT parameters, but should I declare it as returning SETOF when I know for a fact that the no-flags version will never return more than one row? 2. regexp_split does not represent the order of the results. I can do something like: regexp_split(str text, pattern text[, flags text], OUT result text, OUT startpos int) RETURNS SETOF record; It could also have the int being a simple counter to represent the relative order, rather than the position. Thoughts? Do these changes address the issues recently expressed? -- I have yet to see any problem, however complicated, which, when looked at in the right way, did not become still more complicated. -- Poul Anderson ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] patch adding new regexp functions
David Fetter [EMAIL PROTECTED] writes: I've obviously misunderstood the scope of the TODO because it appears that an INSERT into pg_type at creation time for compound types that looks something like the below would do it. What have I missed? There are a couple of issues. One is that we probably don't want two pg_type entries for every single table. Will you be satisfied if only CREATE TYPE AS ... makes an array type? The other is that, at least at the time they were written, the array support routines couldn't handle composite array values. Things might or might not be easier today; I don't think we had record_in and record_out in their current form then. regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PATCHES] patch adding new regexp functions
On Thu, Feb 15, 2007 at 07:35:46PM -0500, Tom Lane wrote: David Fetter [EMAIL PROTECTED] writes: I've obviously misunderstood the scope of the TODO because it appears that an INSERT into pg_type at creation time for compound types that looks something like the below would do it. What have I missed? There are a couple of issues. One is that we probably don't want two pg_type entries for every single table. Now that you mention it, I would want that if that's what it takes to get arrays for them. The long-term goal here is to make all of PostgreSQL's types play nicely together. I'm guessing that SETOF will eventually be a way to describe a collection because MULTISET is in SQL:2003. Will you be satisfied if only CREATE TYPE AS ... makes an array type? The other is that, at least at the time they were written, the array support routines couldn't handle composite array values. Things might or might not be easier today; I don't think we had record_in and record_out in their current form then. OK. What about pg_depend? Cheers, D -- David Fetter [EMAIL PROTECTED] http://fetter.org/ phone: +1 415 235 3778AIM: dfetter666 Skype: davidfetter Remember to vote! ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] patch adding new regexp functions
Tom Lane wrote: David Fetter [EMAIL PROTECTED] writes: I've obviously misunderstood the scope of the TODO because it appears that an INSERT into pg_type at creation time for compound types that looks something like the below would do it. What have I missed? There are a couple of issues. One is that we probably don't want two pg_type entries for every single table. Will you be satisfied if only CREATE TYPE AS ... makes an array type? There should be some better way to create the array type for tables than directly mangling pg_type, though. Maybe a builtin function that took an oid? cheers andrew ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PATCHES] patch adding new regexp functions
What was the status of this? Was there anything else I needed to do with this patch, or is it ready to be applied? Let me know if there is anything else I need to do on this... -- What this country needs is a good five cent nickel. ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PATCHES] patch adding new regexp functions
On Wed, 2007-02-14 at 16:49 -0800, Jeremy Drake wrote: What was the status of this? Was there anything else I needed to do with this patch, or is it ready to be applied? Let me know if there is anything else I need to do on this... Will do -- I'm planning to apply this as soon as I have the free cycles to do so, likely tomorrow or Friday. -Neil ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] patch adding new regexp functions
On Fri, 2007-02-09 at 01:08 -0800, Jeremy Drake wrote: Yeah, I try to do that, but this one just barely passed my personal compression threshold. Guess I should raise my threshold :) No, don't pay any attention to me, I'm just lazy :) Here is a new version of the patch which fixes up the documentation a little (should have read it over again before posting). The doing_srf business is rather tortured, and seems an invitation for bugs. ISTM there should be a cleaner way to implement this. For example, would it be possible to put all the common logic into one or more additional functions, and then have SRF vs. non-SRF cases that call into those functions after doing the appropriate initialization? -Neil ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PATCHES] patch adding new regexp functions
On Fri, 2007-02-09 at 16:33 -0800, Jeremy Drake wrote: Here is a new version of the patch which eliminates the doing_srf stuff. * C89 require constant-sized stack allocated arrays, so the coding in perform_regexp_matches() is non-portable. * I'm not clear about the control flow in regexp_matches() and regexp_split(). Presumably it's not possible for the call_cntr to actually exceed max_calls, so the error message in these cases should be elog(ERROR), not ereport (the former is for shouldn't happen bug scenarios, the latter is for user-facing errors). Can you describe the logic here (e.g. via comments) a bit more? * The logic in regexp_split (incremented_offset, first_match, etc.) is pretty ugly and hard to follow. The if condition on line 1037 is particularly objectionable. Again, ISTM there should be a cleaner way to do this. * Try to keep lines to 80 characters or fewer. If the code is wandering off the right side of the screen all the time, that might be a hint that it needs simplification. Attached is a cleaned up version of your patch -- various improvements throughout, but mostly cosmetic stuff. Do you want to take a look at the above? -Neil Index: doc/src/sgml/func.sgml === RCS file: /home/neilc/postgres/cvs_root/pgsql/doc/src/sgml/func.sgml,v retrieving revision 1.357 diff -c -p -r1.357 func.sgml *** doc/src/sgml/func.sgml 1 Feb 2007 00:28:16 - 1.357 --- doc/src/sgml/func.sgml 10 Feb 2007 03:20:14 - *** *** 1446,1451 --- 1446,1464 /row row +entryliteralfunctionregexp_matches/function(parameterstring/parameter typetext/type, parameterpattern/parameter typetext/type [,parameterflags/parameter typetext/type])/literal/entry +entrytypetext[]/type or typesetof record/type (if flags are given)/entry +entry + Return all capture groups resulting from matching POSIX regular + expression against the parameterstring/parameter. See + xref linkend=functions-matching for more information on pattern + matching. +/entry +entryliteralregexp_matches('foobarbequebaz', '(bar)(beque)')/literal/entry +entryliteral{bar,beque}/literal/entry + /row + + row entryliteralfunctionregexp_replace/function(parameterstring/parameter typetext/type, parameterpattern/parameter typetext/type, parameterreplacement/parameter typetext/type [,parameterflags/parameter typetext/type])/literal/entry entrytypetext/type/entry entry *** *** 1458,1463 --- 1471,1488 /row row +entryliteralfunctionregexp_split/function(parameterstring/parameter typetext/type, parameterpattern/parameter typetext/type [,parameterflags/parameter typetext/type])/literal/entry +entrytypesetof text/type/entry +entry + Splits parameterstring/parameter using POSIX regular expression as + the delimiter. See xref linkend=functions-matching for more + information on pattern matching. +/entry +entryliteralregexp_split('hello world', E'\\s+')/literal/entry +entryliteralhello/literalparaliteralworld/literal/para (2 rows)/entry + /row + + row entryliteralfunctionrepeat/function(parameterstring/parameter typetext/type, parameternumber/parameter typeint/type)/literal/entry entrytypetext/type/entry entryRepeat parameterstring/parameter the specified *** cast(-44 as bit(12)) lineanno *** 2864,2869 --- 2889,2900 indexterm primaryregexp_replace/primary /indexterm +indexterm + primaryregexp_matches/primary +/indexterm +indexterm + primaryregexp_split/primary +/indexterm synopsis replaceablestring/replaceable SIMILAR TO replaceablepattern/replaceable optionalESCAPE replaceableescape-character/replaceable/optional *** substring('foobar' from 'o(.)b') line *** 3112,3118 string containing zero or more single-letter flags that change the function's behavior. Flag literali/ specifies case-insensitive matching, while flag literalg/ specifies replacement of each matching ! substring rather than only the first one. /para para --- 3143,3152 string containing zero or more single-letter flags that change the function's behavior. Flag literali/ specifies case-insensitive matching, while flag literalg/ specifies replacement of each matching ! substring rather than only the first one. Other supported flags are ! literalm/, literaln/, literalp/, literalw/ and ! literalx/, whose meanings correspond to those shown in ! xref linkend=posix-embedded-options-table. /para para *** regexp_replace('foobarbaz', 'b(..)', E'X *** 3127,3132 --- 3161,3341 /programlisting /para + para + The
Re: [PATCHES] patch adding new regexp functions
On Thu, 2007-02-08 at 19:22 -0800, Jeremy Drake wrote: I have added documentation for the functions in this patch. At this point I am ready to submit this patch for the review and application process. Please let me know if you find any issues with it. Barring any objections, I'll review and apply this patch in the next few days. BTW, unless the patch is truly large, leaving the patch uncompressed makes it easier to eyeball it without leaving my mail client. That's just personal preference, though... -Neil ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] patch adding new regexp functions
Neil Conway wrote: BTW, unless the patch is truly large, leaving the patch uncompressed makes it easier to eyeball it without leaving my mail client. That's just personal preference, though... FWIW, I just do a | zcat | less, which shows it uncompressed. No big deal. I don't know if Evolution can do that ... (it surprised me that Thunderbird doesn't seem to be capable of searching a message based on Message-Id). -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] patch adding new regexp functions
Alvaro Herrera wrote: Neil Conway wrote: BTW, unless the patch is truly large, leaving the patch uncompressed makes it easier to eyeball it without leaving my mail client. That's just personal preference, though... FWIW, I just do a | zcat | less, which shows it uncompressed. No big deal. I don't know if Evolution can do that ... (it surprised me that Thunderbird doesn't seem to be capable of searching a message based on Message-Id). My .mailcap calls a script runs 'file' and tries to display it properly automatically, and it worked fine for his attachment. -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly