Re: [PATCH] Introduce array_shuffle() and array_sample()
Hi all, reading this blog post https://www.depesz.com/2023/04/18/waiting-for-postgresql-16-add-array_sample-and-array_shuffle-functions/ I became aware of the new feature and had a look at it and the commit https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=888f2ea0a81ff171087bdd1c5c1eeda3b78d73d4 To me the description /* * Shuffle array using Fisher-Yates algorithm. Scan the array and swap * current item (nelm datums starting at ielms) with a randomly chosen * later item (nelm datums starting at jelms) in each iteration. We can * stop once we've done n iterations; then first n items are the result. */ seems wrong. For n = 1 the returned item could never be the 1st item of the array (see "randomly chosen later item"). If this really is the case then the result is not really random. But to me it seems j later can be 0 (making it not really "later"), so this might only be a documentation issue. Best regards Salek Talangi Am Mi., 19. Apr. 2023 um 13:48 Uhr schrieb Daniel Gustafsson < dan...@yesql.se>: > > On 7 Apr 2023, at 17:47, Tom Lane wrote: > > > > Daniel Gustafsson writes: > >> Ah, ok, now I see what you mean, thanks! I'll try to fix up the patch > like > >> this tomorrow. > > > > Since we're running out of time, I took the liberty of fixing and > > pushing this. > > Great, thanks! > > -- > Daniel Gustafsson
Re: [PATCH] Introduce array_shuffle() and array_sample()
> On 7 Apr 2023, at 17:47, Tom Lane wrote: > > Daniel Gustafsson writes: >> Ah, ok, now I see what you mean, thanks! I'll try to fix up the patch like >> this tomorrow. > > Since we're running out of time, I took the liberty of fixing and > pushing this. Great, thanks! -- Daniel Gustafsson
Re: [PATCH] Introduce array_shuffle() and array_sample()
Daniel Gustafsson writes: > Ah, ok, now I see what you mean, thanks! I'll try to fix up the patch like > this tomorrow. Since we're running out of time, I took the liberty of fixing and pushing this. regards, tom lane
Re: [PATCH] Introduce array_shuffle() and array_sample()
> On 3 Apr 2023, at 23:46, Tom Lane wrote: > > Daniel Gustafsson writes: >> On 29 Sep 2022, at 21:33, Tom Lane wrote: >>> I find this behavior a bit surprising: >>> >>> +SELECT >>> array_dims(array_sample('[-1:2][2:3]={{1,2},{3,NULL},{5,6},{7,8}}'::int[], >>> 3)); >>> + array_dims >>> +- >>> + [-1:1][2:3] >>> +(1 row) >>> >>> I can buy preserving the lower bound in array_shuffle(), but >>> array_sample() is not preserving the first-dimension indexes of >>> the array, so ISTM it ought to reset the first lower bound to 1. > >> I might be daft but I'm not sure I follow why not preserving here, can you >> explain? > > Because array_sample selects only some of the (first level) array > elements, those elements are typically not going to have the same > indexes in the output as they did in the input. So I don't see why > it would be useful to preserve the same lower-bound index. It does > make sense to preserve the lower-order index bounds ([2:3] in this > example) because we are including or not including those array > slices as a whole. Ah, ok, now I see what you mean, thanks! I'll try to fix up the patch like this tomorrow. -- Daniel Gustafsson
Re: [PATCH] Introduce array_shuffle() and array_sample()
Daniel Gustafsson writes: > On 29 Sep 2022, at 21:33, Tom Lane wrote: >> I find this behavior a bit surprising: >> >> +SELECT >> array_dims(array_sample('[-1:2][2:3]={{1,2},{3,NULL},{5,6},{7,8}}'::int[], >> 3)); >> + array_dims >> +- >> + [-1:1][2:3] >> +(1 row) >> >> I can buy preserving the lower bound in array_shuffle(), but >> array_sample() is not preserving the first-dimension indexes of >> the array, so ISTM it ought to reset the first lower bound to 1. > I might be daft but I'm not sure I follow why not preserving here, can you > explain? Because array_sample selects only some of the (first level) array elements, those elements are typically not going to have the same indexes in the output as they did in the input. So I don't see why it would be useful to preserve the same lower-bound index. It does make sense to preserve the lower-order index bounds ([2:3] in this example) because we are including or not including those array slices as a whole. regards, tom lane
Re: [PATCH] Introduce array_shuffle() and array_sample()
> On 29 Sep 2022, at 21:33, Tom Lane wrote: > > Martin Kalcher writes: >> New patch: array_shuffle() and array_sample() use pg_global_prng_state now. > > I took a closer look at the patch today. Since this seems pretty close to going in, and seems like quite useful functions, I took a look to see if I could get it across the line (although I noticed that CFM beat me to the clock in sending this =)). > I find this behavior a bit surprising: > > +SELECT > array_dims(array_sample('[-1:2][2:3]={{1,2},{3,NULL},{5,6},{7,8}}'::int[], > 3)); > + array_dims > +- > + [-1:1][2:3] > +(1 row) > > I can buy preserving the lower bound in array_shuffle(), but > array_sample() is not preserving the first-dimension indexes of > the array, so ISTM it ought to reset the first lower bound to 1. I might be daft but I'm not sure I follow why not preserving here, can you explain? The rest of your comments have been addressed in the attached v6 I think (although I'm pretty sure the docs part is just as bad now, explaining these in concise words is hard, will take another look with fresh eyes tomorrow). -- Daniel Gustafsson v6-0001-Introduce-array_shuffle-and-array_sample.patch Description: Binary data
Re: [PATCH] Introduce array_shuffle() and array_sample()
Given that there's been no updates since September 22 I'm going to make this patch Returned with Feedback. The patch can be resurrected when there's more work done -- don't forget to move it to the new CF when you do that. -- Gregory Stark As Commitfest Manager
Re: [PATCH] Introduce array_shuffle() and array_sample()
On Thu, 29 Sept 2022 at 15:34, Tom Lane wrote: > > Martin Kalcher writes: > > New patch: array_shuffle() and array_sample() use pg_global_prng_state now. > > I took a closer look at the patch today. I find this behavior a bit > surprising: > It looks like this patch received useful feedback and it wouldn't take much to push it over the line. But it's been Waiting On Author since last September. Martin, any chance of getting these last bits of feedback resolved so it can be Ready for Commit? -- Gregory Stark As Commitfest Manager
Re: [PATCH] Introduce array_shuffle() and array_sample()
Martin Kalcher writes: > New patch: array_shuffle() and array_sample() use pg_global_prng_state now. I took a closer look at the patch today. I find this behavior a bit surprising: +SELECT array_dims(array_sample('[-1:2][2:3]={{1,2},{3,NULL},{5,6},{7,8}}'::int[], 3)); + array_dims +- + [-1:1][2:3] +(1 row) I can buy preserving the lower bound in array_shuffle(), but array_sample() is not preserving the first-dimension indexes of the array, so ISTM it ought to reset the first lower bound to 1. Some other comments: +Returns n randomly chosen elements from array in selection order. What's "selection order"? And this probably shouldn't just rely on the example to describe what happens with multi-D arrays. Writing "elements" seems somewhere between confusing and wrong. * Personally I think I'd pass the TypeCacheEntry pointer to array_shuffle_n, and let it pull out what it needs. Less duplicate code that way. * I find array_shuffle_n drastically undercommented, and what comments it has are pretty misleading, eg + /* Swap all elements in item (i) with elements in item (j). */ j is *not* the index of the second item to be swapped. You could make it so, and that might be more readable: j = (int) pg_prng_uint64_range(&pg_global_prng_state, i, nitem - 1); jelms = elms + j * nelm; jnuls = nuls + j * nelm; But if you want the code to stay as it is, this comment needs work. * I think good style for SQL-callable C functions is to make the arguments clear at the top: +array_sample(PG_FUNCTION_ARGS) +{ +ArrayType *array = PG_GETARG_ARRAYTYPE_P(0); +int n = PG_GETARG_INT32(1); +ArrayType *result; +... other declarations as needed ... We can't quite make normal C declaration style work, but that's a poor excuse for not making the argument list visible as directly as possible. * I wouldn't bother with the PG_FREE_IF_COPY calls. Those are generally only used in btree comparison functions, in which there's a need to not leak memory. * I wonder if we really need these to be proparallel = 'r'. If we let a worker process execute them, they will take numbers from the worker's pg_global_prng_state seeding not the parent process's seeding, but why is that a problem? In the prior version where you were tying them to the parent's drandom() sequence, proparallel = 'r' made sense; but now I think it's unnecessary. regards, tom lane
Re: [PATCH] Introduce array_shuffle() and array_sample()
Am 28.09.22 um 16:18 schrieb Tom Lane: It is seeded at process start, yes. If you don't feel a need for user control over the sequence used by these functions, then using pg_global_prng_state is fine. (Basically the point to be made here is that we need to keep a tight rein on what can be affected by setseed().) regards, tom lane New patch: array_shuffle() and array_sample() use pg_global_prng_state now. MartinFrom b9433564f925521f5f6bcebd7cd74a3e12f4f354 Mon Sep 17 00:00:00 2001 From: Martin Kalcher Date: Sun, 17 Jul 2022 18:06:04 +0200 Subject: [PATCH v5] Introduce array_shuffle() and array_sample() * array_shuffle() shuffles the elements of an array. * array_sample() chooses n elements from an array by random. --- doc/src/sgml/func.sgml | 34 + src/backend/utils/adt/array_userfuncs.c | 176 src/include/catalog/pg_proc.dat | 6 + src/test/regress/expected/arrays.out| 54 src/test/regress/sql/arrays.sql | 14 ++ 5 files changed, 284 insertions(+) diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index e1fe4fec57..6b2a3d16f6 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -18468,6 +18468,40 @@ SELECT NULLIF(value, '(none)') ... + + + + array_sample + +array_sample ( array anyarray, n integer ) +anyarray + + +Returns n randomly chosen elements from array in selection order. + + +array_sample(ARRAY[[1,2],[3,4],[5,6]], 2) +{{5,6},{1,2}} + + + + + + + array_shuffle + +array_shuffle ( anyarray ) +anyarray + + +Shuffles the first dimension of the array. + + +array_shuffle(ARRAY[[1,2],[3,4],[5,6]]) +{{5,6},{1,2},{3,4}} + + + diff --git a/src/backend/utils/adt/array_userfuncs.c b/src/backend/utils/adt/array_userfuncs.c index ca70590d7d..4bf28da5e4 100644 --- a/src/backend/utils/adt/array_userfuncs.c +++ b/src/backend/utils/adt/array_userfuncs.c @@ -14,6 +14,7 @@ #include "catalog/pg_type.h" #include "common/int.h" +#include "common/pg_prng.h" #include "utils/array.h" #include "utils/builtins.h" #include "utils/lsyscache.h" @@ -902,3 +903,178 @@ array_positions(PG_FUNCTION_ARGS) PG_RETURN_DATUM(makeArrayResult(astate, CurrentMemoryContext)); } + +/* + * Produce array with n randomly chosen items from given array in random order. + * + * array: array object to examine (must not be NULL) + * n: number of items (must not be greater than the size of the arrays first dimension) + * elmtyp, elmlen, elmbyval, elmalign: info for the datatype of the items + * + * NOTE: it would be cleaner to look up the elmlen/elmbval/elmalign info + * from the system catalogs, given the elmtyp. However, the caller is + * in a better position to cache this info across multiple uses, or even + * to hard-wire values if the element type is hard-wired. + */ +static ArrayType * +array_shuffle_n(ArrayType *array, int n, +Oid elmtyp, int elmlen, bool elmbyval, char elmalign) +{ + ArrayType *result; + int ndim, + *dims, + *lbs, +rdims[MAXDIM], +nelm, +nitem, +i, +j, +k; + Datum elm, + *elms, + *ielms, + *jelms; + bool nul, + *nuls, + *inuls, + *jnuls; + + ndim = ARR_NDIM(array); + dims = ARR_DIMS(array); + lbs = ARR_LBOUND(array); + + if (ndim < 1 || dims[0] < 1 || n < 1) + return construct_empty_array(elmtyp); + + deconstruct_array(array, elmtyp, elmlen, elmbyval, elmalign, + &elms, &nuls, &nelm); + + nitem = dims[0]; /* total number of items */ + nelm /= nitem;/* number of elements per item */ + + ielms = elms; + inuls = nuls; + + /* + * Shuffle array using Fisher-Yates algorithm. Iterate array and swap head + * (ielms) with an randomly chosen item (jelms) at each iteration. + */ + for (i = 0; i < n; i++) + { + j = (int) pg_prng_uint64_range(&pg_global_prng_state, 0, nitem - i - 1) * nelm; + jelms = ielms + j; + jnuls = inuls + j; + + /* Swap all elements in item (i) with elements in item (j). */ + for (k = 0; k < nelm; k++) + { + elm = *ielms; + nul = *inuls; + *ielms++ = *jelms; + *inuls++ = *jnuls; + *jelms++ = elm; + *jnuls++ = nul; + } + } + + memcpy(rdims, dims, ndim * sizeof(int)); + rdims[0] = n; + + result = construct_md_array(elms, nuls, ndim, rdims, lbs, +elmtyp, elmlen, elmbyval, elmalign); + + pfree(elms); + pfree(nuls); + + return result; +} + +/* + * Shuffle the elements of an array. + */ +Datum +array_shuffle(PG_FUNCTION_ARGS) +{ + ArrayType *array, + *result; + int16 elmlen; + bool elmbyval; + char elmalign; + Oid elmtyp; + TypeCacheEntry *typentry; + int n; + + array = PG_GETARG_ARRAYTYPE_P(0); + + /* Return empty array immediately. */ + if (ARR_NDIM(array) < 1) + PG_R
Re: [PATCH] Introduce array_shuffle() and array_sample()
Fabien COELHO writes: >> Thanks for your thoughts, Tom. I have a couple of questions. Should we >> introduce a new seed function for the new PRNG state, used by >> array_shuffle() >> and array_sample()? What would be a good name? Or should those functions use >> pg_global_prng_state? Is it safe to assume, that pg_global_prng_state is >> seeded? > I'd suggest to use the existing global state. The global state should be > seeded at the process start, AFAIKR. It is seeded at process start, yes. If you don't feel a need for user control over the sequence used by these functions, then using pg_global_prng_state is fine. (Basically the point to be made here is that we need to keep a tight rein on what can be affected by setseed().) regards, tom lane
Re: [PATCH] Introduce array_shuffle() and array_sample()
With our current PRNG infrastructure it doesn't cost much to have a separate PRNG for every purpose. I don't object to having array_shuffle() and array_sample() share one PRNG, but I don't think it should go much further than that. Thanks for your thoughts, Tom. I have a couple of questions. Should we introduce a new seed function for the new PRNG state, used by array_shuffle() and array_sample()? What would be a good name? Or should those functions use pg_global_prng_state? Is it safe to assume, that pg_global_prng_state is seeded? I'd suggest to use the existing global state. The global state should be seeded at the process start, AFAIKR. -- Fabien.
Re: [PATCH] Introduce array_shuffle() and array_sample()
Am 26.09.22 um 22:16 schrieb Tom Lane: With our current PRNG infrastructure it doesn't cost much to have a separate PRNG for every purpose. I don't object to having array_shuffle() and array_sample() share one PRNG, but I don't think it should go much further than that. Thanks for your thoughts, Tom. I have a couple of questions. Should we introduce a new seed function for the new PRNG state, used by array_shuffle() and array_sample()? What would be a good name? Or should those functions use pg_global_prng_state? Is it safe to assume, that pg_global_prng_state is seeded? Martin
Re: [PATCH] Introduce array_shuffle() and array_sample()
Martin Kalcher writes: > [ v4-0001-Introduce-array_shuffle-and-array_sample.patch ] I think this idea of exporting drandom()'s PRNG for all and sundry to use is completely misguided. If we go down that path we'll be right back in the swamp that we were in when we used random(3), namely that (a) it's not clear what affects what, and (b) to the extent that there are such interferences, it could be bad, maybe even a security problem. We very intentionally decoupled drandom() from the rest of the world at commit 6645ad6bd, and I'm not ready to unlearn that lesson. With our current PRNG infrastructure it doesn't cost much to have a separate PRNG for every purpose. I don't object to having array_shuffle() and array_sample() share one PRNG, but I don't think it should go much further than that. regards, tom lane
Re: [PATCH] Introduce array_shuffle() and array_sample()
Am 22.09.22 um 17:23 schrieb Andres Freund: Hi, On 2022-08-04 07:46:10 +0200, Martin Kalcher wrote: Patch update without merge conflicts. Due to the merge of the meson based build, this patch needs to be adjusted. See https://cirrus-ci.com/build/6580671765282816 Looks like it'd just be adding user_prng.c to src/backend/utils/adt/meson.build's list. Greetings, Andres Freund Hi Andres, thanks for the heads up. Adjusted patch is attached. - MartinFrom 3c4a939abf8d29bcf666e49ecb042ade42b36c2c Mon Sep 17 00:00:00 2001 From: Martin Kalcher Date: Sun, 17 Jul 2022 18:06:04 +0200 Subject: [PATCH v4] Introduce array_shuffle() and array_sample() * array_shuffle() shuffles the elements of an array. * array_sample() chooses n elements from an array by random. The new functions share its prng state with random() and thus interoperate with setseed(). --- doc/src/sgml/func.sgml | 40 +- src/backend/utils/adt/Makefile | 1 + src/backend/utils/adt/array_userfuncs.c | 176 src/backend/utils/adt/float.c | 37 + src/backend/utils/adt/meson.build | 1 + src/backend/utils/adt/user_prng.c | 87 src/include/catalog/pg_proc.dat | 6 + src/include/utils/user_prng.h | 19 +++ src/test/regress/expected/arrays.out| 66 + src/test/regress/sql/arrays.sql | 16 +++ 10 files changed, 414 insertions(+), 35 deletions(-) create mode 100644 src/backend/utils/adt/user_prng.c create mode 100644 src/include/utils/user_prng.h diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index e1fe4fec57..2a96fc323a 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -1820,7 +1820,8 @@ repeat('Pg', 4) PgPgPgPg void -Sets the seed for subsequent random() calls; +Sets the seed for subsequent calls to random functions, like +random() or array_shuffle(); argument must be between -1.0 and 1.0, inclusive @@ -1838,7 +1839,8 @@ repeat('Pg', 4) PgPgPgPg applications; see the module for a more secure alternative. If setseed() is called, the series of results of - subsequent random() calls in the current session + subsequent calls to random functions, like random() or + array_shuffle(), in the current session can be repeated by re-issuing setseed() with the same argument. Without any prior setseed() call in the same @@ -18468,6 +18470,40 @@ SELECT NULLIF(value, '(none)') ... + + + + array_sample + +array_sample ( array anyarray, n integer ) +anyarray + + +Returns n randomly chosen elements from array in selection order. + + +array_sample(ARRAY[[1,2],[3,4],[5,6]], 2) +{{5,6},{1,2}} + + + + + + + array_shuffle + +array_shuffle ( anyarray ) +anyarray + + +Shuffles the first dimension of the array. + + +array_shuffle(ARRAY[[1,2],[3,4],[5,6]]) +{{5,6},{1,2},{3,4}} + + + diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile index 0de0bbb1b8..17b57a5569 100644 --- a/src/backend/utils/adt/Makefile +++ b/src/backend/utils/adt/Makefile @@ -110,6 +110,7 @@ OBJS = \ tsvector.o \ tsvector_op.o \ tsvector_parser.o \ + user_prng.o \ uuid.o \ varbit.o \ varchar.o \ diff --git a/src/backend/utils/adt/array_userfuncs.c b/src/backend/utils/adt/array_userfuncs.c index ca70590d7d..c4a2117df7 100644 --- a/src/backend/utils/adt/array_userfuncs.c +++ b/src/backend/utils/adt/array_userfuncs.c @@ -17,6 +17,7 @@ #include "utils/array.h" #include "utils/builtins.h" #include "utils/lsyscache.h" +#include "utils/user_prng.h" #include "utils/typcache.h" @@ -902,3 +903,178 @@ array_positions(PG_FUNCTION_ARGS) PG_RETURN_DATUM(makeArrayResult(astate, CurrentMemoryContext)); } + +/* + * Produce array with n randomly chosen items from given array in random order. + * + * array: array object to examine (must not be NULL) + * n: number of items (must not be greater than the size of the arrays first dimension) + * elmtyp, elmlen, elmbyval, elmalign: info for the datatype of the items + * + * NOTE: it would be cleaner to look up the elmlen/elmbval/elmalign info + * from the system catalogs, given the elmtyp. However, the caller is + * in a better position to cache this info across multiple uses, or even + * to hard-wire values if the element type is hard-wired. + */ +static ArrayType * +array_shuffle_n(ArrayType *array, int n, +Oid elmtyp, int elmlen, bool elmbyval, char elmalign) +{ + ArrayType *result; + int ndim, + *dims, + *lbs, +rdims[MAXDIM], +nelm, +nitem, +i, +j, +k; + Datum elm, + *elms, + *ielms, + *jelms; + bool nul, + *
Re: [PATCH] Introduce array_shuffle() and array_sample()
Hi, On 2022-08-04 07:46:10 +0200, Martin Kalcher wrote: > Patch update without merge conflicts. Due to the merge of the meson based build, this patch needs to be adjusted. See https://cirrus-ci.com/build/6580671765282816 Looks like it'd just be adding user_prng.c to src/backend/utils/adt/meson.build's list. Greetings, Andres Freund
Re: [PATCH] Introduce array_shuffle() and array_sample()
Patch update without merge conflicts. MartinFrom 0ecffcf3ed2eb59d045941b69bb86a34b93f3391 Mon Sep 17 00:00:00 2001 From: Martin Kalcher Date: Sun, 17 Jul 2022 18:06:04 +0200 Subject: [PATCH v3] Introduce array_shuffle() and array_sample() * array_shuffle() shuffles the elements of an array. * array_sample() chooses n elements from an array by random. The new functions share its prng state with random() and thus interoperate with setseed(). --- doc/src/sgml/func.sgml | 40 +- src/backend/utils/adt/Makefile | 1 + src/backend/utils/adt/array_userfuncs.c | 176 src/backend/utils/adt/float.c | 37 + src/backend/utils/adt/user_prng.c | 87 src/include/catalog/pg_proc.dat | 6 + src/include/utils/user_prng.h | 19 +++ src/test/regress/expected/arrays.out| 66 + src/test/regress/sql/arrays.sql | 16 +++ 9 files changed, 413 insertions(+), 35 deletions(-) create mode 100644 src/backend/utils/adt/user_prng.c create mode 100644 src/include/utils/user_prng.h diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 053d4dc650..ef7c001fe7 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -1820,7 +1820,8 @@ repeat('Pg', 4) PgPgPgPg void -Sets the seed for subsequent random() calls; +Sets the seed for subsequent calls to random functions, like +random() or array_shuffle(); argument must be between -1.0 and 1.0, inclusive @@ -1838,7 +1839,8 @@ repeat('Pg', 4) PgPgPgPg applications; see the module for a more secure alternative. If setseed() is called, the series of results of - subsequent random() calls in the current session + subsequent calls to random functions, like random() or + array_shuffle(), in the current session can be repeated by re-issuing setseed() with the same argument. Without any prior setseed() call in the same @@ -19398,6 +19400,40 @@ SELECT NULLIF(value, '(none)') ... + + + + array_sample + +array_sample ( array anyarray, n integer ) +anyarray + + +Returns n randomly chosen elements from array in selection order. + + +array_sample(ARRAY[[1,2],[3,4],[5,6]], 2) +{{5,6},{1,2}} + + + + + + + array_shuffle + +array_shuffle ( anyarray ) +anyarray + + +Shuffles the first dimension of the array. + + +array_shuffle(ARRAY[[1,2],[3,4],[5,6]]) +{{5,6},{1,2},{3,4}} + + + diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile index 7c722ea2ce..42b65e58bb 100644 --- a/src/backend/utils/adt/Makefile +++ b/src/backend/utils/adt/Makefile @@ -109,6 +109,7 @@ OBJS = \ tsvector.o \ tsvector_op.o \ tsvector_parser.o \ + user_prng.o \ uuid.o \ varbit.o \ varchar.o \ diff --git a/src/backend/utils/adt/array_userfuncs.c b/src/backend/utils/adt/array_userfuncs.c index ca70590d7d..c4a2117df7 100644 --- a/src/backend/utils/adt/array_userfuncs.c +++ b/src/backend/utils/adt/array_userfuncs.c @@ -17,6 +17,7 @@ #include "utils/array.h" #include "utils/builtins.h" #include "utils/lsyscache.h" +#include "utils/user_prng.h" #include "utils/typcache.h" @@ -902,3 +903,178 @@ array_positions(PG_FUNCTION_ARGS) PG_RETURN_DATUM(makeArrayResult(astate, CurrentMemoryContext)); } + +/* + * Produce array with n randomly chosen items from given array in random order. + * + * array: array object to examine (must not be NULL) + * n: number of items (must not be greater than the size of the arrays first dimension) + * elmtyp, elmlen, elmbyval, elmalign: info for the datatype of the items + * + * NOTE: it would be cleaner to look up the elmlen/elmbval/elmalign info + * from the system catalogs, given the elmtyp. However, the caller is + * in a better position to cache this info across multiple uses, or even + * to hard-wire values if the element type is hard-wired. + */ +static ArrayType * +array_shuffle_n(ArrayType *array, int n, +Oid elmtyp, int elmlen, bool elmbyval, char elmalign) +{ + ArrayType *result; + int ndim, + *dims, + *lbs, +rdims[MAXDIM], +nelm, +nitem, +i, +j, +k; + Datum elm, + *elms, + *ielms, + *jelms; + bool nul, + *nuls, + *inuls, + *jnuls; + + ndim = ARR_NDIM(array); + dims = ARR_DIMS(array); + lbs = ARR_LBOUND(array); + + if (ndim < 1 || dims[0] < 1 || n < 1) + return construct_empty_array(elmtyp); + + deconstruct_array(array, elmtyp, elmlen, elmbyval, elmalign, + &elms, &nuls, &nelm); + + nitem = dims[0]; /* total number of items */ + nelm /= nitem;/* number of elements per item */ + + ielms = elms; + inuls = nuls; + + /* + * Shuffle array usin
Re: [PATCH] Introduce array_shuffle() and array_sample()
Am 24.07.22 um 21:42 schrieb Fabien COELHO: Duno. I'm still wondering what it should do. I'm pretty sure that the documentation should be clear about a shared seed, if any. I do not think that departing from the standard is a good thing, either. Are sure it violates the standard? I could not find anything about it. The private prng state for random() was introduced in 2018 [0]. Neither commit nor discussion mentions any standard compliance. [0] https://www.postgresql.org/message-id/E1gdNAo-00036g-TB%40gemulon.postgresql.org I updated the documentation for setseed(). If someone wants a limit, they can easily "LEAST(#1 dim size, other limit)" to get it, it is easy enough with a strict function. Convinced. It errors out now if n is out of bounds. MartinFrom afb7c022abd26b82a4fd3611313a83f144909554 Mon Sep 17 00:00:00 2001 From: Martin Kalcher Date: Sun, 17 Jul 2022 18:06:04 +0200 Subject: [PATCH v2] Introduce array_shuffle() and array_sample() * array_shuffle() shuffles the elements of an array. * array_sample() chooses n elements from an array by random. The new functions share its prng state with random() and thus interoperate with setseed(). --- doc/src/sgml/func.sgml | 40 +- src/backend/utils/adt/Makefile | 1 + src/backend/utils/adt/array_userfuncs.c | 180 src/backend/utils/adt/float.c | 37 + src/backend/utils/adt/user_prng.c | 87 src/include/catalog/pg_proc.dat | 6 + src/include/utils/user_prng.h | 19 +++ src/test/regress/expected/arrays.out| 66 + src/test/regress/sql/arrays.sql | 16 +++ 9 files changed, 417 insertions(+), 35 deletions(-) create mode 100644 src/backend/utils/adt/user_prng.c create mode 100644 src/include/utils/user_prng.h diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index b6783b7ad0..8c51a17eda 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -1820,7 +1820,8 @@ repeat('Pg', 4) PgPgPgPg void -Sets the seed for subsequent random() calls; +Sets the seed for subsequent calls to random functions, like +random() or array_shuffle(); argument must be between -1.0 and 1.0, inclusive @@ -1838,7 +1839,8 @@ repeat('Pg', 4) PgPgPgPg applications; see the module for a more secure alternative. If setseed() is called, the series of results of - subsequent random() calls in the current session + subsequent calls to random functions, like random() or + array_shuffle(), in the current session can be repeated by re-issuing setseed() with the same argument. @@ -19395,6 +19397,40 @@ SELECT NULLIF(value, '(none)') ... + + + + array_sample + +array_sample ( array anyarray, n integer ) +anyarray + + +Returns n randomly chosen elements from array in selection order. + + +array_sample(ARRAY[[1,2],[3,4],[5,6]], 2) +{{5,6},{1,2}} + + + + + + + array_shuffle + +array_shuffle ( anyarray ) +anyarray + + +Shuffles the first dimension of the array. + + +array_shuffle(ARRAY[[1,2],[3,4],[5,6]]) +{{5,6},{1,2},{3,4}} + + + diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile index 7c722ea2ce..42b65e58bb 100644 --- a/src/backend/utils/adt/Makefile +++ b/src/backend/utils/adt/Makefile @@ -109,6 +109,7 @@ OBJS = \ tsvector.o \ tsvector_op.o \ tsvector_parser.o \ + user_prng.o \ uuid.o \ varbit.o \ varchar.o \ diff --git a/src/backend/utils/adt/array_userfuncs.c b/src/backend/utils/adt/array_userfuncs.c index ca70590d7d..d62f1aa4ef 100644 --- a/src/backend/utils/adt/array_userfuncs.c +++ b/src/backend/utils/adt/array_userfuncs.c @@ -17,6 +17,7 @@ #include "utils/array.h" #include "utils/builtins.h" #include "utils/lsyscache.h" +#include "utils/user_prng.h" #include "utils/typcache.h" @@ -902,3 +903,182 @@ array_positions(PG_FUNCTION_ARGS) PG_RETURN_DATUM(makeArrayResult(astate, CurrentMemoryContext)); } + +/* + * Produce array with n randomly chosen items from given array in random order. + * + * array: array object to examine (must not be NULL) + * n: number of items (must not be greater than the size of the arrays first dimension) + * elmtyp, elmlen, elmbyval, elmalign: info for the datatype of the items + * + * NOTE: it would be cleaner to look up the elmlen/elmbval/elmalign info + * from the system catalogs, given the elmtyp. However, the caller is + * in a better position to cache this info across multiple uses, or even + * to hard-wire values if the element type is hard-wired. + */ +static ArrayType * +array_shuffle_n(ArrayType *array, int n, +Oid elmtyp, int elmlen, bool elmbyval, cha
Re: [PATCH] Introduce array_shuffle() and array_sample()
Hello, Thank you for your feedback. I attached a patch, that addresses most of your points. I'll look into it. It would help if the patch could include a version number at the end. Should the exchange be skipped when i == k? The additional branch is actually slower (on my machine, tested with an 10M element int array) for 1D-arrays, which i assume is the most common case. Even with a 2D-array with a subarray size of 1M there is no difference in execution time. i == k should be relatively rare for large arrays, given a good prng. Ok, slower is bad. The random and array shuffling functions share a common state. I'm wondering whether it should ot should not be so. It seems ok, but then ISTM that the documentation suggests implicitely that setseed applies to random() only, which is not the case anymore after the patch. I really like the idea of a prng state owned by the user, that is used by all user facing random functions, but see that their might be concerns about this change. I would update the setseed() documentation, if this proposal is accepted. Do you think my patch should already include the documentation update? Duno. I'm still wondering what it should do. I'm pretty sure that the documentation should be clear about a shared seed, if any. I do not think that departing from the standard is a good thing, either. If more samples are required than the number of elements, it does not error out. I'm wondering whether it should. The second argument to array_sample() is treated like a LIMIT, rather than the actual number of elements. I updated the documentation. My use-case is: take max random items from an array of unknown size. Hmmm. ISTM that the use case of wanting "this many" stuff would make more sense because it is strictier so less likely to result in unforseen consequences. On principle I do not like not knowing the output size. If someone wants a limit, they can easily "LEAST(#1 dim size, other limit)" to get it, it is easy enough with a strict function. Also, the sampling should not return its result in order when the number of elements required is the full array, ISTM that it should be shuffled there as well. You are the second person asking for this. It's done. I am thinking about ditching array_sample() and replace it with a second signature for array_shuffle(): array_shuffle(array anyarray) -> anyarray array_shuffle(array anyarray, limit int) -> anyarray What do you think? I think that shuffle means shuffle, not partial shuffle, so a different name seems better to me. -- Fabien.
Re: [PATCH] Introduce array_shuffle() and array_sample()
Am 24.07.22 um 10:15 schrieb Fabien COELHO: My 0,02€ about this patch: Thank you for your feedback. I attached a patch, that addresses most of your points. Use (void) when declaring no parameters in headers or in functions. Done. Should the exchange be skipped when i == k? The additional branch is actually slower (on my machine, tested with an 10M element int array) for 1D-arrays, which i assume is the most common case. Even with a 2D-array with a subarray size of 1M there is no difference in execution time. i == k should be relatively rare for large arrays, given a good prng. I do not see the point of having *only* inline functions in a c file. The external functions should not be inlined? Done. The random and array shuffling functions share a common state. I'm wondering whether it should ot should not be so. It seems ok, but then ISTM that the documentation suggests implicitely that setseed applies to random() only, which is not the case anymore after the patch. I really like the idea of a prng state owned by the user, that is used by all user facing random functions, but see that their might be concerns about this change. I would update the setseed() documentation, if this proposal is accepted. Do you think my patch should already include the documentation update? If more samples are required than the number of elements, it does not error out. I'm wondering whether it should. The second argument to array_sample() is treated like a LIMIT, rather than the actual number of elements. I updated the documentation. My use-case is: take max random items from an array of unknown size. Also, the sampling should not return its result in order when the number of elements required is the full array, ISTM that it should be shuffled there as well. You are the second person asking for this. It's done. I am thinking about ditching array_sample() and replace it with a second signature for array_shuffle(): array_shuffle(array anyarray) -> anyarray array_shuffle(array anyarray, limit int) -> anyarray What do you think? I must say that without significant knowledge of the array internal implementation, the swap code looks pretty strange. ISTM that the code would be clearer if pointers and array references style were not intermixed. Done. Went with pointers. Maybe you could add a test with a 3D array? Some sample with NULLs? Done. Unrelated: I notice again that (postgre)SQL does not provide a way to generate random integers. I do not see why not. Should we provide one? Maybe. I think it is outside the scope of this patch. Thank you, MartinFrom d4f3df99cbc4451f7907a7c759a7590d63d8388c Mon Sep 17 00:00:00 2001 From: Martin Kalcher Date: Sun, 17 Jul 2022 18:06:04 +0200 Subject: [PATCH] Introduce array_shuffle() and array_sample() * array_shuffle() shuffles the elements of an array. * array_sample() chooses max n elements from an array by random. The new functions share its prng state with random() and thus interoperate with setseed(). --- doc/src/sgml/func.sgml | 34 + src/backend/utils/adt/Makefile | 1 + src/backend/utils/adt/array_userfuncs.c | 181 src/backend/utils/adt/float.c | 37 + src/backend/utils/adt/user_prng.c | 87 src/include/catalog/pg_proc.dat | 6 + src/include/utils/user_prng.h | 19 +++ src/test/regress/expected/arrays.out| 70 + src/test/regress/sql/arrays.sql | 16 +++ 9 files changed, 418 insertions(+), 33 deletions(-) create mode 100644 src/backend/utils/adt/user_prng.c create mode 100644 src/include/utils/user_prng.h diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index b6783b7ad0..c44db6c652 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -19395,6 +19395,40 @@ SELECT NULLIF(value, '(none)') ... + + + + array_sample + +array_sample ( array anyarray, limit integer ) +anyarray + + +Returns max limit randomly chosen elements from array in selection order. + + +array_sample(ARRAY[[1,2],[3,4],[5,6]], 2) +{{5,6},{1,2}} + + + + + + + array_shuffle + +array_shuffle ( anyarray ) +anyarray + + +Shuffles the first dimension of the array. + + +array_shuffle(ARRAY[[1,2],[3,4],[5,6]]) +{{5,6},{1,2},{3,4}} + + + diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile index 7c722ea2ce..42b65e58bb 100644 --- a/src/backend/utils/adt/Makefile +++ b/src/backend/utils/adt/Makefile @@ -109,6 +109,7 @@ OBJS = \ tsvector.o \ tsvector_op.o \ tsvector_parser.o \ + user_prng.o \ uuid.o \ varbit.o \ varchar.o \ diff --git a/src/backend/utils/adt/
Re: [PATCH] Introduce array_shuffle() and array_sample()
i came to the same conclusions and went with Option 1 (see patch). Mainly because most code in utils/adt is organized by type and this way it is clear, that this is a thin wrapper around pg_prng. Small patch update. I realized the new functions should live array_userfuncs.c (rather than arrayfuncs.c), fixed some file headers and added some comments to the code. My 0,02€ about this patch: Use (void) when declaring no parameters in headers or in functions. Should the exchange be skipped when i == k? I do not see the point of having *only* inline functions in a c file. The external functions should not be inlined? The random and array shuffling functions share a common state. I'm wondering whether it should ot should not be so. It seems ok, but then ISTM that the documentation suggests implicitely that setseed applies to random() only, which is not the case anymore after the patch. If more samples are required than the number of elements, it does not error out. I'm wondering whether it should. Also, the sampling should not return its result in order when the number of elements required is the full array, ISTM that it should be shuffled there as well. I must say that without significant knowledge of the array internal implementation, the swap code looks pretty strange. ISTM that the code would be clearer if pointers and array references style were not intermixed. Maybe you could add a test with a 3D array? Some sample with NULLs? Unrelated: I notice again that (postgre)SQL does not provide a way to generate random integers. I do not see why not. Should we provide one? -- Fabien.
Re: [PATCH] Introduce array_shuffle() and array_sample()
Am 22.07.22 um 11:31 schrieb Martin Kalcher: i came to the same conclusions and went with Option 1 (see patch). Mainly because most code in utils/adt is organized by type and this way it is clear, that this is a thin wrapper around pg_prng. Small patch update. I realized the new functions should live array_userfuncs.c (rather than arrayfuncs.c), fixed some file headers and added some comments to the code.From 138777531961c31250074d1c0d87417e31d63656 Mon Sep 17 00:00:00 2001 From: Martin Kalcher Date: Sun, 17 Jul 2022 18:06:04 +0200 Subject: [PATCH] Introduce array_shuffle() and array_sample() * array_shuffle() shuffles the elements of an array. * array_sample() chooses n elements from an array by random. The new functions share its prng state with random() and thus interoperate with setseed(). --- doc/src/sgml/func.sgml | 35 + src/backend/utils/adt/Makefile | 1 + src/backend/utils/adt/array_userfuncs.c | 182 src/backend/utils/adt/float.c | 37 + src/backend/utils/adt/user_prng.c | 87 +++ src/include/catalog/pg_proc.dat | 6 + src/include/utils/user_prng.h | 19 +++ src/test/regress/expected/arrays.out| 58 src/test/regress/sql/arrays.sql | 14 ++ 9 files changed, 406 insertions(+), 33 deletions(-) create mode 100644 src/backend/utils/adt/user_prng.c create mode 100644 src/include/utils/user_prng.h diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index b6783b7ad0..87df8a9c81 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -19395,6 +19395,41 @@ SELECT NULLIF(value, '(none)') ... + + + + array_sample + +array_sample ( array anyarray, n integer ) +anyarray + + +Returns n randomly chosen elements from array. +The order of the elements in resulting array is unspecified. + + +array_sample(ARRAY[[1,2],[3,4],[5,6]], 2) +{{5,6},{1,2}} + + + + + + + array_shuffle + +array_shuffle ( anyarray ) +anyarray + + +Shuffles the first dimension of the array. + + +array_shuffle(ARRAY[[1,2],[3,4],[5,6]]) +{{5,6},{1,2},{3,4}} + + + diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile index 7c722ea2ce..42b65e58bb 100644 --- a/src/backend/utils/adt/Makefile +++ b/src/backend/utils/adt/Makefile @@ -109,6 +109,7 @@ OBJS = \ tsvector.o \ tsvector_op.o \ tsvector_parser.o \ + user_prng.o \ uuid.o \ varbit.o \ varchar.o \ diff --git a/src/backend/utils/adt/array_userfuncs.c b/src/backend/utils/adt/array_userfuncs.c index ca70590d7d..e3ba14a17c 100644 --- a/src/backend/utils/adt/array_userfuncs.c +++ b/src/backend/utils/adt/array_userfuncs.c @@ -17,6 +17,7 @@ #include "utils/array.h" #include "utils/builtins.h" #include "utils/lsyscache.h" +#include "utils/user_prng.h" #include "utils/typcache.h" @@ -902,3 +903,184 @@ array_positions(PG_FUNCTION_ARGS) PG_RETURN_DATUM(makeArrayResult(astate, CurrentMemoryContext)); } + +/* + * Produce array with max n randomly chosen items from given array in + * random order. + * + * array: array object to examine (must not be NULL) + * n: max number of items + * elmtyp, elmlen, elmbyval, elmalign: info for the datatype of the items + * + * NOTE: it would be cleaner to look up the elmlen/elmbval/elmalign info + * from the system catalogs, given the elmtyp. However, the caller is + * in a better position to cache this info across multiple uses, or even + * to hard-wire values if the element type is hard-wired. + */ +static ArrayType * +array_shuffle_n(ArrayType *array, int n, +Oid elmtyp, int elmlen, bool elmbyval, char elmalign) +{ + ArrayType *result; + int ndim, + *dims, + *lbs, +rdims[MAXDIM], +nelm, +nitem, +rnitem, +i, +j, +k; + Datum elm, + *elms, + *relms; + bool nul, + *nuls, + *rnuls; + + ndim = ARR_NDIM(array); + dims = ARR_DIMS(array); + lbs = ARR_LBOUND(array); + + if (ndim < 1 || dims[0] < 1 || n < 1) + return construct_empty_array(elmtyp); + + deconstruct_array(array, elmtyp, elmlen, elmbyval, elmalign, + &relms, &rnuls, &nelm); + + nitem = dims[0]; /* total number of items */ + rnitem = Min(n, nitem); /* target number of items */ + nelm /= nitem;/* number of elements per item */ + + elms = relms; + nuls = rnuls; + + /* + * Shuffle array using Fisher-Yates algorithm. Swap head with an randomly + * chosen item and increment head. + */ + for (i = 0; i < rnitem; i++) + { + k = (int) user_prng_uint64_range(0, nitem - i - 1) * nelm; + + /* Swap all elements in head with elements in item k. */ + for (j = 0;
Re: [PATCH] Introduce array_shuffle() and array_sample()
On 7/19/22 10:20, Tom Lane wrote: Everything else either explicitly rejects more-than-one-D arrays or does something that is compatible with thinking of them as arrays-of-arrays. I think I am responsible for at least some of those, and I agree that thinking of MD arrays as arrays-of-arrays is preferable even though they are not actually that. Long ago[1] Peter E asked me to fix that as I recall but it was one of those round tuits that I never found. So I withdraw my original position. These functions should just shuffle or select in the array's first dimension, preserving subarrays. Or else be lazy and reject more-than-one-D arrays; but it's probably not that hard to handle them. +1 Joe [1] https://www.postgresql.org/message-id/flat/Pine.LNX.4.44.0306281418020.2178-10%40peter.localdomain#a064d6dd8593993d799db453a3ee04d1 -- Joe Conway RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: [PATCH] Introduce array_shuffle() and array_sample()
On Fri, 22 Jul 2022 at 10:31, Martin Kalcher wrote: > > i came to the same conclusions and went with Option 1 (see patch). > Mainly because most code in utils/adt is organized by type and this way > it is clear, that this is a thin wrapper around pg_prng. > > What do you think? Looks fairly neat, on a quick read-through. There's certainly something to be said for preserving the organisation by type. Regards, Dean
Re: [PATCH] Introduce array_shuffle() and array_sample()
Am 22.07.22 um 09:59 schrieb Dean Rasheed:> Option 1: Keep random.c small - Responsible for initialisation of the user prng on demand - Expose the user prng state to other code like float.c and arrayfuncs.c Option 2: Move all random functions wanting to use the user prng to random.c - Starting with drandom() and setseed() - Then, array_shuffle() and array_sample() - Later, any new SQL-callable random functions we might add - Keep the user prng state local to random.c Hey Dean, i came to the same conclusions and went with Option 1 (see patch). Mainly because most code in utils/adt is organized by type and this way it is clear, that this is a thin wrapper around pg_prng. What do you think?From ceda50f1f7f7e0c123de9b2ce2cc7b5d2b2b7db6 Mon Sep 17 00:00:00 2001 From: Martin Kalcher Date: Sun, 17 Jul 2022 18:06:04 +0200 Subject: [PATCH] Introduce array_shuffle() and array_sample() * array_shuffle() shuffles the elements of an array. * array_sample() chooses n elements from an array by random. The new functions share its prng state with random() and thus interoperate with setseed(). --- doc/src/sgml/func.sgml | 35 + src/backend/utils/adt/Makefile | 1 + src/backend/utils/adt/arrayfuncs.c | 189 ++- src/backend/utils/adt/float.c| 37 +- src/backend/utils/adt/user_prng.c| 88 + src/include/catalog/pg_proc.dat | 6 + src/include/utils/user_prng.h| 21 +++ src/test/regress/expected/arrays.out | 58 src/test/regress/sql/arrays.sql | 14 ++ 9 files changed, 415 insertions(+), 34 deletions(-) create mode 100644 src/backend/utils/adt/user_prng.c create mode 100644 src/include/utils/user_prng.h diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index b6783b7ad0..6b96897244 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -19395,6 +19395,41 @@ SELECT NULLIF(value, '(none)') ... + + + + array_sample + +array_sample ( array anyarray, n integer ) +anyarray + + +Returns n randomly chosen elements from array. +The order of the elements in resulting array is unspecified. + + +array_sample(ARRAY[[1,2],[3,4],[5,6]], 2) +{{5,6},{3,4}} + + + + + + + array_shuffle + +array_shuffle ( anyarray ) +anyarray + + +Shuffles the first dimension of the array. + + +array_shuffle(ARRAY[[1,2],[3,4],[5,6]]) +{{5,6},{3,4},{1,2}} + + + diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile index 7c722ea2ce..42b65e58bb 100644 --- a/src/backend/utils/adt/Makefile +++ b/src/backend/utils/adt/Makefile @@ -109,6 +109,7 @@ OBJS = \ tsvector.o \ tsvector_op.o \ tsvector_parser.o \ + user_prng.o \ uuid.o \ varbit.o \ varchar.o \ diff --git a/src/backend/utils/adt/arrayfuncs.c b/src/backend/utils/adt/arrayfuncs.c index fb167f226a..3a40ccc362 100644 --- a/src/backend/utils/adt/arrayfuncs.c +++ b/src/backend/utils/adt/arrayfuncs.c @@ -32,10 +32,10 @@ #include "utils/fmgroids.h" #include "utils/lsyscache.h" #include "utils/memutils.h" +#include "utils/user_prng.h" #include "utils/selfuncs.h" #include "utils/typcache.h" - /* * GUC parameter */ @@ -6872,3 +6872,190 @@ trim_array(PG_FUNCTION_ARGS) PG_RETURN_DATUM(result); } + +/* + * Produce array with max n random items from given array in random order. + * + * array: array object to examine (must not be NULL) + * n: max number of items + * elmtyp, elmlen, elmbyval, elmalign: info for the datatype of the items + * + * NOTE: it would be cleaner to look up the elmlen/elmbval/elmalign info + * from the system catalogs, given the elmtype. However, the caller is + * in a better position to cache this info across multiple uses, or even + * to hard-wire values if the element type is hard-wired. + */ +static ArrayType * +array_shuffle_n(ArrayType *array, int n, +Oid elmtyp, int elmlen, bool elmbyval, char elmalign) +{ + ArrayType *result; + int ndim, + *dims, + *lbs, +rdims[MAXDIM], +nelm, +nitem, +i, +j, +k; + Datum elm, + *elms, + *relms; + bool nul, + *nuls, + *rnuls; + + ndim = ARR_NDIM(array); + dims = ARR_DIMS(array); + lbs = ARR_LBOUND(array); + + if (ndim < 1 || dims[0] < 1 || n < 1) + return construct_empty_array(elmtyp); + + deconstruct_array(array, elmtyp, elmlen, elmbyval, elmalign, + &relms, &rnuls, &nelm); + + /* Calculate number of elements per item. */ + nelm = (ndim > 1) ? ArrayGetNItems(ndim - 1, dims + 1) : 1; + elms = relms; + nuls = rnuls; + nitem = dims[0]; + n = Min(n, nitem); + + /* + * Shuffle array using Fisher-Yates
Re: [PATCH] Introduce array_shuffle() and array_sample()
On Thu, 21 Jul 2022 at 16:43, Martin Kalcher wrote: > > Am 21.07.22 um 14:25 schrieb Dean Rasheed: > > > > I'm inclined to say that we want a new pg_global_prng_user_state that > > is updated by setseed(), and used by random(), array_shuffle(), > > array_sample(), and any other user-facing random functions we add > > later. > > > > I like the idea. How would you organize the code? I imagine some sort of > user prng that is encapsulated in something like utils/adt/random.c and > is guaranteed to always be seeded. > Something like that, perhaps. I can see 2 ways it could go: Option 1: Keep random.c small - Responsible for initialisation of the user prng on demand - Expose the user prng state to other code like float.c and arrayfuncs.c Option 2: Move all random functions wanting to use the user prng to random.c - Starting with drandom() and setseed() - Then, array_shuffle() and array_sample() - Later, any new SQL-callable random functions we might add - Keep the user prng state local to random.c Option 2 seems quite appealing, because it keeps all SQL-callable random functions together in one place, and keeps the state local, making it easier to keep track of which functions are using it. Code like the Fisher-Yates algorithm is more to do with random than it is to do with arrays, so putting it in random.c seems to make more sense. It's possible that some hypothetical random function might need access to type-specific internals. For example, if we wanted to add a function to return a random numeric, it would probably have to go in numeric.c, but that could be solved by having random.c call numeric.c, passing it the prng state. Regards, Dean
Re: [PATCH] Introduce array_shuffle() and array_sample()
Am 21.07.22 um 10:41 schrieb Dean Rasheed: It's important to mark these new functions as VOLATILE, not IMMUTABLE, otherwise they won't work as expected in queries. See https://www.postgresql.org/docs/current/xfunc-volatility.html It would be better to use pg_prng_uint64_range() rather than rand() to pick elements. Partly, that's because it uses a higher quality PRNG, with a larger internal state, and it ensures that the results are unbiased across the range. But more importantly, it interoperates with setseed(), allowing predictable sequences of "random" numbers to be generated -- something that's useful in writing repeatable regression tests. Assuming these new functions are made to interoperate with setseed(), which I think they should be, then they also need to be marked as PARALLEL RESTRICTED, rather than PARALLEL SAFE. See https://www.postgresql.org/docs/current/parallel-safety.html, which explains why setseed() and random() are parallel restricted. Here is an updated patch that marks the functions VOLATILE PARALLEL RESTRICTED and uses pg_prng_uint64_range() rather than rand().From 26676802f05d00c31e0b2d5997f61734aa421fca Mon Sep 17 00:00:00 2001 From: Martin Kalcher Date: Sun, 17 Jul 2022 18:06:04 +0200 Subject: [PATCH] Introduce array_shuffle() and array_sample() * array_shuffle() shuffles the elements of an array. * array_sample() chooses n elements from an array by random. Shuffling of arrays can already be accomplished with SQL using unnest() and array_agg(order by random()). But that is very slow compared to the new functions. In addition the new functions are dimension aware. --- doc/src/sgml/func.sgml | 35 + src/backend/utils/adt/arrayfuncs.c | 189 ++- src/include/catalog/pg_proc.dat | 6 + src/test/regress/expected/arrays.out | 60 + src/test/regress/sql/arrays.sql | 17 +++ 5 files changed, 306 insertions(+), 1 deletion(-) diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index b6783b7ad0..6b96897244 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -19395,6 +19395,41 @@ SELECT NULLIF(value, '(none)') ... + + + + array_sample + +array_sample ( array anyarray, n integer ) +anyarray + + +Returns n randomly chosen elements from array. +The order of the elements in resulting array is unspecified. + + +array_sample(ARRAY[[1,2],[3,4],[5,6]], 2) +{{5,6},{3,4}} + + + + + + + array_shuffle + +array_shuffle ( anyarray ) +anyarray + + +Shuffles the first dimension of the array. + + +array_shuffle(ARRAY[[1,2],[3,4],[5,6]]) +{{5,6},{3,4},{1,2}} + + + diff --git a/src/backend/utils/adt/arrayfuncs.c b/src/backend/utils/adt/arrayfuncs.c index fb167f226a..64da980348 100644 --- a/src/backend/utils/adt/arrayfuncs.c +++ b/src/backend/utils/adt/arrayfuncs.c @@ -34,7 +34,7 @@ #include "utils/memutils.h" #include "utils/selfuncs.h" #include "utils/typcache.h" - +#include "common/pg_prng.h" /* * GUC parameter @@ -6872,3 +6872,190 @@ trim_array(PG_FUNCTION_ARGS) PG_RETURN_DATUM(result); } + +/* + * Produce array with max n random items from given array in random order. + * + * array: array object to examine (must not be NULL) + * n: max number of items + * elmtyp, elmlen, elmbyval, elmalign: info for the datatype of the items + * + * NOTE: it would be cleaner to look up the elmlen/elmbval/elmalign info + * from the system catalogs, given the elmtype. However, the caller is + * in a better position to cache this info across multiple uses, or even + * to hard-wire values if the element type is hard-wired. + */ +static ArrayType * +array_shuffle_n(ArrayType *array, int n, +Oid elmtyp, int elmlen, bool elmbyval, char elmalign) +{ + ArrayType *result; + int ndim, + *dims, + *lbs, +rdims[MAXDIM], +nelm, +nitem, +i, +j, +k; + Datum elm, + *elms, + *relms; + bool nul, + *nuls, + *rnuls; + + ndim = ARR_NDIM(array); + dims = ARR_DIMS(array); + lbs = ARR_LBOUND(array); + + if (ndim < 1 || dims[0] < 1 || n < 1) + return construct_empty_array(elmtyp); + + deconstruct_array(array, elmtyp, elmlen, elmbyval, elmalign, + &relms, &rnuls, &nelm); + + /* Calculate number of elements per item. */ + nelm = (ndim > 1) ? ArrayGetNItems(ndim - 1, dims + 1) : 1; + elms = relms; + nuls = rnuls; + nitem = dims[0]; + n = Min(n, nitem); + + /* + * Shuffle array using Fisher-Yates algorithm. Swap head with an randomly + * chosen item and increment head. + */ + for (i = 0; i < n; i++) + { + k = (int) pg_prng_uint64_range(&pg_global_prng_state, 0,
Re: [PATCH] Introduce array_shuffle() and array_sample()
Am 21.07.22 um 14:25 schrieb Dean Rasheed: I'm inclined to say that we want a new pg_global_prng_user_state that is updated by setseed(), and used by random(), array_shuffle(), array_sample(), and any other user-facing random functions we add later. I like the idea. How would you organize the code? I imagine some sort of user prng that is encapsulated in something like utils/adt/random.c and is guaranteed to always be seeded. Martin
Re: [PATCH] Introduce array_shuffle() and array_sample()
On Thu, 21 Jul 2022 at 12:15, Martin Kalcher wrote: > > I agree that we should use pg_prng_uint64_range(). However, in order to > achieve interoperability with setseed() we would have to use > drandom_seed (rather than pg_global_prng_state) as rng state, which is > declared statically in float.c and exclusively used by random(). Do we > want to expose drandom_seed to other functions? > Ah, I didn't realise that setseed() and random() were bound up so tightly. It does feel as though, if we're adding more user-facing functions that return random sequences, there ought to be a way to seed them, and I wouldn't want to have separate setseed functions for each one. I'm inclined to say that we want a new pg_global_prng_user_state that is updated by setseed(), and used by random(), array_shuffle(), array_sample(), and any other user-facing random functions we add later. I can also see that others might not like expanding the scope of setseed() in this way. Regards, Dean
Re: [PATCH] Introduce array_shuffle() and array_sample()
Am 21.07.22 um 10:41 schrieb Dean Rasheed: A couple of quick comments on the current patch: Thank you for your feedback! It's important to mark these new functions as VOLATILE, not IMMUTABLE, otherwise they won't work as expected in queries. See https://www.postgresql.org/docs/current/xfunc-volatility.html CREATE FUNCTION marks functions as VOLATILE by default if not explicitly marked otherwise. I assumed function definitions in pg_proc.dat have the same behavior. I will fix that. https://www.postgresql.org/docs/current/sql-createfunction.html It would be better to use pg_prng_uint64_range() rather than rand() to pick elements. Partly, that's because it uses a higher quality PRNG, with a larger internal state, and it ensures that the results are unbiased across the range. But more importantly, it interoperates with setseed(), allowing predictable sequences of "random" numbers to be generated -- something that's useful in writing repeatable regression tests. I agree that we should use pg_prng_uint64_range(). However, in order to achieve interoperability with setseed() we would have to use drandom_seed (rather than pg_global_prng_state) as rng state, which is declared statically in float.c and exclusively used by random(). Do we want to expose drandom_seed to other functions? Assuming these new functions are made to interoperate with setseed(), which I think they should be, then they also need to be marked as PARALLEL RESTRICTED, rather than PARALLEL SAFE. See https://www.postgresql.org/docs/current/parallel-safety.html, which explains why setseed() and random() are parallel restricted. As mentioned above, i assumed the default here is PARALLEL UNSAFE. I'll fix that. In my experience, the requirement for sampling with replacement is about as common as the requirement for sampling without replacement, so it seems odd to provide one but not the other. Of course, we can always add a with-replacement function later, and give it a different name. But maybe array_sample() could be used for both, with a "with_replacement" boolean parameter? We can also add a "with_replacement" boolean parameter which is false by default to array_sample() later. I do not have a strong opinion about that and will implement it, if desired. Any opinions about default/no-default? Martin
Re: [PATCH] Introduce array_shuffle() and array_sample()
On Tue, 19 Jul 2022 at 21:21, Martin Kalcher wrote: > > Here is a patch with dimension aware array_shuffle() and array_sample(). > +1 for this feature, and this way of handling multi-dimensional arrays. > If you think array_flatten() is desirable, i can take a look at it. That's not something I've ever wanted -- personally, I rarely use multi-dimensional arrays in Postgres. A couple of quick comments on the current patch: It's important to mark these new functions as VOLATILE, not IMMUTABLE, otherwise they won't work as expected in queries. See https://www.postgresql.org/docs/current/xfunc-volatility.html It would be better to use pg_prng_uint64_range() rather than rand() to pick elements. Partly, that's because it uses a higher quality PRNG, with a larger internal state, and it ensures that the results are unbiased across the range. But more importantly, it interoperates with setseed(), allowing predictable sequences of "random" numbers to be generated -- something that's useful in writing repeatable regression tests. Assuming these new functions are made to interoperate with setseed(), which I think they should be, then they also need to be marked as PARALLEL RESTRICTED, rather than PARALLEL SAFE. See https://www.postgresql.org/docs/current/parallel-safety.html, which explains why setseed() and random() are parallel restricted. In my experience, the requirement for sampling with replacement is about as common as the requirement for sampling without replacement, so it seems odd to provide one but not the other. Of course, we can always add a with-replacement function later, and give it a different name. But maybe array_sample() could be used for both, with a "with_replacement" boolean parameter? Regards, Dean
Re: [PATCH] Introduce array_shuffle() and array_sample()
Am 19.07.22 um 16:20 schrieb Tom Lane: So I withdraw my original position. These functions should just shuffle or select in the array's first dimension, preserving subarrays. Or else be lazy and reject more-than-one-D arrays; but it's probably not that hard to handle them. Here is a patch with dimension aware array_shuffle() and array_sample(). If you think array_flatten() is desirable, i can take a look at it. Maybe a second parameter would be nice to specify the target dimension: select array_flatten(array[[[1,2],[3,4]],[[5,6],[7,8]]], 1); --- {1,2,3,4,5,6,7,8} select array_flatten(array[[[1,2],[3,4]],[[5,6],[7,8]]], 2); --- {{1,2,3,4},{5,6,7,8}} MartinFrom 2aa6d72ff0f4d8835ee2f09f8cdf16b7e8005e56 Mon Sep 17 00:00:00 2001 From: Martin Kalcher Date: Sun, 17 Jul 2022 18:06:04 +0200 Subject: [PATCH] Introduce array_shuffle() and array_sample() * array_shuffle() shuffles the elements of an array. * array_sample() chooses n elements from an array by random. Shuffling of arrays can already be accomplished with SQL using unnest() and array_agg(order by random()). But that is very slow compared to the new functions. In addition the new functions are dimension aware. --- doc/src/sgml/func.sgml | 35 + src/backend/utils/adt/arrayfuncs.c | 189 +++ src/include/catalog/pg_proc.dat | 6 + src/test/regress/expected/arrays.out | 60 + src/test/regress/sql/arrays.sql | 17 +++ 5 files changed, 307 insertions(+) diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index b6783b7ad0..6b96897244 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -19395,6 +19395,41 @@ SELECT NULLIF(value, '(none)') ... + + + + array_sample + +array_sample ( array anyarray, n integer ) +anyarray + + +Returns n randomly chosen elements from array. +The order of the elements in resulting array is unspecified. + + +array_sample(ARRAY[[1,2],[3,4],[5,6]], 2) +{{5,6},{3,4}} + + + + + + + array_shuffle + +array_shuffle ( anyarray ) +anyarray + + +Shuffles the first dimension of the array. + + +array_shuffle(ARRAY[[1,2],[3,4],[5,6]]) +{{5,6},{3,4},{1,2}} + + + diff --git a/src/backend/utils/adt/arrayfuncs.c b/src/backend/utils/adt/arrayfuncs.c index fb167f226a..3de1b0db30 100644 --- a/src/backend/utils/adt/arrayfuncs.c +++ b/src/backend/utils/adt/arrayfuncs.c @@ -6872,3 +6872,192 @@ trim_array(PG_FUNCTION_ARGS) PG_RETURN_DATUM(result); } + +/* + * Produce array with max n random items from given array in random order. + * + * array: array object to examine (must not be NULL) + * n: max number of items + * elmtyp, elmlen, elmbyval, elmalign: info for the datatype of the items + * + * NOTE: it would be cleaner to look up the elmlen/elmbval/elmalign info + * from the system catalogs, given the elmtype. However, the caller is + * in a better position to cache this info across multiple uses, or even + * to hard-wire values if the element type is hard-wired. + */ +static ArrayType * +array_shuffle_n(ArrayType *array, int n, +Oid elmtyp, int elmlen, bool elmbyval, char elmalign) +{ + ArrayType *result; + int ndim, + *dims, + *lbs, +rdims[MAXDIM], +nelm, +nitem, +i, +j, +k; + Datum elm, + *elms, + *relms; + bool nul, + *nuls, + *rnuls; + + ndim = ARR_NDIM(array); + dims = ARR_DIMS(array); + lbs = ARR_LBOUND(array); + + if (ndim < 1 || dims[0] < 1 || n < 1) + return construct_empty_array(elmtyp); + + deconstruct_array(array, elmtyp, elmlen, elmbyval, elmalign, + &relms, &rnuls, &nelm); + + /* Calculate number of elements per item. */ + nelm = (ndim > 1) ? ArrayGetNItems(ndim - 1, dims + 1) : 1; + elms = relms; + nuls = rnuls; + nitem = dims[0]; + n = Min(n, nitem); + + /* + * Shuffle array using Fisher-Yates algorithm. Swap head with an randomly + * chosen item and increment head. + */ + for (i = 0; i < n; i++) + { + k = (rand() % (nitem - i)) * nelm; + + /* Swap all elements in head with elements in item k. */ + for (j = 0; j < nelm; j++) + { + elm = *elms; + nul = *nuls; + *elms = elms[k]; + *nuls = nuls[k]; + elms[k] = elm; + nuls[k] = nul; + elms++; + nuls++; + } + } + + memcpy(rdims, dims, ndim * sizeof(int)); + rdims[0] = n; + + result = construct_md_array(relms, rnuls, ndim, rdims, lbs, +elmtyp, elmlen, elmbyval, elmalign); + + pfree(relms); + pfree(rnuls); + + return result; +} + +/* + * Shuffle the elements of an array. + */ +Datum +array_shuffle(PG_FUNCTION_ARGS) +{ + ArrayType *array, + *result; + int16 elmlen; + bool elmbyval; + char el
Re: [PATCH] Introduce array_shuffle() and array_sample()
Robert Haas writes: > On Tue, Jul 19, 2022 at 9:53 AM Andrew Dunstan wrote: >> Why not have an optional second parameter for array_shuffle that >> indicates whether or not to flatten? e.g. array_shuffle(my_array, >> flatten => true) > IMHO, if we think that's something many people are going to want, it > would be better to add an array_flatten() function, because that could > be used for a variety of purposes, whereas an option to this function > can only be used for this function. Agreed. Whether it's really needed, I dunno --- I don't recall the issue having come up before. After taking a second look through https://www.postgresql.org/docs/current/functions-array.html it seems like the things that treat arrays as flat even when they are multi-D are just * unnest(), which is more or less forced into that position by our type system: it has to be anyarray returning anyelement, not anyarray returning anyelement-or-anyarray-depending. * array_to_string(), which maybe could have done it differently but can reasonably be considered a variant of unnest(). * The containment/overlap operators, which are kind of their own thing anyway. Arguably they got this wrong, though I suppose it's too late to rethink that. Everything else either explicitly rejects more-than-one-D arrays or does something that is compatible with thinking of them as arrays-of-arrays. So I withdraw my original position. These functions should just shuffle or select in the array's first dimension, preserving subarrays. Or else be lazy and reject more-than-one-D arrays; but it's probably not that hard to handle them. regards, tom lane
Re: [PATCH] Introduce array_shuffle() and array_sample()
On Tue, Jul 19, 2022 at 9:53 AM Andrew Dunstan wrote: > > Having thought about it, i would go with (2). It gives the user the > > ability to decide wether or not array-of-arrays behavior is desired. > > If he wants the behavior of (1) he can flatten the array before > > applying array_shuffle(). Unfortunately there is no array_flatten() > > function (at the moment) and the user would have to work around it > > with unnest() and array_agg(). > > Why not have an optional second parameter for array_shuffle that > indicates whether or not to flatten? e.g. array_shuffle(my_array, > flatten => true) IMHO, if we think that's something many people are going to want, it would be better to add an array_flatten() function, because that could be used for a variety of purposes, whereas an option to this function can only be used for this function. -- Robert Haas EDB: http://www.enterprisedb.com
Re: [PATCH] Introduce array_shuffle() and array_sample()
On 2022-07-19 Tu 07:15, Martin Kalcher wrote: > Am 18.07.22 um 23:48 schrieb Martin Kalcher: >> >> If we go with (1) array_shuffle() and array_sample() should shuffle >> each element individually and always return a one-dimensional array. >> >> select array_shuffle('{{1,2},{3,4},{5,6}}'); >> --- >> {1,4,3,5,6,2} >> >> select array_sample('{{1,2},{3,4},{5,6}}', 3); >> -- >> {1,4,3} >> >> If we go with (2) both functions should only operate on the first >> dimension and shuffle whole subarrays and keep the dimensions intact. >> >> select array_shuffle('{{1,2},{3,4},{5,6}}'); >> - >> {{3,4},{1,2},{5,6}} >> >> select array_sample('{{1,2},{3,4},{5,6}}', 2); >> --- >> {{3,4},{1,2}} >> > > Having thought about it, i would go with (2). It gives the user the > ability to decide wether or not array-of-arrays behavior is desired. > If he wants the behavior of (1) he can flatten the array before > applying array_shuffle(). Unfortunately there is no array_flatten() > function (at the moment) and the user would have to work around it > with unnest() and array_agg(). > > Why not have an optional second parameter for array_shuffle that indicates whether or not to flatten? e.g. array_shuffle(my_array, flatten => true) cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: [PATCH] Introduce array_shuffle() and array_sample()
On Mon, Jul 18, 2022 at 6:43 PM Tom Lane wrote: > Um ... why is "the order in which the elements were chosen" a concept > we want to expose? ISTM sample() is a black box in which notionally > the decisions could all be made at once. I agree with that. But I also think it's fine for the elements to be returned in a shuffled order rather than the original order. > > I really think this function needs to grow an algorithm argument that can > > be used to specify stuff like ordering, replacement/without-replacement, > > etc...just some enums separated by commas that can be added to the call. > > I think you might run out of gold paint somewhere around here. I'm > still not totally convinced we should bother with the sample() function > at all, let alone that it needs algorithm variants. At some point we > say to the user "here's a PL, write what you want for yourself". I don't know what gold paint has to do with anything here, but I agree that David's proposal seems to be moving the goalposts a very long way. The thing is, as Martin points out, these functions already exist in a bunch of other systems. For one example I've used myself, see https://underscorejs.org/ I probably wouldn't have called a function to put a list into a random order "shuffle" in a vacuum, but it seems to be common nomenclature these days. I believe that if you don't make reference to Fisher-Yates in the documentation, they kick you out of the cool programmers club. -- Robert Haas EDB: http://www.enterprisedb.com
Re: [PATCH] Introduce array_shuffle() and array_sample()
Am 18.07.22 um 23:48 schrieb Martin Kalcher: If we go with (1) array_shuffle() and array_sample() should shuffle each element individually and always return a one-dimensional array. select array_shuffle('{{1,2},{3,4},{5,6}}'); --- {1,4,3,5,6,2} select array_sample('{{1,2},{3,4},{5,6}}', 3); -- {1,4,3} If we go with (2) both functions should only operate on the first dimension and shuffle whole subarrays and keep the dimensions intact. select array_shuffle('{{1,2},{3,4},{5,6}}'); - {{3,4},{1,2},{5,6}} select array_sample('{{1,2},{3,4},{5,6}}', 2); --- {{3,4},{1,2}} Having thought about it, i would go with (2). It gives the user the ability to decide wether or not array-of-arrays behavior is desired. If he wants the behavior of (1) he can flatten the array before applying array_shuffle(). Unfortunately there is no array_flatten() function (at the moment) and the user would have to work around it with unnest() and array_agg().
Re: [PATCH] Introduce array_shuffle() and array_sample()
Hi Martin, I didn't look at the code yet but I very much like the idea. Many thanks for working on this! It's a pity your patch was too late for the July commitfest. In September, please keep an eye on cfbot [1] to make sure your patch applies properly. > As Tom's investigation showed, there is no consensus in the code if > multi-dimensional arrays are treated as arrays-of-arrays or not. We need > to decide what should be the correct treatment. Here are my two cents. >From the user perspective I would expect that by default a multidimensional array should be treated as an array of arrays. So for instance: ``` array_shuffle([ [1,2], [3,4], [5,6] ]) ``` ... should return something like: ``` [ [3,4], [1,2], [5,6] ] ``` Note that the order of the elements in the internal arrays is preserved. However, I believe there should be an optional argument that overrides this behavior. For instance: ``` array_shuffle([ [1,2], [3,4], [5,6] ], depth => 2) ``` BTW, while on it, shouldn't we add similar functions for JSON and/or JSONB? Or is this going to be too much for a single discussion? [1]: http://cfbot.cputube.org/ -- Best regards, Aleksander Alekseev
Re: [PATCH] Introduce array_shuffle() and array_sample()
Am 19.07.22 um 00:52 schrieb Martin Kalcher: On the contrary! I am pretty sure there are people out there wanting sampling-without-shuffling. I will think about that. I gave it some thought. Even though there might be use cases, where a stable order is desired, i would consider them edge cases, not worth the additional complexity. I personally would not expect array_sample() to return elements in any specific order. I looked up some sample() implementations. None of them makes guarantees about the order of the resulting array or explicitly states that the resulting array is in random or selection order. - Python random.sample [0] - Ruby Array#sample [1] - Rust rand::seq::SliceRandom::choose_multiple [2] - Julia StatsBase.sample [3] stable order needs explicit request [0] https://docs.python.org/3/library/random.html#random.sample [1] https://ruby-doc.org/core-3.0.0/Array.html#method-i-sample [2] https://docs.rs/rand/0.6.5/rand/seq/trait.SliceRandom.html#tymethod.choose_multiple [3] https://juliastats.org/StatsBase.jl/stable/sampling/#StatsBase.sample
Re: [PATCH] Introduce array_shuffle() and array_sample()
On Tue, Jul 19, 2022 at 8:15 AM Martin Kalcher wrote: > Am 18.07.22 um 21:29 schrieb Tom Lane: > > The preferred thing to do is to add it to our "commitfest" queue, > > which will ensure that it gets looked at eventually. The currently > > open cycle is 2022-09 [1] (see the "New Patch" button there). > > Thanks Tom, did that. FYI that brought your patch to the attention of this CI robot, which is showing a couple of warnings. See the FAQ link there for an explanation of that infrastructure. http://cfbot.cputube.org/martin-kalcher.html
Re: [PATCH] Introduce array_shuffle() and array_sample()
Am 19.07.22 um 00:18 schrieb Tom Lane: Independently of the dimensionality question --- I'd imagined that array_sample would select a random subset of the array elements but keep their order intact. If you want the behavior shown above, you can do array_shuffle(array_sample(...)). But if we randomize it, and that's not what the user wanted, she has no recourse. Now, if you're convinced that the set of people wanting sampling-without-shuffling is the empty set, then making everybody else call two functions is a loser. But I'm not convinced. At the least, I'd like to see the argument made why nobody would want that. On the contrary! I am pretty sure there are people out there wanting sampling-without-shuffling. I will think about that.
Re: [PATCH] Introduce array_shuffle() and array_sample()
"David G. Johnston" writes: > On Mon, Jul 18, 2022 at 3:18 PM Tom Lane wrote: >> Independently of the dimensionality question --- I'd imagined that >> array_sample would select a random subset of the array elements >> but keep their order intact. If you want the behavior shown >> above, you can do array_shuffle(array_sample(...)). But if we >> randomize it, and that's not what the user wanted, she has no >> recourse. > And for those that want to know in what order those elements were chosen > they have no recourse in the other setup. Um ... why is "the order in which the elements were chosen" a concept we want to expose? ISTM sample() is a black box in which notionally the decisions could all be made at once. > I really think this function needs to grow an algorithm argument that can > be used to specify stuff like ordering, replacement/without-replacement, > etc...just some enums separated by commas that can be added to the call. I think you might run out of gold paint somewhere around here. I'm still not totally convinced we should bother with the sample() function at all, let alone that it needs algorithm variants. At some point we say to the user "here's a PL, write what you want for yourself". regards, tom lane
Re: [PATCH] Introduce array_shuffle() and array_sample()
On Mon, Jul 18, 2022 at 3:18 PM Tom Lane wrote: > > Independently of the dimensionality question --- I'd imagined that > array_sample would select a random subset of the array elements > but keep their order intact. If you want the behavior shown > above, you can do array_shuffle(array_sample(...)). But if we > randomize it, and that's not what the user wanted, she has no > recourse. > > And for those that want to know in what order those elements were chosen they have no recourse in the other setup. I really think this function needs to grow an algorithm argument that can be used to specify stuff like ordering, replacement/without-replacement, etc...just some enums separated by commas that can be added to the call. David J.
Re: [PATCH] Introduce array_shuffle() and array_sample()
Martin Kalcher writes: > If we go with (1) array_shuffle() and array_sample() should shuffle each > element individually and always return a one-dimensional array. >select array_shuffle('{{1,2},{3,4},{5,6}}'); >--- > {1,4,3,5,6,2} >select array_sample('{{1,2},{3,4},{5,6}}', 3); >-- > {1,4,3} Independently of the dimensionality question --- I'd imagined that array_sample would select a random subset of the array elements but keep their order intact. If you want the behavior shown above, you can do array_shuffle(array_sample(...)). But if we randomize it, and that's not what the user wanted, she has no recourse. Now, if you're convinced that the set of people wanting sampling-without-shuffling is the empty set, then making everybody else call two functions is a loser. But I'm not convinced. At the least, I'd like to see the argument made why nobody would want that. regards, tom lane
Re: [PATCH] Introduce array_shuffle() and array_sample()
Am 18.07.22 um 23:03 schrieb Tom Lane: I wrote: Martin had originally proposed (2), which I rejected on the grounds that we don't treat multi-dimensional arrays as arrays-of-arrays for any other purpose. Actually, after poking at it for awhile, that's an overstatement. It's true that the type system doesn't think N-D arrays are arrays-of-arrays, but there are individual functions/operators that do. Thanks Robert for pointing out the inconsistent behavior of array_sample(). That needs to be fixed. As Tom's investigation showed, there is no consensus in the code if multi-dimensional arrays are treated as arrays-of-arrays or not. We need to decide what should be the correct treatment. If we go with (1) array_shuffle() and array_sample() should shuffle each element individually and always return a one-dimensional array. select array_shuffle('{{1,2},{3,4},{5,6}}'); --- {1,4,3,5,6,2} select array_sample('{{1,2},{3,4},{5,6}}', 3); -- {1,4,3} If we go with (2) both functions should only operate on the first dimension and shuffle whole subarrays and keep the dimensions intact. select array_shuffle('{{1,2},{3,4},{5,6}}'); - {{3,4},{1,2},{5,6}} select array_sample('{{1,2},{3,4},{5,6}}', 2); --- {{3,4},{1,2}} I do not feel qualified to make that decision. (2) complicates the code a bit, but that should not be the main argument here. Martin
Re: [PATCH] Introduce array_shuffle() and array_sample()
I wrote: > Martin had originally proposed (2), which I rejected on the grounds > that we don't treat multi-dimensional arrays as arrays-of-arrays for > any other purpose. Actually, after poking at it for awhile, that's an overstatement. It's true that the type system doesn't think N-D arrays are arrays-of-arrays, but there are individual functions/operators that do. For instance, @> is in the its-a-flat-list-of-elements camp: regression=# select array[[1,2],[3,4]] @> array[1,3]; ?column? -- t (1 row) but || wants to preserve dimensionality: regression=# select array[[1,2],[3,4]] || array[1]; ERROR: cannot concatenate incompatible arrays DETAIL: Arrays with differing dimensions are not compatible for concatenation. Various other functions dodge the issue by refusing to work on arrays of more than one dimension. There seem to be more functions that think arrays are flat than otherwise, but it's not as black-and-white as I was thinking. regards, tom lane
Re: [PATCH] Introduce array_shuffle() and array_sample()
Robert Haas writes: > On Mon, Jul 18, 2022 at 3:03 PM Martin Kalcher > wrote: >> array_shuffle(anyarray) -> anyarray >> array_sample(anyarray, integer) -> anyarray > I think it's questionable whether the behavior of array_shuffle() is > correct for a multi-dimensional array. The implemented behavior is to > keep the dimensions as they were, but permute the elements across all > levels at random. But there are at least two other behaviors that seem > potentially defensible: (1) always return a 1-dimensional array, (2) > shuffle the sub-arrays at the top-level without the possibility of > moving elements within or between sub-arrays. What behavior we decide > is best here should be documented. Martin had originally proposed (2), which I rejected on the grounds that we don't treat multi-dimensional arrays as arrays-of-arrays for any other purpose. Maybe we should have, but that ship sailed decades ago, and I doubt that shuffle() is the place to start changing it. I could get behind your option (1) though, to make it clearer that the input array's dimensionality is not of interest. Especially since, as you say, (1) is pretty much the only sensible choice for array_sample. > I also think you should add test cases involving multi-dimensional > arrays, as well as arrays with non-default bounds. e.g. trying > shuffling or sampling some values like > '[8:10][-6:-5]={{1,2},{3,4},{5,6}}'::int[] This'd only matter if we decide not to ignore the input's dimensionality. regards, tom lane
Re: [PATCH] Introduce array_shuffle() and array_sample()
On Mon, Jul 18, 2022 at 3:03 PM Martin Kalcher wrote: > Thanks for all your feedback and help. I got a patch that i consider > ready for review. It introduces two new functions: > >array_shuffle(anyarray) -> anyarray >array_sample(anyarray, integer) -> anyarray > > array_shuffle() shuffles an array (obviously). array_sample() picks n > random elements from an array. I like this idea. I think it's questionable whether the behavior of array_shuffle() is correct for a multi-dimensional array. The implemented behavior is to keep the dimensions as they were, but permute the elements across all levels at random. But there are at least two other behaviors that seem potentially defensible: (1) always return a 1-dimensional array, (2) shuffle the sub-arrays at the top-level without the possibility of moving elements within or between sub-arrays. What behavior we decide is best here should be documented. array_sample() will return elements in random order when sample_size < array_size, but in the original order when sample_size >= array_size. Similarly, it will always return a 1-dimensional array in the former case, but will keep the original dimensions in the latter case. That seems pretty hard to defend. I think it should always return a 1-dimensional array with elements in random order, and I think this should be documented. I also think you should add test cases involving multi-dimensional arrays, as well as arrays with non-default bounds. e.g. trying shuffling or sampling some values like '[8:10][-6:-5]={{1,2},{3,4},{5,6}}'::int[] -- Robert Haas EDB: http://www.enterprisedb.com
Re: [PATCH] Introduce array_shuffle() and array_sample()
Am 18.07.22 um 21:29 schrieb Tom Lane: The preferred thing to do is to add it to our "commitfest" queue, which will ensure that it gets looked at eventually. The currently open cycle is 2022-09 [1] (see the "New Patch" button there). Thanks Tom, did that. I am not sure if "SQL Commands" is the correct topic for this patch, but i guess its not that important. I am impressed by all the infrastructure build around this project. Martin
Re: [PATCH] Introduce array_shuffle() and array_sample()
Martin Kalcher writes: > Is someone interested in looking at it? What are the next steps? The preferred thing to do is to add it to our "commitfest" queue, which will ensure that it gets looked at eventually. The currently open cycle is 2022-09 [1] (see the "New Patch" button there). I believe you have to have signed up as a community member[2] in order to put yourself in as the patch author. I think "New Patch" will work anyway, but it's better to have an author listed. regards, tom lane [1] https://commitfest.postgresql.org/39/ [2] https://www.postgresql.org/account/login/
[PATCH] Introduce array_shuffle() and array_sample()
Thanks for all your feedback and help. I got a patch that i consider ready for review. It introduces two new functions: array_shuffle(anyarray) -> anyarray array_sample(anyarray, integer) -> anyarray array_shuffle() shuffles an array (obviously). array_sample() picks n random elements from an array. Is someone interested in looking at it? What are the next steps? MartinFrom 5498bb2d9f1fab4cad56cd0d3a6eeafa24a26c8e Mon Sep 17 00:00:00 2001 From: Martin Kalcher Date: Sun, 17 Jul 2022 18:06:04 +0200 Subject: [PATCH] Introduce array_shuffle() and array_sample() * array_shuffle() shuffles to elements of an array. * array_sample() chooses n elements from an array by random. Shuffling of arrays can already be accomplished with SQL using unnest() and array_agg(order by random()). But that is very slow compared to the new functions. --- doc/src/sgml/func.sgml | 34 ++ src/backend/utils/adt/arrayfuncs.c | 163 +++ src/include/catalog/pg_proc.dat | 6 + src/test/regress/expected/arrays.out | 30 + src/test/regress/sql/arrays.sql | 11 ++ 5 files changed, 244 insertions(+) diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index b6783b7ad0..c676031b4a 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -19395,6 +19395,40 @@ SELECT NULLIF(value, '(none)') ... + + + + array_sample + +array_sample ( array anyarray, n integer ) +anyarray + + +Returns n randomly chosen elements from array. + + +array_sample(ARRAY[1,2,3,4,5,6], 3) +{4,5,1} + + + + + + + array_shuffle + +array_shuffle ( anyarray ) +anyarray + + +Shuffles the elements of the array. + + +array_shuffle(ARRAY[1,2,3,4,5,6]) +{4,5,1,3,2,6} + + + diff --git a/src/backend/utils/adt/arrayfuncs.c b/src/backend/utils/adt/arrayfuncs.c index fb167f226a..a6769a8ebf 100644 --- a/src/backend/utils/adt/arrayfuncs.c +++ b/src/backend/utils/adt/arrayfuncs.c @@ -6872,3 +6872,166 @@ trim_array(PG_FUNCTION_ARGS) PG_RETURN_DATUM(result); } + +/* + * Shuffle the elements of an array. + */ +Datum +array_shuffle(PG_FUNCTION_ARGS) +{ + ArrayType *array, + *result; + int16 elmlen; + bool elmbyval; + char elmalign; + Oid elmtyp; + TypeCacheEntry *typentry; + Datum *elms, +elm; + bool *nuls, +nul; + int nelms, +i, +j; + + array = PG_GETARG_ARRAYTYPE_P(0); + nelms = ArrayGetNItems(ARR_NDIM(array), ARR_DIMS(array)); + + /* There is no point in shuffling arrays with less then two elements. */ + if (nelms < 2) + PG_RETURN_ARRAYTYPE_P(array); + + elmtyp = ARR_ELEMTYPE(array); + typentry = (TypeCacheEntry *) fcinfo->flinfo->fn_extra; + + if (typentry == NULL || typentry->type_id != elmtyp) + { + typentry = lookup_type_cache(elmtyp, 0); + fcinfo->flinfo->fn_extra = (void *) typentry; + } + elmlen = typentry->typlen; + elmbyval = typentry->typbyval; + elmalign = typentry->typalign; + + deconstruct_array(array, elmtyp, elmlen, elmbyval, elmalign, + &elms, &nuls, &nelms); + + /* Shuffle elements and nulls using Fisher-Yates shuffle algorithm. */ + for (i = nelms - 1; i > 0; i--) + { + j = random() % (i + 1); + elm = elms[i]; + nul = nuls[i]; + elms[i] = elms[j]; + nuls[i] = nuls[j]; + elms[j] = elm; + nuls[j] = nul; + } + + result = construct_md_array(elms, nuls, +ARR_NDIM(array), ARR_DIMS(array), ARR_LBOUND(array), +elmtyp, elmlen, elmbyval, elmalign); + + pfree(elms); + pfree(nuls); + PG_FREE_IF_COPY(array, 0); + + PG_RETURN_ARRAYTYPE_P(result); +} + +/* + * Choose N random elements from an array. + */ +Datum +array_sample(PG_FUNCTION_ARGS) +{ + ArrayType *array, + *result; + int16 elmlen; + bool elmbyval; + char elmalign; + Oid elmtyp; + TypeCacheEntry *typentry; + Datum *elms, +elm; + bool *nuls, +nul; + int nelms, +rnelms, +rdims[1], +rlbs[1], +i, +j; + + array = PG_GETARG_ARRAYTYPE_P(0); + nelms = ArrayGetNItems(ARR_NDIM(array), ARR_DIMS(array)); + elmtyp = ARR_ELEMTYPE(array); + rnelms = PG_GETARG_INT32(1); + + if (rnelms < 0) + ereport(ERROR, +(errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("second parameter must not be negative"))); + + /* Return an empty array if the requested sample size is zero. */ + if (rnelms == 0) + { + PG_FREE_IF_COPY(array, 0); + PG_RETURN_ARRAYTYPE_P(construct_empty_array(elmtyp)); + } + + /* + * Return the original array if the requested sample size is greater than + * or equal to its own size. + */ + if (rnelms >= nelms) + PG_RETURN_ARRAYTYPE_P(array); + + typentry = (TypeCacheEntry *) fcinfo->flinfo->fn_extra; + + if (typentry == NULL || typentry->type_id !=