subject:"\[PATCH\] Introduce array_shuffle\(\) and array

Re: [PATCH] Introduce array_shuffle() and array_sample()

2023-04-19 Thread Salek Talangi

 Hi all,

reading this blog post
https://www.depesz.com/2023/04/18/waiting-for-postgresql-16-add-array_sample-and-array_shuffle-functions/
I became aware of the new feature and had a look at it and the commit
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=888f2ea0a81ff171087bdd1c5c1eeda3b78d73d4

To me the description
   /*
* Shuffle array using Fisher-Yates algorithm.  Scan the array and swap
* current item (nelm datums starting at ielms) with a randomly chosen
* later item (nelm datums starting at jelms) in each iteration.  We can
* stop once we've done n iterations; then first n items are the result.
*/

seems wrong. For n = 1 the returned item could never be the 1st item of the
array (see "randomly chosen later item").
If this really is the case then the result is not really random. But to me
it seems j later can be 0 (making it not really "later"), so this might
only be a documentation issue.

Best regards
Salek Talangi

Am Mi., 19. Apr. 2023 um 13:48 Uhr schrieb Daniel Gustafsson <
dan...@yesql.se>:

> > On 7 Apr 2023, at 17:47, Tom Lane  wrote:
> >
> > Daniel Gustafsson  writes:
> >> Ah, ok, now I see what you mean, thanks!  I'll try to fix up the patch
> like
> >> this tomorrow.
> >
> > Since we're running out of time, I took the liberty of fixing and
> > pushing this.
>
> Great, thanks!
>
> --
> Daniel Gustafsson

Re: [PATCH] Introduce array_shuffle() and array_sample()

2023-04-07 Thread Daniel Gustafsson

> On 7 Apr 2023, at 17:47, Tom Lane  wrote:
> 
> Daniel Gustafsson  writes:
>> Ah, ok, now I see what you mean, thanks!  I'll try to fix up the patch like
>> this tomorrow.
> 
> Since we're running out of time, I took the liberty of fixing and
> pushing this.

Great, thanks!

--
Daniel Gustafsson

Re: [PATCH] Introduce array_shuffle() and array_sample()

2023-04-07 Thread Tom Lane

Daniel Gustafsson  writes:
> Ah, ok, now I see what you mean, thanks!  I'll try to fix up the patch like
> this tomorrow.

Since we're running out of time, I took the liberty of fixing and
pushing this.

regards, tom lane

Re: [PATCH] Introduce array_shuffle() and array_sample()

2023-04-03 Thread Daniel Gustafsson

> On 3 Apr 2023, at 23:46, Tom Lane  wrote:
> 
> Daniel Gustafsson  writes:
>> On 29 Sep 2022, at 21:33, Tom Lane  wrote:
>>> I find this behavior a bit surprising:
>>> 
>>> +SELECT 
>>> array_dims(array_sample('[-1:2][2:3]={{1,2},{3,NULL},{5,6},{7,8}}'::int[], 
>>> 3));
>>> + array_dims  
>>> +-
>>> + [-1:1][2:3]
>>> +(1 row)
>>> 
>>> I can buy preserving the lower bound in array_shuffle(), but
>>> array_sample() is not preserving the first-dimension indexes of
>>> the array, so ISTM it ought to reset the first lower bound to 1.
> 
>> I might be daft but I'm not sure I follow why not preserving here, can you
>> explain?
> 
> Because array_sample selects only some of the (first level) array
> elements, those elements are typically not going to have the same
> indexes in the output as they did in the input.  So I don't see why
> it would be useful to preserve the same lower-bound index.  It does
> make sense to preserve the lower-order index bounds ([2:3] in this
> example) because we are including or not including those array
> slices as a whole.

Ah, ok, now I see what you mean, thanks!  I'll try to fix up the patch like
this tomorrow.

--
Daniel Gustafsson

Re: [PATCH] Introduce array_shuffle() and array_sample()

2023-04-03 Thread Tom Lane

Daniel Gustafsson  writes:
> On 29 Sep 2022, at 21:33, Tom Lane  wrote:
>> I find this behavior a bit surprising:
>> 
>> +SELECT 
>> array_dims(array_sample('[-1:2][2:3]={{1,2},{3,NULL},{5,6},{7,8}}'::int[], 
>> 3));
>> + array_dims  
>> +-
>> + [-1:1][2:3]
>> +(1 row)
>> 
>> I can buy preserving the lower bound in array_shuffle(), but
>> array_sample() is not preserving the first-dimension indexes of
>> the array, so ISTM it ought to reset the first lower bound to 1.

> I might be daft but I'm not sure I follow why not preserving here, can you
> explain?

Because array_sample selects only some of the (first level) array
elements, those elements are typically not going to have the same
indexes in the output as they did in the input.  So I don't see why
it would be useful to preserve the same lower-bound index.  It does
make sense to preserve the lower-order index bounds ([2:3] in this
example) because we are including or not including those array
slices as a whole.

regards, tom lane

Re: [PATCH] Introduce array_shuffle() and array_sample()

2023-04-03 Thread Daniel Gustafsson

> On 29 Sep 2022, at 21:33, Tom Lane  wrote:
> 
> Martin Kalcher  writes:
>> New patch: array_shuffle() and array_sample() use pg_global_prng_state now.
> 
> I took a closer look at the patch today.

Since this seems pretty close to going in, and seems like quite useful
functions, I took a look to see if I could get it across the line (although I
noticed that CFM beat me to the clock in sending this =)).

>  I find this behavior a bit surprising:
> 
> +SELECT 
> array_dims(array_sample('[-1:2][2:3]={{1,2},{3,NULL},{5,6},{7,8}}'::int[], 
> 3));
> + array_dims  
> +-
> + [-1:1][2:3]
> +(1 row)
> 
> I can buy preserving the lower bound in array_shuffle(), but
> array_sample() is not preserving the first-dimension indexes of
> the array, so ISTM it ought to reset the first lower bound to 1.

I might be daft but I'm not sure I follow why not preserving here, can you
explain?

The rest of your comments have been addressed in the attached v6 I think
(although I'm pretty sure the docs part is just as bad now, explaining these in
concise words is hard, will take another look with fresh eyes tomorrow).

--
Daniel Gustafsson

v6-0001-Introduce-array_shuffle-and-array_sample.patch
Description: Binary data

Re: [PATCH] Introduce array_shuffle() and array_sample()

2023-04-03 Thread Gregory Stark (as CFM)

Given that there's been no updates since September 22 I'm going to
make this patch Returned with Feedback. The patch can be resurrected
when there's more work done -- don't forget to move it to the new CF
when you do that.


-- 
Gregory Stark
As Commitfest Manager

Re: [PATCH] Introduce array_shuffle() and array_sample()

2023-03-20 Thread Gregory Stark (as CFM)

On Thu, 29 Sept 2022 at 15:34, Tom Lane  wrote:
>
> Martin Kalcher  writes:
> > New patch: array_shuffle() and array_sample() use pg_global_prng_state now.
>
> I took a closer look at the patch today.  I find this behavior a bit
> surprising:
>

It looks like this patch received useful feedback and it wouldn't take
much to push it over the line. But it's been Waiting On Author since
last September.

Martin, any chance of getting these last bits of feedback resolved so
it can be Ready for Commit?

-- 
Gregory Stark
As Commitfest Manager

Re: [PATCH] Introduce array_shuffle() and array_sample()

2022-09-29 Thread Tom Lane

Martin Kalcher  writes:
> New patch: array_shuffle() and array_sample() use pg_global_prng_state now.

I took a closer look at the patch today.  I find this behavior a bit
surprising:

+SELECT 
array_dims(array_sample('[-1:2][2:3]={{1,2},{3,NULL},{5,6},{7,8}}'::int[], 3));
+ array_dims  
+-
+ [-1:1][2:3]
+(1 row)

I can buy preserving the lower bound in array_shuffle(), but
array_sample() is not preserving the first-dimension indexes of
the array, so ISTM it ought to reset the first lower bound to 1.

Some other comments:

+Returns n randomly chosen elements from 
array in selection order.

What's "selection order"?  And this probably shouldn't just rely
on the example to describe what happens with multi-D arrays.
Writing "elements" seems somewhere between confusing and wrong.

* Personally I think I'd pass the TypeCacheEntry pointer to array_shuffle_n,
and let it pull out what it needs.  Less duplicate code that way.

* I find array_shuffle_n drastically undercommented, and what comments
it has are pretty misleading, eg

+   /* Swap all elements in item (i) with elements in item (j). */

j is *not* the index of the second item to be swapped.  You could make
it so, and that might be more readable:

j = (int) pg_prng_uint64_range(&pg_global_prng_state, i, nitem 
- 1);
jelms = elms + j * nelm;
jnuls = nuls + j * nelm;

But if you want the code to stay as it is, this comment needs work.

* I think good style for SQL-callable C functions is to make the arguments
clear at the top:

+array_sample(PG_FUNCTION_ARGS)
+{
+ArrayType  *array = PG_GETARG_ARRAYTYPE_P(0);
+int n = PG_GETARG_INT32(1);
+ArrayType  *result;
+... other declarations as needed ...

We can't quite make normal C declaration style work, but that's a poor
excuse for not making the argument list visible as directly as possible.

* I wouldn't bother with the PG_FREE_IF_COPY calls.  Those are generally
only used in btree comparison functions, in which there's a need to not
leak memory.

* I wonder if we really need these to be proparallel = 'r'.  If we let
a worker process execute them, they will take numbers from the worker's
pg_global_prng_state seeding not the parent process's seeding, but why
is that a problem?  In the prior version where you were tying them
to the parent's drandom() sequence, proparallel = 'r' made sense;
but now I think it's unnecessary.

regards, tom lane

Re: [PATCH] Introduce array_shuffle() and array_sample()

2022-09-29 Thread Martin Kalcher


Am 28.09.22 um 16:18 schrieb Tom Lane:

It is seeded at process start, yes.  If you don't feel a need for
user control over the sequence used by these functions, then using
pg_global_prng_state is fine.  (Basically the point to be made
here is that we need to keep a tight rein on what can be affected
by setseed().)

regards, tom lane


New patch: array_shuffle() and array_sample() use pg_global_prng_state now.

MartinFrom b9433564f925521f5f6bcebd7cd74a3e12f4f354 Mon Sep 17 00:00:00 2001
From: Martin Kalcher 
Date: Sun, 17 Jul 2022 18:06:04 +0200
Subject: [PATCH v5] Introduce array_shuffle() and array_sample()

* array_shuffle() shuffles the elements of an array.
* array_sample() chooses n elements from an array by random.
---
 doc/src/sgml/func.sgml  |  34 +
 src/backend/utils/adt/array_userfuncs.c | 176 
 src/include/catalog/pg_proc.dat |   6 +
 src/test/regress/expected/arrays.out|  54 
 src/test/regress/sql/arrays.sql |  14 ++
 5 files changed, 284 insertions(+)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index e1fe4fec57..6b2a3d16f6 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -18468,6 +18468,40 @@ SELECT NULLIF(value, '(none)') ...

   
 
+  
+   
+
+ array_sample
+
+array_sample ( array anyarray, n integer )
+anyarray
+   
+   
+Returns n randomly chosen elements from array in selection order.
+   
+   
+array_sample(ARRAY[[1,2],[3,4],[5,6]], 2)
+{{5,6},{1,2}}
+   
+  
+
+  
+   
+
+ array_shuffle
+
+array_shuffle ( anyarray )
+anyarray
+   
+   
+Shuffles the first dimension of the array.
+   
+   
+array_shuffle(ARRAY[[1,2],[3,4],[5,6]])
+{{5,6},{1,2},{3,4}}
+   
+  
+
   

 
diff --git a/src/backend/utils/adt/array_userfuncs.c b/src/backend/utils/adt/array_userfuncs.c
index ca70590d7d..4bf28da5e4 100644
--- a/src/backend/utils/adt/array_userfuncs.c
+++ b/src/backend/utils/adt/array_userfuncs.c
@@ -14,6 +14,7 @@
 
 #include "catalog/pg_type.h"
 #include "common/int.h"
+#include "common/pg_prng.h"
 #include "utils/array.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -902,3 +903,178 @@ array_positions(PG_FUNCTION_ARGS)
 
 	PG_RETURN_DATUM(makeArrayResult(astate, CurrentMemoryContext));
 }
+
+/*
+ * Produce array with n randomly chosen items from given array in random order.
+ *
+ * array: array object to examine (must not be NULL)
+ * n: number of items (must not be greater than the size of the arrays first dimension)
+ * elmtyp, elmlen, elmbyval, elmalign: info for the datatype of the items
+ *
+ * NOTE: it would be cleaner to look up the elmlen/elmbval/elmalign info
+ * from the system catalogs, given the elmtyp. However, the caller is
+ * in a better position to cache this info across multiple uses, or even
+ * to hard-wire values if the element type is hard-wired.
+ */
+static ArrayType *
+array_shuffle_n(ArrayType *array, int n,
+Oid elmtyp, int elmlen, bool elmbyval, char elmalign)
+{
+	ArrayType  *result;
+	int			ndim,
+			   *dims,
+			   *lbs,
+rdims[MAXDIM],
+nelm,
+nitem,
+i,
+j,
+k;
+	Datum		elm,
+			   *elms,
+			   *ielms,
+			   *jelms;
+	bool		nul,
+			   *nuls,
+			   *inuls,
+			   *jnuls;
+
+	ndim = ARR_NDIM(array);
+	dims = ARR_DIMS(array);
+	lbs = ARR_LBOUND(array);
+
+	if (ndim < 1 || dims[0] < 1 || n < 1)
+		return construct_empty_array(elmtyp);
+
+	deconstruct_array(array, elmtyp, elmlen, elmbyval, elmalign,
+	  &elms, &nuls, &nelm);
+
+	nitem = dims[0];			/* total number of items */
+	nelm /= nitem;/* number of elements per item */
+
+	ielms = elms;
+	inuls = nuls;
+
+	/*
+	 * Shuffle array using Fisher-Yates algorithm. Iterate array and swap head
+	 * (ielms) with an randomly chosen item (jelms) at each iteration.
+	 */
+	for (i = 0; i < n; i++)
+	{
+		j = (int) pg_prng_uint64_range(&pg_global_prng_state, 0, nitem - i - 1) * nelm;
+		jelms = ielms + j;
+		jnuls = inuls + j;
+
+		/* Swap all elements in item (i) with elements in item (j). */
+		for (k = 0; k < nelm; k++)
+		{
+			elm = *ielms;
+			nul = *inuls;
+			*ielms++ = *jelms;
+			*inuls++ = *jnuls;
+			*jelms++ = elm;
+			*jnuls++ = nul;
+		}
+	}
+
+	memcpy(rdims, dims, ndim * sizeof(int));
+	rdims[0] = n;
+
+	result = construct_md_array(elms, nuls, ndim, rdims, lbs,
+elmtyp, elmlen, elmbyval, elmalign);
+
+	pfree(elms);
+	pfree(nuls);
+
+	return result;
+}
+
+/*
+ * Shuffle the elements of an array.
+ */
+Datum
+array_shuffle(PG_FUNCTION_ARGS)
+{
+	ArrayType  *array,
+			   *result;
+	int16		elmlen;
+	bool		elmbyval;
+	char		elmalign;
+	Oid			elmtyp;
+	TypeCacheEntry *typentry;
+	int			n;
+
+	array = PG_GETARG_ARRAYTYPE_P(0);
+
+	/* Return empty array immediately. */
+	if (ARR_NDIM(array) < 1)
+		PG_R

Re: [PATCH] Introduce array_shuffle() and array_sample()

2022-09-28 Thread Tom Lane

Fabien COELHO  writes:
>> Thanks for your thoughts, Tom. I have a couple of questions. Should we 
>> introduce a new seed function for the new PRNG state, used by 
>> array_shuffle() 
>> and array_sample()? What would be a good name? Or should those functions use 
>> pg_global_prng_state? Is it safe to assume, that pg_global_prng_state is 
>> seeded?

> I'd suggest to use the existing global state. The global state should be 
> seeded at the process start, AFAIKR.

It is seeded at process start, yes.  If you don't feel a need for
user control over the sequence used by these functions, then using
pg_global_prng_state is fine.  (Basically the point to be made
here is that we need to keep a tight rein on what can be affected
by setseed().)

regards, tom lane

Re: [PATCH] Introduce array_shuffle() and array_sample()

2022-09-28 Thread Fabien COELHO





With our current PRNG infrastructure it doesn't cost much to have
a separate PRNG for every purpose.  I don't object to having
array_shuffle() and array_sample() share one PRNG, but I don't
think it should go much further than that.


Thanks for your thoughts, Tom. I have a couple of questions. Should we 
introduce a new seed function for the new PRNG state, used by array_shuffle() 
and array_sample()? What would be a good name? Or should those functions use 
pg_global_prng_state? Is it safe to assume, that pg_global_prng_state is 
seeded?


I'd suggest to use the existing global state. The global state should be 
seeded at the process start, AFAIKR.


--
Fabien.

Re: [PATCH] Introduce array_shuffle() and array_sample()

2022-09-28 Thread Martin Kalcher


Am 26.09.22 um 22:16 schrieb Tom Lane:


With our current PRNG infrastructure it doesn't cost much to have
a separate PRNG for every purpose.  I don't object to having
array_shuffle() and array_sample() share one PRNG, but I don't
think it should go much further than that.



Thanks for your thoughts, Tom. I have a couple of questions. Should we 
introduce a new seed function for the new PRNG state, used by 
array_shuffle() and array_sample()? What would be a good name? Or should 
those functions use pg_global_prng_state? Is it safe to assume, that 
pg_global_prng_state is seeded?


Martin

Re: [PATCH] Introduce array_shuffle() and array_sample()

2022-09-26 Thread Tom Lane

Martin Kalcher  writes:
> [ v4-0001-Introduce-array_shuffle-and-array_sample.patch ]

I think this idea of exporting drandom()'s PRNG for all and sundry
to use is completely misguided.  If we go down that path we'll
be right back in the swamp that we were in when we used random(3),
namely that (a) it's not clear what affects what, and (b) to the
extent that there are such interferences, it could be bad, maybe
even a security problem.  We very intentionally decoupled drandom()
from the rest of the world at commit 6645ad6bd, and I'm not ready
to unlearn that lesson.

With our current PRNG infrastructure it doesn't cost much to have
a separate PRNG for every purpose.  I don't object to having
array_shuffle() and array_sample() share one PRNG, but I don't
think it should go much further than that.

regards, tom lane

Re: [PATCH] Introduce array_shuffle() and array_sample()

2022-09-22 Thread Martin Kalcher


Am 22.09.22 um 17:23 schrieb Andres Freund:

Hi,

On 2022-08-04 07:46:10 +0200, Martin Kalcher wrote:

Patch update without merge conflicts.


Due to the merge of the meson based build, this patch needs to be adjusted. See
https://cirrus-ci.com/build/6580671765282816
Looks like it'd just be adding user_prng.c to
src/backend/utils/adt/meson.build's list.

Greetings,

Andres Freund


Hi Andres,

thanks for the heads up. Adjusted patch is attached.

- MartinFrom 3c4a939abf8d29bcf666e49ecb042ade42b36c2c Mon Sep 17 00:00:00 2001
From: Martin Kalcher 
Date: Sun, 17 Jul 2022 18:06:04 +0200
Subject: [PATCH v4] Introduce array_shuffle() and array_sample()

* array_shuffle() shuffles the elements of an array.
* array_sample() chooses n elements from an array by random.

The new functions share its prng state with random() and thus interoperate
with setseed().
---
 doc/src/sgml/func.sgml  |  40 +-
 src/backend/utils/adt/Makefile  |   1 +
 src/backend/utils/adt/array_userfuncs.c | 176 
 src/backend/utils/adt/float.c   |  37 +
 src/backend/utils/adt/meson.build   |   1 +
 src/backend/utils/adt/user_prng.c   |  87 
 src/include/catalog/pg_proc.dat |   6 +
 src/include/utils/user_prng.h   |  19 +++
 src/test/regress/expected/arrays.out|  66 +
 src/test/regress/sql/arrays.sql |  16 +++
 10 files changed, 414 insertions(+), 35 deletions(-)
 create mode 100644 src/backend/utils/adt/user_prng.c
 create mode 100644 src/include/utils/user_prng.h

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index e1fe4fec57..2a96fc323a 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -1820,7 +1820,8 @@ repeat('Pg', 4) PgPgPgPg
 void


-Sets the seed for subsequent random() calls;
+Sets the seed for subsequent calls to random functions, like
+random() or array_shuffle();
 argument must be between -1.0 and 1.0, inclusive


@@ -1838,7 +1839,8 @@ repeat('Pg', 4) PgPgPgPg
applications; see the  module for a more
secure alternative.
If setseed() is called, the series of results of
-   subsequent random() calls in the current session
+   subsequent calls to random functions, like random() or
+   array_shuffle(), in the current session
can be repeated by re-issuing setseed() with the same
argument.
Without any prior setseed() call in the same
@@ -18468,6 +18470,40 @@ SELECT NULLIF(value, '(none)') ...

   
 
+  
+   
+
+ array_sample
+
+array_sample ( array anyarray, n integer )
+anyarray
+   
+   
+Returns n randomly chosen elements from array in selection order.
+   
+   
+array_sample(ARRAY[[1,2],[3,4],[5,6]], 2)
+{{5,6},{1,2}}
+   
+  
+
+  
+   
+
+ array_shuffle
+
+array_shuffle ( anyarray )
+anyarray
+   
+   
+Shuffles the first dimension of the array.
+   
+   
+array_shuffle(ARRAY[[1,2],[3,4],[5,6]])
+{{5,6},{1,2},{3,4}}
+   
+  
+
   

 
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 0de0bbb1b8..17b57a5569 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -110,6 +110,7 @@ OBJS = \
 	tsvector.o \
 	tsvector_op.o \
 	tsvector_parser.o \
+	user_prng.o \
 	uuid.o \
 	varbit.o \
 	varchar.o \
diff --git a/src/backend/utils/adt/array_userfuncs.c b/src/backend/utils/adt/array_userfuncs.c
index ca70590d7d..c4a2117df7 100644
--- a/src/backend/utils/adt/array_userfuncs.c
+++ b/src/backend/utils/adt/array_userfuncs.c
@@ -17,6 +17,7 @@
 #include "utils/array.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
+#include "utils/user_prng.h"
 #include "utils/typcache.h"
 
 
@@ -902,3 +903,178 @@ array_positions(PG_FUNCTION_ARGS)
 
 	PG_RETURN_DATUM(makeArrayResult(astate, CurrentMemoryContext));
 }
+
+/*
+ * Produce array with n randomly chosen items from given array in random order.
+ *
+ * array: array object to examine (must not be NULL)
+ * n: number of items (must not be greater than the size of the arrays first dimension)
+ * elmtyp, elmlen, elmbyval, elmalign: info for the datatype of the items
+ *
+ * NOTE: it would be cleaner to look up the elmlen/elmbval/elmalign info
+ * from the system catalogs, given the elmtyp. However, the caller is
+ * in a better position to cache this info across multiple uses, or even
+ * to hard-wire values if the element type is hard-wired.
+ */
+static ArrayType *
+array_shuffle_n(ArrayType *array, int n,
+Oid elmtyp, int elmlen, bool elmbyval, char elmalign)
+{
+	ArrayType  *result;
+	int			ndim,
+			   *dims,
+			   *lbs,
+rdims[MAXDIM],
+nelm,
+nitem,
+i,
+j,
+k;
+	Datum		elm,
+			   *elms,
+			   *ielms,
+			   *jelms;
+	bool		nul,
+			   *

Re: [PATCH] Introduce array_shuffle() and array_sample()

2022-09-22 Thread Andres Freund

Hi,

On 2022-08-04 07:46:10 +0200, Martin Kalcher wrote:
> Patch update without merge conflicts.

Due to the merge of the meson based build, this patch needs to be adjusted. See
https://cirrus-ci.com/build/6580671765282816
Looks like it'd just be adding user_prng.c to
src/backend/utils/adt/meson.build's list.

Greetings,

Andres Freund

Re: [PATCH] Introduce array_shuffle() and array_sample()

2022-08-03 Thread Martin Kalcher


Patch update without merge conflicts.

MartinFrom 0ecffcf3ed2eb59d045941b69bb86a34b93f3391 Mon Sep 17 00:00:00 2001
From: Martin Kalcher 
Date: Sun, 17 Jul 2022 18:06:04 +0200
Subject: [PATCH v3] Introduce array_shuffle() and array_sample()

* array_shuffle() shuffles the elements of an array.
* array_sample() chooses n elements from an array by random.

The new functions share its prng state with random() and thus interoperate
with setseed().
---
 doc/src/sgml/func.sgml  |  40 +-
 src/backend/utils/adt/Makefile  |   1 +
 src/backend/utils/adt/array_userfuncs.c | 176 
 src/backend/utils/adt/float.c   |  37 +
 src/backend/utils/adt/user_prng.c   |  87 
 src/include/catalog/pg_proc.dat |   6 +
 src/include/utils/user_prng.h   |  19 +++
 src/test/regress/expected/arrays.out|  66 +
 src/test/regress/sql/arrays.sql |  16 +++
 9 files changed, 413 insertions(+), 35 deletions(-)
 create mode 100644 src/backend/utils/adt/user_prng.c
 create mode 100644 src/include/utils/user_prng.h

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 053d4dc650..ef7c001fe7 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -1820,7 +1820,8 @@ repeat('Pg', 4) PgPgPgPg
 void


-Sets the seed for subsequent random() calls;
+Sets the seed for subsequent calls to random functions, like
+random() or array_shuffle();
 argument must be between -1.0 and 1.0, inclusive


@@ -1838,7 +1839,8 @@ repeat('Pg', 4) PgPgPgPg
applications; see the  module for a more
secure alternative.
If setseed() is called, the series of results of
-   subsequent random() calls in the current session
+   subsequent calls to random functions, like random() or
+   array_shuffle(), in the current session
can be repeated by re-issuing setseed() with the same
argument.
Without any prior setseed() call in the same
@@ -19398,6 +19400,40 @@ SELECT NULLIF(value, '(none)') ...

   
 
+  
+   
+
+ array_sample
+
+array_sample ( array anyarray, n integer )
+anyarray
+   
+   
+Returns n randomly chosen elements from array in selection order.
+   
+   
+array_sample(ARRAY[[1,2],[3,4],[5,6]], 2)
+{{5,6},{1,2}}
+   
+  
+
+  
+   
+
+ array_shuffle
+
+array_shuffle ( anyarray )
+anyarray
+   
+   
+Shuffles the first dimension of the array.
+   
+   
+array_shuffle(ARRAY[[1,2],[3,4],[5,6]])
+{{5,6},{1,2},{3,4}}
+   
+  
+
   

 
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 7c722ea2ce..42b65e58bb 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -109,6 +109,7 @@ OBJS = \
 	tsvector.o \
 	tsvector_op.o \
 	tsvector_parser.o \
+	user_prng.o \
 	uuid.o \
 	varbit.o \
 	varchar.o \
diff --git a/src/backend/utils/adt/array_userfuncs.c b/src/backend/utils/adt/array_userfuncs.c
index ca70590d7d..c4a2117df7 100644
--- a/src/backend/utils/adt/array_userfuncs.c
+++ b/src/backend/utils/adt/array_userfuncs.c
@@ -17,6 +17,7 @@
 #include "utils/array.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
+#include "utils/user_prng.h"
 #include "utils/typcache.h"
 
 
@@ -902,3 +903,178 @@ array_positions(PG_FUNCTION_ARGS)
 
 	PG_RETURN_DATUM(makeArrayResult(astate, CurrentMemoryContext));
 }
+
+/*
+ * Produce array with n randomly chosen items from given array in random order.
+ *
+ * array: array object to examine (must not be NULL)
+ * n: number of items (must not be greater than the size of the arrays first dimension)
+ * elmtyp, elmlen, elmbyval, elmalign: info for the datatype of the items
+ *
+ * NOTE: it would be cleaner to look up the elmlen/elmbval/elmalign info
+ * from the system catalogs, given the elmtyp. However, the caller is
+ * in a better position to cache this info across multiple uses, or even
+ * to hard-wire values if the element type is hard-wired.
+ */
+static ArrayType *
+array_shuffle_n(ArrayType *array, int n,
+Oid elmtyp, int elmlen, bool elmbyval, char elmalign)
+{
+	ArrayType  *result;
+	int			ndim,
+			   *dims,
+			   *lbs,
+rdims[MAXDIM],
+nelm,
+nitem,
+i,
+j,
+k;
+	Datum		elm,
+			   *elms,
+			   *ielms,
+			   *jelms;
+	bool		nul,
+			   *nuls,
+			   *inuls,
+			   *jnuls;
+
+	ndim = ARR_NDIM(array);
+	dims = ARR_DIMS(array);
+	lbs = ARR_LBOUND(array);
+
+	if (ndim < 1 || dims[0] < 1 || n < 1)
+		return construct_empty_array(elmtyp);
+
+	deconstruct_array(array, elmtyp, elmlen, elmbyval, elmalign,
+	  &elms, &nuls, &nelm);
+
+	nitem = dims[0];			/* total number of items */
+	nelm /= nitem;/* number of elements per item */
+
+	ielms = elms;
+	inuls = nuls;
+
+	/*
+	 * Shuffle array usin

Re: [PATCH] Introduce array_shuffle() and array_sample()

2022-07-25 Thread Martin Kalcher


Am 24.07.22 um 21:42 schrieb Fabien COELHO:


Duno. I'm still wondering what it should do. I'm pretty sure that the 
documentation should be clear about a shared seed, if any. I do not 
think that departing from the standard is a good thing, either.


Are sure it violates the standard? I could not find anything about it. 
The private prng state for random() was introduced in 2018 [0]. Neither 
commit nor discussion mentions any standard compliance.


[0] 
https://www.postgresql.org/message-id/E1gdNAo-00036g-TB%40gemulon.postgresql.org


I updated the documentation for setseed().

If someone wants a limit, they can easily "LEAST(#1 dim size, other 
limit)" to get it, it is easy enough with a strict function.


Convinced. It errors out now if n is out of bounds.

MartinFrom afb7c022abd26b82a4fd3611313a83f144909554 Mon Sep 17 00:00:00 2001
From: Martin Kalcher 
Date: Sun, 17 Jul 2022 18:06:04 +0200
Subject: [PATCH v2] Introduce array_shuffle() and array_sample()

* array_shuffle() shuffles the elements of an array.
* array_sample() chooses n elements from an array by random.

The new functions share its prng state with random() and thus interoperate
with setseed().
---
 doc/src/sgml/func.sgml  |  40 +-
 src/backend/utils/adt/Makefile  |   1 +
 src/backend/utils/adt/array_userfuncs.c | 180 
 src/backend/utils/adt/float.c   |  37 +
 src/backend/utils/adt/user_prng.c   |  87 
 src/include/catalog/pg_proc.dat |   6 +
 src/include/utils/user_prng.h   |  19 +++
 src/test/regress/expected/arrays.out|  66 +
 src/test/regress/sql/arrays.sql |  16 +++
 9 files changed, 417 insertions(+), 35 deletions(-)
 create mode 100644 src/backend/utils/adt/user_prng.c
 create mode 100644 src/include/utils/user_prng.h

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index b6783b7ad0..8c51a17eda 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -1820,7 +1820,8 @@ repeat('Pg', 4) PgPgPgPg
 void


-Sets the seed for subsequent random() calls;
+Sets the seed for subsequent calls to random functions, like
+random() or array_shuffle();
 argument must be between -1.0 and 1.0, inclusive


@@ -1838,7 +1839,8 @@ repeat('Pg', 4) PgPgPgPg
applications; see the  module for a more
secure alternative.
If setseed() is called, the series of results of
-   subsequent random() calls in the current session
+   subsequent calls to random functions, like random() or
+   array_shuffle(), in the current session
can be repeated by re-issuing setseed() with the same
argument.
   
@@ -19395,6 +19397,40 @@ SELECT NULLIF(value, '(none)') ...

   
 
+  
+   
+
+ array_sample
+
+array_sample ( array anyarray, n integer )
+anyarray
+   
+   
+Returns n randomly chosen elements from array in selection order.
+   
+   
+array_sample(ARRAY[[1,2],[3,4],[5,6]], 2)
+{{5,6},{1,2}}
+   
+  
+
+  
+   
+
+ array_shuffle
+
+array_shuffle ( anyarray )
+anyarray
+   
+   
+Shuffles the first dimension of the array.
+   
+   
+array_shuffle(ARRAY[[1,2],[3,4],[5,6]])
+{{5,6},{1,2},{3,4}}
+   
+  
+
   

 
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 7c722ea2ce..42b65e58bb 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -109,6 +109,7 @@ OBJS = \
 	tsvector.o \
 	tsvector_op.o \
 	tsvector_parser.o \
+	user_prng.o \
 	uuid.o \
 	varbit.o \
 	varchar.o \
diff --git a/src/backend/utils/adt/array_userfuncs.c b/src/backend/utils/adt/array_userfuncs.c
index ca70590d7d..d62f1aa4ef 100644
--- a/src/backend/utils/adt/array_userfuncs.c
+++ b/src/backend/utils/adt/array_userfuncs.c
@@ -17,6 +17,7 @@
 #include "utils/array.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
+#include "utils/user_prng.h"
 #include "utils/typcache.h"
 
 
@@ -902,3 +903,182 @@ array_positions(PG_FUNCTION_ARGS)
 
 	PG_RETURN_DATUM(makeArrayResult(astate, CurrentMemoryContext));
 }
+
+/*
+ * Produce array with n randomly chosen items from given array in random order.
+ *
+ * array: array object to examine (must not be NULL)
+ * n: number of items (must not be greater than the size of the arrays first dimension)
+ * elmtyp, elmlen, elmbyval, elmalign: info for the datatype of the items
+ *
+ * NOTE: it would be cleaner to look up the elmlen/elmbval/elmalign info
+ * from the system catalogs, given the elmtyp. However, the caller is
+ * in a better position to cache this info across multiple uses, or even
+ * to hard-wire values if the element type is hard-wired.
+ */
+static ArrayType *
+array_shuffle_n(ArrayType *array, int n,
+Oid elmtyp, int elmlen, bool elmbyval, cha

Re: [PATCH] Introduce array_shuffle() and array_sample()

2022-07-24 Thread Fabien COELHO




Hello,

Thank you for your feedback. I attached a patch, that addresses most of your 
points.


I'll look into it. It would help if the patch could include a version 
number at the end.



Should the exchange be skipped when i == k?


The additional branch is actually slower (on my machine, tested with an 10M 
element int array) for 1D-arrays, which i assume is the most common case. 
Even with a 2D-array with a subarray size of 1M there is no difference in 
execution time. i == k should be relatively rare for large arrays, given a 
good prng.


Ok, slower is bad.

The random and array shuffling functions share a common state. I'm 
wondering whether it should ot should not be so. It seems ok, but then ISTM 
that the documentation suggests implicitely that setseed applies to 
random() only, which is not the case anymore after the patch.


I really like the idea of a prng state owned by the user, that is used by all 
user facing random functions, but see that their might be concerns about this 
change. I would update the setseed() documentation, if this proposal is 
accepted. Do you think my patch should already include the documentation 
update?


Duno. I'm still wondering what it should do. I'm pretty sure that the 
documentation should be clear about a shared seed, if any. I do not think 
that departing from the standard is a good thing, either.


If more samples are required than the number of elements, it does not error 
out. I'm wondering whether it should.


The second argument to array_sample() is treated like a LIMIT, rather than 
the actual number of elements. I updated the documentation. My use-case is: 
take max random items from an array of unknown size.


Hmmm. ISTM that the use case of wanting "this many" stuff would make more 
sense because it is strictier so less likely to result in unforseen 
consequences. On principle I do not like not knowing the output size.


If someone wants a limit, they can easily "LEAST(#1 dim size, other 
limit)" to get it, it is easy enough with a strict function.


Also, the sampling should not return its result in order when the number of 
elements required is the full array, ISTM that it should be shuffled there 
as well.


You are the second person asking for this. It's done. I am thinking about 
ditching array_sample() and replace it with a second signature for 
array_shuffle():


 array_shuffle(array anyarray) -> anyarray
 array_shuffle(array anyarray, limit int) -> anyarray

What do you think?


I think that shuffle means shuffle, not partial shuffle, so a different 
name seems better to me.


--
Fabien.

Re: [PATCH] Introduce array_shuffle() and array_sample()

2022-07-24 Thread Martin Kalcher


Am 24.07.22 um 10:15 schrieb Fabien COELHO:


My 0,02€ about this patch:


Thank you for your feedback. I attached a patch, that addresses most of 
your points.



Use (void) when declaring no parameters in headers or in functions.


Done.


Should the exchange be skipped when i == k?


The additional branch is actually slower (on my machine, tested with an 
10M element int array) for 1D-arrays, which i assume is the most common 
case. Even with a 2D-array with a subarray size of 1M there is no 
difference in execution time. i == k should be relatively rare for large 
arrays, given a good prng.


I do not see the point of having *only* inline functions in a c file. 
The external functions should not be inlined?


Done.

The random and array shuffling functions share a common state. I'm 
wondering whether it should ot should not be so. It seems ok, but then 
ISTM that the documentation suggests implicitely that setseed applies to 
random() only, which is not the case anymore after the patch.


I really like the idea of a prng state owned by the user, that is used 
by all user facing random functions, but see that their might be 
concerns about this change. I would update the setseed() documentation, 
if this proposal is accepted. Do you think my patch should already 
include the documentation update?


If more samples are required than the number of elements, it does not 
error out. I'm wondering whether it should.


The second argument to array_sample() is treated like a LIMIT, rather 
than the actual number of elements. I updated the documentation. My 
use-case is: take max random items from an array of unknown size.


Also, the sampling should not return its result in order when the number 
of elements required is the full array, ISTM that it should be shuffled 
there as well.


You are the second person asking for this. It's done. I am thinking 
about ditching array_sample() and replace it with a second signature for 
array_shuffle():


  array_shuffle(array anyarray) -> anyarray
  array_shuffle(array anyarray, limit int) -> anyarray

What do you think?

I must say that without significant knowledge of the array internal 
implementation, the swap code looks pretty strange. ISTM that the code 
would be clearer if pointers and array references style were not 
intermixed.


Done. Went with pointers.


Maybe you could add a test with a 3D array? Some sample with NULLs?


Done.

Unrelated: I notice again that (postgre)SQL does not provide a way to 
generate random integers. I do not see why not. Should we provide one?


Maybe. I think it is outside the scope of this patch.

Thank you, MartinFrom d4f3df99cbc4451f7907a7c759a7590d63d8388c Mon Sep 17 00:00:00 2001
From: Martin Kalcher 
Date: Sun, 17 Jul 2022 18:06:04 +0200
Subject: [PATCH] Introduce array_shuffle() and array_sample()

* array_shuffle() shuffles the elements of an array.
* array_sample() chooses max n elements from an array by random.

The new functions share its prng state with random() and thus interoperate
with setseed().
---
 doc/src/sgml/func.sgml  |  34 +
 src/backend/utils/adt/Makefile  |   1 +
 src/backend/utils/adt/array_userfuncs.c | 181 
 src/backend/utils/adt/float.c   |  37 +
 src/backend/utils/adt/user_prng.c   |  87 
 src/include/catalog/pg_proc.dat |   6 +
 src/include/utils/user_prng.h   |  19 +++
 src/test/regress/expected/arrays.out|  70 +
 src/test/regress/sql/arrays.sql |  16 +++
 9 files changed, 418 insertions(+), 33 deletions(-)
 create mode 100644 src/backend/utils/adt/user_prng.c
 create mode 100644 src/include/utils/user_prng.h

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index b6783b7ad0..c44db6c652 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -19395,6 +19395,40 @@ SELECT NULLIF(value, '(none)') ...

   
 
+  
+   
+
+ array_sample
+
+array_sample ( array anyarray, limit integer )
+anyarray
+   
+   
+Returns max limit randomly chosen elements from array in selection order.
+   
+   
+array_sample(ARRAY[[1,2],[3,4],[5,6]], 2)
+{{5,6},{1,2}}
+   
+  
+
+  
+   
+
+ array_shuffle
+
+array_shuffle ( anyarray )
+anyarray
+   
+   
+Shuffles the first dimension of the array.
+   
+   
+array_shuffle(ARRAY[[1,2],[3,4],[5,6]])
+{{5,6},{1,2},{3,4}}
+   
+  
+
   

 
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 7c722ea2ce..42b65e58bb 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -109,6 +109,7 @@ OBJS = \
 	tsvector.o \
 	tsvector_op.o \
 	tsvector_parser.o \
+	user_prng.o \
 	uuid.o \
 	varbit.o \
 	varchar.o \
diff --git a/src/backend/utils/adt/

Re: [PATCH] Introduce array_shuffle() and array_sample()

2022-07-24 Thread Fabien COELHO



i came to the same conclusions and went with Option 1 (see patch). Mainly 
because most code in utils/adt is organized by type and this way it is 
clear, that this is a thin wrapper around pg_prng.




Small patch update. I realized the new functions should live 
array_userfuncs.c (rather than arrayfuncs.c), fixed some file headers and 
added some comments to the code.


My 0,02€ about this patch:

Use (void) when declaring no parameters in headers or in functions.

Should the exchange be skipped when i == k?

I do not see the point of having *only* inline functions in a c file. The 
external functions should not be inlined?


The random and array shuffling functions share a common state. I'm 
wondering whether it should ot should not be so. It seems ok, but then 
ISTM that the documentation suggests implicitely that setseed applies to 
random() only, which is not the case anymore after the patch.


If more samples are required than the number of elements, it does not 
error out. I'm wondering whether it should.


Also, the sampling should not return its result in order when the number 
of elements required is the full array, ISTM that it should be shuffled 
there as well.


I must say that without significant knowledge of the array internal 
implementation, the swap code looks pretty strange. ISTM that the code 
would be clearer if pointers and array references style were not 
intermixed.


Maybe you could add a test with a 3D array? Some sample with NULLs?

Unrelated: I notice again that (postgre)SQL does not provide a way to 
generate random integers. I do not see why not. Should we provide one?


--
Fabien.

Re: [PATCH] Introduce array_shuffle() and array_sample()

2022-07-23 Thread Martin Kalcher


Am 22.07.22 um 11:31 schrieb Martin Kalcher:


i came to the same conclusions and went with Option 1 (see patch). 
Mainly because most code in utils/adt is organized by type and this way 
it is clear, that this is a thin wrapper around pg_prng.




Small patch update. I realized the new functions should live 
array_userfuncs.c (rather than arrayfuncs.c), fixed some file headers 
and added some comments to the code.From 138777531961c31250074d1c0d87417e31d63656 Mon Sep 17 00:00:00 2001
From: Martin Kalcher 
Date: Sun, 17 Jul 2022 18:06:04 +0200
Subject: [PATCH] Introduce array_shuffle() and array_sample()

* array_shuffle() shuffles the elements of an array.
* array_sample() chooses n elements from an array by random.

The new functions share its prng state with random() and thus interoperate
with setseed().
---
 doc/src/sgml/func.sgml  |  35 +
 src/backend/utils/adt/Makefile  |   1 +
 src/backend/utils/adt/array_userfuncs.c | 182 
 src/backend/utils/adt/float.c   |  37 +
 src/backend/utils/adt/user_prng.c   |  87 +++
 src/include/catalog/pg_proc.dat |   6 +
 src/include/utils/user_prng.h   |  19 +++
 src/test/regress/expected/arrays.out|  58 
 src/test/regress/sql/arrays.sql |  14 ++
 9 files changed, 406 insertions(+), 33 deletions(-)
 create mode 100644 src/backend/utils/adt/user_prng.c
 create mode 100644 src/include/utils/user_prng.h

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index b6783b7ad0..87df8a9c81 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -19395,6 +19395,41 @@ SELECT NULLIF(value, '(none)') ...

   
 
+  
+   
+
+ array_sample
+
+array_sample ( array anyarray, n integer )
+anyarray
+   
+   
+Returns n randomly chosen elements from array.
+The order of the elements in resulting array is unspecified.
+   
+   
+array_sample(ARRAY[[1,2],[3,4],[5,6]], 2)
+{{5,6},{1,2}}
+   
+  
+
+  
+   
+
+ array_shuffle
+
+array_shuffle ( anyarray )
+anyarray
+   
+   
+Shuffles the first dimension of the array.
+   
+   
+array_shuffle(ARRAY[[1,2],[3,4],[5,6]])
+{{5,6},{1,2},{3,4}}
+   
+  
+
   

 
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 7c722ea2ce..42b65e58bb 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -109,6 +109,7 @@ OBJS = \
 	tsvector.o \
 	tsvector_op.o \
 	tsvector_parser.o \
+	user_prng.o \
 	uuid.o \
 	varbit.o \
 	varchar.o \
diff --git a/src/backend/utils/adt/array_userfuncs.c b/src/backend/utils/adt/array_userfuncs.c
index ca70590d7d..e3ba14a17c 100644
--- a/src/backend/utils/adt/array_userfuncs.c
+++ b/src/backend/utils/adt/array_userfuncs.c
@@ -17,6 +17,7 @@
 #include "utils/array.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
+#include "utils/user_prng.h"
 #include "utils/typcache.h"
 
 
@@ -902,3 +903,184 @@ array_positions(PG_FUNCTION_ARGS)
 
 	PG_RETURN_DATUM(makeArrayResult(astate, CurrentMemoryContext));
 }
+
+/*
+ * Produce array with max n randomly chosen items from given array in
+ * random order.
+ *
+ * array: array object to examine (must not be NULL)
+ * n: max number of items
+ * elmtyp, elmlen, elmbyval, elmalign: info for the datatype of the items
+ *
+ * NOTE: it would be cleaner to look up the elmlen/elmbval/elmalign info
+ * from the system catalogs, given the elmtyp. However, the caller is
+ * in a better position to cache this info across multiple uses, or even
+ * to hard-wire values if the element type is hard-wired.
+ */
+static ArrayType *
+array_shuffle_n(ArrayType *array, int n,
+Oid elmtyp, int elmlen, bool elmbyval, char elmalign)
+{
+	ArrayType  *result;
+	int			ndim,
+			   *dims,
+			   *lbs,
+rdims[MAXDIM],
+nelm,
+nitem,
+rnitem,
+i,
+j,
+k;
+	Datum		elm,
+			   *elms,
+			   *relms;
+	bool		nul,
+			   *nuls,
+			   *rnuls;
+
+	ndim = ARR_NDIM(array);
+	dims = ARR_DIMS(array);
+	lbs = ARR_LBOUND(array);
+
+	if (ndim < 1 || dims[0] < 1 || n < 1)
+		return construct_empty_array(elmtyp);
+
+	deconstruct_array(array, elmtyp, elmlen, elmbyval, elmalign,
+	  &relms, &rnuls, &nelm);
+
+	nitem = dims[0];			/* total number of items */
+	rnitem = Min(n, nitem);		/* target number of items */
+	nelm /= nitem;/* number of elements per item */
+
+	elms = relms;
+	nuls = rnuls;
+
+	/*
+	 * Shuffle array using Fisher-Yates algorithm. Swap head with an randomly
+	 * chosen item and increment head.
+	 */
+	for (i = 0; i < rnitem; i++)
+	{
+		k = (int) user_prng_uint64_range(0, nitem - i - 1) * nelm;
+
+		/* Swap all elements in head with elements in item k. */
+		for (j = 0;

Re: [PATCH] Introduce array_shuffle() and array_sample()

2022-07-22 Thread Joe Conway


On 7/19/22 10:20, Tom Lane wrote:

Everything else either explicitly rejects more-than-one-D arrays
or does something that is compatible with thinking of them as
arrays-of-arrays.


I think I am responsible for at least some of those, and I agree that 
thinking of MD arrays as arrays-of-arrays is preferable even though they 
are not actually that. Long ago[1] Peter E asked me to fix that as I 
recall but it was one of those round tuits that I never found.



So I withdraw my original position.  These functions should just
shuffle or select in the array's first dimension, preserving
subarrays.  Or else be lazy and reject more-than-one-D arrays;
but it's probably not that hard to handle them.


+1

Joe

[1] 
https://www.postgresql.org/message-id/flat/Pine.LNX.4.44.0306281418020.2178-10%40peter.localdomain#a064d6dd8593993d799db453a3ee04d1


--
Joe Conway
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: [PATCH] Introduce array_shuffle() and array_sample()

2022-07-22 Thread Dean Rasheed

On Fri, 22 Jul 2022 at 10:31, Martin Kalcher
 wrote:
>
> i came to the same conclusions and went with Option 1 (see patch).
> Mainly because most code in utils/adt is organized by type and this way
> it is clear, that this is a thin wrapper around pg_prng.
>
> What do you think?

Looks fairly neat, on a quick read-through. There's certainly
something to be said for preserving the organisation by type.

Regards,
Dean

Re: [PATCH] Introduce array_shuffle() and array_sample()

2022-07-22 Thread Martin Kalcher


Am 22.07.22 um 09:59 schrieb Dean Rasheed:>

Option 1:
  Keep random.c small
  - Responsible for initialisation of the user prng on demand
  - Expose the user prng state to other code like float.c and arrayfuncs.c

Option 2:
  Move all random functions wanting to use the user prng to random.c
  - Starting with drandom() and setseed()
  - Then, array_shuffle() and array_sample()
  - Later, any new SQL-callable random functions we might add
  - Keep the user prng state local to random.c



Hey Dean,

i came to the same conclusions and went with Option 1 (see patch). 
Mainly because most code in utils/adt is organized by type and this way 
it is clear, that this is a thin wrapper around pg_prng.


What do you think?From ceda50f1f7f7e0c123de9b2ce2cc7b5d2b2b7db6 Mon Sep 17 00:00:00 2001
From: Martin Kalcher 
Date: Sun, 17 Jul 2022 18:06:04 +0200
Subject: [PATCH] Introduce array_shuffle() and array_sample()

* array_shuffle() shuffles the elements of an array.
* array_sample() chooses n elements from an array by random.

The new functions share its prng state with random() and thus interoperate
with setseed().
---
 doc/src/sgml/func.sgml   |  35 +
 src/backend/utils/adt/Makefile   |   1 +
 src/backend/utils/adt/arrayfuncs.c   | 189 ++-
 src/backend/utils/adt/float.c|  37 +-
 src/backend/utils/adt/user_prng.c|  88 +
 src/include/catalog/pg_proc.dat  |   6 +
 src/include/utils/user_prng.h|  21 +++
 src/test/regress/expected/arrays.out |  58 
 src/test/regress/sql/arrays.sql  |  14 ++
 9 files changed, 415 insertions(+), 34 deletions(-)
 create mode 100644 src/backend/utils/adt/user_prng.c
 create mode 100644 src/include/utils/user_prng.h

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index b6783b7ad0..6b96897244 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -19395,6 +19395,41 @@ SELECT NULLIF(value, '(none)') ...

   
 
+  
+   
+
+ array_sample
+
+array_sample ( array anyarray, n integer )
+anyarray
+   
+   
+Returns n randomly chosen elements from array.
+The order of the elements in resulting array is unspecified.
+   
+   
+array_sample(ARRAY[[1,2],[3,4],[5,6]], 2)
+{{5,6},{3,4}}
+   
+  
+
+  
+   
+
+ array_shuffle
+
+array_shuffle ( anyarray )
+anyarray
+   
+   
+Shuffles the first dimension of the array.
+   
+   
+array_shuffle(ARRAY[[1,2],[3,4],[5,6]])
+{{5,6},{3,4},{1,2}}
+   
+  
+
   

 
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 7c722ea2ce..42b65e58bb 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -109,6 +109,7 @@ OBJS = \
 	tsvector.o \
 	tsvector_op.o \
 	tsvector_parser.o \
+	user_prng.o \
 	uuid.o \
 	varbit.o \
 	varchar.o \
diff --git a/src/backend/utils/adt/arrayfuncs.c b/src/backend/utils/adt/arrayfuncs.c
index fb167f226a..3a40ccc362 100644
--- a/src/backend/utils/adt/arrayfuncs.c
+++ b/src/backend/utils/adt/arrayfuncs.c
@@ -32,10 +32,10 @@
 #include "utils/fmgroids.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
+#include "utils/user_prng.h"
 #include "utils/selfuncs.h"
 #include "utils/typcache.h"
 
-
 /*
  * GUC parameter
  */
@@ -6872,3 +6872,190 @@ trim_array(PG_FUNCTION_ARGS)
 
 	PG_RETURN_DATUM(result);
 }
+
+/*
+ * Produce array with max n random items from given array in random order.
+ *
+ * array: array object to examine (must not be NULL)
+ * n: max number of items
+ * elmtyp, elmlen, elmbyval, elmalign: info for the datatype of the items
+ *
+ * NOTE: it would be cleaner to look up the elmlen/elmbval/elmalign info
+ * from the system catalogs, given the elmtype.  However, the caller is
+ * in a better position to cache this info across multiple uses, or even
+ * to hard-wire values if the element type is hard-wired.
+ */
+static ArrayType *
+array_shuffle_n(ArrayType *array, int n,
+Oid elmtyp, int elmlen, bool elmbyval, char elmalign)
+{
+	ArrayType  *result;
+	int			ndim,
+			   *dims,
+			   *lbs,
+rdims[MAXDIM],
+nelm,
+nitem,
+i,
+j,
+k;
+	Datum		elm,
+			   *elms,
+			   *relms;
+	bool		nul,
+			   *nuls,
+			   *rnuls;
+
+	ndim = ARR_NDIM(array);
+	dims = ARR_DIMS(array);
+	lbs = ARR_LBOUND(array);
+
+	if (ndim < 1 || dims[0] < 1 || n < 1)
+		return construct_empty_array(elmtyp);
+
+	deconstruct_array(array, elmtyp, elmlen, elmbyval, elmalign,
+	  &relms, &rnuls, &nelm);
+
+	/* Calculate number of elements per item. */
+	nelm = (ndim > 1) ? ArrayGetNItems(ndim - 1, dims + 1) : 1;
+	elms = relms;
+	nuls = rnuls;
+	nitem = dims[0];
+	n = Min(n, nitem);
+
+	/*
+	 * Shuffle array using Fisher-Yates

Re: [PATCH] Introduce array_shuffle() and array_sample()

2022-07-22 Thread Dean Rasheed

On Thu, 21 Jul 2022 at 16:43, Martin Kalcher
 wrote:
>
> Am 21.07.22 um 14:25 schrieb Dean Rasheed:
> >
> > I'm inclined to say that we want a new pg_global_prng_user_state that
> > is updated by setseed(), and used by random(), array_shuffle(),
> > array_sample(), and any other user-facing random functions we add
> > later.
> >
>
> I like the idea. How would you organize the code? I imagine some sort of
> user prng that is encapsulated in something like utils/adt/random.c and
> is guaranteed to always be seeded.
>

Something like that, perhaps. I can see 2 ways it could go:

Option 1:
 Keep random.c small
 - Responsible for initialisation of the user prng on demand
 - Expose the user prng state to other code like float.c and arrayfuncs.c

Option 2:
 Move all random functions wanting to use the user prng to random.c
 - Starting with drandom() and setseed()
 - Then, array_shuffle() and array_sample()
 - Later, any new SQL-callable random functions we might add
 - Keep the user prng state local to random.c

Option 2 seems quite appealing, because it keeps all SQL-callable
random functions together in one place, and keeps the state local,
making it easier to keep track of which functions are using it.

Code like the Fisher-Yates algorithm is more to do with random than it
is to do with arrays, so putting it in random.c seems to make more
sense.

It's possible that some hypothetical random function might need access
to type-specific internals. For example, if we wanted to add a
function to return a random numeric, it would probably have to go in
numeric.c, but that could be solved by having random.c call numeric.c,
passing it the prng state.

Regards,
Dean

Re: [PATCH] Introduce array_shuffle() and array_sample()

2022-07-21 Thread Martin Kalcher


Am 21.07.22 um 10:41 schrieb Dean Rasheed:


It's important to mark these new functions as VOLATILE, not IMMUTABLE,
otherwise they won't work as expected in queries. See
https://www.postgresql.org/docs/current/xfunc-volatility.html

It would be better to use pg_prng_uint64_range() rather than rand() to
pick elements. Partly, that's because it uses a higher quality PRNG,
with a larger internal state, and it ensures that the results are
unbiased across the range. But more importantly, it interoperates with
setseed(), allowing predictable sequences of "random" numbers to be
generated -- something that's useful in writing repeatable regression
tests.

Assuming these new functions are made to interoperate with setseed(),
which I think they should be, then they also need to be marked as
PARALLEL RESTRICTED, rather than PARALLEL SAFE. See
https://www.postgresql.org/docs/current/parallel-safety.html, which
explains why setseed() and random() are parallel restricted.



Here is an updated patch that marks the functions VOLATILE PARALLEL 
RESTRICTED and uses pg_prng_uint64_range() rather than rand().From 26676802f05d00c31e0b2d5997f61734aa421fca Mon Sep 17 00:00:00 2001
From: Martin Kalcher 
Date: Sun, 17 Jul 2022 18:06:04 +0200
Subject: [PATCH] Introduce array_shuffle() and array_sample()

* array_shuffle() shuffles the elements of an array.
* array_sample() chooses n elements from an array by random.

Shuffling of arrays can already be accomplished with SQL
using unnest() and array_agg(order by random()). But that is
very slow compared to the new functions. In addition the new functions
are dimension aware.
---
 doc/src/sgml/func.sgml   |  35 +
 src/backend/utils/adt/arrayfuncs.c   | 189 ++-
 src/include/catalog/pg_proc.dat  |   6 +
 src/test/regress/expected/arrays.out |  60 +
 src/test/regress/sql/arrays.sql  |  17 +++
 5 files changed, 306 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index b6783b7ad0..6b96897244 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -19395,6 +19395,41 @@ SELECT NULLIF(value, '(none)') ...

   
 
+  
+   
+
+ array_sample
+
+array_sample ( array anyarray, n integer )
+anyarray
+   
+   
+Returns n randomly chosen elements from array.
+The order of the elements in resulting array is unspecified.
+   
+   
+array_sample(ARRAY[[1,2],[3,4],[5,6]], 2)
+{{5,6},{3,4}}
+   
+  
+
+  
+   
+
+ array_shuffle
+
+array_shuffle ( anyarray )
+anyarray
+   
+   
+Shuffles the first dimension of the array.
+   
+   
+array_shuffle(ARRAY[[1,2],[3,4],[5,6]])
+{{5,6},{3,4},{1,2}}
+   
+  
+
   

 
diff --git a/src/backend/utils/adt/arrayfuncs.c b/src/backend/utils/adt/arrayfuncs.c
index fb167f226a..64da980348 100644
--- a/src/backend/utils/adt/arrayfuncs.c
+++ b/src/backend/utils/adt/arrayfuncs.c
@@ -34,7 +34,7 @@
 #include "utils/memutils.h"
 #include "utils/selfuncs.h"
 #include "utils/typcache.h"
-
+#include "common/pg_prng.h"
 
 /*
  * GUC parameter
@@ -6872,3 +6872,190 @@ trim_array(PG_FUNCTION_ARGS)
 
 	PG_RETURN_DATUM(result);
 }
+
+/*
+ * Produce array with max n random items from given array in random order.
+ *
+ * array: array object to examine (must not be NULL)
+ * n: max number of items
+ * elmtyp, elmlen, elmbyval, elmalign: info for the datatype of the items
+ *
+ * NOTE: it would be cleaner to look up the elmlen/elmbval/elmalign info
+ * from the system catalogs, given the elmtype.  However, the caller is
+ * in a better position to cache this info across multiple uses, or even
+ * to hard-wire values if the element type is hard-wired.
+ */
+static ArrayType *
+array_shuffle_n(ArrayType *array, int n,
+Oid elmtyp, int elmlen, bool elmbyval, char elmalign)
+{
+	ArrayType  *result;
+	int			ndim,
+			   *dims,
+			   *lbs,
+rdims[MAXDIM],
+nelm,
+nitem,
+i,
+j,
+k;
+	Datum		elm,
+			   *elms,
+			   *relms;
+	bool		nul,
+			   *nuls,
+			   *rnuls;
+
+	ndim = ARR_NDIM(array);
+	dims = ARR_DIMS(array);
+	lbs = ARR_LBOUND(array);
+
+	if (ndim < 1 || dims[0] < 1 || n < 1)
+		return construct_empty_array(elmtyp);
+
+	deconstruct_array(array, elmtyp, elmlen, elmbyval, elmalign,
+	  &relms, &rnuls, &nelm);
+
+	/* Calculate number of elements per item. */
+	nelm = (ndim > 1) ? ArrayGetNItems(ndim - 1, dims + 1) : 1;
+	elms = relms;
+	nuls = rnuls;
+	nitem = dims[0];
+	n = Min(n, nitem);
+
+	/*
+	 * Shuffle array using Fisher-Yates algorithm. Swap head with an randomly
+	 * chosen item and increment head.
+	 */
+	for (i = 0; i < n; i++)
+	{
+		k = (int) pg_prng_uint64_range(&pg_global_prng_state, 0,

Re: [PATCH] Introduce array_shuffle() and array_sample()

2022-07-21 Thread Martin Kalcher


Am 21.07.22 um 14:25 schrieb Dean Rasheed:


I'm inclined to say that we want a new pg_global_prng_user_state that
is updated by setseed(), and used by random(), array_shuffle(),
array_sample(), and any other user-facing random functions we add
later.



I like the idea. How would you organize the code? I imagine some sort of 
user prng that is encapsulated in something like utils/adt/random.c and 
is guaranteed to always be seeded.


Martin

Re: [PATCH] Introduce array_shuffle() and array_sample()

2022-07-21 Thread Dean Rasheed

On Thu, 21 Jul 2022 at 12:15, Martin Kalcher
 wrote:
>
> I agree that we should use pg_prng_uint64_range(). However, in order to
> achieve interoperability with setseed() we would have to use
> drandom_seed (rather than pg_global_prng_state) as rng state, which is
> declared statically in float.c and exclusively used by random(). Do we
> want to expose drandom_seed to other functions?
>

Ah, I didn't realise that setseed() and random() were bound up so
tightly. It does feel as though, if we're adding more user-facing
functions that return random sequences, there ought to be a way to
seed them, and I wouldn't want to have separate setseed functions for
each one.

I'm inclined to say that we want a new pg_global_prng_user_state that
is updated by setseed(), and used by random(), array_shuffle(),
array_sample(), and any other user-facing random functions we add
later.

I can also see that others might not like expanding the scope of
setseed() in this way.

Regards,
Dean

Re: [PATCH] Introduce array_shuffle() and array_sample()

2022-07-21 Thread Martin Kalcher


Am 21.07.22 um 10:41 schrieb Dean Rasheed:


A couple of quick comments on the current patch:


Thank you for your feedback!


It's important to mark these new functions as VOLATILE, not IMMUTABLE,
otherwise they won't work as expected in queries. See
https://www.postgresql.org/docs/current/xfunc-volatility.html


CREATE FUNCTION marks functions as VOLATILE by default if not explicitly 
marked otherwise. I assumed function definitions in pg_proc.dat have the 
same behavior. I will fix that.

https://www.postgresql.org/docs/current/sql-createfunction.html


It would be better to use pg_prng_uint64_range() rather than rand() to
pick elements. Partly, that's because it uses a higher quality PRNG,
with a larger internal state, and it ensures that the results are
unbiased across the range. But more importantly, it interoperates with
setseed(), allowing predictable sequences of "random" numbers to be
generated -- something that's useful in writing repeatable regression
tests.


I agree that we should use pg_prng_uint64_range(). However, in order to 
achieve interoperability with setseed() we would have to use 
drandom_seed (rather than pg_global_prng_state) as rng state, which is 
declared statically in float.c and exclusively used by random(). Do we 
want to expose drandom_seed to other functions?



Assuming these new functions are made to interoperate with setseed(),
which I think they should be, then they also need to be marked as
PARALLEL RESTRICTED, rather than PARALLEL SAFE. See
https://www.postgresql.org/docs/current/parallel-safety.html, which
explains why setseed() and random() are parallel restricted.


As mentioned above, i assumed the default here is PARALLEL UNSAFE. I'll 
fix that.



In my experience, the requirement for sampling with replacement is
about as common as the requirement for sampling without replacement,
so it seems odd to provide one but not the other. Of course, we can
always add a with-replacement function later, and give it a different
name. But maybe array_sample() could be used for both, with a
"with_replacement" boolean parameter?


We can also add a "with_replacement" boolean parameter which is false by 
default to array_sample() later. I do not have a strong opinion about 
that and will implement it, if desired. Any opinions about 
default/no-default?


Martin

Re: [PATCH] Introduce array_shuffle() and array_sample()

2022-07-21 Thread Dean Rasheed

On Tue, 19 Jul 2022 at 21:21, Martin Kalcher
 wrote:
>
> Here is a patch with dimension aware array_shuffle() and array_sample().
>

+1 for this feature, and this way of handling multi-dimensional arrays.

> If you think array_flatten() is desirable, i can take a look at it.

That's not something I've ever wanted -- personally, I rarely use
multi-dimensional arrays in Postgres.

A couple of quick comments on the current patch:

It's important to mark these new functions as VOLATILE, not IMMUTABLE,
otherwise they won't work as expected in queries. See
https://www.postgresql.org/docs/current/xfunc-volatility.html

It would be better to use pg_prng_uint64_range() rather than rand() to
pick elements. Partly, that's because it uses a higher quality PRNG,
with a larger internal state, and it ensures that the results are
unbiased across the range. But more importantly, it interoperates with
setseed(), allowing predictable sequences of "random" numbers to be
generated -- something that's useful in writing repeatable regression
tests.

Assuming these new functions are made to interoperate with setseed(),
which I think they should be, then they also need to be marked as
PARALLEL RESTRICTED, rather than PARALLEL SAFE. See
https://www.postgresql.org/docs/current/parallel-safety.html, which
explains why setseed() and random() are parallel restricted.

In my experience, the requirement for sampling with replacement is
about as common as the requirement for sampling without replacement,
so it seems odd to provide one but not the other. Of course, we can
always add a with-replacement function later, and give it a different
name. But maybe array_sample() could be used for both, with a
"with_replacement" boolean parameter?

Regards,
Dean

Re: [PATCH] Introduce array_shuffle() and array_sample()

2022-07-19 Thread Martin Kalcher


Am 19.07.22 um 16:20 schrieb Tom Lane:


So I withdraw my original position.  These functions should just
shuffle or select in the array's first dimension, preserving
subarrays.  Or else be lazy and reject more-than-one-D arrays;
but it's probably not that hard to handle them.



Here is a patch with dimension aware array_shuffle() and array_sample().

If you think array_flatten() is desirable, i can take a look at it. 
Maybe a second parameter would be nice to specify the target dimension:


  select array_flatten(array[[[1,2],[3,4]],[[5,6],[7,8]]], 1);
  ---
   {1,2,3,4,5,6,7,8}

  select array_flatten(array[[[1,2],[3,4]],[[5,6],[7,8]]], 2);
  ---
   {{1,2,3,4},{5,6,7,8}}

MartinFrom 2aa6d72ff0f4d8835ee2f09f8cdf16b7e8005e56 Mon Sep 17 00:00:00 2001
From: Martin Kalcher 
Date: Sun, 17 Jul 2022 18:06:04 +0200
Subject: [PATCH] Introduce array_shuffle() and array_sample()

* array_shuffle() shuffles the elements of an array.
* array_sample() chooses n elements from an array by random.

Shuffling of arrays can already be accomplished with SQL
using unnest() and array_agg(order by random()). But that is
very slow compared to the new functions. In addition the new functions
are dimension aware.
---
 doc/src/sgml/func.sgml   |  35 +
 src/backend/utils/adt/arrayfuncs.c   | 189 +++
 src/include/catalog/pg_proc.dat  |   6 +
 src/test/regress/expected/arrays.out |  60 +
 src/test/regress/sql/arrays.sql  |  17 +++
 5 files changed, 307 insertions(+)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index b6783b7ad0..6b96897244 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -19395,6 +19395,41 @@ SELECT NULLIF(value, '(none)') ...

   
 
+  
+   
+
+ array_sample
+
+array_sample ( array anyarray, n integer )
+anyarray
+   
+   
+Returns n randomly chosen elements from array.
+The order of the elements in resulting array is unspecified.
+   
+   
+array_sample(ARRAY[[1,2],[3,4],[5,6]], 2)
+{{5,6},{3,4}}
+   
+  
+
+  
+   
+
+ array_shuffle
+
+array_shuffle ( anyarray )
+anyarray
+   
+   
+Shuffles the first dimension of the array.
+   
+   
+array_shuffle(ARRAY[[1,2],[3,4],[5,6]])
+{{5,6},{3,4},{1,2}}
+   
+  
+
   

 
diff --git a/src/backend/utils/adt/arrayfuncs.c b/src/backend/utils/adt/arrayfuncs.c
index fb167f226a..3de1b0db30 100644
--- a/src/backend/utils/adt/arrayfuncs.c
+++ b/src/backend/utils/adt/arrayfuncs.c
@@ -6872,3 +6872,192 @@ trim_array(PG_FUNCTION_ARGS)
 
 	PG_RETURN_DATUM(result);
 }
+
+/*
+ * Produce array with max n random items from given array in random order.
+ *
+ * array: array object to examine (must not be NULL)
+ * n: max number of items
+ * elmtyp, elmlen, elmbyval, elmalign: info for the datatype of the items
+ *
+ * NOTE: it would be cleaner to look up the elmlen/elmbval/elmalign info
+ * from the system catalogs, given the elmtype.  However, the caller is
+ * in a better position to cache this info across multiple uses, or even
+ * to hard-wire values if the element type is hard-wired.
+ */
+static ArrayType *
+array_shuffle_n(ArrayType *array, int n,
+Oid elmtyp, int elmlen, bool elmbyval, char elmalign)
+{
+	ArrayType  *result;
+	int			ndim,
+			   *dims,
+			   *lbs,
+rdims[MAXDIM],
+nelm,
+nitem,
+i,
+j,
+k;
+	Datum		elm,
+			   *elms,
+			   *relms;
+	bool		nul,
+			   *nuls,
+			   *rnuls;
+
+	ndim = ARR_NDIM(array);
+	dims = ARR_DIMS(array);
+	lbs = ARR_LBOUND(array);
+
+	if (ndim < 1 || dims[0] < 1 || n < 1)
+		return construct_empty_array(elmtyp);
+
+	deconstruct_array(array, elmtyp, elmlen, elmbyval, elmalign,
+	  &relms, &rnuls, &nelm);
+
+	/* Calculate number of elements per item. */
+	nelm = (ndim > 1) ? ArrayGetNItems(ndim - 1, dims + 1) : 1;
+	elms = relms;
+	nuls = rnuls;
+	nitem = dims[0];
+	n = Min(n, nitem);
+
+	/*
+	 * Shuffle array using Fisher-Yates algorithm. Swap head with an randomly
+	 * chosen item and increment head.
+	 */
+	for (i = 0; i < n; i++)
+	{
+		k = (rand() % (nitem - i)) * nelm;
+
+		/* Swap all elements in head with elements in item k. */
+		for (j = 0; j < nelm; j++)
+		{
+			elm = *elms;
+			nul = *nuls;
+			*elms = elms[k];
+			*nuls = nuls[k];
+			elms[k] = elm;
+			nuls[k] = nul;
+			elms++;
+			nuls++;
+		}
+	}
+
+	memcpy(rdims, dims, ndim * sizeof(int));
+	rdims[0] = n;
+
+	result = construct_md_array(relms, rnuls, ndim, rdims, lbs,
+elmtyp, elmlen, elmbyval, elmalign);
+
+	pfree(relms);
+	pfree(rnuls);
+
+	return result;
+}
+
+/*
+ * Shuffle the elements of an array.
+ */
+Datum
+array_shuffle(PG_FUNCTION_ARGS)
+{
+	ArrayType  *array,
+			   *result;
+	int16		elmlen;
+	bool		elmbyval;
+	char		el

Re: [PATCH] Introduce array_shuffle() and array_sample()

2022-07-19 Thread Tom Lane

Robert Haas  writes:
> On Tue, Jul 19, 2022 at 9:53 AM Andrew Dunstan  wrote:
>> Why not have an optional second parameter for array_shuffle that
>> indicates whether or not to flatten? e.g. array_shuffle(my_array,
>> flatten => true)

> IMHO, if we think that's something many people are going to want, it
> would be better to add an array_flatten() function, because that could
> be used for a variety of purposes, whereas an option to this function
> can only be used for this function.

Agreed.  Whether it's really needed, I dunno --- I don't recall the
issue having come up before.

After taking a second look through
https://www.postgresql.org/docs/current/functions-array.html
it seems like the things that treat arrays as flat even when they
are multi-D are just

* unnest(), which is more or less forced into that position by our
type system: it has to be anyarray returning anyelement, not
anyarray returning anyelement-or-anyarray-depending.

* array_to_string(), which maybe could have done it differently but
can reasonably be considered a variant of unnest().

* The containment/overlap operators, which are kind of their own
thing anyway.  Arguably they got this wrong, though I suppose it's
too late to rethink that.

Everything else either explicitly rejects more-than-one-D arrays
or does something that is compatible with thinking of them as
arrays-of-arrays.

So I withdraw my original position.  These functions should just
shuffle or select in the array's first dimension, preserving
subarrays.  Or else be lazy and reject more-than-one-D arrays;
but it's probably not that hard to handle them.

regards, tom lane

Re: [PATCH] Introduce array_shuffle() and array_sample()

2022-07-19 Thread Robert Haas

On Tue, Jul 19, 2022 at 9:53 AM Andrew Dunstan  wrote:
> > Having thought about it, i would go with (2). It gives the user the
> > ability to decide wether or not array-of-arrays behavior is desired.
> > If he wants the behavior of (1) he can flatten the array before
> > applying array_shuffle(). Unfortunately there is no array_flatten()
> > function (at the moment) and the user would have to work around it
> > with unnest() and array_agg().
>
> Why not have an optional second parameter for array_shuffle that
> indicates whether or not to flatten? e.g. array_shuffle(my_array,
> flatten => true)

IMHO, if we think that's something many people are going to want, it
would be better to add an array_flatten() function, because that could
be used for a variety of purposes, whereas an option to this function
can only be used for this function.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [PATCH] Introduce array_shuffle() and array_sample()

2022-07-19 Thread Andrew Dunstan



On 2022-07-19 Tu 07:15, Martin Kalcher wrote:
> Am 18.07.22 um 23:48 schrieb Martin Kalcher:
>>
>> If we go with (1) array_shuffle() and array_sample() should shuffle
>> each element individually and always return a one-dimensional array.
>>
>>    select array_shuffle('{{1,2},{3,4},{5,6}}');
>>    ---
>>     {1,4,3,5,6,2}
>>
>>    select array_sample('{{1,2},{3,4},{5,6}}', 3);
>>    --
>>     {1,4,3}
>>
>> If we go with (2) both functions should only operate on the first
>> dimension and shuffle whole subarrays and keep the dimensions intact.
>>
>>    select array_shuffle('{{1,2},{3,4},{5,6}}');
>>    -
>>     {{3,4},{1,2},{5,6}}
>>
>>    select array_sample('{{1,2},{3,4},{5,6}}', 2);
>>    ---
>>     {{3,4},{1,2}}
>>
>
> Having thought about it, i would go with (2). It gives the user the
> ability to decide wether or not array-of-arrays behavior is desired.
> If he wants the behavior of (1) he can flatten the array before
> applying array_shuffle(). Unfortunately there is no array_flatten()
> function (at the moment) and the user would have to work around it
> with unnest() and array_agg().
>
>


Why not have an optional second parameter for array_shuffle that
indicates whether or not to flatten? e.g. array_shuffle(my_array,
flatten => true)


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: [PATCH] Introduce array_shuffle() and array_sample()

2022-07-19 Thread Robert Haas

On Mon, Jul 18, 2022 at 6:43 PM Tom Lane  wrote:
> Um ... why is "the order in which the elements were chosen" a concept
> we want to expose?  ISTM sample() is a black box in which notionally
> the decisions could all be made at once.

I agree with that. But I also think it's fine for the elements to be
returned in a shuffled order rather than the original order.

> > I really think this function needs to grow an algorithm argument that can
> > be used to specify stuff like ordering, replacement/without-replacement,
> > etc...just some enums separated by commas that can be added to the call.
>
> I think you might run out of gold paint somewhere around here.  I'm
> still not totally convinced we should bother with the sample() function
> at all, let alone that it needs algorithm variants.  At some point we
> say to the user "here's a PL, write what you want for yourself".

I don't know what gold paint has to do with anything here, but I agree
that David's proposal seems to be moving the goalposts a very long
way.

The thing is, as Martin points out, these functions already exist in a
bunch of other systems. For one example I've used myself, see
https://underscorejs.org/

I probably wouldn't have called a function to put a list into a random
order "shuffle" in a vacuum, but it seems to be common nomenclature
these days. I believe that if you don't make reference to Fisher-Yates
in the documentation, they kick you out of the cool programmers club.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [PATCH] Introduce array_shuffle() and array_sample()

2022-07-19 Thread Martin Kalcher


Am 18.07.22 um 23:48 schrieb Martin Kalcher:


If we go with (1) array_shuffle() and array_sample() should shuffle each 
element individually and always return a one-dimensional array.


   select array_shuffle('{{1,2},{3,4},{5,6}}');
   ---
    {1,4,3,5,6,2}

   select array_sample('{{1,2},{3,4},{5,6}}', 3);
   --
    {1,4,3}

If we go with (2) both functions should only operate on the first 
dimension and shuffle whole subarrays and keep the dimensions intact.


   select array_shuffle('{{1,2},{3,4},{5,6}}');
   -
    {{3,4},{1,2},{5,6}}

   select array_sample('{{1,2},{3,4},{5,6}}', 2);
   ---
    {{3,4},{1,2}}



Having thought about it, i would go with (2). It gives the user the 
ability to decide wether or not array-of-arrays behavior is desired. If 
he wants the behavior of (1) he can flatten the array before applying 
array_shuffle(). Unfortunately there is no array_flatten() function (at 
the moment) and the user would have to work around it with unnest() and 
array_agg().

Re: [PATCH] Introduce array_shuffle() and array_sample()

2022-07-19 Thread Aleksander Alekseev

Hi Martin,

I didn't look at the code yet but I very much like the idea. Many
thanks for working on this!

It's a pity your patch was too late for the July commitfest. In
September, please keep an eye on cfbot [1] to make sure your patch
applies properly.

> As Tom's investigation showed, there is no consensus in the code if
> multi-dimensional arrays are treated as arrays-of-arrays or not. We need
> to decide what should be the correct treatment.

Here are my two cents.

>From the user perspective I would expect that by default a
multidimensional array should be treated as an array of arrays. So for
instance:

```
array_shuffle([ [1,2], [3,4], [5,6] ])
```

... should return something like:

```
[ [3,4], [1,2], [5,6] ]
```

Note that the order of the elements in the internal arrays is preserved.

However, I believe there should be an optional argument that overrides
this behavior. For instance:

```
array_shuffle([ [1,2], [3,4], [5,6] ], depth => 2)
```

BTW, while on it, shouldn't we add similar functions for JSON and/or
JSONB? Or is this going to be too much for a single discussion?

[1]: http://cfbot.cputube.org/

-- 
Best regards,
Aleksander Alekseev

Re: [PATCH] Introduce array_shuffle() and array_sample()

2022-07-19 Thread Martin Kalcher


Am 19.07.22 um 00:52 schrieb Martin Kalcher:


On the contrary! I am pretty sure there are people out there wanting 
sampling-without-shuffling. I will think about that.


I gave it some thought. Even though there might be use cases, where a 
stable order is desired, i would consider them edge cases, not worth the 
additional complexity. I personally would not expect array_sample() to 
return elements in any specific order. I looked up some sample() 
implementations. None of them makes guarantees about the order of the 
resulting array or explicitly states that the resulting array is in 
random or selection order.


- Python random.sample [0]
- Ruby Array#sample [1]
- Rust rand::seq::SliceRandom::choose_multiple [2]
- Julia StatsBase.sample [3] stable order needs explicit request

[0] https://docs.python.org/3/library/random.html#random.sample
[1] https://ruby-doc.org/core-3.0.0/Array.html#method-i-sample
[2] 
https://docs.rs/rand/0.6.5/rand/seq/trait.SliceRandom.html#tymethod.choose_multiple

[3] https://juliastats.org/StatsBase.jl/stable/sampling/#StatsBase.sample

Re: [PATCH] Introduce array_shuffle() and array_sample()

2022-07-18 Thread Thomas Munro

On Tue, Jul 19, 2022 at 8:15 AM Martin Kalcher
 wrote:
> Am 18.07.22 um 21:29 schrieb Tom Lane:
> > The preferred thing to do is to add it to our "commitfest" queue,
> > which will ensure that it gets looked at eventually.  The currently
> > open cycle is 2022-09 [1] (see the "New Patch" button there).
>
> Thanks Tom, did that.

FYI that brought your patch to the attention of this CI robot, which
is showing a couple of warnings.  See the FAQ link there for an
explanation of that infrastructure.

http://cfbot.cputube.org/martin-kalcher.html

Re: [PATCH] Introduce array_shuffle() and array_sample()

2022-07-18 Thread Martin Kalcher


Am 19.07.22 um 00:18 schrieb Tom Lane:


Independently of the dimensionality question --- I'd imagined that
array_sample would select a random subset of the array elements
but keep their order intact.  If you want the behavior shown
above, you can do array_shuffle(array_sample(...)).  But if we
randomize it, and that's not what the user wanted, she has no
recourse.

Now, if you're convinced that the set of people wanting
sampling-without-shuffling is the empty set, then making everybody
else call two functions is a loser.  But I'm not convinced.
At the least, I'd like to see the argument made why nobody
would want that.



On the contrary! I am pretty sure there are people out there wanting 
sampling-without-shuffling. I will think about that.

Re: [PATCH] Introduce array_shuffle() and array_sample()

2022-07-18 Thread Tom Lane

"David G. Johnston"  writes:
> On Mon, Jul 18, 2022 at 3:18 PM Tom Lane  wrote:
>> Independently of the dimensionality question --- I'd imagined that
>> array_sample would select a random subset of the array elements
>> but keep their order intact.  If you want the behavior shown
>> above, you can do array_shuffle(array_sample(...)).  But if we
>> randomize it, and that's not what the user wanted, she has no
>> recourse.

> And for those that want to know in what order those elements were chosen
> they have no recourse in the other setup.

Um ... why is "the order in which the elements were chosen" a concept
we want to expose?  ISTM sample() is a black box in which notionally
the decisions could all be made at once.

> I really think this function needs to grow an algorithm argument that can
> be used to specify stuff like ordering, replacement/without-replacement,
> etc...just some enums separated by commas that can be added to the call.

I think you might run out of gold paint somewhere around here.  I'm
still not totally convinced we should bother with the sample() function
at all, let alone that it needs algorithm variants.  At some point we
say to the user "here's a PL, write what you want for yourself".

regards, tom lane

Re: [PATCH] Introduce array_shuffle() and array_sample()

2022-07-18 Thread David G. Johnston

On Mon, Jul 18, 2022 at 3:18 PM Tom Lane  wrote:

>
> Independently of the dimensionality question --- I'd imagined that
> array_sample would select a random subset of the array elements
> but keep their order intact.  If you want the behavior shown
> above, you can do array_shuffle(array_sample(...)).  But if we
> randomize it, and that's not what the user wanted, she has no
> recourse.
>
>
And for those that want to know in what order those elements were chosen
they have no recourse in the other setup.

I really think this function needs to grow an algorithm argument that can
be used to specify stuff like ordering, replacement/without-replacement,
etc...just some enums separated by commas that can be added to the call.

David J.

Re: [PATCH] Introduce array_shuffle() and array_sample()

2022-07-18 Thread Tom Lane

Martin Kalcher  writes:
> If we go with (1) array_shuffle() and array_sample() should shuffle each 
> element individually and always return a one-dimensional array.

>select array_shuffle('{{1,2},{3,4},{5,6}}');
>---
> {1,4,3,5,6,2}

>select array_sample('{{1,2},{3,4},{5,6}}', 3);
>--
> {1,4,3}

Independently of the dimensionality question --- I'd imagined that
array_sample would select a random subset of the array elements
but keep their order intact.  If you want the behavior shown
above, you can do array_shuffle(array_sample(...)).  But if we
randomize it, and that's not what the user wanted, she has no
recourse.

Now, if you're convinced that the set of people wanting
sampling-without-shuffling is the empty set, then making everybody
else call two functions is a loser.  But I'm not convinced.
At the least, I'd like to see the argument made why nobody
would want that.

regards, tom lane

Re: [PATCH] Introduce array_shuffle() and array_sample()

2022-07-18 Thread Martin Kalcher


Am 18.07.22 um 23:03 schrieb Tom Lane:

I wrote:

Martin had originally proposed (2), which I rejected on the grounds
that we don't treat multi-dimensional arrays as arrays-of-arrays for
any other purpose.


Actually, after poking at it for awhile, that's an overstatement.
It's true that the type system doesn't think N-D arrays are
arrays-of-arrays, but there are individual functions/operators that do.
Thanks Robert for pointing out the inconsistent behavior of 

array_sample(). That needs to be fixed.

As Tom's investigation showed, there is no consensus in the code if 
multi-dimensional arrays are treated as arrays-of-arrays or not. We need 
to decide what should be the correct treatment.


If we go with (1) array_shuffle() and array_sample() should shuffle each 
element individually and always return a one-dimensional array.


  select array_shuffle('{{1,2},{3,4},{5,6}}');
  ---
   {1,4,3,5,6,2}

  select array_sample('{{1,2},{3,4},{5,6}}', 3);
  --
   {1,4,3}

If we go with (2) both functions should only operate on the first 
dimension and shuffle whole subarrays and keep the dimensions intact.


  select array_shuffle('{{1,2},{3,4},{5,6}}');
  -
   {{3,4},{1,2},{5,6}}

  select array_sample('{{1,2},{3,4},{5,6}}', 2);
  ---
   {{3,4},{1,2}}

I do not feel qualified to make that decision. (2) complicates the code 
a bit, but that should not be the main argument here.


Martin

Re: [PATCH] Introduce array_shuffle() and array_sample()

2022-07-18 Thread Tom Lane

I wrote:
> Martin had originally proposed (2), which I rejected on the grounds
> that we don't treat multi-dimensional arrays as arrays-of-arrays for
> any other purpose.

Actually, after poking at it for awhile, that's an overstatement.
It's true that the type system doesn't think N-D arrays are
arrays-of-arrays, but there are individual functions/operators that do.

For instance, @> is in the its-a-flat-list-of-elements camp:

regression=# select array[[1,2],[3,4]] @> array[1,3];
 ?column? 
--
 t
(1 row)

but || wants to preserve dimensionality:

regression=# select array[[1,2],[3,4]] || array[1];
ERROR:  cannot concatenate incompatible arrays
DETAIL:  Arrays with differing dimensions are not compatible for concatenation.

Various other functions dodge the issue by refusing to work on arrays
of more than one dimension.

There seem to be more functions that think arrays are flat than
otherwise, but it's not as black-and-white as I was thinking.

regards, tom lane

Re: [PATCH] Introduce array_shuffle() and array_sample()

2022-07-18 Thread Tom Lane

Robert Haas  writes:
> On Mon, Jul 18, 2022 at 3:03 PM Martin Kalcher
>  wrote:
>> array_shuffle(anyarray) -> anyarray
>> array_sample(anyarray, integer) -> anyarray

> I think it's questionable whether the behavior of array_shuffle() is
> correct for a multi-dimensional array. The implemented behavior is to
> keep the dimensions as they were, but permute the elements across all
> levels at random. But there are at least two other behaviors that seem
> potentially defensible: (1) always return a 1-dimensional array, (2)
> shuffle the sub-arrays at the top-level without the possibility of
> moving elements within or between sub-arrays. What behavior we decide
> is best here should be documented.

Martin had originally proposed (2), which I rejected on the grounds
that we don't treat multi-dimensional arrays as arrays-of-arrays for
any other purpose.  Maybe we should have, but that ship sailed decades
ago, and I doubt that shuffle() is the place to start changing it.

I could get behind your option (1) though, to make it clearer that
the input array's dimensionality is not of interest.  Especially since,
as you say, (1) is pretty much the only sensible choice for array_sample.

> I also think you should add test cases involving multi-dimensional
> arrays, as well as arrays with non-default bounds. e.g. trying
> shuffling or sampling some values like
> '[8:10][-6:-5]={{1,2},{3,4},{5,6}}'::int[]

This'd only matter if we decide not to ignore the input's dimensionality.

regards, tom lane

Re: [PATCH] Introduce array_shuffle() and array_sample()

2022-07-18 Thread Robert Haas

On Mon, Jul 18, 2022 at 3:03 PM Martin Kalcher
 wrote:
> Thanks for all your feedback and help. I got a patch that i consider
> ready for review. It introduces two new functions:
>
>array_shuffle(anyarray) -> anyarray
>array_sample(anyarray, integer) -> anyarray
>
> array_shuffle() shuffles an array (obviously). array_sample() picks n
> random elements from an array.

I like this idea.

I think it's questionable whether the behavior of array_shuffle() is
correct for a multi-dimensional array. The implemented behavior is to
keep the dimensions as they were, but permute the elements across all
levels at random. But there are at least two other behaviors that seem
potentially defensible: (1) always return a 1-dimensional array, (2)
shuffle the sub-arrays at the top-level without the possibility of
moving elements within or between sub-arrays. What behavior we decide
is best here should be documented.

array_sample() will return elements in random order when sample_size <
array_size, but in the original order when sample_size >= array_size.
Similarly, it will always return a 1-dimensional array in the former
case, but will keep the original dimensions in the latter case. That
seems pretty hard to defend. I think it should always return a
1-dimensional array with elements in random order, and I think this
should be documented.

I also think you should add test cases involving multi-dimensional
arrays, as well as arrays with non-default bounds. e.g. trying
shuffling or sampling some values like
'[8:10][-6:-5]={{1,2},{3,4},{5,6}}'::int[]

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [PATCH] Introduce array_shuffle() and array_sample()

2022-07-18 Thread Martin Kalcher


Am 18.07.22 um 21:29 schrieb Tom Lane:

The preferred thing to do is to add it to our "commitfest" queue,
which will ensure that it gets looked at eventually.  The currently
open cycle is 2022-09 [1] (see the "New Patch" button there).


Thanks Tom, did that. I am not sure if "SQL Commands" is the correct 
topic for this patch, but i guess its not that important. I am impressed 
by all the infrastructure build around this project.


Martin

Re: [PATCH] Introduce array_shuffle() and array_sample()

2022-07-18 Thread Tom Lane

Martin Kalcher  writes:
> Is someone interested in looking at it? What are the next steps?

The preferred thing to do is to add it to our "commitfest" queue,
which will ensure that it gets looked at eventually.  The currently
open cycle is 2022-09 [1] (see the "New Patch" button there).

I believe you have to have signed up as a community member[2]
in order to put yourself in as the patch author.  I think "New Patch"
will work anyway, but it's better to have an author listed.

regards, tom lane

[1] https://commitfest.postgresql.org/39/
[2] https://www.postgresql.org/account/login/

[PATCH] Introduce array_shuffle() and array_sample()

2022-07-18 Thread Martin Kalcher

Thanks for all your feedback and help. I got a patch that i consider 
ready for review. It introduces two new functions:


  array_shuffle(anyarray) -> anyarray
  array_sample(anyarray, integer) -> anyarray

array_shuffle() shuffles an array (obviously). array_sample() picks n 
random elements from an array.


Is someone interested in looking at it? What are the next steps?

MartinFrom 5498bb2d9f1fab4cad56cd0d3a6eeafa24a26c8e Mon Sep 17 00:00:00 2001
From: Martin Kalcher 
Date: Sun, 17 Jul 2022 18:06:04 +0200
Subject: [PATCH] Introduce array_shuffle() and array_sample()

* array_shuffle() shuffles to elements of an array.
* array_sample() chooses n elements from an array by random.

Shuffling of arrays can already be accomplished with SQL
using unnest() and array_agg(order by random()). But that is
very slow compared to the new functions.
---
 doc/src/sgml/func.sgml   |  34 ++
 src/backend/utils/adt/arrayfuncs.c   | 163 +++
 src/include/catalog/pg_proc.dat  |   6 +
 src/test/regress/expected/arrays.out |  30 +
 src/test/regress/sql/arrays.sql  |  11 ++
 5 files changed, 244 insertions(+)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index b6783b7ad0..c676031b4a 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -19395,6 +19395,40 @@ SELECT NULLIF(value, '(none)') ...

   
 
+  
+   
+
+ array_sample
+
+array_sample ( array anyarray, n integer )
+anyarray
+   
+   
+Returns n randomly chosen elements from array.
+   
+   
+array_sample(ARRAY[1,2,3,4,5,6], 3)
+{4,5,1}
+   
+  
+
+  
+   
+
+ array_shuffle
+
+array_shuffle ( anyarray )
+anyarray
+   
+   
+Shuffles the elements of the array.
+   
+   
+array_shuffle(ARRAY[1,2,3,4,5,6])
+{4,5,1,3,2,6}
+   
+  
+
   

 
diff --git a/src/backend/utils/adt/arrayfuncs.c b/src/backend/utils/adt/arrayfuncs.c
index fb167f226a..a6769a8ebf 100644
--- a/src/backend/utils/adt/arrayfuncs.c
+++ b/src/backend/utils/adt/arrayfuncs.c
@@ -6872,3 +6872,166 @@ trim_array(PG_FUNCTION_ARGS)
 
 	PG_RETURN_DATUM(result);
 }
+
+/*
+ * Shuffle the elements of an array.
+ */
+Datum
+array_shuffle(PG_FUNCTION_ARGS)
+{
+	ArrayType  *array,
+			   *result;
+	int16		elmlen;
+	bool		elmbyval;
+	char		elmalign;
+	Oid			elmtyp;
+	TypeCacheEntry *typentry;
+	Datum	   *elms,
+elm;
+	bool	   *nuls,
+nul;
+	int			nelms,
+i,
+j;
+
+	array = PG_GETARG_ARRAYTYPE_P(0);
+	nelms = ArrayGetNItems(ARR_NDIM(array), ARR_DIMS(array));
+
+	/* There is no point in shuffling arrays with less then two elements. */
+	if (nelms < 2)
+		PG_RETURN_ARRAYTYPE_P(array);
+
+	elmtyp = ARR_ELEMTYPE(array);
+	typentry = (TypeCacheEntry *) fcinfo->flinfo->fn_extra;
+
+	if (typentry == NULL || typentry->type_id != elmtyp)
+	{
+		typentry = lookup_type_cache(elmtyp, 0);
+		fcinfo->flinfo->fn_extra = (void *) typentry;
+	}
+	elmlen = typentry->typlen;
+	elmbyval = typentry->typbyval;
+	elmalign = typentry->typalign;
+
+	deconstruct_array(array, elmtyp, elmlen, elmbyval, elmalign,
+	  &elms, &nuls, &nelms);
+
+	/* Shuffle elements and nulls using Fisher-Yates shuffle algorithm. */
+	for (i = nelms - 1; i > 0; i--)
+	{
+		j = random() % (i + 1);
+		elm = elms[i];
+		nul = nuls[i];
+		elms[i] = elms[j];
+		nuls[i] = nuls[j];
+		elms[j] = elm;
+		nuls[j] = nul;
+	}
+
+	result = construct_md_array(elms, nuls,
+ARR_NDIM(array), ARR_DIMS(array), ARR_LBOUND(array),
+elmtyp, elmlen, elmbyval, elmalign);
+
+	pfree(elms);
+	pfree(nuls);
+	PG_FREE_IF_COPY(array, 0);
+
+	PG_RETURN_ARRAYTYPE_P(result);
+}
+
+/*
+ * Choose N random elements from an array.
+ */
+Datum
+array_sample(PG_FUNCTION_ARGS)
+{
+	ArrayType  *array,
+			   *result;
+	int16		elmlen;
+	bool		elmbyval;
+	char		elmalign;
+	Oid			elmtyp;
+	TypeCacheEntry *typentry;
+	Datum	   *elms,
+elm;
+	bool	   *nuls,
+nul;
+	int			nelms,
+rnelms,
+rdims[1],
+rlbs[1],
+i,
+j;
+
+	array = PG_GETARG_ARRAYTYPE_P(0);
+	nelms = ArrayGetNItems(ARR_NDIM(array), ARR_DIMS(array));
+	elmtyp = ARR_ELEMTYPE(array);
+	rnelms = PG_GETARG_INT32(1);
+
+	if (rnelms < 0)
+		ereport(ERROR,
+(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("second parameter must not be negative")));
+
+	/* Return an empty array if the requested sample size is zero. */
+	if (rnelms == 0)
+	{
+		PG_FREE_IF_COPY(array, 0);
+		PG_RETURN_ARRAYTYPE_P(construct_empty_array(elmtyp));
+	}
+
+	/*
+	 * Return the original array if the requested sample size is greater than
+	 * or equal to its own size.
+	 */
+	if (rnelms >= nelms)
+		PG_RETURN_ARRAYTYPE_P(array);
+
+	typentry = (TypeCacheEntry *) fcinfo->flinfo->fn_extra;
+
+	if (typentry == NULL || typentry->type_id !=

51 matches

Mail list logo