Re: Do we want a hashset type?

2023-08-14 Thread jian he
https://github.com/tvondra/hashset

On Mon, Aug 14, 2023 at 11:23 PM Florents Tselai
 wrote:
>
> Has anyone put this in a git repo / extension package or similar ?
>
> I’d like to try it out outside the core pg tree.
>
> > On 1 Jul 2023, at 12:04 PM, Joel Jacobson  wrote:
> >
> > On Fri, Jun 30, 2023, at 06:50, jian he wrote:
> >> more like a C questions
> >> in this context does
> >> #define HASHSET_GET_VALUES(set) ((int32 *) ((set)->data +
> >> CEIL_DIV((set)->capacity, 8)))
> >> define first, then define struct int4hashset_t. Is this normally ok?
> >
> > Yes, it's fine. Macros are just text substitutions done pre-compilation.
> >
> >> Also does
> >> #define HASHSET_GET_VALUES(set) ((int32 *) ((set)->data +
> >> CEIL_DIV((set)->capacity, 8)))
> >>
> >> remove (int32 *) will make it generic? then when you use it, you can
> >> cast whatever type you like?
> >
> > Maybe, but might be less error-prone more descriptive with different
> > macros for each type, e.g. INT4HASHSET_GET_VALUES,
> > similar to the existing PG_GETARG_INT4HASHSET
> >
> > Curious to hear what everybody thinks about the interface, documentation,
> > semantics and implementation?
> >
> > Is there anything missing or something that you think should be 
> > changed/improved?
> >
> > /Joel
> >
> >
>


-- 
 I recommend David Deutsch's <>

  Jian




Re: Do we want a hashset type?

2023-08-14 Thread Florents Tselai
Has anyone put this in a git repo / extension package or similar ? 

I’d like to try it out outside the core pg tree. 

> On 1 Jul 2023, at 12:04 PM, Joel Jacobson  wrote:
> 
> On Fri, Jun 30, 2023, at 06:50, jian he wrote:
>> more like a C questions
>> in this context does
>> #define HASHSET_GET_VALUES(set) ((int32 *) ((set)->data +
>> CEIL_DIV((set)->capacity, 8)))
>> define first, then define struct int4hashset_t. Is this normally ok?
> 
> Yes, it's fine. Macros are just text substitutions done pre-compilation.
> 
>> Also does
>> #define HASHSET_GET_VALUES(set) ((int32 *) ((set)->data +
>> CEIL_DIV((set)->capacity, 8)))
>> 
>> remove (int32 *) will make it generic? then when you use it, you can
>> cast whatever type you like?
> 
> Maybe, but might be less error-prone more descriptive with different
> macros for each type, e.g. INT4HASHSET_GET_VALUES,
> similar to the existing PG_GETARG_INT4HASHSET
> 
> Curious to hear what everybody thinks about the interface, documentation,
> semantics and implementation?
> 
> Is there anything missing or something that you think should be 
> changed/improved?
> 
> /Joel
> 
> 





Re: Do we want a hashset type?

2023-07-01 Thread Joel Jacobson
On Fri, Jun 30, 2023, at 06:50, jian he wrote:
> more like a C questions
> in this context does
> #define HASHSET_GET_VALUES(set) ((int32 *) ((set)->data +
> CEIL_DIV((set)->capacity, 8)))
> define first, then define struct int4hashset_t. Is this normally ok?

Yes, it's fine. Macros are just text substitutions done pre-compilation.

> Also does
> #define HASHSET_GET_VALUES(set) ((int32 *) ((set)->data +
> CEIL_DIV((set)->capacity, 8)))
>
> remove (int32 *) will make it generic? then when you use it, you can
> cast whatever type you like?

Maybe, but might be less error-prone more descriptive with different
macros for each type, e.g. INT4HASHSET_GET_VALUES,
similar to the existing PG_GETARG_INT4HASHSET

Curious to hear what everybody thinks about the interface, documentation,
semantics and implementation?

Is there anything missing or something that you think should be 
changed/improved?

/Joel




Re: Do we want a hashset type?

2023-06-29 Thread jian he
On Thu, Jun 29, 2023 at 4:43 PM Joel Jacobson  wrote:
>
> On Thu, Jun 29, 2023, at 08:54, jian he wrote:
> > Anyway, this time, I added another macro,which seems to simplify the code.
> >
> > #define SET_DATA_PTR(a) \
> > (((char *) (a->data)) + CEIL_DIV(a->capacity, 8))
> >
> > it passed all the tests on my local machine.
>
> Hmm, this is interesting. There is a bug in your second patch,
> that the tests catch, so it's really surprising if they pass on your machine.
>
> Can you try to run `make clean && make && make install && make installcheck`?
>
> I would guess you forgot to recompile or reinstall.
>
> This is the bug in 
> 0002-marco-SET_DATA_PTR-to-quicly-access-hashset-data-reg.patch:
>
> @@ -411,7 +411,7 @@ int4hashset_union(PG_FUNCTION_ARGS)
> int4hashset_t  *seta = PG_GETARG_INT4HASHSET_COPY(0);
> int4hashset_t  *setb = PG_GETARG_INT4HASHSET(1);
> char   *bitmap = setb->data;
> -   int32  *values = (int32 *) (bitmap + 
> CEIL_DIV(setb->capacity, 8));
> +   int32  *values = (int32 *) SET_DATA_PTR(seta);
>
> You accidentally replaced `setb` with `seta`.
>
> I renamed the macro to HASHSET_GET_VALUES and changed it slightly,
> also added a HASHSET_GET_BITMAP for completeness:
>
> #define HASHSET_GET_BITMAP(set) ((set)->data)
> #define HASHSET_GET_VALUES(set) ((int32 *) ((set)->data + 
> CEIL_DIV((set)->capacity, 8)))
>
> Instead of your version:
>
> #define SET_DATA_PTR(a) \
> (((char *) (a->data)) + CEIL_DIV(a->capacity, 8))
>
> Changes:
> * Parenthesize macro parameters.
> * Prefix the macro names with "HASHSET_" to avoid potential conflicts.
> * "GET_VALUES" more clearly communicates that it's the values we're 
> extracting.
>
> New patch attached.
>
> Other changes in same commit:
>
> * Add original friends-of-friends graph query to new benchmark/ directory
> * Add table of content to README
> * Update docs: Explain null semantics and add function examples
> * Simplify empty hashset handling, remove unused includes
>
> /Joel

more like a C questions
in this context does
#define HASHSET_GET_VALUES(set) ((int32 *) ((set)->data +
CEIL_DIV((set)->capacity, 8)))
define first, then define struct int4hashset_t. Is this normally ok?

Also does
#define HASHSET_GET_VALUES(set) ((int32 *) ((set)->data +
CEIL_DIV((set)->capacity, 8)))

remove (int32 *) will make it generic? then when you use it, you can
cast whatever type you like?




Re: Do we want a hashset type?

2023-06-29 Thread jian he
On Wed, Jun 28, 2023 at 4:50 PM Joel Jacobson  wrote:
>
> On Wed, Jun 28, 2023, at 08:26, jian he wrote:
>
> > Hi there.
> > I changed the function hashset_contains to strict.
>
> Changing hashset_contains to STRICT would cause it to return NULL
> if any of the operands are NULL, which I don't believe is correct, since:
>
> SELECT NULL = ANY('{}'::int4[]);
>  ?column?
> --
>  f
> (1 row)
>
> Hence, `hashset_contains('{}'::int4hashset, NULL)` should also return FALSE,
> to mimic the semantics of arrays and MULTISET's MEMBER OF predicate in 
> SQL:2023.
>
> Did you try running `make installcheck` after your change?
> You would then have seen one of the tests failing:
>
> test array-and-multiset-semantics ... FAILED   21 ms
>
> Check the content of `regression.diffs` to see why:
>
> % cat regression.diffs
> diff -U3 
> /Users/joel/src/hashset/test/expected/array-and-multiset-semantics.out 
> /Users/joel/src/hashset/results/array-and-multiset-semantics.out
> --- /Users/joel/src/hashset/test/expected/array-and-multiset-semantics.out
>   2023-06-27 10:07:38
> +++ /Users/joel/src/hashset/results/array-and-multiset-semantics.out
> 2023-06-28 10:13:27
> @@ -158,7 +158,7 @@
>  |  | {NULL}  | {NULL}   |  |
>  |1 | {1} | {1}  |  |
>  |4 | {4} | {4}  |  |
> - {} |  | {NULL}  | {NULL}   | f| f
> + {} |  | {NULL}  | {NULL}   |  | f
>   {} |1 | {1} | {1}  | f| f
>   {} |4 | {4} | {4}  | f| f
>   {NULL} |  | {NULL}  | {NULL,NULL}  |  |
> @@ -284,7 +284,8 @@
>  "= ANY(...)";
>   arg1 | arg2 | hashset_add | array_append | hashset_contains | = ANY(...)
>  --+--+-+--+--+
> -(0 rows)
> + {}   |  | {NULL}  | {NULL}   |  | f
> +(1 row)
>
>
> > also change the way to return an empty array.
>
> Nice.
> I agree the `Datum d` variable was unnecessary.
> I also removed the unused includes.
>
> > in benchmark.sql, would it be ok to use EXPLAIN to demonstrate that
> > int4hashset can speed distinct aggregate and distinct counts?
> > like the following:
> >
> > explain(analyze, costs off, timing off, buffers)
> > SELECT array_agg(DISTINCT i) FROM benchmark_input_100k \watch c=3
> >
> > explain(analyze, costs off, timing off, buffers)
> > SELECT hashset_agg(i) FROM benchmark_input_100k \watch c=3
>
> The 100k tables seems to be too small to give any meaningful results,
> when trying to measure individual queries:
>
> EXPLAIN(analyze, costs off, timing off, buffers)
> SELECT array_agg(DISTINCT i) FROM benchmark_input_100k;
>  Execution Time: 26.790 ms
>  Execution Time: 30.616 ms
>  Execution Time: 33.253 ms
>
> EXPLAIN(analyze, costs off, timing off, buffers)
> SELECT hashset_agg(i) FROM benchmark_input_100k;
>  Execution Time: 32.797 ms
>  Execution Time: 27.605 ms
>  Execution Time: 26.228 ms
>
> If we instead try the 10M tables, it looks like array_agg(DISTINCT ...)
> is actually faster for the `i` column where all input integers are unique:
>
> EXPLAIN(analyze, costs off, timing off, buffers)
> SELECT array_agg(DISTINCT i) FROM benchmark_input_10M;
>  Execution Time: 799.017 ms
>  Execution Time: 796.008 ms
>  Execution Time: 799.121 ms
>
> EXPLAIN(analyze, costs off, timing off, buffers)
> SELECT hashset_agg(i) FROM benchmark_input_10M;
>  Execution Time: 1204.873 ms
>  Execution Time: 1221.822 ms
>  Execution Time: 1216.340 ms
>
> For random integers, hashset is a win though:
>
> EXPLAIN(analyze, costs off, timing off, buffers)
> SELECT array_agg(DISTINCT rnd) FROM benchmark_input_10M;
>  Execution Time: 1874.722 ms
>  Execution Time: 1878.760 ms
>  Execution Time: 1861.640 ms
>
> EXPLAIN(analyze, costs off, timing off, buffers)
> SELECT hashset_agg(rnd) FROM benchmark_input_10M;
>  Execution Time: 1253.709 ms
>  Execution Time: 1222.651 ms
>  Execution Time: 1237.849 ms
>
> > explain(costs off,timing off, analyze,buffers)
> > select count(distinct rnd) from benchmark_input_100k \watch c=3
> >
> > explain(costs off,timing off, analyze,buffers)
> > SELECT hashset_cardinality(x) FROM (SELECT hashset_agg(rnd) FROM
> > benchmark_input_100k) sub(x) \watch c=3
>
> I tried these with 10M:
>
> EXPLAIN(costs off,timing off, analyze,buffers)
> SELECT COUNT(DISTINCT rnd) FROM benchmark_input_10M;
>  Execution Time: 1733.320 ms
>  Execution Time: 1725.214 ms
>  Execution Time: 1716.636 ms
>
> EXPLAIN(costs off,timing off, analyze,buffers)
> SELECT hashset_cardinality(x) FROM (SELECT hashset_agg(rnd) FROM 
> benchmark_input_10M) sub(x);
>  Execution Time: 1249.612 ms
>  Execution Time: 1240.558 ms
>  Execution Time: 1252.103 ms
>
> Not sure what I think of the current benchmark suite.
>
> I think it would be better to only include some realistic 

Re: Do we want a hashset type?

2023-06-28 Thread Joel Jacobson
On Wed, Jun 28, 2023, at 08:26, jian he wrote:

> Hi there.
> I changed the function hashset_contains to strict.

Changing hashset_contains to STRICT would cause it to return NULL
if any of the operands are NULL, which I don't believe is correct, since:

SELECT NULL = ANY('{}'::int4[]);
 ?column?
--
 f
(1 row)

Hence, `hashset_contains('{}'::int4hashset, NULL)` should also return FALSE,
to mimic the semantics of arrays and MULTISET's MEMBER OF predicate in SQL:2023.

Did you try running `make installcheck` after your change?
You would then have seen one of the tests failing:

test array-and-multiset-semantics ... FAILED   21 ms

Check the content of `regression.diffs` to see why:

% cat regression.diffs
diff -U3 /Users/joel/src/hashset/test/expected/array-and-multiset-semantics.out 
/Users/joel/src/hashset/results/array-and-multiset-semantics.out
--- /Users/joel/src/hashset/test/expected/array-and-multiset-semantics.out  
2023-06-27 10:07:38
+++ /Users/joel/src/hashset/results/array-and-multiset-semantics.out
2023-06-28 10:13:27
@@ -158,7 +158,7 @@
 |  | {NULL}  | {NULL}   |  |
 |1 | {1} | {1}  |  |
 |4 | {4} | {4}  |  |
- {} |  | {NULL}  | {NULL}   | f| f
+ {} |  | {NULL}  | {NULL}   |  | f
  {} |1 | {1} | {1}  | f| f
  {} |4 | {4} | {4}  | f| f
  {NULL} |  | {NULL}  | {NULL,NULL}  |  |
@@ -284,7 +284,8 @@
 "= ANY(...)";
  arg1 | arg2 | hashset_add | array_append | hashset_contains | = ANY(...)
 --+--+-+--+--+
-(0 rows)
+ {}   |  | {NULL}  | {NULL}   |  | f
+(1 row)


> also change the way to return an empty array.

Nice.
I agree the `Datum d` variable was unnecessary.
I also removed the unused includes.

> in benchmark.sql, would it be ok to use EXPLAIN to demonstrate that
> int4hashset can speed distinct aggregate and distinct counts?
> like the following:
>
> explain(analyze, costs off, timing off, buffers)
> SELECT array_agg(DISTINCT i) FROM benchmark_input_100k \watch c=3
>
> explain(analyze, costs off, timing off, buffers)
> SELECT hashset_agg(i) FROM benchmark_input_100k \watch c=3

The 100k tables seems to be too small to give any meaningful results,
when trying to measure individual queries:

EXPLAIN(analyze, costs off, timing off, buffers)
SELECT array_agg(DISTINCT i) FROM benchmark_input_100k;
 Execution Time: 26.790 ms
 Execution Time: 30.616 ms
 Execution Time: 33.253 ms

EXPLAIN(analyze, costs off, timing off, buffers)
SELECT hashset_agg(i) FROM benchmark_input_100k;
 Execution Time: 32.797 ms
 Execution Time: 27.605 ms
 Execution Time: 26.228 ms

If we instead try the 10M tables, it looks like array_agg(DISTINCT ...)
is actually faster for the `i` column where all input integers are unique:

EXPLAIN(analyze, costs off, timing off, buffers)
SELECT array_agg(DISTINCT i) FROM benchmark_input_10M;
 Execution Time: 799.017 ms
 Execution Time: 796.008 ms
 Execution Time: 799.121 ms

EXPLAIN(analyze, costs off, timing off, buffers)
SELECT hashset_agg(i) FROM benchmark_input_10M;
 Execution Time: 1204.873 ms
 Execution Time: 1221.822 ms
 Execution Time: 1216.340 ms

For random integers, hashset is a win though:

EXPLAIN(analyze, costs off, timing off, buffers)
SELECT array_agg(DISTINCT rnd) FROM benchmark_input_10M;
 Execution Time: 1874.722 ms
 Execution Time: 1878.760 ms
 Execution Time: 1861.640 ms

EXPLAIN(analyze, costs off, timing off, buffers)
SELECT hashset_agg(rnd) FROM benchmark_input_10M;
 Execution Time: 1253.709 ms
 Execution Time: 1222.651 ms
 Execution Time: 1237.849 ms

> explain(costs off,timing off, analyze,buffers)
> select count(distinct rnd) from benchmark_input_100k \watch c=3
>
> explain(costs off,timing off, analyze,buffers)
> SELECT hashset_cardinality(x) FROM (SELECT hashset_agg(rnd) FROM
> benchmark_input_100k) sub(x) \watch c=3

I tried these with 10M:

EXPLAIN(costs off,timing off, analyze,buffers)
SELECT COUNT(DISTINCT rnd) FROM benchmark_input_10M;
 Execution Time: 1733.320 ms
 Execution Time: 1725.214 ms
 Execution Time: 1716.636 ms

EXPLAIN(costs off,timing off, analyze,buffers)
SELECT hashset_cardinality(x) FROM (SELECT hashset_agg(rnd) FROM 
benchmark_input_10M) sub(x);
 Execution Time: 1249.612 ms
 Execution Time: 1240.558 ms
 Execution Time: 1252.103 ms

Not sure what I think of the current benchmark suite.

I think it would be better to only include some realistic examples from
real-life, such as the graph query which was the reason I personally started
working on this. Otherwise there is a risk we optimise for some hypothetical
scenario that is not relevant in practise.

Would be good with more examples of typical work loads for when the hashset
type would be 

Re: Do we want a hashset type?

2023-06-28 Thread jian he
On Tue, Jun 27, 2023 at 4:27 PM Joel Jacobson  wrote:
>
> On Tue, Jun 27, 2023, at 04:35, jian he wrote:
> > in SQLMultiSets.pdf(previously thread) I found a related explanation
> > on page 45, 46.
> > /home/jian/hashset/0001-make-int4hashset_contains-strict-and-header-file-change.patch
> > (CASE WHEN OP1 IS NULL OR OP2 IS NULL THEN NULL ELSE MULTISET ( SELECT
> > T1.V FROM UNNEST (OP1) AS T1 (V) INTERSECT SQ SELECT T2.V FROM UNNEST
> > (OP2) AS T2 (V) ) END)
> >
> > CASE WHEN OP1 IS NULL OR OP2 IS NULL THEN NULL ELSE MULTISET ( SELECT
> > T1.V FROM UNNEST (OP1) AS T1 (V) UNION SQ SELECT T2.V FROM UNNEST
> > (OP2) AS T2 (V) ) END
> >
> > (CASE WHEN OP1 IS NULL OR OP2 IS NULL THEN NULL ELSE MULTISET ( SELECT
> > T1.V FROM UNNEST (OP1) AS T1 (V) EXCEPT SQ SELECT T2.V FROM UNNEST
> > (OP2) AS T2 (V) ) END)
>
> Thanks! This was exactly what I was looking for, I knew I've seen it but 
> failed to find it.
>
> Attached is a new incremental patch as well as a full patch, since this is a 
> substantial change:
>
> Align null semantics with SQL:2023 array and multiset standards
>
> * Introduced a new boolean field, null_element, in the int4hashset_t type.
>
> * Rename hashset_count() to hashset_cardinality().
>
> * Rename hashset_merge() to hashset_union().
>
> * Rename hashset_equals() to hashset_eq().
>
> * Rename hashset_neq() to hashset_ne().
>
> * Add hashset_to_sorted_array().
>
> * Handle null semantics to work as in arrays and multisets.
>
> * Update int4hashset_add() to allow creating a new set if none exists.
>
> * Use more portable int32 typedef instead of int32_t.
>
> This also adds a thorough test suite in array-and-multiset-semantics.sql,
> which aims to test all relevant combinations of operations and values.
>
>  Makefile   |   2 +-
>  README.md  |   6 ++--
>  hashset--0.0.1.sql |  37 +++-
>  hashset-api.c  | 208 
> ++--
>  hashset.c  |  12 ++-
>  hashset.h  |  11 +++---
>  test/expected/array-and-multiset-semantics.out | 365 
> ++
>  test/expected/basic.out|  12 +++
>  test/expected/reported_bugs.out|   6 ++--
>  test/expected/strict.out   | 114 
> 
>  test/expected/table.out|   8 ++---
>  test/sql/array-and-multiset-semantics.sql  | 232 
> +
>  test/sql/basic.sql |   4 +--
>  test/sql/benchmark.sql |  14 
>  test/sql/reported_bugs.sql |   6 ++--
>  test/sql/strict.sql|  32 -
>  test/sql/table.sql |   2 +-
>  17 files changed, 823 insertions(+), 248 deletions(-)
>
> /Joel

Hi there.
I changed the function hashset_contains to strict.
also change the way to return an empty array.

in benchmark.sql, would it be ok to use EXPLAIN to demonstrate that
int4hashset can speed distinct aggregate and distinct counts?
like the following:

explain(analyze, costs off, timing off, buffers)
SELECT array_agg(DISTINCT i) FROM benchmark_input_100k \watch c=3

explain(analyze, costs off, timing off, buffers)
SELECT hashset_agg(i) FROM benchmark_input_100k \watch c=3

explain(costs off,timing off, analyze,buffers)
select count(distinct rnd) from benchmark_input_100k \watch c=3

explain(costs off,timing off, analyze,buffers)
SELECT hashset_cardinality(x) FROM (SELECT hashset_agg(rnd) FROM
benchmark_input_100k) sub(x) \watch c=3
From 9030adbf9e46f66812fb11849c367bbcf5b3a427 Mon Sep 17 00:00:00 2001
From: pgaddict 
Date: Wed, 28 Jun 2023 14:09:58 +0800
Subject: [PATCH] make int4hashset_contains strict and header file changes.

---
 hashset--0.0.1.sql |  2 +-
 hashset-api.c  | 20 
 2 files changed, 5 insertions(+), 17 deletions(-)

diff --git a/hashset--0.0.1.sql b/hashset--0.0.1.sql
index d0478ce9..d448ee69 100644
--- a/hashset--0.0.1.sql
+++ b/hashset--0.0.1.sql
@@ -55,7 +55,7 @@ LANGUAGE C IMMUTABLE;
 CREATE OR REPLACE FUNCTION hashset_contains(int4hashset, int)
 RETURNS boolean
 AS 'hashset', 'int4hashset_contains'
-LANGUAGE C IMMUTABLE;
+LANGUAGE C IMMUTABLE STRICT;
 
 CREATE OR REPLACE FUNCTION hashset_union(int4hashset, int4hashset)
 RETURNS int4hashset
diff --git a/hashset-api.c b/hashset-api.c
index a4beef4e..ff948b55 

Re: Do we want a hashset type?

2023-06-27 Thread Joel Jacobson
On Tue, Jun 27, 2023, at 10:26, Joel Jacobson wrote:
> Attachments:
> * hashset-0.0.1-b7e5614-full.patch
> * hashset-0.0.1-b7e5614-incremental.patch

To help verify that the semantics, I thought it might be helpful to provide
a comprehensive set of examples that tries to cover all different ways of 
varying
the arguments to the functions.

Please let me know if you find any possible errors or if you think it looks 
good.

SELECT NULL::int4hashset;
 int4hashset
-

(1 row)

SELECT '{}'::int4hashset;
 int4hashset
-
 {}
(1 row)

SELECT int4hashset();
 int4hashset
-
 {}
(1 row)

SELECT '{NULL}'::int4hashset;
 int4hashset
-
 {NULL}
(1 row)

SELECT '{NULL,NULL}'::int4hashset;
 int4hashset
-
 {NULL}
(1 row)

SELECT '{1,3,2,NULL,2,NULL,3,1}'::int4hashset;
 int4hashset
--
 {2,1,3,NULL}
(1 row)

SELECT hashset_add(NULL, NULL);
 hashset_add
-
 {NULL}
(1 row)

SELECT hashset_add(NULL, 1);
 hashset_add
-
 {1}
(1 row)

SELECT hashset_add('{}', 1);
 hashset_add
-
 {1}
(1 row)

SELECT hashset_add('{NULL}', 1);
 hashset_add
-
 {1,NULL}
(1 row)

SELECT hashset_add('{1}', 1);
 hashset_add
-
 {1}
(1 row)

SELECT hashset_add('{1}', 2);
 hashset_add
-
 {1,2}
(1 row)

SELECT hashset_add('{1}', NULL);
 hashset_add
-
 {1,NULL}
(1 row)

SELECT hashset_contains(NULL, NULL);
 hashset_contains
--

(1 row)

SELECT hashset_contains('{}', NULL);
 hashset_contains
--
 f
(1 row)

SELECT hashset_contains('{NULL}', NULL);
 hashset_contains
--

(1 row)

SELECT hashset_contains('{1}', 1);
 hashset_contains
--
 t
(1 row)

SELECT hashset_contains('{1,NULL}', 1);
 hashset_contains
--
 t
(1 row)

SELECT hashset_contains('{1}', 2);
 hashset_contains
--
 f
(1 row)

SELECT hashset_contains('{1,NULL}', 2);
 hashset_contains
--

(1 row)

SELECT hashset_to_array(NULL);
 hashset_to_array
--

(1 row)

SELECT hashset_to_array('{}');
 hashset_to_array
--
 {}
(1 row)

SELECT hashset_to_array('{NULL}');
 hashset_to_array
--
 {NULL}
(1 row)

SELECT hashset_to_array('{3,1,NULL,2}');
 hashset_to_array
--
 {1,3,2,NULL}
(1 row)

SELECT hashset_to_sorted_array(NULL);
 hashset_to_sorted_array
-

(1 row)

SELECT hashset_to_sorted_array('{}');
 hashset_to_sorted_array
-
 {}
(1 row)

SELECT hashset_to_sorted_array('{NULL}');
 hashset_to_sorted_array
-
 {NULL}
(1 row)

SELECT hashset_to_sorted_array('{3,1,NULL,2}');
 hashset_to_sorted_array
-
 {1,2,3,NULL}
(1 row)

SELECT hashset_cardinality(NULL);
 hashset_cardinality
-

(1 row)

SELECT hashset_cardinality('{}');
 hashset_cardinality
-
   0
(1 row)

SELECT hashset_cardinality('{NULL}');
 hashset_cardinality
-
   1
(1 row)

SELECT hashset_cardinality('{NULL,NULL}');
 hashset_cardinality
-
   1
(1 row)

SELECT hashset_cardinality('{1}');
 hashset_cardinality
-
   1
(1 row)

SELECT hashset_cardinality('{1,1}');
 hashset_cardinality
-
   1
(1 row)

SELECT hashset_cardinality('{1,2}');
 hashset_cardinality
-
   2
(1 row)

SELECT hashset_cardinality('{1,2,NULL}');
 hashset_cardinality
-
   3
(1 row)

SELECT hashset_union(NULL, NULL);
 hashset_union
---

(1 row)

SELECT hashset_union(NULL, '{}');
 hashset_union
---

(1 row)

SELECT hashset_union('{}', NULL);
 hashset_union
---

(1 row)

SELECT hashset_union('{}', '{}');
 hashset_union
---
 {}
(1 row)

SELECT hashset_union('{}', '{NULL}');
 hashset_union
---
 {NULL}
(1 row)

SELECT hashset_union('{NULL}', '{}');
 hashset_union
---
 {NULL}
(1 row)

SELECT hashset_union('{NULL}', '{NULL}');
 hashset_union
---
 {NULL}
(1 row)

SELECT hashset_union('{}', '{1}');
 hashset_union
---
 {1}
(1 row)

SELECT hashset_union('{1}', '{}');
 hashset_union
---
 {1}
(1 row)

SELECT hashset_union('{1}', '{1}');
 hashset_union
---
 {1}
(1 row)

SELECT hashset_union('{1}', NULL);
 hashset_union
---

(1 row)

SELECT hashset_union(NULL, '{1}');
 hashset_union
---

(1 row)

SELECT hashset_union('{1}', '{NULL}');
 hashset_union
---
 {1,NULL}
(1 row)

SELECT hashset_union('{NULL}', '{1}');
 hashset_union
---
 {1,NULL}
(1 row)

SELECT hashset_union('{1}', '{2}');
 hashset_union
---
 {1,2}
(1 row)

SELECT hashset_union('{1,2}', '{2,3}');
 hashset_union
---
 {3,1,2}
(1 row)

SELECT hashset_intersection(NULL, NULL);
 

Re: Do we want a hashset type?

2023-06-26 Thread jian he
On Tue, Jun 27, 2023 at 4:55 AM Joel Jacobson  wrote:
>
> On Mon, Jun 26, 2023, at 13:06, jian he wrote:
> > Can you try to glue the attached to the hashset data type input
> > function.
> > the attached will parse cstring with double quote and not. so '{1,2,3}'
> > == '{"1","2","3"}'. obviously quote will preserve the inner string as
> > is.
> > currently int4hashset input is delimited by comma, if you want deal
> > with range then you need escape the comma.
>
> Not sure what you're trying to do here; what's the problem with
> the current int4hashset_in()?
>
> I think it might be best to focus on null semantics / three-valued logic
> before moving on and trying to implement support for more types,
> otherwise we would need to rewrite more code if we find general
> thinkos that are problems in all types.
>
> Help wanted to reason about what the following queries should return:
>
> SELECT hashset_union(NULL::int4hashset, '{}'::int4hashset);
>
> SELECT hashset_intersection(NULL::int4hashset, '{}'::int4hashset);
>
> SELECT hashset_difference(NULL::int4hashset, '{}'::int4hashset);
>
> SELECT hashset_symmetric_difference(NULL::int4hashset, '{}'::int4hashset);
>
> Should they return NULL, the empty set or something else?
>
> I've renamed hashset_merge() -> hashset_union() to better match
> SQL's MULTISET feature which has a MULTISET UNION.
>
> /Joel

in SQLMultiSets.pdf(previously thread) I found a related explanation
on page 45, 46.

(CASE WHEN OP1 IS NULL OR OP2 IS NULL THEN NULL ELSE MULTISET ( SELECT
T1.V FROM UNNEST (OP1) AS T1 (V) INTERSECT SQ SELECT T2.V FROM UNNEST
(OP2) AS T2 (V) ) END)

CASE WHEN OP1 IS NULL OR OP2 IS NULL THEN NULL ELSE MULTISET ( SELECT
T1.V FROM UNNEST (OP1) AS T1 (V) UNION SQ SELECT T2.V FROM UNNEST
(OP2) AS T2 (V) ) END

(CASE WHEN OP1 IS NULL OR OP2 IS NULL THEN NULL ELSE MULTISET ( SELECT
T1.V FROM UNNEST (OP1) AS T1 (V) EXCEPT SQ SELECT T2.V FROM UNNEST
(OP2) AS T2 (V) ) END)

In page11,
>
> Unlike the corresponding table operators UNION, INTERSECT and EXCEPT, we have 
> chosen ALL as the default, since this is the most natural interpretation of 
> MULTISET UNION, etc

also in page 11 aggregate name FUSION. (I like the name.)




Re: Do we want a hashset type?

2023-06-26 Thread Kirk Wolak
On Mon, Jun 26, 2023 at 4:55 PM Joel Jacobson  wrote:

> On Mon, Jun 26, 2023, at 13:06, jian he wrote:
> > Can you try to glue the attached to the hashset data type input
> > function.
> > the attached will parse cstring with double quote and not. so '{1,2,3}'
> > == '{"1","2","3"}'. obviously quote will preserve the inner string as
> > is.
> > currently int4hashset input is delimited by comma, if you want deal
> > with range then you need escape the comma.
>
> Not sure what you're trying to do here; what's the problem with
> the current int4hashset_in()?
>
> I think it might be best to focus on null semantics / three-valued logic
> before moving on and trying to implement support for more types,
> otherwise we would need to rewrite more code if we find general
> thinkos that are problems in all types.
>
> Help wanted to reason about what the following queries should return:
>
> SELECT hashset_union(NULL::int4hashset, '{}'::int4hashset);
>
> SELECT hashset_intersection(NULL::int4hashset, '{}'::int4hashset);
>
> SELECT hashset_difference(NULL::int4hashset, '{}'::int4hashset);
>
> SELECT hashset_symmetric_difference(NULL::int4hashset, '{}'::int4hashset);
>
> Should they return NULL, the empty set or something else?
>
> I've renamed hashset_merge() -> hashset_union() to better match
> SQL's MULTISET feature which has a MULTISET UNION.
>

Shouldn't they return the same thing that left(NULL::text,1) returns?
(NULL)...
Typically any operation on NULL is NULL.

Kirk...


Re: Do we want a hashset type?

2023-06-26 Thread Joel Jacobson
On Mon, Jun 26, 2023, at 13:06, jian he wrote:
> Can you try to glue the attached to the hashset data type input 
> function.
> the attached will parse cstring with double quote and not. so '{1,2,3}' 
> == '{"1","2","3"}'. obviously quote will preserve the inner string as 
> is.
> currently int4hashset input is delimited by comma, if you want deal 
> with range then you need escape the comma.

Not sure what you're trying to do here; what's the problem with
the current int4hashset_in()?

I think it might be best to focus on null semantics / three-valued logic
before moving on and trying to implement support for more types,
otherwise we would need to rewrite more code if we find general
thinkos that are problems in all types.

Help wanted to reason about what the following queries should return:

SELECT hashset_union(NULL::int4hashset, '{}'::int4hashset);

SELECT hashset_intersection(NULL::int4hashset, '{}'::int4hashset);

SELECT hashset_difference(NULL::int4hashset, '{}'::int4hashset);

SELECT hashset_symmetric_difference(NULL::int4hashset, '{}'::int4hashset);

Should they return NULL, the empty set or something else?

I've renamed hashset_merge() -> hashset_union() to better match
SQL's MULTISET feature which has a MULTISET UNION.

/Joel




Re: Do we want a hashset type?

2023-06-26 Thread jian he
On Mon, Jun 26, 2023 at 4:36 AM Joel Jacobson  wrote:
>
> On Sun, Jun 25, 2023, at 11:42, Joel Jacobson wrote:
> > SELECT hashset_contains('{}'::int4hashset, NULL::int);
> >
> > would be False, according to the General Rules.
> >
> ...
> > Applying the same rules, we'd have to return Unknown (which we
represent as
> > null) for:
> >
> > SELECT hashset_contains('{null}'::int4hashset, NULL::int);
> >
>
> Aha! I just discovered to my surprise that the corresponding array
> queries gives the same result:
>
> SELECT NULL = ANY(ARRAY[]::int[]);
>  ?column?
> --
>  f
> (1 row)
>
> SELECT NULL = ANY(ARRAY[NULL]::int[]);
>  ?column?
> --
>
> (1 row)
>
> I have no more objections; let's stick to the same null semantics as
arrays and multisets.
>
> /Joel

Can you try to glue the attached to the hashset data type input function.
the attached will parse cstring with double quote and not. so '{1,2,3}' ==
'{"1","2","3"}'. obviously quote will preserve the inner string as is.
currently int4hashset input is delimited by comma, if you want deal with
range then you need escape the comma.
/*

gcc -I/home/jian/postgres/2023_05_25_beta5421/include/server -fPIC -c /home/jian/Desktop/regress_pgsql/input_validate.c
gcc -shared  -o /home/jian/Desktop/regress_pgsql/input_validate.so /home/jian/Desktop/regress_pgsql/input_validate.o

CREATE OR REPLACE FUNCTION str_delim_count_validate(cstring) RETURNS BOOL SET search_path from current
AS '/home/jian/Desktop/regress_pgsql/input_validate', 'str_delim_count_validate'
LANGUAGE C IMMUTABLE;

select str_delim_count_validate('{"23890","2","3",  "a",1,2,3,4,NULL,2022-01-01,"[1,2]"}');
select str_delim_count_validate('{"3 ", }'); --fail
select str_delim_count_validate('{"3 " }'); --ok
select str_delim_count_validate('{"""23890"}'); --fail.
select str_delim_count_validate('{}'); --ok
select str_delim_count_validate('}');   --fail.
select str_delim_count_validate('{');  --fail.
select str_delim_count_validate('{{}}');  --fail.
select str_delim_count_validate('{{}}');  --fail.
select str_delim_count_validate('{"22022-01-01,[1,2]}'); --fail.
select str_delim_count_validate('{" 2022-01-01 "}'); --ok
select str_delim_count_validate('{ 2022-01-01}'); --ok
select str_delim_count_validate('{  2022-01-01  ,"[1,2]"}  '); --ok
select str_delim_count_validate('{ 2023-06-26 16:45:02.454293+08   ,"2","3"}'); --ok.
select str_delim_count_validate('{"\\t"}');--ok
*/
#include "postgres.h"
#include "access/htup_details.h"
#include "catalog/pg_type.h"
#include "utils/builtins.h"
#include "utils/numeric.h"
#include "funcapi.h"
#include "utils/lsyscache.h"
#include "utils/fmgrprotos.h"
#include "common/hashfn.h"
PG_MODULE_MAGIC;

PG_FUNCTION_INFO_V1(str_delim_count_validate);
static	int SetCount(const char *str, char typdelim, Node *escontext);
static bool ReadSetStr(char *arrayStr,const char *origStr, char typdelim, Node *escontext); 
static bool set_isspace(char ch);

Datum
str_delim_count_validate(PG_FUNCTION_ARGS)
{   
char*string= PG_GETARG_CSTRING(0);
char*string_save;
char*p;
	/* Make a modifiable copy of the input */
	string_save = pstrdup(string);
chartypdelim   = ',';
	p = string_save;
int nitems;
nitems  = SetCount(p,typdelim, fcinfo->context);

if (!ReadSetStr(p, string,typdelim,fcinfo->context))
elog(INFO,"delimuite str failed");

elog(INFO,"line %d nitems=%d",__LINE__,nitems);
PG_RETURN_BOOL(true);
}


/*
 * array_isspace() --- a non-locale-dependent isspace()
 *
 * We used to use isspace() for parsing array values, but that has
 * undesirable results: an array value might be silently interpreted
 * differently depending on the locale setting.  Now we just hard-wire
 * the traditional ASCII definition of isspace().
 */
static bool
set_isspace(char ch)
{
	if (ch == ' ' ||
		ch == '\t' ||
		ch == '\n' ||
		ch == '\r' ||
		ch == '\v' ||
		ch == '\f')
		return true;
	return false;
}

static bool
ReadSetStr(char *arrayStr,const char *origStr, char typdelim, Node *escontext)
{
	int		i;
char*srcptr;
boolin_quotes   = false;
booleoArray = false;
boolhasnull;
int32   totbytes;
int indx	= 0;

	/*
	 * We have to remove " and \ characters to create a clean item value to
	 * pass to the datatype input routine.  We overwrite each item value
	 * in-place within arrayStr to do this.  srcptr is the current scan point,
	 * and dstptr is where we are copying to.
	 *
	 * We also want to suppress leading and trailing unquoted whitespace. We
	 * use the leadingspace flag to suppress leading space.  Trailing space is
	 * tracked by using dstendptr to point to the last significant output
	 * character.
	 *
	 * The error checking in this routine is mostly pro-forma, since we expect
	 * that SetCount() already validated the string.  So we don't bother
	 * with errdetail messages.
	 */
srcptr  = arrayStr;

Re: Do we want a hashset type?

2023-06-25 Thread jian he
On Mon, Jun 26, 2023 at 2:56 AM Tomas Vondra 
wrote:
>
>
>
> On 6/25/23 15:32, jian he wrote:
> >> Or maybe I just don't understand the proposal. Perhaps it'd be best if
> >> jian wrote a patch illustrating the idea, and showing how it performs
> >> compared to the current approach.
> >
> > currently joel's idea is a int4hashset. based on the code first tomas
wrote.
> > it looks like a non-nested an collection of unique int4. external text
> > format looks like {int4, int4,int4}
> > structure looks like (header +  capacity slots * int4).
> > Within the capacity slots, some slots are empty, some have unique
values.
> >
> > The textual int4hashset looks like a one dimensional array.
> > so I copied/imitated src/backend/utils/adt/arrayfuncs.c code, rewrote a
> > slight generic hashset input and output function.
> >
> > see the attached c file.
> > It works fine for non-null input output for {int4hashset, int8hashset,
> > timestamphashset,intervalhashset,uuidhashset).
>
> So how do you define a table with a "set" column? I mean, with the
> original patch we could have done
>
>CREATE TABLE (a int4hashset);
>
> and then store / query this. How do you do that with this approach?
>
> I've looked at the patch only very briefly - it's really difficult to
> grok such patches - large, with half the comments possibly obsolete etc.
> So what does reusing the array code give us, really?
>
> I'm not against reusing some of the array code, but arrays seem to be
> much more elaborate (multiple dimensions, ...) so the code needs to do
> significantly more stuff in various cases.
>
> When I previously suggested that maybe we should get "inspiration" from
> the array code, I was mostly talking about (a) type polymorphism, i.e.
> doing sets for arbitrary types, and (b) integrating this into grammar
> (instead of using functions).
>
> I don't see how copying arrayfuncs.c like this achieves either of these
> things. It still hardcodes just a handful of selected data types, and
> the array polymorphism relies on automatic creation of array type for
> every scalar type.
>
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company

You are right.
I misread sql-createtype.html about type input_function that can take 3
arguments (cstring, oid, integer) part.
I thought while creating data types, I can pass different params to the
input_function.


Re: Do we want a hashset type?

2023-06-25 Thread Joel Jacobson
On Sun, Jun 25, 2023, at 11:42, Joel Jacobson wrote:
> SELECT hashset_contains('{}'::int4hashset, NULL::int);
>
> would be False, according to the General Rules.
>
...
> Applying the same rules, we'd have to return Unknown (which we represent as
> null) for:
>
> SELECT hashset_contains('{null}'::int4hashset, NULL::int);
>

Aha! I just discovered to my surprise that the corresponding array
queries gives the same result:

SELECT NULL = ANY(ARRAY[]::int[]);
 ?column?
--
 f
(1 row)

SELECT NULL = ANY(ARRAY[NULL]::int[]);
 ?column?
--

(1 row)

I have no more objections; let's stick to the same null semantics as arrays and 
multisets.

/Joel




Re: Do we want a hashset type?

2023-06-25 Thread Tomas Vondra



On 6/25/23 15:32, jian he wrote:
>> Or maybe I just don't understand the proposal. Perhaps it'd be best if
>> jian wrote a patch illustrating the idea, and showing how it performs
>> compared to the current approach.
> 
> currently joel's idea is a int4hashset. based on the code first tomas wrote.
> it looks like a non-nested an collection of unique int4. external text
> format looks like {int4, int4,int4}
> structure looks like (header +  capacity slots * int4).
> Within the capacity slots, some slots are empty, some have unique values.
> 
> The textual int4hashset looks like a one dimensional array.
> so I copied/imitated src/backend/utils/adt/arrayfuncs.c code, rewrote a
> slight generic hashset input and output function.
> 
> see the attached c file.
> It works fine for non-null input output for {int4hashset, int8hashset,
> timestamphashset,intervalhashset,uuidhashset).

So how do you define a table with a "set" column? I mean, with the
original patch we could have done

   CREATE TABLE (a int4hashset);

and then store / query this. How do you do that with this approach?

I've looked at the patch only very briefly - it's really difficult to
grok such patches - large, with half the comments possibly obsolete etc.
So what does reusing the array code give us, really?

I'm not against reusing some of the array code, but arrays seem to be
much more elaborate (multiple dimensions, ...) so the code needs to do
significantly more stuff in various cases.

When I previously suggested that maybe we should get "inspiration" from
the array code, I was mostly talking about (a) type polymorphism, i.e.
doing sets for arbitrary types, and (b) integrating this into grammar
(instead of using functions).

I don't see how copying arrayfuncs.c like this achieves either of these
things. It still hardcodes just a handful of selected data types, and
the array polymorphism relies on automatic creation of array type for
every scalar type.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Do we want a hashset type?

2023-06-25 Thread jian he
> Or maybe I just don't understand the proposal. Perhaps it'd be best if
> jian wrote a patch illustrating the idea, and showing how it performs
> compared to the current approach.

currently joel's idea is a int4hashset. based on the code first tomas wrote.
it looks like a non-nested an collection of unique int4. external text
format looks like {int4, int4,int4}
structure looks like (header +  capacity slots * int4).
Within the capacity slots, some slots are empty, some have unique values.

The textual int4hashset looks like a one dimensional array.
so I copied/imitated src/backend/utils/adt/arrayfuncs.c code, rewrote a
slight generic hashset input and output function.

see the attached c file.
It works fine for non-null input output for {int4hashset, int8hashset,
timestamphashset,intervalhashset,uuidhashset).
/*
gcc -I/home/jian/postgres/2023_05_25_beta5421/include/server -fPIC -c /home/jian/Desktop/regress_pgsql/set.c
gcc -shared  -o /home/jian/Desktop/regress_pgsql/set.so /home/jian/Desktop/regress_pgsql/set.o

CREATE OR REPLACE FUNCTION set_in_out_test(cstring,int, int) RETURNS BOOL SET search_path from current
AS '/home/jian/Desktop/regress_pgsql/set', 'set_in_out_test'
LANGUAGE C IMMUTABLE;

select  set_in_out_test('{111,-0,+0,-2147483648,2147483647}',23,-1);
select  set_in_out_test('{}',23,-1);
select  set_in_out_test('{-111,-9223372036854775808,9223372036854775807,-0,+0}',20,-1);
select  set_in_out_test('{2022-01-01T11:21:21.741077 +05:30, 2022-01-01T11:21:21.741076 +05:30,-infinity,+infinity}',1114,-1);
select  set_in_out_test('{"1hour", " 0:00","-0:00","2000:10:01.23456789"}',1186,-1); 
select  set_in_out_test('{03aadb61-3e30-4112-a46a-8cd72f29876b,2c1e4b2c-b8f4-470a-843f-dd094e4743a6
			,ef1a3202-a84f-4321-ab3c-45e04bf4c42d,2c1e4b2c-b8f4-470a-843f-dd094e4743a6}',2950,-1);

---parse fail cases.
select  set_in_out_test('{{-111,-9223372036854775808,9223372036854775807,-0,+0}}',20,-1);
select  set_in_out_test('{-111,-9223372036854775808,9223372036854775807,-0,+0.1}',20,-1);
select  set_in_out_test('{-111,-9223372036854775808,9223372036854775808,-0,+0.1}',20,-1);
select  set_in_out_test('{2022-01-01T11:21:21.7410as +05:30}',1114,-1);
select  set_in_out_test('{2022-01-01T11:21:21.7410 +05:3a}',1114,-1);
select  set_in_out_test('{2022-01-01T11:21:21.7410 +0x:31}',1114,-1);
select  set_in_out_test('{{2022-01-01T11:21:21.7410 +01:31}}',1114,-1);

select  set_in_out_test('{NULL,1,2,NULL,NULL}',23,-1);

*/
#include "postgres.h"
#include "access/htup_details.h"
#include "catalog/pg_type.h"
#include "utils/builtins.h"
// #include "utils/array.h"
#include "utils/numeric.h"
#include "utils/timestamp.h"
#include "funcapi.h"
#include "utils/lsyscache.h"
#include "utils/fmgrprotos.h"
#include "common/hashfn.h"
#include "utils/uuid.h"
PG_MODULE_MAGIC;

PG_FUNCTION_INFO_V1(set_in_out_test);
#define CEIL_DIV(a, b) (((a) + (b) - 1) / (b))
#define HASHSET_STEP 13

#define PG_RETURN_SETTYPE_P(x)	  PG_RETURN_POINTER(x)

/*
 * Arrays are varlena objects, so must meet the varlena convention that
 * the first int32 of the object contains the total object size in bytes.
 * Be sure to use VARSIZE() and SET_VARSIZE() to access it, though!
 *
 * CAUTION: if you change the header for ordinary arrays you will also
 * need to change the headers for oidvector and int2vector!
 */
typedef struct SetType
{
	int32		vl_len_;		/* varlena header (do not touch directly!) */
	int32   capacity;		/* # of capacity */
	int32		dataoffset;		/* offset to data, or 0 if no bitmap */
	int32		nelements;		/* number of items added to the hashset */
	Oid			elemtype;		/* element type OID */
} SetType;

#define SET_SIZE(a)VARSIZE(a)
#define SET_ITEM(a)((a)->nelements)
#define SET_CAPACITY(a) ((a)->capacity)
#define SET_HASNULL(a)			((a)->dataoffset != 0)
#define SET_ELEMTYPE(a)			((a)->elemtype)

#define SET_OVERHEAD_NONULLS(capacity) \
		MAXALIGN(sizeof(SetType) + CEIL_DIV(capacity, 8))
#define SET_OVERHEAD_WITHNULLS(capacity,nelements) \
		MAXALIGN(sizeof(SetType) + CEIL_DIV(capacity, 8) + \
 ((nelements) + 7) / 8)
#define SET_BITMAP(a) \
(((char *) (a)) + sizeof(SetType))
#define SET_DATA_OFFSET(a) \
		(SET_HASNULL(a) ? (a)->dataoffset : SET_OVERHEAD_NONULLS(SET_CAPACITY(a)))
#define SET_NULLBITMAP(a) \
		(SET_HASNULL(a) ? \
		 (bits8 *) (((char *) (a)) + sizeof(SetType) + \
	((a->nelements) + 7) / 8) \
		 : (bits8 *) NULL)

/*
 * Returns a pointer to the actual array data.
 */
#define SET_DATA_PTR(a) \
		(((char *) (a)) + SET_DATA_OFFSET(a))

typedef struct SetMetaState
{
	Oid			element_type;
	int16		typlen;
	bool		typbyval;
	char		typalign;
	char		typdelim;
	Oid			typioparam;
	Oid			typiofunc;
	FmgrInfo	proc;
} SetMetaState;


static int  SetCount(const char *str, int *dim, char typdelim, Node *escontext); 
static bool set_get_isnull(const bits8 *nullbitmap, int offset);
SetType *set_add_int32(SetType *set, Datum dvalue);
SetType *set_add_int64(SetType *set, Datum dvalue);

Re: Do we want a hashset type?

2023-06-25 Thread Joel Jacobson
On Sat, Jun 24, 2023, at 21:16, Joel Jacobson wrote:
> New version of int4hashset_contains() that should follow the same
> General Rules as MULTISET's MEMBER OF (8.16 ).
...
> SELECT hashset_contains('{}'::int4hashset, NULL::int); -- false
...
> SELECT hashset_contains('{null}'::int4hashset, NULL::int); -- null

When it comes to SQL, the general rule of thumb is that expressions and 
functions 
handling null usually return the null value. This is why it might feel a bit 
out 
of the ordinary to return False when checking if an empty set contains NULL.

However, that's my understanding of the General Rules on page 553 of 
ISO/IEC 9075-2:2023(E). Rule 3 Case a) specifically states:

"If N is 0 (zero), then the  is False.",

where N is the cardinality, and for an empty set, that's 0 (zero).

Rule 3 Case b) goes on to say:

"If at least one of XV and MV is the null value, then the 
 is Unknown."

But since b) follows a), and the condition for a) already matches, b) is out of 
the running. This leads me to believe that the result of:

SELECT hashset_contains('{}'::int4hashset, NULL::int);

would be False, according to the General Rules.

Now, this is based on the assumption that the Case conditions are evaluated in 
sequence, stopping at the first match. Does that assumption hold water?

Applying the same rules, we'd have to return Unknown (which we represent as
null) for:

SELECT hashset_contains('{null}'::int4hashset, NULL::int);

Here, since the cardinality N is 1, Case a) doesn't apply, but Case b) does 
since XV is null.

Looking ahead, we're entertaining the possibility of a future SET SQL-syntax
feature and wondering how our hashset type could be adapted to be compatible and
reusable for such a development. It's a common prediction that any future SET
syntax feature would probably operate on Three-Valued Logic. Therefore, it's key
for our hashset to handle null values, whether storing, identifying, or adding
them.

But here's my two cents, and remember it's just a personal viewpoint. I'm not 
so 
sure that the hashset type functions need to mirror the corresponding MULTISET 
language constructs exactly. In my book, our hashset catalog functions could
take a more clear-cut route with null handling, as long as our data structure is
prepared to handle null values.

Think about this possibility:

hashset_contains_null(int4hashset) -> boolean
hashset_add_null(int4hashset) -> int4hashset
hashset_contains(..., NULL) -> ERROR
hashset_add(..., NULL) -> ERROR

In my mind, this explicit null handling could simplify things, clear up any
potential confusion, and at the same time pave the way for compatibility with
any future SET SQL-syntax feature.

Thoughts?

/Joel




Re: Do we want a hashset type?

2023-06-24 Thread Joel Jacobson
New version of int4hashset_contains() that should follow the same
General Rules as MULTISET's MEMBER OF (8.16 ).

The first rule is to return False if the cardinality is 0 (zero).
However, we must first check if the first argument is null,
in which case the cardinality cannot be 0 (zero),
so if the first argument is null then we return Unknown
(represented as null).

We then proceed and check if the set is empty,
which is defined as nelements being 0 (zero)
as well as the new null_element field being false.
If the set is empty, then we always return False,
regardless of the second argument, that is,
even if it would be null we would still return False,
since the set is empty and can therefore not contain
any element.

The second rule is to return Unknown (represented as null)
if any of the arguments are null. We've already checked that
the first argument is not null, so now we check the second
argument, and return Unknown (represented as null) if it is null.

The third rule is to check for the element, and return True if
the set contains the element. Otherwise, if the set contains
the null element, we don't know if the element we're checking
for is in the set, so we then return Unknown (represented as null).
Finally, if the set doesn't contain the null element and nor the
element we're checking for, then we return False.

Datum
int4hashset_contains(PG_FUNCTION_ARGS)
{
int4hashset_t  *set;
int32   value;
boolresult;

if (PG_ARGISNULL(0))
PG_RETURN_NULL();

set = PG_GETARG_INT4HASHSET(0);

if (set->nelements == 0 && !set->null_element)
PG_RETURN_BOOL(false);

if (PG_ARGISNULL(1))
PG_RETURN_NULL();

value = PG_GETARG_INT32(1);
result = int4hashset_contains_element(set, value);

if (!result && set->null_element)
PG_RETURN_NULL();

PG_RETURN_BOOL(result);
}

Example queries and expected results:

SELECT hashset_contains(NULL::int4hashset, NULL::int); -- null
SELECT hashset_contains(NULL::int4hashset, 1::int); -- null
SELECT hashset_contains('{}'::int4hashset, NULL::int); -- false
SELECT hashset_contains('{}'::int4hashset, 1::int); -- false
SELECT hashset_contains('{null}'::int4hashset, NULL::int); -- null
SELECT hashset_contains('{null}'::int4hashset, 1::int); -- null
SELECT hashset_contains('{1}'::int4hashset, NULL::int); -- null
SELECT hashset_contains('{1}'::int4hashset, 1::int); -- true
SELECT hashset_contains('{1}'::int4hashset, 2::int); -- false

Looks good?

/Joel




Re: Do we want a hashset type?

2023-06-24 Thread Joel Jacobson
On Thu, Jun 22, 2023, at 07:51, Joel Jacobson wrote:
> For instance, how should hashset_count() work?
>
> Given the query,
>
> SELECT hashset_count('{1,2,3,null}'::int4hashset);
>
> Should we,
>
> a) threat NULL as a distinct value and return 4?
>
> b) ignore NULL and return 3?
>
> c) return NULL? (since the presence of NULL can be thought to render 
> the entire count indeterminate)
>
> I think my personal preference is (b) since it is then consistent with 
> how COUNT() works.

Having thought a bit more on this matter,
I think it's better to remove hashset_count() since the semantics are not 
obvious,
and instead provide a hashset_cardinality() function, that would obviously
include a possible null value in the number of elements:

SELECT hashset_cardinality('{1,2,3,null}'::int4hashset);
4

SELECT hashset_cardinality('{null}'::int4hashset);
1

SELECT hashset_cardinality('{null,null}'::int4hashset);
1

SELECT hashset_cardinality('{}'::int4hashset);
0

SELECT hashset_cardinality(NULL::int4hashset);
NULL

Sounds good?

/Joel




Re: Do we want a hashset type?

2023-06-23 Thread Tomas Vondra



On 6/23/23 13:47, Andrew Dunstan wrote:
> 
> On 2023-06-23 Fr 04:23, Joel Jacobson wrote:
>> On Fri, Jun 23, 2023, at 08:40, jian he wrote:
>>> I played around array_func.c
>>> many of the code can be used for multiset data type.
>>> now I imagine multiset as something like one dimension array. (nested 
>>> is somehow beyond the imagination...).
>> Are you suggesting it might be a better idea to start over completely
>> and work on a new code base that is based on arrayfuncs.c,
>> and aim for MULTISET/SET or anyhashset from start, that would not
>> only support int4/int8/uuid but any type?
>>
> 
> Before we run too far down this rabbit hole, let's discuss the storage
> implications of using multisets. ISTM that for small base datums like
> integers it will be a substantial increase in size, since you'll need an
> addition int for the item count, unless some very clever tricks are played.
> 

I honestly don't quite understand what exactly is meant by the proposal
to "reuse array_func.c for multisets". We're implementing sets, not
multisets (those were mentioned only to illustrate behavior). And the
whole point is that sets are not arrays - no duplicates, ordering does
not matter (so no index).

I mentioned that maybe we can model sets based on arrays (say, gram.y
would do similar stuff for SET[] and ARRAY[], polymorphism), not that we
should store sets as arrays. Would it be possible - maybe, if we extend
arrays to also maintain some hash hash table. But I'd bet that'll just
make arrays more complex, and will make sets slower.

Or maybe I just don't understand the proposal. Perhaps it'd be best if
jian wrote a patch illustrating the idea, and showing how it performs
compared to the current approach.

As for the storage size, I don't think an extra "count" field would make
any measurable difference. If we're storing a hash table, we're bound to
have a couple percent of wasted space due to load factor (likely between
0.75 and 0.9).

> As for this older discussion referred to upthread, if the SQL Standards
> Committee hasn't acted on it by now it seem reasonable to think they are
> unlikely to.
> 

AFAIK multisets are included in SQL 2023, pretty much matching the draft
we discussed earlier. Yeah, it's unlikely to change in the future.

> Just for reference, Here's some description of Oracle's suport for
> Multisets from
> :
> 

good to know


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Do we want a hashset type?

2023-06-23 Thread Andrew Dunstan


On 2023-06-23 Fr 04:23, Joel Jacobson wrote:

On Fri, Jun 23, 2023, at 08:40, jian he wrote:

I played around array_func.c
many of the code can be used for multiset data type.
now I imagine multiset as something like one dimension array. (nested
is somehow beyond the imagination...).

Are you suggesting it might be a better idea to start over completely
and work on a new code base that is based on arrayfuncs.c,
and aim for MULTISET/SET or anyhashset from start, that would not
only support int4/int8/uuid but any type?



Before we run too far down this rabbit hole, let's discuss the storage 
implications of using multisets. ISTM that for small base datums like 
integers it will be a substantial increase in size, since you'll need an 
addition int for the item count, unless some very clever tricks are played.


As for this older discussion referred to upthread, if the SQL Standards 
Committee hasn't acted on it by now it seem reasonable to think they are 
unlikely to.


Just for reference, Here's some description of Oracle's suport for 
Multisets from 
:


Multisets in the standard are supported as nested table types in 
Oracle. The Oracle nested table data type based on a scalar type ST is 
equivalent, in standard terminology, to a multiset of rows having a 
single field of type ST and named column_value. The Oracle nested 
table type based on an object type is equivalent to a multiset of 
structured type in the standard.


Oracle supports the following elements of this feature on nested 
tables using the same syntax as the standard has for multisets:


    The CARDINALITY function

    The SET function

    The MEMBER predicate

    The IS A SET predicate

    The COLLECT aggregate

All other aspects of this feature are supported with non-standard 
syntax, as follows:


    To create an empty multiset, denoted MULTISET[] in the standard, 
use an empty constructor of the nested table type.


    To obtain the sole element of a multiset with one element, denoted 
ELEMENT () in the standard, use a scalar 
subquery to select the single element from the nested table.


    To construct a multiset by enumeration, use the constructor of the 
nested table type.


    To construct a multiset by query, use CAST with a multiset 
argument, casting to the nested table type.


    To unnest a multiset, use the TABLE operator in the FROM clause.



cheers


andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com


Re: Do we want a hashset type?

2023-06-23 Thread jian he
On Fri, Jun 23, 2023 at 4:23 PM Joel Jacobson  wrote:

> On Fri, Jun 23, 2023, at 08:40, jian he wrote:
> > I played around array_func.c
> > many of the code can be used for multiset data type.
> > now I imagine multiset as something like one dimension array. (nested
> > is somehow beyond the imagination...).
>
> Are you suggesting it might be a better idea to start over completely
> and work on a new code base that is based on arrayfuncs.c,
> and aim for MULTISET/SET or anyhashset from start, that would not
> only support int4/int8/uuid but any type?
>
> /Joel
>

select prosrc from pg_proc where proname ~*
'(hash.*extended)|(extended.*hash)';
return around 30 rows.
so it's a bit generic?

I tend to think set/multiset as a one dimension array, so the textual input
should be like a one dimension array.
use array_func.c functions to parse and validate the input.
So different types, one input validation function.

Does this make sense?


Re: Do we want a hashset type?

2023-06-23 Thread Joel Jacobson
On Fri, Jun 23, 2023, at 08:40, jian he wrote:
> I played around array_func.c
> many of the code can be used for multiset data type.
> now I imagine multiset as something like one dimension array. (nested 
> is somehow beyond the imagination...).

Are you suggesting it might be a better idea to start over completely
and work on a new code base that is based on arrayfuncs.c,
and aim for MULTISET/SET or anyhashset from start, that would not
only support int4/int8/uuid but any type?

/Joel




Re: Do we want a hashset type?

2023-06-23 Thread jian he
I played around array_func.c
many of the code can be used for multiset data type.
now I imagine multiset as something like one dimension array. (nested is
somehow beyond the imagination...).

* A standard varlena array has the following internal structure:
 *- standard varlena header word
 *- number of dimensions of the array
 *- offset to stored data, or 0 if no nulls bitmap
 *- element type OID
 *- length of each array axis (C array of int)
 *- lower boundary of each dimension (C array of int)
 *- bitmap showing locations of nulls (OPTIONAL)
 *- whatever is the stored data

in set/multiset, we don't need {ndim,lower bnds}, since we are only one
dimension, also we don't need subscript.
So for set we can have following
*  int32 vl_len_; /* varlena header (do not touch directly!) */
*  int32   capacity; /* # of capacity */
*  int32 dataoffset; /* offset to data, or 0 if no bitmap */
*  int32 nelements; /* number of items added to the hashset */
*  Oid elemtype; /* element type OID */
 *   - bitmap showing locations of nulls (OPTIONAL)
 *   - bitmap showing this slot is empty or not  ( I am not sure
this part)
 *   - whatever is the stored data

many of the code in array_func.c can be reused.
array_isspace ==> set_isspace
ArrayMetaState  ==> SetMetastate
ArrayCount ==> SetCount (similar to ArrayCount return the dimension
of set, should be zero (empty set) or one)
ArrayParseState ==> SetParseState
ReadArrayStr ==>  ReadSetStr

attached is a demo shows that use array_func.c to parse cstring. have
similar effect of  array_in.
for multiset_in set type input function. if no duplicate required then
multiset_in would just like array, so more code can be copied from
array_func.c
but if unique required then we need first palloc0(capacity * datums (size
per type)) then put valid value into to a specific slot?




On Fri, Jun 23, 2023 at 6:27 AM Tomas Vondra 
wrote:

> On 6/22/23 19:52, Joel Jacobson wrote:
> > On Tue, Jun 20, 2023, at 14:10, Tomas Vondra wrote:
> >> This is also what the SQL standard does for multisets - there's SQL:20nn
> >> draft at http://www.wiscorp.com/SQLStandards.html, and the  >> predicate> section (p. 475) explains how this should work with NULL.
> >
> > I've looked again at the paper you mentioned and found something
> intriguing
> > in section 2.6 (b). I'm a bit puzzled about this: why would we want to
> return
> > null when we're certain it's not null but just doesn't have any elements?
> >
> > In the same vein, it says, "If it has more than one element, an
> exception is
> > raised." Makes sense to me, but what about when there are no elements at
> all?
> > Why not raise an exception in that case too?
> >
> > The ELEMENT function is designed to do one simple thing: return the
> element of
> > a multiset if the multiset has only 1 element. This seems very similar
> to how
> > our INTO STRICT operates, right?
> >
>
> I agree this looks a bit weird, but that's what I mentioned - this is an
> initial a proposal, outlining the idea. Inevitably some of the stuff
> will get reworked or just left out of the final version. It's useful
> mostly to explain the motivation / goal.
>
> I believe that's the case here - I don't think the ELEMENT got into the
> standard at all, and the NULL rules for the MEMBER OF clause seem not to
> have these strange bits.
>
> > The SQL:20nn seems to still be in draft form, and I can't help but
> wonder if we
> > should propose a bit of an improvement here:
> >
> > "If it doesn't have exactly one element, an exception is raised."
> >
> > Meaning, it would raise an exception both if there are more elements,
> > or zero elements (no elements).
> >
> > I think this would make the semantics more intuitive and less surprising.
> >
>
> Well, the simple truth is the draft is freely available, but you'd need
> to buy the final version. It doesn't mean it's still being worked on or
> that no SQL standard was released since then. In fact, SQL 2023 was
> released a couple weeks ago [1].
>
> It'd be interesting to know the version that actually got into the SQL
> standard (if at all), but I don't have access to the standard yet.
>
> regards
>
>
> [1] https://www.iso.org/standard/76584.html
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>


-- 
 I recommend David Deutsch's <>

  Jian

/*
gcc -I/home/jian/postgres/2023_05_25_beta5421/include/server -fPIC -c /home/jian/Desktop/regress_pgsql/set.c
gcc -shared  -o /home/jian/Desktop/regress_pgsql/set.so /home/jian/Desktop/regress_pgsql/set.o

CREATE OR REPLACE FUNCTION set_in_test(cstring,int, int) RETURNS BOOL SET search_path from current
AS '/home/jian/Desktop/regress_pgsql/set', 'set_in_test'
LANGUAGE C IMMUTABLE;
select  set_in_test('{1,2,3,NULL,NULL, NULL}',23,-1);

*/
#include "postgres.h"
#include "access/htup_details.h"
#include "catalog/pg_type.h"
#include "utils/builtins.h"
// #include "utils/array.h"

Re: Do we want a hashset type?

2023-06-22 Thread Tomas Vondra
On 6/22/23 19:52, Joel Jacobson wrote:
> On Tue, Jun 20, 2023, at 14:10, Tomas Vondra wrote:
>> This is also what the SQL standard does for multisets - there's SQL:20nn
>> draft at http://www.wiscorp.com/SQLStandards.html, and the > predicate> section (p. 475) explains how this should work with NULL.
> 
> I've looked again at the paper you mentioned and found something intriguing
> in section 2.6 (b). I'm a bit puzzled about this: why would we want to return
> null when we're certain it's not null but just doesn't have any elements?
> 
> In the same vein, it says, "If it has more than one element, an exception is
> raised." Makes sense to me, but what about when there are no elements at all?
> Why not raise an exception in that case too?
> 
> The ELEMENT function is designed to do one simple thing: return the element of
> a multiset if the multiset has only 1 element. This seems very similar to how
> our INTO STRICT operates, right?
> 

I agree this looks a bit weird, but that's what I mentioned - this is an
initial a proposal, outlining the idea. Inevitably some of the stuff
will get reworked or just left out of the final version. It's useful
mostly to explain the motivation / goal.

I believe that's the case here - I don't think the ELEMENT got into the
standard at all, and the NULL rules for the MEMBER OF clause seem not to
have these strange bits.

> The SQL:20nn seems to still be in draft form, and I can't help but wonder if 
> we
> should propose a bit of an improvement here:
> 
> "If it doesn't have exactly one element, an exception is raised."
> 
> Meaning, it would raise an exception both if there are more elements,
> or zero elements (no elements).
> 
> I think this would make the semantics more intuitive and less surprising.
> 

Well, the simple truth is the draft is freely available, but you'd need
to buy the final version. It doesn't mean it's still being worked on or
that no SQL standard was released since then. In fact, SQL 2023 was
released a couple weeks ago [1].

It'd be interesting to know the version that actually got into the SQL
standard (if at all), but I don't have access to the standard yet.

regards


[1] https://www.iso.org/standard/76584.html

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Do we want a hashset type?

2023-06-22 Thread Joel Jacobson
On Tue, Jun 20, 2023, at 14:10, Tomas Vondra wrote:
> This is also what the SQL standard does for multisets - there's SQL:20nn
> draft at http://www.wiscorp.com/SQLStandards.html, and the  predicate> section (p. 475) explains how this should work with NULL.

I've looked again at the paper you mentioned and found something intriguing
in section 2.6 (b). I'm a bit puzzled about this: why would we want to return
null when we're certain it's not null but just doesn't have any elements?

In the same vein, it says, "If it has more than one element, an exception is
raised." Makes sense to me, but what about when there are no elements at all?
Why not raise an exception in that case too?

The ELEMENT function is designed to do one simple thing: return the element of
a multiset if the multiset has only 1 element. This seems very similar to how
our INTO STRICT operates, right?

The SQL:20nn seems to still be in draft form, and I can't help but wonder if we
should propose a bit of an improvement here:

"If it doesn't have exactly one element, an exception is raised."

Meaning, it would raise an exception both if there are more elements,
or zero elements (no elements).

I think this would make the semantics more intuitive and less surprising.

/Joel




Re: Do we want a hashset type?

2023-06-21 Thread Joel Jacobson
On Tue, Jun 20, 2023, at 18:25, Tomas Vondra wrote:
> On 6/20/23 16:56, Joel Jacobson wrote:
>> The reference to consistency with what we do elsewhere might not be entirely
>> applicable in this context, since the set feature we're designing is a new 
>> beast
>> in the SQL landscape.
>
> I don't see how it's new, considering relational algebra is pretty much
> based on (multi)sets, and the three-valued logic with NULL values is
> pretty well established part of that.

What I meant was that the SET-feature is new; since it doesn't exist in 
PostgreSQL nor SQL.

>> I think adhering to the theoretical purity of sets by excluding NULLs aligns 
>> us
>> with set theory, simplifies our code, and parallels set implementations in 
>> other
>> languages.
>
> I don't see how that would be more theoretically pure, really. The
> three-valued logic is a well established part of relational algebra, so
> not respecting that is more a violation of the purity.

Hmm, I think it's pure in different ways;
Set Theory is well established and is based on two-values logic,
but at the same time SQL's three-valued logic is also well established.

>> I think we have an opportunity here to innovate and potentially influence a
>> future set concept in the SQL standard.
>
> I doubt this going to influence what the SQL standard says, especially
> because it already defined the behavior for MULTISETS (of which the sets
> are a special case, pretty much). So this has 0% chance of success.

OK. 0% is 1% too low for me to work with. :)

>> However, I see how one could argue against this reasoning, on the basis that
>> PostgreSQL users might be more familiar with and expect NULLs can exist
>> everywhere in all data structures.
>
> Right, it's what we already do for similar cases, and if you have NULLS
> in the data, you better be aware of the behavior. Granted, some people
> are surprised by three-valued logic, but using a different behavior for
> some new features would just increase the confusion.

Good point.

>> I've been trying hard, but I can't find compelling use-cases where a NULL 
>> element
>> in a set would offer a more natural SQL query than handling NULLs within SQL 
>> and
>> keeping the set NULL-free.
>
> IMO if you have NULL values in the data, you better be aware of it and
> handle the case accordingly (e.g. by filtering them out when building
> the set). If you don't have NULLs in the data, there's no issue.

As long as the data model and queries would ensure there can never be
any NULLs, fine, then there's is no issue.

> And in the graph case, I don't see why you'd have any NULLs, considering
> we're dealing with adjacent nodes, and if there's adjacent node, it's ID
> is not NULL.

Me neither, can't see the need for any NULLs there.

>> Does anyone else have a strong realistic example where including NULLs in the
>> set would simplify the SQL query?
>
> I'm sure there are cases where you have NULLs in the dat aand need to
> filter them out, but that's just natural consequence of having NULLs. If
> you have them you better know what NULLs do ...

What I tried to find was an example for was when you wouldn't want to
filter out the NULLs, when you would want to include the NULL
in the set.

If we could just find one should realistic use-case, that would be very
helpful, since it would then kill my argument completely that we couldn't
do without storing a NULL in the set.

> It's too early to make any strong statements, but it's going to be hard
> to convince me we should handle NULLs differently from what we already
> do elsewhere.

I think it's a trade-off, and I don't have any strong preference for the 
simplicity
of a classical two-valued set-theoretic system vs a three-valued
multiset-based one. I was 51/49 but given your feedback I'm now 49/51.

I think the next step is to think about how the hashset type should work
with three-valued logic, and then implement it to get a feeling for it.

For instance, how should hashset_count() work?

Given the query,

SELECT hashset_count('{1,2,3,null}'::int4hashset);

Should we,

a) threat NULL as a distinct value and return 4?

b) ignore NULL and return 3?

c) return NULL? (since the presence of NULL can be thought to render the entire 
count indeterminate)

I think my personal preference is (b) since it is then consistent with how 
COUNT() works.

/Joel




Re: Do we want a hashset type?

2023-06-20 Thread Tomas Vondra
On 6/20/23 20:08, jian he wrote:
> On Wed, Jun 21, 2023 at 12:25 AM Tomas Vondra
> ...
>>  http://www.wiscorp.com/sqlmultisets.zip
> 
>> Conceptually, a multiset is an unordered collection of elements, all of the 
>> same type, with dupli-
>> cates permitted. Unlike arrays, a multiset is an unbounded collection, with 
>> no declared maximum
>> cardinality. This does not mean that the user can insert elements in a 
>> multiset without limit, just
>> that the standard does not mandate that there should be a limit. This is 
>> analogous to tables, which
>> have no declared maximum number of rows.
> 
> Postgres arrays don't have size limits.

Right. You can say int[5] but we don't enforce that limit (I haven't
checked why, but presumably because we had arrays before the standard
existed, and it was more like a list in LISP or something.)

> unordered means no need to use subscript?

Yeah - there's no obvious way to subscript the items when there's no
implicit ordering.

> So multiset is a more limited array type?
> 

Yes and no - both are collection types, so there are similarities and
differences. Multiset does not need to keep the ordering, so in this
sense it's a relaxed version of array.


> null is fine. but personally I feel like so far the hashset main
> feature is the quickly aggregate unique value using hashset.
> I found using hashset count distinct (non null values) is quite faster.

True. That's related to fast membership checks.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Do we want a hashset type?

2023-06-20 Thread jian he
On Wed, Jun 21, 2023 at 12:25 AM Tomas Vondra
 wrote:
>
>
>
> On 6/20/23 16:56, Joel Jacobson wrote:
> > On Tue, Jun 20, 2023, at 14:10, Tomas Vondra wrote:
> >> On 6/20/23 12:59, Joel Jacobson wrote:
> >>> On Mon, Jun 19, 2023, at 02:00, jian he wrote:
>  select hashset_contains('{1,2}'::int4hashset,NULL::int);
>  should return null?
> >>>
> >>> I agree, it should.
> >>>
> >>> I've now changed all functions except int4hashset() (the init function)
> >>> and the aggregate functions to be STRICT.
> >>
> >> I don't think this is correct / consistent with what we do elsewhere.
> >> IMHO it's perfectly fine to have a hashset containing a NULL value,
> >
> > The reference to consistency with what we do elsewhere might not be entirely
> > applicable in this context, since the set feature we're designing is a new 
> > beast
> > in the SQL landscape.
> >
>
> I don't see how it's new, considering relational algebra is pretty much
> based on (multi)sets, and the three-valued logic with NULL values is
> pretty well established part of that.
>
> > I think adhering to the theoretical purity of sets by excluding NULLs 
> > aligns us
> > with set theory, simplifies our code, and parallels set implementations in 
> > other
> > languages.
> >
>
> I don't see how that would be more theoretically pure, really. The
> three-valued logic is a well established part of relational algebra, so
> not respecting that is more a violation of the purity.
>
> > I think we have an opportunity here to innovate and potentially influence a
> > future set concept in the SQL standard.
> >
>
> I doubt this going to influence what the SQL standard says, especially
> because it already defined the behavior for MULTISETS (of which the sets
> are a special case, pretty much). So this has 0% chance of success.
>
> > However, I see how one could argue against this reasoning, on the basis that
> > PostgreSQL users might be more familiar with and expect NULLs can exist
> > everywhere in all data structures.
> >
>
> Right, it's what we already do for similar cases, and if you have NULLS
> in the data, you better be aware of the behavior. Granted, some people
> are surprised by three-valued logic, but using a different behavior for
> some new features would just increase the confusion.
>
> > A different perspective is to look at what use-cases we can foresee.
> >
> > I've been trying hard, but I can't find compelling use-cases where a NULL 
> > element
> > in a set would offer a more natural SQL query than handling NULLs within 
> > SQL and
> > keeping the set NULL-free.
> >
>
> IMO if you have NULL values in the data, you better be aware of it and
> handle the case accordingly (e.g. by filtering them out when building
> the set). If you don't have NULLs in the data, there's no issue.
>
> And in the graph case, I don't see why you'd have any NULLs, considering
> we're dealing with adjacent nodes, and if there's adjacent node, it's ID
> is not NULL.
>
> > Does anyone else have a strong realistic example where including NULLs in 
> > the
> > set would simplify the SQL query?
> >
>
> I'm sure there are cases where you have NULLs in the dat aand need to
> filter them out, but that's just natural consequence of having NULLs. If
> you have them you better know what NULLs do ...
>
> It's too early to make any strong statements, but it's going to be hard
> to convince me we should handle NULLs differently from what we already
> do elsewhere.
>
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company


>  http://www.wiscorp.com/sqlmultisets.zip

> Conceptually, a multiset is an unordered collection of elements, all of the 
> same type, with dupli-
> cates permitted. Unlike arrays, a multiset is an unbounded collection, with 
> no declared maximum
> cardinality. This does not mean that the user can insert elements in a 
> multiset without limit, just
> that the standard does not mandate that there should be a limit. This is 
> analogous to tables, which
> have no declared maximum number of rows.

Postgres arrays don't have size limits.
unordered means no need to use subscript?
So multiset is a more limited array type?

null is fine. but personally I feel like so far the hashset main
feature is the quickly aggregate unique value using hashset.
I found using hashset count distinct (non null values) is quite faster.




Re: Do we want a hashset type?

2023-06-20 Thread Tomas Vondra



On 6/20/23 16:56, Joel Jacobson wrote:
> On Tue, Jun 20, 2023, at 14:10, Tomas Vondra wrote:
>> On 6/20/23 12:59, Joel Jacobson wrote:
>>> On Mon, Jun 19, 2023, at 02:00, jian he wrote:
 select hashset_contains('{1,2}'::int4hashset,NULL::int);
 should return null?
>>>
>>> I agree, it should.
>>>
>>> I've now changed all functions except int4hashset() (the init function)
>>> and the aggregate functions to be STRICT.
>>
>> I don't think this is correct / consistent with what we do elsewhere.
>> IMHO it's perfectly fine to have a hashset containing a NULL value,
> 
> The reference to consistency with what we do elsewhere might not be entirely
> applicable in this context, since the set feature we're designing is a new 
> beast
> in the SQL landscape.
> 

I don't see how it's new, considering relational algebra is pretty much
based on (multi)sets, and the three-valued logic with NULL values is
pretty well established part of that.

> I think adhering to the theoretical purity of sets by excluding NULLs aligns 
> us
> with set theory, simplifies our code, and parallels set implementations in 
> other
> languages.
> 

I don't see how that would be more theoretically pure, really. The
three-valued logic is a well established part of relational algebra, so
not respecting that is more a violation of the purity.

> I think we have an opportunity here to innovate and potentially influence a
> future set concept in the SQL standard.
> 

I doubt this going to influence what the SQL standard says, especially
because it already defined the behavior for MULTISETS (of which the sets
are a special case, pretty much). So this has 0% chance of success.

> However, I see how one could argue against this reasoning, on the basis that
> PostgreSQL users might be more familiar with and expect NULLs can exist
> everywhere in all data structures.
> 

Right, it's what we already do for similar cases, and if you have NULLS
in the data, you better be aware of the behavior. Granted, some people
are surprised by three-valued logic, but using a different behavior for
some new features would just increase the confusion.

> A different perspective is to look at what use-cases we can foresee.
> 
> I've been trying hard, but I can't find compelling use-cases where a NULL 
> element
> in a set would offer a more natural SQL query than handling NULLs within SQL 
> and
> keeping the set NULL-free.
> 

IMO if you have NULL values in the data, you better be aware of it and
handle the case accordingly (e.g. by filtering them out when building
the set). If you don't have NULLs in the data, there's no issue.

And in the graph case, I don't see why you'd have any NULLs, considering
we're dealing with adjacent nodes, and if there's adjacent node, it's ID
is not NULL.

> Does anyone else have a strong realistic example where including NULLs in the
> set would simplify the SQL query?
> 

I'm sure there are cases where you have NULLs in the dat aand need to
filter them out, but that's just natural consequence of having NULLs. If
you have them you better know what NULLs do ...

It's too early to make any strong statements, but it's going to be hard
to convince me we should handle NULLs differently from what we already
do elsewhere.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Do we want a hashset type?

2023-06-20 Thread Joel Jacobson
On Tue, Jun 20, 2023, at 16:56, Joel Jacobson wrote:
> I think we have an opportunity here to innovate and potentially influence a
> future set concept in the SQL standard.

Adding to my previous note - If there's a worry about future SQL standards
introducing SETs with NULLs, causing compatibility issues, we could address it
proactively. We could set up set functions to throw errors when passed NULL
inputs, rather than being STRICT. This keeps our theoretical alignment now, and
offers a smooth transition if standards evolve.

Considering we have a flag field in the struct, we could use it to indicate
whether a value stored on disk was written with NULL support or not.

/Joel




Re: Do we want a hashset type?

2023-06-20 Thread Joel Jacobson
On Tue, Jun 20, 2023, at 14:10, Tomas Vondra wrote:
> On 6/20/23 12:59, Joel Jacobson wrote:
>> On Mon, Jun 19, 2023, at 02:00, jian he wrote:
>>> select hashset_contains('{1,2}'::int4hashset,NULL::int);
>>> should return null?
>> 
>> I agree, it should.
>> 
>> I've now changed all functions except int4hashset() (the init function)
>> and the aggregate functions to be STRICT.
>
> I don't think this is correct / consistent with what we do elsewhere.
> IMHO it's perfectly fine to have a hashset containing a NULL value,

The reference to consistency with what we do elsewhere might not be entirely
applicable in this context, since the set feature we're designing is a new beast
in the SQL landscape.

I think adhering to the theoretical purity of sets by excluding NULLs aligns us
with set theory, simplifies our code, and parallels set implementations in other
languages.

I think we have an opportunity here to innovate and potentially influence a
future set concept in the SQL standard.

However, I see how one could argue against this reasoning, on the basis that
PostgreSQL users might be more familiar with and expect NULLs can exist
everywhere in all data structures.

A different perspective is to look at what use-cases we can foresee.

I've been trying hard, but I can't find compelling use-cases where a NULL 
element
in a set would offer a more natural SQL query than handling NULLs within SQL and
keeping the set NULL-free.

Does anyone else have a strong realistic example where including NULLs in the
set would simplify the SQL query?

/Joel




Re: Do we want a hashset type?

2023-06-20 Thread Tomas Vondra



On 6/20/23 14:10, Tomas Vondra wrote:
> ...
>
> This is also what the SQL standard does for multisets - there's SQL:20nn
> draft at http://www.wiscorp.com/SQLStandards.html, and the  predicate> section (p. 475) explains how this should work with NULL.
> 

BTW I just notices there's also a multiset proposal at the wiscorp page:

   http://www.wiscorp.com/sqlmultisets.zip

It's just the initial proposal and I'm not sure how much it changed over
time, but it likely provide way more context for the choices than the
(rather dry) SQL standard.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Do we want a hashset type?

2023-06-20 Thread Tomas Vondra



On 6/20/23 12:59, Joel Jacobson wrote:
> On Mon, Jun 19, 2023, at 02:00, jian he wrote:
>> select hashset_contains('{1,2}'::int4hashset,NULL::int);
>> should return null?
> 
> I agree, it should.
> 
> I've now changed all functions except int4hashset() (the init function)
> and the aggregate functions to be STRICT.

I don't think this is correct / consistent with what we do elsewhere.
IMHO it's perfectly fine to have a hashset containing a NULL value,
because then it can affect results of membership checks.

Consider these IN / ANY queries:

test=# select 4 in (1,2,3);
 ?column?
--
 f
(1 row)

test=# select 4 = ANY(ARRAY[1,2,3]);
 ?column?
--
 f
(1 row)

now add a NULL:

test=# select 4 in (1,2,3,null);
 ?column?
--

(1 row)

test=# select 4 = ANY(ARRAY[1,2,3,NULL]);
 ?column?
--

(1 row)

I don't see why a (hash)set should behave any differently. It's true
arrays don't behave like this:

test=# select array[1,2,3,4,NULL] @> ARRAY[5];
 ?column?
--
 f
(1 row)

but I'd say that's more an anomaly than something we should replicate.

This is also what the SQL standard does for multisets - there's SQL:20nn
draft at http://www.wiscorp.com/SQLStandards.html, and the  section (p. 475) explains how this should work with NULL.

So if we see a set as a special case of multiset (with no duplicates),
then we have to handle NULLs this way too. It'd be weird to have this
behavior inconsistent.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Do we want a hashset type?

2023-06-20 Thread Joel Jacobson
On Mon, Jun 19, 2023, at 02:00, jian he wrote:
> select hashset_contains('{1,2}'::int4hashset,NULL::int);
> should return null?

I agree, it should.

I've now changed all functions except int4hashset() (the init function)
and the aggregate functions to be STRICT.
I think this patch is OK to send as an incremental one, since it's an isolated 
change:

Apply STRICT to hashset functions; clean up null handling in hashset-api.c

Set hashset functions to be STRICT, thereby letting the system reject null
inputs automatically. This change reflects the nature of hashset as an
implementation of a set-theoretic system, where null values are conceptually
unusual.

Alongside, the hashset-api.c code has been refactored for clarity, consolidating
null checks and assignments into single lines.

A 'strict' test case has been added to account for these changes.

/Joel

hashset-0.0.1-1ee0df0.patch
Description: Binary data


Re: Do we want a hashset type?

2023-06-19 Thread Joel Jacobson
On Tue, Jun 20, 2023, at 02:04, Tomas Vondra wrote:
> For UPDATE, it'd be pretty clear too, I think. It's possible to do
>
>UPDATE table SET col = SET[1,2,3]
>
> and it's clear the first is the command SET, while the second is a set
> constructor. For SELECT there'd be conflict, and for ALTER TABLE it'd be
> possible to do
>
>ALTER TABLE table ALTER COLUMN col SET DEFAULT SET[1,2,3];
>
> Seems clear to me too, I think.
...
> It's a matter of personal taste, I guess. I'm fine with calling function
> API and what not, but a sensible SQL syntax seems nicer.

Now when I see it written out, I actually agree looks nice.

>> I think it's still meaningful to continue hacking on the int4-type
>> hashset extension, to see if we can agree on the semantics,
>> especially around null handling and sorting.
>> 
>
> Definitely. It certainly was not my intention to derail the work by
> proposing more and more stuff. So feel free to pursue what makes sense
> to you / helps the use case.

OK, cool, and didn't mean at all that you did. I appreciate the long-term
perspective, otherwise our short-term work might go wasted.

> TBH I don't particularly see why we'd want to sort sets.

Me neither, sorting sets in the conventional, visually coherent sense
(i.e., lexicographically) doesn't seem necessary. However, for ORDER BY hashset
functionality, we need a we need a stable and deterministic method.

This can be achieved performance-efficiently by computing a commutative hash of
the hashset, XORing each new value's hash with set->hash:

set->hash ^= hash;

...and then sort primarily by set->hash.

Though resulting in an apparently random order, this approach, already employed
in int4hashset_add_element() and int4hashset_cmp(), ensures a deterministic and
stable sorting order.

I think this an acceptable trade-off, better than not supporting ORDER BY.

Jian He had some comments on hashset_cmp() which I will look at.

/Joel




Re: Do we want a hashset type?

2023-06-19 Thread Tomas Vondra



On 6/20/23 00:50, Joel Jacobson wrote:
> On Mon, Jun 19, 2023, at 14:59, Tomas Vondra wrote:
>> What unexpected issues you mean? Sure, if someone uses multisets as if
>> they were sets (so ignoring the handling of duplicates), things will go
>> booom! quickly.
> 
> The unexpected issues I had in mind are subtle bugs due to treating multisets
> as sets, which could go undetected due to having no duplicates initially.
> Multisets might initially therefore seem equal, but later diverge due to
> different element counts, leading to hard-to-detect issues.
> 

Understood.

>> I imagined (if we ended up doing MULTISET) we'd provide interface (e.g.
>> operators) that'd allow perhaps help with this.
> 
> Might help. But still think providing both structures would be a more 
> foolproof
> solution, offering users the choice to select what's best for their use-case.
> 

Yeah. Not confusing people is better.

>>> Despite SQL's multiset possibility, a distinct hashset type is my 
>>> preference,
>>> helping appropriate data structure choice and reducing misuse.
>>>
>>> The necessity of multisets is vague beyond standards compliance.
>>
>> True - we haven't had any requests/proposal to implement MULTISETs.
>>
>> I've looked at the SQL standard primarily to check if maybe there's some
>> precedent that'd give us guidance on the SQL syntax etc. And I think
>> multisets are that - even if we end up not implementing them, it'd be
>> sad to have unnecessarily inconsistent syntax (in case someone decides
>> to add multisets in the future).
>>
>> We could invent "SET" data type, so while standard has ARRAY / MULTISET,
>> we'd have ARRAY / MULTISET / SET, and the difference between the last
>> two would be just handling of duplicates.
> 
> Is the idea to use the "SET" keyword for the syntax?
> Isn't it a risk that will be confusing, since "SET" is currently
> only used for configuration and update operations?
> 

I haven't tried doing that, so not sure if there would be any conflicts
in the grammar. But I can't think of a case that'd be confusing for
users - when setting internal GUC variables it's a completely different
context, there's no use for SQL-level collections (arrays, sets, ...).

For UPDATE, it'd be pretty clear too, I think. It's possible to do

   UPDATE table SET col = SET[1,2,3]

and it's clear the first is the command SET, while the second is a set
constructor. For SELECT there'd be conflict, and for ALTER TABLE it'd be
possible to do

   ALTER TABLE table ALTER COLUMN col SET DEFAULT SET[1,2,3];

Seems clear to me too, I think.


>> The other way to look at sets is that they are pretty similar to arrays,
>> except that there are no duplicates and order does not matter. Sure, the
>> on-disk format and code is different, but from the SQL perspective it'd
>> be nice to allow using sets in most places where arrays are allowed
>> (which is what the standard does for MULTISETS, more or less).
>>
>> That'd mean we could probably search through gram.y for places working
>> with arrays ("ARRAY array_expr", "ARRAY select_with_parens", ...) and
>> make them work with sets too, say by having SET_SUBLINK instead of
>> ARRAY_SUBLINK, set_expression instead of array_expression, etc.
>>
>> This might be also "consistent" with defining hashset type using CREATE
>> TYPE with ELEMENT, because we consider the type to be "array". So that
>> would be polymorphic type, but we don't have pre-defined array for every
>> type (and I'm not sure we want to).
>>
>> Of course, maybe there's some fatal flaw in these idea, I don't know.
>> And I don't want to move the goalposts too far - but it seems like this
>> might make some stuff actually simpler to implement (by piggy-backing on
>> the existing array infrastructure).
> 
> I think it's very interesting thoughts and ambitions.
> 
> I wonder though, from a user-perspective, if a new hashset type still
> wouldn't just be considered simpler, than introducing new SQL syntax?
> 

It's a matter of personal taste, I guess. I'm fine with calling function
API and what not, but a sensible SQL syntax seems nicer.

> However, it would be interesting to see how the piggy-backing on the
> existing array infrastructure would look in practise code-wise though.
> 
> I think it's still meaningful to continue hacking on the int4-type
> hashset extension, to see if we can agree on the semantics,
> especially around null handling and sorting.
> 

Definitely. It certainly was not my intention to derail the work by
proposing more and more stuff. So feel free to pursue what makes sense
to you / helps the use case.


TBH I don't particularly see why we'd want to sort sets.

I wonder if the SQL standard says something about these things (for
MULTISETs), especially for the NULL handling. If it does, I'd try to
stick with those rules.

>> A mostly unrelated thought - I wonder if this might be somehow related
>> to the foreign key array patch ([1] might be the most recent attempt in
>> this direction). Not to 

Re: Do we want a hashset type?

2023-06-19 Thread Joel Jacobson
On Mon, Jun 19, 2023, at 14:59, Tomas Vondra wrote:
> What unexpected issues you mean? Sure, if someone uses multisets as if
> they were sets (so ignoring the handling of duplicates), things will go
> booom! quickly.

The unexpected issues I had in mind are subtle bugs due to treating multisets
as sets, which could go undetected due to having no duplicates initially.
Multisets might initially therefore seem equal, but later diverge due to
different element counts, leading to hard-to-detect issues.

> I imagined (if we ended up doing MULTISET) we'd provide interface (e.g.
> operators) that'd allow perhaps help with this.

Might help. But still think providing both structures would be a more foolproof
solution, offering users the choice to select what's best for their use-case.

>> Despite SQL's multiset possibility, a distinct hashset type is my preference,
>> helping appropriate data structure choice and reducing misuse.
>> 
>> The necessity of multisets is vague beyond standards compliance.
>
> True - we haven't had any requests/proposal to implement MULTISETs.
>
> I've looked at the SQL standard primarily to check if maybe there's some
> precedent that'd give us guidance on the SQL syntax etc. And I think
> multisets are that - even if we end up not implementing them, it'd be
> sad to have unnecessarily inconsistent syntax (in case someone decides
> to add multisets in the future).
>
> We could invent "SET" data type, so while standard has ARRAY / MULTISET,
> we'd have ARRAY / MULTISET / SET, and the difference between the last
> two would be just handling of duplicates.

Is the idea to use the "SET" keyword for the syntax?
Isn't it a risk that will be confusing, since "SET" is currently
only used for configuration and update operations?

> The other way to look at sets is that they are pretty similar to arrays,
> except that there are no duplicates and order does not matter. Sure, the
> on-disk format and code is different, but from the SQL perspective it'd
> be nice to allow using sets in most places where arrays are allowed
> (which is what the standard does for MULTISETS, more or less).
>
> That'd mean we could probably search through gram.y for places working
> with arrays ("ARRAY array_expr", "ARRAY select_with_parens", ...) and
> make them work with sets too, say by having SET_SUBLINK instead of
> ARRAY_SUBLINK, set_expression instead of array_expression, etc.
>
> This might be also "consistent" with defining hashset type using CREATE
> TYPE with ELEMENT, because we consider the type to be "array". So that
> would be polymorphic type, but we don't have pre-defined array for every
> type (and I'm not sure we want to).
>
> Of course, maybe there's some fatal flaw in these idea, I don't know.
> And I don't want to move the goalposts too far - but it seems like this
> might make some stuff actually simpler to implement (by piggy-backing on
> the existing array infrastructure).

I think it's very interesting thoughts and ambitions.

I wonder though, from a user-perspective, if a new hashset type still
wouldn't just be considered simpler, than introducing new SQL syntax?

However, it would be interesting to see how the piggy-backing on the
existing array infrastructure would look in practise code-wise though.

I think it's still meaningful to continue hacking on the int4-type
hashset extension, to see if we can agree on the semantics,
especially around null handling and sorting.

> A mostly unrelated thought - I wonder if this might be somehow related
> to the foreign key array patch ([1] might be the most recent attempt in
> this direction). Not to hashset itself, but I recalled these patches
> because it'd mean we don't need the separate "edges" link table (so the
> hashset column would be the think backing the FK).
>
> [1]
> https://www.postgresql.org/message-id/CAJvoCut7zELHnBSC8HrM6p-R6q-NiBN1STKhqnK5fPE-9%3DGq3g%40mail.gmail.com

I remember that one! We tried to revive that one, but didn't manage to keep it 
alive.
It's a really good idea though. Good idea to see if there might be synergies
between arrays and hashsets in this area, since if we envision the elements in
a hashset mostly will be PKs, then it would be nice to enforce reference
integrity.

/Joel





Re: Do we want a hashset type?

2023-06-19 Thread Tom Lane
Andrew Dunstan  writes:
> Yes, Multisets (a.k.a. bags and a large number of other names) would be 
> interesting. But I wouldn't like to abandon pure sets either. Maybe a 
> typmod indicating the allowed multiplicity of the type?

I don't think trying to use typmod to carry fundamental semantic
information will work, because we drop it in too many places
(e.g. there's no way to pass it through a function).  If you want
both sets and multisets, they'll need to be two different container
types, even if code is shared under the hood.

regards, tom lane




Re: Do we want a hashset type?

2023-06-19 Thread Tomas Vondra



On 6/19/23 13:33, Joel Jacobson wrote:
> On Mon, Jun 19, 2023, at 11:21, Tomas Vondra wrote:
>> AFAICS the standard only defines arrays and multisets. Arrays are pretty
>> much the thing we have, including the ARRAY[] constructor etc. Multisets
>> are similar to hashset discussed here, except that it tracks the number
>> of elements for each value (which would be trivial in hashset).
>>
>> So if we want to make this a built-in feature, maybe we should aim to do
>> the multiset thing, with the standard SQL syntax? Extending the grammar
>> should not be hard, I think. I'm not sure of the underlying code
>> (ArrayType, ARRAY_SUBLINK stuff, etc.) we could reuse or if we'd need a
>> lot of separate code doing that.
> 
> Multisets handle duplicates uniquely, this may bring unexpected issues. Sets
> and multisets have distinct utility in C++, Rust, Java, etc. However, sets are
> more fundamental and prevalent in std libs than multisets.
> 

What unexpected issues you mean? Sure, if someone uses multisets as if
they were sets (so ignoring the handling of duplicates), things will go
booom! quickly.

I imagined (if we ended up doing MULTISET) we'd provide interface (e.g.
operators) that'd allow perhaps help with this.

> Despite SQL's multiset possibility, a distinct hashset type is my preference,
> helping appropriate data structure choice and reducing misuse.
> 
> The necessity of multisets is vague beyond standards compliance.
> 

True - we haven't had any requests/proposal to implement MULTISETs.

I've looked at the SQL standard primarily to check if maybe there's some
precedent that'd give us guidance on the SQL syntax etc. And I think
multisets are that - even if we end up not implementing them, it'd be
sad to have unnecessarily inconsistent syntax (in case someone decides
to add multisets in the future).

We could invent "SET" data type, so while standard has ARRAY / MULTISET,
we'd have ARRAY / MULTISET / SET, and the difference between the last
two would be just handling of duplicates.

The other way to look at sets is that they are pretty similar to arrays,
except that there are no duplicates and order does not matter. Sure, the
on-disk format and code is different, but from the SQL perspective it'd
be nice to allow using sets in most places where arrays are allowed
(which is what the standard does for MULTISETS, more or less).

That'd mean we could probably search through gram.y for places working
with arrays ("ARRAY array_expr", "ARRAY select_with_parens", ...) and
make them work with sets too, say by having SET_SUBLINK instead of
ARRAY_SUBLINK, set_expression instead of array_expression, etc.

This might be also "consistent" with defining hashset type using CREATE
TYPE with ELEMENT, because we consider the type to be "array". So that
would be polymorphic type, but we don't have pre-defined array for every
type (and I'm not sure we want to).

Of course, maybe there's some fatal flaw in these idea, I don't know.
And I don't want to move the goalposts too far - but it seems like this
might make some stuff actually simpler to implement (by piggy-backing on
the existing array infrastructure).


A mostly unrelated thought - I wonder if this might be somehow related
to the foreign key array patch ([1] might be the most recent attempt in
this direction). Not to hashset itself, but I recalled these patches
because it'd mean we don't need the separate "edges" link table (so the
hashset column would be the think backing the FK).

[1]
https://www.postgresql.org/message-id/CAJvoCut7zELHnBSC8HrM6p-R6q-NiBN1STKhqnK5fPE-9%3DGq3g%40mail.gmail.com


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Do we want a hashset type?

2023-06-19 Thread Tomas Vondra
On 6/19/23 13:50, Andrew Dunstan wrote:
> 
> On 2023-06-19 Mo 05:21, Tomas Vondra wrote:
>> On 6/18/23 18:45, Andrew Dunstan wrote:
>>> On 2023-06-16 Fr 20:38, Joel Jacobson wrote:
 New patch is attached, which will henceforth always be a complete patch,
 to avoid the hassle of having to assemble incremental patches.
>>> Cool, thanks.
>>>
>> It might still be convenient to keep it split into smaller, easier to
>> review, parts. A patch that introduces basic functionality and then
>> patches adding various "advanced" features.
>>
>>> A couple of random thoughts:
>>>
>>>
>>> . It might be worth sending a version number with the send function
>>> (c.f. jsonb_send / jsonb_recv). That way would would not be tied forever
>>> to some wire representation.
>>>
>>> . I think there are some important set operations missing: most notably
>>> intersection, slightly less importantly asymmetric and symmetric
>>> difference. I have no idea how easy these would be to add, but even for
>>> your stated use I should have thought set intersection would be useful
>>> ("Who is a member of both this set of friends and that set of friends?").
>>>
>>> . While supporting int4 only is OK for now, I think we would at least
>>> want to support int8, and probably UUID since a number of systems I know
>>> of use that as an object identifier.
>>>
>> I agree we should aim to support a wider range of data types. Could we
>> have a polymorphic type, similar to what we do for arrays and ranges? In
>> fact, CREATE TYPE allows specifying ELEMENT, so wouldn't it be possible
>> to implement this as a special variant of an array? Would be better than
>> having a set of functions for every supported data type.
>>
>> (Note: It might still be possible to have a special implementation for
>> selected fixed-length data types, as it allows optimization at compile
>> time. But that could be done later.)
> 
> 
> Interesting idea. There's also the keyword SETOF that we could possibly
> make use of.
> 
> 
>> The other thing I've been thinking about is the SQL syntax and what does
>> the SQL standard says about this.
>>
>> AFAICS the standard only defines arrays and multisets. Arrays are pretty
>> much the thing we have, including the ARRAY[] constructor etc. Multisets
>> are similar to hashset discussed here, except that it tracks the number
>> of elements for each value (which would be trivial in hashset).
>>
>> So if we want to make this a built-in feature, maybe we should aim to do
>> the multiset thing, with the standard SQL syntax? Extending the grammar
>> should not be hard, I think. I'm not sure of the underlying code
>> (ArrayType, ARRAY_SUBLINK stuff, etc.) we could reuse or if we'd need a
>> lot of separate code doing that.
>>
>>
> 
> Yes, Multisets (a.k.a. bags and a large number of other names) would be
> interesting. But I wouldn't like to abandon pure sets either. Maybe a
> typmod indicating the allowed multiplicity of the type?
> 

Maybe, although I'm not sure if that can be specified with a multiset
constructor, i.e. when using MULTISET[...] in places where we now use
ARRAY[...] to specify arrays.

I was thinking more about having one set of operators, one considering
the duplicity (and thus doing what SQL standard says) and one ignoring
it (thus treating MULTISETS as plain sets).

Anyway, I'm just thinking aloud. I'm not sure if this is the way to do,
but it'd be silly to end up implementing stuff unnecessarily and/or
inventing something that contradicts the SQL standard (or is somehow
inconsistent with similar stuff).


regard

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Do we want a hashset type?

2023-06-19 Thread Joel Jacobson
On Mon, Jun 19, 2023, at 11:49, jian he wrote:
> hashset_to_array function should be strict?
>
> I noticed hashset_symmetric_difference and  hashset_difference handle 
> null in a different way, seems they should handle null in a consistent 
> way?

Yes, I agree, they should be consistent.

I've thought a bit more on this, and came to the conclusion that I think it
would be easiest, safest and least confusing to just mark all functions STRICT.

That way, it's the user's responsibility to ensure null operands are not passed
to the functions, which is simply a WHERE ... or FILTER (WHERE ...). And if
making a mistake and passing, it's better to make the entire result blow up by
letting the result be NULL, than to silently ignore the operand or return some
true/false value that is questionable.

SQL has a quite unique NULL handling compared to other languages, so I think
it's better to let the user use the full arsenal of SQL to deal with nulls,
rather than trying to shoehorn some null semantics into a set-theoretic system.

/Joel




Re: Do we want a hashset type?

2023-06-19 Thread Andrew Dunstan


On 2023-06-19 Mo 05:21, Tomas Vondra wrote:


On 6/18/23 18:45, Andrew Dunstan wrote:

On 2023-06-16 Fr 20:38, Joel Jacobson wrote:

New patch is attached, which will henceforth always be a complete patch,
to avoid the hassle of having to assemble incremental patches.


Cool, thanks.


It might still be convenient to keep it split into smaller, easier to
review, parts. A patch that introduces basic functionality and then
patches adding various "advanced" features.


A couple of random thoughts:


. It might be worth sending a version number with the send function
(c.f. jsonb_send / jsonb_recv). That way would would not be tied forever
to some wire representation.

. I think there are some important set operations missing: most notably
intersection, slightly less importantly asymmetric and symmetric
difference. I have no idea how easy these would be to add, but even for
your stated use I should have thought set intersection would be useful
("Who is a member of both this set of friends and that set of friends?").

. While supporting int4 only is OK for now, I think we would at least
want to support int8, and probably UUID since a number of systems I know
of use that as an object identifier.


I agree we should aim to support a wider range of data types. Could we
have a polymorphic type, similar to what we do for arrays and ranges? In
fact, CREATE TYPE allows specifying ELEMENT, so wouldn't it be possible
to implement this as a special variant of an array? Would be better than
having a set of functions for every supported data type.

(Note: It might still be possible to have a special implementation for
selected fixed-length data types, as it allows optimization at compile
time. But that could be done later.)



Interesting idea. There's also the keyword SETOF that we could possibly 
make use of.






The other thing I've been thinking about is the SQL syntax and what does
the SQL standard says about this.

AFAICS the standard only defines arrays and multisets. Arrays are pretty
much the thing we have, including the ARRAY[] constructor etc. Multisets
are similar to hashset discussed here, except that it tracks the number
of elements for each value (which would be trivial in hashset).

So if we want to make this a built-in feature, maybe we should aim to do
the multiset thing, with the standard SQL syntax? Extending the grammar
should not be hard, I think. I'm not sure of the underlying code
(ArrayType, ARRAY_SUBLINK stuff, etc.) we could reuse or if we'd need a
lot of separate code doing that.




Yes, Multisets (a.k.a. bags and a large number of other names) would be 
interesting. But I wouldn't like to abandon pure sets either. Maybe a 
typmod indicating the allowed multiplicity of the type?




cheers


andrew



--
Andrew Dunstan
EDB:https://www.enterprisedb.com


Re: Do we want a hashset type?

2023-06-19 Thread Joel Jacobson
On Mon, Jun 19, 2023, at 11:21, Tomas Vondra wrote:
> AFAICS the standard only defines arrays and multisets. Arrays are pretty
> much the thing we have, including the ARRAY[] constructor etc. Multisets
> are similar to hashset discussed here, except that it tracks the number
> of elements for each value (which would be trivial in hashset).
>
> So if we want to make this a built-in feature, maybe we should aim to do
> the multiset thing, with the standard SQL syntax? Extending the grammar
> should not be hard, I think. I'm not sure of the underlying code
> (ArrayType, ARRAY_SUBLINK stuff, etc.) we could reuse or if we'd need a
> lot of separate code doing that.

Multisets handle duplicates uniquely, this may bring unexpected issues. Sets
and multisets have distinct utility in C++, Rust, Java, etc. However, sets are
more fundamental and prevalent in std libs than multisets.

Despite SQL's multiset possibility, a distinct hashset type is my preference,
helping appropriate data structure choice and reducing misuse.

The necessity of multisets is vague beyond standards compliance.

/Joel




Re: Do we want a hashset type?

2023-06-19 Thread jian he
On Mon, Jun 19, 2023 at 2:51 PM Joel Jacobson  wrote:
>
> On Mon, Jun 19, 2023, at 02:00, jian he wrote:
> > select hashset_contains('{1,2}'::int4hashset,NULL::int);
> > should return null?
>
> Hmm, that's a good philosophical question.
>
> I notice Tomas Vondra in the initial commit opted for allowing NULL
inputs,
> treating them as empty sets, e.g. in int4hashset_add() we create a
> new hashset if the first argument is NULL.
>
> I guess the easiest perhaps most consistent NULL-handling strategy
> would be to just mark all relevant functions STRICT except for the agg
ones
> since we probably want to allow skipping over rows with NULL values
> without the entire result becoming NULL.
>
> But if we're not just going the STRICT route, then I think it's a bit
more tricky,
> since you could argue the hashset_contains() example should return FALSE
> since the set doesn't contain the NULL value, but OTOH, since we don't
> store NULL values, we don't know if has ever been added, hence a NULL
> result would perhaps make more sense.
>
> I think I lean on thinking that if we want to be "NULL-friendly", like we
> currently are in hashset_add(), it would probably be most user-friendly
> to be consistent and let all functions return non-null return values in
> all cases where it is not unreasonable.
>
> Since we're essentially designing a set-theoretic system, I think we
should
> aim for the logical "soundness" property of it and think about how we can
> verify that it is.
>
> Thoughts?
>
> /Joel

hashset_to_array function should be strict?

I noticed hashset_symmetric_difference and  hashset_difference handle null
in a different way, seems they should handle null in a consistent way?

select '{1,2,NULL}'::int[] operator (pg_catalog.@>) '{NULL}'::int[]; --false
select '{1,2,NULL}'::int[] operator (pg_catalog.&&) '{NULL}'::int[];
--false.
So similarly I guess hashset_contains should be false.
select hashset_contains('{1,2}'::int4hashset,NULL::int);


Re: Do we want a hashset type?

2023-06-19 Thread Tomas Vondra



On 6/18/23 18:45, Andrew Dunstan wrote:
> 
> On 2023-06-16 Fr 20:38, Joel Jacobson wrote:
>>
>> New patch is attached, which will henceforth always be a complete patch,
>> to avoid the hassle of having to assemble incremental patches.
> 
> 
> Cool, thanks.
> 

It might still be convenient to keep it split into smaller, easier to
review, parts. A patch that introduces basic functionality and then
patches adding various "advanced" features.

> 
> A couple of random thoughts:
> 
> 
> . It might be worth sending a version number with the send function
> (c.f. jsonb_send / jsonb_recv). That way would would not be tied forever
> to some wire representation.
> 
> . I think there are some important set operations missing: most notably
> intersection, slightly less importantly asymmetric and symmetric
> difference. I have no idea how easy these would be to add, but even for
> your stated use I should have thought set intersection would be useful
> ("Who is a member of both this set of friends and that set of friends?").
> 
> . While supporting int4 only is OK for now, I think we would at least
> want to support int8, and probably UUID since a number of systems I know
> of use that as an object identifier.
> 

I agree we should aim to support a wider range of data types. Could we
have a polymorphic type, similar to what we do for arrays and ranges? In
fact, CREATE TYPE allows specifying ELEMENT, so wouldn't it be possible
to implement this as a special variant of an array? Would be better than
having a set of functions for every supported data type.

(Note: It might still be possible to have a special implementation for
selected fixed-length data types, as it allows optimization at compile
time. But that could be done later.)


The other thing I've been thinking about is the SQL syntax and what does
the SQL standard says about this.

AFAICS the standard only defines arrays and multisets. Arrays are pretty
much the thing we have, including the ARRAY[] constructor etc. Multisets
are similar to hashset discussed here, except that it tracks the number
of elements for each value (which would be trivial in hashset).

So if we want to make this a built-in feature, maybe we should aim to do
the multiset thing, with the standard SQL syntax? Extending the grammar
should not be hard, I think. I'm not sure of the underlying code
(ArrayType, ARRAY_SUBLINK stuff, etc.) we could reuse or if we'd need a
lot of separate code doing that.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Do we want a hashset type?

2023-06-19 Thread Joel Jacobson
On Mon, Jun 19, 2023, at 02:00, jian he wrote:
> select hashset_contains('{1,2}'::int4hashset,NULL::int);
> should return null?

Hmm, that's a good philosophical question.

I notice Tomas Vondra in the initial commit opted for allowing NULL inputs,
treating them as empty sets, e.g. in int4hashset_add() we create a
new hashset if the first argument is NULL.

I guess the easiest perhaps most consistent NULL-handling strategy
would be to just mark all relevant functions STRICT except for the agg ones
since we probably want to allow skipping over rows with NULL values
without the entire result becoming NULL.

But if we're not just going the STRICT route, then I think it's a bit more 
tricky,
since you could argue the hashset_contains() example should return FALSE
since the set doesn't contain the NULL value, but OTOH, since we don't
store NULL values, we don't know if has ever been added, hence a NULL
result would perhaps make more sense.

I think I lean on thinking that if we want to be "NULL-friendly", like we
currently are in hashset_add(), it would probably be most user-friendly
to be consistent and let all functions return non-null return values in
all cases where it is not unreasonable.

Since we're essentially designing a set-theoretic system, I think we should
aim for the logical "soundness" property of it and think about how we can
verify that it is.

Thoughts?

/Joel




Re: Do we want a hashset type?

2023-06-18 Thread jian he
On Sat, Jun 17, 2023 at 8:38 AM Joel Jacobson  wrote:
>
> On Fri, Jun 16, 2023, at 17:42, Joel Jacobson wrote:
> > I realise int4hashset_hash() is broken,
> > since two int4hashset's that are considered equal,
> > can by coincidence get different hashes:
> ...
> > Do we have any ideas on how to fix this without sacrificing performance?
>
> The problem was due to hashset_hash() function accumulating the hashes
> of individual elements in a non-commutative manner. As a consequence, the
> final hash value was sensitive to the order in which elements were inserted
> into the hashset. This behavior led to inconsistencies, as logically
> equivalent sets (i.e., sets with the same elements but in different orders)
> produced different hash values.
>
> Solved by modifying the hashset_hash() function to use a commutative operation
> when combining the hashes of individual elements. This change ensures that the
> final hash value is independent of the element insertion order, and logically
> equivalent sets produce the same hash.
>
> An somewhat unfortunate side-effect of this fix, is that we can no longer
> visually sort the hashset output format, since it's not lexicographically 
> sorted.
> I think this is an acceptable trade-off for a hashset type,
> since the only alternative I see would be to sort the elements,
> but then it wouldn't be a hashset, but a treeset, which different
> Big-O complexity.
>
> New patch is attached, which will henceforth always be a complete patch,
> to avoid the hassle of having to assemble incremental patches.
>
> /Joel


select hashset_contains('{1,2}'::int4hashset,NULL::int);
should return null?
-
SELECT attname
,pc.relname
,CASE attstorage
WHEN 'p' THEN 'plain'
WHEN 'e' THEN 'external'
WHEN 'm' THEN 'main'
WHEN 'x' THEN 'extended'
END AS storage
FROMpg_attribute pa
joinpg_classpc  on pc.oid = pa.attrelid
where   attnum > 0  and pa.attstorage   = 'e';

In my system catalog, it seems only the hashset type storage =
'external'. most is extended.
I am not sure the consequence of switch from external to extended.

select hashset_hash('{-1,1}')   as a1
,hashset_hash('{1,-2}') as a2
,hashset_hash('{-3,1}') as a3
,hashset_hash('{4,1}')  as a4;
returns:
 a1  |a2 | a3 | a4
-+---++
 -1735582196 | 998516167 | 1337000903 | 1305426029
(1 row)

values {a1,a2,a3,a4} should be monotone increasing, based on the
function int4hashset_cmp, but now it's not.
so the following queries failed.

--should return only one row.
select hashset_cmp('{2,1}','{3,1}')
union
select hashset_cmp('{3,1}','{4,1}')
union
select hashset_cmp('{1,3}','{4,1}');

select hashset_cmp('{9,10,11}','{10,9,-11}') =
hashset_cmp('{9,10,11}','{10,9,-1}'); --should be true
select '{2,1}'::int4hashset > '{7}'::int4hashset; --should be false.
based on int array comparison,.
-
I comment out following lines in hashset-api.c somewhere between {810,829}

// if (a->hash < b->hash)
// PG_RETURN_INT32(-1);
// else if (a->hash > b->hash)
// PG_RETURN_INT32(1);

// if (a->nelements < b->nelements)
// PG_RETURN_INT32(-1);
// else if (a->nelements > b->nelements)
// PG_RETURN_INT32(1);

// Assert(a->nelements == b->nelements);

So hashset_cmp will directly compare int array. the above queries works.

{int4hashset_equals,int4hashset_neq} two special cases of hashset_cmp.
maybe we can just  wrap it just like int4hashset_le?

now store 10 element int4hashset need 99 bytes, similar one dimension
bigint array with length 10, occupy 101 byte

in int4hashset_send, newly add struct attributes/member {load_factor
growth_factor ncollisions hash} also need send to buf?




Re: Do we want a hashset type?

2023-06-18 Thread Joel Jacobson
On Sun, Jun 18, 2023, at 18:45, Andrew Dunstan wrote:
> . It might be worth sending a version number with the send function 
> (c.f. jsonb_send / jsonb_recv). That way would would not be tied forever 
> to some wire representation.

Great idea; implemented.

> . I think there are some important set operations missing: most notably 
> intersection, slightly less importantly asymmetric and symmetric 
> difference. I have no idea how easy these would be to add, but even for 
> your stated use I should have thought set intersection would be useful 
> ("Who is a member of both this set of friends and that set of friends?").

Another great idea; implemented.

> . While supporting int4 only is OK for now, I think we would at least 
> want to support int8, and probably UUID since a number of systems I know 
> of use that as an object identifier.

I agree that's probably the most logical thing to focus on next. I'm on it.

New patch attached.

/Joel

hashset-0.0.1-75bf3ab.patch
Description: Binary data


Re: Do we want a hashset type?

2023-06-18 Thread Andrew Dunstan



On 2023-06-16 Fr 20:38, Joel Jacobson wrote:


New patch is attached, which will henceforth always be a complete patch,
to avoid the hassle of having to assemble incremental patches.



Cool, thanks.


A couple of random thoughts:


. It might be worth sending a version number with the send function 
(c.f. jsonb_send / jsonb_recv). That way would would not be tied forever 
to some wire representation.


. I think there are some important set operations missing: most notably 
intersection, slightly less importantly asymmetric and symmetric 
difference. I have no idea how easy these would be to add, but even for 
your stated use I should have thought set intersection would be useful 
("Who is a member of both this set of friends and that set of friends?").


. While supporting int4 only is OK for now, I think we would at least 
want to support int8, and probably UUID since a number of systems I know 
of use that as an object identifier.



cheers


andrew


--

Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Do we want a hashset type?

2023-06-16 Thread Joel Jacobson
On Fri, Jun 16, 2023, at 17:42, Joel Jacobson wrote:
> I realise int4hashset_hash() is broken,
> since two int4hashset's that are considered equal,
> can by coincidence get different hashes:
...
> Do we have any ideas on how to fix this without sacrificing performance?

The problem was due to hashset_hash() function accumulating the hashes
of individual elements in a non-commutative manner. As a consequence, the
final hash value was sensitive to the order in which elements were inserted
into the hashset. This behavior led to inconsistencies, as logically
equivalent sets (i.e., sets with the same elements but in different orders)
produced different hash values.

Solved by modifying the hashset_hash() function to use a commutative operation
when combining the hashes of individual elements. This change ensures that the
final hash value is independent of the element insertion order, and logically
equivalent sets produce the same hash.

An somewhat unfortunate side-effect of this fix, is that we can no longer
visually sort the hashset output format, since it's not lexicographically 
sorted.
I think this is an acceptable trade-off for a hashset type,
since the only alternative I see would be to sort the elements,
but then it wouldn't be a hashset, but a treeset, which different
Big-O complexity.

New patch is attached, which will henceforth always be a complete patch,
to avoid the hassle of having to assemble incremental patches.

/Joel

hashset-0.0.1-8da4aa8.patch
Description: Binary data


Re: Do we want a hashset type?

2023-06-16 Thread Joel Jacobson
On Fri, Jun 16, 2023, at 13:57, jian he wrote:
> similar to (int[] || int4) and (int4 || int[])
> should we expect ('{1,2}'::int4hashset || 3) == (3 ||  
> '{1,2}'::int4hashset) == (select hashset_add('{1,2}'::int4hashset,3)); 
> *?*

Good idea, makes sense to support it.
Implemented in attached patch.

> CREATE OPERATOR || (
> leftarg = int4,
> rightarg = int4hashset,
> function = int4_add_int4hashset,
> commutator = ||
> );
> while creating an operator. I am not sure how to specify 
> NEGATOR,RESTRICT,JOIN clause.

I don't think we need those for this operator, might be wrong though.

> -
> also. I think the following query should return one row only? but now 
> it doesn't.
> select hashset_cmp('{1,2}','{2,1}')
> union 
> select hashset_cmp('{1,2}','{1,2,1}')
> union 
> select hashset_cmp('{1,2}','{1,2}');

Good point.

I realise int4hashset_hash() is broken,
since two int4hashset's that are considered equal,
can by coincidence get different hashes:

SELECT '{1,2}'::int4hashset = '{2,1}'::int4hashset;
 ?column?
--
 t
(1 row)

SELECT hashset_hash('{1,2}'::int4hashset);
 hashset_hash
--
990882385
(1 row)

SELECT hashset_hash('{2,1}'::int4hashset);
 hashset_hash
--
996377797
(1 row)

Do we have any ideas on how to fix this without sacrificing performance?

We of course want to avoid having to sort the hashsets,
which is the naive solution.

To understand why this is happening, consider this example:

SELECT '{1,2}'::int4hashset;
 int4hashset
-
 {1,2}
(1 row)

SELECT '{2,1}'::int4hashset;
 int4hashset
-
 {2,1}
(1 row)

If the hash of `1` and `2` modulus the capacity results in the same value,
they will be attempted to be inserted into the same position,
and since the input text is parsed left-to-right, in the first case `1` will win
the first position, and `2` will get a collision and try the next position.

In the second case, the opposite happens.

Since we do modulus capacity, the position depends on the capacity,
which is why the output string can be different for the same input.

SELECTint4hashset() || 1 || 2 || 3;
 {3,1,2}

SELECTint4hashset(capacity:=1) || 1 || 2 || 3;
 {3,1,2}

SELECTint4hashset(capacity:=2) || 1 || 2 || 3;
 {3,1,2}

SELECTint4hashset(capacity:=3) || 1 || 2 || 3;
 {3,2,1}

SELECTint4hashset(capacity:=4) || 1 || 2 || 3;
 {3,1,2}

SELECTint4hashset(capacity:=5) || 1 || 2 || 3;
 {1,2,3}

SELECTint4hashset(capacity:=6) || 1 || 2 || 3;
 {1,3,2}


> --
> similar to elem_contained_by_range, range_contains_elem. we should 
> already consider the operator *<@* and @*>? *

That could perhaps be nice.
Apart from possible syntax convenience,
are there any other benefits than just using the function 
hashset_contains(int4hashset, integer) directly?

/Joel

hashset-0.0.1-d83bdef.patch
Description: Binary data


Re: Do we want a hashset type?

2023-06-16 Thread jian he
similar to (int[] || int4) and (int4 || int[])
should we expect ('{1,2}'::int4hashset || 3) == (3 ||
 '{1,2}'::int4hashset) == (select hashset_add('{1,2}'::int4hashset,3)); *?*

The following is the general idea on how to make it work by looking at
similar code
CREATE OPERATOR || (
leftarg = int4hashset,
rightarg = int4,
function = int4hashset_add,
commutator = ||
);

CREATE OR REPLACE FUNCTION int4_add_int4hashset(int4, int4hashset)
RETURNS int4hashset
LANGUAGE sql
IMMUTABLE PARALLEL SAFE STRICT COST 1
RETURN $2 || $1;

CREATE OPERATOR || (
leftarg = int4,
rightarg = int4hashset,
function = int4_add_int4hashset,
commutator = ||
);
while creating an operator. I am not sure how to specify
NEGATOR,RESTRICT,JOIN clause.
-
also. I think the following query should return one row only? but now it
doesn't.
select hashset_cmp('{1,2}','{2,1}')
union
select hashset_cmp('{1,2}','{1,2,1}')
union
select hashset_cmp('{1,2}','{1,2}');
--
similar to elem_contained_by_range, range_contains_elem. we should already
consider the operator *<@* and @*>? *

CREATE OR REPLACE FUNCTION elem_contained_by_hashset(int4, int4hashset)
RETURNS bool
LANGUAGE sql
IMMUTABLE PARALLEL SAFE STRICT COST 1
RETURN hashset_contains ($2,$1);

Is the integer contained in the int4hashset?
integer  <@ int4hashset → boolean
1 <@ int4hashset'{1,7}' → t

CREATE OPERATOR <@ (
leftarg = integer,
rightarg = int4hashset,
function = elem_contained_by_hashset
);

int4hashset @> integer → boolean
Does the int4hashset contain the element?
int4hashset'{1,7}' @> 1 → t

CREATE OPERATOR @> (
leftarg = int4hashset,
rightarg = integer,
function = hashset_contains
);
---


Re: Do we want a hashset type?

2023-06-16 Thread Joel Jacobson
New patch attached:

Add customizable params to int4hashset() and collision count function

This commit enhances int4hashset() by introducing adjustable capacity,
load, and growth factors, providing flexibility for performance optimization.

Also added is a new function, hashset_collisions(), to report collision
counts, aiding in performance tuning.

Aggregate functions are renamed to hashset_agg() for consistency with
array_agg() and range_agg().

A new test file, test/sql/benchmark.sql, is added for evaluating the
performance of hash functions. It's not run automatically by
make installcheck.

The adjustable parameters and the naive hash function are useful for testing
and performance comparison. However, to keep things simple and streamlined
for users, these features are likely to be removed in the final release,
emphasizing the use of well-optimized default settings.

SQL-function indentation is also adjusted to align with the PostgreSQL
source repo, improving readability.

In the benchmark results below, it was a bit surprising the naive hash
function had no collisions, but that only held true when the input
elements were sequential integers. When tested with random integers,
all three hash functions caused collisions.

Timing results not statistical significant, the purpose is just to
give an idea of the execution times.

*** Elements in sequence 1..10
- Testing default hash function (Jenkins/lookup3)
psql:test/sql/benchmark.sql:23: NOTICE:  hashset_count: 10
psql:test/sql/benchmark.sql:23: NOTICE:  hashset_capacity: 262144
psql:test/sql/benchmark.sql:23: NOTICE:  hashset_collisions: 31195
DO
Time: 1342.564 ms (00:01.343)
- Testing Murmurhash32
psql:test/sql/benchmark.sql:40: NOTICE:  hashset_count: 10
psql:test/sql/benchmark.sql:40: NOTICE:  hashset_capacity: 262144
psql:test/sql/benchmark.sql:40: NOTICE:  hashset_collisions: 30879
DO
Time: 1297.823 ms (00:01.298)
- Testing naive hash function
psql:test/sql/benchmark.sql:57: NOTICE:  hashset_count: 10
psql:test/sql/benchmark.sql:57: NOTICE:  hashset_capacity: 262144
psql:test/sql/benchmark.sql:57: NOTICE:  hashset_collisions: 0
DO
Time: 1400.936 ms (00:01.401)
*** Testing 10 random ints
setseed
-

(1 row)

Time: 3.591 ms
- Testing default hash function (Jenkins/lookup3)
psql:test/sql/benchmark.sql:77: NOTICE:  hashset_count: 10
psql:test/sql/benchmark.sql:77: NOTICE:  hashset_capacity: 262144
psql:test/sql/benchmark.sql:77: NOTICE:  hashset_collisions: 30919
DO
Time: 1415.497 ms (00:01.415)
setseed
-

(1 row)

Time: 1.282 ms
- Testing Murmurhash32
psql:test/sql/benchmark.sql:95: NOTICE:  hashset_count: 10
psql:test/sql/benchmark.sql:95: NOTICE:  hashset_capacity: 262144
psql:test/sql/benchmark.sql:95: NOTICE:  hashset_collisions: 30812
DO
Time: 2079.202 ms (00:02.079)
setseed
-

(1 row)

Time: 0.122 ms
- Testing naive hash function
psql:test/sql/benchmark.sql:113: NOTICE:  hashset_count: 10
psql:test/sql/benchmark.sql:113: NOTICE:  hashset_capacity: 262144
psql:test/sql/benchmark.sql:113: NOTICE:  hashset_collisions: 30822
DO
Time: 1613.965 ms (00:01.614)

/Joel

hashset-0.0.1-184a18a.patch
Description: Binary data


Re: Do we want a hashset type?

2023-06-15 Thread Joel Jacobson
On Thu, Jun 15, 2023, at 11:44, jian he wrote:
> In hashset/test/sql/order.sql, can we add the following to test whether 
> the optimizer will use our index.
>
> CREATE INDEX  ON test_int4hashset_order (int4hashset_col  
> int4hashset_btree_ops);
>
> -- to make sure that this work with just two rows
> SET enable_seqscan TO off;  
>
> explain(costs off) SELECT * FROM test_int4hashset_order WHERE 
> int4hashset_col = '{1,2}'::int4hashset;
> reset enable_seqscan;

Not sure I can see the value of that test,
since we've already tested the comparison functions,
which are used by the int4hashset_btree_ops operator class.

I think a test that verifies the btree index is actually used,
would more be a test of the query planner than hashset.

I might be missing something here, please tell me if so.

> Since most contrib modules, one module, only one test file, maybe we 
> need to consolidate all the test sql files to one sql file 
> (int4hashset.sql)?

I've also made the same observation; I wonder if it's by design
or by coincidence? I think multiple test files improves modularity,
isolation and overall organisation of the testing.

As long as we are developing in the pre-release phase,
I think it's beneficial and affordable with rigorous testing.

However, if hashset would ever be considered
for core inclusion, then we should consolidate all tests into
one file and retain only essential tests, thereby minimizing
impact on PostgreSQL's overall test suite runtime
where every millisecond matters.

> Attached is a patch slightly modified README.md. feel free to change, 
> since i am not native english speaker... 
>
> Attachments:
> * 0001-add-instruction-using-PG_CONFIG-to-install-extension.patch

Thanks, improvements incorporated with some minor changes.

/Joel

hashset-0.0.1-61d572a.patch
Description: Binary data


Re: Do we want a hashset type?

2023-06-15 Thread Joel Jacobson
On Thu, Jun 15, 2023, at 06:29, jian he wrote:
> I am not sure the following results are correct.
> with cte as (
> select hashset(x) as x
> ,hashset_capacity(hashset(x))
> ,hashset_count(hashset(x))
> from generate_series(1,10) g(x))
> select *
> ,'|' as delim
> , hashset_add(x,1::int)
> ,hashset_capacity(hashset_add(x,1::int))
> ,hashset_count(hashset_add(x,1::int))
> fromcte \gx
>
>
> results:  
> -[ RECORD 1 ]+-
> x| {8,1,10,3,9,4,6,2,1,5,7}
> hashset_capacity | 64
> hashset_count| 10
> delim| |
> hashset_add  | {8,1,10,3,9,4,6,2,1,5,7}
> hashset_capacity | 64
> hashset_count| 11

Nice catch, you found a bug!

Fixed in attached patch:

---
Ensure hashset_add and hashset_merge operate on copied data

Previously, the hashset_add() and hashset_merge() functions were
modifying the original hashset in-place. This was leading to unexpected
results because the original data in the hashset was being altered.

This commit introduces the macro PG_GETARG_INT4HASHSET_COPY(), ensuring
a copy of the hashset is created and modified, leaving the original
hashset untouched.

This adjustment ensures hashset_add() and hashset_merge() operate
correctly on the copied hashset and prevent modification of the
original data.

A new regression test file `reported_bugs.sql` has been added to
validate the proper functionality of these changes. Future reported
bugs and their corresponding tests will also be added to this file.
---

I wonder if this function:

static int4hashset_t *
int4hashset_copy(int4hashset_t *src)
{
return src;
}

...that was previously named hashset_copy(),
should be implemented to actually copy the struct,
instead of just returning the input?

It is being used by int4hashset_agg_combine() like this:

/* copy the hashset into the right long-lived memory context */
oldcontext = MemoryContextSwitchTo(aggcontext);
src = int4hashset_copy(src);
MemoryContextSwitchTo(oldcontext);

/Joel

hashset-0.0.1-da84659.patch
Description: Binary data


Re: Do we want a hashset type?

2023-06-15 Thread Joel Jacobson
On Thu, Jun 15, 2023, at 11:44, jian he wrote:
> I didn't install the extension directly. I copied the 
> hashset--0.0.1.sql to another place, using gcc to compile these 
> functions. 
..
> Because even make 
> PG_CONFIG=/home/jian/postgres/2023_05_25_beta5421/bin/pg_config still 
> has an error.
>  fatal error: libpq-fe.h: No such file or directory
> 3 | #include 

What platform are you on?
You seem to be missing the postgresql dev package.
For instance, here is how to compile and install the extension on Ubuntu 
22.04.1 LTS:

sudo apt install postgresql-15 postgresql-server-dev-15 postgresql-client-15
git clone https://github.com/tvondra/hashset.git
cd hashset
make
sudo make install
make installcheck

> Is there any way to put test_send_recv.c to sql test file? 

Unfortunately, there doesn't seem to be a way to test *_recv() functions from 
SQL,
since they take `internal` as input. The only way I could figure out to test 
them
was to write a C-program using libpq's binary mode.

I also note that the test_send_recv test was broken; I had forgot to change
the type from "hashset" to "int4hashset". Fixed in attached commit.

On Ubuntu, you can now run the test by specifying to connect via the UNIX 
socket:

PGHOST=/var/run/postgresql make run_c_tests
cd test/c_tests && ./test_send_recv.sh
test test_send_recv   ... ok

> Attached is a patch slightly modified README.md. feel free to change, 
> since i am not native english speaker... 
> Attachments:
> * 0001-add-instruction-using-PG_CONFIG-to-install-extension.patch

Thanks, will have a look later.

/Joel

hashset-0.0.1-bb0ece8.patch
Description: Binary data


Re: Do we want a hashset type?

2023-06-15 Thread jian he
In hashset/test/sql/order.sql, can we add the following to test
whether the optimizer
will use our index.

CREATE INDEX  ON test_int4hashset_order (int4hashset_col
 int4hashset_btree_ops);

-- to make sure that this work with just two rows
SET enable_seqscan TO off;

explain(costs off) SELECT * FROM test_int4hashset_order WHERE
int4hashset_col = '{1,2}'::int4hashset;
reset enable_seqscan;

Since most contrib modules, one module, only one test file, maybe we need
to consolidate all the test sql files to one sql file (int4hashset.sql)?
--
I didn't install the extension directly. I copied the hashset--0.0.1.sql to
another place, using gcc to compile these functions.
gcc -I/home/jian/postgres/2023_05_25_beta5421/include/server -fPIC -c
/home/jian/hashset/hashset.c
gcc -shared  -o /home/jian/hashset/hashset.so /home/jian/hashset/hashset.o
then modify hashset--0.0.1.sql  then in psql  \i fullsqlfilename to create
these functions, types.

Because even make
PG_CONFIG=/home/jian/postgres/2023_05_25_beta5421/bin/pg_config still has
an error.
 fatal error: libpq-fe.h: No such file or directory
3 | #include 

Is there any way to put test_send_recv.c to sql test file?
Attached is a patch slightly modified README.md. feel free to change, since
i am not native english speaker...
From 2bb310affed4a06c6fa38fb0e1b1ff39f330d88d Mon Sep 17 00:00:00 2001
From: pgaddict 
Date: Thu, 15 Jun 2023 15:50:00 +0800
Subject: [PATCH] add instruction using PG_CONFIG to install extension, also
 how to run installcheck

---
 README.md | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/README.md b/README.md
index 3ff57576..30c7 100644
--- a/README.md
+++ b/README.md
@@ -1,7 +1,7 @@
 # hashset
 
 This PostgreSQL extension implements hashset, a data structure (type)
-providing a collection of integer items with fast lookup.
+providing a collection of unique, not null integer items with fast lookup.
 
 
 ## Version
@@ -103,11 +103,19 @@ a variable-length type.
 
 ## Installation
 
-To install the extension, run `make install` in the project root. Then, in your
+To install the extension, run `make install` in the project root. To use a different PostgreSQL installation, point configure to a different `pg_config`, using following command: 
+
+make PG_CONFIG=/else/where/pg_config
+make install PG_CONFIG=/else/where/pg_config
+
+Then, in your
 PostgreSQL connection, execute `CREATE EXTENSION hashset;`.
 
 This extension requires PostgreSQL version ?.? or later.
 
+## Test
+To run regression test, execute
+`make installcheck`
 
 ## License
 
-- 
2.34.1



Re: Do we want a hashset type?

2023-06-15 Thread Joel Jacobson
On Thu, Jun 15, 2023, at 04:22, jian he wrote:
> Attachments:
> * temp.patch

Thanks for good suggestions.
New patch attached:

Enhance parsing and reorder headers in hashset module

Allow whitespaces in hashset input and reorder the inclusion of
header files, placing PostgreSQL headers first. Additionally, update
deprecated pq_sendint calls to pq_sendint32. Add tests for improved
parsing functionality.

/Joel

hashset-0.0.1-1fd790f
Description: Binary data


Re: Do we want a hashset type?

2023-06-14 Thread jian he
On Thu, Jun 15, 2023 at 5:04 AM Joel Jacobson  wrote:

> On Wed, Jun 14, 2023, at 15:16, Tomas Vondra wrote:
> > On 6/14/23 14:57, Joel Jacobson wrote:
> >> Would it be feasible to teach the planner to utilize the internal hash
> table of
> >> hashset directly? In the case of arrays, the hash table construction is
> an
> ...
> > It's definitely something I'd leave out of v0, personally.
>
> OK, thanks for guidance, I'll stay away from it.
>
> I've been doing some preparatory work on this todo item:
>
> > 3) support for other types (now it only works with int32)
>
> I've renamed the type from "hashset" to "int4hashset",
> and the SQL-functions are now prefixed with "int4"
> when necessary. The overloaded functions with
> int4hashset as input parameters don't need to be prefixed,
> e.g. hashset_add(int4hashset, int).
>
> Other changes since last update (4e60615):
>
> * Support creation of empty hashset using '{}'::hashset
> * Introduced a new function hashset_capacity() to return the current
> capacity
>   of a hashset.
> * Refactored hashset initialization:
>   - Replaced hashset_init(int) with int4hashset() to initialize an empty
> hashset
> with zero capacity.
>   - Added int4hashset_with_capacity(int) to initialize a hashset with
> a specified capacity.
> * Improved README.md and testing
>
> As a next step, I'm planning on adding int8 support.
>
> Looks and sounds good?
>
> /Joel

I am not sure the following results are correct.
with cte as (
select hashset(x) as x
,hashset_capacity(hashset(x))
,hashset_count(hashset(x))
from generate_series(1,10) g(x))
select *
,'|' as delim
, hashset_add(x,1::int)
,hashset_capacity(hashset_add(x,1::int))
,hashset_count(hashset_add(x,1::int))
fromcte \gx


results:
-[ RECORD 1 ]+-
x| {8,1,10,3,9,4,6,2,1,5,7}
hashset_capacity | 64
hashset_count| 10
delim| |
hashset_add  | {8,1,10,3,9,4,6,2,1,5,7}
hashset_capacity | 64
hashset_count| 11

but:
with cte as(select '{1,2}'::int4hashset as x)   select
x,hashset_add(x,3::int)  from cte;

returns
   x   | hashset_add
---+-
 {1,2} | {3,1,2}
(1 row)
last simple query seems more sensible to me.


Re: Do we want a hashset type?

2023-06-14 Thread jian he
On Thu, Jun 15, 2023 at 5:04 AM Joel Jacobson  wrote:

> On Wed, Jun 14, 2023, at 15:16, Tomas Vondra wrote:
> > On 6/14/23 14:57, Joel Jacobson wrote:
> >> Would it be feasible to teach the planner to utilize the internal hash
> table of
> >> hashset directly? In the case of arrays, the hash table construction is
> an
> ...
> > It's definitely something I'd leave out of v0, personally.
>
> OK, thanks for guidance, I'll stay away from it.
>
> I've been doing some preparatory work on this todo item:
>
> > 3) support for other types (now it only works with int32)
>
> I've renamed the type from "hashset" to "int4hashset",
> and the SQL-functions are now prefixed with "int4"
> when necessary. The overloaded functions with
> int4hashset as input parameters don't need to be prefixed,
> e.g. hashset_add(int4hashset, int).
>
> Other changes since last update (4e60615):
>
> * Support creation of empty hashset using '{}'::hashset
> * Introduced a new function hashset_capacity() to return the current
> capacity
>   of a hashset.
> * Refactored hashset initialization:
>   - Replaced hashset_init(int) with int4hashset() to initialize an empty
> hashset
> with zero capacity.
>   - Added int4hashset_with_capacity(int) to initialize a hashset with
> a specified capacity.
> * Improved README.md and testing
>
> As a next step, I'm planning on adding int8 support.
>
> Looks and sounds good?
>
> /Joel


still playing around with hashset-0.0.1-a8a282a.patch.

I think "postgres.h" should be on the top, (someone have said it on another
email thread, I forgot who said that)

In my
local /home/jian/postgres/pg16/include/postgresql/server/libpq/pqformat.h:

> /*
>  * Append a binary integer to a StringInfo buffer
>  *
>  * This function is deprecated; prefer use of the functions above.
>  */
> static inline void
> pq_sendint(StringInfo buf, uint32 i, int b)


So I changed to pq_sendint32.

ending and beginning, and in between white space should be stripped. The
following c example seems ok for now. but I am not sure, I don't know how
to glue it in hashset_in.

forgive me the patch name

/*
gcc /home/jian/Desktop/regress_pgsql/strip_white_space.c && ./a.out
*/

#include
#include
#include
#include
#include 
#include

/*
 * array_isspace() --- a non-locale-dependent isspace()
 *
 * We used to use isspace() for parsing array values, but that has
 * undesirable results: an array value might be silently interpreted
 * differently depending on the locale setting.  Now we just hard-wire
 * the traditional ASCII definition of isspace().
 */
static bool
array_isspace(char ch)
{
if (ch == ' ' ||
ch == '\t' ||
ch == '\n' ||
ch == '\r' ||
ch == '\v' ||
ch == '\f')
return true;
return false;
}

int main(void)
{
long *temp   = malloc(10 * sizeof(long));
memset(temp,0,10);
charsource[5][50]   = {{0}};
snprintf(source[0],sizeof(source[0]),"%s","  { 1   ,   20  }");
snprintf(source[1],sizeof(source[0]),"%s","   { 1  ,20 ,   30 ");
snprintf(source[2],sizeof(source[0]),"%s","   {1  ,20 ,   30 ");
snprintf(source[3],sizeof(source[0]),"%s","   {1  ,  20 ,   30  }");
snprintf(source[4],sizeof(source[0]),"%s","   {1  ,  20 ,   30  }
");
/* Make a modifiable copy of the input */
char*p;
charstring_save[50];

for(int j = 0; j < 5; j++)
{
snprintf(string_save,sizeof(string_save),"%s",source[j]);
p = string_save;

int i = 0;
while (array_isspace(*p))
p++;
if (*p != '{')
{
printf("line: %d should be {\n",__LINE__);
exit(EXIT_FAILURE);
}

for (;;)
{
char   *q;
if (*p == '{')
p++;
temp[i] = strtol(p, ,10);
printf("temp[j=%d] [%d]=%ld\n",j,i,temp[i]);

if (*q == '}' && (*(q+1) == '\0'))
{
printf("all works ok now exit\n");
break;
}
if( !array_isspace(*q) && *q != ',')
{
printf("wrong format. program will exit\n");
exit(EXIT_FAILURE);
}
while(array_isspace(*q))
q++;
if(*q != ',')
break;
else
p = q+1;
i++;
}
}
}
diff --git a/hashset.c b/hashset.c
index d62e0491..e71764c5 100644
--- a/hashset.c
+++ b/hashset.c
@@ -3,15 +3,13 @@
  *
  * Copyright (C) Tomas Vondra, 2019
  */
+#include "postgres.h"
 
-#include 
 #include 
-#include 
 #include 
 #include 
 #include 
 
-#include "postgres.h"
 #include "libpq/pqformat.h"
 #include "nodes/memnodes.h"
 #include "utils/array.h"
@@ -255,10 +253,10 @@ hashset_send(PG_FUNCTION_ARGS)
 	pq_begintypsend();
 
 	/* Send the non-data fields */
-	pq_sendint(, set->flags, 4);
-	pq_sendint(, set->maxelements, 4);
-	pq_sendint(, set->nelements, 4);
-	pq_sendint(, set->hashfn_id, 4);
+	pq_sendint32(,set->flags);
+	

Re: Do we want a hashset type?

2023-06-14 Thread Tom Dunstan
On Wed, 14 Jun 2023 at 19:14, Tomas Vondra 
wrote:

> > ...So we'd want the same index usage as
> > =ANY(array) but would like faster row checking than we get with an array
> > when other indexes are used.
>
> We kinda already do this since PG14 (commit 50e17ad281), actually. If
> the list is long enough (9 values or more), we'll build a hash table
> during query execution. So pretty much exactly what you're asking for.
>

Ha! That is great. Unfortunately we can't rely on it as we have customers
using versions back to 12. But good to know that it's available when we
bump the required versions.

Thanks

Tom


Re: Do we want a hashset type?

2023-06-14 Thread Joel Jacobson
On Wed, Jun 14, 2023, at 15:16, Tomas Vondra wrote:
> On 6/14/23 14:57, Joel Jacobson wrote:
>> Would it be feasible to teach the planner to utilize the internal hash table 
>> of
>> hashset directly? In the case of arrays, the hash table construction is an
...
> It's definitely something I'd leave out of v0, personally.

OK, thanks for guidance, I'll stay away from it.

I've been doing some preparatory work on this todo item:

> 3) support for other types (now it only works with int32)

I've renamed the type from "hashset" to "int4hashset",
and the SQL-functions are now prefixed with "int4"
when necessary. The overloaded functions with
int4hashset as input parameters don't need to be prefixed,
e.g. hashset_add(int4hashset, int).

Other changes since last update (4e60615):

* Support creation of empty hashset using '{}'::hashset
* Introduced a new function hashset_capacity() to return the current capacity
  of a hashset.
* Refactored hashset initialization:
  - Replaced hashset_init(int) with int4hashset() to initialize an empty hashset
with zero capacity.
  - Added int4hashset_with_capacity(int) to initialize a hashset with
a specified capacity.
* Improved README.md and testing

As a next step, I'm planning on adding int8 support.

Looks and sounds good?

/Joel

hashset-0.0.1-48e29eb
Description: Binary data


Re: Do we want a hashset type?

2023-06-14 Thread Andrew Dunstan


On 2023-06-14 We 05:44, Tomas Vondra wrote:


The thing is, adding stuff to core is not free - it means the community
becomes responsible for maintenance, testing, fixing issues, etc. It's
not feasible (or desirable) to have all extensions in core, and cloud
vendors generally do have ways to support some pre-vetted extensions
that they deem useful enough. Granted, it means vetting/maintenance for
them, but that's kinda the point of managed services. And it'd not be
free for us either.



I agree it's a judgement call. But the code burden here seems pretty 
small, far smaller than, say, the SQL/JSON patches. And I think the 
range of applications that could benefit is quite significant.



cheers


andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com


Re: Do we want a hashset type?

2023-06-14 Thread Tomas Vondra



On 6/14/23 14:57, Joel Jacobson wrote:
> On Wed, Jun 14, 2023, at 11:44, Tomas Vondra wrote:
>>> Perspective from a potential user: I'm currently working on something
>>> where an array-like structure with fast membership test performance
>>> would be very useful. The main type of query is doing an =ANY(the set)
>>> filter, where the set could contain anywhere from very few to thousands
>>> of entries (ints in our case). So we'd want the same index usage as
>>> =ANY(array) but would like faster row checking than we get with an array
>>> when other indexes are used.
>>>
>>
>> We kinda already do this since PG14 (commit 50e17ad281), actually. If
>> the list is long enough (9 values or more), we'll build a hash table
>> during query execution. So pretty much exactly what you're asking for.
> 
> Would it be feasible to teach the planner to utilize the internal hash table 
> of
> hashset directly? In the case of arrays, the hash table construction is an
> ad hoc operation, whereas with hashset, the hash table already exists, which
> could potentially lead to a faster execution.
> 
> Essentially, the aim would be to support:
> 
> =ANY(hashset)
> 
> Instead of the current:
> 
> =ANY(hashset_to_array(hashset))
> 
> Thoughts?

That should be possible, but probably only when hashset is a built-in
data type (maybe polymorphic).

I don't know if it'd be worth it, the general idea is that building the
hash table is way cheaper than repeated lookups in an array. Yeah, it
might save something, but likely only a tiny fraction of the runtime.

It's definitely something I'd leave out of v0, personally.

=ANY(set) should probably work with an implicit ARRAY cast, I believe.
It'll do the ad hoc build, ofc.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Do we want a hashset type?

2023-06-14 Thread Joel Jacobson
On Wed, Jun 14, 2023, at 11:44, Tomas Vondra wrote:
>> Perspective from a potential user: I'm currently working on something
>> where an array-like structure with fast membership test performance
>> would be very useful. The main type of query is doing an =ANY(the set)
>> filter, where the set could contain anywhere from very few to thousands
>> of entries (ints in our case). So we'd want the same index usage as
>> =ANY(array) but would like faster row checking than we get with an array
>> when other indexes are used.
>> 
>
> We kinda already do this since PG14 (commit 50e17ad281), actually. If
> the list is long enough (9 values or more), we'll build a hash table
> during query execution. So pretty much exactly what you're asking for.

Would it be feasible to teach the planner to utilize the internal hash table of
hashset directly? In the case of arrays, the hash table construction is an
ad hoc operation, whereas with hashset, the hash table already exists, which
could potentially lead to a faster execution.

Essentially, the aim would be to support:

=ANY(hashset)

Instead of the current:

=ANY(hashset_to_array(hashset))

Thoughts?

/Joel




Re: Do we want a hashset type?

2023-06-14 Thread Tomas Vondra
On 6/14/23 06:31, Tom Dunstan wrote:
> On Mon, 12 Jun 2023 at 22:37, Tomas Vondra
> mailto:tomas.von...@enterprisedb.com>>
> wrote:
> 
> Perhaps. So you're proposing to have this as a regular built-in type?
> It's hard for me to judge how popular this feature would be, but I guess
> people often use arrays while they actually want set semantics ...
> 
> 
> Perspective from a potential user: I'm currently working on something
> where an array-like structure with fast membership test performance
> would be very useful. The main type of query is doing an =ANY(the set)
> filter, where the set could contain anywhere from very few to thousands
> of entries (ints in our case). So we'd want the same index usage as
> =ANY(array) but would like faster row checking than we get with an array
> when other indexes are used.
> 

We kinda already do this since PG14 (commit 50e17ad281), actually. If
the list is long enough (9 values or more), we'll build a hash table
during query execution. So pretty much exactly what you're asking for.

> Our app runs connecting to either an embedded postgres database that we
> control or an external one controlled by customers - this is typically
> RDS or some other cloud vendor's DB. Having such a type as a separate
> extension would make it unusable for us until all relevant cloud vendors
> decided that it was popular enough to include - something that may never
> happen, or even if it did, now any time soon.
> 

Understood, but that's really a problem / choice of the cloud vendors.

The thing is, adding stuff to core is not free - it means the community
becomes responsible for maintenance, testing, fixing issues, etc. It's
not feasible (or desirable) to have all extensions in core, and cloud
vendors generally do have ways to support some pre-vetted extensions
that they deem useful enough. Granted, it means vetting/maintenance for
them, but that's kinda the point of managed services. And it'd not be
free for us either.

Anyway, that's mostly irrelevant, as PG14 already does the hash table
for this kind of queries. And I'm not strictly against adding some of
this into core, if it ends up being useful enough.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Do we want a hashset type?

2023-06-14 Thread Joel Jacobson
On Wed, Jun 14, 2023, at 06:31, Tom Dunstan wrote:
> On Mon, 12 Jun 2023 at 22:37, Tomas Vondra 
>  wrote:
>> Perhaps. So you're proposing to have this as a regular built-in type?
>> It's hard for me to judge how popular this feature would be, but I guess
>> people often use arrays while they actually want set semantics ...
>
> Perspective from a potential user: I'm currently working on something 
> where an array-like structure with fast membership test performance 
> would be very useful. The main type of query is doing an =ANY(the set) 
> filter, where the set could contain anywhere from very few to thousands 
> of entries (ints in our case). So we'd want the same index usage as 
> =ANY(array) but would like faster row checking than we get with an 
> array when other indexes are used.

Thanks for providing an interesting use-case.

If you would like to help, one thing that would be helpful,
would be a complete runnable sql script,
that demonstrates exactly the various array-based queries
you currently use, with random data that resembles
reality as closely as possible, i.e. the same number of rows
in the tables, and similar distribution of values etc.

This would be helpful in terms of documentation,
as I think it would be good to provide Usage examples
that are based on real-life scenarios.

It would also be helpful to create realistic benchmarks when
evaluating and optimising the performance.

> Our app runs connecting to either an embedded postgres database that we 
> control or an external one controlled by customers - this is typically 
> RDS or some other cloud vendor's DB. Having such a type as a separate 
> extension would make it unusable for us until all relevant cloud 
> vendors decided that it was popular enough to include - something that 
> may never happen, or even if it did, now any time soon.

Good point.




Re: Do we want a hashset type?

2023-06-13 Thread Tom Dunstan
On Mon, 12 Jun 2023 at 22:37, Tomas Vondra 
wrote:

> Perhaps. So you're proposing to have this as a regular built-in type?
> It's hard for me to judge how popular this feature would be, but I guess
> people often use arrays while they actually want set semantics ...
>

Perspective from a potential user: I'm currently working on something where
an array-like structure with fast membership test performance would be very
useful. The main type of query is doing an =ANY(the set) filter, where the
set could contain anywhere from very few to thousands of entries (ints in
our case). So we'd want the same index usage as =ANY(array) but would like
faster row checking than we get with an array when other indexes are used.

Our app runs connecting to either an embedded postgres database that we
control or an external one controlled by customers - this is typically RDS
or some other cloud vendor's DB. Having such a type as a separate extension
would make it unusable for us until all relevant cloud vendors decided that
it was popular enough to include - something that may never happen, or even
if it did, now any time soon.

Cheers

Tom


Re: Do we want a hashset type?

2023-06-13 Thread Joel Jacobson
On Tue, Jun 13, 2023, at 20:50, Joel Jacobson wrote:
> hashset is now using hash_bytes_uint32() from hashfn.h

I spotted a problem in the ordering logic of the comparison functions.

The issue was with handling hashsets containing empty positions,
causing non-lexicographic ordering.

The updated implementation now correctly iterates over the hashsets,
skipping any empty positions, which results in proper comparison
and ordering of elements present in the hashset.

New patch attached.


hashset-4e60615.patch
Description: Binary data


Re: Do we want a hashset type?

2023-06-13 Thread Joel Jacobson
On Mon, Jun 12, 2023, at 22:36, Joel Jacobson wrote:
> On Mon, Jun 12, 2023, at 21:58, Tomas Vondra wrote:
>> My suggestion is to be lazy, just use the lookup3 we have in hashfn.c
>> (through hash_bytes or something), and at the same time make it possible
>> to switch to a different function in the future. I'd store and ID of the
>> hash function in the set, so that we can support a different algorithm
>> in the future, if we choose to.

hashset is now using hash_bytes_uint32() from hashfn.h

Other changes in the same commit:

* Introduce hashfn_id field to specify hash function ID
* Implement hashset_send and hashset_recv and add C-test using libpq
* Add operators and operator classes for hashset comparison, sorting
  and distinct queries

Looks good? If so, I wonder what's best to focus on next?
Perhaps adding support for bigint? Other ideas?

/Joel

hashset-0.0.1-da3b024.patch
Description: Binary data


Re: Do we want a hashset type?

2023-06-12 Thread Joel Jacobson
On Mon, Jun 12, 2023, at 21:58, Tomas Vondra wrote:
> But those are numbers for large keys - if you restrict the input to
> 4B-16B (which is what we planned to do by focusing on int, bigint and
> uuid), there's no significant difference:

Oh, sorry, I completely failed to read the meaning of the Columns.

> lookup3:
>
> Small key speed test -4-byte keys -30.17 cycles/hash
> Small key speed test -8-byte keys -31.00 cycles/hash
> Small key speed test -   16-byte keys -49.00 cycles/hash
>
> xxh3low:
>
> Small key speed test -4-byte keys -29.00 cycles/hash
> Small key speed test -8-byte keys -29.58 cycles/hash
> Small key speed test -   16-byte keys -37.00 cycles/hash

The winner of the "Small key speed test" competition seems to be:

ahash64 "ahash 64bit":
Small key speed test -4-byte keys -24.00 cycles/hash
Small key speed test -8-byte keys -24.00 cycles/hash
Small key speed test -   16-byte keys -26.98 cycles/hash

Looks like it's a popular one, e.g. it's used by Rust in their 
std::collections::HashSet.

Another notable property of ahash64 is that it's "DOS resistant",
but it isn't crucial for our use case, since we exclusively target
auto-generated primary keys which are not influenced by end-users.

> Not sure what you mean by "optimizing for space" - I imagined the
> on-disk format would just use the same hash table with tiny amount of
> free space (say 10% and not ~%50).

With "optimizing for space" I meant trying to find some alternative or
intermediate data structure that is more likely to be compressible,
like your idea of sorting the data.
What I wondered was if the on-disk format would be affected by
the choice of hash function. I guess it wouldn't, if the hashset
is created by adding the elements one-by-one by iterating
through the elements by reading the on-disk format.
But I thought maybe something more advanced could be
done, where conversion between the on-disk format
and the in-memory format could be done without naively
iterating through all elements, i.e. something less complex
than O(n).
No idea what that mechanism would be though.

> My suggestion is to be lazy, just use the lookup3 we have in hashfn.c
> (through hash_bytes or something), and at the same time make it possible
> to switch to a different function in the future. I'd store and ID of the
> hash function in the set, so that we can support a different algorithm
> in the future, if we choose to.

Sounds good to me.

Smart idea to include the hash function algorithm ID in the set,
to allow implementing a different one in the future!

/Joel




Re: Do we want a hashset type?

2023-06-12 Thread Tomas Vondra



On 6/12/23 19:34, Joel Jacobson wrote:
> On Sat, Jun 10, 2023, at 22:12, Tomas Vondra wrote:
 1) better hash table implementation
> 
> I noticed src/include/common/hashfn.h provides implementation
> of the Jenkins/lookup3 hash function, and thought maybe
> we could simply use it in hashset?
> 
> However, I noticed that according to SMHasher [1],
> Jenkins/lookup3 has some quality problems:
> "UB, 28% bias, collisions, 30% distr, BIC"
> 
> Not sure if that's true or maybe not a problem in the PostgreSQL 
> implementation?
> > According to SHHasher, the two fastest 32/64-bit hash functions
> for non-cryptographic purposes without any quality problems
> that are also portable seems to be these two:
> 
> wyhash v4.1 (64-bit) [2]
> MiB/sec: 22513.04
> cycl./hash: 29.01
> size: 474
> 
> xxh3low (xxHash v3, 64-bit, low 32-bits part) [3]
> MiB/sec: 20722.94
> cycl./hash: 30.26
> size: 756
> 
> [1] https://github.com/rurban/smhasher
> [2] https://github.com/wangyi-fudan/wyhash
> [3] https://github.com/Cyan4973/xxHash
> 

But those are numbers for large keys - if you restrict the input to
4B-16B (which is what we planned to do by focusing on int, bigint and
uuid), there's no significant difference:

lookup3:

Small key speed test -4-byte keys -30.17 cycles/hash
Small key speed test -8-byte keys -31.00 cycles/hash
Small key speed test -   16-byte keys -49.00 cycles/hash

xxh3low:

Small key speed test -4-byte keys -29.00 cycles/hash
Small key speed test -8-byte keys -29.58 cycles/hash
Small key speed test -   16-byte keys -37.00 cycles/hash

But you can try doing some measurements, of course. Or just do profiling
to see how much time we spend in the hash function - I'd bet it's pretty
tiny fraction of the total time.

As for the "quality" issues - it's the same algorithm in Postgres, so it
has the same issues. I don't if that has measurable impact, though. I'd
guess it does not, particularly for "reasonably small" sets.

 5) more efficient storage format, with versioning etc.
>> I think the main question is whether to serialize the hash table as is,
>> or compact it in some way. The current code just uses the same thing for
>> both cases - on-disk format and in-memory representation (aggstate).
>> That's simple, but it also means the on-disk value is likely not well
>> compressible (because it's ~50% random data.
>>
>> I've thought about serializing just the values (as a simple array), but
>> that defeats the whole purpose of fast membership checks. I have two ideas:
>>
>> a) sort the data, and use binary search for this compact variant (and
>> then expand it into "full" hash table if needed)
>>
>> b) use a more compact hash table (with load factor much closer to 1.0)
>>
>> Not sure which if this option is the right one, each has cost for
>> converting between formats (but large on-disk value is not free either).
>>
>> That's roughly what I did for the tdigest extension.
> 
> Is the choice of hash function (and it's in-memory representation)
> independent of the on-disk format question, i.e. could we work
> on experimenting and evaluating different hash functions first,
> to optimise for speed and quality, and then when done, proceed
> and optimise for space, or are the two intertwined somehow?
> 

Not sure what you mean by "optimizing for space" - I imagined the
on-disk format would just use the same hash table with tiny amount of
free space (say 10% and not ~%50).


My suggestion is to be lazy, just use the lookup3 we have in hashfn.c
(through hash_bytes or something), and at the same time make it possible
to switch to a different function in the future. I'd store and ID of the
hash function in the set, so that we can support a different algorithm
in the future, if we choose to.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Do we want a hashset type?

2023-06-12 Thread Joel Jacobson
On Sat, Jun 10, 2023, at 22:12, Tomas Vondra wrote:
>>> 1) better hash table implementation

I noticed src/include/common/hashfn.h provides implementation
of the Jenkins/lookup3 hash function, and thought maybe
we could simply use it in hashset?

However, I noticed that according to SMHasher [1],
Jenkins/lookup3 has some quality problems:
"UB, 28% bias, collisions, 30% distr, BIC"

Not sure if that's true or maybe not a problem in the PostgreSQL implementation?

According to SHHasher, the two fastest 32/64-bit hash functions
for non-cryptographic purposes without any quality problems
that are also portable seems to be these two:

wyhash v4.1 (64-bit) [2]
MiB/sec: 22513.04
cycl./hash: 29.01
size: 474

xxh3low (xxHash v3, 64-bit, low 32-bits part) [3]
MiB/sec: 20722.94
cycl./hash: 30.26
size: 756

[1] https://github.com/rurban/smhasher
[2] https://github.com/wangyi-fudan/wyhash
[3] https://github.com/Cyan4973/xxHash

>>> 5) more efficient storage format, with versioning etc.
> I think the main question is whether to serialize the hash table as is,
> or compact it in some way. The current code just uses the same thing for
> both cases - on-disk format and in-memory representation (aggstate).
> That's simple, but it also means the on-disk value is likely not well
> compressible (because it's ~50% random data.
>
> I've thought about serializing just the values (as a simple array), but
> that defeats the whole purpose of fast membership checks. I have two ideas:
>
> a) sort the data, and use binary search for this compact variant (and
> then expand it into "full" hash table if needed)
>
> b) use a more compact hash table (with load factor much closer to 1.0)
>
> Not sure which if this option is the right one, each has cost for
> converting between formats (but large on-disk value is not free either).
>
> That's roughly what I did for the tdigest extension.

Is the choice of hash function (and it's in-memory representation)
independent of the on-disk format question, i.e. could we work
on experimenting and evaluating different hash functions first,
to optimise for speed and quality, and then when done, proceed
and optimise for space, or are the two intertwined somehow?

/Joel




Re: Do we want a hashset type?

2023-06-12 Thread Joel Jacobson
On Sat, Jun 10, 2023, at 17:46, Andrew Dunstan wrote:
> Maybe you can post a full patch as well as incremental?

Attached patch is based on tvondra's last commit 375b072.

> Stylistically I think you should reduce reliance on magic numbers (like 13). 
> Probably need some #define's?

Great idea, fixed, I've added a HASHSET_STEP definition (set to the value 13).

On Sat, Jun 10, 2023, at 17:51, jian he wrote:
> int32 value = strtol(str, , 10);
> there is no int32 value range check? 
> imitate src/backend/utils/adt/int.c. the following way is what I came up with.
>
>
> int64 value = strtol(str, , 10);
>
> if (errno == ERANGE || value < INT_MIN || value > INT_MAX)

Thanks, fixed like suggested, except I used PG_INT32_MIN and PG_INT32_MAX,
which explicitly represent the maximum value for a 32-bit integer,
regardless of the platform or C implementation.

> also it will infinity loop in hashset_in if supply the wrong value 
> example select  '{1,2s}'::hashset;
> I need kill -9 to kill the process. 

Thanks. I've added a new test, `sql/invalid.sql` with that example query.

Here is a summary of all other changes:
 
* README.md: Added sections Usage, Data types, Functions and Aggregate Functions

* Added test/ directory with some tests.

* Added "not production-ready" notice at top of README, warning for breaking
changes and no migration scripts, until our first release.

* Change version from 1.0.0 to 0.0.1 to indicate current status.

* Added CEIL_DIV macro

* Implemented hashset_in(), hashset_out()
  The syntax for the serialized format is comma separated integer values
  wrapped around curly braces, e.g '{1,2,3}'

* Implemented hashset_recv() to match the existing hashset_send()

* Removed/rewrote some tdigest related comments

/Joel

hashset-0.0.1-a8a282a.patch
Description: Binary data


Re: Do we want a hashset type?

2023-06-12 Thread Andrew Dunstan


On 2023-06-12 Mo 09:00, Tomas Vondra wrote:



What was the json/jsonb story, was it ever an extension before
being included in core?

I don't recall what the exact story was, but I guess the "json" type was
added to core very long ago (before we started to push back a bit), and
we added some SQL grammar stuff too, which can't be done from extension.
So when we added jsonb (much later than json), there wasn't much point
in not having it in core.




Not quite.

The json type as added in 9.2 (Sept 2012) and jsonb in 9.4 (Dec 2014). I 
wouldn't call those far apart or very long ago. Neither included any 
grammar changes AFAIR.


But if they had been added as extensions we'd probably be in a whole lot 
more trouble now implementing SQL/JSON, so whether that was foresight or 
laziness I think the we landed on our feet there.



cheers


andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com


Re: Do we want a hashset type?

2023-06-12 Thread Tomas Vondra



On 6/12/23 14:46, Andrew Dunstan wrote:
> 
> On 2023-06-11 Su 16:15, Joel Jacobson wrote:
>> On Sun, Jun 11, 2023, at 17:03, Tomas Vondra wrote:
>>> On 6/11/23 12:26, Joel Jacobson wrote:
 I think there are some good arguments that speaks in favour of including 
 it in core:
>> ...
>>> I agree with all of that, but ...
>>>
>>> It's just past feature freeze, so the earliest release this could appear
>>> in is 17, about 15 months away.
>>>
>>> Once stuff gets added to core, it's tied to the release cycle, so no new
>>> features in between.
>>>
>>> Presumably people would like to use the extension in the release they
>>> already use, without backporting.
>>>
>>> Postgres is extensible for a reason, exactly so that we don't need to
>>> have everything in core.
>> Interesting, I've never thought about this one before:
>> What if something is deemed to be fundamental and therefore qualify for core 
>> inclusion,
>> and at the same time is suitable to be made an extension.
>> Would it be possible to ship such extension as pre-installed?
>>
>> What was the json/jsonb story, was it ever an extension before
>> being included in core?
> 
> 
> No, and the difficulty is that an in-core type and associated functions
> will have different oids, so migrating from one to the other would be at
> best painful.
> 
> So it's a kind of now or never decision. I think extensions are
> excellent for specialized types. But I don't regard a set type in that
> light.
> 

Perhaps. So you're proposing to have this as a regular built-in type?
It's hard for me to judge how popular this feature would be, but I guess
people often use arrays while they actually want set semantics ...

If we do that, I wonder if we could do something similar to arrays, with
the polymorphism and SQL grammar support. Does SQL have any concept of
sets (or arrays, for that matter) as data types?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Do we want a hashset type?

2023-06-12 Thread Tomas Vondra



On 6/11/23 22:15, Joel Jacobson wrote:
> On Sun, Jun 11, 2023, at 17:03, Tomas Vondra wrote:
>> On 6/11/23 12:26, Joel Jacobson wrote:
>>> I think there are some good arguments that speaks in favour of including it 
>>> in core:
> ...
>>
>> I agree with all of that, but ...
>>
>> It's just past feature freeze, so the earliest release this could appear
>> in is 17, about 15 months away.
>>
>> Once stuff gets added to core, it's tied to the release cycle, so no new
>> features in between.
>>
>> Presumably people would like to use the extension in the release they
>> already use, without backporting.
>>
>> Postgres is extensible for a reason, exactly so that we don't need to
>> have everything in core.
> 
> Interesting, I've never thought about this one before:
> What if something is deemed to be fundamental and therefore qualify for core 
> inclusion,
> and at the same time is suitable to be made an extension.
> Would it be possible to ship such extension as pre-installed?
> 

I think it's always a matter of judgment - I don't think there's some
clear set into a stone. If something is "fundamental" and can be done in
an extension, there's always the option to have it in contrib (with all
the limitations I mentioned).

FWIW I'm not strictly against adding it to contrib, if there's agreement
it's worth it. But if we consider it to be a fundamental data structure
and widely useful, maybe we should consider making it a built-in data
type, with fixed OID etc. That'd allow support at the SQL grammar level,
and perhaps also from the proposed SQL/PGQ (and GQL?). AFAIK moving data
types from extension (even if from contrib) to core is not painless.

Either way it might be a nice / useful first patch, I guess.

> What was the json/jsonb story, was it ever an extension before
> being included in core?

I don't recall what the exact story was, but I guess the "json" type was
added to core very long ago (before we started to push back a bit), and
we added some SQL grammar stuff too, which can't be done from extension.
So when we added jsonb (much later than json), there wasn't much point
in not having it in core.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Do we want a hashset type?

2023-06-12 Thread Andrew Dunstan


On 2023-06-11 Su 16:15, Joel Jacobson wrote:

On Sun, Jun 11, 2023, at 17:03, Tomas Vondra wrote:

On 6/11/23 12:26, Joel Jacobson wrote:

I think there are some good arguments that speaks in favour of including it in 
core:

...

I agree with all of that, but ...

It's just past feature freeze, so the earliest release this could appear
in is 17, about 15 months away.

Once stuff gets added to core, it's tied to the release cycle, so no new
features in between.

Presumably people would like to use the extension in the release they
already use, without backporting.

Postgres is extensible for a reason, exactly so that we don't need to
have everything in core.

Interesting, I've never thought about this one before:
What if something is deemed to be fundamental and therefore qualify for core 
inclusion,
and at the same time is suitable to be made an extension.
Would it be possible to ship such extension as pre-installed?

What was the json/jsonb story, was it ever an extension before
being included in core?



No, and the difficulty is that an in-core type and associated functions 
will have different oids, so migrating from one to the other would be at 
best painful.


So it's a kind of now or never decision. I think extensions are 
excellent for specialized types. But I don't regard a set type in that 
light.



cheers


andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com


Re: Do we want a hashset type?

2023-06-11 Thread Joel Jacobson
On Sun, Jun 11, 2023, at 17:03, Tomas Vondra wrote:
> On 6/11/23 12:26, Joel Jacobson wrote:
>> I think there are some good arguments that speaks in favour of including it 
>> in core:
...
>
> I agree with all of that, but ...
>
> It's just past feature freeze, so the earliest release this could appear
> in is 17, about 15 months away.
>
> Once stuff gets added to core, it's tied to the release cycle, so no new
> features in between.
>
> Presumably people would like to use the extension in the release they
> already use, without backporting.
>
> Postgres is extensible for a reason, exactly so that we don't need to
> have everything in core.

Interesting, I've never thought about this one before:
What if something is deemed to be fundamental and therefore qualify for core 
inclusion,
and at the same time is suitable to be made an extension.
Would it be possible to ship such extension as pre-installed?

What was the json/jsonb story, was it ever an extension before
being included in core?

/Joel




Re: Do we want a hashset type?

2023-06-11 Thread Joel Jacobson
On Sun, Jun 11, 2023, at 16:58, Andrew Dunstan wrote:
>>On 2023-06-11 Su 06:26, Joel Jacobson wrote:
>>Perhaps "set" would have been a better name, since the use of hash functions 
>>from an end-user perspective is >>implementation details, but we cannot use 
>>that word since it's a reserved keyword, hence "hashset".
>
>Rather than use "hashset", which as you say is based on an implementation 
>detail, I would prefer something like
>"integer_set" - what it's a set of.

Apologies for the confusion previously.
Upon further reflection, I recognize that the term "hash" in "hashset"
extends beyond mere representation of implementation.
It imparts key information about performance characteristics as well as 
functionality inherent to the set.

In hindsight, "hashset" does emerge as the most suitable terminology.

/Joel

Re: Do we want a hashset type?

2023-06-11 Thread Tomas Vondra



On 6/11/23 12:26, Joel Jacobson wrote:
> On Sat, Jun 10, 2023, at 22:26, Tomas Vondra wrote:
>> On 6/10/23 17:46, Andrew Dunstan wrote:
>>>
>>> Maybe you can post a full patch as well as incremental?
>>>
>>
>> I wonder if we should keep discussing this extension here, considering
>> it's going to be out of core (at least for now). Not sure how many
>> pgsql-hackers are interested in this, so maybe we should just move it to
>> github PRs or something ...
> 
> I think there are some good arguments that speaks in favour of including it 
> in core:
> 
> 1. It's a fundamental data structure. Perhaps "set" would have been a better 
> name,
> since the use of hash functions from an end-user perspective is implementation
> details, but we cannot use that word since it's a reserved keyword, hence 
> "hashset".
> 
> 2. The addition of SQL/PGQ in SQL:2023 is evidence of a general perceived need
> among SQL users to evaluate graph queries. Even if a future implementation of 
> SQL/PGQ
> would mean users wouldn't need to deal with the hashset type directly, the 
> same
> type could hopefully be used, in part or in whole, under the hood by the 
> future 
> SQL/PGQ implementation. If low-level functionality is useful on its own, I 
> think it's
> a benefit of exposing it to users, since it allows system testing of the 
> component
> in isolation, even if it's primarily gonna be used as a smaller part of a 
> larger more
> high-level component.
> 
> 3. I think there is a general need for hashset, experienced by myself, Andrew 
> and
> I would guess lots of others users. The general pattern that will be improved 
> is
> when you currently would do array_agg(DISTINCT ...)
> probably there are other situations too, since it's a fundamental data 
> structure.
> 

I agree with all of that, but ...

It's just past feature freeze, so the earliest release this could appear
in is 17, about 15 months away.

Once stuff gets added to core, it's tied to the release cycle, so no new
features in between.

Presumably people would like to use the extension in the release they
already use, without backporting.

Postgres is extensible for a reason, exactly so that we don't need to
have everything in core.

> On Sat, Jun 10, 2023, at 22:12, Tomas Vondra wrote:
 3) support for other types (now it only works with int32)
>> I think we should decide what types we want/need to support, and add one
>> or two types early. Otherwise we'll have code / on-disk format making
>> various assumptions about the type length etc.
>>
>> I have no idea what types people use as node IDs - is it likely we'll
>> need to support types passed by reference / varlena types? Or can we
>> just assume it's int/bigint?
> 
> I think we should just support data types that would be sensible
> to use as a PRIMARY KEY in a fully normalised data model,
> which I believe would only include "int", "bigint" and "uuid".
> 

OK, makes sense.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Do we want a hashset type?

2023-06-11 Thread Andrew Dunstan


On 2023-06-11 Su 06:26, Joel Jacobson wrote:

On Sat, Jun 10, 2023, at 22:26, Tomas Vondra wrote:

On 6/10/23 17:46, Andrew Dunstan wrote:

Maybe you can post a full patch as well as incremental?


I wonder if we should keep discussing this extension here, considering
it's going to be out of core (at least for now). Not sure how many
pgsql-hackers are interested in this, so maybe we should just move it to
github PRs or something ...

I think there are some good arguments that speaks in favour of including it in 
core:

1. It's a fundamental data structure.



That's reason enough IMNSHO.



Perhaps "set" would have been a better name,
since the use of hash functions from an end-user perspective is implementation
details, but we cannot use that word since it's a reserved keyword, hence 
"hashset".



Rather than use "hashset", which as you say is based on an 
implementation detail, I would prefer something like "integer_set" - 
what it's a set of.



cheers


andrew


--
Andrew Dunstan
EDB:https://www.enterprisedb.com


Re: Do we want a hashset type?

2023-06-11 Thread Joel Jacobson
On Sat, Jun 10, 2023, at 22:26, Tomas Vondra wrote:
> On 6/10/23 17:46, Andrew Dunstan wrote:
>> 
>> Maybe you can post a full patch as well as incremental?
>> 
>
> I wonder if we should keep discussing this extension here, considering
> it's going to be out of core (at least for now). Not sure how many
> pgsql-hackers are interested in this, so maybe we should just move it to
> github PRs or something ...

I think there are some good arguments that speaks in favour of including it in 
core:

1. It's a fundamental data structure. Perhaps "set" would have been a better 
name,
since the use of hash functions from an end-user perspective is implementation
details, but we cannot use that word since it's a reserved keyword, hence 
"hashset".

2. The addition of SQL/PGQ in SQL:2023 is evidence of a general perceived need
among SQL users to evaluate graph queries. Even if a future implementation of 
SQL/PGQ
would mean users wouldn't need to deal with the hashset type directly, the same
type could hopefully be used, in part or in whole, under the hood by the future 
SQL/PGQ implementation. If low-level functionality is useful on its own, I 
think it's
a benefit of exposing it to users, since it allows system testing of the 
component
in isolation, even if it's primarily gonna be used as a smaller part of a 
larger more
high-level component.

3. I think there is a general need for hashset, experienced by myself, Andrew 
and
I would guess lots of others users. The general pattern that will be improved is
when you currently would do array_agg(DISTINCT ...)
probably there are other situations too, since it's a fundamental data 
structure.

On Sat, Jun 10, 2023, at 22:12, Tomas Vondra wrote:
>>> 3) support for other types (now it only works with int32)
> I think we should decide what types we want/need to support, and add one
> or two types early. Otherwise we'll have code / on-disk format making
> various assumptions about the type length etc.
>
> I have no idea what types people use as node IDs - is it likely we'll
> need to support types passed by reference / varlena types? Or can we
> just assume it's int/bigint?

I think we should just support data types that would be sensible
to use as a PRIMARY KEY in a fully normalised data model,
which I believe would only include "int", "bigint" and "uuid".

/Joel




Re: Do we want a hashset type?

2023-06-10 Thread Tomas Vondra



On 6/10/23 17:46, Andrew Dunstan wrote:
> 
> On 2023-06-09 Fr 07:56, Joel Jacobson wrote:
>> On Fri, Jun 9, 2023, at 13:33, jian he wrote:
>> > Hi, I am quite new about C.
>> > The following function I have 3 questions.
>> > 1. 7691,4201, I assume they are just random prime ints?
>>
>> Yes, 7691 and 4201 are likely chosen as random prime numbers.
>> In hash functions, prime numbers are often used to help in evenly
>> distributing
>> the hash values across the range and reduce the chance of collisions.
>>
>> > 2. I don't get the last return set, even the return type should be bool.
>>
>> Thanks, you found a mistake!
>>
>> The line
>>
>>     return set;
>>
>> is actually unreachable and should be removed.
>> The function will always return either true or false within the while
>> loop and
>> never reach the final return statement.
>>
>> I've attached a new incremental patch with this fix.
>>
>> > 3. I don't understand 13 in hash = (hash + 13) % set->maxelements;
>>
>> The value 13 is used for linear probing [1] in handling hash collisions.
>> Linear probing sequentially checks the next slot in the array when a
>> collision
>> occurs. 13, being a small prime number not near a power of 2, helps in
>> uniformly
>> distributing data and ensuring that all slots are probed, as it's
>> relatively prime
>> to the hash table size.
>>
>> Hm, I realise we actually don't ensure the hash table size and step
>> size (13)
>> are coprime. I've fixed that in the attached patch as well.
>>
>> [1] https://en.wikipedia.org/wiki/Linear_probing
>>
>>
> 
> 
> Maybe you can post a full patch as well as incremental?
> 

I wonder if we should keep discussing this extension here, considering
it's going to be out of core (at least for now). Not sure how many
pgsql-hackers are interested in this, so maybe we should just move it to
github PRs or something ...


> Stylistically I think you should reduce reliance on magic numbers (like
> 13). Probably need some #define's?
> 

Yeah, absolutely. This was just pure laziness.


regard

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Do we want a hashset type?

2023-06-10 Thread Tomas Vondra



On 6/9/23 12:58, Joel Jacobson wrote:
> On Thu, Jun 8, 2023, at 12:19, Tomas Vondra wrote:
>> Would you be interested in helping with / working on some of that? I
>> don't have immediate need for this stuff, so it's not very high on my
>> TODO list.
> 
> Sure, I'm willing to help!
> 
> I've attached a patch that works on some of the items on your list,
> including some additions to the README.md.
> 
> There were a bunch of places where `maxelements / 8` caused bugs,
> that had to be changed to do proper integer ceiling division:
> 
> -   values = (int32 *) (set->data + set->maxelements / 8);
> +   values = (int32 *) (set->data + (set->maxelements + 7) / 8);
> 
> Side note: I wonder if it would be good to add CEIL_DIV and FLOOR_DIV macros
> to the PostgreSQL source code in general, since it's easy to make this 
> mistake,
> and quite verbose/error-prone to write it out manually everywhere.
> Such macros could simplify code in e.g. numeric.c.
> 

Yeah, it'd be good to have macros to calculate the sizes. We'll need
that in many places.

>> There's a bunch of stuff that needs to be improved to make this properly
>> usable, like:
>>
>> 1) better hash table implementation
> TODO
> 
>> 2) input/output functions
> I've attempted to implement these.
> I thought comma separated values wrapped around curly braces felt as the most 
> natural format,
> example:
> SELECT '{1,2,3}'::hashset;
> 

+1 to that. I'd mimic the array in/out functions as much as possible.

>> 3) support for other types (now it only works with int32)
> TODO
> 

I think we should decide what types we want/need to support, and add one
or two types early. Otherwise we'll have code / on-disk format making
various assumptions about the type length etc.

I have no idea what types people use as node IDs - is it likely we'll
need to support types passed by reference / varlena types? Or can we
just assume it's int/bigint?

>> 4) I wonder if this might be done as an array-like polymorphic type.
> That would be nice!
> I guess the work-around would be to store the actual value of non-int type
> in a lookup table, and then hash the int-based primary key in such table.
> 
> Do you think later implementing polymorphic type support would
> mean a more or less complete rewrite, or can we carry on with int32-support
> and add it later on?
> 

I don't know. From the storage perspective it doesn't matter much, I
think, we would not need to change that. But I think adding a
polymorphic type (array-like) would require changes to grammar, and
that's not possible for an extension. If there's a way, I'm not aware of
it and I don't recall an extension doing that.

>> 5) more efficient storage format, with versioning etc.
> TODO
> 

I think the main question is whether to serialize the hash table as is,
or compact it in some way. The current code just uses the same thing for
both cases - on-disk format and in-memory representation (aggstate).
That's simple, but it also means the on-disk value is likely not well
compressible (because it's ~50% random data.

I've thought about serializing just the values (as a simple array), but
that defeats the whole purpose of fast membership checks. I have two ideas:

a) sort the data, and use binary search for this compact variant (and
then expand it into "full" hash table if needed)

b) use a more compact hash table (with load factor much closer to 1.0)

Not sure which if this option is the right one, each has cost for
converting between formats (but large on-disk value is not free either).

That's roughly what I did for the tdigest extension.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Do we want a hashset type?

2023-06-10 Thread jian he
in funcion. hashset_in

int32 value = strtol(str, , 10);
there is no int32 value range check?
imitate src/backend/utils/adt/int.c. the following way is what I came up
with.

int64 value = strtol(str, , 10);
if (errno == ERANGE || value < INT_MIN || value > INT_MAX)
ereturn(fcinfo->context, (Datum) 0,
(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
errmsg("value \"%s\" is out of range for type %s", str,
"integer")));

 set = hashset_add_element(set, (int32)value);

also it will infinity loop in hashset_in if supply the wrong value
example select  '{1,2s}'::hashset;
I need kill -9 to kill the process.


Re: Do we want a hashset type?

2023-06-10 Thread Andrew Dunstan


On 2023-06-09 Fr 07:56, Joel Jacobson wrote:

On Fri, Jun 9, 2023, at 13:33, jian he wrote:
> Hi, I am quite new about C.
> The following function I have 3 questions.
> 1. 7691,4201, I assume they are just random prime ints?

Yes, 7691 and 4201 are likely chosen as random prime numbers.
In hash functions, prime numbers are often used to help in evenly 
distributing

the hash values across the range and reduce the chance of collisions.

> 2. I don't get the last return set, even the return type should be bool.

Thanks, you found a mistake!

The line

    return set;

is actually unreachable and should be removed.
The function will always return either true or false within the while 
loop and

never reach the final return statement.

I've attached a new incremental patch with this fix.

> 3. I don't understand 13 in hash = (hash + 13) % set->maxelements;

The value 13 is used for linear probing [1] in handling hash collisions.
Linear probing sequentially checks the next slot in the array when a 
collision
occurs. 13, being a small prime number not near a power of 2, helps in 
uniformly
distributing data and ensuring that all slots are probed, as it's 
relatively prime

to the hash table size.

Hm, I realise we actually don't ensure the hash table size and step 
size (13)

are coprime. I've fixed that in the attached patch as well.

[1] https://en.wikipedia.org/wiki/Linear_probing





Maybe you can post a full patch as well as incremental?

Stylistically I think you should reduce reliance on magic numbers (like 
13). Probably need some #define's?



cheers


andrew


--
Andrew Dunstan
EDB:https://www.enterprisedb.com


Re: Do we want a hashset type?

2023-06-09 Thread Joel Jacobson
On Fri, Jun 9, 2023, at 13:33, jian he wrote:
> Hi, I am quite new about C.
> The following function I have 3 questions.
> 1. 7691,4201, I assume they are just random prime ints?

Yes, 7691 and 4201 are likely chosen as random prime numbers.
In hash functions, prime numbers are often used to help in evenly distributing
the hash values across the range and reduce the chance of collisions.

> 2. I don't get the last return set, even the return type should be bool.

Thanks, you found a mistake!

The line

return set;

is actually unreachable and should be removed.
The function will always return either true or false within the while loop and
never reach the final return statement.

I've attached a new incremental patch with this fix.

> 3. I don't understand 13 in hash = (hash + 13) % set->maxelements;

The value 13 is used for linear probing [1] in handling hash collisions.
Linear probing sequentially checks the next slot in the array when a collision
occurs. 13, being a small prime number not near a power of 2, helps in uniformly
distributing data and ensuring that all slots are probed, as it's relatively 
prime
to the hash table size.

Hm, I realise we actually don't ensure the hash table size and step size (13)
are coprime. I've fixed that in the attached patch as well.

[1] https://en.wikipedia.org/wiki/Linear_probing

/Joel

hashset-1.0.0-joel-0002.patch
Description: Binary data


Re: Do we want a hashset type?

2023-06-09 Thread jian he
On Fri, Jun 9, 2023 at 6:58 PM Joel Jacobson  wrote:

> On Thu, Jun 8, 2023, at 12:19, Tomas Vondra wrote:
> > Would you be interested in helping with / working on some of that? I
> > don't have immediate need for this stuff, so it's not very high on my
> > TODO list.
>
> Sure, I'm willing to help!
>
> I've attached a patch that works on some of the items on your list,
> including some additions to the README.md.
>
> There were a bunch of places where `maxelements / 8` caused bugs,
> that had to be changed to do proper integer ceiling division:
>
> -   values = (int32 *) (set->data + set->maxelements / 8);
> +   values = (int32 *) (set->data + (set->maxelements + 7) / 8);
>
> Side note: I wonder if it would be good to add CEIL_DIV and FLOOR_DIV
> macros
> to the PostgreSQL source code in general, since it's easy to make this
> mistake,
> and quite verbose/error-prone to write it out manually everywhere.
> Such macros could simplify code in e.g. numeric.c.
>
> > There's a bunch of stuff that needs to be improved to make this properly
> > usable, like:
> >
> > 1) better hash table implementation
> TODO
>
> > 2) input/output functions
> I've attempted to implement these.
> I thought comma separated values wrapped around curly braces felt as the
> most natural format,
> example:
> SELECT '{1,2,3}'::hashset;
>
> > 3) support for other types (now it only works with int32)
> TODO
>
> > 4) I wonder if this might be done as an array-like polymorphic type.
> That would be nice!
> I guess the work-around would be to store the actual value of non-int type
> in a lookup table, and then hash the int-based primary key in such table.
>
> Do you think later implementing polymorphic type support would
> mean a more or less complete rewrite, or can we carry on with int32-support
> and add it later on?
>
> > 5) more efficient storage format, with versioning etc.
> TODO
>
> > 6) regression tests
> I've added some regression tests.
>
> > Right. IMHO the query language is a separate thing, you still need to
> > evaluate the query somehow - which is where hashset applies.
>
> Good point, I fully agree.
>
> /Joel




Hi, I am quite new about C.
The following function I have 3 questions.
1. 7691,4201, I assume they are just random prime ints?
2. I don't get the last return set, even the return type should be bool.
3. I don't understand 13 in hash = (hash + 13) % set->maxelements;

static bool
hashset_contains_element(hashset_t *set, int32 value)
{
int byte;
int bit;
uint32  hash;
char   *bitmap;
int32  *values;

hash = ((uint32) value * 7691 + 4201) % set->maxelements;

bitmap = set->data;
values = (int32 *) (set->data + set->maxelements / 8);

while (true)
{
byte = (hash / 8);
bit = (hash % 8);

/* found an empty slot, value is not there */
if ((bitmap[byte] & (0x01 << bit)) == 0)
return false;

/* is it the same value? */
if (values[hash] == value)
return true;

/* move to the next element */
hash = (hash + 13) % set->maxelements;
}

return set;
}


Re: Do we want a hashset type?

2023-06-09 Thread Joel Jacobson
On Thu, Jun 8, 2023, at 12:19, Tomas Vondra wrote:
> Would you be interested in helping with / working on some of that? I
> don't have immediate need for this stuff, so it's not very high on my
> TODO list.

Sure, I'm willing to help!

I've attached a patch that works on some of the items on your list,
including some additions to the README.md.

There were a bunch of places where `maxelements / 8` caused bugs,
that had to be changed to do proper integer ceiling division:

-   values = (int32 *) (set->data + set->maxelements / 8);
+   values = (int32 *) (set->data + (set->maxelements + 7) / 8);

Side note: I wonder if it would be good to add CEIL_DIV and FLOOR_DIV macros
to the PostgreSQL source code in general, since it's easy to make this mistake,
and quite verbose/error-prone to write it out manually everywhere.
Such macros could simplify code in e.g. numeric.c.

> There's a bunch of stuff that needs to be improved to make this properly
> usable, like:
>
> 1) better hash table implementation
TODO

> 2) input/output functions
I've attempted to implement these.
I thought comma separated values wrapped around curly braces felt as the most 
natural format,
example:
SELECT '{1,2,3}'::hashset;

> 3) support for other types (now it only works with int32)
TODO

> 4) I wonder if this might be done as an array-like polymorphic type.
That would be nice!
I guess the work-around would be to store the actual value of non-int type
in a lookup table, and then hash the int-based primary key in such table.

Do you think later implementing polymorphic type support would
mean a more or less complete rewrite, or can we carry on with int32-support
and add it later on?

> 5) more efficient storage format, with versioning etc.
TODO

> 6) regression tests
I've added some regression tests.

> Right. IMHO the query language is a separate thing, you still need to
> evaluate the query somehow - which is where hashset applies.

Good point, I fully agree.

/Joel

hashset-1.0.0-joel-0001.patch
Description: Binary data


Re: Do we want a hashset type?

2023-06-08 Thread Tomas Vondra
On 6/8/23 11:41, Joel Jacobson wrote:
> On Wed, Jun 7, 2023, at 19:37, Tomas Vondra wrote:
>> Interesting, considering how dumb the the hash table implementation is.
> 
> That's promising.
> 

Yeah, not bad for sleep-deprived on-plane hacking ...

There's a bunch of stuff that needs to be improved to make this properly
usable, like:

1) better hash table implementation

2) input/output functions

3) support for other types (now it only works with int32)

4) I wonder if this might be done as an array-like polymorphic type.

5) more efficient storage format, with versioning etc.

6) regression tests

Would you be interested in helping with / working on some of that? I
don't have immediate need for this stuff, so it's not very high on my
TODO list.

>>> I tested Neo4j and the results are surprising; it appears to be 
>>> significantly *slower*.
>>> However, I've probably misunderstood something, maybe I need to add some 
>>> index or something.
>>> Even so, it's interesting it's apparently not fast "by default".
>>>
>>
>> No idea how to fix that, but it's rather suspicious.
> 
> I've had a graph-db expert review my benchmark, and he suggested adding an 
> index:
> 
> CREATE INDEX FOR (n:User) ON (n.id)
> 
> This did improve the execution time for Neo4j a bit, down from 819 ms to 528 
> ms, but PostgreSQL 299 ms is still a win.
> 
> Benchmark here: https://github.com/joelonsql/graph-query-benchmarks
> Note, in this benchmark, I only test the naive RECURSIVE CTE approach using 
> array_agg(DISTINCT ...).
> And I couldn't even test the most connected user with Neo4j, the query never 
> finish for some reason,
> so I had to test with a less connected user.
> 

Interesting. I'd have expected the graph db to be much faster.

> The graph expert also said that other more realistic graph use-cases might be 
> "multi-relational",
> and pointed me to a link: 
> https://github.com/totogo/awesome-knowledge-graph#knowledge-graph-dataset
> No idea how such multi-relational datasets would affect the benchmarks.
> 

Not sure either, but I don't have ambition to improve everything at
once. If the hashset improves one practical use case, fine with me.

> I think we have already strong indicators that PostgreSQL with a hashset type 
> will from a relative
> performance perspective, do just fine processing basic graph queries, even 
> with large datasets.
> 
> Then there will always be the case when users primarily write very different 
> graph queries all day long,
> who might prefer a graph query *language*, like SQL/PGQ in SQL:2023, Cypher 
> or Gremlin.
> 

Right. IMHO the query language is a separate thing, you still need to
evaluate the query somehow - which is where hashset applies.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Do we want a hashset type?

2023-06-08 Thread Joel Jacobson
On Wed, Jun 7, 2023, at 19:37, Tomas Vondra wrote:
> Interesting, considering how dumb the the hash table implementation is.

That's promising.

>> I tested Neo4j and the results are surprising; it appears to be 
>> significantly *slower*.
>> However, I've probably misunderstood something, maybe I need to add some 
>> index or something.
>> Even so, it's interesting it's apparently not fast "by default".
>> 
>
> No idea how to fix that, but it's rather suspicious.

I've had a graph-db expert review my benchmark, and he suggested adding an 
index:

CREATE INDEX FOR (n:User) ON (n.id)

This did improve the execution time for Neo4j a bit, down from 819 ms to 528 
ms, but PostgreSQL 299 ms is still a win.

Benchmark here: https://github.com/joelonsql/graph-query-benchmarks
Note, in this benchmark, I only test the naive RECURSIVE CTE approach using 
array_agg(DISTINCT ...).
And I couldn't even test the most connected user with Neo4j, the query never 
finish for some reason,
so I had to test with a less connected user.

The graph expert also said that other more realistic graph use-cases might be 
"multi-relational",
and pointed me to a link: 
https://github.com/totogo/awesome-knowledge-graph#knowledge-graph-dataset
No idea how such multi-relational datasets would affect the benchmarks.

I think we have already strong indicators that PostgreSQL with a hashset type 
will from a relative
performance perspective, do just fine processing basic graph queries, even with 
large datasets.

Then there will always be the case when users primarily write very different 
graph queries all day long,
who might prefer a graph query *language*, like SQL/PGQ in SQL:2023, Cypher or 
Gremlin.

/Joel




Re: Do we want a hashset type?

2023-06-07 Thread Tomas Vondra



On 6/7/23 16:21, Joel Jacobson wrote:
> On Tue, Jun 6, 2023, at 13:20, Tomas Vondra wrote:
>> it cuts the timing to about 50% on my laptop, so maybe it'll be ~300ms
>> on your system. There's a bunch of opportunities for more improvements,
>> as the hash table implementation is pretty naive/silly, the on-disk
>> format is wasteful and so on.
>>
>> But before spending more time on that, it'd be interesting to know what
>> would be a competitive timing. I mean, what would be "good enough"? What
>> timings are achievable with graph databases?
> 
> Your hashset is now almost exactly as fast as the corresponding roaringbitmap 
> query, +/- 1 ms on my machine.
> 

Interesting, considering how dumb the the hash table implementation is.

> I tested Neo4j and the results are surprising; it appears to be significantly 
> *slower*.
> However, I've probably misunderstood something, maybe I need to add some 
> index or something.
> Even so, it's interesting it's apparently not fast "by default".
> 

No idea how to fix that, but it's rather suspicious.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Do we want a hashset type?

2023-06-07 Thread Joel Jacobson
On Tue, Jun 6, 2023, at 13:20, Tomas Vondra wrote:
> it cuts the timing to about 50% on my laptop, so maybe it'll be ~300ms
> on your system. There's a bunch of opportunities for more improvements,
> as the hash table implementation is pretty naive/silly, the on-disk
> format is wasteful and so on.
>
> But before spending more time on that, it'd be interesting to know what
> would be a competitive timing. I mean, what would be "good enough"? What
> timings are achievable with graph databases?

Your hashset is now almost exactly as fast as the corresponding roaringbitmap 
query, +/- 1 ms on my machine.

I tested Neo4j and the results are surprising; it appears to be significantly 
*slower*.
However, I've probably misunderstood something, maybe I need to add some index 
or something.
Even so, it's interesting it's apparently not fast "by default".

The query I tested:
MATCH (user:User {id: '5867'})-[:FRIENDS_WITH*3..3]->(fof)
RETURN COUNT(DISTINCT fof)

Here is how I loaded the data into it:

% pwd
/Users/joel/Library/Application Support/Neo4j 
Desktop/Application/relate-data/dbmss/dbms-3837aa22-c830-4dcf-8668-ef8e302263c7

% head import/*
==> import/friendships.csv <==
1,13,FRIENDS_WITH
1,11,FRIENDS_WITH
1,6,FRIENDS_WITH
1,3,FRIENDS_WITH
1,4,FRIENDS_WITH
1,5,FRIENDS_WITH
1,15,FRIENDS_WITH
1,14,FRIENDS_WITH
1,7,FRIENDS_WITH
1,8,FRIENDS_WITH

==> import/friendships_header.csv <==
:START_ID(User),:END_ID(User),:TYPE

==> import/users.csv <==
1,User
2,User
3,User
4,User
5,User
6,User
7,User
8,User
9,User
10,User

==> import/users_header.csv <==
id:ID(User),:LABEL

% ./bin/neo4j-admin database import full --overwrite-destination 
--nodes=User=import/users_header.csv,import/users.csv 
--relationships=FRIENDS_WIDTH=import/friendships_header.csv,import/friendships.csv
 neo4j

/Joel




Re: Do we want a hashset type?

2023-06-06 Thread Tomas Vondra


On 6/5/23 21:52, Joel Jacobson wrote:
> On Mon, Jun 5, 2023, at 01:44, Tomas Vondra wrote:
>> Anyway, I hacked together a trivial set backed by an open addressing
>> hash table:
>>
>>   https://github.com/tvondra/hashset
>>
>> It's super-basic, providing just some bare minimum of functions, but
>> hopefully good enough for experiments.
>>
>> - hashset data type - hash table in varlena
>> - hashset_init - create new hashset
>> - hashset_add / hashset_contains - add value / check membership
>> - hashset_merge - merge two hashsets
>> - hashset_count - count elements
>> - hashset_to_array - return
>> - hashset(int) aggregate
>>
>> This allows rewriting the recursive query like this:
>>
>> WITH RECURSIVE friends_of_friends AS (
> ...
> 
> Nice! I get similar results, 737 ms (hashset) vs 1809 ms (array_agg).
> 
> I think if you just add one more hashset function, it will be a win against 
> roaringbitmap, which is 400 ms.
> 
> The missing function is an agg that takes hashset as input and returns 
> hashset, similar to roaringbitmap's rb_or_agg().
> 
> With such a function, we could add an adjacent nodes hashset column to the 
> `nodes` table, which would eliminate the need to scan the edges table for 
> graph traversal:
> 

I added a trivial version of such aggregate hashset(hashset), and if I
rewrite the CTE like this:

WITH RECURSIVE friends_of_friends AS (
SELECT
(select hashset(v) from values (5867) as s(v)) AS current,
0 AS depth
UNION ALL
SELECT
new_current,
friends_of_friends.depth + 1
FROM
friends_of_friends
CROSS JOIN LATERAL (
SELECT
hashset(edges.to_node) AS new_current
FROM
edges
WHERE
from_node =
ANY(hashset_to_array(friends_of_friends.current))
) q
WHERE
friends_of_friends.depth < 3
)
SELECT
depth,
hashset_count(current)
FROM
friends_of_friends
WHERE
depth = 3;

it cuts the timing to about 50% on my laptop, so maybe it'll be ~300ms
on your system. There's a bunch of opportunities for more improvements,
as the hash table implementation is pretty naive/silly, the on-disk
format is wasteful and so on.

But before spending more time on that, it'd be interesting to know what
would be a competitive timing. I mean, what would be "good enough"? What
timings are achievable with graph databases?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Do we want a hashset type?

2023-06-05 Thread Joel Jacobson
On Mon, Jun 5, 2023, at 01:44, Tomas Vondra wrote:
> Anyway, I hacked together a trivial set backed by an open addressing
> hash table:
>
>   https://github.com/tvondra/hashset
>
> It's super-basic, providing just some bare minimum of functions, but
> hopefully good enough for experiments.
>
> - hashset data type - hash table in varlena
> - hashset_init - create new hashset
> - hashset_add / hashset_contains - add value / check membership
> - hashset_merge - merge two hashsets
> - hashset_count - count elements
> - hashset_to_array - return
> - hashset(int) aggregate
>
> This allows rewriting the recursive query like this:
>
> WITH RECURSIVE friends_of_friends AS (
...

Nice! I get similar results, 737 ms (hashset) vs 1809 ms (array_agg).

I think if you just add one more hashset function, it will be a win against 
roaringbitmap, which is 400 ms.

The missing function is an agg that takes hashset as input and returns hashset, 
similar to roaringbitmap's rb_or_agg().

With such a function, we could add an adjacent nodes hashset column to the 
`nodes` table, which would eliminate the need to scan the edges table for graph 
traversal:

We could then benchmark roaringbitmap against hashset querying the same table:

CREATE TABLE users AS
SELECT
from_node AS id,
rb_build_agg(to_node) AS friends_roaringbitmap,
hashset(to_node) AS friends_hashset
FROM edges
GROUP BY 1;

/Joel




Re: Do we want a hashset type?

2023-06-05 Thread Joel Jacobson
On Fri, Jun 2, 2023, at 10:01, Ants Aasma wrote:
> Have you looked at roaring bitmaps? There is a pg_roaringbitmap
> extension [1] already available that offers very fast unions,
> intersections and membership tests over integer sets. I used it to get
> some pretty impressive performance results for faceting search on
> large document sets. [2]

Many thanks for the tip!

New benchmark:

We already had since before:
> wget https://snap.stanford.edu/data/soc-pokec-relationships.txt.gz
> gunzip soc-pokec-relationships.txt.gz
> CREATE TABLE edges (from_node INT, to_node INT);
> \copy edges from soc-pokec-relationships.txt;
> ALTER TABLE edges ADD PRIMARY KEY (from_node, to_node);

I've created a new `users` table from the `edges` table,
with a new `friends` roaringbitmap column:

CREATE TABLE users AS
SELECT from_node AS id, rb_build_agg(to_node) AS friends FROM edges GROUP BY 1;
ALTER TABLE users ADD PRIMARY KEY (id);

Old query from before:

CREATE OR REPLACE VIEW friends_of_friends AS
WITH RECURSIVE friends_of_friends AS (
SELECT
ARRAY[5867::bigint] AS current,
0 AS depth
UNION ALL
SELECT
new_current,
friends_of_friends.depth + 1
FROM
friends_of_friends
CROSS JOIN LATERAL (
SELECT
array_agg(DISTINCT edges.to_node) AS new_current
FROM
edges
WHERE
from_node = ANY(friends_of_friends.current)
) q
WHERE
friends_of_friends.depth < 3
)
SELECT
coalesce(array_length(current, 1), 0) AS count_friends_at_depth_3
FROM
friends_of_friends
WHERE
depth = 3;
;

New roaringbitmap-based query using users instead:

CREATE OR REPLACE VIEW friends_of_friends_roaringbitmap AS
WITH RECURSIVE friends_of_friends AS
(
SELECT
friends,
1 AS depth
FROM users WHERE id = 5867
UNION ALL
SELECT
new_friends,
friends_of_friends.depth + 1
FROM
friends_of_friends
CROSS JOIN LATERAL (
SELECT
rb_or_agg(users.friends) AS new_friends
FROM
users
WHERE
users.id = ANY(rb_to_array(friends_of_friends.friends))
) q
WHERE
friends_of_friends.depth < 3
)
SELECT
rb_cardinality(friends) AS count_friends_at_depth_3
FROM
friends_of_friends
WHERE
depth = 3
;

Note, depth is 1 at first level since we already have user 5867's friends in 
the users column.

Maybe there is a better way to make use of the btree index on users.id,
than to convert the roaringbitmap to an array in order to use = ANY(...),
that is, this part: `users.id = ANY(rb_to_array(friends_of_friends.friends))`?

Or maybe there is some entirely different but equivalent way of writing the 
query
in a more efficient way?


SELECT * FROM friends_of_friends;
 count_friends_at_depth_3
--
  1035293
(1 row)

SELECT * FROM friends_of_friends_roaringbitmap;
 count_friends_at_depth_3
--
  1035293
(1 row)

EXPLAIN ANALYZE SELECT * FROM friends_of_friends;
   
QUERY PLAN
-
 CTE Scan on friends_of_friends  (cost=5722.03..5722.73 rows=1 width=4) (actual 
time=2232.896..2233.289 rows=1 loops=1)
   Filter: (depth = 3)
   Rows Removed by Filter: 3
   CTE friends_of_friends
 ->  Recursive Union  (cost=0.00..5722.03 rows=31 width=36) (actual 
time=0.003..2228.707 rows=4 loops=1)
   ->  Result  (cost=0.00..0.01 rows=1 width=36) (actual 
time=0.001..0.001 rows=1 loops=1)
   ->  Subquery Scan on "*SELECT* 2"  (cost=190.59..572.17 rows=3 
width=36) (actual time=556.806..556.837 rows=1 loops=4)
 ->  Nested Loop  (cost=190.59..572.06 rows=3 width=36) (actual 
time=553.748..553.748 rows=1 loops=4)
   ->  WorkTable Scan on friends_of_friends 
friends_of_friends_1  (cost=0.00..0.22 rows=3 width=36) (actual 
time=0.000..0.001 rows=1 loops=4)
 Filter: (depth < 3)
 Rows Removed by Filter: 0
   ->  Aggregate  (cost=190.59..190.60 rows=1 width=32) 
(actual time=737.427..737.427 rows=1 loops=3)
 ->  Sort  (cost=179.45..185.02 rows=2227 width=4) 
(actual time=577.192..649.812 rows=3486910 loops=3)
   Sort Key: edges.to_node
   Sort Method: quicksort  Memory: 393217kB
   ->  Index Only Scan using edges_pkey on 
edges  (cost=0.56..55.62 rows=2227 width=4) (actual time=0.027..225.609 
rows=3486910 loops=3)
 Index Cond: (from_node = ANY 
(friends_of_friends_1.current))
 Heap Fetches: 0
 Planning 

Re: Do we want a hashset type?

2023-06-04 Thread Tomas Vondra



On 6/1/23 09:14, Joel Jacobson wrote:
> On Thu, Jun 1, 2023, at 09:02, Joel Jacobson wrote:
>> Here is an example using a real anonymised social network.
> 
> I realised the "found" column is not necessary in this particular example,
> since we only care about the friends at the exact depth level. Simplified 
> query:
> 
> CREATE OR REPLACE VIEW friends_of_friends AS
> WITH RECURSIVE friends_of_friends AS (
> SELECT 
> ARRAY[5867::bigint] AS current,
> 0 AS depth
> UNION ALL
> SELECT
> new_current,
> friends_of_friends.depth + 1
> FROM
> friends_of_friends
> CROSS JOIN LATERAL (
> SELECT
> array_agg(DISTINCT edges.to_node) AS new_current
> FROM
> edges
> WHERE
> from_node = ANY(friends_of_friends.current)
> ) q
> WHERE
> friends_of_friends.depth < 3
> )
> SELECT
> depth,
> coalesce(array_length(current, 1), 0)
> FROM
> friends_of_friends
> WHERE
> depth = 3;
> ;
> 
> -- PostgreSQL 15.2:
> 
> EXPLAIN ANALYZE SELECT * FROM friends_of_friends;
> 
> QUERY PLAN
> ---
>  CTE Scan on friends_of_friends  (cost=2687.88..2688.58 rows=1 width=8) 
> (actual time=2076.362..2076.454 rows=1 loops=1)
>Filter: (depth = 3)
>Rows Removed by Filter: 3
>CTE friends_of_friends
>  ->  Recursive Union  (cost=0.00..2687.88 rows=31 width=36) (actual 
> time=0.008..2075.073 rows=4 loops=1)
>->  Result  (cost=0.00..0.01 rows=1 width=36) (actual 
> time=0.002..0.002 rows=1 loops=1)
>->  Subquery Scan on "*SELECT* 2"  (cost=89.44..268.75 rows=3 
> width=36) (actual time=518.613..518.622 rows=1 loops=4)
>  ->  Nested Loop  (cost=89.44..268.64 rows=3 width=36) 
> (actual time=515.523..515.523 rows=1 loops=4)
>->  WorkTable Scan on friends_of_friends 
> friends_of_friends_1  (cost=0.00..0.22 rows=3 width=36) (actual 
> time=0.001..0.001 rows=1 loops=4)
>  Filter: (depth < 3)
>  Rows Removed by Filter: 0
>->  Aggregate  (cost=89.44..89.45 rows=1 width=32) 
> (actual time=687.356..687.356 rows=1 loops=3)
>  ->  Index Only Scan using edges_pkey on edges  
> (cost=0.56..83.96 rows=2191 width=4) (actual time=0.139..290.996 rows=3486910 
> loops=3)
>Index Cond: (from_node = ANY 
> (friends_of_friends_1.current))
>Heap Fetches: 0
>  Planning Time: 0.557 ms
>  Execution Time: 2076.990 ms
> (17 rows)
> 

I've been thinking about this a bit on the way back from pgcon. Per CPU
profile it seems most of the job is actually spent on qsort, calculating
the array_agg(distinct) bit. I don't think we build

We could replace this part by a hash-based set aggregate, which would be
faster, but I doubt that may yield a massive improvement that'd change
the fundamental cost.

I forgot I did something like that a couple years back, implementing a
count_distinct() aggregate that was meant as a faster alternative to
count(distinct). And then I mostly abandoned it because the built-in
sort-based approach improved significantly - it was still slower in
cases, but the gap got small enough.

Anyway, I hacked together a trivial set backed by an open addressing
hash table:

  https://github.com/tvondra/hashset

It's super-basic, providing just some bare minimum of functions, but
hopefully good enough for experiments.

- hashset data type - hash table in varlena
- hashset_init - create new hashset
- hashset_add / hashset_contains - add value / check membership
- hashset_merge - merge two hashsets
- hashset_count - count elements
- hashset_to_array - return
- hashset(int) aggregate

This allows rewriting the recursive query like this:

WITH RECURSIVE friends_of_friends AS (
SELECT
ARRAY[5867::bigint] AS current,
0 AS depth
UNION ALL
SELECT
new_current,
friends_of_friends.depth + 1
FROM
friends_of_friends
CROSS JOIN LATERAL (
SELECT
hashset_to_array(hashset(edges.to_node)) AS new_current
FROM
edges
WHERE
from_node = ANY(friends_of_friends.current)
) q
WHERE
friends_of_friends.depth < 3
)
SELECT
depth,
coalesce(array_length(current, 1), 0)
FROM
friends_of_friends
WHERE
depth = 3;

On my laptop cuts the timing roughly in half - which is nice, but as I
said I don't think it's not a fundamental speedup. The aggregate can be
also parallelized, but I don't think that changes much.

Re: Do we want a hashset type?

2023-06-02 Thread Ants Aasma
On Wed, 31 May 2023 at 18:40, Joel Jacobson  wrote:
>
> On Wed, May 31, 2023, at 16:53, Tomas Vondra wrote:
> > I think this needs a better explanation - what exactly is a hashset in
> > this context? Something like an array with a hash for faster lookup of
> > unique elements, or what?
>
> In this context, by "hashset" I am indeed referring to a data structure 
> similar
> to an array, where each element would be unique, and lookups would be faster
> than arrays for larger number of elements due to hash-based lookups.
>
> This data structure would store identifiers (IDs) of the nodes, not the 
> complete
> nodes themselves.

Have you looked at roaring bitmaps? There is a pg_roaringbitmap
extension [1] already available that offers very fast unions,
intersections and membership tests over integer sets. I used it to get
some pretty impressive performance results for faceting search on
large document sets. [2]

Depending on the graph fan-outs and operations it might make sense in
the graph use case. For small sets it's probably not too different
from the intarray extension in contrib. But for finding intersections
over large sets (i.e. a join) it's very-very fast. If the workload is
traversal heavy it might make sense to even cache materialized
transitive closures up to some depth (a friend-of-a-friend list).

Roaring bitmaps only support int4 right now, but that is easily
fixable. And they need a relatively dense ID space to get the
performance boost, which seems essential to the approach. The latter
issue means that it can't be easily dropped into GIN or B-tree indexes
for ctid storage.

[1] https://github.com/ChenHuajun/pg_roaringbitmap
[2] https://github.com/cybertec-postgresql/pgfaceting
-- 
Ants Aasma
www.cybertec-postgresql.com




  1   2   >