Re: WIP: BRIN multi-range indexes

2021-03-27 Thread Tomas Vondra
On 3/27/21 7:09 PM, Alvaro Herrera wrote:
> On 2021-Mar-26, Tomas Vondra wrote:
> 
>> Hi,
>>
>> I've pushed both the bloom and minmax-multi indexes today.
> 
> One thing I've been wondering all along is how useful are these
> BRIN-backed bloom indexes compared to contrib-supplied bloom indexes.
> My guess is that the BRIN implementation has some advantage, or you
> would not have worked so much on it.  But what is it?
> 

The contrib/bloom indexes are a completely different type of index. They
are not BRIN but a completely separate AM. The bloom filters are per-row
(so the index is larger than BRIN) and it's useful when you have table
with many attributes, and need to test various combinations of them.


create table t (a int, b int, c int);

insert into t select 10 * random(), 10 * random(), 10 * random()
  from generate_series(1,100) s(i);

analyze t;

create index bloom_idx on t using bloom (a,b,c)
  with (length=80, col1=4, col2=4, col3=4);

create index brin_bloom_idx on t using
  brin (a int4_bloom_ops, b int4_bloom_ops, c int4_bloom_ops);


test=# \di+
  List of relations
 Schema |  Name  | Table | Access Method | Size  | Description
++---+---+---+-
 public | bloom_idx  | t | bloom | 15 MB |
 public | brin_bloom_idx | t | brin  | 88 kB |
(2 rows)


So it's a completely different kind of animal, perhaps closer to btree
than to BRIN. I'm sure there are cases where contrib/bloom works better
than brin/bloom, but also the other way around.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: WIP: BRIN multi-range indexes

2021-03-27 Thread Alvaro Herrera
On 2021-Mar-26, Tomas Vondra wrote:

> Hi,
> 
> I've pushed both the bloom and minmax-multi indexes today.

One thing I've been wondering all along is how useful are these
BRIN-backed bloom indexes compared to contrib-supplied bloom indexes.
My guess is that the BRIN implementation has some advantage, or you
would not have worked so much on it.  But what is it?


Thanks

-- 
Álvaro Herrera   Valdivia, Chile




Re: WIP: BRIN multi-range indexes

2021-03-26 Thread Tomas Vondra
On 3/26/21 3:45 PM, Tomas Vondra wrote:
> On 3/26/21 3:04 PM, Tomas Vondra wrote:
>> On 3/26/21 2:55 PM, Tom Lane wrote:
>>> Tomas Vondra  writes:
 I recall seeing "bus error" on sparc with other patches because of
 alignment issues, so I wonder if this is what's happening here.
>>>
>>> Try compiling with the address sanitizer enabled.  Per c.h,
>>>
>>>  * Testing can be done with "-fsanitize=alignment -fsanitize-trap=alignment"
>>>  * on clang, or "-fsanitize=alignment -fno-sanitize-recover=alignment" on 
>>> gcc.
>>>
>>
>> Bingo! I see the one failing x86_64 machine has -fsanitize=alignment.
>>
> 
> Yeah, the deserialization is borked. It assumes it can just point into
> the serialized representation of the summary, but that is "compacted" by
> ignoring alignment. Hence the failures. For me it fails only for timetz
> and interval types, but perhaps sparc is more sensitive, or maybe it's
> about 32/64 bits too (the only backtrace I'm aware of is from snapper,
> so assuming it's 32bits it'd make sense it fails on int8).
> 
> I have a feeling I made the same mistake in serialization of MCV stats
> some time ago, shame on me. I'll get this fixed.
> 

I've pushed a fix, ensuring proper alignment. There was a second issue
that'd affect big endian systems, so I fixed that too.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: WIP: BRIN multi-range indexes

2021-03-26 Thread Tomas Vondra
On 3/26/21 3:04 PM, Tomas Vondra wrote:
> On 3/26/21 2:55 PM, Tom Lane wrote:
>> Tomas Vondra  writes:
>>> I recall seeing "bus error" on sparc with other patches because of
>>> alignment issues, so I wonder if this is what's happening here.
>>
>> Try compiling with the address sanitizer enabled.  Per c.h,
>>
>>  * Testing can be done with "-fsanitize=alignment -fsanitize-trap=alignment"
>>  * on clang, or "-fsanitize=alignment -fno-sanitize-recover=alignment" on 
>> gcc.
>>
> 
> Bingo! I see the one failing x86_64 machine has -fsanitize=alignment.
> 

Yeah, the deserialization is borked. It assumes it can just point into
the serialized representation of the summary, but that is "compacted" by
ignoring alignment. Hence the failures. For me it fails only for timetz
and interval types, but perhaps sparc is more sensitive, or maybe it's
about 32/64 bits too (the only backtrace I'm aware of is from snapper,
so assuming it's 32bits it'd make sense it fails on int8).

I have a feeling I made the same mistake in serialization of MCV stats
some time ago, shame on me. I'll get this fixed.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: WIP: BRIN multi-range indexes

2021-03-26 Thread Tomas Vondra
On 3/26/21 2:55 PM, Tom Lane wrote:
> Tomas Vondra  writes:
>> I recall seeing "bus error" on sparc with other patches because of
>> alignment issues, so I wonder if this is what's happening here.
> 
> Try compiling with the address sanitizer enabled.  Per c.h,
> 
>  * Testing can be done with "-fsanitize=alignment -fsanitize-trap=alignment"
>  * on clang, or "-fsanitize=alignment -fno-sanitize-recover=alignment" on gcc.
> 

Bingo! I see the one failing x86_64 machine has -fsanitize=alignment.


thanks

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: WIP: BRIN multi-range indexes

2021-03-26 Thread Tom Lane
Tomas Vondra  writes:
> I recall seeing "bus error" on sparc with other patches because of
> alignment issues, so I wonder if this is what's happening here.

Try compiling with the address sanitizer enabled.  Per c.h,

 * Testing can be done with "-fsanitize=alignment -fsanitize-trap=alignment"
 * on clang, or "-fsanitize=alignment -fno-sanitize-recover=alignment" on gcc.

regards, tom lane




Re: WIP: BRIN multi-range indexes

2021-03-26 Thread Tomas Vondra



On 3/26/21 2:08 PM, Tomas Vondra wrote:
> Hi,
> 
> I've pushed both the bloom and minmax-multi indexes today.
> 

Hmmm, I see a couple buildfarm animals are upset about the minmax-multi
patch. It does pass for me both on x86_64 and 32-bit ARM (rpi4), so I'm
unable to reproduce this. But most of the machines having issues seem to
be sparc variants, and snapper does this:


[New LWP 1471]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/sparc-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: parallel worker for PID 1350
 '.
Program terminated with signal 10, Bus error.
#0  0x007167fc in int82gt (fcinfo=0xffcc66d0) at int8.c:399
399 int64   val1 = PG_GETARG_INT64(0);
#0  0x007167fc in int82gt (fcinfo=0xffcc66d0) at int8.c:399
#1  0x00887a94 in FunctionCall2Coll (flinfo=0xb81a2c, collation=0,
arg1=12242916, arg2=0) at fmgr.c:1163


I recall seeing "bus error" on sparc with other patches because of
alignment issues, so I wonder if this is what's happening here.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: WIP: BRIN multi-range indexes

2021-03-26 Thread Alvaro Herrera
On 2021-Mar-26, Tomas Vondra wrote:

> I've pushed both the bloom and minmax-multi indexes today.

Congratulations!  I think this reimplementation of the minmax opclass
infrastructure makes BRIN much more useful (read: actually usable).

-- 
Álvaro Herrera   Valdivia, Chile




Re: WIP: BRIN multi-range indexes

2021-03-26 Thread Tomas Vondra
Hi,

I've pushed both the bloom and minmax-multi indexes today.

Based on the feedback and limitations described before I decided to keep
them in core (i.e. not move them to contrib), but it was very useful
experiment as it uncovered a couple issues - both in existing code, and
in the definition of the new opclasses.

After further thought I've concluded that the decision to ditch the old
signature (in a1c649d889) was probably wrong, so I've added the support
back, per what Alexander originally proposed.

While doing so, I've noticed that the NULL handling may be a few bricks
shy, because with (oi_regular_nulls == false) it should probably keep
passing the NULL scan keys to the consistent function. I'll look into
that and get it fixed.


Thanks everyone who helped with those patches!

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: WIP: BRIN multi-range indexes

2021-03-26 Thread Tomas Vondra



On 3/23/21 7:28 PM, Alvaro Herrera wrote:
> On 2021-Mar-23, Tomas Vondra wrote:
> 
>> FWIW there's yet another difference between the current BRIN opclass
>> definition, compared to what CREATE OPERATOR CLASS would do. Or more
>> precisely, how we'd define opfamily for the cross-type cases (integer,
>> float and timestamp cases).
>>
>> AFAICS we don't really need pg_amproc entries for the cross-type cases,
>> we just need the operators, so pg_amproc entries like
>>
>> { amprocfamily => 'brin/datetime_minmax_ops', amproclefttype =>
>> 'timestamptz',
>>   amprocrighttype => 'timestamp', amprocnum => '1',
>>   amproc => 'brin_minmax_opcinfo' },
>>
>> are unnecessary. The attached patch cleans that up, without breaking any
>> regression tests. Or is there a reason why we need those?
> 
> ... ooh ...
> 
> When you say "just the operators" you mean the pg_amop entries, right?
> 
> I think I agree -- cross-type amproc entries are unlikely to have any
> use.
> 

I've pushed a cleanup of the unnecessary pg_amproc entries for the
built-in opclasses, and I have omitted them from the definition of the
new opclasses. Buildfarm seems happy, so far.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: WIP: BRIN multi-range indexes

2021-03-23 Thread Alvaro Herrera
On 2021-Mar-23, Tomas Vondra wrote:

> FWIW there's yet another difference between the current BRIN opclass
> definition, compared to what CREATE OPERATOR CLASS would do. Or more
> precisely, how we'd define opfamily for the cross-type cases (integer,
> float and timestamp cases).
> 
> AFAICS we don't really need pg_amproc entries for the cross-type cases,
> we just need the operators, so pg_amproc entries like
> 
> { amprocfamily => 'brin/datetime_minmax_ops', amproclefttype =>
> 'timestamptz',
>   amprocrighttype => 'timestamp', amprocnum => '1',
>   amproc => 'brin_minmax_opcinfo' },
> 
> are unnecessary. The attached patch cleans that up, without breaking any
> regression tests. Or is there a reason why we need those?

... ooh ...

When you say "just the operators" you mean the pg_amop entries, right?

I think I agree -- cross-type amproc entries are unlikely to have any
use.

-- 
Álvaro Herrera   Valdivia, Chile




Re: WIP: BRIN multi-range indexes

2021-03-23 Thread Tomas Vondra


On 3/23/21 2:36 PM, Alvaro Herrera wrote:
> On 2021-Mar-22, Tomas Vondra wrote:
> 
>> I don't know what's the right fix, but it seems like this patch has
>> nothing to do with it. If we want to move the opclasses into an
>> extension, we can comment out that one (cidr/inet) case for now.
> 
> I don't know what would be a good reason to define the opclasses in
> separate contrib extensions.  I think it's going to be a nuisance to
> users, so unless there is some strong argument for it, I'd suggest not
> to do it.  I found it being discussed here:
> https://www.postgresql.org/message-id/CA%2BTgmoajaQKBUx%3DvaTUFo6z80dsRzBw__Nu41Q4t06baZep3Ug%40mail.gmail.com
> but there weren't any strong arguments put forward.
> 

OK, thanks for the opinion. Yeah, you're right there were no strong
opinions either way, and I see both pros/cons for the contrib option.
Defining the opclasses using SQL is way more convenient and less
error-prone than doing that directly in .dat files, for example. But
there are some limitations too, so not sure it's worth it.

> It seems a good experiment to have done it, though, since we now know
> that there is a limitation in the existing SQL interface.  Maybe the fix
> to that problem is to add a new clause to CREATE/ALTER OPERATOR CLASS to
> let you define what goes into opckeytype.  However I don't think it's
> this patch's responsibility to fix that problem.
> 

Yeah, it was a good experiment. I think we still need to figure out what
to do about the opckeytype - what needs to be fixed, exactly.

FWIW there's yet another difference between the current BRIN opclass
definition, compared to what CREATE OPERATOR CLASS would do. Or more
precisely, how we'd define opfamily for the cross-type cases (integer,
float and timestamp cases).

AFAICS we don't really need pg_amproc entries for the cross-type cases,
we just need the operators, so pg_amproc entries like

{ amprocfamily => 'brin/datetime_minmax_ops', amproclefttype =>
'timestamptz',
  amprocrighttype => 'timestamp', amprocnum => '1',
  amproc => 'brin_minmax_opcinfo' },

are unnecessary. The attached patch cleans that up, without breaking any
regression tests. Or is there a reason why we need those?

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/src/include/catalog/pg_amproc.dat 
b/src/include/catalog/pg_amproc.dat
index 9192a66a88..50082a1801 100644
--- a/src/include/catalog/pg_amproc.dat
+++ b/src/include/catalog/pg_amproc.dat
@@ -843,28 +843,7 @@
   amproc => 'brin_minmax_consistent' },
 { amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int8',
   amprocrighttype => 'int8', amprocnum => '4', amproc => 'brin_minmax_union' },
-{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int8',
-  amprocrighttype => 'int2', amprocnum => '1',
-  amproc => 'brin_minmax_opcinfo' },
-{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int8',
-  amprocrighttype => 'int2', amprocnum => '2',
-  amproc => 'brin_minmax_add_value' },
-{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int8',
-  amprocrighttype => 'int2', amprocnum => '3',
-  amproc => 'brin_minmax_consistent' },
-{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int8',
-  amprocrighttype => 'int2', amprocnum => '4', amproc => 'brin_minmax_union' },
-{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int8',
-  amprocrighttype => 'int4', amprocnum => '1',
-  amproc => 'brin_minmax_opcinfo' },
-{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int8',
-  amprocrighttype => 'int4', amprocnum => '2',
-  amproc => 'brin_minmax_add_value' },
-{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int8',
-  amprocrighttype => 'int4', amprocnum => '3',
-  amproc => 'brin_minmax_consistent' },
-{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int8',
-  amprocrighttype => 'int4', amprocnum => '4', amproc => 'brin_minmax_union' },
+
 { amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int2',
   amprocrighttype => 'int2', amprocnum => '1',
   amproc => 'brin_minmax_opcinfo' },
@@ -876,28 +855,7 @@
   amproc => 'brin_minmax_consistent' },
 { amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int2',
   amprocrighttype => 'int2', amprocnum => '4', amproc => 'brin_minmax_union' },
-{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int2',
-  amprocrighttype => 'int8', amprocnum => '1',
-  amproc => 'brin_minmax_opcinfo' },
-{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int2',
-  amprocrighttype => 'int8', amprocnum => '2',
-  amproc => 'brin_minmax_add_value' },
-{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int2',
-  amprocrighttype => 'int8', amprocnum => '3',
-  amproc => 'brin_minmax_consistent' },
-{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int2',
-  amprocrighttype => 'int8', amprocnum => '4', amproc => 'brin_minmax_union' },
-{ 

Re: WIP: BRIN multi-range indexes

2021-03-23 Thread Alvaro Herrera
On 2021-Mar-22, Tomas Vondra wrote:

> I don't know what's the right fix, but it seems like this patch has
> nothing to do with it. If we want to move the opclasses into an
> extension, we can comment out that one (cidr/inet) case for now.

I don't know what would be a good reason to define the opclasses in
separate contrib extensions.  I think it's going to be a nuisance to
users, so unless there is some strong argument for it, I'd suggest not
to do it.  I found it being discussed here:
https://www.postgresql.org/message-id/CA%2BTgmoajaQKBUx%3DvaTUFo6z80dsRzBw__Nu41Q4t06baZep3Ug%40mail.gmail.com
but there weren't any strong arguments put forward.

It seems a good experiment to have done it, though, since we now know
that there is a limitation in the existing SQL interface.  Maybe the fix
to that problem is to add a new clause to CREATE/ALTER OPERATOR CLASS to
let you define what goes into opckeytype.  However I don't think it's
this patch's responsibility to fix that problem.

-- 
Álvaro Herrera   Valdivia, Chile
"Hay que recordar que la existencia en el cosmos, y particularmente la
elaboración de civilizaciones dentro de él no son, por desgracia,
nada idílicas" (Ijon Tichy)




Re: WIP: BRIN multi-range indexes

2021-03-22 Thread Tomas Vondra
On 3/22/21 6:27 AM, Tomas Vondra wrote:
> 
> ...
>
> All the regression tests work fine, with the exception of minmax-multi
> on a CIDR column. I don't know why, but the CREATE INDEX then fails like
> this:
> 
>   ERROR:  missing operator 1(650,650) in opfamily 16463
> 
> 650 is cidr, so this essentially says there's no (cidr With the opclasses defined in .dat files this however worked, so I
> suppose it's related to the missing operator families.
> 

Turns out this is likely a bug elsewhere. After a couple fixes to the
extension SQL script, the only real difference in catalog contents
(compared to e.g. the built-in BRIN minmax inet opclass) is that the
built-in one has opckeytype set to 869 (i.e. inet), while the one
created from extension has it set to 0.

The opclasses for inet (OID 869) look like this:

test=# select oid, opcname, opcfamily, opcintype, opckeytype from
pg_opclass where opcintype = 869 order by oid;
  oid  |opcname| opcfamily | opcintype | opckeytype
---+---+---+---+
 10009 | cidr_ops  |  1974 |   869 |  0
 10010 | cidr_ops  |  1975 |   869 |  0
 10015 | inet_ops  |  1974 |   869 |  0
 10016 | inet_ops  |  1975 |   869 |  0
 10017 | inet_ops  |  3550 |   869 |  0
 10018 | inet_ops  |  3794 |   869 |  0
 10105 | inet_minmax_ops   |  4075 |   869 |869
 10106 | inet_inclusion_ops|  4102 |   869 |869
 16451 | inet_bloom_ops| 16450 |   869 |  0
 17398 | inet_minmax_multi_ops | 17397 |   869 |  0
(10 rows)

The only difference between the two minmax variants is the opckeytype,
and if I update it to 869 for inet_minmax_multi_ops, it starts working.
Likewise, if I set it to 0 for inet_minmax_ops, it breaks the same way.

Turns out, this is impossible to set from CREATE OPERATOR CLASS, because
opclasscmds.c does this:

/* Just drop the spec if same as column datatype */
if (storageoid == typeoid && false)
storageoid = InvalidOid;

but index.c looks only at opclassTup->opckeytype. The built-in opclasses
don't have this issue, because they put the data into the catalog
directly, without going through this code.

I don't know what's the right fix, but it seems like this patch has
nothing to do with it. If we want to move the opclasses into an
extension, we can comment out that one (cidr/inet) case for now.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: WIP: BRIN multi-range indexes

2021-03-17 Thread John Naylor
On Wed, Mar 17, 2021 at 3:16 PM Tomas Vondra 
wrote:

> Ummm, no attachment ;-)

Oops, here it is.

--
John Naylor
EDB: http://www.enterprisedb.com


jcn-costing-test.sql
Description: Binary data


Re: WIP: BRIN multi-range indexes

2021-03-17 Thread Tomas Vondra



On 3/17/21 7:59 PM, John Naylor wrote:
> On Thu, Mar 11, 2021 at 12:26 PM Tomas Vondra
> mailto:tomas.von...@enterprisedb.com>>
> wrote:
>>
>> Hi,
>>
>> Here is an updated version of the patch series.
>>
>> It fixes the assert failure (just remove the multiplication from it) and
>> adds a simple regression test that exercises this.
>>
>> Based on the discussion so far, I've decided to keep just the new
>> signature of the consistent function. That's a bit simpler than having
>> to support both 3 and 4 parameters, and it would not deal with the NULL
>> changes anyway (mostly harmless code duplication, but still). I've also
>> realized the API documentation in SGML needs updating.
>>
>> At this point, I think 0001-0006 parts are mostly committable.
> 
> I think so. I've marked it RFC for this six.
> 
>> As for the remaining two parts, the one dealing with correlation may not
>> be strictly necessary, but not having it (or something like it) may
>> result in not picking the BRIN index in some cases.
>>
>> But maybe it's not a major problem. I tried the example from [1] but it
>> no longer triggers the issue for me - I'm not entirely sure why, but the
>> costing changed for some reason. It used to look like this:
> 
>> ...
> 
>> The index scan cost is about the same, but the heap scan is about half
>> the cost. The row estimates are a bit different, perhaps that's related.
>> The seqscan cost (169248) and duration (~500ms) is still about the same,
>> but so now we still pick the bitmap heap scan.
> 
> With only 0001-0006, I get a parallel bitmap scan in both cases:
> 
>                                                             QUERY PLAN
> ---
>  Gather  (cost=6542.42..52779.35 rows=10 width=4) (actual
> time=3.283..22.308 rows=10 loops=1)
>    Workers Planned: 2
>    Workers Launched: 2
>    ->  Parallel Bitmap Heap Scan on t0  (cost=5542.42..51778.35 rows=4
> width=4) (actual time=3.434..17.257 rows=3 loops=3)
>          Recheck Cond: (a = 1)
>          Rows Removed by Index Recheck: 83165
>          Heap Blocks: lossy=421
>          ->  Bitmap Index Scan on t0_a_idx  (cost=0.00..5542.42
> rows=381682 width=0) (actual time=2.996..2.996 rows=11040 loops=1)
>                Index Cond: (a = 1)
>  Planning Time: 0.088 ms
>  Execution Time: 22.341 ms
> (11 rows)
> 
>> Not sure we can rely on
>> this, though. Would be quite sad if we had new improved opclasses but
>> the planner often decided not to use them.
> 
> Yeah, I'm not sure what to do here. It might be okay to leave it out now
> and study it further as a PG14 open item or PG15 improvement.
> 

Yeah, that's definitely an option.

>> I had an idea of tweaking choose_bitmap_and to consider both the cost
>> and selectivity (similarly to how add_path considers statup/total cost),
>> and that did indeed resolve this particular case. This is what the 0008
>> part does.
>>
>> But it may also have negative consequence, as demonstrated by the
>> reproducer2.sql script. So maybe the logic would need to be more
>> complicated. Or maybe there's no reliable solution, considering how
>> tricky/unreliable BRIN estimates are.
> 
> Ok, so with 0008 in reproducer2, it chooses the more selective path,
> even though it has a higher total cost:
> 
> 0001-0007:
> 
>                                                      QUERY PLAN
> -
>  Bitmap Heap Scan on t2  (cost=29.03..24032.28 rows=1 width=8) (actual
> time=0.498..1.755 rows=1 loops=1)
>    Recheck Cond: (a = 1000)
>    Rows Removed by Index Recheck: 7167
>    Heap Blocks: lossy=128
>    ->  Bitmap Index Scan on idx_2  (cost=0.00..29.03 rows=7163 width=0)
> (actual time=0.278..0.278 rows=1280 loops=1)
>          Index Cond: (a = 1000)
>  Planning Time: 0.148 ms
>  Execution Time: 1.774 ms
> (8 rows)
> 
> DROP INDEX
>                                                     QUERY PLAN
> ---
>  Bitmap Heap Scan on t2  (cost=656.00..1531.00 rows=1 width=8) (actual
> time=9.695..9.708 rows=1 loops=1)
>    Recheck Cond: (a = 1000)
>    Rows Removed by Index Recheck: 223
>    Heap Blocks: lossy=4
>    ->  Bitmap Index Scan on idx_1  (cost=0.00..656.00 rows=224 width=0)
> (actual time=9.675..9.675 rows=40 loops=1)
>          Index Cond: (a = 1000)
>  Planning Time: 0.110 ms
>  Execution Time: 9.727 ms
> (8 rows)
> 
> and with 0008:
> 
>                                                     QUERY PLAN
> ---
>  Bitmap Heap Scan on t2  (cost=656.00..1531.00 rows=1 width=8) (actual
> time=8.540..8.577 rows=1 loops=1)
>    Recheck Cond: (a = 1000)
>    Rows Removed by 

Re: WIP: BRIN multi-range indexes

2021-03-17 Thread John Naylor
On Thu, Mar 11, 2021 at 12:26 PM Tomas Vondra 
wrote:
>
> Hi,
>
> Here is an updated version of the patch series.
>
> It fixes the assert failure (just remove the multiplication from it) and
> adds a simple regression test that exercises this.
>
> Based on the discussion so far, I've decided to keep just the new
> signature of the consistent function. That's a bit simpler than having
> to support both 3 and 4 parameters, and it would not deal with the NULL
> changes anyway (mostly harmless code duplication, but still). I've also
> realized the API documentation in SGML needs updating.
>
> At this point, I think 0001-0006 parts are mostly committable.

I think so. I've marked it RFC for this six.

> As for the remaining two parts, the one dealing with correlation may not
> be strictly necessary, but not having it (or something like it) may
> result in not picking the BRIN index in some cases.
>
> But maybe it's not a major problem. I tried the example from [1] but it
> no longer triggers the issue for me - I'm not entirely sure why, but the
> costing changed for some reason. It used to look like this:

> ...

> The index scan cost is about the same, but the heap scan is about half
> the cost. The row estimates are a bit different, perhaps that's related.
> The seqscan cost (169248) and duration (~500ms) is still about the same,
> but so now we still pick the bitmap heap scan.

With only 0001-0006, I get a parallel bitmap scan in both cases:

QUERY PLAN
---
 Gather  (cost=6542.42..52779.35 rows=10 width=4) (actual
time=3.283..22.308 rows=10 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Bitmap Heap Scan on t0  (cost=5542.42..51778.35 rows=4
width=4) (actual time=3.434..17.257 rows=3 loops=3)
 Recheck Cond: (a = 1)
 Rows Removed by Index Recheck: 83165
 Heap Blocks: lossy=421
 ->  Bitmap Index Scan on t0_a_idx  (cost=0.00..5542.42 rows=381682
width=0) (actual time=2.996..2.996 rows=11040 loops=1)
   Index Cond: (a = 1)
 Planning Time: 0.088 ms
 Execution Time: 22.341 ms
(11 rows)

> Not sure we can rely on
> this, though. Would be quite sad if we had new improved opclasses but
> the planner often decided not to use them.

Yeah, I'm not sure what to do here. It might be okay to leave it out now
and study it further as a PG14 open item or PG15 improvement.

> I had an idea of tweaking choose_bitmap_and to consider both the cost
> and selectivity (similarly to how add_path considers statup/total cost),
> and that did indeed resolve this particular case. This is what the 0008
> part does.
>
> But it may also have negative consequence, as demonstrated by the
> reproducer2.sql script. So maybe the logic would need to be more
> complicated. Or maybe there's no reliable solution, considering how
> tricky/unreliable BRIN estimates are.

Ok, so with 0008 in reproducer2, it chooses the more selective path, even
though it has a higher total cost:

0001-0007:

 QUERY PLAN
-
 Bitmap Heap Scan on t2  (cost=29.03..24032.28 rows=1 width=8) (actual
time=0.498..1.755 rows=1 loops=1)
   Recheck Cond: (a = 1000)
   Rows Removed by Index Recheck: 7167
   Heap Blocks: lossy=128
   ->  Bitmap Index Scan on idx_2  (cost=0.00..29.03 rows=7163 width=0)
(actual time=0.278..0.278 rows=1280 loops=1)
 Index Cond: (a = 1000)
 Planning Time: 0.148 ms
 Execution Time: 1.774 ms
(8 rows)

DROP INDEX
QUERY PLAN
---
 Bitmap Heap Scan on t2  (cost=656.00..1531.00 rows=1 width=8) (actual
time=9.695..9.708 rows=1 loops=1)
   Recheck Cond: (a = 1000)
   Rows Removed by Index Recheck: 223
   Heap Blocks: lossy=4
   ->  Bitmap Index Scan on idx_1  (cost=0.00..656.00 rows=224 width=0)
(actual time=9.675..9.675 rows=40 loops=1)
 Index Cond: (a = 1000)
 Planning Time: 0.110 ms
 Execution Time: 9.727 ms
(8 rows)

and with 0008:

QUERY PLAN
---
 Bitmap Heap Scan on t2  (cost=656.00..1531.00 rows=1 width=8) (actual
time=8.540..8.577 rows=1 loops=1)
   Recheck Cond: (a = 1000)
   Rows Removed by Index Recheck: 223
   Heap Blocks: lossy=4
   ->  Bitmap Index Scan on idx_1  (cost=0.00..656.00 rows=224 width=0)
(actual time=8.507..8.508 rows=40 loops=1)
 Index Cond: (a = 1000)
 Planning Time: 0.175 ms
 Execution Time: 8.601 ms
(8 rows)

DROP INDEX
   

Re: WIP: BRIN multi-range indexes

2021-03-09 Thread Tomas Vondra



On 3/9/21 9:51 PM, John Naylor wrote:
> 
> On Sun, Mar 7, 2021 at 8:53 PM Tomas Vondra
> mailto:tomas.von...@enterprisedb.com>>
> wrote:
> [v20210308b]
> 
> I managed to trap an assertion that somehow doesn't happen during the
> regression tests. The callers of fill_expanded_ranges() do math like this:
> 
> /* both ranges and points are expanded into a separate element */
> neranges = ranges->nranges + ranges->nvalues;
> 
> but inside fill_expanded_ranges() we have this assertion:
> 
> /* Check that the output array has the right size. */
> Assert(neranges == (2 * ranges->nranges + ranges->nvalues));
> 

Yeah, that assert is bogus. It's calculating the number of boundary
values (and ranges have two), but in ExpandedRanges each we still
represent that as a single element. So the Assert should be just

Assert(neranges == (ranges->nranges + ranges->nvalues));

But maybe it's just an overkill and we don't really need it.

> In the regression test data, a debugging elog() shows that nranges is
> most often zero, so in that case, the math happens to be right either
> way. I can reliably get nranges above zero by running
> 
> update brintest_multi set int8col = int8col - 1;
> 
> a few times, at which point I get the crash.
>

Hmm, so maybe we should do something like this in regression tests too?
It's not good that we don't trigger the "nranges > 0" case at all.

> 
> Aside from that, the new changes look good. Just a couple small things:
> 
> +    allowed parameters.  Only the bloom operator class
> +    allows specifying parameters:
> 
> minmax-multi aren't mentioned here, but are mentioned further down.
> 

I forgot to add this bit. Will fix.

> + * Addresses from different families are consider to be in maximum
> 
> (comment above brin_minmax_multi_distance_inet)
> s/consider/considered/
> 

Will fix.

>> > 2) moving minmax/inclusion changes from 0002 to a separate patch 0003
>> >
>> > I think we should either ditch the 0003 (i.e. keep the existing
>> > opclasses unchanged) or commit 0003 (in which case I'd vote to just stop
>> > supporting the old signature of the consistent function).
>> >
>>
>> Still not sure what do to about this. I'm leaning towards keeping 0003
>> and just removing the "old" signature entirely, to keep the API cleaner.
>> It might cause some breakage in out-of-core BRIN opclasses, but that
>> seems like a reasonable price. Moreover, the opclasses may need some
>> updating anyway, because of the changes in handling NULL scan keys (0004
>> moves that from the opclass to the bringetbitmap function).
> 
> Keeping 0003 seems reasonable, given the above.
> 

And do you agree with removing the old signature entirely? That might
break some out-of-core opclasses, but we're not aware of any, and
they'll be broken anyway. Seems fine to me.

>> > The remaining part that didn't get much review is the very last patch,
>> > adding an option to ignore correlation for some BRIN opclases. This is
>> > needed as the regular BRIN costing is quite sensitive to correlation,
>> > and the cost gets way too high for poorly correlated data, making it
>> > unlikely the index will be used. But handling such data sets efficiently
>> > is the main point of those new opclasses. Any opinions on this?
>> >
>>
>> Not sure about this.
> 
> I hadn't given it much thought (nor tested), but I just took a look at
> brincostestimate(). If the table is badly correlated, I'm thinking the
> number of ranges that need to be scanned will increase regardless. Maybe
> rather than ignoring correlation, we could clamp it or otherwise tweak
> it. Not sure about the details, though, that would require some testing.
> 

Well, maybe. In any case we need to do something about this, otherwise
the new opclasses won't be used even in cases where it's perfectly OK.
And it needs to be opclass-dependent, in some way.

I'm pretty sure even the simple examples you've used to test
minmax-multi (with updating a fraction of tuples to low/high value)
would be affected by this.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: WIP: BRIN multi-range indexes

2021-03-09 Thread John Naylor
On Sun, Mar 7, 2021 at 8:53 PM Tomas Vondra 
wrote:
[v20210308b]

I managed to trap an assertion that somehow doesn't happen during the
regression tests. The callers of fill_expanded_ranges() do math like this:

/* both ranges and points are expanded into a separate element */
neranges = ranges->nranges + ranges->nvalues;

but inside fill_expanded_ranges() we have this assertion:

/* Check that the output array has the right size. */
Assert(neranges == (2 * ranges->nranges + ranges->nvalues));

In the regression test data, a debugging elog() shows that nranges is most
often zero, so in that case, the math happens to be right either way. I can
reliably get nranges above zero by running

update brintest_multi set int8col = int8col - 1;

a few times, at which point I get the crash.


Aside from that, the new changes look good. Just a couple small things:

+allowed parameters.  Only the bloom operator class
+allows specifying parameters:

minmax-multi aren't mentioned here, but are mentioned further down.

+ * Addresses from different families are consider to be in maximum

(comment above brin_minmax_multi_distance_inet)
s/consider/considered/

> > 2) moving minmax/inclusion changes from 0002 to a separate patch 0003
> >
> > I think we should either ditch the 0003 (i.e. keep the existing
> > opclasses unchanged) or commit 0003 (in which case I'd vote to just stop
> > supporting the old signature of the consistent function).
> >
>
> Still not sure what do to about this. I'm leaning towards keeping 0003
> and just removing the "old" signature entirely, to keep the API cleaner.
> It might cause some breakage in out-of-core BRIN opclasses, but that
> seems like a reasonable price. Moreover, the opclasses may need some
> updating anyway, because of the changes in handling NULL scan keys (0004
> moves that from the opclass to the bringetbitmap function).

Keeping 0003 seems reasonable, given the above.

> > The remaining part that didn't get much review is the very last patch,
> > adding an option to ignore correlation for some BRIN opclases. This is
> > needed as the regular BRIN costing is quite sensitive to correlation,
> > and the cost gets way too high for poorly correlated data, making it
> > unlikely the index will be used. But handling such data sets efficiently
> > is the main point of those new opclasses. Any opinions on this?
> >
>
> Not sure about this.

I hadn't given it much thought (nor tested), but I just took a look at
brincostestimate(). If the table is badly correlated, I'm thinking the
number of ranges that need to be scanned will increase regardless. Maybe
rather than ignoring correlation, we could clamp it or otherwise tweak it.
Not sure about the details, though, that would require some testing.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: WIP: BRIN multi-range indexes

2021-03-04 Thread Tomas Vondra
On 3/4/21 5:30 PM, John Naylor wrote:
> 
> On Tue, Mar 2, 2021 at 7:32 PM Tomas Vondra
> mailto:tomas.von...@enterprisedb.com>>
> wrote:
>> Ummm, in brin_minmax_multi_distance_timetz()? I don't think timetz can
>> be infinite, no? I think brin_minmax_multi_distance_date you meant?
> 
> In timestamp.c there are plenty of checks for TIMESTAMP_NOT_FINITE
> for TimestampTz types and I assumed that was applicable here.
> 

I don't think so. The NOT_FINITE macros are defined only for timestamps
and dates, not for TimeTzADT. So I think the current code is correct.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: WIP: BRIN multi-range indexes

2021-03-04 Thread John Naylor
On Tue, Mar 2, 2021 at 7:32 PM Tomas Vondra 
wrote:
> Ummm, in brin_minmax_multi_distance_timetz()? I don't think timetz can
> be infinite, no? I think brin_minmax_multi_distance_date you meant?

In timestamp.c there are plenty of checks for TIMESTAMP_NOT_FINITE
for TimestampTz types and I assumed that was applicable here.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: WIP: BRIN multi-range indexes

2021-03-02 Thread Tomas Vondra



On 3/3/21 1:17 AM, Alvaro Herrera wrote:
> On 2021-Mar-03, Tomas Vondra wrote:
> 
>> That's kinda my point - I agree the size of the patch is not the
>> primary concern, but it makes the minmax/inclusion code a bit more
>> complicated (because they now have to loop over the keys), with
>> very little benefit (there might be some speedup, but IMO it's
>> rather negligible).
> 
> Yeah, OK.
> 
>> Alternatively we could simply remove the code supporting the old
>> API with "consistent" functions without the additional parameter.
>> But the idea was to seamlessly support existing opclasses / not
>> breaking them unnecessarily (I know we don't guarantee that in
>> major upgrades, but as they may not benefit from this, why break
>> them?). It'd simplify the code in brin.c a little bit, but the
>> opclasses a bit more complex.
> 
> Well, I doubt any opclass-support functions exist outside of core. Or
> am I just outdated and we do know of some?
> 

I'm not aware of any. This was proposed by Nikita Glukhov:

https://www.postgresql.org/message-id/49cb668f-d6f9-3493-681d-7d40b715ef64%40postgrespro.ru

Alexander Korotkov seemed to agree with Nikita, but I don't recall any
references to actual existing opclasses.

Then again - the amount of code to support two signatures is fairly
small. If we decide to get rid of it, then fine - but the new complexity
in minmax/inclusion likely exceeds that.

Which is why I was thinking about still supporting both signatures, but
reverting the changes in brin_minmax and brin_inclusion.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: WIP: BRIN multi-range indexes

2021-03-02 Thread Alvaro Herrera
On 2021-Mar-03, Tomas Vondra wrote:

> That's kinda my point - I agree the size of the patch is not the primary
> concern, but it makes the minmax/inclusion code a bit more complicated
> (because they now have to loop over the keys), with very little benefit
> (there might be some speedup, but IMO it's rather negligible).

Yeah, OK.

> Alternatively we could simply remove the code supporting the old API
> with "consistent" functions without the additional parameter. But the
> idea was to seamlessly support existing opclasses / not breaking them
> unnecessarily (I know we don't guarantee that in major upgrades, but as
> they may not benefit from this, why break them?). It'd simplify the code
> in brin.c a little bit, but the opclasses a bit more complex.

Well, I doubt any opclass-support functions exist outside of core.
Or am I just outdated and we do know of some?

-- 
Álvaro Herrera   Valdivia, Chile
"Escucha y olvidarás; ve y recordarás; haz y entenderás" (Confucio)




Re: WIP: BRIN multi-range indexes

2021-03-02 Thread Tomas Vondra
On 3/3/21 12:44 AM, Alvaro Herrera wrote:
> On 2021-Mar-03, Tomas Vondra wrote:
> 
>> 1) The 0001 patch allows passing of all scan keys to BRIN opclasses,
>> which is needed for the minmax-multi to work. But it also modifies the
>> existing opclasses (minmax and inclusion) to do this - but for those
>> opclasses it does not make much difference, I think. Initially this was
>> done because the patch did this for all opclasses, but then we added the
>> detection based on number of parameters. So I wonder if we should just
>> remove this, to make the patch a bit smaller. It'll also test the other
>> code path (for opclasses without the last parameter).
> 
> I think it makes sense to just do them all in one pass.  I think trying
> to keep the old way working just because it's how it was working
> previously does not have much benefit.  I don't think we care about the
> *patch* being small as much as the resulting *code* being as simple as
> possible (but no simpler).
> 

That's kinda my point - I agree the size of the patch is not the primary
concern, but it makes the minmax/inclusion code a bit more complicated
(because they now have to loop over the keys), with very little benefit
(there might be some speedup, but IMO it's rather negligible).

Alternatively we could simply remove the code supporting the old API
with "consistent" functions without the additional parameter. But the
idea was to seamlessly support existing opclasses / not breaking them
unnecessarily (I know we don't guarantee that in major upgrades, but as
they may not benefit from this, why break them?). It'd simplify the code
in brin.c a little bit, but the opclasses a bit more complex.

> 
>> 2) This needs bsearch_arg, but we only have that defined/used in
>> extended_statistics_internal.h - it seems silly to include that here, as
>> this has nothing to do with extended stats, so I simply added another
>> prototype (which gets linked correctly). But, I suppose a better way
>> would be to define our "portable" variant pg_bsearch_arg, next to
>> pg_qsort etc.
> 
> Yeah, I think it makes sense to move bsearch_arg to a place where it's
> more generally accesible (src/port/bsearch_arg.c I suppose), and make
> both places use that.
> 

OK, will do.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: WIP: BRIN multi-range indexes

2021-03-02 Thread Alvaro Herrera
On 2021-Mar-03, Tomas Vondra wrote:

> 1) The 0001 patch allows passing of all scan keys to BRIN opclasses,
> which is needed for the minmax-multi to work. But it also modifies the
> existing opclasses (minmax and inclusion) to do this - but for those
> opclasses it does not make much difference, I think. Initially this was
> done because the patch did this for all opclasses, but then we added the
> detection based on number of parameters. So I wonder if we should just
> remove this, to make the patch a bit smaller. It'll also test the other
> code path (for opclasses without the last parameter).

I think it makes sense to just do them all in one pass.  I think trying
to keep the old way working just because it's how it was working
previously does not have much benefit.  I don't think we care about the
*patch* being small as much as the resulting *code* being as simple as
possible (but no simpler).


> 2) This needs bsearch_arg, but we only have that defined/used in
> extended_statistics_internal.h - it seems silly to include that here, as
> this has nothing to do with extended stats, so I simply added another
> prototype (which gets linked correctly). But, I suppose a better way
> would be to define our "portable" variant pg_bsearch_arg, next to
> pg_qsort etc.

Yeah, I think it makes sense to move bsearch_arg to a place where it's
more generally accesible (src/port/bsearch_arg.c I suppose), and make
both places use that.

-- 
Álvaro Herrera   Valdivia, Chile




Re: WIP: BRIN multi-range indexes

2021-02-22 Thread John Naylor
On Mon, Feb 15, 2021 at 10:37 AM Tomas Vondra 
wrote:
> [v20210215]

Hi Tomas,

This time I only looked at the cumulative changes in the multiminmax
opclass, since I'm pretty sure all the bloom issues have been addressed.

 * XXX CombineRange name seems a bit weird. Consider renaming, perhaps to
 * something ExpandedRange or so.

The time to do it is now, if you like, or remove the XXX. I agree with the
comment, FWIW.

In has_matching_range():

 * So we know it's in the general min/max, the question is whether it
 * falls in one of the ranges or gaps. We'll use a binary search on
 * the ranges.
 *
 * it's in the general range, but is it actually covered by any
 * of the ranges? Repeat the check for each range.
 *
 * XXX We simply walk the ranges sequentially, but maybe we could
 * further leverage the ordering and non-overlap and use bsearch to
 * speed this up a bit.

It looks to me like you already implemented binary search and the last part
is out of date, or am I missing something?

Same in range_contains_value():

 * XXX This might benefit from the fact that both the intervals and exact
 * values are sorted - we might do bsearch or something. Currently that
 * does not make much difference (there are only ~32 intervals), but if
 * this gets increased and/or the comparator function is more expensive,
 * it might be a huge win.

Below that it does binary search if the number of elements > 16.

In merge_combine_ranges():

There are a couple assert-related TODOs.

In brin_minmax_multi_distance_timetz():

 * XXX Does this need to consider the time zones?

I wouldn't think so, because the stored values are in UTC -- the time zone
calculation only happens during storage and retrieval, and they've already
been stored, IIUC.

Also, I think you need to copy this part from
brin_minmax_multi_distance_timestamp() here as well:

if (TIMESTAMP_NOT_FINITE(dt1) || TIMESTAMP_NOT_FINITE(dt2))
PG_RETURN_FLOAT8(0);


At this point, I think it's pretty close to commit-ready. I thought maybe I
would create a small index with every type, and make sure it looks sane in
page_inspect, but that's about it.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: WIP: BRIN multi-range indexes

2021-02-09 Thread Tomas Vondra




On 2/9/21 3:46 PM, John Naylor wrote:
On Wed, Feb 3, 2021 at 7:54 PM Tomas Vondra 
mailto:tomas.von...@enterprisedb.com>> 
wrote:

 >
 > [v-20210203]

Hi Tomas,

I have some random comments from reading the patch, but haven't gone 
into detail in the newer aspects. I'll do so in the near future.


The cfbot seems to crash on this patch during make check, but it doesn't 
crash for me. I'm not even sure what date that cfbot status is from.




Yeah, I noticed that too, and I'm investigating.

I tried running the regression tests on a 32-bit machine (rpi4), which 
sometimes uncovers strange failures, and that indeed fails. There are 
two or three bugs.


Firstly, the allocation optimization patch does this:

MAXALIGN(sizeof(ScanKey) * scan->numberOfKeys * natts)

instead of

MAXALIGN(sizeof(ScanKey) * scan->numberOfKeys) * natts

and that sometimes produces the wrong result, triggering an assert.


Secondly, there seems to be an issue with cross-type bloom indexes. 
Imagine you have an int8 column, with a bloom index on it, and then you 
do this:


   WHERE column = '122'::int4;

Currently, we end up passing this to the consistent function, which 
tries to call hashint8 on the int4 datum - that succeeds on 64 bits 
(because both types are byval), but fails on 32-bits (where int8 is 
byref, so it fails on int4). Which causes a segfault.


I think I included those cross-type operators as a copy-paste from 
minmax indexes, but I see hash indexes don't allow this. And removing 
those cross-type rows from pg_amop.dat makes the crashes go away.


It's also possible I simplified the get_strategy_procinfo a bit too 
much. I see the minmax variant has subtype, so maybe we could do this 
instead (I recall the integer types should have "compatible" results).


There are a couple failues where the index does not produce the right 
number of results, though. I haven't investigated that yet. Once I fix 
this, I'll post an updated patch - hopefully that'll make cfbot happy.




regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: WIP: BRIN multi-range indexes

2021-02-03 Thread Zhihong Yu
Hi,
bq. Not sure why would that be an issue

Moving the (start > end) check is up to your discretion.

But the midpoint computation should follow text book :-)

Cheers

On Wed, Feb 3, 2021 at 4:59 PM Tomas Vondra 
wrote:

>
>
> On 2/4/21 1:49 AM, Zhihong Yu wrote:
> > Hi,
> > For 0007-Remove-the-special-batch-mode-use-a-larger--20210203.patch :
> >
> > +   /* same as preceding value, so store it */
> > +   if (compare_values(>values[start + i - 1],
> > +  >values[start + i],
> > +  (void *) ) == 0)
> > +   continue;
> > +
> > +   range->values[start + n] = range->values[start + i];
> >
> > It seems the comment doesn't match the code: the value is stored when
> > subsequent value is different from the previous.
> >
>
> Yeah, you're right the comment is wrong - the code is doing exactly the
> opposite. I'll need to go through this more carefully.
>
> > For has_matching_range():
> > +   int midpoint = (start + end) / 2;
> >
> > I think the standard notion for midpoint is start + (end-start)/2.
> >
> > +   /* this means we ran out of ranges in the last step */
> > +   if (start > end)
> > +   return false;
> >
> > It seems the above should be ahead of computation of midpoint.
> >
>
> Not sure why would that be an issue, as we're not using the value and
> the values are just plain integers (so no overflows ...).
>
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>


Re: WIP: BRIN multi-range indexes

2021-02-03 Thread Tomas Vondra



On 2/4/21 1:49 AM, Zhihong Yu wrote:
> Hi,
> For 0007-Remove-the-special-batch-mode-use-a-larger--20210203.patch :
> 
> +       /* same as preceding value, so store it */
> +       if (compare_values(>values[start + i - 1],
> +                          >values[start + i],
> +                          (void *) ) == 0)
> +           continue;
> +
> +       range->values[start + n] = range->values[start + i];
> 
> It seems the comment doesn't match the code: the value is stored when
> subsequent value is different from the previous.
> 

Yeah, you're right the comment is wrong - the code is doing exactly the
opposite. I'll need to go through this more carefully.

> For has_matching_range():
> +       int     midpoint = (start + end) / 2;
> 
> I think the standard notion for midpoint is start + (end-start)/2.
> 
> +       /* this means we ran out of ranges in the last step */
> +       if (start > end)
> +           return false;
> 
> It seems the above should be ahead of computation of midpoint.
> 

Not sure why would that be an issue, as we're not using the value and
the values are just plain integers (so no overflows ...).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: WIP: BRIN multi-range indexes

2021-02-03 Thread Zhihong Yu
Hi,
For 0007-Remove-the-special-batch-mode-use-a-larger--20210203.patch :

+   /* same as preceding value, so store it */
+   if (compare_values(>values[start + i - 1],
+  >values[start + i],
+  (void *) ) == 0)
+   continue;
+
+   range->values[start + n] = range->values[start + i];

It seems the comment doesn't match the code: the value is stored when
subsequent value is different from the previous.

For has_matching_range():
+   int midpoint = (start + end) / 2;

I think the standard notion for midpoint is start + (end-start)/2.

+   /* this means we ran out of ranges in the last step */
+   if (start > end)
+   return false;

It seems the above should be ahead of computation of midpoint.

Similar comment for the code in AssertCheckRanges().

Cheers

On Wed, Feb 3, 2021 at 3:55 PM Tomas Vondra 
wrote:

> Hi,
>
> Here's an updated and significantly improved version of the patch
> series, particularly the multi-minmax part. I've fixed a number of
> stupid bugs in that, discovered by either valgrind or stress-tests.
>
> I was surprised by some of the bugs, or rather that the existing
> regression tests failed to crash on them, so it's probably worth briefly
> discussing the details. There were two main classes of such bugs:
>
>
> 1) missing datumCopy
>
> AFAICS this happened because there were a couple missing datumCopy
> calls, and BRIN covers multiple pages, so with by-ref data types we
> added a pointer but the buffer might have gone away unexpectedly.
> Regular regression tests passed just fine, because brin_multi runs
> almost separately, so the chance of the buffer being evicted was low.
> Valgrind reported this (with a rather enigmatic message, as usual), and
> so did a simple stress-test creating many indexes concurrently. Anything
> causing aggressive eviction of buffer would do the trick, I think,
> triggering segfaults, asserts, etc.
>
>
> 2) bogus opclass definitions
>
> There were a couple opclasses referencing incorrect distance function,
> intended for a different data type. I was scratching my head WTH the
> regression tests pass, as there is a table to build multi-minmax index
> on all supported data types. The reason is pretty silly - the table is
> very small, just 100 rows, with very low fillfactor (so only couple
> values per page), and the index was created with pages_per_range=1. So
> the compaction was not triggered and the distance function was never
> actually called. I've decided to build the indexes on a larger data set
> first, to test this. But maybe this needs somewhat different approach.
>
>
> BLOOM
> -
>
> The attached patch series addresses comments from the last review. As
> for the size limit, I've defined a new macro
>
> #define BloomMaxFilterSize \
> MAXALIGN_DOWN(BLCKSZ - \
>   (MAXALIGN(SizeOfPageHeaderData + \
> sizeof(ItemIdData)) + \
>MAXALIGN(sizeof(BrinSpecialSpace)) + \
>SizeOfBrinTuple))
>
> and use that to determine if the bloom filter is too large. IMO that's
> close enough, considering that this is a best-effort check anyway (due
> to not being able to consider multi-column indexes).
>
>
> MINMAX-MULTI
> 
>
> As mentioned, there's a lot of fixes and improvements in this part, but
> the basic principle is still the same. I've kept it split into three
> parts with different approaches to building, so that it's possible to do
> benchmarks and comparisons, and pick the best one.
>
> a) 0005 - Aggressively compacts the summary, by always keeping it within
> the limit defined by values_per_range. So if the range contains more
> values, this may trigger compaction very often in some cases (e.g. for
> monotonic series).
>
> One drawback is that the more often the compactions happen, the less
> optimal the result is - the algorithm is kinda greedy, picking something
> like local optimums in each step.
>
> b) 0006 - Batch build, exactly the opposite of 0005. Accumulates all
> values in a buffer, then does a single round of compaction at the very
> end. This obviously doesn't have the "greediness" issues, but it may
> consume quite a bit of memory for some data types and/or indexes with
> large BRIN ranges.
>
> c) 0007 - A hybrid approach, using a buffer that is multiple of the
> user-specified value, with some safety min/max limits. IMO this is what
> we should use, although perhaps with some tuning of the exact limits.
>
>
> Attached is a spreadsheet with benchmark results for each of those three
> approaches, on different data types (byval/byref), data set types, index
> parameters (pages/values per range) etc. I think 0007 is a reasonable
> compromise overall, with performance somewhere in betwen 0005 and 0006.
> Of course, there are cases where it's somewhat slow, e.g. for data types
> with expensive comparisons and data sets forcing frequent 

Re: WIP: BRIN multi-range indexes

2021-01-30 Thread John Naylor
On Tue, Jan 26, 2021 at 6:59 PM Tomas Vondra 
wrote:
>
>
>
> On 1/26/21 7:52 PM, John Naylor wrote:
> > On Fri, Jan 22, 2021 at 10:59 PM Tomas Vondra
> > mailto:tomas.von...@enterprisedb.com>>
> > wrote:
> >  > Hmm. I think Alvaro also mentioned he'd like to use this as a drop-in
> >  > replacement for minmax (essentially, using these opclasses as the
> >  > default ones, with the option to switch back to plain minmax). I'm
not
> >  > convinced we should do that - though. Imagine you have minmax
indexes in
> >  > your existing DB, it's working perfectly fine, and then we come and
just
> >  > silently change that during dump/restore. Is there some past example
> >  > when we did something similar and it turned it to be OK?
> >
> > I was assuming pg_dump can be taught to insert explicit opclasses for
> > minmax indexes, so that upgrade would not cause surprises. If that's
> > true, only new indexes would have the different default opclass.
> >
>
> Maybe, I suppose we could do that. But I always found such changes
> happening silently in the background a bit suspicious, because it may be
> quite confusing. I certainly wouldn't expect such difference between
> creating a new index and index created by dump/restore. Did we do such
> changes in the past? That might be a precedent, but I don't recall any
> example ...

I couldn't think of a comparable example either. It comes down to
evaluating risk. On the one hand it's nice if users get an enhancement
without having to know about it, on the other hand if there is some kind of
noticeable regression, that's bad.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: WIP: BRIN multi-range indexes

2021-01-26 Thread Tomas Vondra




On 1/26/21 7:52 PM, John Naylor wrote:
On Fri, Jan 22, 2021 at 10:59 PM Tomas Vondra 
mailto:tomas.von...@enterprisedb.com>> 
wrote:

 >
 >
 > On 1/23/21 12:27 AM, John Naylor wrote:

 > > Still, it would be great if multi-minmax can be a drop in 
replacement. I

 > > know there was a sticking point of a distance function not being
 > > available on all types, but I wonder if that can be remedied or worked
 > > around somehow.
 > >
 >
 > Hmm. I think Alvaro also mentioned he'd like to use this as a drop-in
 > replacement for minmax (essentially, using these opclasses as the
 > default ones, with the option to switch back to plain minmax). I'm not
 > convinced we should do that - though. Imagine you have minmax indexes in
 > your existing DB, it's working perfectly fine, and then we come and just
 > silently change that during dump/restore. Is there some past example
 > when we did something similar and it turned it to be OK?

I was assuming pg_dump can be taught to insert explicit opclasses for 
minmax indexes, so that upgrade would not cause surprises. If that's 
true, only new indexes would have the different default opclass.




Maybe, I suppose we could do that. But I always found such changes 
happening silently in the background a bit suspicious, because it may be 
quite confusing. I certainly wouldn't expect such difference between 
creating a new index and index created by dump/restore. Did we do such 
changes in the past? That might be a precedent, but I don't recall any 
example ...



 > As for the distance functions, I'm pretty sure there are data types
 > without "natural" distance - like most strings, for example. We could
 > probably invent something, but the question is how much we can rely on
 > it working well enough in practice.
 >
 > Of course, is minmax even the right index type for such data types?
 > Strings are usually "labels" and not queried using range queries,
 > although sometimes people encode stuff as strings (but then it's very
 > unlikely we'll define the distance definition well). So maybe for those
 > types a hash / bloom would be a better fit anyway.

Right.

 > But I do have an idea - maybe we can do without distances, in those
 > cases. Essentially, the primary issue of minmax indexes are outliers, so
 > what if we simply sort the values, keep one range in the middle and as
 > many single points on each tail?

That's an interesting idea. I think it would be a nice bonus to try to 
do something along these lines. On the other hand, I'm not the one 
volunteering to do the work, and the patch is useful as is.




IMO it's fairly small amount of code, so I'll take a stab at in in the 
next version of the patch.



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: WIP: BRIN multi-range indexes

2021-01-26 Thread John Naylor
On Fri, Jan 22, 2021 at 10:59 PM Tomas Vondra 
wrote:
>
>
> On 1/23/21 12:27 AM, John Naylor wrote:

> > Still, it would be great if multi-minmax can be a drop in replacement. I
> > know there was a sticking point of a distance function not being
> > available on all types, but I wonder if that can be remedied or worked
> > around somehow.
> >
>
> Hmm. I think Alvaro also mentioned he'd like to use this as a drop-in
> replacement for minmax (essentially, using these opclasses as the
> default ones, with the option to switch back to plain minmax). I'm not
> convinced we should do that - though. Imagine you have minmax indexes in
> your existing DB, it's working perfectly fine, and then we come and just
> silently change that during dump/restore. Is there some past example
> when we did something similar and it turned it to be OK?

I was assuming pg_dump can be taught to insert explicit opclasses for
minmax indexes, so that upgrade would not cause surprises. If that's true,
only new indexes would have the different default opclass.

> As for the distance functions, I'm pretty sure there are data types
> without "natural" distance - like most strings, for example. We could
> probably invent something, but the question is how much we can rely on
> it working well enough in practice.
>
> Of course, is minmax even the right index type for such data types?
> Strings are usually "labels" and not queried using range queries,
> although sometimes people encode stuff as strings (but then it's very
> unlikely we'll define the distance definition well). So maybe for those
> types a hash / bloom would be a better fit anyway.

Right.

> But I do have an idea - maybe we can do without distances, in those
> cases. Essentially, the primary issue of minmax indexes are outliers, so
> what if we simply sort the values, keep one range in the middle and as
> many single points on each tail?

That's an interesting idea. I think it would be a nice bonus to try to do
something along these lines. On the other hand, I'm not the one
volunteering to do the work, and the patch is useful as is.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: WIP: BRIN multi-range indexes

2021-01-26 Thread John Naylor
Hi Tomas,
I took another look through the Bloom opclass portion (0004) with
sorted_mode omitted, and it looks good to me code-wise. I think this part
is close to commit-ready. I also did one more proofreading pass for
minor details.

+ rows per block). The default values is -0.1, and

+ greater than 0.0 and smaller than 1.0. The default values is 0.01,

s/values/value/

+ * bloom filter, we can easily and cheaply test wheter it contains values

s/wheter/whether/

+ * XXX We assume the bloom filters have the same parameters fow now. In the

s/fow/for/

+ * or null if it does not exists.

s/exists/exist/

+ * We do expect the bloom filter to eventually switch to hashing mode,
+ * and it's bound to be almost perfectly random, so not compressible.

Leftover from when it started out in sorted mode.

+ if ((m/8) > BLCKSZ)

It seems we need something more safe, to account for page header and tuple
header at least. As the comment before says, the filter will eventually not
be compressible. I remember we can't be exact here, with the possibility of
multiple columns, but we can leave a little slack space.

-- 
John Naylor
EDB: http://www.enterprisedb.com


Re: WIP: BRIN multi-range indexes

2021-01-22 Thread Tomas Vondra



On 1/23/21 12:27 AM, John Naylor wrote:
On Thu, Jan 21, 2021 at 9:06 PM Tomas Vondra 
mailto:tomas.von...@enterprisedb.com>> 
wrote:

 > [wip optimizations]

 > The pathological case (monotonic-asc) is now 4x faster, roughly equal to
 > regular minmax index build. The opposite (monotonic-desc) is about 33%
 > faster, roughly in line with btree.

Those numbers look good. I get similar results, shown below. I've read 
0007-8 briefly but not in depth.

 >  > There are a couple cases where it's actually a bit slower - those are
 > the cases with very few distinct values per range. I believe this
 > happens because in the batch mode the code does not check if the summary
 > already contains this value, adds it to the buffer and the last step
 > ends up being more expensive than this.

I think if it's worst case a bit faster than btree and best case a bit 
slower than traditional minmax, that's acceptable.


 > I believe there's some "compromise" between those two extremes, i.e. we
 > should use buffer that is too small or too large, but something in
 > between, so that the reduction happens once in a while but not too often
 > (as with the original aggressive approach).

This sounds good also.



Yeah, I agree.

I think the reason why some of the cases got a bit slower is that in 
those cases the original approach (ranges being built fairly frequently, 
not just once at the end) we quickly built something that represented 
the whole range, so adding a new value was often no-op. The add_value 
callback found the range already "includes" the new value, etc.


With the batch mode, that's no longer true - we accumulate everything, 
so we have to sort it etc. Which I guess may be fairly expensive, thanks 
to calling comparator functions etc. I wonder if this could be optimized 
a bit, e.g. by first "deduplicating" the values using memcmp() or so.


But ultimately, I think the right solution will be to limit the buffer 
size to something like 10x the target, and roll with that. Typically, 
increasing the buffer size from e.g. 100B to 1000B brings much clearer 
improvement than increasing it from 1000B to 1B. I'd bet this 
follows the pattern.



 > FWIW, none of this is likely to be an issue in practice, because (a)
 > tables usually don't have such strictly monotonic patterns, (b) people
 > should stick to plain minmax for cases that do.

Still, it would be great if multi-minmax can be a drop in replacement. I 
know there was a sticking point of a distance function not being 
available on all types, but I wonder if that can be remedied or worked 
around somehow.




Hmm. I think Alvaro also mentioned he'd like to use this as a drop-in 
replacement for minmax (essentially, using these opclasses as the 
default ones, with the option to switch back to plain minmax). I'm not 
convinced we should do that - though. Imagine you have minmax indexes in 
your existing DB, it's working perfectly fine, and then we come and just 
silently change that during dump/restore. Is there some past example 
when we did something similar and it turned it to be OK?


As for the distance functions, I'm pretty sure there are data types 
without "natural" distance - like most strings, for example. We could 
probably invent something, but the question is how much we can rely on 
it working well enough in practice.


Of course, is minmax even the right index type for such data types? 
Strings are usually "labels" and not queried using range queries, 
although sometimes people encode stuff as strings (but then it's very 
unlikely we'll define the distance definition well). So maybe for those 
types a hash / bloom would be a better fit anyway.



But I do have an idea - maybe we can do without distances, in those 
cases. Essentially, the primary issue of minmax indexes are outliers, so 
what if we simply sort the values, keep one range in the middle and as 
many single points on each tail?


Imagine we have N values, and we want to represent this by K values. We 
simply sort the N values, keep (k-2)/2 values on each tail as outliers, 
and use 2 values for the values in between.


Example:

input: [1, 2, 100, 110, 111, ..., 120, , ..., 130, 201, 202]

Given k = 6, we would keep 2 values on tails, and range for the rest:

  [1, 2, (100, 130), 201, 202]

Of course, this does not optimize for the same thing as when we have 
distance - in that case we try to minimize the "covering" of the input 
data, something like


 sum(length(r) for r in ranges) / (max(ranges) - min(ranges))

But maybe it's good enough when there's no distance function ...



And (c) regular tables
 > tend to have much wider rows, so there are fewer values per range (so
 > that other stuff is likely more expensive than building BRIN).

True. I'm still puzzled that it didn't help to use 8 pages per range, 
but it's moot now.




I'd bet that even with just 8 pages, there were quite a few values in 
the range - possibly hundreds per page. I haven't tried if the patches 

Re: WIP: BRIN multi-range indexes

2021-01-22 Thread John Naylor
On Thu, Jan 21, 2021 at 9:06 PM Tomas Vondra 
wrote:
> [wip optimizations]

> The pathological case (monotonic-asc) is now 4x faster, roughly equal to
> regular minmax index build. The opposite (monotonic-desc) is about 33%
> faster, roughly in line with btree.

Those numbers look good. I get similar results, shown below. I've read
0007-8 briefly but not in depth.

> There are a couple cases where it's actually a bit slower - those are
> the cases with very few distinct values per range. I believe this
> happens because in the batch mode the code does not check if the summary
> already contains this value, adds it to the buffer and the last step
> ends up being more expensive than this.

I think if it's worst case a bit faster than btree and best case a bit
slower than traditional minmax, that's acceptable.

> I believe there's some "compromise" between those two extremes, i.e. we
> should use buffer that is too small or too large, but something in
> between, so that the reduction happens once in a while but not too often
> (as with the original aggressive approach).

This sounds good also.

> FWIW, none of this is likely to be an issue in practice, because (a)
> tables usually don't have such strictly monotonic patterns, (b) people
> should stick to plain minmax for cases that do.

Still, it would be great if multi-minmax can be a drop in replacement. I
know there was a sticking point of a distance function not being available
on all types, but I wonder if that can be remedied or worked around somehow.

And (c) regular tables
> tend to have much wider rows, so there are fewer values per range (so
> that other stuff is likely more expensive than building BRIN).

True. I'm still puzzled that it didn't help to use 8 pages per range, but
it's moot now.


Here are some numbers (median of 3) with a similar scenario as before,
repeated here with some added details. I didn't bother with what you call
"unpatched":

   btree   minmax   multi
monotonic-asc  44.426.5 27.8
mono-del-ins   38.724.6 30.4
mono-10-asc61.825.6 33.5


create unlogged table iot (
id bigint generated by default as identity primary key,
num double precision not null,
create_dt timestamptz not null,
stuff text generated always as (md5(id::text)) stored
)
with (fillfactor = 95);

-- monotonic-asc:

insert into iot (num, create_dt)
select random(), x
from generate_series(
  '2020-01-01 0:00'::timestamptz,
  '2020-01-01 0:00'::timestamptz +'5 years'::interval,
  '1 second'::interval) x;

-- mono-del-ins:
-- Here I deleted a few values from (likely) each page in the above table,
and reinserted values that shouldn't be in existing ranges:

delete from iot1
where num < 0.05
or num > 0.95;

vacuum iot1;

insert into iot (num, create_dt)
select random(), x
from generate_series(
  '2020-01-01 0:00'::timestamptz,
  '2020-02-01 23:59'::timestamptz,
  '1 second'::interval) x;

-- mono-10-asc

truncate table iot;

insert into iot (num, create_dt)
select random(), '2020-01-01 0:00'::timestamptz + (x % 10 || '
seconds')::interval
from generate_series(1,5*365*24*60*60) x;

--
John Naylor
EDB: http://www.enterprisedb.com


Re: WIP: BRIN multi-range indexes

2021-01-19 Thread Tomas Vondra

On 1/19/21 9:44 PM, John Naylor wrote:
On Tue, Jan 12, 2021 at 1:42 PM Tomas Vondra 
mailto:tomas.von...@enterprisedb.com>> 
wrote:

 > I suspect it'd due to minmax having to decide which "ranges" to merge,
 > which requires repeated sorting, etc. I certainly don't dare to claim
 > the current algorithm is perfect. I wouldn't have expected such big
 > difference, though - so definitely worth investigating.

It seems that monotonically increasing (or decreasing) values in a table 
are a worst case scenario for multi-minmax indexes, or basically, unique 
values within a range. I'm guessing it's because it requires many passes 
to fit all the values into a limited number of ranges. I tried using 
smaller pages_per_range numbers, 32 and 8, and that didn't help.


Now, with a different data distribution, using only 10 values that 
repeat over and over, the results are muchs more sympathetic to multi-minmax:


insert into iot (num, create_dt)
select random(), '2020-01-01 0:00'::timestamptz + (x % 10 || ' 
seconds')::interval

from generate_series(1,5*365*24*60*60) x;

create index cd_single on iot using brin(create_dt);
27.2s

create index cd_multi on iot using brin(create_dt 
timestamptz_minmax_multi_ops);

30.4s

create index cd_bt on iot using btree(create_dt);
61.8s

Circling back to the monotonic case, I tried running a simple perf 
record on a backend creating a multi-minmax index on a timestamptz 
column and these were the highest non-kernel calls:
+   21.98%    21.91%  postgres         postgres            [.] 
FunctionCall2Coll
+    9.31%     9.29%  postgres         postgres            [.] 
compare_combine_ranges

+    8.60%     8.58%  postgres         postgres            [.] qsort_arg
+    5.68%     5.66%  postgres         postgres            [.] 
brin_minmax_multi_add_value

+    5.63%     5.60%  postgres         postgres            [.] timestamp_lt
+    4.73%     4.71%  postgres         postgres            [.] 
reduce_combine_ranges
+    3.80%     0.00%  postgres         [unknown]           [.] 
0x032001680004

+    3.51%     3.50%  postgres         postgres            [.] timestamp_eq

There's no one place that's pathological enough to explain the 4x 
slowness over traditional BRIN and nearly 3x slowness over btree when 
using a large number of unique values per range, so making progress here 
would have to involve a more holistic approach.




Yeah. This very much seems like the primary problem is in how we build 
the ranges incrementally - with monotonic sequences, we end up having to 
merge the ranges over and over again. I don't know what was the 
structure of the table, but I guess it was kinda narrow (very few 
columns), which exacerbates the problem further, because the number of 
rows per range will be way higher than in real-world.


I do think the solution to this might be to allow more values during 
batch index creation, and only "compress" to the requested number at the 
very end (when serializing to on-disk format).


There are a couple additional comments about possibly replacing 
sequential scan with a binary search, that could help a bit too.



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: WIP: BRIN multi-range indexes

2021-01-19 Thread John Naylor
On Tue, Jan 12, 2021 at 1:42 PM Tomas Vondra 
wrote:
> I suspect it'd due to minmax having to decide which "ranges" to merge,
> which requires repeated sorting, etc. I certainly don't dare to claim
> the current algorithm is perfect. I wouldn't have expected such big
> difference, though - so definitely worth investigating.

It seems that monotonically increasing (or decreasing) values in a table
are a worst case scenario for multi-minmax indexes, or basically, unique
values within a range. I'm guessing it's because it requires many passes
to fit all the values into a limited number of ranges. I tried using
smaller pages_per_range numbers, 32 and 8, and that didn't help.

Now, with a different data distribution, using only 10 values that repeat
over and over, the results are much more sympathetic to multi-minmax:

insert into iot (num, create_dt)
select random(), '2020-01-01 0:00'::timestamptz + (x % 10 || '
seconds')::interval
from generate_series(1,5*365*24*60*60) x;

create index cd_single on iot using brin(create_dt);
27.2s

create index cd_multi on iot using brin(create_dt
timestamptz_minmax_multi_ops);
30.4s

create index cd_bt on iot using btree(create_dt);
61.8s

Circling back to the monotonic case, I tried running a simple perf record
on a backend creating a multi-minmax index on a timestamptz column and
these were the highest non-kernel calls:
+   21.98%21.91%  postgres postgres[.]
FunctionCall2Coll
+9.31% 9.29%  postgres postgres[.]
compare_combine_ranges
+8.60% 8.58%  postgres postgres[.] qsort_arg
+5.68% 5.66%  postgres postgres[.]
brin_minmax_multi_add_value
+5.63% 5.60%  postgres postgres[.] timestamp_lt
+4.73% 4.71%  postgres postgres[.]
reduce_combine_ranges
+3.80% 0.00%  postgres [unknown]   [.]
0x032001680004
+3.51% 3.50%  postgres postgres[.] timestamp_eq

There's no one place that's pathological enough to explain the 4x slowness
over traditional BRIN and nearly 3x slowness over btree when using a large
number of unique values per range, so making progress here would have to
involve a more holistic approach.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: WIP: BRIN multi-range indexes

2021-01-12 Thread Tomas Vondra




On 1/12/21 6:28 PM, John Naylor wrote:
On Sat, Dec 19, 2020 at 8:15 PM Tomas Vondra 
mailto:tomas.von...@enterprisedb.com>> 
wrote:

 > [12-20 version]

Hi Tomas,

The measurements look good. In case it fell through the cracks, my 
earlier review comments for Bloom BRIN indexes regarding minor details 
don't seem to have been addressed in this version. I'll point to earlier 
discussion for convenience:


https://www.postgresql.org/message-id/CACPNZCt%3Dx-fOL0CUJbjR3BFXKgcd9HMPaRUVY9cwRe58hmd8Xg%40mail.gmail.com 



https://www.postgresql.org/message-id/CACPNZCuqpkCGt8%3DcywAk1kPu0OoV_TjPXeV-J639ABQWyViyug%40mail.gmail.com 





Whooops :-( I'll go through those again, thanks for reminding me.


 > The improvements are fairly minor:
 >
 > 1) Rejecting bloom filters that are clearly too large (larger than page)
 > early. This is imperfect, as it works for individual index keys, not the
 > whole row. But per discussion it seems useful.

I think this is good enough.

 > So based on this I'm tempted to just use the version with two hashes, as
 > implemented in 0005. It's much simpler than the partitioning scheme,
 > does not need any of the logic to generate primes etc.

Sounds like the best engineering decision.

Circling back to multi-minmax build times, I ran a couple quick tests on 
bigger hardware, and found that not only is multi-minmax slower than 
minmax, which is to be expected, but also slower than btree. (unlogged 
table ~12GB in size, maintenance_work_mem = 1GB, median of three runs)


btree          38.3s
minmax         26.2s
multi-minmax  110s

Since btree indexes are much larger, I imagine something algorithmic is 
involved. Is it worth digging further to see if some code path is taking 
more time than we would expect?




I suspect it'd due to minmax having to decide which "ranges" to merge, 
which requires repeated sorting, etc. I certainly don't dare to claim 
the current algorithm is perfect. I wouldn't have expected such big 
difference, though - so definitely worth investigating.



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: WIP: BRIN multi-range indexes

2021-01-12 Thread John Naylor
On Sat, Dec 19, 2020 at 8:15 PM Tomas Vondra 
wrote:
> [12-20 version]

Hi Tomas,

The measurements look good. In case it fell through the cracks, my earlier
review comments for Bloom BRIN indexes regarding minor details don't seem
to have been addressed in this version. I'll point to earlier discussion
for convenience:

https://www.postgresql.org/message-id/CACPNZCt%3Dx-fOL0CUJbjR3BFXKgcd9HMPaRUVY9cwRe58hmd8Xg%40mail.gmail.com

https://www.postgresql.org/message-id/CACPNZCuqpkCGt8%3DcywAk1kPu0OoV_TjPXeV-J639ABQWyViyug%40mail.gmail.com

> The improvements are fairly minor:
>
> 1) Rejecting bloom filters that are clearly too large (larger than page)
> early. This is imperfect, as it works for individual index keys, not the
> whole row. But per discussion it seems useful.

I think this is good enough.

> So based on this I'm tempted to just use the version with two hashes, as
> implemented in 0005. It's much simpler than the partitioning scheme,
> does not need any of the logic to generate primes etc.

Sounds like the best engineering decision.

Circling back to multi-minmax build times, I ran a couple quick tests on
bigger hardware, and found that not only is multi-minmax slower than
minmax, which is to be expected, but also slower than btree. (unlogged
table ~12GB in size, maintenance_work_mem = 1GB, median of three runs)

btree  38.3s
minmax 26.2s
multi-minmax  110s

Since btree indexes are much larger, I imagine something algorithmic is
involved. Is it worth digging further to see if some code path is taking
more time than we would expect?

--
John Naylor
EDB: http://www.enterprisedb.com


Re: WIP: BRIN multi-range indexes

2021-01-02 Thread Tomas Vondra
Hi,

On 1/2/21 7:42 AM, Thomas Munro wrote:
> On Sun, Dec 20, 2020 at 1:16 PM Tomas Vondra
>  wrote:
>> Attached is an updated version of the patch series, rebased on current
>> master, and results for benchmark comparing the various bloom variants.
> 
> Perhaps src/include/utils/inet.h needs to include ,
> because FreeBSD says:
> 
> brin_minmax_multi.c:1693:24: error: use of undeclared identifier 'AF_INET'
> if (ip_family(ipa) == PGSQL_AF_INET)
> ^
> ../../../../src/include/utils/inet.h:39:24: note: expanded from macro
> 'PGSQL_AF_INET'
> #define PGSQL_AF_INET (AF_INET + 0)
> ^

Not sure. The other files using PGSQL_AF_INET just include sys/socket.h
directly, so maybe this should just do the same thing ...


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: WIP: BRIN multi-range indexes

2021-01-01 Thread Thomas Munro
On Sun, Dec 20, 2020 at 1:16 PM Tomas Vondra
 wrote:
> Attached is an updated version of the patch series, rebased on current
> master, and results for benchmark comparing the various bloom variants.

Perhaps src/include/utils/inet.h needs to include ,
because FreeBSD says:

brin_minmax_multi.c:1693:24: error: use of undeclared identifier 'AF_INET'
if (ip_family(ipa) == PGSQL_AF_INET)
^
../../../../src/include/utils/inet.h:39:24: note: expanded from macro
'PGSQL_AF_INET'
#define PGSQL_AF_INET (AF_INET + 0)
^




Re: WIP: BRIN multi-range indexes

2020-11-09 Thread John Naylor
On Mon, Nov 9, 2020 at 1:39 PM Tomas Vondra 
wrote:
>
>
> While investigating the failures, I've tried increasing the values a
> lot, without observing any measurable increase in runtime. IIRC I've
> even used (10 * target_partlen) or something like that. That tells me
> it's not very sensitive part of the code, so I'd suggest to simply use
> something that we know is large enough to be safe.

Okay, then it's not worth being clever.

> For example, the largest bloom filter we can have is 32kB, i.e. 262kb,
> at which point the largest gap is less than 95 (per the gap table). And
> we may use up to BLOOM_MAX_NUM_PARTITIONS, so let's just use
> BLOOM_MAX_NUM_PARTITIONS * 100

Sure.

> FWIW I wonder if we should do something about bloom filters that we know
> can get larger than page size. In the example I used, we know that
> nbits=575104 is larger than page, so as the filter gets more full (and
> thus more random and less compressible) it won't possibly fit. Maybe we
> should reject that right away, instead of "delaying it" until later, on
> the basis that it's easier to fix at CREATE INDEX time (compared to when
> inserts/updates start failing at a random time).

Yeah, I'd be inclined to reject that right away.

> The problem with this is of course that if the index is multi-column,
> this may not be strict enough (i.e. each filter would fit independently,
> but the whole index row is too large). But it's probably better to do at
> least something, and maybe improve that later with some whole-row check.

A whole-row check would be nice, but I don't know how hard that would be.

As a Devil's advocate proposal, how awful would it be to not allow
multicolumn brin-bloom indexes?

--
John Naylor
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: WIP: BRIN multi-range indexes

2020-11-09 Thread Tomas Vondra



On 11/9/20 3:29 PM, John Naylor wrote:
> On Sat, Nov 7, 2020 at 4:38 PM Tomas Vondra
> mailto:tomas.von...@enterprisedb.com>>
> wrote:
> 
>> Overall, I think there's very little difference, particularly in the
>> "match" cases when we're searching for a value that we know is in the
>> table. The one-hash variant seems to perform a bit better, but the
>> difference is fairly small.
>>
>> In the "mismatch" cases (searching for value that is not in the table)
>> the differences are more significant, but it might be noise. It does
>> seem much more "green" than "red", i.e. the one-hash variant seems to be
>> faster (although this does depend on the values for formatting).
>>
>> To sum this up, I think the one-hash approach seems interesting. It's
>> not going to give us huge speedups because we're only hashing int32
>> values anyway (not the source data), but it's worth exploring.
> 
> Thanks for testing! It seems you tested against the version with two
> moduli, and not the alternative discussed in
> 
> https://www.postgresql.org/message-id/20200918222702.omsieaphfj3ctqg3%40development
> 
> 
> which would in fact be rehashing the 32 bit values. I think that would
> be the way to go if we don't use the one-hashing approach.
> 

Yeah. I forgot about this detail, and I may try again with the two-hash
variant, but I wonder how much difference would it make, considering the
results match the expected results (that is, the scan fraction" results
for fill_factor=100 match the target fpr almost perfectly).

I think there's a possibly-more important omission in the testing - I
forgot about the "sort mode" used initially, when the filter keeps the
actual hash values and only switches to hashing later. I wonder if that
plays role for some of the cases.

I'll investigate this a bit in the next round of tests.


>> a) set_bloom_partitions does this:
>>
>>     while (primes[pidx + nhashes - 1] <= target && primes[pidx] > 0)
>>        pidx++;
>>
>> which is broken, because the second part of the condition only checks
>> the current index - we may end up using nhashes primes after that, and
>> some of them may be 0. So this needs to be:
>>
>>     while (primes[pidx + nhashes - 1] <= target &&
>>            primes[pidx + nhashes] > 0)
>>        pidx++;
> 
> Good catch.
> 
>> b) set_bloom_partitions does this to generate primes:
>>
>>     /*
>>      * Increase the limit to ensure we have some primes higher than
>>      * the target partition length. The 100 value is arbitrary, but
>>      * should be well over what we need.
>>      */
>>     primes = generate_primes(target_partlen + 100);
>>
>> It's not clear to me why 100 is sufficient, particularly for large page
>> sizes. AFAIK the primes get more and more sparse, so how does this
>> guarantee we'll get enough "sufficiently large" primes?
> 
> This value is not rigorous and should be improved, but I started with
> that by looking at the table in section 3 in
> 
> https://primes.utm.edu/notes/gaps.html
> 
> 
> I see two ways to make a stronger guarantee:
> 
> 1. Take the average gap between primes near n, which is log(n), and
> multiply that by BLOOM_MAX_NUM_PARTITIONS. Adding that to the target
> seems a better heuristic than a constant, and is still simple to calculate.
> 
> With the pathological example you gave of n=575104, k=3 (target_partlen
> = 191701), the number to add is log(191701) * 10 = 122.  By the table
> referenced above, the largest prime gap under 360653 is 95, so we're
> guaranteed to find at least one prime in the space of 122 above the
> target. That will likely be enough to find the closest-to-target filter
> size for k=3. Even if it weren't, nbits is so large that the relative
> difference is tiny. I'd say a heuristic like this is most likely to be
> off precisely when it matters the least. At this size, even if we find
> zero primes above our target, the relative filter size is close to 
> 
> (575104 - 3 * 95) / 575104 = 0.9995
> 
> For a more realistic bad-case target partition length, log(1327) * 10 =
> 72. There are 33 composites after 1327, the largest such gap below 9551.
> That would give five primes larger than the target
> 1361   1367   1373   1381   1399
> 
> which is more than enough for k<=10:
> 
> 1297 +  1301  + 1303  + 1307  + 1319  + 1321  + 1327  + 1361 + 1367 +
> 1373 = 13276
> 
> 2. Use a "segmented range" algorithm for the sieve and iterate until we
> get k*2 primes, half below and half above the target. This would be an
> absolute guarantee, but also more code, so I'm inclined against that.
> 

Thanks, that makes sense.

While investigating the failures, I've tried increasing the values a
lot, without observing any measurable increase in runtime. IIRC I've
even used (10 * target_partlen) or something like that. That tells me
it's not very sensitive part of the code, so I'd suggest to simply use
something 

Re: WIP: BRIN multi-range indexes

2020-11-09 Thread John Naylor
On Sat, Nov 7, 2020 at 4:38 PM Tomas Vondra 
wrote:

> Overall, I think there's very little difference, particularly in the
> "match" cases when we're searching for a value that we know is in the
> table. The one-hash variant seems to perform a bit better, but the
> difference is fairly small.
>
> In the "mismatch" cases (searching for value that is not in the table)
> the differences are more significant, but it might be noise. It does
> seem much more "green" than "red", i.e. the one-hash variant seems to be
> faster (although this does depend on the values for formatting).
>
> To sum this up, I think the one-hash approach seems interesting. It's
> not going to give us huge speedups because we're only hashing int32
> values anyway (not the source data), but it's worth exploring.

Thanks for testing! It seems you tested against the version with two
moduli, and not the alternative discussed in

https://www.postgresql.org/message-id/20200918222702.omsieaphfj3ctqg3%40development

which would in fact be rehashing the 32 bit values. I think that would be
the way to go if we don't use the one-hashing approach.

> a) set_bloom_partitions does this:
>
> while (primes[pidx + nhashes - 1] <= target && primes[pidx] > 0)
>pidx++;
>
> which is broken, because the second part of the condition only checks
> the current index - we may end up using nhashes primes after that, and
> some of them may be 0. So this needs to be:
>
> while (primes[pidx + nhashes - 1] <= target &&
>primes[pidx + nhashes] > 0)
>pidx++;

Good catch.

> b) set_bloom_partitions does this to generate primes:
>
> /*
>  * Increase the limit to ensure we have some primes higher than
>  * the target partition length. The 100 value is arbitrary, but
>  * should be well over what we need.
>  */
> primes = generate_primes(target_partlen + 100);
>
> It's not clear to me why 100 is sufficient, particularly for large page
> sizes. AFAIK the primes get more and more sparse, so how does this
> guarantee we'll get enough "sufficiently large" primes?

This value is not rigorous and should be improved, but I started with that
by looking at the table in section 3 in

https://primes.utm.edu/notes/gaps.html

I see two ways to make a stronger guarantee:

1. Take the average gap between primes near n, which is log(n), and
multiply that by BLOOM_MAX_NUM_PARTITIONS. Adding that to the target seems
a better heuristic than a constant, and is still simple to calculate.

With the pathological example you gave of n=575104, k=3 (target_partlen =
191701), the number to add is log(191701) * 10 = 122.  By the table
referenced above, the largest prime gap under 360653 is 95, so we're
guaranteed to find at least one prime in the space of 122 above the target.
That will likely be enough to find the closest-to-target filter size for
k=3. Even if it weren't, nbits is so large that the relative difference is
tiny. I'd say a heuristic like this is most likely to be off precisely when
it matters the least. At this size, even if we find zero primes above our
target, the relative filter size is close to

(575104 - 3 * 95) / 575104 = 0.9995

For a more realistic bad-case target partition length, log(1327) * 10 = 72.
There are 33 composites after 1327, the largest such gap below 9551. That
would give five primes larger than the target
1361   1367   1373   1381   1399

which is more than enough for k<=10:

1297 +  1301  + 1303  + 1307  + 1319  + 1321  + 1327  + 1361 + 1367 + 1373
= 13276

2. Use a "segmented range" algorithm for the sieve and iterate until we get
k*2 primes, half below and half above the target. This would be an absolute
guarantee, but also more code, so I'm inclined against that.

> c) generate_primes uses uint16 to store the primes, so it can only
> generate primes up to 32768. That's (probably) enough for 8kB pages, but
> for 32kB pages it's clearly insufficient.

Okay.

> As for the original question how expensive this naive sieve is, I
> haven't been able to measure any significant timings. The logging aroung
> generate_primes usually looks like this:
>
> 2020-11-07 20:36:10.614 CET [232789] LOG:  generating primes nbits
> 575104 nhashes 3 target_partlen 191701
> 2020-11-07 20:36:10.614 CET [232789] LOG:  primes generated
>
> So it takes 0.000 second for this extreme page size. I don't think we
> need to invent anything more elaborate.

Okay, good to know. If we were concerned about memory, we could have it
check only odd numbers. That's a common feature of sieves, but also makes
the code a bit harder to understand if you haven't seen it before.

Also to fill in something I left for later, the reference for this

/* upper bound of number of primes below limit */
/* WIP: reference for this number */
int numprimes = 1.26 * limit / log(limit);

is

Rosser, J. Barkley; Schoenfeld, Lowell (1962). "Approximate formulas for
some functions of prime numbers". Illinois J. Math. 6: 64–94.
doi:10.1215/ijm/1255631807

More 

Re: WIP: BRIN multi-range indexes

2020-11-02 Thread Anastasia Lubennikova
Status update for a commitfest entry.

According to cfbot the patch no longer compiles.
Tomas, can you send an update, please?

I also see that a few last messages mention a data corruption bug. Sounds 
pretty serious.
Alvaro, have you had a chance to look at it? I don't see anything committed 
yet, nor any active discussion in other threads.

Re: WIP: BRIN multi-range indexes

2020-10-01 Thread Alvaro Herrera
On 2020-Oct-01, Tomas Vondra wrote:

> OK, so this seems like a data corruption bug in BRIN, actually.

Oh crap.  You're right -- the data needs to be detoasted before being
put in the index.

I'll have a look at how this can be fixed.




Re: WIP: BRIN multi-range indexes

2020-10-01 Thread Tomas Vondra

On Wed, Sep 30, 2020 at 07:57:19AM -0400, John Naylor wrote:

On Mon, Sep 28, 2020 at 10:12 PM Tomas Vondra
 wrote:


Is it actually all that different from the existing BRIN indexes?
Consider this example:

create table x (a text, b text, c text);

create index on x using brin (a,b,c);

create or replace function random_str(p_len int) returns text as $$
select string_agg(x, '') from (select chr(1 + (254 * random())::int ) as x from 
generate_series(1,$1)) foo;
$$ language sql;

test=# insert into x select random_str(1000), random_str(1000), 
random_str(1000);
ERROR:  index row size 9056 exceeds maximum 8152 for index "x_a_b_c_idx"


Hmm, okay. As for which comes first, insert or index creation, I'm
baffled, too. I also would expect the example above would take up a
bit over 6000 bytes, but not 9000.



OK, so this seems like a data corruption bug in BRIN, actually.

The ~9000 bytes is actually about right, because the strings are in
UTF-8 so roughly 1.5B per character seems about right. And we have 6
values to store (3 columns, min/max for each), so 6 * 1500 = 9000.

The real question is how come INSERT + CREATE INDEX actually manages to
create an index tuple. And the answer is pretty simple - brin_form_tuple
kinda ignores toasting, happily building index tuples where some values
are toasted.

Consider this:

create table x (a text, b text, c text);
insert into x select random_str(1000), random_str(1000), random_str(1000);

create index on x using brin (a,b,c);
delete from x;
vacuum x;

set enable_seqscan=off;

insert into x select random_str(10), random_str(10), random_str(10);
ERROR:  missing chunk number 0 for toast value 16530 in pg_toast_16525

explain analyze select * from x where a = 'xxx';
ERROR:  missing chunk number 0 for toast value 16530 in pg_toast_16525

select * from brin_page_items(get_raw_page('x_a_b_c_idx', 2), 
'x_a_b_c_idx'::regclass);
ERROR:  missing chunk number 0 for toast value 16547 in pg_toast_16541


Interestingly enough, running the select before the insert seems to be
working - not sure why.

Anyway, it behaves like this since 9.5 :-(


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: WIP: BRIN multi-range indexes

2020-09-30 Thread John Naylor
On Mon, Sep 28, 2020 at 10:12 PM Tomas Vondra
 wrote:

> Is it actually all that different from the existing BRIN indexes?
> Consider this example:
>
> create table x (a text, b text, c text);
>
> create index on x using brin (a,b,c);
>
> create or replace function random_str(p_len int) returns text as $$
> select string_agg(x, '') from (select chr(1 + (254 * random())::int ) as x 
> from generate_series(1,$1)) foo;
> $$ language sql;
>
> test=# insert into x select random_str(1000), random_str(1000), 
> random_str(1000);
> ERROR:  index row size 9056 exceeds maximum 8152 for index "x_a_b_c_idx"

Hmm, okay. As for which comes first, insert or index creation, I'm
baffled, too. I also would expect the example above would take up a
bit over 6000 bytes, but not 9000.

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: WIP: BRIN multi-range indexes

2020-09-28 Thread Tomas Vondra

On Mon, Sep 28, 2020 at 04:42:39PM -0400, John Naylor wrote:

On Thu, Sep 24, 2020 at 7:50 PM Tomas Vondra
 wrote:


On Thu, Sep 24, 2020 at 05:18:03PM -0400, John Naylor wrote:



>Hmm, how ugly would it be to change the default range size depending
>on the opclass?
>

Not sure. What would happen for multi-column BRIN indexes with different
opclasses?


Sounds like a can of worms. In any case I suspect if there is no more
graceful way to handle too-large filters than ERROR out the first time
trying to write to the index, this feature might meet some resistance.
Not sure what to suggest, though.



Is it actually all that different from the existing BRIN indexes?
Consider this example:

create table x (a text, b text, c text);

create index on x using brin (a,b,c);

create or replace function random_str(p_len int) returns text as $$
select string_agg(x, '') from (select chr(1 + (254 * random())::int ) as x from 
generate_series(1,$1)) foo;
$$ language sql;

test=# insert into x select random_str(1000), random_str(1000), 
random_str(1000);
ERROR:  index row size 9056 exceeds maximum 8152 for index "x_a_b_c_idx"


I'm a bit puzzled, though, because both of these things seem to work:

1) insert before creating the index

create table x (a text, b text, c text);
insert into x select random_str(1000), random_str(1000), random_str(1000);
create index on x using brin (a,b,c);
-- and there actually is a non-empty summary with real data
select * from brin_page_items(get_raw_page('x_a_b_c_idx', 2), 
'x_a_b_c_idx'::regclass);


2) insert "small" row before inserting the over-sized one

create table x (a text, b text, c text);
insert into x select random_str(10), random_str(10), random_str(10);
insert into x select random_str(1000), random_str(1000), random_str(1000);
create index on x using brin (a,b,c);
-- and there actually is a non-empty summary with the "big" values
select * from brin_page_items(get_raw_page('x_a_b_c_idx', 2), 
'x_a_b_c_idx'::regclass);


I find this somewhat strange - how come we don't fail here too?


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: WIP: BRIN multi-range indexes

2020-09-28 Thread John Naylor
On Thu, Sep 24, 2020 at 7:50 PM Tomas Vondra
 wrote:
>
> On Thu, Sep 24, 2020 at 05:18:03PM -0400, John Naylor wrote:

> >Hmm, how ugly would it be to change the default range size depending
> >on the opclass?
> >
>
> Not sure. What would happen for multi-column BRIN indexes with different
> opclasses?

Sounds like a can of worms. In any case I suspect if there is no more
graceful way to handle too-large filters than ERROR out the first time
trying to write to the index, this feature might meet some resistance.
Not sure what to suggest, though.

> >> I don't think the efficiency of this code matters too much - it's only
> >> used once when creating the index, so the simpler the better. Certainly
> >> for now, while testing the partitioning approach.
> >
> >To check my understanding, isn't bloom_init() called for every tuple?
> >Agreed on simplicity so done this way.
> >
>
> No, it's only called for the first non-NULL value in the page range
> (unless I made a boo boo when writing that code).

Ok, then I basically understood -- by tuple I meant BRIN tuple, pardon
my ambiguity. After thinking about it, I agree that CPU cost is
probably trivial (and if not, something is seriously wrong).

> >nk   m  p
> >928   7  8895   0.01
> >928  10  13343  0.001  (lowest p supported in patch set)
> >928  13  17790  0.0001
> >928  10  18280  0.0001 (still works with lower k, needs higher m)
> >928  10  17790  0.00012 (even keeping m from row #3, capping k doesn't
> >degrade p much)
> >
> >Also, k seems pretty robust against small changes as long as m isn't
> >artificially constrained and as long as p is small.
> >
> >So I *think* it's okay to cap k at 10 or 12, and not bother with
> >adjusting m, which worsens space issues. As I found before, lowering k
> >raises target fpr, but seems more robust to overshooting ndistinct. In
> >any case, we only need k * 2 bytes to store the partition lengths.
> >
> >The only way I see to avoid any limitation is to make the array of
> >primes variable length, which could be done by putting the filter
> >offset calculation into a macro. But having two variable-length arrays
> >seems messy.
> >
>
> Hmmm. I wonder how would these limitations impact the conclusions from
> the one-hashing paper? Or was this just for the sake of a demonstration?

Using "10" in the patch is a demonstration, which completely supports
the current fpr allowed by the reloption, and showing what happens if
fpr is allowed to go lower. But for your question, I *think* this
consideration is independent from the conclusions. The n, k, m values
give a theoretical false positive rate, assuming a completely perfect
hashing scheme. The numbers I'm playing with show consequences in the
theoretical fpr. The point of the paper (and others like it) is how to
get the real fpr as close as possible to the fpr predicted by the
theory. My understanding anyway.

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: WIP: BRIN multi-range indexes

2020-09-24 Thread Tomas Vondra

On Thu, Sep 24, 2020 at 05:18:03PM -0400, John Naylor wrote:

On Mon, Sep 21, 2020 at 3:56 PM Tomas Vondra
 wrote:


On Mon, Sep 21, 2020 at 01:42:34PM -0400, John Naylor wrote:



>While playing around with the numbers I had an epiphany: At the
>defaults, the filter already takes up ~4.3kB, over half the page.
>There is no room for another tuple, so if we're only indexing one
>column, we might as well take up the whole page.

Hmm, yeah. I may be wrong but IIRC indexes don't support external
storage but compression is still allowed. So even if those defaults are
a bit higher than needed that should make the bloom filters a bit more
compressible, and thus fit multiple BRIN tuples on a single page.



Not sure about how much we want to rely on these optimizations, though,
considering multi-column indexes kinda break this.


Yeah. Okay, then it sounds like we should go in the other direction,
as the block comment at the top of brin_bloom.c implies. Indexes with
multiple bloom-indexed columns already don't fit in one 8kB page, so I
think every documented example should have a much lower
pages_per_range. Using 32 pages per range with max tuples gives n =
928. With default p, that's about 1.1 kB per brin tuple, so one brin
page can index 224 pages, much more than with the default 128.

Hmm, how ugly would it be to change the default range size depending
on the opclass?



Not sure. What would happen for multi-column BRIN indexes with different
opclasses?


If indexes don't support external storage, that sounds like a pain to
add. Also, with very small fpr, you can easily get into many megabytes
of filter space, which kind of defeats the purpose of brin in the
first place.



True.


There is already this item from the brin readme:

* Different-size page ranges?
 In the current design, each "index entry" in a BRIN index covers the same
 number of pages.  There's no hard reason for this; it might make sense to
 allow the index to self-tune so that some index entries cover smaller page
 ranges, if this allows the summary values to be more compact.  This
would incur
 larger BRIN overhead for the index itself, but might allow better pruning of
 page ranges during scan.  In the limit of one index tuple per page, the index
 itself would occupy too much space, even though we would be able to skip
 reading the most heap pages, because the summary values are tight; in the
 opposite limit of a single tuple that summarizes the whole table, we wouldn't
 be able to prune anything even though the index is very small.  This can
 probably be made to work by using the range map as an index in itself.

This sounds like a lot of work, but would be robust.



Yeah. I think it's a fairly independent / orthogonal project.


Anyway, given that this is a general problem and not specific to the
prime partition algorithm, I'll leave that out of the attached patch,
named as a .txt to avoid confusing the cfbot.


>We could also generate the primes via a sieve instead, which is really
>fast (and more code). That would be more robust, but that would require
>the filter to store the actual primes used, so 20 more bytes at max k =
>10. We could hard-code space for that, or to keep from hard-coding
>maximum k and thus lowest possible false positive rate, we'd need more
>bookkeeping.
>

I don't think the efficiency of this code matters too much - it's only
used once when creating the index, so the simpler the better. Certainly
for now, while testing the partitioning approach.


To check my understanding, isn't bloom_init() called for every tuple?
Agreed on simplicity so done this way.



No, it's only called for the first non-NULL value in the page range
(unless I made a boo boo when writing that code).


>So, with the partition approach, we'd likely have to set in stone
>either max nbits, or min target false positive rate. The latter option
>seems more principled, not only for the block size, but also since the
>target fp rate is already fixed by the reloption, and as I've
>demonstrated, we can often go above and beyond the reloption even
>without changing k.
>

That seems like a rather annoying limitation, TBH.


I don't think the latter is that bad. I've capped k at 10 for
demonstration's sake.:

(928 is from using 32 pages per range)

nk   m  p
928   7  8895   0.01
928  10  13343  0.001  (lowest p supported in patch set)
928  13  17790  0.0001
928  10  18280  0.0001 (still works with lower k, needs higher m)
928  10  17790  0.00012 (even keeping m from row #3, capping k doesn't
degrade p much)

Also, k seems pretty robust against small changes as long as m isn't
artificially constrained and as long as p is small.

So I *think* it's okay to cap k at 10 or 12, and not bother with
adjusting m, which worsens space issues. As I found before, lowering k
raises target fpr, but seems more robust to overshooting ndistinct. In
any case, we only need k * 2 bytes to store the partition lengths.

The only way I see to avoid any limitation is to 

Re: WIP: BRIN multi-range indexes

2020-09-24 Thread John Naylor
On Mon, Sep 21, 2020 at 3:56 PM Tomas Vondra
 wrote:
>
> On Mon, Sep 21, 2020 at 01:42:34PM -0400, John Naylor wrote:

> >While playing around with the numbers I had an epiphany: At the
> >defaults, the filter already takes up ~4.3kB, over half the page.
> >There is no room for another tuple, so if we're only indexing one
> >column, we might as well take up the whole page.
>
> Hmm, yeah. I may be wrong but IIRC indexes don't support external
> storage but compression is still allowed. So even if those defaults are
> a bit higher than needed that should make the bloom filters a bit more
> compressible, and thus fit multiple BRIN tuples on a single page.

> Not sure about how much we want to rely on these optimizations, though,
> considering multi-column indexes kinda break this.

Yeah. Okay, then it sounds like we should go in the other direction,
as the block comment at the top of brin_bloom.c implies. Indexes with
multiple bloom-indexed columns already don't fit in one 8kB page, so I
think every documented example should have a much lower
pages_per_range. Using 32 pages per range with max tuples gives n =
928. With default p, that's about 1.1 kB per brin tuple, so one brin
page can index 224 pages, much more than with the default 128.

Hmm, how ugly would it be to change the default range size depending
on the opclass?

If indexes don't support external storage, that sounds like a pain to
add. Also, with very small fpr, you can easily get into many megabytes
of filter space, which kind of defeats the purpose of brin in the
first place.

There is already this item from the brin readme:

* Different-size page ranges?
  In the current design, each "index entry" in a BRIN index covers the same
  number of pages.  There's no hard reason for this; it might make sense to
  allow the index to self-tune so that some index entries cover smaller page
  ranges, if this allows the summary values to be more compact.  This
would incur
  larger BRIN overhead for the index itself, but might allow better pruning of
  page ranges during scan.  In the limit of one index tuple per page, the index
  itself would occupy too much space, even though we would be able to skip
  reading the most heap pages, because the summary values are tight; in the
  opposite limit of a single tuple that summarizes the whole table, we wouldn't
  be able to prune anything even though the index is very small.  This can
  probably be made to work by using the range map as an index in itself.

This sounds like a lot of work, but would be robust.

Anyway, given that this is a general problem and not specific to the
prime partition algorithm, I'll leave that out of the attached patch,
named as a .txt to avoid confusing the cfbot.

> >We could also generate the primes via a sieve instead, which is really
> >fast (and more code). That would be more robust, but that would require
> >the filter to store the actual primes used, so 20 more bytes at max k =
> >10. We could hard-code space for that, or to keep from hard-coding
> >maximum k and thus lowest possible false positive rate, we'd need more
> >bookkeeping.
> >
>
> I don't think the efficiency of this code matters too much - it's only
> used once when creating the index, so the simpler the better. Certainly
> for now, while testing the partitioning approach.

To check my understanding, isn't bloom_init() called for every tuple?
Agreed on simplicity so done this way.

> >So, with the partition approach, we'd likely have to set in stone
> >either max nbits, or min target false positive rate. The latter option
> >seems more principled, not only for the block size, but also since the
> >target fp rate is already fixed by the reloption, and as I've
> >demonstrated, we can often go above and beyond the reloption even
> >without changing k.
> >
>
> That seems like a rather annoying limitation, TBH.

I don't think the latter is that bad. I've capped k at 10 for
demonstration's sake.:

(928 is from using 32 pages per range)

nk   m  p
928   7  8895   0.01
928  10  13343  0.001  (lowest p supported in patch set)
928  13  17790  0.0001
928  10  18280  0.0001 (still works with lower k, needs higher m)
928  10  17790  0.00012 (even keeping m from row #3, capping k doesn't
degrade p much)

Also, k seems pretty robust against small changes as long as m isn't
artificially constrained and as long as p is small.

So I *think* it's okay to cap k at 10 or 12, and not bother with
adjusting m, which worsens space issues. As I found before, lowering k
raises target fpr, but seems more robust to overshooting ndistinct. In
any case, we only need k * 2 bytes to store the partition lengths.

The only way I see to avoid any limitation is to make the array of
primes variable length, which could be done by putting the filter
offset calculation into a macro. But having two variable-length arrays
seems messy.

> >Hmm, I'm not sure I understand you. I can see not caring to trim wasted
> >bits, but we can't set/read off the 

Re: WIP: BRIN multi-range indexes

2020-09-21 Thread Tomas Vondra

On Mon, Sep 21, 2020 at 01:42:34PM -0400, John Naylor wrote:

On Fri, Sep 18, 2020 at 6:27 PM Tomas Vondra
 wrote:


But maybe we could still use this scheme by actually computing

h1 = hash_uint32_extended(value, seed1);
h2 = hash_uint32_extended(value, seed2);

and then use this as the independent hash functions. I think that would
meet the requirements of the paper.


Yeah, that would work algorithmically. It would be trivial to add to
the patch, too of course. There'd be a higher up-front cpu cost. Also,
I'm a bit cautious of rehashing hashes, and whether the two values
above are independent enough. I'm not sure either of these points
matters. My guess is the partition approach is more sound, but it has
some minor organizational challenges (see below).



OK. I don't think rehashing hashes is an issue as long as the original
hash has sufficiently low collision rate (and while we know it's not
perfect we know it works well enough for hash indexes etc.). And I doubt
the cost of the extra hash of uint32 would be noticeable.

That being said the partitioning approach might be more sound and it's
definitely worth giving it a try.


Why would 11k bits be more than we'd like to store? Assuming we could
use the whole 8kB page for the bloom filter, that'd be about 64k bits.
In practice there'd be a bit of overhead (page header ...) but it's
still much more than 11k bits.


Brain fade -- I guess I thought we were avoiding being toasted, but
now I see that's not possible for BRIN storage. So, we'll want to
guard against this:

ERROR:  index row size 8160 exceeds maximum 8152 for index "bar_num_idx"

While playing around with the numbers I had an epiphany: At the
defaults, the filter already takes up ~4.3kB, over half the page.
There is no room for another tuple, so if we're only indexing one
column, we might as well take up the whole page.


Hmm, yeah. I may be wrong but IIRC indexes don't support external
storage but compression is still allowed. So even if those defaults are
a bit higher than needed that should make the bloom filters a bit more
compressible, and thus fit multiple BRIN tuples on a single page.


Here MT = max tuples per 128 8k pages, or 37120, so default ndistinct
is 3712.

n  k  m  p
MT/10  7  35580  0.01
MT/10  7  64000  0.0005
MT/10  12 64000  0.00025

Assuming ndistinct isn't way off reality, we get 20x-40x lower false
positive rate almost for free, and it'd be trivial to code! Keeping k
at 7 would be more robust, since it's equivalent to starting with n =
~6000, p = 0.006, which is still almost 2x less false positives than
you asked for. It also means nearly doubling the number of sorted
values before switching.

Going the other direction, capping nbits to 64k bits when ndistinct
gets too high, the false positive rate we can actually support starts
to drop. Here, the user requested 0.001 fpr.

n   k  p
45009  0.001
60007  0.006
75006  0.017
15000   3  0.129  (probably useless by now)
MT  1  0.440
64000   1  0.63 (possible with > 128 pages per range)

I imagine smaller pages_per_range settings are going to be useful for
skinny tables (note to self -- test). Maybe we could provide a way for
the user to see that their combination of pages_per_range, false
positive rate, and ndistinct is supportable, like
brin_bloom_get_supported_fpr(). Or document to check with page_inspect.
And that's not even considering multi-column indexes, like you
mentioned.



I agree giving users visibility into this would be useful.

Not sure about how much we want to rely on these optimizations, though,
considering multi-column indexes kinda break this.


But I guess we can simply make the table of primes a bit larger,
right?


If we want to support all the above cases without falling down
entirely, it would have to go up to 32k to be safe (When k = 1 we could
degenerate to one modulus on the whole filter). That would be a table
of about 7kB, which we could binary search. [thinks for a
moment]...Actually, if we wanted to be clever, maybe we could
precalculate the primes needed for the 64k bit cases and stick them at
the end of the array. The usual algorithm will find them. That way, we
could keep the array around 2kB. However, for >8kB block size, we
couldn't adjust the 64k number, which might be okay, but not really
project policy.

We could also generate the primes via a sieve instead, which is really
fast (and more code). That would be more robust, but that would require
the filter to store the actual primes used, so 20 more bytes at max k =
10. We could hard-code space for that, or to keep from hard-coding
maximum k and thus lowest possible false positive rate, we'd need more
bookkeeping.



I don't think the efficiency of this code matters too much - it's only
used once when creating the index, so the simpler the better. Certainly
for now, while testing the partitioning approach.


So, with the partition approach, we'd likely have to set in stone
either max nbits, or min 

Re: WIP: BRIN multi-range indexes

2020-09-21 Thread John Naylor
On Fri, Sep 18, 2020 at 6:27 PM Tomas Vondra
 wrote:

> But maybe we could still use this scheme by actually computing
>
> h1 = hash_uint32_extended(value, seed1);
> h2 = hash_uint32_extended(value, seed2);
>
> and then use this as the independent hash functions. I think that would
> meet the requirements of the paper.

Yeah, that would work algorithmically. It would be trivial to add to
the patch, too of course. There'd be a higher up-front cpu cost. Also,
I'm a bit cautious of rehashing hashes, and whether the two values
above are independent enough. I'm not sure either of these points
matters. My guess is the partition approach is more sound, but it has
some minor organizational challenges (see below).

> Why would 11k bits be more than we'd like to store? Assuming we could
> use the whole 8kB page for the bloom filter, that'd be about 64k bits.
> In practice there'd be a bit of overhead (page header ...) but it's
> still much more than 11k bits.

Brain fade -- I guess I thought we were avoiding being toasted, but
now I see that's not possible for BRIN storage. So, we'll want to
guard against this:

ERROR:  index row size 8160 exceeds maximum 8152 for index "bar_num_idx"

While playing around with the numbers I had an epiphany: At the
defaults, the filter already takes up ~4.3kB, over half the page.
There is no room for another tuple, so if we're only indexing one
column, we might as well take up the whole page. Here MT = max tuples
per 128 8k pages, or 37120, so default ndistinct is 3712.

n  k  m  p
MT/10  7  35580  0.01
MT/10  7  64000  0.0005
MT/10  12 64000  0.00025

Assuming ndistinct isn't way off reality, we get 20x-40x lower false
positive rate almost for free, and it'd be trivial to code! Keeping k
at 7 would be more robust, since it's equivalent to starting with n =
~6000, p = 0.006, which is still almost 2x less false positives than
you asked for. It also means nearly doubling the number of sorted
values before switching.

Going the other direction, capping nbits to 64k bits when ndistinct
gets too high, the false positive rate we can actually support starts
to drop. Here, the user requested 0.001 fpr.

n   k  p
45009  0.001
60007  0.006
75006  0.017
15000   3  0.129  (probably useless by now)
MT  1  0.440
64000   1  0.63   (possible with > 128 pages per range)

I imagine smaller pages_per_range settings are going to be useful for
skinny tables (note to self -- test). Maybe we could provide a way for
the user to see that their combination of pages_per_range, false
positive rate, and ndistinct is supportable, like
brin_bloom_get_supported_fpr(). Or document to check with
page_inspect. And that's not even considering multi-column indexes,
like you mentioned.

> But I guess we can simply make the table
> of primes a bit larger, right?

If we want to support all the above cases without falling down
entirely, it would have to go up to 32k to be safe (When k = 1 we
could degenerate to one modulus on the whole filter). That would be a
table of about 7kB, which we could binary search. [thinks for a
moment]...Actually, if we wanted to be clever, maybe we could
precalculate the primes needed for the 64k bit cases and stick them at
the end of the array. The usual algorithm will find them. That way, we
could keep the array around 2kB. However, for >8kB block size, we
couldn't adjust the 64k number, which might be okay, but not really
project policy.

We could also generate the primes via a sieve instead, which is really
fast (and more code). That would be more robust, but that would
require the filter to store the actual primes used, so 20 more bytes
at max k = 10. We could hard-code space for that, or to keep from
hard-coding maximum k and thus lowest possible false positive rate,
we'd need more bookkeeping.

So, with the partition approach, we'd likely have to set in stone
either max nbits, or min target false positive rate. The latter option
seems more principled, not only for the block size, but also since the
target fp rate is already fixed by the reloption, and as I've
demonstrated, we can often go above and beyond the reloption even
without changing k.

> >One wrinkle is that the sum of k primes is not going to match m
> >exactly. If the sum is too small, we can trim a few bits off of the
> >filter bitmap. If the sum is too large, the last partition can spill
> >into the front of the first one. This shouldn't matter much in the
> >common case since we need to round m to the nearest byte anyway.
> >
>
> AFAIK the paper simply says that as long as the sum of partitions is
> close to the requested nbits, it's good enough. So I guess we could just
> roll with that, no need to trim/wrap or something like that.

Hmm, I'm not sure I understand you. I can see not caring to trim
wasted bits, but we can't set/read off the end of the filter. If we
don't wrap, we could just skip reading/writing that bit. So a tiny
portion of access would be at k - 1. The paper is not 

Re: WIP: BRIN multi-range indexes

2020-09-18 Thread Tomas Vondra

On Fri, Sep 18, 2020 at 05:06:49PM -0400, John Naylor wrote:

On Fri, Sep 18, 2020 at 7:40 AM Tomas Vondra
 wrote:


On Thu, Sep 17, 2020 at 06:48:11PM -0400, John Naylor wrote:
>I wrote:
>
>> Hmm, I came across that paper while doing background reading. Okay,
>> now I get that "% (filter->nbits - 1)" is the second hash function in
>> that scheme. But now I wonder if that second function should actually
>> act on the passed "value" (the original hash), so that they are
>> actually independent, as required. In the language of that paper, the
>> patch seems to have
>>
>> g(x) = h1(x) + i*h2(h1(x)) + f(i)
>>
>> instead of
>>
>> g(x) = h1(x) + i*h2(x) + f(i)
>>
>> Concretely, I'm wondering if it should be:
>>
>>  big_h = DatumGetUint32(hash_uint32(value));
>>  h = big_h % filter->nbits;
>> -d = big_h % (filter->nbits - 1);
>> +d = value % (filter->nbits - 1);
>>
>> But I could be wrong.
>
>I'm wrong -- if we use different operands to the moduli, we throw away
>the assumption of co-primeness. But I'm still left wondering why we
>have to re-hash the hash for this to work. In any case, there should
>be some more documentation around the core algorithm, so that future
>readers are not left scratching their heads.
>

Hmm, good question. I think we don't really need to hash it twice. It
does not rally achieve anything - it won't reduce number of collisions
or anything like that.


Yeah, looking back at the discussion you linked previously, I think
it's a holdover from when the uint32 was rehashed with k different
seeds. Anyway, after thinking about it some more, I still have doubts
about the mapping algorithm. There are two stages to a hash mapping --
hashing and modulus. I don't think a single hash function (whether
rehashed or not) can be turned into two independent functions via a
choice of second modulus. At least, that's not what the Kirsch &
Mitzenmacher paper is claiming. Since we're not actually applying two
independent hash functions on the scan key, we're kind of shooting in
the dark.



OK. I admit the modulo by nbits and (nbits - 1) is a bit suspicious, so
you may be right this is not quite correct construction.

The current scheme was meant to reduce the number of expensive hashing
calls (especially for low fpr values we may require quite a few of
those, easily 10 or more.

But maybe we could still use this scheme by actually computing

   h1 = hash_uint32_extended(value, seed1);
   h2 = hash_uint32_extended(value, seed2);

and then use this as the independent hash functions. I think that would
meet the requirements of the paper.


It turns out there is something called a one-hash bloom filter, and
the paper in [1] has a straightforward algorithm. Since we can
implement it exactly as stated in the paper, that gives me more
confidence in the real-world false positive rate. It goes like this:

Partition the filter bitmap into k partitions of similar but unequal
length, corresponding to consecutive prime numbers. Use the primes for
moduli of the uint32 value and map it to the bit of the corresponding
partition. For a simple example, let's use 7, 11, 13 for partitions in
a filter of size 31. The three bits are:

value % 7
7 + (value % 11)
7 + 11 + (value % 13)

We could store a const array of the first 256 primes. The largest such
prime is 1613, so with k=7 we can support up to ~11k bits, which is
more than we'd like to store anyway. Then we store the array index of
the largest prime in the 8bits of padding we currently have in
BloomFilter struct.



Why would 11k bits be more than we'd like to store? Assuming we could
use the whole 8kB page for the bloom filter, that'd be about 64k bits.
In practice there'd be a bit of overhead (page header ...) but it's
still much more than 11k bits. But I guess we can simply make the table
of primes a bit larger, right?

FWIW I don't think we need to be that careful about the space to store
stuff in padding etc. If we can - great, but compared to the size of the
filter it's negligible and I'd prioritize simplicity over a byte or two.


One wrinkle is that the sum of k primes is not going to match m
exactly. If the sum is too small, we can trim a few bits off of the
filter bitmap. If the sum is too large, the last partition can spill
into the front of the first one. This shouldn't matter much in the
common case since we need to round m to the nearest byte anyway.



AFAIK the paper simply says that as long as the sum of partitions is
close to the requested nbits, it's good enough. So I guess we could just
roll with that, no need to trim/wrap or something like that.


This should be pretty straightforward to turn into code and I can take
a stab at it. Thoughts?



Sure, go ahead. I'm happy someone is actually looking at those patches
and proposing alternative solutions, and this might easily be a better
hashing scheme.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: WIP: BRIN multi-range indexes

2020-09-18 Thread John Naylor
On Fri, Sep 18, 2020 at 7:40 AM Tomas Vondra
 wrote:
>
> On Thu, Sep 17, 2020 at 06:48:11PM -0400, John Naylor wrote:
> >I wrote:
> >
> >> Hmm, I came across that paper while doing background reading. Okay,
> >> now I get that "% (filter->nbits - 1)" is the second hash function in
> >> that scheme. But now I wonder if that second function should actually
> >> act on the passed "value" (the original hash), so that they are
> >> actually independent, as required. In the language of that paper, the
> >> patch seems to have
> >>
> >> g(x) = h1(x) + i*h2(h1(x)) + f(i)
> >>
> >> instead of
> >>
> >> g(x) = h1(x) + i*h2(x) + f(i)
> >>
> >> Concretely, I'm wondering if it should be:
> >>
> >>  big_h = DatumGetUint32(hash_uint32(value));
> >>  h = big_h % filter->nbits;
> >> -d = big_h % (filter->nbits - 1);
> >> +d = value % (filter->nbits - 1);
> >>
> >> But I could be wrong.
> >
> >I'm wrong -- if we use different operands to the moduli, we throw away
> >the assumption of co-primeness. But I'm still left wondering why we
> >have to re-hash the hash for this to work. In any case, there should
> >be some more documentation around the core algorithm, so that future
> >readers are not left scratching their heads.
> >
>
> Hmm, good question. I think we don't really need to hash it twice. It
> does not rally achieve anything - it won't reduce number of collisions
> or anything like that.

Yeah, looking back at the discussion you linked previously, I think
it's a holdover from when the uint32 was rehashed with k different
seeds. Anyway, after thinking about it some more, I still have doubts
about the mapping algorithm. There are two stages to a hash mapping --
hashing and modulus. I don't think a single hash function (whether
rehashed or not) can be turned into two independent functions via a
choice of second modulus. At least, that's not what the Kirsch &
Mitzenmacher paper is claiming. Since we're not actually applying two
independent hash functions on the scan key, we're kind of shooting in
the dark.

It turns out there is something called a one-hash bloom filter, and
the paper in [1] has a straightforward algorithm. Since we can
implement it exactly as stated in the paper, that gives me more
confidence in the real-world false positive rate. It goes like this:

Partition the filter bitmap into k partitions of similar but unequal
length, corresponding to consecutive prime numbers. Use the primes for
moduli of the uint32 value and map it to the bit of the corresponding
partition. For a simple example, let's use 7, 11, 13 for partitions in
a filter of size 31. The three bits are:

value % 7
7 + (value % 11)
7 + 11 + (value % 13)

We could store a const array of the first 256 primes. The largest such
prime is 1613, so with k=7 we can support up to ~11k bits, which is
more than we'd like to store anyway. Then we store the array index of
the largest prime in the 8bits of padding we currently have in
BloomFilter struct.

One wrinkle is that the sum of k primes is not going to match m
exactly. If the sum is too small, we can trim a few bits off of the
filter bitmap. If the sum is too large, the last partition can spill
into the front of the first one. This shouldn't matter much in the
common case since we need to round m to the nearest byte anyway.

This should be pretty straightforward to turn into code and I can take
a stab at it. Thoughts?

[1] https://www.researchgate.net/publication/284283336_One-Hashing_Bloom_Filter

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: WIP: BRIN multi-range indexes

2020-09-18 Thread Tomas Vondra

On Thu, Sep 17, 2020 at 06:48:11PM -0400, John Naylor wrote:

I wrote:


Hmm, I came across that paper while doing background reading. Okay,
now I get that "% (filter->nbits - 1)" is the second hash function in
that scheme. But now I wonder if that second function should actually
act on the passed "value" (the original hash), so that they are
actually independent, as required. In the language of that paper, the
patch seems to have

g(x) = h1(x) + i*h2(h1(x)) + f(i)

instead of

g(x) = h1(x) + i*h2(x) + f(i)

Concretely, I'm wondering if it should be:

 big_h = DatumGetUint32(hash_uint32(value));
 h = big_h % filter->nbits;
-d = big_h % (filter->nbits - 1);
+d = value % (filter->nbits - 1);

But I could be wrong.


I'm wrong -- if we use different operands to the moduli, we throw away
the assumption of co-primeness. But I'm still left wondering why we
have to re-hash the hash for this to work. In any case, there should
be some more documentation around the core algorithm, so that future
readers are not left scratching their heads.



Hmm, good question. I think we don't really need to hash it twice. It
does not rally achieve anything - it won't reduce number of collisions
or anything like that.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: WIP: BRIN multi-range indexes

2020-09-18 Thread Tomas Vondra

On Thu, Sep 17, 2020 at 05:42:59PM -0400, John Naylor wrote:

On Thu, Sep 17, 2020 at 12:34 PM Tomas Vondra
 wrote:


On Thu, Sep 17, 2020 at 10:33:06AM -0400, John Naylor wrote:
>On Sun, Sep 13, 2020 at 12:40 PM Tomas Vondra
> wrote:
>> <20200913 patch set>
But those are opclass parameters, so the parameters are not specified in
WITH clause, but right after the opclass name:

CREATE INDEX idx ON table USING brin (
bigint_col int8_minmax_multi_ops(values_per_range = 15)
);


D'oh!


>+ * The index only supports equality operator, similarly to hash indexes.
>
>s/operator/operators/
>

Hmmm, are there really multiple equality operators?


Ah, I see what you meant -- then "_the_ equality operator" is what we want.


The number of items the comment refers to is this:

 /* how many uint32 hashes can we fit into the bitmap */
 int maxvalues = filter->nbits / (8 * sizeof(uint32));

where nbits is the size of the bloom filter. So I think the "same" is
quite right here.


Ok, I get it now.


>+ /*
>+ * The 1% value is mostly arbitrary, it just looks nice.
>+ */
>+#define BLOOM_DEFAULT_FALSE_POSITIVE_RATE 0.01 /* 1% fp rate */
>
>I think we'd want better stated justification for this default, even
>if just precedence in other implementations. Maybe I can test some
>other values here?
>

Well, I don't know how to pick a better default :-( Ultimately it's a
tarde-off between larger indexes and scanning larger fraction of a table
due to lower false positive. Then there's the restriction that the whole
index tuple needs to fit into a single 8kB page.


Well, it might be a perfectly good default, and it seems common in
articles on the topic, but the comment is referring to aesthetics. :-)
I still intend to test some cases.



I think we may formulate this as a question of how much I/O we need to
do for a random query, and pick the false positive rate minimizing that.
For a single BRIN range an approximation might look like this:

  bloom_size(fpr, ...) + (fpr * range_size) + (selectivity * range_size)

The "selectivity" shows the true selectivity of ranges, and it might be
esimated from a per-row selectivity I guess. But it does not matter much
because this is constant and independent of the false-positive rate, so
we can ignore it. Which leaves us with

  bloom_size(fpr, ...) + (fpr * range_size)

We might solve this for fixed parameters (range_size, ndistinct, ...),
either analytically or by brute force, giving us the "optimal" fpr.

The trouble is the bloom_size is restricted, and we don't really know
the limit - the whole index tuple needs to fit on a single 8kB page, and
there may be other BRIN summaries etc. So I've opted to use a somewhat
defensive default for the false positive rate.


>Also, is there a reference for the algorithm for hash values that
>follows? I didn't see anything like it in my cursory reading of the
>topic. Might be good to include it in the comments.
>

This was suggested by Yura Sokolov [1] in 2017. The post refers to a
paper [2] but I don't recall which part describes "our" algorithm.

[1] 
https://www.postgresql.org/message-id/94c173b54a0aef6ae9b18157ef52f...@postgrespro.ru
[2] https://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.pdf


Hmm, I came across that paper while doing background reading. Okay,
now I get that "% (filter->nbits - 1)" is the second hash function in
that scheme. But now I wonder if that second function should actually
act on the passed "value" (the original hash), so that they are
actually independent, as required. In the language of that paper, the
patch seems to have

g(x) = h1(x) + i*h2(h1(x)) + f(i)

instead of

g(x) = h1(x) + i*h2(x) + f(i)

Concretely, I'm wondering if it should be:

big_h = DatumGetUint32(hash_uint32(value));
h = big_h % filter->nbits;
-d = big_h % (filter->nbits - 1);
+d = value % (filter->nbits - 1);

But I could be wrong.

Also, I take it that f(i) = 1 in the patch, hence the increment here:

+ h += d++;

But it's a little hidden. That might be worth commenting, if I haven't
completely missed something.



OK


>+ * Tweak the ndistinct value based on the pagesPerRange value. First,
>
>Nitpick: "Tweak" to my mind means to adjust an existing value. The
>above is only true if ndistinct is negative, but we're really not
>tweaking, but using it as a scale factor. Otherwise it's not adjusted,
>only clamped.
>

OK. Perhaps 'adjust' would be a better term?


I felt like rewriting the whole thing, but your original gets the
point across ok, really.

"If the passed ndistinct value is positive, we can just use that, but
we also clamp the value to prevent over-sizing the bloom filter
unnecessarily. If it's negative, it represents a multiplier to apply
to the maximum number of tuples in the range (assuming each page gets
MaxHeapTuplesPerPage tuples, which is likely a significant
over-estimate), similar to the corresponding value in table
statistics."


>+ /* TODO include the sorted/unsorted values */
>

This was simplemented as 

Re: WIP: BRIN multi-range indexes

2020-09-17 Thread John Naylor
I wrote:

> Hmm, I came across that paper while doing background reading. Okay,
> now I get that "% (filter->nbits - 1)" is the second hash function in
> that scheme. But now I wonder if that second function should actually
> act on the passed "value" (the original hash), so that they are
> actually independent, as required. In the language of that paper, the
> patch seems to have
>
> g(x) = h1(x) + i*h2(h1(x)) + f(i)
>
> instead of
>
> g(x) = h1(x) + i*h2(x) + f(i)
>
> Concretely, I'm wondering if it should be:
>
>  big_h = DatumGetUint32(hash_uint32(value));
>  h = big_h % filter->nbits;
> -d = big_h % (filter->nbits - 1);
> +d = value % (filter->nbits - 1);
>
> But I could be wrong.

I'm wrong -- if we use different operands to the moduli, we throw away
the assumption of co-primeness. But I'm still left wondering why we
have to re-hash the hash for this to work. In any case, there should
be some more documentation around the core algorithm, so that future
readers are not left scratching their heads.

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: WIP: BRIN multi-range indexes

2020-09-17 Thread John Naylor
On Thu, Sep 17, 2020 at 12:34 PM Tomas Vondra
 wrote:
>
> On Thu, Sep 17, 2020 at 10:33:06AM -0400, John Naylor wrote:
> >On Sun, Sep 13, 2020 at 12:40 PM Tomas Vondra
> > wrote:
> >> <20200913 patch set>
> But those are opclass parameters, so the parameters are not specified in
> WITH clause, but right after the opclass name:
>
> CREATE INDEX idx ON table USING brin (
> bigint_col int8_minmax_multi_ops(values_per_range = 15)
> );

D'oh!

> >+ * The index only supports equality operator, similarly to hash indexes.
> >
> >s/operator/operators/
> >
>
> Hmmm, are there really multiple equality operators?

Ah, I see what you meant -- then "_the_ equality operator" is what we want.

> The number of items the comment refers to is this:
>
>  /* how many uint32 hashes can we fit into the bitmap */
>  int maxvalues = filter->nbits / (8 * sizeof(uint32));
>
> where nbits is the size of the bloom filter. So I think the "same" is
> quite right here.

Ok, I get it now.

> >+ /*
> >+ * The 1% value is mostly arbitrary, it just looks nice.
> >+ */
> >+#define BLOOM_DEFAULT_FALSE_POSITIVE_RATE 0.01 /* 1% fp rate */
> >
> >I think we'd want better stated justification for this default, even
> >if just precedence in other implementations. Maybe I can test some
> >other values here?
> >
>
> Well, I don't know how to pick a better default :-( Ultimately it's a
> tarde-off between larger indexes and scanning larger fraction of a table
> due to lower false positive. Then there's the restriction that the whole
> index tuple needs to fit into a single 8kB page.

Well, it might be a perfectly good default, and it seems common in
articles on the topic, but the comment is referring to aesthetics. :-)
I still intend to test some cases.

> >Also, is there a reference for the algorithm for hash values that
> >follows? I didn't see anything like it in my cursory reading of the
> >topic. Might be good to include it in the comments.
> >
>
> This was suggested by Yura Sokolov [1] in 2017. The post refers to a
> paper [2] but I don't recall which part describes "our" algorithm.
>
> [1] 
> https://www.postgresql.org/message-id/94c173b54a0aef6ae9b18157ef52f...@postgrespro.ru
> [2] https://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.pdf

Hmm, I came across that paper while doing background reading. Okay,
now I get that "% (filter->nbits - 1)" is the second hash function in
that scheme. But now I wonder if that second function should actually
act on the passed "value" (the original hash), so that they are
actually independent, as required. In the language of that paper, the
patch seems to have

g(x) = h1(x) + i*h2(h1(x)) + f(i)

instead of

g(x) = h1(x) + i*h2(x) + f(i)

Concretely, I'm wondering if it should be:

 big_h = DatumGetUint32(hash_uint32(value));
 h = big_h % filter->nbits;
-d = big_h % (filter->nbits - 1);
+d = value % (filter->nbits - 1);

But I could be wrong.

Also, I take it that f(i) = 1 in the patch, hence the increment here:

+ h += d++;

But it's a little hidden. That might be worth commenting, if I haven't
completely missed something.

> >+ * Tweak the ndistinct value based on the pagesPerRange value. First,
> >
> >Nitpick: "Tweak" to my mind means to adjust an existing value. The
> >above is only true if ndistinct is negative, but we're really not
> >tweaking, but using it as a scale factor. Otherwise it's not adjusted,
> >only clamped.
> >
>
> OK. Perhaps 'adjust' would be a better term?

I felt like rewriting the whole thing, but your original gets the
point across ok, really.

"If the passed ndistinct value is positive, we can just use that, but
we also clamp the value to prevent over-sizing the bloom filter
unnecessarily. If it's negative, it represents a multiplier to apply
to the maximum number of tuples in the range (assuming each page gets
MaxHeapTuplesPerPage tuples, which is likely a significant
over-estimate), similar to the corresponding value in table
statistics."

> >+ /* TODO include the sorted/unsorted values */
> >
>
> This was simplemented as part of the discussion about pageinspect, and
> I wanted some confirmation if the approach is acceptable or not before
> spending more time on it. Also, the values are really just hashes of the
> column values, so I'm not quite sure it makes sense to include that.
> What would you suggest?

My gut feeling is the hashes are not useful for this purpose, but I
don't feel strongly either way.

--
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: WIP: BRIN multi-range indexes

2020-09-17 Thread John Naylor
On Sun, Sep 13, 2020 at 12:40 PM Tomas Vondra
 wrote:
> <20200913 patch set>

Hi Tomas,

The cfbot fails to apply this, but with 0001 from 0912 it works on my
end, so going with that.

One problem I have is I don't get success with the new reloptions:

create index cd_multi on iot using brin(create_dt
timestamptz_minmax_multi_ops) with (values_per_range = 64);
ERROR:  unrecognized parameter "values_per_range"

create index  on iot using brin(create_dt timestamptz_bloom_ops) with
(n_distinct_per_range = 16);
ERROR:  unrecognized parameter "n_distinct_per_range"


Aside from that, I'm going to try to understand the code, and ask
questions. Some of the code might still change, but I don't think it's
too early to do some comment and docs proofreading. I'll do this in
separate emails for bloom and multi-minmax to keep it from being too
long. Perf testing will come sometime later.


Bloom
-

+ greater than 0.0 and smaller than 1.0. The default values is 0.01,

+ rows per block). The default values is -0.1, and

s/values/value/

+ the minimum number of distinct non-null values is 128.

I don't see 128 in the code, but I do see this, is this the intention?:

#define BLOOM_MIN_NDISTINCT_PER_RANGE 16

+ * Bloom filters allow efficient test whether a given page range contains
+ * a particular value. Therefore, if we summarize each page range into a
+ * bloom filter, we can easily and cheaply test wheter it containst values
+ * we get later.

s/test/testing/
s/wheter it containst/whether it contains/

+ * The index only supports equality operator, similarly to hash indexes.

s/operator/operators/

+ * The number of distinct elements (in a page range) depends on the data,
+ * we can consider it fixed. This simplifies the trade-off to just false
+ * positive rate vs. size.

Sounds like the first sentence should start with "although".

+ * of distinct items to be stored in the filter. We can't quite the input
+ * data, of course, but we may make the BRIN page ranges smaller - instead

I think you accidentally a word.

+ * Of course, good sizing decisions depend on having the necessary data,
+ * i.e. number of distinct values in a page range (of a given size) and
+ * table size (to estimate cost change due to change in false positive
+ * rate due to having larger index vs. scanning larger indexes). We may
+ * not have that data - for example when building an index on empty table
+ * it's not really possible. And for some data we only have estimates for
+ * the whole table and we can only estimate per-range values (ndistinct).

and

+ * The current logic, implemented in brin_bloom_get_ndistinct, attempts to
+ * make some basic sizing decisions, based on the table ndistinct estimate.

+ * XXX We might also fetch info about ndistinct estimate for the column,
+ * and compute the expected number of distinct values in a range. But

Maybe I'm missing something, but the first two comments don't match
the last one -- I don't see where we get table ndistinct, which I take
to mean from the stats catalog?

+ * To address these issues, the opclass stores the raw values directly, and
+ * only switches to the actual bloom filter after reaching the same space
+ * requirements.

IIUC, it's after reaching a certain size (BLOOM_MAX_UNSORTED * 4), so
"same" doesn't make sense here.

+ /*
+ * The 1% value is mostly arbitrary, it just looks nice.
+ */
+#define BLOOM_DEFAULT_FALSE_POSITIVE_RATE 0.01 /* 1% fp rate */

I think we'd want better stated justification for this default, even
if just precedence in other implementations. Maybe I can test some
other values here?

+ * XXX Perhaps we could save a few bytes by using different data types, but
+ * considering the size of the bitmap, the difference is negligible.

Yeah, I think it's obvious enough to leave out.

+ m = ceil((ndistinct * log(false_positive_rate)) / log(1.0 /
(pow(2.0, log(2.0);

I find this pretty hard to read and pgindent is going to mess it up
further. I would have a comment with the formula in math notation
(note that you can dispense with the reciprocal and just use
negation), but in code fold the last part to a constant. That might go
against project style, though:

m = ceil(ndistinct * log(false_positive_rate) * -2.08136);

+ * XXX Maybe the 64B min size is not really needed?

Something to figure out before commit?

+ /* assume 'not updated' by default */
+ Assert(filter);

I don't see how these are related.

+ big_h = ((uint32) DatumGetInt64(hash_uint32(value)));

I'm curious about the Int64 part -- most callers use the bare value or
with DatumGetUInt32().

Also, is there a reference for the algorithm for hash values that
follows? I didn't see anything like it in my cursory reading of the
topic. Might be good to include it in the comments.

+ * Tweak the ndistinct value based on the pagesPerRange value. First,

Nitpick: "Tweak" to my mind means to adjust an existing value. The
above is only true if ndistinct is negative, but we're really not
tweaking, 

Re: WIP: BRIN multi-range indexes

2020-09-11 Thread John Naylor
On Fri, Sep 11, 2020 at 2:05 PM Tomas Vondra
 wrote:
> that bad. But yeah, testing and benchmarking it would be nice. Do you
> plan to test just the minmax-multi opclass, or will you look at the
> bloom one too?

Yeah, I'll start looking at bloom next week, and I'll include it when
I do perf testing.

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: WIP: BRIN multi-range indexes

2020-09-11 Thread John Naylor
On Fri, Sep 11, 2020 at 6:14 AM Tomas Vondra
 wrote:

> I understand. I just feel a bit uneasy about replacing an index with
> something that may or may not be better for a certain use case. I mean,
> if you have data set for which regular minmax works fine, wouldn't you
> be annoyed if we just switched it for something slower?

How about making multi minmax the default for new indexes, and those
who know their data will stay very well correlated can specify simple
minmax ops for speed? Upgraded indexes would stay the same, and only
new ones would have the risk of slowdown if not attended to.

Also, I wonder if the slowdown in building a new index is similar to
the slowdown for updates. I'd like to run some TCP-H tests (that will
take some time).

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: WIP: BRIN multi-range indexes

2020-09-10 Thread John Naylor
Ok, here's an attempt at a somewhat more natural test, to see what
happens after bulk updates and deletes, followed by more inserts. The
short version is that multi-minmax is resilient to a change that
causes a 4x degradation for simple minmax.


shared_buffers = 1GB
random_page_cost = 1.1
effective_cache_size = 4GB
work_mem = 64MB
maintenance_work_mem = 512MB

create unlogged table iot (
id bigint generated by default as identity primary key,
num double precision not null,
create_dt timestamptz not null,
stuff text generated always as (md5(id::text)) stored
)
with (fillfactor = 95);

insert into iot (num, create_dt)
select random(), x
from generate_series(
  '2020-01-01 0:00'::timestamptz,
  '2020-01-01 0:00'::timestamptz +'49000999 seconds'::interval,
  '2 seconds'::interval) x;

INSERT 0 24500500

(01:18s, 2279 MB)

-- done in separate tests so the planner can choose each in turn
create index cd_single on iot using brin(create_dt);
6.7s
create index cd_multi on iot using brin(create_dt timestamptz_minmax_multi_ops);
34s

vacuum analyze;

-- aggregate February
-- single minmax and multi-minmax same plan and same Heap Blocks
below, so only one plan shown
-- query times between the opclasses within noise of variation

explain analyze select date_trunc('day', create_dt), avg(num)
from iot
where create_dt >= '2020-02-01 0:00' and create_dt < '2020-03-01 0:00'
group by 1;


  QUERY PLAN

 HashAggregate  (cost=357664.79..388181.83 rows=1232234 width=16)
(actual time=559.805..561.649 rows=29 loops=1)
   Group Key: date_trunc('day'::text, create_dt)
   Planned Partitions: 4  Batches: 1  Memory Usage: 24601kB
   ->  Bitmap Heap Scan on iot  (cost=323.74..313622.05 rows=1232234
width=16) (actual time=1.787..368.256 rows=1252800 loops=1)
 Recheck Cond: ((create_dt >= '2020-02-01
00:00:00-04'::timestamp with time zone) AND (create_dt < '2020-03-01
00:00:00-04'::timestamp with time zone))
 Rows Removed by Index Recheck: 15936
 Heap Blocks: lossy=15104
 ->  Bitmap Index Scan on cd_single  (cost=0.00..15.68
rows=1236315 width=0) (actual time=0.933..0.934 rows=151040 loops=1)
   Index Cond: ((create_dt >= '2020-02-01
00:00:00-04'::timestamp with time zone) AND (create_dt < '2020-03-01
00:00:00-04'::timestamp with time zone))
 Planning Time: 0.118 ms
 Execution Time: 568.653 ms
(11 rows)


-- delete first month and hi/lo values to create some holes in the table
delete from iot
where create_dt < '2020-02-01 0:00'::timestamptz;

DELETE 1339200

delete from iot
where num < 0.05
or num > 0.95;

DELETE 2316036

vacuum analyze iot;

-- add add back first month, but with double density (1s step rather
than 2s) so it spills over into other parts of the table, causing more
block ranges to have a lower bound with this month.

insert into iot (num, create_dt)
select random(), x
from generate_series(
  '2020-01-01 0:00'::timestamptz,
  '2020-01-31 23:59'::timestamptz,
  '1 second'::interval) x;

INSERT 0 2678341

vacuum analyze;

-- aggregate February again

explain analyze select date_trunc('day', create_dt), avg(num)
from iot
where create_dt >= '2020-02-01 0:00' and create_dt < '2020-03-01 0:00'
group by 1;

-- simple minmax:

  QUERY PLAN

 HashAggregate  (cost=354453.63..383192.38 rows=1160429 width=16)
(actual time=2375.075..2376.982 rows=29 loops=1)
   Group Key: date_trunc('day'::text, create_dt)
   Planned Partitions: 4  Batches: 1  Memory Usage: 24601kB
   ->  Bitmap Heap Scan on iot  (cost=305.85..312977.36 rows=1160429
width=16) (actual time=8.162..2201.547 rows=1127668 loops=1)
 Recheck Cond: ((create_dt >= '2020-02-01
00:00:00-04'::timestamp with time zone) AND (create_dt < '2020-03-01
00:00:00-04'::timestamp with time zone))
 Rows Removed by Index Recheck: 12278985
 Heap Blocks: lossy=159616
 ->  Bitmap Index Scan on cd_single  (cost=0.00..15.74
rows=1206496 width=0) (actual time=7.177..7.178 rows=1596160 loops=1)
   Index Cond: ((create_dt >= '2020-02-01
00:00:00-04'::timestamp with time zone) AND (create_dt < '2020-03-01
00:00:00-04'::timestamp with time zone))
 Planning Time: 0.117 ms
 Execution Time: 2383.685 ms
(11 rows)

-- multi minmax:

  QUERY PLAN

 HashAggregate  (cost=354089.57..382932.46 rows=1164634 width=16)
(actual time=535.773..537.731 rows=29 loops=1)
   Group Key: date_trunc('day'::text, create_dt)
   Planned Partitions: 4  Batches: 1  Memory Usage: 24601kB
   ->  Bitmap Heap Scan on iot  

Re: WIP: BRIN multi-range indexes

2020-09-10 Thread Alvaro Herrera
On 2020-Sep-10, Tomas Vondra wrote:

> I've spent a bit of time experimenting with this. My idea was to allow
> keeping an "expanded" version of the summary somewhere. As the addValue
> function only receives BrinValues I guess one option would be to just
> add bv_mem_values field. Or do you have a better idea?

Maybe it's okay to pass the BrinMemTuple to the add_value function, and
keep something there.  Or maybe that's pointless and just a new field in
BrinValues is okay.

> Of course, more would need to be done:
> 
> 1) We'd need to also pass the right memory context (bt_context seems
> like the right thing, but that's not something addValue sees now).

You could use GetMemoryChunkContext() for that.

> 2) We'd also need to specify some sort of callback that serializes the
> in-memory value into bt_values. That's not something addValue can do,
> because it doesn't know whether it's the last value in the range etc. I
> guess one option would be to add yet another support proc, but I guess a
> simple callback would be enough.

Hmm.

> I've hacked together an experimental version of this to see how much
> would it help, and it reduces the duration from ~4.6s to ~3.3s. Which is
> nice, but plain minmax is ~1.1s. I suppose there's room for further
> improvements in compare_combine_ranges/reduce_combine_ranges and so on,
> but I still think there'll always be a gap compared to plain minmax.

The main reason I'm talking about desupporting plain minmax is that,
even if it's amazingly fast, it loses quite quickly in real-world cases
because of loss of correlation.  Minmax's build time is pretty much
determined by speed at which you can seqscan the table.  I don't think
we lose much if we add overhead in order to create an index that is 100x
more useful.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: WIP: BRIN multi-range indexes

2020-09-10 Thread Tomas Vondra

On Wed, Sep 09, 2020 at 10:26:00PM +0200, Tomas Vondra wrote:

On Wed, Sep 09, 2020 at 04:53:30PM -0300, Alvaro Herrera wrote:

On 2020-Sep-09, Tomas Vondra wrote:


There are some minor optimizations possible - for example I noticed we
call minmax_multi_get_strategy_procinfo often because it happens in a
loop, and we could easily do it just once. But that saves only about 10%
or so, it's not a ground-breaking optimization.


Well, I guess this kind of thing should be fixed regardless while we
still know it's there, just to avoid an obvious inefficiency.



Sure. I was just suggesting it's not something that'd make this very
close to plain minmax opclass.


The main reason for the slowness is that we pass the values one by one
to brin_minmax_multi_add_value - and on each call we need to deserialize
(and then sometimes also serialize) the summary, which may be quite
expensive. The regular minmax does not have this issue, it just swaps
the Datum value and that's it.


Ah, right, that's more interesting.  The original dumb BRIN code
separates BrinMemTuple from BrinTuple so that things can be operated
efficiently in memory.  Maybe something similar can be done in this
case, which also sounds like your second suggestion:


Another option would be to teach add_value to keep the deserialized
summary somewhere, and then force serialization at the end of the BRIN
page range. The end result would be roughly the same, I think.




Well, the patch already has Ranges (memory) and SerializedRanges (disk)
but it's not very clear to me where to stash the in-memory data and
where to make the conversion.



I've spent a bit of time experimenting with this. My idea was to allow
keeping an "expanded" version of the summary somewhere. As the addValue
function only receives BrinValues I guess one option would be to just
add bv_mem_values field. Or do you have a better idea?

Of course, more would need to be done:

1) We'd need to also pass the right memory context (bt_context seems
like the right thing, but that's not something addValue sees now).

2) We'd also need to specify some sort of callback that serializes the
in-memory value into bt_values. That's not something addValue can do,
because it doesn't know whether it's the last value in the range etc. I
guess one option would be to add yet another support proc, but I guess a
simple callback would be enough.

I've hacked together an experimental version of this to see how much
would it help, and it reduces the duration from ~4.6s to ~3.3s. Which is
nice, but plain minmax is ~1.1s. I suppose there's room for further
improvements in compare_combine_ranges/reduce_combine_ranges and so on,
but I still think there'll always be a gap compared to plain minmax.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: WIP: BRIN multi-range indexes

2020-09-09 Thread Tomas Vondra

On Wed, Sep 09, 2020 at 04:53:30PM -0300, Alvaro Herrera wrote:

On 2020-Sep-09, Tomas Vondra wrote:


There are some minor optimizations possible - for example I noticed we
call minmax_multi_get_strategy_procinfo often because it happens in a
loop, and we could easily do it just once. But that saves only about 10%
or so, it's not a ground-breaking optimization.


Well, I guess this kind of thing should be fixed regardless while we
still know it's there, just to avoid an obvious inefficiency.



Sure. I was just suggesting it's not something that'd make this very
close to plain minmax opclass.


The main reason for the slowness is that we pass the values one by one
to brin_minmax_multi_add_value - and on each call we need to deserialize
(and then sometimes also serialize) the summary, which may be quite
expensive. The regular minmax does not have this issue, it just swaps
the Datum value and that's it.


Ah, right, that's more interesting.  The original dumb BRIN code
separates BrinMemTuple from BrinTuple so that things can be operated
efficiently in memory.  Maybe something similar can be done in this
case, which also sounds like your second suggestion:


Another option would be to teach add_value to keep the deserialized
summary somewhere, and then force serialization at the end of the BRIN
page range. The end result would be roughly the same, I think.




Well, the patch already has Ranges (memory) and SerializedRanges (disk)
but it's not very clear to me where to stash the in-memory data and
where to make the conversion.



Also, I think you could get a few initial patches pushed soon, since
they look like general improvements rather than specific to multi-range.



Yeah, I agree. I plan to review those once again in a couple days and
then push them.



On a differen train of thought, I wonder if we shouldn't drop the idea
of there being two minmax opclasses; just have one (still called
"minmax") and have the multi-range version be the v2 of it.  We would
still need to keep code to operate on the old one, but if you ever
REINDEX then your index is upgraded to the new one.  I see no reason to
keep the dumb minmax version around, assuming the performance is roughly
similar.



I'm not a huge fan of that. I think it's unlikely we'll ever make this
new set of oplasses just as fast as the plain minmax, and moreover it
does have some additional requirements (e.g. the distance proc, which
may not make sense for some data types).


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: WIP: BRIN multi-range indexes

2020-09-09 Thread Alvaro Herrera
On 2020-Sep-09, Tomas Vondra wrote:

> There are some minor optimizations possible - for example I noticed we
> call minmax_multi_get_strategy_procinfo often because it happens in a
> loop, and we could easily do it just once. But that saves only about 10%
> or so, it's not a ground-breaking optimization.

Well, I guess this kind of thing should be fixed regardless while we
still know it's there, just to avoid an obvious inefficiency.

> The main reason for the slowness is that we pass the values one by one
> to brin_minmax_multi_add_value - and on each call we need to deserialize
> (and then sometimes also serialize) the summary, which may be quite
> expensive. The regular minmax does not have this issue, it just swaps
> the Datum value and that's it.

Ah, right, that's more interesting.  The original dumb BRIN code
separates BrinMemTuple from BrinTuple so that things can be operated
efficiently in memory.  Maybe something similar can be done in this
case, which also sounds like your second suggestion:

> Another option would be to teach add_value to keep the deserialized
> summary somewhere, and then force serialization at the end of the BRIN
> page range. The end result would be roughly the same, I think.


Also, I think you could get a few initial patches pushed soon, since
they look like general improvements rather than specific to multi-range.


On a differen train of thought, I wonder if we shouldn't drop the idea
of there being two minmax opclasses; just have one (still called
"minmax") and have the multi-range version be the v2 of it.  We would
still need to keep code to operate on the old one, but if you ever
REINDEX then your index is upgraded to the new one.  I see no reason to
keep the dumb minmax version around, assuming the performance is roughly
similar.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: WIP: BRIN multi-range indexes

2020-09-09 Thread Tomas Vondra

On Wed, Sep 09, 2020 at 03:40:41PM -0300, Alvaro Herrera wrote:

On 2020-Sep-09, John Naylor wrote:


create index on t using brin (a);
CREATE INDEX
Time: 1631.452 ms (00:01.631)



create index on t using brin (a int8_minmax_multi_ops);
CREATE INDEX
Time: 6521.026 ms (00:06.521)


It seems strange that the multi-minmax index takes so much longer to
build.  I wonder if there's some obvious part of the algorithm that can
be improved?



There are some minor optimizations possible - for example I noticed we
call minmax_multi_get_strategy_procinfo often because it happens in a
loop, and we could easily do it just once. But that saves only about 10%
or so, it's not a ground-breaking optimization.

The main reason for the slowness is that we pass the values one by one
to brin_minmax_multi_add_value - and on each call we need to deserialize
(and then sometimes also serialize) the summary, which may be quite
expensive. The regular minmax does not have this issue, it just swaps
the Datum value and that's it.

I see two possible optimizations - firstly, adding some sort of batch
variant of the add_value function, which would get a bunch of values
instead of just a single one, amortizing the serialization costs.

Another option would be to teach add_value to keep the deserialized
summary somewhere, and then force serialization at the end of the BRIN
page range. The end result would be roughly the same, I think.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: WIP: BRIN multi-range indexes

2020-09-09 Thread Alvaro Herrera
On 2020-Sep-09, John Naylor wrote:

> create index on t using brin (a);
> CREATE INDEX
> Time: 1631.452 ms (00:01.631)

> create index on t using brin (a int8_minmax_multi_ops);
> CREATE INDEX
> Time: 6521.026 ms (00:06.521)

It seems strange that the multi-minmax index takes so much longer to
build.  I wonder if there's some obvious part of the algorithm that can
be improved?

> The second thing is, with parallel seq scan, the query is faster than
> a BRIN bitmap scan, with this pathological data distribution, but the
> planner won't choose it unless forced to:
> 
> set enable_bitmapscan = 'off';
> explain analyze select * from t
>   where a between 1923300::int and 1923600::int;

This is probably explained by the fact that you likely have the whole
table in shared buffers, or at least in OS cache.  I'm not sure if the
costing should necessarily account for this.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: WIP: BRIN multi-range indexes

2020-09-09 Thread Tomas Vondra

On Wed, Sep 09, 2020 at 12:04:28PM -0400, John Naylor wrote:

On Sat, Sep 5, 2020 at 7:21 PM Tomas Vondra
 wrote:


OK, here is a rebased version. Most of the breakage was due to changes
to the BRIN sgml docs.


Hi Tomas,

I plan on trying some different queries on different data
distributions to get a sense of when the planner chooses a
multi-minmax index, and whether the choice is good.

Just to start, I used the artificial example in [1], but scaled down a
bit to save time. Config is at the default except for:
shared_buffers = 1GB
random_page_cost = 1.1;
effective_cache_size = 4GB;

create table t (a bigint, b int) with (fillfactor=95);

insert into t select i + 1000*random(), i+1000*random()
 from generate_series(1,1000) s(i);

update t set a = 1, b = 1 where random() < 0.001;
update t set a = 1000, b = 1000 where random() < 0.001;

analyze t;

create index on t using brin (a);
CREATE INDEX
Time: 1631.452 ms (00:01.631)

explain analyze select * from t
 where a between 1923300::int and 1923600::int;

   QUERY PLAN
--
Bitmap Heap Scan on t  (cost=16.10..43180.43 rows=291 width=12)
(actual time=217.770..1131.366 rows=288 loops=1)
  Recheck Cond: ((a >= 1923300) AND (a <= 1923600))
  Rows Removed by Index Recheck: 712
  Heap Blocks: lossy=56819
  ->  Bitmap Index Scan on t_a_idx  (cost=0.00..16.03 rows=22595
width=0) (actual time=3.054..3.055 rows=568320 loops=1)
Index Cond: ((a >= 1923300) AND (a <= 1923600))
Planning Time: 0.328 ms
Execution Time: 1131.411 ms
(8 rows)

Now add the multi-minmax:

create index on t using brin (a int8_minmax_multi_ops);
CREATE INDEX
Time: 6521.026 ms (00:06.521)

The first interesting thing is, with both BRIN indexes available, the
planner still chooses the conventional BRIN index. Only when I disable
it, does it choose the multi-minmax index:

explain analyze select * from t
 where a between 1923300::int and 1923600::int;

  QUERY PLAN
-
Bitmap Heap Scan on t  (cost=68.10..43160.86 rows=291 width=12)
(actual time=1.835..4.196 rows=288 loops=1)
  Recheck Cond: ((a >= 1923300) AND (a <= 1923600))
  Rows Removed by Index Recheck: 22240
  Heap Blocks: lossy=128
  ->  Bitmap Index Scan on t_a_idx1  (cost=0.00..68.03 rows=22523
width=0) (actual time=0.691..0.691 rows=1280 loops=1)
Index Cond: ((a >= 1923300) AND (a <= 1923600))
Planning Time: 0.250 ms
Execution Time: 4.244 ms
(8 rows)

I wonder if this is a clue that something in the costing unfairly
penalizes a multi-minmax index. Maybe not enough to matter in
practice, since I wouldn't expect a user to put different kinds of
index on the same column.



I think this is much more an estimation issue than a costing one. Notice
that in the "regular" BRIN minmax index we have this:

   ->  Bitmap Index Scan on t_a_idx  (cost=0.00..16.03 rows=22595
   width=0) (actual time=3.054..3.055 rows=568320 loops=1)

while for the multi-minmax we have this:

   ->  Bitmap Index Scan on t_a_idx1  (cost=0.00..68.03 rows=22523
   width=0) (actual time=0.691..0.691 rows=1280 loops=1)

So yes, the multi-minmax index is costed a bit higher, mostly because
the index is a bit larger. (There's also a tweak to the correlation, but
that does not make much difference because it's just 0.99 vs. 1.0.)

The main difference is that for minmax the bitmap index scan actually
matches ~586k rows (a bit confusing, considering the heap scan has to
process almost 10M rows during recheck). But the multi-minmax only
matches ~1300 rows, with a recheck of 22k.

I'm not sure how to consider this during costing, as we only see these
numbers at execution time. One way would be to also consider "size" of
the ranges (i.e. max-min) vs. range of the whole column. But that's not
something we already have.

I'm not sure how troublesome this issue really is - I don't think people
are very likely to have both minmax and multi-minmax indexes on the same
column.


The second thing is, with parallel seq scan, the query is faster than
a BRIN bitmap scan, with this pathological data distribution, but the
planner won't choose it unless forced to:

set enable_bitmapscan = 'off';
explain analyze select * from t
 where a between 1923300::int and 1923600::int;
 QUERY PLAN
---
Gather  (cost=1000.00..120348.10 rows=291 width=12) (actual
time=372.766..380.364 rows=288 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  ->  Parallel Seq Scan on t  (cost=0.00..119319.00 rows=121
width=12) (actual time=268.326..366.228 rows=96 loops=3)

Re: WIP: BRIN multi-range indexes

2020-09-09 Thread John Naylor
On Sat, Sep 5, 2020 at 7:21 PM Tomas Vondra
 wrote:
>
> OK, here is a rebased version. Most of the breakage was due to changes
> to the BRIN sgml docs.

Hi Tomas,

I plan on trying some different queries on different data
distributions to get a sense of when the planner chooses a
multi-minmax index, and whether the choice is good.

Just to start, I used the artificial example in [1], but scaled down a
bit to save time. Config is at the default except for:
shared_buffers = 1GB
random_page_cost = 1.1;
effective_cache_size = 4GB;

create table t (a bigint, b int) with (fillfactor=95);

insert into t select i + 1000*random(), i+1000*random()
  from generate_series(1,1000) s(i);

update t set a = 1, b = 1 where random() < 0.001;
update t set a = 1000, b = 1000 where random() < 0.001;

analyze t;

create index on t using brin (a);
CREATE INDEX
Time: 1631.452 ms (00:01.631)

explain analyze select * from t
  where a between 1923300::int and 1923600::int;

QUERY PLAN
--
 Bitmap Heap Scan on t  (cost=16.10..43180.43 rows=291 width=12)
(actual time=217.770..1131.366 rows=288 loops=1)
   Recheck Cond: ((a >= 1923300) AND (a <= 1923600))
   Rows Removed by Index Recheck: 712
   Heap Blocks: lossy=56819
   ->  Bitmap Index Scan on t_a_idx  (cost=0.00..16.03 rows=22595
width=0) (actual time=3.054..3.055 rows=568320 loops=1)
 Index Cond: ((a >= 1923300) AND (a <= 1923600))
 Planning Time: 0.328 ms
 Execution Time: 1131.411 ms
(8 rows)

Now add the multi-minmax:

create index on t using brin (a int8_minmax_multi_ops);
CREATE INDEX
Time: 6521.026 ms (00:06.521)

The first interesting thing is, with both BRIN indexes available, the
planner still chooses the conventional BRIN index. Only when I disable
it, does it choose the multi-minmax index:

explain analyze select * from t
  where a between 1923300::int and 1923600::int;

   QUERY PLAN
-
 Bitmap Heap Scan on t  (cost=68.10..43160.86 rows=291 width=12)
(actual time=1.835..4.196 rows=288 loops=1)
   Recheck Cond: ((a >= 1923300) AND (a <= 1923600))
   Rows Removed by Index Recheck: 22240
   Heap Blocks: lossy=128
   ->  Bitmap Index Scan on t_a_idx1  (cost=0.00..68.03 rows=22523
width=0) (actual time=0.691..0.691 rows=1280 loops=1)
 Index Cond: ((a >= 1923300) AND (a <= 1923600))
 Planning Time: 0.250 ms
 Execution Time: 4.244 ms
(8 rows)

I wonder if this is a clue that something in the costing unfairly
penalizes a multi-minmax index. Maybe not enough to matter in
practice, since I wouldn't expect a user to put different kinds of
index on the same column.

The second thing is, with parallel seq scan, the query is faster than
a BRIN bitmap scan, with this pathological data distribution, but the
planner won't choose it unless forced to:

set enable_bitmapscan = 'off';
explain analyze select * from t
  where a between 1923300::int and 1923600::int;
  QUERY PLAN
---
 Gather  (cost=1000.00..120348.10 rows=291 width=12) (actual
time=372.766..380.364 rows=288 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on t  (cost=0.00..119319.00 rows=121
width=12) (actual time=268.326..366.228 rows=96 loops=3)
 Filter: ((a >= 1923300) AND (a <= 1923600))
 Rows Removed by Filter: 237
 Planning Time: 0.089 ms
 Execution Time: 380.434 ms
(8 rows)

And just to compare size:

BRIN 32kB
BRIN multi  136kB
Btree   188MB

[1] 
https://www.postgresql.org/message-id/459eef3e-48c7-0f5a-8356-992442a78bb6%402ndquadrant.com

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: WIP: BRIN multi-range indexes

2020-09-04 Thread Michael Paquier
On Fri, Aug 07, 2020 at 06:27:01PM +0200, Tomas Vondra wrote:
> Attached is an updated version of the patch series, implementing this.
> Adding the extra data types was fairly simple, because both bloom and
> minmax-multi indexes already used "struct as varlena" approach, so all
> that needed was a bunch of in/out functions and catalog records.
> 
> I've left the changes in separate patches for clarity, ultimately it'll
> get merged into the other parts.

This fails to apply per the CF bot, so please provide a rebase.
--
Michael


signature.asc
Description: PGP signature


Re: WIP: BRIN multi-range indexes

2020-08-04 Thread Tomas Vondra

On Tue, Aug 04, 2020 at 05:36:51PM +0300, Alexander Korotkov wrote:

Hi, Tomas!

Sorry for the late reply.

On Sun, Jul 19, 2020 at 6:19 PM Tomas Vondra
 wrote:

I think there's a number of weak points in this approach.

Firstly, it assumes the summaries can be represented as arrays of
built-in types, which I'm not really sure about. It clearly is not true
for the bloom opclasses, for example. But even for minmax oclasses it's
going to be tricky because the ranges may be on different data types so
presumably we'd need somewhat nested data structure.

Moreover, multi-minmax summary contains either points or intervals,
which requires additional fields/flags to indicate that. That further
complicates the things ...

maybe we could decompose that into separate arrays or something, but
honestly it seems somewhat premature - there are far more important
aspects to discuss, I think (e.g. how the ranges are built/merged in
multi-minmax, or whether bloom opclasses are useful at all).


I see.  But there is at least a second option to introduce a new
datatype with just an output function.  In the similar way
gist/tsvector_ops uses gtsvector key type.  I think it would be more
transparent than using just bytea.  Also, this is the way we already
use in the core.



So you're proposing to have a new data types "brin_minmax_multi_summary"
and "brin_bloom_summary" (or some other names), with output functions
printing something nicer? I suppose that could work, and we could even
add pageinspect functions returning the value as raw bytea.

Good idea!


>BTW, I've applied the patchset to the current master, but I got a lot
>of duplicate oids.  Could you please resolve these conflicts.  I think
>it would be good to use high oid numbers to evade conflicts during
>development/review, and rely on committer to set final oids (as
>discussed in [1]).
>
>Links
>1. 
https://www.postgresql.org/message-id/CAH2-WzmMTGMcPuph4OvsO7Ykut0AOCF_i-%3DeaochT0dd2BN9CQ%40mail.gmail.com

Did you use the patchset from 2020/07/03? I don't get any duplicate OIDs
with it, and it's already using quite high OIDs (part 4 uses >= 8000,
part 5 uses >= 9000).


Yep, it appears that I was using the wrong version of patchset.
Patchset from 2020/07/03 works good on the current master.



OK, good.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: WIP: BRIN multi-range indexes

2020-08-04 Thread Alexander Korotkov
Hi, Tomas!

Sorry for the late reply.

On Sun, Jul 19, 2020 at 6:19 PM Tomas Vondra
 wrote:
> I think there's a number of weak points in this approach.
>
> Firstly, it assumes the summaries can be represented as arrays of
> built-in types, which I'm not really sure about. It clearly is not true
> for the bloom opclasses, for example. But even for minmax oclasses it's
> going to be tricky because the ranges may be on different data types so
> presumably we'd need somewhat nested data structure.
>
> Moreover, multi-minmax summary contains either points or intervals,
> which requires additional fields/flags to indicate that. That further
> complicates the things ...
>
> maybe we could decompose that into separate arrays or something, but
> honestly it seems somewhat premature - there are far more important
> aspects to discuss, I think (e.g. how the ranges are built/merged in
> multi-minmax, or whether bloom opclasses are useful at all).

I see.  But there is at least a second option to introduce a new
datatype with just an output function.  In the similar way
gist/tsvector_ops uses gtsvector key type.  I think it would be more
transparent than using just bytea.  Also, this is the way we already
use in the core.

> >BTW, I've applied the patchset to the current master, but I got a lot
> >of duplicate oids.  Could you please resolve these conflicts.  I think
> >it would be good to use high oid numbers to evade conflicts during
> >development/review, and rely on committer to set final oids (as
> >discussed in [1]).
> >
> >Links
> >1. 
> >https://www.postgresql.org/message-id/CAH2-WzmMTGMcPuph4OvsO7Ykut0AOCF_i-%3DeaochT0dd2BN9CQ%40mail.gmail.com
>
> Did you use the patchset from 2020/07/03? I don't get any duplicate OIDs
> with it, and it's already using quite high OIDs (part 4 uses >= 8000,
> part 5 uses >= 9000).

Yep, it appears that I was using the wrong version of patchset.
Patchset from 2020/07/03 works good on the current master.

--
Regards,
Alexander Korotkov




Re: WIP: BRIN multi-range indexes

2020-07-19 Thread Tomas Vondra

On Wed, Jul 15, 2020 at 05:34:05AM +0300, Alexander Korotkov wrote:

On Mon, Jul 13, 2020 at 5:59 PM Tomas Vondra
 wrote:

>> > If we really want to print something nicer, I'd say it needs to be a
>> > special function in the BRIN opclass.
>>
>> If that can be done, then +1.  We just need to ensure that the function
>> knows and can verify the type of index that the value comes from.  I
>> guess we can pass the index OID so that it can extract the opclass from
>> catalogs to verify.
>
>+1 from me, too. Perhaps we can have it as optional. If a BRIN opclass
>doesn't have it, the 'values' can be null.
>

I'd say that if the opclass does not have it, then we should print the
bytea value (or whatever the opclass uses to store the summary) using
the type functions.


I've read the recent messages in this thread and I'd like to share my thoughts.

I think the way brin_page_items() displays values is not really
generic.  It uses a range-like textual representation of an array of
values, while that array doesn't necessarily have range semantics.

However, I think it's good that brin_page_items() uses a type output
function to display values.  So, it's not necessary to introduce a new
BRIN opclass function in order to get values displayed in a
human-readable way.  Instead, we could just make a standard of BRIN
value to be human readable.  I see at least two possibilities for
that.
1. Use standard container data-types to represent BRIN values.  For
instance we could use an array of ranges instead of bytea for
multirange.  Not about how convenient/performant it would be.
2. Introduce new data-type to represent values in BRIN index. And for
that type we can define output function with user-readable output. We
did similar things for GiST.  For instance, pg_trgm defines gtrgm
type, which has no input and no output. But for BRIN opclass we can
define type with just output.



I think there's a number of weak points in this approach.

Firstly, it assumes the summaries can be represented as arrays of
built-in types, which I'm not really sure about. It clearly is not true
for the bloom opclasses, for example. But even for minmax oclasses it's
going to be tricky because the ranges may be on different data types so
presumably we'd need somewhat nested data structure.

Moreover, multi-minmax summary contains either points or intervals,
which requires additional fields/flags to indicate that. That further
complicates the things ...

maybe we could decompose that into separate arrays or something, but
honestly it seems somewhat premature - there are far more important
aspects to discuss, I think (e.g. how the ranges are built/merged in
multi-minmax, or whether bloom opclasses are useful at all).



BTW, I've applied the patchset to the current master, but I got a lot
of duplicate oids.  Could you please resolve these conflicts.  I think
it would be good to use high oid numbers to evade conflicts during
development/review, and rely on committer to set final oids (as
discussed in [1]).

Links
1. 
https://www.postgresql.org/message-id/CAH2-WzmMTGMcPuph4OvsO7Ykut0AOCF_i-%3DeaochT0dd2BN9CQ%40mail.gmail.com



Did you use the patchset from 2020/07/03? I don't get any duplicate OIDs
with it, and it's already using quite high OIDs (part 4 uses >= 8000,
part 5 uses >= 9000).

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: WIP: BRIN multi-range indexes

2020-07-14 Thread Alexander Korotkov
On Mon, Jul 13, 2020 at 5:59 PM Tomas Vondra
 wrote:
> >> > If we really want to print something nicer, I'd say it needs to be a
> >> > special function in the BRIN opclass.
> >>
> >> If that can be done, then +1.  We just need to ensure that the function
> >> knows and can verify the type of index that the value comes from.  I
> >> guess we can pass the index OID so that it can extract the opclass from
> >> catalogs to verify.
> >
> >+1 from me, too. Perhaps we can have it as optional. If a BRIN opclass
> >doesn't have it, the 'values' can be null.
> >
>
> I'd say that if the opclass does not have it, then we should print the
> bytea value (or whatever the opclass uses to store the summary) using
> the type functions.

I've read the recent messages in this thread and I'd like to share my thoughts.

I think the way brin_page_items() displays values is not really
generic.  It uses a range-like textual representation of an array of
values, while that array doesn't necessarily have range semantics.

However, I think it's good that brin_page_items() uses a type output
function to display values.  So, it's not necessary to introduce a new
BRIN opclass function in order to get values displayed in a
human-readable way.  Instead, we could just make a standard of BRIN
value to be human readable.  I see at least two possibilities for
that.
1. Use standard container data-types to represent BRIN values.  For
instance we could use an array of ranges instead of bytea for
multirange.  Not about how convenient/performant it would be.
2. Introduce new data-type to represent values in BRIN index. And for
that type we can define output function with user-readable output. We
did similar things for GiST.  For instance, pg_trgm defines gtrgm
type, which has no input and no output. But for BRIN opclass we can
define type with just output.

BTW, I've applied the patchset to the current master, but I got a lot
of duplicate oids.  Could you please resolve these conflicts.  I think
it would be good to use high oid numbers to evade conflicts during
development/review, and rely on committer to set final oids (as
discussed in [1]).

Links
1. 
https://www.postgresql.org/message-id/CAH2-WzmMTGMcPuph4OvsO7Ykut0AOCF_i-%3DeaochT0dd2BN9CQ%40mail.gmail.com

--
Regards,
Alexander Korotkov




Re: WIP: BRIN multi-range indexes

2020-07-13 Thread Tomas Vondra

On Mon, Jul 13, 2020 at 02:54:56PM +0900, Masahiko Sawada wrote:

On Mon, 13 Jul 2020 at 09:33, Alvaro Herrera  wrote:


On 2020-Jul-13, Tomas Vondra wrote:

> On Sun, Jul 12, 2020 at 07:58:54PM -0400, Alvaro Herrera wrote:

> > Maybe we can try to handle this with some other function that interprets
> > the bytea in 'value' and returns a user-readable text.  I think it'd
> > have to be a superuser-only function, because otherwise you could easily
> > cause a crash by passing a value of a different opclass.  But since this
> > seems a developer-only thing, that restriction seems fine to me.
>
> Ummm, I disagree a superuser check is sufficient protection from a
> segfault or similar issues.

My POV there is that it's the user's responsibility to call the right
function; and if they fail to do so, it's their fault.  I agree it's not
ideal, but frankly these pageinspect things are not critical to get 100%
user-friendly.

> If we really want to print something nicer, I'd say it needs to be a
> special function in the BRIN opclass.

If that can be done, then +1.  We just need to ensure that the function
knows and can verify the type of index that the value comes from.  I
guess we can pass the index OID so that it can extract the opclass from
catalogs to verify.


+1 from me, too. Perhaps we can have it as optional. If a BRIN opclass
doesn't have it, the 'values' can be null.



I'd say that if the opclass does not have it, then we should print the
bytea value (or whatever the opclass uses to store the summary) using
the type functions.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: WIP: BRIN multi-range indexes

2020-07-12 Thread Masahiko Sawada
On Mon, 13 Jul 2020 at 09:33, Alvaro Herrera  wrote:
>
> On 2020-Jul-13, Tomas Vondra wrote:
>
> > On Sun, Jul 12, 2020 at 07:58:54PM -0400, Alvaro Herrera wrote:
>
> > > Maybe we can try to handle this with some other function that interprets
> > > the bytea in 'value' and returns a user-readable text.  I think it'd
> > > have to be a superuser-only function, because otherwise you could easily
> > > cause a crash by passing a value of a different opclass.  But since this
> > > seems a developer-only thing, that restriction seems fine to me.
> >
> > Ummm, I disagree a superuser check is sufficient protection from a
> > segfault or similar issues.
>
> My POV there is that it's the user's responsibility to call the right
> function; and if they fail to do so, it's their fault.  I agree it's not
> ideal, but frankly these pageinspect things are not critical to get 100%
> user-friendly.
>
> > If we really want to print something nicer, I'd say it needs to be a
> > special function in the BRIN opclass.
>
> If that can be done, then +1.  We just need to ensure that the function
> knows and can verify the type of index that the value comes from.  I
> guess we can pass the index OID so that it can extract the opclass from
> catalogs to verify.

+1 from me, too. Perhaps we can have it as optional. If a BRIN opclass
doesn't have it, the 'values' can be null.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: WIP: BRIN multi-range indexes

2020-07-12 Thread Alvaro Herrera
On 2020-Jul-13, Tomas Vondra wrote:

> On Sun, Jul 12, 2020 at 07:58:54PM -0400, Alvaro Herrera wrote:

> > Maybe we can try to handle this with some other function that interprets
> > the bytea in 'value' and returns a user-readable text.  I think it'd
> > have to be a superuser-only function, because otherwise you could easily
> > cause a crash by passing a value of a different opclass.  But since this
> > seems a developer-only thing, that restriction seems fine to me.
> 
> Ummm, I disagree a superuser check is sufficient protection from a
> segfault or similar issues.

My POV there is that it's the user's responsibility to call the right
function; and if they fail to do so, it's their fault.  I agree it's not
ideal, but frankly these pageinspect things are not critical to get 100%
user-friendly.

> If we really want to print something nicer, I'd say it needs to be a
> special function in the BRIN opclass.

If that can be done, then +1.  We just need to ensure that the function
knows and can verify the type of index that the value comes from.  I
guess we can pass the index OID so that it can extract the opclass from
catalogs to verify.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: WIP: BRIN multi-range indexes

2020-07-12 Thread Tomas Vondra

On Sun, Jul 12, 2020 at 07:58:54PM -0400, Alvaro Herrera wrote:

On 2020-Jul-10, Tomas Vondra wrote:


> postgres(1:12801)=# select * from brin_page_items(get_raw_page('mul',
> 2), 'mul');
> -[ RECORD 1 
]--
> 
---
> 
> itemoffset  | 1
> blknum  | 0
> attnum  | 1
> allnulls| f
> hasnulls| f
> placeholder | f
> value   | 
{\x01001b002100e570e670e770e870e970ea70eb70ec70ed70ee70ef
> 
70f070f170f270f370f470f570f670f770f870f970fa70fb70fc70fd70fe70ff700
> 071}

Hmm. I'm not sure we can do much better, without making the function
much more complicated. I mean, even with regular BRIN indexes we don't
really know if the value is plain min/max, right?


Maybe we can try to handle this with some other function that interprets
the bytea in 'value' and returns a user-readable text.  I think it'd
have to be a superuser-only function, because otherwise you could easily
cause a crash by passing a value of a different opclass.  But since this
seems a developer-only thing, that restriction seems fine to me.



Ummm, I disagree a superuser check is sufficient protection from a
segfault or similar issues. If we really want to print something nicer,
I'd say it needs to be a special function in the BRIN opclass.


(I don't know what's a good way to represent a bloom filter, mind.)



Me neither, but I guess we could print either some stats (size, number
of bits set, etc.) and/or then the bitmap.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: WIP: BRIN multi-range indexes

2020-07-12 Thread Alvaro Herrera
On 2020-Jul-10, Tomas Vondra wrote:

> > postgres(1:12801)=# select * from brin_page_items(get_raw_page('mul',
> > 2), 'mul');
> > -[ RECORD 1 
> > ]--
> > ---
> > 
> > itemoffset  | 1
> > blknum  | 0
> > attnum  | 1
> > allnulls| f
> > hasnulls| f
> > placeholder | f
> > value   | 
> > {\x01001b002100e570e670e770e870e970ea70eb70ec70ed70ee70ef
> > 70f070f170f270f370f470f570f670f770f870f970fa70fb70fc70fd70fe70ff700
> > 071}
> 
> Hmm. I'm not sure we can do much better, without making the function
> much more complicated. I mean, even with regular BRIN indexes we don't
> really know if the value is plain min/max, right?

Maybe we can try to handle this with some other function that interprets
the bytea in 'value' and returns a user-readable text.  I think it'd
have to be a superuser-only function, because otherwise you could easily
cause a crash by passing a value of a different opclass.  But since this
seems a developer-only thing, that restriction seems fine to me.

(I don't know what's a good way to represent a bloom filter, mind.)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: WIP: BRIN multi-range indexes

2020-07-12 Thread Sascha Kuhl
Thanks, I see there is some understanding, though.

Tomas Vondra  schrieb am So., 12. Juli 2020,
01:30:

> On Sat, Jul 11, 2020 at 03:32:43PM +0200, Sascha Kuhl wrote:
> >Tomas Vondra  schrieb am Sa., 11. Juli
> 2020,
> >13:24:
> >
> >> On Fri, Jul 10, 2020 at 04:44:41PM +0200, Sascha Kuhl wrote:
> >> >Tomas Vondra  schrieb am Fr., 10. Juli
> >> 2020,
> >> >14:09:
> >> >
> >> >> On Fri, Jul 10, 2020 at 06:01:58PM +0900, Masahiko Sawada wrote:
> >> >> >On Fri, 3 Jul 2020 at 09:58, Tomas Vondra <
> >> tomas.von...@2ndquadrant.com>
> >> >> wrote:
> >> >> >>
> >> >> >> On Sun, Apr 05, 2020 at 08:01:50PM +0300, Alexander Korotkov
> wrote:
> >> >> >> >On Sun, Apr 5, 2020 at 8:00 PM Tomas Vondra
> >> >> >> > wrote:
> >> >> >> ...
> >> >> >> >> >
> >> >> >> >> >Assuming we're not going to get 0001-0003 into v13, I'm not so
> >> >> >> >> >inclined to rush on these three as well.  But you're willing
> to
> >> >> commit
> >> >> >> >> >them, you can count round of review on me.
> >> >> >> >> >
> >> >> >> >>
> >> >> >> >> I have no intention to get 0001-0003 committed. I think those
> >> changes
> >> >> >> >> are beneficial on their own, but the primary reason was to
> support
> >> >> the
> >> >> >> >> new opclasses (which require those changes). And those parts
> are
> >> not
> >> >> >> >> going to make it into v13 ...
> >> >> >> >
> >> >> >> >OK, no problem.
> >> >> >> >Let's do this for v14.
> >> >> >> >
> >> >> >>
> >> >> >> Hi Alexander,
> >> >> >>
> >> >> >> Are you still interested in reviewing those patches? I'll take a
> >> look at
> >> >> >> 0001-0003 to check that your previous feedback was addressed. Do
> you
> >> >> >> have any comments about 0004 / 0005, which I think are the more
> >> >> >> interesting parts of this series?
> >> >> >>
> >> >> >>
> >> >> >> Attached is a rebased version - I realized I forgot to include
> 0005
> >> in
> >> >> >> the last update, for some reason.
> >> >> >>
> >> >> >
> >> >> >I've done a quick test with this patch set. I wonder if we can
> improve
> >> >> >brin_page_items() SQL function in pageinspect as well. Currently,
> >> >> >brin_page_items() is hard-coded to support only normal brin indexes.
> >> >> >When we pass brin-bloom or brin-multi-range to that function the
> >> >> >binary values are shown in 'value' column but it seems not helpful
> for
> >> >> >users. For instance, here is an output of brin_page_items() with a
> >> >> >brin-multi-range index:
> >> >> >
> >> >> >postgres(1:12801)=# select * from
> brin_page_items(get_raw_page('mul',
> >> >> >2), 'mul');
> >> >> >-[ RECORD 1
> >> >>
> >>
> ]--
> >> >>
> >> >>
> >>
> >---
> >> >> >
> >> >> >itemoffset  | 1
> >> >> >blknum  | 0
> >> >> >attnum  | 1
> >> >> >allnulls| f
> >> >> >hasnulls| f
> >> >> >placeholder | f
> >> >> >value   |
> >> >>
> >>
> {\x01001b002100e570e670e770e870e970ea70eb70ec70ed70ee70ef
> >> >>
> >> >>
> >>
> >70f070f170f270f370f470f570f670f770f870f970fa70fb70fc70fd70fe70ff700
> >> >> >071}
> >> >> >
> >> >>
> >> >> Hmm. I'm not sure we can do much better, without making the function
> >> >> much more complicated. I mean, even with regular BRIN indexes we
> don't
> >> >> really know if the value is plain min/max, right?
> >> >>
> >> >You can be sure with the next node. The value is in can be false
> positiv.
> >> >The value is out is clear. You can detect the change between in and
> out.
> >> >
> >>
> >> I'm sorry, I don't understand what you're suggesting. How is any of this
> >> related to false positive rate, etc?
> >>
> >
> >Hi,
> >
> >You check by the bloom filter if a value you're searching is part of the
> >node, right?
> >
> >In case, the value is in the bloom filter you could be mistaken, because
> >another value could have the same hash profile, no?
> >
> >However if the value is out, the filter can not react. You can be sure
> that
> >the value is out.
> >
> >If you looking for a range or many ranges of values, you traverse many
> >nodes. By knowing the value is out, you can state a clear set of nodes
> that
> >form the range. However the border is somehow unsharp because of the false
> >positives.
> >
> >I am not sure if we write about the same. Please confirm, this can be
> >needed. Please.
> >
>
> Probably not. Masahiko-san pointed out that pageinspect (which also has
> a function to print pages from a BRIN index) does not understand the
> summary of the new opclasses and just prints the bytea verbatim.
>
> That has nothing to do with inspecting the bloom filter, or anything
> like that. So I think there's some confusion ...
>
>
> regards
>
> --
> Tomas Vondra  

Re: WIP: BRIN multi-range indexes

2020-07-11 Thread Tomas Vondra

On Sat, Jul 11, 2020 at 03:32:43PM +0200, Sascha Kuhl wrote:

Tomas Vondra  schrieb am Sa., 11. Juli 2020,
13:24:


On Fri, Jul 10, 2020 at 04:44:41PM +0200, Sascha Kuhl wrote:
>Tomas Vondra  schrieb am Fr., 10. Juli
2020,
>14:09:
>
>> On Fri, Jul 10, 2020 at 06:01:58PM +0900, Masahiko Sawada wrote:
>> >On Fri, 3 Jul 2020 at 09:58, Tomas Vondra <
tomas.von...@2ndquadrant.com>
>> wrote:
>> >>
>> >> On Sun, Apr 05, 2020 at 08:01:50PM +0300, Alexander Korotkov wrote:
>> >> >On Sun, Apr 5, 2020 at 8:00 PM Tomas Vondra
>> >> > wrote:
>> >> ...
>> >> >> >
>> >> >> >Assuming we're not going to get 0001-0003 into v13, I'm not so
>> >> >> >inclined to rush on these three as well.  But you're willing to
>> commit
>> >> >> >them, you can count round of review on me.
>> >> >> >
>> >> >>
>> >> >> I have no intention to get 0001-0003 committed. I think those
changes
>> >> >> are beneficial on their own, but the primary reason was to support
>> the
>> >> >> new opclasses (which require those changes). And those parts are
not
>> >> >> going to make it into v13 ...
>> >> >
>> >> >OK, no problem.
>> >> >Let's do this for v14.
>> >> >
>> >>
>> >> Hi Alexander,
>> >>
>> >> Are you still interested in reviewing those patches? I'll take a
look at
>> >> 0001-0003 to check that your previous feedback was addressed. Do you
>> >> have any comments about 0004 / 0005, which I think are the more
>> >> interesting parts of this series?
>> >>
>> >>
>> >> Attached is a rebased version - I realized I forgot to include 0005
in
>> >> the last update, for some reason.
>> >>
>> >
>> >I've done a quick test with this patch set. I wonder if we can improve
>> >brin_page_items() SQL function in pageinspect as well. Currently,
>> >brin_page_items() is hard-coded to support only normal brin indexes.
>> >When we pass brin-bloom or brin-multi-range to that function the
>> >binary values are shown in 'value' column but it seems not helpful for
>> >users. For instance, here is an output of brin_page_items() with a
>> >brin-multi-range index:
>> >
>> >postgres(1:12801)=# select * from brin_page_items(get_raw_page('mul',
>> >2), 'mul');
>> >-[ RECORD 1
>>
]--
>>
>>
>---
>> >
>> >itemoffset  | 1
>> >blknum  | 0
>> >attnum  | 1
>> >allnulls| f
>> >hasnulls| f
>> >placeholder | f
>> >value   |
>>
{\x01001b002100e570e670e770e870e970ea70eb70ec70ed70ee70ef
>>
>>
>70f070f170f270f370f470f570f670f770f870f970fa70fb70fc70fd70fe70ff700
>> >071}
>> >
>>
>> Hmm. I'm not sure we can do much better, without making the function
>> much more complicated. I mean, even with regular BRIN indexes we don't
>> really know if the value is plain min/max, right?
>>
>You can be sure with the next node. The value is in can be false positiv.
>The value is out is clear. You can detect the change between in and out.
>

I'm sorry, I don't understand what you're suggesting. How is any of this
related to false positive rate, etc?



Hi,

You check by the bloom filter if a value you're searching is part of the
node, right?

In case, the value is in the bloom filter you could be mistaken, because
another value could have the same hash profile, no?

However if the value is out, the filter can not react. You can be sure that
the value is out.

If you looking for a range or many ranges of values, you traverse many
nodes. By knowing the value is out, you can state a clear set of nodes that
form the range. However the border is somehow unsharp because of the false
positives.

I am not sure if we write about the same. Please confirm, this can be
needed. Please.



Probably not. Masahiko-san pointed out that pageinspect (which also has
a function to print pages from a BRIN index) does not understand the
summary of the new opclasses and just prints the bytea verbatim.

That has nothing to do with inspecting the bloom filter, or anything
like that. So I think there's some confusion ...


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: WIP: BRIN multi-range indexes

2020-07-11 Thread Sascha Kuhl
Sorry, my topic is different

Sascha Kuhl  schrieb am Sa., 11. Juli 2020, 15:32:

>
>
>
> Tomas Vondra  schrieb am Sa., 11. Juli
> 2020, 13:24:
>
>> On Fri, Jul 10, 2020 at 04:44:41PM +0200, Sascha Kuhl wrote:
>> >Tomas Vondra  schrieb am Fr., 10. Juli
>> 2020,
>> >14:09:
>> >
>> >> On Fri, Jul 10, 2020 at 06:01:58PM +0900, Masahiko Sawada wrote:
>> >> >On Fri, 3 Jul 2020 at 09:58, Tomas Vondra <
>> tomas.von...@2ndquadrant.com>
>> >> wrote:
>> >> >>
>> >> >> On Sun, Apr 05, 2020 at 08:01:50PM +0300, Alexander Korotkov wrote:
>> >> >> >On Sun, Apr 5, 2020 at 8:00 PM Tomas Vondra
>> >> >> > wrote:
>> >> >> ...
>> >> >> >> >
>> >> >> >> >Assuming we're not going to get 0001-0003 into v13, I'm not so
>> >> >> >> >inclined to rush on these three as well.  But you're willing to
>> >> commit
>> >> >> >> >them, you can count round of review on me.
>> >> >> >> >
>> >> >> >>
>> >> >> >> I have no intention to get 0001-0003 committed. I think those
>> changes
>> >> >> >> are beneficial on their own, but the primary reason was to
>> support
>> >> the
>> >> >> >> new opclasses (which require those changes). And those parts are
>> not
>> >> >> >> going to make it into v13 ...
>> >> >> >
>> >> >> >OK, no problem.
>> >> >> >Let's do this for v14.
>> >> >> >
>> >> >>
>> >> >> Hi Alexander,
>> >> >>
>> >> >> Are you still interested in reviewing those patches? I'll take a
>> look at
>> >> >> 0001-0003 to check that your previous feedback was addressed. Do you
>> >> >> have any comments about 0004 / 0005, which I think are the more
>> >> >> interesting parts of this series?
>> >> >>
>> >> >>
>> >> >> Attached is a rebased version - I realized I forgot to include 0005
>> in
>> >> >> the last update, for some reason.
>> >> >>
>> >> >
>> >> >I've done a quick test with this patch set. I wonder if we can improve
>> >> >brin_page_items() SQL function in pageinspect as well. Currently,
>> >> >brin_page_items() is hard-coded to support only normal brin indexes.
>> >> >When we pass brin-bloom or brin-multi-range to that function the
>> >> >binary values are shown in 'value' column but it seems not helpful for
>> >> >users. For instance, here is an output of brin_page_items() with a
>> >> >brin-multi-range index:
>> >> >
>> >> >postgres(1:12801)=# select * from brin_page_items(get_raw_page('mul',
>> >> >2), 'mul');
>> >> >-[ RECORD 1
>> >>
>> ]--
>> >>
>> >>
>> >---
>> >> >
>> >> >itemoffset  | 1
>> >> >blknum  | 0
>> >> >attnum  | 1
>> >> >allnulls| f
>> >> >hasnulls| f
>> >> >placeholder | f
>> >> >value   |
>> >>
>> {\x01001b002100e570e670e770e870e970ea70eb70ec70ed70ee70ef
>> >>
>> >>
>> >70f070f170f270f370f470f570f670f770f870f970fa70fb70fc70fd70fe70ff700
>> >> >071}
>> >> >
>> >>
>> >> Hmm. I'm not sure we can do much better, without making the function
>> >> much more complicated. I mean, even with regular BRIN indexes we don't
>> >> really know if the value is plain min/max, right?
>> >>
>> >You can be sure with the next node. The value is in can be false positiv.
>> >The value is out is clear. You can detect the change between in and out.
>> >
>>
>> I'm sorry, I don't understand what you're suggesting. How is any of this
>> related to false positive rate, etc?
>>
>
> Hi,
>
> You check by the bloom filter if a value you're searching is part of the
> node, right?
>
> In case, the value is in the bloom filter you could be mistaken, because
> another value could have the same hash profile, no?
>
> However if the value is out, the filter can not react. You can be sure
> that the value is out.
>
> If you looking for a range or many ranges of values, you traverse many
> nodes. By knowing the value is out, you can state a clear set of nodes that
> form the range. However the border is somehow unsharp because of the false
> positives.
>
> I am not sure if we write about the same. Please confirm, this can be
> needed. Please.
>
> I will try to understand what you write. Interesting
>
> Sascha
>
>>
>> The problem here is that while plain BRIN opclasses have fairly simple
>> summary that can be stored using a fixed number of simple data types
>> (e.g. minmax will store two values with the same data types as the
>> indexd column)
>>
>>  result = palloc0(MAXALIGN(SizeofBrinOpcInfo(2)) +
>>   sizeof(MinmaxOpaque));
>>  result->oi_nstored = 2;
>>  result->oi_opaque = (MinmaxOpaque *)
>>  MAXALIGN((char *) result + SizeofBrinOpcInfo(2));
>>  result->oi_typcache[0] = result->oi_typcache[1] =
>>  lookup_type_cache(typoid, 0);
>>
>> The opclassed introduced here have somewhat 

Re: WIP: BRIN multi-range indexes

2020-07-11 Thread Sascha Kuhl
Tomas Vondra  schrieb am Sa., 11. Juli 2020,
13:24:

> On Fri, Jul 10, 2020 at 04:44:41PM +0200, Sascha Kuhl wrote:
> >Tomas Vondra  schrieb am Fr., 10. Juli
> 2020,
> >14:09:
> >
> >> On Fri, Jul 10, 2020 at 06:01:58PM +0900, Masahiko Sawada wrote:
> >> >On Fri, 3 Jul 2020 at 09:58, Tomas Vondra <
> tomas.von...@2ndquadrant.com>
> >> wrote:
> >> >>
> >> >> On Sun, Apr 05, 2020 at 08:01:50PM +0300, Alexander Korotkov wrote:
> >> >> >On Sun, Apr 5, 2020 at 8:00 PM Tomas Vondra
> >> >> > wrote:
> >> >> ...
> >> >> >> >
> >> >> >> >Assuming we're not going to get 0001-0003 into v13, I'm not so
> >> >> >> >inclined to rush on these three as well.  But you're willing to
> >> commit
> >> >> >> >them, you can count round of review on me.
> >> >> >> >
> >> >> >>
> >> >> >> I have no intention to get 0001-0003 committed. I think those
> changes
> >> >> >> are beneficial on their own, but the primary reason was to support
> >> the
> >> >> >> new opclasses (which require those changes). And those parts are
> not
> >> >> >> going to make it into v13 ...
> >> >> >
> >> >> >OK, no problem.
> >> >> >Let's do this for v14.
> >> >> >
> >> >>
> >> >> Hi Alexander,
> >> >>
> >> >> Are you still interested in reviewing those patches? I'll take a
> look at
> >> >> 0001-0003 to check that your previous feedback was addressed. Do you
> >> >> have any comments about 0004 / 0005, which I think are the more
> >> >> interesting parts of this series?
> >> >>
> >> >>
> >> >> Attached is a rebased version - I realized I forgot to include 0005
> in
> >> >> the last update, for some reason.
> >> >>
> >> >
> >> >I've done a quick test with this patch set. I wonder if we can improve
> >> >brin_page_items() SQL function in pageinspect as well. Currently,
> >> >brin_page_items() is hard-coded to support only normal brin indexes.
> >> >When we pass brin-bloom or brin-multi-range to that function the
> >> >binary values are shown in 'value' column but it seems not helpful for
> >> >users. For instance, here is an output of brin_page_items() with a
> >> >brin-multi-range index:
> >> >
> >> >postgres(1:12801)=# select * from brin_page_items(get_raw_page('mul',
> >> >2), 'mul');
> >> >-[ RECORD 1
> >>
> ]--
> >>
> >>
> >---
> >> >
> >> >itemoffset  | 1
> >> >blknum  | 0
> >> >attnum  | 1
> >> >allnulls| f
> >> >hasnulls| f
> >> >placeholder | f
> >> >value   |
> >>
> {\x01001b002100e570e670e770e870e970ea70eb70ec70ed70ee70ef
> >>
> >>
> >70f070f170f270f370f470f570f670f770f870f970fa70fb70fc70fd70fe70ff700
> >> >071}
> >> >
> >>
> >> Hmm. I'm not sure we can do much better, without making the function
> >> much more complicated. I mean, even with regular BRIN indexes we don't
> >> really know if the value is plain min/max, right?
> >>
> >You can be sure with the next node. The value is in can be false positiv.
> >The value is out is clear. You can detect the change between in and out.
> >
>
> I'm sorry, I don't understand what you're suggesting. How is any of this
> related to false positive rate, etc?
>

Hi,

You check by the bloom filter if a value you're searching is part of the
node, right?

In case, the value is in the bloom filter you could be mistaken, because
another value could have the same hash profile, no?

However if the value is out, the filter can not react. You can be sure that
the value is out.

If you looking for a range or many ranges of values, you traverse many
nodes. By knowing the value is out, you can state a clear set of nodes that
form the range. However the border is somehow unsharp because of the false
positives.

I am not sure if we write about the same. Please confirm, this can be
needed. Please.

I will try to understand what you write. Interesting

Sascha

>
> The problem here is that while plain BRIN opclasses have fairly simple
> summary that can be stored using a fixed number of simple data types
> (e.g. minmax will store two values with the same data types as the
> indexd column)
>
>  result = palloc0(MAXALIGN(SizeofBrinOpcInfo(2)) +
>   sizeof(MinmaxOpaque));
>  result->oi_nstored = 2;
>  result->oi_opaque = (MinmaxOpaque *)
>  MAXALIGN((char *) result + SizeofBrinOpcInfo(2));
>  result->oi_typcache[0] = result->oi_typcache[1] =
>  lookup_type_cache(typoid, 0);
>
> The opclassed introduced here have somewhat more complex summary, stored
> as a single bytea value - which is what gets printed by brin_page_items.
>
> To print something easier to read (for humans) we'd either have to teach
> brin_page_items about the diffrent opclasses (multi-range, 

Re: WIP: BRIN multi-range indexes

2020-07-11 Thread Tomas Vondra

On Fri, Jul 10, 2020 at 04:44:41PM +0200, Sascha Kuhl wrote:

Tomas Vondra  schrieb am Fr., 10. Juli 2020,
14:09:


On Fri, Jul 10, 2020 at 06:01:58PM +0900, Masahiko Sawada wrote:
>On Fri, 3 Jul 2020 at 09:58, Tomas Vondra 
wrote:
>>
>> On Sun, Apr 05, 2020 at 08:01:50PM +0300, Alexander Korotkov wrote:
>> >On Sun, Apr 5, 2020 at 8:00 PM Tomas Vondra
>> > wrote:
>> ...
>> >> >
>> >> >Assuming we're not going to get 0001-0003 into v13, I'm not so
>> >> >inclined to rush on these three as well.  But you're willing to
commit
>> >> >them, you can count round of review on me.
>> >> >
>> >>
>> >> I have no intention to get 0001-0003 committed. I think those changes
>> >> are beneficial on their own, but the primary reason was to support
the
>> >> new opclasses (which require those changes). And those parts are not
>> >> going to make it into v13 ...
>> >
>> >OK, no problem.
>> >Let's do this for v14.
>> >
>>
>> Hi Alexander,
>>
>> Are you still interested in reviewing those patches? I'll take a look at
>> 0001-0003 to check that your previous feedback was addressed. Do you
>> have any comments about 0004 / 0005, which I think are the more
>> interesting parts of this series?
>>
>>
>> Attached is a rebased version - I realized I forgot to include 0005 in
>> the last update, for some reason.
>>
>
>I've done a quick test with this patch set. I wonder if we can improve
>brin_page_items() SQL function in pageinspect as well. Currently,
>brin_page_items() is hard-coded to support only normal brin indexes.
>When we pass brin-bloom or brin-multi-range to that function the
>binary values are shown in 'value' column but it seems not helpful for
>users. For instance, here is an output of brin_page_items() with a
>brin-multi-range index:
>
>postgres(1:12801)=# select * from brin_page_items(get_raw_page('mul',
>2), 'mul');
>-[ RECORD 1
]--

>---
>
>itemoffset  | 1
>blknum  | 0
>attnum  | 1
>allnulls| f
>hasnulls| f
>placeholder | f
>value   |
{\x01001b002100e570e670e770e870e970ea70eb70ec70ed70ee70ef

>70f070f170f270f370f470f570f670f770f870f970fa70fb70fc70fd70fe70ff700
>071}
>

Hmm. I'm not sure we can do much better, without making the function
much more complicated. I mean, even with regular BRIN indexes we don't
really know if the value is plain min/max, right?


You can be sure with the next node. The value is in can be false positiv.
The value is out is clear. You can detect the change between in and out.



I'm sorry, I don't understand what you're suggesting. How is any of this
related to false positive rate, etc?

The problem here is that while plain BRIN opclasses have fairly simple
summary that can be stored using a fixed number of simple data types
(e.g. minmax will store two values with the same data types as the
indexd column)

result = palloc0(MAXALIGN(SizeofBrinOpcInfo(2)) +
 sizeof(MinmaxOpaque));
result->oi_nstored = 2;
result->oi_opaque = (MinmaxOpaque *)
MAXALIGN((char *) result + SizeofBrinOpcInfo(2));
result->oi_typcache[0] = result->oi_typcache[1] =
lookup_type_cache(typoid, 0);

The opclassed introduced here have somewhat more complex summary, stored
as a single bytea value - which is what gets printed by brin_page_items.

To print something easier to read (for humans) we'd either have to teach
brin_page_items about the diffrent opclasses (multi-range, bloom) end
how to parse the summary bytea, or we'd have to extend the opclasses
with a function formatting the summary. Or rework how the summary is
stored, but that seems like the worst option.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: WIP: BRIN multi-range indexes

2020-07-10 Thread Sascha Kuhl
Tomas Vondra  schrieb am Fr., 10. Juli 2020,
14:09:

> On Fri, Jul 10, 2020 at 06:01:58PM +0900, Masahiko Sawada wrote:
> >On Fri, 3 Jul 2020 at 09:58, Tomas Vondra 
> wrote:
> >>
> >> On Sun, Apr 05, 2020 at 08:01:50PM +0300, Alexander Korotkov wrote:
> >> >On Sun, Apr 5, 2020 at 8:00 PM Tomas Vondra
> >> > wrote:
> >> ...
> >> >> >
> >> >> >Assuming we're not going to get 0001-0003 into v13, I'm not so
> >> >> >inclined to rush on these three as well.  But you're willing to
> commit
> >> >> >them, you can count round of review on me.
> >> >> >
> >> >>
> >> >> I have no intention to get 0001-0003 committed. I think those changes
> >> >> are beneficial on their own, but the primary reason was to support
> the
> >> >> new opclasses (which require those changes). And those parts are not
> >> >> going to make it into v13 ...
> >> >
> >> >OK, no problem.
> >> >Let's do this for v14.
> >> >
> >>
> >> Hi Alexander,
> >>
> >> Are you still interested in reviewing those patches? I'll take a look at
> >> 0001-0003 to check that your previous feedback was addressed. Do you
> >> have any comments about 0004 / 0005, which I think are the more
> >> interesting parts of this series?
> >>
> >>
> >> Attached is a rebased version - I realized I forgot to include 0005 in
> >> the last update, for some reason.
> >>
> >
> >I've done a quick test with this patch set. I wonder if we can improve
> >brin_page_items() SQL function in pageinspect as well. Currently,
> >brin_page_items() is hard-coded to support only normal brin indexes.
> >When we pass brin-bloom or brin-multi-range to that function the
> >binary values are shown in 'value' column but it seems not helpful for
> >users. For instance, here is an output of brin_page_items() with a
> >brin-multi-range index:
> >
> >postgres(1:12801)=# select * from brin_page_items(get_raw_page('mul',
> >2), 'mul');
> >-[ RECORD 1
> ]--
>
> >---
> >
> >itemoffset  | 1
> >blknum  | 0
> >attnum  | 1
> >allnulls| f
> >hasnulls| f
> >placeholder | f
> >value   |
> {\x01001b002100e570e670e770e870e970ea70eb70ec70ed70ee70ef
>
> >70f070f170f270f370f470f570f670f770f870f970fa70fb70fc70fd70fe70ff700
> >071}
> >
>
> Hmm. I'm not sure we can do much better, without making the function
> much more complicated. I mean, even with regular BRIN indexes we don't
> really know if the value is plain min/max, right?
>
You can be sure with the next node. The value is in can be false positiv.
The value is out is clear. You can detect the change between in and out.

>
>
> >Also, I got an assertion failure when setting false_positive_rate
> reloption:
> >
> >postgres(1:12448)=# create index blm on t using brin (c int4_bloom_ops
> >(false_positive_rate = 1));
> >TRAP: FailedAssertion("(false_positive_rate > 0) &&
> >(false_positive_rate < 1.0)", File: "brin_bloom.c", Line: 300)
> >
> >I'll look at the code in depth and let you know if I find a problem.
> >
>
> Yeah, the assert should say (f_p_r <= 1.0).
>
> But I'm not convinced we should allow values up to 1.0, really. The
> f_p_r is the fraction of the table that will get matched always, so 1.0
> would mean we get to scan the whole table. Seems kinda pointless. So
> maybe we should cap it to something like 0.1 or so, but I agree the
> value seems kinda arbitrary.
>
>
> regards
>
> --
> Tomas Vondra  http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>
>
>


Re: WIP: BRIN multi-range indexes

2020-07-10 Thread Tomas Vondra

On Fri, Jul 10, 2020 at 06:01:58PM +0900, Masahiko Sawada wrote:

On Fri, 3 Jul 2020 at 09:58, Tomas Vondra  wrote:


On Sun, Apr 05, 2020 at 08:01:50PM +0300, Alexander Korotkov wrote:
>On Sun, Apr 5, 2020 at 8:00 PM Tomas Vondra
> wrote:
...
>> >
>> >Assuming we're not going to get 0001-0003 into v13, I'm not so
>> >inclined to rush on these three as well.  But you're willing to commit
>> >them, you can count round of review on me.
>> >
>>
>> I have no intention to get 0001-0003 committed. I think those changes
>> are beneficial on their own, but the primary reason was to support the
>> new opclasses (which require those changes). And those parts are not
>> going to make it into v13 ...
>
>OK, no problem.
>Let's do this for v14.
>

Hi Alexander,

Are you still interested in reviewing those patches? I'll take a look at
0001-0003 to check that your previous feedback was addressed. Do you
have any comments about 0004 / 0005, which I think are the more
interesting parts of this series?


Attached is a rebased version - I realized I forgot to include 0005 in
the last update, for some reason.



I've done a quick test with this patch set. I wonder if we can improve
brin_page_items() SQL function in pageinspect as well. Currently,
brin_page_items() is hard-coded to support only normal brin indexes.
When we pass brin-bloom or brin-multi-range to that function the
binary values are shown in 'value' column but it seems not helpful for
users. For instance, here is an output of brin_page_items() with a
brin-multi-range index:

postgres(1:12801)=# select * from brin_page_items(get_raw_page('mul',
2), 'mul');
-[ RECORD 1 
]--
---

itemoffset  | 1
blknum  | 0
attnum  | 1
allnulls| f
hasnulls| f
placeholder | f
value   | 
{\x01001b002100e570e670e770e870e970ea70eb70ec70ed70ee70ef
70f070f170f270f370f470f570f670f770f870f970fa70fb70fc70fd70fe70ff700
071}



Hmm. I'm not sure we can do much better, without making the function
much more complicated. I mean, even with regular BRIN indexes we don't
really know if the value is plain min/max, right?



Also, I got an assertion failure when setting false_positive_rate reloption:

postgres(1:12448)=# create index blm on t using brin (c int4_bloom_ops
(false_positive_rate = 1));
TRAP: FailedAssertion("(false_positive_rate > 0) &&
(false_positive_rate < 1.0)", File: "brin_bloom.c", Line: 300)

I'll look at the code in depth and let you know if I find a problem.



Yeah, the assert should say (f_p_r <= 1.0).

But I'm not convinced we should allow values up to 1.0, really. The
f_p_r is the fraction of the table that will get matched always, so 1.0
would mean we get to scan the whole table. Seems kinda pointless. So
maybe we should cap it to something like 0.1 or so, but I agree the
value seems kinda arbitrary.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: WIP: BRIN multi-range indexes

2020-07-10 Thread Masahiko Sawada
On Fri, 3 Jul 2020 at 09:58, Tomas Vondra  wrote:
>
> On Sun, Apr 05, 2020 at 08:01:50PM +0300, Alexander Korotkov wrote:
> >On Sun, Apr 5, 2020 at 8:00 PM Tomas Vondra
> > wrote:
> ...
> >> >
> >> >Assuming we're not going to get 0001-0003 into v13, I'm not so
> >> >inclined to rush on these three as well.  But you're willing to commit
> >> >them, you can count round of review on me.
> >> >
> >>
> >> I have no intention to get 0001-0003 committed. I think those changes
> >> are beneficial on their own, but the primary reason was to support the
> >> new opclasses (which require those changes). And those parts are not
> >> going to make it into v13 ...
> >
> >OK, no problem.
> >Let's do this for v14.
> >
>
> Hi Alexander,
>
> Are you still interested in reviewing those patches? I'll take a look at
> 0001-0003 to check that your previous feedback was addressed. Do you
> have any comments about 0004 / 0005, which I think are the more
> interesting parts of this series?
>
>
> Attached is a rebased version - I realized I forgot to include 0005 in
> the last update, for some reason.
>

I've done a quick test with this patch set. I wonder if we can improve
brin_page_items() SQL function in pageinspect as well. Currently,
brin_page_items() is hard-coded to support only normal brin indexes.
When we pass brin-bloom or brin-multi-range to that function the
binary values are shown in 'value' column but it seems not helpful for
users. For instance, here is an output of brin_page_items() with a
brin-multi-range index:

postgres(1:12801)=# select * from brin_page_items(get_raw_page('mul',
2), 'mul');
-[ RECORD 1 
]--
---

itemoffset  | 1
blknum  | 0
attnum  | 1
allnulls| f
hasnulls| f
placeholder | f
value   | 
{\x01001b002100e570e670e770e870e970ea70eb70ec70ed70ee70ef
70f070f170f270f370f470f570f670f770f870f970fa70fb70fc70fd70fe70ff700
071}

Also, I got an assertion failure when setting false_positive_rate reloption:

postgres(1:12448)=# create index blm on t using brin (c int4_bloom_ops
(false_positive_rate = 1));
TRAP: FailedAssertion("(false_positive_rate > 0) &&
(false_positive_rate < 1.0)", File: "brin_bloom.c", Line: 300)

I'll look at the code in depth and let you know if I find a problem.

Regards,

--
Masahiko Sawadahttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: WIP: BRIN multi-range indexes

2020-04-05 Thread Alexander Korotkov
On Sun, Apr 5, 2020 at 8:00 PM Tomas Vondra
 wrote:
> On Sun, Apr 05, 2020 at 07:33:40PM +0300, Alexander Korotkov wrote:
> >On Sun, Apr 5, 2020 at 6:53 PM Tomas Vondra
> > wrote:
> >> On Sun, Apr 05, 2020 at 06:29:15PM +0300, Alexander Korotkov wrote:
> >> >On Thu, Apr 2, 2020 at 5:29 AM Tomas Vondra
> >> > wrote:
> >> >> On Sun, Dec 01, 2019 at 10:55:02AM +0900, Michael Paquier wrote:
> >> >> >On Thu, Sep 26, 2019 at 09:01:48PM +0200, Tomas Vondra wrote:
> >> >> >> Yeah, the opclass params patches got broken by 773df883e adding enum
> >> >> >> reloptions. The breakage is somewhat extensive so I'll leave it up to
> >> >> >> Nikita to fix it in [1]. Until that happens, apply the patches on
> >> >> >> top of caba97a9d9 for review.
> >> >> >
> >> >> >This has been close to two months now, so I have the patch as RwF.
> >> >> >Feel free to update if you think that's incorrect.
> >> >>
> >> >> I see the opclass parameters patch got committed a couple days ago, so
> >> >> I've rebased the patch series on top of it. The pach was marked RwF
> >> >> since 2019-11, so I'll add it to the next CF.
> >> >
> >> >I think this patchset was marked RwF mainly because slow progress on
> >> >opclass parameters.  Now we got opclass parameters committed, and I
> >> >think this patchset is in a pretty good shape.  Moreover, opclass
> >> >parameters patch comes with very small examples.  This patchset would
> >> >be great showcase for opclass parameters.
> >> >
> >> >I'd like to give this patchset a chance for v13.  I'm going to make
> >> >another pass trough this patchset.  If I wouldn't find serious issues,
> >> >I'm going to commit it.  Any objections?
> >> >
> >>
> >> I'm an author of the patchset and I'd love to see it committed, but I
> >> think that might be a bit too rushed and unfair (considering it was not
> >> included in the current CF at all).
> >>
> >> I think the code is correct and I'm not aware of any bugs, but I'm not
> >> sure there was sufficient discussion about things like costing, choosing
> >> parameter values (e.g.  number of values in the multi-minmax or bloom
> >> filter parameters).
> >
> >Ok!
> >
> >> That being said, I think the first couple of patches (that modify how
> >> BRIN deals with multi-key scans and IS NULL clauses) are simple enough
> >> and non-controversial, so maybe we could get 0001-0003 committed, and
> >> leave the bloom/multi-minmax opclasses for v14.
> >
> >Regarding 0001-0003 I've couple of notes:
> >1) They should revise BRIN extensibility documentation section.
> >2) I think 0002 and 0003 should be merged.  NULL ScanKeys should be
> >still passed to consistent function when oi_regular_nulls == false.
> >
> >Assuming we're not going to get 0001-0003 into v13, I'm not so
> >inclined to rush on these three as well.  But you're willing to commit
> >them, you can count round of review on me.
> >
>
> I have no intention to get 0001-0003 committed. I think those changes
> are beneficial on their own, but the primary reason was to support the
> new opclasses (which require those changes). And those parts are not
> going to make it into v13 ...

OK, no problem.
Let's do this for v14.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: WIP: BRIN multi-range indexes

2020-04-05 Thread Tomas Vondra

On Sun, Apr 05, 2020 at 07:33:40PM +0300, Alexander Korotkov wrote:

On Sun, Apr 5, 2020 at 6:53 PM Tomas Vondra
 wrote:

On Sun, Apr 05, 2020 at 06:29:15PM +0300, Alexander Korotkov wrote:
>On Thu, Apr 2, 2020 at 5:29 AM Tomas Vondra
> wrote:
>> On Sun, Dec 01, 2019 at 10:55:02AM +0900, Michael Paquier wrote:
>> >On Thu, Sep 26, 2019 at 09:01:48PM +0200, Tomas Vondra wrote:
>> >> Yeah, the opclass params patches got broken by 773df883e adding enum
>> >> reloptions. The breakage is somewhat extensive so I'll leave it up to
>> >> Nikita to fix it in [1]. Until that happens, apply the patches on
>> >> top of caba97a9d9 for review.
>> >
>> >This has been close to two months now, so I have the patch as RwF.
>> >Feel free to update if you think that's incorrect.
>>
>> I see the opclass parameters patch got committed a couple days ago, so
>> I've rebased the patch series on top of it. The pach was marked RwF
>> since 2019-11, so I'll add it to the next CF.
>
>I think this patchset was marked RwF mainly because slow progress on
>opclass parameters.  Now we got opclass parameters committed, and I
>think this patchset is in a pretty good shape.  Moreover, opclass
>parameters patch comes with very small examples.  This patchset would
>be great showcase for opclass parameters.
>
>I'd like to give this patchset a chance for v13.  I'm going to make
>another pass trough this patchset.  If I wouldn't find serious issues,
>I'm going to commit it.  Any objections?
>

I'm an author of the patchset and I'd love to see it committed, but I
think that might be a bit too rushed and unfair (considering it was not
included in the current CF at all).

I think the code is correct and I'm not aware of any bugs, but I'm not
sure there was sufficient discussion about things like costing, choosing
parameter values (e.g.  number of values in the multi-minmax or bloom
filter parameters).


Ok!


That being said, I think the first couple of patches (that modify how
BRIN deals with multi-key scans and IS NULL clauses) are simple enough
and non-controversial, so maybe we could get 0001-0003 committed, and
leave the bloom/multi-minmax opclasses for v14.


Regarding 0001-0003 I've couple of notes:
1) They should revise BRIN extensibility documentation section.
2) I think 0002 and 0003 should be merged.  NULL ScanKeys should be
still passed to consistent function when oi_regular_nulls == false.

Assuming we're not going to get 0001-0003 into v13, I'm not so
inclined to rush on these three as well.  But you're willing to commit
them, you can count round of review on me.



I have no intention to get 0001-0003 committed. I think those changes
are beneficial on their own, but the primary reason was to support the
new opclasses (which require those changes). And those parts are not
going to make it into v13 ...

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: WIP: BRIN multi-range indexes

2020-04-05 Thread Alexander Korotkov
On Sun, Apr 5, 2020 at 6:53 PM Tomas Vondra
 wrote:
> On Sun, Apr 05, 2020 at 06:29:15PM +0300, Alexander Korotkov wrote:
> >On Thu, Apr 2, 2020 at 5:29 AM Tomas Vondra
> > wrote:
> >> On Sun, Dec 01, 2019 at 10:55:02AM +0900, Michael Paquier wrote:
> >> >On Thu, Sep 26, 2019 at 09:01:48PM +0200, Tomas Vondra wrote:
> >> >> Yeah, the opclass params patches got broken by 773df883e adding enum
> >> >> reloptions. The breakage is somewhat extensive so I'll leave it up to
> >> >> Nikita to fix it in [1]. Until that happens, apply the patches on
> >> >> top of caba97a9d9 for review.
> >> >
> >> >This has been close to two months now, so I have the patch as RwF.
> >> >Feel free to update if you think that's incorrect.
> >>
> >> I see the opclass parameters patch got committed a couple days ago, so
> >> I've rebased the patch series on top of it. The pach was marked RwF
> >> since 2019-11, so I'll add it to the next CF.
> >
> >I think this patchset was marked RwF mainly because slow progress on
> >opclass parameters.  Now we got opclass parameters committed, and I
> >think this patchset is in a pretty good shape.  Moreover, opclass
> >parameters patch comes with very small examples.  This patchset would
> >be great showcase for opclass parameters.
> >
> >I'd like to give this patchset a chance for v13.  I'm going to make
> >another pass trough this patchset.  If I wouldn't find serious issues,
> >I'm going to commit it.  Any objections?
> >
>
> I'm an author of the patchset and I'd love to see it committed, but I
> think that might be a bit too rushed and unfair (considering it was not
> included in the current CF at all).
>
> I think the code is correct and I'm not aware of any bugs, but I'm not
> sure there was sufficient discussion about things like costing, choosing
> parameter values (e.g.  number of values in the multi-minmax or bloom
> filter parameters).

Ok!

> That being said, I think the first couple of patches (that modify how
> BRIN deals with multi-key scans and IS NULL clauses) are simple enough
> and non-controversial, so maybe we could get 0001-0003 committed, and
> leave the bloom/multi-minmax opclasses for v14.

Regarding 0001-0003 I've couple of notes:
1) They should revise BRIN extensibility documentation section.
2) I think 0002 and 0003 should be merged.  NULL ScanKeys should be
still passed to consistent function when oi_regular_nulls == false.

Assuming we're not going to get 0001-0003 into v13, I'm not so
inclined to rush on these three as well.  But you're willing to commit
them, you can count round of review on me.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: WIP: BRIN multi-range indexes

2020-04-05 Thread Alexander Korotkov
On Sun, Apr 5, 2020 at 6:51 PM Tom Lane  wrote:
> Alexander Korotkov  writes:
> > I'd like to give this patchset a chance for v13.  I'm going to make
> > another pass trough this patchset.  If I wouldn't find serious issues,
> > I'm going to commit it.  Any objections?
>
> I think it is way too late to be reviving major features that nobody
> has been looking at for months, that indeed were never even listed
> in the final CF.  At this point in the cycle I think we should just be
> trying to get small stuff over the line, not shove in major patches
> and figure they can be stabilized later.
>
> In this particular case, the last serious work on the patchset seems
> to have been Tomas' revision of 2019-09-14, and he specifically stated
> then that the APIs still needed work.  That doesn't sound like
> "it's about ready to commit" to me.

OK, got it.  Thank you for the feedback.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: WIP: BRIN multi-range indexes

2020-04-05 Thread Tomas Vondra

On Sun, Apr 05, 2020 at 06:29:15PM +0300, Alexander Korotkov wrote:

Hi!

On Thu, Apr 2, 2020 at 5:29 AM Tomas Vondra
 wrote:

On Sun, Dec 01, 2019 at 10:55:02AM +0900, Michael Paquier wrote:
>On Thu, Sep 26, 2019 at 09:01:48PM +0200, Tomas Vondra wrote:
>> Yeah, the opclass params patches got broken by 773df883e adding enum
>> reloptions. The breakage is somewhat extensive so I'll leave it up to
>> Nikita to fix it in [1]. Until that happens, apply the patches on
>> top of caba97a9d9 for review.
>
>This has been close to two months now, so I have the patch as RwF.
>Feel free to update if you think that's incorrect.

I see the opclass parameters patch got committed a couple days ago, so
I've rebased the patch series on top of it. The pach was marked RwF
since 2019-11, so I'll add it to the next CF.


I think this patchset was marked RwF mainly because slow progress on
opclass parameters.  Now we got opclass parameters committed, and I
think this patchset is in a pretty good shape.  Moreover, opclass
parameters patch comes with very small examples.  This patchset would
be great showcase for opclass parameters.

I'd like to give this patchset a chance for v13.  I'm going to make
another pass trough this patchset.  If I wouldn't find serious issues,
I'm going to commit it.  Any objections?



I'm an author of the patchset and I'd love to see it committed, but I
think that might be a bit too rushed and unfair (considering it was not
included in the current CF at all).

I think the code is correct and I'm not aware of any bugs, but I'm not
sure there was sufficient discussion about things like costing, choosing
parameter values (e.g.  number of values in the multi-minmax or bloom
filter parameters).

That being said, I think the first couple of patches (that modify how
BRIN deals with multi-key scans and IS NULL clauses) are simple enough
and non-controversial, so maybe we could get 0001-0003 committed, and
leave the bloom/multi-minmax opclasses for v14.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: WIP: BRIN multi-range indexes

2020-04-05 Thread Tom Lane
Alexander Korotkov  writes:
> I'd like to give this patchset a chance for v13.  I'm going to make
> another pass trough this patchset.  If I wouldn't find serious issues,
> I'm going to commit it.  Any objections?

I think it is way too late to be reviving major features that nobody
has been looking at for months, that indeed were never even listed
in the final CF.  At this point in the cycle I think we should just be
trying to get small stuff over the line, not shove in major patches
and figure they can be stabilized later.

In this particular case, the last serious work on the patchset seems
to have been Tomas' revision of 2019-09-14, and he specifically stated
then that the APIs still needed work.  That doesn't sound like
"it's about ready to commit" to me.

regards, tom lane




Re: WIP: BRIN multi-range indexes

2020-04-05 Thread Alexander Korotkov
Hi!

On Thu, Apr 2, 2020 at 5:29 AM Tomas Vondra
 wrote:
> On Sun, Dec 01, 2019 at 10:55:02AM +0900, Michael Paquier wrote:
> >On Thu, Sep 26, 2019 at 09:01:48PM +0200, Tomas Vondra wrote:
> >> Yeah, the opclass params patches got broken by 773df883e adding enum
> >> reloptions. The breakage is somewhat extensive so I'll leave it up to
> >> Nikita to fix it in [1]. Until that happens, apply the patches on
> >> top of caba97a9d9 for review.
> >
> >This has been close to two months now, so I have the patch as RwF.
> >Feel free to update if you think that's incorrect.
>
> I see the opclass parameters patch got committed a couple days ago, so
> I've rebased the patch series on top of it. The pach was marked RwF
> since 2019-11, so I'll add it to the next CF.

I think this patchset was marked RwF mainly because slow progress on
opclass parameters.  Now we got opclass parameters committed, and I
think this patchset is in a pretty good shape.  Moreover, opclass
parameters patch comes with very small examples.  This patchset would
be great showcase for opclass parameters.

I'd like to give this patchset a chance for v13.  I'm going to make
another pass trough this patchset.  If I wouldn't find serious issues,
I'm going to commit it.  Any objections?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: WIP: BRIN multi-range indexes

2019-11-30 Thread Michael Paquier
On Thu, Sep 26, 2019 at 09:01:48PM +0200, Tomas Vondra wrote:
> Yeah, the opclass params patches got broken by 773df883e adding enum
> reloptions. The breakage is somewhat extensive so I'll leave it up to
> Nikita to fix it in [1]. Until that happens, apply the patches on
> top of caba97a9d9 for review.

This has been close to two months now, so I have the patch as RwF.
Feel free to update if you think that's incorrect.
--
Michael


signature.asc
Description: PGP signature


Re: WIP: BRIN multi-range indexes

2019-09-26 Thread Tomas Vondra

On Wed, Sep 25, 2019 at 05:07:48PM -0300, Alvaro Herrera wrote:

This patch fails to apply (or the opclass params one, maybe).  Please
update.



Yeah, the opclass params patches got broken by 773df883e adding enum
reloptions. The breakage is somewhat extensive so I'll leave it up to
Nikita to fix it in [1]. Until that happens, apply the patches on
top of caba97a9d9 for review.

Thanks

[1] https://commitfest.postgresql.org/24/2183/

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




  1   2   >