Re: POC: Cleaning up orphaned files using undo logs

2019-07-20 Thread Amit Kapila
On Fri, Jul 19, 2019 at 6:35 PM Robert Haas  wrote:
>
> On Fri, Jul 19, 2019 at 12:10 AM Amit Kapila  wrote:
> > We are doing exactly what you have written in the last line of the
> > next paragraph "stop the transaction from writing undo when the hash
> > table is already too full.".  So we will
> > never face the problems related to repeated crash recovery.  The
> > definition of too full is that we stop allowing the new transactions
> > that can write undo when we have the hash table already have entries
> > equivalent to (UndoRollbackHashTableSize() - MaxBackends).  Does this
> > make sense?
>
> Oops, I was looking in the wrong place.  Yes, that makes sense, but:
>
> 1. It looks like the test you added to PrepareUndoInsert has no
> locking, and I don't see how that can be right.
>
> 2. It seems like this is would result in testing for each new undo
> insertion that gets prepared, whereas surely we would want to only
> test when first attaching to an undo log.  If you've already attached
> to the undo log, there's no reason not to continue inserting into it,
> because doing so doesn't increase the number of transactions (or
> transaction-persistence level combinations) that need undo.
>

I agree that it should not be for each undo insertion rather whenever
any transaction attached to an undo log.

> 3. I don't think the test itself is correct. It can fire even when
> there's no problem. It is correct (or would be if it said 2 *
> MaxBackends) if every other backend in the system is already attached
> to an undo log (or two). But if they are not, it will block
> transactions from being started for no reason.
>

Right, we should find a way to know the exact number of transactions
that are attached to undo-log at any point in time, then we can have a
more precise check.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: [PATCH] get rid of StdRdOptions, use individual binary reloptions representation for each relation kind instead

2019-07-20 Thread Dent John
Hi Nikolay,

Thanks for the revised patch. It applies now no problem, and seems to work fine.

For me, I still find the relopts area quite odd. I wonder if your patch doesn’t 
go far enough?

For example, take log_autovacuum_min_duration. It’s described intRelOpts, which 
implicitly defines its type (an int), and explicitly sets its min/max, default, 
and lock level, as well as it’s kind (HEAP | TOAST). But then it’s described 
again in the relopt_parse_elt[] in toast_reloptions() and heap_reloptions(), 
which implicitly defines it to apply to HEAP | TOAST, and fact of it being an 
int. (It’s, of course, same for all reloptions.)

The relopt_parse_elt[] is hugely entirely duplicative. I wonder if it is not 
best simply to consolidate — either push all info into the static 
{bool,int,real,string}RelOpts[] structures, or push all into the 
relopt_parse_elt[].

I notice the thread earlier talks of some of the APIs being public interfaces, 
which may be used by EXTENSIONs. I’m not sure I buy that in its fullest sense. 
For sure, an EXTENSION may invoke the APIs. But no EXTENSION can define 
new/alter existing options, because the {bool,int,real,string}RelOpts[] are not 
currently runtime-extendable. I guess we probably should preserve the API’s 
functionality, and allow fillRelOptions(), if provided with a tab, for it to 
restrict filling to only those supplied in the tab. But in general core code, 
my opinion is that fillRelOptions() could be provided with a NULL tab, and for 
it to scavenge all needed information from the static 
{bool,int,real,string}RelOpts[] structures.

That links to what I initially found most confusing: AUTOVACUUM_RELOPTIONS. I 
understand it’s there because there are a bunch of shared reloptions. 
Pre-patch, default_reloptions() meant there was no need for the macro, but your 
patch drops default_reloptions(). default_reloptions() is horrible, but I feel 
the macro approach is worse. :-)

Sorry for the delay providing the feedback. It took me a few sittings to grok 
what was going on, and to work out what I though about it.

denty.

> On 12 Jul 2019, at 15:13, Nikolay Shaplov  wrote:
> 
> В письме от четверг, 4 июля 2019 г. 19:44:42 MSK пользователь Dent John 
> написал:
>> Hi Nikolay,
>> 
>> I have had a crack at re-basing the current patch against 12b2, but I didn’t
>> trivially succeed.
>> 
>> It’s probably more my bodged fixing of the rejections than a fundamental
>> problem. But I now get compile fails in — and seems like only — vacuum.c.
> 
> I am really sorry for such big delay. Two new relation options were added, it 
> needed careful checking while importing them back to the patch.
> I did not find proper timeslot for doing this quick enough.
> Hope I am not terribly late.
> Here goes an updated version.
> 
> 
> 
> 





Re: POC: Cleaning up orphaned files using undo logs

2019-07-20 Thread Dilip Kumar
On Sat, Jul 20, 2019 at 12:40 PM Amit Kapila  wrote:
>
> On Fri, Jul 19, 2019 at 6:35 PM Robert Haas  wrote:
> >
> > On Fri, Jul 19, 2019 at 12:10 AM Amit Kapila  
> > wrote:
> > > We are doing exactly what you have written in the last line of the
> > > next paragraph "stop the transaction from writing undo when the hash
> > > table is already too full.".  So we will
> > > never face the problems related to repeated crash recovery.  The
> > > definition of too full is that we stop allowing the new transactions
> > > that can write undo when we have the hash table already have entries
> > > equivalent to (UndoRollbackHashTableSize() - MaxBackends).  Does this
> > > make sense?
> >
> > Oops, I was looking in the wrong place.  Yes, that makes sense, but:
> >
> > 1. It looks like the test you added to PrepareUndoInsert has no
> > locking, and I don't see how that can be right.
> >
> > 2. It seems like this is would result in testing for each new undo
> > insertion that gets prepared, whereas surely we would want to only
> > test when first attaching to an undo log.  If you've already attached
> > to the undo log, there's no reason not to continue inserting into it,
> > because doing so doesn't increase the number of transactions (or
> > transaction-persistence level combinations) that need undo.
> >
>
> I agree that it should not be for each undo insertion rather whenever
> any transaction attached to an undo log.
>
> > 3. I don't think the test itself is correct. It can fire even when
> > there's no problem. It is correct (or would be if it said 2 *
> > MaxBackends) if every other backend in the system is already attached
> > to an undo log (or two). But if they are not, it will block
> > transactions from being started for no reason.
> >
>
> Right, we should find a way to know the exact number of transactions
> that are attached to undo-log at any point in time, then we can have a
> more precise check.

Maybe we can make ProcGlobal->xactsHavingPendingUndo an atomic
variable.  We can increment its value atomically whenever
a) A transaction writes the first undo record for each persistence level.
b) For each abort request inserted by the 'StartupPass'.

And, we will decrement it when
a) The transaction commits (decrement by 1 for each persistence level
it has written undo for).
b) Rollback request is processed.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: pgsql: Sync our copy of the timezone library with IANA release tzcode20

2019-07-20 Thread Michael Paquier
On Fri, Jul 19, 2019 at 02:56:34PM -0400, Tom Lane wrote:
> I'm not quite convinced whether that will silence the warning, but
> at least it's a bit less unreadable.

Thanks for working with upstream on this.  From what I can see,
woodlouse & friends do not complain anymore.
--
Michael


signature.asc
Description: PGP signature


Re: Compiler warnings with MinGW

2019-07-20 Thread Michael Paquier
On Fri, Jul 19, 2019 at 08:41:28AM -0400, Andrew Dunstan wrote:
> On 7/19/19 1:08 AM, Michael Paquier wrote:
>> The second one is rather obvious to fix, because we don't care about
>> the file mode on Windows, so the attached should do the work.  I am
>> actually surprised that the Visual Studio compilers don't complain
>> about that, but let's fix it.
>>
>> Thoughts?
> 
> +1.

Just wanted to double-check something.  We usually don't bother
back-patching warning fixes like this one, right?
--
Michael


signature.asc
Description: PGP signature


thoughts on "prevent wraparound" vacuum

2019-07-20 Thread Michail Nikolaev
Hello.

Currently I am working a lot with cluster consist a few of big tables.
About 2-3 TB. These tables are heavily updated, some rows are removed, new
rows are inserted... Kind of typical OLTP workload.

Physical table size keeps mostly stable while regular VACUUM is working. It
is fast enough to clean some place from removed rows.

But time to time "to prevent wraparound" comes. And it works like 8-9 days.
During that time relation size starting to expand quickly. Freezing all
blocks in such table takes a lot of time and bloat is generated much more
quickly.

Of course after aggressive vacuum finishes table are not shrink back (some
kind of repacking required). And even after repacking - relation shrinking
causes all cluster to stuck for some time (due exclusive locking, see (1)).

So, I was thinking about it and I saw two possible solutions:

 1. Track two block pointers for aggressive vacuum. One is to freeze all
blocks and other is to perform regular vacuum on non-all-visible blocks.
Second one is circular (could process table multiple times while first one
is moving from start to end of the table). And some parameters to spread
resources between pointers is required.

 2. Separate "to prevent wraparound" from regular Vacuum to allow them run
concurrently. But it seems to be much more work here.

Could you please share some thoughts on it? Is it worth to be implemented?

Thanks.

[1]
https://www.postgresql.org/message-id/c9374921e50a5e8fb1ecf04eb8c6ebc3%40postgrespro.ru


Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2019-07-20 Thread Tomas Vondra

On Fri, Jul 19, 2019 at 04:59:21PM -0400, James Coleman wrote:

On Mon, Jul 8, 2019 at 9:37 PM Tomas Vondra
 wrote:

Now, consider this example:

  create table t (a int, b int, c int);
  insert into t select mod(i,100),mod(i,100),i from generate_series(1,1000) 
s(i);
  create index on t (a);
  analyze t;
  explain select a,b,sum(c) from t group by 1,2 order by 1,2,3 limit 1;

With 0001+0002+0003 pathes, I get a plan like this:

 QUERY PLAN

 Limit  (cost=10375.39..10594.72 rows=1 width=16)
   ->  Incremental Sort  (cost=10375.39..2203675.71 rows=1 width=16)
 Sort Key: a, b, (sum(c))
 Presorted Key: a, b
 ->  GroupAggregate  (cost=10156.07..2203225.71 rows=1 width=16)
   Group Key: a, b
   ->  Gather Merge  (cost=10156.07..2128124.39 rows=1175 
width=12)
 Workers Planned: 2
 ->  Incremental Sort  (cost=9156.04..972856.05 
rows=4166740 width=12)
   Sort Key: a, b
   Presorted Key: a
   ->  Parallel Index Scan using t_a_idx on t  
(cost=0.43..417690.30 rows=4166740 width=12)
(12 rows)


and with 0004, I get this:

  QUERY PLAN
--
 Limit  (cost=20443.84..20665.32 rows=1 width=16)
   ->  Incremental Sort  (cost=20443.84..2235250.05 rows=1 width=16)
 Sort Key: a, b, (sum(c))
 Presorted Key: a, b
 ->  GroupAggregate  (cost=20222.37..2234800.05 rows=1 width=16)
   Group Key: a, b
   ->  Incremental Sort  (cost=20222.37..2159698.74 rows=1175 
width=12)
 Sort Key: a, b
 Presorted Key: a
 ->  Index Scan using t_a_idx on t  (cost=0.43..476024.65 
rows=1175 width=12)
(10 rows)

Notice that cost of the second plan is almost double the first one. That
means 0004 does not even generate the first plan, i.e. there are cases
where we don't try to add the explicit sort before passing the path to
generate_gather_paths().

And I think I know why is that - while gather_grouping_paths() tries to
add explicit sort below the gather merge, there are other places that
call generate_gather_paths() that don't do that. In this case it's
probably apply_scanjoin_target_to_paths() which simply builds

   parallel (seq|index) scan + gather merge

and that's it. The problem is likely the same - the code does not know
which pathkeys are "interesting" at that point. We probably need to
teach planner to do this.


I've been working on figuring out sample queries for each of the
places we're looking at adding create_increment_sort() (starting with
the cases enabled by gather-merge nodes). The
generate_useful_gather_paths() call in
apply_scanjoin_target_to_paths() is required to generate the above
preferred plan.

But I found that if I set enable_sort=off (with our without the
_useful_ variant of generate_gather_paths()) I get a very similar plan
that's actually lower cost yet:

   QUERY PLAN
--
Limit  (cost=10255.98..10355.77 rows=1 width=16)
  ->  Incremental Sort  (cost=10255.98..1008228.67 rows=1 width=16)
Sort Key: a, b, (sum(c))
Presorted Key: a, b
->  Finalize GroupAggregate  (cost=10156.20..1007778.67
rows=1 width=16)
  Group Key: a, b
  ->  Gather Merge  (cost=10156.20..1007528.67 rows=2 width=16)
Workers Planned: 2
->  Partial GroupAggregate
(cost=9156.18..1004220.15 rows=1 width=16)
  Group Key: a, b
  ->  Incremental Sort
(cost=9156.18..972869.60 rows=4166740 width=12)
Sort Key: a, b
Presorted Key: a
->  Parallel Index Scan using t_a_idx
on t  (cost=0.43..417703.85 rows=4166740 width=12)

Is that something we should consider a bug at this stage? It's also
not clear to mean (costing aside) which plan is intuitively
preferable.



This seems like a thinko in add_partial_path() - it only looks at total
cost of the paths, and ignores the startup cost entirely. I've debugged
it a bit, and what's happening for the partially-grouped relation is
roughly this:

1) We add a partial path with startup/total costs 696263 / 738029

2) We attempt to add the "Partial GroupAggregate" path, but it loses the
fight because the total cost (1004207) is higher than the first path.
Which entirely misses that the startup cost is way lower.

3) W

Re: [sqlsmith] Crash in mcv_get_match_bitmap

2019-07-20 Thread Tomas Vondra

On Fri, Jul 19, 2019 at 02:37:19PM -0400, Tom Lane wrote:

Tomas Vondra  writes:

[ mcv fixes ]


These patches look OK to me.



OK, thanks. Pushed.


--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2019-07-20 Thread Tomas Vondra

On Sat, Jul 20, 2019 at 10:33:02AM -0400, James Coleman wrote:

On Sat, Jul 20, 2019 at 9:22 AM Tomas Vondra
 wrote:


On Fri, Jul 19, 2019 at 04:59:21PM -0400, James Coleman wrote:
>On Mon, Jul 8, 2019 at 9:37 PM Tomas Vondra
> wrote:
>> Now, consider this example:
>>
>>   create table t (a int, b int, c int);
>>   insert into t select mod(i,100),mod(i,100),i from 
generate_series(1,1000) s(i);
>>   create index on t (a);
>>   analyze t;
>>   explain select a,b,sum(c) from t group by 1,2 order by 1,2,3 limit 1;
>>
>> With 0001+0002+0003 pathes, I get a plan like this:
>>
>>  QUERY PLAN
>> 

>>  Limit  (cost=10375.39..10594.72 rows=1 width=16)
>>->  Incremental Sort  (cost=10375.39..2203675.71 rows=1 width=16)
>>  Sort Key: a, b, (sum(c))
>>  Presorted Key: a, b
>>  ->  GroupAggregate  (cost=10156.07..2203225.71 rows=1 width=16)
>>Group Key: a, b
>>->  Gather Merge  (cost=10156.07..2128124.39 rows=1175 
width=12)
>>  Workers Planned: 2
>>  ->  Incremental Sort  (cost=9156.04..972856.05 
rows=4166740 width=12)
>>Sort Key: a, b
>>Presorted Key: a
>>->  Parallel Index Scan using t_a_idx on t  
(cost=0.43..417690.30 rows=4166740 width=12)
>> (12 rows)
>>
>>
>> and with 0004, I get this:
>>
>>   QUERY PLAN
>> 
--
>>  Limit  (cost=20443.84..20665.32 rows=1 width=16)
>>->  Incremental Sort  (cost=20443.84..2235250.05 rows=1 width=16)
>>  Sort Key: a, b, (sum(c))
>>  Presorted Key: a, b
>>  ->  GroupAggregate  (cost=20222.37..2234800.05 rows=1 width=16)
>>Group Key: a, b
>>->  Incremental Sort  (cost=20222.37..2159698.74 
rows=1175 width=12)
>>  Sort Key: a, b
>>  Presorted Key: a
>>  ->  Index Scan using t_a_idx on t  
(cost=0.43..476024.65 rows=1175 width=12)
>> (10 rows)
>>
>> Notice that cost of the second plan is almost double the first one. That
>> means 0004 does not even generate the first plan, i.e. there are cases
>> where we don't try to add the explicit sort before passing the path to
>> generate_gather_paths().
>>
>> And I think I know why is that - while gather_grouping_paths() tries to
>> add explicit sort below the gather merge, there are other places that
>> call generate_gather_paths() that don't do that. In this case it's
>> probably apply_scanjoin_target_to_paths() which simply builds
>>
>>parallel (seq|index) scan + gather merge
>>
>> and that's it. The problem is likely the same - the code does not know
>> which pathkeys are "interesting" at that point. We probably need to
>> teach planner to do this.
>
>I've been working on figuring out sample queries for each of the
>places we're looking at adding create_increment_sort() (starting with
>the cases enabled by gather-merge nodes). The
>generate_useful_gather_paths() call in
>apply_scanjoin_target_to_paths() is required to generate the above
>preferred plan.


As I continue this, I've added a couple of test cases (notably for
generate_useful_gather_paths() in both standard_join_search() and
apply_scanjoin_target_to_paths()). Those, plus the current WIP state
of my hacking on your patch adding generate_useful_gather_paths() is
attached as 0001-parallel-and-more-paths.patch.

My current line of investigation is whether we need to do anything in
the parallel portion of create_ordered_paths(). I noticed that the
first-pass patch adding generate_useful_gather_paths() modified that
section but wasn't actually adding any new gather-merge paths (just
bare incremental sort paths). That seems pretty clearly just a
prototype miss, so I modified the prototype to build gather-merge
paths instead (as a side note that change seems to fix an oddity I was
seeing where plans would include a parallel index scan node even
though they weren't parallel plans). While the resulting plan for
something like:



Yes, that seems to be a bug. The block above it clealy has a gather
merge nodes, so this one should too.


explain analyze select * from t where t.a in (1,2,3,4,5,6) order by
t.a, t.b limit 50;

changes cost (to be cheaper) ever so slightly with the gather-merge
addition to create_ordered_paths(), the plan itself is otherwise
identical (including row estimates):

Limit
 -> Gather Merge
  -> Incremental Sort
 -> Parallel Index Scan

(Note: I'm forcing parallel plans here with: set
max_parallel_workers_per_gather=4; set min_parallel_table_scan_size=0;
set parallel_tuple_cost=0; set parallel_setup_cost=0; set
min_

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2019-07-20 Thread James Coleman
On Sat, Jul 20, 2019 at 11:25 AM Tomas Vondra
 wrote:
...
> >My current line of investigation is whether we need to do anything in
> >the parallel portion of create_ordered_paths(). I noticed that the
> >first-pass patch adding generate_useful_gather_paths() modified that
> >section but wasn't actually adding any new gather-merge paths (just
> >bare incremental sort paths). That seems pretty clearly just a
> >prototype miss, so I modified the prototype to build gather-merge
> >paths instead (as a side note that change seems to fix an oddity I was
> >seeing where plans would include a parallel index scan node even
> >though they weren't parallel plans). While the resulting plan for
> >something like:
> >
>
> Yes, that seems to be a bug. The block above it clealy has a gather
> merge nodes, so this one should too.
>
> >explain analyze select * from t where t.a in (1,2,3,4,5,6) order by
> >t.a, t.b limit 50;
> >
> >changes cost (to be cheaper) ever so slightly with the gather-merge
> >addition to create_ordered_paths(), the plan itself is otherwise
> >identical (including row estimates):
> >
> >Limit
> >  -> Gather Merge
> >   -> Incremental Sort
> >  -> Parallel Index Scan
> >
> >(Note: I'm forcing parallel plans here with: set
> >max_parallel_workers_per_gather=4; set min_parallel_table_scan_size=0;
> >set parallel_tuple_cost=0; set parallel_setup_cost=0; set
> >min_parallel_index_scan_size=0;)
> >
> >I can't seem to come up with a case where adding these gather-merge
> >paths in create_ordered_paths() isn't entirely duplicative of paths
> >already created by generate_useful_gather_paths() as called from
> >apply_scanjoin_target_to_paths() -- which I _think_ makes sense given
> >that both apply_scanjoin_target_to_paths() and create_ordered_paths()
> >are called by grouping_planner().
> >
> >Can you think of a case I'm missing here that would make it valuable
> >to generate new parallel plans in create_ordered_paths()?
> >
>
> Good question. Not sure. I think such path would need to do something on
> a relation that is neither a join nor a scan - in which case the path
> should not be created by apply_scanjoin_target_to_paths().
>
> So for example a query like this:
>
>   SELECT
>   a, b, sum(expensive_function(c))
>   FROM
>   t
>   GROUP BY a,b
>   ORDER BY a,sum(...) LIMIT 10;
>
> should be able to produce a plan like this:
>
>   -> Limit
>   -> Gather Merge
> -> Incremental Sort (pathkeys: a, sum)
>   -> Group Aggregate
>  a, b, sum(expensive_function(c))
> -> Index Scan (pathkeys: a, b)
>
> or something like that, maybe. I haven't tried this, though. The other
> question is whether such queries are useful in practice ...

Hmm, when I step through on that example input_rel->partial_pathlist
!= NIL is false, so we don't even attempt to consider any extra
parallel paths in create_ordered_paths(). Nonetheless we get a
parallel plan, but with a different shape:

 Limit
   ->  Incremental Sort
 Sort Key: a, (sum(expensive_function(c)))
 Presorted Key: a
 ->  Finalize GroupAggregate
   Group Key: a, b
   ->  Gather Merge
 Workers Planned: 4
 ->  Partial GroupAggregate
   Group Key: a, b
   ->  Sort
 Sort Key: a, b
 ->  Parallel Seq Scan on t

(or if I disable seqscan then the sort+seq scan above becomes inc sort
+ index scan)

To be honest, I don't think I understand how you would get a plan like
that anyway since the index here only has "a" and so can't provide
(pathkeys: a, b).

Perhaps there's something I'm still missing though.

James Coleman




Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2019-07-20 Thread James Coleman
On Sat, Jul 20, 2019 at 12:12 PM James Coleman  wrote:
>
> On Sat, Jul 20, 2019 at 11:25 AM Tomas Vondra
>  wrote:
> ...
> > >My current line of investigation is whether we need to do anything in
> > >the parallel portion of create_ordered_paths(). I noticed that the
> > >first-pass patch adding generate_useful_gather_paths() modified that
> > >section but wasn't actually adding any new gather-merge paths (just
> > >bare incremental sort paths). That seems pretty clearly just a
> > >prototype miss, so I modified the prototype to build gather-merge
> > >paths instead (as a side note that change seems to fix an oddity I was
> > >seeing where plans would include a parallel index scan node even
> > >though they weren't parallel plans). While the resulting plan for
> > >something like:
> > >
> >
> > Yes, that seems to be a bug. The block above it clealy has a gather
> > merge nodes, so this one should too.
> >
> > >explain analyze select * from t where t.a in (1,2,3,4,5,6) order by
> > >t.a, t.b limit 50;
> > >
> > >changes cost (to be cheaper) ever so slightly with the gather-merge
> > >addition to create_ordered_paths(), the plan itself is otherwise
> > >identical (including row estimates):
> > >
> > >Limit
> > >  -> Gather Merge
> > >   -> Incremental Sort
> > >  -> Parallel Index Scan
> > >
> > >(Note: I'm forcing parallel plans here with: set
> > >max_parallel_workers_per_gather=4; set min_parallel_table_scan_size=0;
> > >set parallel_tuple_cost=0; set parallel_setup_cost=0; set
> > >min_parallel_index_scan_size=0;)
> > >
> > >I can't seem to come up with a case where adding these gather-merge
> > >paths in create_ordered_paths() isn't entirely duplicative of paths
> > >already created by generate_useful_gather_paths() as called from
> > >apply_scanjoin_target_to_paths() -- which I _think_ makes sense given
> > >that both apply_scanjoin_target_to_paths() and create_ordered_paths()
> > >are called by grouping_planner().
> > >
> > >Can you think of a case I'm missing here that would make it valuable
> > >to generate new parallel plans in create_ordered_paths()?
> > >
> >
> > Good question. Not sure. I think such path would need to do something on
> > a relation that is neither a join nor a scan - in which case the path
> > should not be created by apply_scanjoin_target_to_paths().
> >
> > So for example a query like this:
> >
> >   SELECT
> >   a, b, sum(expensive_function(c))
> >   FROM
> >   t
> >   GROUP BY a,b
> >   ORDER BY a,sum(...) LIMIT 10;
> >
> > should be able to produce a plan like this:
> >
> >   -> Limit
> >   -> Gather Merge
> > -> Incremental Sort (pathkeys: a, sum)
> >   -> Group Aggregate
> >  a, b, sum(expensive_function(c))
> > -> Index Scan (pathkeys: a, b)
> >
> > or something like that, maybe. I haven't tried this, though. The other
> > question is whether such queries are useful in practice ...
>
> Hmm, when I step through on that example input_rel->partial_pathlist
> != NIL is false, so we don't even attempt to consider any extra
> parallel paths in create_ordered_paths(). Nonetheless we get a
> parallel plan, but with a different shape:
>
>  Limit
>->  Incremental Sort
>  Sort Key: a, (sum(expensive_function(c)))
>  Presorted Key: a
>  ->  Finalize GroupAggregate
>Group Key: a, b
>->  Gather Merge
>  Workers Planned: 4
>  ->  Partial GroupAggregate
>Group Key: a, b
>->  Sort
>  Sort Key: a, b
>  ->  Parallel Seq Scan on t
>
> (or if I disable seqscan then the sort+seq scan above becomes inc sort
> + index scan)
>
> To be honest, I don't think I understand how you would get a plan like
> that anyway since the index here only has "a" and so can't provide
> (pathkeys: a, b).
>
> Perhaps there's something I'm still missing though.

Also just realized I don't think (?) we can order by the sum inside a
gather-merge -- at least not without having another sort above the
parallel portion? Or is the group aggregate able to also provide
ordering on the final sum after aggregating the partial sums?

James Coleman




Re: [HACKERS] Block level parallel vacuum

2019-07-20 Thread Sergei Kornilov
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:not tested

Hello

I reviewed v25 patches and have just a few notes.

missed synopsis for "PARALLEL" option ( block in 
doc/src/sgml/ref/vacuum.sgml )
missed prototype for vacuum_log_cleanup_info in "non-export function prototypes"

>   /*
>* Do post-vacuum cleanup, and statistics update for each index if
>* we're not in parallel lazy vacuum. If in parallel lazy vacuum, do
>* only post-vacum cleanup and update statistics at the end of parallel
>* lazy vacuum.
>*/
>   if (vacrelstats->useindex)
>   lazy_vacuum_or_cleanup_indexes(vacrelstats, Irel, nindexes,
>  
> indstats, lps, true);
>
>   if (ParallelVacuumIsActive(lps))
>   {
>   /* End parallel mode and update index statistics */
>   end_parallel_vacuum(lps, Irel, nindexes);
>   }

I personally do not like update statistics in different places.
Can we change lazy_vacuum_or_cleanup_indexes to writing stats for both parallel 
and non-parallel cases? I means something like this:

>   if (ParallelVacuumIsActive(lps))
>   {
>   /* Do parallel index vacuuming or index cleanup */
>   lazy_parallel_vacuum_or_cleanup_indexes(vacrelstats, Irel,
>   
> nindexes, stats,
>   
> lps, for_cleanup);
>   if (for_cleanup)
>   {
>...
>   for (i = 0; i < nindexes; i++)
>   lazy_update_index_statistics(...);
>   }
>   return;
>   }

So all lazy_update_index_statistics would be in one place. 
lazy_parallel_vacuum_or_cleanup_indexes is called only from parallel leader and 
waits for all workers. Possible we can update stats in 
lazy_parallel_vacuum_or_cleanup_indexes after WaitForParallelWorkersToFinish 
call.

Also discussion question: vacuumdb parameters --parallel= and --jobs= will 
confuse users? We need more description for this options?

regards, Sergei

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2019-07-20 Thread Tomas Vondra

On Sat, Jul 20, 2019 at 12:21:01PM -0400, James Coleman wrote:

On Sat, Jul 20, 2019 at 12:12 PM James Coleman  wrote:


On Sat, Jul 20, 2019 at 11:25 AM Tomas Vondra
 wrote:
...
> >My current line of investigation is whether we need to do anything in
> >the parallel portion of create_ordered_paths(). I noticed that the
> >first-pass patch adding generate_useful_gather_paths() modified that
> >section but wasn't actually adding any new gather-merge paths (just
> >bare incremental sort paths). That seems pretty clearly just a
> >prototype miss, so I modified the prototype to build gather-merge
> >paths instead (as a side note that change seems to fix an oddity I was
> >seeing where plans would include a parallel index scan node even
> >though they weren't parallel plans). While the resulting plan for
> >something like:
> >
>
> Yes, that seems to be a bug. The block above it clealy has a gather
> merge nodes, so this one should too.
>
> >explain analyze select * from t where t.a in (1,2,3,4,5,6) order by
> >t.a, t.b limit 50;
> >
> >changes cost (to be cheaper) ever so slightly with the gather-merge
> >addition to create_ordered_paths(), the plan itself is otherwise
> >identical (including row estimates):
> >
> >Limit
> >  -> Gather Merge
> >   -> Incremental Sort
> >  -> Parallel Index Scan
> >
> >(Note: I'm forcing parallel plans here with: set
> >max_parallel_workers_per_gather=4; set min_parallel_table_scan_size=0;
> >set parallel_tuple_cost=0; set parallel_setup_cost=0; set
> >min_parallel_index_scan_size=0;)
> >
> >I can't seem to come up with a case where adding these gather-merge
> >paths in create_ordered_paths() isn't entirely duplicative of paths
> >already created by generate_useful_gather_paths() as called from
> >apply_scanjoin_target_to_paths() -- which I _think_ makes sense given
> >that both apply_scanjoin_target_to_paths() and create_ordered_paths()
> >are called by grouping_planner().
> >
> >Can you think of a case I'm missing here that would make it valuable
> >to generate new parallel plans in create_ordered_paths()?
> >
>
> Good question. Not sure. I think such path would need to do something on
> a relation that is neither a join nor a scan - in which case the path
> should not be created by apply_scanjoin_target_to_paths().
>
> So for example a query like this:
>
>   SELECT
>   a, b, sum(expensive_function(c))
>   FROM
>   t
>   GROUP BY a,b
>   ORDER BY a,sum(...) LIMIT 10;
>
> should be able to produce a plan like this:
>
>   -> Limit
>   -> Gather Merge
> -> Incremental Sort (pathkeys: a, sum)
>   -> Group Aggregate
>  a, b, sum(expensive_function(c))
> -> Index Scan (pathkeys: a, b)
>
> or something like that, maybe. I haven't tried this, though. The other
> question is whether such queries are useful in practice ...

Hmm, when I step through on that example input_rel->partial_pathlist
!= NIL is false, so we don't even attempt to consider any extra
parallel paths in create_ordered_paths(). Nonetheless we get a
parallel plan, but with a different shape:

 Limit
   ->  Incremental Sort
 Sort Key: a, (sum(expensive_function(c)))
 Presorted Key: a
 ->  Finalize GroupAggregate
   Group Key: a, b
   ->  Gather Merge
 Workers Planned: 4
 ->  Partial GroupAggregate
   Group Key: a, b
   ->  Sort
 Sort Key: a, b
 ->  Parallel Seq Scan on t

(or if I disable seqscan then the sort+seq scan above becomes inc sort
+ index scan)

To be honest, I don't think I understand how you would get a plan like
that anyway since the index here only has "a" and so can't provide
(pathkeys: a, b).

Perhaps there's something I'm still missing though.


I wasn't stricly adhering to the example we used before, and I imagined
there would be an index on (a,b). Sorry if that wasn't clear.



Also just realized I don't think (?) we can order by the sum inside a
gather-merge -- at least not without having another sort above the
parallel portion? Or is the group aggregate able to also provide
ordering on the final sum after aggregating the partial sums?



Yes, you're right - an extra sort node would be necessary. But I don't
think that changes the idea behind that example. The question is whether
the extra sorts below the gather would be cheaper than doing a large sort
on top of it - but I don't see why wouldn't that be the case, and if we
only need a couple of rows from the beginning ...


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-20 Thread Tomas Vondra

On Fri, Jul 19, 2019 at 04:02:19PM +0200, Antonin Houska wrote:

Tomas Vondra  wrote:


On Fri, Jul 19, 2019 at 12:04:36PM +0200, Antonin Houska wrote:
>Tomas Vondra  wrote:
>
>> On Mon, Jul 15, 2019 at 03:42:39PM -0400, Bruce Momjian wrote:
>> >On Sat, Jul 13, 2019 at 11:58:02PM +0200, Tomas Vondra wrote:
>> >> One extra thing we should consider is authenticated encryption. We can't
>> >> just encrypt the pages (no matter which AES mode is used - XTS/CBC/...),
>> >> as that does not provide integrity protection (i.e. can't detect when
>> >> the ciphertext was corrupted due to disk failure or intentionally). And
>> >> we can't quite rely on checksums, because that checksums the plaintext
>> >> and is stored encrypted.
>> >
>> >Uh, if someone modifies a few bytes of the page, we will decrypt it, but
>> >the checksum (per-page or WAL) will not match our decrypted output.  How
>> >would they make it match the checksum without already knowing the key.
>> >I read [1] but could not see that explained.
>> >
>>
>> Our checksum is only 16 bits, so perhaps one way would be to just
>> generate 64k of randomly modified pages and hope one of them happens to
>> hit the right checksum value. Not sure how practical such attack is, but
>> it does require just filesystem access.
>
>I don't think you can easily generate 64k of different checksums this way. If
>the data is random, I suppose that each set of 2^(128 - 16) blocks will
>contain the the same checksum after decryption. Thus even you generate 64k of
>different ciphertext blocks that contain the checksum, some (many?)  checksums
>will be duplicate. Unfortunately the math to describe this problem does not
>seem to be trivial.
>

I'm not sure what's your point, or why you care about the 128 bits, but I
don't think the math is very complicated (and it's exactly the same with
or without encryption). The probability of checksum collision for randomly
modified page is 1/64k, so p=~0.00153%. So probability of *not* getting a
collision is (1-p)=99.9985%. So with N pages, the probability of no
collisions is pow((1-p),N) which behaves like this:

 N pow((1-p),N)
   
   1   85%
   2   73%
   3   63%
   46000   49%
   20   4%

So with 1.6GB relation you have about 96% chance of a checksum collision.


I thought your attack proposal was to find valid (encrypted) checksum for a
given encrypted page. Instead it seems that you were only trying to say that
it's not too hard to generate page with a valid checksum in general. Thus the
attacker can try to modify the ciphertext again and again in a way that is not
quite random, but the chance to pass the checksum verification may still be
relatively high.


>Also note that if you try to generate ciphertext, decryption of which will
>result in particular value of checksum, you can hardly control the other 14
>bytes of the block, which in turn are used to verify the checksum.
>

Now, I'm not saying this attack is particularly practical - it would
generate a fair number of checkpoint failures before getting the first
collision. So it'd trigger quite a few alerts, I guess.


You probably mean "checksum failures". I agree. And even if the checksum
passed the verification, page or tuple headers would probably be incorrect and
cause other errors.


>> FWIW our CRC algorithm is not quite HMAC, because it's neither keyed nor
>> a cryptographic hash algorithm. Now, maybe we don't want authenticated
>> encryption (e.g. XTS is not authenticated, unlike GCM/CCM).
>
>I'm also not sure if we should try to guarantee data authenticity /
>integrity. As someone already mentioned elsewhere, page MAC does not help if
>the whole page is replaced. (An extreme case is that old filesystem snapshot
>containing the whole data directory is restored, although that will probably
>make the database crash soon.)
>
>We can guarantee integrity and authenticity of backup, but that's a separate
>feature: someone may need this although it's o.k. for him to run the cluster
>unencrypted.
>

Yes, I do agree with that. I think attempts to guarantee data authenticity
and/or integrity at the page level is mostly futile (replay attacks are an
example of why). IMHO we should consider that to be outside the threat
model TDE is expected to address.


When writing my previous email I forgot that, besides improving data
integrity, the authenticated encryption also tries to detect an attempt to get
encryption key via "chosen-ciphertext attack (CCA)". The fact that pages are
encrypted / decrypted independent from each other should not be a problem
here. We just need to consider if this kind of CCA is the threat we try to
protect against.


IMO a better way to handle authenticity/integrity would be based on WAL,
which is essentially an authoritative log of operations. We should be able
to parse WAL, deduce expected state (min LSN, checksums) for each page,
and validate the cluster state based on that.


o

Re: block-level incremental backup

2019-07-20 Thread vignesh C
Hi Jeevan,

The idea is very nice.
When Insert/update/delete and truncate/drop happens at various
combinations, How the incremental backup handles the copying of the
blocks?


On Wed, Jul 17, 2019 at 8:12 PM Jeevan Chalke
 wrote:
>
>
>
> On Wed, Jul 17, 2019 at 7:38 PM Ibrar Ahmed  wrote:
>>
>>
>>
>> On Wed, Jul 17, 2019 at 6:43 PM Jeevan Chalke 
>>  wrote:
>>>
>>> On Wed, Jul 17, 2019 at 2:15 PM Ibrar Ahmed  wrote:


 At what stage you will apply the WAL generated in between the START/STOP 
 backup.
>>>
>>>
>>> In this design, we are not touching any WAL related code. The WAL files will
>>> get copied with each backup either full or incremental. And thus, the last
>>> incremental backup will have the final WAL files which will be copied as-is
>>> in the combined full-backup and they will get apply automatically if that
>>> the data directory is used to start the server.
>>
>>
>> Ok, so you keep all the WAL files since the first backup, right?
>
>
> The WAL files will anyway be copied while taking a backup (full or 
> incremental),
> but only last incremental backup's WAL files are copied to the combined
> synthetic full backup.
>
>>>

 --
 Ibrar Ahmed
>>>
>>>
>>> --
>>> Jeevan Chalke
>>> Technical Architect, Product Development
>>> EnterpriseDB Corporation
>>>
>>
>>
>> --
>> Ibrar Ahmed
>
>
>
> --
> Jeevan Chalke
> Technical Architect, Product Development
> EnterpriseDB Corporation
>


--
Regards,
vignesh
EnterpriseDB: http://www.enterprisedb.com




Re: [RFC] Removing "magic" oids

2019-07-20 Thread Noah Misch
On Fri, Jul 19, 2019 at 10:12:57AM -0700, Andres Freund wrote:
> On 2019-07-07 10:00:35 -0700, Noah Misch wrote:
> > +# Test concurrent OID generation via pg_enum_oid_index.  This indirectly
> > +# exercises LWLock and spinlock concurrency.
> > +my $labels = join ',', map { "'l$_'" } 1 .. 1000;
> >  pgbench(
> > '--no-vacuum --client=5 --protocol=prepared --transactions=25',
> > 0,
> > [qr{processed: 125/125}],
> > [qr{^$}],
> > -   'concurrent insert workload',
> > +   'concurrent OID generation',
> > {
> > '001_pgbench_concurrent_insert' =>
> > - 'INSERT INTO insert_tbl SELECT FROM generate_series(1,1000);'
> > + "CREATE TYPE pg_temp.e AS ENUM ($labels); DROP TYPE 
> > pg_temp.e;"
> > });
> 
> Hm, perhaps we should just do something stupid an insert into a catalog
> table, determining the oid to insert with pg_nextoid?  That ought to be a
> lot faster and thus more "stress testing" than going through a full
> blown DDL statement?  But perhaps that's just too ugly.

I expect the pg_nextoid strategy could have sufficed.  The ENUM strategy
wastes some time parsing 1000 label names, discarding odd-numbered OIDs, and
dropping the type.  The pg_nextoid strategy wastes time by performing the
insertion loop in the executor instead of dedicated C code of
EnumValuesCreate().  Hard to say how to weight those factors.




Re: Index Skip Scan

2019-07-20 Thread Dmitry Dolgov
>  On Thu, Jul 11, 2019 at 2:40 AM Floris Van Nee  
> wrote:
> I verified that the backwards index scan is indeed functioning now. However,
> I'm afraid it's not that simple, as I think the cursor case is broken now. I
> think having just the 'scan direction' in the btree code is not enough to get
> this working properly, because we need to know whether we want the minimum or
> maximum element of a certain prefix. There are basically four cases:
>
> - Forward Index Scan + Forward cursor: we want the minimum element within a
>   prefix and we want to skip 'forward' to the next prefix
>
> - Forward Index Scan + Backward cursor: we want the minimum element within a
>   prefix and we want to skip 'backward' to the previous prefix
>
> - Backward Index Scan + Forward cursor: we want the maximum element within a
>   prefix and we want to skip 'backward' to the previous prefix
>
> - Backward Index Scan + Backward cursor: we want the maximum element within a
>   prefix and we want to skip 'forward' to the next prefix
>
> These cases make it rather complicated unfortunately. They do somewhat tie in
> with the previous discussion on this thread about being able to skip to the
> min or max value. If we ever want to support a sort of minmax scan, we'll
> encounter the same issues.

Yes, these four cases are indeed a very good point. I've prepared a new version
of the patch, where they + an index condition and handling of situations when
it eliminated one or more unique elements are addressed. It seems fixes issues
and works also for those hypothetical examples you've mentioned above, but of
course it looks pretty complicated and I need to polish it a bit before
posting.

> On Thu, Jul 11, 2019 at 12:13 PM David Rowley  
> wrote:
>
> On Thu, 11 Jul 2019 at 19:41, Floris Van Nee  wrote:
> > SELECT DISTINCT ON (a) a,b,c FROM a WHERE c = 2 (with an index on a,b,c)
> > Data (imagine every tuple here actually occurs 10.000 times in the index to
> > see the benefit of skipping):
> > 1,1,1
> > 1,1,2
> > 1,2,2
> > 1,2,3
> > 2,2,1
> > 2,2,3
> > 3,1,1
> > 3,1,2
> > 3,2,2
> > 3,2,3
> >
> > Creating a cursor on this query and then moving forward, you should get
> > (1,1,2), (3,1,2). In the current implementation of the patch, after
> > bt_first, it skips over (1,1,2) to (2,2,1). It checks quals and moves
> > forward one-by-one until it finds a match. This match only comes at (3,1,2)
> > however. Then it skips to the end.
> >
> > If you move the cursor backwards from the end of the cursor, you should
> > still get (3,1,2) (1,1,2). A possible implementation would start at the end
> > and do a skip to the beginning of the prefix: (3,1,1). Then it needs to
> > move forward one-by-one in order to find the first matching (minimum) item
> > (3,1,2). When it finds it, it needs to skip backwards to the beginning of
> > prefix 2 (2,2,1). It needs to move forwards to find the minimum element,
> > but should stop as soon as it detects that the prefix doesn't match anymore
> > (because there is no match for prefix 2, it will move all the way from
> > (2,2,1) to (3,1,1)). It then needs to skip backwards again to the start of
> > prefix 1: (1,1,1) and scan forward to find (1,1,2).
> > Perhaps anyone can think of an easier way to implement it?
>
> One option is just don't implement it and just change
> ExecSupportsBackwardScan() so that it returns false for skip index
> scans, or perhaps at least implement an index am method to allow the
> planner to be able to determine if the index implementation supports
> it... amcanskipbackward

Yep, it was discussed few times in this thread, and after we've discovered
(thanks to Floris) so many issues I was also one step away from implementing
this idea. But at the time time as Thomas correctly noticed, our implementation
needs to be extensible to handle future use cases, and this particular cursor
juggling seems already like a pretty good example of such "future use case". So
I hope by dealing with it we can also figure out what needs to be extensible.

> On Tue, Jul 16, 2019 at 6:53 PM Jesper Pedersen  
> wrote:
>
> Here is a patch more in that direction.
>
> Some questions:
>
> 1) Do we really need the UniqueKey struct ?  If it only contains the
> EquivalenceClass field then we could just have a list of those instead.
> That would make the patch simpler.
>
> 2) Do we need both canon_uniquekeys and query_uniquekeys ?  Currently
> the patch only uses canon_uniquekeys because the we attach the list
> directly on the Path node.
>
> I'll clean the patch up based on your feedback, and then start to rebase
> v21 on it.

Thanks! I'll also take a look as soon an I'm finished with the last updates.




Re: SQL/JSON path issues/questions

2019-07-20 Thread Alexander Korotkov
Hi Steven,

On Fri, Jul 19, 2019 at 9:53 PM Steven Pousty  wrote:
> I would like to help review this documentation. Can you please point me in 
> the right direction?

Thank you for your interest.  You're welcome to do review.

Please take a look at instruction for reviewing a patch [1] and
working with git [2].  Also, in order to build a doc you will need to
setup a toolset first [3].

Links

1. https://wiki.postgresql.org/wiki/Reviewing_a_Patch
2. https://wiki.postgresql.org/wiki/Working_with_git#Testing_a_patch
3. https://www.postgresql.org/docs/devel/docguide-toolsets.html


--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-20 Thread Sehrope Sarkuni
On Sat, Jul 20, 2019 at 1:30 PM Tomas Vondra
 wrote:
> Forbid checksums? I don't see how that could be acceptable. We either have
> to accept the limitations of the current design (having to decrypt
> everything before checking the checksums) or change the design.
>
> I personally think we should do the latter - not just because of this
> "decrypt-then-verify" issue, but consider how much work we've done to
> allow enabling checksums on-line (it's still not there, but it's likely
> doable in PG13). How are we going to do that with encryption? ISTM we
> should design it so that we can enable encryption on-line too - maybe not
> in v1, but it should be possible. So how are we going to do that? With
> checksums it's (fairly) easy because we can "not verify" the page before
> we know all pages have a checksum, but with encryption that's not
> possible. And if the whole page is encrypted, then what?
>
> Of course, maybe we don't need such capability for the use-case we're
> trying to solve with encryption. I can imagine that someone is running a
> large system, has issues with data corruption, and decides to enable
> checksums to remedy that. Maybe there's no such scenario in the privacy
> case? But we can probably come up with scenarios where a new company
> policy forces people to enable encryption on all systems, or something
> like that.
>
> That being said, I don't know how to solve this, but it seems to me that
> any system where we can't easily decide whether a page is encrypted or not
> (because everything including the page header) is encrypted has this
> exact issue. Maybe we could keep some part of the header unencrypted
> (likely an information leak, does not solve decrypt-then-verify). Or maybe
> we need to store some additional information on each page (which breaks
> on-disk format).

How about storing the CRC of the encrypted pages? It would not leak
any additional information and serves the same purpose as a
non-encrypted one, namely I/O corruption detection. I took a look at
pg_checksum and besides checking for empty pages, the checksum
validation path does not interpret any other fields to calculate the
checksum. I think even the offline checksum enabling path looks like
it may work out of the box. Checksums of encrypted data are not a
replacement for a MAC and this would allow that validation to run
without any knowledge of keys.

Related, I think CTR mode should be considered for pages. It has
performance advantages at 8K data sizes, but even better, allows for
arbitrary bytes of the cipher text to be replaced. For example, after
encrypting a block you can replace the two checksum bytes with the CRC
of the cipher text v.s. CBC mode where that would cause corruption to
cascade forward. Same could be used for leaving things like
pd_pagesize_version in plaintext at its current offset. For anything
deemed non-sensitive, having it readable without having to decrypt the
page is useful.

It does not have to be full bytes either. CTR mode operates as a
stream of bits so its possible to replace nibbles or even individual
bits. It can be something as small one bit for an "is_encrypted" flag
or a handful of bits used to infer a derived key. For example, with
2-bits you could have 00 represent unencrypted, 01/10 represent
old/new key, and 11 be future use. Something like that could
facilitate online key rotation.

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/




Re: SQL/JSON path issues/questions

2019-07-20 Thread Steven Pousty
Thanks so much, hope to get to it over this weekend.

On Sat, Jul 20, 2019, 11:48 AM Alexander Korotkov 
wrote:

> Hi Steven,
>
> On Fri, Jul 19, 2019 at 9:53 PM Steven Pousty 
> wrote:
> > I would like to help review this documentation. Can you please point me
> in the right direction?
>
> Thank you for your interest.  You're welcome to do review.
>
> Please take a look at instruction for reviewing a patch [1] and
> working with git [2].  Also, in order to build a doc you will need to
> setup a toolset first [3].
>
> Links
>
> 1. https://wiki.postgresql.org/wiki/Reviewing_a_Patch
> 2. https://wiki.postgresql.org/wiki/Working_with_git#Testing_a_patch
> 3. https://www.postgresql.org/docs/devel/docguide-toolsets.html
>
>
> --
> Alexander Korotkov
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>


Re: [RFC] Removing "magic" oids

2019-07-20 Thread Andres Freund
Hi,

On 2019-07-20 11:21:52 -0700, Noah Misch wrote:
> On Fri, Jul 19, 2019 at 10:12:57AM -0700, Andres Freund wrote:
> > On 2019-07-07 10:00:35 -0700, Noah Misch wrote:
> > > +# Test concurrent OID generation via pg_enum_oid_index.  This indirectly
> > > +# exercises LWLock and spinlock concurrency.
> > > +my $labels = join ',', map { "'l$_'" } 1 .. 1000;
> > >  pgbench(
> > >   '--no-vacuum --client=5 --protocol=prepared --transactions=25',
> > >   0,
> > >   [qr{processed: 125/125}],
> > >   [qr{^$}],
> > > - 'concurrent insert workload',
> > > + 'concurrent OID generation',
> > >   {
> > >   '001_pgbench_concurrent_insert' =>
> > > -   'INSERT INTO insert_tbl SELECT FROM generate_series(1,1000);'
> > > +   "CREATE TYPE pg_temp.e AS ENUM ($labels); DROP TYPE 
> > > pg_temp.e;"
> > >   });
> > 
> > Hm, perhaps we should just do something stupid an insert into a catalog
> > table, determining the oid to insert with pg_nextoid?  That ought to be a
> > lot faster and thus more "stress testing" than going through a full
> > blown DDL statement?  But perhaps that's just too ugly.
> 
> I expect the pg_nextoid strategy could have sufficed.  The ENUM strategy
> wastes some time parsing 1000 label names, discarding odd-numbered OIDs, and
> dropping the type.  The pg_nextoid strategy wastes time by performing the
> insertion loop in the executor instead of dedicated C code of
> EnumValuesCreate().  Hard to say how to weight those factors.

Fair enough. Are you planning to commit your changes?

Greetings,

Andres Freund




Re: [RFC] Removing "magic" oids

2019-07-20 Thread Noah Misch
On Sat, Jul 20, 2019 at 12:56:34PM -0700, Andres Freund wrote:
> On 2019-07-20 11:21:52 -0700, Noah Misch wrote:
> > On Fri, Jul 19, 2019 at 10:12:57AM -0700, Andres Freund wrote:
> > > On 2019-07-07 10:00:35 -0700, Noah Misch wrote:
> > > > +# Test concurrent OID generation via pg_enum_oid_index.  This 
> > > > indirectly
> > > > +# exercises LWLock and spinlock concurrency.
> > > > +my $labels = join ',', map { "'l$_'" } 1 .. 1000;
> > > >  pgbench(
> > > > '--no-vacuum --client=5 --protocol=prepared --transactions=25',
> > > > 0,
> > > > [qr{processed: 125/125}],
> > > > [qr{^$}],
> > > > -   'concurrent insert workload',
> > > > +   'concurrent OID generation',
> > > > {
> > > > '001_pgbench_concurrent_insert' =>
> > > > - 'INSERT INTO insert_tbl SELECT FROM 
> > > > generate_series(1,1000);'
> > > > + "CREATE TYPE pg_temp.e AS ENUM ($labels); DROP TYPE 
> > > > pg_temp.e;"
> > > > });
> > > 
> > > Hm, perhaps we should just do something stupid an insert into a catalog
> > > table, determining the oid to insert with pg_nextoid?  That ought to be a
> > > lot faster and thus more "stress testing" than going through a full
> > > blown DDL statement?  But perhaps that's just too ugly.
> > 
> > I expect the pg_nextoid strategy could have sufficed.  The ENUM strategy
> > wastes some time parsing 1000 label names, discarding odd-numbered OIDs, and
> > dropping the type.  The pg_nextoid strategy wastes time by performing the
> > insertion loop in the executor instead of dedicated C code of
> > EnumValuesCreate().  Hard to say how to weight those factors.
> 
> Fair enough. Are you planning to commit your changes?

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=8a0cbb88524e8b6121597285b811640ee793b3e8




Re: thoughts on "prevent wraparound" vacuum

2019-07-20 Thread Andres Freund
Hi,

On 2019-07-20 15:35:57 +0300, Michail Nikolaev wrote:
> Currently I am working a lot with cluster consist a few of big tables.
> About 2-3 TB. These tables are heavily updated, some rows are removed, new
> rows are inserted... Kind of typical OLTP workload.
> 
> Physical table size keeps mostly stable while regular VACUUM is working. It
> is fast enough to clean some place from removed rows.
> 
> But time to time "to prevent wraparound" comes. And it works like 8-9 days.
> During that time relation size starting to expand quickly. Freezing all
> blocks in such table takes a lot of time and bloat is generated much more
> quickly.

Several questions:
- Which version of postgres is this? Newer versions avoid scanning
  unchanged parts of the heap even for freezing (9.6+, with additional
  smaller improvements in 11).
- have you increased the vacuum cost limits? Before PG 12 they're so low
  they're entirely unsuitable for larger databases, and even in 12 you
  should likely increase them for a multi-TB database

Unfortunately even if those are fixed the indexes are still likely going
to be scanned in their entirety - but most of the time not modified
much, so that's not as bad.

Greetings,

Andres Freund




Re: [RFC] Removing "magic" oids

2019-07-20 Thread Andres Freund
Hi,

On 2019-07-20 12:58:55 -0700, Noah Misch wrote:
> On Sat, Jul 20, 2019 at 12:56:34PM -0700, Andres Freund wrote:
> > Fair enough. Are you planning to commit your changes?
> 
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=8a0cbb88524e8b6121597285b811640ee793b3e8

Oh, sorry for that.  Still haven't fully cought up with things that
happened while I was on vacation.

Thanks.




Re: benchmarking Flex practices

2019-07-20 Thread Tom Lane
John Naylor  writes:
> The pre-existing ecpg var "state_before" was a bit confusing when
> combined with the new var "state_before_quote_stop", and the former is
> also used with C-comments, so I decided to go with
> "state_before_lit_start" and "state_before_lit_stop". Even though
> comments aren't literals, it's less of a stretch than referring to
> quotes. To keep things consistent, I went with the latter var in psql
> and core.

Hm, what do you think of "state_before_str_stop" instead?  It seems
to me that both "quote" and "lit" are pretty specific terms, so
maybe we need something a bit vaguer.

> To get the regression tests to pass, I had to add this:
>  psql_scan_in_quote(PsqlScanState state)
>  {
> - return state->start_state != INITIAL;
> + return state->start_state != INITIAL &&
> + state->start_state != xqs;
>  }
> ...otherwise with parens we sometimes don't get the right prompt and
> we get empty lines echoed. Adding xuend and xuchar here didn't seem to
> make a difference. There might be something subtle I'm missing, so I
> thought I'd mention it.

I think you would see a difference if the regression tests had any cases
with blank lines between a Unicode string/ident and the associated
UESCAPE and escape-character literal.

While poking at that, I also came across this unhappiness:

regression=# select u&'foo' uescape 'bogus';
regression'# 

that is, psql thinks we're still in a literal at this point.  That's
because the uesccharfail rule eats "'b" and then we go to INITIAL
state, so that consuming the last "'" puts us back in a string state.
The backend would have thrown an error before parsing as far as the
incomplete literal, so it doesn't care (or probably not, anyway),
but that's not an option for psql.

My first reaction as to how to fix this was to rip the xuend and
xuchar states out of psql, and let it just lex UESCAPE as an
identifier and the escape-character literal like any other literal.
psql doesn't need to account for the escape character's effect on
the meaning of the Unicode literal, so it doesn't have any need to
lex the sequence as one big token.  I think the same is true of ecpg
though I've not looked really closely.

However, my second reaction was that maybe you were on to something
upthread when you speculated about postponing de-escaping of
Unicode literals into the grammar.  If we did it like that then
we would not need to have this difference between the backend and
frontend lexers, and we'd not have to worry about what
psql_scan_in_quote should do about the whitespace before and after
UESCAPE, either.

So I'm feeling like maybe we should experiment to see what that
solution looks like, before we commit to going in this direction.
What do you think?


> With the unicode escape rules brought over, the diff to the ecpg
> scanner is much cleaner now. The diff for C-comment rules were still
> pretty messy in comparison, so I made an attempt to clean that up in
> 0002. A bit off-topic, but I thought I should offer that while it was
> fresh in my head.

I didn't really review this, but it looked like a fairly plausible
change of the same ilk, ie combine rules by adding memory of the
previous start state.

regards, tom lane




Re: [PATCH v2] Introduce spgist quadtree @<(point,circle) operator

2019-07-20 Thread Alexander Korotkov
Hi Matwey,

On Tue, May 21, 2019 at 10:23 AM Matwey V. Kornilov
 wrote:
> вт, 21 мая 2019 г. в 08:43, Michael Paquier :
> >
> > On Mon, May 20, 2019 at 02:32:39PM +0300, Matwey V. Kornilov wrote:
> > > This patch series is to add support for spgist quadtree @<(point,circle)
> > > operator. The first two patches are to refactor existing code before
> > > implemention the new feature. The third commit is the actual 
> > > implementation
> > > provided with a set of simple unit tests.
> >
> > Could you add that to the next commit fest please?  Here you go:
> > https://commitfest.postgresql.org/23/
>
> Done

Thank you for posting this patch.  A took a look at it.

It appears that you make quadrant-based checks.  But it seems to be
lossy in comparison with box-based checks.  Let me explain this on the
example.  Imagine centroids (0,1) and (1,0).  Square (0,0)-(1,1) is
intersection of quadrant 2 of (0,1) and quadrant 4 of (1,0).  And then
imagine circle with center in (2,2) of radius 1.  It intersects with
both quadrants, but doesn't intersect with square.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2019-07-20 Thread James Coleman
On Sat, Jul 20, 2019 at 1:00 PM Tomas Vondra
 wrote:
>
> On Sat, Jul 20, 2019 at 12:21:01PM -0400, James Coleman wrote:
> >On Sat, Jul 20, 2019 at 12:12 PM James Coleman  wrote:
> >>
> >> On Sat, Jul 20, 2019 at 11:25 AM Tomas Vondra
> >>  wrote:
> >> ...
> >> > >My current line of investigation is whether we need to do anything in
> >> > >the parallel portion of create_ordered_paths(). I noticed that the
> >> > >first-pass patch adding generate_useful_gather_paths() modified that
> >> > >section but wasn't actually adding any new gather-merge paths (just
> >> > >bare incremental sort paths). That seems pretty clearly just a
> >> > >prototype miss, so I modified the prototype to build gather-merge
> >> > >paths instead (as a side note that change seems to fix an oddity I was
> >> > >seeing where plans would include a parallel index scan node even
> >> > >though they weren't parallel plans). While the resulting plan for
> >> > >something like:
> >> > >
> >> >
> >> > Yes, that seems to be a bug. The block above it clealy has a gather
> >> > merge nodes, so this one should too.
> >> >
> >> > >explain analyze select * from t where t.a in (1,2,3,4,5,6) order by
> >> > >t.a, t.b limit 50;
> >> > >
> >> > >changes cost (to be cheaper) ever so slightly with the gather-merge
> >> > >addition to create_ordered_paths(), the plan itself is otherwise
> >> > >identical (including row estimates):
> >> > >
> >> > >Limit
> >> > >  -> Gather Merge
> >> > >   -> Incremental Sort
> >> > >  -> Parallel Index Scan
> >> > >
> >> > >(Note: I'm forcing parallel plans here with: set
> >> > >max_parallel_workers_per_gather=4; set min_parallel_table_scan_size=0;
> >> > >set parallel_tuple_cost=0; set parallel_setup_cost=0; set
> >> > >min_parallel_index_scan_size=0;)
> >> > >
> >> > >I can't seem to come up with a case where adding these gather-merge
> >> > >paths in create_ordered_paths() isn't entirely duplicative of paths
> >> > >already created by generate_useful_gather_paths() as called from
> >> > >apply_scanjoin_target_to_paths() -- which I _think_ makes sense given
> >> > >that both apply_scanjoin_target_to_paths() and create_ordered_paths()
> >> > >are called by grouping_planner().
> >> > >
> >> > >Can you think of a case I'm missing here that would make it valuable
> >> > >to generate new parallel plans in create_ordered_paths()?
> >> > >
> >> >
> >> > Good question. Not sure. I think such path would need to do something on
> >> > a relation that is neither a join nor a scan - in which case the path
> >> > should not be created by apply_scanjoin_target_to_paths().
> >> >
> >> > So for example a query like this:
> >> >
> >> >   SELECT
> >> >   a, b, sum(expensive_function(c))
> >> >   FROM
> >> >   t
> >> >   GROUP BY a,b
> >> >   ORDER BY a,sum(...) LIMIT 10;
> >> >
> >> > should be able to produce a plan like this:
> >> >
> >> >   -> Limit
> >> >   -> Gather Merge
> >> > -> Incremental Sort (pathkeys: a, sum)
> >> >   -> Group Aggregate
> >> >  a, b, sum(expensive_function(c))
> >> > -> Index Scan (pathkeys: a, b)
> >> >
> >> > or something like that, maybe. I haven't tried this, though. The other
> >> > question is whether such queries are useful in practice ...
> >>
> >> Hmm, when I step through on that example input_rel->partial_pathlist
> >> != NIL is false, so we don't even attempt to consider any extra
> >> parallel paths in create_ordered_paths(). Nonetheless we get a
> >> parallel plan, but with a different shape:
> >>
> >>  Limit
> >>->  Incremental Sort
> >>  Sort Key: a, (sum(expensive_function(c)))
> >>  Presorted Key: a
> >>  ->  Finalize GroupAggregate
> >>Group Key: a, b
> >>->  Gather Merge
> >>  Workers Planned: 4
> >>  ->  Partial GroupAggregate
> >>Group Key: a, b
> >>->  Sort
> >>  Sort Key: a, b
> >>  ->  Parallel Seq Scan on t
> >>
> >> (or if I disable seqscan then the sort+seq scan above becomes inc sort
> >> + index scan)
> >>
> >> To be honest, I don't think I understand how you would get a plan like
> >> that anyway since the index here only has "a" and so can't provide
> >> (pathkeys: a, b).
> >>
> >> Perhaps there's something I'm still missing though.
>
> I wasn't stricly adhering to the example we used before, and I imagined
> there would be an index on (a,b). Sorry if that wasn't clear.
>
> >
> >Also just realized I don't think (?) we can order by the sum inside a
> >gather-merge -- at least not without having another sort above the
> >parallel portion? Or is the group aggregate able to also provide
> >ordering on the final sum after aggregating the partial sums?
> >
>
> Yes, you're right - an extra sort node would be necessary. But I don't
> think that changes the idea behind that example. The question is whether

Re: Catching missing Datum conversions

2019-07-20 Thread Corey Huinker
>
> I should probably split this into "actionable" (categories 3 and 4)
> and "noise and scaffolding" patches.
>

Breaking down the noise-and-scaffolding into some subgroups might make the
rather long patches more palatable/exceedingly-obvious:
* (Datum) 0 ---> NullDatum
* 0 > NullDatum
* The DatumGetPointer(allParameterTypes) null tests

Having said that, everything you did seems really straightforward, except
for

src/backend/rewrite/rewriteDefine.c
src/backend/statistics/mcv.c
src/backend/tsearch/ts_parse.c

and those seem like cases where the DatumGetXXX was a no-op before Datum
was a struct.


Re: contrib make check-world fail if data have been modified and there's vpath

2019-07-20 Thread Tom Lane
didier  writes:
> because there's destination data
> src/makefiles/pgxs.mk line
> ln -s $< $@
> fails and make clean doesn't remove these links.
> ln -sf
> is an option but it's not tested in configure
> or rm -f

Can you be more specific about what the problem case is?

regards, tom lane




Fix typos and inconsistencies for HEAD (take 7)

2019-07-20 Thread Alexander Lakhin
Hello hackers,

Please consider fixing the next pack of typos and inconsistencies in the
tree:
7.1. h04m05s06 -> h04mm05s06 (in fact it's broken since 6af04882, but
h04mm05s06.789 still accepted)
7.2. hasbucketcleanup -> hashbucketcleanup
7.3. _hashm_spare -> hashm_spares
7.4. hashtbl -> hash table
7.5. HAS_RELFILENODES -> XINFO_HAS_RELFILENODES
7.6. HAS_SUBXACT -> XINFO_HAS_SUBXACT
7.7. HAVE_FCVT -> remove (survived after ff4628f3)
7.8. HAVE_FINITE -> remove (orphaned after cac2d912)
7.9. HAVE_STRUCT_SOCKADDR_UN -> remove (not used since introduction in
399a36a7)
7.10. HAVE_SYSCONF -> remove (survived after ff4628f3)
7.11. HAVE_ZLIB -> HAVE_LIBZ
7.12. HEAP_CLEAN -> XLOG_HEAP2_CLEAN
7.13. HEAP_CONTAINS_NEW_TUPLE_DATA -> XLH_UPDATE_CONTAINS_NEW_TUPLE,
XLOG_HEAP_CONTAINS_OLD_TUPLE -> XLH_UPDATE_CONTAINS_OLD_TUPLE,
XLOG_HEAP_CONTAINS_OLD_KEY -> XLH_UPDATE_CONTAINS_OLD_KEY,
XLOG_HEAP_PREFIX_FROM_OLD -> XLH_UPDATE_PREFIX_FROM_OLD,
XLOG_HEAP_SUFFIX_FROM_OLD -> XLH_UPDATE_SUFFIX_FROM_OLD (renamed in
168d5805)
7.14. HEAP_FREEZE -> FREEZE_PAGE (an inconsistency since introduction in
48188e16)
7.15. heapio.c -> hio.c
7.16. heap_newpage -> XLOG_FPI (orphaned since 54685338)
7.17. heaxadecimal -> hexadecimal
7.18. hostlen -> nodelen, host -> node, serv -> service, servlen ->
servicelen
7.19. i386s -> x86_64
7.20. IConst/FConst  -> ICONST/FCONST
7.21. imit -> limit
7.22. IN_ARCHIVE_RECOVERY -> DB_IN_ARCHIVE_RECOVERY
7.23. ind_arraysize, ind_value -> ind_arrsize, ind_pointer
7.24. index_getnext -> index_getnext_slot
7.25. IndexTupleVector -> IndexTuple vector
7.26. innerConsistent -> innerConsistentFn
7.27. in-progres -> in-progress
7.28. inspire with -> inspired by the (sync with 192b0c94)
7.29. internalerrpos -> internalerrposition
7.30. internal_procname -> internal_proname
7.31. interruptOK -> remove (orphaned after d0699571)
7.32. intratransaction -> intra-transaction
7.33. InvalidOffset -> InvalidOffsetNumber
7.34. invtrans -> invtransfn
7.35. isbuiltin -> fmgr_isbuiltin
7.36. iself -> itself
7.37. isnoinherit -> noinherit
7.38. ISO_DATES -> USE_ISO_DATES
7.39. isParentRoot -> remove (orphaned after 218f5158)
7.40. isPrefix -> prefix
7.41. ItemPointerIsMax -> remove (orphaned after e20c70cb)
7.42. itemsin -> items in
7.43. jbVal -> jbval
7.44. json_plperl -> jsonb_plperlu
7.45. jvbBinary -> jbvBinary
7.46. keyAttrs -> attrKind
7.47. keyinfo -> key info
7.48. key_modified -> key_changed
7.49. killitems -> killedItems
7.50. KnownAssignedTransactions -> KnownAssignedTransactionIds

Also, I found e-mail headers in optimizer/plan/README not relevant, so I
propose to remove them.
And another finding is related to the sleep effective resolution. `man 7
time` says "Since kernel 2.6.13, the HZ value is a kernel configuration 
parameter  and  can  be 100, 250 (the default) ...", so the 10
milliseconds is not the most common effective resolution nowadays.
I propose the corresponding patch for pgsleep.c, but we have a similar
statement in doc/.../config.sgml. I think It should be fixed too.

Best regards,
Alexander
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index a07bd27a0e..487103fb79 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -793,7 +793,7 @@ _hash_initbitmapbuffer(Buffer buf, uint16 bmsize, bool initpage)
  *	be confused into returning the same tuple more than once or some tuples
  *	not at all by the rearrangement we are performing here.  To prevent
  *	any concurrent scan to cross the squeeze scan we use lock chaining
- *	similar to hasbucketcleanup.  Refer comments atop hashbucketcleanup.
+ *	similar to hashbucketcleanup.  Refer comments atop hashbucketcleanup.
  *
  *	We need to retain a pin on the primary bucket to ensure that no concurrent
  *	split can start.
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 376ee2a63b..defdc9b408 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -509,7 +509,7 @@ _hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
 	 * Choose the number of initial bucket pages to match the fill factor
 	 * given the estimated number of tuples.  We round up the result to the
 	 * total number of buckets which has to be allocated before using its
-	 * _hashm_spare element. However always force at least 2 bucket pages. The
+	 * hashm_spares element. However always force at least 2 bucket pages. The
 	 * upper limit is determined by considerations explained in
 	 * _hash_expandtable().
 	 */
diff --git a/src/backend/utils/adt/datetime.c b/src/backend/utils/adt/datetime.c
index 54ea69f7f1..4d8db1a060 100644
--- a/src/backend/utils/adt/datetime.c
+++ b/src/backend/utils/adt/datetime.c
@@ -1855,7 +1855,7 @@ DecodeTimeOnly(char **field, int *ftype, int nf,
 
 /*
  * Was this an "ISO time" with embedded field labels? An
- * example is "h04m05s06" - thomas 2001-02-04
+ * example is "h04mm05s06" - thom

Re: Performance issue in foreign-key-aware join estimation

2019-07-20 Thread David Rowley
On Thu, 18 Jul 2019 at 19:24, David Rowley  wrote:
> Unless there's some objection, I'll be looking into pushing both 0001
> and 0002 in a single commit in the next few days.

I've pushed this after doing a bit of final tweaking.

After a re-read, I didn't really like all the code that rechecked that
ec->ec_relids matched the relation we're searching for.  The only code
that seems to be able to put additional members into eclass_indexes
that there's no mention of in the actual class was in
add_child_rel_equivalences(). The comments for the definition of
eclass_indexes said nothing about there being a possibility of the
field containing an index for an EC that knows nothing about the given
relation.   Fixing that either meant fixing the comment to say that
"they *may* contain", or fixing up the code so that it's strict about
what ECs can be mentioned in eclass_indexes.  I went with the fixing
the code option since it also allows us to get rid of some redundant
checks, to which I turned into Asserts() to catch any possible future
bugs that might be introduced by any code that might one day remove
rels from an EC, e.g something like
https://commitfest.postgresql.org/23/1712/

I also did some performance tests on the most simple query I could
think of that uses eclasses.

select2.sql: SELECT * FROM t1 INNER JOIN t2 ON t1.a=t2.a

Master:

$ pgbench -n -f select2.sql -T 60 postgres
tps = 12143.597276 (excluding connections establishing)
tps = 12100.773839 (excluding connections establishing)
tps = 12086.209389 (excluding connections establishing)
tps = 12098.194927 (excluding connections establishing)
tps = 12105.140058 (excluding connections establishing)

Patched:

$ pgbench -n -f select2.sql -T 60 postgres
tps = 12224.597530 (excluding connections establishing)
tps = 12097.286522 (excluding connections establishing)
tps = 12035.306729 (excluding connections establishing)
tps = 11965.848289 (excluding connections establishing)
tps = 12059.846474 (excluding connections establishing)

There's a bit of noise there, but on average we're just 0.25% slower
on the worse case and the benchmarks shown above put us ~406% better
on with the fairly complex query that Tom posted in the initial email
on this thread.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: thoughts on "prevent wraparound" vacuum

2019-07-20 Thread Michail Nikolaev
Hello.

>- Which version of postgres is this? Newer versions avoid scanning
>  unchanged parts of the heap even for freezing (9.6+, with additional
>  smaller improvements in 11).

Oh, totally forgot about version and settings...

server_version 10.9 (Ubuntu 10.9-103)

So, "don't vacuum all-frozen pages" included.

> - have you increased the vacuum cost limits? Before PG 12 they're so low
>   they're entirely unsuitable for larger databases, and even in 12 you
>   should likely increase them for a multi-TB database

Current settings are:

autovacuum_max_workers 8
autovacuum_vacuum_cost_delay 5ms
autovacuum_vacuum_cost_limit 400
autovacuum_work_mem -1

vacuum_cost_page_dirty 40
vacuum_cost_page_hit 1
vacuum_cost_page_miss 10

"autovacuum_max_workers" set to 8 because server needs to process a lot of
changing relations.
Settings were more aggressive previously (autovacuum_vacuum_cost_limit was
2800) but it leads to very high IO load causing issues with application
performance and stability (even on SSD).

 "vacuum_cost_page_dirty" was set to 40 few month ago. High IO write peaks
were causing application requests to stuck into WALWriteLock.
After some investigations we found it was caused by WAL-logging peaks.
Such WAL-peaks are mostly consist of such records:

TypeN(%)
 Record size (%) FPI size (%)
 Combined size (%)
--
Heap2/CLEAN   10520 (  0.86)
 623660 (  0.21)   5317532 (  0.53) 5941192
(  0.46)
Heap2/FREEZE_PAGE 113419 (  9.29)
 6673877 (  2.26) 635354048 ( 63.12)642027925 (
49.31)

another example:

Heap2/CLEAN196707 (  6.96)
12116527 (  1.56)   292317231 ( 37.77)304433758 (
19.64)
Heap2/FREEZE_PAGE  1819 (  0.06)
104012 (  0.01)  13324269 (  1.72)13428281 (
 0.87)

Thanks,
Michail.