Re: POC: Cleaning up orphaned files using undo logs
On Fri, Jul 19, 2019 at 6:35 PM Robert Haas wrote: > > On Fri, Jul 19, 2019 at 12:10 AM Amit Kapila wrote: > > We are doing exactly what you have written in the last line of the > > next paragraph "stop the transaction from writing undo when the hash > > table is already too full.". So we will > > never face the problems related to repeated crash recovery. The > > definition of too full is that we stop allowing the new transactions > > that can write undo when we have the hash table already have entries > > equivalent to (UndoRollbackHashTableSize() - MaxBackends). Does this > > make sense? > > Oops, I was looking in the wrong place. Yes, that makes sense, but: > > 1. It looks like the test you added to PrepareUndoInsert has no > locking, and I don't see how that can be right. > > 2. It seems like this is would result in testing for each new undo > insertion that gets prepared, whereas surely we would want to only > test when first attaching to an undo log. If you've already attached > to the undo log, there's no reason not to continue inserting into it, > because doing so doesn't increase the number of transactions (or > transaction-persistence level combinations) that need undo. > I agree that it should not be for each undo insertion rather whenever any transaction attached to an undo log. > 3. I don't think the test itself is correct. It can fire even when > there's no problem. It is correct (or would be if it said 2 * > MaxBackends) if every other backend in the system is already attached > to an undo log (or two). But if they are not, it will block > transactions from being started for no reason. > Right, we should find a way to know the exact number of transactions that are attached to undo-log at any point in time, then we can have a more precise check. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [PATCH] get rid of StdRdOptions, use individual binary reloptions representation for each relation kind instead
Hi Nikolay, Thanks for the revised patch. It applies now no problem, and seems to work fine. For me, I still find the relopts area quite odd. I wonder if your patch doesn’t go far enough? For example, take log_autovacuum_min_duration. It’s described intRelOpts, which implicitly defines its type (an int), and explicitly sets its min/max, default, and lock level, as well as it’s kind (HEAP | TOAST). But then it’s described again in the relopt_parse_elt[] in toast_reloptions() and heap_reloptions(), which implicitly defines it to apply to HEAP | TOAST, and fact of it being an int. (It’s, of course, same for all reloptions.) The relopt_parse_elt[] is hugely entirely duplicative. I wonder if it is not best simply to consolidate — either push all info into the static {bool,int,real,string}RelOpts[] structures, or push all into the relopt_parse_elt[]. I notice the thread earlier talks of some of the APIs being public interfaces, which may be used by EXTENSIONs. I’m not sure I buy that in its fullest sense. For sure, an EXTENSION may invoke the APIs. But no EXTENSION can define new/alter existing options, because the {bool,int,real,string}RelOpts[] are not currently runtime-extendable. I guess we probably should preserve the API’s functionality, and allow fillRelOptions(), if provided with a tab, for it to restrict filling to only those supplied in the tab. But in general core code, my opinion is that fillRelOptions() could be provided with a NULL tab, and for it to scavenge all needed information from the static {bool,int,real,string}RelOpts[] structures. That links to what I initially found most confusing: AUTOVACUUM_RELOPTIONS. I understand it’s there because there are a bunch of shared reloptions. Pre-patch, default_reloptions() meant there was no need for the macro, but your patch drops default_reloptions(). default_reloptions() is horrible, but I feel the macro approach is worse. :-) Sorry for the delay providing the feedback. It took me a few sittings to grok what was going on, and to work out what I though about it. denty. > On 12 Jul 2019, at 15:13, Nikolay Shaplov wrote: > > В письме от четверг, 4 июля 2019 г. 19:44:42 MSK пользователь Dent John > написал: >> Hi Nikolay, >> >> I have had a crack at re-basing the current patch against 12b2, but I didn’t >> trivially succeed. >> >> It’s probably more my bodged fixing of the rejections than a fundamental >> problem. But I now get compile fails in — and seems like only — vacuum.c. > > I am really sorry for such big delay. Two new relation options were added, it > needed careful checking while importing them back to the patch. > I did not find proper timeslot for doing this quick enough. > Hope I am not terribly late. > Here goes an updated version. > > > >
Re: POC: Cleaning up orphaned files using undo logs
On Sat, Jul 20, 2019 at 12:40 PM Amit Kapila wrote: > > On Fri, Jul 19, 2019 at 6:35 PM Robert Haas wrote: > > > > On Fri, Jul 19, 2019 at 12:10 AM Amit Kapila > > wrote: > > > We are doing exactly what you have written in the last line of the > > > next paragraph "stop the transaction from writing undo when the hash > > > table is already too full.". So we will > > > never face the problems related to repeated crash recovery. The > > > definition of too full is that we stop allowing the new transactions > > > that can write undo when we have the hash table already have entries > > > equivalent to (UndoRollbackHashTableSize() - MaxBackends). Does this > > > make sense? > > > > Oops, I was looking in the wrong place. Yes, that makes sense, but: > > > > 1. It looks like the test you added to PrepareUndoInsert has no > > locking, and I don't see how that can be right. > > > > 2. It seems like this is would result in testing for each new undo > > insertion that gets prepared, whereas surely we would want to only > > test when first attaching to an undo log. If you've already attached > > to the undo log, there's no reason not to continue inserting into it, > > because doing so doesn't increase the number of transactions (or > > transaction-persistence level combinations) that need undo. > > > > I agree that it should not be for each undo insertion rather whenever > any transaction attached to an undo log. > > > 3. I don't think the test itself is correct. It can fire even when > > there's no problem. It is correct (or would be if it said 2 * > > MaxBackends) if every other backend in the system is already attached > > to an undo log (or two). But if they are not, it will block > > transactions from being started for no reason. > > > > Right, we should find a way to know the exact number of transactions > that are attached to undo-log at any point in time, then we can have a > more precise check. Maybe we can make ProcGlobal->xactsHavingPendingUndo an atomic variable. We can increment its value atomically whenever a) A transaction writes the first undo record for each persistence level. b) For each abort request inserted by the 'StartupPass'. And, we will decrement it when a) The transaction commits (decrement by 1 for each persistence level it has written undo for). b) Rollback request is processed. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: pgsql: Sync our copy of the timezone library with IANA release tzcode20
On Fri, Jul 19, 2019 at 02:56:34PM -0400, Tom Lane wrote: > I'm not quite convinced whether that will silence the warning, but > at least it's a bit less unreadable. Thanks for working with upstream on this. From what I can see, woodlouse & friends do not complain anymore. -- Michael signature.asc Description: PGP signature
Re: Compiler warnings with MinGW
On Fri, Jul 19, 2019 at 08:41:28AM -0400, Andrew Dunstan wrote: > On 7/19/19 1:08 AM, Michael Paquier wrote: >> The second one is rather obvious to fix, because we don't care about >> the file mode on Windows, so the attached should do the work. I am >> actually surprised that the Visual Studio compilers don't complain >> about that, but let's fix it. >> >> Thoughts? > > +1. Just wanted to double-check something. We usually don't bother back-patching warning fixes like this one, right? -- Michael signature.asc Description: PGP signature
thoughts on "prevent wraparound" vacuum
Hello. Currently I am working a lot with cluster consist a few of big tables. About 2-3 TB. These tables are heavily updated, some rows are removed, new rows are inserted... Kind of typical OLTP workload. Physical table size keeps mostly stable while regular VACUUM is working. It is fast enough to clean some place from removed rows. But time to time "to prevent wraparound" comes. And it works like 8-9 days. During that time relation size starting to expand quickly. Freezing all blocks in such table takes a lot of time and bloat is generated much more quickly. Of course after aggressive vacuum finishes table are not shrink back (some kind of repacking required). And even after repacking - relation shrinking causes all cluster to stuck for some time (due exclusive locking, see (1)). So, I was thinking about it and I saw two possible solutions: 1. Track two block pointers for aggressive vacuum. One is to freeze all blocks and other is to perform regular vacuum on non-all-visible blocks. Second one is circular (could process table multiple times while first one is moving from start to end of the table). And some parameters to spread resources between pointers is required. 2. Separate "to prevent wraparound" from regular Vacuum to allow them run concurrently. But it seems to be much more work here. Could you please share some thoughts on it? Is it worth to be implemented? Thanks. [1] https://www.postgresql.org/message-id/c9374921e50a5e8fb1ecf04eb8c6ebc3%40postgrespro.ru
Re: [PATCH] Incremental sort (was: PoC: Partial sort)
On Fri, Jul 19, 2019 at 04:59:21PM -0400, James Coleman wrote: On Mon, Jul 8, 2019 at 9:37 PM Tomas Vondra wrote: Now, consider this example: create table t (a int, b int, c int); insert into t select mod(i,100),mod(i,100),i from generate_series(1,1000) s(i); create index on t (a); analyze t; explain select a,b,sum(c) from t group by 1,2 order by 1,2,3 limit 1; With 0001+0002+0003 pathes, I get a plan like this: QUERY PLAN Limit (cost=10375.39..10594.72 rows=1 width=16) -> Incremental Sort (cost=10375.39..2203675.71 rows=1 width=16) Sort Key: a, b, (sum(c)) Presorted Key: a, b -> GroupAggregate (cost=10156.07..2203225.71 rows=1 width=16) Group Key: a, b -> Gather Merge (cost=10156.07..2128124.39 rows=1175 width=12) Workers Planned: 2 -> Incremental Sort (cost=9156.04..972856.05 rows=4166740 width=12) Sort Key: a, b Presorted Key: a -> Parallel Index Scan using t_a_idx on t (cost=0.43..417690.30 rows=4166740 width=12) (12 rows) and with 0004, I get this: QUERY PLAN -- Limit (cost=20443.84..20665.32 rows=1 width=16) -> Incremental Sort (cost=20443.84..2235250.05 rows=1 width=16) Sort Key: a, b, (sum(c)) Presorted Key: a, b -> GroupAggregate (cost=20222.37..2234800.05 rows=1 width=16) Group Key: a, b -> Incremental Sort (cost=20222.37..2159698.74 rows=1175 width=12) Sort Key: a, b Presorted Key: a -> Index Scan using t_a_idx on t (cost=0.43..476024.65 rows=1175 width=12) (10 rows) Notice that cost of the second plan is almost double the first one. That means 0004 does not even generate the first plan, i.e. there are cases where we don't try to add the explicit sort before passing the path to generate_gather_paths(). And I think I know why is that - while gather_grouping_paths() tries to add explicit sort below the gather merge, there are other places that call generate_gather_paths() that don't do that. In this case it's probably apply_scanjoin_target_to_paths() which simply builds parallel (seq|index) scan + gather merge and that's it. The problem is likely the same - the code does not know which pathkeys are "interesting" at that point. We probably need to teach planner to do this. I've been working on figuring out sample queries for each of the places we're looking at adding create_increment_sort() (starting with the cases enabled by gather-merge nodes). The generate_useful_gather_paths() call in apply_scanjoin_target_to_paths() is required to generate the above preferred plan. But I found that if I set enable_sort=off (with our without the _useful_ variant of generate_gather_paths()) I get a very similar plan that's actually lower cost yet: QUERY PLAN -- Limit (cost=10255.98..10355.77 rows=1 width=16) -> Incremental Sort (cost=10255.98..1008228.67 rows=1 width=16) Sort Key: a, b, (sum(c)) Presorted Key: a, b -> Finalize GroupAggregate (cost=10156.20..1007778.67 rows=1 width=16) Group Key: a, b -> Gather Merge (cost=10156.20..1007528.67 rows=2 width=16) Workers Planned: 2 -> Partial GroupAggregate (cost=9156.18..1004220.15 rows=1 width=16) Group Key: a, b -> Incremental Sort (cost=9156.18..972869.60 rows=4166740 width=12) Sort Key: a, b Presorted Key: a -> Parallel Index Scan using t_a_idx on t (cost=0.43..417703.85 rows=4166740 width=12) Is that something we should consider a bug at this stage? It's also not clear to mean (costing aside) which plan is intuitively preferable. This seems like a thinko in add_partial_path() - it only looks at total cost of the paths, and ignores the startup cost entirely. I've debugged it a bit, and what's happening for the partially-grouped relation is roughly this: 1) We add a partial path with startup/total costs 696263 / 738029 2) We attempt to add the "Partial GroupAggregate" path, but it loses the fight because the total cost (1004207) is higher than the first path. Which entirely misses that the startup cost is way lower. 3) W
Re: [sqlsmith] Crash in mcv_get_match_bitmap
On Fri, Jul 19, 2019 at 02:37:19PM -0400, Tom Lane wrote: Tomas Vondra writes: [ mcv fixes ] These patches look OK to me. OK, thanks. Pushed. -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [PATCH] Incremental sort (was: PoC: Partial sort)
On Sat, Jul 20, 2019 at 10:33:02AM -0400, James Coleman wrote: On Sat, Jul 20, 2019 at 9:22 AM Tomas Vondra wrote: On Fri, Jul 19, 2019 at 04:59:21PM -0400, James Coleman wrote: >On Mon, Jul 8, 2019 at 9:37 PM Tomas Vondra > wrote: >> Now, consider this example: >> >> create table t (a int, b int, c int); >> insert into t select mod(i,100),mod(i,100),i from generate_series(1,1000) s(i); >> create index on t (a); >> analyze t; >> explain select a,b,sum(c) from t group by 1,2 order by 1,2,3 limit 1; >> >> With 0001+0002+0003 pathes, I get a plan like this: >> >> QUERY PLAN >> >> Limit (cost=10375.39..10594.72 rows=1 width=16) >>-> Incremental Sort (cost=10375.39..2203675.71 rows=1 width=16) >> Sort Key: a, b, (sum(c)) >> Presorted Key: a, b >> -> GroupAggregate (cost=10156.07..2203225.71 rows=1 width=16) >>Group Key: a, b >>-> Gather Merge (cost=10156.07..2128124.39 rows=1175 width=12) >> Workers Planned: 2 >> -> Incremental Sort (cost=9156.04..972856.05 rows=4166740 width=12) >>Sort Key: a, b >>Presorted Key: a >>-> Parallel Index Scan using t_a_idx on t (cost=0.43..417690.30 rows=4166740 width=12) >> (12 rows) >> >> >> and with 0004, I get this: >> >> QUERY PLAN >> -- >> Limit (cost=20443.84..20665.32 rows=1 width=16) >>-> Incremental Sort (cost=20443.84..2235250.05 rows=1 width=16) >> Sort Key: a, b, (sum(c)) >> Presorted Key: a, b >> -> GroupAggregate (cost=20222.37..2234800.05 rows=1 width=16) >>Group Key: a, b >>-> Incremental Sort (cost=20222.37..2159698.74 rows=1175 width=12) >> Sort Key: a, b >> Presorted Key: a >> -> Index Scan using t_a_idx on t (cost=0.43..476024.65 rows=1175 width=12) >> (10 rows) >> >> Notice that cost of the second plan is almost double the first one. That >> means 0004 does not even generate the first plan, i.e. there are cases >> where we don't try to add the explicit sort before passing the path to >> generate_gather_paths(). >> >> And I think I know why is that - while gather_grouping_paths() tries to >> add explicit sort below the gather merge, there are other places that >> call generate_gather_paths() that don't do that. In this case it's >> probably apply_scanjoin_target_to_paths() which simply builds >> >>parallel (seq|index) scan + gather merge >> >> and that's it. The problem is likely the same - the code does not know >> which pathkeys are "interesting" at that point. We probably need to >> teach planner to do this. > >I've been working on figuring out sample queries for each of the >places we're looking at adding create_increment_sort() (starting with >the cases enabled by gather-merge nodes). The >generate_useful_gather_paths() call in >apply_scanjoin_target_to_paths() is required to generate the above >preferred plan. As I continue this, I've added a couple of test cases (notably for generate_useful_gather_paths() in both standard_join_search() and apply_scanjoin_target_to_paths()). Those, plus the current WIP state of my hacking on your patch adding generate_useful_gather_paths() is attached as 0001-parallel-and-more-paths.patch. My current line of investigation is whether we need to do anything in the parallel portion of create_ordered_paths(). I noticed that the first-pass patch adding generate_useful_gather_paths() modified that section but wasn't actually adding any new gather-merge paths (just bare incremental sort paths). That seems pretty clearly just a prototype miss, so I modified the prototype to build gather-merge paths instead (as a side note that change seems to fix an oddity I was seeing where plans would include a parallel index scan node even though they weren't parallel plans). While the resulting plan for something like: Yes, that seems to be a bug. The block above it clealy has a gather merge nodes, so this one should too. explain analyze select * from t where t.a in (1,2,3,4,5,6) order by t.a, t.b limit 50; changes cost (to be cheaper) ever so slightly with the gather-merge addition to create_ordered_paths(), the plan itself is otherwise identical (including row estimates): Limit -> Gather Merge -> Incremental Sort -> Parallel Index Scan (Note: I'm forcing parallel plans here with: set max_parallel_workers_per_gather=4; set min_parallel_table_scan_size=0; set parallel_tuple_cost=0; set parallel_setup_cost=0; set min_
Re: [PATCH] Incremental sort (was: PoC: Partial sort)
On Sat, Jul 20, 2019 at 11:25 AM Tomas Vondra wrote: ... > >My current line of investigation is whether we need to do anything in > >the parallel portion of create_ordered_paths(). I noticed that the > >first-pass patch adding generate_useful_gather_paths() modified that > >section but wasn't actually adding any new gather-merge paths (just > >bare incremental sort paths). That seems pretty clearly just a > >prototype miss, so I modified the prototype to build gather-merge > >paths instead (as a side note that change seems to fix an oddity I was > >seeing where plans would include a parallel index scan node even > >though they weren't parallel plans). While the resulting plan for > >something like: > > > > Yes, that seems to be a bug. The block above it clealy has a gather > merge nodes, so this one should too. > > >explain analyze select * from t where t.a in (1,2,3,4,5,6) order by > >t.a, t.b limit 50; > > > >changes cost (to be cheaper) ever so slightly with the gather-merge > >addition to create_ordered_paths(), the plan itself is otherwise > >identical (including row estimates): > > > >Limit > > -> Gather Merge > > -> Incremental Sort > > -> Parallel Index Scan > > > >(Note: I'm forcing parallel plans here with: set > >max_parallel_workers_per_gather=4; set min_parallel_table_scan_size=0; > >set parallel_tuple_cost=0; set parallel_setup_cost=0; set > >min_parallel_index_scan_size=0;) > > > >I can't seem to come up with a case where adding these gather-merge > >paths in create_ordered_paths() isn't entirely duplicative of paths > >already created by generate_useful_gather_paths() as called from > >apply_scanjoin_target_to_paths() -- which I _think_ makes sense given > >that both apply_scanjoin_target_to_paths() and create_ordered_paths() > >are called by grouping_planner(). > > > >Can you think of a case I'm missing here that would make it valuable > >to generate new parallel plans in create_ordered_paths()? > > > > Good question. Not sure. I think such path would need to do something on > a relation that is neither a join nor a scan - in which case the path > should not be created by apply_scanjoin_target_to_paths(). > > So for example a query like this: > > SELECT > a, b, sum(expensive_function(c)) > FROM > t > GROUP BY a,b > ORDER BY a,sum(...) LIMIT 10; > > should be able to produce a plan like this: > > -> Limit > -> Gather Merge > -> Incremental Sort (pathkeys: a, sum) > -> Group Aggregate > a, b, sum(expensive_function(c)) > -> Index Scan (pathkeys: a, b) > > or something like that, maybe. I haven't tried this, though. The other > question is whether such queries are useful in practice ... Hmm, when I step through on that example input_rel->partial_pathlist != NIL is false, so we don't even attempt to consider any extra parallel paths in create_ordered_paths(). Nonetheless we get a parallel plan, but with a different shape: Limit -> Incremental Sort Sort Key: a, (sum(expensive_function(c))) Presorted Key: a -> Finalize GroupAggregate Group Key: a, b -> Gather Merge Workers Planned: 4 -> Partial GroupAggregate Group Key: a, b -> Sort Sort Key: a, b -> Parallel Seq Scan on t (or if I disable seqscan then the sort+seq scan above becomes inc sort + index scan) To be honest, I don't think I understand how you would get a plan like that anyway since the index here only has "a" and so can't provide (pathkeys: a, b). Perhaps there's something I'm still missing though. James Coleman
Re: [PATCH] Incremental sort (was: PoC: Partial sort)
On Sat, Jul 20, 2019 at 12:12 PM James Coleman wrote: > > On Sat, Jul 20, 2019 at 11:25 AM Tomas Vondra > wrote: > ... > > >My current line of investigation is whether we need to do anything in > > >the parallel portion of create_ordered_paths(). I noticed that the > > >first-pass patch adding generate_useful_gather_paths() modified that > > >section but wasn't actually adding any new gather-merge paths (just > > >bare incremental sort paths). That seems pretty clearly just a > > >prototype miss, so I modified the prototype to build gather-merge > > >paths instead (as a side note that change seems to fix an oddity I was > > >seeing where plans would include a parallel index scan node even > > >though they weren't parallel plans). While the resulting plan for > > >something like: > > > > > > > Yes, that seems to be a bug. The block above it clealy has a gather > > merge nodes, so this one should too. > > > > >explain analyze select * from t where t.a in (1,2,3,4,5,6) order by > > >t.a, t.b limit 50; > > > > > >changes cost (to be cheaper) ever so slightly with the gather-merge > > >addition to create_ordered_paths(), the plan itself is otherwise > > >identical (including row estimates): > > > > > >Limit > > > -> Gather Merge > > > -> Incremental Sort > > > -> Parallel Index Scan > > > > > >(Note: I'm forcing parallel plans here with: set > > >max_parallel_workers_per_gather=4; set min_parallel_table_scan_size=0; > > >set parallel_tuple_cost=0; set parallel_setup_cost=0; set > > >min_parallel_index_scan_size=0;) > > > > > >I can't seem to come up with a case where adding these gather-merge > > >paths in create_ordered_paths() isn't entirely duplicative of paths > > >already created by generate_useful_gather_paths() as called from > > >apply_scanjoin_target_to_paths() -- which I _think_ makes sense given > > >that both apply_scanjoin_target_to_paths() and create_ordered_paths() > > >are called by grouping_planner(). > > > > > >Can you think of a case I'm missing here that would make it valuable > > >to generate new parallel plans in create_ordered_paths()? > > > > > > > Good question. Not sure. I think such path would need to do something on > > a relation that is neither a join nor a scan - in which case the path > > should not be created by apply_scanjoin_target_to_paths(). > > > > So for example a query like this: > > > > SELECT > > a, b, sum(expensive_function(c)) > > FROM > > t > > GROUP BY a,b > > ORDER BY a,sum(...) LIMIT 10; > > > > should be able to produce a plan like this: > > > > -> Limit > > -> Gather Merge > > -> Incremental Sort (pathkeys: a, sum) > > -> Group Aggregate > > a, b, sum(expensive_function(c)) > > -> Index Scan (pathkeys: a, b) > > > > or something like that, maybe. I haven't tried this, though. The other > > question is whether such queries are useful in practice ... > > Hmm, when I step through on that example input_rel->partial_pathlist > != NIL is false, so we don't even attempt to consider any extra > parallel paths in create_ordered_paths(). Nonetheless we get a > parallel plan, but with a different shape: > > Limit >-> Incremental Sort > Sort Key: a, (sum(expensive_function(c))) > Presorted Key: a > -> Finalize GroupAggregate >Group Key: a, b >-> Gather Merge > Workers Planned: 4 > -> Partial GroupAggregate >Group Key: a, b >-> Sort > Sort Key: a, b > -> Parallel Seq Scan on t > > (or if I disable seqscan then the sort+seq scan above becomes inc sort > + index scan) > > To be honest, I don't think I understand how you would get a plan like > that anyway since the index here only has "a" and so can't provide > (pathkeys: a, b). > > Perhaps there's something I'm still missing though. Also just realized I don't think (?) we can order by the sum inside a gather-merge -- at least not without having another sort above the parallel portion? Or is the group aggregate able to also provide ordering on the final sum after aggregating the partial sums? James Coleman
Re: [HACKERS] Block level parallel vacuum
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: not tested Documentation:not tested Hello I reviewed v25 patches and have just a few notes. missed synopsis for "PARALLEL" option ( block in doc/src/sgml/ref/vacuum.sgml ) missed prototype for vacuum_log_cleanup_info in "non-export function prototypes" > /* >* Do post-vacuum cleanup, and statistics update for each index if >* we're not in parallel lazy vacuum. If in parallel lazy vacuum, do >* only post-vacum cleanup and update statistics at the end of parallel >* lazy vacuum. >*/ > if (vacrelstats->useindex) > lazy_vacuum_or_cleanup_indexes(vacrelstats, Irel, nindexes, > > indstats, lps, true); > > if (ParallelVacuumIsActive(lps)) > { > /* End parallel mode and update index statistics */ > end_parallel_vacuum(lps, Irel, nindexes); > } I personally do not like update statistics in different places. Can we change lazy_vacuum_or_cleanup_indexes to writing stats for both parallel and non-parallel cases? I means something like this: > if (ParallelVacuumIsActive(lps)) > { > /* Do parallel index vacuuming or index cleanup */ > lazy_parallel_vacuum_or_cleanup_indexes(vacrelstats, Irel, > > nindexes, stats, > > lps, for_cleanup); > if (for_cleanup) > { >... > for (i = 0; i < nindexes; i++) > lazy_update_index_statistics(...); > } > return; > } So all lazy_update_index_statistics would be in one place. lazy_parallel_vacuum_or_cleanup_indexes is called only from parallel leader and waits for all workers. Possible we can update stats in lazy_parallel_vacuum_or_cleanup_indexes after WaitForParallelWorkersToFinish call. Also discussion question: vacuumdb parameters --parallel= and --jobs= will confuse users? We need more description for this options? regards, Sergei
Re: [PATCH] Incremental sort (was: PoC: Partial sort)
On Sat, Jul 20, 2019 at 12:21:01PM -0400, James Coleman wrote: On Sat, Jul 20, 2019 at 12:12 PM James Coleman wrote: On Sat, Jul 20, 2019 at 11:25 AM Tomas Vondra wrote: ... > >My current line of investigation is whether we need to do anything in > >the parallel portion of create_ordered_paths(). I noticed that the > >first-pass patch adding generate_useful_gather_paths() modified that > >section but wasn't actually adding any new gather-merge paths (just > >bare incremental sort paths). That seems pretty clearly just a > >prototype miss, so I modified the prototype to build gather-merge > >paths instead (as a side note that change seems to fix an oddity I was > >seeing where plans would include a parallel index scan node even > >though they weren't parallel plans). While the resulting plan for > >something like: > > > > Yes, that seems to be a bug. The block above it clealy has a gather > merge nodes, so this one should too. > > >explain analyze select * from t where t.a in (1,2,3,4,5,6) order by > >t.a, t.b limit 50; > > > >changes cost (to be cheaper) ever so slightly with the gather-merge > >addition to create_ordered_paths(), the plan itself is otherwise > >identical (including row estimates): > > > >Limit > > -> Gather Merge > > -> Incremental Sort > > -> Parallel Index Scan > > > >(Note: I'm forcing parallel plans here with: set > >max_parallel_workers_per_gather=4; set min_parallel_table_scan_size=0; > >set parallel_tuple_cost=0; set parallel_setup_cost=0; set > >min_parallel_index_scan_size=0;) > > > >I can't seem to come up with a case where adding these gather-merge > >paths in create_ordered_paths() isn't entirely duplicative of paths > >already created by generate_useful_gather_paths() as called from > >apply_scanjoin_target_to_paths() -- which I _think_ makes sense given > >that both apply_scanjoin_target_to_paths() and create_ordered_paths() > >are called by grouping_planner(). > > > >Can you think of a case I'm missing here that would make it valuable > >to generate new parallel plans in create_ordered_paths()? > > > > Good question. Not sure. I think such path would need to do something on > a relation that is neither a join nor a scan - in which case the path > should not be created by apply_scanjoin_target_to_paths(). > > So for example a query like this: > > SELECT > a, b, sum(expensive_function(c)) > FROM > t > GROUP BY a,b > ORDER BY a,sum(...) LIMIT 10; > > should be able to produce a plan like this: > > -> Limit > -> Gather Merge > -> Incremental Sort (pathkeys: a, sum) > -> Group Aggregate > a, b, sum(expensive_function(c)) > -> Index Scan (pathkeys: a, b) > > or something like that, maybe. I haven't tried this, though. The other > question is whether such queries are useful in practice ... Hmm, when I step through on that example input_rel->partial_pathlist != NIL is false, so we don't even attempt to consider any extra parallel paths in create_ordered_paths(). Nonetheless we get a parallel plan, but with a different shape: Limit -> Incremental Sort Sort Key: a, (sum(expensive_function(c))) Presorted Key: a -> Finalize GroupAggregate Group Key: a, b -> Gather Merge Workers Planned: 4 -> Partial GroupAggregate Group Key: a, b -> Sort Sort Key: a, b -> Parallel Seq Scan on t (or if I disable seqscan then the sort+seq scan above becomes inc sort + index scan) To be honest, I don't think I understand how you would get a plan like that anyway since the index here only has "a" and so can't provide (pathkeys: a, b). Perhaps there's something I'm still missing though. I wasn't stricly adhering to the example we used before, and I imagined there would be an index on (a,b). Sorry if that wasn't clear. Also just realized I don't think (?) we can order by the sum inside a gather-merge -- at least not without having another sort above the parallel portion? Or is the group aggregate able to also provide ordering on the final sum after aggregating the partial sums? Yes, you're right - an extra sort node would be necessary. But I don't think that changes the idea behind that example. The question is whether the extra sorts below the gather would be cheaper than doing a large sort on top of it - but I don't see why wouldn't that be the case, and if we only need a couple of rows from the beginning ... regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Fri, Jul 19, 2019 at 04:02:19PM +0200, Antonin Houska wrote: Tomas Vondra wrote: On Fri, Jul 19, 2019 at 12:04:36PM +0200, Antonin Houska wrote: >Tomas Vondra wrote: > >> On Mon, Jul 15, 2019 at 03:42:39PM -0400, Bruce Momjian wrote: >> >On Sat, Jul 13, 2019 at 11:58:02PM +0200, Tomas Vondra wrote: >> >> One extra thing we should consider is authenticated encryption. We can't >> >> just encrypt the pages (no matter which AES mode is used - XTS/CBC/...), >> >> as that does not provide integrity protection (i.e. can't detect when >> >> the ciphertext was corrupted due to disk failure or intentionally). And >> >> we can't quite rely on checksums, because that checksums the plaintext >> >> and is stored encrypted. >> > >> >Uh, if someone modifies a few bytes of the page, we will decrypt it, but >> >the checksum (per-page or WAL) will not match our decrypted output. How >> >would they make it match the checksum without already knowing the key. >> >I read [1] but could not see that explained. >> > >> >> Our checksum is only 16 bits, so perhaps one way would be to just >> generate 64k of randomly modified pages and hope one of them happens to >> hit the right checksum value. Not sure how practical such attack is, but >> it does require just filesystem access. > >I don't think you can easily generate 64k of different checksums this way. If >the data is random, I suppose that each set of 2^(128 - 16) blocks will >contain the the same checksum after decryption. Thus even you generate 64k of >different ciphertext blocks that contain the checksum, some (many?) checksums >will be duplicate. Unfortunately the math to describe this problem does not >seem to be trivial. > I'm not sure what's your point, or why you care about the 128 bits, but I don't think the math is very complicated (and it's exactly the same with or without encryption). The probability of checksum collision for randomly modified page is 1/64k, so p=~0.00153%. So probability of *not* getting a collision is (1-p)=99.9985%. So with N pages, the probability of no collisions is pow((1-p),N) which behaves like this: N pow((1-p),N) 1 85% 2 73% 3 63% 46000 49% 20 4% So with 1.6GB relation you have about 96% chance of a checksum collision. I thought your attack proposal was to find valid (encrypted) checksum for a given encrypted page. Instead it seems that you were only trying to say that it's not too hard to generate page with a valid checksum in general. Thus the attacker can try to modify the ciphertext again and again in a way that is not quite random, but the chance to pass the checksum verification may still be relatively high. >Also note that if you try to generate ciphertext, decryption of which will >result in particular value of checksum, you can hardly control the other 14 >bytes of the block, which in turn are used to verify the checksum. > Now, I'm not saying this attack is particularly practical - it would generate a fair number of checkpoint failures before getting the first collision. So it'd trigger quite a few alerts, I guess. You probably mean "checksum failures". I agree. And even if the checksum passed the verification, page or tuple headers would probably be incorrect and cause other errors. >> FWIW our CRC algorithm is not quite HMAC, because it's neither keyed nor >> a cryptographic hash algorithm. Now, maybe we don't want authenticated >> encryption (e.g. XTS is not authenticated, unlike GCM/CCM). > >I'm also not sure if we should try to guarantee data authenticity / >integrity. As someone already mentioned elsewhere, page MAC does not help if >the whole page is replaced. (An extreme case is that old filesystem snapshot >containing the whole data directory is restored, although that will probably >make the database crash soon.) > >We can guarantee integrity and authenticity of backup, but that's a separate >feature: someone may need this although it's o.k. for him to run the cluster >unencrypted. > Yes, I do agree with that. I think attempts to guarantee data authenticity and/or integrity at the page level is mostly futile (replay attacks are an example of why). IMHO we should consider that to be outside the threat model TDE is expected to address. When writing my previous email I forgot that, besides improving data integrity, the authenticated encryption also tries to detect an attempt to get encryption key via "chosen-ciphertext attack (CCA)". The fact that pages are encrypted / decrypted independent from each other should not be a problem here. We just need to consider if this kind of CCA is the threat we try to protect against. IMO a better way to handle authenticity/integrity would be based on WAL, which is essentially an authoritative log of operations. We should be able to parse WAL, deduce expected state (min LSN, checksums) for each page, and validate the cluster state based on that. o
Re: block-level incremental backup
Hi Jeevan, The idea is very nice. When Insert/update/delete and truncate/drop happens at various combinations, How the incremental backup handles the copying of the blocks? On Wed, Jul 17, 2019 at 8:12 PM Jeevan Chalke wrote: > > > > On Wed, Jul 17, 2019 at 7:38 PM Ibrar Ahmed wrote: >> >> >> >> On Wed, Jul 17, 2019 at 6:43 PM Jeevan Chalke >> wrote: >>> >>> On Wed, Jul 17, 2019 at 2:15 PM Ibrar Ahmed wrote: At what stage you will apply the WAL generated in between the START/STOP backup. >>> >>> >>> In this design, we are not touching any WAL related code. The WAL files will >>> get copied with each backup either full or incremental. And thus, the last >>> incremental backup will have the final WAL files which will be copied as-is >>> in the combined full-backup and they will get apply automatically if that >>> the data directory is used to start the server. >> >> >> Ok, so you keep all the WAL files since the first backup, right? > > > The WAL files will anyway be copied while taking a backup (full or > incremental), > but only last incremental backup's WAL files are copied to the combined > synthetic full backup. > >>> -- Ibrar Ahmed >>> >>> >>> -- >>> Jeevan Chalke >>> Technical Architect, Product Development >>> EnterpriseDB Corporation >>> >> >> >> -- >> Ibrar Ahmed > > > > -- > Jeevan Chalke > Technical Architect, Product Development > EnterpriseDB Corporation > -- Regards, vignesh EnterpriseDB: http://www.enterprisedb.com
Re: [RFC] Removing "magic" oids
On Fri, Jul 19, 2019 at 10:12:57AM -0700, Andres Freund wrote: > On 2019-07-07 10:00:35 -0700, Noah Misch wrote: > > +# Test concurrent OID generation via pg_enum_oid_index. This indirectly > > +# exercises LWLock and spinlock concurrency. > > +my $labels = join ',', map { "'l$_'" } 1 .. 1000; > > pgbench( > > '--no-vacuum --client=5 --protocol=prepared --transactions=25', > > 0, > > [qr{processed: 125/125}], > > [qr{^$}], > > - 'concurrent insert workload', > > + 'concurrent OID generation', > > { > > '001_pgbench_concurrent_insert' => > > - 'INSERT INTO insert_tbl SELECT FROM generate_series(1,1000);' > > + "CREATE TYPE pg_temp.e AS ENUM ($labels); DROP TYPE > > pg_temp.e;" > > }); > > Hm, perhaps we should just do something stupid an insert into a catalog > table, determining the oid to insert with pg_nextoid? That ought to be a > lot faster and thus more "stress testing" than going through a full > blown DDL statement? But perhaps that's just too ugly. I expect the pg_nextoid strategy could have sufficed. The ENUM strategy wastes some time parsing 1000 label names, discarding odd-numbered OIDs, and dropping the type. The pg_nextoid strategy wastes time by performing the insertion loop in the executor instead of dedicated C code of EnumValuesCreate(). Hard to say how to weight those factors.
Re: Index Skip Scan
> On Thu, Jul 11, 2019 at 2:40 AM Floris Van Nee > wrote: > I verified that the backwards index scan is indeed functioning now. However, > I'm afraid it's not that simple, as I think the cursor case is broken now. I > think having just the 'scan direction' in the btree code is not enough to get > this working properly, because we need to know whether we want the minimum or > maximum element of a certain prefix. There are basically four cases: > > - Forward Index Scan + Forward cursor: we want the minimum element within a > prefix and we want to skip 'forward' to the next prefix > > - Forward Index Scan + Backward cursor: we want the minimum element within a > prefix and we want to skip 'backward' to the previous prefix > > - Backward Index Scan + Forward cursor: we want the maximum element within a > prefix and we want to skip 'backward' to the previous prefix > > - Backward Index Scan + Backward cursor: we want the maximum element within a > prefix and we want to skip 'forward' to the next prefix > > These cases make it rather complicated unfortunately. They do somewhat tie in > with the previous discussion on this thread about being able to skip to the > min or max value. If we ever want to support a sort of minmax scan, we'll > encounter the same issues. Yes, these four cases are indeed a very good point. I've prepared a new version of the patch, where they + an index condition and handling of situations when it eliminated one or more unique elements are addressed. It seems fixes issues and works also for those hypothetical examples you've mentioned above, but of course it looks pretty complicated and I need to polish it a bit before posting. > On Thu, Jul 11, 2019 at 12:13 PM David Rowley > wrote: > > On Thu, 11 Jul 2019 at 19:41, Floris Van Nee wrote: > > SELECT DISTINCT ON (a) a,b,c FROM a WHERE c = 2 (with an index on a,b,c) > > Data (imagine every tuple here actually occurs 10.000 times in the index to > > see the benefit of skipping): > > 1,1,1 > > 1,1,2 > > 1,2,2 > > 1,2,3 > > 2,2,1 > > 2,2,3 > > 3,1,1 > > 3,1,2 > > 3,2,2 > > 3,2,3 > > > > Creating a cursor on this query and then moving forward, you should get > > (1,1,2), (3,1,2). In the current implementation of the patch, after > > bt_first, it skips over (1,1,2) to (2,2,1). It checks quals and moves > > forward one-by-one until it finds a match. This match only comes at (3,1,2) > > however. Then it skips to the end. > > > > If you move the cursor backwards from the end of the cursor, you should > > still get (3,1,2) (1,1,2). A possible implementation would start at the end > > and do a skip to the beginning of the prefix: (3,1,1). Then it needs to > > move forward one-by-one in order to find the first matching (minimum) item > > (3,1,2). When it finds it, it needs to skip backwards to the beginning of > > prefix 2 (2,2,1). It needs to move forwards to find the minimum element, > > but should stop as soon as it detects that the prefix doesn't match anymore > > (because there is no match for prefix 2, it will move all the way from > > (2,2,1) to (3,1,1)). It then needs to skip backwards again to the start of > > prefix 1: (1,1,1) and scan forward to find (1,1,2). > > Perhaps anyone can think of an easier way to implement it? > > One option is just don't implement it and just change > ExecSupportsBackwardScan() so that it returns false for skip index > scans, or perhaps at least implement an index am method to allow the > planner to be able to determine if the index implementation supports > it... amcanskipbackward Yep, it was discussed few times in this thread, and after we've discovered (thanks to Floris) so many issues I was also one step away from implementing this idea. But at the time time as Thomas correctly noticed, our implementation needs to be extensible to handle future use cases, and this particular cursor juggling seems already like a pretty good example of such "future use case". So I hope by dealing with it we can also figure out what needs to be extensible. > On Tue, Jul 16, 2019 at 6:53 PM Jesper Pedersen > wrote: > > Here is a patch more in that direction. > > Some questions: > > 1) Do we really need the UniqueKey struct ? If it only contains the > EquivalenceClass field then we could just have a list of those instead. > That would make the patch simpler. > > 2) Do we need both canon_uniquekeys and query_uniquekeys ? Currently > the patch only uses canon_uniquekeys because the we attach the list > directly on the Path node. > > I'll clean the patch up based on your feedback, and then start to rebase > v21 on it. Thanks! I'll also take a look as soon an I'm finished with the last updates.
Re: SQL/JSON path issues/questions
Hi Steven, On Fri, Jul 19, 2019 at 9:53 PM Steven Pousty wrote: > I would like to help review this documentation. Can you please point me in > the right direction? Thank you for your interest. You're welcome to do review. Please take a look at instruction for reviewing a patch [1] and working with git [2]. Also, in order to build a doc you will need to setup a toolset first [3]. Links 1. https://wiki.postgresql.org/wiki/Reviewing_a_Patch 2. https://wiki.postgresql.org/wiki/Working_with_git#Testing_a_patch 3. https://www.postgresql.org/docs/devel/docguide-toolsets.html -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Sat, Jul 20, 2019 at 1:30 PM Tomas Vondra wrote: > Forbid checksums? I don't see how that could be acceptable. We either have > to accept the limitations of the current design (having to decrypt > everything before checking the checksums) or change the design. > > I personally think we should do the latter - not just because of this > "decrypt-then-verify" issue, but consider how much work we've done to > allow enabling checksums on-line (it's still not there, but it's likely > doable in PG13). How are we going to do that with encryption? ISTM we > should design it so that we can enable encryption on-line too - maybe not > in v1, but it should be possible. So how are we going to do that? With > checksums it's (fairly) easy because we can "not verify" the page before > we know all pages have a checksum, but with encryption that's not > possible. And if the whole page is encrypted, then what? > > Of course, maybe we don't need such capability for the use-case we're > trying to solve with encryption. I can imagine that someone is running a > large system, has issues with data corruption, and decides to enable > checksums to remedy that. Maybe there's no such scenario in the privacy > case? But we can probably come up with scenarios where a new company > policy forces people to enable encryption on all systems, or something > like that. > > That being said, I don't know how to solve this, but it seems to me that > any system where we can't easily decide whether a page is encrypted or not > (because everything including the page header) is encrypted has this > exact issue. Maybe we could keep some part of the header unencrypted > (likely an information leak, does not solve decrypt-then-verify). Or maybe > we need to store some additional information on each page (which breaks > on-disk format). How about storing the CRC of the encrypted pages? It would not leak any additional information and serves the same purpose as a non-encrypted one, namely I/O corruption detection. I took a look at pg_checksum and besides checking for empty pages, the checksum validation path does not interpret any other fields to calculate the checksum. I think even the offline checksum enabling path looks like it may work out of the box. Checksums of encrypted data are not a replacement for a MAC and this would allow that validation to run without any knowledge of keys. Related, I think CTR mode should be considered for pages. It has performance advantages at 8K data sizes, but even better, allows for arbitrary bytes of the cipher text to be replaced. For example, after encrypting a block you can replace the two checksum bytes with the CRC of the cipher text v.s. CBC mode where that would cause corruption to cascade forward. Same could be used for leaving things like pd_pagesize_version in plaintext at its current offset. For anything deemed non-sensitive, having it readable without having to decrypt the page is useful. It does not have to be full bytes either. CTR mode operates as a stream of bits so its possible to replace nibbles or even individual bits. It can be something as small one bit for an "is_encrypted" flag or a handful of bits used to infer a derived key. For example, with 2-bits you could have 00 represent unencrypted, 01/10 represent old/new key, and 11 be future use. Something like that could facilitate online key rotation. Regards, -- Sehrope Sarkuni Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
Re: SQL/JSON path issues/questions
Thanks so much, hope to get to it over this weekend. On Sat, Jul 20, 2019, 11:48 AM Alexander Korotkov wrote: > Hi Steven, > > On Fri, Jul 19, 2019 at 9:53 PM Steven Pousty > wrote: > > I would like to help review this documentation. Can you please point me > in the right direction? > > Thank you for your interest. You're welcome to do review. > > Please take a look at instruction for reviewing a patch [1] and > working with git [2]. Also, in order to build a doc you will need to > setup a toolset first [3]. > > Links > > 1. https://wiki.postgresql.org/wiki/Reviewing_a_Patch > 2. https://wiki.postgresql.org/wiki/Working_with_git#Testing_a_patch > 3. https://www.postgresql.org/docs/devel/docguide-toolsets.html > > > -- > Alexander Korotkov > Postgres Professional: http://www.postgrespro.com > The Russian Postgres Company >
Re: [RFC] Removing "magic" oids
Hi, On 2019-07-20 11:21:52 -0700, Noah Misch wrote: > On Fri, Jul 19, 2019 at 10:12:57AM -0700, Andres Freund wrote: > > On 2019-07-07 10:00:35 -0700, Noah Misch wrote: > > > +# Test concurrent OID generation via pg_enum_oid_index. This indirectly > > > +# exercises LWLock and spinlock concurrency. > > > +my $labels = join ',', map { "'l$_'" } 1 .. 1000; > > > pgbench( > > > '--no-vacuum --client=5 --protocol=prepared --transactions=25', > > > 0, > > > [qr{processed: 125/125}], > > > [qr{^$}], > > > - 'concurrent insert workload', > > > + 'concurrent OID generation', > > > { > > > '001_pgbench_concurrent_insert' => > > > - 'INSERT INTO insert_tbl SELECT FROM generate_series(1,1000);' > > > + "CREATE TYPE pg_temp.e AS ENUM ($labels); DROP TYPE > > > pg_temp.e;" > > > }); > > > > Hm, perhaps we should just do something stupid an insert into a catalog > > table, determining the oid to insert with pg_nextoid? That ought to be a > > lot faster and thus more "stress testing" than going through a full > > blown DDL statement? But perhaps that's just too ugly. > > I expect the pg_nextoid strategy could have sufficed. The ENUM strategy > wastes some time parsing 1000 label names, discarding odd-numbered OIDs, and > dropping the type. The pg_nextoid strategy wastes time by performing the > insertion loop in the executor instead of dedicated C code of > EnumValuesCreate(). Hard to say how to weight those factors. Fair enough. Are you planning to commit your changes? Greetings, Andres Freund
Re: [RFC] Removing "magic" oids
On Sat, Jul 20, 2019 at 12:56:34PM -0700, Andres Freund wrote: > On 2019-07-20 11:21:52 -0700, Noah Misch wrote: > > On Fri, Jul 19, 2019 at 10:12:57AM -0700, Andres Freund wrote: > > > On 2019-07-07 10:00:35 -0700, Noah Misch wrote: > > > > +# Test concurrent OID generation via pg_enum_oid_index. This > > > > indirectly > > > > +# exercises LWLock and spinlock concurrency. > > > > +my $labels = join ',', map { "'l$_'" } 1 .. 1000; > > > > pgbench( > > > > '--no-vacuum --client=5 --protocol=prepared --transactions=25', > > > > 0, > > > > [qr{processed: 125/125}], > > > > [qr{^$}], > > > > - 'concurrent insert workload', > > > > + 'concurrent OID generation', > > > > { > > > > '001_pgbench_concurrent_insert' => > > > > - 'INSERT INTO insert_tbl SELECT FROM > > > > generate_series(1,1000);' > > > > + "CREATE TYPE pg_temp.e AS ENUM ($labels); DROP TYPE > > > > pg_temp.e;" > > > > }); > > > > > > Hm, perhaps we should just do something stupid an insert into a catalog > > > table, determining the oid to insert with pg_nextoid? That ought to be a > > > lot faster and thus more "stress testing" than going through a full > > > blown DDL statement? But perhaps that's just too ugly. > > > > I expect the pg_nextoid strategy could have sufficed. The ENUM strategy > > wastes some time parsing 1000 label names, discarding odd-numbered OIDs, and > > dropping the type. The pg_nextoid strategy wastes time by performing the > > insertion loop in the executor instead of dedicated C code of > > EnumValuesCreate(). Hard to say how to weight those factors. > > Fair enough. Are you planning to commit your changes? https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=8a0cbb88524e8b6121597285b811640ee793b3e8
Re: thoughts on "prevent wraparound" vacuum
Hi, On 2019-07-20 15:35:57 +0300, Michail Nikolaev wrote: > Currently I am working a lot with cluster consist a few of big tables. > About 2-3 TB. These tables are heavily updated, some rows are removed, new > rows are inserted... Kind of typical OLTP workload. > > Physical table size keeps mostly stable while regular VACUUM is working. It > is fast enough to clean some place from removed rows. > > But time to time "to prevent wraparound" comes. And it works like 8-9 days. > During that time relation size starting to expand quickly. Freezing all > blocks in such table takes a lot of time and bloat is generated much more > quickly. Several questions: - Which version of postgres is this? Newer versions avoid scanning unchanged parts of the heap even for freezing (9.6+, with additional smaller improvements in 11). - have you increased the vacuum cost limits? Before PG 12 they're so low they're entirely unsuitable for larger databases, and even in 12 you should likely increase them for a multi-TB database Unfortunately even if those are fixed the indexes are still likely going to be scanned in their entirety - but most of the time not modified much, so that's not as bad. Greetings, Andres Freund
Re: [RFC] Removing "magic" oids
Hi, On 2019-07-20 12:58:55 -0700, Noah Misch wrote: > On Sat, Jul 20, 2019 at 12:56:34PM -0700, Andres Freund wrote: > > Fair enough. Are you planning to commit your changes? > > https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=8a0cbb88524e8b6121597285b811640ee793b3e8 Oh, sorry for that. Still haven't fully cought up with things that happened while I was on vacation. Thanks.
Re: benchmarking Flex practices
John Naylor writes: > The pre-existing ecpg var "state_before" was a bit confusing when > combined with the new var "state_before_quote_stop", and the former is > also used with C-comments, so I decided to go with > "state_before_lit_start" and "state_before_lit_stop". Even though > comments aren't literals, it's less of a stretch than referring to > quotes. To keep things consistent, I went with the latter var in psql > and core. Hm, what do you think of "state_before_str_stop" instead? It seems to me that both "quote" and "lit" are pretty specific terms, so maybe we need something a bit vaguer. > To get the regression tests to pass, I had to add this: > psql_scan_in_quote(PsqlScanState state) > { > - return state->start_state != INITIAL; > + return state->start_state != INITIAL && > + state->start_state != xqs; > } > ...otherwise with parens we sometimes don't get the right prompt and > we get empty lines echoed. Adding xuend and xuchar here didn't seem to > make a difference. There might be something subtle I'm missing, so I > thought I'd mention it. I think you would see a difference if the regression tests had any cases with blank lines between a Unicode string/ident and the associated UESCAPE and escape-character literal. While poking at that, I also came across this unhappiness: regression=# select u&'foo' uescape 'bogus'; regression'# that is, psql thinks we're still in a literal at this point. That's because the uesccharfail rule eats "'b" and then we go to INITIAL state, so that consuming the last "'" puts us back in a string state. The backend would have thrown an error before parsing as far as the incomplete literal, so it doesn't care (or probably not, anyway), but that's not an option for psql. My first reaction as to how to fix this was to rip the xuend and xuchar states out of psql, and let it just lex UESCAPE as an identifier and the escape-character literal like any other literal. psql doesn't need to account for the escape character's effect on the meaning of the Unicode literal, so it doesn't have any need to lex the sequence as one big token. I think the same is true of ecpg though I've not looked really closely. However, my second reaction was that maybe you were on to something upthread when you speculated about postponing de-escaping of Unicode literals into the grammar. If we did it like that then we would not need to have this difference between the backend and frontend lexers, and we'd not have to worry about what psql_scan_in_quote should do about the whitespace before and after UESCAPE, either. So I'm feeling like maybe we should experiment to see what that solution looks like, before we commit to going in this direction. What do you think? > With the unicode escape rules brought over, the diff to the ecpg > scanner is much cleaner now. The diff for C-comment rules were still > pretty messy in comparison, so I made an attempt to clean that up in > 0002. A bit off-topic, but I thought I should offer that while it was > fresh in my head. I didn't really review this, but it looked like a fairly plausible change of the same ilk, ie combine rules by adding memory of the previous start state. regards, tom lane
Re: [PATCH v2] Introduce spgist quadtree @<(point,circle) operator
Hi Matwey, On Tue, May 21, 2019 at 10:23 AM Matwey V. Kornilov wrote: > вт, 21 мая 2019 г. в 08:43, Michael Paquier : > > > > On Mon, May 20, 2019 at 02:32:39PM +0300, Matwey V. Kornilov wrote: > > > This patch series is to add support for spgist quadtree @<(point,circle) > > > operator. The first two patches are to refactor existing code before > > > implemention the new feature. The third commit is the actual > > > implementation > > > provided with a set of simple unit tests. > > > > Could you add that to the next commit fest please? Here you go: > > https://commitfest.postgresql.org/23/ > > Done Thank you for posting this patch. A took a look at it. It appears that you make quadrant-based checks. But it seems to be lossy in comparison with box-based checks. Let me explain this on the example. Imagine centroids (0,1) and (1,0). Square (0,0)-(1,1) is intersection of quadrant 2 of (0,1) and quadrant 4 of (1,0). And then imagine circle with center in (2,2) of radius 1. It intersects with both quadrants, but doesn't intersect with square. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [PATCH] Incremental sort (was: PoC: Partial sort)
On Sat, Jul 20, 2019 at 1:00 PM Tomas Vondra wrote: > > On Sat, Jul 20, 2019 at 12:21:01PM -0400, James Coleman wrote: > >On Sat, Jul 20, 2019 at 12:12 PM James Coleman wrote: > >> > >> On Sat, Jul 20, 2019 at 11:25 AM Tomas Vondra > >> wrote: > >> ... > >> > >My current line of investigation is whether we need to do anything in > >> > >the parallel portion of create_ordered_paths(). I noticed that the > >> > >first-pass patch adding generate_useful_gather_paths() modified that > >> > >section but wasn't actually adding any new gather-merge paths (just > >> > >bare incremental sort paths). That seems pretty clearly just a > >> > >prototype miss, so I modified the prototype to build gather-merge > >> > >paths instead (as a side note that change seems to fix an oddity I was > >> > >seeing where plans would include a parallel index scan node even > >> > >though they weren't parallel plans). While the resulting plan for > >> > >something like: > >> > > > >> > > >> > Yes, that seems to be a bug. The block above it clealy has a gather > >> > merge nodes, so this one should too. > >> > > >> > >explain analyze select * from t where t.a in (1,2,3,4,5,6) order by > >> > >t.a, t.b limit 50; > >> > > > >> > >changes cost (to be cheaper) ever so slightly with the gather-merge > >> > >addition to create_ordered_paths(), the plan itself is otherwise > >> > >identical (including row estimates): > >> > > > >> > >Limit > >> > > -> Gather Merge > >> > > -> Incremental Sort > >> > > -> Parallel Index Scan > >> > > > >> > >(Note: I'm forcing parallel plans here with: set > >> > >max_parallel_workers_per_gather=4; set min_parallel_table_scan_size=0; > >> > >set parallel_tuple_cost=0; set parallel_setup_cost=0; set > >> > >min_parallel_index_scan_size=0;) > >> > > > >> > >I can't seem to come up with a case where adding these gather-merge > >> > >paths in create_ordered_paths() isn't entirely duplicative of paths > >> > >already created by generate_useful_gather_paths() as called from > >> > >apply_scanjoin_target_to_paths() -- which I _think_ makes sense given > >> > >that both apply_scanjoin_target_to_paths() and create_ordered_paths() > >> > >are called by grouping_planner(). > >> > > > >> > >Can you think of a case I'm missing here that would make it valuable > >> > >to generate new parallel plans in create_ordered_paths()? > >> > > > >> > > >> > Good question. Not sure. I think such path would need to do something on > >> > a relation that is neither a join nor a scan - in which case the path > >> > should not be created by apply_scanjoin_target_to_paths(). > >> > > >> > So for example a query like this: > >> > > >> > SELECT > >> > a, b, sum(expensive_function(c)) > >> > FROM > >> > t > >> > GROUP BY a,b > >> > ORDER BY a,sum(...) LIMIT 10; > >> > > >> > should be able to produce a plan like this: > >> > > >> > -> Limit > >> > -> Gather Merge > >> > -> Incremental Sort (pathkeys: a, sum) > >> > -> Group Aggregate > >> > a, b, sum(expensive_function(c)) > >> > -> Index Scan (pathkeys: a, b) > >> > > >> > or something like that, maybe. I haven't tried this, though. The other > >> > question is whether such queries are useful in practice ... > >> > >> Hmm, when I step through on that example input_rel->partial_pathlist > >> != NIL is false, so we don't even attempt to consider any extra > >> parallel paths in create_ordered_paths(). Nonetheless we get a > >> parallel plan, but with a different shape: > >> > >> Limit > >>-> Incremental Sort > >> Sort Key: a, (sum(expensive_function(c))) > >> Presorted Key: a > >> -> Finalize GroupAggregate > >>Group Key: a, b > >>-> Gather Merge > >> Workers Planned: 4 > >> -> Partial GroupAggregate > >>Group Key: a, b > >>-> Sort > >> Sort Key: a, b > >> -> Parallel Seq Scan on t > >> > >> (or if I disable seqscan then the sort+seq scan above becomes inc sort > >> + index scan) > >> > >> To be honest, I don't think I understand how you would get a plan like > >> that anyway since the index here only has "a" and so can't provide > >> (pathkeys: a, b). > >> > >> Perhaps there's something I'm still missing though. > > I wasn't stricly adhering to the example we used before, and I imagined > there would be an index on (a,b). Sorry if that wasn't clear. > > > > >Also just realized I don't think (?) we can order by the sum inside a > >gather-merge -- at least not without having another sort above the > >parallel portion? Or is the group aggregate able to also provide > >ordering on the final sum after aggregating the partial sums? > > > > Yes, you're right - an extra sort node would be necessary. But I don't > think that changes the idea behind that example. The question is whether
Re: Catching missing Datum conversions
> > I should probably split this into "actionable" (categories 3 and 4) > and "noise and scaffolding" patches. > Breaking down the noise-and-scaffolding into some subgroups might make the rather long patches more palatable/exceedingly-obvious: * (Datum) 0 ---> NullDatum * 0 > NullDatum * The DatumGetPointer(allParameterTypes) null tests Having said that, everything you did seems really straightforward, except for src/backend/rewrite/rewriteDefine.c src/backend/statistics/mcv.c src/backend/tsearch/ts_parse.c and those seem like cases where the DatumGetXXX was a no-op before Datum was a struct.
Re: contrib make check-world fail if data have been modified and there's vpath
didier writes: > because there's destination data > src/makefiles/pgxs.mk line > ln -s $< $@ > fails and make clean doesn't remove these links. > ln -sf > is an option but it's not tested in configure > or rm -f Can you be more specific about what the problem case is? regards, tom lane
Fix typos and inconsistencies for HEAD (take 7)
Hello hackers, Please consider fixing the next pack of typos and inconsistencies in the tree: 7.1. h04m05s06 -> h04mm05s06 (in fact it's broken since 6af04882, but h04mm05s06.789 still accepted) 7.2. hasbucketcleanup -> hashbucketcleanup 7.3. _hashm_spare -> hashm_spares 7.4. hashtbl -> hash table 7.5. HAS_RELFILENODES -> XINFO_HAS_RELFILENODES 7.6. HAS_SUBXACT -> XINFO_HAS_SUBXACT 7.7. HAVE_FCVT -> remove (survived after ff4628f3) 7.8. HAVE_FINITE -> remove (orphaned after cac2d912) 7.9. HAVE_STRUCT_SOCKADDR_UN -> remove (not used since introduction in 399a36a7) 7.10. HAVE_SYSCONF -> remove (survived after ff4628f3) 7.11. HAVE_ZLIB -> HAVE_LIBZ 7.12. HEAP_CLEAN -> XLOG_HEAP2_CLEAN 7.13. HEAP_CONTAINS_NEW_TUPLE_DATA -> XLH_UPDATE_CONTAINS_NEW_TUPLE, XLOG_HEAP_CONTAINS_OLD_TUPLE -> XLH_UPDATE_CONTAINS_OLD_TUPLE, XLOG_HEAP_CONTAINS_OLD_KEY -> XLH_UPDATE_CONTAINS_OLD_KEY, XLOG_HEAP_PREFIX_FROM_OLD -> XLH_UPDATE_PREFIX_FROM_OLD, XLOG_HEAP_SUFFIX_FROM_OLD -> XLH_UPDATE_SUFFIX_FROM_OLD (renamed in 168d5805) 7.14. HEAP_FREEZE -> FREEZE_PAGE (an inconsistency since introduction in 48188e16) 7.15. heapio.c -> hio.c 7.16. heap_newpage -> XLOG_FPI (orphaned since 54685338) 7.17. heaxadecimal -> hexadecimal 7.18. hostlen -> nodelen, host -> node, serv -> service, servlen -> servicelen 7.19. i386s -> x86_64 7.20. IConst/FConst -> ICONST/FCONST 7.21. imit -> limit 7.22. IN_ARCHIVE_RECOVERY -> DB_IN_ARCHIVE_RECOVERY 7.23. ind_arraysize, ind_value -> ind_arrsize, ind_pointer 7.24. index_getnext -> index_getnext_slot 7.25. IndexTupleVector -> IndexTuple vector 7.26. innerConsistent -> innerConsistentFn 7.27. in-progres -> in-progress 7.28. inspire with -> inspired by the (sync with 192b0c94) 7.29. internalerrpos -> internalerrposition 7.30. internal_procname -> internal_proname 7.31. interruptOK -> remove (orphaned after d0699571) 7.32. intratransaction -> intra-transaction 7.33. InvalidOffset -> InvalidOffsetNumber 7.34. invtrans -> invtransfn 7.35. isbuiltin -> fmgr_isbuiltin 7.36. iself -> itself 7.37. isnoinherit -> noinherit 7.38. ISO_DATES -> USE_ISO_DATES 7.39. isParentRoot -> remove (orphaned after 218f5158) 7.40. isPrefix -> prefix 7.41. ItemPointerIsMax -> remove (orphaned after e20c70cb) 7.42. itemsin -> items in 7.43. jbVal -> jbval 7.44. json_plperl -> jsonb_plperlu 7.45. jvbBinary -> jbvBinary 7.46. keyAttrs -> attrKind 7.47. keyinfo -> key info 7.48. key_modified -> key_changed 7.49. killitems -> killedItems 7.50. KnownAssignedTransactions -> KnownAssignedTransactionIds Also, I found e-mail headers in optimizer/plan/README not relevant, so I propose to remove them. And another finding is related to the sleep effective resolution. `man 7 time` says "Since kernel 2.6.13, the HZ value is a kernel configuration parameter and can be 100, 250 (the default) ...", so the 10 milliseconds is not the most common effective resolution nowadays. I propose the corresponding patch for pgsleep.c, but we have a similar statement in doc/.../config.sgml. I think It should be fixed too. Best regards, Alexander diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c index a07bd27a0e..487103fb79 100644 --- a/src/backend/access/hash/hashovfl.c +++ b/src/backend/access/hash/hashovfl.c @@ -793,7 +793,7 @@ _hash_initbitmapbuffer(Buffer buf, uint16 bmsize, bool initpage) * be confused into returning the same tuple more than once or some tuples * not at all by the rearrangement we are performing here. To prevent * any concurrent scan to cross the squeeze scan we use lock chaining - * similar to hasbucketcleanup. Refer comments atop hashbucketcleanup. + * similar to hashbucketcleanup. Refer comments atop hashbucketcleanup. * * We need to retain a pin on the primary bucket to ensure that no concurrent * split can start. diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c index 376ee2a63b..defdc9b408 100644 --- a/src/backend/access/hash/hashpage.c +++ b/src/backend/access/hash/hashpage.c @@ -509,7 +509,7 @@ _hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid, * Choose the number of initial bucket pages to match the fill factor * given the estimated number of tuples. We round up the result to the * total number of buckets which has to be allocated before using its - * _hashm_spare element. However always force at least 2 bucket pages. The + * hashm_spares element. However always force at least 2 bucket pages. The * upper limit is determined by considerations explained in * _hash_expandtable(). */ diff --git a/src/backend/utils/adt/datetime.c b/src/backend/utils/adt/datetime.c index 54ea69f7f1..4d8db1a060 100644 --- a/src/backend/utils/adt/datetime.c +++ b/src/backend/utils/adt/datetime.c @@ -1855,7 +1855,7 @@ DecodeTimeOnly(char **field, int *ftype, int nf, /* * Was this an "ISO time" with embedded field labels? An - * example is "h04m05s06" - thomas 2001-02-04 + * example is "h04mm05s06" - thom
Re: Performance issue in foreign-key-aware join estimation
On Thu, 18 Jul 2019 at 19:24, David Rowley wrote: > Unless there's some objection, I'll be looking into pushing both 0001 > and 0002 in a single commit in the next few days. I've pushed this after doing a bit of final tweaking. After a re-read, I didn't really like all the code that rechecked that ec->ec_relids matched the relation we're searching for. The only code that seems to be able to put additional members into eclass_indexes that there's no mention of in the actual class was in add_child_rel_equivalences(). The comments for the definition of eclass_indexes said nothing about there being a possibility of the field containing an index for an EC that knows nothing about the given relation. Fixing that either meant fixing the comment to say that "they *may* contain", or fixing up the code so that it's strict about what ECs can be mentioned in eclass_indexes. I went with the fixing the code option since it also allows us to get rid of some redundant checks, to which I turned into Asserts() to catch any possible future bugs that might be introduced by any code that might one day remove rels from an EC, e.g something like https://commitfest.postgresql.org/23/1712/ I also did some performance tests on the most simple query I could think of that uses eclasses. select2.sql: SELECT * FROM t1 INNER JOIN t2 ON t1.a=t2.a Master: $ pgbench -n -f select2.sql -T 60 postgres tps = 12143.597276 (excluding connections establishing) tps = 12100.773839 (excluding connections establishing) tps = 12086.209389 (excluding connections establishing) tps = 12098.194927 (excluding connections establishing) tps = 12105.140058 (excluding connections establishing) Patched: $ pgbench -n -f select2.sql -T 60 postgres tps = 12224.597530 (excluding connections establishing) tps = 12097.286522 (excluding connections establishing) tps = 12035.306729 (excluding connections establishing) tps = 11965.848289 (excluding connections establishing) tps = 12059.846474 (excluding connections establishing) There's a bit of noise there, but on average we're just 0.25% slower on the worse case and the benchmarks shown above put us ~406% better on with the fairly complex query that Tom posted in the initial email on this thread. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: thoughts on "prevent wraparound" vacuum
Hello. >- Which version of postgres is this? Newer versions avoid scanning > unchanged parts of the heap even for freezing (9.6+, with additional > smaller improvements in 11). Oh, totally forgot about version and settings... server_version 10.9 (Ubuntu 10.9-103) So, "don't vacuum all-frozen pages" included. > - have you increased the vacuum cost limits? Before PG 12 they're so low > they're entirely unsuitable for larger databases, and even in 12 you > should likely increase them for a multi-TB database Current settings are: autovacuum_max_workers 8 autovacuum_vacuum_cost_delay 5ms autovacuum_vacuum_cost_limit 400 autovacuum_work_mem -1 vacuum_cost_page_dirty 40 vacuum_cost_page_hit 1 vacuum_cost_page_miss 10 "autovacuum_max_workers" set to 8 because server needs to process a lot of changing relations. Settings were more aggressive previously (autovacuum_vacuum_cost_limit was 2800) but it leads to very high IO load causing issues with application performance and stability (even on SSD). "vacuum_cost_page_dirty" was set to 40 few month ago. High IO write peaks were causing application requests to stuck into WALWriteLock. After some investigations we found it was caused by WAL-logging peaks. Such WAL-peaks are mostly consist of such records: TypeN(%) Record size (%) FPI size (%) Combined size (%) -- Heap2/CLEAN 10520 ( 0.86) 623660 ( 0.21) 5317532 ( 0.53) 5941192 ( 0.46) Heap2/FREEZE_PAGE 113419 ( 9.29) 6673877 ( 2.26) 635354048 ( 63.12)642027925 ( 49.31) another example: Heap2/CLEAN196707 ( 6.96) 12116527 ( 1.56) 292317231 ( 37.77)304433758 ( 19.64) Heap2/FREEZE_PAGE 1819 ( 0.06) 104012 ( 0.01) 13324269 ( 1.72)13428281 ( 0.87) Thanks, Michail.