Re: Creating foreign key on partitioned table is too slow
David Rowley writes: > In Ottawa this year, Andres and I briefly talked about the possibility > of making a series of changes to how equalfuncs.c works. The idea was > to make it easy by using some pre-processor magic to allow us to > create another version of equalfuncs which would let us have an equal > comparison function that returns -1 / 0 / +1 rather than just true or > false. This would allow us to Binary Search Trees of objects. I > identified that EquivalenceClass.ec_members would be better written as > a BST to allow much faster lookups in get_eclass_for_sort_expr(). This seems like a good idea, but why would we want to maintain two versions? Just change equalfuncs.c into comparefuncs.c, full stop. equal() would be a trivial wrapper for (compare_objects(a,b) == 0). Andres' ideas about autogenerating all that boilerplate aren't bad, but that's no justification for carrying two full sets of per-node logic when one set would do. regards, tom lane
Re: tableam vs. TOAST
On Wed, Oct 30, 2019 at 9:46 PM Robert Haas wrote: > On Wed, Oct 30, 2019 at 3:49 AM Prabhat Sahu < > prabhat.s...@enterprisedb.com> wrote: > >> While testing the Toast patch(PG+v7 patch) I found below server crash. >> System configuration: >> VCPUs: 4, RAM: 8GB, Storage: 320GB >> >> This issue is not frequently reproducible, we need to repeat the same >> testcase multiple times. >> > > I wonder if this is an independent bug, because the backtrace doesn't look > like it's related to the stuff this is changing. Your report doesn't > specify whether you can also reproduce the problem without the patch, which > is something that you should always check before reporting a bug in a > particular patch. > Hi Robert, My sincere apologize that I have not mentioned the issue in more detail. I have ran the same case against both PG HEAD and HEAD+Patch multiple times(7, 10, 20nos), and as I found this issue was not failing in HEAD and same case is reproducible in HEAD+Patch (again I was not sure about the backtrace whether its related to patch or not). > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > -- With Regards, Prabhat Kumar Sahu Skype ID: prabhat.sahu1984 EnterpriseDB Software India Pvt. Ltd. The Postgres Database Company
Re: [BUG] Partition creation fails after dropping a column and adding a partial index
On Tue, Oct 29, 2019 at 01:16:58PM +0900, Michael Paquier wrote: > Yes, something looks wrong with that. I have not looked at it in > details yet though. I'll see about that tomorrow. So.. When building the attribute map for a cloned index (with either LIKE during the transformation or for partition indexes), we store each attribute number with 0 used for dropped columns. Unfortunately, if you look at the way the attribute map is built in this case the code correctly generates the mapping with convert_tuples_by_name_map. But, the length of the mapping used is incorrect as this makes use of the number of attributes for the newly-created child relation, and not the parent which includes the dropped column in its count. So the answer is simply to use the parent as reference for the mapping length. The patch is rather simple as per the attached, with extended regression tests included. I have not checked on back-branches yet, but that's visibly wrong since 8b08f7d down to v11 (will do that when back-patching). There could be a point in changing convert_tuples_by_name_map & co so as they return the length of the map on top of the map to avoid such mistakes in the future. That's a more invasive patch not really adapted for a back-patch, but we could do that on HEAD once this bug is fixed. I have also checked other calls of this API and the handling is done correctly. Wyatt, what do you think? -- Michael diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 8d25d14772..5597be6e3d 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -1091,7 +1091,7 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId, RelationGetDescr(parent)); idxstmt = generateClonedIndexStmt(NULL, idxRel, - attmap, RelationGetDescr(rel)->natts, + attmap, RelationGetDescr(parent)->natts, ); DefineIndex(RelationGetRelid(rel), idxstmt, diff --git a/src/test/regress/expected/create_table.out b/src/test/regress/expected/create_table.out index ce2484fd4e..f63016871c 100644 --- a/src/test/regress/expected/create_table.out +++ b/src/test/regress/expected/create_table.out @@ -1168,3 +1168,52 @@ insert into defcheck_def values (0, 0); create table defcheck_0 partition of defcheck for values in (0); ERROR: updated partition constraint for default partition "defcheck_def" would be violated by some row drop table defcheck; +-- tests of column drop with partition tables and indexes using +-- predicates and expressions. +create table part_column_drop ( + useless_1 int, + id int, + useless_2 int, + d int, + b int, + useless_3 int +) partition by range (id); +alter table part_column_drop drop column useless_1; +alter table part_column_drop drop column useless_2; +alter table part_column_drop drop column useless_3; +create index part_column_drop_b_pred on part_column_drop(b) where b = 1; +create index part_column_drop_b_expr on part_column_drop((b = 1)); +create index part_column_drop_d_pred on part_column_drop(d) where d = 2; +create index part_column_drop_d_expr on part_column_drop((d = 2)); +create table part_column_drop_1_10 partition of + part_column_drop for values from (1) to (10); +\d part_column_drop +Partitioned table "public.part_column_drop" + Column | Type | Collation | Nullable | Default ++-+---+--+- + id | integer | | | + d | integer | | | + b | integer | | | +Partition key: RANGE (id) +Indexes: +"part_column_drop_b_expr" btree ((b = 1)) +"part_column_drop_b_pred" btree (b) WHERE b = 1 +"part_column_drop_d_expr" btree ((d = 2)) +"part_column_drop_d_pred" btree (d) WHERE d = 2 +Number of partitions: 1 (Use \d+ to list them.) + +\d part_column_drop_1_10 + Table "public.part_column_drop_1_10" + Column | Type | Collation | Nullable | Default ++-+---+--+- + id | integer | | | + d | integer | | | + b | integer | | | +Partition of: part_column_drop FOR VALUES FROM (1) TO (10) +Indexes: +"part_column_drop_1_10_b_idx" btree (b) WHERE b = 1 +"part_column_drop_1_10_d_idx" btree (d) WHERE d = 2 +"part_column_drop_1_10_expr_idx" btree ((b = 1)) +"part_column_drop_1_10_expr_idx1" btree ((d = 2)) + +drop table part_column_drop; diff --git a/src/test/regress/sql/create_table.sql b/src/test/regress/sql/create_table.sql index 144eeb480d..e835b65ac4 100644 --- a/src/test/regress/sql/create_table.sql +++ b/src/test/regress/sql/create_table.sql @@ -903,3 +903,26 @@ create table defcheck_1 partition of defcheck for values in (1, null); insert into defcheck_def values (0, 0); create table defcheck_0 partition of defcheck for values in (0); drop table defcheck; + +-- tests of column drop with partition tables and indexes using +--
Re: Allow CREATE OR REPLACE VIEW to rename the columns
Fujii Masao writes: > Currently CREATE OR REPLACE VIEW command fails if the column names > are changed. That is, I believe, intentional. It's an effective aid to catching mistakes in view redefinitions, such as misaligning the new set of columns relative to the old. That's particularly important given that we allow you to add columns during CREATE OR REPLACE VIEW. Consider the oversimplified case where you start with CREATE VIEW v AS SELECT 1 AS x, 2 AS y; and you want to add a column z, and you get sloppy and write CREATE OR REPLACE VIEW v AS SELECT 1 AS x, 3 AS z, 2 AS y; If we did not throw an error on this, references that formerly pointed to column y would now point to z (as that's still attnum 2), which is highly unlikely to be what you wanted. The right way to handle a column rename in a view is to do a separate ALTER VIEW RENAME COLUMN, making it totally clear what you intend to happen. (Right now, we make you spell that "ALTER TABLE RENAME COLUMN", but it'd be reasonable to add that syntax to ALTER VIEW too.) I don't think this functionality should be folded into redefinition of the content of the view. It'd add more opportunity for mistakes than anything else. regards, tom lane
Re: Allow cluster_name in log_line_prefix
> Hi folks > > I was recently surprised to notice that log_line_prefix doesn't support a > cluster_name placeholder. I suggest adding one. If I don't hear objections > I'll send a patch. I think it'd be a good thing for users. Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp
Re: Add SQL function to show total block numbers in the relation
On Wed, Oct 30, 2019 at 10:09:47AM -0400, Tom Lane wrote: > btkimurayuzk writes: >> I propose new simple sql query, which shows total block numbers in the >> relation. >> ... >> Of cource, we can know this value such as >> select (pg_relation_size('t') / >> current_setting('block_size')::bigint)::int; > > I don't really see why the existing solution isn't sufficient. +1. -- Michael signature.asc Description: PGP signature
Re: Allow cluster_name in log_line_prefix
On Mon, Oct 28, 2019 at 3:33 PM Craig Ringer wrote: > I was recently surprised to notice that log_line_prefix doesn't support a > cluster_name placeholder. I suggest adding one. If I don't hear objections > I'll send a patch. +1
Allow CREATE OR REPLACE VIEW to rename the columns
Hi, Currently CREATE OR REPLACE VIEW command fails if the column names are changed. For example, =# CREATE VIEW test AS SELECT 0 AS a; =# CREATE OR REPLACE VIEW test AS SELECT 0 AS x; ERROR: cannot change name of view column "a" to "x" I'd like to propose the attached patch that allows CREATE OR REPLACE VIEW to rename the view columns. Thought? Regards, -- Fujii Masao rename_view_column_v1.patch Description: Binary data
Re: Problem with synchronous replication
On Wed, Oct 30, 2019 at 05:43:04PM +0900, Kyotaro Horiguchi wrote: > At Wed, 30 Oct 2019 17:21:17 +0900, Fujii Masao wrote > in >> This change causes every ending backends to always take the exclusive lock >> even when it's not in SyncRep queue. This may be problematic, for example, >> when terminating multiple backends at the same time? If yes, >> it might be better to check SHMQueueIsDetached() again after taking the lock. >> That is, > > I'm not sure how much that harms but double-checked locking > (releasing) is simple enough for reducing possible congestion here, I > think. FWIW, I could not measure any actual difference with pgbench -C, up to 500 sessions and an empty input file (just have one meta-command) and -c 20. I have added some comments in SyncRepCleanupAtProcExit(), and adjusted the patch with the suggestion from Fujii-san. Any comments? -- Michael diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c index 16aee1de4c..f72b8a75f3 100644 --- a/src/backend/replication/syncrep.c +++ b/src/backend/replication/syncrep.c @@ -149,6 +149,13 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit) const char *old_status; int mode; + /* + * This should be called while holding interrupts during a transaction + * commit to prevent the follow-up shared memory queue cleanups to be + * influenced by external interruptions. + */ + Assert(InterruptHoldoffCount > 0); + /* Cap the level for anything other than commit to remote flush only. */ if (commit) mode = SyncRepWaitMode; @@ -361,10 +368,18 @@ SyncRepCancelWait(void) void SyncRepCleanupAtProcExit(void) { + /* + * First check if we are removed from the queue without the lock to + * not slow down backend exit. + */ if (!SHMQueueIsDetached(&(MyProc->syncRepLinks))) { LWLockAcquire(SyncRepLock, LW_EXCLUSIVE); - SHMQueueDelete(&(MyProc->syncRepLinks)); + + /* maybe we have just been removed, so recheck */ + if (!SHMQueueIsDetached(&(MyProc->syncRepLinks))) + SHMQueueDelete(&(MyProc->syncRepLinks)); + LWLockRelease(SyncRepLock); } } @@ -508,6 +523,8 @@ SyncRepReleaseWaiters(void) /* * Calculate the synced Write, Flush and Apply positions among sync standbys. * + * The caller must hold SyncRepLock. + * * Return false if the number of sync standbys is less than * synchronous_standby_names specifies. Otherwise return true and * store the positions into *writePtr, *flushPtr and *applyPtr. @@ -521,6 +538,8 @@ SyncRepGetSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr, { List *sync_standbys; + Assert(LWLockHeldByMe(SyncRepLock)); + *writePtr = InvalidXLogRecPtr; *flushPtr = InvalidXLogRecPtr; *applyPtr = InvalidXLogRecPtr; @@ -680,6 +699,8 @@ cmp_lsn(const void *a, const void *b) List * SyncRepGetSyncStandbys(bool *am_sync) { + Assert(LWLockHeldByMe(SyncRepLock)); + /* Set default result */ if (am_sync != NULL) *am_sync = false; @@ -984,7 +1005,7 @@ SyncRepGetStandbyPriority(void) * Pass all = true to wake whole queue; otherwise, just wake up to * the walsender's LSN. * - * Must hold SyncRepLock. + * The caller must hold SyncRepLock in exclusive mode. */ static int SyncRepWakeQueue(bool all, int mode) @@ -995,6 +1016,7 @@ SyncRepWakeQueue(bool all, int mode) int numprocs = 0; Assert(mode >= 0 && mode < NUM_SYNC_REP_WAIT_MODE); + Assert(LWLockHeldByMeInMode(SyncRepLock, LW_EXCLUSIVE)); Assert(SyncRepQueueIsOrderedByLSN(mode)); proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue[mode]), signature.asc Description: PGP signature
Re: Problem with synchronous replication
On Wed, Oct 30, 2019 at 12:34:28PM +0900, Kyotaro Horiguchi wrote: > If we do that strictly, other functions like > SyncRepGetOldestSyncRecPtr need the same Assert()s. I think static > functions don't need Assert() and caution in their comments would be > enough. Perhaps. I'd rather be careful though if we meddle again with this code in the future as it is shared across multiple places and callers. > SyncRepReleaseWaiters reads MyWalSnd->sync_standby_priority without > holding SyncRepLock, which could lead to a message with wrong > priority. I'm not sure it matters, though. The WAL sender is the only writer of its info in shared memory, so there is no problem to have it read data without its spin lock hold. > Seems reasonable for holdoffs. The same assertion would be needed in > more places but it's another issue. Sure. > By the way while I was looking this, I found a typo. Please find the > attached. Thanks, applied that one. -- Michael signature.asc Description: PGP signature
Re: Creating foreign key on partitioned table is too slow
Hi, On 2019-10-31 11:19:05 +1300, David Rowley wrote: > In Ottawa this year, Andres and I briefly talked about the possibility > of making a series of changes to how equalfuncs.c works. The idea was > to make it easy by using some pre-processor magic to allow us to > create another version of equalfuncs which would let us have an equal > comparison function that returns -1 / 0 / +1 rather than just true or > false. See also the thread at https://www.postgresql.org/message-id/20190920051857.2fhnvhvx4qdddviz%40alap3.anarazel.de which would make this fairly easy, without having to compile equalfuncs twice or such. Greetings, Andres Freund
Re: Zedstore - compressed in-core columnar storage
Alex Wang and I have been doing some performance analysis of the most recent version of the zedstore branch, and have some interesting statistics to share. We specifically focused on TPC-DS query 2, because it plays to what should be the strength of zedstore- namely it does a full table scan of only a subset of columns. I've attached the explain verbose output for reference. We scan two columns of 'catalog_sales', and two columns of 'web_sales'. -> Parallel Append -> Parallel Seq Scan on tpcds.catalog_sales Output: catalog_sales.cs_ext_sales_price, catalog_sales.cs_sold_date_sk -> Parallel Seq Scan on tpcds.web_sales Output: web_sales.ws_ext_sales_price, web_sales.ws_sold_date_sk For heap, it needs to do a full table scan of both tables, and we need to read the entire table into memory. For our dataset, that totals around 119GB of data. ***HEAP*** tpcds=# select pg_size_pretty(pg_relation_size('web_sales')); pg_size_pretty 39 GB (1 row) tpcds=# select pg_size_pretty(pg_relation_size('catalog_sales')); pg_size_pretty 80 GB (1 row) ***/HEAP*** With Zedstore the total relation size is smaller because of compression. When scanning the table, we only scan the blocks with data we are interested in, and leave the rest alone. So the total size we need to scan for these tables totals around 4GB ***ZEDSTORE*** zedstore=# select pg_size_pretty(pg_relation_size('web_sales')); pg_size_pretty 20 GB (1 row) zedstore=# select pg_size_pretty(pg_relation_size('catalog_sales')); pg_size_pretty 40 GB (1 row) zedstore=# with zedstore_tables as (select d.oid, f.* zedstore(# from (select c.oid zedstore(# from pg_am am zedstore(#join pg_class c on (c.relam = am.oid) zedstore(# where am.amname = 'zedstore') d, zedstore(# pg_zs_btree_pages(d.oid) f) zedstore-# select zs.attno, att.attname, zs.oid::regclass, count(zs.attno) as pages zedstore-# pg_size_pretty(count(zs.attno) * 8 * 1024) from zedstore_tables zs zedstore-# left join pg_attribute att on zs.attno = att.attnum zedstore-# and zs.oid = att.attrelid zedstore-# where zs.oid in ('catalog_sales'::regclass, 'web_sales'::regclass) zedstore-# and (att.attname in ('cs_ext_sales_price','cs_sold_date_sk','ws_ext_sales_price','ws_sold_date_sk') zedstore(# or zs.attno = 0) zedstore-# group by zs.attno, att.attname, zs.oid zedstore-# order by zs.oid , zs.attno; attno | attname | oid | pages | pg_size_pretty ---++---++ 0 || catalog_sales | 39549 | 309 MB 1 | cs_sold_date_sk| catalog_sales | 2441 | 19 MB 24 | cs_ext_sales_price | catalog_sales | 289158 | 2259 MB 0 || web_sales | 20013 | 156 MB 1 | ws_sold_date_sk| web_sales | 17578 | 137 MB 24 | ws_ext_sales_price | web_sales | 144860 | 1132 MB ***/ZEDSTORE *** On our test machine, our tables were stored on a single spinning disk, so our read speed was pretty abysmal with this query. This query is I/O bound for us, so it was the single largest factor. With heap, the tables are scanned sequentially, and therefore can scan around 150MB of table data per second: ***HEAP*** avg-cpu: %user %nice %system %iowait %steal %idle 8.540.001.85 11.620.00 77.98 Devicer/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util sdd 1685.330.00 157069.33 0.0018.67 0.00 1.10 0.001.560.00 2.6293.20 0.00 0.59 100.00 Devicer/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util sdd 1655.330.00 154910.67 0.0021.33 0.00 1.27 0.001.620.00 2.6893.58 0.00 0.60 100.13 Devicer/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util sdd 1746.330.00 155121.33 0.0028.00 0.00 1.58 0.001.480.00 2.6188.83 0.00 0.57 100.00 ***/HEAP*** Because zedstore resembled random I/O, the read speed was significantly hindered on our single disk. As a result, we saw ~150x slower read speeds. ***ZEDSTORE*** avg-cpu: %user %nice %system %iowait %steal %idle 6.240.001.226.340.00 86.20 Devicer/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util sdb129.330.00 1034.67 0.00 0.00 0.00 0.00 0.00 15.890.00 2.05 8.00 0.00 7.67 99.20 Devicer/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util sdb120.67
Re: Creating foreign key on partitioned table is too slow
On Thu, 31 Oct 2019 at 07:30, Tomas Vondra wrote: > > On Thu, Oct 24, 2019 at 04:28:38PM -0700, Andres Freund wrote: > >Hi, > > > >On 2019-10-23 05:59:01 +, kato-...@fujitsu.com wrote: > >> To benchmark with tpcb model, I tried to create a foreign key in the > >> partitioned history table, but backend process killed by OOM. > >> the number of partitions is 8192. I tried in master(commit: ad4b7aeb84). > > > >Obviously this should be improved. But I think it's also worthwhile to > >note that using 8k partitions is very unlikely to be a good choice for > >anything. The metadata, partition pruning, etc overhead is just going to > >be very substantial. > > > > True. Especially with two partitioned tables, each with 8k partitions. In Ottawa this year, Andres and I briefly talked about the possibility of making a series of changes to how equalfuncs.c works. The idea was to make it easy by using some pre-processor magic to allow us to create another version of equalfuncs which would let us have an equal comparison function that returns -1 / 0 / +1 rather than just true or false. This would allow us to Binary Search Trees of objects. I identified that EquivalenceClass.ec_members would be better written as a BST to allow much faster lookups in get_eclass_for_sort_expr(). The implementation I had in mind for the BST was a compact tree that instead of using pointers for the left and right children, it just uses an integer to reference the array element number. This would allow us to maintain very fast insert-order traversals. Deletes would need to decrement all child references greater than the deleted index. This is sort of on-par with how the new List implementation in master. i.e deletes take additional effort, but inserts are fast if there's enough space in the array for a new element, traversals are cache-friendly, etc. I think trees might be better than hash tables for this as a hash function needs to hash all fields, whereas a comparison function can stop when it finds the first non-match. This may also be able to help simplify the code in setrefs.c to get rid of the complex code around indexed tlists. tlist_member() would become O(log n) instead of O(n), so perhaps there'd be not much point in having both search_indexed_tlist_for_var() and search_indexed_tlist_for_non_var(). -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Thoughts on nbtree with logical/varwidth table identifiers, v12 on-disk representation
On Wed, Oct 30, 2019 at 1:25 PM Andres Freund wrote: > I assume you mean that the index would dynamically recognize when it > needs the wider tids ("for the higher portion")? If so, yea, that makes > sense to me. Would that need to encode the 6/8byteness of a tid on a > per-element basis? Or are you thinking of being able to upconvert in a > general way? I think that index access methods should continue to control the layout of TIDs/item pointers, while mostly just treating them as fixed-width integers. Maybe you just have 6 and 8 byte structs, or maybe you also have a 16 byte struct, but it isn't really variable length to the index AM (it's more like a union). It becomes the index AM's responsibility to remember which TID width applies at a per tuple level (or per posting list level, etc). In general, index AMs have to "supply their own status bits", which should be fine (the alternative is that they refuse to support anything other than the traditional 6 byte TID format). Table AMs don't get to supply their own operator class for sorting these integers -- they had better be happy with the TID sort order that is baked in (ascending int order) in the context of index scans that return duplicates, and things like that. There is probably generic infrastructure for up-converting, too, but the index AM is fundamentally in the driving seat with this design. > If we had variable width tids varying by more than 2 bytes, it might be > reasonable for cases like this to store all tids padded to the width of > the widest tid. I think that'd still be pretty OK, because you'd only > pay the price if you actually have long tids, and only on pages where > such tids are referenced. Obviously that means that such a posting list > could grow more than by the size of the inserted tid upon insertion, but > that might still be OK? That'd require storing the width of the posting > list elements somewhere, unfortunately - not sure if that's a problem? My solution is to require the index AM to look after itself. The index AM is responsible for not mixing them up. For nbtree with deduplication, this means that having different width TIDs makes it impossible to merge together posting lists/tuples. For GIN, this means that we can expand the width of the TIDs in the posting list to match the widest TID. We can then make it into a posting tree if necessary -- GIN has the benefit of always being able to fall back on the option of making a new posting tree (unlike nbtree with deduplication). GIN's B-Tree is very primitive in some ways (no deletion of items in the entry tree, no deletion of pages in the entry tree), which gives it the ability to safely fall back on creating a new posting tree when it runs out of space. > ISTM if we had varint style tids, we'd probably still save space on > average for today's heap that way. How important is it for you to be > able to split out the "block offset" and "page offset" bits? Pretty important. The nbtree deduplication patch is very compelling because it almost offers the best of both worlds -- the concurrency characteristics of today's nbtree, combined with very much improved space efficiency. Keeping the space accounting as simple as possible seems like a big part of why this is possible at all. There is only one new type of WAL record required for deduplication, and they're pretty small. (Existing WAL records are changed to support posting list splits, but these are small, low-overhead changes.) > I'm somewhat inclined to think that tids should just be a varint (or > maybe two, if we want to make it slightly simpler to keep compat to how > they look today), and that the AM internally makes sense of that. I am opposed to adding anything that is truly varint. The index access method ought to be able to have fine control over the layout, without being burdened by an overly complicated TID representation. > > I can support varwidth TIDs in the future pretty well if the TID > > doesn't have to be *arbitrarily* wide. > > I think it'd be perfectly reasonable to have a relatively low upper > limit for tid width. Not 8 bytes, but also not 200 bytes. My point is that we should find a way to make TIDs continue to be an array of fixed width integers in any given context. Lots of index access method code can be ported in a relatively straightforward fashion this way. This has some downsides, but I believe that they're worth it. > > Individual posting lists can themselves either use 6 byte or 8 byte > > TIDs, preserving the ability to access a posting list entry at random > > using simple pointer arithmetic. This makes converting over index AMs > > a lot less painful -- it'll be pretty easy to avoid mixing together > > the 6 byte and 8 byte structs. > > With varint style tids as I suggested, that ought to be fairly simple? nbtree probably won't be able to tolerate having to widen every TID in the posting list all at once when new tuples are inserted that have TIDs that are one byte wider,
Re: v12.0: ERROR: could not find pathkey item to sort
On Thu, 31 Oct 2019 at 05:09, Tom Lane wrote: > David --- much of the complexity here comes from the addition of > the eclass_indexes infrastructure, so do you have any thoughts? Hindsight makes me think I should have mentioned in the comment for eclass_indexes that it's only used for simple rels and remains NULL for anything else. All the code in equivclass.c either uses get_common_eclass_indexes() and get_eclass_indexes_for_relids() which go down to the simple rel level to obtain their eclass_indexes. When calling get_eclass_indexes_for_relids() we'll build a union Bitmapset with the indexes from each simple rel that the join rel is made from. We only ever directly use the eclass_indexes field when we're certain we're dealing with a simple rel already. get_eclass_indexes_for_relids() would do the same job, but using the field directly saves a bit of needless effort and memory allocations. So, in short, I don't really see why we need to set eclass_indexes for anything other than simple rels. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Thoughts on nbtree with logical/varwidth table identifiers, v12 on-disk representation
Hi, On 2019-10-30 12:37:50 -0700, Peter Geoghegan wrote: > On Wed, Oct 30, 2019 at 12:03 PM Andres Freund wrote: > > I'd much rather not entrench this further, even leaving global indexes > > aside. The 4 byte block number is a significant limitation for heap > > tables too, and we should lift that at some point not too far away. > > Then there's also other AMs that could really use a wider tid space. > > I agree that that limitation is a problem that should be fixed before > too long. But the solution probably shouldn't be a radical departure > from what we have today. The vast majority of tables are not affected > by the TID space limitation. Those tables that are can tolerate > supporting fixed width "long" TIDs (perhaps 8 bytes long?) that are > used for the higher portion of the heap TID space alone. > > The idea here is that TID is varwidth, but actually uses the existing > heap TID format most of the time. For larger tables it uses a wider > fixed width struct that largely works the same as the old 6 byte > struct. I assume you mean that the index would dynamically recognize when it needs the wider tids ("for the higher portion")? If so, yea, that makes sense to me. Would that need to encode the 6/8byteness of a tid on a per-element basis? Or are you thinking of being able to upconvert in a general way? > > > Though I suppose a posting list almost has to have fixed width TIDs to > > > perform acceptably. > > > > Hm. It's not clear to me why that is? > > Random access matters for things like determining the correct offset > to split a posting list at. This is needed in the event of an > overlapping insertion of a new duplicate tuple whose heap TID falls > within the range of the posting list. Also, I want to be able to scan > posting lists backwards for backwards scans. In general, fixed-width > TIDs make the page space accounting fairly simple, which matters a lot > in nbtree. If we had variable width tids varying by more than 2 bytes, it might be reasonable for cases like this to store all tids padded to the width of the widest tid. I think that'd still be pretty OK, because you'd only pay the price if you actually have long tids, and only on pages where such tids are referenced. Obviously that means that such a posting list could grow more than by the size of the inserted tid upon insertion, but that might still be OK? That'd require storing the width of the posting list elements somewhere, unfortunately - not sure if that's a problem? ISTM if we had varint style tids, we'd probably still save space on average for today's heap that way. How important is it for you to be able to split out the "block offset" and "page offset" bits? I'm somewhat inclined to think that tids should just be a varint (or maybe two, if we want to make it slightly simpler to keep compat to how they look today), and that the AM internally makes sense of that. > I can support varwidth TIDs in the future pretty well if the TID > doesn't have to be *arbitrarily* wide. I think it'd be perfectly reasonable to have a relatively low upper limit for tid width. Not 8 bytes, but also not 200 bytes. > Individual posting lists can themselves either use 6 byte or 8 byte > TIDs, preserving the ability to access a posting list entry at random > using simple pointer arithmetic. This makes converting over index AMs > a lot less painful -- it'll be pretty easy to avoid mixing together > the 6 byte and 8 byte structs. With varint style tids as I suggested, that ought to be fairly simple? > I don't think "indirect" indexes are a realistic goal for > Postgres. VACUUM is just too messy there (as is any other garbage > collection mechanism). Zedstore and Zheap don't change this. Hm, I don't think there's a fundamental problem, but let's leave that aside, there's enough other reasons to improve this. Greetings, Andres Freund
pgstat.c has brittle response to transient problems
While fooling with the NetBSD-vs-libpython issue noted in a nearby thread, I observed that the core regression tests sometimes hang up in the "stats" test on this platform (NetBSD 8.1/amd64). Investigation found that the stats collector process was sometimes exiting like this: 2019-10-29 19:38:14.563 EDT [7018] FATAL: could not read statistics message: No buffer space available 2019-10-29 19:38:14.563 EDT [7932] LOG: statistics collector process (PID 7018) exited with exit code 1 The postmaster then restarts the collector, but possibly with a time delay (see the PGSTAT_RESTART_INTERVAL respawn throttling logic). This seems to interact badly with the wait-for-stats-collector logic in backend_read_statsfile, so that each cycle of the wait_for_stats() test function takes a long time ... and we will do 300 of those unconditionally. (Possibly wait_for_stats ought to be modified so that it pays attention to elapsed wall-clock time rather than iterating for a fixed number of times?) NetBSD's recv() man page glosses ENOBUFS as "A message was not delivered because it would have overflowed the buffer", but I don't believe that's actually what's happening. (Just to be sure, I added an Assert on the sending side that no message exceeds sizeof(PgStat_Msg). I wonder why we didn't have one already.) Trawling the NetBSD kernel code, it seems like ENOBUFS could get returned as a result of transient shortages of kernel working memory --- most of the uses of that error code seem to be on the sending side, but I found some that seem to be in receiving code. In short: it's evidently possible to get ENOBUFS as a transient failure condition on this platform, and having the stats collector die seems like an overreaction. I'm inclined to have it log the error and press on, instead. Looking at the POSIX spec for recv() suggests that ENOMEM is another plausible transient failure, so maybe we should do the same with that. Roughly: if (len < 0) { + /* silently ignore these cases (no data available) */ if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR) break;/* out of inner loop */ + /* noisily ignore these cases (soft errors) */ + if (errno == ENOBUFS || errno == ENOMEM) + { + ereport(LOG, + (errcode_for_socket_access(), +errmsg("could not read statistics message: %m"))); + break;/* out of inner loop */ + } + /* hard failure, but maybe hara-kiri will fix it */ ereport(ERROR, (errcode_for_socket_access(), errmsg("could not read statistics message: %m"))); } A variant idea is to treat all unexpected errnos as LOG-and-push-on, but maybe we ought to have a limit on how many times we'll do that before erroring out. Thoughts? regards, tom lane
Re: MarkBufferDirtyHint() and LSN update
On Wed, Oct 30, 2019 at 02:44:18PM +0100, Antonin Houska wrote: Please consider this scenario (race conditions): 1. FlushBuffer() has written the buffer but hasn't yet managed to clear the BM_DIRTY flag (however BM_JUST_DIRTIED could be cleared by now). 2. Another backend modified a hint bit and called MarkBufferDirtyHint(). 3. In MarkBufferDirtyHint(), if XLogHintBitIsNeeded() evaluates to true (e.g. due to checksums enabled), new LSN is computed, however it's not assigned to the page because the buffer is still dirty: if (!(buf_state & BM_DIRTY)) { ... if (!XLogRecPtrIsInvalid(lsn)) PageSetLSN(page, lsn); } 4. MarkBufferDirtyHint() completes. 5. In the first session, FlushBuffer()->TerminateBufferIO() will not clear BM_DIRTY because MarkBufferDirtyHint() has eventually set BM_JUST_DIRTIED. Thus the hint bit change itself will be written by the next call of FlushBuffer(). However page LSN is hasn't been updated so the requirement that WAL must be flushed first is not met. I think that PageSetLSN() should be called regardless BM_DIRTY. Do I miss any subtle detail? Isn't this prevented by locking of the buffer header? Both FlushBuffer and MarkBufferDirtyHint do obtain that lock. I see MarkBufferDirtyHint does a bit of work before, but that's related to BM_PERMANENT. If there really is a race condition, it shouldn't be that difficult to trigger it by adding a sleep or a breakpoint, I guess. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Thoughts on nbtree with logical/varwidth table identifiers, v12 on-disk representation
On Wed, Oct 30, 2019 at 12:03 PM Andres Freund wrote: > I'd much rather not entrench this further, even leaving global indexes > aside. The 4 byte block number is a significant limitation for heap > tables too, and we should lift that at some point not too far away. > Then there's also other AMs that could really use a wider tid space. I agree that that limitation is a problem that should be fixed before too long. But the solution probably shouldn't be a radical departure from what we have today. The vast majority of tables are not affected by the TID space limitation. Those tables that are can tolerate supporting fixed width "long" TIDs (perhaps 8 bytes long?) that are used for the higher portion of the heap TID space alone. The idea here is that TID is varwidth, but actually uses the existing heap TID format most of the time. For larger tables it uses a wider fixed width struct that largely works the same as the old 6 byte struct. > > Though I suppose a posting list almost has to have fixed width TIDs to > > perform acceptably. > > Hm. It's not clear to me why that is? Random access matters for things like determining the correct offset to split a posting list at. This is needed in the event of an overlapping insertion of a new duplicate tuple whose heap TID falls within the range of the posting list. Also, I want to be able to scan posting lists backwards for backwards scans. In general, fixed-width TIDs make the page space accounting fairly simple, which matters a lot in nbtree. I can support varwidth TIDs in the future pretty well if the TID doesn't have to be *arbitrarily* wide. Individual posting lists can themselves either use 6 byte or 8 byte TIDs, preserving the ability to access a posting list entry at random using simple pointer arithmetic. This makes converting over index AMs a lot less painful -- it'll be pretty easy to avoid mixing together the 6 byte and 8 byte structs. > > Can we steal some bits that are currently used for offset number > > instead? 16 bits is far more than we ever need to use for heap offset > > numbers in practice. > > I think that's a terrible idea. For one, some AMs will have significant > higher limits, especially taking compression and larger block sizes into > account. Also not all AMs need identifiers tied so closely to a disk > position, e.g. zedstore does not. We shouldn't hack evermore > information into the offset, given that background. Fair enough, but somebody needs to cut some scope here. > Having to walk through the index tuple might be acceptable - in all > likelihood we'll have to do so anyway. It does however not *really* > resolve the issue that we still need to pass something tid back from the > indexam, so we can fetch the associated tuple from the heap, or add the > tid to a bitmap. But that could be done separately from the index > internal data structures. I agree. > > Generalizing the nbtree AM to be able to work with an arbitrary type > > of table row identifier that isn't at all like a TID raises tricky > > definitional questions. > Hm. I don't see why a different types of TID would imply them being > stable? It is unclear what it means. I would like to see a sketch of a design for varwidth TIDs that balances everybody's concerns. I don't think "indirect" indexes are a realistic goal for Postgres. VACUUM is just too messy there (as is any other garbage collection mechanism). Zedstore and Zheap don't change this. > > Frankly I am not very enthusiastic about working on a project that has > > unclear scope and unclear benefits for users. > > Why would properly supporting AMs like zedstore, global indexes, > "indirect" indexes etc benefit users? Global indexes seem doable. I don't see how "indirect" indexes can ever work in Postgres. I don't know exactly what zedstore needs here, but maybe it can work well with a less ambitious design for varwidth TIDs along the lines I've sketched. -- Peter Geoghegan
Re: Parallel leader process info in EXPLAIN
On Wed, Oct 30, 2019 at 10:39:04AM -0700, Peter Geoghegan wrote: On Wed, Oct 30, 2019 at 9:24 AM Melanie Plageman wrote: Checked out the patches a bit and noticed that the tuplesort instrumentation uses spaceUsed and I saw this comment in tuplesort_get_stats() might it be worth trying out the memory accounting API 5dd7fc1519461548eebf26c33eac6878ea3e8788 here? I made exactly the same suggestion several years back, not long after the API was first proposed by Jeff. However, tuplesort.c has changed a lot since that time, to the extent that that comment now seems kind of obsolete. These days, availMem accounting still isn't used at all for disk sorts. Rather, the slab allocator is used. Virtually all the memory used to merge is now managed by logtape.c, with only fixed per-tape memory buffers within tuplesort.c. This per-tape stuff is tiny anyway -- slightly more than 1KiB per tape. It would now be relatively straightforward to report the memory used by disk-based sorts, without needing to use the memory accounting API. I imagine that it would report the high watermark memory usage during the final merge. It's possible for this to be lower than the high watermark during initial run generation (e.g. because of the MaxAllocSize limit in buffer size within logtape.c), but that still seems like the most useful figure to users. There'd be a new "LogicalTapeSetMemory()" function to go along with the existing LogicalTapeSetBlocks() function, or something along those lines. Not planning to work on this now, but perhaps you have time for it. Another thing worth mentioning is that the memory accounting API does nothing about the pfree() calls, mentioned in the comment. The memory is tracked at the block level, so unless the pfree() frees the whole block (which only really happens for oversized chunks) the accounting info does not change. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Thoughts on nbtree with logical/varwidth table identifiers, v12 on-disk representation
Hi, On 2019-10-30 11:33:21 -0700, Peter Geoghegan wrote: > On Mon, Apr 22, 2019 at 9:35 AM Andres Freund wrote: > > On 2019-04-21 17:46:09 -0700, Peter Geoghegan wrote: > > > Andres has suggested that I work on teaching nbtree to accommodate > > > variable-width, logical table identifiers, such as those required for > > > indirect indexes, or clustered indexes, where secondary indexes must > > > use a logical primary key value instead of a heap TID. > > I'm revisiting this thread now because it may have relevance to the > nbtree deduplication patch. If nothing else, the patch further commits > us to the current heap TID format by making assumptions about the > width of posting lists with 6 byte TIDs. I'd much rather not entrench this further, even leaving global indexes aside. The 4 byte block number is a significant limitation for heap tables too, and we should lift that at some point not too far away. Then there's also other AMs that could really use a wider tid space. > Though I suppose a posting list almost has to have fixed width TIDs to > perform acceptably. Hm. It's not clear to me why that is? > > I think it's two more cases: > > > > - table AMs that want to support tables that are bigger than 32TB. That > > used to be unrealistic, but it's not anymore. Especially when the need > > to VACUUM etc is largely removed / reduced. > > Can we steal some bits that are currently used for offset number > instead? 16 bits is far more than we ever need to use for heap offset > numbers in practice. I think that's a terrible idea. For one, some AMs will have significant higher limits, especially taking compression and larger block sizes into account. Also not all AMs need identifiers tied so closely to a disk position, e.g. zedstore does not. We shouldn't hack evermore information into the offset, given that background. > (I wonder if this would also have benefits for the representation of > in-memory bitmaps?) Hm. Not sure how? > > - global indexes (for cross-partition unique constraints and such), > > which need a partition identifier as part of the tid (or as part of > > the index key, but I think that actually makes interaction with > > indexam from other layers more complicated - the inside of the index > > maybe may want to represent it as a column, but to the outside that > > ought not to be visible) > > Can we just use an implementation level attribute for this? Would it > be so bad if we weren't able to jump straight to the partition number > without walking through the tuple when the tuple has varwidth > attributes? (If that isn't acceptable, then we can probably make it > work for global indexes without having to generalize everything.) Having to walk through the index tuple might be acceptable - in all likelihood we'll have to do so anyway. It does however not *really* resolve the issue that we still need to pass something tid back from the indexam, so we can fetch the associated tuple from the heap, or add the tid to a bitmap. But that could be done separately from the index internal data structures. > Generalizing the nbtree AM to be able to work with an arbitrary type > of table row identifier that isn't at all like a TID raises tricky > definitional questions. It would have to work in a way that made the > new variety of table row identifier stable, which is a significant new > requirement (and one that zheap is clearly not interested in). Hm. I don't see why a different types of TID would imply them being stable? > I am not suggesting that these issues are totally insurmountable. What > I am saying is this: If we already had "stable logical" TIDs instead > of "mostly physical TIDs", then generalizing nbtree index tuples to > store arbitrary table row identifiers would more or less be all about > the data structure managed by nbtree. But that isn't the case, and > that strongly discourages me from working on this -- we shouldn't talk > about the problem as if it is mostly just a matter of settling of the > best index tuple format. > Frankly I am not very enthusiastic about working on a project that has > unclear scope and unclear benefits for users. Why would properly supporting AMs like zedstore, global indexes, "indirect" indexes etc benefit users? Greetings, Andres Freund
Re: Thoughts on nbtree with logical/varwidth table identifiers, v12 on-disk representation
On Mon, Apr 22, 2019 at 9:35 AM Andres Freund wrote: > On 2019-04-21 17:46:09 -0700, Peter Geoghegan wrote: > > Andres has suggested that I work on teaching nbtree to accommodate > > variable-width, logical table identifiers, such as those required for > > indirect indexes, or clustered indexes, where secondary indexes must > > use a logical primary key value instead of a heap TID. I'm revisiting this thread now because it may have relevance to the nbtree deduplication patch. If nothing else, the patch further commits us to the current heap TID format by making assumptions about the width of posting lists with 6 byte TIDs. Though I suppose a posting list almost has to have fixed width TIDs to perform acceptably. Database systems with a varwidth TID format probably don't support anything like posting lists. > I think it's two more cases: > > - table AMs that want to support tables that are bigger than 32TB. That > used to be unrealistic, but it's not anymore. Especially when the need > to VACUUM etc is largely removed / reduced. Can we steal some bits that are currently used for offset number instead? 16 bits is far more than we ever need to use for heap offset numbers in practice. (I wonder if this would also have benefits for the representation of in-memory bitmaps?) > - global indexes (for cross-partition unique constraints and such), > which need a partition identifier as part of the tid (or as part of > the index key, but I think that actually makes interaction with > indexam from other layers more complicated - the inside of the index > maybe may want to represent it as a column, but to the outside that > ought not to be visible) Can we just use an implementation level attribute for this? Would it be so bad if we weren't able to jump straight to the partition number without walking through the tuple when the tuple has varwidth attributes? (If that isn't acceptable, then we can probably make it work for global indexes without having to generalize everything.) In general, Postgres heap TIDs are not stable identifiers of rows (they only stably identify HOT chains). This is not the case in other database systems, which may go to great trouble to make it possible to assume that TIDs are stable over time (Andy Pavlo says that our TIDs are physical while Oracle's are logical -- I don't like that terminology, but know what he means). Generalizing the nbtree AM to be able to work with an arbitrary type of table row identifier that isn't at all like a TID raises tricky definitional questions. It would have to work in a way that made the new variety of table row identifier stable, which is a significant new requirement (and one that zheap is clearly not interested in). If we're willing to support something like a clustered index in nbtree, I wonder what we need to do to make things like numeric display scale still work. For bonus points, describe how unique checking works with a secondary unique index. I am not suggesting that these issues are totally insurmountable. What I am saying is this: If we already had "stable logical" TIDs instead of "mostly physical TIDs", then generalizing nbtree index tuples to store arbitrary table row identifiers would more or less be all about the data structure managed by nbtree. But that isn't the case, and that strongly discourages me from working on this -- we shouldn't talk about the problem as if it is mostly just a matter of settling of the best index tuple format. Frankly I am not very enthusiastic about working on a project that has unclear scope and unclear benefits for users. -- Peter Geoghegan
Re: Creating foreign key on partitioned table is too slow
On Thu, Oct 24, 2019 at 04:28:38PM -0700, Andres Freund wrote: Hi, On 2019-10-23 05:59:01 +, kato-...@fujitsu.com wrote: To benchmark with tpcb model, I tried to create a foreign key in the partitioned history table, but backend process killed by OOM. the number of partitions is 8192. I tried in master(commit: ad4b7aeb84). Obviously this should be improved. But I think it's also worthwhile to note that using 8k partitions is very unlikely to be a good choice for anything. The metadata, partition pruning, etc overhead is just going to be very substantial. True. Especially with two partitioned tables, each with 8k partitions. I do think it makes sense to reduce the memory usage, because just eating all available memory (in the extreme case) is not very nice. I've added that patch to the CF, although the patch I shared is very crude and I'm by no means suggesting it's how it should be done ultimately. The other bit (speed of planning with 8k partitions) is probably a more general issue, and I suppose we'll improve that over time. I don't think there's a simple change magically improving that. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Parallel leader process info in EXPLAIN
On Wed, Oct 30, 2019 at 9:24 AM Melanie Plageman wrote: > Checked out the patches a bit and noticed that the tuplesort > instrumentation uses spaceUsed and I saw this comment in > tuplesort_get_stats() > might it be worth trying out the memory accounting API > 5dd7fc1519461548eebf26c33eac6878ea3e8788 here? I made exactly the same suggestion several years back, not long after the API was first proposed by Jeff. However, tuplesort.c has changed a lot since that time, to the extent that that comment now seems kind of obsolete. These days, availMem accounting still isn't used at all for disk sorts. Rather, the slab allocator is used. Virtually all the memory used to merge is now managed by logtape.c, with only fixed per-tape memory buffers within tuplesort.c. This per-tape stuff is tiny anyway -- slightly more than 1KiB per tape. It would now be relatively straightforward to report the memory used by disk-based sorts, without needing to use the memory accounting API. I imagine that it would report the high watermark memory usage during the final merge. It's possible for this to be lower than the high watermark during initial run generation (e.g. because of the MaxAllocSize limit in buffer size within logtape.c), but that still seems like the most useful figure to users. There'd be a new "LogicalTapeSetMemory()" function to go along with the existing LogicalTapeSetBlocks() function, or something along those lines. Not planning to work on this now, but perhaps you have time for it. -- Peter Geoghegan
Re: PL/Python fails on new NetBSD/PPC 8.0 install
I wrote: > Thomas Munro writes: >> From a quick look at the relevant trees, isn't the problem here that >> cpython thinks it can reserve pthread_t value -1 (or rather, that >> number cast to unsigned long, which is the type it uses for its own >> thread IDs): > Yeah, this. I shall now go rant at the Python people about that. Done at https://bugs.python.org/issue38646 regards, tom lane
Re: Proposal: Global Index
Hi, On 2019-10-30 13:05:57 -0400, Tom Lane wrote: > Peter Geoghegan writes: > > On Wed, Oct 30, 2019 at 9:23 AM Tom Lane wrote: > >> Well, the *effects* of the feature seem desirable, but that doesn't > >> mean that we want an implementation that actually has a shared index. > >> As soon as you do that, you've thrown away most of the benefits of > >> having a partitioned data structure in the first place. > > > Right, but that's only the case for the global index. Global indexes > > are useful when used judiciously. > > But ... why bother with partitioning then? To me, the main reasons > why you might want a partitioned table are Quite commonly there's a lot of *other* indexes, often on a lot wider data than the primary key, that don't need to be global. And whereas in a lot of cases the primary key in a partitioned table has pretty good locality (and thus will be mostly buffered IO), other indexes will often not have that property (i.e. not have much correlation with table position). > * ability to cheaply add and remove partitions, primarily so that > you can cheaply do things like "delete the oldest month's data". You can still do that to some degree with a global index. Imagine e.g. keeping a 'partition id' as a sort-of column in the global index. That allows you to drop the partition, without having to immediately rebuild the index, by checking the partition id against the live partitions during lookup. So sure, your'e wasting space for a bit in the global index, but it'll also be space that is likely to be fairly efficiently reclaimed the next time vacuum touches the index. And if not the global index can be rebuilt concurrently without blocking writes. > * ability to scale past our limits on the physical size of one table > --- both the hard BlockNumber-based limit, and the performance > constraints of e.g. vacuuming a very large table. For that to be a problem for a global index the global index (which will often be something like two int4 or int8 columns) itself would need to be above the block number based limit - which doesn't seem that close. WRT vacuuming - based on my observations the table itself isn't a performance problem for vacuuming all that commonly anymore, it's the associated index scans. So yea, that's a problem. Greetings, Andres Freund
Re: [Proposal] Add accumulated statistics
út 15. 1. 2019 v 2:14 odesílatel Tsunakawa, Takayuki < tsunakawa.ta...@jp.fujitsu.com> napsal: > From: Pavel Stehule [mailto:pavel.steh...@gmail.com] > > the cumulated lock statistics maybe doesn't help with debugging - but it > > is very good indicator of database (in production usage) health. > > I think it will help both. But I don't think the sampling won't be as > helpful as the precise lock statistics accumulation, because the sampling > doesn't give us exactly how effective our improvements to PostgreSQL code > are. I remember PG developers used LOCK_STATS to see how many (or ratio > of) lwlock waits decreased by applying patches. > > > We can use the cumulated lock stats like: > > 1. SELECT * FROM pg_session_waits; > 2. Run a benchmark. > 3. SELECT * FROM pg_session_waits; > 4. Calculate the difference between 1 and 3. > > Or, reset the wait stats before the benchmark run and just use the stats > as-is. > > I'd like to know why you thought the cumulated wait stats isn't helpful > for debugging. > I don't remember my thoughts, and maybe I used wrong sentences - usually lock times are very small, and very unstable if you has too detailed level. But if you use aggregated values per some longer time window, then these values can be stable and very interesting. More - usually lock time has correlation with database (application) health. Like you I don't think so sampled values are too helpful. Regards Pavel > > Regards > Takayuki Tsunakawa > > >
Re: Proposal: Global Index
Peter Geoghegan writes: > On Wed, Oct 30, 2019 at 9:23 AM Tom Lane wrote: >> Well, the *effects* of the feature seem desirable, but that doesn't >> mean that we want an implementation that actually has a shared index. >> As soon as you do that, you've thrown away most of the benefits of >> having a partitioned data structure in the first place. > Right, but that's only the case for the global index. Global indexes > are useful when used judiciously. But ... why bother with partitioning then? To me, the main reasons why you might want a partitioned table are * ability to cheaply add and remove partitions, primarily so that you can cheaply do things like "delete the oldest month's data". * ability to scale past our limits on the physical size of one table --- both the hard BlockNumber-based limit, and the performance constraints of e.g. vacuuming a very large table. Both of those go out the window with a global index. So you might as well just have one table and forget all the overhead. regards, tom lane
Re: Proposal: Global Index
On Wed, Oct 30, 2019 at 9:23 AM Tom Lane wrote: > Well, the *effects* of the feature seem desirable, but that doesn't > mean that we want an implementation that actually has a shared index. > As soon as you do that, you've thrown away most of the benefits of > having a partitioned data structure in the first place. Right, but that's only the case for the global index. Global indexes are useful when used judiciously. They enable the use of partitioning for use cases where not being able to enforce uniqueness across all partitions happens to be a deal breaker. I bet that this is fairly common. > No, I don't have an idea how we might support, eg, uniqueness of > non-partition-key columns without that. But we need to spend our > effort on figuring that out, not on building a complicated mechanism > whose performance is never going to not suck. I don't think that there is a way to solve the problem that doesn't look very much like a global index. Also, being able to push down a partition number when scanning a global index seems like it would be very compelling in some scenarios. I'm a bit worried about the complexity that will need to be added to nbtree to make global indexes work, but it's probably possible to come up with something that isn't too bad. GIN already uses an implementation level attribute number column for multi-column GIN indexes, which is a little like what Ibrar has in mind. The really complicated new code required for global indexes will be in places like vacuumlazy.c. -- Peter Geoghegan
Re: Parallel leader process info in EXPLAIN
On Wed, Oct 23, 2019 at 12:30 AM Thomas Munro wrote: > > While working on some slides explaining EXPLAIN, I couldn't resist the > urge to add the missing $SUBJECT. The attached 0001 patch gives the > following: > > Gather ... time=0.146..33.077 rows=1 loops=1) > Workers Planned: 2 > Workers Launched: 2 > Buffers: shared hit=4425 > -> Parallel Seq Scan on public.t ... time=19.421..30.092 rows=0 loops=3) > Filter: (t.i = 42) > Rows Removed by Filter: 33 > Leader: actual time=0.013..32.025 rows=1 loops=1 <--- NEW > Buffers: shared hit=1546<--- NEW > Worker 0: actual time=29.069..29.069 rows=0 loops=1 > Buffers: shared hit=1126 > Worker 1: actual time=29.181..29.181 rows=0 loops=1 > Buffers: shared hit=1753 > > Without that, you have to deduce what work was done in the leader, but > I'd rather just show it. > > The 0002 patch adjusts Sort for consistency with that scheme, so you get: > > Sort ... time=84.303..122.238 rows=33 loops=3) >Output: t1.i >Sort Key: t1.i >Leader: Sort Method: external merge Disk: 5864kB <--- DIFFERENT >Worker 0: Sort Method: external merge Disk: 3376kB >Worker 1: Sort Method: external merge Disk: 4504kB >Leader: actual time=119.624..165.949 rows=426914 loops=1 >Worker 0: actual time=61.239..90.984 rows=245612 loops=1 >Worker 1: actual time=72.046..109.782 rows=327474 loops=1 > > Without the "Leader" label, it's not really clear to the uninitiated > whether you're looking at combined, average or single process numbers. > Cool! I dig it. Checked out the patches a bit and noticed that the tuplesort instrumentation uses spaceUsed and I saw this comment in tuplesort_get_stats() /* * Note: it might seem we should provide both memory and disk usage for a * disk-based sort. However, the current code doesn't track memory space * accurately once we have begun to return tuples to the caller (since we * don't account for pfree's the caller is expected to do), so we cannot * rely on availMem in a disk sort. This does not seem worth the overhead * to fix. Is it worth creating an API for the memory context code to * tell us how much is actually used in sortcontext? */ might it be worth trying out the memory accounting API 5dd7fc1519461548eebf26c33eac6878ea3e8788 here? > > Of course there are some more things that could be reported in a > similar way eventually, such as filter counters and hash join details. > > This made me think about other explain wishlist items. For parallel hashjoin, I would find it useful to know which batches each worker participated in (maybe just probing to start with, but loading would be great too). I'm not sure anyone else (especially users) would care about this, though. -- Melanie Plageman
Re: Proposal: Global Index
Robert Haas writes: > On Wed, Oct 30, 2019 at 10:13 AM Tom Lane wrote: >> I believe that the current design of partitioning is explicitly intended >> to avoid the need for such a construct. It'd be absolutely disastrous >> to have such a thing from many standpoints, including the breadth of >> locking needed to work with the global index, the difficulty of vacuuming, >> and the impossibility of cheaply attaching or detaching partitions. >> In other words, this is a "feature" we do not want. > I don't think that's true. Certainly, a lot of EnterpriseDB customers > want this feature - it comes up regularly in discussions here. But > that is not to say that the technical challenges are not formidable, > and I don't think this proposal really grapples with any of them. > However, that doesn't mean that the feature isn't desirable. Well, the *effects* of the feature seem desirable, but that doesn't mean that we want an implementation that actually has a shared index. As soon as you do that, you've thrown away most of the benefits of having a partitioned data structure in the first place. No, I don't have an idea how we might support, eg, uniqueness of non-partition-key columns without that. But we need to spend our effort on figuring that out, not on building a complicated mechanism whose performance is never going to not suck. regards, tom lane
Re: tableam vs. TOAST
On Wed, Oct 30, 2019 at 3:49 AM Prabhat Sahu wrote: > While testing the Toast patch(PG+v7 patch) I found below server crash. > System configuration: > VCPUs: 4, RAM: 8GB, Storage: 320GB > > This issue is not frequently reproducible, we need to repeat the same > testcase multiple times. > I wonder if this is an independent bug, because the backtrace doesn't look like it's related to the stuff this is changing. Your report doesn't specify whether you can also reproduce the problem without the patch, which is something that you should always check before reporting a bug in a particular patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Proposal: Global Index
On Wed, Oct 30, 2019 at 10:13 AM Tom Lane wrote: > I believe that the current design of partitioning is explicitly intended > to avoid the need for such a construct. It'd be absolutely disastrous > to have such a thing from many standpoints, including the breadth of > locking needed to work with the global index, the difficulty of vacuuming, > and the impossibility of cheaply attaching or detaching partitions. > > In other words, this is a "feature" we do not want. I don't think that's true. Certainly, a lot of EnterpriseDB customers want this feature - it comes up regularly in discussions here. But that is not to say that the technical challenges are not formidable, and I don't think this proposal really grapples with any of them. However, that doesn't mean that the feature isn't desirable. One of the biggest reasons why people want it is to enforce uniqueness for secondary keys - e.g. the employees table is partitioned by employee ID, but SSN should also be unique, at least among employees for whom it's not NULL. But people also want it for faster data retrieval: if you're looking for a commonly-occurring value, an index per partition is fine. But if you're looking for values that occur only once or a few times across the whole hierarchy, an index scan per partition is very costly. Consider, e.g.: Nested Loop -> Seq Scan -> Append -> Index Scan on each_partition You don't have to have very many partitions for that to suck, and it's a thing that people want to do. Runtime partition pruning helps with this case a lot, but, once again, only for the primary key. Secondary keys are a big problem for partitioning today, in many ways. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: v12.0: ERROR: could not find pathkey item to sort
Amit Langote writes: > Attached updated patches. [ looks at that... ] I seriously, seriously dislike what you did in build_join_rel, ie adding the new joinrel to the global data structures before it's fully filled in. That's inevitably going to bite us on the ass someday, and you couldn't even be bothered with a comment. Worse, the reason you did that seems to be so that generate_join_implied_equalities can find and modify the joinrel, which is an undocumented and not very well thought out side-effect. There are other call sites for that where the joinrel may or may not already exist, meaning that it might or might not add more members into the joinrel's eclass_indexes. At best that's wasted work, and at worst it costs efficiency, since we could in principle get different sets of common relids depending on which join input pair we're considering. They're equally valid sets, but unioning them is just going to add more relids than we need. Also, the existing logic around eclass_indexes is that it's only set for baserels and we know it is valid after we've finished EC merging. I don't much like modifying add_child_rel_equivalences to have some different opinions about that for joinrels. It'd probably be better to just eat the cost of doing get_common_eclass_indexes over again when it's time to do add_child_rel_equivalences for a joinrel, since we aren't (I hope) going to do that more than once per joinrel anyway. This would probably require refactoring things so that there are separate entry points to add child equivalences for base rels and join rels. But that seems cleaner anyway than what you've got here. David --- much of the complexity here comes from the addition of the eclass_indexes infrastructure, so do you have any thoughts? regards, tom lane
Re: PL/Python fails on new NetBSD/PPC 8.0 install
Thomas Munro writes: > On Wed, Oct 30, 2019 at 9:25 AM Tom Lane wrote: >> What I'm inclined to do is go file a bug report saying that this >> behavior contradicts both POSIX and NetBSD's own man page, and >> see what they say about that. So I went and filed that bug, http://gnats.netbsd.org/cgi-bin/query-pr-single.pl?number=54661 and the answer seems to be that netbsd's libpthread is operating as designed. They don't support creating new threads if libpthread wasn't present at main program start, so redirecting all the entry points to the libc stub functions in that case is actually pretty sane, self-consistent behavior. This behavior is actually kinda useful from our standpoint: it means that a perlu/pythonu/tclu function *can't* cause a backend to become multithreaded, even if it tries. So I definitely don't want to "fix" this by linking libpthread to the core backend; that would open us up to problems we needn't have, on this platform anyway. > From a quick look at the relevant trees, isn't the problem here that > cpython thinks it can reserve pthread_t value -1 (or rather, that > number cast to unsigned long, which is the type it uses for its own > thread IDs): Yeah, this. I shall now go rant at the Python people about that. regards, tom lane
Re: WIP/PoC for parallel backup
On Mon, Oct 28, 2019 at 8:29 PM Robert Haas wrote: > On Mon, Oct 28, 2019 at 10:03 AM Asif Rehman > wrote: > > I have updated the patch to include the changes suggested by Jeevan. > This patch also implements the thread workers instead of > > processes and fetches a single file at a time. The tar format has been > disabled for first version of parallel backup. > > Looking at 0001-0003: > > It's not clear to me what the purpose of the start WAL location is > supposed to be. As far as I can see, SendBackupFiles() stores it in a > variable which is then used for exactly nothing, and nothing else uses > it. It seems like that would be part of a potential incremental > backup feature, but I don't see what it's got to do with parallel full > backup. > 'startptr' is used by sendFile() during checksum verification. Since SendBackupFiles() is using sendFIle we have to set a valid WAL location. > The tablespace_path option appears entirely unused, and I don't know > why that should be necessary here, either. > This is to calculate the basepathlen. We need to exclude the tablespace location (or base path) from the filename before it is sent to the client with sendFile call. I added this option primarily to avoid performing string manipulation on filename to extract the tablespace location and then calculate the basepathlen. Alternatively we can do it by extracting the base path from the received filename. What do you suggest? > > STORE_BACKUPFILE() seems like maybe it should be a function rather > than a macro, and also probably be renamed, because it doesn't store > files and the argument's not necessarily a file. > Sure. > > SendBackupManifest() does not send a backup manifest in the sense > contemplated by the email thread on that subject. It sends a file > list. That seems like the right idea - IMHO, anyway - but you need to > do a thorough renaming. > I'm considering the following command names: START_BACKUP - Starts the backup process SEND_BACKUP_FILELIST (Instead of SEND_BACKUP_MANIFEST) - Sends the list of all files (along with file information such as filename, file type (directory/file/link), file size and file mtime for each file) to be backed up. SEND_BACKUP_FILES - Sends one or more files to the client. STOP_BACKUP - Stops the backup process. I'll update the function names accordingly after your confirmation. Of course, suggestions for better names are welcome. > > I think it would be fine to decide that this facility won't support > exclusive-mode backup. > Sure. Will drop this patch. > > I don't think much of having both sendDir() and sendDir_(). The latter > name is inconsistent with any naming convention we have, and there > seems to be no reason not to just add an argument to sendDir() and > change the callers. > I think we should rename - perhaps as a preparatory patch - the > sizeonly flag to dryrun, or something like that. > Sure, will take care of it. > The resource cleanup does not look right. You've included calls to > PG_ENSURE_ERROR_CLEANUP(base_backup_cleanup, 0) in both StartBackup() > and StopBackup(), but what happens if there is an error or even a > clean shutdown of the connection in between? I think that there needs to be some change here to ensure that a walsender will always call > base_backup_cleanup() when it exits; I think that'd probably remove > the need for any PG_ENSURE_ERROR_CLEANUP calls at all, including ones > we have already. This might also be something that could be done as a > separate, prepatory refactoring patch. > You're right. I didn't handle this case properly. I will removed PG_ENSURE_ERROR_CLEANUP calls and replace it with before_shmem_exit handler. This way whenever backend process exits, base_backup_cleanup will be called: - If it exists before calling the do_pg_stop_backup, base_backup_cleanup will take care of cleanup. - otherwise in case of a clean shutdown (after calling do_pg_stop_backup) then base_backup_cleanup will simply return without doing anything. -- Asif Rehman Highgo Software (Canada/China/Pakistan) URL : www.highgo.ca
Re: Proposal: Global Index
Ibrar Ahmed writes: > A global index by very definition is a single index on the parent table > that maps to many > underlying table partitions. I believe that the current design of partitioning is explicitly intended to avoid the need for such a construct. It'd be absolutely disastrous to have such a thing from many standpoints, including the breadth of locking needed to work with the global index, the difficulty of vacuuming, and the impossibility of cheaply attaching or detaching partitions. In other words, this is a "feature" we do not want. regards, tom lane
Re: Add SQL function to show total block numbers in the relation
btkimurayuzk writes: > I propose new simple sql query, which shows total block numbers in the > relation. > ... > Of cource, we can know this value such as > select (pg_relation_size('t') / > current_setting('block_size')::bigint)::int; I don't really see why the existing solution isn't sufficient. regards, tom lane
Re: Binary support for pgoutput plugin
On Sun, 27 Oct 2019 at 11:00, Dmitry Dolgov <9erthali...@gmail.com> wrote: > > On Mon, Jun 17, 2019 at 10:29:26AM -0400, Dave Cramer wrote: > > > Which is what I have done. Thanks > > > > > > I've attached both patches for comments. > > > I still have to add documentation. > > > > Additional patch for documentation. > > Thank you for the patch! Unfortunately 0002 has some conflicts, could > you please send a rebased version? In the meantime I have few questions: > > -LogicalRepRelId > +void > logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup) > { > charaction; > - LogicalRepRelId relid; > - > - /* read the relation id */ > - relid = pq_getmsgint(in, 4); > > action = pq_getmsgbyte(in); > if (action != 'N') > @@ -175,7 +173,6 @@ logicalrep_read_insert(StringInfo in, > LogicalRepTupleData *newtup) > > logicalrep_read_tuple(in, newtup); > > - return relid; > } > > > > - relid = logicalrep_read_insert(s, ); > + /* read the relation id */ > + relid = pq_getmsgint(s, 4); > rel = logicalrep_rel_open(relid, RowExclusiveLock); > + > + logicalrep_read_insert(s, ); > > Maybe I'm missing something, what is the reason of moving pq_getmsgint > out of logicalrep_read_*? Just from the patch it seems that the code is > equivalent. > > > There is one obvious hack where in binary mode I reset the input > > cursor to allow the binary input to be re-read From what I can tell > > the alternative is to convert the data in logicalrep_read_tuple but > > that would require moving a lot of the logic currently in worker.c to > > proto.c. This seems minimally invasive. > > Which logic has to be moved for example? Actually it sounds more natural > to me, if this functionality would be in e.g. logicalrep_read_tuple > rather than slot_store_data, since it has something to do with reading > data. And it seems that in pglogical it's also located in > pglogical_read_tuple. > Ok, I've rebased and reverted logicalrep_read_insert As to the last comment, honestly it's been so long I can't remember why I put that comment in there. Thanks for reviewing Dave 0001-First-pass-at-working-code-without-subscription-opti.patch Description: Binary data 0003-document-new-binary-option-for-CREATE-SUBSCRIPTION.patch Description: Binary data 0002-add-binary-column-to-pg_subscription.patch Description: Binary data 0004-get-relid-inside-of-logical_read_insert.patch Description: Binary data
Remove HAVE_LONG_LONG_INT
HAVE_LONG_LONG_INT is now implied by the requirement for C99, so the separate Autoconf check can be removed. The uses are almost all in ecpg code, and AFAICT the check was originally added specifically for ecpg. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services From 913823e1985f14908d7e1abe6215f63982b722a1 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Wed, 30 Oct 2019 14:39:15 +0100 Subject: [PATCH] Remove HAVE_LONG_LONG_INT The presence of long long int is now implied in the requirement for C99 and the configure check for the same. We keep the define hard-coded in ecpg_config.h for backward compatibility. --- configure | 117 -- configure.in | 1 - src/include/c.h | 2 - src/include/pg_config.h.in| 6 - src/include/pg_config.h.win32 | 3 - src/interfaces/ecpg/ecpglib/data.c| 14 --- src/interfaces/ecpg/ecpglib/descriptor.c | 4 - src/interfaces/ecpg/ecpglib/execute.c | 6 +- src/interfaces/ecpg/ecpglib/misc.c| 6 - src/interfaces/ecpg/include/ecpg_config.h.in | 2 +- src/interfaces/ecpg/preproc/ecpg.trailer | 36 +- src/interfaces/ecpg/test/expected/sql-sqlda.c | 2 - src/interfaces/ecpg/test/sql/sqlda.pgc| 2 - 13 files changed, 7 insertions(+), 194 deletions(-) diff --git a/configure b/configure index 6b1c779ee3..7312bd7a7a 100755 --- a/configure +++ b/configure @@ -14170,123 +14170,6 @@ fi - { $as_echo "$as_me:${as_lineno-$LINENO}: checking for unsigned long long int" >&5 -$as_echo_n "checking for unsigned long long int... " >&6; } -if ${ac_cv_type_unsigned_long_long_int+:} false; then : - $as_echo_n "(cached) " >&6 -else - ac_cv_type_unsigned_long_long_int=yes - if test "x${ac_cv_prog_cc_c99-no}" = xno; then - cat confdefs.h - <<_ACEOF >conftest.$ac_ext -/* end confdefs.h. */ - - /* For now, do not test the preprocessor; as of 2007 there are too many -implementations with broken preprocessors. Perhaps this can -be revisited in 2012. In the meantime, code should not expect -#if to work with literals wider than 32 bits. */ - /* Test literals. */ - long long int ll = 9223372036854775807ll; - long long int nll = -9223372036854775807LL; - unsigned long long int ull = 18446744073709551615ULL; - /* Test constant expressions. */ - typedef int a[((-9223372036854775807LL < 0 && 0 < 9223372036854775807ll) -? 1 : -1)]; - typedef int b[(18446744073709551615ULL <= (unsigned long long int) -1 -? 1 : -1)]; - int i = 63; -int -main () -{ -/* Test availability of runtime routines for shift and division. */ - long long int llmax = 9223372036854775807ll; - unsigned long long int ullmax = 18446744073709551615ull; - return ((ll << 63) | (ll >> 63) | (ll < i) | (ll > i) - | (llmax / ll) | (llmax % ll) - | (ull << 63) | (ull >> 63) | (ull << i) | (ull >> i) - | (ullmax / ull) | (ullmax % ull)); - ; - return 0; -} - -_ACEOF -if ac_fn_c_try_link "$LINENO"; then : - -else - ac_cv_type_unsigned_long_long_int=no -fi -rm -f core conftest.err conftest.$ac_objext \ -conftest$ac_exeext conftest.$ac_ext - fi -fi -{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_type_unsigned_long_long_int" >&5 -$as_echo "$ac_cv_type_unsigned_long_long_int" >&6; } - if test $ac_cv_type_unsigned_long_long_int = yes; then - -$as_echo "#define HAVE_UNSIGNED_LONG_LONG_INT 1" >>confdefs.h - - fi - - - - { $as_echo "$as_me:${as_lineno-$LINENO}: checking for long long int" >&5 -$as_echo_n "checking for long long int... " >&6; } -if ${ac_cv_type_long_long_int+:} false; then : - $as_echo_n "(cached) " >&6 -else - ac_cv_type_long_long_int=yes - if test "x${ac_cv_prog_cc_c99-no}" = xno; then - ac_cv_type_long_long_int=$ac_cv_type_unsigned_long_long_int - if test $ac_cv_type_long_long_int = yes; then - if test "$cross_compiling" = yes; then : - : -else - cat confdefs.h - <<_ACEOF >conftest.$ac_ext -/* end confdefs.h. */ -#include -#ifndef LLONG_MAX -# define HALF \ - (1LL << (sizeof (long long int) * CHAR_BIT - 2)) -# define LLONG_MAX (HALF - 1 + HALF) -#endif -int -main () -{ -long long int n = 1; -int i; -for (i = 0; ; i++) - { -long long int m = n << i; -if (m >> i != n) - return 1; -if (LLONG_MAX / 2 < m) - break; - } -return 0; - ; - return 0; -} -_ACEOF -if ac_fn_c_try_run "$LINENO"; then : - -else -
MarkBufferDirtyHint() and LSN update
Please consider this scenario (race conditions): 1. FlushBuffer() has written the buffer but hasn't yet managed to clear the BM_DIRTY flag (however BM_JUST_DIRTIED could be cleared by now). 2. Another backend modified a hint bit and called MarkBufferDirtyHint(). 3. In MarkBufferDirtyHint(), if XLogHintBitIsNeeded() evaluates to true (e.g. due to checksums enabled), new LSN is computed, however it's not assigned to the page because the buffer is still dirty: if (!(buf_state & BM_DIRTY)) { ... if (!XLogRecPtrIsInvalid(lsn)) PageSetLSN(page, lsn); } 4. MarkBufferDirtyHint() completes. 5. In the first session, FlushBuffer()->TerminateBufferIO() will not clear BM_DIRTY because MarkBufferDirtyHint() has eventually set BM_JUST_DIRTIED. Thus the hint bit change itself will be written by the next call of FlushBuffer(). However page LSN is hasn't been updated so the requirement that WAL must be flushed first is not met. I think that PageSetLSN() should be called regardless BM_DIRTY. Do I miss any subtle detail? -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: Problem with synchronous replication
On Wed, Oct 30, 2019 at 05:21:17PM +0900, Fujii Masao wrote: > This change causes every ending backends to always take the exclusive lock > even when it's not in SyncRep queue. This may be problematic, for example, > when terminating multiple backends at the same time? If yes, > it might be better to check SHMQueueIsDetached() again after taking the lock. > That is, > > if (!SHMQueueIsDetached(&(MyProc->syncRepLinks))) > { > LWLockAcquire(SyncRepLock, LW_EXCLUSIVE); > if (!SHMQueueIsDetached(&(MyProc->syncRepLinks))) > SHMQueueDelete(&(MyProc->syncRepLinks)); > LWLockRelease(SyncRepLock); > } Makes sense. Thanks for the suggestion. -- Michael signature.asc Description: PGP signature
Re: Unix-domain socket support on Windows
To move this topic a long, I'll submit some preparatory patches in a committable order. First is the patch to deal with getpeereid() that was already included in the previous patch series. This is just some refactoring that reduces the difference between Windows and other platforms and prepares the Unix-domain socket specific code to compile cleanly on Windows. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services From adcad11992f4fde0a1eda4f96fef4abeb8570cdc Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Wed, 30 Oct 2019 12:58:32 +0100 Subject: [PATCH v4] Sort out getpeereid() and peer auth handling on Windows The getpeereid() uses have so far been protected by HAVE_UNIX_SOCKETS, so they didn't ever care about Windows support. But in anticipation of Unix-domain socket support on Windows, that needs to be handled differently. Windows doesn't support getpeereid() at this time, so we use the existing not-supported code path. We let configure do its usual thing of picking up the replacement from libpgport, instead of the custom overrides that it was doing before. But then Windows doesn't have struct passwd, so this patch sprinkles some additional #ifdef WIN32 around to make it work. This is similar to existing code that deals with this issue. --- configure | 36 +-- configure.in | 11 -- src/backend/libpq/auth.c | 17 +++ src/include/port.h| 2 +- src/interfaces/libpq/fe-connect.c | 10 ++--- src/tools/msvc/Mkvcbuild.pm | 2 +- 6 files changed, 36 insertions(+), 42 deletions(-) diff --git a/configure b/configure index 6b1c779ee3..5d4998d9e1 100755 --- a/configure +++ b/configure @@ -15743,6 +15743,19 @@ esac fi +ac_fn_c_check_func "$LINENO" "getpeereid" "ac_cv_func_getpeereid" +if test "x$ac_cv_func_getpeereid" = xyes; then : + $as_echo "#define HAVE_GETPEEREID 1" >>confdefs.h + +else + case " $LIBOBJS " in + *" getpeereid.$ac_objext "* ) ;; + *) LIBOBJS="$LIBOBJS getpeereid.$ac_objext" + ;; +esac + +fi + ac_fn_c_check_func "$LINENO" "getrusage" "ac_cv_func_getrusage" if test "x$ac_cv_func_getrusage" = xyes; then : $as_echo "#define HAVE_GETRUSAGE 1" >>confdefs.h @@ -15921,19 +15934,13 @@ $as_echo "$as_me: On $host_os we will use our strtof wrapper." >&6;} esac case $host_os in - # Windows uses a specialised env handler -# and doesn't need a replacement getpeereid because it doesn't use -# Unix sockets. mingw*) $as_echo "#define HAVE_UNSETENV 1" >>confdefs.h - -$as_echo "#define HAVE_GETPEEREID 1" >>confdefs.h - ac_cv_func_unsetenv=yes -ac_cv_func_getpeereid=yes;; +;; *) ac_fn_c_check_func "$LINENO" "unsetenv" "ac_cv_func_unsetenv" if test "x$ac_cv_func_unsetenv" = xyes; then : @@ -15948,21 +15955,8 @@ esac fi -ac_fn_c_check_func "$LINENO" "getpeereid" "ac_cv_func_getpeereid" -if test "x$ac_cv_func_getpeereid" = xyes; then : - $as_echo "#define HAVE_GETPEEREID 1" >>confdefs.h -else - case " $LIBOBJS " in - *" getpeereid.$ac_objext "* ) ;; - *) LIBOBJS="$LIBOBJS getpeereid.$ac_objext" - ;; -esac - -fi - - - ;; +;; esac # System's version of getaddrinfo(), if any, may be used only if we found diff --git a/configure.in b/configure.in index 2b9025cac3..ebbeea8b2b 100644 --- a/configure.in +++ b/configure.in @@ -1718,6 +1718,7 @@ AC_REPLACE_FUNCS(m4_normalize([ explicit_bzero fls getopt + getpeereid getrusage inet_aton mkdtemp @@ -1746,18 +1747,14 @@ case $host_os in esac case $host_os in - # Windows uses a specialised env handler -# and doesn't need a replacement getpeereid because it doesn't use -# Unix sockets. mingw*) AC_DEFINE(HAVE_UNSETENV, 1, [Define to 1 because replacement version used.]) -AC_DEFINE(HAVE_GETPEEREID, 1, [Define to 1 because function not required.]) ac_cv_func_unsetenv=yes -ac_cv_func_getpeereid=yes;; +;; *) -AC_REPLACE_FUNCS([unsetenv getpeereid]) - ;; +AC_REPLACE_FUNCS([unsetenv]) +;; esac # System's version of getaddrinfo(), if any, may be used only if we found diff --git a/src/backend/libpq/auth.c b/src/backend/libpq/auth.c index d28271c1d8..0bafb83557 100644 --- a/src/backend/libpq/auth.c +++ b/src/backend/libpq/auth.c @@ -78,9 +78,7 @@ static intident_inet(hbaPort *port); * Peer authentication * */ -#ifdef HAVE_UNIX_SOCKETS static int auth_peer(hbaPort *port); -#endif /* @@ -559,11 +557,7
Re: Remove one use of IDENT_USERNAME_MAX
On 2019-10-29 15:34, Tom Lane wrote: Peter Eisentraut writes: On 2019-10-28 14:45, Tom Lane wrote: Kyotaro Horiguchi writes: In think one of the reasons for the coding is the fact that *pw is described to be placed in the static area, which can be overwritten by succeeding calls to getpw*() functions. Good point ... so maybe pstrdup instead of using a fixed-size buffer? Maybe. Or we just decide that check_usermap() is not allowed to call getpw*(). It's just a string-matching routine, so it doesn't have any such business anyway. I'm okay with that as long as you add a comment describing this assumption. Committed with a pstrdup(). That seemed more consistent with other code in that file. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: pgbench - extend initialization phase control
On Mon, Oct 28, 2019 at 10:36 PM Fabien COELHO wrote: > > > Hello Masao-san, > > >> Maybe. If you cannot check, you can only guess. Probably it should be > >> small, but the current version does not allow to check whether it is so. > > > > Could you elaborate what you actually want to measure the performance > > impact by adding explicit begin and commit? Currently pgbench -i issues > > the following queries. The data generation part is already executed within > > single transaction. You want to execute not only data generation but also > > drop/creation of tables within single transaction, and measure how much > > performance impact happens? I'm sure that would be negligible. > > Or you want to execute data generate in multiple transactions, i.e., > > execute each statement for data generation (e.g., one INSERT) in single > > transaction, and then want to measure the performance impact? > > But the patch doesn't enable us to do such data generation yet. > > Indeed, you cannot do this precise thing, but you can do others. > > > So I'm thinking that it's maybe better to commit the addtion of "G" option > > first separately. And then we can discuss how much "(" and ")" options > > are useful later. > > Attached patch v6 only provides G - server side data generation. Thanks for the patch! + snprintf(sql, sizeof(sql), + "insert into pgbench_branches(bid,bbalance) " + "select bid, 0 " + "from generate_series(1, %d) as bid", scale); "scale" should be "nbranches * scale". + snprintf(sql, sizeof(sql), + "insert into pgbench_accounts(aid,bid,abalance,filler) " + "select aid, (aid - 1) / %d + 1, 0, '' " + "from generate_series(1, %d) as aid", naccounts, scale * naccounts); Like client-side data generation, INT64_FORMAT should be used here instead of %d? If large scale factor is specified, the query for generating pgbench_accounts data can take a very long time. While that query is running, operators may be likely to do Ctrl-C to cancel the data generation. In this case, IMO pgbench should cancel the query, i.e., call PQcancel(). Otherwise, the query will keep running to the end. - for (step = initialize_steps; *step != '\0'; step++) + for (const char *step = initialize_steps; *step != '\0'; step++) Per PostgreSQL basic coding style, ISTM that "const char *step" should be declared separately from "for" loop, like the original. - fprintf(stderr, "unrecognized initialization step \"%c\"\n", + fprintf(stderr, + "unrecognized initialization step \"%c\"\n" + "Allowed step characters are: \"" ALL_INIT_STEPS "\".\n", *step); - fprintf(stderr, "allowed steps are: \"d\", \"t\", \"g\", \"v\", \"p\", \"f\"\n"); The original message seems better to me. So what about just appending "G" into the above latter message? That is, "allowed steps are: \"d\", \"t\", \"g\", \"G\", \"v\", \"p\", \"f\"\n" - g (Generate data) + g or G (Generate data, client or server side) Isn't it better to explain a bit more what "client-side / server-side data generation" is? For example, something like When "g" (client-side data generation) is specified, data is generated in pgbench client and sent to the server. When "G" (server-side data generation) is specified, only queries are sent from pgbench client and then data is generated in the server. If the network bandwidth is low between pgbench and the server, using "G" might make the data generation faster. Regards, -- Fujii Masao
Re: v12.0: ERROR: could not find pathkey item to sort
Thanks for taking a look and sorry about the delay in replying. On Fri, Oct 25, 2019 at 1:51 AM Tom Lane wrote: > Amit Langote writes: > > On Mon, Oct 14, 2019 at 11:54 PM Tom Lane wrote: > >> In view of the proposed patches being dependent on some other > >> 13-only changes, I wonder if we should fix v12 by reverting > >> d25ea0127. The potential planner performance loss for large > >> partition sets could be nasty, but failing to plan at all is worse. > > > Actually, the patch I proposed to fix equivalence code can be applied > > on its own. The example I gave on that thread needs us to fix > > partitionwise code to even work, which is perhaps a 13-only change, > > but we have an example here that is broken due to d25ea01275. > > Perhaps, considering applying my patch seems better than alternatives > > which are either reverting d25ea01275 or avoiding doing partitionwise > > join for such queries. > > I looked at this a bit, and I see that the core idea is to generate > the missing EC members by applying add_child_rel_equivalences when > building child join rels. Perhaps we can make that work, but this > patch fails to, because you've ignored the comment pointing out > that the nullable_relids fixup logic only works for baserels: > > * And likewise for nullable_relids. Note this code assumes > * parent and child relids are singletons. > > We could improve that to work for joinrels, I think, but it would become > very much more complicated (and slower). AFAICS the logic would have > to be "iterate through top_parent_relids, see which ones are in > em_nullable_relids, and for each one that is, find the corresponding > child_relid and substitute that in new_nullable_relids". This is a > bit of a problem because we don't have any really convenient way to > map individual top parent relids to child relids. Actually, there is adjust_child_relids_multilevel() which translates the top parent relids in em_nullable_relids to child relids. I have updated the patches that way. > I wonder if we > should extend AppendRelInfo to include the top parent relid as well as > the immediate parent. (Perhaps, while we're at it, we could make > adjust_appendrel_attrs_multilevel less of an inefficient and > underdocumented mess.) Hmm, I agree we should try to fix that situation somehow. Even better if we could do away with child expressions in ECs, because they cause EC related code to show up in profiles when executing complex queries with thousands of partitions. > Also, I'm pretty sure this addition is wrong/broken: > > + > +/* > + * There aren't going to be more expressions to translate in > + * the same EC. > + */ > +break; > > What makes you think that an EC can only contain one entry per rel? I was wrong about that. Fixed. > More generally, as long as this patch requires changing > add_child_rel_equivalences' API anyway, I wonder if we should > rethink that altogether. Looking at the code now, I realize that > d25ea01275 resulted in horribly bastardizing that function's API, > because the parent_rel and appinfo arguments are only consulted in > some cases, while in other cases we disregard them and rely on > child_rel->top_parent_relids to figure out how to translate stuff. > It would be possible to make the argument list be just (root, child_rel) > and always rely on child_rel->top_parent_relids. Actually, as of 3373c71553, add_child_rel_equivalences() looks at parent_rel->eclass_indexes to look up ECs, so maybe we can't take out parent_rel. > At the very least, > if we keep the extra arguments, we should document them as being just > optimizations. For common cases that doesn't involve multi-level partitioning, it really helps to have the appinfos be supplied by the caller because they're already available. I've added a comment at the top about that. Attached updated patches. Thanks, Amit d25ea01275-fixup-HEAD-v2.patch Description: Binary data d25ea01275-fixup-PG12-v2.patch Description: Binary data
Re: [HACKERS] Block level parallel vacuum
On Mon, Oct 28, 2019 at 3:50 PM Amit Kapila wrote: > > On Sun, Oct 27, 2019 at 12:52 PM Dilip Kumar wrote: > > > > On Fri, Oct 25, 2019 at 9:19 PM Masahiko Sawada > > wrote: > > > > > > > > I haven't yet read the new set of the patch. But, I have noticed one > > thing. That we are getting the size of the statistics using the AM > > routine. But, we are copying those statistics from local memory to > > the shared memory directly using the memcpy. Wouldn't it be a good > > idea to have an AM specific routine to get it copied from the local > > memory to the shared memory? I am not sure it is worth it or not but > > my thought behind this point is that it will give AM to have local > > stats in any form ( like they can store a pointer in that ) but they > > can serialize that while copying to shared stats. And, later when > > shared stats are passed back to the Am then it can deserialize in its > > local form and use it. > > > > You have a point, but after changing the gist index, we don't have any > current usage for indexes that need something like that. So, on one > side there is some value in having an API to copy the stats, but on > the other side without having clear usage of an API, it might not be > good to expose a new API for the same. I think we can expose such an > API in the future if there is a need for the same. Do you or anyone > know of any external IndexAM that has such a need? > > Few minor comments while glancing through the latest patchset. > > 1. I think you can merge 0001*, 0002*, 0003* patch into one patch as > all three expose new variable/function from IndexAmRoutine. Fixed. > > 2. > +prepare_index_statistics(LVShared *lvshared, Relation *Irel, int nindexes) > +{ > + char *p = (char *) GetSharedIndStats(lvshared); > + int vac_work_mem = IsAutoVacuumWorkerProcess() && > + autovacuum_work_mem != -1 ? > + autovacuum_work_mem : maintenance_work_mem; > > I think this function won't be called from AutoVacuumWorkerProcess at > least not as of now, so isn't it a better idea to have an Assert for > it? Fixed. > > 3. > +void > +heap_parallel_vacuum_main(dsm_segment *seg, shm_toc *toc) > > This function is for performing a parallel operation on the index, so > why to start with heap? Because parallel vacuum supports only indexes that are created on heaps. > It is better to name it as > index_parallel_vacuum_main or simply parallel_vacuum_main. I'm concerned that both names index_parallel_vacuum_main and parallel_vacuum_main seem to be generic in spite of these codes are heap-specific code. > > 4. > /* useindex = true means two-pass strategy; false means one-pass */ > @@ -128,17 +280,12 @@ typedef struct LVRelStats > BlockNumber pages_removed; > double tuples_deleted; > BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */ > - /* List of TIDs of tuples we intend to delete */ > - /* NB: this list is ordered by TID address */ > - int num_dead_tuples; /* current # of entries */ > - int max_dead_tuples; /* # slots allocated in array */ > - ItemPointer dead_tuples; /* array of ItemPointerData */ > + LVDeadTuples *dead_tuples; > int num_index_scans; > TransactionId latestRemovedXid; > bool lock_waiter_detected; > } LVRelStats; > > - > /* A few variables that don't seem worth passing around as parameters */ > static int elevel = -1; > > It seems like a spurious line removal. Fixed. These above comments are incorporated in the latest patch set(v32) [1]. [1] https://www.postgresql.org/message-id/CAD21AoAqT17QwKJ_sWOqRxNvg66wMw1oZZzf9Rt-E-zD%2BXOh_Q%40mail.gmail.com Regards, -- Masahiko Sawada
Re: Problem with synchronous replication
Hello. At Wed, 30 Oct 2019 17:21:17 +0900, Fujii Masao wrote in > This change causes every ending backends to always take the exclusive lock > even when it's not in SyncRep queue. This may be problematic, for example, > when terminating multiple backends at the same time? If yes, > it might be better to check SHMQueueIsDetached() again after taking the lock. > That is, I'm not sure how much that harms but double-checked locking (releasing) is simple enough for reducing possible congestion here, I think. In short, + 1 for that. > if (!SHMQueueIsDetached(&(MyProc->syncRepLinks))) > { > LWLockAcquire(SyncRepLock, LW_EXCLUSIVE); > if (!SHMQueueIsDetached(&(MyProc->syncRepLinks))) > SHMQueueDelete(&(MyProc->syncRepLinks)); > LWLockRelease(SyncRepLock); > } regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Proposal: Global Index
A global index by very definition is a single index on the parent table that maps to many underlying table partitions. The parent table itself does not have any underlying storage, so it must, therefore, retrieve the data satisfying index constraints from the underlying tables. In very crude terms, it is an accumulation of data from table partitions so that data spanning across multiple partitions are accessed in one go as opposed to individually querying each partition. For the initial version of this work, we are only considering to build b-tree global indexes. - Partitioned Index (Index Partitioning) When global indexes become too large, then those are partitioned to keep the performance and maintenance overhead manageable. These are not within the scope of this work. - Local Index A local index is an index that is local to a specific table partition; i.e. it doesn’t span across multiple partitions. So, when we create an index on a parent table, it will create a separate index for all its partitions. Unfortunately, PostgreSQL uses the terminology of “partitioned index” when it refers to local indexes. This work with fix this terminology for PostgreSQL so that the nomenclature remains consistent with other DBMS. - Why We Need Global Index? A global index is expected to give two very important upgrades to the partitioning feature set in PostgreSQL. It is expected to give a significant improvement in read-performance for queries targeting multiple local indexes of partitions. It also adds a unique constraint across partitions. - Unique Constraint Data uniqueness is a critical requirement for building an index. For global indexes that span across multiple partitions, uniqueness will have to be enforced on index column(s). This effectively translates into a unique constraint. - Performance Currently, the pseudo index created on the parent table of partitions does not contain any data. Rather, it dereferences to the local indexes when an index search is required. This means that multiple indexes will have to be evaluated and data to be combined thereafter. However, with the global indexes, data will reside with global index declared on the parent table. This avoids the need for multi-level index lookups. So read performance is expected to be significantly higher in cases. There will however be a negative performance impact during write (insert/update) of data. This is discussed in more detail later on. - Creating a GLOBAL Index - Syntax A global index may be created with the addition of a “GLOBAL” keyword to the index statement. Alternatively, one could specify the “LOCAL” keyword to create local indexes on partitions. We are suggesting to call this set of keywords: “partition_index_type”. By default, partition_index_type will be set as LOCAL. Here is a sample of the create index syntax. CREATE Index idx parent (columns) [GLOBAL | LOCAL]; Note: There is no shift/reduced by adding these options. - Pointing Index to Tuple Currently, CTID carries a page and offset information for a known heap (table name). However, in the context of global indexes, this information within an index is insufficient. Since the index is expected to carry tuples from multiple partitions (heaps), CTID alone will not be able to link an index node to a tuple. This requires carrying additional data for the heap name to be stored with each index node. How this should be implemented is a point to be discussed. A few possibilities are listed below: -- Expand CTID to include a relfilenode id. In PostgreSQL-Conf Asia, Bruce suggested having the OID instead of relfilenode as relfilenode can be duplicated across tablespaces. -- Using OID introduces another complication where we would need to query catalog for OID to heap mapping. -- The second option is to have a variable-length CTID. We can reserve some top-level bit for segregation of Global CTID or Standard CTID. Robert Haas suggested in PostgreSQL-EU to discuss this with Peter Geoghegan. -- I discussed it with Peter and he believes that it is a very invasive approach that requires a whole lot of the effort to get committed. -- Heikki pointed me to include heap specific information using the INCLUDE keyword so that heap information is stored with each index node as data. -- We (Peter and I) also discussed that option and this looks a more easy and non-invasive option. - Optimizer The challenge with optimizer is a selection between local and global indexes when both are present. How do we: -- Evaluate the cost of scanning a global index? -- When should the LOCAL index be preferred over the GLOBAL index and vice versa? -- Should we hit a GLOBAL index when the query is targeting a couple of partitions only? -- We need to consider the sizes of those partitions being hit and the sizes of partitions not being hit. -- Bruce suggested that we prioritize a GLOBAL index in the first version so that in every case, the GLOBAL
Re: Problem with synchronous replication
On Wed, Oct 30, 2019 at 4:16 PM lingce.ldm wrote: > > On Oct 29, 2019, at 18:50, Kyotaro Horiguchi wrote: > > > Hello. > > At Fri, 25 Oct 2019 15:18:34 +0800, "Dongming Liu" > wrote in > > > Hi, > > I recently discovered two possible bugs about synchronous replication. > > 1. SyncRepCleanupAtProcExit may delete an element that has been deleted > SyncRepCleanupAtProcExit first checks whether the queue is detached, if it is > not detached, > acquires the SyncRepLock lock and deletes it. If this element has been > deleted by walsender, > it will be deleted repeatedly, SHMQueueDelete will core with a segment fault. > > IMO, like SyncRepCancelWait, we should lock the SyncRepLock first and then > check > whether the queue is detached or not. > > > I think you're right here. This change causes every ending backends to always take the exclusive lock even when it's not in SyncRep queue. This may be problematic, for example, when terminating multiple backends at the same time? If yes, it might be better to check SHMQueueIsDetached() again after taking the lock. That is, if (!SHMQueueIsDetached(&(MyProc->syncRepLinks))) { LWLockAcquire(SyncRepLock, LW_EXCLUSIVE); if (!SHMQueueIsDetached(&(MyProc->syncRepLinks))) SHMQueueDelete(&(MyProc->syncRepLinks)); LWLockRelease(SyncRepLock); } Regards, -- Fujii Masao
Re: Join Correlation Name
Bonjour Vik, Is quoting the spec good enough? SQL:2016 Part 2 Foundation Section 7.10 : Ah, this is the one information I did not have when reviewing Peter's patch. ::= USING[ AS ] ::= I think possibly what the spec says (and that neither my patch nor Peter's implements) is assigning the alias just to the . I think you are right, the alias is only on the identical columns. It solves the issue I raised about inaccessible attributes, and explains why it is only available with USING and no other join variants. So my original example query should actually be: SELECT a.x, b.y, j.z FROM a INNER JOIN b USING (z) AS j; Yep, only z should be in j, it is really just about the USING clause. -- Fabien.
Re: tableam vs. TOAST
Hi All, While testing the Toast patch(PG+v7 patch) I found below server crash. System configuration: VCPUs: 4, RAM: 8GB, Storage: 320GB This issue is not frequently reproducible, we need to repeat the same testcase multiple times. CREATE OR REPLACE FUNCTION toast_chunks_cnt_func(p1 IN text) RETURNS int AS $$ DECLARE chunks_cnt int; v_tbl text; BEGIN SELECT reltoastrelid::regclass INTO v_tbl FROM pg_class WHERE RELNAME = p1; EXECUTE 'SELECT count(*) FROM ' || v_tbl::regclass INTO chunks_cnt; RETURN chunks_cnt; END; $$ LANGUAGE PLPGSQL; -- Server crash after multiple run of below testcase -- CHECKPOINT; CREATE TABLE toast_tab (c1 text); \d+ toast_tab -- ALTER table column c1 for storage as "EXTERNAL" to make sure that the column value is pushed to the TOAST table but not COMPRESSED. ALTER TABLE toast_tab ALTER COLUMN c1 SET STORAGE EXTERNAL; \d+ toast_tab \timing INSERT INTO toast_tab ( select repeat('a', 20) from generate_series(1,4) x); \timing SELECT reltoastrelid::regclass FROM pg_class WHERE RELNAME = 'toast_tab'; SELECT toast_chunks_cnt_func('toast_tab') "Number of chunks"; SELECT pg_column_size(t1.*) FROM toast_tab t1 limit 1; SELECT DISTINCT SUBSTR(c1, 9,10) FROM toast_tab; CHECKPOINT; \timing UPDATE toast_tab SET c1 = UPPER(c1); \timing SELECT toast_chunks_cnt_func('toast_tab') "Number of chunks"; SELECT pg_column_size(t1.*) FROM toast_tab t1 limit 1; SELECT DISTINCT SUBSTR(c1, 9,10) FROM toast_tab; DROP TABLE toast_tab; -- -- Stacktrace as below: [centos@host-192-168-1-249 bin]$ gdb -q -c data2/core.3151 postgres Reading symbols from /home/centos/PG/PGsrc/postgresql/inst/bin/postgres...done. [New LWP 3151] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `postgres: checkpointer '. Program terminated with signal 6, Aborted. #0 0x7f2267d33207 in raise () from /lib64/libc.so.6 Missing separate debuginfos, use: debuginfo-install glibc-2.17-260.el7_6.5.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-37.el7_6.x86_64 libcom_err-1.42.9-13.el7.x86_64 libselinux-2.5-14.1.el7.x86_64 openssl-libs-1.0.2k-16.el7_6.1.x86_64 pcre-8.32-17.el7.x86_64 zlib-1.2.7-18.el7.x86_64 (gdb) bt #0 0x7f2267d33207 in raise () from /lib64/libc.so.6 #1 0x7f2267d348f8 in abort () from /lib64/libc.so.6 #2 0x00eb3a80 in errfinish (dummy=0) at elog.c:552 #3 0x00c26530 in ProcessSyncRequests () at sync.c:393 #4 0x00bbbc57 in CheckPointBuffers (flags=256) at bufmgr.c:2589 #5 0x00604634 in CheckPointGuts (checkPointRedo=51448358328, flags=256) at xlog.c:8992 #6 0x00603b5e in CreateCheckPoint (flags=256) at xlog.c:8781 #7 0x00aed8fa in CheckpointerMain () at checkpointer.c:481 #8 0x006240de in AuxiliaryProcessMain (argc=2, argv=0x7ffe887c0880) at bootstrap.c:461 #9 0x00b0e834 in StartChildProcess (type=CheckpointerProcess) at postmaster.c:5414 #10 0x00b09283 in reaper (postgres_signal_arg=17) at postmaster.c:2995 #11 #12 0x7f2267df1f53 in __select_nocancel () from /lib64/libc.so.6 #13 0x00b05000 in ServerLoop () at postmaster.c:1682 #14 0x00b0457b in PostmasterMain (argc=5, argv=0x349bce0) at postmaster.c:1391 #15 0x00971c9f in main (argc=5, argv=0x349bce0) at main.c:210 (gdb) On Sat, Oct 5, 2019 at 12:03 AM Robert Haas wrote: > On Fri, Sep 6, 2019 at 10:59 AM Robert Haas wrote: > > On Thu, Sep 5, 2019 at 4:07 PM Andres Freund wrote: > > > Yea, makes sense to me. > > > > OK, done. Here's the remaining patches again, with a slight update to > > the renaming patch (now 0002). In the last version, I renamed > > toast_insert_or_update to heap_toast_insert_or_update but did not > > rename toast_delete to heap_toast_delete. Actually, I'm not seeing > > any particular reason not to go ahead and push the renaming patch at > > this point also. > > And, hearing no objections, done. > > Here's the last patch back, rebased over that renaming. Although I > think that Andres (and Tom) are probably right that there's room for > improvement here, I currently don't see a way around the issues I > wrote about in > http://postgr.es/m/ca+tgmoa0zfcacpojcsspollgpztvfsyvcvb-uss8yokzmo5...@mail.gmail.com > -- so not quite sure where to go next. Hopefully Andres or someone > else will give me a quick whack with the cluebat if I'm missing > something obvious. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > -- With Regards, Prabhat Kumar Sahu Skype ID: prabhat.sahu1984 EnterpriseDB Software India Pvt. Ltd. The Postgres Database Company
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Wed, Oct 30, 2019 at 9:38 AM vignesh C wrote: > I have noticed one more problem in the logic of setting the logical decoding work mem from the create subscription command. Suppose in subscription command we don't give the work mem then it sends the garbage value to the walsender and the walsender overwrite its value with the garbage value. After investigating a bit I have found the reason for the same. @@ -406,6 +406,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn, appendStringInfo(, "proto_version '%u'", options->proto.logical.proto_version); + appendStringInfo(, ", work_mem '%d'", + options->proto.logical.work_mem); I think the problem is we are unconditionally sending the work_mem as part of the CREATE REPLICATION SLOT, without checking whether it's valid or not. --- a/src/backend/catalog/pg_subscription.c +++ b/src/backend/catalog/pg_subscription.c @@ -71,6 +71,7 @@ GetSubscription(Oid subid, bool missing_ok) sub->name = pstrdup(NameStr(subform->subname)); sub->owner = subform->subowner; sub->enabled = subform->subenabled; + sub->workmem = subform->subworkmem; Another problem is that there is no handling if the subform->subworkmem is NULL. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Add SQL function to show total block numbers in the relation
Hello, I propose new simple sql query, which shows total block numbers in the relation. I now reviewing this patch (https://commitfest.postgresql.org/25/2211/) and I think, it is usefull for knowing how many blocks there are in the relation to determine whether we use VACUUM RESUME or not. Of cource, we can know this value such as select (pg_relation_size('t') / current_setting('block_size')::bigint)::int; but I think it is a litte bit complex. Comment and feedback are very welcome. Regards , Yu Kimuradiff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 28eb322f3f..99834e7286 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -21290,6 +21290,9 @@ postgres=# SELECT * FROM pg_walfile_name_offset(pg_stop_backup()); pg_relation_size + +pg_relation_block_number + pg_size_bytes @@ -21363,6 +21366,15 @@ postgres=# SELECT * FROM pg_walfile_name_offset(pg_stop_backup()); Shorthand for pg_relation_size(..., 'main') + + +pg_relation_block_number(relation regclass) + + bigint + +Shorthand for pg_relation_block_number(..., 'main') + + pg_size_bytes(text) @@ -21504,6 +21516,11 @@ postgres=# SELECT * FROM pg_walfile_name_offset(pg_stop_backup()); + +pg_relation_block_number accepts the OID or name of a table +and returns the number of blocks of that relation. + + pg_size_pretty can be used to format the result of one of the other functions in a human-readable way, using bytes, kB, MB, GB or TB diff --git a/src/backend/utils/adt/dbsize.c b/src/backend/utils/adt/dbsize.c index a87e7214e9..2462d65570 100644 --- a/src/backend/utils/adt/dbsize.c +++ b/src/backend/utils/adt/dbsize.c @@ -23,6 +23,7 @@ #include "commands/tablespace.h" #include "miscadmin.h" #include "storage/fd.h" +#include "storage/bufmgr.h" #include "utils/acl.h" #include "utils/builtins.h" #include "utils/numeric.h" @@ -335,6 +336,25 @@ pg_relation_size(PG_FUNCTION_ARGS) PG_RETURN_INT64(size); } +Datum +pg_relation_block_number(PG_FUNCTION_ARGS) +{ + Oid relOid = PG_GETARG_OID(0); + Relation rel; + int64 size; + + rel = try_relation_open(relOid, AccessShareLock); + + if (rel == NULL) + PG_RETURN_NULL(); + + size = RelationGetNumberOfBlocks(rel); + + relation_close(rel, AccessShareLock); + + PG_RETURN_INT64(size); +} + /* * Calculate total on-disk size of a TOAST relation, including its indexes. * Must not be applied to non-TOAST relations. diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat index 58ea5b982b..b68a523d3b 100644 --- a/src/include/catalog/pg_proc.dat +++ b/src/include/catalog/pg_proc.dat @@ -6929,6 +6929,10 @@ descr => 'disk space usage for the specified fork of a table or index', proname => 'pg_relation_size', provolatile => 'v', prorettype => 'int8', proargtypes => 'regclass text', prosrc => 'pg_relation_size' }, +{ oid => '2228', + descr => 'total block number for the main fork of the specified table or index', + proname => 'pg_relation_block_number', provolatile => 'v', prorettype => 'int8', + proargtypes => 'regclass', prosrc => 'pg_relation_block_number' }, { oid => '2286', descr => 'total disk space usage for the specified table and associated indexes', proname => 'pg_total_relation_size', provolatile => 'v', prorettype => 'int8',
Re: RFC: split OBJS lines to one object per line
On Tue, Oct 29, 2019 at 11:32:09PM -0700, Andres Freund wrote: > Cool. Any opinion on whether to got for > > OBJS = \ > dest.o \ > fastpath.o \ > ... > > or > > OBJS = dest.o \ > fastpath.o \ > ... > > I'm mildly inclined to go for the former. FWIW, I am more used to the latter, but the former is also fine by me. -- Michael signature.asc Description: PGP signature
Re: Problem with synchronous replication
On Oct 30, 2019, at 09:45, Michael Paquier mailto:mich...@paquier.xyz>> wrote: > > On Tue, Oct 29, 2019 at 07:50:01PM +0900, Kyotaro Horiguchi wrote: >> At Fri, 25 Oct 2019 15:18:34 +0800, "Dongming Liu" >> wrote in >>> I recently discovered two possible bugs about synchronous replication. >>> >>> 1. SyncRepCleanupAtProcExit may delete an element that has been deleted >>> SyncRepCleanupAtProcExit first checks whether the queue is detached, if it >>> is not detached, >>> acquires the SyncRepLock lock and deletes it. If this element has been >>> deleted by walsender, >>> it will be deleted repeatedly, SHMQueueDelete will core with a segment >>> fault. >>> >>> IMO, like SyncRepCancelWait, we should lock the SyncRepLock first and then >>> check >>> whether the queue is detached or not. >> >> I think you're right here. > > Indeed. Looking at the surroundings we expect some code paths to hold > SyncRepLock in exclusive or shared mode but we don't actually check > that the lock is hold. So let's add some assertions while on it. > >> This is not right. It is in transaction commit so it is in a >> HOLD_INTERRUPTS section. ProcessInterrupt does not respond to >> cancel/die interrupts thus the ereport should return. > > Yeah. There is an easy way to check after that: InterruptHoldoffCount > needs to be strictly positive. > > My suggestions are attached. Any thoughts? Thanks, this patch looks good to me. smime.p7s Description: S/MIME cryptographic signature
Re: Problem with synchronous replication
On Oct 29, 2019, at 18:50, Kyotaro Horiguchi mailto:horikyota@gmail.com>> wrote: > > Hello. > > At Fri, 25 Oct 2019 15:18:34 +0800, "Dongming Liu" > mailto:lingce@alibaba-inc.com>> wrote in >> >> Hi, >> >> I recently discovered two possible bugs about synchronous replication. >> >> 1. SyncRepCleanupAtProcExit may delete an element that has been deleted >> SyncRepCleanupAtProcExit first checks whether the queue is detached, if it >> is not detached, >> acquires the SyncRepLock lock and deletes it. If this element has been >> deleted by walsender, >> it will be deleted repeatedly, SHMQueueDelete will core with a segment >> fault. >> >> IMO, like SyncRepCancelWait, we should lock the SyncRepLock first and then >> check >> whether the queue is detached or not. > > I think you're right here. Thanks. > >> 2. SyncRepWaitForLSN may not call SyncRepCancelWait if ereport check one >> interrupt. >> For SyncRepWaitForLSN, if a query cancel interrupt arrives, we just >> terminate the wait >> with suitable warning. As follows: >> >> a. set QueryCancelPending to false >> b. errport outputs one warning >> c. calls SyncRepCancelWait to delete one element from the queue >> >> If another cancel interrupt arrives when we are outputting warning at step >> b, the errfinish >> will call CHECK_FOR_INTERRUPTS that will output an ERROR, such as "canceling >> autovacuum >> task", then the process will jump to the sigsetjmp. Unfortunately, the step >> c will be skipped >> and the element that should be deleted by SyncRepCancelWait is remained. >> >> The easiest way to fix this is to swap the order of step b and step c. On >> the other hand, >> let sigsetjmp clean up the queue may also be a good choice. What do you >> think? >> >> Attached the patch, any feedback is greatly appreciated. > > This is not right. It is in transaction commit so it is in a > HOLD_INTERRUPTS section. ProcessInterrupt does not respond to > cancel/die interrupts thus the ereport should return. I read the relevant code, you are right. I called SyncRepWaitForLSN somewhere else, but forgot to put it in a HOLD_INTERRUPTS and triggered an exception. regards. — Dongming Liu smime.p7s Description: S/MIME cryptographic signature
Re: RFC: split OBJS lines to one object per line
Andres Freund writes: > On 2019-10-29 16:31:11 -0400, Tom Lane wrote: >> We did something similar not too long ago in configure.in (bfa6c5a0c), >> and it seems to have helped. +1 > Cool. Any opinion on whether to got for ... Not here. regards, tom lane
Re: RFC: split OBJS lines to one object per line
Hi, On 2019-10-29 16:31:11 -0400, Tom Lane wrote: > Andres Freund writes: > > one of the most frequent conflicts I see is that two patches add files > > to OBJS (or one of its other spellings), and there are conflicts because > > another file has been added. > > ... > > Now, obviously these types of conflicts are easy enough to resolve, but > > it's still annoying. It seems that this would be substantially less > > often a problem if we just split such lines to one file per > > line. > > We did something similar not too long ago in configure.in (bfa6c5a0c), > and it seems to have helped. +1 Cool. Any opinion on whether to got for OBJS = \ dest.o \ fastpath.o \ ... or OBJS = dest.o \ fastpath.o \ ... I'm mildly inclined to go for the former. Greetings, Andres Freund
Re: [HACKERS] Block level parallel vacuum
On Tue, Oct 29, 2019 at 3:11 PM Dilip Kumar wrote: > > On Tue, Oct 29, 2019 at 1:59 PM Masahiko Sawada wrote: > > > > On Tue, Oct 29, 2019 at 4:06 PM Masahiko Sawada > > wrote: > > > > > > On Mon, Oct 28, 2019 at 2:13 PM Dilip Kumar wrote: > > > > > > > > On Thu, Oct 24, 2019 at 4:33 PM Dilip Kumar > > > > wrote: > > > > > > > > > > On Thu, Oct 24, 2019 at 4:21 PM Amit Kapila > > > > > wrote: > > > > > > > > > > > > On Thu, Oct 24, 2019 at 11:51 AM Dilip Kumar > > > > > > wrote: > > > > > > > > > > > > > > On Fri, Oct 18, 2019 at 12:18 PM Dilip Kumar > > > > > > > wrote: > > > > > > > > > > > > > > > > On Fri, Oct 18, 2019 at 11:25 AM Amit Kapila > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > I am thinking if we can write the patch for both the > > > > > > > > > approaches (a. > > > > > > > > > compute shared costs and try to delay based on that, b. try > > > > > > > > > to divide > > > > > > > > > the I/O cost among workers as described in the email > > > > > > > > > above[1]) and do > > > > > > > > > some tests to see the behavior of throttling, that might help > > > > > > > > > us in > > > > > > > > > deciding what is the best strategy to solve this problem, if > > > > > > > > > any. > > > > > > > > > What do you think? > > > > > > > > > > > > > > > > I agree with this idea. I can come up with a POC patch for > > > > > > > > approach > > > > > > > > (b). Meanwhile, if someone is interested to quickly hack with > > > > > > > > the > > > > > > > > approach (a) then we can do some testing and compare. > > > > > > > > Sawada-san, > > > > > > > > by any chance will you be interested to write POC with approach > > > > > > > > (a)? > > > > > > > > Otherwise, I will try to write it after finishing the first one > > > > > > > > (approach b). > > > > > > > > > > > > > > > I have come up with the POC for approach (a). > > > > > > > > > > Can we compute the overall throttling (sleep time) in the operation > > > > > > separately for heap and index, then divide the index's sleep_time > > > > > > with > > > > > > a number of workers and add it to heap's sleep time? Then, it will > > > > > > be > > > > > > a bit easier to compare the data between parallel and non-parallel > > > > > > case. > > > > I have come up with a patch to compute the total delay during the > > > > vacuum. So the idea of computing the total cost delay is > > > > > > > > Total cost delay = Total dealy of heap scan + Total dealy of > > > > index/worker; Patch is attached for the same. > > > > > > > > I have prepared this patch on the latest patch of the parallel > > > > vacuum[1]. I have also rebased the patch for the approach [b] for > > > > dividing the vacuum cost limit and done some testing for computing the > > > > I/O throttling. Attached patches 0001-POC-compute-total-cost-delay > > > > and 0002-POC-divide-vacuum-cost-limit can be applied on top of > > > > v31-0005-Add-paralell-P-option-to-vacuumdb-command.patch. I haven't > > > > rebased on top of v31-0006, because v31-0006 is implementing the I/O > > > > throttling with one approach and 0002-POC-divide-vacuum-cost-limit is > > > > doing the same with another approach. But, > > > > 0001-POC-compute-total-cost-delay can be applied on top of v31-0006 as > > > > well (just 1-2 lines conflict). > > > > > > > > Testing: I have performed 2 tests, one with the same size indexes and > > > > second with the different size indexes and measured total I/O delay > > > > with the attached patch. > > > > > > > > Setup: > > > > VacuumCostDelay=10ms > > > > VacuumCostLimit=2000 > > > > > > > > Test1 (Same size index): > > > > create table test(a int, b varchar, c varchar); > > > > create index idx1 on test(a); > > > > create index idx2 on test(b); > > > > create index idx3 on test(c); > > > > insert into test select i, repeat('a',30)||i, repeat('a',20)||i from > > > > generate_series(1,50) as i; > > > > delete from test where a < 20; > > > > > > > > Vacuum (Head) Parallel Vacuum > > > >Vacuum Cost Divide Patch > > > > Total Delay1784 (ms) 1398(ms) > > > > 1938(ms) > > > > > > > > > > > > Test2 (Variable size dead tuple in index) > > > > create table test(a int, b varchar, c varchar); > > > > create index idx1 on test(a); > > > > create index idx2 on test(b) where a > 10; > > > > create index idx3 on test(c) where a > 15; > > > > > > > > insert into test select i, repeat('a',30)||i, repeat('a',20)||i from > > > > generate_series(1,50) as i; > > > > delete from test where a < 20; > > > > > > > > Vacuum (Head) Parallel Vacuum > > > > Vacuum Cost Divide Patch > > > > Total Delay 1438 (ms) 1029(ms) > > > >1529(ms) > > > > > > > > > > > > Conclusion: > > > > 1. The tests prove that the total I/O delay is significantly less with > > > > the
Re: pg_waldump erroneously outputs newline for FPWs, and another minor bug
Hi, On 2019-10-29 16:33:41 -0700, Andres Freund wrote: > Hi, > > When using -b, --bkp-details pg_waldump outputs an unnecessary newline > for blocks that contain an FPW. > > In --bkp-details block references are output on their own lines, like: > > rmgr: SPGist len (rec/tot): 4348/ 4348, tx:980, lsn: > 0/01985818, prev 0/01983850, desc: PICKSPLIT ndel 92; nins 93 > blkref #0: rel 1663/16384/16967 fork main blk 3 > blkref #1: rel 1663/16384/16967 fork main blk 6 > blkref #2: rel 1663/16384/16967 fork main blk 5 > blkref #3: rel 1663/16384/16967 fork main blk 1 > rmgr: Heaplen (rec/tot): 69/69, tx:980, lsn: > 0/01986930, prev 0/01985818, desc: INSERT off 2 flags 0x00 > blkref #0: rel 1663/16384/16961 fork main blk 1 > > but unfortunately, when there's actually an FPW present, it looks like: > > rmgr: XLOGlen (rec/tot): 75/ 11199, tx:977, lsn: > 0/019755E0, prev 0/0194EDD8, desc: FPI > blkref #0: rel 1663/16384/16960 fork main blk 32 (FPW); hole: offset: > 548, length: 4484 > > blkref #1: rel 1663/16384/16960 fork main blk 33 (FPW); hole: offset: > 548, length: 4484 > > blkref #2: rel 1663/16384/16960 fork main blk 34 (FPW); hole: offset: > 548, length: 4484 > > rmgr: Heaplen (rec/tot):188/ 188, tx:977, lsn: > 0/019781D0, prev 0/019755E0, desc: INPLACE off 23 > > which clearly seems unnecessary. Looking at the code it seems to me that > > static void > XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record) > { > ... > printf("\tblkref #%u: rel %u/%u/%u fork %s blk %u", >block_id, >rnode.spcNode, rnode.dbNode, rnode.relNode, >forkNames[forknum], >blk); > if (XLogRecHasBlockImage(record, block_id)) > { > if (record->blocks[block_id].bimg_info & > BKPIMAGE_IS_COMPRESSED) > { > printf(" (FPW%s); hole: offset: %u, length: %u, " >"compression saved: %u\n", >XLogRecBlockImageApply(record, block_id) ? >"" : " for WAL verification", >record->blocks[block_id].hole_offset, >record->blocks[block_id].hole_length, >BLCKSZ - >record->blocks[block_id].hole_length - >record->blocks[block_id].bimg_len); > } > else > { > printf(" (FPW%s); hole: offset: %u, length: %u\n", >XLogRecBlockImageApply(record, block_id) ? >"" : " for WAL verification", >record->blocks[block_id].hole_offset, >record->blocks[block_id].hole_length); > } > } > putchar('\n'); > > was intended to not actually print a newline in the printfs in the if > preceding the putchar. > > This is a fairly longstanding bug, introduced in: > > commit 2c03216d831160bedd72d45f712601b6f7d03f1c > Author: Heikki Linnakangas > Date: 2014-11-20 17:56:26 +0200 > > Revamp the WAL record format. > > > Does anybody have an opinion about fixing it just in master or also > backpatching it? I guess there could be people having written parsers > for the waldump output? I'm inclined to backpatch. > > > I also find a second minor bug: > > static void > XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record) > { > ... > const char *id; > ... > id = desc->rm_identify(info); > if (id == NULL) > id = psprintf("UNKNOWN (%x)", info & ~XLR_INFO_MASK); > ... > printf("desc: %s ", id); > > after that "id" is not referenced anymore. Which means we would leak > memory if there were a lot of UNKNOWN records. This is from > commit 604f7956b946019dd37bd3baea24cb669a47 > Author: Andres Freund > Date: 2014-09-22 16:48:14 +0200 > > Improve code around the recently added rm_identify rmgr callback. > > While not a lot of memory, it's not absurd to run pg_waldump against a > large amount of WAL, so backpatching seems mildly advised. > > I'm inlined to think that the best fix is to just move the relevant code > to the callsite, and not psprintf'ing into a temporary buffer. We'd need > additional state to free the memory, as rm_identify returns a static > buffer. > > So I'll make it > > id = desc->rm_identify(info); > if (id == NULL) > printf("desc: UNKNOWN (%x) ", info & ~XLR_INFO_MASK); > else > printf("desc: %s ", id); Pushed fixes for these. Greetings, Andres Freund