Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY
Hi Alvaro, Studying the patch to understand how it works. On Tue, Aug 4, 2020 at 8:49 AM Alvaro Herrera wrote: > Why two transactions? The reason is that in order for this to work, we > make a catalog change (mark it detached), and commit so that all > concurrent transactions can see the change. A second transaction waits > for anybody who holds any lock on the partitioned table and grabs Access > Exclusive on the partition (which now no one cares about, if they're > looking at the partitioned table), where the DDL action on the partition > can be completed. + /* +* Concurrent mode has to work harder; first we add a new constraint to the +* partition that matches the partition constraint. The reason for this is +* that the planner may have made optimizations that depend on the +* constraint. XXX Isn't it sufficient to invalidate the partition's +* relcache entry? ... + /* Add constraint on partition, equivalent to the partition constraint */ + n = makeNode(Constraint); + n->contype = CONSTR_CHECK; + n->conname = NULL; + n->location = -1; + n->is_no_inherit = false; + n->raw_expr = NULL; + n->cooked_expr = nodeToString(make_ands_explicit(RelationGetPartitionQual(partRel))); + n->initially_valid = true; + n->skip_validation = true; + /* It's a re-add, since it nominally already exists */ + ATAddCheckConstraint(wqueue, tab, partRel, n, +true, false, true, ShareUpdateExclusiveLock); I suspect that we don't really need this defensive constraint. I mean even after committing the 1st transaction, the partition being detached still has relispartition set to true and RelationGetPartitionQual() still returns the partition constraint, so it seems the constraint being added above is duplicative. Moreover, the constraint is not removed as part of any cleaning up after the DETACH process, so repeated attach/detach of the same partition results in the constraints piling up: create table foo (a int, b int) partition by range (a); create table foo1 partition of foo for values from (1) to (2); create table foo2 partition of foo for values from (2) to (3); alter table foo detach partition foo2 concurrently; alter table foo attach partition foo2 for values from (2) to (3); alter table foo detach partition foo2 concurrently; alter table foo attach partition foo2 for values from (2) to (3); \d foo2 Table "public.foo2" Column | Type | Collation | Nullable | Default +-+---+--+- a | integer | | | b | integer | | | Partition of: foo FOR VALUES FROM (2) TO (3) Check constraints: "foo2_a_check" CHECK (a IS NOT NULL AND a >= 2 AND a < 3) "foo2_a_check1" CHECK (a IS NOT NULL AND a >= 2 AND a < 3) Noticed a bug/typo in the patched RelationBuildPartitionDesc(): - inhoids = find_inheritance_children(RelationGetRelid(rel), NoLock); + inhoids = find_inheritance_children(RelationGetRelid(rel), NoLock, + include_detached); You're passing NoLock for include_detached which means you never actually end up including detached partitions from here. I think it is due to this bug that partition-concurrent-attach test fails in my run. Also, the error seen in the following hunk of detach-partition-concurrently-1 test is not intentional: +starting permutation: s1brr s1prep s1s s2d s1s s1exec2 s1c +step s1brr: BEGIN ISOLATION LEVEL REPEATABLE READ; +step s1prep: PREPARE f(int) AS INSERT INTO d_listp VALUES ($1); +step s1s: SELECT * FROM d_listp; +a + +1 +2 +step s2d: ALTER TABLE d_listp DETACH PARTITION d_listp2 CONCURRENTLY; +step s1s: SELECT * FROM d_listp; +a + +1 +step s1exec2: EXECUTE f(2); DEALLOCATE f; +step s2d: <... completed> +error in steps s1exec2 s2d: ERROR: no partition of relation "d_listp" found for row +step s1c: COMMIT; As you're intending to make the executor always *include* detached partitions, the insert should be able route (2) to d_listp2, the detached partition. It's the bug mentioned above that causes the executor's partition descriptor build to falsely fail to include the detached partition. -- Amit Langote EnterpriseDB: http://www.enterprisedb.com
RE: Use appendStringInfoString and appendPQExpBufferStr where possible
Hi I made a slight update to the patch > > > I found some more places that should use appendPQExrBufferStr instead > > of appendPQExpBuffer. > > > > > > Here is the patch. > > > > Seems like a good idea. Please add it to the next commitfest. > > Thanks, added it to the next commitfest. > https://commitfest.postgresql.org/30/2735/ Best regards, houzj 0001-Latest-appendStringInfoString-instead-of-appendStringInfo.patch Description: 0001-Latest-appendStringInfoString-instead-of-appendStringInfo.patch 0001-Latest-appendPQExpBufferStr-instead-of-appendPQExpBuffer.patch Description: 0001-Latest-appendPQExpBufferStr-instead-of-appendPQExpBuffer.patch
RE: [Patch] Optimize dropping of relation buffers using dlist
From: Jamison, Kirk/ジャミソン カーク > I revised the patch based from my understanding of Horiguchi-san's comment, > but I could be wrong. > Quoting: > > " > + /* Get the number of blocks for the supplied relation's > fork */ > + nblocks = smgrnblocks(smgr_reln, > forkNum[fork_num]); > + Assert(BlockNumberIsValid(nblocks)); > + > + if (nblocks < BUF_DROP_FULLSCAN_THRESHOLD) > > As mentioned upthread, the criteria whether we do full-scan or > lookup-drop is how large portion of NBUFFERS this relation-drop can be > going to invalidate. So the nblocks above should be the sum of number > of blocks to be truncated (not just the total number of blocks) of all > designated forks. Then once we decided to do lookup-drop method, we > do that for all forks." One takeaway from Horiguchi-san's comment is to use the number of blocks to invalidate for comparison, instead of all blocks in the fork. That is, use nblocks = smgrnblocks(fork) - firstDelBlock[fork]; Does this make sense? What do you think is the reason for summing up all forks? I didn't understand why. Typically, FSM and VM forks are very small. If the main fork is larger than NBuffers / 500, then v14 scans the entire shared buffers for the FSM and VM forks as well as the main fork, resulting in three scans in total. Also, if you want to judge the criteria based on the total blocks of all forks, the following if should be placed outside the for loop, right? Because this if condition doesn't change inside the for loop. + if ((nblocks / (uint32)NBuffers) < BUF_DROP_FULLSCAN_THRESHOLD && + BlockNumberIsValid(nblocks)) + { > > (2) > > + if ((nblocks / (uint32)NBuffers) < > > BUF_DROP_FULLSCAN_THRESHOLD && > > + BlockNumberIsValid(nblocks)) > > + { > > > > The division by NBuffers is not necessary, because both sides of = are > > number of blocks. > > Again I based it from my understanding of the comment above, > so nblocks is the sum of all blocks to be truncated for all forks. But the left expression of "<" is a percentage, while the right one is a block count. Two different units are compared. > > Why is BlockNumberIsValid(nblocks)) call needed? > > I thought we need to ensure that nblocks is not invalid, so I also added When is it invalid? smgrnblocks() seems to always return a valid block number. Am I seeing a different source code (I saw HEAD)? Regards Takayuki Tsunakawa
Re: [Patch] Optimize dropping of relation buffers using dlist
On Wed, Sep 23, 2020 at 8:04 AM tsunakawa.ta...@fujitsu.com wrote: > > From: Amit Kapila > > > > > Don't > > > > > we need to guarantee the cache to be valid while recovery? > > > > > > > > > > > > > One possibility could be that we somehow detect that the value we > > > > are using is cached one and if so then only do this optimization. > > > > > > I basically like this direction. But I'm not sure the additional > > > parameter for smgrnblocks is acceptable. > > > > > > But on the contrary, it might be a better design that > > > DropRelFileNodeBuffers gives up the optimization when > > > smgrnblocks(,,must_accurate = true) returns InvalidBlockNumber. > > > > > > > I haven't thought about what is the best way to achieve this. Let us see if > > Tsunakawa-San or Kirk-San has other ideas on this? > > I see no need for smgrnblocks() to add an argument as it returns the correct > cached or measured value. > The idea is that we can't use this optimization if the value is not cached because we can't rely on lseek behavior. See all the discussion between Horiguchi-San and me in the thread above. So, how would you ensure that if we don't use Kirk-San's proposal? -- With Regards, Amit Kapila.
Re: [Patch] Optimize dropping of relation buffers using dlist
On Wed, Sep 23, 2020 at 7:56 AM tsunakawa.ta...@fujitsu.com wrote: > > (3) > if (reln->smgr_cached_nblocks[forknum] == blocknum) > reln->smgr_cached_nblocks[forknum] = blocknum + 1; > else > + { > + /* > +* DropRelFileNodeBuffers relies on the behavior that cached > nblocks > +* won't be invalidated by file extension while recovering. > +*/ > + Assert(!InRecovery); > reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber; > + } > > I think this change is not directly related to this patch and can be a > separate patch, but I want to leave the decision up to a committer. > We have added this mainly for testing purpose, basically this assertion should not fail during the regression tests. We can keep it in a separate patch but need to ensure that. If this fails then we can't rely on the caching behaviour during recovery which is actually required for the correctness of patch. -- With Regards, Amit Kapila.
Re: [HACKERS] logical decoding of two-phase transactions
On Tue, Sep 22, 2020 at 5:18 PM Ajin Cherian wrote: > > On Sun, Sep 20, 2020 at 3:31 PM Dilip Kumar wrote: > > > 3. > > > > + /* > > + * If it's ROLLBACK PREPARED then handle it via callbacks. > > + */ > > + if (TransactionIdIsValid(xid) && > > + !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) && > > + parsed->dbId == ctx->slot->data.database && > > + !FilterByOrigin(ctx, origin_id) && > > + ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid)) > > + { > > + ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr, > > + commit_time, origin_id, origin_lsn, > > + parsed->twophase_gid, false); > > + return; > > + } > > > > > > I think we have already checked !SnapBuildXactNeedsSkip, parsed->dbId > > == ctx->slot->data.database and !FilterByOrigin in DecodePrepare > > so if those are not true then we wouldn't have prepared this > > transaction i.e. ReorderBufferTxnIsPrepared will be false so why do we > > need > > to recheck this conditions. > > I didnt change this, as I am seeing cases where the Abort is getting > called for transactions that needs to be skipped. I also see that the > same check is there both in DecodePrepare and DecodeCommit. > So, while the same transactions were not getting prepared or > committed, it tries to get ROLLBACK PREPARED (as part of finish > prepared handling). The check in if ReorderBufferTxnIsPrepared() is > also not proper. > If the transaction is prepared which you can ensure via ReorderBufferTxnIsPrepared() (considering you have a proper check in that function), it should not require skipping the transaction in Abort. One way it could happen is if you clean up the ReorderBufferTxn in Prepare which you were doing in earlier version of patch which I pointed out was wrong, if you have changed that then I don't know why it could fail, may be someplace else during prepare the patch is freeing it. Just check that. > I will need to relook > this logic again in a future patch. > No problem. I think you can handle the other comments and then we can come back to this and you might want to share the exact details of the test (may be a narrow down version of the original test) and I or someone else might be able to help you with that. -- With Regards, Amit Kapila.
Re: Removing unneeded self joins
On Fri, 3 Apr 2020 at 17:43, Andrey Lepikhov wrote: > v.23 in attachment: Hi, This is only a partial review as I see the patch still does not incorporate the self join detection method I mentioned in March 2019 and the one the patch only partially works. Here's the partial review which contains the details: 1. Change to aset.c not relevant to this patch. --- a/src/backend/utils/mmgr/aset.c +++ b/src/backend/utils/mmgr/aset.c @@ -1485,7 +1485,6 @@ AllocSetCheck(MemoryContext context) chsize = chunk->size; /* aligned chunk size */ dsize = chunk->requested_size; /* real data */ - /* * Check chunk size */ 2. Why GUC_NOT_IN_SAMPLE? + {"enable_self_join_removal", PGC_USERSET, QUERY_TUNING_METHOD, + gettext_noop("Enable removal of unique self-joins."), + NULL, + GUC_NO_SHOW_ALL | GUC_NOT_IN_SAMPLE 3. I'd expect this to remove the join. create table t1 ( a int not null unique, b int not null unique, c int not null unique, d int not null, e int not null ); explain select * from t1 inner join t1 t2 on t1.a=t2.a and t1.b=t2.c; QUERY PLAN -- Hash Join (cost=52.50..88.42 rows=1 width=40) Hash Cond: ((t1.a = t2.a) AND (t1.b = t2.c)) -> Seq Scan on t1 (cost=0.00..27.00 rows=1700 width=20) -> Hash (cost=27.00..27.00 rows=1700 width=20) -> Seq Scan on t1 t2 (cost=0.00..27.00 rows=1700 width=20) (5 rows) This one seems to work. This one *does* seem to work. explain select * from t1 inner join t1 t2 on t1.a=t2.a and t1.d=t2.e; You should likely implement the method I wrote in the final paragraph in [1] The basic idea here is you first process the join quals: t1.a=t2.a and t1.b=t2.c and keep just the ones that use the same varattno on either side (t1.a=t2.a). Perhaps it's not worth thinking about Exprs for now, or if you think it is you can normalise the varnos to 1 and equal() Then just pass the remaining quals to relation_has_unique_index_for(). If it returns true then there's no need to perform the opposite check for the other relation. It would just return the same thing. This will allow you to get rid of using the following as proofs: /* We must have the same unique index for both relations. */ if (outerinfo->index->indexoid != innerinfo->index->indexoid) return false; ... as I've shown above. This only works in some cases and that's not really good enough. Doing thing the way I describe will allow you to get rid of all the UniqueRelInfo stuff. 4. Should be ok for partitioned tables though: /* * This optimization won't work for tables that have inheritance * children. */ if (rte->inh) continue; David [1] https://www.postgresql.org/message-id/CAKJS1f8p-KiEujr12k-oa52JNWWaQUjEjNg%2Bo1MGZk4mHBn_Rg%40mail.gmail.com
RE: [Patch] Optimize dropping of relation buffers using dlist
On Wednesday, September 23, 2020 11:26 AM, Tsunakawa, Takayuki wrote: > I looked at v14. Thank you for checking it! > (1) > + /* Get the total number of blocks for the supplied relation's > fork */ > + for (j = 0; j < nforks; j++) > + { > + BlockNumber block = > smgrnblocks(smgr_reln, forkNum[j]); > + nblocks += block; > + } > > Why do you sum all forks? I revised the patch based from my understanding of Horiguchi-san's comment, but I could be wrong. Quoting: " + /* Get the number of blocks for the supplied relation's fork */ + nblocks = smgrnblocks(smgr_reln, forkNum[fork_num]); + Assert(BlockNumberIsValid(nblocks)); + + if (nblocks < BUF_DROP_FULLSCAN_THRESHOLD) As mentioned upthread, the criteria whether we do full-scan or lookup-drop is how large portion of NBUFFERS this relation-drop can be going to invalidate. So the nblocks above should be the sum of number of blocks to be truncated (not just the total number of blocks) of all designated forks. Then once we decided to do lookup-drop method, we do that for all forks." > (2) > + if ((nblocks / (uint32)NBuffers) < > BUF_DROP_FULLSCAN_THRESHOLD && > + BlockNumberIsValid(nblocks)) > + { > > The division by NBuffers is not necessary, because both sides of = are > number of blocks. Again I based it from my understanding of the comment above, so nblocks is the sum of all blocks to be truncated for all forks. > Why is BlockNumberIsValid(nblocks)) call needed? I thought we need to ensure that nblocks is not invalid, so I also added > (3) > if (reln->smgr_cached_nblocks[forknum] == blocknum) > reln->smgr_cached_nblocks[forknum] = blocknum + 1; > else > + { > + /* > + * DropRelFileNodeBuffers relies on the behavior that > cached nblocks > + * won't be invalidated by file extension while recovering. > + */ > + Assert(!InRecovery); > reln->smgr_cached_nblocks[forknum] = > InvalidBlockNumber; > + } > > I think this change is not directly related to this patch and can be a > separate > patch, but I want to leave the decision up to a committer. > This is noted. Once we clarified the above comments, I'll put it in a separate patch if it's necessary, Thank you very much for the reviews. Best regards, Kirk Jamison
Re: Syncing pg_multixact directories
On Wed, Sep 23, 2020 at 4:09 PM Michael Paquier wrote: > On Tue, Sep 22, 2020 at 07:05:36PM -0700, Andres Freund wrote: > > On 2020-09-23 13:45:51 +1200, Thomas Munro wrote: > >> I think we should be ensuring that directory entries for newly created > >> multixact files are durable at checkpoint time. Please see attached. > > > > Good catch! Probably that should probably be backpatched... > > +1. Passing that down to the SLRU layer is a nice thing to do. Were > you planning to send a second patch here? The commit log generated > mentions patch 1/2. Oh, that's just because I also have another patch, for master only, to go on top, but that's in another thread about SLRU fsync offloading. Sorry for the confusion.
Re: Syncing pg_multixact directories
On Tue, Sep 22, 2020 at 07:05:36PM -0700, Andres Freund wrote: > On 2020-09-23 13:45:51 +1200, Thomas Munro wrote: >> I think we should be ensuring that directory entries for newly created >> multixact files are durable at checkpoint time. Please see attached. > > Good catch! Probably that should probably be backpatched... +1. Passing that down to the SLRU layer is a nice thing to do. Were you planning to send a second patch here? The commit log generated mentions patch 1/2. -- Michael signature.asc Description: PGP signature
Re: PostmasterIsAlive() in recovery (non-USE_POST_MASTER_DEATH_SIGNAL builds)
On Wed, Sep 23, 2020 at 2:27 PM David Rowley wrote: > I've gone as far as running the recovery tests on the v3-0001 patch > using a Windows machine. They pass: Thanks! I pushed that one, because it was effectively a bug fix (WaitLatch() without a latch was supposed to work). I'll wait longer for feedback on the main patch; perhaps someone has a better idea, or wants to take issue with the magic number 1024 (ie limit on how many records we'll replay before we notice the postmaster has exited), or my plan to harmonise those wait loops? It has a CF entry for the next CF.
RE: [Patch] Optimize dropping of relation buffers using dlist
From: Amit Kapila > > > > Don't > > > > we need to guarantee the cache to be valid while recovery? > > > > > > > > > > One possibility could be that we somehow detect that the value we > > > are using is cached one and if so then only do this optimization. > > > > I basically like this direction. But I'm not sure the additional > > parameter for smgrnblocks is acceptable. > > > > But on the contrary, it might be a better design that > > DropRelFileNodeBuffers gives up the optimization when > > smgrnblocks(,,must_accurate = true) returns InvalidBlockNumber. > > > > I haven't thought about what is the best way to achieve this. Let us see if > Tsunakawa-San or Kirk-San has other ideas on this? I see no need for smgrnblocks() to add an argument as it returns the correct cached or measured value. Regards Takayuki Tsunakawa
Re: PostmasterIsAlive() in recovery (non-USE_POST_MASTER_DEATH_SIGNAL builds)
On Sun, 20 Sep 2020 at 09:29, Thomas Munro wrote: > > Although I know from CI that this builds and passes "make check" on > Windows, I'm hoping to attract some review of the 0001 patch from a > Windows person, and confirmation that it passes "check-world" (or at > least src/test/recovery check) there, because I don't have CI scripts > for that yet. I've gone as far as running the recovery tests on the v3-0001 patch using a Windows machine. They pass: L:\Projects\Postgres\d\src\tools\msvc>vcregress taptest src/test/recovery ... t/001_stream_rep.pl .. ok t/002_archiving.pl ... ok t/003_recovery_targets.pl ok t/004_timeline_switch.pl . ok t/005_replay_delay.pl ok t/006_logical_decoding.pl ok t/007_sync_rep.pl ok t/008_fsm_truncation.pl .. ok t/009_twophase.pl ok t/010_logical_decoding_timelines.pl .. ok t/011_crash_recovery.pl .. skipped: Test fails on Windows perl t/012_subtransactions.pl . ok t/013_crash_restart.pl ... ok t/014_unlogged_reinit.pl . ok t/015_promotion_pages.pl . ok t/016_min_consistency.pl . ok t/017_shm.pl . skipped: SysV shared memory not supported by this platform t/018_wal_optimize.pl ok t/019_replslot_limit.pl .. ok t/020_archive_status.pl .. ok All tests successful. Files=20, Tests=222, 397 wallclock secs ( 0.08 usr + 0.00 sys = 0.08 CPU) Result: PASS L:\Projects\Postgres\d\src\tools\msvc>git diff --stat src/backend/storage/ipc/latch.c | 23 +++ 1 file changed, 19 insertions(+), 4 deletions(-) David
RE: [Patch] Optimize dropping of relation buffers using dlist
I looked at v14. (1) + /* Get the total number of blocks for the supplied relation's fork */ + for (j = 0; j < nforks; j++) + { + BlockNumber block = smgrnblocks(smgr_reln, forkNum[j]); + nblocks += block; + } Why do you sum all forks? (2) + if ((nblocks / (uint32)NBuffers) < BUF_DROP_FULLSCAN_THRESHOLD && + BlockNumberIsValid(nblocks)) + { The division by NBuffers is not necessary, because both sides of = are number of blocks. Why is BlockNumberIsValid(nblocks)) call needed? (3) if (reln->smgr_cached_nblocks[forknum] == blocknum) reln->smgr_cached_nblocks[forknum] = blocknum + 1; else + { + /* +* DropRelFileNodeBuffers relies on the behavior that cached nblocks +* won't be invalidated by file extension while recovering. +*/ + Assert(!InRecovery); reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber; + } I think this change is not directly related to this patch and can be a separate patch, but I want to leave the decision up to a committer. Regards Takayuki Tsunakawa
Re: Command statistics system (cmdstats)
On Wed, Sep 23, 2020 at 10:41:10AM +0900, Michael Paquier wrote: > I think that this remark is a bit unfair. When it comes to any patch > in the commit fest, and I have managed a couple of them over the > years, I have tried to keep a neutral view for all of them, meaning > that I only let the numbers speak by themselves, mostly: > - When has a patch lastly been updated. > - What's the current state in the CF app If the patch had no reviews > and still in "Needs review", simply switch it to "waiting on author". Sorry, forgot to mention here that this is in the case where a patch does not apply anymore or that the CF bot complains, requiring a check of the patch from the author. > If the patch has been waiting on author for more than two weeks, it > would get marked as RwF at the end of the CF. There are also cases > where the status of the patch is incorrect, mostly that a patch > waiting on author because of a review was not updated as such in the > CF app. In this case, and in the middle of a CF, this would get get > marked as Rwf, but the author would get a reminder of that. Last sentence here is incorrect as well: s/get get/not get/, and the author would be reminded to update his/her patch. -- Michael signature.asc Description: PGP signature
Re: Syncing pg_multixact directories
Hi, On 2020-09-23 13:45:51 +1200, Thomas Munro wrote: > I think we should be ensuring that directory entries for newly created > multixact files are durable at checkpoint time. Please see attached. Good catch! Probably that should probably be backpatched... Greetings, Andres Freund
Re: Missing TOAST table for pg_class
On Tue, Sep 22, 2020 at 05:35:48PM -0400, Tom Lane wrote: > What exactly do you argue has changed since the previous decision > that should cause us to change it? In particular, where is the > additional data to change our minds about the safety of such a thing? Not sure that's safe, as we also want to avoid circular dependencies similarly for pg_class, pg_index and pg_attribute. -- Michael signature.asc Description: PGP signature
Re: Handing off SLRU fsyncs to the checkpointer
On Tue, Sep 22, 2020 at 9:08 AM Thomas Munro wrote: > On Mon, Sep 21, 2020 at 2:19 PM Thomas Munro wrote: > > While scanning for comments and identifier names that needed updating, > > I realised that this patch changed the behaviour of the ShutdownXXX() > > functions, since they currently flush the SLRUs but are not followed > > by a checkpoint. I'm not entirely sure I understand the logic of > > that, but it wasn't my intention to change it. So here's a version > > that converts the existing fsync_fname() to fsync_fname_recurse() to > > Bleugh, that was probably a bad idea, it's too expensive. But it > forces me to ask the question: *why* do we need to call > Shutdown{CLOG,CommitTS,SUBTRANS, MultiXact}() after a creating a > shutdown checkpoint? I wondered if this might date from before the > WAL, but I see that the pattern was introduced when the CLOG was moved > out of shared buffers into a proto-SLRU in ancient commit 2589735da08, > but even in that commit the preceding CreateCheckPoint() call included > a call to CheckPointCLOG(). I complained about the apparently missing multixact fsync in a new thread, because if I'm right about that it requires a back-patchable fix. As for the ShutdownXXX() functions, I haven't yet come up with any reason for this code to exist. Emboldened by a colleague's inability to explain to me what that code is doing for us, here is a new version that just rips it all out. From 89e5b787ea5efd03ced8da46f95f236e03bf4adf Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Wed, 23 Sep 2020 13:02:27 +1200 Subject: [PATCH v6 1/2] Fix missing fsync of multixact directories. Standardize behavior by moving reponsibility for fsyncing directories down into slru.c. Back-patch to all supported releases. --- src/backend/access/transam/clog.c | 7 --- src/backend/access/transam/commit_ts.c | 6 -- src/backend/access/transam/slru.c | 7 +++ 3 files changed, 7 insertions(+), 13 deletions(-) diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c index 65aa8841f7..9e352d2658 100644 --- a/src/backend/access/transam/clog.c +++ b/src/backend/access/transam/clog.c @@ -836,13 +836,6 @@ CheckPointCLOG(void) /* Flush dirty CLOG pages to disk */ TRACE_POSTGRESQL_CLOG_CHECKPOINT_START(true); SimpleLruFlush(XactCtl, true); - - /* - * fsync pg_xact to ensure that any files flushed previously are durably - * on disk. - */ - fsync_fname("pg_xact", true); - TRACE_POSTGRESQL_CLOG_CHECKPOINT_DONE(true); } diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c index 5244b06a2b..f6a7329ba3 100644 --- a/src/backend/access/transam/commit_ts.c +++ b/src/backend/access/transam/commit_ts.c @@ -822,12 +822,6 @@ CheckPointCommitTs(void) { /* Flush dirty CommitTs pages to disk */ SimpleLruFlush(CommitTsCtl, true); - - /* - * fsync pg_commit_ts to ensure that any files flushed previously are - * durably on disk. - */ - fsync_fname("pg_commit_ts", true); } /* diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c index 7640f153c2..89ce7c43b5 100644 --- a/src/backend/access/transam/slru.c +++ b/src/backend/access/transam/slru.c @@ -1187,6 +1187,13 @@ SimpleLruFlush(SlruCtl ctl, bool allow_redirtied) } if (!ok) SlruReportIOError(ctl, pageno, InvalidTransactionId); + + /* + * Make sure that the directory entries for any newly created files are on + * disk. + */ + if (ctl->do_fsync) + fsync_fname(ctl->Dir, true); } /* -- 2.20.1 From ae51604c43e10ba6d24b628c2b907eca251177e9 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Mon, 21 Sep 2020 09:43:32 +1200 Subject: [PATCH v6 2/2] Defer flushing of SLRU files. Previously, we called fsync() after writing out individual pg_xact, pg_multixact and pg_commit_ts pages due to cache pressure, leading to regular I/O stalls in user backends and recovery. Collapse requests for the same file into a single system call as part of the next checkpoint, as we already did for relation files, using the infrastructure developed by commit 3eb77eba. This causes a significant improvement to recovery performance. Remove the Shutdown{CLOG,CommitTS,SUBTRANS,MultiXact}() functions, because they were redundant after the shutdown checkpoint that immediately precedes them. Hoist ProcessSyncRequests() up into CheckPointGuts() to make it clearer that it applies to all the SLRU mini-buffer-pools as well as the main buffer pool. Also rearrange things so that data collected in CheckpointStats includes SLRU activity. Tested-by: Jakub Wartak Discussion: https://postgr.es/m/CA+hUKGLJ=84yt+nvhkeedauutvhmfq9i-n7k_o50jmq6rpj...@mail.gmail.com --- src/backend/access/transam/clog.c | 36 +++--- src/backend/access/transam/commit_ts.c | 32 +++-- src/backend/access/transam/multixact.c | 53 + src/backend/access/transam/slru.c | 154 + src/backend/access/transam/subtrans.c | 25 +---
Re: Range checks of pg_test_fsync --secs-per-test and pg_test_timing --duration
On Tue, Sep 22, 2020 at 11:45:14PM +0200, Peter Eisentraut wrote: > However, I still think the integer type use is a bit inconsistent. In both > cases, using strtoul() and dealing with unsigned integer types between > parsing and final use would be more consistent. No objections to that either, so changed this way. I kept those variables signed because applying values of 2B~4B is not really going to matter much here ;p -- Michael diff --git a/src/bin/pg_test_fsync/.gitignore b/src/bin/pg_test_fsync/.gitignore index f3b5932498..5eb5085f45 100644 --- a/src/bin/pg_test_fsync/.gitignore +++ b/src/bin/pg_test_fsync/.gitignore @@ -1 +1,3 @@ /pg_test_fsync + +/tmp_check/ diff --git a/src/bin/pg_test_fsync/Makefile b/src/bin/pg_test_fsync/Makefile index 7632c94eb7..c4f9ae0664 100644 --- a/src/bin/pg_test_fsync/Makefile +++ b/src/bin/pg_test_fsync/Makefile @@ -22,6 +22,12 @@ install: all installdirs installdirs: $(MKDIR_P) '$(DESTDIR)$(bindir)' +check: + $(prove_check) + +installcheck: + $(prove_installcheck) + uninstall: rm -f '$(DESTDIR)$(bindir)/pg_test_fsync$(X)' diff --git a/src/bin/pg_test_fsync/pg_test_fsync.c b/src/bin/pg_test_fsync/pg_test_fsync.c index 6e47293123..7119ed0512 100644 --- a/src/bin/pg_test_fsync/pg_test_fsync.c +++ b/src/bin/pg_test_fsync/pg_test_fsync.c @@ -5,6 +5,7 @@ #include "postgres_fe.h" +#include #include #include #include @@ -62,7 +63,7 @@ do { \ static const char *progname; -static int secs_per_test = 5; +static uint32 secs_per_test = 5; static int needs_unlink = 0; static char full_buf[DEFAULT_XLOG_SEG_SIZE], *buf, @@ -148,6 +149,8 @@ handle_args(int argc, char *argv[]) int option; /* Command line option */ int optindex = 0; /* used by getopt_long */ + unsigned long optval; /* used for option parsing */ + char *endptr; if (argc > 1) { @@ -173,7 +176,24 @@ handle_args(int argc, char *argv[]) break; case 's': -secs_per_test = atoi(optarg); +errno = 0; +optval = strtoul(optarg, , 10); + +if (endptr == optarg || *endptr != '\0' || + errno != 0 || optval != (uint32) optval) +{ + pg_log_error("invalid argument for option %s", "--secs-per-test"); + fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname); + exit(1); +} + +secs_per_test = (uint32) optval; +if (secs_per_test == 0) +{ + pg_log_error("%s must be in range %u..%u", + "--secs-per-test", 1, UINT_MAX); + exit(1); +} break; default: @@ -193,8 +213,8 @@ handle_args(int argc, char *argv[]) exit(1); } - printf(ngettext("%d second per test\n", - "%d seconds per test\n", + printf(ngettext("%u second per test\n", + "%u seconds per test\n", secs_per_test), secs_per_test); #if PG_O_DIRECT != 0 diff --git a/src/bin/pg_test_fsync/t/001_basic.pl b/src/bin/pg_test_fsync/t/001_basic.pl new file mode 100644 index 00..fe9c295c49 --- /dev/null +++ b/src/bin/pg_test_fsync/t/001_basic.pl @@ -0,0 +1,25 @@ +use strict; +use warnings; + +use Config; +use TestLib; +use Test::More tests => 12; + +# +# Basic checks + +program_help_ok('pg_test_fsync'); +program_version_ok('pg_test_fsync'); +program_options_handling_ok('pg_test_fsync'); + +# +# Test invalid option combinations + +command_fails_like( + [ 'pg_test_fsync', '--secs-per-test', 'a' ], + qr/\Qpg_test_fsync: error: invalid argument for option --secs-per-test\E/, + 'pg_test_fsync: invalid argument for option --secs-per-test'); +command_fails_like( + [ 'pg_test_fsync', '--secs-per-test', '0' ], + qr/\Qpg_test_fsync: error: --secs-per-test must be in range 1..4294967295\E/, + 'pg_test_fsync: --secs-per-test must be in range'); diff --git a/src/bin/pg_test_timing/.gitignore b/src/bin/pg_test_timing/.gitignore index f6c664c765..e5aac2ab12 100644 --- a/src/bin/pg_test_timing/.gitignore +++ b/src/bin/pg_test_timing/.gitignore @@ -1 +1,3 @@ /pg_test_timing + +/tmp_check/ diff --git a/src/bin/pg_test_timing/Makefile b/src/bin/pg_test_timing/Makefile index 334d6ff5c0..52994b4103 100644 --- a/src/bin/pg_test_timing/Makefile +++ b/src/bin/pg_test_timing/Makefile @@ -22,6 +22,12 @@ install: all installdirs installdirs: $(MKDIR_P) '$(DESTDIR)$(bindir)' +check: + $(prove_check) + +installcheck: + $(prove_installcheck) + uninstall: rm -f '$(DESTDIR)$(bindir)/pg_test_timing$(X)' diff --git a/src/bin/pg_test_timing/pg_test_timing.c b/src/bin/pg_test_timing/pg_test_timing.c index e14802372b..c878b19035 100644 --- a/src/bin/pg_test_timing/pg_test_timing.c +++ b/src/bin/pg_test_timing/pg_test_timing.c @@ -6,15 +6,17 @@ #include "postgres_fe.h" +#include + #include "getopt_long.h" #include "portability/instr_time.h" static const char *progname; -static int32 test_duration = 3; +static uint32 test_duration = 3; static void handle_args(int argc, char *argv[]); -static uint64 test_timing(int32); +static uint64
Syncing pg_multixact directories
Hello hackers, I think we should be ensuring that directory entries for newly created multixact files are durable at checkpoint time. Please see attached. From 89e5b787ea5efd03ced8da46f95f236e03bf4adf Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Wed, 23 Sep 2020 13:02:27 +1200 Subject: [PATCH 1/2] Fix missing fsync of multixact directories. Standardize behavior by moving reponsibility for fsyncing directories down into slru.c. Back-patch to all supported releases. --- src/backend/access/transam/clog.c | 7 --- src/backend/access/transam/commit_ts.c | 6 -- src/backend/access/transam/slru.c | 7 +++ 3 files changed, 7 insertions(+), 13 deletions(-) diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c index 65aa8841f7..9e352d2658 100644 --- a/src/backend/access/transam/clog.c +++ b/src/backend/access/transam/clog.c @@ -836,13 +836,6 @@ CheckPointCLOG(void) /* Flush dirty CLOG pages to disk */ TRACE_POSTGRESQL_CLOG_CHECKPOINT_START(true); SimpleLruFlush(XactCtl, true); - - /* - * fsync pg_xact to ensure that any files flushed previously are durably - * on disk. - */ - fsync_fname("pg_xact", true); - TRACE_POSTGRESQL_CLOG_CHECKPOINT_DONE(true); } diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c index 5244b06a2b..f6a7329ba3 100644 --- a/src/backend/access/transam/commit_ts.c +++ b/src/backend/access/transam/commit_ts.c @@ -822,12 +822,6 @@ CheckPointCommitTs(void) { /* Flush dirty CommitTs pages to disk */ SimpleLruFlush(CommitTsCtl, true); - - /* - * fsync pg_commit_ts to ensure that any files flushed previously are - * durably on disk. - */ - fsync_fname("pg_commit_ts", true); } /* diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c index 7640f153c2..89ce7c43b5 100644 --- a/src/backend/access/transam/slru.c +++ b/src/backend/access/transam/slru.c @@ -1187,6 +1187,13 @@ SimpleLruFlush(SlruCtl ctl, bool allow_redirtied) } if (!ok) SlruReportIOError(ctl, pageno, InvalidTransactionId); + + /* + * Make sure that the directory entries for any newly created files are on + * disk. + */ + if (ctl->do_fsync) + fsync_fname(ctl->Dir, true); } /* -- 2.20.1
Re: Command statistics system (cmdstats)
On Mon, Sep 21, 2020 at 10:41:46AM -0300, Alvaro Herrera wrote: > "A couple of weeks" of inactivity is not sufficient, in my view, to boot > a patch out of the commitfest process. Whenever the patch is > resurrected, it will be a new entry which won't have the history that it > had accumulated in the long time since it was created -- which biases > it against other new patches. Any of the patches that have been marked as RwF last week have been waiting on author since at least the end of the commit fest of July, with zero actions taken on them. In my experience, this points strongly to the fact that these are unlikely going to be worked on actively in a short time frame. > I'm not really arguing about this one patch only, but more generally > about the process you're following. The problem is that you as CFM are > imposing your personal priorities on the whole process by punting > patches ahead of time -- you are saying that nobody else should be > looking at updating this patch because it is now closed. It seems fair > to do so at the end of the commitfest, but it does not seem fair to do > it when it's still mid-cycle. I think that this remark is a bit unfair. When it comes to any patch in the commit fest, and I have managed a couple of them over the years, I have tried to keep a neutral view for all of them, meaning that I only let the numbers speak by themselves, mostly: - When has a patch lastly been updated. - What's the current state in the CF app If the patch had no reviews and still in "Needs review", simply switch it to "waiting on author". If the patch has been waiting on author for more than two weeks, it would get marked as RwF at the end of the CF. There are also cases where the status of the patch is incorrect, mostly that a patch waiting on author because of a review was not updated as such in the CF app. In this case, and in the middle of a CF, this would get get marked as Rwf, but the author would get a reminder of that. Any of the patches that have been discarded by me last week were waiting in the stack for much, much, longer than that, and I think that it does not help reviewers and committers in keeping this stuff longer than necessary because this mainly bloats the CF app in a non-necessary way, hiding patches that should deserve more attention. Some of those patches have been left idle for half a year. Another point that I'd like to make here is that my intention was to lift the curve here (a term quite fashionable this year particularly, isn't it?), due to the high number of patches to handle in a single CF, and the shortage of hands actually doing this kind of work. Then, let's be clear about one last thing. If at *any* moment people here feel that at least the state of *one* patch of this CF has been changed based on some of my "personal" priority views, I will immediately stop any action as CFM and resign from it. So, if that's the case, please feel free to speak up here or on any of the threads related to those patches. > Putting also in perspective the history that the patch had prior to your > unfairly closing it, there was a lot of effort in getting it reviewed to > a good state and updated by the author -- yet a single unsubstantiated > comment, without itself a lot of effort on the reviewer's part, that > "maybe it has a perf drawback" is sufficient to get it booted? It > doesn't seem appropriate. I agree on the fact that not being able to recreate a new CF entry that keeps the history of an old one, with all the past threads or annotations, could be helpful in some cases. Now, I have seen as well cases where a patch got around for so long with a very long thread made things confusing, and that people also like starting a new thread with a summary of the past state as a first message to attempt clarifying things. So both have advantages. -- Michael signature.asc Description: PGP signature
Re: new heapcheck contrib module
Peter Geoghegan writes: > On Mon, Sep 21, 2020 at 2:09 PM Robert Haas wrote: >> +REVOKE ALL ON FUNCTION >> +verify_heapam(regclass, boolean, boolean, cstring, bigint, bigint) >> +FROM PUBLIC; >> >> This too. > Do we really want to use a cstring as an enum-like argument? Ugh. We should not be using cstring as a SQL-exposed datatype unless there really is no alternative. Why wasn't this argument declared "text"? regards, tom lane
Re: Lift line-length limit for pg_service.conf
I wrote: > Yeah. In a quick scan, it appears that there is only one caller that > tries to save the result directly. So I considered making that caller > do a pstrdup and eliminating the extra thrashing in t_readline itself. > But it seemed too fragile; somebody would get it wrong and then have > excess space consumption for their dictionary. I had a better idea: if we get rid of t_readline() itself, which has been deprecated for years anyway, we can have the calling layer tsearch_readline_xxx maintain a StringInfo across the whole file read process and thus get rid of alloc/free cycles for the StringInfo. However ... while working on that, I realized that the usage of the "curline" field in that code is completely unsafe. We save a pointer to the result of tsearch_readline() for possible use by the error context callback, *but the caller is likely to free that string at some point*. So there is a window where an error would result in trying to print already-freed data. It looks like all of the core-code dictionaries free the result string at the bottoms of their loops, so that the window for trouble is pretty much empty. But contrib/dict_xsyn doesn't do it like that, and so it's surely at risk. Any external dictionaries that have copied that code would also be at risk. So the attached adds a pstrdup/pfree to ensure that "curline" has its own storage, putting us right back at two palloc/pfree cycles per line. I don't think there's a lot of choice though; in fact, I'm leaning to the idea that we need to back-patch that part of this. The odds of trouble in a production build probably aren't high, but still... regards, tom lane diff --git a/src/backend/tsearch/dict_thesaurus.c b/src/backend/tsearch/dict_thesaurus.c index cb0835982d..64c979086d 100644 --- a/src/backend/tsearch/dict_thesaurus.c +++ b/src/backend/tsearch/dict_thesaurus.c @@ -286,11 +286,6 @@ thesaurusRead(const char *filename, DictThesaurus *d) (errcode(ERRCODE_CONFIG_FILE_ERROR), errmsg("unexpected end of line"))); - /* - * Note: currently, tsearch_readline can't return lines exceeding 4KB, - * so overflow of the word counts is impossible. But that may not - * always be true, so let's check. - */ if (nwrd != (uint16) nwrd || posinsubst != (uint16) posinsubst) ereport(ERROR, (errcode(ERRCODE_CONFIG_FILE_ERROR), diff --git a/src/backend/tsearch/ts_locale.c b/src/backend/tsearch/ts_locale.c index a916dd6cb6..c85939631e 100644 --- a/src/backend/tsearch/ts_locale.c +++ b/src/backend/tsearch/ts_locale.c @@ -14,6 +14,7 @@ #include "postgres.h" #include "catalog/pg_collation.h" +#include "common/string.h" #include "storage/fd.h" #include "tsearch/ts_locale.h" #include "tsearch/ts_public.h" @@ -128,6 +129,7 @@ tsearch_readline_begin(tsearch_readline_state *stp, return false; stp->filename = filename; stp->lineno = 0; + initStringInfo(>buf); stp->curline = NULL; /* Setup error traceback support for ereport() */ stp->cb.callback = tsearch_readline_callback; @@ -146,11 +148,44 @@ char * tsearch_readline(tsearch_readline_state *stp) { char *result; + int len; + /* Advance line number to use in error reports */ stp->lineno++; - stp->curline = NULL; - result = t_readline(stp->fp); - stp->curline = result; + + /* Clear curline, it's no longer relevant */ + if (stp->curline) + { + pfree(stp->curline); + stp->curline = NULL; + } + + /* Collect next line, if there is one */ + if (!pg_get_line_buf(stp->fp, >buf)) + return NULL; + + /* Make sure the input is valid UTF-8 */ + len = stp->buf.len; + (void) pg_verify_mbstr(PG_UTF8, stp->buf.data, len, false); + + /* And convert */ + result = pg_any_to_server(stp->buf.data, len, PG_UTF8); + if (result == stp->buf.data) + { + /* + * Conversion didn't pstrdup, so we must. We can use the length of + * the original string, because no conversion was done. + */ + result = pnstrdup(result, len); + } + + /* + * Save a copy of the line for possible use in error reports. (We cannot + * just save "result", since it's likely to get pfree'd at some point by + * the caller; an error after that would try to access freed data.) + */ + stp->curline = pstrdup(result); + return result; } @@ -160,7 +195,16 @@ tsearch_readline(tsearch_readline_state *stp) void tsearch_readline_end(tsearch_readline_state *stp) { + /* Suppress use of curline in any error reported below */ + if (stp->curline) + { + pfree(stp->curline); + stp->curline = NULL; + } + + pfree(stp->buf.data); FreeFile(stp->fp); + /* Pop the error context stack */ error_context_stack = stp->cb.previous; } @@ -176,8 +220,7 @@ tsearch_readline_callback(void *arg) /* * We can't include the text of the config line for errors that occur - * during t_readline() itself. This is only partly a consequence of our - * arms-length use of that routine: the major cause of such errors is + * during tsearch_readline() itself. The major cause of
RE: Global snapshots
From: Andrey Lepikhov > Thank you for this work! > As I can see, main development difficulties placed in other areas: CSN, > resolver, > global deadlocks, 2PC commit... I'm not lawyer too. But if we get remarks from > the patent holders, we can rewrite our Clock-SI implementation. Yeah, I understand your feeling. I personally don't want like patents, and don't want to be disturbed by them. But the world is not friendly... We are not a lawyer, but we have to do our best to make sure PostgreSQL will be patent-free by checking the technologies as engineers. Among the above items, CSN is the only concerning one. Other items are written in textbooks, well-known, and used in other DBMSs, so they should be free from patents. However, CSN is not (at least to me.) Have you checked if CSN is not related to some patent? Or is CSN or similar technology already widely used in famous software and we can regard it as patent-free? And please wait. As below, the patent holder just says that Clock-SI is not based on the patent and an independent development. He doesn't say Clock-SI does not overlap with the patent or implementing Clock-SI does not infringe on the patent. Rather, he suggests that Clock-SI has many similarities and thus those may match the claims of the patent (unintentionally?) I felt this is a sign of risking infringement. "The answer to your question is: No, Clock-SI is not based on the patent - it was an entirely independent development. The two approaches are similar in the sense that there is no global clock, the commit time of a distributed transaction is the same in every partition where it modified data, and a transaction gets it snapshot timestamp from a local clock. The difference is whether a distributed transaction gets its commit timestamp before or after the prepare phase in 2PC." The timeline of events also worries me. It seems unnatural to consider that Clock-SI and the patent are independent. 2010/6 - 2010/9 One Clock-SI author worked for Microsoft Research as an research intern 2010/10 Microsoft filed the patent 2011/9 - 2011/12 The same Clock-SI author worked for Microsoft Research as an research intern 2013 The same author moved to EPFL and published the Clock-SI paper with another author who has worked for Microsoft Research since then. So, could you give your opinion whether we can use Clock-SI without overlapping with the patent claims? I also will try to check and see, so that I can understand your technical analysis. And I've just noticed that I got in touch with another author of Clock-SI via SNS, and sent an inquiry to him. I'll report again when I have a reply. Regards Takayuki Tsunakawa
Re: new heapcheck contrib module
On Sat, Aug 29, 2020 at 10:48 AM Mark Dilger wrote: > I had an earlier version of the verify_heapam patch that included a > non-throwing interface to clog. Ultimately, I ripped that out. My reasoning > was that a simpler patch submission was more likely to be acceptable to the > community. Isn't some kind of pragmatic compromise possible? > But I don't want to make this patch dependent on that hypothetical patch > getting written and accepted. Fair enough, but if you're alluding to what I said then about check_tuphdr_xids()/clog checking a while back then FWIW I didn't intend to block progress on clog/xact status verification at all. I just don't think that it is sensible to impose an iron clad guarantee about having no assertion failures with corrupt clog data -- that leads to far too much code duplication. But why should you need to provide an absolute guarantee of that? I for one would be fine with making the clog checks an optional extra, that rescinds the no crash guarantee that you're keen on -- just like with the TOAST checks that you have already in v15. It might make sense to review how often crashes occur with simulated corruption, and then to minimize the number of occurrences in the real world. Maybe we could tolerate a usually-no-crash interface to clog -- if it could still have assertion failures. Making a strong guarantee about assertions seems unnecessary. I don't see how verify_heapam will avoid raising an error during basic validation from PageIsVerified(), which will violate the guarantee about not throwing errors. I don't see that as a problem myself, but presumably you will. -- Peter Geoghegan
Re: new heapcheck contrib module
On Mon, Sep 21, 2020 at 2:09 PM Robert Haas wrote: > +REVOKE ALL ON FUNCTION > +verify_heapam(regclass, boolean, boolean, cstring, bigint, bigint) > +FROM PUBLIC; > > This too. Do we really want to use a cstring as an enum-like argument? I think that I see a bug at this point in check_tuple() (in v15-0001-Adding-function-verify_heapam-to-amcheck-module.patch): > + /* If xmax is a multixact, it should be within valid range */ > + xmax = HeapTupleHeaderGetRawXmax(ctx->tuphdr); > + if ((infomask & HEAP_XMAX_IS_MULTI) && !mxid_valid_in_rel(xmax, ctx)) > + { *** SNIP *** > + } > + > + /* If xmax is normal, it should be within valid range */ > + if (TransactionIdIsNormal(xmax)) > + { Why should it be okay to call TransactionIdIsNormal(xmax) at this point? It isn't certain that xmax is an XID at all (could be a MultiXactId, since you called HeapTupleHeaderGetRawXmax() to get the value in the first place). Don't you need to check "(infomask & HEAP_XMAX_IS_MULTI) == 0" here? This does look like it's shaping up. Thanks for working on it, Mark. -- Peter Geoghegan
Re: Lift line-length limit for pg_service.conf
Daniel Gustafsson writes: >> On 22 Sep 2020, at 23:24, Tom Lane wrote: >> In the same vein, here's a patch to remove the hard-coded line length >> limit for tsearch dictionary files. > LGTM. I like the comment on why not to return buf.data directly, that detail > would be easy to miss. Yeah. In a quick scan, it appears that there is only one caller that tries to save the result directly. So I considered making that caller do a pstrdup and eliminating the extra thrashing in t_readline itself. But it seemed too fragile; somebody would get it wrong and then have excess space consumption for their dictionary. regards, tom lane
Re: Range checks of pg_test_fsync --secs-per-test and pg_test_timing --duration
On 2020-09-20 05:41, Michael Paquier wrote: On Fri, Sep 18, 2020 at 05:22:15PM +0900, Michael Paquier wrote: Okay, after looking at that, here is v3. This includes range checks as well as errno checks based on strtol(). What do you think? This fails in the CF bot on Linux because of pg_logging_init() returning with errno=ENOTTY in the TAP tests, for which I began a new thread: https://www.postgresql.org/message-id/20200918095713.ga20...@paquier.xyz Not sure if this will lead anywhere, but we can also address the failure by enforcing errno=0 for the new calls of strtol() introduced in this patch. So here is an updated patch doing so. I think the error checking is now structurally correct in this patch. However, I still think the integer type use is a bit inconsistent. In both cases, using strtoul() and dealing with unsigned integer types between parsing and final use would be more consistent. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Missing TOAST table for pg_class
=?UTF-8?Q?Fabr=C3=ADzio_de_Royes_Mello?= writes: > Attached patch adds the TOAST to pg_class, and let's open again the > discussion around it. What exactly do you argue has changed since the previous decision that should cause us to change it? In particular, where is the additional data to change our minds about the safety of such a thing? One thing I'd want to see is some amount of testing of pg_class toast accesses under CLOBBER_CACHE_ALWAYS and even CLOBBER_CACHE_RECURSIVELY. AFAICT from reviewing the prior thread, nobody did any such thing. regards, tom lane
Re: Lift line-length limit for pg_service.conf
> On 22 Sep 2020, at 23:24, Tom Lane wrote: > > In the same vein, here's a patch to remove the hard-coded line length > limit for tsearch dictionary files. LGTM. I like the comment on why not to return buf.data directly, that detail would be easy to miss. cheers ./daniel
Re: Binaries on s390x arch
Hi, On Tue, 2020-09-22 at 10:57 +0200, Christoph Berg wrote: > Note that I'm working on the .deb packages for Debian and Ubuntu. For > .rpm (RHEL, SLES) you need to get Devrim on board. (I'm putting him > on Cc.) Thanks! To be able to keep up with the other architectures when building > binaries for PostgreSQL releases, the machine would have about the > same specs as the existing arm64 or ppc64el build daemons: > > VMs: 1 > OS: Debian buster > vCPUs: 8 > Memory: 16 GB > Storage: 100GB (most of that is swap space for tmpfs) @Namrata: Same here for RHEL RPMs: VMs: 2 OS: RHEL 7 and RHEL 8 vCPUs: 8 (each) Memory: 16 GB (each) Storage: 100GB Regards, -- Devrim Gündüz Open Source Solution Architect, Red Hat Certified Engineer Twitter: @DevrimGunduz , @DevrimGunduzTR signature.asc Description: This is a digitally signed message part
Re: Lift line-length limit for pg_service.conf
In the same vein, here's a patch to remove the hard-coded line length limit for tsearch dictionary files. regards, tom lane diff --git a/src/backend/tsearch/dict_thesaurus.c b/src/backend/tsearch/dict_thesaurus.c index cb0835982d..64c979086d 100644 --- a/src/backend/tsearch/dict_thesaurus.c +++ b/src/backend/tsearch/dict_thesaurus.c @@ -286,11 +286,6 @@ thesaurusRead(const char *filename, DictThesaurus *d) (errcode(ERRCODE_CONFIG_FILE_ERROR), errmsg("unexpected end of line"))); - /* - * Note: currently, tsearch_readline can't return lines exceeding 4KB, - * so overflow of the word counts is impossible. But that may not - * always be true, so let's check. - */ if (nwrd != (uint16) nwrd || posinsubst != (uint16) posinsubst) ereport(ERROR, (errcode(ERRCODE_CONFIG_FILE_ERROR), diff --git a/src/backend/tsearch/ts_locale.c b/src/backend/tsearch/ts_locale.c index a916dd6cb6..247180d56e 100644 --- a/src/backend/tsearch/ts_locale.c +++ b/src/backend/tsearch/ts_locale.c @@ -14,6 +14,8 @@ #include "postgres.h" #include "catalog/pg_collation.h" +#include "common/string.h" +#include "lib/stringinfo.h" #include "storage/fd.h" #include "tsearch/ts_locale.h" #include "tsearch/ts_public.h" @@ -204,29 +206,41 @@ tsearch_readline_callback(void *arg) char * t_readline(FILE *fp) { + StringInfoData buf; int len; char *recoded; - char buf[4096]; /* lines must not be longer than this */ - if (fgets(buf, sizeof(buf), fp) == NULL) + initStringInfo(); + + if (!pg_get_line_buf(fp, )) + { + pfree(buf.data); return NULL; + } - len = strlen(buf); + len = buf.len; /* Make sure the input is valid UTF-8 */ - (void) pg_verify_mbstr(PG_UTF8, buf, len, false); + (void) pg_verify_mbstr(PG_UTF8, buf.data, len, false); /* And convert */ - recoded = pg_any_to_server(buf, len, PG_UTF8); - if (recoded == buf) + recoded = pg_any_to_server(buf.data, len, PG_UTF8); + if (recoded == buf.data) { /* * conversion didn't pstrdup, so we must. We can use the length of the * original string, because no conversion was done. + * + * Note: it might seem attractive to just return buf.data, and in most + * usages that'd be fine. But a few callers save the returned string + * as long-term data, so returning a palloc chunk that's bigger than + * necessary is a bad idea. */ recoded = pnstrdup(recoded, len); } + pfree(buf.data); + return recoded; }
Missing TOAST table for pg_class
Hi all, I know it has been discussed before [1] but one of our customers complained about something weird on one of their multi-tenancy databases (thousands of schemas with a lot of objects inside and one user by schema). So when I checked the problem is because the missing TOAST for pg_class, and is easy to break it by just: fabrizio=# create table foo (id int); CREATE TABLE fabrizio=# do $$ begin for i in 1..2500 loop execute 'create user u' || i; execute 'grant all on foo to u' || i; end loop; end; $$; ERROR: row is too big: size 8168, maximum size 8160 CONTEXT: SQL statement "grant all on foo to u2445" PL/pgSQL function inline_code_block line 6 at EXECUTE Attached patch adds the TOAST to pg_class, and let's open again the discussion around it. Regards, [1] https://www.postgresql.org/message-id/flat/84ddff04-f122-784b-b6c5-3536804495f8%40joeconway.com -- Fabrízio de Royes Mello Timbira - http://www.timbira.com.br/ PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento diff --git a/src/include/catalog/toasting.h b/src/include/catalog/toasting.h index 51491c4513..729bc2cc66 100644 --- a/src/include/catalog/toasting.h +++ b/src/include/catalog/toasting.h @@ -51,6 +51,7 @@ extern void BootstrapToastTable(char *relName, /* normal catalogs */ DECLARE_TOAST(pg_aggregate, 4159, 4160); DECLARE_TOAST(pg_attrdef, 2830, 2831); +DECLARE_TOAST(pg_class, 4179, 4180); DECLARE_TOAST(pg_collation, 4161, 4162); DECLARE_TOAST(pg_constraint, 2832, 2833); DECLARE_TOAST(pg_default_acl, 4143, 4144); diff --git a/src/test/regress/expected/misc_sanity.out b/src/test/regress/expected/misc_sanity.out index 8538173ff8..0a2f5cc2a2 100644 --- a/src/test/regress/expected/misc_sanity.out +++ b/src/test/regress/expected/misc_sanity.out @@ -100,12 +100,9 @@ ORDER BY 1, 2; pg_attribute| attfdwoptions | text[] pg_attribute| attmissingval | anyarray pg_attribute| attoptions| text[] - pg_class| relacl| aclitem[] - pg_class| reloptions| text[] - pg_class| relpartbound | pg_node_tree pg_index| indexprs | pg_node_tree pg_index| indpred | pg_node_tree pg_largeobject | data | bytea pg_largeobject_metadata | lomacl| aclitem[] -(11 rows) +(8 rows)
RE: Transactions involving multiple postgres foreign servers, take 2
From: Ashutosh Bapat > parallelism here has both pros and cons. If one of the servers errors > out while preparing for a transaction, there is no point in preparing > the transaction on other servers. In parallel execution we will > prepare on multiple servers before realising that one of them has > failed to do so. On the other hand preparing on multiple servers in > parallel provides a speed up. And pros are dominant in practice. If many transactions are erroring out (during prepare), the system is not functioning for the user. Such an application should be corrected before they are put into production. > But this can be an improvement on version 1. The current approach > doesn't render such an improvement impossible. So if that's something > hard to do, we should do that in the next version rather than > complicating this patch. Could you share your idea on how the current approach could enable parallelism? This is an important point, because (1) the FDW may not lead us to a seriously competitive scale-out DBMS, and (2) a better FDW API and/or implementation could be considered for non-parallel interaction if we have the realization of parallelism in mind. I think that kind of consideration is the design (for the future). Regards Takayuki Tsunakawa
Re: Keep elog(ERROR) and ereport(ERROR) calls in the cold path
On Tue, 22 Sep 2020 at 19:08, David Rowley wrote: > I ran another scale=5 TPCH benchmark on v4 against f859c2ffa using gcc > 9.3. I'm unable to see any gains with this, however, the results were > pretty noisy. I only ran pgbench for 60 seconds per query. I'll likely > need to run that a bit longer. I'll do that tonight. I've attached the results of a TPCH scale=5 run master (f859c2ffa) vs master + elog_ereport_attribute_cold_v4.patch It does not look great. The patched version seems to have done about 1.17% less work than master did. David tpch_scale5_elog_ereport_cold_v4_vs_master_10min.ods Description: application/vnd.oasis.opendocument.spreadsheet
Re: new heapcheck contrib module
On Tue, Sep 22, 2020 at 12:41 PM Robert Haas wrote: > But now I see that there's no secondary permission check in the > verify_nbtree.c code. Is that intentional? Peter, what's the > justification for that? As noted by comments in contrib/amcheck/sql/check_btree.sql (the verify_nbtree.c tests), this is intentional. Note that we explicitly test that a non-superuser role can perform verification following GRANT EXECUTE ON FUNCTION ... . As I mentioned earlier, this is supported (or at least it is supported in my interpretation of things). It just isn't documented anywhere outside the test itself. -- Peter Geoghegan
Re: Improper use about DatumGetInt32
Robert Haas writes: > On Mon, Sep 21, 2020 at 3:53 PM Andres Freund wrote: >> I think we mostly use it for the few places where we currently expose >> data as a signed integer on the SQL level, but internally actually treat >> it as a unsigned data. > So why is the right solution to that not DatumGetInt32() + a cast to uint32? You're ignoring the xid use-case, for which DatumGetUInt32 actually is the right thing. I tend to agree though that if the SQL argument is of a signed type, the least API-abusing answer is a signed DatumGetXXX macro followed by whatever cast you need. regards, tom lane
Re: Lift line-length limit for pg_service.conf
Daniel Gustafsson writes: > [ 0001-Refactor-pg_service.conf-and-pg_restore-TOC-file-par.patch ] I reviewed this and noticed that you'd missed adding resetStringInfo calls in some code paths, which made me realize that while pg_get_line_append() is great for its original customer in hba.c, it kinda sucks for most other callers. Having to remember to do resetStringInfo in every path through a loop is too error-prone, and it's unnecessary. So I made another subroutine that just adds that step, and updated the existing callers that could use it. Pushed with that correction. regards, tom lane
Re: Improper use about DatumGetInt32
On Mon, Sep 21, 2020 at 3:53 PM Andres Freund wrote: > On 2020-09-21 14:08:22 -0400, Robert Haas wrote: > > There is no SQL type corresponding to the C data type uint32, so I'm > > not sure why we even have DatumGetUInt32. I'm sort of suspicious that > > there's some fuzzy thinking going on there. > > I think we mostly use it for the few places where we currently expose > data as a signed integer on the SQL level, but internally actually treat > it as a unsigned data. So why is the right solution to that not DatumGetInt32() + a cast to uint32? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: new heapcheck contrib module
On Tue, Sep 22, 2020 at 1:55 PM Mark Dilger wrote: > I am inclined to just restrict verify_heapam() to superusers and be done. > What do you think? I think that's an old and largely failed approach. If you want to use pg_class_ownercheck here rather than pg_class_aclcheck or something like that, seems fair enough. But I don't think there should be an is-superuser check in the code, because we've been trying really hard to get rid of those in most places. And I also don't think there should be no secondary permissions check, because if somebody does grant execute permission on these functions, it's unlikely that they want the person getting that permission to be able to check every relation in the system even those on which they have no other privileges at all. But now I see that there's no secondary permission check in the verify_nbtree.c code. Is that intentional? Peter, what's the justification for that? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: SEARCH and CYCLE clauses
Hi I found another bug create view xx as WITH recursive destinations (departure, arrival, connections, cost, itinerary) AS (SELECT f.departure, f.arrival, 1, price, CAST(f.departure || f.arrival AS VARCHAR(2000)) FROM flights f WHERE f.departure = 'New York' UNION ALL SELECT r.departure, b.arrival, r.connections + 1 , r.cost + b.price, CAST(r.itinerary || b.arrival AS VARCHAR(2000)) FROM destinations r, flights b WHERE r.arrival = b.departure) CYCLE arrival SET cyclic_data TO '1' DEFAULT '0' using path SELECT departure, arrival, itinerary, cyclic_data FROM destinations ; postgres=# select * from xx; ERROR: attribute number 6 exceeds number of columns 5 Regards Pavel
Re: SEARCH and CYCLE clauses
Hi út 22. 9. 2020 v 20:01 odesílatel Peter Eisentraut < peter.eisentr...@2ndquadrant.com> napsal: > I have implemented the SEARCH and CYCLE clauses. > > This is standard SQL syntax attached to a recursive CTE to compute a > depth- or breadth-first ordering and cycle detection, respectively. > This is just convenience syntax for what you can already do manually. > The original discussion about recursive CTEs briefly mentioned these as > something to do later but then it was never mentioned again. > > SQL specifies these in terms of syntactic transformations, and so that's > how I have implemented them also, mainly in the rewriter. > > I have successfully tested this against examples I found online that > were aimed at DB2. > > The contained documentation and the code comment in rewriteHandler.c > explain the details. > I am playing with this patch. It looks well, but I found some issues (example is from attached data.sql) WITH recursive destinations (departure, arrival, connections, cost) AS (SELECT f.departure, f.arrival, 0, price FROM flights f WHERE f.departure = 'New York' UNION ALL SELECT r.departure, b.arrival, r.connections + 1, r.cost + b.price FROM destinations r, flights b WHERE r.arrival = b.departure) cycle departure, arrival set is_cycle to true default false using path SELECT * FROM destinations ; ; The result is correct. When I tried to use UNION instead UNION ALL, the pg crash Program received signal SIGABRT, Aborted. 0x7f761338ebc5 in raise () from /lib64/libc.so.6 (gdb) bt #0 0x7f761338ebc5 in raise () from /lib64/libc.so.6 #1 0x7f76133778a4 in abort () from /lib64/libc.so.6 #2 0x0090e7eb in ExceptionalCondition (conditionName=, errorType=, fileName=, lineNumber=) at assert.c:67 #3 0x007205e7 in generate_setop_grouplist (targetlist=targetlist@entry=0x7f75fce5d018, op=, op=) at prepunion.c:1412 #4 0x007219d0 in generate_recursion_path (pTargetList=0x7fff073ee728, refnames_tlist=, root=0xf90bd8, setOp=0xf90840) at prepunion.c:502 #5 plan_set_operations (root=0xf90bd8) at prepunion.c:156 #6 0x0070f79b in grouping_planner (root=0xf90bd8, inheritance_update=false, tuple_fraction=) at planner.c:1886 #7 0x00712ea7 in subquery_planner (glob=, parse=, parent_root=, hasRecursion=, tuple_fraction=0) at planner.c:1015 #8 0x0071a614 in SS_process_ctes (root=0xf7abd8) at subselect.c:952 #9 0x007125d4 in subquery_planner (glob=glob@entry=0xf8a010, parse=parse@entry=0xf6cf20, parent_root=parent_root@entry=0x0, hasRecursion=hasRecursion@entry=false, tuple_fraction=tuple_fraction@entry=0) at planner.c:645 #10 0x0071425b in standard_planner (parse=0xf6cf20, query_string=, cursorOptions=256, boundParams=) at planner.c:405 #11 0x007e5f68 in pg_plan_query (querytree=0xf6cf20, query_string=query_string@entry=0xea6370 "WITH recursive destinations (departure, arrival, connections, cost) AS \n(SELECT f.departure, f.arrival, 0, price\n", ' ' , "FROM flights f \n", ' ' , "WHERE f.departure = 'New York' \n UNION "..., cursorOptions=cursorOptions@entry=256, boundParams=boundParams@entry=0x0) at postgres.c:875 #12 0x007e6061 in pg_plan_queries (querytrees=0xf8b690, query_string=query_string@entry=0xea6370 "WITH recursive destinations (departure, arrival, connections, cost) AS \n(SELECT f.departure, f.arrival, 0, price\n", ' ' , "FROM flights f \n", ' ' , "WHERE f.departure = 'New York' \n UNION "..., cursorOptions=cursorOptions@entry=256, boundParams=boundParams@entry=0x0) at postgres.c:966 #13 0x007e63b8 in exec_simple_query ( query_string=0xea6370 "WITH recursive destinations (departure, arrival, connections, cost) AS \n(SELECT f.departure, f.arrival, 0, price\n", ' ' , "FROM flights f \n", ' ' , "WHERE f.departure = 'New York' \n UNION "...) at postgres.c:1158 #14 0x007e81e4 in PostgresMain (argc=, argv=, dbname=, username=) at postgres.c:4309 #15 0x007592b9 in BackendRun (port=0xecaf20) at postmaster.c:4541 #16 BackendStartup (port=0xecaf20) at postmaster.c:4225 #17 ServerLoop () at postmaster.c:1742 #18 0x0075a0ed in PostmasterMain (argc=, argv=0xea0c90) at postmaster.c:1415 #19 0x004832ec in main (argc=3, argv=0xea0c90) at main.c:209 looks so clause USING in cycle detection is unsupported for DB2 and Oracle - the examples from these databases doesn't work on PG without modifications Regards Pavel > > -- > Peter Eisentraut http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services > data.sql Description: application/sql
Re: new heapcheck contrib module
On Tue, Sep 22, 2020 at 10:55 AM Mark Dilger wrote: > I am inclined to just restrict verify_heapam() to superusers and be done. > What do you think? The existing amcheck functions were designed to have execute privilege granted to non-superusers, though we never actually advertised that fact. Maybe now would be a good time to start doing so. -- Peter Geoghegan
Re: new heapcheck contrib module
> On Sep 21, 2020, at 2:09 PM, Robert Haas wrote: > > I think there should be a call to pg_class_aclcheck() here, just like > the one in pg_prewarm, so that if the superuser does choose to grant > access, users given access can check tables they anyway have > permission to access, but not others. Maybe put that in > check_relation_relkind_and_relam() and rename it. Might want to look > at the pg_surgery precedent, too. In the presence of corruption, verify_heapam() reports to the user (in other words, leaks) metadata about the corrupted rows. Reasoning about the attack vectors this creates is hard, but a conservative approach is to assume that an attacker can cause corruption in order to benefit from the leakage, and make sure the leakage does not violate any reasonable security expectations. Basing the security decision on whether the user has access to read the table seems insufficient, as it ignores row level security. Perhaps that is ok if row level security is not enabled for the table or if the user has been granted BYPASSRLS. There is another problem, though. There is no grantable privilege to read dead rows. In the case of corruption, verify_heapam() may well report metadata about dead rows. pg_surgery also appears to leak information about dead rows. Owners of tables can probe whether supplied TIDs refer to dead rows. If a table containing sensitive information has rows deleted prior to ownership being transferred, the new owner of the table could probe each page of deleted data to determine something of the content that was there. Information about the number of deleted rows is already available through the pg_stat_* views, but those views don't give such a fine-grained approach to figuring out how large each deleted row was. For a table with fixed content options, the content can sometimes be completely inferred from the length of the row. (Consider a table with a single text column containing either "approved" or "denied".) But pg_surgery is understood to be a collection of sharp tools only to be used under fairly exceptional conditions. amcheck, on the other hand, is something that feels safer and more reasonable to use on a regular basis, perhaps from a cron job executed by a less trusted user. Forcing the user to be superuser makes it clearer that this feeling of safety is not justified. I am inclined to just restrict verify_heapam() to superusers and be done. What do you think? — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Asynchronous Append on postgres_fdw nodes.
On Tue, Sep 22, 2020 at 9:52 PM Konstantin Knizhnik wrote: > On 20.08.2020 10:36, Kyotaro Horiguchi wrote: > > Come to think of "complex", ExecAsync stuff in this patch might be > > too-much for a short-term solution until executor overhaul, if it > > comes shortly. (the patch of mine here as a whole is like that, > > though..). The queueing stuff in postgres_fdw is, too. > Looks like current implementation of asynchronous append incorrectly > handle LIMIT clause: > > psql:append.sql:10: ERROR: another command is already in progress > CONTEXT: remote SQL command: CLOSE c1 Thanks for the report (and patch)! The same issue has already been noticed in [1]. I too think the cause of the issue would be in the 0003 patch (ie, “the queueing stuff “ in postgres_fdw), but I’m not sure it is really a good idea to have that in postgres_fdw in the first place, because it would impact performance negatively in some cases (see [1]). Best regards, Etsuro Fujita [1] https://www.postgresql.org/message-id/CAPmGK16E1erFV9STg8yokoewY6E-zEJtLzHUJcQx%2B3dyivCT%3DA%40mail.gmail.com
Re: pgindent vs dtrace on macos
After further thought about this, I concluded that a much better idea is to just exclude fmgrprotos.h from the pgindent run. While renaming these two functions may or may not be worthwhile, it doesn't provide any sort of permanent fix for fmgrprotos.h, because new typedef conflicts could arise at any time. (The fact that the typedef list isn't fully under our control, thanks to contributions from system headers, makes this a bigger risk than it might appear.) So I did that, and also added a README comment along the lines you suggested. regards, tom lane
Re: Asynchronous Append on postgres_fdw nodes.
On 22.09.2020 16:40, Konstantin Knizhnik wrote: On 22.09.2020 15:52, Konstantin Knizhnik wrote: On 20.08.2020 10:36, Kyotaro Horiguchi wrote: At Wed, 19 Aug 2020 23:25:36 -0500, Justin Pryzby wrote in On Thu, Jul 02, 2020 at 11:14:48AM +0900, Kyotaro Horiguchi wrote: As the result of a discussion with Fujita-san off-list, I'm going to hold off development until he decides whether mine or Thomas' is better. The latest patch doesn't apply so I set as WoA. https://commitfest.postgresql.org/29/2491/ Thanks. This is rebased version. At Fri, 14 Aug 2020 13:29:16 +1200, Thomas Munro wrote in Either way, we definitely need patch 0001. One comment: -CreateWaitEventSet(MemoryContext context, int nevents) +CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents) I wonder if it's better to have it receive ResourceOwner like that, or to have it capture CurrentResourceOwner. I think the latter is more common in existing code. There's no existing WaitEventSets belonging to a resowner. So unconditionally capturing CurrentResourceOwner doesn't work well. I could pass a bool instead but that make things more complex. Come to think of "complex", ExecAsync stuff in this patch might be too-much for a short-term solution until executor overhaul, if it comes shortly. (the patch of mine here as a whole is like that, though..). The queueing stuff in postgres_fdw is, too. regards. Hi, Looks like current implementation of asynchronous append incorrectly handle LIMIT clause: psql:append.sql:10: ERROR: another command is already in progress CONTEXT: remote SQL command: CLOSE c1 Just FYI: the following patch fixes the problem: --- a/contrib/postgres_fdw/postgres_fdw.c +++ b/contrib/postgres_fdw/postgres_fdw.c @@ -1667,6 +1667,11 @@ remove_async_node(ForeignScanState *node) if (cur == node) { + PGconn *conn = curstate->s.conn; + + while(PQisBusy(conn)) + PQclear(PQgetResult(conn)); + prev_state->waiter = curstate->waiter; /* relink to the previous node if the last node was removed */ Sorry, but it is not the only problem. If you execute the query above and then in the same backend try to insert more records, then backend is crashed: Program terminated with signal SIGSEGV, Segmentation fault. #0 0x7f5dfc59a231 in fetch_received_data (node=0x230c130) at postgres_fdw.c:3736 3736 Assert(fsstate->s.commonstate->leader == node); (gdb) p sstate->s.commonstate No symbol "sstate" in current context. (gdb) p fsstate->s.commonstate Cannot access memory at address 0x7f7f7f7f7f7f7f87 Also my patch doesn't solve the problem for small number of records (100) in the table. I attach yet another patch which fix both problems. Please notice that I did not go deep inside code of async append, so I am not sure that my patch is complete and correct. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c index 1482436..ff15642 100644 --- a/contrib/postgres_fdw/postgres_fdw.c +++ b/contrib/postgres_fdw/postgres_fdw.c @@ -1623,7 +1623,19 @@ remove_async_node(ForeignScanState *node) PgFdwScanState *prev_state; ForeignScanState *cur; - /* no need to remove me */ + if (fsstate->s.commonstate->busy) + { + /* + * this node is waiting for result, absorb the result first so + * that the following commands can be sent on the connection. + */ + PGconn *conn = fsstate->s.conn; + + while(PQisBusy(conn)) + PQclear(PQgetResult(conn)); + } + +/* no need to remove me */ if (!leader || !fsstate->queued) return; @@ -1631,23 +1643,7 @@ remove_async_node(ForeignScanState *node) if (leader == node) { - if (leader_state->s.commonstate->busy) - { - /* - * this node is waiting for result, absorb the result first so - * that the following commands can be sent on the connection. - */ - PgFdwScanState *leader_state = GetPgFdwScanState(leader); - PGconn *conn = leader_state->s.conn; - - while(PQisBusy(conn)) -PQclear(PQgetResult(conn)); - - leader_state->s.commonstate->busy = false; - } - move_to_next_waiter(node); - return; } @@ -1858,7 +1854,7 @@ postgresEndForeignScan(ForeignScanState *node) /* Release remote connection */ ReleaseConnection(fsstate->s.conn); fsstate->s.conn = NULL; - + fsstate->s.commonstate->leader = NULL; /* MemoryContexts will be deleted automatically. */ }
Re: [patch] Fix checksum verification in base backups for zero page headers
Hi, Am Dienstag, den 22.09.2020, 16:26 +0200 schrieb Michael Banck: > Am Mittwoch, den 02.09.2020, 16:50 +0300 schrieb Anastasia Lubennikova: > > I've looked through the previous discussion. As far as I got it, most of > > the controversy was about online checksums improvements. > > > > The warning about pd_upper inconsistency that you've added is a good > > addition. The patch is a bit messy, though, because a huge code block > > was shifted. > > > > Will it be different, if you just leave > > "if (!PageIsNew(page) && PageGetLSN(page) < startptr)" > > block as it was, and add > > "else if (PageIsNew(page) && !PageIsZero(page))" ? > > Thanks, that does indeed look better as a patch and I think it's fine > as-is for the code as well, I've attached a v2. Sorry, forgot to add you as reviewer in the proposed commit message, I've fixed that up now in V3. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.ba...@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz From a6706ec63709137881d415a8acf98c390a39ee56 Mon Sep 17 00:00:00 2001 From: Michael Banck Date: Tue, 22 Sep 2020 16:09:36 +0200 Subject: [PATCH] Fix checksum verification in base backups for zero page headers We currently do not checksum a page if it is considered new by PageIsNew(). However, this means that if the whole page header is zero, PageIsNew() will consider that page new due to the pd_upper field being zero and no subsequent checksum verification is done. To fix, factor out the all-zeroes check from PageIsVerified() into a new method PageIsZero() and call that for pages considered new by PageIsNew(). If a page is all zero, consider that a checksum failure. Add one more test to the pg_basebackup TAP tests to check this error. Reported-By: Andres Freund Reviewed-By: Asif Rehman, Anastasia Lubennikova Discussion: https://postgr.es/m/20190319200050.ncuxejradurja...@alap3.anarazel.de --- src/backend/replication/basebackup.c | 30 + src/backend/storage/page/bufpage.c | 35 +++- src/bin/pg_basebackup/t/010_pg_basebackup.pl | 18 +- src/include/storage/bufpage.h| 11 +++--- 4 files changed, 73 insertions(+), 21 deletions(-) diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c index b89df01fa7..c82765a429 100644 --- a/src/backend/replication/basebackup.c +++ b/src/backend/replication/basebackup.c @@ -1746,6 +1746,36 @@ sendFile(const char *readfilename, const char *tarfilename, "be reported", readfilename))); } } +else +{ + /* + * We skip completely new pages after checking they are + * all-zero, since they don't have a checksum yet. + */ + if (PageIsNew(page) && !PageIsZero(page)) + { + /* + * pd_upper is zero, but the page is not all zero. We + * cannot run pg_checksum_page() on the page as it + * would throw an assertion failure. Consider this a + * checksum failure. + */ + checksum_failures++; + + if (checksum_failures <= 5) + ereport(WARNING, + (errmsg("checksum verification failed in " + "file \"%s\", block %d: pd_upper " + "is zero but page is not all-zero", + readfilename, blkno))); + if (checksum_failures == 5) + ereport(WARNING, + (errmsg("further checksum verification " + "failures in file \"%s\" will not " + "be reported", readfilename))); + } +} + block_retry = false; blkno++; } diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c index 4bc2bf955d..8be3cd6190 100644 --- a/src/backend/storage/page/bufpage.c +++ b/src/backend/storage/page/bufpage.c @@ -82,11 +82,8 @@ bool PageIsVerified(Page page, BlockNumber blkno) { PageHeader p = (PageHeader) page; - size_t *pagebytes; - int i; bool checksum_failure = false; bool header_sane = false; - bool all_zeroes = false; uint16 checksum = 0; /* @@ -120,18 +117,7 @@ PageIsVerified(Page page, BlockNumber blkno) } /* Check all-zeroes case */ - all_zeroes = true; - pagebytes = (size_t *) page; - for (i = 0; i < (BLCKSZ / sizeof(size_t)); i++) - { - if (pagebytes[i] != 0) - { - all_zeroes = false; - break; - } - } - - if (all_zeroes) + if (PageIsZero(page)) return true; /* @@ -154,6 +140,25 @@ PageIsVerified(Page page, BlockNumber blkno) return false; } +/* + * PageIsZero + * Check that the page consists only of zero bytes. + * + */ +bool +PageIsZero(Page page) +{ + int i; + size_t *pagebytes = (size_t *) page; + + for (i = 0; i < (BLCKSZ / sizeof(size_t)); i++) + { +
Re: [patch] Fix checksum verification in base backups for zero page headers
Hi, Am Mittwoch, den 02.09.2020, 16:50 +0300 schrieb Anastasia Lubennikova: > On 01.09.2020 13:22, Michael Banck wrote: > > as a continuation of [1], I've split-off the zero page header case from > > the last patch, as this one seems less contentious. > > [1] https://commitfest.postgresql.org/28/2308/ > > I've looked through the previous discussion. As far as I got it, most of > the controversy was about online checksums improvements. > > The warning about pd_upper inconsistency that you've added is a good > addition. The patch is a bit messy, though, because a huge code block > was shifted. > > Will it be different, if you just leave > "if (!PageIsNew(page) && PageGetLSN(page) < startptr)" > block as it was, and add > "else if (PageIsNew(page) && !PageIsZero(page))" ? Thanks, that does indeed look better as a patch and I think it's fine as-is for the code as well, I've attached a v2. > While on it, I also have a few questions about the code around: All fair points, but I think those should be handled in another patch, if any. > 1) Maybe we can use other header sanity checks from PageIsVerified() as > well? Or even the function itself. Yeah, I'll keep that in mind. > 2) > /* Reset loop to validate the block again */ > How many times do we try to reread the block? Is one reread enough? Not sure whether more rereads would help, but I think the general direction was to replace this with something better anyway I think (see below). > Speaking of which, 'reread_cnt' name looks confusing to me. I would > expect that this variable contains a number of attempts, not the number > of bytes read. Well, there are other "cnt" variables that keep the number of bytes read in that file, so it does not seem to be out of place for me. A comment might not hurt though. On second glance, it should maybe also be of size_t and not int type, as is the other cnt variable? > If everyone agrees, that for basebackup purpose it's enough to rely on a > single reread, I'm ok with it. > Another approach is to read the page directly from shared buffers to > ensure that the page is fine. This way is more complicated, but should > be almost bullet-proof. Using it we can also check pages with lsn >= > startptr. Right, that's what Andres also advocated for initially I think, and what should be done going forward. > 3) Judging by warning messages, we count checksum failures per file, not > per page, and don't report after a fifth failure. Why so? Is it a > common case that so many pages of one file are corrupted? I think this was a request during the original review, on the basis that if we have more than a few checksum errors, it's likely there is something fundamentally broken and it does not make sense to spam the log with potentially millions of broken checksums. Cheers, Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.ba...@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz From 317df17289d4fd057ffdb4410f9effa8c7a2b975 Mon Sep 17 00:00:00 2001 From: Michael Banck Date: Tue, 22 Sep 2020 16:09:36 +0200 Subject: [PATCH] Fix checksum verification in base backups for zero page headers We currently do not checksum a page if it is considered new by PageIsNew(). However, this means that if the whole page header is zero, PageIsNew() will consider that page new due to the pd_upper field being zero and no subsequent checksum verification is done. To fix, factor out the all-zeroes check from PageIsVerified() into a new method PageIsZero() and call that for pages considered new by PageIsNew(). If a page is all zero, consider that a checksum failure. Add one more test to the pg_basebackup TAP tests to check this error. Reported-By: Andres Freund Reviewed-By: Asif Rehman Discussion: https://postgr.es/m/20190319200050.ncuxejradurja...@alap3.anarazel.de --- src/backend/replication/basebackup.c | 30 + src/backend/storage/page/bufpage.c | 35 +++- src/bin/pg_basebackup/t/010_pg_basebackup.pl | 18 +- src/include/storage/bufpage.h| 11 +++--- 4 files changed, 73 insertions(+), 21 deletions(-) diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c index b89df01fa7..c82765a429 100644 --- a/src/backend/replication/basebackup.c +++ b/src/backend/replication/basebackup.c @@ -1746,6 +1746,36 @@ sendFile(const char *readfilename, const char *tarfilename, "be reported", readfilename))); } } +else +{ + /* + * We skip completely new pages after checking they are + * all-zero, since they don't have a checksum yet. + */ + if (PageIsNew(page) &&
Re: pgindent vs dtrace on macos
Daniel Gustafsson writes: >> On 21 Sep 2020, at 20:55, Tom Lane wrote: >> Here's a proposed patch to fix it that way. > +1 on this patch. Do you think it's worth adding a note about this in the > documentation to save the next one staring at this a few minutes? Something > along the lines of: > +If an exported function shares a name with a typedef, the header file with > the > +prototype can get incorrect spacing for the function. AFAIK, a function name that's the same as a typedef name will get misformatted whether it's exported or not; pgindent doesn't really know the difference. But yeah, we could add something about this. regards, tom lane
Re: Asynchronous Append on postgres_fdw nodes.
On 22.09.2020 15:52, Konstantin Knizhnik wrote: On 20.08.2020 10:36, Kyotaro Horiguchi wrote: At Wed, 19 Aug 2020 23:25:36 -0500, Justin Pryzby wrote in On Thu, Jul 02, 2020 at 11:14:48AM +0900, Kyotaro Horiguchi wrote: As the result of a discussion with Fujita-san off-list, I'm going to hold off development until he decides whether mine or Thomas' is better. The latest patch doesn't apply so I set as WoA. https://commitfest.postgresql.org/29/2491/ Thanks. This is rebased version. At Fri, 14 Aug 2020 13:29:16 +1200, Thomas Munro wrote in Either way, we definitely need patch 0001. One comment: -CreateWaitEventSet(MemoryContext context, int nevents) +CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents) I wonder if it's better to have it receive ResourceOwner like that, or to have it capture CurrentResourceOwner. I think the latter is more common in existing code. There's no existing WaitEventSets belonging to a resowner. So unconditionally capturing CurrentResourceOwner doesn't work well. I could pass a bool instead but that make things more complex. Come to think of "complex", ExecAsync stuff in this patch might be too-much for a short-term solution until executor overhaul, if it comes shortly. (the patch of mine here as a whole is like that, though..). The queueing stuff in postgres_fdw is, too. regards. Hi, Looks like current implementation of asynchronous append incorrectly handle LIMIT clause: psql:append.sql:10: ERROR: another command is already in progress CONTEXT: remote SQL command: CLOSE c1 Just FYI: the following patch fixes the problem: --- a/contrib/postgres_fdw/postgres_fdw.c +++ b/contrib/postgres_fdw/postgres_fdw.c @@ -1667,6 +1667,11 @@ remove_async_node(ForeignScanState *node) if (cur == node) { + PGconn *conn = curstate->s.conn; + + while(PQisBusy(conn)) + PQclear(PQgetResult(conn)); + prev_state->waiter = curstate->waiter; /* relink to the previous node if the last node was removed */ -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c index 1482436..9fe16cf 100644 --- a/contrib/postgres_fdw/postgres_fdw.c +++ b/contrib/postgres_fdw/postgres_fdw.c @@ -1667,6 +1667,11 @@ remove_async_node(ForeignScanState *node) if (cur == node) { + PGconn *conn = curstate->s.conn; + + while(PQisBusy(conn)) +PQclear(PQgetResult(conn)); + prev_state->waiter = curstate->waiter; /* relink to the previous node if the last node was removed */
Re: Transactions involving multiple postgres foreign servers, take 2
On Tue, Sep 22, 2020 at 6:48 AM tsunakawa.ta...@fujitsu.com wrote: > > > > I think it's not necessarily that all FDW implementations need to be > > able to support xa_complete(). We can support both synchronous and > > asynchronous executions of prepare/commit/rollback. > > Yes, I think parallel prepare and commit can be an option for FDW. But I > don't think it's an option for a serious scale-out DBMS. If we want to use > FDW as part of PostgreSQL's scale-out infrastructure, we should design (if > not implemented in the first version) how the parallelism can be realized. > That design is also necessary because it could affect the FDW API. parallelism here has both pros and cons. If one of the servers errors out while preparing for a transaction, there is no point in preparing the transaction on other servers. In parallel execution we will prepare on multiple servers before realising that one of them has failed to do so. On the other hand preparing on multiple servers in parallel provides a speed up. But this can be an improvement on version 1. The current approach doesn't render such an improvement impossible. So if that's something hard to do, we should do that in the next version rather than complicating this patch. -- Best Wishes, Ashutosh Bapat
Re: Asynchronous Append on postgres_fdw nodes.
On 20.08.2020 10:36, Kyotaro Horiguchi wrote: At Wed, 19 Aug 2020 23:25:36 -0500, Justin Pryzby wrote in On Thu, Jul 02, 2020 at 11:14:48AM +0900, Kyotaro Horiguchi wrote: As the result of a discussion with Fujita-san off-list, I'm going to hold off development until he decides whether mine or Thomas' is better. The latest patch doesn't apply so I set as WoA. https://commitfest.postgresql.org/29/2491/ Thanks. This is rebased version. At Fri, 14 Aug 2020 13:29:16 +1200, Thomas Munro wrote in Either way, we definitely need patch 0001. One comment: -CreateWaitEventSet(MemoryContext context, int nevents) +CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents) I wonder if it's better to have it receive ResourceOwner like that, or to have it capture CurrentResourceOwner. I think the latter is more common in existing code. There's no existing WaitEventSets belonging to a resowner. So unconditionally capturing CurrentResourceOwner doesn't work well. I could pass a bool instead but that make things more complex. Come to think of "complex", ExecAsync stuff in this patch might be too-much for a short-term solution until executor overhaul, if it comes shortly. (the patch of mine here as a whole is like that, though..). The queueing stuff in postgres_fdw is, too. regards. Hi, Looks like current implementation of asynchronous append incorrectly handle LIMIT clause: psql:append.sql:10: ERROR: another command is already in progress CONTEXT: remote SQL command: CLOSE c1 -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company append.sql Description: application/sql
Re: OpenSSL 3.0.0 compatibility
> On 22 Sep 2020, at 11:37, Peter Eisentraut > wrote: > > On 2020-09-18 16:11, Daniel Gustafsson wrote: >> Since we support ciphers that are now deprecated, we have no other choice >> than >> to load the legacy provider. > > Well, we could just have deprecated ciphers fail, unless the user loads the > legacy provider in the OS configuration. There might be an argument that > that is more proper. Thats a fair point. The issue I have with that is that we'd impose a system wide loading of the legacy provider, impacting other consumers of libssl as well. > As a more extreme analogy, what if OpenSSL remove a cipher from the legacy > provider? Are we then obliged to reimplement it manually for the purpose of > pgcrypto? Probably not. I don't think we have made any such guarantees of support, especially since it's in contrib/. That doesn't mean that some users wont expect it though. Another option would be to follow OpenSSL's deprecations and mark these ciphers as deprecated such that we can remove them in case they indeed get removed from libcypto. That would give users a heads-up that they should have a migration plan for if that time comes. > The code you wrote to load the necessary providers is small enough that I > think it's fine, but it's worth pondering this question briefly. Absolutely. cheers ./daniel
Re: Lift line-length limit for pg_service.conf
> On 21 Sep 2020, at 17:09, Tom Lane wrote: > > Daniel Gustafsson writes: >> The pg_service.conf parsing thread [0] made me realize that we have a >> hardwired >> line length of max 256 bytes. Lifting this would be in line with recent work >> for ecpg, pg_regress and pg_hba (784b1ba1a2 and 8f8154a50). The attached >> moves >> pg_service.conf to use the new pg_get_line_append API and a StringInfo to >> lift >> the restriction. Any reason not to do that while we're lifting other such >> limits? > > +1. I'd been thinking of looking around at our fgets calls to see > which ones need this sort of work, but didn't get to it yet. I took a quick look and found the TOC parsing in pg_restore which used a 100 byte buffer and then did some juggling to find EOL for >100b long lines. There we wont see a bugreport for exceeded line length, but simplifying the code seemed like a win to me so included that in the updated patch as well. > Personally, I'd avoid depending on StringInfo.cursor here, as the > dependency isn't really buying you anything. Fair enough, I was mainly a bit excited at finally finding a use for .cursor =) Fixed. > Also, the need for inserting the pfree into multiple exit paths kind > of makes me itch. I wonder if we should change the ending code to > look like > > exit: > fclose(f); > pfree(linebuf.data); > > return result; > > and then the early exit spots would be replaced with "result = x; > goto exit". (Some of them could use "break", but not all, so > it's probably better to consistently use "goto".) Agreed, fixed. I was a bit tempted to use something less magic and more descriptive than result = 3; but in the end opted for keeping changes to one thing at a time. cheers ./daniel 0001-Refactor-pg_service.conf-and-pg_restore-TOC-file-par.patch Description: Binary data
Re: [HACKERS] logical decoding of two-phase transactions
On Sun, Sep 20, 2020 at 3:31 PM Dilip Kumar wrote: > 1. > + /* > + * Process invalidation messages, even if we're not interested in the > + * transaction's contents, since the various caches need to always be > + * consistent. > + */ > + if (parsed->nmsgs > 0) > + { > + if (!ctx->fast_forward) > + ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr, > + parsed->nmsgs, parsed->msgs); > + ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr); > + } > + > > I think we don't need to add prepare time invalidation messages as we now we > are already logging the invalidations at the command level and adding them to > reorder buffer. > Removed. > 2. > > + /* > + * Tell the reorderbuffer about the surviving subtransactions. We need to > + * do this because the main transaction itself has not committed since we > + * are in the prepare phase right now. So we need to be sure the snapshot > + * is setup correctly for the main transaction in case all changes > + * happened in subtransanctions > + */ > + for (i = 0; i < parsed->nsubxacts; i++) > + { > + ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i], > + buf->origptr, buf->endptr); > + } > + > + if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) || > + (parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) || > + ctx->fast_forward || FilterByOrigin(ctx, origin_id)) > + return; > > Do we need to call ReorderBufferCommitChild if we are skiping this > transaction? > I think the below check should be before calling ReorderBufferCommitChild. > Done. > 3. > > + /* > + * If it's ROLLBACK PREPARED then handle it via callbacks. > + */ > + if (TransactionIdIsValid(xid) && > + !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) && > + parsed->dbId == ctx->slot->data.database && > + !FilterByOrigin(ctx, origin_id) && > + ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid)) > + { > + ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr, > + commit_time, origin_id, origin_lsn, > + parsed->twophase_gid, false); > + return; > + } > > > I think we have already checked !SnapBuildXactNeedsSkip, parsed->dbId > == ctx->slot->data.database and !FilterByOrigin in DecodePrepare > so if those are not true then we wouldn't have prepared this > transaction i.e. ReorderBufferTxnIsPrepared will be false so why do we > need > to recheck this conditions. I didnt change this, as I am seeing cases where the Abort is getting called for transactions that needs to be skipped. I also see that the same check is there both in DecodePrepare and DecodeCommit. So, while the same transactions were not getting prepared or committed, it tries to get ROLLBACK PREPARED (as part of finish prepared handling). The check in if ReorderBufferTxnIsPrepared() is also not proper. I will need to relook this logic again in a future patch. > > > 4. > > + /* If streaming, reset the TXN so that it is allowed to stream > remaining data. */ > + if (streaming && stream_started) > + { > + ReorderBufferResetTXN(rb, txn, snapshot_now, > + command_id, prev_lsn, > + specinsert); > + } > + else > + { > + elog(LOG, "stopping decoding of %s (%u)", > + txn->gid[0] != '\0'? txn->gid:"", txn->xid); > + ReorderBufferTruncateTXN(rb, txn, true); > + } > > Why only if (streaming) is not enough? I agree if we are coming here > and it is streaming mode then streaming started must be true > but we already have an assert for that. > Changed. Amit, I have also changed test_decode startup to support two_phase commits only if specified similar to how it was done for streaming. I have also changed the test cases accordingly. However, I have not added it to the pgoutput startup as that would require create subscription changes. I will do that in a future patch. Some other pending changes are: 1. Remove snapshots on prepare truncate. 2. Look at why ReorderBufferCleanupTXN is failing after a ReorderBufferTruncateTXN 3. Add prepare support to streaming regards, Ajin Cherian Fujitsu Australia v5-0001-Support-decoding-of-two-phase-transactions.patch Description: Binary data v5-0002-Tap-test-to-test-concurrent-aborts-during-2-phase.patch Description: Binary data v5-0003-pgoutput-output-plugin-support-for-logical-decodi.patch Description: Binary data
Re: Logical replication from PG v13 and below to PG v14 (devel version) is not working.
On Tue, Sep 22, 2020 at 12:02 PM Amit Kapila wrote: > > On Tue, Sep 22, 2020 at 8:34 AM Ashutosh Sharma wrote: > > > > On Mon, Sep 21, 2020 at 6:58 PM Amit Kapila wrote: > > > > > > On Mon, Sep 21, 2020 at 6:27 PM Ashutosh Sharma > > > wrote: > > > > > > > > Thanks Dilip for the patch. AFAIU, the fix looks good. One small > > > > comment: > > > > > > > > > > Thanks Ashutosh and Dilip for working on this. I'll look into it in a > > > day or two. > > > > > > > Just a thought: > > > > Should we change the sequence of these #define in > > src/include/replication/logicalproto.h? > > > > #define LOGICALREP_PROTO_MIN_VERSION_NUM 1 > > #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2 > > #define LOGICALREP_PROTO_VERSION_NUM 1 > > #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM > > > > I would have changed above to something like this: > > > > #define LOGICALREP_PROTO_VERSION_NUM 1 > > #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2 > > > > #define LOGICALREP_PROTO_MIN_VERSION_NUM LOGICALREP_PROTO_VERSION_NUM > > #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM > > > > I am not sure if this suggestion makes it better than what is purposed > by Dilip but I think we can declare them in define number order like > below: > #define LOGICALREP_PROTO_MIN_VERSION_NUM 1 > #define LOGICALREP_PROTO_VERSION_NUM 1 > #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2 > #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM Done this way. > Another thing is can we also test by having a publisher of v14 and > subscriber of v13 or prior version, just reverse of what Ashutosh has > tested? I have also tested this and working fine. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com v3-0001-Bugfix-in-logical-protocol-version.patch Description: Binary data
Re: Parallel copy
On Thu, Sep 17, 2020 at 11:06 AM Bharath Rupireddy < bharath.rupireddyforpostg...@gmail.com> wrote: > > On Wed, Sep 16, 2020 at 1:20 PM Greg Nancarrow wrote: > > > > Fortunately I have been given permission to share the exact table > > definition and data I used, so you can check the behaviour and timings > > on your own test machine. > > > > Thanks Greg for the script. I ran your test case and I didn't observe > any increase in exec time with 1 worker, indeed, we have benefitted a > few seconds from 0 to 1 worker as expected. > > Execution time is in seconds. Each test case is executed 3 times on > release build. Each time the data directory is recreated. > > Case 1: 100 rows, 2GB > With Patch, default configuration, 0 worker: 88.933, 92.261, 88.423 > With Patch, default configuration, 1 worker: 73.825, 74.583, 72.678 > > With Patch, custom configuration, 0 worker: 76.191, 78.160, 78.822 > With Patch, custom configuration, 1 worker: 61.289, 61.288, 60.573 > > Case 2: 255 rows, 5GB > With Patch, default configuration, 0 worker: 246.031, 188.323, 216.683 > With Patch, default configuration, 1 worker: 156.299, 153.293, 170.307 > > With Patch, custom configuration, 0 worker: 197.234, 195.866, 196.049 > With Patch, custom configuration, 1 worker: 157.173, 158.287, 157.090 > Hi Greg, If you still observe the issue in your testing environment, I'm attaching a testing patch(that applies on top of the latest parallel copy patch set i.e. v5 1 to 6) to capture various timings such as total copy time in leader and worker, index and table insertion time, leader and worker waiting time. These logs are shown in the server log file. Few things to follow before testing: 1. Is the table being dropped/truncated after the test with 0 workers and before running with 1 worker? If not, then the index insertion time would increase.[1](for me it is increasing by 10 sec). This is obvious because the 1st time index will be created from bottom up manner(from leaves to root), but for the 2nd time it has to search and insert at the proper leaves and inner B+Tree nodes. 2. If possible, can you also run with custom postgresql.conf settings[2] along with default? Just to ensure that other bg processes such as checkpointer, autovacuum, bgwriter etc. don't affect our testcase. For instance, with default postgresql.conf file, it looks like checkpointing[3] is happening frequently, could you please let us know if that happens at your end? 3. Could you please run the test case 3 times at least? Just to ensure the consistency of the issue. 4. I ran the tests in a performance test system where no other user processes(except system processes) are running. Is it possible for you to do the same? Please capture and share the timing logs with us. Here's a snapshot of how the added timings show up in the logs: ( I captured this with your test case case 1: 100 rows, 2GB, custom postgresql.conf file settings[2]). with 0 workers: 2020-09-22 10:49:27.508 BST [163910] LOG: totaltableinsertiontime = 24072.034 ms 2020-09-22 10:49:27.508 BST [163910] LOG: totalindexinsertiontime = 60.682 ms 2020-09-22 10:49:27.508 BST [163910] LOG: totalcopytime = 59664.594 ms with 1 worker: 2020-09-22 10:53:58.409 BST [163947] LOG: totalcopyworkerwaitingtime = 59.815 ms 2020-09-22 10:53:58.409 BST [163947] LOG: totaltableinsertiontime = 23585.881 ms 2020-09-22 10:53:58.409 BST [163947] LOG: totalindexinsertiontime = 30.946 ms 2020-09-22 10:53:58.409 BST [163947] LOG: totalcopytimeworker = 47047.956 ms 2020-09-22 10:53:58.429 BST [163946] LOG: totalcopyleaderwaitingtime = 26746.744 ms 2020-09-22 10:53:58.429 BST [163946] LOG: totalcopytime = 47150.002 ms [1] 0 worker: LOG: totaltableinsertiontime = 25491.881 ms LOG: totalindexinsertiontime = 14136.104 ms LOG: totalcopytime = 75606.858 ms table is not dropped and so are indexes 1 worker: LOG: totalcopyworkerwaitingtime = 64.582 ms LOG: totaltableinsertiontime = 21360.875 ms LOG: totalindexinsertiontime = 24843.570 ms LOG: totalcopytimeworker = 69837.162 ms LOG: totalcopyleaderwaitingtime = 49548.441 ms LOG: totalcopytime = 69997.778 ms [2] custom postgresql.conf configuration: shared_buffers = 40GB max_worker_processes = 32 max_parallel_maintenance_workers = 24 max_parallel_workers = 32 synchronous_commit = off checkpoint_timeout = 1d max_wal_size = 24GB min_wal_size = 15GB autovacuum = off [3] LOG: checkpoints are occurring too frequently (14 seconds apart) HINT: Consider increasing the configuration parameter "max_wal_size". With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com v1-0001-Parallel-Copy-Exec-Time-Capture.patch Description: Binary data
Re: OpenSSL 3.0.0 compatibility
On 2020-09-18 16:11, Daniel Gustafsson wrote: Since we support ciphers that are now deprecated, we have no other choice than to load the legacy provider. Well, we could just have deprecated ciphers fail, unless the user loads the legacy provider in the OS configuration. There might be an argument that that is more proper. As a more extreme analogy, what if OpenSSL remove a cipher from the legacy provider? Are we then obliged to reimplement it manually for the purpose of pgcrypto? Probably not. The code you wrote to load the necessary providers is small enough that I think it's fine, but it's worth pondering this question briefly. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Binaries on s390x arch
Re: Namrata Bhave > We will be glad to obtain binaries for s390x on RHEL, SLES and Ubuntu > distros. Hi Namrata, thanks for getting back to me. Note that I'm working on the .deb packages for Debian and Ubuntu. For .rpm (RHEL, SLES) you need to get Devrim on board. (I'm putting him on Cc.) > We are ready to provide the necessary infra. To enable this, we would need > more information about VM configuration(No of VMs, OS, vCPUs, memory, > Storage). Secondly T 's would be needed to be signed electronically. > Please let me know if you are OK to proceed and we can communicate further. To be able to keep up with the other architectures when building binaries for PostgreSQL releases, the machine would have about the same specs as the existing arm64 or ppc64el build daemons: VMs: 1 OS: Debian buster vCPUs: 8 Memory: 16 GB Storage: 100GB (most of that is swap space for tmpfs) What does T mean? Not sure which body on the PostgreSQL side would be able to sign something. We are an open source project after all. Thanks, Christoph signature.asc Description: PGP signature
RE: Use appendStringInfoString and appendPQExpBufferStr where possible
> On Tue, 22 Sep 2020 at 17:00, Hou, Zhijie wrote: > > I found some more places that should use appendPQExrBufferStr instead > of appendPQExpBuffer. > > > > Here is the patch. > > Seems like a good idea. Please add it to the next commitfest. Thanks, added it to the next commitfest. https://commitfest.postgresql.org/30/2735/ Best regards, houzj
Re: pgindent vs dtrace on macos
> On 21 Sep 2020, at 20:55, Tom Lane wrote: > > Oh wait, I forgot about the fmgrprotos.h discrepancy. > > I wrote: >> It strikes me that a low-cost workaround would be to rename these >> C functions. There's no law that their C names must match the >> SQL names. > > Here's a proposed patch to fix it that way. +1 on this patch. Do you think it's worth adding a note about this in the documentation to save the next one staring at this a few minutes? Something along the lines of: --- a/src/tools/pgindent/README +++ b/src/tools/pgindent/README @@ -110,6 +110,9 @@ Sometimes, if pgindent or perltidy produces odd-looking output, it's because of minor bugs like extra commas. Don't hesitate to clean that up while you're at it. +If an exported function shares a name with a typedef, the header file with the +prototype can get incorrect spacing for the function. + cheers ./daniel
Re: Global snapshots
22.09.2020 03:47, tsunakawa.ta...@fujitsu.com пишет: Does this make sense from your viewpoint, and can we think that we can use Clock-SI without infrindging on the patent? According to the patent holder, the difference between Clock-SI and the patent seems to be fewer than the similarities. Thank you for this work! As I can see, main development difficulties placed in other areas: CSN, resolver, global deadlocks, 2PC commit... I'm not lawyer too. But if we get remarks from the patent holders, we can rewrite our Clock-SI implementation. -- regards, Andrey Lepikhov Postgres Professional
Re: Use appendStringInfoString and appendPQExpBufferStr where possible
On Tue, 22 Sep 2020 at 17:00, Hou, Zhijie wrote: > I found some more places that should use appendPQExrBufferStr instead of > appendPQExpBuffer. > > Here is the patch. Seems like a good idea. Please add it to the next commitfest. David
Re: VACUUM PARALLEL option vs. max_parallel_maintenance_workers
On Mon, 21 Sep 2020 at 19:15, Peter Eisentraut wrote: > > On 2020-09-21 05:48, Amit Kapila wrote: > > What according to you should be the behavior here and how will it be > > better than current? > > I think if I write VACUUM (PARALLEL 5), it should use up to 5 workers > (up to the number of indexes), even if max_parallel_maintenance_workers > is 2. It would be good if we were consistent with these parallel options. Right now max_parallel_workers_per_gather will restrict the parallel_workers reloption. I'd say this max_parallel_workers_per_gather is similar to max_parallel_maintenance_workers here and the PARALLEL vacuum option is like the parallel_workers reloption. If we want VACUUM's parallel option to work the same way as that then max_parallel_maintenance_workers should restrict whatever is mentioned in VACUUM PARALLEL. Or perhaps this is slightly different as the user is explicitly asking for this in the command, but you could likely say the same about ALTER TABLE SET (parallel_workers = N); too. David
Re: Keep elog(ERROR) and ereport(ERROR) calls in the cold path
On Fri, 11 Sep 2020 at 02:01, Peter Eisentraut wrote: > > On 2020-09-06 02:24, David Rowley wrote: > >> I would add DEBUG1 back into the conditional, like > >> > >> if (__builtin_constant_p(elevel) && ((elevel) >= ERROR || (elevel) <= > >> DEBUG1) ? \ > > > > hmm, but surely we don't want to move all code that's in the same > > branch as an elog(DEBUG1) call into a cold area. > > Yeah, nevermind that. I've reattached the v4 patch since it just does the >= ERROR case. > > The v3 patch just put an unlikely() around the errstart() call if the > > level was <= DEBUG1. That just to move the code that's inside the if > > (errstart(...)) in the macro into a cold area. > > That could be useful. Depends on how much effect it has. I wonder if it is. I'm having trouble even seeing gains from the ERROR case and I'm considering dropping this patch due to that. I ran another scale=5 TPCH benchmark on v4 against f859c2ffa using gcc 9.3. I'm unable to see any gains with this, however, the results were pretty noisy. I only ran pgbench for 60 seconds per query. I'll likely need to run that a bit longer. I'll do that tonight. It would be good if someone else could run some tests on their own hardware to see if they can see any gains. David elog_ereport_attribute_cold_v4.patch Description: Binary data tpch_scale5_elog_ereport_cold_v4_vs_master.ods Description: application/vnd.oasis.opendocument.spreadsheet
Re: Logical replication from PG v13 and below to PG v14 (devel version) is not working.
On Tue, Sep 22, 2020 at 12:22 PM Ashutosh Sharma wrote: > > On Tue, Sep 22, 2020 at 12:02 PM Amit Kapila wrote: > > > > On Tue, Sep 22, 2020 at 8:34 AM Ashutosh Sharma > > wrote: > > > > > > On Mon, Sep 21, 2020 at 6:58 PM Amit Kapila > > > wrote: > > > > > > > > On Mon, Sep 21, 2020 at 6:27 PM Ashutosh Sharma > > > > wrote: > > > > > > > > > > Thanks Dilip for the patch. AFAIU, the fix looks good. One small > > > > > comment: > > > > > > > > > > > > > Thanks Ashutosh and Dilip for working on this. I'll look into it in a > > > > day or two. > > > > > > > > > > Just a thought: > > > > > > Should we change the sequence of these #define in > > > src/include/replication/logicalproto.h? > > > > > > #define LOGICALREP_PROTO_MIN_VERSION_NUM 1 > > > #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2 > > > #define LOGICALREP_PROTO_VERSION_NUM 1 > > > #define LOGICALREP_PROTO_MAX_VERSION_NUM > > > LOGICALREP_PROTO_STREAM_VERSION_NUM > > > > > > I would have changed above to something like this: > > > > > > #define LOGICALREP_PROTO_VERSION_NUM 1 > > > #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2 > > > > > > #define LOGICALREP_PROTO_MIN_VERSION_NUM LOGICALREP_PROTO_VERSION_NUM > > > #define LOGICALREP_PROTO_MAX_VERSION_NUM > > > LOGICALREP_PROTO_STREAM_VERSION_NUM > > > > > > > I am not sure if this suggestion makes it better than what is purposed > > by Dilip but I think we can declare them in define number order like > > below: > > #define LOGICALREP_PROTO_MIN_VERSION_NUM 1 > > #define LOGICALREP_PROTO_VERSION_NUM 1 > > #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2 > > #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM > > > > The only reason I proposed that was because for the *_MAX_VERSION_NUM > we are using the latest PROTOCOL version name in its definition so why > not to do the same for defining *_MIN_VERSION_NUM as well. Other than > that, I also wanted to correct the sequence so that they are defined > in the increasing order which you have already done here. > > > Another thing is can we also test by having a publisher of v14 and > > subscriber of v13 or prior version, just reverse of what Ashutosh has > > tested? > > > > I've tested LR from PGv12 (Publisher) to PGv14 (Subscriber) and it works fine. > I meant LR from PGv14 (Publisher) to PGv12 (Subscriber) not the other way. > -- > With Regards, > Ashutosh Sharma > EnterpriseDB:http://www.enterprisedb.com
Re: Logical replication from PG v13 and below to PG v14 (devel version) is not working.
On Tue, Sep 22, 2020 at 12:02 PM Amit Kapila wrote: > > On Tue, Sep 22, 2020 at 8:34 AM Ashutosh Sharma wrote: > > > > On Mon, Sep 21, 2020 at 6:58 PM Amit Kapila wrote: > > > > > > On Mon, Sep 21, 2020 at 6:27 PM Ashutosh Sharma > > > wrote: > > > > > > > > Thanks Dilip for the patch. AFAIU, the fix looks good. One small > > > > comment: > > > > > > > > > > Thanks Ashutosh and Dilip for working on this. I'll look into it in a > > > day or two. > > > > > > > Just a thought: > > > > Should we change the sequence of these #define in > > src/include/replication/logicalproto.h? > > > > #define LOGICALREP_PROTO_MIN_VERSION_NUM 1 > > #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2 > > #define LOGICALREP_PROTO_VERSION_NUM 1 > > #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM > > > > I would have changed above to something like this: > > > > #define LOGICALREP_PROTO_VERSION_NUM 1 > > #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2 > > > > #define LOGICALREP_PROTO_MIN_VERSION_NUM LOGICALREP_PROTO_VERSION_NUM > > #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM > > > > I am not sure if this suggestion makes it better than what is purposed > by Dilip but I think we can declare them in define number order like > below: > #define LOGICALREP_PROTO_MIN_VERSION_NUM 1 > #define LOGICALREP_PROTO_VERSION_NUM 1 > #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2 > #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM > The only reason I proposed that was because for the *_MAX_VERSION_NUM we are using the latest PROTOCOL version name in its definition so why not to do the same for defining *_MIN_VERSION_NUM as well. Other than that, I also wanted to correct the sequence so that they are defined in the increasing order which you have already done here. > Another thing is can we also test by having a publisher of v14 and > subscriber of v13 or prior version, just reverse of what Ashutosh has > tested? > I've tested LR from PGv12 (Publisher) to PGv14 (Subscriber) and it works fine. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Re: Logical replication from PG v13 and below to PG v14 (devel version) is not working.
On Tue, Sep 22, 2020 at 8:34 AM Ashutosh Sharma wrote: > > On Mon, Sep 21, 2020 at 6:58 PM Amit Kapila wrote: > > > > On Mon, Sep 21, 2020 at 6:27 PM Ashutosh Sharma > > wrote: > > > > > > Thanks Dilip for the patch. AFAIU, the fix looks good. One small comment: > > > > > > > Thanks Ashutosh and Dilip for working on this. I'll look into it in a > > day or two. > > > > Just a thought: > > Should we change the sequence of these #define in > src/include/replication/logicalproto.h? > > #define LOGICALREP_PROTO_MIN_VERSION_NUM 1 > #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2 > #define LOGICALREP_PROTO_VERSION_NUM 1 > #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM > > I would have changed above to something like this: > > #define LOGICALREP_PROTO_VERSION_NUM 1 > #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2 > > #define LOGICALREP_PROTO_MIN_VERSION_NUM LOGICALREP_PROTO_VERSION_NUM > #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM > I am not sure if this suggestion makes it better than what is purposed by Dilip but I think we can declare them in define number order like below: #define LOGICALREP_PROTO_MIN_VERSION_NUM 1 #define LOGICALREP_PROTO_VERSION_NUM 1 #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2 #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM Another thing is can we also test by having a publisher of v14 and subscriber of v13 or prior version, just reverse of what Ashutosh has tested? -- With Regards, Amit Kapila.
Re: Dynamic gathering the values for seq_page_cost/xxx_cost
> > > It's probably worth testing on various other storage systems to see >> how that applies to those. >> >> Yes, I can test more on new hardware once I get it. Now it is still in > progress. > However I can only get a physical machine with SSD or Virtual machine with > SSD, other types are hard for me right now. > > Here is a result on a different hardware. The test method is still not changed.[1] Hardware Info: Virtual Machine with 61GB memory. Linux Kernel: 5.4.0-31-generic Ubuntu # lshw -short -C disk H/W pathDevice Class Description = /0/100/4/0 /dev/vda disk 42GB Virtual I/O device /0/100/5/0 /dev/vdb disk 42GB Virtual I/O device The disk on the physical machine is claimed as SSD. This time the FIO and my tools can generate the exact same result. fs_cache_lat = 0.957756us, seq_read_lat = 70.780327us, random_page_lat = 438.837257us cache hit ratio: 1.00 random_page_cost 1.00 cache hit ratio: 0.90 random_page_cost 5.635470 cache hit ratio: 0.50 random_page_cost 6.130565 cache hit ratio: 0.10 random_page_cost 6.192183 cache hit ratio: 0.00 random_page_cost 6.199989 | | seq_read_lat(us) | random_read_lat(us) | | FIO | 70 | 437 | | MY Tool | 70 | 438 | The following query plans have changed because we change random_page_cost to 4 to 6.2, the Execution time also changed. | | random_page_cost=4 | random_page_cost=6.2 | |-++--| | Q1 | 2561 | 2528.272 | | Q10 | 4675.749 | 4684.225 | | Q13 | 18858.048 |18565.929 | | Q2 |329.279 | 308.723 | | Q5 | 46248.132 | 7900.173 | | Q6 | 52526.462 |47639.503 | | Q7 | 27348.900 |25829.221 | Q5 improved by 5.8 times and Q6 & Q7 improved by ~10%. [1] https://www.postgresql.org/message-id/CAKU4AWpRv50k8E3tC3tiLWGe2DbKaoZricRh_YJ8y_zK%2BHdSjQ%40mail.gmail.com -- Best Regards Andy Fan