Re: Transparent Data Encryption (TDE) and encrypted files
On Mon, Oct 14, 2019 at 3:42 PM Antonin Houska wrote: > > Masahiko Sawada wrote: > > > On Wed, Oct 9, 2019 at 3:57 PM Antonin Houska wrote: > > > > > > Moon, Insung wrote: > > > > > > v04-0011-Make-buffile.c-aware-of-encryption.patch in [1] changes > > > buffile.c so > > > that data is read and written in 8kB blocks if encryption is enabled. In > > > order > > > to record the IV per block, the computation of the buffer position within > > > the > > > file would have to be adjusted somehow. I can check it soon but not in the > > > next few days. > > > > As far as I read the patch the nonce consists of pid, counter and > > block number where the counter is the number incremented each time of > > creating a BufFile. Therefore it could happen to rewrite the buffer > > data with the same nonce and key, which is bad. > > This patch was written before the requirement on non-repeating IV was raiesed, > and it does not use the AES-CTR mode. I mentioned it here because it reads / > writes data in 8kB blocks. > > > So I think we can have the rewrite counter of the block in the each 8k > > block header. And then the nonce consists of block number within a > > segment file (4 bytes), temp file counter (8 bytes), rewrite counter > > (2 bytes) and CTR mode counter (2 bytes). And then if we have a > > single-use encryption key per backend processes I guess we can > > guarantee the uniqueness of the combination of key and nonce. > > Since the segment size is 1 GB, the segment cosists of 2^17 blocks, so 4 bytes > will not be utilized. > > As for the "CTR mode counter", consider that it gets incremented once per 16 > bytes of input. So even if BLCKSZ is 32 kB, we need no more than 11 bits for > this counter. > > If these two parts become smaller, we can perhaps increase the size of the > "rewrite counter". Yeah I designed it to make implementation easier but we can increase the size of the rewrite counter to 3 bytes while the block number uses 3 bytes. Regards, -- Masahiko Sawada
Re: [HACKERS] Block level parallel vacuum
On Mon, Oct 14, 2019 at 6:37 PM Amit Kapila wrote: > > On Sat, Oct 12, 2019 at 4:50 PM Amit Kapila wrote: > > On Sat, Oct 12, 2019 at 11:29 AM Masahiko Sawada > > wrote: > > > > > > > I see a much bigger problem with the way this patch collects the index > > stats in shared memory. IIUC, it allocates the shared memory (DSM) > > for all the index stats, in the same way, considering its size as > > IndexBulkDeleteResult. For the first time, it gets the stats from > > local memory as returned by ambulkdelete/amvacuumcleanup call and then > > copies it in shared memory space. There onwards, it always updates > > the stats in shared memory by pointing each index stats to that > > memory. In this scheme, you overlooked the point that an index AM > > could choose to return a larger structure of which > > IndexBulkDeleteResult is just the first field. This generally > > provides a way for ambulkdelete to communicate additional private data > > to amvacuumcleanup. We use this idea in the gist index, see how > > gistbulkdelete and gistvacuumcleanup works. The current design won't > > work for such cases. Indeed. That's a very good point. Thank you for pointing out. > > > > Today, I looked at gistbulkdelete and gistvacuumcleanup closely and I > have a few observations about those which might help us to solve this > problem for gist indexes: > 1. Are we using memory context GistBulkDeleteResult->page_set_context? > It seems to me it is not being used. Yes I also think this memory context is not being used. > 2. Each time we perform gistbulkdelete, we always seem to reset the > GistBulkDeleteResult stats, see gistvacuumscan. So, how will it > accumulate it for the cleanup phase when the vacuum needs to call > gistbulkdelete multiple times because the available space for > dead-tuple is filled. It seems to me like we only use the stats from > the very last call to gistbulkdelete. I think you're right. gistbulkdelete scans all pages and collects all internal pages and all empty pages. And then in gistvacuumcleanup it uses them to unlink all empty pages. Currently it accumulates such information over multiple gistbulkdelete calls due to missing switching the memory context but I guess this code intends to use them only from the very last call to gistbulkdelete. > 3. Do we really need to give the responsibility of deleting empty > pages (gistvacuum_delete_empty_pages) to gistvacuumcleanup. Can't we > do it in gistbulkdelte? I see one advantage of postponing it till the > cleanup phase which is if somehow we can accumulate stats over > multiple calls of gistbulkdelete, but I am not sure if it is feasible. > At least, the way current code works, it seems that there is no > advantage to postpone deleting empty pages till the cleanup phase. > Considering the current strategy of page deletion of gist index the advantage of postponing the page deletion till the cleanup phase is that we can do the bulk deletion in cleanup phase which is called at most once. But I wonder if we can do the page deletion in the similar way to btree index. Or even we use the current strategy I think we can do that while not passing the pages information from bulkdelete to vacuumcleanup using by GistBulkDeleteResult. > If we avoid postponing deleting empty pages till the cleanup phase, > then we don't have the problem for gist indexes. Yes. But considering your pointing out I guess that there might be other index AMs use the stats returned from bulkdelete in the similar way to gist index (i.e. using more larger structure of which IndexBulkDeleteResult is just the first field). If we have the same concern the parallel vacuum still needs to deal with that as you mentioned. Regards, -- Masahiko Sawada
Re: Collation versioning
On Fri, Oct 11, 2019 at 11:41 PM Thomas Munro wrote: > On Thu, Oct 10, 2019 at 8:38 AM Peter Eisentraut > wrote: > > Actually, I had to revert that because pg_dump and pg_upgrade tests need > > to be updated, but that seems doable. > > [Returning from a couple of weeks mostly away from computers] > > Right, sorry about that. Here is a new version that fixes that test, > and also gives credit to Christoph for the idea in the commit message. Here's a version with a small note added to the documentation. I'm planning to commit this tomorrow. To actually make this useful for most users, we need version tracking for the default collation. I noticed that the ICU-as-default patch[1] can do that for ICU collations (though I haven't looked closely yet). Currently, its change to get_collation_actual_version() for the default collation applies only when the default provider is ICU, but if you just take out that condition when rebasing it should do the right thing, I think? [1] https://www.postgresql.org/message-id/attachment/104646/v1-0002-Add-option-to-use-ICU-as-global-collation-provide_rebased.patch 0001-Use-libc-version-as-a-collation-version-on-glibc--v3.patch Description: Binary data
ProcArrayGroupClearXid() compare-exchange style
ProcArrayGroupClearXid() has this: while (true) { nextidx = pg_atomic_read_u32(>procArrayGroupFirst); ... if (pg_atomic_compare_exchange_u32(>procArrayGroupFirst, , (uint32) proc->pgprocno)) break; } This, from UnpinBuffer(), is our more-typical style: old_buf_state = pg_atomic_read_u32(>state); for (;;) { ... if (pg_atomic_compare_exchange_u32(>state, _buf_state, buf_state)) break; } That is, we typically put the pg_atomic_read_u32() outside the loop. After the first iteration, it is redundant with the side effect of pg_atomic_compare_exchange_u32(). I haven't checked whether this materially improves performance, but, for style, I would like to change it in HEAD. diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c index 8abcfdf..3da5307 100644 --- a/src/backend/storage/ipc/procarray.c +++ b/src/backend/storage/ipc/procarray.c @@ -490,15 +490,15 @@ ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid) /* We should definitely have an XID to clear. */ Assert(TransactionIdIsValid(allPgXact[proc->pgprocno].xid)); /* Add ourselves to the list of processes needing a group XID clear. */ proc->procArrayGroupMember = true; proc->procArrayGroupMemberXid = latestXid; + nextidx = pg_atomic_read_u32(>procArrayGroupFirst); while (true) { - nextidx = pg_atomic_read_u32(>procArrayGroupFirst); pg_atomic_write_u32(>procArrayGroupNext, nextidx); if (pg_atomic_compare_exchange_u32(>procArrayGroupFirst, , (uint32) proc->pgprocno)) break;
Re: CREATE TEXT SEARCH DICTIONARY segfaulting on 9.6+
I spent a bit of time investigating this, and it seems the new code is somewhat too trusting when it comes to data from the affix/dict files. In this particular case, it boils down to this code in NISortDictionary: if (Conf->useFlagAliases) { for (i = 0; i < Conf->nspell; i++) { char *end; if (*Conf->Spell[i]->p.flag != '\0') { curaffix = strtol(Conf->Spell[i]->p.flag, , 10); if (Conf->Spell[i]->p.flag == end || errno == ERANGE) ereport(ERROR, (errcode(ERRCODE_CONFIG_FILE_ERROR), errmsg("invalid affix alias \"%s\"", Conf->Spell[i]->p.flag))); } ... Conf->Spell[i]->p.d.affix = curaffix; ... } ... } So it simply grabs whatever it finds in the dict file, parses it and then (later) we use it as index to access the AffixData array, even if the value is way out of bounds. For example in the example, hunspell_sample_long.affix contains about 10 affixes, but then we parse the hunspell_sample_num.dict file, and we stumble upon book/302,301,202,303 and we parse the flags as integers, and interpret them as indexes in the AffixData array. Clearly, 303 is wy out of bounds, triggering the segfault crash. So I think we need some sort of cross-check here. We certainly need to make NISortDictionary() check the affix value is within AffixData bounds, and error out when the index is non-sensical (maybe negative and/or exceeding nAffixData). Maybe there's a simple way to check if the affix/dict files match. The failing affix has FLAG num while with FLAG long it works just fine. But I'm not sure that's actually possible, because I don't see anything in hunspell_sample_num.dict that would allow us to decide that it expects "FLAG num" and not "FLAG long". Furthermore, we certainly can't rely on this - we still need to check the range. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Fix most -Wundef warnings
Peter Eisentraut writes: > During the cleanup of the _MSC_VER versions (commit > 38d8dce61fff09daae0edb6bcdd42b0c7f10ebcd), I found it useful to use > -Wundef, but that resulted in a bunch of gratuitous warnings. Here is a > patch to fix those. Most of these are just stylistic cleanups, but the > change in pg_bswap.h is potentially useful to avoid misuse by > third-party extensions. Looks reasonable offhand. regards, tom lane
Re: "pg_ctl: the PID file ... is empty" at end of make check
On Tue, Oct 15, 2019 at 1:55 PM Tom Lane wrote: > Thomas Munro writes: > > Agreed. Secret non-shareable bug report filed. Fingers crossed. > > Since that conversation, longfin has shown the same symptom > just once more: > > longfin | REL_11_STABLE | 2019-07-28 22:29:03 | recoveryCheck | waiting for > ser > ver to shut down..pg_ctl: the PID file > "/Users/buildfarm/bf-data/REL_11_STAB > LE/pgsql.build/src/test/recovery/tmp_check/t_001_stream_rep_standby_2_data/pgdat > a/postmaster.pid" is empty > > and now prairiedog has shown it too: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog=2019-10-14%2021%3A45%3A47 > > which is positively fascinating, because prairiedog is running a > bronze-age version of macOS that surely never heard of APFS. > So this makes it look like this is a basic macOS bug that's not > as filesystem-dependent as one might think. Does https://github.com/macdice/unlinktest show the problem on that system? > Trawling the buildfarm database finds no other matches in the last year, > so whatever it is, it's pretty darn improbable. > > Did you ever get any reaction to that bug report? No acknowledgement. I did learn from an anonymous source that "it’s prioritized and in the queue".
Re: v12.0: ERROR: could not find pathkey item to sort
On Sun, Oct 13, 2019 at 02:06:02PM -0400, Tom Lane wrote: > Justin Pryzby writes: > > On Fri, Oct 11, 2019 at 10:48:37AM -0400, Tom Lane wrote: > >> Could you provide a self-contained test case please? > > > [ test case ] > > Oh, this is the same issue Amit described in > > https://www.postgresql.org/message-id/flat/CA%2BHiwqG2WVUGmLJqtR0tPFhniO%3DH%3D9qQ%2BZ3L_ZC%2BY3-EVQHFGg%40mail.gmail.com > > namely that we're not generating EquivalenceClass members corresponding > to sub-joins of a partitionwise join. > > Are you interested in helping to test the patches proposed there? Sure. Any requests other than testing that our original query works correctly and maybe endeavoring to read the patch ? BTW it probably should've been documented as an "Open Item" for v12. -- Justin Pryzby System Administrator Telsasoft +1-952-707-8581
Re: "pg_ctl: the PID file ... is empty" at end of make check
[ blast from the past dept. ] Thomas Munro writes: > On Thu, Nov 29, 2018 at 3:30 AM Tom Lane wrote: >> Thomas Munro writes: >>> https://github.com/macdice/unlinktest >> Bleah. But you can do better than ask whether it's a bug: you can >> quote POSIX: >> ... >> Not a lot of wiggle room there. > Agreed. Secret non-shareable bug report filed. Fingers crossed. Since that conversation, longfin has shown the same symptom just once more: longfin | REL_11_STABLE | 2019-07-28 22:29:03 | recoveryCheck | waiting for ser ver to shut down..pg_ctl: the PID file "/Users/buildfarm/bf-data/REL_11_STAB LE/pgsql.build/src/test/recovery/tmp_check/t_001_stream_rep_standby_2_data/pgdat a/postmaster.pid" is empty and now prairiedog has shown it too: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog=2019-10-14%2021%3A45%3A47 which is positively fascinating, because prairiedog is running a bronze-age version of macOS that surely never heard of APFS. So this makes it look like this is a basic macOS bug that's not as filesystem-dependent as one might think. Trawling the buildfarm database finds no other matches in the last year, so whatever it is, it's pretty darn improbable. Did you ever get any reaction to that bug report? regards, tom lane
Re: v12.0: segfault in reindex CONCURRENTLY
On Sun, Oct 13, 2019 at 03:10:21PM -0300, Alvaro Herrera wrote: > On 2019-Oct-13, Justin Pryzby wrote: > > > Looks like it's a race condition and dereferencing *holder=NULL. The first > > crash was probably the same bug, due to report query running during "reindex > > CONCURRENTLY", and probably finished at nearly the same time as another > > locker. > > Ooh, right, makes sense. There's another spot with the same mistake ... > this patch should fix it. I would maybe chop off the 2nd sentence, since conditionalizing indicates that we do actually care. +* If requested, publish who we're going to wait for. This is not +* 100% accurate if they're already gone, but we don't care. Justin
Re: stress test for parallel workers
I wrote: > Filed at > https://bugzilla.kernel.org/show_bug.cgi?id=205183 > We'll see what happens ... Further to this --- I went back and looked at the outlier events where we saw an infinite_recurse failure on a non-Linux-PPC64 platform. There were only three: mereswine| ARMv7| Linux debian-armhf | Clarence Ho | REL_11_STABLE | 2019-08-11 02:10:12 | InstallCheck-C | 2019-08-11 02:36:10.159 PDT [5004:4] DETAIL: Failed process was running: select infinite_recurse(); mereswine| ARMv7| Linux debian-armhf | Clarence Ho | REL_12_STABLE | 2019-08-11 09:52:46 | pg_upgradeCheck | 2019-08-11 04:21:16.756 PDT [6804:5] DETAIL: Failed process was running: select infinite_recurse(); mereswine| ARMv7| Linux debian-armhf | Clarence Ho | HEAD | 2019-08-11 11:29:27 | pg_upgradeCheck | 2019-08-11 07:15:28.454 PDT [9954:76] DETAIL: Failed process was running: select infinite_recurse(); Looking closer at these, though, they were *not* SIGSEGV failures, but SIGKILLs. Seeing that they were all on the same machine on the same day, I'm thinking we can write them off as a transiently misconfigured OOM killer. So, pending some other theory emerging from the kernel hackers, we're down to it's-a-PPC64-kernel-bug. That leaves me wondering what if anything we want to do about it. Even if it's fixed reasonably promptly in Linux upstream, and then we successfully nag assorted vendors to incorporate the fix quickly, that's still going to leave us with frequent buildfarm failures on Mark's flotilla of not-the-very-shiniest Linux versions. Should we move the infinite_recurse test to happen alone in a parallel group just to stop these failures? That's annoying from a parallelism standpoint, but I don't see any other way to avoid these failures. regards, tom lane
Re: v12.0: segfault in reindex CONCURRENTLY
On 2019-Oct-13, Justin Pryzby wrote: > Looks like it's a race condition and dereferencing *holder=NULL. The first > crash was probably the same bug, due to report query running during "reindex > CONCURRENTLY", and probably finished at nearly the same time as another > locker. Ooh, right, makes sense. There's another spot with the same mistake ... this patch should fix it. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c index 70f9b6729a..5f4ee86f70 100644 --- a/src/backend/commands/indexcmds.c +++ b/src/backend/commands/indexcmds.c @@ -388,8 +388,9 @@ WaitForOlderSnapshots(TransactionId limitXmin, bool progress) { PGPROC *holder = BackendIdGetProc(old_snapshots[i].backendId); -pgstat_progress_update_param(PROGRESS_WAITFOR_CURRENT_PID, - holder->pid); +if (holder) + pgstat_progress_update_param(PROGRESS_WAITFOR_CURRENT_PID, + holder->pid); } VirtualXactLock(old_snapshots[i], true); } diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c index 4682438114..63c4188efc 100644 --- a/src/backend/storage/lmgr/lmgr.c +++ b/src/backend/storage/lmgr/lmgr.c @@ -908,8 +908,9 @@ WaitForLockersMultiple(List *locktags, LOCKMODE lockmode, bool progress) { PGPROC *holder = BackendIdGetProc(lockholders->backendId); -pgstat_progress_update_param(PROGRESS_WAITFOR_CURRENT_PID, - holder->pid); +if (holder) + pgstat_progress_update_param(PROGRESS_WAITFOR_CURRENT_PID, + holder->pid); } VirtualXactLock(*lockholders, true); lockholders++;
tuplesort test coverage
Hi, [1] made me look at tuplesorts test coverage at https://coverage.postgresql.org/src/backend/utils/sort/tuplesort.c.gcov.html We don't have coverage for a quite a number of things: - cluster for expression indexes (line 935) - sorts exceeding INT_MAX / 2 memory (line 1337), but that seems hard to test realistically - aborted abbreviated keys (lines 1522, 1608, 1774, 3620, 3739, 3867, 4266) - in memory backwards scans (lines 1936, 3042) - *any* coverage for TSS_SORTEDONTAPE (line 1964) - disk sort skiptuples (line 2325) - mergeruns without abbrev key (line 2582) - disk sorts with more than one run (lines 2707, 2789) - any disk based tuplesort_begin_heap() (lines 3649, 3676) - Seems copytup_index currently is essentially dead, because tuplesort_putindextuplevalues() doesn't use COPYTUP (line 4142) - any disk based tuplesort_begin_datum (lines 4282, 4323) I'm pretty unhappy that tuplesort has been whacked around pretty heavily in the last few years, while *reducing* effective test coverage noticeably, rather than increasing it. There's pretty substantial and nontrivial areas without any tests - do we have actually have any confidence that they work? The largest culprits for that seem to be abbreviated keys, the tape logic overhaul, and the increase of work mem. Greetings, Andres Freund
Re: Columns correlation and adaptive query optimization
Hello Konstantin, What you have proposed regarding join_selectivity and multicolumn statistics is a very good new ! Regarding your auto_explain modification, maybe an "advisor" mode would also be helpfull (with auto_explain_add_statistics_threshold=-1 for exemple). This would allow to track which missing statistic should be tested (manually or in an other environment). In my point of view this advice should be an option of the EXPLAIN command, that should also permit auto_explain module to propose "learning" phase. Regards PAscal -- Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html
Re: dropping column prevented due to inherited index
On Sat, Oct 12, 2019 at 03:08:41PM +0900, Michael Paquier wrote: > On Fri, Oct 11, 2019 at 06:39:47PM -0300, Alvaro Herrera wrote: >> Typo "resursing". This comment seems a bit too long to me. Maybe >> "Recursion having ended, drop everything that was collected." suffices. >> (Fits in one line.) > > Sounds fine to me, thanks. I have been able to look at this issue once again, and applied the fix down to v12. Thanks, all! -- Michael signature.asc Description: PGP signature
Re: Fix most -Wundef warnings
On 10/13/19 12:25 PM, Peter Eisentraut wrote: diff --git a/contrib/hstore/hstore_compat.c b/contrib/hstore/hstore_compat.c index 1d4e7484e4..d75e9cb23f 100644 --- a/contrib/hstore/hstore_compat.c +++ b/contrib/hstore/hstore_compat.c @@ -299,7 +299,7 @@ hstoreUpgrade(Datum orig) if (valid_new) { -#if HSTORE_IS_HSTORE_NEW +#ifdef HSTORE_IS_HSTORE_NEW elog(WARNING, "ambiguous hstore value resolved as hstore-new"); Checking the current sources, git history, and various older commits, I did not find where HSTORE_IS_HSTORE_NEW was ever defined. I expect it was defined at some point, but I checked back as far as 9.0 (where the current contrib/hstore was originally committed) and did not see it. Where did you find this, and can we add a code comment? This one #ifdef is the only line in the entire repository where this label is used, making it hard to check if changing from #if was the right decision. The check on HSTORE_IS_HSTORE_NEW goes back at least as far as 2006, suggesting it was needed for migrating from some version pre-9.0, making me wonder if anybody would need this in the field. Should we drop support for this? I don't have a strong reason to advocate dropping support other than that this #define appears to be undocumented. mark
Re: BRIN index which is much faster never chosen by planner
On Tue, 15 Oct 2019 at 08:43, Jeremy Finzel wrote: > I wanted to follow up on this specific issue. Isn't this the heart of the > matter and a fundamental problem? If I want to rely on BRIN indexes as in a > straightforward case as explained in OP, but I don't know if the planner will > be nearly reliable enough, how can I depend on them in production? Is this > not considered a planner bug or should this kind of case be documented as > problematic for BRIN? As another way to look at it: is there a configuration > parameter that could be added specific to BRIN or bitmapscan to provide help > to cases like this? > > On freshly analyzed tables, I tried my original query again and found that > even with now() - 3 days it does not choose the BRIN index. In fact, it > chose another btree on the table like (id1, id2, rec_insert_time). With warm > cache, the pg-chosen plan takes 40 seconds to execute, whereas when I force a > BRIN scan it takes only 4 seconds. Another thing which you might want to look at is the correlation column in the pg_stats view for the rec_insert_time column. Previous to 7e534adcd, BRIN index were costed based on the selectivity estimate. There was no accountability towards the fact that the pages for those records might have been spread out over the entire table. Post 7e534adcd, we use the correlation estimate to attempt to estimate how many pages (more specifically "ranges") we're likely to hit based on that and the selectivity estimate. This commit intended to fix the issue we had with BRIN indexes being selected far too often. Of course, the correlation is based on the entire table, if there are subsets of the table that are perhaps perfectly correlated, then the planner is not going to know about that. It's possible that some of your older rec_insert_times are spread out far more than the newer ones. As a test, you could try creating a new table and copying the records over to it in rec_insert_time order and seeing if the BRIN index is selected for that table (after having performed an ANALYZE). It would be interesting if you could show the pg_stats row for the column so that we can see if the correlation is low. You can see from the code below that the final selectivity strongly influenced by the correlation value (REF: brincostestimate) qualSelectivity = clauselist_selectivity(root, indexQuals, baserel->relid, JOIN_INNER, NULL); /* work out the actual number of ranges in the index */ indexRanges = Max(ceil((double) baserel->pages / statsData.pagesPerRange), 1.0); /* * Now calculate the minimum possible ranges we could match with if all of * the rows were in the perfect order in the table's heap. */ minimalRanges = ceil(indexRanges * qualSelectivity); /* * Now estimate the number of ranges that we'll touch by using the * indexCorrelation from the stats. Careful not to divide by zero (note * we're using the absolute value of the correlation). */ if (*indexCorrelation < 1.0e-10) estimatedRanges = indexRanges; else estimatedRanges = Min(minimalRanges / *indexCorrelation, indexRanges); /* we expect to visit this portion of the table */ selec = estimatedRanges / indexRanges; CLAMP_PROBABILITY(selec); My overall view on this is that the BRIN index is not that great since it's not eliminating that many rows by using it. >From above we see: > Bitmap Heap Scan on foo.log_table l (cost=2391.71..24360848.29 rows=735 > width=99) (actual time=824.133..21329.054 rows=466 loops=1) Output: Recheck Cond: (l.rec_insert_time >= (now() - '10 days'::interval)) Rows Removed by Index Recheck: 8187584 Filter: ((l.field1 IS NOT NULL) AND (l.category = 'music'::name)) Rows Removed by Filter: 19857107 Heap Blocks: lossy=1509000 So you have just 466 rows matching these quals, but the executor had to scan 1.5 million pages to get those and filter out 8.1 million rows on the recheck then 19.8 million on the filter. You've mentioned that the table's heap is 139 GB, which is about 18 million pages. It seems your query would perform much better if you had a btree index such as (category, rec_insert_time) where field1 is not null;, Of course, you've mentioned that you are finding when the plan uses the BRIN index that it executes more quickly, but I think you're going to find BRIN unreliable for tables anything other than INSERT-only tables which the records are always inserted with an ever-increasing or decreasing value in the BRIN indexed column. If you start performing UPDATEs then that's going to create holes that new record will fill and cause the correlation to start dropping resulting in the BRIN indexes scan cost going up. On the other hand, if you think you can do better than what was done in 7e534adcd, then it would be good to see
Re: v12.0: ERROR: could not find pathkey item to sort
On Sun, Oct 13, 2019 at 01:30:29PM -0500, Justin Pryzby wrote: > BTW it probably should've been documented as an "Open Item" for v12. https://commitfest.postgresql.org/25/2278/ I realized possibly people were thinking of that as a "feature" and not a bugfix for backpatch (?) But, my issue is a query which worked under v11 PWJ but fails under v12 (apparently broken by d25ea01275). In my mind, if the planner doesn't support that query with PWJ, I think it should run without PWJ rather than fail. Justin
Re: BRIN index which is much faster never chosen by planner
> > The other issue is that the estimation of pages fetched using bitmap > heap scan is rather crude - but that's simply hard, and I don't think we > can fundamentally improve this. > I wanted to follow up on this specific issue. Isn't this the heart of the matter and a fundamental problem? If I want to rely on BRIN indexes as in a straightforward case as explained in OP, but I don't know if the planner will be nearly reliable enough, how can I depend on them in production? Is this not considered a planner bug or should this kind of case be documented as problematic for BRIN? As another way to look at it: is there a configuration parameter that could be added specific to BRIN or bitmapscan to provide help to cases like this? On freshly analyzed tables, I tried my original query again and found that even with now() - 3 days it does not choose the BRIN index. In fact it chose another btree on the table like (id1, id2, rec_insert_time). With warm cache, the pg-chosen plan takes 40 seconds to execute, whereas when I force a BRIN scan it takes only 4 seconds. I could understand more if the execution times were close, but the actual BRIN index is orders of magnitude faster than the plan Postgres is choosing. I appreciate the feedback on this very much, as I am quite eager to use BRIN indexes!!! Thanks, Jeremy
Re: [HACKERS] Block level parallel vacuum
On Mon, Oct 14, 2019 at 3:10 PM Amit Kapila wrote: > > On Sat, Oct 12, 2019 at 4:50 PM Amit Kapila wrote: > > On Sat, Oct 12, 2019 at 11:29 AM Masahiko Sawada > > wrote: > > > > > > > I see a much bigger problem with the way this patch collects the index > > stats in shared memory. IIUC, it allocates the shared memory (DSM) > > for all the index stats, in the same way, considering its size as > > IndexBulkDeleteResult. For the first time, it gets the stats from > > local memory as returned by ambulkdelete/amvacuumcleanup call and then > > copies it in shared memory space. There onwards, it always updates > > the stats in shared memory by pointing each index stats to that > > memory. In this scheme, you overlooked the point that an index AM > > could choose to return a larger structure of which > > IndexBulkDeleteResult is just the first field. This generally > > provides a way for ambulkdelete to communicate additional private data > > to amvacuumcleanup. We use this idea in the gist index, see how > > gistbulkdelete and gistvacuumcleanup works. The current design won't > > work for such cases. > > > > Today, I looked at gistbulkdelete and gistvacuumcleanup closely and I > have a few observations about those which might help us to solve this > problem for gist indexes: > 1. Are we using memory context GistBulkDeleteResult->page_set_context? > It seems to me it is not being used. To me also it appears that it's not being used. > 2. Each time we perform gistbulkdelete, we always seem to reset the > GistBulkDeleteResult stats, see gistvacuumscan. So, how will it > accumulate it for the cleanup phase when the vacuum needs to call > gistbulkdelete multiple times because the available space for > dead-tuple is filled. It seems to me like we only use the stats from > the very last call to gistbulkdelete. IIUC, it is fine to use the stats from the latest gistbulkdelete call because we are trying to collect the information of the empty pages while scanning the tree. So I think it would be fine to just use the information collected from the latest scan otherwise we will get duplicate information. > 3. Do we really need to give the responsibility of deleting empty > pages (gistvacuum_delete_empty_pages) to gistvacuumcleanup. Can't we > do it in gistbulkdelte? I see one advantage of postponing it till the > cleanup phase which is if somehow we can accumulate stats over > multiple calls of gistbulkdelete, but I am not sure if it is feasible. It seems that we want to use the latest result. That might be the reason for postponing to the cleanup phase. > At least, the way current code works, it seems that there is no > advantage to postpone deleting empty pages till the cleanup phase. > > If we avoid postponing deleting empty pages till the cleanup phase, > then we don't have the problem for gist indexes. > > This is not directly related to this patch, so we can discuss these > observations in a separate thread as well, but before that, I wanted > to check your opinion to see if this makes sense to you as this will > help us in moving this patch forward. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: d25ea01275 and partitionwise join
On Thu, Sep 19, 2019 at 05:15:37PM +0900, Amit Langote wrote: > Please find attached updated patches. Tom pointed me to this thread, since we hit it in 12.0 https://www.postgresql.org/message-id/flat/16802.1570989962%40sss.pgh.pa.us#070f6675a11dff17760b1cfccf1c038d I can't say much about the patch; there's a little typo: "The nullability of inner relation keys prevents them to" ..should say "prevent them from". In order to compile it against REL12, I tried to cherry-pick this one: 3373c715: Speed up finding EquivalenceClasses for a given set of rels But then it crashes in check-world (possibly due to misapplied hunks). -- Justin Pryzby System Administrator Telsasoft +1-952-707-8581
Add A Glossary
Attached is a v1 patch to add a Glossary to the appendix of our current documentation. I believe that our documentation needs a glossary for a few reasons: 1. It's hard to ask for help if you don't know the proper terminology of the problem you're having. 2. Readers who are new to databases may not understand a few of the terms that are used casually both in the documentation and in forums. This helps to make our documentation a bit more useful as a teaching tool. 3. Readers whose primary language is not English may struggle to find the correct search terms, and this glossary may help them grasp that a given term has a usage in databases that is different from common English usage. 3b. If we are not able to find the resources to translate all of the documentation into a given language, translating the glossary page would be a good first step. 4. The glossary would be web-searchable, and draw viewers to the official documentation. 5. adding link anchors to each term would make them cite-able, useful in forum conversations. A few notes about this patch: 1. It's obviously incomplete. There are more terms, a lot more, to add. 2. The individual definitions supplied are off-the-cuff, and should be thoroughly reviewed. 3. The definitions as a whole should be reviewed by an actual tech writer (one was initially involved but had to step back due to prior commitments), and the definitions should be normalized in terms of voice, tone, audience, etc. 4. My understanding of DocBook is not strong. The glossary vs glosslist tag issue is a bit confusing to me, and I'm not sure if the glossary tag is even appropriate for our needs. 5. I've made no effort at making each term an anchor, nor have I done any CSS styling at all. 6. I'm not quite sure how to handle terms that have different definitions in different contexts. Should that be two glossdefs following one glossterm, or two separate def/term pairs? Please review and share your thoughts. From 343d5c18bf23f98341b510595e3e042e002242cb Mon Sep 17 00:00:00 2001 From: Corey Huinker Date: Sun, 13 Oct 2019 17:57:36 + Subject: [PATCH] add glossary page with sample terms and definitions --- doc/src/sgml/filelist.sgml | 1 + doc/src/sgml/glossary.sgml | 618 doc/src/sgml/stylesheet.css | 2 + 3 files changed, 621 insertions(+) create mode 100644 doc/src/sgml/glossary.sgml diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml index 3da2365ea9..504c8a6326 100644 --- a/doc/src/sgml/filelist.sgml +++ b/doc/src/sgml/filelist.sgml @@ -170,6 +170,7 @@ + diff --git a/doc/src/sgml/glossary.sgml b/doc/src/sgml/glossary.sgml new file mode 100644 index 00..016eee2d76 --- /dev/null +++ b/doc/src/sgml/glossary.sgml @@ -0,0 +1,618 @@ + + + + Glossary + + + This is a list of terms and their in the context of PostgreSQL and databases in general. + + + + +Aggregate + + + To combine a collection of data values into a single value, whose value +may not be of the same type as the original values. Aggregate functions combine +multiple rows that share a common set of values into one row, which means that +the only data visible in the values in common, and the aggregates of the +non-common data. + + + + + +Analytic + + + A function whose computed value can reference values found in nearby rows +of the same result set. + + + + + +Atomic + + + In reference to the value of an Attribute or Datum: cannot be broken up +into smaller components. + + + In reference to an operation: An event that cannot be completed in part: +it must either entirely succeed or entirely fail. A series of SQL statements can +be combined into a Transaction, and that transaction is said to be Atomic. + + + + + +Attribute + + + A typed data element found within a Tuple or Relation or Table. + + + + + +Cast + + + A conversion of a Datum from its current data type to another data type. + + + + + +Check Constraint + + + A type of constraint defined on a relation which restricts the values +allowed in one or more Attributes. The check constraint can make reference to +any Attribute in the Relation, but cannot reference other rows of the same +relation or other relations. + + + + + +Column + + + An Attribute found in a Table or View. + + + + + +Commit + + + The act of finalizing a Transaction within the database. + + + + + +Concurrency + + + The concept that multiple independent operations can be happening within +the database at the same time. + + + + + +Constraint + + + A method of restricting the values of data allowed within a Table. + + + + + +Datum + + + The internal
Re: v12.0: segfault in reindex CONCURRENTLY
On Sun, Oct 13, 2019 at 06:06:43PM +0900, Michael Paquier wrote: > On Fri, Oct 11, 2019 at 07:44:46PM -0500, Justin Pryzby wrote: > > Unfortunately, there was no core file, and I'm still trying to reproduce it. > > Forgot to set ulimit -c? Having a backtrace would surely help. Fortunately (?) another server hit crashed last night. (Doesn't appear to be relevant, but this table has no inheritence/partition-ness). Looks like it's a race condition and dereferencing *holder=NULL. The first crash was probably the same bug, due to report query running during "reindex CONCURRENTLY", and probably finished at nearly the same time as another locker. Relevant code introduced here: commit ab0dfc961b6a821f23d9c40c723d11380ce195a6 Author: Alvaro Herrera Date: Tue Apr 2 15:18:08 2019 -0300 Report progress of CREATE INDEX operations Needs to be conditionalized (as anticipated by the comment) + if (holder) pgstat_progress_update_param(PROGRESS_WAITFOR_CURRENT_PID, holder->pid); Core was generated by `postgres: postgres ts [local] REINDEX '. Program terminated with signal 11, Segmentation fault. #0 WaitForLockersMultiple (locktags=locktags@entry=0x1d30548, lockmode=lockmode@entry=5, progress=progress@entry=true) at lmgr.c:911 #1 0x005c2ac8 in ReindexRelationConcurrently (relationOid=relationOid@entry=17618, options=options@entry=0) at indexcmds.c:3090 #2 0x005c328a in ReindexIndex (indexRelation=, options=0, concurrent=) at indexcmds.c:2352 #3 0x007657fe in standard_ProcessUtility (pstmt=pstmt@entry=0x1d05468, queryString=queryString@entry=0x1d046e0 "REINDEX INDEX CONCURRENTLY loaded_cdr_files_filename", context=context@entry=PROCESS_UTILITY_TOPLEVEL, params=params@entry=0x0, queryEnv=queryEnv@entry=0x0, dest=dest@entry=0x1d05548, completionTag=completionTag@entry=0x7ffc05e6c0a0 "") at utility.c:787 #4 0x7f21517204ef in pgss_ProcessUtility (pstmt=0x1d05468, queryString=0x1d046e0 "REINDEX INDEX CONCURRENTLY loaded_cdr_files_filename", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x1d05548, completionTag=0x7ffc05e6c0a0 "") at pg_stat_statements.c:1006 #5 0x00762816 in PortalRunUtility (portal=0x1d7a4e0, pstmt=0x1d05468, isTopLevel=, setHoldSnapshot=, dest=0x1d05548, completionTag=0x7ffc05e6c0a0 "") at pquery.c:1175 #6 0x00763267 in PortalRunMulti (portal=portal@entry=0x1d7a4e0, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false, dest=dest@entry=0x1d05548, altdest=altdest@entry=0x1d05548, completionTag=completionTag@entry=0x7ffc05e6c0a0 "") at pquery.c:1328 #7 0x00763e45 in PortalRun (portal=, count=9223372036854775807, isTopLevel=, run_once=, dest=0x1d05548, altdest=0x1d05548, completionTag=0x7ffc05e6c0a0 "") at pquery.c:796 #8 0x0075ff45 in exec_simple_query (query_string=) at postgres.c:1215 #9 0x00761212 in PostgresMain (argc=, argv=, dbname=, username=) at postgres.c:4236 #10 0x00483d02 in BackendRun (port=, port=) at postmaster.c:4431 #11 BackendStartup (port=0x1d2b340) at postmaster.c:4122 #12 ServerLoop () at postmaster.c:1704 #13 0x006f0b1f in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x1cff280) at postmaster.c:1377 #14 0x00484c93 in main (argc=3, argv=0x1cff280) at main.c:228 bt f #0 WaitForLockersMultiple (locktags=locktags@entry=0x1d30548, lockmode=lockmode@entry=5, progress=progress@entry=true) at lmgr.c:911 holder = 0x0 lockholders = 0x1d9b778 holders = lc = 0x1d9bf80 total = done = 1 #1 0x005c2ac8 in ReindexRelationConcurrently (relationOid=relationOid@entry=17618, options=options@entry=0) at indexcmds.c:3090 heapRelationIds = 0x1d30360 indexIds = 0x1d303b0 newIndexIds = relationLocks = lockTags = lc = 0x0 lc2 = 0x0 private_context = oldcontext = relkind = 105 'i' relationName = 0x0 relationNamespace = 0x0 ru0 = {tv = {tv_sec = 30592544, tv_usec = 7232025}, ru = {ru_utime = {tv_sec = 281483566645394, tv_usec = 75668733820930}, ru_stime = {tv_sec = 0, tv_usec = 30592272}, { ru_maxrss = 0, __ru_maxrss_word = 0}, {ru_ixrss = 0, __ru_ixrss_word = 0}, {ru_idrss = 105, __ru_idrss_word = 105}, {ru_isrss = -926385342574214656, __ru_isrss_word = -926385342574214656}, {ru_minflt = 8924839, __ru_minflt_word = 8924839}, {ru_majflt = 0, __ru_majflt_word = 0}, {ru_nswap = 17618, __ru_nswap_word = 17618}, {ru_inblock = 139781327898864, __ru_inblock_word = 139781327898864}, {ru_oublock = 30430312, __ru_oublock_word = 30430312}, { ru_msgsnd = 139781327898864, __ru_msgsnd_word = 139781327898864}, {ru_msgrcv =
Re: v12.0: ERROR: could not find pathkey item to sort
Justin Pryzby writes: > On Fri, Oct 11, 2019 at 10:48:37AM -0400, Tom Lane wrote: >> Could you provide a self-contained test case please? > [ test case ] Oh, this is the same issue Amit described in https://www.postgresql.org/message-id/flat/CA%2BHiwqG2WVUGmLJqtR0tPFhniO%3DH%3D9qQ%2BZ3L_ZC%2BY3-EVQHFGg%40mail.gmail.com namely that we're not generating EquivalenceClass members corresponding to sub-joins of a partitionwise join. Are you interested in helping to test the patches proposed there? regards, tom lane
Re: stress test for parallel workers
Hi, On 2019-10-13 13:44:59 +1300, Thomas Munro wrote: > On Sun, Oct 13, 2019 at 1:06 PM Tom Lane wrote: > > I don't think any further proof is required that this is > > a kernel bug. Where would be a good place to file it? > > linuxppc-...@lists.ozlabs.org might be the right place. > > https://lists.ozlabs.org/listinfo/linuxppc-dev Probably requires reproducing on a pretty recent kernel first, to have a decent chance of being investigated... Greetings, Andres Freund
Re: stress test for parallel workers
Andres Freund writes: > Probably requires reproducing on a pretty recent kernel first, to have a > decent chance of being investigated... How recent do you think it needs to be? The machine I was testing on yesterday is under a year old: uname -m = ppc64le uname -r = 4.18.19-100.fc27.ppc64le uname -s = Linux uname -v = #1 SMP Wed Nov 14 21:53:32 UTC 2018 The latest-by-version-number ppc64 kernel I can find in the buildfarm is bonito, uname -m = ppc64le uname -r = 4.19.15-300.fc29.ppc64le uname -s = Linux uname -v = #1 SMP Mon Jan 14 16:21:04 UTC 2019 and that's certainly shown it too. regards, tom lane
Columns correlation and adaptive query optimization
Hi hackers, Errors in selectivity estimations is one of the main reason of bad plans generation by Postgres optimizer. Postgres estimates selectivity based on the collected statistic (histograms). While it is able to more or less precisely estimated selectivity of simple predicate for particular table, it is much more difficult to estimate selectivity for result of join of several tables and for complex predicate consisting of several conjuncts/disjuncts accessing different columns. Postgres is not able to take in account correlation between columns unless correspondent multicolumn statistic is explicitly created. But even if such statistic is created, it can not be used in join selectivity estimation. The problem with adjusting selectivity using machine learning based on the results of EXPLAIN ANALYZE was address in AQO project: https://github.com/postgrespro/aqo There are still many issues with proposed AQO approach (for example, it doesn't take in account concrete constant values). We are going to continue its improvement. But here I wan to propose much simpler patch which allows two things: 1. Use extended statistic in estimation of join selectivity 2. Create on demand multicolumn statistic in auto_explain extension if there is larger gap between real and estimated number of tuples for the concrete plan node. create table inner_tab(x integer, y integer); create table outer_tab(pk integer primary key, x integer, y integer); create index on inner_tab(x,y); insert into outer_tab values (generate_series(1,10), generate_series(1,10), generate_series(1,10)*10); insert into inner_tab values (generate_series(1,100)/10, generate_series(1,100)/10*10); analyze inner_tab; analyze outer_tab; Without this patch: explain select * from outer_tab join inner_tab using(x,y) where pk=1; QUERY PLAN -- Nested Loop (cost=0.72..16.77 rows=1 width=12) -> Index Scan using outer_tab_pkey on outer_tab (cost=0.29..8.31 rows=1 width=12) Index Cond: (pk = 1) -> Index Only Scan using inner_tab_x_y_idx on inner_tab (cost=0.42..8.45 rows=1 width=8) Index Cond: ((x = outer_tab.x) AND (y = outer_tab.y)) (5 rows) With this patch: load 'auto_explain'; set auto_explain.log_min_duration=0; set auto_explain.add_statistics_threshold=10.0; set auto_explain.log_analyze=on; select * from outer_tab join inner_tab using(x,y) where pk=1; analyze inner_tab; analyze outer_tab; explain select * from outer_tab join inner_tab using(x,y) where pk=1; QUERY PLAN Nested Loop (cost=0.72..32.79 rows=10 width=12) -> Index Scan using outer_tab_pkey on outer_tab (cost=0.29..8.31 rows=1 width=12) Index Cond: (pk = 1) -> Index Only Scan using inner_tab_x_y_idx on inner_tab (cost=0.42..24.38 rows=10 width=8) Index Cond: ((x = outer_tab.x) AND (y = outer_tab.y)) (5 rows) As you can see now estimation of join result is correct (10). I attached two patches: one for using extended statistic for join selectivity estimation and another for auto_explain to implicitly add this extended statistic on demand. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c index 4bf777d..a356df9 100644 --- a/src/backend/optimizer/path/clausesel.c +++ b/src/backend/optimizer/path/clausesel.c @@ -25,6 +25,7 @@ #include "utils/lsyscache.h" #include "utils/selfuncs.h" #include "statistics/statistics.h" +#include "catalog/pg_statistic_ext.h" /* @@ -159,6 +160,9 @@ clauselist_selectivity_simple(PlannerInfo *root, RangeQueryClause *rqlist = NULL; ListCell *l; int listidx; + Bitmapset *clauses_attnums = NULL; + int n_clauses_attnums = 0; + int innerRelid = varRelid; /* * If there's exactly one clause (and it was not estimated yet), just go @@ -170,6 +174,9 @@ clauselist_selectivity_simple(PlannerInfo *root, return clause_selectivity(root, (Node *) linitial(clauses), varRelid, jointype, sjinfo); + if (innerRelid == 0 && sjinfo) + bms_get_singleton_member(sjinfo->min_righthand, ); + /* * Anything that doesn't look like a potential rangequery clause gets * multiplied into s1 and forgotten. Anything that does gets inserted into @@ -181,7 +188,6 @@ clauselist_selectivity_simple(PlannerInfo *root, Node *clause = (Node *) lfirst(l); RestrictInfo *rinfo; Selectivity s2; - listidx++; /* @@ -213,6 +219,7 @@ clauselist_selectivity_simple(PlannerInfo *root, else rinfo = NULL; + /* * See if it looks like a restriction clause with a pseudoconstant on * one side. (Anything more
Re: Non-Active links being referred in our source code
On Tue, Oct 8, 2019 at 10:35 AM Michael Paquier wrote: > > On Mon, Oct 07, 2019 at 05:11:40PM +0200, Juan José Santamaría Flecha wrote: > > About the broken links in win32_port.h, they are all referring to > > ntstatus. As for first case that shows the code groups, there is an up > > to date alternative. There is also an alternative for second case that > > points to their codes and descriptions. On the other hand, the last > > case is quoting a document that is no longer available, I would > > suggest to rephrase the comment, thus eliminating the quote. > > > > Please find attached a patch with the proposed alternatives. > > Thanks Juan for the patch. I have checked your suggestions and it > looks good to me, so committed. Good idea to tell about > WIN32_NO_STATUS. I have noticed one typo on the way. About pg_crc.h, I have made the changes with the correct links. The patch for the same is attached. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com 0001-Updated-CRC-links-that-are-not-working.patch Description: Binary data
Re: v12.0: ERROR: could not find pathkey item to sort
Justin Pryzby writes: > On Sun, Oct 13, 2019 at 01:30:29PM -0500, Justin Pryzby wrote: >> BTW it probably should've been documented as an "Open Item" for v12. > https://commitfest.postgresql.org/25/2278/ > I realized possibly people were thinking of that as a "feature" and not a > bugfix for backpatch (?) > But, my issue is a query which worked under v11 PWJ but fails under v12 > (apparently broken by d25ea01275). Yeah, this should have been dealt with as an open item, but it slipped through the cracks. We'll make sure to get it fixed, one way or another, for 12.1. In view of the proposed patches being dependent on some other 13-only changes, I wonder if we should fix v12 by reverting d25ea0127. The potential planner performance loss for large partition sets could be nasty, but failing to plan at all is worse. regards, tom lane
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Thu, Oct 3, 2019 at 4:03 AM Tomas Vondra wrote: > > On Wed, Oct 02, 2019 at 04:27:30AM +0530, Amit Kapila wrote: > >On Tue, Oct 1, 2019 at 7:21 PM Tomas Vondra > >wrote: > > > >> On Tue, Oct 01, 2019 at 06:55:52PM +0530, Amit Kapila wrote: > >> > > >> >On further testing, I found that the patch seems to have problems with > >> >toast. Consider below scenario: > >> >Session-1 > >> >Create table large_text(t1 text); > >> >INSERT INTO large_text > >> >SELECT (SELECT string_agg('x', ',') > >> >FROM generate_series(1, 100)) FROM generate_series(1, 1000); > >> > > >> >Session-2 > >> >SELECT * FROM pg_create_logical_replication_slot('regression_slot', > >> >'test_decoding'); > >> >SELECT * FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL); > >> >*--kaboom* > >> > > >> >The second statement in Session-2 leads to a crash. > >> > > >> > >> OK, thanks for the report - will investigate. > >> > > > >It was an assertion failure in ReorderBufferCleanupTXN at below line: > >+ /* Check we're not mixing changes from different transactions. */ > >+ Assert(change->txn == txn); > > > > Can you still reproduce this issue with the patch I sent on 28/9? I have > been unable to trigger the failure, and it seems pretty similar to the > failure you reported (and I fixed) on 28/9. > > >> >Other than that, I am not sure if the changes related to spill to disk > >> >after logical_decoding_work_mem works for toast table as I couldn't hit > >> >that code for toast table case, but I might be missing something. As > >> >mentioned previously, I feel there should be some way to test whether this > >> >patch works for the cases it claims to work. As of now, I have to check > >> >via debugging. Let me know if there is any way, I can test this. > >> > > >> > >> That's one of the reasons why I proposed to move the statistics (which > >> say how many transactions / bytes were spilled to disk) from a later > >> patch in the series. I don't think there's a better way. > >> > >> > >I like that idea, but I think you need to split that patch to only get the > >stats related to the spill. It would be easier to review if you can > >prepare that atop of > >0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer. > > > > Sure, I wasn't really proposing to adding all stats from that patch, > including those related to streaming. We need to extract just those > related to spilling. And yes, it needs to be moved right after 0001. > I have extracted the spilling related code to a separate patch on top of 0001. I have also fixed some bugs and review comments and attached as a separate patch. Later I can merge it to the main patch if you agree with the changes. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com 0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer.patch Description: Binary data bugs_and_review_comments_fix.patch Description: Binary data 0002-Track-statistics-for-spilling.patch Description: Binary data
Re: Minimal logical decoding on standbys
Sorry to intervene in between, But what about timeline change? Thank you.
Re: [HACKERS] Block level parallel vacuum
On Sat, Oct 12, 2019 at 4:50 PM Amit Kapila wrote: > On Sat, Oct 12, 2019 at 11:29 AM Masahiko Sawada > wrote: > > > > I see a much bigger problem with the way this patch collects the index > stats in shared memory. IIUC, it allocates the shared memory (DSM) > for all the index stats, in the same way, considering its size as > IndexBulkDeleteResult. For the first time, it gets the stats from > local memory as returned by ambulkdelete/amvacuumcleanup call and then > copies it in shared memory space. There onwards, it always updates > the stats in shared memory by pointing each index stats to that > memory. In this scheme, you overlooked the point that an index AM > could choose to return a larger structure of which > IndexBulkDeleteResult is just the first field. This generally > provides a way for ambulkdelete to communicate additional private data > to amvacuumcleanup. We use this idea in the gist index, see how > gistbulkdelete and gistvacuumcleanup works. The current design won't > work for such cases. > Today, I looked at gistbulkdelete and gistvacuumcleanup closely and I have a few observations about those which might help us to solve this problem for gist indexes: 1. Are we using memory context GistBulkDeleteResult->page_set_context? It seems to me it is not being used. 2. Each time we perform gistbulkdelete, we always seem to reset the GistBulkDeleteResult stats, see gistvacuumscan. So, how will it accumulate it for the cleanup phase when the vacuum needs to call gistbulkdelete multiple times because the available space for dead-tuple is filled. It seems to me like we only use the stats from the very last call to gistbulkdelete. 3. Do we really need to give the responsibility of deleting empty pages (gistvacuum_delete_empty_pages) to gistvacuumcleanup. Can't we do it in gistbulkdelte? I see one advantage of postponing it till the cleanup phase which is if somehow we can accumulate stats over multiple calls of gistbulkdelete, but I am not sure if it is feasible. At least, the way current code works, it seems that there is no advantage to postpone deleting empty pages till the cleanup phase. If we avoid postponing deleting empty pages till the cleanup phase, then we don't have the problem for gist indexes. This is not directly related to this patch, so we can discuss these observations in a separate thread as well, but before that, I wanted to check your opinion to see if this makes sense to you as this will help us in moving this patch forward. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Remove obsolete information schema tables
I propose this patch to remove the information schema tables SQL_LANGUAGES, which was eliminated in SQL:2008, and SQL_PACKAGES, which was eliminated in SQL:2011. Since they were dropped by the SQL standard, the information in them was no longer updated and therefore no longer useful. This also removes the feature-package association information in sql_feature_packages.txt, but for the time begin we are keeping the information which features are in the Core package (that is, mandatory SQL features). Maybe at some point someone wants to invent a way to store that that does not involve using the "package" mechanism anymore. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services From 304270da9168134df77d59c537268bd6bcaf1a06 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Mon, 14 Oct 2019 09:56:38 +0200 Subject: [PATCH] Remove obsolete information schema tables Remove SQL_LANGUAGES, which was eliminated in SQL:2008, and SQL_PACKAGES, which was eliminated in SQL:2011. Since they were dropped by the SQL standard, the information in them was no longer updated and therefore no longer useful. This also removes the feature-package association information in sql_feature_packages.txt, but for the time begin we are keeping the information which features are in the Core package (that is, mandatory SQL features). Maybe at some point someone wants to invent a way to store that that does not involve using the "package" mechanism anymore. --- doc/src/sgml/features.sgml | 9 +- doc/src/sgml/information_schema.sgml | 154 --- src/backend/catalog/information_schema.sql | 50 -- src/backend/catalog/sql_feature_packages.txt | 29 src/backend/catalog/sql_features.txt | 2 - src/test/regress/expected/sanity_check.out | 2 - 6 files changed, 3 insertions(+), 243 deletions(-) diff --git a/doc/src/sgml/features.sgml b/doc/src/sgml/features.sgml index f767bee46e..2c5a7e5d0c 100644 --- a/doc/src/sgml/features.sgml +++ b/doc/src/sgml/features.sgml @@ -44,10 +44,7 @@ SQL Conformance broad three levels found in SQL-92. A large subset of these features represents the Core features, which every conforming SQL implementation must supply. - The rest of the features are purely optional. Some optional - features are grouped together to form packages, which - SQL implementations can claim conformance to, thus claiming - conformance to particular groups of features. + The rest of the features are purely optional. @@ -116,7 +113,7 @@ Supported Features Identifier -Package +Core? Description Comment @@ -143,7 +140,7 @@ Unsupported Features Identifier -Package +Core? Description Comment diff --git a/doc/src/sgml/information_schema.sgml b/doc/src/sgml/information_schema.sgml index 906fe7819f..7d3be1afdb 100644 --- a/doc/src/sgml/information_schema.sgml +++ b/doc/src/sgml/information_schema.sgml @@ -4963,160 +4963,6 @@ sql_implementation_info Columns - - sql_languages - - - The table sql_languages contains one row for - each SQL language binding that is supported by - PostgreSQL. - PostgreSQL supports direct SQL and - embedded SQL in C; that is all you will learn from this table. - - - - This table was removed from the SQL standard in SQL:2008, so there - are no entries referring to standards later than SQL:2003. - - - - sql_languages Columns - - - - - Name - Data Type - Description - - - - - - sql_language_source - character_data - - The name of the source of the language definition; always - ISO 9075, that is, the SQL standard - - - - - sql_language_year - character_data - - The year the standard referenced in - sql_language_source was approved. - - - - - sql_language_conformance - character_data - - The standard conformance level for the language binding. For - ISO 9075:2003 this is always CORE. - - - - - sql_language_integrity - character_data - Always null (This value is relevant to an earlier version of the SQL standard.) - - - - sql_language_implementation - character_data - Always null - - - - sql_language_binding_style - character_data - - The language binding style, either DIRECT or - EMBEDDED - - - - - sql_language_programming_language - character_data - - The programming language, if the binding style is - EMBEDDED, else null. PostgreSQL only - supports the language C. - - - - - - - - - sql_packages - - - The table sql_packages contains information -
Re: Transparent Data Encryption (TDE) and encrypted files
Masahiko Sawada wrote: > On Wed, Oct 9, 2019 at 3:57 PM Antonin Houska wrote: > > > > Moon, Insung wrote: > > > > v04-0011-Make-buffile.c-aware-of-encryption.patch in [1] changes buffile.c > > so > > that data is read and written in 8kB blocks if encryption is enabled. In > > order > > to record the IV per block, the computation of the buffer position within > > the > > file would have to be adjusted somehow. I can check it soon but not in the > > next few days. > > As far as I read the patch the nonce consists of pid, counter and > block number where the counter is the number incremented each time of > creating a BufFile. Therefore it could happen to rewrite the buffer > data with the same nonce and key, which is bad. This patch was written before the requirement on non-repeating IV was raiesed, and it does not use the AES-CTR mode. I mentioned it here because it reads / writes data in 8kB blocks. > So I think we can have the rewrite counter of the block in the each 8k > block header. And then the nonce consists of block number within a > segment file (4 bytes), temp file counter (8 bytes), rewrite counter > (2 bytes) and CTR mode counter (2 bytes). And then if we have a > single-use encryption key per backend processes I guess we can > guarantee the uniqueness of the combination of key and nonce. Since the segment size is 1 GB, the segment cosists of 2^17 blocks, so 4 bytes will not be utilized. As for the "CTR mode counter", consider that it gets incremented once per 16 bytes of input. So even if BLCKSZ is 32 kB, we need no more than 11 bits for this counter. If these two parts become smaller, we can perhaps increase the size of the "rewrite counter". -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: Tighten error control for OpenTransientFile/CloseTransientFile
On Fri, Oct 04, 2019 at 01:39:38PM -0700, Andres Freund wrote: > but that reasoning seems bogus to me. For one, on just about any > platform close always closes the fd, even when returning an error > (unless you pass in a bad fd, in which case it obviously doesn't). So > the reasoning that this fixes unnoticed fd leaks doesn't really seem to > make sense. For another, even if it did, throwing an ERROR seems to > achieve very little: We continue with a leaked fd *AND* we cause the > operation to error out. I have read again a couple of times the commit log, and this mentions to let users know that a fd is leaking, not that it fixes things. Still we get to know about it, while previously it was not possible. In some cases we may see errors in close() after a previous write(2). Of course this does not apply to all the code paths patched here, but it seems to me that's a good habit to spread, no? > I can see reasoning for: > - LOG, so it can be noticed, but operations continue to work > - FATAL, to fix the leak > - PANIC, so we recover from the problem, in case of the close indicating > a durability issue LOG or WARNING would not be visible enough and would likely be skipped by users. Not sure that this justifies a FATAL either, and PANIC would cause more harm than necessary, so for most of them ERROR sounded like a good compromise, still the elevel choice is not innocent depending on the code paths patched, because the elevel used is consistent with the error handling of the surroundings. > And if your goal was just to achieve consistency, I also don't > understand, because you left plenty close()'s unchecked? E.g. you added > an error check in get_controlfile(), but not one in > ReadControlFile(). alterSystemSetConfigFile() writes, but you didn't add > one. Because I have not considered these when looking at transient files. That may be worth an extra lookup. -- Michael signature.asc Description: PGP signature