Re: Transparent Data Encryption (TDE) and encrypted files

2019-10-14 Thread Masahiko Sawada
On Mon, Oct 14, 2019 at 3:42 PM Antonin Houska  wrote:
>
> Masahiko Sawada  wrote:
>
> > On Wed, Oct 9, 2019 at 3:57 PM Antonin Houska  wrote:
> > >
> > > Moon, Insung  wrote:
> > >
> > > v04-0011-Make-buffile.c-aware-of-encryption.patch in [1] changes 
> > > buffile.c so
> > > that data is read and written in 8kB blocks if encryption is enabled. In 
> > > order
> > > to record the IV per block, the computation of the buffer position within 
> > > the
> > > file would have to be adjusted somehow. I can check it soon but not in the
> > > next few days.
> >
> > As far as I read the patch the nonce consists of pid, counter and
> > block number where the counter is the number incremented each time of
> > creating a BufFile. Therefore it could happen to rewrite the buffer
> > data with the same nonce and key, which is bad.
>
> This patch was written before the requirement on non-repeating IV was raiesed,
> and it does not use the AES-CTR mode. I mentioned it here because it reads /
> writes data in 8kB blocks.
>
> > So I think we can have the rewrite counter of the block in the each 8k
> > block header. And then the nonce consists of block number within a
> > segment file (4 bytes), temp file counter (8 bytes), rewrite counter
> > (2 bytes) and CTR mode counter (2 bytes). And then if we have a
> > single-use encryption key per backend processes I guess we can
> > guarantee the uniqueness of the combination of key and nonce.
>
> Since the segment size is 1 GB, the segment cosists of 2^17 blocks, so 4 bytes
> will not be utilized.
>
> As for the "CTR mode counter", consider that it gets incremented once per 16
> bytes of input. So even if BLCKSZ is 32 kB, we need no more than 11 bits for
> this counter.
>
> If these two parts become smaller, we can perhaps increase the size of the
> "rewrite counter".

Yeah I designed it to make implementation easier but we can increase
the size of the rewrite counter to 3 bytes while the block number uses
3 bytes.

Regards,

--
Masahiko Sawada




Re: [HACKERS] Block level parallel vacuum

2019-10-14 Thread Masahiko Sawada
On Mon, Oct 14, 2019 at 6:37 PM Amit Kapila  wrote:
>
> On Sat, Oct 12, 2019 at 4:50 PM Amit Kapila  wrote:
> > On Sat, Oct 12, 2019 at 11:29 AM Masahiko Sawada  
> > wrote:
> > >
> >
> > I see a much bigger problem with the way this patch collects the index
> > stats in shared memory.  IIUC, it allocates the shared memory (DSM)
> > for all the index stats, in the same way, considering its size as
> > IndexBulkDeleteResult.  For the first time, it gets the stats from
> > local memory as returned by ambulkdelete/amvacuumcleanup call and then
> > copies it in shared memory space.  There onwards, it always updates
> > the stats in shared memory by pointing each index stats to that
> > memory.  In this scheme, you overlooked the point that an index AM
> > could choose to return a larger structure of which
> > IndexBulkDeleteResult is just the first field.  This generally
> > provides a way for ambulkdelete to communicate additional private data
> > to amvacuumcleanup.  We use this idea in the gist index, see how
> > gistbulkdelete and gistvacuumcleanup works. The current design won't
> > work for such cases.

Indeed. That's a very good point. Thank you for pointing out.

> >
>
> Today, I looked at gistbulkdelete and gistvacuumcleanup closely and I
> have a few observations about those which might help us to solve this
> problem for gist indexes:
> 1. Are we using memory context GistBulkDeleteResult->page_set_context?
>  It seems to me it is not being used.

Yes I also think this memory context is not being used.

> 2. Each time we perform gistbulkdelete, we always seem to reset the
> GistBulkDeleteResult stats, see gistvacuumscan.  So, how will it
> accumulate it for the cleanup phase when the vacuum needs to call
> gistbulkdelete multiple times because the available space for
> dead-tuple is filled.  It seems to me like we only use the stats from
> the very last call to gistbulkdelete.

I think you're right. gistbulkdelete scans all pages and collects all
internal pages and all empty pages. And then in gistvacuumcleanup it
uses them to unlink all empty pages. Currently it accumulates such
information over multiple gistbulkdelete calls due to missing
switching the memory context but I guess this code intends to use them
only from the very last call to gistbulkdelete.

> 3. Do we really need to give the responsibility of deleting empty
> pages (gistvacuum_delete_empty_pages) to gistvacuumcleanup.  Can't we
> do it in gistbulkdelte?  I see one advantage of postponing it till the
> cleanup phase which is if somehow we can accumulate stats over
> multiple calls of gistbulkdelete, but I am not sure if it is feasible.
> At least, the way current code works, it seems that there is no
> advantage to postpone deleting empty pages till the cleanup phase.
>

Considering the current strategy of page deletion of gist index the
advantage of postponing the page deletion till the cleanup phase is
that we can do the bulk deletion in cleanup phase which is called at
most once. But I wonder if we can do the page deletion in the similar
way to btree index. Or even we use the current strategy I think we can
do that while not passing the pages information from bulkdelete to
vacuumcleanup using by GistBulkDeleteResult.

> If we avoid postponing deleting empty pages till the cleanup phase,
> then we don't have the problem for gist indexes.

Yes. But considering your pointing out I guess that there might be
other index AMs use the stats returned from bulkdelete in the similar
way to gist index (i.e. using more larger structure of which
IndexBulkDeleteResult is just the first field). If we have the same
concern the parallel vacuum still needs to deal with that as you
mentioned.

Regards,

--
Masahiko Sawada




Re: Collation versioning

2019-10-14 Thread Thomas Munro
On Fri, Oct 11, 2019 at 11:41 PM Thomas Munro  wrote:
> On Thu, Oct 10, 2019 at 8:38 AM Peter Eisentraut
>  wrote:
> > Actually, I had to revert that because pg_dump and pg_upgrade tests need
> > to be updated, but that seems doable.
>
> [Returning from a couple of weeks mostly away from computers]
>
> Right, sorry about that.  Here is a new version that fixes that test,
> and also gives credit to Christoph for the idea in the commit message.

Here's a version with a small note added to the documentation.  I'm
planning to commit this tomorrow.

To actually make this useful for most users, we need version tracking
for the default collation.  I noticed that the ICU-as-default patch[1]
can do that for ICU collations (though I haven't looked closely yet).
Currently, its change to get_collation_actual_version() for the
default collation applies only when the default provider is ICU, but
if you just take out that condition when rebasing it should do the
right thing, I think?

[1] 
https://www.postgresql.org/message-id/attachment/104646/v1-0002-Add-option-to-use-ICU-as-global-collation-provide_rebased.patch


0001-Use-libc-version-as-a-collation-version-on-glibc--v3.patch
Description: Binary data


ProcArrayGroupClearXid() compare-exchange style

2019-10-14 Thread Noah Misch
ProcArrayGroupClearXid() has this:

while (true)
{
nextidx = pg_atomic_read_u32(>procArrayGroupFirst);

...

if 
(pg_atomic_compare_exchange_u32(>procArrayGroupFirst,

   ,

   (uint32) proc->pgprocno))
break;
}

This, from UnpinBuffer(), is our more-typical style:

old_buf_state = pg_atomic_read_u32(>state);
for (;;)
{
...

if (pg_atomic_compare_exchange_u32(>state, 
_buf_state,

   buf_state))
break;
}

That is, we typically put the pg_atomic_read_u32() outside the loop.  After
the first iteration, it is redundant with the side effect of
pg_atomic_compare_exchange_u32().  I haven't checked whether this materially
improves performance, but, for style, I would like to change it in HEAD.
diff --git a/src/backend/storage/ipc/procarray.c 
b/src/backend/storage/ipc/procarray.c
index 8abcfdf..3da5307 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -490,15 +490,15 @@ ProcArrayGroupClearXid(PGPROC *proc, TransactionId 
latestXid)
/* We should definitely have an XID to clear. */
Assert(TransactionIdIsValid(allPgXact[proc->pgprocno].xid));
 
/* Add ourselves to the list of processes needing a group XID clear. */
proc->procArrayGroupMember = true;
proc->procArrayGroupMemberXid = latestXid;
+   nextidx = pg_atomic_read_u32(>procArrayGroupFirst);
while (true)
{
-   nextidx = pg_atomic_read_u32(>procArrayGroupFirst);
pg_atomic_write_u32(>procArrayGroupNext, nextidx);
 
if 
(pg_atomic_compare_exchange_u32(>procArrayGroupFirst,

   ,

   (uint32) proc->pgprocno))
break;


Re: CREATE TEXT SEARCH DICTIONARY segfaulting on 9.6+

2019-10-14 Thread Tomas Vondra

I spent a bit of time investigating this, and it seems the new code is
somewhat too trusting when it comes to data from the affix/dict files.
In this particular case, it boils down to this code in NISortDictionary:

   if (Conf->useFlagAliases)
   {
   for (i = 0; i < Conf->nspell; i++)
   {
   char   *end;

   if (*Conf->Spell[i]->p.flag != '\0')
   {
   curaffix = strtol(Conf->Spell[i]->p.flag, , 10);
   if (Conf->Spell[i]->p.flag == end || errno == ERANGE)
   ereport(ERROR,
   (errcode(ERRCODE_CONFIG_FILE_ERROR),
errmsg("invalid affix alias \"%s\"",
   Conf->Spell[i]->p.flag)));
   }
   ...
   Conf->Spell[i]->p.d.affix = curaffix;
   ...
   }
   ...
   }

So it simply grabs whatever it finds in the dict file, parses it and
then (later) we use it as index to access the AffixData array, even if
the value is way out of bounds.

For example in the example, hunspell_sample_long.affix contains about
10 affixes, but then we parse the hunspell_sample_num.dict file, and we
stumble upon

   book/302,301,202,303

and we parse the flags as integers, and interpret them as indexes in the
AffixData array. Clearly, 303 is wy out of bounds, triggering the
segfault crash.

So I think we need some sort of cross-check here. We certainly need to
make NISortDictionary() check the affix value is within AffixData
bounds, and error out when the index is non-sensical (maybe negative
and/or exceeding nAffixData). Maybe there's a simple way to check if the
affix/dict files match. The failing affix has

   FLAG num

while with

   FLAG long

it works just fine. But I'm not sure that's actually possible, because I
don't see anything in hunspell_sample_num.dict that would allow us to
decide that it expects "FLAG num" and not "FLAG long". Furthermore, we
certainly can't rely on this - we still need to check the range.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Fix most -Wundef warnings

2019-10-14 Thread Tom Lane
Peter Eisentraut  writes:
> During the cleanup of the _MSC_VER versions (commit
> 38d8dce61fff09daae0edb6bcdd42b0c7f10ebcd), I found it useful to use
> -Wundef, but that resulted in a bunch of gratuitous warnings.  Here is a
> patch to fix those.  Most of these are just stylistic cleanups, but the
> change in pg_bswap.h is potentially useful to avoid misuse by
> third-party extensions.

Looks reasonable offhand.

regards, tom lane




Re: "pg_ctl: the PID file ... is empty" at end of make check

2019-10-14 Thread Thomas Munro
On Tue, Oct 15, 2019 at 1:55 PM Tom Lane  wrote:
> Thomas Munro  writes:
> > Agreed.  Secret non-shareable bug report filed.  Fingers crossed.
>
> Since that conversation, longfin has shown the same symptom
> just once more:
>
>  longfin | REL_11_STABLE | 2019-07-28 22:29:03 | recoveryCheck | waiting for 
> ser
> ver to shut down..pg_ctl: the PID file 
> "/Users/buildfarm/bf-data/REL_11_STAB
> LE/pgsql.build/src/test/recovery/tmp_check/t_001_stream_rep_standby_2_data/pgdat
> a/postmaster.pid" is empty
>
> and now prairiedog has shown it too:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog=2019-10-14%2021%3A45%3A47
>
> which is positively fascinating, because prairiedog is running a
> bronze-age version of macOS that surely never heard of APFS.
> So this makes it look like this is a basic macOS bug that's not
> as filesystem-dependent as one might think.

Does https://github.com/macdice/unlinktest show the problem on that system?

> Trawling the buildfarm database finds no other matches in the last year,
> so whatever it is, it's pretty darn improbable.
>
> Did you ever get any reaction to that bug report?

No acknowledgement.   I did learn from an anonymous source that "it’s
prioritized and in the queue".




Re: v12.0: ERROR: could not find pathkey item to sort

2019-10-14 Thread Justin Pryzby
On Sun, Oct 13, 2019 at 02:06:02PM -0400, Tom Lane wrote:
> Justin Pryzby  writes:
> > On Fri, Oct 11, 2019 at 10:48:37AM -0400, Tom Lane wrote:
> >> Could you provide a self-contained test case please?
> 
> > [ test case ]
> 
> Oh, this is the same issue Amit described in 
> 
> https://www.postgresql.org/message-id/flat/CA%2BHiwqG2WVUGmLJqtR0tPFhniO%3DH%3D9qQ%2BZ3L_ZC%2BY3-EVQHFGg%40mail.gmail.com
> 
> namely that we're not generating EquivalenceClass members corresponding
> to sub-joins of a partitionwise join.
> 
> Are you interested in helping to test the patches proposed there?

Sure.  Any requests other than testing that our original query works correctly
and maybe endeavoring to read the patch ?

BTW it probably should've been documented as an "Open Item" for v12.

-- 
Justin Pryzby
System Administrator
Telsasoft
+1-952-707-8581




Re: "pg_ctl: the PID file ... is empty" at end of make check

2019-10-14 Thread Tom Lane
[ blast from the past dept. ]

Thomas Munro  writes:
> On Thu, Nov 29, 2018 at 3:30 AM Tom Lane  wrote:
>> Thomas Munro  writes:
>>> https://github.com/macdice/unlinktest

>> Bleah.  But you can do better than ask whether it's a bug: you can
>> quote POSIX:
>> ...
>> Not a lot of wiggle room there.

> Agreed.  Secret non-shareable bug report filed.  Fingers crossed.

Since that conversation, longfin has shown the same symptom
just once more:

 longfin | REL_11_STABLE | 2019-07-28 22:29:03 | recoveryCheck | waiting for ser
ver to shut down..pg_ctl: the PID file "/Users/buildfarm/bf-data/REL_11_STAB
LE/pgsql.build/src/test/recovery/tmp_check/t_001_stream_rep_standby_2_data/pgdat
a/postmaster.pid" is empty

and now prairiedog has shown it too:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog=2019-10-14%2021%3A45%3A47

which is positively fascinating, because prairiedog is running a
bronze-age version of macOS that surely never heard of APFS.
So this makes it look like this is a basic macOS bug that's not
as filesystem-dependent as one might think.

Trawling the buildfarm database finds no other matches in the last year,
so whatever it is, it's pretty darn improbable.

Did you ever get any reaction to that bug report?

regards, tom lane




Re: v12.0: segfault in reindex CONCURRENTLY

2019-10-14 Thread Justin Pryzby
On Sun, Oct 13, 2019 at 03:10:21PM -0300, Alvaro Herrera wrote:
> On 2019-Oct-13, Justin Pryzby wrote:
> 
> > Looks like it's a race condition and dereferencing *holder=NULL.  The first
> > crash was probably the same bug, due to report query running during "reindex
> > CONCURRENTLY", and probably finished at nearly the same time as another 
> > locker.
> 
> Ooh, right, makes sense.  There's another spot with the same mistake ...
> this patch should fix it.

I would maybe chop off the 2nd sentence, since conditionalizing indicates that
we do actually care.

+* If requested, publish who we're going to wait for.  
This is not
+* 100% accurate if they're already gone, but we don't 
care.

Justin




Re: stress test for parallel workers

2019-10-14 Thread Tom Lane
I wrote:
> Filed at
> https://bugzilla.kernel.org/show_bug.cgi?id=205183
> We'll see what happens ...

Further to this --- I went back and looked at the outlier events
where we saw an infinite_recurse failure on a non-Linux-PPC64
platform.  There were only three:

 mereswine| ARMv7| Linux debian-armhf | Clarence Ho | 
REL_11_STABLE | 2019-08-11 02:10:12 | InstallCheck-C  | 2019-08-11 02:36:10.159 
PDT [5004:4] DETAIL:  Failed process was running: select infinite_recurse();
 mereswine| ARMv7| Linux debian-armhf | Clarence Ho | 
REL_12_STABLE | 2019-08-11 09:52:46 | pg_upgradeCheck | 2019-08-11 04:21:16.756 
PDT [6804:5] DETAIL:  Failed process was running: select infinite_recurse();
 mereswine| ARMv7| Linux debian-armhf | Clarence Ho | HEAD  
| 2019-08-11 11:29:27 | pg_upgradeCheck | 2019-08-11 07:15:28.454 PDT 
[9954:76] DETAIL:  Failed process was running: select infinite_recurse();
 
Looking closer at these, though, they were *not* SIGSEGV failures,
but SIGKILLs.  Seeing that they were all on the same machine on the
same day, I'm thinking we can write them off as a transiently
misconfigured OOM killer.

So, pending some other theory emerging from the kernel hackers, we're
down to it's-a-PPC64-kernel-bug.  That leaves me wondering what if
anything we want to do about it.  Even if it's fixed reasonably promptly
in Linux upstream, and then we successfully nag assorted vendors to
incorporate the fix quickly, that's still going to leave us with frequent
buildfarm failures on Mark's flotilla of not-the-very-shiniest Linux
versions.

Should we move the infinite_recurse test to happen alone in a parallel
group just to stop these failures?  That's annoying from a parallelism
standpoint, but I don't see any other way to avoid these failures.

regards, tom lane




Re: v12.0: segfault in reindex CONCURRENTLY

2019-10-14 Thread Alvaro Herrera
On 2019-Oct-13, Justin Pryzby wrote:

> Looks like it's a race condition and dereferencing *holder=NULL.  The first
> crash was probably the same bug, due to report query running during "reindex
> CONCURRENTLY", and probably finished at nearly the same time as another 
> locker.

Ooh, right, makes sense.  There's another spot with the same mistake ...
this patch should fix it.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 70f9b6729a..5f4ee86f70 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -388,8 +388,9 @@ WaitForOlderSnapshots(TransactionId limitXmin, bool progress)
 			{
 PGPROC	   *holder = BackendIdGetProc(old_snapshots[i].backendId);
 
-pgstat_progress_update_param(PROGRESS_WAITFOR_CURRENT_PID,
-			 holder->pid);
+if (holder)
+	pgstat_progress_update_param(PROGRESS_WAITFOR_CURRENT_PID,
+ holder->pid);
 			}
 			VirtualXactLock(old_snapshots[i], true);
 		}
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 4682438114..63c4188efc 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -908,8 +908,9 @@ WaitForLockersMultiple(List *locktags, LOCKMODE lockmode, bool progress)
 			{
 PGPROC	   *holder = BackendIdGetProc(lockholders->backendId);
 
-pgstat_progress_update_param(PROGRESS_WAITFOR_CURRENT_PID,
-			 holder->pid);
+if (holder)
+	pgstat_progress_update_param(PROGRESS_WAITFOR_CURRENT_PID,
+ holder->pid);
 			}
 			VirtualXactLock(*lockholders, true);
 			lockholders++;


tuplesort test coverage

2019-10-14 Thread Andres Freund
Hi,

[1] made me look at tuplesorts test coverage at
https://coverage.postgresql.org/src/backend/utils/sort/tuplesort.c.gcov.html
We don't have coverage for a quite a number of things:
- cluster for expression indexes (line 935)
- sorts exceeding INT_MAX / 2 memory (line 1337), but that seems hard to
  test realistically
- aborted abbreviated keys (lines 1522, 1608, 1774, 3620, 3739, 3867, 4266)
- in memory backwards scans (lines 1936, 3042)
- *any* coverage for TSS_SORTEDONTAPE (line 1964)
- disk sort skiptuples (line 2325)
- mergeruns without abbrev key (line 2582)
- disk sorts with more than one run (lines 2707, 2789)
- any disk based tuplesort_begin_heap() (lines 3649, 3676)
- Seems copytup_index currently is essentially dead, because
  tuplesort_putindextuplevalues() doesn't use COPYTUP (line 4142)
- any disk based tuplesort_begin_datum (lines 4282, 4323)

I'm pretty unhappy that tuplesort has been whacked around pretty heavily
in the last few years, while *reducing* effective test coverage
noticeably, rather than increasing it.  There's pretty substantial and
nontrivial areas without any tests - do we have actually have any
confidence that they work?

The largest culprits for that seem to be abbreviated keys, the tape
logic overhaul, and the increase of work mem.

Greetings,

Andres Freund




Re: Columns correlation and adaptive query optimization

2019-10-14 Thread legrand legrand
Hello Konstantin,

What you have proposed regarding join_selectivity and multicolumn statistics
is a very good new !

Regarding your auto_explain modification, maybe an "advisor" mode would also
be helpfull (with auto_explain_add_statistics_threshold=-1 for exemple).
This would allow to track which missing statistic should be tested (manually
or in an other environment).

In my point of view this advice should be an option of the EXPLAIN command,
that should also permit
auto_explain module to propose "learning" phase.

Regards
PAscal



--
Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html




Re: dropping column prevented due to inherited index

2019-10-14 Thread Michael Paquier
On Sat, Oct 12, 2019 at 03:08:41PM +0900, Michael Paquier wrote:
> On Fri, Oct 11, 2019 at 06:39:47PM -0300, Alvaro Herrera wrote:
>> Typo "resursing".  This comment seems a bit too long to me.  Maybe
>> "Recursion having ended, drop everything that was collected." suffices.
>> (Fits in one line.)
> 
> Sounds fine to me, thanks.

I have been able to look at this issue once again, and applied the fix
down to v12.  Thanks, all!
--
Michael


signature.asc
Description: PGP signature


Re: Fix most -Wundef warnings

2019-10-14 Thread Mark Dilger




On 10/13/19 12:25 PM, Peter Eisentraut wrote:

diff --git a/contrib/hstore/hstore_compat.c b/contrib/hstore/hstore_compat.c
index 1d4e7484e4..d75e9cb23f 100644
--- a/contrib/hstore/hstore_compat.c
+++ b/contrib/hstore/hstore_compat.c
@@ -299,7 +299,7 @@ hstoreUpgrade(Datum orig)
  
  	if (valid_new)

{
-#if HSTORE_IS_HSTORE_NEW
+#ifdef HSTORE_IS_HSTORE_NEW
elog(WARNING, "ambiguous hstore value resolved as hstore-new");


Checking the current sources, git history, and various older commits, I 
did not find where HSTORE_IS_HSTORE_NEW was ever defined.  I expect it 
was defined at some point, but I checked back as far as 9.0 (where the 
current contrib/hstore was originally committed) and did not see it. 
Where did you find this, and can we add a code comment?  This one #ifdef 
is the only line in the entire repository where this label is used, 
making it hard to check if changing from #if was the right decision.


The check on HSTORE_IS_HSTORE_NEW goes back at least as far as 2006, 
suggesting it was needed for migrating from some version pre-9.0, making 
me wonder if anybody would need this in the field.  Should we drop 
support for this?  I don't have a strong reason to advocate dropping 
support other than that this #define appears to be undocumented.


mark




Re: BRIN index which is much faster never chosen by planner

2019-10-14 Thread David Rowley
On Tue, 15 Oct 2019 at 08:43, Jeremy Finzel  wrote:
> I wanted to follow up on this specific issue.  Isn't this the heart of the 
> matter and a fundamental problem?  If I want to rely on BRIN indexes as in a 
> straightforward case as explained in OP, but I don't know if the planner will 
> be nearly reliable enough, how can I depend on them in production?  Is this 
> not considered a planner bug or should this kind of case be documented as 
> problematic for BRIN?  As another way to look at it: is there a configuration 
> parameter that could be added specific to BRIN or bitmapscan to provide help 
> to cases like this?
>
> On freshly analyzed tables, I tried my original query again and found that 
> even with now() - 3 days it does not choose the BRIN index.  In fact, it 
> chose another btree on the table like (id1, id2, rec_insert_time).  With warm 
> cache, the pg-chosen plan takes 40 seconds to execute, whereas when I force a 
> BRIN scan it takes only 4 seconds.

Another thing which you might want to look at is the correlation
column in the pg_stats view for the rec_insert_time column. Previous
to 7e534adcd, BRIN index were costed based on the selectivity
estimate. There was no accountability towards the fact that the pages
for those records might have been spread out over the entire table.
Post 7e534adcd, we use the correlation estimate to attempt to estimate
how many pages (more specifically "ranges") we're likely to hit based
on that and the selectivity estimate. This commit intended to fix the
issue we had with BRIN indexes being selected far too often.  Of
course, the correlation is based on the entire table, if there are
subsets of the table that are perhaps perfectly correlated, then the
planner is not going to know about that. It's possible that some of
your older rec_insert_times are spread out far more than the newer
ones.  As a test, you could try creating a new table and copying the
records over to it in rec_insert_time order and seeing if the BRIN
index is selected for that table (after having performed an ANALYZE).

It would be interesting if you could show the pg_stats row for the
column so that we can see if the correlation is low.

You can see from the code below that the final selectivity strongly
influenced by the correlation value (REF: brincostestimate)

qualSelectivity = clauselist_selectivity(root, indexQuals,
baserel->relid,
JOIN_INNER, NULL);

/* work out the actual number of ranges in the index */
indexRanges = Max(ceil((double) baserel->pages / statsData.pagesPerRange),
  1.0);

/*
* Now calculate the minimum possible ranges we could match with if all of
* the rows were in the perfect order in the table's heap.
*/
minimalRanges = ceil(indexRanges * qualSelectivity);

/*
* Now estimate the number of ranges that we'll touch by using the
* indexCorrelation from the stats. Careful not to divide by zero (note
* we're using the absolute value of the correlation).
*/
if (*indexCorrelation < 1.0e-10)
estimatedRanges = indexRanges;
else
estimatedRanges = Min(minimalRanges / *indexCorrelation, indexRanges);

/* we expect to visit this portion of the table */
selec = estimatedRanges / indexRanges;

CLAMP_PROBABILITY(selec);


My overall view on this is that the BRIN index is not that great since
it's not eliminating that many rows by using it.

>From above we see:

>  Bitmap Heap Scan on foo.log_table l  (cost=2391.71..24360848.29 rows=735 
> width=99) (actual time=824.133..21329.054 rows=466 loops=1)
   Output: 
   Recheck Cond:
(l.rec_insert_time >= (now() - '10 days'::interval))
   Rows Removed by Index Recheck: 8187584
   Filter: ((l.field1 IS NOT NULL)
AND (l.category = 'music'::name))
   Rows Removed by Filter: 19857107
   Heap Blocks: lossy=1509000

So you have just 466 rows matching these quals, but the executor had
to scan 1.5 million pages to get those and filter out 8.1 million rows
on the recheck then 19.8 million on the filter. You've mentioned that
the table's heap is 139 GB, which is about 18 million pages.  It seems
your query would perform much better if you had a btree index such as
(category, rec_insert_time) where field1 is not null;,

Of course, you've mentioned that you are finding when the plan uses
the BRIN index that it executes more quickly, but I think you're going
to find BRIN unreliable for tables anything other than INSERT-only
tables which the records are always inserted with an ever-increasing
or decreasing value in the BRIN indexed column.  If you start
performing UPDATEs then that's going to create holes that new record
will fill and cause the correlation to start dropping resulting in the
BRIN indexes scan cost going up.

On the other hand, if you think you can do better than what was done
in 7e534adcd, then it would be good to see 

Re: v12.0: ERROR: could not find pathkey item to sort

2019-10-14 Thread Justin Pryzby
On Sun, Oct 13, 2019 at 01:30:29PM -0500, Justin Pryzby wrote:
> BTW it probably should've been documented as an "Open Item" for v12.

https://commitfest.postgresql.org/25/2278/
I realized possibly people were thinking of that as a "feature" and not a
bugfix for backpatch (?)

But, my issue is a query which worked under v11 PWJ but fails under v12
(apparently broken by d25ea01275).

In my mind, if the planner doesn't support that query with PWJ, I think it
should run without PWJ rather than fail.

Justin




Re: BRIN index which is much faster never chosen by planner

2019-10-14 Thread Jeremy Finzel
>
> The other issue is that the estimation of pages fetched using bitmap
> heap scan is rather crude - but that's simply hard, and I don't think we
> can fundamentally improve this.
>

I wanted to follow up on this specific issue.  Isn't this the heart of the
matter and a fundamental problem?  If I want to rely on BRIN indexes as in
a straightforward case as explained in OP, but I don't know if the planner
will be nearly reliable enough, how can I depend on them in production?  Is
this not considered a planner bug or should this kind of case be documented
as problematic for BRIN?  As another way to look at it: is there a
configuration parameter that could be added specific to BRIN or bitmapscan
to provide help to cases like this?

On freshly analyzed tables, I tried my original query again and found that
even with now() - 3 days it does not choose the BRIN index.  In fact it
chose another btree on the table like (id1, id2, rec_insert_time).  With
warm cache, the pg-chosen plan takes 40 seconds to execute, whereas when I
force a BRIN scan it takes only 4 seconds.

I could understand more if the execution times were close, but the actual
BRIN index is orders of magnitude faster than the plan Postgres is
choosing.  I appreciate the feedback on this very much, as I am quite eager
to use BRIN indexes!!!

Thanks,
Jeremy


Re: [HACKERS] Block level parallel vacuum

2019-10-14 Thread Dilip Kumar
On Mon, Oct 14, 2019 at 3:10 PM Amit Kapila  wrote:
>
> On Sat, Oct 12, 2019 at 4:50 PM Amit Kapila  wrote:
> > On Sat, Oct 12, 2019 at 11:29 AM Masahiko Sawada  
> > wrote:
> > >
> >
> > I see a much bigger problem with the way this patch collects the index
> > stats in shared memory.  IIUC, it allocates the shared memory (DSM)
> > for all the index stats, in the same way, considering its size as
> > IndexBulkDeleteResult.  For the first time, it gets the stats from
> > local memory as returned by ambulkdelete/amvacuumcleanup call and then
> > copies it in shared memory space.  There onwards, it always updates
> > the stats in shared memory by pointing each index stats to that
> > memory.  In this scheme, you overlooked the point that an index AM
> > could choose to return a larger structure of which
> > IndexBulkDeleteResult is just the first field.  This generally
> > provides a way for ambulkdelete to communicate additional private data
> > to amvacuumcleanup.  We use this idea in the gist index, see how
> > gistbulkdelete and gistvacuumcleanup works. The current design won't
> > work for such cases.
> >
>
> Today, I looked at gistbulkdelete and gistvacuumcleanup closely and I
> have a few observations about those which might help us to solve this
> problem for gist indexes:
> 1. Are we using memory context GistBulkDeleteResult->page_set_context?
>  It seems to me it is not being used.
To me also it appears that it's not being used.

> 2. Each time we perform gistbulkdelete, we always seem to reset the
> GistBulkDeleteResult stats, see gistvacuumscan.  So, how will it
> accumulate it for the cleanup phase when the vacuum needs to call
> gistbulkdelete multiple times because the available space for
> dead-tuple is filled.  It seems to me like we only use the stats from
> the very last call to gistbulkdelete.
IIUC, it is fine to use the stats from the latest gistbulkdelete call
because we are trying to collect the information of the empty pages
while scanning the tree.  So I think it would be fine to just use the
information collected from the latest scan otherwise we will get
duplicate information.

> 3. Do we really need to give the responsibility of deleting empty
> pages (gistvacuum_delete_empty_pages) to gistvacuumcleanup.  Can't we
> do it in gistbulkdelte?  I see one advantage of postponing it till the
> cleanup phase which is if somehow we can accumulate stats over
> multiple calls of gistbulkdelete, but I am not sure if it is feasible.
It seems that we want to use the latest result. That might be the
reason for postponing to the cleanup phase.


> At least, the way current code works, it seems that there is no
> advantage to postpone deleting empty pages till the cleanup phase.
>
> If we avoid postponing deleting empty pages till the cleanup phase,
> then we don't have the problem for gist indexes.
>
> This is not directly related to this patch, so we can discuss these
> observations in a separate thread as well, but before that, I wanted
> to check your opinion to see if this makes sense to you as this will
> help us in moving this patch forward.



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: d25ea01275 and partitionwise join

2019-10-14 Thread Justin Pryzby
On Thu, Sep 19, 2019 at 05:15:37PM +0900, Amit Langote wrote:
> Please find attached updated patches.

Tom pointed me to this thread, since we hit it in 12.0
https://www.postgresql.org/message-id/flat/16802.1570989962%40sss.pgh.pa.us#070f6675a11dff17760b1cfccf1c038d

I can't say much about the patch; there's a little typo:
"The nullability of inner relation keys prevents them to"
..should say "prevent them from".

In order to compile it against REL12, I tried to cherry-pick this one:
3373c715: Speed up finding EquivalenceClasses for a given set of rels

But then it crashes in check-world (possibly due to misapplied hunks).

-- 
Justin Pryzby
System Administrator
Telsasoft
+1-952-707-8581




Add A Glossary

2019-10-14 Thread Corey Huinker
Attached is a v1 patch to add a Glossary to the appendix of our current
documentation.

I believe that our documentation needs a glossary for a few reasons:

1. It's hard to ask for help if you don't know the proper terminology of
the problem you're having.

2. Readers who are new to databases may not understand a few of the terms
that are used casually both in the documentation and in forums. This helps
to make our documentation a bit more useful as a teaching tool.

3. Readers whose primary language is not English may struggle to find the
correct search terms, and this glossary may help them grasp that a given
term has a usage in databases that is different from common English usage.

3b. If we are not able to find the resources to translate all of the
documentation into a given language, translating the glossary page would be
a good first step.

4. The glossary would be web-searchable, and draw viewers to the official
documentation.

5. adding link anchors to each term would make them cite-able, useful in
forum conversations.


A few notes about this patch:

1. It's obviously incomplete. There are more terms, a lot more, to add.

2. The individual definitions supplied are off-the-cuff, and should be
thoroughly reviewed.

3. The definitions as a whole should be reviewed by an actual tech writer
(one was initially involved but had to step back due to prior commitments),
and the definitions should be normalized in terms of voice, tone, audience,
etc.

4. My understanding of DocBook is not strong. The glossary vs glosslist tag
issue is a bit confusing to me, and I'm not sure if the glossary tag is
even appropriate for our needs.

5. I've made no effort at making each term an anchor, nor have I done any
CSS styling at all.

6. I'm not quite sure how to handle terms that have different definitions
in different contexts. Should that be two glossdefs following one
glossterm, or two separate def/term pairs?

Please review and share your thoughts.
From 343d5c18bf23f98341b510595e3e042e002242cb Mon Sep 17 00:00:00 2001
From: Corey Huinker 
Date: Sun, 13 Oct 2019 17:57:36 +
Subject: [PATCH] add glossary page with sample terms and definitions

---
 doc/src/sgml/filelist.sgml  |   1 +
 doc/src/sgml/glossary.sgml  | 618 
 doc/src/sgml/stylesheet.css |   2 +
 3 files changed, 621 insertions(+)
 create mode 100644 doc/src/sgml/glossary.sgml

diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index 3da2365ea9..504c8a6326 100644
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
@@ -170,6 +170,7 @@
 
 
 
+
 
 
 
diff --git a/doc/src/sgml/glossary.sgml b/doc/src/sgml/glossary.sgml
new file mode 100644
index 00..016eee2d76
--- /dev/null
+++ b/doc/src/sgml/glossary.sgml
@@ -0,0 +1,618 @@
+
+
+
+ Glossary
+
+ 
+  This is a list of terms and their in the context of PostgreSQL and databases in general.
+
+  
+
+   
+Aggregate
+
+ 
+  To combine a collection of data values into a single value, whose value
+may not be of the same type as the original values. Aggregate functions combine
+multiple rows that share a common set of values into one row, which means that
+the only data visible in the values in common, and the aggregates of the
+non-common data.
+ 
+
+   
+
+   
+Analytic
+
+ 
+  A function whose computed value can reference values found in nearby rows
+of the same result set.
+ 
+
+   
+
+   
+Atomic
+
+ 
+  In reference to the value of an Attribute or Datum: cannot be broken up
+into smaller components.
+ 
+ 
+  In reference to an operation: An event that cannot be completed in part:
+it must either entirely succeed or entirely fail. A series of SQL statements can
+be combined into a Transaction, and that transaction is said to be Atomic.
+ 
+
+   
+
+   
+Attribute
+
+ 
+  A typed data element found within a Tuple or Relation or Table.
+ 
+
+   
+
+   
+Cast
+
+ 
+  A conversion of a Datum from its current data type to another data type.
+ 
+
+   
+
+   
+Check Constraint
+
+ 
+  A type of constraint defined on a relation which restricts the values
+allowed in one or more Attributes. The check constraint can make reference to
+any Attribute in the Relation, but cannot reference other rows of the same
+relation or other relations.
+ 
+
+   
+
+   
+Column
+
+ 
+  An Attribute found in a Table or View.
+ 
+
+   
+
+   
+Commit
+
+ 
+  The act of finalizing a Transaction within the database.
+ 
+
+   
+
+   
+Concurrency
+
+ 
+  The concept that multiple independent operations can be happening within
+the database at the same time.
+ 
+
+   
+
+   
+Constraint
+
+ 
+  A method of restricting the values of data allowed within a Table.
+ 
+
+   
+
+   
+Datum
+
+ 
+  The internal 

Re: v12.0: segfault in reindex CONCURRENTLY

2019-10-14 Thread Justin Pryzby
On Sun, Oct 13, 2019 at 06:06:43PM +0900, Michael Paquier wrote:
> On Fri, Oct 11, 2019 at 07:44:46PM -0500, Justin Pryzby wrote:
> > Unfortunately, there was no core file, and I'm still trying to reproduce it.
> 
> Forgot to set ulimit -c?  Having a backtrace would surely help.

Fortunately (?) another server hit crashed last night.
(Doesn't appear to be relevant, but this table has no 
inheritence/partition-ness).

Looks like it's a race condition and dereferencing *holder=NULL.  The first
crash was probably the same bug, due to report query running during "reindex
CONCURRENTLY", and probably finished at nearly the same time as another locker.

Relevant code introduced here:

commit ab0dfc961b6a821f23d9c40c723d11380ce195a6
Author: Alvaro Herrera 
Date:   Tue Apr 2 15:18:08 2019 -0300

Report progress of CREATE INDEX operations

Needs to be conditionalized (as anticipated by the comment)

+   if (holder)

pgstat_progress_update_param(PROGRESS_WAITFOR_CURRENT_PID,

 holder->pid);


Core was generated by `postgres: postgres ts [local] REINDEX  '.
Program terminated with signal 11, Segmentation fault.

#0  WaitForLockersMultiple (locktags=locktags@entry=0x1d30548, 
lockmode=lockmode@entry=5, progress=progress@entry=true) at lmgr.c:911
#1  0x005c2ac8 in ReindexRelationConcurrently 
(relationOid=relationOid@entry=17618, options=options@entry=0) at 
indexcmds.c:3090
#2  0x005c328a in ReindexIndex (indexRelation=, 
options=0, concurrent=) at indexcmds.c:2352
#3  0x007657fe in standard_ProcessUtility (pstmt=pstmt@entry=0x1d05468, 
queryString=queryString@entry=0x1d046e0 "REINDEX INDEX CONCURRENTLY 
loaded_cdr_files_filename",
context=context@entry=PROCESS_UTILITY_TOPLEVEL, params=params@entry=0x0, 
queryEnv=queryEnv@entry=0x0, dest=dest@entry=0x1d05548,
completionTag=completionTag@entry=0x7ffc05e6c0a0 "") at utility.c:787
#4  0x7f21517204ef in pgss_ProcessUtility (pstmt=0x1d05468, 
queryString=0x1d046e0 "REINDEX INDEX CONCURRENTLY loaded_cdr_files_filename", 
context=PROCESS_UTILITY_TOPLEVEL,
params=0x0, queryEnv=0x0, dest=0x1d05548, completionTag=0x7ffc05e6c0a0 "") 
at pg_stat_statements.c:1006
#5  0x00762816 in PortalRunUtility (portal=0x1d7a4e0, pstmt=0x1d05468, 
isTopLevel=, setHoldSnapshot=, dest=0x1d05548,
completionTag=0x7ffc05e6c0a0 "") at pquery.c:1175
#6  0x00763267 in PortalRunMulti (portal=portal@entry=0x1d7a4e0, 
isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false, 
dest=dest@entry=0x1d05548,
altdest=altdest@entry=0x1d05548, 
completionTag=completionTag@entry=0x7ffc05e6c0a0 "") at pquery.c:1328
#7  0x00763e45 in PortalRun (portal=, 
count=9223372036854775807, isTopLevel=, run_once=, dest=0x1d05548, altdest=0x1d05548,
completionTag=0x7ffc05e6c0a0 "") at pquery.c:796
#8  0x0075ff45 in exec_simple_query (query_string=) at 
postgres.c:1215
#9  0x00761212 in PostgresMain (argc=, argv=, dbname=, username=) at postgres.c:4236
#10 0x00483d02 in BackendRun (port=, port=) at postmaster.c:4431
#11 BackendStartup (port=0x1d2b340) at postmaster.c:4122
#12 ServerLoop () at postmaster.c:1704
#13 0x006f0b1f in PostmasterMain (argc=argc@entry=3, 
argv=argv@entry=0x1cff280) at postmaster.c:1377
#14 0x00484c93 in main (argc=3, argv=0x1cff280) at main.c:228

bt f

#0  WaitForLockersMultiple (locktags=locktags@entry=0x1d30548, 
lockmode=lockmode@entry=5, progress=progress@entry=true) at lmgr.c:911
holder = 0x0
lockholders = 0x1d9b778
holders = 
lc = 0x1d9bf80
total = 
done = 1
#1  0x005c2ac8 in ReindexRelationConcurrently 
(relationOid=relationOid@entry=17618, options=options@entry=0) at 
indexcmds.c:3090
heapRelationIds = 0x1d30360
indexIds = 0x1d303b0
newIndexIds = 
relationLocks = 
lockTags = 
lc = 0x0
lc2 = 0x0
private_context = 
oldcontext = 
relkind = 105 'i'
relationName = 0x0
relationNamespace = 0x0
ru0 = {tv = {tv_sec = 30592544, tv_usec = 7232025}, ru = {ru_utime = 
{tv_sec = 281483566645394, tv_usec = 75668733820930}, ru_stime = {tv_sec = 0, 
tv_usec = 30592272}, {
  ru_maxrss = 0, __ru_maxrss_word = 0}, {ru_ixrss = 0, 
__ru_ixrss_word = 0}, {ru_idrss = 105, __ru_idrss_word = 105}, {ru_isrss = 
-926385342574214656, 
  __ru_isrss_word = -926385342574214656}, {ru_minflt = 8924839, 
__ru_minflt_word = 8924839}, {ru_majflt = 0, __ru_majflt_word = 0}, {ru_nswap = 
17618, 
  __ru_nswap_word = 17618}, {ru_inblock = 139781327898864, 
__ru_inblock_word = 139781327898864}, {ru_oublock = 30430312, __ru_oublock_word 
= 30430312}, {
  ru_msgsnd = 139781327898864, __ru_msgsnd_word = 139781327898864}, 
{ru_msgrcv = 

Re: v12.0: ERROR: could not find pathkey item to sort

2019-10-14 Thread Tom Lane
Justin Pryzby  writes:
> On Fri, Oct 11, 2019 at 10:48:37AM -0400, Tom Lane wrote:
>> Could you provide a self-contained test case please?

> [ test case ]

Oh, this is the same issue Amit described in 

https://www.postgresql.org/message-id/flat/CA%2BHiwqG2WVUGmLJqtR0tPFhniO%3DH%3D9qQ%2BZ3L_ZC%2BY3-EVQHFGg%40mail.gmail.com

namely that we're not generating EquivalenceClass members corresponding
to sub-joins of a partitionwise join.

Are you interested in helping to test the patches proposed there?

regards, tom lane




Re: stress test for parallel workers

2019-10-14 Thread Andres Freund
Hi,

On 2019-10-13 13:44:59 +1300, Thomas Munro wrote:
> On Sun, Oct 13, 2019 at 1:06 PM Tom Lane  wrote:
> > I don't think any further proof is required that this is
> > a kernel bug.  Where would be a good place to file it?
> 
> linuxppc-...@lists.ozlabs.org might be the right place.
> 
> https://lists.ozlabs.org/listinfo/linuxppc-dev

Probably requires reproducing on a pretty recent kernel first, to have a
decent chance of being investigated...

Greetings,

Andres Freund




Re: stress test for parallel workers

2019-10-14 Thread Tom Lane
Andres Freund  writes:
> Probably requires reproducing on a pretty recent kernel first, to have a
> decent chance of being investigated...

How recent do you think it needs to be?  The machine I was testing on
yesterday is under a year old:

uname -m = ppc64le
uname -r = 4.18.19-100.fc27.ppc64le
uname -s = Linux
uname -v = #1 SMP Wed Nov 14 21:53:32 UTC 2018

The latest-by-version-number ppc64 kernel I can find in the buildfarm
is bonito,

uname -m = ppc64le
uname -r = 4.19.15-300.fc29.ppc64le
uname -s = Linux
uname -v = #1 SMP Mon Jan 14 16:21:04 UTC 2019

and that's certainly shown it too.

regards, tom lane




Columns correlation and adaptive query optimization

2019-10-14 Thread Konstantin Knizhnik

Hi hackers,

Errors in selectivity estimations is one of the main reason of bad plans 
generation by Postgres optimizer.
Postgres estimates selectivity based on the collected statistic 
(histograms).
While it is able to more or less precisely estimated selectivity of 
simple predicate for particular table,
it is much more difficult to estimate selectivity for result of join of 
several tables and for complex predicate consisting of several 
conjuncts/disjuncts

accessing different columns.

Postgres is not able to take in account correlation between columns 
unless correspondent multicolumn statistic is explicitly created.
But even if such statistic is created, it can not be used in join 
selectivity estimation.


The problem with adjusting selectivity using machine learning based on 
the results of EXPLAIN ANALYZE was address in AQO project:


https://github.com/postgrespro/aqo

There are still many issues with proposed AQO approach (for example, it 
doesn't take in account concrete constant values).

We are going  to continue its improvement.

But here I wan to propose much simpler patch which allows two things:
1. Use extended statistic in estimation of join selectivity
2. Create on demand multicolumn statistic in auto_explain extension if 
there is larger gap between real and estimated number of tuples for the 
concrete plan node.



create table inner_tab(x integer, y integer);
create table outer_tab(pk integer primary key, x integer, y integer);
create index on inner_tab(x,y);
insert into outer_tab values (generate_series(1,10), 
generate_series(1,10), generate_series(1,10)*10);
insert into inner_tab values (generate_series(1,100)/10, 
generate_series(1,100)/10*10);

analyze inner_tab;
analyze outer_tab;


Without this patch:
explain select * from outer_tab join inner_tab using(x,y) where pk=1;
  QUERY PLAN
--
 Nested Loop  (cost=0.72..16.77 rows=1 width=12)
   ->  Index Scan using outer_tab_pkey on outer_tab (cost=0.29..8.31 
rows=1 width=12)

 Index Cond: (pk = 1)
   ->  Index Only Scan using inner_tab_x_y_idx on inner_tab 
(cost=0.42..8.45 rows=1 width=8)

 Index Cond: ((x = outer_tab.x) AND (y = outer_tab.y))
(5 rows)


With this patch:

load 'auto_explain';
set auto_explain.log_min_duration=0;
set auto_explain.add_statistics_threshold=10.0;
set auto_explain.log_analyze=on;
select * from outer_tab join inner_tab using(x,y) where pk=1;
analyze inner_tab;
analyze outer_tab;

explain select * from outer_tab join inner_tab using(x,y) where pk=1;
   QUERY PLAN

 Nested Loop  (cost=0.72..32.79 rows=10 width=12)
   ->  Index Scan using outer_tab_pkey on outer_tab (cost=0.29..8.31 
rows=1 width=12)

 Index Cond: (pk = 1)
   ->  Index Only Scan using inner_tab_x_y_idx on inner_tab 
(cost=0.42..24.38 rows=10 width=8)

 Index Cond: ((x = outer_tab.x) AND (y = outer_tab.y))
(5 rows)


As you can see now estimation of join result is correct (10).

I attached two patches: one for using extended statistic for join 
selectivity estimation and another for auto_explain to implicitly add 
this extended statistic on demand.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index 4bf777d..a356df9 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -25,6 +25,7 @@
 #include "utils/lsyscache.h"
 #include "utils/selfuncs.h"
 #include "statistics/statistics.h"
+#include "catalog/pg_statistic_ext.h"
 
 
 /*
@@ -159,6 +160,9 @@ clauselist_selectivity_simple(PlannerInfo *root,
 	RangeQueryClause *rqlist = NULL;
 	ListCell   *l;
 	int			listidx;
+	Bitmapset  *clauses_attnums = NULL;
+	int			n_clauses_attnums = 0;
+	int innerRelid = varRelid;
 
 	/*
 	 * If there's exactly one clause (and it was not estimated yet), just go
@@ -170,6 +174,9 @@ clauselist_selectivity_simple(PlannerInfo *root,
 		return clause_selectivity(root, (Node *) linitial(clauses),
   varRelid, jointype, sjinfo);
 
+	if (innerRelid == 0 && sjinfo)
+		bms_get_singleton_member(sjinfo->min_righthand, );
+
 	/*
 	 * Anything that doesn't look like a potential rangequery clause gets
 	 * multiplied into s1 and forgotten. Anything that does gets inserted into
@@ -181,7 +188,6 @@ clauselist_selectivity_simple(PlannerInfo *root,
 		Node	   *clause = (Node *) lfirst(l);
 		RestrictInfo *rinfo;
 		Selectivity s2;
-
 		listidx++;
 
 		/*
@@ -213,6 +219,7 @@ clauselist_selectivity_simple(PlannerInfo *root,
 		else
 			rinfo = NULL;
 
+
 		/*
 		 * See if it looks like a restriction clause with a pseudoconstant on
 		 * one side.  (Anything more 

Re: Non-Active links being referred in our source code

2019-10-14 Thread vignesh C
On Tue, Oct 8, 2019 at 10:35 AM Michael Paquier  wrote:
>
> On Mon, Oct 07, 2019 at 05:11:40PM +0200, Juan José Santamaría Flecha wrote:
> > About the broken links in win32_port.h, they are all referring to
> > ntstatus. As for first case that shows the code groups, there is an up
> > to date alternative. There is also an alternative for second case that
> > points to their codes and descriptions. On the other hand, the last
> > case is quoting a document that is no longer available, I would
> > suggest to rephrase the comment, thus eliminating the quote.
> >
> > Please find attached a patch with the proposed alternatives.
>
> Thanks Juan for the patch.  I have checked your suggestions and it
> looks good to me, so committed.  Good idea to tell about
> WIN32_NO_STATUS.  I have noticed one typo on the way.
About pg_crc.h, I have made the changes with the correct links.
The patch for the same is attached.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com


0001-Updated-CRC-links-that-are-not-working.patch
Description: Binary data


Re: v12.0: ERROR: could not find pathkey item to sort

2019-10-14 Thread Tom Lane
Justin Pryzby  writes:
> On Sun, Oct 13, 2019 at 01:30:29PM -0500, Justin Pryzby wrote:
>> BTW it probably should've been documented as an "Open Item" for v12.

> https://commitfest.postgresql.org/25/2278/
> I realized possibly people were thinking of that as a "feature" and not a
> bugfix for backpatch (?)
> But, my issue is a query which worked under v11 PWJ but fails under v12
> (apparently broken by d25ea01275).

Yeah, this should have been dealt with as an open item, but it
slipped through the cracks.  We'll make sure to get it fixed,
one way or another, for 12.1.

In view of the proposed patches being dependent on some other
13-only changes, I wonder if we should fix v12 by reverting
d25ea0127.  The potential planner performance loss for large
partition sets could be nasty, but failing to plan at all is worse.

regards, tom lane




Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2019-10-14 Thread Dilip Kumar
On Thu, Oct 3, 2019 at 4:03 AM Tomas Vondra
 wrote:
>
> On Wed, Oct 02, 2019 at 04:27:30AM +0530, Amit Kapila wrote:
> >On Tue, Oct 1, 2019 at 7:21 PM Tomas Vondra 
> >wrote:
> >
> >> On Tue, Oct 01, 2019 at 06:55:52PM +0530, Amit Kapila wrote:
> >> >
> >> >On further testing, I found that the patch seems to have problems with
> >> >toast.  Consider below scenario:
> >> >Session-1
> >> >Create table large_text(t1 text);
> >> >INSERT INTO large_text
> >> >SELECT (SELECT string_agg('x', ',')
> >> >FROM generate_series(1, 100)) FROM generate_series(1, 1000);
> >> >
> >> >Session-2
> >> >SELECT * FROM pg_create_logical_replication_slot('regression_slot',
> >> >'test_decoding');
> >> >SELECT * FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL);
> >> >*--kaboom*
> >> >
> >> >The second statement in Session-2 leads to a crash.
> >> >
> >>
> >> OK, thanks for the report - will investigate.
> >>
> >
> >It was an assertion failure in ReorderBufferCleanupTXN at below line:
> >+ /* Check we're not mixing changes from different transactions. */
> >+ Assert(change->txn == txn);
> >
>
> Can you still reproduce this issue with the patch I sent on 28/9? I have
> been unable to trigger the failure, and it seems pretty similar to the
> failure you reported (and I fixed) on 28/9.
>
> >> >Other than that, I am not sure if the changes related to spill to disk
> >> >after logical_decoding_work_mem works for toast table as I couldn't hit
> >> >that code for toast table case, but I might be missing something.  As
> >> >mentioned previously, I feel there should be some way to test whether this
> >> >patch works for the cases it claims to work.  As of now, I have to check
> >> >via debugging.  Let me know if there is any way, I can test this.
> >> >
> >>
> >> That's one of the reasons why I proposed to move the statistics (which
> >> say how many transactions / bytes were spilled to disk) from a later
> >> patch in the series. I don't think there's a better way.
> >>
> >>
> >I like that idea, but I think you need to split that patch to only get the
> >stats related to the spill.  It would be easier to review if you can
> >prepare that atop of
> >0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer.
> >
>
> Sure, I wasn't really proposing to adding all stats from that patch,
> including those related to streaming.  We need to extract just those
> related to spilling. And yes, it needs to be moved right after 0001.
>
I have extracted the spilling related code to a separate patch on top
of 0001.  I have also fixed some bugs and review comments and attached
as a separate patch.  Later I can merge it to the main patch if you
agree with the changes.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer.patch
Description: Binary data


bugs_and_review_comments_fix.patch
Description: Binary data


0002-Track-statistics-for-spilling.patch
Description: Binary data


Re: Minimal logical decoding on standbys

2019-10-14 Thread nil socket
Sorry to intervene in between,

But what about timeline change?


Thank you.

Re: [HACKERS] Block level parallel vacuum

2019-10-14 Thread Amit Kapila
On Sat, Oct 12, 2019 at 4:50 PM Amit Kapila  wrote:
> On Sat, Oct 12, 2019 at 11:29 AM Masahiko Sawada  
> wrote:
> >
>
> I see a much bigger problem with the way this patch collects the index
> stats in shared memory.  IIUC, it allocates the shared memory (DSM)
> for all the index stats, in the same way, considering its size as
> IndexBulkDeleteResult.  For the first time, it gets the stats from
> local memory as returned by ambulkdelete/amvacuumcleanup call and then
> copies it in shared memory space.  There onwards, it always updates
> the stats in shared memory by pointing each index stats to that
> memory.  In this scheme, you overlooked the point that an index AM
> could choose to return a larger structure of which
> IndexBulkDeleteResult is just the first field.  This generally
> provides a way for ambulkdelete to communicate additional private data
> to amvacuumcleanup.  We use this idea in the gist index, see how
> gistbulkdelete and gistvacuumcleanup works. The current design won't
> work for such cases.
>

Today, I looked at gistbulkdelete and gistvacuumcleanup closely and I
have a few observations about those which might help us to solve this
problem for gist indexes:
1. Are we using memory context GistBulkDeleteResult->page_set_context?
 It seems to me it is not being used.
2. Each time we perform gistbulkdelete, we always seem to reset the
GistBulkDeleteResult stats, see gistvacuumscan.  So, how will it
accumulate it for the cleanup phase when the vacuum needs to call
gistbulkdelete multiple times because the available space for
dead-tuple is filled.  It seems to me like we only use the stats from
the very last call to gistbulkdelete.
3. Do we really need to give the responsibility of deleting empty
pages (gistvacuum_delete_empty_pages) to gistvacuumcleanup.  Can't we
do it in gistbulkdelte?  I see one advantage of postponing it till the
cleanup phase which is if somehow we can accumulate stats over
multiple calls of gistbulkdelete, but I am not sure if it is feasible.
At least, the way current code works, it seems that there is no
advantage to postpone deleting empty pages till the cleanup phase.

If we avoid postponing deleting empty pages till the cleanup phase,
then we don't have the problem for gist indexes.

This is not directly related to this patch, so we can discuss these
observations in a separate thread as well, but before that, I wanted
to check your opinion to see if this makes sense to you as this will
help us in moving this patch forward.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Remove obsolete information schema tables

2019-10-14 Thread Peter Eisentraut
I propose this patch to remove the information schema tables
SQL_LANGUAGES, which was eliminated in SQL:2008, and SQL_PACKAGES, which
was eliminated in SQL:2011.  Since they were dropped by the SQL
standard, the information in them was no longer updated and therefore no
longer useful.

This also removes the feature-package association information in
sql_feature_packages.txt, but for the time begin we are keeping the
information which features are in the Core package (that is, mandatory
SQL features).  Maybe at some point someone wants to invent a way to
store that that does not involve using the "package" mechanism
anymore.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 304270da9168134df77d59c537268bd6bcaf1a06 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Mon, 14 Oct 2019 09:56:38 +0200
Subject: [PATCH] Remove obsolete information schema tables

Remove SQL_LANGUAGES, which was eliminated in SQL:2008, and
SQL_PACKAGES, which was eliminated in SQL:2011.  Since they were
dropped by the SQL standard, the information in them was no longer
updated and therefore no longer useful.

This also removes the feature-package association information in
sql_feature_packages.txt, but for the time begin we are keeping the
information which features are in the Core package (that is, mandatory
SQL features).  Maybe at some point someone wants to invent a way to
store that that does not involve using the "package" mechanism
anymore.
---
 doc/src/sgml/features.sgml   |   9 +-
 doc/src/sgml/information_schema.sgml | 154 ---
 src/backend/catalog/information_schema.sql   |  50 --
 src/backend/catalog/sql_feature_packages.txt |  29 
 src/backend/catalog/sql_features.txt |   2 -
 src/test/regress/expected/sanity_check.out   |   2 -
 6 files changed, 3 insertions(+), 243 deletions(-)

diff --git a/doc/src/sgml/features.sgml b/doc/src/sgml/features.sgml
index f767bee46e..2c5a7e5d0c 100644
--- a/doc/src/sgml/features.sgml
+++ b/doc/src/sgml/features.sgml
@@ -44,10 +44,7 @@ SQL Conformance
   broad three levels found in SQL-92.  A large
   subset of these features represents the Core
   features, which every conforming SQL implementation must supply.
-  The rest of the features are purely optional.  Some optional
-  features are grouped together to form packages, which
-  SQL implementations can claim conformance to, thus claiming
-  conformance to particular groups of features.
+  The rest of the features are purely optional.
  
 
  
@@ -116,7 +113,7 @@ Supported Features
   

 Identifier
-Package
+Core?
 Description
 Comment

@@ -143,7 +140,7 @@ Unsupported Features
   

 Identifier
-Package
+Core?
 Description
 Comment

diff --git a/doc/src/sgml/information_schema.sgml 
b/doc/src/sgml/information_schema.sgml
index 906fe7819f..7d3be1afdb 100644
--- a/doc/src/sgml/information_schema.sgml
+++ b/doc/src/sgml/information_schema.sgml
@@ -4963,160 +4963,6 @@ sql_implementation_info 
Columns
   
  
 
- 
-  sql_languages
-
-  
-   The table sql_languages contains one row for
-   each SQL language binding that is supported by
-   PostgreSQL.
-   PostgreSQL supports direct SQL and
-   embedded SQL in C; that is all you will learn from this table.
-  
-
-  
-   This table was removed from the SQL standard in SQL:2008, so there
-   are no entries referring to standards later than SQL:2003.
-  
-
-  
-   sql_languages Columns
-
-   
-
- 
-  Name
-  Data Type
-  Description
- 
-
-
-
- 
-  sql_language_source
-  character_data
-  
-   The name of the source of the language definition; always
-   ISO 9075, that is, the SQL standard
-  
- 
-
- 
-  sql_language_year
-  character_data
-  
-   The year the standard referenced in
-   sql_language_source was approved.
-  
- 
-
- 
-  sql_language_conformance
-  character_data
-  
-   The standard conformance level for the language binding.  For
-   ISO 9075:2003 this is always CORE.
-  
- 
-
- 
-  sql_language_integrity
-  character_data
-  Always null (This value is relevant to an earlier version of the 
SQL standard.)
- 
-
- 
-  sql_language_implementation
-  character_data
-  Always null
- 
-
- 
-  sql_language_binding_style
-  character_data
-  
-   The language binding style, either DIRECT or
-   EMBEDDED
-  
- 
-
- 
-  sql_language_programming_language
-  character_data
-  
-   The programming language, if the binding style is
-   EMBEDDED, else null.  
PostgreSQL only
-   supports the language C.
-  
- 
-
-   
-  
- 
-
- 
-  sql_packages
-
-  
-   The table sql_packages contains information
-  

Re: Transparent Data Encryption (TDE) and encrypted files

2019-10-14 Thread Antonin Houska
Masahiko Sawada  wrote:

> On Wed, Oct 9, 2019 at 3:57 PM Antonin Houska  wrote:
> >
> > Moon, Insung  wrote:
> >
> > v04-0011-Make-buffile.c-aware-of-encryption.patch in [1] changes buffile.c 
> > so
> > that data is read and written in 8kB blocks if encryption is enabled. In 
> > order
> > to record the IV per block, the computation of the buffer position within 
> > the
> > file would have to be adjusted somehow. I can check it soon but not in the
> > next few days.
> 
> As far as I read the patch the nonce consists of pid, counter and
> block number where the counter is the number incremented each time of
> creating a BufFile. Therefore it could happen to rewrite the buffer
> data with the same nonce and key, which is bad.

This patch was written before the requirement on non-repeating IV was raiesed,
and it does not use the AES-CTR mode. I mentioned it here because it reads /
writes data in 8kB blocks.

> So I think we can have the rewrite counter of the block in the each 8k
> block header. And then the nonce consists of block number within a
> segment file (4 bytes), temp file counter (8 bytes), rewrite counter
> (2 bytes) and CTR mode counter (2 bytes). And then if we have a
> single-use encryption key per backend processes I guess we can
> guarantee the uniqueness of the combination of key and nonce.

Since the segment size is 1 GB, the segment cosists of 2^17 blocks, so 4 bytes
will not be utilized.

As for the "CTR mode counter", consider that it gets incremented once per 16
bytes of input. So even if BLCKSZ is 32 kB, we need no more than 11 bits for
this counter.

If these two parts become smaller, we can perhaps increase the size of the
"rewrite counter".

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com




Re: Tighten error control for OpenTransientFile/CloseTransientFile

2019-10-14 Thread Michael Paquier
On Fri, Oct 04, 2019 at 01:39:38PM -0700, Andres Freund wrote:
> but that reasoning seems bogus to me. For one, on just about any
> platform close always closes the fd, even when returning an error
> (unless you pass in a bad fd, in which case it obviously doesn't). So
> the reasoning that this fixes unnoticed fd leaks doesn't really seem to
> make sense. For another, even if it did, throwing an ERROR seems to
> achieve very little: We continue with a leaked fd *AND* we cause the
> operation to error out.

I have read again a couple of times the commit log, and this mentions
to let users know that a fd is leaking, not that it fixes things.
Still we get to know about it, while previously it was not possible.
In some cases we may see errors in close() after a previous write(2).
Of course this does not apply to all the code paths patched here, but
it seems to me that's a good habit to spread, no?

> I can see reasoning for:
> - LOG, so it can be noticed, but operations continue to work
> - FATAL, to fix the leak
> - PANIC, so we recover from the problem, in case of the close indicating
>   a durability issue

LOG or WARNING would not be visible enough and would likely be skipped
by users.  Not sure that this justifies a FATAL either, and PANIC
would cause more harm than necessary, so for most of them ERROR
sounded like a good compromise, still the elevel choice is not
innocent depending on the code paths patched, because the elevel used
is consistent with the error handling of the surroundings.

> And if your goal was just to achieve consistency, I also don't
> understand, because you left plenty close()'s unchecked? E.g. you added
> an error check in get_controlfile(), but not one in
> ReadControlFile(). alterSystemSetConfigFile() writes, but you didn't add
> one.

Because I have not considered these when looking at transient files.
That may be worth an extra lookup.
--
Michael


signature.asc
Description: PGP signature