date:20190711

On Fri, Jul 12, 2019 at 07:49:13AM +0200, Julien Rouhaud wrote:
> It shouldn't be a problem, I reused the same infrastructure as for
> vacuumdb.  so run_reindex_command has a new "async" parameter, so when
> there's no parallelism it's using executeMaintenanceCommand (instead
> of PQsendQuery) which will block until query completion.  That's why
> there's no isFree usage at all in this case.

My point is more about consistency and simplification with the case
where n > 1 and that we could actually move the async/sync code paths
into the same banner as the async mode waits as well until a slot is
free, or in short when the query completes.
--
Michael


signature.asc
Description: PGP signature

Re: Unknown type name bool

2019-07-11 Thread Igal Sapir

On Thu, Jul 11, 2019 at 10:27 PM Michael Paquier 
wrote:

> On Thu, Jul 11, 2019 at 10:21:06PM -0700, Igal Sapir wrote:
> > Any thoughts?  (disclaimer: I have much more experience with Java than C)
>
> We don't support cmake directly.  Here is the documentation about how
> to build the beast:
> https://www.postgresql.org/docs/current/install-procedure.html

Thank you, Michael, but my goal is not to just build from source, but to
run Postgres in an IDE.  I tried CLion because it's modern and cross
platform, but I am open to other IDEs.

What IDEs do Postgres hackers use (other than vi with gcc)?  Is there any
documentation or posts on how to set up the project in an IDE?

Thanks,

Igal

Re: Add parallelism and glibc dependent only options to reindexdb

On Fri, Jul 12, 2019 at 3:20 AM Michael Paquier  wrote:
>
> On Thu, Jul 11, 2019 at 06:22:25PM +0200, Julien Rouhaud wrote:
> > I think t hat it makes the code quite cleaner to have the selection
> > outside ConsumeIdleSlot.
>
> Actually, you have an issue with ConsumeIdleSlot() if there is only
> one parallel slot, no?  In this case the current patch returns
> immediately the slot available without waiting.  I think that we
> should wait until the slot becomes free in that case as well, and
> switch isFree to false.

It shouldn't be a problem, I reused the same infrastructure as for
vacuumdb.  so run_reindex_command has a new "async" parameter, so when
there's no parallelism it's using executeMaintenanceCommand (instead
of PQsendQuery) which will block until query completion.  That's why
there's no isFree usage at all in this case.

>  If you want to keep things splitted, that's
> fine by me, I would still use "Get" within the name for the routine,
> and rename the other to get_idle_slot_internal() or
> get_idle_slot_guts() to point out that it has an internal role.

Ok, I'll change to get_idle_slot_internal then.

Re: Unknown type name bool

On Thu, Jul 11, 2019 at 10:21:06PM -0700, Igal Sapir wrote:
> Any thoughts?  (disclaimer: I have much more experience with Java than C)

We don't support cmake directly.  Here is the documentation about how
to build the beast:
https://www.postgresql.org/docs/current/install-procedure.html
--
Michael


signature.asc
Description: PGP signature

Re: Excessive memory usage in multi-statement queries w/ partitioning

2019-07-11 Thread Amit Langote

Horiguchi-san,

Thanks for the comment.  My reply is a bit late now, but

On Wed, Jul 10, 2019 at 5:39 PM Kyotaro Horiguchi
 wrote:
> At Wed, 10 Jul 2019 16:35:18 +0900, Amit Langote  
> wrote:
> - * Switch to appropriate context for constructing querytrees (again,
> - * these must outlive the execution context).
> + * Switch to appropriate context for constructing query and plan trees
> + * (again, these must outlive the execution context).  Normally, it's
> + * MessageContext, but if there are more queries to plan, we use a
> + * temporary child context that will be reset after executing this
> + * query.  We avoid that overhead of setting up a separate context
> + * for the common case of having just a single query.
>
> Might be stupid, but I feel uneasy that "usually it must live in
> MessageContxt, but not necessarily if there is succeeding
> query".. *I* need more explanation why it is safe to use that
> short-lived context.

So the problem we're trying solve with this is that memory consumed
when parsing/planning individual statements pile up in a single
context (currently, MessageContext), which can lead to severe bloat
especially if the planning of individual statements consumes huge
amount of memory.  The solution is to use a sub-context of
MessageContext for each statement that's reset when its execution is
finished.  I think it's safe to use a shorter-lived context for each
statement because the memory of a given statement should not need to
be referenced when its execution finishes.  Do you see any problem
with that assumption?

Thanks,
Amit

Unknown type name bool

2019-07-11 Thread Igal Sapir

I am trying to build Postgres in an IDE (CLion on Ubuntu), and I'm getting
the following error message:

In file included from /workspace/src/postgres/src/include/c.h:61:0,
 from /workspace/src/postgres/src/include/postgres.h:46,
 from /workspace/src/postgres/contrib/bloom/blcost.c:13:
/workspace/src/postgres/src/include/common/string.h:13:8: error: unknown
type name ‘bool’
 extern bool pg_str_endswith(const char *str, const char *end);

CLion created a CMakeLists.txt file with the following at the top:

cmake_minimum_required(VERSION 3.14)
project(postgres)
set(CMAKE_CXX_STANDARD 14)

And my compiler version is: gcc version 7.4.0 (Ubuntu
7.4.0-1ubuntu1~18.04.1)

Any thoughts?  (disclaimer: I have much more experience with Java than C)

Thanks,

Igal

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-11 Thread Masahiko Sawada

On Fri, Jul 12, 2019 at 7:37 AM Bruce Momjian  wrote:
>
> On Wed, Jul 10, 2019 at 12:26:24PM -0400, Bruce Momjian wrote:
> > On Wed, Jul 10, 2019 at 08:31:17AM -0400, Joe Conway wrote:
> > > Please see my other reply (and
> > > https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-38a.pdf
> > > appendix C as pointed out by Ryan downthread).
> > >
> > > At least in my mind, I trust a published specification from the
> > > nation-state level over random blogs or wikipedia. If we can find some
> > > equivalent published standards that contradict NIST we should discuss
> > > it, but for my money I would prefer to stick with the NIST recommended
> > > method to produce the IVs.
> >
> > So, we have had a flurry of activity on this thread in the past day, so
> > let me summarize:
>
> Seems we have an updated approach:
>
> First, we need to store the symmetric encryption key in the data
> directory, like we do for SSL certificates and private keys.  (Crash
> recovery needs access to this key, so we can't easily store it in a
> database table.)  We will pattern it after the GUC
> ssl_passphrase_command.   We will need to decide on a format for the
> symmetric encryption key in the file so we can check that the supplied
> passphrase properly unlocks the key.
>
> Our first implementation will encrypt the entire cluster.  We can later
> consider encryption per table or tablespace.  It is unclear if
> encrypting different parts of the system with different keys is useful
> or feasible.  (This is separate from key rotation.)
>
> We will use CBC AES128 mode for tables/indexes, and CTR AES128 for WAL.
> 8k pages will use the LSN as a nonce, which will be encrypted to
> generate the initialization vector (IV).  We will not encrypt the first
> 16 bytes of each pages so the LSN can be used in this way.  The WAL will
> use the WAL file segment number as the nonce and the IV will be created
> in the same way.
>
> wal_log_hints will be enabled automatically in encryption mode, like we
> do for checksum mode, so we never encrypt different 8k pages with the
> same IV.

I guess that different two pages can have the same LSN when a heap
update modifies both a page for old tuple and another page for new
tuple.

heapam.c:3707
recptr = log_heap_update(relation, buffer,
 newbuf, , heaptup,
 old_key_tuple,
 all_visible_cleared,
 all_visible_cleared_new);
if (newbuf != buffer)
{
PageSetLSN(BufferGetPage(newbuf), recptr);
}
PageSetLSN(BufferGetPage(buffer), recptr);

Wouldn't it a problem?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Brazil disables DST - 2019b update

2019-07-11 Thread Robins Tharakan

On Fri, 12 Jul 2019 at 14:04, Michael Paquier  wrote:

> On Fri, Jul 12, 2019 at 01:42:59PM +1000, Robins Tharakan wrote:
> So 2019b has been released on the 1st of July.  Usually tzdata updates
> happen just before a minor release, so this would get pulled in at the
> beginning of August (https://www.postgresql.org/developer/roadmap/).
> Tom, I guess that would be again the intention here?
> --
> Michael
>

An August release does give a little more comfort. (I was expecting that
the August
date would get pushed out since 11.4 was an emergency release at the end of
June).

-
robins

Re: pg_stat_database update stats_reset only by pg_stat_reset

On Thu, Jul 11, 2019 at 04:34:20PM +0200, Daniel Verite wrote:
> I can understand why you'd want that resetting the stats for a single object
> would not reset the per-database timestamp, but this would revert a 8+ years
> old decision that seems intentional and has apparently not been criticized
> since then (based on searching for pg_stat_reset_single_table_counters in 
> the archives) . More opinions are probably needed in favor of this
> change (or against, in which case the fate of the patch might be a
> rejection).

I agree with Daniel that breaking an 8-year-old behavior may not be of
the taste of folks relying on the current behavior, particularly
because we have not had complains about the current behavior being
bad.  So -1 from me.
--
Michael


signature.asc
Description: PGP signature

Re: Excessive memory usage in multi-statement queries w/ partitioning

2019-07-11 Thread Amit Langote

On Thu, Jul 11, 2019 at 3:46 AM Tom Lane  wrote:
>
> Amit Langote  writes:
> > Attached updated patch.  Thanks again.
>
> Pushed with a bit of further cleanup

Thanks a lot.

> --- most notably, the way
> you had execute_sql_string(), it was still leaking any cruft
> ProcessUtility might generate.  We can fix that by running
> ProcessUtility in the per-statement context too.

Ah, I was thinking only about planning.

> I also dropped the optimization for a single/last statement in
> execute_sql_string(), and simplified it to just always create
> and delete a child context.  This was based on a couple of
> thoughts.  The norm in this code path is that there's multiple
> statements, probably lots of them, so that the percentage savings
> from getting rid of one context creation is likely negligible.
> Also, unlike exec_simple_query, we *don't* know that the outer
> context is due to be cleared right afterwards.  Since
> execute_sql_string() can run multiple times in one extension
> command, in principle we could get bloat from not cleaning up
> after the last command of each string.   Admittedly, it's not
> likely that you'd have so many strings involved that that
> amounts to a lot, but between following upgrade-script chains
> and cascaded module loads, there could be more than a couple.
> So it seems like the argument for saving a context creation is
> much weaker here than in exec_simple_query.

Agreed.

> I tried to improve the comments too.  I noticed that the bit about
> "(again, these must outlive the execution context)" seemed to be
> a dangling reference --- whatever previous comment it was referring
> to is not to be found anymore.  So I made that self-contained.

Thanks.

Regards,
Amit

Re: Brazil disables DST - 2019b update

On Fri, Jul 12, 2019 at 01:42:59PM +1000, Robins Tharakan wrote:
> The 2019b DST update [1] disables DST for Brazil. This would take effect
> starting November 2019. The last DST update in Postgres was 2019a in v11.3
> (since this update came in after the recent-most Postgres release).
> 
> Since a ~3 month release cycle may be too close for some users, are there
> any plans for an early 11.5 (or are such occurrences not a candidate for an
> early release)?
> 
> Reference:
> a) https://mm.icann.org/pipermail/tz-announce/2019-July/56.html

So 2019b has been released on the 1st of July.  Usually tzdata updates
happen just before a minor release, so this would get pulled in at the
beginning of August (https://www.postgresql.org/developer/roadmap/).
Tom, I guess that would be again the intention here?
--
Michael


signature.asc
Description: PGP signature

Re: Comment fix of config_default.pl

On Fri, Jul 12, 2019 at 12:15:29PM +0900, Kyotaro Horiguchi wrote:
> In src/tools/msvc/config_default.pl, parameter "perl" requires a
> path string, not a bool differently from that of configure
> script. --with-python has the same characteristics and the
> comment is suggesting that.
> 
> We need to fix --with-perl and --with-uuid the same way.
>
> + uuid  => undef,# --with-ossp-uuid=

--with-ossp-uuid is an obsolete spelling.  Wouldn't it be better to
replace it with --with-uuid=?  That would be a bit inconsistent
with configure which can only take a set of hardcoded names, still
there is little point in keeping an option which would get removed
sooner than later?
--
Michael


signature.asc
Description: PGP signature

Brazil disables DST - 2019b update

2019-07-11 Thread Robins Tharakan

Hi,

The 2019b DST update [1] disables DST for Brazil. This would take effect
starting November 2019. The last DST update in Postgres was 2019a in v11.3
(since this update came in after the recent-most Postgres release).

Since a ~3 month release cycle may be too close for some users, are there
any plans for an early 11.5 (or are such occurrences not a candidate for an
early release)?

Reference:
a) https://mm.icann.org/pipermail/tz-announce/2019-July/56.html
-
robins

Comment fix of config_default.pl

2019-07-11 Thread Kyotaro Horiguchi

Hello.

In src/tools/msvc/config_default.pl, peremeter "perl" requires a
path string, not a bool differently from that of configure
script. --with-python has the same characteristics and the
comment is suggesting that.

We need to fix --with-perl and --with-uuid the same way.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/tools/msvc/config_default.pl b/src/tools/msvc/config_default.pl
index 2553636dc1..9f37ba53f8 100644
--- a/src/tools/msvc/config_default.pl
+++ b/src/tools/msvc/config_default.pl
@@ -18,10 +18,10 @@ our $config = {
 	nls   => undef,# --enable-nls=
 	tap_tests => undef,# --enable-tap-tests
 	tcl   => undef,# --with-tcl=
-	perl  => undef,# --with-perl
+	perl  => undef,# --with-perl=
 	python=> undef,# --with-python=
 	openssl   => undef,# --with-openssl=
-	uuid  => undef,# --with-ossp-uuid
+	uuid  => undef,# --with-ossp-uuid=
 	xml   => undef,# --with-libxml=
 	xslt  => undef,# --with-libxslt=
 	iconv => undef,# (not in configure, path to iconv)

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

On Thu, Jul 11, 2019 at 06:37:41PM -0400, Bruce Momjian wrote:
> wal_log_hints will be enabled automatically in encryption mode, like we
> do for checksum mode, so we never encrypt different 8k pages with the
> same IV.

Checksum mode can be enabled in encrypted clusters to detect modified
data, since the checksum is encrypted.  The WAL already has checksums.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Thu, Jul 11, 2019 at 10:42 AM Peter Geoghegan  wrote:
> > I think unique indexes may benefit from deduplication not only because
> > of NULL values.  Non-HOT updates produce duplicates of non-NULL values
> > in unique indexes.  And those duplicates can take significant space.
>
> I agree that we should definitely have an open mind about unique
> indexes, even with non-NULL values. If we can prevent a page split by
> deduplicating the contents of a unique index page, then we'll probably
> win. Why not try? This will need to be tested.

I thought about this some more. I believe that the LP_DEAD bit setting
within _bt_check_unique() is generally more important than the more
complicated kill_prior_tuple mechanism for setting LP_DEAD bits, even
though the _bt_check_unique() thing can only be used with unique
indexes. Also, I have often thought that we don't do enough to take
advantage of the special characteristics of unique indexes -- they
really are quite different. I believe that other database systems do
this in various ways. Maybe we should too.

Unique indexes are special because there can only ever be zero or one
tuples of the same value that are visible to any possible MVCC
snapshot. Within the index AM, there is little difference between an
UPDATE by a transaction and a DELETE + INSERT of the same value by a
transaction. If there are 3 or 5 duplicates within a unique index,
then there is a strong chance that VACUUM could reclaim some of them,
given the chance. It is worth going to a little effort to find out.

In a traditional serial/bigserial primary key, the key space that is
typically "owned" by the left half of a rightmost page split describes
a range of about ~366 items, with few or no gaps for other values that
didn't exist at the time of the split (i.e. the two pivot tuples on
each side cover a range that is equal to the number of items itself).
If the page ever splits again, the chances of it being due to non-HOT
updates is perhaps 100%. Maybe VACUUM just didn't get around to the
index in time, or maybe there is a long running xact, or whatever. If
we can delay page splits in indexes like this, then we could easily
prevent them from *ever* happening.

Our first line of defense against page splits within unique indexes
will probably always be LP_DEAD bits set within _bt_check_unique(),
because it costs so little -- same as today. We could also add a
second line of defense: deduplication -- same as with non-unique
indexes with the patch. But we can even add a third line of defense on
top of those two: more aggressive reclaiming of posting list space, by
going to the heap to check the visibility status of earlier posting
list entries. We can do this optimistically when there is no LP_DEAD
bit set, based on heuristics.

The high level principle here is that we can justify going to a small
amount of extra effort for the chance to avoid a page split, and maybe
even more than a small amount. Our chances of reversing the split by
merging pages later on are almost zero. The two halves of the split
will probably each get dirtied again and again in the future if we
cannot avoid it, plus we have to dirty the parent page, and the old
sibling page (to update its left link). In general, a page split is
already really expensive. We could do something like amortize the cost
of accessing the heap a second time for tuples that we won't have
considered setting the LP_DEAD bit on within _bt_check_unique() by
trying the *same* heap page a *second* time where possible (distinct
values are likely to be nearby on the same page). I think that an
approach like this could work quite well for many workloads. You only
pay a cost (visiting the heap an extra time) when it looks like you'll
get a benefit (not splitting the page).

As you know, Andres already changed nbtree to get an XID for conflict
purposes on the primary by visiting the heap a second time (see commit
558a9165e08), when we need to actually reclaim LP_DEAD space. I
anticipated that we could extend this to do more clever/eager/lazy
cleanup of additional items before that went in, which is a closely
related idea. See:

https://www.postgresql.org/message-id/flat/CAH2-Wznx8ZEuXu7BMr6cVpJ26G8OSqdVo6Lx_e3HSOOAU86YoQ%40mail.gmail.com#46ffd6f32a60e086042a117f2bfd7df7

I know that this is a bit hand-wavy; the details certainly need to be
worked out. However, it is not so different to the "ghost bit" design
that other systems use with their non-unique indexes (though this idea
applies specifically to unique indexes in our case). The main
difference is that we're going to the heap rather than to UNDO,
because that's where we store our visibility information. That doesn't
seem like such a big difference -- we are also reasonably confident
that we'll find that the TID is dead, even without LP_DEAD bits being
set, because we only do the extra stuff with unique indexes. And, we
do it lazily.

--
Peter Geoghegan

Re: Comment typo in tableam.h

2019-07-11 Thread Brad DeJong

More typos in tableam.h along with a few grammar changes.


v1-more-typos-in-tableam.h.patch
Description: Binary data

Re: [Patch] Mingw: Fix import library extension, build actual static libraries

2019-07-11 Thread Noah Misch

On Tue, Jul 09, 2019 at 08:26:52AM -0400, Andrew Dunstan wrote:
> On 4/16/19 1:22 AM, Noah Misch wrote:
> > On Thu, Mar 07, 2019 at 03:23:50PM +0100, Sandro Mani wrote:
> >> Related, no actual static libraries are produced alongside the respective
> >> dlls. The attached patch 0002-Build-static-libraries.patch addresses this,
> >> in a similar fashion as is already done for the AIX case in Makefile.shlib.
> > We don't build static libraries on AIX, though Makefile.shlib uses the
> > $(stlib) variable to get a name for the *.a shared library it makes.  Here's
> > an example of one AIX Makefile.shlib build sequence, from
> > https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hornet=2019-04-15%2022%3A35%3A52=make
> >
> > rm -f libpq.a
> > ar crs libpq.a fe-auth.o fe-auth-scram.o fe-connect.o fe-exec.o fe-misc.o 
> > fe-print.o fe-lobj.o fe-protocol2.o fe-protocol3.o pqexpbuffer.o 
> > fe-secure.o libpq-events.o encnames.o wchar.o fe-secure-openssl.o 
> > fe-secure-common.o
> > touch libpq.a
> > ../../../src/backend/port/aix/mkldexport.sh libpq.a libpq.so.5 >libpq.exp
> > xlc_r -qmaxmem=33554432 -D_LARGE_FILES=1  -qnoansialias -g -O2 
> > -qmaxmem=16384 -qsrcmsg -D_REENTRANT -D_THREAD_SAFE 
> > -D_POSIX_PTHREAD_SEMANTICS  -o libpq.so.5 libpq.a -Wl,-bE:libpq.exp 
> > -L../../../src/port -L../../../src/common -lpgcommon_shlib -lpgport_shlib  
> > -L/home/nm/sw/nopath/libxml2-64/lib -L/home/nm/sw/nopath/icu58.2-64/lib  
> > -L/home/nm/sw/nopath/uuid-64/lib -L/home/nm/sw/nopath/openldap-64/lib 
> > -L/home/nm/sw/nopath/icu58.2-64/lib -L/home/nm/sw/nopath/libxml2-64/lib 
> > -Wl,-blibpath:'/home/nm/farm/xlc64/HEAD/inst/lib:/home/nm/sw/nopath/libxml2-64/lib:/home/nm/sw/nopath/icu58.2-64/lib:/home/nm/sw/nopath/uuid-64/lib:/home/nm/sw/nopath/openldap-64/lib:/home/nm/sw/nopath/icu58.2-64/lib:/home/nm/sw/nopath/libxml2-64/lib:/usr/lib:/lib'
> >   -Wl,-bnoentry -Wl,-H512 -Wl,-bM:SRE -lintl -lssl -lcrypto -lm -lldap_r 
> > -llber -lpthreads
> > rm -f libpq.a
> > ar crs libpq.a libpq.so.5
> 
> I'm wondering if it wouldn't be better to set the value of stlib for
> windows, instead of changes like this:
> 
> -    $(CC) $(CFLAGS)  -shared -static-libgcc -o $@  $(OBJS) $(LDFLAGS)
> $(LDFLAGS_SL) $(SHLIB_LINK) $(LIBS) -Wl,--export-all-symbols
> -Wl,--out-implib=$(stlib)
> +    $(CC) $(CFLAGS)  -shared -static-libgcc -o $@  $(OBJS) $(LDFLAGS)
> $(LDFLAGS_SL) $(SHLIB_LINK) $(LIBS) -Wl,--export-all-symbols
> -Wl,--out-implib=lib$(NAME).dll.a

If we're going to have both static libs and import libs, they need different
names.  But it may be an improvement to set implib = lib$(NAME).dll.a and use
$(implib) in Make recipes like this one.

> I'm also wondering if changing this will upset third party authors.

I'd make it master-only, but I don't expect big trouble.  GNU ld does pick
libFOO.dll.a when both that and libFOO.a are available.  Folks probably have
scripts that copy libpq.a to libpq.lib for the benefit of MSVC, and those
would need to change.

XLogRecGetFullXid()

Hello hackers,

Here is a small patch extracted from the undo log patch set that I'd
like to discuss separately and commit soon.  I'm pretty sure that
zheap, zedstore and anyone else developing new AMs based on 64 bit
xids needs this, but there are no plans to extend WAL records to
actually carry 64 bit xids yet, and I want to discourage people from
making generic xid expanding functions that don't have any
interlocking, as I mentioned recently[1].

It's defined in xlogreader.c, because that's where such things
naturally live, but it can only work during replay for now, so I
wrapped it in a FRONTEND invisibility cloak.  That means that
front-end tools (pg_waldump.c) can't use it and will have to continue
show 32 bit xids for now.

Better ideas?

[1] 
https://www.postgresql.org/message-id/CA+hUKGJPuKR7i7UvmXRXhjhdW=3v1-nso3afn4xdldkbjru...@mail.gmail.com

-- 
Thomas Munro
https://enterprisedb.com


0001-Add-XLogRecGetFullXid-function.patch
Description: Binary data

Re: Add parallelism and glibc dependent only options to reindexdb

On Thu, Jul 11, 2019 at 06:22:25PM +0200, Julien Rouhaud wrote:
> I think t hat it makes the code quite cleaner to have the selection
> outside ConsumeIdleSlot.

Actually, you have an issue with ConsumeIdleSlot() if there is only
one parallel slot, no?  In this case the current patch returns
immediately the slot available without waiting.  I think that we
should wait until the slot becomes free in that case as well, and
switch isFree to false.  If you want to keep things splitted, that's
fine by me, I would still use "Get" within the name for the routine,
and rename the other to get_idle_slot_internal() or
get_idle_slot_guts() to point out that it has an internal role.

> You're talking about getQuerySuccess right? That was actually the
> original comment of a function I renamed.  +1 to improve it, but this
> function is in common.c and doesn't deal with parallel slot at all, so
> I'll just drop the slang parts.

If we can design a clean interface with better comments, we can use
this occasion to browse the whole thing and make it better.
--
Michael

signature.asc
Description: PGP signature

Re: Memory-Bounded Hash Aggregation

2019-07-11 Thread Jeff Davis

On Thu, 2019-07-11 at 17:55 +0200, Tomas Vondra wrote:
> Makes sense. I haven't thought about how the hybrid approach would be
> implemented very much, so I can't quite judge how complicated would
> it be
> to extend "approach 1" later. But if you think it's a sensible first
> step,
> I trust you. And I certainly agree we need something to compare the
> other
> approaches against.

Is this a duplicate of your previous email?

I'm slightly confused but I will use the opportunity to put out another
WIP patch. The patch could use a few rounds of cleanup and quality
work, but the funcionality is there and the performance seems
reasonable.

I rebased on master and fixed a few bugs, and most importantly, added
tests.

It seems to be working with grouping sets fine. It will take a little
longer to get good performance numbers, but even for group size of one,
I'm seeing HashAgg get close to Sort+Group in some cases.

You are right that the missed lookups appear to be costly, at least
when the data all fits in system memory. I think it's the cache misses,
because sometimes reducing work_mem improves performance. I'll try
tuning the number of buckets for the hash table and see if that helps.
If not, then the performance still seems pretty good to me.

Of course, HashAgg can beat sort for larger group sizes, but I'll try
to gather some more data on the cross-over point.

Regards,
Jeff Davis

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index c91e3e1550..d2f97d5fce 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1702,6 +1702,23 @@ include_dir 'conf.d'
   
  
 
+ 
+  hashagg_mem_overflow (boolean)
+  
+   hashagg_mem_overflow configuration parameter
+  
+  
+  
+   
+ If hash aggregation exceeds work_mem at query
+ execution time, and hashagg_mem_overflow is set
+ to on, continue consuming more memory rather than
+ performing disk-based hash aggregation. The default
+ is off.
+   
+  
+ 
+
  
   max_stack_depth (integer)
   
@@ -4354,6 +4371,24 @@ ANY num_sync ( 
+  enable_hashagg_spill (boolean)
+  
+   enable_hashagg_spill configuration parameter
+  
+  
+  
+   
+Enables or disables the query planner's use of hashed aggregation plan
+types when the memory usage is expected to
+exceed work_mem. This only affects the planner
+choice; actual behavior at execution time is dictated by
+. The default
+is on.
+   
+  
+ 
+
  
   enable_hashjoin (boolean)
   
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index dff2ed3f97..a5b7b73b13 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -102,6 +102,7 @@ static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
 			 List *ancestors, ExplainState *es);
 static void show_sort_info(SortState *sortstate, ExplainState *es);
 static void show_hash_info(HashState *hashstate, ExplainState *es);
+static void show_hashagg_info(AggState *hashstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
 ExplainState *es);
 static void show_instrumentation_count(const char *qlabel, int which,
@@ -1826,6 +1827,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_Agg:
 			show_agg_keys(castNode(AggState, planstate), ancestors, es);
 			show_upper_qual(plan->qual, "Filter", planstate, ancestors, es);
+			if (es->analyze)
+show_hashagg_info((AggState *) planstate, es);
 			if (plan->qual)
 show_instrumentation_count("Rows Removed by Filter", 1,
 		   planstate, es);
@@ -2715,6 +2718,56 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 	}
 }
 
+/*
+ * If EXPLAIN ANALYZE, show information on hash aggregate memory usage and
+ * batches.
+ */
+static void
+show_hashagg_info(AggState *aggstate, ExplainState *es)
+{
+	Agg		*agg	   = (Agg *)aggstate->ss.ps.plan;
+	long	 memPeakKb = (aggstate->hash_mem_peak + 1023) / 1024;
+	long	 diskKb= (aggstate->hash_disk_used + 1023) / 1024;
+
+
+	Assert(IsA(aggstate, AggState));
+
+	if (agg->aggstrategy != AGG_HASHED &&
+		agg->aggstrategy != AGG_MIXED)
+		return;
+
+	if (es->format == EXPLAIN_FORMAT_TEXT)
+	{
+		appendStringInfoSpaces(es->str, es->indent * 2);
+		appendStringInfo(
+			es->str,
+			"Memory Usage: %ldkB",
+			memPeakKb);
+
+		if (aggstate->hash_batches_used > 0)
+		{
+			appendStringInfo(
+es->str,
+"  Batches: %d  Disk Usage:%ldkB",
+aggstate->hash_batches_used, diskKb);
+		}
+
+		appendStringInfo(
+			es->str,
+			"\n");
+	}
+	else
+	{
+		ExplainPropertyInteger("Peak Memory Usage", "kB", memPeakKb, es);
+		if (aggstate->hash_batches_used > 0)
+		{
+			ExplainPropertyInteger("HashAgg Batches", NULL,
+   aggstate->hash_batches_used, es);
+			ExplainPropertyInteger("Disk Usage", "kB", diskKb, es);
+		}
+	}
+}
+
 /*

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

On Thu, Jul 11, 2019 at 08:41:52PM -0400, Joe Conway wrote:
> On 7/11/19 6:37 PM, Bruce Momjian wrote:
> > Our first implementation will encrypt the entire cluster.  We can later
> > consider encryption per table or tablespace.  It is unclear if
> > encrypting different parts of the system with different keys is useful
> > or feasible.  (This is separate from key rotation.)
> 
> I still object strongly to using a single key for the entire database. I
> think we can use a single key for WAL, but we need some way to split the
> heap so that multiple keys are used. If not by tablespace, then some
> other method.

What do you base this on?

> Regardless of the method to split the heap into different keys, I think
> there should be an option for some tables to not be encrypted. If we
> decide it must be all or nothing for the first implementation I guess I
> could live with it but would be very disappointed.

What does it mean you "could live this it"?  Why do you consider having
some tables unencrypted important?

> The keys themselves should be in an file which is encrypted by a master
> key. Obtaining the master key should be pattern it after the GUC
> ssl_passphrase_command.
> 
> > We will use CBC AES128 mode for tables/indexes, and CTR AES128 for WAL.
> > 8k pages will use the LSN as a nonce, which will be encrypted to
> > generate the initialization vector (IV).  We will not encrypt the first
> > 16 bytes of each pages so the LSN can be used in this way.  The WAL will
> > use the WAL file segment number as the nonce and the IV will be created
> > in the same way.
> 
> I vote for AES 256 rather than 128.

Why?  This page seems to think 128 is sufficient:


https://crypto.stackexchange.com/questions/20/what-are-the-practical-differences-between-256-bit-192-bit-and-128-bit-aes-enc

For practical purposes, 128-bit keys are sufficient to ensure security.
The larger key sizes exist mostly to satisfy some US military
regulations which call for the existence of several distinct "security
levels", regardless of whether breaking the lowest level is already far
beyond existing technology.

We might need to run some benchmarks to determine the overhead of going
to AES256, because I am unclear of the security value.

> Did we determine that we no longer need table oid because LSNs are
> sufficiently unique?

I think so.

> > wal_log_hints will be enabled automatically in encryption mode, like we
> > do for checksum mode, so we never encrypt different 8k pages with the
> > same IV.
> 
> check
> 
> > There will need to be a pg_control field to indicate that encryption is
> > in use.
> 
> I didn't see that discussed but it makes sense.

Yes, it seems to make sense, but was not discussed.

> > Right now we don't support the online changing of a cluster's checksum
> > mode, so I suggest we create a utility like pg_checksums --enable to
> > allow offline key rotation.  Once we get online checksum mode changing
> > ability, we can look into use that for encryption key rotation.
> 
> Master key rotation should be trivial if we do it the way I discussed
> above. Rotating the individual heap and WAL keys would certainly be a
> bigger problem.

Yes, sorry, master key rotation is simple.  It is encryption key
rotation that I think needs a tool.

> Thinking out loud (and I believe somewhere in this massive thread
> someone else already said this), if we had a way to flag "key version"
> at the page level it seems like we could potentially rekey page-by-page
> while online, locking only one page at a time. We really only need to
> support 2 key versions and could ping-pong between them as they change.
> Or maybe this is a crazy idea.

Yes, we did talk about this.  It is certainly possible, but we would
still need a tool to guarantee all pages are using the new version, so I
am not sure what per-page buys us except making the later check faster. 
I don't see this as a version-1 feature, frankly.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: [HACKERS] WAL logging problem in 9.4.3?

2019-07-11 Thread Noah Misch

On Wed, Jul 10, 2019 at 01:19:14PM +0900, Kyotaro Horiguchi wrote:
> Hello. Rebased the patch to master(bd56cd75d2).

It looks like you did more than just a rebase, because this v16 no longer
modifies many files that v14 did modify.  (That's probably good, since you had
pending review comments.)  What other changes did you make?

Re: base backup client as auxiliary backend process

2019-07-11 Thread Kyotaro Horiguchi

Hello.

At Sat, 29 Jun 2019 22:05:22 +0200, Peter Eisentraut 
 wrote in 
<61b8d18d-c922-ac99-b990-a31ba63cd...@2ndquadrant.com>
> Setting up a standby instance is still quite complicated.  You need to
> run pg_basebackup with all the right options.  You need to make sure
> pg_basebackup has the right permissions for the target directories.  The
> created instance has to be integrated into the operating system's start
> scripts.  There is this slightly awkward business of the --recovery-conf
> option and how it interacts with other features.  And you should
> probably run pg_basebackup under screen.  And then how do you get
> notified when it's done.  And when it's done you have to log back in and
> finish up.  Too many steps.
> 
> My idea is that the postmaster can launch a base backup worker, wait
> till it's done, then proceed with the rest of the startup.  initdb gets
> a special option to create a "minimal" data directory with only a few
> files, directories, and the usual configuration files.  Then you create
> a $PGDATA/basebackup.signal, start the postmaster as normal.  It sees
> the signal file, launches an auxiliary process that runs the base
> backup, then proceeds with normal startup in standby mode.
> 
> This makes a whole bunch of things much nicer: The connection
> information for where to get the base backup from comes from
> postgresql.conf, so you only need to specify it in one place.
> pg_basebackup is completely out of the picture; no need to deal with
> command-line options, --recovery-conf, screen, monitoring for
> completion, etc.  If something fails, the base backup process can
> automatically be restarted (maybe).  Operating system integration is
> much easier: You only call initdb and then pg_ctl or postgres, as you
> are already doing.  Automated deployment systems don't need to wait for
> pg_basebackup to finish: You only call initdb, then start the server,
> and then you're done -- waiting for the base backup to finish can be
> done by the regular monitoring system.
> 
> Attached is a very hackish patch to implement this.  It works like this:
> 
> # (assuming you have a primary already running somewhere)
> initdb -D data2 --minimal
> $EDITOR data2/postgresql.conf  # set primary_conninfo
> pg_ctl -D data2 start

Nice idea! 

> (Curious side note: If you don’t set primary_conninfo in these steps,
> then libpq defaults apply, so the default behavior might end up being
> that a given instance attempts to replicate from itself.)

We may be able to have different setting for primary and replica
for other settings if we could have sections in the configuration
file, defining, say, [replica] section gives us more frexibility.
Though it is a bit far from the topic, dedicate command-line
configuration editor that can find and replace specified
parameter would elimite the sublte editing step. It is annoying
that finding specific separator in conf file then trim then add
new contnet.

> It works for basic cases.  It's missing tablespace support, proper
> fsyncing, progress reporting, probably more.  Those would be pretty

While catching up master, connections to replica are once
accepted then result in FATAL error. I now and then receive
inquiries for that. With the new feature, we get FATAL also while
basebackup phase. That can let users fear more frequently.

> straightforward I think.  The interesting bit is the delicate ordering
> of the postmaster startup:  Normally, the pg_control file is read quite
> early, but if starting from a minimal data directory, we need to wait
> until the base backup is done.  There is also the question what you do
> if the base backup fails halfway through.  Currently you probably need
> to delete the whole data directory and start again with initdb.  Better
> might be a way to start again and overwrite any existing files, but that
> can clearly also be dangerous.  All this needs some careful analysis,
> but I think it's doable.
> 
> Any thoughts?

Just overwriting won't work since files removed just before
retrying are left alon in replica. I think it should work
similarly to initdb, that is, removing all then retrying.

It's easy if we don't consider reducing startup time. Just do
initdb then start exising postmaster internally. But melding them
together makes room for reducing the startup time. We even could
redirect read-only queries to master while setting up the server.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-11 Thread Joe Conway

On 7/11/19 6:37 PM, Bruce Momjian wrote:
> On Wed, Jul 10, 2019 at 12:26:24PM -0400, Bruce Momjian wrote:
>> On Wed, Jul 10, 2019 at 08:31:17AM -0400, Joe Conway wrote:
>>> Please see my other reply (and
>>> https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-38a.pdf
>>> appendix C as pointed out by Ryan downthread).
>>>
>>> At least in my mind, I trust a published specification from the
>>> nation-state level over random blogs or wikipedia. If we can find some
>>> equivalent published standards that contradict NIST we should discuss
>>> it, but for my money I would prefer to stick with the NIST recommended
>>> method to produce the IVs.
>>
>> So, we have had a flurry of activity on this thread in the past day, so
>> let me summarize:
> 
> Seems we have an updated approach:

I tried to keep up with this thread, and may have failed, but comments
inline...

> First, we need to store the symmetric encryption key in the data
> directory, like we do for SSL certificates and private keys.  (Crash
> recovery needs access to this key, so we can't easily store it in a
> database table.)  We will pattern it after the GUC
> ssl_passphrase_command.   We will need to decide on a format for the
> symmetric encryption key in the file so we can check that the supplied
> passphrase properly unlocks the key.
> 
> Our first implementation will encrypt the entire cluster.  We can later
> consider encryption per table or tablespace.  It is unclear if
> encrypting different parts of the system with different keys is useful
> or feasible.  (This is separate from key rotation.)

I still object strongly to using a single key for the entire database. I
think we can use a single key for WAL, but we need some way to split the
heap so that multiple keys are used. If not by tablespace, then some
other method.

Regardless of the method to split the heap into different keys, I think
there should be an option for some tables to not be encrypted. If we
decide it must be all or nothing for the first implementation I guess I
could live with it but would be very disappointed.

The keys themselves should be in an file which is encrypted by a master
key. Obtaining the master key should be pattern it after the GUC
ssl_passphrase_command.

> We will use CBC AES128 mode for tables/indexes, and CTR AES128 for WAL.
> 8k pages will use the LSN as a nonce, which will be encrypted to
> generate the initialization vector (IV).  We will not encrypt the first
> 16 bytes of each pages so the LSN can be used in this way.  The WAL will
> use the WAL file segment number as the nonce and the IV will be created
> in the same way.

I vote for AES 256 rather than 128.

Did we determine that we no longer need table oid because LSNs are
sufficiently unique?

> wal_log_hints will be enabled automatically in encryption mode, like we
> do for checksum mode, so we never encrypt different 8k pages with the
> same IV.

check

> There will need to be a pg_control field to indicate that encryption is
> in use.

I didn't see that discussed but it makes sense.

> Right now we don't support the online changing of a cluster's checksum
> mode, so I suggest we create a utility like pg_checksums --enable to
> allow offline key rotation.  Once we get online checksum mode changing
> ability, we can look into use that for encryption key rotation.

Master key rotation should be trivial if we do it the way I discussed
above. Rotating the individual heap and WAL keys would certainly be a
bigger problem.

Thinking out loud (and I believe somewhere in this massive thread
someone else already said this), if we had a way to flag "key version"
at the page level it seems like we could potentially rekey page-by-page
while online, locking only one page at a time. We really only need to
support 2 key versions and could ping-pong between them as they change.
Or maybe this is a crazy idea.

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development

Re: heapam_index_build_range_scan's anyvisible

2019-07-11 Thread Ashwin Agrawal

On Mon, Jun 10, 2019 at 5:38 PM Ashwin Agrawal  wrote:

>
> On Mon, Jun 10, 2019 at 2:56 PM Andres Freund  wrote:
>
>> > Currently, all AM needs to build HeapTuple in
>> > index_build_range_scan function. I looked into all the callback
>> functions
>> > and only htup->t_self is used from heaptuple in all the functions
>> (unless I
>> > missed something). So, if seems fine will be happy to write patch to
>> make
>> > that argument ItemPointer instead of HeapTuple?
>>
>> I wonder if it should better be the slot? It's not inconceivable that
>> some AMs could benefit from that. Although it'd add some complication
>> to the heap HeapTupleIsHeapOnly case.
>>
>
> I like the slot suggestion, only if can push FormIndexDatum() out of AM
> code as a result and pass slot to the callback. Not sure what else can it
> help. HeapTupleIsHeapOnly possibly can be made to work with slot similar to
> current hack of updating the t_self and slot's tid field, maybe.
>
> Index callback using the slot can form the index datum. Though that would
> mean every Index AM callback function needs to do it, not sure which way is
> better. Plus, need to create ExecutorState for the same. With current setup
> every AM needs to do. Feels good if belongs to indexing code though instead
> of AM.
>
> Currently, index build needing to create ExecutorState and all at AM layer
> seems not nice either. Maybe there is quite generic logic here and possible
> can be extracted into common code which either most of AM leverage. Or
> possibly the API itself can be simplified to get minimum input from AM and
> have rest of flow/machinery at upper layer. As Robert pointed at start of
> thread at heart its pretty simple flow and possibly will remain same for
> any AM.
>
>
Please find attached the patch to remove IndexBuildCallback's dependency on
HeapTuple, as discussed. Changed to have the argument as ItemPointer
instead of HeapTuple. Other larger refactoring if feasible for
index_build_range_scan API itself can be performed as follow-up changes.
From f73b0061795f0c320f96ecfed0c0602ae318d73e Mon Sep 17 00:00:00 2001
From: Ashwin Agrawal 
Date: Thu, 11 Jul 2019 16:58:50 -0700
Subject: [PATCH v1] Remove IndexBuildCallback's dependency on HeapTuple.

With IndexBuildCallback taking input as HeapTuple, all table AMs
irrespective of storing the tuples in HeapTuple form or not, are
forced to construct HeapTuple, to insert the tuple in Index. Since,
only thing required by the index callbacks is TID and not really the
full tuple, modify callback to only take ItemPointer.
---
 src/backend/access/brin/brin.c   |  4 ++--
 src/backend/access/gin/gininsert.c   |  4 ++--
 src/backend/access/gist/gistbuild.c  |  6 +++---
 src/backend/access/hash/hash.c   |  8 
 src/backend/access/heap/heapam_handler.c | 11 +--
 src/backend/access/nbtree/nbtsort.c  |  8 
 src/backend/access/spgist/spginsert.c|  4 ++--
 src/include/access/tableam.h |  2 +-
 8 files changed, 23 insertions(+), 24 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index ae7b729edd9..d27d503c7f1 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -597,7 +597,7 @@ brinendscan(IndexScanDesc scan)
  */
 static void
 brinbuildCallback(Relation index,
-  HeapTuple htup,
+  ItemPointer tid,
   Datum *values,
   bool *isnull,
   bool tupleIsAlive,
@@ -607,7 +607,7 @@ brinbuildCallback(Relation index,
 	BlockNumber thisblock;
 	int			i;
 
-	thisblock = ItemPointerGetBlockNumber(>t_self);
+	thisblock = ItemPointerGetBlockNumber(tid);
 
 	/*
 	 * If we're in a block that belongs to a future range, summarize what
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 55eab146173..d19cbcb2f90 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -273,7 +273,7 @@ ginHeapTupleBulkInsert(GinBuildState *buildstate, OffsetNumber attnum,
 }
 
 static void
-ginBuildCallback(Relation index, HeapTuple htup, Datum *values,
+ginBuildCallback(Relation index, ItemPointer tid, Datum *values,
  bool *isnull, bool tupleIsAlive, void *state)
 {
 	GinBuildState *buildstate = (GinBuildState *) state;
@@ -285,7 +285,7 @@ ginBuildCallback(Relation index, HeapTuple htup, Datum *values,
 	for (i = 0; i < buildstate->ginstate.origTupdesc->natts; i++)
 		ginHeapTupleBulkInsert(buildstate, (OffsetNumber) (i + 1),
 			   values[i], isnull[i],
-			   >t_self);
+			   tid);
 
 	/* If we've maxed out our available memory, dump everything to the index */
 	if (buildstate->accum.allocatedMemory >= (Size) maintenance_work_mem * 1024L)
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index ecef0ff0724..2a004937078 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -80,7 +80,7 @@ typedef struct
 static void

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

On Wed, Jul 10, 2019 at 12:26:24PM -0400, Bruce Momjian wrote:
> On Wed, Jul 10, 2019 at 08:31:17AM -0400, Joe Conway wrote:
> > Please see my other reply (and
> > https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-38a.pdf
> > appendix C as pointed out by Ryan downthread).
> > 
> > At least in my mind, I trust a published specification from the
> > nation-state level over random blogs or wikipedia. If we can find some
> > equivalent published standards that contradict NIST we should discuss
> > it, but for my money I would prefer to stick with the NIST recommended
> > method to produce the IVs.
> 
> So, we have had a flurry of activity on this thread in the past day, so
> let me summarize:

Seems we have an updated approach:

First, we need to store the symmetric encryption key in the data
directory, like we do for SSL certificates and private keys.  (Crash
recovery needs access to this key, so we can't easily store it in a
database table.)  We will pattern it after the GUC
ssl_passphrase_command.   We will need to decide on a format for the
symmetric encryption key in the file so we can check that the supplied
passphrase properly unlocks the key.

Our first implementation will encrypt the entire cluster.  We can later
consider encryption per table or tablespace.  It is unclear if
encrypting different parts of the system with different keys is useful
or feasible.  (This is separate from key rotation.)

We will use CBC AES128 mode for tables/indexes, and CTR AES128 for WAL.
8k pages will use the LSN as a nonce, which will be encrypted to
generate the initialization vector (IV).  We will not encrypt the first
16 bytes of each pages so the LSN can be used in this way.  The WAL will
use the WAL file segment number as the nonce and the IV will be created
in the same way.

wal_log_hints will be enabled automatically in encryption mode, like we
do for checksum mode, so we never encrypt different 8k pages with the
same IV.

There will need to be a pg_control field to indicate that encryption is
in use.

Right now we don't support the online changing of a cluster's checksum
mode, so I suggest we create a utility like pg_checksums --enable to
allow offline key rotation.  Once we get online checksum mode changing
ability, we can look into use that for encryption key rotation.

Did I miss anything?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: Adding SMGR discriminator to buffer tags

On Fri, May 10, 2019 at 8:54 AM Shawn Debnath  wrote:
> On Wed, May 08, 2019 at 06:31:04PM +1200, Thomas Munro wrote:
> Looks good to me. Minor nit: update the comment for XLogRecGetBlockTag:

Fixed.  Also fixed broken upgrade scripts for pg_buffercache
extension, as pointed out by Robert[1] on the main thread where undo
stuff is being discussed. Attempts to keep subtopics separated have so
far failed, so the thread ostensibly about orphaned file cleanup is
now about undo work allocation, but I figured it'd be useful to
highlight this patch separately as it'll be the first to go in, and
it's needed by your work Shawn.  So I hope we're still on the same
page with this refactoring patch.

One thing I'm not sure about is the TODO message in parsexlog.c's
extractPageInfo() function.

[1] 
https://www.postgresql.org/message-id/CA%2BTgmob4htT-9Tq7eHG3wS%3DdpKFbQZOyqgSr1iWmV_65Duz6Pw%40mail.gmail.com

-- 
Thomas Munro
https://enterprisedb.com

0002-Move-tablespace-dir-creation-from-smgr.c-to-md.c.patch
Description: Binary data

0001-Add-SmgrId-to-smgropen-and-BufferTag.patch
Description: Binary data

Re: base backup client as auxiliary backend process

Peter Eisentraut  writes:
> On 2019-07-11 22:20, Robert Haas wrote:
>> On Thu, Jul 11, 2019 at 4:10 PM Tom Lane  wrote:
>>> How's the postmaster to know that it's supposed to run pg_basebackup
>>> rather than start normally?  Where will it get the connection information?
>>> Seem to need configuration data *somewhere*.
>> 
>> Maybe just:
>> 
>> ./postgres --replica='connstr' -D createme

> What you are describing is of course theoretically possible, but it
> doesn't really fit with how existing tooling normally deals with this,
> which is one of the problems I want to address.

I don't care for Robert's suggestion for a different reason: it presumes
that all data that can possibly be needed to set up a new replica is
feasible to cram onto the postmaster command line, and always will be.

An immediate counterexample is that's not where you want to be specifying
the password for a replication connection.  But even without that sort of
security issue, this approach won't scale.  It also does not work even a
little bit nicely for tooling in which the postmaster is not supposed to
be started directly by the user.  (Which is to say, all postgres-service
tooling everywhere.)

regards, tom lane

REINDEX filtering in the backend

Hi,

I had some spare time tonight so I started a prototype to allow
filtering the indexes that are processed using the REINDEX command, as
Peter suggested in the parallel reindexdb thread [1].

I didn't want to spend too much time enjoying bison and adding new
unreserved keywords, so for now I just implemented this syntax to
start a discussion for this feature in the next commitfest:

REINDEX ( FILTER = COLLATION ) [...]

The FILTER clause can be used multiple times, each one is OR-ed with
the ReindexStmt's option, so we could easily add a LIBC, ICU and other
filters, also making COLLATION (or more realistically a better new
keyword) an alias for (LIBC | ICU) for instance.

The filtering is done at table level (with and without the
concurrently option), so SCHEMA, DATABASE and SYSTEM automatically
benefit from it.  If this clause is used with a REINDEX INDEX, the
statement errors out, as I don't see a valid use case for providing a
single index name and asking to possibly filter it at the same time.

Under the hood, the filtering is for now done in a single function by
appending elements, not removing them.  An empty oid list is created,
all indexes belonging to the underlying relation are processed by the
specific filter(s), and any index that fails to be discarded by at
least one filter, even partially, is added to the final list.

I also added some minimal documentation and regression tests.  I'll
add this patch to the next commitfest.

[1] 
https://www.postgresql.org/message-id/7140716c-679e-a0b9-a273-b201329d8891%402ndquadrant.com


reindex_filter-v1.diff
Description: Binary data

Re: base backup client as auxiliary backend process

2019-07-11 Thread Peter Eisentraut

On 2019-07-11 22:20, Robert Haas wrote:
> On Thu, Jul 11, 2019 at 4:10 PM Tom Lane  wrote:
>> Robert Haas  writes:
>>> On Thu, Jul 11, 2019 at 10:36 AM Tom Lane  wrote:
 Gotta have config files in place already, no?
>>
>>> Why?
>>
>> How's the postmaster to know that it's supposed to run pg_basebackup
>> rather than start normally?  Where will it get the connection information?
>> Seem to need configuration data *somewhere*.
> 
> Maybe just:
> 
> ./postgres --replica='connstr' -D createme

What you are describing is of course theoretically possible, but it
doesn't really fit with how existing tooling normally deals with this,
which is one of the problems I want to address.

initdb has all the knowledge of how to create the data *directory*, how
to set permissions, deal with existing and non-empty directories, how to
set up a separate WAL directory.  Packaged environments might wrap this
further by using the correct OS users, creating the directory first as
root, then changing owner, etc.  This is all logic that we can reuse and
probably don't want to duplicate elsewhere.

Furthermore, we have for the longest time encouraged packagers *not* to
create data directories automatically when a service is started, because
this might store data in places that will be hidden by a later mount.
Keeping this property requires making the initialization of the data
directory a separate step somehow.  That step doesn't have to be called
"initdb", it could be a new "pg_mkdirs", but for the reasons described
above, this would create a fair mount of code duplication and not really
gain anything.

Finally, many installations want to have the configuration files under
control of some centralized configuration management system.  The way
those want to work is usually: (1) create file system structures, (2)
install configuration files from some templates, (3) start service.
This is of course how setting up a primary works.  Having such a system
set up a standby is currently seemingly impossible in an elegant way,
because the order and timing of how things work is all wrong.  My
proposed change would fix this because things would be set up in the
same three-step process.  (As has been pointed out, this would require
that the base backup does not copy over the configuration files from the
remote, which my patch currently doesn't do correctly.)

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [RFC] [PATCH] Flexible "partition pruning" hook

2019-07-11 Thread Mike Palmiotto

On Thu, Jul 11, 2019 at 8:49 AM Thomas Munro  wrote:
> 
>
> Here are some comments on superficial aspects of the patch:
>
> +/* Custom partition child access hook. Provides further partition pruning 
> given
> + * child OID.
> + */
>
> Should be like:
>
> /*
>  * Multi-line comment...
>  */

Fixed in attached patch.

>
> Why "child"?  Don't you really mean "Partition pruning hook.  Provides
> custom pruning given partition OID." or something?
>
> +typedef bool (*partitionChildAccess_hook_type) (Oid childOID);
> +PGDLLIMPORT partitionChildAccess_hook_type partitionChildAccess_hook;
>
> Hmm, I wonder if this could better evoke the job that it's doing...
> partition_filter_hook?
> partition_access_filter_hook?  partition_prune_hook?

Ended up going with partition_prune_hook. Good call.

>
> +/* Macro to use partitionChildAccess_hook. Handles NULL-checking. */
>
> It's not a macro, it's a function.

Copy-pasta. Fixed.

>
> +static inline bool InvokePartitionChildAccessHook (Oid childOID)
> +{
> +if (partitionChildAccess_hook && enable_partition_pruning && childOID)
> +{
>
> Normally we write OidIsValid(childOID) rather than comparing with 0.
> I wonder if you should call the variable relId?  Single line if
> branches don't usually get curly braces.

Fixed.

>
> +return (*partitionChildAccess_hook) (childOID);
>
> The syntax we usually use for calling function pointers is just
> partitionChildAccess_hook(childOID).

Fixed.

-- 
Mike Palmiotto
Software Engineer
Crunchy Data Solutions
https://crunchydata.com
From 7a2494c8829df95dc75d037cf54000d350c25d62 Mon Sep 17 00:00:00 2001
From: Mike Palmiotto 
Date: Mon, 8 Jul 2019 12:46:21 +
Subject: [PATCH] Flexible partition pruning hook

This hook allows partition pruning to be performed for entire partitions which
are deemed inaccessible by RLS policy prior to those relations being opened on
disk.
---
 src/backend/catalog/pg_inherits.c |  5 +
 src/backend/optimizer/path/allpaths.c |  7 +++
 src/include/partitioning/partprune.h  | 17 +
 3 files changed, 29 insertions(+)

diff --git a/src/backend/catalog/pg_inherits.c b/src/backend/catalog/pg_inherits.c
index 00f7957..644cd59 100644
--- a/src/backend/catalog/pg_inherits.c
+++ b/src/backend/catalog/pg_inherits.c
@@ -25,6 +25,7 @@
 #include "catalog/indexing.h"
 #include "catalog/pg_inherits.h"
 #include "parser/parse_type.h"
+#include "partitioning/partprune.h"
 #include "storage/lmgr.h"
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
@@ -92,6 +93,10 @@ find_inheritance_children(Oid parentrelId, LOCKMODE lockmode)
 	while ((inheritsTuple = systable_getnext(scan)) != NULL)
 	{
 		inhrelid = ((Form_pg_inherits) GETSTRUCT(inheritsTuple))->inhrelid;
+
+		if (!InvokePartitionPruneHook(inhrelid))
+			continue;
+
 		if (numoids >= maxoids)
 		{
 			maxoids *= 2;
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index b772348..b45e086 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1038,6 +1038,13 @@ set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
 			continue;
 		}
 
+		if (!InvokePartitionPruneHook(childRTE->relid))
+		{
+			/* Implement custom partition pruning filter*/
+			set_dummy_rel_pathlist(childrel);
+			continue;
+		}
+
 		/*
 		 * Constraint exclusion failed, so copy the parent's join quals and
 		 * targetlist to the child, with appropriate variable substitutions.
diff --git a/src/include/partitioning/partprune.h b/src/include/partitioning/partprune.h
index b3e9268..c3e30f9 100644
--- a/src/include/partitioning/partprune.h
+++ b/src/include/partitioning/partprune.h
@@ -15,6 +15,7 @@
 #define PARTPRUNE_H
 
 #include "nodes/execnodes.h"
+#include "optimizer/cost.h"
 #include "partitioning/partdefs.h"
 
 struct PlannerInfo;/* avoid including pathnodes.h here */
@@ -68,6 +69,22 @@ typedef struct PartitionPruneContext
 #define PruneCxtStateIdx(partnatts, step_id, keyno) \
 	((partnatts) * (step_id) + (keyno))
 
+/*
+ * Custom partition pruning hook. Provides further partition pruning given
+ * partition OID.
+ */
+typedef bool (*partition_prune_hook_type) (Oid relid);
+PGDLLIMPORT partition_prune_hook_type partition_prune_hook;
+
+/* Inline function for calling partition_prune_hook with NULL-checking. */
+static inline bool InvokePartitionPruneHook (Oid relid)
+{
+	if (partition_prune_hook && enable_partition_pruning && OidIsValid(relid))
+		return partition_prune_hook(relid);
+
+	return true;
+}
+
 extern PartitionPruneInfo *make_partition_pruneinfo(struct PlannerInfo *root,
 	struct RelOptInfo *parentrel,
 	List *subpaths,
-- 
1.8.3.1

Re: initdb recommendations

2019-07-11 Thread David Fetter

On Thu, Jul 11, 2019 at 09:34:25PM +0200, Julien Rouhaud wrote:
> On Tue, Jun 18, 2019 at 10:33 PM Peter Eisentraut
>  wrote:
> >
> > On 2019-05-23 18:54, Peter Eisentraut wrote:
> > > To recap, the idea here was to change the default authentication methods
> > > that initdb sets up, in place of "trust".
> > >
> > > I think the ideal scenario would be to use "peer" for local and some
> > > appropriate password method (being discussed elsewhere) for host.
> 
> I'm also personally all for that change.
> 
> > Patch for that attached.
> 
> Patch applies and compiles cleanly, same for documentation.  The
> change works as intended, so I don't have much to say.
> 
> > Note that with this change, running initdb without arguments will now
> > error on those platforms: You need to supply either a password or select
> > a different default authentication method.
> 
> Should we make this explicitly stated in the documentation?  As a
> reference, it's saying:
> 
> The default client authentication setup is such that users can connect
> over the Unix-domain socket to the same database user name as their
> operating system user names (on operating systems that support this,
> which are most modern Unix-like systems, but not Windows)

It turns out that really recent versions of Windows do have it.

https://bsmadhu.wordpress.com/2018/08/22/unix-domain-socket-support-in-windows/

Not that this is relevant, or will be, for another couple of years...

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: base backup client as auxiliary backend process

On Thu, Jul 11, 2019 at 4:10 PM Tom Lane  wrote:
> Robert Haas  writes:
> > On Thu, Jul 11, 2019 at 10:36 AM Tom Lane  wrote:
> >> Gotta have config files in place already, no?
>
> > Why?
>
> How's the postmaster to know that it's supposed to run pg_basebackup
> rather than start normally?  Where will it get the connection information?
> Seem to need configuration data *somewhere*.

Maybe just:

./postgres --replica='connstr' -D createme

?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: base backup client as auxiliary backend process

Robert Haas  writes:
> On Thu, Jul 11, 2019 at 10:36 AM Tom Lane  wrote:
>> Gotta have config files in place already, no?

> Why?

How's the postmaster to know that it's supposed to run pg_basebackup
rather than start normally?  Where will it get the connection information?
Seem to need configuration data *somewhere*.

regards, tom lane

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

On Thu, Jul 11, 2019 at 03:47:50PM -0400, Robert Haas wrote:
> On Tue, Jul 9, 2019 at 10:43 AM Bruce Momjian  wrote:
> > FYI, pg_upgrade already preserves the pg_class.oid, which is why I
> > recommended it over pg_class.relfilenode:
> 
> I think it's strange that pg_upgrade does not preserve the
> relfilenode.  I think it would probably make more sense if it did.
> 
> Anyway, leaving that aside, you have to be able to read pg_class to
> know the OID of a table, and you can't do that in recovery before
> reaching consistency. Yet, you still need to be able to modify disk
> blocks at that point, to finish recovery. So I can't see how any
> system that involves figuring out the nonce from the OID would ever
> work.
> 
> If we end up with random nonces, we're going to have to store them
> someplace - either in some unencrypted portion of the disk blocks
> themselves, or in a separate fork, or someplace else. If it's OK for
> them to predictable as long as they vary a lot, we could derive them
> from DBOID + RELFILENODE + FORK + BLOCK, but not from DBOID + RELOID +
> FORK + BLOCK, because of the aforementioned recovery problem.

Later in this thread, we decided that the page LSN was the best option
as a nonce because it changes every time the page chages.  (We will
enable wal_log_hints.)  We will not encrypt the first 16 bytes of the
page so it can be used for the nonce. (AES block ciphers are use 16-byte
blocks.)

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

On Tue, Jul 9, 2019 at 10:43 AM Bruce Momjian  wrote:
> FYI, pg_upgrade already preserves the pg_class.oid, which is why I
> recommended it over pg_class.relfilenode:

I think it's strange that pg_upgrade does not preserve the
relfilenode.  I think it would probably make more sense if it did.

Anyway, leaving that aside, you have to be able to read pg_class to
know the OID of a table, and you can't do that in recovery before
reaching consistency. Yet, you still need to be able to modify disk
blocks at that point, to finish recovery. So I can't see how any
system that involves figuring out the nonce from the OID would ever
work.

If we end up with random nonces, we're going to have to store them
someplace - either in some unencrypted portion of the disk blocks
themselves, or in a separate fork, or someplace else. If it's OK for
them to predictable as long as they vary a lot, we could derive them
from DBOID + RELFILENODE + FORK + BLOCK, but not from DBOID + RELOID +
FORK + BLOCK, because of the aforementioned recovery problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: initdb recommendations

On Tue, Jun 18, 2019 at 10:33 PM Peter Eisentraut
 wrote:
>
> On 2019-05-23 18:54, Peter Eisentraut wrote:
> > To recap, the idea here was to change the default authentication methods
> > that initdb sets up, in place of "trust".
> >
> > I think the ideal scenario would be to use "peer" for local and some
> > appropriate password method (being discussed elsewhere) for host.

I'm also personally all for that change.

> Patch for that attached.

Patch applies and compiles cleanly, same for documentation.  The
change works as intended, so I don't have much to say.

> Note that with this change, running initdb without arguments will now
> error on those platforms: You need to supply either a password or select
> a different default authentication method.

Should we make this explicitly stated in the documentation?  As a
reference, it's saying:

The default client authentication setup is such that users can connect
over the Unix-domain socket to the same database user name as their
operating system user names (on operating systems that support this,
which are most modern Unix-like systems, but not Windows) and
otherwise with a password. To assign a password to the initial
database superuser, use one of initdb's -W, --pwprompt or -- pwfile
options.

Re: let's make the list of reportable GUCs configurable (was Re: Add %r substitution for psql prompts to show recovery status)

On Thu, 11 Jul 2019 at 15:07, Robert Haas  wrote:

> On Thu, Jul 11, 2019 at 2:29 PM Dave Cramer  wrote:
> > So if I understand this correctly if user bob has altered his search
> path and there is a security-definer function called owned by him then
> > the search path will be changed for the duration of the function and
> reported for every iteration? The implications of this are "interesting" to
> say the least.
>
> I don't believe that it matters what search path has been set using
> ALTER USER bob.


Why wouldn't it ???


Dave Cramer

da...@postgresintl.com
www.postgresintl.com


>
>

Re: base backup client as auxiliary backend process

On Thu, Jul 11, 2019 at 10:36 AM Tom Lane  wrote:
> Gotta have config files in place already, no?

Why?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: let's make the list of reportable GUCs configurable (was Re: Add %r substitution for psql prompts to show recovery status)

On Thu, Jul 11, 2019 at 2:29 PM Dave Cramer  wrote:
> So if I understand this correctly if user bob has altered his search path and 
> there is a security-definer function called owned by him then
> the search path will be changed for the duration of the function and reported 
> for every iteration? The implications of this are "interesting" to say the 
> least.

I don't believe that it matters what search path has been set using
ALTER USER bob.  But it does matter whether there is a SET function
attached to the function definition.  If you're not familiar with
this, reread the CREATE FUNCTION docs... it's a cool feature.  And of
course the function could also have an explicit SET inside of it, or
several of them.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: buildfarm's typedefs list has gone completely nutso

2019-07-11 Thread Andrew Dunstan



On 7/11/19 12:50 PM, Tom Lane wrote:
>
>> This looks like a bug in the version of objdump unless I'm reading it
>> wrong. Why would this be tagged as a typedef?
> Maybe.  We still need to explain the other non-typedef symbols that have
> just appeared and are not being reported by calliphoridae.
>


The others come from komodoensis (also Andres' machine)


pgbfprod=> select l.sysname, l.snapshot, l.branch from build_status_log
l where snapshot > now() - interval '12 days' and log_stage =
'typedefs.log' and log_text ~
'ECPGt_bytea|in_addr|connection_name|send_appname';
   sysname   |  snapshot   | branch
-+-+
 komodoensis | 2019-07-08 23:34:01 | HEAD
 komodoensis | 2019-07-10 02:38:01 | HEAD
(2 rows)



cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: let's make the list of reportable GUCs configurable (was Re: Add %r substitution for psql prompts to show recovery status)

On Wed, 10 Jul 2019 at 16:22, Robert Haas  wrote:

> On Wed, Jul 10, 2019 at 9:59 AM Dave Cramer  wrote:
> > I'm still a bit conflicted about what to do with search_path as I do
> believe this is potentially a security issue.
> > It may be that we always want to report that and possibly back patch it.
>
> I don't see that as a feasible option unless we make the logic that
> does the reporting smarter.  If it changes transiently inside of a
> security-definer function, and then changes back, my recollection is
> that right now we would report both changes.  I think that could cause
> a serious efficiency problem if you are calling such a function in a
> loop.
>

Not being intimately familiar with the backend the implications of above
just struck home.

So if I understand this correctly if user bob has altered his search path
and there is a security-definer function called owned by him then
the search path will be changed for the duration of the function and
reported for every iteration? The implications of this are "interesting" to
say the least.

Dave Cramer

da...@postgresintl.com
www.postgresintl.com

>
>

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Thu, Jul 11, 2019 at 8:34 AM Rafia Sabih  wrote:
> I tried this patch and found the improvements impressive. However,
> when I tried with multi-column indexes it wasn't giving any
> improvement, is it the known limitation of the patch?

It'll only deduplicate full duplicates. It works with multi-column
indexes, provided the entire set of values in duplicated -- not just a
prefix. Prefix compression is possible, but it's more complicated. It
seems to generally require the DBA to specify a prefix length,
expressed as a number of prefix columns.

> I am surprised to find that such a patch is on radar since quite some
> years now and not yet committed.

The v12 work on nbtree (making heap TID a tiebreaker column) seems to
have made the general approach a lot more effective. Compression is
performed lazily, not eagerly, which seems to work a lot better.

> + elog(DEBUG4, "insert_itupprev_to_page. compressState->ntuples %d
> IndexTupleSize %zu free %zu",
> + compressState->ntuples, IndexTupleSize(to_insert), PageGetFreeSpace(page));
> +
> and other such DEBUG4 statements are meant to be removed, right...?

I hope so too.

> /*
> * If we have only 10 uncompressed items on the full page, it probably
> * won't worth to compress them.
> */
> if (maxoff - n_posting_on_page < 10)
> return;
>
> Is this a magic number...?

I think that this should be a constant or something.

> /*
> * We do not expect to meet any DEAD items, since this function is
> * called right after _bt_vacuum_one_page(). If for some reason we
> * found dead item, don't compress it, to allow upcoming microvacuum
> * or vacuum clean it up.
> */
> if (ItemIdIsDead(itemId))
> continue;
>
> This makes me wonder about those 'some' reasons.

I think that this is just defensive. Note that _bt_vacuum_one_page()
is prepared to find no dead items, even when the BTP_HAS_GARBAGE flag
is set for the page.

> Caller is responsible for checking BTreeTupleIsPosting to ensure that
> + * he will get what he expects
>
> This can be re-framed to make the caller more gender neutral.

Agreed. I also don't like anthropomorphizing code like this.

> Other than that, I am curious about the plans for its backward compatibility.

Me too. There is something about a new version 5 in comments in
nbtree.h, but the version number isn't changed. I think that we may be
able to get away with not increasing the B-Tree version from 4 to 5,
actually. Deduplication is performed lazily when it looks like we
might have to split the page, so there isn't any expectation that
tuples will either be compressed or uncompressed in any context.

-- 
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Thu, Jul 11, 2019 at 8:09 AM Alexander Korotkov
 wrote:
> BTW, I think deduplication could cause some small performance
> degradation in some particular cases, because page-level locks became
> more coarse grained once pages hold more tuples.  However, this
> doesn't seem like something we should much care about.  Providing an
> option to turn deduplication off looks enough for me.

There was an issue like this with my v12 work on nbtree, with the
TPC-C indexes. They were always ~40% smaller, but there was a
regression when TPC-C was used with a small number of warehouses, when
the data could easily fit in memory (which is not allowed by the TPC-C
spec, in effect). TPC-C is very write-heavy, which combined with
everything else causes this problem. I wasn't doing anything too fancy
there -- the regression seemed to happen simply because the index was
smaller, not because of the overhead of doing page splits differently
or anything like that (there were far fewer splits).

I expect there to be some regression for workloads like this. I am
willing to accept that provided it's not too noticeable, and doesn't
have an impact on other workloads. I am optimistic about it.

> Regarding bitmap indexes itself, I think our BRIN could provide them.
> However, it would be useful to have opclass parameters to make them
> tunable.

I thought that we might implement them in nbtree myself. But we don't
need to decide now.

-- 
Peter Geoghegan

Re: Ltree syntax improvement

2019-07-11 Thread Dmitry Belyavsky

Dear Thomas,

On Thu, Jul 11, 2019 at 11:20 AM Thomas Munro 
wrote:

> On Wed, Jul 10, 2019 at 7:40 AM Dmitry Belyavsky 
> wrote:
> > [ltree_20190709.diff]
>
> Hi Dmitry,
>
> You need to update contrib/ltree_plpython/expected/ltree_plpython.out,
> otherwise check-world fails when built with Python support.  The good
> news is that it looks like it fails because you fixed something!
> (Though I didn't check the details).
>
>  CREATE FUNCTION test2() RETURNS ltree
>  LANGUAGE plpythonu
>  TRANSFORM FOR TYPE ltree
>  AS $$
>  return ['foo', 'bar', 'baz']
>  $$;
>  -- plpython to ltree is not yet implemented, so this will fail,
>  -- because it will try to parse the Python list as an ltree input
>  -- string.
>  SELECT test2();
> -ERROR:  syntax error at position 0
> -CONTEXT:  while creating return value
> -PL/Python function "test2"
> +  test2
> +-
> + "['foo', 'bar', 'baz']"
> +(1 row)
> +
>
>
See attached. I'm not familiar enough with python so I just removed the
failing tests.
If the main patch is accepted, the ltree_python extension should be
redesigned, I think...

-- 
SY, Dmitry Belyavsky
diff --git a/contrib/ltree/expected/ltree.out b/contrib/ltree/expected/ltree.out
index 8226930905..5f45726229 100644
--- a/contrib/ltree/expected/ltree.out
+++ b/contrib/ltree/expected/ltree.out
@@ -1,4 +1,5 @@
 CREATE EXTENSION ltree;
+SET standard_conforming_strings=on;
 -- Check whether any of our opclasses fail amvalidate
 SELECT amname, opcname
 FROM pg_opclass opc LEFT JOIN pg_am am ON am.oid = opcmethod
@@ -7679,3 +7680,1587 @@ SELECT count(*) FROM _ltreetest WHERE t ? '{23.*.1,23.*.2}' ;
 15
 (1 row)
 
+-- Extended syntax, escaping, quoting etc
+-- success
+SELECT E'\\.'::ltree;
+ ltree 
+---
+ "."
+(1 row)
+
+SELECT E'\\ '::ltree;
+ ltree 
+---
+ " "
+(1 row)
+
+SELECT E''::ltree;
+ ltree 
+---
+ "\"
+(1 row)
+
+SELECT E'\\a'::ltree;
+ ltree 
+---
+ a
+(1 row)
+
+SELECT E'\\n'::ltree;
+ ltree 
+---
+ n
+(1 row)
+
+SELECT E'x'::ltree;
+ ltree 
+---
+ "x\"
+(1 row)
+
+SELECT E'x\\ '::ltree;
+ ltree 
+---
+ "x "
+(1 row)
+
+SELECT E'x\\.'::ltree;
+ ltree 
+---
+ "x."
+(1 row)
+
+SELECT E'x\\a'::ltree;
+ ltree 
+---
+ xa
+(1 row)
+
+SELECT E'x\\n'::ltree;
+ ltree 
+---
+ xn
+(1 row)
+
+SELECT 'a b.с d'::ltree;
+ltree
+-
+ "a b"."с d"
+(1 row)
+
+SELECT ' e . f '::ltree;
+ ltree 
+---
+ e.f
+(1 row)
+
+SELECT ' '::ltree;
+ ltree 
+---
+ 
+(1 row)
+
+SELECT E'\\ g  . h\\ '::ltree;
+   ltree   
+---
+ " g"."h "
+(1 row)
+
+SELECT E'\\ g'::ltree;
+ ltree 
+---
+ " g"
+(1 row)
+
+SELECT E' h\\ '::ltree;
+ ltree 
+---
+ "h "
+(1 row)
+
+SELECT '" g  "." h "'::ltree;
+ltree 
+--
+ " g  "." h "
+(1 row)
+
+SELECT '" g  " '::ltree;
+ ltree  
+
+ " g  "
+(1 row)
+
+SELECT '" g  "   ." h "  '::ltree;
+ltree 
+--
+ " g  "." h "
+(1 row)
+
+SELECT nlevel(E'Bottom\\.Test'::ltree);
+ nlevel 
+
+  1
+(1 row)
+
+SELECT subpath(E'Bottom\\.'::ltree, 0, 1);
+  subpath  
+---
+ "Bottom."
+(1 row)
+
+SELECT subpath(E'a\\.b', 0, 1);
+ subpath 
+-
+ "a.b"
+(1 row)
+
+SELECT subpath(E'a\\..b', 1, 1);
+ subpath 
+-
+ b
+(1 row)
+
+SELECT subpath(E'a\\..\\b', 1, 1);
+ subpath 
+-
+ b
+(1 row)
+
+SELECT subpath(E'a b.с d'::ltree, 1, 1);
+ subpath 
+-
+ "с d"
+(1 row)
+
+SELECT(
+'01234567890123456789012345678901234567890123456789' ||
+'01234567890123456789012345678901234567890123456789' ||
+'01234567890123456789012345678901234567890123456789' ||
+'01234567890123456789012345678901234567890123456789' ||
+'01234567890123456789012345678901234567890123456789' ||
+'\z\z\z\z\z')::ltree;
+  ltree  
+-
+ 0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789z
+(1 row)
+
+SELECT('   ' ||
+'01234567890123456789012345678901234567890123456789' ||
+'01234567890123456789012345678901234567890123456789' ||
+'01234567890123456789012345678901234567890123456789' ||
+'01234567890123456789012345678901234567890123456789' ||
+'01234567890123456789012345678901234567890123456789' ||
+'\a\b\c\d\e   ')::ltree;
+  ltree

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Thu, Jul 11, 2019 at 8:02 AM Alexander Korotkov
 wrote:
> Could you please elaborate more on preserving the logical contents?  I
> can understand it as following: "B-Tree should have the same structure
> and invariants as if each TID in posting list be a separate tuple".

That's exactly what I mean.

> So, if we imagine each TID to become separate tuple it would be the
> same B-tree, which just can magically sometimes store more tuples in
> page.  Is my understanding correct?

Yes.

> But outside code will still
> notice changes as soon as it directly accesses B-tree pages (like
> contrib/amcheck does).  Do you mean we need an API for accessing
> logical B-tree tuples or something?

Well, contrib/amcheck isn't really outside code. But amcheck's
"rootdescend" option will still need to be able to supply a heap TID
as just another column, and get back zero or one logical tuples from
the index. This is important because retail index tuple deletion needs
to be able to think about logical tuples in the same way. I also think
that it might be useful for the planner to expect to get back
duplicates in heap TID order in the future (or in reverse order in the
case of a backwards scan). Query execution and VACUUM code outside of
nbtree should be able to pretend that there is no such thing as a
posting list.

The main thing that the patch is missing that is needed to "preserve
logical contents" is the ability to update/expand an *existing*
posting list due to a retail insertion of a new duplicate that happens
to be within the range of that existing posting list. This will
usually be a non-HOT update that doesn't change the value for the row
in the index -- that must change the posting list, even when there is
available space on the page without recompressing. We must still
occasionally be eager, like GIN always is, though in practice we'll
almost always add to posting lists in a lazy fashion, when it looks
like we might have to split the page -- the lazy approach seems to
perform best.

> I think in order to deduplicate "equal but distinct" values we need at
> least to give up with index only scans.  Because we have no
> restriction that equal according to B-tree opclass values are same for
> other operations and/or user output.

We can either prevent index-only scans in the case of affected
indexes, or prevent compression, or give the user a choice. I'm not
too worried about how that will work for users just yet.

> Do I understand correctly that current patch may produce posting lists
> of the same value with overlapping ranges of TIDs?  If so, it's
> definitely wrong.

Yes, it can, since the assertion fails. It looks like the assertion
itself was changed to match what I expect, so I assume that this bug
will be fixed in the next version of the patch. It fails with a fairly
big index on text for me.

> > * Maybe we could do compression with unique indexes when inserting
> > values with NULLs? Note that we now treat an insertion of a tuple with
> > NULLs into a unique index as if it wasn't even a unique index -- see
> > the "checkingunique" optimization at the beginning of _bt_doinsert().
> > Having many NULL values in a unique index is probably fairly common.
>
> I think unique indexes may benefit from deduplication not only because
> of NULL values.  Non-HOT updates produce duplicates of non-NULL values
> in unique indexes.  And those duplicates can take significant space.

I agree that we should definitely have an open mind about unique
indexes, even with non-NULL values. If we can prevent a page split by
deduplicating the contents of a unique index page, then we'll probably
win. Why not try? This will need to be tested.

-- 
Peter Geoghegan

Re: let's make the list of reportable GUCs configurable (was Re: Add %r substitution for psql prompts to show recovery status)

On Thu, Jul 11, 2019 at 10:35 AM Tom Lane  wrote:
> All of the above is based on the assumption that this isn't a plain
> old USERSET GUC, which I'm not really seeing the argument for.
> OK, there might be *implementation* reasons why we would rather not
> deal with on-the-fly changes to the list, but there's no advantage
> to users in it.

It's pretty important that this value remain fixed for the lifetime of
a session.  If it can be changed during the session via SQL, or via
SET commands attached to a function for example, then driver authors
can't count on it to really be set to the value that they requested.

Also, think about something like a connection pooler. The user might
request one list of GUCs from the driver, and the driver might request
a different (probably larger) from the connection pooler, and the
connection pooler might then request a different (and presumably still
larger) list of GUCs from the server. If the end-user can shoot a SET
command through and change the setting on the server, the connection
pooler and the driver will break, because they were presumably
counting on getting the reports for which they asked (and forwarding
the ones someone else asked).

We have steadfastly refused to provide protocol-level tools for things
like "please change my user ID, and don't let anyone change it again
via SQL," and that's a huge problem for things like connection poolers
which can't parse all the SQL flowing through the connection (because
figuring out what it does requires solving the Halting Problem) and
wouldn't want to if they could for performance reasons. I think that's
a huge mistake. The FEBE protocol endpoint is not the application, but
some intermediate piece of software that needs control over its own
destiny. Even when there is no pooling involved, the FEBE endpoint is
the driver, not the application. That distinction is subtle, but it is
real, and it matters here too.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Thu, Jul 11, 2019 at 7:30 AM Bruce Momjian  wrote:
> Wow, I never thought of that.  The only things I know we lock until
> transaction end are rows we update (against concurrent updates), and
> additions to unique indexes.  By definition, indexes with many
> duplicates are not unique, so that doesn't apply.

Right. Another advantage of their approach is that you can make
queries like this work:

UPDATE tab SET unique_col = unique_col + 1

This will not throw a unique violation error on most/all other DB
systems when the updated column (in this case "unique_col") has a
unique constraint/is the primary key. This behavior is actually
required by the SQL standard. An SQL statement is supposed to be
all-or-nothing, which Postgres doesn't quite manage here.

The section "6.6 Interdependencies of Transactional Storage" from the
paper "Architecture of a Database System" provides additional
background information (I should have suggested reading both 6.6 and
6.7 together).

-- 
Peter Geoghegan

Re: accounting for memory used for BufFile during hash joins

On Mon, May 6, 2019 at 9:49 PM Thomas Munro  wrote:
> Stepping back a bit, I think there is something fishy about the way we
> detect extreme skew.  Is that a factor in this case?  Right now we
> wait until we have a batch that gets split into child batches
> containing exactly 0% and 100% of the tuples before we give up.
> Previously I had thought of that as merely a waste of time, but
> clearly it's also a waste of unmetered memory.  Oops.
>
> I think our extreme skew detector should go off sooner, because
> otherwise if you have N nicely distributed unique keys and also M
> duplicates of one bad egg key that'll never fit in memory, we keep
> repartitioning until none of the N keys fall into the batch containing
> the key for the M duplicates before we give up!  You can use
> balls-into-bins maths to figure out the number, but I think that means
> we expect to keep splitting until we have N * some_constant batches,
> and that's just silly and liable to create massive numbers of
> partitions proportional to N, even though we're trying to solve a
> problem with M.  In another thread I suggested we should stop when
> (say) 95% of the tuples go to one child batch.  I'm not sure how you
> pick the number.

Another thing that is fishy about this is that we can't split a batch
or a bucket without splitting them all.  Let's say that nbatches *
nbuckets = 16 million. One bucket in one batch contains 90% of the
tuples. Splitting *that* bucket might be a good idea if only 5% of the
tuples end up moving, perhaps even if only 1% end up moving. But, if
you have to double the total number of batches to get that benefit,
it's a lot less compelling, because now you have to rescan the outer
side more times.

I wonder whether we should be dividing things into batches unevenly,
based on the distribution of tuples we've seen so far.  For example,
suppose we've gotten to 1024 buckets and that's all we can fit in
memory. If we decide to go to 2 batches, we'll use the next bit of the
hash key to decide which things go into batch 0 and which things go
into batch 1. But if we know that 50% of the data is in bucket 17, why
are we not making bucket 17 into a batch and everything else into
another batch? Then, when we process the batch that was derived from
bucket-17, we can use 10 completely new bits from the hash key to
slice the data from that bucket as finely as possible.

Now the bucket might be entirely duplicates, in which case no number
of additional bits will help.  However, even in that case it's still a
good idea to make it its own batch, and then use some other algorithm
to process that batch. And if it's *not* entirely duplicates, but
there are say 2 or 3 really common values that unluckily hash to the
same bucket, then being able to use a lot more bits for that portion
of the data gives us the best chance of managing to spread it out into
different buckets.

Similarly, if we split the hash join into four batches, and batch 0
fits in memory but batch 1 does not, we cannot further split batch 1
without splitting batch 2 and batch 3 also.  That's not good either,
because those batches might be small and not need splitting.

I guess what I'm trying to say is that our algorithms for dealing with
mis-estimation seem to be largely oblivious to the problem of skew,
and I don't think the problem is confined to extreme skew. Suppose you
have some data that is only moderately skewed, so that when you build
a hash table with 1024 buckets, 25% of the data is in buckets 0-19,
25% in buckets 20-768, 25% in buckets 769-946, and the last 25% in
buckets 947-1023. If you knew that, then when you discover that the
data is 4x too large to fit in memory, you can divide the data into 4
batches using those bucket number ranges, and get it done in exactly 4
batches. As it is, you'll need to split until every uniform range of
buckets fits in memory: 0-31 is going to be too big a range, so you're
going to go with 0-15, which means you'll have 64 batches instead of
4.

It seems to me that a good chunk of what's being proposed right now
basically ignores the fact that we're not really responding to the
skew in a very effective way.  Thomas wants to stop splitting all the
buckets when splitting one of the buckets produces only a very small
benefit rather than when it produces no benefit at all, but he's not
asking why we're splitting all of the buckets in the first place.
Tomas wants to slice the array of batches because there are so many of
them, but why are there so many? As he said himself, "it gets to that
many batches because some of the values are very common and we don't
disable the growth earlier."  Realistically, I don't see how there can
be so many batches that we can't even fit the metadata about the
matches into memory unless we're unnecessarily creating a lot of
little tiny batches that we don't really need.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Index Skip Scan

2019-07-11 Thread Jesper Pedersen


Hi David,

On 7/11/19 7:38 AM, David Rowley wrote:

The UniqueKeys idea is quite separate from pathkeys.  Currently, a
Path can have a List of PathKeys which define the order that the
tuples will be read from the Plan node that's created from that Path.
The idea with UniqueKeys is that a Path can also have a non-empty List
of UniqueKeys to define that there will be no more than 1 row with the
same value for the Paths UniqueKey column/exprs.

As of now, if you look at standard_qp_callback() sets
root->query_pathkeys, the idea here would be that the callback would
also set a new List field named "query_uniquekeys" based on the
group_pathkeys when non-empty and !root->query->hasAggs, or by using
the distinct clause if it's non-empty. Then in build_index_paths()
around the call to match_pathkeys_to_index() we'll probably also want
to check if the index can support UniqueKeys that would suit the
query_uniquekeys that were set during standard_qp_callback().

As for setting query_uniquekeys in standard_qp_callback(), this should
be simply a matter of looping over either group_pathkeys or
distinct_pathkeys and grabbing the pk_eclass from each key and making
a canonical UniqueKey from that. To have these canonical you'll need
to have a new field in PlannerInfo named canon_uniquekeys which will
do for UniqueKeys what canon_pathkeys does for PathKeys. So you'll
need an equivalent of make_canonical_pathkey() in uniquekey.c

With this implementation, the code that the patch adds in
create_distinct_paths() can mostly disappear. You'd only need to look
for a path that uniquekeys_contained_in() matches the
root->query_uniquekeys and then just leave it to
set_cheapest(distinct_rel); to pick the cheapest path.

It would be wasted effort to create paths with UniqueKeys if there's
multiple non-dead base rels in the query as the final rel in
create_distinct_paths would be a join rel, so it might be worth
checking that before creating such index paths in build_index_paths().
However, down the line, we could carry the uniquekeys forward into the
join paths if the join does not duplicate rows from the other side of
the join... That's future stuff though, not for this patch, I don't
think.

I think a UniqueKey can just be a struct similar to PathKey, e.g be
located in pathnodes.h around where PathKey is defined.  Likely we'll
need a uniquekeys.c file that has the equivalent to
pathkeys_contained_in() ... uniquekeys_contained_in(), which return
true if arg1 is a superset of arg2 rather than check for one set being
a prefix of another. As you mention above: UniqueKeys { x, y } ==
UniqueKeys { y, x }.  That superset check could perhaps be optimized
by sorting UniqueKey lists in memory address order, that'll save
having a nested loop, but likely that's not going to be required for a
first cut version.  This would work since you'd want UniqueKeys to be
canonical the same as PathKeys are (Notice that compare_pathkeys()
only checks memory addresses of pathkeys and not equals()).

I think the UniqueKey struct would only need to contain an
EquivalenceClass field. I think all the other stuff that's in PathKey
is irrelevant to UniqueKey.



Thanks for the feedback ! I'll work on these changes for the next 
uniquekey patch.


Best regards,
 Jesper

Re: buildfarm's typedefs list has gone completely nutso

Andrew Dunstan  writes:
> On 7/10/19 8:24 PM, Andres Freund wrote:
>> I think that's kinda what I'm complaining about... It seems like a bad
>> idea to have this in the buildfarm code, rather than our code. IMO the
>> buildfarm code should invoke an updated src/tools/find_typedef - that
>> way people at least can create typedefs manually locally.

> OK, I don't have a problem with that. I think the sane way to go would
> be to replace it with what the buildfarm is using, more or less.

+1 for the idea --- I've not studied the code, but surely the buildfarm's
version of this code is more bulletproof.

> This looks like a bug in the version of objdump unless I'm reading it
> wrong. Why would this be tagged as a typedef?

Maybe.  We still need to explain the other non-typedef symbols that have
just appeared and are not being reported by calliphoridae.

regards, tom lane

Re: Add parallelism and glibc dependent only options to reindexdb

On Thu, Jul 11, 2019 at 3:34 PM Michael Paquier  wrote:
>
> On Thu, Jul 11, 2019 at 11:48:20AM +0200, Julien Rouhaud wrote:
> > On Thu, Jul 11, 2019 at 6:04 AM Michael Paquier  wrote:
> >> Get* would be my choice, because we look at the set of parallel slots,
> >> and get an idle one.
> >
> > That's what the former GetIdleSlot (that I renamed to get_idle_slot as
> > it's not static) is doing.  ConsumeIdleSlot() actually get an idle
> > slot and mark it as being used.  That's probably some leakage of
> > internal implementation, but having a GetIdleParallelSlot (or
> > ParallelSlotGetIdle) *and* a get_idle_slot sounds like a bad idea, and
> > I don't have a better idea on how to rename get_idle_slot.  If you
> > really prefer Get* and you're fine with a static get_idle_slot, that's
> > fine by me.
>
> Do we actually need get_idle_slot?  ConsumeIdleSlot is its only
> caller.

I think t hat it makes the code quite cleaner to have the selection
outside ConsumeIdleSlot.

> And while scanning the code...
>
> + * getQuerySucess
> Typo here.

Argh, I thought I caught all of them, thanks!

> - * Pump the conn till it's dry of results; return false if any are errors.
> This comment could be improved on the way, like "Go through all the
> connections and make sure to consume any remaining results.  If any
> error is found, false is returned after processing all the parallel
> slots."

You're talking about getQuerySuccess right? That was actually the
original comment of a function I renamed.  +1 to improve it, but this
function is in common.c and doesn't deal with parallel slot at all, so
I'll just drop the slang parts.

Re: pgbench - add option to show actual builtin script code

2019-07-11 Thread Fabien COELHO




Hello Andrew,


Now the patch is good now.

The new status of this patch is: Ready for Committer


Why aren't we instead putting the exact scripts in the documentation?
Having to call pgbench with a special flag to get the script text seems
a bit odd.


A typical use case I had is to create a new script by modifying an 
existing one for testing or debug. I prefer "command > file.sql ; vi 
file.sql" to hazardous copy-pasting stuff from html pages.


I do not think that it is worth replicating all scripts inside the doc, 
they are not that interesting, especially if more are added. Currently, 
out of the 3 scripts, only one is in the doc, and nobody complained:-)


Now, they could be added to the documentation, but I'd like the option 
anyway.


--
Fabien.

Re: let's make the list of reportable GUCs configurable (was Re: Add %r substitution for psql prompts to show recovery status)

On Thu, 11 Jul 2019 at 10:19, Robert Haas  wrote:

> On Thu, Jul 11, 2019 at 10:11 AM Tom Lane  wrote:
> > Robert Haas  writes:
> > > 3. I'm not sure that just ignoring any GUCs we don't find is the right
> > > thing.  I'm also not sure that it's the wrong thing, but it might be.
> > > My question is: what if there's an extension-owned GUC in play? The
> > > library probably isn't even loaded at this stage, unless it's in
> > > shared_preload_libraries.
> >
> > Gut reaction is that define_custom_variable would need to consult
> > the list to see if a newly-defined variable should be marked GUC_REPORT.
>

This suggests creating a list in guc.c instead. I'm unclear as to the
visibility of variables in there
How do I make this list visible only to the session ?


Dave Cramer

da...@postgresintl.com
www.postgresintl.com


>
>

Re: pgbench - add option to show actual builtin script code

2019-07-11 Thread Andrew Dunstan

On 5/2/19 12:35 PM, Fabien COELHO wrote:
>
>> Now the patch is good now.
>>
>> The new status of this patch is: Ready for Committer
>
> Ok, thanks.
>

Why aren't we instead putting the exact scripts in the documentation?
Having to call pgbench with a special flag to get the script text seems
a bit odd.

cheers

andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Memory-Bounded Hash Aggregation

On Wed, Jul 03, 2019 at 07:03:06PM -0700, Jeff Davis wrote:

On Wed, 2019-07-03 at 02:17 +0200, Tomas Vondra wrote:

What does "partitioned hash strategy" do? It's probably explained in
one
of the historical discussions, but I'm not sure which one. I assume
it
simply hashes the group keys and uses that to partition the data, and
then
passing it to hash aggregate.

Yes. When spilling, it is cheap to partition on the hash value at the
same time, which dramatically reduces the need to spill multiple times.
Previous discussions:

Unfortunately the second link does not work :-(

It's supposed to be:

https://www.postgresql.org/message-id/CAGTBQpa__-NP7%3DkKwze_enkqw18vodRxKkOmNhxAPzqkruc-8g%40mail.gmail.com

I'm not going to block Approach 1, althought I'd really like to see
something that helps with array_agg.

I have a WIP patch that I just posted. It doesn't yet work with
ARRAY_AGG, but I think it can be made to work by evicting the entire
hash table, serializing the transition states, and then later combining
them.

Aren't all three approaches a way to "fix" hash aggregate? In any
case,
it's certainly reasonable to make incremental changes. The question
is
whether "approach 1" is sensible step towards some form of "approach
3"

Disk-based hashing certainly seems like a reasonable algorithm on paper
that has some potential advantages over sorting. It certainly seems
sensible to me that we explore the disk-based hashing strategy first,
and then we would at least know what we are missing (if anything) by
going with the hybrid approach later.

There's also a fair amount of design space to explore in the hybrid
strategy. That could take a while to converge, especially if we don't
have anything in place to compare against.

Makes sense. I haven't thought about how the hybrid approach would be
implemented very much, so I can't quite judge how complicated would it be
to extend "approach 1" later. But if you think it's a sensible first step,
I trust you. And I certainly agree we need something to compare the other
approaches against.

> * It means we have a hash table and sort running concurrently, each
> using memory. Andres said this might not be a problem[3], but I'm
> not convinced that the problem is zero. If you use small work_mem
> for the write phase of sorting, you'll end up with a lot of runs
> to
> merge later and that has some kind of cost.
>

Why would we need to do both concurrently? I thought we'd empty the
hash
table before doing the sort, no?

So you are saying we spill the tuples into a tuplestore, then feed the
tuplestore through a tuplesort? Seems inefficient, but I guess we can.

I think the question is whether we see this as "emergency fix" (for cases
that are misestimated and could/would fail with OOM at runtime), or as
something that is meant to make "hash agg" more widely applicable.

I personally see it as an emergency fix, in which cases it's perfectly
fine if it's not 100% efficient, assuming it kicks in only rarely.
Effectively, we're betting on hash agg, and from time to time we lose.

But even if we see it as a general optimization technique it does not have
to be perfectly efficient, as long as it's properly costed (so the planner
only uses it when appropriate).

If we have a better solution (in terms of efficiency, code complexity,
etc.) then sure - let's use that. But considering we've started this
discussion in ~2015 and we still don't have anything, I wouldn't hold my
breath. Let's do something good enough, and maybe improve it later.

Do we actually need to handle that case? How many such aggregates are
there? I think it's OK to just ignore that case (and keep doing what
we do
now), and require serial/deserial functions for anything better.

Punting on a few cases is fine with me, if the user has a way to fix
it.

+1 to doing that

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: buildfarm's typedefs list has gone completely nutso

2019-07-11 Thread Andrew Dunstan



On 7/10/19 8:24 PM, Andres Freund wrote:
> Hi,
>
> On 2019-07-10 16:40:20 -0400, Andrew Dunstan wrote:
>> On 7/10/19 1:34 PM, Andres Freund wrote:
>>> Hm, it has gotten gcc-9 installed recently, but calliphoridae isn't
>>> using that. So it's probably not the compiler side. But I also see a
>>> binutils upgrade:
>>>
>>> 2019-07-08 06:22:48 upgrade binutils-multiarch:amd64 2.31.1-16 
>>> 2.32.51.20190707-1
>>>
>>> and corresponding upgrades forall the arch specific packages. I suspect
>>> it might be that.
>>>
>>> I can't immediately reproduce that locally though, using the same
>>> version of binutils. It's somewhat annoying that the buildfarm uses a
>>> different form of computing the typedefs than src/tools/find_typedef ...
>> That script is notably non-portable, and hasn't seen any significant
>> update in a decade.
> I think that's kinda what I'm complaining about... It seems like a bad
> idea to have this in the buildfarm code, rather than our code. IMO the
> buildfarm code should invoke an updated src/tools/find_typedef - that
> way people at least can create typedefs manually locally.




OK, I don't have a problem with that. I think the sane way to go would
be to replace it with what the buildfarm is using, more or less.



>
> Not yet sure what's actually going on, but there's something odd with
> debug information afaict:
>
> objdump -w spits out warnings for a few files, all static libraries:



I assume you mean -W rather than -w


>
> ../install/lib/libpgcommon.a
> objdump: Warning: Location list starting at offset 0x0 is not terminated.
> objdump: Warning: Hole and overlap detection requires adjacent view lists and 
> loclists.
> objdump: Warning: There are 3411 unused bytes at the end of section .debug_loc
>


That seems new. I didn't get this in my attempt to recreate the setup of
your animal. That was on debian/buster which has binutils 2.31.1-16




> Not sure if that's related - I don't get that locally (newer compiler,
> different compiler options, same binutils).
>
> Interestingly pg_fprintf is only generated by pg_fprintf, as far as I
> can tell, and the 1/2 typedefs are from libpq. The relevant entries look
> like:
>
>  <1><4b>: Abbrev Number: 4 (DW_TAG_typedef)
> <4c>   DW_AT_name: (indirect string, offset: 0x0): 
> USE_SSE42_CRC32C_WITH_RUNTIME_CHECK 1
> <50>   DW_AT_decl_file   : 2
> <51>   DW_AT_decl_line   : 216
> <52>   DW_AT_decl_column : 23
> <53>   DW_AT_type: <0x57>
>
> So I suspect it might be the corruption/misparsing of objdump that's to
> blame. Kinda looks like there's an issue with the dwarf stringtable, and
> thus the wrong strings (debug information for macros in this case) is
> being picked up.
>

This looks like a bug in the version of objdump unless I'm reading it
wrong. Why would this be tagged as a typedef?


I would tentatively suggest that you revert to the previous version of
binutils if possible, and consider reporting a bug in the bleeding edge
of objdump.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: accounting for memory used for BufFile during hash joins


On Wed, Jul 10, 2019 at 04:51:02PM -0700, Melanie Plageman wrote:

Okay, so, while I do have specific, actual code review/commitfest-y
feedback for the patch in this thread registered for this commitfest,
I wanted to defer that for a later email and use this one to cover off
on a few higher level issues.

1) How this patch's approach fits into the wider set of problems with
hybrid hashjoin.

2) Parallel HashJoin implementation of this patch's approach

I think implementing support for parallel hashjoin or explicitly
disabling it would be the bare minimum for this patch, which is why I
made 2 its own item. I've marked it as returned to author for this
reason.



OK. I'm a bit confused / unsure what exactly our solution to the various
hashjoin issues is. I have not been paying attention to all the various
threads, but I thought we kinda pivoted to the BNL approach, no? I'm not
against pushing this patch (the slicing one) forward and then maybe add
BNL on top.


I do think that accounting for Buffile overhead when estimating the
size of the hashtable during ExecChooseHashTableSize() so it can be
used during planning is a worthwhile patch by itself (though I know it
is not even part of this patch).



+1 to that


I'll start with 2 since I have less to say there.

From comments upthread, I take it this would not work with parallel
hashjoin as expected. Is this because each worker operates on batches
independently and now batches are lumped into slices?

Thinking through a parallel-aware implementation, it seems like you
would use slice-based barriers for the build phase but batch-based
barriers for the probe phase to avoid getting out of sync (workers
with outer tuples from one batch should not try and join those with
tuples from another batch, even if in the same slice).

You would, of course, need to add code to make slices work with
SharedTuplestore--caveat here is I still haven't tried to understand
how parallel-aware hashjoin works/uses SharedTuplestore.



I don't know. I haven't thought about the parallel version very much. I
wonder if Thomas Munro has some thoughts about it ...


Now, addressing 1, how this patch fits into the wider set of problem's
with current hybrid hashjoin:

Thomas Munro nicely summarized roughly what I'm about to lay out like
this (upthread) -- he called them "three separate but related
problems":


A.  Broken escape valve:  sometimes we generate a huge number of
batches while trying to split up many duplicates, because of the
presence of other more uniformly distributed keys.  We could fix that
with (say) a 95% rule.
B.  Lack of good alternative execution strategy when the escape valve
is triggered.  A batch cannot be split effectively, but cannot fit in
work_mem, so for now we decide to ignore work_mem.
C.  Unmetered explosion of batches and thus BufFiles, probably usually
caused by problem A, but theoretically also due to a real need for
partitions.


However, I would like to lay out the problem space a little bit
differently. (using the end of the alphabet to differentiate).

The following scenarios are how you could end up running out of
memory:

Y. Plan-time underestimation of the number of required batches with
relatively uniform data distribution

In this case, the best join execution strategy is a plain hashjoin
with spilling as needed.
nbatches should be increased as needed, because the data is ~evenly
distributed.
slicing should be employed when buffile overhead exceeds some
threshhold for the ratio of work_mem to be used for buffile overhead



OK, makes sense. But at some point we get so many slices the overflow
files alone use more than work_mem. Of course, to hit that the
underestimate needs to be sufficiently serious. My understanding was we'll
roll until that point and then switch to BNL.


Z. Plan and or execution time underestimation of the number of
required batches with skewed data

If you knew this at planning time, you could have picked another
join-type, though, there might be cases where it would actually be
less costly to use plain hashjoin for all batches except the bad batch
and fall back to hash block nested loop join just for the duplicate
values.

If you could not have known this at planning time, the best join
execution strategy is a hybrid hashjoin/hash block nested loop join.

To do this, preview if increasing nbatches would move tuples, and, if
it would, do this (also, employing slicing if buffile overhead exceeds
the threshold)

If increasing nbatches wouldn't move tuples, process this batch with
hash block nested loop join.



OK.


Essentially, what we want is logical units of tuples which are
work_mem-sized. In some cases, each unit may contain multiple batches
(a slice in Tomas' patch) and in other cases, each unit may contain
only part of a batch (a chunk is the term I used in my hash block
nested loop join patch [1]).



OK, although with slicing the work_mem-sized unit is still one batch. The
slice just ensures the metadata we need to

Fwd: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-07-11 Thread Rafia Sabih

On Sun, 7 Jul 2019 at 01:08, Peter Geoghegan  wrote:

> * Maybe we could do compression with unique indexes when inserting
> values with NULLs? Note that we now treat an insertion of a tuple with
+1

I tried this patch and found the improvements impressive. However,
when I tried with multi-column indexes it wasn't giving any
improvement, is it the known limitation of the patch?
I am surprised to find that such a patch is on radar since quite some
years now and not yet committed.

Going through the patch, here are a few comments from me,

 /* Add the new item into the page */
+ offnum = OffsetNumberNext(offnum);
+
+ elog(DEBUG4, "insert_itupprev_to_page. compressState->ntuples %d
IndexTupleSize %zu free %zu",
+ compressState->ntuples, IndexTupleSize(to_insert), PageGetFreeSpace(page));
+
and other such DEBUG4 statements are meant to be removed, right...?
Just because I didn't find any other such statements in this API and
there are many in this patch, so not sure how much are they needed.

/*
* If we have only 10 uncompressed items on the full page, it probably
* won't worth to compress them.
*/
if (maxoff - n_posting_on_page < 10)
return;

Is this a magic number...?

/*
* We do not expect to meet any DEAD items, since this function is
* called right after _bt_vacuum_one_page(). If for some reason we
* found dead item, don't compress it, to allow upcoming microvacuum
* or vacuum clean it up.
*/
if (ItemIdIsDead(itemId))
continue;

This makes me wonder about those 'some' reasons.

Caller is responsible for checking BTreeTupleIsPosting to ensure that
+ * he will get what he expects

This can be re-framed to make the caller more gender neutral.

Other than that, I am curious about the plans for its backward compatibility.

--
Regards,
Rafia Sabih

Re: SQL/JSON path issues/questions

2019-07-11 Thread Alexander Korotkov

On Thu, Jul 11, 2019 at 5:10 PM Thom Brown  wrote:
> On Wed, 10 Jul 2019 at 05:58, Alexander Korotkov
>  wrote:
> >
> > On Mon, Jul 8, 2019 at 12:30 AM Alexander Korotkov
> >  wrote:
> > > On Thu, Jul 4, 2019 at 4:38 PM Liudmila Mantrova
> > >  wrote:
> > > > Thank  you!
> > > >
> > > > I think we can make this sentence even shorter, the fix is attached:
> > > >
> > > > "To refer to a JSON element stored at a lower nesting level, add one or
> > > > more accessor operators after @."
> > >
> > > Thanks, looks good to me.  Attached revision of patch contains commit
> > > message.  I'm going to commit this on no objections.
> >
> > So, pushed!
>
> I've just noticed the >= operator shows up as just > in the "jsonpath
> Filter Expression Elements" table, and the <= example only shows <.

Thank you for catching this!  Fix just pushed.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

On Thu, Jul 11, 2019 at 06:45:36AM -0600, Ryan Lambert wrote:
> > >>Uh, what if a transaction modifies page 0 and page 1 of the same table
> 
> > >>--- don't those pages have the same LSN.
> > >
> > >No, because WAL being a physical change log, each page gets its own
> > >WAL record with its own LSN.
> > >
> >
> > What if you have wal_log_hints=off? AFAIK that won't change the page LSN.
> 
> >
> 
> > Alvaro suggested elsewhere that we require checksums for these, which
> > would also force wal_log_hints to be on, and therefore the LSN would
> > change.
>  
> Yes, it sounds like the agreement was LSN is unique when wal_log_hints is on. 
> I don't know enough about the internals to know if pg_class.oid is also needed
> or not.

Well, so, as far as we know now, every change to a heap/index page
advances the LSN, except for hint bits, which we can force to advance
LSN via wal_log_hints.  We automatically enable wal_log_hints for
checksums, so we can easily enable it automatically for encrypted
clusters.

I assume the LSN used for 8k pages and the segment numbers (for WAL) do
not overlap in numbering, for our nonce.

I think we will eventually have to have this setup verified by security
experts but I think we need to work through what is possible using
existing Postgres facilities before we present a possible solution.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: [sqlsmith] Crash in mcv_get_match_bitmap


On Wed, Jul 10, 2019 at 06:48:16PM -0400, Tom Lane wrote:

Oh ... while we're piling on here, it just sunk into me that
mcv_get_match_bitmap is deciding what the semantics of an operator
are by seeing what it's using for a selectivity estimator.
That is just absolutely, completely wrong.  For starters, it
means that the whole mechanism fails for any operator that wants
to use a specialized estimator --- hardly an unreasonable thing
to do.  For another, it's going to be pretty unreliable for
extensions, because I do not think they're all careful about using
the right estimator --- a lot of 'em probably still haven't adapted
to the introduction of separate <= / >= estimators, for instance.

The right way to determine operator semantics is to look to see
whether they are in a btree opclass.  That's what the rest of the
planner does, and there is no good reason for the mcv code to
do it some other way.



Hmmm, but that will mean we're unable to estimate operators that are not
part of a btree opclass. Which is a bit annoying, because that would also
affect equalities (and thus functional dependencies), in which case the
type may easily have just hash opclass or something.

But maybe I just don't understand how the btree opclass detection works
and it'd be fine for sensibly defined operators. Can you point me to code
doing this elsewhere in the planner?

We may need to do something about code in pg10, because functional
dependencies this too (although just with F_EQSEL).  But maybe we should
leave pg10 alone, we got exactly 0 reports about it until now anyway.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-07-11 Thread Alexander Korotkov

On Thu, Jul 11, 2019 at 7:53 AM Peter Geoghegan  wrote:
> Anyway, I think that *hundreds* or even *thousands* of rows are
> effectively locked all at once when a bitmap index needs to be updated
> in these other systems -- and I mean a heavyweight lock that lasts
> until the xact commits or aborts, like a Postgres row lock. As I said,
> this is necessary simply because the transaction might need to roll
> back. Of course, your patch never needs to do anything like that --
> the only risk is that buffer lock contention will be increased. Maybe
> VACUUM isn't so bad after all!
>
> Doing deduplication adaptively and automatically in nbtree seems like
> it might play to the strengths of Postgres, while also ameliorating
> its weaknesses. As the same paper goes on to say, it's actually quite
> unusual that PostgreSQL has *transactional* full text search built in
> (using GIN), and offers transactional, high concurrency spatial
> indexing (using GiST). Actually, this is an additional advantages of
> our "pure" approach to MVCC -- we can add new high concurrency,
> transactional access methods relatively easily.

Good finding, thank you!

BTW, I think deduplication could cause some small performance
degradation in some particular cases, because page-level locks became
more coarse grained once pages hold more tuples.  However, this
doesn't seem like something we should much care about.  Providing an
option to turn deduplication off looks enough for me.

Regarding bitmap indexes itself, I think our BRIN could provide them.
However, it would be useful to have opclass parameters to make them
tunable.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-07-11 Thread Alexander Korotkov

Hi Peter,

Thank you very much for your attention to this patch. Let me comment
some points of your message.

On Sun, Jul 7, 2019 at 2:09 AM Peter Geoghegan  wrote:
> On Thu, Jul 4, 2019 at 5:06 AM Anastasia Lubennikova
>  wrote:
> > The new version of the patch is attached.
> > This version is even simpler than the previous one,
> > thanks to the recent btree design changes and all the feedback I received.
> > I consider it ready for review and testing.
>
> I took a closer look at this patch, and have some general thoughts on
> its design, and specific feedback on the implementation.
>
> Preserving the *logical contents* of B-Tree indexes that use
> compression is very important -- that should not change in a way that
> outside code can notice. The heap TID itself should count as logical
> contents here, since we want to be able to implement retail index
> tuple deletion in the future. Even without retail index tuple
> deletion, amcheck's "rootdescend" verification assumes that it can
> find one specific tuple (which could now just be one specific "logical
> tuple") using specific key values from the heap, including the heap
> tuple's heap TID. This requirement makes things a bit harder for your
> patch, because you have to deal with one or two edge-cases that you
> currently don't handle: insertion of new duplicates that fall inside
> the min/max range of some existing posting list. That should be rare
> enough in practice, so the performance penalty won't be too bad. This
> probably means that code within _bt_findinsertloc() and/or
> _bt_binsrch_insert() will need to think about a logical tuple as a
> distinct thing from a physical tuple, though that won't be necessary
> in most places.

Could you please elaborate more on preserving the logical contents?  I
can understand it as following: "B-Tree should have the same structure
and invariants as if each TID in posting list be a separate tuple".
So, if we imagine each TID to become separate tuple it would be the
same B-tree, which just can magically sometimes store more tuples in
page.  Is my understanding correct?  But outside code will still
notice changes as soon as it directly accesses B-tree pages (like
contrib/amcheck does).  Do you mean we need an API for accessing
logical B-tree tuples or something?

> The need to "preserve the logical contents" also means that the patch
> will need to recognize when indexes are not safe as a target for
> compression/deduplication (maybe we should call this feature
> deduplilcation, so it's clear how it differs from TOAST?). For
> example, if we have a case-insensitive ICU collation, then it is not
> okay to treat an opclass-equal pair of text strings that use the
> collation as having the same value when considering merging the two
> into one. You don't actually do that in the patch, but you also don't
> try to deal with the fact that such a pair of strings are equal, and
> so must have their final positions determined by the heap TID column
> (deduplication within _bt_compress_one_page() must respect that).
> Possibly equal-but-distinct values seems like a problem that's not
> worth truly fixing, but it will be necessary to store metadata about
> whether or not we're willing to do deduplication in the meta page,
> based on operator class and collation details. That seems like a
> restriction that we're just going to have to accept, though I'm not
> too worried about exactly what that will look like right now. We can
> work it out later.

I think in order to deduplicate "equal but distinct" values we need at
least to give up with index only scans.  Because we have no
restriction that equal according to B-tree opclass values are same for
other operations and/or user output.

> I think that the need to be careful about the logical contents of
> indexes already causes bugs, even with "safe for compression" indexes.
> For example, I can sometimes see an assertion failure
> within_bt_truncate(), at the point where we check if heap TID values
> are safe:
>
> /*
>  * Lehman and Yao require that the downlink to the right page, which is to
>  * be inserted into the parent page in the second phase of a page split be
>  * a strict lower bound on items on the right page, and a non-strict upper
>  * bound for items on the left page.  Assert that heap TIDs follow these
>  * invariants, since a heap TID value is apparently needed as a
>  * tiebreaker.
>  */
> #ifndef DEBUG_NO_TRUNCATE
> Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft),
>   BTreeTupleGetMinTID(firstright)) < 0);
> ...
>
> This bug is not that easy to see, but it will happen with a big index,
> even without updates or deletes. I think that this happens because
> compression can allow the "logical tuples" to be in the wrong heap TID
> order when there are multiple posting lists for the same value. As I
> said, I think that it's necessary to see a posting list as being
> comprised of multiple

Re: [sqlsmith] Crash in mcv_get_match_bitmap


On Thu, Jul 11, 2019 at 09:55:01AM +0900, Michael Paquier wrote:

On Wed, Jul 10, 2019 at 11:26:09PM +0200, Tomas Vondra wrote:

Yeah, that's a bug. Will fix (not sure how yet).


Please note that I have added an open item for it.


Thanks.

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: base backup client as auxiliary backend process

Robert Haas  writes:
> On Sat, Jun 29, 2019 at 4:05 PM Peter Eisentraut
>  wrote:
>> My idea is that the postmaster can launch a base backup worker, wait
>> till it's done, then proceed with the rest of the startup.  initdb gets
>> a special option to create a "minimal" data directory with only a few
>> files, directories, and the usual configuration files.

> Why do we even have to do that much?  Can we remove the need for an
> initdb altogether?

Gotta have config files in place already, no?

regards, tom lane

Re: let's make the list of reportable GUCs configurable (was Re: Add %r substitution for psql prompts to show recovery status)

Robert Haas  writes:
> I had the same thought, but I just realized that's probably
> unfriendly: at the point when the client is assembling the list of
> names to send to the server, it doesn't know the server version.  I
> think we're probably best off assuming that any names we don't
> recognize are something that got added in a newer server version and
> just ignoring them. The client is not powerless to sort this out
> after-the-fact: once the connection is made, they'll know the server
> version and also have the option to interrogate pg_settings if they
> wish.

> We also need to think about how to write a test for this patch...

All of the above is based on the assumption that this isn't a plain
old USERSET GUC, which I'm not really seeing the argument for.
OK, there might be *implementation* reasons why we would rather not
deal with on-the-fly changes to the list, but there's no advantage
to users in it.

regards, tom lane

Re: pg_stat_database update stats_reset only by pg_stat_reset

2019-07-11 Thread Daniel Verite

张连壮 wrote:

> it reset statistics for a single table and update the column stats_reset of
> pg_stat_database.
> but i think that stats_reset shoud be database-level statistics, a single
> table should not update the column stats_reset.

This patch is a current CF entry at
https://commitfest.postgresql.org/23/2116/

The issue it addresses was submitted as bug #15801:
https://www.postgresql.org/message-id/flat/15801-21c7fbff08b6c10c%40postgresql.org

As mentioned in the discussion on -bugs, it's not necessarily a bug
because:

* the comment in the code specifically states that it's intentional,
in pgstat_recv_resetsinglecounter():

/* Set the reset timestamp for the whole database */
dbentry->stat_reset_timestamp = GetCurrentTimestamp();

* the commit message also states the same:

commit 4c468b37a281941afd3bf61c782b20def8c17047
Author: Magnus Hagander 
Date:   Thu Feb 10 15:09:35 2011 +0100

Track last time for statistics reset on databases and bgwriter

Tracks one counter for each database, which is reset whenever
the statistics for any individual object inside the database is
reset, and one counter for the background writer.

Tomas Vondra, reviewed by Greg Smith

I can understand why you'd want that resetting the stats for a single object
would not reset the per-database timestamp, but this would revert a 8+ years
old decision that seems intentional and has apparently not been criticized
since then (based on searching for pg_stat_reset_single_table_counters in 
the archives) . More opinions are probably needed in favor of this
change (or against, in which case the fate of the patch might be a
rejection).

Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

Re: [HACKERS] [PATCH] Generic type subscripting

2019-07-11 Thread David Fetter

On Thu, Jul 11, 2019 at 04:30:46PM +0200, Dmitry Dolgov wrote:
> > On Thu, Jul 11, 2019 at 9:47 AM David Fetter  wrote:
> >
> > Looks great!
> >
> > The tutorial piece has bit-rotted slightly. Please find attached a
> > patch atop yours that fixes it.
> 
> Indeed, I've missed this change, thank you! Although there supposed to be an
> upperindex, not numupper (since the latter is just a number of upper indexes).

Oops! Fooled by a suggestion from the compiler. My bad for not
checking more carefully. Thanks for making this happen.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Wed, Jul 10, 2019 at 09:53:04PM -0700, Peter Geoghegan wrote:
> Anyway, I think that *hundreds* or even *thousands* of rows are
> effectively locked all at once when a bitmap index needs to be updated
> in these other systems -- and I mean a heavyweight lock that lasts
> until the xact commits or aborts, like a Postgres row lock. As I said,
> this is necessary simply because the transaction might need to roll
> back. Of course, your patch never needs to do anything like that --
> the only risk is that buffer lock contention will be increased. Maybe
> VACUUM isn't so bad after all!
> 
> Doing deduplication adaptively and automatically in nbtree seems like
> it might play to the strengths of Postgres, while also ameliorating
> its weaknesses. As the same paper goes on to say, it's actually quite
> unusual that PostgreSQL has *transactional* full text search built in
> (using GIN), and offers transactional, high concurrency spatial
> indexing (using GiST). Actually, this is an additional advantages of
> our "pure" approach to MVCC -- we can add new high concurrency,
> transactional access methods relatively easily.

Wow, I never thought of that.  The only things I know we lock until
transaction end are rows we update (against concurrent updates), and
additions to unique indexes.  By definition, indexes with many
duplicates are not unique, so that doesn't apply.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: base backup client as auxiliary backend process

On Sat, Jun 29, 2019 at 4:05 PM Peter Eisentraut
 wrote:
> My idea is that the postmaster can launch a base backup worker, wait
> till it's done, then proceed with the rest of the startup.  initdb gets
> a special option to create a "minimal" data directory with only a few
> files, directories, and the usual configuration files.

Why do we even have to do that much?  Can we remove the need for an
initdb altogether?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: let's make the list of reportable GUCs configurable (was Re: Add %r substitution for psql prompts to show recovery status)

On Thu, Jul 11, 2019 at 10:11 AM Tom Lane  wrote:
> Robert Haas  writes:
> > 3. I'm not sure that just ignoring any GUCs we don't find is the right
> > thing.  I'm also not sure that it's the wrong thing, but it might be.
> > My question is: what if there's an extension-owned GUC in play? The
> > library probably isn't even loaded at this stage, unless it's in
> > shared_preload_libraries.
>
> Gut reaction is that define_custom_variable would need to consult
> the list to see if a newly-defined variable should be marked GUC_REPORT.

Yeah, that seems like a good idea.

> Therefore, at least for qualified GUC names, we can't issue an error
> for unrecognized names.  But maybe it should complain about unrecognized
> unqualified names.

I had the same thought, but I just realized that's probably
unfriendly: at the point when the client is assembling the list of
names to send to the server, it doesn't know the server version.  I
think we're probably best off assuming that any names we don't
recognize are something that got added in a newer server version and
just ignoring them. The client is not powerless to sort this out
after-the-fact: once the connection is made, they'll know the server
version and also have the option to interrogate pg_settings if they
wish.

We also need to think about how to write a test for this patch...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: let's make the list of reportable GUCs configurable (was Re: Add %r substitution for psql prompts to show recovery status)

Robert Haas  writes:
> 3. I'm not sure that just ignoring any GUCs we don't find is the right
> thing.  I'm also not sure that it's the wrong thing, but it might be.
> My question is: what if there's an extension-owned GUC in play? The
> library probably isn't even loaded at this stage, unless it's in
> shared_preload_libraries.

Gut reaction is that define_custom_variable would need to consult
the list to see if a newly-defined variable should be marked GUC_REPORT.

Therefore, at least for qualified GUC names, we can't issue an error
for unrecognized names.  But maybe it should complain about unrecognized
unqualified names.

regards, tom lane

Re: let's make the list of reportable GUCs configurable (was Re: Add %r substitution for psql prompts to show recovery status)

On Thu, 11 Jul 2019 at 10:06, Robert Haas  wrote:

> On Thu, Jul 11, 2019 at 8:23 AM Dave Cramer  wrote:
> > See attached for an initial patch. If this is an acceptable way to go I
> will add tests and documentation
>
> And clean up the code?  Doesn't look crazy on a quick glance but I
> think I see at least half a dozen coding style problems.  More
> substantively:
>

Sure, of course

>
> 1. I don't really like putting 'guc' into an externally visible name;
> that's why I suggested 'report'.
>

sure, no problem

>
> 2. I haven't really scrutinized whether what SetConfigReport is an OK
> way of implementing this.  That probably needs some study.  It may be
> fine.
>
> 3. I'm not sure that just ignoring any GUCs we don't find is the right
> thing.  I'm also not sure that it's the wrong thing, but it might be.
> My question is: what if there's an extension-owned GUC in play? The
> library probably isn't even loaded at this stage, unless it's in
> shared_preload_libraries.
>
> Well we haven't even established the connection. I don't really see a way
to find extensions
I thought about checking the available GUC's. I'll add that.

Thanks for your quick response


Dave Cramer

da...@postgresintl.com
www.postgresintl.com


>
>

Re: SQL/JSON path issues/questions

2019-07-11 Thread Thom Brown

On Wed, 10 Jul 2019 at 05:58, Alexander Korotkov
 wrote:
>
> On Mon, Jul 8, 2019 at 12:30 AM Alexander Korotkov
>  wrote:
> > On Thu, Jul 4, 2019 at 4:38 PM Liudmila Mantrova
> >  wrote:
> > > Thank  you!
> > >
> > > I think we can make this sentence even shorter, the fix is attached:
> > >
> > > "To refer to a JSON element stored at a lower nesting level, add one or
> > > more accessor operators after @."
> >
> > Thanks, looks good to me.  Attached revision of patch contains commit
> > message.  I'm going to commit this on no objections.
>
> So, pushed!

I've just noticed the >= operator shows up as just > in the "jsonpath
Filter Expression Elements" table, and the <= example only shows <.

Thom

Re: base backup client as auxiliary backend process

2019-07-11 Thread Euler Taveira

Em sáb, 29 de jun de 2019 às 17:05, Peter Eisentraut
 escreveu:
>
> Setting up a standby instance is still quite complicated.  You need to
> run pg_basebackup with all the right options.  You need to make sure
> Attached is a very hackish patch to implement this.  It works like this:
>
> # (assuming you have a primary already running somewhere)
> initdb -D data2 --minimal
> $EDITOR data2/postgresql.conf  # set primary_conninfo
> pg_ctl -D data2 start
>
Great! The main complaints about pg_basebackup usage in TB clusters
are: (a) it can't be restarted and (b) it can't be parallelized.
AFAICS your proposal doesn't solve them. It would be nice if it can be
solved in future releases (using rsync or another in-house tool is as
fragile as using pg_basebackup).


-- 
   Euler Taveira   Timbira -
http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento

Re: complier warnings from ecpg tests

Hi

> Are you using -Wformat-overflow? At which level?

I use: ./configure --prefix=somepath --enable-cassert --enable-debug 
CFLAGS="-ggdb -Og -g3 -fno-omit-frame-pointer" --enable-tap-tests
No other explicit options.

pg_config reports:

CPPFLAGS = -D_GNU_SOURCE
CFLAGS = -Wall -Wmissing-prototypes -Wpointer-arith 
-Wdeclaration-after-statement -Werror=vla -Wendif-labels 
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv 
-fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -g 
-ggdb -Og -g3 -fno-omit-frame-pointer
CFLAGS_SL = -fPIC

regards, Sergei

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-11 Thread Antonin Houska

Joe Conway  wrote:

> Please see my other reply (and
> https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-38a.pdf
> appendix C as pointed out by Ryan downthread).

Thanks.

> At least in my mind, I trust a published specification from the
> nation-state level over random blogs or wikipedia. If we can find some
> equivalent published standards that contradict NIST we should discuss
> it, but for my money I would prefer to stick with the NIST recommended
> method to produce the IVs.

I don't think this as a problem of trusting A over B. Those blogs try to
explain the attacks in detail, while the NIST standard is just a set of
recommendations that does not (try to) provide technical details of comparable
depth.

Although I prefer understanding things in detail, I think it's o.k. to say in
documentation that "we use ... cipher because it complies to ... standard".

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: Conflict handling for COPY FROM

2019-07-11 Thread Surafel Temesgen

Hi

>
> Also, I would prefer having an option to ignore all errors, e.g. with
> option ERROR_LIMIT set to -1. Because it is rather difficult to estimate
> a number of future errors if you are playing with some badly structured
> data, while always setting it to 100500k looks ugly.
>

Here are the patch that contain all the comment given except adding a way
to specify
to ignoring all error because specifying a highest number can do the work
and may be
try to store such badly structure data is a bad idea

regards
Surafel
diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 5e2992ddac..f16108b23b 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -44,6 +44,7 @@ COPY { table_name [ ( column_name [, ...] )
 FORCE_NULL ( column_name [, ...] )
 ENCODING 'encoding_name'
+ERROR 'limit_number'
 
  
 
@@ -355,6 +356,22 @@ COPY { table_name [ ( 

 
+   
+ERROR
+
+ 
+  Specifies to ignore error record up to limit_number number.
+ 
+
+ 
+  Currently, only unique or exclusion constraint violation
+  and same record formatting error is ignored.
+ 
+
+
+   
+

 WHERE
 
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 4f04d122c3..2a6bc48f78 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -48,6 +48,7 @@
 #include "port/pg_bswap.h"
 #include "rewrite/rewriteHandler.h"
 #include "storage/fd.h"
+#include "storage/lmgr.h"
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -154,6 +155,7 @@ typedef struct CopyStateData
 	List	   *convert_select; /* list of column names (can be NIL) */
 	bool	   *convert_select_flags;	/* per-column CSV/TEXT CS flags */
 	Node	   *whereClause;	/* WHERE condition (or NULL) */
+	int			error_limit;	/* total number of error to ignore */
 
 	/* these are just for error messages, see CopyFromErrorCallback */
 	const char *cur_relname;	/* table name for error messages */
@@ -1291,6 +1293,21 @@ ProcessCopyOptions(ParseState *pstate,
 defel->defname),
 		 parser_errposition(pstate, defel->location)));
 		}
+		else if (strcmp(defel->defname, "error_limit") == 0)
+		{
+			if (cstate->error_limit > 0)
+ereport(ERROR,
+		(errcode(ERRCODE_SYNTAX_ERROR),
+		 errmsg("conflicting or redundant options"),
+		 parser_errposition(pstate, defel->location)));
+			cstate->error_limit = defGetInt64(defel);
+			if (cstate->error_limit <= 0)
+ereport(ERROR,
+		(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+		 errmsg("argument to option \"%s\" must be positive integer or -1",
+defel->defname),
+		 parser_errposition(pstate, defel->location)));
+		}
 		else
 			ereport(ERROR,
 	(errcode(ERRCODE_SYNTAX_ERROR),
@@ -1441,6 +1458,10 @@ ProcessCopyOptions(ParseState *pstate,
 		ereport(ERROR,
 (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
  errmsg("CSV quote character must not appear in the NULL specification")));
+	if (cstate->error_limit && !cstate->is_copy_from)
+		ereport(ERROR,
+(errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("ERROR LIMIT only available using COPY FROM")));
 }
 
 /*
@@ -2678,6 +2699,7 @@ CopyFrom(CopyState cstate)
 	bool		has_before_insert_row_trig;
 	bool		has_instead_insert_row_trig;
 	bool		leafpart_use_multi_insert = false;
+	DestReceiver *dest;
 
 	Assert(cstate->rel);
 
@@ -2841,7 +2863,17 @@ CopyFrom(CopyState cstate)
 	/* Verify the named relation is a valid target for INSERT */
 	CheckValidResultRel(resultRelInfo, CMD_INSERT);
 
-	ExecOpenIndices(resultRelInfo, false);
+	if (cstate->error_limit)
+	{
+		TupleDesc	tupDesc;
+
+		tupDesc = RelationGetDescr(cstate->rel);
+		ExecOpenIndices(resultRelInfo, true);
+		dest = CreateDestReceiver(DestRemoteSimple);
+		dest->rStartup(dest, (int) CMD_SELECT, tupDesc);
+	}
+	else
+		ExecOpenIndices(resultRelInfo, false);
 
 	estate->es_result_relations = resultRelInfo;
 	estate->es_num_result_relations = 1;
@@ -2946,6 +2978,13 @@ CopyFrom(CopyState cstate)
 		 */
 		insertMethod = CIM_SINGLE;
 	}
+	else if (cstate->error_limit)
+	{
+		/*
+		 * Can't support speculative insertion in multi-inserts.
+		 */
+		insertMethod = CIM_SINGLE;
+	}
 	else
 	{
 		/*
@@ -3289,6 +3328,63 @@ CopyFrom(CopyState cstate)
 		 */
 		myslot->tts_tableOid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
 	}
+	else if (cstate->error_limit && resultRelInfo->ri_NumIndices > 0)
+	{
+		/* Perform a speculative insertion. */
+		uint32		specToken;
+		ItemPointerData conflictTid;
+		bool		specConflict;
+
+		/*
+		 * Do a non-conclusive check for conflicts first.
+		 */
+		specConflict = false;
+
+		if (!ExecCheckIndexConstraints(myslot, estate, ,
+	   NIL))
+		{
+			(void) dest->receiveSlot(myslot, dest);
+			cstate->error_limit--;
+			continue;
+		}
+
+		/*
+		 * Acquire our speculative insertion lock".
+		 */
+		specToken =

Re: complier warnings from ecpg tests

On Thu, Jul 11, 2019 at 03:21:15PM +0300, Sergei Kornilov wrote:
> I noticed few warnings from my compiler (gcc version 8.3.0 (Debian
> 8.3.0-6)) during make check-world:
>
> They coming from src/interfaces/ecpg tests (
> ./src/interfaces/ecpg/test/sql/array.pgc ).
> Seems this code is 4 year old but I did not found discussion related
> to such compiler warnings. Is this expected?

Are you using -Wformat-overflow?  At which level?
--
Michael


signature.asc
Description: PGP signature

Re: Add parallelism and glibc dependent only options to reindexdb

On Thu, Jul 11, 2019 at 11:48:20AM +0200, Julien Rouhaud wrote:
> On Thu, Jul 11, 2019 at 6:04 AM Michael Paquier  wrote:
>> Get* would be my choice, because we look at the set of parallel slots,
>> and get an idle one.
> 
> That's what the former GetIdleSlot (that I renamed to get_idle_slot as
> it's not static) is doing.  ConsumeIdleSlot() actually get an idle
> slot and mark it as being used.  That's probably some leakage of
> internal implementation, but having a GetIdleParallelSlot (or
> ParallelSlotGetIdle) *and* a get_idle_slot sounds like a bad idea, and
> I don't have a better idea on how to rename get_idle_slot.  If you
> really prefer Get* and you're fine with a static get_idle_slot, that's
> fine by me.

Do we actually need get_idle_slot?  ConsumeIdleSlot is its only
caller.

>> and we'd
>> need to be careful of the same where pg_catalog is listed.  Actually,
>> your patch breaks if we do a parallel run with pg_catalog and another
>> schema, no?
> 
> It can definitely cause problems if you ask for pg_catalog and other
> schema, same as if you ask a parallel reindex of some catalog tables
> (possibly with other tables).  There's a --system switch for that
> need, so I don't know if documenting the limitation to avoid some
> extra code to deal with it is a good enough solution?

vacuumdb --full still has limitations in this area and we had some
reports on the matter about this behavior being annoying.  Its
documentation also mentions that mixing catalog relations with  --full
can cause deadlocks.

Documenting it may be fine at the end, but my take is that it would be
nice to make sure that we don't have deadlocks if we can avoid them
easily.  It is also a matter of balance.  If for example the patch
gets 3 times bigger in size because of that we may have an argument
for not doing it and keep the code simple.  What do people think about
that?  I would be nice to get more opinions here.

And while scanning the code...

+ * getQuerySucess
Typo here.

- * Pump the conn till it's dry of results; return false if any are errors.
This comment could be improved on the way, like "Go through all the
connections and make sure to consume any remaining results.  If any
error is found, false is returned after processing all the parallel
slots."
--
Michael

signature.asc
Description: PGP signature

Re: [RFC] [PATCH] Flexible "partition pruning" hook

On Tue, Jul 9, 2019 at 3:31 PM Mike Palmiotto
 wrote:
> Attached you will find a patch (rebased on master) that passes all
> tests on my local CentOS 7 box. Thanks again for the catch!

Hi Mike,

Here are some comments on superficial aspects of the patch:

+/* Custom partition child access hook. Provides further partition pruning given
+ * child OID.
+ */

Should be like:

/*
 * Multi-line comment...
 */

Why "child"?  Don't you really mean "Partition pruning hook.  Provides
custom pruning given partition OID." or something?

+typedef bool (*partitionChildAccess_hook_type) (Oid childOID);
+PGDLLIMPORT partitionChildAccess_hook_type partitionChildAccess_hook;

Hmm, I wonder if this could better evoke the job that it's doing...
partition_filter_hook?
partition_access_filter_hook?  partition_prune_hook?

+/* Macro to use partitionChildAccess_hook. Handles NULL-checking. */

It's not a macro, it's a function.

+static inline bool InvokePartitionChildAccessHook (Oid childOID)
+{
+if (partitionChildAccess_hook && enable_partition_pruning && childOID)
+{

Normally we write OidIsValid(childOID) rather than comparing with 0.
I wonder if you should call the variable relId?  Single line if
branches don't usually get curly braces.

+return (*partitionChildAccess_hook) (childOID);

The syntax we usually use for calling function pointers is just
partitionChildAccess_hook(childOID).

-- 
Thomas Munro
https://enterprisedb.com

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-11 Thread Ryan Lambert

> >>Uh, what if a transaction modifies page 0 and page 1 of the same table

> >>--- don't those pages have the same LSN.
> >
> >No, because WAL being a physical change log, each page gets its own
> >WAL record with its own LSN.
> >
>
> What if you have wal_log_hints=off? AFAIK that won't change the page LSN.
>

> Alvaro suggested elsewhere that we require checksums for these, which
> would also force wal_log_hints to be on, and therefore the LSN would
> change.


Yes, it sounds like the agreement was LSN is unique when wal_log_hints is
on.  I don't know enough about the internals to know if pg_class.oid is
also needed or not.

Ryan

On Wed, Jul 10, 2019 at 6:07 PM Bruce Momjian  wrote:

> On Thu, Jul 11, 2019 at 12:18:47AM +0200, Tomas Vondra wrote:
> > On Wed, Jul 10, 2019 at 06:04:30PM -0400, Stephen Frost wrote:
> > > Greetings,
> > >
> > > * Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
> > > > On Wed, Jul 10, 2019 at 04:11:21PM -0400, Alvaro Herrera wrote:
> > > > >On 2019-Jul-10, Bruce Momjian wrote:
> > > > >
> > > > >>Uh, what if a transaction modifies page 0 and page 1 of the same
> table
> > > > >>--- don't those pages have the same LSN.
> > > > >
> > > > >No, because WAL being a physical change log, each page gets its own
> > > > >WAL record with its own LSN.
> > > > >
> > > >
> > > > What if you have wal_log_hints=off? AFAIK that won't change the page
> LSN.
> > >
> > > Alvaro suggested elsewhere that we require checksums for these, which
> > > would also force wal_log_hints to be on, and therefore the LSN would
> > > change.
> > >
> >
> > Oh, I see - yes, that would solve the hint bits issue. Not sure we want
> > to combine the features like this, though, as it increases the costs of
> > TDE. But maybe it's the best solution.
>
> Uh, why can't we just force log_hint_bits for encrypted tables?  Why
> would we need to use checksums as well?
>
> Why is page-number not needed in the nonce?  Because it is duplicative
> of the LSN?  Can we use just LSN?  Do we need pg_class.oid too?
>
> --
>   Bruce Momjian  http://momjian.us
>   EnterpriseDB http://enterprisedb.com
>
> + As you are, so once was I.  As I am, so you will be. +
> +  Ancient Roman grave inscription +
>

Re: let's make the list of reportable GUCs configurable (was Re: Add %r substitution for psql prompts to show recovery status)

On Wed, 10 Jul 2019 at 16:34, Tom Lane  wrote:

> Robert Haas  writes:
> > On Wed, Jul 10, 2019 at 9:59 AM Dave Cramer  wrote:
> >> I'm still a bit conflicted about what to do with search_path as I do
> believe this is potentially a security issue.
> >> It may be that we always want to report that and possibly back patch it.
>
> > I don't see that as a feasible option unless we make the logic that
> > does the reporting smarter.  If it changes transiently inside of a
> > security-definer function, and then changes back, my recollection is
> > that right now we would report both changes.  I think that could cause
> > a serious efficiency problem if you are calling such a function in a
> > loop.
>
> And, even more to the point, what's the client side going to do with
> the information?  If there was a security problem inside the
> security-definer function, it's too late.  And the client can't do
> much about it anyway.
>
> If we have a configurable GUC_REPORT list, and somebody thinks it's useful
> to them to report search_path, I don't wish to stand in their way.
> But the argument that this is useful is so tissue-thin that we have no
> business imposing the overhead on everybody, much less back-patching it.
>
> regards, tom lane
>

See attached for an initial patch. If this is an acceptable way to go I
will add tests and documentation


0001-add-special-startup-parameter-_pq_.guc_report-to-add.patch
Description: Binary data

complier warnings from ecpg tests

Hello

I noticed few warnings from my compiler (gcc version 8.3.0 (Debian 8.3.0-6)) 
during make check-world:

array.pgc: In function ‘main’:
array.pgc:41:16: warning: ‘%d’ directive writing between 1 and 11 bytes into a 
region of size 10 [-Wformat-overflow=]
   sprintf(str, "2000-1-1 0%d:00:00", j);
^~~~
array.pgc:41:16: note: directive argument in the range [-2147483648, 9]
array.pgc:41:3: note: ‘sprintf’ output between 18 and 28 bytes into a 
destination of size 20
   sprintf(str, "2000-1-1 0%d:00:00", j);
   ^
array.pgc:43:16: warning: ‘sprintf’ may write a terminating nul past the end of 
the destination [-Wformat-overflow=]
   sprintf(str, "2000-1-1%d\n", j);
^~
array.pgc:43:3: note: ‘sprintf’ output between 11 and 21 bytes into a 
destination of size 20
   sprintf(str, "2000-1-1%d\n", j);
   ^~~

They coming from src/interfaces/ecpg tests ( 
./src/interfaces/ecpg/test/sql/array.pgc ).
Seems this code is 4 year old but I did not found discussion related to such 
compiler warnings. Is this expected?

regards, Sergei

Re: Tid scan improvements

2019-07-11 Thread Edmund Horner

On Thu, 11 Jul 2019 at 10:22, David Rowley  wrote:
> > A.  Continue to target only heapam tables, making the bare minimum
> > changes necessary for the new tableam api.
> > B.  Try to do something more general that works on all tableam
> > implementations for which it may be useful.
>
> Going by the conversation with Andres above:
>
> On Tue, 26 Mar 2019 at 10:52, Andres Freund  wrote:
> >
> > On 2019-03-18 22:35:05 +1300, David Rowley wrote:
> > > The commit message in 8586bf7ed mentions:
> > >
> > > > Subsequent commits will incrementally abstract table access
> > > > functionality to be routed through table access methods. That change
> > > > is too large to be reviewed & committed at once, so it'll be done
> > > > incrementally.
> > >
> > > and looking at [1] I see patch 0004 introduces some changes in
> > > nodeTidscan.c to call a new tableam API function named
> > > heapam_fetch_row_version. I see this function does have a ItemPointer
> > > argument, so I guess we must be keeping those as unique row
> > > identifiers in the API.
> >
> > Right, we are. At least for now - there's some discussions around
> > allowing different format for TIDs, to allow things like index organized
> > tables, but that's for later.
>
> So it seems that the plan is to insist that TIDs are tuple identifiers
> for all table AMs, for now.  If that changes in the future, then so be
> it, but I don't think that's cause for delaying any work on TID Range
> Scans.  Also from scanning around tableam.h, I see that there's no
> shortage of usages of BlockNumber, so it seems reasonable to assume
> table AMs must use blocks... It's hard to imagine moving away from
> that given that we have shared buffers.
>
> We do appear to have some table AM methods that are optional, although
> I'm not sure where the documentation is about that. For example, in
> get_relation_info() I see:
>
> info->amhasgetbitmap = amroutine->amgetbitmap != NULL &&
> relation->rd_tableam->scan_bitmap_next_block != NULL;
>
> so it appears that at least scan_bitmap_next_block is optional.

> I think what I'd do would be to add a table_setscanlimits API method
> for table AM and perhaps have the planner only add TID range scan
> paths if the relation has a non-NULL function pointer for that API
> function.  It would be good to stick a comment at least in tableam.h
> that mentions that the callback is optional.  That might help a bit
> when it comes to writing documentation on each API function and what
> they do.

This seems like a good path forward.

> > There may not be much different between them, but B. means a bit more
> > research into zheap, zstore and other possible tableams.
> >
> > Next question, how will the executor access the table?
> >
> > 1. Continue to use the seqscan tableam methods, by setting limits.
> > 2. Use the bitmap scan methods, for instance by faking a 
> > BitmapIteratorResuit.
> > 3. Add new tableam methods specially for scanning a range of TIDs.
> >
> > Any thoughts?
>
> I think #1 is fine for now. #3 might be slightly more efficient since
> you'd not need to read each tuple in the given page before the given
> offset and throw it away, but I don't really think it's worth adding a
> new table AM method for.

Yeah, it's not perfectly efficient in that regard.  But it's at least
a step in the right direction.

Thanks for the guidance David.  I think I'll be able to make some
progress now before hitting the next obstacle!

Edmund

Re: Index Skip Scan

2019-07-11 Thread David Rowley

On Thu, 11 Jul 2019 at 02:48, Dmitry Dolgov <9erthali...@gmail.com> wrote:
>
> > On Tue, Jul 2, 2019 at 2:27 PM David Rowley  
> > wrote:
> >
> > The more I think about these UniqueKeys, the more I think they need to
> > be a separate concept to PathKeys. For example, UniqueKeys: { x, y }
> > should be equivalent to { y, x }, but with PathKeys, that's not the
> > case, since the order of each key matters. UniqueKeys equivalent
> > version of pathkeys_contained_in() would not care about the order of
> > individual keys, it would say things like, { a, b, c } is contained in
> > { b, a }, since if the path is unique on columns { b, a } then it must
> > also be unique on { a, b, c }.
>
> > On Tue, Jul 9, 2019 at 3:32 PM Jesper Pedersen  
> > wrote:
> >
> > David, are you thinking about something like the attached ?
> >
> > Some questions.
> >
> > * Do you see UniqueKey as a "complete" planner node ?
> >- I didn't update the nodes/*.c files for this yet
> >
> > * Is a UniqueKey with a list of EquivalenceClass best, or a list of
> > UniqueKey with a single EquivalenceClass
>
> Just for me to clarify, the idea is to replace PathKeys with a new concept of
> "UniqueKey" for skip scans, right? If I see it correctly, of course
>
> UniqueKeys { x, y } == UniqueKeys { y, x }
>
> from the result point of view, but the execution costs could be different due
> to different values distribution. In fact there are efforts to utilize this to
> produce more optimal order [1], but with UniqueKeys concept this information 
> is
> lost. Obviously it's not something that could be immediately (or even never) a
> problem for skip scan functionality, but I guess it's still worth to point it
> out.

The UniqueKeys idea is quite separate from pathkeys.  Currently, a
Path can have a List of PathKeys which define the order that the
tuples will be read from the Plan node that's created from that Path.
The idea with UniqueKeys is that a Path can also have a non-empty List
of UniqueKeys to define that there will be no more than 1 row with the
same value for the Paths UniqueKey column/exprs.

As of now, if you look at standard_qp_callback() sets
root->query_pathkeys, the idea here would be that the callback would
also set a new List field named "query_uniquekeys" based on the
group_pathkeys when non-empty and !root->query->hasAggs, or by using
the distinct clause if it's non-empty. Then in build_index_paths()
around the call to match_pathkeys_to_index() we'll probably also want
to check if the index can support UniqueKeys that would suit the
query_uniquekeys that were set during standard_qp_callback().

As for setting query_uniquekeys in standard_qp_callback(), this should
be simply a matter of looping over either group_pathkeys or
distinct_pathkeys and grabbing the pk_eclass from each key and making
a canonical UniqueKey from that. To have these canonical you'll need
to have a new field in PlannerInfo named canon_uniquekeys which will
do for UniqueKeys what canon_pathkeys does for PathKeys. So you'll
need an equivalent of make_canonical_pathkey() in uniquekey.c

With this implementation, the code that the patch adds in
create_distinct_paths() can mostly disappear. You'd only need to look
for a path that uniquekeys_contained_in() matches the
root->query_uniquekeys and then just leave it to
set_cheapest(distinct_rel); to pick the cheapest path.

It would be wasted effort to create paths with UniqueKeys if there's
multiple non-dead base rels in the query as the final rel in
create_distinct_paths would be a join rel, so it might be worth
checking that before creating such index paths in build_index_paths().
However, down the line, we could carry the uniquekeys forward into the
join paths if the join does not duplicate rows from the other side of
the join... That's future stuff though, not for this patch, I don't
think.

I think a UniqueKey can just be a struct similar to PathKey, e.g be
located in pathnodes.h around where PathKey is defined.  Likely we'll
need a uniquekeys.c file that has the equivalent to
pathkeys_contained_in() ... uniquekeys_contained_in(), which return
true if arg1 is a superset of arg2 rather than check for one set being
a prefix of another. As you mention above: UniqueKeys { x, y } ==
UniqueKeys { y, x }.  That superset check could perhaps be optimized
by sorting UniqueKey lists in memory address order, that'll save
having a nested loop, but likely that's not going to be required for a
first cut version.  This would work since you'd want UniqueKeys to be
canonical the same as PathKeys are (Notice that compare_pathkeys()
only checks memory addresses of pathkeys and not equals()).

I think the UniqueKey struct would only need to contain an
EquivalenceClass field. I think all the other stuff that's in PathKey
is irrelevant to UniqueKey.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: block-level incremental backup

2019-07-11 Thread Jeevan Chalke

Hi Anastasia,

On Wed, Jul 10, 2019 at 11:47 PM Anastasia Lubennikova <
a.lubennik...@postgrespro.ru> wrote:

> 23.04.2019 14:08, Anastasia Lubennikova wrote:
> > I'm volunteering to write a draft patch or, more likely, set of
> > patches, which
> > will allow us to discuss the subject in more detail.
> > And to do that I wish we agree on the API and data format (at least
> > broadly).
> > Looking forward to hearing your thoughts.
>
> Though the previous discussion stalled,
> I still hope that we could agree on basic points such as a map file
> format and protocol extension,
> which is necessary to start implementing the feature.
>

It's great that you too come up with the PoC patch. I didn't look at your
changes in much details but we at EnterpriseDB too working on this feature
and started implementing it.

Attached series of patches I had so far... (which needed further
optimization and adjustments though)

Here is the overall design (as proposed by Robert) we are trying to
implement:

1. Extend the BASE_BACKUP command that can be used with replication
connections. Add a new [ LSN 'lsn' ] option.

2. Extend pg_basebackup with a new --lsn=LSN option that causes it to send
the option added to the server in #1.

Here are the implementation details when we have a valid LSN

sendFile() in basebackup.c is the function which mostly does the thing for
us. If the filename looks like a relation file, then we'll need to consider
sending only a partial file. The way to do that is probably:

A. Read the whole file into memory.

B. Check the LSN of each block. Build a bitmap indicating which blocks have
an LSN greater than or equal to the threshold LSN.

C. If more than 90% of the bits in the bitmap are set, send the whole file
just as if this were a full backup. This 90% is a constant now; we might
make it a GUC later.

D. Otherwise, send a file with .partial added to the name. The .partial
file contains an indication of which blocks were changed at the beginning,
followed by the data blocks. It also includes a checksum/CRC.
Currently, a .partial file format looks like:
 - start with a 4-byte magic number
 - then store a 4-byte CRC covering the header
 - then a 4-byte count of the number of blocks included in the file
 - then the block numbers, each as a 4-byte quantity
 - then the data blocks

We are also working on combining these incremental back-ups with the full
backup and for that, we are planning to add a new utility called
pg_combinebackup. Will post the details on that later once we have on the
same page for taking backup.

Thanks
-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation

IncrBackup.tar.gz
Description: application/gzip

Re: base backup client as auxiliary backend process

Hello

>>  Attached is a very hackish patch to implement this. It works like this:
>>
>>  # (assuming you have a primary already running somewhere)
>>  initdb -D data2 --minimal
>>  $EDITOR data2/postgresql.conf # set primary_conninfo
>>  pg_ctl -D data2 start
>
> +1, very nice. How about --replica?

+1

Also not works with -DEXEC_BACKEND for me.

> There is also the question what you do
> if the base backup fails halfway through. Currently you probably need
> to delete the whole data directory and start again with initdb. Better
> might be a way to start again and overwrite any existing files, but that
> can clearly also be dangerous.

I think the need for delete directory and rerun initdb is better than overwrite 
files.

- we need check major version. Basebackup can works with different versions, 
but would be useless to copying cluster which we can not run
- basebackup silently overwrite configs (pg_hba.conf, postgresql.conf, 
postgresql.auto.conf) in $PGDATA. This is ok for pg_basebackup but not for 
backend process
- I think we need start walreceiver. At best, without interruption during 
startup replay (if possible)

> XXX Is there a use for
>* switching into (non-standby) recovery here?

I think not.

regards, Sergei

Re: progress report for ANALYZE

2019-07-11 Thread Kyotaro Horiguchi

Hello.

At Tue, 9 Jul 2019 17:38:44 +0900, Tatsuro Yamada 
 wrote in 
<244cb241-168b-d6a9-c45f-a80c34cdc...@nttcom.co.jp>
> Hi Alvaro, Anthony, Julien and Robert,
> 
> On 2019/07/09 3:47, Julien Rouhaud wrote:
> > On Mon, Jul 8, 2019 at 8:44 PM Robert Haas 
> > wrote:
> >>
> >> On Mon, Jul 8, 2019 at 2:18 PM Alvaro Herrera
> >>  wrote:
> >>> Yeah, I got the impression that that was determined to be the
> >>> desirable
> >>> behavior, so I made it do that, but I'm not really happy about it
> >>> either.  We're not too late to change the CREATE INDEX behavior, but
> >>> let's discuss what is it that we want.
> >>
> >> I don't think I intended to make any such determination -- which
> >> commit do you think established this as the canonical behavior?
> >>
> >> I propose that once a field is set, we should leave it set until the
> >> end.
> > +1
> > Note that this patch is already behaving like that if the table only
> > contains dead rows.

+1 from me.

> I fixed the patch including:
> 
>   - Replace "if" to "else if". (Suggested by Julien)
>   - Fix typo s/ech/each/. (Suggested by Anthony)
>   - Add Phase "analyzing complete" in the pgstat view. (Suggested by
>   - Julien, Robert and me)
> It was overlooked to add it in system_views.sql.
> 
> I share my re-test result, see below:
> 
> -
> [Session #1]
> create table hoge as select * from generate_series(1, 100) a;
> analyze verbose hoge;
> 
> [Session #2]
> \a \t
> select * from pg_stat_progress_analyze; \watch 0.001
> 
> 3785|13599|postgres|16384|f|16384|scanning table|4425|6
> 3785|13599|postgres|16384|f|16384|scanning table|4425|31
> 3785|13599|postgres|16384|f|16384|scanning table|4425|70
> 3785|13599|postgres|16384|f|16384|scanning table|4425|109
> ...
> 3785|13599|postgres|16384|f|16384|scanning table|4425|4425
> 3785|13599|postgres|16384|f|16384|scanning table|4425|4425
> 3785|13599|postgres|16384|f|16384|scanning table|4425|4425
> 3785|13599|postgres|16384|f|16384|analyzing sample|0|0
> 3785|13599|postgres|16384|f|16384|analyzing complete|0|0 <-- Added and
> fixed. :)
> -

I have some comments.

+   const int   index[] = {
+   PROGRESS_ANALYZE_PHASE,
+   PROGRESS_ANALYZE_TOTAL_BLOCKS,
+   PROGRESS_ANALYZE_BLOCKS_DONE
+   };
+   const int64 val[] = {
+   PROGRESS_ANALYZE_PHASE_ANALYSIS,
+   0, 0

Do you oppose to leaving the total/done blocks alone here:p?

+  BlockNumber  nblocks;
+  doubleblksdone = 0;

Why is it a double even though blksdone is of the same domain
with nblocks? And finally:

+pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_DONE,
+   ++blksdone);

It is converted into int64.

+  WHEN 2 THEN 'analyzing sample'
+  WHEN 3 THEN 'analyzing sample (extended stats)'

I think we should avoid parenthesized phrases as far as
not-necessary. That makes the column unnecessarily long. The
phase is internally called as "compute stats" so *I* would prefer
something like the followings:

+  WHEN 2 THEN 'computing stats'
+  WHEN 3 THEN 'computing extended stats'

+  WHEN 4 THEN 'analyzing complete'

And if you are intending by this that (as mentioned in the
documentation) "working to complete this analyze", this would be:

+  WHEN 4 THEN 'completing analyze'
+  WHEN 4 THEN 'finalizing analyze'

+ Process ID of backend.

of "the" backend. ? And period is not attached on all
descriptions consisting of a single sentence.

+  OID of the database to which this backend is connected.
+  Name of the database to which this backend is connected.

"database to which .. is connected" is phrased as "database this
backend is connected to" in pg_stat_acttivity. (Just for consistency)

+  Whether the current scan includes legacy inheritance 
children.

This apparently excludes partition tables but actually it is
included.

  "Whether scanning through child tables" ?

I'm not sure "child tables" is established as the word to mean
both "partition tables and inheritance children"..

+   The table being scanned (differs from relid
+   only when processing partitions or inheritance children).

Is  needed? And I think the parentheses are not needed.

  OID of the table currently being scanned. Can differ from relid
  when analyzing tables that have child tables.

+   Total number of heap blocks to scan in the current table.

   Number of heap blocks on scanning_table to scan?

It might be better that this description describes that this
and the next column is meaningful only while the phase
"scanning table".

+   The command is currently scanning the table.

"sample(s)" comes somewhat abruptly in the next item.

Re: [HACKERS] WIP: Aggregation push-down

2019-07-11 Thread Richard Guo

On Tue, Jul 9, 2019 at 9:47 PM Antonin Houska  wrote:

> Richard Guo  wrote:
>
> > Another rebase is needed for the patches.
>
> Done.
>


I didn't fully follow the whole thread and mainly looked into the latest
patch set. So what are the considerations for abandoning the aggmultifn
concept? In my opinion, aggmultifn would enable us to do a lot more
types of transformation. For example, consider the query below:

select sum(foo.c) from foo join bar on foo.b = bar.b group by foo.a, bar.a;

With the latest patch, the plan looks like:

Finalize HashAggregate<-- sum(psum)
   Group Key: foo.a, bar.a
   ->  Hash Join
 Hash Cond: (bar.b = foo.b)
 ->  Seq Scan on bar
 ->  Hash
   ->  Partial HashAggregate<-- sum(foo.c) as psum
 Group Key: foo.a, foo.b
 ->  Seq Scan on foo


If we have aggmultifn, we can perform the query this way:

Finalize HashAggregate<-- sum(foo.c)*cnt
   Group Key: foo.a, bar.a
   ->  Hash Join
 Hash Cond: (foo.b = bar.b)
 ->  Seq Scan on foo
 ->  Hash
   ->  Partial HashAggregate<-- count(*) as cnt
 Group Key: bar.a, bar.b
 ->  Seq Scan on bar


And this way:

Finalize HashAggregate<-- sum(psum)*cnt
   Group Key: foo.a, bar.a
   ->  Hash Join
 Hash Cond: (foo.b = bar.b)
   ->  Partial HashAggregate<-- sum(foo.c) as psum
 Group Key: foo.a, foo.b
 ->  Seq Scan on foo
 ->  Hash
   ->  Partial HashAggregate<-- count(*) as cnt
 Group Key: bar.a, bar.b
 ->  Seq Scan on bar


My another question is in function add_grouped_path(), when creating
sorted aggregation path on top of subpath. If the subpath is not sorted,
then the sorted aggregation path would not be generated. Why not in this
case we create a sort path on top of subpath first and then create group
aggregation path on top of the sort path?


Core dump when running one query in agg_pushdown.sql

EXPLAIN ANALYZE
SELECT p.x, avg(c1.v) FROM agg_pushdown_parent AS p JOIN agg_pushdown_child1
AS c1 ON c1.parent = p.i GROUP BY p.i;


#0  0x006def98 in CheckVarSlotCompatibility (slot=0x0, attnum=1,
vartype=23) at execExprInterp.c:1850
#1  0x006def5d in CheckExprStillValid (state=0x2b63a28,
econtext=0x2ba4958) at execExprInterp.c:1814
#2  0x006dee38 in ExecInterpExprStillValid (state=0x2b63a28,
econtext=0x2ba4958, isNull=0x7fff7cd16a37) at execExprInterp.c:1763
#3  0x007144dd in ExecEvalExpr (state=0x2b63a28,
econtext=0x2ba4958, isNull=0x7fff7cd16a37)
at ../../../src/include/executor/executor.h:288
#4  0x00715475 in ExecIndexEvalRuntimeKeys (econtext=0x2ba4958,
runtimeKeys=0x2b63910, numRuntimeKeys=1) at nodeIndexscan.c:630
#5  0x0071533b in ExecReScanIndexScan (node=0x2b62bf8) at
nodeIndexscan.c:568
#6  0x006d4ce6 in ExecReScan (node=0x2b62bf8) at execAmi.c:182
#7  0x007152a0 in ExecIndexScan (pstate=0x2b62bf8) at
nodeIndexscan.c:530


This is really a cool feature. Thank you for working on this.

Thanks
Richard

Re: Index Skip Scan

2019-07-11 Thread David Rowley

On Thu, 11 Jul 2019 at 19:41, Floris Van Nee  wrote:
> SELECT DISTINCT ON (a) a,b,c FROM a WHERE c = 2 (with an index on a,b,c)
> Data (imagine every tuple here actually occurs 10.000 times in the index to 
> see the benefit of skipping):
> 1,1,1
> 1,1,2
> 1,2,2
> 1,2,3
> 2,2,1
> 2,2,3
> 3,1,1
> 3,1,2
> 3,2,2
> 3,2,3
>
> Creating a cursor on this query and then moving forward, you should get 
> (1,1,2), (3,1,2). In the current implementation of the patch, after bt_first, 
> it skips over (1,1,2) to (2,2,1). It checks quals and moves forward 
> one-by-one until it finds a match. This match only comes at (3,1,2) however. 
> Then it skips to the end.
>
> If you move the cursor backwards from the end of the cursor, you should still 
> get (3,1,2) (1,1,2). A possible implementation would start at the end and do 
> a skip to the beginning of the prefix: (3,1,1). Then it needs to move forward 
> one-by-one in order to find the first matching (minimum) item (3,1,2). When 
> it finds it, it needs to skip backwards to the beginning of prefix 2 (2,2,1). 
> It needs to move forwards to find the minimum element, but should stop as 
> soon as it detects that the prefix doesn't match anymore (because there is no 
> match for prefix 2, it will move all the way from (2,2,1) to (3,1,1)). It 
> then needs to skip backwards again to the start of prefix 1: (1,1,1) and scan 
> forward to find (1,1,2).
> Perhaps anyone can think of an easier way to implement it?

One option is just don't implement it and just change
ExecSupportsBackwardScan() so that it returns false for skip index
scans, or perhaps at least implement an index am method to allow the
planner to be able to determine if the index implementation supports
it... amcanskipbackward

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Feature: Add Greek language fulltext search

2019-07-11 Thread Adrien Nayrat

On 7/4/19 1:39 PM, Peter Eisentraut wrote:
> On 2019-03-25 12:04, Panagiotis Mavrogiorgos wrote:
>> Last November snowball added support for Greek language [1]. Following
>> the instructions [2], I wrote a patch that adds fulltext search for
>> Greek in Postgres. The patch is attached. 
> 
> I have committed a full sync from the upstream snowball repository,
> which pulled in the new greek stemmer.
> 
> Could you please clarify where you got the stopword list from?  The
> README says those need to be downloaded separately, but I wasn't able to
> find the download location.  It would be good to document this, for
> example in the commit message.  I haven't committed the stopword list yet.
> 

Thanks, I noted snowball pushed a new commit related to greek stemmer few days
after your sync:
https://github.com/snowballstem/snowball/commit/533602101f963eeb0c38343d94c428ceef740c0c

As it seems there is no policy for stable release on Snowball, I don't know what
is the best way to keep in sync :(




signature.asc
Description: OpenPGP digital signature

Re: Add parallelism and glibc dependent only options to reindexdb

On Thu, Jul 11, 2019 at 6:04 AM Michael Paquier  wrote:
>
> On Wed, Jul 10, 2019 at 09:44:14PM +0200, Julien Rouhaud wrote:
> > On Wed, Jul 10, 2019 at 4:15 PM Alvaro Herrera  
> > wrote:
> >> Looking good!
> >
> > Thanks!
>
> Confirmed.  The last set is much easier to go through.
>
> >>  I'm not sure about the "Consume" word in ConsumeIdleSlot;
> >> maybe "Reserve"? "Obtain"? "Get"?
> >
> > Yes, Consume is maybe a little bit weird.  I wanted to point out the
> > make it clear that this function is actually removing a slot from the
> > free list, especially since there's a (historical) get_idle_slot().  I
> > like Reserve, but Obtain and Get are probably too ambiguous.
>
> The refactoring patch is getting in shape.  Now reindex_one_database()
> is the only place setting and manipulating the slots.  I am wondering
> if we should have a wrapper which disconnects all the slots (doing
> conn = NULL after the disconnectDatabase() call does not matter).

You already mentioned that in a previous mail.  I was afraid it'd be
overkill, but it'll make caller code easier, so let's do it.

> Get* would be my choice, because we look at the set of parallel slots,
> and get an idle one.

That's what the former GetIdleSlot (that I renamed to get_idle_slot as
it's not static) is doing.  ConsumeIdleSlot() actually get an idle
slot and mark it as being used.  That's probably some leakage of
internal implementation, but having a GetIdleParallelSlot (or
ParallelSlotGetIdle) *and* a get_idle_slot sounds like a bad idea, and
I don't have a better idea on how to rename get_idle_slot.  If you
really prefer Get* and you're fine with a static get_idle_slot, that's
fine by me.

>  It would be nice to have more consistency in the
> names for the routines, say:
> - ParallelSlotInit() instead of SetupParallelSlots (still my
> suggestion is not perfect either as that sounds like one single slot,
> but we have a set of these).
> - ParallelSlotGetIdle() instead of ConsumeIdleSlot().  Still that's
> more a wait-then-get behavior.
> - ParallelSlotWaitCompletion() instead of WaitForSlotsCompletion()
> - ParallelSlotDisconnect, as a wrapper for the calls to
> DisconnectDatabase().

I don't have an opinion on whether to use parallel slot as prefix or
postfix, so I'm fine with postfixing.

> +/*
> + * Database-wide parallel reindex requires special processing.  If
> + * multiple jobs were asked, we have to reindex system catalogs first,
> + * as they can't be processed in parallel.
> + */
> +   if (process_type == REINDEX_DATABASE)
>
> In patch 0002, a parallel database REINDEX first processes the catalog
> relations in a serializable fashion, and then all the other relations
> in parallel, which is right  Could we have schema-level reindexes also
> process things in parallel with all the relations from all the schemas
> listed?  This would be profitable in particular for callers listing
> multiple schemas with an unbalanced number of tables in each

It would also be beneficial for a parallel reindex of a single schema.

> and we'd
> need to be careful of the same where pg_catalog is listed.  Actually,
> your patch breaks if we do a parallel run with pg_catalog and another
> schema, no?

It can definitely cause problems if you ask for pg_catalog and other
schema, same as if you ask a parallel reindex of some catalog tables
(possibly with other tables).  There's a --system switch for that
need, so I don't know if documenting the limitation to avoid some
extra code to deal with it is a good enough solution?

Re: Detailed questions about pg_xact_commit_timestamp

2019-07-11 Thread Adrien Nayrat

Hi,

On 7/9/19 12:22 AM, Morris de Oryx wrote:
> I have some specific questions about pg_xact_commit_timestamp, and am hoping
> that this is the right place to ask. I read a lot of the commentary about the
> original patch, and the contributors seem to be here. If I'm asking in the 
> wrong
> place, just let me know.
> 
> I'm working on a design for a concurrency-safe incremental aggregate rollup
> system,and pg_xact_commit_timestamp sounds perfect. But I've found very little
> commentary on it generally, and couldn't figure out how it works in detail 
> from
> the source code.
> 
> Hopefully, someone knows the answers to a few questions:
> 
> * Is it possible for pg_xact_commit_timestamp to produce times out of order?
> What I'm after is a way to identify records that have been chagned since a
> specific time so that I can get any later changes for processing. I don't need
> them in commit order, so overlapping timestamps aren't a problem. 

I think yes. For example, you can have a session "A" xid 34386826 that commit
after session "B" xid 34386827:
postgres=# select pg_xact_commit_timestamp('34386827'::xid);
   pg_xact_commit_timestamp
---
 2019-07-11 09:32:29.806183+00
(1 row)

postgres=# select pg_xact_commit_timestamp('34386826'::xid);
   pg_xact_commit_timestamp
--
 2019-07-11 09:32:38.99444+00
(1 row)


> 
> * How many bytes are added to each row in the final implementation? The
> discussions I saw seemed to be ranging from 12-24 bytes. There was discussion 
> of
> adding in extra bytes for "just in case." This is pre 9.5, so a world ago.

src/backend/access/transam/commit_ts.c says 8+4 bytes per xact.

Note it is not per row but per xact: We only have to store the timestamp for one
xid.

> 
> * Are the timestamps indexed internally? With a B-tree? I ask for
> capacity-planning reasons.

I think no.

> 
> * I've seen on StackOverflow and the design discussions that the timestamps 
> are
> not kept indefinitely, but can't find the details on exactly how long they are
> stored.
> 

Yes timestamp are stored in pg_commit_ts directory. Old timestamp are removed
after freeze has explained in
https://www.postgresql.org/docs/current/routine-vacuuming.html :

> The sole disadvantage of increasing autovacuum_freeze_max_age (and
vacuum_freeze_table_age along with it) is that the pg_xact and pg_commit_ts
subdirectories of the database cluster will take more space, because it must
store the commit status and (if track_commit_timestamp is enabled) timestamp of
all transactions back to the autovacuum_freeze_max_age horizon. The commit
status uses two bits per transaction, so if autovacuum_freeze_max_age is set to
its maximum allowed value of two billion, pg_xact can be expected to grow to
about half a gigabyte and pg_commit_ts to about 20GB. If this is trivial
compared to your total database size, setting autovacuum_freeze_max_age to its
maximum allowed value is recommended. Otherwise, set it depending on what you
are willing to allow for pg_xact and pg_commit_ts storage. (The default, 200
million transactions, translates to about 50MB of pg_xact storage and about 2GB
of pg_commit_ts storage.)

> * Any rules of thumb on the performance impact of enabling
> pg_xact_commit_timestamp? I don't need the data on all tables but, where I do,
> it sounds like it might work perfectly.
> 
> Many thanks for any assistance!

I didn't notice any performance impact, but I didn't do any extensive testing.



Regards,





signature.asc
Description: OpenPGP digital signature

Re: [proposal] de-TOAST'ing using a iterator

2019-07-11 Thread Binguo Bao

I have set the local build configuration to be the same as on the CI. This
patch should be correct.

Best regards,
Binguo Bao

Binguo Bao  于2019年7月11日周四 上午12:39写道：

> This is the patch that fix warnings.
>
> Best Regards,
> Binguo Bao
>
> Binguo Bao  于2019年7月10日周三 下午10:18写道：
>
>> Hi Thomas,
>> I've fixed the warnings.
>>
>> Thomas Munro  于2019年7月5日周五 下午12:21写道：
>>
>>> On Thu, Jun 20, 2019 at 1:51 AM Binguo Bao  wrote:
>>> > Hi hackers!
>>> > This proposal aims to provide the ability to de-TOAST a fully TOAST'd
>>> and compressed field using an iterator and then update the appropriate
>>> parts of the code to use the iterator where possible instead of
>>> de-TOAST'ing and de-compressing the entire value. Examples where this can
>>> be helpful include using position() from the beginning of the value, or
>>> doing a pattern or substring match.
>>> >
>>> > de-TOAST iterator overview:
>>> > 1. The caller requests the slice of the attribute value from the
>>> de-TOAST iterator.
>>> > 2. The de-TOAST iterator checks if there is a slice available in the
>>> output buffer, if there is, return the result directly,
>>> > otherwise goto the step3.
>>> > 3. The de-TOAST iterator checks if there is the slice available in the
>>> input buffer, if there is, goto step44. Otherwise,
>>> > call fetch_datum_iterator to fetch datums from disk to input
>>> buffer.
>>> > 4. If the data in the input buffer is compressed, extract some data
>>> from the input buffer to the output buffer until the caller's
>>> > needs are met.
>>> >
>>> > I've implemented the prototype and apply it to the position() function
>>> to test performance.
>>>
>>> Hi Binguo,
>>>
>>> Interesting work, and nice performance improvements so far.  Just by
>>> the way, the patch currently generates warnings:
>>>
>>> https://travis-ci.org/postgresql-cfbot/postgresql/builds/554345719
>>>
>>> --
>>> Thomas Munro
>>> https://enterprisedb.com
>>>
>>
From 556e7945ee6e9c6ceba723a2fc5c5f1d60a9f412 Mon Sep 17 00:00:00 2001
From: BBG 
Date: Tue, 4 Jun 2019 22:56:42 +0800
Subject: [PATCH] de-TOASTing using a iterator

---
 src/backend/access/heap/tuptoaster.c | 491 +++
 src/backend/utils/adt/varlena.c  |  49 ++--
 src/include/access/tuptoaster.h  |  92 +++
 3 files changed, 616 insertions(+), 16 deletions(-)

diff --git a/src/backend/access/heap/tuptoaster.c b/src/backend/access/heap/tuptoaster.c
index 55d6e91..8f7faf6 100644
--- a/src/backend/access/heap/tuptoaster.c
+++ b/src/backend/access/heap/tuptoaster.c
@@ -83,6 +83,13 @@ static int	toast_open_indexes(Relation toastrel,
 static void toast_close_indexes(Relation *toastidxs, int num_indexes,
 LOCKMODE lock);
 static void init_toast_snapshot(Snapshot toast_snapshot);
+static FetchDatumIterator create_fetch_datum_iterator(struct varlena *attr);
+static bool free_fetch_datum_iterator(FetchDatumIterator iter);
+static int32 fetch_datum_iterate(FetchDatumIterator iter);
+static void init_toast_buffer(ToastBuffer *buf, int size, bool compressed);
+static bool free_toast_buffer(ToastBuffer *buf);
+static int32 pglz_decompress_iterate(ToastBuffer *source, ToastBuffer *dest,
+	 DetoastIterator iter, int32 length);
 
 
 /* --
@@ -347,6 +354,145 @@ heap_tuple_untoast_attr_slice(struct varlena *attr,
 
 
 /* --
+ * create_detoast_iterator -
+ *
+ * Initialize detoast iterator.
+ * --
+ */
+DetoastIterator create_detoast_iterator(struct varlena *attr) {
+	struct varatt_external toast_pointer;
+	DetoastIterator iterator = NULL;
+	if (VARATT_IS_EXTERNAL_ONDISK(attr))
+	{
+		/*
+		 * This is an externally stored datum --- create fetch datum iterator
+		 */
+		iterator = (DetoastIterator) palloc0(sizeof(DetoastIteratorData));
+		iterator->fetch_datum_iterator = create_fetch_datum_iterator(attr);
+		VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr);
+		if (VARATT_EXTERNAL_IS_COMPRESSED(toast_pointer))
+		{
+			/* If it's compressed, prepare buffer for raw data */
+			iterator->buf = (ToastBuffer *) palloc0(sizeof(ToastBuffer));
+			init_toast_buffer(iterator->buf, toast_pointer.va_rawsize, false);
+			iterator->source = NULL;
+			iterator->ctrlc = 0;
+			iterator->compressed = true;
+			iterator->done = false;
+		}
+		else
+		{
+			iterator->buf = iterator->fetch_datum_iterator->buf;
+			iterator->source = NULL;
+			iterator->ctrlc = 0;
+			iterator->compressed = false;
+			iterator->done = false;
+		}
+	}
+	else if (VARATT_IS_EXTERNAL_INDIRECT(attr))
+	{
+		/*
+		 * This is an indirect pointer --- dereference it
+		 */
+		struct varatt_indirect redirect;
+
+		VARATT_EXTERNAL_GET_POINTER(redirect, attr);
+		attr = (struct varlena *) redirect.pointer;
+
+		/* nested indirect Datums aren't allowed */
+		Assert(!VARATT_IS_EXTERNAL_INDIRECT(attr));
+
+		/* recurse in case value is still extended in some other way */
+		iterator = create_detoast_iterator(attr);
+
+	}
+	else if (VARATT_IS_COMPRESSED(attr))
+	{
+		/*
+		 * This is a compressed

Re: pg_checksums (or checksums in general) vs tableam

2019-07-11 Thread Magnus Hagander

On Thu, Jul 11, 2019 at 2:30 AM Michael Paquier  wrote:

> On Wed, Jul 10, 2019 at 09:19:03AM -0700, Andres Freund wrote:
> > On July 10, 2019 9:12:18 AM PDT, Magnus Hagander 
> wrote:
> >> That would be fine, if we actually knew. Should we (or have we already?)
> >> defined a rule that they are not allowed to use the same naming standard
> >> unless they have the same type of header?
> >
> > No, don't think we have already.  There's the related problem of
> > what to include in base backups, too.
>
> Yes.  This one needs a careful design and I am not sure exactly what
> that would be.  At least one new callback would be needed, called from
> basebackup.c to decide if a given file should be backed up or not
> based on a path.

That wouldn't be at all enough, of course. We have to think of everybody
who uses the pg_start_backup/pg_stop_backup functions (including the
deprecated versions we don't want to get rid of :P). So whatever it is it
has to be externally reachable. And just calling something before you start
your backup won't be enough, as there can be files showing up during the
backup etc.

Having a strict naming standard would help a lot with that, then you'd just
need the metadata. For example, one could say that each non-default storage
engine has to put all their files in a subdirectory, and inside that
subdirectory they can name them whatever they want. If we do that, then all
a backup tool would need to know about is all the possible subdirectories
in the current installation (and *that* doesn't change frequently).

>   But then how do you make sure that a path applies to
> one table AM or another, by using a regex given by all table AMs to
> see if there is a match?  How do we handle conflicts?  I am not sure
> either that it is a good design to restrict table AMs to have a given
> format for paths as that actually limits the possibilities when it
> comes to split across data across multiple files for attributes and/or
> tablespaces.  (I am a pessimistic guy by nature.)
>

As long as the restriction contains enough wildcards, it should hopefully
be enough :) E.g. data/base/1234/zheap/whatever.they.like.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Re: pg_stat_statements vs. SELECT FOR UPDATE

Hello

Patch is still applied cleanly on HEAD and passes check-world. I think ignoring 
FOR UPDATE clause is incorrect behavior.

PS: my note about comments in tests from my previous review is actual too.

regards, Sergei

Re: Ltree syntax improvement