date:20190529

On May 29, 2019, at 8:23 PM, Paul Guo  wrote:
> Have you used the correct libpq library? If yes, you might want to check the 
> build logs and related files to see where is wrong. In my environment, it's 
> ok with both gssapi enabled or disabled.

Thank you! Resetting libpq's path fixes it.

Regards,
Donald Dong

Re: Dead stores in src/common/sha2.c

2019-05-29 Thread Heikki Linnakangas


On 29/05/2019 18:47, Hamlin, Garick L wrote:

On Wed, May 29, 2019 at 11:01:05AM -0400, Tom Lane wrote:

At the same time, I'm not sure if we should just write this off as an
ignorable warning.  If the C compiler concludes these are dead stores
it'll probably optimize them away, leading to not accomplishing the
goal of wiping the values.


Yeah, I mean it's odd to put code in to zero/hide state knowing it's
probably optimized out.

We could also take it out, but maybe it does help somewhere?

... or put in a comment that says: This probably gets optimized away, but
we don't consider it much of a risk.


There's a function called explicit_bzero() in glibc, for this purpose. 
See 
https://www.gnu.org/software/libc/manual/html_node/Erasing-Sensitive-Data.html. 
It's not totally portable, but it's also available in some BSDs, at 
least. That documentation mentions the possibility that it might force 
variables to be stored in memory that would've otherwise been kept only 
in registers, but says that in most situations it's nevertheless better 
to use explicit_bero() than not. So I guess we should use that, if it's 
available.


- Heikki

Re: undefined symbol: PQgssEncInUse

2019-05-29 Thread Paul Guo

Have you used the correct libpq library? If yes, you might want to check
the build logs and related files to see where is wrong. In my environment,
it's ok with both gssapi enabled or disabled.

On Thu, May 30, 2019 at 9:22 AM Donald Dong  wrote:

> Hi,
>
> After I make temp-install on HEAD with a clean build, I fail to start
> psql (tmp_install/usr/local/pgsql/bin/psql) and get an error message:
>
> ./psql: symbol lookup error: ./psql: undefined symbol: PQgssEncInUse
>
> However, make check and other tests still work. For me, it is fine
> until commit b0b39f72b9904bcb80f97b35837ccff1578aa4b8. I wonder if
> this only occurs to me?
>
> Thank you,
> Donald Dong
>
>
>

undefined symbol: PQgssEncInUse

Hi,

After I make temp-install on HEAD with a clean build, I fail to start
psql (tmp_install/usr/local/pgsql/bin/psql) and get an error message:

./psql: symbol lookup error: ./psql: undefined symbol: PQgssEncInUse

However, make check and other tests still work. For me, it is fine
until commit b0b39f72b9904bcb80f97b35837ccff1578aa4b8. I wonder if
this only occurs to me?

Thank you,
Donald Dong

Print baserestrictinfo for varchar fields

Hi,

I noticed that debug_print_rel outputs "unknown expr" when the fields
in baserestrictinfo are typed as varchar.

create table tbl_a(id int, info varchar(32));

RELOPTINFO (tbl_a): rows=4 width=86
baserestrictinfo: unknown expr = pattern

My approach is to handle the RelabelType case in print_expr. After
the patch, I get:

RELOPTINFO (tbl_a): rows=4 width=86
baserestrictinfo: tbl_a.info = pattern

I wonder if this is a proper way of fixing it?

Thank you,
Donald Dong



001_print_baserestrictinfo_varchar.patch
Description: Binary data

Re: coverage increase for worker_spi

Alvaro Herrera  writes:
> Tom pointed out that coverage for worker_spi is 0%.  For a module that
> only exists to provide coverage, that's pretty stupid.  This patch
> increases coverage to 90.9% line-wise and 100% function-wise, which
> seems like a sufficient starting point.

> How would people feel about me getting this in master at this point in
> the cycle, it being just some test code?  We can easily revert if
> it seems too unstable.

I'm not opposed to adding a new test case at this point in the cycle,
but as written this one seems more or less guaranteed to fail under
load.  You can't just sleep for worker_spi.naptime and expect that
the worker will certainly have run.

Perhaps you could use a plpgsql DO block with a loop to wait up
to X seconds until the expected state appears, for X around 120
to 180 seconds (compare poll_query_until in the TAP tests).

regards, tom lane

coverage increase for worker_spi

2019-05-29 Thread Alvaro Herrera

Tom pointed out that coverage for worker_spi is 0%.  For a module that
only exists to provide coverage, that's pretty stupid.  This patch
increases coverage to 90.9% line-wise and 100% function-wise, which
seems like a sufficient starting point.

How would people feel about me getting this in master at this point in
the cycle, it being just some test code?  We can easily revert if
it seems too unstable.

-- 
Álvaro Herrera
diff --git a/src/test/modules/worker_spi/Makefile b/src/test/modules/worker_spi/Makefile
index 7cdb33c9df..cbf9b2e37f 100644
--- a/src/test/modules/worker_spi/Makefile
+++ b/src/test/modules/worker_spi/Makefile
@@ -6,6 +6,14 @@ EXTENSION = worker_spi
 DATA = worker_spi--1.0.sql
 PGFILEDESC = "worker_spi - background worker example"
 
+REGRESS = worker_spi
+
+# enable our module in shared_preload_libraries for dynamic bgworkers
+REGRESS_OPTS = --temp-config $(top_srcdir)/src/test/modules/worker_spi/dynamic.conf
+
+# Disable installcheck to ensure we cover dynamic bgworkers.
+NO_INSTALLCHECK = 1
+
 ifdef USE_PGXS
 PG_CONFIG = pg_config
 PGXS := $(shell $(PG_CONFIG) --pgxs)
diff --git a/src/test/modules/worker_spi/dynamic.conf b/src/test/modules/worker_spi/dynamic.conf
new file mode 100644
index 00..646885a9c7
--- /dev/null
+++ b/src/test/modules/worker_spi/dynamic.conf
@@ -0,0 +1,2 @@
+worker_spi.naptime = 1
+shared_preload_libraries = worker_spi
diff --git a/src/test/modules/worker_spi/expected/worker_spi.out b/src/test/modules/worker_spi/expected/worker_spi.out
new file mode 100644
index 00..28b2970d01
--- /dev/null
+++ b/src/test/modules/worker_spi/expected/worker_spi.out
@@ -0,0 +1,22 @@
+\c postgres
+CREATE EXTENSION worker_spi;
+SELECT worker_spi_launch(1) IS NOT NULL;
+ ?column? 
+--
+ t
+(1 row)
+
+INSERT INTO schema1.counted VALUES ('total', 0), ('delta', 1);
+SELECT pg_sleep(CASE WHEN count(*) = 0 THEN 0 ELSE current_setting('worker_spi.naptime')::int END)
+  FROM schema1.counted WHERE type = 'delta';
+ pg_sleep 
+--
+ 
+(1 row)
+
+SELECT * FROM schema1.counted;
+ type  | value 
+---+---
+ total | 1
+(1 row)
+
diff --git a/src/test/modules/worker_spi/sql/worker_spi.sql b/src/test/modules/worker_spi/sql/worker_spi.sql
new file mode 100644
index 00..ac454cff57
--- /dev/null
+++ b/src/test/modules/worker_spi/sql/worker_spi.sql
@@ -0,0 +1,7 @@
+\c postgres
+CREATE EXTENSION worker_spi;
+SELECT worker_spi_launch(1) IS NOT NULL;
+INSERT INTO schema1.counted VALUES ('total', 0), ('delta', 1);
+SELECT pg_sleep(CASE WHEN count(*) = 0 THEN 0 ELSE current_setting('worker_spi.naptime')::int END)
+  FROM schema1.counted WHERE type = 'delta';
+SELECT * FROM schema1.counted;
-- 
2.17.1

Re: incorrect xlog.c coverage report

2019-05-29 Thread Alvaro Herrera

On 2018-Nov-22, Michael Paquier wrote:

> On Thu, Nov 22, 2018 at 10:56:39AM +0900, Masahiko Sawada wrote:
> > On Thu, Nov 22, 2018 at 10:43 AM Thomas Munro 
> >  wrote:
> >> Presumably you could add your own call to __gcov_flush() in
> >> quickdie(), so that we get GCOV data but no other atexit()-like stuff.
> >> I see that some people advocate doing that in signal handlers, but I
> >> don't know if it's really safe.  If that is somehow magically OK,
> >> you'd probably also need the chdir() hack from proc_exit() to get
> >> per-pid files.
> > 
> > That's probably a good idea, I'm also not sure if it's really safe
> > though. An alternative approach could be that we can do $node->restart
> > after recovered from $node->teardown_node() to write gcda file surely,
> > although it would make the tests hard to read.
> 
> Thanks for looking at the details around that.  I'd prefer much if we
> have a solution like what's outline here because we should really try to
> have coverage even for code paths which involve an immediate shutdown
> (mainly for recovery).  Manipulating the tests to get a better coverage
> feels more like a band-aid solution, and does not help folks with custom
> TAP tests in their plugins.

I've just realized that we didn't do anything about this (see line 5380
in https://coverage.postgresql.org/src/backend/access/transam/xlog.c.gcov.html)
, and we should.  I still prefer the solution I proposed (which is to
edit the test files to avoid immediate shutdown in certain places), but
I admit that adding __gcov_flush() to quickdie() seems to have gotten
more votes.

Are there objections to doing that now on the master branch?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Why does SPI_connect change the memory context?

Jeff Davis  writes:
> SPI_connect() changes the memory context to a newly-created one, and
> then SPI_finish() restores it. That seems a bit dangerous because the
> caller might not be expecting it. Is there a reason it doesn't just
> change to the new memory context as-needed?

Because the expectation is that palloc inside the SPI procedure will
allocate in a procedure-specific context.  If the caller isn't expecting
that, they haven't read the documentation, specifically

https://www.postgresql.org/docs/devel/spi-memory.html

which says

   SPI_connect creates a new memory context and
   makes it current.  SPI_finish restores the
   previous current memory context and destroys the context created by
   SPI_connect.  These actions ensure that
   transient memory allocations made inside your C function are
   reclaimed at C function exit, avoiding memory leakage.

regards, tom lane

Re: Rearranging ALTER TABLE to avoid multi-operations bugs

Robert Haas  writes:
> From my point of view, the DDL code doesn't do a great job separating
> parsing/parse analysis from optimization/execution.  The ALTER TABLE
> stuff is actually pretty good in this regard.

Meh.  I think a pretty fair characterization of the bug(s) I'm trying to
fix is "we separated parse analysis from execution when we should not
have, because it leads to parse analysis being done against the wrong
database state".  So I'm *very* suspicious of any argument that we should
try to separate them more, let alone that doing so will somehow fix this
set of bugs.

>> Also, recursive ProcessUtility cases exist independently of this issue,
>> in particular in CreateSchemaCommand.  My worry about my patch upthread
>> is not really that it introduces another one, but that it doesn't do
>> anything towards providing a uniform framework/notation for all these
>> cases.

> I'm not really sure I understand this.

Well, I tried to wrap what are currently a random set of ProcessUtility
arguments into one struct to reduce the notational burden.  But as things
are set up, that's specific to the ALTER TABLE case.  I'm feeling like it
should not be, but I'm not very sure where to draw the line between
arguments that should be folded into the struct and ones that shouldn't.

Note that I think there are live bugs in here that are directly traceable
to not having tried to fold those arguments before.  Of the four existing
recursive ProcessUtility calls with context = PROCESS_UTILITY_SUBCOMMAND,
two pass down the outer call's "ParamListInfo params", and two don't ---
how is it not a bug that they don't all behave alike?  And none of the
four pass down the outer call's QueryEnvironment, which seems like even
more of a bug.  So it feels like we ought to have a uniform approach
to what gets passed down during recursion, and enforce it by passing
all such values in a struct rather than as independent arguments.

regards, tom lane

Re: [HACKERS] Runtime Partition Pruning

On Wed, May 29, 2019 at 6:02 PM Tom Lane  wrote:
> Well, it *is* a problem.  The whole point of this discussion I think is
> to try to get better information "by default" for routine bug reports.
> So if those come from production servers without debug symbols, which
> I believe will be the usual case, then it seems likely to me that
> libunwind will produce no better results than glibc.  (But perhaps
> I'm wrong about that --- I have not experimented with libunwind.)

Sure, I agree.

> Now it's true that "install debug symbols" is less of an ask than
> "install debug symbols, *and* gdb, and make sure server core dumps are
> enabled, and then go through this arcane manual procedure next time
> you get a core dump".  But we shouldn't fool ourselves that it isn't
> an ask that's going to be hard for people with corporate policies
> against installing extra stuff on production servers.

There may be cases where that is true, but as you say, it's better
than what we have now.  Plus, what exactly is the alternative?  We
could:

- encourage packagers to install debug symbols by default (but they
might not; it might even be against policy), or
- invent our own system for generating backtraces and ignore what the
OS toolchain knows how to do (sounds painfully complex and expensive),
or
- just live with the fact that it's imperfect.

Is there a fourth option?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Rearranging ALTER TABLE to avoid multi-operations bugs

On Wed, May 29, 2019 at 5:52 PM Tom Lane  wrote:
> Hm ... I'm not exactly clear on why that would be a superior solution.
> It would imply that standalone CREATE INDEX etc would call into the
> ALTER TABLE framework --- how is that not equally a layering violation?

Well, the framework could be renamed to something more general, I
suppose, but I don't see a *layering* concern.

>From my point of view, the DDL code doesn't do a great job separating
parsing/parse analysis from optimization/execution.  The ALTER TABLE
stuff is actually pretty good in this regard.  But when you build
something that is basically a parse tree and pass it to some other
function that thinks that parse tree may well be coming straight from
the user, you are not doing a good job distinguishing between a
statement and an action which that statement may caused to be
performed.

> Also, recursive ProcessUtility cases exist independently of this issue,
> in particular in CreateSchemaCommand.  My worry about my patch upthread
> is not really that it introduces another one, but that it doesn't do
> anything towards providing a uniform framework/notation for all these
> cases.

I'm not really sure I understand this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Runtime Partition Pruning

Robert Haas  writes:
> On Fri, May 24, 2019 at 12:09 PM Tom Lane  wrote:
>> Is it actually better?  The basic problem with backtrace() is that it
>> only knows about global functions, and so reports call sites in static
>> functions as if they were in whatever global function physically precedes
>> the static one.  I think doing materially better requires depending on
>> debug symbols, which (at least in the Red Hat world) aren't going to
>> be there in a typical production situation.

> I don't have an opinion on glibc vs. libunwind, but I don't understand
> this argument.  If you are unlucky enough to have a production server
> that is crashing due to some hitherto-unknown bug, and if it's not
> possible to get a good backtrace without installing debugging symbols,
> then you are going to have to pick between (1) installing those
> debugging symbols and (2) getting a poor backtrace.  I don't really
> see that as a problem so much as just the way life is.

Well, it *is* a problem.  The whole point of this discussion I think is
to try to get better information "by default" for routine bug reports.
So if those come from production servers without debug symbols, which
I believe will be the usual case, then it seems likely to me that
libunwind will produce no better results than glibc.  (But perhaps
I'm wrong about that --- I have not experimented with libunwind.)

Now it's true that "install debug symbols" is less of an ask than
"install debug symbols, *and* gdb, and make sure server core dumps are
enabled, and then go through this arcane manual procedure next time
you get a core dump".  But we shouldn't fool ourselves that it isn't
an ask that's going to be hard for people with corporate policies
against installing extra stuff on production servers.

regards, tom lane

Re: Rearranging ALTER TABLE to avoid multi-operations bugs

Robert Haas  writes:
> On Sun, May 26, 2019 at 6:24 PM Tom Lane  wrote:
>> Anybody have thoughts about a different way to approach it?

> I mean, in an ideal world, I think we'd never call back out to
> ProcessUtility() from within AlterTable().  That seems like a pretty
> clear layering violation.  I assume the reason we've never tried to do
> better is a lack of round tuits and/or sufficient motivation.

> In terms of what we'd do instead, I suppose we'd try to move as much
> as possible inside the ALTER TABLE framework proper and have
> everything call into that.

Hm ... I'm not exactly clear on why that would be a superior solution.
It would imply that standalone CREATE INDEX etc would call into the
ALTER TABLE framework --- how is that not equally a layering violation?

Also, recursive ProcessUtility cases exist independently of this issue,
in particular in CreateSchemaCommand.  My worry about my patch upthread
is not really that it introduces another one, but that it doesn't do
anything towards providing a uniform framework/notation for all these
cases.

regards, tom lane

Re: "WIP: Data at rest encryption" patch and, PostgreSQL 11-beta3

On Tue, May 28, 2019 at 11:27 AM Antonin Houska  wrote:
> We thought that it's more convenient for administrator to enter password. To
> achieve that, we can instead tell the admin how to use the key derivation tool
> (pg_keysetup) and send its output to postgres. So yes, it's possible to always
> receive the key from the "encryption key command". I'll change this.

Sounds good.  I'm not quite sure how this is going to work.  Ideally
you'd like the encryption key command to fetch the key from something
like ssh-agent, or maybe pop up a window on the user's terminal with a
key prompt.  Just reading from stdin and writing to stdout is not
going to be very convenient...

> Maintenance of a long patch series requires additional effort and I was busy
> enough with major code rearrangements so far. I'll continue splitting the
> stuff into more diffs.

Right, I understand.  Whatever can be split off can be reviewed and
committed separately, which will start to ease the burden of
maintaining the patch set.  I hope.  But getting over the initial hump
can be a lot of work.

> > +pg_upgrade. (Therefore the encrypted cluster does not support pg_upgrade 
> > with
> > +the --link option.)
> >
> > That seems pretty unfortunate.  Any way to do better?
>
> The reason is that the initialization vector (IV) for relation page encryption
> is derived from RelFileNode, and both dbNode and relNode can be different in
> the new cluster. Therefore the data of the old cluster need to be decrypted
> and encrypted using the new IV before it's copied to the new cluster. That's
> why we can't use symlink.
>
> As an alternative, I thought of deriving the IV by hashing the
> .. string, but that would mean that the
> "reencryption" is triggered by commands like "ALTER TABLE ... RENAME TO ...",
> "ALTER TABLE ... SET SCHEMA ...", etc.
>
> Currently I'm thinking of making the IV less dependent on RelFileNode
> (e.g. generate a random number for which RelFileNode serves as the seed) and
> storing it in pg_class as a storage parameter. Therefore we wouldn't have to
> pay any extra attention to transfer of the IV to the new cluster. However, the
> IV storage parameter would need special treatment:
>
> 1. DBA should not be able to remove or change it using ALTER TABLE command
> because inability to decrypt the data can be the only outcome.
>
> 2. The \d+ command of psql client should not display the IV. Displaying it
> probably wouldn't be a security risk, but as we encrypt the whole instance,
> the IV would appear for every single table and that might be annoying.
>
> What do you think about this approach?

I don't think it's a very good idea.  For one thing, you can't store
the IV for pg_class in pg_class; that has a circularity problem.  For
another thing, I'm not sure we can or should try to do catalog access
from every place where an IV might be needed.  In fact, I'd go so far
as to say that sounds like a recipe for a lot of pain and fragility.

One potential approach is to modify pg_upgrade to preserve the dbnode
and relfilenode.  I don't know of any fundamental reason why such a
change shouldn't be possible, although it is probably a bunch of work,
and there may be problems with which I am not acquainted.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Inconsistent error message wording for REINDEX CONCURRENTLY

On Mon, May 27, 2019 at 4:02 AM Michael Paquier  wrote:
> On Mon, May 27, 2019 at 12:20:58AM -0400, Alvaro Herrera wrote:
> > I wonder if we really want to abolish all distinction between "cannot do
> > X" and "Y is not supported".  I take the former to mean that the
> > operation is impossible to do for some reason, while the latter means we
> > just haven't implemented it yet and it seems likely to get implemented
> > in a reasonable timeframe.  See some excellent commentary about about
> > the "can not" wording at
> > https://postgr.es/m/CA+TgmoYS8jKhETyhGYTYMcbvGPwYY=qa6yyp9b47mx7mwee...@mail.gmail.com
>
> Incorrect URL?

That's one of my messages that never made it through to the list.

Try 
http://postgr.es/m/ca+tgmoz0hzulgvlkf_lrtnydijic4nqd-epcdf_ngtmksfn...@mail.gmail.com

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: message style

While reading another thread that attempted to link to this email, I
discovered that this email never made it to the list archives.  I am
trying again.

On Tue, Apr 30, 2019 at 12:05 PM Robert Haas  wrote:
>
> On Tue, Apr 30, 2019 at 10:58 AM Alvaro Herrera
>  wrote:
> > My problem here is not really the replacement of the name to %s, but the
> > "may not be" part of it.  We don't use "may not be" anywhere else; most
> > places seem to use "foo cannot be X" and a small number of other places
> > use "foo must not be Y".  I'm not able to judge which of the two is
> > better (so change all messages to use that form), or if there's a
> > semantic difference and if so which one to use in this case.
>
> The message style guidelines specifically discourage the use of "may",
> IMHO for good reason.  "mumble may not be flidgy" could be trying to
> tell you that something is impermissible, as in "the children may not
> run in the house."  But it could also be warning you that something is
> doubtful, as in "the children may not receive Easter candy this year
> because there is a worldwide chocolate shortage."  Sometimes the same
> sentence can be read either way, like "this table may not be
> truncated," which can mean either that TRUNCATE is going to fail if
> run in the future, or that it is unclear whether TRUNCATE was already
> run in the past.
>
> As far as "cannot" and "must not" is murkier, but it looks to me as
> though we prefer "cannot" to "must not" about 9:1, so most often
> "cannot" is the right thing, but not always.  The distinction seems to
> be that we use "cannot" to talk about things that we are unwilling or
> unable to do in the future, whereas "must not" is used to admonish the
> user about what has already taken place.  Consider:
>
> array must not contain nulls
> header key must not contain newlines
> cast function must not return a set
> interval time zone \"%s\" must not include months or days
> function \"%s\" must not be SECURITY DEFINER
>
> vs.
>
> cannot drop %s because %s requires it
> cannot PREPARE a transaction that has manipulated logical replication workers
> cannot reindex temporary tables of other sessions
> cannot inherit from partitioned table \"%s\"
>
> The first set of messages are essentially complaints about the past.
> The array shouldn't have contained nulls, but it did!  The header key
> should not have contained newlines, but it did!  The cast function
> should not return a set, but it does!  Hence, we are sad and are
> throwing an error.  The second set are statements that we are
> unwilling or unable to proceed, but they don't necessarily carry the
> connotation that there is a problem already in the past.  You've just
> asked for something you are not going to get.
>
> I think principle that still leaves some ambiguity.  For example, you
> could phrase that second of the "cannot" message as "must not try to
> PREPARE a transaction that has manipulated logical replication
> workers."  That's grammatical and everything, but it sounds a bit
> accusatory, like the user is in trouble or something.  I think that's
> probably why we tend to prefer "cannot" in most cases.  But sometimes
> that would lead to a longer or less informative message.  For example,
> you can't just change
>
> function \"%s\" must not be SECURITY DEFINER
>
> to
>
> function \"%s\" can not be SECURITY DEFINER
>
> ...because the user will rightly respond "well, I guess it can,
> because it is."  We could say
>
> can not execute security definer functions from PL/Tcl
>
> ...but that sucks because we now have no reasonable place to put the
> function name.  We could say
>
> can not execute security definer function \"%s\" from PL/Tcl
>
> ...but that also sucks because now the message only says that this one
> particular security definer function cannot be executed, rather than
> saying that ALL security definer functions cannot be executed.   To
> really get there, you'd have to do something like
>
> function "\%s" cannot be executed by PL/Tcl because it is a security
> definer function
>
> ...which is fine, but kinda long.  On the plus side it's more clear
> about the source of the error (PL/Tcl) than the current message which
> doesn't state that explicitly, so perhaps it's an improvement anyway,
> but the general point is that sometimes I think there is no succinct
> way of expressing the complaint clearly without using "must not".
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company



-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: some questions about fast-path-lock

On Mon, May 27, 2019 at 2:01 AM Alex  wrote:
> I got some idea from the README under storage/lmgr and read some code of  
> LockAcquireExtended ,   but I still have some questions now.
>
> LWLockAcquire(>backendLock, LW_EXCLUSIVE);
> if (FastPathStrongRelationLocks->count[fasthashcode] != 0)
> acquired = false;
> else
>  acquired = FastPathGrantRelationLock(locktag->locktag_field2,
> lockmode);
>
> 1.  In the README,   it says:  "A key point of this algorithm is that it must 
> be possible to verify the
> absence of possibly conflicting locks without fighting over a shared LWLock or
> spinlock.  Otherwise, this effort would simply move the contention bottleneck
> from one place to another."
>
> but in the code, there is LWLockAcquire in the above code.  Actually I can't 
> think out how can we proceed without a lock.

The per-backend lock is not heavily contended, because under normal
circumstances it is only accessed by a single backend.  If there is a
potential lock conflict that must be analyzed then another backend may
acquire it and that might lead to a little bit of contention, but it
happens quite rarely -- so the overall contention is still much less
than if everyone is fighting over the lock manager partition locks.

> 2.   Why does the MyProc->backendLock work?   it is MyProc not a global lock.

It's still an LWLock.  Putting it inside of MyProc doesn't make it
magically stop working.  MyProc is in shared memory, not backend-local
memory, if that's what you are confused about.

> 3. for the line,acquired = 
> FastPathGrantRelationLock(locktag->locktag_field2,
> lockmode);I think it should  be able to replaced with  "acquired = true" 
> (but obviously I'm wrong)  .   I read "FastPathGrantRelationLock" but can't 
> understand it.

It can't say 'acquired = true' because each backend can only acquire a
maximum of 16 relation locks via the fast-path mechanism.  If a
process acquires more than 16 relation locks, at least some of them
will have to be acquired without benefit of the fast-path.  This value
could be changed by changing the value of the constant
FP_LOCK_SLOTS_PER_BACKEND, but since we scan the array linearly,
making it too big will lead to other problems.  I don't quite
understand what about FastPathGrantRelationLock you don't understand -
it's a pretty straightforwardly-coded search for either (a) an
existing fastpath slot for the specified relid or failing that (b) an
unused fastpath slot.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Names

On Tue, May 28, 2019 at 12:35 AM Sascha Kuhl  wrote:
> Are there teams behind the names or does everybody write with their personal 
> name?

I think if you spend some time reading the mailing list, you'll be
able to figure out the answer to this question and many others you
might have.  People are generally pretty clear in their messages
whether or not they collaborated with others on the work which they
are presenting.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Why does SPI_connect change the memory context?

2019-05-29 Thread Jeff Davis

SPI_connect() changes the memory context to a newly-created one, and
then SPI_finish() restores it. That seems a bit dangerous because the
caller might not be expecting it. Is there a reason it doesn't just
change to the new memory context as-needed?

spi.c:161:

   /* ... and switch to procedure's context */
_SPI_current->savedcxt = MemoryContextSwitchTo(_SPI_current-
>procCxt);

Regards,
Jeff Davis

Re: [HACKERS] Runtime Partition Pruning

On Fri, May 24, 2019 at 12:09 PM Tom Lane  wrote:
> Is it actually better?  The basic problem with backtrace() is that it
> only knows about global functions, and so reports call sites in static
> functions as if they were in whatever global function physically precedes
> the static one.  I think doing materially better requires depending on
> debug symbols, which (at least in the Red Hat world) aren't going to
> be there in a typical production situation.

I don't have an opinion on glibc vs. libunwind, but I don't understand
this argument.  If you are unlucky enough to have a production server
that is crashing due to some hitherto-unknown bug, and if it's not
possible to get a good backtrace without installing debugging symbols,
then you are going to have to pick between (1) installing those
debugging symbols and (2) getting a poor backtrace.  I don't really
see that as a problem so much as just the way life is.  You can't
expect to get good debugging output without debugging symbols, just as
you can't expect to get a clean audit without bank statements or a
clear idea of how to find your way to an unknown destination without a
map.  If you don't have the thing that contains the information that
is needed, you can't get the information; c'est la vie.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Rearranging ALTER TABLE to avoid multi-operations bugs

On Sun, May 26, 2019 at 6:24 PM Tom Lane  wrote:
> Anybody have thoughts about a different way to approach it?

I mean, in an ideal world, I think we'd never call back out to
ProcessUtility() from within AlterTable().  That seems like a pretty
clear layering violation.  I assume the reason we've never tried to do
better is a lack of round tuits and/or sufficient motivation.

In terms of what we'd do instead, I suppose we'd try to move as much
as possible inside the ALTER TABLE framework proper and have
everything call into that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Different row estimations on base rels

On May 29, 2019, at 1:36 PM, Robert Haas  wrote:
> Well, it's all there in the code.  I believe the issue is that the
> final estimates are based on the number of rows that will be returned
> from the relation, which is often less, and occasionally more, than
> the total of the rows in the relation.  The reason it's often less is
> because there might be a WHERE clause or similar which rules out some
> of the rows.  The reason it might be more is because a nested loop
> could return the same rows multiple times.

Yes, indeed. I was confused, and I guess I could've thought about it
about more before posting here. Thank you for answering this
question!

Regards,
Donald Dong

Re: [HACKERS] Unlogged tables cleanup

On Sun, May 26, 2019 at 10:11 PM Michael Paquier  wrote:
> If we should do something about such cases, then this brings us back
> to do (INIT | CLEANUP) at the end of recovery anyway?

I believe that could fix problems #1 and #2, but we'd have to think
about what new issues it would introduce.

#3 seems like it needs a different fix.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Different row estimations on base rels

On Sun, May 26, 2019 at 1:00 PM Donald Dong  wrote:
> I noticed the estimated rows of the base relations during the join
> searching is *very* different from the estimations in the final plan.
>
> Join search (rows of the initial_rels):
> RELOPTINFO (ct): rows=1 width=4
> RELOPTINFO (it): rows=1 width=4
> RELOPTINFO (mc): rows=17567 width=32
> RELOPTINFO (mi_idx): rows=1380035 width=8
> RELOPTINFO (t): rows=2528356 width=25
>
> The final plan:
> Seq Scan on company_type ct
>   (cost=0.00..1.05 rows=1 width=4)
> Seq Scan on info_type it
>   (cost=0.00..2.41 rows=1 width=4)
> Parallel Seq Scan on movie_companies mc
>   (cost=0.00..37814.90 rows=7320 width=32)
> Parallel Seq Scan on movie_info_idx mi_idx
>   (cost=0.00..13685.15 rows=575015 width=8)
> Index Scan using title_pkey on title t
>   (cost=0.43..0.58 rows=1 width=25)
>
> By looking at the joinrel->rows, I would expect relation t to have
> the largest size, however, this is not true at all. I wonder what's
> causing this observation, and how to get estimations close to the
> final plan?

Well, it's all there in the code.  I believe the issue is that the
final estimates are based on the number of rows that will be returned
from the relation, which is often less, and occasionally more, than
the total of the rows in the relation.  The reason it's often less is
because there might be a WHERE clause or similar which rules out some
of the rows.  The reason it might be more is because a nested loop
could return the same rows multiple times.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Pinned files at Windows

On Mon, May 27, 2019 at 05:52:13PM +0300, Konstantin Knizhnik wrote:
> Postgres is opening file with FILE_SHARE_DELETE  flag which makes it
> possible to unlink opened file.
> But unlike Unixes, the file is not actually deleted. You can see it using
> "dir" command.
> And stat() function also doesn't return error in this case:
> 
> https://stackoverflow.com/questions/27270374/deletefile-or-unlink-calls-succeed-but-doesnt-remove-file
> 
> So first check in  pgwin32_safestat (r < 0) is not working at all: stat()
> returns 0, but subsequent call of GetFileAttributesEx
> returns 5 (ERROR_ACCESS_DENIED).

So you would basically hijack the result of GetFileAttributesEx() so
as any errors returned by this function complain with ENOENT for
everything seen.  Why would that be a sane idea?  What if say a
permission or another error is legit, but instead ENOENT is returned
as you propose, then the caller would be confused by an incorrect
status.

As you mention, what we did as of 9951741 may not be completely right,
and the reason why it was done this way comes from here:
https://www.postgresql.org/message-id/20160712083220.1426.58...@wrigleys.postgresql.org

Could we instead come up with a reliable way to detect if a file is in
a deletion pending state?  Mapping blindly EACCES to ENOENT is not a
solution I think we can rely on (perhaps we could check only after
ERROR_ACCESS_DENIED using GetLastError() and map back to ENOENT in
this case still this can be triggered if a virus scanner holds the
file for read, no?).  stat() returning 0 for a file pending for
deletion which will go away physically once the handles still keeping
the file around are closed is not something I would have imagined is
sane, but that's what we need to deal with...  Windows has a long
history of keeping things compatible, sometimes in their own weird
way, and it seems that we have one here so I cannot imagine that this
behavior is going to change.

Looking around, I have found out about NtCreateFile() which could be
able to report a proper pending deletion status, still that's only
available in kernel mode.  Perhaps others have ideas?
--
Michael

signature.asc
Description: PGP signature

Re: Dead encoding conversion functions

2019-05-29 Thread Daniel Gustafsson

> On 29 May 2019, at 15:03, Tom Lane  wrote:
> 
> Pursuant to today's discussion at PGCon about code coverage, I went
> nosing into some of the particularly under-covered subdirectories
> in our tree,

On a similar, but much less important/interesting note.  I fat-fingered when
compiling isolationtester on the plane over here and happened to compile
src/test/examples, and in there testlo.c and testlo64.c has two dead functions
for which the callsites have been commented out since the Postgres95 import
(and now cause a warning).  Is there any (historic?) reason to keep that code?
It also seems kind of broken as it doesn’t really handle the open() call
failure very well.

cheers ./daniel

Dead encoding conversion functions

Pursuant to today's discussion at PGCon about code coverage, I went
nosing into some of the particularly under-covered subdirectories
in our tree, and immediately tripped over an interesting factoid:
the ASCII<->MIC and ASCII<->UTF8 encoding conversion functions are
untested ... not because the regression tests don't try, but because
those conversions are unreachable.  pg_do_encoding_conversion() and
its sister functions have hard-wired fast paths for any conversion
in which the source or target encoding is SQL_ASCII, so that an
encoding conversion function declared for such a case will never
be used.

(The coverage results do show ascii_to_utf8 as being covered, but
that's just because alter_table.sql randomly chose to test
ALTER CONVERSION using a user-defined conversion from SQL_ASCII
to UTF8, rather than any other case.  CreateConversionCommand()
will invoke the specified function on an empty string just to see
if it works, so that's where that "coverage" comes from.)

This situation seems kinda silly.  My inclination is to delete
these functions as useless, but I suppose another approach is
to suppress the fast paths if there's a declared conversion function.
(Doing so would likely require added catalog lookups in places we
might not want them...)

If we do delete them as useless, it might also be advisable to change
CreateConversionCommand() to refuse creation of conversions to/from
SQL_ASCII, to prevent future confusion.

Thoughts?

regards, tom lane

Re: Quick doc typo fix

2019-05-29 Thread Guillaume Lelarge

Le mer. 29 mai 2019 19:52, Michael Paquier  a écrit :

> On Wed, May 29, 2019 at 07:47:12PM +0200, Guillaume Lelarge wrote:
> > Yeah, I still have quite a lot to process. That might be better to do it
> > all in once.
>
> OK, thanks!  Could you ping me on this thread once you think you are
> done?
>

Sure.

Re: Quick doc typo fix

On Wed, May 29, 2019 at 07:47:12PM +0200, Guillaume Lelarge wrote:
> Yeah, I still have quite a lot to process. That might be better to do it
> all in once.

OK, thanks!  Could you ping me on this thread once you think you are
done?
--
Michael


signature.asc
Description: PGP signature

Re: Indexing - comparison of tree structures

On Tue, May 28, 2019 at 11:37:54AM -0700, Andres Freund wrote:
> 1) Please respect the list style of properly quoting responses inline,
>and only responding to messages that are somewhat related to the
>previous content
> 2) You ask a lot of question, without actually responding to responses
> 3) Please do some of your own research, before asking
>questions. E.g. there's documentation about our btree implementation
>etc in our source tree.

In this case, you may find the various README in the code to be
of interest.  All index access methods are located in
src/backend/access/, and nbtree/README includes documentation for
btree indexes.
--
Michael


signature.asc
Description: PGP signature

Re: Quick doc typo fix

2019-05-29 Thread Guillaume Lelarge

Le mer. 29 mai 2019 19:45, Michael Paquier  a écrit :

> On Wed, May 29, 2019 at 05:30:33PM +0200, Guillaume Lelarge wrote:
> > And here is another one. See patch attached.
>
> Are you still going through some parts of the documentation?  Perhaps
> you may notice something else?  I am wondering if it would be better
> to wait a bit more so as we can group all issues you are finding at
> once.
>

Yeah, I still have quite a lot to process. That might be better to do it
all in once.

Re: Quick doc typo fix

On Wed, May 29, 2019 at 05:30:33PM +0200, Guillaume Lelarge wrote:
> And here is another one. See patch attached.

Are you still going through some parts of the documentation?  Perhaps
you may notice something else?  I am wondering if it would be better
to wait a bit more so as we can group all issues you are finding at
once.
--
Michael


signature.asc
Description: PGP signature

Re: [HACKERS] Built-in plugin for logical decoding output

2019-05-29 Thread Dave Cramer

Reviving this thread.



On Tue, 26 Sep 2017 at 13:57, Henry  wrote:

> It seems test_decoding.c could be easily changed to support JSON by using
> the built in PostgreSQL functions (json.c composite_to_json) to convert a
> Datum into SQL. It's use of OidOutputFunctionCall could be modified to emit
> arrays and composite types as JSON. This might be enough to enable
> downstream frameworks to parse (without having to code to the terse and
> positional composite structure format).
>
> It could be a minimal change to have in core using the built in JSON
> support with no additional libraries. I have not made changes to this code
> but it seems like it should work.
>
> Thank you,
> Henry
>
> On Tue, Sep 26, 2017 at 9:37 AM Alvaro Hernandez  wrote:
>
>>
>>
>> On 26/09/17 17:50, Craig Ringer wrote:
>>
>> On 26 September 2017 at 22:14, Magnus Hagander 
>> wrote:
>>
>>>
>>>
>>> On Tue, Sep 26, 2017 at 2:16 PM, Alvaro Hernandez 
>>> wrote:
>>>



 But what about earlier versions? Any chance it could be backported
 down to 9.4? If that would be acceptable, I could probably help/do that...
>>>
>>>
>>> The likelihood is zero if you mean backported into core of earlier
>>> versions.
>>>
>>
>> Right. We don't add features to back branches.
>>
>>
>> Yeah, I know the policy. But asking is free ;) and in my opinion this
>> would be a very good reason to have an exception, if there would be a clear
>> desire to have a single, unified, production quality output plugin across
>> all PostgreSQL versions. At least I tried ;)
>>
>>
>>
>>
>>
>>>
>>> If you mean backported as a standalone extension that could be installed
>>> on a previous version, probably. I'm not sure if it relies on any internals
>>> not present before that would make it harder, but it would probably at
>>> least be possible.
>>>
>>>
>> All the pub/sub stuff is new and hooked into syscache etc. So you'd be
>> doing a bunch of emulation/shims using user catalogs. Not impossible, but
>> probably irritating and verbose. And you'd have none of the DDL required to
>> manage it, so you'd need SQL-function equivalents.
>>
>> I suspect you'd be better off tweaking pglogical to speak the same
>> protocol as pg10, since the pgoutput protocol is an evolution of
>> pglogical's protocol. Then using pglogical on older versions.
>>
>>
>>
>> Given all this, if I would be doing an app based on logical decoding,
>> I think I will stick to test_decoding for <10
>>
>>
>> Thanks,
>>
>>
>> Álvaro
>>
>>
>> --
>>
>> Alvaro Hernandez
>>
>>
>> ---
>> OnGres
>>
>>

I believe there is a valid reason for providing a reasonably feature
complete plugin in core. Specifically in instances such as cloud providers
where the user does not control what is installed on the server it would be
useful to have a decent output plugin.

Having had a cursory look at pgoutput I see no reason why pgoutput could
not be used as general purpose output plugin.

One thing that would be nice is to remove the requirement for a publication
as creating a publication on all tables requires a superuser.
I'm also curious why pgoutput does not send attributes in binary ? This
seems like a rather small change that should provide some significant
performance benefits.


Dave Cramer

da...@postgresintl.com
www.postgresintl.com

Re: [HACKERS] [PATCH] Generic type subscripting

2019-05-29 Thread Pavel Stehule

st 29. 5. 2019 v 17:49 odesílatel Dmitry Dolgov <9erthali...@gmail.com>
napsal:

> Rebase after pg_indent. Besides, off the list there was a suggestion that
> this
> could be useful to accept more than one data type as a key for
> subscripting.
> E.g. for jsonb it probably makes sense to understand both a simple key
> name and
> jsonpath:
>
> jsonb['a'] and jsonb['$.a']
>
> While to implement it can be technically relatively straightforward I
> guess, I
> wonder if there is any opinion about how valuable it could be and what it
> should looks like from the syntax point of view (since I believe a user
> needs
> to specify which type needs to be used).
>

It is difficult decision - possibility to use jsonpath looks great, but
necessity to cast every time is not friendly.

Probably there can be preferred type if subscripting is of unknown type.
There can be similar rules to function's parameters.

so jsonb['a'] -- key
jsonb['$.a'] -- key
jsonb['$.a'::jsonpath'] -- json path

but it can be source of bad issues - so I think we don't need this feature
in this moment. This feature can be implemented later, I think.

Regards

Pavel

Re: Index Skip Scan

2019-05-29 Thread Dmitry Dolgov

> On Sat, May 11, 2019 at 6:35 PM Dmitry Dolgov <9erthali...@gmail.com> wrote:
>
> Here is the updated version with the changes I was talking about (mostly about
> readability and code cleanup). I've also added few tests for a cursor 
> behaviour.

And one more cosmetic rebase after pg_indent.


v16-0001-Index-skip-scan.patch
Description: Binary data

Re: Dead stores in src/common/sha2.c

2019-05-29 Thread Hamlin, Garick L

On Wed, May 29, 2019 at 11:01:05AM -0400, Tom Lane wrote:
> Michael Paquier  writes:
> > On Wed, May 29, 2019 at 01:24:19PM +, Hamlin, Garick L wrote:
> >> I ran clang checker and noticed these.   It looks like the 
> >> sha2 implementation is trying to zero out state on exit, but
> >> clang checker finds at least 'a' is a dead store.  
> >> 
> >> Should we fix this?
> >> Is something like the attached sensible?
> >> Is there a common/better approach to zero-out in PG ?
> 
> > This code comes from the SHA-2 implementation of OpenBSD, so it is not
> > adapted to directly touch it.  What's the current state of this code
> > in upstream?  Should we perhaps try to sync with the upstream
> > implementation instead?
> > After a quick search I am not seeing that this area has actually
> > changed:
> > http://fxr.watson.org/fxr/source/crypto/sha2.c?v=OPENBSD
> 
> Hm ... plastering "volatile"s all over it isn't good for readability
> or for quality of the generated code.  (In particular, I'm worried
> about this patch causing all those variables to be forced into memory
> instead of registers.)

Yeah, I don't actually think it's a great approach which is why I 
was wondering what if PG had a right approach.  I figured it was
the clearest way to start the discussion.  Especially, since I wasn't
sure if people would want to fix it.

> At the same time, I'm not sure if we should just write this off as an
> ignorable warning.  If the C compiler concludes these are dead stores
> it'll probably optimize them away, leading to not accomplishing the
> goal of wiping the values.

Yeah, I mean it's odd to put code in to zero/hide state knowing it's
probably optimized out.  

We could also take it out, but maybe it does help somewhere?

... or put in a comment that says: This probably gets optimized away, but
we don't consider it much of a risk.

> On the third hand, that goal may not be worth much, particularly not
> if the variables do get kept in registers.

I haven't looked at the asm.  
Maybe they are in registers...

Garick

Re: Quick doc typo fix

On Tue, May 28, 2019 at 09:46:48PM +0200, Guillaume Lelarge wrote:
> Hehe, that was the first thing I wrote :) but went with "table and index"
> as it was also used a bit later in the chapter. Both are fine with me.

Okay, done this way.  Thanks for the report.
--
Michael


signature.asc
Description: PGP signature

Re: Quick doc typo fix

2019-05-29 Thread Guillaume Lelarge

Le mar. 28 mai 2019 à 21:46, Guillaume Lelarge  a
écrit :

> Le mar. 28 mai 2019 à 21:28, Michael Paquier  a
> écrit :
>
>> On Tue, May 28, 2019 at 05:05:10PM +0200, Guillaume Lelarge wrote:
>> >   
>> >> linkend="catalog-pg-am">pg_am
>> > -  index access methods
>> > +  table and index access methods
>> >   
>>
>> Perhaps we could just say "relation" here?  That's the term used on
>> the paragraph describing pg_am.
>>
>
> Hehe, that was the first thing I wrote :) but went with "table and index"
> as it was also used a bit later in the chapter. Both are fine with me.
>
>
And here is another one. See patch attached.


-- 
Guillaume.
diff --git a/doc/src/sgml/file-fdw.sgml b/doc/src/sgml/file-fdw.sgml
index 2413089ffe..4c34ad9cc9 100644
--- a/doc/src/sgml/file-fdw.sgml
+++ b/doc/src/sgml/file-fdw.sgml
@@ -174,7 +174,7 @@
  
 
  
-  COPY's FORCE_QUOTE options is
+  COPY's FORCE_QUOTE option is
   currently not supported by file_fdw.

Re: docs about FKs referencing partitioned tables

On Wed, May 29, 2019 at 02:33:26PM +0900, Amit Langote wrote:
> But couple of other points mentioned in 5.11.3.3. Caveats (of 5.11. Table
> Partitioning) are already repeated in 5.10.1. Caveats; for example, note
> the point about VACUUM, ANALYZE, INSERT ON CONFLICT, etc. applying to
> single tables.  So, perhaps it won't hurt to repeat the caveat about
> indexes and foreign keys too.

OK, committed as such.  Your patch linked to the top of the
inheritance section, so I redirected that to the actual section about
caveats for clarity.
--
Michael


signature.asc
Description: PGP signature

Re: accounting for memory used for BufFile during hash joins

2019-05-29 Thread Tomas Vondra

On Tue, May 28, 2019 at 03:40:01PM +0800, Hubert Zhang wrote:

On Sat, May 4, 2019 at 8:34 AM Tomas Vondra
wrote:

Hi Tomas

I read your second patch which uses overflow buf files to reduce the total
number of batches.
It would solve the hash join OOM problem what you discussed above: 8K per
batch leads to batch bloating problem.

I mentioned in another thread:

https://www.postgresql.org/message-id/flat/CAB0yrekv%3D6_T_eUe2kOEvWUMwufcvfd15SFmCABtYFOkxCFdfA%40mail.gmail.com
There is another hashjoin OOM problem which disables splitting batches too
early. PG uses a flag hashtable->growEnable to determine whether to split
batches. Once one splitting failed(all the tuples are assigned to only one
batch of two split ones) The growEnable flag would be turned off forever.

The is an opposite side of batch bloating problem. It only contains too few
batches and makes the in-memory hash table too large to fit into memory.

Yes. There are deffinitely multiple separate issues in the hashjoin code,
and the various improvements discussed in this (and other) thread usually
address just a subset of them. We need to figure out how to combine them
or maybe devise some more generic solution.

So I think we need to take a step back, and figure out how to combine
these improvements - otherwise we might commit a fix for one issue, making
it much harder/impossible to improve the other issues.

The other important question is whether we see these cases as outliers
(and the solutions as last-resort-attempt-to-survive kind of fix) or more
widely applicable optimizations. I've seen some interesting speedups with
the overflow-batches patch, but my feeling is we should really treat it as
a last-resort to survive.

I had a chat about this with Thomas Munro yesterday. Unfortunately, some
beer was involved but I do vaguely remember he more or less convinced me
the BNL (block nested loop join) might be the right approach here. We
don't have any patch for that yet, though :-(

Here is the tradeoff: one batch takes more than 8KB(8KB makes sense, due to
performance), in-memory hash table takes memory as well and splitting
batched may(not must) reduce the in-memory hash table size but introduce
more batches(and thus more memory usage 8KB*#batch).
Can we conclude that it would be worth to splitting if satisfy:
(The reduced memory of in-memory hash table) - (8KB * number of new
batches) > 0

Something like that, yes.

So I'm considering to combine our patch with your patch to fix join OOM
problem. No matter the OOM is introduced by (the memory usage of in-memory
hash table) or (8KB * number of batches).

nbatch_inmemory in your patch could also use the upper rule to redefine.

What's your opinion?

One of the issues with my "overflow batches" patch, pointed out to me by
Thomas yesterday, is that it only works with non-parallel hash join. And
we don't know how to make it work in the parallel mode :-(

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Dead stores in src/common/sha2.c

Michael Paquier  writes:
> On Wed, May 29, 2019 at 01:24:19PM +, Hamlin, Garick L wrote:
>> I ran clang checker and noticed these.   It looks like the 
>> sha2 implementation is trying to zero out state on exit, but
>> clang checker finds at least 'a' is a dead store.  
>> 
>> Should we fix this?
>> Is something like the attached sensible?
>> Is there a common/better approach to zero-out in PG ?

> This code comes from the SHA-2 implementation of OpenBSD, so it is not
> adapted to directly touch it.  What's the current state of this code
> in upstream?  Should we perhaps try to sync with the upstream
> implementation instead?
> After a quick search I am not seeing that this area has actually
> changed:
> http://fxr.watson.org/fxr/source/crypto/sha2.c?v=OPENBSD

Hm ... plastering "volatile"s all over it isn't good for readability
or for quality of the generated code.  (In particular, I'm worried
about this patch causing all those variables to be forced into memory
instead of registers.)

At the same time, I'm not sure if we should just write this off as an
ignorable warning.  If the C compiler concludes these are dead stores
it'll probably optimize them away, leading to not accomplishing the
goal of wiping the values.

On the third hand, that goal may not be worth much, particularly not
if the variables do get kept in registers.

regards, tom lane

Re: Dead stores in src/common/sha2.c