Re: [HACKERS] shared memory based stat collector (was: Sharing record typmods between backends)

2017-08-14 Thread Andres Freund
On 2017-08-14 18:57:39 +0200, Magnus Hagander wrote:
> On Mon, Aug 14, 2017 at 5:46 PM, Robert Haas <robertmh...@gmail.com> wrote:
> 
> > On Sun, Aug 13, 2017 at 9:59 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> > >> b) I think our tendency to dump all stats whenever we crash isn't really
> > >>tenable, given how autovacuum etc are tied to them.
> > >
> > > Eh, maybe.  I don't think crashes are really so common on production
> > > systems.  As developers we have an inflated view of their frequency ;-)
> >
> 
> From a stats perspective I think the crash problem is the small problem.
> The big problem is we nuke them on replication promotion and we nuke them
> on PITR. If we solve those two, I'm pretty sure we would also solve the
> on-crash more or less in the same thing.

I don't think they're that rare, but the others are problems
too. Unfortunately I think changing that is a bit more complicated,
given that some stats are actually updated on standbys.

I previously thought that an option to occasionally WAL log the stats
file would be useful (e.g. just before a checkpoint). That'd make them
persistent, and available on the standby. But that'd still require
somehow dealing with stats being produced on the standby - I presume
we'd need multiple stats files and provide functions for merging them.

It'd be good if we somehow could figure out a way to WAL log the stats
data in a way that's consistent, i.e. doesn't contain data about dumped
relations. But I don't quite see how...


> > Without taking a position on the point under debate, AFAIK it wouldn't
> > be technically complex either under our current architecture or the
> > proposed new one to dump out a new permanent stats file every 10
> > minutes or so.  So if there is an issue here I think it might not be
> > very hard to fix, whatever else we do.
> >
> 
> I started working on that patch at some point, I think I have a rotting
> branch somewhere. That part was indeed fairly easy.
> 
> I got stalled when I feature-crept myself by realizing I wanted it
> snapshotted to WAL so it could fix the PITR and replication issues. And
> then properly bogged down when I realized that on the standby I'd want
> *both* the stats from the standby (while it's running) and the stats from
> the master (after switchover).

Hah, indeed.



Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] shared memory based stat collector (was: Sharing record typmods between backends)

2017-08-14 Thread Andres Freund
On 2017-08-14 12:28:39 -0400, Tom Lane wrote:
> Robert Haas <robertmh...@gmail.com> writes:
> > On Mon, Aug 14, 2017 at 12:16 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> >> Just FYI, the only values being reported by buildfarm animals are
> >> "posix", "sysv", and "windows".  So while mmap may be a thing,
> >> it's an untested thing.
> 
> > I'm pretty sure I dev-tested it before committing anything, but,
> > certainly, having ongoing BF coverage woudn't be a bad thing.
> 
> Looking closer, the reason those are the only reported values is
> that those are the only possible results from initdb's
> choose_dsm_implementation().  So the real question here is whether
> "mmap" should be considered to dominate "sysv" if it's available.

No mmap isn't a good option - it's file backed mmap, rather than
anonymous mmap. To my knowledge there's no good portable way to use
anonymous mmap to share memory across processes unless established
before a fork().


> If so, why isn't choose_dsm_implementation() trying it; and if not,
> why are we carrying it?

I think the idea was that there might be platforms that require it, but
...


Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Orphaned files in base/[oid]

2017-08-14 Thread Andres Freund
Hi,

On 2017-08-14 14:12:22 +0200, Chris Travers wrote:
> Problem:
> The system this came up on is PostgreSQL 9.6.3 and has had repeated trouble
> with disk space.  Querying pg_database_size, as well as du on the
> subdirectory of base/ show total usage to be around 3.8TB.  Summing up the
> size of the relations in pg_class though shows around 2.1TB.
> 
> Initial troubleshooting found around 150 GB of space in pg_temp which had
> never been cleared and was at least several days old.  Restarting the
> server cleared these up.
> 
> Poking around the base/[oid] directory, I found a large number of files
> which did not correspond with a pg_class entry.  One of the apparent
> relations was nearly 1TB in size.
> 
> What I think happened:
> I think various pg_temp/* and orphaned relation files (In base/[oid]) were
> created when PostgreSQL crashed due to running out of space in various
> operations including creating materialised views.
> 
> So my question is if there is a way we can safely clean these up on server
> restart?  If not does it make sense to try to create a utility that can
> connect to PostgreSQL, seek out valid files, and delete the rest?

I think the fix here is to call RemovePgTempFiles() during
crash-restarts, instead of just full starts. The previously stated need
to be able to inspect temp files after a crash can be less impactfully
fulfilled with restart_after_crash = false.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] shared memory based stat collector (was: Sharing record typmods between backends)

2017-08-14 Thread Andres Freund
On 2017-08-14 12:21:30 -0400, Robert Haas wrote:
> >> ... If somebody has a system where no other form of shared
> >> memory, works dynamic_shared_memory_type = mmap is still a thing, so
> >> the use case for "none" seems very thin indeed.  I'd vote for just
> >> ripping it out in v11.
> >
> > Just FYI, the only values being reported by buildfarm animals are
> > "posix", "sysv", and "windows".  So while mmap may be a thing,
> > it's an untested thing.
> 
> I'm pretty sure I dev-tested it before committing anything, but,
> certainly, having ongoing BF coverage woudn't be a bad thing.

Is there any platforms that require it?  I thought posix, sysv and
windows are sufficient?  If we're going to start relying more on dsms
we probably don't want to use mmap anyway...

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] shared memory based stat collector (was: Sharing record typmods between backends)

2017-08-14 Thread Andres Freund
Hi,

On 2017-08-14 11:46:10 -0400, Robert Haas wrote:
> On Sun, Aug 13, 2017 at 9:59 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> > Andres Freund <and...@anarazel.de> writes:
> >> You both are obviously right.  Another point of potential concern could
> >> be that we'd pretyt much fully rely on dsm/dht's being available, for
> >> the server to function correctly. Are we ok with that? Right now
> >> postgres still works perfectly well, leaving parallelism aside, with
> >> dynamic_shared_memory_type = none.
> >
> > In principle we could keep the existing mechanism as a fallback.
> > Whether that's worth the trouble is debatable.  The current code
> > in initdb believes that every platform has some type of DSM support
> > (see choose_dsm_implementation).  Nobody's complained about that,
> > and it certainly works on every buildfarm animal.  So for all we know,
> > dynamic_shared_memory_type = none is broken already.
> 
> Actually, now that you mention it, I think it *is* broken already, or
> more to the point, if you configure that value, the autovacuum
> launcher hangs up, because of the AutovacuumWorkItem stuff that Alvaro
> added.  When I just tested it, the AV launcher somehow ended up
> waiting for AutovacuumLock and I had to SIGQUIT the server to shut it
> down.  That's actually not really entirely the fault of
> dynamic_shared_memory_type = none, though, because the code in
> autovacuum.c does this:
> 
> AutoVacuumDSA = dsa_create(AutovacuumLock->tranche);
> /* make sure it doesn't go away even if we do */
> dsa_pin(AutoVacuumDSA);
> dsa_pin_mapping(AutoVacuumDSA);
> 
> Now, that's actually really broken because if dsa_create() throws an
> error of any kind, you're going to have already assigned the value to
> AutoVacuumDSA, but you will not have pinned the DSA or the DSA
> mapping.  There's evidently some additional bug here because I'd sorta
> expect this code to just go into an infinite loop in this case,
> failing over and over trying to reattach the segment, but evidently
> something even worse happening - perhaps the ERROR isn't releasing
> AutovacuumLock.  Anyway, this code has multiple error handling defects
> and IMHO it's pretty questionable whether DSA should have been used
> here at all.  Allowing the number of autovacuum work items to grow
> without bound would not be a good design - and if we've got a bound
> somewhere, why not just put this in the main shared memory segment?

CCing Alvaro.  This seems like it should be an open item.


> Leaving all that aside, when the DSM facility was introduced in 9.4, I
> was afraid that it would break things for people and that they'd
> demand a way to turn it off, and I included "none" as an option for
> that purpose.  Well, it did break a few things initially but those
> seem to have mostly been fixed during the 9.4 cycle itself.  I'm not
> aware of any subsequent problem reports caused by having DSM enabled
> (pointers welcome!) and given that we now have parallel query relying
> on DSM, people are much less likely to want to just give up on having
> DSM available.  If somebody has a system where no other form of shared
> memory, works dynamic_shared_memory_type = mmap is still a thing, so
> the use case for "none" seems very thin indeed.  I'd vote for just
> ripping it out in v11.

It's possibly still useful for debugging - we'll continue not to be able
to rely on allocations to succeed...


> >> a) It'd be quite useful to avoid needing a whole cluster's stats in
> >>memory. Even if $subject would save memory, I'm hesitant committing
> >>to something requiring all stats to be in memory forever. As a first
> >>step it seems reasonable to e.g. not require state for all databases
> >>to be in memory. The first time per-database stats are required, it
> >>could be "paged in". Could even be more aggressive and do that on a
> >>per-table level and just have smaller placeholder entries for
> >>non-accessed tables, but that seems more work.
> >
> > Huh?  As long as we don't lock the shared memory into RAM, which on most
> > platforms we couldn't do without being root anyway, ISTM the kernel should
> > do just fine at paging out little-used areas of the stats.  Let's not
> > reinvent that wheel until there's demonstrable proof of need.
> 
> I think some systems aren't able to page out data stored in shared
> memory segments, and it would probably be pretty awful for performance
> if they did (there's a reason large sorts need to switch to using temp
> files ... and the same rationale seems to a

[HACKERS] shared memory based stat collector (was: Sharing record typmods between backends)

2017-08-13 Thread Andres Freund
Hi,

Since we're getting a bit into the weeds of a different topic, and since
I think it's an interesting feature, I'm detaching this into a separate
thread.

On 2017-08-12 23:37:27 -0400, Tom Lane wrote:
> >> On 2017-08-12 22:52:57 -0400, Robert Haas wrote:
> >>> I think it'd be pretty interesting to look at replacing parts of the
> >>> stats collector machinery with something DHT-based.
> > On Sat, Aug 12, 2017 at 11:30 PM, Andres Freund <and...@anarazel.de> wrote:
> >> That seems to involve a lot more than this though, given that currently
> >> the stats collector data doesn't entirely have to be in memory. I've
> >> seen sites with a lot of databases with quite some per-database stats
> >> data. Don't think we can just require that to be in memory :(
>
> Robert Haas <robertmh...@gmail.com> writes:
> > Hmm.  I'm not sure it wouldn't end up being *less* memory.  Don't we
> > end up caching 1 copy of it per backend, at least for the database to
> > which that backend is connected?  Accessing a shared copy would avoid
> > that sort of thing.
>
> Yeah ... the collector itself has got all that in memory anyway.
> We do need to think about synchronization issues if we make that
> memory globally available, but I find it hard to see how that would
> lead to more memory consumption overall than what happens now.

You both are obviously right.  Another point of potential concern could
be that we'd pretyt much fully rely on dsm/dht's being available, for
the server to function correctly. Are we ok with that? Right now
postgres still works perfectly well, leaving parallelism aside, with
dynamic_shared_memory_type = none.


What are your thoughts about how to actually implement this? It seems
we'd have to do something like:

1) Keep the current per-backend & per-transaction state in each
   backend. That allows both to throw away the information and avoids
   increasing contention quite noticeably.

2) Some plain shared memory with metadata.  A set of shared hashtables
   for per database, per relation contents.

3) Individual database/relation entries are either individual atomics
   (we don't rely on consistency anyway), or seqcount (like
   st_changecount) based.

4) Instead of sending stats at transaction end, copy them into a
   "pending" entry.  Nontransactional contents can be moved to
   the pending entry more frequently.

5) Occasionally, try to flush the pending array into the global hash.
   The lookup in the table would be protected by something
   LWLockConditionalAcquire() based, to avoid blocking - don't want to
   introduce chokepoints due to commonly used tables and such.  Updating
   the actual stats can happen without the partition locks being held.

I think there's two other relevant points here:

a) It'd be quite useful to avoid needing a whole cluster's stats in
   memory. Even if $subject would save memory, I'm hesitant committing
   to something requiring all stats to be in memory forever. As a first
   step it seems reasonable to e.g. not require state for all databases
   to be in memory. The first time per-database stats are required, it
   could be "paged in". Could even be more aggressive and do that on a
   per-table level and just have smaller placeholder entries for
   non-accessed tables, but that seems more work.

   On the other hand, autoavcuum is likely going to make that approach
   useless anyway, given it's probably going to access otherwise unneded
   stats regularly.

b) I think our tendency to dump all stats whenever we crash isn't really
   tenable, given how autovacuum etc are tied to them. We should think
   about ways to avoid that if we're going to do a major rewrite of the
   stats stuff, which this certainly sounds like.


If there weren't HS to worry about, these two points kinda sound like
the data should be persisted into an actual table, rather than some
weird other storage format. But HS seems to make that untenable.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] POC: Sharing record typmods between backends

2017-08-13 Thread Andres Freund
Hi,

On 2017-08-11 20:39:13 +1200, Thomas Munro wrote:
> Please find attached a new patch series.

Review for 0001:

I think you made a few long lines even longer, like:

@@ -1106,11 +1106,11 @@ pltcl_trigger_handler(PG_FUNCTION_ARGS, 
pltcl_call_state *call_state,
Tcl_ListObjAppendElement(NULL, tcl_trigtup, Tcl_NewObj());
for (i = 0; i < tupdesc->natts; i++)
{
-   if (tupdesc->attrs[i]->attisdropped)
+   if (TupleDescAttr(tupdesc, i)->attisdropped)
Tcl_ListObjAppendElement(NULL, tcl_trigtup, 
Tcl_NewObj());
else
Tcl_ListObjAppendElement(NULL, tcl_trigtup,
-   
 Tcl_NewStringObj(utf_e2u(NameStr(tupdesc->attrs[i]->attname)), -1));
+   
 Tcl_NewStringObj(utf_e2u(NameStr(TupleDescAttr(tupdesc, i)->attname)), -1));


as it's not particularly pretty to access tupdesc->attrs[i] repeatedly,
it'd be good if you instead had a local variable for the individual
attribute.

Similar:
if 
(OidIsValid(get_base_element_type(TupleDescAttr(tupdesc, i)->atttypid)))
sv = plperl_ref_from_pg_array(attr, 
TupleDescAttr(tupdesc, i)->atttypid);
else if ((funcid = 
get_transform_fromsql(TupleDescAttr(tupdesc, i)->atttypid, 
current_call_data->prodesc->lang_oid, current_call_data->prodesc->trftypes)))
sv = (SV *) 
DatumGetPointer(OidFunctionCall1(funcid, attr));


@@ -150,7 +148,7 @@ ValuesNext(ValuesScanState *node)
 */
values[resind] = 
MakeExpandedObjectReadOnly(values[resind],

isnull[resind],
-   
att[resind]->attlen);
+   
TupleDescAttr(slot->tts_tupleDescriptor, 
resind)->attlen);

@@ -158,9 +158,9 @@ convert_tuples_by_position(TupleDesc indesc,
 * must agree.
 */
if (attrMap[i] == 0 &&
-   indesc->attrs[i]->attisdropped &&
-   indesc->attrs[i]->attlen == 
outdesc->attrs[i]->attlen &&
-   indesc->attrs[i]->attalign == 
outdesc->attrs[i]->attalign)
+   TupleDescAttr(indesc, i)->attisdropped &&
+   TupleDescAttr(indesc, i)->attlen == 
TupleDescAttr(outdesc, i)->attlen &&
+   TupleDescAttr(indesc, i)->attalign == 
TupleDescAttr(outdesc, i)->attalign)
continue;


I think you get the drift, there's more

Otherwise this seems fairly boring...


Review for 0002:

@@ -71,17 +71,17 @@ typedef struct tupleConstr
 typedef struct tupleDesc
 {
int natts;  /* number of attributes 
in the tuple */
-   Form_pg_attribute *attrs;
-   /* attrs[N] is a pointer to the description of Attribute Number N+1 */
TupleConstr *constr;/* constraints, or NULL if none */
Oid tdtypeid;   /* composite type ID 
for tuple type */
int32   tdtypmod;   /* typmod for tuple type */
booltdhasoid;   /* tuple has oid attribute in 
its header */
int tdrefcount; /* reference count, or 
-1 if not counting */
+   /* attrs[N] is the description of Attribute Number N+1 */
+   FormData_pg_attribute attrs[FLEXIBLE_ARRAY_MEMBER];
 } *TupleDesc;

sorry if I'm beating on my hobby horse, but if we're re-ordering anyway,
can you move TupleConstr to the second-to-last? That a) seems more
consistent but b) (hobby horse, sorry) avoids unnecessary alignment
padding.


@@ -734,13 +708,13 @@ BuildDescForRelation(List *schema)
/* Override TupleDescInitEntry's settings as requested */
TupleDescInitEntryCollation(desc, attnum, attcollation);
if (entry->storage)
-   desc->attrs[attnum - 1]->attstorage = entry->storage;
+   desc->attrs[attnum - 1].attstorage = entry->storage;

/* Fill in additional stuff not handled by TupleDescInitEntry */
-   desc->attrs[attnum - 1]->attnotnull = entry->is_not_null;
+   desc->attrs[attnum - 1].attnotnull = entry->is_not_null;
has_not_null |= entry->is_not_null;
-   desc->attrs[attnum - 1]->attislocal 

Re: [HACKERS] Patches I'm thinking of pushing shortly

2017-08-13 Thread Andres Freund
On 2017-08-13 17:43:10 -0400, Robert Haas wrote:
> On Sun, Aug 13, 2017 at 5:24 PM, Tom Lane  wrote:
> >> I'd vote for including this in v10.  There doesn't seem to be any
> >> downside to this: it's a no brainer to avoid our exploding hash table
> >> case when we can see it coming.
> >
> > Anybody else want to vote that way?  For myself it's getting a bit late
> > in the beta process to be including inessential changes, but I'm willing
> > to push it to v10 not just v11 if there's multiple people speaking for
> > that.
> 
> I'd vote for waiting until v11.  I think it's too late to be doing
> things that might change good plans into bad ones or visca versa;
> that's a recipe for having to put out 10.1 and 10.2 a little quicker
> than I'd like.

Similar here, there doesn't seem to be that much urgency.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

2017-08-13 Thread Andres Freund
On 2017-08-13 16:55:33 -0400, Tom Lane wrote:
> Peter Geoghegan <p...@bowt.ie> writes:
> > I think that it's useful for these things to be handled in an
> > adversarial manner, in the same way that litigation is adversarial in
> > a common law court. I doubt that Noah actually set out to demoralize
> > anyone. He is just doing the job he was assigned.
> 
> FWIW, I agree that Noah is just carrying out the RMT's task as
> assigned.

Well, then that's a sign that the tasks/process need to be rescoped.


> However, the only thing that Peter could really do about this of his own
> authority is to revert 1e8a850, which I certainly think should be the last
> resort not the first.  We are continuing to make progress towards finding
> a better solution, I think, and since no particular deadline is imminent
> we should let that process play out.

I agree that that'd be a bad fix. There's other things that should, but
don't, really use the asynchronous API, e.g. postgres_fdw.

As a measure of last restart we could add a libpq workaround that forces
a pqSocketCheck() at the right moment, while still establishing a
connection.  That's not good from an interruptability perspective, but
better than blocking for the entire connection establishment.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

2017-08-13 Thread Andres Freund
Hi,

On 2017-08-11 18:11:03 +0530, Augustine, Jobin wrote:
> Appears that patch is not helping.
> Errors like below are still appearing in the log
> ===
> 2017-08-11 12:22:35 UTC [2840]: [1-1] user=,db=,app=,client= FATAL:  could
> not connect to the primary server: could not send data to server: Socket is
> not connected (0x2749/10057)
> could not send startup packet: Socket is not connected
> (0x2749/10057)
> 2017-08-11 12:22:40 UTC [1964]: [1-1] user=,db=,app=,client= FATAL:  could
> not connect to the primary server: could not send data to server: Socket is
> not connected (0x2749/10057)
> could not send startup packet: Socket is not connected
> (0x2749/10057)
> 2017-08-11 12:22:45 UTC [248]: [1-1] user=,db=,app=,client= FATAL:  could
> not connect to the primary server: could not send data to server: Socket is
> not connected (0x2749/10057)
> could not send startup packet: Socket is not connected
> (0x2749/10057)
> ===

That's too bad. Any chance you could install
https://docs.microsoft.com/en-us/sysinternals/downloads/procmon and
activate monitoring just for that phase? I think it can export to a text
file...

Greetings,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

2017-08-13 Thread Andres Freund
On 2017-08-11 20:56:22 -0700, Noah Misch wrote:
> > > If nobody volunteers, you could always resolve this by reverting 1e8a850 
> > > and
> > > successors.
> > 
> > I think you're blaming the victim.  Our current theory about the cause
> > of this is that on Windows, WaitLatchOrSocket cannot be used to wait for
> > completion of a nonblocking connect() call.  That seems pretty broken
> > independently of whether libpqwalreceiver needs the capability.
> 
> Yes, the theorized defect lies in APIs commit 1e8a850 used, not in the commit
> itself.  Nonetheless, commit 1e8a850 promoted the defect from one reachable
> only by writing C code to one reachable by merely configuring replication on
> Windows according to the documentation.  For that, its committer owns this
> open item.  Besides the one approach I mentioned, there exist several other
> fine ways to implement said ownership.

FWIW, I'm personally quite demotivated by this style of handling
issues. You're essentially saying that any code change, even if it just
increases exposure of a preexisting bug, needs to be handled by the
committer of the exposing change. And even if that bug is on a platform
the committer doesn't have.  And all that despite the issue getting
attention.

I'm not ok with this.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] POC: Sharing record typmods between backends

2017-08-12 Thread Andres Freund
On 2017-08-12 22:52:57 -0400, Robert Haas wrote:
> On Fri, Aug 11, 2017 at 9:55 PM, Andres Freund <and...@anarazel.de> wrote:
> > Well, most of the potential usecases for dsmhash I've heard about so
> > far, don't actually benefit much from incremental growth. In nearly all
> > the implementations I've seen incremental move ends up requiring more
> > total cycles than doing it at once, and for parallelism type usecases
> > the stall isn't really an issue.  So yes, I think this is something
> > worth considering.   If we were to actually use DHT for shared caches or
> > such, this'd be different, but that seems darned far off.
> 
> I think it'd be pretty interesting to look at replacing parts of the
> stats collector machinery with something DHT-based.

That seems to involve a lot more than this though, given that currently
the stats collector data doesn't entirely have to be in memory. I've
seen sites with a lot of databases with quite some per-database stats
data. Don't think we can just require that to be in memory :(

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] POC: Sharing record typmods between backends

2017-08-11 Thread Andres Freund
Hi,

On 2017-08-11 20:39:13 +1200, Thomas Munro wrote:
> Please find attached a new patch series.  I apologise in advance for
> 0001 and note that the patchset now weighs in at ~75kB compressed.
> Here are my in-line replies to your two reviews:

Replying to a few points here, then I'll do a pass through your your
submission...


> On Tue, Jul 25, 2017 at 10:09 PM, Andres Freund <and...@anarazel.de> wrote:
> > It does concern me that we're growing yet another somewhat different
> > hashtable implementation. Yet I don't quite see how we could avoid
> > it. dynahash relies on proper pointers, simplehash doesn't do locking
> > (and shouldn't) and also relies on pointers, although to a much lesser
> > degree.  All the open coded tables aren't a good match either.  So I
> > don't quite see an alternative, but I'd love one.
> 
> Yeah, I agree.  To deal with data structures with different pointer
> types, locking policy, inlined hash/eq functions etc, perhaps there is
> a way we could eventually do 'policy based design' using the kind of
> macro trickery you started where we generate N different hash table
> variations but only have to maintain source code for one chaining hash
> table implementation?  Or perl scripts that effectively behave as a
> cfront^H^H^H nevermind.  I'm not planning to investigate that for this
> cycle.

Whaaa, what have I done But more seriously, I'm doubtful it's worth
going there.


> > + * level.  However, when a resize operation begins, all partition locks 
> > must
> > + * be acquired simultaneously for a brief period.  This is only expected to
> > + * happen a small number of times until a stable size is found, since 
> > growth is
> > + * geometric.
> >
> > I'm a bit doubtful that we need partitioning at this point, and that it
> > doesn't actually *degrade* performance for your typmod case.
> 
> Yeah, partitioning not needed for this case, but this is supposed to
> be more generally useful.  I thought about making the number of
> partitions a construction parameter, but it doesn't really hurt does
> it?

Well, using multiple locks and such certainly isn't free. An exclusively
owned cacheline mutex is nearly an order of magnitude faster than one
that's currently shared, not to speak of modified. Also it does increase
the size overhead, which might end up happening for a few other cases.


> > + * Resizing is done incrementally so that no individual insert operation 
> > pays
> > + * for the potentially large cost of splitting all buckets.
> >
> > I'm not sure this is a reasonable tradeoff for the use-case suggested so
> > far, it doesn't exactly make things simpler. We're not going to grow
> > much.
> 
> Yeah, designed to be more generally useful.  Are you saying you would
> prefer to see the DHT patch split into an initial submission that does
> the simplest thing possible, so that the unlucky guy who causes the
> hash table to grow has to do all the work of moving buckets to a
> bigger hash table?  Then we could move the more complicated
> incremental growth stuff to a later patch.

Well, most of the potential usecases for dsmhash I've heard about so
far, don't actually benefit much from incremental growth. In nearly all
the implementations I've seen incremental move ends up requiring more
total cycles than doing it at once, and for parallelism type usecases
the stall isn't really an issue.  So yes, I think this is something
worth considering.   If we were to actually use DHT for shared caches or
such, this'd be different, but that seems darned far off.


> This is complicated, and in the category that I would normally want a
> stack of heavy unit tests for.  If you don't feel like making
> decisions about this now, perhaps iteration (and incremental resize?)
> could be removed, leaving only the most primitive get/put hash table
> facilities -- just enough for this purpose?  Then a later patch could
> add them back, with a set of really convincing unit tests...

I'm inclined to go for that, yes.


> > +/*
> > + * Detach from a hash table.  This frees backend-local resources associated
> > + * with the hash table, but the hash table will continue to exist until it 
> > is
> > + * either explicitly destroyed (by a backend that is still attached to 
> > it), or
> > + * the area that backs it is returned to the operating system.
> > + */
> > +void
> > +dht_detach(dht_hash_table *hash_table)
> > +{
> > +   /* The hash table may have been destroyed.  Just free local memory. 
> > */
> > +   pfree(hash_table);
> > +}
> >
> > Somewhat inclined to add debugging refcount - seems like bugs around
> > that might be annoying to find. Maybe

Re: [HACKERS] POC: Sharing record typmods between backends

2017-08-11 Thread Andres Freund
On 2017-08-11 11:14:44 -0400, Robert Haas wrote:
> On Fri, Aug 11, 2017 at 4:39 AM, Thomas Munro
>  wrote:
> > OK.  Now it's ds_hash_table.{c,h}, where "ds" stands for "dynamic
> > shared".  Better?  If we were to do other data structures in DSA
> > memory they could follow that style: ds_red_black_tree.c, ds_vector.c,
> > ds_deque.c etc and their identifier prefix would be drbt_, dv_, dd_
> > etc.
> >
> > Do you want to see a separate patch to rename dsa.c?  Got a better
> > name?  You could have spoken up earlier :-)  It does sound like a bit
> > like the thing from crypto or perhaps a scary secret government
> > department.

I, and I bet a lot of other people, kind of missed dsa being merged for
a while...


> I doubt that we really want to have accessor functions with names like
> dynamic_shared_hash_table_insert or ds_hash_table_insert. Long names
> are fine, even desirable, for APIs that aren't too widely used,
> because they're relatively self-documenting, but a 30-character
> function name gets annoying in a hurry if you have to call it very
> often, and this is intended to be reusable for other things that want
> a dynamic shared memory hash table.  I think we should (a) pick some
> reasonably short prefix for all the function names, like dht or dsht
> or ds_hash, but not ds_hash_table or dynamic_shared_hash_table and (b)
> also use that prefix as the name for the .c and .h files.

Yea, I agree with this. Something dsmhash_{insert,...}... seems like
it'd kinda work without being too ambiguous like dht imo is, while still
being reasonably short.


> Right now, we've got a situation where the most widely-used hash table
> implementation uses dynahash.c for the code, hsearch.h for the
> interface, and "hash" as the prefix for the names, and that's really
> hard to remember.  I think having a consistent naming scheme
> throughout would be a lot better.

Yea, that situation still occasionally confuses me, a good 10 years
after starting to look at pg...  There's even a a dynahash.h, except
it's useless. And dynahash.c doesn't even include hsearch.h directly
(included via shmem.h)!  Personally I'd actually in favor of moving
hsearch.h stuff into dynahash.h and leave hsearch as a wrapper.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Failover Slots

2017-08-11 Thread Andres Freund
On 2017-08-02 16:35:17 -0400, Robert Haas wrote:
> I actually think failover slots are quite desirable, especially now
> that we've got logical replication in core.  In a review of this
> thread I don't see anyone saying otherwise.  The debate has really
> been about the right way of implementing that.

Given that I presumably was one of the people pushing back more
strongly: I agree with that.  Besides disagreeing with the proposed
implementation our disagreements solely seem to have been about
prioritization.

I still think we should have a halfway agreed upon *design* for logical
failover, before we introduce a concept that's quite possibly going to
be incompatible with that, however. But that doesn't mean it has to
submitted/merged to core.


> - When a standby connects to a master, it can optionally supply a list
> of slot names that it cares about.
> - The master responds by periodically notifying the standby of changes
> to the slot contents using some new replication sub-protocol message.
> - The standby applies those updates to its local copies of the slots.

> So, you could create a slot on a standby with an "uplink this" flag of
> some kind, and it would then try to keep it up to date using the
> method described above.  It's not quite clear to me how to handle the
> case where the corresponding slot doesn't exist on the master, or
> initially does but then it's later dropped, or it initially doesn't
> but it's later created.


I think there's a couple design goals we need to agree upon, before
going into the weeds of how exactly we want this to work. Some of the
axis I can think of are:

- How do we want to deal with cascaded setups, do slots have to be
  available everywhere, or not?
- What kind of PITR integration do we want? Note that simple WAL based
  slots do *NOT* provide proper PITR support, there's not enough
  interlock easily available (you'd have to save slots at the end, then
  increment minRecoveryLSN to a point later than the slot saving)
- How much divergence are we going to accept between logical decoding on
  standbys, and failover slots. I'm probably a lot closer to closer than
  than Craig is.
- How much divergence are we going to accept between infrastructure for
  logical failover, and logical failover via failover slots (or however
  we're naming this)? Again, I'm probably a lot closer to zero than
  craig is.


Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

2017-08-11 Thread Andres Freund
Hi,

On 2017-07-21 19:32:02 +0300, Marina Polyakova wrote:
> Here is the third version of the patch for pgbench thanks to Fabien Coelho
> comments. As in the previous one, transactions with serialization and
> deadlock failures are rolled back and retried until they end successfully or
> their number of tries reaches maximum.

Just had a need for this feature, and took this to a short test
drive. So some comments:
- it'd be useful to display a retry percentage of all transactions,
  similar to what's displayed for failed transactions.
- it appears that we now unconditionally do not disregard a connection
  after a serialization / deadlock failure. Good. But that's useful far
  beyond just deadlocks / serialization errors, and should probably be exposed.
- it'd be useful to also conveniently display the number of retried
  transactions, rather than the total number of retries.

Nice feature!

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

2017-08-10 Thread Andres Freund
On 2017-08-10 18:51:07 -0700, Noah Misch wrote:
> On Tue, Aug 08, 2017 at 07:25:37PM -0400, Peter Eisentraut wrote:
> > On 8/7/17 21:06, Noah Misch wrote:
> > >> That would fit.  Until v10 (commit 1e8a850), PQconnectStart() had no 
> > >> in-tree
> > >> callers outside of libpq itself.
> > > [Action required within three days.  This is a generic notification.]
> > > 
> > > The above-described topic is currently a PostgreSQL 10 open item.  Peter,
> > > since you committed the patch believed to have created it, you own this 
> > > open
> > > item.
> > 
> > I don't think I can usefully contribute to this.  Could someone else
> > take it?
> 
> If nobody volunteers, you could always resolve this by reverting 1e8a850 and
> successors.

I've volunteered a fix nearby, just can't test it. I don't think we can
require committers to work on windows.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Thoughts on unit testing?

2017-08-10 Thread Andres Freund
Hi,

On 2017-08-11 09:53:23 +1200, Thomas Munro wrote:
> The current regression tests, isolation tests and TAP tests are very
> good(though I admit my experience with TAP is limited), but IMHO we
> are lacking support for C-level unit testing.  Complicated, fiddly
> things with many states, interactions, edge cases etc can be hard to
> get full test coverage on from the outside.  Consider
> src/backend/utils/mmgr/freepage.c as a case in point.
> 
> I guess we could consider the various xUnit systems for C[1] and have
> an entirely new kind of test that runs independently of PostgreSQL.  I
> guess that'd be difficult politically (choosing external project,
> cognitive load) and technically (global state problems as soon as you
> get away from completely stand-alone components).

Yea, I think there's little chance for something like that. The amount
of infrastructure you need to link in and initialize is prohibitive imo.


> One idea that keeps coming back to me is that we could probably extend
> our existing regression tests to cover C tests with automatic
> discovery/minimal boilerplate.

What's your definition of boilerplate here? All the "expected" state
tests in C unit tests is plenty boilerplate...


> Imagine if you just had to create a
> file bitmapset.test.c that sits beside bitmapset.c (and perhaps add it
> to TEST_OBJS), and in it write tests using a tiny set of macros a bit
> like Google Test's[2].  It could get automagically sucked into a test
> driver shlib module, perhaps one per source directory/subsystem, that
> is somehow discovered, loaded and run inside PostgreSQL as part of the
> regression suite, or perhaps it's just explicitly listed in the
> regression schedule with a .sql file that loads the module and runs an
> entry point function.
> 
> One problem is that if this was happening inside an FMGR function it'd
> be always in a transaction, which has implications.  There are
> probably better ways to do it.

You can already kinda avoid that in various ways, some more some less
hacky. I think it depends a bit on which scenarios we want to test.  How
much infrastructure do you want around? Do you want to be able to start
transactions? Write to the WAL, etc?  One relatively easy way would be
to simply have a 'postgres' commandline option (like we already kinda
have for EXEC_BACKEND style subprocesses), that loads a shared library
and invokes an entry point in it. That'd then be invoked at some defined
point in time. Alternatively you could just have a bgworker that does
its thing at startup and then shuts down the cluster...


Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

2017-08-09 Thread Andres Freund
On 2017-08-08 19:25:37 -0400, Peter Eisentraut wrote:
> On 8/7/17 21:06, Noah Misch wrote:
> >> That would fit.  Until v10 (commit 1e8a850), PQconnectStart() had no 
> >> in-tree
> >> callers outside of libpq itself.
> > [Action required within three days.  This is a generic notification.]
> > 
> > The above-described topic is currently a PostgreSQL 10 open item.  Peter,
> > since you committed the patch believed to have created it, you own this open
> > item.
> 
> I don't think I can usefully contribute to this.  Could someone else
> take it?

I've written up a patch at
http://archives.postgresql.org/message-id/20170806215521.d6fq4esvx7s5ejka%40alap3.anarazel.de
but I can't test this either. We really need somebody with access to
window to verify whether it works.  That patch needs some adjustments as
remarked by Tom, but can be tested as is.

Regards,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error : undefined symbol : LWLockAssign in 9.6.3

2017-08-08 Thread Andres Freund
Hi,

On 2017-08-09 13:07:53 +0900, 송기훈 wrote:
> [image: 본문 이미지 1]
> Hi.
> I'm trying to use imcs module with 9.6 and got this error message.
> LWLockAssign function has been deleted from 9.6. I can't use this module
> anymore from 9.6.
> 
> What I want to ask you something is that your team decides not to support
> imcs module anymore or doesn't concern about imcs module or are there any
> ways to run postgresql in memory only?

imcs was never supported by "the postgres team" - it was made by an
individual contributor "outside" of postgres. I've CCed him, maybe he
can comment on his plans around imcs.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Infrastructure for JIT compiling tuple deforming

2017-08-08 Thread Andres Freund
Hi,

As previously mentioned, tuple deforming is a major bottleneck, and
JITing it can be highly beneficial.  I previously had posted a prototype
that does JITing at the slot_deform_tuple() level, caching the deformed
function in the tupledesc.

Storing things in the tupledesc isn't a great concept however - the
lifetime of the generated function is hard to manage. But more
importantly, and even if we moved this into the slot, it precludes
important optimization.

JITing the deforming is a *lot* more efficient if we can combine it with
the JITing of the expressions using the deformed expression. There's a
couple of reasons for that:

1) By knowing the exact attnum the caller is going to request, the code
   can be optimized. No need to generate code for columns not
   deformed. If there's NOT NULL columns at/after the last
   to-be-deformed column, there's no need to generate checks about the
   length of the null-bitmap - getting rid of about half the branches!

2) By generating the deforming code in the generated expression code,
   the code will be generated together.. That's a good chunk of the
   overhead, of the memory mapping overhead, and it noticeably reduces
   function call overhead (because relative near calls can be used).

3) LLVM's optimizer can inline parts / all of the tuple deforming code
   into the expression evaluation function, further reducing
   overhead. In simpler cases and with some additional prodding, llvm
   even can interleave deforming of individual columns and their use
   (note that I'm not proposing to do so initially).

4) If we know that the underlying tuple is an actual nonvirtual tuple,
   e.g. on the scan level, the slot deforming of NOT NULL can be
   replaced with direct byte accesses to the relevant column - a good
   chunk faster again.
   (note that I'm not proposing to do so initially)

The problem however is that when generating the expression code we don't
have the necessary information. In my current prototype I'm emitting the
LLVM IR (the input to LLVM) at ExecInitExpr() time for all expressions
in a tree. That allows to emit the code for all functions in executor
tree in one go.  But unfortunately the current executor initiation
"framework" doesn't provide information about the underlying slot
tupledescs at that time.  Nor does it actually guarantee that the
tupledesc / slots stay the same over the course of the execution.

Therefore I'd like to somehow change things so that the executor keeps
track of whether the tupledesc of inner/outer/scan are going to change,
and if not provide them.

The right approach here seems to be to add a bit of extra data to
ExecAssignScanType etc., and move ExecInitExpr / ExecInitQual /
ExecAssignScanProjectionInfo /... to after that.  We then could keep
track of of the relevant tupledescs somewhere in PlanState - that's a
bit ugly, but I don't quite see how to avoid that unless we want to add
major executor-node awareness into expression evaluation.

Thoughts? Better ideas?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] max_files_per_processes vs others uses of file descriptors

2017-08-07 Thread Andres Freund
Hi,

On 2017-08-07 17:30:13 -0400, Tom Lane wrote:
> Meh.  The lack of field complaints about this doesn't indicate to me that
> we have a huge problem, and in any case, just increasing NUM_RESERVED_FDS
> would do nothing for the system-wide limits.

Howso? Via count_usable_fds() we test for max_files_per_process /
RLIMIT_NOFILE fds, and *then* subtract NUM_RESERVED_FDS.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] max_files_per_processes vs others uses of file descriptors

2017-08-07 Thread Andres Freund
On 2017-08-07 17:05:06 -0400, Tom Lane wrote:
> Andres Freund <and...@anarazel.de> writes:
> > On 2017-08-07 16:52:42 -0400, Tom Lane wrote:
> >> No, I don't think so.  If you're depending on the NUM_RESERVED_FDS
> >> headroom for anything meaningful, *you're doing it wrong*.  You should be
> >> getting an FD via fd.c, so that there is an opportunity to free up an FD
> >> (by closing a VFD) if you're up against system limits.  Relying on
> >> NUM_RESERVED_FDS headroom can only protect against EMFILE not ENFILE.
> 
> > How would this work for libpq based stuff like postgres fdw? Or some
> > random PL doing something with files? There's very little headroom here.
> 
> Probably the best we can hope for there is to have fd.c provide a function
> "close an FD please", which postgres_fdw could call if libpq fails because
> of ENFILE/EMFILE, and then retry.

Unless that takes up a slot in fd.c while in use, that'll still leave us
open to failures to open files in some critical parts, unless I miss
something.

And then we'd have to teach similar things to PLs etc.  I agree that
having some more slop isn't a proper solution, but only having ~30 fds
as slop on the most common systems seems mightily small.


> (Though I'm unsure how reliably postgres_fdw can detect that failure
> reason right now --- I don't know that we preserve errno on the way
> out of PQconnect.)

Yea, probably not really...


Regards,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] max_files_per_processes vs others uses of file descriptors

2017-08-07 Thread Andres Freund
On 2017-08-07 16:52:42 -0400, Tom Lane wrote:
> Andres Freund <and...@anarazel.de> writes:
> > These days there's a number of other consumers of
> > fds. E.g. postgres_fdw, epoll, ...  All these aren't accounted for by
> > fd.c.
> 
> > Given how close max_files_per_process is to the default linux limit of
> > 1024 fds, I wonder if we shouldn't increase NUM_RESERVED_FDS by quite a
> > bit?
> 
> No, I don't think so.  If you're depending on the NUM_RESERVED_FDS
> headroom for anything meaningful, *you're doing it wrong*.  You should be
> getting an FD via fd.c, so that there is an opportunity to free up an FD
> (by closing a VFD) if you're up against system limits.  Relying on
> NUM_RESERVED_FDS headroom can only protect against EMFILE not ENFILE.

How would this work for libpq based stuff like postgres fdw? Or some
random PL doing something with files? There's very little headroom here.


> What this means is that the epoll stuff needs to be tied into fd.c more
> than it is now, but that's likely a good thing anyway; it would for
> example provide a more robust way of ensuring we don't leak epoll FDs at
> transaction abort.

Not arguing against that.


Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] max_files_per_processes vs others uses of file descriptors

2017-08-07 Thread Andres Freund
Hi,

currently the default max_files_per_process is 1000. fd.c uses close to
that many (- NUM_RESERVED_FDS/10). count_usable_fds() makes sure that at
server start there's at most that many fds available, but that doesn't
mean that much for runtime.

These days there's a number of other consumers of
fds. E.g. postgres_fdw, epoll, ...  All these aren't accounted for by
fd.c.

Given how close max_files_per_process is to the default linux limit of
1024 fds, I wonder if we shouldn't increase NUM_RESERVED_FDS by quite a
bit?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] snapbuild woes

2017-08-06 Thread Andres Freund
On 2017-06-09 09:25:34 +0200, Antonin Houska wrote:
> Andres Freund <and...@anarazel.de> wrote:
> 
> > Looking at 0001:
> > - GetOldestSafeDecodingTransactionId() only guarantees to return an xid
> >   safe for decoding (note how procArray->replication_slot_catalog_xmin
> >   is checked), not one for the initial snapshot - so afaics this whole
> >   exercise doesn't guarantee much so far.
> 
> I happen to use CreateInitDecodingContext() in an extension, so I had to think
> what the new argumen exactly means (as for the incompatibility between PG
> 9.6.2 and 9.6.3, I suppose preprocessor directives can handle it).
> 
> One thing I'm failing to understand is: if TRUE is passed for
> need_full_snapshot, then slot->effective_xmin receives the result of
> 
> GetOldestSafeDecodingTransactionId(need_full_snapshot)
> 
> but this does include "catalog xmin".
> 
> If the value returned is determined by an existing slot which has valid
> effective_catalog_xmin and invalid effective_xmin (i.e. that slot only
> protects catalog tables from VACUUM but not the regular ones), then the new
> slot will think it (i.e. the new slot) protects even non-catalog tables, but
> that's no true.
> 
> Shouldn't the xmin_horizon be computed by this call instead?
> 
> GetOldestSafeDecodingTransactionId(!need_full_snapshot);
> 
> (If so, I think "considerCatalog" argument name would be clearer than
> "catalogOnly".)

Good catch. Pushing a fix. Afaict it's luckily inconsequential atm
because fo the way we wait for concurrent snapshots when creating a
slot. But it obviously nevertheless needs tobe fixed.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Draft release notes up for review

2017-08-05 Thread Andres Freund
Hi Tom,

On 2017-08-04 18:41:12 -0400, Tom Lane wrote:
> I've committed the first-draft release notes for 9.6.4 at
> https://git.postgresql.org/pg/commitdiff/03378c4da598840b0520a53580dd7713c95f21c8

I just pushed a 9.4 specific bugfix. Do you want me to fix up the
release notes after you backpatch the minor release to 9.4, or what's
the best process?

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PostgreSQL not setting OpenSSL session id context?

2017-08-03 Thread Andres Freund
On 2017-08-04 07:22:42 +0300, Shay Rojansky wrote:
> I'm still not convinced of the risk/problem of simply setting the session
> id context as I explained above (rather than disabling the optimization),
> but of course either solution resolves my problem.

How would that do anything? Each backend has it's own local
memory. I.e. any cache state that openssl would maintain wouldn't be
useful. If you want to take advantage of features around this you really
need to cache tickets in shared memory...

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Functions

2017-08-03 Thread Andres Freund
On 2017-08-03 17:57:37 -0400, Robert Haas wrote:
> On Thu, Aug 3, 2017 at 5:50 PM, Andres Freund <and...@anarazel.de> wrote:
> > On 2017-08-03 17:43:44 -0400, Robert Haas wrote:
> >> For me, the basic point here is that we need a set of hash functions
> >> for hash partitioning that are different than what we use for hash
> >> indexes and hash joins -- otherwise when we hash partition a table and
> >> create hash indexes on each partition, those indexes will have nasty
> >> clustering.  Partitionwise hash joins will have similar problems.  So,
> >> a new set of hash functions specifically for hash partitioning is
> >> quite desirable.
> >
> > Couldn't that just as well solved by being a bit smarter with an IV? I
> > doubt we want to end up with different hashfunctions for sharding,
> > partitioning, hashjoins (which seems to form a hierarchy). Having a
> > working hash-combine function, or even better a hash API that can
> > continue to use the hash's internal state, seems a more scalable
> > solution.
> 
> That's another way to go, but it requires inventing a way to thread
> the IV through the hash opclass interface.

Only if we really want to do it really well :P. Using a hash_combine()
like

/*
 * Combine two hash values, resulting in another hash value, with decent bit
 * mixing.
 *
 * Similar to boost's hash_combine().
 */
static inline uint32
hash_combine(uint32 a, uint32 b)
{
a ^= b + 0x9e3779b9 + (a << 6) + (a >> 2);
return a;
}

between hash(IV) and the hashfunction should do the trick (the IV needs
to hashed once, otherwise the bit mix is bad).


> That's actually sort of a
> problem anyway.  Maybe I ought to have started with the question of
> how we're going to make that end of things work.

+1 one for that plan.


> We could:
> 
> - Invent a new hash_partition AM that doesn't really make indexes but
> supplies hash functions for hash partitioning.
> - Add a new, optional support function 2 to the hash AM that takes a
> value of the type *and* an IV as an argument.
> - Something else.

Not arguing for it, but one option could also have pg_type.hash*
function(s).

One thing that I think might be advisable to think about is that we're
atm stuck with a relatively bad hash function for hash indexes (and hash
joins/aggs), and we should probably evolve it at some point. At the same
time there's currently people out there relying on the current hash
functions remaining stable.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Functions

2017-08-03 Thread Andres Freund
Hi,

On 2017-08-03 17:43:44 -0400, Robert Haas wrote:
> For me, the basic point here is that we need a set of hash functions
> for hash partitioning that are different than what we use for hash
> indexes and hash joins -- otherwise when we hash partition a table and
> create hash indexes on each partition, those indexes will have nasty
> clustering.  Partitionwise hash joins will have similar problems.  So,
> a new set of hash functions specifically for hash partitioning is
> quite desirable.

Couldn't that just as well solved by being a bit smarter with an IV? I
doubt we want to end up with different hashfunctions for sharding,
partitioning, hashjoins (which seems to form a hierarchy). Having a
working hash-combine function, or even better a hash API that can
continue to use the hash's internal state, seems a more scalable
solution.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Functions

2017-08-03 Thread Andres Freund
Hi,

On 2017-08-03 17:09:41 -0400, Robert Haas wrote:
> On Thu, Jun 1, 2017 at 2:25 PM, Andres Freund <and...@anarazel.de> wrote:
> > Just to clarify: I don't think it's a problem to do so for integers and
> > most other simple scalar types. There's plenty hash algorithms that are
> > endianess independent, and the rest is just a bit of care.
> 
> Do you have any feeling for which of those endianness-independent hash
> functions might be a reasonable choice for us?

Not a strong / very informed one, TBH.

I'm not convinced it's worth trying to achieve this in the first place,
now that we "nearly" have the insert-via-parent feature. Isn't this a
lot of work, for very little practical gain? Having to select that when
switching architectures still seems unproblematic. People will just
about never switch endianess, so even a tiny performance & effort
overhead doesn't seem worth it to me.


Leaving that aside:


> https://github.com/markokr/pghashlib implements various hash functions
> for PostgreSQL, and claims that, of those implemented, crc32, Jenkins,
> lookup3be and lookup3le, md5, and siphash24 are endian-independent.

> An interesting point here is that Jeff Davis asserted in the original
> post on this thread that our existing hash_any() wasn't portable, but
> our current hash_any seems to be the Jenkins algorithm -- so I'm
> confused.  Part of the problem seems to be that, according to
> https://en.wikipedia.org/wiki/Jenkins_hash_function there are 4 of
> those.  I don't know whether the one in pghashlib is the same one
> we've implemented.

IIUC lookup3be/le from Marko's hashlib just has a endianess conversion
added.  I'd personally not go for jenkins3, it's not particularly fast,
nor does it balance that out w/ being cryptographicaly secure.


> Kennel Marshall suggested xxhash as an endian-independent algorithm
> upthread.  Code for that is available under a 2-clause BSD license.

Yea, that'd have been one of my suggestions too. Especially as I still
want to implement better compression using lz4, and that'll depend on
xxhash in turn.


> PostgreSQL page checksums use an algorithm based on, but not exactly,
> FNV-1a.  See storage/checksum_impl.h.  The comments there say this
> algorithm was chosen with speed in mind.  Our version is not
> endian-independent because it folds in 4-byte integers rather than
> 1-byte integers, but plain old FNV-1a *is* endian-independent and
> could be used.

Non-SIMDed (which we hope to achieve with our implementation, which is
why we have separate compiler flags for that file) implementations of
FNV are, to my knowledge, not particularly fast. And the SIMD tricks
are, to my knowledge, not really applicable to the case at hand here. So
I'm not a fan of choosing FNV.


> We also have an implementation of CRC32C in core - see port/pg_crc32.h
> and port/pg_crc32c_sb8.c.  It's not clear to me whether this is
> Endian-independent or not, although there is stuff that depends on
> WORDS_BIGENDIAN, so, uh, maybe?

The results should be endian independent. It depends on WORDS_BIGENDIAN
because we need different pre-computed tables depending on endianess.


Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tightly packing expressions

2017-08-03 Thread Andres Freund
Hi Doug,

On 2017-08-03 18:01:06 +, Douglas Doole wrote:
> Back when you were getting ready to commit your faster expressions, I
> volunteered to look at switching from an array of fixed sized steps to
> tightly packed structures. (
> https://www.postgresql.org/message-id/20170314221648.jrcgh5n7ld4ej...@alap3.anarazel.de).
> I've finally found time to make it happen.

Thanks for working on it!


> Using tightly packed structures makes the expressions a lot smaller. I
> instrumented some before and after builds and ran "make check" just to see
> how much memory expressions were using. What I found was:
> 
> There were ~104K expressions.
> 
> Memory - bytes needed to hold the steps of the expressions
> Allocated Memory - memory is allocated in blocks, this is the bytes
> allocated to hold the expressions
> Wasted Memory - the difference between the allocated and used memory
> 
> Original code:
> Memory: min=64 max=9984 total=28232512 average=265
> Allocated Memory: min=1024 max=16384 total=71584 average=1045
> Wasted Memory: min=0 max=6400 total=82939072 average=780
> 
> New code:
> Memory: min=32 (50%) max=8128 (82%) total=18266712 (65%) average=175 (66%)
> Allocated Memory: min=192 (19%) max=9408 (57%) total=29386176 (26%)
> average=282 (27%)
> Wasted Memory: min=0 (0%) max=1280 (20%) total=9464 (13%) average=106
> (14%)

That's actually not *that* big of a saving...


> Another benefit of the way I did this is that the expression memory is
> never reallocated. This means that it is safe for one step to point
> directly to a field in another step without needing to separately palloc
> storage. That should allow us to simplify the code and reduce pointer
> traversals. (I haven't exploited this in the patch. I figured it would be a
> task for round 2 assuming you like what I've done here.)

Yes, that's a neat benefit. Although I think it'd even better if we
could work to the point where the mutable and the unchanging data is
separately allocated, so we can at some point avoid redundant expression
"compilations".


> The work was mostly mechanical. The only tricky bit was dealing with the
> places where you jump to step n+1 while building step n. Since we can't
> tell the address of step n+1 until it is actually built, I had to defer
> resolution of the jumps. So all the interesting bits of this patch are in
> ExprEvalPushStep().

I was wondering about a more general "symbol resolution" stage
anyway. Then we'd allocate individual steps during ExecInitExprRec, and
allocate the linear array after we know the exact size.


I think it'd be important to see some performance measurements,
especially for larger queries. It'd not be too surprising if the
different method of dereffing the next expression actually has a
negative performance effect, but I'm absolutely not sure of that.

Regards,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-08-02 Thread Andres Freund
On 2017-02-10 07:52:57 -0500, Robert Haas wrote:
> On Thu, Feb 9, 2017 at 6:38 PM, Thomas Munro
> > Up until two minutes ago I assumed that policy would leave only two
> > possibilities: you attach to the DSM segment and attach to the
> > SharedBufFileManager successfully or you attach to the DSM segment and
> > then die horribly (but not throw) and the postmaster restarts the
> > whole cluster and blows all temp files away with RemovePgTempFiles().
> > But I see now in the comment of that function that crash-induced
> > restarts don't call that because "someone might want to examine the
> > temp files for debugging purposes".  Given that policy for regular
> > private BufFiles, I don't see why that shouldn't apply equally to
> > shared files: after a crash restart, you may have some junk files that
> > won't be cleaned up until your next clean restart, whether they were
> > private or shared BufFiles.
> 
> I think most people (other than Tom) would agree that that policy
> isn't really sensible any more; it probably made sense when the
> PostgreSQL user community was much smaller and consisted mostly of the
> people developing PostgreSQL, but these days it's much more likely to
> cause operational headaches than to help a developer debug.

FWIW, we have restart_after_crash = false. If you need to debug things,
you can enable that. Hence the whole RemovePgTempFiles() crash-restart
exemption isn't required anymore, we have a much more targeted solution.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Why does logical replication launcher exit with exit code 1?

2017-08-02 Thread Andres Freund
On 2017-08-02 17:09:10 -0400, Tom Lane wrote:
> Even if we fix that, though, I think it is reasonable to downgrade it to
> DEBUG1.  We did that already for other standard background processes such
> as the autovac launcher, and it's not apparent to me why the logrep
> launcher should be chattier.  Now, *unexpected* starts or stops should
> certainly deserve a higher log rating ... but the general rule ought to be
> that totally-expected behavior does not deserve a log entry by default.

Well, every exit *other* than during a shutdown is unexpected.  That's
why I suggested changing the log level for exit code 1 depending on the
cluster being shut down or not.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Why does logical replication launcher exit with exit code 1?

2017-08-02 Thread Andres Freund
On 2017-08-02 16:52:01 -0400, Robert Haas wrote:
> On Wed, Aug 2, 2017 at 4:38 PM, Peter Eisentraut
>  wrote:
> > Maybe it doesn't need to be logged at all (other than perhaps as DEBUG)?
> >  A few months ago, people were complaining about too many messages about
> > background workers starting.  Now we are having complaints about
> > messages about background workers stopping.
> 
> I actually don't think it's that unreasonable to get notified when
> system-wide processes like the autovacuum launcher or the logical
> replication launcher start or stop.  That's stuff somebody might want
> to know.  It's not going to generate a lot of log volume, and it might
> be useful, so why suppress it?
> 
> Where things get ugly is if you start to get a high rate of messages -
> e.g. from starting and stopping parallel query workers or other kinds
> of things where you might have workers starting and stopping very
> frequently.  But surely this isn't an example of that.

I generally agree.  But in the shutdown case it's just useless and
confusing - the launcher is stopping because the entire server is being
stopped, and that's very much not clear from the message.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] POC: Cache data in GetSnapshotData()

2017-08-02 Thread Andres Freund
Hi,

I think this patch should have a "cover letter" explaining what it's
trying to achieve, how it does so and why it's safe/correct.  I think
it'd also be good to try to show some of the worst cases of this patch
(i.e. where the caching only adds overhead, but never gets used), and
some of the best cases (where the cache gets used quite often, and
reduces contention significantly).

I wonder if there's a way to avoid copying the snapshot into the cache
in situations where we're barely ever going to need it. But right now
the only way I can think of is to look at the length of the
ProcArrayLock wait queue and count the exclusive locks therein - which'd
be expensive and intrusive...


On 2017-08-02 15:56:21 +0530, Mithun Cy wrote:
> diff --git a/src/backend/storage/ipc/procarray.c 
> b/src/backend/storage/ipc/procarray.c
> index a7e8cf2..4d7debe 100644
> --- a/src/backend/storage/ipc/procarray.c
> +++ b/src/backend/storage/ipc/procarray.c
> @@ -366,11 +366,13 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
>   (arrayP->numProcs - index - 1) * 
> sizeof(int));
>   arrayP->pgprocnos[arrayP->numProcs - 1] = -1;   /* for 
> debugging */
>   arrayP->numProcs--;
> + ProcGlobal->is_snapshot_valid = false;
>   LWLockRelease(ProcArrayLock);
>   return;
>   }
>   }
>  
> + ProcGlobal->is_snapshot_valid = false;
>   /* Oops */
>   LWLockRelease(ProcArrayLock);

Why do we need to do anything here? And if we need to, why not in
ProcArrayAdd?


> @@ -1552,6 +1557,8 @@ GetSnapshotData(Snapshot snapshot)
>errmsg("out of memory")));
>   }
>  
> + snapshot->takenDuringRecovery = RecoveryInProgress();
> +
>   /*
>* It is sufficient to get shared lock on ProcArrayLock, even if we are
>* going to set MyPgXact->xmin.
> @@ -1566,100 +1573,200 @@ GetSnapshotData(Snapshot snapshot)
>   /* initialize xmin calculation with xmax */
>   globalxmin = xmin = xmax;
>  
> - snapshot->takenDuringRecovery = RecoveryInProgress();
> -

This movement of code seems fairly unrelated?


>   if (!snapshot->takenDuringRecovery)
>   {

Do we really need to restrict this to !recovery snapshots? It'd be nicer
if we could generalize it - I don't immediately see anything !recovery
specific in the logic here.


> - int*pgprocnos = arrayP->pgprocnos;
> - int numProcs;
> + if (ProcGlobal->is_snapshot_valid)
> + {
> + Snapshotcsnap = >cached_snapshot;
> + TransactionId *saved_xip;
> + TransactionId *saved_subxip;
>  
> - /*
> -  * Spin over procArray checking xid, xmin, and subxids.  The 
> goal is
> -  * to gather all active xids, find the lowest xmin, and try to 
> record
> -  * subxids.
> -  */
> - numProcs = arrayP->numProcs;
> - for (index = 0; index < numProcs; index++)
> + saved_xip = snapshot->xip;
> + saved_subxip = snapshot->subxip;
> +
> + memcpy(snapshot, csnap, sizeof(SnapshotData));
> +
> + snapshot->xip = saved_xip;
> + snapshot->subxip = saved_subxip;
> +
> + memcpy(snapshot->xip, csnap->xip,
> +sizeof(TransactionId) * csnap->xcnt);
> + memcpy(snapshot->subxip, csnap->subxip,
> +sizeof(TransactionId) * csnap->subxcnt);
> + globalxmin = ProcGlobal->cached_snapshot_globalxmin;
> + xmin = csnap->xmin;
> +
> + count = csnap->xcnt;
> + subcount = csnap->subxcnt;
> + suboverflowed = csnap->suboverflowed;
> +
> + Assert(TransactionIdIsValid(globalxmin));
> + Assert(TransactionIdIsValid(xmin));
> + }

Can we move this to a separate function? In fact, could you check how
much effort it'd be to split cached, !recovery, recovery cases into
three static inline helper functions? This is starting to be hard to
read, the indentation added in this patch doesn't help... Consider doing
recovery, !recovery cases in a preliminary separate patch.

> +  * Let only one of the caller cache the computed 
> snapshot, and
> +  * others can continue as before.
>*/
> - if (!suboverflowed)
> + if (!ProcGlobal->is_snapshot_valid &&
> + 
> LWLockConditionalAcquire(>CachedSnapshotLock,
> + 
>  LW_EXCLUSIVE))
>   

Re: [HACKERS] Why does logical replication launcher exit with exit code 1?

2017-08-01 Thread Andres Freund
On 2017-08-01 20:37:07 -0400, Robert Haas wrote:
> On Tue, Aug 1, 2017 at 7:03 PM, Andres Freund <and...@anarazel.de> wrote:
> > On 2017-08-02 10:58:32 +1200, Thomas Munro wrote:
> >> When I shut down a cluster that isn't using logical replication, it
> >> always logs a line like the following.  So do the build farm members I
> >> looked at.  I didn't see anything about this in the open items list --
> >> isn't it a bug?
> >>
> >> 2017-08-02 10:39:25.007 NZST [34781] LOG:  worker process: logical
> >> replication launcher (PID 34788) exited with exit code 1
> >
> > Exit code 0 signals that a worker should be restarted. Therefore
> > graceful exit can't really use that.  I think a) we really need to
> > improve bgworker infrastructure around that b) shows the limit of using
> > bgworkers for this kinda thing - we should probably have a more bgworker
> > like infrastructure for internal workers.
> 
> You might've missed commit be7558162acc5578d0b2cf0c8d4c76b6076ce352.

Not really, just thinko-ing it. We don't want to unregister, so we can't
return 0, IOW, I just * -1'd my comment ;)

We intentionally return 1, so we *do* get restarted:
else if (IsLogicalLauncher())
{
ereport(DEBUG1,
(errmsg("logical replication launcher 
shutting down")));

/*
 * The logical replication launcher can be stopped at 
any time.
 * Use exit status 1 so the background worker is 
restarted.
 */
proc_exit(1);
}

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Why does logical replication launcher exit with exit code 1?

2017-08-01 Thread Andres Freund
On 2017-08-02 12:14:18 +1200, Thomas Munro wrote:
> On Wed, Aug 2, 2017 at 11:03 AM, Andres Freund <and...@anarazel.de> wrote:
> > On 2017-08-02 10:58:32 +1200, Thomas Munro wrote:
> >> When I shut down a cluster that isn't using logical replication, it
> >> always logs a line like the following.  So do the build farm members I
> >> looked at.  I didn't see anything about this in the open items list --
> >> isn't it a bug?
> >>
> >> 2017-08-02 10:39:25.007 NZST [34781] LOG:  worker process: logical
> >> replication launcher (PID 34788) exited with exit code 1
> >
> > Exit code 0 signals that a worker should be restarted. Therefore
> > graceful exit can't really use that.  I think a) we really need to
> > improve bgworker infrastructure around that b) shows the limit of using
> > bgworkers for this kinda thing - we should probably have a more bgworker
> > like infrastructure for internal workers.
> 
> I see.  In the meantime IMHO I think we should try to find a way to
> avoid printing out this message -- it looks like something is wrong to
> the uninitiated.

Well, that's how it is for all bgworkers - maybe a better solution is to
adjust that message in the postmaster rather than fiddle with the worker
exist code?  Seems like we could easily take pmStatus into account
inside LogChildExit() and set the log level to DEBUG1 even for
EXIT_STATUS_1 in that case?  Additionally we probably should always log
a better message for bgworkers exiting with exit 1, something about
unregistering the worker or such.


> Possibly stupid question: why do we restart workers when we know we're
> shutting down anyway?  Hmm, I suppose there might conceivably be
> workers that need to do something during shutdown and they might not
> have done it yet.

The launcher doesn't really know the reason for the shutdown.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Why does logical replication launcher exit with exit code 1?

2017-08-01 Thread Andres Freund
On 2017-08-02 11:19:39 +1200, Gavin Flower wrote:
> Returning zero to indicate success is a holdover to the time computers could
> only run one program at a time.  At the end of the code there was a jump
> table of 4 byte entries.  The first entry with a displacement of zero was
> the location to jump to for a normal exit, subsequent entries where for
> various error conditions.  This why often return codes where in positive
> multiples of 4, since we don't use jump tables now - more & more people are
> using any integers they want.
> 
> So apart from convention, returning zero is no longer held to be a sacred to
> indicate something exited okay.  In fact since, zero could simply mean a
> value was not explicitly assigned, hence it is actually a very dangerous
> value  to be used to indicate success!

This has nothing to do with this thread.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Why does logical replication launcher exit with exit code 1?

2017-08-01 Thread Andres Freund
Hi,

On 2017-08-02 10:58:32 +1200, Thomas Munro wrote:
> When I shut down a cluster that isn't using logical replication, it
> always logs a line like the following.  So do the build farm members I
> looked at.  I didn't see anything about this in the open items list --
> isn't it a bug?
> 
> 2017-08-02 10:39:25.007 NZST [34781] LOG:  worker process: logical
> replication launcher (PID 34788) exited with exit code 1

Exit code 0 signals that a worker should be restarted. Therefore
graceful exit can't really use that.  I think a) we really need to
improve bgworker infrastructure around that b) shows the limit of using
bgworkers for this kinda thing - we should probably have a more bgworker
like infrastructure for internal workers.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] A hook for session start

2017-08-01 Thread Andres Freund
On 2017-08-01 15:37:40 -0400, Peter Eisentraut wrote:
> On 7/21/17 12:59, Robert Haas wrote:
> > That's an exceedingly-weak argument for rejecting this patch.  The
> > fact that you can probably hack around the lack of a hook for most
> > reasonable use cases is not an argument for having a hook that does
> > what people actually want to do.
> 
> Still nobody has presented a concrete use case so far.

Citus for example starts a background worker (performing
e.g. distributed deadlock detection) if the citus extension exists. We
atm need annoying hacks to do so when the first query is executed.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Better way to handle suppression of CASCADE detail messages

2017-08-01 Thread Andres Freund
On 2017-08-01 13:48:34 -0400, Robert Haas wrote:
> On Tue, Aug 1, 2017 at 1:39 PM, Andres Freund <and...@anarazel.de> wrote:
> > Oid is probably not good enough - with parallel tests and such it's not
> > necessarily predicable. Even less so when the tests are run against an
> > existing cluster.  Sorting by name would probably be better...
> 
> It's arguably more user-friendly, too, although part of me feels like
> it would be better to try to preserve the topological ordering in some
> way.  If something cascades to foo and from there to bar and from
> there to baz to and from there to quux, emitting the messages as
> 
> drop cascades to bar
> drop cascades to baz
> drop cascades to foo
> drop cascades to quux
> 
> is arguably not going to be too helpful to the user in understanding
> the chain of events, however nice it may be for regression testing
> purposes.

I'm not sure that's going to easily be better - won't the oid order in
turn determine the topological order. Which then again isn't very easy
to understand for users.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Better way to handle suppression of CASCADE detail messages

2017-08-01 Thread Andres Freund
Hi,

On 2017-08-01 13:34:45 -0400, Tom Lane wrote:
> BTW, in the long run maybe we should instead make the CASCADE message
> ordering more predictable, perhaps by sorting the objects by OID.
> But that's not a job for beta time.

Oid is probably not good enough - with parallel tests and such it's not
necessarily predicable. Even less so when the tests are run against an
existing cluster.  Sorting by name would probably be better...

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH v3] pg_progress() SQL function to monitor progression of long running SQL queries/utilities

2017-08-01 Thread Andres Freund
On 2017-08-01 19:11:55 +0200, Michael Paquier wrote:
> I think that Depesz makes use of a non-default format for its
> explain.depesz.com, or he would have a hard time maintaining a
> deparsing API for its application.

Hm? e.d.c accepts the text explain format, so I'm unclear on what you're
saying here.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [BUGS] BUG #14758: Segfault with logical replication on a function index

2017-07-31 Thread Andres Freund
On 2017-07-31 09:40:34 +0900, Masahiko Sawada wrote:
> Moved to -hackers.
> 
> On Sat, Jul 29, 2017 at 4:35 AM, Scott Milliken  wrote:
> > Thank you Masahiko! I've tested and confirmed that this patch fixes the
> > problem.
> >
> 
> Thank you for the testing. This issue should be added to the open item
> since this cause of the server crash. I'll add it.

Adding Petr to CC list.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Hash take II

2017-07-31 Thread Andres Freund
From: Thomas Munro 
Date: Wed 26 Jul 2017 19:58:20 NZST
Subject: [PATCH] Add support for parallel-aware hash joins.

Hi,

WRT the main patch:

- Echoing concerns from other threads (Robert: ping): I'm doubtful that
  it makes sense to size the number of parallel workers solely based on
  the parallel scan node's size.  I don't think it's this patch's job to
  change that, but to me it seriously amplifys that - I'd bet there's a
  lot of cases with nontrivial joins where the benefit from parallelism
  on the join level is bigger than on the scan level itself.  And the
  number of rows in the upper nodes might also be bigger than on the
  scan node level, making it more important to have higher number of
  nodes.

- If I understand the code in initial_cost_hashjoin() correctly, we
  count the synchronization overhead once, independent of the number of
  workers.  But on the other hand we calculate the throughput by
  dividing by the number of workers.  Do you think that's right?

- I haven't really grokked the deadlock issue you address. Could you
  expand the comments on that? Possibly somewhere central referenced by
  the various parts.

- maybe I'm overly paranoid, but it might not be bad to add some extra
  checks for ExecReScanHashJoin ensuring that it doesn't get called when
  workers are still doing something.

- seems like you're dereffing tuple unnecessarily here:

+   /*
+* If we detached a chain of tuples, transfer them to the main hash 
table
+* or batch storage.
+*/
+   if (regainable_space > 0)
+   {
+   HashJoinTuple tuple;
+
+   tuple = (HashJoinTuple)
+   dsa_get_address(hashtable->area, detached_chain_shared);
+   ExecHashTransferSkewTuples(hashtable, detached_chain,
+  
detached_chain_shared);
+
+   /* Remove from the total space used. */
+   LWLockAcquire(>shared->chunk_lock, LW_EXCLUSIVE);
+   Assert(hashtable->shared->size >= regainable_space);
+   hashtable->shared->size -= regainable_space;
+   LWLockRelease(>shared->chunk_lock);
+
+   /*
+* If the bucket we removed is the same as the bucket the 
caller just
+* overflowed, then we can forget about the overflowing part of 
the
+* tuple.  It's been moved out of the skew hash table.  
Otherwise, the
+* caller will call again; eventually we'll either succeed in
+* allocating space for the overflow or reach this case.
+*/
+   if (bucket_to_remove == bucketno)
+   {
+   hashtable->spaceUsedSkew = 0;
+   hashtable->spaceAllowedSkew = 0;
+   }
+   }


- The names here could probably improved some:
+   case WAIT_EVENT_HASH_SHRINKING1:
+   event_name = "Hash/Shrinking1";
+   break;
+   case WAIT_EVENT_HASH_SHRINKING2:
+   event_name = "Hash/Shrinking2";
+   break;
+   case WAIT_EVENT_HASH_SHRINKING3:
+   event_name = "Hash/Shrinking3";
+   break;
+   case WAIT_EVENT_HASH_SHRINKING4:
+   event_name = "Hash/Shrinking4";

- why are we restricting rows_total bit to parallel aware?

+   /*
+* If parallel-aware, the executor will also need an estimate of the 
total
+* number of rows expected from all participants so that it can size the
+* shared hash table.
+*/
+   if (best_path->jpath.path.parallel_aware)
+   {
+   hash_plan->plan.parallel_aware = true;
+   hash_plan->rows_total = best_path->inner_rows_total;
+   }
+

- seems we need a few more test - I don't think the existing tests are
  properly going to exercise the skew stuff, multiple batches, etc?
  This is nontrivial code, I'd really like to see a high test coverage
  of the new code.

- might not hurt to reindent before the final submission

- Unsurprisingly, please implement the FIXME ;)


Regards,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Hash take II

2017-07-31 Thread Andres Freund
Hi,

On 2017-07-26 20:12:56 +1200, Thomas Munro wrote:
> Here is a new version of my parallel-aware hash join patchset.

Yay!

Working on reviewing this. Will send separate emails for individual
patch reviews.


> 2.  Simplified costing.  There is now just one control knob
> "parallel_synchronization_cost", which I charge for each time the
> participants will wait for each other at a barrier, to be set high
> enough to dissuade the planner from using Parallel Hash for tiny hash
> tables that would be faster in a parallel-oblivious hash join.
> Earlier ideas about modelling the cost of shared memory access didn't
> work out.

Hm. You say, "didn't work out" - could you expand a bit on that? I'm
quite doubtful that justaccounting for barriers will be good enough.


> I'll report on performance separately.

Looking forward to that ;)

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] POC: Sharing record typmods between backends

2017-07-31 Thread Andres Freund
Hi,

diff --git a/src/backend/access/common/tupdesc.c 
b/src/backend/access/common/tupdesc.c
index 9fd7b4e019b..97c0125a4ba 100644
--- a/src/backend/access/common/tupdesc.c
+++ b/src/backend/access/common/tupdesc.c
@@ -337,17 +337,75 @@ DecrTupleDescRefCount(TupleDesc tupdesc)
 {
Assert(tupdesc->tdrefcount > 0);
 
-   ResourceOwnerForgetTupleDesc(CurrentResourceOwner, tupdesc);
+   if (CurrentResourceOwner != NULL)
+   ResourceOwnerForgetTupleDesc(CurrentResourceOwner, tupdesc);
if (--tupdesc->tdrefcount == 0)
FreeTupleDesc(tupdesc);
 }

What's this about? CurrentResourceOwner should always be valid here, no?
If so, why did that change? I don't think it's good to detach this from
the resowner infrastructure...


 /*
- * Compare two TupleDesc structures for logical equality
+ * Compare two TupleDescs' attributes for logical equality
  *
  * Note: we deliberately do not check the attrelid and tdtypmod fields.
  * This allows typcache.c to use this routine to see if a cached record type
  * matches a requested type, and is harmless for relcache.c's uses.
+ */
+bool
+equalTupleDescAttrs(Form_pg_attribute attr1, Form_pg_attribute attr2)
+{

comment not really accurate, this routine afaik isn't used by
typcache.c?


/*
- * Magic numbers for parallel state sharing.  Higher-level code should use
- * smaller values, leaving these very large ones for use by this module.
+ * Magic numbers for per-context parallel state sharing.  Higher-level code
+ * should use smaller values, leaving these very large ones for use by this
+ * module.
  */
 #define PARALLEL_KEY_FIXED 
UINT64CONST(0x0001)
 #define PARALLEL_KEY_ERROR_QUEUE   
UINT64CONST(0x0002)
@@ -63,6 +74,16 @@
 #define PARALLEL_KEY_ACTIVE_SNAPSHOT   UINT64CONST(0x0007)
 #define PARALLEL_KEY_TRANSACTION_STATE UINT64CONST(0x0008)
 #define PARALLEL_KEY_ENTRYPOINT
UINT64CONST(0x0009)
+#define PARALLEL_KEY_SESSION_DSM   
UINT64CONST(0x000A)
+
+/* Magic number for per-session DSM TOC. */
+#define PARALLEL_SESSION_MAGIC 0xabb0fbc9
+
+/*
+ * Magic numbers for parallel state sharing in the per-session DSM area.
+ */
+#define PARALLEL_KEY_SESSION_DSA   
UINT64CONST(0x0001)
+#define PARALLEL_KEY_RECORD_TYPMOD_REGISTRYUINT64CONST(0x0002)

Not this patch's fault, but this infrastructure really isn't great. We
should really replace it with a shmem.h style infrastructure, using a
dht hashtable as backing...


+/* The current per-session DSM segment, if attached. */
+static dsm_segment *current_session_segment = NULL;
+

I think it'd be better if we had a proper 'SessionState' and
'BackendSessionState' infrastructure that then contains the dsm segment
etc. I think we'll otherwise just end up with a bunch of parallel
infrastructures.



+/*
+ * A mechanism for sharing record typmods between backends.
+ */
+struct SharedRecordTypmodRegistry
+{
+   dht_hash_table_handle atts_index_handle;
+   dht_hash_table_handle typmod_index_handle;
+   pg_atomic_uint32 next_typmod;
+};
+

I think the code needs to explain better how these are intended to be
used. IIUC, atts_index is used to find typmods by "identity", and
typmod_index by the typmod, right? And we need both to avoid
all workers generating different tupledescs, right?  Kinda guessable by
reading typecache.c, but that shouldn't be needed.


+/*
+ * A flattened/serialized representation of a TupleDesc for use in shared
+ * memory.  Can be converted to and from regular TupleDesc format.  Doesn't
+ * support constraints and doesn't store the actual type OID, because this is
+ * only for use with RECORD types as created by CreateTupleDesc().  These are
+ * arranged into a linked list, in the hash table entry corresponding to the
+ * OIDs of the first 16 attributes, so we'd expect to get more than one entry
+ * in the list when named and other properties differ.
+ */
+typedef struct SerializedTupleDesc
+{
+   dsa_pointer next;   /* next with the same same 
attribute OIDs */
+   int natts;  /* number of attributes 
in the tuple */
+   int32   typmod; /* typmod for tuple type */
+   boolhasoid; /* tuple has oid attribute in 
its header */
+
+   /*
+* The attributes follow.  We only ever access the first
+* ATTRIBUTE_FIXED_PART_SIZE bytes of each element, like the code in
+* tupdesc.c.
+*/
+   FormData_pg_attribute attributes[FLEXIBLE_ARRAY_MEMBER];
+} SerializedTupleDesc;

Not a fan of a separate tupledesc representation, that's just going to
lead to divergence over time. I think we should rather change the normal
tupledesc representation to be 

Re: [HACKERS] Persistent wait event sets and socket changes

2017-07-31 Thread Andres Freund
Hi Craig,

On 2017-07-31 14:08:58 +0800, Craig Ringer wrote:
> Hi all
> 
> I've been looking into the wait event set interface added in 9.6 with an
> eye to using it in an extension that maintains a set of non-blocking libpq
> connections to other PostgreSQL instances.
> 
> In the process I've been surprised to find that there does not appear to be
> any interface to remove a socket once added to the wait event set, or
> replace it with a new socket.
> 
> ModifyWaitEvent(...) doesn't take a pgsocket. There doesn't seem to be a
> way to remove an event from the set either.

Yea, and what's even more annoying, you can't "disable" waiting for
readyness events on a socket, we've an assert that we're waiting for
something.


> Discussion in
> https://www.postgresql.org/message-id/flat/20160114143931.GG10941%40awork2.anarazel.de
> talks about a proposed SetSocketToWaitOn(...)/AddSocketToWaitSet and
> RemoveSocketFromWaitSet etc. But it looks like it petered out and went
> nowhere, apparently mainly due to not being needed by any current core
> users.

I think it just needs somebody to push this forward.


> I'd like to add such interfaces at some point, but for now can work around
> it by destroying and re-creating the wait event set when the fd-set
> changes. So I'm posting mostly to confirm that it's not supposed to work,
> and ask if anyone thinks I should submit a comment patch to latch.c
> documenting it.

It doesn't work, right. I'm not sure I see the point of adding comments
explaining that a nonexistant interface doesn't exist?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] segfault in HEAD when too many nested functions call

2017-07-30 Thread Andres Freund
Hi,

On 2017-07-29 16:14:08 -0400, Tom Lane wrote:
> Andres Freund <and...@anarazel.de> writes:
> > [ 0002-Move-ExecProcNode-from-dispatch-to-function-pointer-.patch ]
> 
> Here's a reviewed version of this patch.

Thanks!  I pushed both now.


> I added dummy ExecProcNodeMtd functions to the various node types that
> lacked them because they expect to be called through MultiExecProcNode
> instead.  In the old coding, trying to call ExecProcNode on one of those
> node types would have led to a useful error message; as you had it,
> it'd have dumped core, which is not an improvement.

Ok, makes sense.


> Also, I removed the ExecReScan stanza from ExecProcNodeFirst; that
> should surely be redundant, because we should only get to that function
> through ExecProcNode().  If somehow it's not redundant, please add a
> comment explaining why not.

Makes sense too.


> Some other minor cosmetic changes, mostly comment wordsmithing.

Thanks!


Julien, could you quickly verify that everything's good for you now too?

Regards,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] segfault in HEAD when too many nested functions call

2017-07-29 Thread Andres Freund
On 2017-07-29 14:20:32 -0400, Tom Lane wrote:
> Here's a reviewed version of this patch.

Thanks!

> * I think you put ExecScan's CFI in the wrong place; AFAICT yours
> only covers its fast path.

Sure - but the old path already has a CFI? And it has to be inside the
loop, because well, the loop ;).


> * I think ExecAgg needs a CFI at the head, just to be sure it's hit
> in any path through that.

Yep, makes esense.


> * I agree that putting CFI inside ExecHashJoin's state machine loop
> is a good idea, because it might have to trawl through quite a lot of
> a batch file before finding a returnable tuple.  But I think in merge
> and nestloop joins it's sufficient to put one CFI at the head.  Neither
> of those cases can do very much processing without invoking a child
> node, where a CFI will happen.

Ok, I can live with that.


> * You missed ExecLockRows altogether.

Well, it directly calls the next ExecProcNode(), so I didn't think it
was necessary. One of the aforementioned judgement calls. But I'm
perfectly happy to have one there.

Thanks,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] segfault in HEAD when too many nested functions call

2017-07-28 Thread Andres Freund
On 2017-07-28 09:29:58 -0700, Noah Misch wrote:
> On Thu, Jul 27, 2017 at 10:08:57PM -0700, Andres Freund wrote:
> > For me that means the policy isn't quite right.  It's not like I can
> > force Tom to review the patch at a specific date. But the thread has
> > been progressing steadily over the last days, so I'm not particularly
> > concerned.
> 
> Your colleagues achieve compliance despite uncertainty; for inspiration, I
> recommend examining Alvaro's status updates as examples of this.  The policy
> currently governs your open items even if you disagree with it.

That's just process over substance.  Publishing repeated "I'll look at
it again in $small_n days" doesn't really provide something useful.

Anyway, I'll commit it after another pass in ~1 week if it doesn't get a
review till then, but I assume it'll.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] segfault in HEAD when too many nested functions call

2017-07-27 Thread Andres Freund
On 2017-07-27 22:04:59 -0700, Noah Misch wrote:
> On Thu, Jul 27, 2017 at 09:49:18PM -0700, Andres Freund wrote:
> > On 2017-07-27 21:46:57 -0700, Noah Misch wrote:
> > > On Thu, Jul 27, 2017 at 02:29:32AM +, Noah Misch wrote:
> > > > On Mon, Jul 24, 2017 at 08:04:30AM +0100, Andres Freund wrote:
> > > > > 
> > > > > 
> > > > > On July 24, 2017 7:10:19 AM GMT+01:00, Noah Misch <n...@leadboat.com> 
> > > > > wrote:
> > > > > >On Tue, Jul 18, 2017 at 01:04:10PM -0700, Andres Freund wrote:
> > > > > >> Ok, I'll flesh out the patch till Thursday.  But I do think we're
> > > > > >going
> > > > > >> to have to do something about the back branches, too.
> > > > > >
> > > > > >This PostgreSQL 10 open item is past due for your status update. 
> > > > > >Kindly send
> > > > > >a status update within 24 hours, and include a date for your 
> > > > > >subsequent
> > > > > >status
> > > > > >update.  Refer to the policy on open item ownership:
> > > > > >https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com
> > > > > 
> > > > > I sent out a note fleshed out patch last week, which Tom reviewed. 
> > > > > Planning to update it to address that review today or tomorrow.
> > > > 
> > > > This PostgreSQL 10 open item is past due for your status update.  
> > > > Kindly send
> > > > a status update within 24 hours, and include a date for your subsequent 
> > > > status
> > > > update.  Refer to the policy on open item ownership:
> > > > https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com
> > > 
> > > IMMEDIATE ATTENTION REQUIRED.  This PostgreSQL 10 open item is long past 
> > > due
> > > for your status update.  Please reacquaint yourself with the policy on 
> > > open
> > > item ownership[1] and then reply immediately.  If I do not hear from you 
> > > by
> > > 2017-07-29 05:00 UTC, I will transfer this item to release management team
> > > ownership without further notice.
> > > 
> > > [1] 
> > > https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com
> > 
> > I've updated the patch based on review (today). Awaiting new review.
> > 
> > FWIW, I don't see the point of these messages when there is a new patch
> > version posted today.
> 
> The policy says, "Each update shall state a date when the community will
> receive another update".  Nothing you've sent today specifies a deadline for
> your next update, so your ownership of this item remains out of
> compliance.

For me that means the policy isn't quite right.  It's not like I can
force Tom to review the patch at a specific date. But the thread has
been progressing steadily over the last days, so I'm not particularly
concerned.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] segfault in HEAD when too many nested functions call

2017-07-27 Thread Andres Freund
On 2017-07-27 21:46:57 -0700, Noah Misch wrote:
> On Thu, Jul 27, 2017 at 02:29:32AM +, Noah Misch wrote:
> > On Mon, Jul 24, 2017 at 08:04:30AM +0100, Andres Freund wrote:
> > > 
> > > 
> > > On July 24, 2017 7:10:19 AM GMT+01:00, Noah Misch <n...@leadboat.com> 
> > > wrote:
> > > >On Tue, Jul 18, 2017 at 01:04:10PM -0700, Andres Freund wrote:
> > > >> Ok, I'll flesh out the patch till Thursday.  But I do think we're
> > > >going
> > > >> to have to do something about the back branches, too.
> > > >
> > > >This PostgreSQL 10 open item is past due for your status update. 
> > > >Kindly send
> > > >a status update within 24 hours, and include a date for your subsequent
> > > >status
> > > >update.  Refer to the policy on open item ownership:
> > > >https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com
> > > 
> > > I sent out a note fleshed out patch last week, which Tom reviewed. 
> > > Planning to update it to address that review today or tomorrow.
> > 
> > This PostgreSQL 10 open item is past due for your status update.  Kindly 
> > send
> > a status update within 24 hours, and include a date for your subsequent 
> > status
> > update.  Refer to the policy on open item ownership:
> > https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com
> 
> IMMEDIATE ATTENTION REQUIRED.  This PostgreSQL 10 open item is long past due
> for your status update.  Please reacquaint yourself with the policy on open
> item ownership[1] and then reply immediately.  If I do not hear from you by
> 2017-07-29 05:00 UTC, I will transfer this item to release management team
> ownership without further notice.
> 
> [1] 
> https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com

I've updated the patch based on review (today). Awaiting new review.

FWIW, I don't see the point of these messages when there is a new patch
version posted today.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] segfault in HEAD when too many nested functions call

2017-07-27 Thread Andres Freund
On 2017-07-26 16:28:38 -0400, Tom Lane wrote:
> Andres Freund <and...@anarazel.de> writes:
> > On 2017-07-26 15:03:37 -0400, Tom Lane wrote:
> >> Hm, that seems kinda backwards to me; I was envisioning the checks
> >> moving to the callees not the callers.  I think it'd end up being
> >> about the same number of occurrences of CHECK_FOR_INTERRUPTS(),
> >> and there would be less of a judgment call about where to put them.
> 
> > Hm, that seems a bit riskier - easy to forget one of the places where we
> > might need a CFI().
> 
> I would argue the contrary.  If we put a CFI at the head of each node
> execution function, then it's just boilerplate that you copy-and-paste
> when you invent a new node type.  The way you've coded it here, it
> seems to involve a lot of judgment calls.  That's very far from being
> copy and paste, and the more different it looks from one node type
> to another, the easier it will be to forget it.
> 
> > We certainly are missing a bunch of them in various nodes
> 
> It's certainly possible that there are long-running loops not involving
> any ExecProcNode recursion at all, but that would be a bug independent
> of this issue.  The CFI in ExecProcNode itself can be replaced exactly
> either by asking all callers to do it, or by asking all callees to do it.
> I think the latter is going to be more uniform and harder to screw up.

Looks a bit better.  Still a lot of judgement-y calls tho, e.g. when one
node function just calls the next, or when there's loops etc.   I found
a good number of missing CFIs...

What do you think?

- Andres
>From a8dd50100915f4bc889bc923d749d0661a88b973 Mon Sep 17 00:00:00 2001
From: Andres Freund <and...@anarazel.de>
Date: Tue, 25 Jul 2017 17:37:17 -0700
Subject: [PATCH 1/2] Move interrupt checking from ExecProcNode() to executor
 nodes.

In a followup commit ExecProcNode(), and especially the large switch
it contains, will be replaced by a function pointer directly to the
correct node. The node functions will then get invoked by a thin
inline function wrapper. To avoid having to include miscadmin.h in
headers - CHECK_FOR_INTERRUPTS() - move the interrupt checks into the
individual executor routines.

While looking through all executor nodes, I noticed a number of
missing interrupt checks, add these too.

Author: Andres Freund
Reviewed-By: Tom Lane
Discussion:
https://postgr.es/m/22833.1490390...@sss.pgh.pa.us
---
 src/backend/executor/execProcnode.c   | 2 --
 src/backend/executor/execScan.c   | 1 +
 src/backend/executor/nodeAgg.c| 5 +
 src/backend/executor/nodeAppend.c | 3 +++
 src/backend/executor/nodeBitmapHeapscan.c | 3 +++
 src/backend/executor/nodeCustom.c | 3 +++
 src/backend/executor/nodeGather.c | 4 
 src/backend/executor/nodeGatherMerge.c| 4 
 src/backend/executor/nodeGroup.c  | 3 +++
 src/backend/executor/nodeHash.c   | 6 ++
 src/backend/executor/nodeHashjoin.c   | 9 ++---
 src/backend/executor/nodeIndexonlyscan.c  | 3 +++
 src/backend/executor/nodeIndexscan.c  | 6 ++
 src/backend/executor/nodeLimit.c  | 3 +++
 src/backend/executor/nodeMaterial.c   | 2 ++
 src/backend/executor/nodeMergeAppend.c| 4 +++-
 src/backend/executor/nodeMergejoin.c  | 3 +++
 src/backend/executor/nodeModifyTable.c| 2 ++
 src/backend/executor/nodeNestloop.c   | 3 +++
 src/backend/executor/nodeProjectSet.c | 3 +++
 src/backend/executor/nodeRecursiveunion.c | 2 ++
 src/backend/executor/nodeResult.c | 3 +++
 src/backend/executor/nodeSetOp.c  | 5 +
 src/backend/executor/nodeSort.c   | 2 ++
 src/backend/executor/nodeSubplan.c| 5 +
 src/backend/executor/nodeTableFuncscan.c  | 2 ++
 src/backend/executor/nodeTidscan.c| 3 +++
 src/backend/executor/nodeUnique.c | 3 +++
 src/backend/executor/nodeWindowAgg.c  | 5 +
 29 files changed, 92 insertions(+), 10 deletions(-)

diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 294ad2cff9..20fd9f822e 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -399,8 +399,6 @@ ExecProcNode(PlanState *node)
 {
 	TupleTableSlot *result;
 
-	CHECK_FOR_INTERRUPTS();
-
 	if (node->chgParam != NULL) /* something changed */
 		ExecReScan(node);		/* let ReScan handle this */
 
diff --git a/src/backend/executor/execScan.c b/src/backend/executor/execScan.c
index 4f131b3ee0..dc40f6b699 100644
--- a/src/backend/executor/execScan.c
+++ b/src/backend/executor/execScan.c
@@ -139,6 +139,7 @@ ExecScan(ScanState *node,
 	 */
 	if (!qual && !projInfo)
 	{
+		CHECK_FOR_INTERRUPTS();
 		ResetExprContext(econtext);
 		return ExecScanFetch(node, accessMtd, recheckMtd);
 	}
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index de9a18e

Re: [HACKERS] segfault in HEAD when too many nested functions call

2017-07-26 Thread Andres Freund
On 2017-07-26 15:03:37 -0400, Tom Lane wrote:
> Andres Freund <and...@anarazel.de> writes:
> > I've moved the CHECK_FOR_INTERRUPTS() to the callsites. That
> > unsurprisingly ends up being somewhat verbose, and there's a bunch of
> > minor judgement calls where exactly to place them. While doing so I've
> > also added a few extra ones.  Did this in a separate patch to make it
> > easier to review.
> 
> Hm, that seems kinda backwards to me; I was envisioning the checks
> moving to the callees not the callers.  I think it'd end up being
> about the same number of occurrences of CHECK_FOR_INTERRUPTS(),
> and there would be less of a judgment call about where to put them.

Hm, that seems a bit riskier - easy to forget one of the places where we
might need a CFI(). We certainly are missing a bunch of them in various
nodes - I tried to add ones I saw as missing, but it's quite some
code. Keeping them close to ExecProcNode() makes that call easier.  I'm
not quite seing how solely putting them in callees removes the judgement
call issue?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] segfault in HEAD when too many nested functions call

2017-07-25 Thread Andres Freund
Hi,

On 2017-07-24 13:27:58 -0400, Tom Lane wrote:
> Andres Freund <and...@anarazel.de> writes:
> >> * I think the comments need more work.  Am willing to make a pass over
> >> that if you want.
> 
> > That'd be good, but let's wait till we have something more final.
> 
> Agreed, I'll wait till you produce another version.

Attached. Did a bunch of cleanup myself already.


I've moved the CHECK_FOR_INTERRUPTS() to the callsites. That
unsurprisingly ends up being somewhat verbose, and there's a bunch of
minor judgement calls where exactly to place them. While doing so I've
also added a few extra ones.  Did this in a separate patch to make it
easier to review.

I'm pretty jetlagged right now, so I want to do another pass to make
sure I didn't forget any CFI()s, but the general shape looks right.

Tried to address the rest of your feedback too.

> >> * Can we redefine the ExecCustomScan function pointer as type
> >> ExecProcNodeCB, eliminating the unsightly cast in nodeCustom.c?
> 
> > That'd change an "extension API", which is why I skipped it at this
> > point of the release cycle. It's not like we didn't have this type of
> > cast all over before. Ok, with changing it, but that's where I came
> > down.
> 
> Is this patch really not changing anything else that a custom-scan
> extension would touch?  If not, I'm okay with postponing this bit
> of cleanup to v11.

FWIW, I've reintroduced ExecCustomScan() which I'd previously removed,
because it now contains a CHECK_FOR_INTERRUPTS(). So this seems moot.


Greetings,

Andres Freund
>From c70603ac35665ba78f0a83d0abbd080b05a9442d Mon Sep 17 00:00:00 2001
From: Andres Freund <and...@anarazel.de>
Date: Tue, 25 Jul 2017 17:37:17 -0700
Subject: [PATCH 1/2] Move interrupt checking from ExecProcNode() to callers.

---
 contrib/postgres_fdw/postgres_fdw.c   |  2 ++
 src/backend/executor/execMain.c   |  2 ++
 src/backend/executor/execProcnode.c   |  2 --
 src/backend/executor/nodeAgg.c|  2 ++
 src/backend/executor/nodeAppend.c |  3 +++
 src/backend/executor/nodeCtescan.c|  2 ++
 src/backend/executor/nodeCustom.c |  3 +++
 src/backend/executor/nodeForeignscan.c|  3 +++
 src/backend/executor/nodeGather.c |  2 ++
 src/backend/executor/nodeGatherMerge.c|  2 ++
 src/backend/executor/nodeGroup.c  |  6 +
 src/backend/executor/nodeHash.c   |  4 +++
 src/backend/executor/nodeHashjoin.c   | 21 +--
 src/backend/executor/nodeLimit.c  |  4 +++
 src/backend/executor/nodeLockRows.c   |  2 ++
 src/backend/executor/nodeMaterial.c   |  2 ++
 src/backend/executor/nodeMergeAppend.c|  6 -
 src/backend/executor/nodeMergejoin.c  |  3 +++
 src/backend/executor/nodeModifyTable.c|  2 ++
 src/backend/executor/nodeNestloop.c   |  3 +++
 src/backend/executor/nodeProjectSet.c |  5 
 src/backend/executor/nodeRecursiveunion.c |  2 ++
 src/backend/executor/nodeResult.c |  3 +++
 src/backend/executor/nodeSetOp.c  |  9 +++
 src/backend/executor/nodeSort.c   |  4 +++
 src/backend/executor/nodeSubplan.c| 44 +++
 src/backend/executor/nodeSubqueryscan.c   |  3 +++
 src/backend/executor/nodeUnique.c |  3 +++
 src/backend/executor/nodeWindowAgg.c  |  8 +-
 29 files changed, 128 insertions(+), 29 deletions(-)

diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index d77c2a70e4..0b2093f34b 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -2079,6 +2079,8 @@ postgresRecheckForeignScan(ForeignScanState *node, TupleTableSlot *slot)
 
 	Assert(outerPlan != NULL);
 
+	CHECK_FOR_INTERRUPTS();
+
 	/* Execute a local join execution plan */
 	result = ExecProcNode(outerPlan);
 	if (TupIsNull(result))
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 78cbcd1a32..a58a70e3f5 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1542,6 +1542,8 @@ ExecPostprocessPlan(EState *estate)
 		{
 			TupleTableSlot *slot;
 
+			CHECK_FOR_INTERRUPTS();
+
 			/* Reset the per-output-tuple exprcontext each time */
 			ResetPerTupleExprContext(estate);
 
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 294ad2cff9..20fd9f822e 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -399,8 +399,6 @@ ExecProcNode(PlanState *node)
 {
 	TupleTableSlot *result;
 
-	CHECK_FOR_INTERRUPTS();
-
 	if (node->chgParam != NULL) /* something changed */
 		ExecReScan(node);		/* let ReScan handle this */
 
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index de9a18e71c..f9073e79aa 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executo

Re: [HACKERS] Create language syntax is not proper in pg_dumpall and not working using pg_upgrade

2017-07-25 Thread Andres Freund
On 2017-07-25 13:18:25 -0400, Robert Haas wrote:
> On Tue, Jul 25, 2017 at 1:13 PM, Andres Freund <and...@anarazel.de> wrote:
> > On 2017-07-25 13:10:11 -0400, Robert Haas wrote:
> >> On Tue, Jul 25, 2017 at 1:06 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> >> >> Is this assumption, like, documented someplace?
> >> >
> >> > Uh, right there?
> >>
> >> I don't think we can expect end-users to read the code comments to
> >> determine whether their apparently-legal SQL is fully supported.
> >
> > I don't think plain end-users are going to create differently named PLs
> > using builtin handlers. There's plenty special casing of system object
> > in pg_dump and elsewhere. Dependency tracking doesn't quite work right
> > if you refer to system objects either, etc.  This is superuser only
> > stuff, for a reason.
> 
> But superuser != developer.  Superusers aren't obliged to read the
> code comments any more than any other user.

And yet we tell them that they're to blame if they do a CREATE FUNCTION
with the wrong signature, or a DELETE FROM pg_class; or ...


> I think the only reason we don't get people whining about stuff like
> this more than we do is that it's pretty obscure.  But I bet if we
> look through the pgsql-bugs archives we can find people complaining
> about various cases where they did assorted seemingly-legal things
> that turned out not to be supported by pg_dump.  Whether this
> particular thing has been discovered by anyone before, I dunno.  But
> there's certainly a whole category of bug reports along the line of
> "pg_dump works mostly, except when I do X".

Yes, and?  We can try to address countless intentionally unsupported
edge-cases, but it's going to cost code, complexity and time. And very
likely it's going to add hard to find, test and address bugs. pg_dump is
complicated as is, I don't think trying to address every conceivable
weirdness is a good idea. There's plenty more fundamental things wrong
(e.g. DDL concurrent with a dump sometimes breaking that dump).

I'm not sure what you're arguing for here.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Create language syntax is not proper in pg_dumpall and not working using pg_upgrade

2017-07-25 Thread Andres Freund
On 2017-07-25 13:10:11 -0400, Robert Haas wrote:
> On Tue, Jul 25, 2017 at 1:06 PM, Tom Lane  wrote:
> >> Is this assumption, like, documented someplace?
> >
> > Uh, right there?
> 
> I don't think we can expect end-users to read the code comments to
> determine whether their apparently-legal SQL is fully supported.

I don't think plain end-users are going to create differently named PLs
using builtin handlers. There's plenty special casing of system object
in pg_dump and elsewhere. Dependency tracking doesn't quite work right
if you refer to system objects either, etc.  This is superuser only
stuff, for a reason.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] POC: Sharing record typmods between backends

2017-07-25 Thread Andres Freund
On 2017-07-10 21:39:09 +1200, Thomas Munro wrote:
> Here's a new version that introduces a per-session DSM segment to hold
> the shared record typmod registry (and maybe more things later).

You like to switch it up. *.patchset.tgz??? ;)


It does concern me that we're growing yet another somewhat different
hashtable implementation. Yet I don't quite see how we could avoid
it. dynahash relies on proper pointers, simplehash doesn't do locking
(and shouldn't) and also relies on pointers, although to a much lesser
degree.  All the open coded tables aren't a good match either.  So I
don't quite see an alternative, but I'd love one.


Regards,

Andres

diff --git a/src/backend/lib/dht.c b/src/backend/lib/dht.c
new file mode 100644
index 000..2fec70f7742
--- /dev/null
+++ b/src/backend/lib/dht.c

FWIW, not a big fan of dht as a filename (nor of dsa.c). For one DHT
usually refers to distributed hash tables, which this is not, and for
another the abbreviation is so short it's not immediately
understandable, and likely to conflict further.  I think it'd possibly
ok to have dht as symbol prefixes, but rename the file to be longer.

+ * To deal with currency, it has a fixed size set of partitions, each of which
+ * is independently locked.

s/currency/concurrency/ I presume.


+ * Each bucket maps to a partition; so insert, find
+ * and iterate operations normally only acquire one lock.  Therefore, good
+ * concurrency is achieved whenever they don't collide at the lock partition

s/they/operations/?


+ * level.  However, when a resize operation begins, all partition locks must
+ * be acquired simultaneously for a brief period.  This is only expected to
+ * happen a small number of times until a stable size is found, since growth is
+ * geometric.

I'm a bit doubtful that we need partitioning at this point, and that it
doesn't actually *degrade* performance for your typmod case.


+ * Resizing is done incrementally so that no individual insert operation pays
+ * for the potentially large cost of splitting all buckets.

I'm not sure this is a reasonable tradeoff for the use-case suggested so
far, it doesn't exactly make things simpler. We're not going to grow
much.


+/* The opaque type used for tracking iterator state. */
+struct dht_iterator;
+typedef struct dht_iterator dht_iterator;

Isn't it actually the iterator state? Rather than tracking it? Also, why
is it opaque given you're actually defining it below? Guess you'd moved
it at some point.


+/*
+ * The set of parameters needed to create or attach to a hash table.  The
+ * members tranche_id and tranche_name do not need to be initialized when
+ * attaching to an existing hash table.
+ */
+typedef struct
+{
+   Size key_size;  /* Size of the key (initial 
bytes of entry) */
+   Size entry_size;/* Total size of entry */

Let's use size_t, like we kind of concluded in the thread you started:
http://archives.postgresql.org/message-id/25076.1489699457%40sss.pgh.pa.us
:)

+   dht_compare_function compare_function;  /* Compare function */
+   dht_hash_function hash_function;/* Hash function */

Might be worth explaining that these need to provided when attaching
because they're possibly process local. Did you test this with
EXEC_BACKEND?

+   int tranche_id; /* The tranche ID to use for 
locks. */
+} dht_parameters;


+struct dht_iterator
+{
+   dht_hash_table *hash_table; /* The hash table we are iterating 
over. */
+   bool exclusive; /* Whether to lock buckets 
exclusively. */
+   Size partition; /* The index of the next 
partition to visit. */
+   Size bucket;/* The index of the next bucket 
to visit. */
+   dht_hash_table_item *item;  /* The most recently returned item. */
+   dsa_pointer last_item_pointer; /* The last item visited. */
+   Size table_size_log2;   /* The table size when we started iterating. */
+   bool locked;/* Whether the current partition is 
locked. */

Haven't gotten to the actual code yet, but this kinda suggest we leave a
partition locked when iterating? Hm, that seems likely to result in a
fair bit of pain...


+/* Iterating over the whole hash table. */
+extern void dht_iterate_begin(dht_hash_table *hash_table,
+ dht_iterator 
*iterator, bool exclusive);
+extern void *dht_iterate_next(dht_iterator *iterator);
+extern void dht_iterate_delete(dht_iterator *iterator);

s/delete/delete_current/? Otherwise it looks like it's part of
manipulating just the iterator.

+extern void dht_iterate_release(dht_iterator *iterator);

I'd add lock to to the name.


+/*
+ * An item in the hash table.  This wraps the user's entry object in an
+ * envelop that holds a pointer back to the bucket and a pointer to the next
+ * item in the bucket.
+ */
+struct 

Re: [HACKERS] segfault in HEAD when too many nested functions call

2017-07-24 Thread Andres Freund
On 2017-07-24 13:27:58 -0400, Tom Lane wrote:
> Andres Freund <and...@anarazel.de> writes:
> >> I seriously, seriously, seriously dislike that.  You practically might as
> >> well put miscadmin.h into postgres.h.  Instead, what do you think of
> >> requiring the individual ExecProcNode functions to perform
> >> CHECK_FOR_INTERRUPTS?  Since they're already responsible for doing that
> >> if they have any long-running internal loops, this doesn't seem like a
> >> modularity violation.  It is a risk for bugs-of-omission, sure, but so
> >> are a lot of other things that the per-node code has to do.
> 
> > That'd work. Another alternative would be to move the inline definition
> > of ExecProcNode() (and probably a bunch of other related functions) into
> > a more internals oriented header. It seems likely that we're going to
> > add more inline functions to the executor, and that'd reduce the
> > coupling of external and internal users a bit.
> 
> Well, it still ends up that the callers of ExecProcNode need to include
> miscadmin.h, whereas if we move it into the per-node functions, then the
> per-node files need to include miscadmin.h.  I think the latter is better
> because those files may need to have other CHECK_FOR_INTERRUPTS calls
> anyway.

> It's less clear from a modularity standpoint that executor
> callers should need miscadmin.h.

Well, that's why I'm pondering an executor_internal.h or something -
there shouldn't be ExecProcNode() callers that don't also need CFI(),
and no executor callers should need ExecProcNode(). executor.h right now
really defines infrastructure to *use* the executor
(Executor{Start,Run,Finish,End,Rewind}), functions internal to the
executor (lots of initialization functions, EPQ, partition logic), some
things inbetween (e.g. expression related stuff), and some things that
really should be separate ExecOpenIndices etc, execReplication.c
functions. But that's not something we can easily clear up just now.


> (Or in short, I'm not really okay with *any* header file including
> miscadmin.h.)

Perhaps that's a sign that we should split it up? It's a weird grab bag
atm. utils/interrupt.h or such would e.g. make sense for for the
*INTERRUPTS, and *CRIT_SECTION macros, as well as ProcessInterrupts()
itself, which imo isn't super well placed in postgres.c
either. Including utils/interrupt.h in a header would be much less
odious in my opinion than including miscadmin.h.


> >> * Can we redefine the ExecCustomScan function pointer as type
> >> ExecProcNodeCB, eliminating the unsightly cast in nodeCustom.c?
> 
> > That'd change an "extension API", which is why I skipped it at this
> > point of the release cycle. It's not like we didn't have this type of
> > cast all over before. Ok, with changing it, but that's where I came
> > down.
> 
> Is this patch really not changing anything else that a custom-scan
> extension would touch?  If not, I'm okay with postponing this bit
> of cleanup to v11.

Not that I can see - I've build & tested citus which uses custom scans
these days with and without patch without trouble.  Nor do I see any
change in the current patch that'd be troublesome - after all the
API of ExecProcNode() stays the same.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] segfault in HEAD when too many nested functions call

2017-07-24 Thread Andres Freund
Hi,

On 2017-07-21 20:17:54 -0400, Tom Lane wrote:
> > I dislike having the miscadmin.h include in executor.h - but I don't
> > quite see a better alternative.
> 
> I seriously, seriously, seriously dislike that.  You practically might as
> well put miscadmin.h into postgres.h.  Instead, what do you think of
> requiring the individual ExecProcNode functions to perform
> CHECK_FOR_INTERRUPTS?  Since they're already responsible for doing that
> if they have any long-running internal loops, this doesn't seem like a
> modularity violation.  It is a risk for bugs-of-omission, sure, but so
> are a lot of other things that the per-node code has to do.

That'd work. Another alternative would be to move the inline definition
of ExecProcNode() (and probably a bunch of other related functions) into
a more internals oriented header. It seems likely that we're going to
add more inline functions to the executor, and that'd reduce the
coupling of external and internal users a bit.


> * I think the comments need more work.  Am willing to make a pass over
> that if you want.

That'd be good, but let's wait till we have something more final.


> * In most places, if there's an immediate return-if-trivial-case test,
> we check stack depth only after that.  There's no point in checking
> and then returning; either you already crashed, or you're at peak
> stack so far as this code path is concerned.

I went back/forth a bit on that one. The calling code might call other
functions that go deeper on the stack, which won't have the checks. Fine
with moving, just wanted to explain why I got there.


> * Can we redefine the ExecCustomScan function pointer as type
> ExecProcNodeCB, eliminating the unsightly cast in nodeCustom.c?

That'd change an "extension API", which is why I skipped it at this
point of the release cycle. It's not like we didn't have this type of
cast all over before. Ok, with changing it, but that's where I came
down.


> * The various callers of ExecScan() are pretty inconsistently coded.
> I don't care that much whether they use castNode() or just forcibly
> cast to ScanState*, but let's make them all look the same.

I tried changed the minimum, perfectly fine to move to castNode in a
wholesale manner.  Btw, I really want to get rid of ExecScan(), at least
as an external function. Does a lot of unnecessary things in a lot of
cases, and makes branch prediction a lot worse. Not v10 stuff tho.


> * I believe the usual term for what these function pointers are is
> "methods", not "callbacks".  Certainly we'd call them that if we
> were working in C++.

K.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] segfault in HEAD when too many nested functions call

2017-07-24 Thread Andres Freund


On July 24, 2017 7:10:19 AM GMT+01:00, Noah Misch <n...@leadboat.com> wrote:
>On Tue, Jul 18, 2017 at 01:04:10PM -0700, Andres Freund wrote:
>> Ok, I'll flesh out the patch till Thursday.  But I do think we're
>going
>> to have to do something about the back branches, too.
>
>This PostgreSQL 10 open item is past due for your status update. 
>Kindly send
>a status update within 24 hours, and include a date for your subsequent
>status
>update.  Refer to the policy on open item ownership:
>https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com

I sent out a note fleshed out patch last week, which Tom reviewed. Planning to 
update it to address that review today or tomorrow.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Add a Gather executor node.

2017-07-21 Thread Andres Freund


On July 21, 2017 5:34:32 PM GMT+02:00, Alvaro Herrera 
 wrote:
>Robert Haas wrote:
>
>> I think those top-of-file lists are just annoying, and would prefer
>to
>> rip them out entirely rather than spend time maintaining them.
>
>+1

I'm quite sympathetic to that view. But I think it's either them, or ripping 
out, not leaving bit rot.

Andres

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Make sure all statistics is sent after a few DML are performed

2017-07-21 Thread Andres Freund
Hi,

(please don't top-reply on this list)

On 2017-07-19 14:04:39 +0900, Yugo Nagata wrote:
> On Tue, 18 Jul 2017 10:10:49 -0400
> Tom Lane  wrote:
> 
> Thank you for your comments. I understand the problem of my proposal
> patch.

Does that mean you're trying to rewrite it in the way that was
suggested:

> > > Another,
> > > pretty half-baked, approach would be to add a procsignal triggering idle
> > > backends to send stats, and send that to all idle backends when querying
> > > stats. We could even publish the number of outstanding stats updates in
> > > PGXACT or such, without any locking, and send it only to those that have
> > > outstanding ones.
> > 
> > If somebody wanted to do the work, that'd be a viable answer IMO.  You'd
> > really want to not wake backends that have nothing more to send, but
> > I agree that it'd be possible to advertise that in shared memory.

or are you planning to just let the issue leave be?

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] segfault in HEAD when too many nested functions call

2017-07-21 Thread Andres Freund
On 2017-07-18 16:53:43 -0400, Tom Lane wrote:
> Andres Freund <and...@anarazel.de> writes:
> > ...  If we were to go this route we'd have to probably move
> > the callback assignment into the ExecInit* routines, and possibly
> > replace the switch in ExecInitNode() with another callback, assigned in
> > make_*, and implement callbacks for ExecShutdown, MultiExecProcNode etc.
> 
> BTW, I don't see why you really need to mess with anything except
> ExecProcNode?  Surely the other cases such as MultiExecProcNode are
> not called often enough to justify changing them away from the
> switch technology.  Yeah, maybe it would be a bit cleaner if they
> all looked alike ... but if you're trying to make a patch that's
> as little invasive as possible for v10, I'd suggest converting just
> ExecProcNode to this style.

Yea, I was making that statement when not aiming for v10.  Attached is a
patch that converts just ExecProcNode. The big change in comparison to
the earlier patch is that the assignment of the callback is now done in
the respective ExecInit* routines. As a consequence the ExecProcNode
callbacks now are static.  I think we should do this fully in v11, I
removing dispatch routines like ExecInitNode() is a good idea, both
because it moves concerns more towards the nodes themselves - given the
growth of executor nodes that strikes me as a good idea.

I've also added stack depth checks to ExecEndNode(),
MultiExecProcNode(), ExecShutdownNode(), because it's not guaranteed
that ExecProcNode is called for every node...

I dislike having the miscadmin.h include in executor.h - but I don't
quite see a better alternative.

I want to run this through pgindent before merging, otherwise we'll
presumably end up with a lot of noise.

I still think we should backpatch at least the check_stack_depth() calls
in ExecInitNode(), ExecEndNode().

Greetings,

Andres Freund
>From 120218e4c6aab1c716aab12495c1a466baa8e37a Mon Sep 17 00:00:00 2001
From: Andres Freund <and...@anarazel.de>
Date: Mon, 17 Jul 2017 00:33:49 -0700
Subject: [PATCH] Move ExecProcNode from dispatch to callback based model.

This allows us to add stack-depth checks the first time an executor
node is called, and skip that overhead on following calls.
Additionally it yields a nice speedup.

We should move towards that model for further routines, but as this is
required for v10, it seems better to only do the necessary (which
already is quite large).

Todo: Expand (scope, need in v10) & link.

Author: Andres Freund
Reported-By: Julien Rouhaud
Discussion:
https://postgr.es/m/22833.1490390...@sss.pgh.pa.us
https://postgr.es/m/b0af9eaa-130c-60d0-9e4e-7a135b1e0...@dalibo.com
---
 src/backend/executor/execProcnode.c| 257 +++--
 src/backend/executor/nodeAgg.c |   6 +-
 src/backend/executor/nodeAppend.c  |   8 +-
 src/backend/executor/nodeBitmapHeapscan.c  |   7 +-
 src/backend/executor/nodeCtescan.c |   7 +-
 src/backend/executor/nodeCustom.c  |   8 +-
 src/backend/executor/nodeForeignscan.c |   7 +-
 src/backend/executor/nodeFunctionscan.c|   7 +-
 src/backend/executor/nodeGather.c  |   7 +-
 src/backend/executor/nodeGatherMerge.c |   7 +-
 src/backend/executor/nodeGroup.c   |   6 +-
 src/backend/executor/nodeHashjoin.c|   6 +-
 src/backend/executor/nodeIndexonlyscan.c   |   7 +-
 src/backend/executor/nodeIndexscan.c   |   7 +-
 src/backend/executor/nodeLimit.c   |   6 +-
 src/backend/executor/nodeLockRows.c|   6 +-
 src/backend/executor/nodeMaterial.c|   6 +-
 src/backend/executor/nodeMergeAppend.c |   7 +-
 src/backend/executor/nodeMergejoin.c   |   6 +-
 src/backend/executor/nodeModifyTable.c |   6 +-
 src/backend/executor/nodeNamedtuplestorescan.c |   7 +-
 src/backend/executor/nodeNestloop.c|   6 +-
 src/backend/executor/nodeProjectSet.c  |   6 +-
 src/backend/executor/nodeRecursiveunion.c  |   6 +-
 src/backend/executor/nodeResult.c  |   6 +-
 src/backend/executor/nodeSamplescan.c  |   7 +-
 src/backend/executor/nodeSeqscan.c |   7 +-
 src/backend/executor/nodeSetOp.c   |   6 +-
 src/backend/executor/nodeSort.c|   6 +-
 src/backend/executor/nodeSubqueryscan.c|   7 +-
 src/backend/executor/nodeTableFuncscan.c   |   7 +-
 src/backend/executor/nodeTidscan.c |   7 +-
 src/backend/executor/nodeUnique.c  |   6 +-
 src/backend/executor/nodeValuesscan.c  |   7 +-
 src/backend/executor/nodeWindowAgg.c   |   6 +-
 src/backend/executor/nodeWorktablescan.c   |   7 +-
 src/include/executor/executor.h|  28 ++-
 src/include/executor/nodeAgg.h |   1 -
 src/include/executor/nodeAppend.h  |   1 

Re: [HACKERS] segfault in HEAD when too many nested functions call

2017-07-18 Thread Andres Freund
On 2017-07-18 15:45:53 -0400, Tom Lane wrote:
> Andres Freund <and...@anarazel.de> writes:
> > Attached is a trivial patch implementing 1) and a proof-of-concept hack
> > for 2).
> 
> The comment you made previously, as well as the proposed commit message
> for 0001, suggest that you've forgotten that pre-v10 execQual.c had
> several check_stack_depth() calls.  Per its header comment,

>  *During expression evaluation, we check_stack_depth only in
>  *ExecMakeFunctionResult (and substitute routines) rather than at 
> every
>  *single node.  This is a compromise that trades off precision of the
>  *stack limit setting to gain speed.

No, I do remember that fact. But
a) unfortunately that's not really that useful because by far not all
   function calls go through ExecMakeFunctionResult()
   (e.g. ExecEvalDistinct() and a bunch of other FunctionCallInvoke()
   containing functions).
b) deeply nested executor nodes - and that's what the commit's about to
   a good degree - aren't necessarily guaranteed to actually evaluate
   expressions. There's no guarantee there's any expressions (you could
   just nest joins without conditions), and a bunch of nodes like
   hashjoins invoke functions themselves.


> and there was also a check in the recursive ExecInitExpr routine.

Which still is there.


> While it doesn't seem to be documented anywhere, I believe that we'd
> argued that ExecProcNode and friends didn't need their own stack depth
> checks because any nontrivial node would surely do expression evaluation
> which would contain a check.

Yea, I don't buy that at all.


> I agree that adding a check to ExecInitNode() is really necessary,
> but I'm not convinced that it's a sufficient substitute for checking
> in ExecProcNode().  The two flaws I see in that argument are
> 
> (a) you've provided no hard evidence backing up the argument that node
> initialization will take strictly more stack space than node execution;
> as far as I can see that's just wishful thinking.

I think it's pretty likely to be roughly (within slop) the case in
realistic scenarios, but I do feel fairly uncomfortable about the
assumption. That's why I really want to do something like the what I'm
proposing in the second patch. I just don't think we can realistically
add the check to the back branches, given that it's quite measurable
performancewise.


> (b) there's also an implicit assumption that ExecutorRun is called from
> a point not significantly more deeply nested than the corresponding
> call to ExecutorStart.  This seems also to depend on nothing much better
> than wishful thinking.  Certainly, many ExecutorRun calls are right next
> to ExecutorStart, but several of them aren't; the portal code and
> SQL-function code are both scary in this respect.

:/


> > The latter obviously isn't ready, but might make clearer what I'm
> > thinking about. If we were to go this route we'd have to probably move
> > the callback assignment into the ExecInit* routines, and possibly
> > replace the switch in ExecInitNode() with another callback, assigned in
> > make_*, and implement callbacks for ExecShutdown, MultiExecProcNode etc.
> 
> While I'm uncomfortable with making such a change post-beta2, I'm one
> heck of a lot *more* uncomfortable with shipping v10 with no stack depth
> checks in the executor mainline.  Fleshing this out and putting it in
> v10 might be an acceptable compromise.

Ok, I'll flesh out the patch till Thursday.  But I do think we're going
to have to do something about the back branches, too.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Make sure all statistics is sent after a few DML are performed

2017-07-18 Thread Andres Freund
Hi,


On 2017-07-18 09:42:31 -0400, Tom Lane wrote:
> I wonder if a better answer wouldn't be to reduce PGSTAT_STAT_INTERVAL.
> I don't think that value has been reconsidered since the code was written,
> circa turn of the century.  Maybe even make it configurable, though that
> could be overkill.

Not sure if that really does that much to solve the concern.  Another,
pretty half-baked, approach would be to add a procsignal triggering idle
backends to send stats, and send that to all idle backends when querying
stats. We could even publish the number of outstanding stats updates in
PGXACT or such, without any locking, and send it only to those that have
outstanding ones.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Make sure all statistics is sent after a few DML are performed

2017-07-18 Thread Andres Freund
Hi,

On 2017-07-18 21:49:27 +0900, Yugo Nagata wrote:
> After a DML is perfomed, the statistics is sent to pgstat by 
> pgsent_report_sent().
> However, the mininum time between stas file update is PGSTAT_STAT_INTERVAL, 
> so if
> a few DMLs are performed with short interval, some statistics could not be 
> sent
> until the backend is shutdown.
> 
> This is not a problem in usual cases, but in case that a session is remained 
> in
> idle for a long time, for example when using connection pooling, statistics of
> a huge change of a table is not sent for a long time, and as a result, 
> starting
> autovacuum might be delayed.

Hm.


> An idea to resolve this is call pgsent_report_sent() again with a delay
> after the last DML to make sure to send the statistics. The attached patch
> implements this.

That seems like it'd add a good number of new wakeups, or at least
scheduling of wakeups.  Now a timer would be armed whenever a query is
run outside of a transaction - something happening quite regularly. And
it'd do so even if a) there are no outstanding stat reports b) there's
already a timer running.  Additionally it'd, unless I mis-skimmed,
trigger pgstat_report_stat() from within
CHECK_FOR_INTERRUPTS/ProcessInterrupts even when inside a transaction,
if that transaction is started after the enable_timeout_after().


I'm not entirely sure what a good solution would be.  One option would
be to have a regular wakeup setting a flag (considerably longer than
500ms, that's way too many additional wakeups), rearm the timer in the
handler, but only process the flag when outside of a transaction.

Or we could do nothing - I actually think that's a viable option.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] segfault in HEAD when too many nested functions call

2017-07-18 Thread Andres Freund
On 2017-07-17 23:04:43 +0200, Julien Rouhaud wrote:
> On 17/07/2017 16:57, Andres Freund wrote:
> > The latter obviously isn't ready, but might make clearer what I'm
> > thinking about.
> 
> It did for me, and FWIW I like this approach.

Cool.


> > If we were to go this route we'd have to probably move
> > the callback assignment into the ExecInit* routines, and possibly
> > replace the switch in ExecInitNode() with another callback, assigned in
> > make_*, and implement callbacks for ExecShutdown, MultiExecProcNode etc.
> > 
> > This results in a good speedup in tpc-h btw:
> > master total min: 46434.897 cb min: 44778.228 [diff -3.70]
> 
> Is it v11 material or is there any chance to make it in v10?

I think it'd make sense to apply the first to v10 (and earlier), and the
second to v11.  I think this isn't a terribly risky patch, but it's
still a relatively large change for this point in the development
cycle.  I'm willing to reconsider, but that's my default.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] segfault in HEAD when too many nested functions call

2017-07-17 Thread Andres Freund
Hi,

On 2017-07-17 00:26:29 -0700, Andres Freund wrote:
> I'm less convinced of that, due to the overhead argument.  I think
> there's a couple ways around that however:
> 
> 1) We could move ExecProcNode() to be callback based. The first time a
>node is executed a "wrapper" function is called that does the stack
>and potentially other checks. That also makes ExecProcNode() small
>enough to be inlined, which ends up being good for jump target
>prediction.   I've done something similar for v11 for expression
>evaluation, getting rid of EEOP_*_FIRST duplication etc, and it seems
>to work well.  The big disadvantage to that is that it's a bit
>invasive for v10, and very likely too invasive to backpatch.
> 2) I think there's some fair argument to be made that ExecInitNode()'s
>stack-space needs are similar enough to ExecProcNode()'s allowing us
>to put a check_stack_depth() into the former.  That seems like it's
>required anyway, since in many cases that's going to trigger
>stack-depth exhaustion first anyway (unless we hit it in parse
>analysis, which also seems quite common).

Attached is a trivial patch implementing 1) and a proof-of-concept hack
for 2).

The latter obviously isn't ready, but might make clearer what I'm
thinking about. If we were to go this route we'd have to probably move
the callback assignment into the ExecInit* routines, and possibly
replace the switch in ExecInitNode() with another callback, assigned in
make_*, and implement callbacks for ExecShutdown, MultiExecProcNode etc.

This results in a good speedup in tpc-h btw:
master total min: 46434.897 cb min: 44778.228 [diff -3.70]

- Andres
>From 06b88ddf63427e1f6aeb49feb48916c3351e3380 Mon Sep 17 00:00:00 2001
From: Andres Freund <and...@anarazel.de>
Date: Mon, 17 Jul 2017 00:33:49 -0700
Subject: [PATCH 1/2] Check stack-depth when initializing executor nodes.

Previously we only checked stack-depth during parse-analysis, but
that's not necessarily sufficient. In v10, where targetlist SRF
evaluation now is executor node based, this can indeed cause
problems. It's quite possible that earlier versions are also affected
in a bit different manner, or might become vulnerable due to future
changes.

To shore this up, add a stack-depth check to ExecInitNode(). As that's
called in the same recursive manner as ExecProcNode() /
MultiExecProcNode(), it should have similar stack usage as those,
without incurring noticeable overhead in queries processing many rows.

Author: Andres Freund
Reported-By: Julien Rouhaud
Discussion:
https://postgr.es/m/22833.1490390...@sss.pgh.pa.us
https://postgr.es/m/b0af9eaa-130c-60d0-9e4e-7a135b1e0...@dalibo.com
Backpatch: 9.2-, issue has existed for a long while
---
 src/backend/executor/execProcnode.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 294ad2cff9..360479d081 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -143,6 +143,15 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 	List	   *subps;
 	ListCell   *l;
 
+	/*
+	 * Make sure there's enough stack available. Check that here, rather than
+	 * ExecProcNode() / MultiExecProcNode(), to avoid incurring overhead for
+	 * every single tuple. The assumption making this valid is that the
+	 * difference in stack use between node initialization and execution
+	 * should be far less than the safety margin from check_stack_depth().
+	 */
+	check_stack_depth();
+
 	/*
 	 * do nothing when we get to the end of a leaf on tree.
 	 */
-- 
2.13.1.392.g8d1b10321b.dirty

>From 2321fb83a8973c970ec934d3c1f121e739a7a20a Mon Sep 17 00:00:00 2001
From: Andres Freund <and...@anarazel.de>
Date: Mon, 17 Jul 2017 07:50:40 -0700
Subject: [PATCH 2/2] WIP: Move to callback based executor nodes.

---
 src/backend/executor/execProcnode.c | 436 +++-
 src/include/executor/executor.h |  30 ++-
 src/include/nodes/execnodes.h   |  13 ++
 3 files changed, 118 insertions(+), 361 deletions(-)

diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 360479d081..9d8727f41c 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -122,6 +122,15 @@
 #include "miscadmin.h"
 
 
+
+static TupleTableSlot *ExecProcNodeFirst(PlanState *node);
+static TupleTableSlot *ExecProcNodeInstr(PlanState *node);
+
+#define INIT_CB(node, name) \
+	(node)->ExecProcNode = ExecProcNodeFirst; \
+	(node)->ExecProcNodeReal = (ExecProcNodeCB) Exec##name; \
+	(node)->ExecEndNode = (ExecEndNodeCB) ExecEnd##name
+
 /* 
  *		ExecInitNode
  *
@@ -166,41 +175,49 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 		case T_Result:
 			result = (PlanS

Re: [HACKERS] segfault in HEAD when too many nested functions call

2017-07-17 Thread Andres Freund
Hi,

On 2017-07-15 11:22:37 -0400, Tom Lane wrote:
> The thread drifted off without any resolution, but clearly we need to
> do something before 10.0 final.

We need to do something, I'm less convinced that it's really v10
specific :/


> "SELECT so()" has a nontrivial targetlist so we end up running
> ExecBuildProjectionInfo on that, meaning that a fresh expression
> compilation happens at every nesting depth, and there are
> check_stack_depth calls in expression compilation.  Surely that's
> something we'd try to improve someday.  Or it could accidentally get
> broken by unrelated changes in the way plpgsql sets up queries to be
> executed.

Independent of $subject: What are you thinking of here? You want to
avoid the ExecBuildProjectionInfo() in more cases - I'm unconvinced
that's actually helpful. In my WIP JIT compilation patch the
ExecBuildProjectionInfo() ends up being a good bit faster than paths
avoiding it.


> I still think that we really need to add a check in ExecProcNode().
> Even if there's an argument to be made that every recursion would
> somewhere go through ExecMakeTableFunctionResult, very large/complex
> queries could result in substantial stack getting chewed up before
> we get to that --- and we don't have an infinite amount of stack slop.

I'm less convinced of that, due to the overhead argument.  I think
there's a couple ways around that however:

1) We could move ExecProcNode() to be callback based. The first time a
   node is executed a "wrapper" function is called that does the stack
   and potentially other checks. That also makes ExecProcNode() small
   enough to be inlined, which ends up being good for jump target
   prediction.   I've done something similar for v11 for expression
   evaluation, getting rid of EEOP_*_FIRST duplication etc, and it seems
   to work well.  The big disadvantage to that is that it's a bit
   invasive for v10, and very likely too invasive to backpatch.
2) I think there's some fair argument to be made that ExecInitNode()'s
   stack-space needs are similar enough to ExecProcNode()'s allowing us
   to put a check_stack_depth() into the former.  That seems like it's
   required anyway, since in many cases that's going to trigger
   stack-depth exhaustion first anyway (unless we hit it in parse
   analysis, which also seems quite common).

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] building libpq.a static library

2017-07-12 Thread Andres Freund
On 2017-07-12 23:55:56 +0100, Greg Stark wrote:
> Fwiw I think the real problem is that building static libraries
> "properly" requires different compiler options -- notably they're not
> normally built with -fPIC. So that means building every object twice
> which kind of breaks make's build model which has a simple dependency
> graph where each object appears once. Some packages do this by
> inventing a foo-shared.o and foo-static.o but that introduces its own
> weirdness.
> 
> I don't know what the downsides would be of creating a static library
> out of objects built with -fPIC. It might just be a small performance
> penalty which might be no big deal for libpq. That may be a good
> compromise.

FWIW, most linux distributions build everything with -fPIC/PIE anyway
these days, to allow address space randomization. So I don't think this
is a huge concern for modern platforms.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Rust bindings to pgtypes lib

2017-07-07 Thread Andres Freund
On 2017-07-07 12:54:33 +0200, Michael Meskes wrote:
> > Some people use http://libpqtypes.esilo.com/
>
> Never before saw this. It does not seem to have more in common than the
> name, though.

It has binary to text conversion functionality for various types.  I
don't exactly know what Kaare needs, but that seems like a relevant
need.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Rust bindings to pgtypes lib

2017-07-06 Thread Andres Freund
Hi,

On 2017-07-06 20:26:29 +0200, Michael Meskes wrote:
> > But is this possible, or feasible? I see that the makefile refers to 
> 
> Possible yes, but in general I'm not a big fan of duplicating code. I
> spend too much time to keep those copies in sync.

Indeed. I'm quite strongly against exposing / using pgtypeslib in more
places. If anything it should be phased out. Because that code is
definitely not always kept up2date, and it's a lot of work to do so.  If
anything the code should be rewritten to *not* require so much
duplication, then we could talk.


> > How much of the source tree would I have to carve out?
> > Or from another perspective; how do other language (if any) solve this?
> 
> I'm not aware of any other language binding for pgtypeslib.

Some people use http://libpqtypes.esilo.com/

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [BUGS] [HACKERS] Segmentation fault in libpq

2017-07-02 Thread Andres Freund
Hi,

On 2017-07-02 20:58:52 +0200, Michal Novotný wrote:
> thank you all for your advice. I've been investigating this a little more
> and finally it turned out it's not a bug in libpq although I got confused
> by going deep as several libpq functions. The bug was really on our side
> after trying to use connection pointer after calling PQfinish(). The code
> is pretty complex so it took some time to investigate however I would like
> to apologize for "blaming" libpq instead of our code.

Usually using a tool like valgrind is quite helpful to find issues like
that, because it'll show you the call-stack accessing the memory and
*also* the call-stack that lead to the memory being freed.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Apparent walsender bug triggered by logical replication

2017-06-29 Thread Andres Freund
Hi,

On 2017-06-29 20:07:14 -0400, Tom Lane wrote:
> I was able to make the hang go away by means of the attached patch that
> allows WalSndWaitForWal to exit early if the client has shut down the
> COPY.  However, since that function is miserably underdocumented (like
> most of this code :-(), I have little idea if this is correct or safe.

Seems reasonable to me.


> I also wonder why WalSndWaitForWal is being called for WAL that seemingly
> doesn't exist yet, and whether that doesn't indicate another bug somewhere
> in this stack.

That's pretty normal - we can only send back something once a
transaction is complete, and until that happens we'll just block waiting
for more WAL.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reducing pg_ctl's reaction time

2017-06-29 Thread Andres Freund


On June 29, 2017 10:19:46 AM PDT, Jeff Janes <jeff.ja...@gmail.com> wrote:
>On Tue, Jun 27, 2017 at 11:59 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
>
>> I wrote:
>> > Andres Freund <and...@anarazel.de> writes:
>> >> On 2017-06-26 17:38:03 -0400, Tom Lane wrote:
>> >>> Hm.  Take that a bit further, and we could drop the connection
>probes
>> >>> altogether --- just put the whole responsibility on the
>postmaster to
>> >>> show in the pidfile whether it's ready for connections or not.
>>
>> >> Yea, that seems quite appealing, both from an architectural,
>simplicity,
>> >> and log noise perspective. I wonder if there's some added
>reliability by
>> >> the connection probe though? Essentially wondering if it'd be
>worthwhile
>> >> to keep a single connection test at the end. I'm somewhat
>disinclined
>> >> though.
>>
>> > I agree --- part of the appeal of this idea is that there could be
>a net
>> > subtraction of code from pg_ctl.  (I think it wouldn't have to link
>libpq
>> > anymore at all, though maybe I forgot something.)  And you get rid
>of a
>> > bunch of can't-connect failure modes, eg kernel packet filter in
>the way,
>> > which probably outweighs any hypothetical reliability gain from
>> confirming
>> > the postmaster state the old way.
>>
>> Here's a draft patch for that.  I quite like the results --- this
>seems
>> way simpler and more reliable than what pg_ctl has done up to now.
>> However, it's certainly arguable that this is too much change for an
>> optional post-beta patch.
>
>
>In the now-committed version of this, the 'pg_ctl start' returns
>successfully as soon as the server reaches a consistent state. Which is
>OK,
>except that it does the same thing when hot_standby=off.  When
>hot_standby=off, I would expect it to wait for the end of recovery
>before
>exiting with a success code.

I've not looked at the committed version, but earlier versions had code dealing 
with that difference; essentially doing what you suggest.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Server crash due to SIGBUS(Bus Error) when trying to access the memory created using dsm_create().

2017-06-28 Thread Andres Freund
On 2017-06-28 19:07:50 +1200, Thomas Munro wrote:
> I think this line is saying that it won't restart automatically:
> 
> https://github.com/torvalds/linux/blob/590dce2d4934fb909b112cd80c80486362337744/mm/shmem.c#L2884

Indeed.


> So I think we either need to mask signals with or put in an explicit
> retry loop, as shown in the attached version of the patch.  With the
> v3 patch I posted earlier, I see interrupted system call failures in
> the select_parallel regression test, but with the v4 it passes.
> Thoughts?

I think a retry loop is a lot better than blocking signals.


> > Unfounded speculation: fallocate() might actually *improve*
> > performance of DSM segments if your access pattern involves random
> > access (just to pick an example out of the air, something like...
> > building a hash table), since it's surely easier to allocate a big
> > contiguous chunk than a squillion random pages most of which divide an
> > existing hole into two smaller holes...
> 
> Bleugh... I retract this, of course we initialise the hash table in
> order anyway so this doesn't make any sense.

It's still faster to create larger mappings at once, rather than through
subsequent page faults...


> diff --git a/configure.in b/configure.in
> index 11eb9c8acfc..47452bbac43 100644
> --- a/configure.in
> +++ b/configure.in
> @@ -1429,7 +1429,7 @@ PGAC_FUNC_WCSTOMBS_L
>  LIBS_including_readline="$LIBS"
>  LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
>  
> -AC_CHECK_FUNCS([cbrt clock_gettime dlopen fdatasync getifaddrs getpeerucred 
> getrlimit mbstowcs_l memmove poll pstat pthread_is_threaded_np readlink 
> setproctitle setsid shm_open symlink sync_file_range towlower utime utimes 
> wcstombs wcstombs_l])
> +AC_CHECK_FUNCS([cbrt clock_gettime dlopen fallocate fdatasync getifaddrs 
> getpeerucred getrlimit mbstowcs_l memmove poll pstat pthread_is_threaded_np 
> readlink setproctitle setsid shm_open symlink sync_file_range towlower utime 
> utimes wcstombs wcstombs_l])

Why are we going for fallocate rather than posix_fallocate()? There's
plenty use cases that can only be done with the former, but this doesn't
seem like one of them?

Currently this is a linux specific path - but I don't actually see any
reason to keep it that way? It seems far from unlikely that this is just
an issue with linux...

>
>  #ifdef USE_DSM_POSIX
>  /*
> + * Set the size of a virtual memory region associate with a file descriptor.
> + * On Linux, also ensure that virtual memory is actually allocated by the
> + * operating system to avoid nasty surprises later.
> + *
> + * Returns non-zero if either truncation or allocation fails, and sets errno.
> + */
> +static int
> +dsm_impl_posix_resize(int fd, off_t size)
> +{
> + int rc;
> +
> + /* Truncate (or extend) the file to the requested size. */
> + rc = ftruncate(fd, size);
> +
> +#ifdef HAVE_FALLOCATE
> +#ifdef __linux__
> + /*
> +  * On Linux, a shm_open fd is backed by a tmpfs file.  After 
> resizing
> +  * with ftruncate it may contain a hole.  Accessing memory 
> backed by a
> +  * hole causes tmpfs to allocate pages, which fails with SIGBUS 
> if
> +  * there is no virtual memory available.  So we ask tmpfs to 
> allocate
> +  * pages here, so we can fail gracefully with ENOSPC now rather 
> than
> +  * risking SIGBUS later.
> +  */
> + if (rc == 0)
> + {
> + do
> + {
> + rc = fallocate(fd, 0, 0, size);
> + } while (rc == -1 && errno == EINTR);
> + if (rc != 0 && errno == ENOSYS)
> + {
> + /* Kernel too old (< 2.6.23). */
> + rc = 0;
> + }
> + }
> +#endif
> +#endif

I'd rather fall-back to just write() initializing the buffer, and do
either of those in all implementations rather than just linux.

> diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
> index 7a05c7e5b85..dcb9a846c7b 100644
> --- a/src/include/pg_config.h.in
> +++ b/src/include/pg_config.h.in
> @@ -167,6 +167,9 @@
>  /* Define to 1 if you have the  header file. */
>  #undef HAVE_EDITLINE_READLINE_H
>  
> +/* Define to 1 if you have the `fallocate' function. */
> +#undef HAVE_FALLOCATE
> +
>  /* Define to 1 if you have the `fdatasync' function. */
>  #undef HAVE_FDATASYNC

Needs pg_config.h.win32 adjustment...

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reducing pg_ctl's reaction time

2017-06-28 Thread Andres Freund
On 2017-06-28 13:31:27 -0400, Tom Lane wrote:
> I'm not hearing anyone speaking against doing this now, so I'm going
> to go ahead with it.

Cool.


> While looking this over again, I got worried about the fact that pg_ctl
> is #including "miscadmin.h".  That's a pretty low-level backend header
> and it wouldn't be surprising at all if somebody tried to put stuff in
> it that wouldn't compile frontend-side.  I think we should take the
> opportunity, as long as we're touching this stuff, to split the #defines
> that describe the contents of postmaster.pid into a separate header file.
> Maybe "utils/pidfile.h" ?

Yes, that sounds like a valid concern, and solution.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast promotion not used when doing a recovery_target PITR restore?

2017-06-27 Thread Andres Freund
On 2017-06-28 06:04:23 +0900, Michael Paquier wrote:
> On Wed, Jun 28, 2017 at 3:44 AM, Andres Freund <and...@anarazel.de> wrote:
> > I'm far from convinced by this.  By now WAL replay with checkpointer,
> > bgwriter, etc. active is actually *more* tested than the cases without
> > it. The likelihood of bugs is higher in the less frequently exercised
> > paths, and given that replication exercises the situation with all those
> > processes active on a continuous basis, I'm fairly unconvinced by your
> > argument.
> 
> Crash recovery is the last thing where failures should never happen.
> Don't you think that it should remain simple as it has been designed
> originally? It seems to me that the argument for keeping things simple
> has higher priority than performance in being able to reconnect by
> delaying the checkpoint.

You seem to completely argue besides my point that the replication path
is *more* robust by now?  And there's plenty scenarios where a faster
startup is quite crucial for performance. The difference between an
immediate shutdown + recovery without checkpoint to a fast shutdown can
be very large, and that matters a lot for faster postgres updates etc.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reducing pg_ctl's reaction time

2017-06-27 Thread Andres Freund
Hi,

On 2017-06-27 14:59:18 -0400, Tom Lane wrote:
> Here's a draft patch for that.  I quite like the results --- this seems
> way simpler and more reliable than what pg_ctl has done up to now.

Yea, I like that too.


> However, it's certainly arguable that this is too much change for an
> optional post-beta patch.

Yea, I think there's a valid case to be made for that. I'm still
inclined to go along with this, it seems we're otherwise just adding
complexity we're going to remove in a bit again.


> If we decide that it has to wait for v11,
> I'd address Jeff's complaint by hacking the loop behavior in
> test_postmaster_connection, which'd be ugly but not many lines of code.

Basically increasing the wait time over time?


> Note that I followed the USE_SYSTEMD patch's lead as to where to report
> postmaster state changes.  Arguably, in the standby-server case, we
> should not report the postmaster is ready until we've reached consistency.
> But that would require additional signaling from the startup process
> to the postmaster, so it seems like a separate change if we want it.

I suspect we're going to want to add more states to this over time, but
as you say, there's no need to hurry.


>   /*
> +  * Report postmaster status in the postmaster.pid file, to allow pg_ctl 
> to
> +  * see what's happening.  Note that all strings written to the status 
> line
> +  * must be the same length, per comments for AddToDataDirLockFile().  We
> +  * currently make them all 8 bytes, padding with spaces.
> +  */
> + AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, "starting");

The 8-space thing in multiple places is a bit ugly.  How about having a
a bunch of constants declared in one place? Alternatively make it
something like: status: $c, where $c is a one character code for the
various states?


> @@ -1149,8 +1149,9 @@ TouchSocketLockFiles(void)
>   *
>   * Note: because we don't truncate the file, if we were to rewrite a line
>   * with less data than it had before, there would be garbage after the last
> - * line.  We don't ever actually do that, so not worth adding another kernel
> - * call to cover the possibility.
> + * line.  While we could fix that by adding a truncate call, that would make
> + * the file update non-atomic, which we'd rather avoid.  Therefore, callers
> + * should endeavor never to shorten a line once it's been written.
>   */
>  void
>  AddToDataDirLockFile(int target_line, const char *str)
> @@ -1193,19 +1194,26 @@ AddToDataDirLockFile(int target_line, co
>   srcptr = srcbuffer;
>   for (lineno = 1; lineno < target_line; lineno++)
>   {
> - if ((srcptr = strchr(srcptr, '\n')) == NULL)
> - {
> - elog(LOG, "incomplete data in \"%s\": found only %d 
> newlines while trying to add line %d",
> -  DIRECTORY_LOCK_FILE, lineno - 1, target_line);
> - close(fd);
> - return;
> - }
> - srcptr++;
> + char   *eol = strchr(srcptr, '\n');
> +
> + if (eol == NULL)
> + break;  /* not enough lines in 
> file yet */
> + srcptr = eol + 1;
>   }
>   memcpy(destbuffer, srcbuffer, srcptr - srcbuffer);
>   destptr = destbuffer + (srcptr - srcbuffer);
>  
>   /*
> +  * Fill in any missing lines before the target line, in case lines are
> +  * added to the file out of order.
> +  */
> + for (; lineno < target_line; lineno++)
> + {
> + if (destptr < destbuffer + sizeof(destbuffer))
> + *destptr++ = '\n';
> + }
> +
> + /*
>* Write or rewrite the target line.
>*/
>   snprintf(destptr, destbuffer + sizeof(destbuffer) - destptr,
> "%s\n", str);

Not this patches fault, and shouldn't be changed now, but this is a
fairly weird way to manage and update this file.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] memory layouts for binary search in nbtree

2017-06-27 Thread Andres Freund
On 2017-06-27 11:13:38 -0700, Peter Geoghegan wrote:
> On Tue, Jun 27, 2017 at 11:04 AM, Andres Freund <and...@anarazel.de> wrote:
> >
> > On 2017-06-27 10:57:15 -0700, Peter Geoghegan wrote:
> >> I looked at this again recently. I wrote a patch to prove to myself
> >> that we can fairly easily reclaim 15 bits from every nbtree internal
> >> page ItemId, and put an abbreviated key there instead.
> >
> > Interesting. Not sure however that really addresses my layout / cache
> > efficiency concern? Or is that just a largely independent optimization,
> > except it's affecting the same area of code?
> 
> Well, you'd only do this on internal pages, which are a tiny minority
> of the total, and yet are where the majority of binary searches for an
> index scan occur in practice. The optimization has the effect of
> making the binary search only need to access the much smaller ItemId
> array in that best case. In the best case, where you resolve all
> comparisons on internal pages, you still have to get the index tuple
> that you need to follow the TID of to go to a page on the next level
> down, once the binary search for an internal page actually finds it.
> But that's all.
> 
> In the best case, and maybe the average case, this could be highly
> effective, I think. There would definitely be cases where the
> optimization wouldn't help at all, but hopefully it would also not
> hurt.

In other words, it's an independent optimization.  That's cool, but I'd
rather talk about it in an independent thread, to avoid conflating the
issues.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast promotion not used when doing a recovery_target PITR restore?

2017-06-27 Thread Andres Freund
On 2017-06-23 10:56:07 +0900, Michael Paquier wrote:
> > And even there it might actually be
> > a pretty good idea to not force a full checkpoint - getting up fast
> > after a crash is kinda important..
> 
> But not that. Crash recovery is designed to be simple and robust, with
> only the postmaster and the startup processes running when doing so.
> Not having the startup process doing by itself checkpoints would
> require the need of the bgwriter, which increases the likelihood of
> bugs. In short, I don't think that improving performance is the matter
> for crash recovery, robustness and simplicity are.

I'm far from convinced by this.  By now WAL replay with checkpointer,
bgwriter, etc. active is actually *more* tested than the cases without
it. The likelihood of bugs is higher in the less frequently exercised
paths, and given that replication exercises the situation with all those
processes active on a continuous basis, I'm fairly unconvinced by your
argument.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] memory layouts for binary search in nbtree

2017-06-27 Thread Andres Freund
On 2016-05-19 19:38:02 -0700, Peter Geoghegan wrote:
> On Wed, May 18, 2016 at 6:25 AM, Andres Freund <and...@anarazel.de> wrote:
> > currently we IIRC use linearly sorted datums for the search in
> > individual btree nodes.  Not surprisingly that's often one of the
> > dominant entries in profiles. We could probably improve upon that by
> > using an order more optimized for efficient binary search.
> 
> Did you ever try running a pgbench SELECT benchmark, having modified
> things such that all PKs are on columns that are not of type
> int4/int8, but rather are of type numeric? It's an interesting
> experiment, that I've been meaning to re-run on a big box.

> Obviously this will be slower than an equivalent plain pgbench SELECT,
> but the difference may be smaller than you expect.

I'm not sure what that has to do with the topic?

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] memory layouts for binary search in nbtree

2017-06-27 Thread Andres Freund
Hi,

On 2017-06-27 10:57:15 -0700, Peter Geoghegan wrote:
> I looked at this again recently. I wrote a patch to prove to myself
> that we can fairly easily reclaim 15 bits from every nbtree internal
> page ItemId, and put an abbreviated key there instead.

Interesting. Not sure however that really addresses my layout / cache
efficiency concern? Or is that just a largely independent optimization,
except it's affecting the same area of code?


> Can you suggest a workload/hardware to assess the benefits of an
> optimization like this, Andres? Is there a profile available somewhere
> in the archives that shows many cache misses within _bt_binsrch()?

I don't quite remember what triggered my report, but it's quite commonly
visible in any workloads that have btrees above toy sizes, but still
fitting in shared_buffers. Obviously you need somehting where btree
lookups are a significant share of the time, but that's easy enough with
index nested loop joins and such.  I do recall seeing it recently-ish in
a number of tpc-h queries.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reducing pg_ctl's reaction time

2017-06-26 Thread Andres Freund
On 2017-06-26 17:38:03 -0400, Tom Lane wrote:
> Andres Freund <and...@anarazel.de> writes:
> > On 2017-06-26 17:30:30 -0400, Tom Lane wrote:
> >> No, I don't like that at all.  Has race conditions against updates
> >> coming from the startup process.
> 
> > You'd obviously have to take the appropriate locks.  I think the issue
> > here is less race conditions, and more that architecturally we'd
> > interact with shmem too much.
> 
> Uh, we are *not* taking any locks in the postmaster.

I'm not sure why you're 'Uh'ing, when my my point pretty much is that we
do not want to do so?


> Hm.  Take that a bit further, and we could drop the connection probes
> altogether --- just put the whole responsibility on the postmaster to
> show in the pidfile whether it's ready for connections or not.

Yea, that seems quite appealing, both from an architectural, simplicity,
and log noise perspective. I wonder if there's some added reliability by
the connection probe though? Essentially wondering if it'd be worthwhile
to keep a single connection test at the end. I'm somewhat disinclined
though.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reducing pg_ctl's reaction time

2017-06-26 Thread Andres Freund
On 2017-06-26 17:30:30 -0400, Tom Lane wrote:
> Andres Freund <and...@anarazel.de> writes:
> > It'd be quite possible to address the race-condition by moving the
> > updating of the control file to postmaster, to the
> > CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) block. That'd require
> > updating the control file from postmaster, which'd be somewhat ugly.
> 
> No, I don't like that at all.  Has race conditions against updates
> coming from the startup process.

You'd obviously have to take the appropriate locks.  I think the issue
here is less race conditions, and more that architecturally we'd
interact with shmem too much.

> > Perhaps that indicates that field shouldn't be in pg_control, but in the
> > pid file?
> 
> Yeah, that would be a different way to go at it.  The postmaster would
> probably just write the state of the hot_standby GUC to the file, and
> pg_ctl would have to infer things from there.

I'd actually say we should just mirror the existing
#ifdef USE_SYSTEMD
if (!EnableHotStandby)
sd_notify(0, "READY=1");
#endif
with corresponding pidfile updates - doesn't really seem necessary for
pg_ctl to do more?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reducing pg_ctl's reaction time

2017-06-26 Thread Andres Freund
Hi,

On 2017-06-26 16:49:07 -0400, Tom Lane wrote:
> Andres Freund <and...@anarazel.de> writes:
> > Arguably we could and should improve the logic when the server has
> > started, right now it's pretty messy because we never treat a standby as
> > up if hot_standby is disabled...
> 
> True.  If you could tell the difference between "HS disabled" and "HS not
> enabled yet" from pg_control, that would make pg_ctl's behavior with
> cold-standby servers much cleaner.  Maybe it *is* worth messing with the
> contents of pg_control at this late hour.

I'm +0.5.


> My inclination for the least invasive fix is to leave the DBState
> enum alone and add a separate hot-standby state field with three
> values (disabled/not-yet-enabled/enabled).

Yea, that seems sane.


> Then pg_ctl would start
> probing the postmaster when it saw either DB_IN_PRODUCTION DBstate
> or hot-standby-enabled.  (It'd almost not have to probe the postmaster
> at all, except there's a race condition that the startup process
> will probably change the field a little before the postmaster gets
> the word to open the gates.)  On the other hand, if it saw
> DB_IN_ARCHIVE_RECOVERY with hot standby disabled, it'd stop waiting.

It'd be quite possible to address the race-condition by moving the
updating of the control file to postmaster, to the
CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) block. That'd require
updating the control file from postmaster, which'd be somewhat ugly.
Perhaps that indicates that field shouldn't be in pg_control, but in the
pid file?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reducing pg_ctl's reaction time

2017-06-26 Thread Andres Freund
On 2017-06-26 16:26:00 -0400, Tom Lane wrote:
> Andres Freund <and...@anarazel.de> writes:
> > On 2017-06-26 16:19:16 -0400, Tom Lane wrote:
> >> Sure, what do you think an appropriate behavior would be?
> 
> > It'd not be unreasonble to check pg_control first, and only after that
> > indicates readyness check via the protocol.
> 
> Hm, that's a thought.  The problem here isn't the frequency of checks,
> but the log spam.

Right.  I think to deal with hot-standby we'd probably have to add new
state to the control file however. We don't just want to treat the
server as ready once DB_IN_PRODUCTION is reached.

Arguably we could and should improve the logic when the server has
started, right now it's pretty messy because we never treat a standby as
up if hot_standby is disabled...

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reducing pg_ctl's reaction time

2017-06-26 Thread Andres Freund
On 2017-06-26 16:19:16 -0400, Tom Lane wrote:
> Jeff Janes  writes:
> > The 10 fold increase in log spam during long PITR recoveries is a bit
> > unfortunate.
> 
> > 9153  2017-06-26 12:55:40.243 PDT FATAL:  the database system is starting up
> > 9154  2017-06-26 12:55:40.345 PDT FATAL:  the database system is starting up
> > 9156  2017-06-26 12:55:40.447 PDT FATAL:  the database system is starting up
> > 9157  2017-06-26 12:55:40.550 PDT FATAL:  the database system is starting up
> > ...
> 
> > I can live with it, but could we use an escalating wait time so it slows
> > back down to once a second after a while?
> 
> Sure, what do you think an appropriate behavior would be?

It'd not be unreasonble to check pg_control first, and only after that
indicates readyness check via the protocol. Doesn't quite seem like
something backpatchable tho.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Another reason why the recovery tests take a long time

2017-06-26 Thread Andres Freund
On 2017-06-26 13:42:52 -0400, Tom Lane wrote:
> Andres Freund <and...@anarazel.de> writes:
> > On 2017-06-26 12:32:00 -0400, Tom Lane wrote:
> >> ... But I wonder whether it's intentional that the old
> >> walreceiver dies in the first place.  That FATAL exit looks suspiciously
> >> like it wasn't originally-designed-in behavior.
> 
> > It's quite intentional afaik - I've complained about the bad error
> > message recently (we really shouldn't say "no COPY in progress), but
> > exiting seems quite reasonable.  Otherwise we'd have add a separate
> > retry logic into the walsender, that reconnects without a new walsender
> > being started.
> 
> Ah, I see.  I'd misinterpreted the purpose of the infinite loop in
> WalReceiverMain() --- now I see that's for receiving requests from the
> startup proc for different parts of the WAL stream, not for reconnecting
> to the master.

Right.  And if the connection fails, we intentionally (whether that's
good or bad is another question) switch to restore_command (or
pg_xlog...) based recovery, in which case we certainly do not want the
walsender around.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Another reason why the recovery tests take a long time

2017-06-26 Thread Andres Freund
Hi,


On 2017-06-26 12:32:00 -0400, Tom Lane wrote:
> I've found another edge-case bug through investigation of unexpectedly
> slow recovery test runs.  It goes like this:
> 
> * While streaming from master to slave, test script shuts down master
> while slave is left running.  We soon restart the master, but meanwhile:
> 
> * slave's walreceiver process fails, reporting
> 
> 2017-06-26 16:06:50.209 UTC [13209] LOG:  replication terminated by primary 
> server
> 2017-06-26 16:06:50.209 UTC [13209] DETAIL:  End of WAL reached on timeline 1 
> at 0/398.
> 2017-06-26 16:06:50.209 UTC [13209] FATAL:  could not send end-of-streaming 
> message to primary: no COPY in progress
> 
> * slave's startup process observes that walreceiver is gone and sends
> PMSIGNAL_START_WALRECEIVER to ask for a new one
> 
> * more often than you would guess, in fact nearly 100% reproducibly for
> me, the postmaster receives/services the PMSIGNAL before it receives
> SIGCHLD for the walreceiver.  In this situation sigusr1_handler just
> throws away the walreceiver start request, reasoning that the walreceiver
> is already running.

Yuck.

I've recently seen a bunch of symptoms vaguely around this, e.g. I can
atm frequently reconnect to a new backend after a
PANIC/segfault/whatnot, before postmastre gets the message.  I've not
yet figured out whether that's a kernel change, or whether some of the
more recent tinkering in postmaster.c changed this.


> * eventually, it dawns on the startup process that the walreceiver
> isn't starting, and it asks for a new one.  But that takes ten seconds
> (WALRCV_STARTUP_TIMEOUT).
> 
> So this looks like a pretty obvious race condition in the postmaster,
> which should be resolved by having it set a flag on receipt of
> PMSIGNAL_START_WALRECEIVER that's cleared only when it does start a
> new walreceiver.  But I wonder whether it's intentional that the old
> walreceiver dies in the first place.  That FATAL exit looks suspiciously
> like it wasn't originally-designed-in behavior.

It's quite intentional afaik - I've complained about the bad error
message recently (we really shouldn't say "no COPY in progress), but
exiting seems quite reasonable.  Otherwise we'd have add a separate
retry logic into the walsender, that reconnects without a new walsender
being started.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] subscription worker signalling wal writer too much

2017-06-24 Thread Andres Freund
Hi,

On 2017-06-15 15:06:43 -0700, Jeff Janes wrote:
> > Well, wal_writer_delay doesn't work if walwriter is in sleep mode, and
> > this afaics would allow wal writer to go into sleep mode with half a
> > page filled, and it'd not be woken up again until the page is filled.
> > No?
> >
> 
> If it is taking the big sleep, then we always wake it up, since
> acd4c7d58baf09f.
> 
> I didn't change that part.  I only changed what we do if it not hibernating.

Right, I hadn't read enough of the context.


> It looks like only limited consolidation was happening, the number of kills
> invoked was more than half of the number of SetLatch.  I think the reason
> is what when kill(owner_pid, SIGUSR1) is called, the kernel immediately
> suspends the calling process and gives the signalled process a chance to
> run in that time slice.  The Wal Writer gets woken up, sees that it only
> has a partial page to write and decides not to do anything, but resets the
> latch.  It does this fast enough that the subscription worker hasn't
> migrated CPUs yet.

I think part of the problem here is that latches signal the owning
process even if the owning process isn't actually sleeping - that's
obviously quite pointless.  In cases where walwriter is busy, that
actually causes a lot of superflous interrupted syscalls, because it
actually never ends up waiting on the latch. There's also a lot of
superflous context switches.  I think we can avoid doing so quite
easily, as e.g. with the attached patch.  Could you check how much that
helps your benchmark?


> The first change made the WALWriter wake up and do a write and flush
> whenever an async commit occurred and there was a completed WAL page to be
> written.  This way the hint bits could be set while the heap page was still
> hot, because they couldn't get set until the WAL covering the hinted-at
> transaction commit is flushed.
> 
> The second change said we can set hint bits before the WAL covering the
> hinted-at transaction is flushed, as long the page LSN is newer than that
> (so the wal covering the hinted-at transaction commit must be flushed
> before the page containing the hint bit can be written).
> 
> Then the third change makes the wal writer only write the full pages most
> of the times it is woken up, not flush them.  It only flushes them once
> every wal_writer_delay or wal_writer_flush_after, regardless of how often
> it is woken up.
> 
> It seems like the third change rendered the first one useless.

I don't think so. Isn't the walwriter writing out the contents of the
WAL is quite important because it makes room in wal_buffers for new WAL?

I suspect we actually should wake up wal-writer *more* aggressively, to
offload wal fsyncs from individual backends, even when they're not
committing.  Right now we fsync whenever a segment is finished - we
really don't want to do that in backends that could do other useful
work.  I suspect it'd be a good idea to remove that logic from
XLogWrite() and replace it with
if (ProcGlobal->walwriterLatch)
SetLatch(ProcGlobal->walwriterLatch);


> Wouldn't it
> better, and much simpler, just to have reverted the first change once the
> second one was done?

I think both can actually happen independently, no? It's pretty common
for the page lsn to be *older* than the corresponding commit.  In fact
you'll always hit that case unless there's also concurrent writes also
touching said page.


> If that were done, would the third change still be
> needed?  (It did seem to add some other features as well, so I'm going to
> assume we still want those...).

It's still useful, yes.  It avoids flushing the WAL too often
(page-by-page when there's a lot of writes), which can eat up a lot of
IOPS on fast storage.



> But now the first change is even worse than useless, it is positively
> harmful.  The only thing to stop it from waking the WALWriter for every
> async commit is this line:
> 
> /* if we have already flushed that far, we're done */
> if (WriteRqstPtr <= LogwrtResult.Flush)
> return;
> 
> Since LogwrtResult.Flush doesn't advance anymore, this condition doesn't
> becomes false due to us waking walwriter, it only becomes false once the
> timeout expires (which it would have done anyway with no help from us), or
> once wal_writer_flush_after is exceeded.  So now every async commit is
> waking the walwriter.  Most of those wake up do nothing (there is not a
> completely new patch to write), some write one completed page but don't
> flush anything, and very few do a flush, and that one would have been done
> anyway.

s/completely new patch/completely new page/?

In my opinion we actually *do* want to write (but not flush!) out such
pages, so I'm not sure I agree with that logic.  Have to think a

Re: [HACKERS] Logical replication: stuck spinlock at ReplicationSlotRelease

2017-06-23 Thread Andres Freund
On 2017-06-23 13:26:58 -0400, Alvaro Herrera wrote:
> Hmm, so for instance in LogicalIncreaseRestartDecodingForSlot() we have
> some elog(DEBUG1) calls with the slot spinlock held.  That's probably
> uncool.

It definitely is not cool, rather daft even (it's probably me who wrote
that).  No idea why I've done it like that, makes no sense.

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix a typo in snapmgr.c

2017-06-23 Thread Andres Freund
On 2017-06-23 19:21:57 +0100, Simon Riggs wrote:
> On 23 June 2017 at 08:23, Simon Riggs <si...@2ndquadrant.com> wrote:
> > On 23 June 2017 at 08:21, Andres Freund <and...@anarazel.de> wrote:
> >> On 2017-06-07 10:17:31 -0700, Andres Freund wrote:
> >>> On 2017-05-08 09:12:13 -0400, Tom Lane wrote:
> >>> > Simon Riggs <si...@2ndquadrant.com> writes:
> >>> > > So rearranged code a little to keep it lean.
> >>> >
> >>> > Didn't you break it with that?  As it now stands, the memcpy will
> >>> > copy the nonzero value.
> >>>
> >>> I've not seen a fix and/or alleviating comment about this so far.  Did I
> >>> miss something?
> >>
> >> Simon, FWIW, I plan to either revert or fix this up soon-ish.  Unless
> >> you're going to actually respond on this thread?
> >
> > Sorry, I've only just seen Tom's reply. Will fix.
> 
> I don't see a bug. Perhaps I'm tired and can't see it yet.
> 
> Will fix if you thwack me with the explanation.

Wasn't my complaint, but here we go:

Previous code:

/*
 * Ignore the SubXID array if it has overflowed, unless the snapshot was
 * taken during recovey - in that case, top-level XIDs are in subxip as
 * well, and we mustn't lose them.
 */
if (serialized_snapshot.suboverflowed && !snapshot->takenDuringRecovery)
serialized_snapshot.subxcnt = 0;

/* Copy struct to possibly-unaligned buffer */
memcpy(start_address,
   _snapshot, sizeof(SerializedSnapshotData));

i.e. if suboverflowed, start_address would contain subxcnt = 0.

New code:


/* Copy struct to possibly-unaligned buffer */
memcpy(start_address,
   _snapshot, sizeof(SerializedSnapshotData));

/* Copy XID array */
if (snapshot->xcnt > 0)
memcpy((TransactionId *) (start_address +
  
sizeof(SerializedSnapshotData)),
   snapshot->xip, snapshot->xcnt * 
sizeof(TransactionId));

/*
 * Copy SubXID array. Don't bother to copy it if it had overflowed,
 * though, because it's not used anywhere in that case. Except if it's a
 * snapshot taken during recovery; all the top-level XIDs are in subxip 
as
 * well in that case, so we mustn't lose them.
 */
if (serialized_snapshot.suboverflowed && !snapshot->takenDuringRecovery)
serialized_snapshot.subxcnt = 0;

Here the copy is done before subxcnt = 0.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] REPLICA IDENTITY FULL

2017-06-23 Thread Andres Freund
On 2017-06-23 13:05:21 -0400, Alvaro Herrera wrote:
> Tom Lane wrote:
> > Peter Eisentraut  writes:
> > > Any thoughts about keeping datumAsEqual() as a first check?  I did some
> > > light performance tests, but it was inconclusive.
> > 
> > Seems like it would tend to be a win if, in fact, the values are
> > usually equal.  If they're usually not, then it's a loser.  Do
> > we have any feeling for which case is more common?

Seems like a premature optimization to me - if you care about
performance and do this frequently, you're not going to end up using
FULL.  If we want to performance optimize, it'd probably better to
lookup candidate keys and use those if available.


> Though, thinking about it, maybe the datumIsEqual test would give the
> wrong answer for floating point values, and there'd be no fallback to
> equality with the logic I propose.  But then maybe that's all
> right ---

I don't think it'd be ok, we shouldn't just do the wrong thing because
we think it's unlikely to happen.


> who in their right minds would use floating point columns as part of
> replica identity ...?

Since this is FULL, it'll be all columns...


- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix a typo in snapmgr.c

2017-06-23 Thread Andres Freund
On 2017-06-07 10:17:31 -0700, Andres Freund wrote:
> On 2017-05-08 09:12:13 -0400, Tom Lane wrote:
> > Simon Riggs <si...@2ndquadrant.com> writes:
> > > So rearranged code a little to keep it lean.
> > 
> > Didn't you break it with that?  As it now stands, the memcpy will
> > copy the nonzero value.
> 
> I've not seen a fix and/or alleviating comment about this so far.  Did I
> miss something?

Simon, FWIW, I plan to either revert or fix this up soon-ish.  Unless
you're going to actually respond on this thread?

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Dynamic instrumentation of lwlock wait times (lwlock flamegraphs)

2017-06-22 Thread Andres Freund
Hi,

At pgcon some people were talking about the difficulty of instrumenting
the time actually spent waiting for lwlocks and related measurements.
I'd mentioned that linux these days provides infrastructure to measure
such things in unmodified binaries.

Attached is a prototype of a script that measures the time spent inside
PGSemaphoreLock(), aggregates that inside the kernel, grouped by pid and
stacktrace.  That allows one to generate nice flame graphs showing which
part of the code waits how long for lwlocks.

The attached script clearly needs improvements, but I thought I'd show
some of the results it can get.  To run it you need the the python
library of the 'bcc' project [1], and a sufficiently new kernel.  Some
distributions, e.g. newer debian versions, package this as python-bpfcc
and similar.

The output of the script can be turned into a flamegraph with
https://github.com/brendangregg/FlameGraph 's flamegraph.pl.

Here's a few examples of a pgbench run. The number is the number of
clients, sync/async indicates synchronous_commit on/off.  All the
numbers here were generated with the libpq & pgbench batch patches
applied and in use, but that's just because that was the state of my
working tree.

http://anarazel.de/t/2017-06-22/pgsemwait_8_sync.svg
http://anarazel.de/t/2017-06-22/pgsemwait_8_async.svg
http://anarazel.de/t/2017-06-22/pgsemwait_64_sync.svg
http://anarazel.de/t/2017-06-22/pgsemwait_64_async.svg

A bonus, not that representative one is the wait for a pgbench readonly
run after the above, with autovacuum previously disabled:
http://anarazel.de/t/2017-06-22/pgsemwait_64_select.svg
interesting to see how the backends themselves never end up having to
flush WAL themselves, or at least not in a manner triggering contention.

I plan to write a few more of these, because they're hugely useful to
understand actual locking behaviour. Among them:
- time beteen Acquire/Release of lwlocks, grouped by lwlock
- time beteen Acquire/Release of lwlocks, grouped by stack
- callstack of acquirer and waker of lwlocks, grouped by caller stack, waiter 
stack
- ...

I think it might be interesting to collect a few of these somewhere
centrally once halfway mature.  Maybe in src/tools or such.

Greetings,

Andres Freund

[1] https://github.com/iovisor/bcc
#!/usr/bin/python

from __future__ import print_function
from bcc import BPF
from time import sleep

bpf_text = """
#include 
#include 

struct stats_key_t {
u32 pid;
int user_stack_id;
};

struct stats_value_t {
u64 total_time;
};

struct start_key_t {
u32 pid;
};

struct start_value_t {
u64 last_start;
};

// map_type, key_type, leaf_type, table_name, num_entry
BPF_HASH(stats, struct stats_key_t, struct stats_value_t);
BPF_HASH(start, struct start_key_t, struct start_value_t);

BPF_STACK_TRACE(stack_traces, STACK_STORAGE_SIZE)

int trace_sem_entry(struct pt_regs *ctx)
{
struct start_key_t start_key = {};
struct start_value_t start_value = {};

start_key.pid = bpf_get_current_pid_tgid();
start_value.last_start = bpf_ktime_get_ns();

start.update(_key, _value);

return 0;
}

int trace_sem_return(struct pt_regs *ctx)
{
struct stats_key_t stats_key = {};
struct start_key_t start_key = {};
struct stats_value_t zero = {};
struct stats_value_t *stats_value;
struct start_value_t *start_value;
u64 delta;

start_key.pid = bpf_get_current_pid_tgid();
start_value = start.lookup(_key);

if (!start_value)
return 0; /* missed start */;

delta = bpf_ktime_get_ns() - start_value->last_start;
start.delete(_key);

stats_key.pid = bpf_get_current_pid_tgid();
stats_key.user_stack_id = stack_traces.get_stackid(ctx, BPF_F_REUSE_STACKID 
| BPF_F_USER_STACK);

stats_value = stats.lookup_or_init(_key, );
stats_value->total_time += delta;

return 0;
}

"""

# set stack storage size
bpf_text = bpf_text.replace('STACK_STORAGE_SIZE', str(1000))

b = BPF(text=bpf_text)


libpath = 
BPF.find_exe('/home/andres/build/postgres/dev-optimize/vpath/src/backend/postgres')
if not libpath:
bail("can't resolve library %s" % library)
library = libpath

b.attach_uprobe(name=library, sym_re='PGSemaphoreLock',
fn_name="trace_sem_entry",
pid = -1)

b.attach_uretprobe(name=library, sym_re='PGSemaphoreLock',
fn_name="trace_sem_return",
pid = -1)

matched = b.num_open_uprobes()
if matched == 0:
print("error: 0 functions traced. Exiting.", file=stderr)
exit(1)

sleep(3)

stats = b.get_table("stats")
stack_traces = b.get_table("stack_traces")
folded = True #False

for k, v in stats.items(): #, key=lambda v: v.total_time):
#print(dir(k))
#print(dir(v))
user_stack = [] if k.user_stack_id < 0 else \
stack_traces.walk(k.user_stack_id)
name = 'postgres'
if v.total_time == 0:
continue
if folded:
# print folded stack output
 

Re: [HACKERS] possible self-deadlock window after bad ProcessStartupPacket

2017-06-22 Thread Andres Freund
On 2017-06-22 10:41:41 -0700, Andres Freund wrote:
> On 2017-02-02 12:18:22 -0800, Jimmy Yih wrote:
> > In the above pull request, Heikki also mentions that a similar scenario can
> > happen during palloc() as well... which is similar to what we saw in
> > Greenplum a couple years back for a deadlock in a malloc() call where we
> > responded by changing exit() to _exit() in quickdie as a fix.  That could
> > possibly be applicable to latest Postgres as well.
> 
> Isn't the quickdie() issue that we palloc/malloc in the first place,
> rather than the exit call?  There's some risk for exit() too, but we
> reset our own atexit handlers before exiting, so the risk seems fairly
> small.
> 
> 
> In my opinion the fix here would be to do it right and never emit error
> messages from signal handlers via elog.c - we've progressed a lot
> towards the goal and do a lot less in signal handlers these days - but
> quickdie() is one glaring exception to that.  I think a reasonable fix
> here would be to use write() of a statically allocated message, rather
> then elog proper, and not to send the message to the client.  Libpq, and
> I presume other clients, synthethize a message upon unexpected socket
> closure anyway, and it's completely unclear whether we can send a
> message anyway.

Or, probably more robust: Simply _exit(2) without further ado, and rely
on postmaster to output an appropriate error message. Arguably it's not
actually useful to see hundreds of "WARNING: terminating connection because of
crash of another server process" messages in the log anyway.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


<    1   2   3   4   5   6   7   8   9   10   >