date:20180108

Re: Finalizing logical replication limitations as well as potential features

2018-01-08 Thread Alvaro Hernandez




On 05/01/18 05:35, Joshua D. Drake wrote:

On 01/04/2018 01:26 PM, Alvaro Herrera wrote:

Joshua D. Drake wrote:


We just queue/audit the changes as they happen and sync up the changes
after the initial sync completes.

This already happens.  There is an initial sync, and there's logical
decoding that queues any changes that exist "after" the sync's snapshot.

What you seem to want is to have multiple processes doing the initial
COPY in parallel -- each doing one fraction of the table.  Of course,
they would have to use the same snapshot.  That would make sense only
if the COPY itself is the bottleneck and not the network, or the I/O
speed of the origin server.  This doesn't sound a common scenario to me.


Not quite but close. My thought process is that we don't want to sync 
within a single snapshot a 100-500mil row table (or worse). Unless I 
am missing something there, that has the potential to be a very long 
running transaction especially if we are syncing more than one relation.


JD



    That's indeed the way it works, you need to hold the snapshot 
possibly for a long time. But not doing so seems to go a very complex, 
even though it's not impossible. Changes after initial sync are 
definitely registered (via logical decoding), that's not an issue. But 
if you don't keep a snapshot of the database, you will also see some or 
all of these changes applied to the tables mid-way. How to make the 
whole table copy consistent with potential mid-way changes and the 
recorded ones on logical decoding is difficult and bug-prone.


    Surprisingly, this is how MongoDB replication works, as they don't 
have the equivalent of a snapshot facility. But actually they need to do 
really weird stuff, like re-applying changes up to 3 (why?) times and 
comments on the source code point to strange hacks to make all 
consistent. I (want to) believe they made it correctly, but it is hacky, 
complicated, and MongoDB doesn't support FKs and other features that I'm 
sure complicate matters even more.


    I'm not a PG hacker, but all this sounds too complicated to me. I'd 
keep the snapshot open that makes things very easy. If inside you want 
to do parallel COPY, that's fine (if, as the other Álvaro said, it is 
COPY the limiting factor).



    Cheers,

    Álvaro

--

Alvaro Hernandez


---
OnGres

Re: FP16 Support?

2018-01-08 Thread Kohei KaiGai

Just for your information.

I tried to implement "float2" according to IEEE 754 specification,
as a custom data type of PG-Strom.

https://github.com/heterodb/pg-strom/blob/master/src/float2.c
https://github.com/heterodb/pg-strom/blob/master/sql/float2.sql

The recent GPU device (Maxwell or later) supports "half" precision
data type by the hardware, so it might be valuable for somebody.

Thanks,

2017-11-14 14:49 GMT+09:00 Kohei KaiGai :
> 2017-11-14 10:33 GMT+09:00 Thomas Munro :
>> On Tue, Nov 14, 2017 at 1:11 PM, Kohei KaiGai  wrote:
>>> Any opinions?
>>
>> The only reason I can think of for having it in core is that you might
>> want to use standard SQL notation FLOAT(10) to refer to it.  Right now
>> our parser converts that to float4 but it could map precisions up to
>> 10 to float2.  The need for such special treatment is one of my
>> arguments for considering SQL:2016 DECFLOAT(n) in core PostgreSQL.
>> But this case is different: FLOAT(10) already works, it just maps to a
>> type with a larger significand, as permitted by the standard.  So why
>> not just do these short floats as an extension type?
>>
> Our extension will be able to provide its own "half" or "float2" data type
> using CREATE TYPE, indeed. I thought it is useful to other people, even
> if they are not interested in the in-database analytics with GPU, to reduce
> amount of storage consumption.
>
> Of course, it is my opinion.
>
> Thanks,
> --
> HeteroDB, Inc / The PG-Strom Project
> KaiGai Kohei 



-- 
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-08 Thread Rushabh Lathia

On Sat, Jan 6, 2018 at 3:47 AM, Peter Geoghegan  wrote:

> On Tue, Jan 2, 2018 at 8:43 PM, Rushabh Lathia 
> wrote:
> > I agree that plan_create_index_workers() needs to count the leader as a
> > normal worker for the CREATE INDEX.  So what you proposing is - when
> > parallel_leader_participation is true launch (return value of
> > compute_parallel_worker() - 1)
> > workers. true ?
>
> Almost. We need to not subtract one when only one worker is indicated
> by compute_parallel_worker(). I also added some new stuff there, to
> consider edge cases with the parallel_leader_participation GUC.
>
> >> I'm working on fixing up what you posted. I'm probably not more than a
> >> week away from posting a patch that I'm going to mark "ready for
> >> committer". I've already made the change above, and once I spend time
> >> on trying to break the few small changes needed within buffile.c I'll
> >> have taken it as far as I can, most likely.
> >>
> >
> > Okay, once you submit the patch with changes - I will do one round of
> > review for the changes.
>
> I've attached my revision. Changes include:
>
> * Changes to plan_create_index_workers() were made along the lines
> recently discussed.
>
> * plan_create_index_workers() is now called right before the ambuild
> routine is called (nbtree index builds only, of course).
>
> * Significant overhaul of tuplesort.h contract. This had references to
> the old approach, and to tqueue.c's tuple descriptor thing that was
> since superseded by the typmod registry added for parallel hash join.
> These were updated/removed.
>
> * Both tuplesort.c and logtape.c now say that they cannot write to the
> writable/last tape, while still acknowledging that it is in fact the
> leader tape, and that this restriction is due to a restriction with
> BufFiles. They also point out that if the restriction within buffile.c
> ever was removed, everything would work fine.
>
> * Added new call to BufFileExportShared() when freezing tape in logtape.c.
>
> * Tweaks to documentation.
>
> * pgindent ran on modified files.
>
> * Polished the stuff that is added to buffile.c. Mostly comments that
> clarify its reason for existing. Also added Assert()s.
>
> Note that I added Heikki as an author in the commit message.
> Technically, Heikki didn't actually write code for parallel CREATE
> INDEX, but he did loads of independently useful work on merging + temp
> file I/O that went into Postgres 10 (though this wasn't listed in the
> v10 release notes). That work was done in large part to help the
> parallel CREATE INDEX patch, and it did in fact help it quite
> noticeably, so I think that this is warranted. Remember that with
> parallel CREATE INDEX, the leader's merge occurs serially, so anything
> that we can do to speed that part up is very helpful.
>
> This revision does seem very close, but I'll hold off on changing the
> status of the patch for a few more days, to give you time to give some
> feedback.
>
>
Thanks Peter for the updated patch.

I gone through the changes and perform the basic testing. Changes
looks good and haven't found any unusual during testing

Thanks,
Rushabh Lathia

Re: [HACKERS] taking stdbool.h into use

2018-01-08 Thread Michael Paquier

On Sat, Dec 30, 2017 at 10:31:45AM -0500, Tom Lane wrote:
> Surely bool and bool8 should have identical Datum representations,
> so I'm not sure they need different DatumGet/GetDatum macros.

[... refresh update ...]
Yeah sure. I am a bit disturbed by the fact that DatumGetBool() enforces
a case to bool though.

> Although this opens up another point: just above those macros,
> postgres.h says
> 
>  * When a type narrower than Datum is stored in a Datum, we place it in the
>  * low-order bits and are careful that the DatumGetXXX macro for it discards
>  * the unused high-order bits (as opposed to, say, assuming they are zero).
>  * This is needed to support old-style user-defined functions, since depending
>  * on architecture and compiler, the return value of a function returning char
>  * or short may contain garbage when called as if it returned Datum.
> 
> Since we flushed support for V0 functions, the stated rationale doesn't
> apply anymore.  I wonder whether there is anything to be gained by
> changing DatumGetXXX and XXXGetDatum to be simple casts (as, if memory
> serves, they once were until we noticed the stated hazard).  Or, if
> there's still a reason to keep the masking steps in place, we'd better
> update this comment.

23a41573 did not do it rightly visibly, and 23b09e1 got it better, still
GET_1_BYTE() was getting used only because of V0 functions being used in
contrib/seg and because of the way MSVC 2015 handles boolean evaluation
as per thread [1].

In short, your argument for a switch to simple casts makes sense for me.

[1]: 
https://www.postgresql.org/message-id/e1atdps-0005zu...@gemulon.postgresql.org
--
Michael


signature.asc
Description: PGP signature

Re: refactor subscription tests to use PostgresNode's wait_for_catchup

2018-01-08 Thread Michael Paquier

On Mon, Jan 08, 2018 at 09:46:21PM -0500, Peter Eisentraut wrote:
> It appears that we have unwittingly created some duplicate and
> copy-and-paste-prone code in src/test/subscription/ to wait for a
> replication subscriber to catch up, when we already have
> almost-sufficient code in PostgresNode to do that more compactly.  So I
> propose this patch to consolidate that.

This looks sane to me. I have two comments while I read the
surroundings.

> @@ -1505,7 +1515,7 @@ sub wait_for_catchup
> . $target_lsn . " on "
> . $self->name . "\n";
>   my $query =
> -qq[SELECT '$target_lsn' <= ${mode}_lsn FROM pg_catalog.pg_stat_replication 
> WHERE application_name = '$standby_name';];
> +qq[SELECT $lsn_expr <= ${mode}_lsn FROM pg_catalog.pg_stat_replication WHERE 
> application_name = '$standby_name';];
>   $self->poll_query_until('postgres', $query)
> or die "timed out waiting for catchup, current location is "
> . ($self->safe_psql('postgres', $query) || '(unknown)');

This log is wrong from the beginning. Here $query returns a boolean
status and not a location. I think that when the poll dies because of a
timeout you should do a lookup at ${mode}_lsn from pg_stat_replication
when application_name matching $standby_name. Could you fix that as
well?

Could you also update promote_standby in RewindTest.pm? Your refactoring
to use pg_current_wal_lsn() if a target_lsn is not possible makes this
move possible. Using the generic APIs gives better logs as well.
--
Michael


signature.asc
Description: PGP signature

Re: [HACKERS] Proposal: Local indexes for partitioned table

2018-01-08 Thread David Rowley

On 9 January 2018 at 09:54, Alvaro Herrera  wrote:
> I removed the pg_index.indparentidx column that previous versions add,
> and replaced it with pg_inherits rows.  This makes the code a little bit
> bulkier in a couple of places, but nothing terrible.  As a benefit,
> there's no extra index in pg_index now.

Hi Álvaro,

I've made a pass over this patch. It's mostly in pretty good shape,
but I did find a few things to note down.

1. "either partition" -> "either the partition"

+   so the partition index is dropped together with either
partition it indexes,
+   or with the parent index it is attached to.

2. In findDependentObjects(), should the following if test be below
the ReleaseDeletionLock call?

+ if (foundDep->deptype == DEPENDENCY_INTERNAL_AUTO)
+ break;
  ReleaseDeletionLock(object);

3. The following existing comment indicates that we'll be creating a
disk file for the index. Should we update this comment to say that
this is the case only for RELKIND_INDEX?

/*
* create the index relation's relcache entry and physical disk file. (If
* we fail further down, it's the smgr's responsibility to remove the disk
* file again.)
*/

Maybe:

/*
* create the index relation's relcache entry and for non-partitioned
* indexes, a physical disk file. (If we fail further down, it's the
* smgr's responsibility to remove the disk file again.)
*/

4. Extra newline

+ recordDependencyOn(&myself, &referenced, DEPENDENCY_INTERNAL_AUTO);
+ }
+
+

5. Do you think it's out of scope to support expression indexes?

/*
* Expression indexes are currently not considered equal.  Not needed for
* current callers.
*/
if (info1->ii_Expressions != NIL || info2->ii_Expressions != NIL)
return false;

6. pg_inherits.inhseqno is int, not smallint. I understand you copied
this from StoreCatalogInheritance1(), so maybe a fix should be put in
before this patch is?

void
StoreSingleInheritance(Oid relationId, Oid parentOid, int16 seqNumber)

and

values[Anum_pg_inherits_inhseqno - 1] = Int16GetDatum(seqNumber);

7. Is the following a bug fix? If so, should it not be fixed and
backpatched, rather than sneaked in here?

@@ -10119,6 +10195,7 @@ ATExecChangeOwner(Oid relationOid, Oid
newOwnerId, bool recursing, LOCKMODE lock
  * relation, as well as its toast table (if it has one).
  */
  if (tuple_class->relkind == RELKIND_RELATION ||
+ tuple_class->relkind == RELKIND_PARTITIONED_TABLE ||

8. Missing if (attachInfos) pfree(attachInfos);

+ if (idxes)
+ pfree(idxes);
+ if (attachRelIdxs)
+ pfree(attachRelIdxs);

9. The first OR branch of the Assert will always be false, so might as
well get rid of it.

+ if (!has_superclass(idxid))
+ continue;
+
+ Assert(!has_superclass(idxid) ||
+(IndexGetRelation(get_partition_parent(idxid), false) ==
+RelationGetRelid(rel)));

10. lock -> locks:

/* we already hold lock on both tables, so this is safe: */

11. "already the right" -> "already in the right"

/* Silently do nothing if already the right state */

12. It seems to be RangeVarGetRelidExtended() that locks the partition
index, not the callback.

/* no deadlock risk: our callback above already acquired the lock */

13. We seem to only be holding an AccessShareLock on the partitioned
table. What happens if someone concurrently does ALTER TABLE
partitioned_table DETACH our_partition;?

parentTbl = relation_open(parentIdx->rd_index->indrelid, AccessShareLock);

and;
/* Make sure it indexes a partition of the other index's table */
partDesc = RelationGetPartitionDesc(parentTbl);
found = false;
for (i = 0; i < partDesc->nparts; i++)
{
if (partDesc->oids[i] == state.partitionOid)
{
found = true;
break;
}
}

Wouldn't you also need an AccessExclusiveLock to prevent another
session trying to ALTER INDEX part_index ATTACH PARTITION leaf_index;
at the same time? The concurrent session could be trying to ATTACH it
to a subpartition and this comment could be attaching to the parent.

14. validatePartitionedIndex does not verify that the index it is
checking is valid. Consider the following:

create table part (a int, b int) partition by list(a);
create table part1 partition of part for values in(1) partition by list(b)
create table part1 partition of part for values in(1) partition by list(b);
create table part2 partition of part1 for values in (1);
create index on only part1 (a);
create index on only part (a);
alter index part_a_idx attach partition part1_a_idx;
\d part
Table "public.part"
 Column |  Type   | Collation | Nullable | Default
+-+---+--+-
 a  | integer |   |  |
 b  | integer |   |  |
Partition key: LIST (a)
Indexes:
"part_a_idx" btree (a) <--- should be INVALID
Number of partitions: 1 (Use \d+ to list them.)

\d part1
   Table "public.part1"
 Column |  Type   | Collation | Nullable | Default
+-+---+--+-
 a  | integer |   |  |
 b  | integer |   |

Re: [HACKERS] Replication status in logical replication

2018-01-08 Thread Masahiko Sawada

On Sun, Jan 7, 2018 at 7:50 PM, Simon Riggs  wrote:
> On 26 December 2017 at 00:26, Masahiko Sawada  wrote:
>> On Tue, Dec 26, 2017 at 1:10 AM, Petr Jelinek
>>  wrote:
>>> On 21/11/17 22:06, Masahiko Sawada wrote:

 After investigation, I found out that my previous patch was wrong
 direction. I should have changed XLogSendLogical() so that we can
 check the read LSN and set WalSndCaughtUp = true even after read a
 record without wait. Attached updated patch passed 'make check-world'.
 Please review it.

>>>
>>> Hi,
>>>
>>> This version looks good to me and seems to be in line with what we do in
>>> physical replication.
>>>
>>> Marking as ready for committer.
>>>
>>
>> Thank you for reviewing this patch!
>
> The patch calls this AFTER processing the record
>if (sentPtr >= GetFlushRecPtr())
> but it seems better to call GetFlushRecPtr() before we process the
> record, otherwise the flush pointer might have moved forwards while we
> process the record and it wouldn't catch up. (Physical replication
> works like that).

Agreed.

> New patch version attached for discussion before commit. (v4)

v4 patch looks good to me.

>
> I'd rather not call it at all at that point though, so if we made
> RecentFlushPtr static at the module level rather than within
> WalSndWaitForWal we could use it here also. That's a bit more invasive
> for backpatching, so not implemented that here.
>
> Overall, I find setting WalSndCaughtUp = false at the top of
> XLogSendLogical() to be incredibly ugly and I would like to remove it.
> It can't be correct to have a static status variable that oscillates
> between false and true with every record. This seems to be done
> because of the lack of a logical initialization call. Petr? Peter?
> Version with this removed (v4alt2)
>
> I've removed the edit that fusses over English grammar: both ways are correct.
>
>> I think this patch can be
>> back-patched to 9.4 as Simon mentioned.
>
> This patch appears to cause this DEBUG1 message
>
> "standby \"%s\" has now caught up with primary"
>
> which probably isn't the right message, but might be OK to backpatch.
>
> Thoughts on better wording?
>

I think that this DEBUG1 message appears when sent any WAL after
caught up even without this patch. This patch makes this message
appear at a properly timing. Or am I missing something?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: BUG #14941: Vacuum crashes

2018-01-08 Thread Michael Paquier

On Thu, Dec 28, 2017 at 10:46:18PM +, Bossart, Nathan wrote:
> I agree, this makes more sense.  I've made this change in v3 of 0003.

Based on the opinions gathered on this thread, 0001 and 0002 seem to be
in the shape wanted, and those look good for shipping. Now for 0003 we
are not there yet.

-   if (relation == NULL)
+   if (relation != NULL)
+   relname = relation->relname;
+   else if (!rel_lock)
+   relname = get_rel_name(relid);
+
+   if (relname == NULL)
I think that you are doing it wrong here. In get_all_vacuum_rels() you
should build a RangeVar to be reused in the context of this error
message, and hence you'll save an extra lookup based on the relation
OID here, saving from any timing issues that you have overseen as in
this code path a lock on the relation whose name is looked at is not
taken. Relying on the RangeVar being NULL to not generate any logs is
fine as a concept to me, but let's fill it where it is needed, and in
the case of this patch a VACUUM NOWAIT on the whole database is such a
case.
--
Michael


signature.asc
Description: PGP signature

Re: Parallel append plan instability/randomness

2018-01-08 Thread Amit Kapila

On Tue, Jan 9, 2018 at 12:48 AM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Sun, Jan 7, 2018 at 11:40 PM, Amit Kapila  wrote:
>>> One theory that can explain above failure is that the costs of
>>> scanning some of the sub-paths is very close due to which sometimes
>>> the results can vary.  If that is the case, then probably using
>>> fuzz_factor in costs comparison (as is done in attached patch) can
>>> improve the situation, may be we have to consider some other factors
>>> like number of rows in each subpath.
>
>> This isn't an acceptable solution because sorting requires that the
>> comparison operator satisfy the transitive property; that is, if a = b
>> and b = c then a = c.  With your proposed comparator, you could have a
>> = b and b = c but a < c.  That will break stuff.
>
>> It seems like the obvious fix here is to use a query where the
>> contents of the partitions are such that the sorting always produces
>> the same result.  We could do that either by changing the query or by
>> changing the data in the partitions or, maybe, by inserting ANALYZE
>> someplace.
>
> The foo_star tables are made in create_table.sql, filled in
> create_misc.sql, and not modified thereafter.  The fact that we have
> accurate rowcounts for them in select_parallel.sql is because of the
> database-wide VACUUM that happens at the start of sanity_check.sql.
> Given the lack of any WHERE condition, the costs in this particular query
> depend only on the rowcount and physical table size, so inserting an
> ANALYZE shouldn't (and doesn't, for me) change anything.  I would be
> concerned about side-effects on other queries anyway if we were to ANALYZE
> tables that have never been ANALYZEd in the regression tests before.
>

Fair point.  This seems to indicate that wrong rowcounts is probably
not the reason for the failure.  However, I think it might still be
good to use a different set of tables (probably create new tables with
appropriate data for these queries) and analyze them explicitly before
these queries rather than relying on the execution order of some
not-directly related tests.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Condition variable live lock

2018-01-08 Thread Robert Haas

On Mon, Jan 8, 2018 at 7:02 PM, Tom Lane  wrote:
> I'm not real sure BTW why we have some callers that ereport and some
> that just exit(1).  Seems like it would be better to be consistent,
> though I'm not entirely sure which behavior to standardize on.

I think at one point we had an idea that regular backends would FATAL
if the postmaster fell over and other processes (e.g. checkpointer,
bgwriter) would exit silently.  Whether that was the right idea, and
whether it's still is/was ever what the code did, I'm not sure.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Invalid pg_upgrade error message during live check

2018-01-08 Thread Bruce Momjian

On Mon, Jan  8, 2018 at 06:26:21PM -0500, Bruce Momjian wrote:
> On Mon, Jan  8, 2018 at 10:43:00AM -0500, Robert Haas wrote:
> > On Fri, Jan 5, 2018 at 9:11 PM, Bruce Momjian  wrote:
> > > On Fri, Jan  5, 2018 at 04:58:39PM -0500, Bruce Momjian wrote:
> > >> Pg_upgrade is able to run in --check mode when the old server is still
> > >> running.  Unfortunately, all supported versions of pg_upgrade generate
> > >> an incorrect (but harmless) "failure" message when doing this:
> > >
> > > Based on recent discussions, I am thinking we would just fix this in PG
> > > 10 and HEAD and leave the harmless error message in the other branches,
> > > right?
> > 
> > Hmm, not sure why you say that.  If the message is buggy and the fix
> > isn't too risky, might as well fix it all the way back.
> 
> OK, I will do that then.  The message is harmless, but confusing, so I
> would like to fix it, but it requires changing the API of two functions.
> I will see how cleanly it applies back to 9.3 and act accordingly.

I ended up just doing HEAD and PG10.  When you change a function API,
you have to change every call site, and that can cause the patch to be
large and make backpatching too far difficult, which was the case for
this patch.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Use of term hostaddrs for multiple hostaddr values

2018-01-08 Thread Michael Paquier

Hi all,

While looking at the documentation of libpq, I have noticed that the
term hostaddrs is used to qualify multiple values of hostaddr. This
looks incorrect to me, as this is not the name of a connection
parameter. Please find attached a patch to address this
inconsistency. One error message is also touched, impacting
translability.

Thanks,
--
Michael
diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index 4e4645136c..f0a288cd3e 100644
--- a/doc/src/sgml/libpq.sgml
+++ b/doc/src/sgml/libpq.sgml
@@ -1010,8 +1010,8 @@ postgresql://%2Fvar%2Flib%2Fpostgresql/dbname

 

-A comma-separated list of hostaddrs is also 
accepted, in
-which case each host in the list is tried in order. See
+A comma-separated list of hostaddr values is also
+accepted, in which case each host in the list is tried in order. See
  for details.


diff --git a/src/interfaces/libpq/fe-connect.c 
b/src/interfaces/libpq/fe-connect.c
index 8d543334ae..77eebb0ba1 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -972,7 +972,7 @@ connectOptions2(PGconn *conn)
{
conn->status = CONNECTION_BAD;
printfPQExpBuffer(&conn->errorMessage,
- libpq_gettext("could 
not match %d host names to %d hostaddrs\n"),
+ libpq_gettext("could 
not match %d host names to %d hostaddr values\n"),
  
count_comma_separated_elems(conn->pghost), conn->nconnhost);
return false;
}


signature.asc
Description: PGP signature

Incorrect comment for expand_single_inheritance_child

2018-01-08 Thread Etsuro Fujita

Here is part of comments for expand_single_inheritance_child:

 * expand_single_inheritance_child
 *  Expand a single inheritance child, if needed.
 *
 * If this is a temp table of another backend, we'll return without doing
 * anything at all.  Otherwise, build a RangeTblEntry and an
AppendRelInfo, if
 * appropriate, plus maybe a PlanRowMark.

The sentence "If this is a temp table of another backend, we'll return
without doing anything at all" is not correct, because that function
doesn't even check whether the given child is such a temp table.  (The
caller does, so that function assumes that the given child is not such a
temp table.)  So I think we should remove that sentence entirely.  Also,
I think we should remove "if needed" from the headline: Expand a single
inheritance child, if needed.  IMO I don't think that "Expand a single
inheritance child" says much, so I'd like to propose changing that part
simply to something like this:

 * expand_single_inheritance_child
 *  Build a RangeTblEntry and an AppendRelInfo, if appropriate, plus
 *  maybe a PlanRowMark.

Attached is a patch for that.

Best regards,
Etsuro Fujita
*** a/src/backend/optimizer/prep/prepunion.c
--- b/src/backend/optimizer/prep/prepunion.c
***
*** 1634,1644  expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
  
  /*
   * expand_single_inheritance_child
!  *		Expand a single inheritance child, if needed.
!  *
!  * If this is a temp table of another backend, we'll return without doing
!  * anything at all.  Otherwise, build a RangeTblEntry and an AppendRelInfo, if
!  * appropriate, plus maybe a PlanRowMark.
   *
   * We now expand the partition hierarchy level by level, creating a
   * corresponding hierarchy of AppendRelInfos and RelOptInfos, where each
--- 1634,1641 
  
  /*
   * expand_single_inheritance_child
!  *		Build a RangeTblEntry and an AppendRelInfo, if appropriate, plus
!  *		maybe a PlanRowMark.
   *
   * We now expand the partition hierarchy level by level, creating a
   * corresponding hierarchy of AppendRelInfos and RelOptInfos, where each

Re: Comment typo in postgres_fdw.c

2018-01-08 Thread Etsuro Fujita


(2017/12/28 4:03), Robert Haas wrote:

On Wed, Dec 27, 2017 at 4:35 AM, Etsuro Fujita
  wrote:

Here is a small patch for fixing $Subject: s/it's/its/.


Committed.


Thanks!

Best regards,
Etsuro Fujita

refactor subscription tests to use PostgresNode's wait_for_catchup

2018-01-08 Thread Peter Eisentraut

It appears that we have unwittingly created some duplicate and
copy-and-paste-prone code in src/test/subscription/ to wait for a
replication subscriber to catch up, when we already have
almost-sufficient code in PostgresNode to do that more compactly.  So I
propose this patch to consolidate that.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From daf3c2d02a409a33bff349c558d79dcdd966982e Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Mon, 8 Jan 2018 17:32:09 -0500
Subject: [PATCH] Refactor subscription tests to use PostgresNode's
 wait_for_catchup

---
 src/test/perl/PostgresNode.pm  | 16 +---
 src/test/subscription/t/001_rep_changes.pl | 19 +--
 src/test/subscription/t/002_types.pl   | 15 ---
 src/test/subscription/t/003_constraints.pl | 15 ---
 src/test/subscription/t/004_sync.pl| 14 +++---
 src/test/subscription/t/005_encoding.pl| 13 ++---
 src/test/subscription/t/006_rewrite.pl | 17 -
 src/test/subscription/t/007_ddl.pl | 11 +--
 src/test/subscription/t/008_diff_schema.pl | 15 +++
 9 files changed, 39 insertions(+), 96 deletions(-)

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 80f68df246..6062f41480 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -1465,7 +1465,8 @@ sub lsn
 
 =item $node->wait_for_catchup(standby_name, mode, target_lsn)
 
-Wait for the node with application_name standby_name (usually from node->name)
+Wait for the node with application_name standby_name (usually from node->name,
+also works for logical subscriptions)
 until its replication location in pg_stat_replication equals or passes the
 upstream's WAL insert point at the time this function is called. By default
 the replay_lsn is waited for, but 'mode' may be specified to wait for any of
@@ -1477,6 +1478,7 @@ poll_query_until timeout.
 Requires that the 'postgres' db exists and is accessible.
 
 target_lsn may be any arbitrary lsn, but is typically 
$master_node->lsn('insert').
+If omitted, pg_current_wal_lsn() is used.
 
 This is not a test. It die()s on failure.
 
@@ -1497,7 +1499,15 @@ sub wait_for_catchup
{
$standby_name = $standby_name->name;
}
-   die 'target_lsn must be specified' unless defined($target_lsn);
+   my $lsn_expr;
+   if (defined($target_lsn))
+   {
+   $lsn_expr = "'$target_lsn'";
+   }
+   else
+   {
+   $lsn_expr = 'pg_current_wal_lsn()'
+   }
print "Waiting for replication conn "
  . $standby_name . "'s "
  . $mode
@@ -1505,7 +1515,7 @@ sub wait_for_catchup
  . $target_lsn . " on "
  . $self->name . "\n";
my $query =
-qq[SELECT '$target_lsn' <= ${mode}_lsn FROM pg_catalog.pg_stat_replication 
WHERE application_name = '$standby_name';];
+qq[SELECT $lsn_expr <= ${mode}_lsn FROM pg_catalog.pg_stat_replication WHERE 
application_name = '$standby_name';];
$self->poll_query_until('postgres', $query)
  or die "timed out waiting for catchup, current location is "
  . ($self->safe_psql('postgres', $query) || '(unknown)');
diff --git a/src/test/subscription/t/001_rep_changes.pl 
b/src/test/subscription/t/001_rep_changes.pl
index 0136c79d4b..e0104cd8d0 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -60,11 +60,7 @@
 "CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr 
application_name=$appname' PUBLICATION tap_pub, tap_pub_ins_only"
 );
 
-# Wait for subscriber to finish initialization
-my $caughtup_query =
-"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication WHERE 
application_name = '$appname';";
-$node_publisher->poll_query_until('postgres', $caughtup_query)
-  or die "Timed out while waiting for subscriber to catch up";
+$node_publisher->wait_for_catchup($appname);
 
 # Also wait for initial table sync to finish
 my $synced_query =
@@ -93,8 +89,7 @@
 $node_publisher->safe_psql('postgres',
"INSERT INTO tab_mixed VALUES (2, 'bar')");
 
-$node_publisher->poll_query_until('postgres', $caughtup_query)
-  or die "Timed out while waiting for subscriber to catch up";
+$node_publisher->wait_for_catchup($appname);
 
 $result = $node_subscriber->safe_psql('postgres',
"SELECT count(*), min(a), max(a) FROM tab_ins");
@@ -132,9 +127,7 @@
 $node_publisher->safe_psql('postgres',
"UPDATE tab_full2 SET x = 'bb' WHERE x = 'b'");
 
-# Wait for subscription to catch up
-$node_publisher->poll_query_until('postgres', $caughtup_query)
-  or die "Timed out while waiting for subscriber to catch up";
+$node_publisher->wait_for_catchup($appname);
 
 $result = $node_subscriber->safe_psql('postgres',
"SELECT count(*), min(a), max(a) FROM tab_full");
@@ -176,8 +169,7 @@

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2018-01-08 Thread Peter Eisentraut

On 1/3/18 15:13, Tomas Vondra wrote:
> Forgot to mention that the v4 also extends the CREATE SUBSCRIPTION to
> allow customizing the streaming and memory limit. So you can do
> 
> CREATE SUBSCRIPTION ... WITH (streaming=on, work_mem=1024)
> 
> and this subscription will allow streaming, and the logica_work_mem (on
> provider) will be set to 1MB.

I was wondering already during PG10 development whether we should give
subscriptions a generic configuration array, like databases and roles
have, so we don't have to hardcode a bunch of similar stuff every time
we add an option like this.  At the time we only had synchronous_commit,
but now we're adding more.

Also, instead of sticking this into the START_REPLICATION command, could
we just run a SET command?  That should work over replication
connections as well.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2018-01-08 Thread Peter Eisentraut

On 1/3/18 14:53, Tomas Vondra wrote:
>> I don't see the need to tie this setting to maintenance_work_mem. 
>> maintenance_work_mem is often set to very large values, which could
>> then have undesirable side effects on this use.
> 
> Well, we need to pick some default value, and we can either use a fixed
> value (not sure what would be a good default) or tie it to an existing
> GUC. We only really have work_mem and maintenance_work_mem, and the
> walsender process will never use more than one such buffer. Which seems
> to be closer to maintenance_work_mem.
> 
> Pretty much any default value can have undesirable side effects.

Let's just make it an independent setting unless we know any better.  We
don't have a lot of settings that depend on other settings, and the ones
we do have a very specific relationship.

>> Moreover, the name logical_work_mem makes it sound like it's a logical
>> version of work_mem.  Maybe we could think of another name.
> 
> I won't object to a better name, of course. Any proposals?

logical_decoding_[work_]mem?

>> I think we need a way to report on how much memory is actually used,
>> so the setting can be tuned. Something analogous to log_temp_files
>> perhaps.
> 
> Yes, I agree. I'm just about to submit an updated version of the patch
> series, that also introduces a few columns pg_stat_replication, tracking
> this type of stats (amount of data spilled to disk or streamed, etc.).

That seems OK.  Perhaps we could bring forward the part of that patch
that applies to this feature.

That would also help testing *this* feature and determine what
appropriate settings are.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: PATCH: Configurable file mode mask

2018-01-08 Thread Peter Eisentraut

On 1/3/18 08:11, Robert Haas wrote:
> On Tue, Jan 2, 2018 at 11:43 AM, David Steele  wrote:
 I think MakeDirectory() is a good wrapper, but isn't
>>> MakeDirectoryPerm() sort of silly?
>>
>> There's one place in the backend (storage/ipc/ipc.c) that sets non-default
>> directory permissions.  This function is intended to support that and any
>> extensions that need to set custom perms.
> 
> Yeah, but all it does is call mkdir(), which could just as well be
> called directly.  I think there's a pointer to a wrapper when it does
> something for you -- supply an argument, log something, handle
> portability concerns -- but this wrapper does exactly nothing.

Yeah, I didn't like this aspect when this patch was originally
submitted.  We want to keep the code legible for future new
contributors.  Having these generic-sounding but specific-in-purpose
wrapper functions can be pretty confusing.  Let's use mkdir() when it's
the appropriate function, and let's figure out a different name for
"make a data directory subdirectory in a secure and robust way".

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Unimpressed with pg_attribute_always_inline

2018-01-08 Thread Andres Freund

Hi,

On 2018-01-08 20:20:09 -0500, Tom Lane wrote:
> No, you have it right.  I checked locally and confirmed Andres' assertion
> that by default, gcc (my version anyway) is not persuaded to inline
> ExecHashJoinImpl simply by "inline", but "always_inline" persuades it.
> Maybe at some level higher than -O2, or with some other weird flag,
> it would do what we want; but we probably don't want to mess with global
> compiler flags for this.

There's -finline-limit=n, but as you say, I don't think we want to mess
with this.

Greetings,

Andres Freund

Re: proposal: alternative psql commands quit and exit

2018-01-08 Thread Alvaro Herrera

Everaldo Canuto wrote:
> Change my patch will make psql different from other sql clients I use
> (sqlplus and mysql).

Well, I don't think we're too hot about copying their behavior exactly.
This thread discussed the behavior at length and a consensus was found
after much thinking.  No one is imposing anything.

> I am happy with my custom version of psql so you can cancel/delete/ignore
> my patch.

We're not happy to lose a potential contributor, but of course that's
your prerogative.

Can we find somebody else willing to implement the agreed-upon behavior?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: portal pinning

2018-01-08 Thread Peter Eisentraut

On 1/8/18 15:27, Andrew Dunstan wrote:
> This seems like a good idea, and the code change is tiny and clean. I
> don't know of any third party PLs or other libraries might be pinning
> the portals already on their own. How would they be affected if they did?

They would get an error if they tried to pin it a second time.  So this
would require a small source-level adjustment.  But I doubt this is
actually the case anywhere, seeing that we are not even using this
consistently in core.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Unimpressed with pg_attribute_always_inline

2018-01-08 Thread Tom Lane

Peter Geoghegan  writes:
> On Mon, Jan 8, 2018 at 5:09 PM, Peter Eisentraut
>  wrote:
>> However, at  it says,
>> "GCC does not inline any functions when not optimizing unless you
>> specify the ‘always_inline’ attribute for the function".  So,
>> apparently, if the goal is to turn off inlining when not optimizing,
>> then we should just use the normal inline attribute.

> The compiler isn't obligated to inline anything with the normal inline
> attribute. The whole point of always_inline is that the programmer may
> know better than the compiler about inlining in some specific cases,
> and may therefore want to make inlining absolutely mandatory. IIUC,
> that's almost what we want, except that it also inlines with -O0,
> which we do not want.

> Have I missed the point here?

No, you have it right.  I checked locally and confirmed Andres' assertion
that by default, gcc (my version anyway) is not persuaded to inline
ExecHashJoinImpl simply by "inline", but "always_inline" persuades it.
Maybe at some level higher than -O2, or with some other weird flag,
it would do what we want; but we probably don't want to mess with global
compiler flags for this.

regards, tom lane

Re: Enhance pg_stat_wal_receiver view to display connected host

2018-01-08 Thread Michael Paquier

On Fri, Jan 05, 2018 at 09:15:36AM -0300, Alvaro Herrera wrote:
> Haribabu Kommi wrote:
> 
> > And also not returning "default host" details, because for the conninfo
> > without any host details, the return value must be NULL. But this change
> > may break the backward compatibility of the function.
> 
> I wouldn't want to have to fight that battle.

Hm. Any users of psql's PROMPT would be equally confused, and this can
actually lead to more confusion from the user prospective I think than
just pg_stat_wal_receiver. If you take the case of Haribabu from
upthread with say this bit in psqlrc:
\set PROMPT1 '[host=%M;port=%>]=%# '

Then you get on HEAD the following set of results using different
connection strings:
1) host=localhost,localhost hostaddr=127.0.0.1,127.0.0.1 port=5432,5433
[host=localhost,localhost;port=5432]=#

2) host=localhost,localhost port=5432,5433
[host=localhost;port=5432]=# 

3) hostaddr=127.0.0.1,127.0.0.1 port=5432,5433
[host=[local];port=5432]=# 

4) host=/tmp,tmp hostaddr=127.0.0.1,127.0.0.1
[host=[local:/tmp,tmp];port=5432]=# 

So for cases 2) and 4), mixing both hostaddr and host is hurting the
experience. The documentation in [1] also specifies that if both host
and hostaddrs are specified then host is ignored. The same rule applies
for multiple values so for 2) and 4) the correct values ought to be
"local" for both of them. This would be more consistent with the pre-9.6
behavior as well.

[1]: https://www.postgresql.org/docs/current/static/libpq-connect.html
--
Michael

signature.asc
Description: PGP signature

Re: Unimpressed with pg_attribute_always_inline

2018-01-08 Thread Peter Geoghegan

On Mon, Jan 8, 2018 at 5:09 PM, Peter Eisentraut
 wrote:
> However, at  it says,
> "GCC does not inline any functions when not optimizing unless you
> specify the ‘always_inline’ attribute for the function".  So,
> apparently, if the goal is to turn off inlining when not optimizing,
> then we should just use the normal inline attribute.

The compiler isn't obligated to inline anything with the normal inline
attribute. The whole point of always_inline is that the programmer may
know better than the compiler about inlining in some specific cases,
and may therefore want to make inlining absolutely mandatory. IIUC,
that's almost what we want, except that it also inlines with -O0,
which we do not want.

Have I missed the point here?

-- 
Peter Geoghegan

Re: Unimpressed with pg_attribute_always_inline

2018-01-08 Thread Andres Freund

Hi,

On 2018-01-08 20:09:27 -0500, Peter Eisentraut wrote:
> On 1/8/18 19:56, Tom Lane wrote:
> > Peter Geoghegan  writes:
> >> Anyway, ISTM that it should be possible to make
> >> pg_attribute_always_inline have no effect in typical debug builds.
> >> Wouldn't that make everyone happy?
> > 
> > That would improve matters, but do we have access to the -O switch
> > level as an #if condition?
> 
> See __OPTIMIZE__ and __NO_INLINE__ here:
> https://gcc.gnu.org/onlinedocs/cpp/Common-Predefined-Macros.html

Yea, __OPTIMIZE__ might work.

> However, at  it says,
> "GCC does not inline any functions when not optimizing unless you
> specify the ‘always_inline’ attribute for the function".  So,
> apparently, if the goal is to turn off inlining when not optimizing,
> then we should just use the normal inline attribute.

See 
http://archives.postgresql.org/message-id/20180109001935.s42ovj3uwmwygqzu%40alap3.anarazel.de

Greetings,

Andres Freund

Re: Unimpressed with pg_attribute_always_inline

2018-01-08 Thread Peter Eisentraut

On 1/8/18 19:56, Tom Lane wrote:
> Peter Geoghegan  writes:
>> Anyway, ISTM that it should be possible to make
>> pg_attribute_always_inline have no effect in typical debug builds.
>> Wouldn't that make everyone happy?
> 
> That would improve matters, but do we have access to the -O switch
> level as an #if condition?

See __OPTIMIZE__ and __NO_INLINE__ here:
https://gcc.gnu.org/onlinedocs/cpp/Common-Predefined-Macros.html

However, at  it says,
"GCC does not inline any functions when not optimizing unless you
specify the ‘always_inline’ attribute for the function".  So,
apparently, if the goal is to turn off inlining when not optimizing,
then we should just use the normal inline attribute.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Transactions involving multiple postgres foreign servers

2018-01-08 Thread Masahiko Sawada

On Mon, Jan 1, 2018 at 7:12 PM, Ashutosh Bapat
 wrote:
> On Thu, Dec 28, 2017 at 11:08 AM, Masahiko Sawada  
> wrote:
>>
>>> (1)
>>> Why don't you use the existing global variable MyXactFlags instead of the 
>>> new TransactionDidWrite?  Or, how about using XactLastRecEnd != 0 to 
>>> determine the transaction did any writes?  When the transaction only 
>>> modified temporary tables on the local database and some data on one remote 
>>> database, I think 2pc is unnecessary.
>>
>> Perhaps we can use (XactLastRecEnd != 0 && markXidCommitted) to see if
>> we did any writes on local node which requires the atomic commit. Will
>> fix.
>>
>
> I haven't checked how much code it needs to track whether the local
> transaction wrote anything. But probably we can post-pone this
> optimization. If it's easy to incorporate, it's good to have in the
> first set itself.
>

Without the track of local writes, we always have to use two-phase
commit even when the transaction write data on only one foreign
server. It will be cause of unnecessary performance degradation and
cause of transaction failure on existing systems. We can set the using
two-phase commit per foreign server by ALTER SERVER but it will affect
other transactions. If we can configure it per transaction perhaps it
will be worth to postpone.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Unimpressed with pg_attribute_always_inline

2018-01-08 Thread Tom Lane

Peter Geoghegan  writes:
> Anyway, ISTM that it should be possible to make
> pg_attribute_always_inline have no effect in typical debug builds.
> Wouldn't that make everyone happy?

That would improve matters, but do we have access to the -O switch
level as an #if condition?  The use-case I'm generally worried about
involves recompiling individual files at -O0, with something like
make PROFILE=-O0 nodeHashjoin.o
so trying to, say, get configure to adjust the macro isn't really going
to help.

The other half of the problem is the warnings, but I think Thomas'
patch would alleviate that.

regards, tom lane

Re: Unimpressed with pg_attribute_always_inline

2018-01-08 Thread Peter Geoghegan

On Mon, Jan 8, 2018 at 4:22 PM, Andres Freund  wrote:
>> Isn't that an argument against inlining in general, rather than
>> forcing inlining in particular?
>
> No. Normal 'inline' annotation doesn't do anything on -O0 / debug
> builds. But always_inline does, even though the goal of the usage is
> just to override the compiler's inlining heuristics.

Sorry, I found the way this was discussed confusing.

Anyway, ISTM that it should be possible to make
pg_attribute_always_inline have no effect in typical debug builds.
Wouldn't that make everyone happy?

-- 
Peter Geoghegan

Re: Unimpressed with pg_attribute_always_inline

2018-01-08 Thread Andres Freund

On 2018-01-08 16:20:26 -0800, Peter Geoghegan wrote:
> On Mon, Jan 8, 2018 at 4:12 PM, Tom Lane  wrote:
> > When I complained that always_inline inhibits debuggability, I did NOT
> > mean what shows up in perf reports.  I'm talking about whether you can
> > break at, or single-step through, a function reliably and whether gdb
> > knows where all the variables are.  In my experience, inlining hurts
> > both of those things, which is why I'm saying that forcing inlining
> > even in non-optimized builds is a bad idea.
> 
> Isn't that an argument against inlining in general, rather than
> forcing inlining in particular?

No. Normal 'inline' annotation doesn't do anything on -O0 / debug
builds. But always_inline does, even though the goal of the usage is
just to override the compiler's inlining heuristics.

Greetings,

Andres Freund

Re: Unimpressed with pg_attribute_always_inline

2018-01-08 Thread Peter Geoghegan

On Mon, Jan 8, 2018 at 4:12 PM, Tom Lane  wrote:
> When I complained that always_inline inhibits debuggability, I did NOT
> mean what shows up in perf reports.  I'm talking about whether you can
> break at, or single-step through, a function reliably and whether gdb
> knows where all the variables are.  In my experience, inlining hurts
> both of those things, which is why I'm saying that forcing inlining
> even in non-optimized builds is a bad idea.

Isn't that an argument against inlining in general, rather than
forcing inlining in particular?

-- 
Peter Geoghegan

Re: Unimpressed with pg_attribute_always_inline

2018-01-08 Thread Andres Freund

On 2018-01-08 19:12:08 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > Unless this pg_attribute_always_inline gets a lot more widely
> > proliferated I don't see a need to change anything. Debuggability isn't
> > meaningfully impacted by seing more runtime attributed to
> > ExecHashJoin/ExecParallelHashJoin rather than ExecHashJoinImpl.
> 
> When I complained that always_inline inhibits debuggability, I did NOT
> mean what shows up in perf reports.  I'm talking about whether you can
> break at, or single-step through, a function reliably and whether gdb
> knows where all the variables are.  In my experience, inlining hurts
> both of those things, which is why I'm saying that forcing inlining
> even in non-optimized builds is a bad idea.

Yea, I got that, I just don't think it's a strong argument for the cases
here.


> If we needed this thing to cause inlining even in optimized builds,
> there might be a case for it; but that is not what I'm reading in
> the gcc manual.

That's what it's doing here however:

(gdb) disassemble ExecHashJoin
Dump of assembler code for function ExecHashJoin:
   0x12f0 <+0>: xor%esi,%esi
   0x12f2 <+2>: jmpq   0x530 
End of assembler dump.
(gdb) quit

$ git diff
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -161,7 +161,7 @@ static void 
ExecParallelHashJoinPartitionOuter(HashJoinState *node);
  *   the other one is "outer".
  * 
  */
-pg_attribute_always_inline
+//pg_attribute_always_inline
 static inline TupleTableSlot *
 ExecHashJoinImpl(PlanState *pstate, bool parallel)
 {


reverting that hunk:

make nodeHashjoin.o
gcc-7 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement 
-Wendif-labels -Wmissing-format-attribute -Wformat-security 
-fno-strict-aliasing -fwrapv -fexcess-precision=standard -O3 -ggdb -g3 
-Wmissing-declarations -Wmissing-prototypes -Wall -Wextra -Wno-unused-parameter 
-Wno-sign-compare -Wno-missing-field-initializers -Wempty-body -Wno-clobbered 
-march=native -mtune=native -Wno-implicit-fallthrough -I../../../src/include 
-I/home/andres/src/postgresql-master/src/include  -D_GNU_SOURCE 
-I/usr/include/libxml2   -c -o nodeHashjoin.o 
/home/andres/src/postgresql-master/src/backend/executor/nodeHashjoin.c -MMD -MP 
-MF .deps/nodeHashjoin.Po

$ gdb nodeHashjoin.o
(gdb) disassemble ExecHashJoin
Dump of assembler code for function ExecHashJoin:
   0x0530 <+0>: push   %r15
   0x0532 <+2>: mov%rdi,%r15
   0x0535 <+5>: push   %r14
...[whole function]

If this were just to get it to force inlining in assertion builds, I'd
certainly not agree that it makes sense. But here it's really about
forcing the compilers hand in the optimized case.

Greetings,

Andres Freund

Re: Unimpressed with pg_attribute_always_inline

2018-01-08 Thread Tom Lane

Andres Freund  writes:
> Unless this pg_attribute_always_inline gets a lot more widely
> proliferated I don't see a need to change anything. Debuggability isn't
> meaningfully impacted by seing more runtime attributed to
> ExecHashJoin/ExecParallelHashJoin rather than ExecHashJoinImpl.

When I complained that always_inline inhibits debuggability, I did NOT
mean what shows up in perf reports.  I'm talking about whether you can
break at, or single-step through, a function reliably and whether gdb
knows where all the variables are.  In my experience, inlining hurts
both of those things, which is why I'm saying that forcing inlining
even in non-optimized builds is a bad idea.

If we needed this thing to cause inlining even in optimized builds,
there might be a case for it; but that is not what I'm reading in
the gcc manual.

regards, tom lane

Re: Condition variable live lock

2018-01-08 Thread Tom Lane

Thomas Munro  writes:
> On Sun, Jan 7, 2018 at 10:00 AM, Tom Lane  wrote:
>> Another loose end that I'm seeing here is that while a process waiting on
>> a condition variable will respond to a cancel or die interrupt, it will
>> not notice postmaster death.  This seems unwise to me.  I think we should
>> adjust the WaitLatch call to include WL_POSTMASTER_DEATH as a wake
>> condition and just do a summary proc_exit(1) if it sees that.  I'd even
>> argue that that is a back-patchable bug fix.

> Yeah.  As far as I know so far, every place where we wait on a
> WaitEventSet falls into one of 4 categories when it comes to
> postmaster death:

> 1.  We proc_exit(1) if WL_POSTMASTER_DEATH is returned.
> 2.  We ereport(FATAL) if WL_POSTMASTER_DEATH is returned.
> 3.  We asked for WL_POSTMASTER_DEATH in the WaitEventSet, but we don't
> actually bother checking for it in the returned value.  Instead we
> call PostmasterIsAlive() every time through the loop (pgarch.c,
> syncrep.c, walsender.c and walreceiver.c).
> 4.  We didn't ask for WL_POSTMASTER_DEATH.

> Do I have that right?  I guess category 3 is suboptimal especially on
> some clunky but loveable kernels[1] and all cases of 4 including this
> one are probably bugs.  That makes me wonder why we don't change the
> WaitEventSet API so that it calls proc_exit(1) for you by default if
> you didn't ask to receive WL_POSTMASTER_DEATH explicitly, to handle
> category 1 for free.  If you asked for WL_POSTMASTER_DEATH explicitly
> then we could return it, to support category 2 callers that want to do
> something different.  Just a thought.

Yeah, that's worth thinking about, because I'm pretty sure that we have
had this type of bug before.  We have to be a bit careful because
Andres is thinking of using the WaitEventState support in the postmaster
loop, where it definitely shouldn't check WL_POSTMASTER_DEATH, and it
is probably also possible to reach some of those calls in standalone
backends, which shouldn't either.  So I'm imagining:

1. The WaitEventSet code always includes WL_POSTMASTER_DEATH checking
if IsUnderPostmaster --- and conversely, if !IsUnderPostmaster, it should
ignore any caller request for that.

2. If caller specifies WL_POSTMASTER_DEATH then we'll return that
flag bit, otherwise just exit(1) --- or should the default be
ereport(FATAL)?

3. Remove explicit specification of WL_POSTMASTER_DEATH anywhere that
the default handling is OK, which is probably almost everywhere.
Get rid of those now-redundant PostmasterIsAlive checks, too.

I'm not real sure BTW why we have some callers that ereport and some
that just exit(1).  Seems like it would be better to be consistent,
though I'm not entirely sure which behavior to standardize on.

(Of course, this would only be an appropriate thing to do in HEAD.
In v10 I think we should do the narrow fix I suggested up top.)

regards, tom lane

Re: Condition variable live lock

2018-01-08 Thread Tom Lane

Thomas Munro  writes:
> On Sun, Jan 7, 2018 at 10:00 AM, Tom Lane  wrote:
>> I began with the intention of making no non-cosmetic changes, but then
>> I started to wonder why ConditionVariablePrepareToSleep bothers with a
>> !proclist_contains test, when the calling process surely ought not be
>> in the list -- or any other list -- since it wasn't prepared to sleep.
>> And that led me down a different rabbit hole ending in the conclusion
>> that proclist.h could stand some improvements too.

> +1

>> The second attached patch is the cosmetic changes I want to make in
>> condition_variable.c/.h.

> +1

I pushed these, with an addition to ConditionVariablePrepareToSleep's
comment emphasizing that you're supposed to call it before the loop test;
this because some desultory digging found that one of the existing callers
is violating that rule.  replorigin_drop() in replication/logical/origin.c
has

cv = &state->origin_cv;

LWLockRelease(ReplicationOriginLock);
ConditionVariablePrepareToSleep(cv);
ConditionVariableSleep(cv, WAIT_EVENT_REPLICATION_ORIGIN_DROP);
ConditionVariableCancelSleep();
goto restart;

and unless I'm much mistaken, that is flat broken.  Somebody could
change the ReplicationState's acquired_by before we reach
ConditionVariablePrepareToSleep, in which case we won't get any signal
for that and will just sleep here indefinitely.

It would still be broken if we removed the PrepareToSleep call, but
at least it'd be a busy-wait not a hang.

Not sure about a convenient fix.  One idea is to move the
PrepareToSleep call inside the hold on ReplicationOriginLock, which
would fix things only if the various signallers of origin_cv did the
signalling while holding that lock, which they do not.  It's not clear
to me whether making them do so inside that lock would be problematic
for performance.

We can't just move the prepare/cancel sleep calls to around this whole
loop, because we do not know outside the loop which CV is to be waited
on.  So another idea is to get rid of that and have just one CV for all
ReplicationStates.

Or (and I'm sure Robert sees this coming), if we applied my proposed
patch to let ConditionVariableSleep auto-switch to a different CV,
then we could fix this code by simply removing the PrepareToSleep and
moving the ConditionVariableCancelSleep call to below the loop.
This would work even in the perhaps-unlikely case where the slot as
identified by roident changed positions, though you'd get an additional
trip through the loop (a/k/a test of the exit condition) if that
happened.

>> Another loose end that I'm seeing here is that while a process waiting on
>> a condition variable will respond to a cancel or die interrupt, it will
>> not notice postmaster death.

> Yeah.

I'll respond to that separately to keep it from getting confused with
this replorigin_drop() bug.

regards, tom lane

Re: Unimpressed with pg_attribute_always_inline

2018-01-08 Thread Andres Freund

On 2018-01-05 00:11:19 +1300, Thomas Munro wrote:
> On Tue, Jan 2, 2018 at 4:58 PM, Thomas Munro
>  wrote:
> > On Tue, Jan 2, 2018 at 4:17 PM, Tom Lane  wrote:
> >>  gaur  | nodeHashjoin.c:167: warning: `always_inline' attribute 
> >> directive ignored
> >>  mastodon  | .\src\backend\executor\nodeHashjoin.c(165): warning C4141: 
> >> 'inline' : used more than once
> >
> > 1. MSVC doesn't like you to say both "__forceinline" and "inline".
> >
> > 2.  GCC 2.95.3 doesn't understand always_inline.  From a quick look at
> > archived manuals, it seems that that attribute arrived in 3.1.
> 
> Here is one way to fix those warnings.  Thoughts?

That looks good to me.

> >> Therefore, I think that pg_attribute_always_inline is not merely
> >> useless but actively bad, and should be removed.
> 
> How about a macro PG_NOINLINE, which, if defined, inhibits this?  Or I
> could give up this strategy and maintain two separate very similar
> functions, ExecHashJoin and ExecParallelHashJoin.

Unless this pg_attribute_always_inline gets a lot more widely
proliferated I don't see a need to change anything. Debuggability isn't
meaningfully impacted by seing more runtime attributed to
ExecHashJoin/ExecParallelHashJoin rather than ExecHashJoinImpl. If this
were used on functions that are called from a lot of places that
argument would hold some weight, but for now I'm really not seing it.

- Andres

Re: [HACKERS] [PATCH] Generic type subscripting

2018-01-08 Thread Andres Freund

Hi,

On 2018-01-04 16:47:50 +0100, Dmitry Dolgov wrote:
> diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
> index 16f908037c..ee50fda4ef 100644
> --- a/src/backend/executor/execExpr.c
> +++ b/src/backend/executor/execExpr.c
> @@ -64,7 +64,8 @@ static void ExecInitExprSlots(ExprState *state, Node *node);
>  static bool get_last_attnums_walker(Node *node, LastAttnumInfo *info);
>  static void ExecInitWholeRowVar(ExprEvalStep *scratch, Var *variable,
>   ExprState *state);
> -static void ExecInitArrayRef(ExprEvalStep *scratch, ArrayRef *aref,
> +static void ExecInitSubscriptingRef(ExprEvalStep *scratch,
> +  SubscriptingRef *sbsref,
>ExprState *state,
>Datum *resv, bool *resnull);
>  static bool isAssignmentIndirectionExpr(Expr *expr);
> @@ -857,11 +858,11 @@ ExecInitExprRec(Expr *node, ExprState *state,
>   break;
>   }
>  
> - case T_ArrayRef:
> + case T_SubscriptingRef:
>   {
> - ArrayRef   *aref = (ArrayRef *) node;
> + SubscriptingRef   *sbsref = (SubscriptingRef *) 
> node;
>  
> - ExecInitArrayRef(&scratch, aref, state, resv, 
> resnull);
> + ExecInitSubscriptingRef(&scratch, sbsref, 
> state, resv, resnull);
>   break;
>   }
>  
> @@ -1176,7 +1177,7 @@ ExecInitExprRec(Expr *node, ExprState *state,
>   /*
>* Use the CaseTestExpr mechanism to 
> pass down the old
>* value of the field being replaced; 
> this is needed in
> -  * case the newval is itself a 
> FieldStore or ArrayRef that
> +  * case the newval is itself a 
> FieldStore or SubscriptingRef that
>* has to obtain and modify the old 
> value.  It's safe to
>* reuse the CASE mechanism because 
> there cannot be a CASE
>* between here and where the value 
> would be needed, and a
> @@ -2401,34 +2402,40 @@ ExecInitWholeRowVar(ExprEvalStep *scratch, Var 
> *variable, ExprState *state)
>  }
>  
>  /*
> - * Prepare evaluation of an ArrayRef expression.
> + * Prepare evaluation of a SubscriptingRef expression.
>   */
>  static void
> -ExecInitArrayRef(ExprEvalStep *scratch, ArrayRef *aref,
> +ExecInitSubscriptingRef(ExprEvalStep *scratch, SubscriptingRef *sbsref,
>ExprState *state, Datum *resv, bool *resnull)
>  {
> - boolisAssignment = (aref->refassgnexpr != NULL);
> - ArrayRefState *arefstate = palloc0(sizeof(ArrayRefState));
> - List   *adjust_jumps = NIL;
> - ListCell   *lc;
> - int i;
> + bool isAssignment = (sbsref->refassgnexpr 
> != NULL);
> + SubscriptingRefState *sbsrefstate = 
> palloc0(sizeof(SubscriptingRefState));
> + List *adjust_jumps = NIL;
> + ListCell *lc;
> + int   i;
> + FmgrInfo *eval_finfo, *nested_finfo;
> +
> + eval_finfo = palloc0(sizeof(FmgrInfo));
> + nested_finfo = palloc0(sizeof(FmgrInfo));
> +
> + fmgr_info(sbsref->refevalfunc, eval_finfo);
> + if (OidIsValid(sbsref->refnestedfunc))
> + {
> + fmgr_info(sbsref->refnestedfunc, nested_finfo);
> + }
>  
> - /* Fill constant fields of ArrayRefState */
> - arefstate->isassignment = isAssignment;
> - arefstate->refelemtype = aref->refelemtype;
> - arefstate->refattrlength = get_typlen(aref->refarraytype);
> - get_typlenbyvalalign(aref->refelemtype,
> -  &arefstate->refelemlength,
> -  &arefstate->refelembyval,
> -  &arefstate->refelemalign);
> + /* Fill constant fields of SubscriptingRefState */
> + sbsrefstate->isassignment = isAssignment;
> + sbsrefstate->refelemtype = sbsref->refelemtype;
> + sbsrefstate->refattrlength = get_typlen(sbsref->refcontainertype);
>  
>   /*
>* Evaluate array input.  It's safe to do so into resv/resnull, because 
> we
>* won't use that as target for any of the other subexpressions, and 
> it'll
> -  * be overwritten by the final EEOP_ARRAYREF_FETCH/ASSIGN step, which is
> +  * be overwritten by the final EEOP_SBSREF_FETCH/ASSIGN step, which is
>* pushed last.
>*/
> - ExecInitExprRec(aref->refexpr, state, resv, resnull);
> + ExecInitExprRec(sbs

Re: Invalid pg_upgrade error message during live check

2018-01-08 Thread Bruce Momjian

On Mon, Jan  8, 2018 at 10:43:00AM -0500, Robert Haas wrote:
> On Fri, Jan 5, 2018 at 9:11 PM, Bruce Momjian  wrote:
> > On Fri, Jan  5, 2018 at 04:58:39PM -0500, Bruce Momjian wrote:
> >> Pg_upgrade is able to run in --check mode when the old server is still
> >> running.  Unfortunately, all supported versions of pg_upgrade generate
> >> an incorrect (but harmless) "failure" message when doing this:
> >
> > Based on recent discussions, I am thinking we would just fix this in PG
> > 10 and HEAD and leave the harmless error message in the other branches,
> > right?
> 
> Hmm, not sure why you say that.  If the message is buggy and the fix
> isn't too risky, might as well fix it all the way back.

OK, I will do that then.  The message is harmless, but confusing, so I
would like to fix it, but it requires changing the API of two functions.
I will see how cleanly it applies back to 9.3 and act accordingly.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: Parallel append plan instability/randomness

2018-01-08 Thread Tom Lane

I wrote:
> BTW, I had not looked closely at list_qsort before, but now that I have
> it seems seriously bizarre.  Why is it allocating a new List header
> rather than reusing the old one?  Because it stomps on the next-links
> of the list cells, the old header is effectively corrupted and can't
> possibly be useful afterwards, so we might as well re-use it and save
> a palloc cycle.  (By the same token, it's pretty ludicrous that it
> claims the input List is "const".  Not stomping on the list header
> is a pretty poor definition of not modifying the list.)

Actually, after looking closer, I think the current implementation
of list_qsort is actively dangerous.  If we left the code like this,
we would have to document that create_append_path corrupts its caller's
copies of subpaths and partial_subpaths when parallel_aware is true
(and by "corrupt" I don't just mean "reorder": the list headers the
caller has pointers to are now flat wrong).  Even with the change
I imagine above, we'd have to document that the caller's lists get
reordered.  Although the sole caller that's currently affected doesn't
currently do anything with those lists after the call, it's not terribly
hard to imagine that future improvements there would lead to trying to
build multiple paths with the same lists or something like that, so
that we'd have hidden interactions between different Paths.  The same
types of hazards would apply to every future use of list_qsort.

(In fact, I spent some time trying to think of a way that this could have
led to the silverfish failure we started the thread with.  I couldn't
convince myself that it was possible, but I'm not entirely convinced it's
not, either.)

So I think that re-using the caller's list cells is another penny-wise-
but-pound-foolish idea, and that we ought to just build a new list to
avoid the hazard.  As per attached.

regards, tom lane

diff --git a/src/backend/nodes/list.c b/src/backend/nodes/list.c
index 083538f..fe085d2 100644
*** a/src/backend/nodes/list.c
--- b/src/backend/nodes/list.c
*** list_copy_tail(const List *oldlist, int 
*** 1250,1290 
  }
  
  /*
!  * Sort a list using qsort. A sorted list is built but the cells of the
!  * original list are re-used.  The comparator function receives arguments of
!  * type ListCell **
   */
  List *
  list_qsort(const List *list, list_qsort_comparator cmp)
  {
- 	ListCell   *cell;
- 	int			i;
  	int			len = list_length(list);
  	ListCell  **list_arr;
! 	List	   *new_list;
  
  	if (len == 0)
  		return NIL;
  
  	i = 0;
- 	list_arr = palloc(sizeof(ListCell *) * len);
  	foreach(cell, list)
  		list_arr[i++] = cell;
  
  	qsort(list_arr, len, sizeof(ListCell *), cmp);
  
! 	new_list = (List *) palloc(sizeof(List));
! 	new_list->type = list->type;
! 	new_list->length = len;
! 	new_list->head = list_arr[0];
! 	new_list->tail = list_arr[len - 1];
  
! 	for (i = 0; i < len - 1; i++)
! 		list_arr[i]->next = list_arr[i + 1];
  
! 	list_arr[len - 1]->next = NULL;
  	pfree(list_arr);
! 	return new_list;
  }
  
  /*
--- 1250,1312 
  }
  
  /*
!  * Sort a list as though by qsort.
!  *
!  * A new list is built and returned.
!  * The comparator function receives arguments of type ListCell **.
   */
  List *
  list_qsort(const List *list, list_qsort_comparator cmp)
  {
  	int			len = list_length(list);
  	ListCell  **list_arr;
! 	List	   *newlist;
! 	ListCell   *newlist_prev;
! 	ListCell   *cell;
! 	int			i;
  
+ 	/* Empty list is easy */
  	if (len == 0)
  		return NIL;
  
+ 	/* Flatten list cells into an array, so we can use qsort */
+ 	list_arr = (ListCell **) palloc(sizeof(ListCell *) * len);
  	i = 0;
  	foreach(cell, list)
  		list_arr[i++] = cell;
  
  	qsort(list_arr, len, sizeof(ListCell *), cmp);
  
! 	/* Construct new list (this code is much like list_copy) */
! 	newlist = new_list(list->type);
! 	newlist->length = len;
  
! 	/*
! 	 * Copy over the data in the first cell; new_list() has already allocated
! 	 * the head cell itself
! 	 */
! 	newlist->head->data = list_arr[0]->data;
  
! 	newlist_prev = newlist->head;
! 	for (i = 1; i < len; i++)
! 	{
! 		ListCell   *newlist_cur;
! 
! 		newlist_cur = (ListCell *) palloc(sizeof(*newlist_cur));
! 		newlist_cur->data = list_arr[i]->data;
! 		newlist_prev->next = newlist_cur;
! 
! 		newlist_prev = newlist_cur;
! 	}
! 
! 	newlist_prev->next = NULL;
! 	newlist->tail = newlist_prev;
! 
! 	/* Might as well free the workspace array */
  	pfree(list_arr);
! 
! 	check_list_invariants(newlist);
! 	return newlist;
  }
  
  /*

Re: Challenges preventing us moving to 64 bit transaction id (XID)?

2018-01-08 Thread Alexander Korotkov

Hi, Ryan!

On Sat, Jan 6, 2018 at 10:10 PM, Ryan Murphy  wrote:

> Thanks for this contribution!
> I think it's a hard but important problem to upgrade these xids.
>
> Unfortunately, I've confirmed that this patch
> 0001-64bit-guc-relopt-3.patch doesn't apply correctly on my computer.
>
> Here's what I did:
>
> I did a "git pull" to the current HEAD, which is
> 6271fceb8a4f07dafe9d67dcf7e849b319bb2647
>
> Then I attempted to apply the patch, here's what I saw:
>
> $ git apply patches/0001-64bit-guc-relopt-3.patch
> error: src/backend/access/common/reloptions.c: already exists in working
> directory
> error: src/backend/utils/misc/guc.c: already exists in working directory
> error: src/include/access/reloptions.h: already exists in working
> directory
> error: src/include/utils/guc.h: already exists in working directory
> error: src/include/utils/guc_tables.h: already exists in working directory
>
> Alexander, what is the process you're using to create the patch?  I've
> heard someone (maybe Tom Lane?) say that he sometimes uses "patch" directly
> instead of "git" to create the patch, with better results.  I forget the
> exact command.

I've created patches using context diff, as described in PostgreSQL wiki.
https://wiki.postgresql.org/wiki/Working_with_Git#Context_diffs_with_Git
I already noticed that it causing troubles to some community members who
use 'git apply'.  And also I noticed that majority of patches nowadays are
sent using universal format.  So, I decided to switch to universal format
too.  I'm working on rebasing patchset, that takes some time...  Next
revision will be sent in universal format.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Parallel append plan instability/randomness

2018-01-08 Thread Tom Lane

I wrote:
>> In the regression test case at hand, the startup costs are all zero so
>> this change wouldn't improve the test case's stability.  So I'm thinking
>> that in addition, it would be a good idea for these functions to break
>> exact compare_path_costs ties in some arbitrary but deterministic way,
>> rather than letting qsort() have the final say on what happens.  If the
>> subplans were all simple relation scans we could order them by planner
>> relid, but I'm not sure what to do if they're not.

> Ah, I have an idea --- let's create a bms_compare() qsort comparator for
> bitmapsets, and use that on the paths' relid sets.  It hardly matters
> what the exact semantics of the comparator are, as long as it behaves
> sanely for equal sets.

Concretely, I propose the attached.  I spent a little bit of extra effort
on the comparator in hopes that it might be useful for something else.

regards, tom lane

diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 733fe3c..edcd19a 100644
*** a/src/backend/nodes/bitmapset.c
--- b/src/backend/nodes/bitmapset.c
*** bms_equal(const Bitmapset *a, const Bitm
*** 173,178 
--- 173,222 
  }
  
  /*
+  * bms_compare - qsort-style comparator for bitmapsets
+  *
+  * This guarantees to report values as equal iff bms_equal would say they are
+  * equal.  Otherwise, the highest-numbered bit that is set in one value but
+  * not the other determines the result.  (This rule means that, for example,
+  * {6} is greater than {5}, which seems plausible.)
+  */
+ int
+ bms_compare(const Bitmapset *a, const Bitmapset *b)
+ {
+ 	int			shortlen;
+ 	int			i;
+ 
+ 	/* Handle cases where either input is NULL */
+ 	if (a == NULL)
+ 		return bms_is_empty(b) ? 0 : -1;
+ 	else if (b == NULL)
+ 		return bms_is_empty(a) ? 0 : +1;
+ 	/* Handle cases where one input is longer than the other */
+ 	shortlen = Min(a->nwords, b->nwords);
+ 	for (i = shortlen; i < a->nwords; i++)
+ 	{
+ 		if (a->words[i] != 0)
+ 			return +1;
+ 	}
+ 	for (i = shortlen; i < b->nwords; i++)
+ 	{
+ 		if (b->words[i] != 0)
+ 			return -1;
+ 	}
+ 	/* Process words in common */
+ 	i = shortlen;
+ 	while (--i >= 0)
+ 	{
+ 		bitmapword	aw = a->words[i];
+ 		bitmapword	bw = b->words[i];
+ 
+ 		if (aw != bw)
+ 			return (aw > bw) ? +1 : -1;
+ 	}
+ 	return 0;
+ }
+ 
+ /*
   * bms_make_singleton - build a bitmapset containing a single member
   */
  Bitmapset *
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 7df8761..5c2b996 100644
*** a/src/backend/optimizer/util/pathnode.c
--- b/src/backend/optimizer/util/pathnode.c
*** create_append_path(RelOptInfo *rel,
*** 1274,1311 
  
  /*
   * append_total_cost_compare
!  *	  list_qsort comparator for sorting append child paths by total_cost
   */
  static int
  append_total_cost_compare(const void *a, const void *b)
  {
  	Path	   *path1 = (Path *) lfirst(*(ListCell **) a);
  	Path	   *path2 = (Path *) lfirst(*(ListCell **) b);
  
! 	if (path1->total_cost > path2->total_cost)
! 		return -1;
! 	if (path1->total_cost < path2->total_cost)
! 		return 1;
! 
! 	return 0;
  }
  
  /*
   * append_startup_cost_compare
!  *	  list_qsort comparator for sorting append child paths by startup_cost
   */
  static int
  append_startup_cost_compare(const void *a, const void *b)
  {
  	Path	   *path1 = (Path *) lfirst(*(ListCell **) a);
  	Path	   *path2 = (Path *) lfirst(*(ListCell **) b);
  
! 	if (path1->startup_cost > path2->startup_cost)
! 		return -1;
! 	if (path1->startup_cost < path2->startup_cost)
! 		return 1;
! 
! 	return 0;
  }
  
  /*
--- 1274,1317 
  
  /*
   * append_total_cost_compare
!  *	  qsort comparator for sorting append child paths by total_cost descending
!  *
!  * For equal total costs, we fall back to comparing startup costs; if those
!  * are equal too, break ties using bms_compare on the paths' relids.
!  * (This is to avoid getting unpredictable results from qsort.)
   */
  static int
  append_total_cost_compare(const void *a, const void *b)
  {
  	Path	   *path1 = (Path *) lfirst(*(ListCell **) a);
  	Path	   *path2 = (Path *) lfirst(*(ListCell **) b);
+ 	int			cmp;
  
! 	cmp = compare_path_costs(path1, path2, TOTAL_COST);
! 	if (cmp == 0)
! 		cmp = bms_compare(path1->parent->relids, path2->parent->relids);
! 	return -cmp;
  }
  
  /*
   * append_startup_cost_compare
!  *	  qsort comparator for sorting append child paths by startup_cost descending
!  *
!  * For equal startup costs, we fall back to comparing total costs; if those
!  * are equal too, break ties using bms_compare on the paths' relids.
!  * (This is to avoid getting unpredictable results from qsort.)
   */
  static int
  append_startup_cost_compare(const void *a, const void *b)
  {
  	Path	   *path1 = (Path *) lfirst(*(ListCell **) a);
  	Path	   *path2 = (Path *) lfirst(*(ListCell **) b);
+ 	int			cmp;
  
! 	cmp = compare_path_costs(path1, path2, STARTUP_COST);
! 	if (cmp

Re: update portal-related memory context names and API

2018-01-08 Thread Alvaro Herrera

Peter Eisentraut wrote:

> Subject: [PATCH 1/2] Update portal-related memory context names and API
> 
> Rename PortalMemory to TopPortalContext, to avoid confusion with
> PortalContext and align naming with similar top-level memory contexts.
> 
> Rename PortalData's "heap" field to portalContext.  The "heap" naming
> seems quite antiquated and confusing.  Also get rid of the
> PortalGetHeapMemory() macro and access the field directly, which we do
> for other portal fields, so this abstraction doesn't buy anything.

This one looks good to me.  I was long confused by that "heap"
terminology.

> Subject: [PATCH 2/2] Remove PortalGetQueryDesc()
> 
> After having gotten rid of PortalGetHeapMemory(), there seems little
> reason to keep one Portal access macro around that offers no actual
> abstraction and isn't consistently used anyway.

This macro does look quite pointless.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Proposal: Local indexes for partitioned table

2018-01-08 Thread Jesper Pedersen


Hi Alvaro,

On 01/08/2018 03:36 PM, Alvaro Herrera wrote:

Jesper Pedersen wrote:


Maybe a warning for existing indexes on the same column(s), but with a
different type, should be emitted during ATTACH, e.g.



[detach one partition, replace index with a different one, attach
partition]



Of course, this could also fall under index maintenance and no warning
emitted. Docs have "Each partition is first checked to determine whether an
equivalent index already exists," so it is covered.


Yeah, I'm of two minds about this also -- in the initial versions I had
a code comment wondering exactly about having a hash index in a
partition attached to a btree index on parent.

As another example, having a documents table with two partitions (one
"long term archival" and other "documents currently being messed with")
you could have a text search index which is GIN in the former and GiST
in the latter.  There is a performance argument for doing it that way,
so it's not merely academic.

Anyway, while I think attaching an index that doesn't match the
properties of the index on parent can be a useful feature, on the other
hand it could be surprising (you end up losing an index because it was
matched during attach that you didn't think was going to be matched).
One idea would be to have a way to specify at ATTACH time what to do
about indexes when they don't match exactly, but I think the user
interface would be pretty messy: exactly how different do you want to
allow the indexes to be?  Is an index having one more column than the
one in parent good enough?  I think the answer to this is mostly a
judgement call, and I'd rather not spend my time debating those.



Yeah, agreed - lets leave as is.

Migrating an index to another type would mean to drop the top-level 
definition anyway.


Best regards,
 Jesper

Re: Condition variable live lock

2018-01-08 Thread Tom Lane

Robert Haas  writes:
> I guess my aversion to converting the existing Assert-test into an
> elog test was really a concern that we'd be countenancing the use of
> CVs in any coding pattern more complicated than a very simple
> test-and-wait loop.  Suppose someone were to propose adding runtime
> checks that when we release a spinlock, it is held by the process that
> tried to release it.  Someone might reply that such a check ought to
> be unnecessary because the code protected by a spinlock ought to be a
> short, straight-line critical section and therefore we should be able
> to spot any such coding error by inspection.

Right, *because* we have the rule that spinlock-protected code must be
short straight-line segments.  There is certainly no such rule for
LWLocks, and I'm not sure why you think that CVs have more restricted
use-cases than LWLocks.

There is a potential issue, if you call random code inside the wait loop,
that that code might reset the process latch and thereby lose a signal for
the CV.  As far as I can see at the moment, that'd only be a hazard for
code executed before the loop's ConditionVariableSleep, not after, so even
that can be avoided if you code properly.  In any case, the proposed
generalization of CVs poses no such hazard: yes, we might reset the latch,
but if so the outer CV loop will see that it's lost its prepared sleep
state and will be forced to recheck its exit condition.

> It's often the case that mechanisms like this end up getting
> used profitably in a lot of places not imagined by their original
> creator, and that might be the case here.

Yeah, that's exactly my expectation.  If I thought we were only going to
have five uses of condition variables forevermore, I wouldn't be putting
much time into them.

> I think an extremely *likely* programming error when programming with
> CVs is to have a code path where ConditionVariableCancelSleep() does
> not get called.  The proposed change could make such mistakes much
> less noticeable.

Actually, the proposed change would turn it into barely an error at all.
The only advantage of cancelling a sleep immediately is that you avoid
possibly getting a no-longer-useful latch event later.

regards, tom lane

Re: [HACKERS] Proposal: Local indexes for partitioned table

2018-01-08 Thread Alvaro Herrera

Jesper Pedersen wrote:

> Maybe a warning for existing indexes on the same column(s), but with a
> different type, should be emitted during ATTACH, e.g.

> [detach one partition, replace index with a different one, attach
> partition]

> Of course, this could also fall under index maintenance and no warning
> emitted. Docs have "Each partition is first checked to determine whether an
> equivalent index already exists," so it is covered.

Yeah, I'm of two minds about this also -- in the initial versions I had
a code comment wondering exactly about having a hash index in a
partition attached to a btree index on parent.

As another example, having a documents table with two partitions (one
"long term archival" and other "documents currently being messed with")
you could have a text search index which is GIN in the former and GiST
in the latter.  There is a performance argument for doing it that way,
so it's not merely academic.

Anyway, while I think attaching an index that doesn't match the
properties of the index on parent can be a useful feature, on the other
hand it could be surprising (you end up losing an index because it was
matched during attach that you didn't think was going to be matched).
One idea would be to have a way to specify at ATTACH time what to do
about indexes when they don't match exactly, but I think the user
interface would be pretty messy: exactly how different do you want to
allow the indexes to be?  Is an index having one more column than the
one in parent good enough?  I think the answer to this is mostly a
judgement call, and I'd rather not spend my time debating those.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: update portal-related memory context names and API

2018-01-08 Thread Andrew Dunstan



On 12/19/2017 12:31 PM, Peter Eisentraut wrote:
> ISTM that some of the portal-related memory context naming is a bit
> antiquated and at odds with current terminology.  In this patch, I
> propose to rename PortalMemory to TopPortalContext and rename
> Portal->heap to Portal->portalContext, and then clean up some
> surrounding APIs.
>

Seems reasonable - 100% cosmetic AFAICT. Marking ready for committer.

cheers

andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] VACUUM and ANALYZE disagreeing on what reltuples means

2018-01-08 Thread Tomas Vondra



On 01/08/2018 08:39 PM, Tom Lane wrote:
> Tomas Vondra  writes:
>> As I already mentioned, Tom's updated patch is better than what I
>> posted initially, and I agree with his approach to the remaining
>> issues he pointed out. But I somehow assumed that he's already
>> looking into that. Tom, do you plan to look into this patch soon,
>> or should I?
> 
> No, I thought you were going to run with those ideas. I have a lot
> of other stuff on my plate ...
> 

OK, will do.

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: portal pinning

2018-01-08 Thread Andrew Dunstan



On 12/15/2017 03:36 PM, Peter Eisentraut wrote:
> On 12/12/17 10:34, Peter Eisentraut wrote:
>> But I also wonder whether we shouldn't automatically pin/unpin portals
>> in SPI_cursor_open() and SPI_cursor_close().  This makes sense if you
>> consider "pinned" to mean "internally generated".  I don't think there
>> is a scenario in which user code should directly operate on a portal
>> created by SPI.
> Here is a patch for this option.
>
> The above sentence was not quite correct.  Only unnamed portals should
> be pinned automatically.  Named portals are of course possible to be
> passed around as refcursors for example.
>


This seems like a good idea, and the code change is tiny and clean. I
don't know of any third party PLs or other libraries might be pinning
the portals already on their own. How would they be affected if they did?

Anyway, marking ready for committer.

cheers

andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Condition variable live lock

2018-01-08 Thread Robert Haas

On Sun, Jan 7, 2018 at 6:38 PM, Tom Lane  wrote:
>> Actually ... perhaps a better design would be to have
>> ConditionVariable[PrepareTo]Sleep auto-cancel any prepared sleep for
>> a different condition variable, analogously to what we just did in
>> ConditionVariableBroadcast, on the same theory that whenever control
>> returns to the other CV wait loop it can re-establish the relevant
>> state easily enough.  I have to think that if the use of CVs grows
>> much, the existing restriction is going to become untenable anyway,
>> so why not just get rid of it?
>
> Concretely, as per attached.

I guess my aversion to converting the existing Assert-test into an
elog test was really a concern that we'd be countenancing the use of
CVs in any coding pattern more complicated than a very simple
test-and-wait loop.  Suppose someone were to propose adding runtime
checks that when we release a spinlock, it is held by the process that
tried to release it.  Someone might reply that such a check ought to
be unnecessary because the code protected by a spinlock ought to be a
short, straight-line critical section and therefore we should be able
to spot any such coding error by inspection.

And I wonder if the same isn't true here.  Is it really sensible,
within a CV-wait loop, to call some other function that contains its
own CV-wait loop?  Is that really a use case we want to support?  If
you do have such a thing, with the present coding, you don't *need* an
Assert() at runtime; you just need to run the code with assertions
enabled AT LEAST ONCE.  If the code flow is so complex that it doesn't
reliably fail an assertion in test environments, maybe it's just too
complex.  Perhaps our answer to that, or so my thinking went, ought to
be "don't write code like that in the first place".

Now, that may be myopic on my part.  If we want to support complex
control flows involving CVs, the approach here has a lot to recommend
it.  Instead of making it the caller's problem to work it out, we do
it automatically.  There is some loss of efficiency, perhaps, since
when control returns to the outer CV-wait loop it will have to recheck
the condition twice before potentially waiting, but maybe that doesn't
matter.  It's often the case that mechanisms like this end up getting
used profitably in a lot of places not imagined by their original
creator, and that might be the case here.

I think an extremely *likely* programming error when programming with
CVs is to have a code path where ConditionVariableCancelSleep() does
not get called.  The proposed change could make such mistakes much
less noticeable.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel append plan instability/randomness

2018-01-08 Thread Robert Haas

On Mon, Jan 8, 2018 at 1:26 PM, Jim Finnerty  wrote:
> I see.  Maybe the EXPLAIN output can be made more stable when "EXPLAIN (costs
> FALSE)", though?

I think that would be fixing the problem from the wrong end.  We want
to actually get a stable choice of plan, not display something other
than the actual plan because that other thing is more stable.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel append plan instability/randomness

2018-01-08 Thread Tom Lane

I wrote:
> In the regression test case at hand, the startup costs are all zero so
> this change wouldn't improve the test case's stability.  So I'm thinking
> that in addition, it would be a good idea for these functions to break
> exact compare_path_costs ties in some arbitrary but deterministic way,
> rather than letting qsort() have the final say on what happens.  If the
> subplans were all simple relation scans we could order them by planner
> relid, but I'm not sure what to do if they're not.

Ah, I have an idea --- let's create a bms_compare() qsort comparator for
bitmapsets, and use that on the paths' relid sets.  It hardly matters
what the exact semantics of the comparator are, as long as it behaves
sanely for equal sets.

regards, tom lane

Re: [HACKERS] Proposal: Local indexes for partitioned table

2018-01-08 Thread Jesper Pedersen


Hi Alvaro,

On 01/04/2018 09:30 AM, Alvaro Herrera wrote:

v11 fixes the dependency problem I mentioned in
https://postgr.es/m/20171229203818.pqxf2cyl4g2wre6h@alvherre.pgsql
and adds some test to cover that stuff.



Thanks, make check-world passes.

Maybe a warning for existing indexes on the same column(s), but with a 
different type, should be emitted during ATTACH, e.g.


-- test.sql --
CREATE TABLE test (a integer NOT NULL) PARTITION BY HASH(a);
CREATE TABLE test_p00 PARTITION OF test FOR VALUES WITH (MODULUS 2, 
REMAINDER 0);
CREATE TABLE test_p01 PARTITION OF test FOR VALUES WITH (MODULUS 2, 
REMAINDER 1);

CREATE INDEX idx_test_a ON test (a);

INSERT INTO test (SELECT i FROM generate_series(1, 100) AS i);

ANALYZE;

ALTER TABLE test DETACH PARTITION test_p00;
DROP INDEX test_p00_a_idx;
CREATE INDEX test_p00_a_idx ON test_p00 USING hash (a);
ALTER TABLE test ATTACH PARTITION test_p00 FOR VALUES WITH (MODULUS 2, 
REMAINDER 0);


-- test.sql --

Of course, this could also fall under index maintenance and no warning 
emitted. Docs have "Each partition is first checked to determine whether 
an equivalent index already exists," so it is covered.


Best regards,
 Jesper

Re: [HACKERS] VACUUM and ANALYZE disagreeing on what reltuples means

2018-01-08 Thread Tom Lane

Tomas Vondra  writes:
> As I already mentioned, Tom's updated patch is better than what I posted
> initially, and I agree with his approach to the remaining issues he
> pointed out. But I somehow assumed that he's already looking into that.
> Tom, do you plan to look into this patch soon, or should I?

No, I thought you were going to run with those ideas.  I have a lot of
other stuff on my plate ...

regards, tom lane

Re: PL/Python SD dict wiped?

2018-01-08 Thread Peter Eisentraut

On 1/8/18 11:59, Ken Huffman wrote:
> I'm fine with per-session initializing of prepared statements, but is
> there PL/Python mechanism that spans sessions to minimize ConfigParser
> calls?

Nothing built-in.  You would have to use a separate external storage
mechanism of some kind.  It could be a file, separately managed shared
memory, or perhaps something like memcached.

However, if this is the problem, then you might want to figure out a way
to keep your connections open longer, such as by using a connection pool.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] [PATCH] Incremental sort

2018-01-08 Thread Alexander Korotkov

On Mon, Jan 8, 2018 at 2:29 PM, Antonin Houska  wrote:

> Alexander Korotkov  wrote:
>
> > Antonin Houska  wrote:
>
> > >  Shouldn't the test contain *both* cases?
>
> > Thank you for pointing that. Sure, both cases are better. I've added
> second case as well as comments. Patch is attached.
>
> I'm fine with the tests now but have a minor comment on this comment:
>
> -- CROSS JOIN, not pushed down, because we don't push down LIMIT and
> remote side
> -- can't perform top-N sort like local side can.
>
> I think the note on LIMIT push-down makes the comment less clear because
> there's no difference in processing the LIMIT: EXPLAIN shows that both
>
> SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1
> OFFSET
> 100 LIMIT 10;
>
> and
>
> SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3
> OFFSET
> 100 LIMIT 10;
>
> evaluate the LIMIT clause only locally.
>
> What I consider the important difference is that the 2nd case does not
> generate the appropriate input for remote incremental sort (while
> incremental
> sort tends to be very cheap). Therefore it's cheaper to do no remote sort
> at
> all and perform the top-N sort locally than to do a regular
> (non-incremental)
> remote sort.
>

Agree, these comments are not clear enough.  I've rewritten comments: they
became much
more wordy, but now they look clearer for me.  Also I've swapped the
queries order, for me
it seems to easier for understanding.


> I have no other questions about this patch. I expect the CFM to set the
> status
> to "ready for committer" as soon as the other reviewers confirm they're
> happy
> about the patch status.


Good, thank you.  Let's see what other reviewers will say.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


incremental-sort-15.patch
Description: Binary data

Re: Parallel append plan instability/randomness

2018-01-08 Thread Tom Lane

Robert Haas  writes:
> On Sun, Jan 7, 2018 at 11:40 PM, Amit Kapila  wrote:
>> One theory that can explain above failure is that the costs of
>> scanning some of the sub-paths is very close due to which sometimes
>> the results can vary.  If that is the case, then probably using
>> fuzz_factor in costs comparison (as is done in attached patch) can
>> improve the situation, may be we have to consider some other factors
>> like number of rows in each subpath.

> This isn't an acceptable solution because sorting requires that the
> comparison operator satisfy the transitive property; that is, if a = b
> and b = c then a = c.  With your proposed comparator, you could have a
> = b and b = c but a < c.  That will break stuff.

> It seems like the obvious fix here is to use a query where the
> contents of the partitions are such that the sorting always produces
> the same result.  We could do that either by changing the query or by
> changing the data in the partitions or, maybe, by inserting ANALYZE
> someplace.

The foo_star tables are made in create_table.sql, filled in
create_misc.sql, and not modified thereafter.  The fact that we have
accurate rowcounts for them in select_parallel.sql is because of the
database-wide VACUUM that happens at the start of sanity_check.sql.
Given the lack of any WHERE condition, the costs in this particular query
depend only on the rowcount and physical table size, so inserting an
ANALYZE shouldn't (and doesn't, for me) change anything.  I would be
concerned about side-effects on other queries anyway if we were to ANALYZE
tables that have never been ANALYZEd in the regression tests before.

I remain pretty much at a loss for exactly what happened on silverfish,
but that doesn't mean I have no ideas for improving things.  Both
append_total_cost_compare and append_startup_cost_compare are seriously
deficient IMO, because they ignore the question of what to do when total
costs (resp. startup costs) are exactly equal, leaving the outcome to
the none too tender mercies of qsort().  Which you'll recall is not a
stable algorithm.

(For fewer than 7 items, our implementation of qsort falls back to an
insertion sort, which *is* stable, perhaps explaining why we've not
seen failures in this test case many times before.  So this line of
thought doesn't explain what happened on silverfish.  But there's
surely room for improvement here.)

The normal rule in the planner, as embodied in places like
compare_path_costs, is to break ties on total cost by comparing startup
cost, and vice versa, which these two functions fail to do; so my first
point is that they ought to be using compare_path_costs rather than
inventing their own approach.  This would especially affect
append_startup_cost_compare, because of the frequency with which plans
have identical startup costs (ie 0).  Right now I'm afraid we're getting
basically random ordering of the partial subpaths in most cases.

In the regression test case at hand, the startup costs are all zero so
this change wouldn't improve the test case's stability.  So I'm thinking
that in addition, it would be a good idea for these functions to break
exact compare_path_costs ties in some arbitrary but deterministic way,
rather than letting qsort() have the final say on what happens.  If the
subplans were all simple relation scans we could order them by planner
relid, but I'm not sure what to do if they're not.

BTW, I had not looked closely at list_qsort before, but now that I have
it seems seriously bizarre.  Why is it allocating a new List header
rather than reusing the old one?  Because it stomps on the next-links
of the list cells, the old header is effectively corrupted and can't
possibly be useful afterwards, so we might as well re-use it and save
a palloc cycle.  (By the same token, it's pretty ludicrous that it
claims the input List is "const".  Not stomping on the list header
is a pretty poor definition of not modifying the list.)

regards, tom lane

Re: pgbench - add \if support

2018-01-08 Thread Fabien COELHO



Hello Dmitry,

Thanks for working on this. I had just a quick look at it, but I hope 
I'll have time to post a proper review. In the meantime I'm wondering 
what am I doing wrong here (I see a similar example in your first 
message)?


```
-- test.sql
\if random(0, 99) < 85
   \set test 1
\else
   \set test 2
\endif
select :test;
```

```
$ pgbench -s 10 -f test.sql
test.sql:1: unexpected character (<) in command "if"
\if random(0, 99) < 85
 ^ error found here


Sure.

What you are trying to do is the result of combining the pgbench-if patch 
and the pgbench-more-ops-and-funcs patch.


There is also with the ability to put the result of a SELECT into a 
variable, which would then enable doing some if about data coming

from the database.

https://commitfest.postgresql.org/16/985/
https://commitfest.postgresql.org/16/669/
https://commitfest.postgresql.org/16/1385/

These are distinct entries in the CF, because they do quite distinct 
things, and interact weakly one with the other.


However, it really makes full sense when they are all available together, 
so I put an example which combines all three. "\if 1" is not that 
interesting in itself, obviously.


Everytime I sent a (relatively) big patch in the past I was asked to cut 
it in bites, which is tiresome when everything is intermixed, so now I'm 
doing the bytes first.


--
Fabien.

Re: [HACKERS] VACUUM and ANALYZE disagreeing on what reltuples means

2018-01-08 Thread Tomas Vondra


On 01/05/2018 05:25 AM, Stephen Frost wrote:
> Tomas,
> 
> * Michael Paquier (michael.paqu...@gmail.com) wrote:
>> On Sun, Nov 19, 2017 at 6:40 AM, Tomas Vondra
>>  wrote:
>>> Thanks for looking into this. I agree your patch is more consistent and
>>> generally cleaner.
>>
>> This is classified as a bug fix, and is marked as waiting on author. I
>> am moving it to next CF as work continues.
> 
> This is still in waiting-on-author state; it'd be great to get your 
> thoughts on where this patch is and what the next steps are to move
> it forward. If you need additional feedback or there needs to be
> more discussion (perhaps with Tom) then maybe this should be in
> needs-review instead, but it'd be great if you could review the
> discussion and patch again and at least provide an update on it (last
> update from you was in November).
> 

Hmmm, I'm not sure, TBH.

As I already mentioned, Tom's updated patch is better than what I posted
initially, and I agree with his approach to the remaining issues he
pointed out. But I somehow assumed that he's already looking into that.

Tom, do you plan to look into this patch soon, or should I?

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Parallel append plan instability/randomness

2018-01-08 Thread Tom Lane

Robert Haas  writes:
> [ example where current parallel-append dispatch algorithm loses ]

I should clarify my thinking a bit more: I was imagining that we might
switch to a system where, as each worker comes free, it makes a dynamic
choice of which subplan to take next.  That means that not only do we have
more knowledge of the number of available workers than the planner did,
but we also have hard knowledge of which subplans actually finished first.
So even (or especially?) granted that the cost estimates are imperfect,
that might provide a leg up in choosing how to schedule the remaining
tasks.

I haven't thought about it hard enough to have a clear idea of how that
might win.  But for sure I don't think we should dismiss out of hand
the idea that it could be better than choosing a fixed dispatch order
at plan time.

Anyway, I'm merely suggesting that as an approach worth investigating in
future research; I doubt anyone is going to produce an implementation
right away.  So we still want to figure out exactly what happened on
silverfish and how we can nail down that expected plan shape.  (Or,
perhaps, decide that we don't need to see that plan in the output?)

regards, tom lane

Re: [HACKERS] WIP: Covering + unique indexes.

2018-01-08 Thread Andrey Borodin

Hello!

The patch does not apply currently.
Anastasia, can you, please, rebase the patch?

Best regards, Andrey Borodin.

Re: Parallel append plan instability/randomness

2018-01-08 Thread Robert Haas

On Mon, Jan 8, 2018 at 11:57 AM, Tom Lane  wrote:
>> Ignoring some details around partial vs. non-partial plans, that's
>> pretty much what we ARE doing, but to make it efficient, we sort the
>> paths at plan time so that those choices are easy to make at runtime.
>> If we didn't do that, we could have every worker sort the paths at
>> execution time instead, or have the first process to arrive perform
>> the sort and store the results in shared memory while everyone else
>> waits, but that seems to be more complicated and less efficient, so I
>> don't understand why you're proposing it.
>
> The main bit of info we'd have at runtime that we lack at plan time is
> certainty about the number of available workers.  Maybe that doesn't
> really add anything useful to the order in which subplans would be doled
> out; not sure.

The current algorithm doesn't have any use for that information; I'm
not sure whether some other algorithm might.  If we had perfect
information, each participant would always choose the next task in
such a way as to minimize overall execution time.  Is it possible that
the correct choice might depend on how many workers we have in total?

/me thinks for a bit.  Yes, that's possible. Suppose that the
execution times of 5 non-partial plans are 100 99 60 40 1 and that the
number of participants is either 2 or 3.  Suppose further that the
number of rows returned is nominal so that there's no issue of the
worker(s) having to wait.  If there are 2 participants, one process
should get the plans with execution times 99 and 60 and the other
should get the rest.  If there are 3 participants, one process should
get only 100, the second should get 99 and 1, and the last should get
60 and 40.

In the actually implemented algorithm, the leader will grab the cheap
plan at the end, the worker(s) will grab the expensive plans at the
beginning, and things probably won't actually work out very nicely.
With three participants, the leader will end up with 1 and 40, the
first worker with 100 only, and the second worker with 99 and 60.
Fail.

The choice of which plan the leader takes might also depend on the
number of rows being returned by the various plans.  If there's one
plan that returns a lot of rows, it would be nice to run that one in
the leader, because then we save the cost of transferring those rows
between processes.  On the other hand, that also risks starving all of
the workers; it could be faster to incur the overhead of transferring
those rows so that the leader focuses on servicing the tuple queues
rather than running plans.

There are also problems with partial plans having high startup costs.
In general, if a partial plan has a high startup cost, we'd like the
leader not to choose it, because then it's not available to read
tuples.  So, if we have one plan with a startup cost with 1000 and
another plan with a startup cost of 800, then we might think that the
leader should prefer the latter.  However, if the former plan is a
parallel hash join and the latter is a non-parallel-aware join, then
the reverse *may* be true, because the startup cost of parallel hash
can (mostly) be *shared* among as many participants as we have,
whereas non-parallel-aware nodes will *repeat* the startup work.  Of
course, the plan tree doesn't carry information about whether startup
costs of parallel plans are likely to be amortized across workers or
repeated in each one, and even if it did, it does not seem to be
trivial to put that information to good use.

Yet another problem with parallel append -- and really, with parallel
planning in general -- is that it ignores the cost of resource
contention between cooperating backends.  If we do a Parallel Append
over Parallel Seq Scan operations, spreading workers out figures to
win if (1) there are enough of them to general significant contention
on the in-memory data structures, which is especially likely if the
data is resident in memory or if (2) the different partitions are on
different filesystems that can satisfy I/O requests in parallel or
even if (3) the different partitions are all on the same filesystem
but that filesystem has enough spindles to sequential-scan them all at
once.  But it could easily lose if, say, the partitions are on
different filesystems serviced by a single disk head, and the
alternating I/O pattern produces a continuous stream of long seeks.
On the the third hand, that case could work out too if the per-row
processing is enough that we weren't I/O-bound in the first place.

I'd be quite happy if somebody wanted to dig into these issues more.
I don't pretend that the existing implementation is anything more than
a first approximation of what we really want here.  It's just based on
the observations that (1) in practice, the leader is very often the
bottleneck when executing parallel queries and (2) a task that takes a
long time to run and can't be sped up by adding workers must get
started as soon as possible.  I have a feeling that to do t

Re: [HACKERS] GSoC 2017 : Patch for predicate locking in Gist index

2018-01-08 Thread Shubham Barai

On 8 January 2018 at 22:44, Shubham Barai  wrote:

>
>
> On 5 January 2018 at 03:18, Alexander Korotkov 
> wrote:
>
>> On Thu, Jan 4, 2018 at 7:07 PM, Andrey Borodin 
>> wrote:
>>
>>> 29 нояб. 2017 г., в 22:50, Shubham Barai 
>>> написал(а):
>>>
>>>  I have fixed formatting style. Please take a look at updated patch.
>>>
>>>
>>> Here's rebased patch. Every issue has been addressed, so I'm marking
>>> this patch as ready for committer.
>>>
>>
>> I'm sorry for concentrating on boring things, but formatting of
>> predicate-gist.spec still doesn't look good for me.
>>
>> # To verify serialization failures, queries and permutations are written
>>> in such
>>> # a way that an index scan(from one transaction) and an index
>>> insert(from another
>>> # transaction) will try to access the same part(sub-tree) of the index.
>>> #
>>> # To check reduced false positives, queries and permutations are written
>>> in such
>>> # a way that an index scan(from one transaction) and an index
>>> insert(from another
>>> # transaction) will try to access different parts(sub-tree) of the index.
>>>
>>
>> No space before open bracket (I think it should be when there are
>> multiple words brackets).
>> Also, we're trying to fit our lines to 80 characters (if it's not
>> objectively difficult).
>> And these are two almost same paragraphs.  I think it should be
>> simplified.
>>
>> setup
>>> {
>>>  create table gist_point_tbl(id int4, p point);
>>>  create index gist_pointidx on gist_point_tbl using gist(p);
>>>  insert into gist_point_tbl (id, p)
>>>  select g, point(g*10, g*10) from generate_series(1, 1000) g;
>>> }
>>> setup
>>> {
>>>   BEGIN ISOLATION LEVEL SERIALIZABLE;
>>>   set enable_seqscan=off;
>>>   set enable_bitmapscan=off;
>>>   set enable_indexonlyscan=on;
>>> }
>>> setup {
>>>   BEGIN ISOLATION LEVEL SERIALIZABLE;
>>>   set enable_seqscan=off;
>>>   set enable_bitmapscan=off;
>>>   set enable_indexonlyscan=on;
>>> }
>>
>>
>> I didn't get idea of using various indentation styles for same purpose.
>>
>> step "wx3" { insert into gist_point_tbl (id, p)
>>>   select g, point(g*500, g*500) from generate_series(12,
>>> 18) g; }
>>
>>
>> Indented using spaces here...
>>
>>
>
>
I have fixed formatting issues. Please have a look at updated patch.


Regards,
Shubham


Predicate-locking-in-gist_index_v7.patch
Description: Binary data

Re: pgbench - add \if support

2018-01-08 Thread Dmitry Dolgov

> On 4 January 2018 at 07:32, Fabien COELHO  wrote:
>
> Another rebase to try to please the patch tester.

Thanks for working on this. I had just a quick look at it, but I hope I'll have
time to post a proper review. In the meantime I'm wondering what am I doing
wrong here (I see a similar example in your first message)?

```
-- test.sql
\if random(0, 99) < 85
\set test 1
\else
\set test 2
\endif
select :test;
```

```
$ pgbench -s 10 -f test.sql
test.sql:1: unexpected character (<) in command "if"
\if random(0, 99) < 85
  ^ error found here
```

I'm using `pgbench-if-4.patch`, and everything is fine for simple
conditions like `\if 1`.

Re: Doc tweak for huge_pages?

2018-01-08 Thread Catalin Iacob

On Fri, Dec 1, 2017 at 10:09 PM, Thomas Munro
 wrote:
>> On 11/30/17 23:35, Thomas Munro wrote:
>>> Hmm.  Yeah, it does, but apparently it's not so transparent.  So if we
>>> mention that we'd better indicate in the same paragraph that you
>>> probably don't actually want to use it.  How about the attached?

Here's a review for v3.

I find that the first paragraph is an improvement as it's more precise.

What I didn't like about the second paragraph is that it pointed out
Linux transparent huge pages too favorably while they are actually
known to cause big (huge?, pardon the pun) issues (as witnessed in
this thread as well). v3 basically says "in Linux it can be
transparent or explicit and explicit is faster than transparent".
Reading that, and seeing that explicit needs tweaking of kernel
parameters and so on, one might very well conclude "I'll use the
slightly-slower-but-still-better-than-nothing transparent version".

So I tried to redo the second paragraph and ended up with the
attached. Rationale for the changes:
* changed "this feature" to "explicitly requesting huge pages" to
contrast with the automatic one described below
* made the wording of Linux THP more negative (but still with some
wiggle room for future kernel versions which might improve THP),
contrasting with the positive explicit request from this GUC
* integrated your mention of other OSes with automatic huge pages
* moved the new text to the last paragraph to lower its importance

What do you think?
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e4a01699e4..b6b309a943 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1363,14 +1363,15 @@ include_dir 'conf.d'

-Enables/disables the use of huge memory pages. Valid values are
-try (the default), on,
-and off.
+Controls whether huge memory pages are requested for the main shared
+memory area. Valid values are try (the default),
+on, and off.

-At present, this feature is supported only on Linux. The setting is
-ignored on other systems when set to try.
+At present, explicitly requesting huge pages is supported only on
+Linux. The setting is ignored on other systems when set to
+try.

@@ -1386,6 +1387,18 @@ include_dir 'conf.d'
 to use huge pages will prevent the server from starting up. With
 off, huge pages will not be used.

+
+   
+Note that, besides explicitly requesting huge pages via
+huge_pages, operating systems including Linux,
+FreeBSD and Illumos can also use huge pages (sometimes known as "super"
+pages or "large" pages) automatically, without an explicit request from
+PostgreSQL. In Linux this automatic use is
+called "transparent huge pages" but, for some Linux kernel versions,
+transparent huge pages are known to cause performance degradation with
+PostgreSQL so, unlike
+huge_pages, their use is discouraged.
+

Re: [HACKERS] [PATCH] Generic type subscripting

2018-01-08 Thread Dmitry Dolgov

> On 7 January 2018 at 23:39, Tom Lane  wrote:
>> Dmitry Dolgov <9erthali...@gmail.com> writes:
>>> On 4 January 2018 at 03:05, Tom Lane  wrote:
>>> I wonder what happened to the plan to separate assignment and fetch into two
>>> different node types. I can see that that didn't happen so far as 
>>> primnodes.h
>>> is concerned, but you've made some changes that seem to assume it did 
>>> happen.
>
>> There was one version of this patch that followed this plan. It turns out 
>> that
>> it's quite unnatural approach (at least within the current implementation),
>> because I had to duplicate or reuse a lot of code for those two node types.
>
> I'm not entirely convinced by that argument.  Sure, there would be a lot of
> duplicative code in the node support functions, but that's inherent in our
> approach to those functions.  I'm more concerned about readability,
> especially given that exposing this feature to extensions is going to set
> all these decisions in concrete forever.

That version I'm talking about actually have got not that much readability from
this separation (at least from my perspective). For extenions it maybe more
justified, but at the same time `isAssignment` flag the only thing you need to
check to figure out whether it's a fetch or assignment operation - it doesn't
sound as something significantly worse for readability. Or do you mean
something else?

Anyway, it's quite possible that I just failed to implement a proper approach
to implement this. I'm going to keep pondering this in case I'll come up with
something better - it should be more or less straightforward to change this
logic at any time.

> * You changed transformArrayType so that it doesn't throw an error if
> the given type isn't an array --- without, I note, updating either the
> function header comment or the internal comment that you falsified.
>
> * That in turn means that where transformAssignmentSubscripts thinks
> it's determining the result type, it may or may not be producing
> InvalidOid instead (again, with no comment to warn the reader).
>
> * And then you had to kludge transformAssignmentIndirection horribly
> (and I'm not sure you kludged it enough, either, because there are
> still a bunch of places where it uses targetTypeId without any concern
> for the possibility that that's zero).  It doesn't seem to me to be
> acceptable to just ignore coercion failure, as that code does now.
> If it's no longer the responsibility of this function to guarantee
> that the result is of the right type, why is it attempting coercion
> at all?  In any case you failed to update its header comment to
> explain what it's doing differently than before.
>
> In short the division of labor in this area still needs to be thought
> about.  I don't think we really want transformAssignmentSubscripts
> determining the required input data type at all; that should be
> farmed out to the datatype-specific code.  I'm also pretty unconvinced
> about refnestedfunc --- why do we need that?

Yes, I see, there is a room for improvements. I'm going to post a new version
soon, where I'll try to address this. About `refnestedfunc` - I added it as
part of my modification for functions caching, but looks like now this function
is useless.

> While I'm on the topic, I am not really happy with s/array/container/
> as you've done in some of this code.  To my mind, "container type"
> includes composite types.  Particularly in the parse_target code, where
> we're currently dealing with either composites or arrays, making it say
> that we're dealing with either composites or containers is just a recipe
> for confusion.  Unfortunately I can't think of a better word offhand,
> but some attention to this is needed.  As far as the comments go,
> we might be able to use the term "subscriptable type", but not sure if
> that will work for function/variable names.

Is there a plausible use case when "container type" can be something else than
a "composite type"? Maybe it's ok just to say that "container type" is equal to
"composite type"?

> On the executor side of things, I suspect Andres will be unhappy that
> you are making ExprEvalStep part of the API for datatypes --- he
> objected to my exposing it to plpgsql param eval in
> https://postgr.es/m/20171220174243.n4y3hgzf7xd3m...@alap3.anarazel.de
> and there was a lot more reason to do so there than there is here, IMO.
> It looks like what you actually need is just the SubscriptingRefState and
> an isnull flag, so it might be better to specify that the fetch and assign
> functions have signatures like
> Datum fetch(Datum val, SubscriptingRefState *state, bool *isnull)
> (representing both of the last two arguments as INTERNAL at SQL level).

Yes, I agree.

> Now on the other hand, maybe the right way to go is to embrace a similar
> approach to what I did for plpgsql param eval, and let the datatype
> control what gets generated as the expression execution step.  The main
> point here would be to let the

Re: [HACKERS] GSoC 2017 : Patch for predicate locking in Gist index

2018-01-08 Thread Shubham Barai

On 5 January 2018 at 03:18, Alexander Korotkov 
wrote:

> On Thu, Jan 4, 2018 at 7:07 PM, Andrey Borodin 
> wrote:
>
>> 29 нояб. 2017 г., в 22:50, Shubham Barai 
>> написал(а):
>>
>>  I have fixed formatting style. Please take a look at updated patch.
>>
>>
>> Here's rebased patch. Every issue has been addressed, so I'm marking this
>> patch as ready for committer.
>>
>
> I'm sorry for concentrating on boring things, but formatting of
> predicate-gist.spec still doesn't look good for me.
>
> # To verify serialization failures, queries and permutations are written
>> in such
>> # a way that an index scan(from one transaction) and an index insert(from
>> another
>> # transaction) will try to access the same part(sub-tree) of the index.
>> #
>> # To check reduced false positives, queries and permutations are written
>> in such
>> # a way that an index scan(from one transaction) and an index insert(from
>> another
>> # transaction) will try to access different parts(sub-tree) of the index.
>>
>
> No space before open bracket (I think it should be when there are multiple
> words brackets).
> Also, we're trying to fit our lines to 80 characters (if it's not
> objectively difficult).
> And these are two almost same paragraphs.  I think it should be simplified.
>
> setup
>> {
>>  create table gist_point_tbl(id int4, p point);
>>  create index gist_pointidx on gist_point_tbl using gist(p);
>>  insert into gist_point_tbl (id, p)
>>  select g, point(g*10, g*10) from generate_series(1, 1000) g;
>> }
>> setup
>> {
>>   BEGIN ISOLATION LEVEL SERIALIZABLE;
>>   set enable_seqscan=off;
>>   set enable_bitmapscan=off;
>>   set enable_indexonlyscan=on;
>> }
>> setup {
>>   BEGIN ISOLATION LEVEL SERIALIZABLE;
>>   set enable_seqscan=off;
>>   set enable_bitmapscan=off;
>>   set enable_indexonlyscan=on;
>> }
>
>
> I didn't get idea of using various indentation styles for same purpose.
>
> step "wx3" { insert into gist_point_tbl (id, p)
>>   select g, point(g*500, g*500) from generate_series(12,
>> 18) g; }
>
>
> Indented using spaces here...
>
>
I have fixed formatting issues. Please have a look at updated patch.


Regards,
Shubham


Predicate-locking-in-gist-index_7.patch
Description: Binary data

Re: [PROPOSAL] Shared Ispell dictionaries

2018-01-08 Thread Tomas Vondra

Hi Arthur,

Sorry for the delay, I somehow missed this thread ...

On 12/27/2017 10:20 AM, Arthur Zakirov wrote:
> On Tue, Dec 26, 2017 at 07:03:48PM +0100, Pavel Stehule wrote:
>>
>> Tomas had some workable patches related to this topic
>>
> 
> Tomas, have you planned to propose it?
> 

I believe Pavel was referring to this extension:

https://github.com/tvondra/shared_ispell

I wasn't going to submit that as in-core solution, but I'm happy you're
making improvements in that direction. I'll take a look at your patch
shortly.

ragards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

RE: PL/Python SD dict wiped?

2018-01-08 Thread Ken Huffman

I'm fine with per-session initializing of prepared statements, but is there 
PL/Python mechanism that spans sessions to minimize ConfigParser calls?

-Original Message-
From: Tom Lane [mailto:t...@sss.pgh.pa.us] 
Sent: Monday, January 08, 2018 10:23 AM
To: Ken Huffman 
Cc: Pg Hackers 
Subject: Re: PL/Python SD dict wiped?

Ken Huffman  writes:
> But if I quit and restart psql and do a subsequent db update, the SD dict is 
> empty again and the one-time logic has to run again. What is clearing the 
> dict? Is that expected?

You've got a new backend process.  That dict (or any other Python state) is 
only kept within a session.

regards, tom lane

Disclaimer

The information contained in this communication from the sender is 
confidential. It is intended solely for use by the recipient and others 
authorized to receive it. If you are not the recipient, you are hereby notified 
that any disclosure, copying, distribution or taking action in relation of the 
contents of this information is strictly prohibited and may be unlawful.

This email has been scanned for viruses and malware, and may have been 
automatically archived by Mimecast Ltd, an innovator in Software as a Service 
(SaaS) for business. Providing a safer and more useful place for your human 
generated data. Specializing in; Security, archiving and compliance. To find 
out more visit the Mimecast website.

Re: Parallel append plan instability/randomness

2018-01-08 Thread Tom Lane

Robert Haas  writes:
> On Mon, Jan 8, 2018 at 11:42 AM, Tom Lane  wrote:
>> More generally, I wonder if
>> it wouldn't be better to implement this behavior at runtime rather
>> than plan time.

> Ignoring some details around partial vs. non-partial plans, that's
> pretty much what we ARE doing, but to make it efficient, we sort the
> paths at plan time so that those choices are easy to make at runtime.
> If we didn't do that, we could have every worker sort the paths at
> execution time instead, or have the first process to arrive perform
> the sort and store the results in shared memory while everyone else
> waits, but that seems to be more complicated and less efficient, so I
> don't understand why you're proposing it.

The main bit of info we'd have at runtime that we lack at plan time is
certainty about the number of available workers.  Maybe that doesn't
really add anything useful to the order in which subplans would be doled
out; not sure.

regards, tom lane

Re: Parallel append plan instability/randomness

2018-01-08 Thread Robert Haas

On Mon, Jan 8, 2018 at 11:42 AM, Tom Lane  wrote:
> Jim Finnerty  writes:
>> Ordering the plan output elements by estimated cost will cause spurious plan
>> changes to be reported after table cardinalities change.  Can we choose an
>> explain output order that is stable to changes in cardinality, please?
>
> I found the code that's doing this, in create_append_path, and it says:
>
>  * For parallel append, non-partial paths are sorted by descending total
>  * costs. That way, the total time to finish all non-partial paths is
>  * minimized.  Also, the partial paths are sorted by descending startup
>  * costs.  There may be some paths that require to do startup work by a
>  * single worker.  In such case, it's better for workers to choose the
>  * expensive ones first, whereas the leader should choose the cheapest
>  * startup plan.
>
> There's some merit in that argument, although I'm not sure how much.
> It's certainly pointless to sort that way if the expected number of
> workers is >= the number of subpaths.

That is not completely true, because the leader will generally pick
first before the workers have finished starting up, and we want it to
pick something cheap and, ideally, partial, so that it's not committed
to doing a ton of work while the workers sit there idle after filling
their tuple queues.

> More generally, I wonder if
> it wouldn't be better to implement this behavior at runtime rather
> than plan time.  Something along the line of "workers choose the
> highest-cost remaining subplan, but leader chooses the lowest-cost one".

Ignoring some details around partial vs. non-partial plans, that's
pretty much what we ARE doing, but to make it efficient, we sort the
paths at plan time so that those choices are easy to make at runtime.
If we didn't do that, we could have every worker sort the paths at
execution time instead, or have the first process to arrive perform
the sort and store the results in shared memory while everyone else
waits, but that seems to be more complicated and less efficient, so I
don't understand why you're proposing it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Refactoring identifier checks to consistently use strcmp

2018-01-08 Thread Daniel Gustafsson

> On 08 Jan 2018, at 17:27, Robert Haas  wrote:
> 
> On Sat, Jan 6, 2018 at 7:38 PM, Tom Lane  wrote:
>> It's definitely concerning that the submitted patch introduced a new bug,
>> but we have seldom taken the position that bugs in an initial submission
>> are sufficient grounds for rejecting the entire concept.
> 
> Fair point.  I withdraw my categorical -1 vote and replace it with the
> statement that the patch hasn't been sufficiently-carefully checked by
> the patch author or other reviewers for bugs to consider committing it
> -- nor has anyone taken the trouble to list with precision all of the
> places where the behavior will change.  I think the latter is
> absolutely indispensable, which is why I started to compile such a
> list in my previous post.

Sorry for dropping off the radar, my available time for hacking was severely
limited (well, down to zero really).

Seeing the surface again I’ve started on the complete list and hope to have
something quite soon, and I think (as seems to be the consensus on this thread)
that that list is a prerequisite for the review.

cheers ./daniel

Re: Parallel append plan instability/randomness

2018-01-08 Thread Tom Lane

Jim Finnerty  writes:
> Ordering the plan output elements by estimated cost will cause spurious plan
> changes to be reported after table cardinalities change.  Can we choose an
> explain output order that is stable to changes in cardinality, please?

I found the code that's doing this, in create_append_path, and it says:

 * For parallel append, non-partial paths are sorted by descending total
 * costs. That way, the total time to finish all non-partial paths is
 * minimized.  Also, the partial paths are sorted by descending startup
 * costs.  There may be some paths that require to do startup work by a
 * single worker.  In such case, it's better for workers to choose the
 * expensive ones first, whereas the leader should choose the cheapest
 * startup plan.

There's some merit in that argument, although I'm not sure how much.
It's certainly pointless to sort that way if the expected number of
workers is >= the number of subpaths.  More generally, I wonder if
it wouldn't be better to implement this behavior at runtime rather
than plan time.  Something along the line of "workers choose the
highest-cost remaining subplan, but leader chooses the lowest-cost one".

regards, tom lane

Re: Parallel append plan instability/randomness

2018-01-08 Thread Robert Haas

On Mon, Jan 8, 2018 at 11:28 AM, Jim Finnerty  wrote:
> Ordering the plan output elements by estimated cost will cause spurious plan
> changes to be reported after table cardinalities change.  Can we choose an
> explain output order that is stable to changes in cardinality, please?

No.  The order of the plans in the EXPLAIN output isn't some kind of
display artifact.  Parallel Append sorts the paths by cost and by
whether they are partial, for reasons explained in the comments in
create_append_path(), and EXPLAIN prints them in that order.

In general, if you expect to be able to change the contents of the
tables in the regression test database without changing the results of
the regression test queries, you are in for a lot of disappointment.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Refactoring identifier checks to consistently use strcmp

2018-01-08 Thread Robert Haas

On Sat, Jan 6, 2018 at 7:38 PM, Tom Lane  wrote:
> It's definitely concerning that the submitted patch introduced a new bug,
> but we have seldom taken the position that bugs in an initial submission
> are sufficient grounds for rejecting the entire concept.

Fair point.  I withdraw my categorical -1 vote and replace it with the
statement that the patch hasn't been sufficiently-carefully checked by
the patch author or other reviewers for bugs to consider committing it
-- nor has anyone taken the trouble to list with precision all of the
places where the behavior will change.  I think the latter is
absolutely indispensable, which is why I started to compile such a
list in my previous post.  The former needs to be done as well, of
course.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] [PROPOSAL] Temporal query processing with range types

2018-01-08 Thread Robert Haas

On Sat, Jan 6, 2018 at 3:29 PM, Simon Riggs  wrote:
> I'm not very keen on adopting new syntax that isn't in the
> SQLStandard. They have a bad habit of doing something completely
> different. So a flexible approach will allow us to have functionality
> now and we can adopt any new syntax later.
>
> For any new syntax, I think the right approach would be to create a
> new parser plugin. That way we could make all of this part of an
> extension.
> * a parser plugin for any new syntax
> * various PostJoinSetProjection() functions to be called as needed
> So the idea is we enable Postgres to allow major new functionality, as
> was done for PostGIS so successfully.
>
> We can adopt syntax into the main parser later once SQLStandard
> accepts this, or some munged version of it.

We don't currently have any concept of a parser plugin, and I don't
think it's reasonable for this patch to try to invent one.  I can't
see where you could possibly put a hook that would handle something
like this.  I've thought about suggesting a hook that gets called if
parsing fails, before we give up and throw a syntax error.  That
would, perhaps, allow extension to implement additional DDL commands,
which would be pretty cool.  However, it wouldn't be practical to use
something like that for this because this is syntax that appears
buried deep down inside FROM clauses, and you'd basically have to
reimplement the core parser's entire handling of SELECT statements,
which would be immensely complicated to develop and a real pain to
maintain.

Furthermore, the changes here aren't only parsing changes.  There is a
proposal to add a new executor node which has to be supported by new
planner code.  A great deal more than parser pluggability would be
needed.  And if we did all that, it doesn't really solve anything
anyway: the potential problem with committing to this syntax is that
if we change it later, there could be upgrade problems.
Extension-izing this wouldn't make those problems go away.  People are
going to want to be able to run pg_dump on their old database and
restore the result on their new database, full stop.  If we add this
syntax, in core or in a hypothetical extension system, we're stuck
with it and must maintain it going forward.

I also don't agree with the idea that we should reject syntax that
doesn't appear in the SQL standard.  We have a great deal of such
syntax already, and we add more of it in every release, and a good
deal of it is contributed by you and your colleagues.  I don't see why
this patch should be held to a stricter standard than we do in
general.  I agree that there is some possibility for pain if the SQL
standards committee adopts syntax that is similar to whatever we pick
but different in detail, but I don't think we should be too worried
about that unless other database systems, such as Oracle, have syntax
that is similar to what is proposed here but different in detail.  The
SQL standards committee seems to like standardizing on whatever
companies with a lot of money have already implemented; it's unlikely
that they are going to adopt something totally different from any
existing system but inconveniently similar to ours.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel append plan instability/randomness

2018-01-08 Thread Robert Haas

On Sun, Jan 7, 2018 at 11:40 PM, Amit Kapila  wrote:
> One theory that can explain above failure is that the costs of
> scanning some of the sub-paths is very close due to which sometimes
> the results can vary.  If that is the case, then probably using
> fuzz_factor in costs comparison (as is done in attached patch) can
> improve the situation, may be we have to consider some other factors
> like number of rows in each subpath.

This isn't an acceptable solution because sorting requires that the
comparison operator satisfy the transitive property; that is, if a = b
and b = c then a = c.  With your proposed comparator, you could have a
= b and b = c but a < c.  That will break stuff.

It seems like the obvious fix here is to use a query where the
contents of the partitions are such that the sorting always produces
the same result.  We could do that either by changing the query or by
changing the data in the partitions or, maybe, by inserting ANALYZE
someplace.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Invalid pg_upgrade error message during live check

2018-01-08 Thread Robert Haas

On Fri, Jan 5, 2018 at 9:11 PM, Bruce Momjian  wrote:
> On Fri, Jan  5, 2018 at 04:58:39PM -0500, Bruce Momjian wrote:
>> Pg_upgrade is able to run in --check mode when the old server is still
>> running.  Unfortunately, all supported versions of pg_upgrade generate
>> an incorrect (but harmless) "failure" message when doing this:
>
> Based on recent discussions, I am thinking we would just fix this in PG
> 10 and HEAD and leave the harmless error message in the other branches,
> right?

Hmm, not sure why you say that.  If the message is buggy and the fix
isn't too risky, might as well fix it all the way back.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: PL/Python SD dict wiped?

2018-01-08 Thread Tom Lane

Ken Huffman  writes:
> But if I quit and restart psql and do a subsequent db update, the SD dict is 
> empty again and the one-time logic has to run again. What is clearing the 
> dict? Is that expected?

You've got a new backend process.  That dict (or any other Python state)
is only kept within a session.

regards, tom lane

PL/Python SD dict wiped?

2018-01-08 Thread Ken Huffman

Is there a known issue with the PL/Python extension clearing the SD dict under 
some circumstances? As suggested by the documentation, my plpythonu uses the SD 
dict for one-time (prepare and ConfigParser) logic:

CREATE OR REPLACE FUNCTION change_notify() RETURN TRIGGER AS $$
...
If len(SD) == 0:
# one time logic
config = ConfigParser.RawConfigParser()
config.read('...')
SD['x'] = config.get('...','...')
SD['plan'] = plpy.prepare('...')
...
x = SD['x']
plan = SD['plan']

To test this trigger after creation, I use psql and force a db update to cause 
it to fire. If I do multiple updates within psql, the SD dict is only empty for 
the first update. Yeah!

But if I quit and restart psql and do a subsequent db update, the SD dict is 
empty again and the one-time logic has to run again. What is clearing the dict? 
Is that expected?

Ken

Disclaimer

The information contained in this communication from the sender is 
confidential. It is intended solely for use by the recipient and others 
authorized to receive it. If you are not the recipient, you are hereby notified 
that any disclosure, copying, distribution or taking action in relation of the 
contents of this information is strictly prohibited and may be unlawful.

This email has been scanned for viruses and malware, and may have been 
automatically archived by Mimecast Ltd, an innovator in Software as a Service 
(SaaS) for business. Providing a safer and more useful place for your human 
generated data. Specializing in; Security, archiving and compliance. To find 
out more visit the Mimecast website.

Re: Condition variable live lock

2018-01-08 Thread Tom Lane

Thomas Munro  writes:
> One very small thing after another look:

> -   Assert(cv_sleep_target == NULL);
> +   if (cv_sleep_target != NULL)
> +   ConditionVariableCancelSleep();

> The test for cv_sleep_target != NULL is redundant since
> ConditionVariableCancelSleep() would return early.

True.  I did that because Robert was already objecting to the cost
of an added test-and-branch here, so I figured he'd be really
unhappy with the cost of a function call plus test-and-branch ;-)

> ConditionVariableBroadcast() doesn't do that.

Yup.  I considered removing the discrepancy by adding a similar
if-guard in ConditionVariableBroadcast().  The internal test in
ConditionVariableCancelSleep would then be only for the benefit
of outside callers such as AbortTransaction, but that seems fine
and per its documentation.

Or we could remove those if's and save a few bytes at the
cost of some cycles.  I don't care much.

regards, tom lane

Re: [HACKERS] UPDATE of partition key

2018-01-08 Thread David Rowley

On 4 January 2018 at 02:52, David Rowley  wrote:
> I'll try to look at the tests tomorrow and also do some testing.

I've made a pass over the tests. Again, sometimes I'm probably a bit
pedantic. The reason for that is that the tests are not that easy to
follow. Moving creation and cleanup of objects closer to where they're
used and no longer needed makes it easier to read through and verify
the tests. There are some genuine mistakes in there too.

1.

   NEW.c = NEW.c + 1; -- Make even number odd, or vice versa

This seems to be worded as if there'd only ever be one number. I think
it should be plural and read "Make even numbers odd, and vice versa"

2. The following comment does not make a huge amount of sense.

-- UPDATE with
-- partition key or non-partition columns, with different column ordering,
-- triggers.

Should "or" be "on"? Does ", triggers" mean "with triggers"?

3. The follow test tries to test a BEFORE DELETE trigger stopping a
DELETE on sub_part1, but going by the SELECT, there are no rows in
that table to stop being DELETEd.

select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
  tableoid  | a | b  | c
+---++
 list_part1 | 2 | 52 | 50
 list_part1 | 3 |  6 | 60
 sub_part2  | 1 |  2 | 10
 sub_part2  | 1 |  2 | 70
(4 rows)

drop trigger parted_mod_b ON sub_part1 ;
-- If BR DELETE trigger prevented DELETE from happening, we should also skip
-- the INSERT if that delete is part of UPDATE=>DELETE+INSERT.
create or replace function func_parted_mod_b() returns trigger as $$
begin return NULL; end $$ language plpgsql;
create trigger trig_skip_delete before delete on sub_part1
   for each row execute procedure func_parted_mod_b();
update list_parted set b = 1 where c = 70;
select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
  tableoid  | a | b  | c
+---++
 list_part1 | 2 | 52 | 50
 list_part1 | 3 |  6 | 60
 sub_part1  | 1 |  1 | 70
 sub_part2  | 1 |  2 | 10
(4 rows)

You've added the BEFORE DELETE trigger to sub_part1, but you can see
the tuple was DELETEd from sub_part2 and INSERTed into sub_part1, so
the test is not working as you've commented.

It's probably a good idea to RAISE NOTICE 'something useful here'; in
the trigger function to verify they're actually being called in the
test.

4. I think the final drop function in the following should be before
the UPDATE FROM test. You've already done some cleanup for that test
by doing "drop trigger trig_skip_delete ON sub_part1 ;"

drop trigger trig_skip_delete ON sub_part1 ;
-- UPDATE partition-key with FROM clause. If join produces multiple output
-- rows for the same row to be modified, we should tuple-route the row
only once.
-- There should not be any rows inserted.
create table non_parted (id int);
insert into non_parted values (1), (1), (1), (2), (2), (2), (3), (3), (3);
update list_parted t1 set a = 2 from non_parted t2 where t1.a = t2.id and a = 1;
select tableoid::regclass::text , * from list_parted order by 1, 2, 3, 4;
  tableoid  | a | b  | c
+---++
 list_part1 | 2 |  1 | 70
 list_part1 | 2 |  2 | 10
 list_part1 | 2 | 52 | 50
 list_part1 | 3 |  6 | 60
(4 rows)

drop table non_parted;
drop function func_parted_mod_b();

Also, there's a space before the ; in the drop trigger above. Can that
be removed?

5. The following comment:

-- update to a partition should check partition bound constraint for
the new tuple.
-- If partition key is updated, the row should be moved to the appropriate
-- partition. updatable views using partitions should enforce the check options
-- for the rows that have been moved.

Can this be changed a bit? I think it's not accurate to say that an
update to a partition key causes the row to move. The row movement
only occurs when the new tuple does not match the partition bound and
another partition exists that does have a partition bound that matches
the tuple. How about:

-- When a partitioned table receives an UPDATE to the partitioned key and the
-- new values no longer meet the partition's bound, the row must be moved to
-- the correct partition for the new partition key (if one exists). We must
-- also ensure that updatable views on partitioned tables properly enforce any
-- WITH CHECK OPTION that is defined. The situation with triggers in this case
-- also requires thorough testing as partition key updates causing row
-- movement convert UPDATEs into DELETE+INSERT.

6. What does the following actually test?

-- This tests partition-key UPDATE on a partitioned table that does
not have any child partitions
update part_b_10_b_20 set b = b - 6;

There are no records in that partition, or anywhere in the hierarchy.
Are you just testing that there's no error? If so then the comment
should say so.

7. I think the following comment:

-- As mentioned above, the partition creation is intentionally kept in
descending bound order.

should instead say:

-- Create some more partitions following the above pattern of descendi

Re: [HACKERS] Timeline ID in backup_label file

2018-01-08 Thread Michael Paquier

On Sat, Jan 06, 2018 at 09:56:03AM +0900, Michael Paquier wrote:
> Previous patch was incorrect, this was the same v2 as upthread.
> Attached is the correct v3.

For the archive's sake, this version of the patch has been pushed as
6271fce. Thanks Simon for the commit, and David for the review!
--
Michael


signature.asc
Description: PGP signature

Re: pgsql: Implement channel binding tls-server-end-point for SCRAM

2018-01-08 Thread Michael Paquier

On Sat, Jan 06, 2018 at 09:10:51AM +0900, Michael Paquier wrote:
> On Sat, Jan 6, 2018 at 2:56 AM, Alvaro Herrera  
> wrote:
> Good idea as a whole, but I don't think this is the right approach. As
> we include $(bindir) in PATH when running the prove command, why not
> getting the include path from "pg_config --includedir"?

So, I have been looking at that, and I propose the following
counter-patch which implements this idea by adding a new routine as
TestLib::config_check which is able to check within pg_config.h if a
given regexp matches or not for the installation on which TAP tests are
being run. I have tested with Postgres compiled with both OpenSSL 1.0.1
and 1.0.2, in which case the connection test respectively fails and
passes, causing the test to be correctly handled. This is based on
Peter's patch upthread.
--
Michael
diff --git a/src/test/perl/TestLib.pm b/src/test/perl/TestLib.pm
index 72826d5bad..c49b4336c1 100644
--- a/src/test/perl/TestLib.pm
+++ b/src/test/perl/TestLib.pm
@@ -26,6 +26,7 @@ our @EXPORT = qw(
   slurp_dir
   slurp_file
   append_to_file
+  config_check
   system_or_bail
   system_log
   run_log
@@ -221,6 +222,27 @@ sub append_to_file
close $fh;
 }
 
+# Check presence of a given regexp within pg_config.h for the installation
+# where tests are running, returning a match status result depending on
+# that.
+sub config_check
+{
+   my ($regexp) = @_;
+   my ($stdout, $stderr);
+   my $result = IPC::Run::run [ 'pg_config', '--includedir' ], '>',
+   \$stdout, '2>', \$stderr;
+
+   print("# Checking for match with expression \"$regexp\" in 
pg_config.h\n");
+
+   die "could not execute pg_config" if $result != 1;
+   chomp($stdout);
+
+   open my $pg_config_h, '<', "$stdout/pg_config.h" or die "$!";
+   my $match = (grep {/^$regexp/} <$pg_config_h>);
+   close $pg_config_h;
+   return $match;
+}
+
 #
 # Test functions
 #
diff --git a/src/test/ssl/t/002_scram.pl b/src/test/ssl/t/002_scram.pl
index 3f425e00f0..73c8f7adf2 100644
--- a/src/test/ssl/t/002_scram.pl
+++ b/src/test/ssl/t/002_scram.pl
@@ -11,6 +11,10 @@ use File::Copy;
 # This is the hostname used to connect to the server.
 my $SERVERHOSTADDR = '127.0.0.1';
 
+# Determine whether build supports tls-server-end-point.
+my $supports_tls_server_end_point =
+   TestLib::config_check("#define HAVE_X509_GET_SIGNATURE_NID 1");
+
 # Allocation of base connection string shared among multiple tests.
 my $common_connstr;
 
@@ -44,10 +48,19 @@ test_connect_ok($common_connstr,
"SCRAM authentication with tls-unique as channel binding");
 test_connect_ok($common_connstr,
"scram_channel_binding=''",
-   "SCRAM authentication without channel binding");
-test_connect_ok($common_connstr,
-   "scram_channel_binding=tls-server-end-point",
-   "SCRAM authentication with tls-server-end-point as channel binding");
+   "SCRAM authentication without channel binding");
+if ($supports_tls_server_end_point)
+{
+   test_connect_ok($common_connstr,
+   
"scram_channel_binding=tls-server-end-point",
+   "SCRAM authentication with 
tls-server-end-point as channel binding");
+}
+else
+{
+   test_connect_fails($common_connstr,
+   
"scram_channel_binding=tls-server-end-point",
+   "SCRAM authentication with 
tls-server-end-point as channel binding");
+}
 test_connect_fails($common_connstr,
"scram_channel_binding=not-exists",
"SCRAM authentication with invalid channel binding");


signature.asc
Description: PGP signature

Re: Condition variable live lock

2018-01-08 Thread Thomas Munro

On Mon, Jan 8, 2018 at 5:25 PM, Thomas Munro
 wrote:
> On Sun, Jan 7, 2018 at 10:43 AM, Tom Lane  wrote:
>> Actually ... perhaps a better design would be to have
>> ConditionVariable[PrepareTo]Sleep auto-cancel any prepared sleep for
>> a different condition variable, analogously to what we just did in
>> ConditionVariableBroadcast, on the same theory that whenever control
>> returns to the other CV wait loop it can re-establish the relevant
>> state easily enough.  I have to think that if the use of CVs grows
>> much, the existing restriction is going to become untenable anyway,
>> so why not just get rid of it?
>
> +1
>
> It's a more robust API this way.

One very small thing after another look:

-   Assert(cv_sleep_target == NULL);
+   if (cv_sleep_target != NULL)
+   ConditionVariableCancelSleep();

The test for cv_sleep_target != NULL is redundant since
ConditionVariableCancelSleep() would return early.
ConditionVariableBroadcast() doesn't do that.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: [HACKERS] [PATCH] Incremental sort

2018-01-08 Thread Antonin Houska

Alexander Korotkov  wrote:

> Antonin Houska  wrote:

> >  Shouldn't the test contain *both* cases?

> Thank you for pointing that. Sure, both cases are better. I've added second 
> case as well as comments. Patch is attached.

I'm fine with the tests now but have a minor comment on this comment:

-- CROSS JOIN, not pushed down, because we don't push down LIMIT and remote side
-- can't perform top-N sort like local side can.

I think the note on LIMIT push-down makes the comment less clear because
there's no difference in processing the LIMIT: EXPLAIN shows that both

SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET
100 LIMIT 10;

and

SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET
100 LIMIT 10;

evaluate the LIMIT clause only locally.

What I consider the important difference is that the 2nd case does not
generate the appropriate input for remote incremental sort (while incremental
sort tends to be very cheap). Therefore it's cheaper to do no remote sort at
all and perform the top-N sort locally than to do a regular (non-incremental)
remote sort.

I have no other questions about this patch. I expect the CFM to set the status
to "ready for committer" as soon as the other reviewers confirm they're happy
about the patch status.

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at

Re: Parallel append plan instability/randomness

2018-01-08 Thread Amit Kapila

On Mon, Jan 8, 2018 at 2:11 PM, Amit Khandekar  wrote:
> On 8 January 2018 at 13:35, Amit Kapila  wrote:
>> On Mon, Jan 8, 2018 at 11:26 AM, Tom Lane  wrote:
>>> Amit Khandekar  writes:
 The fact that b_star gets moved from 5th position to  the first
 position in the scans, indicates that it's cost shoots up from 1.04 to
 a value greater than 1.16. It does not look like a case where two
 costs are almost same due to which their positions swap sometimes. I
 am trying to figure out what else can it be ...
>>>
>>
>> That occurred to me as well, but still, the change in plan can happen
>> due to the similar costs.
>
> Agreed. But I think we should first fix the issue due to which the
> test might be failing in this case. BTW, for your patch, I am thinking
> we can have a separate factor other than STD_FUZZ_FACTOR ? This way,
> we can make it much smaller than 1.01 also.
>

Sounds good.  However, I am not sure what should be the right value
for it, apart from STD_FUZZ_FACTOR we are using 1.01 also as
fuzz_factor in the code.  Any suggestions?

 And anyways,
> STD_FUZZ_FACTOR is used only for comparing paths on the same relation,
> whereas in our case, our comparision goal is different.
>
>> Another possibility as indicated in the
>> previous email is that if somehow the stats of table (reltuples,
>> relpages) is not appropriate, say due to some reason analyze doesn't
>> happen on the table.
>
> Yes, I am also thinking on the same lines. E.g., if the relpages is 0
> (due to no analyze run yet), tuple density calculation follows a
> different logic, due to which reltuples can be quite bigger. I suspect
> this also might be the reason. So yes, I think it's worth having
> ANALYZE on *_star.
>
>> For example, if you just manually update the
>> value of reltuples for b_star in pg_class to 20 or so, you will see
>> the plan as indicated in the failure.  If that is true, then probably
>> doing Analyze before Parallel Append should do the trick.
>
>  Or better still, we can have Analyze in create_misc.sql and
> create_table.sql where the table is populated.
>

Then probably we might want to do in misc.sql file as that Alters some
of these tables which can lead to a rewrite of the table.  I think we
can do it either way, but lets first wait for Tom or Robert to comment
on whether they agree with this theory or if they see some other
reason for this behavior.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel append plan instability/randomness

2018-01-08 Thread Amit Khandekar

On 8 January 2018 at 13:35, Amit Kapila  wrote:
> On Mon, Jan 8, 2018 at 11:26 AM, Tom Lane  wrote:
>> Amit Khandekar  writes:
>>> The fact that b_star gets moved from 5th position to  the first
>>> position in the scans, indicates that it's cost shoots up from 1.04 to
>>> a value greater than 1.16. It does not look like a case where two
>>> costs are almost same due to which their positions swap sometimes. I
>>> am trying to figure out what else can it be ...
>>
>
> That occurred to me as well, but still, the change in plan can happen
> due to the similar costs.

Agreed. But I think we should first fix the issue due to which the
test might be failing in this case. BTW, for your patch, I am thinking
we can have a separate factor other than STD_FUZZ_FACTOR ? This way,
we can make it much smaller than 1.01 also. And anyways,
STD_FUZZ_FACTOR is used only for comparing paths on the same relation,
whereas in our case, our comparision goal is different.

> Another possibility as indicated in the
> previous email is that if somehow the stats of table (reltuples,
> relpages) is not appropriate, say due to some reason analyze doesn't
> happen on the table.

Yes, I am also thinking on the same lines. E.g., if the relpages is 0
(due to no analyze run yet), tuple density calculation follows a
different logic, due to which reltuples can be quite bigger. I suspect
this also might be the reason. So yes, I think it's worth having
ANALYZE on *_star.

> For example, if you just manually update the
> value of reltuples for b_star in pg_class to 20 or so, you will see
> the plan as indicated in the failure.  If that is true, then probably
> doing Analyze before Parallel Append should do the trick.

 Or better still, we can have Analyze in create_misc.sql and
create_table.sql where the table is populated.

>
>> The gut feeling I had upon seeing the failure was that the plan shape
>> depends on the order in which rows happen to be read from the system
>> catalogs by a heapscan.  I've not tried to run that idea to ground yet.
>>
>
> I don't see how something like that can happen because we internally
> sort the subpaths for parallel append.

True. Or may I didn't understand. Tom, are you referring to reading
pg_class.reltuples or similar, when say "read from the system
catalogs" ?

>
>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Re: Parallel append plan instability/randomness

2018-01-08 Thread Amit Kapila

On Mon, Jan 8, 2018 at 11:26 AM, Tom Lane  wrote:
> Amit Khandekar  writes:
>> The fact that b_star gets moved from 5th position to  the first
>> position in the scans, indicates that it's cost shoots up from 1.04 to
>> a value greater than 1.16. It does not look like a case where two
>> costs are almost same due to which their positions swap sometimes. I
>> am trying to figure out what else can it be ...
>

That occurred to me as well, but still, the change in plan can happen
due to the similar costs.  Another possibility as indicated in the
previous email is that if somehow the stats of table (reltuples,
relpages) is not appropriate, say due to some reason analyze doesn't
happen on the table.  For example, if you just manually update the
value of reltuples for b_star in pg_class to 20 or so, you will see
the plan as indicated in the failure.  If that is true, then probably
doing Analyze before Parallel Append should do the trick.

> The gut feeling I had upon seeing the failure was that the plan shape
> depends on the order in which rows happen to be read from the system
> catalogs by a heapscan.  I've not tried to run that idea to ground yet.
>

I don't see how something like that can happen because we internally
sort the subpaths for parallel append.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

88 matches

Mail list logo