date:20160728

[HACKERS] System load consideration before spawning parallel workers

2016-07-28 Thread Haribabu Kommi

we observed that spawning the specified number of parallel workers for
every query that satisfies for parallelism is sometimes leading to
performance drop compared to improvement during the peak system load
with other processes. Adding more processes to the system is leading
to more context switches thus it reducing the performance of other SQL
operations.

In order to avoid this problem, how about adding some kind of system
load consideration into account before spawning the parallel workers?

This may not be a problem for some users, so instead of adding the
code into the core for the system load calculation and etc, how about
providing some additional hook in the code? so that user who wants to
consider the system load registers the function and this hooks
provides the number of parallel workers that can be started.

comments?

Regards,
Hari Babu
Fujitsu Australia


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Wrong defeinition of pq_putmessage_noblock since 9.5

2016-07-28 Thread Kyotaro HORIGUCHI

At Fri, 29 Jul 2016 12:47:53 +0900, Michael Paquier  
wrote in 
> On Fri, Jul 29, 2016 at 12:18 PM, Kyotaro HORIGUCHI
>  wrote:
> > At Thu, 28 Jul 2016 10:46:00 -0400, Tom Lane  wrote in 
> > <4313.1469717...@sss.pgh.pa.us>
> >> Fujii Masao  writes:
> >> > 3. Several source comments in pqcomm.c have not been updated.
> >> > Some comments still use the old function name like pq_putmessage().
> >>
> >> > Attached patch fixes the above issues.
> >>
> >> I dunno, this seems like it's doubling down on some extremely poor
> >> decisions.  Why is it that you now have to flip a coin to guess whether
> >> the prefix is pq_ or socket_ for functions in this module?  I would
> >> rather see that renaming reverted.
> 
> Yes, I agree with that. I cannot understand the intention behind
> 2bd9e41 to rename those routines as they are now, so getting them back
> with pg_ as prefix looks like a good idea to me.

The many of the socket_* functions are required to be replaced
with mq_* functions for backgroud workers. So reverting the names
of socket_* functions should be accompanied by the need to, for
example, changing their callers to use them explicitly via
PqCommMethods, not hiding with macros, or renaming current pq_*
macros to, say, pqi_. (it stands for PQ-Indirect as a tentative)
I'm not confident on the new prefix since I'm not sure that what
the 'pq' stands for. (Postgres Query?)

Attached patch is a rush work to revert the names of socket_
functions and replace the prefix of the macros with "pqi". pq_
names no longer points to mq_ functions. Is this roughly on the
right way? Even though the prefix is not appropriate.

> > The set of functions in PQcommMethods doesn't seem clean. They
> > are chosen arbitrarily just so that other pq_* functions used in
> > parallel workers will work as expected. I suppose that it needs
> > some refactoring.
> 
> Any work in this area is likely 10.0 material at this point.
> 
> > By the way, pq_start/endcopyout() are used only in FE protocols
> > below 3.0, which had already bacome obsolete as of PG7.4. While
> > the next dev cycle is for PG10, if there is no particular reason
> > to support such ancient protocols, removing them would make things
> > easier and cleaner.
> 
> Remove support for protocol 2 has been in the air for some time, but
> that's a separate discussion. If you want to discuss this issue
> particularly, raising a new thread would be a good idea.

Thanks. I saw in somewhere that the protocol 2 no longer
works. I'll raise a new thread later.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From b38b9a21d8cc73f3c20df46fb0db17db237c27ac Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Fri, 29 Jul 2016 14:45:15 +0900
Subject: [PATCH] Revert socket_ functions' name to pq_ names.

---
 src/backend/access/transam/parallel.c |  2 +-
 src/backend/commands/async.c  |  2 +-
 src/backend/commands/copy.c   | 12 ++---
 src/backend/libpq/auth.c  |  2 +-
 src/backend/libpq/pqcomm.c| 84 +--
 src/backend/libpq/pqformat.c  |  8 ++--
 src/backend/replication/basebackup.c  | 14 +++---
 src/backend/replication/walsender.c   | 44 +-
 src/backend/tcop/dest.c   |  6 +--
 src/backend/tcop/postgres.c   |  4 +-
 src/backend/utils/error/elog.c|  4 +-
 src/include/libpq/libpq.h | 16 +++
 12 files changed, 99 insertions(+), 99 deletions(-)

diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index eef1dc2..1ab2921 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -1072,7 +1072,7 @@ ParallelWorkerMain(Datum main_arg)
 	EndParallelWorkerTransaction();
 
 	/* Report success. */
-	pq_putmessage('X', NULL, 0);
+	pqi_putmessage('X', NULL, 0);
 }
 
 /*
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 716f1c3..987c05e 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -2062,7 +2062,7 @@ ProcessIncomingNotify(void)
 	/*
 	 * Must flush the notify messages to ensure frontend gets them promptly.
 	 */
-	pq_flush();
+	pqi_flush();
 
 	set_ps_display("idle", false);
 
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index f45b330..a41eb00 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -362,7 +362,7 @@ SendCopyBegin(CopyState cstate)
 			errmsg("COPY BINARY is not supported to stdout or from stdin")));
 		pq_putemptymessage('H');
 		/* grottiness needed for old COPY OUT protocol */
-		pq_startcopyout();
+		pqi_startcopyout();
 		cstate->copy_dest = COPY_OLD_FE;
 	}
 	else
@@ -374,7 +374,7 @@ SendCopyBegin(CopyState cstate)
 			errmsg("COPY BINARY is not supported to stdout or from stdin")));
 		pq_putemptymessage('B');
 		/* grottiness needed for old COPY OUT protocol */
-		pq_startcopyout();
+		pqi_startcopyout();
 		cstate->copy_dest = COPY_OLD_FE;

[HACKERS] Double invocation of InitPostmasterChild in bgworker with -DEXEC_BACKEND

2016-07-28 Thread Thomas Munro

Hi

I discovered that if you build with -DEXEC_BACKEND on a Unix system
and then try to start a background worker, it dies in
InitializeLatchSupport:

TRAP: FailedAssertion("!(selfpipe_readfd == -1)", File: "latch.c", Line: 161)

That's because InitPostmasterChild is called twice.  I can
successfully use regular parallel query workers and bgworkers created
by extensions if I apply the attached patch.

-- 
Thomas Munro
http://www.enterprisedb.com


fix.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Oddity in EXPLAIN for foreign/custom join pushdown plans

2016-07-28 Thread Kouhei Kaigai

> I wrote:
> >> That may be so, but my point is that the target relations involved in
> >> the foreign join (ie, ft1 and ft2) should be printed somewhere in the
> >> EXPLAIN output by core, as in EXPLAIN for a simple foreign table scan.
> 
> > Why? According to your rule, Hash Join should take "on t0,t1,t2".
> >
> > postgres=# explain select id from t0 natural join t1 natural join t2;
> >  QUERY PLAN
> > -
> >  Hash Join  (cost=6370.00..4560826.24 rows=98784048 width=4)
> >Hash Cond: (t0.aid = t1.aid)
> >->  Hash Join  (cost=3185.00..3199360.58 rows=98784048 width=8)
> >  Hash Cond: (t0.bid = t2.bid)
> >  ->  Seq Scan on t0  (cost=0.00..184.80 rows=10080 width=12)
> >  ->  Hash  (cost=1935.00..1935.00 rows=10 width=4)
> >->  Seq Scan on t2  (cost=0.00..1935.00 rows=10 width=4)
> >->  Hash  (cost=1935.00..1935.00 rows=10 width=4)
> >  ->  Seq Scan on t1  (cost=0.00..1935.00 rows=10 width=4)
> > (9 rows)
> 
> I don't think it needs "on t0,t1,t2", because we can see joining
> relations from inner/outer plans in that case.  In a foreign-join case,
> however, we can't see such relations from the EXPLAIN printed *by core*.
>   postgres_fdw avoids this issue by adding such relations to the EXPLAIN
> using ExplainForeignScan as shown in the below example, but since such
> relations are essential, I think that information should be shown by
> core itself.
>
I never opposed to print the relation names by the core, however, it has
to be controllable by the extension which provides ForeignScan/CustomScan
because only extension can know how underlying relation shall be scanned
exactly.

> postgres=# explain select * from (select ft1.a from ft1 left join ft2 on
> ft1.a = ft2.a where ft1.b = 1) ss1(a) full join (select ft3.a from ft3
> left join ft4 on ft3.a = ft4.a where ft3.b = 1) ss2(a) on ss1.a = ss2.a;
> QUERY PLAN
> 
>   Hash Full Join  (cost=202.06..204.12 rows=1 width=8)
> Hash Cond: (ft1.a = ft3.a)
> ->  Foreign Scan  (cost=100.00..102.05 rows=1 width=4)
>   Relations: (public.ft1) LEFT JOIN (public.ft2)
> ->  Hash  (cost=102.05..102.05 rows=1 width=4)
>   ->  Foreign Scan  (cost=100.00..102.05 rows=1 width=4)
> Relations: (public.ft3) LEFT JOIN (public.ft4)
> (7 rows)
> 
>  From the Relations line shown by postgres_fdw, we can see which foreign
> join joins which foreign tables, but if no such lines, we couldn't.
>
In case of postgres_fdw, if extension can indicate the core EXPLAIN to
print all of its underlying relations, the core EXPLAIN routine will
print name of the relations. Here is no problem.

> >>> postgres=# explain select id from t0 natural join t1 natural join t2;
> >>> QUERY PLAN
> >>> ---
> >>>  Custom Scan (GpuJoin)  (cost=12385.67..291245.35 rows=9815465 width=4)
> >>>GPU Projection: t0.id
> >>>Depth 1: GpuHashJoin, HashKeys: (t0.bid)
> >>> JoinQuals: (t0.bid = t2.bid)
> >>> Nrows (in/out: 98.15%), KDS-Hash (size: 13.47MB, nbatches: 1)
> >>>Depth 2: GpuHashJoin, HashKeys: (t0.aid)
> >>> JoinQuals: (t0.aid = t1.aid)
> >>> Nrows (in/out: 100.00%), KDS-Hash (size: 13.47MB, nbatches: 1)
> >>>->  Seq Scan on t0  (cost=0.00..18.96 rows=996 width=12)
> >>>->  Seq Scan on t2  (cost=0.00..1935.00 rows=10 width=4)
> >>>->  Seq Scan on t1  (cost=0.00..1935.00 rows=10 width=4)
> >>> (11 rows)
> 
> > My largest concern for you proposition is, ForeignScan/CustomScan node is
> > enforced to print name of underlying relations, regardless of its actual
> > behavior. The above GpuJoin never scans tables at least, thus, it mislead
> > users if we have no choice to print underlying relation names.
> 
> OK, I understand we would need special handling for such custom joins.
>
It is not a case only for CustomScan.

Please assume an enhancement of postgres_fdw that reads a small local table 
(tbl_1)
and parse them as VALUES() clause within a remote query to execute remote join
with foreign tables (ftbl_2, ftbl_3).
This ForeignScan node has three underlying relations; tbl_1, ftbl_2 and ftbl_3.
Likely, tbl_1 will be scanned by SeqScan, not ForeignScan itself.
In this case, which relations shall be printed around ForeignScan?
Is it possible to choose proper relation names without hint by the extension?
   
Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Oddity in EXPLAIN for foreign/custom join pushdown plans

2016-07-28 Thread Ashutosh Bapat

I wrote:
>
>> Probably something like this:
>>>
>>>Foreign Processing
>>>  Remote Operations: ...
>>>
>>> In the Remote Operations line, the FDW/extension could print any info
>>> about remote operations, eg, "Scan/Join + Aggregate".
>>>
>>
> "Foreign" implies this node is processed by FDW, but "Procesing" gives us
>> no extra information; seems to me redundant.
>>
>
> I intentionally chose that word and thought we could leave detailed
> descriptions about remote operations to the FDW/extension; a broader word
> like "Processing" seems to work well because we allow various kinds of
> operations to the remote side, in addition to scans/joins, to be performed
> in that one Foreign Scan node indicated by "Foreign Processing", such as
> aggregation, window functions, distinct, order by, row locking, table
> modification, or combinations of them.
>
> "Scan" is a better word than "Processing". From plan's perspective it's
ultimately a Scan (on the data produced by the foreign server) and not
processing.
-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: [HACKERS] Oddity in EXPLAIN for foreign/custom join pushdown plans

2016-07-28 Thread Etsuro Fujita


On 2016/07/28 22:11, Kouhei Kaigai wrote:

I wrote:

That may be so, but my point is that the target relations involved in
the foreign join (ie, ft1 and ft2) should be printed somewhere in the
EXPLAIN output by core, as in EXPLAIN for a simple foreign table scan.



Why? According to your rule, Hash Join should take "on t0,t1,t2".

postgres=# explain select id from t0 natural join t1 natural join t2;
 QUERY PLAN
-
 Hash Join  (cost=6370.00..4560826.24 rows=98784048 width=4)
   Hash Cond: (t0.aid = t1.aid)
   ->  Hash Join  (cost=3185.00..3199360.58 rows=98784048 width=8)
 Hash Cond: (t0.bid = t2.bid)
 ->  Seq Scan on t0  (cost=0.00..184.80 rows=10080 width=12)
 ->  Hash  (cost=1935.00..1935.00 rows=10 width=4)
   ->  Seq Scan on t2  (cost=0.00..1935.00 rows=10 width=4)
   ->  Hash  (cost=1935.00..1935.00 rows=10 width=4)
 ->  Seq Scan on t1  (cost=0.00..1935.00 rows=10 width=4)
(9 rows)


I don't think it needs "on t0,t1,t2", because we can see joining  
relations from inner/outer plans in that case.  In a foreign-join case,  
however, we can't see such relations from the EXPLAIN printed *by core*.  
 postgres_fdw avoids this issue by adding such relations to the EXPLAIN  
using ExplainForeignScan as shown in the below example, but since such  
relations are essential, I think that information should be shown by  
core itself.


postgres=# explain select * from (select ft1.a from ft1 left join ft2 on  
ft1.a = ft2.a where ft1.b = 1) ss1(a) full join (select ft3.a from ft3  
left join ft4 on ft3.a = ft4.a where ft3.b = 1) ss2(a) on ss1.a = ss2.a;

   QUERY PLAN

 Hash Full Join  (cost=202.06..204.12 rows=1 width=8)
   Hash Cond: (ft1.a = ft3.a)
   ->  Foreign Scan  (cost=100.00..102.05 rows=1 width=4)
 Relations: (public.ft1) LEFT JOIN (public.ft2)
   ->  Hash  (cost=102.05..102.05 rows=1 width=4)
 ->  Foreign Scan  (cost=100.00..102.05 rows=1 width=4)
   Relations: (public.ft3) LEFT JOIN (public.ft4)
(7 rows)

From the Relations line shown by postgres_fdw, we can see which foreign  
join joins which foreign tables, but if no such lines, we couldn't.


I wrote:

Probably something like this:

   Foreign Processing
 Remote Operations: ...

In the Remote Operations line, the FDW/extension could print any info
about remote operations, eg, "Scan/Join + Aggregate".



"Foreign" implies this node is processed by FDW, but "Procesing" gives us
no extra information; seems to me redundant.


I intentionally chose that word and thought we could leave detailed  
descriptions about remote operations to the FDW/extension; a broader  
word like "Processing" seems to work well because we allow various kinds  
of operations to the remote side, in addition to scans/joins, to be  
performed in that one Foreign Scan node indicated by "Foreign  
Processing", such as aggregation, window functions, distinct, order by,  
row locking, table modification, or combinations of them.



Prior to the new invention, please explain why you don't want to by my
suggestion first? Annoying is a feel of you, but not a logic to persuade
others.


I'm not saying that the idea I proposed is better than your suggestion.  
 Just brain storming.  I want to know what options we have and the pros  
and cons of each approach.



postgres=# explain select id from t0 natural join t1 natural join t2;
QUERY PLAN
---
 Custom Scan (GpuJoin)  (cost=12385.67..291245.35 rows=9815465 width=4)
   GPU Projection: t0.id
   Depth 1: GpuHashJoin, HashKeys: (t0.bid)
JoinQuals: (t0.bid = t2.bid)
Nrows (in/out: 98.15%), KDS-Hash (size: 13.47MB, nbatches: 1)
   Depth 2: GpuHashJoin, HashKeys: (t0.aid)
JoinQuals: (t0.aid = t1.aid)
Nrows (in/out: 100.00%), KDS-Hash (size: 13.47MB, nbatches: 1)
   ->  Seq Scan on t0  (cost=0.00..18.96 rows=996 width=12)
   ->  Seq Scan on t2  (cost=0.00..1935.00 rows=10 width=4)
   ->  Seq Scan on t1  (cost=0.00..1935.00 rows=10 width=4)
(11 rows)



My largest concern for you proposition is, ForeignScan/CustomScan node is
enforced to print name of underlying relations, regardless of its actual
behavior. The above GpuJoin never scans tables at least, thus, it mislead
users if we have no choice to print underlying relation names.


OK, I understand we would need special handling for such custom joins.

Best regards,
Etsuro Fujita




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Wrong defeinition of pq_putmessage_noblock since 9.5

2016-07-28 Thread Michael Paquier

On Fri, Jul 29, 2016 at 12:18 PM, Kyotaro HORIGUCHI
 wrote:
> At Thu, 28 Jul 2016 10:46:00 -0400, Tom Lane  wrote in 
> <4313.1469717...@sss.pgh.pa.us>
>> Fujii Masao  writes:
>> > 3. Several source comments in pqcomm.c have not been updated.
>> > Some comments still use the old function name like pq_putmessage().
>>
>> > Attached patch fixes the above issues.
>>
>> I dunno, this seems like it's doubling down on some extremely poor
>> decisions.  Why is it that you now have to flip a coin to guess whether
>> the prefix is pq_ or socket_ for functions in this module?  I would
>> rather see that renaming reverted.

Yes, I agree with that. I cannot understand the intention behind
2bd9e41 to rename those routines as they are now, so getting them back
with pg_ as prefix looks like a good idea to me.

> The set of functions in PQcommMethods doesn't seem clean. They
> are chosen arbitrarily just so that other pq_* functions used in
> parallel workers will work as expected. I suppose that it needs
> some refactoring.

Any work in this area is likely 10.0 material at this point.

> By the way, pq_start/endcopyout() are used only in FE protocols
> below 3.0, which had already bacome obsolete as of PG7.4. While
> the next dev cycle is for PG10, if there is no particular reason
> to support such ancient protocols, removing them would make things
> easier and cleaner.

Remove support for protocol 2 has been in the air for some time, but
that's a separate discussion. If you want to discuss this issue
particularly, raising a new thread would be a good idea.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] old_snapshot_threshold allows heap:toast disagreement

2016-07-28 Thread Andres Freund

On 2016-07-28 23:08:13 -0400, Robert Haas wrote:
> On Thu, Jul 28, 2016 at 7:29 PM, Andres Freund  wrote:
> > I think just iterating through the active snapshots would have been
> > fine. Afaics there's no guarantee that the first active snapshot pushed
> > is the relevant one - in contrast to registered one, which are ordered
> > by virtue of the heap.
> 
> I think the oldest snapshot has to be on the bottom of the stack; how not?

Well, one might push a previously acuired (and registered) snapshot onto
the stack. Afaics that'll only happen if the snapshot is already
registered, but I'm not sure.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Wrong defeinition of pq_putmessage_noblock since 9.5

2016-07-28 Thread Kyotaro HORIGUCHI

At Thu, 28 Jul 2016 10:46:00 -0400, Tom Lane  wrote in 
<4313.1469717...@sss.pgh.pa.us>
> Fujii Masao  writes:
> > 3. Several source comments in pqcomm.c have not been updated.
> > Some comments still use the old function name like pq_putmessage().
> 
> > Attached patch fixes the above issues.
> 
> I dunno, this seems like it's doubling down on some extremely poor
> decisions.  Why is it that you now have to flip a coin to guess whether
> the prefix is pq_ or socket_ for functions in this module?  I would
> rather see that renaming reverted.

The set of functions in PQcommMethods doesn't seem clean. They
are choosed arbitrary just so that other pq_* functions used in
parallel workers will work as expected. I suppose that it needs
some refactoring.

By the way, pq_start/endcopyout() are used only in FE protocols
below 3.0, which had already bacome obsolete as of PG7.4. While
the next dev cycle is for PG10, if there is no particular reason
to support such acient protocols, removing them would make things
easier and cleaner.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] old_snapshot_threshold allows heap:toast disagreement

2016-07-28 Thread Robert Haas

On Thu, Jul 28, 2016 at 7:29 PM, Andres Freund  wrote:
> I think just iterating through the active snapshots would have been
> fine. Afaics there's no guarantee that the first active snapshot pushed
> is the relevant one - in contrast to registered one, which are ordered
> by virtue of the heap.

I think the oldest snapshot has to be on the bottom of the stack; how not?

>> > Hm. Could we perhaps assert that the session has a valid xmin?
>>
>> I don't think so.  CLUSTER?
>
> That should have one during any toast lookups afaics - the relevant code
> is
> /* Start a new transaction for each relation. */
> StartTransactionCommand();
> /* functions in indexes may want a snapshot set */
> PushActiveSnapshot(GetTransactionSnapshot());
> /* Do the job. */
> cluster_rel(rvtc->tableOid, rvtc->indexOid, true, 
> stmt->verbose);
> PopActiveSnapshot();
> CommitTransactionCommand();
> right? And
> Snapshot
> GetSnapshotData(Snapshot snapshot)
> {
> ...
>
> if (!TransactionIdIsValid(MyPgXact->xmin))
> MyPgXact->xmin = TransactionXmin = xmin;
> sets xmin.

Hmm, OK, I'll have to check on that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] A Modest Upgrade Proposal

2016-07-28 Thread Bruce Momjian

On Thu, Jul 28, 2016 at 09:22:17AM +0800, Craig Ringer wrote:
> It might, actually. One approach for online upgrade is to:
> 
> * pg_basebackup the master
> * start the replica and let it catch up
> * create a logical replication slot on the master
> * replace the replication.conf on the basebackup so it stops recovery at the
> lsn of the replication slot's confirmed_flush_lsn
> * stop the replica and pg_upgrade it
> * have the upgraded replica, now a master, replay from the old master over
> logical replication
> * once caught up, switch over
> 
> This means a full dump and reload with a full rebuild of all indexes, etc,
> isn't needed. All shared catalog stuff is copied (until we switch to logical
> rep for the final catch-up).

Right, using pg_upgrade as part of a logical upgrade procedure makes
sense.  I was referring to having pg_upgrade --online do all of those
bullets plus what is does now --- that just seems like it would fail as
too complex, and if someone wanted to do just pg_upgrade without the
logical, the manual page would be incomprehensible.

In summary, I think we need to keep pg_upgrade doing what it does well,
and come up with another tool that does those bullets.  Heck, people
already can't find --link.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] "Strong sides of MySQL" talk from PgDay16Russia, translated

2016-07-28 Thread Nikolay Samokhvalov

On Fri, Jul 29, 2016 at 4:39 AM, Tatsuo Ishii  wrote:
>
> Great translation.
>
> BTW, is there any opposite information, i.e. showing the limitation of
> MySQL comparing with PostgreSQL? I'm not familiar with MySQL, but
> occasionally hearing surprising (as a PostgreSQL user) limitation of
> MySQL and wondering if there's any summary of the info.
>

Sorry cannot help with that :-) I stopped using MySQL in 2005, when
discovered that we speak different languages (I learned standard ISO/ANSI
SQL in university, and then easily communicated with Oracle, SQL Server,
but failed to do so with MySQL; that's why I switched to Postgres).

During last years, all the focus of "let's compare Postres to ..." activity
was switched from MySQL to MongoDB and Oracle.
Maybe it's time to refresh the data -- for those who works with both
Postgres and MySQL.

Re: [HACKERS] "Strong sides of MySQL" talk from PgDay16Russia, translated

2016-07-28 Thread Tatsuo Ishii

> Following Uber's case discussion, I found this talk by Alexey Kopytov to be
> really interesting:
>   http://kaamos.me/talks/pgday16/strongmysql/strongmysql.html (online html,
> in Russian)
> 
> I translated it to English:
> 
> https://www.dropbox.com/s/t6a15s66jxg50tg/mysqlstrong_pgday16russia.pdf?dl=0
> (pdf)
> 
> The slides deck contains a lot of details. The author claims that during
> recent years, MySQL made a lot of progress in defending and advancing its
> position as a "most popular database for the web", he provides detailed
> reasoning for that, and then concludes that PostgreSQL will need years and
> maybe even decades to close gaps in the certain fields which are very
> sensitive for large companies:
>  - replication
>  - storage engines / compression / direct IO / etc
>  - partitioning,
> etc.
> 
> Of course this information is biased (Alexey works at Percona) but IMO it's
> much more detailed, qualitative and useful analysis compared to the Uber's
> recent article.

Great translation.

BTW, is there any opposite information, i.e. showing the limitation of
MySQL comparing with PostgreSQL? I'm not familiar with MySQL, but
occasionally hearing surprising (as a PostgreSQL user) limitation of
MySQL and wondering if there's any summary of the info.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] "Strong sides of MySQL" talk from PgDay16Russia, translated

2016-07-28 Thread Nikolay Samokhvalov

Following Uber's case discussion, I found this talk by Alexey Kopytov to be
really interesting:
  http://kaamos.me/talks/pgday16/strongmysql/strongmysql.html (online html,
in Russian)

I translated it to English:

https://www.dropbox.com/s/t6a15s66jxg50tg/mysqlstrong_pgday16russia.pdf?dl=0
(pdf)

The slides deck contains a lot of details. The author claims that during
recent years, MySQL made a lot of progress in defending and advancing its
position as a "most popular database for the web", he provides detailed
reasoning for that, and then concludes that PostgreSQL will need years and
maybe even decades to close gaps in the certain fields which are very
sensitive for large companies:
 - replication
 - storage engines / compression / direct IO / etc
 - partitioning,
etc.

Of course this information is biased (Alexey works at Percona) but IMO it's
much more detailed, qualitative and useful analysis compared to the Uber's
recent article.

[HACKERS] Duplicate prototype for socket_set_nonblocking.

2016-07-28 Thread Kyotaro HORIGUCHI

Hello,

Looking into pqcomm.c, I noticed that the prototype of
socket_set_nonblocking defined twice. Actually the prototype is
not necessary at all but one may remain as a convention.

It doesn't no harm so backpatching seems unnecessary. 

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From fa8911cbab690c8dc2f8b0d8864dee5470730bbe Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Fri, 29 Jul 2016 10:01:53 +0900
Subject: [PATCH] Remove duplicate prototype definition for
 socket_set_nonblocking.

---
 src/backend/libpq/pqcomm.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index ba42753..90b6946 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -145,7 +145,6 @@ static void socket_startcopyout(void);
 static void socket_endcopyout(bool errorAbort);
 static int	internal_putbytes(const char *s, size_t len);
 static int	internal_flush(void);
-static void socket_set_nonblocking(bool nonblocking);
 
 #ifdef HAVE_UNIX_SOCKETS
 static int	Lock_AF_UNIX(char *unixSocketDir, char *unixSocketPath);
-- 
1.8.3.1


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Why we lost Uber as a user

2016-07-28 Thread Josh Berkus

On 07/28/2016 03:58 AM, Geoff Winkless wrote:
> On 27 July 2016 at 17:04, Bruce Momjian  >wrote:
> 
> Well, their big complaint about binary replication is that a bug can
> spread from a master to all slaves, which doesn't happen with statement
> level replication.  
> 
> 
> 
> I'm not sure that that makes sense to me. If there's a database bug
> that occurs when you run a statement on the master, it seems there's a
> decent chance that that same bug is going to occur when you run the same
> statement on the slave.
> 
> Obviously it depends on the type of bug and how identical the slave is,
> but statement-level replication certainly doesn't preclude such a bug
> from propagating.

That's correct, which is why I ignored that part of their post.

However, we did have issues for a couple of years where replication
accuracy was poorly tested, and did have several bugs associated with
that.  It's not surprising that a few major users got hit hard by those
bugs and decided to switch.

-- 
--
Josh Berkus
Red Hat OSAS
(any opinions are my own)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_upgrade: exit_hook_registered variable

2016-07-28 Thread Kyotaro HORIGUCHI

At Thu, 28 Jul 2016 10:58:24 -0400, Tom Lane  wrote in 
<4782.1469717...@sss.pgh.pa.us>
> Kyotaro HORIGUCHI  writes:
> >> I noticed that exit_hook_registered variable in start_postmaster() is
> >> local variable. Shouldn't it be a static variable?
> 
> > Good catch! It is totally useless unless being static.
> 
> Indeed.
> 
> > But I think it is not necessary to go outside
> > start_postmaster. Just being static staying in start_postmaster
> > is enough.
> 
> I agree, since there's no need for any other function to touch it.
> But personally I'd want to visually separate the static variable
> from the non-static ones.  So maybe like this:

It looks better indeed.

> {
>   charcmd[MAXPGPATH * 4 + 1000];
>   PGconn *conn;
> - boolexit_hook_registered = false;
>   boolpg_ctl_return = false;
>   charsocket_string[MAXPGPATH + 200];
> +
> + static bool exit_hook_registered = false;
> 
>   if (!exit_hook_registered)
>   {
> 
> 
> Will push it in a bit.

Thanks.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] old_snapshot_threshold allows heap:toast disagreement

2016-07-28 Thread Andres Freund

On 2016-07-28 15:40:48 -0400, Robert Haas wrote:
> On Wed, Jul 27, 2016 at 7:26 PM, Andres Freund  wrote:
> >> Also, I wonder why it's right to use
> >> pairingheap_first() instead of looking at the oldest snapshot on the
> >> active snapshot stack, or conversely why that is right and this is
> >> not.  Or maybe we need to check both.
> >
> > Well, generally, the older the snapshot we use is, the more we'll error
> > out spuriously. The reason to use the oldest from the pairing heap is
> > that that'll be the most conservative value, right?  If there's neither
> > an active, nor a registered snapshot, we'll not prevent pruning in the
> > first place (and xmin ought to be invalid).  As registered snapshots are
> > usually "older" than active ones, we definitely have to check those. But
> > you're right, it seems safer to also check active ones - which squares
> > with SnapshotResetXmin().
> 
> OK.  That's a bit inconvenient because we don't have an O(1) way to
> access the bottom of the active snapshot stack; I've attempted to add
> such a mechanism here.

I think just iterating through the active snapshots would have been
fine. Afaics there's no guarantee that the first active snapshot pushed
is the relevant one - in contrast to registered one, which are ordered
by virtue of the heap.


> >> But there's a bit of spooky action at a
> >> distance: if we don't see any snapshots, then we have to assume that
> >> the scan is being performed with a non-MVCC snapshot and therefore we
> >> don't need to worry.  But there's no way to verify that via an
> >> assertion, because the connection between the relevant MVCC snapshot
> >> and the corresponding TOAST snapshot is pretty indirect, mediated only
> >> by snapmgr.c.
> >
> > Hm. Could we perhaps assert that the session has a valid xmin?
> 
> I don't think so.  CLUSTER?

That should have one during any toast lookups afaics - the relevant code
is
/* Start a new transaction for each relation. */
StartTransactionCommand();
/* functions in indexes may want a snapshot set */
PushActiveSnapshot(GetTransactionSnapshot());
/* Do the job. */
cluster_rel(rvtc->tableOid, rvtc->indexOid, true, 
stmt->verbose);
PopActiveSnapshot();
CommitTransactionCommand();
right? And
Snapshot
GetSnapshotData(Snapshot snapshot)
{
...

if (!TransactionIdIsValid(MyPgXact->xmin))
MyPgXact->xmin = TransactionXmin = xmin;
sets xmin.


Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] LWLocks in DSM memory

2016-07-28 Thread Thomas Munro

On Fri, Jul 29, 2016 at 2:12 AM, Robert Haas  wrote:
> On Thu, Jul 28, 2016 at 12:12 AM, Thomas Munro
>  wrote:
>> On Tue, Jul 26, 2016 at 6:00 AM, Andres Freund  wrote:
>>> I think the better fix here would acutally be to get rid of a pointer
>>> based list here, and a) replace the queue with integer offsets ...
>>
>> Here is a prototype of that idea.  It replaces that dlist with a
>> proclist, a new specialised doubly-linked list for pgprocno values,
>> using INVALID_PGPROCNO for termination.  It works with proclist_node
>> objects inside PGPROC at an arbitrary offset which must be provided
>> when you initialise the proclist.
>
> Aside from the fact that this allows LWLocks inside DSM segments,
> which I definitely want to support, this seems to have the nice
> property of reducing the size of an LWLock by 8 bytes.  We need to
> consider what to do about LWLOCK_MINIMAL_SIZE.  We could (a) decide to
> still pad all arrays of LWLocks to 32 bytes per LWLock so as to reduce
> false sharing, and rename this constant not to imply minimality; or
> (b) alter the macro definition to allow for 16 bytes as a possible
> result.

That version didn't actually make LWLock any smaller, because of the
additional offset stored in proclist_head when initialising the list.
Here is a new version that requires callers to provide it when
accessing the list, bringing sizeof(LWLock) down to 16 on LP64/LLP64
systems.  In the spirit of the dlist_container macro, I added macros
so you can name the link member rather than providing its offset:
proclist_push_tail(&list, procno, lwWaitLink).

-- 
Thomas Munro
http://www.enterprisedb.com


lwlocks-in-dsm-v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PoC: Make it possible to disallow WHERE-less UPDATE and DELETE

2016-07-28 Thread David Fetter

On Wed, Jul 27, 2016 at 02:59:17PM +0200, Vik Fearing wrote:
> On 27/07/16 06:11, David Fetter wrote:
> > On Wed, Jul 27, 2016 at 03:24:28AM +0200, Vik Fearing wrote:
> >> On 27/07/16 03:15, Peter Eisentraut wrote:
> >>> On 7/26/16 6:14 PM, Vik Fearing wrote:
>  As mentioned elsewhere in the thread, you can just do WHERE true
>  to get around it, so why on Earth have it PGC_SUSET?
> >>>
> >>> I'm not sure whether it's supposed to guard against typos and
> >>> possibly buggy SQL string concatenation in application code.  So
> >>> it would help against accidental mistakes, whereas putting a WHERE
> >>> TRUE in there would be an intentional override.
> >>
> >> If buggy SQL string concatenation in application code is your
> >> argument, quite a lot of them add "WHERE true" so that they can just
> >> append a bunch of "AND ..." clauses without worrying if it's the
> >> first (or last, whatever), so I'm not sure this is protecting
> >> anything.
> > 
> > I am sure that I'm not the only one who's been asked for this feature
> > because people other than me have piped up on this thread to that very
> > effect.
> 
> Sure.  I'm just saying that I think it is poorly designed.  I think it
> would be far better to error out if the command affects x rows, or an
> estimated y% of the table.

What else would constitute a good design?

I am a little wary of relying on estimates, at least those provided by
EXPLAIN, because the row counts they produce can be off by several
orders of magnitude.

Are there more accurate ways to estimate?

Would you want x and y to be parameters somewhere?

> Doing that, and also allowing the user to turn it off, would solve the
> problem as I understand your presentation of it.

I made it PGC_USERSET in the third patch.

> > I understand that there may well be lots of really meticulous people
> > on this list, people who would never accidentally do an unqualified
> > DELETE on a table in production, but I can't claim to be one of them
> > because I have, and not just once.  It's under once a decade, but even
> > that's too many.
> 
> That doesn't mean that requiring a WHERE clause -- without even looking
> at what's in it -- is a good idea.
> 
> Why not start by turning off autocommit, for example?

Because that setting is client side, and even more vulnerable to not
being turned on for everyone everywhere.

> > I'm not proposing to make this feature default, or even available by
> > default, but I am totally certain that this is the kind of feature
> > people would really appreciate, even if it doesn't prevent every
> > catastrophe.
> 
> This kind of feature, why not.  This feature, no.

I would very much value your input into the design of the feature.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] BRIN vs. HOT

2016-07-28 Thread David Fetter

On Thu, Jul 28, 2016 at 10:53:47AM -0400, Robert Haas wrote:
> [1] I look forward to a future PostgreSQL conference in which the
> struggle to pronounce "HMT" forms a recurring theme.

For maximal confusion, this should really be called a Heap Often
Tuple.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] old_snapshot_threshold allows heap:toast disagreement

2016-07-28 Thread Robert Haas

On Wed, Jul 27, 2016 at 7:26 PM, Andres Freund  wrote:
> Why did you decide to introduce MaximumXLogRecPtr? Wouldn't just using
> InvalidXLogRecPtr do the trick? That already prevents errors.

Oh, right.

>> Also, I wonder why it's right to use
>> pairingheap_first() instead of looking at the oldest snapshot on the
>> active snapshot stack, or conversely why that is right and this is
>> not.  Or maybe we need to check both.
>
> Well, generally, the older the snapshot we use is, the more we'll error
> out spuriously. The reason to use the oldest from the pairing heap is
> that that'll be the most conservative value, right?  If there's neither
> an active, nor a registered snapshot, we'll not prevent pruning in the
> first place (and xmin ought to be invalid).  As registered snapshots are
> usually "older" than active ones, we definitely have to check those. But
> you're right, it seems safer to also check active ones - which squares
> with SnapshotResetXmin().

OK.  That's a bit inconvenient because we don't have an O(1) way to
access the bottom of the active snapshot stack; I've attempted to add
such a mechanism here.

>> But there's a bit of spooky action at a
>> distance: if we don't see any snapshots, then we have to assume that
>> the scan is being performed with a non-MVCC snapshot and therefore we
>> don't need to worry.  But there's no way to verify that via an
>> assertion, because the connection between the relevant MVCC snapshot
>> and the corresponding TOAST snapshot is pretty indirect, mediated only
>> by snapmgr.c.
>
> Hm. Could we perhaps assert that the session has a valid xmin?

I don't think so.  CLUSTER?

New version attached.

(Official status update: I expect to have this issue wrapped up in the
next few days, assuming that review and discussion continue.  If for
some reason that fails to be the case, I will provide a further
official status update no later than Tuesday of next week: August 2,
2016.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


ost-heap-toast-disagreement-v2.patch
Description: application/download

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [Patch] RBTree iteration interface improvement

2016-07-28 Thread Tom Lane

Aleksander Alekseev  writes:
>> Can you explain use case where you need it?

> ... Or maybe you have different objects, e.g. IndexScanDesc's, that should
> iterate over some tree's independently somewhere in indexam.c
> procedures. Exact order may depend on user's query so you don't even
> control it.

It seems clear to me that the existing arrangement is hazardous for any
RBTree that hasn't got exactly one consumer.  I think Aleksander's plan to
decouple the iteration state is probably a good one (NB: I've not read the
patch, so this is not an endorsement of details).  I'd go so far as to say
that we should remove the old API as being dangerous, rather than preserve
it on backwards-compatibility grounds.  We make bigger changes than that
in internal APIs all the time.

Having said that, though: if the iteration state is not part of the
object, it's not very clear whether we can behave sanely if someone
changes the tree while an iteration is open.  It will need careful
thought as to what sort of guarantees we can make about that.  If it's
too weak, then a separated-state version would have enough hazards of
its own that it's not necessarily any safer.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] BRIN vs. HOT

2016-07-28 Thread Alvaro Herrera

Robert Haas wrote:
> If I understand correctly, we currently deem an update to be non-HOT
> whenever any indexed column is updated.  This is because tuple
> versions created by HOT updates can later be removed by HOT pruning,
> which means that they must not be referenced by index entries.
> Otherwise, after HOT pruning removed the tuple, the index entries
> would at best be pointing at nothing and at worse be pointing at some
> completely unrelated tuple.
> 
> But what about index types that do not store TIDs - i.e. BRIN?  

Oh, I had a note to get back to the topic of HOT updates and it fell
through the cracks :-(  You're right, this needs to be addressed.

> If the
> indexed column is updated, we can't actually create a Heap Only Tuple
> (HOT), because then the index might be wrong.  But we could create a
> Heap Mostly Tuple[1].  We'd construct the update chain in the heap
> page just as we would for HOT, and set all the same flags.  But then
> we'd also insert new index entries for any TID-free indexes, currently
> just BRIN.  For BRIN, that would have the effect of updating the
> summary data for that page in such a way that it would encompass both
> the old and new values.

Sounds reasonable.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] BRIN vs. HOT

2016-07-28 Thread Petr Jelinek


On 28/07/16 16:53, Robert Haas wrote:

If I understand correctly, we currently deem an update to be non-HOT
whenever any indexed column is updated.  This is because tuple
versions created by HOT updates can later be removed by HOT pruning,
which means that they must not be referenced by index entries.
Otherwise, after HOT pruning removed the tuple, the index entries
would at best be pointing at nothing and at worse be pointing at some
completely unrelated tuple.

But what about index types that do not store TIDs - i.e. BRIN?  If the


I was thinking about this as well when I've seem the Uber post.


indexed column is updated, we can't actually create a Heap Only Tuple
(HOT), because then the index might be wrong.  But we could create a
Heap Mostly Tuple[1].  We'd construct the update chain in the heap
page just as we would for HOT, and set all the same flags.  But then
we'd also insert new index entries for any TID-free indexes, currently
just BRIN.  For BRIN, that would have the effect of updating the
summary data for that page in such a way that it would encompass both
the old and new values.


I thought about adding another am api function to let index decide if 
update can be HOT or not, but I like you idea more.


--
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Why we lost Uber as a user

2016-07-28 Thread Alex Ignatov

On 28.07.2016 17:53, Vladimir Sitnikov wrote:

>> That's a recipe for runaway table bloat; VACUUM can't do much
because
>> there's always some minutes-old transaction hanging around (and
SNAPSHOT
>> TOO OLD doesn't really help, we're talking about minutes here), and
>> because of all of the indexes HOT isn't effective.

Just curious: what if PostgreSQL supported index that stores "primary 
key" (or unique key) instead of tids?
Am I right that kind of index would not suffer from that bloat? I'm 
assuming the primary key is not updated, thus secondary indices build 
in that way should be much less prone to bloat when updates land to 
other columns (even if tid moves, its PK does not change, thus 
secondary index row could be reused).

If that works, it could reduce index bloat, reduce the amount of WAL 
(less indices will need be updated). Of course it will make index scan 
a bit worse, however it looks like at least Uber is fine with that 
extra cost of index scan.

Does it make sense to implement that kind of index as an access method?

Vladimir

You mean IOT like Oracle have?

Alex Ignatov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] [Patch] RBTree iteration interface improvement

2016-07-28 Thread Aleksander Alekseev

> Can you explain use case where you need it?

Sure. You can consider RBTree as a container that always keeps its
elements in sorted order.  Now imagine you would like to write a code
like this:

```
/* iterate over items in sorted order */
while(item1 = left_right_walk(tree))
{

  /* another iteration, probably even in different procedure */
  while(item2 = left_right_walk(tree))
  {
/* ... some logic ... */
  }

}
```

Currently you can't do it.

Or maybe you have different objects, e.g. IndexScanDesc's, that should
iterate over some tree's independently somewhere in indexam.c
procedures. Exact order may depend on user's query so you don't even
control it.

-- 
Best regards,
Aleksander Alekseev


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_upgrade: exit_hook_registered variable

2016-07-28 Thread Tom Lane

Kyotaro HORIGUCHI  writes:
>> I noticed that exit_hook_registered variable in start_postmaster() is
>> local variable. Shouldn't it be a static variable?

> Good catch! It is totally useless unless being static.

Indeed.

> But I think it is not necessary to go outside
> start_postmaster. Just being static staying in start_postmaster
> is enough.

I agree, since there's no need for any other function to touch it.
But personally I'd want to visually separate the static variable
from the non-static ones.  So maybe like this:

{
charcmd[MAXPGPATH * 4 + 1000];
PGconn *conn;
-   boolexit_hook_registered = false;
boolpg_ctl_return = false;
charsocket_string[MAXPGPATH + 200];
+
+   static bool exit_hook_registered = false;

if (!exit_hook_registered)
{


Will push it in a bit.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Why we lost Uber as a user

2016-07-28 Thread Vladimir Sitnikov

>
>
> >> That's a recipe for runaway table bloat; VACUUM can't do much because
> >> there's always some minutes-old transaction hanging around (and SNAPSHOT
> >> TOO OLD doesn't really help, we're talking about minutes here), and
> >> because of all of the indexes HOT isn't effective.
>

Just curious: what if PostgreSQL supported index that stores "primary key"
(or unique key) instead of tids?
Am I right that kind of index would not suffer from that bloat? I'm
assuming the primary key is not updated, thus secondary indices build in
that way should be much less prone to bloat when updates land to other
columns (even if tid moves, its PK does not change, thus secondary index
row could be reused).

If that works, it could reduce index bloat, reduce the amount of WAL (less
indices will need be updated). Of course it will make index scan a bit
worse, however it looks like at least Uber is fine with that extra cost of
index scan.

Does it make sense to implement that kind of index as an access method?

Vladimir

[HACKERS] BRIN vs. HOT

2016-07-28 Thread Robert Haas

If I understand correctly, we currently deem an update to be non-HOT
whenever any indexed column is updated.  This is because tuple
versions created by HOT updates can later be removed by HOT pruning,
which means that they must not be referenced by index entries.
Otherwise, after HOT pruning removed the tuple, the index entries
would at best be pointing at nothing and at worse be pointing at some
completely unrelated tuple.

But what about index types that do not store TIDs - i.e. BRIN?  If the
indexed column is updated, we can't actually create a Heap Only Tuple
(HOT), because then the index might be wrong.  But we could create a
Heap Mostly Tuple[1].  We'd construct the update chain in the heap
page just as we would for HOT, and set all the same flags.  But then
we'd also insert new index entries for any TID-free indexes, currently
just BRIN.  For BRIN, that would have the effect of updating the
summary data for that page in such a way that it would encompass both
the old and new values.

It seems to me that this would make BRIN indexes a lot better for
people who want to perform ad-hoc analytic queries on data sets that
suffer at least some updates.  Right now, even if you knew that you
might want to run some occasional ad-hoc reporting queries, the
prospect of putting a BRIN index on all of your otherwise-unindexed
columns is pretty unappetizing because of this issue.  (Has anyone
tried running pgbench with a BRIN index on the abalance column, for
example?  I have not, but I bet it slows things down a lot.)  This
could make that a lot cheaper; instead of additional bloat in the
table and every index, you'd just pay the cost of widening BRIN
summary ranges as needed, which seems way better.

Apologies if this has been discussed before; a quick search did not
find any previous discussion on this topic.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1] I look forward to a future PostgreSQL conference in which the
struggle to pronounce "HMT" forms a recurring theme.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Wrong defeinition of pq_putmessage_noblock since 9.5

2016-07-28 Thread Tom Lane

Fujii Masao  writes:
> 3. Several source comments in pqcomm.c have not been updated.
> Some comments still use the old function name like pq_putmessage().

> Attached patch fixes the above issues.

I dunno, this seems like it's doubling down on some extremely poor
decisions.  Why is it that you now have to flip a coin to guess whether
the prefix is pq_ or socket_ for functions in this module?  I would
rather see that renaming reverted.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] LWLocks in DSM memory

2016-07-28 Thread Robert Haas

On Thu, Jul 28, 2016 at 12:12 AM, Thomas Munro
 wrote:
> On Tue, Jul 26, 2016 at 6:00 AM, Andres Freund  wrote:
>> I think the better fix here would acutally be to get rid of a pointer
>> based list here, and a) replace the queue with integer offsets ...
>
> Here is a prototype of that idea.  It replaces that dlist with a
> proclist, a new specialised doubly-linked list for pgprocno values,
> using INVALID_PGPROCNO for termination.  It works with proclist_node
> objects inside PGPROC at an arbitrary offset which must be provided
> when you initialise the proclist.

Aside from the fact that this allows LWLocks inside DSM segments,
which I definitely want to support, this seems to have the nice
property of reducing the size of an LWLock by 8 bytes.  We need to
consider what to do about LWLOCK_MINIMAL_SIZE.  We could (a) decide to
still pad all arrays of LWLocks to 32 bytes per LWLock so as to reduce
false sharing, and rename this constant not to imply minimality; or
(b) alter the macro definition to allow for 16 bytes as a possible
result.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Why we lost Uber as a user

2016-07-28 Thread Merlin Moncure

On Thu, Jul 28, 2016 at 8:16 AM, pgwhatever  wrote:
> Statement-Based replication has a lot of problems with it like indeterminate
> UDFs.  Here is a link to see them all:
> https://dev.mysql.com/doc/refman/5.7/en/replication-sbr-rbr.html#replication-sbr-rbr-sbr-disadvantages

Sure.  It's also incredibly efficient with respect to bandwidth -- so,
if you're application was engineered to work around those problems
it's a huge win.  They could have used pgpool, but I guess the fix was
already in.

Taking a step back, from the outside, it looks like uber:
*) has a very thick middleware, very thin database with respect to
logic and complexity
*) has a very high priority on quick and cheap (in terms of bandwidth)
replication
*) has decided the database needs to be interchangeable
*) is not afraid to make weak or erroneous technical justifications as
a basis of stack selection (the futex vs ipc argument I felt was
particularly awful -- it ignored the fact we use spinlocks)

The very fact that they swapped it out so easily suggests that they
were not utilizing the database as they could have, and a different
technical team might have come to a different result.   Postgres is a
very general system and rewards deep knowledge such that it can
outperform even specialty systems in the hands of a capable developer
(for example, myself).  I'm just now hammering in the final coffin
nails that will get solr swapped out for jsonb backed postgres.

I guess it's fair to say that they felt mysql is closer to what they
felt a database should do out of the box.  That's disappointing, but
life moves on.  The takeaways are:

*) people like different choices of replication mechanics -- statement
level sucks a lot of the time, but not all the time
*) hs/sr simplicity of configuration and operation is a big issue.
it's continually gotten better and still needs to
*) bad QC can cost you customers.   how much regression coverage do we
have of hs/sr?
*) postgres may not be the ideal choice for those who want a thin and
simple database

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_basebackup wish list

2016-07-28 Thread Fujii Masao

On Thu, Jul 28, 2016 at 10:16 PM, Amit Kapila  wrote:
> On Thu, Jul 28, 2016 at 5:37 PM, Fujii Masao  wrote:
>>
>> Maybe I failed to parse his proposal. It's helpful if you elaborate it.
>>
>
> As per mail [1], it seems the proposal is not to use .tar for -Z 0.

I was thinking that the proposal is "output uncompressed tar data,
and not add the ".gz" to the "base.tar" file name" part. So, if -Z 0 is
specified with tar format, .gz should not be added as a file extension.

> Now here actually we are on the fence, one can argue that if user
> doesn't want compression, he or she can use -F p (plain format).
> OTOH, without compression getting the backup as a single .tar file
> makes it simple to manage.

Right now we are providing both methods, plain and tar formats
(without compression, i.e., neither -z nor -Z options are specified).

> I think there is some value in providing
> .tar for -Z 0,

I was thinking that "-Ft -Z0" is something like an alias of "-Ft".
That is, the backup is taken in uncompressed tar format.

> however in that case how should we define usage of -F p
> -Z 0?  Shall we say with plain format -Z 0 gets ignored or throw error
> or do something else?  If first, then I think it is better to mention
> the same in docs.

ISTM that it's better to ignore that case, like pg_dump -Ft -Z0
doesn't throw an error.

Regards,

-- 
Fujii Masao

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [Patch] RBTree iteration interface improvement

2016-07-28 Thread Amit Kapila

On Wed, Jul 27, 2016 at 7:56 PM, Aleksander Alekseev
 wrote:
> Hello
>
> While working on some new feature for PostgreSQL (which is still in
> development and is irrelevant in this context) I noticed that current
> RBTree implementation interface is following:
>
> ```
> extern void rb_begin_iterate(RBTree *rb, RBOrderControl ctrl);
> extern RBNode *rb_iterate(RBTree *rb);
> ```
>
> As you see it doesn't allow to do multiple iterations over a single
> tree. I believe it's a major flaw for a general-purpose container.
>

Can you explain use case where you need it?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Why we lost Uber as a user

2016-07-28 Thread pgwhatever

Statement-Based replication has a lot of problems with it like indeterminate
UDFs.  Here is a link to see them all:
https://dev.mysql.com/doc/refman/5.7/en/replication-sbr-rbr.html#replication-sbr-rbr-sbr-disadvantages



--
View this message in context: 
http://postgresql.nabble.com/Why-we-lost-Uber-as-a-user-tp5913417p5913750.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_basebackup wish list

2016-07-28 Thread Amit Kapila

On Thu, Jul 28, 2016 at 5:37 PM, Fujii Masao  wrote:
>
> Maybe I failed to parse his proposal. It's helpful if you elaborate it.
>

As per mail [1], it seems the proposal is not to use .tar for -Z 0.
Now here actually we are on the fence, one can argue that if user
doesn't want compression, he or she can use -F p (plain format).
OTOH, without compression getting the backup as a single .tar file
makes it simple to manage.  I think there is some value in providing
.tar for -Z 0, however in that case how should we define usage of -F p
-Z 0?  Shall we say with plain format -Z 0 gets ignored or throw error
or do something else?  If first, then I think it is better to mention
the same in docs.

[1] - 
https://www.postgresql.org/message-id/CAMkU%3D1zzj0et2x9fCqxMGJ6XP-FtMSUwtNQGwF01698FRWQ6uA%40mail.gmail.com
-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Oddity in EXPLAIN for foreign/custom join pushdown plans

2016-07-28 Thread Kouhei Kaigai

> On 2016/07/28 10:01, Kouhei Kaigai wrote:
> > What I'm saying is here:
> 
> > EXPLAIN (COSTS false, VERBOSE)
> > SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1) ORDER BY
> > t1.c3, t1.c1 OFFSET 100 LIMIT 10;
> >   QUERY PLAN
> > ---
> > Limit
> >   Output: t1.c1, t2.c1, t1.c3
> >   ->  Foreign Scan
> >   
> > Output: t1.c1, t2.c1, t1.c3
> > Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
> > Remote SQL: SELECT r1."C 1", r1.c3, r2."C 1" FROM ("S 1"."T
> > 1" r1 INNER JOIN "S 1"."T 1" r2 ON (((r1."C 1" = r2."C 1" ORDER BY
> > r1.c3 ASC N\
> > ULLS LAST, r1."C 1" ASC NULLS LAST
> > (6 rows)
> 
> > Your concern is that EXPLAIN shows ForeignScan node as "Foreign Scan"
> > although it actually works as "Foreign Join".
> 
> That may be so, but my point is that the target relations involved in
> the foreign join (ie, ft1 and ft2) should be printed somewhere in the
> EXPLAIN output by core, as in EXPLAIN for a simple foreign table scan.
>
Why? According to your rule, Hash Join should take "on t0,t1,t2".

postgres=# explain select id from t0 natural join t1 natural join t2;
 QUERY PLAN
-
 Hash Join  (cost=6370.00..4560826.24 rows=98784048 width=4)
   Hash Cond: (t0.aid = t1.aid)
   ->  Hash Join  (cost=3185.00..3199360.58 rows=98784048 width=8)
 Hash Cond: (t0.bid = t2.bid)
 ->  Seq Scan on t0  (cost=0.00..184.80 rows=10080 width=12)
 ->  Hash  (cost=1935.00..1935.00 rows=10 width=4)
   ->  Seq Scan on t2  (cost=0.00..1935.00 rows=10 width=4)
   ->  Hash  (cost=1935.00..1935.00 rows=10 width=4)
 ->  Seq Scan on t1  (cost=0.00..1935.00 rows=10 width=4)
(9 rows)

> > My suggestion makes EXPLAIN print "Foreign %s" according to the label
> > assigned by the extension. Only FDW driver knows how this ForeignScan
> > node actually performs on, thus, only FDW driver can set meaningful label.
> >
> > This label may be "Join", may be "Partial Aggregate + Join", or something
> > others according to the actual job of this node.
> 
> OK, thanks for that.  But sorry, I'm not sure that's a good idea.  One
> reason for that is that FDWs/extentions might give different labels to
> the same upper-planner action, which would lead to confusing EXPLAINs.
>
If extension put a label hard to understand, it is a problem of the extension.

> Another is that labeling might be annoying, so some FDWs/extensions may
> not want to do that.
>
I'm happy with arbitrary label, not annoying. :-)

> Couldn't we print EXPLAINs in a more unified way?  A simple idea is to
> introduce a broad label, eg, "Foreign Processing", as a unified label;
> if the FDW/extension performs any upper-planner actions remotely, then
> the label "Foreign Processing" will be printed by core, and following
> that, the FDW/extension would print what they want, using
> ExplainForeignScan/ExplainCustomScan.  Probably something like this:
> 
>Foreign Processing
>  Remote Operations: ...
> 
> In the Remote Operations line, the FDW/extension could print any info
> about remote operations, eg, "Scan/Join + Aggregate".  Different data
> sources have different concepts/terms, so it seems reasonable to me that
> there would be different descriptions by different data sources, to the
> same plan actions done remotely.
>
"Foreign" implies this node is processed by FDW, but "Procesing" gives us
no extra information; seems to me redundant.
Prior to the new invention, please explain why you don't want to by my
suggestion first? Annoying is a feel of you, but not a logic to persuade
others.

> >>>  - A flag to turn on/off printing relation(s) name
> >>
> >> ISTM that the relation information should be always printed even in the
> >> case of scanrelid=0 by the core, because that would be essential for
> >> analyzing EXPLAIN results.
> 
> > We have no information which relations are actually scanned by ForeignScan
> > and CustomScan. Your patch abuses fs_relids and custom_relids, however,
> > these fields don't indicate the relations actually scan by these nodes.
> > It just tracks relations processed by this node or its children.
> 
> Maybe my explanation was not enough, but I just mean that *joining
> relations* should be printed somewhere in the EXPLAIN output for a
> foreign/custom join as well, by core.
> 
> > postgres=# explain select id from t0 natural join t1 natural join t2;
> > QUERY PLAN
> > ---
> >  Custom Scan (GpuJoin)  (cost=12385.67..291245.35 rows=9815465 width=4)
> >GPU Projection: t0.id
> >Depth 1: GpuHashJoin, HashKeys: (t0.bid)
> > JoinQuals: (t0.bid = t2.bid)
> > Nrows (in/out: 98.15%), KDS-Hash (size: 13.

[HACKERS] Fix for PL/Python slow input arrays traversal issue

2016-07-28 Thread Alexey Grishchenko

Hi

Following issue exists with PL/Python: when your function takes array as
input parameters, processing arrays of fixed-size elements containing null
values is many times slower than processing same array without nulls. Here
is an example:

-- Function

create or replace function test(a int8[]) returns int8 as $BODY$
return sum([x for x in a if x is not None])
$BODY$ language plpythonu volatile;

pl_regression=# select test(array_agg(a)::int8[])
pl_regression-# from (
pl_regression(# select generate_series(1,10) as a
pl_regression(# ) as q;
test

 55
(1 row)

Time: 22.248 ms
pl_regression=# select test(array_agg(a)::int8[])
pl_regression-# from (
pl_regression(# select generate_series(1,10) as a
pl_regression(# union all
pl_regression(# select null::int8 as a
pl_regression(# ) as q;
test

 55
(1 row)

Time: 7179.921 ms


As you can see, single null in array introduces 320x slowdown. The reason
for this is following:
Original implementation uses array_ref for each element of the array. Each
call to array_ref causes subsequent call to array_seek. Function array_seek
in turn has a shortcut for fixed-size arrays with no nulls. But if your
array is not of fixed-size elements, or if it contains nulls, each call to
array_seek would cause calculation of the Kth element offset starting from
the first element. This is O(N^2) algorithm, resulting in high processing
time for arrays of non-fixed-size elements and arrays with nulls.

The fix I propose applies same logic used at array_out function for
efficient array traversal, keeping the pointer to the last fetched
element's offset, which results in dramatical performance improvement for
affected cases. With this implementation, both arrays of fixed-size
elements without nulls, fixed-size elements with nulls and variable-size
elements are processed with the same speed. Here is the test after this fix
is applied:

pl_regression=# select test(array_agg(a)::int8[])
pl_regression-# from (
pl_regression(# select generate_series(1,10) as a
pl_regression(# ) as q;
test

 55
(1 row)

Time: 21.056 ms
pl_regression=# select test(array_agg(a)::int8[])
pl_regression-# from (
pl_regression(# select generate_series(1,10) as a
pl_regression(# union all
pl_regression(# select null::int8 as a
pl_regression(# ) as q;
test

 55
(1 row)

Time: 22.839 ms


-- 
Best regards,
Alexey Grishchenko


0001-Fix-for-PL-Python-slow-input-arrays-traversal-issue.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Wrong defeinition of pq_putmessage_noblock since 9.5

2016-07-28 Thread Fujii Masao

On Thu, Jul 28, 2016 at 9:08 PM, Fujii Masao  wrote:
> On Thu, Jul 28, 2016 at 6:52 PM, Kyotaro HORIGUCHI
>  wrote:
>> Hello,
>>
>> While testing replication for 9.5, we found that repl-master can
>> ignore wal_sender_timeout and seemingly waits for TCP
>> retransmission timeout for the case of sudden power-off of a
>> standby.
>>
>> My investigation told me that the immediate cause could be that
>> secure_write() is called with *blocking mode* (that is,
>> port->noblock = false) under *pq_putmessage_noblock* macro called
>> from XLogSendPhysical().
>>
>> libpq.h of 9.5 and newer defines it as the following,
>>
>>> #define pq_putmessage(msgtype, s, len) \
>>>   (PqCommMethods->putmessage(msgtype, s, len))
>>> #define pq_putmessage_noblock(msgtype, s, len) \
>>>   (PqCommMethods->putmessage(msgtype, s, len))
>>
>> which is apparently should be the following.
>>
>>> #define pq_putmessage_noblock(msgtype, s, len) \
>>>   (PqCommMethods->putmessage_noblock(msgtype, s, len))
>>
>> The attached patch fixes it.
>
> Good catch! Barring objection, I will push this to both master and 9.5.

Regarding this patch, while reading pqcomm.c, I found the following things.

1. socket_comm_reset() calls pq_endcopyout().
I think that socket_endcopyout() should be called, instead.

2. socket_putmessage_noblock() calls pq_putmessage().
I think that socket_putmessage() should be called, instead.

3. Several source comments in pqcomm.c have not been updated.
Some comments still use the old function name like pq_putmessage().

Attached patch fixes the above issues.

Regards,

-- 
Fujii Masao
*** a/src/backend/libpq/pqcomm.c
--- b/src/backend/libpq/pqcomm.c
***
*** 18,24 
   * NOTE: generally, it's a bad idea to emit outgoing messages directly with
   * pq_putbytes(), especially if the message would require multiple calls
   * to send.  Instead, use the routines in pqformat.c to construct the message
!  * in a buffer and then emit it in one call to pq_putmessage.  This ensures
   * that the channel will not be clogged by an incomplete message if execution
   * is aborted by ereport(ERROR) partway through the message.  The only
   * non-libpq code that should call pq_putbytes directly is old-style COPY OUT.
--- 18,24 
   * NOTE: generally, it's a bad idea to emit outgoing messages directly with
   * pq_putbytes(), especially if the message would require multiple calls
   * to send.  Instead, use the routines in pqformat.c to construct the message
!  * in a buffer and then emit it in one call to socket_putmessage.  This ensures
   * that the channel will not be clogged by an incomplete message if execution
   * is aborted by ereport(ERROR) partway through the message.  The only
   * non-libpq code that should call pq_putbytes directly is old-style COPY OUT.
***
*** 44,50 
   *		StreamClose			- Close a client/backend connection
   *		TouchSocketFiles	- Protect socket files against /tmp cleaners
   *		pq_init			- initialize libpq at backend startup
!  *		pq_comm_reset	- reset libpq during error recovery
   *		pq_close		- shutdown libpq at backend exit
   *
   * low-level I/O:
--- 44,50 
   *		StreamClose			- Close a client/backend connection
   *		TouchSocketFiles	- Protect socket files against /tmp cleaners
   *		pq_init			- initialize libpq at backend startup
!  *		socket_comm_reset	- reset libpq during error recovery
   *		pq_close		- shutdown libpq at backend exit
   *
   * low-level I/O:
***
*** 53,68 
   *		pq_getmessage	- get a message with length word from connection
   *		pq_getbyte		- get next byte from connection
   *		pq_peekbyte		- peek at next byte from connection
!  *		pq_putbytes		- send bytes to connection (not flushed until pq_flush)
!  *		pq_flush		- flush pending output
!  *		pq_flush_if_writable - flush pending output if writable without blocking
   *		pq_getbyte_if_available - get a byte if available without blocking
   *
   * message-level I/O (and old-style-COPY-OUT cruft):
!  *		pq_putmessage	- send a normal message (suppressed in COPY OUT mode)
!  *		pq_putmessage_noblock - buffer a normal message (suppressed in COPY OUT)
!  *		pq_startcopyout - inform libpq that a COPY OUT transfer is beginning
!  *		pq_endcopyout	- end a COPY OUT transfer
   *
   *
   */
--- 53,68 
   *		pq_getmessage	- get a message with length word from connection
   *		pq_getbyte		- get next byte from connection
   *		pq_peekbyte		- peek at next byte from connection
!  *		pq_putbytes		- send bytes to connection (not flushed until socket_flush)
!  *		socket_flush		- flush pending output
!  *		socket_flush_if_writable - flush pending output if writable without blocking
   *		pq_getbyte_if_available - get a byte if available without blocking
   *
   * message-level I/O (and old-style-COPY-OUT cruft):
!  *		socket_putmessage	- send a normal message (suppressed in COPY OUT mode)
!  *		socket_putmessage_noblock - buffer a normal message (s

Re: [HACKERS] Wrong defeinition of pq_putmessage_noblock since 9.5

2016-07-28 Thread Fujii Masao

On Thu, Jul 28, 2016 at 6:52 PM, Kyotaro HORIGUCHI
 wrote:
> Hello,
>
> While testing replication for 9.5, we found that repl-master can
> ignore wal_sender_timeout and seemingly waits for TCP
> retransmission timeout for the case of sudden power-off of a
> standby.
>
> My investigation told me that the immediate cause could be that
> secure_write() is called with *blocking mode* (that is,
> port->noblock = false) under *pq_putmessage_noblock* macro called
> from XLogSendPhysical().
>
> libpq.h of 9.5 and newer defines it as the following,
>
>> #define pq_putmessage(msgtype, s, len) \
>>   (PqCommMethods->putmessage(msgtype, s, len))
>> #define pq_putmessage_noblock(msgtype, s, len) \
>>   (PqCommMethods->putmessage(msgtype, s, len))
>
> which is apparently should be the following.
>
>> #define pq_putmessage_noblock(msgtype, s, len) \
>>   (PqCommMethods->putmessage_noblock(msgtype, s, len))
>
> The attached patch fixes it.

Good catch! Barring objection, I will push this to both master and 9.5.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_basebackup wish list

2016-07-28 Thread Fujii Masao

On Thu, Jul 28, 2016 at 8:44 PM, Amit Kapila  wrote:
> On Tue, Jul 26, 2016 at 11:58 AM, Fujii Masao  wrote:
>> On Wed, Jul 13, 2016 at 3:06 AM, Jeff Janes  wrote:
>>> On Tue, Jul 12, 2016 at 10:48 AM, Peter Eisentraut
>>>  wrote:
 On 7/12/16 12:53 PM, Jeff Janes wrote:
> The --help message for pg_basebackup says:
>
> -Z, --compress=0-9 compress tar output with given compression level
>
> But -Z0 is then rejected as 'invalid compression level "0"'.  The real
> docs do say 1-9, only the --help message has this bug.  Trivial patch
> attached.

 pg_dump --help and man page say it supports 0..9.  Maybe we should make
 that more consistent.
>>>
>>> pg_dump actually does support -Z0, though.  Well, sort of.  It outputs
>>> plain text.  Rather than plain text wrapped in some kind of dummy gzip
>>> header, which is what I had naively expected.
>>>
>>> Is that what -Z0 in pg_basebackup should do as well, just output
>>> uncompressed tar data, and not add the ".gz" to the "base.tar" file
>>> name?
>>
>> Yes, I think. What about the attached patch?
>>
>
> What if user tries to use -Z 0 with format as tar, won't it generate
> base.tar without any compression?

Yes, with -Z 0 -F t options, the patched version of pg_basebackup generate
base.tar without compression.

> I am not sure if that is what Jeff
> intends to say in his proposal.

Maybe I failed to parse his proposal. It's helpful if you elaborate it.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Oddity in EXPLAIN for foreign/custom join pushdown plans

2016-07-28 Thread Etsuro Fujita


On 2016/07/28 10:01, Kouhei Kaigai wrote:

What I'm saying is here:



EXPLAIN (COSTS false, VERBOSE)
SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1) ORDER BY
t1.c3, t1.c1 OFFSET 100 LIMIT 10;
  QUERY PLAN
---
Limit
  Output: t1.c1, t2.c1, t1.c3
  ->  Foreign Scan
  
Output: t1.c1, t2.c1, t1.c3
Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
Remote SQL: SELECT r1."C 1", r1.c3, r2."C 1" FROM ("S 1"."T
1" r1 INNER JOIN "S 1"."T 1" r2 ON (((r1."C 1" = r2."C 1" ORDER BY
r1.c3 ASC N\
ULLS LAST, r1."C 1" ASC NULLS LAST
(6 rows)



Your concern is that EXPLAIN shows ForeignScan node as "Foreign Scan"
although it actually works as "Foreign Join".


That may be so, but my point is that the target relations involved in  
the foreign join (ie, ft1 and ft2) should be printed somewhere in the  
EXPLAIN output by core, as in EXPLAIN for a simple foreign table scan.



My suggestion makes EXPLAIN print "Foreign %s" according to the label
assigned by the extension. Only FDW driver knows how this ForeignScan
node actually performs on, thus, only FDW driver can set meaningful label.

This label may be "Join", may be "Partial Aggregate + Join", or something
others according to the actual job of this node.


OK, thanks for that.  But sorry, I'm not sure that's a good idea.  One  
reason for that is that FDWs/extentions might give different labels to  
the same upper-planner action, which would lead to confusing EXPLAINs.  
Another is that labeling might be annoying, so some FDWs/extensions may  
not want to do that.


Couldn't we print EXPLAINs in a more unified way?  A simple idea is to  
introduce a broad label, eg, "Foreign Processing", as a unified label;  
if the FDW/extension performs any upper-planner actions remotely, then  
the label "Foreign Processing" will be printed by core, and following  
that, the FDW/extension would print what they want, using  
ExplainForeignScan/ExplainCustomScan.  Probably something like this:


  Foreign Processing
Remote Operations: ...

In the Remote Operations line, the FDW/extension could print any info  
about remote operations, eg, "Scan/Join + Aggregate".  Different data  
sources have different concepts/terms, so it seems reasonable to me that  
there would be different descriptions by different data sources, to the  
same plan actions done remotely.



 - A flag to turn on/off printing relation(s) name


ISTM that the relation information should be always printed even in the
case of scanrelid=0 by the core, because that would be essential for
analyzing EXPLAIN results.



We have no information which relations are actually scanned by ForeignScan
and CustomScan. Your patch abuses fs_relids and custom_relids, however,
these fields don't indicate the relations actually scan by these nodes.
It just tracks relations processed by this node or its children.


Maybe my explanation was not enough, but I just mean that *joining  
relations* should be printed somewhere in the EXPLAIN output for a  
foreign/custom join as well, by core.



postgres=# explain select id from t0 natural join t1 natural join t2;
QUERY PLAN
---
 Custom Scan (GpuJoin)  (cost=12385.67..291245.35 rows=9815465 width=4)
   GPU Projection: t0.id
   Depth 1: GpuHashJoin, HashKeys: (t0.bid)
JoinQuals: (t0.bid = t2.bid)
Nrows (in/out: 98.15%), KDS-Hash (size: 13.47MB, nbatches: 1)
   Depth 2: GpuHashJoin, HashKeys: (t0.aid)
JoinQuals: (t0.aid = t1.aid)
Nrows (in/out: 100.00%), KDS-Hash (size: 13.47MB, nbatches: 1)
   ->  Seq Scan on t0  (cost=0.00..18.96 rows=996 width=12)
   ->  Seq Scan on t2  (cost=0.00..1935.00 rows=10 width=4)
   ->  Seq Scan on t1  (cost=0.00..1935.00 rows=10 width=4)
(11 rows)



If EXPLAIN command attaches something like "on t0,t1,t2" after the GpuJoin
*with no choice*, it is problematic and misleading.
It shall be controllable by the extension which knows what tables are
actually scanned. (Please note that I never says extension makes the string.
It is a job of core explain. I suggest it needs to be controllable.)


Again, I just mean that the info "on t0,t1,t2" should be printed after  
the GpuJoin label, as its joining relations.


Best regards,
Etsuro Fujita




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_basebackup wish list

2016-07-28 Thread Amit Kapila

On Tue, Jul 26, 2016 at 11:58 AM, Fujii Masao  wrote:
> On Wed, Jul 13, 2016 at 3:06 AM, Jeff Janes  wrote:
>> On Tue, Jul 12, 2016 at 10:48 AM, Peter Eisentraut
>>  wrote:
>>> On 7/12/16 12:53 PM, Jeff Janes wrote:
 The --help message for pg_basebackup says:

 -Z, --compress=0-9 compress tar output with given compression level

 But -Z0 is then rejected as 'invalid compression level "0"'.  The real
 docs do say 1-9, only the --help message has this bug.  Trivial patch
 attached.
>>>
>>> pg_dump --help and man page say it supports 0..9.  Maybe we should make
>>> that more consistent.
>>
>> pg_dump actually does support -Z0, though.  Well, sort of.  It outputs
>> plain text.  Rather than plain text wrapped in some kind of dummy gzip
>> header, which is what I had naively expected.
>>
>> Is that what -Z0 in pg_basebackup should do as well, just output
>> uncompressed tar data, and not add the ".gz" to the "base.tar" file
>> name?
>
> Yes, I think. What about the attached patch?
>

What if user tries to use -Z 0 with format as tar, won't it generate
base.tar without any compression?  I am not sure if that is what Jeff
intends to say in his proposal.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Why we lost Uber as a user

2016-07-28 Thread Geoff Winkless

On 28 Jul 2016 12:19, "Vitaly Burovoy"  wrote:
>
> On 7/28/16, Geoff Winkless  wrote:
> > On 27 July 2016 at 17:04, Bruce Momjian  wrote:
> >
> >> Well, their big complaint about binary replication is that a bug can
> >> spread from a master to all slaves, which doesn't happen with statement
> >> level replication.
> >
> > 
> > I'm not sure that that makes sense to me. If there's a database bug that
> > occurs when you run a statement on the master, it seems there's a decent
> > chance that that same bug is going to occur when you run the same
statement
> > on the slave.
> >
> > Obviously it depends on the type of bug and how identical the slave is,
but
> > statement-level replication certainly doesn't preclude such a bug from
> > propagating.
> >
> > Geoff
>
> Please, read the article first! The bug is about wrong visibility of
> tuples after applying WAL at slaves.
> For example, you can see two different records selecting from a table
> by a primary key (moreover, their PKs are the same, but other columns
> differ).

I read the article. It affected slaves as well as the master.

I quote:
"because of the way replication works, this issue has the potential to
spread into all of the databases in a replication hierarchy"

I maintain that this is a nonsense argument. Especially since (as you
pointed out and as I missed first time around) the bug actually occurred at
different records on different slaves, so he invalidates his own point.

Geoff

Re: [HACKERS] Why we lost Uber as a user

2016-07-28 Thread Vitaly Burovoy

On 7/28/16, Geoff Winkless  wrote:
> On 27 July 2016 at 17:04, Bruce Momjian  wrote:
>
>> Well, their big complaint about binary replication is that a bug can
>> spread from a master to all slaves, which doesn't happen with statement
>> level replication.
>
> 
> I'm not sure that that makes sense to me. If there's a database bug that
> occurs when you run a statement on the master, it seems there's a decent
> chance that that same bug is going to occur when you run the same statement
> on the slave.
>
> Obviously it depends on the type of bug and how identical the slave is, but
> statement-level replication certainly doesn't preclude such a bug from
> propagating.
>
> Geoff

Please, read the article first! The bug is about wrong visibility of
tuples after applying WAL at slaves.
For example, you can see two different records selecting from a table
by a primary key (moreover, their PKs are the same, but other columns
differ).

From the article (emphasizing is mine):
The following query illustrates how this bug would affect our users
table example:
SELECT * FROM users WHERE id = 4;
This query would return *TWO* records: ...

And it affected slaves, not master.
Slaves are for decreasing loading to master, if you run all queries
(even) RO at master, why would you (or someone) have so many slaves?

-- 
Best regards,
Vitaly Burovoy

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Why we lost Uber as a user

2016-07-28 Thread Geoff Winkless

On 27 July 2016 at 17:04, Bruce Momjian  wrote:

> Well, their big complaint about binary replication is that a bug can
> spread from a master to all slaves, which doesn't happen with statement
> level replication.

I'm not sure that that makes sense to me. If there's a database bug that
occurs when you run a statement on the master, it seems there's a decent
chance that that same bug is going to occur when you run the same statement
on the slave.

Obviously it depends on the type of bug and how identical the slave is, but
statement-level replication certainly doesn't preclude such a bug from
propagating.

Geoff

Re: [HACKERS] pg_upgrade: exit_hook_registered variable

2016-07-28 Thread Kyotaro HORIGUCHI

Hello,

> I noticed that exit_hook_registered variable in start_postmaster() is
> local variable. Shouldn't it be a static variable?
> 
> I attached a patch. Thank you!

Good catch! It is totally useless unless being static.

But I think it is not necessary to go outside
start_postmaster. Just being static staying in start_postmaster
is enough.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 21695d427b146a0a44297558c747bb75e48c62dd Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 28 Jul 2016 19:13:46 +0900
Subject: [PATCH] start_postmaster of pg_upgarde registers exit function more
 than once.

start_postmaster of pg_upgrade is intending to register an exit
function but it fails to prevent duplicate regstration by defining
control variable as function-local. It should be static.
---
 src/bin/pg_upgrade/server.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/bin/pg_upgrade/server.c b/src/bin/pg_upgrade/server.c
index 969e5d6..3fe169d 100644
--- a/src/bin/pg_upgrade/server.c
+++ b/src/bin/pg_upgrade/server.c
@@ -174,7 +174,7 @@ start_postmaster(ClusterInfo *cluster, bool throw_error)
 {
 	char		cmd[MAXPGPATH * 4 + 1000];
 	PGconn	   *conn;
-	bool		exit_hook_registered = false;
+	static bool	exit_hook_registered = false;
 	bool		pg_ctl_return = false;
 	char		socket_string[MAXPGPATH + 200];
 
-- 
1.8.3.1


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Wrong defeinition of pq_putmessage_noblock since 9.5

2016-07-28 Thread Kyotaro HORIGUCHI

Hello,

While testing replication for 9.5, we found that repl-master can
ignore wal_sender_timeout and seemingly waits for TCP
retransmission timeout for the case of sudden power-off of a
standby.

My investigation told me that the immediate cause could be that
secure_write() is called with *blocking mode* (that is,
port->noblock = false) under *pq_putmessage_noblock* macro called
from XLogSendPhysical().

libpq.h of 9.5 and newer defines it as the following,

> #define pq_putmessage(msgtype, s, len) \
>   (PqCommMethods->putmessage(msgtype, s, len))
> #define pq_putmessage_noblock(msgtype, s, len) \
>   (PqCommMethods->putmessage(msgtype, s, len))

which is apparently should be the following.

> #define pq_putmessage_noblock(msgtype, s, len) \
>   (PqCommMethods->putmessage_noblock(msgtype, s, len))

The attached patch fixes it.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From b15db80734215b7e3b4bbae849cf89ffd990e8be Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 28 Jul 2016 18:28:57 +0900
Subject: [PATCH] Fix the defeinition of pq_putmessage_noblock macro.

pq_putmessage_noblock marcro is mistakenly defined as the same with
pg_putmessage. Fix it.
---
 src/include/libpq/libpq.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/include/libpq/libpq.h b/src/include/libpq/libpq.h
index 547d3b8..18052cb 100644
--- a/src/include/libpq/libpq.h
+++ b/src/include/libpq/libpq.h
@@ -43,7 +43,7 @@ extern PGDLLIMPORT PQcommMethods *PqCommMethods;
 #define pq_putmessage(msgtype, s, len) \
 	(PqCommMethods->putmessage(msgtype, s, len))
 #define pq_putmessage_noblock(msgtype, s, len) \
-	(PqCommMethods->putmessage(msgtype, s, len))
+	(PqCommMethods->putmessage_noblock(msgtype, s, len))
 #define pq_startcopyout() (PqCommMethods->startcopyout())
 #define pq_endcopyout(errorAbort) (PqCommMethods->endcopyout(errorAbort))
 
-- 
1.8.3.1


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] pg_upgrade: exit_hook_registered variable

2016-07-28 Thread Artur Zakirov


Hello hackers,

I noticed that exit_hook_registered variable in start_postmaster() is 
local variable. Shouldn't it be a static variable?


I attached a patch. Thank you!

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
diff --git a/src/bin/pg_upgrade/server.c b/src/bin/pg_upgrade/server.c
index 969e5d6..a13800f 100644
--- a/src/bin/pg_upgrade/server.c
+++ b/src/bin/pg_upgrade/server.c
@@ -14,6 +14,8 @@
 
 static PGconn *get_db_conn(ClusterInfo *cluster, const char *db_name);
 
+static bool exit_hook_registered = false;
+
 
 /*
  * connectToServer()
@@ -174,7 +176,6 @@ start_postmaster(ClusterInfo *cluster, bool throw_error)
 {
 	char		cmd[MAXPGPATH * 4 + 1000];
 	PGconn	   *conn;
-	bool		exit_hook_registered = false;
 	bool		pg_ctl_return = false;
 	char		socket_string[MAXPGPATH + 200];
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL logging problem in 9.4.3?

2016-07-28 Thread Michael Paquier

On Wed, Apr 6, 2016 at 3:11 PM, Michael Paquier
 wrote:
> On Wed, Mar 23, 2016 at 12:45 PM, Michael Paquier
>  wrote:
>> On Wed, Mar 23, 2016 at 11:11 AM, David Steele  wrote:
>>> I would prefer not to bump it to the next CF unless we decide this will
>>> not get fixed for 9.6.
>>
>> It may make sense to add that to the list of open items for 9.6
>> instead. That's not a feature.
>
> So I have moved this patch to the next CF for now, and will work on
> fixing it rather soonishly as an effort to stabilize 9.6 as well as
> back-branches.

Well, not that soon at the end, but I am back on that... I have not
completely reviewed all the code yet, and the case of index relation
referring to a relation optimized with truncate is still broken, but
for now here is a rebased patch if people are interested. I am going
to get as well a TAP tests out of my pocket to ease testing.
-- 
Michael
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 38bba16..bbc09cd 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -55,6 +55,7 @@
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/namespace.h"
+#include "catalog/storage.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/bufmgr.h"
@@ -2331,12 +2332,6 @@ FreeBulkInsertState(BulkInsertState bistate)
  * The new tuple is stamped with current transaction ID and the specified
  * command ID.
  *
- * If the HEAP_INSERT_SKIP_WAL option is specified, the new tuple is not
- * logged in WAL, even for a non-temp relation.  Safe usage of this behavior
- * requires that we arrange that all new tuples go into new pages not
- * containing any tuples from other transactions, and that the relation gets
- * fsync'd before commit.  (See also heap_sync() comments)
- *
  * The HEAP_INSERT_SKIP_FSM option is passed directly to
  * RelationGetBufferForTuple, which see for more info.
  *
@@ -2440,7 +2435,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+	if (HeapNeedsWAL(relation, buffer))
 	{
 		xl_heap_insert xlrec;
 		xl_heap_header xlhdr;
@@ -2639,12 +2634,10 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
 	int			ndone;
 	char	   *scratch = NULL;
 	Page		page;
-	bool		needwal;
 	Size		saveFreeSpace;
 	bool		need_tuple_data = RelationIsLogicallyLogged(relation);
 	bool		need_cids = RelationIsAccessibleInLogicalDecoding(relation);
 
-	needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
 	saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
    HEAP_DEFAULT_FILLFACTOR);
 
@@ -2659,7 +2652,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
 	 * palloc() within a critical section is not safe, so we allocate this
 	 * beforehand.
 	 */
-	if (needwal)
+	if (RelationNeedsWAL(relation))
 		scratch = palloc(BLCKSZ);
 
 	/*
@@ -2727,7 +2720,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
 			 * We don't use heap_multi_insert for catalog tuples yet, but
 			 * better be prepared...
 			 */
-			if (needwal && need_cids)
+			if (HeapNeedsWAL(relation, buffer) && need_cids)
 log_heap_new_cid(relation, heaptup);
 		}
 
@@ -2747,7 +2740,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
 		MarkBufferDirty(buffer);
 
 		/* XLOG stuff */
-		if (needwal)
+		if (HeapNeedsWAL(relation, buffer))
 		{
 			XLogRecPtr	recptr;
 			xl_heap_multi_insert *xlrec;
@@ -3261,7 +3254,7 @@ l1:
 	 * NB: heap_abort_speculative() uses the same xlog record and replay
 	 * routines.
 	 */
-	if (RelationNeedsWAL(relation))
+	if (HeapNeedsWAL(relation, buffer))
 	{
 		xl_heap_delete xlrec;
 		XLogRecPtr	recptr;
@@ -3982,7 +3975,7 @@ l2:
 
 		MarkBufferDirty(buffer);
 
-		if (RelationNeedsWAL(relation))
+		if (HeapNeedsWAL(relation, buffer))
 		{
 			xl_heap_lock xlrec;
 			XLogRecPtr	recptr;
@@ -4194,7 +4187,7 @@ l2:
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(relation))
+	if (HeapNeedsWAL(relation, buffer))
 	{
 		XLogRecPtr	recptr;
 
@@ -5148,7 +5141,7 @@ failed:
 	 * (Also, in a PITR log-shipping or 2PC environment, we have to have XLOG
 	 * entries for everything anyway.)
 	 */
-	if (RelationNeedsWAL(relation))
+	if (HeapNeedsWAL(relation, *buffer))
 	{
 		xl_heap_lock xlrec;
 		XLogRecPtr	recptr;
@@ -5825,7 +5818,7 @@ l4:
 		MarkBufferDirty(buf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (HeapNeedsWAL(rel, buf))
 		{
 			xl_heap_lock_updated xlrec;
 			XLogRecPtr	recptr;
@@ -5980,7 +5973,7 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)
 	htup->t_ctid = tuple->t_self;
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(relation))
+	if (HeapNeedsWAL(relation, buffer))
 	{
 		xl_heap_confirm xlrec;
 		XLogRecPtr	recptr;
@@ -6112,7 +6105,7 @@ heap_abort_speculative(Relation relation, HeapTuple tuple)

Re: [HACKERS] pg_basebackup wish list

2016-07-28 Thread Fujii Masao

On Tue, Jul 26, 2016 at 3:28 PM, Fujii Masao  wrote:
> On Wed, Jul 13, 2016 at 3:06 AM, Jeff Janes  wrote:
>> On Tue, Jul 12, 2016 at 10:48 AM, Peter Eisentraut
>>  wrote:
>>> On 7/12/16 12:53 PM, Jeff Janes wrote:
 The --help message for pg_basebackup says:

 -Z, --compress=0-9 compress tar output with given compression level

 But -Z0 is then rejected as 'invalid compression level "0"'.  The real
 docs do say 1-9, only the --help message has this bug.  Trivial patch
 attached.
>>>
>>> pg_dump --help and man page say it supports 0..9.  Maybe we should make
>>> that more consistent.
>>
>> pg_dump actually does support -Z0, though.  Well, sort of.  It outputs
>> plain text.  Rather than plain text wrapped in some kind of dummy gzip
>> header, which is what I had naively expected.
>>
>> Is that what -Z0 in pg_basebackup should do as well, just output
>> uncompressed tar data, and not add the ".gz" to the "base.tar" file
>> name?
>
> Yes, I think. What about the attached patch?


Barring any objection, I will commit this patch.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_replication_origin_xact_reset() and its argument variables

2016-07-28 Thread Fujii Masao

On Sat, Jul 2, 2016 at 7:01 AM, Tom Lane  wrote:
> Andres Freund  writes:
>> On 2016-06-30 10:14:04 -0400, Tom Lane wrote:
>>> Fujii Masao  writes:
 As far as I read the code of the function, those arguments don't seem to
 be necessary. So I'm afraid that the pg_proc entry for the function might
 be incorrect and those two arguments should be removed from the definition.
>
>>> Sure looks that way from here.  Copy-and-paste from the previous
>>> line in pg_proc.h, perhaps?
>
>> Yes, that's clearly wrong.

Attached patch (pg_replication_origin_xact_reset_9.6.patch) fixes this.
We need to apply this at least before RC1 of PostgreSQL9.6 will be released
because the patch needs the change of catalog version.

>> Damn.  Can't fix that for 9.5 anymore. The
>> function isn't all that important (especially not from SQL), but still,
>> that's annoying.  I'm inclined to just remove the args in 9.6. We could
>> also add a note to the 9.5 docs, adding that the arguments are there by
>> error?

What about the attched patch (pg_replication_origin_xact_reset_9.5.patch)?

Regards,

-- 
Fujii Masao
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index d6ed0ce..0a3d3de 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -17519,7 +17519,7 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
 
  pg_replication_origin_xact_reset
 
-pg_replication_origin_xact_reset()
+pg_replication_origin_xact_reset(origin_lsn pg_lsn, origin_timestamp timestamptz)


 void
@@ -17527,6 +17527,12 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());

 Cancel the effects of
 pg_replication_origin_xact_setup().
+Note that two arguments were introduced by mistake
+during the PostgreSQL 9.5 development cycle while
+pg_replication_origin_xact_reset() actually
+doesn't use them at all. Therefore, any dummy values like
+NULL can be safely specified as the arguments.
+This mistake will be fixed in a future release.

   
 
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index 4fd532a..34735fb 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
  */
 
 /*			mmddN */
-#define CATALOG_VERSION_NO	201607071
+#define CATALOG_VERSION_NO	201607281
 
 #endif
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 5d233e3..270dd21 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -5284,7 +5284,7 @@ DESCR("get the replication progress of the current session");
 DATA(insert OID = 6010 ( pg_replication_origin_xact_setup PGNSP PGUID 12 1 0 0 0 f f f f t f v r 2 0 2278 "3220 1184" _null_ _null_ _null_ _null_ _null_ pg_replication_origin_xact_setup _null_ _null_ _null_ ));
 DESCR("setup the transaction's origin lsn and timestamp");
 
-DATA(insert OID = 6011 ( pg_replication_origin_xact_reset PGNSP PGUID 12 1 0 0 0 f f f f t f v r 2 0 2278 "3220 1184" _null_ _null_ _null_ _null_ _null_ pg_replication_origin_xact_reset _null_ _null_ _null_ ));
+DATA(insert OID = 6011 ( pg_replication_origin_xact_reset PGNSP PGUID 12 1 0 0 0 f f f f t f v r 0 0 2278 "" _null_ _null_ _null_ _null_ _null_ pg_replication_origin_xact_reset _null_ _null_ _null_ ));
 DESCR("reset the transaction's origin lsn and timestamp");
 
 DATA(insert OID = 6012 ( pg_replication_origin_advance PGNSP PGUID 12 1 0 0 0 f f f f t f v u 2 0 2278 "25 3220" _null_ _null_ _null_ _null_ _null_ pg_replication_origin_advance _null_ _null_ _null_ ));

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

53 matches

Mail list logo