date:20160802

[HACKERS] Quorum commit for multiple synchronous replication.

2016-08-02 Thread Masahiko Sawada

Hi all,

In 9.6 development cycle, we had been discussed about configuration
syntax for a long time while considering expanding.
As a result, we had new dedicated language for multiple synchronous
replication, but it supports only priority method.
We know that quorum commit is very useful for many users and can
expand dedicated language easily for quorum commit.
So I'd like to propose quorum commit for multiple synchronous replication here.

The followings are changes attached patches made.
- Add new syntax 'Any N ( node1, node2, ... )' to
synchornous_standby_names for quorum commit.
- In quorum commit, the master can return commit to client after
received ACK from *at least* any N servers of listed standbys.
- sync_priority of all listed servers are same, 1.
- Add regression test for quorum commit.

I was thinking that the syntax for quorum method would use '[ ... ]'
but it will be confused with '( ... )' priority method used.
001 patch adds 'Any N ( ... )' style syntax but I know that we still
might need to discuss about better syntax, discussion is very welcome.
Attached draft patch, please give me feedback.

Regards,

--
Masahiko Sawada
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index 67249d8..0ce5399 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -76,9 +76,9 @@ char	   *SyncRepStandbyNames;
 #define SyncStandbysDefined() \
 	(SyncRepStandbyNames != NULL && SyncRepStandbyNames[0] != '\0')
 
-static bool announce_next_takeover = true;
+SyncRepConfigData *SyncRepConfig = NULL;
 
-static SyncRepConfigData *SyncRepConfig = NULL;
+static bool announce_next_takeover = true;
 static int	SyncRepWaitMode = SYNC_REP_NO_WAIT;
 
 static void SyncRepQueueInsert(int mode);
@@ -89,7 +89,12 @@ static bool SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr,
 		   XLogRecPtr *flushPtr,
 		   XLogRecPtr *applyPtr,
 		   bool *am_sync);
+static bool SyncRepGetNNewestSyncRecPtr(XLogRecPtr *writePtr,
+		   XLogRecPtr *flushPtr,
+		   XLogRecPtr *applyPtr,
+		   int pos, bool *am_sync);
 static int	SyncRepGetStandbyPriority(void);
+static int	cmp_lsn(const void *a, const void *b);
 
 #ifdef USE_ASSERT_CHECKING
 static bool SyncRepQueueIsOrderedByLSN(int mode);
@@ -391,7 +396,7 @@ SyncRepReleaseWaiters(void)
 	XLogRecPtr	writePtr;
 	XLogRecPtr	flushPtr;
 	XLogRecPtr	applyPtr;
-	bool		got_oldest;
+	bool		got_recptr;
 	bool		am_sync;
 	int			numwrite = 0;
 	int			numflush = 0;
@@ -418,11 +423,16 @@ SyncRepReleaseWaiters(void)
 	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
 
 	/*
-	 * Check whether we are a sync standby or not, and calculate the oldest
-	 * positions among all sync standbys.
+	 * Check whether we are a sync standby or not, and calculate the synced
+	 * positions among all sync standbys using method.
 	 */
-	got_oldest = SyncRepGetOldestSyncRecPtr(, ,
-			, _sync);
+	if (SyncRepConfig->sync_method == SYNC_REP_PRIORITY)
+		got_recptr = SyncRepGetOldestSyncRecPtr(, ,
+			 , _sync);
+	else /* SYNC_REP_QUORUM */
+		got_recptr = SyncRepGetNNewestSyncRecPtr(, ,
+			  , SyncRepConfig->num_sync,
+			  _sync);
 
 	/*
 	 * If we are managing a sync standby, though we weren't prior to this,
@@ -440,7 +450,7 @@ SyncRepReleaseWaiters(void)
 	 * If the number of sync standbys is less than requested or we aren't
 	 * managing a sync standby then just leave.
 	 */
-	if (!got_oldest || !am_sync)
+	if (!got_recptr || !am_sync)
 	{
 		LWLockRelease(SyncRepLock);
 		announce_next_takeover = !am_sync;
@@ -476,6 +486,88 @@ SyncRepReleaseWaiters(void)
 }
 
 /*
+ * Calculate the 'pos' newest Write, Flush and Apply positions among sync standbys.
+ *
+ * Return false if the number of sync standbys is less than
+ * synchronous_standby_names specifies. Otherwise return true and
+ * store the 'pos' newest positions into *writePtr, *flushPtr, *applyPtr.
+ *
+ * On return, *am_sync is set to true if this walsender is connecting to
+ * sync standby. Otherwise it's set to false.
+ */
+static bool
+SyncRepGetNNewestSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
+		XLogRecPtr *applyPtr, int pos, bool *am_sync)
+{
+	XLogRecPtr	*write_array;
+	XLogRecPtr	*flush_array;
+	XLogRecPtr	*apply_array;
+	List	   *sync_standbys;
+	ListCell   *cell;
+	int			len;
+	int			i = 0;
+
+	*writePtr = InvalidXLogRecPtr;
+	*flushPtr = InvalidXLogRecPtr;
+	*applyPtr = InvalidXLogRecPtr;
+	*am_sync = false;
+
+	/* Get standbys that are considered as synchronous at this moment */
+	sync_standbys = SyncRepGetSyncStandbys(am_sync);
+
+	/*
+	 * Quick exit if we are not managing a sync standby or there are not
+	 * enough synchronous standbys.
+	 */
+	if (!(*am_sync) ||
+		SyncRepConfig == NULL ||
+		list_length(sync_standbys) < SyncRepConfig->num_sync)
+	{
+		list_free(sync_standbys);
+		return false;
+	}
+
+	len = list_length(sync_standbys);
+	write_array = (XLogRecPtr *) palloc(sizeof(XLogRecPtr) * len);
+	flush_array =

Re: [HACKERS] Why we lost Uber as a user

2016-08-02 Thread Mark Kirkwood


On 03/08/16 02:27, Robert Haas wrote:


Personally, I think that incremental surgery on our current heap
format to try to fix this is not going to get very far.  If you look
at the history of this, 8.3 was a huge release for timely cleanup of
dead tuple.  There was also significant progress in 8.4 as a result of
5da9da71c44f27ba48fdad08ef263bf70e43e689.   As far as I can recall, we
then made no progress at all in 9.0 - 9.4.  We made a very small
improvement in 9.5 with 94028691609f8e148bd4ce72c46163f018832a5b, but
that's pretty niche.  In 9.6, we have "snapshot too old", which I'd
argue is potentially a large improvement, but it was big and invasive
and will no doubt pose code maintenance hazards in the years to come;
also, many people won't be able to use it or won't realize that they
should use it.  I think it is likely that further incremental
improvements here will be quite hard to find, and the amount of effort
will be large relative to the amount of benefit.  I think we need a
new storage format where the bloat is cleanly separated from the data
rather than intermingled with it; every other major RDMS works that
way.  Perhaps this is a case of "the grass is greener on the other
side of the fence", but I don't think so.


Yeah, I think this is a good summary of the state of play.

The only other new db development to use a non-overwriting design like 
ours that I know of was Jim Starky's Falcon engine for (ironically) 
Mysql 6.0. Not sure if anyone is still progressing that at all now.


I do wonder if Uber could have successfully tamed dead tuple bloat with 
aggressive per-table autovacuum settings (and if in fact they tried), 
but as I think Robert said earlier, it is pretty easy to come up with a 
highly update (or insert + delete) workload that makes for a pretty ugly 
bloat component even with real aggressive autovacuuming.


Cheers

Mark



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] regression test for extended query protocol

2016-08-02 Thread Tatsuo Ishii

[Sorry for a top post. For unknown reason I couldn' get mails from
pgsql-hackers since this morning and I have to check the archive]

>> In my understanding, we don't have any regression test for protocol
>> level prepared query (we do have SQL level prepared query tests,
>> though). Shouldn't we add those tests to the regression test suites?
>
> I thought that ECPG was covering a portion of that. Right?
> -- 
> Michael

In my understanding, ECPG calls libpq, thus the test cases are limited
to the cases which are only possible with libpq (or it may be even
limited to the cases where ECPG deal with.)

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Slowness of extended protocol

2016-08-02 Thread Bruce Momjian

On Sun, Jul 31, 2016 at 05:57:12PM -0400, Tom Lane wrote:
> In hindsight it seems clear that what a lot of apps want out of extended
> protocol is only the ability to send parameter values out-of-line instead
> of having to quote/escape them into SQL literals.  Maybe an idea for the
> fabled V4 protocol update is some compromise query type that corresponds
> precisely to PQexecParams's feature set: you can send parameter values
> out-of-line, and you can specify text or binary results, but there's no
> notion of any persistent state being created and no feedback about
> parameter data types.

Do you want this on the TODO list?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] regression test for extended query protocol

2016-08-02 Thread Michael Paquier

On Wed, Aug 3, 2016 at 11:33 AM, Tatsuo Ishii  wrote:
> In my understanding, we don't have any regression test for protocol
> level prepared query (we do have SQL level prepared query tests,
> though). Shouldn't we add those tests to the regression test suites?

I thought that ECPG was covering a portion of that. Right?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] regression test for extended query protocol

2016-08-02 Thread Tatsuo Ishii

In my understanding, we don't have any regression test for protocol
level prepared query (we do have SQL level prepared query tests,
though). Shouldn't we add those tests to the regression test suites?

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Why we lost Uber as a user

2016-08-02 Thread Bruce Momjian

On Tue, Aug  2, 2016 at 07:30:22PM -0700, Alfred Perlstein wrote:
> So for instance, let's say there is a bug in the master's write to disk.
> The logical replication acts as a barrier from that bad write going to the
> slaves.   With bad writes going to slaves then any corruption experienced on
> the master will quickly reach the slaves and they too will be corrupted.
> 
> With logical replication a bug may be stopped at the replication layer.  At
> that point you can resync the slave from the master.
> 
> Now in the case of physical replication all your base are belong to zuul and
> you are in a very bad state.
> 
> That said with logical replication, who's to say that if the statement is
> replicated to a slave that the slave won't experience the same bug and also
> corrupt itself.
> 
> We may be saying the same thing, but still there is something to be said for
> logical replication... also, didnt they show that logical replication was
> faster for some use cases at Uber?

I saw from the Uber article that they weren't going to per-row logical
replication but _statement_ replication, which is very hard to do
because typical SQL doesn't record what concurrent transactions
committed before a new statement's transaction snapshot is taken, and
doesn't record lock order for row updates blocked by concurrent activity
--- both of which affect the final result from the query.

So, for statement replication, it is not a question of whether the code
has bugs, but that the replay is not 100% possible in all cases, unless
you switch to some statement-row-lock hybrid ability.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Why we lost Uber as a user

2016-08-02 Thread Alfred Perlstein




On 8/2/16 2:14 PM, Tom Lane wrote:

Stephen Frost  writes:

With physical replication, there is the concern that a bug in *just* the
physical (WAL) side of things could cause corruption.

Right.  But with logical replication, there's the same risk that the
master's state could be fine but a replication bug creates corruption on
the slave.

Assuming that the logical replication works by issuing valid SQL commands
to the slave, one could hope that this sort of "corruption" only extends
to having valid data on the slave that fails to match the master.
But that's still not a good state to be in.  And to the extent that
performance concerns lead the implementation to bypass some levels of the
SQL engine, you can easily lose that guarantee too.

In short, I think Uber's position that logical replication is somehow more
reliable than physical is just wishful thinking.  If anything, my money
would be on the other way around: there's a lot less mechanism that can go
wrong in physical replication.  Which is not to say there aren't good
reasons to use logical replication; I just do not believe that one.

regards, tom lane


The reason it can be less catastrophic is that for logical replication 
you may futz up your data, but you are safe from corrupting your entire 
db.  Meaning if an update is missed or doubled that may be addressed by 
a fixup SQL stmt, however if the replication causes a write to the 
entirely wrong place in the db file then you need to "fsck" your db and 
hope that nothing super critical was blown away.


The impact across a cluster is potentially magnified by physical 
replication.


So for instance, let's say there is a bug in the master's write to 
disk.  The logical replication acts as a barrier from that bad write 
going to the slaves.   With bad writes going to slaves then any 
corruption experienced on the master will quickly reach the slaves and 
they too will be corrupted.


With logical replication a bug may be stopped at the replication layer.  
At that point you can resync the slave from the master.


Now in the case of physical replication all your base are belong to zuul 
and you are in a very bad state.


That said with logical replication, who's to say that if the statement 
is replicated to a slave that the slave won't experience the same bug 
and also corrupt itself.


We may be saying the same thing, but still there is something to be said 
for logical replication... also, didnt they show that logical 
replication was faster for some use cases at Uber?


-Alfred







--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Increasing timeout of poll_query_until for TAP tests

2016-08-02 Thread Michael Paquier

On Wed, Aug 3, 2016 at 7:21 AM, Alvaro Herrera  wrote:
> Michael Paquier wrote:
>
>> Here using pg_xlog_replay_resume() is not the correct solution because
>> this would cause the node to finish recovery before we want it to, and
>> so is recovery_target_action = 'promote'. If we look at the test, it
>> is doing the following when getting the TXID that is used as recovery
>> target:
>> $node_master->safe_psql('postgres',
>> "INSERT INTO tab_int VALUES (generate_series(1001,2000))");
>> my $recovery_txid =
>>   $node_master->safe_psql('postgres', "SELECT txid_current()");
>> my $lsn2 =
>>   $node_master->safe_psql('postgres', "SELECT pg_current_xlog_location();");
>> What I think we had better do is reverse the calls
>> pg_current_xlog_location() and txid_current() so as we are sure that
>> the LSN we track for replay is lower than the real target LSN. The
>> same problem exists when defining the timestamp target.
>>
>> The patch attached does that,
>
> Why not capture both items in a single select, such as in the attached
> patch?

Let me test this
[... A while after ...]
This looks to work properly. 12 runs in a row have passed.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Slowness of extended protocol

2016-08-02 Thread Tatsuo Ishii

Doesn't this patch break an existing behavior of unnamed statements?
That is, an unnamed statement shall exist until next parse message
using unnamed statement received. It is possible to use the same
unnamed statement multiple times in a transaction.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [sqlsmith] Failed assertion in joinrels.c

2016-08-02 Thread Peter Geoghegan

On Tue, Aug 2, 2016 at 5:05 PM, Robert Haas  wrote:
> I think that's just making life difficult.  If nothing else, sqlsmith
> hunts around for functions it can call that return internal errors,
> and if we refuse to fix all of them to return user-facing errors, then
> it's just crap for the people running sqlsmith to sift through and
> it's a judgment call whether to fix each particular case.  Even aside
> from that, I think it's much better to have a clear and unambiguous
> rule that elog is only for can't-happen things, not
> we-don't-recommend-it things.

+1.

This also has value in the context of automatically surfacing
situations where "can't happen" errors do in fact happen at scale.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [sqlsmith] Failed assertion in joinrels.c

2016-08-02 Thread Robert Haas

On Tue, Aug 2, 2016 at 7:16 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Tue, Aug 2, 2016 at 6:03 AM, Dilip Kumar  wrote:
>>> There are many more such exposed functions, which can throw cache lookup
>>> failure error if we pass wrong value.
>>>
>>> i.e.
>>> record_in
>>> domain_in
>>> fmgr_c_validator
>>> plpgsql_validator
>>> pg_procedure_is_visible
>>>
>>> Are we planning to change these also..
>
>> I think if they are SQL-callable functions that users can invoke from
>> a SQL prompt, they shouldn't throw "cache lookup failed" errors or,
>> for that matter, any other error that is reported with elog().
>
> FWIW, I would restrict that to "functions that users are *intended* to
> call from a SQL prompt".  If you are fooling around calling internal
> functions with garbage input, you should not be surprised to get an
> internal error back.  So of the above list, only pg_procedure_is_visible
> seems particularly worth changing to me.  And for it, returning NULL
> (ie, "unknown") seems reasonable enough.

I think that's just making life difficult.  If nothing else, sqlsmith
hunts around for functions it can call that return internal errors,
and if we refuse to fix all of them to return user-facing errors, then
it's just crap for the people running sqlsmith to sift through and
it's a judgment call whether to fix each particular case.  Even aside
from that, I think it's much better to have a clear and unambiguous
rule that elog is only for can't-happen things, not
we-don't-recommend-it things.

> There's no such animal as pg_procedure_is_visible.  I suspect that's an
> EDB-ism that hasn't caught up with commit 66bb74dbe8206a35.

Yep.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Changed SRF in targetlist handling

2016-08-02 Thread Andres Freund

On 2016-08-02 19:02:38 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > I've an implementation that
>
> > 1) turns all targetlist SRF (tSRF from now on) into ROWS FROM
> >expressions. If there's tSRFs in the argument of a tSRF those becomes
> >a separate, lateral, ROWS FROM expression.
>
> > 2) If grouping/window functions are present, the entire query is wrapped
> >in a subquery RTE, except for the set-returning function. All
> >referenced Var|Aggref|GroupingFunc|WindowFunc|Param nodes in the
> >original targetlist are made to reference that subquery, which gets a
> >TargetEntry for them.
>
> FWIW, I'd be inclined to do the subquery RTE all the time,

Yea, that's what I ended up doing.

> adding some
> optimization fence to ensure it doesn't get folded back.  That fixes
> your problem here:

> > So far I have one problem without an easy solution: Historically queries
> > like
> > =# SELECT id, generate_series(1,2) FROM (VALUES(1),(2)) few(id);
> > ┌┬─┐
> > │ id │ generate_series │
> > ├┼─┤
> > │  1 │   1 │
> > │  1 │   2 │
> > │  2 │   1 │
> > │  2 │   2 │
> > └┴─┘
> > have preserved the SRF output ordering. But by turning the SRF into a
> > ROWS FROM, there's no guarantee that the cross join between "few" and
> > generate_series(1,3) above is implemented in that order.

But I don't see how that fixes the above problem?  The join, on the
top-level because of aggregates, still can be implemented as
subquery join srf or as srf join subquery, with the different output order
that implies.  I've duct-taped together a solution for that, by forcing
the lateral machinery to always see a dependency from the SRF to the
subquery; but that probably needs a nicer fix than a RangeTblEntry->deps
field which is processed in extract_lateral_references() ;)

> > Besides that I'm structurally wondering whether turning the original
> > query into a subquery is the right thing to do. It requires some kind of
> > ugly munching of Query->*, and has the above problem.
>
> It does not seem like it should be that hard, certainly no worse than
> subquery pullup.  Want to show code?

It's not super hard, there's some stuff like pushing/not-pushing
various sortgrouprefs to the subquery. But I think we can live with it.

Let me clean up the code some, hope to have something today or tomorrow.

> > An alternative approach would be to do this during parse-analysis, but I
> > think that might end up being confusing, because stored rules would
> > suddenly have a noticeably different structure, and it'd tie us a lot
> > more to the details of that transformation than I'd like.
>
> -1 on that; we do not want this transformation visible in stored rules.

Agreed.

> > Besides the removal of the least-common-multiple behaviour of tSRF queries,
> > there's some other consequences that using function scans have:
> > Previously if a tSRF was never evaluated, it didn't cause the number of
> > rows from being increased. E.g.
> > SELECT id, COALESCE(1, generate_series(1,2)) FROM (VALUES(1),(2)) few(id);
> > only produced two rows.  But using joins means that a simple
> > implementation of using ROWS FROM returns four rows.
>
> Hmm.  I don't mind changing behavior in that sort of corner case.
> If we're prepared to discard the LCM behavior, this seems at least
> an order of magnitude less likely to be worth worrying about.

I think it's fine, and potentially less confusing.

> Would it be worth detecting SRFs below CASE/COALESCE/etc and throwing
> an error?  It would be easier to sell throwing an error than silently
> changing behavior, I think.

Hm. We could, but I think the new behaviour would actually make sense in
the long run. Interpreting the coalesce to run on the output of the SRF
doesn't seem bad to me.

I found another edgecase, which we need to make a decision about.
'record' returning SRFs can't be transformed easily into a ROWS
FROM. Consider e.g. the following from the regression tests:

create function array_to_set(anyarray) returns setof record as $$
  select i AS "index", $1[i] AS "value" from generate_subscripts($1, 1)  i
$$ language sql strict immutable;

select array_to_set(array['one', 'two']);
┌──┐
│ array_to_set │
├──┤
│ (1,one)  │
│ (2,two)  │
└──┘
(2 rows)

which currently works. That currently can't be modeled as ROWS FROM()
directly, because that desperately wants to return the columns as
columns, which we can't do for 'record' returning things, because they
don't have defined columns.  For composite returning SRFs I've currently
implemented that by generating a ROWS() expression, but that doesn't
work for record.

So it seems like we need some, not necessarily user exposed, way of
making nodeFunctionscan.c return the return value as one datum.  One
way, as suggested by Andrew G. on IRC, was to interpret empty column

Re: [HACKERS] [sqlsmith] Failed assertion in joinrels.c

2016-08-02 Thread Tom Lane

Robert Haas  writes:
> On Tue, Aug 2, 2016 at 6:03 AM, Dilip Kumar  wrote:
>> There are many more such exposed functions, which can throw cache lookup
>> failure error if we pass wrong value.
>> 
>> i.e.
>> record_in
>> domain_in
>> fmgr_c_validator
>> plpgsql_validator
>> pg_procedure_is_visible
>> 
>> Are we planning to change these also..

> I think if they are SQL-callable functions that users can invoke from
> a SQL prompt, they shouldn't throw "cache lookup failed" errors or,
> for that matter, any other error that is reported with elog().

FWIW, I would restrict that to "functions that users are *intended* to
call from a SQL prompt".  If you are fooling around calling internal
functions with garbage input, you should not be surprised to get an
internal error back.  So of the above list, only pg_procedure_is_visible
seems particularly worth changing to me.  And for it, returning NULL
(ie, "unknown") seems reasonable enough.

Actually, it looks to me like we already fixed that for the built-in
is_visible functions:

# select pg_function_is_visible(0) is null;
 ?column? 
--
 t
(1 row)

There's no such animal as pg_procedure_is_visible.  I suspect that's an
EDB-ism that hasn't caught up with commit 66bb74dbe8206a35.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Changed SRF in targetlist handling

2016-08-02 Thread Tom Lane

Andres Freund  writes:
> I've an implementation that

> 1) turns all targetlist SRF (tSRF from now on) into ROWS FROM
>expressions. If there's tSRFs in the argument of a tSRF those becomes
>a separate, lateral, ROWS FROM expression.

> 2) If grouping/window functions are present, the entire query is wrapped
>in a subquery RTE, except for the set-returning function. All
>referenced Var|Aggref|GroupingFunc|WindowFunc|Param nodes in the
>original targetlist are made to reference that subquery, which gets a
>TargetEntry for them.

FWIW, I'd be inclined to do the subquery RTE all the time, adding some
optimization fence to ensure it doesn't get folded back.  That fixes
your problem here:

> So far I have one problem without an easy solution: Historically queries
> like
> =# SELECT id, generate_series(1,2) FROM (VALUES(1),(2)) few(id);
> ââââââ¬ââââââââââââââââââ
> â id â generate_series â
> ââââââ¼ââââââââââââââââââ¤
> â  1 â   1 â
> â  1 â   2 â
> â  2 â   1 â
> â  2 â   2 â
> ââââââ´ââââââââââââââââââ
> have preserved the SRF output ordering. But by turning the SRF into a
> ROWS FROM, there's no guarantee that the cross join between "few" and
> generate_series(1,3) above is implemented in that order.


> Besides that I'm structurally wondering whether turning the original
> query into a subquery is the right thing to do. It requires some kind of
> ugly munching of Query->*, and has the above problem.

It does not seem like it should be that hard, certainly no worse than
subquery pullup.  Want to show code?

> An alternative approach would be to do this during parse-analysis, but I
> think that might end up being confusing, because stored rules would
> suddenly have a noticeably different structure, and it'd tie us a lot
> more to the details of that transformation than I'd like.

-1 on that; we do not want this transformation visible in stored rules.

> Besides the removal of the least-common-multiple behaviour of tSRF queries,
> there's some other consequences that using function scans have:
> Previously if a tSRF was never evaluated, it didn't cause the number of
> rows from being increased. E.g.
> SELECT id, COALESCE(1, generate_series(1,2)) FROM (VALUES(1),(2)) few(id);
> only produced two rows.  But using joins means that a simple
> implementation of using ROWS FROM returns four rows.

Hmm.  I don't mind changing behavior in that sort of corner case.
If we're prepared to discard the LCM behavior, this seems at least
an order of magnitude less likely to be worth worrying about.

Having said that, I do seem to recall a bug report about misbehavior when
a SRF was present in just one arm of a CASE statement.  That would have
the same type of behavior as you describe here, and evidently there's at
least one person out there depending on it.

Would it be worth detecting SRFs below CASE/COALESCE/etc and throwing
an error?  It would be easier to sell throwing an error than silently
changing behavior, I think.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Double invocation of InitPostmasterChild in bgworker with -DEXEC_BACKEND

2016-08-02 Thread Tom Lane

Thomas Munro  writes:
> I discovered that if you build with -DEXEC_BACKEND on a Unix system
> and then try to start a background worker, it dies in
> InitializeLatchSupport:

> TRAP: FailedAssertion("!(selfpipe_readfd == -1)", File: "latch.c", Line: 161)

> That's because InitPostmasterChild is called twice.  I can
> successfully use regular parallel query workers and bgworkers created
> by extensions if I apply the attached patch.

Confirmed, fix pushed.  I wonder if we should have a buildfarm member
running the Unix EXEC_BACKEND code ...

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Increasing timeout of poll_query_until for TAP tests

2016-08-02 Thread Alvaro Herrera

Michael Paquier wrote:

> Here using pg_xlog_replay_resume() is not the correct solution because
> this would cause the node to finish recovery before we want it to, and
> so is recovery_target_action = 'promote'. If we look at the test, it
> is doing the following when getting the TXID that is used as recovery
> target:
> $node_master->safe_psql('postgres',
> "INSERT INTO tab_int VALUES (generate_series(1001,2000))");
> my $recovery_txid =
>   $node_master->safe_psql('postgres', "SELECT txid_current()");
> my $lsn2 =
>   $node_master->safe_psql('postgres', "SELECT pg_current_xlog_location();");
> What I think we had better do is reverse the calls
> pg_current_xlog_location() and txid_current() so as we are sure that
> the LSN we track for replay is lower than the real target LSN. The
> same problem exists when defining the timestamp target.
> 
> The patch attached does that,

Why not capture both items in a single select, such as in the attached
patch?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/test/recovery/t/003_recovery_targets.pl b/src/test/recovery/t/003_recovery_targets.pl
index 20b878e..3864e60 100644
--- a/src/test/recovery/t/003_recovery_targets.pl
+++ b/src/test/recovery/t/003_recovery_targets.pl
@@ -66,17 +66,16 @@ $node_master->backup('my_backup');
 # target TXID.
 $node_master->safe_psql('postgres',
 	"INSERT INTO tab_int VALUES (generate_series(1001,2000))");
-my $recovery_txid =
-  $node_master->safe_psql('postgres', "SELECT txid_current()");
-my $lsn2 =
-  $node_master->safe_psql('postgres', "SELECT pg_current_xlog_location();");
+my $ret =
+  $node_master->safe_psql('postgres', "SELECT pg_current_xlog_location(), txid_current();");
+my ($lsn2, $recovery_txid) = split /\|/, $ret;
 
 # More data, with recovery target timestamp
 $node_master->safe_psql('postgres',
 	"INSERT INTO tab_int VALUES (generate_series(2001,3000))");
-my $recovery_time = $node_master->safe_psql('postgres', "SELECT now()");
-my $lsn3 =
-  $node_master->safe_psql('postgres', "SELECT pg_current_xlog_location();");
+$ret =
+  $node_master->safe_psql('postgres', "SELECT pg_current_xlog_location(), now();");
+my ($lsn3, $recovery_time) = split /\|/, $ret;
 
 # Even more data, this time with a recovery target name
 $node_master->safe_psql('postgres',

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: [BUGS] BUG #14247: COMMENT is restored on wrong database

2016-08-02 Thread David G. Johnston

Moving to -hackers since this is getting into details

Bug Report # 14247

On Tue, Aug 2, 2016 at 4:58 PM, Tom Lane  wrote:

> "David G. Johnston"  writes:
> > Do you have an opinion on this following?
>
> I think the real problem in this area is that the division of labor
> we have between pg_dump and pg_dumpall isn't very good for real-world
> use cases.

I won't disagree here.

>   I'm not at all sure what a better answer would look like.
> But I'd rather see us take two steps back and consider the issue
> holistically instead of trying to band-aid individual misbehaviors.
>
> The fact that pg_dump is emitting COMMENT ON DATABASE at all is
> fundamentally wrong given the existing division-of-labor decisions,
> namely that pg_dump is responsible for objects within a database
> not for database-level properties.

But I have to disagree with this in the presence of the --create flag.

> It's entirely possible that some feature like COMMENT ON CURRENT DATABASE
> would be a necessary component of a holistic solution,
>  [ various other ON CURRENT commands elidded ]
>

In reading the code for the public schema bug I never fully got my head
around how dump/restore interacts with multiple databases.  Specifically,
in RestoreArchive, why we basically continue instead of breaking once we've
encountered the "DATABASE" entry and decide that we need to drop the DB due
to (createDB && dropSchema).

OTOH, seeing the code for the public schema exception I'm inclined to
think that adding something like the follow just after the public schema
block just patched:

/* If the user didn't request that a new database be created skip emitting
 * commands targeting the current database as they may not have the same
 * name as the database being restored into.
 *
 * XXX This too is ugly but the treatment of globals is not presently
amenable to pretty solutions
 */
if (!ropt->createDB))
{
if (strcmp(te->desc, "DATABASE") != 0  //this is quite possibly too broad a
check...I'm always concerned about false positives in cases like these.
 return;
}

Though reading it now that seems a bit like throwing out the baby with the
bath water; but then again if they are not requesting a database be created
and they happen to have a database of the same name already in place it
would be unexpected that it would not have been properly configured.

Assuming I'm not missing something here it seems like an easy and
internally consistent hack/fix for a rare but real concern.  However, since
the existing code doesn't provoke an error it doesn't seem like something
this should back-patched (I'm not convinced either way though).

David J.

Re: [HACKERS] Why we lost Uber as a user

2016-08-02 Thread Tom Lane

Stephen Frost  writes:
> With physical replication, there is the concern that a bug in *just* the
> physical (WAL) side of things could cause corruption.

Right.  But with logical replication, there's the same risk that the
master's state could be fine but a replication bug creates corruption on
the slave.

Assuming that the logical replication works by issuing valid SQL commands
to the slave, one could hope that this sort of "corruption" only extends
to having valid data on the slave that fails to match the master.
But that's still not a good state to be in.  And to the extent that
performance concerns lead the implementation to bypass some levels of the
SQL engine, you can easily lose that guarantee too.

In short, I think Uber's position that logical replication is somehow more
reliable than physical is just wishful thinking.  If anything, my money
would be on the other way around: there's a lot less mechanism that can go
wrong in physical replication.  Which is not to say there aren't good
reasons to use logical replication; I just do not believe that one.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] parallel.c is not marked as test covered

2016-08-02 Thread Robert Haas

On Tue, Aug 2, 2016 at 1:17 PM, Peter Eisentraut
 wrote:
> On 6/19/16 10:00 PM, Robert Haas wrote:
>>> Independent of that, it would help testing things like this if we allowed
>>> > setting max_worker_processes to 0, instead of the current minimum 1.  If
>>> > there a reason for the minimum of 1?
>> I believe that's pure brain fade on my part.  I think the minimum should be 
>> 0.
>
> Fixed.

Thank you.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] old_snapshot_threshold allows heap:toast disagreement

2016-08-02 Thread Robert Haas

On Thu, Jul 28, 2016 at 7:29 PM, Andres Freund  wrote:
> I think just iterating through the active snapshots would have been
> fine. Afaics there's no guarantee that the first active snapshot pushed
> is the relevant one - in contrast to registered one, which are ordered
> by virtue of the heap.

After a bit of scrutiny, I don't think this is a problem, and I don't
want to introduce belt-and-suspenders protections against can't-happen
conditions that may get cargo-culted into other places.  (How's that
for cramming as many Tom-Lane-isms as possible into one sentence?
More might be done, but it would be some sort of bespoke crock of the
first water.)

Anyway, the reason I think it's OK is that PushActiveSnapshot() does
not copy the input snapshot unless it's CurrentSnapshot or
SecondarySnapshot -- so every snapshot that gets pushed on the active
snapshot stack is either one that's already registered or one that was
just taken (because CurrentSnapshot will get overwritten on next call
to GetTransactionSnapshot).  In the former case, it's included in the
registered snapshots so it doesn't even matter whether we notice that
it is also on the active snapshot stack.  In the latter case, having
just been taken, it must be newer than the oldest registered snapshot,
which was necessarily taken earlier.  I think that the only time it
matters whether we even look at the active snapshot stack is when
there are no registered snapshots whatsoever.  In many cases -- like
QueryDescs -- we register the snapshot first and then later put it on
the active snapshot stack, but there are some things like CLUSTER that
only make the snapshot active and don't register it.  But if they do
that, they can only do it in first-in, first-out fashion.

>> >> But there's a bit of spooky action at a
>> >> distance: if we don't see any snapshots, then we have to assume that
>> >> the scan is being performed with a non-MVCC snapshot and therefore we
>> >> don't need to worry.  But there's no way to verify that via an
>> >> assertion, because the connection between the relevant MVCC snapshot
>> >> and the corresponding TOAST snapshot is pretty indirect, mediated only
>> >> by snapmgr.c.
>> >
>> > Hm. Could we perhaps assert that the session has a valid xmin?
>>
>> I don't think so.  CLUSTER?
>
> That should have one during any toast lookups afaics - the relevant code
> is
> /* Start a new transaction for each relation. */
> StartTransactionCommand();
> /* functions in indexes may want a snapshot set */
> PushActiveSnapshot(GetTransactionSnapshot());
> /* Do the job. */
> cluster_rel(rvtc->tableOid, rvtc->indexOid, true, 
> stmt->verbose);
> PopActiveSnapshot();
> CommitTransactionCommand();
> right? And
> Snapshot
> GetSnapshotData(Snapshot snapshot)
> {
> ...
>
> if (!TransactionIdIsValid(MyPgXact->xmin))
> MyPgXact->xmin = TransactionXmin = xmin;
> sets xmin.

Yeah.  Actually, consistent with the above, I discovered that as long
as we consult both the active snapshot stack and the pairingheap of
registered snapshots, it seems to be fine to just insist that we
always get a snapshot back from the snapmgr and use that to initialize
the TOAST snapshot.  So I did it that way in the attached version.

New patch attached.  Barring objections, I will commit this tomorrow.
If there are objections, I will send another status update tomorrow.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

ost-heap-toast-disagreement-v3.patch
Description: application/download

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PostgreSQL 10 kick-off

2016-08-02 Thread Alvaro Herrera

Peter Eisentraut wrote:
> On 8/1/16 1:08 PM, Fabrízio de Royes Mello wrote:
> > What knowledge is expected for a CFM? I'm really would like to help and
> > also learn more about our development.
> 
> If you search the wiki for "commitfest" you will find several pages of
> varying outdatedness that describe the process.

I have cleaned up the wiki a bit, so the authoritative info should be at
https://wiki.postgresql.org/wiki/CommitFest_Checklist
Note that what that page describes is not exactly what we do nowadays;
if someone wants to edit that page to better reflect reality, be my
guest.  Also see the old (2009) page where an older workflow was
described:
https://wiki.postgresql.org/index.php?title=Running_a_CommitFest=9369
Perhaps it'd be good to resurrect some of that contents, in modernized
form, to the current page.

Thanks

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] HandleParallelMessages contains CHECK_FOR_INTERRUPTS?

2016-08-02 Thread Tom Lane

I wrote:
> Alvaro Herrera  writes:
>> I notice you just removed the CHECK_FOR_INTERRUPTS in
>> HandleParallelMessages().  Did you notice that HandleParallelMessages
>> calls shm_mq_receive(), which calls shm_mq_receive_bytes(), which
>> contains a CHECK_FOR_INTERRUPTS() call?

After study, I believe that that CHECK_FOR_INTERRUPTS() is unreachable
given that HandleParallelMessages passes nowait = true.  But it's not
unlikely that future changes in shm_mq.c might introduce such calls that
are reachable.

> I wonder whether we should make use of HOLD_INTERRUPTS/RESUME_INTERRUPTS
> to avoid the recursion scenario here.

I concluded that that would be good future-proofing, whether or not it's
strictly necessary today, so I pushed it.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Why we lost Uber as a user

2016-08-02 Thread Stephen Frost

* Robert Haas (robertmh...@gmail.com) wrote:
> On Tue, Aug 2, 2016 at 3:07 PM, Alfred Perlstein  wrote:
> > You are quite technical, my feeling is that you will understand it, however 
> > it will need to be a self learned lesson.
> 
> I don't know what this is supposed to mean, but I think that Geoff's
> point is somewhat valid.  No matter how you replicate data, there is
> always the possibility that you will replicate any corruption along
> with the data - or that your copy will be unfaithful to the original.

I believe what Geoff was specifically getting at is probably best
demonstrated with an example.

Consider a bug in the btree index code which will accept a value but not
store it correctly.

INSERT INTO mytable (indexed_column) VALUES (-10);

/* oops, bug, this value gets stored in the wrong place in the btree */

We happily accept the record and insert it into the btree index, but
that insert is incorrect and results in the btree being corrupted
because some bug doesn't handle such large values correctly.

In such a case, either approach to replication (replicating the query
statement, or replicating the changes to the btree page exactly) would
result in corruption on the replica.

The above represents a bug in *just* the btree side of things (the
physical replication did its job correctly, even though the result is a
corrupted index on the replica).

With physical replication, there is the concern that a bug in *just* the
physical (WAL) side of things could cause corruption.  That is, we
correctly accept and store the value on the primary, but the records
generated to send that data to the replica are incorrect and result in
an invalid state on the replica.

Of course, a bug in the physical side of things which caused corruption
would mean that *crash recovery* would also cause corruption.  As I
understand it, that same concern exists for MySQL, so, moving to logical
replication doesn't actually mean you don't need to worry about bugs in
the crash recovery side of things, assuming you depend on the database
to come back up in a consistent manner after a crash.

Thanks!

Stephen

signature.asc
Description: Digital signature

Re: [HACKERS] No longer possible to query catalogs for index capabilities?

2016-08-02 Thread Bruce Momjian

On Mon, Aug  1, 2016 at 10:19:43AM -0400, Tom Lane wrote:
> As far as I understood Andrew's use case, he was specifically *not*
> interested in a complete representation of an index definition, but
> rather about whether it had certain properties that would be of
> interest to query-constructing applications.

Would it be helpful to output an array of strings representing the index
definition?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] AdvanceXLInsertBuffer vs. WAL segment compressibility

2016-08-02 Thread Robert Haas

On Tue, Aug 2, 2016 at 2:33 PM, Bruce Momjian  wrote:
> On Tue, Jul 26, 2016 at 05:42:43PM -0400, Chapman Flack wrote:
>> Even so, I'd be curious whether it would break anything to have
>> xlp_pageaddr simply set to InvalidXLogRecPtr in the dummy zero
>> pages written to fill out a segment. At least until it's felt
>> that archive_timeout has been so decidedly obsoleted by streaming
>> replication that it is removed, and the log-tail zeroing code
>> with it.
>>
>> That at least would eliminate the risk of anyone else repeating
>> my astonishment. :)  I had read that 9.4 added built-in log-zeroing
>> code, and my first reaction was "cool! that may make the compression
>> technique we're using unnecessary, but certainly can't make it worse"
>> only to discover that it did, by ~ 300x, becoming now 3x *worse* than
>> plain gzip, which itself is ~ 100x worse than what we had.
>
> My guess is that the bytes are there to detect problems where a 512-byte
> disk sector is zeroed by a disk failure.  I don't see use changing that
> for the use-case you have described.

Is there actually any code that makes such a check?

I'm inclined to doubt that was the motivation, though admittedly we're
both speculating about the contents of Heikki's brain, a tricky
proposition on a good day.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Why we lost Uber as a user

2016-08-02 Thread Robert Haas

On Tue, Aug 2, 2016 at 3:07 PM, Alfred Perlstein  wrote:
> You are quite technical, my feeling is that you will understand it, however 
> it will need to be a self learned lesson.

I don't know what this is supposed to mean, but I think that Geoff's
point is somewhat valid.  No matter how you replicate data, there is
always the possibility that you will replicate any corruption along
with the data - or that your copy will be unfaithful to the original.
The possible advantage of logical replication rather than physical
replication is that any errors you replicate will be logical errors
rather than physical errors - so if the heap gets out of step with the
indexes on the master, the same problem will not necessarily occur on
the slave.  On the flip side, despite what Uber found in their
environment, physical replication tends to be high-performance because
the replay is dead simple.  Andres and others have done a good job
making our logical decoding facility fast, but I believe it's still
slower than plain old physical replication and probably always will
be, and the trigger-based logical replication solutions are slower
still.  Consequently, I believe that both physical and logical
replication have advantages, and that's why we should support both of
them.  Then, each individual user can make the trade-offs they prefer.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] AdvanceXLInsertBuffer vs. WAL segment compressibility

2016-08-02 Thread Chapman Flack

On 08/02/2016 02:33 PM, Bruce Momjian wrote:

> My guess is that the bytes are there to detect problems where
> a 512-byte disk sector is zeroed by a disk failure.

Does that seem plausible? (a) there is only one such header for
every 16 512-byte disk sectors, so it only affords a 6% chance of
detecting a zeroed sector, and (b) the header contains other
non-zero values in fields other than xlp_pageaddr, so the use
of a fixed value for _that field_ in zeroed tail blocks would
not prevent (or even reduce the 6% probability of) detecting
a sector zeroed by a defect.

-Chap

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] TODO item: Implement Boyer-Moore searching in LIKE queries

2016-08-02 Thread Tom Lane

Karan Sikka  writes:
> Just for the record, I'm leaning to the side of "not worth it". My
> thoughts are, if you are at a scale where you care about this 20%
> speedup, you would want to go all the way to an indexed structure,
> because the gains you would realize would exceed 20%, and 20% may not be
> a sufficient speedup anyway.

If I'm reading your test case correctly, 20% is actually a rather
impressive number, because it's 20% *overall* gain on queries that will
also involve TOAST fetch and decompress on the source data.  (Decompress
definitely, and I'm guessing those 5K strings don't compress well enough
to avoid getting pushed out-of-line; though it might be worth repeating
the test with chunks of 10K or 20K to be sure.)  So the percentage
improvement in the LIKE test proper must have been a lot more than that.

However, I'm dubious that LIKE patterns with long fixed substrings are a
common use-case, so I'm afraid that this might be quite a lot of work for
something that won't much benefit most users.  I'm also worried that the
setup costs might be enough to make it a net loss in many cases.

There are probably ways to amortize the setup costs, since typical
scenarios involve the same LIKE pattern across many rows, but implementing
that would add even more work.  (Having said that, I've had a bee in my
bonnet for a long time about removing per-row setup cost for repetitive
regex matches, and whatever infrastructure that needs would work for this
too.  And for strpos' B-M-H setup, looks like.  So this might be something
to look into with a suitably wide view of what the problem is.)

Not sure what advice to give you here.  I think this is in the grey zone
where it's hard to be sure whether it's worth putting work into.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Slowness of extended protocol

2016-08-02 Thread Vladimir Sitnikov

>
> I really don't get what's problematic with posting a message on a mailing
> list about a potential performance issue, to try to get people's reactions,
> without diving into profiling right away
>

"Benchmark data is a perfect substitute for benchmarking results. Data is
easy to misinterpret, so try not to do that." (see [1], and slide 73 of [2])

The key points are:
0) It is extremely easy to take a wrong way unless you analyze the
benchmark results.
1) If you (or someone else) thinks that "ok, the original email did meet
its goal, as Vladimir did provide a patch with measurements", then I failed.
The only reason for me doing the benchmark and patch was to teach you how
to do that.
2) Have you seen recent discussion "TODO item: Implement Boyer-Moore
searching in LIKE queries" on the list?
It does include relevant details right from the start.
(see
https://www.postgresql.org/message-id/CALkFZpcbipVJO%3DxVvNQMZ7uLUgHzBn65GdjtBHdeb47QV4XzLw%40mail.gmail.com
 )

I'm not a PostgreSQL developer
>
Neither am I.


> I have other urgent things to do
>
So do I.


> and don't even spend most of my programming time in C.
>

Java and SQL covers 99% of my time.

[1]: https://twitter.com/shipilev/status/760387758246486017
[2]: https://shipilev.net/talks/jvmls-July2014-benchmarking.pdf

Vladimir

Re: [HACKERS] Why we lost Uber as a user

2016-08-02 Thread Alfred Perlstein



> On Aug 2, 2016, at 2:33 AM, Geoff Winkless  wrote:
> 
>> On 2 August 2016 at 08:11, Alfred Perlstein  wrote:
>>> On 7/2/16 4:39 AM, Geoff Winkless wrote:
>>> I maintain that this is a nonsense argument. Especially since (as you 
>>> pointed out and as I missed first time around) the bug actually occurred at 
>>> different records on different slaves, so he invalidates his own point.
> 
>> Seriously?
> 
> No, I make a habit of spouting off random arguments to a list full of
> people whose opinions I massively respect purely for kicks. What do
> you think?
> 
>> There's a valid point here, you're sending over commands at the block level, 
>> effectively "write to disk at this location" versus "update this record 
>> based on PK", obviously this has some drawbacks that are reason for concern.
> 
> Writing values directly into file offsets is only problematic if
> something else has failed that has caused the file to be an inexact
> copy. If a different bug occurred that caused the primary key to be
> corrupted on the slave (or indeed the master), PK-based updates would
> exhibit similar propagation errors.
> 
> To reiterate my point, uber's described problem came about because of
> a bug. Every software has bugs at some point in its life, to pretend
> otherwise is simply naive. I'm not trying to excuse the bug, or to
> belittle the impact that such a bug has on data integrity or on uber
> or indeed on the reputation of PostgreSQL. While I'm prepared to
> accept (because I have a job that requires I spend time on things
> other than digging through obscure reddits and mailing lists to
> understand more fully the exact cause) that in _this particular
> instance_ the bug was propagated because of the replication mechanism
> (although I'm still dubious about that, as per my comment above), that
> does _not_ preclude other bugs propagating in a statement-based
> replication. That's what I said is a nonsense argument, and no-one has
> yet explained in what way that's incorrect.
> 
> Geoff


Geoff,

You are quite technical, my feeling is that you will understand it, however it 
will need to be a self learned lesson. 

-Alfred




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] TODO item: Implement Boyer-Moore searching in LIKE queries

2016-08-02 Thread Robert Haas

On Tue, Aug 2, 2016 at 1:56 PM, Alvaro Herrera  wrote:
>> > How desirable is this feature? To begin answering that question,
>> > I did some initial benchmarking with an English text corpus to find how 
>> > much
>> > faster BMH (Boyer-Moore-Horspool) would be compared to LIKE, and the 
>> > results
>> > were as follows: BMH can be up to 20% faster than LIKE, but for short
>> > substrings, it's roughly comparable or slower.
>> >
>> > Here are the results visualized:
>> >
>> > http://ctrl-c.club/~ksikka/pg/like-strpos-short-1469975400.png
>> > http://ctrl-c.club/~ksikka/pg/like-strpos-long-1469975400.png
>>
>> Based on these results, this looks to me like a pretty unexciting
>> thing upon which to spend time.
>
> Uh, a 20% different is "unexciting" for you?  I think it's interesting.
> Now, really, users shouldn't be running LIKE on constant strings all the
> time but rather use some sort of indexed search, but once in a while
> there is a need to run some custom query and you need to LIKE-scan a
> large portion of a table.  For those cases an algorithm that performs
> 20% better is surely welcome.

Sure, but an algorithm that performs 20% faster in the best case and
worse in some other cases is not the same thing as a 20%
across-the-board performance improvement.  I guess if we had a way of
deciding which algorithm to use in particular cases it might make
sense.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] TODO item: Implement Boyer-Moore searching in LIKE queries

2016-08-02 Thread Karan Sikka

Yeah, you make a fair point.

I was hoping more people would chime in with strong opinions to make the
decision easier, but perhaps the lack of noise indicates this is a low-pri
feature.

I will ask around in my coworker circles to see what they think,
could you do the same?

Just for the record, I'm leaning to the side of "not worth it". My thoughts
are,
if you are at a scale where you care about this 20% speedup, you would want
 to go all the way to an indexed structure, because the gains you would
realize
would exceed 20%, and 20% may not be a sufficient speedup anyway.

On Tue, Aug 2, 2016 at 1:56 PM, Alvaro Herrera 
wrote:

> Robert Haas wrote:
> > On Mon, Aug 1, 2016 at 1:19 PM, Karan Sikka 
> wrote:
> > > Following the patch to implement strpos with Boyer-Moore-Horspool,
> > > it was proposed we bring BMH to LIKE as well.
> > >
> > > Original thread:
> > >
> https://www.postgresql.org/message-id/flat/27645.1220635769%40sss.pgh.pa.us#27645.1220635...@sss.pgh.pa.us
> > >
> > > I'm a first time hacker and I found this task on the TODO list. It
> seemed
> > > interesting and isolated enough. But after looking at the code in
> > > like_match.c, it's clearly a non-trivial task.
> > >
> > > How desirable is this feature? To begin answering that question,
> > > I did some initial benchmarking with an English text corpus to find
> how much
> > > faster BMH (Boyer-Moore-Horspool) would be compared to LIKE, and the
> results
> > > were as follows: BMH can be up to 20% faster than LIKE, but for short
> > > substrings, it's roughly comparable or slower.
> > >
> > > Here are the results visualized:
> > >
> > > http://ctrl-c.club/~ksikka/pg/like-strpos-short-1469975400.png
> > > http://ctrl-c.club/~ksikka/pg/like-strpos-long-1469975400.png
> >
> > Based on these results, this looks to me like a pretty unexciting
> > thing upon which to spend time.
>
> Uh, a 20% different is "unexciting" for you?  I think it's interesting.
> Now, really, users shouldn't be running LIKE on constant strings all the
> time but rather use some sort of indexed search, but once in a while
> there is a need to run some custom query and you need to LIKE-scan a
> large portion of a table.  For those cases an algorithm that performs
> 20% better is surely welcome.
>
> I wouldn't be so quick to dismiss this.
>
> Of course, it needs to work in all cases, or failing that, be able to
> fall back to the original code if it cannot support some corner case.
>
> --
> Álvaro Herrerahttp://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>

Re: [HACKERS] PostmasterContext survives into parallel workers!?

2016-08-02 Thread Tom Lane

Alvaro Herrera  writes:
> Tom Lane wrote:
>> It looks to me like the reason for it is simply not having bothered to
>> copy the rw->rw_worker data to somewhere that would survive deletion
>> of the PostmasterContext.  I wonder though if anyone remembers a more
>> fundamental reason?  Surely the bgworker is not supposed to touch any
>> of the rest of the BackgroundWorkerList?

> I just checked BDR, which is the more complex code using workers I know
> of, and I don't see any reason why this cannot be changed.

The attached patch passes "make check-world" for me.  Can you check it
against BDR?

(I'd be hesitant to back-patch it in any case, but I think it's okay for
HEAD unless we can easily find something it breaks.)

regards, tom lane

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 19d11e0..e48703a 100644
*** a/src/backend/postmaster/postmaster.c
--- b/src/backend/postmaster/postmaster.c
*** do_start_bgworker(RegisteredBgWorker *rw
*** 5529,5537 
  			/* Close the postmaster's sockets */
  			ClosePostmasterPorts(false);
  
! 			/* Do NOT release postmaster's working memory context */
  
- 			MyBgworkerEntry = >rw_worker;
  			StartBackgroundWorker();
  			break;
  #endif
--- 5529,5547 
  			/* Close the postmaster's sockets */
  			ClosePostmasterPorts(false);
  
! 			/*
! 			 * Before blowing away PostmasterContext, save this bgworker's
! 			 * data where it can find it.
! 			 */
! 			MyBgworkerEntry = (BackgroundWorker *)
! MemoryContextAlloc(TopMemoryContext, sizeof(BackgroundWorker));
! 			memcpy(MyBgworkerEntry, >rw_worker, sizeof(BackgroundWorker));
! 
! 			/* Release postmaster's working memory context */
! 			MemoryContextSwitchTo(TopMemoryContext);
! 			MemoryContextDelete(PostmasterContext);
! 			PostmasterContext = NULL;
  
  			StartBackgroundWorker();
  			break;
  #endif

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Increasing timeout of poll_query_until for TAP tests

2016-08-02 Thread Alvaro Herrera

Michael Paquier wrote:
> On Tue, Aug 2, 2016 at 10:28 AM, Michael Paquier
>  wrote:
> > There is still an issue with pg_basebackup when testing stream mode
> > and replication slots. I am digging into this one now..
> 
> After 5 hours running this test in a row and 30 attempts torturing
> hamster with a script running make check in an infinite loop with set
> -e I am giving up on that for the time being... I have added the patch
> to make the tests more stable to next CF so as it is not forgotten:
> https://commitfest.postgresql.org/10/693/

Great, thanks for the effort, will push.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] AdvanceXLInsertBuffer vs. WAL segment compressibility

2016-08-02 Thread Bruce Momjian

On Tue, Jul 26, 2016 at 05:42:43PM -0400, Chapman Flack wrote:
> Even so, I'd be curious whether it would break anything to have
> xlp_pageaddr simply set to InvalidXLogRecPtr in the dummy zero
> pages written to fill out a segment. At least until it's felt
> that archive_timeout has been so decidedly obsoleted by streaming
> replication that it is removed, and the log-tail zeroing code
> with it.
> 
> That at least would eliminate the risk of anyone else repeating
> my astonishment. :)  I had read that 9.4 added built-in log-zeroing
> code, and my first reaction was "cool! that may make the compression
> technique we're using unnecessary, but certainly can't make it worse"
> only to discover that it did, by ~ 300x, becoming now 3x *worse* than
> plain gzip, which itself is ~ 100x worse than what we had.

My guess is that the bytes are there to detect problems where a 512-byte
disk sector is zeroed by a disk failure.  I don't see use changing that
for the use-case you have described.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PostmasterContext survives into parallel workers!?

2016-08-02 Thread Alvaro Herrera

Tom Lane wrote:
> Robert Haas  writes:
> > On Mon, Aug 1, 2016 at 4:18 PM, Tom Lane  wrote:
> >> Now, I'm undecided whether to flush that context only in parallel workers,
> >> or to try to make it go away for all bgworkers of any stripe.  The latter
> >> seems a little better from a security standpoint, but I wonder if anyone
> >> has a use-case where that'd be a bad idea?
> 
> > I think it would be better to get rid of it in all bgworkers.
> 
> I looked into this, and immediately found this in the spot in postmaster.c
> that would be the obvious place to kill the PostmasterContext:
> 
> /* Do NOT release postmaster's working memory context */
> 
> MyBgworkerEntry = >rw_worker;
> StartBackgroundWorker();
> 
> This comment was in Alvaro's original commit adding bgworkers (da07a1e8).

Hm, I don't have the development branch in this laptop.  I might find
some evidence in the old one, but I won't be able to reach it till
tonight.

> It looks to me like the reason for it is simply not having bothered to
> copy the rw->rw_worker data to somewhere that would survive deletion
> of the PostmasterContext.  I wonder though if anyone remembers a more
> fundamental reason?  Surely the bgworker is not supposed to touch any
> of the rest of the BackgroundWorkerList?

I just checked BDR, which is the more complex code using workers I know
of, and I don't see any reason why this cannot be changed.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PostgreSQL 10 kick-off

2016-08-02 Thread Peter Eisentraut

On 8/1/16 1:08 PM, Fabrízio de Royes Mello wrote:
> What knowledge is expected for a CFM? I'm really would like to help and
> also learn more about our development.

If you search the wiki for "commitfest" you will find several pages of
varying outdatedness that describe the process.

Mainly, you will need to be nagging a lot of people. ;-)

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PostmasterContext survives into parallel workers!?

2016-08-02 Thread Tom Lane

Robert Haas  writes:
> On Mon, Aug 1, 2016 at 4:18 PM, Tom Lane  wrote:
>> Now, I'm undecided whether to flush that context only in parallel workers,
>> or to try to make it go away for all bgworkers of any stripe.  The latter
>> seems a little better from a security standpoint, but I wonder if anyone
>> has a use-case where that'd be a bad idea?

> I think it would be better to get rid of it in all bgworkers.

I looked into this, and immediately found this in the spot in postmaster.c
that would be the obvious place to kill the PostmasterContext:

/* Do NOT release postmaster's working memory context */

MyBgworkerEntry = >rw_worker;
StartBackgroundWorker();

This comment was in Alvaro's original commit adding bgworkers (da07a1e8).
It looks to me like the reason for it is simply not having bothered to
copy the rw->rw_worker data to somewhere that would survive deletion
of the PostmasterContext.  I wonder though if anyone remembers a more
fundamental reason?  Surely the bgworker is not supposed to touch any
of the rest of the BackgroundWorkerList?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Slowness of extended protocol

2016-08-02 Thread Shay Rojansky

On Mon, Aug 1, 2016 at 12:12 PM, Vladimir Sitnikov <
sitnikov.vladi...@gmail.com> wrote:

> The attached patch passes `make check` and it gains 31221 -> 33547
improvement for "extended pgbench of SELECT 1".
>
> The same version gains 35682 in "simple" mode, and "prepared" mode
achieves 46367 (just in case).

That's great, thanks for looking into it! I hope your patch gets merged.

> Shay, why don't you use a profiler? Seriously.
> I'm afraid "iterate the per-message loop in PostgresMain five times not
once" /"just discussing what may or may not be a problem..."  is just
hand-waving.
>
> Come on, it is not that hard.

I really don't get what's problematic with posting a message on a mailing
list about a potential performance issue, to try to get people's reactions,
without diving into profiling right away. I'm not a PostgreSQL developer,
have other urgent things to do and don't even spend most of my programming
time in C.

Re: [HACKERS] TODO item: Implement Boyer-Moore searching in LIKE queries

2016-08-02 Thread Alvaro Herrera

Robert Haas wrote:
> On Mon, Aug 1, 2016 at 1:19 PM, Karan Sikka  wrote:
> > Following the patch to implement strpos with Boyer-Moore-Horspool,
> > it was proposed we bring BMH to LIKE as well.
> >
> > Original thread:
> > https://www.postgresql.org/message-id/flat/27645.1220635769%40sss.pgh.pa.us#27645.1220635...@sss.pgh.pa.us
> >
> > I'm a first time hacker and I found this task on the TODO list. It seemed
> > interesting and isolated enough. But after looking at the code in
> > like_match.c, it's clearly a non-trivial task.
> >
> > How desirable is this feature? To begin answering that question,
> > I did some initial benchmarking with an English text corpus to find how much
> > faster BMH (Boyer-Moore-Horspool) would be compared to LIKE, and the results
> > were as follows: BMH can be up to 20% faster than LIKE, but for short
> > substrings, it's roughly comparable or slower.
> >
> > Here are the results visualized:
> >
> > http://ctrl-c.club/~ksikka/pg/like-strpos-short-1469975400.png
> > http://ctrl-c.club/~ksikka/pg/like-strpos-long-1469975400.png
> 
> Based on these results, this looks to me like a pretty unexciting
> thing upon which to spend time.

Uh, a 20% different is "unexciting" for you?  I think it's interesting.
Now, really, users shouldn't be running LIKE on constant strings all the
time but rather use some sort of indexed search, but once in a while
there is a need to run some custom query and you need to LIKE-scan a
large portion of a table.  For those cases an algorithm that performs
20% better is surely welcome.

I wouldn't be so quick to dismiss this.

Of course, it needs to work in all cases, or failing that, be able to
fall back to the original code if it cannot support some corner case.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] MSVC pl-perl error message is not verbose enough

2016-08-02 Thread John Harvey

On Mon, Aug 1, 2016 at 9:39 PM, Michael Paquier 
wrote:

> On Tue, Aug 2, 2016 at 2:08 AM, Robert Haas  wrote:
> > Did you add this to the next CommitFest?
>
> I have added it here:
> https://commitfest.postgresql.org/10/691/
> John, it would be good if you could get a community account and add
> your name to this patch as its author. I could not find you.
>

Done.  If there's anything else that I should do, please let me know.  I'd
be glad to help out.

Thanks!
  -John

Re: [HACKERS] parallel.c is not marked as test covered

2016-08-02 Thread Peter Eisentraut

On 6/19/16 10:00 PM, Robert Haas wrote:
>> Independent of that, it would help testing things like this if we allowed
>> > setting max_worker_processes to 0, instead of the current minimum 1.  If
>> > there a reason for the minimum of 1?
> I believe that's pure brain fade on my part.  I think the minimum should be 0.

Fixed.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_size_pretty, SHOW, and spaces

2016-08-02 Thread Bruce Momjian

On Tue, Aug  2, 2016 at 11:29:01AM +0200, Christoph Berg wrote:
> > The issue is that we output "10 bytes", not "10bytes", but for units we
> > use "977KB".  That seems inconsistent, but it is the normal policy
> > people use.  I think this is because "977KB" is really "977K bytes", but
> > we just append the "B" after the "K" for bevity.
> 
> It's the other way round:
> 
> https://en.wikipedia.org/wiki/International_System_of_Units#General_rules
> 
> | The value of a quantity is written as a number followed by a space
> | (representing a multiplication sign) and a unit symbol; e.g., 2.21 kg
> [...]
> 
> I'd opt to omit the space anywhere where the value is supposed to be
> fed back into the config (SHOW, --parameters), but use the "pretty"
> format with space everywhere otherwise (documentation, memory counts
> in explain output, pg_size_pretty() etc.)

Yes, that's a strong argument for using a space.  I have adjusted the
patch to use spaces in all reasonable places.  Patch attached, which I
have gzipped because it was 133 KB.  (Ah, see what I did there?)  :-)

I am thinking of leaving the 9.6 docs alone as I have already made them
consistent (no space) with minimal changes.  We can make it consistent
the other way in PG 10.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


kilo2.diff.gz
Description: application/gzip

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Wanting to learn about pgsql design decision

2016-08-02 Thread Andrew Gierth

> "Tom" == Tom Lane  writes:

 >> - Why to read from a table, both a usage permission on the schema
 >> and a read access permission on the table is needed?

 Tom> Because the SQL standard says so.

You'd think, but in fact it doesn't; the spec (at least 2008 and the
2011 drafts) has no concept of grantable permissions on schemas, and
ties table ownership and schema ownership together.

(See the definition of  to see that there's nothing there
for schemas, and the definition of  for the fact that
it's the schema owner who also owns the table and gets the initial
grants on it, and  and  to
confirm that only the schema owner can alter or drop the table. The
access rules for  only require permission on a table
column, no mention of schemas.)

-- 
Andrew (irc:RhodiumToad)

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] PATCH: two slab-like memory allocators

2016-08-02 Thread Tomas Vondra


Hi,

Back in the bug #14231 thread [1], dealing with performance issues in 
reorderbuffer due to excessive number of expensive free() calls, I've 
proposed to resolve that by a custom slab-like memory allocator, 
suitable for fixed-size allocations. I'd like to put this into the next 
CF, and it's probably too invasive change to count as a bugfix anyway.


[1] 
https://www.postgresql.org/message-id/flat/20160706185502.1426.28143%40wrigleys.postgresql.org


This patch actually includes two new memory allocators (not one). Very 
brief summary (for more detailed explanation of the ideas, see comments 
at the beginning of slab.c and genslab.c):


Slab

* suitable for fixed-length allocations (other pallocs fail)
* much simpler than AllocSet (no global freelist management etc.)
* free space is tracked per block (using a simple bitmap)
* which allows freeing the block once all chunks are freed (AllocSet 
will hold the memory forever, in the hope of reusing it)


GenSlab
---
* suitable for non-fixed-length allocations, but with chunks of mostly 
the same size (initially unknown, the context will tune itself)

* a combination AllocSet and Slab (or a sequence of Slab allocators)
* the goal is to do most allocations in Slab context
* there's always a single 'current' Slab context, and every now and and 
then it's replaced with a new generation (with the chunk size computed 
from recent requests)

* the AllocSet context is used for chunks too large for current Slab

So none of this is meant as a universal replacement of AllocSet, but in 
the suitable cases the results seem really promising. For example for 
the simple test query in [1], the performance improvement is this:


N |   master |  patched
   -
1 |100ms |100ms
5 |  15000ms |350ms
   10 | 146000ms |700ms
   20 |? |   1400ms

That's a fairly significant improvement, and the submitted version of 
the patches should perform even better (~2x, IIRC).


There's a bunch of TODOs - e.g. handling of realloc() calls in the 
GenSlab, and probably things I haven't thought about.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


0001-simple-slab-allocator-fixed-size-allocations.patch
Description: binary/octet-stream


0002-generational-slab-auto-tuning-allocator.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Why we lost Uber as a user

2016-08-02 Thread Simon Riggs

On 2 August 2016 at 15:27, Robert Haas  wrote:
> On Tue, Aug 2, 2016 at 5:51 AM, Amit Kapila  wrote:
>> Why we need to add a record in all indexes if only the key
>> corresponding to one of indexes is updated?  Basically, if the tuple
>> can fit on same page, why can't we consider it as HOT (or HPT - heap
>> partial tuple or something like that), unless it updates all the keys
>> for all the indexes.  Now, we can't consider such tuple versions for
>> pruning as we do for HOT.  The downside of this could be that we might
>> need to retain some of the line pointers for more time (as we won't be
>> able to reuse the line pointer till it is used in any one of the
>> indexes and those could be reused once we make next non-HOT update).
>> However, this should allow us not to update the indexes for which the
>> corresponding column in tuple is not updated.  I think it is a basic
>> premise that if any index column is updated then the update will be
>> considered as non-HOT, so there is a good chance that I might be
>> missing something here.
>
> Well, I think that the biggest advantage of a HOT update is the fact
> that it enables HOT pruning.  In other words, we're not primarily
> trying to minimize index traffic; we're trying to make cleanup of the
> heap cheaper.  So this could certainly be done, but I'm not sure it
> would buy us enough to be worth the engineering effort involved.

(Hi, just back from leave and catching up on emails.)

The above suggested design is something I've been working on for last
few days. In my design I referred to "intermediate root" tuples. I've
got a detailed design for it and it works, yay!... but Pavan has
managed to shoot it down with some accurate observations about it
leading to an annoying accumulation of root pointers and complex logic
to remove them. So I'm not pursuing it further at this stage.

I'm writing up my conclusions around what we should do now, so should
post later today.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Wanting to learn about pgsql design decision

2016-08-02 Thread Tom Lane

Tal Walter  writes:
>- Why in the roles system, user are actually roles with login attribute
>and not a separate entity.

Groups and users used to be separate concepts, actually, a long time ago.
We got rid of that because it was a PITA; in particular, grants to groups
had to be represented separately from grants to individual users.  Looking
at the git history, that happened in mid-2005, so you might trawl the
pgsql-hackers archives from around that time for discussion.

>- Why to read from a table, both a usage permission on the schema and a
>read access permission on the table is needed?

Because the SQL standard says so.  You might want to get a copy.  While
the "official" releases cost lots o' money, draft versions are freely
available on the net, and are generally close enough.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PostmasterContext survives into parallel workers!?

2016-08-02 Thread Tom Lane

Robert Haas  writes:
> On Mon, Aug 1, 2016 at 7:42 PM, Tom Lane  wrote:
>> ...  This is what makes me dubious that getting rid
>> of doing work in the postmaster's signal handlers is really going
>> to add any noticeable increment of safety.  It might make the
>> code look cleaner, but I'm afraid it's going to be a lot of churn
>> for not much gain.

> It's not just a cosmetic issue.

> See 
> https://www.postgresql.org/message-id/CA+Tgmobr+Q2WgWeasdbDNefVwJkAGALxA=-vtegntqgl1v2...@mail.gmail.com
> and d0410d66037c2f3f9bee45e0a2db9e47eeba2bb4.

I do not think that patch particularly bears on this question.  We have
had logic bugs in the postmaster in the current coding structure, and
we undoubtedly would still have logic bugs in the postmaster if it were
rewritten to avoid using nontrivial signal handlers.  They'd just be
in different places.

Also, the message you mention does a fine job of summarizing why getting
out of the signal-handler-based design will be fraught with a bunch of
new portability hazards.

But what I was responding to here is the claim that we have portability
hazards built into the current code as a consequence of doing work in the
signal handlers.  AFAICT, that's just FUD, and is sufficiently disproven
by the many years of reliable service we've gotten out of the current
design.

Anyway, if someone is really motivated to rewrite the postmaster, I
won't stand in the way.  I'm just opining that that it will be a lot
of work that would be better expended elsewhere.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-08-02 Thread Noah Misch

On Tue, Aug 02, 2016 at 02:10:29PM +0530, Amit Kapila wrote:
> On Tue, Aug 2, 2016 at 11:19 AM, Noah Misch  wrote:
> > On Sat, Jul 23, 2016 at 01:25:55PM +0530, Amit Kapila wrote:
> >> On Mon, Jul 18, 2016 at 2:03 PM, Andres Freund  wrote:
> >> > On 2016-07-18 10:02:52 +0530, Amit Kapila wrote:
> >> >> Consider the below scenario.
> >> >>
> >> >> Vacuum
> >> >> a. acquires a cleanup lock for page - 10
> >> >> b. busy in checking visibility of tuples
> >> >> --assume, here it takes some time and in the meantime Session-1
> >> >> performs step (a) and (b) and start waiting in step- (c)
> >> >> c. marks the page as all-visible (PageSetAllVisible)
> >> >> d. unlockandrelease the buffer
> >> >>
> >> >> Session-1
> >> >> a. In heap_lock_tuple(), readbuffer for page-10
> >> >> b. check PageIsAllVisible(), found page is not all-visible, so didn't
> >> >> acquire the visbilitymap_pin
> >> >> c. LockBuffer in ExlusiveMode  - here it will wait for vacuum to
> >> >> release the lock
> >> >> d. Got the lock, but now the page is marked as all-visible, so ideally
> >> >> need to recheck the page and acquire the visibilitymap_pin
> >> >
> >> > So, I've tried pretty hard to reproduce that. While the theory above is
> >> > sound, I believe the relevant code-path is essentially dead for SQL
> >> > callable code, because we'll always hold a buffer pin before even
> >> > entering heap_update/heap_lock_tuple.
> >> >
> >>
> >> It is possible that we don't hold any buffer pin before entering
> >> heap_update() and or heap_lock_tuple().  For heap_update(), it is
> >> possible when it enters via simple_heap_update() path.  For
> >> heap_lock_tuple(), it is possible for ON CONFLICT DO Update statement
> >> and may be others as well.
> >
> > This is currently listed as a 9.6 open item.  Is it indeed a regression in
> > 9.6, or do released versions have the same defect?  If it is a 9.6 
> > regression,
> > do you happen to know which commit, or at least which feature, caused it?
> >
> 
> Commit eca0f1db is the reason for this specific issue.

[Action required within 72 hours.  This is a generic notification.]

The above-described topic is currently a PostgreSQL 9.6 open item.  Andres,
since you committed the patch believed to have created it, you own this open
item.  If some other commit is more relevant or if this does not belong as a
9.6 open item, please let us know.  Otherwise, please observe the policy on
open item ownership[1] and send a status update within 72 hours of this
message.  Include a date for your subsequent status update.  Testers may
discover new open items at any time, and I want to plan to get them all fixed
in advance of shipping 9.6rc1 next week.  Consequently, I will appreciate your
efforts toward speedy resolution.  Thanks.

[1] 
http://www.postgresql.org/message-id/20160527025039.ga447...@tornado.leadboat.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] PostgreSQL 10 Roadmaps

2016-08-02 Thread Simon Riggs

https://wiki.postgresql.org/wiki/PostgreSQL10_Roadmap

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Why we lost Uber as a user

2016-08-02 Thread Robert Haas

On Tue, Aug 2, 2016 at 5:51 AM, Amit Kapila  wrote:
> Why we need to add a record in all indexes if only the key
> corresponding to one of indexes is updated?  Basically, if the tuple
> can fit on same page, why can't we consider it as HOT (or HPT - heap
> partial tuple or something like that), unless it updates all the keys
> for all the indexes.  Now, we can't consider such tuple versions for
> pruning as we do for HOT.  The downside of this could be that we might
> need to retain some of the line pointers for more time (as we won't be
> able to reuse the line pointer till it is used in any one of the
> indexes and those could be reused once we make next non-HOT update).
> However, this should allow us not to update the indexes for which the
> corresponding column in tuple is not updated.  I think it is a basic
> premise that if any index column is updated then the update will be
> considered as non-HOT, so there is a good chance that I might be
> missing something here.

Well, I think that the biggest advantage of a HOT update is the fact
that it enables HOT pruning.  In other words, we're not primarily
trying to minimize index traffic; we're trying to make cleanup of the
heap cheaper.  So this could certainly be done, but I'm not sure it
would buy us enough to be worth the engineering effort involved.

Personally, I think that incremental surgery on our current heap
format to try to fix this is not going to get very far.  If you look
at the history of this, 8.3 was a huge release for timely cleanup of
dead tuple.  There was also significant progress in 8.4 as a result of
5da9da71c44f27ba48fdad08ef263bf70e43e689.   As far as I can recall, we
then made no progress at all in 9.0 - 9.4.  We made a very small
improvement in 9.5 with 94028691609f8e148bd4ce72c46163f018832a5b, but
that's pretty niche.  In 9.6, we have "snapshot too old", which I'd
argue is potentially a large improvement, but it was big and invasive
and will no doubt pose code maintenance hazards in the years to come;
also, many people won't be able to use it or won't realize that they
should use it.  I think it is likely that further incremental
improvements here will be quite hard to find, and the amount of effort
will be large relative to the amount of benefit.  I think we need a
new storage format where the bloat is cleanly separated from the data
rather than intermingled with it; every other major RDMS works that
way.  Perhaps this is a case of "the grass is greener on the other
side of the fence", but I don't think so.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] New version numbering practices

2016-08-02 Thread Tom Lane

Greg Stark  writes:
> On Tue, Aug 2, 2016 at 2:10 AM, Alvaro Herrera  
> wrote:
>> That said, I'm not opposed to REL_10 and so on.  In 89 years there will
>> be a problem with sorting REL_100 but I'm sure they can find a solution
>> then, if computers still need humans to write programs for them.

> It would be nice if there was a consistent way of referring to a
> version regardless of how old it was.

> There would be nothing stopping us from going back and adding tags for
> existing versions.

The discussion here is about branches, not tags.  I don't know of any
way to have an alias for a branch (though I'm no git expert).

> It would also give a convenient chance
> to fix the inconsistencies in how some of the older branches were
> tagged.

I thought we'd pretty much done that cleanup during the cvs->git
conversion?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Tracking wait event for latches

2016-08-02 Thread Michael Paquier

On Wed, Jun 8, 2016 at 10:11 AM, Michael Paquier
 wrote:
> Yeah, that's as well my line of thoughts on the matter since the
> beginning: keep it simple and done. What is written just after those
> words is purely hand-waving and I have no way to prove it, but my
> instinctive guess is that more than 90% of the real use cases where we
> need to track the latch waits in pgstat would be covered without the
> need of this extra shared memory infrastructure for extensions.

So, I have done some extra tests with my patch to see if I move the
event reporting down to WaitEventSetWait and see what is the effect on
secure_read and secure_write. And the conclusion is that I am seeing
no difference, so I changed the patch to the way suggested by Thomas.
In this test, what I have done was using the following pgbench script
with -C via an SSL connection:
\set id random(1,1000)
As the script file is not empty, a connection to the server is opened,
so this goes though secure_read at minimal cost on the backend.

Also, I have change the event ID notation as follows to be consistent
with the routine names:
WAL -> Wal
PG -> Pg
I also found that LATCH_RECOVERY_WAL_ALL was reporting
"XLogAfterAvailable" as name, which was incorrect.

Finally, I have changed the patch so as it does not track "Latch"
events, but "EventSet" events instead, per the suggestion of Thomas.
"WaitEventSet" is too close to wait_event in my taste so I shortened
the suggeston. There were also some conflicts caused by the recent
commit 887feefe, which are fixed.

Attached is an updated patch.
-- 
Michael

wait-event-set-v2.patch
Description: application/download

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Oddity in EXPLAIN for foreign/custom join pushdown plans

2016-08-02 Thread Kouhei Kaigai

> -Original Message-
> From: Etsuro Fujita [mailto:fujita.ets...@lab.ntt.co.jp]
> Sent: Tuesday, August 02, 2016 9:36 PM
> To: Kaigai Kouhei(海外 浩平); pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Oddity in EXPLAIN for foreign/custom join pushdown 
> plans
> 
> On 2016/08/01 20:15, Etsuro Fujita wrote:
> > I thought about the Relations line a bit more and noticed that there are
> > cases where the table reference names for joining relations in the
> > Relations line are printed incorrectly.  Here is an example:
> >
> > postgres=# explain verbose select * from (select t1.a, t2.a from ft1 t1,
> > ft2 t2 where t1.a = t2.a union select t1.a, t2.a from ft1 t1, ft2 t2
> > where t1.a = t2.a) as t(t1a, t2a);
> >  QUERY PLAN
> >
> --
> --
> >
> >  Unique  (cost=204.12..204.13 rows=2 width=8)
> >Output: t1.a, t2.a
> >->  Sort  (cost=204.12..204.12 rows=2 width=8)
> >  Output: t1.a, t2.a
> >  Sort Key: t1.a, t2.a
> >  ->  Append  (cost=100.00..204.11 rows=2 width=8)
> >->  Foreign Scan  (cost=100.00..102.04 rows=1 width=8)
> >  Output: t1.a, t2.a
> >  Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
> >  Remote SQL: SELECT r1.a, r2.a FROM (public.t1 r1
> > INNER JOIN public.t2 r2 ON (((r1.a = r2.a
> >->  Foreign Scan  (cost=100.00..102.04 rows=1 width=8)
> >  Output: t1_1.a, t2_1.a
> >  Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
> >  Remote SQL: SELECT r1.a, r2.a FROM (public.t1 r1
> > INNER JOIN public.t2 r2 ON (((r1.a = r2.a
> > (14 rows)
> >
> > The table reference names for ft1 and ft2 in the Relations line for the
> > second Foreign Scan should be t1_1 and t2_1 respectively.
> >
> > Another concern about the Relations line is, that represents just an
> > internal representation of a pushed-down join, so that would not
> > necessarily match a deparsed query shown in the Remote SQL line.  Here
> > is an example, which I found when working on supporting pushing down
> > full outer join a lot more, by improving the deparsing logic so that
> > postgres_fdw can build a remote query that involves subqueries [1],
> > which I'll work on for 10.0:
> >
> > + -- full outer join with restrictions on the joining relations
> > + EXPLAIN (COSTS false, VERBOSE)
> > + SELECT t1.c1, t2.c1 FROM (SELECT c1 FROM ft4 WHERE c1 BETWEEN 50 AND
> > 60) t1 FULL JOIN (SELECT c1 FROM ft5 WHERE c1 BETWEEN 50 AND 60) t2 ON
> > (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1;
> > + QUERY
> > PLAN
> > +
> >
> --
> --
> --
> -
> >
> > +  Foreign Scan
> > +Output: ft4.c1, ft5.c1
> > +Relations: (public.ft4) FULL JOIN (public.ft5)
> > +Remote SQL: SELECT ss1.c1, ss2.c1 FROM ((SELECT c1 FROM "S 1"."T 3"
> > WHERE ((c1 >= 50)) AND ((c1 <= 60))) ss1(c1) FULL JOIN (SELECT c1 FROM
> > "S 1"."T 4" WHERE ((c1 >= 50)) AND ((c1 <= 60))) ss2(c1) ON (((ss1.c1 =
> > ss2.c1 ORDER BY ss1.c1 ASC NULLS LAST, ss2.c1 ASC NULLS LAST
> > + (4 rows)
> >
> > "(public.ft4) FULL JOIN (public.ft5)" in the Relations line does not
> > exactly match the deparsed query in the Remote SQL line, which I think
> > would be rather confusing for users.  (We may be able to print more
> > exact information in the Relations line so as to match the depaserd
> > query, but I think that that would make the Relations line redundant.)
> >
> > Would we really need the Relations line?  If joining relations are
> > printed by core like "Foreign Join on public.ft1 t1_1, public.ft2 t2_1"
> > as proposed upthread, we can see those relations from that, not the
> > Relations line.  Also we can see the join tree structure from the
> > deparsed query in the Remote SQL line.  The Relations line seems to be
> > not that useful anymore, then.  What do you think about that?
> 
> I removed the Relations line.  Here is an updated version of the patch.
> 
> * As I said upthread, I left the upper-relation handling for another
> patch.  Currently, the patch prints "Foreign Scan" with no relations in
> that case.
> 
> * I kept custom joins as-is.  We would need discussions about how to
> choose relations we print in EXPLAIN, so I'd also like to leave that for
> yet another patch.
>
Please don't rely on fs_relids bitmap to pick up relations to be printed.
It just hold a set of underlying relations, but it does not mean all of
these relations are actually scanned inside of the ForeignScan.

You

Re: [HACKERS] Wrong defeinition of pq_putmessage_noblock since 9.5

2016-08-02 Thread Robert Haas

On Fri, Jul 29, 2016 at 11:58 AM, Tom Lane  wrote:
> Maybe the real problem here is that the abstraction layer is so incomplete
> and apparently haphazard, so that it's not very obvious where you should
> use a pq_xxx name and where you should refer to socket_xxx.  For some of
> the functions in pqcomm.c, such as internal_flush, it's impossible to tell
> which side of the abstraction boundary they're supposed to be on.
> (I suspect that that particular function has simply been ignored on the
> undocumented assumption that a bgworker could never call it, but that's
> the kind of half-baked work that I don't find acceptable.)
>
> I think the core work that needs to be done here is to clearly identify
> each function in pqcomm.c as either above or below the PqCommMethods
> abstraction boundary.  Maybe we should split one set or the other out
> to a new source file.  In any case, the naming convention ought to
> reflect that hierarchy.
>
> I withdraw the idea that we should try to revert all these functions back
> to their old names, since that would obscure the layering distinction
> between socket-specific and general-usage functions.  I now think that
> the problem is that that distinction hasn't been drawn sharply enough.

If memory serves, there was some discussion of this at the time the
patch that changed this was originally submitted.  The original patch,
IIRC, just installed one or two hooks and it was argued that I should
instead create some kind of abstraction layer.  The present coding is
the result of my attempt to do just that.  I have to admit that I
wasn't very eager to churn the contents of this file more than
necessary.  It seems like old, crufty code to me.  I don't object if
you want to refactor it, but I'm not sure what problem it solves.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Oddity in EXPLAIN for foreign/custom join pushdown plans

2016-08-02 Thread Etsuro Fujita


On 2016/08/01 20:15, Etsuro Fujita wrote:

I thought about the Relations line a bit more and noticed that there are
cases where the table reference names for joining relations in the
Relations line are printed incorrectly.  Here is an example:

postgres=# explain verbose select * from (select t1.a, t2.a from ft1 t1,
ft2 t2 where t1.a = t2.a union select t1.a, t2.a from ft1 t1, ft2 t2
where t1.a = t2.a) as t(t1a, t2a);
 QUERY PLAN


 Unique  (cost=204.12..204.13 rows=2 width=8)
   Output: t1.a, t2.a
   ->  Sort  (cost=204.12..204.12 rows=2 width=8)
 Output: t1.a, t2.a
 Sort Key: t1.a, t2.a
 ->  Append  (cost=100.00..204.11 rows=2 width=8)
   ->  Foreign Scan  (cost=100.00..102.04 rows=1 width=8)
 Output: t1.a, t2.a
 Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
 Remote SQL: SELECT r1.a, r2.a FROM (public.t1 r1
INNER JOIN public.t2 r2 ON (((r1.a = r2.a
   ->  Foreign Scan  (cost=100.00..102.04 rows=1 width=8)
 Output: t1_1.a, t2_1.a
 Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
 Remote SQL: SELECT r1.a, r2.a FROM (public.t1 r1
INNER JOIN public.t2 r2 ON (((r1.a = r2.a
(14 rows)

The table reference names for ft1 and ft2 in the Relations line for the
second Foreign Scan should be t1_1 and t2_1 respectively.

Another concern about the Relations line is, that represents just an
internal representation of a pushed-down join, so that would not
necessarily match a deparsed query shown in the Remote SQL line.  Here
is an example, which I found when working on supporting pushing down
full outer join a lot more, by improving the deparsing logic so that
postgres_fdw can build a remote query that involves subqueries [1],
which I'll work on for 10.0:

+ -- full outer join with restrictions on the joining relations
+ EXPLAIN (COSTS false, VERBOSE)
+ SELECT t1.c1, t2.c1 FROM (SELECT c1 FROM ft4 WHERE c1 BETWEEN 50 AND
60) t1 FULL JOIN (SELECT c1 FROM ft5 WHERE c1 BETWEEN 50 AND 60) t2 ON
(t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1;
+ QUERY
PLAN
+
---

+  Foreign Scan
+Output: ft4.c1, ft5.c1
+Relations: (public.ft4) FULL JOIN (public.ft5)
+Remote SQL: SELECT ss1.c1, ss2.c1 FROM ((SELECT c1 FROM "S 1"."T 3"
WHERE ((c1 >= 50)) AND ((c1 <= 60))) ss1(c1) FULL JOIN (SELECT c1 FROM
"S 1"."T 4" WHERE ((c1 >= 50)) AND ((c1 <= 60))) ss2(c1) ON (((ss1.c1 =
ss2.c1 ORDER BY ss1.c1 ASC NULLS LAST, ss2.c1 ASC NULLS LAST
+ (4 rows)

"(public.ft4) FULL JOIN (public.ft5)" in the Relations line does not
exactly match the deparsed query in the Remote SQL line, which I think
would be rather confusing for users.  (We may be able to print more
exact information in the Relations line so as to match the depaserd
query, but I think that that would make the Relations line redundant.)

Would we really need the Relations line?  If joining relations are
printed by core like "Foreign Join on public.ft1 t1_1, public.ft2 t2_1"
as proposed upthread, we can see those relations from that, not the
Relations line.  Also we can see the join tree structure from the
deparsed query in the Remote SQL line.  The Relations line seems to be
not that useful anymore, then.  What do you think about that?


I removed the Relations line.  Here is an updated version of the patch.

* As I said upthread, I left the upper-relation handling for another  
patch.  Currently, the patch prints "Foreign Scan" with no relations in  
that case.


* I kept custom joins as-is.  We would need discussions about how to  
choose relations we print in EXPLAIN, so I'd also like to leave that for  
yet another patch.


Best regards,
Etsuro Fujita


explain-for-foreign-join-pushdown.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] old_snapshot_threshold allows heap:toast disagreement

2016-08-02 Thread Amit Kapila

On Tue, Aug 2, 2016 at 9:10 AM, Robert Haas  wrote:
> On Sat, Jul 30, 2016 at 8:17 AM, Amit Kapila  wrote:
>> On Fri, Jul 29, 2016 at 1:10 AM, Robert Haas  wrote:
>>> On Wed, Jul 27, 2016 at 7:26 PM, Andres Freund  wrote:
>>>
>>> New version attached.
>>
>> +static inline void
>> +InitToastSnapshot(Snapshot snapshot, XLogRecPtr lsn)
>> +{
>> + snapshot->satisfies = HeapTupleSatisfiesToast;
>> + snapshot->lsn = lsn;
>> +}
>>
>> Here, don't you need to initialize whenTaken as that is also used in
>> TestForOldSnapshot_impl() to report error "snapshot too old".
>
> Hmm, yeah.  This is actually a bit confusing.  We want the "oldest"
> snapshot, but there are three different notions of "oldest":
>
> 1. Smallest LSN.
> 2. Smallest whenTaken.
> 3. Smallest xmin.
>
> Which one do we use?
>

Which ever notion we choose, I think we should use the LSN and
whenTaken from the same snapshot.  I think we can choose the snapshot
with smallest xmin and then use the LSN and whenTaken from it.  I
think xmin comparison makes more sense because you are already
retrieving smallest xmin snapshot from the registered snapshots.

Whatever value we choose here, I think we can't guarantee that it will
be in sync with the value used for heap.  So there is a chance that we
might spuriously raise "snapshot too old" error.  I think we should
mentioned this as a caveat in docs [1] where we explain behaviour of
about old_snapshot_threshold.

[1] - 
https://www.postgresql.org/docs/9.6/static/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-ASYNC-BEHAVIOR

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [sqlsmith] Failed assertion in joinrels.c

2016-08-02 Thread Robert Haas

On Tue, Aug 2, 2016 at 6:03 AM, Dilip Kumar  wrote:
> There are many more such exposed functions, which can throw cache lookup
> failure error if we pass wrong value.
>
> i.e.
> record_in
> domain_in
> fmgr_c_validator
> plpgsql_validator
> pg_procedure_is_visible
>
> Are we planning to change these also..

I think if they are SQL-callable functions that users can invoke from
a SQL prompt, they shouldn't throw "cache lookup failed" errors or,
for that matter, any other error that is reported with elog().  Now, I
don't necessarily think that these functions should return NULL in
that case the way we did with the others; that's not a sensible
behavior for a type-input function AFAIK.  But we should emit a better
error.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] New version numbering practices

2016-08-02 Thread Greg Stark

On Tue, Aug 2, 2016 at 2:10 AM, Alvaro Herrera  wrote:
> That said, I'm not opposed to REL_10 and so on.  In 89 years there will
> be a problem with sorting REL_100 but I'm sure they can find a solution
> then, if computers still need humans to write programs for them.

It would be nice if there was a consistent way of referring to a
version regardless of how old it was.

There would be nothing stopping us from going back and adding tags for
existing versions. We could add REL_09_5 back to REL_06_5 if we wanted
to. Then we could easily refer to any version without special cases or
rules about pre-10 vs post-10. It would also give a convenient chance
to fix the inconsistencies in how some of the older branches were
tagged.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Broken order-of-operations in parallel query latch manipulation

2016-08-02 Thread Amit Kapila

On Mon, Aug 1, 2016 at 8:45 PM, Tom Lane  wrote:
> Amit Kapila  writes:
>> On Mon, Aug 1, 2016 at 1:58 AM, Tom Lane  wrote:
>>> I believe this is wrong and the CHECK_FOR_INTERRUPTS needs to be before
>>> or after the two latch operations.  As-is, if the reason somebody set
>>> our latch was to get us to notice that a CHECK_FOR_INTERRUPTS condition
>>> happened, there's a race condition where we'd fail to realize that.
>
>> I could see that in nodeGather.c, it might fail to notice the SetLatch
>> done by worker process or spuriously woken up due to SetLatch for some
>> unrelated reason.  However, I don't see what problem it can cause
>> apart from one extra loop cycle where it will try to process the tuple
>> when actually there is no tuple in the queue.
>
> Consider the following sequence of events:
>
> 1. gather_readnext reaches the WaitLatch, and is allowed to pass through
> it for some unrelated reason, perhaps some long-since-handled SIGUSR1
> from a worker process.
>
> 2. gather_readnext does CHECK_FOR_INTERRUPTS(), and sees nothing pending.
>
> 3. A SIGINT is received.  StatementCancelHandler sets QueryCancelPending
> and does SetLatch(MyLatch).
>
> 4. gather_readnext does ResetLatch(MyLatch).
>
> 5. gather_readnext runs through its loop again, finds nothing to do, and
> reaches the WaitLatch.
>
> 6. The process is now sleeping on its latch, and might sit there a long
> time before noticing the pending query cancel.
>
> Obviously the window for this race condition is pretty tight --- there's
> not many instructions between steps 2 and 4.  But it can happen.  If
> memory serves, we've had actual field reports for race condition bugs
> where the window that was being hit wasn't much more than a single
> instruction.
>
> Also, it's entirely possible that the bug could be masked, if there was
> another CHECK_FOR_INTERRUPTS lurking anywhere in the code called within
> the loop.  That doesn't excuse this coding practice, though.
>

Right and Thanks for the detailed explanation.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [sqlsmith] Failed assertion in joinrels.c

2016-08-02 Thread Dilip Kumar

On Tue, Aug 2, 2016 at 3:33 PM, Dilip Kumar  wrote:

> There are many more such exposed functions, which can throw cache lookup
> failure error if we pass wrong value.
>
> i.e.
> record_in
> domain_in
> fmgr_c_validator
> edb_get_objecttypeheaddef
> plpgsql_validator
> pg_procedure_is_visible
>
> Are we planning to change these also..
>
> below query is one example (from sqlsmith).
>
> ERROR:  cache lookup failed for procedure 0
>
> postgres=# select
>
> postgres-# pg_catalog.record_in(
>
> postgres(#   cast(pg_catalog.numerictypmodout(
>
> postgres(# cast(98 as integer)) as cstring),
>
> postgres(#   cast(1 as oid),
>
> postgres(#   cast(35 as integer));
>
> ERROR:  cache lookup failed for type 1
>
By mistake I mentioned edb_get_objecttypeheaddef function also, please
ignore that.

So actual list is as below..

record_in
domain_in
fmgr_c_validator
plpgsql_validator
pg_procedure_is_visible

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] [sqlsmith] Failed assertion in joinrels.c

2016-08-02 Thread Dilip Kumar

On Sat, Jul 30, 2016 at 4:27 AM, Michael Paquier 
wrote:

> OK for me. Thanks for the commit.


There are many more such exposed functions, which can throw cache lookup
failure error if we pass wrong value.

i.e.
record_in
domain_in
fmgr_c_validator
edb_get_objecttypeheaddef
plpgsql_validator
pg_procedure_is_visible

Are we planning to change these also..

below query is one example (from sqlsmith).

ERROR:  cache lookup failed for procedure 0

postgres=# select

postgres-# pg_catalog.record_in(

postgres(#   cast(pg_catalog.numerictypmodout(

postgres(# cast(98 as integer)) as cstring),

postgres(#   cast(1 as oid),

postgres(#   cast(35 as integer));

ERROR:  cache lookup failed for type 1


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Why we lost Uber as a user

2016-08-02 Thread Amit Kapila

On Sat, Jul 30, 2016 at 12:06 AM, Stephen Frost  wrote:
> * Bruce Momjian (br...@momjian.us) wrote:
>> On Fri, Jul 29, 2016 at 10:44:29AM -0400, Stephen Frost wrote:
>> > Another thought that was kicking around in my head related to that is if
>> > we could have indexes that only provide page-level information (similar
>> > to BRIN, but maybe a btree) and which also would allow HOT updates.
>> > Those indexes would typically be used in a bitmap index scan where we're
>> > going to be doing a bitmap heap scan with a recheck, of course, though I
>> > wonder if we could come up with a way to do an in-order bitmap index
>> > scan where we sort the tuples on the page and then perform some kind of
>> > mergejoin recheck (or just pull out whatever the lowest-not-seen each
>> > time we sort the tuples on the page).
>>
>> So allow HOT updates if the updated row is on the same page, even if the
>> indexed column was changed, by scanning the page --- got it.
>
> The idea I had was to allow creation of indexes which *only* include the
> page ID.  Your rephrase seems to imply that we'd have a regular index
> but then be able to figure out if a given tuple had any HOT updates
> performed on it and, if so, scan the entire page.  As I understand it,
> it's more complicated than that because we must involve an index when
> updating a tuple in some cases (UNIQUE?) and therefore we don't perform
> HOT in the case where any indexed column is being changed.
>

Why we need to add a record in all indexes if only the key
corresponding to one of indexes is updated?  Basically, if the tuple
can fit on same page, why can't we consider it as HOT (or HPT - heap
partial tuple or something like that), unless it updates all the keys
for all the indexes.  Now, we can't consider such tuple versions for
pruning as we do for HOT.  The downside of this could be that we might
need to retain some of the line pointers for more time (as we won't be
able to reuse the line pointer till it is used in any one of the
indexes and those could be reused once we make next non-HOT update).
However, this should allow us not to update the indexes for which the
corresponding column in tuple is not updated.  I think it is a basic
premise that if any index column is updated then the update will be
considered as non-HOT, so there is a good chance that I might be
missing something here.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Why we lost Uber as a user

2016-08-02 Thread Geoff Winkless

On 2 August 2016 at 08:11, Alfred Perlstein  wrote:
> On 7/2/16 4:39 AM, Geoff Winkless wrote:
> > I maintain that this is a nonsense argument. Especially since (as you 
> > pointed out and as I missed first time around) the bug actually occurred at 
> > different records on different slaves, so he invalidates his own point.

> Seriously?

No, I make a habit of spouting off random arguments to a list full of
people whose opinions I massively respect purely for kicks. What do
you think?

> There's a valid point here, you're sending over commands at the block level, 
> effectively "write to disk at this location" versus "update this record based 
> on PK", obviously this has some drawbacks that are reason for concern.

Writing values directly into file offsets is only problematic if
something else has failed that has caused the file to be an inexact
copy. If a different bug occurred that caused the primary key to be
corrupted on the slave (or indeed the master), PK-based updates would
exhibit similar propagation errors.

To reiterate my point, uber's described problem came about because of
a bug. Every software has bugs at some point in its life, to pretend
otherwise is simply naive. I'm not trying to excuse the bug, or to
belittle the impact that such a bug has on data integrity or on uber
or indeed on the reputation of PostgreSQL. While I'm prepared to
accept (because I have a job that requires I spend time on things
other than digging through obscure reddits and mailing lists to
understand more fully the exact cause) that in _this particular
instance_ the bug was propagated because of the replication mechanism
(although I'm still dubious about that, as per my comment above), that
does _not_ preclude other bugs propagating in a statement-based
replication. That's what I said is a nonsense argument, and no-one has
yet explained in what way that's incorrect.

Geoff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_size_pretty, SHOW, and spaces

2016-08-02 Thread Christoph Berg

Re: Peter Eisentraut 2016-08-01 

> > PostgreSQL uses the spaces inconsistently, though. pg_size_pretty uses 
> > spaces:
> > 
> > # select pg_size_pretty((2^20)::bigint);
> >  pg_size_pretty
> > 
> >  1024 kB
> 
> because it's "pretty" :)

:)

> > SHOW does not:
> > 
> > # show work_mem;
> >  work_mem
> > ──
> >  1MB
> 
> The original idea might have been to allow that value to be passed back
> into the settings system, without having to quote the space.  I'm not
> sure, but I think changing that might cause some annoyance.

That's a good argument for keeping it that way, yes.


Re: Bruce Momjian 2016-08-01 <20160801162508.ga28...@momjian.us>
> Looking at the Wikipedia article I posted earlier, that also doesn't use
> spaces:
> 
>   https://en.wikipedia.org/wiki/Binary_prefix

That article has plenty of occurrences of "10 MB" "528 MB/s" and the
like.

> I think the only argument _for_ spaces is the output of pg_size_pretty()
> now looks odd, e.g.:
> 
>10 | 10 bytes   | -10 bytes
>  1000 | 1000 bytes | -1000 bytes
>   100 | 977KB  | -977KB
>10 | 954MB  | -954MB
> 1 | 931GB  | -931GB
>  1000 | 909TB  | -909TB
> ^ ^
> 
> The issue is that we output "10 bytes", not "10bytes", but for units we
> use "977KB".  That seems inconsistent, but it is the normal policy
> people use.  I think this is because "977KB" is really "977K bytes", but
> we just append the "B" after the "K" for bevity.

It's the other way round:

https://en.wikipedia.org/wiki/International_System_of_Units#General_rules

| The value of a quantity is written as a number followed by a space
| (representing a multiplication sign) and a unit symbol; e.g., 2.21 kg
[...]

I'd opt to omit the space anywhere where the value is supposed to be
fed back into the config (SHOW, --parameters), but use the "pretty"
format with space everywhere otherwise (documentation, memory counts
in explain output, pg_size_pretty() etc.)

Christoph


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-08-02 Thread Amit Kapila

On Tue, Aug 2, 2016 at 11:19 AM, Noah Misch  wrote:
> On Sat, Jul 23, 2016 at 01:25:55PM +0530, Amit Kapila wrote:
>> On Mon, Jul 18, 2016 at 2:03 PM, Andres Freund  wrote:
>> > On 2016-07-18 10:02:52 +0530, Amit Kapila wrote:
>> >> Consider the below scenario.
>> >>
>> >> Vacuum
>> >> a. acquires a cleanup lock for page - 10
>> >> b. busy in checking visibility of tuples
>> >> --assume, here it takes some time and in the meantime Session-1
>> >> performs step (a) and (b) and start waiting in step- (c)
>> >> c. marks the page as all-visible (PageSetAllVisible)
>> >> d. unlockandrelease the buffer
>> >>
>> >> Session-1
>> >> a. In heap_lock_tuple(), readbuffer for page-10
>> >> b. check PageIsAllVisible(), found page is not all-visible, so didn't
>> >> acquire the visbilitymap_pin
>> >> c. LockBuffer in ExlusiveMode  - here it will wait for vacuum to
>> >> release the lock
>> >> d. Got the lock, but now the page is marked as all-visible, so ideally
>> >> need to recheck the page and acquire the visibilitymap_pin
>> >
>> > So, I've tried pretty hard to reproduce that. While the theory above is
>> > sound, I believe the relevant code-path is essentially dead for SQL
>> > callable code, because we'll always hold a buffer pin before even
>> > entering heap_update/heap_lock_tuple.
>> >
>>
>> It is possible that we don't hold any buffer pin before entering
>> heap_update() and or heap_lock_tuple().  For heap_update(), it is
>> possible when it enters via simple_heap_update() path.  For
>> heap_lock_tuple(), it is possible for ON CONFLICT DO Update statement
>> and may be others as well.
>
> This is currently listed as a 9.6 open item.  Is it indeed a regression in
> 9.6, or do released versions have the same defect?  If it is a 9.6 regression,
> do you happen to know which commit, or at least which feature, caused it?
>

Commit eca0f1db is the reason for this specific issue.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Wanting to learn about pgsql design decision

2016-08-02 Thread Tal Walter

Dear list,

I'm interested in pgsql, and would like to know more about the design
decisions behind it's features.
Where should I search in order to know more about subjects, for example:

   - Why in the roles system, user are actually roles with login attribute
   and not a separate entity.
   - Why to read from a table, both a usage permission on the schema and a
   read access permission on the table is needed? Alternatively, there could
   be a usage permission on schema just to alter the schema itself or add
   tables to it, and not require it in the case of selecting from a table from
   inside it.

And other questions of this sort.

Thank you very much!

Re: [HACKERS] asynchronous and vectorized execution

2016-08-02 Thread Kyotaro HORIGUCHI

Thank you for the comment.

At Mon, 1 Aug 2016 10:44:56 +0530, Amit Khandekar  
wrote in 
> On 21 July 2016 at 15:20, Kyotaro HORIGUCHI  > wrote:
> 
> >
> > After some consideration, I found that ExecAsyncWaitForNode
> > cannot be reentrant because it means that the control goes into
> > async-unaware nodes while having not-ready nodes, that is
> > inconsistent state. To inhibit such reentering, I allocated node
> > identifiers in depth-first order so that ascendant-descendant
> > relationship can be checked (nested-set model) in simple way and
> > call ExecAsyncConfigureWait only for the descendant nodes of the
> > parameter planstate.
> >
> >
> We have estate->waiting_nodes containing a mix of async-aware and
> non-async-aware nodes. I was thinking, an asynchrony tree would have only
> async-aware nodes, with possible multiple asynchrony sub-trees in a tree.
> Somehow, if we restrict the bubbling up of events only upto the root of the
> asynchrony subtree, do you think we can simplify some of the complexities ?

The current code prohibiting regsitration of nodes outside the
current subtree to avoid the reentring-disaster.

Indeed leaving the "waiting node" mark or something like on every
root node at the first visit will enable the propagation to stop
upto the root of any async-subtree. Neverheless, when an
async-child in an inactive async-root fires, the new tuple is
loaded but is not consumed then the succeeding firing on the same
child leads to a dead-lock (without result queueing). However,
that can be avoided if ExecAsyncConfigureWait doesn't register
nodes in ready state.

On the other hand, any two or more asynchronous nodes can share a
syncronization object. For instance, multiple postgres_fdw scan
node can share one server connection and only one of them can get
into waitable state at once. If no async-child in the current
async subtree is waitable, it must be stuck. So I think it is
crucial for ExecAsyncWaitForNode to force at least one child *in
the current async subtree* to get into waiting state for such
situation. The ascendant-descendant relationship is necessary to
do that anyway.

Since we should have the node-id to detect ascendant-descendant
relationship anyway and finally should restrict async-nodes with
it, activating only descendant node from the first would make the
things rather simple than avoiding possible dead-lock laster as
described above.

# It is implemented as per-subtree waiting-node list but it was
# fragile and too ugly..

> For e.g. ExecAsyncWaitForNode() has become a bit complex seemingly because
> it has to handle non-async-nodes also, and that's the reason I believe you
> have introduced modes such as ASYNCCONF_FORCE_ADD.

As explained above, the ASYNCCONF_FORCE_ADD is not for
non-async-nodes, but for sets of async nodes that share a
synchronization object. We could let ExecAsyncConfigureWait force
acquire async-object from the first, but it in turn causes
possiblly unnecessary transfer of a sync-object among the nodes
sharing it.

I wish the above sentsnces are readable enough, but any questions
are welcome even the meaning of a sentence.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Why we lost Uber as a user

2016-08-02 Thread Alfred Perlstein




On 7/26/16 9:54 AM, Joshua D. Drake wrote:

Hello,

The following article is a very good look at some of our limitations 
and highlights some of the pains many of us have been working "around" 
since we started using the software.


https://eng.uber.com/mysql-migration/

Specifically:

* Inefficient architecture for writes
* Inefficient data replication
* Issues with table corruption
* Poor replica MVCC support
* Difficulty upgrading to newer releases

It is a very good read and I encourage our hackers to do so with an 
open mind.


Sincerely,

JD


It was a good read.

Having based a high performance web tracking service as well as a high 
performance security appliance on Postgresql I too have been bitten by 
these issues.


I had a few questions that maybe the folks with core knowledge can answer:

1) Would it be possible to create a "star-like" schema to fix this 
problem?  For example, let's say you have a table that is similar to Uber's:

col0pk, col1, col2, col3, col4, col5

All cols are indexed.
Assuming that updates happen to only 1 column at a time.
Why not figure out some way to encourage or automate the splitting of 
this table into multiple tables that present themselves as a single table?


What I mean is that you would then wind up with the following tables:
table1: col0pk, col1
table2: col0pk, col2
table3: col0pk, col3
table4: col0pk, col4
table5: col0pk, col5

Now when you update "col5" on a row, you only have to update the index 
on table5:col5 and table5:col0pk as opposed to beforehand where you 
would have to update more indecies.  In addition I believe that vacuum 
would be somewhat mitigated as well in this case.


2) Why not have a look at how innodb does its storage, would it be 
possible to do this?


3) For the small-ish table that Uber mentioned, is there a way to "have 
it in memory" however provide some level of sync to disk so that it is 
consistent?


thanks!
-Alfred




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Why we lost Uber as a user

2016-08-02 Thread Alfred Perlstein




On 7/28/16 7:08 AM, Merlin Moncure wrote:


*) postgres may not be the ideal choice for those who want a thin and
simple database
This is a huge market, addressing it will bring mindshare and more jobs, 
code and braintrust to psql.


-Alfred


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Why we lost Uber as a user

2016-08-02 Thread Alfred Perlstein

On 7/28/16 4:39 AM, Geoff Winkless wrote:

On 28 Jul 2016 12:19, "Vitaly Burovoy" > wrote:

>
> On 7/28/16, Geoff Winkless > wrote:
> > On 27 July 2016 at 17:04, Bruce Momjian > wrote:

> >
> >> Well, their big complaint about binary replication is that a bug can
> >> spread from a master to all slaves, which doesn't happen with 
statement

> >> level replication.
> >
> > 
> > I'm not sure that that makes sense to me. If there's a database 
bug that
> > occurs when you run a statement on the master, it seems there's a 
decent
> > chance that that same bug is going to occur when you run the same 
statement

> > on the slave.
> >
> > Obviously it depends on the type of bug and how identical the 
slave is, but

> > statement-level replication certainly doesn't preclude such a bug from
> > propagating.
> >
> > Geoff
>
> Please, read the article first! The bug is about wrong visibility of
> tuples after applying WAL at slaves.
> For example, you can see two different records selecting from a table
> by a primary key (moreover, their PKs are the same, but other columns
> differ).

I read the article. It affected slaves as well as the master.

I quote:
"because of the way replication works, this issue has the potential to 
spread into all of the databases in a replication hierarchy"

I maintain that this is a nonsense argument. Especially since (as you 
pointed out and as I missed first time around) the bug actually 
occurred at different records on different slaves, so he invalidates 
his own point.

Geoff

Seriously?

There's a valid point here, you're sending over commands at the block 
level, effectively "write to disk at this location" versus "update this 
record based on PK", obviously this has some drawbacks that are reason 
for concern.  Does it validate the move on its own? NO.  Does it add to 
the reasons to move away?  Yes, that much is obvious.

Please read this thread:
https://www.reddit.com/r/programming/comments/4vms8x/why_we_lost_uber_as_a_user_postgresql_mailing_list/d5zx82n

Do I love postgresql?  Yes.
Have I been bitten by things such as this?  Yes.
Should the community learn from these things and think of ways to avoid 
it?  Absolutely!

-Alfred

Re: [HACKERS] Oddity in EXPLAIN for foreign/custom join pushdown plans

2016-08-02 Thread Kouhei Kaigai

> -Original Message-
> From: Etsuro Fujita [mailto:fujita.ets...@lab.ntt.co.jp]
> Sent: Tuesday, August 02, 2016 2:45 PM
> To: Kaigai Kouhei(海外 浩平); Ashutosh Bapat
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Oddity in EXPLAIN for foreign/custom join pushdown 
> plans
> 
> On 2016/08/02 13:32, Kouhei Kaigai wrote:
> 
> I wrote:
> >> My concern here is EXPLAIN for foreign joins.  I think it's another
> >> problem how we handle Foreign Scan plan nodes representing
> >> post-scan/join operations in EXPLAIN, so I'd like to leave that for
> >> another patch.
> 
> > What is the post-scan/join operations? Are you saying EPQ recheck and
> > alternative local join plan?
> 
> No.  I mean e.g., aggregation, window functions, sorting, or table
> modification.  In other words, Foreign Scan plan nodes created from
> ForeignPath paths created from GetForeignUpperPaths.
>
Why do you need to handle these upper paths individually?
FDW driver knows the remote query contains aggregation, window functions,
sorting, or table modification. It can give proper label.

If remote query contains partial aggregation and relations join, for
example, "Partial Agg + Scan" will be a proper label that shall be
followed by the "Foreign %s".

All you need to do are the two enhancements:
- Add "const char *explain_label" on the ForeignScanState or somewhere
  extension can set. It gives a label to be printed.
  If NULL string is set, EXPLAIN shows "Foreign Scan" as a default.
- Add "Bitmapset explain_rels" on the ForeignScanState or somewhere
  extension can set. It indicates the relations to be printed after
  the "Foreign %s" token. If you want to print all the relations names
  underlying this ForeignScan node, just copy scanrelids bitmap.
  If NULL bitmap is set, EXPLAIN shows nothing as a default.

Please note that the default does not change the existing behavior.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Increasing timeout of poll_query_until for TAP tests

2016-08-02 Thread Michael Paquier

On Tue, Aug 2, 2016 at 10:28 AM, Michael Paquier
 wrote:
> There is still an issue with pg_basebackup when testing stream mode
> and replication slots. I am digging into this one now..

After 5 hours running this test in a row and 30 attempts torturing
hamster with a script running make check in an infinite loop with set
-e I am giving up on that for the time being... I have added the patch
to make the tests more stable to next CF so as it is not forgotten:
https://commitfest.postgresql.org/10/693/
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

74 matches

Mail list logo