date:20140108

Re: [HACKERS] In-core regression tests for replication, cascading, archiving, PITR, etc. Michael Paquier

2014-01-08 Thread Michael Paquier

On Thu, Jan 9, 2014 at 12:34 PM, Mark Dilger  wrote:
> Michael Paquier wrote:
>> A possible input for a test that users could provide would be something>
>> like that:
>>
>> # Node information for tests
>> nodes> {
>> {node1, postgresql.conf params, recovery.conf params}
>> {node2, postgresql.conf params, recovery.conf params, slave of node1}
>> }
>> # Run test
>> init node1
>> run_sql node1 file1.sql
>> # Check output
>> init node2
>> run_sql node2 file2.sql
>> # Check that results are fine
>> # Process
>>
>> The main problem is actually how to do that. Having some smart shell
>> infrastructure would be simple and would facilitate (?) the maintenance
>> of code used to run the tests. On the contrary having a C program would
>> make the maintenance of code to run the tests more difficult (?) for a
>> trade with more readable test suite input like the one I wrote above.
>> This might also make the test input more readable for a human eye, in
>> the shape of what is already available in src/test/isolation.
>
> I like making this part of src/test/isolation, if folks do not object.
> The core infrastructure in src/test/isolation seems applicable to
> replication testing, and I'd hate to duplicate that code.
>
> As for the node setup in your example above, I don't think it can be as
> simple as defining nodes first, then running tests.  The configurations
> themselves may need to be changed during the execution of a test, and
> services stopped and started, all under test control and specified in
> the same easy format.
Yes, my example was very basic :). What you actually need is the
possibility to perform actions on nodes during a test run, basically:
stop, start, init, reload, run SQL, change params/create new conf
files (like putting a node in recovery could be = create recovery.conf
+ restart). The place of the code does not matter much, but don't
think that it should be part of isolation as clustering and isolation
are too different test suites. I would have for example seen that as
src/test/cluster, with src/test/common for things that are shared
between test infrastructures.

As mentioned by Steve, the test suite of Slony might be interesting to
look at to get some ideas.

Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Planning time in explain/explain analyze

2014-01-08 Thread Andreas Karlsson


A patch with updated documentation is attached.

On 01/02/2014 04:08 AM, Robert Haas wrote:

I'm wondering whether the time should be stored inside the PlannedStmt
node instead of passing it around separately. One possible problem
with the way you've done things here is that, in the case of a
prepared statement, EXPLAIN ANALYZE will emit the time needed to call
GetCachedPlan(), even if that function didn't do any replanning.  Now
you could argue that this is the correct behavior, but I think there's
a decent argument that what we ought to show there is the amount of
time that was required to create the plan that we're displaying at the
time it was created, rather than the amount of time that was required
to figure out that we didn't need to replan.

A minor side benefit of this approach is that you wouldn't need to
change the signature for ExplainOnePlan(), which would avoid breaking
extensions that may call it.


A possible argument against printing the time to create the plan is that 
unless it was created when running EXPLAIN we will not know it. I do not 
think we want to always measure the time it took to generate a plan due 
to slow clocks on some architectures. Also I feel that such a patch 
would be more invasive.


Just my reasoning for the current solution. I welcome any opinions about 
how to print planning time for prepared statements since I am not a 
heavy user of them.


--
Andreas Karlsson
diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml
new file mode 100644
index 2af1738..482490b
*** a/doc/src/sgml/perform.sgml
--- b/doc/src/sgml/perform.sgml
*** EXPLAIN SELECT * FROM tenk1;
*** 89,94 
--- 89,95 
   QUERY PLAN
  -
   Seq Scan on tenk1  (cost=0.00..458.00 rows=1 width=244)
+  Planning time: 0.113 ms
  
 
  
*** EXPLAIN SELECT * FROM tenk1;
*** 162,167 
--- 163,174 
 
  
 
+ The Planning time shown is the time it took to generate
+ the query plan from the parsed query and optimize it. It does not include
+ rewriting and parsing.
+
+ 
+
  Returning to our example:
  
  
*** EXPLAIN SELECT * FROM tenk1;
*** 170,175 
--- 177,183 
   QUERY PLAN
  -
   Seq Scan on tenk1  (cost=0.00..458.00 rows=1 width=244)
+  Planning time: 0.113 ms
  
 
  
*** EXPLAIN SELECT * FROM tenk1 WHERE unique
*** 198,203 
--- 206,212 
  
   Seq Scan on tenk1  (cost=0.00..483.00 rows=7001 width=244)
 Filter: (unique1 < 7000)
+  Planning time: 0.104 ms
  
  
  Notice that the EXPLAIN output shows the WHERE
*** EXPLAIN SELECT * FROM tenk1 WHERE unique
*** 234,239 
--- 243,249 
 Recheck Cond: (unique1 < 100)
 ->  Bitmap Index Scan on tenk1_unique1  (cost=0.00..5.04 rows=101 width=0)
   Index Cond: (unique1 < 100)
+  Planning time: 0.093 ms
  
  
  Here the planner has decided to use a two-step plan: the child plan
*** EXPLAIN SELECT * FROM tenk1 WHERE unique
*** 262,267 
--- 272,278 
 Filter: (stringu1 = 'xxx'::name)
 ->  Bitmap Index Scan on tenk1_unique1  (cost=0.00..5.04 rows=101 width=0)
   Index Cond: (unique1 < 100)
+  Planning time: 0.089 ms
  
  
  The added condition stringu1 = 'xxx' reduces the
*** EXPLAIN SELECT * FROM tenk1 WHERE unique
*** 283,288 
--- 294,300 
  -
   Index Scan using tenk1_unique1 on tenk1  (cost=0.29..8.30 rows=1 width=244)
 Index Cond: (unique1 = 42)
+  Planning time: 0.076 ms
  
  
  In this type of plan the table rows are fetched in index order, which
*** EXPLAIN SELECT * FROM tenk1 WHERE unique
*** 311,316 
--- 323,329 
 Index Cond: (unique1 < 100)
   ->  Bitmap Index Scan on tenk1_unique2  (cost=0.00..19.78 rows=999 width=0)
 Index Cond: (unique2 > 9000)
+  Planning time: 0.094 ms
  
  
  But this requires visiting both indexes, so it's not necessarily a win
*** EXPLAIN SELECT * FROM tenk1 WHERE unique
*** 331,336 
--- 344,350 
 ->  Index Scan using tenk1_unique2 on tenk1  (cost=0.29..71.27 rows=10 width=244)
   Index Cond: (unique2 > 9000)
   Filter: (unique1 < 100)
+  Planning time: 0.087 ms
  
 
  
*** WHERE t1.unique1 < 10 AND t1.unique2
*** 364,369 
--- 378,384 
 Index Cond: (unique1 < 10)
 ->  Index Scan using tenk2_unique2 on tenk2 t2  (cost=0.29..7.91 rows=1 width=244)
   Index Cond: (unique2 = t1.unique2)
+  Planning time: 0.117 ms
  
 
  
*** WHERE t1.unique1 < 10 AND t2.unique2
*** 415,420 
--- 430,436 
 ->  Materialize  (cost=0.29..8.51 ro

Re: [HACKERS] Standalone synchronous master

2014-01-08 Thread Robert Treat

On Wed, Jan 8, 2014 at 6:15 PM, Josh Berkus  wrote:
> Stephen,
>
>
>> I'm aware, my point was simply that we should state, up-front in
>> 25.2.7.3 *and* where we document synchronous_standby_names, that it
>> requires at least three servers to be involved to be a workable
>> solution.
>
> It's a workable solution with 2 servers.  That's a "low-availability,
> high-integrity" solution; the user has chosen to double their risk of
> not accepting writes against never losing a write.  That's a perfectly
> valid configuration, and I believe that NTT runs several applications
> this way.
>
> In fact, that can already be looked at as a kind of "auto-degrade" mode:
> if there aren't two nodes, then the database goes read-only.
>
> Might I also point out that transactions are synchronous or not
> individually?  The sensible configuration is for only the important
> writes being synchronous -- in which case auto-degrade makes even less
> sense.
>
> I really think that demand for auto-degrade is coming from users who
> don't know what sync rep is for in the first place.  The fact that other
> vendors are offering auto-degrade as a feature instead of the ginormous
> foot-gun it is adds to the confusion, but we can't help that.
>

I think the problem here is that we tend to have a limited view of
"the right way to use synch rep". If I have 5 nodes, and I set 1
synchronous and the other 3 asynchronous, I've set up a "known
successor" in the event that the leader fails. In this scenario
though, if the "successor" fails, you actually probably want to keep
accepting writes; since you weren't using synchronous for durability
but for operational simplicity. I suspect there are probably other
scenarios where users are willing to trade latency for improved and/or
directed durability but not at the extent of availability, don't you?

In fact there are entire systems that provide that type of thing. I
feel like it's worth mentioning that there's a nice primer on tunable
consistency in the Riak docs; strongly recommended.
http://docs.basho.com/riak/1.1.0/tutorials/fast-track/Tunable-CAP-Controls-in-Riak/.
I'm not entirely sure how well it maps into our problem space, but it
at least gives you a sane working model to think about. If you were
trying to explain the Postgres case, async is like the N value (I want
the data to end up on this many nodes eventually) and sync is like the
W value (it must be written to this many nodes, or it should fail). Of
course, we only offer an R = 1, W = 1 or 2, and N = all. And it's
worse than that, because we have golden nodes.

This isn't to say there isn't a lot of confusion around the issue.
Designing, implementing, and configuring different guarantees in the
presence of node failures is a non-trivial problem. Still, I'd prefer
to see Postgres head in the direction of providing more options in
this area rather than drawing a firm line at being a CP-oriented
system.

Robert Treat
play: xzilla.net
work: omniti.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] In-core regression tests for replication, cascading, archiving, PITR, etc. Michael Paquier

2014-01-08 Thread Mark Dilger

Michael Paquier wrote:
> A possible input for a test that users could provide would be something> like 
> that:
>
> # Node information for tests
> nodes> {
> {node1, postgresql.conf params, recovery.conf params}
> {node2, postgresql.conf params, recovery.conf params, slave of node1}
> }
> # Run test
> init node1
> run_sql node1 file1.sql
> # Check output
> init node2
> run_sql node2 file2.sql
> # Check that results are fine
> # Process
>
> The main problem is actually how to do that. Having some smart shell
> infrastructure would be simple and would facilitate (?) the maintenance
> of code used to run the tests. On the contrary having a C program would
> make the maintenance of code to run the tests more difficult (?) for a
> trade with more readable test suite input like the one I wrote above.
> This might also make the test input more readable for a human eye, in
> the shape of what is already available in src/test/isolation.

I like making this part of src/test/isolation, if folks do not object.
The core infrastructure in src/test/isolation seems applicable to 
replication testing, and I'd hate to duplicate that code.

As for the node setup in your example above, I don't think it can be as
simple as defining nodes first, then running tests.  The configurations
themselves may need to be changed during the execution of a test, and
services stopped and started, all under test control and specified in
the same easy format.

I have started working on this, and will post WIP patches from time to
time, unless you all feel the need to point me in a different direction.


mark






On Sunday, January 5, 2014 6:13 PM, Michael Paquier  
wrote:
 


On Mon, Jan 6, 2014 at 4:51 AM, Mark Dilger  wrote:
> I am building a regression test system for replication and came across
> this email thread.  I have gotten pretty far into my implementation, but
> would be happy to make modifications if folks have improvements to
> suggest.  If the community likes my design, or a modified version based
> on your feedback, I'd be happy to submit a patch.
Yeah, this would be nice to look at, core code definitely needs to have some 
more infrastructure for such a test suite. I didn't get the time to go back to 
it since I began this thread though :)

> Currently I am canibalizing src/test/pg_regress.c, but that could instead
> be copied to src/test/pg_regress_replication.c or whatever.  The regression
> test creates and configures multiple database clusters, sets up the
> replication configuration for them, runs them each in nonprivileged mode
> and bound to different ports, feeds all the existing 141 regression tests
> into the master database with the usual checking that all the right results
> are obtained, and then checks that the standbys have the expected
> data.  This is possible all on one system because the database clusters
> are chroot'ed to see their own /data directory and not the /data directory
> of the other chroot'ed clusters, although the rest of the system, like /bin
> and /etc and /dev are all bind mounted and visible to each cluster.
Having vanilla regressions run in a cluster with multiple nodes and check the 
results on a standby is the top of the iceberg though. What I had in mind when 
I began this thread was to have more than a copy/paste of pg_regress, but an 
infrastructure that people could use to create and customize tests by having an 
additional control layer on the cluster itself. For example, testing 
replication is not only a matter of creating and setting up the nodes, but you 
might want to be able to initialize, add, remove nodes during the tests. Node 
addition would be either a new fresh master (this would be damn useful for a 
test suite for logical replication I think), or a slave node with custom 
recovery parameters to test replication, as well as PITR, archiving, etc. Then 
you need to be able to run SQL commands on top of that to check if the results 
are consistent with what you want.

A possible input for a test that users could provide would be something like 
that:
# Node information for tests
nodes
{
    {node1, postgresql.conf params, recovery.conf params}
    {node2, postgresql.conf params, recovery.conf params, slave of node1}
}
# Run test
init node1
run_sql node1 file1.sql
# Check output
init node2
run_sql node2 file2.sql
# Check that results are fine
# Process


The main problem is actually how to do that. Having some smart shell 
infrastructure would be simple and would facilitate (?) the maintenance of code 
used to run the tests. On the contrary having a C program would make the 
maintenance of code to run the tests more difficult (?) for a trade with more 
readable test suite input like the one I wrote above. This might also make the 
test input more readable for a human eye, in the shape of what is already 
available in src/test/isolation.


Another possibility could be also to integrate directly a recovery/backup 
manager in PG core, and have some tests for it, or e

Re: [HACKERS] integrate pg_upgrade analyze_new_cluster.sh into vacuumdb

2014-01-08 Thread Tom Lane

Peter Eisentraut  writes:
> pg_upgrade creates a script analyze_new_cluster.{sh|bat} that runs
> vacuumdb --analyze-only in three stages with different statistics target
> settings to get a fresh cluster analyzed faster.  I think this behavior
> is also useful for clusters or databases freshly created by pg_restore
> or any other loading mechanism, so it's suboptimal to have this
> constrained to pg_upgrade.

> Therefore, I suggest to add this functionality into the vacuumdb
> program.

Seems reasonable.

> There are some details to be considered about who pg_upgrade would call
> this.  For example, would we keep creating the script and just have the
> script call vacuumdb with the new option, or would we skip the script
> altogether and just print a message from pg_upgrade?  Also, pg_upgrade
> contains logic to run a vacuum (not only analyze) in the final run when
> upgrading from PostgreSQL <8.4 to deal with the freespace map.  Not sure
> how to adapt that; maybe just keep the script and run a non-analyze
> vacuum after the analyze.

I don't think this vacuumdb feature should deal with any
version-conversion issues.  So it sounds like the thing to do is keep the
wrapper script, which will give us a place to put any such special actions
without having to kluge up vacuumdb's behavior.  That'll avoid breaking
scripts that users might've built for using pg_upgrade, too.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add CREATE support to event triggers

2014-01-08 Thread Alvaro Herrera

Craig Ringer escribió:

> Instead, can't we use your already proposed subclause structure?
> 
> {"authorization":{"authorization_role":"some guy",
>   "output":"AUTHORIZATION %i{authorization_role}"},
>  "if_not_exists": {"output": "IF NOT EXISTS"},
>  "name":"some schema",
>  "output":"CREATE SCHEMA %{if_not_exists} %i{name} %{authorization}"}
> 
> i.e. "if_not_exists" becomes an object. All clauses are objects, all
> non-object values are user data. (right?). If the clause is absent, the
> "output" key is the empty string.
> 
> The issue with that (and with your original proposal) is that you can't
> tell what these clauses are supposed to be if they're not present in the
> original query. You can't *enable* "IF NOT EXISTS" without pulling
> knowledge of that syntax from somewhere else.
> 
> Depending on the problem you intend to solve there, that might be fine.

Hmm.  This seems like a reasonable thing to do, except that I would like
the "output" to always be the constant, and have some other way to
enable the clause or disable it.  With your "present" boolean:
so

"if_not_exists": {"output": "IF NOT EXISTS",
  "present": true/false}

In fact, I'm now wondering whether this is a better idea than not
emitting anything when some element in the output expands to NULL; so it
would apply to "authorization" as well; if the command includes the
clause, it'd be

 {"authorization":{"authorization_role":"some guy",
   "present": true,
   "output":"AUTHORIZATION %i{authorization_role}"},

and if there wasn't anything, you'd have

 {"authorization":{"authorization_role": null,
   "present": false,
   "output":"AUTHORIZATION %i{authorization_role}"},

so if you want to turn it on and it wasn't, you need to change both the
present boolean and also set the authorization_role element; and if you
want to turn it off when it was present, just set present to false.

> Am I just over-complicating something simple here?

I think it's a fair point.

> My reasoning is that it'd be good to be able to easily tell the
> difference between *structure* and *user data* in these query trees and
> do so without possibly version-specific and certainly
> syntax/clause-specific knowledge about the meaning of every key of every
> clause.

Sounds reasonable.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] newlines at end of generated SQL

2014-01-08 Thread Peter Eisentraut

Is there a reason why the programs in src/bin/scripts all put newlines
at the end of the SQL commands they generate?  This produces useless
empty lines in the server log (and client output, if selected).




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] commit fest manager?

2014-01-08 Thread Fabrízio de Royes Mello

On Wed, Jan 8, 2014 at 8:43 PM, Joshua D. Drake 
wrote:
>
>
> On 01/08/2014 02:38 PM, Josh Berkus wrote:
>>
>>
>> On 01/08/2014 02:04 PM, Peter Eisentraut wrote:
>>>
>>> Anyone else?
>>>
>>> Or you'll have to deal with me again?
>>>
>>
>> I vote for Peter.
>>
>
> +1
>

+1

--
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL
>> Timbira: http://www.timbira.com.br
>> Blog sobre TI: http://fabriziomello.blogspot.com
>> Perfil Linkedin: http://br.linkedin.com/in/fabriziomello
>> Twitter: http://twitter.com/fabriziomello

Re: [HACKERS] Add CREATE support to event triggers

2014-01-08 Thread Craig Ringer

On 01/09/2014 04:42 AM, Alvaro Herrera wrote:

> If there's a NULL element when expanding an object, the whole thing
> expands to empty.  For example, if no AUTHORIZATION
> clause is specified, "authorization" element is still there, but the
> "authorization_role" element within it is NULL, and so the whole
> AUTHORIZATION clause expands to empty and the resulting command contains
> no authorization clause.

I'd like to see this applied consistently to argument-less clauses like
IF NOT EXISTS too. So the same rules apply.

> IF NOT EXISTS is handled by defining it to either the string IF NOT
> EXISTS or to empty if no such clause was specified.

I'm not keen on this bit. It puts clauses of syntax into value strings
other than the special "output" key. Those keys aren't easily
distinguished from user data without clause specific knowledge. I'm not
keen on that.

Instead, can't we use your already proposed subclause structure?

{"authorization":{"authorization_role":"some guy",
  "output":"AUTHORIZATION %i{authorization_role}"},
 "if_not_exists": {"output": "IF NOT EXISTS"},
 "name":"some schema",
 "output":"CREATE SCHEMA %{if_not_exists} %i{name} %{authorization}"}

i.e. "if_not_exists" becomes an object. All clauses are objects, all
non-object values are user data. (right?). If the clause is absent, the
"output" key is the empty string.

The issue with that (and with your original proposal) is that you can't
tell what these clauses are supposed to be if they're not present in the
original query. You can't *enable* "IF NOT EXISTS" without pulling
knowledge of that syntax from somewhere else.

Depending on the problem you intend to solve there, that might be fine.

If it isn't, then instead there just needs to be a key to flag such
clauses as present or not.

{"authorization":{"authorization_role":"some guy",
  "output":"AUTHORIZATION %i{authorization_role}"},
 "if_not_exists": {"output": "IF NOT EXISTS"
   "present": true},
 "name":"some schema",
 "output":"CREATE SCHEMA %{if_not_exists} %i{name} %{authorization}"}

Am I just over-complicating something simple here?

My reasoning is that it'd be good to be able to easily tell the
difference between *structure* and *user data* in these query trees and
do so without possibly version-specific and certainly
syntax/clause-specific knowledge about the meaning of every key of every
clause.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Simple improvements to freespace allocation

2014-01-08 Thread Jim Nasby


On 1/8/14, 1:43 AM, Heikki Linnakangas wrote:

I've wanted the cluster case for a long time. I also see the use for the RECENT 
scenario, especially if we had CLUSTER CONCURRENT that let you shrink the head 
of the table as needed.

I suspect the in-memory case would only be useful if it could look into the OS 
cache as well, at least until we can recommend you give Postgres 90% of memory 
instead of 25%. Even then, I'm not sure how useful it would ultimately be...


* PACK
We want the table to be smaller, so rather than run a VACUUM FULL we
want to force the table to choose an NBT at start of table, even at
the expense of concurrency. By avoiding putting new data at the top of
the table we allow the possibility that VACUUM will shrink table size.
This is same as current except we always reset the FSM pointer to zero
and re-seek from there. This takes some time to have an effect, but is
much less invasive than VACUUM FULL.


We already reset the FSM pointer to zero on vacuum. Would the above actually 
make any difference in practice?


What if your first request is for a large chunk of free space? You could skip a 
lot of blocks, even if the FSM is bucketized.

But there's probably a more important point to this one: for you to have any 
chance of packing you MUST get everything out of the tail of the table. 
Resetting to zero on every request is one possible way to do that, though it 
might be better to do something like reset only once the pointer goes past 
block X. The other thing you'd want is a way to force tuples off the last X 
pages. Due to a lack of ctid operators that was already hard, and HOT makes 
that even harder (BTW, related to this you'd ideally want HOT to continue to 
operate on the front of the table, but not the back.)

All that said, I've definitely wanted the ability to shrink tables in the past, 
though TBH I've wanted that more for indexes.

Ultimately, what I really want on this front is:

PACK TABLE blah BACKGROUND;
CLUSTER TABLE blah BACKGROUND;
REINDEX INDEX blah BACKGROUND;

where BACKGROUND would respect a throttle setting. (While I'm dreaming, it'd be 
nice to have DATABASE/TABLESPACE/SCHEMA alternate specifications too...)
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] nested hstore patch

2014-01-08 Thread Jim Nasby


On 12/23/13, 9:47 AM, Pavel Stehule wrote:

Has anybody looked into how hard it would be to add "method" notation
to postgreSQL, so that instead of calling

getString(hstorevalue, n)

we could use

hstorevalue.getString(n)


yes, I played with it some years ago. I ended early, there was a problem with 
parser - when I tried append a new rule. And because there was not simple 
solution, I didn't continue.

But it can be nice feature - minimally for plpgsql coders.


Isn't there also some major problem with differentiating between 
schema/table/field with that too? I recall discussion along those lines, though 
maybe it was for the idea of recursive schemas.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Standalone synchronous master

2014-01-08 Thread Jim Nasby

On 1/8/14, 6:05 PM, Tom Lane wrote:

Josh Berkus  writes:

>On 01/08/2014 03:27 PM, Tom Lane wrote:

>>What we lack, and should work on, is a way for sync mode to have M larger
>>than one.  AFAICS, right now we'll report commit as soon as there's one
>>up-to-date replica, and some high-reliability cases are going to want
>>more.

>"Sync N times" is really just a guarantee against data loss as long as
>you lose N-1 servers or fewer.  And it becomes an even
>lower-availability solution if you don't have at least N+1 replicas.
>For that reason, I'd like to see some realistic actual user demand
>before we take the idea seriously.

Sure.  I wasn't volunteering to implement it, just saying that what
we've got now is not designed to guarantee data survival across failure
of more than one server.  Changing things around the margins isn't
going to improve such scenarios very much.

It struck me after re-reading your example scenario that the most
likely way to figure out what you had left would be to see if some
additional system (think Nagios monitor, or monitors) had records
of when the various database servers went down.  This might be
what you were getting at when you said "logging", but the key point
is it has to be logging done on an external server that could survive
failure of the database server.  postmaster.log ain't gonna do it.

Yeah, and I think that the logging command that was suggested allows for that 
*if configured correctly*.

Automatic degradation to async is useful for protecting you against all modes 
of a single failure: Master fails, you've got the replica. Replica fails, 
you've got the master.

But fit hits the shan as soon as you get a double failure, and that double 
failure can be very subtle. Josh's case is not subtle: You lost power AND the 
master died. You KNOW you have two failures.

But what happens if there's a network blip that's not large enough to notice 
(but large enough to degrade your replication) and the master dies? Now you 
have no clue if you've lost data.

Compare this to async: if the master goes down (one failure), you have zero 
clue if you lost data or not. At least with auto-degredation you know you have 
to have 2 failures to suffer data loss.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Standalone synchronous master

2014-01-08 Thread Tom Lane

Josh Berkus  writes:
> On 01/08/2014 03:27 PM, Tom Lane wrote:
>> What we lack, and should work on, is a way for sync mode to have M larger
>> than one.  AFAICS, right now we'll report commit as soon as there's one
>> up-to-date replica, and some high-reliability cases are going to want
>> more.

> "Sync N times" is really just a guarantee against data loss as long as
> you lose N-1 servers or fewer.  And it becomes an even
> lower-availability solution if you don't have at least N+1 replicas.
> For that reason, I'd like to see some realistic actual user demand
> before we take the idea seriously.

Sure.  I wasn't volunteering to implement it, just saying that what
we've got now is not designed to guarantee data survival across failure
of more than one server.  Changing things around the margins isn't
going to improve such scenarios very much.

It struck me after re-reading your example scenario that the most
likely way to figure out what you had left would be to see if some
additional system (think Nagios monitor, or monitors) had records
of when the various database servers went down.  This might be
what you were getting at when you said "logging", but the key point
is it has to be logging done on an external server that could survive
failure of the database server.  postmaster.log ain't gonna do it.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Standalone synchronous master

2014-01-08 Thread Jeff Janes

On Wed, Jan 8, 2014 at 2:56 PM, Stephen Frost  wrote:

> * Andres Freund (and...@2ndquadrant.com) wrote:
> > That's why you should configure a second standby as another (candidate)
> > synchronous replica, also listed in synchronous_standby_names.
>
> Perhaps we should stress in the docs that this is, in fact, the *only*
> reasonable mode in which to run with sync rep on?

I don't think it is the only reasonable way to run it.  Most of the time
that the master can't communicate with rep1, it is because of a network
problem.  So, the master probably can't talk to rep2 either, and adding the
second one doesn't really get you all that much in terms of availability.

Cheers,

Jeff

Re: [HACKERS] Standalone synchronous master

2014-01-08 Thread Josh Berkus

On 01/08/2014 03:27 PM, Tom Lane wrote:
> Good point, but C can't solve this for you just by logging.  If C was the
> first to go down, it has no way to know whether A and B committed more
> transactions before dying; and it's unlikely to have logged its own crash,
> either.

Sure.  But if we *knew* that C was not in synchronous mode when it went
down, then we'd expect some data loss.  As you point out, though, the
converse is not true; even if C was in sync mode, we don't know that
there's been no data loss, since B could come back up as a sync replica
before going down again.

> What we lack, and should work on, is a way for sync mode to have M larger
> than one.  AFAICS, right now we'll report commit as soon as there's one
> up-to-date replica, and some high-reliability cases are going to want
> more.

Yeah, we talked about having this when sync rep originally went in.  It
involves a LOT more bookeeping on the master though, which is why nobody
has been willing to attempt it -- and why we went with the
single-replica solution in the first place.  Especially since most
people who want "quorum sync" really want MM replication anyway.

"Sync N times" is really just a guarantee against data loss as long as
you lose N-1 servers or fewer.  And it becomes an even
lower-availability solution if you don't have at least N+1 replicas.
For that reason, I'd like to see some realistic actual user demand
before we take the idea seriously.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Standalone synchronous master

2014-01-08 Thread Jeff Janes

On Wed, Jan 8, 2014 at 2:23 PM, Joshua D. Drake wrote:

>
> On 01/08/2014 01:55 PM, Tom Lane wrote:
>
>  Sync mode is about providing a guarantee that the data exists on more than
>> one server *before* we tell the client it's committed.  If you don't need
>> that guarantee, you shouldn't be using sync mode.  If you do need it,
>> it's not clear to me why you'd suddenly not need it the moment the going
>> actually gets tough.
>>
>
> As I understand it what is being suggested is that if a subscriber or
> target goes down, then the master will just sit there and wait. When I read
> that, I read that the master will no longer process write transactions. If
> I am wrong in that understanding then cool. If I am not then that is a
> serious problem with a production scenario. There is an expectation that a
> master will continue to function if the target is down, synchronous or not.
>

My expectation is that the master stops writing checks when it finds it can
no longer cash them.

Cheers,

Jeff

Re: [HACKERS] Standalone synchronous master

2014-01-08 Thread Tom Lane

Josh Berkus  writes:
> HOWEVER, we've already kind of set up an indeterminate situation with
> allowing sync rep groups and candidate sync rep servers.  Consider this:

> 1. Master server A is configured with sync replica B and candidate sync
> replica C

> 2. A rolling power/network failure event occurs, which causes B and C to
> go down sometime before A, and all of them to go down before the
> application does.

> 3. On restore, only C is restorable; both A and B are a total loss.

> Again, we have no way to know whether or not C was in sync replication
> when it went down.  If C went down before B, then we've lost data; if B
> went down before C, we haven't.  But we can't find out.  *This* is where
> it would be useful to have C log whenever it went into (or out of)
> synchronous mode.

Good point, but C can't solve this for you just by logging.  If C was the
first to go down, it has no way to know whether A and B committed more
transactions before dying; and it's unlikely to have logged its own crash,
either.

More fundamentally, if you want to survive the failure of M out of N
nodes, you need a sync configuration that guarantees data is on at least
M+1 nodes before reporting commit.  The above example doesn't meet that,
so it's not surprising that you're screwed.

What we lack, and should work on, is a way for sync mode to have M larger
than one.  AFAICS, right now we'll report commit as soon as there's one
up-to-date replica, and some high-reliability cases are going to want
more.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Standalone synchronous master

2014-01-08 Thread Josh Berkus

On 01/08/2014 03:18 PM, Stephen Frost wrote:
> Do you really feel that a WARNING and increasing the docs to point
> out that three systems are necessary, particularly under the 'high
> availability' documentation and options, is a bad idea?  I fail to see
> how that does anything but clarify the use-case for our users.

I think the warning is dumb, and that the suggested documentation change
is insufficient.  If we're going to clarify things, then we need to have
a full-on several-page doc showing several examples of different sync
rep configurations and explaining their tradeoffs (including the
different sync modes and per-transaction sync).  Anything short of that
is just going to muddy the waters further.

Mind you, someone needs to take a machete to the HA section of the docs
anyway.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Standalone synchronous master

2014-01-08 Thread Stephen Frost

Josh,

* Josh Berkus (j...@agliodbs.com) wrote:
> > I'm aware, my point was simply that we should state, up-front in
> > 25.2.7.3 *and* where we document synchronous_standby_names, that it
> > requires at least three servers to be involved to be a workable
> > solution.
> 
> It's a workable solution with 2 servers.  That's a "low-availability,
> high-integrity" solution; the user has chosen to double their risk of
> not accepting writes against never losing a write.  That's a perfectly
> valid configuration, and I believe that NTT runs several applications
> this way.

I really don't agree with that when the standby going offline can take
out the master.  Note that I didn't say we shouldn't allow it, but I
don't think we should accept that it's a real-world solution.

> I really think that demand for auto-degrade is coming from users who
> don't know what sync rep is for in the first place.  The fact that other
> vendors are offering auto-degrade as a feature instead of the ginormous
> foot-gun it is adds to the confusion, but we can't help that.

Do you really feel that a WARNING and increasing the docs to point
out that three systems are necessary, particularly under the 'high
availability' documentation and options, is a bad idea?  I fail to see
how that does anything but clarify the use-case for our users.

Thanks,

Stephen

signature.asc
Description: Digital signature

Re: [HACKERS] Standalone synchronous master

2014-01-08 Thread Tom Lane

Stephen Frost  writes:
> I'm aware, my point was simply that we should state, up-front in
> 25.2.7.3 *and* where we document synchronous_standby_names, that it
> requires at least three servers to be involved to be a workable
> solution.

It only requires that if your requirements include both redundant
data storage and tolerating single-node failure.  Now admittedly,
most people who want replication want it so they can have failure
tolerance, but I don't think it's insane to say that you want to
stop accepting writes if either node of a 2-node server drops out.
If you can only afford two nodes, and you need guaranteed redundancy
for business reasons, then that's where you end up.

Or in short, I'm against throwing warnings for this kind of setup.
I do agree that we need some doc improvements, since this is
evidently not clear enough yet.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Standalone synchronous master

2014-01-08 Thread Josh Berkus

Stephen,


> I'm aware, my point was simply that we should state, up-front in
> 25.2.7.3 *and* where we document synchronous_standby_names, that it
> requires at least three servers to be involved to be a workable
> solution.

It's a workable solution with 2 servers.  That's a "low-availability,
high-integrity" solution; the user has chosen to double their risk of
not accepting writes against never losing a write.  That's a perfectly
valid configuration, and I believe that NTT runs several applications
this way.

In fact, that can already be looked at as a kind of "auto-degrade" mode:
if there aren't two nodes, then the database goes read-only.

Might I also point out that transactions are synchronous or not
individually?  The sensible configuration is for only the important
writes being synchronous -- in which case auto-degrade makes even less
sense.

I really think that demand for auto-degrade is coming from users who
don't know what sync rep is for in the first place.  The fact that other
vendors are offering auto-degrade as a feature instead of the ginormous
foot-gun it is adds to the confusion, but we can't help that.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Standalone synchronous master

2014-01-08 Thread Stephen Frost

* Andres Freund (and...@2ndquadrant.com) wrote:
> On 2014-01-08 17:56:37 -0500, Stephen Frost wrote:
> > * Andres Freund (and...@2ndquadrant.com) wrote:
> > > That's why you should configure a second standby as another (candidate)
> > > synchronous replica, also listed in synchronous_standby_names.
> > 
> > Perhaps we should stress in the docs that this is, in fact, the *only*
> > reasonable mode in which to run with sync rep on?  Where there are
> > multiple replicas, because otherwise Drake is correct that you'll just
> > end up having both nodes go offline if the slave fails.
> 
> Which, as it happens, is actually documented.

I'm aware, my point was simply that we should state, up-front in
25.2.7.3 *and* where we document synchronous_standby_names, that it
requires at least three servers to be involved to be a workable
solution.

Perhaps we should even log a warning if only one value is found in
synchronous_standby_names...

Thanks,

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] Standalone synchronous master

2014-01-08 Thread Andres Freund

On 2014-01-08 14:52:07 -0800, Joshua D. Drake wrote:
> On 01/08/2014 02:46 PM, Andres Freund wrote:
> >>The idea is that we know that data on db0 is not written until we know for a
> >>fact that db1 also has that data. That is great and a guarantee of data
> >>integrity between the two nodes.
> >
> >That guarantee is never there. The only thing guaranteed is that the
> >client isn't notified of the commit until db1 has received the data.
> 
> Well ugh on that.. but that is for another reply.

You do realize that locally you have the same guarantees? If the client
didn't receive a reply to a COMMIT you won't know whether the tx
committed or not. If that's not sufficient you need to use 2pc and a
transaction manager.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Standalone synchronous master

2014-01-08 Thread Josh Berkus

On 01/08/2014 01:49 PM, Tom Lane wrote:
> Josh Berkus  writes:
>> If we really want auto-degrading sync rep, then we'd (at a minimum) need
>> a way to determine *from the replica* whether or not it was in degraded
>> mode when the master died.  What good do messages to the master log do
>> you if the master no longer exists?
> 
> How would it be possible for a replica to know whether the master had
> committed more transactions while communication was lost, if the master
> dies without ever restoring communication?  It sounds like pie in the
> sky from here ...

Oh, right.  Because the main reason for a sync replica degrading is that
it's down.  In which case it isn't going to record anything.  This would
still be useful for sync rep candidates, though, and I'll document why
below.  But first, lemme demolish the case for auto-degrade.

So here's the case that we can't possibly solve for auto-degrade.
Anyone who wants auto-degrade needs to come up with a solution for this
case as a first requirement:

1. A data center network/power event starts.

2. The sync replica goes down.

3. A short time later, the master goes down.

4. Data center power is restored.

5. The master is fried and is a permanent loss.  The replica is ok, though.

Question: how does the DBA know whether data has been lost or not?

With current sync rep, it's easy: no data was lost, because the master
stopped accepting writes once the replica went down.  If we support
auto-degrade, though, there's no way to know; the replica doesn't have
that information, and anything which was on the master is permanently
lost.  And the point several people have made is: if you can live with
indeterminancy, then you're better off with async rep in the first place.

Now, what we COULD definitely use is a single-command way of degrading
the master when the sync replica is down.  Something like "ALTER SYSTEM
DEGRADE SYNC".  Right now you have to push a change to the conf file and
reload, and there's no way to salvage the transaction which triggered
the sync failure.  This would be a nice 9.5 feature.

HOWEVER, we've already kind of set up an indeterminate situation with
allowing sync rep groups and candidate sync rep servers.  Consider this:

1. Master server A is configured with sync replica B and candidate sync
replica C

2. A rolling power/network failure event occurs, which causes B and C to
go down sometime before A, and all of them to go down before the
application does.

3. On restore, only C is restorable; both A and B are a total loss.

Again, we have no way to know whether or not C was in sync replication
when it went down.  If C went down before B, then we've lost data; if B
went down before C, we haven't.  But we can't find out.  *This* is where
it would be useful to have C log whenever it went into (or out of)
synchronous mode.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Standalone synchronous master

2014-01-08 Thread Andres Freund

On 2014-01-08 17:56:37 -0500, Stephen Frost wrote:
> * Andres Freund (and...@2ndquadrant.com) wrote:
> > That's why you should configure a second standby as another (candidate)
> > synchronous replica, also listed in synchronous_standby_names.
> 
> Perhaps we should stress in the docs that this is, in fact, the *only*
> reasonable mode in which to run with sync rep on?  Where there are
> multiple replicas, because otherwise Drake is correct that you'll just
> end up having both nodes go offline if the slave fails.

Which, as it happens, is actually documented.

http://www.postgresql.org/docs/devel/static/warm-standby.html#SYNCHRONOUS-REPLICATION
25.2.7.3. Planning for High Availability

"Commits made when synchronous_commit is set to on or remote_write will
wait until the synchronous standby responds. The response may never
occur if the last, or only, standby should crash.

The best solution for avoiding data loss is to ensure you don't lose
your last remaining synchronous standby. This can be achieved by naming
multiple potential synchronous standbys using
synchronous_standby_names. The first named standby will be used as the
synchronous standby. Standbys listed after this will take over the role
of synchronous standby if the first one should fail."

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Standalone synchronous master

2014-01-08 Thread Joshua D. Drake



On 01/08/2014 02:49 PM, Tom Lane wrote:


Then you don't understand the point of sync mode, and you shouldn't be
using it.  The point is *exactly* to refuse to commit transactions unless
we can guarantee the data's been replicated.


I understand exactly that and I don't disagree, except in the case where 
it is going to bring down the master (see my further reply). I now 
remember arguing about this a few years ago when we started down the 
sync path.


Anyway, perhaps this is just something of a knob that can be turned. We 
don't have to continue the argument. Thank you for considering what I 
was saying.


Sincerely,

JD

--
Command Prompt, Inc. - http://www.commandprompt.com/  509-416-6579
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc
"In a time of universal deceit - telling the truth is a revolutionary 
act.", George Orwell



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Standalone synchronous master

2014-01-08 Thread Stephen Frost

* Andres Freund (and...@2ndquadrant.com) wrote:
> That's why you should configure a second standby as another (candidate)
> synchronous replica, also listed in synchronous_standby_names.

Perhaps we should stress in the docs that this is, in fact, the *only*
reasonable mode in which to run with sync rep on?  Where there are
multiple replicas, because otherwise Drake is correct that you'll just
end up having both nodes go offline if the slave fails.

Thanks,

Stephen

signature.asc
Description: Digital signature

Re: [HACKERS] Standalone synchronous master

2014-01-08 Thread Tom Lane

Andres Freund  writes:
> On 2014-01-08 14:42:37 -0800, Joshua D. Drake wrote:
>> Using the model (as I understand it) that is being discussed we have
>> increased our failure rate because the moment db1:down we also lose db0. The
>> node db0 may be up but if it isn't going to process transactions it is
>> useless. I can tell you that I have exactly 0 customers that would want that
>> model because a single node failure would cause a double node failure.

> That's why you should configure a second standby as another (candidate)
> synchronous replica, also listed in synchronous_standby_names.

Right.  If you want to tolerate one node failure, *and* have a guarantee
that committed data is on at least two nodes, you need at least three
nodes.  Simple arithmetic.  If you only have two nodes, you only get to
have one of those properties.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Standalone synchronous master

2014-01-08 Thread Joshua D. Drake



On 01/08/2014 02:46 PM, Andres Freund wrote:


db0->db1 in synchronous mode

The idea is that we know that data on db0 is not written until we know for a
fact that db1 also has that data. That is great and a guarantee of data
integrity between the two nodes.


That guarantee is never there. The only thing guaranteed is that the
client isn't notified of the commit until db1 has received the data.


Well ugh on that.. but that is for another reply.



That's why you should configure a second standby as another (candidate)
synchronous replica, also listed in synchronous_standby_names.


I don't have a response to this that does not involve a great deal of 
sarcasm.


Sincerely,

Joshua D. Drake

--
Command Prompt, Inc. - http://www.commandprompt.com/  509-416-6579
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc
"In a time of universal deceit - telling the truth is a revolutionary 
act.", George Orwell



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] nested hstore patch (sgml typo)

2014-01-08 Thread Erik Rijkers

On Wed, January 8, 2014 22:29, Oleg Bartunov wrote:
> Attached is a new version of patch, which addresses most issues raised
> by Andres.

> [ nested_hstore-0.42.patch.gz  ]

Building documentation fails:

openjade:hstore.sgml:1010:18:E: end tag for element "A" which is not open
openjade:hstore.sgml:1011:13:E: document type does not allow element "TYPE" here
openjade:hstore.sgml:1012:8:E: document type does not allow element "TYPE" here
openjade:hstore.sgml:1012:27:E: document type does not allow element "TYPE" here
openjade:hstore.sgml:1013:15:E: document type does not allow element 
"PROGRAMLISTING" here
openjade:hstore.sgml:1024:8:E: end tag for "TYPE" omitted, but OMITTAG NO was 
specified
openjade:hstore.sgml:1010:3: start tag was here
make: *** [HTML.index] Error 1
make: *** Deleting file `HTML.index'

This is caused by a small tag typo.

The attached fixes that hstore.sgml typo.

thanks,

Erikjan


--- doc/src/sgml/hstore.sgml.orig	2014-01-08 23:32:29.493548857 +0100
+++ doc/src/sgml/hstore.sgml	2014-01-08 23:33:02.554527949 +0100
@@ -1007,7 +1007,7 @@
 
   
But populate_record() supports more complicated records and nested
-   hstore values, as well. It makes an effort to convert
+   hstore values, as well. It makes an effort to convert
from hstore data types to PostgreSQL types, including arrays,
json, and hstore values:
 
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Standalone synchronous master

2014-01-08 Thread Tom Lane

"Joshua D. Drake"  writes:
> On 01/08/2014 01:55 PM, Tom Lane wrote:
>> Sync mode is about providing a guarantee that the data exists on more than
>> one server *before* we tell the client it's committed.  If you don't need
>> that guarantee, you shouldn't be using sync mode.  If you do need it,
>> it's not clear to me why you'd suddenly not need it the moment the going
>> actually gets tough.

> As I understand it what is being suggested is that if a subscriber or 
> target goes down, then the master will just sit there and wait. When I 
> read that, I read that the master will no longer process write 
> transactions. If I am wrong in that understanding then cool. If I am not 
> then that is a serious problem with a production scenario. There is an 
> expectation that a master will continue to function if the target is 
> down, synchronous or not.

Then you don't understand the point of sync mode, and you shouldn't be
using it.  The point is *exactly* to refuse to commit transactions unless
we can guarantee the data's been replicated.

There might be other interpretations of "synchronous replication" in which
it makes sense to continue accepting transactions whether or not there are
any up-to-date replicas; but in the meaning Postgres ascribes to the term,
it does not make sense.  You should just use async mode if that behavior
is what you want.

Possibly we need to rename "synchronous replication", or document it
better.  And I don't have any objection in principle to developing
additional replication modes that offer different sets of guarantees and
performance tradeoffs.  But for the synchronous mode that we've got, the
proposed switch is insane, and asking for it merely proves that you don't
understand the difference between async and sync modes.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add CREATE support to event triggers

2014-01-08 Thread Alvaro Herrera

CC to hackers restored.

Pavel Stehule escribió:
> Dne 8.1.2014 23:17 "Alvaro Herrera"  napsal(a):
> >
> > Pavel Stehule escribió:
> > > Hello
> > >
> > > I don't like this direction. What we can do with JSON from plpgsql?
> >
> > We have plenty of JSON functions and operators in SQL, and more to come
> > soon.  Is that not enough?
> 
> No, is not. Im sure. It is wrong a request to parse  system internal data,
> that is available in structured form. You create string that should be
> parsed same time.
> 
> Few functions with OUT parameters are beter than any semistructured string.

That was shot down, for good reasons: we assume that the users of this
are going to want to modify the command before considering it final.
Maybe they want to add a column to each table being created, or they
want to change the tablespace if the table name ends with "_big", or
they want to change the schema in which it is created.

This JSON representations lets you receive the table creation data in a
well-known JSON schema; you can tweak individual elements without having
to parse the SQL command.  And when you're done tweaking, there's a
function that lets you produce the SQL command that corresponds to the
original with the tweaks you just did.

(Please note that, thus far, this facility DOES NOT let you change the
table that was created, at least not directly: these event triggers are
run AFTER the creation command has completed.  You can tweak the command
that would be sent to a remote server in a replication swarm, for
example.  Or create a mirror table for audit purposes.  Or perhaps even
generate an ALTER TABLE command for the new table.)

If by "few functions with OUT parameters" you mean that we need to have
one record type that is able to receive all possible CREATE TABLE
options, so that you can change them as a record in plpgsql, this
doesn't sound too good to me, for three reasons: a) it's going to
require too many types and functions (one per statement type); b)
cramming the stuff in pg_type.h / pg_proc.h is going to be a horrid
task; c) any change is going to require an initdb.

-- 
Alvaro Herrera


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Standalone synchronous master

2014-01-08 Thread Andres Freund

On 2014-01-08 14:42:37 -0800, Joshua D. Drake wrote:
> 
> On 01/08/2014 02:34 PM, Andres Freund wrote:
> 
> >I don't think you've understood synchronous replication. There wouldn't
> >be *any* benefit to using it if it worked the way you wish since there
> >wouldn't be any additional guarantees. A single reconnect of the
> >streaming rep connection, without any permanent outage, would
> >potentially lead to data loss if the primary crashed in the wrong
> >moment.
> >So you'd buy no guarantees with a noticeable loss in performance.
> >
> >Just use async mode if you want things work like that.
> 
> Well no. That isn't what I am saying. Consider the following scenario:
> 
> db0->db1 in synchronous mode
> 
> The idea is that we know that data on db0 is not written until we know for a
> fact that db1 also has that data. That is great and a guarantee of data
> integrity between the two nodes.

That guarantee is never there. The only thing guaranteed is that the
client isn't notified of the commit until db1 has received the data.

> If we have the following:
> 
> db0->db1:down
> 
> Using the model (as I understand it) that is being discussed we have
> increased our failure rate because the moment db1:down we also lose db0. The
> node db0 may be up but if it isn't going to process transactions it is
> useless. I can tell you that I have exactly 0 customers that would want that
> model because a single node failure would cause a double node failure.

That's why you should configure a second standby as another (candidate)
synchronous replica, also listed in synchronous_standby_names.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] commit fest manager?

2014-01-08 Thread Joshua D. Drake



On 01/08/2014 02:38 PM, Josh Berkus wrote:


On 01/08/2014 02:04 PM, Peter Eisentraut wrote:

Anyone else?

Or you'll have to deal with me again?



I vote for Peter.



+1

JD

--
Command Prompt, Inc. - http://www.commandprompt.com/  509-416-6579
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc
"In a time of universal deceit - telling the truth is a revolutionary 
act.", George Orwell



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Standalone synchronous master

2014-01-08 Thread Joshua D. Drake



On 01/08/2014 02:34 PM, Andres Freund wrote:


I don't think you've understood synchronous replication. There wouldn't
be *any* benefit to using it if it worked the way you wish since there
wouldn't be any additional guarantees. A single reconnect of the
streaming rep connection, without any permanent outage, would
potentially lead to data loss if the primary crashed in the wrong
moment.
So you'd buy no guarantees with a noticeable loss in performance.

Just use async mode if you want things work like that.


Well no. That isn't what I am saying. Consider the following scenario:

db0->db1 in synchronous mode

The idea is that we know that data on db0 is not written until we know 
for a fact that db1 also has that data. That is great and a guarantee of 
data integrity between the two nodes.


If we have the following:

db0->db1:down

Using the model (as I understand it) that is being discussed we have 
increased our failure rate because the moment db1:down we also lose db0. 
The node db0 may be up but if it isn't going to process transactions it 
is useless. I can tell you that I have exactly 0 customers that would 
want that model because a single node failure would cause a double node 
failure.


All the other stuff with wal_keep_segments is just idea throwing. I 
don't care about that at this point. What I care about specifically is 
that a single node failure regardless of replication mode should not be 
able to (automatically) stop the operation of the master node.


Sincerely,

JD




--
Command Prompt, Inc. - http://www.commandprompt.com/  509-416-6579
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc
"In a time of universal deceit - telling the truth is a revolutionary 
act.", George Orwell



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] commit fest manager?

2014-01-08 Thread Josh Berkus

On 01/08/2014 02:04 PM, Peter Eisentraut wrote:
> Anyone else?
> 
> Or you'll have to deal with me again?
> 
> 

I vote for Peter.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Standalone synchronous master

2014-01-08 Thread Andres Freund

On 2014-01-08 14:23:34 -0800, Joshua D. Drake wrote:
> 
> On 01/08/2014 01:55 PM, Tom Lane wrote:
> 
> >Sync mode is about providing a guarantee that the data exists on more than
> >one server *before* we tell the client it's committed.  If you don't need
> >that guarantee, you shouldn't be using sync mode.  If you do need it,
> >it's not clear to me why you'd suddenly not need it the moment the going
> >actually gets tough.
> 
> As I understand it what is being suggested is that if a subscriber or target
> goes down, then the master will just sit there and wait. When I read that, I
> read that the master will no longer process write transactions. If I am
> wrong in that understanding then cool. If I am not then that is a serious
> problem with a production scenario. There is an expectation that a master
> will continue to function if the target is down, synchronous or not.

I don't think you've understood synchronous replication. There wouldn't
be *any* benefit to using it if it worked the way you wish since there
wouldn't be any additional guarantees. A single reconnect of the
streaming rep connection, without any permanent outage, would
potentially lead to data loss if the primary crashed in the wrong
moment.
So you'd buy no guarantees with a noticeable loss in performance.

Just use async mode if you want things work like that.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Standalone synchronous master

2014-01-08 Thread Joshua D. Drake



On 01/08/2014 01:55 PM, Tom Lane wrote:


Sync mode is about providing a guarantee that the data exists on more than
one server *before* we tell the client it's committed.  If you don't need
that guarantee, you shouldn't be using sync mode.  If you do need it,
it's not clear to me why you'd suddenly not need it the moment the going
actually gets tough.


As I understand it what is being suggested is that if a subscriber or 
target goes down, then the master will just sit there and wait. When I 
read that, I read that the master will no longer process write 
transactions. If I am wrong in that understanding then cool. If I am not 
then that is a serious problem with a production scenario. There is an 
expectation that a master will continue to function if the target is 
down, synchronous or not.


Sincerely,

JD

--
Command Prompt, Inc. - http://www.commandprompt.com/  509-416-6579
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc
For my dreams of your image that blossoms
   a rose in the deeps of my heart. - W.B. Yeats


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add CREATE support to event triggers

2014-01-08 Thread Alvaro Herrera

Pavel Stehule escribió:
> Hello
> 
> I don't like this direction. What we can do with JSON from plpgsql?

We have plenty of JSON functions and operators in SQL, and more to come
soon.  Is that not enough?

> More, JSON is not too robust format against some future changes.

Not sure what you mean.  This JSON is generated and consumed by our own
code, so we only need to concern ourselves with making sure that we can
consume (in the expansion function) what we generated in the previous
phase.  There is no transmission to the exterior.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add CREATE support to event triggers

2014-01-08 Thread Pavel Stehule

Hello

I don't like this direction. What we can do with JSON from plpgsql? More,
JSON is not too robust format against some future changes.

Regards

Pavel
Dne 8.1.2014 21:43 "Alvaro Herrera"  napsal(a):

> Alvaro Herrera escribió:
> > Robert Haas escribió:
> >
> > > I think this direction has some potential.  I'm not sure it's right in
> > > detail.  The exact scheme you propose above won't work if you want to
> > > leave out the schema name altogether, and more generally it's not
> > > going to help very much with anything other than substituting in
> > > identifiers.  What if you want to add a column called satellite_id to
> > > every table that gets created, for example?  What if you want to make
> > > the tables UNLOGGED?  I don't see how that kind of things is going to
> > > work at all cleanly.
> >
> > Thanks for the discussion.  I am building some basic infrastructure to
> > make this possible, and will explore ideas to cover these oversights
> > (not posting anything concrete yet because I expect several iterations
> > to crash and burn before I have something sensible to post).
>
> Here's a working example.  Suppose the user runs
>
> CREATE SCHEMA IF NOT EXISTS "some schema" AUTHORIZATION "some guy";
>
> In an event trigger, the function pg_event_trigger_get_creation_commands()
> returns the following JSON blob:
>
> {"authorization":{"authorization_role":"some guy",
>   "output":"AUTHORIZATION %i{authorization_role}"},
>  "if_not_exists":"IF NOT EXISTS",
>  "name":"some schema",
>  "output":"CREATE SCHEMA %{if_not_exists} %i{name} %{authorization}"}
>
> wherein I have chosen to have a JSON element with the hardcoded name of
> "output" which is what needs to be expanded; for each %{} parameter
> found in it, there is an equally-named element in the JSON blob.  This
> can be a string, a NULL, or another JSON object.
>
> If it's a string, it expands to that value; if it's an object,
> recursively an "output" element is expanded in the same way, and the
> expanded string is used.
>
> If there's a NULL element when expanding an object, the whole thing
> expands to empty.  For example, if no AUTHORIZATION
> clause is specified, "authorization" element is still there, but the
> "authorization_role" element within it is NULL, and so the whole
> AUTHORIZATION clause expands to empty and the resulting command contains
> no authorization clause.  This is useful to support the case that
> someone doesn't have an AUTHORIZATION clause in the CREATE SCHEMA
> command, and the event trigger injects one simply by setting the
> authorization_role to some role name.
>
> IF NOT EXISTS is handled by defining it to either the string IF NOT
> EXISTS or to empty if no such clause was specified.
>
> The user can modify elements in the JSON to get a different version of
> the command.  (I reckon the "output" can also be modified, but this is
> probably a bad idea in most/all cases.  I don't think there's a need to
> prohibit this explicitely.)  Also, someone might define "if_not_exists"
> to something completely unrelated, but that would be their own fault.
> (Maybe we can have some cross-check that the if_not_exists element in
> JSON cannot be anything other than "IF NOT EXISTS" or the empty string;
> and that the "output" element remains the same at expansion time than it
> was at generation time.  Perhaps we should even hide the "output"
> element from the user completely and only add them to the JSON at time
> of expansion.  Not sure it's worth the trouble.)
>
> There is another function,
> pg_event_trigger_expand_creation_command(json), which will expand the
> above JSON blob and return the following text:
>
> CREATE SCHEMA IF NOT EXISTS "some schema" AUTHORIZATION "some guy"
>
> Note the identifiers are properly quoted (there are quotes in the JSON
> blob, but they correspond to JSON's own delimiters).  I have defined a
> 'i' modifier to have %i{} elements, which means that the element is an
> identifier which might need quoting.
>
> I have also defined a %d{} modifier that means to use the element to
> expand a possibly-qualified dotted name.  (There would be no "output"
> element in this case.)  This is to support the case where you have
>
> CREATE TABLE public.foo
> which results in
> {"table_name":{"schema":"public",
>"relname":"foo"}}
>
> and you want to edit the "table_name" element in the root JSON and set
> the schema to something else (perhaps NULL), so in the event trigger
> after expansion you can end up with "CREATE TABLE foo" or "CREATE TABLE
> private.foo" or whatever.
>
> Most likely there are some more rules that will need to be created, but
> so far this looks sensible.
>
> I'm going to play some more with the %d{} stuff, and also with the idea
> of representing table elements such as columns and constraints as an
> array.  In the meantime please let me know whether this makes sense.
>
> --
> Álvaro Herrerahttp://www.2ndQuadrant.com/
> PostgreSQL Develo

[HACKERS] commit fest manager?

2014-01-08 Thread Peter Eisentraut

Anyone else?

Or you'll have to deal with me again?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GIN improvements part 1: additional information

2014-01-08 Thread Alexander Korotkov

On Mon, Jan 6, 2014 at 12:35 PM, Amit Langote wrote:

> On Sat, Dec 21, 2013 at 4:36 AM, Heikki Linnakangas
>  wrote:
> >
> > Yet another version. The encoding/decoding code is now quite isolated in
> > ginpostinglist.c, so it's easy to experiment with different encodings.
> This
> > patch uses varbyte encoding again.
> >
> > I got a bit carried away, experimented with a bunch of different
> encodings.
> > I tried rice encoding, rice encoding with block and offset number delta
> > stored separately, the simple9 variant, and varbyte encoding.
> >
> > The compressed size obviously depends a lot on the distribution of the
> > items, but in the test set I used, the differences between different
> > encodings were quite small.
> >
> > One fatal problem with many encodings is VACUUM. If a page is completely
> > full and you remove one item, the result must still fit. In other words,
> > removing an item must never enlarge the space needed. Otherwise we have
> to
> > be able to split on vacuum, which adds a lot of code, and also makes it
> > possible for VACUUM to fail if there is no disk space left. That's
> > unpleasant if you're trying to run VACUUM to release disk space. (gin
> fast
> > updates already has that problem BTW, but let's not make it worse)
> >
> > I believe that eliminates all encodings in the Simple family, as well as
> > PForDelta, and surprisingly also Rice encoding. For example, if you have
> > three items in consecutive offsets, the differences between them are
> encoded
> > as 11 in rice encoding. If you remove the middle item, the encoding for
> the
> > next item becomes 010, which takes more space than the original.
> >
> > AFAICS varbyte encoding is safe from that. (a formal proof would be nice
> > though).
> >
> > So, I'm happy to go with varbyte encoding now, indeed I don't think we
> have
> > much choice unless someone can come up with an alternative that's
> > VACUUM-safe. I have to put this patch aside for a while now, I spent a
> lot
> > more time on these encoding experiments than I intended. If you could
> take a
> > look at this latest version, spend some time reviewing it and cleaning up
> > any obsolete comments, and re-run the performance tests you did earlier,
> > that would be great. One thing I'm slightly worried about is the
> overhead of
> > merging the compressed and uncompressed posting lists in a scan. This
> patch
> > will be in good shape for the final commitfest, or even before that.
> >
>
>
> I just tried out the patch "gin-packed-postinglists-varbyte2.patch"
> (which looks like the latest one in this thread) as follows:
>
> 1) Applied patch to the HEAD (on commit
> 94b899b829657332bda856ac3f06153d09077bd1)
> 2) Created a test table and index
>
> create table test (a text);
> copy test from '/usr/share/dict/words';
> create index test_trgm_idx on test using gin (a gin_trgm_ops);
>
> 3) Got the following error on a wildcard query:
>
> postgres=# explain (buffers, analyze) select count(*) from test where
> a like '%tio%';
> ERROR:  lock 9447 is not held
> STATEMENT:  explain (buffers, analyze) select count(*) from test where
> a like '%tio%';
> ERROR:  lock 9447 is not held
>

Thanks for reporting. Fixed version is attached.

--
With best regards,
Alexander Korotkov.


gin-packed-postinglists-varbyte3.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Standalone synchronous master

2014-01-08 Thread Kevin Grittner

Andres Freund  wrote:
> On 2014-01-08 13:34:08 -0800, Kevin Grittner wrote:
>
>> On the other hand, we keep getting people saying they want the
>> database to make the promise of synchronous replication, and
>> tell applications that it has been successful even when it
>> hasn't been, as long as there's a line in the server log to
>> record the lie.
>
> Most people having such a position I've talked to have held that
> position because they thought synchronous replication would mean
> that apply (and thus visibility) would also be synchronous. Is
> that different from your experience?

I haven't pursued it that far because we don't have
maybe-synchronous mode yet and seem unlikely to ever support it.
I'm not sure why that use-case is any better than any other.  You
still would never really know whether the data read is current.  If
we were to implement this, the supposedly synchronous replica could
be out-of-date by any arbitrary amount of time (from milliseconds
to months).  (Consider what could happen if the replication
connection authorizations got messed up while application
connections to the replica were fine.)

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Standalone synchronous master

2014-01-08 Thread Tom Lane

"Joshua D. Drake"  writes:
> However, if the subscriber is down, the origin should NEVER wait. That 
> is just silly behavior and makes synchronous replication pretty much 
> useless. Machines go down, that is the nature of things. Yes, we should 
> log and log loudly if the subscriber is down:

> ERROR: target xyz is non-communicative: switching to async replication.

> We then should store the wal logs up to wal_keep_segments.

> When the subscriber comes back up, it will then replicate in async mode 
> until the two are back in sync and then switch (perhaps by hand) to sync 
> mode. This of course assumes that we have a valid database on the 
> subscriber and we have not overrun wal_keep_segments.

It sounds to me like you are describing the existing behavior of async
mode, with the possible exception of exactly what shows up in the
postmaster log.

Sync mode is about providing a guarantee that the data exists on more than
one server *before* we tell the client it's committed.  If you don't need
that guarantee, you shouldn't be using sync mode.  If you do need it,
it's not clear to me why you'd suddenly not need it the moment the going
actually gets tough.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Standalone synchronous master

2014-01-08 Thread Tom Lane

Josh Berkus  writes:
> If we really want auto-degrading sync rep, then we'd (at a minimum) need
> a way to determine *from the replica* whether or not it was in degraded
> mode when the master died.  What good do messages to the master log do
> you if the master no longer exists?

How would it be possible for a replica to know whether the master had
committed more transactions while communication was lost, if the master
dies without ever restoring communication?  It sounds like pie in the
sky from here ...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Standalone synchronous master

2014-01-08 Thread Heikki Linnakangas


On 01/08/2014 11:37 PM, Andres Freund wrote:

On 2014-01-08 13:34:08 -0800, Kevin Grittner wrote:

On the other hand, we keep getting people saying they want the
database to make the promise of synchronous replication, and tell
applications that it has been successful even when it hasn't been,
as long as there's a line in the server log to record the lie.


Most people having such a position I've talked to have held that
position because they thought synchronous replication would mean that
apply (and thus visibility) would also be synchronous.


And I totally agree that it would be a useful mode if apply was 
synchronous. You could then build a master-standby pair where it's 
guaranteed that when you commit a transaction in the master, it's 
thereafter always seen as committed in the standby too. In that usage, 
if the link between the two is broken, you could set up timeouts e.g so 
that the standby stops accepting new queries after 20 seconds, and then 
the master proceeds without the standby after 25 seconds. Then the 
guarantee would hold.


I don't know if the people asking for the fallback mode are thinking 
that synchronous replication means synchronous apply, or if they're 
trying to have the cake and eat it too wrt. durability and availability.


Synchronous apply would be cool..

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Standalone synchronous master

2014-01-08 Thread Joshua D. Drake



On 01/08/2014 01:34 PM, Kevin Grittner wrote:


I'm torn on whether we should cave to popular demand on this; but
if we do, we sure need to be very clear in the documentation about
what a successful return from a commit request means.  Sooner or
later, Murphy's Law being what it is, if we do this someone will
lose the primary and blame us because the synchronous replica is
missing gobs of transactions that were successfully committed.


I am trying to follow this thread and perhaps I am just being dense but 
it seems to me that:


If you are running synchronous replication, as long as the target 
(subscriber) is up, synchronous replication operates as it should. That 
is that the origin will wait for a notification from the subscriber that 
the write has been successful before continuing.


However, if the subscriber is down, the origin should NEVER wait. That 
is just silly behavior and makes synchronous replication pretty much 
useless. Machines go down, that is the nature of things. Yes, we should 
log and log loudly if the subscriber is down:


ERROR: target xyz is non-communicative: switching to async replication.

We then should store the wal logs up to wal_keep_segments.

When the subscriber comes back up, it will then replicate in async mode 
until the two are back in sync and then switch (perhaps by hand) to sync 
mode. This of course assumes that we have a valid database on the 
subscriber and we have not overrun wal_keep_segments.


Sincerely,

Joshua D. Drake



--
Command Prompt, Inc. - http://www.commandprompt.com/  509-416-6579
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc
For my dreams of your image that blossoms
   a rose in the deeps of my heart. - W.B. Yeats


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Standalone synchronous master

2014-01-08 Thread Josh Berkus

On 01/08/2014 12:27 PM, Bruce Momjian wrote:
> I am glad Heikki and Simon agree, but I don't.  ;-)
> 
> The way that I understand it is that you might want durability, but
> might not want to sacrifice availability.  Phrased that way, it makes
> sense, and notifying the administrator seems the appropriate action.

I think there's a valid argument to want things the other way, but I
find the argument not persuasive.  In general, people who want
auto-degrade for sync rep either:

a) don't understand what sync rep actually does (lots of folks confuse
synchronous with simultaneous), or

b) want more infrastructure than we actually have around managing sync
replicas

Now, the folks who want (b) have a legitimate need, and I'll point out
that we always planned to have more features around sync rep, it's just
that we never actually worked on any.  For example, "quorum sync" was
extensively discussed and originally projected for 9.2, only certain
hackers changed jobs and interests.

If we just did the minimal change, that is, added an "auto-degrade" GUC
and an alert to the logs each time the master server went into degraded
mode, as Heikki says we'd be loading a big foot-gun for a bunch of
ill-informed DBAs.  People who want that are really much better off with
async rep in the first place.

If we really want auto-degrading sync rep, then we'd (at a minimum) need
a way to determine *from the replica* whether or not it was in degraded
mode when the master died.  What good do messages to the master log do
you if the master no longer exists?

Mind you, being able to determine on the replica whether it was
synchronous or not when it lost communication with the master would be a
great feature to have for sync rep groups as well, and would make them
practical (right now, they're pretty useless).  However, I seriously
doubt that someone is going to code that up in the next 5 days.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Standalone synchronous master

2014-01-08 Thread Andres Freund

On 2014-01-08 13:34:08 -0800, Kevin Grittner wrote:
> On the other hand, we keep getting people saying they want the
> database to make the promise of synchronous replication, and tell
> applications that it has been successful even when it hasn't been,
> as long as there's a line in the server log to record the lie.

Most people having such a position I've talked to have held that
position because they thought synchronous replication would mean that
apply (and thus visibility) would also be synchronous. Is that
different from your experience?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Standalone synchronous master

2014-01-08 Thread Tom Lane

Kevin Grittner  writes:
> I'm torn on whether we should cave to popular demand on this; but
> if we do, we sure need to be very clear in the documentation about
> what a successful return from a commit request means.  Sooner or
> later, Murphy's Law being what it is, if we do this someone will
> lose the primary and blame us because the synchronous replica is
> missing gobs of transactions that were successfully committed.

I'm for not caving.  I think people who are asking for this don't
actually understand what they'd be getting.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Standalone synchronous master

2014-01-08 Thread Kevin Grittner

Bruce Momjian  wrote:
> Heikki Linnakangas wrote:

>> They want to have the cake and eat it too. But they're not
>> actually getting that. What they actually get is extra latency
>> when things work, with no gain in durability.
>
> They are getting guaranteed durability until they get a
> notification --- that seems valuable.  When they get the
> notification, they can reevaluate if they want that tradeoff.

My first reaction to this has been that if you want synchronous
replication without having the system wait if the synchronous
target goes down, you should configure an alternate target.  With
the requested change we can no longer state that when a COMMIT
returns with an indication of success that the data has been
persisted to multiple clusters.  We would be moving to a situation
where the difference between synchronous is subtle -- either way
the data may or may not be on a second cluster by the time the
committer is notified of success.  We wait up to some threshold
time to try to make the success indication indicate that, but then
return success even if the guarantee has not been provided, without
any way for the committer to know the difference.

On the other hand, we keep getting people saying they want the
database to make the promise of synchronous replication, and tell
applications that it has been successful even when it hasn't been,
as long as there's a line in the server log to record the lie.  Or,
more likely, to record the boundaries of time blocks where it has
been a lie.  This appears to be requested because other products
behave that way.

I'm torn on whether we should cave to popular demand on this; but
if we do, we sure need to be very clear in the documentation about
what a successful return from a commit request means.  Sooner or
later, Murphy's Law being what it is, if we do this someone will
lose the primary and blame us because the synchronous replica is
missing gobs of transactions that were successfully committed.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] How to reproduce serialization failure for a read only transaction.

2014-01-08 Thread Kevin Grittner

Florian Pflug  wrote:
> On Jan7, 2014, at 20:11 , Kevin Grittner  wrote:

>> Yeah, neither of the provided examples rolled back the read only
>> transaction itself;
>
> Actually, the fixed version [1] of my example does.
>
> [1] 
> http://www.postgresql.org/message-id/8721aad3-7a3a-4576-b10e-f2cbd1e53...@phlo.org

Due to my lame email provider, that post didn't show for me until I
had already replied.  :-(  You had already showed an example almost
exactly like what I described in my post.  I tweaked it a bit more
for the Wiki page to show more clearly why SSI has to care about
what the writing transaction reads.  For all the database engine
knows, what was read contributed to whether the application allowed
it to successfully commit.  By using the value from the SELECT in
the UPDATE it is easier to see why it matters, although it needs to
be considered either way.

In other words, we seem to be in full agreement, just using
different language to describe it.  :-)

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Standalone synchronous master

2014-01-08 Thread Bruce Momjian

On Wed, Jan  8, 2014 at 10:46:51PM +0200, Heikki Linnakangas wrote:
> On 01/08/2014 10:27 PM, Bruce Momjian wrote:
> >On Wed, Jan  8, 2014 at 05:39:23PM +, Simon Riggs wrote:
> >>On 8 January 2014 09:07, Heikki Linnakangas  wrote:
> >>
> >>>I'm going to say right off the bat that I think the whole notion to
> >>>automatically disable synchronous replication when the standby goes down is
> >>>completely bonkers.
> >>
> >>Agreed
> >>
> >>We had this discussion across 3 months and we don't want it again.
> >>This should not have been added as a TODO item.
> >
> >I am glad Heikki and Simon agree, but I don't.  ;-)
> >
> >The way that I understand it is that you might want durability, but
> >might not want to sacrifice availability.  Phrased that way, it makes
> >sense, and notifying the administrator seems the appropriate action.
> 
> They want to have the cake and eat it too. But they're not actually
> getting that. What they actually get is extra latency when things
> work, with no gain in durability.

They are getting guaranteed durability until they get a notification ---
that seems valuable.  When they get the notification, they can
reevaluate if they want that tradeoff.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] How to reproduce serialization failure for a read only transaction.

2014-01-08 Thread Florian Pflug

On Jan7, 2014, at 20:11 , Kevin Grittner  wrote:
> Yeah, neither of the provided examples rolled back the read only
> transaction itself;

Actually, the fixed version [1] of my example does.

[1] 
http://www.postgresql.org/message-id/8721aad3-7a3a-4576-b10e-f2cbd1e53...@phlo.org

best regards,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Standalone synchronous master

2014-01-08 Thread Heikki Linnakangas


On 01/08/2014 10:27 PM, Bruce Momjian wrote:

On Wed, Jan  8, 2014 at 05:39:23PM +, Simon Riggs wrote:

On 8 January 2014 09:07, Heikki Linnakangas  wrote:


I'm going to say right off the bat that I think the whole notion to
automatically disable synchronous replication when the standby goes down is
completely bonkers.


Agreed

We had this discussion across 3 months and we don't want it again.
This should not have been added as a TODO item.


I am glad Heikki and Simon agree, but I don't.  ;-)

The way that I understand it is that you might want durability, but
might not want to sacrifice availability.  Phrased that way, it makes
sense, and notifying the administrator seems the appropriate action.


They want to have the cake and eat it too. But they're not actually 
getting that. What they actually get is extra latency when things work, 
with no gain in durability.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add CREATE support to event triggers

2014-01-08 Thread Alvaro Herrera

Alvaro Herrera escribió:
> Robert Haas escribió:
> 
> > I think this direction has some potential.  I'm not sure it's right in
> > detail.  The exact scheme you propose above won't work if you want to
> > leave out the schema name altogether, and more generally it's not
> > going to help very much with anything other than substituting in
> > identifiers.  What if you want to add a column called satellite_id to
> > every table that gets created, for example?  What if you want to make
> > the tables UNLOGGED?  I don't see how that kind of things is going to
> > work at all cleanly.
> 
> Thanks for the discussion.  I am building some basic infrastructure to
> make this possible, and will explore ideas to cover these oversights
> (not posting anything concrete yet because I expect several iterations
> to crash and burn before I have something sensible to post).

Here's a working example.  Suppose the user runs

CREATE SCHEMA IF NOT EXISTS "some schema" AUTHORIZATION "some guy";

In an event trigger, the function pg_event_trigger_get_creation_commands()
returns the following JSON blob:

{"authorization":{"authorization_role":"some guy",
  "output":"AUTHORIZATION %i{authorization_role}"},
 "if_not_exists":"IF NOT EXISTS",
 "name":"some schema",
 "output":"CREATE SCHEMA %{if_not_exists} %i{name} %{authorization}"}

wherein I have chosen to have a JSON element with the hardcoded name of
"output" which is what needs to be expanded; for each %{} parameter
found in it, there is an equally-named element in the JSON blob.  This
can be a string, a NULL, or another JSON object.

If it's a string, it expands to that value; if it's an object,
recursively an "output" element is expanded in the same way, and the
expanded string is used.

If there's a NULL element when expanding an object, the whole thing
expands to empty.  For example, if no AUTHORIZATION
clause is specified, "authorization" element is still there, but the
"authorization_role" element within it is NULL, and so the whole
AUTHORIZATION clause expands to empty and the resulting command contains
no authorization clause.  This is useful to support the case that
someone doesn't have an AUTHORIZATION clause in the CREATE SCHEMA
command, and the event trigger injects one simply by setting the
authorization_role to some role name.

IF NOT EXISTS is handled by defining it to either the string IF NOT
EXISTS or to empty if no such clause was specified.

The user can modify elements in the JSON to get a different version of
the command.  (I reckon the "output" can also be modified, but this is
probably a bad idea in most/all cases.  I don't think there's a need to
prohibit this explicitely.)  Also, someone might define "if_not_exists"
to something completely unrelated, but that would be their own fault.
(Maybe we can have some cross-check that the if_not_exists element in
JSON cannot be anything other than "IF NOT EXISTS" or the empty string;
and that the "output" element remains the same at expansion time than it
was at generation time.  Perhaps we should even hide the "output"
element from the user completely and only add them to the JSON at time
of expansion.  Not sure it's worth the trouble.)

There is another function,
pg_event_trigger_expand_creation_command(json), which will expand the
above JSON blob and return the following text:

CREATE SCHEMA IF NOT EXISTS "some schema" AUTHORIZATION "some guy"

Note the identifiers are properly quoted (there are quotes in the JSON
blob, but they correspond to JSON's own delimiters).  I have defined a
'i' modifier to have %i{} elements, which means that the element is an
identifier which might need quoting.

I have also defined a %d{} modifier that means to use the element to
expand a possibly-qualified dotted name.  (There would be no "output"
element in this case.)  This is to support the case where you have 

CREATE TABLE public.foo
which results in
{"table_name":{"schema":"public",
   "relname":"foo"}}

and you want to edit the "table_name" element in the root JSON and set
the schema to something else (perhaps NULL), so in the event trigger
after expansion you can end up with "CREATE TABLE foo" or "CREATE TABLE
private.foo" or whatever.

Most likely there are some more rules that will need to be created, but
so far this looks sensible.

I'm going to play some more with the %d{} stuff, and also with the idea
of representing table elements such as columns and constraints as an
array.  In the meantime please let me know whether this makes sense.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Standalone synchronous master

2014-01-08 Thread Hans-Jürgen Schönig


On Jan 8, 2014, at 9:27 PM, Bruce Momjian wrote:

> On Wed, Jan  8, 2014 at 05:39:23PM +, Simon Riggs wrote:
>> On 8 January 2014 09:07, Heikki Linnakangas  wrote:
>> 
>>> I'm going to say right off the bat that I think the whole notion to
>>> automatically disable synchronous replication when the standby goes down is
>>> completely bonkers.
>> 
>> Agreed
>> 
>> We had this discussion across 3 months and we don't want it again.
>> This should not have been added as a TODO item.
> 
> I am glad Heikki and Simon agree, but I don't.  ;-)
> 
> The way that I understand it is that you might want durability, but
> might not want to sacrifice availability.  Phrased that way, it makes
> sense, and notifying the administrator seems the appropriate action.
> 

technically and conceptually i agree with andres and simon but from daily 
experience i would say that we should make it configurable.
some people got some nasty experiences when their systems stopped working.

+1 for a GUC to control this one.

many thanks,

hans

--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Standalone synchronous master

2014-01-08 Thread Bruce Momjian

On Wed, Jan  8, 2014 at 05:39:23PM +, Simon Riggs wrote:
> On 8 January 2014 09:07, Heikki Linnakangas  wrote:
> 
> > I'm going to say right off the bat that I think the whole notion to
> > automatically disable synchronous replication when the standby goes down is
> > completely bonkers.
> 
> Agreed
> 
> We had this discussion across 3 months and we don't want it again.
> This should not have been added as a TODO item.

I am glad Heikki and Simon agree, but I don't.  ;-)

The way that I understand it is that you might want durability, but
might not want to sacrifice availability.  Phrased that way, it makes
sense, and notifying the administrator seems the appropriate action.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Time to do our Triage for 9.4

2014-01-08 Thread Josh Berkus


> What about doing the triage the first two days or so of the CF? Then we
> know which patches have been submitted/updated and we haven't yet
> performed too many reviews of patches which aren't going to make it.

If we triage the patches we already have NOW, then it'll be possible to
triage the new/updated stuff which comes in in the first 48 hours of the
CF.  If we wait until the CF begins, we'll spend at least the first week
of the CF triaging.

That's why we set this schedule at the developer meeting.

And besides, we already know what category *your* patch belongs in ...

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Time to do our Triage for 9.4

2014-01-08 Thread Andres Freund

On 2014-01-08 14:50:52 -0500, Tom Lane wrote:
> David Fetter  writes:
> > With utmost respect, there are unsent patches which don't fit into the
> > categories below.  Are you saying now that the cut-off is today, and
> > not actually 1/15, as everyone understands it to be?
> 
> I think Josh is merely suggesting that we could start triaging the
> patches we have in hand.  If you're aware of new submissions pending
> for any of them, of course, it wouldn't hurt to tell people about it.

What about doing the triage the first two days or so of the CF? Then we
know which patches have been submitted/updated and we haven't yet
performed too many reviews of patches which aren't going to make it.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Time to do our Triage for 9.4

2014-01-08 Thread Tom Lane

David Fetter  writes:
> With utmost respect, there are unsent patches which don't fit into the
> categories below.  Are you saying now that the cut-off is today, and
> not actually 1/15, as everyone understands it to be?

I think Josh is merely suggesting that we could start triaging the
patches we have in hand.  If you're aware of new submissions pending
for any of them, of course, it wouldn't hurt to tell people about it.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [ANNOUNCE] IMCS: In Memory Columnar Store for PostgreSQL

2014-01-08 Thread knizhnik


On 01/08/2014 10:51 PM, Robert Haas wrote:

On Tue, Jan 7, 2014 at 10:20 PM, Amit Kapila  wrote:

On Tue, Jan 7, 2014 at 2:46 AM, Robert Haas  wrote:

On Mon, Jan 6, 2014 at 4:04 PM, james  wrote:

The point remains that you need to duplicate it into every process that
might
want to use it subsequently, so it makes sense to DuplicateHandle into the
parent, and then to advertise that  handle value publicly so that other
child
processes can DuplicateHandle it back into their own process.

Well, right now we just reopen the same object from all of the
processes, which seems to work fine and doesn't require any of this
complexity.  The only problem I don't know how to solve is how to make
a segment stick around for the whole postmaster lifetime.  If
duplicating the handle into the postmaster without its knowledge gets
us there, it may be worth considering, but that doesn't seem like a
good reason to rework the rest of the existing mechanism.

I think one has to try this to see if it works as per the need. If it's not
urgent, I can try this early next week?

Anything we want to get into 9.4 has to be submitted by next Tuesday,
but I don't know that we're going to get this into 9.4.


I wonder what is the intended use case of dynamic shared memory?
Is is primarly oriented on PostgreSQL extensions or it will be used also 
in PosatgreSQL core?
In case of extensions, shared memory may be needed to store some 
collected/calculated information which will be used by extension functions.


The main advantage of DSM (from my point of view) comparing with existed 
mechanism of preloaded extension is that it is not necessary to restart 
server to add new extension requiring shared memory.
DSM segment can be attached or created by _PG_init function of the 
loaded module.
But there will be not so much sense in this mechanism if this segment 
will be deleted when there are no more processes attached to it.
So to make DSM really useful for extension it needs some mechanism to 
pin segment in memory during all server/extension lifetime.


May be I am wrong, but I do not see some reasons for creating multiple 
DSM segments by the same extension.
And total number of DSM segments is expected to be not very large (<10). 
The same is true for synchronization primitives (LWLocks for example) 
needed to synchronize access to this DSM segments. So I am not sure if 
possibility to place locks in DSM is really so critical...
We can just reserved some space for LWLocks which can be used by 
extension, so that LWLockAssign() can be used without 
RequestAddinLWLocks or RequestAddinLWLocks can be used not only from 
preloaded extension.


IMHO the main trouble with DSM is lack of guarantee that segment is 
always mapped to the same virtual address.
Without such guarantee it is not possible to use direct (normal) 
pointers inside DSM.

But there seems to be no reasonable solution.



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Time to do our Triage for 9.4

2014-01-08 Thread Josh Berkus

On 01/08/2014 11:07 AM, David Fetter wrote:
> On Wed, Jan 08, 2014 at 10:45:37AM -0800, Josh Berkus wrote:
>> Hackers,
>>
>> Per the Developer Meeting, we are scheduled to do a final triage of 9.4
>> patches the week before CF4 starts, which is *now*.  The goal of this
>> triage is to divide patches already in queue into 5 groups:
> 
> With utmost respect, there are unsent patches which don't fit into the
> categories below.  Are you saying now that the cut-off is today, and
> not actually 1/15, as everyone understands it to be?

It has always been our policy that non-trivial patches which appear for
the first time in CF4 go to the bottom of the priority list.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Time to do our Triage for 9.4

2014-01-08 Thread Stephen Frost

* David Fetter (da...@fetter.org) wrote:
> On Wed, Jan 08, 2014 at 10:45:37AM -0800, Josh Berkus wrote:
> > Per the Developer Meeting, we are scheduled to do a final triage of 9.4
> > patches the week before CF4 starts, which is *now*.  The goal of this
> > triage is to divide patches already in queue into 5 groups:
> 
> With utmost respect, there are unsent patches which don't fit into the
> categories below.  Are you saying now that the cut-off is today, and
> not actually 1/15, as everyone understands it to be?

The categories are based on the degree of completion- how can a patch
not fit into one of those?

As for the cut-off, I would guess that the "by-the-book" approach would
be to allow until the 15th, but if it's a big patch and it hasn't been
seen in prior CF's then it shouldn't particularly matter.  That doesn't
preclude us from working on the triage in advance of the CF starting.

Thanks,

Stephen

signature.asc
Description: Digital signature

Re: [HACKERS] Time to do our Triage for 9.4

2014-01-08 Thread David Fetter

On Wed, Jan 08, 2014 at 10:45:37AM -0800, Josh Berkus wrote:
> Hackers,
> 
> Per the Developer Meeting, we are scheduled to do a final triage of 9.4
> patches the week before CF4 starts, which is *now*.  The goal of this
> triage is to divide patches already in queue into 5 groups:

With utmost respect, there are unsent patches which don't fit into the
categories below.  Are you saying now that the cut-off is today, and
not actually 1/15, as everyone understands it to be?

Cheers,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ALTER SYSTEM SET command to change postgresql.conf parameters

2014-01-08 Thread Robert Haas

On Mon, Jan 6, 2014 at 11:37 PM, Amit Kapila  wrote:
> On Tue, Jan 7, 2014 at 12:52 AM, Robert Haas  wrote:
>> On Mon, Jan 6, 2014 at 9:48 AM, Amit Kapila  wrote:
 Couldn't we also handle this by postponing FreeConfigVariables until
 after the if (error) block?
>>>
>>>Wouldn't doing that way can lead to bigger memory leak, if error level
>>>is ERROR. Though in current fix also it can leak memory but it will be
>>>just for ErrorConfFile_save. I think some similar case can happen for
>>>'pre_value' in code currently as well, that's why I have fixed it in a
>>>similar way in patch.
>>
>> I was assuming that error-recovery would reset the containing memory
>> context, but I'm not sure what memory context we're executing in at
>> this point.
>
> This function is called from multiple places and based on when it would
> get called the memory context varies. During Startup, it gets called in
> Postmaster context and if some backend runs pg_reload_conf(), then
> it will get called from other background processes (WAL Writer,
> Checpointer, etc..) in their respective contexts (for WAL Writer, the
> context will be WAL Writer, ..).
>
> In current code, the only time it can go to error path with elevel as
> ERROR is during Postmaster startup
> (context == PGC_POSTMASTER), at which it will anyway upgrade
> ERROR to FATAL, so it should not be a problem to move
> function FreeConfigVariables() after error block check. However
> in future, if someone added any more ERROR (the chances of which
> seems to be quite less), it can cause leak, may be thats why original
> code has been written that way.
>
> If you think it's better to fix by moving FreeConfigVariables() after error
> block check, then I can update the patch by doing so and incorporate other
> change (directly use PG_AUTOCONF_FILENAME) suggested by you
> as well?

Yeah, let's do it that way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] BDR-project

2014-01-08 Thread Andreas Joseph Krogh

På onsdag 08. januar 2014 kl. 18:57:52, skrev Simon Riggs mailto:si...@2ndquadrant.com>>: On 25 December 2013 12:01, Andreas Joseph 
Krogh  wrote:

 > Ref:
 > http://wiki.postgresql.org/wiki/BDR_Project
 >
 > Is implementing main BDR features into core Postgres a probable objective to
 > version 9.4?

 I've not updated hackers recently on this point, so thanks for asking
 the question. This was discussed in the ending keynote of the PG
 Europe conference, but I appreciate that's not the same thing as
 saying it here.

 The plan is

 * submit the core logical replication technology for 9.4
 * submit online upgrade as a feature for 9.5, allowing upgrades from 9.4+
 * submit full BDR features for 9.6

 BDR code will be released as a separate open source project until/if
 core accepts/modifies that. There's lots of work and discussion to be
 had yet, so the above plan is a reasonable schedule for achieving
 change allowing input from all. Design to full feature submission
 would be 4.5 years, plus we expect the features to mature/extend after
 that, so there's no rush, just steady movement.

 No attempts to publicise that as yet, but if all goes well we expect
 to do that once 9.4 is released.   Thanks for the update!   --
 Andreas Joseph Krogh       mob: +47 909 56 963
 Senior Software Developer / CTO - OfficeNet AS - http://www.officenet.no
 Public key: http://home.officenet.no/~andreak/public_key.asc

Re: [HACKERS] [ANNOUNCE] IMCS: In Memory Columnar Store for PostgreSQL

2014-01-08 Thread Robert Haas

On Tue, Jan 7, 2014 at 10:20 PM, Amit Kapila  wrote:
> On Tue, Jan 7, 2014 at 2:46 AM, Robert Haas  wrote:
>> On Mon, Jan 6, 2014 at 4:04 PM, james  wrote:
>>> The point remains that you need to duplicate it into every process that
>>> might
>>> want to use it subsequently, so it makes sense to DuplicateHandle into the
>>> parent, and then to advertise that  handle value publicly so that other
>>> child
>>> processes can DuplicateHandle it back into their own process.
>>
>> Well, right now we just reopen the same object from all of the
>> processes, which seems to work fine and doesn't require any of this
>> complexity.  The only problem I don't know how to solve is how to make
>> a segment stick around for the whole postmaster lifetime.  If
>> duplicating the handle into the postmaster without its knowledge gets
>> us there, it may be worth considering, but that doesn't seem like a
>> good reason to rework the rest of the existing mechanism.
>
> I think one has to try this to see if it works as per the need. If it's not
> urgent, I can try this early next week?

Anything we want to get into 9.4 has to be submitted by next Tuesday,
but I don't know that we're going to get this into 9.4.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Time to do our Triage for 9.4

2014-01-08 Thread Josh Berkus

Hackers,

Per the Developer Meeting, we are scheduled to do a final triage of 9.4
patches the week before CF4 starts, which is *now*.  The goal of this
triage is to divide patches already in queue into 5 groups:

1) Good To Go: patches which are 100% ready for final testing and commit.

2) Needs a Little Work: smaller patches which can be included in 9.4 if
they get a few hours of love from a committer or major hacker.

3) Big Patches: big, important patches which will need a major time
committement to commit even though they are 90% ready, just due to size.

4) Not Nearly Ready: Patches which need major work and/or spec
discussions before commitment.

5) WIP: any patch which is acknowledged just there for review, or any
brand-new patch which wasn't in CF3 and is non-trivial.

Obviously, any patches in groups 4 and 5 aren't going into 9.4.  The
idea is to handle patches in CF4 in this order:

1. do immediately

2. do after (1) is complete

3. assign 1 senior hacker reviewer to each patch

4. review as time permits after 1-3

5. review as time permits after 1-3

Let the triage begin!

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP patch: improve documentation for user defined final functions of aggregates

2014-01-08 Thread Tom Lane

Mark Dilger  writes:
> The fact that a final function may be called
> multiple times does not appear to be mentioned
> in the docs,

It's in xaggr.sgml, near the discussion of AggCheckCallContext
(the end of section 35.10, in current HEAD's section numbering).

In general, people writing C functions should not expect that they
can scribble on pass-by-ref arguments, so I don't see a need to
belabor the point in multiple places.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] WIP patch: improve documentation for user defined final functions of aggregates

2014-01-08 Thread Mark Dilger

The fact that a final function may be called
multiple times does not appear to be mentioned
in the docs, and a reasonable reading of the
docs gives the alternate impression.  For
people writing final functions in C, there
should be some warning about changing the
transition value, such as exists in the comments
in src/backend/executor/nodeWindowAgg.c


I'm not sure if this warning should be placed
in the docs of the CREATE AGGREGATE command, or
in section 35.10.  The documentation for the
CREATE AGGREGATE command talks more about what
final functions do, and hence is the more natural
placement in that sense, but for users who are
not doing C programming, changing the state of
the transition value is probably not a problem.

WIP WIP WIP WIP WIP


diff --git a/doc/src/sgml/ref/create_aggregate.sgml 
b/doc/src/sgml/ref/create_aggregate.sgml
index d15fcba..d4603cc 100644
--- a/doc/src/sgml/ref/create_aggregate.sgml
+++ b/doc/src/sgml/ref/create_aggregate.sgml
@@ -111,6 +111,12 @@ CREATE AGGREGATE name (
   
 
   
+   The final function may be invoked at other times during the processing
+   of all rows and must not damage the internal state value.  It is unwise
+   to assume that the final function will only be invoked once.
+  
+
+  
    An aggregate function can provide an initial condition,
    that is, an initial value for the internal state value.
    This is specified and stored in the database as a value of type


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] BDR-project

2014-01-08 Thread Simon Riggs

On 25 December 2013 12:01, Andreas Joseph Krogh  wrote:

> Ref:
> http://wiki.postgresql.org/wiki/BDR_Project
>
> Is implementing main BDR features into core Postgres a probable objective to
> version 9.4?

I've not updated hackers recently on this point, so thanks for asking
the question. This was discussed in the ending keynote of the PG
Europe conference, but I appreciate that's not the same thing as
saying it here.

The plan is

* submit the core logical replication technology for 9.4
* submit online upgrade as a feature for 9.5, allowing upgrades from 9.4+
* submit full BDR features for 9.6

BDR code will be released as a separate open source project until/if
core accepts/modifies that. There's lots of work and discussion to be
had yet, so the above plan is a reasonable schedule for achieving
change allowing input from all. Design to full feature submission
would be 4.5 years, plus we expect the features to mature/extend after
that, so there's no rush, just steady movement.

No attempts to publicise that as yet, but if all goes well we expect
to do that once 9.4 is released.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Support for pg_stat_archiver view

2014-01-08 Thread Magnus Hagander

On Wed, Jan 8, 2014 at 6:42 PM, Simon Riggs  wrote:

> On 4 January 2014 13:01, Gabriele Bartolini
>  wrote:
>
> >> I'd suggest making the view on top of an SRF like pg_stat_replication
> >> and pg_stat_activity (for example), instead of a whole lot of separate
> >> function calls like the older stats views.
> >
> > Ok, good idea.
>
> Not sure I see why it needs to be an SRF. It only returns one row.
>

Good point, it could/should be a general function returning a composite
type.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Re: [HACKERS] Specifying both recovery_target_xid and recovery_target_time

2014-01-08 Thread Simon Riggs

On 8 January 2014 15:38, Heikki Linnakangas  wrote:
> The docs say:
>
>> At most one of recovery_target_time, recovery_target_name or
>> recovery_target_xid can be specified
>
>
> However, the code actually allows them all to be specified at the same time:
>
>> else if (strcmp(item->name, "recovery_target_name") == 0)
>> {
>> /*
>>  * if recovery_target_xid specified, then this
>> overrides
>>  * recovery_target_name
>>  */
>> if (recoveryTarget == RECOVERY_TARGET_XID)
>> continue;
>> recoveryTarget = RECOVERY_TARGET_NAME;
>
>
> The precedence is XID, time, name.
>
> I think the documented behavior would make more sense, ie. throw an error if
> you try to specify multiple targets. Anyone remember if that was
> intentional? Any objections to change the code to match the docs, in master?

It seems like I was grasping at some meaning but didn't quite achieve it.

Changing it to mean OR would make sense, but that would be more work.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Support for pg_stat_archiver view

2014-01-08 Thread Simon Riggs

On 4 January 2014 13:01, Gabriele Bartolini
 wrote:

>> I'd suggest making the view on top of an SRF like pg_stat_replication
>> and pg_stat_activity (for example), instead of a whole lot of separate
>> function calls like the older stats views.
>
> Ok, good idea.

Not sure I see why it needs to be an SRF. It only returns one row.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Standalone synchronous master

2014-01-08 Thread Simon Riggs

On 8 January 2014 09:07, Heikki Linnakangas  wrote:

> I'm going to say right off the bat that I think the whole notion to
> automatically disable synchronous replication when the standby goes down is
> completely bonkers.

Agreed

We had this discussion across 3 months and we don't want it again.
This should not have been added as a TODO item.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] gSoC - ADD MERGE COMMAND - code patch submission

2014-01-08 Thread taskov

Hello,
could you tell me where I can find the latest version of the MERGE PATCH
file? I need to use it on PostgreSQL 9.3.
I couldn't find it anywhere in git.

Regards,
Nikolay




--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/gSoC-ADD-MERGE-COMMAND-code-patch-submission-tp1956415p5785822.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Changeset Extraction Interfaces

2014-01-08 Thread Andres Freund

On 2014-01-07 17:54:21 +0100, Andres Freund wrote:
> On 2013-12-12 16:49:33 +0100, Andres Freund wrote:
> > On 2013-12-12 10:01:21 -0500, Robert Haas wrote:
> > > On Thu, Dec 12, 2013 at 7:04 AM, Andres Freund  
> > > wrote:
> > > > As far as I have been thinking of, this would be another catalog table 
> > > > like
> > > > pg_decoding_plugin(oid, dpname name, dpload regproc).
> > > 
> > > Instead of adding another catalog table, I think we should just define
> > > a new type.  Again, please look at the way that foreign data wrappers
> > > do this:
> > 
> > I don't really see what the usage of a special type has to do with this,
> > but I think that's besides your main point. What you're saying is that
> > the output plugin is just defined by a function name, possibly schema
> > prefixed. That has an elegance to it. +1
> 
> Ok, so I've implemented this, but I am not so sure it's sufficient,
> there's some issue:
> Currently a logical replication slot has a plugin assigned, previously
> that has just been identified by the basename of a .so. But with the
> above proposal the identifier is pointing to a function, currently via
> its oid. But what happens if somebody drops or recreates the function?
> We can't make pg_depend entries or anything since that won't work on a
> standby.
> Earlier, if somebody removed the .so we'd just error out, but pg's
> dependency tracking always only mattered to things inside the catalogs.
> 
> I see the following possible solutions for this:
> 
> 1) accept that fact, and throw an error if the function doesn't exist
> anymore, or has an unsuitable signature. We can check the return type of
> output_plugin_callbacks, so that's a pretty specific test.
> 
> 2) Create a pg_output_plugin catalog and prevent DROP OUTPUT PLUGIN (or
> similar) when there's a slot defined. But how'd that work if the slot is
> only defined on standbys? We could have the redo routine block and/or
> kill the slot if necessary?
> 
> 3) Don't assign a specific output plugin to a slot, but have it
> specified everytime data is streamed, not just when a slot is
> created. Currently that wouldn't be a problem, but I am afraid it will
> constrict some future optimizations.
> 
> Good ideas?

So, Robert and I had a IM discussion about this. Neither of us was
particularly happy about the proposed solutions.

So, what we've concluded is that using a function as the handler doesn't
work out well enough given the constraints (primarily the inability to
create dependency records on a HS node). We've concluded that the best
way forward is a variant of the current implementation where the output
plugin is specified as a dynamic library. Which is:
CREATE_REPLICATION_SLOT slot_name LOGICAL OUTPUT_PLUGIN library_name;
but in contrast to the current code where each individual output plugin
callback is dlsym()ed via a fixed function name, only a
_PG_output_plugin_init() function is looked up & called which fills out
a struct containing the individual callbacks. Additionally the "init"
and "cleanup" output plugin callbacks will be renamed to
startup/shutdown to avoid possible confusions.

This unfortunately still prohibits implementing output plugins within
the core postgres binary, but that can be solved by shipping
core-provided output plugins - should they ever exist - as shared
objects, like it's already done for libpqwalreceiver.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Simple improvements to freespace allocation

2014-01-08 Thread Tom Lane

Heikki Linnakangas  writes:
> On 01/08/2014 08:56 AM, Simon Riggs wrote:
>> * IN-MEMORY
>> A large table may only have some of its blocks in memory. It would be
>> useful to force a NBT to be a block already in shared_buffers IFF a
>> table is above a certain size (use same threshold as seq scans, i.e.
>> 25% of shared_buffers). That may be difficult to achieve in practice,
>> so not sure about this one. Like it? Any ideas?

> Yeah, that seems nice, although I have feeling that it's not worth the 
> complexity.

Not only would that be rather expensive to do, but I think it would be
self-defeating.  Pages that are in memory would be particularly likely
to have been modified by someone else recently, so that the FSM's info
about their available space is stale, and thus once you actually got
to the page it'd be more likely to not have the space you need.

The existing FSM algorithm is intentionally designed to hand out pages
that nobody else has tried to insert into lately, with one goal being to
minimize the number of retries needed because of stale info.  (Or at
least, it worked that way originally, and I don't think Heikki's rewrite
changed that aspect.)  I'm concerned that the alternatives Simon proposes
would lead to more processes ganging up on the same pages, with not only a
direct cost in concurrency but an indirect cost in repeated FSM searches
due to believing stale available-space data.  Indeed, a naive
implementation could easily get into infinite loops of handing back the
same page.

I realize that the point is exactly to sacrifice some insertion
performance in hopes of getting better table packing, but it's not clear
to me that there's an easy fix that makes packing better without a very
large hit on the other side.

Anyway, these fears could be proven or disproven with some benchmarking of
a trial patch, and I have no objection if Simon wants to do that
experimentation.  But I'd be hesitant to see such a feature committed
in advance of experimental proof that it's actually useful.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Specifying both recovery_target_xid and recovery_target_time

2014-01-08 Thread Tom Lane

Heikki Linnakangas  writes:
> The docs say:
>> At most one of recovery_target_time, recovery_target_name or 
>> recovery_target_xid can be specified

> However, the code actually allows them all to be specified at the same time:

>> else if (strcmp(item->name, "recovery_target_name") == 0)
>> {
>> /*
>> * if recovery_target_xid specified, then this overrides
>> * recovery_target_name
>> */
>> if (recoveryTarget == RECOVERY_TARGET_XID)
>> continue;
>> recoveryTarget = RECOVERY_TARGET_NAME;

> The precedence is XID, time, name.

> I think the documented behavior would make more sense, ie. throw an 
> error if you try to specify multiple targets. Anyone remember if that 
> was intentional? Any objections to change the code to match the docs, in 
> master?

Hm.  I can see potential uses for specifying more than one if the behavior
were "OR", that is stop as soon as any of the specified conditions is
satisfied.

It looks like the actual behavior is to randomly choose one specified
mode and ignore the others, which I concur is bogus.  But maybe we
should try to do something useful instead of just throwing an error.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Specifying both recovery_target_xid and recovery_target_time

2014-01-08 Thread Heikki Linnakangas


The docs say:


At most one of recovery_target_time, recovery_target_name or 
recovery_target_xid can be specified


However, the code actually allows them all to be specified at the same time:


else if (strcmp(item->name, "recovery_target_name") == 0)
{
/*
 * if recovery_target_xid specified, then this overrides
 * recovery_target_name
 */
if (recoveryTarget == RECOVERY_TARGET_XID)
continue;
recoveryTarget = RECOVERY_TARGET_NAME;


The precedence is XID, time, name.

I think the documented behavior would make more sense, ie. throw an 
error if you try to specify multiple targets. Anyone remember if that 
was intentional? Any objections to change the code to match the docs, in 
master?


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Bug in visibility map WAL-logging

2014-01-08 Thread Heikki Linnakangas


On 01/08/2014 07:29 AM, Greg Stark wrote:

On Tue, Jan 7, 2014 at 11:36 AM, Heikki Linnakangas
 wrote:

Hmm. The xlogdump indeed shows that the order of 'clean' and 'visible' is
incorrect, but I don't immediately see how that could cause the PANIC. Why
is the page uninitialized in the standby? If VACUUM is removing some dead
tuples from it, it certainly should exist and be correctly initialized.


Unless the vacuum subsequently truncated the file to be shorter and
the backup was taken after that?


In that case WAL replay should also see the truncation record before 
reaching consistency. We only PANIC on an uninitialized/missing page 
after reaching consistency, before that it's indeed normal if the file 
was later truncated or deleted.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Bug in visibility map WAL-logging

2014-01-08 Thread Andres Freund

On 2014-01-07 22:42:59 -0200, Matheus de Oliveira wrote:
> @andres, if it is really removing backup_label it could also cause that
> other issue we saw on Monday, right? (yes I did run the same script).

It might be in your case since that's an easy to way to generate that
situation, but there have been several other reports of that bug, so
it's good that we've discussed it ;)

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Bug in visibility map WAL-logging

2014-01-08 Thread Andres Freund

On 2014-01-08 14:37:34 +0200, Heikki Linnakangas wrote:
> That seems to be a very common mistake to make. I wish we could do something
> about it. Do you think it would've helped in your case if there was a big
> fat warning in the beginning of backup_label, along the lines of: "# DO NOT
> REMOVE THIS FILE FROM A BACKUP" ? Any other ideas how we could've made it
> more obvious to the script author to not remove it?

I've been wondering about the possibility of setting a boolean in
checkpoint records indicating that a backup label needs to be used when
starting from that checkpoint. That boolean would only get checked when
using a recovery.conf and we've started with pg_control indicating that
it was written by a primary (i.e. state <= DB_SHUTDOWNING).

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Bug in visibility map WAL-logging

2014-01-08 Thread Heikki Linnakangas


On 01/08/2014 02:32 PM, Matheus de Oliveira wrote:

On Tue, Jan 7, 2014 at 10:42 PM, Matheus de Oliveira <
matioli.math...@gmail.com> wrote:


How did you set up the standby? Did you initialize it from an offline

backup of the master's data directory, perhaps? The log shows that the
startup took the the "crash recovery first, then start archive recovery"
path, because there was no backup label file. In that mode, the standby
assumes that the system is consistent after replaying all the WAL in
pg_xlog, which is correct if you initialize from an offline backup or
atomic filesystem snapshot, for example. But "WAL contains references to
invalid pages" could also be a symptom of an inconsistent base backup,
cause by incorrect backup procedure. In particular, I have to ask because
I've seen it before: you didn't delete backup_label from the backup, did
you?


Well, I cannot answer this right now, but makes all sense and is possible.


I've just confirmed. That was indeed the case, the script was removing the
backup_label. I've just removed this line and synced it again, it is
running nice (for past 1 hour at least).


A-ha! ;-)


Thank you guys for all your help, and sorry for all the confusion I caused.


That seems to be a very common mistake to make. I wish we could do 
something about it. Do you think it would've helped in your case if 
there was a big fat warning in the beginning of backup_label, along the 
lines of: "# DO NOT REMOVE THIS FILE FROM A BACKUP" ? Any other ideas 
how we could've made it more obvious to the script author to not remove it?


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Bug in visibility map WAL-logging

2014-01-08 Thread Matheus de Oliveira

On Tue, Jan 7, 2014 at 10:42 PM, Matheus de Oliveira <
matioli.math...@gmail.com> wrote:

> How did you set up the standby? Did you initialize it from an offline
>> backup of the master's data directory, perhaps? The log shows that the
>> startup took the the "crash recovery first, then start archive recovery"
>> path, because there was no backup label file. In that mode, the standby
>> assumes that the system is consistent after replaying all the WAL in
>> pg_xlog, which is correct if you initialize from an offline backup or
>> atomic filesystem snapshot, for example. But "WAL contains references to
>> invalid pages" could also be a symptom of an inconsistent base backup,
>> cause by incorrect backup procedure. In particular, I have to ask because
>> I've seen it before: you didn't delete backup_label from the backup, did
>> you?
>>
>
> Well, I cannot answer this right now, but makes all sense and is possible.
>

I've just confirmed. That was indeed the case, the script was removing the
backup_label. I've just removed this line and synced it again, it is
running nice (for past 1 hour at least).

Thank you guys for all your help, and sorry for all the confusion I caused.

Best regards,
-- 
Matheus de Oliveira
Analista de Banco de Dados
Dextra Sistemas - MPS.Br nível F!
www.dextra.com.br/postgres

Re: [HACKERS] Standalone synchronous master

2014-01-08 Thread Andres Freund

On 2014-01-08 11:07:48 +0200, Heikki Linnakangas wrote:
> I'm going to say right off the bat that I think the whole notion to
> automatically disable synchronous replication when the standby goes down is
> completely bonkers. If you don't need the strong guarantee that your
> transaction is safe in at least two servers before it's acknowledged to the
> client, there's no point enabling synchronous replication in the first
> place.

I think that's likely caused by the misconception that synchronous
replication is synchronous in apply, not just remote write/fsync. I have
now seen several sites that assumed that and just set up sync rep to
maintain that goal to then query standbys instead of the primary after
the commit finished.
If that assumption were true, supporting a timeout that way would
possibly be helpful, but it is not atm...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] extra_float_digits and casting from real to numeric

2014-01-08 Thread Christoph Berg

Re: Tom Lane 2014-01-07 <14979.1389112...@sss.pgh.pa.us>
> > But if extra_float_digits > 0 is set, I'd expect not only the float4
> > output to be affected by it, but also casts to other datatypes,
> 
> This proposal scares me.  extra_float_digits is strictly a matter of
> I/O representation, it does not affect any internal calculations.
> Moreover, since one of the fundamental attributes of type numeric
> is that it's supposed to give platform-independent results, I don't
> like the idea that you're likely to get platform-dependent results
> of conversions from float4/float8.

I forgot to mention one bit here, and that's actually what made me
think "wtf" and post here. The server log is of course also affected
by this, so you even get different parameters depending on
extra_float_digits, yet the numeric result is the same "bad" one:

2014-01-08 10:13:53 CET LOG:  execute : INSERT INTO s VALUES($1)
2014-01-08 10:13:53 CET DETAIL:  parameters: $1 = '1.2'
2014-01-08 10:14:18 CET LOG:  execute : INSERT INTO s VALUES($1)
2014-01-08 10:14:18 CET DETAIL:  parameters: $1 = '1.1797'

Of course this is all consistent and in practice sums up to "don't use
real/single"...

> I think your customer got bit by his own bad coding practice, and
> that should be the end of it.

What about this patch to mention this gotcha more explicitely in the
documentation?

diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml
new file mode 100644
index 0386330..968f4a7
*** a/doc/src/sgml/datatype.sgml
--- b/doc/src/sgml/datatype.sgml
*** NUMERIC
*** 689,694 
--- 689,697 
0, the output is the same on every platform
supported by PostgreSQL.  Increasing it will produce output that
more accurately represents the stored value, but may be unportable.
+   Casts to other numeric datatypes and the to_char
+   function are not affected by this setting, it affects only the text
+   representation.
   
  
  

Christoph
-- 
c...@df7cb.de | http://www.df7cb.de/


signature.asc
Description: Digital signature

Re: [HACKERS] Standalone synchronous master

2014-01-08 Thread Heikki Linnakangas


On 11/13/2013 03:09 PM, Rajeev rastogi wrote:

This patch implements the following TODO item:

Add a new "eager" synchronous mode that starts out synchronous but reverts to 
asynchronous after a failure timeout period
This would require some type of command to be executed to alert administrators 
of this change.
http://archives.postgresql.org/pgsql-hackers/2011-12/msg01224.php

This patch implementation is in the same line as it was given in the earlier 
thread.
Some Of the additional important changes are:

1.   Have added two GUC variable to take commands from user to be executed

a.   Master_to_standalone_cmd: To be executed before master switches to 
standalone mode.

b.  Master_to_sync_cmd: To be executed before master switches from sync 
mode to standalone mode.

2.   Master mode switch will happen only if the corresponding command 
executed successfully.

3.   Taken care of replication timeout to decide whether synchronous 
standby has gone down. i.e. only after expiry of

wal_sender_timeout, the master will switch from sync mode to standalone mode.

Please provide your opinion or any other expectation out of this patch.


I'm going to say right off the bat that I think the whole notion to 
automatically disable synchronous replication when the standby goes down 
is completely bonkers. If you don't need the strong guarantee that your 
transaction is safe in at least two servers before it's acknowledged to 
the client, there's no point enabling synchronous replication in the 
first place. If you do need it, then you shouldn't fall back to a 
degraded mode, at least not automatically. It's an idea that keeps 
coming back, but I have not heard a convincing argument why it makes 
sense. It's been discussed many times before, most recently in that 
thread you linked to.


Now that I got that out of the way, I concur that some sort of hooks or 
commands that fire when a standby goes down or comes back up makes 
sense, for monitoring purposes. I don't much like this particular 
design. If you just want to write log entry, when all the standbys are 
disconnected, running a shell command seems like an awkward interface. 
It's OK for raising an alarm, but there are many other situations where 
you might want to raise alarms, so I'd rather have us implement some 
sort of a generic trap system, instead of adding this one particular 
extra config option. What do people usually use to monitor replication?


There are two things we're trying to solve here: raising an alarm when 
something interesting happens, and changing the configuration to 
temporarily disable synchronous replication. What would be a good API to 
disable synchronous replication? Editing the config file and SIGHUPing 
is not very nice. There's been talk of an ALTER command to change the 
config, but I'm not sure that's a very good API either. Perhaps expose 
the sync_master_in_standalone_mode variable you have in your patch to 
new SQL-callable functions. Something like:


pg_disable_synchronous_replication()
pg_enable_synchronous_replication()

I'm not sure where that state would be stored. Should it persist 
restarts? And you probably should get some sort of warnings in the log 
when synchronous replication is disabled.


In summary, more work is required to design a good 
user/admin/programming interface. Let's hear a solid proposal for that, 
before writing patches.


BTW, calling an external command with system(), while holding 
SyncRepLock in exclusive-mode, seems like a bad idea. For starters, 
holding a lock will prevent a new WAL sender from starting up and 
becoming a synchronous standby, and the external command might take a 
long time to return.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP patch (v2) for updatable security barrier views

2014-01-08 Thread Craig Ringer

Dean,

Short version
-

Looks amazing overall. Very clever to zip up the s.b. quals, let the
rest of the rewriter and planer do their work normally, then unpack them
into subqueries inserted in the planner once inheritance appendrels are
expanded, etc.

My main concern is that the securityQuals appear to bypass all later
rewrite stages, inheritance expansion during planning, etc. I suspect
this might be hard to get around (because these are disembodied quals
which may have nonsense varnos), but I'm looking into it now.

There's also an assertion failure whenever a correlated subquery appears
as a security barrier view qual.  Again, looking at it.

Ideas on that issue?



Much longer version: My understanding of how it works
-

My understanding from reading the patch is that this:

- Flattens target views in rewriteTargetView, as in current master. If
the target view is a security barrier view, the view quals are appended
to a list of security barrier quals on the new RTE, instead of appended
to the RTE's normal quals like for normal views.

After rewrite the views are fully flattened down to a RTE_RELATION,
which becomes the resultRelation. An unreferenced RTE for each view
that's been rewritten is preserved in the range-table for permissions
checking purposes only (same as current master).

- Inheritance expansion, tlist expansion, etc then occurrs as normal.

- In planning, in inheritance_planner, if any RTE has any stashed
security quals in its RangeTableEntry, expand_security_qual is invoked.
This iteratively wraps the base relation in a subquery with the saved
security barrier quals, creating nested subqueries around the original
RTE. At each pass resultRelation is changed to point to the new
outer-most subquery.


As a result of this approach everything looks normal to
preprocess_targetlist, row-marking, etc, because they're seeing a normal
RTE_RELATION as resultRelation. The security barrier quals are, at this
stage, stashed aside. If there's inheritance involved, RTEs copied
during appendrel expansion get copies of the security quals on in the
parent RTE.

Problem with inheritance, views, etc in s.b. quals
--

After inheritance expansion, tlist expansion, etc, the s.b. quals are
unpacked to create subqueries wrapping the original RTEs.


So, with:

CREATE TABLE t1 (x float, b integer, secret1 text, secret2 text);
CREATE TABLE t1child (z integer) INHERITS (t1);

INSERT INTO t1 (x, b, secret1, secret2)
VALUES
(0,0,'secret0', 'supersecret'),
(1,1,'secret1', 'supersecret'),
(2,2,'secret2', 'supersecret'),
(3,3,'secret3', 'supersecret'),
(4,4,'secret4', 'supersecret'),
(5,6,'secret5', 'supersecret');

INSERT INTO t1child (x, b, secret1, secret2, z)
VALUES
(8,8,'secret8', 'ss', 8),
(9,9,'secret8', 'ss', 9),
(10,10,'secret8', 'ss', 10);

CREATE VIEW v1
WITH (security_barrier)
AS
SELECT b AS b1, x AS x1, secret1
FROM t1 WHERE b % 2 = 0;

CREATE VIEW v2
WITH (security_barrier)
AS
SELECT b1 AS b2, x1 AS x2
FROM v1 WHERE b1 % 4 = 0;



then a statement like:

UPDATE v2
SET x2 = x2 + 32;

will be rewritten into something like (imaginary sql)

UPDATE t1 WITH SECURITY QUALS ((b % 2 == 0), (b % 4 == 0))
SET x = x + 32

inheritance-expanded and tlist-expanded into something like (imaginary SQL)


UPDATE
 (t1 WITH SECURITY QUALS ((b % 2 == 0), (b % 4 == 0)))
 UNION ALL
 (t1child WITH SECURITY QUALS ((b % 2 == 0), (b % 4 == 0)))
SET x = x + 32;


after which security qual expansion occurs, giving us something like:


UPDATE
 t1, t1child <--- resultRelations
 (
SELECT v2.ctid, v2.*
FROM (
  SELECT v1.ctid, v1.*
  FROM (
SELECT t1.ctid, t1.*
FROM t1
WHERE b % 2 == 0
  ) v1
  WHERE b % 4 == 0
) v2

UNION ALL

SELECT v2.ctid, v2.*
FROM (
  SELECT v1.ctid, v1.*
  FROM (
SELECT t1child.ctid, t1child.*
FROM t1child
WHERE b % 2 == 0
  ) v1
  WHERE b % 4 == 0
) v2

 )
SET x = x + 32;


Giving a plan looking like:

EXPLAIN UPDATE v2 SET x2 = 32


QUERY PLAN
---
 Update on t1 t1_2  (cost=0.00..23.35 rows=2 width=76)
   ->  Subquery Scan on t1  (cost=0.00..2.18 rows=1 width=74)
 ->  Subquery Scan on t1_3  (cost=0.00..2.17 rows=1 width=74)
   Filter: ((t1_3.b % 4) = 0)
   ->  Seq Scan on t1 t1_4  (cost=0.00..2.16 rows=1 width=74)
 Filter: ((b % 2) = 0)
   ->  Subquery Scan on t1_1  (cost=0.00..21.17 rows=1 width=78)
 ->  Subquery Scan on t1_5  (cost=0.00..21.16 rows=1 width=78)
   Filter: ((t1_5.b % 4) = 0)
   ->  Seq Scan on t1child  (cost=0.00..21.10 rows=4 width=78)
 Filter: ((b % 2) = 0)
(11 rows)




So far this looks like a really clever approach. My only real concern is
that the security quals are curr

[HACKERS] Turning off HOT/Cleanup sometimes

2014-01-08 Thread Simon Riggs

VACUUM cleans up blocks, which is nice because it happens offline in a
lazy manner.

We also make SELECT clean up blocks as it goes. That is useful in OLTP
workloads, but it means that large SQL queries and pg_dump effectively
do much the same work as VACUUM, generating huge amounts of I/O and
WAL on the master, the cost and annoyance of which is experienced
directly by the user. That is avoided on standbys.

Effects of that are that long running statements often run much longer
than we want, increasing bloat as a result. It also produces wildly
varying response times, depending upon extent of cleanup required.

It is a simple task to make that behaviour optional on the master.

I propose a USERSET parameter, prune_cost_limit (<---insert better name here)
which will make the behaviour optional, default -1, in normal user
processes. VACUUM will ignore this parameter and so its actions will
never be deferred.

In detail, this parameter would disable pruning for any scan larger
than the cost limit. So large scans will disable the behaviour. The
default, -1, means never disable pruning, which is the current
behavour.

We track the number of pages dirtied by the current statement. When
this reaches prune_cost_limit, we will apply these behaviours to all
shared_buffer block accesses...

(1) avoid running heap_page_prune_opt()

(2) avoid dirtying the buffer for hints. (This is safe because the
hinted changes will either be lost or will be part of the full page
image when we make a logged-change).

(i.e. doesn't apply to temp tables)

For example, if we set prune_cost_limit = 4 this behaviour allows
small index lookups via bitmapheapscan to continue to cleanup, while
larger index and seq scans will avoid cleanup.



There would be a postgresql.conf parameter prune_cost_limit, as well
as a table level parameter that would prevent pruning except via
VACUUM.

This will help in these ways
* Reduce write I/O from SELECTs and pg_dump - improving backups and BI queries
* Allow finer grained control over Hot Standby conflicts
* Potentially allow diagnostic inspection of older data via SeqScan

Prototype patch shows this is possible and simple enough for 9.4.
Major objections? Or should I polish up and submit?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Simple improvements to freespace allocation

2014-01-08 Thread Simon Riggs

On 8 January 2014 07:43, Heikki Linnakangas  wrote:
> On 01/08/2014 08:56 AM, Simon Riggs wrote:
>>
>> Current freesapce code gives a new block insert target (NBT) from
>> anywhere in table. That isn't very useful with bigger tables and it
>> would be useful to be able to specify different algorithms for
>> producing NBTs.
>
>
> I've actually been surprised how little demand there has been for
> alternative algorithms. When I wrote the current FSM implementation, I
> expected people to start coming up with all kinds of wishes, but it didn't
> happen. There has been very few complaints, everyone seems to be satisfied
> with the way it works now. So I'm not convinced there's much need for this.

That would require someone to conduct detailed analysis on problems,
which few people are capable of doing at a level that we would accept.
No doubt I will soon be challenged to prove beyond doubt that anything
here is required, which becomes chicken and egg. For the vast majority
of cases, the general approach we have works well enough - this area
has had lots of very useful attention ove the years.

The problem is tables have multiple use cases and we support only one,
with no easy way for people to experiment with alternatives in
production.

Its been on my list for years... but its not been a top priority, for sure.

>> ALTER TABLE foo WITH (freespace = );
>>
>> Three simple and useful models come to mind
>>
>> * CONCURRENT
>> This is the standard/current model. Naming it likes this emphasises
>> why we pick NBTs in the way we do.
>>
>> * PACK
>> We want the table to be smaller, so rather than run a VACUUM FULL we
>> want to force the table to choose an NBT at start of table, even at
>> the expense of concurrency. By avoiding putting new data at the top of
>> the table we allow the possibility that VACUUM will shrink table size.
>> This is same as current except we always reset the FSM pointer to zero
>> and re-seek from there. This takes some time to have an effect, but is
>> much less invasive than VACUUM FULL.
>
>
> We already reset the FSM pointer to zero on vacuum. Would the above actually
> make any difference in practice?

The Pack algo would emphasise tight packing over assigning concurrent
blocks. It would be useful if that also included not doing HOT updates
in favour of migrating rows to an earlier block in the table. Emphasis
on avoiding VACUUM FULL in certain cases, not for general use.

>> * RECENT
>> For large tables that are append-mostly use case it would be easier to
>> prefer NBTs from the last two 1GB segments of a table, allowing them
>> to be more easily cached. This is same as current except when we wrap
>> we don't go to block 0 we go to first block of penultimate (max - 1)
>> segment. For tables <= 2 segments this is no change from existing
>> algorithm. For larger tables it would focus updates/inserts into a
>> much reduced and yet still large area and allow better cacheing.
>
>
> Umm, wouldn't that bloat the table with no limit? Putting my DBA/developer
> hat on, I don't understand when I would want to use that setting.

That would depend on the use case; no algo suggested works for all or
even general cases.

If you have a large table, allocating freespace in blocks unlikely to
be accessed by queries means we introduce additional cache pressure.
If we allocate NBTs from newer blocks we will more likely find them in
cache.

Allowing older data to become write-seldom allows us to consider
things like compressing particular segments or moving them onto
cheaper storage.

(Another suggestion might be to use the VM so that we tend to add data
to already dirty blocks.)

> There's one policy that I'd like to see: maintaining cluster order. When
> inserting a new tuple, try to place it close to other tuples with similar
> keys, to keep the table clustered.

Agreed. Do you have a particular algorithm in mind? I can think of a few.

> In practice, CLUSTER CONCURRENTLY might be more useful, though.

I think we want both wholesale and retail. Certainly in the absence of
the former, the latter seems good addition.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

92 matches

Mail list logo