date:20150902

Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Amit Langote

On 2015-09-02 PM 04:07, Albe Laurenz wrote:
> Amit Langote wrote:
>> On 2015-09-02 PM 03:25, Amit Kapila wrote:
>>> Will it handle deadlocks across different table partitions. Consider
>>> a case as below:
>>>
>>> T1
>>> 1. Updates row R1 of T1 on shard S1
>>> 2. Updates row R2 of T2 on shard S2
>>>
>>> T2
>>> 1. Updates row R2 of T2 on shard S2
>>> 2. Updates row R1 of T1 on shard S1
> 
>> As long as shards are processed in the same order in different
>> transactions, ISTM, this issue should not arise? I can imagine it becoming
>> a concern if parallel shard processing enters the scene. Am I missing
>> something?
> 
> That would only hold for a single query, right?
> 
> If 1. and 2. in the above example come from different queries within one
> transaction, you cannot guarantee that shards are processed in the same order.
> 
> So T1 and T2 could deadlock.
> 

Sorry, I failed to see why that would be the case. Could you elaborate?

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2015-09-02 Thread Michael Paquier

On Wed, Sep 2, 2015 at 6:21 AM, Michael Paquier wrote:
> Yes, that's what I have been looking at actually by having some markers in 
> relcache.c. The reference count of this relation get incremented here:

So, I have been playing more with this code, and as mentioned by
Andres this test case goes twice through the PG_CATCH block in
pl_exec.c, causing the crash as the temporary relation does not get
dereferenced. I may be missing something, but isn't it a matter of
moving the switch to the old memory context at the end of the block so
as we can get a correct cleanup of the estate in the first block?
See for example the attached that fixes the problem for me, and passes
make checkworld, though I expect Tom and Andres to jump in and say
that this is not correct within the next 12 hours :)
I haven't written yet a test case but I think that we could reproduce
that simply by having a relation referenced in the exception block of
a first function, calling a second function that itself raises an
exception, causing the referencing error.
Regards,
-- 
Michael
diff --git a/src/pl/plpgsql/src/pl_exec.c b/src/pl/plpgsql/src/pl_exec.c
index 935fa62..56084c3 100644
--- a/src/pl/plpgsql/src/pl_exec.c
+++ b/src/pl/plpgsql/src/pl_exec.c
@@ -1253,8 +1253,6 @@ exec_stmt_block(PLpgSQL_execstate *estate, PLpgSQL_stmt_block *block)
 
 			/* Abort the inner transaction */
 			RollbackAndReleaseCurrentSubTransaction();
-			MemoryContextSwitchTo(oldcontext);
-			CurrentResourceOwner = oldowner;
 
 			/* Revert to outer eval_econtext */
 			estate->eval_econtext = old_eval_econtext;
@@ -1326,6 +1324,9 @@ exec_stmt_block(PLpgSQL_execstate *estate, PLpgSQL_stmt_block *block)
 ReThrowError(edata);
 			else
 FreeErrorData(edata);
+
+			MemoryContextSwitchTo(oldcontext);
+			CurrentResourceOwner = oldowner;
 		}
 		PG_END_TRY();
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Albe Laurenz

Etsuro Fujita wrote:
> On 2015/09/02 16:40, Amit Langote wrote:
>> On 2015-09-02 PM 04:07, Albe Laurenz wrote:
>>> Amit Langote wrote:
 On 2015-09-02 PM 03:25, Amit Kapila wrote:
> Will it handle deadlocks across different table partitions. Consider
> a case as below:
>
> T1
> 1. Updates row R1 of T1 on shard S1
> 2. Updates row R2 of T2 on shard S2
>
> T2
> 1. Updates row R2 of T2 on shard S2
> 2. Updates row R1 of T1 on shard S1
>>>
 As long as shards are processed in the same order in different
 transactions, ISTM, this issue should not arise? I can imagine it becoming
 a concern if parallel shard processing enters the scene. Am I missing
 something?
>>>
>>> That would only hold for a single query, right?
>>>
>>> If 1. and 2. in the above example come from different queries within one
>>> transaction, you cannot guarantee that shards are processed in the same 
>>> order.
>>>
>>> So T1 and T2 could deadlock.
> 
>> Sorry, I failed to see why that would be the case. Could you elaborate?
> 
> I think Laurenz would assume that the updates 1. and 2. in the above
> transactions are performed *in a non-inherited manner*.  If that's
> right, T1 and T2 could deadlock, but I think we assume here to run
> transactions over shards *in an inherited manner*.

Yes, but does every update affect all shards?

If I say "UPDATE t1 SET col = 1 WHERE id = 42" and the row with id 42
happens to be on shard S1, the update would only affect that shard, right?

Now if "UPDATE t2 SET col = 1 WHERE id = 42" would only take place on
shard S2, and two transactions issue both updates in different order,
one transaction would be waiting for a lock on shard S1, while the other
would be waiting for a lock on shard S2, right?

But maybe I'm missing something fundamental.

Yours,
Laurenz Albe

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: Implement failover on libpq connect level.

2015-09-02 Thread Shulgin, Oleksandr

On Tue, Sep 1, 2015 at 8:12 PM, Andres Freund  wrote:

> On 2015-09-01 14:07:19 -0400, Robert Haas wrote:
> > But I think it's quite wrong to assume that the infrastructure for
> > this is available and usable everywhere, because in my experience,
> > that's far from the case.
>
> Especially when the alternative is a rather short patch implementing an
> otherwise widely available feature.
>

But that won't actually help in the case described by Robert: if the master
server A failed, the client has no idea if B or C would become the new
master.

Unless it actually tries to connect them in turn and check for the result
of pg_is_in_recovery().  I think that brings enough complexity for keeping
this outside of libpq.  Also think about all the extra flexibility people
will likely want to have: number of retries, delay between retries, delay
backoff, etc., to the point we'll have to support some sort of client code
retry_policy_callback.

For read-only clients you might want to include a number of slave
hostnames, and let the connector choose one, but then again you can't
achieve load-balancing on the client side, you're better off using
round-robin DNS.  To add or remove a slave you only need to update DNS, and
not configuration on all the clients.

For the master failover I think a common technique is to just move the
floating IP address from the old master to the new one.  This doesn't
require touching the DNS record.

--
Alex

Re: [HACKERS] On-demand running query plans using auto_explain and signals

2015-09-02 Thread Shulgin, Oleksandr

On Tue, Sep 1, 2015 at 7:02 PM, Pavel Stehule 
wrote:

>
>> But do we really need the slots mechanism?  Would it not be OK to just
>> let the LWLock do the sequencing of concurrent requests?  Given that we
>> only going to use one message queue per cluster, there's not much
>> concurrency you can gain by introducing slots I believe.
>>
>
> I afraid of problems on production. When you have a queue related to any
> process, then all problems should be off after end of processes. One
> message queue per cluster needs restart cluster when some pathological
> problems are - and you cannot restart cluster in production week, sometimes
> weeks. The slots are more robust.
>

Yes, but in your implementation the slots themselves don't have a
queue/buffer.  Did you intend to have a message queue per slot?

What sort of pathological problems are you concerned of?  The communicating
backends should just detach from the message queue properly and have some
timeout configured to prevent deadlocks.  Other than that, I don't see how
having N slots really help the problem: in case of pathological problems
you will just deplete them all sooner or later.

--
Alex

Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Etsuro Fujita


On 2015/09/02 16:40, Amit Langote wrote:

On 2015-09-02 PM 04:07, Albe Laurenz wrote:

Amit Langote wrote:

On 2015-09-02 PM 03:25, Amit Kapila wrote:

Will it handle deadlocks across different table partitions. Consider
a case as below:

T1
1. Updates row R1 of T1 on shard S1
2. Updates row R2 of T2 on shard S2

T2
1. Updates row R2 of T2 on shard S2
2. Updates row R1 of T1 on shard S1



As long as shards are processed in the same order in different
transactions, ISTM, this issue should not arise? I can imagine it becoming
a concern if parallel shard processing enters the scene. Am I missing
something?


That would only hold for a single query, right?

If 1. and 2. in the above example come from different queries within one
transaction, you cannot guarantee that shards are processed in the same order.

So T1 and T2 could deadlock.



Sorry, I failed to see why that would be the case. Could you elaborate?


I think Laurenz would assume that the updates 1. and 2. in the above 
transactions are performed *in a non-inherited manner*.  If that's 
right, T1 and T2 could deadlock, but I think we assume here to run 
transactions over shards *in an inherited manner*.


Best regards,
Etsuro Fujita



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ON CONFLICT DO UPDATE using EXCLUDED.column gives an error about mismatched types

2015-09-02 Thread Amit Langote


Peter,

On 2015-08-11 AM 07:37, Peter Geoghegan wrote:
> What I'm going to do is roll this into my own pending patch to fix the
> issue with wholerow vars, which is also down to a problem with the
> excluded targetlist initially generated by calling expandRelAttrs().
> Andres might want to take an alternative approach with that, so a
> consolidation makes sense.

Did you get around to making a patch for this?

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On-demand running query plans using auto_explain and signals

2015-09-02 Thread Pavel Stehule

2015-09-02 11:01 GMT+02:00 Shulgin, Oleksandr 
:

> On Tue, Sep 1, 2015 at 7:02 PM, Pavel Stehule 
> wrote:
>
>>
>>> But do we really need the slots mechanism?  Would it not be OK to just
>>> let the LWLock do the sequencing of concurrent requests?  Given that we
>>> only going to use one message queue per cluster, there's not much
>>> concurrency you can gain by introducing slots I believe.
>>>
>>
>> I afraid of problems on production. When you have a queue related to any
>> process, then all problems should be off after end of processes. One
>> message queue per cluster needs restart cluster when some pathological
>> problems are - and you cannot restart cluster in production week, sometimes
>> weeks. The slots are more robust.
>>
>
> Yes, but in your implementation the slots themselves don't have a
> queue/buffer.  Did you intend to have a message queue per slot?
>

The message queue cannot be reused, so I expect one slot per caller to be
used passing parameters, - message queue will be created/released by demand
by caller.


>
> What sort of pathological problems are you concerned of?  The
> communicating backends should just detach from the message queue properly
> and have some timeout configured to prevent deadlocks.  Other than that, I
> don't see how having N slots really help the problem: in case of
> pathological problems you will just deplete them all sooner or later.
>

I afraid of unexpected problems :) - any part of signal handling or
multiprocess communication is fragile. Slots are simple and simply attached
to any process without necessity to alloc/free some memory.

Pavel


>
> --
> Alex
>
>

Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Etsuro Fujita


On 2015/09/02 15:40, Amit Langote wrote:

On 2015-09-02 PM 03:25, Amit Kapila wrote:

On Wed, Sep 2, 2015 at 11:35 AM, Etsuro Fujita 

The UPDATE/DELETE pushdown, which I've proposed, would ensure the sane
behaviour for inherited UPDATEs/DELETEs, as existing non-pushed-down
UPDATE/DELETE does, because inheritance_planner guarantees that all
backends lock inheritance children in the same order to avoid needless
deadlocks.



Will it be able to do it for row level locks, row level locking occurs
during updation of a row, so will it be possible to ensure the order of
locks on rows?



Will it handle deadlocks across different table partitions. Consider
a case as below:

T1
1. Updates row R1 of T1 on shard S1
2. Updates row R2 of T2 on shard S2

T2
1. Updates row R2 of T2 on shard S2
2. Updates row R1 of T1 on shard S1



As long as shards are processed in the same order in different
transactions, ISTM, this issue should not arise? I can imagine it becoming
a concern if parallel shard processing enters the scene. Am I missing
something?


Yeah, I thinks so, too.

Sorry, maybe my explanation above was not enough, but in the inherted 
UPDATEs/DELETEs, the table modification is also ensured to be done in 
the same order.  So, as Amit Langote said, both transactions will do the 
updates in the same order.


Best regards,
Etsuro Fujita



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2015-09-02 Thread Jim Nasby


On 9/1/15 11:59 PM, Michael Paquier wrote:

On Wed, Sep 2, 2015 at 12:59 PM, Jim Nasby wrote:

On 9/1/15 8:42 PM, Michael Paquier wrote:
The crash is triggered by having an exception raised in this particular call
stack. Since there's no syntax error in master/0.2.1, there's no assert
failure either. Were you running asserts?


I thought I was, but visibly it was mistaken. So, after re-rolling
configure, I have been able to reproduce the crash, and as far as I
can see all supported versions are impacted. I tested down to 9.1
where extensions were introduced, and the stack trace, as well as the
assertion failing are the same, similar to what Jim has reported. I am
just digging more into this thing now.


I just had a theory that reference counts were messed up because there 
was both a temp table and a real table with the same name. Turns out 
that's not the case, but I do now know that the assert is failing for 
the reference count on the temp table.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2015-09-02 Thread Michael Paquier

On Wed, Sep 2, 2015 at 6:13 AM, Jim Nasby  wrote:
> On 9/1/15 11:59 PM, Michael Paquier wrote:
>>
>> On Wed, Sep 2, 2015 at 12:59 PM, Jim Nasby wrote:
>>>
>>> On 9/1/15 8:42 PM, Michael Paquier wrote:
>>> The crash is triggered by having an exception raised in this particular
>>> call
>>> stack. Since there's no syntax error in master/0.2.1, there's no assert
>>> failure either. Were you running asserts?
>>
>>
>> I thought I was, but visibly it was mistaken. So, after re-rolling
>> configure, I have been able to reproduce the crash, and as far as I
>> can see all supported versions are impacted. I tested down to 9.1
>> where extensions were introduced, and the stack trace, as well as the
>> assertion failing are the same, similar to what Jim has reported. I am
>> just digging more into this thing now.
>
>
> I just had a theory that reference counts were messed up because there was
> both a temp table and a real table with the same name. Turns out that's
not
> the case, but I do now know that the assert is failing for the reference
> count on the temp table.

Yes, that's what I have been looking at actually by having some markers in
relcache.c. The reference count of this relation get incremented here:
LOG:  increment ref count relation: invoice_03, rd_refcnt: 1
CONTEXT:  SQL statement "
CREATE TEMP TABLE invoice_03 ON COMMIT DROP AS
WITH i AS (
  INSERT INTO invoice VALUES(
DEFAULT
, (tf.get( NULL::customer, 'insert' )).customer_id
, current_date
, current_date + 30
  ) RETURNING *
)
  SELECT *
FROM i
"
PL/pgSQL function tf.get(anyelement,text) line 29 at EXECUTE
PL/pgSQL function results_eq(refcursor,refcursor,text) line 11 at
FETCH
PL/pgSQL function results_eq(text,text,text) line 9 at assignment
STATEMENT:  SELECT results_eq(
  $$SELECT * FROM tf.get( NULL::invoice, 'base' )$$
  , $$VALUES( 1, 1, current_date, current_date + 30 )$$
  , 'invoice factory output'
);
And it gets cleared here without being decremented when cleaning up the
second exception:
LOG:  clear relation: invoice_03, rd_refcnt: 1
CONTEXT:  PL/pgSQL function results_eq(refcursor,refcursor,text) line 11
during exception cleanup
PL/pgSQL function results_eq(text,text,text) line 9 at assignment
STATEMENT:  SELECT results_eq(
  $$SELECT * FROM tf.get( NULL::invoice, 'base' )$$
  , $$VALUES( 1, 1, current_date, current_date + 30 )$$
  , 'invoice factory output'
);
-- 
Michael

Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Amit Langote

On 2015-09-02 PM 03:25, Amit Kapila wrote:
> On Wed, Sep 2, 2015 at 11:35 AM, Etsuro Fujita 
>>
>> The UPDATE/DELETE pushdown, which I've proposed, would ensure the sane
>> behaviour for inherited UPDATEs/DELETEs, as existing non-pushed-down
>> UPDATE/DELETE does, because inheritance_planner guarantees that all
>> backends lock inheritance children in the same order to avoid needless
>> deadlocks.
>>
>>
> Will it be able to do it for row level locks, row level locking occurs
> during updation of a row, so will it be possible to ensure the order of
> locks on rows?
> 
> Will it handle deadlocks across different table partitions. Consider
> a case as below:
> 
> T1
> 1. Updates row R1 of T1 on shard S1
> 2. Updates row R2 of T2 on shard S2
> 
> T2
> 1. Updates row R2 of T2 on shard S2
> 2. Updates row R1 of T1 on shard S1
> 

As long as shards are processed in the same order in different
transactions, ISTM, this issue should not arise? I can imagine it becoming
a concern if parallel shard processing enters the scene. Am I missing
something?

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Etsuro Fujita


On 2015/09/02 14:28, Amit Langote wrote:

On 2015-09-02 PM 01:28, Amit Kapila wrote:

On Tue, Sep 1, 2015 at 9:48 PM, Robert Haas  wrote:

I'm not averse to making the "connect to the remote nodes" part of
this solution use something other than the FDW infrastructure at some
point in time if somebody's prepared to build something better.  On
the other hand, I think it's extremely clear that the FDW
infrastructure has a large amount of potential upon which we have
thoroughly failed to capitalize.  Patches have already been written
for UPDATE/DELETE pushdown and for join pushdown.



Will pushing down writes (Update/Delete) sufficient to maintain sane locking
behaviour and deadlock detection that can occur during writes on multiple
shards?  For example it could easily be the case where a single Update
statement could effect multiple shards and cause deadlock due to waits
across the nodes.  Now unless we have some distributed lock manager or
some other way to know the information of locks that happens across
shards, it could be difficult to detect deadlocks.



I wonder if Ashutosh's atomic foreign transactions patch would address any
issues inherent in such cases...


The UPDATE/DELETE pushdown, which I've proposed, would ensure the sane 
behaviour for inherited UPDATEs/DELETEs, as existing non-pushed-down 
UPDATE/DELETE does, because inheritance_planner guarantees that all 
backends lock inheritance children in the same order to avoid needless 
deadlocks.


Best regards,
Etsuro Fujita



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Amit Kapila

On Wed, Sep 2, 2015 at 11:35 AM, Etsuro Fujita 
wrote:

> On 2015/09/02 14:28, Amit Langote wrote:
>
>> On 2015-09-02 PM 01:28, Amit Kapila wrote:
>>
>>> On Tue, Sep 1, 2015 at 9:48 PM, Robert Haas 
 wrote:

> I'm not averse to making the "connect to the remote nodes" part of
> this solution use something other than the FDW infrastructure at some
> point in time if somebody's prepared to build something better.  On
> the other hand, I think it's extremely clear that the FDW
> infrastructure has a large amount of potential upon which we have
> thoroughly failed to capitalize.  Patches have already been written
> for UPDATE/DELETE pushdown and for join pushdown.
>

> Will pushing down writes (Update/Delete) sufficient to maintain sane
>>> locking
>>> behaviour and deadlock detection that can occur during writes on multiple
>>> shards?  For example it could easily be the case where a single Update
>>> statement could effect multiple shards and cause deadlock due to waits
>>> across the nodes.  Now unless we have some distributed lock manager or
>>> some other way to know the information of locks that happens across
>>> shards, it could be difficult to detect deadlocks.
>>>
>>
> I wonder if Ashutosh's atomic foreign transactions patch would address any
>> issues inherent in such cases...
>>
>
> The UPDATE/DELETE pushdown, which I've proposed, would ensure the sane
> behaviour for inherited UPDATEs/DELETEs, as existing non-pushed-down
> UPDATE/DELETE does, because inheritance_planner guarantees that all
> backends lock inheritance children in the same order to avoid needless
> deadlocks.
>
>
Will it be able to do it for row level locks, row level locking occurs
during updation of a row, so will it be possible to ensure the order of
locks on rows?

Will it handle deadlocks across different table partitions. Consider
a case as below:

T1
1. Updates row R1 of T1 on shard S1
2. Updates row R2 of T2 on shard S2

T2
1. Updates row R2 of T2 on shard S2
2. Updates row R1 of T1 on shard S1



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Multi-tenancy with RLS

2015-09-02 Thread Haribabu Kommi

On Fri, Aug 14, 2015 at 12:00 PM, Haribabu Kommi
 wrote:
>
> Here I attached the proof concept patch.

Here I attached an updated patch by adding policies to the most of the
system catalog tables, except the following.
AggregateRelationId

AccessMethodRelationId
AccessMethodOperatorRelationId
AccessMethodProcedureRelationId

AuthMemRelationId
CastRelationId
EnumRelationId
EventTriggerRelationId
ExtensionRelationId

LargeObjectRelationId
LargeObjectMetadataRelationId

PLTemplateRelationId
RangeRelationId
RewriteRelationId
TransformRelationId

TSConfigRelationId
TSConfigMapRelationId
TSDictionaryRelationId
TSParserRelationId
TSTemplateRelationId

Following catalog tables needs to create the policy based on the
class, so currently didn't added any policy for the same.

SecLabelRelationId
SharedDependRelationId
SharedDescriptionRelationId
SharedSecLabelRelationId

If any user is granted any permissions on that object then that user
can view it's meta data of that object from the catalog tables.
To check the permissions of the user on the object, instead of
checking each and every available option, I just added a new
privilege check option called "any". If user have any permissions on
the object, the corresponding permission check function returns
true. Patch attached for the same.

Any thoughts/comments?

Regards,
Hari Babu
Fujitsu Australia

multi-tenancy_with_rls_poc_2.patch
Description: Binary data

any_privilege_check_option.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Albe Laurenz

Amit Langote wrote:
> On 2015-09-02 PM 03:25, Amit Kapila wrote:
>> Will it handle deadlocks across different table partitions. Consider
>> a case as below:
>>
>> T1
>> 1. Updates row R1 of T1 on shard S1
>> 2. Updates row R2 of T2 on shard S2
>>
>> T2
>> 1. Updates row R2 of T2 on shard S2
>> 2. Updates row R1 of T1 on shard S1

> As long as shards are processed in the same order in different
> transactions, ISTM, this issue should not arise? I can imagine it becoming
> a concern if parallel shard processing enters the scene. Am I missing
> something?

That would only hold for a single query, right?

If 1. and 2. in the above example come from different queries within one
transaction, you cannot guarantee that shards are processed in the same order.

So T1 and T2 could deadlock.

Yours,
Laurenz Albe

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] creating extension including dependencies

2015-09-02 Thread Andres Freund

Hi,

I'm looking at committing this patch. I found some nitpick-level things
that I can easily fixup. But I dislike two things:

1) Passing the list of parents through the cascade DefElem strikes me as
incredibly ugly.

For one the cascade option really should take a true/false type option
on the C level (so you can do defGetBoolean()), for another passing
through the list of parents via DefElem->arg seems wrong. You're
supposed to be able to copy parsenodes and at the very least that's
broken by the approach.

2) I don't like the control flow around the schema selection.

It seems to be getting a bit arcane. How about instead moving the
"extension \"%s\" must be installed in schema \"%s\" check into the if
(control->schema != NULL) block and check for d_schema after it? That
should look cleaner.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] PSA: Upcoming Linux scheduler changes

2015-09-02 Thread Joshua D. Drake


Folks,

This is something we should be watching for and if people have time, 
testing to see how it affects us:


http://lkml.iu.edu/hypermail/linux/kernel/1508.3/04818.html
--
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Announcing "I'm offended" is basically telling the world you can't
control your own emotions, so everyone else should do it for you.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Hooking at standard_join_search (Was: Re: [HACKERS] Foreign join pushdown vs EvalPlanQual)

2015-09-02 Thread Robert Haas

On Wed, Sep 2, 2015 at 10:30 AM, Tom Lane  wrote:
> But if you have in mind that typical FDWs would actually create join paths
> at that point, consider that
>
> 1. The FDW would have to find all the combinations of its supplied
> relations (unless you are only intending to generate one path for the
> union of all such rels, which seems pretty narrow-minded from here).

Well, if the remote end is another database server, presumably we can
leave it to optimize the query, so why would we need more than one
path?  I can see that we need more than one path because of sort-order
considerations, which would affect the query we ship to the remote
side.  But I don't see the point of considering multiple join orders
unless the remote end is dumber than our optimizer, which might be
true in some cases, but not if the remote end is PostgreSQL.

> 2. The FDW would have to account for join_is_legal considerations.

I agree with this.

> 3. The FDW would have to arrange for creation of joinrel RelOptInfo
> structures.  While that's possible, the available infrastructure for it
> assumes that joinrels are built up from pairs of simpler joinrels, so
> you couldn't go directly to the union of all the FDW's rels anyway.

And with this.

> So I still think that the most directly useful infrastructure here
> would involve, when build_join_rel() first creates a given joinrel,
> noticing whether both sides belong to the same foreign server and
> if so giving the FDW a callback to consider producing pushed-down
> joins.  That would be extremely cheap to do and it would not involve
> adding overhead for an FDW to discover what the valid sets of joins
> are.  In a large join problem, that's *not* going to be a cheap
> thing to duplicate.  If there are multiple FDWs involved, the idea
> that each one of them would do its own join search is particularly
> horrid.

So, the problem is that I don't think this entirely skirts the
join_is_legal issues, which are a principal point of concern for me.
Say this is a joinrel between (A B) and (C D E).  We need to generate
an SQL query for (A B C D E).  We know that the outermost syntactic
join can be (A B) to (C D E).  But how do we know which join orders
are legal as among (C D E)?  Maybe there's a simple way to handle this
that I'm not seeing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Improving test coverage of extensions with pg_dump

2015-09-02 Thread Alvaro Herrera

Andres Freund wrote:

> > As I recall, Andrew Dunstan has a module that
> > tests cross-version pg_upgrade and one thing he does is dump both and
> > compare; the problem is that there are differences, so he keeps a count
> > of how many lines he expect to differ between any two releases.
> 
> I'm not suggesting to do anything cross-release - that'd indeed be
> another ballpark.

Ah, you're right.

> Just that instead of a pg_dump test that tests some individual things we
> have one that tests the whole regression test output and then does a
> diff?

Sorry if I'm slow -- A diff against what?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Merlin Moncure

On Tue, Sep 1, 2015 at 11:18 AM, Robert Haas  wrote:
> It would be a bad idea to cling blindly to the FDW infrastructure if
> it's fundamentally inadequate to do what we want.  On the other hand,
> it would also be a bad idea to set about recreating it without a
> really good reason, and - just to take one example - the fact that it
> doesn't currently push down DML operations to the remote side is not a
> really good reason to rewrite the whole thing.  On the contrary, it's
> a reason to put some energy into the already-written patch which
> implements that optimization.

The problem with FDW for these purposes as I see it is that too much
intelligence is relegated to the implementer of the API.  There needs
to be a mechanism so that the planner can rewrite the remote query and
then do some after the fact processing.  This exactly what citus does;
if you send out AVG(foo) it rewrites that to SUM(foo) and COUNT(foo)
so that aggregation can be properly weighted to the result.   To do
this (distributed OLAP-type processing) right, the planner needs to
*know* that this table is in fact distributed and also know that it
can make SQL compatible adjustments to the query.

This strikes me as a bit of a conflict of interest with FDW which
seems to want to hide the fact that it's foreign; the FDW
implementation makes it's own optimization decisions which might make
sense for single table queries but breaks down in the face of joins.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] creating extension including dependencies

2015-09-02 Thread Andres Freund

On 2015-09-02 17:27:38 +0200, Andres Freund wrote:
> 1) Passing the list of parents through the cascade DefElem strikes me as
> incredibly ugly.
> 
> For one the cascade option really should take a true/false type option
> on the C level (so you can do defGetBoolean()), for another passing
> through the list of parents via DefElem->arg seems wrong. You're
> supposed to be able to copy parsenodes and at the very least that's
> broken by the approach.

I think the fix here is to split off the bulk of CreateExtension() into
an internal function that takes additional parameters.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting

2015-09-02 Thread Tomas Vondra


Hi

On 09/02/2015 03:53 PM, Andres Freund wrote:


Hi,

On 2015-07-18 12:17:39 +0200, Julien Rouhaud wrote:

I didn't know that the thread must exists on -hackers to be able to add
a commitfest entry, so I transfer the thread here.


Please, in the future, also update the title of the thread to something
fitting.


@@ -539,6 +541,9 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState 
*estate, int eflags)
  {
BitmapHeapScanState *scanstate;
RelationcurrentRelation;
+#ifdef USE_PREFETCH
+   int new_io_concurrency;
+#endif

/* check for unsupported flags */
Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
@@ -598,6 +603,25 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState 
*estate, int eflags)
 */
currentRelation = ExecOpenScanRelation(estate, node->scan.scanrelid, 
eflags);

+#ifdef USE_PREFETCH
+   /* check if the effective_io_concurrency has been overloaded for the
+* tablespace storing the relation and compute the 
target_prefetch_pages,
+* or just get the current target_prefetch_pages
+*/
+   new_io_concurrency = get_tablespace_io_concurrency(
+   currentRelation->rd_rel->reltablespace);
+
+
+   scanstate->target_prefetch_pages = target_prefetch_pages;
+
+   if (new_io_concurrency != effective_io_concurrency)
+   {
+   double prefetch_pages;
+  if (compute_io_concurrency(new_io_concurrency, _pages))
+   scanstate->target_prefetch_pages = rint(prefetch_pages);
+   }
+#endif


Maybe it's just me - but imo there should be as few USE_PREFETCH
dependant places in the code as possible. It'll just be 0 when not
supported, that's fine?


It's not just you. Dealing with code with plenty of ifdefs is annoying - 
it compiles just fine most of the time, until you compile it with some 
rare configuration. Then it either starts producing strange warnings or 
the compilation fails entirely.


It might make a tiny difference on builds without prefetching support 
because of code size, but realistically I think it's noise.



Especially changing the size of externally visible structs depending
ona configure detected ifdef seems wrong to me.


+100 to that


+   /*--
+* The user-visible GUC parameter is the number of drives (spindles),
+* which we need to translate to a number-of-pages-to-prefetch target.
+* The target value is stashed in *extra and then assigned to the actual
+* variable by assign_effective_io_concurrency.
+*
+* The expected number of prefetch pages needed to keep N drives busy 
is:
+*
+* drives |   I/O requests
+* ---+
+*  1 |   1
+*  2 |   2/1 + 2/2 = 3
+*  3 |   3/1 + 3/2 + 3/3 = 5 1/2
+*  4 |   4/1 + 4/2 + 4/3 + 4/4 = 8 1/3
+*  n |   n * H(n)


I know you just moved this code. But: I don't buy this formula. Like at
all. Doesn't queuing and reordering entirely invalidate the logic here?


Well, even the comment right next after the formula says that:

* Experimental results show that both of these formulas aren't
* aggressiveenough, but we don't really have any better proposals.

That's the reason why users generally either use 0 or some rather high 
value (16 or 32 are the most common values see). The problem is that we 
don't really care about the number of spindles (and not just because 
SSDs don't have them at all), but about the target queue length per 
device. Spinning rust uses TCQ/NCQ to optimize the head movement, SSDs 
are parallel by nature (stacking multiple chips with separate channels).


I doubt we can really improve the formula, except maybe for saying "we 
want 16 requests per device" and multiplying the number by that. We 
don't really have the necessary introspection to determine better values 
(and it's not really possible, because the devices may be hidden behind 
a RAID controller (or a SAN). So we can't really do much.


Maybe the best thing we can do is just completely abandon the "number of 
spindles" idea, and just say "number of I/O requests to prefetch". 
Possibly with an explanation of how to estimate it (devices * queue length).


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Improving test coverage of extensions with pg_dump

2015-09-02 Thread Alvaro Herrera

Andres Freund wrote:

> Isn't a full test with a separate initdb, create extension etc. a really
> heavyhanded way to test this? I mean that's a test where the setup takes
> up to 10s, whereas the actual runtime is in the millisecond range?

I spent some time looking over this patch yesterday, and that was one
concern I had too.  I was thinking that if we turn this into a single
module that can contain several extensions, or several pg_dump-related
tests, and then test multiple features without requiring an initdb for
each, it would be more palatable.

> Adding tests in this manner doesn't seem scalable to me.

I was thinking in having this be renamed src/test/modules/extensions/
and then the extension contained here would be renamed ext001_fk_tables
or something like that; later we could ext002_something for testing some
other angle of extensions, not necessarily pg_dump related.

> I think we should rather add *one* test that does dump/restore over the
> normal regression test database. Something similar to the pg_upgrade
> tests. And then work at adding more content to the regression test
> database - potentially sourcing from src/test/modules.

That's another option, but we've had this idea for many years now and it
hasn't materialized.  As I recall, Andrew Dunstan has a module that
tests cross-version pg_upgrade and one thing he does is dump both and
compare; the problem is that there are differences, so he keeps a count
of how many lines he expect to differ between any two releases.  Or
something like that.  While it's a good enough start, I don't think it's
robust enough to be in core.  How would we do it?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Oleg Bartunov

On Tue, Sep 1, 2015 at 7:08 PM, Robert Haas  wrote:

> On Tue, Sep 1, 2015 at 12:00 AM, Pavan Deolasee
>  wrote:
> > My worry is that if we start implementing them again from scratch, it
> will
> > take a few years before we get them in a usable state. What XC/XL lacked
> is
> > probably a Robert Haas or a Tom Lane who could look at the work and
> suggest
> > major edits. If that had happened, the quality of the product could have
> > been much better today. I don't mean to derate the developers who worked
> on
> > XC/XL, but there is no harm in accepting that if someone with a much
> better
> > understanding of the whole system was part of the team, that would have
> > positively impacted the project. Is that an angle worth exploring? Does
> it
> > make sense to commit some more resources to say XC or XL and try to
> improve
> > the quality of the product even further? To be honest, XL is in far far
> > better shape (haven't really tried XC in a while) and some more
> QA/polishing
> > can make it production ready much sooner.
>
> From my point of view, and EnterpriseDB's point of view, anything that
> doesn't go into the core PostgreSQL distribution isn't really getting
> us where we need to be.  If there's code in XL that would be valuable
> to merge into core PostgreSQL, then let's do it.  If the code cannot
> be used but there are lessons we can learn that will make what does go
> into core PostgreSQL better, let's learn them.  However, I don't think
> it's serving anybody very well that we have the XC fork, and multiple
> forks of the XC fork, floating around out there and people are working
> on those instead of working on core PostgreSQL.  The reality is that
> we don't have enough brainpower to spread it across 2 or 3 or 4 or 5
> different projects and have all of them be good.  The reality is,
> also, that horizontal scalability isn't an optional feature.  There
> was a point in time at which the PostgreSQL project's official policy
> on replication was that it did not belong in core.  That was a bad
> policy; thankfully, it was reversed, and the result was Hot Standby
> and Streaming Replication, incredibly important technologies without
> which we would not be where we are today. Horizontal scalability is
> just as essential.
>

Agree with you, Robert.

One lesson from XL we got is that we need testing framework for cluster, so
any cluster project should at least pass functional and performance
testing. XL was very easy to break and I'm wondering how many corner cases
still exists. We tried several other approaches and while reading the
papers was a fun, in practice we found many devil details, which made the
paper be just a paper.



>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

Re: [HACKERS] WIP: About CMake v2

2015-09-02 Thread Noah Misch

On Tue, Sep 01, 2015 at 11:46:05AM -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2015-09-01 10:32:39 -0400, Noah Misch wrote:
> >> A monolithic patch replacing the GNU make build system with a CMake build
> >> system sounds far too hard to write and review; we should expect to 
> >> maintain
> >> those two build systems in parallel for awhile.
> 
> > I don't buy this.
> 
> Me either.

Okay.  If the author, reviewer(s) and committer deliver a correct monolithic
patch, I won't have anything to complain about.  For the PostgreSQL hackers
not involved in this work, that outcome is better than my suggestion.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench stats per script & other stuff

2015-09-02 Thread Andres Freund

On 2015-07-30 18:03:56 +0200, Fabien COELHO wrote:
> 
> >v6 is just a rebase after a bug fix by Andres Freund.
> >
> >Also a small question: The patch currently displays pgbench scripts
> >starting numbering at 0. Probably a little too geek... should start at 1?
> 
> v7 is a rebase after another small bug fix in pgbench.
> 
> -- 
> Fabien.

> diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
> index 2517a3a..99670d4 100644
> --- a/doc/src/sgml/ref/pgbench.sgml
> +++ b/doc/src/sgml/ref/pgbench.sgml
> @@ -261,6 +261,23 @@ pgbench  options  
> dbname
>  benchmarking arguments:
>  
>  
> + 
> +  -b scriptname[@weight]
> +  --builtin scriptname[@weight]
> +  
> +   
> +Add the specified builtin script to the list of executed scripts.
> +An optional integer weight after @ allows to adjust the
> +probability of drawing the test.
> +Available builtin scripts are: tpcb-like,
> +simple-update and select-only.
> +The provided scriptname needs only to be a prefix
> +of the builtin name, hence simp would be enough to select
> +simple-update.
> +   
> +  
> + 

Maybe add --builtin list to show them?

> @@ -404,10 +422,7 @@ pgbench  options  
> dbname
>--skip-some-updates
>
> 
> -Do not update pgbench_tellers and
> -pgbench_branches.
> -This will avoid update contention on these tables, but
> -it makes the test case even less like TPC-B.
> +Shorthand for -b simple-update@1.
> 
>
>   

> @@ -511,7 +526,7 @@ pgbench  options  
> dbname
>--select-only
>
> 
> -Perform select-only transactions instead of TPC-B-like test.
> +Shorthand for -b select-only@1.
> 
>
>   

I'm a bit inclined to remove these options.

>
> -   The default transaction script issues seven commands per transaction:
> +   Pgbench executes test scripts chosen randomly from a specified list.
> +   They include built-in scripts with -b and
> +   user-provided custom scripts with -f.
> +   Each script may be given a relative weight specified after a
> +   @ so as to change its drawing probability.
> +   The default weight is 1.
> + 

I'm wondering if percentages instead of weights would be a better
idea. That'd mean you'd be forced to be more careful when adding another
script (having to adjust the percentages of other scripts) but arguably
that's a good thing?

> +static SQLScript sql_script[MAX_SCRIPTS];
> +static struct {
> + char *name;   /* very short name for -b ...*/
> + char *desc;   /* short description */
> + char *script; /* actual pgbench script */
> +} builtin_script[]

Can't we put these in the same array?

> + printf("transaction type: %s\n",
> +num_scripts == 1? sql_script[0].name: "multiple scripts");

Seems like it'd be more useful to simply always list the scripts +
weights here.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Improving test coverage of extensions with pg_dump

2015-09-02 Thread Andres Freund

On 2015-09-02 14:30:33 -0300, Alvaro Herrera wrote:
> I was thinking in having this be renamed src/test/modules/extensions/
> and then the extension contained here would be renamed ext001_fk_tables
> or something like that; later we could ext002_something for testing some
> other angle of extensions, not necessarily pg_dump related.

The largest dataset we have for this is the current regression test
database, it seems a waste not to include it...

> That's another option, but we've had this idea for many years now and it
> hasn't materialized.

But that's just minimally more complex than what's currently done in
that test and in the pg_upgrade test?

> As I recall, Andrew Dunstan has a module that
> tests cross-version pg_upgrade and one thing he does is dump both and
> compare; the problem is that there are differences, so he keeps a count
> of how many lines he expect to differ between any two releases.

I'm not suggesting to do anything cross-release - that'd indeed be
another ballpark.

Just that instead of a pg_dump test that tests some individual things we
have one that tests the whole regression test output and then does a
diff?

Greetings,

Andres Freund

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Hooking at standard_join_search (Was: Re: [HACKERS] Foreign join pushdown vs EvalPlanQual)

2015-09-02 Thread Tom Lane

Robert Haas  writes:
> On Wed, Sep 2, 2015 at 10:30 AM, Tom Lane  wrote:
>> But if you have in mind that typical FDWs would actually create join paths
>> at that point, consider that
>> 
>> 1. The FDW would have to find all the combinations of its supplied
>> relations (unless you are only intending to generate one path for the
>> union of all such rels, which seems pretty narrow-minded from here).

> Well, if the remote end is another database server, presumably we can
> leave it to optimize the query, so why would we need more than one
> path?

If you have say 5 relations in the query, 3 of which are foreign, it might
make sense to join all 3 at the remote end, or maybe you should only join
2 of them remotely because it's better to then join to one of the local
rels before joining the last remote rel.  Even if you claim that that
would never make sense from a cost standpoint (a claim easily seen to be
silly), there might not be any legal way to join all 3 directly because of
join order constraints.

The larger point is that we can't expect the remote server to be fully
responsible for optimizing, because it will know nothing of what's being
done on our end.

> I can see that we need more than one path because of sort-order
> considerations, which would affect the query we ship to the remote
> side.  But I don't see the point of considering multiple join orders
> unless the remote end is dumber than our optimizer, which might be
> true in some cases, but not if the remote end is PostgreSQL.

(1) not all remote ends are Postgres, (2) the remote end doesn't have any
access to info about our end.

> So, the problem is that I don't think this entirely skirts the
> join_is_legal issues, which are a principal point of concern for me.
> Say this is a joinrel between (A B) and (C D E).  We need to generate
> an SQL query for (A B C D E).  We know that the outermost syntactic
> join can be (A B) to (C D E).  But how do we know which join orders
> are legal as among (C D E)?  Maybe there's a simple way to handle this
> that I'm not seeing.

Well, if the joins get built up in the way I think should happen, we'd
have already considered (C D E), and we could have recorded the legal join
orders within that at the time.  (I imagine that we should allow FDWs to
store some data within RelOptInfo structs that represent foreign joins
belonging entirely to them, so that there'd be a handy place to keep that
data till later.)  Or we could trawl through the paths associated with the
child joinrel, which will presumably include instances of every reasonable
sub-join combination.  Or the FDW could look at the SpecialJoinInfo data
and determine things for itself (or more likely, ask join_is_legal about
that).

In the case of postgres_fdw, I think the actual requirement will be to be
able to reconstruct a SQL query that correctly expresses the join; that
is, we need to send over something like "from c left join d on (...) full
join e on (...)", not just "from c, d, e", or we'll get totally bogus
estimates as well as bogus execution results.  Offhand I think that the
most likely way to build that text will be to examine the query's jointree
to see where c,d,e appear in it.  But in any case, that's a separate issue
and I fail to see how plopping the join search problem into the FDW's lap
would make it any easier.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Improving test coverage of extensions with pg_dump

2015-09-02 Thread Andres Freund

On 2015-08-02 19:15:58 -0700, Michael Paquier wrote:
> +psql 'postgres', 'CREATE EXTENSION tables_fk';
> +
> +# Insert some data before running the dump, this is needed to check
> +# consistent data dump of tables with foreign key dependencies
> +psql 'postgres', 'INSERT INTO cc_tab_fkey VALUES (1)';
> +psql 'postgres', 'INSERT INTO bb_tab_fkey VALUES (1)';
> +psql 'postgres', 'INSERT INTO aa_tab_fkey VALUES (1)';
> +
> +# Create a table depending on a FK defined in the extension
> +psql 'postgres', 'CREATE TABLE dd_tab_fkey (id int REFERENCES 
> bb_tab_fkey(id))';
> +psql 'postgres', 'INSERT INTO dd_tab_fkey VALUES (1)';
> +
> +# Take a dump then re-deploy it to a new database
> +command_ok(['pg_dump', '-d', 'postgres', '-f', "$tempdir/dump.sql"],
> +'taken dump with installed extension');
> +command_ok(['createdb', 'foobar1'], 'created new database to redeploy dump');
> +command_ok(['psql', '--set=ON_ERROR_STOP=on', '-d', 'foobar1', '-f',
> + "$tempdir/dump.sql"], 'dump restored');
> +
> +# And check its content
> +my $result = psql 'foobar1',
> + q{SELECT id FROM aa_tab_fkey UNION ALL
> +SELECT id FROM bb_tab_fkey UNION ALL
> +SELECT id FROM cc_tab_fkey UNION ALL
> +SELECT id FROM dd_tab_fkey};
> +is($result, qq(1
> +1
> +1
> +1),
> + 'consistency of data restored');
> diff --git a/src/test/modules/tables_fk/tables_fk--1.0.sql 
> b/src/test/modules/tables_fk/tables_fk--1.0.sql
> new file mode 100644
> index 000..e424610
> --- /dev/null
> +++ b/src/test/modules/tables_fk/tables_fk--1.0.sql
> @@ -0,0 +1,20 @@
> +/* src/test/modules/tables_fk/tables_fk--1.0.sql */
> +
> +-- complain if script is sourced in psql, rather than via CREATE EXTENSION
> +\echo Use "CREATE EXTENSION tables_fk" to load this file. \quit
> +
> +CREATE TABLE IF NOT EXISTS cc_tab_fkey (
> + id int PRIMARY KEY
> +);
> +
> +CREATE TABLE IF NOT EXISTS bb_tab_fkey (
> + id int PRIMARY KEY REFERENCES cc_tab_fkey(id)
> +);
> +
> +CREATE TABLE IF NOT EXISTS aa_tab_fkey (
> + id int REFERENCES bb_tab_fkey(id)
> +);
> +
> +SELECT pg_catalog.pg_extension_config_dump('aa_tab_fkey', '');
> +SELECT pg_catalog.pg_extension_config_dump('bb_tab_fkey', '');
> +SELECT pg_catalog.pg_extension_config_dump('cc_tab_fkey', '');
> diff --git a/src/test/modules/tables_fk/tables_fk.control 
> b/src/test/modules/tables_fk/tables_fk.control

Isn't a full test with a separate initdb, create extension etc. a really
heavyhanded way to test this? I mean that's a test where the setup takes
up to 10s, whereas the actual runtime is in the millisecond range?

Adding tests in this manner doesn't seem scalable to me.

I think we should rather add *one* test that does dump/restore over the
normal regression test database. Something similar to the pg_upgrade
tests. And then work at adding more content to the regression test
database - potentially sourcing from src/test/modules.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Make timestamptz_out less slow.

2015-09-02 Thread Andres Freund

On 2015-08-09 12:47:53 +1200, David Rowley wrote:
> I took a bit of weekend time to finish this one off. Patch attached.
> 
> A quick test shows a pretty good performance increase:
> 
> create table ts (ts timestamp not null);
> insert into ts select generate_series('2010-01-01 00:00:00', '2011-01-01
> 00:00:00', '1 sec'::interval);
> vacuum freeze ts;
> 
> Master:
> david=# copy ts to 'l:/ts.sql';
> COPY 31536001
> Time: 20444.468 ms
> 
> Patched:
> david=# copy ts to 'l:/ts.sql';
> COPY 31536001
> Time: 10947.097 ms

Yes, that's pretty cool.

> There's probably a bit more to squeeze out of this.
> 1. EncodeDateTime() could return the length of the string to allow callers
> to use pnstrdup() instead of pstrdup(), which would save the strlen() call.
> 2. Have pg_int2str_zeropad() and pg_int2str() skip appending the NUL char
> and leave this up to the calling function.
> 3. Make something to replace the sprintf(str, " %.*s", MAXTZLEN, tzn); call.
> 
> Perhaps 1 and 3 are worth looking into, but I think 2 is likely too error
> prone for the small gain we'd get from it.

I'm inclined to first get the majority of the optimization - as in
somethign similar to the current patch. If we then feel a need to
optimize further we can do that. Otherwise we might end up not getting
the 95% performance improvement in 9.6 because we're playing with the
remaining 5 ;)


> Also I was not too sure on if pg_int2str() was too similar to pg_ltoa().
> Perhaps pg_ltoa() should just be modified to return the end of string?

I don't think the benefits are worth breaking pg_ltoa interface.

>  /*
> - * Append sections and fractional seconds (if any) at *cp.
> + * Append seconds and fractional seconds (if any) at *cp.
>   * precision is the max number of fraction digits, fillzeros says to
>   * pad to two integral-seconds digits.
>   * Note that any sign is stripped from the input seconds values.
> + * Note 'precision' must not be a negative number.
>   */
> -static void
> +static char *
>  AppendSeconds(char *cp, int sec, fsec_t fsec, int precision, bool fillzeros)
>  {
> +#ifdef HAVE_INT64_TIMESTAMP

Wouldn't it be better to just normalize fsec to an integer in that case?
Afaics that's the only remaining reason for the alternative path?

> +/*
> + * pg_int2str
> + *   Converts 'value' into a decimal string representation of the 
> number.
> + *
> + * Caller must ensure that 'str' points to enough memory to hold the result
> + * (at least 12 bytes, counting a leading sign and trailing NUL).
> + * Return value is a pointer to the new NUL terminated end of string.
> + */
> +char *
> +pg_int2str(char *str, int32 value)
> +{
> + char *start;
> + char *end;
> +
> + /*
> +  * Handle negative numbers in a special way. We can't just append a '-'
> +  * prefix and reverse the sign as on two's complement machines negative
> +  * numbers can be 1 further from 0 than positive numbers, we do it this 
> way
> +  * so we properly handle the smallest possible value.
> +  */
> + if (value < 0)
> + {

We could alternatively handle this by special-casing INT_MIN, probably
resulting in a bit less duplicated code.
>  
>  /*
>   *   Per-opclass comparison functions for new btrees.  These are
> diff --git a/src/tools/msvc/build.pl b/src/tools/msvc/build.pl
> index e107d41..1e2dd62 100644
> --- a/src/tools/msvc/build.pl
> +++ b/src/tools/msvc/build.pl
> @@ -61,7 +61,7 @@ elsif ($buildwhat)
>  }
>  else
>  {
> - system("msbuild pgsql.sln /verbosity:detailed /p:Configuration=$bconf");
> + system("msbuild pgsql.sln /verbosity:quiet /p:Configuration=$bconf");
>  }

Uh? I assume that's an acciental change?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Microvacuum for gist.

2015-09-02 Thread Andres Freund

Hi,

I don't know too much about gist, but did a quick read through. Mostly
spotting some stylistic issues. Please fix those making it easier for
the next reviewer.

> *** a/src/backend/access/gist/gist.c
> --- b/src/backend/access/gist/gist.c
> ***
> *** 36,42  static bool gistinserttuples(GISTInsertState *state, 
> GISTInsertStack *stack,
>bool unlockbuf, bool unlockleftchild);
>   static void gistfinishsplit(GISTInsertState *state, GISTInsertStack *stack,
>   GISTSTATE *giststate, List *splitinfo, bool 
> releasebuf);
> ! 

>   #define ROTATEDIST(d) do { \
>   SplitedPageLayout 
> *tmp=(SplitedPageLayout*)palloc(sizeof(SplitedPageLayout)); \
> --- 36,42 
>bool unlockbuf, bool unlockleftchild);
>   static void gistfinishsplit(GISTInsertState *state, GISTInsertStack *stack,
>   GISTSTATE *giststate, List *splitinfo, bool 
> releasebuf);
> ! static void gistvacuumpage(Relation rel, Page page, Buffer buffer);
>   
>   #define ROTATEDIST(d) do { \
>   SplitedPageLayout
>   *tmp=(SplitedPageLayout*)palloc(sizeof(SplitedPageLayout)); \

Newline removed.

> + /*
> +  * If leaf page is full, try at first to delete dead tuples. And then
> +  * check again.
> +  */
> + if ((is_split) && GistPageIsLeaf(page) && GistPageHasGarbage(page))

superfluous parens around is_split
> + /*
> +  * gistkillitems() -- set LP_DEAD state for items an indexscan caller has
> +  * told us were killed.
> +  *
> +  * We match items by heap TID before mark them. If an item has moved off
> +  * the current page due to a split, we'll fail to find it and do nothing
> +  * (this is not an error case --- we assume the item will eventually get
> +  * marked in a future indexscan).
> +  *
> +  * We re-read page here, so it's significant to check page LSN. If page
> +  * has been modified since the last read (as determined by LSN), we dare not
> +  * flag any antries because it is possible that the old entry was vacuumed
> +  * away and the TID was re-used by a completely different heap tuple.

s/significant/important/?.
s/If page/If the page/
s/dare not/cannot/

> +  */
> + static void
> + gistkillitems(IndexScanDesc scan)
> + {
> + GISTScanOpaque so = (GISTScanOpaque) scan->opaque;
> + Buffer  buffer;
> + Pagepage;
> + OffsetNumber minoff;
> + OffsetNumber maxoff;
> + int i;
> + boolkilledsomething = false;
> + 
> + Assert(so->curBlkno != InvalidBlockNumber);
> + 
> + buffer = ReadBuffer(scan->indexRelation, so->curBlkno);
> + if (!BufferIsValid(buffer))
> + return;
> + 
> + LockBuffer(buffer, GIST_SHARE);
> + gistcheckpage(scan->indexRelation, buffer);
> + page = BufferGetPage(buffer);
> + 
> + /*
> +  * If page LSN differs it means that the page was modified since the 
> last read.
> +  * killedItemes could be not valid so LP_DEAD hints applying is not 
> safe.
> +  */
> + if(PageGetLSN(page) != so->curPageLSN)
> + {
> + UnlockReleaseBuffer(buffer);
> + so->numKilled = 0; /* reset counter */
> + return;
> + }
> + 
> + minoff = FirstOffsetNumber;
> + maxoff = PageGetMaxOffsetNumber(page);
> + 
> + maxoff = PageGetMaxOffsetNumber(page);

duplicated line.

> + for (i = 0; i < so->numKilled; i++)
> + {
> + if (so->killedItems != NULL)
> + {
> + OffsetNumber offnum = FirstOffsetNumber;
> + 
> + while (offnum <= maxoff)
> + {

This nested loop deserves a comment.

> + ItemId  iid = PageGetItemId(page, 
> offnum);
> + IndexTuple  ituple = (IndexTuple) 
> PageGetItem(page, iid);
> + 
> + if (ItemPointerEquals(>t_tid, 
> &(so->killedItems[i])))
> + {
> + /* found the item */
> + ItemIdMarkDead(iid);
> + killedsomething = true;
> + break;  /* out of inner search 
> loop */
> + }
> + offnum = OffsetNumberNext(offnum);
> + }
> + }
> + }

I know it's the same approach nbtree takes, but if there's a significant
number of deleted items this takes me as a rather suboptimal
approach. The constants are small, but this still essentially is O(n^2).

> ***
> *** 451,456  getNextNearest(IndexScanDesc scan)
> --- 553,575 
>   
>   if (scan->xs_itup)
>   {
> + /*
> +  * If previously returned index tuple is not visible save it 
> into
> +  * so->killedItems. And at the end of the page scan

Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Josh Berkus

On 09/01/2015 04:14 PM, Petr Jelinek wrote:
> On 2015-09-02 00:09, Josh Berkus wrote:
>> On 09/01/2015 02:29 PM, Tomas Vondra wrote:
>>> So while you may be right in single-DC deployments, with multi-DC
>>> deployments the situation is quite different - not only that the network
>>> bandwidth is not unlimited, but because latencies within DC may be a
>>> fraction of latencies between the locations (to the extent that the
>>> increase due to syncrep may be just noise). So the local replication may
>>> be actually way faster.
>>
>> I'm not seeing how the above is better using syncrep than using shard
>> copying?
> 
> Shard copying usually assumes that the origin node does the copy - the
> data has to go twice through the slow connection. With replication you
> can replicate locally over fast connection.

Ah, I was thinking of the case of having a single set of copies in the
remote DC, but of course that isn't going to be the case with a highly
redundant setup.

Basically this seems to be saying that, in an ideal setup, we'd have
some kind of synchronous per-shard replication.  We don't have that at
present (sync rep is whole-node, and BDR is asynchronous).  There's also
the question of how to deal with failures and taking bad nodes out of
circulation in such a setup, especially considering that the writes
could be coming from multiple other nodes.

>> Not really, the mechanism is different and the behavior is different.
>> One critical deficiency in using binary syncrep is that you can't do
>> round-robin redundancy at all; every redundant node has to be an exact
>> mirror of another node.  In a good HA distributed system, you want
>> multiple shards per node, and you want each shard to be replicated to a
>> different node, so that in the event of node failure you're not dumping
>> the full load on one other server.
>>
> 
> This assumes that we use binary replication, but we can reasonably use
> logical replication which can quite easily do filtering of what's
> replicated where.

Is there a way to do logical synchronous replication?  I didn't think
there was.

>>> IMHO the design has to address the multi-DC setups somehow. I think that
>>> many of the customers who are so concerned about scaling to many shards
>>> are also concerned about availability in case of DC outages, no?
>>
>> Certainly.  But users located in a single DC shouldn't pay the same
>> overhead as users who are geographically spread.
>>
> 
> Agreed, so we should support both ways, but I don't think it's necessary
> to support both ways in version 0.1. It's just important to not paint
> ourselves into a corner with design decisions that would make one of the
> ways impossible.

Exactly!

Let me explain why I'm so vocal on this point.  PostgresXC didn't deal
with the redundancy/node replacement at all until after version 1.0.
Then, when they tried to address it, they discovered that the code was
chock full of assumptions that "1 node == 1 shard", and breaking that
assumption would require a total refactor of the code (which never
happened).  I don't want to see a repeat of that mistake.

Even if it's only on paper, any new sharding design needs to address
these questions:

1. How do we ensure no/minimal data is lost if we lose a node?
2. How do we replace a lost node (without taking the cluster down)?
   2. a. how do we allow an out-of-sync node to "catch up"?
3. How do we maintain metadata about good/bad nodes (and shard locations)?
4. How do we add nodes to expand the cluster?

There doesn't need to be code for all of the above from version 0.1, but
there needs to be a plan to tackle those problems.  Otherwise, we'll
just end up with another dead-end, not-useful-in-production technology.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting

2015-09-02 Thread Andres Freund

On 2015-09-02 18:06:54 +0200, Tomas Vondra wrote:
> Maybe the best thing we can do is just completely abandon the "number of
> spindles" idea, and just say "number of I/O requests to prefetch". Possibly
> with an explanation of how to estimate it (devices * queue length).

I think that'd be a lot better.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GIN pending clean up is not interruptable

2015-09-02 Thread Andres Freund

On 2015-08-12 11:59:48 -0700, Jeff Janes wrote:
> Attached patch does it that way.  There was also a free-standing
> CHECK_FOR_INTERRUPTS() which had no reason that I could see not be a
> vacuum_delay_point, so I changed that one as well.

I think we should backpatch this - any arguments against?

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2015-09-02 Thread Robert Haas

On Tue, Sep 1, 2015 at 6:43 PM, and...@anarazel.de  wrote:
> Why a new tranche for each of these? And it can't be correct that each
> has the same base?

I complained about the same-base problem before.  Apparently, that got ignored.

> I don't really like the tranche model as in the patch right now. I'd
> rather have in a way that we have one tranch for all the individual
> lwlocks, where the tranche points to an array of names alongside the
> tranche's name. And then for the others we just supply the tranche name,
> but leave the name array empty, whereas a name can be generated.

That's an interesting idea.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Josh Berkus

On 09/02/2015 11:41 AM, Robert Haas wrote:
> On Wed, Sep 2, 2015 at 1:57 PM, Josh Berkus  wrote:
>> Even if it's only on paper, any new sharding design needs to address
>> these questions:
>>
>> 1. How do we ensure no/minimal data is lost if we lose a node?
>> 2. How do we replace a lost node (without taking the cluster down)?
>>2. a. how do we allow an out-of-sync node to "catch up"?
>> 3. How do we maintain metadata about good/bad nodes (and shard locations)?
>> 4. How do we add nodes to expand the cluster?
>>
>> There doesn't need to be code for all of the above from version 0.1, but
>> there needs to be a plan to tackle those problems.  Otherwise, we'll
>> just end up with another dead-end, not-useful-in-production technology.
> 
> This is a good point, and I think I agree with it.  Let me make a few
> observations:
> 
> 1. None of this stuff matters very much when the data is strictly
> read-only. 

Yep.

> 2. None of this stuff matters when you only have one copy of the data.
> Your system is low-availability, but you just don't care for whatever
> reason. 

Uh-huh.

> 3. IIUC, Postgres-XC handles this problem by reducing at least
> volatile functions, maybe all functions, to constants.  Then it
> generates an SQL statement to be sent to the data node to make the
> appropriate change.  If there's more than one copy of the data, we
> send a separate copy of the SQL statement to every node.  I'm not sure
> exactly what happens if some of those nodes are not available, but I
> don't think it's anything good.  Fundamentally, this model doesn't
> allow for many good options in that case.

pg_shard also sends the data to each node, and automatically notices
which nodes are not responding and takes them out of availability.
There isn't a "catch up" feature yet (AFAIK), or any attempt to reduce
volatile functions.

For that matter, last I worked on it Greenplum also did multiplexing via
the writing node (or via the data loader).  So this is a popular
approach; it has a number of drawbacks, though, of which volatile
functions are a major one.

> 4. Therefore, I think that we should instead use logical replication,
> which might be either synchronous or asynchronous.  When you modify
> one copy of the data, that change will then be replicated to all other
> nodes.  If you are OK with eventual consistency, this replication can
> be asynchronous, and nodes that are off-line will catch up when they
> are on-line.  If you are not OK with that, then you must replicate
> synchronously to every node before transaction commit; or at least you
> must replicate synchronously to every node that is currently on-line.
> This presents some challenges: logical decoding currently can't
> replicate transactions that are still in process - replication starts
> when the transaction commits.  Also, we don't have any way for
> synchronous replication to wait for multiple nodes.  

Well, there is a WIP patch for that, which IMHO would be much improved
by having a concrete use-case like this one.  What nobody is working on
-- and we've vetoed in the past -- is a way of automatically failing and
removing from replication any node which repeatedly fails to sync, which
would be a requirement for this model.

You'd also need a way to let the connection nodes know when a replica
has fallen behind so that they can be taken out of
load-balancing/sharding for read queries.  For the synchronous model,
that would be "fallen behind at all"; for asynchronous it would be
"fallen more than ### behind".

> But in theory
> those seem like limitations that can be lifted.  Also, the GTM needs
> to be aware that this stuff is happening, or it will DTWT.  That too
> seems like a problem that can be solved.

Yeah?  I'd assume that a GTM would be antithetical to two-stage copying.
 I'm not a big fan of a GTM at all, frankly; it makes clusters much
harder to set up, and becomes a SPOF.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] src/test/ssl broken on HEAD

2015-09-02 Thread Robert Haas

On Wed, Aug 26, 2015 at 3:37 AM, Michael Paquier
 wrote:
> On Wed, Aug 26, 2015 at 4:35 PM, Michael Paquier
>  wrote:
>> Only HEAD is impacted, and attached is a patch to fix the problem.
>
> Actually this version is better, I forgot to update a comment.

For so long as this test suite is not run by 'make check-world' or by
the buildfarm, it's likely to keep getting broken, and we're likely to
keep not noticing.  I realize that the decision to exclude this from
'make check-world' was deliberately made for security reasons, but
that doesn't take anything away from the maintenance headache.

Still, that's not a reason not commit this, so done.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Allow replication roles to use file access functions

2015-09-02 Thread Michael Paquier

On Wed, Sep 2, 2015 at 11:32 PM, Fujii Masao wrote:
> Did you confirm that replication user can complete pg_rewind
> after this patch is applied?

Yes.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench stats per script & other stuff

2015-09-02 Thread Robert Haas

On Wed, Sep 2, 2015 at 5:55 PM, Andres Freund  wrote:
> On 2015-09-02 14:36:51 -0400, Robert Haas wrote:
>> On Wed, Sep 2, 2015 at 2:20 PM, Fabien COELHO  wrote:
>> >> I'm wondering if percentages instead of weights would be a better
>> >> idea. That'd mean you'd be forced to be more careful when adding another
>> >> script (having to adjust the percentages of other scripts) but arguably
>> >> that's a good thing?
>> >
>> > If you use only percent, then you have to check that the total is 100,
>> > probably you have to use floats, to do something when the total is not 100,
>> > checking would complicate the code and test people mental calculus
>> > abilities. Not sure this is a good idea:-)
>>
>> I agree.  I don't see a reason to enforce that the total of the
>> weights must be 100.
>
> I'm slightly worried that using weights will be a bit confusing because
> adding another script will obviously reduce the frequency of already
> defined scripts. But it's probably not worth worrying.

That sounds like a feature to me, not a bug.  I wouldn't worry.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Allow replication roles to use file access functions

2015-09-02 Thread Stephen Frost

Micahel,

* Michael Paquier (michael.paqu...@gmail.com) wrote:
> On Thu, Sep 3, 2015 at 4:52 AM, Alvaro Herrera  
> wrote:
> > Michael Paquier wrote:
> >> As of now, file access functions in genfile.c can only be used by
> >> superusers. This proposal is to relax those functions so as
> >> replication users can use them as well. Here are the functions aimed
> >> by this patch:
> >> - pg_stat_file
> >> - pg_read_binary_file
> >> - pg_read_file
> >> - pg_ls_dir
> >> The main argument for this change is that pg_rewind makes use of those
> >> functions, forcing users to use a superuser role when rewinding a
> >> node.
> >
> > Can this piggyback on Stephen Frost's proposed patch for reserved roles et 
> > al?
> 
> (Adding Stephen in the loop)

Thanks!

> I thought that this was rather independent on Stephen's default role
> patch because this is not based on a new role permission type.
> Stephen, do you think so, or should we merge the effort with your
> things?

It's not directly related to the default roles as it's adding
permissions to an existing role attribute control, but it's certainly
covering related territory and I'd like to make sure we're careful with
any increase in permissions for existing role attributes.

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Michael Paquier  writes:
> > On Thu, Sep 3, 2015 at 4:52 AM, Alvaro Herrera  
> > wrote:
> >> Michael Paquier wrote:
> >>> The main argument for this change is that pg_rewind makes use of those
> >>> functions, forcing users to use a superuser role when rewinding a
> >>> node.
> 
> >> Can this piggyback on Stephen Frost's proposed patch for reserved roles et 
> >> al?
> 
> > (Adding Stephen in the loop)
> > I thought that this was rather independent on Stephen's default role
> > patch because this is not based on a new role permission type.
> > Stephen, do you think so, or should we merge the effort with your
> > things?
> 
> Just on general principles, this seems like a pretty horrid idea.

Agreed- it's far broader than what is necessary, as I understand it.

> To me replication privilege means the ability to transfer data out of
> the master, not to cause arbitrary state changes on the master.

Agreed, but as Andres points out, this doesn't allow state changes on
the master.

> Therefore, "allow replication users to run pg_rewind" seems less like
> a feature and more like a security bug.  Am I missing something?

I believe Andres clarified this- this is about remastering an old master
to follow a *new* master, and at this point, the old master needs
information from the new master to reset the old master system back to a
point where the old master can turn into a replica of the new master.

> Another problem with it is that access to the filesystem is about halfway
> to being superuser anyway, even if it's only read-only access.  It would
> certainly let you read things that one would not expect a replication
> user to be able to read (like files unrelated to Postgres).

The replication role already has read-only access to everything
(nearly?) in the PGDATA directory.  The specific issue in this case, I
believe, is if the functions mentioned provide access beyond what the
replication role already has access to.

> I don't entirely understand why pg_rewind should be invoking any of these
> functions to begin with.  Isn't it working directly with the disk data,
> with both postmasters shut down?

Certain information is needed from the new master to rewind the old
master.  What I would hope is that we'd provide a way for *just* that
information to be made available to the replication role rather than
completely general access through the functions mentioned.

* Andres Freund (and...@anarazel.de) wrote:
> On 2015-09-02 19:48:15 -0400, Tom Lane wrote:
> > Just on general principles, this seems like a pretty horrid idea.
> > To me replication privilege means the ability to transfer data out of
> > the master, not to cause arbitrary state changes on the master.
> 
> It's not about the permission to trigger pg_rewind on the master - it's
> about being able to run pg_rewind (as the necessary OS user) on the
> *standby* when the connection to the primary has only replication rather
> than superuser privs.

Understood.

> > Another problem with it is that access to the filesystem is about halfway
> > to being superuser anyway, even if it's only read-only access.  It would
> > certainly let you read things that one would not expect a replication
> > user to be able to read (like files unrelated to Postgres).
> 
> Doesn't pg_read_file et al insist that the path is below the data
> directorY?

I don't believe that's the case, actually..  I might be wrong though,
I'm not looking at the code right now.  This is definitely a big part of
the question, but I'd like to ask- what, exactly, does pg_rewind
actually need access to?  Is it only the WAL, or are heap and WAL files
needed?  How much

Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Robert Haas

On Wed, Sep 2, 2015 at 1:59 PM, Merlin Moncure  wrote:
> On Tue, Sep 1, 2015 at 11:18 AM, Robert Haas  wrote:
>> It would be a bad idea to cling blindly to the FDW infrastructure if
>> it's fundamentally inadequate to do what we want.  On the other hand,
>> it would also be a bad idea to set about recreating it without a
>> really good reason, and - just to take one example - the fact that it
>> doesn't currently push down DML operations to the remote side is not a
>> really good reason to rewrite the whole thing.  On the contrary, it's
>> a reason to put some energy into the already-written patch which
>> implements that optimization.
>
> The problem with FDW for these purposes as I see it is that too much
> intelligence is relegated to the implementer of the API.  There needs
> to be a mechanism so that the planner can rewrite the remote query and
> then do some after the fact processing.  This exactly what citus does;
> if you send out AVG(foo) it rewrites that to SUM(foo) and COUNT(foo)
> so that aggregation can be properly weighted to the result.   To do
> this (distributed OLAP-type processing) right, the planner needs to
> *know* that this table is in fact distributed and also know that it
> can make SQL compatible adjustments to the query.
>
> This strikes me as a bit of a conflict of interest with FDW which
> seems to want to hide the fact that it's foreign; the FDW
> implementation makes it's own optimization decisions which might make
> sense for single table queries but breaks down in the face of joins.

Well, I don't think that ALL of the logic should go into the FDW.  In
particular, in this example, parallel aggregate needs the same query
rewrite, so the logic for that should live in core so that both
parallel and distributed queries get the benefit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory prefetching while sequentially fetching from SortTuple array, tuplestore

2015-09-02 Thread Peter Geoghegan

On Wed, Sep 2, 2015 at 7:12 AM, Andres Freund  wrote:
> On 2015-07-19 16:34:52 -0700, Peter Geoghegan wrote:
> Hm. Is a compiler test actually test anything reliably here? Won't this
> just throw a warning during compile time about an unknown function?

I'll need to look into that.

>> +/*
>> + * Prefetch support -- Support memory prefetching hints on some platforms.
>> + *
>> + * pg_rfetch() is specialized for the case where an array is accessed
>> + * sequentially, and we can prefetch a pointer within the next element (or 
>> an
>> + * even later element) in order to hide memory latency.  This case involves
>> + * prefetching addresses with low temporal locality.  Note that it's rather
>> + * difficult to get any kind of speedup using pg_rfetch();  any use of the
>> + * intrinsic should be carefully tested.  Also note that it's okay to pass 
>> it
>> + * an invalid or NULL address, although it's best avoided.
>> + */
>> +#if defined(USE_MEM_PREFETCH)
>> +#define pg_rfetch(addr)  __builtin_prefetch((addr), 0, 0)
>> +#endif
>
> What is that name actually supposed to mean? 'rfetch' doesn't seem to be
> particularly clear - or is it a meant to be a wordplay combined with the
> p?

"Read fetch". One argument past to the intrinsic here specifies that
the variable will be read only. I did things this way because I
imagined that there would be very limited uses for the macro only. I
probably cover almost all interesting cases for explicit memory
prefetching already.

> I don't think you should specify the read/write and locality parameters
> if we don't hand-pick them - right now you're saying the memory will
> only be read and that it has no temporal locality.
>
> I think there should be a empty fallback definition even if the current
> only caller ends up not needing it - not all callers will require it.

It started out that way, but Tom felt that it was better to have a
USE_MEM_PREFETCH because of the branch below...

>> + /*
>> +  * Perform memory prefetch of tuple 
>> that's three places
>> +  * ahead of current (which is returned 
>> to caller).
>> +  * Testing shows that this 
>> significantly boosts the
>> +  * performance for TSS_INMEM "forward" 
>> callers by
>> +  * hiding memory latency behind their 
>> processing of
>> +  * returned tuples.
>> +  */
>> +#ifdef USE_MEM_PREFETCH
>> + if (readptr->current + 3 < 
>> state->memtupcount)
>> + 
>> pg_rfetch(state->memtuples[readptr->current + 3]);
>> +#endif
>
> Hm. Why do we need a branch here? The target of prefetches don't need to
> be valid addresses and adding a bunch of arithmetic and a branch for the
> corner case doesn't sound worthwhile to me.

...which is needed, because I'm still required to not dereference wild
pointers. In other words, although pg_rfetch()/__builtin_prefetch()
does not require that a valid pointer be passed, it is not okay to
read past an array's bounds to read that pointer. The GCC docs are
clear on this -- "Data prefetch does not generate faults if addr is
invalid, but the address expression itself must be valid". Also, for
cases that don't benefit (like datum sorts, which never have a "tuple
proper" -- the above code is for tuplestore, not tuplesort, so you
can't see this), it seemed faster to have the branch than to rely on
__builtin_prefetch() being forgiving about invalid/wild pointers. I
saw a regression without the branch for that case.

> What worries me about adding explicit prefetching is that it's *highly*
> platform and even micro-architecture dependent. Why is looking three
> elements ahead the right value?

Because that was the fastest value following testing on my laptop. You
are absolutely right to point out that this isn't a good reason to go
with the patch -- I share your concern. All I can say in defense of
that is that other major system software does the same, without any
regard for the underlying microarchitecture AFAICT. So, yes,
certainly, more testing is required across a reasonable cross-section
of platforms to justify the patch. But right now, time spent in
_bt_buildadd() is massively reduced with the patch, which makes it
worthwhile in my mind. It's really very effective if you look at the
only part of (say) a CREATE INDEX that is affected one way or another
-- the final fetching of tuples.

BTW, I am very close to posting a patch that makes (say) CREATE INDEX
very fast for external sorts. That uses a memory pool to make final
on-the-fly merge memory access sequential per tape. That's what I'll
do rather than explicitly prefetching for external sorts. It makes a
huge difference on top of the external sort

Re: [HACKERS] Allow replication roles to use file access functions

2015-09-02 Thread Michael Paquier

On Thu, Sep 3, 2015 at 8:59 AM, Andres Freund  wrote:
> On 2015-09-02 19:48:15 -0400, Tom Lane wrote:
>> Just on general principles, this seems like a pretty horrid idea.
>> To me replication privilege means the ability to transfer data out of
>> the master, not to cause arbitrary state changes on the master.
>
> It's not about the permission to trigger pg_rewind on the master - it's
> about being able to run pg_rewind (as the necessary OS user) on the
> *standby* when the connection to the primary has only replication rather
> than superuser privs.

Yeah, I got poked by this limitation of pg_rewind some time ago
internally actually, folks willing to be able to manage their cluster
only with a replication role, and they were not really willing to have
a superuser for such operations being used across the network.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting

2015-09-02 Thread Andres Freund

On 2015-09-03 01:59:13 +0200, Tomas Vondra wrote:
> That's a bit surprising, especially considering that e_i_c=30 means ~100
> pages to prefetch if I'm doing the math right.
> 
> AFAIK queue depth for SATA drives generally is 32 (so prefetching 100 pages
> should not make a difference), 256 for SAS drives and ~1000 for most current
> RAID controllers.

I think the point is that after an initial buildup - we'll again only
prefetch pages in smaller increments because we already prefetched most
pages. The actual number of concurrent reads will then be determined by
how fast pages are processed.  A prefetch_target of a 100 does *not*
mean 100 requests are continously in flight. And even if it would, the
OS won't issue many more requests than the queue depth, so they'll just
sit in the OS queue.

> So instead of "How many blocks I need to prefetch to saturate the devices?"
> you're asking "How many blocks I need to prefetch to never actually wait for
> the I/O?"

Yes, pretty much.

> I do like this view, but I'm not really sure how could we determine the
> right value? It seems to be very dependent on hardware and workload.

Right.

> For spinning drives the speedup comes from optimizing random seeks to a more
> optimal path (thanks to NCQ/TCQ), and on SSDs thanks to using the parallel
> channels (and possibly faster access to the same block).

+ OS reordering & coalescing.

Don't forget that the OS processes the OS IO queues while the userland
process isn't scheduled - in a concurrent workload with more processes
than hardware threads that means that pending OS requests are being
issued while the query isn't actively being processed.

> I guess the best thing we could do at this level is simply keep the
> on-device queues fully saturated, no?

Well, being too aggressive can hurt throughput and latency of concurrent
processes, without being beneficial.

Andres Freund

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] exposing pg_controldata and pg_config as functions

2015-09-02 Thread Robert Haas

On Wed, Sep 2, 2015 at 5:31 PM, Joe Conway  wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
>
> On 09/02/2015 05:25 PM, Tom Lane wrote:
>> Robert Haas  writes:
>>> But I'm not sure I like the idea of adding a server dependency on
>>> the ability to exec pg_controldata.  That seems like it could be
>>> unreliable at best, and a security vulnerability at worst.
>>
>> I hadn't been paying attention --- the proposed patch actually
>> depends on exec'ing pg_controldata?  That's horrid!  There is no
>> expectation that that's installed.
>
> No it doesn't. I'm confused :-/

No, I'm confused.  Sorry.  Somehow I misread your patch.

Pay no attention to that man behind the curtain.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Improving test coverage of extensions with pg_dump

2015-09-02 Thread Michael Paquier

On Thu, Sep 3, 2015 at 2:30 AM, Alvaro Herrera wrote:
> Andres Freund wrote:
>> Isn't a full test with a separate initdb, create extension etc. a really
>> heavyhanded way to test this? I mean that's a test where the setup takes
>> up to 10s, whereas the actual runtime is in the millisecond range?
>
> I spent some time looking over this patch yesterday, and that was one
> concern I had too.  I was thinking that if we turn this into a single
> module that can contain several extensions, or several pg_dump-related
> tests, and then test multiple features without requiring an initdb for
> each, it would be more palatable.

Yeah sure. That's a statement you could generalize for all the TAP
tests, per se for example pg_rewind tests.

>> Adding tests in this manner doesn't seem scalable to me.
>
> I was thinking in having this be renamed src/test/modules/extensions/
> and then the extension contained here would be renamed ext001_fk_tables
> or something like that; later we could ext002_something for testing some
> other angle of extensions, not necessarily pg_dump related.

So, if I am following here correctly, what you are suggesting is:
1) create src/test/modules/extensions/, and include fk_tables in it for now
2) offer the possibility to run make check in this path, doing
complicated tests on a single instance, pg_dump on an instance is one
of them.
Note that for now the TAP test machinery does not allow to share a
single cluster instance across multiple pl scripts, and it seems to me
that's what you are meaning here (in my opinion that's out of sight of
this patch, and I don't think either we should do it this will make
error handling a nightmare). And we can now have a single TAP script
doing all the tests on a single instance. So which one are you
suggesting?

The patch proposed here is actually doing the second, initializing an
instance and performing pg_dump on it. If we change it your way we
should just rename the test script to something like
001_test_extensions.pl and mention in it that new tests for extensions
should be included in it. Still I would imagine that each extension
are actually going to need slightly differ test patterns to work or
cover what they are intended to cover. See for example the set of
tests added in the CREATE EXTENSION CASCADE patch: those are not
adapted to run with the TAP machinery. I may be of course wrong, we
may have other extensions in the future using that, still I am seeing
none around for now.

If you are worrying about run time, we could as well have make check
add an EXTRA_INSTALL pointing to src/test/modules/extensions/, with an
additional test, say extensions.sql that creates data based on it.
Then we let pg_upgrade do the dump/restore checks. This way the run
time of make check-world is minimally impacted, and the code paths we
want tested are done by the buildfarm. That's ugly, still it won't
impact performance.

>> I think we should rather add *one* test that does dump/restore over the
>> normal regression test database. Something similar to the pg_upgrade
>> tests. And then work at adding more content to the regression test
>> database - potentially sourcing from src/test/modules.
>
> That's another option, but we've had this idea for many years now and it
> hasn't materialized.  As I recall, Andrew Dunstan has a module that
> tests cross-version pg_upgrade and one thing he does is dump both and
> compare; the problem is that there are differences, so he keeps a count
> of how many lines he expect to differ between any two releases.  Or
> something like that.  While it's a good enough start, I don't think it's
> robust enough to be in core.  How would we do it?

That does not seem plausible to me for an integration in core, you
would need to keep switching between branches in a given git tree if
the test is being done within a git repository, and that would be just
not doable from a raw tarball which just stores one given version.
-- 
Michael

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Improving test coverage of extensions with pg_dump

2015-09-02 Thread Michael Paquier

On Thu, Sep 3, 2015 at 12:38 AM, Andres Freund wrote:
> Isn't a full test with a separate initdb, create extension etc. a really
> heavyhanded way to test this? I mean that's a test where the setup takes
> up to 10s, whereas the actual runtime is in the millisecond range?
>
> Adding tests in this manner doesn't seem scalable to me.

How to include such kind of tests is in the heart of the discussion
since this patch has showed up. I think we are now close to 5
different opinions with 4 different hackers on the matter, the Riggs'
theorem applies nicely.

> I think we should rather add *one* test that does dump/restore over the
> normal regression test database. Something similar to the pg_upgrade
> tests. And then work at adding more content to the regression test
> database - potentially sourcing from src/test/modules.

If you are worrying about run time, pg_upgrade already exactly does
that. What would be the point to double the amount of time to do the
same thing in two different places?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2015-09-02 Thread Jim Nasby


On 9/2/15 2:56 PM, Jim Nasby wrote:

On 9/2/15 2:17 PM, Alvaro Herrera wrote:

Michael Paquier wrote:


I haven't written yet a test case but I think that we could reproduce
that simply by having a relation referenced in the exception block of
a first function, calling a second function that itself raises an
exception, causing the referencing error.


Hm, so function 2 is called in the exception block of function 1?

Are you going to produce the test case for this?  Jim?


I had attempted one but the issue is that it's more than just an
exception block thing. If it was that simple then we'd still get the
crash without involving pgTap. So I suspect you need to have a named
cursor in the mix as well.

Let me make another attempt at something simpler.


After some time messing around with this I've been able to remove pgTap 
from the equation using the attached crash.sql (relevant snippet below).


This only works if you pass a refcursor into the function. It doesn't 
work if you do


OPEN have FOR EXECUTE $$SELECT * FROM tf.get( NULL::invoice, 'base' )$$;

inside the function instead. This code also produces different results; 
it outputs the error message before crashing and the stack trace shows 
the assert is now failing while trying to abort the top-level 
transaction. So it looks like the scenario is:


BEGIN;
DECLARE (open?) cursor that calls function with exception handler that 
calls function with exception handler that calls something that fails


The double set of exceptions seems to be critical; if instead of calling 
tf.get(invoice) (which recursively does tf.get(customer)) I define the 
cursor to call tf.get(customer) directly there's no assert.


I can poke at this more tomorrow, but it'd be very helpful if someone 
could explain this failure mode, as I'm basically stumbling in the dark 
on a test case right now.


CREATE OR REPLACE FUNCTION a(
have refcursor
) RETURNS void LANGUAGE plpgsql AS $body$
DECLARE
have_rec record;
BEGIN
FETCH have INTO have_rec;
END
$body$;
DECLARE h CURSOR FOR SELECT * FROM tf.get( NULL::invoice, 'base' );
SELECT a(
  'h'::refcursor
);


(lldb) bt
* thread #1: tid = 0x722ed8, 0x7fff92a5a866 
libsystem_kernel.dylib`__pthread_kill + 10, queue = 'com.apple.main-thread', 
stop reason = signal SIGABRT
  * frame #0: 0x7fff92a5a866 libsystem_kernel.dylib`__pthread_kill + 10
frame #1: 0x7fff9001b35c libsystem_pthread.dylib`pthread_kill + 92
frame #2: 0x7fff8cf89b1a libsystem_c.dylib`abort + 125
frame #3: 0x00010cdb4039 
postgres`ExceptionalCondition(conditionName=0x00010cf59cfd, 
errorType=0x00010cec392d, fileName=0x00010cf59045, lineNumber=1972) + 
137 at assert.c:54
frame #4: 0x00010cd9b332 
postgres`RelationClearRelation(relation=0x00011658f110, rebuild='\0') + 162 
at relcache.c:1970
frame #5: 0x00010cd9cc0f 
postgres`AtEOSubXact_cleanup(relation=0x00011658f110, isCommit='\0', 
mySubid=15, parentSubid=1) + 79 at relcache.c:2665
frame #6: 0x00010cd9cb92 
postgres`AtEOSubXact_RelationCache(isCommit='\0', mySubid=15, parentSubid=1) + 
242 at relcache.c:2633
frame #7: 0x00010c9b6e88 postgres`AbortSubTransaction + 440 at 
xact.c:4373
frame #8: 0x00010c9b7208 postgres`AbortCurrentTransaction + 312 at 
xact.c:2947
frame #9: 0x00010cc249f3 postgres`PostgresMain(argc=1, 
argv=0x7fa3f4800378, dbname=0x7fa3f4800200, 
username=0x7fa3f30031f8) + 1875 at postgres.c:3902
frame #10: 0x00010cb9da48 postgres`PostmasterMain [inlined] BackendRun 
+ 8024 at postmaster.c:4155
frame #11: 0x00010cb9da22 postgres`PostmasterMain [inlined] 
BackendStartup at postmaster.c:3829
frame #12: 0x00010cb9da22 postgres`PostmasterMain [inlined] ServerLoop 
at postmaster.c:1597
frame #13: 0x00010cb9da22 postgres`PostmasterMain(argc=, 
argv=) + 7986 at postmaster.c:1244
frame #14: 0x00010cb218cd postgres`main(argc=, 
argv=) + 1325 at main.c:228
frame #15: 0x7fff8e9a35fd libdyld.dylib`start + 1

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
BEGIN;
\i test/helpers/tap_setup.sql

CREATE EXTENSION test_factory;
SET search_path=tap;
\i test/helpers/create.sql

SELECT tf.register(
  'customer'
  , array[
row(
  'insert'
  , $$INSERT INTO customer VALUES (DEFAULT, 'first', 'last' ) RETURNING *$$
)::tf.test_set
, row(
  'function'
  , $$SELECT * FROM customer__add( 'func first', 'func last' )$$
)::tf.test_set
  ]
);
SELECT tf.register(
  'invoice'
  , array[
  row(
'insert'
, $$INSERT INTO invoice VALUES(
DEFAULT
, (tf.get( NULL::customer, 'insert' )).customer_id
, current_date
, current_date + 30
  ) RETURNING *$$
  )::tf.test_set
  , row(
'function'
, $$INSERT INTO invoice VALUES(

Re: [HACKERS] Memory prefetching while sequentially fetching from SortTuple array, tuplestore

2015-09-02 Thread Andres Freund

On 2015-09-02 16:23:13 -0700, Peter Geoghegan wrote:
> On Wed, Sep 2, 2015 at 4:12 PM, Andres Freund  wrote:
> >> Well, still needs to work for tuplestore, which does not have a SortTuple.
> >
> > Isn't it even more trivial there? It's just an array of void*'s? So
> > prefetch(state->memtuples + 3 + readptr->current)?
> 
> All I meant is that there couldn't be one centralized definition for
> both. I don't mind if you want to bake it into a macro for
> tupelstore.c and tuplesort.c.

I'm not following? Just write pg_read_prefetch(state->memtuples + 3 +
readptr->current) and the corresponding version for tuplesort in the
callsites?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Allow replication roles to use file access functions

2015-09-02 Thread Michael Paquier

On Thu, Sep 3, 2015 at 4:52 AM, Alvaro Herrera  wrote:
> Michael Paquier wrote:
>> Hi all,
>>
>> As of now, file access functions in genfile.c can only be used by
>> superusers. This proposal is to relax those functions so as
>> replication users can use them as well. Here are the functions aimed
>> by this patch:
>> - pg_stat_file
>> - pg_read_binary_file
>> - pg_read_file
>> - pg_ls_dir
>> The main argument for this change is that pg_rewind makes use of those
>> functions, forcing users to use a superuser role when rewinding a
>> node.
>
> Can this piggyback on Stephen Frost's proposed patch for reserved roles et al?

(Adding Stephen in the loop)
I thought that this was rather independent on Stephen's default role
patch because this is not based on a new role permission type.
Stephen, do you think so, or should we merge the effort with your
things?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory prefetching while sequentially fetching from SortTuple array, tuplestore

2015-09-02 Thread Peter Geoghegan

On Wed, Sep 2, 2015 at 4:26 PM, Andres Freund  wrote:
> I'm not following? Just write pg_read_prefetch(state->memtuples + 3 +
> readptr->current) and the corresponding version for tuplesort in the
> callsites?

Oh, I see. Maybe I'll do it that way when I pick this up in a little
while. I need to consider the "no tuple proper" (pass-by-value datum
sort) case within tuplesort.c, where stup.tuple is always NULL.
Currently, we discriminate against such SortTuples for various other
reasons (e.g. for memory accounting) by testing if the pointer is set,
so I suppose I copied that pattern here.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting

2015-09-02 Thread Tomas Vondra




On 09/03/2015 12:23 AM, Andres Freund wrote:

On 2015-09-02 14:31:35 -0700, Josh Berkus wrote:

On 09/02/2015 02:25 PM, Tomas Vondra wrote:


As I explained, spindles have very little to do with it - you need
multiple I/O requests per device, to get the benefit. Sure, the DBAs
should know how many spindles they have and should be able to determine
optimal IO depth. But we actually say this in the docs:


My experience with performance tuning is that values above 3 have no
real effect on how queries are executed.


I saw pretty much the opposite - the benefits seldomly were
significant below 30 or so. Even on single disks.


That's a bit surprising, especially considering that e_i_c=30 means ~100 
pages to prefetch if I'm doing the math right.


AFAIK queue depth for SATA drives generally is 32 (so prefetching 100 
pages should not make a difference), 256 for SAS drives and ~1000 for 
most current RAID controllers.


I'm not entirely surprised that values beyond 30 make a difference, but 
that you seldomly saw significant improvements below this value.


No doubt there are workloads like that, but I'd expect them to be quite 
rare and not prevalent as you're claiming.



Which actually isn't that surprising - to be actually beneficial
(that is, turn an IO into a CPU bound workload) the prefetched buffer
needs to actually have been read in by the time its needed. In many
queries processing a single heap page takes far shorter than
prefetching the data from storage, even if it's on good SSDs.

>

Therefore what you actually need is a queue of prefetches for the
next XX buffers so that between starting a prefetch and actually
needing the buffer ienough time has passed that the data is
completely read in. And the point is that that's the case even for a
single rotating disk!


So instead of "How many blocks I need to prefetch to saturate the 
devices?" you're asking "How many blocks I need to prefetch to never 
actually wait for the I/O?"


I do like this view, but I'm not really sure how could we determine the 
right value? It seems to be very dependent on hardware and workload.


For spinning drives the speedup comes from optimizing random seeks to a 
more optimal path (thanks to NCQ/TCQ), and on SSDs thanks to using the 
parallel channels (and possibly faster access to the same block).


I guess the best thing we could do at this level is simply keep the 
on-device queues fully saturated, no?


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Allow replication roles to use file access functions

2015-09-02 Thread Andres Freund

On 2015-09-02 19:48:15 -0400, Tom Lane wrote:
> Just on general principles, this seems like a pretty horrid idea.
> To me replication privilege means the ability to transfer data out of
> the master, not to cause arbitrary state changes on the master.

It's not about the permission to trigger pg_rewind on the master - it's
about being able to run pg_rewind (as the necessary OS user) on the
*standby* when the connection to the primary has only replication rather
than superuser privs.

> Another problem with it is that access to the filesystem is about halfway
> to being superuser anyway, even if it's only read-only access.  It would
> certainly let you read things that one would not expect a replication
> user to be able to read (like files unrelated to Postgres).

Doesn't pg_read_file et al insist that the path is below the data
directorY?

> I don't entirely understand why pg_rewind should be invoking any of these
> functions to begin with.  Isn't it working directly with the disk data,
> with both postmasters shut down?

There's two modes: If both data directories are on the same server both
need to be shut down. If, as it's pretty commonly the case, the "source"
db is on another system the source system must be started (as a
primary). In the second mode all the files that have changed are copied
over a libpq connection.

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory prefetching while sequentially fetching from SortTuple array, tuplestore

2015-09-02 Thread Andres Freund

On 2015-09-02 17:14:12 -0700, Peter Geoghegan wrote:
> On Wed, Sep 2, 2015 at 12:24 PM, Peter Geoghegan  wrote:
> > On Wed, Sep 2, 2015 at 7:12 AM, Andres Freund  wrote:
> >> On 2015-07-19 16:34:52 -0700, Peter Geoghegan wrote:
> >> Hm. Is a compiler test actually test anything reliably here? Won't this
> >> just throw a warning during compile time about an unknown function?
> >
> > I'll need to look into that.
> 
> My error here was not considering why the existing __builtin_bswap32()
> test worked alright with AC_COMPILE_IFELSE(), while my test needed
> AC_LINK_IFELSE(). I'll be sure to make a point of manually testing
> failure (not just success) when writing compiler tests in the future.
> 
> AC_LINK_IFELSE() appears to work fine following a straight
> substitution. When I get around to revising this patch in a while,
> that fix will be included.

Be sure to also test with optimizations enabled - often tests can get
optimized away if you don't force the result to be used (or volatile) :/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] src/test/ssl broken on HEAD

2015-09-02 Thread Michael Paquier

On Thu, Sep 3, 2015 at 5:22 AM, Robert Haas wrote:
> Still, that's not a reason not commit this, so done.

Thanks.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Robert Haas

On Wed, Sep 2, 2015 at 6:56 PM, Bruce Momjian  wrote:
> On Wed, Sep  2, 2015 at 02:41:46PM -0400, Robert Haas wrote:
>> 4. Therefore, I think that we should instead use logical replication,
>> which might be either synchronous or asynchronous.  When you modify
>> one copy of the data, that change will then be replicated to all other
>> nodes.  If you are OK with eventual consistency, this replication can
>> be asynchronous, and nodes that are off-line will catch up when they
>> are on-line.  If you are not OK with that, then you must replicate
>> synchronously to every node before transaction commit; or at least you
>> must replicate synchronously to every node that is currently on-line.
>> This presents some challenges: logical decoding currently can't
>> replicate transactions that are still in process - replication starts
>> when the transaction commits.  Also, we don't have any way for
>> synchronous replication to wait for multiple nodes.  But in theory
>> those seem like limitations that can be lifted.  Also, the GTM needs
>> to be aware that this stuff is happening, or it will DTWT.  That too
>> seems like a problem that can be solved.
>
> Can you explain why logical replication is better than binary
> replication for this use-case?

Uh, well, for the same reasons it is better in many other cases.
Particularly, you probably don't want to replicate all the data on
machine A to machine B, just some of it.

Typically, sharding solutions store multiple copies of each piece of
data.  So let's say you have 4 machines.  You divide the data into 12
chunks.  Each machine is the write-master for 2 of those chunks, but
has secondary copies of 3 others.  So maybe things start out like
this:

machine #1: master for chunks 1, 2, 3; also has copies of chunks 4, 7, 10
machine #2: master for chunks 4, 5, 6; also has copies of chunks 1, 8, 11
machine #3: master for chunks 7, 8, 9; also has copies of chunks 2, 5, 12
machine #4: master for chunks 10, 11, 12; also has copies of chunks 3, 6, 9

If machine #1 is run over by a rabid triceratops, you can make machine
#2 the master for chunk 1, machine #3 the master for chunk 2, and
machine #4 the master for chunk 3.  The write load therefore remains
evenly divided.  If you can only copy entire machines, you can't
achieve that in this situation.

I'm not saying that the above is exactly what we're going to end up
with, or even necessarily close.  But a big part of the point of
sharding is that not all the machines have the same data - otherwise
you are not write scaling.  But it will frequently be the case, for
various reasons, that they have *overlapping* sets of data.  Logical
replication can handle that; physical replication can't.

In Postgres-XC, all tables are either sharded (part of the table is
present on each node) or distributed (all of the table is present on
every node).  Clearly, there's no way to use physical replication in
that scenario except if you are OK with having two copies of every
node.  But that's not a very good solution.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting

2015-09-02 Thread Greg Stark

That doesn't match any of the empirical tests I did at the time. I posted
graphs of the throughput for varying numbers of spindles with varying
amount of prefetch. In every case more prefetching increases throuput up to
N times the single platter throuput where N was the number of spindles.

There can be a small speedup from overlapping CPU with I/O but that's a
really small effect. At most that can be a single request and it would be a
very rarely would the amount of CPU time be even a moderate fraction of the
I/O latency. The only case where they're comparable is when your reading
sequentially and then hopefully you wouldn't be using postgres prefetching
at all which is only really intended to help random I/O.
On 2 Sep 2015 23:38, "Andres Freund"  wrote:

> On 2015-09-02 19:49:13 +0100, Greg Stark wrote:
> > I can take the blame for this formula.
> >
> > It's called the "Coupon Collector Problem". If you hit get a random
> > coupon from a set of n possible coupons, how many random coupons would
> > you have to collect before you expect to have at least one of each.
>
> My point is that that's just the entirely wrong way to model
> prefetching. Prefetching can be massively beneficial even if you only
> have a single platter! Even if there were no queues on the hardware or
> OS level! Concurrency isn't the right way to look at prefetching.
>
> You need to prefetch so far ahead that you'll never block on reading
> heap pages - and that's only the case if processing the next N heap
> blocks takes longer than the prefetch of the N+1 th page. That doesn't
> mean there continously have to be N+1 prefetches in progress - in fact
> that actually often will only be the case for the first few, after that
> you hopefully are bottlnecked on CPU.
>
> If you additionally take into account hardware realities where you have
> multiple platters, multiple spindles, command queueing etc, that's even
> more true. A single rotation of a single platter with command queuing
> can often read several non consecutive blocks if they're on a similar
>
> - Andres
>

Re: [HACKERS] Memory prefetching while sequentially fetching from SortTuple array, tuplestore

2015-09-02 Thread Peter Geoghegan

On Wed, Sep 2, 2015 at 4:12 PM, Andres Freund  wrote:
>> Well, still needs to work for tuplestore, which does not have a SortTuple.
>
> Isn't it even more trivial there? It's just an array of void*'s? So
> prefetch(state->memtuples + 3 + readptr->current)?

All I meant is that there couldn't be one centralized definition for
both. I don't mind if you want to bake it into a macro for
tupelstore.c and tuplesort.c.

>> Because of the way tuples are fetched across translation unit
>> boundaries in the cases addressed by the patch, it isn't hard to see
>> why the compiler does not do this automatically (prefetch instructions
>> added by the compiler are not common anyway, IIRC).
>
> Hardware prefetchers just have gotten to be rather good and obliterated
> most of the cases where it's beneficial.

Well, if hardware prefetchers are bright enough to do this perfectly
nowadays, then that implies a cost for useless prefetch instructions.
It might still be worth it even then, if the cost is very low for
these platforms. It's not as if they're required to respect the
prefetch hint in any way.

> I'd be interested to see a perf stat -ddd comparison to the patch
> with/without prefetches. It'll be interesting to see how the number of
> cache hits/misses and prefetches changes.
>
> Which microarchitecture did you test this on?

My laptop has an Intel Core i7-3520M, which is a mobile processor that
is a bit old. So, Ivy Bridge.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Allow replication roles to use file access functions

2015-09-02 Thread Tom Lane

Michael Paquier  writes:
> On Thu, Sep 3, 2015 at 4:52 AM, Alvaro Herrera  
> wrote:
>> Michael Paquier wrote:
>>> The main argument for this change is that pg_rewind makes use of those
>>> functions, forcing users to use a superuser role when rewinding a
>>> node.

>> Can this piggyback on Stephen Frost's proposed patch for reserved roles et 
>> al?

> (Adding Stephen in the loop)
> I thought that this was rather independent on Stephen's default role
> patch because this is not based on a new role permission type.
> Stephen, do you think so, or should we merge the effort with your
> things?

Just on general principles, this seems like a pretty horrid idea.
To me replication privilege means the ability to transfer data out of
the master, not to cause arbitrary state changes on the master.
Therefore, "allow replication users to run pg_rewind" seems less like
a feature and more like a security bug.  Am I missing something?

Another problem with it is that access to the filesystem is about halfway
to being superuser anyway, even if it's only read-only access.  It would
certainly let you read things that one would not expect a replication
user to be able to read (like files unrelated to Postgres).

I don't entirely understand why pg_rewind should be invoking any of these
functions to begin with.  Isn't it working directly with the disk data,
with both postmasters shut down?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory prefetching while sequentially fetching from SortTuple array, tuplestore

2015-09-02 Thread Peter Geoghegan

On Wed, Sep 2, 2015 at 12:24 PM, Peter Geoghegan  wrote:
> On Wed, Sep 2, 2015 at 7:12 AM, Andres Freund  wrote:
>> On 2015-07-19 16:34:52 -0700, Peter Geoghegan wrote:
>> Hm. Is a compiler test actually test anything reliably here? Won't this
>> just throw a warning during compile time about an unknown function?
>
> I'll need to look into that.

My error here was not considering why the existing __builtin_bswap32()
test worked alright with AC_COMPILE_IFELSE(), while my test needed
AC_LINK_IFELSE(). I'll be sure to make a point of manually testing
failure (not just success) when writing compiler tests in the future.

AC_LINK_IFELSE() appears to work fine following a straight
substitution. When I get around to revising this patch in a while,
that fix will be included.

Thanks
-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-09-02 Thread Robert Haas

, On Wed, Aug 26, 2015 at 4:05 AM, Kouhei Kaigai  wrote:
>> On 2015/08/26 16:07, Kouhei Kaigai wrote:
>> I wrote:
>> >> Maybe I'm missing something, but why do we need such a flexiblity for
>> >> the columnar-stores?
>>
>> > Even if we enforce them a new interface specification comfortable to RDBMS,
>> > we cannot guarantee it is also comfortable to other type of FDW drivers.
>>
>> Specifically, what kind of points about the patch are specific to RDBMS?
>>
>
>   *** 88,93  ForeignRecheck(ForeignScanState *node, TupleTableSlot *slot)
>   --- 99,122 
> TupleTableSlot *
> ExecForeignScan(ForeignScanState *node)
> {
>   + EState *estate = node->ss.ps.state;
>   +
>   + if (estate->es_epqTuple != NULL)
>   + {
>   + /*
>   +  * We are inside an EvalPlanQual recheck.  If foreign join, 
> get next
>   +  * tuple from subplan.
>   +  */
>   + Index   scanrelid = ((Scan *) 
> node->ss.ps.plan)->scanrelid;
>   +
>   + if (scanrelid == 0)
>   + {
>   + PlanState  *outerPlan = outerPlanState(node);
>   +
>   + return ExecProcNode(outerPlan);
>   + }
>   + }
>   +
> return ExecScan((ScanState *) node,
> (ExecScanAccessMtd) ForeignNext,
> (ExecScanRecheckMtd) ForeignRecheck);
>
> It might not be specific to RDBMS, however, we cannot guarantee all the FDW 
> are
> comfortable to run the alternative plan node on EPQ recheck.
> This design does not allow FDW drivers to implement own EPQ recheck, possibly
> more efficient than built-in logic.

I'm not convinced that this problem is more than hypothetical.  EPQ
rechecks should be quite rare, so it shouldn't really matter if we
jump through a few extra hoops when they happen.  And really, are
those hoops all that expensive?  It's not as if ExecInitNode should be
doing any sort of expensive operation, or ExecEndScan either.  And
they will be able to tell if they're being called for an EPQ-recheck
by fishing out the estate, so if there's some processing that they
want to short-circuit for that case, they can.  So I'm not seeing the
problem.  Do you have any evidence that either the performance cost or
the code complexity cost is significant for PG-Strom or any other
extension?

That having been said, I don't entirely like Fujita-san's patch
either.  Much of the new code is called immediately adjacent to an FDW
callback which could pretty trivially do the same thing itself, if
needed.  And much of it is contingent on whether estate->es_epqTuple
!= NULL and scanrelid == 0, but perhaps out would be better to check
whether the subplan is actually present instead of checking whether we
think it should be present.  Also, the naming is a bit weird:
node->fs_subplan gets shoved into outerPlanState(), which seems like a
kludge.

I'm wondering if there's another approach.  If I understand correctly,
there are two reasons why the current situation is untenable.  The
first is that ForeignRecheck always returns true, but we could instead
call an FDW-supplied callback routine there.  The callback could be
optional, so that we just return true if there is none, which is nice
for already-existing FDWs that then don't need to do anything.  The
second is that ExecScanFetch needs scanrelid > 0 so that
estate->es_epqTupleSet[scanrelid - 1] isn't indexing off the beginning
of the array, and similarly estate->es_epqScanDone[scanrelid - 1] and
estate->es_epqTuple[scanrelid - 1].  But, waving my hands wildly, that
also seems like a solvable problem.  I mean, we're joining a non-empty
set of relations, so the entries in the EPQ-related arrays for those
RTIs are not getting used for anything, so we can use any of them for
the joinrel.  We need some way for this code to decide what RTI to
use, but that shouldn't be too hard to finagle.

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Allow replication roles to use file access functions

2015-09-02 Thread Michael Paquier

On Thu, Sep 3, 2015 at 10:05 AM, Stephen Frost wrote:
> * Michael Paquier (michael.paqu...@gmail.com) wrote:
>> On Thu, Sep 3, 2015 at 4:52 AM, Alvaro Herrera wrote:
> The replication role already has read-only access to everything
> (nearly?) in the PGDATA directory.  The specific issue in this case, I
> believe, is if the functions mentioned provide access beyond what the
> replication role already has access to.

It seems to me that this difference of files is symbolized by what is
filtered out in a base backup, like postmaster.opts or postmaster.pid.
The matter here is: are we going to have in the future files that we
do not want to include in a base backup that are superuser-only? Today
I cannot recall we have any of that, we may have in the future though.

>> I don't entirely understand why pg_rewind should be invoking any of these
>> functions to begin with.  Isn't it working directly with the disk data,
>> with both postmasters shut down?
>
> Certain information is needed from the new master to rewind the old
> master.  What I would hope is that we'd provide a way for *just* that
> information to be made available to the replication role rather than
> completely general access through the functions mentioned.

The rewound node does not only need the diffs of blocks from a source,
but also some more information like clog. There are actually three
different approaches to solve that:
1) Use a differential backup to me, or the possibility to fetch a set
of data block diffs from a source node using an LSN and then re-apply
them on the target node. The major disadvantage of this approach is
actually performance: we would need to scan the source node
completely, so that's slow. And pg_rewind is fast because it only
scans the WAL records of the node to be rewound from the last
checkpoint before WAL forked.
2) Request data blocks from the source node using the replication
protocol, that's what pg_read_binary_file actually does, we just don't
have the logic on replication protocol side, though we could.
3) Create a new set of functions similar to the existing file access
functions that are usable by the replication user, except that they
refuse to return file entries that match the existing filters in
basebackup.c. That is doable, with a routine in basebackup.c that
decides if a given file string can be read or not. Base backups could
use this routine as well.
I just came up with 3), which looks rather appealing to me.

>> > Another problem with it is that access to the filesystem is about halfway
>> > to being superuser anyway, even if it's only read-only access.  It would
>> > certainly let you read things that one would not expect a replication
>> > user to be able to read (like files unrelated to Postgres).
>>
>> Doesn't pg_read_file et al insist that the path is below the data
>> directorY?
>
> I don't believe that's the case, actually..  I might be wrong though,
> I'm not looking at the code right now.

The use of PGDATA is enforced
=# select pg_read_file('/etc/passwd');
ERROR:  42501: absolute path not allowed
LOCATION:  convert_and_check_filename, genfile.c:73
If PGDATA includes soft links it is possible to point to those places
though, but that's not something anybody sane would do, still it can
be done and I've seen worse:
$ cd $PGDATA && ln -s /etc etc
$ psql -c 'select pg_read_file('etc/passwd')'
# Adult content, not an ERROR
>From this perspective allowing a replication user to have access to
those functions is dangerous, a superuser being the equivalent of an
all-mightly god on a server, still that's not the case of the
replication user.

> This is definitely a big part of
> the question, but I'd like to ask- what, exactly, does pg_rewind
> actually need access to? Is it only the WAL, or are heap and WAL files
> needed?

Not only, +clog, configuration files, etc.

> Consider the discussion about delta backups, et al, using LSNs.  Perhaps
> the replication protocol should be extended to allow access to arbitrary
> WAL, querying what WAL is available, and access to the heap files,
> perhaps even specific pages in the heap files and relation forks,
> instead of giving pg_rewind access to these extremely general
> nearly-OS-user-level functions.

The problem when using differential backups in this case is
performance as mentioned above. We would need to scan the whole target
cluster, which may take time, the current approach of pg_rewind only
needs to scan WAL records to find the list of blocks modified, and
directly requests them from the source. I would expect pg_rewind to be
as quick as possible.

> Further, for my two cents, I view any access to the system by a
> replication role to a non-replication-protocol communication with the
> server as being inappropriate.  To that point, I'd actually remove the
> ability for the replication roles to call pg_start/stop_backup as normal
> SQL calls since that implies the replication user has connected to a
> normal backend instead of talking the

Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Joshua D. Drake


On 09/02/2015 03:56 PM, Bruce Momjian wrote:

On Wed, Sep  2, 2015 at 02:41:46PM -0400, Robert Haas wrote:

4. Therefore, I think that we should instead use logical replication,
which might be either synchronous or asynchronous.  When you modify
one copy of the data, that change will then be replicated to all other
nodes.  If you are OK with eventual consistency, this replication can
be asynchronous, and nodes that are off-line will catch up when they
are on-line.  If you are not OK with that, then you must replicate
synchronously to every node before transaction commit; or at least you
must replicate synchronously to every node that is currently on-line.
This presents some challenges: logical decoding currently can't
replicate transactions that are still in process - replication starts
when the transaction commits.  Also, we don't have any way for
synchronous replication to wait for multiple nodes.  But in theory
those seem like limitations that can be lifted.  Also, the GTM needs
to be aware that this stuff is happening, or it will DTWT.  That too
seems like a problem that can be solved.


Can you explain why logical replication is better than binary
replication for this use-case?



Selectivity?

JD


--
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Announcing "I'm offended" is basically telling the world you can't
control your own emotions, so everyone else should do it for you.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Make timestamptz_out less slow.

2015-09-02 Thread David Rowley

On 3 September 2015 at 05:10, Andres Freund  wrote:

> On 2015-08-09 12:47:53 +1200, David Rowley wrote:
> > I took a bit of weekend time to finish this one off. Patch attached.
> >
> > A quick test shows a pretty good performance increase:
> >
> > create table ts (ts timestamp not null);
> > insert into ts select generate_series('2010-01-01 00:00:00', '2011-01-01
> > 00:00:00', '1 sec'::interval);
> > vacuum freeze ts;
> >
> > Master:
> > david=# copy ts to 'l:/ts.sql';
> > COPY 31536001
> > Time: 20444.468 ms
> >
> > Patched:
> > david=# copy ts to 'l:/ts.sql';
> > COPY 31536001
> > Time: 10947.097 ms
>
> Yes, that's pretty cool.
>
> > There's probably a bit more to squeeze out of this.
> > 1. EncodeDateTime() could return the length of the string to allow
> callers
> > to use pnstrdup() instead of pstrdup(), which would save the strlen()
> call.
> > 2. Have pg_int2str_zeropad() and pg_int2str() skip appending the NUL char
> > and leave this up to the calling function.
> > 3. Make something to replace the sprintf(str, " %.*s", MAXTZLEN, tzn);
> call.
> >
> > Perhaps 1 and 3 are worth looking into, but I think 2 is likely too error
> > prone for the small gain we'd get from it.
>

I see the logic there, but a few weekends ago I modified the patch to
implement number 2. This is perhaps one optimization that would be hard to
change later since, if we suddenly change pg_int2str() to not append the
NUL, then it might break any 3rd party code that's started using them. The
more I think about it the more I think allowing those to skip appending the
NUL is ok. They're very useful functions for incrementally building strings.

Attached is a patch to implement this, which performs as follows
postgres=# copy ts to 'l:/ts.sql';
COPY 31536001
Time: 10665.297 ms

However, this version of the patch does change many existing functions in
datetime.c so that they no longer append the NUL char... The good news is
that these are all static functions, so nothing else should be using them.

I'm happy to leave 1 and 3 for another rainy weekend one day.


> > Also I was not too sure on if pg_int2str() was too similar to pg_ltoa().
> > Perhaps pg_ltoa() should just be modified to return the end of string?
>
> I don't think the benefits are worth breaking pg_ltoa interface.
>
>
Ok let's leave pg_ltoa() alone


> >  /*
> > - * Append sections and fractional seconds (if any) at *cp.
> > + * Append seconds and fractional seconds (if any) at *cp.
> >   * precision is the max number of fraction digits, fillzeros says to
> >   * pad to two integral-seconds digits.
> >   * Note that any sign is stripped from the input seconds values.
> > + * Note 'precision' must not be a negative number.
> >   */
> > -static void
> > +static char *
> >  AppendSeconds(char *cp, int sec, fsec_t fsec, int precision, bool
> fillzeros)
> >  {
> > +#ifdef HAVE_INT64_TIMESTAMP
>
> Wouldn't it be better to just normalize fsec to an integer in that case?
> Afaics that's the only remaining reason for the alternative path?
>
> A special case would need to exist somewhere, unless we just modified
timestamp2tm() to multiply fsec by 1 million when HAVE_INT64_TIMESTAMP is
not defined, but that changes the function signature.


> > +/*
> > + * pg_int2str
> > + *   Converts 'value' into a decimal string representation of
> the number.
> > + *
> > + * Caller must ensure that 'str' points to enough memory to hold the
> result
> > + * (at least 12 bytes, counting a leading sign and trailing NUL).
> > + * Return value is a pointer to the new NUL terminated end of string.
> > + */
> > +char *
> > +pg_int2str(char *str, int32 value)
> > +{
> > + char *start;
> > + char *end;
> > +
> > + /*
> > +  * Handle negative numbers in a special way. We can't just append
> a '-'
> > +  * prefix and reverse the sign as on two's complement machines
> negative
> > +  * numbers can be 1 further from 0 than positive numbers, we do it
> this way
> > +  * so we properly handle the smallest possible value.
> > +  */
> > + if (value < 0)
> > + {
>
> We could alternatively handle this by special-casing INT_MIN, probably
> resulting in a bit less duplicated code.
>

Those special cases seem to do some weird stuff to the performance
characteristics of those functions. I find my method to handle negative
numbers to produce much faster code.

I experimented with finding the fastest way to convert an int to a string
at the weekend, I did happen to be testing with int64's but likely int32
will be the same.

This is the time taken to convert 30 into "30" 2 billion times.

The closest implementation I'm using in the patch is pg_int64tostr() one.
The pg_int64tostr_memcpy() only performs better with bigger numbers /
longer strings.

david@ubuntu:~/C$ gcc5.2 tostring.c -o tostring -O2
david@ubuntu:~/C$ ./tostring
pg_int64tostr_memcpy in 13.653252 seconds
pg_int64tostr in 2.042616 seconds
pg_int64tostr_new in 2.056688 seconds
pg_lltoa in

Re: [HACKERS] Allow replication roles to use file access functions

2015-09-02 Thread Stephen Frost

* Michael Paquier (michael.paqu...@gmail.com) wrote:
> On Thu, Sep 3, 2015 at 10:05 AM, Stephen Frost wrote:
> > * Michael Paquier (michael.paqu...@gmail.com) wrote:
> >> On Thu, Sep 3, 2015 at 4:52 AM, Alvaro Herrera wrote:
> > The replication role already has read-only access to everything
> > (nearly?) in the PGDATA directory.  The specific issue in this case, I
> > believe, is if the functions mentioned provide access beyond what the
> > replication role already has access to.
> 
> It seems to me that this difference of files is symbolized by what is
> filtered out in a base backup, like postmaster.opts or postmaster.pid.

Neither of those strike me as particularly sensitive information..

> The matter here is: are we going to have in the future files that we
> do not want to include in a base backup that are superuser-only? Today
> I cannot recall we have any of that, we may have in the future though.

This also seems unliekly.

> >> I don't entirely understand why pg_rewind should be invoking any of these
> >> functions to begin with.  Isn't it working directly with the disk data,
> >> with both postmasters shut down?
> >
> > Certain information is needed from the new master to rewind the old
> > master.  What I would hope is that we'd provide a way for *just* that
> > information to be made available to the replication role rather than
> > completely general access through the functions mentioned.
> 
> The rewound node does not only need the diffs of blocks from a source,
> but also some more information like clog. There are actually three
> different approaches to solve that:

Ok, so CLOG information is also required, but that's not a particularly
large issue and the replication role has access to those files already,
no?  It's just inconvenient to access them using the current protocol,
if that's all you're interested in.

> 1) Use a differential backup to me, or the possibility to fetch a set
> of data block diffs from a source node using an LSN and then re-apply
> them on the target node. The major disadvantage of this approach is
> actually performance: we would need to scan the source node
> completely, so that's slow. And pg_rewind is fast because it only
> scans the WAL records of the node to be rewound from the last
> checkpoint before WAL forked.

I don't follow this performance concern at all.  Why can't pg_rewind
look through the WAL and find what it needs, and then request exactly
that information?  Clearly, we would need to add to the existing
replication protocol, but I don't see any reason to not consider that a
perfectly reasonable approach for this.

> 2) Request data blocks from the source node using the replication
> protocol, that's what pg_read_binary_file actually does, we just don't
> have the logic on replication protocol side, though we could.

Right, we would need to modify the replication protocol to allow such
requests, but that's not particularly difficult.

> 3) Create a new set of functions similar to the existing file access
> functions that are usable by the replication user, except that they
> refuse to return file entries that match the existing filters in
> basebackup.c. That is doable, with a routine in basebackup.c that
> decides if a given file string can be read or not. Base backups could
> use this routine as well.

I don't particularly like this approach as it implies SQL access for the
replication role.

> >> > Another problem with it is that access to the filesystem is about halfway
> >> > to being superuser anyway, even if it's only read-only access.  It would
> >> > certainly let you read things that one would not expect a replication
> >> > user to be able to read (like files unrelated to Postgres).
> >>
> >> Doesn't pg_read_file et al insist that the path is below the data
> >> directorY?
> >
> > I don't believe that's the case, actually..  I might be wrong though,
> > I'm not looking at the code right now.
> 
> The use of PGDATA is enforced

We know that there are a lot of potential risks around enforcing this,
or similar, requirements.  Right now, we don't have to worry about any
of those corner cases, but we would if we used this approach.  If,
instead, the replication protocol is modified to accept requests for the
specific information (CLOG ), we could be sure to only ever return
exactly that information from the current cluster, or throw an error.

> > This is definitely a big part of
> > the question, but I'd like to ask- what, exactly, does pg_rewind
> > actually need access to? Is it only the WAL, or are heap and WAL files
> > needed?
> 
> Not only, +clog, configuration files, etc.

Configuration files?  Perhaps you could elaborate?

> > Consider the discussion about delta backups, et al, using LSNs.  Perhaps
> > the replication protocol should be extended to allow access to arbitrary
> > WAL, querying what WAL is available, and access to the heap files,
> > perhaps even specific pages in the heap files and relation forks,
> > instead of

Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Bruce Momjian

On Wed, Sep  2, 2015 at 07:50:25PM -0700, Joshua Drake wrote:
> >Can you explain why logical replication is better than binary
> >replication for this use-case?
> >
> 
> Selectivity?

I was assuming you would just create identical slaves to handle failure,
rather than moving selected data around.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GIN pending clean up is not interruptable

2015-09-02 Thread Fujii Masao

On Thu, Sep 3, 2015 at 2:15 AM, Andres Freund  wrote:
> On 2015-08-12 11:59:48 -0700, Jeff Janes wrote:
>> Attached patch does it that way.  There was also a free-standing
>> CHECK_FOR_INTERRUPTS() which had no reason that I could see not be a
>> vacuum_delay_point, so I changed that one as well.

-if (vac_delay)
-vacuum_delay_point();
+vacuum_delay_point();

If vac_delay is false, e.g., ginInsertCleanup() is called by the backend,
vacuum_delay_point() should not be called. No?

> I think we should backpatch this - any arguments against?

+1 for backpatch.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Michael Paquier

On Thu, Sep 3, 2015 at 3:41 AM, Robert Haas wrote:
> 3. IIUC, Postgres-XC handles this problem by reducing at least
> volatile functions, maybe all functions, to constants.  Then it
> generates an SQL statement to be sent to the data node to make the
> appropriate change.  If there's more than one copy of the data, we
> send a separate copy of the SQL statement to every node.  I'm not sure
> exactly what happens if some of those nodes are not available, but I
> don't think it's anything good.  Fundamentally, this model doesn't
> allow for many good options in that case.

I don't recall that. Immutable functions are switched to constants in
the query sent to datanodes. Volatile and stable functions are
evaluated locally after fetching the results from the remote node. Not
that efficient for warehouse loads. My 2c.

> 4. Therefore, I think that we should instead use logical replication,
> which might be either synchronous or asynchronous.  When you modify
> one copy of the data, that change will then be replicated to all other
> nodes.  If you are OK with eventual consistency, this replication can
> be asynchronous, and nodes that are off-line will catch up when they
> are on-line.  If you are not OK with that, then you must replicate
> synchronously to every node before transaction commit; or at least you
> must replicate synchronously to every node that is currently on-line.
> This presents some challenges: logical decoding currently can't
> replicate transactions that are still in process - replication starts
> when the transaction commits.  Also, we don't have any way for
> synchronous replication to wait for multiple nodes.

That's something that the quorum synchronous patch would address.
Still, having the possibility to be synchronous across multiple nodes
does not seem like to be something at the top of the list.

> Also, the GTM needs to be aware that this stuff is happening, or it will 
> DTWT.  That too seems like a problem that can be solved.

If I understood correctly, yes it is with its centralized transaction
facility each node is aware of the transaction status via the global
snapshot.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Allow replication roles to use file access functions

2015-09-02 Thread Michael Paquier

On Thu, Sep 3, 2015 at 11:20 AM, Stephen Frost wrote:
> * Michael Paquier wrote:
>> 1) Use a differential backup to me, or the possibility to fetch a set
>> of data block diffs from a source node using an LSN and then re-apply
>> them on the target node. The major disadvantage of this approach is
>> actually performance: we would need to scan the source node
>> completely, so that's slow. And pg_rewind is fast because it only
>> scans the WAL records of the node to be rewound from the last
>> checkpoint before WAL forked.
>
> I don't follow this performance concern at all.  Why can't pg_rewind
> look through the WAL and find what it needs, and then request exactly
> that information?  Clearly, we would need to add to the existing
> replication protocol, but I don't see any reason to not consider that a
> perfectly reasonable approach for this.

See below, visibly I misunderstood what you meant.

>> 2) Request data blocks from the source node using the replication
>> protocol, that's what pg_read_binary_file actually does, we just don't
>> have the logic on replication protocol side, though we could.
>
> Right, we would need to modify the replication protocol to allow such
> requests, but that's not particularly difficult.

Check.

>> 3) Create a new set of functions similar to the existing file access
>> functions that are usable by the replication user, except that they
>> refuse to return file entries that match the existing filters in
>> basebackup.c. That is doable, with a routine in basebackup.c that
>> decides if a given file string can be read or not. Base backups could
>> use this routine as well.
>
> I don't particularly like this approach as it implies SQL access for the
> replication role.

Check.

>> > This is definitely a big part of
>> > the question, but I'd like to ask- what, exactly, does pg_rewind
>> > actually need access to? Is it only the WAL, or are heap and WAL files
>> > needed?
>>
>> Not only, +clog, configuration files, etc.
>
> Configuration files?  Perhaps you could elaborate?

Sure. Sorry for being unclear. It copies everything that is not a
relation file, a kind of base backup without the relation files then.

>> > Consider the discussion about delta backups, et al, using LSNs.  Perhaps
>> > the replication protocol should be extended to allow access to arbitrary
>> > WAL, querying what WAL is available, and access to the heap files,
>> > perhaps even specific pages in the heap files and relation forks,
>> > instead of giving pg_rewind access to these extremely general
>> > nearly-OS-user-level functions.
>>
>> The problem when using differential backups in this case is
>> performance as mentioned above. We would need to scan the whole target
>> cluster, which may take time, the current approach of pg_rewind only
>> needs to scan WAL records to find the list of blocks modified, and
>> directly requests them from the source. I would expect pg_rewind to be
>> as quick as possible.
>
> I don't follow why the current approach of pg_rewind would have to
> change.  All I'm suggesting is that we have a different way, one which
> is much more restricted, for pg_rewind to request exactly the
> information it needs for efficient operation.

Ah, OK. I thought that you were referring to a protocol where caller
sends a single LSN from which it gets a differential backup that needs
to scan all the relation files of the source cluster to get the data
blocks with an LSN newer than the one sent, and then sends them back
to the caller.

I guess that what you are suggesting instead is an approach where
caller sends something like that through the replication protocol with
a relation OID and a block list:
BLOCK_DIFF relation_oid BLOCK_LIST m,n,[o, ...]
Which is close to what pg_read_binary_file does now for a superuser.
We would need as well to extend BASE_BACKUP so as it does not include
relation files though for this use case.

Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_stat_statements query jumbling question

2015-09-02 Thread Satoshi Nagayasu

On 2015/09/01 14:39, Satoshi Nagayasu wrote:
> On 2015/09/01 14:01, Tom Lane wrote:
>> Satoshi Nagayasu  writes:
>>> On 2015/09/01 13:41, Peter Geoghegan wrote:
 If you want to use the queryId field directly, which I recall you
 mentioning before, then that's harder. There is simply no contract
 among extensions for "owning" a queryId. But when the fingerprinting
 code is moved into core, then I think at that point queryId may cease
 to be even a thing that pg_stat_statements theoretically has the right
 to write into. Rather, it just asks the core system to do the
 fingerprinting, and finds it within queryId. At the same time, other
 extensions may do the same, and don't need to care about each other.

 Does that work for you?
>>
>>> Yes. I think so.
>>
>>> I need some query fingerprint to determine query group. I want queryid
>>> to keep the same value when query strings are the same (except literal
>>> values).
>>
>> The problem I've got with this is the unquestioned assumption that every
>> application for query IDs will have exactly the same requirements for
>> what the ID should include or ignore.
> 
> I'm not confident about that too, but at least, I think we will be able
> to collect most common use cases as of today. (aka best guess. :)
> 
> And IMHO it would be ok to change the spec in future release.

I know this still needs to be discussed, but I would like to submit
a patch for further discussion at the next CF, 2015-11.

Regards,
-- 
NAGAYASU Satoshi 
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c 
b/contrib/pg_stat_statements/pg_stat_statements.c
index 59b8a2e..2f6fb99 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -76,6 +76,8 @@
 #include "tcop/utility.h"
 #include "utils/builtins.h"
 #include "utils/memutils.h"
+#include "utils/rel.h"
+#include "utils/relcache.h"
 
 PG_MODULE_MAGIC;
 
@@ -2286,6 +2288,7 @@ static void
 JumbleRangeTable(pgssJumbleState *jstate, List *rtable)
 {
ListCell   *lc;
+   Relation   rel;
 
foreach(lc, rtable)
{
@@ -2296,7 +2299,9 @@ JumbleRangeTable(pgssJumbleState *jstate, List *rtable)
switch (rte->rtekind)
{
case RTE_RELATION:
-   APP_JUMB(rte->relid);
+   rel = RelationIdGetRelation(rte->relid);
+   APP_JUMB_STRING(RelationGetRelationName(rel));
+   RelationClose(rel);
JumbleExpr(jstate, (Node *) rte->tablesample);
break;
case RTE_SUBQUERY:

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory prefetching while sequentially fetching from SortTuple array, tuplestore

2015-09-02 Thread Peter Geoghegan

On Wed, Sep 2, 2015 at 9:43 PM, David Rowley
 wrote:
> Peter, would you be able to share the test case which you saw the speedup
> on. So far I've been unable to see much of an improvement.

The case I tested was an internal sort CREATE INDEX. I don't recall
the exact details, but testing showed it to be a fairly robust
speedup. It was not a very brief CREATE INDEX operation, or a very
lengthily one. trace_sort output made it quite visible that there was
a significant saving after the sort is "performed", but before it is
"done". It wasn't hard to see an improvement on a variety of other
cases, although the Intel vTune tool made the difference particularly
obvious.

The only thing that definitely won't be helped is pass-by-value datum
sort cases. In case it matters, I used GCC 4.8.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Bruce Momjian

On Wed, Sep  2, 2015 at 09:03:25PM -0400, Robert Haas wrote:
> > Can you explain why logical replication is better than binary
> > replication for this use-case?
> 
> Uh, well, for the same reasons it is better in many other cases.
> Particularly, you probably don't want to replicate all the data on
> machine A to machine B, just some of it.
> 
> Typically, sharding solutions store multiple copies of each piece of
> data.  So let's say you have 4 machines.  You divide the data into 12
> chunks.  Each machine is the write-master for 2 of those chunks, but
> has secondary copies of 3 others.  So maybe things start out like
> this:
> 
> machine #1: master for chunks 1, 2, 3; also has copies of chunks 4, 7, 10
> machine #2: master for chunks 4, 5, 6; also has copies of chunks 1, 8, 11
> machine #3: master for chunks 7, 8, 9; also has copies of chunks 2, 5, 12
> machine #4: master for chunks 10, 11, 12; also has copies of chunks 3, 6, 9
> 
> If machine #1 is run over by a rabid triceratops, you can make machine
> #2 the master for chunk 1, machine #3 the master for chunk 2, and
> machine #4 the master for chunk 3.  The write load therefore remains
> evenly divided.  If you can only copy entire machines, you can't
> achieve that in this situation.

I see the advantage of this now.  My original idea is that each shard
would have its own standby for disaster recovery, but your approach
above, which I know is typical, allows the shards to back up each other.
You could say shard 2 is the backup for shard 1, but then if shard one
goes bad, the entire workload of shard 1 goes to shard 2.  With the
above approach, the load of shard 1 is shared by all the shards.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Amit Kapila

On Thu, Sep 3, 2015 at 8:28 AM, Bruce Momjian  wrote:
>
> On Wed, Sep  2, 2015 at 07:50:25PM -0700, Joshua Drake wrote:
> > >Can you explain why logical replication is better than binary
> > >replication for this use-case?
> > >
> >
> > Selectivity?
>
> I was assuming you would just create identical slaves to handle failure,
> rather than moving selected data around.
>

Yes, I also think so, otherwise when the shard goes down and it's replica
has to take the place of shard, it will take more time to make replica
available as it won't have all the data as original shard had.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Memory prefetching while sequentially fetching from SortTuple array, tuplestore

2015-09-02 Thread David Rowley

On 3 September 2015 at 07:24, Peter Geoghegan  wrote:

> On Wed, Sep 2, 2015 at 7:12 AM, Andres Freund  wrote:
> > What worries me about adding explicit prefetching is that it's *highly*
> > platform and even micro-architecture dependent. Why is looking three
> > elements ahead the right value?
>
> Because that was the fastest value following testing on my laptop. You
> are absolutely right to point out that this isn't a good reason to go
> with the patch -- I share your concern. All I can say in defense of
> that is that other major system software does the same, without any
> regard for the underlying microarchitecture AFAICT. So, yes,
> certainly, more testing is required across a reasonable cross-section
> of platforms to justify the patch.
>

FWIW someone else found 3 to be good on the platform they tested on:
http://www.naftaliharris.com/blog/2x-speedup-with-one-line-of-code/

Peter, would you be able to share the test case which you saw the speedup
on. So far I've been unable to see much of an improvement.

Regards

David Rowley

--
 David Rowley   http://www.2ndQuadrant.com/

 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-09-02 Thread Etsuro Fujita


On 2015/09/03 9:41, Robert Haas wrote:

That having been said, I don't entirely like Fujita-san's patch
either.  Much of the new code is called immediately adjacent to an FDW
callback which could pretty trivially do the same thing itself, if
needed.


Another idea about that code is to call that code in eg, ExecProcNode, 
instead of calling ExecForeignScan there.  I think that that might be 
much cleaner and resolve the naming problem below.



And much of it is contingent on whether estate->es_epqTuple
!= NULL and scanrelid == 0, but perhaps out would be better to check
whether the subplan is actually present instead of checking whether we
think it should be present.


Agreed with this.


Also, the naming is a bit weird:
node->fs_subplan gets shoved into outerPlanState(), which seems like a
kludge.


And with this.  Proposals welcome.


I'm wondering if there's another approach.  If I understand correctly,
there are two reasons why the current situation is untenable.  The
first is that ForeignRecheck always returns true, but we could instead
call an FDW-supplied callback routine there.  The callback could be
optional, so that we just return true if there is none, which is nice
for already-existing FDWs that then don't need to do anything.


My question about this is, is the callback really needed?  If there are 
any FDWs that want to do the work *in their own way*, instead of just 
doing ExecProcNode for executing a local join execution plan in case of 
foreign join (or just doing ExecQual for checking remote quals in case 
of foreign table), I'd agree with introducing the callback, but if not, 
I don't think that that makes much sense.


Best regards,
Etsuro Fujita



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Pavan Deolasee

On Wed, Sep 2, 2015 at 9:04 PM, Oleg Bartunov  wrote:

>
>
>
> One lesson from XL we got is that we need testing framework for cluster,
> so any cluster project should at least pass functional and performance
> testing.
>

+1. In early XC days, we focused a lot on adding newer features and
supporting as many PG features as possible. That took its toll on the
testing and QA. It was a mistake though my feeling was we tried to correct
that to some extend with XL. We did a 9.5 merge, which of course was a big
deal, but other than more time is being spent on improving stability and
performance

XL was very easy to break and I'm wondering how many corner cases still
> exists.
>

Your team reported 2 or 3 major issues which I think we were able to fix
quite quickly. But if there are more such issues which your team has
recorded somewhere, I would request you to send them to the XL mailing
list. I would definitely want to look at them and address them.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Tatsuo Ishii

>> One lesson from XL we got is that we need testing framework for cluster,
>> so any cluster project should at least pass functional and performance
>> testing.
>>
> 
> +1. In early XC days, we focused a lot on adding newer features and
> supporting as many PG features as possible. That took its toll on the
> testing and QA. It was a mistake though my feeling was we tried to correct
> that to some extend with XL. We did a 9.5 merge, which of course was a big
> deal, but other than more time is being spent on improving stability and
> performance

Agreed. Any cluster project needs a cluster testing framework.

pgpool-II project runs "build farm" which runs cluster regression
tests every day. The tests includes several versions of pgpool-II and
PostgreSQL combinations using docker. Still it needs more tests but
even with limited test cases, it is pretty usefull to detect bugs.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] SQL function to report log message

2015-09-02 Thread dinesh kumar

On Mon, Aug 31, 2015 at 10:08 PM, Pavel Stehule 
wrote:

>
>
> 2015-09-01 6:59 GMT+02:00 Pavel Stehule :
>
>>
>>
>> 2015-08-31 20:43 GMT+02:00 dinesh kumar :
>>
>>> Hi,
>>>
>>> On Sat, Aug 29, 2015 at 4:22 PM, Pavel Stehule 
>>> wrote:
>>>
 Hi

 I am starting to work review of this patch

 2015-07-13 9:54 GMT+02:00 dinesh kumar :

> Hi All,
>
> Greetings for the day.
>
> Would like to discuss on below feature here.
>
> Feature:
> Having an SQL function, to write messages to log destination.
>
> Justification:
> As of now, we don't have an SQL function to write
> custom/application messages to log destination. We have "RAISE" clause
> which is controlled by
> log_ parameters. If we have an SQL function which works irrespective
> of log settings, that would be a good for many log parsers. What i mean 
> is,
> in DBA point of view, if we route all our native OS stats to log files in 
> a
> proper format, then we can have our log reporting tools to give most
> effective reports. Also, Applications can log their own messages to
> postgres log files, which can be monitored by DBAs too.
>
> Implementation:
> Implemented a new function "pg_report_log" which takes one
> argument as text, and returns void. I took, "LOG" prefix for all the
> reporting messages.I wasn't sure to go with new prefix for this, since
> these are normal LOG messages. Let me know, if i am wrong here.
>
> Here is the attached patch.
>

 This patch is not complex, but the implementation doesn't cover a
 "ereport" well.

 Although this functionality should be replaced by custom function in
 any PL (now or near future), I am not against to have this function in
 core. There are lot of companies with strong resistance against stored
 procedures - and sometimes this functionality can help with SQL debugging.

 Issues:

 1. Support only MESSAGE field in exception - I am expecting to support
 all fields: HINT, DETAIL, ...

>>>
>>> Added these functionalities too.
>>>
>>>
 2. Missing regress tests

>>>
>>> Adding here.
>>>
>>>
 3. the parsing ereport level should be public function shared with
 PLpgSQL and other PL

>>>
>>> Sorry Pavel. I am not getting your point here. Would you give me an
>>> example.
>>>
>>
>> The transformation: text -> error level is common task - and PLpgSQL it
>> does in pl_gram.y. My idea is to add new function to error utils named
>> "parse_error_level" and use it from PLpgSQL and from your code.
>>
>>
>>>
>>>
 4. should be hidestmt mandatory parameter?

>>>
>>> I changed this argument's default value as "true".
>>>
>>>
 5. the function declaration is strange

 postgres=# \sf pg_report_log (text, anyelement, boolean)
 CREATE OR REPLACE FUNCTION pg_catalog.pg_report_log(text, anyelement,
 boolean)
  RETURNS void
  LANGUAGE sql
  STABLE STRICT COST 1
 AS $function$SELECT pg_report_log($1::pg_catalog.text,
 $2::pg_catalog.text, $3::boolean)$function$

 Why polymorphic? It is useless on any modern release


>>> I took quote_ident(anyelement) as referral code, for implementing this.
>>> Could you guide me if I am doing wrong here.
>>>
>>
>> I was wrong - this is ok - it is emulation of force casting to text
>>
>>
>>>
>>>
 postgres=# \sf pg_report_log (text, text, boolean)
 CREATE OR REPLACE FUNCTION pg_catalog.pg_report_log(text, text, boolean)
  RETURNS void
  LANGUAGE internal
  IMMUTABLE STRICT
 AS $function$pg_report_log$function$

 Why stable, why immutable? This function should be volatile.

 Fixed these to volatile.
>>>
>>>
 6. using elog level enum as errcode is wrong idea - errcodes are
 defined in table
 http://www.postgresql.org/docs/9.4/static/errcodes-appendix.html

>>>
>>> You mean, if the elevel is 'ERROR', then we need to allow errcode. Let
>>> me know your inputs.
>>>
>>
>> I was blind, but the code was not good. Yes, error and higher needs error
>> code. From ANSI/SQL anything can has error code "00 is ok", "01 ..
>> warnings" ...
>>
>> There is more possibilities - look to PLpgSQL implementation - it can be
>> optional parameter - it default can use ERRCODE_RAISE_EXCEPTION
>>
>>
>>>
>>> Adding new patch, with the above fixes.
>>>
>>
> the code looks better
>
> my objections:
>
> 1. I prefer default values be NULL
>

Fixed it.


> 2. readability:
>   * parsing error level should be in alone cycle
>   * you don't need to use more ereport calls - one is good enough - look
> on implementation of stmt_raise in PLpgSQL
>

Sorry for my ignorance. I have tried to implement parse_error_level in
pl_gram.y, but not able to do it. I was not able to parse

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2015-09-02 Thread Jim Nasby


On 9/2/15 2:17 PM, Alvaro Herrera wrote:

Michael Paquier wrote:


I haven't written yet a test case but I think that we could reproduce
that simply by having a relation referenced in the exception block of
a first function, calling a second function that itself raises an
exception, causing the referencing error.


Hm, so function 2 is called in the exception block of function 1?

Are you going to produce the test case for this?  Jim?


I had attempted one but the issue is that it's more than just an 
exception block thing. If it was that simple then we'd still get the 
crash without involving pgTap. So I suspect you need to have a named 
cursor in the mix as well.


Let me make another attempt at something simpler.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposing COPY .. WITH PERMISSIVE

2015-09-02 Thread dinesh kumar

On Tue, Sep 1, 2015 at 10:58 PM, Stefan Kaltenbrunner <
ste...@kaltenbrunner.cc> wrote:

> On 07/25/2015 03:38 AM, dinesh kumar wrote:
> >
> >
> > On Fri, Jul 24, 2015 at 10:22 AM, Robert Haas  > > wrote:
> >
> > On Thu, Jul 23, 2015 at 8:15 PM, dinesh kumar
> > > wrote:
> > > On Thu, Jul 23, 2015 at 9:21 AM, Robert Haas
> > > wrote:
> > >>
> > >> On Thu, Jul 23, 2015 at 12:19 PM, dinesh kumar
> > >
> > >> wrote:
> > >> > Sorry for my  unclear description about the proposal.
> > >> >
> > >> > "WITH PERMISSIVE" is equal to our existing behavior. That is,
> chmod=644
> > >> > on
> > >> > the created files.
> > >> >
> > >> > If User don't specify "PERMISSIVE" as an option, then the
> chmod=600 on
> > >> > created files. In this way, we can restrict the other users
> from reading
> > >> > these files.
> > >>
> > >> There might be some benefit in allowing the user to choose the
> > >> permissions, but (1) I doubt we want to change the default
> behavior
> > >> and (2) providing only two options doesn't seem flexible enough.
> > >>
> > >
> > > Thanks for your inputs Robert.
> > >
> > > 1) IMO, we will keep the exiting behavior as it is.
> > >
> > > 2) As the actual proposal talks about the permissions of
> group/others. So,
> > > we can add few options as below to the WITH clause
> > >
> > > COPY
> > > ..
> > > ..
> > > WITH
> > > [
> > > NO
> > > (READ,WRITE)
> > > PERMISSION TO
> > > (GROUP,OTHERS)
> > > ]
> >
> > If we're going to do anything here, it should use COPY's
> > extensible-options syntax, I think.
> >
> >
> > Thanks Robert. Let me send a patch for this.
>
>
> how are you going to handle windows or unix ACLs here?
> Its permission model is quite different and more powerful than (non-acl
> based) unix in general, handling this in a flexible way might soon get
> very complicated and complex for limited gain...
>
>
Hi Stefan,

I had the same questions too. But, I believe, our initdb works in these
cases, after creating the data cluster. Isn't ?

Regards,
Dinesh
manojadinesh.blogspot.com

>
>
> Stefan
>

Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Tomas Vondra

On 09/02/2015 08:27 PM, Robert Haas wrote:

On Wed, Sep 2, 2015 at 1:59 PM, Merlin Moncure 
wrote:

This strikes me as a bit of a conflict of interest with FDW which
seems to want to hide the fact that it's foreign; the FDW
implementation makes it's own optimization decisions which might
make sense for single table queries but breaks down in the face of
joins.

+1 to these concerns

Well, I don't think that ALL of the logic should go into the FDW.

Then maybe we shouldn't call this "FDW-based sharding" (or "FDW 
approach" or whatever was used in this thread so far) because that kinda 
implies that the proposal is to build on FDW.

In my mind, FDW is a wonderful tool to integrate PostgreSQL with 
external data sources, and it's nicely shaped for this purpose, which 
implies the abstractions and assumptions in the code.

The truth however is that many current uses of the FDW API are actually 
using it for different purposes because there's no other way to do that, 
not because FDWs are the "right way". And this includes the attempts to 
build sharding on FDW, I think.

Situations like this result in "improvements" of the API that seem to 
improve the API for the second group, but make the life harder for the 
original FDW API audience by making the API needlessly complex. And I 
say "seem to improve" because the second group eventually runs into the 
fundamental abstractions and assumptions the API is based on anyway.

And based on the discussions at pgcon, I think this is the main reason 
why people cringe when they hear "FDW" and "sharding" in the same sentence.

I'm not opposed to reusing the FDW infrastructure, of course.

> In particular, in this example, parallel aggregate needs the same
> query rewrite, so the logic for that should live in core so that
> both parallel and distributed queries get the benefit.

I'm not sure the parallel query is a great example here - maybe I'm 
wrong but I think it's a fairly isolated piece of code, and we have 
pretty clear idea of the two use cases.

I'm sure it's non-trivial to design it well for both cases, but I think 
the questions for FWD/sharding will be much more about abstract concepts 
than particular technical solutions.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] psql tabcomplete - minor bugfix - tabcomplete for SET ROLE TO xxx

2015-09-02 Thread Pavel Stehule

Hi

2015-09-02 15:23 GMT+02:00 Andres Freund :

> Hi,
>
> On 2015-07-08 14:50:37 +0200, Pavel Stehule wrote:
> > - static const char *const my_list[] =
> > - {"DEFAULT", NULL};
> > + /* fallback for GUC settings */
> >
> > - COMPLETE_WITH_LIST(my_list);
> > + char *vartype = get_vartype(prev2_wd);
> > +
> > + if (strcmp(vartype, "enum") == 0)
> > + {
> > + char querybuf[1024];
> > +
> > + snprintf(querybuf, 1024, Query_for_enum,
> prev2_wd);
> > + COMPLETE_WITH_QUERY(querybuf);
> > + }
>
> Won't that mean that enum variables don't complete to default anymore?
>

no, it does

#define Query_for_enum \
" SELECT name FROM ( "\
"   SELECT unnest(enumvals) AS name "\
"FROM pg_catalog.pg_settings "\
"   WHERE pg_catalog.lower(name)=pg_catalog.lower('%s') "\
"   UNION SELECT 'DEFAULT' ) ss "\

"  WHERE pg_catalog.substring(name,1,%%d)='%%s'"



>
> > +static char *
> > +get_vartype(const char *varname)
> > +{
> > + PQExpBufferData query_buffer;
> > + char*e_varname;
> > + PGresult *result;
> > + int string_length;
> > + static char resbuf[10];
> > +
> > + initPQExpBuffer(_buffer);
> > +
> > + string_length = strlen(varname);
> > + e_varname = pg_malloc(string_length * 2 + 1);
> > + PQescapeString(e_varname, varname, string_length);
>
> Independent of this patch, we really shouldn't do this in several places
> :(
>

fixed

>
> > + appendPQExpBuffer(_buffer,
> > + "SELECT vartype FROM pg_settings WHERE
> pg_catalog.lower(name) = pg_catalog.lower('%s')",
> > +  e_varname);
>
> Missing pg_catalog for pg_settings.
>

fixed

>
> Greetings,
>
> Andres Freund
>

I am sending new version

Regards

Pavel
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
new file mode 100644
index 816deda..61216e1
*** a/src/bin/psql/tab-complete.c
--- b/src/bin/psql/tab-complete.c
*** static const SchemaQuery Query_for_list_
*** 757,762 
--- 757,770 
  "   (SELECT polrelid FROM pg_catalog.pg_policy "\
  " WHERE pg_catalog.quote_ident(polname)='%s')"
  
+ #define Query_for_enum \
+ " SELECT name FROM ( "\
+ "   SELECT unnest(enumvals) AS name "\
+ "FROM pg_catalog.pg_settings "\
+ "   WHERE pg_catalog.lower(name)=pg_catalog.lower('%s') "\
+ "   UNION SELECT 'DEFAULT' ) ss "\
+ "  WHERE pg_catalog.substring(name,1,%%d)='%%s'"
+ 
  /*
   * This is a list of all "things" in Pgsql, which can show up after CREATE or
   * DROP; and there is also a query to get a list of them.
*** static PGresult *exec_query(const char *
*** 847,852 
--- 855,863 
  
  static void get_previous_words(int point, char **previous_words, int nwords);
  
+ static char *get_vartype(const char *varname);
+ static char *escape_string(const char *text);
+ 
  #ifdef NOT_USED
  static char *quote_file_name(char *text, int match_type, char *quote_pointer);
  static char *dequote_file_name(char *text, char quote_char);
*** psql_completion(const char *text, int st
*** 3604,3623 
  
  			COMPLETE_WITH_LIST(my_list);
  		}
- 		else if (pg_strcasecmp(prev2_wd, "IntervalStyle") == 0)
- 		{
- 			static const char *const my_list[] =
- 			{"postgres", "postgres_verbose", "sql_standard", "iso_8601", NULL};
- 
- 			COMPLETE_WITH_LIST(my_list);
- 		}
- 		else if (pg_strcasecmp(prev2_wd, "GEQO") == 0)
- 		{
- 			static const char *const my_list[] =
- 			{"ON", "OFF", "DEFAULT", NULL};
- 
- 			COMPLETE_WITH_LIST(my_list);
- 		}
  		else if (pg_strcasecmp(prev2_wd, "search_path") == 0)
  		{
  			COMPLETE_WITH_QUERY(Query_for_list_of_schemas
--- 3615,3620 
*** psql_completion(const char *text, int st
*** 3627,3636 
  		}
  		else
  		{
! 			static const char *const my_list[] =
! 			{"DEFAULT", NULL};
  
! 			COMPLETE_WITH_LIST(my_list);
  		}
  	}
  
--- 3624,3654 
  		}
  		else
  		{
! 			/* fallback for GUC settings */
  
! 			char *vartype = get_vartype(prev2_wd);
! 
! 			if (strcmp(vartype, "enum") == 0)
! 			{
! char querybuf[1024];
! 
! snprintf(querybuf, 1024, Query_for_enum, prev2_wd);
! COMPLETE_WITH_QUERY(querybuf);
! 			}
! 			else if (strcmp(vartype, "bool") == 0)
! 			{
! static const char *const my_list[] =
! {"ON", "OFF", "DEFAULT", NULL};
! 
! COMPLETE_WITH_LIST(my_list);
! 			}
! 			else
! 			{
! static const char *const my_list[] =
! {"DEFAULT", NULL};
! 
! COMPLETE_WITH_LIST(my_list);
! 			}
  		}
  	}
  
*** _complete_from_query(int is_schema_query
*** 4166,4195 
  		result = NULL;
  
  		/* Set up suitably-escaped copies of textual inputs */
! 		e_text = pg_malloc(string_length * 2 + 1);
! 		PQescapeString(e_text, text, string_length);
  
  		if

Re: [HACKERS] pgbench stats per script & other stuff

2015-09-02 Thread Fabien COELHO



Hello Andres,


Maybe add --builtin list to show them?


Yep, easy enough.


[...]
+Shorthand for -b simple-update@1.
+Shorthand for -b select-only@1.


I'm a bit inclined to remove these options.


Hm...

This is really backward compatibility, and people may find reference to 
these in blogs or elswhere, so I think that it would make sense to

be upward compatible.

I would certainly be against adding any other of these options, though.


+   Each script may be given a relative weight specified after a
+   @ so as to change its drawing probability.
+   The default weight is 1.


I'm wondering if percentages instead of weights would be a better
idea. That'd mean you'd be forced to be more careful when adding another
script (having to adjust the percentages of other scripts) but arguably
that's a good thing?


If you use only percent, then you have to check that the total is 100, 
probably you have to use floats, to do something when the total is not 
100, checking would complicate the code and test people mental calculus 
abilities. Not sure this is a good idea:-)


In the use case you outline, when adding a script, maybe you know that it 
runs "as much as" this other script, so you can pick up the same weight 
without bothering.


Also, when testing, there is an issue when you want to remove one script 
for a quick test, and that would mean changing all percentages on the 
command line...


So I would advise not to put such a constraint.


+static SQLScript sql_script[MAX_SCRIPTS];

+static struct {
+   char *name;   /* very short name for -b ...*/
+   char *desc;   /* short description */
+   char *script; /* actual pgbench script */
+} builtin_script[]


Can't we put these in the same array?


I do not understand.


+   printf("transaction type: %s\n",
+  num_scripts == 1? sql_script[0].name: "multiple scripts");


Seems like it'd be more useful to simply always list the scripts +
weights here.


The detailed list is shown later, with the summary performance figure for 
each scripts, so ISTM that it would be redundant? Maybe the transaction 
type could be moved downwards just before said list?


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: Implement failover on libpq connect level.

2015-09-02 Thread Robert Haas

On Wed, Sep 2, 2015 at 4:52 AM, Shulgin, Oleksandr
 wrote:
> On Tue, Sep 1, 2015 at 8:12 PM, Andres Freund  wrote:
>>
>> On 2015-09-01 14:07:19 -0400, Robert Haas wrote:
>> > But I think it's quite wrong to assume that the infrastructure for
>> > this is available and usable everywhere, because in my experience,
>> > that's far from the case.
>>
>> Especially when the alternative is a rather short patch implementing an
>> otherwise widely available feature.
>
> But that won't actually help in the case described by Robert: if the master
> server A failed, the client has no idea if B or C would become the new
> master.

Sure it does.  You just need to ensure that whichever of those is the
new master accepts connections, and the other one doesn't.  There are
lots of ways to do this; e.g. give the machine a second IP that
accepts connections only when the machine is the designated master,
and have read-write clients connect to that IP, and read-only clients
connect to the machine's main IP.

Andres's point is the same as mine: we ought to accept this feature,
in some form, because it's really quite useful.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] src/test/ssl broken on HEAD

2015-09-02 Thread Andrew Dunstan




On 09/02/2015 04:22 PM, Robert Haas wrote:

On Wed, Aug 26, 2015 at 3:37 AM, Michael Paquier
 wrote:

On Wed, Aug 26, 2015 at 4:35 PM, Michael Paquier
 wrote:

Only HEAD is impacted, and attached is a patch to fix the problem.

Actually this version is better, I forgot to update a comment.

For so long as this test suite is not run by 'make check-world' or by
the buildfarm, it's likely to keep getting broken, and we're likely to
keep not noticing.  I realize that the decision to exclude this from
'make check-world' was deliberately made for security reasons, but
that doesn't take anything away from the maintenance headache.

Still, that's not a reason not commit this, so done.




Tell me what's needed and I'll look at creating a buildfarm test module 
for it.


cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench stats per script & other stuff

2015-09-02 Thread Robert Haas

On Wed, Sep 2, 2015 at 2:20 PM, Fabien COELHO  wrote:
>> I'm wondering if percentages instead of weights would be a better
>> idea. That'd mean you'd be forced to be more careful when adding another
>> script (having to adjust the percentages of other scripts) but arguably
>> that's a good thing?
>
> If you use only percent, then you have to check that the total is 100,
> probably you have to use floats, to do something when the total is not 100,
> checking would complicate the code and test people mental calculus
> abilities. Not sure this is a good idea:-)

I agree.  I don't see a reason to enforce that the total of the
weights must be 100.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] FSM versus GIN pending list bloat

2015-09-02 Thread Robert Haas

On Mon, Aug 10, 2015 at 1:16 PM, Jeff Janes  wrote:
> I have a simple test case that inserts an array of 101 md5 digests into each
> row.  With 10_000 of these rows inserted into an already indexed table, I
> get 40MB for the table and 80MB for the index unpatched.  With the patch, I
> get 7.3 MB for the index.

Uh, wow.  Seems like we should do something about this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: Core dump with nested CREATE TEMP TABLE

2015-09-02 Thread Alvaro Herrera

Michael Paquier wrote:

> I haven't written yet a test case but I think that we could reproduce
> that simply by having a relation referenced in the exception block of
> a first function, calling a second function that itself raises an
> exception, causing the referencing error.

Hm, so function 2 is called in the exception block of function 1?

Are you going to produce the test case for this?  Jim?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Hooking at standard_join_search (Was: Re: [HACKERS] Foreign join pushdown vs EvalPlanQual)

2015-09-02 Thread Tom Lane

I wrote:
> ... I imagine that we should allow FDWs to
> store some data within RelOptInfo structs that represent foreign joins
> belonging entirely to them, so that there'd be a handy place to keep that
> data till later.

Actually, if we do that (ie, provide a "void *fdw_state" field in join
RelOptInfos), then the FDW could use the nullness or not-nullness of
such a field to realize whether or not it had already considered this
join relation.  So I'm now thinking that the best API is to call the
FDW at the end of each make_join_rel call, whether it's the first one
for the joinrel or not.  If the FDW wants a call for each legal pair of
input sub-relations, it's got one.  If it only wants one call per joinrel,
it can just make sure to put something into fdw_state, and then on
subsequent calls for the same joinrel it can just exit immediately if
fdw_state is already non-null.  So we have both use-cases covered.
Also, by doing this at the end, the FDW can look at the "regular" (local
join execution) paths that were already generated, should it wish to.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Allow replication roles to use file access functions

2015-09-02 Thread Alvaro Herrera

Michael Paquier wrote:
> Hi all,
> 
> As of now, file access functions in genfile.c can only be used by
> superusers. This proposal is to relax those functions so as
> replication users can use them as well. Here are the functions aimed
> by this patch:
> - pg_stat_file
> - pg_read_binary_file
> - pg_read_file
> - pg_ls_dir
> The main argument for this change is that pg_rewind makes use of those
> functions, forcing users to use a superuser role when rewinding a
> node.

Can this piggyback on Stephen Frost's proposed patch for reserved roles et al?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposing COPY .. WITH PERMISSIVE

2015-09-02 Thread Stefan Kaltenbrunner

On 09/02/2015 10:10 PM, dinesh kumar wrote:
> On Tue, Sep 1, 2015 at 10:58 PM, Stefan Kaltenbrunner
> > wrote:
> 
> On 07/25/2015 03:38 AM, dinesh kumar wrote:
> >
> >
> > On Fri, Jul 24, 2015 at 10:22 AM, Robert Haas  
> > >> wrote:
> >
> > On Thu, Jul 23, 2015 at 8:15 PM, dinesh kumar
> > 
> >>
> wrote:
> > > On Thu, Jul 23, 2015 at 9:21 AM, Robert Haas
> > 
> >> wrote:
> > >>
> > >> On Thu, Jul 23, 2015 at 12:19 PM, dinesh kumar
> > 
> >>
> > >> wrote:
> > >> > Sorry for my  unclear description about the proposal.
> > >> >
> > >> > "WITH PERMISSIVE" is equal to our existing behavior. That
> is, chmod=644
> > >> > on
> > >> > the created files.
> > >> >
> > >> > If User don't specify "PERMISSIVE" as an option, then the
> chmod=600 on
> > >> > created files. In this way, we can restrict the other
> users from reading
> > >> > these files.
> > >>
> > >> There might be some benefit in allowing the user to choose the
> > >> permissions, but (1) I doubt we want to change the default
> behavior
> > >> and (2) providing only two options doesn't seem flexible
> enough.
> > >>
> > >
> > > Thanks for your inputs Robert.
> > >
> > > 1) IMO, we will keep the exiting behavior as it is.
> > >
> > > 2) As the actual proposal talks about the permissions of
> group/others. So,
> > > we can add few options as below to the WITH clause
> > >
> > > COPY
> > > ..
> > > ..
> > > WITH
> > > [
> > > NO
> > > (READ,WRITE)
> > > PERMISSION TO
> > > (GROUP,OTHERS)
> > > ]
> >
> > If we're going to do anything here, it should use COPY's
> > extensible-options syntax, I think.
> >
> >
> > Thanks Robert. Let me send a patch for this.
> 
> 
> how are you going to handle windows or unix ACLs here?
> Its permission model is quite different and more powerful than (non-acl
> based) unix in general, handling this in a flexible way might soon get
> very complicated and complex for limited gain...
> 
> 
> Hi Stefan,
> 
> I had the same questions too. But, I believe, our initdb works in these
> cases, after creating the data cluster. Isn't ?

maybe - but having a fixed "default"  is very different from baking a
classic unix permission concept of user/group/world^others into actual
DDL or into a COPY option. The proposed syntax might make some sense to
a admin used to a unix style system but it is likely utterly
incomprehensible to somebody who is used to (windows style) ACLs.

I dont have a good answer on what to do else atm but I dont think we
should embedded traditional/historical unix permission models in our
grammer unless really really needed...


Stefan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ON CONFLICT DO UPDATE using EXCLUDED.column gives an error about mismatched types

2015-09-02 Thread Peter Geoghegan

On Wed, Sep 2, 2015 at 1:18 AM, Amit Langote
 wrote:
> Did you get around to making a patch for this?

I've worked on it inconsistently. I'll pick this up again soon. I may
take the opportunity to talk this over with Andres in person when we
meet at Postgres Open shortly.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Robert Haas

On Wed, Sep 2, 2015 at 1:57 PM, Josh Berkus  wrote:
> Even if it's only on paper, any new sharding design needs to address
> these questions:
>
> 1. How do we ensure no/minimal data is lost if we lose a node?
> 2. How do we replace a lost node (without taking the cluster down)?
>2. a. how do we allow an out-of-sync node to "catch up"?
> 3. How do we maintain metadata about good/bad nodes (and shard locations)?
> 4. How do we add nodes to expand the cluster?
>
> There doesn't need to be code for all of the above from version 0.1, but
> there needs to be a plan to tackle those problems.  Otherwise, we'll
> just end up with another dead-end, not-useful-in-production technology.

This is a good point, and I think I agree with it.  Let me make a few
observations:

1. None of this stuff matters very much when the data is strictly
read-only.  You don't lose any data because you made enough copies at
some point in the distant past to ensure that you wouldn't.  You
replace a lost node by taking anew copy.  Nodes never need to catch up
because there are no changes happening.  To make bring up a new node,
you make a copy of an existing node (which doesn't change in the
meantime).  So most of these concerns are about how to handle writes.

2. None of this stuff matters when you only have one copy of the data.
Your system is low-availability, but you just don't care for whatever
reason.  The issue arises when you have multiple copies of the data,
and the data is being changed.  Now, you have to worry about the
copies getting out of sync with each other, especially when failures
happen.

3. IIUC, Postgres-XC handles this problem by reducing at least
volatile functions, maybe all functions, to constants.  Then it
generates an SQL statement to be sent to the data node to make the
appropriate change.  If there's more than one copy of the data, we
send a separate copy of the SQL statement to every node.  I'm not sure
exactly what happens if some of those nodes are not available, but I
don't think it's anything good.  Fundamentally, this model doesn't
allow for many good options in that case.

4. Therefore, I think that we should instead use logical replication,
which might be either synchronous or asynchronous.  When you modify
one copy of the data, that change will then be replicated to all other
nodes.  If you are OK with eventual consistency, this replication can
be asynchronous, and nodes that are off-line will catch up when they
are on-line.  If you are not OK with that, then you must replicate
synchronously to every node before transaction commit; or at least you
must replicate synchronously to every node that is currently on-line.
This presents some challenges: logical decoding currently can't
replicate transactions that are still in process - replication starts
when the transaction commits.  Also, we don't have any way for
synchronous replication to wait for multiple nodes.  But in theory
those seem like limitations that can be lifted.  Also, the GTM needs
to be aware that this stuff is happening, or it will DTWT.  That too
seems like a problem that can be solved.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] SQL function to report log message

2015-09-02 Thread Robert Haas

On Tue, Sep 1, 2015 at 6:13 PM, Jim Nasby  wrote:
> My thought is that there's a fair amount of places where we do string
> comparison for not a great reason. Perhaps a better example is data that
> comes back from a trigger; AFTER/BEFORE, INSERT/UPDATE/..., which is more
> expensive to setup the variables for (strdup a fixed string, which means a
> palloc), and then comparisons are done as text varlena (iirc).
>
> Instead if this information came back as an ENUM the variable would be a
> simple int as would the comparison. We'd still have a raw string being
> parsed in the function body, but that would happen once during initial
> compilation and it would be replaced with an ENUM value.
>
> For RAISE, AFAIK we still end up converting the raw string into a varlena
> CONST, which means a palloc. If it was an ENUM it'd be converted to an int.
>
> If we're worried about the overhead of the enum machinery we could create a
> relcache for system enums, but I suspect that even without that it'd be a
> win over the string stuff. Especially since I bet most people run UTF8.

I agree with Pavel on this one: creating an extra type here is going
to cause more pain than it removes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting

2015-09-02 Thread Greg Stark

On 2 Sep 2015 14:54, "Andres Freund"  wrote:
>
>
> > + /*--
> > +  * The user-visible GUC parameter is the number of drives (spindles),
> > +  * which we need to translate to a number-of-pages-to-prefetch target.
> > +  * The target value is stashed in *extra and then assigned to the 
> > actual
> > +  * variable by assign_effective_io_concurrency.
> > +  *
> > +  * The expected number of prefetch pages needed to keep N drives busy 
> > is:
> > +  *
> > +  * drives |   I/O requests
> > +  * ---+
> > +  *  1 |   1
> > +  *  2 |   2/1 + 2/2 = 3
> > +  *  3 |   3/1 + 3/2 + 3/3 = 5 1/2
> > +  *  4 |   4/1 + 4/2 + 4/3 + 4/4 = 8 1/3
> > +  *  n |   n * H(n)
>
> I know you just moved this code. But: I don't buy this formula. Like at
> all. Doesn't queuing and reordering entirely invalidate the logic here?

I can take the blame for this formula.

It's called the "Coupon Collector Problem". If you hit get a random
coupon from a set of n possible coupons, how many random coupons would
you have to collect before you expect to have at least one of each.

This computation model assumes we have no information about which
spindle each block will hit. That's basically true for the case of
bitmapheapscan for most cases because the idea of bitmapheapscan is to
be picking a sparse set of blocks and there's no reason the blocks
being read will have any regularity that causes them all to fall on
the same spindles. If in fact you're reading a fairly dense set then
bitmapheapscan probably is a waste of time and simply reading
sequentially would be exactly as fast or even faster.

We talked about this quite a bit back then and there was no dispute
that the aim is to provide GUCs that mean something meaningful to the
DBA who can actually measure them. They know how many spindles they
have. They do not know what the optimal prefetch depth is and the only
way to determine it would be to experiment with Postgres. Worse, I
think the above formula works for essentially random I/O but for more
predictable I/O it might be possible to use a different formula. But
if we made the GUC something low level like "how many blocks to
prefetch" then we're left in the dark about how to handle that
different access pattern.

I did speak to a dm developer and he suggested that the kernel could
help out with an API. He suggested something of the form "how many
blocks do I have to read before the end of the current device". I
wasn't sure exactly what we would do with something like that but it
would be better than just guessing how many I/O operations we need to
issue to keep all the spindles busy.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting

2015-09-02 Thread Julien Rouhaud

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi,

On 02/09/2015 18:06, Tomas Vondra wrote:
> Hi
> 
> On 09/02/2015 03:53 PM, Andres Freund wrote:
>> 
>> Hi,
>> 
>> On 2015-07-18 12:17:39 +0200, Julien Rouhaud wrote:
>>> I didn't know that the thread must exists on -hackers to be
>>> able to add a commitfest entry, so I transfer the thread here.
>> 
>> Please, in the future, also update the title of the thread to
>> something fitting.
>> 

Sorry for that.

>>> @@ -539,6 +541,9 @@ ExecInitBitmapHeapScan(BitmapHeapScan
>>> *node, EState *estate, int eflags) { BitmapHeapScanState
>>> *scanstate; RelationcurrentRelation; +#ifdef USE_PREFETCH +
>>> int new_io_concurrency; +#endif
>>> 
>>> /* check for unsupported flags */ Assert(!(eflags &
>>> (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK))); @@ -598,6 +603,25 @@
>>> ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate,
>>> int eflags) */ currentRelation = ExecOpenScanRelation(estate, 
>>> node->scan.scanrelid, eflags);
>>> 
>>> +#ifdef USE_PREFETCH +/* check if the
>>> effective_io_concurrency has been overloaded for the + *
>>> tablespace storing the relation and compute the 
>>> target_prefetch_pages, + * or just get the current
>>> target_prefetch_pages + */ +new_io_concurrency =
>>> get_tablespace_io_concurrency( +
>>> currentRelation->rd_rel->reltablespace); + + +
>>> scanstate->target_prefetch_pages = target_prefetch_pages; + +
>>> if (new_io_concurrency != effective_io_concurrency) +{ +
>>> double prefetch_pages; +   if
>>> (compute_io_concurrency(new_io_concurrency, _pages)) +
>>> scanstate->target_prefetch_pages = rint(prefetch_pages); +
>>> } +#endif
>> 
>> Maybe it's just me - but imo there should be as few USE_PREFETCH 
>> dependant places in the code as possible. It'll just be 0 when
>> not supported, that's fine?
> 
> It's not just you. Dealing with code with plenty of ifdefs is
> annoying - it compiles just fine most of the time, until you
> compile it with some rare configuration. Then it either starts
> producing strange warnings or the compilation fails entirely.
> 
> It might make a tiny difference on builds without prefetching
> support because of code size, but realistically I think it's
> noise.
> 
>> Especially changing the size of externally visible structs
>> depending ona configure detected ifdef seems wrong to me.
> 
> +100 to that
> 

I totally agree. I'll remove the ifdefs.

>> Nitpick: Wrong comment style (/* stands on its own line).

I did run pgindent before submitting patch, but apparently I picked
the wrong one. Already fixed in local branch.

>>> +/*-- + * The user-visible GUC parameter is the
>>> number of drives (spindles), + * which we need to translate
>>> to a number-of-pages-to-prefetch target. + * The target
>>> value is stashed in *extra and then assigned to the actual +
>>> * variable by assign_effective_io_concurrency. + * + *
>>> The expected number of prefetch pages needed to keep N drives 
>>> busy is: + * + * drives |   I/O requests + *
>>> ---+ + *1 |   1 + *
>>> 2 |   2/1 + 2/2 = 3 + *3 |   3/1 + 3/2 + 3/3 = 5
>>> 1/2 + *4 |   4/1 + 4/2 + 4/3 + 4/4 = 8 1/3 + *
>>> n |   n * H(n)
>> 
>> I know you just moved this code. But: I don't buy this formula.
>> Like at all. Doesn't queuing and reordering entirely invalidate
>> the logic here?
> 
> Well, even the comment right next after the formula says that:
> 
> * Experimental results show that both of these formulas aren't *
> aggressiveenough, but we don't really have any better proposals.
> 
> That's the reason why users generally either use 0 or some rather
> high value (16 or 32 are the most common values see). The problem
> is that we don't really care about the number of spindles (and not
> just because SSDs don't have them at all), but about the target
> queue length per device. Spinning rust uses TCQ/NCQ to optimize the
> head movement, SSDs are parallel by nature (stacking multiple chips
> with separate channels).
> 
> I doubt we can really improve the formula, except maybe for saying
> "we want 16 requests per device" and multiplying the number by
> that. We don't really have the necessary introspection to determine
> better values (and it's not really possible, because the devices
> may be hidden behind a RAID controller (or a SAN). So we can't
> really do much.
> 
> Maybe the best thing we can do is just completely abandon the
> "number of spindles" idea, and just say "number of I/O requests to
> prefetch". Possibly with an explanation of how to estimate it
> (devices * queue length).
> 
>> I think that'd be a lot better.

+1 for that too.

If everone's ok with this change, I can submit a patch for that too.
Should I split that into two patches, and/or start a new thread?



- -- 
Julien Rouhaud
http://dalibo.com - http://dalibo.org
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (GNU/Linux)

Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Robert Haas

On Wed, Sep 2, 2015 at 3:03 PM, Josh Berkus  wrote:
>> 4. Therefore, I think that we should instead use logical replication,
>> which might be either synchronous or asynchronous.  When you modify
>> one copy of the data, that change will then be replicated to all other
>> nodes.  If you are OK with eventual consistency, this replication can
>> be asynchronous, and nodes that are off-line will catch up when they
>> are on-line.  If you are not OK with that, then you must replicate
>> synchronously to every node before transaction commit; or at least you
>> must replicate synchronously to every node that is currently on-line.
>> This presents some challenges: logical decoding currently can't
>> replicate transactions that are still in process - replication starts
>> when the transaction commits.  Also, we don't have any way for
>> synchronous replication to wait for multiple nodes.
>
> Well, there is a WIP patch for that, which IMHO would be much improved
> by having a concrete use-case like this one.  What nobody is working on
> -- and we've vetoed in the past -- is a way of automatically failing and
> removing from replication any node which repeatedly fails to sync, which
> would be a requirement for this model.

Yep.  It's clear to me we need that in general, not just for sharding.
To me, the key is to make sure there's a way for the cluster-ware to
know about the state transitions.  Currently, when the synchronous
standby changes, PostgreSQL doesn't tell anyone.  That's a problem.

> You'd also need a way to let the connection nodes know when a replica
> has fallen behind so that they can be taken out of
> load-balancing/sharding for read queries.  For the synchronous model,
> that would be "fallen behind at all"; for asynchronous it would be
> "fallen more than ### behind".

How is that different from the previous thing?  Just that we'd treat
"lagging" as "down" beyond some threshold?  That doesn't seem like a
mandatory feature.

>> But in theory
>> those seem like limitations that can be lifted.  Also, the GTM needs
>> to be aware that this stuff is happening, or it will DTWT.  That too
>> seems like a problem that can be solved.
>
> Yeah?  I'd assume that a GTM would be antithetical to two-stage copying.

I don't think so.  If transaction A writes data on X which is
replicated to Y and then commits, a new snapshot which shows A as
committed can't be used on Y until A's changes have been replicated
there.  That could be enforced by having the commit of A wait for
replication, or by having an attempt by a later transaction to use the
snapshot on Y wait until replication completes, or some even more
sophisticated strategy that considers whether the replication backlog
touches the same data that the new transaction will read.  It's
complicated, but it doesn't seem intractable.

>  I'm not a big fan of a GTM at all, frankly; it makes clusters much
> harder to set up, and becomes a SPOF.

I partially agree.  I think it's very important that the GTM is an
optional feature of whatever we end up with, rather than an
indispensable component.  People who don't want it shouldn't have to
pay the price in performance and administrative complexity.  But at
the same time, I think a lot of people will want it, because without
it, the fact that sharding is in use is much less transparent to the
application.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Allow a per-tablespace effective_io_concurrency setting

2015-09-02 Thread Julien Rouhaud

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 02/09/2015 15:53, Andres Freund wrote:
> On 2015-07-18 12:17:39 +0200, Julien Rouhaud wrote:
> 
> You also didn't touch /* * How many buffers PrefetchBuffer callers
> should try to stay ahead of their * ReadBuffer calls by.  This is
> maintained by the assign hook for * effective_io_concurrency.  Zero
> means "never prefetch". */ inttarget_prefetch_pages = 
> 0; which
> surely doesn't make sense anymore after these changes.
> 
> But do we even need that variable now?
> 

I thought this was related to the effective_io_concurrency GUC
(possibly overloaded by the per-tablespace setting), so I didn't make
any change on that.

I also just found an issue with my previous patch, the global
effective_io_concurrency GUC was ignored if the tablespace had a
specific seq_page_cost or random_page_cost setting, I just fixed that
in my local branch.


>> diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h 
>> index dc167f9..57008fc 100644 --- a/src/include/utils/guc.h +++
>> b/src/include/utils/guc.h @@ -26,6 +26,9 @@ #define MAX_KILOBYTES
>> (INT_MAX / 1024) #endif
>> 
>> +/* upper limit for effective_io_concurrency */ +#define
>> MAX_IO_CONCURRENCY 1000 + /* * Automatic configuration file name
>> for ALTER SYSTEM. * This file will be used to store values of
>> configuration parameters @@ -256,6 +259,8 @@ extern int
>> temp_file_limit;
>> 
>> extern int   num_temp_buffers;
>> 
>> +extern int  effective_io_concurrency; +
> 
> target_prefetch_pages is declared in bufmgr.h - that seems like a
> better place for these.
> 

I was rather sceptical about that too. I'll move these in bufmgr.h.


Regards.


- -- 
Julien Rouhaud
http://dalibo.com - http://dalibo.org
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (GNU/Linux)

iQEcBAEBAgAGBQJV51pcAAoJELGaJ8vfEpOqIV0H/Rj1e/DtJS60X2mReWDyfooD
G3j0Ptblhy+brYIIxo9Bdp9hVeYFmEqlOJIht9T/3gjfkg5IMz+5bV2waEbAan/m
9uedR/RmS9sz2YpwGgpd21bfSt2LwB+UC448t3rq8KtuzwmXgSVVEflmDmN1qV3z
PseUFzS74HeIJWfxLRLGsJ5amN0hJ8bdolIfxdFR0FyFDv0tRv1DzppdMebVJmHs
uIdJOU49sIDHjcnsUcq67jkP+IfTUon+nnwvk5FYVVKdBX2ka1Q/1VAvTfmWo0oV
WZSlIjQdMUnlTX91zke0NdmsTnagIeRy1oISn/K1v+YmSqnsPqPAcZ6FFQhUMqI=
=4ofZ
-END PGP SIGNATURE-


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Horizontal scalability/sharding

2015-09-02 Thread Josh Berkus

On 09/02/2015 12:30 PM, Robert Haas wrote:
> On Wed, Sep 2, 2015 at 3:03 PM, Josh Berkus  wrote:
>>> 4. Therefore, I think that we should instead use logical replication,
>>> which might be either synchronous or asynchronous.  When you modify
>>> one copy of the data, that change will then be replicated to all other
>>> nodes.  If you are OK with eventual consistency, this replication can
>>> be asynchronous, and nodes that are off-line will catch up when they
>>> are on-line.  If you are not OK with that, then you must replicate
>>> synchronously to every node before transaction commit; or at least you
>>> must replicate synchronously to every node that is currently on-line.
>>> This presents some challenges: logical decoding currently can't
>>> replicate transactions that are still in process - replication starts
>>> when the transaction commits.  Also, we don't have any way for
>>> synchronous replication to wait for multiple nodes.
>>
>> Well, there is a WIP patch for that, which IMHO would be much improved
>> by having a concrete use-case like this one.  What nobody is working on
>> -- and we've vetoed in the past -- is a way of automatically failing and
>> removing from replication any node which repeatedly fails to sync, which
>> would be a requirement for this model.
> 
> Yep.  It's clear to me we need that in general, not just for sharding.
> To me, the key is to make sure there's a way for the cluster-ware to
> know about the state transitions.  Currently, when the synchronous
> standby changes, PostgreSQL doesn't tell anyone.  That's a problem.

There are many parts of our replication which are still effectively
unmonitorable. For example, there's still no way to tell from the
replica that it's lost contact with the master except by tailing the
log.  If we try to build bigger systems on top of these components,
we'll find that we need to add a lot of instrumentation.

> 
>> You'd also need a way to let the connection nodes know when a replica
>> has fallen behind so that they can be taken out of
>> load-balancing/sharding for read queries.  For the synchronous model,
>> that would be "fallen behind at all"; for asynchronous it would be
>> "fallen more than ### behind".
> 
> How is that different from the previous thing?  Just that we'd treat
> "lagging" as "down" beyond some threshold?  That doesn't seem like a
> mandatory feature.

It's a mandatory feature if you want to load-balance reads.  We have to
know which nodes not to send reads to because they are out of sync.

>> Yeah?  I'd assume that a GTM would be antithetical to two-stage copying.
> 
> I don't think so.  If transaction A writes data on X which is
> replicated to Y and then commits, a new snapshot which shows A as
> committed can't be used on Y until A's changes have been replicated
> there.  That could be enforced by having the commit of A wait for
> replication, or by having an attempt by a later transaction to use the
> snapshot on Y wait until replication completes, or some even more
> sophisticated strategy that considers whether the replication backlog
> touches the same data that the new transaction will read.  It's
> complicated, but it doesn't seem intractable.

I need to see this on a chalkboard to understand it.

>>  I'm not a big fan of a GTM at all, frankly; it makes clusters much
>> harder to set up, and becomes a SPOF.
> 
> I partially agree.  I think it's very important that the GTM is an
> optional feature of whatever we end up with, rather than an
> indispensable component.  People who don't want it shouldn't have to
> pay the price in performance and administrative complexity.  But at
> the same time, I think a lot of people will want it, because without
> it, the fact that sharding is in use is much less transparent to the
> application.

If it can be optional, then we're pretty close to covering most use
cases with one general infrastructure.  That would be nice.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

1 2 >

1 - 100 of 154 matches

Mail list logo