Re: [HACKERS] row_security GUC, BYPASSRLS

2015-09-18 Thread Noah Misch
On Tue, Sep 15, 2015 at 03:18:21PM -0400, Adam Brightwell wrote:
> On Tue, Sep 15, 2015 at 2:26 PM, Joe Conway  wrote:
> >> Joe Conway  writes:
> >>> There are use cases where row_security=force will be set in production
> >>> environments, not only in testing.

> > Noah's suggestion of using a per table attribute
> > would work -- in fact I like the idea of that better than using the
> > current GUC.
> 
> FWIW, I also concur with a per table attribute for this purpose.  In
> fact, I think I really like the per-table flexibility over an
> 'all-or-nothing' approach better too.

Great.  Robert, does that work for you, too?  If so, this sub-thread is
looking at three patches:

1. remove row_security=force
2. remove SECURITY_ROW_LEVEL_DISABLED; make ri_triggers.c subject to policies
3. add DDL-controlled, per-table policy forcing

They ought to land in that order.  PostgreSQL 9.5 would need at least (1) and
(2); would RLS experts find it beneficial for me to take care of those?

Thanks,
nm


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] creating extension including dependencies

2015-09-18 Thread Jeff Janes
On Sep 17, 2015 7:52 PM, "Petr Jelinek"  wrote:
>
> On 2015-09-17 17:31, Jeff Janes wrote:
>>
>>
>> Also, It would be nice to have psql tab complete the word CASCADE.
>>
>
> Hmm, it already does?

Indeed it does.  Oops.  I need to run the program I just compiled, and not
some other version that happens to be in my $PATH.  I've learned that for
pg_ctl mostly, but still forget for psql.

Cheers,


Jeff


Re: [HACKERS] Use pg_rewind when target timeline was switched

2015-09-18 Thread Michael Paquier
On Fri, Sep 18, 2015 at 12:34 PM, Michael Paquier
 wrote:
> On Fri, Sep 18, 2015 at 9:00 AM, Alexander Korotkov wrote:
>> BTW, it would be an option to generate system_identifier to each new
>> timeline, by analogy of initdb do for the whole WAL.
>> Having such system_identifiers we can distinguish different timeline which
>> have assigned same ids.
>> Any thoughts?
>
> If you mean a new field incorporated in XLogLongPageHeader and
> ControlFile to ensure that a new timeline generated is unique across
> the same installation identified with system_identifier, then I'm not
> against it for something up to 2 bytes (even 1 byte would be fine),

Er, 2 bytes may be a bad idea as well, 1/16k% chance of collision
looks dangerous when rewinding a node... It could cause silent data
corruption on the standby being rewounded.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Use pg_rewind when target timeline was switched

2015-09-18 Thread Michael Paquier
On Fri, Sep 18, 2015 at 6:25 PM, Michael Paquier wrote:
> The refactoring of getTimelineHistory as you propose looks like a good
> idea to me, I tried to remove by myself the difference between source
> and target in copy_fetch.c and friends but this gets uglier,
> particularly because of datadir_source in copy_file_range. Not worth
> it.

Forgot that:
if (ControlFile_target.state != DB_SHUTDOWNED)
pg_fatal("target server must be shut down cleanly\n");
We may want to allow a target node shutdowned in recovery as well here.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TABLESAMPLE patch is really in pretty sad shape

2015-09-18 Thread Peter Eisentraut
On 7/23/15 6:39 PM, Tom Lane wrote:
> + 2202HEERRCODE_INVALID_TABLESAMPLE_ARGUMENT  
>  invalid_tablesample_argument
> + 2202GEERRCODE_INVALID_TABLESAMPLE_REPEAT
>  invalid_tablesample_repeat

Where did you get these error codes from?  The constants in the SQL
standard would map to

ERRCODE_INVALID_SAMPLE_SIZE
ERRCODE_INVALID_REPEAT_ARGUMENT_IN_A_SAMPLE_CLAUSE

Were you looking at a different standard, or did you intentionally
choose to rephrase?



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] row_security GUC, BYPASSRLS

2015-09-18 Thread Noah Misch
On Fri, Sep 18, 2015 at 09:01:15AM -0400, Adam Brightwell wrote:
> > 2. remove SECURITY_ROW_LEVEL_DISABLED; make ri_triggers.c subject to 
> > policies
> 
> I believe this one has already been addressed by Stephen
> (20150910192313.gt3...@tamriel.snowman.net)?  Are there further
> considerations for his proposed changes?

Right.  I'll use that patch, minus the bits examining InLocalUserIdChange().


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] extend pgbench expressions with functions

2015-09-18 Thread Fabien COELHO


Hello Kyotaro-san,


My description should have been obscure. Indeed the call tree is
finite for *sane* expression node. But it makes infinit call for
a value of expr->etype unknown by both evalDouble and
evalInt.


Such issue would be detected if the function is actually tested, hopefully 
this should be the case... :-)


However I agree that relying implicitely on the "default" case is not very 
good practice, so I updated the code in the attached v11 to fail 
explicitely on such errors.


I also attached a small test script, which exercises most (all?) 
functions:


  ./pgbench -f functions.sql -t 1


[...]
By the way, the complexity comes from separating integer and
double. If there is no serios reason to separate them, handling
all values as double makes things far simpler.


Yep, but no.

Could you let me know the reason why it strictly separates integer and 
double? I don't see no problem in possible errors of floating point 
calculations for this purpose. Is there any?


Indeed it would make things simpler, but it would break large integers as 
the int64 -> double -> int64 casts would result in approximations. The 
integer type is the important one here because it is used for primary 
keys, and you do not want a key to be approximated in any way, so the 
int64 type must be fully and exactly supported.


--
Fabien.diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 0ac40f1..11aa518 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -771,17 +771,20 @@ pgbench  options  dbname
   Sets variable varname to an integer value calculated
   from expression.
   The expression may contain integer constants such as 5432,
-  references to variables :variablename,
+  double constants such as 3.14156,
+  references to integer variables :variablename,
   and expressions composed of unary (-) or binary operators
   (+, -, *, /, %)
-  with their usual associativity, and parentheses.
+  with their usual associativity, function calls and parentheses.
+   shows the available
+  functions.
  
 
  
   Examples:
 
 \set ntellers 10 * :scale
-\set aid (1021 * :aid) % (10 * :scale) + 1
+\set aid (1021 * random(1, 10 * :scale)) % (10 * :scale) + 1
 
 

@@ -931,18 +934,110 @@ pgbench  options  dbname

   
 
+   
+   
+PgBench Functions
+
+ 
+  
+   Function
+   Return Type
+   Description
+   Example
+   Result
+  
+ 
+ 
+  
+   abs(a)
+   same as a
+   integer or double absolute value
+   abs(-17)
+   17
+  
+  
+   ddebug(x)
+   double
+   stderr print for debug and return argument
+   ddebug(5432.1)
+   5432.1
+  
+  
+   double(i)
+   double
+   evaluate as int and cast to double
+   double(5432)
+   5432.0
+  
+  
+   exporand(i, j, t)
+   integer
+   exponentially distributed random integer in the bounds, see below
+   exporand(1, 10, 3.0)
+   int between 1 and 10
+  
+  
+   idebug(i)
+   integer
+   stderr print for debug and return argument
+   idebug(5432)
+   5432
+  
+  
+   int(x)
+   integer
+   evaluate as double and cast to int
+   int(5.4 + 3.8)
+   9
+  
+  
+   gaussrand(i, j, t)
+   integer
+   gaussian distributed random integer in the bounds, see below
+   gaussrand(1, 10, 2.5)
+   int between 1 and 10
+  
+  
+   min(i, ...)
+   integer
+   minimum value
+   min(5, 4, 3, 2)
+   2
+  
+  
+   max(i, ...)
+   integer
+   maximum value
+   max(5, 4, 3, 2)
+   5
+  
+  
+   random(i, j)
+   integer
+   uniformly distributed random integer in the bounds
+   random(1, 10)
+   int between 1 and 10
+  
+  
+   sqrt(x)
+   double
+   square root
+   sqrt(2.0)
+   1.414213562
+  
+ 
+ 
+   
+
   
As an example, the full definition of the built-in TPC-B-like
transaction is:
 
 
-\set nbranches :scale
-\set ntellers 10 * :scale
-\set naccounts 10 * :scale
-\setrandom aid 1 :naccounts
-\setrandom bid 1 :nbranches
-\setrandom tid 1 :ntellers
-\setrandom delta -5000 5000
+\set aid random(1, 10 * :scale)
+\set bid random(1, 1 * :scale)
+\set tid random(1, 10 * :scale)
+\set delta random(-5000, 5000)
 BEGIN;
 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
diff --git a/src/bin/pgbench/exprparse.y b/src/bin/pgbench/exprparse.y
index e68631e..97bb559 100644
--- a/src/bin/pgbench/exprparse.y
+++ b/src/bin/pgbench/exprparse.y
@@ -16,10 +16,14 @@
 
 PgBenchExpr *expr_parse_result;
 
+static PgBenchExprList *make_elist(PgBenchExpr *exp, PgBenchExprList *list);
 static PgBenchExpr *make_integer_constant(int64 ival);
+static 

Re: [HACKERS] On-demand running query plans using auto_explain and signals

2015-09-18 Thread Pavel Stehule
2015-09-18 10:59 GMT+02:00 Shulgin, Oleksandr 
:

> On Thu, Sep 17, 2015 at 10:13 PM, Robert Haas 
> wrote:
>
>> On Thu, Sep 17, 2015 at 11:16 AM, Pavel Stehule 
>> wrote:
>>
>> >> Second, using a shm_mq manipulates the state of the process latch.  I
>> >> don't think you can make the assumption that it's safe to reset the
>> >> process latch at any and every place where we check for interrupts.
>> >> For example, suppose the process is already using a shm_mq and the
>> >> CHECK_FOR_INTERRUPTS() call inside that code then discovers that
>> >> somebody has activated this mechanism and you now go try to send and
>> >> receive from a new shm_mq.  But even if that and every other
>> >> CHECK_FOR_INTERRUPTS() in the code can tolerate a process latch reset
>> >> today, it's a new coding rule that could easily trip people up in the
>> >> future.
>> >
>> > It is valid, and probably most important. But if we introduce own
>> mechanism,
>> > we will play with process latch too (although we can use LWlocks)
>>
>> With the design I proposed, there is zero need to touch the process
>> latch, which is good, because I'm pretty sure that is going to be a
>> problem.  I don't think there is any need to use LWLocks here either.
>> When you get a request for data, you can just publish a DSM segment
>> with the data and that's it.  Why do you need anything more?  You
>> could set the requestor's latch if it's convenient; that wouldn't be a
>> problem.  But the process supplying the data can't end up in a
>> different state than it was before supplying that data, or stuff WILL
>> break.
>>
>
> There is still the whole problem of where exactly the backend being
> queried for the status should publish that DSM segment and when to free it?
>
> If it's a location shared between all backends, there should be locking
> around it.  Probably this is not a big problem, if you don't expect all the
> backends start querying each other rapidly.  That is how it was implemented
> in the first versions of this patch actually.
>
> If we take the per-backend slot approach the locking seems unnecessary and
> there are principally two options:
>
> 1) The backend puts the DSM handle in its own slot and notifies the
> requester to read it.
> 2) The backend puts the DSM handle in the slot of the requester (and
> notifies it).
>
> If we go with the first option, the backend that has created the DSM will
> not know when it's OK to free it, so this has to be responsibility of the
> requester.  If the latter exits before reading and freeing the DSM, we have
> a leak.  Even bigger is the problem that the sender backend can no longer
> send responses to a number of concurrent requestors: if its slot is
> occupied by a DSM handle, it can not send a reply to another backend until
> the slot is freed.
>
> With the second option we have all the same problems with not knowing when
> to free the DSM and potentially leaking it, but we can handle concurrent
> requests.
>

It should not be true - the data sender create DSM and fills it. Then set
caller slot and send signal to caller. Caller can free DSM any time,
because data sender send newer touch it.


>
> The current approach where the requester creates and frees the DSM doesn't
> suffer from these problems, so if we pre-allocate the segment just big
> enough we can avoid the use of shm_mq.  That will take another GUC for the
> segment size.  Certainly no one expects a query plan to weigh a bloody
> megabyte, but this is what happens to Pavel apparently.
>

It is plan C - last variant from my view. Any other GUC :(

Pavel


>
> --
> Alex
>
>


Re: [HACKERS] [patch] Proposal for \rotate in psql

2015-09-18 Thread Pavel Stehule
3. When data are multiattribute - then merging together with space
> separator is not practical
>
>   * important information is lost
>   * same transformation can be done as expression, so this feature is
> useless
>
> Is possible to use one cell per attribute (don't do merge)?
>
> DATA QUERY: SELECT dim1, dim2, sum(x), avg(x) FROM .. GROUP BY dim1, dim2
>
> and result header of rotate can be
>
> DIM1   | dim2_val1/sum | dim2_val1/avg | dim2_val2/sum | dim2_val2/avg |
> ...
>

Last point can wait - we don't need to show pivot table with all details
perfectly in first step.

The main issue of this patch is name - "rotate" is really pretty strange
for me. Please, change it :) - crosstab is much better

Regards

Pavel


Re: [HACKERS] checkpointer continuous flushing

2015-09-18 Thread Fabien COELHO


Hello,

[...] If you make the sorting criterion include the tablespace id you 
wouldn't need the lookahead loop in NextBufferToWrite().


I'm considering this precise point, i.e. including the tablespace as
a sorting criterion.

Currently the array used for sorting is 16 bytes per buffer (although I 
wrote 12 in another mail, I was wrong...). The data include the bufid (4 
bytes) the relation & fork num (8 bytes, but really 4 bytes + 2 bits are 
used), and the block number (4 bytes) which is the offset within the 
relation. These 3 combined data allow to find the file and the offset 
within that file, for the given buffer id.


I'm concerned that these 16 bytes are already significant and I do not 
want to extend them any more. I was already pretty happy with the previous 
version with 4 bytes per buffer.


Now as the number of tablespace is expected to be very small (1, 2, maybe 
3), there is no problem to pack it within the unused 30 bits in forknum. 
That would mean some masking and casts here and there, so it would not be 
very beautiful, but it would make it easy to find the buffers for a given 
tablespace, and indeed remove the lookahead stuff in the next buffer 
function, as you suggest.



My question is: would that be acceptable, or would someone object to the 
use of masks and things like that?  The benefit would be a simpler/more 
direct next buffer function, but some more tinkering around the sorting 
criterion to use a packed representation.


Note that I do not think that it would have any actual impact on 
performance... it would only make a difference if there were really many 
tablespaces (the scanning complexity would be Nbuffer instead of 
Nbuffer*Ntablespace, but as Ntablespace is small...).  My motivation is 
rather to help the patch get through, so I'm fine if this is not needed.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-09-18 Thread Fujii Masao
On Fri, Sep 4, 2015 at 2:55 PM, Masahiko Sawada  wrote:
> On Fri, Sep 4, 2015 at 10:35 AM, Fujii Masao  wrote:
>> On Fri, Sep 4, 2015 at 2:23 AM, Masahiko Sawada  
>> wrote:
>>> On Thu, Aug 27, 2015 at 1:54 AM, Masahiko Sawada  
>>> wrote:
 On Thu, Aug 20, 2015 at 11:46 PM, Alvaro Herrera
  wrote:
> Jim Nasby wrote:
>
>> I think things like pageinspect are very different; I really can't see 
>> any
>> use for those beyond debugging (and debugging by an expert at that).
>
> I don't think that necessarily means it must continue to be in contrib.
> Quite the contrary, I think it is a tool critical enough that it should
> not be relegated to be a second-class citizen as it is now (let's face
> it, being in contrib *is* second-class citizenship).
>

 Attached patch is latest patch.
>>>
>>> The previous patch lacks some files for regression test.
>>> Attached fixed v12 patch.
>>
>> The patch could be applied cleanly. "make check" could pass successfully.
>> But "make check-world -j 2" failed.
>>
>
> Thank you for looking at this patch.
> Could you tell me what test you got failed?
> make check-world -j 2 or more is done successfully in my environment.

I tried to do the test again, but initdb failed with the following error.

creating template1 database in data/base/1 ... FATAL:  invalid
input syntax for type oid: "f"

This error didn't happen when I tested before. So the commit which was
applied recently might interfere with the patch.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] numbering plan nodes

2015-09-18 Thread Kouhei Kaigai
> On Thu, Sep 17, 2015 at 9:01 PM, Kouhei Kaigai  wrote:
> > I entirely agree with the idea of plan-node identifier, however,
> > uncertain whether the node-id shall represent physical location on
> > the dynamic shared memory segment, because
> > (1) Relatively smaller number of node type needs shared state,
> > thus most of array items are empty.
> > (2) Extension that tries to modify plan-tree using planner_hook
> > may need to adjust node-id also.
> >
> > Even though shm_toc_lookup() has to walk on the toc entries to find
> > out the node-id, it happens at once on beginning of the executor at
> > background worker side. I don't think it makes a significant problem.
> 
> Yes, I was thinking that what would make sense is to have each
> parallel-aware node call shm_toc_insert() using its ID as the key.
> Then, we also need Instrumentation nodes.  For those, I thought we
> could use some fixed, high-numbered key, and Tom's idea.
>
Hmm, indeed, run-time statistics are needed for every node.
If an array indexed by node-id would be a hash slot, we can treat
non-contiguous node-id with no troubles.

> Are there extensions that use planner_hook to do surgery on the plan
> tree?  What do they do, exactly?
>
(Even though it will not work under Funnel,) PG-Strom often inject
a preprocessor node under Agg-node to produce partial aggregation
to reduce number of rows to be processed by CPU.

Also, I have seen a paper published by Fujitsu folks. Their module
modifies plan-tree to replace built-in scan node with their own
columnar storage scan node.
  http://db-event.jpn.org/deim2015/paper/195.pdf
This paper is written in Japanese, however, figure-3 in page.4 shows
what I explain above.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On-demand running query plans using auto_explain and signals

2015-09-18 Thread Shulgin, Oleksandr
On Thu, Sep 17, 2015 at 10:13 PM, Robert Haas  wrote:

> On Thu, Sep 17, 2015 at 11:16 AM, Pavel Stehule 
> wrote:
>
> >> Second, using a shm_mq manipulates the state of the process latch.  I
> >> don't think you can make the assumption that it's safe to reset the
> >> process latch at any and every place where we check for interrupts.
> >> For example, suppose the process is already using a shm_mq and the
> >> CHECK_FOR_INTERRUPTS() call inside that code then discovers that
> >> somebody has activated this mechanism and you now go try to send and
> >> receive from a new shm_mq.  But even if that and every other
> >> CHECK_FOR_INTERRUPTS() in the code can tolerate a process latch reset
> >> today, it's a new coding rule that could easily trip people up in the
> >> future.
> >
> > It is valid, and probably most important. But if we introduce own
> mechanism,
> > we will play with process latch too (although we can use LWlocks)
>
> With the design I proposed, there is zero need to touch the process
> latch, which is good, because I'm pretty sure that is going to be a
> problem.  I don't think there is any need to use LWLocks here either.
> When you get a request for data, you can just publish a DSM segment
> with the data and that's it.  Why do you need anything more?  You
> could set the requestor's latch if it's convenient; that wouldn't be a
> problem.  But the process supplying the data can't end up in a
> different state than it was before supplying that data, or stuff WILL
> break.
>

There is still the whole problem of where exactly the backend being queried
for the status should publish that DSM segment and when to free it?

If it's a location shared between all backends, there should be locking
around it.  Probably this is not a big problem, if you don't expect all the
backends start querying each other rapidly.  That is how it was implemented
in the first versions of this patch actually.

If we take the per-backend slot approach the locking seems unnecessary and
there are principally two options:

1) The backend puts the DSM handle in its own slot and notifies the
requester to read it.
2) The backend puts the DSM handle in the slot of the requester (and
notifies it).

If we go with the first option, the backend that has created the DSM will
not know when it's OK to free it, so this has to be responsibility of the
requester.  If the latter exits before reading and freeing the DSM, we have
a leak.  Even bigger is the problem that the sender backend can no longer
send responses to a number of concurrent requestors: if its slot is
occupied by a DSM handle, it can not send a reply to another backend until
the slot is freed.

With the second option we have all the same problems with not knowing when
to free the DSM and potentially leaking it, but we can handle concurrent
requests.

The current approach where the requester creates and frees the DSM doesn't
suffer from these problems, so if we pre-allocate the segment just big
enough we can avoid the use of shm_mq.  That will take another GUC for the
segment size.  Certainly no one expects a query plan to weigh a bloody
megabyte, but this is what happens to Pavel apparently.

--
Alex


Re: [HACKERS] On-demand running query plans using auto_explain and signals

2015-09-18 Thread Shulgin, Oleksandr
On Fri, Sep 18, 2015 at 11:25 AM, Pavel Stehule 
wrote:

> 2015-09-18 10:59 GMT+02:00 Shulgin, Oleksandr <
> oleksandr.shul...@zalando.de>:
>
>>
>> If we take the per-backend slot approach the locking seems unnecessary
>> and there are principally two options:
>>
>> 1) The backend puts the DSM handle in its own slot and notifies the
>> requester to read it.
>> 2) The backend puts the DSM handle in the slot of the requester (and
>> notifies it).
>>
>> If we go with the first option, the backend that has created the DSM will
>> not know when it's OK to free it, so this has to be responsibility of the
>> requester.  If the latter exits before reading and freeing the DSM, we have
>> a leak.  Even bigger is the problem that the sender backend can no longer
>> send responses to a number of concurrent requestors: if its slot is
>> occupied by a DSM handle, it can not send a reply to another backend until
>> the slot is freed.
>>
>> With the second option we have all the same problems with not knowing
>> when to free the DSM and potentially leaking it, but we can handle
>> concurrent requests.
>>
>
> It should not be true - the data sender create DSM and fills it. Then set
> caller slot and send signal to caller. Caller can free DSM any time,
> because data sender send newer touch it.
>

But the requester can timeout on waiting for reply and exit before it sees
the reply DSM.  Actually, I now don't even think a backend can free the DSM
it has not created.  First it will need to attach it, effectively
increasing the refcount, and upon detach it will only decrease the
refcount, but not actually release the segment...

So this has to be the responsibility of the reply sending backend in the
end: to create and release the DSM *at some point*.

--
Alex


Re: [HACKERS] synchronous_commit = apply

2015-09-18 Thread Thomas Munro
On Fri, Sep 18, 2015 at 7:06 PM, Kyotaro HORIGUCHI
 wrote:
> Hello, I have some random comments.

Thanks for the feedback!  I have fixed several of the things that you
found in the attached new version -- see comments inline below.
However, I now know that Simon has a better patch in development to do
this, so I won't be developing this further.  (Until that work is
available, this patch is temporarily useful as a prerequisite for
something else that I'm working on so I'll still be using it...)

> At Wed, 16 Sep 2015 23:07:03 +1200, Thomas Munro 
>  wrote in 
> 

Re: [HACKERS] [COMMITTERS] pgsql: Add pages deleted from pending list to FSM

2015-09-18 Thread Jeff Janes
On Fri, Sep 18, 2015 at 6:27 AM, Tom Lane  wrote:

> [ moving thread to -hackers ]
>
> Fujii Masao  writes:
> > So autoanalyze still doesn't call IndexFreeSpaceMapVacuum().
> > That is, only backend can clean the list in INSERT-only workload.
> > I don't think that this is desirable. Because the backend will
> > periodically take a big performance penalty.
>

Calling IndexFreeSpaceMapVacuum is only need to reuse the space freed up by
cleaning the pending list.

The list is still getting cleaned and truncated by autoanalyze, it is just
that the space is not getting marked for reuse.

When I wrote this, my thought was that a table vacuum is going to call
IndexFreeSpaceMapVacuum() from another place in its code and so should not
call it here as well.  I neglected to consider the autoanalyze case.



> > So I'm thinking that even autoanalyze should call
> IndexFreeSpaceMapVacuum()
> > to clean the list in a background, in order to avoid such spikes in
> > INSERT response time. Thought?
>

But it already is cleaning the list.  If that space is not marked as
reusable, that causes index bloat, but doesn't cause latency spikes.



> It seems quite bizarre for auto-analyze to do that.  auto-vacuum, sure,
> but I do not think this should get plugged into ANALYZE.
>

It may be odd that autoanalyze cleans the list at all (which this patch
doesn't change), but given that it does clean the list I don't see why it
would be bizarre to make the cleaned-up space re-usable.

Cheers,

Jeff


Re: [HACKERS] Parallel Seq Scan

2015-09-18 Thread Robert Haas
On Fri, Sep 18, 2015 at 12:56 PM, Robert Haas  wrote:
> On Thu, Sep 17, 2015 at 11:44 PM, Amit Kapila  wrote:
>> Okay, but I think the same can be achieved with this as well.  Basic idea
>> is that each worker will work on one planned statement at a time and in
>> above case there will be two different planned statements and they will
>> store partial seq scan related information in two different loctions in
>> toc, although the key (PARALLEL_KEY_SCAN) would be same and I think this
>> will quite similar to what we are already doing for response queues.
>> The worker will work on one of those keys based on planned statement
>> which it chooses to execute.  I have explained this in somewhat more details
>> in one of my previous mails [1].
>
> shm_toc keys are supposed to be unique.  If you added more than one
> with the same key, there would be no look up the second one.  That was
> intentional, and I don't want to revise it.
>
> I don't want to have multiple PlannedStmt objects in any case.  That
> doesn't seem like the right approach.  I think passing down an Append
> tree with multiple Partial Seq Scan children to be run in order is
> simple and clear, and I don't see why we would do it any other way.
> The master should be able to generate a plan and then copy the part of
> it below the Funnel and send it to the worker.  But there's clearly
> never more than one PlannedStmt in the master, so where would the
> other ones come from in the worker?  There's no reason to introduce
> that complexity.

Also, as KaiGai pointed out on the other thread, even if you DID pass
two PlannedStmt nodes to the worker, you still need to know which one
goes with which ParallelHeapScanDesc.  If both of the
ParallelHeapScanDesc nodes are stored under the same key, then you
can't do that.  That's why, as discussed in the other thread, we need
some way of uniquely identifying a plan node.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [patch] Proposal for \rotate in psql

2015-09-18 Thread Daniel Verite
Pavel Stehule wrote:

> 2. Data column are not well aligned - numbers are aligned as text

Thanks for spotting that, it's fixed in the attached new iteration of
the patch.

> 3. When data are multiattribute - then merging together with space separator
> is not practical
> 
>   * important information is lost
>   * same transformation can be done as expression, so this feature is
> useless

The primary use case is for queries with 3 output columns
(A,B,C) where A and B are dimensions and C is uniquely
determined by (A,B).

How columns 4 and above get displayed is not essential to the
feature, as it's somehow a degenerate case. As you note, it
could be avoided by the user limiting the query to 3 columns,
and providing whatever expression fits for the 3rd column.

Still if the query has > 3 columns, it has to be dealt with.
The choices I've considered:

a- Just error out.

b- Force the user to specify which single column should be taken
   as the value. That would be an additional argument to the command,
   or the fixed 3rd column in the invocation without arg.

c- Stack the values horizontally in the same cell with a separator.
   As the query implies f(A,B)=(C,D,E) we display C D E in the cell
   at coordinates (A,B). It's what it does currently.

[a] is not very user friendly.
[b] seems acceptable. It discards columns but the user decides which.
[c] is meant as a best effort at not discarding anything.

When [c] gives poor results, the next step from my point of view
would be for the user to rework the query, just like in the general case
when a query is not satisfying.

You're suggesting a [d] choice, subdividing the horizontal headers.
It seems to me like a pretty radical change, multiplying the number
of columns, and it has also the potential to give poor results visually.
Let's see if more feedback comes.

Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite
diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index f5c9552..0ed2e3d 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -2445,6 +2445,96 @@ lo_import 152801
 
   
 
+  
+\rotate [ colV  [-]colH] 
+
+
+Execute the current query buffer (like \g) and shows the results
+inside a crosstab grid.  The contents at
+output column colV are transformed into a
+vertical header and the contents at
+output column colH into a
+horizontal header. The results for the other output columns are projected inside the grid.
+
+
+
+colV
+and colH can indicate a
+column position (starting at 1), or a column name. Normal case folding
+and quoting rules apply on column names. By default,
+colV designates column 1
+and colH column 2.
+A query having less than two output columns cannot be rotated, and
+colH must differ from
+colV.
+
+
+
+The vertical header, displayed as the leftmost column in the output,
+contains the set of all distinct values found in
+column colV, in the order
+of their first appearance in the results.
+
+
+The horizontal header, displayed as the first row in the output,
+contains the set of all distinct non-null values found in
+column colH.  It is
+sorted in ascending order of values, unless a minus (-) sign
+precedes colH, in which
+case this order is reversed.
+
+
+
+The query results being tuples of N columns
+(including colV and
+colH),
+for each distinct value x of
+colH
+and each distinct value y of
+colV,
+a cell is output at the intersection (x,y) in the grid,
+and its contents are determined by these rules:
+
+
+
+ if there is no corresponding row in the results such that the value
+ for colH
+ is x and the value
+ for colV
+ is y, the cell is empty.
+
+
+
+
+
+ if there is exactly one row such that the value
+ for colH
+ is x and the value
+ for colV
+ is y, then the N-2 other
+ columns are displayed in the cell, separated between each other by
+ a space character if needed.
+
+ If N=2, the letter X is displayed in the cell as
+ if a virtual third column contained that character.
+
+
+
+
+
+ if there are several corresponding rows, the behavior is identical to one row
+ except that the values coming from different rows are stacked
+ vertically, rows being separated by newline characters inside
+ the same cell.
+
+
+
+
+
+
+
+  
+
 
   
 \s [ filename ]
diff --git 

Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2015-09-18 Thread Vladimir Borodin

> 18 сент. 2015 г., в 20:16, Robert Haas  написал(а):
> 
> On Fri, Sep 18, 2015 at 4:08 AM, Vladimir Borodin  wrote:
>> For both scenarios on linux we got approximately the same results - version
>> with timings was faster then version with sampling (sampling was done every
>> 10 ms). Vanilla PostgreSQL from REL9_4_STABLE gave ~15500 tps and version
>> with timings gave ~14500 tps while version with sampling gave ~13800 tps. In
>> all cases processor was 100% utilized. Comparing vanilla PostgreSQL and
>> version with timings on constant workload (12000 tps) gave the following
>> results in latencies for queries:
> 
> If the timing is speeding things up, that's most likely a sign that
> the spinlock contention on that workload is so severe that you are
> spending a lot of time in s_lock.  Adding more things for the system
> to do that don't require that lock will speed the system up by
> reducing the contention.  Instead of inserting gettimeofday() calls,
> you could insert a for loop that counts to some large number without
> doing any useful work, and that would likely have a similar effect.
> 
> In any case, I think your experiment clearly proves that the presence
> or absence of this instrumentation *is* performance-relevant and that
> we *do* need to worry about what it costs. If the system gets 20%
> faster when you call gettimeofday() a lot, does that mean we should
> insert gettimeofday() calls all over the system in random places to
> speed it up?

No, probably you misunderstood the results, let me explain one more time. 
Unpatched PostgreSQL from REL9_4_STABLE gave 15500 tps. Version with timings - 
14500 tps which is 6,5% worse. Version with sampling wait events every 10 ms 
gave 13800 tps (11% worse than unpatched and 5% worse than with timings).

We also made a test with a stable workload of 12000 tps for unpatched version 
and version with timings. In thas test we saw that response times are a bit 
worse in version with timings as shown in the table below. You should read this 
table as follows: 99% of all queries in unpatched version fits in 79 ms while 
in version with timings 99% of all queries fits in 97 ms which is 18 ms slower, 
and so on. That test also showed that version with timings consumes extra 7% of 
CPU to handle the same workload as unpatched version.

So this is the cost of waits monitoring with timings on lwlock stress workload 
- 6,5% less throughput, a bit worse timings and extra 7% of CPU. If you will 
insert gettimeofday() calls all over the system in random places, you 
expectedly will not speed up, you will be getting slower.

q'thvanilla timing
99% 79.097.0(+18.0)
98% 64.076.0(+12.0)
95% 38.047.0(+9.0)
90% 16.021.0(+5.0)
85% 7.0 11.0(+4.0)
80% 5.0 7.0 (+2.0)
75% 4.0 5.0 (+1.0)
50% 2.0 3.0 (+1.0)

> 
> I do agree that if we're going to include support for timings, having
> them be controlled by a GUC is a good idea.
> 
> -- 
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
> 
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


--
May the force be with you…
https://simply.name



Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2015-09-18 Thread Amit Kapila
On Fri, Sep 18, 2015 at 11:08 PM, Jesper Pedersen <
jesper.peder...@redhat.com> wrote:

> On 09/11/2015 10:31 AM, Amit Kapila wrote:
>
>> Updated comments and the patch (increate_clog_bufs_v2.patch)
>> containing the same is attached.
>>
>>
> I have done various runs on an Intel Xeon 28C/56T w/ 256Gb mem and 2 x
> RAID10 SSD (data + xlog) with Min(64,).
>
>
The benefit with this patch could be seen at somewhat higher
client-count as you can see in my initial mail, can you please
once try with client count > 64?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Reducing ClogControlLock contention

2015-09-18 Thread Amit Kapila
On Fri, Sep 18, 2015 at 11:31 PM, Jesper Pedersen <
jesper.peder...@redhat.com> wrote:

> On 08/31/2015 07:34 AM, Amit Kapila wrote:
>
>> I have updated the patch (attached with mail) to show
>> you what I have in mind.
>>
>>
> I havn't been able to get a successful run with _v5 using pgbench.
>
>
This patch is still not ready for performance test, as you might
have seen in comments that we are still discussing locking
issues.


> TransactionIdSetStatusBit assumes an exclusive lock on CLogControlLock
> when called, but that part is removed from TransactionIdSetPageStatus now.
>

It doesn't seem to be a necessary condition, thats why in this patch
a smaller granularity lock is introduced.


> I tried an
>
>   if (!LWLockHeldByMe(CLogControlLock))
>   {
>   LWLockAcquire(CLogControlLock, LW_EXCLUSIVE);
>   mode = LW_EXCLUSIVE;
>   }
>
> approach, but didn't get further.


I suspect the problem is something else.


> Plus that probably isn't the best way, since we will traverse all LWLock's,


Yes, but it does in such a way that probably the caller will find it in
very few initial entries in the array.


Thank you very much for doing valuable performance tests with the
CLog patches.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] [patch] Proposal for \rotate in psql

2015-09-18 Thread Pavel Stehule
> You're suggesting a [d] choice, subdividing the horizontal headers.
> It seems to me like a pretty radical change, multiplying the number
> of columns, and it has also the potential to give poor results visually.
> Let's see if more feedback comes.
>

yes, I know, plan @d needs lot of new code - and described feature is from
"nice to have" kind.

The prerequisite is enhancing drawing system in psql to support
multiattribute (records) cells  - what can be nice feature generally.

Some like:

id  |C1  | C2|
+---++---+---+---+---+
|A1 | A2 |A3 |A4 |A5 |A6 |
+===++===+===+===+===+
|   ||   |   |   |   |

Without this possibility plan @d is impossible.

Regards

Pavel


>
> Best regards,
> --
> Daniel Vérité
> PostgreSQL-powered mailer: http://www.manitou-mail.org
> Twitter: @DanielVerite
>


Re: [HACKERS] Parallel Seq Scan

2015-09-18 Thread Haribabu Kommi
On Fri, Sep 18, 2015 at 9:45 PM, Amit Kapila  wrote:
> On Fri, Sep 18, 2015 at 1:33 PM, Haribabu Kommi 
> wrote:
>>
>> On Thu, Sep 3, 2015 at 8:21 PM, Amit Kapila 
>> wrote:
>> >
>> > Attached, find the rebased version of patch.
>>
>> Here are the performance test results:
>>
>> Query  selectivityHashAgg HashAgg
>> (million) + seqscan(ms)+
>> parallel seq scan(ms)
>> 2
>> workers   4 workers  8 workers
>> $1 <= '001'  0.1   16717.00 7086.00
>> 4459.00 2912.00
>> $1 <= '004'  0.4   17962.00 7410.00
>> 4651.00 2977.00
>> $1 <= '008'  0.8   18870.00 7849.00
>> 4868.00 3092.00
>> $1 <= '016'  1.5   21368.00 8645.00
>> 6800.00 3486.00
>> $1 <= '030'  2.7   24622.00   14796.0013108.00
>> 9981.00
>> $1 <= '060'  5.4   31690.00   29839.0026544.00
>>   23814.00
>> $1 <= '080'  7.2   37147.00   40485.0035763.00
>>   32679.00
>>
>
> I think here probably when the selectivity is more than 5, then it should
> not have selected Funnel plan.  Have you by any chance changed
> cpu_tuple_comm_cost?  If not, then you can try by setting value of
> parallel_setup_cost (may be 10) and then see if it selects the Funnel
> Plan.  Is it possible for you to check the cost difference of Sequence
> and Funnel plan, hopefully explain or explain analyze should be sufficient?

Yes, I changed cpu_tuple_comm_cost to zero to observe how parallel seq scan
performs in high selectivity. Forgot to mention in the earlier mail.
Overall the
parallel seq scan performance is good.


>> And also attached perf results for selectivity of 0.1 million and 5.4
>> million cases for analysis.
>>
>
> I have checked perf reports and it seems that when selectivity is more, it
> seems to be spending time in some kernel calls which could be due
> communication of tuples.

Yes. And also in low selectivity with increase of workers, tas and
s_lock functions usage
is getting increased. May be these are also one of the reasons for
scaling problem.

Regards,
Hari Babu
Fujitsu Australia


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Use pg_rewind when target timeline was switched

2015-09-18 Thread Michael Paquier
On Wed, Sep 16, 2015 at 9:47 AM, Alexander Korotkov wrote:
> On Thu, Sep 10, 2015 at 8:33 AM, Michael Paquier wrote:
> OK, I get it. I tried to get rid of code duplication in the attached patch.
> There is getTimelineHistory() function which gets control file as argument
> and fetches history from appropriate path.

Cool, thanks!

> Imagine following configuration of server.
>   1
>  / \
> 2   3
> |
> 4
>
> Initially, node 1 is master, nodes 2 and 3 are standbys for node 1. node 4
> is cascaded standby for node 3.
> Then node 2 and node 3 are promoted. They both got timeline number 2. Then
> node 3 is promoted and gets timeline number 3.

Typo. You mean node 4 getting timeline 3 here.

> Then we could try to rewind node 4 with node 2 as source. How pg_rewind
> could know that timeline number 2 for those nodes is not the same?
> We can only check that those timelines are forked from timeline 1 at the
> same place. But coincidence is still possible.

OK, I see your point and you are right. This additional check allows
pg_rewind to switch one timeline back and make the scan of blocks
begin at the real origin of both timelines. I had in mind the case
where you tried to actually rewind node 2 to node 3 actually which
would not be possible with your patch, and by thinking more about that
I think that it could be possible to remove the check I am listing
below and rely on the checks in the history files, basically what is
in findCommonAncestorTimeline:
if (ControlFile_target.checkPointCopy.ThisTimeLineID ==
ControlFile_source.checkPointCopy.ThisTimeLineID)
pg_fatal("source and target cluster are on the same
timeline\n");
Alexander, what do you think about that? I think that we should be
able to rewind with for example node 2 as target and node 3 as source,
and vice-versa as per the example you gave even if both nodes are on
the same timeline, just that they forked at a different point. Could
you test this particular case? Using my script, simply be sure to
switch archive_mode to on/off depending on the node, aka only 3 and 4
do archiving.

+   /*
+* Since incomplete segments are copied into next
timelines, find the
+* lastest timeline holding required segment. Assuming
we can move
+* in xlog both forward and backward, consider also
switching timeline
+* back.
+*/
s/lastest/correct? "lastest" does sound very English to me even if
that's grammatically correct.

+ * Find minimum from two xlog pointers assuming invalid pointer (0) means
+ * infinity as timeline.h states.
This needs an update, now timeline.h points out correctly that this is
not 0, but InvalidXLogRecPtr.

+   /* Retreive timelines for both source and target */
s/Retreive/Retrieve.

The refactoring of getTimelineHistory as you propose looks like a good
idea to me, I tried to remove by myself the difference between source
and target in copy_fetch.c and friends but this gets uglier,
particularly because of datadir_source in copy_file_range. Not worth
it.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Seq Scan

2015-09-18 Thread Amit Kapila
On Thu, Sep 17, 2015 at 2:28 AM, Robert Haas  wrote:
>
> On Thu, Sep 3, 2015 at 6:21 AM, Amit Kapila 
wrote:
> +/*
> + * We expect each worker to populate the BufferUsage structure
> + * allocated by master backend and then master backend will aggregate
> + * all the usage along with it's own, so account it for each worker.
> + */
>
> This also needs improvement.  Especially because...
>
> +/*
> + * We expect each worker to populate the instrumentation structure
> + * allocated by master backend and then master backend will aggregate
> + * all the information, so account it for each worker.
> + */
>
> ...it's almost identical to this one.
>

I think we can combine them and have one comment.


>
> +GetParallelSupportInfo(shm_toc *toc, ParamListInfo *params,
>
> Could this be a static function?  Will it really be needed outside this
file?
>

It is already declared as static, but will add static in function definition
as well.

> And is there any use case for letting some of the arguments be NULL?

In earlier versions of patch this API was used from other places, but now
there is no such use, so will change accordingly.

>
> +bool
> +ExecParallelBufferUsageAccum(Node *node)
> +{
> +if (node == NULL)
> +return false;
> +
> +switch (nodeTag(node))
> +{
> +case T_FunnelState:
> +{
> +FinishParallelSetupAndAccumStats((FunnelState*)node);
> +return true;
> +}
> +break;
> +default:
> +break;
> +}
> +
> +(void) planstate_tree_walker((Node*)((PlanState *)node)->lefttree,
NULL,
> + ExecParallelBufferUsageAccum, 0);
> +(void) planstate_tree_walker((Node*)((PlanState *)node)->righttree,
NULL,
> + ExecParallelBufferUsageAccum, 0);
> +return false;
> +}
>
> This seems wacky.  I mean, isn't the point of planstate_tree_walker()
> that the callback itself doesn't have to handle recursion like this?
> And if not, then this wouldn't be adequate anyway, because some
> planstate nodes have children that are not in lefttree or righttree
> (cf. explain.c).
>

Will change according to recent commit for planstate_tree_walker

> +currentRelation = ExecOpenScanRelation(estate,
> +   ((SeqScan *)
> node->ss.ps.plan)->scanrelid,
> +   eflags);
>
> I can't see how this can possibly be remotely correct.  The funnel
> node shouldn't be limited to scanning a baserel (cf. fdw_scan_tlist).
>

This is mainly used for generating tuple descriptor and that tuple
descriptor will be used for forming scanslot, funnel node itself won't
do any scan. However, we can completely eliminate this InitFunnel()
function and use ExecAssignProjectionInfo() instead of
ExecAssignScanProjectionInfo() to form the projection info.


>
> +buffer_usage_worker = (BufferUsage *)(buffer_usage + (i *
> sizeof(BufferUsage)));
>
> Cast it to a BufferUsage * first.  Then you can use [i] to find
> the i'th element.
>

Do you mean to say that the way code is written won't work?
Values of structure BufferUsage for each worker is copied into string
buffer_usage which I believe could be fetched in above way.

> +/*
> + * Re-initialize the parallel context and workers to perform
> + * rescan of relation.  We want to gracefully shutdown all the
> + * workers so that they should be able to propagate any error
> + * or other information to master backend before dying.
> + */
> +FinishParallelSetupAndAccumStats(node);
>
> Somehow, this makes me feel like that function is badly named.
>

I think here comment seems to be slightly misleading, shall we
change the comment as below:

Destroy the parallel context to gracefully shutdown all the
workers so that they should be able to propagate any error
or other information to master backend before dying.

> +/*
> + * _readPlanInvalItem
> + */
> +static PlanInvalItem *
> +_readPlanInvalItem(void)
> +{
> +READ_LOCALS(PlanInvalItem);
> +
> +READ_INT_FIELD(cacheId);
> +READ_UINT_FIELD(hashValue);
> +
> +READ_DONE();
> +}
>
> I don't see why we should need to be able to copy PlanInvalItems.  In
> fact, it seems like a bad idea.
>

We are not copying PlanInvalItems, so I don't think this is required.
Now I don't exactly remember why I have added this at first place,
one reason could be that in earlier versions PlanInvalItems might
have been copied.  Anyway, I will verify it once more and if not
required, I will remove it.

> +#parallel_setup_cost = 0.0  # same scale as above
> +#define DEFAULT_PARALLEL_SETUP_COST  0.0
>
> This value is probably a bit on the low side.
>

How about keeping it as 10.0?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Parallel Seq Scan

2015-09-18 Thread Amit Kapila
On Fri, Sep 18, 2015 at 1:33 PM, Haribabu Kommi 
wrote:

> On Thu, Sep 3, 2015 at 8:21 PM, Amit Kapila 
> wrote:
> >
> > Attached, find the rebased version of patch.
>
> Here are the performance test results:
>
> Query  selectivityHashAgg HashAgg
> (million) + seqscan(ms)+
> parallel seq scan(ms)
> 2
> workers   4 workers  8 workers
> $1 <= '001'  0.1   16717.00 7086.00
> 4459.00 2912.00
> $1 <= '004'  0.4   17962.00 7410.00
> 4651.00 2977.00
> $1 <= '008'  0.8   18870.00 7849.00
> 4868.00 3092.00
> $1 <= '016'  1.5   21368.00 8645.00
> 6800.00 3486.00
> $1 <= '030'  2.7   24622.00   14796.0013108.00
> 9981.00
> $1 <= '060'  5.4   31690.00   29839.0026544.00
>   23814.00
> $1 <= '080'  7.2   37147.00   40485.0035763.00
>   32679.00
>
>
I think here probably when the selectivity is more than 5, then it should
not have selected Funnel plan.  Have you by any chance changed
cpu_tuple_comm_cost?  If not, then you can try by setting value of
parallel_setup_cost (may be 10) and then see if it selects the Funnel
Plan.  Is it possible for you to check the cost difference of Sequence
and Funnel plan, hopefully explain or explain analyze should be sufficient?


>
> And also attached perf results for selectivity of 0.1 million and 5.4
> million cases for analysis.
>
>
I have checked perf reports and it seems that when selectivity is more, it
seems to be spending time in some kernel calls which could be due
communication of tuples.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] [patch] Proposal for \rotate in psql

2015-09-18 Thread Daniel Verite
Pavel Stehule wrote:


> my comments:
> 
> 1. I don't understand why you are use two methods for sorting columns
> (qsort, and query with ORDER BY)

qsort (with strcmp as the comparator) is only used to determine the
set of distinct values for the vertical and horizontal headers.
In fact this is just to allow the use of bsearch() when doing that
instead of a slower sequential search.

Once the values for the horizontal headers are known, they
are passed to the server for sorting with server-side semantics
according to their type. The order will differ from qsort/strcmp
found as the comparison semantics are different.

The values for the vertical header are not sorted server-side
because we keep the order of the query for displaying
top to bottom.
In the typical use case , it will have ORDER BY 1[,2] possibly
with one/two DESC qualifiers.

> 2. Data column are not well aligned - numbers are aligned as text

Yes. I'll look shortly into fixing that.


Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On-demand running query plans using auto_explain and signals

2015-09-18 Thread Shulgin, Oleksandr
On Fri, Sep 18, 2015 at 12:59 PM, Pavel Stehule 
wrote:

> 2015-09-18 12:05 GMT+02:00 Shulgin, Oleksandr <
> oleksandr.shul...@zalando.de>:
>
>> On Fri, Sep 18, 2015 at 11:25 AM, Pavel Stehule 
>> wrote:
>>
>>>
>>> It should not be true - the data sender create DSM and fills it. Then
>>> set caller slot and send signal to caller. Caller can free DSM any time,
>>> because data sender send newer touch it.
>>>
>>
>> But the requester can timeout on waiting for reply and exit before it
>> sees the reply DSM.  Actually, I now don't even think a backend can free
>> the DSM it has not created.  First it will need to attach it, effectively
>> increasing the refcount, and upon detach it will only decrease the
>> refcount, but not actually release the segment...
>>
>
> I am afraid so it has not simple and nice solution - when data sender will
> wait for to moment when data are received, then we have same complexity
> like we use  shm_mq.
>
> Isn't better to introduce new background worker with responsibility to
> clean orphaned DSM?
>

I'm not thrilled by this idea.

We don't even need a worker to do that, as leaked segments are reported by
the backend itself upon transaction commit (see
ResourceOwnerReleaseInternal), e.g:

WARNING:  dynamic shared memory leak: segment 808539725 still referenced

Still I believe relying on some magic cleanup mechanism and not caring
about managing the shared memory properly is not the way to go.

--
Alex


Re: [HACKERS] [patch] Proposal for \rotate in psql

2015-09-18 Thread Daniel Verite
Pavel Stehule wrote:

> in the help inside your last patch, you are using "crosstab". Cannto be
> crosstab the name for this feature?

If it wasn't taken already by contrib/tablefunc, that would be a first
choice. But now, when searching for crosstab+postgresql, pages of
results come out concerning the crosstab() function.

So not using \crosstab is deliberate; it's to prevent confusion with
the server-side function.


Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Seq Scan

2015-09-18 Thread Amit Kapila
On Thu, Sep 17, 2015 at 4:44 PM, Robert Haas  wrote:
> On Thu, Sep 17, 2015 at 2:54 AM, Amit Kapila 
wrote:
>
>
> > Here the subplan is generated before the top level plan and while
generation
> > of subplan we can't predict whether it is okay to generate it as Funnel
or
> > not,
> > because it might be that top level plan is non-Funnel.  Also if such a
> > subplan
> > is actually an InitPlan, then we are safe (as we execute the InitPlans
in
> > master backend and then pass the result to parallel worker) even if top
> > level
> > plan is Funnel.  I think the place where we can catch this is during the
> > generation of Funnel path, basically we can evaluate if any nodes
beneath
> > Funnel node has 'filter' or 'targetlist' as another Funnel node, then we
> > have
> > two options to proceed:
> > a. Mark such a filter or target list as non-pushable which will indicate
> > that
> > they need to be executed only in master backend.  If we go with this
> > option, then we have to make Funnel node capable of evaluating Filter
> > and Targetlist which is not a big thing.
> > b. Don't choose the current path as Funnel path.
> >
> > I prefer second one as that seems to be simpler as compare to first and
> > there doesn't seem to be much benefit in going by first.
> >
> > Any better ideas?
>
> I haven't studied the planner logic in enough detail yet to have a
> clear opinion on this.  But what I do think is that this is a very
> good reason why we should bite the bullet and add outfuncs/readfuncs
> support for all Plan nodes.  Otherwise, we're going to have to scan
> subplans for nodes we're not expecting to see there, which seems
> silly.  We eventually want to allow all of those nodes in the worker
> anyway.
>

makes sense to me.  There are 39 plan nodes and it seems we have
support for all of them in outfuncs and needs to add for most of them
in readfuncs.



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] On-demand running query plans using auto_explain and signals

2015-09-18 Thread Pavel Stehule
2015-09-18 12:05 GMT+02:00 Shulgin, Oleksandr 
:

> On Fri, Sep 18, 2015 at 11:25 AM, Pavel Stehule 
> wrote:
>
>> 2015-09-18 10:59 GMT+02:00 Shulgin, Oleksandr <
>> oleksandr.shul...@zalando.de>:
>>
>>>
>>> If we take the per-backend slot approach the locking seems unnecessary
>>> and there are principally two options:
>>>
>>> 1) The backend puts the DSM handle in its own slot and notifies the
>>> requester to read it.
>>> 2) The backend puts the DSM handle in the slot of the requester (and
>>> notifies it).
>>>
>>> If we go with the first option, the backend that has created the DSM
>>> will not know when it's OK to free it, so this has to be responsibility of
>>> the requester.  If the latter exits before reading and freeing the DSM, we
>>> have a leak.  Even bigger is the problem that the sender backend can no
>>> longer send responses to a number of concurrent requestors: if its slot is
>>> occupied by a DSM handle, it can not send a reply to another backend until
>>> the slot is freed.
>>>
>>> With the second option we have all the same problems with not knowing
>>> when to free the DSM and potentially leaking it, but we can handle
>>> concurrent requests.
>>>
>>
>> It should not be true - the data sender create DSM and fills it. Then set
>> caller slot and send signal to caller. Caller can free DSM any time,
>> because data sender send newer touch it.
>>
>
> But the requester can timeout on waiting for reply and exit before it sees
> the reply DSM.  Actually, I now don't even think a backend can free the DSM
> it has not created.  First it will need to attach it, effectively
> increasing the refcount, and upon detach it will only decrease the
> refcount, but not actually release the segment...
>


I am afraid so it has not simple and nice solution - when data sender will
wait for to moment when data are received, then we have same complexity
like we use  shm_mq.

Isn't better to introduce new background worker with responsibility to
clean orphaned DSM?

Regards

Pavel


>
> So this has to be the responsibility of the reply sending backend in the
> end: to create and release the DSM *at some point*.
>
> --
> Alex
>
>


Re: [HACKERS] Freeze avoidance of very large table.

2015-09-18 Thread Masahiko Sawada
On Fri, Sep 18, 2015 at 6:13 PM, Fujii Masao  wrote:
> On Fri, Sep 4, 2015 at 2:55 PM, Masahiko Sawada  wrote:
>> On Fri, Sep 4, 2015 at 10:35 AM, Fujii Masao  wrote:
>>> On Fri, Sep 4, 2015 at 2:23 AM, Masahiko Sawada  
>>> wrote:
 On Thu, Aug 27, 2015 at 1:54 AM, Masahiko Sawada  
 wrote:
> On Thu, Aug 20, 2015 at 11:46 PM, Alvaro Herrera
>  wrote:
>> Jim Nasby wrote:
>>
>>> I think things like pageinspect are very different; I really can't see 
>>> any
>>> use for those beyond debugging (and debugging by an expert at that).
>>
>> I don't think that necessarily means it must continue to be in contrib.
>> Quite the contrary, I think it is a tool critical enough that it should
>> not be relegated to be a second-class citizen as it is now (let's face
>> it, being in contrib *is* second-class citizenship).
>>
>
> Attached patch is latest patch.

 The previous patch lacks some files for regression test.
 Attached fixed v12 patch.
>>>
>>> The patch could be applied cleanly. "make check" could pass successfully.
>>> But "make check-world -j 2" failed.
>>>
>>
>> Thank you for looking at this patch.
>> Could you tell me what test you got failed?
>> make check-world -j 2 or more is done successfully in my environment.
>
> I tried to do the test again, but initdb failed with the following error.
>
> creating template1 database in data/base/1 ... FATAL:  invalid
> input syntax for type oid: "f"
>
> This error didn't happen when I tested before. So the commit which was
> applied recently might interfere with the patch.
>

Thank you for testing!
Attached fixed version patch.

Regards,

--
Masahiko Sawada


000_add_frozen_bit_into_visibilitymap_v13.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On-demand running query plans using auto_explain and signals

2015-09-18 Thread Pavel Stehule
2015-09-18 13:22 GMT+02:00 Shulgin, Oleksandr 
:

> On Fri, Sep 18, 2015 at 12:59 PM, Pavel Stehule 
> wrote:
>
>> 2015-09-18 12:05 GMT+02:00 Shulgin, Oleksandr <
>> oleksandr.shul...@zalando.de>:
>>
>>> On Fri, Sep 18, 2015 at 11:25 AM, Pavel Stehule >> > wrote:
>>>

 It should not be true - the data sender create DSM and fills it. Then
 set caller slot and send signal to caller. Caller can free DSM any time,
 because data sender send newer touch it.

>>>
>>> But the requester can timeout on waiting for reply and exit before it
>>> sees the reply DSM.  Actually, I now don't even think a backend can free
>>> the DSM it has not created.  First it will need to attach it, effectively
>>> increasing the refcount, and upon detach it will only decrease the
>>> refcount, but not actually release the segment...
>>>
>>
>> I am afraid so it has not simple and nice solution - when data sender
>> will wait for to moment when data are received, then we have same
>> complexity like we use  shm_mq.
>>
>> Isn't better to introduce new background worker with responsibility to
>> clean orphaned DSM?
>>
>
> I'm not thrilled by this idea.
>
> We don't even need a worker to do that, as leaked segments are reported by
> the backend itself upon transaction commit (see
> ResourceOwnerReleaseInternal), e.g:
>
> WARNING:  dynamic shared memory leak: segment 808539725 still referenced
>
> Still I believe relying on some magic cleanup mechanism and not caring
> about managing the shared memory properly is not the way to go.
>

It was one  my idea too,  to check shared memory on exit. The disadvantage
is clean - some times you can wait too long - although the very practical
limit for session is about 2h.




>
> --
> Alex
>
>


Re: [HACKERS] [patch] Proposal for \rotate in psql

2015-09-18 Thread Pavel Stehule
2015-09-18 13:36 GMT+02:00 Daniel Verite :

> Pavel Stehule wrote:
>
> > in the help inside your last patch, you are using "crosstab". Cannto be
> > crosstab the name for this feature?
>
> If it wasn't taken already by contrib/tablefunc, that would be a first
> choice. But now, when searching for crosstab+postgresql, pages of
> results come out concerning the crosstab() function.
>
> So not using \crosstab is deliberate; it's to prevent confusion with
> the server-side function.
>

I don't afraid about this - crosstab is a function in extension. Psql
backslash commands living in different worlds. When we introduce new
command, then google will adapt on it.

For this use case the correct keywords are "crosstab psql postgres"

Regards

Pavel



>
>
> Best regards,
> --
> Daniel Vérité
> PostgreSQL-powered mailer: http://www.manitou-mail.org
> Twitter: @DanielVerite
>


Re: [HACKERS] On-demand running query plans using auto_explain and signals

2015-09-18 Thread Robert Haas
On Thu, Sep 17, 2015 at 12:47 PM, Shulgin, Oleksandr
 wrote:
> But if we make the sender backend create the DSM with the response payload,
> it suddenly becomes really unclear at which point and who should make the
> final detach of that DSM.  We're getting back to the problem of
> synchronization between these processes all over again.

Yes, some thought is needed here.  There's not really a problem if
somebody asks for information just once: you could just have the
publishing process keep it attached until process exit, and nothing
bad would happen.  The real problem comes when multiple people ask for
information in quick succession: if you detach the old segment too
quickly after publishing it, the guy who requested that data might not
have attached it yet, and then when he gets around to looking at it,
it won't be there.

This seems like a pretty solvable problem, though.  For example, when
somebody asks for information, they store in the main shared segment a
pointer to their PGPROC and then they signal the target process, which
responds by publishing a dsm_segment_id.  When the requesting process
has accessed it, or when it errors out or exits, it clears the PGPROC
and the dsm_segment_id.  The publisher avoids unmapping it until that
happens.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [patch] Proposal for \rotate in psql

2015-09-18 Thread Pavel Stehule
2015-09-18 13:59 GMT+02:00 Daniel Verite :

> Pavel Stehule wrote:
>
>
> > my comments:
> >
> > 1. I don't understand why you are use two methods for sorting columns
> > (qsort, and query with ORDER BY)
>
> qsort (with strcmp as the comparator) is only used to determine the
> set of distinct values for the vertical and horizontal headers.
> In fact this is just to allow the use of bsearch() when doing that
> instead of a slower sequential search.
>
> Once the values for the horizontal headers are known, they
> are passed to the server for sorting with server-side semantics
> according to their type. The order will differ from qsort/strcmp
> found as the comparison semantics are different.
>
> The values for the vertical header are not sorted server-side
> because we keep the order of the query for displaying
> top to bottom.
> In the typical use case , it will have ORDER BY 1[,2] possibly
> with one/two DESC qualifiers.
>

ok


>
> > 2. Data column are not well aligned - numbers are aligned as text
>
> Yes. I'll look shortly into fixing that.
>
>
> Best regards,
> --
> Daniel Vérité
> PostgreSQL-powered mailer: http://www.manitou-mail.org
> Twitter: @DanielVerite
>


Re: [HACKERS] synchronous_commit = apply

2015-09-18 Thread Kyotaro HORIGUCHI
Hello, I have some random comments.

At Wed, 16 Sep 2015 23:07:03 +1200, Thomas Munro 
 wrote in 

Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2015-09-18 Thread Vladimir Borodin

> 16 сент. 2015 г., в 20:52, Robert Haas  написал(а):
> 
> On Wed, Sep 16, 2015 at 12:29 PM, Alexander Korotkov
>  wrote:
>> Yes, the major question is cost. But I think we should validate our thoughts
>> by experiments assuming there are more possible synchronization protocols.
>> Ildus posted implemention of double buffering approach that showed quite low
>> cost.
> 
> I'm not sure exactly which email you are referring to, but I don't
> believe that anyone has done experiments that are anywhere near
> comprehensive enough to convince ourselves that this won't be a
> problem.  If a particular benchmark doesn't show an issue, that can
> just mean that the benchmark isn't hitting the case where there is a
> problem.  For example, EDB has had customers who have severe
> contention apparently on the buffer content lwlocks, resulting in big
> slowdowns.  You don't see that in, say, a pgbench run.  But for people
> who have certain kinds of queries, it's really bad.  Those sort of
> loads, where the lwlock system really gets stressed, are cases where
> adding overhead seems likely to pinch.

Alexander and Ildus gave us two patches for REL9_4_STABLE - one with sampling 
every N milliseconds and one with measuring timings. We have tested both of 
them on two kinds of our workload besides pgbench runs. One workload is OLTP 
when all the data fits in shared buffers, synchronous_commit if off and the 
bottleneck is ProcArrayLock (all backtraces look something like the following):

[Thread debugging using libthread_db enabled]
0x7f10e01d4f27 in semop () from /lib64/libc.so.6
#0  0x7f10e01d4f27 in semop () from /lib64/libc.so.6
#1  0x0061fe27 in PGSemaphoreLock (sema=0x7f11f2d4f430, interruptOK=0 
'\000') at pg_sema.c:421
#2  0x006769ba in LWLockAcquireCommon (l=0x7f10e1e00120, 
mode=LW_EXCLUSIVE) at lwlock.c:626
#3  LWLockAcquire (l=0x7f10e1e00120, mode=LW_EXCLUSIVE) at lwlock.c:467
#4  0x00667862 in ProcArrayEndTransaction (proc=0x7f11f2d4f420, 
latestXid=182562881) at procarray.c:404
#5  0x004b579b in CommitTransaction () at xact.c:1957
#6  0x004b6ae5 in CommitTransactionCommand () at xact.c:2727
#7  0x006819d9 in finish_xact_command () at postgres.c:2437
#8  0x00684f05 in PostgresMain (argc=, argv=, dbname=0x21e1a70 "xivadb", username=) at 
postgres.c:4270
#9  0x00632d7d in BackendRun (argc=, argv=) at postmaster.c:4155
#10 BackendStartup (argc=, argv=) at 
postmaster.c:3829
#11 ServerLoop (argc=, argv=) at 
postmaster.c:1597
#12 PostmasterMain (argc=, argv=) at 
postmaster.c:1244
#13 0x005cadb8 in main (argc=3, argv=0x21e0aa0) at main.c:228


Another is when all the data fits in RAM but does not fit in shared buffers and 
the bottleneck is mostly BufFreelistLock with sensible contention around buffer 
partitions locking and buffers locking. Taking several backtraces with GDB of 
several backends and doing some bash magic gives the following:

root@pgload01g ~ # grep '^#4 ' /tmp/bt | awk '{print $2, $4, $NF}' | sort | 
uniq -c | sort -rn
126 0x0065db61 BufferAlloc bufmgr.c:591
 67 0x0065e03a BufferAlloc bufmgr.c:760
 43 0x005c8c3b pq_getbyte pqcomm.c:899
 39 0x0065dd93 BufferAlloc bufmgr.c:765
  6 0x004b52bb RecordTransactionCommit xact.c:1194
  4 0x0065da0e ReadBuffer_common bufmgr.c:476
  1 ReadBuffer_common relpersistence=112 bufmgr.c:340
  1 exec_eval_expr expr=0x166e908, pl_exec.c:4796
  1 0x7f78b8cb217b ?? /usr/pgsql-9.4/lib/pg_stat_statements.so
  1 0x005d4cbb _copyList copyfuncs.c:3849
root@pgload01g ~ #

For both scenarios on linux we got approximately the same results - version 
with timings was faster then version with sampling (sampling was done every 10 
ms). Vanilla PostgreSQL from REL9_4_STABLE gave ~15500 tps and version with 
timings gave ~14500 tps while version with sampling gave ~13800 tps. In all 
cases processor was 100% utilized. Comparing vanilla PostgreSQL and version 
with timings on constant workload (12000 tps) gave the following results in 
latencies for queries:

q'thvanilla timing
99% 79.097.0(+18.0)
98% 64.076.0(+12.0)
95% 38.047.0(+9.0)
90% 16.021.0(+5.0)
85% 7.0 11.0(+4.0)
80% 5.0 7.0 (+2.0)
75% 4.0 5.0 (+1.0)
50% 2.0 3.0 (+1.0)

And it that test version with timings consumed about 7% more of CPU. Does it 
seem to be the results on workload where lwlock system is stressed?

And when the data does not fit in RAM you really don’t see much difference 
between all three version because your contention is moved from lwlock system 
to I/O, even with newest NVMe SSDs, or at least is divided between lwlocks and 
other waits events.

> 
>> Yes, but some competing products also provides 

Re: [HACKERS] row_security GUC, BYPASSRLS

2015-09-18 Thread Joe Conway
On 09/18/2015 09:25 AM, Adam Brightwell wrote:
>>> 1. remove row_security=force
>>> 2. remove SECURITY_ROW_LEVEL_DISABLED; make ri_triggers.c subject to 
>>> policies
>>> 3. add DDL-controlled, per-table policy forcing
>>>
>>> They ought to land in that order.  PostgreSQL 9.5 would need at least (1) 
>>> and
>>> (2); would RLS experts find it beneficial for me to take care of those?
>>
>> That would be awesome, but I would say that if we do #1 & 2 for 9.5, we
>> also need #3.
> 
> Agreed.  Please let me know if there is anything I can do to help.


Yes, same here.

Joe


-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



signature.asc
Description: OpenPGP digital signature


Re: [HACKERS] [COMMITTERS] pgsql: Add pages deleted from pending list to FSM

2015-09-18 Thread Teodor Sigaev

Reply to both emails in one
Fujii:
> So autoanalyze still doesn't call IndexFreeSpaceMapVacuum().

Fixed, see patch

> ginvacuumcleanup calls RecordFreeIndexPage() twice for the same block.

fixed too

Tom:

It seems quite bizarre for auto-analyze to do that.  auto-vacuum, sure,
but I do not think this should get plugged into ANALYZE.


I agree, but we want to have pending list shorter as possible.



--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index 76bebad..7be9362 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -434,7 +434,7 @@ ginHeapTupleFastInsert(GinState *ginstate, 
GinTupleCollector *collector)
END_CRIT_SECTION();
 
if (needCleanup)
-   ginInsertCleanup(ginstate, false, NULL);
+   ginInsertCleanup(ginstate, false, true, NULL);
 }
 
 /*
@@ -505,7 +505,7 @@ ginHeapTupleFastCollect(GinState *ginstate,
  */
 static bool
 shiftList(Relation index, Buffer metabuffer, BlockNumber newHead,
- IndexBulkDeleteResult *stats)
+ bool fill_fsm, IndexBulkDeleteResult *stats)
 {
Pagemetapage;
GinMetaPageData *metadata;
@@ -732,13 +732,19 @@ processPendingPage(BuildAccumulator *accum, KeyArray *ka,
  * action of removing a page from the pending list really needs exclusive
  * lock.
  *
- * vac_delay indicates that ginInsertCleanup is called from vacuum process,
- * so call vacuum_delay_point() periodically.
+ * vac_delay indicates that ginInsertCleanup should so call
+ * vacuum_delay_point() periodically.
+ *
+ * fill_fsm indicates that ginInsertCleanup should add deleted pages
+ * to FSM otherwise caller is responsible to put deleted pages into
+ * FSM.
+ *
  * If stats isn't null, we count deleted pending pages into the counts.
  */
 void
 ginInsertCleanup(GinState *ginstate,
-bool vac_delay, IndexBulkDeleteResult *stats)
+bool vac_delay, bool fill_fsm,
+IndexBulkDeleteResult *stats)
 {
Relationindex = ginstate->index;
Buffer  metabuffer,
@@ -899,7 +905,7 @@ ginInsertCleanup(GinState *ginstate,
 * remove read pages from pending list, at this point 
all
 * content of read pages is in regular structure
 */
-   if (shiftList(index, metabuffer, blkno, stats))
+   if (shiftList(index, metabuffer, blkno, fill_fsm, 
stats))
{
/* another cleanup process is running 
concurrently */
LockBuffer(metabuffer, GIN_UNLOCK);
@@ -948,7 +954,7 @@ ginInsertCleanup(GinState *ginstate,
 * desirable to recycle them immediately to the FreeSpace Map when
 * ordinary backends clean the list.
 */
-   if (fsm_vac && !vac_delay)
+   if (fsm_vac && fill_fsm)
IndexFreeSpaceMapVacuum(index);
 
 
diff --git a/src/backend/access/gin/ginvacuum.c 
b/src/backend/access/gin/ginvacuum.c
index 1315762..323cfb0 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -544,7 +544,7 @@ ginbulkdelete(PG_FUNCTION_ARGS)
/* Yes, so initialize stats to zeroes */
stats = (IndexBulkDeleteResult *) 
palloc0(sizeof(IndexBulkDeleteResult));
/* and cleanup any pending inserts */
-   ginInsertCleanup(, true, stats);
+   ginInsertCleanup(, true, false, stats);
}
 
/* we'll re-count the tuples each time */
@@ -659,7 +659,7 @@ ginvacuumcleanup(PG_FUNCTION_ARGS)
if (IsAutoVacuumWorkerProcess())
{
initGinState(, index);
-   ginInsertCleanup(, true, stats);
+   ginInsertCleanup(, true, true, stats);
}
PG_RETURN_POINTER(stats);
}
@@ -672,7 +672,7 @@ ginvacuumcleanup(PG_FUNCTION_ARGS)
{
stats = (IndexBulkDeleteResult *) 
palloc0(sizeof(IndexBulkDeleteResult));
initGinState(, index);
-   ginInsertCleanup(, true, stats);
+   ginInsertCleanup(, true, false, stats);
}
 
memset(, 0, sizeof(idxStat));
diff --git a/src/include/access/gin_private.h b/src/include/access/gin_private.h
index acbe36a..5021887 100644
--- a/src/include/access/gin_private.h
+++ b/src/include/access/gin_private.h
@@ -936,7 +936,7 @@ extern void ginHeapTupleFastCollect(GinState *ginstate,
OffsetNumber attnum, Datum 
value, bool isNull,
ItemPointer ht_ctid);
 extern void 

Re: [HACKERS] tsvector work with citext

2015-09-18 Thread Teodor Sigaev



Fixable?


Fixed (9acb9007de30b3daaa9efc16763c3bc6e3e0a92d), but didn't backpatch because 
it isn't a critical bug.


Thank you for the report!

--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] row_security GUC, BYPASSRLS

2015-09-18 Thread Adam Brightwell
>> 1. remove row_security=force
>> 2. remove SECURITY_ROW_LEVEL_DISABLED; make ri_triggers.c subject to policies
>> 3. add DDL-controlled, per-table policy forcing
>>
>> They ought to land in that order.  PostgreSQL 9.5 would need at least (1) and
>> (2); would RLS experts find it beneficial for me to take care of those?
>
> That would be awesome, but I would say that if we do #1 & 2 for 9.5, we
> also need #3.

Agreed.  Please let me know if there is anything I can do to help.

-Adam


-- 
Adam Brightwell - adam.brightw...@crunchydatasolutions.com
Database Engineer - www.crunchydatasolutions.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Fix an O(N^2) problem in foreign key references.

2015-09-18 Thread Tom Lane
Jan Wieck  writes:
> Attached is a complete rework of the fix from scratch, based on Tom's 
> suggestion.

> The code now maintains a double linked list as suggested, but only uses 
> it to mark all currently valid entries as invalid when hashvalue == 0. 
> If a specific entry is invalidated it performs a hash lookup for maximum 
> efficiency in that case.

That does not work; you're ignoring the possibility of hashvalue
collisions, and for that matter not considering that the hash value
is not equal to the hash key.  I don't think your code is invalidating
the right entry at all during a foreign key constraint update, and
it certainly cannot do the right thing if there's a hash value collision.

Attached is something closer to what I was envisioning; can you do
performance testing on it?

regards, tom lane

diff --git a/src/backend/utils/adt/ri_triggers.c b/src/backend/utils/adt/ri_triggers.c
index 61edde9..db1787a 100644
*** a/src/backend/utils/adt/ri_triggers.c
--- b/src/backend/utils/adt/ri_triggers.c
***
*** 40,45 
--- 40,46 
  #include "commands/trigger.h"
  #include "executor/executor.h"
  #include "executor/spi.h"
+ #include "lib/ilist.h"
  #include "parser/parse_coerce.h"
  #include "parser/parse_relation.h"
  #include "miscadmin.h"
*** typedef struct RI_ConstraintInfo
*** 125,130 
--- 126,132 
   * PK) */
  	Oid			ff_eq_oprs[RI_MAX_NUMKEYS];		/* equality operators (FK =
   * FK) */
+ 	dlist_node	valid_link;		/* Link in list of valid entries */
  } RI_ConstraintInfo;
  
  
*** typedef struct RI_CompareHashEntry
*** 185,190 
--- 187,194 
  static HTAB *ri_constraint_cache = NULL;
  static HTAB *ri_query_cache = NULL;
  static HTAB *ri_compare_cache = NULL;
+ static dlist_head ri_constraint_cache_valid_list;
+ static int	ri_constraint_cache_valid_count = 0;
  
  
  /* --
*** ri_LoadConstraintInfo(Oid constraintOid)
*** 2924,2929 
--- 2928,2940 
  
  	ReleaseSysCache(tup);
  
+ 	/*
+ 	 * For efficient processing of invalidation messages below, we keep a
+ 	 * doubly-linked list, and a count, of all currently valid entries.
+ 	 */
+ 	dlist_push_tail(_constraint_cache_valid_list, >valid_link);
+ 	ri_constraint_cache_valid_count++;
+ 
  	riinfo->valid = true;
  
  	return riinfo;
*** ri_LoadConstraintInfo(Oid constraintOid)
*** 2936,2956 
   * gets enough update traffic that it's probably worth being smarter.
   * Invalidate any ri_constraint_cache entry associated with the syscache
   * entry with the specified hash value, or all entries if hashvalue == 0.
   */
  static void
  InvalidateConstraintCacheCallBack(Datum arg, int cacheid, uint32 hashvalue)
  {
! 	HASH_SEQ_STATUS status;
! 	RI_ConstraintInfo *hentry;
  
  	Assert(ri_constraint_cache != NULL);
  
! 	hash_seq_init(, ri_constraint_cache);
! 	while ((hentry = (RI_ConstraintInfo *) hash_seq_search()) != NULL)
  	{
! 		if (hentry->valid &&
! 			(hashvalue == 0 || hentry->oidHashValue == hashvalue))
! 			hentry->valid = false;
  	}
  }
  
--- 2947,2987 
   * gets enough update traffic that it's probably worth being smarter.
   * Invalidate any ri_constraint_cache entry associated with the syscache
   * entry with the specified hash value, or all entries if hashvalue == 0.
+  *
+  * Note: at the time a cache invalidation message is processed there may be
+  * active references to the cache.  Because of this we never remove entries
+  * from the cache, but only mark them invalid, which is harmless to active
+  * uses.  (Any query using an entry should hold a lock sufficient to keep that
+  * data from changing under it --- but we may get cache flushes anyway.)
   */
  static void
  InvalidateConstraintCacheCallBack(Datum arg, int cacheid, uint32 hashvalue)
  {
! 	dlist_mutable_iter iter;
  
  	Assert(ri_constraint_cache != NULL);
  
! 	/*
! 	 * If the list of currently valid entries gets excessively large, we mark
! 	 * them all invalid so we can empty the list.  This arrangement avoids
! 	 * O(N^2) behavior in situations where a session touches many foreign keys
! 	 * and also does many ALTER TABLEs, such as a restore from pg_dump.
! 	 */
! 	if (ri_constraint_cache_valid_count > 1000)
! 		hashvalue = 0;			/* pretend it's a cache reset */
! 
! 	dlist_foreach_modify(iter, _constraint_cache_valid_list)
  	{
! 		RI_ConstraintInfo *riinfo = dlist_container(RI_ConstraintInfo,
! 	valid_link, iter.cur);
! 
! 		if (hashvalue == 0 || riinfo->oidHashValue == hashvalue)
! 		{
! 			riinfo->valid = false;
! 			/* Remove invalidated entries from the list, too */
! 			dlist_delete(iter.cur);
! 			ri_constraint_cache_valid_count--;
! 		}
  	}
  }
  

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] row_security GUC, BYPASSRLS

2015-09-18 Thread Adam Brightwell
> 2. remove SECURITY_ROW_LEVEL_DISABLED; make ri_triggers.c subject to policies

I believe this one has already been addressed by Stephen
(20150910192313.gt3...@tamriel.snowman.net)?  Are there further
considerations for his proposed changes?

-Adam

-- 
Adam Brightwell - adam.brightw...@crunchydatasolutions.com
Database Engineer - www.crunchydatasolutions.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] row_security GUC, BYPASSRLS

2015-09-18 Thread Joe Conway
On 09/18/2015 01:07 AM, Noah Misch wrote:
> Great.  Robert, does that work for you, too?  If so, this sub-thread is
> looking at three patches:
> 
> 1. remove row_security=force
> 2. remove SECURITY_ROW_LEVEL_DISABLED; make ri_triggers.c subject to policies
> 3. add DDL-controlled, per-table policy forcing
> 
> They ought to land in that order.  PostgreSQL 9.5 would need at least (1) and
> (2); would RLS experts find it beneficial for me to take care of those?

That would be awesome, but I would say that if we do #1 & 2 for 9.5, we
also need #3.

Joe


-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



signature.asc
Description: OpenPGP digital signature


Re: [HACKERS] [COMMITTERS] pgsql: Add pages deleted from pending list to FSM

2015-09-18 Thread Teodor Sigaev



--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Add pages deleted from pending list to FSM

2015-09-18 Thread Tom Lane
[ moving thread to -hackers ]

Fujii Masao  writes:
> So autoanalyze still doesn't call IndexFreeSpaceMapVacuum().
> That is, only backend can clean the list in INSERT-only workload.
> I don't think that this is desirable. Because the backend will
> periodically take a big performance penalty.

> So I'm thinking that even autoanalyze should call IndexFreeSpaceMapVacuum()
> to clean the list in a background, in order to avoid such spikes in
> INSERT response time. Thought?

It seems quite bizarre for auto-analyze to do that.  auto-vacuum, sure,
but I do not think this should get plugged into ANALYZE.

Maybe we need to rethink the rules for what is autovacuum vs. autoanalyze,
or come up with some third thing that the autovac daemon does for indexes.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] row_security GUC, BYPASSRLS

2015-09-18 Thread Robert Haas
On Fri, Sep 18, 2015 at 2:07 AM, Noah Misch  wrote:
> On Tue, Sep 15, 2015 at 03:18:21PM -0400, Adam Brightwell wrote:
>> On Tue, Sep 15, 2015 at 2:26 PM, Joe Conway  wrote:
>> >> Joe Conway  writes:
>> >>> There are use cases where row_security=force will be set in production
>> >>> environments, not only in testing.
>
>> > Noah's suggestion of using a per table attribute
>> > would work -- in fact I like the idea of that better than using the
>> > current GUC.
>>
>> FWIW, I also concur with a per table attribute for this purpose.  In
>> fact, I think I really like the per-table flexibility over an
>> 'all-or-nothing' approach better too.
>
> Great.  Robert, does that work for you, too?

Yes, that seems like a fine design from my point of view.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Use pg_rewind when target timeline was switched

2015-09-18 Thread Alexander Korotkov
On Wed, Sep 16, 2015 at 7:47 PM, Alexander Korotkov <
a.korot...@postgrespro.ru> wrote:

> On Thu, Sep 10, 2015 at 8:33 AM, Michael Paquier <
> michael.paqu...@gmail.com> wrote:
>
>> On Wed, Sep 9, 2015 at 7:13 PM, Alexander Korotkov wrote:
>
> > A also added additional check in findCommonAncestorTimeline(). Two
>> standbys
>> > could be independently promoted and get the same new timeline ID. Now,
>> it's
>> > checked that timelines, that we assume to be the same, have equal
>> begins.
>> > Begins could still match by coincidence. But the same risk exists in
>> > original pg_rewind, though.
>>
>> Really? pg_rewind blocks attempts to rewind two nodes that are already
>> on the same timeline before checking if their timeline history map at
>> some point or not:
>> /*
>>  * If both clusters are already on the same timeline, there's
>> nothing to
>>  * do.
>>  */
>> if (ControlFile_target.checkPointCopy.ThisTimeLineID ==
>> ControlFile_source.checkPointCopy.ThisTimeLineID)
>> pg_fatal("source and target cluster are on the same
>> timeline\n");
>> And this seems really justified to me. Imagine that you have one
>> master, with two standbys linked to it. If both standbys are promoted
>> to the same timeline, you could easily replug them to the master, but
>> I fail to see the point to be able to replug one promoted standby with
>> the other in this case: those nodes have segment and history files
>> that map with each other, an operator could easily mess up things in
>> such a configuration.
>>
>
> Imagine following configuration of server.
>   1
>  / \
> 2   3
> |
> 4
>
> Initially, node 1 is master, nodes 2 and 3 are standbys for node 1. node 4
> is cascaded standby for node 3.
> Then node 2 and node 3 are promoted. They both got timeline number 2. Then
> node 3 is promoted and gets timeline number 3.
> Then we could try to rewind node 4 with node 2 as source. How pg_rewind
> could know that timeline number 2 for those nodes is not the same?
> We can only check that those timelines are forked from timeline 1 at the
> same place. But coincidence is still possible.
>

BTW, it would be an option to generate system_identifier to each new
timeline, by analogy of initdb do for the whole WAL.
Having such system_identifiers we can distinguish different timeline which
have assigned same ids.
Any thoughts?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] Parallel Seq Scan

2015-09-18 Thread Robert Haas
On Thu, Sep 17, 2015 at 11:44 PM, Amit Kapila  wrote:
> Okay, but I think the same can be achieved with this as well.  Basic idea
> is that each worker will work on one planned statement at a time and in
> above case there will be two different planned statements and they will
> store partial seq scan related information in two different loctions in
> toc, although the key (PARALLEL_KEY_SCAN) would be same and I think this
> will quite similar to what we are already doing for response queues.
> The worker will work on one of those keys based on planned statement
> which it chooses to execute.  I have explained this in somewhat more details
> in one of my previous mails [1].

shm_toc keys are supposed to be unique.  If you added more than one
with the same key, there would be no look up the second one.  That was
intentional, and I don't want to revise it.

I don't want to have multiple PlannedStmt objects in any case.  That
doesn't seem like the right approach.  I think passing down an Append
tree with multiple Partial Seq Scan children to be run in order is
simple and clear, and I don't see why we would do it any other way.
The master should be able to generate a plan and then copy the part of
it below the Funnel and send it to the worker.  But there's clearly
never more than one PlannedStmt in the master, so where would the
other ones come from in the worker?  There's no reason to introduce
that complexity.

>> Each partial sequential scan needs to have a *separate* key, which
>> will need to be stored in either the Plan or the PlanState or both
>> (not sure exactly).  Each partial seq scan needs to get assigned a
>> unique key there in the master, probably starting from 0 or 100 or
>> something and counting up, and then this code needs to extract that
>> value and use it to look up the correct data for that scan.
>
> In that case also, multiple workers can worker on same key, assuming
> in your above example, multiple workers will be required to execute
> each partial seq scan.  In this case we might need to see how to map
> instrumentation information for a particular execution.

That was discussed on the nearby thread about numbering plan nodes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] extend pgbench expressions with functions

2015-09-18 Thread Fabien COELHO



-1.  double is an inexact type, whereas integer is an exact type.


Sure. I already argue on that very line.


The typical way to handle this sort of thing is to define a struct
whose first member is a type field and whose second field is a union
of all the types you need to care about.


Yep.

Then that gets passed around everywhere.  This patch should be designed 
in such a way that if we eventually end up with functions that have 10 
different return types instead of 2 different return types, we don't 
need to add 8 more parameters to any functions.  Instead, those still 
return PgBench_Value (or whatever we call it) which is the 
aforementioned struct, but there are more options for what that can 
contain.


I just put the double type as a proof of concept, but for pgbench only 
integers really matters.


What you suggest would work, but it would also result in ugly and lengthy 
code, as I argued in another mail, because you have to decide for 
overloaded operators and functions which actual typed operator must be 
called, and then perform the necessary type conversions depending on the 
actual type of the operands. The implicit descendent typing used in the 
patch hides this, and is more than enough for pgbench, IMO.


If this is a blocker, I would rather remove the support for doubles than 
write verbose and inelegant code.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsvector work with citext

2015-09-18 Thread Tom Lane
"David E. Wheeler"  writes:
> On Sep 18, 2015, at 7:29 AM, Teodor Sigaev  wrote:
>> Fixed (9acb9007de30b3daaa9efc16763c3bc6e3e0a92d), but didn't backpatch 
>> because it isn't a critical bug.

> Great, thank you!
> For those on older versions, what’s the simplest workaround?

FWIW, I thought this would be a reasonable thing to back-patch.
It's not as though contrib/citext hasn't been around for awhile.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] vacuumdb sentence

2015-09-18 Thread Robert Haas
On Fri, Sep 18, 2015 at 12:18 AM, Euler Taveira  wrote:
> Yeah, I know. [Too sleepy to be writing emails...] What I want to say is:
> when we want to refer to an option, we usually add "option" after the quoted
> name (in this case, it won't make sense). I propose to remove the quotation
> because the way the sentence is written it seems we are referring to the
> task instead of the option.

+1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Seq Scan

2015-09-18 Thread Robert Haas
On Fri, Sep 18, 2015 at 4:03 AM, Haribabu Kommi
 wrote:
> On Thu, Sep 3, 2015 at 8:21 PM, Amit Kapila  wrote:
>>
>> Attached, find the rebased version of patch.
>
> Here are the performance test results:

Thanks, this is really interesting.  I'm very surprised by how much
kernel overhead this shows.  I wonder where that's coming from.  The
writes to and reads from the shm_mq shouldn't need to touch the kernel
at all except for page faults; that's why I chose this form of IPC.
It could be that the signals which are sent for flow control are
chewing up a lot of cycles, but if that's the problem, it's not very
clear from here.  copy_user_generic_string doesn't sound like
something related to signals.  And why all the kernel time in
_spin_lock?  Maybe perf -g would help us tease out where this kernel
time is coming from.

Some of this may be due to rapid context switching.  Suppose the
master process is the bottleneck.  Then each worker will fill up the
queue and go to sleep.  When the master reads a tuple, the worker has
to wake up and write a tuple, and then it goes back to sleep.  This
might be an indication that we need a bigger shm_mq size.  I think
that would be experimenting with: if we double or quadruple or
increase by 10x the queue size, what happens to performance?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pageinspect patch, for showing tuple data

2015-09-18 Thread Nikolay Shaplov
В письме от 11 сентября 2015 15:12:04 Вы написали:

> > Ok.Let's come to the final decision with tuple_data_parse, and i will add
> > this switch there and to pure sql heap_page_item_attrs
> 
> Fine for me.

So I've modified the code, now we have:

heap_page_items - have a column with raw tuple data

tuple_data_split - takes oid, raw tuple data, infomask, infomask2 and bits 
parsed as string and returns bytea[] with attribute raw values. It also have 
two optional arguments do_detoast that forces function to detoast attribute, 
and warning_mode that allows to set this function to warning mode, and do not 
stop working if some inconsistency were found. 

there is also pure sql function heap_page_item_attrs that takes page data, and 
table oid, and returns same data as heap_page_items but bytea[] of attributes 
instead of one whole piece of raw data. It also has optional argument 
do_detoast that allows to get bytea[] of detoasted attribute data.

I've decided that there is no real need in warning_mode in 
heap_page_item_attrs so there is no such argument there.

So now it is still RFC. Final patch with documentation will come soon

-- 
Nikolay Shaplov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Companydiff --git a/contrib/pageinspect/Makefile b/contrib/pageinspect/Makefile
index aec5258..e4bc1af 100644
--- a/contrib/pageinspect/Makefile
+++ b/contrib/pageinspect/Makefile
@@ -5,7 +5,7 @@ OBJS		= rawpage.o heapfuncs.o btreefuncs.o fsmfuncs.o \
 		  brinfuncs.o ginfuncs.o $(WIN32RES)
 
 EXTENSION = pageinspect
-DATA = pageinspect--1.3.sql pageinspect--1.2--1.3.sql \
+DATA = pageinspect--1.4.sql pageinspect--1.3--1.4.sql pageinspect--1.2--1.3.sql \
 	pageinspect--1.1--1.2.sql pageinspect--1.0--1.1.sql \
 	pageinspect--unpackaged--1.0.sql
 PGFILEDESC = "pageinspect - functions to inspect contents of database pages"
diff --git a/contrib/pageinspect/heapfuncs.c b/contrib/pageinspect/heapfuncs.c
index 8d1666c..d42ca48 100644
--- a/contrib/pageinspect/heapfuncs.c
+++ b/contrib/pageinspect/heapfuncs.c
@@ -29,7 +29,11 @@
 #include "funcapi.h"
 #include "utils/builtins.h"
 #include "miscadmin.h"
+#include "utils/array.h"
+#include "utils/rel.h"
+#include "catalog/pg_type.h"
 
+Datum split_tuple_data(char *tuple_data, uint16 tuple_data_len, TupleDesc tuple_desc, uint16 t_infomask, uint16 t_infomask2, bits8 *t_bits, bool do_detoast, int error_level);
 
 /*
  * bits_to_text
@@ -122,8 +126,8 @@ heap_page_items(PG_FUNCTION_ARGS)
 		HeapTuple	resultTuple;
 		Datum		result;
 		ItemId		id;
-		Datum		values[13];
-		bool		nulls[13];
+		Datum		values[14];
+		bool		nulls[14];
 		uint16		lp_offset;
 		uint16		lp_flags;
 		uint16		lp_len;
@@ -155,6 +159,8 @@ heap_page_items(PG_FUNCTION_ARGS)
 		{
 			HeapTupleHeader tuphdr;
 			int			bits_len;
+			bytea		*tuple_data_bytea;
+			int			tuple_data_len;
 
 			/* Extract information from the tuple header */
 
@@ -168,6 +174,13 @@ heap_page_items(PG_FUNCTION_ARGS)
 			values[9] = UInt32GetDatum(tuphdr->t_infomask);
 			values[10] = UInt8GetDatum(tuphdr->t_hoff);
 
+			/* Copy raw tuple data into bytea attribute */
+			tuple_data_len = lp_len - tuphdr->t_hoff;
+			tuple_data_bytea = (bytea *) palloc(tuple_data_len + VARHDRSZ);
+			SET_VARSIZE(tuple_data_bytea, tuple_data_len + VARHDRSZ);
+			memcpy(VARDATA(tuple_data_bytea), (char *) tuphdr + tuphdr->t_hoff, tuple_data_len);
+			values[13] = PointerGetDatum(tuple_data_bytea);
+			nulls[13] = false;
 			/*
 			 * We already checked that the item is completely within the raw
 			 * page passed to us, with the length given in the line pointer.
@@ -208,7 +221,7 @@ heap_page_items(PG_FUNCTION_ARGS)
 			 */
 			int			i;
 
-			for (i = 4; i <= 12; i++)
+			for (i = 4; i <= 13; i++)
 nulls[i] = true;
 		}
 
@@ -223,3 +236,210 @@ heap_page_items(PG_FUNCTION_ARGS)
 	else
 		SRF_RETURN_DONE(fctx);
 }
+
+PG_FUNCTION_INFO_V1(tuple_data_split);
+Datum
+tuple_data_split(PG_FUNCTION_ARGS)
+{
+	Oidrel_oid;
+	bytea			*raw_data;
+	uint16			t_infomask;
+	uint16			t_infomask2;
+	text			*t_bits_str;
+	RelationData	*rel;
+	TupleDesc		tuple_desc;
+	bool			do_detoast = false;
+	interror_level = ERROR;
+
+	bits8			*t_bits = NULL;
+	Datum			res;
+
+	rel_oid		= PG_GETARG_OID(0);
+	raw_data	= PG_GETARG_BYTEA_P(1);
+	t_infomask	= PG_GETARG_INT16(2);
+	t_infomask2	= PG_GETARG_INT16(3);
+	t_bits_str	= PG_ARGISNULL(4) ? NULL : PG_GETARG_TEXT_P(4);
+	if (PG_NARGS()>=6) do_detoast = PG_GETARG_BOOL(5);
+	if (PG_NARGS()>=7) error_level = PG_GETARG_BOOL(6)?WARNING:ERROR;
+
+	if (!superuser())
+		ereport(ERROR,
+(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+	(errmsg("must be superuser to use raw page functions";
+
+	/*
+	 * Here we converting t_bits string back to the bits8 array,
+	 * as it really is in the tuple header
+	 */
+	if (t_infomask & HEAP_HASNULL)
+	{
+		int bits_str_len;
+		int bits_len;
+		char * p;
+		int off;
+		char byte = 0;
+		bits_len = (t_infomask2 & HEAP_NATTS_MASK) / 8 + 1;
+		if (! t_bits_str)
+		{
+			

Re: [HACKERS] tsvector work with citext

2015-09-18 Thread David E. Wheeler
On Sep 18, 2015, at 7:29 AM, Teodor Sigaev  wrote:

>> Fixable?
> 
> Fixed (9acb9007de30b3daaa9efc16763c3bc6e3e0a92d), but didn't backpatch 
> because it isn't a critical bug.

Great, thank you!

For those on older versions, what’s the simplest workaround?

Best,

David



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On-demand running query plans using auto_explain and signals

2015-09-18 Thread Robert Haas
On Fri, Sep 18, 2015 at 6:59 AM, Pavel Stehule  wrote:
> I am afraid so it has not simple and nice solution - when data sender will
> wait for to moment when data are received, then we have same complexity like
> we use  shm_mq.
>
> Isn't better to introduce new background worker with responsibility to clean
> orphaned DSM?

That won't work, or at least not easily.  On Windows, the DSM is
cleaned up by the operating system as soon as nobody has it mapped.

Frankly, I think you guys are making this out to be way more
complicated than it really is.  Basically, I think the process being
queried should publish a DSM via a slot it owns.  The recipient is
responsible for reading it and then notifying the sender.  If a second
process requests data before the first process reads its data, the
second process can either (1) become an additional reader of the
already-published data or (2) wait for the first process to finish,
and then send its own inquiry.

There are some problems to solve here, but they hardly seem impossible.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] numbering plan nodes

2015-09-18 Thread Robert Haas
On Fri, Sep 18, 2015 at 5:24 AM, Kouhei Kaigai  wrote:
>> On Thu, Sep 17, 2015 at 9:01 PM, Kouhei Kaigai  wrote:
>> > I entirely agree with the idea of plan-node identifier, however,
>> > uncertain whether the node-id shall represent physical location on
>> > the dynamic shared memory segment, because
>> > (1) Relatively smaller number of node type needs shared state,
>> > thus most of array items are empty.
>> > (2) Extension that tries to modify plan-tree using planner_hook
>> > may need to adjust node-id also.
>> >
>> > Even though shm_toc_lookup() has to walk on the toc entries to find
>> > out the node-id, it happens at once on beginning of the executor at
>> > background worker side. I don't think it makes a significant problem.
>>
>> Yes, I was thinking that what would make sense is to have each
>> parallel-aware node call shm_toc_insert() using its ID as the key.
>> Then, we also need Instrumentation nodes.  For those, I thought we
>> could use some fixed, high-numbered key, and Tom's idea.
>>
> Hmm, indeed, run-time statistics are needed for every node.
> If an array indexed by node-id would be a hash slot, we can treat
> non-contiguous node-id with no troubles.

Yeah, we could do that, but it seems more complex than we really need.

>> Are there extensions that use planner_hook to do surgery on the plan
>> tree?  What do they do, exactly?
>>
> (Even though it will not work under Funnel,) PG-Strom often inject
> a preprocessor node under Agg-node to produce partial aggregation
> to reduce number of rows to be processed by CPU.

OK.  So, if you wanted to make it work under a Funnel, you'd need to
stick a node ID on it that was higher than any assigned to any plan
node thus far.  Doesn't seem like a big deal.

> Also, I have seen a paper published by Fujitsu folks. Their module
> modifies plan-tree to replace built-in scan node with their own
> columnar storage scan node.
>   http://db-event.jpn.org/deim2015/paper/195.pdf
> This paper is written in Japanese, however, figure-3 in page.4 shows
> what I explain above.

If you replaced a node, you'd just copy the ID from the existing node.
That seems simple enough.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] extend pgbench expressions with functions

2015-09-18 Thread Robert Haas
On Thu, Sep 17, 2015 at 10:58 PM, Kyotaro HORIGUCHI
 wrote:
> By the way, the complexity comes from separating integer and
> double. If there is no serios reason to separate them, handling
> all values as double makes things far simpler.

-1.  double is an inexact type, whereas integer is an exact type.

The typical way to handle this sort of thing is to define a struct
whose first member is a type field and whose second field is a union
of all the types you need to care about.  Then that gets passed around
everywhere.  This patch should be designed in such a way that if we
eventually end up with functions that have 10 different return types
instead of 2 different return types, we don't need to add 8 more
parameters to any functions.  Instead, those still return
PgBench_Value (or whatever we call it) which is the aforementioned
struct, but there are more options for what that can contain.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2015-09-18 Thread Robert Haas
On Fri, Sep 18, 2015 at 4:08 AM, Vladimir Borodin  wrote:
> For both scenarios on linux we got approximately the same results - version
> with timings was faster then version with sampling (sampling was done every
> 10 ms). Vanilla PostgreSQL from REL9_4_STABLE gave ~15500 tps and version
> with timings gave ~14500 tps while version with sampling gave ~13800 tps. In
> all cases processor was 100% utilized. Comparing vanilla PostgreSQL and
> version with timings on constant workload (12000 tps) gave the following
> results in latencies for queries:

If the timing is speeding things up, that's most likely a sign that
the spinlock contention on that workload is so severe that you are
spending a lot of time in s_lock.  Adding more things for the system
to do that don't require that lock will speed the system up by
reducing the contention.  Instead of inserting gettimeofday() calls,
you could insert a for loop that counts to some large number without
doing any useful work, and that would likely have a similar effect.

In any case, I think your experiment clearly proves that the presence
or absence of this instrumentation *is* performance-relevant and that
we *do* need to worry about what it costs. If the system gets 20%
faster when you call gettimeofday() a lot, does that mean we should
insert gettimeofday() calls all over the system in random places to
speed it up?

I do agree that if we're going to include support for timings, having
them be controlled by a GUC is a good idea.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Seq Scan

2015-09-18 Thread Robert Haas
On Fri, Sep 18, 2015 at 6:55 AM, Amit Kapila  wrote:
>> +currentRelation = ExecOpenScanRelation(estate,
>> +   ((SeqScan *)
>> node->ss.ps.plan)->scanrelid,
>> +   eflags);
>>
>> I can't see how this can possibly be remotely correct.  The funnel
>> node shouldn't be limited to scanning a baserel (cf. fdw_scan_tlist).
>>
>
> This is mainly used for generating tuple descriptor and that tuple
> descriptor will be used for forming scanslot, funnel node itself won't
> do any scan. However, we can completely eliminate this InitFunnel()
> function and use ExecAssignProjectionInfo() instead of
> ExecAssignScanProjectionInfo() to form the projection info.

That sounds like a promising approach.

>> +buffer_usage_worker = (BufferUsage *)(buffer_usage + (i *
>> sizeof(BufferUsage)));
>>
>> Cast it to a BufferUsage * first.  Then you can use [i] to find
>> the i'th element.
>
> Do you mean to say that the way code is written won't work?
> Values of structure BufferUsage for each worker is copied into string
> buffer_usage which I believe could be fetched in above way.

I'm just complaining about the style.  If bar is a char*, then these
are all equivalent:

foo = (Quux *) (bar + (i * sizeof(Quux));

foo = ((Quux *) bar) + i;

foo = &((Quux *) bar)[i];

baz = (Quux *) bar;
foo = [i];

>> +/*
>> + * Re-initialize the parallel context and workers to perform
>> + * rescan of relation.  We want to gracefully shutdown all the
>> + * workers so that they should be able to propagate any error
>> + * or other information to master backend before dying.
>> + */
>> +FinishParallelSetupAndAccumStats(node);
>>
>> Somehow, this makes me feel like that function is badly named.
>
> I think here comment seems to be slightly misleading, shall we
> change the comment as below:
>
> Destroy the parallel context to gracefully shutdown all the
> workers so that they should be able to propagate any error
> or other information to master backend before dying.

Well, why does a function that destroys the parallel context have a
name that starts with FinishParallelSetup?  It sounds like it is
tearing things down, not finishing setup.

>> +#parallel_setup_cost = 0.0  # same scale as above
>> +#define DEFAULT_PARALLEL_SETUP_COST  0.0
>>
>> This value is probably a bit on the low side.
>
> How about keeping it as 10.0?

Really?  I would have guessed that the correct value was in the tens
of thousands.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Use pg_rewind when target timeline was switched

2015-09-18 Thread Michael Paquier
On Fri, Sep 18, 2015 at 9:00 AM, Alexander Korotkov wrote:
> BTW, it would be an option to generate system_identifier to each new
> timeline, by analogy of initdb do for the whole WAL.
> Having such system_identifiers we can distinguish different timeline which
> have assigned same ids.
> Any thoughts?

If you mean a new field incorporated in XLogLongPageHeader and
ControlFile to ensure that a new timeline generated is unique across
the same installation identified with system_identifier, then I'm not
against it for something up to 2 bytes (even 1 byte would be fine),
though it seems that there is a very narrow need for it. Do you have
cases in mind that could use it? Even in the case of pg_rewind, it
seems to me we would not need this extra timeline-based ID. Per your
case above, it is possible to rewind node 4 on timeline 3 using node 2
on timeline 2 when both timelines have forked exactly at the same
point from timeline 1, by having node 4 beginning to replay from
timeline 1 (last checkpoint record before WAL forked), and if I am
reading your patch correctly that's what you do.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers