date:20150113

Re: [HACKERS] Add generate_series(numeric, numeric)

2015-01-13 Thread Ali Akbar

2014-12-18 19:35 GMT+07:00 Fujii Masao masao.fu...@gmail.com:

 On Mon, Dec 15, 2014 at 2:38 PM, Andrew Gierth
 and...@tao11.riddles.org.uk wrote:
  I was thinking something like this, added just after that para:
 
  warning
   para
While the actual arguments to the function remain unchanged between
calls, if you detoast the argument values (which is normally done
transparently by the
functionPG_GETARG_replaceablexxx/replaceable/function
 macro)
in the transient context then the detoasted copies will be freed on
each cycle. Accordingly, if you keep references to such values in
your structfielduser_fctx/, you must either copy them into the
structfieldmulti_call_memory_ctx/ after detoasting, or ensure
that you detoast the values only in that context.
   /para
  /warning

 I'm OK with this.


Wrapping the doc changes in a patch. Will add to next commitfest so it
won't be lost.

-- 
Ali Akbar
*** a/doc/src/sgml/xfunc.sgml
--- b/doc/src/sgml/xfunc.sgml
***
*** 2986,2991  SRF_RETURN_DONE(funcctx)
--- 2986,3005 
   structfieldmulti_call_memory_ctx/ while doing the first-call setup.
  /para
  
+ warning
+  para
+   While the actual arguments to the function remain unchanged between
+   calls, if you detoast the argument values (which is normally done
+   transparently by the
+   functionPG_GETARG_replaceablexxx/replaceable/function macro)
+   in the transient context then the detoasted copies will be freed on
+   each cycle. Accordingly, if you keep references to such values in
+   your structfielduser_fctx/, you must either copy them into the
+   structfieldmulti_call_memory_ctx/ after detoasting, or ensure
+   that you detoast the values only in that context.
+  /para
+ /warning
+ 
  para
   A complete pseudo-code example looks like the following:
  programlisting

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Transactions involving multiple postgres foreign servers

2015-01-13 Thread Robert Haas

On Thu, Jan 8, 2015 at 1:00 PM, Kevin Grittner kgri...@ymail.com wrote:
 Clearly, all the nodes other than the local one need to use 2PC.  I am
 unconvinced that the local node must write a 2PC state file only to
 turn around and remove it again almost immediately thereafter.

 The key point is that the distributed transaction data must be
 flagged as needing to commit rather than roll back between the
 prepare phase and the final commit.  If you try to avoid the
 PREPARE, flagging, COMMIT PREPARED sequence by building the
 flagging of the distributed transaction metadata into the COMMIT
 process, you still have the problem of what to do on crash
 recovery.  You really need to use 2PC to keep that clean, I think.

I don't really follow this.  You need to write a list of the
transactions that you're going to prepare to stable storage before
preparing any of them.  And then you need to write something to stable
storage when you've definitively determined that you're going to
commit.  But we have no current mechanism for the first thing (so
reusing 2PC doesn't help) and we already have the second thing (it's
the commit record itself).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Minor configure tweak to simplify adjusting gcc warnings

2015-01-13 Thread Tom Lane

Would anyone object to modifying configure.in like this:
 
 if test $GCC = yes -a $ICC = no; then
-  CFLAGS=$CFLAGS -Wall -Wmissing-prototypes -Wpointer-arith
+  CFLAGS=-Wall $CFLAGS -Wmissing-prototypes -Wpointer-arith
   # These work in some but not all gcc versions
   PGAC_PROG_CC_CFLAGS_OPT([-Wdeclaration-after-statement])

The reason I got interested in this is that I attempted to pass in
CFLAGS=-Wno-format to configure, to suppress format warnings on
buildfarm member gaur (whose gcc is too old to recognize z modifiers).
That doesn't work because -Wall turns the warnings right back on again.
If the user-supplied CFLAGS were inserted after -Wall then it would work.

A slightly more complicated change could be applied to make sure that
*all* of the CFLAGS forcibly inserted by configure appear before any
externally-sourced CFLAGS, allowing any of them to be overridden from the
environment variable.  I'm not sure if it's worth the trouble to do that,
but if there's interest I could make it happen.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Seq Scan

2015-01-13 Thread Amit Kapila

On Tue, Jan 13, 2015 at 4:55 PM, John Gorman johngorm...@gmail.com wrote:



 On Sun, Jan 11, 2015 at 6:00 PM, Robert Haas robertmh...@gmail.com
wrote:

 On Sun, Jan 11, 2015 at 6:01 AM, Stephen Frost sfr...@snowman.net
wrote:
  So, for my 2c, I've long expected us to parallelize at the
relation-file
  level for these kinds of operations.  This goes back to my other
  thoughts on how we should be thinking about parallelizing inbound data
  for bulk data loads but it seems appropriate to consider it here also.
  One of the issues there is that 1G still feels like an awful lot for a
  minimum work size for each worker and it would mean we don't
parallelize
  for relations less than that size.

 Yes, I think that's a killer objection.


 One approach that I has worked well for me is to break big jobs into much
smaller bite size tasks. Each task is small enough to complete quickly.


Here we have to decide what should be the strategy and how much
each worker should scan.  As an example one of the the strategy
could be if the table size is X MB and there are 8 workers, then
divide the work as X/8 MB for each worker (which I have currently
used in patch) and another could be each worker does scan
1 block at a time and then check some global structure to see which
next block it needs to scan, according to me this could lead to random
scan.  I have read that some other databases also divide the work
based on partitions or segments (size of segment is not very clear).


 We add the tasks to a task queue and spawn a generic worker pool which
eats through the task queue items.

 This solves a lot of problems.

 - Small to medium jobs can be parallelized efficiently.
 - No need to split big jobs perfectly.
 - We don't get into a situation where we are waiting around for a worker
to finish chugging through a huge task while the other workers sit idle.
 - Worker memory footprint is tiny so we can afford many of them.
 - Worker pool management is a well known problem.
 - Worker spawn time disappears as a cost factor.
 - The worker pool becomes a shared resource that can be managed and
reported on and becomes considerably more predictable.


Yeah, it is good idea to maintain shared worker pool, but it seems
to me that for initial version even if the workers are not shared,
then also it is meaningful to make parallel sequential scan work.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Parallel Seq Scan

2015-01-13 Thread Ashutosh Bapat

On Wed, Jan 14, 2015 at 9:12 AM, Amit Kapila amit.kapil...@gmail.com
wrote:

 On Tue, Jan 13, 2015 at 4:55 PM, John Gorman johngorm...@gmail.com
 wrote:
 
 
 
  On Sun, Jan 11, 2015 at 6:00 PM, Robert Haas robertmh...@gmail.com
 wrote:
 
  On Sun, Jan 11, 2015 at 6:01 AM, Stephen Frost sfr...@snowman.net
 wrote:
   So, for my 2c, I've long expected us to parallelize at the
 relation-file
   level for these kinds of operations.  This goes back to my other
   thoughts on how we should be thinking about parallelizing inbound data
   for bulk data loads but it seems appropriate to consider it here also.
   One of the issues there is that 1G still feels like an awful lot for a
   minimum work size for each worker and it would mean we don't
 parallelize
   for relations less than that size.
 
  Yes, I think that's a killer objection.
 
 
  One approach that I has worked well for me is to break big jobs into
 much smaller bite size tasks. Each task is small enough to complete quickly.
 

 Here we have to decide what should be the strategy and how much
 each worker should scan.  As an example one of the the strategy
 could be if the table size is X MB and there are 8 workers, then
 divide the work as X/8 MB for each worker (which I have currently
 used in patch) and another could be each worker does scan
 1 block at a time and then check some global structure to see which
 next block it needs to scan, according to me this could lead to random
 scan.  I have read that some other databases also divide the work
 based on partitions or segments (size of segment is not very clear).


A block can contain useful tuples, i.e tuples which are visible and fulfil
the quals + useless tuples i.e. tuples which are dead, invisible or that do
not fulfil the quals. Depending upon the contents of these blocks, esp. the
ratio of (useful tuples)/(unuseful tuples), even though we divide the
relation into equal sized runs, each worker may take different time. So,
instead of dividing the relation into number of run = number of workers, it
might be better to divide them into fixed sized runs with size  (total
number of blocks/ number of workers), and let a worker pick up a run after
it finishes with the previous one. The smaller the size of runs the better
load balancing but higher cost of starting with the run. So, we have to
strike a balance.




  We add the tasks to a task queue and spawn a generic worker pool which
 eats through the task queue items.
 
  This solves a lot of problems.
 
  - Small to medium jobs can be parallelized efficiently.
  - No need to split big jobs perfectly.
  - We don't get into a situation where we are waiting around for a worker
 to finish chugging through a huge task while the other workers sit idle.
  - Worker memory footprint is tiny so we can afford many of them.
  - Worker pool management is a well known problem.
  - Worker spawn time disappears as a cost factor.
  - The worker pool becomes a shared resource that can be managed and
 reported on and becomes considerably more predictable.
 

 Yeah, it is good idea to maintain shared worker pool, but it seems
 to me that for initial version even if the workers are not shared,
 then also it is meaningful to make parallel sequential scan work.


 With Regards,
 Amit Kapila.
 EnterpriseDB: http://www.enterprisedb.com




-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: [HACKERS] Transactions involving multiple postgres foreign servers

2015-01-13 Thread Robert Haas

On Sun, Jan 11, 2015 at 3:36 AM, Michael Paquier
michael.paqu...@gmail.com wrote:
 My understanding is that once you get a successful PREPARE that should mean
 that it's basically impossible for the transaction to fail to commit. If
 that's not the case, I fail to see how you can get any decent level of
 sanity out of this...
 When giving the responsability of a group of COMMIT PREPARED to a set
 of nodes in a network, there could be a couple of problems showing up,
 of the type split-brain for example.

I think this is just confusing the issue.  When a machine reports that
a transaction is successfully prepared, any future COMMIT PREPARED
operation *must* succeed.  If it doesn't, the machine has broken its
promises, and that's not OK.  Period.  It doesn't matter whether
that's due to split-brain or sunspots or Oscar Wilde having bad
breath.  If you say that it's prepared, then you're not allowed to
change your mind later and say that it can't be committed.  If you do,
then you have a broken 2PC implementation and, as Jim says, all bets
are off.

Now of course nothing is certain in life except death and taxes.  If
you PREPARE a transaction, and then go into the data directory and
corrupt the 2PC state file using dd, and then try to commit it, it
might fail.  But no system can survive that sort of thing, whether 2PC
is involved or not; in such extraordinary situations, of course
operator intervention will be required.  But in a more normal
situation where you just have a failover, if the failover causes your
prepared transaction to come unprepared, that means your failover
mechanism is broken.  If you're using synchronous replication, this
shouldn't happen.

 There could be as well failures
 at hardware-level, so you would need a mechanism ensuring that WAL is
 consistent among all the nodes, with for example the addition of a
 common restore point on all the nodes once PREPARE is successfully
 done with for example XLOG_RESTORE_POINT. That's a reason why I think
 that the local Coordinator should use 2PC as well, to ensure a
 consistency point once all the remote nodes have successfully
 PREPAREd, and a reason why things can get complicated for either the
 DBA or the upper application in charge of ensuring the DB consistency
 even in case of critical failures.

It's up to the DBA to decide whether they care about surviving
complete loss of a node while having 2PC still work.  If they do, they
should use sync rep, and they should be fine -- the machine on which
the transaction is prepared shouldn't acknowledge the PREPARE as
having succeeded until the WAL is safely on disk on the standby.  Most
probably don't, though; that's a big performance penalty.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] OOM on EXPLAIN with lots of nodes

2015-01-13 Thread Heikki Linnakangas


On 01/13/2015 07:24 PM, Tom Lane wrote:

It is, but FDWs are not at risk here: they merely reference ExplainStates
that were allocated by core backend code.  So as long as we add the new
field at the end it's not a problem for them.  Problematic usage would be
like what auto_explain does:

 ExplainState es;

 ExplainInitState(es);
 ...

In hindsight, that's a bad API and we should change it to something like

 ExplainState *es = NewExplainState();

so that the sizeof the struct isn't embedded in extension code.  But we
definitely can't do that in back branches.


Actually, it would make sense to do exactly that, to break any 
extensions that are doing the unsafe thing in an obvious way. The 
downside would be that an extension using the new API would then not 
work on an old server.


We could repurpose one of the existing fields in ExplainState to point 
to another struct that contains more fields. Something like this:


*** a/src/include/commands/explain.h
--- b/src/include/commands/explain.h
***
*** 40,48 
List   *rtable; /* range table */
List   *rtable_names;   /* alias names for RTEs */
int indent; /* current indentation 
level */
!   List   *grouping_stack; /* format-specific grouping state */
  } ExplainState;

  /* Hook for plugins to get control in ExplainOneQuery() */
  typedef void (*ExplainOneQuery_hook_type) (Query *query,

   IntoClause *into,
--- 40,55 
List   *rtable; /* range table */
List   *rtable_names;   /* alias names for RTEs */
int indent; /* current indentation 
level */
!   ExplainStateExtraFields *extra; /* more fields */
  } ExplainState;

+ typedef struct ExplainStateExtraFields
+ {
+   List   *grouping_stack; /* format-specific grouping state */
+   ...
+ }
+

That's pretty ugly, but it would work even if there are ExplainState 
structs embeded in extensions. As long as they don't try to look at the 
grouping_stack field; I think that's fairly safe assumption.


But do we really need to backpatch any of this?

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] explain sortorder

2015-01-13 Thread Heikki Linnakangas


On 01/13/2015 06:04 PM, Timmer, Marius wrote:

  -malloc() (StringInfo is used as suggested now).


There really shouldn't be any snprintf() calls in the patch, when 
StringInfo is used correctly...



@@ -1187,6 +1187,7 @@ explain (verbose, costs off) select * from matest0 order 
by 1-id;
  Sort
Output: matest0.id, matest0.name, ((1 - matest0.id))
Sort Key: ((1 - matest0.id))
+   Sort Order: ASC NULLS LAST
-  Result
  Output: matest0.id, matest0.name, (1 - matest0.id)
  -  Append


This patch isn't going to be committed with this output format. Please 
change per my suggestion earlier:



I don't like this output. If there are a lot of sort keys, it gets
difficult to match the right ASC/DESC element to the sort key it applies
to. (Also, there seems to be double-spaces in the list)

I would suggest just adding the information to the Sort Key line. As
long as you don't print the modifiers when they are defaults (ASC and
NULLS LAST), we could print the information even in non-VERBOSE mode. So
it would look something like:

Sort Key: sortordertest.n1 NULLS FIRST, sortordertest.n2 DESC


Or if you don't agree with that, explain why.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Check that streaming replica received all data after master shutdown

2015-01-13 Thread Heikki Linnakangas


On 01/13/2015 12:11 PM, Vladimir Borodin wrote:


05 янв. 2015 г., в 18:15, Vladimir Borodin r...@simply.name написал(а):


Hi all.

I have a simple script for planned switchover of PostgreSQL (9.3 and 9.4) 
master to one of its replicas. This script checks a lot of things before doing 
it and one of them is that all data from master has been received by replica 
that is going to be promoted. Right now the check is done like below:

On the master:

postgres@pgtest03d ~ $ psql -t -A -c 'select pg_current_xlog_location();'
0/3390
postgres@pgtest03d ~ $ /usr/pgsql-9.3/bin/pg_ctl stop -m fast
waiting for server to shut down done
server stopped
postgres@pgtest03d ~ $ /usr/pgsql-9.3/bin/pg_controldata | head
pg_control version number:937
Catalog version number:   201306121
Database system identifier:   6061800518091528182
Database cluster state:   shut down
pg_control last modified: Mon 05 Jan 2015 06:47:57 PM MSK
Latest checkpoint location:   0/3428
Prior checkpoint location:0/3328
Latest checkpoint's REDO location:0/3428
Latest checkpoint's REDO WAL file:001B0034
Latest checkpoint's TimeLineID:   27
postgres@pgtest03d ~ $

On the replica (after shutdown of master):

postgres@pgtest03g ~ $ psql -t -A -c select 
pg_xlog_location_diff(pg_last_xlog_replay_location(), '0/3428');
104
postgres@pgtest03g ~ $

These 104 bytes seems to be the size of shutdown checkpoint record (as I can 
understand from pg_xlogdump output).

postgres@pgtest03g ~/9.3/data/pg_xlog $ /usr/pgsql-9.3/bin/pg_xlogdump -s 
0/3390 -t 27
rmgr: XLOGlen (rec/tot):  0/32, tx:  0, lsn: 
0/3390, prev 0/3328, bkp: , desc: xlog switch
rmgr: XLOGlen (rec/tot): 72/   104, tx:  0, lsn: 
0/3428, prev 0/3390, bkp: , desc: checkpoint: redo 0/3428; tli 
27; prev tli 27; fpw true; xid 0/6010; oid 54128; multi 1; offset 0; oldest xid 
1799 in DB 1; oldest multi 1 in DB 1; oldest running xid 0; shutdown
pg_xlogdump: FATAL:  error in WAL record at 0/3428: record with zero length 
at 0/3490

postgres@pgtest03g ~/9.3/data/pg_xlog $

I’m not sure that these 104 bytes will always be 104 bytes to have a strict 
equality while checking. Could it change in the future? Or is there a better 
way to understand that streaming replica received all data after master 
shutdown? The check that pg_xlog_location_diff returns 104 bytes seems a bit 
strange.


Don't rely on it being 104 bytes. It can vary across versions, and 
across different architectures.


You could simply check that the standby's pg_last_xlog_replay_location() 
 master's Latest checkpoint location, and not care about the exact 
difference.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] explain sortorder

2015-01-13 Thread Timmer, Marius

Hi,

we removed
 -malloc() (StringInfo is used as suggested now).
 -leftover commented out code
 -the splitting of existing declaration and initialization in show_group_keys().

Missing tests and documentation are WIP and will follow with the next patch 
version.

Best regards

Marius




---
Marius Timmer
Zentrum für Informationsverarbeitung
Westfälische Wilhelms-Universität Münster
Einsteinstraße 60

mtimm...@uni-muenster.de


explain_sortorder-v7.patch
Description: explain_sortorder-v7.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] OOM on EXPLAIN with lots of nodes

2015-01-13 Thread Tom Lane

Alvaro Herrera alvhe...@2ndquadrant.com writes:
 Tom Lane wrote:
 A difficulty with either your patch or my idea is that they require adding
 another field to ExplainState, which is an ABI break for any third-party
 code that might be declaring variables of that struct type.  That's fine
 for HEAD but would be risky to back-patch.  Any thoughts about whether we
 can get away with that (ie, anybody have an idea if there are third-party
 extensions that call explain.c)?

 codesearch.debian.net shows a couple of hits for ExplainState in
 multicorn (an extension for FDW from Python data sources); I didn't look
 but it seems that the FDW API is using that struct.

It is, but FDWs are not at risk here: they merely reference ExplainStates
that were allocated by core backend code.  So as long as we add the new
field at the end it's not a problem for them.  Problematic usage would be
like what auto_explain does:

ExplainState es;

ExplainInitState(es);
...

In hindsight, that's a bad API and we should change it to something like

ExplainState *es = NewExplainState();

so that the sizeof the struct isn't embedded in extension code.  But we
definitely can't do that in back branches.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] OOM on EXPLAIN with lots of nodes

2015-01-13 Thread Alvaro Herrera

Tom Lane wrote:

 A difficulty with either your patch or my idea is that they require adding
 another field to ExplainState, which is an ABI break for any third-party
 code that might be declaring variables of that struct type.  That's fine
 for HEAD but would be risky to back-patch.  Any thoughts about whether we
 can get away with that (ie, anybody have an idea if there are third-party
 extensions that call explain.c)?

codesearch.debian.net shows a couple of hits for ExplainState in
multicorn (an extension for FDW from Python data sources); I didn't look
but it seems that the FDW API is using that struct.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] EXEC_BACKEND + logging_collector=on is broken

2015-01-13 Thread Andres Freund

Hi,

Currently the combination from $subject fails for me with could not
read from backend variables file 

The origin for that problem seems to be b94ce6e80 which moved
RemovePgTempFiles() to after SysLogger_Start(). Unless the syslogger
starts up very quickly RemovePgTempFiles() will have deleted the server
variables file.

I think moving the RemovePgTempFiles() to just above SysLogger_Start()
should preserve the faster reporting for pg_ctl intended by b94ce6e80
and fix the startup issue?

I'm rather surprised that this hasn't caused more problems. Apparently
windows users don't use the logging_collector?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] OOM on EXPLAIN with lots of nodes

2015-01-13 Thread Tom Lane

Heikki Linnakangas hlinnakan...@vmware.com writes:
 On 01/13/2015 07:24 PM, Tom Lane wrote:
 In hindsight, that's a bad API and we should change it to something like
 ExplainState *es = NewExplainState();
 so that the sizeof the struct isn't embedded in extension code.  But we
 definitely can't do that in back branches.

 Actually, it would make sense to do exactly that, to break any 
 extensions that are doing the unsafe thing in an obvious way. The 
 downside would be that an extension using the new API would then not 
 work on an old server.

I guess that's a possibility ...

 We could repurpose one of the existing fields in ExplainState to point 
 to another struct that contains more fields. Something like this:
 ...
 That's pretty ugly, but it would work even if there are ExplainState 
 structs embeded in extensions. As long as they don't try to look at the 
 grouping_stack field; I think that's fairly safe assumption.

Yeah, I was thinking the same thing, but it's *mighty* ugly and would also
create a back-patch hazard, since presumably we'd not do that in HEAD.

 But do we really need to backpatch any of this?

Alexey's example consumes only a couple hundred MB in 9.2, vs about 7GB
peak in 9.3 and up.  That seems like a pretty nasty regression.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WITH CHECK and Column-Level Privileges

2015-01-13 Thread Noah Misch

On Mon, Jan 12, 2015 at 05:16:40PM -0500, Stephen Frost wrote:
 Alright, here's an updated patch which doesn't return any detail if no
 values are visible or if only a partial key is visible.

I browsed this patch.  There's been no mention of foreign key constraints, but
ri_ReportViolation() deserves similar hardening.  If a user has only DELETE
privilege on a PK table, FK violation messages should not leak the PK values.
Modifications to the foreign side are less concerning, since the user will
often know the attempted value; still, I would lock down both sides.

Please add a comment explaining the safety of each row-emitting message you
haven't changed.  For example, the one in refresh_by_match_merge() is safe
because getting there requires MV ownership.

 --- a/src/backend/access/nbtree/nbtinsert.c
 +++ b/src/backend/access/nbtree/nbtinsert.c
 @@ -388,18 +388,30 @@ _bt_check_unique(Relation rel, IndexTuple itup, 
 Relation heapRel,
   {
   Datum   
 values[INDEX_MAX_KEYS];
   bool
 isnull[INDEX_MAX_KEYS];
 + char   *key_desc;
  
   index_deform_tuple(itup, 
 RelationGetDescr(rel),
   
values, isnull);
 - ereport(ERROR,
 - 
 (errcode(ERRCODE_UNIQUE_VIOLATION),
 -  
 errmsg(duplicate key value violates unique constraint \%s\,
 - 
 RelationGetRelationName(rel)),
 -  errdetail(Key 
 %s already exists.,
 - 
BuildIndexValueDescription(rel,
 - 
 values, isnull)),
 -  
 errtableconstraint(heapRel,
 - 
  RelationGetRelationName(rel;
 +
 + key_desc = 
 BuildIndexValueDescription(rel, values,
 + 
   isnull);
 +
 + if (!strlen(key_desc))
 + ereport(ERROR,
 + 
 (errcode(ERRCODE_UNIQUE_VIOLATION),
 +  
 errmsg(duplicate key value violates unique constraint \%s\,
 + 
 RelationGetRelationName(rel)),
 +  
 errtableconstraint(heapRel,
 + 
 RelationGetRelationName(rel;
 + else
 + ereport(ERROR,
 + 
 (errcode(ERRCODE_UNIQUE_VIOLATION),
 +  
 errmsg(duplicate key value violates unique constraint \%s\,
 + 
 RelationGetRelationName(rel)),
 +  
 errdetail(Key %s already exists.,
 + 
key_desc),
 +  
 errtableconstraint(heapRel,
 + 
 RelationGetRelationName(rel;

Instead of duplicating an entire ereport() to change whether you include an
errdetail, use condition ? errdetail(...) : 0.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Typo fix in alter_table.sgml

2015-01-13 Thread Michael Paquier

Hi all,

I noticed that SET STATISTICS was not in a literal block in alter_table.sgml:
  para
-  SET STATISTICS acquires a literalSHARE UPDATE EXCLUSIVE/literal lock.
+  literalSET STATISTICS/literal acquires a
+  literalSHARE UPDATE EXCLUSIVE/literal lock.
  /para
That's a small detail, still.. Patch is attached.
Regards,
-- 
Michael
diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml
index b5ef09e..b3a4970 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -190,7 +190,8 @@ ALTER TABLE ALL IN TABLESPACE replaceable class=PARAMETERname/replaceable
   xref linkend=planner-stats.
  /para
  para
-  SET STATISTICS acquires a literalSHARE UPDATE EXCLUSIVE/literal lock.
+  literalSET STATISTICS/literal acquires a
+  literalSHARE UPDATE EXCLUSIVE/literal lock.
  /para
 /listitem
/varlistentry

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] EXEC_BACKEND + logging_collector=on is broken

2015-01-13 Thread Magnus Hagander

On Tue, Jan 13, 2015 at 10:23 AM, Andres Freund and...@2ndquadrant.com
wrote:

 Hi,

 Currently the combination from $subject fails for me with could not
 read from backend variables file 

 The origin for that problem seems to be b94ce6e80 which moved
 RemovePgTempFiles() to after SysLogger_Start(). Unless the syslogger
 starts up very quickly RemovePgTempFiles() will have deleted the server
 variables file.

 I think moving the RemovePgTempFiles() to just above SysLogger_Start()
 should preserve the faster reporting for pg_ctl intended by b94ce6e80
 and fix the startup issue?

 I'm rather surprised that this hasn't caused more problems. Apparently
 windows users don't use the logging_collector?


I haven't looked at the actual code. But AFAIK, logging_collector=on and
storing it in pg_log is the default log behaviour for pg on Windows
installed by the edb installers afaik. Always has been. Surely we didn't
break logging on Windows completely back in 9.2?! Any chance it's working
on Windows, just not EXEC_BACKEND on Unix?

(Dave or someone else from the EDB team can probably confirm that this is
still the default in the installers?)

EXEC_BACKEND on Windows doesn't actually use a tempfile though, so I'm
guessing that's it. It does an anonymous memory mapping (see top of
internal_forkexec) out of the swapfile and passes a handle down. And only
closes it's own handle once it's been inherited into the subprocess.

I'm pretty sure that means it's not actually broken on Windows.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Re: [HACKERS] EXEC_BACKEND + logging_collector=on is broken

2015-01-13 Thread Tom Lane

Andres Freund and...@2ndquadrant.com writes:
 On 2015-01-13 11:10:06 -0800, Magnus Hagander wrote:
 EXEC_BACKEND on Windows doesn't actually use a tempfile though, so I'm
 guessing that's it.

 Ah! Then this really is fairly harmless. Will fix and backpatch anyway,
 but the number of affected people should be pretty much zero.

Agreed.  We want this case to work for our own testing purposes, but
it's unlikely to be hurting anybody in the field.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] EXEC_BACKEND + logging_collector=on is broken

2015-01-13 Thread Andres Freund

On 2015-01-13 11:10:06 -0800, Magnus Hagander wrote:
 EXEC_BACKEND on Windows doesn't actually use a tempfile though, so I'm
 guessing that's it.

Ah! Then this really is fairly harmless. Will fix and backpatch anyway,
but the number of affected people should be pretty much zero.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] hung backends stuck in spinlock heavy endless loop

2015-01-13 Thread Peter Geoghegan

On Tue, Jan 13, 2015 at 3:54 PM, Andres Freund and...@2ndquadrant.com wrote:
 I don't remember seeing _bt_moveright()  or _bt_compare() figuring so
 prominently, where _bt_binsrch() is nowhere to be seen. I can't see a
 reference to _bt_binsrch() in either profile.

 Well, we do a _bt_moveright pretty early on, so that actually might be
 cache misses we're primarily seing.

Still, I see zero references to _bt_binsrch(). Not one. Even if there
was only one non-meta page (i.e. pre-first-root-split), and regardless
of whether this was a read or a write to the B-Tree, there'd still be
some call there as the B-Tree was initially scanned (before _bt_next()
calls, of which I also see none).

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] hung backends stuck in spinlock heavy endless loop

2015-01-13 Thread Peter Geoghegan

On Tue, Jan 13, 2015 at 3:54 PM, Merlin Moncure mmonc...@gmail.com wrote:
 Some more information what's happening:
 This is a ghetto logical replication engine that migrates data from
 sql sever to postgres, consolidating a sharded database into a single
 set of tables (of which there are only two). There is only one index
 on the destination table, and it's composite int,int in both cases.

Does the logical replication engine perform dynamic DDL at all? Does
it dynamically add columns to the table that everything is being
consolidated to?

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] hung backends stuck in spinlock heavy endless loop

2015-01-13 Thread Peter Geoghegan

On Tue, Jan 13, 2015 at 4:05 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I'm not convinced that Peter is barking up the right tree.  I'm noticing
 that the profiles seem rather skewed towards parser/planner work; so I
 suspect the contention is probably on access to system catalogs.  No
 idea exactly why though.

I see no int4cmp() calls at all, but plenty of _bt_compare(), and some
FunctionCall2Coll(). And yet, no _bt_binsrch(). I can see btoidcmp(),
so if I'm right then I guess it's a system catalog index.

I too would like to see a stack trace.

Trivia: pg_attribute_relid_attnam_index has deleted B-Tree pages after
a fresh initdb.
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] OOM on EXPLAIN with lots of nodes

2015-01-13 Thread Tom Lane

I wrote:
 Heikki Linnakangas hlinnakan...@vmware.com writes:
 But do we really need to backpatch any of this?

 Alexey's example consumes only a couple hundred MB in 9.2, vs about 7GB
 peak in 9.3 and up.  That seems like a pretty nasty regression.

I did a bit more measurement of the time and backend memory consumption
for Alexey's example EXPLAIN:

9.2: 0.9 sec, circa 200 MB
HEAD: 56 sec, circa 7300 MB
with patch below: 3.7 sec, circa 300 MB

So while this doesn't get us all the way back down to where we were before
we started trying to guarantee unique table/column identifiers in EXPLAIN
printouts, it's at least a lot closer.

Not sure whether to just commit this to HEAD and call it a day, or to
risk back-patching.

It occurred to me that we could use your idea of a secondary data
structure and minimize the code impact with a macro layer, ie
#define grouping_stack pointer_field-groupingstack
But I'm not sure if it's worth the trouble.  9.3 has been like this
right along, and this is the first complaint.

One compromise idea would be to back-patch only as far as 9.4.
If there are extensions out there that have this ABI issue, expecting
them to rebuild for 9.4.1 might be OK.

regards, tom lane

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 8a0be5d..39ceaf2 100644
*** a/src/backend/commands/explain.c
--- b/src/backend/commands/explain.c
*** ExplainPrintPlan(ExplainState *es, Query
*** 563,568 
--- 563,570 
  	es-rtable = queryDesc-plannedstmt-rtable;
  	ExplainPreScanNode(queryDesc-planstate, rels_used);
  	es-rtable_names = select_rtable_names_for_explain(es-rtable, rels_used);
+ 	es-deparse_cxt = deparse_context_for_plan_rtable(es-rtable,
+ 	  es-rtable_names);
  	ExplainNode(queryDesc-planstate, NIL, NULL, NULL, es);
  }
  
*** show_plan_tlist(PlanState *planstate, Li
*** 1678,1687 
  		return;
  
  	/* Set up deparsing context */
! 	context = deparse_context_for_planstate((Node *) planstate,
! 			ancestors,
! 			es-rtable,
! 			es-rtable_names);
  	useprefix = list_length(es-rtable)  1;
  
  	/* Deparse each result column (we now include resjunk ones) */
--- 1680,1688 
  		return;
  
  	/* Set up deparsing context */
! 	context = set_deparse_context_planstate(es-deparse_cxt,
! 			(Node *) planstate,
! 			ancestors);
  	useprefix = list_length(es-rtable)  1;
  
  	/* Deparse each result column (we now include resjunk ones) */
*** show_expression(Node *node, const char *
*** 1710,1719 
  	char	   *exprstr;
  
  	/* Set up deparsing context */
! 	context = deparse_context_for_planstate((Node *) planstate,
! 			ancestors,
! 			es-rtable,
! 			es-rtable_names);
  
  	/* Deparse the expression */
  	exprstr = deparse_expression(node, context, useprefix, false);
--- 1711,1719 
  	char	   *exprstr;
  
  	/* Set up deparsing context */
! 	context = set_deparse_context_planstate(es-deparse_cxt,
! 			(Node *) planstate,
! 			ancestors);
  
  	/* Deparse the expression */
  	exprstr = deparse_expression(node, context, useprefix, false);
*** show_sort_group_keys(PlanState *planstat
*** 1855,1864 
  		return;
  
  	/* Set up deparsing context */
! 	context = deparse_context_for_planstate((Node *) planstate,
! 			ancestors,
! 			es-rtable,
! 			es-rtable_names);
  	useprefix = (list_length(es-rtable)  1 || es-verbose);
  
  	for (keyno = 0; keyno  nkeys; keyno++)
--- 1855,1863 
  		return;
  
  	/* Set up deparsing context */
! 	context = set_deparse_context_planstate(es-deparse_cxt,
! 			(Node *) planstate,
! 			ancestors);
  	useprefix = (list_length(es-rtable)  1 || es-verbose);
  
  	for (keyno = 0; keyno  nkeys; keyno++)
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index dd748ac..cacd72c 100644
*** a/src/backend/utils/adt/ruleutils.c
--- b/src/backend/utils/adt/ruleutils.c
*** deparse_context_for(const char *aliasnam
*** 2520,2526 
  }
  
  /*
!  * deparse_context_for_planstate	- Build deparse context for a plan
   *
   * When deparsing an expression in a Plan tree, we might have to resolve
   * OUTER_VAR, INNER_VAR, or INDEX_VAR references.  To do this, the caller must
--- 2520,2562 
  }
  
  /*
!  * deparse_context_for_plan_rtable - Build deparse context for a plan's rtable
!  *
!  * When deparsing an expression in a Plan tree, we use the plan's rangetable
!  * to resolve names of simple Vars.  The initialization of column names for
!  * this is rather expensive if the rangetable is large, and it'll be the same
!  * for every expression in the Plan tree; so we do it just once and re-use
!  * the result of this function for each expression.  (Note that the result
!  * is not usable until set_deparse_context_planstate() is applied to it.)
!  *
!  * In addition to the plan's

Re: [HACKERS] EXEC_BACKEND + logging_collector=on is broken

2015-01-13 Thread Andres Freund

On 2015-01-13 20:21:55 +0100, Andres Freund wrote:
 On 2015-01-13 11:10:06 -0800, Magnus Hagander wrote:
  EXEC_BACKEND on Windows doesn't actually use a tempfile though, so I'm
  guessing that's it.
 
 Ah! Then this really is fairly harmless. Will fix and backpatch anyway,
 but the number of affected people should be pretty much zero.

Done that now.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sequence Access Method WIP

2015-01-13 Thread Petr Jelinek


On 13/01/15 13:24, Tomas Vondra wrote:

On 12.1.2015 22:33, Petr Jelinek wrote:

On 15/12/14 11:36, Petr Jelinek wrote:

On 10/12/14 03:33, Petr Jelinek wrote:

On 24/11/14 12:16, Heikki Linnakangas wrote:

About the rough edges:
- The AlterSequence is not prettiest code around as we now have to
create new relation when sequence AM is changed and I don't know how to
do that nicely
- I am not sure if I did the locking, command order and dependency
handling in AlterSequence correcly


This version does AlterSequence differently and better. Didn't attach
the gapless sequence again as that one is unchanged.




And another version, separated into patch-set of 3 patches.
First patch contains the seqam patch itself, not many changes there,
mainly docs/comments related. What I wrote in the previous email for
version 3 still applies.


I did a review of the first part today - mostly by reading through the
diff. I plan to do a bit more thorough testing in a day or two. I'll
also look at the two (much smaller) patches.



Thanks!


comments/questions/general nitpicking:

(1) Why treating empty string as equal to 'local'? Isn't enforcing the
 actual value a better approach?



Álvaro mentioned on IM also, I already changed it to saner normal GUC 
with 'local' as default value in my working copy



(2) NITPICK: Maybe we could use access_method in the docs (instead of
 sequence_access_method), as the 'sequence' part is clear from the
 context?


Yes.


(3) Why index_reloptions / sequence_reloptions when both do exactly the
 same thing (call to common_am_reloptions)? I'd understand this if
 the patch(es) then change the sequence_reloptions() but that's not
 happening. Maybe that's expected to happen?


That's leftover from the original design where AM was supposed to call 
this, since this is not exposed to AM anymore I think it can be single 
function now.




(4) DOCS: Each sequence can only use one access method at a time.

 Does that mean a sequence can change the access method during it's
 lifetime? My understanding is that the AM is fixed after creating
 the sequence?



Oh, I forgot to add ALTER SEQUENCE USING into docs, you can change AM 
even though you probably don't want to do it often, but for migrations 
it's useful.



(8) check_default_seqam without a transaction

  * If we aren't inside a transaction, we cannot do database access so
  * cannot verify the name. Must accept the value on faith.

 In which situation this happens? Wouldn't it be better to simply
 enforce the transaction and fail otherwise?


Reading postgresql.conf during startup, I don't think we want to fail in 
that scenario ;)



--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] POLA violation with \c service=

2015-01-13 Thread David Fetter

On Sat, Jan 10, 2015 at 04:41:16PM -0800, David Fetter wrote:
 On Sat, Jan 10, 2015 at 09:30:57AM +0100, Erik Rijkers wrote:
  On Fri, January 9, 2015 20:15, David Fetter wrote:
   [psql_fix_uri_service_003.patch]
  
  Applies on master; the feature (switching services) works well but a \c 
  without any parameters produces a segfault:
  
  (centos 6.6, 4.9.2, 64-bit)
  
  
  $ echo -en $PGSERVICEFILE\n$PGSERVICE\n$PGPORT\n
  /home/aardvark/.pg_service
  service_pola
  6968
  
  $ psql
  Timing is on.
  psql (9.5devel_service_pola_20150109_2340_ac7009abd228)
  Type help for help.
  
  testdb=# \c service=HEAD
  You are now connected to database testdb as user aardvark via socket in 
  /tmp at port 6545.
  testdb=# \c service=service_pola
  You are now connected to database testdb as user aardvark via socket in 
  /tmp at port 6968.
  testdb=# \c
  Segmentation fault (core dumped)
 
 Fixed by running that function only if the argument exists.
 
 More C cleanups, too.

Added to the upcoming commitfest.

Cheers,
David.
-- 
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] authentication_timeout ineffective for replication connections

2015-01-13 Thread Andres Freund

Hi,

I just noticed that authentication_timeout is ineffective for
replication=true type connections. That's because walsender doesn't
register a SIGINT handler and authentication_timeout relies on having
one.

There's no problem with reading the initial startup packet
(ProcessStartupPacket/BackendInitialize) because we use a separate
handler there. But once that's done, before finishing authentication,
WalSndSignals() will have set SIGINT's handler to SIG_IGN.

Demo python program attached. You'll only see the problem if the
authentication method requires a password/addititional packets.

I think we could fix this by simply mapping SIGINT to die() instead
SIG_IGN.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
import socket
import time
import struct
import sys
import os

if len(sys.argv) != 6:
print(wrong number of params: host port database user replication)
sys.exit(1)

print connecting
conn = socket.create_connection((sys.argv[1], sys.argv[2]))

data = bytes();

# send protocol version
data = struct.pack('I', 0x0003)

# database
data += database
data += struct.pack('B', 0)
data += sys.argv[3]
data += struct.pack('B', 0)

# user
data += user
data += struct.pack('B', 0)
data += sys.argv[4]
data += struct.pack('B', 0)

# replication
data += replication
data += struct.pack('B', 0)
data += sys.argv[5]
data += struct.pack('B', 0)

# start packet terminator
data += struct.pack('B', 0)

print sending startup packet
# length of the length + length of the startup packet
conn.send(struct.pack('i', 4 + len(data)))

# version
conn.send(data)

print and going idle
# and now sleep
time.sleep(3600)

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WITH CHECK and Column-Level Privileges

2015-01-13 Thread Dean Rasheed

On 13 January 2015 at 13:50, Stephen Frost sfr...@snowman.net wrote:
 * Dean Rasheed (dean.a.rash...@gmail.com) wrote:
 On 12 January 2015 at 22:16, Stephen Frost sfr...@snowman.net wrote:
  Alright, here's an updated patch which doesn't return any detail if no
  values are visible or if only a partial key is visible.
 
  Please take a look.  I'm not thrilled with simply returning an empty
  string and then checking that for BuildIndexValueDescription and
  ExecBuildSlotValueDescription, but I figured we might have other users
  of BuildIndexValueDescription and I wasn't seeing a particularly better
  solution.  Suggestions welcome, of course.

 Actually I'm starting to wonder if it's even worth bothering about the
 index case. To leak information, you'd need to have a composite key
 for which you only had access to a subset of the key columns (which in
 itself seems like a pretty rare case), and then you'd need to specify
 new values to make the entire key conflict with an existing value, at
 which point the fact that an exception is thrown tells you that the
 values in the index must be the same as your new values. I'm
 struggling to imagine a realistic scenario where this would be a
 problem.

 I'm not sure that I follow..  From re-reading the above a couple times,
 I take it you're making an argument that people would be silly to set
 their database up that way, but that's not an argument we can stand on
 when it comes to privileges.  Additionally, as the regression tests
 hopefully show, if you have update on one column of a composite key, you
 could find out the entire key today by issuing an update against that
 column to set it to the same value throughout.  You don't need to know
 what the rest of the key is but only that two records somewhere have the
 same values except for the one column you have update rights on.


Hmm, yes I guess that's right.

One improvement we could trivially make is to only do this for
multi-column indexes. If there is only one column there is no danger
of information leakage, right?

 Also, if we change BuildIndexValueDescription() in this way, it's
 going to make it more or less useless for updatable views, since in
 the most common case the user won't have permissions on the underlying
 table.

 That's certainly something to consider.  I looked at trying to get which
 columns the user had provided down to BuildIndexValueDescription, but I
 couldn't find a straight-forward way to do that as it involves the index
 AMs which we can't change (and I'm not really sure we'd want to anyway).


Yeah I couldn't see any easy way of doing it. 2 possibilities sprung
to mind -- (1) wrap the index update in a PG_TRY() and add the detail
in the catch block, or (2) track the currently active EState and make
GetModifiedColumns() into an exported function taking a single EState
argument (the EState has the currently active ResultRelInfo on it).
Neither of those alternatives seems particularly attractive to me
though.

Regards,
Dean


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] OOM on EXPLAIN with lots of nodes

2015-01-13 Thread Alexey Bashtanov


On 13.01.2015 16:47, Heikki Linnakangas wrote:

Hmm, something like the attached? Seems reasonable...

- Heikki


yes, i have tested something like this, it stopped eating memory

Just one small notice to the patch you attached: maybe it would be more 
safe to switch to oldcxt before calling 
ExplainPropertyText/ExplainPropertyList?
Otherwise are you pretty sure ExplainPropertyText and 
ExplainPropertyList are not going perform any pallocs without immediate 
pfrees in current memory context even in future?


Regards,
  Alexey Bashtanov



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Check that streaming replica received all data after master shutdown

2015-01-13 Thread Vladimir Borodin


05 янв. 2015 г., в 18:15, Vladimir Borodin r...@simply.name написал(а):

 Hi all.
 
 I have a simple script for planned switchover of PostgreSQL (9.3 and 9.4) 
 master to one of its replicas. This script checks a lot of things before 
 doing it and one of them is that all data from master has been received by 
 replica that is going to be promoted. Right now the check is done like below:
 
 On the master:
 
 postgres@pgtest03d ~ $ psql -t -A -c 'select pg_current_xlog_location();'
 0/3390
 postgres@pgtest03d ~ $ /usr/pgsql-9.3/bin/pg_ctl stop -m fast
 waiting for server to shut down done
 server stopped
 postgres@pgtest03d ~ $ /usr/pgsql-9.3/bin/pg_controldata | head
 pg_control version number:937
 Catalog version number:   201306121
 Database system identifier:   6061800518091528182
 Database cluster state:   shut down
 pg_control last modified: Mon 05 Jan 2015 06:47:57 PM MSK
 Latest checkpoint location:   0/3428
 Prior checkpoint location:0/3328
 Latest checkpoint's REDO location:0/3428
 Latest checkpoint's REDO WAL file:001B0034
 Latest checkpoint's TimeLineID:   27
 postgres@pgtest03d ~ $
 
 On the replica (after shutdown of master):
 
 postgres@pgtest03g ~ $ psql -t -A -c select 
 pg_xlog_location_diff(pg_last_xlog_replay_location(), '0/3428');
 104
 postgres@pgtest03g ~ $
 
 These 104 bytes seems to be the size of shutdown checkpoint record (as I can 
 understand from pg_xlogdump output).
 
 postgres@pgtest03g ~/9.3/data/pg_xlog $ /usr/pgsql-9.3/bin/pg_xlogdump -s 
 0/3390 -t 27
 rmgr: XLOGlen (rec/tot):  0/32, tx:  0, lsn: 
 0/3390, prev 0/3328, bkp: , desc: xlog switch
 rmgr: XLOGlen (rec/tot): 72/   104, tx:  0, lsn: 
 0/3428, prev 0/3390, bkp: , desc: checkpoint: redo 0/3428; 
 tli 27; prev tli 27; fpw true; xid 0/6010; oid 54128; multi 1; offset 0; 
 oldest xid 1799 in DB 1; oldest multi 1 in DB 1; oldest running xid 0; 
 shutdown
 pg_xlogdump: FATAL:  error in WAL record at 0/3428: record with zero 
 length at 0/3490
 
 postgres@pgtest03g ~/9.3/data/pg_xlog $
 
 I’m not sure that these 104 bytes will always be 104 bytes to have a strict 
 equality while checking. Could it change in the future? Or is there a better 
 way to understand that streaming replica received all data after master 
 shutdown? The check that pg_xlog_location_diff returns 104 bytes seems a bit 
 strange.
 

+hackers

Could anyone help?

Thanks.

 Thanks.
 
 --
 May the force be with you...
 http://simply.name
 
 
 
 


--
May the force be with you...
http://simply.name

Re: [HACKERS] Memory leak in receivelog.c when receiving stream

2015-01-13 Thread Michael Paquier

On Tue, Jan 13, 2015 at 5:45 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 This looks like a false positive to me. PQgetCopyData() will only return a
 buffer if its return value is  0
Right. Sorry for the noise.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Removing INNER JOINs

2015-01-13 Thread David Rowley

On 12 January 2015 at 15:57, Robert Haas robertmh...@gmail.com wrote:

 On Thu, Jan 8, 2015 at 6:31 AM, David Rowley dgrowle...@gmail.com wrote:
  I'd be keen to know what people's thoughts are about the
 nodeAlternativePlan
  only surviving until the plan is initialised.

 I find it scary, although sometimes I am easily frightened.


Ok remember I'm not actually modifying the plan like I was in the earlier
version of the patch. The Alternative Plan node just simply initialises the
correct plan and instead of returning it's own initialised state, it
returns the initialised state of the selected plan's root node.

I have to admit, it didn't really become clear in my head if the frenzy of
discussion above gave any sort of indication that this Alternative plan
node would remain and be shown in the EXPLAIN output, or the appropriate
plan would be selected during plan initialisation and the new root node
would become that of the selected plan. When I was implement what was
discussed, I decided that it would be better to choose the correct plan
during initialisation rather than transitioning through the Alternative
plan node for each row. As proved already on this thread, transitioning
through needless nodes adds needless executor time overhead.

Also if we kept the alternative plan node, then I think the explain would
look rather weird and frighteningly complex, as it would effectively be 2
plans in 1.

I'm actually quite happy with how simple the executor changes became. It's
far more simple and clean than the node stripping code that I had in an
earlier version. The parts of the patch that I'm concerned might raise a
few eyebrows are the changes to query_planner() as it now returns a List of
RelOptInfo instead of a RelOptInfo.

Regards

David Rowley

Re: [HACKERS] [RFC] LSN Map

2015-01-13 Thread Marco Nenciarini

Il 08/01/15 20:18, Jim Nasby ha scritto:
 On 1/7/15, 3:50 AM, Marco Nenciarini wrote:
 The current implementation tracks only heap LSN. It currently does not
 track any kind of indexes, but this can be easily added later.
 
 Would it make sense to do this at a buffer level, instead of at the heap
 level? That means it would handle both heap and indexes.
  I don't know if LSN is visible that far down though.

Where exactly you are thinking to handle it?

 
 Also, this pattern is repeated several times; it would be good to put it
 in it's own function:
 + lsnmap_pin(reln, blkno, lmbuffer);
 + lsnmap_set(reln, blkno, lmbuffer, lsn);
 + ReleaseBuffer(lmbuffer);

Right.

Regards,
Marco

-- 
Marco Nenciarini - 2ndQuadrant Italy
PostgreSQL Training, Services and Support
marco.nenciar...@2ndquadrant.it | www.2ndQuadrant.it



signature.asc
Description: OpenPGP digital signature

Re: [HACKERS] Parallel Seq Scan

2015-01-13 Thread John Gorman

On Sun, Jan 11, 2015 at 6:00 PM, Robert Haas robertmh...@gmail.com wrote:

 On Sun, Jan 11, 2015 at 6:01 AM, Stephen Frost sfr...@snowman.net wrote:
  So, for my 2c, I've long expected us to parallelize at the relation-file
  level for these kinds of operations.  This goes back to my other
  thoughts on how we should be thinking about parallelizing inbound data
  for bulk data loads but it seems appropriate to consider it here also.
  One of the issues there is that 1G still feels like an awful lot for a
  minimum work size for each worker and it would mean we don't parallelize
  for relations less than that size.

 Yes, I think that's a killer objection.


One approach that I has worked well for me is to break big jobs into much
smaller bite size tasks. Each task is small enough to complete quickly.

We add the tasks to a task queue and spawn a generic worker pool which eats
through the task queue items.

This solves a lot of problems.

- Small to medium jobs can be parallelized efficiently.
- No need to split big jobs perfectly.
- We don't get into a situation where we are waiting around for a worker to
finish chugging through a huge task while the other workers sit idle.
- Worker memory footprint is tiny so we can afford many of them.
- Worker pool management is a well known problem.
- Worker spawn time disappears as a cost factor.
- The worker pool becomes a shared resource that can be managed and
reported on and becomes considerably more predictable.

Re: [HACKERS] Unused variables in hstore_to_jsonb

2015-01-13 Thread Michael Paquier

On Tue, Jan 13, 2015 at 5:36 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 You'll need to use (void) pushJsonbValue(...), otherwise you'll just get a
 different warning. See commit c8315930.
Oh, I see. So this portion in contrib/ has been visibly missing.
Attached is a new patch.
-- 
Michael
*** a/contrib/hstore/hstore_io.c
--- b/contrib/hstore/hstore_io.c
***
*** 1338,1344  hstore_to_jsonb(PG_FUNCTION_ARGS)
  	JsonbParseState *state = NULL;
  	JsonbValue *res;
  
! 	res = pushJsonbValue(state, WJB_BEGIN_OBJECT, NULL);
  
  	for (i = 0; i  count; i++)
  	{
--- 1338,1344 
  	JsonbParseState *state = NULL;
  	JsonbValue *res;
  
! 	(void) pushJsonbValue(state, WJB_BEGIN_OBJECT, NULL);
  
  	for (i = 0; i  count; i++)
  	{
***
*** 1349,1355  hstore_to_jsonb(PG_FUNCTION_ARGS)
  		key.val.string.len = HS_KEYLEN(entries, i);
  		key.val.string.val = HS_KEY(entries, base, i);
  
! 		res = pushJsonbValue(state, WJB_KEY, key);
  
  		if (HS_VALISNULL(entries, i))
  		{
--- 1349,1355 
  		key.val.string.len = HS_KEYLEN(entries, i);
  		key.val.string.val = HS_KEY(entries, base, i);
  
! 		(void) pushJsonbValue(state, WJB_KEY, key);
  
  		if (HS_VALISNULL(entries, i))
  		{
***
*** 1361,1367  hstore_to_jsonb(PG_FUNCTION_ARGS)
  			val.val.string.len = HS_VALLEN(entries, i);
  			val.val.string.val = HS_VAL(entries, base, i);
  		}
! 		res = pushJsonbValue(state, WJB_VALUE, val);
  	}
  
  	res = pushJsonbValue(state, WJB_END_OBJECT, NULL);
--- 1361,1367 
  			val.val.string.len = HS_VALLEN(entries, i);
  			val.val.string.val = HS_VAL(entries, base, i);
  		}
! 		(void) pushJsonbValue(state, WJB_VALUE, val);
  	}
  
  	res = pushJsonbValue(state, WJB_END_OBJECT, NULL);
***
*** 1385,1391  hstore_to_jsonb_loose(PG_FUNCTION_ARGS)
  
  	initStringInfo(tmp);
  
! 	res = pushJsonbValue(state, WJB_BEGIN_OBJECT, NULL);
  
  	for (i = 0; i  count; i++)
  	{
--- 1385,1391 
  
  	initStringInfo(tmp);
  
! 	(void) pushJsonbValue(state, WJB_BEGIN_OBJECT, NULL);
  
  	for (i = 0; i  count; i++)
  	{
***
*** 1396,1402  hstore_to_jsonb_loose(PG_FUNCTION_ARGS)
  		key.val.string.len = HS_KEYLEN(entries, i);
  		key.val.string.val = HS_KEY(entries, base, i);
  
! 		res = pushJsonbValue(state, WJB_KEY, key);
  
  		if (HS_VALISNULL(entries, i))
  		{
--- 1396,1402 
  		key.val.string.len = HS_KEYLEN(entries, i);
  		key.val.string.val = HS_KEY(entries, base, i);
  
! 		(void) pushJsonbValue(state, WJB_KEY, key);
  
  		if (HS_VALISNULL(entries, i))
  		{
***
*** 1471,1477  hstore_to_jsonb_loose(PG_FUNCTION_ARGS)
  val.val.string.val = HS_VAL(entries, base, i);
  			}
  		}
! 		res = pushJsonbValue(state, WJB_VALUE, val);
  	}
  
  	res = pushJsonbValue(state, WJB_END_OBJECT, NULL);
--- 1471,1477 
  val.val.string.val = HS_VAL(entries, base, i);
  			}
  		}
! 		(void) pushJsonbValue(state, WJB_VALUE, val);
  	}
  
  	res = pushJsonbValue(state, WJB_END_OBJECT, NULL);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Escaping from blocked send() reprised.

2015-01-13 Thread Andres Freund

On 2015-01-12 00:40:50 +0100, Andres Freund wrote:
 Fixed in what I've since pushed (as Heikki basically was ok with the
 patch sent a couple months back, modulo some fixes)...

I'd not actually pushed that patch... I had pushed some patches
(barriers, atomics), but had decided to hold off on this. I've now done
so.

I've mentioned the portability concerns over select() bugs in the commit
message  a comment. ATM I'm not inclined to add a relatively elaborate
test for the bug on pretty fringe platforms.

Thanks for looking at this!

I plan to continue with committing
1) Commonalize process startup code
2) Add a default local latch for use in signal handlers
3) Use a nonblocking socket for FE/BE communication and block using latches

pretty soon.

As we already seem to assume that WaitLatch() is signal safe/reentrant
(c.f. walsender.c), I'm fine with committing 3) in isolation, without
4). I need a test that properly exercises catchup interrupts before
committing that.

Once I have that test I plan to commit
4) Introduce and use infrastructure for interrupt processing during
client reads.

I'd like some input from others what they think about the problem that
5) Process 'die' interrupts while reading/writing from a socket.

can reduce the likelihood of clients getting the error message. I
personally think that's more than outweighed by not having backends
stuck (save quickdie) for a long time when the client is gone/stuck. I
think the middleground in the patch to only process die events when
actually blocked in writes reduces the likelihood of this sufficiently.

I have hacks ontop this to get rid of ImmediateInterrupt alltogether,
although I'm not sure how well this will work for some parts of
auth/crypt.c. Everything else, including the deadlock checker, seems
quite doable.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Unused variables in hstore_to_jsonb

2015-01-13 Thread Heikki Linnakangas


On 01/13/2015 01:09 PM, Michael Paquier wrote:

On Tue, Jan 13, 2015 at 5:36 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:

You'll need to use (void) pushJsonbValue(...), otherwise you'll just get a
different warning. See commit c8315930.

Oh, I see. So this portion in contrib/ has been visibly missing.
Attached is a new patch.


Thanks, committed.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] OOM on EXPLAIN with lots of nodes

2015-01-13 Thread Heikki Linnakangas


On 01/13/2015 02:08 PM, Alexey Bashtanov wrote:

I found that EXPLAIN command takes very much memory to execute when huge
unions are used.
For example the following sql
-- begin sql
create table t (a000 int, a001 int, ... a099 int);
explain select * from (
  select a001 a from t
  union all
  select a001 a from t
  union all
  ... (1000 times) ...
  union all
  select a001 a from t
) _ where a = 1;
-- end sql
took more than 1GB of memory to execute.

Namely converting of the plan to a human-readable form causes excessive
memory usage, not planning itself.

By varying the parameters and reading source code I determined that
memory usage linearly depends on (plan nodes count)*(overall columns
count), thus it quadratically depends on number of tables unionized.

To remove this excessive memory usage I propose
to run deparse_context_for_planstate+deparse_expression in a separate
memory context and free it after a plan node is generated.


Hmm, something like the attached? Seems reasonable...

- Heikki

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 8a0be5d..4509aab 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -27,6 +27,7 @@
 #include utils/builtins.h
 #include utils/json.h
 #include utils/lsyscache.h
+#include utils/memutils.h
 #include utils/rel.h
 #include utils/ruleutils.h
 #include utils/snapmgr.h
@@ -263,6 +264,15 @@ ExplainInitState(ExplainState *es)
 	es-costs = true;
 	/* Prepare output buffer. */
 	es-str = makeStringInfo();
+
+	/*
+	 * Create a temporary memory context, used for deparsing expressions.
+	 */
+	es-tmpCxt = AllocSetContextCreate(CurrentMemoryContext,
+	   Temporary EXPLAIN context,
+	   ALLOCSET_DEFAULT_MINSIZE,
+	   ALLOCSET_DEFAULT_INITSIZE,
+	   ALLOCSET_DEFAULT_MAXSIZE);
 }
 
 /*
@@ -1664,6 +1674,7 @@ show_plan_tlist(PlanState *planstate, List *ancestors, ExplainState *es)
 	List	   *result = NIL;
 	bool		useprefix;
 	ListCell   *lc;
+	MemoryContext oldcxt;
 
 	/* No work if empty tlist (this occurs eg in bitmap indexscans) */
 	if (plan-targetlist == NIL)
@@ -1677,6 +1688,9 @@ show_plan_tlist(PlanState *planstate, List *ancestors, ExplainState *es)
 	if (IsA(plan, RecursiveUnion))
 		return;
 
+	/* Switch to a temporary memory context to limit memory usage */
+	oldcxt = MemoryContextSwitchTo(es-tmpCxt);
+
 	/* Set up deparsing context */
 	context = deparse_context_for_planstate((Node *) planstate,
 			ancestors,
@@ -1696,6 +1710,10 @@ show_plan_tlist(PlanState *planstate, List *ancestors, ExplainState *es)
 
 	/* Print results */
 	ExplainPropertyList(Output, result, es);
+
+	/* Clean up */
+	MemoryContextSwitchTo(oldcxt);
+	MemoryContextReset(es-tmpCxt);
 }
 
 /*
@@ -1708,6 +1726,10 @@ show_expression(Node *node, const char *qlabel,
 {
 	List	   *context;
 	char	   *exprstr;
+	MemoryContext oldcxt;
+
+	/* Switch to a temporary memory context to limit memory usage */
+	oldcxt = MemoryContextSwitchTo(es-tmpCxt);
 
 	/* Set up deparsing context */
 	context = deparse_context_for_planstate((Node *) planstate,
@@ -1720,6 +1742,10 @@ show_expression(Node *node, const char *qlabel,
 
 	/* And add to es-str */
 	ExplainPropertyText(qlabel, exprstr, es);
+
+	/* Clean up */
+	MemoryContextSwitchTo(oldcxt);
+	MemoryContextReset(es-tmpCxt);
 }
 
 /*
@@ -1850,10 +1876,14 @@ show_sort_group_keys(PlanState *planstate, const char *qlabel,
 	bool		useprefix;
 	int			keyno;
 	char	   *exprstr;
+	MemoryContext oldcxt;
 
 	if (nkeys = 0)
 		return;
 
+	/* Switch to a temporary memory context to limit memory usage */
+	oldcxt = MemoryContextSwitchTo(es-tmpCxt);
+
 	/* Set up deparsing context */
 	context = deparse_context_for_planstate((Node *) planstate,
 			ancestors,
@@ -1877,6 +1907,10 @@ show_sort_group_keys(PlanState *planstate, const char *qlabel,
 	}
 
 	ExplainPropertyList(qlabel, result, es);
+
+	/* Clean up */
+	MemoryContextSwitchTo(oldcxt);
+	MemoryContextReset(es-tmpCxt);
 }
 
 /*
diff --git a/src/include/commands/explain.h b/src/include/commands/explain.h
index 6e26950..0a819e9 100644
--- a/src/include/commands/explain.h
+++ b/src/include/commands/explain.h
@@ -41,6 +41,8 @@ typedef struct ExplainState
 	List	   *rtable_names;	/* alias names for RTEs */
 	int			indent;			/* current indentation level */
 	List	   *grouping_stack; /* format-specific grouping state */
+
+	MemoryContext tmpCxt;		/* working context for deparsing expressions */
 } ExplainState;
 
 /* Hook for plugins to get control in ExplainOneQuery() */

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [RFC] Incremental backup v3: incremental PoC

2015-01-13 Thread Gabriele Bartolini

Hi Marco,

  could you please send an updated version the patch against the current
HEAD in order to facilitate reviewers?

Thanks,
Gabriele

--
 Gabriele Bartolini - 2ndQuadrant Italia - Managing Director
 PostgreSQL Training, Services and Support
 gabriele.bartol...@2ndquadrant.it | www.2ndQuadrant.it

2015-01-07 11:00 GMT+01:00 Marco Nenciarini marco.nenciar...@2ndquadrant.it
:

 Il 06/01/15 14:26, Robert Haas ha scritto:
  I suggest leaving this out altogether for the first version.  I can
  think of three possible ways that we can determine which blocks need
  to be backed up.  One, just read every block in the database and look
  at the LSN of each one.  Two, maintain a cache of LSN information on a
  per-segment (or smaller) basis, as you suggest here.  Three, scan the
  WAL generated since the incremental backup and summarize it into a
  list of blocks that need to be backed up.  This last idea could either
  be done when the backup is requested, or it could be done as the WAL
  is generated and used to populate the LSN cache.  In the long run, I
  think some variant of approach #3 is likely best, but in the short
  run, approach #1 (scan everything) is certainly easiest.  While it
  doesn't optimize I/O, it still gives you the benefit of reducing the
  amount of data that needs to be transferred and stored, and that's not
  nothing.  If we get that much working, we can improve things more
  later.
 

 Hi,
 The patch now uses the approach #1, but I've just sent a patch that uses
 the #2 approach.

 54ad016e.9020...@2ndquadrant.it

 Regards,
 Marco

 --
 Marco Nenciarini - 2ndQuadrant Italy
 PostgreSQL Training, Services and Support
 marco.nenciar...@2ndquadrant.it | www.2ndQuadrant.it

[HACKERS] OOM on EXPLAIN with lots of nodes

2015-01-13 Thread Alexey Bashtanov


Hello!

I found that EXPLAIN command takes very much memory to execute when huge 
unions are used.

For example the following sql
-- begin sql
create table t (a000 int, a001 int, ... a099 int);
explain select * from (
select a001 a from t
union all
select a001 a from t
union all
... (1000 times) ...
union all
select a001 a from t
) _ where a = 1;
-- end sql
took more than 1GB of memory to execute.

Namely converting of the plan to a human-readable form causes excessive 
memory usage, not planning itself.


By varying the parameters and reading source code I determined that
memory usage linearly depends on (plan nodes count)*(overall columns 
count), thus it quadratically depends on number of tables unionized.


To remove this excessive memory usage I propose
to run deparse_context_for_planstate+deparse_expression in a separate 
memory context and free it after a plan node is generated.


Any reasons to treat this idea as bad?

BTW in this case explain execution is also quite long (I got tens of 
seconds). But I have no immediate ideas how to improve it.


Regards,
  Alexey Bashtanov


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Seq Scan

2015-01-13 Thread John Gorman

On Tue, Jan 13, 2015 at 7:25 AM, John Gorman johngorm...@gmail.com wrote:



 On Sun, Jan 11, 2015 at 6:00 PM, Robert Haas robertmh...@gmail.com
 wrote:

 On Sun, Jan 11, 2015 at 6:01 AM, Stephen Frost sfr...@snowman.net
 wrote:
  So, for my 2c, I've long expected us to parallelize at the relation-file
  level for these kinds of operations.  This goes back to my other
  thoughts on how we should be thinking about parallelizing inbound data
  for bulk data loads but it seems appropriate to consider it here also.
  One of the issues there is that 1G still feels like an awful lot for a
  minimum work size for each worker and it would mean we don't parallelize
  for relations less than that size.

 Yes, I think that's a killer objection.


 One approach that I has worked well for me is to break big jobs into much
 smaller bite size tasks. Each task is small enough to complete quickly.

 We add the tasks to a task queue and spawn a generic worker pool which
 eats through the task queue items.

 This solves a lot of problems.

 - Small to medium jobs can be parallelized efficiently.
 - No need to split big jobs perfectly.
 - We don't get into a situation where we are waiting around for a worker
 to finish chugging through a huge task while the other workers sit idle.
 - Worker memory footprint is tiny so we can afford many of them.
 - Worker pool management is a well known problem.
 - Worker spawn time disappears as a cost factor.
 - The worker pool becomes a shared resource that can be managed and
 reported on and becomes considerably more predictable.


I forgot to mention that a running task queue can provide metrics such as
current utilization, current average throughput, current queue length and
estimated queue wait time. These can become dynamic cost factors in
deciding whether to parallelize.

Re: [HACKERS] Sequence Access Method WIP

2015-01-13 Thread Tomas Vondra

On 12.1.2015 22:33, Petr Jelinek wrote:
 On 15/12/14 11:36, Petr Jelinek wrote:
 On 10/12/14 03:33, Petr Jelinek wrote:
 On 24/11/14 12:16, Heikki Linnakangas wrote:

 About the rough edges:
 - The AlterSequence is not prettiest code around as we now have to
 create new relation when sequence AM is changed and I don't know how to
 do that nicely
 - I am not sure if I did the locking, command order and dependency
 handling in AlterSequence correcly

 This version does AlterSequence differently and better. Didn't attach
 the gapless sequence again as that one is unchanged.


 
 And another version, separated into patch-set of 3 patches.
 First patch contains the seqam patch itself, not many changes there,
 mainly docs/comments related. What I wrote in the previous email for
 version 3 still applies.

I did a review of the first part today - mostly by reading through the
diff. I plan to do a bit more thorough testing in a day or two. I'll
also look at the two (much smaller) patches.

comments/questions/general nitpicking:

(1) Why treating empty string as equal to 'local'? Isn't enforcing the
actual value a better approach?

(2) NITPICK: Maybe we could use access_method in the docs (instead of
sequence_access_method), as the 'sequence' part is clear from the
context?

(3) Why index_reloptions / sequence_reloptions when both do exactly the
same thing (call to common_am_reloptions)? I'd understand this if
the patch(es) then change the sequence_reloptions() but that's not
happening. Maybe that's expected to happen?

(4) DOCS: Each sequence can only use one access method at a time.

Does that mean a sequence can change the access method during it's
lifetime? My understanding is that the AM is fixed after creating
the sequence?

(5) DOCS/NITPICK: SeqAM / SeqAm ... (probably should be the same?)

(6) cache lookup failed for relation %u

I believe this should reference 'sequence access method' instead of
a relation.

(7) seqam_init

I believe this is a bug (not setting nulls[4] but twice nulls[3]):

+   fcinfo.argnull[0] = seqparams == NULL;
+   fcinfo.argnull[1] = reloptions_parsed == NULL;
+   fcinfo.argnull[2] = false;
+   fcinfo.argnull[3] = false;
+   fcinfo.argnull[3] = false;

(8) check_default_seqam without a transaction

 * If we aren't inside a transaction, we cannot do database access so
 * cannot verify the name.  Must accept the value on faith.

In which situation this happens? Wouldn't it be better to simply
enforce the transaction and fail otherwise?

regards

-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Async execution of postgres_fdw.

2015-01-13 Thread Kyotaro HORIGUCHI

Hello. This is a version 3 patch.

- PgFdwConnState is removed

- PgFdwConn is isolated as a separate module.

- State transition was simplicated, I think.

- Comment about multiple scans on a connection is added.

- The issue of PREPARE is not addressed yet.

- It is to show how the new style looks, so it is lacking for
  comments for every PgFdwConn functions.

- Rebased to current master.

===
  I added a comment describing the and logic and meaning of the
  statesjust above the enum declaration.
 
 
 This needs to be clarified further. But that can wait till we finalise the
 approach and the patch. Esp. comment below is confusing
 1487 + * PGFDW_CONN_SYNC_RUNNING is rather an internal state in
 1488 + * fetch_more_data(). It indicates that the function shouldn't send
 the next
 1489 + * fetch requst after getting the result.
 
 I couldn't get the meaning of the second sentence, esp. it's connection
 with synchronous-ness.

In this version, I removed PgFdwConnState. Now what indicates
that async fetch is running or not is the existence of
async_fetch. I think the complicated state transition is
dissapeard.

   The if condition if (entry-conn == NULL) in GetConnection(), used to
  track
   whether there is a PGConn active for the given entry, now it tracks
  whether
   it has PgFdwConn for the same.
 
  After some soncideration, I decided to make PgFdwConn to be
  handled more similarly to PGconn. This patch has shrunk as a
  result and bacame looks clear.
 
 
 I think it's still prone to segfaults considering two pointer indirections.

PGconn itself already makes two-level indirection, and PgFdwConn
has hidden its details mainly using macros. I may misunderstood
you, but if you're worried that PgFdwConn.pgconn can be set from
anywhere, we would should separate PgFdwConn into another
C-module and hide all the details as PGconn does. It is shown as
the separte patch. But I feel it a bit overdone because it is not
an end-user interface.

  I wrote the outline of the logic in the comment for enum
  PgFdwConnState in postgres_fdw.h. Is it make sense?
 
 
 The point about two different ForeignScan nodes using the same connection
 needs some clarification, I guess. It's not very clear, why would there be
 more queries run on the same connection. I know why this happens, but it's
 important to mention it somewhere. If it's already mentioned somewhere in
 the file, sorry for not paying attention to that.

Yeah. It is just what I stumbled on. I changed the comment in
fetch_more_data() like below. Does it make sense?

| /*
|  * On the current postgres_fdw implement, multiple PgFdwScanState
|  * on the same foreign server and mapped user share the same
|  * connection to the remote server (see GetConnection() in
|  * connection.c) and inidividual scans on it are separated using
|  * cursors. Since one connection cannot accept two or more
|  * asynchronous queries simultaneously, we should stop the async
|  * fetching if the another scan comes.
|  */
| 
| if (PFCgetNscans(conn)  1)
|   PFCsetAsyncScan(conn, NULL);

 I think there is more chance that there will more than one ForeignScan
 nodes trying interact with a foreign server, even after the push-down work.
 The current solution doesn't address that. We actually need parallel
 querying in two cases
 1. Querying multiple servers in parallel
 2. Querying same server (by two querists) in parallel within the same query
 e.g. an un-pushable join.
 
 We need a solution which is work in both the cases.

The first point is realized by this patch with some
limitations. The second point is that my old patch did, it made a
dedicated connection for individual scans up to some fixed number
aside the main connection, then the overflowed scans go to the
main connection and they are done in the manner the unpatched
postgres_fdw does.

I was thinking that the 'some fiexed number' could be set by a
parameter of a foreign server but I got a comment that it could
fill up the remote server unless reasonable load or/and bandwidth
managemant. So I abandoned the multiple-connection solution and
decided to do everything on the first connection. It's how the
current patch came.

 Is it possible to use the parallel query infrastructure being built by
 Robert or to do something like parallel seq scan? That will work, not just
 for Postgres FDW but all the FDWs.

I haven't seen closer to the patch but if my understanding by a
glance is correct, the parallel scan devides the target table
into multple parts then runs subscans on every part in parallel.

It might allocate dedicated processes for every child scan on a
partitioned table.

But, I think, from the performance view, every scan of multiple
foreign scans don't need correnponding local process. But if the
parallel scan infrastructure makes the mechanism simpler, using
it is a very promising choice.

 In case of Prepared statements, ExecInit is called at the end of planning,
 without subsequent execution like the case of

Re: [HACKERS] WITH CHECK and Column-Level Privileges

2015-01-13 Thread Dean Rasheed

On 12 January 2015 at 22:16, Stephen Frost sfr...@snowman.net wrote:
 Alright, here's an updated patch which doesn't return any detail if no
 values are visible or if only a partial key is visible.

 Please take a look.  I'm not thrilled with simply returning an empty
 string and then checking that for BuildIndexValueDescription and
 ExecBuildSlotValueDescription, but I figured we might have other users
 of BuildIndexValueDescription and I wasn't seeing a particularly better
 solution.  Suggestions welcome, of course.


Actually I'm starting to wonder if it's even worth bothering about the
index case. To leak information, you'd need to have a composite key
for which you only had access to a subset of the key columns (which in
itself seems like a pretty rare case), and then you'd need to specify
new values to make the entire key conflict with an existing value, at
which point the fact that an exception is thrown tells you that the
values in the index must be the same as your new values. I'm
struggling to imagine a realistic scenario where this would be a
problem.

Also, if we change BuildIndexValueDescription() in this way, it's
going to make it more or less useless for updatable views, since in
the most common case the user won't have permissions on the underlying
table.

For CHECK constraints/options, the change looks more reasonable, and I
guess that approach could be extended to RLS by only including the
modified columns, not the ones with select permissions, if RLS is
enabled on the table. One minor comment on the code -- you could save
a few cycles by only calling GetModifiedColumns() in the exceptional
case.

Regards,
Dean


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WITH CHECK and Column-Level Privileges

2015-01-13 Thread Stephen Frost

* Dean Rasheed (dean.a.rash...@gmail.com) wrote:
 On 12 January 2015 at 22:16, Stephen Frost sfr...@snowman.net wrote:
  Alright, here's an updated patch which doesn't return any detail if no
  values are visible or if only a partial key is visible.
 
  Please take a look.  I'm not thrilled with simply returning an empty
  string and then checking that for BuildIndexValueDescription and
  ExecBuildSlotValueDescription, but I figured we might have other users
  of BuildIndexValueDescription and I wasn't seeing a particularly better
  solution.  Suggestions welcome, of course.
 
 Actually I'm starting to wonder if it's even worth bothering about the
 index case. To leak information, you'd need to have a composite key
 for which you only had access to a subset of the key columns (which in
 itself seems like a pretty rare case), and then you'd need to specify
 new values to make the entire key conflict with an existing value, at
 which point the fact that an exception is thrown tells you that the
 values in the index must be the same as your new values. I'm
 struggling to imagine a realistic scenario where this would be a
 problem.

I'm not sure that I follow..  From re-reading the above a couple times,
I take it you're making an argument that people would be silly to set
their database up that way, but that's not an argument we can stand on
when it comes to privileges.  Additionally, as the regression tests
hopefully show, if you have update on one column of a composite key, you
could find out the entire key today by issuing an update against that
column to set it to the same value throughout.  You don't need to know
what the rest of the key is but only that two records somewhere have the
same values except for the one column you have update rights on.

 Also, if we change BuildIndexValueDescription() in this way, it's
 going to make it more or less useless for updatable views, since in
 the most common case the user won't have permissions on the underlying
 table.

That's certainly something to consider.  I looked at trying to get which
columns the user had provided down to BuildIndexValueDescription, but I
couldn't find a straight-forward way to do that as it involves the index
AMs which we can't change (and I'm not really sure we'd want to anyway).

 For CHECK constraints/options, the change looks more reasonable, and I
 guess that approach could be extended to RLS by only including the
 modified columns, not the ones with select permissions, if RLS is
 enabled on the table. One minor comment on the code -- you could save
 a few cycles by only calling GetModifiedColumns() in the exceptional
 case.

Agreed, that sounds like a good approach for how to address the RLS
concern.  Also, good point about GetModifiedColumns.  There's a few
other minor changes that I want to make on re-reading it also.  I should
be able to post a new version later today.

Thanks!

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] OOM on EXPLAIN with lots of nodes

2015-01-13 Thread Tom Lane

Heikki Linnakangas hlinnakan...@vmware.com writes:
 On 01/13/2015 02:08 PM, Alexey Bashtanov wrote:
 By varying the parameters and reading source code I determined that
 memory usage linearly depends on (plan nodes count)*(overall columns
 count), thus it quadratically depends on number of tables unionized.
 
 To remove this excessive memory usage I propose
 to run deparse_context_for_planstate+deparse_expression in a separate
 memory context and free it after a plan node is generated.

 Hmm, something like the attached? Seems reasonable...

This looks pretty unsafe to me: it assumes, without much justification,
that there is no memory allocated during show_expression() that will be
needed later.

I suspect the O(N^2) consumption comes directly from some workspace
allocated for variable deparsing in ruleutils.c.  A safer fix would
be to do retail pfrees on that.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Safe memory allocation functions

2015-01-13 Thread Tom Lane

I wrote:
 Michael Paquier michael.paqu...@gmail.com writes:
 Attached is a patch adding the following set of functions for frontend
 and backends returning NULL instead of reporting ERROR when allocation
 fails:
 - palloc_safe
 - palloc0_safe
 - repalloc_safe

 Unimpressed with this naming convention.  _unsafe would be nearer
 the mark ;-)

Less snarkily: _noerror would probably fit better with existing
precedents in our code.

However, there is a larger practical problem with this whole concept,
which is that experience should teach us to be very wary of the assumption
that asking for memory the system can't give us will just lead to nice
neat malloc-returns-NULL behavior.  Any small perusal of the mailing list
archives will remind you that very often the end result will be SIGSEGV,
OOM kills, unrecoverable trap-on-write when the kernel realizes it can't
honor a copy-on-write promise, yadda yadda.  Agreed that it's arguable
that these only occur in misconfigured systems ... but misconfiguration
appears to be the default in a depressingly large fraction of systems.
(This is another reason for _safe not being the mot juste :-()

In that light, I'm not really convinced that there's a safe use-case
for a behavior like this.  I certainly wouldn't risk asking for a couple
of gigabytes on the theory that I could just ask for less if it fails.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [BUGS] BUG #12070: hstore extension: hstore_to_json_loose produces invalid JSON

2015-01-13 Thread Peter Eisentraut

On 11/30/14 11:45 AM, Tom Lane wrote:
 The API break isn't a big issue imo.  The net effect would be that eg
 hstore 9.3.6 wouldn't work against a 9.3.5 server.  We do that sort of
 thing *all the time* --- at least twice in the past year, according to
 a quick scan of the commit logs.  If you were changing or removing a
 function that third-party code might depend on, it'd be problematic,
 but an addition has no such risk.

This sort of things is actually a bit of an annoyance, because it means
that for minor-version upgrades, you need to stop the server before
unpacking the new version, otherwise the old running server will try to
load the new hstore module and fail with a symbol lookup.  This can
increase the downtime significantly.

Yes, we've done this before, and people have gotten bitten by it before.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [RFC] Incremental backup v3: incremental PoC

2015-01-13 Thread Marco Nenciarini

Il 13/01/15 12:53, Gabriele Bartolini ha scritto:
 Hi Marco,
 
   could you please send an updated version the patch against the current
 HEAD in order to facilitate reviewers?
 

Here is the updated patch for incremental file based backup.

It is based on the current HEAD.

I'm now working to the client tool to rebuild a full backup starting
from a file based incremental backup.

Regards,
Marco

-- 
Marco Nenciarini - 2ndQuadrant Italy
PostgreSQL Training, Services and Support
marco.nenciar...@2ndquadrant.it | www.2ndQuadrant.it
From 50ff0872d3901a30b6742900170052eabe0e06dd Mon Sep 17 00:00:00 2001
From: Marco Nenciarini marco.nenciar...@2ndquadrant.it
Date: Tue, 14 Oct 2014 14:31:28 +0100
Subject: [PATCH] File-based incremental backup v4

Add backup profile to pg_basebackup
INCREMENTAL option implementaion
---
 src/backend/access/transam/xlog.c  |   7 +-
 src/backend/access/transam/xlogfuncs.c |   2 +-
 src/backend/replication/basebackup.c   | 335 +++--
 src/backend/replication/repl_gram.y|   6 +
 src/backend/replication/repl_scanner.l |   1 +
 src/bin/pg_basebackup/pg_basebackup.c  |  83 ++--
 src/include/access/xlog.h  |   3 +-
 src/include/replication/basebackup.h   |   4 +
 8 files changed, 409 insertions(+), 32 deletions(-)

diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index 839ea7c..625a5df 100644
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
*** XLogFileNameP(TimeLineID tli, XLogSegNo 
*** 9249,9255 
   * permissions of the calling user!
   */
  XLogRecPtr
! do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
   char **labelfile)
  {
boolexclusive = (labelfile == NULL);
--- 9249,9256 
   * permissions of the calling user!
   */
  XLogRecPtr
! do_pg_start_backup(const char *backupidstr, bool fast,
!  XLogRecPtr incremental_startpoint, 
TimeLineID *starttli_p,
   char **labelfile)
  {
boolexclusive = (labelfile == NULL);
*** do_pg_start_backup(const char *backupids
*** 9468,9473 
--- 9469,9478 
 (uint32) (startpoint  32), (uint32) startpoint, 
xlogfilename);
appendStringInfo(labelfbuf, CHECKPOINT LOCATION: %X/%X\n,
 (uint32) (checkpointloc  32), 
(uint32) checkpointloc);
+   if (incremental_startpoint  0)
+   appendStringInfo(labelfbuf, INCREMENTAL FROM 
LOCATION: %X/%X\n,
+(uint32) 
(incremental_startpoint  32),
+(uint32) 
incremental_startpoint);
appendStringInfo(labelfbuf, BACKUP METHOD: %s\n,
 exclusive ? pg_start_backup 
: streamed);
appendStringInfo(labelfbuf, BACKUP FROM: %s\n,
diff --git a/src/backend/access/transam/xlogfuncs.c 
b/src/backend/access/transam/xlogfuncs.c
index 2179bf7..ace84d8 100644
*** a/src/backend/access/transam/xlogfuncs.c
--- b/src/backend/access/transam/xlogfuncs.c
*** pg_start_backup(PG_FUNCTION_ARGS)
*** 59,65 
(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
   errmsg(must be superuser or replication role to run a 
backup)));
  
!   startpoint = do_pg_start_backup(backupidstr, fast, NULL, NULL);
  
PG_RETURN_LSN(startpoint);
  }
--- 59,65 
(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
   errmsg(must be superuser or replication role to run a 
backup)));
  
!   startpoint = do_pg_start_backup(backupidstr, fast, 0, NULL, NULL);
  
PG_RETURN_LSN(startpoint);
  }
diff --git a/src/backend/replication/basebackup.c 
b/src/backend/replication/basebackup.c
index 07030a2..05b19c5 100644
*** a/src/backend/replication/basebackup.c
--- b/src/backend/replication/basebackup.c
***
*** 30,40 
--- 30,42 
  #include replication/basebackup.h
  #include replication/walsender.h
  #include replication/walsender_private.h
+ #include storage/bufpage.h
  #include storage/fd.h
  #include storage/ipc.h
  #include utils/builtins.h
  #include utils/elog.h
  #include utils/ps_status.h
+ #include utils/pg_lsn.h
  #include utils/timestamp.h
  
  
*** typedef struct
*** 46,56 
boolnowait;
boolincludewal;
uint32  maxrate;
  } basebackup_options;
  
  
! static int64 sendDir(char *path, int basepathlen, bool sizeonly, List 
*tablespaces);
! static int64 sendTablespace(char *path, bool sizeonly);
  static bool sendFile(char *readfilename, char *tarfilename,
 struct stat * statbuf, bool missing_ok);
  static void sendFileWithContent(const char

Re: [HACKERS] WITH CHECK and Column-Level Privileges

2015-01-13 Thread Stephen Frost

* Dean Rasheed (dean.a.rash...@gmail.com) wrote:
 One improvement we could trivially make is to only do this for
 multi-column indexes. If there is only one column there is no danger
 of information leakage, right?

That's an interesting thought.  If there's only one column then to have
a conflict you must be changing it and providing a new value with either
a constant, through a column on which you must have select rights, or
with a function you have execute rights on.

So, no, I can't think of a way that would leak information.  I'm still
on the fence about it though as it might be confusing to have
single-column indexes behave differently and I'm a bit worried that,
even if there isn't a way now to exploit this, there might be in the
future.

 Yeah I couldn't see any easy way of doing it. 2 possibilities sprung
 to mind -- (1) wrap the index update in a PG_TRY() and add the detail
 in the catch block, or (2) track the currently active EState and make
 GetModifiedColumns() into an exported function taking a single EState
 argument (the EState has the currently active ResultRelInfo on it).
 Neither of those alternatives seems particularly attractive to me
 though.

The EState is available when dealing with exclusion constraints but it's
not available to _bt_check_unique easily, which is the bigger issue.
GetModifiedColumns() could (and probably should, really) be moved into a
.h somewhere as it's also in trigger.c (actually, that's where I pulled
it from :).

Thanks,

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] OOM on EXPLAIN with lots of nodes

2015-01-13 Thread Tom Lane

I wrote:
 Heikki Linnakangas hlinnakan...@vmware.com writes:
 Hmm, something like the attached? Seems reasonable...

 This looks pretty unsafe to me: it assumes, without much justification,
 that there is no memory allocated during show_expression() that will be
 needed later.

 I suspect the O(N^2) consumption comes directly from some workspace
 allocated for variable deparsing in ruleutils.c.  A safer fix would
 be to do retail pfrees on that.

I looked at this a bit more, and realized that we've got worse problems
than just O(N^2) space consumption: there's also O(N^2) time consumption
in ruleutils' set_simple_column_names().  The constant factor is not too
bad in this example but would be horrible with 1000 regular relations in
the rtable :-(.  Safe or not, the extra-context solution doesn't improve
the code's time consumption.

Really the issue here is that explain.c assumes that
deparse_context_for_planstate() is negligibly cheap, which was once true
but is so no longer.  What's more, most of the work is dependent only on
the rtable and so there is no good reason at all to be doing it more than
once per plan tree.

I think we can fix this by refactoring so that we construct a deparse
context once in ExplainPrintPlan(), and then just repoint it at
different planstate nodes as we work through the plan tree.

A difficulty with either your patch or my idea is that they require adding
another field to ExplainState, which is an ABI break for any third-party
code that might be declaring variables of that struct type.  That's fine
for HEAD but would be risky to back-patch.  Any thoughts about whether we
can get away with that (ie, anybody have an idea if there are third-party
extensions that call explain.c)?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] hung backends stuck in spinlock heavy endless loop

2015-01-13 Thread Merlin Moncure

On my workstation today (running vanilla 9.4.0) I was testing some new
code that does aggressive parallel loading to a couple of tables. It
ran ok several dozen times and froze up with no external trigger.
There were at most 8 active backends that were stuck (the loader is
threaded to a cap) -- each query typically resolves in a few seconds
but they were hung for 30 minutes+.  Had to do restart immediate as
backends were not responding to cancel...but I snapped a 'perf top'
before I did so.  The results were interesting so I'm posting them
here.  So far I have not been able to reproduce...FYI

 61.03%  postgres [.] s_lock
 13.56%  postgres [.] LWLockRelease
 10.11%  postgres [.] LWLockAcquire
  4.02%  perf [.] 0x000526d3
  1.65%  postgres [.] _bt_compare
  1.60%  libc-2.17.so [.] 0x00081069
  0.66%  [kernel] [k] kallsyms_expand_symbol.constprop.1
  0.60%  [kernel] [k] format_decode
  0.57%  [kernel] [k] number.isra.1
  0.47%  [kernel] [k] memcpy
  0.44%  postgres [.] ReleaseAndReadBuffer
  0.44%  postgres [.] FunctionCall2Coll
  0.41%  [kernel] [k] vsnprintf
  0.41%  [kernel] [k] module_get_kallsym
  0.32%  postgres [.] _bt_relandgetbuf
  0.31%  [kernel] [k] string.isra.5
  0.31%  [kernel] [k] strnlen
  0.31%  postgres [.] _bt_moveright
  0.28%  libc-2.17.so [.] getdelim
  0.22%  postgres [.] LockBuffer
  0.16%  [kernel] [k] seq_read
  0.16%  libc-2.17.so [.] __libc_calloc
  0.13%  postgres [.] _bt_checkpage
  0.09%  [kernel] [k] pointer.isra.15
  0.09%  [kernel] [k] update_iter
  0.08%  plugin_host  [.] PyObject_GetAttr
  0.06%  [kernel] [k] strlcpy
  0.06%  [kernel] [k] seq_vprintf
  0.06%  [kernel] [k] copy_user_enhanced_fast_string
  0.06%  libc-2.17.so [.] _IO_feof
  0.06%  postgres [.] btoidcmp
  0.06%  [kernel] [k] page_fault
  0.06%  libc-2.17.so [.] free
  0.06%  libc-2.17.so [.] memchr
  0.06%  libpthread-2.17.so   [.] __pthread_mutex_unlock_usercnt

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] hung backends stuck in spinlock heavy endless loop

2015-01-13 Thread Andres Freund

Hi,

On 2015-01-13 16:29:51 -0600, Merlin Moncure wrote:
 On my workstation today (running vanilla 9.4.0) I was testing some new
 code that does aggressive parallel loading to a couple of tables. It
 ran ok several dozen times and froze up with no external trigger.
 There were at most 8 active backends that were stuck (the loader is
 threaded to a cap) -- each query typically resolves in a few seconds
 but they were hung for 30 minutes+.

Interesting.

 Had to do restart immediate as
 backends were not responding to cancel...but I snapped a 'perf top'
 before I did so.  The results were interesting so I'm posting them
 here.  So far I have not been able to reproduce...FYI

Can you compile postgres with -fno-omit-frame-pointer? Then, when this
happens the next time, you can take a perf record -g, which will tell us
which lock the contention is at.

  61.03%  postgres [.] s_lock
  13.56%  postgres [.] LWLockRelease
  10.11%  postgres [.] LWLockAcquire

That profile looks like it might end up being helped by the lwlock
and/or freelist changes in 9.5.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] hung backends stuck in spinlock heavy endless loop

2015-01-13 Thread Merlin Moncure

On Tue, Jan 13, 2015 at 4:33 PM, Andres Freund and...@2ndquadrant.com wrote:
 Hi,

 On 2015-01-13 16:29:51 -0600, Merlin Moncure wrote:
 On my workstation today (running vanilla 9.4.0) I was testing some new
 code that does aggressive parallel loading to a couple of tables. It
 ran ok several dozen times and froze up with no external trigger.
 There were at most 8 active backends that were stuck (the loader is
 threaded to a cap) -- each query typically resolves in a few seconds
 but they were hung for 30 minutes+.

 Interesting.

 Had to do restart immediate as
 backends were not responding to cancel...but I snapped a 'perf top'
 before I did so.  The results were interesting so I'm posting them
 here.  So far I have not been able to reproduce...FYI

 Can you compile postgres with -fno-omit-frame-pointer? Then, when this
 happens the next time, you can take a perf record -g, which will tell us
 which lock the contention is at.

will do, and I'll loop it for a while and see if I can get it to re-occur.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_basebackup fails with long tablespace paths

2015-01-13 Thread Peter Eisentraut

On 1/7/15 3:19 PM, Robert Haas wrote:
 On Tue, Jan 6, 2015 at 4:33 PM, Peter Eisentraut pete...@gmx.net wrote:
 Currently, when you unpack a tarred basebackup with tablespaces, the
 symlinks will tell you whether you have unpacked the tablespace tars at
 the right place.  Otherwise, how do you know?  Secondly, you also have
 the option of putting the tablespaces somewhere else by changing the
 symlinks.
 
 That's a good argument for making the tablespace-map file
 human-readable and human-editable, but I don't think it's an argument
 for duplicating its contents inaccurately in the filesystem.
 
 One way to address this would be to do away with the symlinks altogether
 and have pg_tblspc/12345 be a text file that contains the tablespace
 location.  Kind of symlinks implemented in user space.
 
 Well, that's just spreading the tablespace-map file out into several
 files, and maybe keeping it around after we've restored from backup.

I think the key point I'm approaching is that the information should
only ever be in one place, all the time.  This is not dissimilar from
why we took the tablespace location out of the system catalogs.  Users
might have all kinds of workflows for how they back up, restore, and
move their tablespaces.  This works pretty well right now, because the
authoritative configuration information is always in plain view.  The
proposal is essentially that we add another location for this
information, because the existing location is incompatible with some
operating system tools.  And, when considered by a user, that second
location might or might not collide with or overwrite the first location
at some mysterious times.

So I think the preferable fix is not to add a second location, but to
make the first location compatible with said operating system tools,
possibly in the way I propose above.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_rewind in contrib

2015-01-13 Thread Peter Eisentraut

There are two ways in which access control for replication connections
is separate:

- replication pseudo-database in pg_hba.conf

- replication role attribute

If someone has a restrictive setup for replication and pg_basebackup,
and then pg_rewind enters, they will have to

- add a line to pg_hba.conf to allow non-replication access

- connect as superuser

The first is an inconvenience, although it might be useful for this and
other reasons to have a variant of all includes replication.

The second, however, is a problem, because you have to change from a
restricted, quasi-ready-only user to full superuser access.

One way to address that might be if we added backend functions that are
variants of pg_ls_dir() and pg_stat_file() that are accessible only with
the replication attribute.  Or we allow calling pg_ls_dir() and
pg_stat_file() with relative paths with the replication attribute.
(While we're at it, add a recursive variant of pg_ls_dir().)  Or some
variant of that.




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] hung backends stuck in spinlock heavy endless loop

2015-01-13 Thread Peter Geoghegan

On Tue, Jan 13, 2015 at 2:29 PM, Merlin Moncure mmonc...@gmail.com wrote:
 On my workstation today (running vanilla 9.4.0) I was testing some new
 code that does aggressive parallel loading to a couple of tables.

Could you give more details, please? For example, I'd like to see
representative data, or at least the table definitions involved. Do
you have any idea what index is involved with the _bt_compare() calls
above? I'm curious as to what its definition is.

You might also consider using my B-Tree verification tool on a
preserved data directory:
http://www.postgresql.org/message-id/cam3swzrk2yxkegj94_2wks_vbemjehc1ye23ecss01wrakz...@mail.gmail.com

I'm inclined to think that this is a livelock, and so the problem
isn't evident from the structure of the B-Tree, but it can't hurt to
check. The tool is bleeding edge, so don't use it in production.

Thanks
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] hung backends stuck in spinlock heavy endless loop

2015-01-13 Thread Andres Freund

On 2015-01-13 15:17:15 -0800, Peter Geoghegan wrote:
 I'm inclined to think that this is a livelock, and so the problem
 isn't evident from the structure of the B-Tree, but it can't hurt to
 check.

My guess is rather that it's contention on the freelist lock via
StrategyGetBuffer's. I've seen profiles like this due to exactly that
before - and it fits to parallel loading quite well.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] hung backends stuck in spinlock heavy endless loop

2015-01-13 Thread Merlin Moncure

On Tue, Jan 13, 2015 at 5:21 PM, Andres Freund and...@2ndquadrant.com wrote:
 On 2015-01-13 15:17:15 -0800, Peter Geoghegan wrote:
 I'm inclined to think that this is a livelock, and so the problem
 isn't evident from the structure of the B-Tree, but it can't hurt to
 check.

 My guess is rather that it's contention on the freelist lock via
 StrategyGetBuffer's. I've seen profiles like this due to exactly that
 before - and it fits to parallel loading quite well.

I think I've got it to pop again. s_lock is only showing 35%
(increasing very slowly if at all) but performance is mostly halted.
Frame pointer is compiled out.  perf report attached.

merlin

 35.82%  postgres[.] s_lock
 23.71%  postgres[.] tas
 14.01%  postgres[.] tas
  6.82%  postgres[.] spin_delay
  5.93%  postgres[.] LWLockRelease
  4.36%  postgres[.] LWLockAcquireCommon
  2.34%  postgres[.] FunctionCall2Coll
  1.79%  postgres[.] _bt_compare
  0.80%  postgres[.] LockBuffer
  0.52%  postgres[.] btoidcmp
  0.49%  postgres[.] ReleaseAndReadBuffer
  0.47%  postgres[.] _bt_moveright
  0.36%  postgres[.] _bt_checkpage
  0.36%  postgres[.] _bt_relandgetbuf
  0.20%  perf[.] 0x0004312a
  0.19%  postgres[.] LWLockAcquire
  0.13%  libv8.so[.] 0x001bbbe0
  0.11%  libc-2.17.so[.] 0x00151134
  0.09%  libwebkit.so[.] 0x0106ccb7
  0.05%  libgdk_pixbuf-2.0.so.0.2800.1   [.] 0x000139c7
  0.04%  Xorg[.] 0x000efb00
  0.03%  libglib-2.0.so.0.3800.1 [.] 0x000876a2
  0.03%  [kernel][k] __ticket_spin_lock
  0.02%  perf-12966.map  [.] 0x0625db944621
  0.02%  libskia.so  [.] 0x0021d15f
  0.02%  libcairo.so.2.11200.16  [.] 0x0002440b
  0.02%  libpthread-2.17.so  [.]
__pthread_mutex_unlock_usercnt
  0.02%  [kernel][k]
copy_user_enhanced_fast_string
  0.02%  radeon_drv.so   [.] 0x00045002
  0.02%  libpthread-2.17.so  [.] pthread_mutex_lock
  0.02%  [kernel][k] fget_light
  0.01%  [kernel][k] __schedule
  0.01%  libexa.so   [.] 0x6e1d
  0.01%  libgdk-x11-2.0.so.0.2400.20 [.] 0x0002f0b0
  0.01%  libvte.so.9.2800.2  [.] 0x00039028
  0.01%  [radeon][k] r100_mm_rreg
  0.01%  [kernel][k] xhci_irq
  0.01%  [kernel][k] native_write_msr_safe
  0.01%  [kernel][k]
update_cfs_rq_blocked_load
  0.01%  libglib-2.0.so.0.3800.1 [.] g_hash_table_lookup
  0.01%  libgobject-2.0.so.0.3800.1  [.]
g_type_check_instance_is_a
Press '?' for help on key bindings


perf.report.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Safe memory allocation functions

2015-01-13 Thread Michael Paquier

Tom Lane writes:
 [blah]
 (This is another reason for _safe not being the mot juste :-()
My wording was definitely incorrect but I sure you got it: I should
have said safe on error.  noerror or error_safe would are definitely
more correct.

 In that light, I'm not really convinced that there's a safe use-case
 for a behavior like this.  I certainly wouldn't risk asking for a couple
 of gigabytes on the theory that I could just ask for less if it fails.
That's as well a matter of documentation. We could add a couple of
lines in for example xfunc.sgml to describe the limitations of such
APIs.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] hung backends stuck in spinlock heavy endless loop

2015-01-13 Thread Andres Freund

On 2015-01-13 17:39:09 -0600, Merlin Moncure wrote:
 On Tue, Jan 13, 2015 at 5:21 PM, Andres Freund and...@2ndquadrant.com wrote:
  On 2015-01-13 15:17:15 -0800, Peter Geoghegan wrote:
  I'm inclined to think that this is a livelock, and so the problem
  isn't evident from the structure of the B-Tree, but it can't hurt to
  check.
 
  My guess is rather that it's contention on the freelist lock via
  StrategyGetBuffer's. I've seen profiles like this due to exactly that
  before - and it fits to parallel loading quite well.
 
 I think I've got it to pop again. s_lock is only showing 35%
 (increasing very slowly if at all) but performance is mostly halted.
 Frame pointer is compiled out.  perf report attached.

  35.82%  postgres[.] s_lock
  23.71%  postgres[.] tas
  14.01%  postgres[.] tas
   6.82%  postgres[.] spin_delay
   5.93%  postgres[.] LWLockRelease
   4.36%  postgres[.] LWLockAcquireCommon

Interesting. This profile looks quite different?

What kind of hardware is this on?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] hung backends stuck in spinlock heavy endless loop

2015-01-13 Thread Peter Geoghegan

On Tue, Jan 13, 2015 at 3:21 PM, Andres Freund and...@2ndquadrant.com wrote:
 My guess is rather that it's contention on the freelist lock via
 StrategyGetBuffer's. I've seen profiles like this due to exactly that
 before - and it fits to parallel loading quite well.

I'm not saying you're wrong, but the breakdown of _bt_moveright()
relative to _bt_relandgetbuf() calls seems a bit fishy to me.

I don't remember seeing _bt_moveright()  or _bt_compare() figuring so
prominently, where _bt_binsrch() is nowhere to be seen. I can't see a
reference to _bt_binsrch() in either profile.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] hung backends stuck in spinlock heavy endless loop

2015-01-13 Thread Merlin Moncure

On Tue, Jan 13, 2015 at 5:49 PM, Peter Geoghegan p...@heroku.com wrote:
 On Tue, Jan 13, 2015 at 3:21 PM, Andres Freund and...@2ndquadrant.com wrote:
 My guess is rather that it's contention on the freelist lock via
 StrategyGetBuffer's. I've seen profiles like this due to exactly that
 before - and it fits to parallel loading quite well.

 I'm not saying you're wrong, but the breakdown of _bt_moveright()
 relative to _bt_relandgetbuf() calls seems a bit fishy to me.

 I don't remember seeing _bt_moveright()  or _bt_compare() figuring so
 prominently, where _bt_binsrch() is nowhere to be seen. I can't see a
 reference to _bt_binsrch() in either profile.

hm, this is hand compiled now, I bet the symbols are wrong.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] hung backends stuck in spinlock heavy endless loop

2015-01-13 Thread Andres Freund

On 2015-01-13 15:49:33 -0800, Peter Geoghegan wrote:
 On Tue, Jan 13, 2015 at 3:21 PM, Andres Freund and...@2ndquadrant.com wrote:
  My guess is rather that it's contention on the freelist lock via
  StrategyGetBuffer's. I've seen profiles like this due to exactly that
  before - and it fits to parallel loading quite well.
 
 I'm not saying you're wrong, but the breakdown of _bt_moveright()
 relative to _bt_relandgetbuf() calls seems a bit fishy to me.

The hierarchical profile definitely doesn't look like my guess was
right.

 I don't remember seeing _bt_moveright()  or _bt_compare() figuring so
 prominently, where _bt_binsrch() is nowhere to be seen. I can't see a
 reference to _bt_binsrch() in either profile.

Well, we do a _bt_moveright pretty early on, so that actually might be
cache misses we're primarily seing.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] hung backends stuck in spinlock heavy endless loop

2015-01-13 Thread Peter Geoghegan

On Tue, Jan 13, 2015 at 3:50 PM, Merlin Moncure mmonc...@gmail.com wrote:
 I don't remember seeing _bt_moveright()  or _bt_compare() figuring so
 prominently, where _bt_binsrch() is nowhere to be seen. I can't see a
 reference to _bt_binsrch() in either profile.

 hm, this is hand compiled now, I bet the symbols are wrong.

In case it isn't clear, I think that the proximate cause here may well
be either one (or both) of commits
efada2b8e920adfdf7418862e939925d2acd1b89 and/or
40dae7ec537c5619fc93ad602c62f37be786d161. Probably the latter. I think
that the profile is roughly consistent with that, although I may well
be wrong.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] hung backends stuck in spinlock heavy endless loop

2015-01-13 Thread Merlin Moncure

On Tue, Jan 13, 2015 at 5:54 PM, Peter Geoghegan p...@heroku.com wrote:
 On Tue, Jan 13, 2015 at 3:50 PM, Merlin Moncure mmonc...@gmail.com wrote:
 I don't remember seeing _bt_moveright()  or _bt_compare() figuring so
 prominently, where _bt_binsrch() is nowhere to be seen. I can't see a
 reference to _bt_binsrch() in either profile.

 hm, this is hand compiled now, I bet the symbols are wrong.

 In case it isn't clear, I think that the proximate cause here may well
 be either one (or both) of commits
 efada2b8e920adfdf7418862e939925d2acd1b89 and/or
 40dae7ec537c5619fc93ad602c62f37be786d161. Probably the latter. I think
 that the profile is roughly consistent with that, although I may well
 be wrong.

I'm out of time for the day, but I'm fairly confident I can reproduce.
I'll see if I can reverse those commits tomorrow and retest (I'm on
development box).

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] hung backends stuck in spinlock heavy endless loop

2015-01-13 Thread Merlin Moncure

On Tue, Jan 13, 2015 at 5:42 PM, Andres Freund and...@2ndquadrant.com wrote:
 On 2015-01-13 17:39:09 -0600, Merlin Moncure wrote:
 On Tue, Jan 13, 2015 at 5:21 PM, Andres Freund and...@2ndquadrant.com 
 wrote:
  On 2015-01-13 15:17:15 -0800, Peter Geoghegan wrote:
  I'm inclined to think that this is a livelock, and so the problem
  isn't evident from the structure of the B-Tree, but it can't hurt to
  check.
 
  My guess is rather that it's contention on the freelist lock via
  StrategyGetBuffer's. I've seen profiles like this due to exactly that
  before - and it fits to parallel loading quite well.

 I think I've got it to pop again. s_lock is only showing 35%
 (increasing very slowly if at all) but performance is mostly halted.
 Frame pointer is compiled out.  perf report attached.

  35.82%  postgres[.] s_lock
  23.71%  postgres[.] tas
  14.01%  postgres[.] tas
   6.82%  postgres[.] spin_delay
   5.93%  postgres[.] LWLockRelease
   4.36%  postgres[.] LWLockAcquireCommon

 Interesting. This profile looks quite different?

yep, it's very stable, and the database is approximately frozen.

 What kind of hardware is this on?

linux on crap workstation box, except I have good ssd (intel 3500).

 model name : Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz


top - 17:44:00 up 11 days,  6:25, 15 users,  load average: 6.94, 6.97, 5.73
Tasks: 259 total,   8 running, 250 sleeping,   0 stopped,   1 zombie
%Cpu0  : 95.7 us,  0.0 sy,  0.0 ni,  4.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  : 93.5 us,  2.2 sy,  0.0 ni,  4.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  : 95.7 us,  0.0 sy,  0.0 ni,  4.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  : 93.6 us,  2.1 sy,  0.0 ni,  4.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:   8131028 total,  7977608 used,   153420 free,21424 buffers
KiB Swap:  8340476 total,  1884900 used,  6455576 free,  5201188 cached

  PID USER  PR  NI  VIRT  RES  SHR S  %CPU %MEMTIME+  COMMAND
32492 mmoncure  20   0  175m  12m 9040 R  64.0  0.2   9:28.42 postgres
32367 mmoncure  20   0  175m  12m 9128 R  57.3  0.2   9:35.87 postgres
31792 mmoncure  20   0  175m  92m  88m R  52.9  1.2   9:57.50 postgres
32725 mmoncure  20   0  175m  13m 9852 R  52.9  0.2   9:08.50 postgres
31899 mmoncure  20   0  179m 104m  97m R  50.7  1.3   9:49.51 postgres
32368 mmoncure  20   0  175m  12m 9200 R  50.7  0.2   9:33.77 postgres
31282 mmoncure  20   0  184m 152m 140m R  46.3  1.9  10:26.29 postgres
  666 mmoncure  20   0 20640 1684 1116 R   2.2  0.0   0:00.01 top

Some more information what's happening:
This is a ghetto logical replication engine that migrates data from
sql sever to postgres, consolidating a sharded database into a single
set of tables (of which there are only two). There is only one index
on the destination table, and it's composite int,int in both cases.

A single source table is replicated in a single transaction, and it's
staged to a set of two temp tables, one for deletes, one for inserts.
I log each step in the database.  Here are the times:

cds2=# select cdsruntableid, finished - started, inserted from
cdsruntable where cdsrunid = 62 order by started;
 cdsruntableid │?column? │ inserted
───┼─┼──
   833 │ 00:00:33.24044  │11860
   834 │ 00:00:35.44981  │19099
   835 │ 00:02:01.725186 │   530923
   836 │ 00:01:29.578811 │   211737
   838 │ 00:01:17.393461 │64258
   837 │ 00:00:56.849106 │   227844
   839 │ 00:02:12.016504 │   630309
   840 │ 00:00:00.00111  │
   841 │ 00:01:09.058806 │   155309
   842 │ 00:01:54.723747 │   704422
   843 │ 00:01:14.965304 │
   844 │ 00:00:45.410953 │59934
   845 │ 00:00:34.302632 │14848
   846 │ 00:00:46.913832 │89893
   848 │ 00:05:22.155189 │  2410622
   847 │ 00:01:25.199141 │   268157
   849 │ 00:01:16.168352 │   117511
   850 │ 00:00:34.809862 │ 1175
   851 │ 00:01:12.565397 │67214
   852 │ 00:01:03.734776 │20129
   853 │ 00:00:41.038456 │62765
   854 │ 00:01:03.478917 │14967
   855 │ 00:00:33.88803  │ 6901
   856 │ 00:00:36.381886 │ 6670
   857 │ 00:00:36.626623 │ 8268
   858 │ 00:01:14.391584 │   363448
   859 │ 00:01:44.619005 │   533395
   860 │ 00:01:02.042255 │   212202
   861 │ 00:00:00.001065 │
   863 │ 00:00:58.265506 │   215234
   862 │ 00:02:45.572635 │   827941
   865 │ 00:01:28.049165 │   241020
   864 │ 00:01:51.465643 │   531012
   866 │ 00:01:20.273391 │   419357
   868 │ 00:01:25.479443 │   294262
   869 │ 00:00:46.400825 │46136
   870 │ 00:00:42.337286 │25934

Re: [HACKERS] hung backends stuck in spinlock heavy endless loop

2015-01-13 Thread Tom Lane

Merlin Moncure mmonc...@gmail.com writes:
 On Tue, Jan 13, 2015 at 5:54 PM, Peter Geoghegan p...@heroku.com wrote:
 In case it isn't clear, I think that the proximate cause here may well
 be either one (or both) of commits
 efada2b8e920adfdf7418862e939925d2acd1b89 and/or
 40dae7ec537c5619fc93ad602c62f37be786d161. Probably the latter. I think
 that the profile is roughly consistent with that, although I may well
 be wrong.

 I'm out of time for the day, but I'm fairly confident I can reproduce.
 I'll see if I can reverse those commits tomorrow and retest (I'm on
 development box).

I'm not convinced that Peter is barking up the right tree.  I'm noticing
that the profiles seem rather skewed towards parser/planner work; so I
suspect the contention is probably on access to system catalogs.  No
idea exactly why though.

It would be good to collect a few stack traces from the hung backends,
rather than relying on perf statistics.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] hung backends stuck in spinlock heavy endless loop

2015-01-13 Thread Andres Freund

On 2015-01-13 19:05:10 -0500, Tom Lane wrote:
 Merlin Moncure mmonc...@gmail.com writes:
  On Tue, Jan 13, 2015 at 5:54 PM, Peter Geoghegan p...@heroku.com wrote:
  In case it isn't clear, I think that the proximate cause here may well
  be either one (or both) of commits
  efada2b8e920adfdf7418862e939925d2acd1b89 and/or
  40dae7ec537c5619fc93ad602c62f37be786d161. Probably the latter. I think
  that the profile is roughly consistent with that, although I may well
  be wrong.
 
  I'm out of time for the day, but I'm fairly confident I can reproduce.
  I'll see if I can reverse those commits tomorrow and retest (I'm on
  development box).
 
 I'm not convinced that Peter is barking up the right tree.  I'm noticing
 that the profiles seem rather skewed towards parser/planner work; so I
 suspect the contention is probably on access to system catalogs.  No
 idea exactly why though.

The plan contains plpgsql and exec_stmt_dynexecute(). So it might just
be executing crazy amounts of dynamic statements.

I'm still wondering if this isn'ta different issue to the first one, the
plans do look different.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

69 matches

Mail list logo