Re: [HACKERS] Turning recovery.conf into GUCs

2015-01-16 Thread Michael Paquier
On Wed, Jan 7, 2015 at 6:22 AM, Peter Eisentraut pete...@gmx.net wrote:
 I have written similar logic, and while it's not pleasant, it's doable.
  This issue would really only go away if you don't use a file to signal
 recovery at all, which you have argued for, but which is really a
 separate and more difficult problem.
Moving this patch to the next CF and marking it as returned with
feedback for current CF as there is visibly no consensus reached.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Turning recovery.conf into GUCs

2015-01-16 Thread Andres Freund
On 2015-01-16 21:43:43 +0900, Michael Paquier wrote:
 On Wed, Jan 7, 2015 at 6:22 AM, Peter Eisentraut pete...@gmx.net wrote:
  I have written similar logic, and while it's not pleasant, it's doable.
   This issue would really only go away if you don't use a file to signal
  recovery at all, which you have argued for, but which is really a
  separate and more difficult problem.
 Moving this patch to the next CF and marking it as returned with
 feedback for current CF as there is visibly no consensus reached.

I don't think that's a good idea. If we defer this another couple months
we'l *never* reach anything coming close to concensus.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Turning recovery.conf into GUCs

2015-01-16 Thread Michael Paquier
On Fri, Jan 16, 2015 at 9:45 PM, Andres Freund and...@2ndquadrant.com wrote:
 On 2015-01-16 21:43:43 +0900, Michael Paquier wrote:
 On Wed, Jan 7, 2015 at 6:22 AM, Peter Eisentraut pete...@gmx.net wrote:
  I have written similar logic, and while it's not pleasant, it's doable.
   This issue would really only go away if you don't use a file to signal
  recovery at all, which you have argued for, but which is really a
  separate and more difficult problem.
 Moving this patch to the next CF and marking it as returned with
 feedback for current CF as there is visibly no consensus reached.

 I don't think that's a good idea. If we defer this another couple months
 we'l *never* reach anything coming close to concensus.
What makes you think that the situation could move suddendly move into
a direction more than another?
(FWIW, my vote goes to the all GUC approach with standby.enabled.)
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallel mode and parallel contexts

2015-01-16 Thread Andres Freund
On 2015-01-05 11:27:57 -0500, Robert Haas wrote:
 On Sat, Jan 3, 2015 at 7:31 PM, Andres Freund and...@2ndquadrant.com wrote:
  * I wonder if parallel contexts shouldn't be tracked via resowners
 
 That is a good question.  I confess that I'm somewhat fuzzy about
 which things should be tracked via the resowner mechanism vs. which
 things should have their own bespoke bookkeeping.  However, the
 AtEOXact_Parallel() stuff happens well before ResourceOwnerRelease(),
 which makes merging them seem not particularly clean.

I'm not sure either. But I think the current location is wrong anyway -
during AtEOXact_Parallel() before running user defined queries via
AfterTriggerFireDeferred() seems wrong.

  * combocid.c should error out in parallel mode, as insurance
 
 Eh, which function?  HeapTupleHeaderGetCmin(),
 HeapTupleHeaderGetCmax(), and AtEOXact_ComboCid() are intended to work
 in parallel mode. HeapTupleHeaderAdjustCmax() could
 Assert(!IsInParallelMode()) but the only calls are in heap_update()
 and heap_delete() which already have error checks, so putting yet
 another elog() there seems like overkill.

To me it seems like a good idea, but whatever.

  * I doubt it's a good idea to allow heap_insert, heap_inplace_update,
index_insert. I'm not convinced that those will be handled correct and
relaxing restrictions is easier than adding them.
 
 I'm fine with adding checks to heap_insert() and
 heap_inplace_update().  I'm not sure we really need to add anything to
 index_insert(); how are we going to get there without hitting some
 other prohibited operation first?

I'm not sure. But it's not that hard to imagine that somebody will start
adding codepaths that insert into indexes first. Think upsert.

  * I'd much rather separate HandleParallelMessageInterrupt() in one
variant that just tells the machinery there's a interrupt (called from
signal handlers) and one that actually handles them. We shouldn't even
consider adding any new code that does allocations, errors and such in
signal handlers. That's just a *bad* idea.
 
 That's a nice idea, but I didn't invent the way this crap works today.
 ImmediateInterruptOK gets set to true while performing a lock wait,
 and we need to be able to respond to errors while in that state.  I
 think there's got to be a better way to handle that than force every
 asynchronous operation to have to cope with the fact that
 ImmediateInterruptOK may be set or not set, but as of today that's
 what you have to do.

I personally think it's absolutely unacceptable to add any more of
that. That the current mess works is more luck than anything else - and
I'm pretty sure there's several bugs in it. But since I realize I can't
force you to develop a alternative solution, I tried to implement enough
infrastructure to deal with it without too much work.

As far as I can see this could relatively easily be implemented ontop of
the removal of ImmediateInterruptOK in the patch series I posted
yesterday? Afaics you just need to remove most of
+HandleParallelMessageInterrupt() and replace it by a single SetLatch().

  * I'm not a fan of the shm_toc stuff... Verbose and hard to read. I'd
much rather place a struct there and be careful about not using
pointers. That also would obliviate the need to reserve some ids.
 
 I don't see how that's going to work with variable-size data
 structures, and a bunch of the things that we need to serialize are
 variable-size.

Meh. Just appending the variable data to the end of the structure and
calculating offsets works just fine.

 Moreover, I'm not really interested in rehashing this
 design again.  I know it's not your favorite; you've said that before.

Well, if I keep having to read code using it, I'll keep being annoyed by
it. I guess we both have to live with that.


Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Turning recovery.conf into GUCs

2015-01-16 Thread Andres Freund
On 2015-01-16 21:50:16 +0900, Michael Paquier wrote:
 On Fri, Jan 16, 2015 at 9:45 PM, Andres Freund and...@2ndquadrant.com wrote:
  On 2015-01-16 21:43:43 +0900, Michael Paquier wrote:
  On Wed, Jan 7, 2015 at 6:22 AM, Peter Eisentraut pete...@gmx.net wrote:
   I have written similar logic, and while it's not pleasant, it's doable.
This issue would really only go away if you don't use a file to signal
   recovery at all, which you have argued for, but which is really a
   separate and more difficult problem.
  Moving this patch to the next CF and marking it as returned with
  feedback for current CF as there is visibly no consensus reached.
 
  I don't think that's a good idea. If we defer this another couple months
  we'l *never* reach anything coming close to concensus.

 What makes you think that the situation could move suddendly move into
 a direction more than another?

That we have to fix this.

I see absolutely no advantage of declaring the discussion closed for
now. That doesn't exactly increase the chance of this ever succeeding.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in pg_dump

2015-01-16 Thread Gilles Darold
On 16/01/2015 01:06, Jim Nasby wrote:
 On 1/15/15 5:26 AM, Gilles Darold wrote:
 Hello,

 There's a long pending issue with pg_dump and extensions that have
 table members with foreign keys. This was previously reported in this
 thread
 http://www.postgresql.org/message-id/ca+tgmoyvzkadmgh_8el7uvm472geru0b4pnnfjqye6ss1k9...@mail.gmail.com
 and discuss by Robert. All PostgreSQL users that use the PostGis
 extension postgis_topology are facing the issue because the two
 members tables (topology and layer) are linked by foreign keys.

 If you dump a database with this extension and try to import it you
 will experience this error:

 pg_restore: [archiver (db)] Error while PROCESSING TOC:
 pg_restore: [archiver (db)] Error from TOC entry 3345; 0
 157059176 TABLE DATA layer gilles
 pg_restore: [archiver (db)] COPY failed for table layer: ERROR:
 insert or update on table layer violates foreign key constraint
 layer_topology_id_fkey
 DETAIL:  Key (topology_id)=(1) is not present in table topology.
 WARNING: errors ignored on restore: 1


 The problem is that, whatever export type you choose (plain/custom
 and full-export/data-only) the data of tables topology and layer
 are always exported in alphabetic order. I think this is a bug
 because outside extension, in data-only export, pg_dump is able to
 find foreign keys dependency and dump table's data in the right order
 but not with extension's members. Default is alphabetic order but
 that should not be the case with extension's members because
 constraints are recreated during the CREATE EXTENSION order. I hope I
 am clear enough.

 Here we have three solutions:

  1/ Inform developers of extensions to take care to alphabetical
 order when they have member tables using foreign keys.
  2/ Inform DBAs that they have to restore the failing table
 independently. The use case above can be resumed using the following
 command:

   pg_restore -h localhost -n topology -t layer -Fc -d
 testdb_empty testdump.dump

  3/ Inform DBAs that they have to restore the schema first then
 the data only using --disable-triggers

 I don't like 1-3, and I doubt anyone else does...

  4/ Patch pg_dump to solve this issue.

 5. Disable FK's during load.
 This is really a bigger item than just extensions. It would have the
 nice benefit of doing a wholesale FK validation instead of firing
 per-row triggers, but it would leave the database in a weird state if
 a restore failed...

I think this is an other problem. Here we just need to apply to
extension's members tables the same work than to normal tables. I guess
this is what this patch try to solve.


 I attach a patch that solves the issue in pg_dump, let me know if it
 might be included in Commit Fest or if the three other solutions are
 a better choice. I also join a sample extension (test_fk_in_ext) to
 be able to reproduce the issue and test the patch. Note that it might
 exists a simpler solution than the one I used in this patch, if this
 is the case please point me on the right way, I will be pleased to
 rewrite and send an other patch.

 The only problem I see with this approach is circular FK's:

 decibel@decina.local=# create table a(a_id serial primary key, b_id int);
 CREATE TABLE
 decibel@decina.local=# create table b(b_id serial primary key, a_id
 int references a);
 CREATE TABLE
 decibel@decina.local=# alter table a add foreign key(b_id) references b;
 ALTER TABLE
 decibel@decina.local=#

 That's esoteric enough that I think it's OK not to directly support
 them, but pg_dump shouldn't puke on them (and really should throw a
 warning). Though it looks like it doesn't handle that in the data-only
 case anyway...

The patch is taking care or circular references and you will be warn if
pg_dump found it in the extension members. That was not the case before.
If you try do dump a database with the postgis extension you will be
warn about FK defined on the edge_data table.


-- 
Gilles Darold
Consultant PostgreSQL
http://dalibo.com - http://dalibo.org




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: Reducing lock strength of trigger and foreign key DDL

2015-01-16 Thread Andres Freund
Hi,

   /*
 -  * Grab an exclusive lock on the pk table, so that someone doesn't 
 delete
 -  * rows out from under us. (Although a lesser lock would do for that
 -  * purpose, we'll need exclusive lock anyway to add triggers to the pk
 -  * table; trying to start with a lesser lock will just create a risk of
 -  * deadlock.)
 +  * Grab ShareRowExclusiveLock on the pk table, so that someone doesn't
 +  * delete rows out from under us. Note that this does not create risks
 +  * of deadlocks as triggers add added to the pk table using the same
 +  * lock.
*/

add added doesn't look intended. The rest of the sentence doesn't look
entirely right either.

   /*
* Triggers must be on tables or views, and there are additional
 @@ -526,8 +526,7 @@ CreateTrigger(CreateTrigStmt *stmt, const char 
 *queryString,
* can skip this for internally generated triggers, since the name
* modification above should be sufficient.
*
 -  * NOTE that this is cool only because we have AccessExclusiveLock on 
 the
 -  * relation, so the trigger set won't be changing underneath us.
 +  * NOTE that this is cool only because of the unique contraint.

I fail to see what the unique constraint has to do with this? The
previous comment refers to the fact that the AccessExclusiveLock is what
prevents a race where another transaction adds a trigger with the same
name already exists. Yes, the unique index would, as noted earlier in
the comment, catch the error. But that's not the point of the
check. Unless I miss something the comment is just as true if you
replace the access exclusive with share row exlusive as it's also self
conflicting.

 @@ -1272,8 +1271,7 @@ renametrig(RenameStmt *stmt)
* on tgrelid/tgname would complain anyway) and to ensure a trigger does
* exist with oldname.
*
 -  * NOTE that this is cool only because we have AccessExclusiveLock on 
 the
 -  * relation, so the trigger set won't be changing underneath us.
 +  * NOTE that this is cool only because there is a unique constraint.
*/

Same as above.

   tgrel = heap_open(TriggerRelationId, RowExclusiveLock);
  
 diff --git a/src/backend/utils/adt/ruleutils.c 
 b/src/backend/utils/adt/ruleutils.c
 index dd748ac..8eeccf2 100644
 --- a/src/backend/utils/adt/ruleutils.c
 +++ b/src/backend/utils/adt/ruleutils.c
 @@ -699,7 +699,8 @@ pg_get_triggerdef_worker(Oid trigid, bool pretty)
   HeapTuple   ht_trig;
   Form_pg_trigger trigrec;
   StringInfoData buf;
 - Relationtgrel;
 + Snapshotsnapshot = RegisterSnapshot(GetTransactionSnapshot());
 + Relationtgrel = heap_open(TriggerRelationId, AccessShareLock);
   ScanKeyData skey[1];
   SysScanDesc tgscan;
   int findx = 0;
 @@ -710,18 +711,18 @@ pg_get_triggerdef_worker(Oid trigid, bool pretty)
   /*
* Fetch the pg_trigger tuple by the Oid of the trigger
*/
 - tgrel = heap_open(TriggerRelationId, AccessShareLock);
 -
   ScanKeyInit(skey[0],
   ObjectIdAttributeNumber,
   BTEqualStrategyNumber, F_OIDEQ,
   ObjectIdGetDatum(trigid));
  
   tgscan = systable_beginscan(tgrel, TriggerOidIndexId, true,
 - NULL, 1, skey);
 + snapshot, 1, 
 skey);
  
   ht_trig = systable_getnext(tgscan);
  
 + UnregisterSnapshot(snapshot);
 +
   if (!HeapTupleIsValid(ht_trig))
   elog(ERROR, could not find tuple for trigger %u, trigid);


Hm. Pushing the snapshot is supposed to make this fully mvcc? Idon't
think that's actually sufficient because of the deparsing of the WHEN
clause and of the function name.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] hung backends stuck in spinlock heavy endless loop

2015-01-16 Thread Merlin Moncure
On Thu, Jan 15, 2015 at 5:10 PM, Peter Geoghegan p...@heroku.com wrote:
 On Thu, Jan 15, 2015 at 3:00 PM, Merlin Moncure mmonc...@gmail.com wrote:
 Running this test on another set of hardware to verify -- if this
 turns out to be a false alarm which it may very well be, I can only
 offer my apologies!  I've never had a new drive fail like that, in
 that manner.  I'll burn the other hardware in overnight and report
 back.

huh -- well possibly. not.  This is on a virtual machine attached to a
SAN.  It ran clean for several (this is 9.4 vanilla, asserts off,
checksums on) hours then the starting having issues:

[cds2 21952 2015-01-15 22:54:51.833 CST 5502]WARNING:  page
verification failed, calculated checksum 59143 but expected 59137 at
character 20
[cds2 21952 2015-01-15 22:54:51.852 CST 5502]QUERY:
  DELETE FROM onesitepmc.propertyguestcard t
  WHERE EXISTS
  (
SELECT 1 FROM propertyguestcard_insert d
WHERE (t.prptyid, t.gcardid) = (d.prptyid, d.gcardid)
  )

[cds2 21952 2015-01-15 22:54:51.852 CST 5502]CONTEXT:  PL/pgSQL
function cdsreconciletable(text,text,text,text,boolean) line 197 at
EXECUTE statement
SQL statement SELECT* FROM CDSReconcileTable(
  t.CDSServer,
  t.CDSDatabase,
  t.SchemaName,
  t.TableName)
PL/pgSQL function cdsreconcileruntable(bigint) line 35 at SQL statement

After that, several hours of clean running, followed by:

[cds2 32353 2015-01-16 04:40:57.814 CST 7549]WARNING:  did not find
subXID 7553 in MyProc
[cds2 32353 2015-01-16 04:40:57.814 CST 7549]CONTEXT:  PL/pgSQL
function cdsreconcileruntable(bigint) line 35 during exception cleanup
[cds2 32353 2015-01-16 04:40:58.018 CST 7549]WARNING:  you don't own a
lock of type AccessShareLock
[cds2 32353 2015-01-16 04:40:58.018 CST 7549]CONTEXT:  PL/pgSQL
function cdsreconcileruntable(bigint) line 35 during exception cleanup
[cds2 32353 2015-01-16 04:40:58.026 CST 7549]LOG:  could not send data
to client: Broken pipe
[cds2 32353 2015-01-16 04:40:58.026 CST 7549]CONTEXT:  PL/pgSQL
function cdsreconcileruntable(bigint) line 35 during exception cleanup
[cds2 32353 2015-01-16 04:40:58.026 CST 7549]STATEMENT:  SELECT
CDSReconcileRunTable(1160)
[cds2 32353 2015-01-16 04:40:58.026 CST 7549]WARNING:
ReleaseLockIfHeld: failed??
[cds2 32353 2015-01-16 04:40:58.026 CST 7549]CONTEXT:  PL/pgSQL
function cdsreconcileruntable(bigint) line 35 during exception cleanup
[cds2 32353 2015-01-16 04:40:58.026 CST 7549]WARNING:  you don't own a
lock of type AccessShareLock
[cds2 32353 2015-01-16 04:40:58.026 CST 7549]CONTEXT:  PL/pgSQL
function cdsreconcileruntable(bigint) line 35 during exception cleanup
[cds2 32353 2015-01-16 04:40:58.026 CST 7549]WARNING:
ReleaseLockIfHeld: failed??
[cds2 32353 2015-01-16 04:40:58.026 CST 7549]CONTEXT:  PL/pgSQL
function cdsreconcileruntable(bigint) line 35 during exception cleanup
[cds2 32353 2015-01-16 04:40:58.026 CST 7549]WARNING:  you don't own a
lock of type AccessShareLock
[cds2 32353 2015-01-16 04:40:58.026 CST 7549]CONTEXT:  PL/pgSQL
function cdsreconcileruntable(bigint) line 35 during exception cleanup
[cds2 32353 2015-01-16 04:40:58.027 CST 7549]WARNING:
ReleaseLockIfHeld: failed??
[cds2 32353 2015-01-16 04:40:58.027 CST 7549]CONTEXT:  PL/pgSQL
function cdsreconcileruntable(bigint) line 35 during exception cleanup
[cds2 32353 2015-01-16 04:40:58.027 CST 7549]WARNING:  you don't own a
lock of type AccessShareLock
[cds2 32353 2015-01-16 04:40:58.027 CST 7549]CONTEXT:  PL/pgSQL
function cdsreconcileruntable(bigint) line 35 during exception cleanup
[cds2 32353 2015-01-16 04:40:58.027 CST 7549]WARNING:
ReleaseLockIfHeld: failed??
[cds2 32353 2015-01-16 04:40:58.027 CST 7549]CONTEXT:  PL/pgSQL
function cdsreconcileruntable(bigint) line 35 during exception cleanup
[cds2 32353 2015-01-16 04:40:58.027 CST 7549]WARNING:  you don't own a
lock of type ShareLock
[cds2 32353 2015-01-16 04:40:58.027 CST 7549]CONTEXT:  PL/pgSQL
function cdsreconcileruntable(bigint) line 35 during exception cleanup
[cds2 32353 2015-01-16 04:40:58.027 CST 7549]WARNING:
ReleaseLockIfHeld: failed??
[cds2 32353 2015-01-16 04:40:58.027 CST 7549]CONTEXT:  PL/pgSQL
function cdsreconcileruntable(bigint) line 35 during exception cleanup
[cds2 32353 2015-01-16 04:40:58.027 CST 7549]ERROR:  failed to re-find
shared lock object
[cds2 32353 2015-01-16 04:40:58.027 CST 7549]CONTEXT:  PL/pgSQL
function cdsreconcileruntable(bigint) line 35 during exception cleanup
[cds2 32353 2015-01-16 04:40:58.027 CST 7549]STATEMENT:  SELECT
CDSReconcileRunTable(1160)
[cds2 32353 2015-01-16 04:40:58.027 CST 7549]WARNING:
AbortSubTransaction while in ABORT state
[cds2 32353 2015-01-16 04:40:58.027 CST 7549]WARNING:  did not find
subXID 7553 in MyProc
[cds2 32353 2015-01-16 04:40:58.027 CST 7549]WARNING:  you don't own a
lock of type RowExclusiveLock
[cds2 32353 2015-01-16 04:40:58.027 CST 7549]WARNING:
ReleaseLockIfHeld: failed??
[cds2 32353 2015-01-16 

Re: [HACKERS] Safe memory allocation functions

2015-01-16 Thread Michael Paquier
On Fri, Jan 16, 2015 at 8:47 AM, Michael Paquier
michael.paqu...@gmail.com wrote:
 Voting for palloc_noerror() as well.
And here is an updated patch using this naming, added to the next CF as well.
-- 
Michael
From b636c809c2f2cb4177bedc2e5a4883a79b61fbc6 Mon Sep 17 00:00:00 2001
From: Michael Paquier mich...@otacoo.com
Date: Tue, 13 Jan 2015 15:40:38 +0900
Subject: [PATCH] Add memory allocation APIs able to return NULL instead of
 ERROR

The following functions are added to the existing set for frontend and
backend:
- palloc_noerror
- palloc0_noerror
- repalloc_noerror
---
 src/backend/utils/mmgr/aset.c| 529 +++
 src/backend/utils/mmgr/mcxt.c| 124 +
 src/common/fe_memutils.c |  72 --
 src/include/common/fe_memutils.h |   3 +
 src/include/nodes/memnodes.h |   2 +
 src/include/utils/palloc.h   |   3 +
 6 files changed, 451 insertions(+), 282 deletions(-)

diff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c
index 85b3c9a..974e018 100644
--- a/src/backend/utils/mmgr/aset.c
+++ b/src/backend/utils/mmgr/aset.c
@@ -243,11 +243,22 @@ typedef struct AllocChunkData
 	((AllocPointer)(((char *)(chk)) + ALLOC_CHUNKHDRSZ))
 
 /*
+ * Wrappers for allocation functions.
+ */
+static void *set_alloc_internal(MemoryContext context,
+Size size, bool noerror);
+static void *set_realloc_internal(MemoryContext context, void *pointer,
+  Size size, bool noerror);
+
+/*
  * These functions implement the MemoryContext API for AllocSet contexts.
  */
 static void *AllocSetAlloc(MemoryContext context, Size size);
+static void *AllocSetAllocNoError(MemoryContext context, Size size);
 static void AllocSetFree(MemoryContext context, void *pointer);
 static void *AllocSetRealloc(MemoryContext context, void *pointer, Size size);
+static void *AllocSetReallocNoError(MemoryContext context,
+ void *pointer, Size size);
 static void AllocSetInit(MemoryContext context);
 static void AllocSetReset(MemoryContext context);
 static void AllocSetDelete(MemoryContext context);
@@ -264,8 +275,10 @@ static void AllocSetCheck(MemoryContext context);
  */
 static MemoryContextMethods AllocSetMethods = {
 	AllocSetAlloc,
+	AllocSetAllocNoError,
 	AllocSetFree,
 	AllocSetRealloc,
+	AllocSetReallocNoError,
 	AllocSetInit,
 	AllocSetReset,
 	AllocSetDelete,
@@ -517,140 +530,16 @@ AllocSetContextCreate(MemoryContext parent,
 }
 
 /*
- * AllocSetInit
- *		Context-type-specific initialization routine.
- *
- * This is called by MemoryContextCreate() after setting up the
- * generic MemoryContext fields and before linking the new context
- * into the context tree.  We must do whatever is needed to make the
- * new context minimally valid for deletion.  We must *not* risk
- * failure --- thus, for example, allocating more memory is not cool.
- * (AllocSetContextCreate can allocate memory when it gets control
- * back, however.)
- */
-static void
-AllocSetInit(MemoryContext context)
-{
-	/*
-	 * Since MemoryContextCreate already zeroed the context node, we don't
-	 * have to do anything here: it's already OK.
-	 */
-}
-
-/*
- * AllocSetReset
- *		Frees all memory which is allocated in the given set.
- *
- * Actually, this routine has some discretion about what to do.
- * It should mark all allocated chunks freed, but it need not necessarily
- * give back all the resources the set owns.  Our actual implementation is
- * that we hang onto any keeper block specified for the set.  In this way,
- * we don't thrash malloc() when a context is repeatedly reset after small
- * allocations, which is typical behavior for per-tuple contexts.
- */
-static void
-AllocSetReset(MemoryContext context)
-{
-	AllocSet	set = (AllocSet) context;
-	AllocBlock	block;
-
-	AssertArg(AllocSetIsValid(set));
-
-#ifdef MEMORY_CONTEXT_CHECKING
-	/* Check for corruption and leaks before freeing */
-	AllocSetCheck(context);
-#endif
-
-	/* Clear chunk freelists */
-	MemSetAligned(set-freelist, 0, sizeof(set-freelist));
-
-	block = set-blocks;
-
-	/* New blocks list is either empty or just the keeper block */
-	set-blocks = set-keeper;
-
-	while (block != NULL)
-	{
-		AllocBlock	next = block-next;
-
-		if (block == set-keeper)
-		{
-			/* Reset the block, but don't return it to malloc */
-			char	   *datastart = ((char *) block) + ALLOC_BLOCKHDRSZ;
-
-#ifdef CLOBBER_FREED_MEMORY
-			wipe_mem(datastart, block-freeptr - datastart);
-#else
-			/* wipe_mem() would have done this */
-			VALGRIND_MAKE_MEM_NOACCESS(datastart, block-freeptr - datastart);
-#endif
-			block-freeptr = datastart;
-			block-next = NULL;
-		}
-		else
-		{
-			/* Normal case, release the block */
-#ifdef CLOBBER_FREED_MEMORY
-			wipe_mem(block, block-freeptr - ((char *) block));
-#endif
-			free(block);
-		}
-		block = next;
-	}
-
-	/* Reset block size allocation sequence, too */
-	set-nextBlockSize = set-initBlockSize;
-}
-
-/*
- * AllocSetDelete
- *		Frees all memory which is allocated in the given set,
- 

Re: [HACKERS] Safe memory allocation functions

2015-01-16 Thread Andres Freund
On 2015-01-16 08:47:10 +0900, Michael Paquier wrote:
 On Fri, Jan 16, 2015 at 12:57 AM, Alvaro Herrera
 alvhe...@2ndquadrant.com wrote:
  I do think that safe is the wrong suffix.  Maybe palloc_soft_fail()
  or palloc_null() or palloc_no_oom() or palloc_unsafe().
 
  I liked palloc_noerror() better myself FWIW.
 Voting for palloc_noerror() as well.

I don't like that name. It very well can error out. E.g. because of the
allocation size. And we definitely do not want to ignore that case. How
about palloc_try()?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Safe memory allocation functions

2015-01-16 Thread Andres Freund
On 2015-01-16 23:06:12 +0900, Michael Paquier wrote:
  /*
 + * Wrappers for allocation functions.
 + */
 +static void *set_alloc_internal(MemoryContext context,
 + Size size, bool 
 noerror);
 +static void *set_realloc_internal(MemoryContext context, void *pointer,
 +   Size size, 
 bool noerror);
 +
 +/*
   * These functions implement the MemoryContext API for AllocSet contexts.
   */
  static void *AllocSetAlloc(MemoryContext context, Size size);
 +static void *AllocSetAllocNoError(MemoryContext context, Size size);
  static void AllocSetFree(MemoryContext context, void *pointer);
  static void *AllocSetRealloc(MemoryContext context, void *pointer, Size 
 size);
 +static void *AllocSetReallocNoError(MemoryContext context,
 +  void *pointer, 
 Size size);
  static void AllocSetInit(MemoryContext context);
  static void AllocSetReset(MemoryContext context);
  static void AllocSetDelete(MemoryContext context);
 @@ -264,8 +275,10 @@ static void AllocSetCheck(MemoryContext context);
   */
  static MemoryContextMethods AllocSetMethods = {
   AllocSetAlloc,
 + AllocSetAllocNoError,
   AllocSetFree,
   AllocSetRealloc,
 + AllocSetReallocNoError,
   AllocSetInit,
   AllocSetReset,
   AllocSetDelete,
 @@ -517,140 +530,16 @@ AllocSetContextCreate(MemoryContext parent,
  }

Wouldn't it make more sense to change the MemoryContext API to return
NULLs in case of allocation failure and do the error checking in the
mcxt.c callers?
 +/* wrapper routines for allocation */
 +static void* palloc_internal(Size size, bool noerror);
 +static void* repalloc_internal(void *pointer, Size size, bool noerror);
 +
  /*
   * You should not do memory allocations within a critical section, because
   * an out-of-memory error will be escalated to a PANIC. To enforce that
 @@ -684,8 +688,8 @@ MemoryContextAllocZeroAligned(MemoryContext context, Size 
 size)
   return ret;
  }
  
 -void *
 -palloc(Size size)
 +static void*
 +palloc_internal(Size size, bool noerror)
  {
   /* duplicates MemoryContextAlloc to avoid increased overhead */
   void   *ret;
 @@ -698,31 +702,85 @@ palloc(Size size)
  
   CurrentMemoryContext-isReset = false;
  
 - ret = (*CurrentMemoryContext-methods-alloc) (CurrentMemoryContext, 
 size);
 + if (noerror)
 + ret = (*CurrentMemoryContext-methods-alloc_noerror)
 + (CurrentMemoryContext, size);
 + else
 + ret = (*CurrentMemoryContext-methods-alloc)
 + (CurrentMemoryContext, size);
   VALGRIND_MEMPOOL_ALLOC(CurrentMemoryContext, ret, size);
  
   return ret;
  }

I'd be rather surprised if these branches won't show up in
profiles. This is really rather hot code. At the very least this helper
function should be inlined. Also, calling the valgrind function on an
allocation failure surely isn't correct.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: Reducing lock strength of trigger and foreign key DDL

2015-01-16 Thread Andreas Karlsson

On 01/16/2015 03:01 PM, Andres Freund wrote:

Hi,


/*
-* Grab an exclusive lock on the pk table, so that someone doesn't 
delete
-* rows out from under us. (Although a lesser lock would do for that
-* purpose, we'll need exclusive lock anyway to add triggers to the pk
-* table; trying to start with a lesser lock will just create a risk of
-* deadlock.)
+* Grab ShareRowExclusiveLock on the pk table, so that someone doesn't
+* delete rows out from under us. Note that this does not create risks
+* of deadlocks as triggers add added to the pk table using the same
+* lock.
 */


add added doesn't look intended. The rest of the sentence doesn't look
entirely right either.


It was intended to be are added, but the sentence is pretty awful 
anyway. I am not sure the sentence is really necessary anyway.



/*
 * Triggers must be on tables or views, and there are additional
@@ -526,8 +526,7 @@ CreateTrigger(CreateTrigStmt *stmt, const char *queryString,
 * can skip this for internally generated triggers, since the name
 * modification above should be sufficient.
 *
-* NOTE that this is cool only because we have AccessExclusiveLock on 
the
-* relation, so the trigger set won't be changing underneath us.
+* NOTE that this is cool only because of the unique contraint.


I fail to see what the unique constraint has to do with this? The
previous comment refers to the fact that the AccessExclusiveLock is what
prevents a race where another transaction adds a trigger with the same
name already exists. Yes, the unique index would, as noted earlier in
the comment, catch the error. But that's not the point of the
check. Unless I miss something the comment is just as true if you
replace the access exclusive with share row exlusive as it's also self
conflicting.


Yeah, this must have been a remainder from the version where I only 
grabbed a ShareLock. The comment should be restored.



Hm. Pushing the snapshot is supposed to make this fully mvcc? Idon't
think that's actually sufficient because of the deparsing of the WHEN
clause and of the function name.


Indeed. As Noah and I discussed previously in this thread we would need 
to do quite a bit of refactoring of ruleutils.c to make it fully MVCC. 
For this reason I opted to only lower the lock levels of ADD and ALTER 
TRIGGER, and not DROP TRIGGER. Neither of those require MVCC of then 
WHEN clause.


Andreas




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] hung backends stuck in spinlock heavy endless loop

2015-01-16 Thread Merlin Moncure
On Fri, Jan 16, 2015 at 8:05 AM, Merlin Moncure mmonc...@gmail.com wrote:
 On Thu, Jan 15, 2015 at 5:10 PM, Peter Geoghegan p...@heroku.com wrote:
 On Thu, Jan 15, 2015 at 3:00 PM, Merlin Moncure mmonc...@gmail.com wrote:
 Running this test on another set of hardware to verify -- if this
 turns out to be a false alarm which it may very well be, I can only
 offer my apologies!  I've never had a new drive fail like that, in
 that manner.  I'll burn the other hardware in overnight and report
 back.

 huh -- well possibly. not.  This is on a virtual machine attached to a
 SAN.  It ran clean for several (this is 9.4 vanilla, asserts off,
 checksums on) hours then the starting having issues:

I'm going to bisect this down...FYI.  I'll start with the major
releases first.   This is not going to be a fast process...I'm out of
pocket for the weekend and have a lot of other stuff going on.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] hung backends stuck in spinlock heavy endless loop

2015-01-16 Thread Heikki Linnakangas

On 01/16/2015 04:05 PM, Merlin Moncure wrote:

On Thu, Jan 15, 2015 at 5:10 PM, Peter Geoghegan p...@heroku.com wrote:

On Thu, Jan 15, 2015 at 3:00 PM, Merlin Moncure mmonc...@gmail.com wrote:

Running this test on another set of hardware to verify -- if this
turns out to be a false alarm which it may very well be, I can only
offer my apologies!  I've never had a new drive fail like that, in
that manner.  I'll burn the other hardware in overnight and report
back.


huh -- well possibly. not.  This is on a virtual machine attached to a
SAN.  It ran clean for several (this is 9.4 vanilla, asserts off,
checksums on) hours then the starting having issues:

[cds2 21952 2015-01-15 22:54:51.833 CST 5502]WARNING:  page
verification failed, calculated checksum 59143 but expected 59137 at
character 20


The calculated checksum is suspiciously close to to the expected one. It 
could be coincidence, but the previous checksum warning you posted was 
also quite close:



[cds2 18347 2015-01-15 15:58:29.955 CST 1779]WARNING:  page
verification failed, calculated checksum 28520 but expected 28541


I believe the checksum algorithm is supposed to mix the bits quite 
thoroughly, so that a difference in a single byte in the input will lead 
to a completely different checksum. However, we add the block number to 
the checksum last:



/* Mix in the block number to detect transposed pages */
checksum ^= blkno;

/*
 * Reduce to a uint16 (to fit in the pd_checksum field) with an offset 
of
 * one. That avoids checksums of zero, which seems like a good idea.
 */
return (checksum % 65535) + 1;


It looks very much like that a page has for some reason been moved to a 
different block number. And that's exactly what Peter found out in his 
investigation too; an index page was mysteriously copied to a different 
block with identical content.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] hung backends stuck in spinlock heavy endless loop

2015-01-16 Thread Andres Freund
Hi,

On 2015-01-16 08:05:07 -0600, Merlin Moncure wrote:
 On Thu, Jan 15, 2015 at 5:10 PM, Peter Geoghegan p...@heroku.com wrote:
  On Thu, Jan 15, 2015 at 3:00 PM, Merlin Moncure mmonc...@gmail.com wrote:
  Running this test on another set of hardware to verify -- if this
  turns out to be a false alarm which it may very well be, I can only
  offer my apologies!  I've never had a new drive fail like that, in
  that manner.  I'll burn the other hardware in overnight and report
  back.
 
 huh -- well possibly. not.  This is on a virtual machine attached to a
 SAN.  It ran clean for several (this is 9.4 vanilla, asserts off,
 checksums on) hours then the starting having issues:

Damn.

Is there any chance you can package this somehow so that others can run
it locally? It looks hard to find the actual bug here without adding
instrumentation to to postgres.

 [cds2 21952 2015-01-15 22:54:51.833 CST 5502]WARNING:  page
 verification failed, calculated checksum 59143 but expected 59137 at
 character 20
 [cds2 21952 2015-01-15 22:54:51.852 CST 5502]QUERY:
   DELETE FROM onesitepmc.propertyguestcard t
   WHERE EXISTS
   (
 SELECT 1 FROM propertyguestcard_insert d
 WHERE (t.prptyid, t.gcardid) = (d.prptyid, d.gcardid)
   )

 [cds2 21952 2015-01-15 22:54:51.852 CST 5502]CONTEXT:  PL/pgSQL
 function cdsreconciletable(text,text,text,text,boolean) line 197 at
 EXECUTE statement
 SQL statement SELECT* FROM CDSReconcileTable(
   t.CDSServer,
   t.CDSDatabase,
   t.SchemaName,
   t.TableName)
 PL/pgSQL function cdsreconcileruntable(bigint) line 35 at SQL statement


This was the first error? None of the 'could not find subXID' errors
beforehand?


 [cds2 32353 2015-01-16 04:40:57.814 CST 7549]WARNING:  did not find
 subXID 7553 in MyProc
 [cds2 32353 2015-01-16 04:40:57.814 CST 7549]CONTEXT:  PL/pgSQL
 function cdsreconcileruntable(bigint) line 35 during exception cleanup
 [cds2 32353 2015-01-16 04:40:58.018 CST 7549]WARNING:  you don't own a
 lock of type AccessShareLock
 [cds2 32353 2015-01-16 04:40:58.018 CST 7549]CONTEXT:  PL/pgSQL
 function cdsreconcileruntable(bigint) line 35 during exception cleanup
 [cds2 32353 2015-01-16 04:40:58.026 CST 7549]LOG:  could not send data
 to client: Broken pipe
 [cds2 32353 2015-01-16 04:40:58.026 CST 7549]CONTEXT:  PL/pgSQL
 function cdsreconcileruntable(bigint) line 35 during exception cleanup
 [cds2 32353 2015-01-16 04:40:58.026 CST 7549]STATEMENT:  SELECT
 CDSReconcileRunTable(1160)
 [cds2 32353 2015-01-16 04:40:58.026 CST 7549]WARNING:
 ReleaseLockIfHeld: failed??
 [cds2 32353 2015-01-16 04:40:58.026 CST 7549]CONTEXT:  PL/pgSQL
 function cdsreconcileruntable(bigint) line 35 during exception cleanup
 [cds2 32353 2015-01-16 04:40:58.026 CST 7549]WARNING:  you don't own a
 lock of type AccessShareLock
 [cds2 32353 2015-01-16 04:40:58.026 CST 7549]CONTEXT:  PL/pgSQL
 function cdsreconcileruntable(bigint) line 35 during exception cleanup
 [cds2 32353 2015-01-16 04:40:58.026 CST 7549]WARNING:
 ReleaseLockIfHeld: failed??
 [cds2 32353 2015-01-16 04:40:58.026 CST 7549]CONTEXT:  PL/pgSQL
 function cdsreconcileruntable(bigint) line 35 during exception cleanup
 [cds2 32353 2015-01-16 04:40:58.026 CST 7549]WARNING:  you don't own a
 lock of type AccessShareLock
 [cds2 32353 2015-01-16 04:40:58.026 CST 7549]CONTEXT:  PL/pgSQL
 function cdsreconcileruntable(bigint) line 35 during exception cleanup
 [cds2 32353 2015-01-16 04:40:58.027 CST 7549]WARNING:
 ReleaseLockIfHeld: failed??
 [cds2 32353 2015-01-16 04:40:58.027 CST 7549]CONTEXT:  PL/pgSQL
 function cdsreconcileruntable(bigint) line 35 during exception cleanup
 [cds2 32353 2015-01-16 04:40:58.027 CST 7549]WARNING:  you don't own a
 lock of type AccessShareLock
 [cds2 32353 2015-01-16 04:40:58.027 CST 7549]CONTEXT:  PL/pgSQL
 function cdsreconcileruntable(bigint) line 35 during exception cleanup
 [cds2 32353 2015-01-16 04:40:58.027 CST 7549]WARNING:
 ReleaseLockIfHeld: failed??
 [cds2 32353 2015-01-16 04:40:58.027 CST 7549]CONTEXT:  PL/pgSQL
 function cdsreconcileruntable(bigint) line 35 during exception cleanup
 [cds2 32353 2015-01-16 04:40:58.027 CST 7549]WARNING:  you don't own a
 lock of type ShareLock
 [cds2 32353 2015-01-16 04:40:58.027 CST 7549]CONTEXT:  PL/pgSQL
 function cdsreconcileruntable(bigint) line 35 during exception cleanup
 [cds2 32353 2015-01-16 04:40:58.027 CST 7549]WARNING:
 ReleaseLockIfHeld: failed??
 [cds2 32353 2015-01-16 04:40:58.027 CST 7549]CONTEXT:  PL/pgSQL
 function cdsreconcileruntable(bigint) line 35 during exception cleanup
 [cds2 32353 2015-01-16 04:40:58.027 CST 7549]ERROR:  failed to re-find
 shared lock object
 [cds2 32353 2015-01-16 04:40:58.027 CST 7549]CONTEXT:  PL/pgSQL
 function cdsreconcileruntable(bigint) line 35 during exception cleanup
 [cds2 32353 2015-01-16 04:40:58.027 CST 7549]STATEMENT:  SELECT
 CDSReconcileRunTable(1160)
 [cds2 32353 2015-01-16 04:40:58.027 CST 

Re: [HACKERS] hung backends stuck in spinlock heavy endless loop

2015-01-16 Thread Merlin Moncure
On Fri, Jan 16, 2015 at 8:22 AM, Andres Freund and...@2ndquadrant.com wrote:
 Is there any chance you can package this somehow so that others can run
 it locally? It looks hard to find the actual bug here without adding
 instrumentation to to postgres.

That's possible but involves a lot of complexity in the setup because
of the source database (SQL Server) dependency..

Thinking outside the box here I'm going to migrate the source to
postgres.   This will rule out pl/sh which is the only non-core
dependency but will take some setup work on my end first.  If I can
still reproduce the error at that point, maybe we can go in this
direction, and it it would make local reproduction easier anyways.

 [cds2 21952 2015-01-15 22:54:51.833 CST 5502]WARNING:  page

 This was the first error? None of the 'could not find subXID' errors
 beforehand?

yep.  no caught exceptions or anything interesting in the log before that.

 Could you add a EmitErrorReport(); before the FlushErrorState() in
 pl_exec.c's exec_stmt_block()?

will do.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Safe memory allocation functions

2015-01-16 Thread Robert Haas
On Thu, Jan 15, 2015 at 10:57 AM, Alvaro Herrera
alvhe...@2ndquadrant.com wrote:
 Hmm, I understood Tom to be opposing the idea of a palloc variant that
 returns NULL on failure, and I understand you to be supporting it.
 But maybe I'm confused.

 Your understanding seems correct to me.  I was just saying that your
 description of Tom's argument to dislike the idea seemed at odds with
 what he was actually saying.

OK, that may be.  I'm not sure.

 Anyway, I support it.  I agree that there are
 systems (or circumstances?) where malloc is going to succeed and then
 the world will blow up later on anyway, but I don't think that means
 that an out-of-memory error is the only sensible response to a palloc
 failure; returning NULL seems like a sometimes-useful alternative.

 I do think that safe is the wrong suffix.  Maybe palloc_soft_fail()
 or palloc_null() or palloc_no_oom() or palloc_unsafe().

 I liked palloc_noerror() better myself FWIW.

I don't care for noerror() because it probably still will error in
some circumstances; just not for OOM.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Safe memory allocation functions

2015-01-16 Thread Alvaro Herrera
Robert Haas wrote:
 On Thu, Jan 15, 2015 at 10:57 AM, Alvaro Herrera
 alvhe...@2ndquadrant.com wrote:

  I do think that safe is the wrong suffix.  Maybe palloc_soft_fail()
  or palloc_null() or palloc_no_oom() or palloc_unsafe().
 
  I liked palloc_noerror() better myself FWIW.
 
 I don't care for noerror() because it probably still will error in
 some circumstances; just not for OOM.

Yes, but that seems fine to me.  We have other functions with noerror
flags, and they can still fail under some circumstances -- just not if
the error is the most commonly considered scenario in which they fail.
The first example I found is LookupAggNameTypeNames(); there are many
more.  I don't think this causes any confusion in practice.

Another precendent we have is something like missing_ok as a flag name
in get_object_address() and other places; following that, we could have
this new function as palloc_oom_ok or something like that.  But it
doesn't seem an improvement to me.  (I'm pretty sure we all agree that
this must not be a flag to palloc but rather a new function.)

Of all the ones you proposed above, the one I like the most is
palloc_no_oom, but IMO palloc_noerror is still better.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Error check always bypassed in tablefunc.c

2015-01-16 Thread Michael Paquier
Hi all,

As mentioned in $subject, commit 08c33c4 of 2003 has made the
following block of code dead in tablefunc.c:1320 because level is
incremented to at least 1:
/* First time through, do a little more setup */
if (level == 0)
{
/*
 * Check that return tupdesc is compatible
with the one we got
 * from the query, but only at level 0 -- no
need to check more
 * than once
 */

if (!compatConnectbyTupleDescs(tupdesc, spi_tupdesc))
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
 errmsg(invalid return type),
 errdetail(Return and
SQL tuple descriptions are  \

incompatible.)));
}

A simple fix is simply to change level == 0 to level == 1 to check
for this error, like in the patch attached. This issue has been
spotted by Coverity.
Regards,
-- 
Michael
diff --git a/contrib/tablefunc/tablefunc.c b/contrib/tablefunc/tablefunc.c
index 3388fab..878255e 100644
--- a/contrib/tablefunc/tablefunc.c
+++ b/contrib/tablefunc/tablefunc.c
@@ -1317,12 +1317,12 @@ build_tuplestore_recursively(char *key_fld,
 		StringInfoData chk_current_key;
 
 		/* First time through, do a little more setup */
-		if (level == 0)
+		if (level == 1)
 		{
 			/*
 			 * Check that return tupdesc is compatible with the one we got
-			 * from the query, but only at level 0 -- no need to check more
-			 * than once
+			 * from the query, but only at the first level -- no need to check
+			 * more than once
 			 */
 
 			if (!compatConnectbyTupleDescs(tupdesc, spi_tupdesc))

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] orangutan seizes up during isolation-check

2015-01-16 Thread Noah Misch
On Fri, Jan 16, 2015 at 05:43:44AM -0500, Dave Cramer wrote:
 On 16 January 2015 at 01:33, Noah Misch n...@leadboat.com wrote:
  Sure, done.  Dave, orangutan should now be able to pass with --enable-nls.
  Would you restore that option?
 
 I can, but is this for HEAD or all versions ?

All versions.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error check always bypassed in tablefunc.c

2015-01-16 Thread Alvaro Herrera
Michael Paquier wrote:

 As mentioned in $subject, commit 08c33c4 of 2003 has made the
 following block of code dead in tablefunc.c:1320 because level is
 incremented to at least 1:
 /* First time through, do a little more setup */
 if (level == 0)
 {

Uh.  This means the error case has no test coverage ...

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] orangutan seizes up during isolation-check

2015-01-16 Thread Dave Cramer
On 16 January 2015 at 01:33, Noah Misch n...@leadboat.com wrote:

 On Thu, Jan 15, 2015 at 09:24:01AM -0500, Robert Haas wrote:
  On Thu, Jan 15, 2015 at 1:04 AM, Noah Misch n...@leadboat.com wrote:
   On Wed, Jan 14, 2015 at 04:48:53PM -0500, Peter Eisentraut wrote:
   What I'm seeing now is that the unaccent regression tests when run
 under
   make check-world abort with
  
   FATAL:  postmaster became multithreaded during startup
   HINT:  Set the LC_ALL environment variable to a valid locale.
  
   contrib/unaccent/Makefile sets NO_LOCALE=1, so that makes sense.  I
 expect the
   patch over here will fix it:
  
 http://www.postgresql.org/message-id/20150109063015.ga2491...@tornado.leadboat.com
 
  I just hit this same problem; are you going to commit that patch soon?
   It's rather annoying to have make check-world fail.

 Sure, done.  Dave, orangutan should now be able to pass with --enable-nls.
 Would you restore that option?


I can, but is this for HEAD or all versions ?


Re: [HACKERS] Safe memory allocation functions

2015-01-16 Thread Andres Freund
On 2015-01-16 12:09:25 -0300, Alvaro Herrera wrote:
 Robert Haas wrote:
  On Thu, Jan 15, 2015 at 10:57 AM, Alvaro Herrera
  alvhe...@2ndquadrant.com wrote:
 
   I do think that safe is the wrong suffix.  Maybe palloc_soft_fail()
   or palloc_null() or palloc_no_oom() or palloc_unsafe().
  
   I liked palloc_noerror() better myself FWIW.
  
  I don't care for noerror() because it probably still will error in
  some circumstances; just not for OOM.
 
 Yes, but that seems fine to me.  We have other functions with noerror
 flags, and they can still fail under some circumstances -- just not if
 the error is the most commonly considered scenario in which they fail.

We rely on palloc erroring out on large allocations in a couple places
as a crosscheck. I don't think this argument holds much water.

 The first example I found is LookupAggNameTypeNames(); there are many
 more.  I don't think this causes any confusion in practice.
 
 Another precendent we have is something like missing_ok as a flag name
 in get_object_address() and other places; following that, we could have
 this new function as palloc_oom_ok or something like that.  But it
 doesn't seem an improvement to me.  (I'm pretty sure we all agree that
 this must not be a flag to palloc but rather a new function.)
 
 Of all the ones you proposed above, the one I like the most is
 palloc_no_oom, but IMO palloc_noerror is still better.

Neither seem very accurate. no_oom isn't true because they actually can
cause ooms. _noerror isn't true because they can error out - we
e.g. rely on palloc erroring out when reading toast tuples (to detect
invalid datum lengths) and during parsing of WAL as an additional
defense.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Safe memory allocation functions

2015-01-16 Thread Alvaro Herrera
Andres Freund wrote:
 On 2015-01-16 12:09:25 -0300, Alvaro Herrera wrote:
  Robert Haas wrote:
   On Thu, Jan 15, 2015 at 10:57 AM, Alvaro Herrera
   alvhe...@2ndquadrant.com wrote:
  
I do think that safe is the wrong suffix.  Maybe palloc_soft_fail()
or palloc_null() or palloc_no_oom() or palloc_unsafe().
   
I liked palloc_noerror() better myself FWIW.
   
   I don't care for noerror() because it probably still will error in
   some circumstances; just not for OOM.
  
  Yes, but that seems fine to me.  We have other functions with noerror
  flags, and they can still fail under some circumstances -- just not if
  the error is the most commonly considered scenario in which they fail.
 
 We rely on palloc erroring out on large allocations in a couple places
 as a crosscheck. I don't think this argument holds much water.

I don't understand what that has to do with it.  Surely we're not going
to have palloc_noerror() not error out when presented with a huge
allocation.  My point is just that the noerror bit in palloc_noerror()
means that it doesn't fail in OOM, and that there are other causes for
it to error.


One thought I just had is that we also have MemoryContextAllocHuge; are
we going to consider a mixture of both things in the future, i.e. allow
huge allocations to return NULL when OOM?  It sounds a bit useless
currently, but it doesn't seem extremely far-fetched that we will need
further flags in the future.  (Or, perhaps, we will want to have code
that retries a Huge allocation that returns NULL with a smaller size,
just in case it does work.)  Maybe what we need is to turn these things
into flags to a new generic function.  Furthermore, I question whether
we really need a palloc variant -- I mean, can we live with just the
MemoryContext API instead?  As with the Huge variant (which does not
have a corresponding palloc equivalent), possible use cases seem very
limited so there's probably not much point in providing a shortcut.

So how about something like

#define ALLOCFLAG_HUGE  0x01
#define ALLOCFLAG_NO_ERROR_ON_OOM   0x02
void *
MemoryContextAllocFlags(MemoryContext context, Size size, int flags);

and perhaps even

#define MemoryContextAllocHuge(cxt, sz) \
MemoryContextAllocFlags(cxt, sz, ALLOCFLAG_HUGE)
for source-level compatibility.


(Now we all agree that palloc() itself is a very hot spot and shouldn't
be touched at all.  I don't think these new functions are used as commonly
as that, so the fact that they are slightly slower shouldn't be too
troublesome.)

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: Reducing lock strength of trigger and foreign key DDL

2015-01-16 Thread Andres Freund
On 2015-01-16 15:16:20 +0100, Andreas Karlsson wrote:
 Indeed. As Noah and I discussed previously in this thread we would need to
 do quite a bit of refactoring of ruleutils.c to make it fully MVCC.

Right.

 For this reason I opted to only lower the lock levels of ADD and ALTER
 TRIGGER, and not DROP TRIGGER. Neither of those require MVCC of then
 WHEN clause.

I'm unconvinced that this is true. Using a snapshot for part of getting
a definition certainly opens the door for getting strange
results.

Acquiring a lock that prevents schema changes on the table and then
getting the definition using the syscaches guarantees that that
definition is at least self consistent because no further schema changes
are possible and the catalog caches will be up2date.

What you're doing though is doing part of the scan using the
transaction's snapshot (as used by pg_dump that will usually be a
repeatable read snapshot and possibly quite old) and the other using a
fresh catalog snapshot. This can result in rather odd things.

Just consider:
S1: CREATE TABLE flubber(id serial primary key, data text);
S1: CREATE FUNCTION blarg() RETURNS TRIGGER LANGUAGE plpgsql AS $$BEGIN RETURN 
NEW; END;$$;
S1: CREATE TRIGGER flubber_blarg BEFORE INSERT ON flubber FOR EACH ROW WHEN 
(NEW.data IS NOT NULL) EXECUTE PROCEDURE blarg();
S2: BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
S2: SELECT 'dosomethingelse';
S1: ALTER TABLE flubber RENAME TO wasflubber;
S1: ALTER TABLE wasflubber RENAME COLUMN data TO wasdata;
S1: ALTER TRIGGER flubber_blarg ON wasflubber RENAME TO wasflubber_blarg;
S1: ALTER FUNCTION blarg() RENAME TO wasblarg;
S2: SELECT pg_get_triggerdef(oid) FROM pg_trigger;

This will give you: The old trigger name. The new table name. The new
column name. The new function name.

I don't think using a snapshot for tiny parts of these functions
actually buys anything. Now, this isn't a pattern you introduced. But I
think we should think about this for a second before expanding it
further.

Before you argue that this isn't relevant for pg_dump: It is. Precisely
the above can happen - just replace the 'dosomethingelse' with several
LOCK TABLEs as pg_dump does. The first blocks after a snapshot has been
acquired. While waiting, all the ALTERs happen.

Arguably the benefit here is that the window for this issue is becoming
smaller as pg_dump (and hopefully other possible callers) acquire
exclusive locks on the table. I.e. that the lowering of the lock level
doesn't introduce new races. But on the other side of the coin, this
makes the result of pg_get_triggerdef() even *more* inaccurate in many
cases.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Safe memory allocation functions

2015-01-16 Thread Andres Freund
On 2015-01-16 12:56:18 -0300, Alvaro Herrera wrote:
 Andres Freund wrote:
  We rely on palloc erroring out on large allocations in a couple places
  as a crosscheck. I don't think this argument holds much water.
 
 I don't understand what that has to do with it.  Surely we're not going
 to have palloc_noerror() not error out when presented with a huge
 allocation.

Precisely. That means it *does* error out in a somewhat expected path.

 My point is just that the noerror bit in palloc_noerror() means that
 it doesn't fail in OOM, and that there are other causes for it to
 error.

That description pretty much describes why it's a misnomer, no?

 One thought I just had is that we also have MemoryContextAllocHuge; are
 we going to consider a mixture of both things in the future, i.e. allow
 huge allocations to return NULL when OOM?

I definitely think we should. I'd even say that the usecase is larger
for huge allocations. It'd e.g. be rather nice to first try sorting with
the huge 16GB work mem and then try 8GB/4/1GB if that fails.

 It sounds a bit useless
 currently, but it doesn't seem extremely far-fetched that we will need
 further flags in the future.  (Or, perhaps, we will want to have code
 that retries a Huge allocation that returns NULL with a smaller size,
 just in case it does work.)  Maybe what we need is to turn these things
 into flags to a new generic function.  Furthermore, I question whether
 we really need a palloc variant -- I mean, can we live with just the
 MemoryContext API instead?  As with the Huge variant (which does not
 have a corresponding palloc equivalent), possible use cases seem very
 limited so there's probably not much point in providing a shortcut.

I'm fine with not providing a palloc() equivalent, but I also am fine
with having it.

 So how about something like
 
 #define ALLOCFLAG_HUGE0x01
 #define ALLOCFLAG_NO_ERROR_ON_OOM 0x02
 void *
 MemoryContextAllocFlags(MemoryContext context, Size size, int flags);
 
 and perhaps even
 
 #define MemoryContextAllocHuge(cxt, sz) \
   MemoryContextAllocFlags(cxt, sz, ALLOCFLAG_HUGE)
 for source-level compatibility.

I don't know, this seems a bit awkward to use.  Your earlier example with
the *Huge variant that returns a smaller allocation doesn't really
convince me - that'd need a separate API anyway.

I definitely do not want to push the nofail stuff via the
MemoryContextData- API into aset.c. Imo aset.c should always return
NULL and then mcxt.c should throw the error if in the normal palloc()
function.

 (Now we all agree that palloc() itself is a very hot spot and shouldn't
 be touched at all.  I don't think these new functions are used as commonly
 as that, so the fact that they are slightly slower shouldn't be too
 troublesome.)

Yea, the speed of the new functions really shouldn't matter.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Merging postgresql.conf and postgresql.auto.conf

2015-01-16 Thread Sawada Masahiko
On Fri, Jan 16, 2015 at 12:54 PM, Amit Kapila amit.kapil...@gmail.com wrote:
 On Thu, Jan 15, 2015 at 9:48 PM, Sawada Masahiko sawada.m...@gmail.com
 wrote:
 On Thu, Jan 15, 2015 at 2:02 PM, Amit Kapila amit.kapil...@gmail.com
 wrote:
 
  One thought I have in this line is that currently there doesn't seem to
  be
  a way to know if the setting has an entry both in postgresql.conf and
  postgresql.auto.conf, if we can have some way of knowing the same
  (pg_settings?), then it could be convenient for user to decide if the
  value
  in postgresql.auto.conf is useful or not and if it's not useful then use
  Alter System .. Reset command to remove the same from
  postgresql.auto.conf.

 I think one way is that pg_settings has file name of variables,  But
 It would not affect to currently status of postgresql.conf
 So we would need to parse postgresql.conf again at that time.


 Yeah that could be a possibility, but I think that will break the existing
 command('s) as this is the common infrastructure used for SHOW ..
 commands as well which displays the guc value that is used by
 current session rather than the value in postgresql.conf.

You're right.
pg_setting and SHOW command use value in current session rather than
config file.
It might break these common infrastructure.


 I don't know how appealing it would be to others, but a new view
 like pg_file_settings which would display the settings in file could
 be meaningful for your need.

 Another way is user can do pg_reload_conf() to see the latest
 values (excluding values for server startup time parameters).


I prefer the former. Because the latter would not handle all guc
variables as you said.
The new view like pg_file_setting has guc variable and config file
which has corresponding guc variable.
It would be helpful view for like ALTER SYSTEM ... RESET and that
command might be beneficial feature for user who want to manage
configuration file manually, I would like to propose them.

Regards,

---
Sawada Masahiko


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Safe memory allocation functions

2015-01-16 Thread Alvaro Herrera
Andres Freund wrote:
 On 2015-01-16 12:56:18 -0300, Alvaro Herrera wrote:

  So how about something like
  
  #define ALLOCFLAG_HUGE  0x01
  #define ALLOCFLAG_NO_ERROR_ON_OOM   0x02
  void *
  MemoryContextAllocFlags(MemoryContext context, Size size, int flags);

 I don't know, this seems a bit awkward to use.  Your earlier example with
 the *Huge variant that returns a smaller allocation doesn't really
 convince me - that'd need a separate API anyway.

What example was that?  My thinking was that the mcxt.c function would
return NULL if the request was not satisfied; only the caller would be
entitled to retry with a smaller size.  I was thinking in something like

baseflag = ALLOCFLAG_NO_ERROR_ON_OOM;
reqsz = SomeHugeValue;
while (true)
{
ptr = MemoryContextAllocFlags(cxt, reqsz,
ALLOCFLAG_HUGE | baseflag);
if (ptr != NULL)
break;  /* success */

/* too large, retry with a smaller allocation */
reqsz *= 0.75;

/* if under some limit, have it fail next time */
if (reqsz  SomeHugeValue * 0.1)
baseflag = 0;
}
/* by here, you know ptr points to a memory area of size reqsz, which is
   between SomeHugeValue * 0.1 and SomeHugeValue. */


Were you thinking of something else?

 I definitely do not want to push the nofail stuff via the
 MemoryContextData- API into aset.c. Imo aset.c should always return
 NULL and then mcxt.c should throw the error if in the normal palloc()
 function.

Sure, that seems reasonable ...

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] speedup tidbitmap patch: cache page

2015-01-16 Thread Andres Freund
On 2014-12-25 01:26:53 +1300, David Rowley wrote:
 So I think v3 is the one to go with, and I can't see any problems with it,
 so I'm marking it as ready for committer.

And committed.

Thanks Teodor and David.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Incremental backup v3: incremental PoC

2015-01-16 Thread Marco Nenciarini
On 14/01/15 17:22, Gabriele Bartolini wrote:
 
 My opinion, Marco, is that for version 5 of this patch, you:
 
 1) update the information on the wiki (it is outdated - I know you have
 been busy with LSN map optimisation)

Done.

 2) modify pg_basebackup in order to accept a directory (or tar file) and
 automatically detect the LSN from the backup profile

New version of patch attached. The -I parameter now requires a backup
profile from a previous backup. I've added a sanity check that forbid
incremental file level backups if the base timeline is different from
the current one.

 3) add the documentation regarding the backup profile and pg_basebackup
 

Next on my TODO list.

 Once we have all of this, we can continue trying the patch. Some
 unexplored paths are:
 
 * tablespace usage

I've improved my pg_restorebackup python PoC. It now supports tablespaces.

 * tar format
 * performance impact (in both read-only and heavily updated contexts)

From the server point of view, the current code generates a load similar
to normal backup. It only adds an initial scan of any data file to
decide whether it has to send it. One it found a single newer page it
immediately stop scanning and start sending the file. The IO impact
should not be that big due to the filesystem cache, but I agree with you
that it has to be measured.

Regards,
Marco

-- 
Marco Nenciarini - 2ndQuadrant Italy
PostgreSQL Training, Services and Support
marco.nenciar...@2ndquadrant.it | www.2ndQuadrant.it
From f7cf8b9dd7d32f64a30dafaeeaeb56cbcd2eafff Mon Sep 17 00:00:00 2001
From: Marco Nenciarini marco.nenciar...@2ndquadrant.it
Date: Tue, 14 Oct 2014 14:31:28 +0100
Subject: [PATCH] File-based incremental backup v5

Add backup profile to pg_basebackup
INCREMENTAL option implementaion
---
 src/backend/access/transam/xlog.c  |   7 +-
 src/backend/access/transam/xlogfuncs.c |   2 +-
 src/backend/replication/basebackup.c   | 335 +++--
 src/backend/replication/repl_gram.y|   6 +
 src/backend/replication/repl_scanner.l |   1 +
 src/bin/pg_basebackup/pg_basebackup.c  | 147 +--
 src/include/access/xlog.h  |   3 +-
 src/include/replication/basebackup.h   |   4 +
 8 files changed, 473 insertions(+), 32 deletions(-)

diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index 629a457..1e50625 100644
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
*** XLogFileNameP(TimeLineID tli, XLogSegNo 
*** 9249,9255 
   * permissions of the calling user!
   */
  XLogRecPtr
! do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
   char **labelfile)
  {
boolexclusive = (labelfile == NULL);
--- 9249,9256 
   * permissions of the calling user!
   */
  XLogRecPtr
! do_pg_start_backup(const char *backupidstr, bool fast,
!  XLogRecPtr incremental_startpoint, 
TimeLineID *starttli_p,
   char **labelfile)
  {
boolexclusive = (labelfile == NULL);
*** do_pg_start_backup(const char *backupids
*** 9468,9473 
--- 9469,9478 
 (uint32) (startpoint  32), (uint32) startpoint, 
xlogfilename);
appendStringInfo(labelfbuf, CHECKPOINT LOCATION: %X/%X\n,
 (uint32) (checkpointloc  32), 
(uint32) checkpointloc);
+   if (incremental_startpoint  0)
+   appendStringInfo(labelfbuf, INCREMENTAL FROM 
LOCATION: %X/%X\n,
+(uint32) 
(incremental_startpoint  32),
+(uint32) 
incremental_startpoint);
appendStringInfo(labelfbuf, BACKUP METHOD: %s\n,
 exclusive ? pg_start_backup 
: streamed);
appendStringInfo(labelfbuf, BACKUP FROM: %s\n,
diff --git a/src/backend/access/transam/xlogfuncs.c 
b/src/backend/access/transam/xlogfuncs.c
index 2179bf7..ace84d8 100644
*** a/src/backend/access/transam/xlogfuncs.c
--- b/src/backend/access/transam/xlogfuncs.c
*** pg_start_backup(PG_FUNCTION_ARGS)
*** 59,65 
(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
   errmsg(must be superuser or replication role to run a 
backup)));
  
!   startpoint = do_pg_start_backup(backupidstr, fast, NULL, NULL);
  
PG_RETURN_LSN(startpoint);
  }
--- 59,65 
(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
   errmsg(must be superuser or replication role to run a 
backup)));
  
!   startpoint = do_pg_start_backup(backupidstr, fast, 0, NULL, NULL);
  
PG_RETURN_LSN(startpoint);
  }
diff --git a/src/backend/replication/basebackup.c 
b/src/backend/replication/basebackup.c
index 07030a2..05b19c5 100644

Re: [HACKERS] proposal: searching in array function - array_position

2015-01-16 Thread Jim Nasby

On 1/16/15 3:39 AM, Pavel Stehule wrote:

I am proposing a simple function, that returns a position of element in array.


Yes please!


FUNCTION array_position(anyarray, anyelement) RETURNS int


That won't work on a multi-dimensional array. Ideally it needs to accept a 
slice or an element and return the specifier for the slice.

This wouldn't be so bad if we had an easier way to extract subsets of an array, 
but even that is really ugly because whatever you extract keeps the original 
number of dimensions.


Implementation is simple (plpgsql code)


This would actually be written in C though, yes? Otherwise it's not really any 
better than just doing an extension...
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Temporal features in PostgreSQL

2015-01-16 Thread pe...@vanroose.be
What's the current status of this topic?
Has someone worked on temporal tables for PostgreSQL since 2012 ?

I'm giving a presentation on Fosdem later this month in Brussels, on the
topic of temporal tables, and would like to give all possibly relevant
information to the audience!

--  Peter Vanroose,
 Leuven, Belgium.



--
View this message in context: 
http://postgresql.nabble.com/Temporal-features-in-PostgreSQL-tp5737881p5834312.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] proposal: lock_time for pg_stat_database

2015-01-16 Thread Pavel Stehule
Hi all,

some time ago, I proposed a lock time measurement related to query. A main
issue was a method, how to show this information. Today proposal is little
bit simpler, but still useful. We can show a total lock time per database
in pg_stat_database statistics. High number can be signal about lock issues.

Comments, ideas, notices?

Regards

Pavel


Re: [HACKERS] [PATCH] HINT: pg_hba.conf changed since last config reload

2015-01-16 Thread Andres Freund
On 2014-12-15 19:38:16 +0300, Alex Shulgin wrote:
 Attached is the modified version of the original patch by Craig,
 addressing the handling of the new hint_log error data field and
 removing the client-side HINT.

I'm not a big fan of this implementation. We're adding a fair bit of
infrastructure (i.e. server-only hints) where in other places we use
NOTICE for similar hints.

Why don't we just add emit a NOTICE or WARNING in the relevant place
saying that pg_hba.conf is outdated? Then the server won't log those if
configured appropriately, which doesn't seem like a bad thing. Note that
= ERROR messages aren't sent to the client during authentication.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: row_to_array function

2015-01-16 Thread Jim Nasby

On 1/16/15 3:45 AM, Pavel Stehule wrote:

I am returning back to processing records in plpgsql.

I am thinking so it can be simply processed with transformations to array.

Now we have similar functions - hstore(row), row_to_json, ... but using of 
these functions can be a useless step. Any row variable can be transformed to 
2D text array.


How is it useless? Why wouldn't you just use JSON and be done with it?

Do you have some use cases you can share?


There two possible transformations:

row_to_array -- [[key1, value1],[key2, value2], ...]
row_to_row_array -- [(key1, value1), (key2, value2), ... ]


If we're going to go that route, I think it makes more sense to create an 
actual key/value type (ie: http://pgxn.org/dist/pair/doc/pair.html) and return 
an array of that.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] HINT: pg_hba.conf changed since last config reload

2015-01-16 Thread Andres Freund
On 2015-01-16 18:01:24 +0100, Andres Freund wrote:
 Why don't we just add emit a NOTICE or WARNING in the relevant place
 saying that pg_hba.conf is outdated? Then the server won't log those if
 configured appropriately, which doesn't seem like a bad thing. Note that
 = ERROR messages aren't sent to the client during authentication.

I think a 'WARNING: client authentication failed/succeeded with a
outdated pg_hba.conf in effect' would actually be a good way to present
this. It's not only failed logins where this is relevant.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_rewind in contrib

2015-01-16 Thread Heikki Linnakangas

On 01/16/2015 03:30 AM, Peter Eisentraut wrote:

Here is a random bag of comments for the v5 patch:


Addressed most of your comments, and Michael's. Another version 
attached. More on a few of the comments below.



The option --source-server had be confused at first, because the entry
above under --source-pgdata also talks about a source server.  Maybe
--source-connection would be clearer?


Hmm. I would rather emphasize that the source is a running server, than 
the connection to the server. I can see the confusion, though. What 
about phrasing it as:


--source-pgdata

  Specifies path to the data directory of the source server, to
  synchronize the target with. When option--source-pgdata/ is used,
  the source server must be cleanly shut down.

The point is that whether you use --source-pgdata or --source-server, 
the source is a PostgreSQL server. Perhaps it should be mentioned here 
that you only need one of --source-pgdata and --source-server, even 
though the synopsis says that too.


Another idea is to rename --source-server to just --source. That would 
make sense if we assume that it's more common to connect to a live server:


pg_rewind --target mypgdata --source host=otherhost user=dba

pg_rewind --target mypgdata --source-pgdata /other/pgdata



Reference pages have standardized top-level headers, so Theory of
operation should be under something like Notes.

Similarly for Restrictions, but that seems important enough to go
into the description.


Oh. That's rather limiting. I'll rename the Restrictions to Notes - 
that seems to be where we have listed limitations like that in many 
other pages. I also moved Theory of operation as a How it works 
subsection under Notes.


The top-level headers aren't totally standardized in man pages. Googling 
around, a few seem to have a section called IMPLEMENTATION NOTES, 
which would be a good fit here. We don't have such sections currently, 
but how about it?



There should be an installcheck target.


Doesn't seem appropriate, as there are no regression tests that would 
run against an existing cluster. Or should it just use the installed 
pg_rewind and postgres binary, but create the temporary clusters all the 
same?



RewindTest.pm should be in the t/ directory.


I used a similar layout in src/test/ssl, so that ought to be changed too 
if we're not happy with it.


make check will not find the module if I just move it to t/, so that 
would require changes to Makefiles or something. I don't know how to do 
that offhand.



Instead of FILE_TYPE_DIRECTORY etc., why not use S_IFDIR etc.?


I don't see what that would buy us. Most of the code only knows about a 
few S_IF* types, namely regular files, directories and symlinks. Those 
are the types that there are FILE_TYPE_* codes for. If we started using 
the more general S_IF* constants, it would not be clear which types can 
appear in which parts of the code. Also, the compiler would no longer 
warn about omitting one of the types in a switch(file-type) statement.



TestLib.pm addition command_is sounds a bit wrong.  It's evidently
modelled after command_like, but that now sounds wrong too.  How about
command_stdout_is?


Works for me. How about also renaming command_like to 
command_stdout_like, and committing that along with the new 
command_stdout_is function as a separate patch, before pg_rewind?



The test suite needs to silence all non-TAP output.  So psql needs to
be run with -q pg_ctl with -s etc.  Any important output needs to be
through diag() or note().

Test cases like

ok 6 - psql -A -t --no-psqlrc -c port=10072 -c SELECT datname FROM pg_database 
exit code 0

should probably get a real name.

The whole structure of the test suite still looks too much like the
old hack.  I'll try to think of other ways to structure it.


Yeah. It currently works with callback functions, like:

test program:
- call RewindTest::run_rewind_test
 set up both cluster
 - call before_standby callback
 start standby
 - call standby_following_master callback
 promote standby
 - call after_promotion callback
 stop master
 run pg_rewind
 restart master
 - call after_rewind callback

It might be better to turn the control around, so that all the code 
that's now in the callbacks are in the test program's main flow, and 
test program calls functions in RewindTest.sh to execute the parts that 
are currently between the callbacks:


test program
  - call RewindTest::setup_cluster
  do stuff that's now in before_standby callback
  - call RewindTest::start_standby
  do stuff that's now in standby_following_master callback
  - call RewindTest::promote_standby
  do stuff that's now in after_promotion callback
  - call RewindTest::run_pg_rewind
  do stuff that's now in after_rewind callback

- Heikki



pg_rewind-bin-6.patch.gz
Description: application/gzip

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Re: [HACKERS] speedup tidbitmap patch: cache page

2015-01-16 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 On 2014-12-25 01:26:53 +1300, David Rowley wrote:
 So I think v3 is the one to go with, and I can't see any problems with it,
 so I'm marking it as ready for committer.

 And committed.

It strikes me that this patch leaves some lookups on the table,
specifically that it fails to avoid repeated hash_search lookups
inside tbm_page_is_lossy() in the situation where we're adding
new TIDs to an already-lossified page.  Is it worth adding a few
more lines to handle that case as well?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: searching in array function - array_position

2015-01-16 Thread Pavel Stehule
2015-01-16 17:57 GMT+01:00 Jim Nasby jim.na...@bluetreble.com:

 On 1/16/15 3:39 AM, Pavel Stehule wrote:

 I am proposing a simple function, that returns a position of element in
 array.


 Yes please!

  FUNCTION array_position(anyarray, anyelement) RETURNS int


 That won't work on a multi-dimensional array. Ideally it needs to accept a
 slice or an element and return the specifier for the slice.


It is question, what is a result - probably, there can be a
multidimensional variant, where result will be a array

array_position([1,2,3],2) -- 2
array_position([[1,2],[2,3],[3,4]], [2,3]) -- 2 /* 2nd parameter should to
have N-1 dimension of first parameter */
array_position_md([1,2,3],2) -- [2]
array_position_md([[1,2],[2,3],[3,4]], 2) -- [2,1]

another question is how to solve more than one occurrence on one value -
probably two sets of functions - first returns first occurrence of value,
second returns set of occurrence



 This wouldn't be so bad if we had an easier way to extract subsets of an
 array, but even that is really ugly because whatever you extract keeps the
 original number of dimensions.

  Implementation is simple (plpgsql code)


 This would actually be written in C though, yes? Otherwise it's not really
 any better than just doing an extension...


Sure, I expect a C implementation

Pavel


 --
 Jim Nasby, Data Architect, Blue Treble Consulting
 Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: [HACKERS] speedup tidbitmap patch: cache page

2015-01-16 Thread Andres Freund
On 2015-01-16 12:15:35 -0500, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  On 2014-12-25 01:26:53 +1300, David Rowley wrote:
  So I think v3 is the one to go with, and I can't see any problems with it,
  so I'm marking it as ready for committer.
 
  And committed.
 
 It strikes me that this patch leaves some lookups on the table,
 specifically that it fails to avoid repeated hash_search lookups
 inside tbm_page_is_lossy() in the situation where we're adding
 new TIDs to an already-lossified page.  Is it worth adding a few
 more lines to handle that case as well?

There was a alternative version (v2.3 in 549950fb.2050...@sigaev.ru) of
the patch that cached the lossyness of a page, but Teodor/David didn't
find it to be beneficial in their benchmarking.

Teodor's argument was basically that it's completely lost in the noise
due to the much bigger overhead of rechecks.

I thought it'd better to get this improvement committed rather than
waiting for someone to find a even bigger improvement for some case.

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] HINT: pg_hba.conf changed since last config reload

2015-01-16 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 Why don't we just add emit a NOTICE or WARNING in the relevant place
 saying that pg_hba.conf is outdated? Then the server won't log those if
 configured appropriately, which doesn't seem like a bad thing. Note that
 = ERROR messages aren't sent to the client during authentication.

I think people felt that sending that information to the client wouldn't
be a good idea security-wise.  But I'd phrase it as why not just emit
a LOG message?.  I agree that new logging infrastructure seems like
overkill.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: row_to_array function

2015-01-16 Thread Pavel Stehule
2015-01-16 18:03 GMT+01:00 Jim Nasby jim.na...@bluetreble.com:

 On 1/16/15 3:45 AM, Pavel Stehule wrote:

 I am returning back to processing records in plpgsql.

 I am thinking so it can be simply processed with transformations to array.

 Now we have similar functions - hstore(row), row_to_json, ... but using
 of these functions can be a useless step. Any row variable can be
 transformed to 2D text array.


 How is it useless? Why wouldn't you just use JSON and be done with it?


We can use a FOREACH IN ARRAY iteration in plpgsql (second variant is a
implementation FOREACH for jsonb)

so ROW-ARRAY is shorter than ROW-JSON-ARRAY or ROW-HSTORE-ARRAY



 Do you have some use cases you can share?


processing of NEW, OLD variables in triggers



  There two possible transformations:

 row_to_array -- [[key1, value1],[key2, value2], ...]
 row_to_row_array -- [(key1, value1), (key2, value2), ... ]


 If we're going to go that route, I think it makes more sense to create an
 actual key/value type (ie: http://pgxn.org/dist/pair/doc/pair.html) and
 return an array of that.


ok


 --
 Jim Nasby, Data Architect, Blue Treble Consulting
 Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: [HACKERS] proposal: lock_time for pg_stat_database

2015-01-16 Thread Jim Nasby

On 1/16/15 11:00 AM, Pavel Stehule wrote:

Hi all,

some time ago, I proposed a lock time measurement related to query. A main 
issue was a method, how to show this information. Today proposal is little bit 
simpler, but still useful. We can show a total lock time per database in 
pg_stat_database statistics. High number can be signal about lock issues.


Would this not use the existing stats mechanisms? If so, couldn't we do this per table? 
(I realize that won't handle all cases; we'd still need a lock_time_other 
somewhere).

Also, what do you mean by 'lock'? Heavyweight? We already have some visibility 
there. What I wish we had was some way to know if we're spending a lot of time 
in a particular non-heavy lock. Actually measuring time probably wouldn't make 
sense but we might be able to count how often we fail initial acquisition or 
something.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] HINT: pg_hba.conf changed since last config reload

2015-01-16 Thread Andres Freund
On 2015-01-16 12:21:13 -0500, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  Why don't we just add emit a NOTICE or WARNING in the relevant place
  saying that pg_hba.conf is outdated? Then the server won't log those if
  configured appropriately, which doesn't seem like a bad thing. Note that
  = ERROR messages aren't sent to the client during authentication.
 
 I think people felt that sending that information to the client wouldn't
 be a good idea security-wise.

It won't if issued during the right phase of the authentication:
/*
 * client_min_messages is honored only after we complete the
 * authentication handshake.  This is required both for security
 * reasons and because many clients can't handle NOTICE messages
 * during authentication.
 */
if (ClientAuthInProgress)
output_to_client = (elevel = ERROR);
else
output_to_client = (elevel = client_min_messages ||
elevel == INFO);
}

Surely deserves a comment on the emitting site.

 But I'd phrase it as why not just emit a LOG message?.

Well, LOGs can be sent to the client just the same, no? Just requires a
nondefault client_min_messages.

But as I don't think sending logs to the client is a unsurmountable
problem (due to the above) I don't really care if we use WARNING or LOG.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Partitioning: issues/ideas (Was: Re: [HACKERS] On partitioning)

2015-01-16 Thread Robert Haas
On Wed, Jan 14, 2015 at 9:07 PM, Amit Langote
langote_amit...@lab.ntt.co.jp wrote:
 * It has been repeatedly pointed out that we may want to decouple
 partitioning from inheritance because implementing partitioning as an
 extension of inheritance mechanism means that we have to keep all the
 existing semantics which might limit what we want to do with the special
 case of it which is partitioning; in other words, we would find
 ourselves in difficult position where we have to inject a special case
 code into a very generalized mechanism that is inheritance.
 Specifically, do we regard a partitions as pg_inherits children of its
 partitioning parent?

I don't think this is totally an all-or-nothing decision.  I think
everyone is agreed that we need to not break things that work today --
e.g. Merge Append.  What that implies for pg_inherits isn't altogether
clear.

 * Syntax: do we want to make it similar to one of the many other
 databases out there? Or we could invent our own?

Well, what I think we don't want is something that is *almost* like
some other database but not quite.  I lean toward inventing our own
since I'm not aware of something that we'd want to copy exactly.

 I wonder if we could add a clause like DISTRIBUTED BY to complement
 PARTITION ON that represents a hash distributed/partitioned table (that
 could be a syntax to support sharded tables maybe; we would definitely
 want to move ahead in that direction I guess)

Maybe eventually, but let's not complicate things by worrying too much
about that now.

 * Catalog: We would like to have a catalog structure suitable to
 implement capabilities like multi-column partitioning, sub-partitioning
 (with arbitrary number of levels in the hierarchy). I had suggested
 that we create two new catalogs viz. pg_partitioned_rel,
 pg_partition_def to store metadata about a partition key of a
 partitioned relation and partition bound info of a partition,
 respectively. Also, see the point about on-disk representation of
 partition bounds

I'm not convinced that there is any benefit in spreading this
information across two tables neither of which exist today.  If the
representation of the partitioning scheme is going to be a node tree,
then there's no point in taking what would otherwise have been a List
and storing each element of it in a separate tuple. The overarching
point here is that the system catalog structure should be whatever is
most convenient for the system internals; I'm not sure we understand
what that is yet.

 * It is desirable to treat partitions as pg_class relations with perhaps
 a new relkind(s). We may want to choose an implementation where only the
 lowest level relations in a partitioning hierarchy have storage; those
 at the upper layers are mere placeholder relations though of course with
 associated constraints determined by partitioning criteria (with
 appropriate metadata entered into the additional catalogs).

I think the storage-less parents need a new relkind precisely to
denote that they have no storage; I am not convinced that there's any
reason to change the relkind for the leaf nodes.  But that's been
proposed, so evidently someone thinks there's a reason to do it.

 I am not
 quite sure if each kind of the relations involved in the partitioning
 scheme have separate namespaces and, if they are, how we implement that

I am in favor of having all of the nodes in the hierarchy have names
just as relations do today -- pg_class.relname.  Anything else seems
to me to be complex to implement and of very marginal benefit.  But
again, it's been proposed.

 * In the initial implementation, we could just live with partitioning on
 a set of columns (and not arbitrary expressions of them)

Seems quite fair.

 * We perhaps do not need multi-column LIST partitions as they are not
 very widely used and may complicate the implementation

I agree that the use case is marginal; but I'm not sure it needs to
complicate the implementation much.  Depending on how the
implementation shakes out, prohibiting it might come to seem like more
of a wart than allowing it.

 * There are a number of suggestions about how we represent partition
 bounds (on-disk) - pg_node_tree, RECORD (a composite type or the rowtype
 associated with the relation itself), etc. Important point to consider
 here may be that partition key may contain more than one column

Yep.

 * How we represent partition definition in memory (for a given
 partitioned relation) - important point to remember is that such a
 representation should be efficient to iterate through or
 binary-searchable. Also see the points about tuple-routing and
 partition-pruning

Yep.

 * Overflow/catchall partition: it seems we do not want/need them. It
 might seem desirable for example in cases where a big transaction enters
 a large number of tuples all but one of which find a defined partition;
 we may not want to error out in such case but instead enter that erring
 tuple into the overflow partition 

Re: [HACKERS] speedup tidbitmap patch: cache page

2015-01-16 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 On 2015-01-16 12:15:35 -0500, Tom Lane wrote:
 It strikes me that this patch leaves some lookups on the table,
 specifically that it fails to avoid repeated hash_search lookups
 inside tbm_page_is_lossy() in the situation where we're adding
 new TIDs to an already-lossified page.  Is it worth adding a few
 more lines to handle that case as well?

 There was a alternative version (v2.3 in 549950fb.2050...@sigaev.ru) of
 the patch that cached the lossyness of a page, but Teodor/David didn't
 find it to be beneficial in their benchmarking.

 Teodor's argument was basically that it's completely lost in the noise
 due to the much bigger overhead of rechecks.

That's a fair point, but on reflection it seems like a patch that covered
this case too wouldn't actually be any more complicated, so why not?
v2.3 is pretty brute-force and I agree it's not very attractive, but
I was thinking about something like

BlockNumber cached_blkno = InvalidBlockNumber;
PagetableEntry *page = NULL;

inside loop:

/* look up the target page unless we already have it */
if (blk != cached_blkno)
{
if (tbm_page_is_lossy(tbm, blk))
page = NULL;
else
page = tbm_get_pageentry(tbm, blk);
cached_blkno = blk;
}
if (page == NULL)
continue;  /* page is already marked lossy */

The reset after calling tbm_lossify() would just need to be
cached_blkno = InvalidBlockNumber.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: lock_time for pg_stat_database

2015-01-16 Thread Pavel Stehule
2015-01-16 18:23 GMT+01:00 Jim Nasby jim.na...@bluetreble.com:

 On 1/16/15 11:00 AM, Pavel Stehule wrote:

 Hi all,

 some time ago, I proposed a lock time measurement related to query. A
 main issue was a method, how to show this information. Today proposal is
 little bit simpler, but still useful. We can show a total lock time per
 database in pg_stat_database statistics. High number can be signal about
 lock issues.


 Would this not use the existing stats mechanisms? If so, couldn't we do
 this per table? (I realize that won't handle all cases; we'd still need a
 lock_time_other somewhere).



it can use a current existing stats mechanisms

I afraid so isn't possible to assign waiting time to table - because it
depends on order



 Also, what do you mean by 'lock'? Heavyweight? We already have some
 visibility there. What I wish we had was some way to know if we're spending
 a lot of time in a particular non-heavy lock. Actually measuring time
 probably wouldn't make sense but we might be able to count how often we
 fail initial acquisition or something.


now, when I am thinking about it, lock_time is not good name - maybe
waiting lock time (lock time should not be interesting, waiting is
interesting) - it can be divided to some more categories - in GoodData we
use Heavyweight, pages, and others categories.

 --
 Jim Nasby, Data Architect, Blue Treble Consulting
 Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: [HACKERS] proposal: searching in array function - array_position

2015-01-16 Thread Jim Nasby

On 1/16/15 11:16 AM, Pavel Stehule wrote:



2015-01-16 17:57 GMT+01:00 Jim Nasby jim.na...@bluetreble.com 
mailto:jim.na...@bluetreble.com:

On 1/16/15 3:39 AM, Pavel Stehule wrote:

I am proposing a simple function, that returns a position of element in 
array.


Yes please!

FUNCTION array_position(anyarray, anyelement) RETURNS int


That won't work on a multi-dimensional array. Ideally it needs to accept a 
slice or an element and return the specifier for the slice.


It is question, what is a result - probably, there can be a multidimensional 
variant, where result will be a array

array_position([1,2,3],2) -- 2
array_position([[1,2],[2,3],[3,4]], [2,3]) -- 2 /* 2nd parameter should to 
have N-1 dimension of first parameter */


The problem with that is you can't actually use '2' to get [2,3] back:

select (array[[1,2,3],[4,5,6],[7,8,9]])[1] IS NULL;
 ?column?
--
 t
(1 row)

I think the bigger problem here is we need something better than slices for 
handling subsets of arrays. Even if the function returned [2:2] it's still 
going to behave differently than it will in the non-array case because you 
won't be getting the expected number of dimensions back. :(


array_position_md([1,2,3],2) -- [2]
array_position_md([[1,2],[2,3],[3,4]], 2) -- [2,1]

another question is how to solve more than one occurrence on one value - 
probably two sets of functions - first returns first occurrence of value, 
second returns set of occurrence


Gee, if only way had some way to return multiple elements of something... ;P

In other words, I think all of these should actually return an array of 
positions. I think it's OK for someone that only cares about the first instance 
to just do [1].
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] HINT: pg_hba.conf changed since last config reload

2015-01-16 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 On 2015-01-16 12:21:13 -0500, Tom Lane wrote:
 I think people felt that sending that information to the client wouldn't
 be a good idea security-wise.

 It won't if issued during the right phase of the authentication:

Good point.

 But as I don't think sending logs to the client is a unsurmountable
 problem (due to the above) I don't really care if we use WARNING or LOG.

If we're expecting the DBA to need to read it in the postmaster log,
remember that LOG is actually *more* likely to get to the log than
WARNING is.  Choosing WARNING because you think it sounds more significant
is a mistake.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fillfactor for GIN indexes

2015-01-16 Thread Robert Haas
On Thu, Jan 15, 2015 at 7:06 AM, Michael Paquier
michael.paqu...@gmail.com wrote:
 Alexander Korotkov wrote:
 I'm not sure. On the one hand it's unclear why fillfactor should be
 different from 9.4.
 On the other hand it's unclear why it should be different from btree.
 I propose marking this ready for committer. So, committer can make a final
 decision.
 OK let's do so then. My preference is to fully pack the index at
 build. GIN compression has been one of the headlines of 9.4.

I'm struggling to understand why we shouldn't just reject this patch.
On November 27th, Cedric said:

what are the benefits of this patch ? (maybe you had some test case
or a benchmark ?)

Nobody replied.  On January 15th, you (Michael) hypothesized that
this patch has value to control random updates on GIN indexes but
there seem to be absolutely no test results showing that any such
value exists.

There's only value in adding a fillfactor parameter to GIN indexes if
it improves performance.  There are no benchmarks showing it does.
So, why are we still talking about this?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: row_to_array function

2015-01-16 Thread Jim Nasby

On 1/16/15 11:22 AM, Pavel Stehule wrote:



2015-01-16 18:03 GMT+01:00 Jim Nasby jim.na...@bluetreble.com 
mailto:jim.na...@bluetreble.com:

On 1/16/15 3:45 AM, Pavel Stehule wrote:

I am returning back to processing records in plpgsql.

I am thinking so it can be simply processed with transformations to 
array.

Now we have similar functions - hstore(row), row_to_json, ... but using 
of these functions can be a useless step. Any row variable can be transformed 
to 2D text array.


How is it useless? Why wouldn't you just use JSON and be done with it?


We can use a FOREACH IN ARRAY iteration in plpgsql (second variant is a 
implementation FOREACH for jsonb)

so ROW-ARRAY is shorter than ROW-JSON-ARRAY or ROW-HSTORE-ARRAY


I think the real problem here is that we're inventing a bunch of different ways 
to do the same thing: iterate over a set. Instead of doing that, should we add 
the idea of an iterator to the type system? That would make sense for arrays, 
hstore, json and XML.


Do you have some use cases you can share?


processing of NEW, OLD variables in triggers


Note that last time I checked you couldn't do something like NEW.variable, and 
I don't think you could use EXEC to do it either. So there's more needed here 
than just converting a record to an array.


There two possible transformations:

row_to_array -- [[key1, value1],[key2, value2], ...]
row_to_row_array -- [(key1, value1), (key2, value2), ... ]


If we're going to go that route, I think it makes more sense to create an actual 
key/value type (ie: http://pgxn.org/dist/pair/doc/__pair.html 
http://pgxn.org/dist/pair/doc/pair.html) and return an array of that.


ok

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com





--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code

2015-01-16 Thread Andres Freund
Hi,

On 2015-01-05 17:16:43 +0900, Michael Paquier wrote:
 + varlistentry id=restore-command-retry-interval 
 xreflabel=restore_command_retry_interval
 +  termvarnamerestore_command_retry_interval/varname 
 (typeinteger/type)
 +  indexterm
 +primaryvarnamerestore_command_retry_interval/ recovery 
 parameter/primary
 +  /indexterm
 +  /term
 +  listitem
 +   para
 +If varnamerestore_command/ returns nonzero exit status code, 
 retry
 +command after the interval of time specified by this parameter,
 +measured in milliseconds if no unit is specified. Default value is
 +literal5s/.
 +   /para
 +   para
 +This command is particularly useful for warm standby configurations
 +to leverage the amount of retries done to restore a WAL segment file
 +depending on the activity of a master node.
 +   /para

Don't really like this paragraph. What's leveraging the amount of
retries done supposed to mean?

 +# restore_command_retry_interval
 +#
 +# specifies an optional retry interval for restore_command command, if
 +# previous command returned nonzero exit status code. This can be useful
 +# to increase or decrease the number of calls of restore_command.

It's not really the number  but frequency, right?

 + else if (strcmp(item-name, restore_command_retry_interval) 
 == 0)
 + {
 + const char *hintmsg;
 +
 + if (!parse_int(item-value, 
 restore_command_retry_interval, GUC_UNIT_MS,
 +hintmsg))
 + ereport(ERROR,
 + 
 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 +  errmsg(parameter \%s\ 
 requires a temporal value,
 + 
 restore_command_retry_interval),
 +  hintmsg ? errhint(%s, 
 _(hintmsg)) : 0));

temporal value sounds odd to my ears. I'd rather keep in line with
what guc.c outputs in such a case.

 + now = GetCurrentTimestamp();
 + if 
 (!TimestampDifferenceExceeds(last_fail_time, now,
 + 
 restore_command_retry_interval))
   {
 - pg_usleep(100L * (5 - (now 
 - last_fail_time)));
 - now = (pg_time_t) time(NULL);
 + longsecs;
 + int 
 microsecs;
 +
 + 
 TimestampDifference(last_fail_time, now, secs, microsecs);
 + 
 pg_usleep(restore_command_retry_interval * 1000L - (100L * secs + 1L * 
 microsecs));
 + now = GetCurrentTimestamp();
   }
   last_fail_time = now;
   currentSource = XLOG_FROM_ARCHIVE;

Am I missing something or does this handle/affect streaming failures
just the same? That might not be a bad idea, because the current
interval is far too long for e.g. a sync replication setup. But it
certainly makes the name a complete misnomer.

Not this patch's fault, but I'm getting a bit tired seing the above open
coded. How about adding a function that does the sleeping based on a
timestamptz and a ms interval?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: row_to_array function

2015-01-16 Thread Pavel Stehule
2015-01-16 18:42 GMT+01:00 Jim Nasby jim.na...@bluetreble.com:

 On 1/16/15 11:22 AM, Pavel Stehule wrote:



 2015-01-16 18:03 GMT+01:00 Jim Nasby jim.na...@bluetreble.com mailto:
 jim.na...@bluetreble.com:

 On 1/16/15 3:45 AM, Pavel Stehule wrote:

 I am returning back to processing records in plpgsql.

 I am thinking so it can be simply processed with transformations
 to array.

 Now we have similar functions - hstore(row), row_to_json, ... but
 using of these functions can be a useless step. Any row variable can be
 transformed to 2D text array.


 How is it useless? Why wouldn't you just use JSON and be done with it?


 We can use a FOREACH IN ARRAY iteration in plpgsql (second variant is a
 implementation FOREACH for jsonb)

 so ROW-ARRAY is shorter than ROW-JSON-ARRAY or ROW-HSTORE-ARRAY


 I think the real problem here is that we're inventing a bunch of different
 ways to do the same thing: iterate over a set. Instead of doing that,
 should we add the idea of an iterator to the type system? That would make
 sense for arrays, hstore, json and XML.


what do you think? How this can be implemented?





  Do you have some use cases you can share?


 processing of NEW, OLD variables in triggers


 Note that last time I checked you couldn't do something like NEW.variable,
 and I don't think you could use EXEC to do it either. So there's more
 needed here than just converting a record to an array.

  There two possible transformations:

 row_to_array -- [[key1, value1],[key2, value2], ...]
 row_to_row_array -- [(key1, value1), (key2, value2), ... ]


 If we're going to go that route, I think it makes more sense to
 create an actual key/value type (ie: http://pgxn.org/dist/pair/doc/
 __pair.html http://pgxn.org/dist/pair/doc/pair.html) and return an
 array of that.


 ok

 --
 Jim Nasby, Data Architect, Blue Treble Consulting
 Data in Trouble? Get it in Treble! http://BlueTreble.com




 --
 Jim Nasby, Data Architect, Blue Treble Consulting
 Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: [HACKERS] proposal: lock_time for pg_stat_database

2015-01-16 Thread Jim Nasby

On 1/16/15 11:35 AM, Pavel Stehule wrote:



2015-01-16 18:23 GMT+01:00 Jim Nasby jim.na...@bluetreble.com 
mailto:jim.na...@bluetreble.com:

On 1/16/15 11:00 AM, Pavel Stehule wrote:

Hi all,

some time ago, I proposed a lock time measurement related to query. A 
main issue was a method, how to show this information. Today proposal is little 
bit simpler, but still useful. We can show a total lock time per database in 
pg_stat_database statistics. High number can be signal about lock issues.


Would this not use the existing stats mechanisms? If so, couldn't we do this per 
table? (I realize that won't handle all cases; we'd still need a 
lock_time_other somewhere).



it can use a current existing stats mechanisms

I afraid so isn't possible to assign waiting time to table - because it depends 
on order


Huh? Order of what?


Also, what do you mean by 'lock'? Heavyweight? We already have some 
visibility there. What I wish we had was some way to know if we're spending a 
lot of time in a particular non-heavy lock. Actually measuring time probably 
wouldn't make sense but we might be able to count how often we fail initial 
acquisition or something.


now, when I am thinking about it, lock_time is not good name - maybe waiting lock 
time (lock time should not be interesting, waiting is interesting) - it can be 
divided to some more categories - in GoodData we use Heavyweight, pages, and others 
categories.


So do you see this somehow encompassing locks other than heavyweight locks? 
Because I think that's the biggest need here. Basically, something akin to 
TRACE_POSTGRESQL_LWLOCK_WAIT_START() that doesn't depend on dtrace.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: searching in array function - array_position

2015-01-16 Thread Pavel Stehule
2015-01-16 18:37 GMT+01:00 Jim Nasby jim.na...@bluetreble.com:

 On 1/16/15 11:16 AM, Pavel Stehule wrote:



 2015-01-16 17:57 GMT+01:00 Jim Nasby jim.na...@bluetreble.com mailto:
 jim.na...@bluetreble.com:

 On 1/16/15 3:39 AM, Pavel Stehule wrote:

 I am proposing a simple function, that returns a position of
 element in array.


 Yes please!

 FUNCTION array_position(anyarray, anyelement) RETURNS int


 That won't work on a multi-dimensional array. Ideally it needs to
 accept a slice or an element and return the specifier for the slice.


 It is question, what is a result - probably, there can be a
 multidimensional variant, where result will be a array

 array_position([1,2,3],2) -- 2
 array_position([[1,2],[2,3],[3,4]], [2,3]) -- 2 /* 2nd parameter should
 to have N-1 dimension of first parameter */


 The problem with that is you can't actually use '2' to get [2,3] back:

 select (array[[1,2,3],[4,5,6],[7,8,9]])[1] IS NULL;
  ?column?
 --
  t
 (1 row)


yes, but when you are searching a array in array you can use a full slice
selection:

postgres=# select (ARRAY[[1,2],[4,5]])[1][1:2]; -- [1:2] should be a
constant every time in this case -- so it should not be returned
  array
-
 {{1,2}}
(1 row)





 I think the bigger problem here is we need something better than slices
 for handling subsets of arrays. Even if the function returned [2:2] it's
 still going to behave differently than it will in the non-array case
 because you won't be getting the expected number of dimensions back. :(


you cannot to return a slice and I don't propose it, although we can return
a range type or array of range type - but still we cannot to use range for
a arrays.


  array_position_md([1,2,3],2) -- [2]
 array_position_md([[1,2],[2,3],[3,4]], 2) -- [2,1]

 another question is how to solve more than one occurrence on one value -
 probably two sets of functions - first returns first occurrence of value,
 second returns set of occurrence


 Gee, if only way had some way to return multiple elements of something...
 ;P

 In other words, I think all of these should actually return an array of
 positions. I think it's OK for someone that only cares about the first
 instance to just do [1].


there can be two functions - position - returns first and positions
returns all as a array



 --
 Jim Nasby, Data Architect, Blue Treble Consulting
 Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: [HACKERS] Parallel Seq Scan

2015-01-16 Thread Robert Haas
On Wed, Jan 14, 2015 at 9:00 PM, Jim Nasby jim.na...@bluetreble.com wrote:
 Simply doing
 something like blkno % num_workers is going to cause imbalances,

Yes.

 but trying
 to do this on a per-block basis seems like too much overhead.

...but no.  Or at least, I doubt it.  The cost of handing out blocks
one at a time is that, for each block, a worker's got to grab a
spinlock, increment and record the block number counter, and release
the spinlock.  Or, use an atomic add.  Now, it's true that spinlock
cycles and atomic ops can have sometimes impose severe overhead, but
you have to look at it as a percentage of the overall work being done.
In this case, the backend has to read, pin, and lock the page and
process every tuple on the page.  Processing every tuple on the page
may involve de-TOASTing the tuple (leading to many more page
accesses), or evaluating a complex expression, or hitting CLOG to
check visibility, but even if it doesn't, I think the amount of work
that it takes to process all the tuples on the page will be far larger
than the cost of one atomic increment operation per block.

As mentioned downthread, a far bigger consideration is the I/O pattern
we create.  A sequential scan is so-called because it reads the
relation sequentially.  If we destroy that property, we will be more
than slightly sad.  It might be OK to do sequential scans of, say,
each 1GB segment separately, but I'm pretty sure it would be a real
bad idea to read 8kB at a time at blocks 0, 64, 128, 1, 65, 129, ...

What I'm thinking about is that we might have something like this:

struct this_lives_in_dynamic_shared_memory
{
BlockNumber last_block;
Size prefetch_distance;
Size prefetch_increment;
slock_t mutex;
BlockNumber next_prefetch_block;
BlockNumber next_scan_block;
};

Each worker takes the mutex and checks whether next_prefetch_block -
next_scan_block  prefetch_distance and also whether
next_prefetch_block  last_block.  If both are true, it prefetches
some number of additional blocks, as specified by prefetch_increment.
Otherwise, it increments next_scan_block and scans the block
corresponding to the old value.

So in this way, the prefetching runs ahead of the scan by a
configurable amount (prefetch_distance), which should be chosen so
that the prefetches have time to compete before the scan actually
reaches those blocks.  Right now, of course, we rely on the operating
system to prefetch for sequential scans, but I have a strong hunch
that may not work on all systems if there are multiple processes doing
the reads.

Now, what of other strategies like dividing up the relation into 1GB
chunks and reading each one in a separate process?  We could certainly
DO that, but what advantage does it have over this?  The only benefit
I can see is that you avoid accessing a data structure of the type
shown above for every block, but that only matters if that cost is
material, and I tend to think it won't be.  On the flip side, it means
that the granularity for dividing up work between processes is now
very coarse - when there are less than 6GB of data left in a relation,
at most 6 processes can work on it.  That might be OK if the data is
being read in from disk anyway, but it's certainly not the best we can
do when the data is in memory.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: lock_time for pg_stat_database

2015-01-16 Thread Pavel Stehule
2015-01-16 19:06 GMT+01:00 Jim Nasby jim.na...@bluetreble.com:

 On 1/16/15 11:35 AM, Pavel Stehule wrote:



 2015-01-16 18:23 GMT+01:00 Jim Nasby jim.na...@bluetreble.com mailto:
 jim.na...@bluetreble.com:

 On 1/16/15 11:00 AM, Pavel Stehule wrote:

 Hi all,

 some time ago, I proposed a lock time measurement related to
 query. A main issue was a method, how to show this information. Today
 proposal is little bit simpler, but still useful. We can show a total lock
 time per database in pg_stat_database statistics. High number can be signal
 about lock issues.


 Would this not use the existing stats mechanisms? If so, couldn't we
 do this per table? (I realize that won't handle all cases; we'd still need
 a lock_time_other somewhere).



 it can use a current existing stats mechanisms

 I afraid so isn't possible to assign waiting time to table - because it
 depends on order


 Huh? Order of what?


when you have a SELECT FROM T1, T2 and T1 is locked for t1, and T2 is
locked for t2 -- but if t2  t1 then t2 is not important -- so what I have
to cont as lock time for T1 and T2?

DDL statements are exception - there is almost simple mapping between
relations and lock time reason.



  Also, what do you mean by 'lock'? Heavyweight? We already have some
 visibility there. What I wish we had was some way to know if we're spending
 a lot of time in a particular non-heavy lock. Actually measuring time
 probably wouldn't make sense but we might be able to count how often we
 fail initial acquisition or something.


 now, when I am thinking about it, lock_time is not good name - maybe
 waiting lock time (lock time should not be interesting, waiting is
 interesting) - it can be divided to some more categories - in GoodData we
 use Heavyweight, pages, and others categories.


 So do you see this somehow encompassing locks other than heavyweight
 locks? Because I think that's the biggest need here. Basically, something
 akin to TRACE_POSTGRESQL_LWLOCK_WAIT_START() that doesn't depend on
 dtrace.


For these global statistics I see as important a common total waiting time
for locks - we can use a more detailed granularity but I am not sure, if a
common statistics are best tool.

My motivations is - look to statistics -- and I can see ... lot of
rollbacks -- issue, lot of deadlocks -- issue, lot of waiting time -- issue
too. It is tool for people without possibility to use dtrace and similar
tools and for everyday usage - simple check if locks are not a issue (or if
locking is stable).



 --
 Jim Nasby, Data Architect, Blue Treble Consulting
 Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: [HACKERS] infinite loop in _bt_getstackbuf

2015-01-16 Thread Robert Haas
On Thu, Jan 15, 2015 at 5:46 PM, Peter Geoghegan p...@heroku.com wrote:
 I think that it might be a good idea to have circular _bt_moveright()
 moves (the direct offender in Merlin's case, which has very similar
 logic to your _bt_getstackbuf() problem case) detected. I'm pretty
 sure that it's exceptional for there to be more than 2 or 3 retries in
 _bt_moveright(). It would probably be fine to consider the possibility
 that we'll never finish once we get past 5 retries or something like
 that. We'd then start keeping track of blocks visited, and raise an
 error when a page was visited a second time.

Yeah, I could go for that.  Possibly somebody might object that it's a
lot of code that will never get tested in normal operation, but as
this problem doesn't seem to be strictly theoretical I'm not sure I
subscribe to that objection.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: lock_time for pg_stat_database

2015-01-16 Thread Pavel Stehule
2015-01-16 19:24 GMT+01:00 Pavel Stehule pavel.steh...@gmail.com:



 2015-01-16 19:06 GMT+01:00 Jim Nasby jim.na...@bluetreble.com:

 On 1/16/15 11:35 AM, Pavel Stehule wrote:



 2015-01-16 18:23 GMT+01:00 Jim Nasby jim.na...@bluetreble.com mailto:
 jim.na...@bluetreble.com:

 On 1/16/15 11:00 AM, Pavel Stehule wrote:

 Hi all,

 some time ago, I proposed a lock time measurement related to
 query. A main issue was a method, how to show this information. Today
 proposal is little bit simpler, but still useful. We can show a total lock
 time per database in pg_stat_database statistics. High number can be signal
 about lock issues.


 Would this not use the existing stats mechanisms? If so, couldn't we
 do this per table? (I realize that won't handle all cases; we'd still need
 a lock_time_other somewhere).



 it can use a current existing stats mechanisms

 I afraid so isn't possible to assign waiting time to table - because it
 depends on order


 Huh? Order of what?


 when you have a SELECT FROM T1, T2 and T1 is locked for t1, and T2 is
 locked for t2 -- but if t2  t1 then t2 is not important -- so what I have
 to cont as lock time for T1 and T2?

 DDL statements are exception - there is almost simple mapping between
 relations and lock time reason.



  Also, what do you mean by 'lock'? Heavyweight? We already have some
 visibility there. What I wish we had was some way to know if we're spending
 a lot of time in a particular non-heavy lock. Actually measuring time
 probably wouldn't make sense but we might be able to count how often we
 fail initial acquisition or something.


 now, when I am thinking about it, lock_time is not good name - maybe
 waiting lock time (lock time should not be interesting, waiting is
 interesting) - it can be divided to some more categories - in GoodData we
 use Heavyweight, pages, and others categories.


 So do you see this somehow encompassing locks other than heavyweight
 locks? Because I think that's the biggest need here. Basically, something
 akin to TRACE_POSTGRESQL_LWLOCK_WAIT_START() that doesn't depend on
 dtrace.


 For these global statistics I see as important a common total waiting time
 for locks - we can use a more detailed granularity but I am not sure, if a
 common statistics are best tool.

 My motivations is - look to statistics -- and I can see ... lot of
 rollbacks -- issue, lot of deadlocks -- issue, lot of waiting time -- issue
 too. It is tool for people without possibility to use dtrace and similar
 tools and for everyday usage - simple check if locks are not a issue (or if
 locking is stable).


and this proposal has sense only for heavyweight locks - because others
locks are everywhere




 --
 Jim Nasby, Data Architect, Blue Treble Consulting
 Data in Trouble? Get it in Treble! http://BlueTreble.com





Re: [HACKERS] hung backends stuck in spinlock heavy endless loop

2015-01-16 Thread Merlin Moncure
On Fri, Jan 16, 2015 at 8:22 AM, Andres Freund and...@2ndquadrant.com wrote:
 Hi,

 On 2015-01-16 08:05:07 -0600, Merlin Moncure wrote:
 On Thu, Jan 15, 2015 at 5:10 PM, Peter Geoghegan p...@heroku.com wrote:
  On Thu, Jan 15, 2015 at 3:00 PM, Merlin Moncure mmonc...@gmail.com wrote:
  Running this test on another set of hardware to verify -- if this
  turns out to be a false alarm which it may very well be, I can only
  offer my apologies!  I've never had a new drive fail like that, in
  that manner.  I'll burn the other hardware in overnight and report
  back.

 huh -- well possibly. not.  This is on a virtual machine attached to a
 SAN.  It ran clean for several (this is 9.4 vanilla, asserts off,
 checksums on) hours then the starting having issues:

 Damn.

 Is there any chance you can package this somehow so that others can run
 it locally? It looks hard to find the actual bug here without adding
 instrumentation to to postgres.

FYI, a two hour burn in on my workstation on 9.3 ran with no issues.
An overnight run would probably be required to prove it, ruling out
both hardware and pl/sh.   If proven, it's possible we may be facing a
regression, perhaps a serious one.

ISTM the next step is to bisect the problem down over the weekend in
order to to narrow the search.  If that doesn't turn up anything
productive I'll look into taking other steps.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: lock_time for pg_stat_database

2015-01-16 Thread Jim Nasby

On 1/16/15 12:30 PM, Pavel Stehule wrote:



2015-01-16 19:24 GMT+01:00 Pavel Stehule pavel.steh...@gmail.com 
mailto:pavel.steh...@gmail.com:



2015-01-16 19:06 GMT+01:00 Jim Nasby jim.na...@bluetreble.com 
mailto:jim.na...@bluetreble.com:

On 1/16/15 11:35 AM, Pavel Stehule wrote:



2015-01-16 18:23 GMT+01:00 Jim Nasby jim.na...@bluetreble.com 
mailto:jim.na...@bluetreble.com mailto:Jim.Nasby@bluetreble.__com 
mailto:jim.na...@bluetreble.com:

 On 1/16/15 11:00 AM, Pavel Stehule wrote:

 Hi all,

 some time ago, I proposed a lock time measurement related 
to query. A main issue was a method, how to show this information. Today 
proposal is little bit simpler, but still useful. We can show a total lock time 
per database in pg_stat_database statistics. High number can be signal about 
lock issues.


 Would this not use the existing stats mechanisms? If so, couldn't we do 
this per table? (I realize that won't handle all cases; we'd still need a 
lock_time_other somewhere).



it can use a current existing stats mechanisms

I afraid so isn't possible to assign waiting time to table - 
because it depends on order


Huh? Order of what?


when you have a SELECT FROM T1, T2 and T1 is locked for t1, and T2 is locked 
for t2 -- but if t2  t1 then t2 is not important -- so what I have to cont as 
lock time for T1 and T2?


If that select is waiting on a lock on t2, then it's waiting on that lock on 
that table. It doesn't matter who else has the lock.


 Also, what do you mean by 'lock'? Heavyweight? We already have 
some visibility there. What I wish we had was some way to know if we're 
spending a lot of time in a particular non-heavy lock. Actually measuring time 
probably wouldn't make sense but we might be able to count how often we fail 
initial acquisition or something.


now, when I am thinking about it, lock_time is not good name - maybe 
waiting lock time (lock time should not be interesting, waiting is 
interesting) - it can be divided to some more categories - in GoodData we use 
Heavyweight, pages, and others categories.


So do you see this somehow encompassing locks other than heavyweight 
locks? Because I think that's the biggest need here. Basically, something akin 
to TRACE_POSTGRESQL_LWLOCK_WAIT___START() that doesn't depend on dtrace.


For these global statistics I see as important a common total waiting time 
for locks - we can use a more detailed granularity but I am not sure, if a 
common statistics are best tool.


Locks may be global, but what you're waiting for a lock on certainly isn't. 
It's almost always a lock either on a table or a row in a table. Of course this 
does mean you can't just blindly report that you're blocked on some XID; that 
doesn't tell anyone anything.


My motivations is - look to statistics -- and I can see ... lot of 
rollbacks -- issue, lot of deadlocks -- issue, lot of waiting time -- issue 
too. It is tool for people without possibility to use dtrace and similar tools 
and for everyday usage - simple check if locks are not a issue (or if locking 
is stable).


Meh. SELECT sum(state_change) FROM pg_stat_activity WHERE waiting is just about 
as useful. Or just turn on lock logging.

If you really want to add it at the database level I'm not opposed (so long as 
it leaves the door open for more granular locking later), but I can't really 
get excited about it either.


and this proposal has sense only for heavyweight locks - because others locks 
are everywhere


So what if they're everywhere? Right now if you're spending a lot of time 
waiting for LWLocks you have no way to know what's going on unless you happen 
to have dtrace. Obviously we're not going to something like issue a stats 
update every time we attempt to acquire an LWLock, but that doesn't mean we 
can't keep some counters on the locks and periodically report that.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: row_to_array function

2015-01-16 Thread Andrew Dunstan


On 01/16/2015 12:22 PM, Pavel Stehule wrote:



There two possible transformations:

row_to_array -- [[key1, value1],[key2, value2], ...]
row_to_row_array -- [(key1, value1), (key2, value2), ... ]


If we're going to go that route, I think it makes more sense to
create an actual key/value type (ie:
http://pgxn.org/dist/pair/doc/pair.html) and return an array of that.


ok

http://BlueTreble.com




I think we'd possibly be better off with simply returning a flat array, 
[key1, value1, ...]


Thats's what the hstore(text[]) and json_object(text[]) functions 
accept, along with the 2D variant, if we want a precedent.


cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical Decoding follows timelines

2015-01-16 Thread Andres Freund
On 2015-01-03 12:07:29 +0200, Heikki Linnakangas wrote:
 @@ -941,6 +936,8 @@ StartLogicalReplication(StartReplicationCmd *cmd)
   * Force a disconnect, so that the decoding code doesn't need to care
   * about an eventual switch from running in recovery, to running in a
   * normal environment. Client code is expected to handle reconnects.
 + * This covers the race condition where we are promoted half way
 + * through starting up.
   */
  if (am_cascading_walsender  !RecoveryInProgress())
  {
 
 We could exit recovery immediately after this check. Why is this check
 needed?

I probably wrote that ched and I don't think it really is needed. I
think that's a remnant of what the physical pendant used to do.

I think this needs slightly more abstraction because the infrastructure
is local to walsender.c - but logical decoding is also possible via
SQL. I'm not yet sure how that should look like. It'd be awesome if in
the course of that we could get rid of the nearly duplicated XLogRead()
:(

Simon, have you checked that this actually correctly follows timelines?
Afaics the patch as is won't allow to start logical decoding on a standby.

To allow logical decoding from clients I (apperently) wrote the the
following comment:
/* 
 * TODO: We got to change that someday soon...
 *
 * There's basically three things missing to allow this:
 * 1) We need to be able to correctly and quickly identify the timeline 
a
 *LSN belongs to
 * 2) We need to force hot_standby_feedback to be enabled at all times 
so
 *the primary cannot remove rows we need.
 * 3) support dropping replication slots referring to a database, in
 *dbase_redo. There can't be any active ones due to HS recovery
 *conflicts, so that should be relatively easy.
 * 
 */
if (RecoveryInProgress())
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
   errmsg(logical decoding cannot be used while in 
recovery)));

You're implementing 1) here. 3) doesn't look very challenging.

But 2) imo is rather more interesting/complex. I guess we'd have to
force that streaming replication is used, that a physical replication
slot is used and that hot_standby_feedback is enabled.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] advance local xmin more aggressively

2015-01-16 Thread Heikki Linnakangas

On 01/15/2015 09:57 PM, Robert Haas wrote:

On Thu, Jan 15, 2015 at 3:08 AM, Michael Paquier
michael.paqu...@gmail.com wrote:

On Mon, Dec 22, 2014 at 7:31 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:

Here's an updated version, rebased over the pairing heap code that I just
committed, and fixing those bugs.

So, are we reaching an outcome for the match happening here?


Well, I still like using the existing ResourceOwner pointers to find
the snapshots more than introducing a separate data structure just for
that.  But I'm OK with Heikki committing his version and, if anybody
notices the new code becoming a hotspot, we can revisit the issue.


Ok, committed.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] hung backends stuck in spinlock heavy endless loop

2015-01-16 Thread Peter Geoghegan
On Fri, Jan 16, 2015 at 10:33 AM, Merlin Moncure mmonc...@gmail.com wrote:
 ISTM the next step is to bisect the problem down over the weekend in
 order to to narrow the search.  If that doesn't turn up anything
 productive I'll look into taking other steps.

That might be the quickest way to do it, provided you can isolate the
bug fairly reliably. It might be a bit tricky to write a shell script
that assumes a certain amount of time having passed without the bug
tripping indicates that it doesn't exist, and have that work
consistently. I'm slightly concerned that you'll hit other bugs that
have since been fixed, given the large number of possible symptoms
here.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fillfactor for GIN indexes

2015-01-16 Thread Michael Paquier
On Sat, Jan 17, 2015 at 2:40 AM, Robert Haas robertmh...@gmail.com wrote:
 There's only value in adding a fillfactor parameter to GIN indexes if
 it improves performance.  There are no benchmarks showing it does.
 So, why are we still talking about this?
Indeed. There is no such benchmark as of now, and I am not sure I'll
get the time to work more on that soon, so let's reject the patch for
now. And sorry for the useless noise.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] explain sortorder

2015-01-16 Thread Tom Lane
Timmer, Marius marius.tim...@uni-muenster.de writes:
 attached is version 8, fixing remaining issues, adding docs and tests as 
 requested/agreed.

I've committed this with some rather substantial alterations, notably:

* Having get_sortorder_by_keyno dig into the plan state node by itself
seemed like a bad idea; it's certainly at variance with the existing
division of knowledge in this code, and I think it might outright do
the wrong thing if there's a Sort node underneath an Agg or Group node
(since in those cases the child Sort node, not the parent node, would
get passed in).  I fixed it so that show_sort_keys and siblings are
responsible for extracting and passing in the correct data arrays.

* There were some minor bugs in the rules for when to print NULLS
FIRST/LAST too.  I think the way I've got it now is a precise inverse of
what addTargetToSortList() will do.

* The proposed new regression test cases were not portable (en_US
isn't guaranteed to exist), and I thought adding a new regression
script file for just one test was a bit excessive.  The DESC and
USING behaviors were already covered by existing test cases so I just
added something to exercise COLLATE and NULLS FIRST in collate.sql.

* I took out the change in perform.sgml.  The change as proposed was
seriously confusing because it injected off-topic discussion into an
example that was meant to be just about the additional information printed
by EXPLAIN ANALYZE.  I'm not really sure that this feature needs any
specific documentation (other than its eventual mention in the release
notes); but if it does, we should probably add a brand new example
someplace before the EXPLAIN ANALYZE subsection.

* Assorted cleanups such as removal of irrelevant whitespace changes.
That sort of thing just makes a reviewer's job harder, so it's best
to avoid it if you can.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] n_live_tup smaller than the number of rows in a table

2015-01-16 Thread Lisa Guo
Hi,

We are seeing a strange behavior where n_live_tup is way smaller than the 
number of rows in a table. The table has  18m rows, but n_live_tup only has  
100K. We tried to do vacuum analyze to clear up any sticky errors, but it 
didn’t correct the problem. We are running Postgres 9.2. Any pointers on how we 
could debug this problem and how to correct the stats?

Thanks,
Lisa


Re: [HACKERS] hung backends stuck in spinlock heavy endless loop

2015-01-16 Thread Peter Geoghegan
On Fri, Jan 16, 2015 at 6:21 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 It looks very much like that a page has for some reason been moved to a
 different block number. And that's exactly what Peter found out in his
 investigation too; an index page was mysteriously copied to a different
 block with identical content.

What I found suspicious about that was that the spuriously identical
pages were not physically adjacent, but logically adjacent (i.e. the
bad page was considered the B-Tree right link of the good page by the
good, spuriously-copied-by-bad page). It also seems likely that that
small catalog index on pg_class(oid) was well cached in
shared_buffers. So I agree that it's unlikely that this is actually a
hardware or filesystem problem. Beyond that, if I had to guess, I'd
say that the problem is more likely to be in the B-Tree code than it
is in the buffer manager or whatever (so the logically adjacent
thing is probably not an artifact of the order that the pages were
accessed, since it appears there was a downlink to the bad page. This
downlink was not added recently. Also, this logical adjacency is
unlikely to be mere coincidence - Postgres seemed to fairly
consistently break this way).

Does anyone have a better developed sense of where the ultimate
problem here is than I do? I guess I've never thought too much about
how the system fails when a catalog index is this thoroughly broken.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] n_live_tup smaller than the number of rows in a table

2015-01-16 Thread Tom Lane
Lisa Guo l...@fb.com writes:
 We are seeing a strange behavior where n_live_tup is way smaller than the 
 number of rows in a table. The table has  18m rows, but n_live_tup only has 
  100K. We tried to do vacuum analyze to clear up any sticky errors, but it 
 didn’t correct the problem. We are running Postgres 9.2. Any pointers on how 
 we could debug this problem and how to correct the stats?

n_live_tup is a moving average over the last few observations, so in
theory it should get better if you repeat ANALYZE several times.
AFAIR, VACUUM isn't likely to help much.  (VACUUM FREEZE might though.)

It seems odd that you have a value that's so far off ... have you been
using this table in any unusual way?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] n_live_tup smaller than the number of rows in a table

2015-01-16 Thread Lisa Guo
The server did crash yesterday. It has many schemas that have the same table 
(different shards). Operations done on these tables are very similar, but only 
a few of them actually have this problem. Other than this error, we do not see 
other abnormalities on the box.

Lisa

From: Tom Lane t...@sss.pgh.pa.usmailto:t...@sss.pgh.pa.us
Date: Friday, January 16, 2015 at 4:23 PM
To: s l...@fb.commailto:l...@fb.com
Cc: pgsql-hackers@postgresql.orgmailto:pgsql-hackers@postgresql.org 
pgsql-hackers@postgresql.orgmailto:pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] n_live_tup smaller than the number of rows in a table

Lisa Guo l...@fb.commailto:l...@fb.com writes:
We are seeing a strange behavior where n_live_tup is way smaller than the 
number of rows in a table. The table has  18m rows, but n_live_tup only has  
100K. We tried to do vacuum analyze to clear up any sticky errors, but it 
didn’t correct the problem. We are running Postgres 9.2. Any pointers on how we 
could debug this problem and how to correct the stats?

n_live_tup is a moving average over the last few observations, so in
theory it should get better if you repeat ANALYZE several times.
AFAIR, VACUUM isn't likely to help much.  (VACUUM FREEZE might though.)

It seems odd that you have a value that's so far off ... have you been
using this table in any unusual way?

regards, tom lane



Re: [HACKERS] Merging postgresql.conf and postgresql.auto.conf

2015-01-16 Thread Amit Kapila
On Fri, Jan 16, 2015 at 9:55 PM, Sawada Masahiko sawada.m...@gmail.com
wrote:
 On Fri, Jan 16, 2015 at 12:54 PM, Amit Kapila amit.kapil...@gmail.com
wrote:
 
  I don't know how appealing it would be to others, but a new view
  like pg_file_settings which would display the settings in file could
  be meaningful for your need.
 
  Another way is user can do pg_reload_conf() to see the latest
  values (excluding values for server startup time parameters).
 

 I prefer the former. Because the latter would not handle all guc
 variables as you said.
 The new view like pg_file_setting has guc variable and config file
 which has corresponding guc variable.
 It would be helpful view for like ALTER SYSTEM ... RESET and that
 command might be beneficial feature for user who want to manage
 configuration file manually, I would like to propose them.


Okay, but I think it would be better to start a new thread
as this proposal seems to be different than your original
idea.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Parallel Seq Scan

2015-01-16 Thread Amit Kapila
On Fri, Jan 16, 2015 at 11:49 PM, Robert Haas robertmh...@gmail.com wrote:

 As mentioned downthread, a far bigger consideration is the I/O pattern
 we create.  A sequential scan is so-called because it reads the
 relation sequentially.  If we destroy that property, we will be more
 than slightly sad.  It might be OK to do sequential scans of, say,
 each 1GB segment separately, but I'm pretty sure it would be a real
 bad idea to read 8kB at a time at blocks 0, 64, 128, 1, 65, 129, ...

 What I'm thinking about is that we might have something like this:

 struct this_lives_in_dynamic_shared_memory
 {
 BlockNumber last_block;
 Size prefetch_distance;
 Size prefetch_increment;
 slock_t mutex;
 BlockNumber next_prefetch_block;
 BlockNumber next_scan_block;
 };

 Each worker takes the mutex and checks whether next_prefetch_block -
 next_scan_block  prefetch_distance and also whether
 next_prefetch_block  last_block.  If both are true, it prefetches
 some number of additional blocks, as specified by prefetch_increment.
 Otherwise, it increments next_scan_block and scans the block
 corresponding to the old value.


Assuming we will increment next_prefetch_block only after prefetching
blocks (equivalent to prefetch_increment), won't 2 workers can
simultaneously see the same value for next_prefetch_block and try to
perform prefetch for same blocks?

What will be value of prefetch_increment?
Will it be equal to prefetch_distance or prefetch_distance/2 or
prefetch_distance/4 or .. or will it be totally unrelated
to prefetch_distance?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Merging postgresql.conf and postgresql.auto.conf

2015-01-16 Thread David G Johnston
Sawada Masahiko wrote
 On Fri, Jan 16, 2015 at 12:54 PM, Amit Kapila lt;

 amit.kapila16@

 gt; wrote:
 On Thu, Jan 15, 2015 at 9:48 PM, Sawada Masahiko lt;

 sawada.mshk@

 gt;
 wrote:
 On Thu, Jan 15, 2015 at 2:02 PM, Amit Kapila lt;

 amit.kapila16@

 gt;
 wrote:
 
  One thought I have in this line is that currently there doesn't seem
 to
  be
  a way to know if the setting has an entry both in postgresql.conf and
  postgresql.auto.conf, if we can have some way of knowing the same
  (pg_settings?), then it could be convenient for user to decide if the
  value
  in postgresql.auto.conf is useful or not and if it's not useful then
 use
  Alter System .. Reset command to remove the same from
  postgresql.auto.conf.

 I think one way is that pg_settings has file name of variables,  But
 It would not affect to currently status of postgresql.conf
 So we would need to parse postgresql.conf again at that time.


 Yeah that could be a possibility, but I think that will break the
 existing
 command('s) as this is the common infrastructure used for SHOW ..
 commands as well which displays the guc value that is used by
 current session rather than the value in postgresql.conf.
 
 You're right.
 pg_setting and SHOW command use value in current session rather than
 config file.
 It might break these common infrastructure.

Two changes solve this problem in what seems to be a clean way.
1) Upon each parsing of postgresql.conf we store all assigned variables
somewhere
2) We display these assignments in a new pg_settings column named
system_reset_val

I would also extend this to include:
a) upon each parsing of postgresql.auto.conf we store all assigned variables
somewhere (maybe the same place as postgresql.conf and simply label the file
source)
b) add an alter_system_val field to show that value (or null)
c) add a db_role_val to show the current value for the session via
pg_db_role_setting
c.1) add a db_role_id to show the named user that is being used for the
db_role_val lookup

The thinking for c.1 is that in situations with role hierarchies and SET
ROLE usage it would be nice to be able to query what the connection role -
the one used during variable lookup - is.

I'm probably going overkill on this but there are not a lot of difference
sources nor do they change frequently so extending the pg_settings view to
be more of a one-stop-shopping for this information seems to make sense.

As it relates back to this thread the desired merging ends up being done
inside this view and at least gives the DBA a central location (well, along
with pg_db_role_setting) to go and look at the configuration landscape for
the system.

David J.





--
View this message in context: 
http://postgresql.nabble.com/Merging-postgresql-conf-and-postgresql-auto-conf-tp5833910p5834386.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Seq Scan

2015-01-16 Thread Robert Haas
On Fri, Jan 16, 2015 at 11:27 PM, Amit Kapila amit.kapil...@gmail.com wrote:
 On Fri, Jan 16, 2015 at 11:49 PM, Robert Haas robertmh...@gmail.com wrote:
 As mentioned downthread, a far bigger consideration is the I/O pattern
 we create.  A sequential scan is so-called because it reads the
 relation sequentially.  If we destroy that property, we will be more
 than slightly sad.  It might be OK to do sequential scans of, say,
 each 1GB segment separately, but I'm pretty sure it would be a real
 bad idea to read 8kB at a time at blocks 0, 64, 128, 1, 65, 129, ...

 What I'm thinking about is that we might have something like this:

 struct this_lives_in_dynamic_shared_memory
 {
 BlockNumber last_block;
 Size prefetch_distance;
 Size prefetch_increment;
 slock_t mutex;
 BlockNumber next_prefetch_block;
 BlockNumber next_scan_block;
 };

 Each worker takes the mutex and checks whether next_prefetch_block -
 next_scan_block  prefetch_distance and also whether
 next_prefetch_block  last_block.  If both are true, it prefetches
 some number of additional blocks, as specified by prefetch_increment.
 Otherwise, it increments next_scan_block and scans the block
 corresponding to the old value.

 Assuming we will increment next_prefetch_block only after prefetching
 blocks (equivalent to prefetch_increment), won't 2 workers can
 simultaneously see the same value for next_prefetch_block and try to
 perform prefetch for same blocks?

The idea is that you can only examine and modify next_prefetch_block
or next_scan_block while holding the mutex.

 What will be value of prefetch_increment?
 Will it be equal to prefetch_distance or prefetch_distance/2 or
 prefetch_distance/4 or .. or will it be totally unrelated to
 prefetch_distance?

I dunno, that might take some experimentation.  prefetch_distance/2
doesn't sound stupid.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Merging postgresql.conf and postgresql.auto.conf

2015-01-16 Thread Amit Kapila
On Sat, Jan 17, 2015 at 10:02 AM, David G Johnston 
david.g.johns...@gmail.com wrote:
  You're right.
  pg_setting and SHOW command use value in current session rather than
  config file.
  It might break these common infrastructure.

 Two changes solve this problem in what seems to be a clean way.
 1) Upon each parsing of postgresql.conf we store all assigned variables
 somewhere
 2) We display these assignments in a new pg_settings column named
 system_reset_val

 I would also extend this to include:
 a) upon each parsing of postgresql.auto.conf we store all assigned
variables
 somewhere (maybe the same place as postgresql.conf and simply label the
file
 source)

Do we need to perform this parsing whenever user queries pg_settings?
I think it might lead to extra cycles of reading file when user won't even
need it and as the code is shared with SHOW commands that could be
slightly complicated.



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Merging postgresql.conf and postgresql.auto.conf

2015-01-16 Thread David Johnston
On Fri, Jan 16, 2015 at 10:08 PM, Amit Kapila amit.kapil...@gmail.com
wrote:

 On Sat, Jan 17, 2015 at 10:02 AM, David G Johnston 
 david.g.johns...@gmail.com wrote:
   You're right.
   pg_setting and SHOW command use value in current session rather than
   config file.
   It might break these common infrastructure.
 
  Two changes solve this problem in what seems to be a clean way.
  1) Upon each parsing of postgresql.conf we store all assigned variables
  somewhere
  2) We display these assignments in a new pg_settings column named
  system_reset_val
 
  I would also extend this to include:
  a) upon each parsing of postgresql.auto.conf we store all assigned
 variables
  somewhere (maybe the same place as postgresql.conf and simply label the
 file
  source)

 Do we need to perform this parsing whenever user queries pg_settings?
 I think it might lead to extra cycles of reading file when user won't even
 need it and as the code is shared with SHOW commands that could be
 slightly complicated.


There would be no parsing upon reading of pg_settings, only lookups.  The
existing parsing would simply have its values saved to the catalogs that
will be involved in the underlying pg_setting view query.

David J.​


Re: [HACKERS] Merging postgresql.conf and postgresql.auto.conf

2015-01-16 Thread Amit Kapila
On Sat, Jan 17, 2015 at 10:41 AM, David Johnston david.g.johns...@gmail.com
wrote:
 On Fri, Jan 16, 2015 at 10:08 PM, Amit Kapila amit.kapil...@gmail.com
wrote:

 On Sat, Jan 17, 2015 at 10:02 AM, David G Johnston 
david.g.johns...@gmail.com wrote:
   You're right.
   pg_setting and SHOW command use value in current session rather than
   config file.
   It might break these common infrastructure.
 
  Two changes solve this problem in what seems to be a clean way.
  1) Upon each parsing of postgresql.conf we store all assigned variables
  somewhere
  2) We display these assignments in a new pg_settings column named
  system_reset_val
 
  I would also extend this to include:
  a) upon each parsing of postgresql.auto.conf we store all assigned
variables
  somewhere (maybe the same place as postgresql.conf and simply label
the file
  source)

 Do we need to perform this parsing whenever user queries pg_settings?
 I think it might lead to extra cycles of reading file when user won't
even
 need it and as the code is shared with SHOW commands that could be
 slightly complicated.


 There would be no parsing upon reading of pg_settings, only lookups.  The
existing parsing would simply have its values saved to the catalogs that
will be involved in the underlying pg_setting view query.

So are you telling that whenever we read, save the settings
to some catalog (probably a new one)?

Will that completely address the problem specified in this thread,
as those values could probably be old (when last time server is
started or at last SIGHUP time values)?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Merging postgresql.conf and postgresql.auto.conf

2015-01-16 Thread David Johnston
On Fri, Jan 16, 2015 at 10:19 PM, Amit Kapila amit.kapil...@gmail.com
wrote:

 On Sat, Jan 17, 2015 at 10:41 AM, David Johnston 
 david.g.johns...@gmail.com wrote:
  On Fri, Jan 16, 2015 at 10:08 PM, Amit Kapila amit.kapil...@gmail.com
 wrote:
 
  On Sat, Jan 17, 2015 at 10:02 AM, David G Johnston 
 david.g.johns...@gmail.com wrote:
You're right.
pg_setting and SHOW command use value in current session rather than
config file.
It might break these common infrastructure.
  
   Two changes solve this problem in what seems to be a clean way.
   1) Upon each parsing of postgresql.conf we store all assigned
 variables
   somewhere
   2) We display these assignments in a new pg_settings column named
   system_reset_val
  
   I would also extend this to include:
   a) upon each parsing of postgresql.auto.conf we store all assigned
 variables
   somewhere (maybe the same place as postgresql.conf and simply label
 the file
   source)
 
  Do we need to perform this parsing whenever user queries pg_settings?
  I think it might lead to extra cycles of reading file when user won't
 even
  need it and as the code is shared with SHOW commands that could be
  slightly complicated.
 
 
  There would be no parsing upon reading of pg_settings, only lookups.
 The existing parsing would simply have its values saved to the catalogs
 that will be involved in the underlying pg_setting view query.
 
 So are you telling that whenever we read, save the settings
 to some catalog (probably a new one)?

 Will that completely address the problem specified in this thread,
 as those values could probably be old (when last time server is
 started or at last SIGHUP time values)?


Cache invalidation is a hard problem to solve :)

​I am reasonably content with showing the user who has just updated their
postgresql.conf file and boots/SIGHUPs the server to find that said value
hasn't taken effect that their value is indeed sitting in postgresql.conf
but that other parts of the system are preempting it from being the active
value.

David J.


Re: [HACKERS] proposal: row_to_array function

2015-01-16 Thread Pavel Stehule
2015-01-16 22:35 GMT+01:00 Andrew Dunstan and...@dunslane.net:


 On 01/16/2015 12:22 PM, Pavel Stehule wrote:



 There two possible transformations:

 row_to_array -- [[key1, value1],[key2, value2], ...]
 row_to_row_array -- [(key1, value1), (key2, value2), ... ]


 If we're going to go that route, I think it makes more sense to
 create an actual key/value type (ie:
 http://pgxn.org/dist/pair/doc/pair.html) and return an array of that.


 ok

 http://BlueTreble.com



 I think we'd possibly be better off with simply returning a flat array,
 [key1, value1, ...]

 Thats's what the hstore(text[]) and json_object(text[]) functions accept,
 along with the 2D variant, if we want a precedent.


It can be one of supported variant. I should not be one, because we cannot
to simply iterate over it

Next possibility is teach FOREACH to take key and value in one step.

Regards

Pavel



 cheers

 andrew




Re: [HACKERS] proposal: lock_time for pg_stat_database

2015-01-16 Thread Pavel Stehule
2015-01-16 20:33 GMT+01:00 Jim Nasby jim.na...@bluetreble.com:

 On 1/16/15 12:30 PM, Pavel Stehule wrote:



 2015-01-16 19:24 GMT+01:00 Pavel Stehule pavel.steh...@gmail.com
 mailto:pavel.steh...@gmail.com:



 2015-01-16 19:06 GMT+01:00 Jim Nasby jim.na...@bluetreble.com
 mailto:jim.na...@bluetreble.com:

 On 1/16/15 11:35 AM, Pavel Stehule wrote:



 2015-01-16 18:23 GMT+01:00 Jim Nasby 
 jim.na...@bluetreble.com mailto:jim.na...@bluetreble.com mailto:
 Jim.Nasby@bluetreble.__com mailto:jim.na...@bluetreble.com:

  On 1/16/15 11:00 AM, Pavel Stehule wrote:

  Hi all,

  some time ago, I proposed a lock time measurement
 related to query. A main issue was a method, how to show this information.
 Today proposal is little bit simpler, but still useful. We can show a total
 lock time per database in pg_stat_database statistics. High number can be
 signal about lock issues.


  Would this not use the existing stats mechanisms? If so,
 couldn't we do this per table? (I realize that won't handle all cases; we'd
 still need a lock_time_other somewhere).



 it can use a current existing stats mechanisms

 I afraid so isn't possible to assign waiting time to table -
 because it depends on order


 Huh? Order of what?


 when you have a SELECT FROM T1, T2 and T1 is locked for t1, and T2 is
 locked for t2 -- but if t2  t1 then t2 is not important -- so what I have
 to cont as lock time for T1 and T2?


 If that select is waiting on a lock on t2, then it's waiting on that lock
 on that table. It doesn't matter who else has the lock.

   Also, what do you mean by 'lock'? Heavyweight? We
 already have some visibility there. What I wish we had was some way to know
 if we're spending a lot of time in a particular non-heavy lock. Actually
 measuring time probably wouldn't make sense but we might be able to count
 how often we fail initial acquisition or something.


 now, when I am thinking about it, lock_time is not good name
 - maybe waiting lock time (lock time should not be interesting, waiting
 is interesting) - it can be divided to some more categories - in GoodData
 we use Heavyweight, pages, and others categories.


 So do you see this somehow encompassing locks other than
 heavyweight locks? Because I think that's the biggest need here. Basically,
 something akin to TRACE_POSTGRESQL_LWLOCK_WAIT___START() that doesn't
 depend on dtrace.


 For these global statistics I see as important a common total waiting
 time for locks - we can use a more detailed granularity but I am not sure,
 if a common statistics are best tool.


 Locks may be global, but what you're waiting for a lock on certainly
 isn't. It's almost always a lock either on a table or a row in a table. Of
 course this does mean you can't just blindly report that you're blocked on
 some XID; that doesn't tell anyone anything.

  My motivations is - look to statistics -- and I can see ... lot of
 rollbacks -- issue, lot of deadlocks -- issue, lot of waiting time -- issue
 too. It is tool for people without possibility to use dtrace and similar
 tools and for everyday usage - simple check if locks are not a issue (or if
 locking is stable).


 Meh. SELECT sum(state_change) FROM pg_stat_activity WHERE waiting is just
 about as useful. Or just turn on lock logging.

 If you really want to add it at the database level I'm not opposed (so
 long as it leaves the door open for more granular locking later), but I
 can't really get excited about it either.

  and this proposal has sense only for heavyweight locks - because others
 locks are everywhere


 So what if they're everywhere? Right now if you're spending a lot of time
 waiting for LWLocks you have no way to know what's going on unless you
 happen to have dtrace. Obviously we're not going to something like issue a
 stats update every time we attempt to acquire an LWLock, but that doesn't
 mean we can't keep some counters on the locks and periodically report that.


I have a plan to update statistics when all necessary keys are acquired -
so it is once per statement - it is similar press on stats system like now.

Pavel



 --
 Jim Nasby, Data Architect, Blue Treble Consulting
 Data in Trouble? Get it in Treble! http://BlueTreble.com