Re: [HACKERS] Turning recovery.conf into GUCs
On Wed, Jan 7, 2015 at 6:22 AM, Peter Eisentraut pete...@gmx.net wrote: I have written similar logic, and while it's not pleasant, it's doable. This issue would really only go away if you don't use a file to signal recovery at all, which you have argued for, but which is really a separate and more difficult problem. Moving this patch to the next CF and marking it as returned with feedback for current CF as there is visibly no consensus reached. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Turning recovery.conf into GUCs
On 2015-01-16 21:43:43 +0900, Michael Paquier wrote: On Wed, Jan 7, 2015 at 6:22 AM, Peter Eisentraut pete...@gmx.net wrote: I have written similar logic, and while it's not pleasant, it's doable. This issue would really only go away if you don't use a file to signal recovery at all, which you have argued for, but which is really a separate and more difficult problem. Moving this patch to the next CF and marking it as returned with feedback for current CF as there is visibly no consensus reached. I don't think that's a good idea. If we defer this another couple months we'l *never* reach anything coming close to concensus. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Turning recovery.conf into GUCs
On Fri, Jan 16, 2015 at 9:45 PM, Andres Freund and...@2ndquadrant.com wrote: On 2015-01-16 21:43:43 +0900, Michael Paquier wrote: On Wed, Jan 7, 2015 at 6:22 AM, Peter Eisentraut pete...@gmx.net wrote: I have written similar logic, and while it's not pleasant, it's doable. This issue would really only go away if you don't use a file to signal recovery at all, which you have argued for, but which is really a separate and more difficult problem. Moving this patch to the next CF and marking it as returned with feedback for current CF as there is visibly no consensus reached. I don't think that's a good idea. If we defer this another couple months we'l *never* reach anything coming close to concensus. What makes you think that the situation could move suddendly move into a direction more than another? (FWIW, my vote goes to the all GUC approach with standby.enabled.) -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel mode and parallel contexts
On 2015-01-05 11:27:57 -0500, Robert Haas wrote: On Sat, Jan 3, 2015 at 7:31 PM, Andres Freund and...@2ndquadrant.com wrote: * I wonder if parallel contexts shouldn't be tracked via resowners That is a good question. I confess that I'm somewhat fuzzy about which things should be tracked via the resowner mechanism vs. which things should have their own bespoke bookkeeping. However, the AtEOXact_Parallel() stuff happens well before ResourceOwnerRelease(), which makes merging them seem not particularly clean. I'm not sure either. But I think the current location is wrong anyway - during AtEOXact_Parallel() before running user defined queries via AfterTriggerFireDeferred() seems wrong. * combocid.c should error out in parallel mode, as insurance Eh, which function? HeapTupleHeaderGetCmin(), HeapTupleHeaderGetCmax(), and AtEOXact_ComboCid() are intended to work in parallel mode. HeapTupleHeaderAdjustCmax() could Assert(!IsInParallelMode()) but the only calls are in heap_update() and heap_delete() which already have error checks, so putting yet another elog() there seems like overkill. To me it seems like a good idea, but whatever. * I doubt it's a good idea to allow heap_insert, heap_inplace_update, index_insert. I'm not convinced that those will be handled correct and relaxing restrictions is easier than adding them. I'm fine with adding checks to heap_insert() and heap_inplace_update(). I'm not sure we really need to add anything to index_insert(); how are we going to get there without hitting some other prohibited operation first? I'm not sure. But it's not that hard to imagine that somebody will start adding codepaths that insert into indexes first. Think upsert. * I'd much rather separate HandleParallelMessageInterrupt() in one variant that just tells the machinery there's a interrupt (called from signal handlers) and one that actually handles them. We shouldn't even consider adding any new code that does allocations, errors and such in signal handlers. That's just a *bad* idea. That's a nice idea, but I didn't invent the way this crap works today. ImmediateInterruptOK gets set to true while performing a lock wait, and we need to be able to respond to errors while in that state. I think there's got to be a better way to handle that than force every asynchronous operation to have to cope with the fact that ImmediateInterruptOK may be set or not set, but as of today that's what you have to do. I personally think it's absolutely unacceptable to add any more of that. That the current mess works is more luck than anything else - and I'm pretty sure there's several bugs in it. But since I realize I can't force you to develop a alternative solution, I tried to implement enough infrastructure to deal with it without too much work. As far as I can see this could relatively easily be implemented ontop of the removal of ImmediateInterruptOK in the patch series I posted yesterday? Afaics you just need to remove most of +HandleParallelMessageInterrupt() and replace it by a single SetLatch(). * I'm not a fan of the shm_toc stuff... Verbose and hard to read. I'd much rather place a struct there and be careful about not using pointers. That also would obliviate the need to reserve some ids. I don't see how that's going to work with variable-size data structures, and a bunch of the things that we need to serialize are variable-size. Meh. Just appending the variable data to the end of the structure and calculating offsets works just fine. Moreover, I'm not really interested in rehashing this design again. I know it's not your favorite; you've said that before. Well, if I keep having to read code using it, I'll keep being annoyed by it. I guess we both have to live with that. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Turning recovery.conf into GUCs
On 2015-01-16 21:50:16 +0900, Michael Paquier wrote: On Fri, Jan 16, 2015 at 9:45 PM, Andres Freund and...@2ndquadrant.com wrote: On 2015-01-16 21:43:43 +0900, Michael Paquier wrote: On Wed, Jan 7, 2015 at 6:22 AM, Peter Eisentraut pete...@gmx.net wrote: I have written similar logic, and while it's not pleasant, it's doable. This issue would really only go away if you don't use a file to signal recovery at all, which you have argued for, but which is really a separate and more difficult problem. Moving this patch to the next CF and marking it as returned with feedback for current CF as there is visibly no consensus reached. I don't think that's a good idea. If we defer this another couple months we'l *never* reach anything coming close to concensus. What makes you think that the situation could move suddendly move into a direction more than another? That we have to fix this. I see absolutely no advantage of declaring the discussion closed for now. That doesn't exactly increase the chance of this ever succeeding. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bug in pg_dump
On 16/01/2015 01:06, Jim Nasby wrote: On 1/15/15 5:26 AM, Gilles Darold wrote: Hello, There's a long pending issue with pg_dump and extensions that have table members with foreign keys. This was previously reported in this thread http://www.postgresql.org/message-id/ca+tgmoyvzkadmgh_8el7uvm472geru0b4pnnfjqye6ss1k9...@mail.gmail.com and discuss by Robert. All PostgreSQL users that use the PostGis extension postgis_topology are facing the issue because the two members tables (topology and layer) are linked by foreign keys. If you dump a database with this extension and try to import it you will experience this error: pg_restore: [archiver (db)] Error while PROCESSING TOC: pg_restore: [archiver (db)] Error from TOC entry 3345; 0 157059176 TABLE DATA layer gilles pg_restore: [archiver (db)] COPY failed for table layer: ERROR: insert or update on table layer violates foreign key constraint layer_topology_id_fkey DETAIL: Key (topology_id)=(1) is not present in table topology. WARNING: errors ignored on restore: 1 The problem is that, whatever export type you choose (plain/custom and full-export/data-only) the data of tables topology and layer are always exported in alphabetic order. I think this is a bug because outside extension, in data-only export, pg_dump is able to find foreign keys dependency and dump table's data in the right order but not with extension's members. Default is alphabetic order but that should not be the case with extension's members because constraints are recreated during the CREATE EXTENSION order. I hope I am clear enough. Here we have three solutions: 1/ Inform developers of extensions to take care to alphabetical order when they have member tables using foreign keys. 2/ Inform DBAs that they have to restore the failing table independently. The use case above can be resumed using the following command: pg_restore -h localhost -n topology -t layer -Fc -d testdb_empty testdump.dump 3/ Inform DBAs that they have to restore the schema first then the data only using --disable-triggers I don't like 1-3, and I doubt anyone else does... 4/ Patch pg_dump to solve this issue. 5. Disable FK's during load. This is really a bigger item than just extensions. It would have the nice benefit of doing a wholesale FK validation instead of firing per-row triggers, but it would leave the database in a weird state if a restore failed... I think this is an other problem. Here we just need to apply to extension's members tables the same work than to normal tables. I guess this is what this patch try to solve. I attach a patch that solves the issue in pg_dump, let me know if it might be included in Commit Fest or if the three other solutions are a better choice. I also join a sample extension (test_fk_in_ext) to be able to reproduce the issue and test the patch. Note that it might exists a simpler solution than the one I used in this patch, if this is the case please point me on the right way, I will be pleased to rewrite and send an other patch. The only problem I see with this approach is circular FK's: decibel@decina.local=# create table a(a_id serial primary key, b_id int); CREATE TABLE decibel@decina.local=# create table b(b_id serial primary key, a_id int references a); CREATE TABLE decibel@decina.local=# alter table a add foreign key(b_id) references b; ALTER TABLE decibel@decina.local=# That's esoteric enough that I think it's OK not to directly support them, but pg_dump shouldn't puke on them (and really should throw a warning). Though it looks like it doesn't handle that in the data-only case anyway... The patch is taking care or circular references and you will be warn if pg_dump found it in the extension members. That was not the case before. If you try do dump a database with the postgis extension you will be warn about FK defined on the edge_data table. -- Gilles Darold Consultant PostgreSQL http://dalibo.com - http://dalibo.org -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: Reducing lock strength of trigger and foreign key DDL
Hi, /* - * Grab an exclusive lock on the pk table, so that someone doesn't delete - * rows out from under us. (Although a lesser lock would do for that - * purpose, we'll need exclusive lock anyway to add triggers to the pk - * table; trying to start with a lesser lock will just create a risk of - * deadlock.) + * Grab ShareRowExclusiveLock on the pk table, so that someone doesn't + * delete rows out from under us. Note that this does not create risks + * of deadlocks as triggers add added to the pk table using the same + * lock. */ add added doesn't look intended. The rest of the sentence doesn't look entirely right either. /* * Triggers must be on tables or views, and there are additional @@ -526,8 +526,7 @@ CreateTrigger(CreateTrigStmt *stmt, const char *queryString, * can skip this for internally generated triggers, since the name * modification above should be sufficient. * - * NOTE that this is cool only because we have AccessExclusiveLock on the - * relation, so the trigger set won't be changing underneath us. + * NOTE that this is cool only because of the unique contraint. I fail to see what the unique constraint has to do with this? The previous comment refers to the fact that the AccessExclusiveLock is what prevents a race where another transaction adds a trigger with the same name already exists. Yes, the unique index would, as noted earlier in the comment, catch the error. But that's not the point of the check. Unless I miss something the comment is just as true if you replace the access exclusive with share row exlusive as it's also self conflicting. @@ -1272,8 +1271,7 @@ renametrig(RenameStmt *stmt) * on tgrelid/tgname would complain anyway) and to ensure a trigger does * exist with oldname. * - * NOTE that this is cool only because we have AccessExclusiveLock on the - * relation, so the trigger set won't be changing underneath us. + * NOTE that this is cool only because there is a unique constraint. */ Same as above. tgrel = heap_open(TriggerRelationId, RowExclusiveLock); diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c index dd748ac..8eeccf2 100644 --- a/src/backend/utils/adt/ruleutils.c +++ b/src/backend/utils/adt/ruleutils.c @@ -699,7 +699,8 @@ pg_get_triggerdef_worker(Oid trigid, bool pretty) HeapTuple ht_trig; Form_pg_trigger trigrec; StringInfoData buf; - Relationtgrel; + Snapshotsnapshot = RegisterSnapshot(GetTransactionSnapshot()); + Relationtgrel = heap_open(TriggerRelationId, AccessShareLock); ScanKeyData skey[1]; SysScanDesc tgscan; int findx = 0; @@ -710,18 +711,18 @@ pg_get_triggerdef_worker(Oid trigid, bool pretty) /* * Fetch the pg_trigger tuple by the Oid of the trigger */ - tgrel = heap_open(TriggerRelationId, AccessShareLock); - ScanKeyInit(skey[0], ObjectIdAttributeNumber, BTEqualStrategyNumber, F_OIDEQ, ObjectIdGetDatum(trigid)); tgscan = systable_beginscan(tgrel, TriggerOidIndexId, true, - NULL, 1, skey); + snapshot, 1, skey); ht_trig = systable_getnext(tgscan); + UnregisterSnapshot(snapshot); + if (!HeapTupleIsValid(ht_trig)) elog(ERROR, could not find tuple for trigger %u, trigid); Hm. Pushing the snapshot is supposed to make this fully mvcc? Idon't think that's actually sufficient because of the deparsing of the WHEN clause and of the function name. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] hung backends stuck in spinlock heavy endless loop
On Thu, Jan 15, 2015 at 5:10 PM, Peter Geoghegan p...@heroku.com wrote: On Thu, Jan 15, 2015 at 3:00 PM, Merlin Moncure mmonc...@gmail.com wrote: Running this test on another set of hardware to verify -- if this turns out to be a false alarm which it may very well be, I can only offer my apologies! I've never had a new drive fail like that, in that manner. I'll burn the other hardware in overnight and report back. huh -- well possibly. not. This is on a virtual machine attached to a SAN. It ran clean for several (this is 9.4 vanilla, asserts off, checksums on) hours then the starting having issues: [cds2 21952 2015-01-15 22:54:51.833 CST 5502]WARNING: page verification failed, calculated checksum 59143 but expected 59137 at character 20 [cds2 21952 2015-01-15 22:54:51.852 CST 5502]QUERY: DELETE FROM onesitepmc.propertyguestcard t WHERE EXISTS ( SELECT 1 FROM propertyguestcard_insert d WHERE (t.prptyid, t.gcardid) = (d.prptyid, d.gcardid) ) [cds2 21952 2015-01-15 22:54:51.852 CST 5502]CONTEXT: PL/pgSQL function cdsreconciletable(text,text,text,text,boolean) line 197 at EXECUTE statement SQL statement SELECT* FROM CDSReconcileTable( t.CDSServer, t.CDSDatabase, t.SchemaName, t.TableName) PL/pgSQL function cdsreconcileruntable(bigint) line 35 at SQL statement After that, several hours of clean running, followed by: [cds2 32353 2015-01-16 04:40:57.814 CST 7549]WARNING: did not find subXID 7553 in MyProc [cds2 32353 2015-01-16 04:40:57.814 CST 7549]CONTEXT: PL/pgSQL function cdsreconcileruntable(bigint) line 35 during exception cleanup [cds2 32353 2015-01-16 04:40:58.018 CST 7549]WARNING: you don't own a lock of type AccessShareLock [cds2 32353 2015-01-16 04:40:58.018 CST 7549]CONTEXT: PL/pgSQL function cdsreconcileruntable(bigint) line 35 during exception cleanup [cds2 32353 2015-01-16 04:40:58.026 CST 7549]LOG: could not send data to client: Broken pipe [cds2 32353 2015-01-16 04:40:58.026 CST 7549]CONTEXT: PL/pgSQL function cdsreconcileruntable(bigint) line 35 during exception cleanup [cds2 32353 2015-01-16 04:40:58.026 CST 7549]STATEMENT: SELECT CDSReconcileRunTable(1160) [cds2 32353 2015-01-16 04:40:58.026 CST 7549]WARNING: ReleaseLockIfHeld: failed?? [cds2 32353 2015-01-16 04:40:58.026 CST 7549]CONTEXT: PL/pgSQL function cdsreconcileruntable(bigint) line 35 during exception cleanup [cds2 32353 2015-01-16 04:40:58.026 CST 7549]WARNING: you don't own a lock of type AccessShareLock [cds2 32353 2015-01-16 04:40:58.026 CST 7549]CONTEXT: PL/pgSQL function cdsreconcileruntable(bigint) line 35 during exception cleanup [cds2 32353 2015-01-16 04:40:58.026 CST 7549]WARNING: ReleaseLockIfHeld: failed?? [cds2 32353 2015-01-16 04:40:58.026 CST 7549]CONTEXT: PL/pgSQL function cdsreconcileruntable(bigint) line 35 during exception cleanup [cds2 32353 2015-01-16 04:40:58.026 CST 7549]WARNING: you don't own a lock of type AccessShareLock [cds2 32353 2015-01-16 04:40:58.026 CST 7549]CONTEXT: PL/pgSQL function cdsreconcileruntable(bigint) line 35 during exception cleanup [cds2 32353 2015-01-16 04:40:58.027 CST 7549]WARNING: ReleaseLockIfHeld: failed?? [cds2 32353 2015-01-16 04:40:58.027 CST 7549]CONTEXT: PL/pgSQL function cdsreconcileruntable(bigint) line 35 during exception cleanup [cds2 32353 2015-01-16 04:40:58.027 CST 7549]WARNING: you don't own a lock of type AccessShareLock [cds2 32353 2015-01-16 04:40:58.027 CST 7549]CONTEXT: PL/pgSQL function cdsreconcileruntable(bigint) line 35 during exception cleanup [cds2 32353 2015-01-16 04:40:58.027 CST 7549]WARNING: ReleaseLockIfHeld: failed?? [cds2 32353 2015-01-16 04:40:58.027 CST 7549]CONTEXT: PL/pgSQL function cdsreconcileruntable(bigint) line 35 during exception cleanup [cds2 32353 2015-01-16 04:40:58.027 CST 7549]WARNING: you don't own a lock of type ShareLock [cds2 32353 2015-01-16 04:40:58.027 CST 7549]CONTEXT: PL/pgSQL function cdsreconcileruntable(bigint) line 35 during exception cleanup [cds2 32353 2015-01-16 04:40:58.027 CST 7549]WARNING: ReleaseLockIfHeld: failed?? [cds2 32353 2015-01-16 04:40:58.027 CST 7549]CONTEXT: PL/pgSQL function cdsreconcileruntable(bigint) line 35 during exception cleanup [cds2 32353 2015-01-16 04:40:58.027 CST 7549]ERROR: failed to re-find shared lock object [cds2 32353 2015-01-16 04:40:58.027 CST 7549]CONTEXT: PL/pgSQL function cdsreconcileruntable(bigint) line 35 during exception cleanup [cds2 32353 2015-01-16 04:40:58.027 CST 7549]STATEMENT: SELECT CDSReconcileRunTable(1160) [cds2 32353 2015-01-16 04:40:58.027 CST 7549]WARNING: AbortSubTransaction while in ABORT state [cds2 32353 2015-01-16 04:40:58.027 CST 7549]WARNING: did not find subXID 7553 in MyProc [cds2 32353 2015-01-16 04:40:58.027 CST 7549]WARNING: you don't own a lock of type RowExclusiveLock [cds2 32353 2015-01-16 04:40:58.027 CST 7549]WARNING: ReleaseLockIfHeld: failed?? [cds2 32353 2015-01-16
Re: [HACKERS] Safe memory allocation functions
On Fri, Jan 16, 2015 at 8:47 AM, Michael Paquier michael.paqu...@gmail.com wrote: Voting for palloc_noerror() as well. And here is an updated patch using this naming, added to the next CF as well. -- Michael From b636c809c2f2cb4177bedc2e5a4883a79b61fbc6 Mon Sep 17 00:00:00 2001 From: Michael Paquier mich...@otacoo.com Date: Tue, 13 Jan 2015 15:40:38 +0900 Subject: [PATCH] Add memory allocation APIs able to return NULL instead of ERROR The following functions are added to the existing set for frontend and backend: - palloc_noerror - palloc0_noerror - repalloc_noerror --- src/backend/utils/mmgr/aset.c| 529 +++ src/backend/utils/mmgr/mcxt.c| 124 + src/common/fe_memutils.c | 72 -- src/include/common/fe_memutils.h | 3 + src/include/nodes/memnodes.h | 2 + src/include/utils/palloc.h | 3 + 6 files changed, 451 insertions(+), 282 deletions(-) diff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c index 85b3c9a..974e018 100644 --- a/src/backend/utils/mmgr/aset.c +++ b/src/backend/utils/mmgr/aset.c @@ -243,11 +243,22 @@ typedef struct AllocChunkData ((AllocPointer)(((char *)(chk)) + ALLOC_CHUNKHDRSZ)) /* + * Wrappers for allocation functions. + */ +static void *set_alloc_internal(MemoryContext context, +Size size, bool noerror); +static void *set_realloc_internal(MemoryContext context, void *pointer, + Size size, bool noerror); + +/* * These functions implement the MemoryContext API for AllocSet contexts. */ static void *AllocSetAlloc(MemoryContext context, Size size); +static void *AllocSetAllocNoError(MemoryContext context, Size size); static void AllocSetFree(MemoryContext context, void *pointer); static void *AllocSetRealloc(MemoryContext context, void *pointer, Size size); +static void *AllocSetReallocNoError(MemoryContext context, + void *pointer, Size size); static void AllocSetInit(MemoryContext context); static void AllocSetReset(MemoryContext context); static void AllocSetDelete(MemoryContext context); @@ -264,8 +275,10 @@ static void AllocSetCheck(MemoryContext context); */ static MemoryContextMethods AllocSetMethods = { AllocSetAlloc, + AllocSetAllocNoError, AllocSetFree, AllocSetRealloc, + AllocSetReallocNoError, AllocSetInit, AllocSetReset, AllocSetDelete, @@ -517,140 +530,16 @@ AllocSetContextCreate(MemoryContext parent, } /* - * AllocSetInit - * Context-type-specific initialization routine. - * - * This is called by MemoryContextCreate() after setting up the - * generic MemoryContext fields and before linking the new context - * into the context tree. We must do whatever is needed to make the - * new context minimally valid for deletion. We must *not* risk - * failure --- thus, for example, allocating more memory is not cool. - * (AllocSetContextCreate can allocate memory when it gets control - * back, however.) - */ -static void -AllocSetInit(MemoryContext context) -{ - /* - * Since MemoryContextCreate already zeroed the context node, we don't - * have to do anything here: it's already OK. - */ -} - -/* - * AllocSetReset - * Frees all memory which is allocated in the given set. - * - * Actually, this routine has some discretion about what to do. - * It should mark all allocated chunks freed, but it need not necessarily - * give back all the resources the set owns. Our actual implementation is - * that we hang onto any keeper block specified for the set. In this way, - * we don't thrash malloc() when a context is repeatedly reset after small - * allocations, which is typical behavior for per-tuple contexts. - */ -static void -AllocSetReset(MemoryContext context) -{ - AllocSet set = (AllocSet) context; - AllocBlock block; - - AssertArg(AllocSetIsValid(set)); - -#ifdef MEMORY_CONTEXT_CHECKING - /* Check for corruption and leaks before freeing */ - AllocSetCheck(context); -#endif - - /* Clear chunk freelists */ - MemSetAligned(set-freelist, 0, sizeof(set-freelist)); - - block = set-blocks; - - /* New blocks list is either empty or just the keeper block */ - set-blocks = set-keeper; - - while (block != NULL) - { - AllocBlock next = block-next; - - if (block == set-keeper) - { - /* Reset the block, but don't return it to malloc */ - char *datastart = ((char *) block) + ALLOC_BLOCKHDRSZ; - -#ifdef CLOBBER_FREED_MEMORY - wipe_mem(datastart, block-freeptr - datastart); -#else - /* wipe_mem() would have done this */ - VALGRIND_MAKE_MEM_NOACCESS(datastart, block-freeptr - datastart); -#endif - block-freeptr = datastart; - block-next = NULL; - } - else - { - /* Normal case, release the block */ -#ifdef CLOBBER_FREED_MEMORY - wipe_mem(block, block-freeptr - ((char *) block)); -#endif - free(block); - } - block = next; - } - - /* Reset block size allocation sequence, too */ - set-nextBlockSize = set-initBlockSize; -} - -/* - * AllocSetDelete - * Frees all memory which is allocated in the given set, -
Re: [HACKERS] Safe memory allocation functions
On 2015-01-16 08:47:10 +0900, Michael Paquier wrote: On Fri, Jan 16, 2015 at 12:57 AM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: I do think that safe is the wrong suffix. Maybe palloc_soft_fail() or palloc_null() or palloc_no_oom() or palloc_unsafe(). I liked palloc_noerror() better myself FWIW. Voting for palloc_noerror() as well. I don't like that name. It very well can error out. E.g. because of the allocation size. And we definitely do not want to ignore that case. How about palloc_try()? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Safe memory allocation functions
On 2015-01-16 23:06:12 +0900, Michael Paquier wrote: /* + * Wrappers for allocation functions. + */ +static void *set_alloc_internal(MemoryContext context, + Size size, bool noerror); +static void *set_realloc_internal(MemoryContext context, void *pointer, + Size size, bool noerror); + +/* * These functions implement the MemoryContext API for AllocSet contexts. */ static void *AllocSetAlloc(MemoryContext context, Size size); +static void *AllocSetAllocNoError(MemoryContext context, Size size); static void AllocSetFree(MemoryContext context, void *pointer); static void *AllocSetRealloc(MemoryContext context, void *pointer, Size size); +static void *AllocSetReallocNoError(MemoryContext context, + void *pointer, Size size); static void AllocSetInit(MemoryContext context); static void AllocSetReset(MemoryContext context); static void AllocSetDelete(MemoryContext context); @@ -264,8 +275,10 @@ static void AllocSetCheck(MemoryContext context); */ static MemoryContextMethods AllocSetMethods = { AllocSetAlloc, + AllocSetAllocNoError, AllocSetFree, AllocSetRealloc, + AllocSetReallocNoError, AllocSetInit, AllocSetReset, AllocSetDelete, @@ -517,140 +530,16 @@ AllocSetContextCreate(MemoryContext parent, } Wouldn't it make more sense to change the MemoryContext API to return NULLs in case of allocation failure and do the error checking in the mcxt.c callers? +/* wrapper routines for allocation */ +static void* palloc_internal(Size size, bool noerror); +static void* repalloc_internal(void *pointer, Size size, bool noerror); + /* * You should not do memory allocations within a critical section, because * an out-of-memory error will be escalated to a PANIC. To enforce that @@ -684,8 +688,8 @@ MemoryContextAllocZeroAligned(MemoryContext context, Size size) return ret; } -void * -palloc(Size size) +static void* +palloc_internal(Size size, bool noerror) { /* duplicates MemoryContextAlloc to avoid increased overhead */ void *ret; @@ -698,31 +702,85 @@ palloc(Size size) CurrentMemoryContext-isReset = false; - ret = (*CurrentMemoryContext-methods-alloc) (CurrentMemoryContext, size); + if (noerror) + ret = (*CurrentMemoryContext-methods-alloc_noerror) + (CurrentMemoryContext, size); + else + ret = (*CurrentMemoryContext-methods-alloc) + (CurrentMemoryContext, size); VALGRIND_MEMPOOL_ALLOC(CurrentMemoryContext, ret, size); return ret; } I'd be rather surprised if these branches won't show up in profiles. This is really rather hot code. At the very least this helper function should be inlined. Also, calling the valgrind function on an allocation failure surely isn't correct. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: Reducing lock strength of trigger and foreign key DDL
On 01/16/2015 03:01 PM, Andres Freund wrote: Hi, /* -* Grab an exclusive lock on the pk table, so that someone doesn't delete -* rows out from under us. (Although a lesser lock would do for that -* purpose, we'll need exclusive lock anyway to add triggers to the pk -* table; trying to start with a lesser lock will just create a risk of -* deadlock.) +* Grab ShareRowExclusiveLock on the pk table, so that someone doesn't +* delete rows out from under us. Note that this does not create risks +* of deadlocks as triggers add added to the pk table using the same +* lock. */ add added doesn't look intended. The rest of the sentence doesn't look entirely right either. It was intended to be are added, but the sentence is pretty awful anyway. I am not sure the sentence is really necessary anyway. /* * Triggers must be on tables or views, and there are additional @@ -526,8 +526,7 @@ CreateTrigger(CreateTrigStmt *stmt, const char *queryString, * can skip this for internally generated triggers, since the name * modification above should be sufficient. * -* NOTE that this is cool only because we have AccessExclusiveLock on the -* relation, so the trigger set won't be changing underneath us. +* NOTE that this is cool only because of the unique contraint. I fail to see what the unique constraint has to do with this? The previous comment refers to the fact that the AccessExclusiveLock is what prevents a race where another transaction adds a trigger with the same name already exists. Yes, the unique index would, as noted earlier in the comment, catch the error. But that's not the point of the check. Unless I miss something the comment is just as true if you replace the access exclusive with share row exlusive as it's also self conflicting. Yeah, this must have been a remainder from the version where I only grabbed a ShareLock. The comment should be restored. Hm. Pushing the snapshot is supposed to make this fully mvcc? Idon't think that's actually sufficient because of the deparsing of the WHEN clause and of the function name. Indeed. As Noah and I discussed previously in this thread we would need to do quite a bit of refactoring of ruleutils.c to make it fully MVCC. For this reason I opted to only lower the lock levels of ADD and ALTER TRIGGER, and not DROP TRIGGER. Neither of those require MVCC of then WHEN clause. Andreas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] hung backends stuck in spinlock heavy endless loop
On Fri, Jan 16, 2015 at 8:05 AM, Merlin Moncure mmonc...@gmail.com wrote: On Thu, Jan 15, 2015 at 5:10 PM, Peter Geoghegan p...@heroku.com wrote: On Thu, Jan 15, 2015 at 3:00 PM, Merlin Moncure mmonc...@gmail.com wrote: Running this test on another set of hardware to verify -- if this turns out to be a false alarm which it may very well be, I can only offer my apologies! I've never had a new drive fail like that, in that manner. I'll burn the other hardware in overnight and report back. huh -- well possibly. not. This is on a virtual machine attached to a SAN. It ran clean for several (this is 9.4 vanilla, asserts off, checksums on) hours then the starting having issues: I'm going to bisect this down...FYI. I'll start with the major releases first. This is not going to be a fast process...I'm out of pocket for the weekend and have a lot of other stuff going on. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] hung backends stuck in spinlock heavy endless loop
On 01/16/2015 04:05 PM, Merlin Moncure wrote: On Thu, Jan 15, 2015 at 5:10 PM, Peter Geoghegan p...@heroku.com wrote: On Thu, Jan 15, 2015 at 3:00 PM, Merlin Moncure mmonc...@gmail.com wrote: Running this test on another set of hardware to verify -- if this turns out to be a false alarm which it may very well be, I can only offer my apologies! I've never had a new drive fail like that, in that manner. I'll burn the other hardware in overnight and report back. huh -- well possibly. not. This is on a virtual machine attached to a SAN. It ran clean for several (this is 9.4 vanilla, asserts off, checksums on) hours then the starting having issues: [cds2 21952 2015-01-15 22:54:51.833 CST 5502]WARNING: page verification failed, calculated checksum 59143 but expected 59137 at character 20 The calculated checksum is suspiciously close to to the expected one. It could be coincidence, but the previous checksum warning you posted was also quite close: [cds2 18347 2015-01-15 15:58:29.955 CST 1779]WARNING: page verification failed, calculated checksum 28520 but expected 28541 I believe the checksum algorithm is supposed to mix the bits quite thoroughly, so that a difference in a single byte in the input will lead to a completely different checksum. However, we add the block number to the checksum last: /* Mix in the block number to detect transposed pages */ checksum ^= blkno; /* * Reduce to a uint16 (to fit in the pd_checksum field) with an offset of * one. That avoids checksums of zero, which seems like a good idea. */ return (checksum % 65535) + 1; It looks very much like that a page has for some reason been moved to a different block number. And that's exactly what Peter found out in his investigation too; an index page was mysteriously copied to a different block with identical content. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] hung backends stuck in spinlock heavy endless loop
Hi, On 2015-01-16 08:05:07 -0600, Merlin Moncure wrote: On Thu, Jan 15, 2015 at 5:10 PM, Peter Geoghegan p...@heroku.com wrote: On Thu, Jan 15, 2015 at 3:00 PM, Merlin Moncure mmonc...@gmail.com wrote: Running this test on another set of hardware to verify -- if this turns out to be a false alarm which it may very well be, I can only offer my apologies! I've never had a new drive fail like that, in that manner. I'll burn the other hardware in overnight and report back. huh -- well possibly. not. This is on a virtual machine attached to a SAN. It ran clean for several (this is 9.4 vanilla, asserts off, checksums on) hours then the starting having issues: Damn. Is there any chance you can package this somehow so that others can run it locally? It looks hard to find the actual bug here without adding instrumentation to to postgres. [cds2 21952 2015-01-15 22:54:51.833 CST 5502]WARNING: page verification failed, calculated checksum 59143 but expected 59137 at character 20 [cds2 21952 2015-01-15 22:54:51.852 CST 5502]QUERY: DELETE FROM onesitepmc.propertyguestcard t WHERE EXISTS ( SELECT 1 FROM propertyguestcard_insert d WHERE (t.prptyid, t.gcardid) = (d.prptyid, d.gcardid) ) [cds2 21952 2015-01-15 22:54:51.852 CST 5502]CONTEXT: PL/pgSQL function cdsreconciletable(text,text,text,text,boolean) line 197 at EXECUTE statement SQL statement SELECT* FROM CDSReconcileTable( t.CDSServer, t.CDSDatabase, t.SchemaName, t.TableName) PL/pgSQL function cdsreconcileruntable(bigint) line 35 at SQL statement This was the first error? None of the 'could not find subXID' errors beforehand? [cds2 32353 2015-01-16 04:40:57.814 CST 7549]WARNING: did not find subXID 7553 in MyProc [cds2 32353 2015-01-16 04:40:57.814 CST 7549]CONTEXT: PL/pgSQL function cdsreconcileruntable(bigint) line 35 during exception cleanup [cds2 32353 2015-01-16 04:40:58.018 CST 7549]WARNING: you don't own a lock of type AccessShareLock [cds2 32353 2015-01-16 04:40:58.018 CST 7549]CONTEXT: PL/pgSQL function cdsreconcileruntable(bigint) line 35 during exception cleanup [cds2 32353 2015-01-16 04:40:58.026 CST 7549]LOG: could not send data to client: Broken pipe [cds2 32353 2015-01-16 04:40:58.026 CST 7549]CONTEXT: PL/pgSQL function cdsreconcileruntable(bigint) line 35 during exception cleanup [cds2 32353 2015-01-16 04:40:58.026 CST 7549]STATEMENT: SELECT CDSReconcileRunTable(1160) [cds2 32353 2015-01-16 04:40:58.026 CST 7549]WARNING: ReleaseLockIfHeld: failed?? [cds2 32353 2015-01-16 04:40:58.026 CST 7549]CONTEXT: PL/pgSQL function cdsreconcileruntable(bigint) line 35 during exception cleanup [cds2 32353 2015-01-16 04:40:58.026 CST 7549]WARNING: you don't own a lock of type AccessShareLock [cds2 32353 2015-01-16 04:40:58.026 CST 7549]CONTEXT: PL/pgSQL function cdsreconcileruntable(bigint) line 35 during exception cleanup [cds2 32353 2015-01-16 04:40:58.026 CST 7549]WARNING: ReleaseLockIfHeld: failed?? [cds2 32353 2015-01-16 04:40:58.026 CST 7549]CONTEXT: PL/pgSQL function cdsreconcileruntable(bigint) line 35 during exception cleanup [cds2 32353 2015-01-16 04:40:58.026 CST 7549]WARNING: you don't own a lock of type AccessShareLock [cds2 32353 2015-01-16 04:40:58.026 CST 7549]CONTEXT: PL/pgSQL function cdsreconcileruntable(bigint) line 35 during exception cleanup [cds2 32353 2015-01-16 04:40:58.027 CST 7549]WARNING: ReleaseLockIfHeld: failed?? [cds2 32353 2015-01-16 04:40:58.027 CST 7549]CONTEXT: PL/pgSQL function cdsreconcileruntable(bigint) line 35 during exception cleanup [cds2 32353 2015-01-16 04:40:58.027 CST 7549]WARNING: you don't own a lock of type AccessShareLock [cds2 32353 2015-01-16 04:40:58.027 CST 7549]CONTEXT: PL/pgSQL function cdsreconcileruntable(bigint) line 35 during exception cleanup [cds2 32353 2015-01-16 04:40:58.027 CST 7549]WARNING: ReleaseLockIfHeld: failed?? [cds2 32353 2015-01-16 04:40:58.027 CST 7549]CONTEXT: PL/pgSQL function cdsreconcileruntable(bigint) line 35 during exception cleanup [cds2 32353 2015-01-16 04:40:58.027 CST 7549]WARNING: you don't own a lock of type ShareLock [cds2 32353 2015-01-16 04:40:58.027 CST 7549]CONTEXT: PL/pgSQL function cdsreconcileruntable(bigint) line 35 during exception cleanup [cds2 32353 2015-01-16 04:40:58.027 CST 7549]WARNING: ReleaseLockIfHeld: failed?? [cds2 32353 2015-01-16 04:40:58.027 CST 7549]CONTEXT: PL/pgSQL function cdsreconcileruntable(bigint) line 35 during exception cleanup [cds2 32353 2015-01-16 04:40:58.027 CST 7549]ERROR: failed to re-find shared lock object [cds2 32353 2015-01-16 04:40:58.027 CST 7549]CONTEXT: PL/pgSQL function cdsreconcileruntable(bigint) line 35 during exception cleanup [cds2 32353 2015-01-16 04:40:58.027 CST 7549]STATEMENT: SELECT CDSReconcileRunTable(1160) [cds2 32353 2015-01-16 04:40:58.027 CST
Re: [HACKERS] hung backends stuck in spinlock heavy endless loop
On Fri, Jan 16, 2015 at 8:22 AM, Andres Freund and...@2ndquadrant.com wrote: Is there any chance you can package this somehow so that others can run it locally? It looks hard to find the actual bug here without adding instrumentation to to postgres. That's possible but involves a lot of complexity in the setup because of the source database (SQL Server) dependency.. Thinking outside the box here I'm going to migrate the source to postgres. This will rule out pl/sh which is the only non-core dependency but will take some setup work on my end first. If I can still reproduce the error at that point, maybe we can go in this direction, and it it would make local reproduction easier anyways. [cds2 21952 2015-01-15 22:54:51.833 CST 5502]WARNING: page This was the first error? None of the 'could not find subXID' errors beforehand? yep. no caught exceptions or anything interesting in the log before that. Could you add a EmitErrorReport(); before the FlushErrorState() in pl_exec.c's exec_stmt_block()? will do. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Safe memory allocation functions
On Thu, Jan 15, 2015 at 10:57 AM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Hmm, I understood Tom to be opposing the idea of a palloc variant that returns NULL on failure, and I understand you to be supporting it. But maybe I'm confused. Your understanding seems correct to me. I was just saying that your description of Tom's argument to dislike the idea seemed at odds with what he was actually saying. OK, that may be. I'm not sure. Anyway, I support it. I agree that there are systems (or circumstances?) where malloc is going to succeed and then the world will blow up later on anyway, but I don't think that means that an out-of-memory error is the only sensible response to a palloc failure; returning NULL seems like a sometimes-useful alternative. I do think that safe is the wrong suffix. Maybe palloc_soft_fail() or palloc_null() or palloc_no_oom() or palloc_unsafe(). I liked palloc_noerror() better myself FWIW. I don't care for noerror() because it probably still will error in some circumstances; just not for OOM. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Safe memory allocation functions
Robert Haas wrote: On Thu, Jan 15, 2015 at 10:57 AM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: I do think that safe is the wrong suffix. Maybe palloc_soft_fail() or palloc_null() or palloc_no_oom() or palloc_unsafe(). I liked palloc_noerror() better myself FWIW. I don't care for noerror() because it probably still will error in some circumstances; just not for OOM. Yes, but that seems fine to me. We have other functions with noerror flags, and they can still fail under some circumstances -- just not if the error is the most commonly considered scenario in which they fail. The first example I found is LookupAggNameTypeNames(); there are many more. I don't think this causes any confusion in practice. Another precendent we have is something like missing_ok as a flag name in get_object_address() and other places; following that, we could have this new function as palloc_oom_ok or something like that. But it doesn't seem an improvement to me. (I'm pretty sure we all agree that this must not be a flag to palloc but rather a new function.) Of all the ones you proposed above, the one I like the most is palloc_no_oom, but IMO palloc_noerror is still better. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Error check always bypassed in tablefunc.c
Hi all, As mentioned in $subject, commit 08c33c4 of 2003 has made the following block of code dead in tablefunc.c:1320 because level is incremented to at least 1: /* First time through, do a little more setup */ if (level == 0) { /* * Check that return tupdesc is compatible with the one we got * from the query, but only at level 0 -- no need to check more * than once */ if (!compatConnectbyTupleDescs(tupdesc, spi_tupdesc)) ereport(ERROR, (errcode(ERRCODE_SYNTAX_ERROR), errmsg(invalid return type), errdetail(Return and SQL tuple descriptions are \ incompatible.))); } A simple fix is simply to change level == 0 to level == 1 to check for this error, like in the patch attached. This issue has been spotted by Coverity. Regards, -- Michael diff --git a/contrib/tablefunc/tablefunc.c b/contrib/tablefunc/tablefunc.c index 3388fab..878255e 100644 --- a/contrib/tablefunc/tablefunc.c +++ b/contrib/tablefunc/tablefunc.c @@ -1317,12 +1317,12 @@ build_tuplestore_recursively(char *key_fld, StringInfoData chk_current_key; /* First time through, do a little more setup */ - if (level == 0) + if (level == 1) { /* * Check that return tupdesc is compatible with the one we got - * from the query, but only at level 0 -- no need to check more - * than once + * from the query, but only at the first level -- no need to check + * more than once */ if (!compatConnectbyTupleDescs(tupdesc, spi_tupdesc)) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] orangutan seizes up during isolation-check
On Fri, Jan 16, 2015 at 05:43:44AM -0500, Dave Cramer wrote: On 16 January 2015 at 01:33, Noah Misch n...@leadboat.com wrote: Sure, done. Dave, orangutan should now be able to pass with --enable-nls. Would you restore that option? I can, but is this for HEAD or all versions ? All versions. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Error check always bypassed in tablefunc.c
Michael Paquier wrote: As mentioned in $subject, commit 08c33c4 of 2003 has made the following block of code dead in tablefunc.c:1320 because level is incremented to at least 1: /* First time through, do a little more setup */ if (level == 0) { Uh. This means the error case has no test coverage ... -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] orangutan seizes up during isolation-check
On 16 January 2015 at 01:33, Noah Misch n...@leadboat.com wrote: On Thu, Jan 15, 2015 at 09:24:01AM -0500, Robert Haas wrote: On Thu, Jan 15, 2015 at 1:04 AM, Noah Misch n...@leadboat.com wrote: On Wed, Jan 14, 2015 at 04:48:53PM -0500, Peter Eisentraut wrote: What I'm seeing now is that the unaccent regression tests when run under make check-world abort with FATAL: postmaster became multithreaded during startup HINT: Set the LC_ALL environment variable to a valid locale. contrib/unaccent/Makefile sets NO_LOCALE=1, so that makes sense. I expect the patch over here will fix it: http://www.postgresql.org/message-id/20150109063015.ga2491...@tornado.leadboat.com I just hit this same problem; are you going to commit that patch soon? It's rather annoying to have make check-world fail. Sure, done. Dave, orangutan should now be able to pass with --enable-nls. Would you restore that option? I can, but is this for HEAD or all versions ?
Re: [HACKERS] Safe memory allocation functions
On 2015-01-16 12:09:25 -0300, Alvaro Herrera wrote: Robert Haas wrote: On Thu, Jan 15, 2015 at 10:57 AM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: I do think that safe is the wrong suffix. Maybe palloc_soft_fail() or palloc_null() or palloc_no_oom() or palloc_unsafe(). I liked palloc_noerror() better myself FWIW. I don't care for noerror() because it probably still will error in some circumstances; just not for OOM. Yes, but that seems fine to me. We have other functions with noerror flags, and they can still fail under some circumstances -- just not if the error is the most commonly considered scenario in which they fail. We rely on palloc erroring out on large allocations in a couple places as a crosscheck. I don't think this argument holds much water. The first example I found is LookupAggNameTypeNames(); there are many more. I don't think this causes any confusion in practice. Another precendent we have is something like missing_ok as a flag name in get_object_address() and other places; following that, we could have this new function as palloc_oom_ok or something like that. But it doesn't seem an improvement to me. (I'm pretty sure we all agree that this must not be a flag to palloc but rather a new function.) Of all the ones you proposed above, the one I like the most is palloc_no_oom, but IMO palloc_noerror is still better. Neither seem very accurate. no_oom isn't true because they actually can cause ooms. _noerror isn't true because they can error out - we e.g. rely on palloc erroring out when reading toast tuples (to detect invalid datum lengths) and during parsing of WAL as an additional defense. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Safe memory allocation functions
Andres Freund wrote: On 2015-01-16 12:09:25 -0300, Alvaro Herrera wrote: Robert Haas wrote: On Thu, Jan 15, 2015 at 10:57 AM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: I do think that safe is the wrong suffix. Maybe palloc_soft_fail() or palloc_null() or palloc_no_oom() or palloc_unsafe(). I liked palloc_noerror() better myself FWIW. I don't care for noerror() because it probably still will error in some circumstances; just not for OOM. Yes, but that seems fine to me. We have other functions with noerror flags, and they can still fail under some circumstances -- just not if the error is the most commonly considered scenario in which they fail. We rely on palloc erroring out on large allocations in a couple places as a crosscheck. I don't think this argument holds much water. I don't understand what that has to do with it. Surely we're not going to have palloc_noerror() not error out when presented with a huge allocation. My point is just that the noerror bit in palloc_noerror() means that it doesn't fail in OOM, and that there are other causes for it to error. One thought I just had is that we also have MemoryContextAllocHuge; are we going to consider a mixture of both things in the future, i.e. allow huge allocations to return NULL when OOM? It sounds a bit useless currently, but it doesn't seem extremely far-fetched that we will need further flags in the future. (Or, perhaps, we will want to have code that retries a Huge allocation that returns NULL with a smaller size, just in case it does work.) Maybe what we need is to turn these things into flags to a new generic function. Furthermore, I question whether we really need a palloc variant -- I mean, can we live with just the MemoryContext API instead? As with the Huge variant (which does not have a corresponding palloc equivalent), possible use cases seem very limited so there's probably not much point in providing a shortcut. So how about something like #define ALLOCFLAG_HUGE 0x01 #define ALLOCFLAG_NO_ERROR_ON_OOM 0x02 void * MemoryContextAllocFlags(MemoryContext context, Size size, int flags); and perhaps even #define MemoryContextAllocHuge(cxt, sz) \ MemoryContextAllocFlags(cxt, sz, ALLOCFLAG_HUGE) for source-level compatibility. (Now we all agree that palloc() itself is a very hot spot and shouldn't be touched at all. I don't think these new functions are used as commonly as that, so the fact that they are slightly slower shouldn't be too troublesome.) -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: Reducing lock strength of trigger and foreign key DDL
On 2015-01-16 15:16:20 +0100, Andreas Karlsson wrote: Indeed. As Noah and I discussed previously in this thread we would need to do quite a bit of refactoring of ruleutils.c to make it fully MVCC. Right. For this reason I opted to only lower the lock levels of ADD and ALTER TRIGGER, and not DROP TRIGGER. Neither of those require MVCC of then WHEN clause. I'm unconvinced that this is true. Using a snapshot for part of getting a definition certainly opens the door for getting strange results. Acquiring a lock that prevents schema changes on the table and then getting the definition using the syscaches guarantees that that definition is at least self consistent because no further schema changes are possible and the catalog caches will be up2date. What you're doing though is doing part of the scan using the transaction's snapshot (as used by pg_dump that will usually be a repeatable read snapshot and possibly quite old) and the other using a fresh catalog snapshot. This can result in rather odd things. Just consider: S1: CREATE TABLE flubber(id serial primary key, data text); S1: CREATE FUNCTION blarg() RETURNS TRIGGER LANGUAGE plpgsql AS $$BEGIN RETURN NEW; END;$$; S1: CREATE TRIGGER flubber_blarg BEFORE INSERT ON flubber FOR EACH ROW WHEN (NEW.data IS NOT NULL) EXECUTE PROCEDURE blarg(); S2: BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE; S2: SELECT 'dosomethingelse'; S1: ALTER TABLE flubber RENAME TO wasflubber; S1: ALTER TABLE wasflubber RENAME COLUMN data TO wasdata; S1: ALTER TRIGGER flubber_blarg ON wasflubber RENAME TO wasflubber_blarg; S1: ALTER FUNCTION blarg() RENAME TO wasblarg; S2: SELECT pg_get_triggerdef(oid) FROM pg_trigger; This will give you: The old trigger name. The new table name. The new column name. The new function name. I don't think using a snapshot for tiny parts of these functions actually buys anything. Now, this isn't a pattern you introduced. But I think we should think about this for a second before expanding it further. Before you argue that this isn't relevant for pg_dump: It is. Precisely the above can happen - just replace the 'dosomethingelse' with several LOCK TABLEs as pg_dump does. The first blocks after a snapshot has been acquired. While waiting, all the ALTERs happen. Arguably the benefit here is that the window for this issue is becoming smaller as pg_dump (and hopefully other possible callers) acquire exclusive locks on the table. I.e. that the lowering of the lock level doesn't introduce new races. But on the other side of the coin, this makes the result of pg_get_triggerdef() even *more* inaccurate in many cases. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Safe memory allocation functions
On 2015-01-16 12:56:18 -0300, Alvaro Herrera wrote: Andres Freund wrote: We rely on palloc erroring out on large allocations in a couple places as a crosscheck. I don't think this argument holds much water. I don't understand what that has to do with it. Surely we're not going to have palloc_noerror() not error out when presented with a huge allocation. Precisely. That means it *does* error out in a somewhat expected path. My point is just that the noerror bit in palloc_noerror() means that it doesn't fail in OOM, and that there are other causes for it to error. That description pretty much describes why it's a misnomer, no? One thought I just had is that we also have MemoryContextAllocHuge; are we going to consider a mixture of both things in the future, i.e. allow huge allocations to return NULL when OOM? I definitely think we should. I'd even say that the usecase is larger for huge allocations. It'd e.g. be rather nice to first try sorting with the huge 16GB work mem and then try 8GB/4/1GB if that fails. It sounds a bit useless currently, but it doesn't seem extremely far-fetched that we will need further flags in the future. (Or, perhaps, we will want to have code that retries a Huge allocation that returns NULL with a smaller size, just in case it does work.) Maybe what we need is to turn these things into flags to a new generic function. Furthermore, I question whether we really need a palloc variant -- I mean, can we live with just the MemoryContext API instead? As with the Huge variant (which does not have a corresponding palloc equivalent), possible use cases seem very limited so there's probably not much point in providing a shortcut. I'm fine with not providing a palloc() equivalent, but I also am fine with having it. So how about something like #define ALLOCFLAG_HUGE0x01 #define ALLOCFLAG_NO_ERROR_ON_OOM 0x02 void * MemoryContextAllocFlags(MemoryContext context, Size size, int flags); and perhaps even #define MemoryContextAllocHuge(cxt, sz) \ MemoryContextAllocFlags(cxt, sz, ALLOCFLAG_HUGE) for source-level compatibility. I don't know, this seems a bit awkward to use. Your earlier example with the *Huge variant that returns a smaller allocation doesn't really convince me - that'd need a separate API anyway. I definitely do not want to push the nofail stuff via the MemoryContextData- API into aset.c. Imo aset.c should always return NULL and then mcxt.c should throw the error if in the normal palloc() function. (Now we all agree that palloc() itself is a very hot spot and shouldn't be touched at all. I don't think these new functions are used as commonly as that, so the fact that they are slightly slower shouldn't be too troublesome.) Yea, the speed of the new functions really shouldn't matter. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Merging postgresql.conf and postgresql.auto.conf
On Fri, Jan 16, 2015 at 12:54 PM, Amit Kapila amit.kapil...@gmail.com wrote: On Thu, Jan 15, 2015 at 9:48 PM, Sawada Masahiko sawada.m...@gmail.com wrote: On Thu, Jan 15, 2015 at 2:02 PM, Amit Kapila amit.kapil...@gmail.com wrote: One thought I have in this line is that currently there doesn't seem to be a way to know if the setting has an entry both in postgresql.conf and postgresql.auto.conf, if we can have some way of knowing the same (pg_settings?), then it could be convenient for user to decide if the value in postgresql.auto.conf is useful or not and if it's not useful then use Alter System .. Reset command to remove the same from postgresql.auto.conf. I think one way is that pg_settings has file name of variables, But It would not affect to currently status of postgresql.conf So we would need to parse postgresql.conf again at that time. Yeah that could be a possibility, but I think that will break the existing command('s) as this is the common infrastructure used for SHOW .. commands as well which displays the guc value that is used by current session rather than the value in postgresql.conf. You're right. pg_setting and SHOW command use value in current session rather than config file. It might break these common infrastructure. I don't know how appealing it would be to others, but a new view like pg_file_settings which would display the settings in file could be meaningful for your need. Another way is user can do pg_reload_conf() to see the latest values (excluding values for server startup time parameters). I prefer the former. Because the latter would not handle all guc variables as you said. The new view like pg_file_setting has guc variable and config file which has corresponding guc variable. It would be helpful view for like ALTER SYSTEM ... RESET and that command might be beneficial feature for user who want to manage configuration file manually, I would like to propose them. Regards, --- Sawada Masahiko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Safe memory allocation functions
Andres Freund wrote: On 2015-01-16 12:56:18 -0300, Alvaro Herrera wrote: So how about something like #define ALLOCFLAG_HUGE 0x01 #define ALLOCFLAG_NO_ERROR_ON_OOM 0x02 void * MemoryContextAllocFlags(MemoryContext context, Size size, int flags); I don't know, this seems a bit awkward to use. Your earlier example with the *Huge variant that returns a smaller allocation doesn't really convince me - that'd need a separate API anyway. What example was that? My thinking was that the mcxt.c function would return NULL if the request was not satisfied; only the caller would be entitled to retry with a smaller size. I was thinking in something like baseflag = ALLOCFLAG_NO_ERROR_ON_OOM; reqsz = SomeHugeValue; while (true) { ptr = MemoryContextAllocFlags(cxt, reqsz, ALLOCFLAG_HUGE | baseflag); if (ptr != NULL) break; /* success */ /* too large, retry with a smaller allocation */ reqsz *= 0.75; /* if under some limit, have it fail next time */ if (reqsz SomeHugeValue * 0.1) baseflag = 0; } /* by here, you know ptr points to a memory area of size reqsz, which is between SomeHugeValue * 0.1 and SomeHugeValue. */ Were you thinking of something else? I definitely do not want to push the nofail stuff via the MemoryContextData- API into aset.c. Imo aset.c should always return NULL and then mcxt.c should throw the error if in the normal palloc() function. Sure, that seems reasonable ... -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] speedup tidbitmap patch: cache page
On 2014-12-25 01:26:53 +1300, David Rowley wrote: So I think v3 is the one to go with, and I can't see any problems with it, so I'm marking it as ready for committer. And committed. Thanks Teodor and David. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [RFC] Incremental backup v3: incremental PoC
On 14/01/15 17:22, Gabriele Bartolini wrote: My opinion, Marco, is that for version 5 of this patch, you: 1) update the information on the wiki (it is outdated - I know you have been busy with LSN map optimisation) Done. 2) modify pg_basebackup in order to accept a directory (or tar file) and automatically detect the LSN from the backup profile New version of patch attached. The -I parameter now requires a backup profile from a previous backup. I've added a sanity check that forbid incremental file level backups if the base timeline is different from the current one. 3) add the documentation regarding the backup profile and pg_basebackup Next on my TODO list. Once we have all of this, we can continue trying the patch. Some unexplored paths are: * tablespace usage I've improved my pg_restorebackup python PoC. It now supports tablespaces. * tar format * performance impact (in both read-only and heavily updated contexts) From the server point of view, the current code generates a load similar to normal backup. It only adds an initial scan of any data file to decide whether it has to send it. One it found a single newer page it immediately stop scanning and start sending the file. The IO impact should not be that big due to the filesystem cache, but I agree with you that it has to be measured. Regards, Marco -- Marco Nenciarini - 2ndQuadrant Italy PostgreSQL Training, Services and Support marco.nenciar...@2ndquadrant.it | www.2ndQuadrant.it From f7cf8b9dd7d32f64a30dafaeeaeb56cbcd2eafff Mon Sep 17 00:00:00 2001 From: Marco Nenciarini marco.nenciar...@2ndquadrant.it Date: Tue, 14 Oct 2014 14:31:28 +0100 Subject: [PATCH] File-based incremental backup v5 Add backup profile to pg_basebackup INCREMENTAL option implementaion --- src/backend/access/transam/xlog.c | 7 +- src/backend/access/transam/xlogfuncs.c | 2 +- src/backend/replication/basebackup.c | 335 +++-- src/backend/replication/repl_gram.y| 6 + src/backend/replication/repl_scanner.l | 1 + src/bin/pg_basebackup/pg_basebackup.c | 147 +-- src/include/access/xlog.h | 3 +- src/include/replication/basebackup.h | 4 + 8 files changed, 473 insertions(+), 32 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 629a457..1e50625 100644 *** a/src/backend/access/transam/xlog.c --- b/src/backend/access/transam/xlog.c *** XLogFileNameP(TimeLineID tli, XLogSegNo *** 9249,9255 * permissions of the calling user! */ XLogRecPtr ! do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p, char **labelfile) { boolexclusive = (labelfile == NULL); --- 9249,9256 * permissions of the calling user! */ XLogRecPtr ! do_pg_start_backup(const char *backupidstr, bool fast, ! XLogRecPtr incremental_startpoint, TimeLineID *starttli_p, char **labelfile) { boolexclusive = (labelfile == NULL); *** do_pg_start_backup(const char *backupids *** 9468,9473 --- 9469,9478 (uint32) (startpoint 32), (uint32) startpoint, xlogfilename); appendStringInfo(labelfbuf, CHECKPOINT LOCATION: %X/%X\n, (uint32) (checkpointloc 32), (uint32) checkpointloc); + if (incremental_startpoint 0) + appendStringInfo(labelfbuf, INCREMENTAL FROM LOCATION: %X/%X\n, +(uint32) (incremental_startpoint 32), +(uint32) incremental_startpoint); appendStringInfo(labelfbuf, BACKUP METHOD: %s\n, exclusive ? pg_start_backup : streamed); appendStringInfo(labelfbuf, BACKUP FROM: %s\n, diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c index 2179bf7..ace84d8 100644 *** a/src/backend/access/transam/xlogfuncs.c --- b/src/backend/access/transam/xlogfuncs.c *** pg_start_backup(PG_FUNCTION_ARGS) *** 59,65 (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), errmsg(must be superuser or replication role to run a backup))); ! startpoint = do_pg_start_backup(backupidstr, fast, NULL, NULL); PG_RETURN_LSN(startpoint); } --- 59,65 (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), errmsg(must be superuser or replication role to run a backup))); ! startpoint = do_pg_start_backup(backupidstr, fast, 0, NULL, NULL); PG_RETURN_LSN(startpoint); } diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c index 07030a2..05b19c5 100644
Re: [HACKERS] proposal: searching in array function - array_position
On 1/16/15 3:39 AM, Pavel Stehule wrote: I am proposing a simple function, that returns a position of element in array. Yes please! FUNCTION array_position(anyarray, anyelement) RETURNS int That won't work on a multi-dimensional array. Ideally it needs to accept a slice or an element and return the specifier for the slice. This wouldn't be so bad if we had an easier way to extract subsets of an array, but even that is really ugly because whatever you extract keeps the original number of dimensions. Implementation is simple (plpgsql code) This would actually be written in C though, yes? Otherwise it's not really any better than just doing an extension... -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Temporal features in PostgreSQL
What's the current status of this topic? Has someone worked on temporal tables for PostgreSQL since 2012 ? I'm giving a presentation on Fosdem later this month in Brussels, on the topic of temporal tables, and would like to give all possibly relevant information to the audience! -- Peter Vanroose, Leuven, Belgium. -- View this message in context: http://postgresql.nabble.com/Temporal-features-in-PostgreSQL-tp5737881p5834312.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] proposal: lock_time for pg_stat_database
Hi all, some time ago, I proposed a lock time measurement related to query. A main issue was a method, how to show this information. Today proposal is little bit simpler, but still useful. We can show a total lock time per database in pg_stat_database statistics. High number can be signal about lock issues. Comments, ideas, notices? Regards Pavel
Re: [HACKERS] [PATCH] HINT: pg_hba.conf changed since last config reload
On 2014-12-15 19:38:16 +0300, Alex Shulgin wrote: Attached is the modified version of the original patch by Craig, addressing the handling of the new hint_log error data field and removing the client-side HINT. I'm not a big fan of this implementation. We're adding a fair bit of infrastructure (i.e. server-only hints) where in other places we use NOTICE for similar hints. Why don't we just add emit a NOTICE or WARNING in the relevant place saying that pg_hba.conf is outdated? Then the server won't log those if configured appropriately, which doesn't seem like a bad thing. Note that = ERROR messages aren't sent to the client during authentication. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: row_to_array function
On 1/16/15 3:45 AM, Pavel Stehule wrote: I am returning back to processing records in plpgsql. I am thinking so it can be simply processed with transformations to array. Now we have similar functions - hstore(row), row_to_json, ... but using of these functions can be a useless step. Any row variable can be transformed to 2D text array. How is it useless? Why wouldn't you just use JSON and be done with it? Do you have some use cases you can share? There two possible transformations: row_to_array -- [[key1, value1],[key2, value2], ...] row_to_row_array -- [(key1, value1), (key2, value2), ... ] If we're going to go that route, I think it makes more sense to create an actual key/value type (ie: http://pgxn.org/dist/pair/doc/pair.html) and return an array of that. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] HINT: pg_hba.conf changed since last config reload
On 2015-01-16 18:01:24 +0100, Andres Freund wrote: Why don't we just add emit a NOTICE or WARNING in the relevant place saying that pg_hba.conf is outdated? Then the server won't log those if configured appropriately, which doesn't seem like a bad thing. Note that = ERROR messages aren't sent to the client during authentication. I think a 'WARNING: client authentication failed/succeeded with a outdated pg_hba.conf in effect' would actually be a good way to present this. It's not only failed logins where this is relevant. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_rewind in contrib
On 01/16/2015 03:30 AM, Peter Eisentraut wrote: Here is a random bag of comments for the v5 patch: Addressed most of your comments, and Michael's. Another version attached. More on a few of the comments below. The option --source-server had be confused at first, because the entry above under --source-pgdata also talks about a source server. Maybe --source-connection would be clearer? Hmm. I would rather emphasize that the source is a running server, than the connection to the server. I can see the confusion, though. What about phrasing it as: --source-pgdata Specifies path to the data directory of the source server, to synchronize the target with. When option--source-pgdata/ is used, the source server must be cleanly shut down. The point is that whether you use --source-pgdata or --source-server, the source is a PostgreSQL server. Perhaps it should be mentioned here that you only need one of --source-pgdata and --source-server, even though the synopsis says that too. Another idea is to rename --source-server to just --source. That would make sense if we assume that it's more common to connect to a live server: pg_rewind --target mypgdata --source host=otherhost user=dba pg_rewind --target mypgdata --source-pgdata /other/pgdata Reference pages have standardized top-level headers, so Theory of operation should be under something like Notes. Similarly for Restrictions, but that seems important enough to go into the description. Oh. That's rather limiting. I'll rename the Restrictions to Notes - that seems to be where we have listed limitations like that in many other pages. I also moved Theory of operation as a How it works subsection under Notes. The top-level headers aren't totally standardized in man pages. Googling around, a few seem to have a section called IMPLEMENTATION NOTES, which would be a good fit here. We don't have such sections currently, but how about it? There should be an installcheck target. Doesn't seem appropriate, as there are no regression tests that would run against an existing cluster. Or should it just use the installed pg_rewind and postgres binary, but create the temporary clusters all the same? RewindTest.pm should be in the t/ directory. I used a similar layout in src/test/ssl, so that ought to be changed too if we're not happy with it. make check will not find the module if I just move it to t/, so that would require changes to Makefiles or something. I don't know how to do that offhand. Instead of FILE_TYPE_DIRECTORY etc., why not use S_IFDIR etc.? I don't see what that would buy us. Most of the code only knows about a few S_IF* types, namely regular files, directories and symlinks. Those are the types that there are FILE_TYPE_* codes for. If we started using the more general S_IF* constants, it would not be clear which types can appear in which parts of the code. Also, the compiler would no longer warn about omitting one of the types in a switch(file-type) statement. TestLib.pm addition command_is sounds a bit wrong. It's evidently modelled after command_like, but that now sounds wrong too. How about command_stdout_is? Works for me. How about also renaming command_like to command_stdout_like, and committing that along with the new command_stdout_is function as a separate patch, before pg_rewind? The test suite needs to silence all non-TAP output. So psql needs to be run with -q pg_ctl with -s etc. Any important output needs to be through diag() or note(). Test cases like ok 6 - psql -A -t --no-psqlrc -c port=10072 -c SELECT datname FROM pg_database exit code 0 should probably get a real name. The whole structure of the test suite still looks too much like the old hack. I'll try to think of other ways to structure it. Yeah. It currently works with callback functions, like: test program: - call RewindTest::run_rewind_test set up both cluster - call before_standby callback start standby - call standby_following_master callback promote standby - call after_promotion callback stop master run pg_rewind restart master - call after_rewind callback It might be better to turn the control around, so that all the code that's now in the callbacks are in the test program's main flow, and test program calls functions in RewindTest.sh to execute the parts that are currently between the callbacks: test program - call RewindTest::setup_cluster do stuff that's now in before_standby callback - call RewindTest::start_standby do stuff that's now in standby_following_master callback - call RewindTest::promote_standby do stuff that's now in after_promotion callback - call RewindTest::run_pg_rewind do stuff that's now in after_rewind callback - Heikki pg_rewind-bin-6.patch.gz Description: application/gzip -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription:
Re: [HACKERS] speedup tidbitmap patch: cache page
Andres Freund and...@2ndquadrant.com writes: On 2014-12-25 01:26:53 +1300, David Rowley wrote: So I think v3 is the one to go with, and I can't see any problems with it, so I'm marking it as ready for committer. And committed. It strikes me that this patch leaves some lookups on the table, specifically that it fails to avoid repeated hash_search lookups inside tbm_page_is_lossy() in the situation where we're adding new TIDs to an already-lossified page. Is it worth adding a few more lines to handle that case as well? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: searching in array function - array_position
2015-01-16 17:57 GMT+01:00 Jim Nasby jim.na...@bluetreble.com: On 1/16/15 3:39 AM, Pavel Stehule wrote: I am proposing a simple function, that returns a position of element in array. Yes please! FUNCTION array_position(anyarray, anyelement) RETURNS int That won't work on a multi-dimensional array. Ideally it needs to accept a slice or an element and return the specifier for the slice. It is question, what is a result - probably, there can be a multidimensional variant, where result will be a array array_position([1,2,3],2) -- 2 array_position([[1,2],[2,3],[3,4]], [2,3]) -- 2 /* 2nd parameter should to have N-1 dimension of first parameter */ array_position_md([1,2,3],2) -- [2] array_position_md([[1,2],[2,3],[3,4]], 2) -- [2,1] another question is how to solve more than one occurrence on one value - probably two sets of functions - first returns first occurrence of value, second returns set of occurrence This wouldn't be so bad if we had an easier way to extract subsets of an array, but even that is really ugly because whatever you extract keeps the original number of dimensions. Implementation is simple (plpgsql code) This would actually be written in C though, yes? Otherwise it's not really any better than just doing an extension... Sure, I expect a C implementation Pavel -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
Re: [HACKERS] speedup tidbitmap patch: cache page
On 2015-01-16 12:15:35 -0500, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: On 2014-12-25 01:26:53 +1300, David Rowley wrote: So I think v3 is the one to go with, and I can't see any problems with it, so I'm marking it as ready for committer. And committed. It strikes me that this patch leaves some lookups on the table, specifically that it fails to avoid repeated hash_search lookups inside tbm_page_is_lossy() in the situation where we're adding new TIDs to an already-lossified page. Is it worth adding a few more lines to handle that case as well? There was a alternative version (v2.3 in 549950fb.2050...@sigaev.ru) of the patch that cached the lossyness of a page, but Teodor/David didn't find it to be beneficial in their benchmarking. Teodor's argument was basically that it's completely lost in the noise due to the much bigger overhead of rechecks. I thought it'd better to get this improvement committed rather than waiting for someone to find a even bigger improvement for some case. Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] HINT: pg_hba.conf changed since last config reload
Andres Freund and...@2ndquadrant.com writes: Why don't we just add emit a NOTICE or WARNING in the relevant place saying that pg_hba.conf is outdated? Then the server won't log those if configured appropriately, which doesn't seem like a bad thing. Note that = ERROR messages aren't sent to the client during authentication. I think people felt that sending that information to the client wouldn't be a good idea security-wise. But I'd phrase it as why not just emit a LOG message?. I agree that new logging infrastructure seems like overkill. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: row_to_array function
2015-01-16 18:03 GMT+01:00 Jim Nasby jim.na...@bluetreble.com: On 1/16/15 3:45 AM, Pavel Stehule wrote: I am returning back to processing records in plpgsql. I am thinking so it can be simply processed with transformations to array. Now we have similar functions - hstore(row), row_to_json, ... but using of these functions can be a useless step. Any row variable can be transformed to 2D text array. How is it useless? Why wouldn't you just use JSON and be done with it? We can use a FOREACH IN ARRAY iteration in plpgsql (second variant is a implementation FOREACH for jsonb) so ROW-ARRAY is shorter than ROW-JSON-ARRAY or ROW-HSTORE-ARRAY Do you have some use cases you can share? processing of NEW, OLD variables in triggers There two possible transformations: row_to_array -- [[key1, value1],[key2, value2], ...] row_to_row_array -- [(key1, value1), (key2, value2), ... ] If we're going to go that route, I think it makes more sense to create an actual key/value type (ie: http://pgxn.org/dist/pair/doc/pair.html) and return an array of that. ok -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
Re: [HACKERS] proposal: lock_time for pg_stat_database
On 1/16/15 11:00 AM, Pavel Stehule wrote: Hi all, some time ago, I proposed a lock time measurement related to query. A main issue was a method, how to show this information. Today proposal is little bit simpler, but still useful. We can show a total lock time per database in pg_stat_database statistics. High number can be signal about lock issues. Would this not use the existing stats mechanisms? If so, couldn't we do this per table? (I realize that won't handle all cases; we'd still need a lock_time_other somewhere). Also, what do you mean by 'lock'? Heavyweight? We already have some visibility there. What I wish we had was some way to know if we're spending a lot of time in a particular non-heavy lock. Actually measuring time probably wouldn't make sense but we might be able to count how often we fail initial acquisition or something. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] HINT: pg_hba.conf changed since last config reload
On 2015-01-16 12:21:13 -0500, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: Why don't we just add emit a NOTICE or WARNING in the relevant place saying that pg_hba.conf is outdated? Then the server won't log those if configured appropriately, which doesn't seem like a bad thing. Note that = ERROR messages aren't sent to the client during authentication. I think people felt that sending that information to the client wouldn't be a good idea security-wise. It won't if issued during the right phase of the authentication: /* * client_min_messages is honored only after we complete the * authentication handshake. This is required both for security * reasons and because many clients can't handle NOTICE messages * during authentication. */ if (ClientAuthInProgress) output_to_client = (elevel = ERROR); else output_to_client = (elevel = client_min_messages || elevel == INFO); } Surely deserves a comment on the emitting site. But I'd phrase it as why not just emit a LOG message?. Well, LOGs can be sent to the client just the same, no? Just requires a nondefault client_min_messages. But as I don't think sending logs to the client is a unsurmountable problem (due to the above) I don't really care if we use WARNING or LOG. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Partitioning: issues/ideas (Was: Re: [HACKERS] On partitioning)
On Wed, Jan 14, 2015 at 9:07 PM, Amit Langote langote_amit...@lab.ntt.co.jp wrote: * It has been repeatedly pointed out that we may want to decouple partitioning from inheritance because implementing partitioning as an extension of inheritance mechanism means that we have to keep all the existing semantics which might limit what we want to do with the special case of it which is partitioning; in other words, we would find ourselves in difficult position where we have to inject a special case code into a very generalized mechanism that is inheritance. Specifically, do we regard a partitions as pg_inherits children of its partitioning parent? I don't think this is totally an all-or-nothing decision. I think everyone is agreed that we need to not break things that work today -- e.g. Merge Append. What that implies for pg_inherits isn't altogether clear. * Syntax: do we want to make it similar to one of the many other databases out there? Or we could invent our own? Well, what I think we don't want is something that is *almost* like some other database but not quite. I lean toward inventing our own since I'm not aware of something that we'd want to copy exactly. I wonder if we could add a clause like DISTRIBUTED BY to complement PARTITION ON that represents a hash distributed/partitioned table (that could be a syntax to support sharded tables maybe; we would definitely want to move ahead in that direction I guess) Maybe eventually, but let's not complicate things by worrying too much about that now. * Catalog: We would like to have a catalog structure suitable to implement capabilities like multi-column partitioning, sub-partitioning (with arbitrary number of levels in the hierarchy). I had suggested that we create two new catalogs viz. pg_partitioned_rel, pg_partition_def to store metadata about a partition key of a partitioned relation and partition bound info of a partition, respectively. Also, see the point about on-disk representation of partition bounds I'm not convinced that there is any benefit in spreading this information across two tables neither of which exist today. If the representation of the partitioning scheme is going to be a node tree, then there's no point in taking what would otherwise have been a List and storing each element of it in a separate tuple. The overarching point here is that the system catalog structure should be whatever is most convenient for the system internals; I'm not sure we understand what that is yet. * It is desirable to treat partitions as pg_class relations with perhaps a new relkind(s). We may want to choose an implementation where only the lowest level relations in a partitioning hierarchy have storage; those at the upper layers are mere placeholder relations though of course with associated constraints determined by partitioning criteria (with appropriate metadata entered into the additional catalogs). I think the storage-less parents need a new relkind precisely to denote that they have no storage; I am not convinced that there's any reason to change the relkind for the leaf nodes. But that's been proposed, so evidently someone thinks there's a reason to do it. I am not quite sure if each kind of the relations involved in the partitioning scheme have separate namespaces and, if they are, how we implement that I am in favor of having all of the nodes in the hierarchy have names just as relations do today -- pg_class.relname. Anything else seems to me to be complex to implement and of very marginal benefit. But again, it's been proposed. * In the initial implementation, we could just live with partitioning on a set of columns (and not arbitrary expressions of them) Seems quite fair. * We perhaps do not need multi-column LIST partitions as they are not very widely used and may complicate the implementation I agree that the use case is marginal; but I'm not sure it needs to complicate the implementation much. Depending on how the implementation shakes out, prohibiting it might come to seem like more of a wart than allowing it. * There are a number of suggestions about how we represent partition bounds (on-disk) - pg_node_tree, RECORD (a composite type or the rowtype associated with the relation itself), etc. Important point to consider here may be that partition key may contain more than one column Yep. * How we represent partition definition in memory (for a given partitioned relation) - important point to remember is that such a representation should be efficient to iterate through or binary-searchable. Also see the points about tuple-routing and partition-pruning Yep. * Overflow/catchall partition: it seems we do not want/need them. It might seem desirable for example in cases where a big transaction enters a large number of tuples all but one of which find a defined partition; we may not want to error out in such case but instead enter that erring tuple into the overflow partition
Re: [HACKERS] speedup tidbitmap patch: cache page
Andres Freund and...@2ndquadrant.com writes: On 2015-01-16 12:15:35 -0500, Tom Lane wrote: It strikes me that this patch leaves some lookups on the table, specifically that it fails to avoid repeated hash_search lookups inside tbm_page_is_lossy() in the situation where we're adding new TIDs to an already-lossified page. Is it worth adding a few more lines to handle that case as well? There was a alternative version (v2.3 in 549950fb.2050...@sigaev.ru) of the patch that cached the lossyness of a page, but Teodor/David didn't find it to be beneficial in their benchmarking. Teodor's argument was basically that it's completely lost in the noise due to the much bigger overhead of rechecks. That's a fair point, but on reflection it seems like a patch that covered this case too wouldn't actually be any more complicated, so why not? v2.3 is pretty brute-force and I agree it's not very attractive, but I was thinking about something like BlockNumber cached_blkno = InvalidBlockNumber; PagetableEntry *page = NULL; inside loop: /* look up the target page unless we already have it */ if (blk != cached_blkno) { if (tbm_page_is_lossy(tbm, blk)) page = NULL; else page = tbm_get_pageentry(tbm, blk); cached_blkno = blk; } if (page == NULL) continue; /* page is already marked lossy */ The reset after calling tbm_lossify() would just need to be cached_blkno = InvalidBlockNumber. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: lock_time for pg_stat_database
2015-01-16 18:23 GMT+01:00 Jim Nasby jim.na...@bluetreble.com: On 1/16/15 11:00 AM, Pavel Stehule wrote: Hi all, some time ago, I proposed a lock time measurement related to query. A main issue was a method, how to show this information. Today proposal is little bit simpler, but still useful. We can show a total lock time per database in pg_stat_database statistics. High number can be signal about lock issues. Would this not use the existing stats mechanisms? If so, couldn't we do this per table? (I realize that won't handle all cases; we'd still need a lock_time_other somewhere). it can use a current existing stats mechanisms I afraid so isn't possible to assign waiting time to table - because it depends on order Also, what do you mean by 'lock'? Heavyweight? We already have some visibility there. What I wish we had was some way to know if we're spending a lot of time in a particular non-heavy lock. Actually measuring time probably wouldn't make sense but we might be able to count how often we fail initial acquisition or something. now, when I am thinking about it, lock_time is not good name - maybe waiting lock time (lock time should not be interesting, waiting is interesting) - it can be divided to some more categories - in GoodData we use Heavyweight, pages, and others categories. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
Re: [HACKERS] proposal: searching in array function - array_position
On 1/16/15 11:16 AM, Pavel Stehule wrote: 2015-01-16 17:57 GMT+01:00 Jim Nasby jim.na...@bluetreble.com mailto:jim.na...@bluetreble.com: On 1/16/15 3:39 AM, Pavel Stehule wrote: I am proposing a simple function, that returns a position of element in array. Yes please! FUNCTION array_position(anyarray, anyelement) RETURNS int That won't work on a multi-dimensional array. Ideally it needs to accept a slice or an element and return the specifier for the slice. It is question, what is a result - probably, there can be a multidimensional variant, where result will be a array array_position([1,2,3],2) -- 2 array_position([[1,2],[2,3],[3,4]], [2,3]) -- 2 /* 2nd parameter should to have N-1 dimension of first parameter */ The problem with that is you can't actually use '2' to get [2,3] back: select (array[[1,2,3],[4,5,6],[7,8,9]])[1] IS NULL; ?column? -- t (1 row) I think the bigger problem here is we need something better than slices for handling subsets of arrays. Even if the function returned [2:2] it's still going to behave differently than it will in the non-array case because you won't be getting the expected number of dimensions back. :( array_position_md([1,2,3],2) -- [2] array_position_md([[1,2],[2,3],[3,4]], 2) -- [2,1] another question is how to solve more than one occurrence on one value - probably two sets of functions - first returns first occurrence of value, second returns set of occurrence Gee, if only way had some way to return multiple elements of something... ;P In other words, I think all of these should actually return an array of positions. I think it's OK for someone that only cares about the first instance to just do [1]. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] HINT: pg_hba.conf changed since last config reload
Andres Freund and...@2ndquadrant.com writes: On 2015-01-16 12:21:13 -0500, Tom Lane wrote: I think people felt that sending that information to the client wouldn't be a good idea security-wise. It won't if issued during the right phase of the authentication: Good point. But as I don't think sending logs to the client is a unsurmountable problem (due to the above) I don't really care if we use WARNING or LOG. If we're expecting the DBA to need to read it in the postmaster log, remember that LOG is actually *more* likely to get to the log than WARNING is. Choosing WARNING because you think it sounds more significant is a mistake. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fillfactor for GIN indexes
On Thu, Jan 15, 2015 at 7:06 AM, Michael Paquier michael.paqu...@gmail.com wrote: Alexander Korotkov wrote: I'm not sure. On the one hand it's unclear why fillfactor should be different from 9.4. On the other hand it's unclear why it should be different from btree. I propose marking this ready for committer. So, committer can make a final decision. OK let's do so then. My preference is to fully pack the index at build. GIN compression has been one of the headlines of 9.4. I'm struggling to understand why we shouldn't just reject this patch. On November 27th, Cedric said: what are the benefits of this patch ? (maybe you had some test case or a benchmark ?) Nobody replied. On January 15th, you (Michael) hypothesized that this patch has value to control random updates on GIN indexes but there seem to be absolutely no test results showing that any such value exists. There's only value in adding a fillfactor parameter to GIN indexes if it improves performance. There are no benchmarks showing it does. So, why are we still talking about this? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: row_to_array function
On 1/16/15 11:22 AM, Pavel Stehule wrote: 2015-01-16 18:03 GMT+01:00 Jim Nasby jim.na...@bluetreble.com mailto:jim.na...@bluetreble.com: On 1/16/15 3:45 AM, Pavel Stehule wrote: I am returning back to processing records in plpgsql. I am thinking so it can be simply processed with transformations to array. Now we have similar functions - hstore(row), row_to_json, ... but using of these functions can be a useless step. Any row variable can be transformed to 2D text array. How is it useless? Why wouldn't you just use JSON and be done with it? We can use a FOREACH IN ARRAY iteration in plpgsql (second variant is a implementation FOREACH for jsonb) so ROW-ARRAY is shorter than ROW-JSON-ARRAY or ROW-HSTORE-ARRAY I think the real problem here is that we're inventing a bunch of different ways to do the same thing: iterate over a set. Instead of doing that, should we add the idea of an iterator to the type system? That would make sense for arrays, hstore, json and XML. Do you have some use cases you can share? processing of NEW, OLD variables in triggers Note that last time I checked you couldn't do something like NEW.variable, and I don't think you could use EXEC to do it either. So there's more needed here than just converting a record to an array. There two possible transformations: row_to_array -- [[key1, value1],[key2, value2], ...] row_to_row_array -- [(key1, value1), (key2, value2), ... ] If we're going to go that route, I think it makes more sense to create an actual key/value type (ie: http://pgxn.org/dist/pair/doc/__pair.html http://pgxn.org/dist/pair/doc/pair.html) and return an array of that. ok -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code
Hi, On 2015-01-05 17:16:43 +0900, Michael Paquier wrote: + varlistentry id=restore-command-retry-interval xreflabel=restore_command_retry_interval + termvarnamerestore_command_retry_interval/varname (typeinteger/type) + indexterm +primaryvarnamerestore_command_retry_interval/ recovery parameter/primary + /indexterm + /term + listitem + para +If varnamerestore_command/ returns nonzero exit status code, retry +command after the interval of time specified by this parameter, +measured in milliseconds if no unit is specified. Default value is +literal5s/. + /para + para +This command is particularly useful for warm standby configurations +to leverage the amount of retries done to restore a WAL segment file +depending on the activity of a master node. + /para Don't really like this paragraph. What's leveraging the amount of retries done supposed to mean? +# restore_command_retry_interval +# +# specifies an optional retry interval for restore_command command, if +# previous command returned nonzero exit status code. This can be useful +# to increase or decrease the number of calls of restore_command. It's not really the number but frequency, right? + else if (strcmp(item-name, restore_command_retry_interval) == 0) + { + const char *hintmsg; + + if (!parse_int(item-value, restore_command_retry_interval, GUC_UNIT_MS, +hintmsg)) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg(parameter \%s\ requires a temporal value, + restore_command_retry_interval), + hintmsg ? errhint(%s, _(hintmsg)) : 0)); temporal value sounds odd to my ears. I'd rather keep in line with what guc.c outputs in such a case. + now = GetCurrentTimestamp(); + if (!TimestampDifferenceExceeds(last_fail_time, now, + restore_command_retry_interval)) { - pg_usleep(100L * (5 - (now - last_fail_time))); - now = (pg_time_t) time(NULL); + longsecs; + int microsecs; + + TimestampDifference(last_fail_time, now, secs, microsecs); + pg_usleep(restore_command_retry_interval * 1000L - (100L * secs + 1L * microsecs)); + now = GetCurrentTimestamp(); } last_fail_time = now; currentSource = XLOG_FROM_ARCHIVE; Am I missing something or does this handle/affect streaming failures just the same? That might not be a bad idea, because the current interval is far too long for e.g. a sync replication setup. But it certainly makes the name a complete misnomer. Not this patch's fault, but I'm getting a bit tired seing the above open coded. How about adding a function that does the sleeping based on a timestamptz and a ms interval? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: row_to_array function
2015-01-16 18:42 GMT+01:00 Jim Nasby jim.na...@bluetreble.com: On 1/16/15 11:22 AM, Pavel Stehule wrote: 2015-01-16 18:03 GMT+01:00 Jim Nasby jim.na...@bluetreble.com mailto: jim.na...@bluetreble.com: On 1/16/15 3:45 AM, Pavel Stehule wrote: I am returning back to processing records in plpgsql. I am thinking so it can be simply processed with transformations to array. Now we have similar functions - hstore(row), row_to_json, ... but using of these functions can be a useless step. Any row variable can be transformed to 2D text array. How is it useless? Why wouldn't you just use JSON and be done with it? We can use a FOREACH IN ARRAY iteration in plpgsql (second variant is a implementation FOREACH for jsonb) so ROW-ARRAY is shorter than ROW-JSON-ARRAY or ROW-HSTORE-ARRAY I think the real problem here is that we're inventing a bunch of different ways to do the same thing: iterate over a set. Instead of doing that, should we add the idea of an iterator to the type system? That would make sense for arrays, hstore, json and XML. what do you think? How this can be implemented? Do you have some use cases you can share? processing of NEW, OLD variables in triggers Note that last time I checked you couldn't do something like NEW.variable, and I don't think you could use EXEC to do it either. So there's more needed here than just converting a record to an array. There two possible transformations: row_to_array -- [[key1, value1],[key2, value2], ...] row_to_row_array -- [(key1, value1), (key2, value2), ... ] If we're going to go that route, I think it makes more sense to create an actual key/value type (ie: http://pgxn.org/dist/pair/doc/ __pair.html http://pgxn.org/dist/pair/doc/pair.html) and return an array of that. ok -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
Re: [HACKERS] proposal: lock_time for pg_stat_database
On 1/16/15 11:35 AM, Pavel Stehule wrote: 2015-01-16 18:23 GMT+01:00 Jim Nasby jim.na...@bluetreble.com mailto:jim.na...@bluetreble.com: On 1/16/15 11:00 AM, Pavel Stehule wrote: Hi all, some time ago, I proposed a lock time measurement related to query. A main issue was a method, how to show this information. Today proposal is little bit simpler, but still useful. We can show a total lock time per database in pg_stat_database statistics. High number can be signal about lock issues. Would this not use the existing stats mechanisms? If so, couldn't we do this per table? (I realize that won't handle all cases; we'd still need a lock_time_other somewhere). it can use a current existing stats mechanisms I afraid so isn't possible to assign waiting time to table - because it depends on order Huh? Order of what? Also, what do you mean by 'lock'? Heavyweight? We already have some visibility there. What I wish we had was some way to know if we're spending a lot of time in a particular non-heavy lock. Actually measuring time probably wouldn't make sense but we might be able to count how often we fail initial acquisition or something. now, when I am thinking about it, lock_time is not good name - maybe waiting lock time (lock time should not be interesting, waiting is interesting) - it can be divided to some more categories - in GoodData we use Heavyweight, pages, and others categories. So do you see this somehow encompassing locks other than heavyweight locks? Because I think that's the biggest need here. Basically, something akin to TRACE_POSTGRESQL_LWLOCK_WAIT_START() that doesn't depend on dtrace. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: searching in array function - array_position
2015-01-16 18:37 GMT+01:00 Jim Nasby jim.na...@bluetreble.com: On 1/16/15 11:16 AM, Pavel Stehule wrote: 2015-01-16 17:57 GMT+01:00 Jim Nasby jim.na...@bluetreble.com mailto: jim.na...@bluetreble.com: On 1/16/15 3:39 AM, Pavel Stehule wrote: I am proposing a simple function, that returns a position of element in array. Yes please! FUNCTION array_position(anyarray, anyelement) RETURNS int That won't work on a multi-dimensional array. Ideally it needs to accept a slice or an element and return the specifier for the slice. It is question, what is a result - probably, there can be a multidimensional variant, where result will be a array array_position([1,2,3],2) -- 2 array_position([[1,2],[2,3],[3,4]], [2,3]) -- 2 /* 2nd parameter should to have N-1 dimension of first parameter */ The problem with that is you can't actually use '2' to get [2,3] back: select (array[[1,2,3],[4,5,6],[7,8,9]])[1] IS NULL; ?column? -- t (1 row) yes, but when you are searching a array in array you can use a full slice selection: postgres=# select (ARRAY[[1,2],[4,5]])[1][1:2]; -- [1:2] should be a constant every time in this case -- so it should not be returned array - {{1,2}} (1 row) I think the bigger problem here is we need something better than slices for handling subsets of arrays. Even if the function returned [2:2] it's still going to behave differently than it will in the non-array case because you won't be getting the expected number of dimensions back. :( you cannot to return a slice and I don't propose it, although we can return a range type or array of range type - but still we cannot to use range for a arrays. array_position_md([1,2,3],2) -- [2] array_position_md([[1,2],[2,3],[3,4]], 2) -- [2,1] another question is how to solve more than one occurrence on one value - probably two sets of functions - first returns first occurrence of value, second returns set of occurrence Gee, if only way had some way to return multiple elements of something... ;P In other words, I think all of these should actually return an array of positions. I think it's OK for someone that only cares about the first instance to just do [1]. there can be two functions - position - returns first and positions returns all as a array -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
Re: [HACKERS] Parallel Seq Scan
On Wed, Jan 14, 2015 at 9:00 PM, Jim Nasby jim.na...@bluetreble.com wrote: Simply doing something like blkno % num_workers is going to cause imbalances, Yes. but trying to do this on a per-block basis seems like too much overhead. ...but no. Or at least, I doubt it. The cost of handing out blocks one at a time is that, for each block, a worker's got to grab a spinlock, increment and record the block number counter, and release the spinlock. Or, use an atomic add. Now, it's true that spinlock cycles and atomic ops can have sometimes impose severe overhead, but you have to look at it as a percentage of the overall work being done. In this case, the backend has to read, pin, and lock the page and process every tuple on the page. Processing every tuple on the page may involve de-TOASTing the tuple (leading to many more page accesses), or evaluating a complex expression, or hitting CLOG to check visibility, but even if it doesn't, I think the amount of work that it takes to process all the tuples on the page will be far larger than the cost of one atomic increment operation per block. As mentioned downthread, a far bigger consideration is the I/O pattern we create. A sequential scan is so-called because it reads the relation sequentially. If we destroy that property, we will be more than slightly sad. It might be OK to do sequential scans of, say, each 1GB segment separately, but I'm pretty sure it would be a real bad idea to read 8kB at a time at blocks 0, 64, 128, 1, 65, 129, ... What I'm thinking about is that we might have something like this: struct this_lives_in_dynamic_shared_memory { BlockNumber last_block; Size prefetch_distance; Size prefetch_increment; slock_t mutex; BlockNumber next_prefetch_block; BlockNumber next_scan_block; }; Each worker takes the mutex and checks whether next_prefetch_block - next_scan_block prefetch_distance and also whether next_prefetch_block last_block. If both are true, it prefetches some number of additional blocks, as specified by prefetch_increment. Otherwise, it increments next_scan_block and scans the block corresponding to the old value. So in this way, the prefetching runs ahead of the scan by a configurable amount (prefetch_distance), which should be chosen so that the prefetches have time to compete before the scan actually reaches those blocks. Right now, of course, we rely on the operating system to prefetch for sequential scans, but I have a strong hunch that may not work on all systems if there are multiple processes doing the reads. Now, what of other strategies like dividing up the relation into 1GB chunks and reading each one in a separate process? We could certainly DO that, but what advantage does it have over this? The only benefit I can see is that you avoid accessing a data structure of the type shown above for every block, but that only matters if that cost is material, and I tend to think it won't be. On the flip side, it means that the granularity for dividing up work between processes is now very coarse - when there are less than 6GB of data left in a relation, at most 6 processes can work on it. That might be OK if the data is being read in from disk anyway, but it's certainly not the best we can do when the data is in memory. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: lock_time for pg_stat_database
2015-01-16 19:06 GMT+01:00 Jim Nasby jim.na...@bluetreble.com: On 1/16/15 11:35 AM, Pavel Stehule wrote: 2015-01-16 18:23 GMT+01:00 Jim Nasby jim.na...@bluetreble.com mailto: jim.na...@bluetreble.com: On 1/16/15 11:00 AM, Pavel Stehule wrote: Hi all, some time ago, I proposed a lock time measurement related to query. A main issue was a method, how to show this information. Today proposal is little bit simpler, but still useful. We can show a total lock time per database in pg_stat_database statistics. High number can be signal about lock issues. Would this not use the existing stats mechanisms? If so, couldn't we do this per table? (I realize that won't handle all cases; we'd still need a lock_time_other somewhere). it can use a current existing stats mechanisms I afraid so isn't possible to assign waiting time to table - because it depends on order Huh? Order of what? when you have a SELECT FROM T1, T2 and T1 is locked for t1, and T2 is locked for t2 -- but if t2 t1 then t2 is not important -- so what I have to cont as lock time for T1 and T2? DDL statements are exception - there is almost simple mapping between relations and lock time reason. Also, what do you mean by 'lock'? Heavyweight? We already have some visibility there. What I wish we had was some way to know if we're spending a lot of time in a particular non-heavy lock. Actually measuring time probably wouldn't make sense but we might be able to count how often we fail initial acquisition or something. now, when I am thinking about it, lock_time is not good name - maybe waiting lock time (lock time should not be interesting, waiting is interesting) - it can be divided to some more categories - in GoodData we use Heavyweight, pages, and others categories. So do you see this somehow encompassing locks other than heavyweight locks? Because I think that's the biggest need here. Basically, something akin to TRACE_POSTGRESQL_LWLOCK_WAIT_START() that doesn't depend on dtrace. For these global statistics I see as important a common total waiting time for locks - we can use a more detailed granularity but I am not sure, if a common statistics are best tool. My motivations is - look to statistics -- and I can see ... lot of rollbacks -- issue, lot of deadlocks -- issue, lot of waiting time -- issue too. It is tool for people without possibility to use dtrace and similar tools and for everyday usage - simple check if locks are not a issue (or if locking is stable). -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
Re: [HACKERS] infinite loop in _bt_getstackbuf
On Thu, Jan 15, 2015 at 5:46 PM, Peter Geoghegan p...@heroku.com wrote: I think that it might be a good idea to have circular _bt_moveright() moves (the direct offender in Merlin's case, which has very similar logic to your _bt_getstackbuf() problem case) detected. I'm pretty sure that it's exceptional for there to be more than 2 or 3 retries in _bt_moveright(). It would probably be fine to consider the possibility that we'll never finish once we get past 5 retries or something like that. We'd then start keeping track of blocks visited, and raise an error when a page was visited a second time. Yeah, I could go for that. Possibly somebody might object that it's a lot of code that will never get tested in normal operation, but as this problem doesn't seem to be strictly theoretical I'm not sure I subscribe to that objection. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: lock_time for pg_stat_database
2015-01-16 19:24 GMT+01:00 Pavel Stehule pavel.steh...@gmail.com: 2015-01-16 19:06 GMT+01:00 Jim Nasby jim.na...@bluetreble.com: On 1/16/15 11:35 AM, Pavel Stehule wrote: 2015-01-16 18:23 GMT+01:00 Jim Nasby jim.na...@bluetreble.com mailto: jim.na...@bluetreble.com: On 1/16/15 11:00 AM, Pavel Stehule wrote: Hi all, some time ago, I proposed a lock time measurement related to query. A main issue was a method, how to show this information. Today proposal is little bit simpler, but still useful. We can show a total lock time per database in pg_stat_database statistics. High number can be signal about lock issues. Would this not use the existing stats mechanisms? If so, couldn't we do this per table? (I realize that won't handle all cases; we'd still need a lock_time_other somewhere). it can use a current existing stats mechanisms I afraid so isn't possible to assign waiting time to table - because it depends on order Huh? Order of what? when you have a SELECT FROM T1, T2 and T1 is locked for t1, and T2 is locked for t2 -- but if t2 t1 then t2 is not important -- so what I have to cont as lock time for T1 and T2? DDL statements are exception - there is almost simple mapping between relations and lock time reason. Also, what do you mean by 'lock'? Heavyweight? We already have some visibility there. What I wish we had was some way to know if we're spending a lot of time in a particular non-heavy lock. Actually measuring time probably wouldn't make sense but we might be able to count how often we fail initial acquisition or something. now, when I am thinking about it, lock_time is not good name - maybe waiting lock time (lock time should not be interesting, waiting is interesting) - it can be divided to some more categories - in GoodData we use Heavyweight, pages, and others categories. So do you see this somehow encompassing locks other than heavyweight locks? Because I think that's the biggest need here. Basically, something akin to TRACE_POSTGRESQL_LWLOCK_WAIT_START() that doesn't depend on dtrace. For these global statistics I see as important a common total waiting time for locks - we can use a more detailed granularity but I am not sure, if a common statistics are best tool. My motivations is - look to statistics -- and I can see ... lot of rollbacks -- issue, lot of deadlocks -- issue, lot of waiting time -- issue too. It is tool for people without possibility to use dtrace and similar tools and for everyday usage - simple check if locks are not a issue (or if locking is stable). and this proposal has sense only for heavyweight locks - because others locks are everywhere -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
Re: [HACKERS] hung backends stuck in spinlock heavy endless loop
On Fri, Jan 16, 2015 at 8:22 AM, Andres Freund and...@2ndquadrant.com wrote: Hi, On 2015-01-16 08:05:07 -0600, Merlin Moncure wrote: On Thu, Jan 15, 2015 at 5:10 PM, Peter Geoghegan p...@heroku.com wrote: On Thu, Jan 15, 2015 at 3:00 PM, Merlin Moncure mmonc...@gmail.com wrote: Running this test on another set of hardware to verify -- if this turns out to be a false alarm which it may very well be, I can only offer my apologies! I've never had a new drive fail like that, in that manner. I'll burn the other hardware in overnight and report back. huh -- well possibly. not. This is on a virtual machine attached to a SAN. It ran clean for several (this is 9.4 vanilla, asserts off, checksums on) hours then the starting having issues: Damn. Is there any chance you can package this somehow so that others can run it locally? It looks hard to find the actual bug here without adding instrumentation to to postgres. FYI, a two hour burn in on my workstation on 9.3 ran with no issues. An overnight run would probably be required to prove it, ruling out both hardware and pl/sh. If proven, it's possible we may be facing a regression, perhaps a serious one. ISTM the next step is to bisect the problem down over the weekend in order to to narrow the search. If that doesn't turn up anything productive I'll look into taking other steps. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: lock_time for pg_stat_database
On 1/16/15 12:30 PM, Pavel Stehule wrote: 2015-01-16 19:24 GMT+01:00 Pavel Stehule pavel.steh...@gmail.com mailto:pavel.steh...@gmail.com: 2015-01-16 19:06 GMT+01:00 Jim Nasby jim.na...@bluetreble.com mailto:jim.na...@bluetreble.com: On 1/16/15 11:35 AM, Pavel Stehule wrote: 2015-01-16 18:23 GMT+01:00 Jim Nasby jim.na...@bluetreble.com mailto:jim.na...@bluetreble.com mailto:Jim.Nasby@bluetreble.__com mailto:jim.na...@bluetreble.com: On 1/16/15 11:00 AM, Pavel Stehule wrote: Hi all, some time ago, I proposed a lock time measurement related to query. A main issue was a method, how to show this information. Today proposal is little bit simpler, but still useful. We can show a total lock time per database in pg_stat_database statistics. High number can be signal about lock issues. Would this not use the existing stats mechanisms? If so, couldn't we do this per table? (I realize that won't handle all cases; we'd still need a lock_time_other somewhere). it can use a current existing stats mechanisms I afraid so isn't possible to assign waiting time to table - because it depends on order Huh? Order of what? when you have a SELECT FROM T1, T2 and T1 is locked for t1, and T2 is locked for t2 -- but if t2 t1 then t2 is not important -- so what I have to cont as lock time for T1 and T2? If that select is waiting on a lock on t2, then it's waiting on that lock on that table. It doesn't matter who else has the lock. Also, what do you mean by 'lock'? Heavyweight? We already have some visibility there. What I wish we had was some way to know if we're spending a lot of time in a particular non-heavy lock. Actually measuring time probably wouldn't make sense but we might be able to count how often we fail initial acquisition or something. now, when I am thinking about it, lock_time is not good name - maybe waiting lock time (lock time should not be interesting, waiting is interesting) - it can be divided to some more categories - in GoodData we use Heavyweight, pages, and others categories. So do you see this somehow encompassing locks other than heavyweight locks? Because I think that's the biggest need here. Basically, something akin to TRACE_POSTGRESQL_LWLOCK_WAIT___START() that doesn't depend on dtrace. For these global statistics I see as important a common total waiting time for locks - we can use a more detailed granularity but I am not sure, if a common statistics are best tool. Locks may be global, but what you're waiting for a lock on certainly isn't. It's almost always a lock either on a table or a row in a table. Of course this does mean you can't just blindly report that you're blocked on some XID; that doesn't tell anyone anything. My motivations is - look to statistics -- and I can see ... lot of rollbacks -- issue, lot of deadlocks -- issue, lot of waiting time -- issue too. It is tool for people without possibility to use dtrace and similar tools and for everyday usage - simple check if locks are not a issue (or if locking is stable). Meh. SELECT sum(state_change) FROM pg_stat_activity WHERE waiting is just about as useful. Or just turn on lock logging. If you really want to add it at the database level I'm not opposed (so long as it leaves the door open for more granular locking later), but I can't really get excited about it either. and this proposal has sense only for heavyweight locks - because others locks are everywhere So what if they're everywhere? Right now if you're spending a lot of time waiting for LWLocks you have no way to know what's going on unless you happen to have dtrace. Obviously we're not going to something like issue a stats update every time we attempt to acquire an LWLock, but that doesn't mean we can't keep some counters on the locks and periodically report that. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: row_to_array function
On 01/16/2015 12:22 PM, Pavel Stehule wrote: There two possible transformations: row_to_array -- [[key1, value1],[key2, value2], ...] row_to_row_array -- [(key1, value1), (key2, value2), ... ] If we're going to go that route, I think it makes more sense to create an actual key/value type (ie: http://pgxn.org/dist/pair/doc/pair.html) and return an array of that. ok http://BlueTreble.com I think we'd possibly be better off with simply returning a flat array, [key1, value1, ...] Thats's what the hstore(text[]) and json_object(text[]) functions accept, along with the 2D variant, if we want a precedent. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical Decoding follows timelines
On 2015-01-03 12:07:29 +0200, Heikki Linnakangas wrote: @@ -941,6 +936,8 @@ StartLogicalReplication(StartReplicationCmd *cmd) * Force a disconnect, so that the decoding code doesn't need to care * about an eventual switch from running in recovery, to running in a * normal environment. Client code is expected to handle reconnects. + * This covers the race condition where we are promoted half way + * through starting up. */ if (am_cascading_walsender !RecoveryInProgress()) { We could exit recovery immediately after this check. Why is this check needed? I probably wrote that ched and I don't think it really is needed. I think that's a remnant of what the physical pendant used to do. I think this needs slightly more abstraction because the infrastructure is local to walsender.c - but logical decoding is also possible via SQL. I'm not yet sure how that should look like. It'd be awesome if in the course of that we could get rid of the nearly duplicated XLogRead() :( Simon, have you checked that this actually correctly follows timelines? Afaics the patch as is won't allow to start logical decoding on a standby. To allow logical decoding from clients I (apperently) wrote the the following comment: /* * TODO: We got to change that someday soon... * * There's basically three things missing to allow this: * 1) We need to be able to correctly and quickly identify the timeline a *LSN belongs to * 2) We need to force hot_standby_feedback to be enabled at all times so *the primary cannot remove rows we need. * 3) support dropping replication slots referring to a database, in *dbase_redo. There can't be any active ones due to HS recovery *conflicts, so that should be relatively easy. * */ if (RecoveryInProgress()) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg(logical decoding cannot be used while in recovery))); You're implementing 1) here. 3) doesn't look very challenging. But 2) imo is rather more interesting/complex. I guess we'd have to force that streaming replication is used, that a physical replication slot is used and that hot_standby_feedback is enabled. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] advance local xmin more aggressively
On 01/15/2015 09:57 PM, Robert Haas wrote: On Thu, Jan 15, 2015 at 3:08 AM, Michael Paquier michael.paqu...@gmail.com wrote: On Mon, Dec 22, 2014 at 7:31 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: Here's an updated version, rebased over the pairing heap code that I just committed, and fixing those bugs. So, are we reaching an outcome for the match happening here? Well, I still like using the existing ResourceOwner pointers to find the snapshots more than introducing a separate data structure just for that. But I'm OK with Heikki committing his version and, if anybody notices the new code becoming a hotspot, we can revisit the issue. Ok, committed. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] hung backends stuck in spinlock heavy endless loop
On Fri, Jan 16, 2015 at 10:33 AM, Merlin Moncure mmonc...@gmail.com wrote: ISTM the next step is to bisect the problem down over the weekend in order to to narrow the search. If that doesn't turn up anything productive I'll look into taking other steps. That might be the quickest way to do it, provided you can isolate the bug fairly reliably. It might be a bit tricky to write a shell script that assumes a certain amount of time having passed without the bug tripping indicates that it doesn't exist, and have that work consistently. I'm slightly concerned that you'll hit other bugs that have since been fixed, given the large number of possible symptoms here. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fillfactor for GIN indexes
On Sat, Jan 17, 2015 at 2:40 AM, Robert Haas robertmh...@gmail.com wrote: There's only value in adding a fillfactor parameter to GIN indexes if it improves performance. There are no benchmarks showing it does. So, why are we still talking about this? Indeed. There is no such benchmark as of now, and I am not sure I'll get the time to work more on that soon, so let's reject the patch for now. And sorry for the useless noise. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] explain sortorder
Timmer, Marius marius.tim...@uni-muenster.de writes: attached is version 8, fixing remaining issues, adding docs and tests as requested/agreed. I've committed this with some rather substantial alterations, notably: * Having get_sortorder_by_keyno dig into the plan state node by itself seemed like a bad idea; it's certainly at variance with the existing division of knowledge in this code, and I think it might outright do the wrong thing if there's a Sort node underneath an Agg or Group node (since in those cases the child Sort node, not the parent node, would get passed in). I fixed it so that show_sort_keys and siblings are responsible for extracting and passing in the correct data arrays. * There were some minor bugs in the rules for when to print NULLS FIRST/LAST too. I think the way I've got it now is a precise inverse of what addTargetToSortList() will do. * The proposed new regression test cases were not portable (en_US isn't guaranteed to exist), and I thought adding a new regression script file for just one test was a bit excessive. The DESC and USING behaviors were already covered by existing test cases so I just added something to exercise COLLATE and NULLS FIRST in collate.sql. * I took out the change in perform.sgml. The change as proposed was seriously confusing because it injected off-topic discussion into an example that was meant to be just about the additional information printed by EXPLAIN ANALYZE. I'm not really sure that this feature needs any specific documentation (other than its eventual mention in the release notes); but if it does, we should probably add a brand new example someplace before the EXPLAIN ANALYZE subsection. * Assorted cleanups such as removal of irrelevant whitespace changes. That sort of thing just makes a reviewer's job harder, so it's best to avoid it if you can. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] n_live_tup smaller than the number of rows in a table
Hi, We are seeing a strange behavior where n_live_tup is way smaller than the number of rows in a table. The table has 18m rows, but n_live_tup only has 100K. We tried to do vacuum analyze to clear up any sticky errors, but it didn’t correct the problem. We are running Postgres 9.2. Any pointers on how we could debug this problem and how to correct the stats? Thanks, Lisa
Re: [HACKERS] hung backends stuck in spinlock heavy endless loop
On Fri, Jan 16, 2015 at 6:21 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: It looks very much like that a page has for some reason been moved to a different block number. And that's exactly what Peter found out in his investigation too; an index page was mysteriously copied to a different block with identical content. What I found suspicious about that was that the spuriously identical pages were not physically adjacent, but logically adjacent (i.e. the bad page was considered the B-Tree right link of the good page by the good, spuriously-copied-by-bad page). It also seems likely that that small catalog index on pg_class(oid) was well cached in shared_buffers. So I agree that it's unlikely that this is actually a hardware or filesystem problem. Beyond that, if I had to guess, I'd say that the problem is more likely to be in the B-Tree code than it is in the buffer manager or whatever (so the logically adjacent thing is probably not an artifact of the order that the pages were accessed, since it appears there was a downlink to the bad page. This downlink was not added recently. Also, this logical adjacency is unlikely to be mere coincidence - Postgres seemed to fairly consistently break this way). Does anyone have a better developed sense of where the ultimate problem here is than I do? I guess I've never thought too much about how the system fails when a catalog index is this thoroughly broken. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] n_live_tup smaller than the number of rows in a table
Lisa Guo l...@fb.com writes: We are seeing a strange behavior where n_live_tup is way smaller than the number of rows in a table. The table has 18m rows, but n_live_tup only has 100K. We tried to do vacuum analyze to clear up any sticky errors, but it didnt correct the problem. We are running Postgres 9.2. Any pointers on how we could debug this problem and how to correct the stats? n_live_tup is a moving average over the last few observations, so in theory it should get better if you repeat ANALYZE several times. AFAIR, VACUUM isn't likely to help much. (VACUUM FREEZE might though.) It seems odd that you have a value that's so far off ... have you been using this table in any unusual way? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] n_live_tup smaller than the number of rows in a table
The server did crash yesterday. It has many schemas that have the same table (different shards). Operations done on these tables are very similar, but only a few of them actually have this problem. Other than this error, we do not see other abnormalities on the box. Lisa From: Tom Lane t...@sss.pgh.pa.usmailto:t...@sss.pgh.pa.us Date: Friday, January 16, 2015 at 4:23 PM To: s l...@fb.commailto:l...@fb.com Cc: pgsql-hackers@postgresql.orgmailto:pgsql-hackers@postgresql.org pgsql-hackers@postgresql.orgmailto:pgsql-hackers@postgresql.org Subject: Re: [HACKERS] n_live_tup smaller than the number of rows in a table Lisa Guo l...@fb.commailto:l...@fb.com writes: We are seeing a strange behavior where n_live_tup is way smaller than the number of rows in a table. The table has 18m rows, but n_live_tup only has 100K. We tried to do vacuum analyze to clear up any sticky errors, but it didn’t correct the problem. We are running Postgres 9.2. Any pointers on how we could debug this problem and how to correct the stats? n_live_tup is a moving average over the last few observations, so in theory it should get better if you repeat ANALYZE several times. AFAIR, VACUUM isn't likely to help much. (VACUUM FREEZE might though.) It seems odd that you have a value that's so far off ... have you been using this table in any unusual way? regards, tom lane
Re: [HACKERS] Merging postgresql.conf and postgresql.auto.conf
On Fri, Jan 16, 2015 at 9:55 PM, Sawada Masahiko sawada.m...@gmail.com wrote: On Fri, Jan 16, 2015 at 12:54 PM, Amit Kapila amit.kapil...@gmail.com wrote: I don't know how appealing it would be to others, but a new view like pg_file_settings which would display the settings in file could be meaningful for your need. Another way is user can do pg_reload_conf() to see the latest values (excluding values for server startup time parameters). I prefer the former. Because the latter would not handle all guc variables as you said. The new view like pg_file_setting has guc variable and config file which has corresponding guc variable. It would be helpful view for like ALTER SYSTEM ... RESET and that command might be beneficial feature for user who want to manage configuration file manually, I would like to propose them. Okay, but I think it would be better to start a new thread as this proposal seems to be different than your original idea. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Parallel Seq Scan
On Fri, Jan 16, 2015 at 11:49 PM, Robert Haas robertmh...@gmail.com wrote: As mentioned downthread, a far bigger consideration is the I/O pattern we create. A sequential scan is so-called because it reads the relation sequentially. If we destroy that property, we will be more than slightly sad. It might be OK to do sequential scans of, say, each 1GB segment separately, but I'm pretty sure it would be a real bad idea to read 8kB at a time at blocks 0, 64, 128, 1, 65, 129, ... What I'm thinking about is that we might have something like this: struct this_lives_in_dynamic_shared_memory { BlockNumber last_block; Size prefetch_distance; Size prefetch_increment; slock_t mutex; BlockNumber next_prefetch_block; BlockNumber next_scan_block; }; Each worker takes the mutex and checks whether next_prefetch_block - next_scan_block prefetch_distance and also whether next_prefetch_block last_block. If both are true, it prefetches some number of additional blocks, as specified by prefetch_increment. Otherwise, it increments next_scan_block and scans the block corresponding to the old value. Assuming we will increment next_prefetch_block only after prefetching blocks (equivalent to prefetch_increment), won't 2 workers can simultaneously see the same value for next_prefetch_block and try to perform prefetch for same blocks? What will be value of prefetch_increment? Will it be equal to prefetch_distance or prefetch_distance/2 or prefetch_distance/4 or .. or will it be totally unrelated to prefetch_distance? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Merging postgresql.conf and postgresql.auto.conf
Sawada Masahiko wrote On Fri, Jan 16, 2015 at 12:54 PM, Amit Kapila lt; amit.kapila16@ gt; wrote: On Thu, Jan 15, 2015 at 9:48 PM, Sawada Masahiko lt; sawada.mshk@ gt; wrote: On Thu, Jan 15, 2015 at 2:02 PM, Amit Kapila lt; amit.kapila16@ gt; wrote: One thought I have in this line is that currently there doesn't seem to be a way to know if the setting has an entry both in postgresql.conf and postgresql.auto.conf, if we can have some way of knowing the same (pg_settings?), then it could be convenient for user to decide if the value in postgresql.auto.conf is useful or not and if it's not useful then use Alter System .. Reset command to remove the same from postgresql.auto.conf. I think one way is that pg_settings has file name of variables, But It would not affect to currently status of postgresql.conf So we would need to parse postgresql.conf again at that time. Yeah that could be a possibility, but I think that will break the existing command('s) as this is the common infrastructure used for SHOW .. commands as well which displays the guc value that is used by current session rather than the value in postgresql.conf. You're right. pg_setting and SHOW command use value in current session rather than config file. It might break these common infrastructure. Two changes solve this problem in what seems to be a clean way. 1) Upon each parsing of postgresql.conf we store all assigned variables somewhere 2) We display these assignments in a new pg_settings column named system_reset_val I would also extend this to include: a) upon each parsing of postgresql.auto.conf we store all assigned variables somewhere (maybe the same place as postgresql.conf and simply label the file source) b) add an alter_system_val field to show that value (or null) c) add a db_role_val to show the current value for the session via pg_db_role_setting c.1) add a db_role_id to show the named user that is being used for the db_role_val lookup The thinking for c.1 is that in situations with role hierarchies and SET ROLE usage it would be nice to be able to query what the connection role - the one used during variable lookup - is. I'm probably going overkill on this but there are not a lot of difference sources nor do they change frequently so extending the pg_settings view to be more of a one-stop-shopping for this information seems to make sense. As it relates back to this thread the desired merging ends up being done inside this view and at least gives the DBA a central location (well, along with pg_db_role_setting) to go and look at the configuration landscape for the system. David J. -- View this message in context: http://postgresql.nabble.com/Merging-postgresql-conf-and-postgresql-auto-conf-tp5833910p5834386.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Seq Scan
On Fri, Jan 16, 2015 at 11:27 PM, Amit Kapila amit.kapil...@gmail.com wrote: On Fri, Jan 16, 2015 at 11:49 PM, Robert Haas robertmh...@gmail.com wrote: As mentioned downthread, a far bigger consideration is the I/O pattern we create. A sequential scan is so-called because it reads the relation sequentially. If we destroy that property, we will be more than slightly sad. It might be OK to do sequential scans of, say, each 1GB segment separately, but I'm pretty sure it would be a real bad idea to read 8kB at a time at blocks 0, 64, 128, 1, 65, 129, ... What I'm thinking about is that we might have something like this: struct this_lives_in_dynamic_shared_memory { BlockNumber last_block; Size prefetch_distance; Size prefetch_increment; slock_t mutex; BlockNumber next_prefetch_block; BlockNumber next_scan_block; }; Each worker takes the mutex and checks whether next_prefetch_block - next_scan_block prefetch_distance and also whether next_prefetch_block last_block. If both are true, it prefetches some number of additional blocks, as specified by prefetch_increment. Otherwise, it increments next_scan_block and scans the block corresponding to the old value. Assuming we will increment next_prefetch_block only after prefetching blocks (equivalent to prefetch_increment), won't 2 workers can simultaneously see the same value for next_prefetch_block and try to perform prefetch for same blocks? The idea is that you can only examine and modify next_prefetch_block or next_scan_block while holding the mutex. What will be value of prefetch_increment? Will it be equal to prefetch_distance or prefetch_distance/2 or prefetch_distance/4 or .. or will it be totally unrelated to prefetch_distance? I dunno, that might take some experimentation. prefetch_distance/2 doesn't sound stupid. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Merging postgresql.conf and postgresql.auto.conf
On Sat, Jan 17, 2015 at 10:02 AM, David G Johnston david.g.johns...@gmail.com wrote: You're right. pg_setting and SHOW command use value in current session rather than config file. It might break these common infrastructure. Two changes solve this problem in what seems to be a clean way. 1) Upon each parsing of postgresql.conf we store all assigned variables somewhere 2) We display these assignments in a new pg_settings column named system_reset_val I would also extend this to include: a) upon each parsing of postgresql.auto.conf we store all assigned variables somewhere (maybe the same place as postgresql.conf and simply label the file source) Do we need to perform this parsing whenever user queries pg_settings? I think it might lead to extra cycles of reading file when user won't even need it and as the code is shared with SHOW commands that could be slightly complicated. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Merging postgresql.conf and postgresql.auto.conf
On Fri, Jan 16, 2015 at 10:08 PM, Amit Kapila amit.kapil...@gmail.com wrote: On Sat, Jan 17, 2015 at 10:02 AM, David G Johnston david.g.johns...@gmail.com wrote: You're right. pg_setting and SHOW command use value in current session rather than config file. It might break these common infrastructure. Two changes solve this problem in what seems to be a clean way. 1) Upon each parsing of postgresql.conf we store all assigned variables somewhere 2) We display these assignments in a new pg_settings column named system_reset_val I would also extend this to include: a) upon each parsing of postgresql.auto.conf we store all assigned variables somewhere (maybe the same place as postgresql.conf and simply label the file source) Do we need to perform this parsing whenever user queries pg_settings? I think it might lead to extra cycles of reading file when user won't even need it and as the code is shared with SHOW commands that could be slightly complicated. There would be no parsing upon reading of pg_settings, only lookups. The existing parsing would simply have its values saved to the catalogs that will be involved in the underlying pg_setting view query. David J.
Re: [HACKERS] Merging postgresql.conf and postgresql.auto.conf
On Sat, Jan 17, 2015 at 10:41 AM, David Johnston david.g.johns...@gmail.com wrote: On Fri, Jan 16, 2015 at 10:08 PM, Amit Kapila amit.kapil...@gmail.com wrote: On Sat, Jan 17, 2015 at 10:02 AM, David G Johnston david.g.johns...@gmail.com wrote: You're right. pg_setting and SHOW command use value in current session rather than config file. It might break these common infrastructure. Two changes solve this problem in what seems to be a clean way. 1) Upon each parsing of postgresql.conf we store all assigned variables somewhere 2) We display these assignments in a new pg_settings column named system_reset_val I would also extend this to include: a) upon each parsing of postgresql.auto.conf we store all assigned variables somewhere (maybe the same place as postgresql.conf and simply label the file source) Do we need to perform this parsing whenever user queries pg_settings? I think it might lead to extra cycles of reading file when user won't even need it and as the code is shared with SHOW commands that could be slightly complicated. There would be no parsing upon reading of pg_settings, only lookups. The existing parsing would simply have its values saved to the catalogs that will be involved in the underlying pg_setting view query. So are you telling that whenever we read, save the settings to some catalog (probably a new one)? Will that completely address the problem specified in this thread, as those values could probably be old (when last time server is started or at last SIGHUP time values)? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Merging postgresql.conf and postgresql.auto.conf
On Fri, Jan 16, 2015 at 10:19 PM, Amit Kapila amit.kapil...@gmail.com wrote: On Sat, Jan 17, 2015 at 10:41 AM, David Johnston david.g.johns...@gmail.com wrote: On Fri, Jan 16, 2015 at 10:08 PM, Amit Kapila amit.kapil...@gmail.com wrote: On Sat, Jan 17, 2015 at 10:02 AM, David G Johnston david.g.johns...@gmail.com wrote: You're right. pg_setting and SHOW command use value in current session rather than config file. It might break these common infrastructure. Two changes solve this problem in what seems to be a clean way. 1) Upon each parsing of postgresql.conf we store all assigned variables somewhere 2) We display these assignments in a new pg_settings column named system_reset_val I would also extend this to include: a) upon each parsing of postgresql.auto.conf we store all assigned variables somewhere (maybe the same place as postgresql.conf and simply label the file source) Do we need to perform this parsing whenever user queries pg_settings? I think it might lead to extra cycles of reading file when user won't even need it and as the code is shared with SHOW commands that could be slightly complicated. There would be no parsing upon reading of pg_settings, only lookups. The existing parsing would simply have its values saved to the catalogs that will be involved in the underlying pg_setting view query. So are you telling that whenever we read, save the settings to some catalog (probably a new one)? Will that completely address the problem specified in this thread, as those values could probably be old (when last time server is started or at last SIGHUP time values)? Cache invalidation is a hard problem to solve :) I am reasonably content with showing the user who has just updated their postgresql.conf file and boots/SIGHUPs the server to find that said value hasn't taken effect that their value is indeed sitting in postgresql.conf but that other parts of the system are preempting it from being the active value. David J.
Re: [HACKERS] proposal: row_to_array function
2015-01-16 22:35 GMT+01:00 Andrew Dunstan and...@dunslane.net: On 01/16/2015 12:22 PM, Pavel Stehule wrote: There two possible transformations: row_to_array -- [[key1, value1],[key2, value2], ...] row_to_row_array -- [(key1, value1), (key2, value2), ... ] If we're going to go that route, I think it makes more sense to create an actual key/value type (ie: http://pgxn.org/dist/pair/doc/pair.html) and return an array of that. ok http://BlueTreble.com I think we'd possibly be better off with simply returning a flat array, [key1, value1, ...] Thats's what the hstore(text[]) and json_object(text[]) functions accept, along with the 2D variant, if we want a precedent. It can be one of supported variant. I should not be one, because we cannot to simply iterate over it Next possibility is teach FOREACH to take key and value in one step. Regards Pavel cheers andrew
Re: [HACKERS] proposal: lock_time for pg_stat_database
2015-01-16 20:33 GMT+01:00 Jim Nasby jim.na...@bluetreble.com: On 1/16/15 12:30 PM, Pavel Stehule wrote: 2015-01-16 19:24 GMT+01:00 Pavel Stehule pavel.steh...@gmail.com mailto:pavel.steh...@gmail.com: 2015-01-16 19:06 GMT+01:00 Jim Nasby jim.na...@bluetreble.com mailto:jim.na...@bluetreble.com: On 1/16/15 11:35 AM, Pavel Stehule wrote: 2015-01-16 18:23 GMT+01:00 Jim Nasby jim.na...@bluetreble.com mailto:jim.na...@bluetreble.com mailto: Jim.Nasby@bluetreble.__com mailto:jim.na...@bluetreble.com: On 1/16/15 11:00 AM, Pavel Stehule wrote: Hi all, some time ago, I proposed a lock time measurement related to query. A main issue was a method, how to show this information. Today proposal is little bit simpler, but still useful. We can show a total lock time per database in pg_stat_database statistics. High number can be signal about lock issues. Would this not use the existing stats mechanisms? If so, couldn't we do this per table? (I realize that won't handle all cases; we'd still need a lock_time_other somewhere). it can use a current existing stats mechanisms I afraid so isn't possible to assign waiting time to table - because it depends on order Huh? Order of what? when you have a SELECT FROM T1, T2 and T1 is locked for t1, and T2 is locked for t2 -- but if t2 t1 then t2 is not important -- so what I have to cont as lock time for T1 and T2? If that select is waiting on a lock on t2, then it's waiting on that lock on that table. It doesn't matter who else has the lock. Also, what do you mean by 'lock'? Heavyweight? We already have some visibility there. What I wish we had was some way to know if we're spending a lot of time in a particular non-heavy lock. Actually measuring time probably wouldn't make sense but we might be able to count how often we fail initial acquisition or something. now, when I am thinking about it, lock_time is not good name - maybe waiting lock time (lock time should not be interesting, waiting is interesting) - it can be divided to some more categories - in GoodData we use Heavyweight, pages, and others categories. So do you see this somehow encompassing locks other than heavyweight locks? Because I think that's the biggest need here. Basically, something akin to TRACE_POSTGRESQL_LWLOCK_WAIT___START() that doesn't depend on dtrace. For these global statistics I see as important a common total waiting time for locks - we can use a more detailed granularity but I am not sure, if a common statistics are best tool. Locks may be global, but what you're waiting for a lock on certainly isn't. It's almost always a lock either on a table or a row in a table. Of course this does mean you can't just blindly report that you're blocked on some XID; that doesn't tell anyone anything. My motivations is - look to statistics -- and I can see ... lot of rollbacks -- issue, lot of deadlocks -- issue, lot of waiting time -- issue too. It is tool for people without possibility to use dtrace and similar tools and for everyday usage - simple check if locks are not a issue (or if locking is stable). Meh. SELECT sum(state_change) FROM pg_stat_activity WHERE waiting is just about as useful. Or just turn on lock logging. If you really want to add it at the database level I'm not opposed (so long as it leaves the door open for more granular locking later), but I can't really get excited about it either. and this proposal has sense only for heavyweight locks - because others locks are everywhere So what if they're everywhere? Right now if you're spending a lot of time waiting for LWLocks you have no way to know what's going on unless you happen to have dtrace. Obviously we're not going to something like issue a stats update every time we attempt to acquire an LWLock, but that doesn't mean we can't keep some counters on the locks and periodically report that. I have a plan to update statistics when all necessary keys are acquired - so it is once per statement - it is similar press on stats system like now. Pavel -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com