Re: [HACKERS] DO with a large amount of statements get stuck with high memory consumption
On Mon, Jul 18, 2016 at 10:05 AM, Tom Lane <t...@sss.pgh.pa.us> wrote: > Jan Wieck <j...@wi3ck.info> writes: > > In the meantime, would it be appropriate to backpatch the double linking > > of memory context children at this time? I believe it has had plenty of > > testing in the 9.6 cycle to be sure it didn't break anything. > > I'm concerned about the ABI breakage risk from changing a data structure > as fundamental as MemoryContext. Yeah, code outside utils/mmgr probably > *shouldn't* be looking at that struct, but that doesn't mean it isn't. > In the past we've generally only taken that sort of risk when there was > no other way to fix a bug; and this patch isn't a bug fix. While this > does help performance in some corner cases, I don't think it's enough of > an across-the-board win to justify the risk of back-patching. > I would consider mucking with the linked lists of memory context children inside of 3rd party code a really bad idea, but I concede. It isn't a bug fix and there is that small risk that somebody did precisely that, so no backpatch. Regards, Jan -- Jan Wieck Senior Postgres Architect http://pgblog.wi3ck.info
Re: [HACKERS] DO with a large amount of statements get stuck with high memory consumption
On Mon, Jul 18, 2016 at 9:43 AM, Tom Lane <t...@sss.pgh.pa.us> wrote: > Merlin Moncure <mmonc...@gmail.com> writes: > > BTW, while the fix does address the cleanup performance issue, it's > > still the case that anonymous code blocks burn up lots of resident > > memory (my 315k example I tested with ate around 8gb IIRC) when run > > like this. My question is, if the pl/pgsql code block is anonymous > > and not in some kind of a loop, why bother caching the plan at all? > > Nobody got around to it. Also, as you note, it's not as simple as > "don't cache if in a DO block". You'd need to track whether you were > inside any sort of looping construct. Depending on how difficult > that turned out to be, it might add overhead to regular functions > that we don't want. Agreed. And from the structures themselves it is not really easy to detect if inside of a loop, the toplevel, while, for and if all use the same statement block and call exec_stmts(), which in turn calls exec_stmt() for each element in that list. It is not impossible to add a flag, set at PL compile time, to that element and check it every time, the statement is executed. But such a change definitely needs more testing and probably won't qualify for backpatching. In the meantime, would it be appropriate to backpatch the double linking of memory context children at this time? I believe it has had plenty of testing in the 9.6 cycle to be sure it didn't break anything. Regards, Jan -- Jan Wieck Senior Postgres Architect http://pgblog.wi3ck.info
Re: [HACKERS] One process per session lack of sharing
On Sun, Jul 17, 2016 at 9:28 PM, Robert Haas <robertmh...@gmail.com> wrote: > On Sun, Jul 17, 2016 at 3:12 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: > > Robert Haas <robertmh...@gmail.com> writes: > >> On Fri, Jul 15, 2016 at 4:28 AM, Craig Ringer <cr...@2ndquadrant.com> > wrote: > >>> I don't think anyone's considering moving from multi-processing to > >>> multi-threading in PostgreSQL. I really, really like the protection > that the > >>> shared-nothing-by-default process model gives us, among other things. > > > >> We get some very important protection by having the postmaster in a > >> separate address space from the user processes, but separating the > >> other backends from each other has no value. > > > > I do not accept that proposition in the least. For one thing, debugging > > becomes an order of magnitude harder when you've got multiple threads > > in the same address space: you have essentially zero guarantees about > > what one thread might have done to the supposedly-private state of > > another one. > > Well, that's true, in theory. In practice, random memory clobbers are > a pretty rare type of bug. The chances that thread A crashed because > thread B overwrote its supposedly-private state are just not very > high. Also, such bugs are extremely hard to troubleshoot even when > there is only process and one thread involved, so it's not like things > are a rose garden today. I don't buy the argument that it's worth > giving up arbitrary amounts of performance and functionality for this. > The random memory clobbers are partially rare because they often aren't triggered. Many of them are of the dangling pointer type and in a single threaded process, there is less of a chance to allocate and overwrite the free'd and then used memory. But you are right, all of them are tricky to hunt and I remember using hardware watch points and what not in the past. They make your day though when you finally find them. So you will be sorry when the last one is hunted down. > > >> ... enough other people have > >> written complex, long-running multithreaded programs that I think it > >> is probably possible to do so without unduly compromising reliability. > > > > I would bet that every single successful project of that sort has been > > written with threading in mind from the get-go. Trying to retro-fit > > threading onto thirty years' worth of single-threaded coding is a recipe > > for breaking your project; even if you had control of all the code > running > > in the address space, which we assuredly do not. > > I admit that it is risky, but I think there are things that could be > done to limit the risk. I don't believe we can indefinitely continue > to ignore the potential performance benefits of making a switch like > this. Breaking a thirty-year old code base irretrievably would be > sad, but letting it fade into irrelevance because we're not willing to > make the architecture changes that are needed to remain relevant would > be sad, too. > I have to agree with Robert on that one. We have been "thinking" about multi-threading some 16 years ago already. We were aware of the dangers and yet at least considered doing it some day for things like a parallel executor. And that would probably be our best bang for the buck still. The risks of jamming all sessions into a single, multi-threaded process are huge. Think of snapshot visibility together with catalog cache invalidations. I'd say no to that one as a first step. But multi-threading the executor or even certain utility commands at first should not be rejected purely on the notion that "we don't have multithreading today." Regards, Jan > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- Jan Wieck Senior Postgres Architect http://pgblog.wi3ck.info
Re: [HACKERS] DO with a large amount of statements get stuck with high memory consumption
BTW, here is the email thread about double-linking MemoryContext children patch, that Kevin at the end committed to master. https://www.postgresql.org/message-id/55F2D834.8040106%40wi3ck.info Regards, Jan On Sat, Jul 16, 2016 at 3:47 PM, Jan Wieck <j...@wi3ck.info> wrote: > > > On Tue, Jul 12, 2016 at 3:29 PM, Merlin Moncure <mmonc...@gmail.com> > wrote: > >> I've noticed that pl/pgsql functions/do commands do not behave well >> when the statement resolves and frees memory. To be clear: >> >> FOR i in 1..100 >> LOOP >> INSERT INTO foo VALUES (i); >> END LOOP; >> >> ...runs just fine while >> >> BEGIN >> INSERT INTO foo VALUES (1); >> INSERT INTO foo VALUES (2); >> ... >> INSERT INTO foo VALUES (100); >> END; >> > > This sounds very much like what led to > commit 25c539233044c235e97fd7c9dc600fb5f08fe065. > > It seems that patch was only applied to master and never backpatched to > 9.5 or earlier. > > > Regards, Jan > > > > >> >> (for the curious, create a script yourself via >> copy ( >> select >> 'do $$begin create temp table foo(i int);' >> union all select >> format('insert into foo values (%s);', i) from >> generate_series(1,100) i >> union all select 'raise notice ''abandon all hope!''; end; $$;' >> ) to '/tmp/breakit.sql'; >> >> ...while consume amounts of resident memory proportional to the number >> of statemnts and eventually crash the server. The problem is obvious; >> each statement causes a plan to get created and the server gets stuck >> in a loop where SPI_freeplan() is called repeatedly. Everything is >> working as designed I guess, but when this happens it's really >> unpleasant: the query is uncancellable and unterminatable, nicht gut. >> A pg_ctl kill ABRT will do the trick but I was quite astonished >> to see linux take a few minutes to clean up the mess (!) on a somewhat >> pokey virtualized server with lots of memory. With even as little as >> ten thousand statements the cleanup time far exceed the runtime of the >> statement block. >> >> I guess the key takeaway here is, "don't do that"; pl/pgsql >> aggressively generates plans and turns out to be a poor choice for >> bulk loading because of all the plan caching. Having said that, I >> can't help but wonder if there should be a (perhaps user configurable) >> limit to the amount of SPI plans a single function call should be able >> to acquire on the basis you are going to smack into very poor >> behaviors in the memory subsystem. >> >> Stepping back, I can't help but wonder what the value of all the plan >> caching going on is at all for statement blocks. Loops might comprise >> a notable exception, noted. I'd humbly submit though that (relative >> to functions) it's much more likely to want to do something like >> insert a lot of statements and a impossible to utilize any cached >> plans. >> >> This is not an academic gripe -- I just exploded production :-D. >> >> merlin >> >> >> -- >> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) >> To make changes to your subscription: >> http://www.postgresql.org/mailpref/pgsql-hackers >> > > > > -- > Jan Wieck > Senior Postgres Architect > -- Jan Wieck Senior Postgres Architect http://pgblog.wi3ck.info
Re: [HACKERS] A Modest Upgrade Proposal
On Sun, Jul 17, 2016 at 2:08 PM, Jim Nasby <jim.na...@bluetreble.com> wrote: > On 7/13/16 2:06 PM, Joshua D. Drake wrote: > >> On 07/07/2016 01:01 PM, Robert Haas wrote: >> >> There was an unconference session on this topic at PGCon and quite a >>> number of people there stated that they found DDL to be an ease-of-use >>> feature and wanted to have it. >>> >> >> Yeah, I haven't meet anyone yet that would like to have: >> >> select replicate_these_relations('['public']); >> >> vs: >> >> ALTER SCHEMA public ENABLE REPLICATION; >> >> (or something like that). >> > > I generally agree, but I think the more important question is "Why?". Is > it becouse DDL looks more like a sentence? Is it because arrays are a PITA? > Is it too hard to call functions? Once you get fine grained enough to support replicating different sets of possibly overlapping objects/namespaces to different groups of recipients, the DDL approach becomes just as convoluted as calling functions and nobody will memorize the entire syntax. Jan > > -- > Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX > Experts in Analytics, Data Architecture and PostgreSQL > Data in Trouble? Get it in Treble! http://BlueTreble.com > 855-TREBLE2 (855-873-2532) mobile: 512-569-9461 > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- Jan Wieck Senior Postgres Architect http://pgblog.wi3ck.info
Re: [HACKERS] One process per session lack of sharing
On Sun, Jul 17, 2016 at 3:23 AM, Craig Ringer <cr...@2ndquadrant.com> wrote: > > > Lots more could be shared, too. Cached plans, for example. > But the fact that PostgreSQL has transactional DDL complicates things like a shared plan cache and shared PL/pgSQL execution trees. Those things are much easier in a trivial database implementation, where an ALTER TABLE is just trampling over a running transaction. Regards, Jan -- Jan Wieck Senior Postgres Architect http://pgblog.wi3ck.info
Re: [HACKERS] DO with a large amount of statements get stuck with high memory consumption
On Tue, Jul 12, 2016 at 3:29 PM, Merlin Moncure <mmonc...@gmail.com> wrote: > I've noticed that pl/pgsql functions/do commands do not behave well > when the statement resolves and frees memory. To be clear: > > FOR i in 1..100 > LOOP > INSERT INTO foo VALUES (i); > END LOOP; > > ...runs just fine while > > BEGIN > INSERT INTO foo VALUES (1); > INSERT INTO foo VALUES (2); > ... > INSERT INTO foo VALUES (100); > END; > This sounds very much like what led to commit 25c539233044c235e97fd7c9dc600fb5f08fe065. It seems that patch was only applied to master and never backpatched to 9.5 or earlier. Regards, Jan > > (for the curious, create a script yourself via > copy ( > select > 'do $$begin create temp table foo(i int);' > union all select > format('insert into foo values (%s);', i) from > generate_series(1,100) i > union all select 'raise notice ''abandon all hope!''; end; $$;' > ) to '/tmp/breakit.sql'; > > ...while consume amounts of resident memory proportional to the number > of statemnts and eventually crash the server. The problem is obvious; > each statement causes a plan to get created and the server gets stuck > in a loop where SPI_freeplan() is called repeatedly. Everything is > working as designed I guess, but when this happens it's really > unpleasant: the query is uncancellable and unterminatable, nicht gut. > A pg_ctl kill ABRT will do the trick but I was quite astonished > to see linux take a few minutes to clean up the mess (!) on a somewhat > pokey virtualized server with lots of memory. With even as little as > ten thousand statements the cleanup time far exceed the runtime of the > statement block. > > I guess the key takeaway here is, "don't do that"; pl/pgsql > aggressively generates plans and turns out to be a poor choice for > bulk loading because of all the plan caching. Having said that, I > can't help but wonder if there should be a (perhaps user configurable) > limit to the amount of SPI plans a single function call should be able > to acquire on the basis you are going to smack into very poor > behaviors in the memory subsystem. > > Stepping back, I can't help but wonder what the value of all the plan > caching going on is at all for statement blocks. Loops might comprise > a notable exception, noted. I'd humbly submit though that (relative > to functions) it's much more likely to want to do something like > insert a lot of statements and a impossible to utilize any cached > plans. > > This is not an academic gripe -- I just exploded production :-D. > > merlin > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- Jan Wieck Senior Postgres Architect
Re: [HACKERS] WAL Re-Writes
On 01/27/2016 08:30 AM, Amit Kapila wrote: operation. Now why OS couldn't find the corresponding block in memory is that, while closing the WAL file, we use POSIX_FADV_DONTNEED if wal_level is less than 'archive' which lead to this problem. So with this experiment, the conclusion is that though we can avoid re-write of WAL data by doing exact writes, but it could lead to significant reduction in TPS. POSIX_FADV_DONTNEED isn't the only way how those blocks would vanish from OS buffers. If I am not mistaken we recycle WAL segments in a round robin fashion. In a properly configured system, where the reason for a checkpoint is usually "time" rather than "xlog", a recycled WAL file written to had been closed and not touched for about a complete checkpoint_timeout or longer. You must have a really big amount of spare RAM in the machine to still find those blocks in memory. Basically we are talking about the active portion of your database, shared buffers, the sum of all process local memory and the complete pg_xlog directory content fitting into RAM. Regards, Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Fix an O(N^2) problem in foreign key references.
On 09/18/2015 10:47 AM, Tom Lane wrote: Jan Wieck <j...@wi3ck.info> writes: Attached is a complete rework of the fix from scratch, based on Tom's suggestion. The code now maintains a double linked list as suggested, but only uses it to mark all currently valid entries as invalid when hashvalue == 0. If a specific entry is invalidated it performs a hash lookup for maximum efficiency in that case. That does not work; you're ignoring the possibility of hashvalue collisions, and for that matter not considering that the hash value is not equal to the hash key. I don't think your code is invalidating the right entry at all during a foreign key constraint update, and it certainly cannot do the right thing if there's a hash value collision. Attached is something closer to what I was envisioning; can you do performance testing on it? Sorry for the delay. Yes, that patch also has the desired performance for restoring a schema with hundreds of thousands of foreign key constraints. Regards, Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Fix an O(N^2) problem in foreign key references.
On 09/15/2015 12:02 PM, Jan Wieck wrote: On 09/15/2015 11:54 AM, Tom Lane wrote: Jan Wieck <j...@wi3ck.info> writes: Would it be appropriate to use (Un)RegisterXactCallback() in case of detecting excessive sequential scanning of that cache and remove all invalid entries from it at that time? Another idea is that it's not the size of the cache that's the problem, it's the cost of finding the entries that need to be invalidated. So we could also consider adding list links that chain only the entries that are currently marked valid. If the list gets too long, we could mark them all invalid and thereby reset the list to nil. This doesn't do anything for the cache's space consumption, but that wasn't what you were worried about anyway was it? That sounds like a workable solution to this edge case. I'll see how that works. Attached is a complete rework of the fix from scratch, based on Tom's suggestion. The code now maintains a double linked list as suggested, but only uses it to mark all currently valid entries as invalid when hashvalue == 0. If a specific entry is invalidated it performs a hash lookup for maximum efficiency in that case. This code does pass the regression tests with CLOBBER_CACHE_ALWAYS, does have the desired performance impact on restore of hundreds of thousands of foreign key constraints and does not show any negative impact on bulk loading of data with foreign keys. Regards, Jan -- Jan Wieck Senior Software Engineer http://slony.info diff --git a/src/backend/utils/adt/ri_triggers.c b/src/backend/utils/adt/ri_triggers.c index 61edde9..eaec737 100644 --- a/src/backend/utils/adt/ri_triggers.c +++ b/src/backend/utils/adt/ri_triggers.c @@ -125,6 +125,8 @@ typedef struct RI_ConstraintInfo * PK) */ Oid ff_eq_oprs[RI_MAX_NUMKEYS]; /* equality operators (FK = * FK) */ + struct RI_ConstraintInfo *prev_valid; /* Previous valid entry */ + struct RI_ConstraintInfo *next_valid; /* Next valid entry */ } RI_ConstraintInfo; @@ -185,6 +187,7 @@ typedef struct RI_CompareHashEntry static HTAB *ri_constraint_cache = NULL; static HTAB *ri_query_cache = NULL; static HTAB *ri_compare_cache = NULL; +static RI_ConstraintInfo *ri_constraint_cache_valid_list = NULL; /* -- @@ -2924,6 +2927,19 @@ ri_LoadConstraintInfo(Oid constraintOid) ReleaseSysCache(tup); + /* + * At the time a cache invalidation message is processed there may be + * active references to the cache. Because of this we never remove entries + * from the cache, but only mark them invalid. For efficient processing + * of invalidation messages below we keep a double linked list of + * currently valid entries. + */ + if (ri_constraint_cache_valid_list != NULL) + ri_constraint_cache_valid_list->prev_valid = riinfo; + riinfo->prev_valid = NULL; + riinfo->next_valid = ri_constraint_cache_valid_list; + ri_constraint_cache_valid_list = riinfo; + riinfo->valid = true; return riinfo; @@ -2940,17 +2956,55 @@ ri_LoadConstraintInfo(Oid constraintOid) static void InvalidateConstraintCacheCallBack(Datum arg, int cacheid, uint32 hashvalue) { - HASH_SEQ_STATUS status; - RI_ConstraintInfo *hentry; + RI_ConstraintInfo *hentry; + RI_ConstraintInfo *hnext; + boolfound; Assert(ri_constraint_cache != NULL); - hash_seq_init(, ri_constraint_cache); - while ((hentry = (RI_ConstraintInfo *) hash_seq_search()) != NULL) + if (hashvalue == 0) { - if (hentry->valid && - (hashvalue == 0 || hentry->oidHashValue == hashvalue)) + /* + * Set all valid entries in the cache to invalid and clear the + * list of valid entries. + */ + hnext = ri_constraint_cache_valid_list; + while (hnext) + { + hentry = hnext; + hnext = hnext->next_valid; + + Assert(hentry->valid); + + hentry->prev_valid = NULL; + hentry->next_valid = NULL; hentry->valid = false; + } + ri_constraint_cache_valid_list = NULL; + } + else + { + /* + * Search for the specified entry and set that one invalid + * and remove it from the list of valid entries. + */ + hentry = (RI_ConstraintInfo *) hash_search(ri_constraint_cache, + (void *) , + HASH_FIND, ); + if (found && hentry->valid) + { + Assert(hentry->oidHashValue == hashvalue); + + if (hentry->prev_valid != NULL) +hentry->prev_valid->next_valid = hentry->next_valid; + else +ri_constraint_cache_valid_list = hentry->next_valid; + if (hentry->next_valid != NULL) +hentry->next_valid->prev_valid = hentry->prev_valid; + hentry->prev_valid = NULL; + hentry->next_valid = NULL; + hentry->valid = false; + } } } -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Fix an O(N^2) problem in foreign key references.
On 09/15/2015 11:54 AM, Tom Lane wrote: Jan Wieck <j...@wi3ck.info> writes: Would it be appropriate to use (Un)RegisterXactCallback() in case of detecting excessive sequential scanning of that cache and remove all invalid entries from it at that time? Another idea is that it's not the size of the cache that's the problem, it's the cost of finding the entries that need to be invalidated. So we could also consider adding list links that chain only the entries that are currently marked valid. If the list gets too long, we could mark them all invalid and thereby reset the list to nil. This doesn't do anything for the cache's space consumption, but that wasn't what you were worried about anyway was it? That sounds like a workable solution to this edge case. I'll see how that works. Thanks, Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Fix an O(N^2) problem in foreign key references.
On 09/15/2015 10:58 AM, Tom Lane wrote: I wrote: Jan Wieck <j...@wi3ck.info> writes: I'm able to reproduce that failure with CLOBBER_CACHE_ALWAYS and it definitely is caused by this commit. Do you want to back it out for the time being? Kevin is on vacation right now. I'll take a look today, and either fix it if I can find the problem quickly or back it out. The answer is "back it out", because this patch is fundamentally and irretrievably broken. You can't just clobber the hashtable like that, because there are very possibly active uses of hashtable entries in outer function call levels. Ok. Would it be appropriate to use (Un)RegisterXactCallback() in case of detecting excessive sequential scanning of that cache and remove all invalid entries from it at that time? Regards, Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Fix an O(N^2) problem in foreign key references.
On 09/14/2015 09:56 AM, Tom Lane wrote: Kevin Grittner <kgri...@postgresql.org> writes: Fix an O(N^2) problem in foreign key references. Judging from the buildfarm, this patch is broken under CLOBBER_CACHE_ALWAYS. See friarbird's results in particular. I might be too quick to finger this patch, but nothing else lately has touched foreign-key behavior, and the foreign_key test is where things are going south. I'm able to reproduce that failure with CLOBBER_CACHE_ALWAYS and it definitely is caused by this commit. Do you want to back it out for the time being? Kevin is on vacation right now. Regards, Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Fix an O(N^2) problem in foreign key references.
On 09/15/2015 09:51 AM, Tom Lane wrote: Jan Wieck <j...@wi3ck.info> writes: On 09/14/2015 09:56 AM, Tom Lane wrote: Kevin Grittner <kgri...@postgresql.org> writes: Fix an O(N^2) problem in foreign key references. Judging from the buildfarm, this patch is broken under CLOBBER_CACHE_ALWAYS. See friarbird's results in particular. I might be too quick to finger this patch, but nothing else lately has touched foreign-key behavior, and the foreign_key test is where things are going south. I'm able to reproduce that failure with CLOBBER_CACHE_ALWAYS and it definitely is caused by this commit. Do you want to back it out for the time being? Kevin is on vacation right now. I'll take a look today, and either fix it if I can find the problem quickly or back it out. (If anyone else is already looking at it, happy to defer to you...) I am looking at it. Unfortunately I'm not getting any usable backtrace yet. Regards, Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Double linking MemoryContext children
On 09/12/2015 11:35 AM, Kevin Grittner wrote: On the other hand, a grep indicates that there are two places that MemoryContextData.nextchild is set (and we therefore probably need to also set the new field), and Jan's proposed patch only changes one of them. If we do this, I think we need to change both places that are affected, so ResourceOwnerCreate() in resowner.c would need a line or two added. ResourceOwnerCreate() sets ResourceOwnerData.nextchild, not MemoryContextData.nextchild. Regards, Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Double linking MemoryContext children
The attached script demonstrates an O(N^2) problem we recently became aware of. The script simply executes a large anonymous code block that doesn't do anything useful. Usage is ./t1.py [NUM_STMTS] | psql [DBNAME] NUM_STMTS defaults to 20,000. The code block executes rather fast, but the entire script runs for over a minute (at 20,000 statements) on a 2.66 GHz Xeon. The time is spent when all the prepared SPI statements are freed at the end of execution. All prepared SPI statements are children of the Cache context. MemoryContext children are a single linked list where new members are inserted at the head. This works best when children are created and destroyed in a last-in-first-out pattern. SPI however destroys the SPI statements in the order they were created, forcing MemoryContextSetParent() to traverse the entire linked list for each child. The attached patch makes MemoryContext children a double linked list that no longer needs any list traversal no find the position of the child within the list. Regards, Jan -- Jan Wieck Senior Software Engineer http://slony.info diff --git a/src/backend/utils/mmgr/mcxt.c b/src/backend/utils/mmgr/mcxt.c index 705f3ef..d1a2e02 100644 --- a/src/backend/utils/mmgr/mcxt.c +++ b/src/backend/utils/mmgr/mcxt.c @@ -331,21 +331,13 @@ MemoryContextSetParent(MemoryContext context, MemoryContext new_parent) { MemoryContext parent = context->parent; - if (context == parent->firstchild) - parent->firstchild = context->nextchild; + if (context->prevchild != NULL) + context->prevchild->nextchild = context->nextchild; else - { - MemoryContext child; - - for (child = parent->firstchild; child; child = child->nextchild) - { -if (context == child->nextchild) -{ - child->nextchild = context->nextchild; - break; -} - } - } + parent->firstchild = context->nextchild; + + if (context->nextchild != NULL) + context->nextchild->prevchild = context->prevchild; } /* And relink */ @@ -353,12 +345,16 @@ MemoryContextSetParent(MemoryContext context, MemoryContext new_parent) { AssertArg(MemoryContextIsValid(new_parent)); context->parent = new_parent; + context->prevchild = NULL; context->nextchild = new_parent->firstchild; + if (new_parent->firstchild != NULL) + new_parent->firstchild->prevchild = context; new_parent->firstchild = context; } else { context->parent = NULL; + context->prevchild = NULL; context->nextchild = NULL; } } @@ -714,6 +710,7 @@ MemoryContextCreate(NodeTag tag, Size size, node->methods = methods; node->parent = NULL; /* for the moment */ node->firstchild = NULL; + node->prevchild = NULL; node->nextchild = NULL; node->isReset = true; node->name = ((char *) node) + size; @@ -728,6 +725,8 @@ MemoryContextCreate(NodeTag tag, Size size, { node->parent = parent; node->nextchild = parent->firstchild; + if (parent->firstchild != NULL) + parent->firstchild->prevchild = node; parent->firstchild = node; /* inherit allowInCritSection flag from parent */ node->allowInCritSection = parent->allowInCritSection; diff --git a/src/include/nodes/memnodes.h b/src/include/nodes/memnodes.h index 577ab9c..bbb58bd 100644 --- a/src/include/nodes/memnodes.h +++ b/src/include/nodes/memnodes.h @@ -79,6 +79,7 @@ typedef struct MemoryContextData MemoryContextMethods *methods; /* virtual function table */ MemoryContext parent; /* NULL if no parent (toplevel context) */ MemoryContext firstchild; /* head of linked list of children */ + MemoryContext prevchild; /* previous child of same parent */ MemoryContext nextchild; /* next child of same parent */ char *name; /* context name (just for debugging) */ MemoryContextCallback *reset_cbs; /* list of reset/delete callbacks */ #!/usr/bin/env python import sys def main(argv): if len(argv) == 0: num_stmts = 2 else: num_stmts = int(argv[0]) print "\\timing" print "DO $$" print "DECLARE" print "ts1 timestamp;" print "ts2 timestamp;" print "dummy integer;" print "BEGIN" print "SELECT timeofday() INTO ts1;" print "RAISE NOTICE 'start: %', ts1;" for i in range(0, num_stmts): print "SELECT %d INTO dummy;"%i print "SELECT timeofday() INTO ts2;" print "RAISE NOTICE 'end: %', ts2;" print "RAISE NOTICE 'duration: %', (ts2 - ts1);" print "END;" print "$$;" if __name__ == '__main__': sys.exit(main(sys.argv[1:])) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Double linking MemoryContext children
On 09/11/2015 09:38 AM, Tom Lane wrote: Jan Wieck <j...@wi3ck.info> writes: The attached script demonstrates an O(N^2) problem we recently became aware of. The script simply executes a large anonymous code block that doesn't do anything useful. Usage is ./t1.py [NUM_STMTS] | psql [DBNAME] NUM_STMTS defaults to 20,000. The code block executes rather fast, but the entire script runs for over a minute (at 20,000 statements) on a 2.66 GHz Xeon. The time is spent when all the prepared SPI statements are freed at the end of execution. All prepared SPI statements are children of the Cache context. MemoryContext children are a single linked list where new members are inserted at the head. This works best when children are created and destroyed in a last-in-first-out pattern. SPI however destroys the SPI statements in the order they were created, forcing MemoryContextSetParent() to traverse the entire linked list for each child. The attached patch makes MemoryContext children a double linked list that no longer needs any list traversal no find the position of the child within the list. Seems less invasive to fix SPI to delete in the opposite order? That would assume that there are no other code paths that destroy memory contexts out of LIFO order. Regards, Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Double linking MemoryContext children
On 09/11/2015 09:58 AM, Tom Lane wrote: Jan Wieck <j...@wi3ck.info> writes: On 09/11/2015 09:38 AM, Tom Lane wrote: Seems less invasive to fix SPI to delete in the opposite order? That would assume that there are no other code paths that destroy memory contexts out of LIFO order. Given that we've not seen this sort of problem reported in the dozen or more years the code has been like that, I'm a bit dubious that we need to worry about that. It's possible you're right that we need to expend more space in the MemoryContext lists --- but I'd like to see more than one example. I added a bit of debug code to MemoryContextSetParent(), tracking loop iterations per context name. This is what happens during "make check" of current master: name | count | not_first | iterations --++---+ CacheMemoryContext | 6271 | 2634 | 104846 ExecutorState| 90 | 2951 | 7176 TopMemoryContext | 28464 | 620 | 2716 SPI Proc | 8424 | 225 |381 PortalMemory | 24616 |31 |291 TopTransactionContext| 19312 | 114 |141 ExprContext | 3317 | 1 | 1 There was a total of 6,271 calls to MemoryContextSetParent() where the old parent is the CacheMemoryContext. For 2,634 of those the context is not parent->firstchild and this lead to 104,846 loop iterations total. The remaining numbers indicate that other contexts are mostly used in the intended fashion, but not strictly. This means there is definitely potential for more edge cases, similar to the SPI example above. Regards, Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Small patch to fix an O(N^2) problem in foreign keys
On 09/08/2015 08:49 AM, Kevin Grittner wrote: Jan Wieck <j...@wi3ck.info> wrote: The problem is a cache introduced in commit 45ba4247 that improves That's a bit off; 45ba424f seems to be what you mean. Copy paste over paper. foreign key lookups during bulk updates when the FK value does not change. When restoring a schema dump from a database with many (say 100,000) foreign keys, this cache is growing very big and every ALTER TABLE command is causing a InvalidateConstraintCacheCallBack(), which does a sequential hash table scan. The patch uses a heuristic method of detecting when the hash table should be destroyed and recreated. InvalidateConstraintCacheCallBack() adds the current size of the hash table to a counter. When that sum reaches 1,000,000, the hash table is flushed. This improves the schema restore of a database with 100,000 foreign keys by factor 3. According to my tests the patch does not interfere with the bulk updates, the original feature was supposed to improve. In the real-world customer case that caused you to look into this, I thought 45ba424f drove schema-only restore time from 2 hours to about 25 hours, and that this patch takes it back down to 2 hours. Am I remembering right? And this came about because it added 20-some hours to a pg_upgrade run? From 2 hours to >50, but yes, this is that case. If there are no objections, I will push this as a bug fix back to 9.3, where the performance regression was introduced. Regards, Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Small patch to fix an O(N^2) problem in foreign keys
Attached is a small patch and a script to reproduce the issue. The problem is a cache introduced in commit 45ba4247 that improves foreign key lookups during bulk updates when the FK value does not change. When restoring a schema dump from a database with many (say 100,000) foreign keys, this cache is growing very big and every ALTER TABLE command is causing a InvalidateConstraintCacheCallBack(), which does a sequential hash table scan. The patch uses a heuristic method of detecting when the hash table should be destroyed and recreated. InvalidateConstraintCacheCallBack() adds the current size of the hash table to a counter. When that sum reaches 1,000,000, the hash table is flushed. This improves the schema restore of a database with 100,000 foreign keys by factor 3. According to my tests the patch does not interfere with the bulk updates, the original feature was supposed to improve. Regards, Jan -- Jan Wieck Senior Software Engineer http://slony.info diff --git a/src/backend/utils/adt/ri_triggers.c b/src/backend/utils/adt/ri_triggers.c index 61edde9..d7023ce 100644 --- a/src/backend/utils/adt/ri_triggers.c +++ b/src/backend/utils/adt/ri_triggers.c @@ -183,6 +183,7 @@ typedef struct RI_CompareHashEntry * -- */ static HTAB *ri_constraint_cache = NULL; +static long ri_constraint_cache_seq_count = 0; static HTAB *ri_query_cache = NULL; static HTAB *ri_compare_cache = NULL; @@ -2945,6 +2946,27 @@ InvalidateConstraintCacheCallBack(Datum arg, int cacheid, uint32 hashvalue) Assert(ri_constraint_cache != NULL); + /* + * Prevent an O(N^2) problem when creating large amounts of foreign + * key constraints with ALTER TABLE, like it happens at the end of + * a pg_dump with hundred-thousands of tables having references. + */ + ri_constraint_cache_seq_count += hash_get_num_entries(ri_constraint_cache); + if (ri_constraint_cache_seq_count > 100) + { + HASHCTL ctl; + + hash_destroy(ri_constraint_cache); + memset(, 0, sizeof(ctl)); + ctl.keysize = sizeof(Oid); + ctl.entrysize = sizeof(RI_ConstraintInfo); + ri_constraint_cache = hash_create("RI constraint cache", + RI_INIT_CONSTRAINTHASHSIZE, + , HASH_ELEM | HASH_BLOBS); + ri_constraint_cache_seq_count = 0; + return; + } + hash_seq_init(, ri_constraint_cache); while ((hentry = (RI_ConstraintInfo *) hash_seq_search()) != NULL) { t1.sh Description: application/shellscript -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] s_lock() seems too aggressive for machines with many sockets
The whole thing turns out to be based on wrong baseline data, taken with a pgbench client running from a remote machine. It all started out from an investigation against 9.3. Curiously enough, the s_lock() problem that existed in 9.3 has a very similar effect on throughput as a network bottleneck has on 9.5. Sorry for the noise, Jan On 06/10/2015 09:18 AM, Jan Wieck wrote: Hi, I think I may have found one of the problems, PostgreSQL has on machines with many NUMA nodes. I am not yet sure what exactly happens on the NUMA bus, but there seems to be a tipping point at which the spinlock concurrency wreaks havoc and the performance of the database collapses. On a machine with 8 sockets, 64 cores, Hyperthreaded 128 threads total, a pgbench -S peaks with 50-60 clients around 85,000 TPS. The throughput then takes a very sharp dive and reaches around 20,000 TPS at 120 clients. It never recovers from there. The attached patch demonstrates that less aggressive spinning and (much) more often delaying improves the performance on this type of machine. The 8 socket machine in question scales to over 350,000 TPS. The patch is meant to demonstrate this effect only. It has a negative performance impact on smaller machines and client counts #cores, so the real solution will probably look much different. But I thought it would be good to share this and start the discussion about reevaluating the spinlock code before PGCon. Regards, Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] DBT-3 with SF=20 got failed
On 06/11/2015 09:53 AM, Kouhei Kaigai wrote: curious: what was work_mem set to? work_mem=48GB My machine mounts 256GB physical RAM. work_mem can be allocated several times per backend. Nodes like sort and hash_aggregate may each allocate that much. You should set work_mem to a fraction of physical-RAM / concurrent-connections depending on the complexity of your queries. 48GB does not sound reasonable. Regards, Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] s_lock() seems too aggressive for machines with many sockets
Hi, I think I may have found one of the problems, PostgreSQL has on machines with many NUMA nodes. I am not yet sure what exactly happens on the NUMA bus, but there seems to be a tipping point at which the spinlock concurrency wreaks havoc and the performance of the database collapses. On a machine with 8 sockets, 64 cores, Hyperthreaded 128 threads total, a pgbench -S peaks with 50-60 clients around 85,000 TPS. The throughput then takes a very sharp dive and reaches around 20,000 TPS at 120 clients. It never recovers from there. The attached patch demonstrates that less aggressive spinning and (much) more often delaying improves the performance on this type of machine. The 8 socket machine in question scales to over 350,000 TPS. The patch is meant to demonstrate this effect only. It has a negative performance impact on smaller machines and client counts #cores, so the real solution will probably look much different. But I thought it would be good to share this and start the discussion about reevaluating the spinlock code before PGCon. Regards, Jan -- Jan Wieck Senior Software Engineer http://slony.info diff --git a/src/backend/storage/lmgr/s_lock.c b/src/backend/storage/lmgr/s_lock.c index 0afcba1..edaa8fd 100644 --- a/src/backend/storage/lmgr/s_lock.c +++ b/src/backend/storage/lmgr/s_lock.c @@ -83,8 +83,8 @@ s_lock(volatile slock_t *lock, const char *file, int line) * the probability of unintended failure) than to fix the total time * spent. */ -#define MIN_SPINS_PER_DELAY 10 -#define MAX_SPINS_PER_DELAY 1000 +#define MIN_SPINS_PER_DELAY 4 +#define MAX_SPINS_PER_DELAY 10 #define NUM_DELAYS 1000 #define MIN_DELAY_USEC 1000L #define MAX_DELAY_USEC 100L @@ -144,7 +144,7 @@ s_lock(volatile slock_t *lock, const char *file, int line) { /* we never had to delay */ if (spins_per_delay MAX_SPINS_PER_DELAY) - spins_per_delay = Min(spins_per_delay + 100, MAX_SPINS_PER_DELAY); + spins_per_delay = Min(spins_per_delay + 2, MAX_SPINS_PER_DELAY); } else { diff --git a/src/include/storage/s_lock.h b/src/include/storage/s_lock.h index c63cf54..42eb353 100644 --- a/src/include/storage/s_lock.h +++ b/src/include/storage/s_lock.h @@ -973,7 +973,7 @@ extern slock_t dummy_spinlock; extern int s_lock(volatile slock_t *lock, const char *file, int line); /* Support for dynamic adjustment of spins_per_delay */ -#define DEFAULT_SPINS_PER_DELAY 100 +#define DEFAULT_SPINS_PER_DELAY 10 extern void set_spins_per_delay(int shared_spins_per_delay); extern int update_spins_per_delay(int shared_spins_per_delay); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] s_lock() seems too aggressive for machines with many sockets
On 06/10/2015 09:28 AM, Andres Freund wrote: On 2015-06-10 09:18:56 -0400, Jan Wieck wrote: On a machine with 8 sockets, 64 cores, Hyperthreaded 128 threads total, a pgbench -S peaks with 50-60 clients around 85,000 TPS. The throughput then takes a very sharp dive and reaches around 20,000 TPS at 120 clients. It never recovers from there. 85k? Phew, that's pretty bad. What exact type of CPU is this? Which pgbench scale? Did you use -M prepared? model name : Intel(R) Xeon(R) CPU E7- 8830 @ 2.13GHz numactl --hardware shows the distance to the attached memory as 10, the distance to every other node as 21. I interpret that as the machine having one NUMA bus with all cpu packages attached to that, rather than individual connections from cpu to cpu or something different. pgbench scale=300, -Msimple. Could you share a call graph perf profile? I do not have them handy at the moment and the machine is in use for something else until tomorrow. I will forward perf and systemtap based graphs ASAP. What led me into that spinlock area was the fact that a wall clock based systemtap FlameGraph showed a large portion of the time spent in BufferPin() and BufferUnpin(). The attached patch demonstrates that less aggressive spinning and (much) more often delaying improves the performance on this type of machine. The 8 socket machine in question scales to over 350,000 TPS. Even that seems quite low. I've gotten over 500k TPS on a four socket x86 machine, and about 700k on a 8 socket x86 machine. There is more wrong with the machine in question than just that. But for the moment I am satisfied with having a machine where I can reproduce this phenomenon in what appears to be a worst case. Maybe we need to adjust the amount of spinning, but to me such drastic differences are a hint that we should tackle the actual contention point. Often a spinlock for something regularly heavily contended can be worse than a queued lock. I have the impression that the code assumes that there is little penalty for accessing the shared byte in a tight loop from any number of cores in parallel. That apparently is true for some architectures and core counts, but no longer holds for these machines with many sockets. Regards, Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] s_lock() seems too aggressive for machines with many sockets
On 06/10/2015 10:07 AM, Nils Goroll wrote: On larger Linux machines, we have been running with spin locks replaced by generic posix mutexes for years now. I personally haven't look at the code for ages, but we maintain a patch which pretty much does the same thing still: Ref: http://www.postgresql.org/message-id/4fede0bf.7080...@schokola.de I understand that there are systems out there which have less efficient posix mutex implementations than Linux (which uses futexes), but I think it would still be worth considering to do away with the roll-your-own spinlocks on systems whose posix mutexes are known to behave. I have played with test code that isolates a stripped down version of s_lock() and uses it with multiple threads. I then implemented multiple different versions of that s_lock(). The results with 200 concurrent threads are that using a __sync_val_compare_and_swap() to acquire the lock and then falling back to a futex() is limited to about 500,000 locks/second. Spinning for 10 times and then doing a usleep(1000) (one millisecond) gives me 25 million locks/second. Note that the __sync_val_compare_and_swap() GCC built in seems identical in performance with the assembler xchgb operation used by PostgreSQL today on x84_64. Regards, Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Possible problem with pgcrypto
On 02/05/2015 02:15 PM, Jan Wieck wrote: On 02/05/2015 01:18 PM, Marko Tiikkaja wrote: pgcrypto bug That doesn't look too good, but I can't reproduce it against 9.3.6 either. Attached is an improved script and the final output from it. I ran it over night and it did not reproduce the pgcrypto bug message, but instead the expected Corrupt data plus one, that I haven't seen before which is Unsupported compression algorithm. Jan -- Jan Wieck Senior Software Engineer http://slony.info pgcrypto_test2.sh Description: application/shellscript select * from pgp_test2 order by 1; id | vdata | vkey | verrmsg +--+-+--- 1 | \xc30d040703027123de71fa32937175d23a01cc0377628c3b58119e4c8e51804b74cccf961c8b500b8a283db7084ed809833f9bfd7827b70cf06aa0254707e1c08b3db8419e6e2eda697637 | key_1 | ERROR: Wrong key or corrupt data 2 | \xc30d040703027123de71fa32937175d23a01cc0377628c3b58119e4c8e51804b74cccf961c8b500b8a283db7084ed809833f9bfd7827b70cf06aa0254707e1c08b3db8419e6e2eda697637 | key_686 | ERROR: Not text data 3 | \xc30d04070302b9688f5b4b1db6bb77d23a01ee1cad33770344503c564496fb6463d9b11a30013424fcae8b80cfbafba415ba3c7047aec7499b8e74254069b29390990c0f2a34740a7d2085 | key_972 | ERROR: Unsupported compression algorithm 4 | \xc30d04070302c011ec7879402f4178d23a0165d2462bb4e6a7a204f8af20fc0cfadfc72f6a55902f013479697f316659bbc028ddfe624578e462aee3279b65f3e66f0993305d378ae35593 | key_816 | ERROR: Corrupt data (4 rows) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Possible problem with pgcrypto
Hi, I have encountered a small instability in the behavior of pgcrypto's pgp_sym_decrypt() function. Attached is a script that can reproduce the problem. It may have to be run repeatedly because the symptom occurs rather seldom. What the script does is to encode a small string with pgp_sym_encrypt() and then repeatedly try to decrypt it with different wrong passwords. The expected error message for that is of course Wrong key or corrupt data. Every now and then, I get a different error message. Things I've seen are: Not text data pgcrypto bug This seems to be triggered by a combination of the random data included in the encrypted data as well as the wrong password, because for an instance of encrypted data only certain passwords cause this symptom. I wonder if this may actually be a bug in pgcrypto or if this is an error inherent in the way, the encrypted data is encoded. I.e. that the decryption algorithm cannot really figure out what is wrong and just sometimes gets a little further in the attempt to decrypt. Jan -- Jan Wieck Senior Software Engineer http://slony.info pgcrypto_test.sh Description: application/shellscript -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Possible problem with pgcrypto
On 02/05/2015 10:58 AM, Tom Lane wrote: Jan Wieck j...@wi3ck.info writes: I have encountered a small instability in the behavior of pgcrypto's pgp_sym_decrypt() function. Attached is a script that can reproduce the problem. It may have to be run repeatedly because the symptom occurs rather seldom. What the script does is to encode a small string with pgp_sym_encrypt() and then repeatedly try to decrypt it with different wrong passwords. The expected error message for that is of course Wrong key or corrupt data. Every now and then, I get a different error message. Things I've seen are: Have you tested this with this week's releases? We fixed some memory-mishandling bugs in pgcrypto ... The posted script reproduces the symptom in today's checkout of master as well as REL9_4_STABLE. Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Possible problem with pgcrypto
On 02/05/2015 01:18 PM, Marko Tiikkaja wrote: On 2/5/15 4:48 PM, Jan Wieck wrote: What the script does is to encode a small string with pgp_sym_encrypt() and then repeatedly try to decrypt it with different wrong passwords. The expected error message for that is of course Wrong key or corrupt data. Every now and then, I get a different error message. Things I've seen are: Not text data That's not unexpected; the check for whether the data is text or not appears to happen quite early in the process of decoding. So it's enough to get to that point without anything being obviously broken. I suspected something like that. In addition to the two errors above, it doesn't appear to be too difficult to see PXE_MBUF_SHORT_READ, which would give you ERROR: Corrupt data. I wonder why that error message is different, though. From reading the code as far I did, I expected to see that, but haven't seen it yet. pgcrypto bug That doesn't look too good, but I can't reproduce it against 9.3.6 either. Let me improve the script to a point where it can run for a long time in the background and collect all different error cases as examples of encrypted data and wrong password. Thanks so far. Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Moving of INT64_FORMAT to c.h
Hi, PostgreSQL has for ages defined INT64_FORMAT and UINT64_FORMAT in pg_config.h. This commit http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=ce486056ecd28050 moved those definitions to c.h, which breaks compilation of all released Slony-I versions against current master. Can those be moved back to where they used to be? Slony uses the definitions in external tools, like slon and slonik, to format sequence numbers in log output. Regards, Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] DDL Damage Assessment
On 10/02/2014 01:15 PM, Joe Conway wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 10/02/2014 11:30 AM, Dimitri Fontaine wrote: Questions: 1. Do you agree that a systematic way to report what a DDL command (or script, or transaction) is going to do on your production database is a feature we should provide to our growing user base? +1 I really like the idea and would find it useful/time-saving 2. What do you think such a feature should look like? Elsewhere on this thread EXPLAIN was suggested. That makes a certain amount of sense. Maybe something like EXPLAIN IMPACT [...] 3. Does it make sense to support the whole set of DDL commands from the get go (or ever) when most of them are only taking locks in their own pg_catalog entry anyway? Yes, I think it should cover all commands that can have an availability impact. In principle I agree with the sentiment. However, that full coverage is a nice goal, seldom achieved. The real question is at what level of information, returned to the user, does this feature become user friendly? It is one thing to provide information of the kind of TAKE ACCECSS EXCLUSIVE LOCK ON TABLE foo TEST EVERY ROW IN TABLE foo FOR FK (a, b) IN bar (id1, id2) That information is useful, but only to an experienced DBA who knows their schema and data to a certain degree. The majority of users, I fear, will not be able to even remotely guesstimate if that will need seconds or hours. There needs to be more detail information for those cases and I believe that tackling them one at a time in depth will lead to more useful results than trying to cover a lot but shallow. My $.02 Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_receivexlog and replication slots
On 10/01/2014 08:59 AM, Andres Freund wrote: On 2014-10-01 21:54:56 +0900, Michael Paquier wrote: Thanks! On Wed, Oct 1, 2014 at 8:38 PM, Andres Freund and...@2ndquadrant.com wrote: 0001) Old WIP patch for pg_recvlogical tests. Useful for easier development. From where is this patch? Is it some rest from the time when pg_recvlogical was developed? 0002) Copy editing that should be in 9.4 Definitely makes sense for a backpatch. 0003) Rebased patches of yours 0004) My changes to 0003 besides the rebase. This'll be squashed, but I thought it might be interesting for you. (thanks for caring) -/* Drop a replication slot */ +/* Drop a replication slot. */ I don't recall that comments on a single line have a dot even if this single line is a complete sentence. Then it shouldn't start with a uppercase - which you changed... -static void StreamLog(); +static void StreamLogicalLog(); Not related at all to those patches, but for correctness it is better to add a declaration (void). Agreed. Except those small things the changes look good to me. Cool. Will push and then, sometime this week, review the next patch. Greetings, Andres Freund You might want to do that function renaming in pg_receivexlog.c as well. Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_receivexlog and replication slots
On 10/01/2014 01:19 PM, Andres Freund wrote: On 2014-10-01 13:17:17 -0400, Jan Wieck wrote: -static void StreamLog(); +static void StreamLogicalLog(); Not related at all to those patches, but for correctness it is better to add a declaration (void). Agreed. Except those small things the changes look good to me. Cool. Will push and then, sometime this week, review the next patch. You might want to do that function renaming in pg_receivexlog.c as well. Why? You mean to StreamPhysicalLog()? Greetings, Andres Freund Like that. Plus, there is one more occurrence of StreamLog() in pg_recvlogical.c. Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_receivexlog and replication slots
On 10/01/2014 01:32 PM, Andres Freund wrote: On 2014-10-01 13:22:53 -0400, Jan Wieck wrote: On 10/01/2014 01:19 PM, Andres Freund wrote: On 2014-10-01 13:17:17 -0400, Jan Wieck wrote: -static void StreamLog(); +static void StreamLogicalLog(); Not related at all to those patches, but for correctness it is better to add a declaration (void). Agreed. Except those small things the changes look good to me. Cool. Will push and then, sometime this week, review the next patch. You might want to do that function renaming in pg_receivexlog.c as well. Why? You mean to StreamPhysicalLog()? Like that. Don't see that much point - it just seemed wrong to have StreamLog do two somewhat different things. And StreamLog() in receivexlog is much older... The point is that StreamLog is semantically a superset of StreamWhateverLog. Jan Plus, there is one more occurrence of StreamLog() in pg_recvlogical.c. Gah. Wrongly split commit :(. Fixed in 9.4, it's already fixed by the subsequent commit in master. Greetings, Andres Freund -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 09/15/2014 09:46 PM, Craig Ringer wrote: On 09/16/2014 07:44 AM, Peter Geoghegan wrote: FWIW, I am slightly concerned about weighing use cases around very large JSON documents too heavily. Having enormous jsonb documents just isn't going to work out that well, but neither will equivalent designs in popular document database systems for similar reasons. For example, the maximum BSON document size supported by MongoDB is 16 megabytes, and that seems to be something that their users don't care too much about. Having 270 pairs in an object isn't unreasonable, but it isn't going to be all that common either. Also, at a certain size the fact that Pg must rewrite the whole document for any change to it starts to introduce other practical changes. Anyway - this is looking like the change will go in, and with it a catversion bump. Introduction of a jsonb version/flags byte might be worthwhile at the same time. It seems likely that there'll be more room for improvement in jsonb, possibly even down to using different formats for different data. Is it worth paying a byte per value to save on possible upgrade pain? This comment seems to have drowned in the discussion. If there indeed has to be a catversion bump in the process of this, then I agree with Craig. Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: plpgsql - Assert statement
On 09/17/2014 03:36 PM, Peter Eisentraut wrote: On 9/17/14 3:04 PM, Pavel Stehule wrote: What is difference between content of variable or content of database? You can test any prerequisite, but when this prerequisite is not solved, than exception is very very hard without possible handling. If the assertion tests arbitrary Boolean expressions, then we can't stop the user from abusing them. Exactly. Doing something like ASSERT (select count(*) from foo where fk not in (select pk from bar)) = 0; is a perfectly fine, arbitrary boolean expression. It will probably work well in a development environment too. And I am very sure that it will not scale well once that code gets deployed. And I know how DBAs react to the guaranteed following performance problem. They will disable ALL assert ... or was there some sort of assert class system proposed that I missed? But it's another thing if we design specific syntax that encourages such abuse, as proposed earlier. The design should explicitly discourage that sort of nonsense. Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: plpgsql - Assert statement
On 09/14/2014 02:25 PM, Pavel Stehule wrote: a) Isn't possible handle a assertion exception anywhere .. it enforce ROLLBACK in 100% b) Assertions should be disabled globally .. I am not sure, it it is a good idea, but I can understand so some tests based on queries to data can be performance issue. Important question is a relation assertations and exceptions. Is it only shortcut for exception or some different? I think that most data integrity issues can be handled by a well designed database schema that uses UNIQUE, NOT NULL, REFERENCES and CHECK constraints. Assertions are usually found inside of complex code constructs to check values of local variables. I don't think it is even a good idea to implement assertions that can query arbitrary data. Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgbench throttling latency limit
On 09/10/2014 11:28 AM, Heikki Linnakangas wrote: On 09/10/2014 05:57 PM, Fabien COELHO wrote: Hello Heikki, I looked closer at the this, and per Jan's comments, realized that we don't log the lag time in the per-transaction log file. I think that's a serious omission; when --rate is used, the schedule lag time is important information to make sense of the result. I think we have to apply the attached patch, and backpatch to 9.4. (The documentation on the log file format needs updating) Indeed. I think that people do not like it to change. I remember that I suggested to change timestamps to .yy instead of the unreadable yyy, and be told not to, because some people have tool which process the output so the format MUST NOT CHANGE. So my behavior is not to avoid touching anything in this area. I'm fine if you do it, though:-) However I have not time to have a precise look at your patch to cross-check it before Friday. This is different from changing xxx yyy to xxx.yyy in two ways. First, this adds new information to the log that just isn't there otherwise, it's not just changing the format for the sake of it. Second, in this case it's easy to write a parser for the log format so that it works with the old and new formats. Many such programs would probably ignore any unexpected extra fields, as a matter of lazy programming, while changing the separator from space to a dot would surely require changing every parsing program. We could leave out the lag fields, though, when --rate is not used. Without --rate, the lag is always zero anyway. That would keep the file format unchanged, when you don't use the new --rate feature. I'm not sure if that would be better or worse for programs that might want to parse the information. We could also leave the default output format as is and introduce another option with a % style format string. Jan Also, this is bizarre: int64 wait = (int64) (throttle_delay * 1.00055271703 * -log(getrand(thread, 1, 1) / 1.0)); We're using getrand() to generate a uniformly distributed random value between 1 and 1, and then convert it to a double between 0.0 and 1.0. The reason for this conversion is to have randomness but to still avoid going to extreme multiplier values. The idea is to avoid a very large multiplier which would generate (even if it is not often) a 0 tps when 100 tps is required. The 1 granularity is basically random but the multiplier stays bounded (max 9.2, so the minimum possible tps would be around 11 for a target of 100 tps, bar issues from the database for processing the transactions). So although this feature can be discussed and amended, it is fully voluntary. I think that it make sense so I would prefer to keep it as is. Maybe the comments could be update to be clearer. Ok, yeah, the comments indeed didn't mention anything about that. I don't think such clamping is necessary. With that 9.2x clamp on the multiplier, the probability that any given transaction hits it is about 1/1. And a delay 9.2 times the average is still quite reasonable, IMHO. The maximum delay on my laptop, when pg_erand48() returns DBL_MIN, seems to be about 700 x the average, which is still reasonable if you run a decent number of transactions. And of course, the probability of hitting such an extreme value is miniscule, so if you're just doing a few quick test runs with a small total number of transactions, you won't hit that. - Heikki -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PL/pgSQL 2
On 09/05/2014 10:32 PM, Marko Tiikkaja wrote: On 2014-09-02 8:52 PM, Kevin Grittner wrote: Marko Tiikkaja ma...@joh.to wrote: Sounds like in this case you'd only use set-oriented programming at the end of the transaction, no? I guess -- more properly I would say in the final database transaction for that financial transaction. Yes, I should have said financial transaction, but I hit send a bit too early. And no, that never made me wish that plpgsql functions defaulted to throwing errors for DML statements that affected more than one row. Fine. But you should still be able to see the point we're trying to make. The number one is special, and it's present everywhere. If you want to program defensively, you have to go through a lot of pain right now. We're looking for a way to alleviate that pain. Defaulting to throwing errors would be one way to do it, but that's not what's being suggested here anymore. You can dismiss what we're doing by saying that it doesn't follow the best practices or we just want an interface for a key-value store or whatever. And yes, to some extent, a simple interface for a key-value store would come in handy. But we still have the 5-15% (big part of it being the reporting we need to do) of the code that *doesn't* want that, *and* we want to use all of the Postgres features where applicable. The point isn't about best practices. The point is that if you want to ensure that at maximum one row is affected, then qualify it by a unique set of columns. Making PL/pgSQL behave different on UPDATE than SQL to enforce that by default was simply a misguided design idea. Regards, Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PL/pgSQL 1.2
On 09/06/2014 04:21 AM, Marko Tiikkaja wrote: We wrap these things into (sometimes) simple-looking function so that none of the application developers ever run any SQL. We define an interface between the application and the database, and that interface is implemented using PL/PgSQL functions. Sure, sometimes one function will just fire off a single UPDATE .. RETURNING, or a SELECT, but that doesn't matter. The trick is to be consistent everywhere. There is precisely your root problem. Instead of educating your application developers on how to properly use a relational database system, you try to make it foolproof. Guess what, the second you made something foolproof, evolution will create a dumber fool. This is a race you cannot win. Regards, Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PL/pgSQL 2
On 09/06/2014 12:17 PM, Marko Tiikkaja wrote: OK, fine. But that's not what I suggested on the wiki page, and is also not what I'm arguing for here right now. What the message you referred to was about was the condescending attitude where we were told to think in terms of sets (paraphrased), without considering whether that's even possible to do *all the time*. SQL is, by definition, a set oriented language. The name Procedural Language / pgSQL was supposed to suggest that this language adds some procedural elements to the PostgreSQL database. I never intended to create a 100% procedural language. It was from the very beginning, 16 years ago, intended to keep the set orientation when it comes to DML statements inside of functions. That means that you will have to think in sets *all the time*. The empty set and a set with one element are still sets. No matter how hard you try to make them special, in my mind they are not. Regards, Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PL/pgSQL 2
On 09/06/2014 12:33 PM, Marko Tiikkaja wrote: On 2014-09-06 6:31 PM, Jan Wieck wrote: On 09/06/2014 12:17 PM, Marko Tiikkaja wrote: OK, fine. But that's not what I suggested on the wiki page, and is also not what I'm arguing for here right now. What the message you referred to was about was the condescending attitude where we were told to think in terms of sets (paraphrased), without considering whether that's even possible to do *all the time*. SQL is, by definition, a set oriented language. The name Procedural Language / pgSQL was supposed to suggest that this language adds some procedural elements to the PostgreSQL database. I never intended to create a 100% procedural language. It was from the very beginning, 16 years ago, intended to keep the set orientation when it comes to DML statements inside of functions. No matter how hard you try to make them special, in my mind they are not. Of course they are. That's why you have PRIMARY KEYs and UNIQUE constraints. Then please use those features instead of crippling the language. Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PL/pgSQL 2
On 09/06/2014 12:47 PM, David G Johnston wrote: If the language, and the system as a whole, was only used by perfectionists that do not make errors - and with perfectly clean data - this adherence to purity would be acceptable. But the real world is not that clean and so enhancing the language to meet the needs of the real world is not crippling the language. Begin able to state explicitly that the cardinality of the set I get back must be 1, no more and no less, doesn't remove the fact that I know I am dealing with a set and that I simply want to make an assertion as to its properties so that if a bug 3 layers deep into the application causes something other than 1 row to be affected I know immediately and can invoke the appropriate action - throw an error. As I already mentioned in the other thread, those assertions or checks do not belong into the PL. If they are desired they should be added to the main SQL syntax as COMMAND CONSTRAINT like suggested by Hannu. Your statement is not limited to PL functions. It is just as valid for NORMAL applications. However, this would be a proprietary extension that is not covered by any SQL standard and for that reason alone cannot be the default. Regards, Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improving PL/PgSQL
On 09/06/2014 02:08 PM, Marko Tiikkaja wrote: On 2014-09-06 7:56 PM, Pavel Stehule wrote: 2014-09-06 19:54 GMT+02:00 Marko Tiikkaja ma...@joh.to: Then that doesn't really solve our problem. Switching between two languages on a per-function basis, when both look exactly the same but have very different semantics would be a nightmare. It is maximum what is possible use a different language instead Sigh. OK, let's try and forget the cardinality assertions we've been talking about in the other thread(s). I seem to recall there being a generally welcoming atmosphere in the discussion about adding a set of pragmas (or options/whatever) to make some of PL/PgSQL's flaws go away, in a non-backwards compatible way. From the list here: https://wiki.postgresql.org/wiki/Improving_PL/PgSQL_(September_2014) do you think at least some of those would be reasonable candidates for these pragmas? Do you see others ones that are missing from this list? Please also keep discussion about ASSERT in the thread for that, and the suggestion under Single-row operations out of this. +1 for SELECT INTO throwing TOO_MANY_ROWS if there are more than one. Zero rows should be dealt with an IF NOT FOUND ... construct. +1 for the number of result columns should match the expression list of SELECT INTO. -1 on removal of implicit type casting. This needs to go into a #pragma or GUC. Too drastic of a backwards compatibility break. -1 on the single row operations. This belongs into the main SQL engine as COMMAND CONSTRAINTS. +1 on EXECUTE and FOUND, where applicable (DML statements only). I do not recall why we decided to implement GET DIAGNOSTICS instead of an automatically set global ROW_COUNT variable. But there was a reason I believe and we should check the list archives for it. +1 on the OUT alias. -1 on the ASSERT as proposed. It would be too easy for application developers to abuse them to govern business logic and a DBA later turning off assertions for performance reasons. Regards, Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: plpgsql - Assert statement
On 09/05/2014 04:40 AM, Pavel Stehule wrote: 2014-09-05 10:33 GMT+02:00 Marko Tiikkaja ma...@joh.to mailto:ma...@joh.to: On 9/5/14 10:09 AM, Pavel Stehule wrote: I think this could still be parsed correctly, though I'm not 100% sure on that: ASSERT WARNING (EXISTS(SELECT ..)), 'data are there'; PLpgSQL uses a ';' or some plpgsql keyword as SQL statement delimiter. It reason why RETURN QUERY ... ';' So in this case can practical to place SQL statement on the end of plpgsql statement. *shrug* There are lots of cases where a comma is used as well, e.g. RAISE NOTICE '..', expr, expr; parenthesis are not practical, because it is hard to identify bug .. I don't see why. The PL/PgSQL SQL parser goes to great lengths to identify unmatched parenthesis. But the parens probably aren't necessary in the first place; you could just omit them and keep parsing until the next comma AFAICT. So the syntax would be: RAISE [ NOTICE | WARNING | EXCEPTION/ASSERT/WHATEVER ] boolean_expr [, error_message [, error_message_param [, ... ] ] ]; extension RAISE about ASSERT level has minimal new value Adding a WHEN clause to RAISE would have the benefit of not needing any new keywords at all. RAISE EXCEPTION 'format' [, expr ...] WHEN row_count 1; Regards, Jan A simplicity of integration SQL and PLpgSQL is in using smart keywords - It is more verbose, and it allow to well diagnostics I disagree. The new keywords provide nothing of value here. They even encourage the use of quirky syntax in *exchange* for verbosity (IS NOT NULL pk? really?). It is about semantic and conformity of proposed tools. Sure, all can reduced to ASSERT(expr) .. but where is some benefit against function call I am able to do without any change of language as plpgsql extension - there is no necessary to do any change for too thin proposal .marko -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgbench throttling latency limit
On 08/27/2014 04:08 AM, Heikki Linnakangas wrote: That model might make some sense if you think e.g. of a web application, where the web server has a timeout for how long it waits to get a database connection from a pool, but once a query is started, the transaction is considered a succeess no matter how long it takes. The latency limit would be that timeout. But I think a more useful model is that when the user clicks a button, he waits at most X seconds for the result. If that deadline is exceeded, the web server will give a 404, or the user will simply get bored and go away, and the transaction is considered a failure. Correct, the whole TPC-B model better fits an application where client requests enter a queue at the specified TPS rate and that queue is processed. While we are at it, Note that in the original TPC-B specification the transaction duration measured is the time from receiving the client request (in current pgbench under throttling that is for when the transaction is scheduled) and when the request is answered. This is the client visible response time, which has nothing to do with the database latency. As per TPC-B, the entire test is only valid if 90% of all client response times are within 2 seconds. It would be useful if pgbench would A) measure and report that client response time in the per transaction log files and B) report at the end what percentage of transactions finished within a specified response time constraint (default 2 seconds). Regards, Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgbench throttling latency limit
On 09/05/2014 10:12 AM, Fabien COELHO wrote: Note that despite pg appaling latency performance, in may stay well over the 90% limit, or even 100%: when things are going well a lot of transaction run in about ms, while when things are going bad transactions would take a long time (although possibly under or about 1s anyway), *but* very few transactions are passed, the throughput is very small. The fact that during 15 seconds only 30 transactions are processed is a detail that does not show up in the metric. I haven't used the real pgbench for a long time. I will have to look at your patch and see what the current version actually does or does not. What I have been using is a Python version of pgbench that I wrote for myself when I started learning that language. That one does record both values, the DB transaction latency and the client response time (time from the request being entered into the Queue until transaction commit). When I look at those results it is possible to have an utterly failing run, with 60% of client response times being within 2 seconds, but all the DB transactions are still in milliseconds. As said, I'll have to take a look at it. Since I am on vacation next week, getting ready for my first day at EnterpriseDB, this may actually happen. Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PL/pgSQL 1.2
On 09/04/2014 01:14 AM, Pavel Stehule wrote: 2014-09-03 23:19 GMT+02:00 Hannu Krosing ha...@2ndquadrant.com A more SQL-ish way of doing the same could probably be called COMMAND CONSTRAINTS and look something like this SELECT ... CHECK (ROWCOUNT BETWEEN 0 AND 1); It is very near to my proposed ASSERT Only if the ASSERT syntax would become part of the original statement, it is supposed to check. In Hannu's command constraint example above, the statement that causes the error, and thus will be logged and become identified by the error message, is the actual SELECT (or other DML statement). I think I like the COMMAND CONSTRAINT the best so far. Regards, Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PL/pgSQL 1.2
On 09/04/2014 09:31 AM, Pavel Stehule wrote: 2014-09-04 15:24 GMT+02:00 Jan Wieck j...@wi3ck.info I think I like the COMMAND CONSTRAINT the best so far. I not, because when it will not be part of SQL, than parser in plpgsql will be more complex. You have to inject SELECT, UPDATE, INSERT, DELETE Making the COMMAND CONSTRAINT part of the core SQL parser was how I understood Hannu's idea. It would be horrible to tuck that feature away inside of a PL, rather than making it available to all PLs as well as applications, that use SQL directly (I think there still are two or three applications that do). Regards, Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PL/pgSQL 1.2
On 09/04/2014 11:16 AM, Joel Jacobson wrote: On 4 sep 2014, at 16:45, Hannu Krosing ha...@2ndquadrant.com wrote: When looking from the other end of the problem, we are using SELECT/INSERT/UPDATE/DELETE *SET statements* in pl/pgsql when we really want scalars. My understanding is that one main drivers of starting this thread was wanting also guaranteed SCALAR versions of these. And wanting them in a way that is easy to use. +1 Thank you! I have been trying to explain this in multiple cryptic ways but failed. You just nailed it! That's *exactly* what I mean! I believe we all agree that the availability of most of the proposed functionality is desirable. I think the main difference between your point of view and that of a few others (me included) is that you prefer a language that is easy and fast to type, with as few key strokes as possible, while we prefer a language that is similar to SQL, which is rather verbose to the reader. At least when the discussion is about the default procedural language installed with the core database system. Such a language should be as similar as possible to SQL. Which is the reason why I believe that the CHECK clause belongs into the main parser, not into the PL. Regards, Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 08/12/2014 10:58 AM, Robert Haas wrote: What would really be ideal here is if the JSON code could inform the toast compression code this many initial bytes are likely incompressible, just pass them through without trying, and then start compressing at byte N, where N is the byte following the TOC. But I don't know that there's a reasonable way to implement that. Sorry, being late for the party. Anyhow, this strikes me as a good basic direction of thought. But I think we should put the burden on the data type, not on toast. To do that data types could have an optional toast_hint_values() function, which the toast code can call with the actual datum at hand and its default parameter array. The hint values function then can modify that parameter array, telling toast how much to skip, how hard to try (or not at all) and so on. A data type specific function should know much better how to figure out how compressible a particular datum may be. Certainly nothing for 9.4, but it might require changing the toast API in a different way than just handing it an oid and hard-coding the JASONBOID case into toast for 9.4. If we are going to change the API, we might as well do it right. Regards, Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 08/08/2014 10:21 AM, Andrew Dunstan wrote: On 08/07/2014 11:17 PM, Tom Lane wrote: I looked into the issue reported in bug #11109. The problem appears to be that jsonb's on-disk format is designed in such a way that the leading portion of any JSON array or object will be fairly incompressible, because it consists mostly of a strictly-increasing series of integer offsets. This interacts poorly with the code in pglz_compress() that gives up if it's found nothing compressible in the first first_success_by bytes of a value-to-be-compressed. (first_success_by is 1024 in the default set of compression parameters.) [snip] There is plenty of compressible data once we get into the repetitive strings in the payload part --- but that starts at offset 944, and up to that point there is nothing that pg_lzcompress can get a handle on. There are, by definition, no sequences of 4 or more repeated bytes in that area. I think in principle pg_lzcompress could decide to compress the 3-byte sequences consisting of the high-order 24 bits of each offset; but it doesn't choose to do so, probably because of the way its lookup hash table works: * pglz_hist_idx - * * Computes the history table slot for the lookup by the next 4 * characters in the input. * * NB: because we use the next 4 characters, we are not guaranteed to * find 3-character matches; they very possibly will be in the wrong * hash list. This seems an acceptable tradeoff for spreading out the * hash keys more. For jsonb header data, the next 4 characters are *always* different, so only a chance hash collision can result in a match. There is therefore a pretty good chance that no compression will occur before it gives up because of first_success_by. I'm not sure if there is any easy fix for this. We could possibly change the default first_success_by value, but I think that'd just be postponing the problem to larger jsonb objects/arrays, and it would hurt performance for genuinely incompressible data. A somewhat painful, but not yet out-of-the-question, alternative is to change the jsonb on-disk representation. Perhaps the JEntry array could be defined as containing element lengths instead of element ending offsets. Not sure though if that would break binary searching for JSON object keys. Ouch. Back when this structure was first presented at pgCon 2013, I wondered if we shouldn't extract the strings into a dictionary, because of key repetition, and convinced myself that this shouldn't be necessary because in significant cases TOAST would take care of it. Maybe we should have pglz_compress() look at the *last* 1024 bytes if it can't find anything worth compressing in the first, for values larger than a certain size. It's worth noting that this is a fairly pathological case. AIUI the example you constructed has an array with 100k string elements. I don't think that's typical. So I suspect that unless I've misunderstood the statement of the problem we're going to find that almost all the jsonb we will be storing is still compressible. I also think that a substantial part of the problem of coming up with a representative data sample is because the size of the incompressible data at the beginning is somewhat tied to the overall size of the datum itself. This may or may not be true in any particular use case, but as a general rule of thumb I would assume that the larger the JSONB document, the larger the offset array at the beginning. Would changing 1024 to a fraction of the datum length for the time being give us enough room to come up with a proper solution for 9.5? Regards, Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 08/08/2014 11:18 AM, Tom Lane wrote: Andrew Dunstan and...@dunslane.net writes: On 08/07/2014 11:17 PM, Tom Lane wrote: I looked into the issue reported in bug #11109. The problem appears to be that jsonb's on-disk format is designed in such a way that the leading portion of any JSON array or object will be fairly incompressible, because it consists mostly of a strictly-increasing series of integer offsets. Ouch. Back when this structure was first presented at pgCon 2013, I wondered if we shouldn't extract the strings into a dictionary, because of key repetition, and convinced myself that this shouldn't be necessary because in significant cases TOAST would take care of it. That's not really the issue here, I think. The problem is that a relatively minor aspect of the representation, namely the choice to store a series of offsets rather than a series of lengths, produces nonrepetitive data even when the original input is repetitive. This is only because the input data was exact copies of the same strings over and over again. PGLZ can very well compress slightly less identical strings of varying lengths too. Not as well, but well enough. But I suspect such input data would make it fail again, even with lengths. Regards, Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PL/pgSQL 2
On 09/03/2014 03:14 AM, Joel Jacobson wrote: I'm in favour of Tom's idea. To merely make the plpgsql2 language a way of explicitly saying you want a specific exact combination of features/beaviour/settings which we can implemented in plpgsql's existing codebase. Since it was about 100 posts since Tom's post, maybe it's worth repeating for those who missed it: What I would think about is c) plpgsql and plpgsql2 are the same code base, with a small number of places that act differently depending on the language version. We could alternatively get the result by inventing a bunch of pragma declarations, or some similar notation, that control the behavioral changes one-at-a-time. That might even be worth doing anyway, in case somebody likes some of the ideas and others not so much. But I'd see the language version as a convenient shorthand for enabling a specified collection of pretty-localized incompatible behavior changes. If they're not pretty localized, there's going to be a barrier to uptake, very comparable to the python3 analogy mentioned upthread. regards, tom lane I fully agree on this approach. It's maintainable and it will be useful from day 1. One can take that approach to another, more generic level. Like GUCs can be set on a ROLE base with ALTER USER or ALTER ROLE, PL specific GUCs could be set via ALTER LANGUAGE foo SET The possibility to CREATE LANGUAGE mybetterpl, pointing to the same PL handler function, exists already. And the same mechanism could be used by other languages, like PL/Python (for whatever such language might need such settings). This way an application can define the language settings, it needs, by simply creating its own language, based on all the possible permutations of those PRAGMA/GUC settings. Regards, Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PL/pgSQL 2
Without having read the entire thread ... On 09/01/2014 05:04 AM, Joel Jacobson wrote: From the top of my head, these are Things I personally would want to see in plpgsql2: + Make UPDATE/INSERT/DELETE throw error if they didnt' modify exactly 1 row, as that's the most common use-case, and provide alternative syntax to modify multiple or zero rows. I think this is a completely flawed proposal by definition. SQL itself does not make such assumption and as a SET oriented language, never should. A SET is zero or more tuples. Would you also suggest that the PostgreSQL backend throw an ERROR if an UPDATE/INSERT/DELETE doesn't modify exactly 1 row? If not, explain why there should be a difference between SQL executed from the frontend and SQL executed by a PL/pgSQL function. -1 from me. + Change all warnings into errors I suggest being slightly more selective on that one. Regards, Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PL/pgSQL 2
On 09/01/2014 09:06 PM, Andrew Dunstan wrote: On 09/01/2014 08:09 PM, Neil Tiffin wrote: The docs also tell you how to avoid having to do this, using dollar quoting. That should be enough alone to suggest postgreSQL start working on a modern, in core, fast, fully supported language. Of course PL/pgSQL works, but so did one-line 5k perl programs that nobody likes today. Everything can be done in assembler, but no one suggests that today. Today, it is all about programmer productivity. PL/pgSQL has a lot of unnecessary stuff that sucks the life out of programmer productivity. And this should be very much a concern of the professionals that support PostgreSQL For example: DECLARE declarations BEGIN statements END This looks a lot like COBOL or Pascal, and today is mostly unnecessary. It looks like Ada, and that's not an accident. (Nor is it a bad thing.) First of all it is [DECLARE declarations] BEGIN statements END; Second statements includes the whole definition above as a statement and controls identifier visibility and such compatible to PL/SQL. You want to lose that? Not such a great idea, IMHO. Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PL/pgSQL 2
On 09/02/2014 12:20 PM, Joel Jacobson wrote: On Tue, Sep 2, 2014 at 6:09 PM, Kevin Grittner kgri...@ymail.com wrote: Joel Jacobson j...@trustly.com wrote: Sorry for being unclear, I didn't mean to suggest the main concern is updating *all* rows. The main concern is when you have a rather complex UPDATE WHERE clause, aiming to update exactly one row. Some of the expressions might be assertions, to just double-verify the values and to make it stand-out you are checking those expressions. These are two different problems which probably need two different solutions. Making the default behavior of a set-based command that it throw an error if the resulting set is not exactly one row doesn't seem like the right solution to either one of them. I see your point. Basically, we have two types of applications where PL/pgSQL is commonly used. a) OLTP applications where you typically operate on one row for each UPDATE command. Your idea of what an OLTP application is seems flawed. b) Data warehouseing applications where you process multiple rows in each UPDATE command. Ditto. Regards, Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PL/pgSQL 2
On 09/01/2014 10:41 AM, Joel Jacobson wrote: On Mon, Sep 1, 2014 at 4:26 PM, Craig Ringer cr...@2ndquadrant.com wrote: Well, the idiom: EXECUTE format(SELECT %I FROM %I WHERE $1, col, tbl) USING val; is not lovely. It works, but it's clumsy. This is exactly why we need a new language. All the clumsy stuff we cannot fix in plpgsql, can easily be fixed in plpgsql2, with the most beautiful syntax we can come up with. You know that you're running into problems with the SPI subsystem on that one, no? Identifiers cannot be parameters in SPI_prepare(). So how do you propose to make that pretty and performant? Because the moment, your pretty language is out there, be sure users will kick your behind that whenever they use that pretty stuff on anything but a toy setup, it spirals their servers into a DOS attack state. Regards, Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PL/pgSQL 2
On 09/02/2014 09:32 AM, Joel Jacobson wrote: I think it's much better to make it the default behaviour in plpgsql2 than to add a new syntax to plpgsql, because then we don't have to argue what to call the keyword or where to put it. This should in NO CASE be any new syntax to plpgsql or plpgsql2. If you cannot get that same syntax to apply to default client queries, then the whole idea is bogus because it will confuse developers more than it helps you with your particular way of thinking. Regards, Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PL/pgSQL 2
On 09/02/2014 04:16 PM, Andres Freund wrote: On 2014-09-02 19:06:02 +0200, Joel Jacobson wrote: But what do you think about, STRICT UPDATE ...? I quite dislike it. An UPDATE isn't less 'strict' (whatever that means) if updates more than one row. There's some sense in the way it's used for INTO because it's referring to the INTO. And it's much more obvious what it could mean there. For once I completely agree with Andres. Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PL/pgSQL 2
On 09/02/2014 06:41 PM, Joshua D. Drake wrote: On 09/02/2014 02:47 PM, Álvaro Hernández Tortosa wrote: Yeah, we differ there. I think having an Oracle compatibility layer in PostgreSQL would be the-next-big-thing we could have. Oracle is has orders of magnitude bigger user base than postgres has; and having the ability to attract them would bring us many many more users which, in turn, would benefit us all very significantly. It would be my #1 priority to do in postgres (but yes, I know -guess- how hard and what resources that would require). But dreaming is free :) Oracle compatibility certainly has merit, I just don't see it as useful for core. I would be far more interested in MSSQL compatibility honestly. That said, Postgres itself is a rockstar and I think we can make our own case without having to copy others. PL/pgSQL's syntax was modelled to look like PL/SQL. Which is a Ada/COBOL lookalike. Instead of trying to mimic what it was or a T-SQL thing instead ... maybe it is time to come up with a true PostgreSQL specific PL for a change? Just for the sake of being something new, and not a copy of some old opossum, that's rotting like road kill on the side of the highway for a decade already. Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PL/pgSQL 2
On 09/02/2014 06:56 PM, Andrew Dunstan wrote: People are free to do what they want, but to my mind that would be a massive waste of resources, and probably imposing a substantial extra maintenance burden on the core committers. I hear you and agree to some degree. But at the same time I remember that one of the strengths of Postgres used to be to be able to incorporate new ideas. This seems to be one of those cases. Instead of fork plpgsql2, what about designing a completely new PL/postgres from scratch? It will only take 3-10 years, but I bet it will be worth it after all. And I mean that. No sarcasm. Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] How can we make beta testing better?
On 04/17/14 15:16, Merlin Moncure wrote: On Tue, Apr 15, 2014 at 4:47 PM, Josh Berkus j...@agliodbs.com wrote: Hackers, I think 9.3 has given us evidence that our users aren't giving new versions of PostgreSQL substantial beta testing, or if they are, they aren't sharing the results with us. How can we make beta testing better and more effective? How can we get more users to actually throw serious workloads at new versions and share the results? I've tried a couple of things over the last two years and they haven't worked all that well. Since we're about to go into another beta testing period, we need something new. Ideas? I've seen lots of bugs reported and fixed in the beta period over the years. My take is that it's basically unrealistic to expect volunteer beta testers to replace bone fide regression testing. I think it's a pretty fair statement that we've had some QC issues in the general area of replication technologies. What this is indicating to me is that replication needs substantially more coverage in 'make check'. Since I'm wishing for things, it would be nice to see an expansion of the buildfarm so that we could [optionally] run various performance tests as well as various replication scenarios. Then we could go back to users and say, please donate 'repeatable tests and machines to run them on' and reap the long term value. Not at all making light out of any of this...it's a huge project. The problem with testing replication is that it doesn't fit well into our standard regression testing. There are way too many moving parts as well as dependencies on the underlying OS and network topology. You will discover a ton of race conditions once you actually move from testing with multiple postmasters (so you can kill one) on the same box to using multiple virtual machines and instead of completely severing a network connection using some packet shaping/filtering to introduce packet loss, limited bandwidth, async routing and so on. At least that is my experience from throwing that sort of sh*t at Slony at full speed. Not trying to discourage anyone from trying. Just saying that it doesn't fit into our existing regression test framework. Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Problem with txid_snapshot_in/out() functionality
On 04/13/14 08:27, Marko Kreen wrote: On Sat, Apr 12, 2014 at 02:10:13PM -0400, Jan Wieck wrote: Since it doesn't seem to produce any side effects, I'd think that making the snapshot unique within txid_current_snapshot() and filtering duplicates on input should be sufficient and eligible for backpatching. Agreed. The attached patch adds a unique loop to the internal sort_snapshot() function and skips duplicates on input. The git commit is here: https://github.com/wieck/postgres/commit/a88a2b2c25b856478d7e2b012fc718106338fe00 static void sort_snapshot(TxidSnapshot *snap) { + txidlast = 0; + int nxip, idx1, idx2; + if (snap-nxip 1) + { qsort(snap-xip, snap-nxip, sizeof(txid), cmp_txid); + nxip = snap-nxip; + idx1 = idx2 = 0; + while (idx1 nxip) + { + if (snap-xip[idx1] != last) + last = snap-xip[idx2++] = snap-xip[idx1]; + else + snap-nxip--; + idx1++; + } + } } I think you need to do SET_VARSIZE also here. Alternative is to move SET_VARSIZE after sort_snapshot(). And it seems the drop-double-txid logic should be added also to txid_snapshot_recv(). It seems weird to have it behave differently from txid_snapshot_in(). Thanks, yes on both issues. Will create another patch. Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Problem with txid_snapshot_in/out() functionality
On 04/13/14 14:22, Jan Wieck wrote: On 04/13/14 08:27, Marko Kreen wrote: I think you need to do SET_VARSIZE also here. Alternative is to move SET_VARSIZE after sort_snapshot(). And it seems the drop-double-txid logic should be added also to txid_snapshot_recv(). It seems weird to have it behave differently from txid_snapshot_in(). Thanks, yes on both issues. Will create another patch. New patch attached. New github commit is https://github.com/wieck/postgres/commit/b8fd0d2eb78791e5171e34aecd233fd06218890d Thanks again, Jan -- Jan Wieck Senior Software Engineer http://slony.info diff --git a/src/backend/utils/adt/txid.c b/src/backend/utils/adt/txid.c index a005e67..1023566 100644 *** a/src/backend/utils/adt/txid.c --- b/src/backend/utils/adt/txid.c *** cmp_txid(const void *aa, const void *bb) *** 131,146 } /* ! * sort a snapshot's txids, so we can use bsearch() later. * * For consistency of on-disk representation, we always sort even if bsearch ! * will not be used. */ static void sort_snapshot(TxidSnapshot *snap) { if (snap-nxip 1) qsort(snap-xip, snap-nxip, sizeof(txid), cmp_txid); } /* --- 131,161 } /* ! * unique sort a snapshot's txids, so we can use bsearch() later. * * For consistency of on-disk representation, we always sort even if bsearch ! * will not be used. */ static void sort_snapshot(TxidSnapshot *snap) { + txidlast = 0; + int nxip, idx1, idx2; + if (snap-nxip 1) + { qsort(snap-xip, snap-nxip, sizeof(txid), cmp_txid); + nxip = snap-nxip; + idx1 = idx2 = 0; + while (idx1 nxip) + { + if (snap-xip[idx1] != last) + last = snap-xip[idx2++] = snap-xip[idx1]; + else + snap-nxip--; + idx1++; + } + } } /* *** parse_snapshot(const char *str) *** 294,304 val = str2txid(str, endp); str = endp; ! /* require the input to be in order */ ! if (val xmin || val = xmax || val = last_val) goto bad_format; ! buf_add_txid(buf, val); last_val = val; if (*str == ',') --- 309,320 val = str2txid(str, endp); str = endp; ! /* require the input to be in order and skip duplicates */ ! if (val xmin || val = xmax || val last_val) goto bad_format; ! if (val != last_val) ! buf_add_txid(buf, val); last_val = val; if (*str == ',') *** txid_current_snapshot(PG_FUNCTION_ARGS) *** 361,368 { TxidSnapshot *snap; uint32 nxip, ! i, ! size; TxidEpoch state; Snapshotcur; --- 377,383 { TxidSnapshot *snap; uint32 nxip, ! i; TxidEpoch state; Snapshotcur; *** txid_current_snapshot(PG_FUNCTION_ARGS) *** 381,389 /* allocate */ nxip = cur-xcnt; ! size = TXID_SNAPSHOT_SIZE(nxip); ! snap = palloc(size); ! SET_VARSIZE(snap, size); /* fill */ snap-xmin = convert_xid(cur-xmin, state); --- 396,402 /* allocate */ nxip = cur-xcnt; ! snap = palloc(TXID_SNAPSHOT_SIZE(nxip)); /* fill */ snap-xmin = convert_xid(cur-xmin, state); *** txid_current_snapshot(PG_FUNCTION_ARGS) *** 395,400 --- 408,416 /* we want them guaranteed to be in ascending order */ sort_snapshot(snap); + /* we set the size here because the sort may have removed duplicate xips */ + SET_VARSIZE(snap, TXID_SNAPSHOT_SIZE(snap-nxip)); + PG_RETURN_POINTER(snap); } *** txid_snapshot_recv(PG_FUNCTION_ARGS) *** 472,489 snap = palloc(TXID_SNAPSHOT_SIZE(nxip)); snap-xmin = xmin; snap-xmax = xmax; - snap-nxip = nxip; - SET_VARSIZE(snap, TXID_SNAPSHOT_SIZE(nxip)); for (i = 0; i nxip; i++) { txidcur = pq_getmsgint64(buf); ! if (cur = last || cur xmin || cur = xmax) goto bad_format; snap-xip[i] = cur; last = cur; } PG_RETURN_POINTER(snap); bad_format: --- 488,514 snap = palloc(TXID_SNAPSHOT_SIZE(nxip)); snap-xmin = xmin; snap-xmax = xmax; for (i = 0; i nxip; i++) { txidcur = pq_getmsgint64(buf
Re: [HACKERS] Problem with txid_snapshot_in/out() functionality
On 04/12/14 03:27, Heikki Linnakangas wrote: On 04/12/2014 12:07 AM, Jan Wieck wrote: Hackers, the Slony team has been getting seldom reports of a problem with the txid_snapshot data type. The symptom is that a txid_snapshot on output lists the same txid multiple times in the xip list part of the external representation. This string is later rejected by txid_snapshot_in() when trying to determine if a particular txid is visible in that snapshot using the txid_visible_in_snapshot() function. I was not yet able to reproduce this problem in a lab environment. It might be related to subtransactions and/or two phase commit (at least one user is using both of them). The reported PostgreSQL version involved in that case was 9.1. It's two-phase commit. When preparing a transaction, the state of the transaction is first transfered to a dummy PGXACT entry, and then the PGXACT entry of the backend is cleared. There is a transient state when both PGXACT entries have the same xid. You can reproduce that by putting a sleep or breakpoint in PrepareTransaction(), just before the ProcArrayClearTransaction(MyProc); call. If you call txid_current_snapshot() from another session at that point, it will output two duplicate xids. (you will have to also commit one more unrelated transaction to bump up xmax). Thanks, that explains it. At this point I would find it extremely helpful to sanitize the external representation in txid_snapshot_out() while emitting some NOTICE level logging when this actually happens. I am aware that this does amount to a functional change for a back release, but considering that the _out() generated external representation of an existing binary datum won't pass the type's _in() function, I argue that such change is warranted. Especially since this problem could possibly corrupt a dump. Hmm. Do we snapshots to be stored in tables, and included in a dump? I don't think we can guarantee that will work, at least not across versions, as the way we handle snapshot internally can change. At least Londiste and Slony do store snapshots as well as xids in tables and assuming that the txid epoch is properly bumped, that information is useful and valid after a restore. But yeah, we probably should do something about that. The most straightforward fix would be to scan the array in txid_current_snapshot() and remove any duplicates. The code in txid_snapshot_in() checks that the xip list is ascending. txid_snapshot_out() does not sort the list, so it must already be sorted when the snapshot itself is created. That scan would be fairly simple. Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Problem with txid_snapshot_in/out() functionality
On 04/12/14 08:38, Andres Freund wrote: Since it's sorted there, that should be fairly straightforward. Won't fix already created and stored datums tho. Maybe _in()/parse_snapshot() should additionally skip over duplicate values? Looks easy enough. There is the sort ... missed that when glancing over the code earlier. Right, that is easy enough and looks like an acceptable fix for back branches too. Thanks, Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Problem with txid_snapshot_in/out() functionality
On 04/12/14 10:09, Greg Stark wrote: A pg_restore would start a new xid space from FirstNormalXid which would obviously not work with any stored txids. http://www.postgresql.org/docs/9.3/static/app-pgresetxlog.html Regards, Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Problem with txid_snapshot_in/out() functionality
On 04/12/14 11:18, Andres Freund wrote: On 2014-04-12 11:15:09 -0400, Jan Wieck wrote: On 04/12/14 10:09, Greg Stark wrote: A pg_restore would start a new xid space from FirstNormalXid which would obviously not work with any stored txids. http://www.postgresql.org/docs/9.3/static/app-pgresetxlog.html Using that as part of any sort of routine task IMNSHO is a seriously bad idea. Nobody is advocating doing so. Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Problem with txid_snapshot_in/out() functionality
On 04/12/14 10:03, Andres Freund wrote: On 2014-04-12 09:47:24 -0400, Tom Lane wrote: Heikki Linnakangas hlinnakan...@vmware.com writes: On 04/12/2014 12:07 AM, Jan Wieck wrote: the Slony team has been getting seldom reports of a problem with the txid_snapshot data type. The symptom is that a txid_snapshot on output lists the same txid multiple times in the xip list part of the external representation. It's two-phase commit. When preparing a transaction, the state of the transaction is first transfered to a dummy PGXACT entry, and then the PGXACT entry of the backend is cleared. There is a transient state when both PGXACT entries have the same xid. Hm, yeah, but why is that intermediate state visible to anyone else? Don't we have exclusive lock on the PGPROC array while we're doing this? It's done outside the remit of ProcArray lock :(. And documented to lead to duplicate xids in PGXACT. EndPrepare(): /* * Mark the prepared transaction as valid. As soon as xact.c marks * MyPgXact as not running our XID (which it will do immediately after * this function returns), others can commit/rollback the xact. * * NB: a side effect of this is to make a dummy ProcArray entry for the * prepared XID. This must happen before we clear the XID from MyPgXact, * else there is a window where the XID is not running according to * TransactionIdIsInProgress, and onlookers would be entitled to assume * the xact crashed. Instead we have a window where the same XID appears * twice in ProcArray, which is OK. */ MarkAsPrepared(gxact); It doesn't sound too hard to essentially move PrepareTransaction()'s ProcArrayClearTransaction() into MarkAsPrepared() and rejigger the locking to remove the intermediate state. But I think it'll lead to bigger changes than we'd be comfortable backpatching. If we don't, aren't we letting other backends see non-self-consistent state in regards to who holds which locks, for example? I think that actually works out ok, because the locks aren't owned by xids/xacts, but procs. Otherwise we'd be in deep trouble in CommitTransaction() as well where ProcArrayEndTransaction() clearing that state. After the whole xid transfer, there's PostPrepare_Locks() transferring the locks. Since it doesn't seem to produce any side effects, I'd think that making the snapshot unique within txid_current_snapshot() and filtering duplicates on input should be sufficient and eligible for backpatching. The attached patch adds a unique loop to the internal sort_snapshot() function and skips duplicates on input. The git commit is here: https://github.com/wieck/postgres/commit/a88a2b2c25b856478d7e2b012fc718106338fe00 Regards, Jan -- Jan Wieck Senior Software Engineer http://slony.info diff --git a/src/backend/utils/adt/txid.c b/src/backend/utils/adt/txid.c index a005e67..44f7880 100644 *** a/src/backend/utils/adt/txid.c --- b/src/backend/utils/adt/txid.c *** cmp_txid(const void *aa, const void *bb) *** 131,146 } /* ! * sort a snapshot's txids, so we can use bsearch() later. * * For consistency of on-disk representation, we always sort even if bsearch ! * will not be used. */ static void sort_snapshot(TxidSnapshot *snap) { if (snap-nxip 1) qsort(snap-xip, snap-nxip, sizeof(txid), cmp_txid); } /* --- 131,161 } /* ! * unique sort a snapshot's txids, so we can use bsearch() later. * * For consistency of on-disk representation, we always sort even if bsearch ! * will not be used. */ static void sort_snapshot(TxidSnapshot *snap) { + txidlast = 0; + int nxip, idx1, idx2; + if (snap-nxip 1) + { qsort(snap-xip, snap-nxip, sizeof(txid), cmp_txid); + nxip = snap-nxip; + idx1 = idx2 = 0; + while (idx1 nxip) + { + if (snap-xip[idx1] != last) + last = snap-xip[idx2++] = snap-xip[idx1]; + else + snap-nxip--; + idx1++; + } + } } /* *** parse_snapshot(const char *str) *** 294,304 val = str2txid(str, endp); str = endp; ! /* require the input to be in order */ ! if (val xmin || val = xmax || val = last_val) goto bad_format; ! buf_add_txid(buf, val); last_val = val; if (*str == ',') --- 309,320 val = str2txid(str, endp); str = endp; ! /* require the input to be in order and skip duplicates */ ! if (val xmin || val = xmax || val last_val) goto bad_format; ! if (val
[HACKERS] Problem with txid_snapshot_in/out() functionality
Hackers, the Slony team has been getting seldom reports of a problem with the txid_snapshot data type. The symptom is that a txid_snapshot on output lists the same txid multiple times in the xip list part of the external representation. This string is later rejected by txid_snapshot_in() when trying to determine if a particular txid is visible in that snapshot using the txid_visible_in_snapshot() function. I was not yet able to reproduce this problem in a lab environment. It might be related to subtransactions and/or two phase commit (at least one user is using both of them). The reported PostgreSQL version involved in that case was 9.1. At this point I would find it extremely helpful to sanitize the external representation in txid_snapshot_out() while emitting some NOTICE level logging when this actually happens. I am aware that this does amount to a functional change for a back release, but considering that the _out() generated external representation of an existing binary datum won't pass the type's _in() function, I argue that such change is warranted. Especially since this problem could possibly corrupt a dump. Comments? Jan -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] add --throttle to pgbench (submission 3)
On 05/01/13 04:57, Fabien COELHO wrote: Add --throttle to pgbench Each client is throttled to the specified rate, which can be expressed in tps or in time (s, ms, us). Throttling is achieved by scheduling transactions along a Poisson-distribution. This is an update of the previous proposal which fix a typo in the sgml documentation. The use case of the option is to be able to generate a continuous gentle load for functional tests, eg in a practice session with students or for testing features on a laptop. Why does this need two option formats (-H and --throttle)? Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] add --throttle to pgbench (submission 3)
On 06/19/13 14:34, Fabien COELHO wrote: The use case of the option is to be able to generate a continuous gentle load for functional tests, eg in a practice session with students or for testing features on a laptop. Why does this need two option formats (-H and --throttle)? On the latest version it is --rate and -R. Because you may want to put something very readable and understandable in a script and like long options, or have to type it interactively every day in a terminal and like short ones. Most UNIX commands include both kind. Would it make sense then to add long versions for all the other standard options too? Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] (auto)vacuum truncate exclusive lock
On 4/12/2013 1:57 PM, Tom Lane wrote: Kevin Grittner kgri...@ymail.com writes: Tom Lane t...@sss.pgh.pa.us wrote: I think that the minimum appropriate fix here is to revert the hunk I quoted, ie take out the suppression of stats reporting and analysis. I'm not sure I understand -- are you proposing that is all we do for both the VACUUM command and autovacuum? No, I said that was the minimum fix. Looking again at the patch, I note this comment: /* +* We failed to establish the lock in the specified number of +* retries. This means we give up truncating. Suppress the +* ANALYZE step. Doing an ANALYZE at this point will reset the +* dead_tuple_count in the stats collector, so we will not get +* called by the autovacuum launcher again to do the truncate. +*/ and I suppose the rationale for suppressing the stats report was this same idea of lying to the stats collector in order to encourage a new vacuum attempt to happen right away. Now I'm not sure that that's a good idea at all --- what's the reasoning for thinking the table will be any less hot in thirty seconds? But if it is reasonable, we need a redesign of the reporting messages, not just a hack to not tell the stats collector what we did. Yes, that was the rationale behind it combined with don't change function call sequences and more all over the place. The proper solution would eventually be to add a block number to the stats held by the stats collector, which preserves the information about the last filled block of the table. The decouple the truncate from vacuum/autovacuum. vacuum/autovacuum will set that block number when they detect the trailing free space. The analyze step can happen just as usual and reset stats, which doesn't reset that block number. The autovacuum launcher goes through its normal logic for launching autovac or autoanalyze. If it doesn't find any of those to do but the last-used-block is set, it launches the separate lazy truncate step. This explicitly moves the truncate, which inherently requires the exclusive lock and therefore is undesirable even in a manual vacuum, into the background. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] (auto)vacuum truncate exclusive lock
On 4/12/2013 2:08 PM, Alvaro Herrera wrote: Tom Lane escribió: Are you saying you intend to revert that whole concept? That'd be okay with me, I think. Otherwise we need some thought about how to inform the stats collector what's really happening. Maybe what we need is to consider table truncation as a separate activity from vacuuming. Then autovacuum could invoke it without having to do a full-blown vacuum. For this to work, I guess we would like to separately store the status of the back-scan in pgstat somehow (I think a boolean flag suffices: were we able to truncate all pages that appeared to be empty?) Should have read the entire thread before responding :) Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] (auto)vacuum truncate exclusive lock
On 4/18/2013 11:44 AM, Jan Wieck wrote: Yes, that was the rationale behind it combined with don't change function call sequences and more all over the place. function call signatures -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] strange OOM errors with EXECUTE in PL/pgSQL
On 12/20/2012 4:47 PM, Dimitri Fontaine wrote: Tom Lane t...@sss.pgh.pa.us writes: The reason this fails is that you've got a half-megabyte source string, and each of the 11000 plans that are due to be created from it saves its own copy of the source string. Hence, 5500 megabytes needed just for source strings. We could possibly fix this by inventing some sort of reference-sharing arrangement (which'd be complicated and fragile) or by not storing the source strings with the plans (which'd deal a serious blow to our ability to provide helpful error messages). Neither answer seems appealing. I don't readily see how complicated and fragile it would be, it looks like a hash table of symbols pointing to source strings and a reference counting, and each plan would need to reference that symbol. Now maybe that's what you call complicated and fragile, and even if not, I'm not really sure it would pull its weight. The use case of sending over and over again *in a given session* the exact same query string without using PREPARE/EXECUTE looks like quite tiny. That sounds like a bit of overkill to me. Don't all the plans result as a plan list from a multi-statement query string, which was parsed into a query tree list and each single query tree then planned? I don't think there is any way that a single one of those trees (parse or plan) will be free'd separately. If that is true, then proper usage of memory contexts would make reference counting obsolete, even though all plans refer to the same copy. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MySQL search query is not executing in Postgres DB
On 12/14/2012 3:20 PM, Robert Haas wrote: On Fri, Dec 14, 2012 at 2:11 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: ... In more than ten years of working with PostgreSQL, I've never encountered where the restriction at issue here prevented a bug. It's only annoyed me and broken my application code (when moving from PostgreSQL 8.2 to PostgreSQL 8.3, never mind any other database!). There are quite a few examples in our archives of application bugs that would have been prevented, or later were prevented, by the 8.3 changes that reduced the system's willingness to apply implicit casts to text. I recall for instance cases where people got wrong/unexpected answers because the system was sorting what-they-thought-were-timestamp values textually. So I find such sweeping claims to be demonstrably false, and I'm suspicious of behavioral changes that are proposed with such arguments backing them. I think you're mixing apples and oranges. The whole point of the patch on the table - which, if you recall, was designed originally by you and merely implemented by me - was to change the behavior only in the cases where there's only one function with the appropriate name and argument count. The ambiguous cases that 8.3+ helpfully prevent are those where overloading is in use and the choice of which function to call is somewhat arbitrary and perhaps incorrectly-foreseen by the user. Those changes also have the side-effect of preventing a straightforward function call from working without casts even in cases where no overloading is in use - and making that case work is completely different from making the ambiguous case arbitrarily pick one of the available answers. FWIW I for one thought that casting more liberal in the case at hand, where there is only one function with that name and number of arguments, would be a good thing. My only concern with the patch presented was that changing make_fn_assignment() in that way may have some unwanted side effects because it is called from different locations and the usage of COERCION_IMPLICIT may actually guard against something, that we don't want to allow. I don't have any evidence that it does, just the concern that it may. Perhaps make_fn_arguments() needs to receive that coercion context as an argument and the caller hands in COERCION_ASSIGNMENT only in the case at hand? Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PRIVATE columns
On 12/12/2012 1:12 PM, Simon Riggs wrote: Currently, ANALYZE collects data on all columns and stores these samples in pg_statistic where they can be seen via the view pg_stats. In some cases we have data that is private and we do not wish others to see it, such as patient names. This becomes more important when we have row security. Perhaps that data can be protected, but it would be even better if we simply didn't store value-revealing statistic data at all. Such private data is seldom the target of searches, or if it is, it is mostly evenly distributed anyway. Would protecting it the same way, we protect the passwords in pg_authid, be sufficient? Jan It would be good if we could collect the overall stats * NULL fraction * average width * ndistinct yet without storing either the MFVs or histogram. Doing that would avoid inadvertent leaking of potentially private information. SET STATISTICS 0 simply skips collection of statistics altogether To implement this, one way would be to allow ALTER TABLE foo ALTER COLUMN foo1 SET STATISTICS PRIVATE; Or we could use another magic value like -2 to request this case. (Yes, I am aware we could use a custom datatype with a custom typanalyze for this, but that breaks other things) Thoughts? -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PRIVATE columns
On 12/12/2012 3:12 PM, Simon Riggs wrote: On 12 December 2012 19:13, Jan Wieck janwi...@yahoo.com wrote: On 12/12/2012 1:12 PM, Simon Riggs wrote: Currently, ANALYZE collects data on all columns and stores these samples in pg_statistic where they can be seen via the view pg_stats. In some cases we have data that is private and we do not wish others to see it, such as patient names. This becomes more important when we have row security. Perhaps that data can be protected, but it would be even better if we simply didn't store value-revealing statistic data at all. Such private data is seldom the target of searches, or if it is, it is mostly evenly distributed anyway. Would protecting it the same way, we protect the passwords in pg_authid, be sufficient? The user backend does need to be able to access the stats data during optimization. It's hard to have data accessible and yet impose limits on the uses to which that can be put. If we have row security on the table but no equivalent capability on the stats, then we'll have leakage. e.g. set statistics 1, ANALYZE, then leak 1 credit card numbers. Like for the encrypted password column of pg_authid, I don't see any reason why the values in the stats columns need to be readable for anyone but a superuser at all. Do you? Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] autovacuum truncate exclusive lock round two
On 12/9/2012 2:37 PM, Kevin Grittner wrote: Jan Wieck wrote: Based on the discussion and what I feel is a consensus I have created an updated patch that has no GUC at all. The hard coded parameters in include/postmaster/autovacuum.h are AUTOVACUUM_TRUNCATE_LOCK_CHECK_INTERVAL 20 /* ms */ AUTOVACUUM_TRUNCATE_LOCK_WAIT_INTERVAL 50 /* ms */ AUTOVACUUM_TRUNCATE_LOCK_TIMEOUT 5000 /* ms */ Since these really aren't part of the external API and are only referenced in vacuumlazy.c, it seems more appropriate to define them there. I gave that the worst workload I can think of. A pgbench (style) application that throws about 10 transactions per second at it, so that there is constantly the need to give up the lock due to conflicting lock requests and then reacquiring it again. A cleanup process is periodically moving old tuples from the history table to an archive table, making history a rolling window table. And a third job that 2-3 times per minute produces a 10 second lasting transaction, forcing autovacuum to give up on the lock reacquisition. Even with that workload autovacuum slow but steady is chopping away at the table. Applies with minor offsets, builds without warning, and passes `make check-world`. My tests based on your earlier posted test script confirm the benefit. There are some minor white-space issues; for example git diff --color shows some trailing spaces in comments. Cleaned up all of those. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c index 5036849..1686b88 100644 *** a/src/backend/commands/vacuumlazy.c --- b/src/backend/commands/vacuumlazy.c *** *** 52,62 --- 52,64 #include storage/bufmgr.h #include storage/freespace.h #include storage/lmgr.h + #include storage/proc.h #include utils/lsyscache.h #include utils/memutils.h #include utils/pg_rusage.h #include utils/timestamp.h #include utils/tqual.h + #include portability/instr_time.h /* *** *** 82,87 --- 84,96 */ #define SKIP_PAGES_THRESHOLD ((BlockNumber) 32) + /* + * Truncate exclusive lock configuration + */ + #define AUTOVACUUM_TRUNCATE_LOCK_CHECK_INTERVAL 20 /* ms */ + #define AUTOVACUUM_TRUNCATE_LOCK_WAIT_INTERVAL 50 /* ms */ + #define AUTOVACUUM_TRUNCATE_LOCK_TIMEOUT5000 /* ms */ + typedef struct LVRelStats { /* hasindex = true means two-pass strategy; false means one-pass */ *** typedef struct LVRelStats *** 103,108 --- 112,118 ItemPointer dead_tuples; /* array of ItemPointerData */ int num_index_scans; TransactionId latestRemovedXid; + bool lock_waiter_detected; } LVRelStats; *** lazy_vacuum_rel(Relation onerel, VacuumS *** 193,198 --- 203,210 vacrelstats-old_rel_pages = onerel-rd_rel-relpages; vacrelstats-old_rel_tuples = onerel-rd_rel-reltuples; vacrelstats-num_index_scans = 0; + vacrelstats-pages_removed = 0; + vacrelstats-lock_waiter_detected = false; /* Open all indexes of the relation */ vac_open_indexes(onerel, RowExclusiveLock, nindexes, Irel); *** lazy_vacuum_rel(Relation onerel, VacuumS *** 259,268 vacrelstats-hasindex, new_frozen_xid); ! /* report results to the stats collector, too */ ! pgstat_report_vacuum(RelationGetRelid(onerel), onerel-rd_rel-relisshared, new_rel_tuples); /* and log the action if appropriate */ if (IsAutoVacuumWorkerProcess() Log_autovacuum_min_duration = 0) --- 271,288 vacrelstats-hasindex, new_frozen_xid); ! /* ! * report results to the stats collector, too. ! * An early terminated lazy_truncate_heap attempt ! * suppresses the message and also cancels the ! * execution of ANALYZE, if that was ordered. ! */ ! if (!vacrelstats-lock_waiter_detected) ! pgstat_report_vacuum(RelationGetRelid(onerel), onerel-rd_rel-relisshared, new_rel_tuples); + else + vacstmt-options = ~VACOPT_ANALYZE; /* and log the action if appropriate */ if (IsAutoVacuumWorkerProcess() Log_autovacuum_min_duration = 0) *** lazy_truncate_heap(Relation onerel, LVRe *** 1257,1336 BlockNumber old_rel_pages = vacrelstats-rel_pages; BlockNumber new_rel_pages; PGRUsage ru0; pg_rusage_init(ru0); /* ! * We need full exclusive lock on the relation in order to do truncation. ! * If we can't get it, give up rather than waiting --- we don't want to ! * block other backends, and we don't want to deadlock (which is quite ! * possible considering we already hold a lower-grade lock). ! */ ! if (!ConditionalLockRelation(onerel, AccessExclusiveLock)) ! return; ! ! /* ! * Now that we have exclusive lock, look to see if the rel has grown ! * whilst we were vacuuming with non-exclusive lock. If so, give up
Re: [HACKERS] MySQL search query is not executing in Postgres DB
I am aware that in the case at hand, the call to make_fn_arguments() is with the only possible candidate function, so changing COERCE_IMPLICIT to COERCE_ASSIGNMENT inside of make_fn_arguments() is correct. But I wonder if this may have any unwanted side effects for other code paths to make_fn_arguments(), like from optimizer/util/clauses.c or from parser/parse_oper.c. I'm not saying it does, just asking if that could be the case. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] autovacuum truncate exclusive lock round two
On 12/6/2012 12:45 PM, Robert Haas wrote: On Wed, Dec 5, 2012 at 10:16 PM, Jan Wieck janwi...@yahoo.com wrote: That sort of dynamic approach would indeed be interesting. But I fear that it is going to be complex at best. The amount of time spent in scanning heavily depends on the visibility map. The initial vacuum scan of a table can take hours or more, but it does update the visibility map even if the vacuum itself is aborted later. The next vacuum may scan that table in almost no time at all, because it skips all blocks that are marked all visible. Well, if that's true, then there's little reason to worry about giving up quickly, because the next autovacuum a minute later won't consume many resources. Almost no time is of course relative to what an actual scan and dead tuple removal cost. Looking at a table with 3 GB of dead tuples at the end, the initial vacuum scan takes hours. When vacuum comes back to this table, cleaning up a couple megabytes of newly deceased tuples and then skipping over the all visible pages may take a minute. Based on the discussion and what I feel is a consensus I have created an updated patch that has no GUC at all. The hard coded parameters in include/postmaster/autovacuum.h are AUTOVACUUM_TRUNCATE_LOCK_CHECK_INTERVAL 20 /* ms */ AUTOVACUUM_TRUNCATE_LOCK_WAIT_INTERVAL 50 /* ms */ AUTOVACUUM_TRUNCATE_LOCK_TIMEOUT 5000 /* ms */ I gave that the worst workload I can think of. A pgbench (style) application that throws about 10 transactions per second at it, so that there is constantly the need to give up the lock due to conflicting lock requests and then reacquiring it again. A cleanup process is periodically moving old tuples from the history table to an archive table, making history a rolling window table. And a third job that 2-3 times per minute produces a 10 second lasting transaction, forcing autovacuum to give up on the lock reacquisition. Even with that workload autovacuum slow but steady is chopping away at the table. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c index c9253a9..fd3e91e 100644 *** a/src/backend/commands/vacuumlazy.c --- b/src/backend/commands/vacuumlazy.c *** *** 52,62 --- 52,64 #include storage/bufmgr.h #include storage/freespace.h #include storage/lmgr.h + #include storage/proc.h #include utils/lsyscache.h #include utils/memutils.h #include utils/pg_rusage.h #include utils/timestamp.h #include utils/tqual.h + #include portability/instr_time.h /* *** typedef struct LVRelStats *** 103,108 --- 105,111 ItemPointer dead_tuples; /* array of ItemPointerData */ int num_index_scans; TransactionId latestRemovedXid; + bool lock_waiter_detected; } LVRelStats; *** lazy_vacuum_rel(Relation onerel, VacuumS *** 193,198 --- 196,203 vacrelstats-old_rel_pages = onerel-rd_rel-relpages; vacrelstats-old_rel_tuples = onerel-rd_rel-reltuples; vacrelstats-num_index_scans = 0; + vacrelstats-pages_removed = 0; + vacrelstats-lock_waiter_detected = false; /* Open all indexes of the relation */ vac_open_indexes(onerel, RowExclusiveLock, nindexes, Irel); *** lazy_vacuum_rel(Relation onerel, VacuumS *** 259,268 vacrelstats-hasindex, new_frozen_xid); ! /* report results to the stats collector, too */ ! pgstat_report_vacuum(RelationGetRelid(onerel), onerel-rd_rel-relisshared, new_rel_tuples); /* and log the action if appropriate */ if (IsAutoVacuumWorkerProcess() Log_autovacuum_min_duration = 0) --- 264,281 vacrelstats-hasindex, new_frozen_xid); ! /* ! * report results to the stats collector, too. ! * An early terminated lazy_truncate_heap attempt ! * suppresses the message and also cancels the ! * execution of ANALYZE, if that was ordered. ! */ ! if (!vacrelstats-lock_waiter_detected) ! pgstat_report_vacuum(RelationGetRelid(onerel), onerel-rd_rel-relisshared, new_rel_tuples); + else + vacstmt-options = ~VACOPT_ANALYZE; /* and log the action if appropriate */ if (IsAutoVacuumWorkerProcess() Log_autovacuum_min_duration = 0) *** lazy_truncate_heap(Relation onerel, LVRe *** 1255,1334 BlockNumber old_rel_pages = vacrelstats-rel_pages; BlockNumber new_rel_pages; PGRUsage ru0; pg_rusage_init(ru0); /* ! * We need full exclusive lock on the relation in order to do truncation. ! * If we can't get it, give up rather than waiting --- we don't want to ! * block other backends, and we don't want to deadlock (which is quite ! * possible considering we already hold a lower-grade lock). ! */ ! if (!ConditionalLockRelation(onerel, AccessExclusiveLock)) ! return; ! ! /* ! * Now
Re: [HACKERS] autovacuum truncate exclusive lock round two
Kevin and Robert are well aware of most of the below. I just want to put this out here so other people, who haven't followed the discussion too closely, may chime in. Some details on the problem: First of all, there is a minimum number of 1000 pages that the vacuum scan must detect as possibly being all empty at the end of a relation. Without at least 8MB of possible free space at the end, the code never calls lazy_truncate_heap(). This means we don't have to worry about tiny relations at all. Any relation that stays under 8MB turnover between autovacuum VACUUM runs can never get into this ever. Relations that have higher turnover than that, but at random places or with a high percentage of rather static rows, don't fall into the problem category either. They may never accumulate that much contiguous free space at the end. The turnover will be reusing free space all over the place. So again, lazy_truncate_heap() won't be called ever. Relations that eventually build up more than 8MB of free space at the end aren't automatically a problem. The autovacuum VACUUM scan just scanned those pages at the end, which means that the safety scan for truncate, done under exclusive lock, is checking exactly those pages at the end and most likely they are still in memory. The truncate safety scan will be fast due to a 99+% buffer cache hit rate. The only actual problem case (I have found so far) are rolling window tables of significant size, that can bloat multiple times their normal size every now and then. This is indeed a rare corner case and I have no idea how many users may (unknowingly) be suffering from it. This rare corner case triggers lazy_truncate_heap() with a significant amount of free space to truncate. The table bloats, then all the bloat is deleted and the periodic 100% turnover will guarantee that all live tuples will shortly after circulate in lower block numbers again, with gigabytes of empty space at the end. This by itself isn't a problem still. The existing code may do the job just fine unless there is frequent access to that very table. Only at this special combination of circumstances we actually have a problem. Only now, with a significant amount of free space at the end and frequent access to the table, the truncate safety scan takes long enough and has to actually read pages from disk to interfere with client transactions. At this point, the truncate safety scan may have to be interrupted to let the frequent other traffic go through. This is what we accomplish with the autovacuum_truncate_lock_check interval, where we voluntarily release the lock whenever someone else needs it. I agree with Kevin that a 20ms check interval is reasonable because the code to check this is even less expensive than releasing the exclusive lock we're holding. At the same time, completely giving up and relying on the autovacuum launcher to restart another worker isn't as free as it looks like either. The next autovacuum worker will have to do the VACUUM scan first, before getting to the truncate phase. We cannot just skip blindly to the truncate code. With repeated abortion of the truncate, the table would deteriorate and accumulate dead tuples again. The removal of dead tuples and their index tuples has priority. As said earlier in the discussion, the VACUUM scan will skip pages, that are marked as completely visible. So the scan won't physically read the majority of the empty pages at the end of the table over and over. But it will at least scan all pages, that had been modified since the last VACUUM run. To me this means that we want to be more generous to the truncate code about acquiring the exclusive lock. In my tests, I've seen that a rolling window table with a live set of just 10 MB or so, but empty space of 3 GB, can still have a 2 minute VACUUM scan time. Throwing that work away because we can't acquire the exclusive lock withing 2 seconds is a waste of effort. Something in between 2-60 seconds sounds more reasonable to me. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] autovacuum truncate exclusive lock round two
On 12/5/2012 2:00 PM, Robert Haas wrote: Many it'd be sensible to relate the retry time to the time spend vacuuming the table. Say, if the amount of time spent retrying exceeds 10% of the time spend vacuuming the table, with a minimum of 1s and a maximum of 1min, give up. That way, big tables will get a little more leeway than small tables, which is probably appropriate. That sort of dynamic approach would indeed be interesting. But I fear that it is going to be complex at best. The amount of time spent in scanning heavily depends on the visibility map. The initial vacuum scan of a table can take hours or more, but it does update the visibility map even if the vacuum itself is aborted later. The next vacuum may scan that table in almost no time at all, because it skips all blocks that are marked all visible. So the total time the scan takes is no yardstick I'd use. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] autovacuum truncate exclusive lock round two
On 12/4/2012 8:06 AM, Kevin Grittner wrote: Jan Wieck wrote: I believe the check interval needs to be decoupled from the deadlock_timeout again. OK This will leave us with 2 GUCs at least. Hmm. What problems do you see with hard-coding reasonable values? The question is what is reasonable? Lets talk about the time to (re)acquire the lock first. In the cases where truncating a table can hurt we are dealing with many gigabytes. The original vacuumlazy scan of them can take hours if not days. During that scan the vacuum worker has probably spent many hours napping in the vacuum delay points. For me 50ms interval for 5 seconds would be reasonable for (re)acquiring that lock. The reasoning behind it being that we need some sort of retry mechanism because if autovacuum just gave up the exclusive lock because someone needed access, it is more or less guaranteed that the immediate attempt to reacquire it will fail until that waiter has committed. But if it can't get a lock after 5 seconds, the system seems busy enough so that autovacuum should come back much later, when the launcher kicks it off again. I don't care much about occupying that autovacuum worker for a few seconds. It just spent hours vacuuming that very table. How much harm will a couple more seconds do? The check interval for the LockHasWaiters() call however depends very much on the response time constraints of the application. A 200ms interval for example would cause the truncate phase to hold onto the exclusive lock for 200ms at least. That means that a steady stream of short running transactions would see a 100ms blocking on average, 200ms max. For many applications that is probably OK. If your response time constraint is =50ms on 98% of transactions, you might want to have that knob though. I admit I really have no idea what the most reasonable default for that value would be. Something between 50ms and deadlock_timeout/2 I guess. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] autovacuum truncate exclusive lock round two
On 12/4/2012 1:51 PM, Kevin Grittner wrote: Jan Wieck wrote: [arguments for GUCs] This is getting confusing. I thought I had already conceded the case for autovacuum_truncate_lock_try, and you appeared to spend most of your post arguing for it anyway. I think. It's a little hard to tell. Perhaps the best thing is to present the issue to the list and solicit more opinions on what to do. Please correct me if I mis-state any of this. The primary problem this patch is solving is that in some workloads, autovacuum will repeatedly try to truncate the unused pages at the end of a table, but will continually get canceled after burning resources because another process wants to acquire a lock on the table which conflicts with the one held by autovacuum. This is handled by the deadlock checker, so another process must block for the deadlock_timeout interval each time. All work done by the truncate phase of autovacuum is lost on each interrupted attempt. Statistical information is not updated, so another attempt will trigger the next time autovacuum looks at whether to vacuum the table. It's obvious that this pattern not only fails to release potentially large amounts of unused space back to the OS, but the headbanging can continue to consume significant resources and for an extended period, and the repeated blocking for deadlock_timeout could cause latency problems. The patch has the truncate work, which requires AccessExclusiveLock, check at intervals for whether another process is waiting on its lock. That interval is one of the timings we need to determine, and for which a GUC was initially proposed. I think that the check should be fast enough that doing it once every 20ms as a hard-coded interval would be good enough. When it sees this situation, it truncates the file for as far as it has managed to get, releases its lock on the table, sleeps for an interval, and then checks to see if the lock has become available again. How long it should sleep between tries to reacquire the lock is another possible GUC. Again, I'm inclined to think that this could be hard-coded. Since autovacuum was knocked off-task after doing some significant work, I'm inclined to make this interval a little bigger, but I don't think it matters a whole lot. Anything between 20ms and 100ms seens sane. Maybe 50ms? At any point that it is unable to acquire the lock, there is a check for how long this autovacuum task has been starved for the lock. Initially I was arguing for twice the deadlock_timeout on the basis that this would probably be short enough not to leave the autovacuum worker sidelined for too long, but long enough for the attempt to get past a single deadlock between two other processes. This is the setting Jan is least willing to concede. If the autovacuum worker does abandon the attempt, it will keep retrying, since we go out of our way to prevent the autovacuum process from updating the statistics based on the incomplete processing. This last interval is not how long it will attempt to truncate, but how long it will keep one autovacuum worker making unsuccessful attempts to acquire the lock before it is put to other uses. Workers will keep coming back to this table until the truncate phase is completed, just as it does without the patch; the difference being that anytime it gets the lock, even briefly, it is able to persist some progress. That is all correct. So the question on the table is which of these three intervals should be GUCs, and what values to use if they aren't. I could live with all the above defaults, but would like to see more comments on them. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] autovacuum truncate exclusive lock round two
On 12/3/2012 5:42 PM, Kevin Grittner wrote: Jan Wieck wrote: Attached is a new patch that addresses most of the points raised in discussion before. 1) Most of the configuration variables are derived from deadlock_timeout now. The check for conflicting lock request interval is deadlock_timeout/10, clamped to 10ms. The try to acquire exclusive lock interval is deadlock_timeout/20, also clamped to 10ms. The only GUC variable remaining is autovacuum_truncate_lock_try=2000ms with a range from 0 (just try once) to 2ms. If we're going to keep this GUC, we need docs for it. Certainly. But since we're still debating which and how many GUC variables we want, I don't think doc-time has come yet. I'd like to point out that this is a significant change in functionality as without the config option for the check interval, there is no longer any possibility to disable the call to LockHasWaiters() and return to the original (deadlock code kills autovacuum) behavior. Arguably we could simplify the deadlock resolution code a little, but it seems like it is probably safer to leave it as a failsafe, at least for now. Thinking about it, I'm not really happy with removing the autovacuum_truncate_lock_check GUC at all. Fact is that the deadlock detection code and the configuration parameter for it should IMHO have nothing to do with all this in the first place. A properly implemented application does not deadlock. Someone running such a properly implemented application should be able to safely set deadlock_timeout to minutes without the slightest ill side effect, but with the benefit that the deadlock detection code itself does not add to the lock contention. The only reason one cannot do so today is because autovacuum's truncate phase could then freeze the application with an exclusive lock for that long. I believe the check interval needs to be decoupled from the deadlock_timeout again. This will leave us with 2 GUCs at least. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] autovacuum truncate exclusive lock round two
Attached is a new patch that addresses most of the points raised in discussion before. 1) Most of the configuration variables are derived from deadlock_timeout now. The check for conflicting lock request interval is deadlock_timeout/10, clamped to 10ms. The try to acquire exclusive lock interval is deadlock_timeout/20, also clamped to 10ms. The only GUC variable remaining is autovacuum_truncate_lock_try=2000ms with a range from 0 (just try once) to 2ms. I'd like to point out that this is a significant change in functionality as without the config option for the check interval, there is no longer any possibility to disable the call to LockHasWaiters() and return to the original (deadlock code kills autovacuum) behavior. 2) The partition lock in LockHasWaiters() was lowered to LW_SHARED. The LW_EXCLUSIVE was indeed a copy/paste result. 3) The instr_time handling was simplified as suggested. 4) Lower case TRUE/FALSE. I did not touch the part about suppressing the stats and the ANALYZE step of auto vacuum+analyze. The situation is no different today. When the deadlock code kills autovacuum, stats aren't updated either. And this patch is meant to cause autovacuum to finish the truncate in a few minutes or hours, so that the situation fixes itself. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c index c9253a9..e8f88ad 100644 *** a/src/backend/commands/vacuumlazy.c --- b/src/backend/commands/vacuumlazy.c *** *** 52,62 --- 52,64 #include storage/bufmgr.h #include storage/freespace.h #include storage/lmgr.h + #include storage/proc.h #include utils/lsyscache.h #include utils/memutils.h #include utils/pg_rusage.h #include utils/timestamp.h #include utils/tqual.h + #include portability/instr_time.h /* *** typedef struct LVRelStats *** 103,108 --- 105,111 ItemPointer dead_tuples; /* array of ItemPointerData */ int num_index_scans; TransactionId latestRemovedXid; + bool lock_waiter_detected; } LVRelStats; *** static TransactionId FreezeLimit; *** 114,119 --- 117,126 static BufferAccessStrategy vac_strategy; + static int autovacuum_truncate_lock_check; + static int autovacuum_truncate_lock_retry; + static int autovacuum_truncate_lock_wait; + /* non-export function prototypes */ static void lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, *** lazy_vacuum_rel(Relation onerel, VacuumS *** 193,198 --- 200,207 vacrelstats-old_rel_pages = onerel-rd_rel-relpages; vacrelstats-old_rel_tuples = onerel-rd_rel-reltuples; vacrelstats-num_index_scans = 0; + vacrelstats-pages_removed = 0; + vacrelstats-lock_waiter_detected = false; /* Open all indexes of the relation */ vac_open_indexes(onerel, RowExclusiveLock, nindexes, Irel); *** lazy_vacuum_rel(Relation onerel, VacuumS *** 259,268 vacrelstats-hasindex, new_frozen_xid); ! /* report results to the stats collector, too */ ! pgstat_report_vacuum(RelationGetRelid(onerel), onerel-rd_rel-relisshared, new_rel_tuples); /* and log the action if appropriate */ if (IsAutoVacuumWorkerProcess() Log_autovacuum_min_duration = 0) --- 268,285 vacrelstats-hasindex, new_frozen_xid); ! /* ! * report results to the stats collector, too. ! * An early terminated lazy_truncate_heap attempt ! * suppresses the message and also cancels the ! * execution of ANALYZE, if that was ordered. ! */ ! if (!vacrelstats-lock_waiter_detected) ! pgstat_report_vacuum(RelationGetRelid(onerel), onerel-rd_rel-relisshared, new_rel_tuples); + else + vacstmt-options = ~VACOPT_ANALYZE; /* and log the action if appropriate */ if (IsAutoVacuumWorkerProcess() Log_autovacuum_min_duration = 0) *** lazy_truncate_heap(Relation onerel, LVRe *** 1255,1334 BlockNumber old_rel_pages = vacrelstats-rel_pages; BlockNumber new_rel_pages; PGRUsage ru0; pg_rusage_init(ru0); /* ! * We need full exclusive lock on the relation in order to do truncation. ! * If we can't get it, give up rather than waiting --- we don't want to ! * block other backends, and we don't want to deadlock (which is quite ! * possible considering we already hold a lower-grade lock). */ ! if (!ConditionalLockRelation(onerel, AccessExclusiveLock)) ! return; /* ! * Now that we have exclusive lock, look to see if the rel has grown ! * whilst we were vacuuming with non-exclusive lock. If so, give up; the ! * newly added pages presumably contain non-deletable tuples. */ ! new_rel_pages = RelationGetNumberOfBlocks(onerel); ! if (new_rel_pages != old_rel_pages) { /* ! * Note: we intentionally don't update
Re: [HACKERS] autovacuum truncate exclusive lock round two
On 11/28/2012 3:33 PM, Kevin Grittner wrote: Kevin Grittner wrote: I still need to review the timing calls, since I'm not familiar with them so it wasn't immediately obvious to me whether they were being used correctly. I have no reason to believe that they aren't, but feel I should check. It seems odd to me that assignment of one instr_time to another is done with INSTR_TIME_SET_ZERO() of the target followed by INSTR_TIME_ADD() with the target and the source. It seems to me that simple assignment would be more readable, and I can't see any down side. Why shouldn't: INSTR_TIME_SET_ZERO(elapsed); INSTR_TIME_ADD(elapsed, currenttime); INSTR_TIME_SUBTRACT(elapsed, starttime); instead be?: elapsed = currenttime; INSTR_TIME_SUBTRACT(elapsed, starttime); And why shouldn't: INSTR_TIME_SET_ZERO(starttime); INSTR_TIME_ADD(starttime, currenttime); instead be?: starttime = currenttime; Resetting starttime this way seems especially odd. instr_time is LARGE_INTEGER on Win32 but struct timeval on Unix. Is starttime = currenttime; portable if those are structs? Also, I want to do another pass looking just for off-by-one errors on blkno. Again, I have no reason to believe that there is a problem; it just seems like it would be a particularly easy type of mistake to make and miss when a key structure has this field: BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */ No problems found with that. And I want to test more. The patch worked as advertised in all my tests, but I became uncomforatable with the games being played with the last autovacuum timestamp and the count of dead tuples. Sure, that causes autovacuum to kick back in until it has dealt with the truncation, but it makes it hard for a human looking at the stat views to see what's going on, and I'm not sure it wouldn't lead to bad plans due to stale statistics. Personally, I would rather see us add a boolean to indicate that autovacuum was needed (regardless of the math on the other columns) and use that to control the retries -- leaving the other columns free to reflect reality. You mean to extend the stats by yet another column? The thing is that this whole case happens rarely. We see this every other month or so and only on a rolling window table after it got severely bloated due to some abnormal use (purging disabled for some operational reason). The whole situation resolves itself after a few minutes to maybe an hour or two. Personally I don't think living with a few wrong stats on one table for that time is so bad that it warrants changing that much more code. Lower casing TRUE/FALSE will be done. If the LW_SHARED is enough in LockHasWaiters(), that's fine too. I think we have a consensus that the check interval should be derived from deadlock_timeout/10 clamped to 10ms. I'm going to remove that GUC. About the other two, I'm just not sure. Maybe using half the value as for the lock waiter interval as the lock retry interval, again clamped to 10ms, and simply leaving one GUC that controls how long autovacuum should retry this. I'm not too worried about the retry interval. However, the problem with that overall retry duration is that this value very much depends on the usage pattern of the database. If long running transactions (like 5 seconds) are the norm, then a hard coded value of 2 seconds before autovacuum gives up isn't going to help much. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] autovacuum truncate exclusive lock round two
On 11/29/2012 9:46 AM, Tom Lane wrote: Jan Wieck janwi...@yahoo.com writes: On 11/28/2012 3:33 PM, Kevin Grittner wrote: Resetting starttime this way seems especially odd. instr_time is LARGE_INTEGER on Win32 but struct timeval on Unix. Is starttime = currenttime; portable if those are structs? Sure. We rely on struct assignment in lots of places. Will be done then. Thanks, Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical to physical page mapping
On 10/27/2012 2:41 PM, Heikki Linnakangas wrote: And it's not at all clear to me that it would perform better than full_page_writes. You're writing and flushing out roughly the same amount of data AFAICS. I think this assumption is wrong. full_page_writes=on means we write the full page content to WAL on first change after a checkpoint. A change after a checkpoint logically means that the same page is dirty now and must also be written latest during the next checkpoint, which means 16K written minimum for every page changed after a checkpoint. What exactly is the problem with full_page_writes that we're trying to solve? Full page writes are meant to guard against torn pages. That is the case when an 8K page is written by the underlying OS/filesystem/HW in smaller chunks (for example 512 byte sectors), and in the case of a crash some of these chunks make it, others don't. Without full_page_writes, crash recovery can work if all 8K made it, or nothing made it (aka atomic writes). But it will fail otherwise. The amount of WAL generated with full_page_writes=on is quite substantial. For pgbench for example the ratio 20:1. Meaning with full_page_writes you write 20x the amount you do without. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical to physical page mapping
On 10/28/2012 10:50 PM, Peter Geoghegan wrote: On 28 October 2012 22:35, Jan Wieck janwi...@yahoo.com wrote: The amount of WAL generated with full_page_writes=on is quite substantial. For pgbench for example the ratio 20:1. Meaning with full_page_writes you write 20x the amount you do without. Sure, but as always, pgbench pessimises everything by having writes follow a uniform distribution, which is completely unrepresentative of the natural world. This will literally maximise the number of pristine full page images that need to be included. The actual frequency of checkpoints is a crucial factor here too, and you didn't mention anything about that. Higher buffer cache hit rates certainly reduce that ratio. Well, I guess it was just one of those random thoughts that can't work in the end or aren't worth the work anyway. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] autovacuum truncate exclusive lock round two
Steven, On 10/24/2012 10:46 PM, Stephen Frost wrote: Jan, * Jan Wieck (janwi...@yahoo.com) wrote: This problem has been discussed before. Those familiar with the subject please skip the next paragraph. Apologies if this was already thought-of and ruled out for some reason, but... Because all the scanning had been done in parallel to normal DB activity, it needs to verify that all those blocks are still empty. Would it be possible to use the FSM to figure out if things have changed since the last scan..? Does that scan update the FSM, which would then be updated by another backend in the event that it decided to write something there? Or do we consider the FSM to be completely untrustworthy wrt this (and if so, I don't suppose there's any hope to using the visibility map...)? I honestly don't know if we can trust the FSM enough when it comes to throwing away heap pages. Can we? The notion of having to double-scan and the AccessExclusiveLock on the relation are telling me this work-around, while completely possible, isn't exactly ideal... Under normal circumstances with just a few pages to trim off the end this is no problem. Those pages were the last pages just scanned by this very autovacuum, so they are found in the shared buffers anyway. All the second scan does in that case is to fetch the page once more from shared buffers to be 100% sure, we are not truncating off new tuples. We definitely need the AccessExclusiveLock to prevent someone from extending the relation at the end between our check for relation size and the truncate. Fetching 50 empty blocks from the buffer cache while at it isn't that big of a deal and that is what it normally looks like. The problem case this patch is dealing with is rolling window tables that experienced some bloat. The typical example is a log table, that has new data constantly added and the oldest data constantly purged out. This data normally rotates through some blocks like a rolling window. If for some reason (purging turned off for example) this table bloats by several GB and later shrinks back to its normal content, soon all the used blocks are at the beginning of the heap and we find tens of thousands of empty pages at the end. Only now does the second scan take more than 1000ms and autovacuum is at risk to get killed while at it. Since we have experienced this problem several times now on our production systems, something clearly needs to be done. But IMHO it doesn't happen often enough to take any risk here. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] autovacuum truncate exclusive lock round two
On 10/25/2012 9:45 AM, Tom Lane wrote: Jan Wieck janwi...@yahoo.com writes: On 10/24/2012 10:46 PM, Stephen Frost wrote: Would it be possible to use the FSM to figure out if things have changed since the last scan..? Does that scan update the FSM, which would then be updated by another backend in the event that it decided to write something there? Or do we consider the FSM to be completely untrustworthy wrt this (and if so, I don't suppose there's any hope to using the visibility map...)? I honestly don't know if we can trust the FSM enough when it comes to throwing away heap pages. Can we? No. Backends are under no obligation to update FSM for each individual tuple insertion, and typically don't do so. More to the point, you have to take AccessExclusiveLock *anyway*, because this is interlocking not only against new insertions but plain read-only seqscans: if a seqscan falls off the end of the table it will be very unhappy. So I don't see where we'd buy anything by consulting the FSM. Thank you. One thing that I haven't mentioned yet is that with this patch, we could actually insert a vacuum_delay_point() into the loop in count_nondeletable_pages(). We no longer cling to the exclusive lock but rather get out of the way as soon as somebody needs the table. Under this condition we no longer need to do the second scan full bore. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] autovacuum truncate exclusive lock round two
On 10/25/2012 10:12 AM, Stephen Frost wrote: Jan, * Jan Wieck (janwi...@yahoo.com) wrote: The problem case this patch is dealing with is rolling window tables that experienced some bloat. The typical example is a log table, that has new data constantly added and the oldest data constantly purged out. This data normally rotates through some blocks like a rolling window. If for some reason (purging turned off for example) this table bloats by several GB and later shrinks back to its normal content, soon all the used blocks are at the beginning of the heap and we find tens of thousands of empty pages at the end. Only now does the second scan take more than 1000ms and autovacuum is at risk to get killed while at it. My concern is that this could certainly also happen to a heavily updated table in an OLTP type of environment where the requirement to take a heavy lock to clean it up might prevent it from ever happening.. I was simply hoping we could find a mechanism to lock just those pages we're getting ready to nuke rather than the entire relation. Perhaps we can consider how to make those changes alongside of changes to eliminate or reduce the extent locking that has been painful (for me at least) when doing massive parallel loads into a table. I've been testing this with loads of 20 writes/s to that bloated table. Preventing not only the clean up, but the following ANALYZE as well is precisely what happens. There may be multiple ways how to get into this situation, but once you're there the symptoms are the same. Vacuum fails to truncate it and causing a 1 second hiccup every minute, while vacuum is holding the exclusive lock until the deadlock detection code of another transaction kills it. My patch doesn't change the logic how we ensure that we don't zap any data by accident with the truncate and Tom's comments suggest we should stick to it. It only makes autovacuum check frequently if the AccessExclusiveLock is actually blocking anyone and then get out of the way. I would rather like to discuss any ideas how to do all this without 3 new GUCs. In the original code, the maximum delay that autovacuum can cause by holding the exclusive lock is one deadlock_timeout (default 1s). It would appear reasonable to me to use max(deadlock_timeout/10,10ms) as the interval to check for a conflicting lock request. For another transaction that needs to access the table this is 10 times faster than it is now and still guarantees that autovacuum will make some progress with the truncate. The other two GUCs control how often and how fast autovacuum tries to acquire the exclusive lock in the first place. Since we actively release the lock *because someone needs it* it is pretty much guaranteed that the immediate next lock attempt fails. We on purpose do a ConditionalLockRelation() because there is a chance to deadlock. The current code only tries one lock attempt and gives up immediately. I don't know from what to derive a good value for how long to retry, but the nap time in between tries could be a hardcoded 20ms or using the cost based vacuum nap time (which defaults to 20ms). Any other ideas are welcome. Thanks, Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] autovacuum truncate exclusive lock round two
This problem has been discussed before. Those familiar with the subject please skip the next paragraph. When autovacuum finds a substantial amount of empty pages at the end of a relation, it attempts to truncate it in lazy_truncate_heap(). Because all the scanning had been done in parallel to normal DB activity, it needs to verify that all those blocks are still empty. To do that autovacuum grabs an AccessExclusiveLock on the relation, then scans backwards to the last non-empty page. If any other backend needs to access that table during this time, it will kill the autovacuum from the deadlock detection code, which by default is done after a 1000ms timeout. The autovacuum launcher will start another vacuum after (default) 60 seconds, which most likely is getting killed again, and again, and again. The net result of this is that the table is never truncated and every 60 seconds there is a 1 second hiccup before the autovacuum is killed. Proposal: Add functions to lmgr that are derived from the lock release code, but instead of releasing the lock and waking up waiters, just return a boolean telling if there are any waiters that would be woken up if this lock was released. Use this lmgr feature inside count_nondeletable_pages() of vacuumlazy.c to periodically check, if there is a conflicting lock request waiting. If not, keep going. If there is a waiter, truncate the relation to the point checked thus far, release the AccessExclusiveLock, then loop back to where we acquire this lock in the first place and continue checking/truncating. I have a working patch here: https://github.com/wieck/postgres/tree/autovacuum-truncate-lock This patch does introduce three new postgresql.conf parameters, which I would be happy to get rid of if we could derive them from something else. Something based on the deadlock timeout may be possible. autovacuum_truncate_lock_check = 100ms # how frequent to check # for conflicting locks autovacuum_truncate_lock_retry = 50# how often to try acquiring # the exclusive lock autovacuum_truncate_lock_wait = 20ms # nap in between attempts With these settings, I see the truncate of a bloated table progressing at a rate of 3 minutes per GB, while that table is accessed 20 times per second. The original kill autovacuum mechanism in the deadlock code is still there. All this code really does is 10 lmgr lookups per second and releasing the AccessExclusiveLock if there are any waiters. I don't think it can get any cheaper than this. I am attaching a script that uses pgbench to demonstrate the actual problem of a bloated table with significant empty pages at the end. Comments? Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin t1.autovac-lock-issue.tgz Description: application/compressed -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] autovacuum truncate exclusive lock round two
Here is the patch for it. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c index c9253a9..9f880f0 100644 *** a/src/backend/commands/vacuumlazy.c --- b/src/backend/commands/vacuumlazy.c *** *** 57,62 --- 57,63 #include utils/pg_rusage.h #include utils/timestamp.h #include utils/tqual.h + #include portability/instr_time.h /* *** typedef struct LVRelStats *** 103,108 --- 104,110 ItemPointer dead_tuples; /* array of ItemPointerData */ int num_index_scans; TransactionId latestRemovedXid; + bool lock_waiter_detected; } LVRelStats; *** lazy_vacuum_rel(Relation onerel, VacuumS *** 193,198 --- 195,202 vacrelstats-old_rel_pages = onerel-rd_rel-relpages; vacrelstats-old_rel_tuples = onerel-rd_rel-reltuples; vacrelstats-num_index_scans = 0; + vacrelstats-pages_removed = 0; + vacrelstats-lock_waiter_detected = false; /* Open all indexes of the relation */ vac_open_indexes(onerel, RowExclusiveLock, nindexes, Irel); *** lazy_vacuum_rel(Relation onerel, VacuumS *** 259,268 vacrelstats-hasindex, new_frozen_xid); ! /* report results to the stats collector, too */ ! pgstat_report_vacuum(RelationGetRelid(onerel), onerel-rd_rel-relisshared, new_rel_tuples); /* and log the action if appropriate */ if (IsAutoVacuumWorkerProcess() Log_autovacuum_min_duration = 0) --- 263,280 vacrelstats-hasindex, new_frozen_xid); ! /* ! * report results to the stats collector, too. ! * An early terminated lazy_truncate_heap attempt ! * suppresses the message and also cancels the ! * execution of ANALYZE, if that was ordered. ! */ ! if (!vacrelstats-lock_waiter_detected) ! pgstat_report_vacuum(RelationGetRelid(onerel), onerel-rd_rel-relisshared, new_rel_tuples); + else + vacstmt-options = ~VACOPT_ANALYZE; /* and log the action if appropriate */ if (IsAutoVacuumWorkerProcess() Log_autovacuum_min_duration = 0) *** lazy_truncate_heap(Relation onerel, LVRe *** 1255,1334 BlockNumber old_rel_pages = vacrelstats-rel_pages; BlockNumber new_rel_pages; PGRUsage ru0; pg_rusage_init(ru0); /* ! * We need full exclusive lock on the relation in order to do truncation. ! * If we can't get it, give up rather than waiting --- we don't want to ! * block other backends, and we don't want to deadlock (which is quite ! * possible considering we already hold a lower-grade lock). ! */ ! if (!ConditionalLockRelation(onerel, AccessExclusiveLock)) ! return; ! ! /* ! * Now that we have exclusive lock, look to see if the rel has grown ! * whilst we were vacuuming with non-exclusive lock. If so, give up; the ! * newly added pages presumably contain non-deletable tuples. */ ! new_rel_pages = RelationGetNumberOfBlocks(onerel); ! if (new_rel_pages != old_rel_pages) { /* ! * Note: we intentionally don't update vacrelstats-rel_pages with the ! * new rel size here. If we did, it would amount to assuming that the ! * new pages are empty, which is unlikely. Leaving the numbers alone ! * amounts to assuming that the new pages have the same tuple density ! * as existing ones, which is less unlikely. */ ! UnlockRelation(onerel, AccessExclusiveLock); ! return; ! } ! /* ! * Scan backwards from the end to verify that the end pages actually ! * contain no tuples. This is *necessary*, not optional, because other ! * backends could have added tuples to these pages whilst we were ! * vacuuming. ! */ ! new_rel_pages = count_nondeletable_pages(onerel, vacrelstats); ! if (new_rel_pages = old_rel_pages) ! { ! /* can't do anything after all */ ! UnlockRelation(onerel, AccessExclusiveLock); ! return; ! } ! /* ! * Okay to truncate. ! */ ! RelationTruncate(onerel, new_rel_pages); ! /* ! * We can release the exclusive lock as soon as we have truncated. Other ! * backends can't safely access the relation until they have processed the ! * smgr invalidation that smgrtruncate sent out ... but that should happen ! * as part of standard invalidation processing once they acquire lock on ! * the relation. ! */ ! UnlockRelation(onerel, AccessExclusiveLock); ! /* ! * Update statistics. Here, it *is* correct to adjust rel_pages without ! * also touching reltuples, since the tuple count wasn't changed by the ! * truncation. ! */ ! vacrelstats-rel_pages = new_rel_pages; ! vacrelstats-pages_removed = old_rel_pages - new_rel_pages; ! ereport(elevel, ! (errmsg(\%s\: truncated %u to %u pages, ! RelationGetRelationName(onerel), ! old_rel_pages, new_rel_pages), ! errdetail(%s., ! pg_rusage_show(ru0; } /* --- 1267,1388
Re: [HACKERS] [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)
On 10/15/2012 4:43 PM, Simon Riggs wrote: Jan spoke at length at PgCon, for all to hear, that what we are building is a much better way than the trigger logging approach Slony uses. I don't take that as carte blanche for approval of everything being done, but its going in the right direction with an open heart, which is about as good as it gets. The mechanism you are building for capturing changes is certainly a lot better than what Bucardo, Londiste and Slony are doing today. That much is true. The flip side of the coin however is that all of today's logical replication systems are designed Postgres version agnostic to a degree. This means that the transition time from the existing, trigger based approach to the new WAL based mechanism will see both technologies in parallel, which is no small thing to support. And that transition time may last for a good while. We still have people installing Slony 1.2 because 2.0 (3 years old by now) requires Postgres 8.3 minimum. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)
On 10/15/2012 3:25 PM, Andres Freund wrote: On Monday, October 15, 2012 09:18:57 PM Peter Geoghegan wrote: On 15 October 2012 19:19, Bruce Momjian br...@momjian.us wrote: I think Robert is right that if Slony can't use the API, it is unlikely any other replication system could use it. I don't accept that. Clearly there is a circular dependency, and someone has to go first - why should the Slony guys invest in adopting this technology if it is going to necessitate using a forked Postgres with an uncertain future? Well. I don't think (hope) anybody proposed making something release worthy for slony but rather a POC patch that proofs the API is generic enough to be used by them. If I (or somebody else familiar with this) work together with somebody familiar with with slony internals I think such a POC shouldn't be too hard to do. I think some more input from that side is a good idea. I plan to send out an email to possibly interested parties in about two weeks... What Slony essentially sends to the receiver node is a COPY stream in the format, Christopher described. That stream is directly copied into the receiving node's sl_log_N table and picked up there by an apply trigger BEFORE INSERT, that performs the corresponding INSERT/UPDATE/DELETE operation via prepared plans to the user tables. For a POC I think it is sufficient to demonstrate that this copy stream can be generated out of the WAL decoding. Note that Slony today does not touch columns in an UPDATE, that have not changed in the original UPDATE on the origin. Sending toasted column values, that haven't changed, would be a substantial change to the storage efficiency on the replica. The consequence of this is that the number of colums that need to be in the UPDATE's SET clause varies. The log_cmdupdncols is to separate the new column/value pairs from the column/key pairs of the updated row. The old row key in Slony is based on a unique index (preferably a PK, but any unique key will do). This makes that cmdupdncols simply the number of column/value pairs minus the number of key columns. So it isn't too hard to figure out. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers