from:"Jan Wieck"

Re: [HACKERS] DO with a large amount of statements get stuck with high memory consumption

2016-07-18 Thread Jan Wieck

On Mon, Jul 18, 2016 at 10:05 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:

> Jan Wieck <j...@wi3ck.info> writes:
> > In the meantime, would it be appropriate to backpatch the double linking
> > of memory context children at this time? I believe it has had plenty of
> > testing in the 9.6 cycle to be sure it didn't break anything.
>
> I'm concerned about the ABI breakage risk from changing a data structure
> as fundamental as MemoryContext.  Yeah, code outside utils/mmgr probably
> *shouldn't* be looking at that struct, but that doesn't mean it isn't.
> In the past we've generally only taken that sort of risk when there was
> no other way to fix a bug; and this patch isn't a bug fix.  While this
> does help performance in some corner cases, I don't think it's enough of
> an across-the-board win to justify the risk of back-patching.
>

I would consider mucking with the linked lists of memory context children
inside
of 3rd party code a really bad idea, but I concede. It isn't a bug fix and
there is
that small risk that somebody did precisely that, so no backpatch.


Regards, Jan

-- 
Jan Wieck
Senior Postgres Architect
http://pgblog.wi3ck.info

Re: [HACKERS] DO with a large amount of statements get stuck with high memory consumption

2016-07-18 Thread Jan Wieck

On Mon, Jul 18, 2016 at 9:43 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:

> Merlin Moncure <mmonc...@gmail.com> writes:
> > BTW, while the fix does address the cleanup performance issue, it's
> > still the case that anonymous code blocks burn up lots of resident
> > memory (my 315k example I tested with ate around 8gb IIRC) when run
> > like this.  My question is, if the pl/pgsql code block is anonymous
> > and not in some kind of a loop, why bother caching the plan at all?
>
> Nobody got around to it.  Also, as you note, it's not as simple as
> "don't cache if in a DO block".  You'd need to track whether you were
> inside any sort of looping construct.  Depending on how difficult
> that turned out to be, it might add overhead to regular functions
> that we don't want.

Agreed. And from the structures themselves it is not really easy to detect
if inside of a loop, the toplevel, while, for and if all use the same
statement
block and call exec_stmts(), which in turn calls exec_stmt() for  each
element in that list. It is not impossible to add a flag, set at PL compile
time, to that element and check it every time, the statement is executed.
But such a change definitely needs more testing and probably won't
qualify for backpatching.

In the meantime, would it be appropriate to backpatch the double linking
of memory context children at this time? I believe it has had plenty of
testing in the 9.6 cycle to be sure it didn't break anything.

Regards, Jan

-- 
Jan Wieck
Senior Postgres Architect
http://pgblog.wi3ck.info

Re: [HACKERS] One process per session lack of sharing

2016-07-17 Thread Jan Wieck

On Sun, Jul 17, 2016 at 9:28 PM, Robert Haas <robertmh...@gmail.com> wrote:

> On Sun, Jul 17, 2016 at 3:12 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> > Robert Haas <robertmh...@gmail.com> writes:
> >> On Fri, Jul 15, 2016 at 4:28 AM, Craig Ringer <cr...@2ndquadrant.com>
> wrote:
> >>> I don't think anyone's considering moving from multi-processing to
> >>> multi-threading in PostgreSQL. I really, really like the protection
> that the
> >>> shared-nothing-by-default process model gives us, among other things.
> >
> >> We get some very important protection by having the postmaster in a
> >> separate address space from the user processes, but separating the
> >> other backends from each other has no value.
> >
> > I do not accept that proposition in the least.  For one thing, debugging
> > becomes an order of magnitude harder when you've got multiple threads
> > in the same address space: you have essentially zero guarantees about
> > what one thread might have done to the supposedly-private state of
> > another one.
>
> Well, that's true, in theory.  In practice, random memory clobbers are
> a pretty rare type of bug.  The chances that thread A crashed because
> thread B overwrote its supposedly-private state are just not very
> high.  Also, such bugs are extremely hard to troubleshoot even when
> there is only process and one thread involved, so it's not like things
> are a rose garden today.  I don't buy the argument that it's worth
> giving up arbitrary amounts of performance and functionality for this.
>

The random memory clobbers are partially rare because they often aren't
triggered. Many of them are of the dangling pointer type and in a single
threaded process, there is less of a chance to allocate and overwrite the
free'd and then used memory.

But you are right, all of them are tricky to hunt and I remember using
hardware watch points and what not in the past. They make your day
though when you finally find them. So you will be sorry when the last
one is hunted down.



>
> >> ... enough other people have
> >> written complex, long-running multithreaded programs that I think it
> >> is probably possible to do so without unduly compromising reliability.
> >
> > I would bet that every single successful project of that sort has been
> > written with threading in mind from the get-go.  Trying to retro-fit
> > threading onto thirty years' worth of single-threaded coding is a recipe
> > for breaking your project; even if you had control of all the code
> running
> > in the address space, which we assuredly do not.
>
> I admit that it is risky, but I think there are things that could be
> done to limit the risk.  I don't believe we can indefinitely continue
> to ignore the potential performance benefits of making a switch like
> this.  Breaking a thirty-year old code base irretrievably would be
> sad, but letting it fade into irrelevance because we're not willing to
> make the architecture changes that are needed to remain relevant would
> be sad, too.
>

I have to agree with Robert on that one. We have been "thinking" about
multi-threading some 16 years ago already. We were aware of the dangers
and yet at least considered doing it some day for things like a parallel
executor. And that would probably be our best bang for the buck still.

The risks of jamming all sessions into a single, multi-threaded process are
huge. Think of snapshot visibility together with catalog cache
invalidations.
I'd say no to that one as a first step.

But multi-threading the executor or even certain utility commands at first
should not be rejected purely on the notion that "we don't have
multithreading
today."


Regards, Jan








>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



-- 
Jan Wieck
Senior Postgres Architect
http://pgblog.wi3ck.info

Re: [HACKERS] DO with a large amount of statements get stuck with high memory consumption

2016-07-17 Thread Jan Wieck

BTW, here is the email thread about double-linking MemoryContext children
patch, that Kevin at the end committed to master.

https://www.postgresql.org/message-id/55F2D834.8040106%40wi3ck.info


Regards, Jan


On Sat, Jul 16, 2016 at 3:47 PM, Jan Wieck <j...@wi3ck.info> wrote:

>
>
> On Tue, Jul 12, 2016 at 3:29 PM, Merlin Moncure <mmonc...@gmail.com>
> wrote:
>
>> I've noticed that pl/pgsql functions/do commands do not behave well
>> when the statement resolves and frees memory.   To be clear:
>>
>> FOR i in 1..100
>> LOOP
>>   INSERT INTO foo VALUES (i);
>> END LOOP;
>>
>> ...runs just fine while
>>
>> BEGIN
>>   INSERT INTO foo VALUES (1);
>>   INSERT INTO foo VALUES (2);
>>   ...
>>   INSERT INTO foo VALUES (100);
>> END;
>>
>
> This sounds very much like what led to
> commit 25c539233044c235e97fd7c9dc600fb5f08fe065.
>
> It seems that patch was only applied to master and never backpatched to
> 9.5 or earlier.
>
>
> Regards, Jan
>
>
>
>
>>
>> (for the curious, create a script yourself via
>> copy (
>>   select
>> 'do $$begin create temp table foo(i int);'
>>   union all select
>> format('insert into foo values (%s);', i) from
>> generate_series(1,100) i
>>   union all select 'raise notice ''abandon all hope!''; end; $$;'
>> ) to '/tmp/breakit.sql';
>>
>> ...while consume amounts of resident memory proportional to the number
>> of statemnts and eventually crash the server.  The problem is obvious;
>> each statement causes a plan to get created and the server gets stuck
>> in a loop where SPI_freeplan() is called repeatedly.  Everything is
>> working as designed I guess, but when this happens it's really
>> unpleasant: the query is uncancellable and unterminatable, nicht gut.
>> A pg_ctl kill ABRT  will do the trick but I was quite astonished
>> to see linux take a few minutes to clean up the mess (!) on a somewhat
>> pokey virtualized server with lots of memory.  With even as little as
>> ten thousand statements the cleanup time far exceed the runtime of the
>> statement block.
>>
>> I guess the key takeaway here is, "don't do that"; pl/pgsql
>> aggressively generates plans and turns out to be a poor choice for
>> bulk loading because of all the plan caching.   Having said that, I
>> can't help but wonder if there should be a (perhaps user configurable)
>> limit to the amount of SPI plans a single function call should be able
>> to acquire on the basis you are going to smack into very poor
>> behaviors in the memory subsystem.
>>
>> Stepping back, I can't help but wonder what the value of all the plan
>> caching going on is at all for statement blocks.  Loops might comprise
>> a notable exception, noted.  I'd humbly submit though that (relative
>> to functions) it's much more likely to want to do something like
>> insert a lot of statements and a impossible to utilize any cached
>> plans.
>>
>> This is not an academic gripe -- I just exploded production :-D.
>>
>> merlin
>>
>>
>> --
>> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-hackers
>>
>
>
>
> --
> Jan Wieck
> Senior Postgres Architect
>



-- 
Jan Wieck
Senior Postgres Architect
http://pgblog.wi3ck.info

Re: [HACKERS] A Modest Upgrade Proposal

2016-07-17 Thread Jan Wieck

On Sun, Jul 17, 2016 at 2:08 PM, Jim Nasby <jim.na...@bluetreble.com> wrote:

> On 7/13/16 2:06 PM, Joshua D. Drake wrote:
>
>> On 07/07/2016 01:01 PM, Robert Haas wrote:
>>
>> There was an unconference session on this topic at PGCon and quite a
>>> number of people there stated that they found DDL to be an ease-of-use
>>> feature and wanted to have it.
>>>
>>
>> Yeah, I haven't meet anyone yet that would like to have:
>>
>> select replicate_these_relations('['public']);
>>
>> vs:
>>
>> ALTER SCHEMA public ENABLE REPLICATION;
>>
>> (or something like that).
>>
>
> I generally agree, but I think the more important question is "Why?". Is
> it becouse DDL looks more like a sentence? Is it because arrays are a PITA?
> Is it too hard to call functions?


Once you get fine grained enough to support replicating different sets
of possibly overlapping objects/namespaces to different groups of
recipients, the DDL approach becomes just as convoluted as calling
functions and nobody will memorize the entire syntax.


Jan





>
> --
> Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
> Experts in Analytics, Data Architecture and PostgreSQL
> Data in Trouble? Get it in Treble! http://BlueTreble.com
> 855-TREBLE2 (855-873-2532)   mobile: 512-569-9461
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



-- 
Jan Wieck
Senior Postgres Architect
http://pgblog.wi3ck.info

Re: [HACKERS] One process per session lack of sharing

2016-07-17 Thread Jan Wieck

On Sun, Jul 17, 2016 at 3:23 AM, Craig Ringer <cr...@2ndquadrant.com> wrote:

>
>
> Lots more could be shared, too. Cached plans, for example.
>

But the fact that PostgreSQL has transactional DDL complicates things like
a shared plan cache and shared PL/pgSQL execution trees. Those things are
much easier in a trivial database implementation, where an ALTER TABLE is
just trampling over a running transaction.

Regards, Jan

-- 
Jan Wieck
Senior Postgres Architect
http://pgblog.wi3ck.info

Re: [HACKERS] DO with a large amount of statements get stuck with high memory consumption

2016-07-16 Thread Jan Wieck

On Tue, Jul 12, 2016 at 3:29 PM, Merlin Moncure <mmonc...@gmail.com> wrote:

> I've noticed that pl/pgsql functions/do commands do not behave well
> when the statement resolves and frees memory.   To be clear:
>
> FOR i in 1..100
> LOOP
>   INSERT INTO foo VALUES (i);
> END LOOP;
>
> ...runs just fine while
>
> BEGIN
>   INSERT INTO foo VALUES (1);
>   INSERT INTO foo VALUES (2);
>   ...
>   INSERT INTO foo VALUES (100);
> END;
>

This sounds very much like what led to
commit 25c539233044c235e97fd7c9dc600fb5f08fe065.

It seems that patch was only applied to master and never backpatched to 9.5
or earlier.


Regards, Jan




>
> (for the curious, create a script yourself via
> copy (
>   select
> 'do $$begin create temp table foo(i int);'
>   union all select
> format('insert into foo values (%s);', i) from
> generate_series(1,100) i
>   union all select 'raise notice ''abandon all hope!''; end; $$;'
> ) to '/tmp/breakit.sql';
>
> ...while consume amounts of resident memory proportional to the number
> of statemnts and eventually crash the server.  The problem is obvious;
> each statement causes a plan to get created and the server gets stuck
> in a loop where SPI_freeplan() is called repeatedly.  Everything is
> working as designed I guess, but when this happens it's really
> unpleasant: the query is uncancellable and unterminatable, nicht gut.
> A pg_ctl kill ABRT  will do the trick but I was quite astonished
> to see linux take a few minutes to clean up the mess (!) on a somewhat
> pokey virtualized server with lots of memory.  With even as little as
> ten thousand statements the cleanup time far exceed the runtime of the
> statement block.
>
> I guess the key takeaway here is, "don't do that"; pl/pgsql
> aggressively generates plans and turns out to be a poor choice for
> bulk loading because of all the plan caching.   Having said that, I
> can't help but wonder if there should be a (perhaps user configurable)
> limit to the amount of SPI plans a single function call should be able
> to acquire on the basis you are going to smack into very poor
> behaviors in the memory subsystem.
>
> Stepping back, I can't help but wonder what the value of all the plan
> caching going on is at all for statement blocks.  Loops might comprise
> a notable exception, noted.  I'd humbly submit though that (relative
> to functions) it's much more likely to want to do something like
> insert a lot of statements and a impossible to utilize any cached
> plans.
>
> This is not an academic gripe -- I just exploded production :-D.
>
> merlin
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



-- 
Jan Wieck
Senior Postgres Architect

Re: [HACKERS] WAL Re-Writes

2016-01-31 Thread Jan Wieck


On 01/27/2016 08:30 AM, Amit Kapila wrote:

operation.  Now why OS couldn't find the corresponding block in
memory is that, while closing the WAL file, we use
POSIX_FADV_DONTNEED if wal_level is less than 'archive' which
lead to this problem.  So with this experiment, the conclusion is that
though we can avoid re-write of WAL data by doing exact writes, but
it could lead to significant reduction in TPS.


POSIX_FADV_DONTNEED isn't the only way how those blocks would vanish 
from OS buffers. If I am not mistaken we recycle WAL segments in a round 
robin fashion. In a properly configured system, where the reason for a 
checkpoint is usually "time" rather than "xlog", a recycled WAL file 
written to had been closed and not touched for about a complete 
checkpoint_timeout or longer. You must have a really big amount of spare 
RAM in the machine to still find those blocks in memory. Basically we 
are talking about the active portion of your database, shared buffers, 
the sum of all process local memory and the complete pg_xlog directory 
content fitting into RAM.



Regards, Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Fix an O(N^2) problem in foreign key references.

2015-09-25 Thread Jan Wieck


On 09/18/2015 10:47 AM, Tom Lane wrote:

Jan Wieck <j...@wi3ck.info> writes:

Attached is a complete rework of the fix from scratch, based on Tom's
suggestion.



The code now maintains a double linked list as suggested, but only uses
it to mark all currently valid entries as invalid when hashvalue == 0.
If a specific entry is invalidated it performs a hash lookup for maximum
efficiency in that case.


That does not work; you're ignoring the possibility of hashvalue
collisions, and for that matter not considering that the hash value
is not equal to the hash key.  I don't think your code is invalidating
the right entry at all during a foreign key constraint update, and
it certainly cannot do the right thing if there's a hash value collision.

Attached is something closer to what I was envisioning; can you do
performance testing on it?


Sorry for the delay.

Yes, that patch also has the desired performance for restoring a schema 
with hundreds of thousands of foreign key constraints.



Regards, Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Fix an O(N^2) problem in foreign key references.

2015-09-17 Thread Jan Wieck


On 09/15/2015 12:02 PM, Jan Wieck wrote:

On 09/15/2015 11:54 AM, Tom Lane wrote:

Jan Wieck <j...@wi3ck.info> writes:

Would it be appropriate to use (Un)RegisterXactCallback() in case of
detecting excessive sequential scanning of that cache and remove all
invalid entries from it at that time?



Another idea is that it's not the size of the cache that's the problem,
it's the cost of finding the entries that need to be invalidated.
So we could also consider adding list links that chain only the entries
that are currently marked valid.  If the list gets too long, we could mark
them all invalid and thereby reset the list to nil.  This doesn't do
anything for the cache's space consumption, but that wasn't what you were
worried about anyway was it?


That sounds like a workable solution to this edge case. I'll see how
that works.


Attached is a complete rework of the fix from scratch, based on Tom's 
suggestion.


The code now maintains a double linked list as suggested, but only uses 
it to mark all currently valid entries as invalid when hashvalue == 0. 
If a specific entry is invalidated it performs a hash lookup for maximum 
efficiency in that case.


This code does pass the regression tests with CLOBBER_CACHE_ALWAYS, does 
have the desired performance impact on restore of hundreds of thousands 
of foreign key constraints and does not show any negative impact on bulk 
loading of data with foreign keys.



Regards, Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info
diff --git a/src/backend/utils/adt/ri_triggers.c b/src/backend/utils/adt/ri_triggers.c
index 61edde9..eaec737 100644
--- a/src/backend/utils/adt/ri_triggers.c
+++ b/src/backend/utils/adt/ri_triggers.c
@@ -125,6 +125,8 @@ typedef struct RI_ConstraintInfo
  * PK) */
 	Oid			ff_eq_oprs[RI_MAX_NUMKEYS];		/* equality operators (FK =
  * FK) */
+	struct RI_ConstraintInfo *prev_valid;		/* Previous valid entry */
+	struct RI_ConstraintInfo *next_valid;		/* Next valid entry */
 } RI_ConstraintInfo;
 
 
@@ -185,6 +187,7 @@ typedef struct RI_CompareHashEntry
 static HTAB *ri_constraint_cache = NULL;
 static HTAB *ri_query_cache = NULL;
 static HTAB *ri_compare_cache = NULL;
+static RI_ConstraintInfo *ri_constraint_cache_valid_list = NULL;
 
 
 /* --
@@ -2924,6 +2927,19 @@ ri_LoadConstraintInfo(Oid constraintOid)
 
 	ReleaseSysCache(tup);
 
+	/*
+	 * At the time a cache invalidation message is processed there may be
+	 * active references to the cache. Because of this we never remove entries
+	 * from the cache, but only mark them invalid. For efficient processing
+	 * of invalidation messages below we keep a double linked list of
+	 * currently valid entries.
+	 */
+	if (ri_constraint_cache_valid_list != NULL)
+		ri_constraint_cache_valid_list->prev_valid = riinfo;
+	riinfo->prev_valid = NULL;
+	riinfo->next_valid = ri_constraint_cache_valid_list;
+	ri_constraint_cache_valid_list = riinfo;
+
 	riinfo->valid = true;
 
 	return riinfo;
@@ -2940,17 +2956,55 @@ ri_LoadConstraintInfo(Oid constraintOid)
 static void
 InvalidateConstraintCacheCallBack(Datum arg, int cacheid, uint32 hashvalue)
 {
-	HASH_SEQ_STATUS status;
-	RI_ConstraintInfo *hentry;
+	RI_ConstraintInfo  *hentry;
+	RI_ConstraintInfo  *hnext;
+	boolfound;
 
 	Assert(ri_constraint_cache != NULL);
 
-	hash_seq_init(, ri_constraint_cache);
-	while ((hentry = (RI_ConstraintInfo *) hash_seq_search()) != NULL)
+	if (hashvalue == 0)
 	{
-		if (hentry->valid &&
-			(hashvalue == 0 || hentry->oidHashValue == hashvalue))
+		/*
+		 * Set all valid entries in the cache to invalid and clear the
+		 * list of valid entries.
+		 */
+		hnext = ri_constraint_cache_valid_list;
+		while (hnext)
+		{
+			hentry = hnext;
+			hnext = hnext->next_valid;
+
+			Assert(hentry->valid);
+
+			hentry->prev_valid = NULL;
+			hentry->next_valid = NULL;
 			hentry->valid = false;
+		}
+		ri_constraint_cache_valid_list = NULL;
+	}
+	else
+	{
+		/*
+		 * Search for the specified entry and set that one invalid
+		 * and remove it from the list of valid entries.
+		 */
+		hentry = (RI_ConstraintInfo *) hash_search(ri_constraint_cache,
+   (void *) ,
+   HASH_FIND, );
+		if (found && hentry->valid)
+		{
+			Assert(hentry->oidHashValue == hashvalue);
+
+			if (hentry->prev_valid != NULL)
+hentry->prev_valid->next_valid = hentry->next_valid;
+			else
+ri_constraint_cache_valid_list = hentry->next_valid;
+			if (hentry->next_valid != NULL)
+hentry->next_valid->prev_valid = hentry->prev_valid;
+			hentry->prev_valid = NULL;
+			hentry->next_valid = NULL;
+			hentry->valid = false;
+		}
 	}
 }
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Fix an O(N^2) problem in foreign key references.

2015-09-15 Thread Jan Wieck


On 09/15/2015 11:54 AM, Tom Lane wrote:

Jan Wieck <j...@wi3ck.info> writes:

Would it be appropriate to use (Un)RegisterXactCallback() in case of
detecting excessive sequential scanning of that cache and remove all
invalid entries from it at that time?



Another idea is that it's not the size of the cache that's the problem,
it's the cost of finding the entries that need to be invalidated.
So we could also consider adding list links that chain only the entries
that are currently marked valid.  If the list gets too long, we could mark
them all invalid and thereby reset the list to nil.  This doesn't do
anything for the cache's space consumption, but that wasn't what you were
worried about anyway was it?


That sounds like a workable solution to this edge case. I'll see how 
that works.



Thanks, Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Fix an O(N^2) problem in foreign key references.

2015-09-15 Thread Jan Wieck


On 09/15/2015 10:58 AM, Tom Lane wrote:

I wrote:

Jan Wieck <j...@wi3ck.info> writes:

I'm able to reproduce that failure with CLOBBER_CACHE_ALWAYS and it
definitely is caused by this commit. Do you want to back it out for the
time being? Kevin is on vacation right now.



I'll take a look today, and either fix it if I can find the problem
quickly or back it out.


The answer is "back it out", because this patch is fundamentally and
irretrievably broken.  You can't just clobber the hashtable like that,
because there are very possibly active uses of hashtable entries in
outer function call levels.


Ok.

Would it be appropriate to use (Un)RegisterXactCallback() in case of 
detecting excessive sequential scanning of that cache and remove all 
invalid entries from it at that time?



Regards, Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Fix an O(N^2) problem in foreign key references.

2015-09-15 Thread Jan Wieck


On 09/14/2015 09:56 AM, Tom Lane wrote:

Kevin Grittner <kgri...@postgresql.org> writes:

Fix an O(N^2) problem in foreign key references.


Judging from the buildfarm, this patch is broken under
CLOBBER_CACHE_ALWAYS.  See friarbird's results in particular.
I might be too quick to finger this patch, but nothing else
lately has touched foreign-key behavior, and the foreign_key
test is where things are going south.


I'm able to reproduce that failure with CLOBBER_CACHE_ALWAYS and it 
definitely is caused by this commit. Do you want to back it out for the 
time being? Kevin is on vacation right now.



Regards, Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Fix an O(N^2) problem in foreign key references.

2015-09-15 Thread Jan Wieck


On 09/15/2015 09:51 AM, Tom Lane wrote:

Jan Wieck <j...@wi3ck.info> writes:

On 09/14/2015 09:56 AM, Tom Lane wrote:

Kevin Grittner <kgri...@postgresql.org> writes:

Fix an O(N^2) problem in foreign key references.



Judging from the buildfarm, this patch is broken under
CLOBBER_CACHE_ALWAYS.  See friarbird's results in particular.
I might be too quick to finger this patch, but nothing else
lately has touched foreign-key behavior, and the foreign_key
test is where things are going south.



I'm able to reproduce that failure with CLOBBER_CACHE_ALWAYS and it
definitely is caused by this commit. Do you want to back it out for the
time being? Kevin is on vacation right now.


I'll take a look today, and either fix it if I can find the problem
quickly or back it out.

(If anyone else is already looking at it, happy to defer to you...)


I am looking at it. Unfortunately I'm not getting any usable backtrace yet.


Regards, Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Double linking MemoryContext children

2015-09-14 Thread Jan Wieck


On 09/12/2015 11:35 AM, Kevin Grittner wrote:


On the other hand, a grep indicates that there are two places that
MemoryContextData.nextchild is set (and we therefore probably need
to also set the new field), and Jan's proposed patch only changes
one of them.  If we do this, I think we need to change both places
that are affected, so ResourceOwnerCreate() in resowner.c would
need a line or two added.


ResourceOwnerCreate() sets ResourceOwnerData.nextchild, not 
MemoryContextData.nextchild.



Regards, Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Double linking MemoryContext children

2015-09-11 Thread Jan Wieck

The attached script demonstrates an O(N^2) problem we recently became 
aware of. The script simply executes a large anonymous code block that 
doesn't do anything useful. Usage is


./t1.py [NUM_STMTS] | psql [DBNAME]

NUM_STMTS defaults to 20,000. The code block executes rather fast, but 
the entire script runs for over a minute (at 20,000 statements) on a 
2.66 GHz Xeon.


The time is spent when all the prepared SPI statements are freed at the 
end of execution. All prepared SPI statements are children of the Cache 
context. MemoryContext children are a single linked list where new 
members are inserted at the head. This works best when children are 
created and destroyed in a last-in-first-out pattern. SPI however 
destroys the SPI statements in the order they were created, forcing 
MemoryContextSetParent() to traverse the entire linked list for each child.


The attached patch makes MemoryContext children a double linked list 
that no longer needs any list traversal no find the position of the 
child within the list.



Regards, Jan


--
Jan Wieck
Senior Software Engineer
http://slony.info
diff --git a/src/backend/utils/mmgr/mcxt.c b/src/backend/utils/mmgr/mcxt.c
index 705f3ef..d1a2e02 100644
--- a/src/backend/utils/mmgr/mcxt.c
+++ b/src/backend/utils/mmgr/mcxt.c
@@ -331,21 +331,13 @@ MemoryContextSetParent(MemoryContext context, MemoryContext new_parent)
 	{
 		MemoryContext parent = context->parent;
 
-		if (context == parent->firstchild)
-			parent->firstchild = context->nextchild;
+		if (context->prevchild != NULL)
+			context->prevchild->nextchild = context->nextchild;
 		else
-		{
-			MemoryContext child;
-
-			for (child = parent->firstchild; child; child = child->nextchild)
-			{
-if (context == child->nextchild)
-{
-	child->nextchild = context->nextchild;
-	break;
-}
-			}
-		}
+			parent->firstchild = context->nextchild;
+
+		if (context->nextchild != NULL)
+			context->nextchild->prevchild = context->prevchild;
 	}
 
 	/* And relink */
@@ -353,12 +345,16 @@ MemoryContextSetParent(MemoryContext context, MemoryContext new_parent)
 	{
 		AssertArg(MemoryContextIsValid(new_parent));
 		context->parent = new_parent;
+		context->prevchild = NULL;
 		context->nextchild = new_parent->firstchild;
+		if (new_parent->firstchild != NULL)
+			new_parent->firstchild->prevchild = context;
 		new_parent->firstchild = context;
 	}
 	else
 	{
 		context->parent = NULL;
+		context->prevchild = NULL;
 		context->nextchild = NULL;
 	}
 }
@@ -714,6 +710,7 @@ MemoryContextCreate(NodeTag tag, Size size,
 	node->methods = methods;
 	node->parent = NULL;		/* for the moment */
 	node->firstchild = NULL;
+	node->prevchild = NULL;
 	node->nextchild = NULL;
 	node->isReset = true;
 	node->name = ((char *) node) + size;
@@ -728,6 +725,8 @@ MemoryContextCreate(NodeTag tag, Size size,
 	{
 		node->parent = parent;
 		node->nextchild = parent->firstchild;
+		if (parent->firstchild != NULL)
+			parent->firstchild->prevchild = node;
 		parent->firstchild = node;
 		/* inherit allowInCritSection flag from parent */
 		node->allowInCritSection = parent->allowInCritSection;
diff --git a/src/include/nodes/memnodes.h b/src/include/nodes/memnodes.h
index 577ab9c..bbb58bd 100644
--- a/src/include/nodes/memnodes.h
+++ b/src/include/nodes/memnodes.h
@@ -79,6 +79,7 @@ typedef struct MemoryContextData
 	MemoryContextMethods *methods;		/* virtual function table */
 	MemoryContext parent;		/* NULL if no parent (toplevel context) */
 	MemoryContext firstchild;	/* head of linked list of children */
+	MemoryContext prevchild;	/* previous child of same parent */
 	MemoryContext nextchild;	/* next child of same parent */
 	char	   *name;			/* context name (just for debugging) */
 	MemoryContextCallback *reset_cbs;	/* list of reset/delete callbacks */
#!/usr/bin/env python

import sys

def main(argv):
if len(argv) == 0:
num_stmts = 2
else:
num_stmts = int(argv[0])

print "\\timing"
print "DO $$"
print "DECLARE"
print "ts1 timestamp;"
print "ts2 timestamp;"
print "dummy integer;"
print "BEGIN"
print "SELECT timeofday() INTO ts1;"
print "RAISE NOTICE 'start: %', ts1;"

for i in range(0, num_stmts):
print "SELECT %d INTO dummy;"%i

print "SELECT timeofday() INTO ts2;"
print "RAISE NOTICE 'end: %', ts2;"
print "RAISE NOTICE 'duration: %', (ts2 - ts1);"
print "END;"
print "$$;"

if __name__ == '__main__':
sys.exit(main(sys.argv[1:]))

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Double linking MemoryContext children

2015-09-11 Thread Jan Wieck


On 09/11/2015 09:38 AM, Tom Lane wrote:

Jan Wieck <j...@wi3ck.info> writes:

The attached script demonstrates an O(N^2) problem we recently became
aware of. The script simply executes a large anonymous code block that
doesn't do anything useful. Usage is



 ./t1.py [NUM_STMTS] | psql [DBNAME]



NUM_STMTS defaults to 20,000. The code block executes rather fast, but
the entire script runs for over a minute (at 20,000 statements) on a
2.66 GHz Xeon.



The time is spent when all the prepared SPI statements are freed at the
end of execution. All prepared SPI statements are children of the Cache
context. MemoryContext children are a single linked list where new
members are inserted at the head. This works best when children are
created and destroyed in a last-in-first-out pattern. SPI however
destroys the SPI statements in the order they were created, forcing
MemoryContextSetParent() to traverse the entire linked list for each child.



The attached patch makes MemoryContext children a double linked list
that no longer needs any list traversal no find the position of the
child within the list.


Seems less invasive to fix SPI to delete in the opposite order?


That would assume that there are no other code paths that destroy memory 
contexts out of LIFO order.



Regards, Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Double linking MemoryContext children

2015-09-11 Thread Jan Wieck


On 09/11/2015 09:58 AM, Tom Lane wrote:

Jan Wieck <j...@wi3ck.info> writes:

On 09/11/2015 09:38 AM, Tom Lane wrote:

Seems less invasive to fix SPI to delete in the opposite order?



That would assume that there are no other code paths that destroy memory
contexts out of LIFO order.


Given that we've not seen this sort of problem reported in the dozen or
more years the code has been like that, I'm a bit dubious that we need
to worry about that.  It's possible you're right that we need to expend
more space in the MemoryContext lists --- but I'd like to see more than
one example.


I added a bit of debug code to MemoryContextSetParent(), tracking loop 
iterations per context name. This is what happens during "make check" of 
current master:


 name | count  | not_first | iterations
--++---+
 CacheMemoryContext   |   6271 |  2634 | 104846
 ExecutorState| 90 |  2951 |   7176
 TopMemoryContext |  28464 |   620 |   2716
 SPI Proc |   8424 |   225 |381
 PortalMemory |  24616 |31 |291
 TopTransactionContext|  19312 |   114 |141
 ExprContext  |   3317 | 1 |  1

There was a total of 6,271 calls to MemoryContextSetParent() where the 
old parent is the CacheMemoryContext. For 2,634 of those the context is 
not parent->firstchild and this lead to 104,846 loop iterations total.


The remaining numbers indicate that other contexts are mostly used in 
the intended fashion, but not strictly. This means there is definitely 
potential for more edge cases, similar to the SPI example above.



Regards, Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Small patch to fix an O(N^2) problem in foreign keys

2015-09-10 Thread Jan Wieck


On 09/08/2015 08:49 AM, Kevin Grittner wrote:

Jan Wieck <j...@wi3ck.info> wrote:


The problem is a cache introduced in commit 45ba4247 that improves


That's a bit off; 45ba424f seems to be what you mean.


Copy paste over paper.





foreign key lookups during bulk updates when the FK value does not
change. When restoring a schema dump from a database with many (say
100,000) foreign keys, this cache is growing very big and every ALTER
TABLE command is causing a InvalidateConstraintCacheCallBack(), which
does a sequential hash table scan.

The patch uses a heuristic method of detecting when the hash table
should be destroyed and recreated. InvalidateConstraintCacheCallBack()
adds the current size of the hash table to a counter. When that sum
reaches 1,000,000, the hash table is flushed. This improves the schema
restore of a database with 100,000 foreign keys by factor 3.

According to my tests the patch does not interfere with the bulk
updates, the original feature was supposed to improve.


In the real-world customer case that caused you to look into this,
I thought 45ba424f drove schema-only restore time from 2 hours to
about 25 hours, and that this patch takes it back down to 2 hours.
Am I remembering right?  And this came about because it added
20-some hours to a pg_upgrade run?


From 2 hours to >50, but yes, this is that case.



If there are no objections, I will push this as a bug fix back to
9.3, where the performance regression was introduced.



Regards, Jan


--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Small patch to fix an O(N^2) problem in foreign keys

2015-09-03 Thread Jan Wieck


Attached is a small patch and a script to reproduce the issue.

The problem is a cache introduced in commit 45ba4247 that improves 
foreign key lookups during bulk updates when the FK value does not 
change. When restoring a schema dump from a database with many (say 
100,000) foreign keys, this cache is growing very big and every ALTER 
TABLE command is causing a InvalidateConstraintCacheCallBack(), which 
does a sequential hash table scan.


The patch uses a heuristic method of detecting when the hash table 
should be destroyed and recreated. InvalidateConstraintCacheCallBack() 
adds the current size of the hash table to a counter. When that sum 
reaches 1,000,000, the hash table is flushed. This improves the schema 
restore of a database with 100,000 foreign keys by factor 3.


According to my tests the patch does not interfere with the bulk 
updates, the original feature was supposed to improve.



Regards, Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info
diff --git a/src/backend/utils/adt/ri_triggers.c b/src/backend/utils/adt/ri_triggers.c
index 61edde9..d7023ce 100644
--- a/src/backend/utils/adt/ri_triggers.c
+++ b/src/backend/utils/adt/ri_triggers.c
@@ -183,6 +183,7 @@ typedef struct RI_CompareHashEntry
  * --
  */
 static HTAB *ri_constraint_cache = NULL;
+static long  ri_constraint_cache_seq_count = 0;
 static HTAB *ri_query_cache = NULL;
 static HTAB *ri_compare_cache = NULL;
 
@@ -2945,6 +2946,27 @@ InvalidateConstraintCacheCallBack(Datum arg, int cacheid, uint32 hashvalue)
 
 	Assert(ri_constraint_cache != NULL);
 
+	/*
+	 * Prevent an O(N^2) problem when creating large amounts of foreign
+	 * key constraints with ALTER TABLE, like it happens at the end of
+	 * a pg_dump with hundred-thousands of tables having references.
+	 */
+	ri_constraint_cache_seq_count += hash_get_num_entries(ri_constraint_cache);
+	if (ri_constraint_cache_seq_count > 100)
+	{
+		HASHCTL		ctl;
+
+		hash_destroy(ri_constraint_cache);
+		memset(, 0, sizeof(ctl));
+		ctl.keysize = sizeof(Oid);
+		ctl.entrysize = sizeof(RI_ConstraintInfo);
+		ri_constraint_cache = hash_create("RI constraint cache",
+		  RI_INIT_CONSTRAINTHASHSIZE,
+		  , HASH_ELEM | HASH_BLOBS);
+		ri_constraint_cache_seq_count = 0;
+		return;
+	}
+
 	hash_seq_init(, ri_constraint_cache);
 	while ((hentry = (RI_ConstraintInfo *) hash_seq_search()) != NULL)
 	{


t1.sh
Description: application/shellscript

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] s_lock() seems too aggressive for machines with many sockets

2015-06-14 Thread Jan Wieck

The whole thing turns out to be based on wrong baseline data, taken with 
a pgbench client running from a remote machine. It all started out from 
an investigation against 9.3. Curiously enough, the s_lock() problem 
that existed in 9.3 has a very similar effect on throughput as a network 
bottleneck has on 9.5.


Sorry for the noise, Jan


On 06/10/2015 09:18 AM, Jan Wieck wrote:

Hi,

I think I may have found one of the problems, PostgreSQL has on machines
with many NUMA nodes. I am not yet sure what exactly happens on the NUMA
bus, but there seems to be a tipping point at which the spinlock
concurrency wreaks havoc and the performance of the database collapses.

On a machine with 8 sockets, 64 cores, Hyperthreaded 128 threads total,
a pgbench -S peaks with 50-60 clients around 85,000 TPS. The throughput
then takes a very sharp dive and reaches around 20,000 TPS at 120
clients. It never recovers from there.

The attached patch demonstrates that less aggressive spinning and (much)
more often delaying improves the performance on this type of machine.
The 8 socket machine in question scales to over 350,000 TPS.

The patch is meant to demonstrate this effect only. It has a negative
performance impact on smaller machines and client counts  #cores, so
the real solution will probably look much different. But I thought it
would be good to share this and start the discussion about reevaluating
the spinlock code before PGCon.


Regards, Jan




--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] DBT-3 with SF=20 got failed

2015-06-11 Thread Jan Wieck


On 06/11/2015 09:53 AM, Kouhei Kaigai wrote:

curious: what was work_mem set to?


work_mem=48GB

My machine mounts 256GB physical RAM.


work_mem can be allocated several times per backend. Nodes like sort and 
hash_aggregate may each allocate that much. You should set work_mem to a 
fraction of physical-RAM / concurrent-connections depending on the 
complexity of your queries. 48GB does not sound reasonable.



Regards, Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] s_lock() seems too aggressive for machines with many sockets

2015-06-10 Thread Jan Wieck


Hi,

I think I may have found one of the problems, PostgreSQL has on machines 
with many NUMA nodes. I am not yet sure what exactly happens on the NUMA 
bus, but there seems to be a tipping point at which the spinlock 
concurrency wreaks havoc and the performance of the database collapses.


On a machine with 8 sockets, 64 cores, Hyperthreaded 128 threads total, 
a pgbench -S peaks with 50-60 clients around 85,000 TPS. The throughput 
then takes a very sharp dive and reaches around 20,000 TPS at 120 
clients. It never recovers from there.


The attached patch demonstrates that less aggressive spinning and (much) 
more often delaying improves the performance on this type of machine. 
The 8 socket machine in question scales to over 350,000 TPS.


The patch is meant to demonstrate this effect only. It has a negative 
performance impact on smaller machines and client counts  #cores, so 
the real solution will probably look much different. But I thought it 
would be good to share this and start the discussion about reevaluating 
the spinlock code before PGCon.



Regards, Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info
diff --git a/src/backend/storage/lmgr/s_lock.c b/src/backend/storage/lmgr/s_lock.c
index 0afcba1..edaa8fd 100644
--- a/src/backend/storage/lmgr/s_lock.c
+++ b/src/backend/storage/lmgr/s_lock.c
@@ -83,8 +83,8 @@ s_lock(volatile slock_t *lock, const char *file, int line)
 	 * the probability of unintended failure) than to fix the total time
 	 * spent.
 	 */
-#define MIN_SPINS_PER_DELAY 10
-#define MAX_SPINS_PER_DELAY 1000
+#define MIN_SPINS_PER_DELAY 4
+#define MAX_SPINS_PER_DELAY 10
 #define NUM_DELAYS			1000
 #define MIN_DELAY_USEC		1000L
 #define MAX_DELAY_USEC		100L
@@ -144,7 +144,7 @@ s_lock(volatile slock_t *lock, const char *file, int line)
 	{
 		/* we never had to delay */
 		if (spins_per_delay  MAX_SPINS_PER_DELAY)
-			spins_per_delay = Min(spins_per_delay + 100, MAX_SPINS_PER_DELAY);
+			spins_per_delay = Min(spins_per_delay + 2, MAX_SPINS_PER_DELAY);
 	}
 	else
 	{
diff --git a/src/include/storage/s_lock.h b/src/include/storage/s_lock.h
index c63cf54..42eb353 100644
--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -973,7 +973,7 @@ extern slock_t dummy_spinlock;
 extern int s_lock(volatile slock_t *lock, const char *file, int line);
 
 /* Support for dynamic adjustment of spins_per_delay */
-#define DEFAULT_SPINS_PER_DELAY  100
+#define DEFAULT_SPINS_PER_DELAY  10
 
 extern void set_spins_per_delay(int shared_spins_per_delay);
 extern int	update_spins_per_delay(int shared_spins_per_delay);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] s_lock() seems too aggressive for machines with many sockets

2015-06-10 Thread Jan Wieck


On 06/10/2015 09:28 AM, Andres Freund wrote:

On 2015-06-10 09:18:56 -0400, Jan Wieck wrote:

On a machine with 8 sockets, 64 cores, Hyperthreaded 128 threads total, a
pgbench -S peaks with 50-60 clients around 85,000 TPS. The throughput then
takes a very sharp dive and reaches around 20,000 TPS at 120 clients. It
never recovers from there.


85k? Phew, that's pretty bad. What exact type of CPU is this? Which
pgbench scale? Did you use -M prepared?


model name  : Intel(R) Xeon(R) CPU E7- 8830  @ 2.13GHz

numactl --hardware shows the distance to the attached memory as 10, the 
distance to every other node as 21. I interpret that as the machine 
having one NUMA bus with all cpu packages attached to that, rather than 
individual connections from cpu to cpu or something different.


pgbench scale=300, -Msimple.



Could you share a call graph perf profile?


I do not have them handy at the moment and the machine is in use for 
something else until tomorrow. I will forward perf and systemtap based 
graphs ASAP.


What led me into that spinlock area was the fact that a wall clock based 
systemtap FlameGraph showed a large portion of the time spent in

BufferPin() and BufferUnpin().




The attached patch demonstrates that less aggressive spinning and
(much) more often delaying improves the performance on this type of
machine. The 8 socket machine in question scales to over 350,000 TPS.


Even that seems quite low. I've gotten over 500k TPS on a four socket
x86 machine, and about 700k on a 8 socket x86 machine.


There is more wrong with the machine in question than just that. But for 
the moment I am satisfied with having a machine where I can reproduce 
this phenomenon in what appears to be a worst case.




Maybe we need to adjust the amount of spinning, but to me such drastic
differences are a hint that we should tackle the actual contention
point. Often a spinlock for something regularly heavily contended can be
worse than a queued lock.


I have the impression that the code assumes that there is little penalty 
for accessing the shared byte in a tight loop from any number of cores 
in parallel. That apparently is true for some architectures and core 
counts, but no longer holds for these machines with many sockets.



Regards, Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] s_lock() seems too aggressive for machines with many sockets

2015-06-10 Thread Jan Wieck


On 06/10/2015 10:07 AM, Nils Goroll wrote:

On larger Linux machines, we have been running with spin locks replaced by
generic posix mutexes for years now. I personally haven't look at the code for
ages, but we maintain a patch which pretty much does the same thing still:

Ref: http://www.postgresql.org/message-id/4fede0bf.7080...@schokola.de

I understand that there are systems out there which have less efficient posix
mutex implementations than Linux (which uses futexes), but I think it would
still be worth considering to do away with the roll-your-own spinlocks on
systems whose posix mutexes are known to behave.


I have played with test code that isolates a stripped down version of 
s_lock() and uses it with multiple threads. I then implemented multiple 
different versions of that s_lock(). The results with 200 concurrent 
threads are that using a __sync_val_compare_and_swap() to acquire the 
lock and then falling back to a futex() is limited to about 500,000 
locks/second. Spinning for 10 times and then doing a usleep(1000) (one 
millisecond) gives me 25 million locks/second.


Note that the __sync_val_compare_and_swap() GCC built in seems identical 
in performance with the assembler xchgb operation used by PostgreSQL 
today on x84_64.



Regards, Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Possible problem with pgcrypto

2015-02-06 Thread Jan Wieck


On 02/05/2015 02:15 PM, Jan Wieck wrote:

On 02/05/2015 01:18 PM, Marko Tiikkaja wrote:



  pgcrypto bug


That doesn't look too good, but I can't reproduce it against 9.3.6 either.


Attached is an improved script and the final output from it. I ran it 
over night and it did not reproduce the pgcrypto bug message, but 
instead the expected Corrupt data plus one, that I haven't seen before 
which is Unsupported compression algorithm.



Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


pgcrypto_test2.sh
Description: application/shellscript
select * from pgp_test2 order by 1;
 id |  
vdata   
|  vkey   |  verrmsg  
+--+-+---
  1 | 
\xc30d040703027123de71fa32937175d23a01cc0377628c3b58119e4c8e51804b74cccf961c8b500b8a283db7084ed809833f9bfd7827b70cf06aa0254707e1c08b3db8419e6e2eda697637
 | key_1   | ERROR:  Wrong key or corrupt data
  2 | 
\xc30d040703027123de71fa32937175d23a01cc0377628c3b58119e4c8e51804b74cccf961c8b500b8a283db7084ed809833f9bfd7827b70cf06aa0254707e1c08b3db8419e6e2eda697637
 | key_686 | ERROR:  Not text data
  3 | 
\xc30d04070302b9688f5b4b1db6bb77d23a01ee1cad33770344503c564496fb6463d9b11a30013424fcae8b80cfbafba415ba3c7047aec7499b8e74254069b29390990c0f2a34740a7d2085
 | key_972 | ERROR:  Unsupported compression algorithm
  4 | 
\xc30d04070302c011ec7879402f4178d23a0165d2462bb4e6a7a204f8af20fc0cfadfc72f6a55902f013479697f316659bbc028ddfe624578e462aee3279b65f3e66f0993305d378ae35593
 | key_816 | ERROR:  Corrupt data
(4 rows)

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Possible problem with pgcrypto

2015-02-05 Thread Jan Wieck


Hi,

I have encountered a small instability in the behavior of pgcrypto's 
pgp_sym_decrypt() function. Attached is a script that can reproduce the 
problem. It may have to be run repeatedly because the symptom occurs 
rather seldom.


What the script does is to encode a small string with pgp_sym_encrypt() 
and then repeatedly try to decrypt it with different wrong passwords. 
The expected error message for that is of course


Wrong key or corrupt data.

Every now and then, I get a different error message. Things I've seen are:

Not text data
pgcrypto bug

This seems to be triggered by a combination of the random data included 
in the encrypted data as well as the wrong password, because for an 
instance of encrypted data only certain passwords cause this symptom.


I wonder if this may actually be a bug in pgcrypto or if this is an 
error inherent in the way, the encrypted data is encoded. I.e. that the 
decryption algorithm cannot really figure out what is wrong and just 
sometimes gets a little further in the attempt to decrypt.



Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


pgcrypto_test.sh
Description: application/shellscript

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Possible problem with pgcrypto

2015-02-05 Thread Jan Wieck


On 02/05/2015 10:58 AM, Tom Lane wrote:

Jan Wieck j...@wi3ck.info writes:

I have encountered a small instability in the behavior of pgcrypto's
pgp_sym_decrypt() function. Attached is a script that can reproduce the
problem. It may have to be run repeatedly because the symptom occurs
rather seldom.



What the script does is to encode a small string with pgp_sym_encrypt()
and then repeatedly try to decrypt it with different wrong passwords.
The expected error message for that is of course
 Wrong key or corrupt data.



Every now and then, I get a different error message. Things I've seen are:


Have you tested this with this week's releases?  We fixed some
memory-mishandling bugs in pgcrypto ...


The posted script reproduces the symptom in today's checkout of master 
as well as REL9_4_STABLE.



Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Possible problem with pgcrypto

2015-02-05 Thread Jan Wieck


On 02/05/2015 01:18 PM, Marko Tiikkaja wrote:

On 2/5/15 4:48 PM, Jan Wieck wrote:

What the script does is to encode a small string with pgp_sym_encrypt()
and then repeatedly try to decrypt it with different wrong passwords.
The expected error message for that is of course

  Wrong key or corrupt data.

Every now and then, I get a different error message. Things I've seen are:

  Not text data


That's not unexpected; the check for whether the data is text or not
appears to happen quite early in the process of decoding.  So it's
enough to get to that point without anything being obviously broken.


I suspected something like that.



In addition to the two errors above, it doesn't appear to be too
difficult to see PXE_MBUF_SHORT_READ, which would give you  ERROR:
Corrupt data.  I wonder why that error message is different, though.


From reading the code as far I did, I expected to see that, but haven't 
seen it yet.





  pgcrypto bug


That doesn't look too good, but I can't reproduce it against 9.3.6 either.


Let me improve the script to a point where it can run for a long time in 
the background and collect all different error cases as examples of 
encrypted data and wrong password.



Thanks so far.
Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Moving of INT64_FORMAT to c.h

2014-10-16 Thread Jan Wieck


Hi,

PostgreSQL has for ages defined INT64_FORMAT and UINT64_FORMAT in 
pg_config.h. This commit


http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=ce486056ecd28050

moved those definitions to c.h, which breaks compilation of all released 
Slony-I versions against current master. Can those be moved back to 
where they used to be?


Slony uses the definitions in external tools, like slon and slonik, to 
format sequence numbers in log output.



Regards,
Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] DDL Damage Assessment

2014-10-02 Thread Jan Wieck


On 10/02/2014 01:15 PM, Joe Conway wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 10/02/2014 11:30 AM, Dimitri Fontaine wrote:

Questions:

1. Do you agree that a systematic way to report what a DDL command
(or script, or transaction) is going to do on your production
database is a feature we should provide to our growing user base?


+1

I really like the idea and would find it useful/time-saving


2. What do you think such a feature should look like?


Elsewhere on this thread EXPLAIN was suggested. That makes a certain
amount of sense.

Maybe something like EXPLAIN IMPACT [...]


3. Does it make sense to support the whole set of DDL commands from
the get go (or ever) when most of them are only taking locks in
their own pg_catalog entry anyway?


Yes, I think it should cover all commands that can have an
availability impact.


In principle I agree with the sentiment. However, that full coverage is 
a nice goal, seldom achieved.


The real question is at what level of information, returned to the user, 
does this feature become user friendly?


It is one thing to provide information of the kind of

TAKE ACCECSS EXCLUSIVE LOCK ON TABLE foo
TEST EVERY ROW IN TABLE foo FOR FK (a, b) IN bar (id1, id2)

That information is useful, but only to an experienced DBA who knows 
their schema and data to a certain degree. The majority of users, I 
fear, will not be able to even remotely guesstimate if that will need 
seconds or hours.


There needs to be more detail information for those cases and I believe 
that tackling them one at a time in depth will lead to more useful 
results than trying to cover a lot but shallow.



My $.02
Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_receivexlog and replication slots

2014-10-01 Thread Jan Wieck


On 10/01/2014 08:59 AM, Andres Freund wrote:

On 2014-10-01 21:54:56 +0900, Michael Paquier wrote:

Thanks!

On Wed, Oct 1, 2014 at 8:38 PM, Andres Freund and...@2ndquadrant.com
wrote:

 0001) Old WIP patch for pg_recvlogical tests. Useful for easier
 development.

From where is this patch? Is it some rest from the time when pg_recvlogical
was developed?


 0002) Copy editing that should be in 9.4

Definitely makes sense for a backpatch.


 0003) Rebased patches of yours

0004) My changes to 0003 besides the rebase. This'll be squashed, but I
   thought it might be interesting for you.

(thanks for caring)
-/* Drop a replication slot */
+/* Drop a replication slot. */
I don't recall that comments on a single line have a dot even if this
single line is a complete sentence.


Then it shouldn't start with a uppercase - which you changed...


-static void StreamLog();
+static void StreamLogicalLog();
Not related at all to those patches, but for correctness it is better to
add a declaration (void).


Agreed.


Except those small things the changes look good to me.


Cool. Will push and then, sometime this week, review the next patch.

Greetings,

Andres Freund




You might want to do that function renaming in pg_receivexlog.c as well.


Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_receivexlog and replication slots

2014-10-01 Thread Jan Wieck


On 10/01/2014 01:19 PM, Andres Freund wrote:

On 2014-10-01 13:17:17 -0400, Jan Wieck wrote:

-static void StreamLog();
+static void StreamLogicalLog();
Not related at all to those patches, but for correctness it is better to
add a declaration (void).

Agreed.

Except those small things the changes look good to me.

Cool. Will push and then, sometime this week, review the next patch.



You might want to do that function renaming in pg_receivexlog.c as well.


Why? You mean to StreamPhysicalLog()?

Greetings,

Andres Freund



Like that.

Plus, there is one more occurrence of StreamLog() in pg_recvlogical.c.


Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_receivexlog and replication slots

2014-10-01 Thread Jan Wieck


On 10/01/2014 01:32 PM, Andres Freund wrote:

On 2014-10-01 13:22:53 -0400, Jan Wieck wrote:

On 10/01/2014 01:19 PM, Andres Freund wrote:
On 2014-10-01 13:17:17 -0400, Jan Wieck wrote:
-static void StreamLog();
+static void StreamLogicalLog();
Not related at all to those patches, but for correctness it is better to
add a declaration (void).

Agreed.

Except those small things the changes look good to me.

Cool. Will push and then, sometime this week, review the next patch.

You might want to do that function renaming in pg_receivexlog.c as well.

Why? You mean to StreamPhysicalLog()?

Like that.


Don't see that much point - it just seemed wrong to have StreamLog do
two somewhat different things. And StreamLog() in receivexlog is much older...


The point is that StreamLog is semantically a superset of StreamWhateverLog.


Jan





Plus, there is one more occurrence of StreamLog() in pg_recvlogical.c.


Gah. Wrongly split commit :(. Fixed in 9.4, it's already fixed by the
subsequent commit in master.

Greetings,

Andres Freund




--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] jsonb format is pessimal for toast compression

2014-09-23 Thread Jan Wieck


On 09/15/2014 09:46 PM, Craig Ringer wrote:

On 09/16/2014 07:44 AM, Peter Geoghegan wrote:

FWIW, I am slightly concerned about weighing use cases around very
large JSON documents too heavily. Having enormous jsonb documents just
isn't going to work out that well, but neither will equivalent designs
in popular document database systems for similar reasons. For example,
the maximum BSON document size supported by MongoDB is 16 megabytes,
and that seems to be something that their users don't care too much
about. Having 270 pairs in an object isn't unreasonable, but it isn't
going to be all that common either.


Also, at a certain size the fact that Pg must rewrite the whole document
for any change to it starts to introduce other practical changes.

Anyway - this is looking like the change will go in, and with it a
catversion bump. Introduction of a jsonb version/flags byte might be
worthwhile at the same time. It seems likely that there'll be more room
for improvement in jsonb, possibly even down to using different formats
for different data.

Is it worth paying a byte per value to save on possible upgrade pain?



This comment seems to have drowned in the discussion.

If there indeed has to be a catversion bump in the process of this, then 
I agree with Craig.



Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] proposal: plpgsql - Assert statement

2014-09-17 Thread Jan Wieck


On 09/17/2014 03:36 PM, Peter Eisentraut wrote:

On 9/17/14 3:04 PM, Pavel Stehule wrote:

What is difference between content of variable or content of database?
You can test any prerequisite, but when this prerequisite is not solved,
than exception is very very hard without possible handling.


If the assertion tests arbitrary Boolean expressions, then we can't stop
the user from abusing them.


Exactly. Doing something like

ASSERT (select count(*) from foo
where fk not in (select pk from bar)) = 0;

is a perfectly fine, arbitrary boolean expression. It will probably work 
well in a development environment too. And I am very sure that it will 
not scale well once that code gets deployed. And I know how DBAs react 
to the guaranteed following performance problem. They will disable ALL 
assert ... or was there some sort of assert class system proposed that I 
missed?




But it's another thing if we design specific syntax that encourages such
abuse, as proposed earlier.


The design should explicitly discourage that sort of nonsense.


Jan


--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] proposal: plpgsql - Assert statement

2014-09-14 Thread Jan Wieck


On 09/14/2014 02:25 PM, Pavel Stehule wrote:

a) Isn't possible handle a assertion exception anywhere .. it enforce
ROLLBACK in 100%

b) Assertions should be disabled globally .. I am not sure, it it is a
good idea, but I can understand so some tests based on queries to data
can be performance issue.

Important question is a relation assertations and exceptions. Is it only
shortcut for exception or some different?


I think that most data integrity issues can be handled by a well 
designed database schema that uses UNIQUE, NOT NULL, REFERENCES and 
CHECK constraints. Assertions are usually found inside of complex code 
constructs to check values of local variables. I don't think it is even 
a good idea to implement assertions that can query arbitrary data.



Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench throttling latency limit

2014-09-10 Thread Jan Wieck


On 09/10/2014 11:28 AM, Heikki Linnakangas wrote:

On 09/10/2014 05:57 PM, Fabien COELHO wrote:


Hello Heikki,


I looked closer at the this, and per Jan's comments, realized that we don't
log the lag time in the per-transaction log file. I think that's a serious
omission; when --rate is used, the schedule lag time is important information
to make sense of the result. I think we have to apply the attached patch, and
backpatch to 9.4. (The documentation on the log file format needs updating)


Indeed. I think that people do not like it to change. I remember that I
suggested to change timestamps to .yy instead of the unreadable
 yyy, and be told not to, because some people have tool which
process the output so the format MUST NOT CHANGE. So my behavior is not to
avoid touching anything in this area.

I'm fine if you do it, though:-) However I have not time to have a precise
look at your patch to cross-check it before Friday.


This is different from changing xxx yyy to xxx.yyy in two ways.
First, this adds new information to the log that just isn't there
otherwise, it's not just changing the format for the sake of it. Second,
in this case it's easy to write a parser for the log format so that it
works with the old and new formats. Many such programs would probably
ignore any unexpected extra fields, as a matter of lazy programming,
while changing the separator from space to a dot would surely require
changing every parsing program.

We could leave out the lag fields, though, when --rate is not used.
Without --rate, the lag is always zero anyway. That would keep the file
format unchanged, when you don't use the new --rate feature. I'm not
sure if that would be better or worse for programs that might want to
parse the information.


We could also leave the default output format as is and introduce 
another option with a % style format string.



Jan





Also, this is bizarre:


int64 wait = (int64) (throttle_delay *
   1.00055271703 * -log(getrand(thread, 1, 1) / 1.0));


We're using getrand() to generate a uniformly distributed random value
between 1 and 1, and then convert it to a double between 0.0 and 1.0.


The reason for this conversion is to have randomness but to still avoid
going to extreme multiplier values. The idea is to avoid a very large
multiplier which would generate (even if it is not often) a 0 tps when 100
tps is required. The 1 granularity is basically random but the
multiplier stays bounded (max 9.2, so the minimum possible tps would be
around 11 for a target of 100 tps, bar issues from the database for
processing the transactions).

So although this feature can be discussed and amended, it is fully
voluntary. I think that it make sense so I would prefer to keep it as is.
Maybe the comments could be update to be clearer.


Ok, yeah, the comments indeed didn't mention anything about that. I
don't think such clamping is necessary. With that 9.2x clamp on the
multiplier, the probability that any given transaction hits it is about
1/1. And a delay 9.2 times the average is still quite reasonable,
IMHO. The maximum delay on my laptop, when pg_erand48() returns DBL_MIN,
seems to be about 700 x the average, which is still reasonable if you
run a decent number of transactions. And of course, the probability of
hitting such an extreme value is miniscule, so if you're just doing a
few quick test runs with a small total number of transactions, you won't
hit that.

- Heikki




--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PL/pgSQL 2

2014-09-06 Thread Jan Wieck


On 09/05/2014 10:32 PM, Marko Tiikkaja wrote:

On 2014-09-02 8:52 PM, Kevin Grittner wrote:

Marko Tiikkaja ma...@joh.to wrote:


Sounds like in this case you'd only use set-oriented programming
at the end of the transaction, no?


I guess -- more properly I would say in the final database
transaction for that financial transaction.


Yes, I should have said financial transaction, but I hit send a bit
too early.


And no, that never
made me wish that plpgsql functions defaulted to throwing errors
for DML statements that affected more than one row.


Fine.  But you should still be able to see the point we're trying to
make.  The number one is special, and it's present everywhere.  If you
want to program defensively, you have to go through a lot of pain right
now.  We're looking for a way to alleviate that pain.  Defaulting to
throwing errors would be one way to do it, but that's not what's being
suggested here anymore.

You can dismiss what we're doing by saying that it doesn't follow the
best practices or we just want an interface for a key-value store or
whatever.  And yes, to some extent, a simple interface for a key-value
store would come in handy.  But we still have the 5-15% (big part of it
being the reporting we need to do) of the code that *doesn't* want that,
*and* we want to use all of the Postgres features where applicable.


The point isn't about best practices. The point is that if you want to 
ensure that at maximum one row is affected, then qualify it by a unique 
set of columns. Making PL/pgSQL behave different on UPDATE than SQL to 
enforce that by default was simply a misguided design idea.



Regards,
Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PL/pgSQL 1.2

2014-09-06 Thread Jan Wieck


On 09/06/2014 04:21 AM, Marko Tiikkaja wrote:


We wrap these things into (sometimes) simple-looking function so that
none of the application developers ever run any SQL.  We define an
interface between the application and the database, and that interface
is implemented using PL/PgSQL functions.  Sure, sometimes one function
will just fire off a single UPDATE .. RETURNING, or a SELECT, but that
doesn't matter.  The trick is to be consistent everywhere.


There is precisely your root problem. Instead of educating your 
application developers on how to properly use a relational database 
system, you try to make it foolproof.


Guess what, the second you made something foolproof, evolution will 
create a dumber fool. This is a race you cannot win.



Regards,
Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PL/pgSQL 2

2014-09-06 Thread Jan Wieck


On 09/06/2014 12:17 PM, Marko Tiikkaja wrote:

OK, fine.  But that's not what I suggested on the wiki page, and is also
not what I'm arguing for here right now.  What the message you referred
to was about was the condescending attitude where we were told to think
in terms of sets (paraphrased), without considering whether that's even
possible to do *all the time*.


SQL is, by definition, a set oriented language. The name Procedural 
Language / pgSQL was supposed to suggest that this language adds some 
procedural elements to the PostgreSQL database. I never intended to 
create a 100% procedural language. It was from the very beginning, 16 
years ago, intended to keep the set orientation when it comes to DML 
statements inside of functions.


That means that you will have to think in sets *all the time*. The empty 
set and a set with one element are still sets. No matter how hard you 
try to make them special, in my mind they are not.



Regards,
Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PL/pgSQL 2

2014-09-06 Thread Jan Wieck


On 09/06/2014 12:33 PM, Marko Tiikkaja wrote:

On 2014-09-06 6:31 PM, Jan Wieck wrote:

On 09/06/2014 12:17 PM, Marko Tiikkaja wrote:

OK, fine.  But that's not what I suggested on the wiki page, and is also
not what I'm arguing for here right now.  What the message you referred
to was about was the condescending attitude where we were told to think
in terms of sets (paraphrased), without considering whether that's even
possible to do *all the time*.


SQL is, by definition, a set oriented language. The name Procedural
Language / pgSQL was supposed to suggest that this language adds some
procedural elements to the PostgreSQL database. I never intended to
create a 100% procedural language. It was from the very beginning, 16
years ago, intended to keep the set orientation when it comes to DML
statements inside of functions.

No matter how hard you
try to make them special, in my mind they are not.


Of course they are.  That's why you have PRIMARY KEYs and UNIQUE
constraints.


Then please use those features instead of crippling the language.


Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PL/pgSQL 2

2014-09-06 Thread Jan Wieck


On 09/06/2014 12:47 PM, David G Johnston wrote:

If the language, and the system as a whole, was only used by
perfectionists that do not make errors - and with perfectly clean data -
this adherence to purity would be acceptable.  But the real world is not
that clean and so enhancing the language to meet the needs of the real
world is not crippling the language.  Begin able to state explicitly
that the cardinality of the set I get back must be 1, no more and no
less, doesn't remove the fact that I know I am dealing with a set and
that I simply want to make an assertion as to its properties so that if
a bug 3 layers deep into the application causes something other than 1
row to be affected I know immediately and can invoke the appropriate
action - throw an error.


As I already mentioned in the other thread, those assertions or checks 
do not belong into the PL. If they are desired they should be added to 
the main SQL syntax as COMMAND CONSTRAINT like suggested by Hannu.


Your statement is not limited to PL functions. It is just as valid for 
NORMAL applications.


However, this would be a proprietary extension that is not covered by 
any SQL standard and for that reason alone cannot be the default.



Regards,
Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Improving PL/PgSQL

2014-09-06 Thread Jan Wieck


On 09/06/2014 02:08 PM, Marko Tiikkaja wrote:

On 2014-09-06 7:56 PM, Pavel Stehule wrote:

2014-09-06 19:54 GMT+02:00 Marko Tiikkaja ma...@joh.to:

Then that doesn't really solve our problem.  Switching between two
languages on a per-function basis, when both look exactly the same but have
very different semantics would be a nightmare.



It is maximum what is possible

use a different language instead


Sigh.

OK, let's try and forget the cardinality assertions we've been talking
about in the other thread(s).  I seem to recall there being a generally
welcoming atmosphere in the discussion about adding a set of pragmas (or
options/whatever) to make some of PL/PgSQL's flaws go away, in a
non-backwards compatible way.  From the list here:
https://wiki.postgresql.org/wiki/Improving_PL/PgSQL_(September_2014) do
you think at least some of those would be reasonable candidates for
these pragmas?  Do you see others ones that are missing from this list?

Please also keep discussion about ASSERT in the thread for that, and the
suggestion under Single-row operations out of this.


+1 for SELECT INTO throwing TOO_MANY_ROWS if there are more than one. 
Zero rows should be dealt with an IF NOT FOUND ... construct.


+1 for the number of result columns should match the expression list of 
SELECT INTO.


-1 on removal of implicit type casting. This needs to go into a #pragma 
or GUC. Too drastic of a backwards compatibility break.


-1 on the single row operations. This belongs into the main SQL engine 
as COMMAND CONSTRAINTS.


+1 on EXECUTE and FOUND, where applicable (DML statements only).

I do not recall why we decided to implement GET DIAGNOSTICS instead of 
an automatically set global ROW_COUNT variable. But there was a reason I 
believe and we should check the list archives for it.


+1 on the OUT alias.

-1 on the ASSERT as proposed. It would be too easy for application 
developers to abuse them to govern business logic and a DBA later 
turning off assertions for performance reasons.



Regards,
Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] proposal: plpgsql - Assert statement

2014-09-05 Thread Jan Wieck


On 09/05/2014 04:40 AM, Pavel Stehule wrote:




2014-09-05 10:33 GMT+02:00 Marko Tiikkaja ma...@joh.to
mailto:ma...@joh.to:

On 9/5/14 10:09 AM, Pavel Stehule wrote:

I think this could still be parsed correctly, though I'm not
100% sure on
that:

ASSERT WARNING (EXISTS(SELECT ..)), 'data are there';


PLpgSQL uses a ';' or some plpgsql keyword as SQL statement
delimiter. It
reason why RETURN QUERY ... ';' So in this case can practical to
place SQL
statement on the end of plpgsql statement.


*shrug*  There are lots of cases where a comma is used as well, e.g.
RAISE NOTICE '..', expr, expr;

parenthesis are not practical, because it is hard to identify bug ..


I don't see why.  The PL/PgSQL SQL parser goes to great lengths to
identify unmatched parenthesis.  But the parens probably aren't
necessary in the first place; you could just omit them and keep
parsing until the next comma AFAICT.  So the syntax would be:

RAISE [ NOTICE | WARNING | EXCEPTION/ASSERT/WHATEVER ]
boolean_expr [, error_message [, error_message_param [, ... ] ] ];


extension RAISE about ASSERT level has minimal new value


Adding a WHEN clause to RAISE would have the benefit of not needing any 
new keywords at all.


RAISE EXCEPTION 'format' [, expr ...] WHEN row_count  1;


Regards,
Jan




A simplicity of integration SQL and PLpgSQL is in using smart
keywords -
It is more verbose, and it allow to well diagnostics


I disagree.  The new keywords provide nothing of value here.  They
even encourage the use of quirky syntax in *exchange* for verbosity
(IS NOT NULL pk? really?).


It is about semantic and conformity of proposed tools. Sure,  all can
reduced to ASSERT(expr) .. but where is some benefit against function call

I am able to do without any change of language as plpgsql extension -
there is no necessary to do any change for too thin proposal



.marko





--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench throttling latency limit

2014-09-05 Thread Jan Wieck


On 08/27/2014 04:08 AM, Heikki Linnakangas wrote:

That model might make some sense if you think e.g. of a web application,
where the web server has a timeout for how long it waits to get a
database connection from a pool, but once a query is started, the
transaction is considered a succeess no matter how long it takes. The
latency limit would be that timeout. But I think a more useful model is
that when the user clicks a button, he waits at most X seconds for the
result. If that deadline is exceeded, the web server will give a 404, or
the user will simply get bored and go away, and the transaction is
considered a failure.


Correct, the whole TPC-B model better fits an application where client 
requests enter a queue at the specified TPS rate and that queue is 
processed.


While we are at it,

Note that in the original TPC-B specification the transaction duration 
measured is the time from receiving the client request (in current 
pgbench under throttling that is for when the transaction is scheduled) 
and when the request is answered. This is the client visible response 
time, which has nothing to do with the database latency.


As per TPC-B, the entire test is only valid if 90% of all client 
response times are within 2 seconds.


It would be useful if pgbench would

A) measure and report that client response time in the per transaction
   log files and

B) report at the end what percentage of transactions finished within
   a specified response time constraint (default 2 seconds).


Regards,
Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench throttling latency limit

2014-09-05 Thread Jan Wieck


On 09/05/2014 10:12 AM, Fabien COELHO wrote:

Note that despite pg appaling latency performance, in may stay well over
the 90% limit, or even 100%: when things are going well a lot of
transaction run in about ms, while when things are going bad transactions
would take a long time (although possibly under or about 1s anyway), *but*
very few transactions are passed, the throughput is very small. The fact
that during 15 seconds only 30 transactions are processed is a detail that
does not show up in the metric.


I haven't used the real pgbench for a long time. I will have to look at 
your patch and see what the current version actually does or does not.


What I have been using is a Python version of pgbench that I wrote for 
myself when I started learning that language. That one does record both 
values, the DB transaction latency and the client response time (time 
from the request being entered into the Queue until transaction commit). 
When I look at those results it is possible to have an utterly failing 
run, with 60% of client response times being within 2 seconds, but all 
the DB transactions are still in milliseconds.


As said, I'll have to take a look at it. Since I am on vacation next 
week, getting ready for my first day at EnterpriseDB, this may actually 
happen.



Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PL/pgSQL 1.2

2014-09-04 Thread Jan Wieck


On 09/04/2014 01:14 AM, Pavel Stehule wrote:

2014-09-03 23:19 GMT+02:00 Hannu Krosing ha...@2ndquadrant.com
A more SQL-ish way of doing the same could probably be called COMMAND
CONSTRAINTS
and look something like this

SELECT
...
CHECK (ROWCOUNT BETWEEN 0 AND 1);


It is very near to my proposed ASSERT


Only if the ASSERT syntax would become part of the original statement, 
it is supposed to check. In Hannu's command constraint example above, 
the statement that causes the error, and thus will be logged and become 
identified by the error message, is the actual SELECT (or other DML 
statement).


I think I like the COMMAND CONSTRAINT the best so far.


Regards,
Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PL/pgSQL 1.2

2014-09-04 Thread Jan Wieck


On 09/04/2014 09:31 AM, Pavel Stehule wrote:

2014-09-04 15:24 GMT+02:00 Jan Wieck j...@wi3ck.info

I think I like the COMMAND CONSTRAINT the best so far.


I not, because when it will not be part of SQL, than parser in plpgsql
will be more complex. You have to inject SELECT, UPDATE, INSERT, DELETE


Making the COMMAND CONSTRAINT part of the core SQL parser was how I 
understood Hannu's idea. It would be horrible to tuck that feature away 
inside of a PL, rather than making it available to all PLs as well as 
applications, that use SQL directly (I think there still are two or 
three applications that do).



Regards,
Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PL/pgSQL 1.2

2014-09-04 Thread Jan Wieck


On 09/04/2014 11:16 AM, Joel Jacobson wrote:

On 4 sep 2014, at 16:45, Hannu Krosing ha...@2ndquadrant.com wrote:

When looking from the other end of the problem, we are
using SELECT/INSERT/UPDATE/DELETE *SET statements* in pl/pgsql
when we really want scalars.

My understanding is that one main drivers of starting this thread
was wanting also guaranteed SCALAR versions of these.

And wanting them in a way that is easy to use.


+1

Thank you! I have been trying to explain this in multiple cryptic ways
but failed. You just nailed it! That's *exactly* what I mean!


I believe we all agree that the availability of most of the proposed 
functionality is desirable.


I think the main difference between your point of view and that of a few 
others (me included) is that you prefer a language that is easy and fast 
to type, with as few key strokes as possible, while we prefer a language 
that is similar to SQL, which is rather verbose to the reader. At least 
when the discussion is about the default procedural language installed 
with the core database system.


Such a language should be as similar as possible to SQL. Which is the 
reason why I believe that the CHECK clause belongs into the main parser, 
not into the PL.



Regards,
Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] jsonb format is pessimal for toast compression

2014-09-04 Thread Jan Wieck


On 08/12/2014 10:58 AM, Robert Haas wrote:

What would really be ideal here is if the JSON code could inform the
toast compression code this many initial bytes are likely
incompressible, just pass them through without trying, and then start
compressing at byte N, where N is the byte following the TOC.  But I
don't know that there's a reasonable way to implement that.



Sorry, being late for the party.

Anyhow, this strikes me as a good basic direction of thought. But I 
think we should put the burden on the data type, not on toast. To do 
that data types could have an optional toast_hint_values() function, 
which the toast code can call with the actual datum at hand and its 
default parameter array. The hint values function then can modify that 
parameter array, telling toast how much to skip, how hard to try (or not 
at all) and so on. A data type specific function should know much better 
how to figure out how compressible a particular datum may be.


Certainly nothing for 9.4, but it might require changing the toast API 
in a different way than just handing it an oid and hard-coding the 
JASONBOID case into toast for 9.4. If we are going to change the API, we 
might as well do it right.



Regards,
Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] jsonb format is pessimal for toast compression

2014-09-04 Thread Jan Wieck


On 08/08/2014 10:21 AM, Andrew Dunstan wrote:


On 08/07/2014 11:17 PM, Tom Lane wrote:

I looked into the issue reported in bug #11109.  The problem appears to be
that jsonb's on-disk format is designed in such a way that the leading
portion of any JSON array or object will be fairly incompressible, because
it consists mostly of a strictly-increasing series of integer offsets.
This interacts poorly with the code in pglz_compress() that gives up if
it's found nothing compressible in the first first_success_by bytes of a
value-to-be-compressed.  (first_success_by is 1024 in the default set of
compression parameters.)


[snip]


There is plenty of compressible data once we get into the repetitive
strings in the payload part --- but that starts at offset 944, and up to
that point there is nothing that pg_lzcompress can get a handle on.  There
are, by definition, no sequences of 4 or more repeated bytes in that area.
I think in principle pg_lzcompress could decide to compress the 3-byte
sequences consisting of the high-order 24 bits of each offset; but it
doesn't choose to do so, probably because of the way its lookup hash table
works:

  * pglz_hist_idx -
  *
  * Computes the history table slot for the lookup by the next 4
  * characters in the input.
  *
  * NB: because we use the next 4 characters, we are not guaranteed to
  * find 3-character matches; they very possibly will be in the wrong
  * hash list.  This seems an acceptable tradeoff for spreading out the
  * hash keys more.

For jsonb header data, the next 4 characters are *always* different, so
only a chance hash collision can result in a match.  There is therefore a
pretty good chance that no compression will occur before it gives up
because of first_success_by.

I'm not sure if there is any easy fix for this.  We could possibly change
the default first_success_by value, but I think that'd just be postponing
the problem to larger jsonb objects/arrays, and it would hurt performance
for genuinely incompressible data.  A somewhat painful, but not yet
out-of-the-question, alternative is to change the jsonb on-disk
representation.  Perhaps the JEntry array could be defined as containing
element lengths instead of element ending offsets.  Not sure though if
that would break binary searching for JSON object keys.





Ouch.

Back when this structure was first presented at pgCon 2013, I wondered
if we shouldn't extract the strings into a dictionary, because of key
repetition, and convinced myself that this shouldn't be necessary
because in significant cases TOAST would take care of it.

Maybe we should have pglz_compress() look at the *last* 1024 bytes if it
can't find anything worth compressing in the first, for values larger
than a certain size.

It's worth noting that this is a fairly pathological case. AIUI the
example you constructed has an array with 100k string elements. I don't
think that's typical. So I suspect that unless I've misunderstood the
statement of the problem we're going to find that almost all the jsonb
we will be storing is still compressible.


I also think that a substantial part of the problem of coming up with a 
representative data sample is because the size of the incompressible 
data at the beginning is somewhat tied to the overall size of the datum 
itself. This may or may not be true in any particular use case, but as a 
general rule of thumb I would assume that the larger the JSONB document, 
the larger the offset array at the beginning.


Would changing 1024 to a fraction of the datum length for the time being 
give us enough room to come up with a proper solution for 9.5?



Regards,
Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] jsonb format is pessimal for toast compression

2014-09-04 Thread Jan Wieck


On 08/08/2014 11:18 AM, Tom Lane wrote:

Andrew Dunstan and...@dunslane.net writes:

On 08/07/2014 11:17 PM, Tom Lane wrote:

I looked into the issue reported in bug #11109.  The problem appears to be
that jsonb's on-disk format is designed in such a way that the leading
portion of any JSON array or object will be fairly incompressible, because
it consists mostly of a strictly-increasing series of integer offsets.



Ouch.



Back when this structure was first presented at pgCon 2013, I wondered
if we shouldn't extract the strings into a dictionary, because of key
repetition, and convinced myself that this shouldn't be necessary
because in significant cases TOAST would take care of it.


That's not really the issue here, I think.  The problem is that a
relatively minor aspect of the representation, namely the choice to store
a series of offsets rather than a series of lengths, produces
nonrepetitive data even when the original input is repetitive.


This is only because the input data was exact copies of the same strings 
over and over again. PGLZ can very well compress slightly less identical 
strings of varying lengths too. Not as well, but well enough. But I 
suspect such input data would make it fail again, even with lengths.



Regards,
Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PL/pgSQL 2

2014-09-03 Thread Jan Wieck


On 09/03/2014 03:14 AM, Joel Jacobson wrote:

I'm in favour of Tom's idea. To merely make the plpgsql2 language a
way of explicitly saying you want
a specific exact combination of features/beaviour/settings which we
can implemented in plpgsql's existing codebase.

Since it was about 100 posts since Tom's post, maybe it's worth
repeating for those who missed it:


What I would think about is

c) plpgsql and plpgsql2 are the same code base, with a small number
of places that act differently depending on the language version.

We could alternatively get the result by inventing a bunch of pragma
declarations, or some similar notation, that control the behavioral
changes one-at-a-time.  That might even be worth doing anyway, in
case somebody likes some of the ideas and others not so much.  But
I'd see the language version as a convenient shorthand for enabling a
specified collection of pretty-localized incompatible behavior changes.
If they're not pretty localized, there's going to be a barrier to
uptake, very comparable to the python3 analogy mentioned upthread.

   regards, tom lane


I fully agree on this approach. It's maintainable and it will be
useful from day 1.


One can take that approach to another, more generic level. Like GUCs can 
be set on a ROLE base with ALTER USER or ALTER ROLE, PL specific GUCs 
could be set via ALTER LANGUAGE foo SET 


The possibility to CREATE LANGUAGE mybetterpl, pointing to the same PL 
handler function, exists already. And the same mechanism could be used 
by other languages, like PL/Python (for whatever such language might 
need such settings).


This way an application can define the language settings, it needs, by 
simply creating its own language, based on all the possible permutations 
of those PRAGMA/GUC settings.



Regards,
Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PL/pgSQL 2

2014-09-02 Thread Jan Wieck


Without having read the entire thread ...

On 09/01/2014 05:04 AM, Joel Jacobson wrote:

 From the top of my head, these are Things I personally would want to
see in plpgsql2:
+ Make UPDATE/INSERT/DELETE throw error if they didnt' modify exactly 1
row, as that's the most common use-case, and provide alternative syntax
to modify multiple or zero rows.


I think this is a completely flawed proposal by definition. SQL itself 
does not make such assumption and as a SET oriented language, never 
should. A SET is zero or more tuples. Would you also suggest that the 
PostgreSQL backend throw an ERROR if an UPDATE/INSERT/DELETE doesn't 
modify exactly 1 row? If not, explain why there should be a difference 
between SQL executed from the frontend and SQL executed by a PL/pgSQL 
function.


-1 from me.



+ Change all warnings into errors


I suggest being slightly more selective on that one.


Regards,
Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PL/pgSQL 2

2014-09-02 Thread Jan Wieck


On 09/01/2014 09:06 PM, Andrew Dunstan wrote:


On 09/01/2014 08:09 PM, Neil Tiffin wrote:
The docs also tell you how to avoid having to do this, using dollar quoting.




That should be enough alone to suggest postgreSQL start working on a modern, in 
core, fast, fully supported language.  Of course PL/pgSQL works, but so did 
one-line 5k perl programs that nobody likes today.  Everything can be done in 
assembler, but no one suggests that today.  Today, it is all about programmer 
productivity.  PL/pgSQL has a lot of unnecessary stuff that sucks the life out 
of programmer productivity.  And this should be very much a concern of the 
professionals that support PostgreSQL

For example:

DECLARE
declarations
BEGIN
statements
END

This looks a lot like COBOL or Pascal, and today is mostly unnecessary.


It looks like Ada, and that's not an accident. (Nor is it a bad thing.)


First of all it is

[DECLARE
declarations]
BEGIN
statements
END;

Second statements includes the whole definition above as a statement 
and controls identifier visibility and such compatible to PL/SQL.


You want to lose that? Not such a great idea, IMHO.


Jan


--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PL/pgSQL 2

2014-09-02 Thread Jan Wieck


On 09/02/2014 12:20 PM, Joel Jacobson wrote:

On Tue, Sep 2, 2014 at 6:09 PM, Kevin Grittner kgri...@ymail.com wrote:

Joel Jacobson j...@trustly.com wrote:


Sorry for being unclear, I didn't mean to suggest the main concern is
updating *all* rows.
The main concern is when you have a rather complex UPDATE WHERE clause,
aiming to update exactly one row. Some of the expressions might be
assertions, to just double-verify the values and to make it stand-out
you are checking those expressions.



These are two different problems which probably need two different
solutions.  Making the default behavior of a set-based command that
it throw an error if the resulting set is not exactly one row
doesn't seem like the right solution to either one of them.


I see your point.
Basically, we have two types of applications where PL/pgSQL is commonly used.
a) OLTP applications where you typically operate on one row for each
UPDATE command.


Your idea of what an OLTP application is seems flawed.


b) Data warehouseing applications where you process multiple rows in
each UPDATE command.


Ditto.


Regards,
Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PL/pgSQL 2

2014-09-02 Thread Jan Wieck


On 09/01/2014 10:41 AM, Joel Jacobson wrote:

On Mon, Sep 1, 2014 at 4:26 PM, Craig Ringer cr...@2ndquadrant.com wrote:

Well, the idiom:

  EXECUTE format(SELECT %I FROM %I WHERE $1, col, tbl) USING val;

is not lovely. It works, but it's clumsy.


This is exactly why we need a new language.
All the clumsy stuff we cannot fix in plpgsql, can easily be fixed in
plpgsql2, with the most beautiful syntax we can come up with.


You know that you're running into problems with the SPI subsystem on 
that one, no?


Identifiers cannot be parameters in SPI_prepare(). So how do you propose 
to make that pretty and performant?


Because the moment, your pretty language is out there, be sure users 
will kick your behind that whenever they use that pretty stuff on 
anything but a toy setup, it spirals their servers into a DOS attack state.



Regards,
Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PL/pgSQL 2

2014-09-02 Thread Jan Wieck


On 09/02/2014 09:32 AM, Joel Jacobson wrote:


I think it's much better to make it the default behaviour in plpgsql2
than to add a new syntax to plpgsql,
because then we don't have to argue what to call the keyword or where to put it.


This should in NO CASE be any new syntax to plpgsql or plpgsql2. If you 
cannot get that same syntax to apply to default client queries, then the 
whole idea is bogus because it will confuse developers more than it 
helps you with your particular way of thinking.



Regards,
Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PL/pgSQL 2

2014-09-02 Thread Jan Wieck


On 09/02/2014 04:16 PM, Andres Freund wrote:

On 2014-09-02 19:06:02 +0200, Joel Jacobson wrote:

But what do you think about,
STRICT UPDATE ...?


I quite dislike it. An UPDATE isn't less 'strict' (whatever that means)
if updates more than one row. There's some sense in the way it's used
for INTO because it's referring to the INTO. And it's much more obvious
what it could mean there.


For once I completely agree with Andres.


Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PL/pgSQL 2

2014-09-02 Thread Jan Wieck


On 09/02/2014 06:41 PM, Joshua D. Drake wrote:


On 09/02/2014 02:47 PM, Álvaro Hernández Tortosa wrote:


 Yeah, we differ there. I think having an Oracle compatibility layer
in PostgreSQL would be the-next-big-thing we could have. Oracle is has
orders of magnitude bigger user base than postgres has; and having the
ability to attract them would bring us many many more users which, in
turn, would benefit us all very significantly.

 It would be my #1 priority to do in postgres (but yes, I know
-guess- how hard and what resources that would require). But dreaming is
free :)


Oracle compatibility certainly has merit, I just don't see it as useful
for core. I would be far more interested in MSSQL compatibility
honestly. That said, Postgres itself is a rockstar and I think we can
make our own case without having to copy others.


PL/pgSQL's syntax was modelled to look like PL/SQL. Which is a Ada/COBOL 
lookalike.


Instead of trying to mimic what it was or a T-SQL thing instead ... 
maybe it is time to come up with a true PostgreSQL specific PL for a change?


Just for the sake of being something new, and not a copy of some old 
opossum, that's rotting like road kill on the side of the highway for a 
decade already.



Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PL/pgSQL 2

2014-09-02 Thread Jan Wieck


On 09/02/2014 06:56 PM, Andrew Dunstan wrote:

People are free to do what they want, but to my mind that would be a
massive waste of resources, and probably imposing a substantial extra
maintenance burden on the core committers.


I hear you and agree to some degree.

But at the same time I remember that one of the strengths of Postgres 
used to be to be able to incorporate new ideas.


This seems to be one of those cases.

Instead of fork plpgsql2, what about designing a completely new 
PL/postgres from scratch? It will only take 3-10 years, but I bet it 
will be worth it after all. And I mean that. No sarcasm.



Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] How can we make beta testing better?

2014-04-17 Thread Jan Wieck


On 04/17/14 15:16, Merlin Moncure wrote:

On Tue, Apr 15, 2014 at 4:47 PM, Josh Berkus j...@agliodbs.com wrote:

Hackers,

I think 9.3 has given us evidence that our users aren't giving new
versions of PostgreSQL substantial beta testing, or if they are, they
aren't sharing the results with us.

How can we make beta testing better and more effective?  How can we get
more users to actually throw serious workloads at new versions and share
the results?

I've tried a couple of things over the last two years and they haven't
worked all that well.  Since we're about to go into another beta testing
period, we need something new.  Ideas?


I've seen lots of bugs reported and fixed in the beta period over the
years.  My take is that it's basically unrealistic to expect volunteer
beta testers to replace bone fide regression testing.

I think it's a pretty fair statement that we've had some QC issues in
the general area of replication technologies.  What this is indicating
to me is that replication needs substantially more coverage in 'make
check'.  Since I'm wishing for things, it would be nice to see an
expansion of the buildfarm so that we could [optionally] run various
performance tests as well as various replication scenarios.  Then we
could go back to users and say, please donate 'repeatable tests and
machines to run them on' and reap the long term value.

Not at all making light out of any of this...it's a huge project.


The problem with testing replication is that it doesn't fit well into 
our standard regression testing. There are way too many moving parts as 
well as dependencies on the underlying OS and network topology.


You will discover a ton of race conditions once you actually move from 
testing with multiple postmasters (so you can kill one) on the same box 
to using multiple virtual machines and instead of completely severing a 
network connection using some packet shaping/filtering to introduce 
packet loss, limited bandwidth, async routing and so on. At least that 
is my experience from throwing that sort of sh*t at Slony at full speed.


Not trying to discourage anyone from trying. Just saying that it doesn't 
fit into our existing regression test framework.



Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Problem with txid_snapshot_in/out() functionality

2014-04-13 Thread Jan Wieck


On 04/13/14 08:27, Marko Kreen wrote:

On Sat, Apr 12, 2014 at 02:10:13PM -0400, Jan Wieck wrote:

Since it doesn't seem to produce any side effects, I'd think that
making the snapshot unique within txid_current_snapshot() and
filtering duplicates on input should be sufficient and eligible for
backpatching.


Agreed.


The attached patch adds a unique loop to the internal
sort_snapshot() function and skips duplicates on input. The git
commit is here:

https://github.com/wieck/postgres/commit/a88a2b2c25b856478d7e2b012fc718106338fe00



  static void
  sort_snapshot(TxidSnapshot *snap)
  {
+   txidlast = 0;
+   int nxip, idx1, idx2;
+
if (snap-nxip  1)
+   {
qsort(snap-xip, snap-nxip, sizeof(txid), cmp_txid);
+   nxip = snap-nxip;
+   idx1 = idx2 = 0;
+   while (idx1  nxip)
+   {
+   if (snap-xip[idx1] != last)
+   last = snap-xip[idx2++] = snap-xip[idx1];
+   else
+   snap-nxip--;
+   idx1++;
+   }
+   }
  }


I think you need to do SET_VARSIZE also here.  Alternative is to
move SET_VARSIZE after sort_snapshot().

And it seems the drop-double-txid logic should be added also to
txid_snapshot_recv().  It seems weird to have it behave differently
from txid_snapshot_in().



Thanks,

yes on both issues. Will create another patch.


Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Problem with txid_snapshot_in/out() functionality

2014-04-13 Thread Jan Wieck


On 04/13/14 14:22, Jan Wieck wrote:

On 04/13/14 08:27, Marko Kreen wrote:

I think you need to do SET_VARSIZE also here.  Alternative is to
move SET_VARSIZE after sort_snapshot().

And it seems the drop-double-txid logic should be added also to
txid_snapshot_recv().  It seems weird to have it behave differently
from txid_snapshot_in().



Thanks,

yes on both issues. Will create another patch.


New patch attached.

New github commit is 
https://github.com/wieck/postgres/commit/b8fd0d2eb78791e5171e34aecd233fd06218890d



Thanks again,
Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info
diff --git a/src/backend/utils/adt/txid.c b/src/backend/utils/adt/txid.c
index a005e67..1023566 100644
*** a/src/backend/utils/adt/txid.c
--- b/src/backend/utils/adt/txid.c
*** cmp_txid(const void *aa, const void *bb)
*** 131,146 
  }
  
  /*
!  * sort a snapshot's txids, so we can use bsearch() later.
   *
   * For consistency of on-disk representation, we always sort even if bsearch
!  * will not be used.
   */
  static void
  sort_snapshot(TxidSnapshot *snap)
  {
if (snap-nxip  1)
qsort(snap-xip, snap-nxip, sizeof(txid), cmp_txid);
  }
  
  /*
--- 131,161 
  }
  
  /*
!  * unique sort a snapshot's txids, so we can use bsearch() later.
   *
   * For consistency of on-disk representation, we always sort even if bsearch
!  * will not be used. 
   */
  static void
  sort_snapshot(TxidSnapshot *snap)
  {
+   txidlast = 0;
+   int nxip, idx1, idx2;
+ 
if (snap-nxip  1)
+   {
qsort(snap-xip, snap-nxip, sizeof(txid), cmp_txid);
+   nxip = snap-nxip;
+   idx1 = idx2 = 0;
+   while (idx1  nxip)
+   {
+   if (snap-xip[idx1] != last)
+   last = snap-xip[idx2++] = snap-xip[idx1];
+   else
+   snap-nxip--;
+   idx1++;
+   }
+   }
  }
  
  /*
*** parse_snapshot(const char *str)
*** 294,304 
val = str2txid(str, endp);
str = endp;
  
!   /* require the input to be in order */
!   if (val  xmin || val = xmax || val = last_val)
goto bad_format;
  
!   buf_add_txid(buf, val);
last_val = val;
  
if (*str == ',')
--- 309,320 
val = str2txid(str, endp);
str = endp;
  
!   /* require the input to be in order and skip duplicates */
!   if (val  xmin || val = xmax || val  last_val)
goto bad_format;
  
!   if (val != last_val)
!   buf_add_txid(buf, val);
last_val = val;
  
if (*str == ',')
*** txid_current_snapshot(PG_FUNCTION_ARGS)
*** 361,368 
  {
TxidSnapshot *snap;
uint32  nxip,
!   i,
!   size;
TxidEpoch   state;
Snapshotcur;
  
--- 377,383 
  {
TxidSnapshot *snap;
uint32  nxip,
!   i;
TxidEpoch   state;
Snapshotcur;
  
*** txid_current_snapshot(PG_FUNCTION_ARGS)
*** 381,389 
  
/* allocate */
nxip = cur-xcnt;
!   size = TXID_SNAPSHOT_SIZE(nxip);
!   snap = palloc(size);
!   SET_VARSIZE(snap, size);
  
/* fill */
snap-xmin = convert_xid(cur-xmin, state);
--- 396,402 
  
/* allocate */
nxip = cur-xcnt;
!   snap = palloc(TXID_SNAPSHOT_SIZE(nxip));
  
/* fill */
snap-xmin = convert_xid(cur-xmin, state);
*** txid_current_snapshot(PG_FUNCTION_ARGS)
*** 395,400 
--- 408,416 
/* we want them guaranteed to be in ascending order */
sort_snapshot(snap);
  
+   /* we set the size here because the sort may have removed duplicate 
xips */
+   SET_VARSIZE(snap, TXID_SNAPSHOT_SIZE(snap-nxip));
+ 
PG_RETURN_POINTER(snap);
  }
  
*** txid_snapshot_recv(PG_FUNCTION_ARGS)
*** 472,489 
snap = palloc(TXID_SNAPSHOT_SIZE(nxip));
snap-xmin = xmin;
snap-xmax = xmax;
-   snap-nxip = nxip;
-   SET_VARSIZE(snap, TXID_SNAPSHOT_SIZE(nxip));
  
for (i = 0; i  nxip; i++)
{
txidcur = pq_getmsgint64(buf);
  
!   if (cur = last || cur  xmin || cur = xmax)
goto bad_format;
snap-xip[i] = cur;
last = cur;
}
PG_RETURN_POINTER(snap);
  
  bad_format:
--- 488,514 
snap = palloc(TXID_SNAPSHOT_SIZE(nxip));
snap-xmin = xmin;
snap-xmax = xmax;
  
for (i = 0; i  nxip; i++)
{
txidcur = pq_getmsgint64(buf

Re: [HACKERS] Problem with txid_snapshot_in/out() functionality

2014-04-12 Thread Jan Wieck


On 04/12/14 03:27, Heikki Linnakangas wrote:

On 04/12/2014 12:07 AM, Jan Wieck wrote:

Hackers,

the Slony team has been getting seldom reports of a problem with the
txid_snapshot data type.

The symptom is that a txid_snapshot on output lists the same txid
multiple times in the xip list part of the external representation. This
string is later rejected by txid_snapshot_in() when trying to determine
if a particular txid is visible in that snapshot using the
txid_visible_in_snapshot() function.

I was not yet able to reproduce this problem in a lab environment. It
might be related to subtransactions and/or two phase commit (at least
one user is using both of them). The reported PostgreSQL version
involved in that case was 9.1.


It's two-phase commit. When preparing a transaction, the state of the
transaction is first transfered to a dummy PGXACT entry, and then the
PGXACT entry of the backend is cleared. There is a transient state when
both PGXACT entries have the same xid.

You can reproduce that by putting a sleep or breakpoint in
PrepareTransaction(), just before the
ProcArrayClearTransaction(MyProc); call. If you call
txid_current_snapshot() from another session at that point, it will
output two duplicate xids. (you will have to also commit one more
unrelated transaction to bump up xmax).


Thanks, that explains it.




At this point I would find it extremely helpful to sanitize the
external representation in txid_snapshot_out() while emitting some
NOTICE level logging when this actually happens. I am aware that this
does amount to a functional change for a back release, but considering
that the _out() generated external representation of an existing binary
datum won't pass the type's _in() function, I argue that such change is
warranted. Especially since this problem could possibly corrupt a dump.


Hmm. Do we snapshots to be stored in tables, and included in a dump? I
don't think we can guarantee that will work, at least not across
versions, as the way we handle snapshot internally can change.


At least Londiste and Slony do store snapshots as well as xids in tables 
and assuming that the txid epoch is properly bumped, that information is 
useful and valid after a restore.




But yeah, we probably should do something about that. The most
straightforward fix would be to scan the array in
txid_current_snapshot() and remove any duplicates.


The code in txid_snapshot_in() checks that the xip list is ascending. 
txid_snapshot_out() does not sort the list, so it must already be sorted 
when the snapshot itself is created. That scan would be fairly simple.



Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Problem with txid_snapshot_in/out() functionality

2014-04-12 Thread Jan Wieck


On 04/12/14 08:38, Andres Freund wrote:


Since it's sorted there, that should be fairly straightforward. Won't
fix already created and stored datums tho. Maybe _in()/parse_snapshot()
should additionally skip over duplicate values? Looks easy enough.


There is the sort ... missed that when glancing over the code earlier.

Right, that is easy enough and looks like an acceptable fix for back 
branches too.



Thanks,
Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Problem with txid_snapshot_in/out() functionality

2014-04-12 Thread Jan Wieck


On 04/12/14 10:09, Greg Stark wrote:


A pg_restore would start a new xid space from FirstNormalXid which would
obviously not work with any stored txids.



http://www.postgresql.org/docs/9.3/static/app-pgresetxlog.html


Regards,
Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Problem with txid_snapshot_in/out() functionality

2014-04-12 Thread Jan Wieck


On 04/12/14 11:18, Andres Freund wrote:

On 2014-04-12 11:15:09 -0400, Jan Wieck wrote:

On 04/12/14 10:09, Greg Stark wrote:

A pg_restore would start a new xid space from FirstNormalXid which would
obviously not work with any stored txids.


http://www.postgresql.org/docs/9.3/static/app-pgresetxlog.html


Using that as part of any sort of routine task IMNSHO is a seriously bad
idea.


Nobody is advocating doing so.


Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Problem with txid_snapshot_in/out() functionality

2014-04-12 Thread Jan Wieck


On 04/12/14 10:03, Andres Freund wrote:

On 2014-04-12 09:47:24 -0400, Tom Lane wrote:

Heikki Linnakangas hlinnakan...@vmware.com writes:
 On 04/12/2014 12:07 AM, Jan Wieck wrote:
 the Slony team has been getting seldom reports of a problem with the
 txid_snapshot data type.
 The symptom is that a txid_snapshot on output lists the same txid
 multiple times in the xip list part of the external representation.

 It's two-phase commit. When preparing a transaction, the state of the
 transaction is first transfered to a dummy PGXACT entry, and then the
 PGXACT entry of the backend is cleared. There is a transient state when
 both PGXACT entries have the same xid.

Hm, yeah, but why is that intermediate state visible to anyone else?
Don't we have exclusive lock on the PGPROC array while we're doing this?


It's done outside the remit of ProcArray lock :(. And documented to lead
to duplicate xids in PGXACT.
EndPrepare():
/*
 * Mark the prepared transaction as valid.  As soon as xact.c marks
 * MyPgXact as not running our XID (which it will do immediately after
 * this function returns), others can commit/rollback the xact.
 *
 * NB: a side effect of this is to make a dummy ProcArray entry for the
 * prepared XID.  This must happen before we clear the XID from 
MyPgXact,
 * else there is a window where the XID is not running according to
 * TransactionIdIsInProgress, and onlookers would be entitled to assume
 * the xact crashed.  Instead we have a window where the same XID 
appears
 * twice in ProcArray, which is OK.
 */
MarkAsPrepared(gxact);

It doesn't sound too hard to essentially move PrepareTransaction()'s
ProcArrayClearTransaction() into MarkAsPrepared() and rejigger the
locking to remove the intermediate state. But I think it'll lead to
bigger changes than we'd be comfortable backpatching.


If we don't, aren't we letting other backends see non-self-consistent
state in regards to who holds which locks, for example?


I think that actually works out ok, because the locks aren't owned by
xids/xacts, but procs. Otherwise we'd be in deep trouble in
CommitTransaction() as well where ProcArrayEndTransaction() clearing
that state.
After the whole xid transfer, there's PostPrepare_Locks() transferring
the locks.


Since it doesn't seem to produce any side effects, I'd think that making 
the snapshot unique within txid_current_snapshot() and filtering 
duplicates on input should be sufficient and eligible for backpatching.


The attached patch adds a unique loop to the internal sort_snapshot() 
function and skips duplicates on input. The git commit is here:


https://github.com/wieck/postgres/commit/a88a2b2c25b856478d7e2b012fc718106338fe00


Regards,
Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info
diff --git a/src/backend/utils/adt/txid.c b/src/backend/utils/adt/txid.c
index a005e67..44f7880 100644
*** a/src/backend/utils/adt/txid.c
--- b/src/backend/utils/adt/txid.c
*** cmp_txid(const void *aa, const void *bb)
*** 131,146 
  }
  
  /*
!  * sort a snapshot's txids, so we can use bsearch() later.
   *
   * For consistency of on-disk representation, we always sort even if bsearch
!  * will not be used.
   */
  static void
  sort_snapshot(TxidSnapshot *snap)
  {
if (snap-nxip  1)
qsort(snap-xip, snap-nxip, sizeof(txid), cmp_txid);
  }
  
  /*
--- 131,161 
  }
  
  /*
!  * unique sort a snapshot's txids, so we can use bsearch() later.
   *
   * For consistency of on-disk representation, we always sort even if bsearch
!  * will not be used. 
   */
  static void
  sort_snapshot(TxidSnapshot *snap)
  {
+   txidlast = 0;
+   int nxip, idx1, idx2;
+ 
if (snap-nxip  1)
+   {
qsort(snap-xip, snap-nxip, sizeof(txid), cmp_txid);
+   nxip = snap-nxip;
+   idx1 = idx2 = 0;
+   while (idx1  nxip)
+   {
+   if (snap-xip[idx1] != last)
+   last = snap-xip[idx2++] = snap-xip[idx1];
+   else
+   snap-nxip--;
+   idx1++;
+   }
+   }
  }
  
  /*
*** parse_snapshot(const char *str)
*** 294,304 
val = str2txid(str, endp);
str = endp;
  
!   /* require the input to be in order */
!   if (val  xmin || val = xmax || val = last_val)
goto bad_format;
  
!   buf_add_txid(buf, val);
last_val = val;
  
if (*str == ',')
--- 309,320 
val = str2txid(str, endp);
str = endp;
  
!   /* require the input to be in order and skip duplicates */
!   if (val  xmin || val = xmax || val  last_val)
goto bad_format;
  
!   if (val

[HACKERS] Problem with txid_snapshot_in/out() functionality

2014-04-11 Thread Jan Wieck


Hackers,

the Slony team has been getting seldom reports of a problem with the 
txid_snapshot data type.


The symptom is that a txid_snapshot on output lists the same txid 
multiple times in the xip list part of the external representation. This 
string is later rejected by txid_snapshot_in() when trying to determine 
if a particular txid is visible in that snapshot using the 
txid_visible_in_snapshot() function.


I was not yet able to reproduce this problem in a lab environment. It 
might be related to subtransactions and/or two phase commit (at least 
one user is using both of them). The reported PostgreSQL version 
involved in that case was 9.1.


At this point I would find it extremely helpful to sanitize the 
external representation in txid_snapshot_out() while emitting some 
NOTICE level logging when this actually happens. I am aware that this 
does amount to a functional change for a back release, but considering 
that the _out() generated external representation of an existing binary 
datum won't pass the type's _in() function, I argue that such change is 
warranted. Especially since this problem could possibly corrupt a dump.



Comments?


Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] add --throttle to pgbench (submission 3)

2013-06-19 Thread Jan Wieck

On 05/01/13 04:57, Fabien COELHO wrote:
 
 Add --throttle to pgbench
 
 Each client is throttled to the specified rate, which can be expressed in 
 tps or in time (s, ms, us). Throttling is achieved by scheduling 
 transactions along a Poisson-distribution.
 
 This is an update of the previous proposal which fix a typo in the sgml 
 documentation.
 
 The use case of the option is to be able to generate a continuous gentle 
 load for functional tests, eg in a practice session with students or for 
 testing features on a laptop.

Why does this need two option formats (-H and --throttle)?


Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] add --throttle to pgbench (submission 3)

2013-06-19 Thread Jan Wieck

On 06/19/13 14:34, Fabien COELHO wrote:
 
 The use case of the option is to be able to generate a continuous gentle
 load for functional tests, eg in a practice session with students or for
 testing features on a laptop.

 Why does this need two option formats (-H and --throttle)?
 
 On the latest version it is --rate and -R.
 
 Because you may want to put something very readable and understandable in 
 a script and like long options, or have to type it interactively every day 
 in a terminal and like short ones. Most UNIX commands include both kind.
 

Would it make sense then to add long versions for all the other standard
options too?


Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] (auto)vacuum truncate exclusive lock

2013-04-18 Thread Jan Wieck


On 4/12/2013 1:57 PM, Tom Lane wrote:

Kevin Grittner kgri...@ymail.com writes:

Tom Lane t...@sss.pgh.pa.us wrote:

I think that the minimum appropriate fix here is to revert the hunk
I quoted, ie take out the suppression of stats reporting and analysis.



I'm not sure I understand -- are you proposing that is all we do
for both the VACUUM command and autovacuum?


No, I said that was the minimum fix.

Looking again at the patch, I note this comment:

/*
+* We failed to establish the lock in the specified number of
+* retries. This means we give up truncating. Suppress the
+* ANALYZE step. Doing an ANALYZE at this point will reset the
+* dead_tuple_count in the stats collector, so we will not get
+* called by the autovacuum launcher again to do the truncate.
+*/

and I suppose the rationale for suppressing the stats report was this
same idea of lying to the stats collector in order to encourage a new
vacuum attempt to happen right away.  Now I'm not sure that that's a
good idea at all --- what's the reasoning for thinking the table will be
any less hot in thirty seconds?  But if it is reasonable, we need a
redesign of the reporting messages, not just a hack to not tell the
stats collector what we did.


Yes, that was the rationale behind it combined with don't change 
function call sequences and more all over the place.


The proper solution would eventually be to add a block number to the 
stats held by the stats collector, which preserves the information about 
the last filled block of the table. The decouple the truncate from 
vacuum/autovacuum. vacuum/autovacuum will set that block number when 
they detect the trailing free space. The analyze step can happen just as 
usual and reset stats, which doesn't reset that block number. The 
autovacuum launcher goes through its normal logic for launching autovac 
or autoanalyze. If it doesn't find any of those to do but the 
last-used-block is set, it launches the separate lazy truncate step.


This explicitly moves the truncate, which inherently requires the 
exclusive lock and therefore is undesirable even in a manual vacuum, 
into the background.



Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] (auto)vacuum truncate exclusive lock

2013-04-18 Thread Jan Wieck


On 4/12/2013 2:08 PM, Alvaro Herrera wrote:

Tom Lane escribió:


Are you saying you intend to revert that whole concept?  That'd be
okay with me, I think.  Otherwise we need some thought about how to
inform the stats collector what's really happening.


Maybe what we need is to consider table truncation as a separate
activity from vacuuming.  Then autovacuum could invoke it without having
to do a full-blown vacuum.  For this to work, I guess we would like to
separately store the status of the back-scan in pgstat somehow (I think
a boolean flag suffices: were we able to truncate all pages that
appeared to be empty?)


Should have read the entire thread before responding :)


Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] (auto)vacuum truncate exclusive lock

2013-04-18 Thread Jan Wieck


On 4/18/2013 11:44 AM, Jan Wieck wrote:


Yes, that was the rationale behind it combined with don't change
function call sequences and more all over the place.


function call signatures

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] strange OOM errors with EXECUTE in PL/pgSQL

2013-01-15 Thread Jan Wieck


On 12/20/2012 4:47 PM, Dimitri Fontaine wrote:

Tom Lane t...@sss.pgh.pa.us writes:

The reason this fails is that you've got a half-megabyte source string,
and each of the 11000 plans that are due to be created from it saves
its own copy of the source string.  Hence, 5500 megabytes needed just
for source strings.

We could possibly fix this by inventing some sort of reference-sharing
arrangement (which'd be complicated and fragile) or by not storing the
source strings with the plans (which'd deal a serious blow to our
ability to provide helpful error messages).  Neither answer seems
appealing.


I don't readily see how complicated and fragile it would be, it looks
like a hash table of symbols pointing to source strings and a reference
counting, and each plan would need to reference that symbol. Now maybe
that's what you call complicated and fragile, and even if not, I'm not
really sure it would pull its weight. The use case of sending over and
over again *in a given session* the exact same query string without
using PREPARE/EXECUTE looks like quite tiny.



That sounds like a bit of overkill to me.

Don't all the plans result as a plan list from a multi-statement query 
string, which was parsed into a query tree list and each single query 
tree then planned? I don't think there is any way that a single one of 
those trees (parse or plan) will be free'd separately. If that is true, 
then proper usage of memory contexts would make reference counting 
obsolete, even though all plans refer to the same copy.



Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] MySQL search query is not executing in Postgres DB

2012-12-16 Thread Jan Wieck


On 12/14/2012 3:20 PM, Robert Haas wrote:

On Fri, Dec 14, 2012 at 2:11 PM, Tom Lane t...@sss.pgh.pa.us wrote:

Robert Haas robertmh...@gmail.com writes:

...  In more
than ten years of working with PostgreSQL, I've never encountered
where the restriction at issue here prevented a bug.  It's only
annoyed me and broken my application code (when moving from PostgreSQL
8.2 to PostgreSQL 8.3, never mind any other database!).


There are quite a few examples in our archives of application bugs that
would have been prevented, or later were prevented, by the 8.3 changes
that reduced the system's willingness to apply implicit casts to text.
I recall for instance cases where people got wrong/unexpected answers
because the system was sorting what-they-thought-were-timestamp values
textually.

So I find such sweeping claims to be demonstrably false, and I'm
suspicious of behavioral changes that are proposed with such arguments
backing them.


I think you're mixing apples and oranges.  The whole point of the
patch on the table - which, if you recall, was designed originally by
you and merely implemented by me - was to change the behavior only in
the cases where there's only one function with the appropriate name
and argument count.  The ambiguous cases that 8.3+ helpfully prevent
are those where overloading is in use and the choice of which function
to call is somewhat arbitrary and perhaps incorrectly-foreseen by the
user.  Those changes also have the side-effect of preventing a
straightforward function call from working without casts even in cases
where no overloading is in use - and making that case work is
completely different from making the ambiguous case arbitrarily pick
one of the available answers.


FWIW I for one thought that casting more liberal in the case at hand, 
where there is only one function with that name and number of arguments, 
would be a good thing. My only concern with the patch presented was that 
changing make_fn_assignment() in that way may have some unwanted side 
effects because it is called from different locations and the usage of 
COERCION_IMPLICIT may actually guard against something, that we don't 
want to allow. I don't have any evidence that it does, just the concern 
that it may.


Perhaps make_fn_arguments() needs to receive that coercion context as an 
argument and the caller hands in COERCION_ASSIGNMENT only in the case at 
hand?



Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PRIVATE columns

2012-12-12 Thread Jan Wieck


On 12/12/2012 1:12 PM, Simon Riggs wrote:

Currently, ANALYZE collects data on all columns and stores these
samples in pg_statistic where they can be seen via the view pg_stats.

In some cases we have data that is private and we do not wish others
to see it, such as patient names. This becomes more important when we
have row security.

Perhaps that data can be protected, but it would be even better if we
simply didn't store value-revealing statistic data at all. Such
private data is seldom the target of searches, or if it is, it is
mostly evenly distributed anyway.


Would protecting it the same way, we protect the passwords in pg_authid, 
be sufficient?



Jan



It would be good if we could collect the overall stats
* NULL fraction
* average width
* ndistinct
yet without storing either the MFVs or histogram.
Doing that would avoid inadvertent leaking of potentially private information.

SET STATISTICS 0
simply skips collection of statistics altogether

To implement this, one way would be to allow

ALTER TABLE foo
   ALTER COLUMN foo1 SET STATISTICS PRIVATE;

Or we could use another magic value like -2 to request this case.

(Yes, I am aware we could use a custom datatype with a custom
typanalyze for this, but that breaks other things)

Thoughts?




--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PRIVATE columns

2012-12-12 Thread Jan Wieck


On 12/12/2012 3:12 PM, Simon Riggs wrote:

On 12 December 2012 19:13, Jan Wieck janwi...@yahoo.com wrote:

On 12/12/2012 1:12 PM, Simon Riggs wrote:


Currently, ANALYZE collects data on all columns and stores these
samples in pg_statistic where they can be seen via the view pg_stats.

In some cases we have data that is private and we do not wish others
to see it, such as patient names. This becomes more important when we
have row security.

Perhaps that data can be protected, but it would be even better if we
simply didn't store value-revealing statistic data at all. Such
private data is seldom the target of searches, or if it is, it is
mostly evenly distributed anyway.



Would protecting it the same way, we protect the passwords in pg_authid, be
sufficient?


The user backend does need to be able to access the stats data during
optimization. It's hard to have data accessible and yet impose limits
on the uses to which that can be put. If we have row security on the
table but no equivalent capability on the stats, then we'll have
leakage. e.g. set statistics 1, ANALYZE, then leak 1 credit
card numbers.


Like for the encrypted password column of pg_authid, I don't see any 
reason why the values in the stats columns need to be readable for 
anyone but a superuser at all. Do you?



Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] autovacuum truncate exclusive lock round two

2012-12-11 Thread Jan Wieck


On 12/9/2012 2:37 PM, Kevin Grittner wrote:

Jan Wieck wrote:


Based on the discussion and what I feel is a consensus I have
created an updated patch that has no GUC at all. The hard coded
parameters in include/postmaster/autovacuum.h are

 AUTOVACUUM_TRUNCATE_LOCK_CHECK_INTERVAL 20 /* ms */
 AUTOVACUUM_TRUNCATE_LOCK_WAIT_INTERVAL 50 /* ms */
 AUTOVACUUM_TRUNCATE_LOCK_TIMEOUT 5000 /* ms */


Since these really aren't part of the external API and are only
referenced in vacuumlazy.c, it seems more appropriate to define
them there.


I gave that the worst workload I can think of. A pgbench (style)
application that throws about 10 transactions per second at it,
so that there is constantly the need to give up the lock due to
conflicting lock requests and then reacquiring it again. A
cleanup process is periodically moving old tuples from the
history table to an archive table, making history a rolling
window table. And a third job that 2-3 times per minute produces
a 10 second lasting transaction, forcing autovacuum to give up on
the lock reacquisition.

Even with that workload autovacuum slow but steady is chopping
away at the table.


Applies with minor offsets, builds without warning, and passes
`make check-world`. My tests based on your earlier posted test
script confirm the benefit.

There are some minor white-space issues; for example git diff
--color shows some trailing spaces in comments.


Cleaned up all of those.


Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 5036849..1686b88 100644
*** a/src/backend/commands/vacuumlazy.c
--- b/src/backend/commands/vacuumlazy.c
***
*** 52,62 
--- 52,64 
  #include storage/bufmgr.h
  #include storage/freespace.h
  #include storage/lmgr.h
+ #include storage/proc.h
  #include utils/lsyscache.h
  #include utils/memutils.h
  #include utils/pg_rusage.h
  #include utils/timestamp.h
  #include utils/tqual.h
+ #include portability/instr_time.h
  
  
  /*
***
*** 82,87 
--- 84,96 
   */
  #define SKIP_PAGES_THRESHOLD	((BlockNumber) 32)
  
+ /*
+  * Truncate exclusive lock configuration
+  */
+ #define AUTOVACUUM_TRUNCATE_LOCK_CHECK_INTERVAL 20 /* ms */
+ #define AUTOVACUUM_TRUNCATE_LOCK_WAIT_INTERVAL  50 /* ms */
+ #define AUTOVACUUM_TRUNCATE_LOCK_TIMEOUT5000 /* ms */
+ 
  typedef struct LVRelStats
  {
  	/* hasindex = true means two-pass strategy; false means one-pass */
*** typedef struct LVRelStats
*** 103,108 
--- 112,118 
  	ItemPointer dead_tuples;	/* array of ItemPointerData */
  	int			num_index_scans;
  	TransactionId latestRemovedXid;
+ 	bool		lock_waiter_detected;
  } LVRelStats;
  
  
*** lazy_vacuum_rel(Relation onerel, VacuumS
*** 193,198 
--- 203,210 
  	vacrelstats-old_rel_pages = onerel-rd_rel-relpages;
  	vacrelstats-old_rel_tuples = onerel-rd_rel-reltuples;
  	vacrelstats-num_index_scans = 0;
+ 	vacrelstats-pages_removed = 0;
+ 	vacrelstats-lock_waiter_detected = false;
  
  	/* Open all indexes of the relation */
  	vac_open_indexes(onerel, RowExclusiveLock, nindexes, Irel);
*** lazy_vacuum_rel(Relation onerel, VacuumS
*** 259,268 
  		vacrelstats-hasindex,
  		new_frozen_xid);
  
! 	/* report results to the stats collector, too */
! 	pgstat_report_vacuum(RelationGetRelid(onerel),
  		 onerel-rd_rel-relisshared,
  		 new_rel_tuples);
  
  	/* and log the action if appropriate */
  	if (IsAutoVacuumWorkerProcess()  Log_autovacuum_min_duration = 0)
--- 271,288 
  		vacrelstats-hasindex,
  		new_frozen_xid);
  
! 	/*
! 	 * report results to the stats collector, too.
! 	 * An early terminated lazy_truncate_heap attempt
! 	 * suppresses the message and also cancels the
! 	 * execution of ANALYZE, if that was ordered.
! 	 */
! 	if (!vacrelstats-lock_waiter_detected)
! 		pgstat_report_vacuum(RelationGetRelid(onerel),
  		 onerel-rd_rel-relisshared,
  		 new_rel_tuples);
+ 	else
+ 		vacstmt-options = ~VACOPT_ANALYZE;
  
  	/* and log the action if appropriate */
  	if (IsAutoVacuumWorkerProcess()  Log_autovacuum_min_duration = 0)
*** lazy_truncate_heap(Relation onerel, LVRe
*** 1257,1336 
  	BlockNumber old_rel_pages = vacrelstats-rel_pages;
  	BlockNumber new_rel_pages;
  	PGRUsage	ru0;
  
  	pg_rusage_init(ru0);
  
  	/*
! 	 * We need full exclusive lock on the relation in order to do truncation.
! 	 * If we can't get it, give up rather than waiting --- we don't want to
! 	 * block other backends, and we don't want to deadlock (which is quite
! 	 * possible considering we already hold a lower-grade lock).
! 	 */
! 	if (!ConditionalLockRelation(onerel, AccessExclusiveLock))
! 		return;
! 
! 	/*
! 	 * Now that we have exclusive lock, look to see if the rel has grown
! 	 * whilst we were vacuuming with non-exclusive lock.  If so, give up

Re: [HACKERS] MySQL search query is not executing in Postgres DB

2012-12-09 Thread Jan Wieck

I am aware that in the case at hand, the call to make_fn_arguments() is 
with the only possible candidate function, so changing COERCE_IMPLICIT 
to COERCE_ASSIGNMENT inside of make_fn_arguments() is correct. But I 
wonder if this may have any unwanted side effects for other code paths 
to make_fn_arguments(), like from optimizer/util/clauses.c or from 
parser/parse_oper.c. I'm not saying it does, just asking if that could 
be the case.



Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] autovacuum truncate exclusive lock round two

2012-12-08 Thread Jan Wieck


On 12/6/2012 12:45 PM, Robert Haas wrote:

On Wed, Dec 5, 2012 at 10:16 PM, Jan Wieck janwi...@yahoo.com wrote:

That sort of dynamic approach would indeed be interesting. But I fear that
it is going to be complex at best. The amount of time spent in scanning
heavily depends on the visibility map. The initial vacuum scan of a table
can take hours or more, but it does update the visibility map even if the
vacuum itself is aborted later. The next vacuum may scan that table in
almost no time at all, because it skips all blocks that are marked all
visible.


Well, if that's true, then there's little reason to worry about giving
up quickly, because the next autovacuum a minute later won't consume
many resources.


Almost no time is of course relative to what an actual scan and dead 
tuple removal cost. Looking at a table with 3 GB of dead tuples at the 
end, the initial vacuum scan takes hours. When vacuum comes back to this 
table, cleaning up a couple megabytes of newly deceased tuples and then 
skipping over the all visible pages may take a minute.


Based on the discussion and what I feel is a consensus I have created an 
updated patch that has no GUC at all. The hard coded parameters in 
include/postmaster/autovacuum.h are


AUTOVACUUM_TRUNCATE_LOCK_CHECK_INTERVAL  20 /* ms */
AUTOVACUUM_TRUNCATE_LOCK_WAIT_INTERVAL   50 /* ms */
AUTOVACUUM_TRUNCATE_LOCK_TIMEOUT 5000 /* ms */

I gave that the worst workload I can think of. A pgbench (style) 
application that throws about 10 transactions per second at it, so that 
there is constantly the need to give up the lock due to conflicting lock 
requests and then reacquiring it again. A cleanup process is 
periodically moving old tuples from the history table to an archive 
table, making history a rolling window table. And a third job that 2-3 
times per minute produces a 10 second lasting transaction, forcing 
autovacuum to give up on the lock reacquisition.


Even with that workload autovacuum slow but steady is chopping away at 
the table.



Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index c9253a9..fd3e91e 100644
*** a/src/backend/commands/vacuumlazy.c
--- b/src/backend/commands/vacuumlazy.c
***
*** 52,62 
--- 52,64 
  #include storage/bufmgr.h
  #include storage/freespace.h
  #include storage/lmgr.h
+ #include storage/proc.h
  #include utils/lsyscache.h
  #include utils/memutils.h
  #include utils/pg_rusage.h
  #include utils/timestamp.h
  #include utils/tqual.h
+ #include portability/instr_time.h
  
  
  /*
*** typedef struct LVRelStats
*** 103,108 
--- 105,111 
  	ItemPointer dead_tuples;	/* array of ItemPointerData */
  	int			num_index_scans;
  	TransactionId latestRemovedXid;
+ 	bool		lock_waiter_detected;
  } LVRelStats;
  
  
*** lazy_vacuum_rel(Relation onerel, VacuumS
*** 193,198 
--- 196,203 
  	vacrelstats-old_rel_pages = onerel-rd_rel-relpages;
  	vacrelstats-old_rel_tuples = onerel-rd_rel-reltuples;
  	vacrelstats-num_index_scans = 0;
+ 	vacrelstats-pages_removed = 0;
+ 	vacrelstats-lock_waiter_detected = false;
  
  	/* Open all indexes of the relation */
  	vac_open_indexes(onerel, RowExclusiveLock, nindexes, Irel);
*** lazy_vacuum_rel(Relation onerel, VacuumS
*** 259,268 
  		vacrelstats-hasindex,
  		new_frozen_xid);
  
! 	/* report results to the stats collector, too */
! 	pgstat_report_vacuum(RelationGetRelid(onerel),
  		 onerel-rd_rel-relisshared,
  		 new_rel_tuples);
  
  	/* and log the action if appropriate */
  	if (IsAutoVacuumWorkerProcess()  Log_autovacuum_min_duration = 0)
--- 264,281 
  		vacrelstats-hasindex,
  		new_frozen_xid);
  
! 	/*
! 	 * report results to the stats collector, too.
! 	 * An early terminated lazy_truncate_heap attempt 
! 	 * suppresses the message and also cancels the
! 	 * execution of ANALYZE, if that was ordered.
! 	 */
! 	if (!vacrelstats-lock_waiter_detected)
! 		pgstat_report_vacuum(RelationGetRelid(onerel),
  		 onerel-rd_rel-relisshared,
  		 new_rel_tuples);
+ 	else
+ 		vacstmt-options = ~VACOPT_ANALYZE;
  
  	/* and log the action if appropriate */
  	if (IsAutoVacuumWorkerProcess()  Log_autovacuum_min_duration = 0)
*** lazy_truncate_heap(Relation onerel, LVRe
*** 1255,1334 
  	BlockNumber old_rel_pages = vacrelstats-rel_pages;
  	BlockNumber new_rel_pages;
  	PGRUsage	ru0;
  
  	pg_rusage_init(ru0);
  
  	/*
! 	 * We need full exclusive lock on the relation in order to do truncation.
! 	 * If we can't get it, give up rather than waiting --- we don't want to
! 	 * block other backends, and we don't want to deadlock (which is quite
! 	 * possible considering we already hold a lower-grade lock).
! 	 */
! 	if (!ConditionalLockRelation(onerel, AccessExclusiveLock))
! 		return;
! 
! 	/*
! 	 * Now

Re: [HACKERS] autovacuum truncate exclusive lock round two

2012-12-06 Thread Jan Wieck

Kevin and Robert are well aware of most of the below. I just want to put 
this out here so other people, who haven't followed the discussion too 
closely, may chime in.


Some details on the problem:

First of all, there is a minimum number of 1000 pages that the vacuum 
scan must detect as possibly being all empty at the end of a relation. 
Without at least 8MB of possible free space at the end, the code never 
calls lazy_truncate_heap(). This means we don't have to worry about tiny 
relations at all. Any relation that stays under 8MB turnover between 
autovacuum VACUUM runs can never get into this ever.


Relations that have higher turnover than that, but at random places or 
with a high percentage of rather static rows, don't fall into the 
problem category either. They may never accumulate that much contiguous 
free space at the end. The turnover will be reusing free space all over 
the place. So again, lazy_truncate_heap() won't be called ever.


Relations that eventually build up more than 8MB of free space at the 
end aren't automatically a problem. The autovacuum VACUUM scan just 
scanned those pages at the end, which means that the safety scan for 
truncate, done under exclusive lock, is checking exactly those pages at 
the end and most likely they are still in memory. The truncate safety 
scan will be fast due to a 99+% buffer cache hit rate.


The only actual problem case (I have found so far) are rolling window 
tables of significant size, that can bloat multiple times their normal 
size every now and then. This is indeed a rare corner case and I have no 
idea how many users may (unknowingly) be suffering from it.


This rare corner case triggers lazy_truncate_heap() with a significant 
amount of free space to truncate. The table bloats, then all the bloat 
is deleted and the periodic 100% turnover will guarantee that all live 
tuples will shortly after circulate in lower block numbers again, with 
gigabytes of empty space at the end.


This by itself isn't a problem still. The existing code may do the job 
just fine unless there is frequent access to that very table. Only 
at this special combination of circumstances we actually have a problem.


Only now, with a significant amount of free space at the end and 
frequent access to the table, the truncate safety scan takes long enough 
and has to actually read pages from disk to interfere with client 
transactions.


At this point, the truncate safety scan may have to be interrupted to 
let the frequent other traffic go through. This is what we accomplish 
with the autovacuum_truncate_lock_check interval, where we voluntarily 
release the lock whenever someone else needs it. I agree with Kevin that 
a 20ms check interval is reasonable because the code to check this is 
even less expensive than releasing the exclusive lock we're holding.


At the same time, completely giving up and relying on the autovacuum 
launcher to restart another worker isn't as free as it looks like 
either. The next autovacuum worker will have to do the VACUUM scan 
first, before getting to the truncate phase. We cannot just skip blindly 
to the truncate code. With repeated abortion of the truncate, the table 
would deteriorate and accumulate dead tuples again. The removal of dead 
tuples and their index tuples has priority.


As said earlier in the discussion, the VACUUM scan will skip pages, that 
are marked as completely visible. So the scan won't physically read the 
majority of the empty pages at the end of the table over and over. But 
it will at least scan all pages, that had been modified since the last 
VACUUM run.


To me this means that we want to be more generous to the truncate code 
about acquiring the exclusive lock. In my tests, I've seen that a 
rolling window table with a live set of just 10 MB or so, but empty 
space of 3 GB, can still have a 2 minute VACUUM scan time. Throwing that 
work away because we can't acquire the exclusive lock withing 2 seconds 
is a waste of effort.


Something in between 2-60 seconds sounds more reasonable to me.


Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] autovacuum truncate exclusive lock round two

2012-12-05 Thread Jan Wieck


On 12/5/2012 2:00 PM, Robert Haas wrote:

Many it'd be sensible to relate the retry time to the time spend
vacuuming the table.  Say, if the amount of time spent retrying
exceeds 10% of the time spend vacuuming the table, with a minimum of
1s and a maximum of 1min, give up.  That way, big tables will get a
little more leeway than small tables, which is probably appropriate.


That sort of dynamic approach would indeed be interesting. But I fear 
that it is going to be complex at best. The amount of time spent in 
scanning heavily depends on the visibility map. The initial vacuum scan 
of a table can take hours or more, but it does update the visibility map 
even if the vacuum itself is aborted later. The next vacuum may scan 
that table in almost no time at all, because it skips all blocks that 
are marked all visible.


So the total time the scan takes is no yardstick I'd use.


Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] autovacuum truncate exclusive lock round two

2012-12-04 Thread Jan Wieck


On 12/4/2012 8:06 AM, Kevin Grittner wrote:

Jan Wieck wrote:

I believe the check interval needs to be decoupled from the
deadlock_timeout again.


OK


This will leave us with 2 GUCs at least.


Hmm. What problems do you see with hard-coding reasonable values?


The question is what is reasonable?

Lets talk about the time to (re)acquire the lock first. In the cases 
where truncating a table can hurt we are dealing with many gigabytes. 
The original vacuumlazy scan of them can take hours if not days. During 
that scan the vacuum worker has probably spent many hours napping in the 
vacuum delay points. For me 50ms interval for 5 seconds would be 
reasonable for (re)acquiring that lock.


The reasoning behind it being that we need some sort of retry mechanism 
because if autovacuum just gave up the exclusive lock because someone 
needed access, it is more or less guaranteed that the immediate attempt 
to reacquire it will fail until that waiter has committed. But if it 
can't get a lock after 5 seconds, the system seems busy enough so that 
autovacuum should come back much later, when the launcher kicks it off 
again.


I don't care much about occupying that autovacuum worker for a few 
seconds. It just spent hours vacuuming that very table. How much harm 
will a couple more seconds do?


The check interval for the LockHasWaiters() call however depends very 
much on the response time constraints of the application. A 200ms 
interval for example would cause the truncate phase to hold onto the 
exclusive lock for 200ms at least. That means that a steady stream of 
short running transactions would see a 100ms blocking on average, 
200ms max. For many applications that is probably OK. If your response 
time constraint is =50ms on 98% of transactions, you might want to have 
that knob though.


I admit I really have no idea what the most reasonable default for that 
value would be. Something between 50ms and deadlock_timeout/2 I guess.



Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] autovacuum truncate exclusive lock round two

2012-12-04 Thread Jan Wieck


On 12/4/2012 1:51 PM, Kevin Grittner wrote:

Jan Wieck wrote:


[arguments for GUCs]


This is getting confusing. I thought I had already conceded the
case for autovacuum_truncate_lock_try, and you appeared to spend
most of your post arguing for it anyway. I think. It's a little
hard to tell. Perhaps the best thing is to present the issue to the
list and solicit more opinions on what to do. Please correct me if
I mis-state any of this.

The primary problem this patch is solving is that in some
workloads, autovacuum will repeatedly try to truncate the unused
pages at the end of a table, but will continually get canceled
after burning resources because another process wants to acquire a
lock on the table which conflicts with the one held by autovacuum.
This is handled by the deadlock checker, so another process must
block for the deadlock_timeout interval each time. All work done by
the truncate phase of autovacuum is lost on each interrupted
attempt. Statistical information is not updated, so another attempt
will trigger the next time autovacuum looks at whether to vacuum
the table.

It's obvious that this pattern not only fails to release
potentially large amounts of unused space back to the OS, but the
headbanging can continue to consume significant resources and for
an extended period, and the repeated blocking for deadlock_timeout
could cause latency problems.

The patch has the truncate work, which requires
AccessExclusiveLock, check at intervals for whether another process
is waiting on its lock. That interval is one of the timings we need
to determine, and for which a GUC was initially proposed. I think
that the check should be fast enough that doing it once every 20ms
as a hard-coded interval would be good enough. When it sees this
situation, it truncates the file for as far as it has managed to
get, releases its lock on the table, sleeps for an interval, and
then checks to see if the lock has become available again.

How long it should sleep between tries to reacquire the lock is
another possible GUC. Again, I'm inclined to think that this could
be hard-coded. Since autovacuum was knocked off-task after doing
some significant work, I'm inclined to make this interval a little
bigger, but I don't think it matters a whole lot. Anything between
20ms and 100ms seens sane. Maybe 50ms?

At any point that it is unable to acquire the lock, there is a
check for how long this autovacuum task has been starved for the
lock. Initially I was arguing for twice the deadlock_timeout on the
basis that this would probably be short enough not to leave the
autovacuum worker sidelined for too long, but long enough for the
attempt to get past a single deadlock between two other processes.
This is the setting Jan is least willing to concede.

If the autovacuum worker does abandon the attempt, it will keep
retrying, since we go out of our way to prevent the autovacuum
process from updating the statistics based on the incomplete
processing. This last interval is not how long it will attempt to
truncate, but how long it will keep one autovacuum worker making
unsuccessful attempts to acquire the lock before it is put to other
uses. Workers will keep coming back to this table until the
truncate phase is completed, just as it does without the patch; the
difference being that anytime it gets the lock, even briefly, it is
able to persist some progress.


That is all correct.



So the question on the table is which of these three intervals
should be GUCs, and what values to use if they aren't.


I could live with all the above defaults, but would like to see more 
comments on them.



Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] autovacuum truncate exclusive lock round two

2012-12-03 Thread Jan Wieck


On 12/3/2012 5:42 PM, Kevin Grittner wrote:

Jan Wieck wrote:


Attached is a new patch that addresses most of the points raised
in discussion before.

1) Most of the configuration variables are derived from
deadlock_timeout now. The check for conflicting lock request
interval is deadlock_timeout/10, clamped to 10ms. The try to
acquire exclusive lock interval is deadlock_timeout/20, also
clamped to 10ms. The only GUC variable remaining is
autovacuum_truncate_lock_try=2000ms with a range from 0 (just try
once) to 2ms.


If we're going to keep this GUC, we need docs for it.


Certainly. But since we're still debating which and how many GUC 
variables we want, I don't think doc-time has come yet.





I'd like to point out that this is a significant change in
functionality as without the config option for the check
interval, there is no longer any possibility to disable the call
to LockHasWaiters() and return to the original (deadlock code
kills autovacuum) behavior.


Arguably we could simplify the deadlock resolution code a little,
but it seems like it is probably safer to leave it as a failsafe,
at least for now.


Thinking about it, I'm not really happy with removing the 
autovacuum_truncate_lock_check GUC at all.


Fact is that the deadlock detection code and the configuration parameter 
for it should IMHO have nothing to do with all this in the first place. 
A properly implemented application does not deadlock. Someone running 
such a properly implemented application should be able to safely set 
deadlock_timeout to minutes without the slightest ill side effect, but 
with the benefit that the deadlock detection code itself does not add to 
the lock contention. The only reason one cannot do so today is because 
autovacuum's truncate phase could then freeze the application with an 
exclusive lock for that long.


I believe the check interval needs to be decoupled from the 
deadlock_timeout again. This will leave us with 2 GUCs at least.



Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] autovacuum truncate exclusive lock round two

2012-12-02 Thread Jan Wieck

Attached is a new patch that addresses most of the points raised in 
discussion before.


1) Most of the configuration variables are derived from deadlock_timeout 
now. The check for conflicting lock request interval is 
deadlock_timeout/10, clamped to 10ms. The try to acquire exclusive 
lock interval is deadlock_timeout/20, also clamped to 10ms. The only 
GUC variable remaining is autovacuum_truncate_lock_try=2000ms with a 
range from 0 (just try once) to 2ms.


I'd like to point out that this is a significant change in functionality 
as without the config option for the check interval, there is no longer 
any possibility to disable the call to LockHasWaiters() and return to 
the original (deadlock code kills autovacuum) behavior.


2) The partition lock in LockHasWaiters() was lowered to LW_SHARED. The 
LW_EXCLUSIVE was indeed a copy/paste result.


3) The instr_time handling was simplified as suggested.

4) Lower case TRUE/FALSE.


I did not touch the part about suppressing the stats and the ANALYZE 
step of auto vacuum+analyze. The situation is no different today. When 
the deadlock code kills autovacuum, stats aren't updated either. And 
this patch is meant to cause autovacuum to finish the truncate in a few 
minutes or hours, so that the situation fixes itself.



Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index c9253a9..e8f88ad 100644
*** a/src/backend/commands/vacuumlazy.c
--- b/src/backend/commands/vacuumlazy.c
***
*** 52,62 
--- 52,64 
  #include storage/bufmgr.h
  #include storage/freespace.h
  #include storage/lmgr.h
+ #include storage/proc.h
  #include utils/lsyscache.h
  #include utils/memutils.h
  #include utils/pg_rusage.h
  #include utils/timestamp.h
  #include utils/tqual.h
+ #include portability/instr_time.h
  
  
  /*
*** typedef struct LVRelStats
*** 103,108 
--- 105,111 
  	ItemPointer dead_tuples;	/* array of ItemPointerData */
  	int			num_index_scans;
  	TransactionId latestRemovedXid;
+ 	bool		lock_waiter_detected;
  } LVRelStats;
  
  
*** static TransactionId FreezeLimit;
*** 114,119 
--- 117,126 
  
  static BufferAccessStrategy vac_strategy;
  
+ static int autovacuum_truncate_lock_check;
+ static int autovacuum_truncate_lock_retry;
+ static int autovacuum_truncate_lock_wait;
+ 
  
  /* non-export function prototypes */
  static void lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
*** lazy_vacuum_rel(Relation onerel, VacuumS
*** 193,198 
--- 200,207 
  	vacrelstats-old_rel_pages = onerel-rd_rel-relpages;
  	vacrelstats-old_rel_tuples = onerel-rd_rel-reltuples;
  	vacrelstats-num_index_scans = 0;
+ 	vacrelstats-pages_removed = 0;
+ 	vacrelstats-lock_waiter_detected = false;
  
  	/* Open all indexes of the relation */
  	vac_open_indexes(onerel, RowExclusiveLock, nindexes, Irel);
*** lazy_vacuum_rel(Relation onerel, VacuumS
*** 259,268 
  		vacrelstats-hasindex,
  		new_frozen_xid);
  
! 	/* report results to the stats collector, too */
! 	pgstat_report_vacuum(RelationGetRelid(onerel),
  		 onerel-rd_rel-relisshared,
  		 new_rel_tuples);
  
  	/* and log the action if appropriate */
  	if (IsAutoVacuumWorkerProcess()  Log_autovacuum_min_duration = 0)
--- 268,285 
  		vacrelstats-hasindex,
  		new_frozen_xid);
  
! 	/*
! 	 * report results to the stats collector, too.
! 	 * An early terminated lazy_truncate_heap attempt 
! 	 * suppresses the message and also cancels the
! 	 * execution of ANALYZE, if that was ordered.
! 	 */
! 	if (!vacrelstats-lock_waiter_detected)
! 		pgstat_report_vacuum(RelationGetRelid(onerel),
  		 onerel-rd_rel-relisshared,
  		 new_rel_tuples);
+ 	else
+ 		vacstmt-options = ~VACOPT_ANALYZE;
  
  	/* and log the action if appropriate */
  	if (IsAutoVacuumWorkerProcess()  Log_autovacuum_min_duration = 0)
*** lazy_truncate_heap(Relation onerel, LVRe
*** 1255,1334 
  	BlockNumber old_rel_pages = vacrelstats-rel_pages;
  	BlockNumber new_rel_pages;
  	PGRUsage	ru0;
  
  	pg_rusage_init(ru0);
  
  	/*
! 	 * We need full exclusive lock on the relation in order to do truncation.
! 	 * If we can't get it, give up rather than waiting --- we don't want to
! 	 * block other backends, and we don't want to deadlock (which is quite
! 	 * possible considering we already hold a lower-grade lock).
  	 */
! 	if (!ConditionalLockRelation(onerel, AccessExclusiveLock))
! 		return;
  
  	/*
! 	 * Now that we have exclusive lock, look to see if the rel has grown
! 	 * whilst we were vacuuming with non-exclusive lock.  If so, give up; the
! 	 * newly added pages presumably contain non-deletable tuples.
  	 */
! 	new_rel_pages = RelationGetNumberOfBlocks(onerel);
! 	if (new_rel_pages != old_rel_pages)
  	{
  		/*
! 		 * Note: we intentionally don't update

Re: [HACKERS] autovacuum truncate exclusive lock round two

2012-11-29 Thread Jan Wieck


On 11/28/2012 3:33 PM, Kevin Grittner wrote:

Kevin Grittner wrote:


I still need to review the timing calls, since I'm not familiar
with them so it wasn't immediately obvious to me whether they
were being used correctly. I have no reason to believe that they
aren't, but feel I should check.


It seems odd to me that assignment of one instr_time to another is
done with INSTR_TIME_SET_ZERO() of the target followed by
INSTR_TIME_ADD() with the target and the source. It seems to me
that simple assignment would be more readable, and I can't see any
down side.

Why shouldn't:

 INSTR_TIME_SET_ZERO(elapsed);
 INSTR_TIME_ADD(elapsed, currenttime);
 INSTR_TIME_SUBTRACT(elapsed, starttime);

instead be?:

 elapsed = currenttime;
 INSTR_TIME_SUBTRACT(elapsed, starttime);

And why shouldn't:

 INSTR_TIME_SET_ZERO(starttime);
 INSTR_TIME_ADD(starttime, currenttime);

instead be?:

 starttime = currenttime;

Resetting starttime this way seems especially odd.


instr_time is LARGE_INTEGER on Win32 but struct timeval on Unix. Is

starttime = currenttime;

portable if those are structs?




Also, I want to do another pass looking just for off-by-one
errors on blkno. Again, I have no reason to believe that there is
a problem; it just seems like it would be a particularly easy
type of mistake to make and miss when a key structure has this
field:

  BlockNumber nonempty_pages;
 /* actually, last nonempty page + 1 */


No problems found with that.


And I want to test more.


The patch worked as advertised in all my tests, but I became
uncomforatable with the games being played with the last autovacuum
timestamp and the count of dead tuples. Sure, that causes
autovacuum to kick back in until it has dealt with the truncation,
but it makes it hard for a human looking at the stat views to see
what's going on, and I'm not sure it wouldn't lead to bad plans due
to stale statistics.

Personally, I would rather see us add a boolean to indicate that
autovacuum was needed (regardless of the math on the other columns)
and use that to control the retries -- leaving the other columns
free to reflect reality.


You mean to extend the stats by yet another column? The thing is that 
this whole case happens rarely. We see this every other month or so and 
only on a rolling window table after it got severely bloated due to some 
abnormal use (purging disabled for some operational reason). The whole 
situation resolves itself after a few minutes to maybe an hour or two.


Personally I don't think living with a few wrong stats on one table for 
that time is so bad that it warrants changing that much more code.


Lower casing TRUE/FALSE will be done.

If the LW_SHARED is enough in LockHasWaiters(), that's fine too.

I think we have a consensus that the check interval should be derived 
from deadlock_timeout/10 clamped to 10ms. I'm going to remove that GUC.


About the other two, I'm just not sure. Maybe using half the value as 
for the lock waiter interval as the lock retry interval, again clamped 
to 10ms, and simply leaving one GUC that controls how long autovacuum 
should retry this. I'm not too worried about the retry interval. 
However, the problem with that overall retry duration is that this value 
very much depends on the usage pattern of the database. If long running 
transactions (like 5 seconds) are the norm, then a hard coded value of 
2 seconds before autovacuum gives up isn't going to help much.



Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] autovacuum truncate exclusive lock round two

2012-11-29 Thread Jan Wieck


On 11/29/2012 9:46 AM, Tom Lane wrote:

Jan Wieck janwi...@yahoo.com writes:

On 11/28/2012 3:33 PM, Kevin Grittner wrote:

Resetting starttime this way seems especially odd.



instr_time is LARGE_INTEGER on Win32 but struct timeval on Unix. Is
 starttime = currenttime;
portable if those are structs?


Sure.  We rely on struct assignment in lots of places.


Will be done then.


Thanks,
Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical to physical page mapping

2012-10-28 Thread Jan Wieck


On 10/27/2012 2:41 PM, Heikki Linnakangas wrote:

And it's not at all
clear to me that it would perform better than full_page_writes. You're
writing and flushing out roughly the same amount of data AFAICS.


I think this assumption is wrong. full_page_writes=on means we write the 
full page content to WAL on first change after a checkpoint. A change 
after a checkpoint logically means that the same page is dirty now and 
must also be written latest during the next checkpoint, which means 16K 
written minimum for every page changed after a checkpoint.



What exactly is the problem with full_page_writes that we're trying to
solve?


Full page writes are meant to guard against torn pages. That is the case 
when an 8K page is written by the underlying OS/filesystem/HW in smaller 
chunks (for example 512 byte sectors), and in the case of a crash some 
of these chunks make it, others don't. Without full_page_writes, crash 
recovery can work if all 8K made it, or nothing made it (aka atomic 
writes). But it will fail otherwise.


The amount of WAL generated with full_page_writes=on is quite 
substantial. For pgbench for example the ratio 20:1. Meaning with 
full_page_writes you write 20x the amount you do without.



Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical to physical page mapping

2012-10-28 Thread Jan Wieck


On 10/28/2012 10:50 PM, Peter Geoghegan wrote:

On 28 October 2012 22:35, Jan Wieck janwi...@yahoo.com wrote:

The amount of WAL generated with full_page_writes=on is quite substantial.
For pgbench for example the ratio 20:1. Meaning with full_page_writes you
write 20x the amount you do without.


Sure, but as always, pgbench pessimises everything by having writes
follow a uniform distribution, which is completely unrepresentative of
the natural world. This will literally maximise the number of pristine
full page images that need to be included. The actual frequency of
checkpoints is a crucial factor here too, and you didn't mention
anything about that.


Higher buffer cache hit rates certainly reduce that ratio.

Well, I guess it was just one of those random thoughts that can't work 
in the end or aren't worth the work anyway.



Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] autovacuum truncate exclusive lock round two

2012-10-25 Thread Jan Wieck


Steven,

On 10/24/2012 10:46 PM, Stephen Frost wrote:

Jan,

* Jan Wieck (janwi...@yahoo.com) wrote:

This problem has been discussed before. Those familiar with the
subject please skip the next paragraph.


Apologies if this was already thought-of and ruled out for some reason,
but...


Because all the scanning had been done in parallel to normal DB
activity, it needs to verify that all those blocks are still empty.


Would it be possible to use the FSM to figure out if things have changed
since the last scan..?  Does that scan update the FSM, which would then
be updated by another backend in the event that it decided to write
something there?  Or do we consider the FSM to be completely
untrustworthy wrt this (and if so, I don't suppose there's any hope to
using the visibility map...)?


I honestly don't know if we can trust the FSM enough when it comes to 
throwing away heap pages. Can we?




The notion of having to double-scan and the AccessExclusiveLock on the
relation are telling me this work-around, while completely possible,
isn't exactly ideal...


Under normal circumstances with just a few pages to trim off the end 
this is no problem. Those pages were the last pages just scanned by this 
very autovacuum, so they are found in the shared buffers anyway. All the 
second scan does in that case is to fetch the page once more from shared 
buffers to be 100% sure, we are not truncating off new tuples. We 
definitely need the AccessExclusiveLock to prevent someone from 
extending the relation at the end between our check for relation size 
and the truncate. Fetching 50 empty blocks from the buffer cache while 
at it isn't that big of a deal and that is what it normally looks like.


The problem case this patch is dealing with is rolling window tables 
that experienced some bloat. The typical example is a log table, that 
has new data constantly added and the oldest data constantly purged out. 
This data normally rotates through some blocks like a rolling window. If 
for some reason (purging turned off for example) this table bloats by 
several GB and later shrinks back to its normal content, soon all the 
used blocks are at the beginning of the heap and we find tens of 
thousands of empty pages at the end. Only now does the second scan take 
more than 1000ms and autovacuum is at risk to get killed while at it.


Since we have experienced this problem several times now on our 
production systems, something clearly needs to be done. But IMHO it 
doesn't happen often enough to take any risk here.



Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] autovacuum truncate exclusive lock round two

2012-10-25 Thread Jan Wieck


On 10/25/2012 9:45 AM, Tom Lane wrote:

Jan Wieck janwi...@yahoo.com writes:

On 10/24/2012 10:46 PM, Stephen Frost wrote:

Would it be possible to use the FSM to figure out if things have changed
since the last scan..?  Does that scan update the FSM, which would then
be updated by another backend in the event that it decided to write
something there?  Or do we consider the FSM to be completely
untrustworthy wrt this (and if so, I don't suppose there's any hope to
using the visibility map...)?



I honestly don't know if we can trust the FSM enough when it comes to
throwing away heap pages. Can we?


No.  Backends are under no obligation to update FSM for each individual
tuple insertion, and typically don't do so.

More to the point, you have to take AccessExclusiveLock *anyway*,
because this is interlocking not only against new insertions but plain
read-only seqscans: if a seqscan falls off the end of the table it will
be very unhappy.  So I don't see where we'd buy anything by consulting
the FSM.


Thank you.

One thing that I haven't mentioned yet is that with this patch, we could 
actually insert a vacuum_delay_point() into the loop in 
count_nondeletable_pages(). We no longer cling to the exclusive lock but 
rather get out of the way as soon as somebody needs the table. Under 
this condition we no longer need to do the second scan full bore.



Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] autovacuum truncate exclusive lock round two

2012-10-25 Thread Jan Wieck


On 10/25/2012 10:12 AM, Stephen Frost wrote:

Jan,

* Jan Wieck (janwi...@yahoo.com) wrote:

The problem case this patch is dealing with is rolling window tables
that experienced some bloat. The typical example is a log table,
that has new data constantly added and the oldest data constantly
purged out. This data normally rotates through some blocks like a
rolling window. If for some reason (purging turned off for example)
this table bloats by several GB and later shrinks back to its normal
content, soon all the used blocks are at the beginning of the heap
and we find tens of thousands of empty pages at the end. Only now
does the second scan take more than 1000ms and autovacuum is at risk
to get killed while at it.


My concern is that this could certainly also happen to a heavily updated
table in an OLTP type of environment where the requirement to take a
heavy lock to clean it up might prevent it from ever happening..  I was
simply hoping we could find a mechanism to lock just those pages we're
getting ready to nuke rather than the entire relation.  Perhaps we can
consider how to make those changes alongside of changes to eliminate or
reduce the extent locking that has been painful (for me at least) when
doing massive parallel loads into a table.


I've been testing this with loads of 20 writes/s to that bloated table. 
Preventing not only the clean up, but the following ANALYZE as well is 
precisely what happens. There may be multiple ways how to get into this 
situation, but once you're there the symptoms are the same. Vacuum fails 
to truncate it and causing a 1 second hiccup every minute, while vacuum 
is holding the exclusive lock until the deadlock detection code of 
another transaction kills it.


My patch doesn't change the logic how we ensure that we don't zap any 
data by accident with the truncate and Tom's comments suggest we should 
stick to it. It only makes autovacuum check frequently if the 
AccessExclusiveLock is actually blocking anyone and then get out of the 
way.


I would rather like to discuss any ideas how to do all this without 3 
new GUCs.


In the original code, the maximum delay that autovacuum can cause by 
holding the exclusive lock is one deadlock_timeout (default 1s). It 
would appear reasonable to me to use max(deadlock_timeout/10,10ms) as 
the interval to check for a conflicting lock request. For another 
transaction that needs to access the table this is 10 times faster than 
it is now and still guarantees that autovacuum will make some progress 
with the truncate.


The other two GUCs control how often and how fast autovacuum tries to 
acquire the exclusive lock in the first place. Since we actively release 
the lock *because someone needs it* it is pretty much guaranteed that 
the immediate next lock attempt fails. We on purpose do a 
ConditionalLockRelation() because there is a chance to deadlock. The 
current code only tries one lock attempt and gives up immediately. I 
don't know from what to derive a good value for how long to retry, but 
the nap time in between tries could be a hardcoded 20ms or using the 
cost based vacuum nap time (which defaults to 20ms).


Any other ideas are welcome.


Thanks,
Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] autovacuum truncate exclusive lock round two

2012-10-24 Thread Jan Wieck

This problem has been discussed before. Those familiar with the subject 
please skip the next paragraph.


When autovacuum finds a substantial amount of empty pages at the end of 
a relation, it attempts to truncate it in lazy_truncate_heap(). Because 
all the scanning had been done in parallel to normal DB activity, it 
needs to verify that all those blocks are still empty. To do that 
autovacuum grabs an AccessExclusiveLock on the relation, then scans 
backwards to the last non-empty page. If any other backend needs to 
access that table during this time, it will kill the autovacuum from the 
deadlock detection code, which by default is done after a 1000ms 
timeout. The autovacuum launcher will start another vacuum after 
(default) 60 seconds, which most likely is getting killed again, and 
again, and again. The net result of this is that the table is never 
truncated and every 60 seconds there is a 1 second hiccup before the 
autovacuum is killed.


Proposal:

Add functions to lmgr that are derived from the lock release code, but 
instead of releasing the lock and waking up waiters, just return a 
boolean telling if there are any waiters that would be woken up if this 
lock was released.


Use this lmgr feature inside count_nondeletable_pages() of vacuumlazy.c 
to periodically check, if there is a conflicting lock request waiting. 
If not, keep going. If there is a waiter, truncate the relation to the 
point checked thus far, release the AccessExclusiveLock, then loop back 
to where we acquire this lock in the first place and continue 
checking/truncating.


I have a working patch here:

https://github.com/wieck/postgres/tree/autovacuum-truncate-lock

This patch does introduce three new postgresql.conf parameters, which I 
would be happy to get rid of if we could derive them from something 
else. Something based on the deadlock timeout may be possible.


autovacuum_truncate_lock_check = 100ms # how frequent to check
   # for conflicting locks
autovacuum_truncate_lock_retry = 50# how often to try acquiring
   # the exclusive lock
autovacuum_truncate_lock_wait = 20ms   # nap in between attempts

With these settings, I see the truncate of a bloated table progressing 
at a rate of 3 minutes per GB, while that table is accessed 20 times per 
second.


The original kill autovacuum mechanism in the deadlock code is still 
there. All this code really does is 10 lmgr lookups per second and 
releasing the AccessExclusiveLock if there are any waiters. I don't 
think it can get any cheaper than this.


I am attaching a script that uses pgbench to demonstrate the actual 
problem of a bloated table with significant empty pages at the end.



Comments?


Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


t1.autovac-lock-issue.tgz
Description: application/compressed

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] autovacuum truncate exclusive lock round two

2012-10-24 Thread Jan Wieck


Here is the patch for it.


Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index c9253a9..9f880f0 100644
*** a/src/backend/commands/vacuumlazy.c
--- b/src/backend/commands/vacuumlazy.c
***
*** 57,62 
--- 57,63 
  #include utils/pg_rusage.h
  #include utils/timestamp.h
  #include utils/tqual.h
+ #include portability/instr_time.h
  
  
  /*
*** typedef struct LVRelStats
*** 103,108 
--- 104,110 
  	ItemPointer dead_tuples;	/* array of ItemPointerData */
  	int			num_index_scans;
  	TransactionId latestRemovedXid;
+ 	bool		lock_waiter_detected;
  } LVRelStats;
  
  
*** lazy_vacuum_rel(Relation onerel, VacuumS
*** 193,198 
--- 195,202 
  	vacrelstats-old_rel_pages = onerel-rd_rel-relpages;
  	vacrelstats-old_rel_tuples = onerel-rd_rel-reltuples;
  	vacrelstats-num_index_scans = 0;
+ 	vacrelstats-pages_removed = 0;
+ 	vacrelstats-lock_waiter_detected = false;
  
  	/* Open all indexes of the relation */
  	vac_open_indexes(onerel, RowExclusiveLock, nindexes, Irel);
*** lazy_vacuum_rel(Relation onerel, VacuumS
*** 259,268 
  		vacrelstats-hasindex,
  		new_frozen_xid);
  
! 	/* report results to the stats collector, too */
! 	pgstat_report_vacuum(RelationGetRelid(onerel),
  		 onerel-rd_rel-relisshared,
  		 new_rel_tuples);
  
  	/* and log the action if appropriate */
  	if (IsAutoVacuumWorkerProcess()  Log_autovacuum_min_duration = 0)
--- 263,280 
  		vacrelstats-hasindex,
  		new_frozen_xid);
  
! 	/*
! 	 * report results to the stats collector, too.
! 	 * An early terminated lazy_truncate_heap attempt 
! 	 * suppresses the message and also cancels the
! 	 * execution of ANALYZE, if that was ordered.
! 	 */
! 	if (!vacrelstats-lock_waiter_detected)
! 		pgstat_report_vacuum(RelationGetRelid(onerel),
  		 onerel-rd_rel-relisshared,
  		 new_rel_tuples);
+ 	else
+ 		vacstmt-options = ~VACOPT_ANALYZE;
  
  	/* and log the action if appropriate */
  	if (IsAutoVacuumWorkerProcess()  Log_autovacuum_min_duration = 0)
*** lazy_truncate_heap(Relation onerel, LVRe
*** 1255,1334 
  	BlockNumber old_rel_pages = vacrelstats-rel_pages;
  	BlockNumber new_rel_pages;
  	PGRUsage	ru0;
  
  	pg_rusage_init(ru0);
  
  	/*
! 	 * We need full exclusive lock on the relation in order to do truncation.
! 	 * If we can't get it, give up rather than waiting --- we don't want to
! 	 * block other backends, and we don't want to deadlock (which is quite
! 	 * possible considering we already hold a lower-grade lock).
! 	 */
! 	if (!ConditionalLockRelation(onerel, AccessExclusiveLock))
! 		return;
! 
! 	/*
! 	 * Now that we have exclusive lock, look to see if the rel has grown
! 	 * whilst we were vacuuming with non-exclusive lock.  If so, give up; the
! 	 * newly added pages presumably contain non-deletable tuples.
  	 */
! 	new_rel_pages = RelationGetNumberOfBlocks(onerel);
! 	if (new_rel_pages != old_rel_pages)
  	{
  		/*
! 		 * Note: we intentionally don't update vacrelstats-rel_pages with the
! 		 * new rel size here.  If we did, it would amount to assuming that the
! 		 * new pages are empty, which is unlikely.	Leaving the numbers alone
! 		 * amounts to assuming that the new pages have the same tuple density
! 		 * as existing ones, which is less unlikely.
  		 */
! 		UnlockRelation(onerel, AccessExclusiveLock);
! 		return;
! 	}
  
! 	/*
! 	 * Scan backwards from the end to verify that the end pages actually
! 	 * contain no tuples.  This is *necessary*, not optional, because other
! 	 * backends could have added tuples to these pages whilst we were
! 	 * vacuuming.
! 	 */
! 	new_rel_pages = count_nondeletable_pages(onerel, vacrelstats);
  
! 	if (new_rel_pages = old_rel_pages)
! 	{
! 		/* can't do anything after all */
! 		UnlockRelation(onerel, AccessExclusiveLock);
! 		return;
! 	}
  
! 	/*
! 	 * Okay to truncate.
! 	 */
! 	RelationTruncate(onerel, new_rel_pages);
  
! 	/*
! 	 * We can release the exclusive lock as soon as we have truncated.	Other
! 	 * backends can't safely access the relation until they have processed the
! 	 * smgr invalidation that smgrtruncate sent out ... but that should happen
! 	 * as part of standard invalidation processing once they acquire lock on
! 	 * the relation.
! 	 */
! 	UnlockRelation(onerel, AccessExclusiveLock);
  
! 	/*
! 	 * Update statistics.  Here, it *is* correct to adjust rel_pages without
! 	 * also touching reltuples, since the tuple count wasn't changed by the
! 	 * truncation.
! 	 */
! 	vacrelstats-rel_pages = new_rel_pages;
! 	vacrelstats-pages_removed = old_rel_pages - new_rel_pages;
  
! 	ereport(elevel,
! 			(errmsg(\%s\: truncated %u to %u pages,
! 	RelationGetRelationName(onerel),
! 	old_rel_pages, new_rel_pages),
! 			 errdetail(%s.,
! 	   pg_rusage_show(ru0;
  }
  
  /*
--- 1267,1388

Re: [HACKERS] [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

2012-10-16 Thread Jan Wieck


On 10/15/2012 4:43 PM, Simon Riggs wrote:

Jan spoke at length at PgCon, for all to hear, that what we are
building is a much better way than the trigger logging approach Slony
uses. I don't take that as carte blanche for approval of everything
being done, but its going in the right direction with an open heart,
which is about as good as it gets.


The mechanism you are building for capturing changes is certainly a lot 
better than what Bucardo, Londiste and Slony are doing today. That much 
is true.


The flip side of the coin however is that all of today's logical 
replication systems are designed Postgres version agnostic to a degree. 
This means that the transition time from the existing, trigger based 
approach to the new WAL based mechanism will see both technologies in 
parallel, which is no small thing to support. And that transition time 
may last for a good while. We still have people installing Slony 1.2 
because 2.0 (3 years old by now) requires Postgres 8.3 minimum.



Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

2012-10-16 Thread Jan Wieck


On 10/15/2012 3:25 PM, Andres Freund wrote:

On Monday, October 15, 2012 09:18:57 PM Peter Geoghegan wrote:

On 15 October 2012 19:19, Bruce Momjian br...@momjian.us wrote:
 I think Robert is right that if Slony can't use the API, it is unlikely
 any other replication system could use it.

I don't accept that. Clearly there is a circular dependency, and
someone has to go first - why should the Slony guys invest in adopting
this technology if it is going to necessitate using a forked Postgres
with an uncertain future?


Well. I don't think (hope) anybody proposed making something release worthy for
slony but rather a POC patch that proofs the API is generic enough to be used
by them. If I (or somebody else familiar with this) work together with somebody
familiar with with slony internals I think such a POC shouldn't be too hard to
do.
I think some more input from that side is a good idea. I plan to send out an
email to possibly interested parties in about two weeks...


What Slony essentially sends to the receiver node is a COPY stream in 
the format, Christopher described. That stream is directly copied into 
the receiving node's sl_log_N table and picked up there by an apply 
trigger BEFORE INSERT, that performs the corresponding 
INSERT/UPDATE/DELETE operation via prepared plans to the user tables.


For a POC I think it is sufficient to demonstrate that this copy stream 
can be generated out of the WAL decoding.


Note that Slony today does not touch columns in an UPDATE, that have not 
changed in the original UPDATE on the origin. Sending toasted column 
values, that haven't changed, would be a substantial change to the 
storage efficiency on the replica. The consequence of this is that the 
number of colums that need to be in the UPDATE's SET clause varies. The 
log_cmdupdncols is to separate the new column/value pairs from the 
column/key pairs of the updated row. The old row key in Slony is based 
on a unique index (preferably a PK, but any unique key will do). This 
makes that cmdupdncols simply the number of column/value pairs minus the 
number of key columns. So it isn't too hard to figure out.



Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 910 matches

Mail list logo