date:20130622

Re: [HACKERS] A better way than tweaking NTUP_PER_BUCKET

2013-06-22 Thread Simon Riggs

On 23 June 2013 03:16, Stephen Frost  wrote:

>  Still doesn't really address the issue of dups though.

Checking for duplicates in all cases would be wasteful, since often we
are joining to the PK of a smaller table.

If duplicates are possible at all for a join, then it would make sense
to build the hash table more carefully to remove dupes. I think we
should treat that as a separate issue.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] wrong state of patch in CF

2013-06-22 Thread Amit kapila


Patch (https://commitfest.postgresql.org/action/patch_view?id=1161) is already 
committed by Commit
http://git.postgresql.org/pg/commitdiff/b23160889c963dfe23d8cf1f9be64fb3c535a2d6

It should be marked as Committed in CF.

With Regards,
Amit Kapila.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)

2013-06-22 Thread MauMau


From: "Robert Haas" 

On Fri, Jun 21, 2013 at 10:02 PM, MauMau  wrote:
I'm comfortable with 5 seconds.  We are talking about the interval 
between
sending SIGQUIT to the children and then sending SIGKILL to them.  In 
most
situations, the backends should terminate immediately.  However, as I 
said a
few months ago, ereport() call in quickdie() can deadlock indefinitely. 
This

is a PostgreSQL's bug.


So let's fix that bug.  Then we don't need this.


tHERE ARE TWO WAYS TO FIX THAT BUG.  yOU ARE SUGGESTING 1 OF THE FOLLOWING 
TWO METHODS, AREN'T YOU?


1. (rOBERT-SAN'S IDEA)
uPON RECEIPT OF IMMEDIATE SHUTDOWN REQUEST, POSTMASTER SENDS sigkill TO ITS 
CHILDREN.


2. (tOM-SAN'S IDEA)
uPON RECEIPT OF IMMEDIATE SHUTDOWN REQUEST, POSTMASTER FIRST SENDS sigquit 
TO ITS CHILDREN, WAIT A WHILE FOR THEM TO TERMINATE, THEN SENDS sigkill TO 
THEM.



aT FIRST i PROPOSED 1.  tHEN tOM SAN SUGGESTED 2 SO THAT THE SERVER IS AS 
FRIENDLY TO THE CLIENTS AS NOW BY NOTIFYING THEM OF THE SERVER SHUTDOWN.  i 
WAS CONVINCED BY THAT IDEA AND CHOSE 2.


bASICALLY, i THINK BOTH IDEAS ARE RIGHT.  tHEY CAN SOLVE THE ORIGINAL 
PROBLEM.


hOWEVER, RE-CONSIDERING THE MEANING OF "IMMEDIATE" SHUTDOWN, i FEEL 1 IS A 
BIT BETTER, BECAUSE TRYING TO DO SOMETHING IN QUICKDIE() OR SOMEWHERE DOES 
NOT MATCH THE IDEA OF "IMMEDIATE". wE MAY NOT HAVE TO BE FRIENDLY TO THE 
CLIENTS AT THE IMMEDIATE SHUTDOWN.  tHE CODE GETS MUCH SIMPLER.  iN 
ADDITION, IT MAY BE BETTER THAT WE SIMILARLY SEND sigkill IN BACKEND CRASH 
(fATALeRROR) CASE, ELIMINATE THE USE OF sigquit AND REMOVE QUICKDIE() AND 
OTHER sigquit HANDLERS.


wHAT DO YOU THINK?  hOW SHOULD WE MAKE CONSENSUS AND PROCEED?

rEGARDS
mAUmAU




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] [PATCH] Revive line type

2013-06-22 Thread rui hua

Hi,

Test results are as follows:

Contents & Purpose

This patch is for finishing the line type and related functions that not
done yet but listed in catalogs and documentation. There are no other new
features added in this patch.

The regression test cases which included in this patch, Documentation are
also updated.


 Run

Regression tests are all succeed, but several problems have be found while
ding some simple test. The updated document said that the points used in
the output are not necessarily the points used on input.  I understand that
as long as they are marked on the same line. But when the input data
represents a  horizontal or vertical line, the output is not exactly the
same line. It is another line parallel to it.

For example:


postgres=# select line('1,3,2,3');

  line

-

 [(0,-3),(1,-3)]

(1 row)



postgres=# select line('1,3,1,6');

  line

-

 [(-1,0),(-1,1)]

(1 row)


In addition, when a straight line coincides with coordinate axis, output
appears -0, I do not know whether it is appropriate.


postgres=# select line('0,1,0,5');

  line

-

 [(-0,0),(-0,1)]

(1 row)


Negative value appeared when use <-> to calculate the distance between two
parallel lines.


postgres=# select line('1,1,2,1') <-> line('-1,-1,-2,-1') ;

 ?column?

--

   -2


postgres=# select lseg('1,1,2,1') <-> line('-1,-1,-2,-1') ;

 ?column?

--

   -2

(1 row)


The same situation occurs in distance calculation between point and a
straight line.

postgres=# select point('-1,1') <-> line('-3,0,-4,0') ;

 ?column?

--

-1

(1 row)



Should the distance be positive numbers?

Other functions seem work properly.


Performance

==

Because these functions is first implemented. So there is no relatively
comparison for the performance.


 Conclusion

This patch lets line type worked. But there are some bugs. Additionally,
function "close_sl" not implemented.


With Regards,

Rui

Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)

2013-06-22 Thread MauMau


From: "Alvaro Herrera" 

MauMau escribió:

I thought of adding some new state of pmState for some reason (that
might be the same as your idea).
But I refrained from doing that, because pmState has already many
states.  I was afraid adding a new pmState value for this bug fix
would complicate the state management (e.g. state transition in
PostmasterStateMachine()).  In addition, I felt PM_WAIT_BACKENDS was
appropriate because postmaster is actually waiting for backends to
terminate after sending SIGQUIT.  The state name is natural.


Well, a natural state name is of no use if we're using it to indicate
two different states, which I think is what would be happening here.
Basically, with your patch, PM_WAIT_BACKENDS means two different things
depending on AbortStartTime.


i PROBABLY GOT YOUR FEELING.  yOU AREN'T FEELING COMFORTABLE ABOUT USING THE 
TIME VARIABLE aBORTsTARTtIME AS A STATE VARIABLE TO CHANGE POSTMASTER'S 
BEHAVIOR, ARE YOU?


tHAT MAY BE RIGHT, BUT i'M NOT SURE WELL... BECAUSE IN pm_wait_backends, AS 
THE NAME INDICATES, POSTMASTER IS INDEED WAITING FOR THE BACKENDS TO 
TERMINATE REGARDLESS OF aBORTsTARTtIME.  aPART FROM THIS, POSTMASTER SEEMS 
TO CHANGE ITS BEHAVIOR IN THE SAME PMsTATE DEPENDING ON OTHER VARIABLES SUCH 
AS sHUTDOWN AND fATALeRROR.  i'M NOT CONFIDENT IN WHICH IS BETTER, SO i 
WON'T OBJECT IF THE REVIEWER OR COMMITTER MODIFIES THE CODE.


rEGARDS
mAUmAU



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] A better way than tweaking NTUP_PER_BUCKET

2013-06-22 Thread Stephen Frost

On Saturday, June 22, 2013, Simon Riggs wrote:

> On 22 June 2013 21:40, Stephen Frost >
> wrote:
>
> > I'm actually not a huge fan of this as it's certainly not cheap to do.
> If it
> > can be shown to be better than an improved heuristic then perhaps it
> would
> > work but I'm not convinced.
>
> We need two heuristics, it would seem:
>
> * an initial heuristic to overestimate the number of buckets when we
> have sufficient memory to do so
>
> * a heuristic to determine whether it is cheaper to rebuild a dense
> hash table into a better one.
>
> Although I like Heikki's rebuild approach we can't do this every x2
> overstretch. Given large underestimates exist we'll end up rehashing
> 5-12 times, which seems bad. Better to let the hash table build and
> then re-hash once, it we can see it will be useful.
>
> OK?
>

I've been thinking a bit more on your notion of simply using as much memory
as we're permitted, but maybe adjust it down based on how big the input to
the hash table is (which I think we have better stats on, and even if we
don't, we could easily keep track of how many tuples we've seen and
consider rehashing as we go).  Still doesn't really address the issue of
dups though. It may still be much larger than it should be if there's a lot
of duplicates in the input that hash into a much smaller set of buckets.

Will think on it more.

Thanks,

Stephen

Re: [HACKERS] Patch for fast gin cache performance improvement

2013-06-22 Thread ian link

Looks like my community login is still not working. No rush, just wanted to
let you know. Thanks!

Ian


On Tue, Jun 18, 2013 at 11:41 AM, Ian Link  wrote:

>
> No worries! I'll just wait until it's up again.
> Thanks
> Ian
>
>   Kevin Grittner 
>  Tuesday, June 18, 2013 9:15 AM
>
> Oops -- we seem to have a problem with new community logins at the
> moment, which will hopefully be straightened out soon.  You might
> want to wait a few days if you don't already have a login.
>
> --
> Kevin Grittner
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>   Kevin Grittner 
>  Tuesday, June 18, 2013 9:09 AM
>
> Ian Link   wrote:
>
>
> This patch contains a performance improvement for the fast gin
> cache.
>
> Our test queries improve from about 0.9 ms to 0.030 ms.
>
> Impressive!
>
>
> Thanks for reading and considering this patch!
>
>
> Congratulations on your first PostgreSQL patch!  To get it
> scheduled for review, please add it to this page:
> https://commitfest.postgresql.org/action/commitfest_view/open
>
>
> You will need to get a community login (if you don't already have
> one), but that is a quick and painless process.  Choose an
> appropriate topic (like "Performance") and reference the message ID
> of the email to which you attached the patch.  Don't worry about
> the fields for reviewers, committer, or date closed.
>
> Sorry for the administrative overhead, but without it things can
> fall through the cracks.  You can find an overview of the review
> process with links to more detail here:
>
> http://wiki.postgresql.org/wiki/CommitFest
>
> Thanks for contributing!
>
>
>   Ian Link 
>  Monday, June 17, 2013 9:42 PM
>  *
>
> This patch contains a performance improvement for the fast gin cache. As
> you may know, the performance of the fast gin cache decreases with its
> size. Currently, the size of the fast gin cache is tied to work_mem. The
> size of work_mem can often be quite high. The large size of work_mem is
> inappropriate for the fast gin cache size. Therefore, we created a separate
> cache size called gin_fast_limit. This global variable controls the size of
> the fast gin cache, independently of work_mem. Currently, the default
> gin_fast_limit is set to 128kB. However, that value could need tweaking.
> 64kB may work better, but it's hard to say with only my single machine to
> test on.
>
> On my machine, this patch results in a nice speed up. Our test queries
> improve from about 0.9 ms to 0.030 ms. Please feel free to use the test
> case yourself: it should be attached. I can look into additional test cases
> (tsvectors) if anyone is interested.
>
> In addition to the global limit, we have provided a per-index limit:
> fast_cache_size. This per-index limit begins at -1, which means that it is
> disabled. If the user does not specify a per-index limit, the index will
> simply use the global limit.
>
> I would like to thank Andrew Gierth for all his help on this patch. As
> this is my first patch he was extremely helpful. The idea for this
> performance improvement was entirely his. I just did the implementation.
> Thanks for reading and considering this patch!*
>
>
> Ian Link
>
>
<><>

Re: [HACKERS] Support for RANGE ... PRECEDING windows in OVER

2013-06-22 Thread ian link

Thanks Craig! That definitely does help. I probably still have some
questions but I think I will read through the rest of the code before
asking. Thanks again!

Ian

> Craig Ringer
> Friday, June 21, 2013 8:41 PM
>
> On 06/22/2013 03:30 AM, ian link wrote:
>>
>> Forgive my ignorance, but I don't entirely understand the problem. What
>> does '+' and '-' refer to exactly?
>
> Consider "RANGE 4.5 PRECEDING'.
>
> You need to be able to test whether, for the current row 'b', any given
> row 'a' is within the range (b - 4.5) < a <= b . Not 100% sure about the
> < vs <= boundaries, but that's irrelevant for the example.
>
> To test that, you have to be able to do two things: you have to be able
> to test whether one value is greater than another, and you have to be
> able to add or subtract a constant from one of the values.
>
> Right now, the b-tree access method provides information on the ordering
> operators < <= = > >= <> , which provides half the answer. But these
> don't give any concept of *distance* - you can test ordinality but not
> cardinality.
>
> To implement the "different by 4.5" part, you have to be able to add 4.5
> to one value or subtract it from the other.
>
> The obvious way to do that is to look up the function that implements
> the '+' or '-' operator, and do:
>
> ((OPERATOR(+))(a, 4.5)) > b AND (a <= b)
>
> or
>
> ((OPERATOR(-))(b, 4.5)) < a AND (a <= b);
>
> The problem outlined by Tom in prior discussion about this is that
> PostgreSQL tries really hard not to assume that particular operator
> names mean particular things. Rather than "knowing" that "+" is always
> "an operator that adds two values together; is transitive, symmetric and
> reflexive", PostgreSQL requires that you define an *operator class* that
> names the operator that has those properties.
>
> Or at least, it does for less-than, less-than-or-equals, equals,
> greater-than-or-equals, greater-than, and not-equals as part of the
> b-tree operator class, which *usually* defines these operators as < <= =
>>
>> = > <>, but you could use any operator names you wanted if you really
>
> liked.
>
> Right now (as far as I know) there's no operator class that lets you
> identify operators for addition and subtraction in a similar way. So
> it's necessary to either add such an operator class (in which case
> support has to be added for it for every type), extend the existing
> b-tree operator class to provide the info, or blindly assume that "+"
> and "-" are always addition and subtraction.
>
> For an example of why such assumptions are a bad idea, consider matrix
> multiplication. Normally, "a * b" = "b * a", but this isn't true for
> multiplication of matrices. Similarly, if someone defined a "+" operator
> as an alias for string concatenation (||), we'd be totally wrong to
> assume we could use that for doing range-offset windowing.
>
> So. Yeah. Operator classes required, unless we're going to change the
> rules and make certain operator names "special" in PostgreSQL, so that
> if you implement them they *must* have certain properties. This seems
> like a pretty poor reason to add such a big change.
>
> I hope this explanation (a) is actually correct and (b) is helpful.
>
> ian link
> Friday, June 21, 2013 12:30 PM
> Forgive my ignorance, but I don't entirely understand the problem. What
does '+' and '-' refer to exactly?
> Thanks!
>
>
>
> Hitoshi Harada
> Friday, June 21, 2013 4:35 AM
>
>
>
 On 06/22/2013 03:30 AM, ian link wrote:
> Forgive my ignorance, but I don't entirely understand the problem. What
> does '+' and '-' refer to exactly?

Consider "RANGE 4.5 PRECEDING'.

You need to be able to test whether, for the current row 'b', any given
row 'a' is within the range (b - 4.5) < a <= b . Not 100% sure about the
< vs <= boundaries, but that's irrelevant for the example.

To test that, you have to be able to do two things: you have to be able
to test whether one value is greater than another, and you have to be
able to add or subtract a constant from one of the values.

Right now, the b-tree access method provides information on the ordering
operators < <= = > >= <> , which provides half the answer. But these
don't give any concept of *distance* - you can test ordinality but not
cardinality.

To implement the "different by 4.5" part, you have to be able to add 4.5
to one value or subtract it from the other.

The obvious way to do that is to look up the function that implements
the '+' or '-' operator, and do:

((OPERATOR(+))(a, 4.5)) > b AND (a <= b)

or

((OPERATOR(-))(b, 4.5)) < a AND (a <= b);

The problem outlined by Tom in prior discussion about this is that
PostgreSQL tries really hard not to assume that particular operator
names mean particular things. Rather than "knowing" that "+" is always
"an operator that adds two values together; is transitive, symmetric and
reflexive", PostgreSQL requires that you define an *operator class* that
names the operator that has those properties.

Or at least, it does for less-than,

Re: [HACKERS] A better way than tweaking NTUP_PER_BUCKET

2013-06-22 Thread Stephen Frost

On Saturday, June 22, 2013, Heikki Linnakangas wrote:

> On 22.06.2013 19:19, Simon Riggs wrote:
>
>> So I think that (2) is the best route: Given that we know with much
>> better certainty the number of rows in the scanned-relation, we should
>> be able to examine our hash table after it has been built and decide
>> whether it would be cheaper to rebuild the hash table with the right
>> number of buckets, or continue processing with what we have now. Which
>> is roughly what Heikki proposed already, in January.
>>
>
> Back in January, I wrote a quick patch to experiment with rehashing when
> the hash table becomes too full. It was too late to make it into 9.3 so I
> didn't pursue it further back then, but IIRC it worked. If we have the
> capability to rehash, the accuracy of the initial guess becomes much less
> important.
>

What we're hashing isn't going to change mid-way through or be updated
after we've started doing lookups against it.

Why not simply scan and queue the data and then build the hash table right
the first time?  Also, this patch doesn't appear to address dups and
therefore would rehash unnecessarily. There's no point rehashing into more
buckets if the buckets are only deep due to lots of dups. Figuring out how
many distinct values there are, in order to build the best hash table, is
actually pretty expensive compared to how quickly we can build the table
today. Lastly, this still encourages collisions due to too few buckets. If
we would simply start with more buckets outright we'd reduce the need to
rehash..

Thanks,

Stephen

Re: [HACKERS] A better way than tweaking NTUP_PER_BUCKET

2013-06-22 Thread Simon Riggs

On 22 June 2013 21:40, Stephen Frost  wrote:

> I'm actually not a huge fan of this as it's certainly not cheap to do. If it
> can be shown to be better than an improved heuristic then perhaps it would
> work but I'm not convinced.

We need two heuristics, it would seem:

* an initial heuristic to overestimate the number of buckets when we
have sufficient memory to do so

* a heuristic to determine whether it is cheaper to rebuild a dense
hash table into a better one.

Although I like Heikki's rebuild approach we can't do this every x2
overstretch. Given large underestimates exist we'll end up rehashing
5-12 times, which seems bad. Better to let the hash table build and
then re-hash once, it we can see it will be useful.

OK?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] A better way than tweaking NTUP_PER_BUCKET

2013-06-22 Thread Heikki Linnakangas


On 22.06.2013 19:19, Simon Riggs wrote:

So I think that (2) is the best route: Given that we know with much
better certainty the number of rows in the scanned-relation, we should
be able to examine our hash table after it has been built and decide
whether it would be cheaper to rebuild the hash table with the right
number of buckets, or continue processing with what we have now. Which
is roughly what Heikki proposed already, in January.


Back in January, I wrote a quick patch to experiment with rehashing when 
the hash table becomes too full. It was too late to make it into 9.3 so 
I didn't pursue it further back then, but IIRC it worked. If we have the 
capability to rehash, the accuracy of the initial guess becomes much 
less important.


- Heikki
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 6a2f236..32aebb9 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -223,6 +223,60 @@ ExecEndHash(HashState *node)
 	ExecEndNode(outerPlan);
 }
 
+static void
+Rehash(HashJoinTable hashtable)
+{
+	int nbuckets_old = hashtable->nbuckets_inmem;
+	int nbuckets_new;
+	uint32 mask;
+	int i;
+	
+	if (nbuckets_old > (1<<30))
+		return;
+
+	nbuckets_new = nbuckets_old * 2;
+	hashtable->buckets = (HashJoinTuple *)
+		repalloc(hashtable->buckets, nbuckets_new * sizeof(HashJoinTuple));
+	memset(((char *) hashtable->buckets) + nbuckets_old * sizeof(HashJoinTuple),
+		   0, nbuckets_old * sizeof(HashJoinTuple));
+	hashtable->nbuckets_inmem = nbuckets_new;
+
+	mask = nbuckets_old;
+
+	for (i = 0; i < nbuckets_old; i++)
+	{
+		int			newbucketno = i + nbuckets_old;
+		HashJoinTuple prevtuple;
+		HashJoinTuple tuple;
+
+		prevtuple = NULL;
+		tuple = hashtable->buckets[i];
+
+		while (tuple != NULL)
+		{
+			/* save link in case we delete */
+			HashJoinTuple nexttuple = tuple->next;
+
+			if ((tuple->hashvalue & mask) != 0)
+			{
+/* move to the new bucket */
+tuple->next = hashtable->buckets[newbucketno];
+hashtable->buckets[newbucketno] = tuple;
+
+if (prevtuple)
+	prevtuple->next = nexttuple;
+else
+	hashtable->buckets[i] = nexttuple;
+			}
+			else
+			{
+/* keep in this bucket */
+prevtuple = tuple;
+			}
+			tuple = nexttuple;
+		}
+	}
+}
 
 /* 
  *		ExecHashTableCreate
@@ -271,6 +325,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	 */
 	hashtable = (HashJoinTable) palloc(sizeof(HashJoinTableData));
 	hashtable->nbuckets = nbuckets;
+	hashtable->nbuckets_inmem = nbuckets;
 	hashtable->log2_nbuckets = log2_nbuckets;
 	hashtable->buckets = NULL;
 	hashtable->keepNulls = keepNulls;
@@ -285,6 +340,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	hashtable->nbatch_outstart = nbatch;
 	hashtable->growEnabled = true;
 	hashtable->totalTuples = 0;
+	hashtable->inmemTuples = 0;
 	hashtable->innerBatchFile = NULL;
 	hashtable->outerBatchFile = NULL;
 	hashtable->spaceUsed = 0;
@@ -612,7 +668,7 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 	 */
 	ninmemory = nfreed = 0;
 
-	for (i = 0; i < hashtable->nbuckets; i++)
+	for (i = 0; i < hashtable->nbuckets_inmem; i++)
 	{
 		HashJoinTuple prevtuple;
 		HashJoinTuple tuple;
@@ -734,6 +790,10 @@ ExecHashTableInsert(HashJoinTable hashtable,
 		hashTuple->next = hashtable->buckets[bucketno];
 		hashtable->buckets[bucketno] = hashTuple;
 
+		hashtable->inmemTuples += 1;
+		if (hashtable->inmemTuples > hashtable->nbuckets_inmem * NTUP_PER_BUCKET * 2)
+			Rehash(hashtable);
+
 		/* Account for space used, and back off if we've used too much */
 		hashtable->spaceUsed += hashTupleSize;
 		if (hashtable->spaceUsed > hashtable->spacePeak)
@@ -873,18 +933,18 @@ ExecHashGetBucketAndBatch(HashJoinTable hashtable,
 		  int *bucketno,
 		  int *batchno)
 {
-	uint32		nbuckets = (uint32) hashtable->nbuckets;
+	uint32		nbuckets_inmem = (uint32) hashtable->nbuckets_inmem;
 	uint32		nbatch = (uint32) hashtable->nbatch;
 
 	if (nbatch > 1)
 	{
 		/* we can do MOD by masking, DIV by shifting */
-		*bucketno = hashvalue & (nbuckets - 1);
+		*bucketno = hashvalue & (nbuckets_inmem - 1);
 		*batchno = (hashvalue >> hashtable->log2_nbuckets) & (nbatch - 1);
 	}
 	else
 	{
-		*bucketno = hashvalue & (nbuckets - 1);
+		*bucketno = hashvalue & (nbuckets_inmem - 1);
 		*batchno = 0;
 	}
 }
@@ -995,7 +1055,7 @@ ExecScanHashTableForUnmatched(HashJoinState *hjstate, ExprContext *econtext)
 		 */
 		if (hashTuple != NULL)
 			hashTuple = hashTuple->next;
-		else if (hjstate->hj_CurBucketNo < hashtable->nbuckets)
+		else if (hjstate->hj_CurBucketNo < hashtable->nbuckets_inmem)
 		{
 			hashTuple = hashtable->buckets[hjstate->hj_CurBucketNo];
 			hjstate->hj_CurBucketNo++;
@@ -1052,7 +1112,7 @@ void
 ExecHashTableReset(HashJoinTable hashtable)
 {
 	MemoryContext oldcxt;
-	int			nbuckets = hashtable->nbuckets;
+	int			nbuckets_inmem = hashtable->nbuckets_inmem;
 
 	/*
 	 * Release

Re: [HACKERS] [PATCH] add --progress option to pgbench (submission 3)

2013-06-22 Thread Fabien COELHO



Hello Mitsumasa,

Thanks for the review.


* 2. Output format in result for more readable.

5.0 s[thread 1]: tps = 1015.576032, AverageLatency(ms) = 0.000984663
5.0 s[thread 0]: tps = 1032.580794, AverageLatency(ms) = 0.000968447
10.0 s [thread 0]: tps = 1129.591189, AverageLatency(ms) = 0.000885276
10.0 s [thread 1]: tps = 1126.267776, AverageLatency(ms) = 0.000887888


However, interesting of output format(design) is different depending on the 
person:-). If you like other format, fix it.


I think that your suggestion is too verbose, and as far as automation is 
oncerned I like "cut -f 2" unix filtering and other gnuplot processing... 
but I see your point and it is a matter of taste. I'll try to propose 
something in between, if I can.



* 3. Thread name in output format is not nesessary.
I cannot understand that thread name is displayed in each progress. I think 
that it does not need. I hope that output result sould be more simple also in 
a lot of thread. My images is here,



5.0 s: tps = 2030.576032, AverageLatency(ms) = 0.000984663
10.0 s : tps = 2250.591189, AverageLatency(ms) = 0.000885276


This output format is more simple and intuitive. If you need result in each 
threads, please tell us the reason.


I agree that it would be better, but only a thread has access to its data, 
if it must work with the "fork" pthread emulation, so each thread has to 
do its report... If the "fork" emulation is removed and only real threads 
are used, it would be much better, and one thread would be able to report 
for everyone. The alternative is to do a feature which does not work with

fork emulation.


* 4. Slipping the progress time.
Whan I executed this patch in long time, I found slipping the progress time. 
This problem image is here.


Yep. I must change the test to align on the overall start time.

I'll submit a new patch later.

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] A better way than tweaking NTUP_PER_BUCKET

2013-06-22 Thread Stephen Frost

On Saturday, June 22, 2013, Simon Riggs wrote:

> On 22 June 2013 14:48, Stephen Frost >
> wrote:
>
> > Based on your argument that we want to have a bucket load which is, on
> > average, the size of NTUP_PER_BUCKET, I have to '-1' on this.
>
> That is not my argument. I am pointing out that the comments claim the
> code does that, and yet the code does not.

To be honest, I don't read that comment quite the same way as you do.

Tom's wish for an average tuples per bucket of 10 may be connected to
> that error; I can't say. But we definitely don't end up with an
> average of 10 in many cases, which is the basis for previous poor
> performance reports.

I don't believe Tom wishes for an average of 10..  That's just what
NTUP_PER_BUCKET has always been set to.

The bottom line is that you're hoping for perfection. If we knew
> exactly how many tuples were in the hash table then we could size it
> precisely for the hash join we are about to perform. We don't know
> that, so we can't size it precisely. Worse than that is that we know
> we badly estimate the number of buckets on a regular basis.

I'm actually not trying for perfection. What I think we should start with
is to at least not shoot ourselves in the foot by cutting the bucket size
to one-tenth of our distinct tuple estimate and virtually guaranteeing lots
of true collisions which are expensive.

> That gives us three choices: if we have a hash table fixed without
> prior knowledge then we either (0) we continue to underallocate them
> in many cases or (1) potentially overallocate buckets. Or (2) we learn
> to cope better with the variation in the number of entries. If we rule
> out the first two we have only the last one remaining.
>
> (1) will always be a heuristic, and as you point out could itself be
> an extreme setting

A heuristic based on our estimates is a good choice, imv.  I do think we
can do better than we are now though.

So I think that (2) is the best route: Given that we know with much
> better certainty the number of rows in the scanned-relation, we should
> be able to examine our hash table after it has been built and decide
> whether it would be cheaper to rebuild the hash table with the right
> number of buckets, or continue processing with what we have now. Which
> is roughly what Heikki proposed already, in January.
>

I'm actually not a huge fan of this as it's certainly not cheap to do. If
it can be shown to be better than an improved heuristic then perhaps it
would work but I'm not convinced.

One other way to address having a smaller actual hash table is to come up
with another way to detect empty parts of the hash space, perhaps by using
a bitmap to represent the hash space (obviously the individual bits would
represent some chunk of the 32bit hash space, perhaps as much as 1/64'th,
allowing our bitmap to fit into a 64bit int; that may end up being too
small to actually help though). This would mean that when we did get to
scanning a bucket we would at least be much more likely of finding a match
instead of scanning it and finding nothing.

Thanks,

Stephen

Re: [Review] Re: [HACKERS] minor patch submission: CREATE CAST ... AS EXPLICIT

2013-06-22 Thread Fabien COELHO




I flag it 'return with feedback', please update the patch so it builds.
Everything else is ok.


Here it is.

--
Fabien.diff --git a/doc/src/sgml/ref/create_cast.sgml b/doc/src/sgml/ref/create_cast.sgml
index 29ea298..0ace996 100644
--- a/doc/src/sgml/ref/create_cast.sgml
+++ b/doc/src/sgml/ref/create_cast.sgml
@@ -20,15 +20,15 @@
 
 CREATE CAST (source_type AS target_type)
 WITH FUNCTION function_name (argument_type [, ...])
-[ AS ASSIGNMENT | AS IMPLICIT ]
+[ AS ASSIGNMENT | AS EXPLICIT | AS IMPLICIT ]
 
 CREATE CAST (source_type AS target_type)
 WITHOUT FUNCTION
-[ AS ASSIGNMENT | AS IMPLICIT ]
+[ AS ASSIGNMENT | AS EXPLICIT | AS IMPLICIT ]
 
 CREATE CAST (source_type AS target_type)
 WITH INOUT
-[ AS ASSIGNMENT | AS IMPLICIT ]
+[ AS ASSIGNMENT | AS EXPLICIT | AS IMPLICIT ]
 
  
 
@@ -74,7 +74,8 @@ SELECT CAST(42 AS float8);
   
 
   
-   By default, a cast can be invoked only by an explicit cast request,
+   By default, or if the cast is declared AS EXPLICIT,
+   a cast can be invoked only by an explicit cast request,
that is an explicit CAST(x AS
typename) or
x::typename
@@ -239,6 +240,21 @@ SELECT CAST ( 2 AS numeric ) + 4.0;
 
 
 
+ AS EXPLICIT
+
+ 
+  
+   Indicates that the cast can be invoked only with an explicit
+   cast request, that is an explicit CAST(x AS
+   typename) or
+   x::typename
+   construct.
+   This is the default.
+  
+ 
+
+
+
  AS IMPLICIT
 
  
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 5094226..2c0694f 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -533,7 +533,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	DICTIONARY DISABLE_P DISCARD DISTINCT DO DOCUMENT_P DOMAIN_P DOUBLE_P DROP
 
 	EACH ELSE ENABLE_P ENCODING ENCRYPTED END_P ENUM_P ESCAPE EVENT EXCEPT
-	EXCLUDE EXCLUDING EXCLUSIVE EXECUTE EXISTS EXPLAIN
+	EXCLUDE EXCLUDING EXCLUSIVE EXECUTE EXISTS EXPLAIN EXPLICIT
 	EXTENSION EXTERNAL EXTRACT
 
 	FALSE_P FAMILY FETCH FIRST_P FLOAT_P FOLLOWING FOR FORCE FOREIGN FORWARD
@@ -6720,6 +6720,7 @@ CreateCastStmt: CREATE CAST '(' Typename AS Typename ')'
 
 cast_context:  AS IMPLICIT_P	{ $$ = COERCION_IMPLICIT; }
 		| AS ASSIGNMENT			{ $$ = COERCION_ASSIGNMENT; }
+		| AS EXPLICIT			{ $$ = COERCION_EXPLICIT; }
 		| /*EMPTY*/{ $$ = COERCION_EXPLICIT; }
 		;
 
@@ -12723,6 +12724,7 @@ unreserved_keyword:
 			| EXCLUSIVE
 			| EXECUTE
 			| EXPLAIN
+			| EXPLICIT
 			| EXTENSION
 			| EXTERNAL
 			| FAMILY
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 68a13b7..f97389b 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -149,6 +149,7 @@ PG_KEYWORD("exclusive", EXCLUSIVE, UNRESERVED_KEYWORD)
 PG_KEYWORD("execute", EXECUTE, UNRESERVED_KEYWORD)
 PG_KEYWORD("exists", EXISTS, COL_NAME_KEYWORD)
 PG_KEYWORD("explain", EXPLAIN, UNRESERVED_KEYWORD)
+PG_KEYWORD("explicit", EXPLICIT, UNRESERVED_KEYWORD)
 PG_KEYWORD("extension", EXTENSION, UNRESERVED_KEYWORD)
 PG_KEYWORD("external", EXTERNAL, UNRESERVED_KEYWORD)
 PG_KEYWORD("extract", EXTRACT, COL_NAME_KEYWORD)
diff --git a/src/test/regress/expected/create_cast.out b/src/test/regress/expected/create_cast.out
index 56cd86e..a8858fa 100644
--- a/src/test/regress/expected/create_cast.out
+++ b/src/test/regress/expected/create_cast.out
@@ -27,8 +27,8 @@ ERROR:  function casttestfunc(text) does not exist
 LINE 1: SELECT casttestfunc('foo'::text);
^
 HINT:  No function matches the given name and argument types. You might need to add explicit type casts.
--- Try binary coercion cast
-CREATE CAST (text AS casttesttype) WITHOUT FUNCTION;
+-- Try binary coercion cast and verbose AS EXPLICIT
+CREATE CAST (text AS casttesttype) WITHOUT FUNCTION AS EXPLICIT;
 SELECT casttestfunc('foo'::text); -- doesn't work, as the cast is explicit
 ERROR:  function casttestfunc(text) does not exist
 LINE 1: SELECT casttestfunc('foo'::text);
@@ -54,7 +54,7 @@ SELECT 1234::int4::casttesttype; -- No cast yet, should fail
 ERROR:  cannot cast type integer to casttesttype
 LINE 1: SELECT 1234::int4::casttesttype;
  ^
-CREATE CAST (int4 AS casttesttype) WITH INOUT;
+CREATE CAST (int4 AS casttesttype) WITH INOUT; -- default AS EXPLICIT
 SELECT 1234::int4::casttesttype; -- Should work now
  casttesttype 
 --
diff --git a/src/test/regress/sql/create_cast.sql b/src/test/regress/sql/create_cast.sql
index ad348da..91123ca 100644
--- a/src/test/regress/sql/create_cast.sql
+++ b/src/test/regress/sql/create_cast.sql
@@ -27,8 +27,8 @@ $$ SELECT 1; $$;
 
 SELECT casttestfunc('foo'::text); -- fails, as there's no cast
 
--- Try binary coercion cast
-CREATE CAST (text AS casttesttype) WITHOUT FUNCTION;
+-- Try binary coercion cast and verbose AS EXPLICIT
+CREATE CAST (text AS casttesttype) WITHOUT FUNCTION AS EXPLICIT;
 SELECT casttestfunc('foo'::text); --

Re: [Review] Re: [HACKERS] minor patch submission: CREATE CAST ... AS EXPLICIT

2013-06-22 Thread Fabien COELHO



Hello Cédric,


So maybe it is possible to rephrase this piece:
-   AS IMPLICIT is a PostgreSQL
-   extension, too.
+   AS IMPLICIT and AS EXPLICIT are
+   a PostgreSQL extension, too.


Ok.


Back in 2012 Tom exposed arguments against it, or at least not a clear +1.
The patch add nothing but more explicit creation statement, it has gone
untouched for 2 years without interest so it is surely not something really
important for PostgreSQL users.


Elegant is important:-) See my answer to Robert's objection.


* Coding review
Patch does not pass test:
./check_keywords.pl gram.y ../../../src/include/parser/kwlist.h


Oops! That is not elegant!


I had to update the unreserved keyword list in order to be able to build
postgresql.



** Does it produce compiler warnings? don't build as is. Need patch update


Indeed.


I flag it 'return with feedback', please update the patch so it builds.
Everything else is ok.


Yep.

--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Review] Re: [HACKERS] minor patch submission: CREATE CAST ... AS EXPLICIT

2013-06-22 Thread Fabien COELHO



Hello Robert,


I object to this patch.  This patch a new keyword, EXPLICIT, for
reasons that are strictly cosmetic.  Everything that you can do with
this patch can also be done without this patch.  It is not a good idea
to slow down parsing of every SQL statement we have just so that
someone can write CREATE CAST .. AS EXPLICIT.  Granted, the parsing
slowdown for just one keyword is probably not noticeable, but it's
cumulative with every new keyword we add.  Adding them to support new
features is one thing, but adding them to support purely optional
syntax is, I think, going too far.


I partly object to the objection:-)

I agree that it may induce a very small delay to the parser, however I 
*do* think that cosmetic things are important. In order to ease 
understanding, learning and memorizing a language, concepts must have 
names, syntaxes, and be orthogonal and symmetric where applicable.


In this example, there are 3 kinds of casts, all 3 have a conceptual name 
(explicit, implicit, assignment) but only two have a syntax, and the other 
one is the absence of syntax. So you have to memorize this stupid 
information (which one of the three does not have a syntax) or read the 
documentation every time to remember that "explicit" is the one without a 
syntax. Note also that you must state "implicit" explicitely, but 
"explicit" is told implicitely, which does not really help.


The impact is also on the documentation which is not symmetric because it 
is based on the syntax which is not, so it is simply a little harder to 
understand.


Every year I do my class about PL/pgSQL and extensions to Pg, and every 
year some students will try "as explicit" because it is logical to do so. 
I think that she is right and that it should work, instead of having to 
explain that "explicit" is implicit when dealing with Pg casts. Although 
it is my job, I would prefer to spend time explaining more interesting 
things.


From the software engineering point of view, having a syntax for all case 
means that the developer must think about which kind of cast she really 
wants, instead of doing the default thing just because it is the default.


So in my mind the tradeoff is between people time & annoyance and a few 
machine cycles, and I have no hesitation to choose the later.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] dump difference between 9.3 and master after upgrade

2013-06-22 Thread Andrew Dunstan



On 06/20/2013 11:16 AM, I wrote:


On 06/20/2013 10:43 AM, Robert Haas wrote:
On Tue, Jun 18, 2013 at 12:18 PM, Andrew Dunstan 
 wrote:
As I was updating my cross version upgrade tester to include support 
for the
9.3 branch, I noted this dump difference between the dump of the 
original
9.3 database and the dump of the converted database, which looks 
odd. Is it

correct?

It looks sketchy to me, but I'm not sure exactly what you're comparing.



Essentially, cross version upgrade testing runs pg_dumpall from the 
new version on the old cluster, runs pg_upgrade, and then runs 
pg_dumpall on the upgraded cluster, and compares the two outputs. This 
is what we get when the new version is HEAD and the old version is 9.3.


The reason this hasn't been caught by the standard same-version 
upgrade tests is that this module uses a more extensive cluster, which 
has had not only the core regression tests run but also all the 
contrib and pl regression tests, and this problem seems to be exposed 
by the postgres_fdw tests.


At first glance to me like pg_dump in binary-upgrade mode is not 
suppressing something that it should be suppressing.





Yeah, after examination I don't see why we should output anything for 
dropped columns of a foreign table in binary upgrade mode. This looks to 
me like it's been a bug back to 9.1 where we introduced foreign tables.


I think something like the attached should fix it, although I'm not sure 
if that's the right place for the fix.


cheers

andrew
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index ec956ad..b25b475 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -6670,7 +6670,7 @@ getTableAttrs(Archive *fout, TableInfo *tblinfo, int numTables)
  * such a column it will mistakenly get pg_attribute.attislocal set to true.)
  * However, in binary_upgrade mode, we must print all such columns anyway and
  * fix the attislocal/attisdropped state later, so as to keep control of the
- * physical column order.
+ * physical column order, unless it's a Foreign Table.
  *
  * This function exists because there are scattered nonobvious places that
  * must be kept in sync with this decision.
@@ -6679,7 +6679,7 @@ bool
 shouldPrintColumn(TableInfo *tbinfo, int colno)
 {
 	if (binary_upgrade)
-		return true;
+		return (tbinfo->relkind != RELKIND_FOREIGN_TABLE || !tbinfo->attisdropped[colno]);
 	return (tbinfo->attislocal[colno] && !tbinfo->attisdropped[colno]);
 }
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)

2013-06-22 Thread Robert Haas

On Fri, Jun 21, 2013 at 10:02 PM, MauMau  wrote:
> I'm comfortable with 5 seconds.  We are talking about the interval between
> sending SIGQUIT to the children and then sending SIGKILL to them.  In most
> situations, the backends should terminate immediately.  However, as I said a
> few months ago, ereport() call in quickdie() can deadlock indefinitely. This
> is a PostgreSQL's bug.

So let's fix that bug.  Then we don't need this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Review] Re: [HACKERS] minor patch submission: CREATE CAST ... AS EXPLICIT

2013-06-22 Thread Robert Haas

On Sat, Jun 22, 2013 at 9:16 AM, Cédric Villemain
 wrote:
> patch is in unified format and apply on HEAD.
> patch contains documentation, however I believe 'AS IMPLICIT' is a PostgreSQL
> extension with special behavior and 'AS EXPLICIT' respect the standard except
> that PostgreSQL adds only the expression 'AS EXPLICIT' (it is also the default
> in the standard).

I object to this patch.  This patch a new keyword, EXPLICIT, for
reasons that are strictly cosmetic.  Everything that you can do with
this patch can also be done without this patch.  It is not a good idea
to slow down parsing of every SQL statement we have just so that
someone can write CREATE CAST .. AS EXPLICIT.  Granted, the parsing
slowdown for just one keyword is probably not noticeable, but it's
cumulative with every new keyword we add.  Adding them to support new
features is one thing, but adding them to support purely optional
syntax is, I think, going too far.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] MemoryContextAllocHuge(): selectively bypassing MaxAllocSize

2013-06-22 Thread Robert Haas

On Sat, Jun 22, 2013 at 3:46 AM, Stephen Frost  wrote:
> I'm not a huge fan of moving directly to INT_MAX.  Are we confident that
> everything can handle that cleanly..?  I feel like it might be a bit
> safer to shy a bit short of INT_MAX (say, by 1K).

Maybe it would be better to stick with INT_MAX and fix any bugs we
find.  If there are magic numbers short of INT_MAX that cause
problems, it would likely be better to find out about those problems
and adjust the relevant code, rather than trying to dodge them.  We'll
have to confront all of those problems eventually as we come to
support larger and larger sorts; I don't see much value in putting it
off.

Especially since we're early in the release cycle.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] A better way than tweaking NTUP_PER_BUCKET

2013-06-22 Thread Robert Haas

On Sat, Jun 22, 2013 at 9:48 AM, Stephen Frost  wrote:
>> The correct calculation that would match the objective set out in the
>> comment would be
>>
>>  dbuckets = (hash_table_bytes / tupsize) / NTUP_PER_BUCKET;
>
> This looks to be driving the size of the hash table size off of "how
> many of this size tuple can I fit into memory?" and ignoring how many
> actual rows we have to hash.  Consider a work_mem of 1GB with a small
> number of rows to actually hash- say 50.  With a tupsize of 8 bytes,
> we'd be creating a hash table sized for some 13M buckets.

This is a fair point, but I still think Simon's got a good point, too.
 Letting the number of buckets ramp up when there's ample memory seems
like a broadly sensible strategy.  We might need to put a floor on the
effective load factor, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] A better way than tweaking NTUP_PER_BUCKET

2013-06-22 Thread Robert Haas

On Sat, Jun 22, 2013 at 9:15 AM, Simon Riggs  wrote:
> Previous discussions of Hash Joins have noted that the performance
> decreases when the average number of tuples per bucket increases.
> O(N^2) effects are seen.
>
> We've argued this about many different ways, yet all of those
> discussions have centred around the constant NTUP_PER_BUCKET. I
> believe that was a subtle mistake and I have a proposal.
>
> The code in ExecChooseHashTableSize() line 460 says
>  /*
>   * Set nbuckets to achieve an average bucket load of NTUP_PER_BUCKET when
>   * memory is filled.
> ...
>
> but the calculation then sets the number of buckets like this
>
>  dbuckets = ceil(ntuples / NTUP_PER_BUCKET);
>
> **This is doesn't match the comment.** If we underestimate the number
> of tuples and go on to fill the available memory, we then end up with
> an average number of tuples per bucket higher than NTUP_PER_BUCKET. A
> notational confusion that has been skewing the discussion.
>
> The correct calculation that would match the objective set out in the
> comment would be
>
>  dbuckets = (hash_table_bytes / tupsize) / NTUP_PER_BUCKET;
>
> Which leads us to a much more useful value of dbuckets in the case
> where using ntuples occupies much less space than is available. This
> value is always same or higher than previously because of the if-test
> that surrounds it.

+1.  I think this is very clever.

The other thing that seems to be distorting things here is that this
function doesn't account for the memory consumed by the bucket array.
So as things stand today, changing NTUP_PER_BUCKET can't ever increase
the required number of batches.  But that's really nonsensical.
Setting NTUP_PER_BUCKET to a smaller value like 1 or even, in effect,
a value less than 1 would probably improve performance for large hash
tables, but there's no way to decide what value is too expensive
because the memory cost of changing it isn't accounted for in the
first place.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)

2013-06-22 Thread Robert Haas

On Fri, Jun 21, 2013 at 11:19 PM, Tom Lane  wrote:
>> I think that's the Tom Lane theory.  The Robert Haas theory is that if
>> the postmaster has died, there's no reason to suppose that it hasn't
>> corrupted shared memory on the way down, or that the system isn't
>> otherwise heavily fuxxored in some way.
>
> Eh?  The postmaster does its level best never to touch shared memory
> (after initialization anyway).

And yet it certainly does - see pmsignal.c, for example.  Besides
which, as Andres points out, if the postmaster is dead, there is zip
for a guarantee that some OTHER backend hasn't panicked.  I think it's
just ridiculous to suppose that the system can run in any sort of
reasonable way without the postmaster.  The whole reason why we work
so hard to make sure that the postmaster doesn't die in the first
place is because we need it to clean up when things go horribly wrong.
 If that cleanup function is important, then we need a living
postmaster at all times.  If it's not important, then our extreme
paranoia about what operations the postmaster is permitted to engage
in is overblown.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)

2013-06-22 Thread Fabien COELHO



Please find attached a v12, which under timer_exceeded interrupts clients 
which are being throttled instead of waiting for the end of the transaction, 
as the transaction is not started yet.


Oops, I forgot the attachment. Here it is!

--
Fabien.diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 1303217..4c9e55d 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -137,6 +137,12 @@ int			unlogged_tables = 0;
 double		sample_rate = 0.0;
 
 /*
+ * When threads are throttled to a given rate limit, this is the target delay
+ * to reach that rate in usec.  0 is the default and means no throttling.
+ */
+int64		throttle_delay = 0;
+
+/*
  * tablespace selection
  */
 char	   *tablespace = NULL;
@@ -200,11 +206,13 @@ typedef struct
 	int			listen;			/* 0 indicates that an async query has been
  * sent */
 	int			sleeping;		/* 1 indicates that the client is napping */
+boolthrottling; /* whether nap is for throttling */
 	int64		until;			/* napping until (usec) */
 	Variable   *variables;		/* array of variable definitions */
 	int			nvariables;
 	instr_time	txn_begin;		/* used for measuring transaction latencies */
 	instr_time	stmt_begin;		/* used for measuring statement latencies */
+	bool		throttled;  /* whether current transaction was throttled */
 	int			use_file;		/* index in sql_files for this client */
 	bool		prepared[MAX_FILES];
 } CState;
@@ -222,6 +230,10 @@ typedef struct
 	instr_time *exec_elapsed;	/* time spent executing cmds (per Command) */
 	int		   *exec_count;		/* number of cmd executions (per Command) */
 	unsigned short random_state[3];		/* separate randomness for each thread */
+int64   throttle_trigger;  /* previous/next throttling (us) */
+	int64   throttle_lag;  /* total transaction lag behind throttling */
+	int64   throttle_lag_max;  /* max transaction lag */
+
 } TState;
 
 #define INVALID_THREAD		((pthread_t) 0)
@@ -230,6 +242,8 @@ typedef struct
 {
 	instr_time	conn_time;
 	int			xacts;
+	int64   throttle_lag;
+	int64   throttle_lag_max;
 } TResult;
 
 /*
@@ -355,6 +369,8 @@ usage(void)
 		   "  -n   do not run VACUUM before tests\n"
 		   "  -N   do not update tables \"pgbench_tellers\" and \"pgbench_branches\"\n"
 		   "  -r   report average latency per command\n"
+		   "  -R SPEC, --rate SPEC\n"
+		   "   target rate in transactions per second\n"
 		   "  -s NUM   report this scale factor in output\n"
 		   "  -S   perform SELECT-only transactions\n"
 	 "  -t NUM   number of transactions each client runs (default: 10)\n"
@@ -898,17 +914,58 @@ doCustom(TState *thread, CState *st, instr_time *conn_time, FILE *logfile, AggVa
 {
 	PGresult   *res;
 	Command   **commands;
+	booldo_throttle = false;
 
 top:
 	commands = sql_files[st->use_file];
 
+	/* handle throttling once per transaction by inserting a sleep.
+	 * this is simpler than doing it at the end.
+	 */
+	if (throttle_delay && ! st->throttled)
+	{
+		/* compute delay to approximate a Poisson distribution
+		 * 100 => 13.8 .. 0 multiplier
+		 *  10 => 11.5 .. 0
+		 *   1 =>  9.2 .. 0
+		 *1000 =>  6.9 .. 0
+		 * if transactions are too slow or a given wait shorter than
+		 * a transaction, the next transaction will start right away.
+		 */
+		int64 wait = (int64)
+			throttle_delay * -log(getrand(thread, 1, 1000)/1000.0);
+
+		thread->throttle_trigger += wait;
+
+		st->until = thread->throttle_trigger;
+		st->sleeping = 1;
+		st->throttling = true;
+		st->throttled = true;
+		if (debug)
+			fprintf(stderr, "client %d throttling "INT64_FORMAT" us\n",
+	st->id, wait);
+	}
+
 	if (st->sleeping)
 	{			/* are we sleeping? */
 		instr_time	now;
+		int64 now_us;
 
 		INSTR_TIME_SET_CURRENT(now);
-		if (st->until <= INSTR_TIME_GET_MICROSEC(now))
+		now_us = INSTR_TIME_GET_MICROSEC(now);
+		if (st->until <= now_us)
+		{
 			st->sleeping = 0;	/* Done sleeping, go ahead with next command */
+			if (st->throttling)
+			{
+/* measure lag of throttled transaction */
+int64 lag = now_us - st->until;
+thread->throttle_lag += lag;
+if (lag > thread->throttle_lag_max)
+	thread->throttle_lag_max = lag;
+st->throttling = false;
+			}
+		}
 		else
 			return true;		/* Still sleeping, nothing to do here */
 	}
@@ -1037,7 +1094,7 @@ top:
 	 * This is more than we really ought to know about
 	 * instr_time
 	 */
-	fprintf(logfile, "%d %d %.0f %d %ld %ld\n",
+	fprintf(logfile, "%d %d %.0f %d %ld.%06ld\n",
 			st->id, st->cnt, usec, st->use_file,
 			(long) now.tv_sec, (long) now.tv_usec);
 #else
@@ -1046,7 +1103,7 @@ top:
 	 * On Windows, instr_time doesn't provide a timestamp
 	 * anyway
 	 */
-	fprintf(logfile, "%d %d %.0f %d 0 0\n",
+	fprintf(logfile, "%d %d %.0f %d 0.0\n",
 			st->id, st->cnt, usec, st->use_file);
 #endif
 }
@@ -1095,6 +1152,13 @@ top:
 			st->state = 0;
 			st->use_fil

Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)

2013-06-22 Thread Fabien COELHO



Please find attached a v12, which under timer_exceeded interrupts clients 
which are being throttled instead of waiting for the end of the 
transaction, as the transaction is not started yet.


I've also changed the log format that I used for debugging the apparent 
latency issue:


  x y z 12345677 1234 -> x y z 12345677.001234

It seems much clearer that way.

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)

2013-06-22 Thread Andres Freund

On 2013-06-21 16:56:57 -0400, Alvaro Herrera wrote:
> > What we could do to improve the robustness a bit - at least on linux -
> > is to prctl(PR_SET_PDEATHSIG, SIGKILL) which would cause children to be
> > killed if the postmaster goes away...
> 
> This is an interesting idea (even if it has no relationship to the patch
> at hand).

Well, we could just set the deathsig to SIGKILL and exit normally -
which would be a twoliner or so.
Admittedly the cross-platform issue makes this approach not so great...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)

2013-06-22 Thread Alvaro Herrera

MauMau escribió:

> Are you suggesting simplifying the following part in ServerLoop()?
> I welcome the idea if this condition becomes simpler.  However, I
> cannot imagine how.

>  if (AbortStartTime > 0 &&  /* SIGKILL only once */
>   (Shutdown == ImmediateShutdown || (FatalError && !SendStop)) &&
>   now - AbortStartTime >= 10)
>  {
>   SignalAllChildren(SIGKILL);
>   AbortStartTime = 0;
>  }

Yes, that's what I'm suggesting.  I haven't tried coding it yet.

> I thought of adding some new state of pmState for some reason (that
> might be the same as your idea).
> But I refrained from doing that, because pmState has already many
> states.  I was afraid adding a new pmState value for this bug fix
> would complicate the state management (e.g. state transition in
> PostmasterStateMachine()).  In addition, I felt PM_WAIT_BACKENDS was
> appropriate because postmaster is actually waiting for backends to
> terminate after sending SIGQUIT.  The state name is natural.

Well, a natural state name is of no use if we're using it to indicate
two different states, which I think is what would be happening here.
Basically, with your patch, PM_WAIT_BACKENDS means two different things
depending on AbortStartTime.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Support for REINDEX CONCURRENTLY

2013-06-22 Thread Alvaro Herrera

Andres Freund escribió:
> On 2013-06-22 22:45:26 +0900, Michael Paquier wrote:

> > And I imagine that you have the same problem even with
> > RelationGetIndexList, not only RelationGetIndexListIfInvalid, because
> > this would appear as long as you try to open more than 1 index with an
> > index list.
> 
> No. RelationGetIndexList() returns a copy of the list for exactly that
> reason. The danger is not to see an outdated list - we should be
> protected by locks against that - but looking at uninitialized or reused
> memory.

Are we doing this only to save some palloc traffic?  Could we do this
by, say, teaching list_copy() to have a special case for lists of ints
and oids that allocates all the cells in a single palloc chunk?

(This has the obvious problem that list_free no longer works, of
course.  But I think that specific problem can be easily fixed.  Not
sure if it causes more breakage elsewhere.)

Alternatively, I guess we could grab an uncopied list, then copy the
items individually into a locally allocated array, avoiding list_copy.
We'd need to iterate differently than with foreach().

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)

2013-06-22 Thread Fabien COELHO



Dear Robert and Greg,


I think so.  If it doesn't get fixed now, it's not likely to get fixed
later.  And the fact that nobody understands why it's happening is
kinda worrisome...


Possibly, but I thing that it is not my fault:-)

So, I spent some time at performance debugging...

First, I finally succeeded in reproducing Greg Smith spikes on my ubuntu 
laptop. It needs short transactions and many clients per thread so as to 
be a spike. With "pgbench" standard transactions and not too many clients 
per thread it is more of a little bump, or even it is not there at all.


After some poking around, and pursuing various red herrings, I resorted to 
measure the delay for calling "PQfinish()", which is really the only 
special thing going around at the end of pgbench run...


BINGO!

In his tests Greg is using one thread. Once the overall timer is exceeded, 
clients start closing their connections by calling PQfinish once their 
transactions are done. This call takes between a few us and a few ... ms. 
So if some client closing time hit the bad figures, the transactions of 
other clients are artificially delayed by this time, and it seems they 
have a high latency, but it is really because the thread is in another 
client's PQfinish and was not available to process the data. If you have 
one thread per client, no problem, especially as the final PQfinish() time 
is not measured at all by pgbench:-) Also, the more clients, the higher 
the spike because more are to be stopped and may hit the bad figures.


Here is a trace, with the simple SELECT transaction.

  sh> ./pgbench --rate=14000 -T 10 -r -l -c 30 -S bench
  ...

  sh> less pgbench_log.*

 # short transactions, about 0.250 ms
 ...
 20 4849 241 0 1371916583.455400
 21 4844 256 0 1371916583.455470
 22 4832 348 0 1371916583.455569
 23 4829 218 0 1371916583.455627

   ** TIMER EXCEEDED **

 25 4722 390 0 1371916583.455820
 25 done in 1560 [1,2]  # BING, 1560 us for PQfinish()
 26 4557 1969 0 1371916583.457407
 26 done in 21 [1,2]
 27 4372 1969 0 1371916583.457447
 27 done in 19 [1,2]
 28 4009 1910 0 1371916583.457486
 28 done in 1445 [1,2]  # BOUM
 2 interrupted in 1300 [0,0]# BANG
 3 interrupted in 15 [0,0]
 4 interrupted in 40 [0,0]
 5 interrupted in 203 [0,0] # boum?
 6 interrupted in 1352 [0,0]# ZIP
 7 interrupted in 18 [0,0]
 ...
 23 interrupted in 12 [0,0]

 ## the cumulated PQfinish() time above is about 6 ms which
 ## appears as an apparent latency for the next clients:

  0 4880 6521 0 1371916583.462157
  0 done in 9 [1,2]
  1 4877 6397 0 1371916583.462194
  1 done in 9 [1,2]
 24 4807 6796 0 1371916583.462217
 24 done in 9 [1,2]
 ...

Note that the bad measures also appear when there is no throttling:

  sh> ./pgbench -T 10 -r -l -c 30 -S bench
  sh> grep 'done.*[0-9][0-9][0-9]' pgbench_log.*
   0 done in 1974 [1,2]
   1 done in 312 [1,2]
   3 done in 2159 [1,2]
   7 done in 409 [1,2]
  11 done in 393 [1,2]
  12 done in 2212 [1,2]
  13 done in 1458 [1,2]
  ## other clients execute PQfinish in less than 100 us

This "done" is issued by my instrumented version of clientDone().

The issue does also appear if I instrument pgbench from master, without 
anything from the throttling patch at all:


  sh> git diff master
  diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
  index 1303217..7c5ea81 100644
  --- a/contrib/pgbench/pgbench.c
  +++ b/contrib/pgbench/pgbench.c
  @@ -869,7 +869,15 @@ clientDone(CState *st, bool ok)

  if (st->con != NULL)
  {
  +   instr_time now;
  +   int64 s0, s1;
  +   INSTR_TIME_SET_CURRENT(now);
  +   s0 = INSTR_TIME_GET_MICROSEC(now);
  PQfinish(st->con);
  +   INSTR_TIME_SET_CURRENT(now);
  +   s1 = INSTR_TIME_GET_MICROSEC(now);
  +   fprintf(stderr, "%d done in %ld [%d,%d]\n",
  +   st->id, s1-s0, st->listen, st->state);
  st->con = NULL;
  }
  return false;   /* always false */

  sh> ./pgbench -T 10 -r -l -c 30 -S bench 2> x.err

  sh> grep 'done.*[0-9][0-9][0-9]' x.err
  14 done in 1985 [1,2]
  16 done in 276 [1,2]
  17 done in 1418 [1,2]

So my argumented conclusion is that the issue is somewhere within 
PQfinish(), possibly in interaction with pgbench doings, but is *NOT* 
related in any way to the throttling patch, as it is preexisting it. Gregs 
stumbled upon it because he looked at latencies.


I'll submit a slightly improved v12 shortly.

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] A better way than tweaking NTUP_PER_BUCKET

2013-06-22 Thread Simon Riggs

On 22 June 2013 14:48, Stephen Frost  wrote:

> Based on your argument that we want to have a bucket load which is, on
> average, the size of NTUP_PER_BUCKET, I have to '-1' on this.

That is not my argument. I am pointing out that the comments claim the
code does that, and yet the code does not.

So either we apply the patch to make the code fit the comment, or we
change the comment.

Tom's wish for an average tuples per bucket of 10 may be connected to
that error; I can't say. But we definitely don't end up with an
average of 10 in many cases, which is the basis for previous poor
performance reports.

> What we
> want is to size a table large enough that we never have any true
> collisions (cases where different 32-bit hash values end up in the same
> bucket) *for all tuples being hashed*, that includes both the side
> building the hash table and the side doing the scan.  This should be
> done when we can avoid batching- if we don't have enough to avoid
> batching while having such a large table, we should consider adjusting
> the hash table size down some because, in those cases, we're memory
> constrained.
>
> When we have enough memory to avoid batching, we never want to have to
> check down through a bucket for a tuple which doesn't exist- we should
> simply be able to look in the hash table itself, determine (with a
> single fetch) that there are no entries for that hash value, and throw
> the tuple away and move on to the next.

The bottom line is that you're hoping for perfection. If we knew
exactly how many tuples were in the hash table then we could size it
precisely for the hash join we are about to perform. We don't know
that, so we can't size it precisely. Worse than that is that we know
we badly estimate the number of buckets on a regular basis.

That gives us three choices: if we have a hash table fixed without
prior knowledge then we either (0) we continue to underallocate them
in many cases or (1) potentially overallocate buckets. Or (2) we learn
to cope better with the variation in the number of entries. If we rule
out the first two we have only the last one remaining.

(1) will always be a heuristic, and as you point out could itself be
an extreme setting.

So I think that (2) is the best route: Given that we know with much
better certainty the number of rows in the scanned-relation, we should
be able to examine our hash table after it has been built and decide
whether it would be cheaper to rebuild the hash table with the right
number of buckets, or continue processing with what we have now. Which
is roughly what Heikki proposed already, in January.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2013-06-22 Thread Bruce Momjian

On Mon, Jun 10, 2013 at 07:28:24AM +0800, Craig Ringer wrote:
> (I'm still learning the details of Pg's WAL, WAL replay and recovery, so
> the below's just my understanding):
> 
> The problem is that WAL for all tablespaces is mixed together in the
> archives. If you lose your tablespace then you have to keep *all* WAL
> around and replay *all* of it again when the tablespace comes back
> online. This would be very inefficient, would require a lot of tricks to
> cope with applying WAL to a database that has an on-disk state in the
> future as far as the archives are concerned. It's not as simple as just
> replaying all WAL all over again - as I understand it, things like
> CLUSTER or TRUNCATE will result in relfilenodes not being where they're
> expected to be as far as old WAL archives are concerned. Selective
> replay would be required, and that leaves the door open to all sorts of
> new and exciting bugs in areas that'd hardly ever get tested.
> 
> To solve the massive disk space explosion problem I imagine we'd have to
> have per-tablespace WAL. That'd cause a *huge* increase in fsync costs
> and loss of the rather nice property that WAL writes are nice sequential
> writes. It'd be complicated and probably cause nightmares during
> recovery, for archive-based replication, etc.
> 
> The only other thing I can think of is: When a tablespace is offline,
> write WAL records to a separate "tablespace recovery log" as they're
> encountered. Replay this log when the tablespace comes is restored,
> before applying any other new WAL to the tablespace. This wouldn't
> affect archive-based recovery since it'd already have the records from
> the original WAL.
> 
> None of these options seem exactly simple or pretty, especially given
> the additional complexities that'd be involved in allowing WAL records
> to be applied out-of-order, something that AFAIK _never_h happens at the
> moment.
> 
> The key problem, of course, is that this all sounds like a lot of
> complicated work for a case that's not really supposed to happen. Right
> now, the answer is "your database is unrecoverable, switch to your
> streaming warm standby and re-seed it from the standby". Not pretty, but
> at least there's the option of using a sync standby and avoiding data loss.
> 
> How would you approach this?

Sorry to be replying late.  You are right that you could record/apply
WAL separately for offline tablespaces.  The problem is that you could
have logical ties from the offline tablespace to online tablespaces. 
For example, what happens if data in an online tablespace references a
primary key in an offline tablespace.  What if the system catalogs are
stored in an offline tablespace?  Right now, we allow logical bindings
across physical tablespaces.  To do what you want, you would really need
to store each database in its own tablespace.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Possible bug in CASE evaluation

2013-06-22 Thread Andres Freund

On 2013-06-21 16:45:28 +0200, Andres Freund wrote:
> On 2013-06-21 09:51:05 -0400, Noah Misch wrote:
> > That being said, if we discover a simple-enough fix that performs well, we 
> > may
> > as well incorporate it.
> 
> What about passing another parameter down eval_const_expressions_mutator
> (which is static, so changing the API isn't a problem) that basically
> tells us whether we actually should evaluate expressions or just perform
> the transformation?
> There's seems to be basically a couple of places where we call dangerous
> things:
> * simplify_function (via ->evaluate_function->evaluate_expr) which is
>   called in a bunch of places
> * evaluate_expr which is directly called in T_ArrayExpr
>   T_ArrayCoerceExpr
> 
> All places I've inspected so far need to deal with simplify_function
> returning NULL anyway, so that seems like a viable fix.

*Prototype* patch - that seems simple enough - attached. Opinions?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
>From 1349f8e4a8133eebbf66753b1aa3787f88ee3e33 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Sat, 22 Jun 2013 16:53:39 +0200
Subject: [PATCH] Don't evaluate potentially unreachable parts of CASE during
 planning

---
 src/backend/optimizer/util/clauses.c | 31 +--
 src/test/regress/expected/case.out   | 13 ++---
 src/test/regress/sql/case.sql|  4 ++--
 3 files changed, 41 insertions(+), 7 deletions(-)

diff --git a/src/backend/optimizer/util/clauses.c b/src/backend/optimizer/util/clauses.c
index 6d5b204..94b52f6 100644
--- a/src/backend/optimizer/util/clauses.c
+++ b/src/backend/optimizer/util/clauses.c
@@ -63,6 +63,7 @@ typedef struct
 	List	   *active_fns;
 	Node	   *case_val;
 	bool		estimate;
+	bool		doevil;
 } eval_const_expressions_context;
 
 typedef struct
@@ -2202,6 +2203,7 @@ eval_const_expressions(PlannerInfo *root, Node *node)
 	context.active_fns = NIL;	/* nothing being recursively simplified */
 	context.case_val = NULL;	/* no CASE being examined */
 	context.estimate = false;	/* safe transformations only */
+	context.doevil = true;		/* allowed to evaluate const expressions */
 	return eval_const_expressions_mutator(node, &context);
 }
 
@@ -2233,6 +2235,7 @@ estimate_expression_value(PlannerInfo *root, Node *node)
 	context.active_fns = NIL;	/* nothing being recursively simplified */
 	context.case_val = NULL;	/* no CASE being examined */
 	context.estimate = true;	/* unsafe transformations OK */
+	context.doevil = true;		/* allowed to evaluate const expressions */
 	return eval_const_expressions_mutator(node, &context);
 }
 
@@ -2751,7 +2754,7 @@ eval_const_expressions_mutator(Node *node,
  * If constant argument and it's a binary-coercible or
  * immutable conversion, we can simplify it to a constant.
  */
-if (arg && IsA(arg, Const) &&
+if (context->doevil && arg && IsA(arg, Const) &&
 	(!OidIsValid(newexpr->elemfuncid) ||
 func_volatile(newexpr->elemfuncid) == PROVOLATILE_IMMUTABLE))
 	return (Node *) evaluate_expr((Expr *) newexpr,
@@ -2843,16 +2846,20 @@ eval_const_expressions_mutator(Node *node,
 CaseExpr   *caseexpr = (CaseExpr *) node;
 CaseExpr   *newcase;
 Node	   *save_case_val;
+bool	save_doevil;
 Node	   *newarg;
 List	   *newargs;
 bool		const_true_cond;
 Node	   *defresult = NULL;
 ListCell   *arg;
 
+save_doevil = context->doevil;
+
 /* Simplify the test expression, if any */
 newarg = eval_const_expressions_mutator((Node *) caseexpr->arg,
 		context);
 
+
 /* Set up for contained CaseTestExpr nodes */
 save_case_val = context->case_val;
 if (newarg && IsA(newarg, Const))
@@ -2894,6 +2901,17 @@ eval_const_expressions_mutator(Node *node,
 		const_true_cond = true;
 	}
 
+	/*
+	 * We can only evaluate expressions as long we are sure the
+	 * the expression will be reached at runtime - otherwise we
+	 * might cause spurious errors to be thrown. The first
+	 * eval_const_expression above can always evaluate
+	 * expressions (unless we are in a nested CaseExpr) since
+	 * it will always be reached at runtime.
+	 */
+	if (!const_true_cond)
+		context->doevil = false;
+
 	/* Simplify this alternative's result value */
 	caseresult = eval_const_expressions_mutator((Node *) oldcasewhen->result,
 context);
@@ -2926,6 +2944,8 @@ eval_const_expressions_mutator(Node *node,
 
 context->case_val = save_case_val;
 
+context->doevil = save_doevil;
+
 /*
  * If no non-FALSE alternatives, CASE reduces to the default
  * result
@@ -2982,7 +3002,7 @@ eval_const_expressions_mutator(Node *node,
 newarray->multidims = arrayexpr->multidims;
 newarray->location = arrayexpr->location;
 
-if (all_const)
+if (all_const && context->doevil)
 	return (Node *) evaluate_expr((Ex

Re: [HACKERS] Implementing incremental backup

2013-06-22 Thread Andres Freund

On 2013-06-22 15:58:35 +0200, Cédric Villemain wrote:
> > A differential backup resulting from a bunch of WAL between W1 and Wn
> > would help to recover much faster to the time of Wn than replaying all
> > the WALs between W1 and Wn and saves a lot of space.
> > 
> > I was hoping to find some time to dig around this idea, but as the
> > subject rose here, then here are my 2¢!
> 
> something like that maybe :
> ./pg_xlogdump -b \
> ../data/pg_xlog/00010001 \   
> ../data/pg_xlog/00010005| \
> grep 'backup bkp' | awk '{print ($5,$9)}'

Note that it's a bit more complex than that for a number of reasons:
* we don't log full page images for e.g. new heap pages, we just set the
  XLOG_HEAP_INIT_PAGE flag on the record
* there also are XLOG_FPI records
* How do you get a base backup as the basis to apply those to? You need
  it to be recovered exactly to a certain point...

But yes, I think something can be done in the end. I think Heikki's
pg_rewind already has quite a bit of the required logic.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Implementing incremental backup

2013-06-22 Thread Cédric Villemain

Le samedi 22 juin 2013 01:09:20, Jehan-Guillaume (ioguix) de Rorthais a écrit 
:
> On 20/06/2013 03:25, Tatsuo Ishii wrote:
> >> On Wed, Jun 19, 2013 at 8:40 PM, Tatsuo Ishii  
wrote:
>  On Wed, Jun 19, 2013 at 6:20 PM, Stephen Frost  
wrote:
> > * Claudio Freire (klaussfre...@gmail.com) wrote:
> [...]
> 
> >> The only bottleneck here, is WAL archiving. This assumes you can
> >> afford WAL archiving at least to a local filesystem, and that the WAL
> >> compressor is able to cope with WAL bandwidth. But I have no reason to
> >> think you'd be able to cope with dirty-map updates anyway if you were
> >> saturating the WAL compressor, as the compressor is more efficient on
> >> amortized cost per transaction than the dirty-map approach.
> > 
> > Thank you for detailed explanation. I will think more about this.
> 
> Just for the record, I was mulling over this idea since a bunch of
> month. I even talked about that with Dimitri Fontaine some weeks ago
> with some beers :)
> 
> My idea came from a customer during a training explaining me the
> difference between differential and incremental backup in Oracle.
> 
> My approach would have been to create a standalone tool (say
> pg_walaggregate) which takes a bunch of WAL from archives and merge them
> in a single big file, keeping only the very last version of each page
> after aggregating all their changes. The resulting file, aggregating all
> the changes from given WAL files would be the "differential backup".
> 
> A differential backup resulting from a bunch of WAL between W1 and Wn
> would help to recover much faster to the time of Wn than replaying all
> the WALs between W1 and Wn and saves a lot of space.
> 
> I was hoping to find some time to dig around this idea, but as the
> subject rose here, then here are my 2¢!

something like that maybe :
./pg_xlogdump -b \
../data/pg_xlog/00010001 \   
../data/pg_xlog/00010005| \
grep 'backup bkp' | awk '{print ($5,$9)}'

-- 
Cédric Villemain +33 (0)6 20 30 22 52
http://2ndQuadrant.fr/
PostgreSQL: Support 24x7 - Développement, Expertise et Formation


signature.asc
Description: This is a digitally signed message part.

Re: [HACKERS] Unaccent performance

2013-06-22 Thread Andres Freund

On 2013-06-21 22:52:04 +0100, Thom Brown wrote:
> > CREATE OR REPLACE FUNCTION public.myunaccent(sometext text)
> >  RETURNS text
> >  LANGUAGE sql
> >  IMMUTABLE
> > AS $function$
> > SELECT
> > replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(sometext,'ą','a'),'Ą','A'),'ă','a'),'Ă','A'),'ā','a'),'Ā','A'),'æ','a'),'å','a'),'ä','a'),'ã','a'),'â','a'),'á','a'),'à','a'),'Æ','A'),'Å','A'),'Ä','A'),'Ã','A'),'Â','A'),'Á','A'),'À','A')
> > ;
> > $function$
> >

> Another test passing in a string of 10 characters gives the following
> timings:
> 
> unaccent: 240619.395 ms
> myunaccent: 785.505 ms

The reason for that is that unaccent is 'stable' while your function is
'immutable', so the planner recognizes that and computes it only once
since you're always passing the same text string to it.

> Another test inserting long text strings into a text column of a table
> 100,000 times, then updating another column to have that unaccented value
> using both methods:
> 
> unaccent: 3867.306 ms
> myunaccent: 43611.732 ms

Whereas it cannot recognize that in this case.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] A better way than tweaking NTUP_PER_BUCKET

2013-06-22 Thread Stephen Frost

Simon,

* Simon Riggs (si...@2ndquadrant.com) wrote:
> Previous discussions of Hash Joins have noted that the performance
> decreases when the average number of tuples per bucket increases.

Having the hash table so small that we have hash bucket collisions with
different 32bit hash values is extremely painful, however..

> The correct calculation that would match the objective set out in the
> comment would be
> 
>  dbuckets = (hash_table_bytes / tupsize) / NTUP_PER_BUCKET;

This looks to be driving the size of the hash table size off of "how
many of this size tuple can I fit into memory?" and ignoring how many
actual rows we have to hash.  Consider a work_mem of 1GB with a small
number of rows to actually hash- say 50.  With a tupsize of 8 bytes,
we'd be creating a hash table sized for some 13M buckets.

> This solves the problem in earlier discussions since we get a lower
> average number of tuples per bucket and yet we also get to keep the
> current NTUP_PER_BUCKET value. Everybody wins.

I agree that this should reduce the number of tuples per bucket, but I
don't think it's doing what you expect and based on what it's doing, it
seems wasteful.

> Comments?

Based on your argument that we want to have a bucket load which is, on
average, the size of NTUP_PER_BUCKET, I have to '-1' on this.  What we
want is to size a table large enough that we never have any true
collisions (cases where different 32-bit hash values end up in the same
bucket) *for all tuples being hashed*, that includes both the side
building the hash table and the side doing the scan.  This should be
done when we can avoid batching- if we don't have enough to avoid
batching while having such a large table, we should consider adjusting
the hash table size down some because, in those cases, we're memory
constrained.

When we have enough memory to avoid batching, we never want to have to
check down through a bucket for a tuple which doesn't exist- we should
simply be able to look in the hash table itself, determine (with a
single fetch) that there are no entries for that hash value, and throw
the tuple away and move on to the next.

Thanks,

Stephen

signature.asc
Description: Digital signature

Re: [HACKERS] Support for REINDEX CONCURRENTLY

2013-06-22 Thread Andres Freund

On 2013-06-22 22:45:26 +0900, Michael Paquier wrote:
> On Sat, Jun 22, 2013 at 10:34 PM, Andres Freund  
> wrote:
> > On 2013-06-22 12:50:52 +0900, Michael Paquier wrote:
> >> By looking at the comments of RelationGetIndexList:relcache.c,
> >> actually the method of the patch is correct because in the event of a
> >> shared cache invalidation, rd_indexvalid is set to 0 when the index
> >> list is reset, so the index list would get recomputed even in the case
> >> of shared mem invalidation.
> >
> > The problem I see is something else. Consider code like the following:
> >
> > RelationFetchIndexListIfInvalid(toastrel);
> > foreach(lc, toastrel->rd_indexlist)
> >toastidxs[i++] = index_open(lfirst_oid(lc), RowExclusiveLock);
> >
> > index_open calls relation_open calls LockRelationOid which does:
> > if (res != LOCKACQUIRE_ALREADY_HELD)
> >AcceptInvalidationMessages();
> >
> > So, what might happen is that you open the first index, which accepts an
> > invalidation message which in turn might delete the indexlist. Which
> > means we would likely read invalid memory if there are two indexes.
> And I imagine that you have the same problem even with
> RelationGetIndexList, not only RelationGetIndexListIfInvalid, because
> this would appear as long as you try to open more than 1 index with an
> index list.

No. RelationGetIndexList() returns a copy of the list for exactly that
reason. The danger is not to see an outdated list - we should be
protected by locks against that - but looking at uninitialized or reused
memory.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Support for REINDEX CONCURRENTLY

2013-06-22 Thread Michael Paquier

On Sat, Jun 22, 2013 at 10:34 PM, Andres Freund  wrote:
> On 2013-06-22 12:50:52 +0900, Michael Paquier wrote:
>> On Fri, Jun 21, 2013 at 10:47 PM, Andres Freund  
>> wrote:
>> > Hm. Looking at how this is currently used - I am afraid it's not
>> > correct... the reason RelationGetIndexList() returns a copy is that
>> > cache invalidations will throw away that list. And you do index_open()
>> > while iterating over it which will accept invalidation messages.
>> > Mybe it's better to try using RelationGetIndexList directly and measure
>> > whether that has a measurable impact=
>> By looking at the comments of RelationGetIndexList:relcache.c,
>> actually the method of the patch is correct because in the event of a
>> shared cache invalidation, rd_indexvalid is set to 0 when the index
>> list is reset, so the index list would get recomputed even in the case
>> of shared mem invalidation.
>
> The problem I see is something else. Consider code like the following:
>
> RelationFetchIndexListIfInvalid(toastrel);
> foreach(lc, toastrel->rd_indexlist)
>toastidxs[i++] = index_open(lfirst_oid(lc), RowExclusiveLock);
>
> index_open calls relation_open calls LockRelationOid which does:
> if (res != LOCKACQUIRE_ALREADY_HELD)
>AcceptInvalidationMessages();
>
> So, what might happen is that you open the first index, which accepts an
> invalidation message which in turn might delete the indexlist. Which
> means we would likely read invalid memory if there are two indexes.
And I imagine that you have the same problem even with
RelationGetIndexList, not only RelationGetIndexListIfInvalid, because
this would appear as long as you try to open more than 1 index with an
index list.
--
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)

2013-06-22 Thread Andres Freund

On 2013-06-21 23:19:27 -0400, Tom Lane wrote:
> Robert Haas  writes:
> > On Fri, Jun 21, 2013 at 5:44 PM, Tom Lane  wrote:
> >> The traditional theory has been that that would be less robust, not
> >> more so.  Child backends are (mostly) able to carry out queries whether
> >> or not the postmaster is around.
> 
> > I think that's the Tom Lane theory.  The Robert Haas theory is that if
> > the postmaster has died, there's no reason to suppose that it hasn't
> > corrupted shared memory on the way down, or that the system isn't
> > otherwise heavily fuxxored in some way.
> 
> Eh?  The postmaster does its level best never to touch shared memory
> (after initialization anyway).

I am not sure that will never happen - but I think the chain of argument
misses the main point. Normally we rely on the postmaster to kill off
all other backends if a backend PANICs or segfaults for all the known
reasons. As soon as there's no postmaster anymore we loose that
capability.
And *that* is scary imo. Especially as I would say the chance of getting
PANICs or segfaults increases if there's no postmaster anymore since we
might reach code branches we otherwise won't.

> >> True, you can't make new connections,
> >> but how does killing the existing children make that better?
> 
> > It allows you to start a new postmaster in a timely fashion, instead
> > of waiting for an idle connection that may not ever terminate without
> > operator intervention.

And it's no always easy to figure out which cluster those backends
belong to if there are multiple postgres instances running as the same user.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Support for REINDEX CONCURRENTLY

2013-06-22 Thread Andres Freund

On 2013-06-22 12:50:52 +0900, Michael Paquier wrote:
> On Fri, Jun 21, 2013 at 10:47 PM, Andres Freund  
> wrote:
> > Hm. Looking at how this is currently used - I am afraid it's not
> > correct... the reason RelationGetIndexList() returns a copy is that
> > cache invalidations will throw away that list. And you do index_open()
> > while iterating over it which will accept invalidation messages.
> > Mybe it's better to try using RelationGetIndexList directly and measure
> > whether that has a measurable impact=
> By looking at the comments of RelationGetIndexList:relcache.c,
> actually the method of the patch is correct because in the event of a
> shared cache invalidation, rd_indexvalid is set to 0 when the index
> list is reset, so the index list would get recomputed even in the case
> of shared mem invalidation.

The problem I see is something else. Consider code like the following:

RelationFetchIndexListIfInvalid(toastrel);
foreach(lc, toastrel->rd_indexlist)
   toastidxs[i++] = index_open(lfirst_oid(lc), RowExclusiveLock);

index_open calls relation_open calls LockRelationOid which does:
if (res != LOCKACQUIRE_ALREADY_HELD)
   AcceptInvalidationMessages();

So, what might happen is that you open the first index, which accepts an
invalidation message which in turn might delete the indexlist. Which
means we would likely read invalid memory if there are two indexes.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hardware donation

2013-06-22 Thread Simon Riggs

On 21 June 2013 20:03, Jim Nasby  wrote:

> Who can be point of contact from the community to arrange shipping, etc?

Do they need to be shipped? Can we just leave them where they are and
arrange access and power charges to be passed to SPI? Sounds like it
would be cheaper and easier to leave them where they are and they
won't get damaged in transit then. Of course, may not be possible.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[Review] Re: [HACKERS] minor patch submission: CREATE CAST ... AS EXPLICIT

2013-06-22 Thread Cédric Villemain

Le lundi 17 juin 2013 00:02:21, Fabien COELHO a écrit :
> >> What activity would you expect?
> > 
> > A patch which applies cleanly to git HEAD.  This one doesn't for me,
> > although I'm not really sure why, I don't see any obvious conflicts.
> 
> Please find attached a freshly generated patch against current master.

* Submission review: 
patch is in unified format and apply on HEAD.
patch contains documentation, however I believe 'AS IMPLICIT' is a PostgreSQL 
extension with special behavior and 'AS EXPLICIT' respect the standard except 
that PostgreSQL adds only the expression 'AS EXPLICIT' (it is also the default 
in the standard). So maybe it is possible to rephrase this piece:

@@ -411,8 +427,8 @@ CREATE CAST (bigint AS int4) WITH FUNCTION int4(bigint) AS 
ASSIGNMENT;
SQL standard,
except that SQL does not make provisions for binary-coercible
types or extra arguments to implementation functions.
-   AS IMPLICIT is a PostgreSQL
-   extension, too.
+   AS IMPLICIT and AS EXPLICIT are
+   a PostgreSQL extension, too.
   
  

After digging in the archive and the CF: original request is at  
https://commitfest.postgresql.org/action/patch_view?id=563

* Usability review 
** Does the patch actually implement that? yes
** Do we want that? 
Back in 2012 Tom exposed arguments against it, or at least not a clear +1. 
The patch add nothing but more explicit creation statement, it has gone 
untouched for 2 years without interest so it is surely not something really 
important for PostgreSQL users. However we already have non-standard words for 
CREATE CAST, this new one is not very intrusive .

** Does it follow SQL spec, or the community-agreed behavior? 
It does not follow SQL spec.

** Does it include pg_dump support (if applicable)?
Not but it is probably not interesting to add that to the pg_dump output: it 
increases incompatibility with SQL spec for no gain. The result is that the 
patch only allows to CREATE CAST..AS EXPLICIT without error. Then pg_dump 
won't know if the CAST has been created with the default or an 'explicit 
default'...
 
** Are there dangers? 
It seems no.

* Feature test 
** Does the feature work as advertised? Yes
** Are there corner cases the author has failed to consider?
I think no, but my skills with the parser are limited (gram.y, ...)
** Are there any assertion failures or crashes?
no

* Performance review: not relevant.

* Coding review 
Patch does not pass test:
./check_keywords.pl gram.y ../../../src/include/parser/kwlist.h

I had to update the unreserved keyword list in order to be able to build 
postgresql.

** Does it follow the project coding guidelines? yes
** Are there portability issues? no (only for SQL)
** Will it work on Windows/BSD etc?  yes
** Are the comments sufficient and accurate? Yes
** Does it do what it says, correctly? Yes
** Does it produce compiler warnings? don't build as is. Need patch update
** Can you make it crash? no

* Architecture review
** Is everything done in a way that fits together coherently with other 
features/modules? Yes
** Are there interdependencies that can cause problems? No.

I flag it 'return with feedback', please update the patch so it builds. 
Everything else is ok.
-- 
Cédric Villemain +33 (0)6 20 30 22 52
http://2ndQuadrant.fr/
PostgreSQL: Support 24x7 - Développement, Expertise et Formation


signature.asc
Description: This is a digitally signed message part.

[HACKERS] A better way than tweaking NTUP_PER_BUCKET

2013-06-22 Thread Simon Riggs

Previous discussions of Hash Joins have noted that the performance
decreases when the average number of tuples per bucket increases.
O(N^2) effects are seen.

We've argued this about many different ways, yet all of those
discussions have centred around the constant NTUP_PER_BUCKET. I
believe that was a subtle mistake and I have a proposal.

The code in ExecChooseHashTableSize() line 460 says
 /*
  * Set nbuckets to achieve an average bucket load of NTUP_PER_BUCKET when
  * memory is filled.
...

but the calculation then sets the number of buckets like this

 dbuckets = ceil(ntuples / NTUP_PER_BUCKET);

**This is doesn't match the comment.** If we underestimate the number
of tuples and go on to fill the available memory, we then end up with
an average number of tuples per bucket higher than NTUP_PER_BUCKET. A
notational confusion that has been skewing the discussion.

The correct calculation that would match the objective set out in the
comment would be

 dbuckets = (hash_table_bytes / tupsize) / NTUP_PER_BUCKET;

Which leads us to a much more useful value of dbuckets in the case
where using ntuples occupies much less space than is available. This
value is always same or higher than previously because of the if-test
that surrounds it.

Given my experience that on larger tables we end up underestimating
ndistinct by 10-100-1000 times, I don't think this change is
unwarranted.

This solves the problem in earlier discussions since we get a lower
average number of tuples per bucket and yet we also get to keep the
current NTUP_PER_BUCKET value. Everybody wins.

Comments?

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


rethink_num_hash_buckets.v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] MemoryContextAllocHuge(): selectively bypassing MaxAllocSize

2013-06-22 Thread Stephen Frost

* Simon Riggs (si...@2ndquadrant.com) wrote:
> On 22 June 2013 08:46, Stephen Frost  wrote:
> >>The next limit faced by sorts is
> >> INT_MAX concurrent tuples in memory, which limits helpful work_mem to about
> >> 150 GiB when sorting int4.
> >
> > That's frustratingly small. :(
> 
> But that has nothing to do with this patch, right? And is easily fixed, yes?

I don't know about 'easily fixed' (consider supporting a HashJoin of >2B
records) but I do agree that dealing with places in the code where we are
using an int4 to keep track of the number of objects in memory is outside
the scope of this patch.

Hopefully we are properly range-checking and limiting ourselves to only
what a given node can support and not solely depending on MaxAllocSize
to keep us from overflowing some int4 which we're using as an index for
an array or as a count of how many objects we've currently got in
memory, but we'll want to consider carefully what happens with such
large sets as we're adding support into nodes for these Huge
allocations (along with the recent change to allow 1TB work_mem, which
may encourage users with systems large enough to actually try to set it
that high... :)

Thanks,

Stephen

signature.asc
Description: Digital signature

Re: [HACKERS] XLogInsert scaling, revisited

2013-06-22 Thread Heikki Linnakangas


On 21.06.2013 21:55, Jeff Janes wrote:

I think I'm getting an undetected deadlock between the checkpointer and a
user process running a TRUNCATE command.

This is the checkpointer:

#0  0x003a73eeaf37 in semop () from /lib64/libc.so.6
#1  0x005ff847 in PGSemaphoreLock (sema=0x7f8c0a4eb730,
interruptOK=0 '\000') at pg_sema.c:415
#2  0x004b0abf in WaitOnSlot (upto=416178159648) at xlog.c:1775
#3  WaitXLogInsertionsToFinish (upto=416178159648) at xlog.c:2086
#4  0x004b657a in CopyXLogRecordToWAL (write_len=32, isLogSwitch=1
'\001', rdata=0x0, StartPos=, EndPos=416192397312)
 at xlog.c:1389
#5  0x004b6fb2 in XLogInsert (rmid=0 '\000', info=, rdata=0x7fff0020) at xlog.c:1209
#6  0x004b7644 in RequestXLogSwitch () at xlog.c:8748


Hmm, it looks like the xlog-switch is trying to wait for itself to 
finish. The concurrent TRUNCATE is just being blocked behind the 
xlog-switch, which is stuck on itself.


I wasn't able to reproduce exactly that, but I got a PANIC by running 
pgbench and concurrently doing "select pg_switch_xlog()" many times in psql.


Attached is a new version that fixes at least the problem I saw. Not 
sure if it fixes what you saw, but it's worth a try. How easily can you 
reproduce that?



This is using the same testing harness as in the last round of this patch.


This one? 
http://www.postgresql.org/message-id/CAMkU=1xoa6fdyoj_4fmlqpiczr1v9gp7clnxjdhu+iggqb6...@mail.gmail.com



Is there a way for me to dump the list of held/waiting lwlocks from gdb?


You can print out the held_lwlocks array. Or to make it more friendly, 
write a function that prints it out and call that from gdb. There's no 
easy way to print out who's waiting for what that I know of.


Thanks for the testing!

- Heikki


xloginsert-scale-24.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] MemoryContextAllocHuge(): selectively bypassing MaxAllocSize

2013-06-22 Thread Simon Riggs

On 22 June 2013 08:46, Stephen Frost  wrote:

>>The next limit faced by sorts is
>> INT_MAX concurrent tuples in memory, which limits helpful work_mem to about
>> 150 GiB when sorting int4.
>
> That's frustratingly small. :(
>

But that has nothing to do with this patch, right? And is easily fixed, yes?

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] MemoryContextAllocHuge(): selectively bypassing MaxAllocSize

2013-06-22 Thread Simon Riggs

On 13 May 2013 15:26, Noah Misch  wrote:
> A memory chunk allocated through the existing palloc.h interfaces is limited
> to MaxAllocSize (~1 GiB).  This is best for most callers; SET_VARSIZE() need
> not check its own 1 GiB limit, and algorithms that grow a buffer by doubling
> need not check for overflow.  However, a handful of callers are quite happy to
> navigate those hazards in exchange for the ability to allocate a larger chunk.
>
> This patch introduces MemoryContextAllocHuge() and repalloc_huge() that check
> a higher MaxAllocHugeSize limit of SIZE_MAX/2.  Chunks don't bother recording
> whether they were allocated as huge; one can start with palloc() and then
> repalloc_huge() to grow the value.

I like the design and think its workable.

I'm concerned that people will accidentally use MaxAllocSize. Can we
put in a runtime warning if someone tests AllocSizeIsValid() with a
larger value?

>  To demonstrate, I put this to use in
> tuplesort.c; the patch also updates tuplestore.c to keep them similar.  Here's
> the trace_sort from building the pgbench_accounts primary key at scale factor
> 7500, maintenance_work_mem = '56GB'; memtuples itself consumed 17.2 GiB:
>
> LOG:  internal sort ended, 48603324 KB used: CPU 75.65s/305.46u sec elapsed 
> 391.21 sec
>
> Compare:
>
> LOG:  external sort ended, 1832846 disk blocks used: CPU 77.45s/988.11u sec 
> elapsed 1146.05 sec

Cool.

I'd like to put in an explicit test for this somewhere. Obviously not
part of normal regression, but somewhere, at least, so we have
automated testing that we all agree on. (yes, I know we don't have
that for replication/recovery yet, but thats why I don't want to
repeat that mistake).

> This was made easier by tuplesort growth algorithm improvements in commit
> 8ae35e91807508872cabd3b0e8db35fc78e194ac.  The problem has come up before
> (TODO item "Allow sorts to use more available memory"), and Tom floated the
> idea[1] behind the approach I've used.  The next limit faced by sorts is
> INT_MAX concurrent tuples in memory, which limits helpful work_mem to about
> 150 GiB when sorting int4.
>
> I have not added variants like palloc_huge() and palloc0_huge(), and I have
> not added to the frontend palloc.h interface.  There's no particular barrier
> to doing any of that.  I don't expect more than a dozen or so callers, so most
> of the variations might go unused.
>
> The comment at MaxAllocSize said that aset.c expects doubling the size of an
> arbitrary allocation to never overflow, but I couldn't find the code in
> question.  AllocSetAlloc() does double sizes of blocks used to aggregate small
> allocations, so maxBlockSize had better stay under SIZE_MAX/2.  Nonetheless,
> that expectation does apply to dozens of repalloc() users outside aset.c, and
> I preserved it for repalloc_huge().  64-bit builds will never notice, and I
> won't cry for the resulting 2 GiB limit on 32-bit.

Agreed. Can we document this for the relevant parameters?

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] MemoryContextAllocHuge(): selectively bypassing MaxAllocSize

2013-06-22 Thread Stephen Frost

Noah,

* Noah Misch (n...@leadboat.com) wrote:
> This patch introduces MemoryContextAllocHuge() and repalloc_huge() that check
> a higher MaxAllocHugeSize limit of SIZE_MAX/2.  

Nice!  I've complained about this limit a few different times and just
never got around to addressing it.

> This was made easier by tuplesort growth algorithm improvements in commit
> 8ae35e91807508872cabd3b0e8db35fc78e194ac.  The problem has come up before
> (TODO item "Allow sorts to use more available memory"), and Tom floated the
> idea[1] behind the approach I've used.  The next limit faced by sorts is
> INT_MAX concurrent tuples in memory, which limits helpful work_mem to about
> 150 GiB when sorting int4.

That's frustratingly small. :(

[...]
> --- 1024,1041 
>* new array elements even if no other memory were currently 
> used.
>*
>* We do the arithmetic in float8, because otherwise the 
> product of
> !  * memtupsize and allowedMem could overflow.  Any inaccuracy in 
> the
> !  * result should be insignificant; but even if we computed a
> !  * completely insane result, the checks below will prevent 
> anything
> !  * really bad from happening.
>*/
>   double  grow_ratio;
>   
>   grow_ratio = (double) state->allowedMem / (double) memNowUsed;
> ! if (memtupsize * grow_ratio < INT_MAX)
> ! newmemtupsize = (int) (memtupsize * grow_ratio);
> ! else
> ! newmemtupsize = INT_MAX;
>   
>   /* We won't make any further enlargement attempts */
>   state->growmemtuples = false;

I'm not a huge fan of moving directly to INT_MAX.  Are we confident that
everything can handle that cleanly..?  I feel like it might be a bit
safer to shy a bit short of INT_MAX (say, by 1K).  Perhaps that's overly
paranoid, but there's an awful lot of callers and some loop which +2's
and then overflows would suck, eg:

int x = INT_MAX;
for (x-1; (x-1) < INT_MAX; x += 2) {
myarray[x] = 5;
}

Also, could this be used to support hashing larger sets..?  If we change
NTUP_PER_BUCKET to one, we could end up wanting to create a hash table
larger than INT_MAX since, with 8-byte pointers, that'd only be around
134M tuples.

Haven't had a chance to review the rest, but +1 on the overall idea. :)

Thanks!

Stephen

signature.asc
Description: Digital signature

48 matches

Mail list logo