Re: [HACKERS] A better way than tweaking NTUP_PER_BUCKET
On 23 June 2013 03:16, Stephen Frost wrote: > Still doesn't really address the issue of dups though. Checking for duplicates in all cases would be wasteful, since often we are joining to the PK of a smaller table. If duplicates are possible at all for a join, then it would make sense to build the hash table more carefully to remove dupes. I think we should treat that as a separate issue. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] wrong state of patch in CF
Patch (https://commitfest.postgresql.org/action/patch_view?id=1161) is already committed by Commit http://git.postgresql.org/pg/commitdiff/b23160889c963dfe23d8cf1f9be64fb3c535a2d6 It should be marked as Committed in CF. With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)
From: "Robert Haas" On Fri, Jun 21, 2013 at 10:02 PM, MauMau wrote: I'm comfortable with 5 seconds. We are talking about the interval between sending SIGQUIT to the children and then sending SIGKILL to them. In most situations, the backends should terminate immediately. However, as I said a few months ago, ereport() call in quickdie() can deadlock indefinitely. This is a PostgreSQL's bug. So let's fix that bug. Then we don't need this. tHERE ARE TWO WAYS TO FIX THAT BUG. yOU ARE SUGGESTING 1 OF THE FOLLOWING TWO METHODS, AREN'T YOU? 1. (rOBERT-SAN'S IDEA) uPON RECEIPT OF IMMEDIATE SHUTDOWN REQUEST, POSTMASTER SENDS sigkill TO ITS CHILDREN. 2. (tOM-SAN'S IDEA) uPON RECEIPT OF IMMEDIATE SHUTDOWN REQUEST, POSTMASTER FIRST SENDS sigquit TO ITS CHILDREN, WAIT A WHILE FOR THEM TO TERMINATE, THEN SENDS sigkill TO THEM. aT FIRST i PROPOSED 1. tHEN tOM SAN SUGGESTED 2 SO THAT THE SERVER IS AS FRIENDLY TO THE CLIENTS AS NOW BY NOTIFYING THEM OF THE SERVER SHUTDOWN. i WAS CONVINCED BY THAT IDEA AND CHOSE 2. bASICALLY, i THINK BOTH IDEAS ARE RIGHT. tHEY CAN SOLVE THE ORIGINAL PROBLEM. hOWEVER, RE-CONSIDERING THE MEANING OF "IMMEDIATE" SHUTDOWN, i FEEL 1 IS A BIT BETTER, BECAUSE TRYING TO DO SOMETHING IN QUICKDIE() OR SOMEWHERE DOES NOT MATCH THE IDEA OF "IMMEDIATE". wE MAY NOT HAVE TO BE FRIENDLY TO THE CLIENTS AT THE IMMEDIATE SHUTDOWN. tHE CODE GETS MUCH SIMPLER. iN ADDITION, IT MAY BE BETTER THAT WE SIMILARLY SEND sigkill IN BACKEND CRASH (fATALeRROR) CASE, ELIMINATE THE USE OF sigquit AND REMOVE QUICKDIE() AND OTHER sigquit HANDLERS. wHAT DO YOU THINK? hOW SHOULD WE MAKE CONSENSUS AND PROCEED? rEGARDS mAUmAU -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] [PATCH] Revive line type
Hi, Test results are as follows: Contents & Purpose This patch is for finishing the line type and related functions that not done yet but listed in catalogs and documentation. There are no other new features added in this patch. The regression test cases which included in this patch, Documentation are also updated. Run Regression tests are all succeed, but several problems have be found while ding some simple test. The updated document said that the points used in the output are not necessarily the points used on input. I understand that as long as they are marked on the same line. But when the input data represents a horizontal or vertical line, the output is not exactly the same line. It is another line parallel to it. For example: postgres=# select line('1,3,2,3'); line - [(0,-3),(1,-3)] (1 row) postgres=# select line('1,3,1,6'); line - [(-1,0),(-1,1)] (1 row) In addition, when a straight line coincides with coordinate axis, output appears -0, I do not know whether it is appropriate. postgres=# select line('0,1,0,5'); line - [(-0,0),(-0,1)] (1 row) Negative value appeared when use <-> to calculate the distance between two parallel lines. postgres=# select line('1,1,2,1') <-> line('-1,-1,-2,-1') ; ?column? -- -2 postgres=# select lseg('1,1,2,1') <-> line('-1,-1,-2,-1') ; ?column? -- -2 (1 row) The same situation occurs in distance calculation between point and a straight line. postgres=# select point('-1,1') <-> line('-3,0,-4,0') ; ?column? -- -1 (1 row) Should the distance be positive numbers? Other functions seem work properly. Performance == Because these functions is first implemented. So there is no relatively comparison for the performance. Conclusion This patch lets line type worked. But there are some bugs. Additionally, function "close_sl" not implemented. With Regards, Rui
Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)
From: "Alvaro Herrera" MauMau escribió: I thought of adding some new state of pmState for some reason (that might be the same as your idea). But I refrained from doing that, because pmState has already many states. I was afraid adding a new pmState value for this bug fix would complicate the state management (e.g. state transition in PostmasterStateMachine()). In addition, I felt PM_WAIT_BACKENDS was appropriate because postmaster is actually waiting for backends to terminate after sending SIGQUIT. The state name is natural. Well, a natural state name is of no use if we're using it to indicate two different states, which I think is what would be happening here. Basically, with your patch, PM_WAIT_BACKENDS means two different things depending on AbortStartTime. i PROBABLY GOT YOUR FEELING. yOU AREN'T FEELING COMFORTABLE ABOUT USING THE TIME VARIABLE aBORTsTARTtIME AS A STATE VARIABLE TO CHANGE POSTMASTER'S BEHAVIOR, ARE YOU? tHAT MAY BE RIGHT, BUT i'M NOT SURE WELL... BECAUSE IN pm_wait_backends, AS THE NAME INDICATES, POSTMASTER IS INDEED WAITING FOR THE BACKENDS TO TERMINATE REGARDLESS OF aBORTsTARTtIME. aPART FROM THIS, POSTMASTER SEEMS TO CHANGE ITS BEHAVIOR IN THE SAME PMsTATE DEPENDING ON OTHER VARIABLES SUCH AS sHUTDOWN AND fATALeRROR. i'M NOT CONFIDENT IN WHICH IS BETTER, SO i WON'T OBJECT IF THE REVIEWER OR COMMITTER MODIFIES THE CODE. rEGARDS mAUmAU -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] A better way than tweaking NTUP_PER_BUCKET
On Saturday, June 22, 2013, Simon Riggs wrote: > On 22 June 2013 21:40, Stephen Frost > > wrote: > > > I'm actually not a huge fan of this as it's certainly not cheap to do. > If it > > can be shown to be better than an improved heuristic then perhaps it > would > > work but I'm not convinced. > > We need two heuristics, it would seem: > > * an initial heuristic to overestimate the number of buckets when we > have sufficient memory to do so > > * a heuristic to determine whether it is cheaper to rebuild a dense > hash table into a better one. > > Although I like Heikki's rebuild approach we can't do this every x2 > overstretch. Given large underestimates exist we'll end up rehashing > 5-12 times, which seems bad. Better to let the hash table build and > then re-hash once, it we can see it will be useful. > > OK? > I've been thinking a bit more on your notion of simply using as much memory as we're permitted, but maybe adjust it down based on how big the input to the hash table is (which I think we have better stats on, and even if we don't, we could easily keep track of how many tuples we've seen and consider rehashing as we go). Still doesn't really address the issue of dups though. It may still be much larger than it should be if there's a lot of duplicates in the input that hash into a much smaller set of buckets. Will think on it more. Thanks, Stephen
Re: [HACKERS] Patch for fast gin cache performance improvement
Looks like my community login is still not working. No rush, just wanted to let you know. Thanks! Ian On Tue, Jun 18, 2013 at 11:41 AM, Ian Link wrote: > > No worries! I'll just wait until it's up again. > Thanks > Ian > > Kevin Grittner > Tuesday, June 18, 2013 9:15 AM > > Oops -- we seem to have a problem with new community logins at the > moment, which will hopefully be straightened out soon. You might > want to wait a few days if you don't already have a login. > > -- > Kevin Grittner > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > Kevin Grittner > Tuesday, June 18, 2013 9:09 AM > > Ian Link wrote: > > > This patch contains a performance improvement for the fast gin > cache. > > Our test queries improve from about 0.9 ms to 0.030 ms. > > Impressive! > > > Thanks for reading and considering this patch! > > > Congratulations on your first PostgreSQL patch! To get it > scheduled for review, please add it to this page: > https://commitfest.postgresql.org/action/commitfest_view/open > > > You will need to get a community login (if you don't already have > one), but that is a quick and painless process. Choose an > appropriate topic (like "Performance") and reference the message ID > of the email to which you attached the patch. Don't worry about > the fields for reviewers, committer, or date closed. > > Sorry for the administrative overhead, but without it things can > fall through the cracks. You can find an overview of the review > process with links to more detail here: > > http://wiki.postgresql.org/wiki/CommitFest > > Thanks for contributing! > > > Ian Link > Monday, June 17, 2013 9:42 PM > * > > This patch contains a performance improvement for the fast gin cache. As > you may know, the performance of the fast gin cache decreases with its > size. Currently, the size of the fast gin cache is tied to work_mem. The > size of work_mem can often be quite high. The large size of work_mem is > inappropriate for the fast gin cache size. Therefore, we created a separate > cache size called gin_fast_limit. This global variable controls the size of > the fast gin cache, independently of work_mem. Currently, the default > gin_fast_limit is set to 128kB. However, that value could need tweaking. > 64kB may work better, but it's hard to say with only my single machine to > test on. > > On my machine, this patch results in a nice speed up. Our test queries > improve from about 0.9 ms to 0.030 ms. Please feel free to use the test > case yourself: it should be attached. I can look into additional test cases > (tsvectors) if anyone is interested. > > In addition to the global limit, we have provided a per-index limit: > fast_cache_size. This per-index limit begins at -1, which means that it is > disabled. If the user does not specify a per-index limit, the index will > simply use the global limit. > > I would like to thank Andrew Gierth for all his help on this patch. As > this is my first patch he was extremely helpful. The idea for this > performance improvement was entirely his. I just did the implementation. > Thanks for reading and considering this patch!* > > > Ian Link > > <><>
Re: [HACKERS] Support for RANGE ... PRECEDING windows in OVER
Thanks Craig! That definitely does help. I probably still have some questions but I think I will read through the rest of the code before asking. Thanks again! Ian > Craig Ringer > Friday, June 21, 2013 8:41 PM > > On 06/22/2013 03:30 AM, ian link wrote: >> >> Forgive my ignorance, but I don't entirely understand the problem. What >> does '+' and '-' refer to exactly? > > Consider "RANGE 4.5 PRECEDING'. > > You need to be able to test whether, for the current row 'b', any given > row 'a' is within the range (b - 4.5) < a <= b . Not 100% sure about the > < vs <= boundaries, but that's irrelevant for the example. > > To test that, you have to be able to do two things: you have to be able > to test whether one value is greater than another, and you have to be > able to add or subtract a constant from one of the values. > > Right now, the b-tree access method provides information on the ordering > operators < <= = > >= <> , which provides half the answer. But these > don't give any concept of *distance* - you can test ordinality but not > cardinality. > > To implement the "different by 4.5" part, you have to be able to add 4.5 > to one value or subtract it from the other. > > The obvious way to do that is to look up the function that implements > the '+' or '-' operator, and do: > > ((OPERATOR(+))(a, 4.5)) > b AND (a <= b) > > or > > ((OPERATOR(-))(b, 4.5)) < a AND (a <= b); > > The problem outlined by Tom in prior discussion about this is that > PostgreSQL tries really hard not to assume that particular operator > names mean particular things. Rather than "knowing" that "+" is always > "an operator that adds two values together; is transitive, symmetric and > reflexive", PostgreSQL requires that you define an *operator class* that > names the operator that has those properties. > > Or at least, it does for less-than, less-than-or-equals, equals, > greater-than-or-equals, greater-than, and not-equals as part of the > b-tree operator class, which *usually* defines these operators as < <= = >> >> = > <>, but you could use any operator names you wanted if you really > > liked. > > Right now (as far as I know) there's no operator class that lets you > identify operators for addition and subtraction in a similar way. So > it's necessary to either add such an operator class (in which case > support has to be added for it for every type), extend the existing > b-tree operator class to provide the info, or blindly assume that "+" > and "-" are always addition and subtraction. > > For an example of why such assumptions are a bad idea, consider matrix > multiplication. Normally, "a * b" = "b * a", but this isn't true for > multiplication of matrices. Similarly, if someone defined a "+" operator > as an alias for string concatenation (||), we'd be totally wrong to > assume we could use that for doing range-offset windowing. > > So. Yeah. Operator classes required, unless we're going to change the > rules and make certain operator names "special" in PostgreSQL, so that > if you implement them they *must* have certain properties. This seems > like a pretty poor reason to add such a big change. > > I hope this explanation (a) is actually correct and (b) is helpful. > > ian link > Friday, June 21, 2013 12:30 PM > Forgive my ignorance, but I don't entirely understand the problem. What does '+' and '-' refer to exactly? > Thanks! > > > > Hitoshi Harada > Friday, June 21, 2013 4:35 AM > > > On 06/22/2013 03:30 AM, ian link wrote: > Forgive my ignorance, but I don't entirely understand the problem. What > does '+' and '-' refer to exactly? Consider "RANGE 4.5 PRECEDING'. You need to be able to test whether, for the current row 'b', any given row 'a' is within the range (b - 4.5) < a <= b . Not 100% sure about the < vs <= boundaries, but that's irrelevant for the example. To test that, you have to be able to do two things: you have to be able to test whether one value is greater than another, and you have to be able to add or subtract a constant from one of the values. Right now, the b-tree access method provides information on the ordering operators < <= = > >= <> , which provides half the answer. But these don't give any concept of *distance* - you can test ordinality but not cardinality. To implement the "different by 4.5" part, you have to be able to add 4.5 to one value or subtract it from the other. The obvious way to do that is to look up the function that implements the '+' or '-' operator, and do: ((OPERATOR(+))(a, 4.5)) > b AND (a <= b) or ((OPERATOR(-))(b, 4.5)) < a AND (a <= b); The problem outlined by Tom in prior discussion about this is that PostgreSQL tries really hard not to assume that particular operator names mean particular things. Rather than "knowing" that "+" is always "an operator that adds two values together; is transitive, symmetric and reflexive", PostgreSQL requires that you define an *operator class* that names the operator that has those properties. Or at least, it does for less-than,
Re: [HACKERS] A better way than tweaking NTUP_PER_BUCKET
On Saturday, June 22, 2013, Heikki Linnakangas wrote: > On 22.06.2013 19:19, Simon Riggs wrote: > >> So I think that (2) is the best route: Given that we know with much >> better certainty the number of rows in the scanned-relation, we should >> be able to examine our hash table after it has been built and decide >> whether it would be cheaper to rebuild the hash table with the right >> number of buckets, or continue processing with what we have now. Which >> is roughly what Heikki proposed already, in January. >> > > Back in January, I wrote a quick patch to experiment with rehashing when > the hash table becomes too full. It was too late to make it into 9.3 so I > didn't pursue it further back then, but IIRC it worked. If we have the > capability to rehash, the accuracy of the initial guess becomes much less > important. > What we're hashing isn't going to change mid-way through or be updated after we've started doing lookups against it. Why not simply scan and queue the data and then build the hash table right the first time? Also, this patch doesn't appear to address dups and therefore would rehash unnecessarily. There's no point rehashing into more buckets if the buckets are only deep due to lots of dups. Figuring out how many distinct values there are, in order to build the best hash table, is actually pretty expensive compared to how quickly we can build the table today. Lastly, this still encourages collisions due to too few buckets. If we would simply start with more buckets outright we'd reduce the need to rehash.. Thanks, Stephen
Re: [HACKERS] A better way than tweaking NTUP_PER_BUCKET
On 22 June 2013 21:40, Stephen Frost wrote: > I'm actually not a huge fan of this as it's certainly not cheap to do. If it > can be shown to be better than an improved heuristic then perhaps it would > work but I'm not convinced. We need two heuristics, it would seem: * an initial heuristic to overestimate the number of buckets when we have sufficient memory to do so * a heuristic to determine whether it is cheaper to rebuild a dense hash table into a better one. Although I like Heikki's rebuild approach we can't do this every x2 overstretch. Given large underestimates exist we'll end up rehashing 5-12 times, which seems bad. Better to let the hash table build and then re-hash once, it we can see it will be useful. OK? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] A better way than tweaking NTUP_PER_BUCKET
On 22.06.2013 19:19, Simon Riggs wrote: So I think that (2) is the best route: Given that we know with much better certainty the number of rows in the scanned-relation, we should be able to examine our hash table after it has been built and decide whether it would be cheaper to rebuild the hash table with the right number of buckets, or continue processing with what we have now. Which is roughly what Heikki proposed already, in January. Back in January, I wrote a quick patch to experiment with rehashing when the hash table becomes too full. It was too late to make it into 9.3 so I didn't pursue it further back then, but IIRC it worked. If we have the capability to rehash, the accuracy of the initial guess becomes much less important. - Heikki diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c index 6a2f236..32aebb9 100644 --- a/src/backend/executor/nodeHash.c +++ b/src/backend/executor/nodeHash.c @@ -223,6 +223,60 @@ ExecEndHash(HashState *node) ExecEndNode(outerPlan); } +static void +Rehash(HashJoinTable hashtable) +{ + int nbuckets_old = hashtable->nbuckets_inmem; + int nbuckets_new; + uint32 mask; + int i; + + if (nbuckets_old > (1<<30)) + return; + + nbuckets_new = nbuckets_old * 2; + hashtable->buckets = (HashJoinTuple *) + repalloc(hashtable->buckets, nbuckets_new * sizeof(HashJoinTuple)); + memset(((char *) hashtable->buckets) + nbuckets_old * sizeof(HashJoinTuple), + 0, nbuckets_old * sizeof(HashJoinTuple)); + hashtable->nbuckets_inmem = nbuckets_new; + + mask = nbuckets_old; + + for (i = 0; i < nbuckets_old; i++) + { + int newbucketno = i + nbuckets_old; + HashJoinTuple prevtuple; + HashJoinTuple tuple; + + prevtuple = NULL; + tuple = hashtable->buckets[i]; + + while (tuple != NULL) + { + /* save link in case we delete */ + HashJoinTuple nexttuple = tuple->next; + + if ((tuple->hashvalue & mask) != 0) + { +/* move to the new bucket */ +tuple->next = hashtable->buckets[newbucketno]; +hashtable->buckets[newbucketno] = tuple; + +if (prevtuple) + prevtuple->next = nexttuple; +else + hashtable->buckets[i] = nexttuple; + } + else + { +/* keep in this bucket */ +prevtuple = tuple; + } + tuple = nexttuple; + } + } +} /* * ExecHashTableCreate @@ -271,6 +325,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls) */ hashtable = (HashJoinTable) palloc(sizeof(HashJoinTableData)); hashtable->nbuckets = nbuckets; + hashtable->nbuckets_inmem = nbuckets; hashtable->log2_nbuckets = log2_nbuckets; hashtable->buckets = NULL; hashtable->keepNulls = keepNulls; @@ -285,6 +340,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls) hashtable->nbatch_outstart = nbatch; hashtable->growEnabled = true; hashtable->totalTuples = 0; + hashtable->inmemTuples = 0; hashtable->innerBatchFile = NULL; hashtable->outerBatchFile = NULL; hashtable->spaceUsed = 0; @@ -612,7 +668,7 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable) */ ninmemory = nfreed = 0; - for (i = 0; i < hashtable->nbuckets; i++) + for (i = 0; i < hashtable->nbuckets_inmem; i++) { HashJoinTuple prevtuple; HashJoinTuple tuple; @@ -734,6 +790,10 @@ ExecHashTableInsert(HashJoinTable hashtable, hashTuple->next = hashtable->buckets[bucketno]; hashtable->buckets[bucketno] = hashTuple; + hashtable->inmemTuples += 1; + if (hashtable->inmemTuples > hashtable->nbuckets_inmem * NTUP_PER_BUCKET * 2) + Rehash(hashtable); + /* Account for space used, and back off if we've used too much */ hashtable->spaceUsed += hashTupleSize; if (hashtable->spaceUsed > hashtable->spacePeak) @@ -873,18 +933,18 @@ ExecHashGetBucketAndBatch(HashJoinTable hashtable, int *bucketno, int *batchno) { - uint32 nbuckets = (uint32) hashtable->nbuckets; + uint32 nbuckets_inmem = (uint32) hashtable->nbuckets_inmem; uint32 nbatch = (uint32) hashtable->nbatch; if (nbatch > 1) { /* we can do MOD by masking, DIV by shifting */ - *bucketno = hashvalue & (nbuckets - 1); + *bucketno = hashvalue & (nbuckets_inmem - 1); *batchno = (hashvalue >> hashtable->log2_nbuckets) & (nbatch - 1); } else { - *bucketno = hashvalue & (nbuckets - 1); + *bucketno = hashvalue & (nbuckets_inmem - 1); *batchno = 0; } } @@ -995,7 +1055,7 @@ ExecScanHashTableForUnmatched(HashJoinState *hjstate, ExprContext *econtext) */ if (hashTuple != NULL) hashTuple = hashTuple->next; - else if (hjstate->hj_CurBucketNo < hashtable->nbuckets) + else if (hjstate->hj_CurBucketNo < hashtable->nbuckets_inmem) { hashTuple = hashtable->buckets[hjstate->hj_CurBucketNo]; hjstate->hj_CurBucketNo++; @@ -1052,7 +1112,7 @@ void ExecHashTableReset(HashJoinTable hashtable) { MemoryContext oldcxt; - int nbuckets = hashtable->nbuckets; + int nbuckets_inmem = hashtable->nbuckets_inmem; /* * Release
Re: [HACKERS] [PATCH] add --progress option to pgbench (submission 3)
Hello Mitsumasa, Thanks for the review. * 2. Output format in result for more readable. 5.0 s[thread 1]: tps = 1015.576032, AverageLatency(ms) = 0.000984663 5.0 s[thread 0]: tps = 1032.580794, AverageLatency(ms) = 0.000968447 10.0 s [thread 0]: tps = 1129.591189, AverageLatency(ms) = 0.000885276 10.0 s [thread 1]: tps = 1126.267776, AverageLatency(ms) = 0.000887888 However, interesting of output format(design) is different depending on the person:-). If you like other format, fix it. I think that your suggestion is too verbose, and as far as automation is oncerned I like "cut -f 2" unix filtering and other gnuplot processing... but I see your point and it is a matter of taste. I'll try to propose something in between, if I can. * 3. Thread name in output format is not nesessary. I cannot understand that thread name is displayed in each progress. I think that it does not need. I hope that output result sould be more simple also in a lot of thread. My images is here, 5.0 s: tps = 2030.576032, AverageLatency(ms) = 0.000984663 10.0 s : tps = 2250.591189, AverageLatency(ms) = 0.000885276 This output format is more simple and intuitive. If you need result in each threads, please tell us the reason. I agree that it would be better, but only a thread has access to its data, if it must work with the "fork" pthread emulation, so each thread has to do its report... If the "fork" emulation is removed and only real threads are used, it would be much better, and one thread would be able to report for everyone. The alternative is to do a feature which does not work with fork emulation. * 4. Slipping the progress time. Whan I executed this patch in long time, I found slipping the progress time. This problem image is here. Yep. I must change the test to align on the overall start time. I'll submit a new patch later. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] A better way than tweaking NTUP_PER_BUCKET
On Saturday, June 22, 2013, Simon Riggs wrote: > On 22 June 2013 14:48, Stephen Frost > > wrote: > > > Based on your argument that we want to have a bucket load which is, on > > average, the size of NTUP_PER_BUCKET, I have to '-1' on this. > > That is not my argument. I am pointing out that the comments claim the > code does that, and yet the code does not. To be honest, I don't read that comment quite the same way as you do. Tom's wish for an average tuples per bucket of 10 may be connected to > that error; I can't say. But we definitely don't end up with an > average of 10 in many cases, which is the basis for previous poor > performance reports. I don't believe Tom wishes for an average of 10.. That's just what NTUP_PER_BUCKET has always been set to. The bottom line is that you're hoping for perfection. If we knew > exactly how many tuples were in the hash table then we could size it > precisely for the hash join we are about to perform. We don't know > that, so we can't size it precisely. Worse than that is that we know > we badly estimate the number of buckets on a regular basis. I'm actually not trying for perfection. What I think we should start with is to at least not shoot ourselves in the foot by cutting the bucket size to one-tenth of our distinct tuple estimate and virtually guaranteeing lots of true collisions which are expensive. > That gives us three choices: if we have a hash table fixed without > prior knowledge then we either (0) we continue to underallocate them > in many cases or (1) potentially overallocate buckets. Or (2) we learn > to cope better with the variation in the number of entries. If we rule > out the first two we have only the last one remaining. > > (1) will always be a heuristic, and as you point out could itself be > an extreme setting A heuristic based on our estimates is a good choice, imv. I do think we can do better than we are now though. So I think that (2) is the best route: Given that we know with much > better certainty the number of rows in the scanned-relation, we should > be able to examine our hash table after it has been built and decide > whether it would be cheaper to rebuild the hash table with the right > number of buckets, or continue processing with what we have now. Which > is roughly what Heikki proposed already, in January. > I'm actually not a huge fan of this as it's certainly not cheap to do. If it can be shown to be better than an improved heuristic then perhaps it would work but I'm not convinced. One other way to address having a smaller actual hash table is to come up with another way to detect empty parts of the hash space, perhaps by using a bitmap to represent the hash space (obviously the individual bits would represent some chunk of the 32bit hash space, perhaps as much as 1/64'th, allowing our bitmap to fit into a 64bit int; that may end up being too small to actually help though). This would mean that when we did get to scanning a bucket we would at least be much more likely of finding a match instead of scanning it and finding nothing. Thanks, Stephen
Re: [Review] Re: [HACKERS] minor patch submission: CREATE CAST ... AS EXPLICIT
I flag it 'return with feedback', please update the patch so it builds. Everything else is ok. Here it is. -- Fabien.diff --git a/doc/src/sgml/ref/create_cast.sgml b/doc/src/sgml/ref/create_cast.sgml index 29ea298..0ace996 100644 --- a/doc/src/sgml/ref/create_cast.sgml +++ b/doc/src/sgml/ref/create_cast.sgml @@ -20,15 +20,15 @@ CREATE CAST (source_type AS target_type) WITH FUNCTION function_name (argument_type [, ...]) -[ AS ASSIGNMENT | AS IMPLICIT ] +[ AS ASSIGNMENT | AS EXPLICIT | AS IMPLICIT ] CREATE CAST (source_type AS target_type) WITHOUT FUNCTION -[ AS ASSIGNMENT | AS IMPLICIT ] +[ AS ASSIGNMENT | AS EXPLICIT | AS IMPLICIT ] CREATE CAST (source_type AS target_type) WITH INOUT -[ AS ASSIGNMENT | AS IMPLICIT ] +[ AS ASSIGNMENT | AS EXPLICIT | AS IMPLICIT ] @@ -74,7 +74,8 @@ SELECT CAST(42 AS float8); - By default, a cast can be invoked only by an explicit cast request, + By default, or if the cast is declared AS EXPLICIT, + a cast can be invoked only by an explicit cast request, that is an explicit CAST(x AS typename) or x::typename @@ -239,6 +240,21 @@ SELECT CAST ( 2 AS numeric ) + 4.0; + AS EXPLICIT + + + + Indicates that the cast can be invoked only with an explicit + cast request, that is an explicit CAST(x AS + typename) or + x::typename + construct. + This is the default. + + + + + AS IMPLICIT diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y index 5094226..2c0694f 100644 --- a/src/backend/parser/gram.y +++ b/src/backend/parser/gram.y @@ -533,7 +533,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); DICTIONARY DISABLE_P DISCARD DISTINCT DO DOCUMENT_P DOMAIN_P DOUBLE_P DROP EACH ELSE ENABLE_P ENCODING ENCRYPTED END_P ENUM_P ESCAPE EVENT EXCEPT - EXCLUDE EXCLUDING EXCLUSIVE EXECUTE EXISTS EXPLAIN + EXCLUDE EXCLUDING EXCLUSIVE EXECUTE EXISTS EXPLAIN EXPLICIT EXTENSION EXTERNAL EXTRACT FALSE_P FAMILY FETCH FIRST_P FLOAT_P FOLLOWING FOR FORCE FOREIGN FORWARD @@ -6720,6 +6720,7 @@ CreateCastStmt: CREATE CAST '(' Typename AS Typename ')' cast_context: AS IMPLICIT_P { $$ = COERCION_IMPLICIT; } | AS ASSIGNMENT { $$ = COERCION_ASSIGNMENT; } + | AS EXPLICIT { $$ = COERCION_EXPLICIT; } | /*EMPTY*/{ $$ = COERCION_EXPLICIT; } ; @@ -12723,6 +12724,7 @@ unreserved_keyword: | EXCLUSIVE | EXECUTE | EXPLAIN + | EXPLICIT | EXTENSION | EXTERNAL | FAMILY diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h index 68a13b7..f97389b 100644 --- a/src/include/parser/kwlist.h +++ b/src/include/parser/kwlist.h @@ -149,6 +149,7 @@ PG_KEYWORD("exclusive", EXCLUSIVE, UNRESERVED_KEYWORD) PG_KEYWORD("execute", EXECUTE, UNRESERVED_KEYWORD) PG_KEYWORD("exists", EXISTS, COL_NAME_KEYWORD) PG_KEYWORD("explain", EXPLAIN, UNRESERVED_KEYWORD) +PG_KEYWORD("explicit", EXPLICIT, UNRESERVED_KEYWORD) PG_KEYWORD("extension", EXTENSION, UNRESERVED_KEYWORD) PG_KEYWORD("external", EXTERNAL, UNRESERVED_KEYWORD) PG_KEYWORD("extract", EXTRACT, COL_NAME_KEYWORD) diff --git a/src/test/regress/expected/create_cast.out b/src/test/regress/expected/create_cast.out index 56cd86e..a8858fa 100644 --- a/src/test/regress/expected/create_cast.out +++ b/src/test/regress/expected/create_cast.out @@ -27,8 +27,8 @@ ERROR: function casttestfunc(text) does not exist LINE 1: SELECT casttestfunc('foo'::text); ^ HINT: No function matches the given name and argument types. You might need to add explicit type casts. --- Try binary coercion cast -CREATE CAST (text AS casttesttype) WITHOUT FUNCTION; +-- Try binary coercion cast and verbose AS EXPLICIT +CREATE CAST (text AS casttesttype) WITHOUT FUNCTION AS EXPLICIT; SELECT casttestfunc('foo'::text); -- doesn't work, as the cast is explicit ERROR: function casttestfunc(text) does not exist LINE 1: SELECT casttestfunc('foo'::text); @@ -54,7 +54,7 @@ SELECT 1234::int4::casttesttype; -- No cast yet, should fail ERROR: cannot cast type integer to casttesttype LINE 1: SELECT 1234::int4::casttesttype; ^ -CREATE CAST (int4 AS casttesttype) WITH INOUT; +CREATE CAST (int4 AS casttesttype) WITH INOUT; -- default AS EXPLICIT SELECT 1234::int4::casttesttype; -- Should work now casttesttype -- diff --git a/src/test/regress/sql/create_cast.sql b/src/test/regress/sql/create_cast.sql index ad348da..91123ca 100644 --- a/src/test/regress/sql/create_cast.sql +++ b/src/test/regress/sql/create_cast.sql @@ -27,8 +27,8 @@ $$ SELECT 1; $$; SELECT casttestfunc('foo'::text); -- fails, as there's no cast --- Try binary coercion cast -CREATE CAST (text AS casttesttype) WITHOUT FUNCTION; +-- Try binary coercion cast and verbose AS EXPLICIT +CREATE CAST (text AS casttesttype) WITHOUT FUNCTION AS EXPLICIT; SELECT casttestfunc('foo'::text); --
Re: [Review] Re: [HACKERS] minor patch submission: CREATE CAST ... AS EXPLICIT
Hello Cédric, So maybe it is possible to rephrase this piece: - AS IMPLICIT is a PostgreSQL - extension, too. + AS IMPLICIT and AS EXPLICIT are + a PostgreSQL extension, too. Ok. Back in 2012 Tom exposed arguments against it, or at least not a clear +1. The patch add nothing but more explicit creation statement, it has gone untouched for 2 years without interest so it is surely not something really important for PostgreSQL users. Elegant is important:-) See my answer to Robert's objection. * Coding review Patch does not pass test: ./check_keywords.pl gram.y ../../../src/include/parser/kwlist.h Oops! That is not elegant! I had to update the unreserved keyword list in order to be able to build postgresql. ** Does it produce compiler warnings? don't build as is. Need patch update Indeed. I flag it 'return with feedback', please update the patch so it builds. Everything else is ok. Yep. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Review] Re: [HACKERS] minor patch submission: CREATE CAST ... AS EXPLICIT
Hello Robert, I object to this patch. This patch a new keyword, EXPLICIT, for reasons that are strictly cosmetic. Everything that you can do with this patch can also be done without this patch. It is not a good idea to slow down parsing of every SQL statement we have just so that someone can write CREATE CAST .. AS EXPLICIT. Granted, the parsing slowdown for just one keyword is probably not noticeable, but it's cumulative with every new keyword we add. Adding them to support new features is one thing, but adding them to support purely optional syntax is, I think, going too far. I partly object to the objection:-) I agree that it may induce a very small delay to the parser, however I *do* think that cosmetic things are important. In order to ease understanding, learning and memorizing a language, concepts must have names, syntaxes, and be orthogonal and symmetric where applicable. In this example, there are 3 kinds of casts, all 3 have a conceptual name (explicit, implicit, assignment) but only two have a syntax, and the other one is the absence of syntax. So you have to memorize this stupid information (which one of the three does not have a syntax) or read the documentation every time to remember that "explicit" is the one without a syntax. Note also that you must state "implicit" explicitely, but "explicit" is told implicitely, which does not really help. The impact is also on the documentation which is not symmetric because it is based on the syntax which is not, so it is simply a little harder to understand. Every year I do my class about PL/pgSQL and extensions to Pg, and every year some students will try "as explicit" because it is logical to do so. I think that she is right and that it should work, instead of having to explain that "explicit" is implicit when dealing with Pg casts. Although it is my job, I would prefer to spend time explaining more interesting things. From the software engineering point of view, having a syntax for all case means that the developer must think about which kind of cast she really wants, instead of doing the default thing just because it is the default. So in my mind the tradeoff is between people time & annoyance and a few machine cycles, and I have no hesitation to choose the later. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] dump difference between 9.3 and master after upgrade
On 06/20/2013 11:16 AM, I wrote: On 06/20/2013 10:43 AM, Robert Haas wrote: On Tue, Jun 18, 2013 at 12:18 PM, Andrew Dunstan wrote: As I was updating my cross version upgrade tester to include support for the 9.3 branch, I noted this dump difference between the dump of the original 9.3 database and the dump of the converted database, which looks odd. Is it correct? It looks sketchy to me, but I'm not sure exactly what you're comparing. Essentially, cross version upgrade testing runs pg_dumpall from the new version on the old cluster, runs pg_upgrade, and then runs pg_dumpall on the upgraded cluster, and compares the two outputs. This is what we get when the new version is HEAD and the old version is 9.3. The reason this hasn't been caught by the standard same-version upgrade tests is that this module uses a more extensive cluster, which has had not only the core regression tests run but also all the contrib and pl regression tests, and this problem seems to be exposed by the postgres_fdw tests. At first glance to me like pg_dump in binary-upgrade mode is not suppressing something that it should be suppressing. Yeah, after examination I don't see why we should output anything for dropped columns of a foreign table in binary upgrade mode. This looks to me like it's been a bug back to 9.1 where we introduced foreign tables. I think something like the attached should fix it, although I'm not sure if that's the right place for the fix. cheers andrew diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c index ec956ad..b25b475 100644 --- a/src/bin/pg_dump/pg_dump.c +++ b/src/bin/pg_dump/pg_dump.c @@ -6670,7 +6670,7 @@ getTableAttrs(Archive *fout, TableInfo *tblinfo, int numTables) * such a column it will mistakenly get pg_attribute.attislocal set to true.) * However, in binary_upgrade mode, we must print all such columns anyway and * fix the attislocal/attisdropped state later, so as to keep control of the - * physical column order. + * physical column order, unless it's a Foreign Table. * * This function exists because there are scattered nonobvious places that * must be kept in sync with this decision. @@ -6679,7 +6679,7 @@ bool shouldPrintColumn(TableInfo *tbinfo, int colno) { if (binary_upgrade) - return true; + return (tbinfo->relkind != RELKIND_FOREIGN_TABLE || !tbinfo->attisdropped[colno]); return (tbinfo->attislocal[colno] && !tbinfo->attisdropped[colno]); } -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)
On Fri, Jun 21, 2013 at 10:02 PM, MauMau wrote: > I'm comfortable with 5 seconds. We are talking about the interval between > sending SIGQUIT to the children and then sending SIGKILL to them. In most > situations, the backends should terminate immediately. However, as I said a > few months ago, ereport() call in quickdie() can deadlock indefinitely. This > is a PostgreSQL's bug. So let's fix that bug. Then we don't need this. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Review] Re: [HACKERS] minor patch submission: CREATE CAST ... AS EXPLICIT
On Sat, Jun 22, 2013 at 9:16 AM, Cédric Villemain wrote: > patch is in unified format and apply on HEAD. > patch contains documentation, however I believe 'AS IMPLICIT' is a PostgreSQL > extension with special behavior and 'AS EXPLICIT' respect the standard except > that PostgreSQL adds only the expression 'AS EXPLICIT' (it is also the default > in the standard). I object to this patch. This patch a new keyword, EXPLICIT, for reasons that are strictly cosmetic. Everything that you can do with this patch can also be done without this patch. It is not a good idea to slow down parsing of every SQL statement we have just so that someone can write CREATE CAST .. AS EXPLICIT. Granted, the parsing slowdown for just one keyword is probably not noticeable, but it's cumulative with every new keyword we add. Adding them to support new features is one thing, but adding them to support purely optional syntax is, I think, going too far. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MemoryContextAllocHuge(): selectively bypassing MaxAllocSize
On Sat, Jun 22, 2013 at 3:46 AM, Stephen Frost wrote: > I'm not a huge fan of moving directly to INT_MAX. Are we confident that > everything can handle that cleanly..? I feel like it might be a bit > safer to shy a bit short of INT_MAX (say, by 1K). Maybe it would be better to stick with INT_MAX and fix any bugs we find. If there are magic numbers short of INT_MAX that cause problems, it would likely be better to find out about those problems and adjust the relevant code, rather than trying to dodge them. We'll have to confront all of those problems eventually as we come to support larger and larger sorts; I don't see much value in putting it off. Especially since we're early in the release cycle. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] A better way than tweaking NTUP_PER_BUCKET
On Sat, Jun 22, 2013 at 9:48 AM, Stephen Frost wrote: >> The correct calculation that would match the objective set out in the >> comment would be >> >> dbuckets = (hash_table_bytes / tupsize) / NTUP_PER_BUCKET; > > This looks to be driving the size of the hash table size off of "how > many of this size tuple can I fit into memory?" and ignoring how many > actual rows we have to hash. Consider a work_mem of 1GB with a small > number of rows to actually hash- say 50. With a tupsize of 8 bytes, > we'd be creating a hash table sized for some 13M buckets. This is a fair point, but I still think Simon's got a good point, too. Letting the number of buckets ramp up when there's ample memory seems like a broadly sensible strategy. We might need to put a floor on the effective load factor, though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] A better way than tweaking NTUP_PER_BUCKET
On Sat, Jun 22, 2013 at 9:15 AM, Simon Riggs wrote: > Previous discussions of Hash Joins have noted that the performance > decreases when the average number of tuples per bucket increases. > O(N^2) effects are seen. > > We've argued this about many different ways, yet all of those > discussions have centred around the constant NTUP_PER_BUCKET. I > believe that was a subtle mistake and I have a proposal. > > The code in ExecChooseHashTableSize() line 460 says > /* > * Set nbuckets to achieve an average bucket load of NTUP_PER_BUCKET when > * memory is filled. > ... > > but the calculation then sets the number of buckets like this > > dbuckets = ceil(ntuples / NTUP_PER_BUCKET); > > **This is doesn't match the comment.** If we underestimate the number > of tuples and go on to fill the available memory, we then end up with > an average number of tuples per bucket higher than NTUP_PER_BUCKET. A > notational confusion that has been skewing the discussion. > > The correct calculation that would match the objective set out in the > comment would be > > dbuckets = (hash_table_bytes / tupsize) / NTUP_PER_BUCKET; > > Which leads us to a much more useful value of dbuckets in the case > where using ntuples occupies much less space than is available. This > value is always same or higher than previously because of the if-test > that surrounds it. +1. I think this is very clever. The other thing that seems to be distorting things here is that this function doesn't account for the memory consumed by the bucket array. So as things stand today, changing NTUP_PER_BUCKET can't ever increase the required number of batches. But that's really nonsensical. Setting NTUP_PER_BUCKET to a smaller value like 1 or even, in effect, a value less than 1 would probably improve performance for large hash tables, but there's no way to decide what value is too expensive because the memory cost of changing it isn't accounted for in the first place. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
On Fri, Jun 21, 2013 at 11:19 PM, Tom Lane wrote: >> I think that's the Tom Lane theory. The Robert Haas theory is that if >> the postmaster has died, there's no reason to suppose that it hasn't >> corrupted shared memory on the way down, or that the system isn't >> otherwise heavily fuxxored in some way. > > Eh? The postmaster does its level best never to touch shared memory > (after initialization anyway). And yet it certainly does - see pmsignal.c, for example. Besides which, as Andres points out, if the postmaster is dead, there is zip for a guarantee that some OTHER backend hasn't panicked. I think it's just ridiculous to suppose that the system can run in any sort of reasonable way without the postmaster. The whole reason why we work so hard to make sure that the postmaster doesn't die in the first place is because we need it to clean up when things go horribly wrong. If that cleanup function is important, then we need a living postmaster at all times. If it's not important, then our extreme paranoia about what operations the postmaster is permitted to engage in is overblown. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)
Please find attached a v12, which under timer_exceeded interrupts clients which are being throttled instead of waiting for the end of the transaction, as the transaction is not started yet. Oops, I forgot the attachment. Here it is! -- Fabien.diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c index 1303217..4c9e55d 100644 --- a/contrib/pgbench/pgbench.c +++ b/contrib/pgbench/pgbench.c @@ -137,6 +137,12 @@ int unlogged_tables = 0; double sample_rate = 0.0; /* + * When threads are throttled to a given rate limit, this is the target delay + * to reach that rate in usec. 0 is the default and means no throttling. + */ +int64 throttle_delay = 0; + +/* * tablespace selection */ char *tablespace = NULL; @@ -200,11 +206,13 @@ typedef struct int listen; /* 0 indicates that an async query has been * sent */ int sleeping; /* 1 indicates that the client is napping */ +boolthrottling; /* whether nap is for throttling */ int64 until; /* napping until (usec) */ Variable *variables; /* array of variable definitions */ int nvariables; instr_time txn_begin; /* used for measuring transaction latencies */ instr_time stmt_begin; /* used for measuring statement latencies */ + bool throttled; /* whether current transaction was throttled */ int use_file; /* index in sql_files for this client */ bool prepared[MAX_FILES]; } CState; @@ -222,6 +230,10 @@ typedef struct instr_time *exec_elapsed; /* time spent executing cmds (per Command) */ int *exec_count; /* number of cmd executions (per Command) */ unsigned short random_state[3]; /* separate randomness for each thread */ +int64 throttle_trigger; /* previous/next throttling (us) */ + int64 throttle_lag; /* total transaction lag behind throttling */ + int64 throttle_lag_max; /* max transaction lag */ + } TState; #define INVALID_THREAD ((pthread_t) 0) @@ -230,6 +242,8 @@ typedef struct { instr_time conn_time; int xacts; + int64 throttle_lag; + int64 throttle_lag_max; } TResult; /* @@ -355,6 +369,8 @@ usage(void) " -n do not run VACUUM before tests\n" " -N do not update tables \"pgbench_tellers\" and \"pgbench_branches\"\n" " -r report average latency per command\n" + " -R SPEC, --rate SPEC\n" + " target rate in transactions per second\n" " -s NUM report this scale factor in output\n" " -S perform SELECT-only transactions\n" " -t NUM number of transactions each client runs (default: 10)\n" @@ -898,17 +914,58 @@ doCustom(TState *thread, CState *st, instr_time *conn_time, FILE *logfile, AggVa { PGresult *res; Command **commands; + booldo_throttle = false; top: commands = sql_files[st->use_file]; + /* handle throttling once per transaction by inserting a sleep. + * this is simpler than doing it at the end. + */ + if (throttle_delay && ! st->throttled) + { + /* compute delay to approximate a Poisson distribution + * 100 => 13.8 .. 0 multiplier + * 10 => 11.5 .. 0 + * 1 => 9.2 .. 0 + *1000 => 6.9 .. 0 + * if transactions are too slow or a given wait shorter than + * a transaction, the next transaction will start right away. + */ + int64 wait = (int64) + throttle_delay * -log(getrand(thread, 1, 1000)/1000.0); + + thread->throttle_trigger += wait; + + st->until = thread->throttle_trigger; + st->sleeping = 1; + st->throttling = true; + st->throttled = true; + if (debug) + fprintf(stderr, "client %d throttling "INT64_FORMAT" us\n", + st->id, wait); + } + if (st->sleeping) { /* are we sleeping? */ instr_time now; + int64 now_us; INSTR_TIME_SET_CURRENT(now); - if (st->until <= INSTR_TIME_GET_MICROSEC(now)) + now_us = INSTR_TIME_GET_MICROSEC(now); + if (st->until <= now_us) + { st->sleeping = 0; /* Done sleeping, go ahead with next command */ + if (st->throttling) + { +/* measure lag of throttled transaction */ +int64 lag = now_us - st->until; +thread->throttle_lag += lag; +if (lag > thread->throttle_lag_max) + thread->throttle_lag_max = lag; +st->throttling = false; + } + } else return true; /* Still sleeping, nothing to do here */ } @@ -1037,7 +1094,7 @@ top: * This is more than we really ought to know about * instr_time */ - fprintf(logfile, "%d %d %.0f %d %ld %ld\n", + fprintf(logfile, "%d %d %.0f %d %ld.%06ld\n", st->id, st->cnt, usec, st->use_file, (long) now.tv_sec, (long) now.tv_usec); #else @@ -1046,7 +1103,7 @@ top: * On Windows, instr_time doesn't provide a timestamp * anyway */ - fprintf(logfile, "%d %d %.0f %d 0 0\n", + fprintf(logfile, "%d %d %.0f %d 0.0\n", st->id, st->cnt, usec, st->use_file); #endif } @@ -1095,6 +1152,13 @@ top: st->state = 0; st->use_fil
Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)
Please find attached a v12, which under timer_exceeded interrupts clients which are being throttled instead of waiting for the end of the transaction, as the transaction is not started yet. I've also changed the log format that I used for debugging the apparent latency issue: x y z 12345677 1234 -> x y z 12345677.001234 It seems much clearer that way. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
On 2013-06-21 16:56:57 -0400, Alvaro Herrera wrote: > > What we could do to improve the robustness a bit - at least on linux - > > is to prctl(PR_SET_PDEATHSIG, SIGKILL) which would cause children to be > > killed if the postmaster goes away... > > This is an interesting idea (even if it has no relationship to the patch > at hand). Well, we could just set the deathsig to SIGKILL and exit normally - which would be a twoliner or so. Admittedly the cross-platform issue makes this approach not so great... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)
MauMau escribió: > Are you suggesting simplifying the following part in ServerLoop()? > I welcome the idea if this condition becomes simpler. However, I > cannot imagine how. > if (AbortStartTime > 0 && /* SIGKILL only once */ > (Shutdown == ImmediateShutdown || (FatalError && !SendStop)) && > now - AbortStartTime >= 10) > { > SignalAllChildren(SIGKILL); > AbortStartTime = 0; > } Yes, that's what I'm suggesting. I haven't tried coding it yet. > I thought of adding some new state of pmState for some reason (that > might be the same as your idea). > But I refrained from doing that, because pmState has already many > states. I was afraid adding a new pmState value for this bug fix > would complicate the state management (e.g. state transition in > PostmasterStateMachine()). In addition, I felt PM_WAIT_BACKENDS was > appropriate because postmaster is actually waiting for backends to > terminate after sending SIGQUIT. The state name is natural. Well, a natural state name is of no use if we're using it to indicate two different states, which I think is what would be happening here. Basically, with your patch, PM_WAIT_BACKENDS means two different things depending on AbortStartTime. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support for REINDEX CONCURRENTLY
Andres Freund escribió: > On 2013-06-22 22:45:26 +0900, Michael Paquier wrote: > > And I imagine that you have the same problem even with > > RelationGetIndexList, not only RelationGetIndexListIfInvalid, because > > this would appear as long as you try to open more than 1 index with an > > index list. > > No. RelationGetIndexList() returns a copy of the list for exactly that > reason. The danger is not to see an outdated list - we should be > protected by locks against that - but looking at uninitialized or reused > memory. Are we doing this only to save some palloc traffic? Could we do this by, say, teaching list_copy() to have a special case for lists of ints and oids that allocates all the cells in a single palloc chunk? (This has the obvious problem that list_free no longer works, of course. But I think that specific problem can be easily fixed. Not sure if it causes more breakage elsewhere.) Alternatively, I guess we could grab an uncopied list, then copy the items individually into a locally allocated array, avoiding list_copy. We'd need to iterate differently than with foreach(). -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)
Dear Robert and Greg, I think so. If it doesn't get fixed now, it's not likely to get fixed later. And the fact that nobody understands why it's happening is kinda worrisome... Possibly, but I thing that it is not my fault:-) So, I spent some time at performance debugging... First, I finally succeeded in reproducing Greg Smith spikes on my ubuntu laptop. It needs short transactions and many clients per thread so as to be a spike. With "pgbench" standard transactions and not too many clients per thread it is more of a little bump, or even it is not there at all. After some poking around, and pursuing various red herrings, I resorted to measure the delay for calling "PQfinish()", which is really the only special thing going around at the end of pgbench run... BINGO! In his tests Greg is using one thread. Once the overall timer is exceeded, clients start closing their connections by calling PQfinish once their transactions are done. This call takes between a few us and a few ... ms. So if some client closing time hit the bad figures, the transactions of other clients are artificially delayed by this time, and it seems they have a high latency, but it is really because the thread is in another client's PQfinish and was not available to process the data. If you have one thread per client, no problem, especially as the final PQfinish() time is not measured at all by pgbench:-) Also, the more clients, the higher the spike because more are to be stopped and may hit the bad figures. Here is a trace, with the simple SELECT transaction. sh> ./pgbench --rate=14000 -T 10 -r -l -c 30 -S bench ... sh> less pgbench_log.* # short transactions, about 0.250 ms ... 20 4849 241 0 1371916583.455400 21 4844 256 0 1371916583.455470 22 4832 348 0 1371916583.455569 23 4829 218 0 1371916583.455627 ** TIMER EXCEEDED ** 25 4722 390 0 1371916583.455820 25 done in 1560 [1,2] # BING, 1560 us for PQfinish() 26 4557 1969 0 1371916583.457407 26 done in 21 [1,2] 27 4372 1969 0 1371916583.457447 27 done in 19 [1,2] 28 4009 1910 0 1371916583.457486 28 done in 1445 [1,2] # BOUM 2 interrupted in 1300 [0,0]# BANG 3 interrupted in 15 [0,0] 4 interrupted in 40 [0,0] 5 interrupted in 203 [0,0] # boum? 6 interrupted in 1352 [0,0]# ZIP 7 interrupted in 18 [0,0] ... 23 interrupted in 12 [0,0] ## the cumulated PQfinish() time above is about 6 ms which ## appears as an apparent latency for the next clients: 0 4880 6521 0 1371916583.462157 0 done in 9 [1,2] 1 4877 6397 0 1371916583.462194 1 done in 9 [1,2] 24 4807 6796 0 1371916583.462217 24 done in 9 [1,2] ... Note that the bad measures also appear when there is no throttling: sh> ./pgbench -T 10 -r -l -c 30 -S bench sh> grep 'done.*[0-9][0-9][0-9]' pgbench_log.* 0 done in 1974 [1,2] 1 done in 312 [1,2] 3 done in 2159 [1,2] 7 done in 409 [1,2] 11 done in 393 [1,2] 12 done in 2212 [1,2] 13 done in 1458 [1,2] ## other clients execute PQfinish in less than 100 us This "done" is issued by my instrumented version of clientDone(). The issue does also appear if I instrument pgbench from master, without anything from the throttling patch at all: sh> git diff master diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c index 1303217..7c5ea81 100644 --- a/contrib/pgbench/pgbench.c +++ b/contrib/pgbench/pgbench.c @@ -869,7 +869,15 @@ clientDone(CState *st, bool ok) if (st->con != NULL) { + instr_time now; + int64 s0, s1; + INSTR_TIME_SET_CURRENT(now); + s0 = INSTR_TIME_GET_MICROSEC(now); PQfinish(st->con); + INSTR_TIME_SET_CURRENT(now); + s1 = INSTR_TIME_GET_MICROSEC(now); + fprintf(stderr, "%d done in %ld [%d,%d]\n", + st->id, s1-s0, st->listen, st->state); st->con = NULL; } return false; /* always false */ sh> ./pgbench -T 10 -r -l -c 30 -S bench 2> x.err sh> grep 'done.*[0-9][0-9][0-9]' x.err 14 done in 1985 [1,2] 16 done in 276 [1,2] 17 done in 1418 [1,2] So my argumented conclusion is that the issue is somewhere within PQfinish(), possibly in interaction with pgbench doings, but is *NOT* related in any way to the throttling patch, as it is preexisting it. Gregs stumbled upon it because he looked at latencies. I'll submit a slightly improved v12 shortly. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] A better way than tweaking NTUP_PER_BUCKET
On 22 June 2013 14:48, Stephen Frost wrote: > Based on your argument that we want to have a bucket load which is, on > average, the size of NTUP_PER_BUCKET, I have to '-1' on this. That is not my argument. I am pointing out that the comments claim the code does that, and yet the code does not. So either we apply the patch to make the code fit the comment, or we change the comment. Tom's wish for an average tuples per bucket of 10 may be connected to that error; I can't say. But we definitely don't end up with an average of 10 in many cases, which is the basis for previous poor performance reports. > What we > want is to size a table large enough that we never have any true > collisions (cases where different 32-bit hash values end up in the same > bucket) *for all tuples being hashed*, that includes both the side > building the hash table and the side doing the scan. This should be > done when we can avoid batching- if we don't have enough to avoid > batching while having such a large table, we should consider adjusting > the hash table size down some because, in those cases, we're memory > constrained. > > When we have enough memory to avoid batching, we never want to have to > check down through a bucket for a tuple which doesn't exist- we should > simply be able to look in the hash table itself, determine (with a > single fetch) that there are no entries for that hash value, and throw > the tuple away and move on to the next. The bottom line is that you're hoping for perfection. If we knew exactly how many tuples were in the hash table then we could size it precisely for the hash join we are about to perform. We don't know that, so we can't size it precisely. Worse than that is that we know we badly estimate the number of buckets on a regular basis. That gives us three choices: if we have a hash table fixed without prior knowledge then we either (0) we continue to underallocate them in many cases or (1) potentially overallocate buckets. Or (2) we learn to cope better with the variation in the number of entries. If we rule out the first two we have only the last one remaining. (1) will always be a heuristic, and as you point out could itself be an extreme setting. So I think that (2) is the best route: Given that we know with much better certainty the number of rows in the scanned-relation, we should be able to examine our hash table after it has been built and decide whether it would be cheaper to rebuild the hash table with the right number of buckets, or continue processing with what we have now. Which is roughly what Heikki proposed already, in January. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)
On Mon, Jun 10, 2013 at 07:28:24AM +0800, Craig Ringer wrote: > (I'm still learning the details of Pg's WAL, WAL replay and recovery, so > the below's just my understanding): > > The problem is that WAL for all tablespaces is mixed together in the > archives. If you lose your tablespace then you have to keep *all* WAL > around and replay *all* of it again when the tablespace comes back > online. This would be very inefficient, would require a lot of tricks to > cope with applying WAL to a database that has an on-disk state in the > future as far as the archives are concerned. It's not as simple as just > replaying all WAL all over again - as I understand it, things like > CLUSTER or TRUNCATE will result in relfilenodes not being where they're > expected to be as far as old WAL archives are concerned. Selective > replay would be required, and that leaves the door open to all sorts of > new and exciting bugs in areas that'd hardly ever get tested. > > To solve the massive disk space explosion problem I imagine we'd have to > have per-tablespace WAL. That'd cause a *huge* increase in fsync costs > and loss of the rather nice property that WAL writes are nice sequential > writes. It'd be complicated and probably cause nightmares during > recovery, for archive-based replication, etc. > > The only other thing I can think of is: When a tablespace is offline, > write WAL records to a separate "tablespace recovery log" as they're > encountered. Replay this log when the tablespace comes is restored, > before applying any other new WAL to the tablespace. This wouldn't > affect archive-based recovery since it'd already have the records from > the original WAL. > > None of these options seem exactly simple or pretty, especially given > the additional complexities that'd be involved in allowing WAL records > to be applied out-of-order, something that AFAIK _never_h happens at the > moment. > > The key problem, of course, is that this all sounds like a lot of > complicated work for a case that's not really supposed to happen. Right > now, the answer is "your database is unrecoverable, switch to your > streaming warm standby and re-seed it from the standby". Not pretty, but > at least there's the option of using a sync standby and avoiding data loss. > > How would you approach this? Sorry to be replying late. You are right that you could record/apply WAL separately for offline tablespaces. The problem is that you could have logical ties from the offline tablespace to online tablespaces. For example, what happens if data in an online tablespace references a primary key in an offline tablespace. What if the system catalogs are stored in an offline tablespace? Right now, we allow logical bindings across physical tablespaces. To do what you want, you would really need to store each database in its own tablespace. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Possible bug in CASE evaluation
On 2013-06-21 16:45:28 +0200, Andres Freund wrote: > On 2013-06-21 09:51:05 -0400, Noah Misch wrote: > > That being said, if we discover a simple-enough fix that performs well, we > > may > > as well incorporate it. > > What about passing another parameter down eval_const_expressions_mutator > (which is static, so changing the API isn't a problem) that basically > tells us whether we actually should evaluate expressions or just perform > the transformation? > There's seems to be basically a couple of places where we call dangerous > things: > * simplify_function (via ->evaluate_function->evaluate_expr) which is > called in a bunch of places > * evaluate_expr which is directly called in T_ArrayExpr > T_ArrayCoerceExpr > > All places I've inspected so far need to deal with simplify_function > returning NULL anyway, so that seems like a viable fix. *Prototype* patch - that seems simple enough - attached. Opinions? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services >From 1349f8e4a8133eebbf66753b1aa3787f88ee3e33 Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Sat, 22 Jun 2013 16:53:39 +0200 Subject: [PATCH] Don't evaluate potentially unreachable parts of CASE during planning --- src/backend/optimizer/util/clauses.c | 31 +-- src/test/regress/expected/case.out | 13 ++--- src/test/regress/sql/case.sql| 4 ++-- 3 files changed, 41 insertions(+), 7 deletions(-) diff --git a/src/backend/optimizer/util/clauses.c b/src/backend/optimizer/util/clauses.c index 6d5b204..94b52f6 100644 --- a/src/backend/optimizer/util/clauses.c +++ b/src/backend/optimizer/util/clauses.c @@ -63,6 +63,7 @@ typedef struct List *active_fns; Node *case_val; bool estimate; + bool doevil; } eval_const_expressions_context; typedef struct @@ -2202,6 +2203,7 @@ eval_const_expressions(PlannerInfo *root, Node *node) context.active_fns = NIL; /* nothing being recursively simplified */ context.case_val = NULL; /* no CASE being examined */ context.estimate = false; /* safe transformations only */ + context.doevil = true; /* allowed to evaluate const expressions */ return eval_const_expressions_mutator(node, &context); } @@ -2233,6 +2235,7 @@ estimate_expression_value(PlannerInfo *root, Node *node) context.active_fns = NIL; /* nothing being recursively simplified */ context.case_val = NULL; /* no CASE being examined */ context.estimate = true; /* unsafe transformations OK */ + context.doevil = true; /* allowed to evaluate const expressions */ return eval_const_expressions_mutator(node, &context); } @@ -2751,7 +2754,7 @@ eval_const_expressions_mutator(Node *node, * If constant argument and it's a binary-coercible or * immutable conversion, we can simplify it to a constant. */ -if (arg && IsA(arg, Const) && +if (context->doevil && arg && IsA(arg, Const) && (!OidIsValid(newexpr->elemfuncid) || func_volatile(newexpr->elemfuncid) == PROVOLATILE_IMMUTABLE)) return (Node *) evaluate_expr((Expr *) newexpr, @@ -2843,16 +2846,20 @@ eval_const_expressions_mutator(Node *node, CaseExpr *caseexpr = (CaseExpr *) node; CaseExpr *newcase; Node *save_case_val; +bool save_doevil; Node *newarg; List *newargs; bool const_true_cond; Node *defresult = NULL; ListCell *arg; +save_doevil = context->doevil; + /* Simplify the test expression, if any */ newarg = eval_const_expressions_mutator((Node *) caseexpr->arg, context); + /* Set up for contained CaseTestExpr nodes */ save_case_val = context->case_val; if (newarg && IsA(newarg, Const)) @@ -2894,6 +2901,17 @@ eval_const_expressions_mutator(Node *node, const_true_cond = true; } + /* + * We can only evaluate expressions as long we are sure the + * the expression will be reached at runtime - otherwise we + * might cause spurious errors to be thrown. The first + * eval_const_expression above can always evaluate + * expressions (unless we are in a nested CaseExpr) since + * it will always be reached at runtime. + */ + if (!const_true_cond) + context->doevil = false; + /* Simplify this alternative's result value */ caseresult = eval_const_expressions_mutator((Node *) oldcasewhen->result, context); @@ -2926,6 +2944,8 @@ eval_const_expressions_mutator(Node *node, context->case_val = save_case_val; +context->doevil = save_doevil; + /* * If no non-FALSE alternatives, CASE reduces to the default * result @@ -2982,7 +3002,7 @@ eval_const_expressions_mutator(Node *node, newarray->multidims = arrayexpr->multidims; newarray->location = arrayexpr->location; -if (all_const) +if (all_const && context->doevil) return (Node *) evaluate_expr((Ex
Re: [HACKERS] Implementing incremental backup
On 2013-06-22 15:58:35 +0200, Cédric Villemain wrote: > > A differential backup resulting from a bunch of WAL between W1 and Wn > > would help to recover much faster to the time of Wn than replaying all > > the WALs between W1 and Wn and saves a lot of space. > > > > I was hoping to find some time to dig around this idea, but as the > > subject rose here, then here are my 2¢! > > something like that maybe : > ./pg_xlogdump -b \ > ../data/pg_xlog/00010001 \ > ../data/pg_xlog/00010005| \ > grep 'backup bkp' | awk '{print ($5,$9)}' Note that it's a bit more complex than that for a number of reasons: * we don't log full page images for e.g. new heap pages, we just set the XLOG_HEAP_INIT_PAGE flag on the record * there also are XLOG_FPI records * How do you get a base backup as the basis to apply those to? You need it to be recovered exactly to a certain point... But yes, I think something can be done in the end. I think Heikki's pg_rewind already has quite a bit of the required logic. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Implementing incremental backup
Le samedi 22 juin 2013 01:09:20, Jehan-Guillaume (ioguix) de Rorthais a écrit : > On 20/06/2013 03:25, Tatsuo Ishii wrote: > >> On Wed, Jun 19, 2013 at 8:40 PM, Tatsuo Ishii wrote: > On Wed, Jun 19, 2013 at 6:20 PM, Stephen Frost wrote: > > * Claudio Freire (klaussfre...@gmail.com) wrote: > [...] > > >> The only bottleneck here, is WAL archiving. This assumes you can > >> afford WAL archiving at least to a local filesystem, and that the WAL > >> compressor is able to cope with WAL bandwidth. But I have no reason to > >> think you'd be able to cope with dirty-map updates anyway if you were > >> saturating the WAL compressor, as the compressor is more efficient on > >> amortized cost per transaction than the dirty-map approach. > > > > Thank you for detailed explanation. I will think more about this. > > Just for the record, I was mulling over this idea since a bunch of > month. I even talked about that with Dimitri Fontaine some weeks ago > with some beers :) > > My idea came from a customer during a training explaining me the > difference between differential and incremental backup in Oracle. > > My approach would have been to create a standalone tool (say > pg_walaggregate) which takes a bunch of WAL from archives and merge them > in a single big file, keeping only the very last version of each page > after aggregating all their changes. The resulting file, aggregating all > the changes from given WAL files would be the "differential backup". > > A differential backup resulting from a bunch of WAL between W1 and Wn > would help to recover much faster to the time of Wn than replaying all > the WALs between W1 and Wn and saves a lot of space. > > I was hoping to find some time to dig around this idea, but as the > subject rose here, then here are my 2¢! something like that maybe : ./pg_xlogdump -b \ ../data/pg_xlog/00010001 \ ../data/pg_xlog/00010005| \ grep 'backup bkp' | awk '{print ($5,$9)}' -- Cédric Villemain +33 (0)6 20 30 22 52 http://2ndQuadrant.fr/ PostgreSQL: Support 24x7 - Développement, Expertise et Formation signature.asc Description: This is a digitally signed message part.
Re: [HACKERS] Unaccent performance
On 2013-06-21 22:52:04 +0100, Thom Brown wrote: > > CREATE OR REPLACE FUNCTION public.myunaccent(sometext text) > > RETURNS text > > LANGUAGE sql > > IMMUTABLE > > AS $function$ > > SELECT > > replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(sometext,'ą','a'),'Ą','A'),'ă','a'),'Ă','A'),'ā','a'),'Ā','A'),'æ','a'),'å','a'),'ä','a'),'ã','a'),'â','a'),'á','a'),'à','a'),'Æ','A'),'Å','A'),'Ä','A'),'Ã','A'),'Â','A'),'Á','A'),'À','A') > > ; > > $function$ > > > Another test passing in a string of 10 characters gives the following > timings: > > unaccent: 240619.395 ms > myunaccent: 785.505 ms The reason for that is that unaccent is 'stable' while your function is 'immutable', so the planner recognizes that and computes it only once since you're always passing the same text string to it. > Another test inserting long text strings into a text column of a table > 100,000 times, then updating another column to have that unaccented value > using both methods: > > unaccent: 3867.306 ms > myunaccent: 43611.732 ms Whereas it cannot recognize that in this case. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] A better way than tweaking NTUP_PER_BUCKET
Simon, * Simon Riggs (si...@2ndquadrant.com) wrote: > Previous discussions of Hash Joins have noted that the performance > decreases when the average number of tuples per bucket increases. Having the hash table so small that we have hash bucket collisions with different 32bit hash values is extremely painful, however.. > The correct calculation that would match the objective set out in the > comment would be > > dbuckets = (hash_table_bytes / tupsize) / NTUP_PER_BUCKET; This looks to be driving the size of the hash table size off of "how many of this size tuple can I fit into memory?" and ignoring how many actual rows we have to hash. Consider a work_mem of 1GB with a small number of rows to actually hash- say 50. With a tupsize of 8 bytes, we'd be creating a hash table sized for some 13M buckets. > This solves the problem in earlier discussions since we get a lower > average number of tuples per bucket and yet we also get to keep the > current NTUP_PER_BUCKET value. Everybody wins. I agree that this should reduce the number of tuples per bucket, but I don't think it's doing what you expect and based on what it's doing, it seems wasteful. > Comments? Based on your argument that we want to have a bucket load which is, on average, the size of NTUP_PER_BUCKET, I have to '-1' on this. What we want is to size a table large enough that we never have any true collisions (cases where different 32-bit hash values end up in the same bucket) *for all tuples being hashed*, that includes both the side building the hash table and the side doing the scan. This should be done when we can avoid batching- if we don't have enough to avoid batching while having such a large table, we should consider adjusting the hash table size down some because, in those cases, we're memory constrained. When we have enough memory to avoid batching, we never want to have to check down through a bucket for a tuple which doesn't exist- we should simply be able to look in the hash table itself, determine (with a single fetch) that there are no entries for that hash value, and throw the tuple away and move on to the next. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Support for REINDEX CONCURRENTLY
On 2013-06-22 22:45:26 +0900, Michael Paquier wrote: > On Sat, Jun 22, 2013 at 10:34 PM, Andres Freund > wrote: > > On 2013-06-22 12:50:52 +0900, Michael Paquier wrote: > >> By looking at the comments of RelationGetIndexList:relcache.c, > >> actually the method of the patch is correct because in the event of a > >> shared cache invalidation, rd_indexvalid is set to 0 when the index > >> list is reset, so the index list would get recomputed even in the case > >> of shared mem invalidation. > > > > The problem I see is something else. Consider code like the following: > > > > RelationFetchIndexListIfInvalid(toastrel); > > foreach(lc, toastrel->rd_indexlist) > >toastidxs[i++] = index_open(lfirst_oid(lc), RowExclusiveLock); > > > > index_open calls relation_open calls LockRelationOid which does: > > if (res != LOCKACQUIRE_ALREADY_HELD) > >AcceptInvalidationMessages(); > > > > So, what might happen is that you open the first index, which accepts an > > invalidation message which in turn might delete the indexlist. Which > > means we would likely read invalid memory if there are two indexes. > And I imagine that you have the same problem even with > RelationGetIndexList, not only RelationGetIndexListIfInvalid, because > this would appear as long as you try to open more than 1 index with an > index list. No. RelationGetIndexList() returns a copy of the list for exactly that reason. The danger is not to see an outdated list - we should be protected by locks against that - but looking at uninitialized or reused memory. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support for REINDEX CONCURRENTLY
On Sat, Jun 22, 2013 at 10:34 PM, Andres Freund wrote: > On 2013-06-22 12:50:52 +0900, Michael Paquier wrote: >> On Fri, Jun 21, 2013 at 10:47 PM, Andres Freund >> wrote: >> > Hm. Looking at how this is currently used - I am afraid it's not >> > correct... the reason RelationGetIndexList() returns a copy is that >> > cache invalidations will throw away that list. And you do index_open() >> > while iterating over it which will accept invalidation messages. >> > Mybe it's better to try using RelationGetIndexList directly and measure >> > whether that has a measurable impact= >> By looking at the comments of RelationGetIndexList:relcache.c, >> actually the method of the patch is correct because in the event of a >> shared cache invalidation, rd_indexvalid is set to 0 when the index >> list is reset, so the index list would get recomputed even in the case >> of shared mem invalidation. > > The problem I see is something else. Consider code like the following: > > RelationFetchIndexListIfInvalid(toastrel); > foreach(lc, toastrel->rd_indexlist) >toastidxs[i++] = index_open(lfirst_oid(lc), RowExclusiveLock); > > index_open calls relation_open calls LockRelationOid which does: > if (res != LOCKACQUIRE_ALREADY_HELD) >AcceptInvalidationMessages(); > > So, what might happen is that you open the first index, which accepts an > invalidation message which in turn might delete the indexlist. Which > means we would likely read invalid memory if there are two indexes. And I imagine that you have the same problem even with RelationGetIndexList, not only RelationGetIndexListIfInvalid, because this would appear as long as you try to open more than 1 index with an index list. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
On 2013-06-21 23:19:27 -0400, Tom Lane wrote: > Robert Haas writes: > > On Fri, Jun 21, 2013 at 5:44 PM, Tom Lane wrote: > >> The traditional theory has been that that would be less robust, not > >> more so. Child backends are (mostly) able to carry out queries whether > >> or not the postmaster is around. > > > I think that's the Tom Lane theory. The Robert Haas theory is that if > > the postmaster has died, there's no reason to suppose that it hasn't > > corrupted shared memory on the way down, or that the system isn't > > otherwise heavily fuxxored in some way. > > Eh? The postmaster does its level best never to touch shared memory > (after initialization anyway). I am not sure that will never happen - but I think the chain of argument misses the main point. Normally we rely on the postmaster to kill off all other backends if a backend PANICs or segfaults for all the known reasons. As soon as there's no postmaster anymore we loose that capability. And *that* is scary imo. Especially as I would say the chance of getting PANICs or segfaults increases if there's no postmaster anymore since we might reach code branches we otherwise won't. > >> True, you can't make new connections, > >> but how does killing the existing children make that better? > > > It allows you to start a new postmaster in a timely fashion, instead > > of waiting for an idle connection that may not ever terminate without > > operator intervention. And it's no always easy to figure out which cluster those backends belong to if there are multiple postgres instances running as the same user. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support for REINDEX CONCURRENTLY
On 2013-06-22 12:50:52 +0900, Michael Paquier wrote: > On Fri, Jun 21, 2013 at 10:47 PM, Andres Freund > wrote: > > Hm. Looking at how this is currently used - I am afraid it's not > > correct... the reason RelationGetIndexList() returns a copy is that > > cache invalidations will throw away that list. And you do index_open() > > while iterating over it which will accept invalidation messages. > > Mybe it's better to try using RelationGetIndexList directly and measure > > whether that has a measurable impact= > By looking at the comments of RelationGetIndexList:relcache.c, > actually the method of the patch is correct because in the event of a > shared cache invalidation, rd_indexvalid is set to 0 when the index > list is reset, so the index list would get recomputed even in the case > of shared mem invalidation. The problem I see is something else. Consider code like the following: RelationFetchIndexListIfInvalid(toastrel); foreach(lc, toastrel->rd_indexlist) toastidxs[i++] = index_open(lfirst_oid(lc), RowExclusiveLock); index_open calls relation_open calls LockRelationOid which does: if (res != LOCKACQUIRE_ALREADY_HELD) AcceptInvalidationMessages(); So, what might happen is that you open the first index, which accepts an invalidation message which in turn might delete the indexlist. Which means we would likely read invalid memory if there are two indexes. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hardware donation
On 21 June 2013 20:03, Jim Nasby wrote: > Who can be point of contact from the community to arrange shipping, etc? Do they need to be shipped? Can we just leave them where they are and arrange access and power charges to be passed to SPI? Sounds like it would be cheaper and easier to leave them where they are and they won't get damaged in transit then. Of course, may not be possible. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[Review] Re: [HACKERS] minor patch submission: CREATE CAST ... AS EXPLICIT
Le lundi 17 juin 2013 00:02:21, Fabien COELHO a écrit : > >> What activity would you expect? > > > > A patch which applies cleanly to git HEAD. This one doesn't for me, > > although I'm not really sure why, I don't see any obvious conflicts. > > Please find attached a freshly generated patch against current master. * Submission review: patch is in unified format and apply on HEAD. patch contains documentation, however I believe 'AS IMPLICIT' is a PostgreSQL extension with special behavior and 'AS EXPLICIT' respect the standard except that PostgreSQL adds only the expression 'AS EXPLICIT' (it is also the default in the standard). So maybe it is possible to rephrase this piece: @@ -411,8 +427,8 @@ CREATE CAST (bigint AS int4) WITH FUNCTION int4(bigint) AS ASSIGNMENT; SQL standard, except that SQL does not make provisions for binary-coercible types or extra arguments to implementation functions. - AS IMPLICIT is a PostgreSQL - extension, too. + AS IMPLICIT and AS EXPLICIT are + a PostgreSQL extension, too. After digging in the archive and the CF: original request is at https://commitfest.postgresql.org/action/patch_view?id=563 * Usability review ** Does the patch actually implement that? yes ** Do we want that? Back in 2012 Tom exposed arguments against it, or at least not a clear +1. The patch add nothing but more explicit creation statement, it has gone untouched for 2 years without interest so it is surely not something really important for PostgreSQL users. However we already have non-standard words for CREATE CAST, this new one is not very intrusive . ** Does it follow SQL spec, or the community-agreed behavior? It does not follow SQL spec. ** Does it include pg_dump support (if applicable)? Not but it is probably not interesting to add that to the pg_dump output: it increases incompatibility with SQL spec for no gain. The result is that the patch only allows to CREATE CAST..AS EXPLICIT without error. Then pg_dump won't know if the CAST has been created with the default or an 'explicit default'... ** Are there dangers? It seems no. * Feature test ** Does the feature work as advertised? Yes ** Are there corner cases the author has failed to consider? I think no, but my skills with the parser are limited (gram.y, ...) ** Are there any assertion failures or crashes? no * Performance review: not relevant. * Coding review Patch does not pass test: ./check_keywords.pl gram.y ../../../src/include/parser/kwlist.h I had to update the unreserved keyword list in order to be able to build postgresql. ** Does it follow the project coding guidelines? yes ** Are there portability issues? no (only for SQL) ** Will it work on Windows/BSD etc? yes ** Are the comments sufficient and accurate? Yes ** Does it do what it says, correctly? Yes ** Does it produce compiler warnings? don't build as is. Need patch update ** Can you make it crash? no * Architecture review ** Is everything done in a way that fits together coherently with other features/modules? Yes ** Are there interdependencies that can cause problems? No. I flag it 'return with feedback', please update the patch so it builds. Everything else is ok. -- Cédric Villemain +33 (0)6 20 30 22 52 http://2ndQuadrant.fr/ PostgreSQL: Support 24x7 - Développement, Expertise et Formation signature.asc Description: This is a digitally signed message part.
[HACKERS] A better way than tweaking NTUP_PER_BUCKET
Previous discussions of Hash Joins have noted that the performance decreases when the average number of tuples per bucket increases. O(N^2) effects are seen. We've argued this about many different ways, yet all of those discussions have centred around the constant NTUP_PER_BUCKET. I believe that was a subtle mistake and I have a proposal. The code in ExecChooseHashTableSize() line 460 says /* * Set nbuckets to achieve an average bucket load of NTUP_PER_BUCKET when * memory is filled. ... but the calculation then sets the number of buckets like this dbuckets = ceil(ntuples / NTUP_PER_BUCKET); **This is doesn't match the comment.** If we underestimate the number of tuples and go on to fill the available memory, we then end up with an average number of tuples per bucket higher than NTUP_PER_BUCKET. A notational confusion that has been skewing the discussion. The correct calculation that would match the objective set out in the comment would be dbuckets = (hash_table_bytes / tupsize) / NTUP_PER_BUCKET; Which leads us to a much more useful value of dbuckets in the case where using ntuples occupies much less space than is available. This value is always same or higher than previously because of the if-test that surrounds it. Given my experience that on larger tables we end up underestimating ndistinct by 10-100-1000 times, I don't think this change is unwarranted. This solves the problem in earlier discussions since we get a lower average number of tuples per bucket and yet we also get to keep the current NTUP_PER_BUCKET value. Everybody wins. Comments? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services rethink_num_hash_buckets.v1.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MemoryContextAllocHuge(): selectively bypassing MaxAllocSize
* Simon Riggs (si...@2ndquadrant.com) wrote: > On 22 June 2013 08:46, Stephen Frost wrote: > >>The next limit faced by sorts is > >> INT_MAX concurrent tuples in memory, which limits helpful work_mem to about > >> 150 GiB when sorting int4. > > > > That's frustratingly small. :( > > But that has nothing to do with this patch, right? And is easily fixed, yes? I don't know about 'easily fixed' (consider supporting a HashJoin of >2B records) but I do agree that dealing with places in the code where we are using an int4 to keep track of the number of objects in memory is outside the scope of this patch. Hopefully we are properly range-checking and limiting ourselves to only what a given node can support and not solely depending on MaxAllocSize to keep us from overflowing some int4 which we're using as an index for an array or as a count of how many objects we've currently got in memory, but we'll want to consider carefully what happens with such large sets as we're adding support into nodes for these Huge allocations (along with the recent change to allow 1TB work_mem, which may encourage users with systems large enough to actually try to set it that high... :) Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] XLogInsert scaling, revisited
On 21.06.2013 21:55, Jeff Janes wrote: I think I'm getting an undetected deadlock between the checkpointer and a user process running a TRUNCATE command. This is the checkpointer: #0 0x003a73eeaf37 in semop () from /lib64/libc.so.6 #1 0x005ff847 in PGSemaphoreLock (sema=0x7f8c0a4eb730, interruptOK=0 '\000') at pg_sema.c:415 #2 0x004b0abf in WaitOnSlot (upto=416178159648) at xlog.c:1775 #3 WaitXLogInsertionsToFinish (upto=416178159648) at xlog.c:2086 #4 0x004b657a in CopyXLogRecordToWAL (write_len=32, isLogSwitch=1 '\001', rdata=0x0, StartPos=, EndPos=416192397312) at xlog.c:1389 #5 0x004b6fb2 in XLogInsert (rmid=0 '\000', info=, rdata=0x7fff0020) at xlog.c:1209 #6 0x004b7644 in RequestXLogSwitch () at xlog.c:8748 Hmm, it looks like the xlog-switch is trying to wait for itself to finish. The concurrent TRUNCATE is just being blocked behind the xlog-switch, which is stuck on itself. I wasn't able to reproduce exactly that, but I got a PANIC by running pgbench and concurrently doing "select pg_switch_xlog()" many times in psql. Attached is a new version that fixes at least the problem I saw. Not sure if it fixes what you saw, but it's worth a try. How easily can you reproduce that? This is using the same testing harness as in the last round of this patch. This one? http://www.postgresql.org/message-id/CAMkU=1xoa6fdyoj_4fmlqpiczr1v9gp7clnxjdhu+iggqb6...@mail.gmail.com Is there a way for me to dump the list of held/waiting lwlocks from gdb? You can print out the held_lwlocks array. Or to make it more friendly, write a function that prints it out and call that from gdb. There's no easy way to print out who's waiting for what that I know of. Thanks for the testing! - Heikki xloginsert-scale-24.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MemoryContextAllocHuge(): selectively bypassing MaxAllocSize
On 22 June 2013 08:46, Stephen Frost wrote: >>The next limit faced by sorts is >> INT_MAX concurrent tuples in memory, which limits helpful work_mem to about >> 150 GiB when sorting int4. > > That's frustratingly small. :( > But that has nothing to do with this patch, right? And is easily fixed, yes? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MemoryContextAllocHuge(): selectively bypassing MaxAllocSize
On 13 May 2013 15:26, Noah Misch wrote: > A memory chunk allocated through the existing palloc.h interfaces is limited > to MaxAllocSize (~1 GiB). This is best for most callers; SET_VARSIZE() need > not check its own 1 GiB limit, and algorithms that grow a buffer by doubling > need not check for overflow. However, a handful of callers are quite happy to > navigate those hazards in exchange for the ability to allocate a larger chunk. > > This patch introduces MemoryContextAllocHuge() and repalloc_huge() that check > a higher MaxAllocHugeSize limit of SIZE_MAX/2. Chunks don't bother recording > whether they were allocated as huge; one can start with palloc() and then > repalloc_huge() to grow the value. I like the design and think its workable. I'm concerned that people will accidentally use MaxAllocSize. Can we put in a runtime warning if someone tests AllocSizeIsValid() with a larger value? > To demonstrate, I put this to use in > tuplesort.c; the patch also updates tuplestore.c to keep them similar. Here's > the trace_sort from building the pgbench_accounts primary key at scale factor > 7500, maintenance_work_mem = '56GB'; memtuples itself consumed 17.2 GiB: > > LOG: internal sort ended, 48603324 KB used: CPU 75.65s/305.46u sec elapsed > 391.21 sec > > Compare: > > LOG: external sort ended, 1832846 disk blocks used: CPU 77.45s/988.11u sec > elapsed 1146.05 sec Cool. I'd like to put in an explicit test for this somewhere. Obviously not part of normal regression, but somewhere, at least, so we have automated testing that we all agree on. (yes, I know we don't have that for replication/recovery yet, but thats why I don't want to repeat that mistake). > This was made easier by tuplesort growth algorithm improvements in commit > 8ae35e91807508872cabd3b0e8db35fc78e194ac. The problem has come up before > (TODO item "Allow sorts to use more available memory"), and Tom floated the > idea[1] behind the approach I've used. The next limit faced by sorts is > INT_MAX concurrent tuples in memory, which limits helpful work_mem to about > 150 GiB when sorting int4. > > I have not added variants like palloc_huge() and palloc0_huge(), and I have > not added to the frontend palloc.h interface. There's no particular barrier > to doing any of that. I don't expect more than a dozen or so callers, so most > of the variations might go unused. > > The comment at MaxAllocSize said that aset.c expects doubling the size of an > arbitrary allocation to never overflow, but I couldn't find the code in > question. AllocSetAlloc() does double sizes of blocks used to aggregate small > allocations, so maxBlockSize had better stay under SIZE_MAX/2. Nonetheless, > that expectation does apply to dozens of repalloc() users outside aset.c, and > I preserved it for repalloc_huge(). 64-bit builds will never notice, and I > won't cry for the resulting 2 GiB limit on 32-bit. Agreed. Can we document this for the relevant parameters? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MemoryContextAllocHuge(): selectively bypassing MaxAllocSize
Noah, * Noah Misch (n...@leadboat.com) wrote: > This patch introduces MemoryContextAllocHuge() and repalloc_huge() that check > a higher MaxAllocHugeSize limit of SIZE_MAX/2. Nice! I've complained about this limit a few different times and just never got around to addressing it. > This was made easier by tuplesort growth algorithm improvements in commit > 8ae35e91807508872cabd3b0e8db35fc78e194ac. The problem has come up before > (TODO item "Allow sorts to use more available memory"), and Tom floated the > idea[1] behind the approach I've used. The next limit faced by sorts is > INT_MAX concurrent tuples in memory, which limits helpful work_mem to about > 150 GiB when sorting int4. That's frustratingly small. :( [...] > --- 1024,1041 >* new array elements even if no other memory were currently > used. >* >* We do the arithmetic in float8, because otherwise the > product of > ! * memtupsize and allowedMem could overflow. Any inaccuracy in > the > ! * result should be insignificant; but even if we computed a > ! * completely insane result, the checks below will prevent > anything > ! * really bad from happening. >*/ > double grow_ratio; > > grow_ratio = (double) state->allowedMem / (double) memNowUsed; > ! if (memtupsize * grow_ratio < INT_MAX) > ! newmemtupsize = (int) (memtupsize * grow_ratio); > ! else > ! newmemtupsize = INT_MAX; > > /* We won't make any further enlargement attempts */ > state->growmemtuples = false; I'm not a huge fan of moving directly to INT_MAX. Are we confident that everything can handle that cleanly..? I feel like it might be a bit safer to shy a bit short of INT_MAX (say, by 1K). Perhaps that's overly paranoid, but there's an awful lot of callers and some loop which +2's and then overflows would suck, eg: int x = INT_MAX; for (x-1; (x-1) < INT_MAX; x += 2) { myarray[x] = 5; } Also, could this be used to support hashing larger sets..? If we change NTUP_PER_BUCKET to one, we could end up wanting to create a hash table larger than INT_MAX since, with 8-byte pointers, that'd only be around 134M tuples. Haven't had a chance to review the rest, but +1 on the overall idea. :) Thanks! Stephen signature.asc Description: Digital signature