[HACKERS] Re: Using quicksort and a merge step to significantly improve on tuplesort's single run external sort
On 30 July 2015 at 12:26, Greg Stark st...@mit.edu wrote: On Thu, Jul 30, 2015 at 12:09 PM, Heikki Linnakangas hlinn...@iki.fi wrote: True, you need a heap to hold the next tuple from each tape in the merge step. To avoid exceeding work_mem, you'd need to push some tuples from the in-memory array to the tape to make room for that. In practice, though, the memory needed for the merge step's heap is tiny. Even if you merge 1000 tapes, you only need memory for 1000 tuples in the heap. But yeah, you'll need some logic to share/divide the in-memory array between the heap and the in-memory tail of the last tape. It's a bit worse than that because we buffer up a significant chunk of the tape to avoid randomly seeking between tapes after every tuple read. But I think in today's era of large memory we don't need anywhere near the entire work_mem just to buffer to avoid random access. Something simple like a fixed buffer size per tape probably much less than 1MB/tape. MERGE_BUFFER_SIZE is currently 0.25 MB, but there was benefit seen above that. I'd say we should scale that up to 1 MB if memory allows. Yes, that could be tiny for small number of runs. I mention it because Heikki's comment that we could use this in the general case would not be true for larger numbers of runs. Number of runs decreases quickly with more memory anyway, so the technique is mostly good for larger memory but certainly not for small memory allocations. -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] [PATCH] Microvacuum for gist.
Hi! On Thu, Jul 30, 2015 at 2:51 PM, Anastasia Lubennikova lubennikov...@gmail.com wrote: I have written microvacuum support for gist access method. Briefly microvacuum includes two steps: 1. When search tells us that the tuple is invisible to all transactions it is marked LP_DEAD and page is marked as has dead tuples, 2. Then, when insert touches full page which has dead tuples it calls microvacuum instead of splitting page. You can find a kind of review here [1]. [1] http://www.google-melange.com/gsoc/proposal/public/google/gsoc2015/ivanitskiy_ilya/5629499534213120 Patch is in attachements. Please review it. Nice! Some notes about this patch. 1) Could you give same test case demonstrating that microvacuum really work with patch? Finally, we should get index less growing with microvacuum. 2) Generating notices for every dead tuple would be too noisy. I suggest to replace notice with one of debug levels. + elog(NOTICE, gistkillitems. Mark Item Dead offnum %hd, blkno %d, offnum, BufferGetBlockNumber(buffer)); 3) Please, recheck coding style. For instance, this line needs more spaces and open brace should be on the next line. + if ((scan-kill_prior_tuple)(so-curPageData 0)(so-curPageData == so-nPageData)) { -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] multivariate statistics / patch v7
Hi, On 07/30/2015 10:21 AM, Heikki Linnakangas wrote: On 05/25/2015 11:43 PM, Tomas Vondra wrote: There are 6 files attached, but only 0002-0006 are actually part of the multivariate statistics patch itself. All of these patches are huge. In order to review this in a reasonable amount of time, we need to do this in several steps. So let's see what would be the minimal set of these patches that could be reviewed and committed, while still being useful. The main patches are: 1. shared infrastructure and functional dependencies 2. clause reduction using functional dependencies 3. multivariate MCV lists 4. multivariate histograms 5. multi-statistics estimation Would it make sense to commit only patches 1 and 2 first? Would that be enough to get a benefit from this? I agree that the patch can't be reviewed as a single chunk - that was the idea when I split the original (single chunk) patch into multiple smaller pieces. And yes, I believe committing pieces 12 might be enough to get something useful, which can then be improved by adding the usual MCV and histogram stats on top of that. I have some doubts about the clause reduction and functional dependencies part of this. It seems to treat functional dependency as a boolean property, but even with the classic zipcode and city case, it's not always an all or nothing thing. At least in some countries, there can be zipcodes that span multiple cities. So zipcode=X does not completely imply city=Y, although there is a strong correlation (if that's the right term). How strong does the correlation need to be for this patch to decide that zipcode implies city? I couldn't actually see a clear threshold stated anywhere. So rather than treating functional dependence as a boolean, I think it would make more sense to put a 0.0-1.0 number to it. That means that you can't do clause reduction like it's done in this patch, where you actually remove clauses from the query for cost esimation purposes. Instead, you need to calculate the selectivity for each clause independently, but instead of just multiplying the selectivities together, apply the dependence factor to it. Does that make sense? I haven't really looked at the MCV, histogram and multi-statistics estimation patches yet. Do those patches make the clause reduction patch obsolete? Should we forget about the clause reduction and functional dependency patch, and focus on those later patches instead? Perhaps. It's true that most real-world data sets are not 100% valid with respect to functional dependencies - either because of natural imperfections (multiple cities with the same ZIP code) or just noise in the data (incorrect entries ...). And it's even mentioned in the code comments somewhere, I guess. But there are two main reasons why I chose not to extend the functional dependencies with the [0.0-1.0] value you propose. Firstly, functional dependencies were meant to be the simplest possible implementation, illustrating how the infrastructure is supposed to work (which is the main topic of the first patch). Secondly, all kinds of statistics are simplifications of the actual data. So I think it's not incorrect to ignore the exceptions up to some threshold. I also don't think this will make the estimates globally better. Let's say you have 1% of rows that contradict the functional dependency - you may either ignore them and have good estimates for 99% of the values and incorrect estimates for 1%, or tweak the rule a bit and make the estimates worse for 99% (and possibly better for 1%). That being said, I'm not against improving the functional dependencies. I already do have some improvements on my TODO - like for example dependencies on more columns (not just A=B but [A,B]=C and such), but I think we should not squash this into those two patches. And yet another point - ISTM these cases might easily be handled better by the statistics based on ndistinct coefficients, as proposed by Kyotaro-san some time ago. That is, compute and track ndistinct(A) * ndistinct(B) / ndistinct(A,B) for all pairs of columns (or possibly larger groups). That seems to be similar to the coefficient you propose. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.5a1 BUG FIX: pgbench negative latencies
On 07/27/2015 03:43 PM, Fabien COELHO wrote: Under 9.5a1 pgbench -r negative latencies are reported on meta commands, probably as an oversight of 84f0ea3f. This patch ensures that now is reset on each loop inside doCustom. Applied, thanks! - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Row-Level Security Policies (RLS)
On 30 July 2015 at 01:35, Joe Conway m...@joeconway.com wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 06/01/2015 02:21 AM, Dean Rasheed wrote: While going through this, I spotted another issue --- in a DML query with additional non-target relations, such as UPDATE t1 .. FROM t2 .., the old code was checking the UPDATE policies of both t1 and t2, but really I think it ought to be checking the SELECT policies of t2 (in the same way as this query requires SELECT table permissions on t2, not UPDATE permissions). I've changed that and added new regression tests to test that change. I assume the entire refactoring patch needs a fair bit of work to rebase against current HEAD, Actually, there haven't been any conflicting changes so far, so a git rebase was able to automatically merge correctly -- new patch attached, with some minor comment rewording (not affecting the bug-fix part). Even so, I agree that it makes sense to apply the bug-fix separately, since it's not really anything to do with the refactoring. but I picked out the attached to address just the above issue. Does this look correct, and if so does it make sense to apply at least this part right now? Looks correct to me. Thanks. Regards, Dean diff --git a/src/backend/commands/policy.c b/src/backend/commands/policy.c new file mode 100644 index bcf4a8f..cb689ec --- a/src/backend/commands/policy.c +++ b/src/backend/commands/policy.c @@ -186,9 +186,6 @@ policy_role_list_to_array(List *roles, i /* * Load row security policy from the catalog, and store it in * the relation's relcache entry. - * - * We will always set up some kind of policy here. If no explicit policies - * are found then an implicit default-deny policy is created. */ void RelationBuildRowSecurity(Relation relation) @@ -246,7 +243,6 @@ RelationBuildRowSecurity(Relation relati char *with_check_value; Expr *with_check_qual; char *policy_name_value; - Oid policy_id; bool isnull; RowSecurityPolicy *policy; @@ -298,14 +294,11 @@ RelationBuildRowSecurity(Relation relati else with_check_qual = NULL; - policy_id = HeapTupleGetOid(tuple); - /* Now copy everything into the cache context */ MemoryContextSwitchTo(rscxt); policy = palloc0(sizeof(RowSecurityPolicy)); policy-policy_name = pstrdup(policy_name_value); - policy-policy_id = policy_id; policy-polcmd = cmd_value; policy-roles = DatumGetArrayTypePCopy(roles_datum); policy-qual = copyObject(qual_expr); @@ -326,40 +319,6 @@ RelationBuildRowSecurity(Relation relati systable_endscan(sscan); heap_close(catalog, AccessShareLock); - - /* - * Check if no policies were added - * - * If no policies exist in pg_policy for this relation, then we need - * to create a single default-deny policy. We use InvalidOid for the - * Oid to indicate that this is the default-deny policy (we may decide - * to ignore the default policy if an extension adds policies). - */ - if (rsdesc-policies == NIL) - { - RowSecurityPolicy *policy; - Datum role; - - MemoryContextSwitchTo(rscxt); - - role = ObjectIdGetDatum(ACL_ID_PUBLIC); - - policy = palloc0(sizeof(RowSecurityPolicy)); - policy-policy_name = pstrdup(default-deny policy); - policy-policy_id = InvalidOid; - policy-polcmd = '*'; - policy-roles = construct_array(role, 1, OIDOID, sizeof(Oid), true, - 'i'); - policy-qual = (Expr *) makeConst(BOOLOID, -1, InvalidOid, - sizeof(bool), BoolGetDatum(false), - false, true); - policy-with_check_qual = copyObject(policy-qual); - policy-hassublinks = false; - - rsdesc-policies = lcons(policy, rsdesc-policies); - - MemoryContextSwitchTo(oldcxt); - } } PG_CATCH(); { diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c new file mode 100644 index 2c65a90..c28eb2b --- a/src/backend/executor/execMain.c +++ b/src/backend/executor/execMain.c @@ -1815,14 +1815,26 @@ ExecWithCheckOptions(WCOKind kind, Resul break; case WCO_RLS_INSERT_CHECK: case WCO_RLS_UPDATE_CHECK: - ereport(ERROR, - (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), + if (wco-polname != NULL) + ereport(ERROR, +(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), + errmsg(new row violates row level security policy \%s\ for \%s\, + wco-polname, wco-relname))); + else + ereport(ERROR, +(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), errmsg(new row violates row level security policy for \%s\, wco-relname))); break; case WCO_RLS_CONFLICT_CHECK: - ereport(ERROR, - (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), + if (wco-polname != NULL) + ereport(ERROR, +(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), + errmsg(new row violates row level security policy \%s\ (USING expression) for \%s\, + wco-polname, wco-relname))); + else + ereport(ERROR, +
[HACKERS] [PATCH] Microvacuum for gist.
Hi, I have written microvacuum support for gist access method. Briefly microvacuum includes two steps: 1. When search tells us that the tuple is invisible to all transactions it is marked LP_DEAD and page is marked as has dead tuples, 2. Then, when insert touches full page which has dead tuples it calls microvacuum instead of splitting page. You can find a kind of review here [1]. [1] http://www.google-melange.com/gsoc/proposal/public/google/gsoc2015/ivanitskiy_ilya/5629499534213120 Patch is in attachements. Please review it. -- Best regards, Lubennikova Anastasia microvacuum_for_gist.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] 64-bit XIDs again
Hackers, I know there were already couple of threads about 64bit XIDs. http://www.postgresql.org/message-id/42cccfe9.9040...@intellilink.co.jp http://www.postgresql.org/message-id/4f6c0e13.3080...@wiesinger.com I read them carefully, but I didn't find all the arguments for 64bit XIDs mentioned. That's why I'd like to raise this subject again. Now hardware capabilities are much higher than when Postgres was designed. In the modern PostgreSQL scalability tests it's typical to achieve 400 000 - 500 000 tps with pgbench. With such tps it takes only few minutes to achieve default autovacuum_freeze_max_age = 200 millions. Notion of wraparound is evolutioning during the time. Initially it was something that almost never happens. Then it becomes something that could happen rarely, and we should be ready to it (freeze tuples in advance). Now, it becomes quite frequent periodic event for high load database. DB admins should take into account its performance impact. Typical scenario that I've faced in real life was so. Database is divided into operative and archive parts. Operative part is small (dozens of gigabytes) and it serves most of transactions. Archive part is relatively large (some terabytes) and it serves rare selects and bulk inserts. Autovacuum work very active for operative part and very lazy for archive part (as it's expected). System works well until one day age of archive tables exceeds autovacuum_freeze_max_age. Then all autovacuum workers starts to do autovacuum to prevent wraparound on archive tables. If even system IO survive this, operative tables get bloated because all autovacuum workers are busy with archive tables. In such situation I typically advise to increase autovacuum_freeze_max_age and run vacuum freeze manually when system have enough of free resources. As I mentioned in CSN thread, it would be nice to replace XID with CSN when setting hint bits for tuple. In this case when hint bits are set we don't need any additional lookups to check visibility. http://www.postgresql.org/message-id/CAPpHfdv7BMwGv=ofug3s-jgvfkqhi79pr_zk1wsk-13oz+c...@mail.gmail.com Introducing 32-bit CSN doesn't seem reasonable for me, because it would double our troubles with wraparound. Also, I think it's possible to migrate to 64-bit XIDs without breaking pg_upgrade. Old tuples can be leaved with 32-bit XIDs while new tuples would be created with 64-bit XIDs. We can use free bits in t_infomask2 to distinguish old and new formats. Any thoughts? Do you think 64-bit XIDs worth it? -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
[HACKERS] [BUG] CustomScan-custom_plans are not copied
Hello, The attached patch fixes up my careless misses when custom_plans field was added. Please apply this fix. Thanks, -- NEC Business Creation Division / PG-Strom Project KaiGai Kohei kai...@ak.jp.nec.com pgsql-v9.5-fix-copy-custom-scan.patch Description: pgsql-v9.5-fix-copy-custom-scan.patch -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] creating extension including dependencies
On 2015-07-27 15:18, Michael Paquier wrote: On Sun, Jul 26, 2015 at 1:01 AM, Petr Jelinek wrote: Yes that's what I meant by the change of checking order in the explanation above. I did that because I thought code would be more complicated otherwise, but apparently I was stupid... +In case the extension specifies schema in its control file, the schema s/schema/literalschema// +++ b/src/test/modules/test_extensions/.gitignore @@ -0,0 +1,3 @@ +# Generated subdirectories +/results/ +/tmp_check/ test_extensions/.gitignore is missing /log/. Something also has not been discussed yet: what to do with new_version and old_version (the options of CreateExtensionStmt)? As of now if those options are defined they are not passed down to the parent extensions but shouldn't we raise an error if they are used in combination with CASCADE? In any case, I think that the behavior chosen should be documented. I don't see why we should raise error, they are used just for the top-level extension and I think it makes sense that way. CASCADE documentation says SCHEMA option is cascaded to required extensions, do we need to say something more than that (ie explicitly saying that the old_version and new_version aren't)? -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 64-bit XIDs again
On 07/30/2015 05:57 PM, Alexander Korotkov wrote: On Thu, Jul 30, 2015 at 5:24 PM, Heikki Linnakangas hlinn...@iki.fi wrote: I think we should move to 64-bit XIDs in in-memory structs snapshots, proc array etc. And expand clog to handle 64-bit XIDs. But keep the xmin/xmax fields on heap pages at 32-bits, and add an epoch-like field to the page header so that logically the xmin/xmax fields on the page are 64 bits wide, but physically stored in 32 bits. That's possible as long as no two XIDs on the same page are more than 2^31 XIDs apart. So you still need to freeze old tuples on the page when that's about to happen, but it would make it possible to have more than 2^32 XID transactions in the clog. You'd never be forced to do anti-wraparound vacuums, you could just let the clog grow arbitrarily large. Nice idea. Storing extra epoch would be extra 4 bytes per heap tuple instead of extra 8 bytes per tuple if storing 64 bits xmin/xmax. No, I was thinking that the epoch would be stored *per page*, in the page header. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] LWLock deadlock and gdb advice
On 07/29/2015 09:35 PM, Andres Freund wrote: On 2015-07-29 20:23:24 +0300, Heikki Linnakangas wrote: Backend A has called LWLockWaitForVar(X) on a lock, and is now waiting on it. The lock holder releases the lock, and wakes up A. But before A wakes up and sees that the lock is free, another backend acquires the lock again. It runs LWLockAcquireWithVar to the point just before setting the variable's value. Now A wakes up, sees that the lock is still (or again) held, and that the variable's value still matches the old one, and goes back to sleep. The new lock holder won't wake it up until it updates the value again, or to releases the lock. I'm not sure whether this observation is about my patch or the general lwlock variable mechanism. In my opinion that behaviour exists today both in 9.4 and 9.5. In 9.4, LWLockAcquire holds the spinlock when it marks the lock as held, until it has updated the variable. And LWLockWaitForVar() holds the spinlock when it checks that the lock is held and that the variable's value matches. So it cannot happen on 9.4. To reiterate, with 9.5, it's possible that a backend is sleeping in LWLockWaitForVar(oldvar=123), even though the lock is currently held by another backend with value 124. That seems wrong, or surprising at the very least. But I think that's fine because that race seems pretty fundamental. After all, you could have called LWLockWaitForVar() just after the second locker had set the variable to the same value. I'm not talking about setting it to the same value. I'm talking about setting it to a different value. (I talked about setting it to the same value later in the email, and I agree that's a pretty fundamental problem and exists with 9.4 as well). - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 64-bit XIDs again
On 07/30/2015 08:04 AM, Simon Riggs wrote: There is a big downside to expanding xmin/xmax to 64 bits: it takes space. More space means more memory needed for caching, more memory bandwidth, more I/O, etc. My feeling is that the overhead will recede in time. Having a nice, simple change to remove old bugs and new would help us be more robust. But let's measure the overhead before we try to optimize it away. In field experience would agree with you. The amount of memory people are arbitrarily throwing at databases now is pretty significant. It is common to have 64GB of memory. Heck, I run into 128GB all the time and seeing 192GB is no longer a, Wow. JD -- Command Prompt, Inc. - http://www.commandprompt.com/ 503-667-4564 PostgreSQL Centered full stack support, consulting and development. Announcing I'm offended is basically telling the world you can't control your own emotions, so everyone else should do it for you. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 64-bit XIDs again
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 07/30/2015 07:14 AM, Simon Riggs wrote: On 30 July 2015 at 14:26, Alexander Korotkov a.korot...@postgrespro.ru Any thoughts? Do you think 64-bit XIDs worth it? The problem of freezing is painful, but not impossible, which is why we have held out so long. The problem of very long lived snapshots is coming closer at the same speed as freezing; there is no solution to that without 64-bit xids throughout whole infrastructure, or CSNs. The opportunity for us to have SQL Standard historical databases becomes possible with 64-bit xids, or CSNs. That is a high value goal. I personally now think we should thoroughly investigate 64-bit xids. I don't see this as mere debate, I see this as something that we can make a patch for and scientifically analyze the pros and cons through measurement. +1 I've been thinking along similar lines to both of you for quite some time now. I think at the least we should explore an initdb time option - -- we can and should measure the pros and cons. - -- Joe Conway -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.22 (GNU/Linux) iQIcBAEBAgAGBQJVujVzAAoJEDfy90M199hlphUP/j/+TEJO05h+aLD1TrddZ01f Fq2ijyvQjfe3aBN/4DEKuVBPMsQ6ZWLWtYJ3/FpktlFUIoDdDWJY//8rb63KqUut tiSMI4MNzIp/ImvyR1pMJpAmfF9zHsJOiC8Hjpj9J8ity1BXm5My0XYzf9cux/KN Qr8e5RTiPNKZyCB7w5Ci9byYIQKwHS9UyoHhgXQhZTopYLqrN9G7KxjKHZjTYxAs 6uJowQqsoevlgi15L8Ojk+KuJuowEHhVthhZ0f147twrOu2PwvhPP0/tf3TCSzKW I3TGC8ChQ67+h/x4lF2LMENvDwGZFh0fB4foeu0F3oR5YX4jG6pic/k7BJFPke3f YPk8PnA4fn5PM2otgikExIM6NFm+1y4JEeVOcGaA0GungdbgcBuN4p8gLvO9zJRa qsJp6U+FHK7m68jBVAlo0aVERikh29devypOSWhz474nvYsZIm9bfQ4te+DQECzw m3a9KJWJUy7Bj8xkwLpMMXmm83bIbvNMj8oDlg9tMo//CEzSsXyNjGUPG0/U9jIs YHZUYd24i8Wg4+BjdQ19ULJH22ROZa2JBq658t6n97vab7HS3ZWGPhao0piYW20i /q8wmd52KE0e4gg4Jixc1p8kPvIItFeJliEPgbRC1+7vnZu0rkENxXpTuS/g1fn0 Ql/P9C7Nb97cux9EvlZv =gX7d -END PGP SIGNATURE- -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 64-bit XIDs again
On Thu, Jul 30, 2015 at 5:24 PM, Heikki Linnakangas hlinn...@iki.fi wrote: On 07/30/2015 04:26 PM, Alexander Korotkov wrote: Also, I think it's possible to migrate to 64-bit XIDs without breaking pg_upgrade. Old tuples can be leaved with 32-bit XIDs while new tuples would be created with 64-bit XIDs. We can use free bits in t_infomask2 to distinguish old and new formats. I think we should move to 64-bit XIDs in in-memory structs snapshots, proc array etc. And expand clog to handle 64-bit XIDs. But keep the xmin/xmax fields on heap pages at 32-bits, and add an epoch-like field to the page header so that logically the xmin/xmax fields on the page are 64 bits wide, but physically stored in 32 bits. That's possible as long as no two XIDs on the same page are more than 2^31 XIDs apart. So you still need to freeze old tuples on the page when that's about to happen, but it would make it possible to have more than 2^32 XID transactions in the clog. You'd never be forced to do anti-wraparound vacuums, you could just let the clog grow arbitrarily large. Nice idea. Storing extra epoch would be extra 4 bytes per heap tuple instead of extra 8 bytes per tuple if storing 64 bits xmin/xmax. But if first column is aligned to 8 bytes (i.e. bigserial) would we loose this 4 bytes win for alignment? There is a big downside to expanding xmin/xmax to 64 bits: it takes space. More space means more memory needed for caching, more memory bandwidth, more I/O, etc. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] 64-bit XIDs again
On 30 July 2015 at 14:26, Alexander Korotkov a.korot...@postgrespro.ru wrote: Any thoughts? Do you think 64-bit XIDs worth it? The problem of freezing is painful, but not impossible, which is why we have held out so long. The problem of very long lived snapshots is coming closer at the same speed as freezing; there is no solution to that without 64-bit xids throughout whole infrastructure, or CSNs. The opportunity for us to have SQL Standard historical databases becomes possible with 64-bit xids, or CSNs. That is a high value goal. I personally now think we should thoroughly investigate 64-bit xids. I don't see this as mere debate, I see this as something that we can make a patch for and scientifically analyze the pros and cons through measurement. -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] 64-bit XIDs again
On 07/30/2015 04:26 PM, Alexander Korotkov wrote: Also, I think it's possible to migrate to 64-bit XIDs without breaking pg_upgrade. Old tuples can be leaved with 32-bit XIDs while new tuples would be created with 64-bit XIDs. We can use free bits in t_infomask2 to distinguish old and new formats. I think we should move to 64-bit XIDs in in-memory structs snapshots, proc array etc. And expand clog to handle 64-bit XIDs. But keep the xmin/xmax fields on heap pages at 32-bits, and add an epoch-like field to the page header so that logically the xmin/xmax fields on the page are 64 bits wide, but physically stored in 32 bits. That's possible as long as no two XIDs on the same page are more than 2^31 XIDs apart. So you still need to freeze old tuples on the page when that's about to happen, but it would make it possible to have more than 2^32 XID transactions in the clog. You'd never be forced to do anti-wraparound vacuums, you could just let the clog grow arbitrarily large. There is a big downside to expanding xmin/xmax to 64 bits: it takes space. More space means more memory needed for caching, more memory bandwidth, more I/O, etc. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 64-bit XIDs again
On 30 July 2015 at 15:24, Heikki Linnakangas hlinn...@iki.fi wrote: On 07/30/2015 04:26 PM, Alexander Korotkov wrote: Also, I think it's possible to migrate to 64-bit XIDs without breaking pg_upgrade. Old tuples can be leaved with 32-bit XIDs while new tuples would be created with 64-bit XIDs. We can use free bits in t_infomask2 to distinguish old and new formats. I think we should move to 64-bit XIDs in in-memory structs snapshots, proc array etc. And expand clog to handle 64-bit XIDs. But keep the xmin/xmax fields on heap pages at 32-bits, and add an epoch-like field to the page header so that logically the xmin/xmax fields on the page are 64 bits wide, but physically stored in 32 bits. That's possible as long as no two XIDs on the same page are more than 2^31 XIDs apart. So you still need to freeze old tuples on the page when that's about to happen, but it would make it possible to have more than 2^32 XID transactions in the clog. You'd never be forced to do anti-wraparound vacuums, you could just let the clog grow arbitrarily large. This is a good scheme, but it assumes, as you say, that you can freeze tuples that are more than 2^31 xids apart. That is no longer a safe assumption on high transaction rate systems with longer lived snapshots. There is a big downside to expanding xmin/xmax to 64 bits: it takes space. More space means more memory needed for caching, more memory bandwidth, more I/O, etc. My feeling is that the overhead will recede in time. Having a nice, simple change to remove old bugs and new would help us be more robust. But let's measure the overhead before we try to optimize it away. -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] brin index vacuum versus transaction snapshots
Alvaro Herrera alvhe...@2ndquadrant.com writes: Kevin Grittner wrote: If you run `make installcheck` against a cluster with default_transaction_isolation = 'repeatable read' you get one failure: I am tempted to say that we should just disallow to run vacuum on a table containing a brin index in a transaction-snapshot transaction. Huh? We don't allow vacuum inside a (user started) transaction now. What new restriction are you proposing? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 64-bit XIDs again
Gavin Flower gavinflo...@archidevsys.co.nz writes: On 31/07/15 02:24, Heikki Linnakangas wrote: There is a big downside to expanding xmin/xmax to 64 bits: it takes space. More space means more memory needed for caching, more memory bandwidth, more I/O, etc. I think having a special case to save 32 bits per tuple would cause unnecessary complications, and the savings are minimal compared to the size of current modern storage devices and the typical memory used in serious database servers. I think the argument that the savings are minimal is pretty thin. It all depends on how wide your tables are --- but on a narrow table, say half a dozen ints, the current tuple size is 24 bytes header plus the same number of bytes of data. We'd be going up to 32 bytes header which makes for a 16% increase in physical table size. If your table is large, claiming that 16% doesn't hurt is just silly. But the elephant in the room is on-disk compatibility. There is absolutely no way that we can just change xmin/xmax to 64 bits without a disk format break. However, if we do something like what Heikki is suggesting, it's at least conceivable that we could convert incrementally (ie, if you find a page with the old header format, assume all tuples in it are part of epoch 0; and do not insert new tuples into it unless there is room to convert the header to new format ... but I'm not sure what we do about tuple deletion if the old page is totally full and we need to write an xmax that's past 4G). Only if you are willing to kiss off on-disk compatibility is it even worth having a discussion about whether we can afford more bloat in HeapTupleHeader. And that would be a pretty big pain point for a lot of users. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 64-bit XIDs again
On 07/30/2015 07:24 AM, Heikki Linnakangas wrote: I think we should move to 64-bit XIDs in in-memory structs snapshots, proc array etc. And expand clog to handle 64-bit XIDs. But keep the xmin/xmax fields on heap pages at 32-bits, and add an epoch-like field to the page header so that logically the xmin/xmax fields on the page are 64 bits wide, but physically stored in 32 bits. That's possible as long as no two XIDs on the same page are more than 2^31 XIDs apart. So you still need to freeze old tuples on the page when that's about to happen, but it would make it possible to have more than 2^32 XID transactions in the clog. You'd never be forced to do anti-wraparound vacuums, you could just let the clog grow arbitrarily large When I introduced the same idea a few years back, having the clog get arbitrarily large was cited as a major issue. I was under the impression that clog size had some major performance impacts. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 64-bit XIDs again
On 31/07/15 02:24, Heikki Linnakangas wrote: On 07/30/2015 04:26 PM, Alexander Korotkov wrote: Also, I think it's possible to migrate to 64-bit XIDs without breaking pg_upgrade. Old tuples can be leaved with 32-bit XIDs while new tuples would be created with 64-bit XIDs. We can use free bits in t_infomask2 to distinguish old and new formats. I think we should move to 64-bit XIDs in in-memory structs snapshots, proc array etc. And expand clog to handle 64-bit XIDs. But keep the xmin/xmax fields on heap pages at 32-bits, and add an epoch-like field to the page header so that logically the xmin/xmax fields on the page are 64 bits wide, but physically stored in 32 bits. That's possible as long as no two XIDs on the same page are more than 2^31 XIDs apart. So you still need to freeze old tuples on the page when that's about to happen, but it would make it possible to have more than 2^32 XID transactions in the clog. You'd never be forced to do anti-wraparound vacuums, you could just let the clog grow arbitrarily large. There is a big downside to expanding xmin/xmax to 64 bits: it takes space. More space means more memory needed for caching, more memory bandwidth, more I/O, etc. - Heikki I think having a special case to save 32 bits per tuple would cause unnecessary complications, and the savings are minimal compared to the size of current modern storage devices and the typical memory used in serious database servers. I think it is too much pain for very little gain, especially when looking into the future growth in storage capacity andbandwidth. The early mainframes used a base displacement technique to keep the size of addresses down in instructions: 16 bit addresses, comprising 4 bits for a base register and 12 bits for the displacement (hence the use of 4KB pages sizes now!). Necessary at the time when mainframes were often less than 128 KB! Now it would ludicrous to do that for modern servers! Cheers, Gavin (Who is ancient enough, to have programmed such MainFrames!) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics / patch v7
Hi, On 07/30/2015 06:58 PM, Heikki Linnakangas wrote: The problem with a threshold is that around that threshold, even a small change in the data set can drastically change the produced estimates. For example, imagine that we know from the stats that zip code implies city. But then someone adds a single row to the table with an odd zip code city combination, which pushes the estimator over the threshold, and the columns are no longer considered dependent, and the estimates are now completely different. We should avoid steep cliffs like that. BTW, what is the threshold in the current patch? There's not a simple threshold - the algorithm mining the functional dependencies is a bit more complicated. I tried to explain it in the comment before build_mv_dependencies (in dependencies.c), but let me briefly summarize it here. To mine dependency [A = B], build_mv_dependencies does this: (1) sort the sample by {A,B} (2) split the sample into groups with the same value of A (3) for each group, decide if it's consistent with the dependency (a) if the group is too small (less than 3 rows), ignore it (a) if the group is consistent, update n_supporting n_supporting_rows (b) if the group is inconsistent, update n_contradicting n_contradicting_rows (4) decide whether the dependency is valid by checking n_supporting_rows = n_contradicting_rows * 10 The limit is rather arbitrary and yes - I can imagine a more complex condition (e.g. looking at average number of tuples per group etc.), but I haven't looked into that - the point was to use something very simple, only to illustrate the infrastructure. I think we might come up with some elaborate way of associating degree with the functional dependency, but at that point we really loose the simplicity, and also make it indistinguishable from the remaining statistics (because it won't be possible to reduce the clauses like this, before performing the regular estimation). Which is exactly what makes the functional dependencies so neat and efficient, so I'm not overly enthusiastic about doing that. What seems more interesting is implementing the ndistinct coefficient instead, as proposed by Kyotaro-san - that seems to have the nice smooth behavior you desire, while keeping the simplicity. Both statistics types (functional dependencies and ndistinct coeff) have one weak point, though - they somehow assume the queries use compatible values. For example if you use a query with WHERE city = 'New York' AND zip = 'zip for Detroit' they can't detect cases like this, because those statistics types are oblivious to individual values. I don't see this as a fatal flaw, though - it's rather a consequence of the nature of the stats. And I tend to look at the functional dependencies the same way. If you need stats without these issues you'll have to use MCV list or a histogram. Trying to fix the simple statistics types is futile, IMHO. regards Tomas -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] brin index vacuum versus transaction snapshots
Kevin Grittner wrote: If you run `make installcheck` against a cluster with default_transaction_isolation = 'repeatable read' you get one failure: I am tempted to say that we should just disallow to run vacuum on a table containing a brin index in a transaction-snapshot transaction. It is possible to silence the problem by checking for vacuum flags, but (without testing) I think there will be a problem because the snapshot is acquired too early and it is possible for concurrent transactions to insert tuples in the table that the summarizing scan will not see, which will cause the index to become effectively corrupt. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support for N synchronous standby servers - take 2
On Sun, Jul 19, 2015 at 4:16 PM, Tom Lane t...@sss.pgh.pa.us wrote: Josh Berkus j...@agliodbs.com writes: On 07/17/2015 04:36 PM, Jim Nasby wrote: I'm guessing it'd be really ugly/hard to support at least this GUC being multi-line? Mind you, multi-line GUCs would be useful otherwise, but we don't want to hinge this feature on making that work. I'm pretty sure that changing the GUC parser to allow quoted strings to continue across lines would be trivial. The problem with it is not that it's hard, it's that omitting a closing quote mark would then result in the entire file being syntactically broken, with the error message(s) almost certainly pointing somewhere else than where the actual mistake is. Do we really want such a global reduction in friendliness to make this feature easier? Maybe shoehorning this into the GUC mechanism is the wrong thing, and what we really need is a new config file for this. The information we're proposing to store seems complex enough to justify that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 64-bit XIDs again
On Thu, Jul 30, 2015 at 5:31 PM, Gavin Flower gavinflo...@archidevsys.co.nz wrote: On 31/07/15 02:24, Heikki Linnakangas wrote: On 07/30/2015 04:26 PM, Alexander Korotkov wrote: Also, I think it's possible to migrate to 64-bit XIDs without breaking pg_upgrade. Old tuples can be leaved with 32-bit XIDs while new tuples would be created with 64-bit XIDs. We can use free bits in t_infomask2 to distinguish old and new formats. I think we should move to 64-bit XIDs in in-memory structs snapshots, proc array etc. And expand clog to handle 64-bit XIDs. But keep the xmin/xmax fields on heap pages at 32-bits, and add an epoch-like field to the page header so that logically the xmin/xmax fields on the page are 64 bits wide, but physically stored in 32 bits. That's possible as long as no two XIDs on the same page are more than 2^31 XIDs apart. So you still need to freeze old tuples on the page when that's about to happen, but it would make it possible to have more than 2^32 XID transactions in the clog. You'd never be forced to do anti-wraparound vacuums, you could just let the clog grow arbitrarily large. There is a big downside to expanding xmin/xmax to 64 bits: it takes space. More space means more memory needed for caching, more memory bandwidth, more I/O, etc. - Heikki I think having a special case to save 32 bits per tuple would cause unnecessary complications, and the savings are minimal compared to the size of current modern storage devices and the typical memory used in serious database servers. I think it is too much pain for very little gain, especially when looking into the future growth in storage capacity andbandwidth. The early mainframes used a base displacement technique to keep the size of addresses down in instructions: 16 bit addresses, comprising 4 bits for a base register and 12 bits for the displacement (hence the use of 4KB pages sizes now!). Necessary at the time when mainframes were often less than 128 KB! Now it would ludicrous to do that for modern servers! Cheers, Gavin (Who is ancient enough, to have programmed such MainFrames!) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers In the other hand PG tuple overhead is already the largest among the alternatives. Even if storage keeps getting faster and cheaper stuff you can't ignore the overhead of adding yet another 8bytes to each tuple.
Re: [HACKERS] Improving test coverage of extensions with pg_dump
On Fri, Jul 31, 2015 at 6:41 AM, Andreas Karlsson andr...@proxel.se wrote: The comment is good, but what I personally do not like is that the path to the test suite is non-obvious and not self explanatory I would not expect src/test/tables_fk/t/ to test pg_dump and extensions. I find it to confusing to regard the test suite as testing an extension called tables_fk since in my mind that is just a necessary tool built to test something else. This is obviously a subjective thing, and I see that for example Heikki likes it the way it is. I am fine with setting this as ready for committer and and let a committer weigh in on the naming. Well, honestly, I would just have it in src/test/modules and call it a deal. Because it is now the only extension that has such interactions with perl tests. And because if modules/ gets bloated later on, we could extend prove_check to install modules from extra paths, just that it does not seem worth it to do it now for one module, and there is no guarantee that we'll have that many around. Of course there is no guarantee either that modules/ is not going to get bloated. Note as well that I will be fine with any decision taken by the committer who picks up this, this test case is useful by itself in any case. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Microvacuum for gist.
On 7/30/15 7:33 AM, Alexander Korotkov wrote: 2) Generating notices for every dead tuple would be too noisy. I suggest to replace notice with one of debug levels. + elog(NOTICE, gistkillitems. Mark Item Dead offnum %hd, blkno %d, offnum, BufferGetBlockNumber(buffer)); Even that seems like pretty serious overkill. vacuumlazy.c doesn't have anything like that, and I don't think the BTree code does either. If you were debugging something and actually needed it I'd say drop in a temporary printf(). -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort and a merge step to significantly improve on tuplesort's single run external sort
On Thu, Jul 30, 2015 at 12:00 AM, Heikki Linnakangas hlinn...@iki.fi wrote: Hmm. You don't really need to merge the in-memory array into the tape, as you know that all the tuples in the in-memory must come after the tuples already on the tape. You can just return all the tuples from the tape first, and then all the tuples from the array. It's more complicated than it appears, I think. Tuples may be variable sized. WRITETUP() performs a pfree(), and gives us back a variable amount of availMem. What if we dumped a single, massive, outlier tuple out when a caller passes it and it goes to the root of the heap? We'd dump that massive tuple in one go (this would be an incremental dumptuples() call, which we still do in the patch), making things !LACKMEM() again, but by an usually comfortable margin. We read in a few more regular tuples, but we're done consuming tuples before things ever get LACKMEM() again (no more dumping needed, at least with this patch applied). What prevents the tuple at the top of the in-memory heap at the point of tuplesort_performsort() (say, one of the ones added to the heap as our glut of memory was *partially* consumed) being less than the last/greatest tuple on tape? If the answer is nothing, a merge step is clearly required. This is not a problem when every single tuple is dumped, but that doesn't happen anymore. I probably should have shown more tests, that tested HeapTuple sorts (not just datum tuple sorts). I agree that things at least usually happen as you describe, FWIW. I think it'd make sense to structure the code differently, to match the way I described this optimization above. Instead of adding a new tuplesort state for this, abstract this in the logical tape code. Add a function to attach an in-memory tail to a tape, and have LogicalTapeRead() read from the tail after reading the on-disk tape. The rest of the code wouldn't need to care that sometimes part of the tape is actually in memory. I'll need to think about all of that. Certainly, quicksorting runs in a more general way seems like a very promising idea, and I know that this patch does not go far enough. But it also add this idea of not dumping out most tuples, which seems impossible to generalize further, so maybe it's a useful special case to start from for that reason (and also because it's where the pain is currently felt most often). +* Note that there might actually be 2 runs, but only the +* contents of one of them went to tape, and so we can +* safely pretend that there is only 1 run (since we're +* about to give up on the idea of the memtuples array being +* a heap). This means that if our sort happened to require +* random access, the similar single run optimization +* below (which sets TSS_SORTEDONTAPE) might not be used at +* all. This is because dumping all tuples out might have +* forced an otherwise equivalent randomAccess case to +* acknowledge a second run, which we can avoid. Is that really true? We don't start a second run until we have to, i.e. when it's time to dump the first tuple of the second run to tape. So I don't think the case you describe above, where you have two runs but only one of them has tuples on disk, can actually happen. I think we're talking about two slightly different things. I agree that I am avoiding starting a second run because I am avoiding dumping tuples, just as you say (I describe this as avoiding acknowledging a second run). But there could still be SortTuples that have a tupindex that is 0 (they could be 1, to be specific). It's pretty clear from looking at the TSS_BUILDRUNS case within puttuple_common() that this is true. So, if instead you define starting a tuple as adding a sortTuple with a tupindex that is 0, then yes, this comment is true. The important thing is that since we're not dumping every tuple, it doesn't matter whether or not a that TSS_BUILDRUNS case within puttuple_common() ever took the currentRun + 1 insertion path (which can easily happen), provided things aren't so skewed that it ends up on tape even without dumping all tuples (which seems much less likely). As I've said, this optimization will occur a lot more often then the existing one run optimization (assuming !randomAccess), as a nice side benefit of not dumping every tuple. Quicksort does not use HEAPCOMPARE(), so clearly having multiple runs in that subrun is a non-issue. Whether or not we say that a second run started, or that there was merely the unfulfilled intent to start a new, second run is just semantics. While I certainly care about semantics, my point is that we agree that this useful pretend there is only one run thing happens (I think). The existing one run optimization only really occurs when the range of values in the set of tuples is well characterized by the tuple values observed during initial heapification, which is bad. Or would be bad, if the existing
Re: [HACKERS] TAP tests are badly named
On 07/30/2015 12:40 PM, Andrew Dunstan wrote: We should describe test sets by what they test, not by how they test. TAP is a testing tool/protocol. The current set of tests we have test the programs in src/bin, and we should really name the test set by a name that reflects that, rather than the fact that we are using TAP tools to run the tests. What if we decide to test something else using TAP? Would we call that set of tests TAP tests too? --enable-tap-tests is a reasonable configuration setting, because it's about whether or not we have a TAP testing framework available, but I think we should stop calling the bin tests TAP tests and we should change the test name in vcregress.pl to a more appropriate name. In the buildfarm I'm calling the step bin-check: http://www.pgbuildfarm.org/cgi-bin/show_stage_log.pl?nm=crakedt=2015-07-30%2012%3A25%3A58stg=bin-check Thoughts? In fact, looking more closely at the changes that have been made to vcregress.pl, I don't really like the way this has been done. I'm putting together some changes to bring things more into line with how the Makefiles work. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] patch: prevent user from setting wal_buffers over 2GB bytes
Hackers, In guc.c, the maximum for wal_buffers is INT_MAX. However, wal_buffers is actually measured in 8KB buffers, not in bytes. This means that users are able to set wal_buffers 2GB. When the database is started, this can cause a core dump if the WAL offset is 2GB. Attached patch fixes this issue by lowering the maximum for wal_buffers: josh@Radegast:~/pg94a$ bin/pg_ctl -D data start server starting josh@Radegast:~/pg94a$ LOG: 393216 is outside the valid range for parameter wal_buffers (-1 .. 262143) FATAL: configuration file /home/josh/pg94a/data/postgresql.conf contains errors Thanks to Andrew Gierth for diagnosing this issue. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c new file mode 100644 index 1b7b914..b3dac51 *** a/src/backend/utils/misc/guc.c --- b/src/backend/utils/misc/guc.c *** static struct config_int ConfigureNamesI *** 2215,2221 GUC_UNIT_XBLOCKS }, XLOGbuffers, ! -1, -1, INT_MAX, check_wal_buffers, NULL, NULL }, --- 2215,2221 GUC_UNIT_XBLOCKS }, XLOGbuffers, ! -1, -1, (INT_MAX / XLOG_BLCKSZ), check_wal_buffers, NULL, NULL }, -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 64-bit XIDs again
On 2015-07-30 23:23, Tom Lane wrote: Gavin Flower gavinflo...@archidevsys.co.nz writes: On 31/07/15 02:24, Heikki Linnakangas wrote: There is a big downside to expanding xmin/xmax to 64 bits: it takes space. More space means more memory needed for caching, more memory bandwidth, more I/O, etc. I think having a special case to save 32 bits per tuple would cause unnecessary complications, and the savings are minimal compared to the size of current modern storage devices and the typical memory used in serious database servers. I think the argument that the savings are minimal is pretty thin. It all depends on how wide your tables are --- but on a narrow table, say half a dozen ints, the current tuple size is 24 bytes header plus the same number of bytes of data. We'd be going up to 32 bytes header which makes for a 16% increase in physical table size. If your table is large, claiming that 16% doesn't hurt is just silly. But the elephant in the room is on-disk compatibility. There is absolutely no way that we can just change xmin/xmax to 64 bits without a disk format break. However, if we do something like what Heikki is suggesting, it's at least conceivable that we could convert incrementally (ie, if you find a page with the old header format, assume all tuples in it are part of epoch 0; and do not insert new tuples into it unless there is room to convert the header to new format ... We could theoretically do similar thing with 64bit xmin/xmax though - detect page is in old format and convert all tuples there to 64bit xmin/xmax. But I agree that we don't want to increase bloat per tuple as it's already too big. but I'm not sure what we do about tuple deletion if the old page is totally full and we need to write an xmax that's past 4G). If the page is too full we could move some data to different (or new) page. For me bigger issue is that we'll still have to refreeze pages because if tuples are updated or deleted in different epoch than the one they were inserted in, the new version of tuple has to go to different page and the old page will have free space that can't be used by new tuples since the system is now in different epoch. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Reload SSL certificates on SIGHUP
On 07/29/2015 03:24 AM, Tom Lane wrote: Peter Eisentraut pete...@gmx.net writes: I don't have a problem with rebuilding the SSL context on every reload cycle. We already do a lot of extra reloading every time, so a bit more shouldn't hurt. But I'm not so sure whether we should do that in the SIGHUP handler. I don't know how we got into the situation of doing all the file reloads directly in the handler, but at least we can control that code. Making a bunch of calls into an external library is a different thing, though. Can we find a way to do this differently? Do we have an idea how expensive it is to load that data? A brute-force answer is to not have the postmaster load it at all, but to have new backends do so (if needed) during their connection acceptance/authentication phase. I'm not sure how much that would add to the SSL connection startup time though. It would also mean that problems with the SSL config files would only be reported during subsequent connection starts, not at SIGHUP time, and indeed that SIGHUP is more or less meaningless for SSL file changes: the instant you change a file, it's live for later connections. On the plus side, it would make Windows and Unix behavior closer, since (I suppose) we're reloading that stuff anyway in EXEC_BACKEND builds. I measured it taking ~0.3ms to build the new SSL context in a simple benchmark (certificate + CA + small crl). Personally I do not think moving this to connection start would be worth it since reporting errors that late is not nice for people who have misconfigured their database, and even if my benchmarks indicates it is relatively cheap to reload SSL adding more work to connection establishing is something I would want to avoid unless it gives us a clear benefit. I'm not entirely sure your concern is valid, though. We have always had the principle that almost everything of interest in the postmaster happens in signal handler functions. We could possibly change things so that reloading config files is done in the main loop of ServerLoop, but if we did, it would have to execute with all signals blocked, which seems like just about as much of a risk for third-party code as executing that code in a signal handler is. Agreed, I am not sure what moving it to the main loop would gain us. -- Andreas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort and a merge step to significantly improve on tuplesort's single run external sort
On Thu, Jul 30, 2015 at 11:32 AM, Robert Haas robertmh...@gmail.com wrote: Very interesting. And great performance numbers. Thanks for taking the time to investigate this - really cool. Thanks. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort and a merge step to significantly improve on tuplesort's single run external sort
On Thu, Jul 30, 2015 at 3:47 AM, Simon Riggs si...@2ndquadrant.com wrote: This is a good optimization for the common case where tuples are mostly already in order. We could increase the usefulness of this by making UPDATE pick blocks that are close to a tuple's original block, rather than putting them near the end of a relation. Not sure what you mean here. So here's a shorter/different explanation of this optimization: When it's time to perform the sort, instead of draining the in-memory heap one tuple at a time to the last tape, you sort the heap with quicksort, and pretend that the sorted heap belongs to the last tape, after all the other tuples in the tape. Some questions/thoughts on that: Isn't that optimization applicable even when you have multiple runs? Quicksorting the heap and keeping it as an array in memory is surely always faster than heapsorting and pushing it to the tape. It's about use of memory. If you have multiple runs on tape, then they will need to be merged and you need memory to do that efficiently. If there are tuples in the last batch still in memory then it can work, but it depends upon how full memory is from the last batch and how many batches there are. I agree that this optimization has a lot to do with exploiting the fact that you don't need to free the memtuples array for future runs because you've already received all tuples (or keep space free for previous runs). I think that we should still use quicksort on runs where this optimization doesn't work out, but I also still think that that's a different idea. Doing what I've proposed here when there are multiple runs seems less valuable, if only because you're not going to avoid that much writing. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Updatable view?
On 7/30/15 1:03 AM, Tatsuo Ishii wrote: Is anyone working on implementing or interested in implementing automatic updatable view which uses two or more tables involved (join)? SQL1999 allows it in certain conditions. I think it would be nice to have... but not to the point of working on it myself. Might be worth an email to -general to see how many people have immediate use for it. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: backend niceness / session_priority
On 7/30/15 10:54 AM, Tom Lane wrote: =?ISO-8859-15?Q?Jos=E9_Luis_Tall=F3n?= jltal...@adv-solutions.net writes: Since PostgreSQL lacks the resource management capabilities of the Big Ones ( Resource Groups - Red, WorkLoad Manager - Blue ) or the Resource Governor in MS SQL Server, we can try and approximate the requested behaviour by reducing the CPU priority (nice) of the backend in question. Please note that we would be using scheduler priority to try and modulate I/O, though I'm aware of the limitations of this mechanism. This has been proposed before, and rejected before, and I'm not seeing anything particularly new here. Without a credible mechanism for throttling I/O, nice alone does not seem very promising. Some OSes respect nice when it comes to IO scheduling, so it might still be useful. What I'm worried about is priority inversion[1]. What might be useful would be to add a set of GUCs similar to vacuum_cost_* that operated at the shared buffer level. Dunno where you'd put the sleep though (presumably all the functions where you'd put the accounting are too low-level to sleep in). [1] https://en.wikipedia.org/wiki/Priority_inversion -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: Using quicksort and a merge step to significantly improve on tuplesort's single run external sort
On Thu, Jul 30, 2015 at 4:26 AM, Greg Stark st...@mit.edu wrote: I'm a bit confused where the big win comes from though. Is what's going on that the external sort only exceeded memory by a small amount so nearly all the tuples are still in memory? Yes, that's why this can be much faster just as the work_mem threshold is crossed. You get an almost internal sort, which means you can mostly quicksort, and you can avoid dumping most tuples. It's still a pretty nice win when less than half of tuples fit in memory, though -- just not as nice. Below that, the optimization isn't used. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improving test coverage of extensions with pg_dump
On 07/30/2015 04:48 AM, Michael Paquier wrote: On Thu, Jul 30, 2015 at 5:54 AM, Andreas Karlsson andr...@proxel.se wrote: What I do not like though is how the path src/test/tables_fk/t/ tells us nothing about what features are of PostgreSQL are tested here. For this I personally prefer the earlier versions where I think that was clear. +Simple module used to check data dump ordering of pg_dump with tables +linked with foreign keys. The README mentions that this is to test pg_dump, perhaps referring to the TAP tests makes sense as well? The comment is good, but what I personally do not like is that the path to the test suite is non-obvious and not self explanatory I would not expect src/test/tables_fk/t/ to test pg_dump and extensions. I find it to confusing to regard the test suite as testing an extension called tables_fk since in my mind that is just a necessary tool built to test something else. This is obviously a subjective thing, and I see that for example Heikki likes it the way it is. I am fine with setting this as ready for committer and and let a committer weigh in on the naming. Another thought: would it be worthwhile to also add an assertion to check if the data really was restored properly or would that just be redundant code? That seems worth doing, hence added something for it. Thanks for the suggestion. At the same time I have added an entry in .gitignore for the generated tmp_check/. Nice changes. -- Andreas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] creating extension including dependencies
On Thu, Jul 30, 2015 at 10:58 PM, Petr Jelinek wrote: On 2015-07-27 15:18, Michael Paquier wrote: Something also has not been discussed yet: what to do with new_version and old_version (the options of CreateExtensionStmt)? As of now if those options are defined they are not passed down to the parent extensions but shouldn't we raise an error if they are used in combination with CASCADE? In any case, I think that the behavior chosen should be documented. I don't see why we should raise error, they are used just for the top-level extension and I think it makes sense that way. CASCADE documentation says SCHEMA option is cascaded to required extensions, do we need to say something more than that (ie explicitly saying that the old_version and new_version aren't)? OK, let's do so then. I think that we should still document the fact that the old and new version strings and not passed to the parent extensions when cascade is used for clarity. Something like: Other options are not recursively applied when the CASCASE clause is used. I have been through this patch one last time and changed the following: - Improved documentation: missing markups with literal, SCHEMA is a clause and not a parameter, added explanation that options other than SCHEMA are not applied recursively with CASCADE - Corrected .gitignore in test_extensions, log/ was missing. Those are minor things though, hence I just switched patch as Ready for committer. -- Michael diff --git a/contrib/hstore_plperl/expected/hstore_plperl.out b/contrib/hstore_plperl/expected/hstore_plperl.out index cf384eb..25fc506 100644 --- a/contrib/hstore_plperl/expected/hstore_plperl.out +++ b/contrib/hstore_plperl/expected/hstore_plperl.out @@ -1,6 +1,6 @@ -CREATE EXTENSION hstore; -CREATE EXTENSION plperl; -CREATE EXTENSION hstore_plperl; +CREATE EXTENSION hstore_plperl CASCADE; +NOTICE: installing required extension hstore +NOTICE: installing required extension plperl SELECT transforms.udt_schema, transforms.udt_name, routine_schema, routine_name, group_name, transform_type diff --git a/contrib/hstore_plperl/expected/hstore_plperlu.out b/contrib/hstore_plperl/expected/hstore_plperlu.out index 8c689ad..3fc3201 100644 --- a/contrib/hstore_plperl/expected/hstore_plperlu.out +++ b/contrib/hstore_plperl/expected/hstore_plperlu.out @@ -1,6 +1,6 @@ -CREATE EXTENSION hstore; -CREATE EXTENSION plperlu; -CREATE EXTENSION hstore_plperlu; +CREATE EXTENSION hstore_plperlu CASCADE; +NOTICE: installing required extension hstore +NOTICE: installing required extension plperlu SELECT transforms.udt_schema, transforms.udt_name, routine_schema, routine_name, group_name, transform_type diff --git a/contrib/hstore_plperl/sql/hstore_plperl.sql b/contrib/hstore_plperl/sql/hstore_plperl.sql index 0f70f14..9398aed 100644 --- a/contrib/hstore_plperl/sql/hstore_plperl.sql +++ b/contrib/hstore_plperl/sql/hstore_plperl.sql @@ -1,6 +1,4 @@ -CREATE EXTENSION hstore; -CREATE EXTENSION plperl; -CREATE EXTENSION hstore_plperl; +CREATE EXTENSION hstore_plperl CASCADE; SELECT transforms.udt_schema, transforms.udt_name, routine_schema, routine_name, diff --git a/contrib/hstore_plperl/sql/hstore_plperlu.sql b/contrib/hstore_plperl/sql/hstore_plperlu.sql index 3cfb2fd..8d8508c 100644 --- a/contrib/hstore_plperl/sql/hstore_plperlu.sql +++ b/contrib/hstore_plperl/sql/hstore_plperlu.sql @@ -1,6 +1,4 @@ -CREATE EXTENSION hstore; -CREATE EXTENSION plperlu; -CREATE EXTENSION hstore_plperlu; +CREATE EXTENSION hstore_plperlu CASCADE; SELECT transforms.udt_schema, transforms.udt_name, routine_schema, routine_name, diff --git a/contrib/hstore_plpython/expected/hstore_plpython.out b/contrib/hstore_plpython/expected/hstore_plpython.out index b7a6a92..55f4efe 100644 --- a/contrib/hstore_plpython/expected/hstore_plpython.out +++ b/contrib/hstore_plpython/expected/hstore_plpython.out @@ -1,5 +1,5 @@ -CREATE EXTENSION plpython2u; -CREATE EXTENSION hstore_plpython2u; +CREATE EXTENSION hstore_plpython2u CASCADE; +NOTICE: installing required extension plpython2u -- test hstore - python CREATE FUNCTION test1(val hstore) RETURNS int LANGUAGE plpythonu diff --git a/contrib/hstore_plpython/sql/hstore_plpython.sql b/contrib/hstore_plpython/sql/hstore_plpython.sql index 9ff2ebc..d55beda 100644 --- a/contrib/hstore_plpython/sql/hstore_plpython.sql +++ b/contrib/hstore_plpython/sql/hstore_plpython.sql @@ -1,5 +1,4 @@ -CREATE EXTENSION plpython2u; -CREATE EXTENSION hstore_plpython2u; +CREATE EXTENSION hstore_plpython2u CASCADE; -- test hstore - python diff --git a/contrib/ltree_plpython/expected/ltree_plpython.out b/contrib/ltree_plpython/expected/ltree_plpython.out index 934529e..9bee6be 100644 --- a/contrib/ltree_plpython/expected/ltree_plpython.out +++ b/contrib/ltree_plpython/expected/ltree_plpython.out @@ -1,5 +1,5 @@ -CREATE EXTENSION plpython2u; -CREATE EXTENSION ltree_plpython2u; +CREATE EXTENSION ltree_plpython2u CASCADE; +NOTICE:
Re: [HACKERS] Proposal: backend niceness / session_priority
On 07/31/2015 12:18 AM, Jim Nasby wrote: This has been proposed before, and rejected before, and I'm not seeing anything particularly new here. Without a credible mechanism for throttling I/O, nice alone does not seem very promising. Some OSes respect nice when it comes to IO scheduling, so it might still be useful. Wouldn't the bgwriter remove a lot of the usefulness? Andreas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reduce ProcArrayLock contention
On Wed, Jul 29, 2015 at 11:48 PM, Andres Freund and...@anarazel.de wrote: On 2015-07-29 12:54:59 -0400, Robert Haas wrote: I would try to avoid changing lwlock.c. It's pretty easy when so doing to create mechanisms that work now but make further upgrades to the general lwlock mechanism difficult. I'd like to avoid that. I'm massively doubtful that re-implementing parts of lwlock.c is the better outcome. Then you have two different infrastructures you need to improve over time. I agree and modified the patch to use 32-bit atomics based on idea suggested by Robert and didn't modify lwlock.c. I understand. IMHO it will be a good idea though to ensure that the patch does not cause regression for other setups such as a less powerful machine or while running with lower number of clients. Okay, I have tried it on CentOS VM, but the data is totally screwed (at same client count across 2 runs the variation is quite high ranging from 300 to 3000 tps) to make any meaning out of it. I think if you want to collect data on less powerful m/c, then also atleast we should ensure that it is taken in a way that we are sure that it is not due to noise and there is actual regression, then only we can decide whether we should investigate that or not. Can you please try taking data with attached script (perf_pgbench_tpcb_write.sh), few things you need to change in script based on your setup (environment variables defined in beginning of script like PGDATA), other thing is that you need to ensure that name of binaries for HEAD and Patch should be changed in script if you are naming them differently. It will generate the performance data in test*.txt files. Also try to ensure that checkpoint should be configured such that it should not occur in-between tests, else it will be difficult to conclude the results. Some parameters you might want to consider for the same are checkpoint_timeout, checkpoint_completion_target, min_wal_size, max_wal_size. I was telling that fact even without my patch. Basically I have tried by commenting ProcArrayLock in ProcArrayEndTransaction. I did not get that. You mean the TPS is same if you run with commenting out ProcArrayLock in ProcArrayEndTransaction? Yes, TPS is almost same as with Patch. Is that safe to do? No, that is not safe. I have done that just to see what is the maximum gain we can get by reducing the contention around ProcArrayLock. No, autovacuum generates I/O due to which sometimes there is more variation in Write tests. Sure, but on an average does it still show similar improvements? Yes, number with and without autovacuum show similar trend, it's just that you can see somewhat better results (more performance improvement) with autovacuum=off and do manual vacuum at end of each test. Or does the test becomes IO bound and hence the bottleneck shifts somewhere else? Can you please post those numbers as well when you get chance? The numbers posted in initial mail or in this mail are with autovacuum =on. So among these only step 2 can be common among different algorithms, other's need some work specific to each optimization. Right, but if we could encapsulate that work in a function that just needs to work on some shared memory, I think we can build such infrastructure. For now, I have encapsulated the code into 2 separate functions, rather than extending LWLock.c as that could easily lead to code which might not be used in future even though currently it sounds like that and in-future if we need to use same technique elsewhere then we can look into it. Regarding the patch, the compare-and-exchange function calls that you've used would work only for 64-bit machines, right? You would need to use equivalent 32-bit calls on a 32-bit machine. I thought that internal API will automatically take care of it, example for msvc it uses _InterlockedCompareExchange64 which if doesn't work on 32-bit systems or is not defined, then we have to use 32-bit version, but I am not certain about that fact. Hmm. The pointer will be a 32-bit field on a 32-bit machine. I don't know how exchanging that with 64-bit integer be safe. True, but that has to be in-general taken care by 64bit atomic API's, like in this case it should fallback to either 32-bit version of API or something that can work on 32-bit m/c. I think fallback support is missing as of now in 64-bit API's which we should have if we want to use those API's, but anyway for now I have modified the patch to use 32-bit atomics. Performance Data with modified patch. pgbench setup scale factor - 300 Data is on magnetic disk and WAL on ssd. pgbench -M prepared tpc-b Data is median of 3 - 30 min runs, the detailed data for all the 3 runs is in attached document (perf_write_procarraylock_data.ods) Head : commit c53f7387 Patch : group_xid_clearing_at_trans_end_rel_v2 Client_Count/Patch_ver (TPS) 1 8 16 32 64 128 HEAD 972 6004 11060 20074 23839
Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.
I think it's related to the problem of figuring out how many dead tuples you expect to find in the overall heap, which you need to do to have any hope of this being a comprehensive estimate. An estimate of number of index scans while vacuuming can be done using estimate of total dead tuples in the relation and maintenance work mem. n_dead_tuples in pg_stat_all_tables can be used as an estimate of dead tuples. Following can be a way to estimate, if nindexes == 0 index_scans =0 else if pages_all_visible index_scans =0 else index_scans = Max((n_dead_tuples * space occupied by single dead tuple)/m_w_m,1) This estimates index_scans = 1 if n_dead_tuples = 0 assuming lazy scan heap is likely to find some dead_tuples. If n_dead_tuples is non zero the above estimate gives a lower bound on number of index scans possible. Thank you, Rahila Syed
Re: [HACKERS] Updatable view?
I think it would be nice to have... but not to the point of working on it myself. Might be worth an email to -general to see how many people have immediate use for it. What I am thinking about is, 1) Implement certain class of updatable views allowed in SQL:1999 (UNION ALL, natural joins) 2) Anything beyond #1 (I have no idea for now) Let me see how people are interested in... Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doubt about AccessExclusiveLock in ALTER TABLE .. SET ( .. );
On Thu, Jul 30, 2015 at 11:28 PM, Andres Freund and...@anarazel.de wrote: @@ -57,7 +57,8 @@ static relopt_bool boolRelOpts[] = If we go through this list, I'd rather make informed decisions about each reloption. Otherwise we're going to get patches for each of them separately over the next versions. I have no problem to do this change now instead of wait for next versions... @@ -73,7 +75,8 @@ static relopt_bool boolRelOpts[] = { fastupdate, Enables \fast update\ feature for this GIN index, - RELOPT_KIND_GIN + RELOPT_KIND_GIN, + AccessExclusiveLock }, true }, @@ -95,7 +99,8 @@ static relopt_int intRelOpts[] = { fillfactor, Packs table pages only to this percentage, - RELOPT_KIND_HEAP + RELOPT_KIND_HEAP, + AccessExclusiveLock }, HEAP_DEFAULT_FILLFACTOR, HEAP_MIN_FILLFACTOR, 100 [some other fillfactor settings] { { gin_pending_list_limit, Maximum size of the pending list for this GIN index, in kilobytes., - RELOPT_KIND_GIN + RELOPT_KIND_GIN, + AccessExclusiveLock }, -1, 64, MAX_KILOBYTES }, @@ -297,7 +325,8 @@ static relopt_string stringRelOpts[] = { buffering, Enables buffering build for this GiST index, - RELOPT_KIND_GIST + RELOPT_KIND_GIST, + AccessExclusiveLock }, 4, false, Why? These options just change things for the future and don't influence past decisions. It seems unproblematic to change them. +1 @@ -259,7 +283,8 @@ static relopt_real realRelOpts[] = { seq_page_cost, Sets the planner's estimate of the cost of a sequentially fetched disk page., - RELOPT_KIND_TABLESPACE + RELOPT_KIND_TABLESPACE, + AccessExclusiveLock }, -1, 0.0, DBL_MAX }, @@ -267,7 +292,8 @@ static relopt_real realRelOpts[] = { random_page_cost, Sets the planner's estimate of the cost of a nonsequentially fetched disk page., - RELOPT_KIND_TABLESPACE + RELOPT_KIND_TABLESPACE, + AccessExclusiveLock }, -1, 0.0, DBL_MAX }, @@ -275,7 +301,8 @@ static relopt_real realRelOpts[] = { n_distinct, Sets the planner's estimate of the number of distinct values appearing in a column (excluding child relations)., - RELOPT_KIND_ATTRIBUTE + RELOPT_KIND_ATTRIBUTE, + AccessExclusiveLock }, 0, -1.0, DBL_MAX }, @@ -283,7 +310,8 @@ static relopt_real realRelOpts[] = { n_distinct_inherited, Sets the planner's estimate of the number of distinct values appearing in a column (including child relations)., - RELOPT_KIND_ATTRIBUTE + RELOPT_KIND_ATTRIBUTE, + AccessExclusiveLock }, 0, -1.0, DBL_MAX }, These probably are the settings that are most interesting to change without access exlusive locks. +1 j = 0; for (i = 0; boolRelOpts[i].gen.name; i++) + { + Assert(DoLockModesConflict(boolRelOpts[i].gen.lockmode, + boolRelOpts[i].gen.lockmode)); j++; + } for (i = 0; intRelOpts[i].gen.name; i++) + { + Assert(DoLockModesConflict(intRelOpts[i].gen.lockmode, + intRelOpts[i].gen.lockmode)); j++; + } for (i = 0; realRelOpts[i].gen.name; i++) + { + Assert(DoLockModesConflict(realRelOpts[i].gen.lockmode, + realRelOpts[i].gen.lockmode)); j++; + } for (i = 0; stringRelOpts[i].gen.name; i++) + { + Assert(DoLockModesConflict(stringRelOpts[i].gen.lockmode, + stringRelOpts[i].gen.lockmode)); j++; + } j += num_custom_options; Doesn't really seem worth it to assert individually in each case here to me. What do you suggest then? +/* + * Determine the required LOCKMODE from an option list + */ +LOCKMODE +GetRelOptionsLockLevel(List *defList) +{ +
Re: [HACKERS] security labels on databases are bad for dump restore
1. pg_dumpall -g 2. pg_dump --create per database Gah, OK, I see your point. But we better document this, because if you need a PhD in PostgreSQL-ology to take a backup, we're not in a good place. Agreed. Though, honestly, I find this to be a cumbersome approach. I think it just makes things more confusing, even if it is well documented. Perhaps it might be necessary as a bridge to get to a better place. But my first question as an end user would be, 'why can't one tool do this?'. Also, by using 'pg_dumpall -g' aren't you potentially getting things that you don't want/need/care about? For instance, if database 'foo' is owned by 'user1' and database 'bar' is owned by 'user2' and neither have any knowledge/relation of/to the other, then when I dump 'foo', in this manner, wouldn't I also be including 'user2'? Said differently, a restore of a 'foo'-only dump would also include a 'bar' related role. That seems like a bad idea, IMHO. Maybe it can't be avoided, but I'd expect that only relevant information for the database being dumped would be included. -Adam -- Adam Brightwell - adam.brightw...@crunchydatasolutions.com Database Engineer - www.crunchydatasolutions.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doubt about AccessExclusiveLock in ALTER TABLE .. SET ( .. );
On Fri, Jul 31, 2015 at 11:28 AM, Andres Freund and...@anarazel.de wrote: @@ -57,7 +57,8 @@ static relopt_bool boolRelOpts[] = If we go through this list, I'd rather make informed decisions about each reloption. Otherwise we're going to get patches for each of them separately over the next versions. Just dropping quickly a reply: I meant table relopts only, excluding the index stuff for now regarding the isolation tests. + AccessExclusiveLock + foreach(cell, defList) + { + DefElem *def = (DefElem *) lfirst(cell); + int i; + + for (i = 0; relOpts[i]; i++) + { + if (pg_strncasecmp(relOpts[i]-name, def-defname, relOpts[i]-namelen + 1) == 0) + { + if (lockmode relOpts[i]-lockmode) + lockmode = relOpts[i]-lockmode; + } + } + } + + return lockmode; +} We usually don't compare lock values that way, i.e. there's not guaranteed to be a strict monotonicity between lock levels. I don't really agree with that policy, but it's nonetheless there. Yeah, there are some in lock.c but that's rather localized. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doubt about AccessExclusiveLock in ALTER TABLE .. SET ( .. );
On Thu, Jul 30, 2015 at 10:46 PM, Michael Paquier michael.paqu...@gmail.com wrote: This patch size has increased from 16k to 157k because of the output of the isolation tests you just added. This is definitely too large and actually the test coverage is rather limited. Hence I think that they should be changed as follows: - Use only one table, the locks of tables a and b do not conflict, and that's what we want to look at here. - Check the locks of all the relation parameters, by that I mean as well fillfactor and user_catalog_table which still take AccessExclusiveLock on the relation - Use custom permutations. I doubt that there is much sense to test all the permutations in this context, and this will reduce the expected output size drastically. Ok. On further notice, I would recommend not to use the same string name for the session and the query identifiers. And I think that inserting only one tuple at initialization is just but fine. Here is an example of such a spec: setup { CREATE TABLE a (id int PRIMARY KEY); INSERT INTO a SELECT generate_series(1,100); } teardown { DROP TABLE a; } session s1 step b1 { BEGIN; } # TODO add one query per parameter step at11 { ALTER TABLE a SET (fillfactor=10); } step at12 { ALTER TABLE a SET (autovacuum_vacuum_scale_factor=0.001); } step c1 { COMMIT; } session s2 step b2 { BEGIN; } step wx1 { UPDATE a SET id = id + 1; } step c2 { COMMIT; } And by testing permutations like for example that it is possible to see which session is waiting for what. Input: permutation b1 b2 at11 wx1 c1 c2 Output where session 2 waits for lock taken after fillfactor update: step b1: BEGIN; step b2: BEGIN; step at11: ALTER TABLE a SET (fillfactor=10); step wx1: UPDATE a SET id = id + 1; waiting ... step c1: COMMIT; step wx1: ... completed step c2: COMMIT; Be careful as well to not include incorrect permutations in the expected output file, the isolation tester is smart enough to ping you about that. Changed the isolation tests according your suggestions. Regards, -- Fabrízio de Royes Mello Consultoria/Coaching PostgreSQL Timbira: http://www.timbira.com.br Blog: http://fabriziomello.github.io Linkedin: http://br.linkedin.com/in/fabriziomello Twitter: http://twitter.com/fabriziomello Github: http://github.com/fabriziomello diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml index 1c1c181..ad985cd 100644 --- a/doc/src/sgml/ref/alter_table.sgml +++ b/doc/src/sgml/ref/alter_table.sgml @@ -543,6 +543,10 @@ ALTER TABLE ALL IN TABLESPACE replaceable class=PARAMETERname/replaceable of commandALTER TABLE/ that forces a table rewrite. /para + para + Changing autovacuum storage parameters acquires a literalSHARE UPDATE EXCLUSIVE/literal lock. + /para + note para While commandCREATE TABLE/ allows literalOIDS/ to be specified diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c index 8176b6a..b8d2a92 100644 --- a/src/backend/access/common/reloptions.c +++ b/src/backend/access/common/reloptions.c @@ -57,7 +57,8 @@ static relopt_bool boolRelOpts[] = { autovacuum_enabled, Enables autovacuum in this relation, - RELOPT_KIND_HEAP | RELOPT_KIND_TOAST + RELOPT_KIND_HEAP | RELOPT_KIND_TOAST, + ShareUpdateExclusiveLock }, true }, @@ -65,7 +66,8 @@ static relopt_bool boolRelOpts[] = { user_catalog_table, Declare a table as an additional catalog table, e.g. for the purpose of logical replication, - RELOPT_KIND_HEAP + RELOPT_KIND_HEAP, + AccessExclusiveLock }, false }, @@ -73,7 +75,8 @@ static relopt_bool boolRelOpts[] = { fastupdate, Enables \fast update\ feature for this GIN index, - RELOPT_KIND_GIN + RELOPT_KIND_GIN, + AccessExclusiveLock }, true }, @@ -81,7 +84,8 @@ static relopt_bool boolRelOpts[] = { security_barrier, View acts as a row security barrier, - RELOPT_KIND_VIEW + RELOPT_KIND_VIEW, + AccessExclusiveLock }, false }, @@ -95,7 +99,8 @@ static relopt_int intRelOpts[] = { fillfactor, Packs table pages only to this percentage, - RELOPT_KIND_HEAP + RELOPT_KIND_HEAP, + AccessExclusiveLock }, HEAP_DEFAULT_FILLFACTOR, HEAP_MIN_FILLFACTOR, 100 }, @@ -103,7 +108,8 @@ static relopt_int intRelOpts[] = { fillfactor, Packs btree index pages only to this percentage, - RELOPT_KIND_BTREE + RELOPT_KIND_BTREE, + AccessExclusiveLock }, BTREE_DEFAULT_FILLFACTOR, BTREE_MIN_FILLFACTOR, 100 }, @@ -111,7 +117,8 @@ static relopt_int intRelOpts[] = { fillfactor, Packs hash index pages only to this percentage, - RELOPT_KIND_HASH + RELOPT_KIND_HASH, + AccessExclusiveLock }, HASH_DEFAULT_FILLFACTOR, HASH_MIN_FILLFACTOR, 100 }, @@ -119,7 +126,8 @@ static relopt_int intRelOpts[] = { fillfactor, Packs gist index pages only
Re: [HACKERS] Doubt about AccessExclusiveLock in ALTER TABLE .. SET ( .. );
@@ -57,7 +57,8 @@ static relopt_bool boolRelOpts[] = If we go through this list, I'd rather make informed decisions about each reloption. Otherwise we're going to get patches for each of them separately over the next versions. @@ -73,7 +75,8 @@ static relopt_bool boolRelOpts[] = { fastupdate, Enables \fast update\ feature for this GIN index, - RELOPT_KIND_GIN + RELOPT_KIND_GIN, + AccessExclusiveLock }, true }, @@ -95,7 +99,8 @@ static relopt_int intRelOpts[] = { fillfactor, Packs table pages only to this percentage, - RELOPT_KIND_HEAP + RELOPT_KIND_HEAP, + AccessExclusiveLock }, HEAP_DEFAULT_FILLFACTOR, HEAP_MIN_FILLFACTOR, 100 [some other fillfactor settings] { { gin_pending_list_limit, Maximum size of the pending list for this GIN index, in kilobytes., - RELOPT_KIND_GIN + RELOPT_KIND_GIN, + AccessExclusiveLock }, -1, 64, MAX_KILOBYTES }, @@ -297,7 +325,8 @@ static relopt_string stringRelOpts[] = { buffering, Enables buffering build for this GiST index, - RELOPT_KIND_GIST + RELOPT_KIND_GIST, + AccessExclusiveLock }, 4, false, Why? These options just change things for the future and don't influence past decisions. It seems unproblematic to change them. @@ -259,7 +283,8 @@ static relopt_real realRelOpts[] = { seq_page_cost, Sets the planner's estimate of the cost of a sequentially fetched disk page., - RELOPT_KIND_TABLESPACE + RELOPT_KIND_TABLESPACE, + AccessExclusiveLock }, -1, 0.0, DBL_MAX }, @@ -267,7 +292,8 @@ static relopt_real realRelOpts[] = { random_page_cost, Sets the planner's estimate of the cost of a nonsequentially fetched disk page., - RELOPT_KIND_TABLESPACE + RELOPT_KIND_TABLESPACE, + AccessExclusiveLock }, -1, 0.0, DBL_MAX }, @@ -275,7 +301,8 @@ static relopt_real realRelOpts[] = { n_distinct, Sets the planner's estimate of the number of distinct values appearing in a column (excluding child relations)., - RELOPT_KIND_ATTRIBUTE + RELOPT_KIND_ATTRIBUTE, + AccessExclusiveLock }, 0, -1.0, DBL_MAX }, @@ -283,7 +310,8 @@ static relopt_real realRelOpts[] = { n_distinct_inherited, Sets the planner's estimate of the number of distinct values appearing in a column (including child relations)., - RELOPT_KIND_ATTRIBUTE + RELOPT_KIND_ATTRIBUTE, + AccessExclusiveLock }, 0, -1.0, DBL_MAX }, These probably are the settings that are most interesting to change without access exlusive locks. j = 0; for (i = 0; boolRelOpts[i].gen.name; i++) + { + Assert(DoLockModesConflict(boolRelOpts[i].gen.lockmode, + boolRelOpts[i].gen.lockmode)); j++; + } for (i = 0; intRelOpts[i].gen.name; i++) + { + Assert(DoLockModesConflict(intRelOpts[i].gen.lockmode, + intRelOpts[i].gen.lockmode)); j++; + } for (i = 0; realRelOpts[i].gen.name; i++) + { + Assert(DoLockModesConflict(realRelOpts[i].gen.lockmode, + realRelOpts[i].gen.lockmode)); j++; + } for (i = 0; stringRelOpts[i].gen.name; i++) + { + Assert(DoLockModesConflict(stringRelOpts[i].gen.lockmode, + stringRelOpts[i].gen.lockmode)); j++; + } j += num_custom_options; Doesn't really seem worth it to assert individually in each case here to me. +/* + * Determine the required LOCKMODE from an option list + */ +LOCKMODE +GetRelOptionsLockLevel(List *defList) +{ + LOCKMODElockmode = NoLock; + ListCell*cell;
Re: [HACKERS] LWLock deadlock and gdb advice
On 2015-07-30 09:03:01 -0700, Jeff Janes wrote: On Wed, Jul 29, 2015 at 6:10 AM, Andres Freund and...@anarazel.de wrote: What do you think about something roughly like the attached? I've not evaluated the code, but applying it does solve the problem I was seeing. Cool, thanks for testing! How long did you run the test and how long did it, on average, previously take for the issue to occur? Regards, Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Updatable view?
Hi, Is anyone working on implementing or interested in implementing automatic updatable view which uses two or more tables involved (join)? SQL1999 allows it in certain conditions. Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] security labels on databases are bad for dump restore
On Wed, Jul 29, 2015 at 10:50:53AM -0400, Robert Haas wrote: On Wed, Jul 29, 2015 at 12:39 AM, Noah Misch n...@leadboat.com wrote: On Tue, Jul 28, 2015 at 03:36:13PM -0400, Robert Haas wrote: On Tue, Jul 28, 2015 at 3:33 PM, Andres Freund and...@anarazel.de wrote: Hm? Let me try again: If the admin does a ALTER DATABASE ... SET guc = ... *before* restoring a backup and the backup does contain a setting for the same guc, but with a different value it'll overwrite the previous explicit action by the DBA without any warning. If the backup does *not* contain that guc the previous action survives. That's confusing, because you're more likely to be in the 'the backup does not contain the guc' situation when testing where it thus will work. True. But I don't think modifying a database before restoring into it is terribly supported. Even pg_dump --clean, which is supposed to do this sort of thing, doesn't seem to work terribly reliably. We could try to fix this by having a command like ALTER DATABASE ... RESET ALL that we issue before restoring the settings, but I'm afraid that will take us into all sorts of unreasonable scenarios that are better just labeled as don't do that. Andres's example is a harbinger of the semantic morass ahead. Excepting database objects and the public schema object, pg_dump and pg_dumpall mutate only the objects they CREATE. They consistently restore object properties (owner, ACLs, security label, etc.) if and only if issuing a CREATE statement for the object. For example, restoring objects contained in a schema without restoring the schema itself changes none of those schema properties. pg_dump and pg_dumpall have mostly followed that rule for databases, too, but they depart from it for comment and security label. That was a mistake. We can't in general mutate an existing database to match, because we can't mutate the encoding, datcollate or datctype. Even discounting that problem, I value consistency with the rest of the dumpable object types. What we've proven so far (if Craig's comments are to be believed) is that the oft-recommended formula of pg_dumpall -g plus pg_dump of each database doesn't completely work. That's absolutely gotta be fixed. What exact formula did you have in mind? It must not be merely 1. pg_dumpall -g 2. pg_dump (without --create) per database which _never_ works: it emits no CREATE DATABASE statements. Perhaps this? 1. pg_dumpall -g 2. Issue a handwritten CREATE DATABASE statement per database with correct encoding, lc_ctype and lc_collate parameters. All other database properties can be wrong; the dump will fix them. 3. pg_dump (without --create) per database That neglects numerous database properties today, but we could make it work. Given the problems I described upthread, it's an inferior formula that I recommend against propping up. I much prefer making this work completely: 1. pg_dumpall -g 2. pg_dump --create per database Another formula I wouldn't mind offering: 1. pg_dumpall -g 2. pg_dumpall --empty-databases 3. pg_dump (without --create) per database Code for an --empty-databases option already exists for pg_dumpall -g --binary-upgrade. A patch turning that into a user-facing feature might be quite compact. I don't see much point given a complete pg_dump --create, but I wouldn't object. Thanks, nm -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Don'st start streaming after creating a slot in pg_receivexlog
On Wed, Jul 29, 2015 at 10:20 PM, Andres Freund and...@anarazel.de wrote: On 2015-07-29 22:17:27 +0900, Michael Paquier wrote: Here is a patch implementing those things. IMO if-not-exists does not make much sense anymore What? It's rather useful to be able to discern between 'slot was already there' and 'oops, some error occured'. -1 OK, fine. To me the pg_recvlogical changes are pretty pointless. OK, you wrote it after all. I won't insist on it. So, perhaps the attached is more convincing then? It just changes --create-slot to leave immediately after creation to address the complain of this thread. -- Michael 20150730_pgrecv_slots_v2.patch Description: binary/octet-stream -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics / patch v7
On 05/25/2015 11:43 PM, Tomas Vondra wrote: There are 6 files attached, but only 0002-0006 are actually part of the multivariate statistics patch itself. All of these patches are huge. In order to review this in a reasonable amount of time, we need to do this in several steps. So let's see what would be the minimal set of these patches that could be reviewed and committed, while still being useful. The main patches are: 1. shared infrastructure and functional dependencies 2. clause reduction using functional dependencies 3. multivariate MCV lists 4. multivariate histograms 5. multi-statistics estimation Would it make sense to commit only patches 1 and 2 first? Would that be enough to get a benefit from this? I have some doubts about the clause reduction and functional dependencies part of this. It seems to treat functional dependency as a boolean property, but even with the classic zipcode and city case, it's not always an all or nothing thing. At least in some countries, there can be zipcodes that span multiple cities. So zipcode=X does not completely imply city=Y, although there is a strong correlation (if that's the right term). How strong does the correlation need to be for this patch to decide that zipcode implies city? I couldn't actually see a clear threshold stated anywhere. So rather than treating functional dependence as a boolean, I think it would make more sense to put a 0.0-1.0 number to it. That means that you can't do clause reduction like it's done in this patch, where you actually remove clauses from the query for cost esimation purposes. Instead, you need to calculate the selectivity for each clause independently, but instead of just multiplying the selectivities together, apply the dependence factor to it. Does that make sense? I haven't really looked at the MCV, histogram and multi-statistics estimation patches yet. Do those patches make the clause reduction patch obsolete? Should we forget about the clause reduction and functional dependency patch, and focus on those later patches instead? - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] dblink: add polymorphic functions.
Joe Conway m...@joeconway.com writes: What about just TYPE then? SELECT x::TYPE(some_expression) FROM ... SELECT CAST (x AS TYPE(some_expression)) FROM ... Yeah, that would work. Quick-hack proof-of-concept patch attached. Some usage examples in the regression database: regression=# select pg_typeof(43::type(q1)) from int8_tbl; pg_typeof --- bigint bigint bigint bigint bigint (5 rows) regression=# select pg_typeof(43::type(q1/0.0)) from int8_tbl; pg_typeof --- numeric numeric numeric numeric numeric (5 rows) regression=# select pg_typeof(43::type(f1)) from point_tbl; ERROR: cannot cast type integer to point LINE 1: select pg_typeof(43::type(f1)) from point_tbl; ^ The main limitation of this patch is that it won't work for call sites that pass pstate == NULL to LookupTypeName. There are a fair number of them, some of which wouldn't care because they could never invoke this notation anyway, but for others we'd need to do some work to cons up a suitable pstate. regards, tom lane diff --git a/src/backend/parser/parse_type.c b/src/backend/parser/parse_type.c index 6616639..d5d0f73 100644 *** a/src/backend/parser/parse_type.c --- b/src/backend/parser/parse_type.c *** *** 19,25 --- 19,27 #include catalog/pg_type.h #include lib/stringinfo.h #include nodes/makefuncs.h + #include nodes/nodeFuncs.h #include parser/parser.h + #include parser/parse_expr.h #include parser/parse_type.h #include utils/array.h #include utils/builtins.h *** static int32 typenameTypeMod(ParseState *** 52,58 * found but is a shell, and there is typmod decoration, an error will be * thrown --- this is intentional. * ! * pstate is only used for error location info, and may be NULL. */ Type LookupTypeName(ParseState *pstate, const TypeName *typeName, --- 54,61 * found but is a shell, and there is typmod decoration, an error will be * thrown --- this is intentional. * ! * In most cases pstate is only used for error location info, and may be NULL. ! * However, the TYPE(expression) syntax is not accepted when pstate is NULL. */ Type LookupTypeName(ParseState *pstate, const TypeName *typeName, *** LookupTypeName(ParseState *pstate, const *** 143,148 --- 146,188 format_type_be(typoid; } } + else if (pstate != NULL + list_length(typeName-typmods) == 1 + list_length(typeName-names) == 1 + strcmp(strVal(linitial(typeName-names)), type) == 0) + { + /* TYPE(expression) notation */ + Node *typexpr = (Node *) linitial(typeName-typmods); + + /* XXX should invent a new EXPR_KIND for this, likely */ + typexpr = transformExpr(pstate, typexpr, EXPR_KIND_SELECT_TARGET); + + /* We needn't bother assigning collations to the expr */ + + /* We use the expression's type/typmod and then throw the expr away */ + typoid = exprType(typexpr); + + /* If an array reference, return the array type instead */ + if (typeName-arrayBounds != NIL) + typoid = get_array_type(typoid); + + if (!OidIsValid(typoid)) + { + if (typmod_p) + *typmod_p = -1; + return NULL; + } + + if (typmod_p) + *typmod_p = exprTypmod(typexpr); + + /* Duplicative, but I'm too lazy to refactor this function right now */ + tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(typoid)); + if (!HeapTupleIsValid(tup)) /* should not happen */ + elog(ERROR, cache lookup failed for type %u, typoid); + + return (Type) tup; + } else { /* Normal reference to a type name */ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE FUNCTION .. LEAKPROOF docs
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 07/29/2015 04:46 PM, Joe Conway wrote: On 06/14/2015 03:46 AM, Dean Rasheed wrote: I think the docs for the LEAKPROOF option in create_function.sgml ought to mention RLS as well as security barrier views. Also the current text is no longer strictly correct in light of commit dcbf5948e12aa60b4d6ab65b6445897dfc971e01. Suggested rewording attached. Any objections out there to this doc patch? It rings true and reads well to me. I'd like to check another item off the 9.5 Open Items list.. Pushed to HEAD and 9.5 Joe -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.22 (GNU/Linux) iQIcBAEBAgAGBQJVulxZAAoJEDfy90M199hlpL8P/0S5tpWTNNGpk4suaoWVhULX bAopCCZJLyWS7uZ4/PqMZvV2iVTiQvKrSMsKwXQsvm0fod5V4nC7UwSp9vqw0oE1 o0rzUFpcE1ZeUwbec5S98WZ4wi9HONO51/A6/QYf3I5LlWQBIvU5/5g8DINxtIlp xe60Ihr5K1QUEAOGDqYWhMAyMpQKm0hBjriIIzC/+SW1gr0bXqiQcSytqTi6HV+w y8x/YByJtYZEWgRuuMK+Cd6V7xFfBgdMlFepk6vPsnOZi1m+NVYGxVLGR5fzBDan BTunV7d+uTQfrzwNXR8BUrreU/ED0R55CaSNCQEF5vnERPjLuGaiWxZjSOqI/8GF FLBLlCJAH9V+o2krBeM74aAp/JkI3y1F7I/3otrxH0nFjdw4fpKrEPPiG4esG5zv Y2J66aFHlykc7rmjSAWDiJGY6wfvinmjkeRTYCyt8cBRdeXuKO3LrjxY/9ZX7R33 cfUhCsBczRwg/7v5uEniE5k3SQ+YS7xwfLz9VZUuYCKJyMx4nego/VGVtGfvlJ2d BWjcfHgjRdFy1alfPWZNEdl9F/DoGtAoEpcpawydtMcQCn5g/kg1mfCqg2P97VXe rcay37smjACFBsTghDd23ufa0eNCTOj0hf7Fb19Ecmcm8XQwlVW7sByPjg89qRKd mAIR5yIP5g/ptz38R8RC =LzHH -END PGP SIGNATURE- -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] LWLock deadlock and gdb advice
On Wed, Jul 29, 2015 at 6:10 AM, Andres Freund and...@anarazel.de wrote: On 2015-07-29 14:22:23 +0200, Andres Freund wrote: On 2015-07-29 15:14:23 +0300, Heikki Linnakangas wrote: Ah, ok, that should work, as long as you also re-check the variable's value after queueing. Want to write the patch, or should I? I'll try. Shouldn't be too hard. What do you think about something roughly like the attached? I've not evaluated the code, but applying it does solve the problem I was seeing. Cheers, Jeff
[HACKERS] TAP tests are badly named
We should describe test sets by what they test, not by how they test. TAP is a testing tool/protocol. The current set of tests we have test the programs in src/bin, and we should really name the test set by a name that reflects that, rather than the fact that we are using TAP tools to run the tests. What if we decide to test something else using TAP? Would we call that set of tests TAP tests too? --enable-tap-tests is a reasonable configuration setting, because it's about whether or not we have a TAP testing framework available, but I think we should stop calling the bin tests TAP tests and we should change the test name in vcregress.pl to a more appropriate name. In the buildfarm I'm calling the step bin-check: http://www.pgbuildfarm.org/cgi-bin/show_stage_log.pl?nm=crakedt=2015-07-30%2012%3A25%3A58stg=bin-check Thoughts? cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] LWLock deadlock and gdb advice
On 2015-07-30 17:36:52 +0300, Heikki Linnakangas wrote: In 9.4, LWLockAcquire holds the spinlock when it marks the lock as held, until it has updated the variable. And LWLockWaitForVar() holds the spinlock when it checks that the lock is held and that the variable's value matches. So it cannot happen on 9.4. The first paragraph talked about the same value, but that was just referring to it not yet having been cleared i think... To reiterate, with 9.5, it's possible that a backend is sleeping in LWLockWaitForVar(oldvar=123), even though the lock is currently held by another backend with value 124. That seems wrong, or surprising at the very least. With my patch that can't really happen that way though? The value is re-checked after queuing. If it has changed by then we're done. And if it hasn't yet changed we're guaranteed to be woken up once it's being changed? I generaly don't mind adding some sort of flag clearing or such, but I'd rather not have it in the retry loop in the general LWLockAttemptLock path - I found that very small amounts of instructions in there have a measurable impact. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Volatility of pg_xact_commit_timestamp() and pg_last_committed_xact()
Alvaro Herrera wrote: Robert Haas wrote: On Thu, Jul 16, 2015 at 9:49 AM, Fujii Masao masao.fu...@gmail.com wrote: Volatilities of pg_xact_commit_timestamp() and pg_last_committed_xact() are now STABLE. But ISTM that those functions can return different results even within a single statement. So we should change the volatilities of them to VOLATILE? Sounds right to me. Will fix. Fix pushed. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] security labels on databases are bad for dump restore
On Thu, Jul 30, 2015 at 2:49 AM, Noah Misch n...@leadboat.com wrote: What exact formula did you have in mind? It must not be merely 1. pg_dumpall -g 2. pg_dump (without --create) per database which _never_ works: it emits no CREATE DATABASE statements. Perhaps this? 1. pg_dumpall -g 2. Issue a handwritten CREATE DATABASE statement per database with correct encoding, lc_ctype and lc_collate parameters. All other database properties can be wrong; the dump will fix them. 3. pg_dump (without --create) per database That neglects numerous database properties today, but we could make it work. Given the problems I described upthread, it's an inferior formula that I recommend against propping up. I much prefer making this work completely: 1. pg_dumpall -g 2. pg_dump --create per database Gah, OK, I see your point. But we better document this, because if you need a PhD in PostgreSQL-ology to take a backup, we're not in a good place. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort and a merge step to significantly improve on tuplesort's single run external sort
On Wed, Jul 29, 2015 at 9:05 PM, Peter Geoghegan p...@heroku.com wrote: The behavior of external sorts that do not require any merge step due to only having one run (what EXPLAIN ANALYZE output shows as an external sort, and not a merge sort) seems like an area that can be significantly improved upon. As noted in code comments, this optimization did not appear in The Art of Computer Programming, Volume III. It's not an unreasonable idea, but it doesn't work well on modern machines due to using heapsort, which is known to use the cache ineffectively. It also often implies significant additional I/O for little or no benefit. I suspect that all the reports we've heard of smaller work_mem sizes improving sort performance are actually down to this one-run optimization *hurting* performance. Very interesting. And great performance numbers. Thanks for taking the time to investigate this - really cool. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reduce ProcArrayLock contention
On Wed, Jul 29, 2015 at 2:18 PM, Andres Freund and...@anarazel.de wrote: On 2015-07-29 12:54:59 -0400, Robert Haas wrote: I would try to avoid changing lwlock.c. It's pretty easy when so doing to create mechanisms that work now but make further upgrades to the general lwlock mechanism difficult. I'd like to avoid that. I'm massively doubtful that re-implementing parts of lwlock.c is the better outcome. Then you have two different infrastructures you need to improve over time. That is also true, but I don't think we're going to be duplicating anything from lwlock.c in this case. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort and a merge step to significantly improve on tuplesort's single run external sort
On 07/30/2015 04:05 AM, Peter Geoghegan wrote: Patch -- quicksort with spillover = With the attached patch, I propose to add an additional, better one run special case optimization. This new special case splits the single run into 2 subruns. One of the runs is comprised of whatever tuples were in memory when the caller finished passing tuples to tuplesort. To sort that, we use quicksort, which in general has various properties that make it much faster than heapsort -- it's a cache oblivious algorithm, which is important these days. The other subrun is whatever tuples were on-tape when tuplesort_performsort() was called. This will often be a minority of the total, but certainly never much more than half. This is already sorted when tuplesort_performsort() is reached. This spillover is already inserted at the front of the sorted-on-tape tuples, and so already has reasonably good cache characteristics. With the patch, we perform an on-the-fly merge that is somewhat similar to the existing (unaffected) merge sort TSS_FINALMERGE case, except that one of the runs is in memory, and is potentially much larger than the on-tape/disk run (but never much smaller), and is quicksorted. The existing merge sort case similarly is only used when the caller specified !randomAccess. Hmm. You don't really need to merge the in-memory array into the tape, as you know that all the tuples in the in-memory must come after the tuples already on the tape. You can just return all the tuples from the tape first, and then all the tuples from the array. So here's a shorter/different explanation of this optimization: When it's time to perform the sort, instead of draining the in-memory heap one tuple at a time to the last tape, you sort the heap with quicksort, and pretend that the sorted heap belongs to the last tape, after all the other tuples in the tape. Some questions/thoughts on that: Isn't that optimization applicable even when you have multiple runs? Quicksorting the heap and keeping it as an array in memory is surely always faster than heapsorting and pushing it to the tape. I think it'd make sense to structure the code differently, to match the way I described this optimization above. Instead of adding a new tuplesort state for this, abstract this in the logical tape code. Add a function to attach an in-memory tail to a tape, and have LogicalTapeRead() read from the tail after reading the on-disk tape. The rest of the code wouldn't need to care that sometimes part of the tape is actually in memory. It should be pretty easy to support randomAccess too. If you think of the in-memory heap as a tail of the last tape, you can easily move backwards from the in-memory heap back to the on-disk tape, too. +* Note that there might actually be 2 runs, but only the +* contents of one of them went to tape, and so we can +* safely pretend that there is only 1 run (since we're +* about to give up on the idea of the memtuples array being +* a heap). This means that if our sort happened to require +* random access, the similar single run optimization +* below (which sets TSS_SORTEDONTAPE) might not be used at +* all. This is because dumping all tuples out might have +* forced an otherwise equivalent randomAccess case to +* acknowledge a second run, which we can avoid. Is that really true? We don't start a second run until we have to, i.e. when it's time to dump the first tuple of the second run to tape. So I don't think the case you describe above, where you have two runs but only one of them has tuples on disk, can actually happen. Performance == Impressive! Predictability == Even more impressive! Future work = As an extra optimization, you could delay quicksorting the in-memory array until it's time to read the first tuple from it. If the caller reads only the top-N tuples from the sort for some reason (other than LIMIT, which we already optimize for), that could avoid a lot of work. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] LWLock deadlock and gdb advice
On 2015-07-29 20:23:24 +0300, Heikki Linnakangas wrote: Backend A has called LWLockWaitForVar(X) on a lock, and is now waiting on it. The lock holder releases the lock, and wakes up A. But before A wakes up and sees that the lock is free, another backend acquires the lock again. It runs LWLockAcquireWithVar to the point just before setting the variable's value. Now A wakes up, sees that the lock is still (or again) held, and that the variable's value still matches the old one, and goes back to sleep. The new lock holder won't wake it up until it updates the value again, or to releases the lock. I'm not sure whether this observation is about my patch or the general lwlock variable mechanism. In my opinion that behaviour exists today both in 9.4 and 9.5. But I think that's fine because that race seems pretty fundamental. After all, you could have called LWLockWaitForVar() just after the second locker had set the variable to the same value. You didn't like the new LW_FLAG_VAR_SET flag and the API changes I proposed? I do slightly dislike the additional bit of math in AttemptLock. But what I really disliked about the patch is the reliance on holding the spinlock over longer times and conditionally acquiring spinlocks. I didn't see a need for the flags change to fix the issue at hand, that's why I didn't incorporate it. I'm not fundamentally against it. Either way, there is a race condition that if the new lock holder sets the variable to the *same* value as before, the old waiter won't necessarily wake up even though the lock was free or had a different value in between. But that's easier to explain and understand than the fact that the value set by LWLockAcquireWithVar() might not be seen by a waiter, until you release the lock or update it again. I can't really se a use for the API that'd allow that and care about the waits. Because quite fundamentally you could just started waiting after the variable was set to the value at the same time. Another idea is to have LWLockAcquire check the wait queue after setting the variable, and wake up anyone who's already queued up. Ugh, not a fan of that. The changes in LWLockWaitForVar() and LWLockConflictsWithVar() seem OK in principle. You'll want to change LWLockConflictsWithVar() so that the spinlock is held only over the value = *valptr line, though; the other stuff just modifies local variables and don't need to be protected by the spinlock. good point., Also, when you enter LWLockWaitForVar(), you're checking if the lock is held twice in a row; first at the top of the function, and again inside LWLockConflictsWithVar. You could just remove the quick test at the top. Yea, I was thinking about removing it. The first check was there previously, so I left it in place. We do execute a bit more code once we've disabled interrupts (extraWaits, re-enabling interrupts). I don't think it'll matter much, but it seemed like an independent thing. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reduce ProcArrayLock contention
On 2015-07-29 12:54:59 -0400, Robert Haas wrote: I would try to avoid changing lwlock.c. It's pretty easy when so doing to create mechanisms that work now but make further upgrades to the general lwlock mechanism difficult. I'd like to avoid that. I'm massively doubtful that re-implementing parts of lwlock.c is the better outcome. Then you have two different infrastructures you need to improve over time. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] security labels on databases are bad for dump restore
Noah Misch wrote: What exact formula did you have in mind? It must not be merely 1. pg_dumpall -g 2. pg_dump (without --create) per database which _never_ works: it emits no CREATE DATABASE statements. Perhaps this? 1. pg_dumpall -g 2. Issue a handwritten CREATE DATABASE statement per database with correct encoding, lc_ctype and lc_collate parameters. All other database properties can be wrong; the dump will fix them. 3. pg_dump (without --create) per database That neglects numerous database properties today, but we could make it work. Given the problems I described upthread, it's an inferior formula that I recommend against propping up. Agreed, and IMO it's embarrasing that it's so complicated to get a fully working backup. I much prefer making this work completely: 1. pg_dumpall -g 2. pg_dump --create per database My full support for this proposal. Another formula I wouldn't mind offering: 1. pg_dumpall -g 2. pg_dumpall --empty-databases 3. pg_dump (without --create) per database Code for an --empty-databases option already exists for pg_dumpall -g --binary-upgrade. A patch turning that into a user-facing feature might be quite compact. I don't mind if this one is also made to work, but I don't care about this case all that much. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: backend niceness / session_priority
=?ISO-8859-15?Q?Jos=E9_Luis_Tall=F3n?= jltal...@adv-solutions.net writes: Since PostgreSQL lacks the resource management capabilities of the Big Ones ( Resource Groups - Red, WorkLoad Manager - Blue ) or the Resource Governor in MS SQL Server, we can try and approximate the requested behaviour by reducing the CPU priority (nice) of the backend in question. Please note that we would be using scheduler priority to try and modulate I/O, though I'm aware of the limitations of this mechanism. This has been proposed before, and rejected before, and I'm not seeing anything particularly new here. Without a credible mechanism for throttling I/O, nice alone does not seem very promising. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgbench stats per script other stuff
v6 is just a rebase after a bug fix by Andres Freund. Also a small question: The patch currently displays pgbench scripts starting numbering at 0. Probably a little too geek... should start at 1? v7 is a rebase after another small bug fix in pgbench. -- Fabien.diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml index 2517a3a..99670d4 100644 --- a/doc/src/sgml/ref/pgbench.sgml +++ b/doc/src/sgml/ref/pgbench.sgml @@ -261,6 +261,23 @@ pgbench optional replaceableoptions/ /optional replaceabledbname/ benchmarking arguments: variablelist + varlistentry + termoption-b/ replaceablescriptname[@weight]//term + termoption--builtin/ replaceablescriptname[@weight]//term + listitem + para +Add the specified builtin script to the list of executed scripts. +An optional integer weight after literal@/ allows to adjust the +probability of drawing the test. +Available builtin scripts are: literaltpcb-like/, +literalsimple-update/ and literalselect-only/. +The provided repleacablescriptname/ needs only to be a prefix +of the builtin name, hence literalsimp/ would be enough to select +literalsimple-update/. + /para + /listitem + /varlistentry + varlistentry termoption-c/option replaceableclients//term @@ -307,14 +324,15 @@ pgbench optional replaceableoptions/ /optional replaceabledbname/ /varlistentry varlistentry - termoption-f/option replaceablefilename//term - termoption--file=/optionreplaceablefilename//term + termoption-f/ replaceablefilename[@weight]//term + termoption--file=/replaceablefilename[@weight]//term listitem para -Read transaction script from replaceablefilename/. +Add a transaction script read from replaceablefilename/ to +the list of executed scripts. +An optional integer weight after literal@/ allows to adjust the +probability of drawing the test. See below for details. -option-N/option, option-S/option, and option-f/option -are mutually exclusive. /para /listitem /varlistentry @@ -404,10 +422,7 @@ pgbench optional replaceableoptions/ /optional replaceabledbname/ termoption--skip-some-updates/option/term listitem para -Do not update structnamepgbench_tellers/ and -structnamepgbench_branches/. -This will avoid update contention on these tables, but -it makes the test case even less like TPC-B. +Shorthand for option-b simple-update@1/. /para /listitem /varlistentry @@ -499,9 +514,9 @@ pgbench optional replaceableoptions/ /optional replaceabledbname/ Report the specified scale factor in applicationpgbench/'s output. With the built-in tests, this is not necessary; the correct scale factor will be detected by counting the number of -rows in the structnamepgbench_branches/ table. However, when testing -custom benchmarks (option-f/ option), the scale factor -will be reported as 1 unless this option is used. +rows in the structnamepgbench_branches/ table. +However, when testing only custom benchmarks (option-f/ option), +the scale factor will be reported as 1 unless this option is used. /para /listitem /varlistentry @@ -511,7 +526,7 @@ pgbench optional replaceableoptions/ /optional replaceabledbname/ termoption--select-only/option/term listitem para -Perform select-only transactions instead of TPC-B-like test. +Shorthand for option-b select-only@1/. /para /listitem /varlistentry @@ -567,6 +582,16 @@ pgbench optional replaceableoptions/ /optional replaceabledbname/ /varlistentry varlistentry + termoption--per-script-stats/option/term + listitem + para +Report some statistics per script run by pgbench. + /para + /listitem + /varlistentry + + + varlistentry termoption--sampling-rate=replaceablerate//option/term listitem para @@ -661,7 +686,20 @@ pgbench optional replaceableoptions/ /optional replaceabledbname/ titleWhat is the quoteTransaction/ Actually Performed in pgbench?/title para - The default transaction script issues seven commands per transaction: + Pgbench executes test scripts chosen randomly from a specified list. + They include built-in scripts with option-b/ and + user-provided custom scripts with option-f/. + Each script may be given a relative weight specified after a + literal@/ so as to change its drawing probability. + The default weight is literal1/. + /para + + para + The default builtin transaction script (also invoked with option-b tpcb-like/) + issues seven commands per transaction over randomly chosen literalaid/, + literaltid/, literalbid/ and
Re: [HACKERS] multivariate statistics / patch v7
On 07/30/2015 03:55 PM, Tomas Vondra wrote: On 07/30/2015 10:21 AM, Heikki Linnakangas wrote: I have some doubts about the clause reduction and functional dependencies part of this. It seems to treat functional dependency as a boolean property, but even with the classic zipcode and city case, it's not always an all or nothing thing. At least in some countries, there can be zipcodes that span multiple cities. So zipcode=X does not completely imply city=Y, although there is a strong correlation (if that's the right term). How strong does the correlation need to be for this patch to decide that zipcode implies city? I couldn't actually see a clear threshold stated anywhere. So rather than treating functional dependence as a boolean, I think it would make more sense to put a 0.0-1.0 number to it. That means that you can't do clause reduction like it's done in this patch, where you actually remove clauses from the query for cost esimation purposes. Instead, you need to calculate the selectivity for each clause independently, but instead of just multiplying the selectivities together, apply the dependence factor to it. Does that make sense? I haven't really looked at the MCV, histogram and multi-statistics estimation patches yet. Do those patches make the clause reduction patch obsolete? Should we forget about the clause reduction and functional dependency patch, and focus on those later patches instead? Perhaps. It's true that most real-world data sets are not 100% valid with respect to functional dependencies - either because of natural imperfections (multiple cities with the same ZIP code) or just noise in the data (incorrect entries ...). And it's even mentioned in the code comments somewhere, I guess. But there are two main reasons why I chose not to extend the functional dependencies with the [0.0-1.0] value you propose. Firstly, functional dependencies were meant to be the simplest possible implementation, illustrating how the infrastructure is supposed to work (which is the main topic of the first patch). Secondly, all kinds of statistics are simplifications of the actual data. So I think it's not incorrect to ignore the exceptions up to some threshold. The problem with a threshold is that around that threshold, even a small change in the data set can drastically change the produced estimates. For example, imagine that we know from the stats that zip code implies city. But then someone adds a single row to the table with an odd zip code city combination, which pushes the estimator over the threshold, and the columns are no longer considered dependent, and the estimates are now completely different. We should avoid steep cliffs like that. BTW, what is the threshold in the current patch? - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Proposal: backend niceness / session_priority
Hackers, I have found myself needing to run some maintenance routines (VACUUM, REINDEX, REFRESH MATERIALIZED VIEW mostly) at a lower priority so as not to disturb concurrent *highly transactional* connections. This issue is also noted within the TODO[0] list in the Wiki . * There was some discussion on 2007 [1] regarding Priorities for users or queries? http://archives.postgresql.org/pgsql-general/2007-02/msg00493.php Since PostgreSQL lacks the resource management capabilities of the Big Ones ( Resource Groups - Red, WorkLoad Manager - Blue ) or the Resource Governor in MS SQL Server, we can try and approximate the requested behaviour by reducing the CPU priority (nice) of the backend in question. Please note that we would be using scheduler priority to try and modulate I/O, though I'm aware of the limitations of this mechanism. Using renice(1) from outside is not only cumbersome and error prone but very much unuseable for the use cases I am contemplating. * Moveover, as seen in the Priorities wiki page [2], there exists an extension providing a set_backend_priority() function, to be called set_backend_priority(pg_backend_pid(), 20). This approach is, sadly, not portable to non-POSIX operating systems (e.g. Windows), and IMO quite too convoluted to use and tied to actual implementation details. * I have been playing with some code which uses a GUC for this purpose, though only define/support three different priorities would make sense for the final implementation IMO: NORMAL, LOW_PRIORITY, IDLE Checked platform compatibility too: this behaviour can be implemented on Windows, too. For everything else, there's nice (2) However, there is a relatively minor catch here which is the reason behind this e-mail: user interface - Inventing a new command seems overkill to me. Plus, I don't know what we could model it on --- given that the real solution for this problem would be a fully featured priority manager --- - I have been playing with a GUC that ignores being reset --- so as to comply with nice's specification when not running as a privileged user --- but I reckon that this behaviour might be surprising at best: SET session_priority TO 'low';-- Ok, low priority VACUUM FREEZE my_test_table; RESET session_priority;-- Nope, still low prio. Emit notice? The way to reset the priority would be to RECONNECT. And this is my main pain point though it does fullfill the need. However, this approach does fullfill my needs and ---it seems--- the OP's needs: be able to run a maintenance task at a low priority (i.e. disturbing other concurrent queries as little as possible). Expected use case: cronjob running psql -c 'SET session_priority TO low; REINDEX blabla CONCURRENTLY; VACUUM foobar;' All suggestions welcome. I'll be wrapping a more-or-less-done patch on monday if somebody wants to take a look and criticize on actual code (I won't be working on this tomorrow) unless somebody points me at a better solution Thanks, / J.L. [0] https://wiki.postgresql.org/wiki/Todo - Miscellaneous performance [1] http://archives.postgresql.org/pgsql-general/2007-02/msg00493.php [2] https://wiki.postgresql.org/wiki/Priorities [3] http://docs.oracle.com/cd/E11882_01/server.112/e25494/dbrm.htm [4] http://www-01.ibm.com/support/knowledgecenter/SSEPGG_10.1.0/com.ibm.db2.luw.doc/com.ibm.db2.luw.doc-gentopic6.html [5] https://msdn.microsoft.com/en-us/library/bb933866.aspx
Re: [HACKERS] Division by zero in selfuncs.c:estimate_hash_bucketsize()
Piotr Stefaniak postg...@piotr-stefaniak.me writes: the two asserts below will fail with this query (ran against the regression db): I've applied fixes for this and the other thing. Thanks for the report! regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Row-Level Security Policies (RLS)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 07/30/2015 06:52 AM, Dean Rasheed wrote: On 30 July 2015 at 01:35, Joe Conway m...@joeconway.com wrote: On 06/01/2015 02:21 AM, Dean Rasheed wrote: While going through this, I spotted another issue --- in a DML query with additional non-target relations, such as UPDATE t1 .. FROM t2 .., the old code was checking the UPDATE policies of both t1 and t2, but really I think it ought to be checking the SELECT policies of t2 (in the same way as this query requires SELECT table permissions on t2, not UPDATE permissions). I've changed that and added new regression tests to test that change. I assume the entire refactoring patch needs a fair bit of work to rebase against current HEAD, Actually, there haven't been any conflicting changes so far, so a git rebase was able to automatically merge correctly -- new patch attached, with some minor comment rewording (not affecting the bug-fix part). Good to hear. Even so, I agree that it makes sense to apply the bug-fix separately, since it's not really anything to do with the refactoring. but I picked out the attached to address just the above issue. Does this look correct, and if so does it make sense to apply at least this part right now? Looks correct to me. Thanks -- committed and pushed to HEAD and 9.5 - -- Joe Conway -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.22 (GNU/Linux) iQIcBAEBAgAGBQJVulOVAAoJEDfy90M199hlURUP/2TF7r77taLPhQtEk3CFFQEn mrt90N4DJ43VwGwC7mfdBsKXoJ+3jF3Hpghw/7ulI731/Io7C514gYDDvwGkrJWu vK3vzXEQu9CIIfh97CsJ5mmenaaxrF9ZSaWDbYvQQMwMnxUS5CGWwR6VSp+NhCZ9 cnfA7idbxNjfBsjG0nQvtSgV/HOp0tP3A6dlYPTXPiIzbT9IiOpxWPwoQYgMSTcP TBgK5MHG5JWrK1Bcg3BlQzYZefKEwN+LGU6zal4P4AjM14FfyMKfMc9A6EP9vtRl jFekmRUddbXxWddl0ZSSV5BY9BLTRL2CZFhMQNQ9xKDlsK1cuYORN4v+TgRCjPKy PdMH2tgndWsNNRTmj/vWUJxMXnHARl8MmtY8pau5Z6PKNUcASqd5Xq+zfKxOw8vf lS8c4eRsLcCD+TZ1rv5K6VULmwBViU0gKP6Xs62yDHsz/Zwvt2ar1fW/25peohhb m4j8vOCsdik9DDcJf6dF8sbb/z0FVh+JQqWhjbSJYioX9BOw/1AoNbi44wS7HzV1 vdhWx6qGWZnxi5qtb9j8BU0eFr/Q65kU3hsp2smRA/IK0bCQaO08rDQlPYsIHq10 SFodULNKFzpGvkzQ2Yqv1oyJ6jMvLtWgr9vBUZvPo8OHUyAkR8kfrZyzWyr/L/Mm 6jcuFNdYSS5F7W/S7ost =GJw9 -END PGP SIGNATURE- -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] dblink: add polymorphic functions.
Tom Lane wrote: Joe Conway m...@joeconway.com writes: What about just TYPE then? SELECT x::TYPE(some_expression) FROM ... SELECT CAST (x AS TYPE(some_expression)) FROM ... Yeah, that would work. Quick-hack proof-of-concept patch attached. I'm amazed that this works without hacking the grammar itself, but in retrospect that's expected. + else if (pstate != NULL + list_length(typeName-typmods) == 1 + list_length(typeName-names) == 1 + strcmp(strVal(linitial(typeName-names)), type) == 0) + { + /* TYPE(expression) notation */ + Node *typexpr = (Node *) linitial(typeName-typmods); This is a rather ugly, but I guess not untenable. I suppose we don't care about any actual typmod, do we?. Will this be of any use with the PostGIS types and such, for which the typmod is not merely a size limit? Also INTERVAL has some funny typmod rules, not sure if that affects usage of this construct. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Planner debug views
On 7/29/15 2:40 PM, Alvaro Herrera wrote: Qingqing Zhou wrote: Can we simplify above with foreign table methods? There are two major concerns about this method per previous discussions: security and usability. I think the main cause is the sharing foreign table design. I think foreign data wrappers are great. I do not think that we should try to shape every problem to look like foreign data so that we can solve it with a foreign data wrapper. I am a bit nervous that this keeps being brought up. Agreed. I think a better option would be shoving it into a backend tuplestore and just leaving it there (maybe with a command to clear it for the paranoid). That gives a relation you can query against, insert into another table, etc. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] dblink: add polymorphic functions.
Alvaro Herrera alvhe...@2ndquadrant.com writes: Tom Lane wrote: Yeah, that would work. Quick-hack proof-of-concept patch attached. This is a rather ugly, but I guess not untenable. I suppose we don't care about any actual typmod, do we?. We get the type and typmod both from the expression. Example: regression=# create table ref (f1 varchar, f2 varchar(3)); CREATE TABLE regression=# insert into ref values('1','2'); INSERT 0 1 regression=# select '1234567890'::type(f1) from ref; type 1234567890 (1 row) regression=# select '1234567890'::type(f2) from ref; type -- 123 (1 row) Will this be of any use with the PostGIS types and such, for which the typmod is not merely a size limit? Don't see why not. They still have to follow the rule that typmod is a static property of an expression. Also INTERVAL has some funny typmod rules, not sure if that affects usage of this construct. I would not think so. The weirdness about INTERVAL mainly has to do with SQL's, ahem, inventive collection of ways to write an interval constant, and that wouldn't really be an issue for any practical use of this construct AFAICS. Whatever you write as the expression, we're going to be able to reduce to a type OID and typmod. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Failing assertions in indxpath.c, placeholder.c and brin_minmax.c
Tom Lane wrote: Alvaro Herrera alvhe...@2ndquadrant.com writes: Tom Lane wrote: Bottom line is that somebody failed to consider the possibility of a null comparison value reaching the BRIN index lookup machinery. The code stanza that's failing supposes that only IS NULL or IS NOT NULL tests could have SK_ISNULL set, but that's just wrong. I think the easiest way to solve this is to consider that all indexable operators are strict, and have the function return false in that case. The attached patch implements that. This looks fine to me as a localized fix. I was wondering whether we could short-circuit the index lookup further upstream, but I take it from your comment about _bt_preprocess_keys that BRIN has no convenient place for that today. (Even if it did, I'd still vote for making this change, for safety's sake.) Yeah, it doesn't currently. Hopefully we will improve that in the future. Particularly with regards to array keys I think there's a lot to be done there. I pushed this as is. I hesitated about adding a regression test, but it didn't seem worthwhile in the end, because if in the future we improve scan key analysis, we will need much more extensive testing, and this doesn't look like the type of bug that we could reintroduce in a whim. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doubt about AccessExclusiveLock in ALTER TABLE .. SET ( .. );
On Fri, Jul 31, 2015 at 8:30 AM, Fabrízio de Royes Mello fabriziome...@gmail.com wrote: On Fri, Jul 24, 2015 at 4:05 AM, Michael Paquier michael.paqu...@gmail.com wrote: On Fri, Jul 24, 2015 at 7:11 AM, Fabrízio de Royes Mello fabriziome...@gmail.com wrote: On Thu, Jul 2, 2015 at 2:03 PM, Simon Riggs si...@2ndquadrant.com wrote: Looks functionally complete Need a test to show that ALTER TABLE works on views, as discussed on this thread. And confirmation that pg_dump is not broken by this. Message-ID: 20140321205828.gb3969...@tornado.leadboat.com Added more test cases to cover ALTER TABLE on views. I'm thinking about the isolation tests, what about add another 'alter-table' spec for isolation tests enabling and disabling 'autovacuum' options? Yes, please. Added. I really don't know if my isolation tests are completely correct, is my first time writing this kind of tests. This patch size has increased from 16k to 157k because of the output of the isolation tests you just added. This is definitely too large and actually the test coverage is rather limited. Hence I think that they should be changed as follows: - Use only one table, the locks of tables a and b do not conflict, and that's what we want to look at here. - Check the locks of all the relation parameters, by that I mean as well fillfactor and user_catalog_table which still take AccessExclusiveLock on the relation - Use custom permutations. I doubt that there is much sense to test all the permutations in this context, and this will reduce the expected output size drastically. On further notice, I would recommend not to use the same string name for the session and the query identifiers. And I think that inserting only one tuple at initialization is just but fine. Here is an example of such a spec: setup { CREATE TABLE a (id int PRIMARY KEY); INSERT INTO a SELECT generate_series(1,100); } teardown { DROP TABLE a; } session s1 step b1 { BEGIN; } # TODO add one query per parameter step at11 { ALTER TABLE a SET (fillfactor=10); } step at12 { ALTER TABLE a SET (autovacuum_vacuum_scale_factor=0.001); } step c1 { COMMIT; } session s2 step b2 { BEGIN; } step wx1 { UPDATE a SET id = id + 1; } step c2 { COMMIT; } And by testing permutations like for example that it is possible to see which session is waiting for what. Input: permutation b1 b2 at11 wx1 c1 c2 Output where session 2 waits for lock taken after fillfactor update: step b1: BEGIN; step b2: BEGIN; step at11: ALTER TABLE a SET (fillfactor=10); step wx1: UPDATE a SET id = id + 1; waiting ... step c1: COMMIT; step wx1: ... completed step c2: COMMIT; Be careful as well to not include incorrect permutations in the expected output file, the isolation tester is smart enough to ping you about that. +GetRelOptionsLockLevel(List *defList) +{ + LOCKMODElockmode = NoLock; Shouldn't this default to AccessExclusiveLock instead of NoLock? No, because it will break the logic bellow (get the highest locklevel according the defList), but I force return an AccessExclusiveLock if defList == NIL. Yep, OK with this change. The rest of the patch looks good to me, so once the isolation test coverage is done I think that it could be put in the hands of a committer. Regards -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Remaining 'needs review' patchs in July commitfest
On Wed, Jul 29, 2015 at 5:08 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Tue, Jul 28, 2015 at 3:51 PM, Heikki Linnakangas hlinn...@iki.fi wrote: plpgsql raise statement with context Impasse. Everyone wants this feature in some form, but no consensus on whether to do this client-side or server-side. +1 for server-side. Does anyone other than you even think that the client side is a reasonable way to go? Yes. This is presupposing on the server side what the client will want to display. Fair enough. I'm still not convinced we're doing anything other than complicating what ought to be a simple matter. It is just a fact that logging tracing messages in PL/pgsql functions is a pain in the butt right now in some situations because you get a huge number of CONTEXT lines that you don't want. Can we agree on some solution to that problem without over-engineering this to infinity? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_dump quietly ignore missing tables - is it bug?
On 07/25/2015 07:08 PM, Pavel Stehule wrote: I am sending a new patch - without checking wildcard chars. The documentation says the option is called --strict-names, while the code has --strict-mode. I like --strict-names more, mode seems redundant, and it's not clear what it's strict about. For symmetry, it would be good to also support this option in pg_restore. It seems even more useful there. Can we do better than issuing a separate query for each table/schema name? The performance of this isn't very important, but still it seems like you could fairly easily refactor the code to avoid that. Perhaps return an extra constant for part of the UNION to distinguish which result row came from which pattern, and check that at least one row is returned for each. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort and a merge step to significantly improve on tuplesort's single run external sort
On 30 July 2015 at 08:00, Heikki Linnakangas hlinn...@iki.fi wrote: Hmm. You don't really need to merge the in-memory array into the tape, as you know that all the tuples in the in-memory must come after the tuples already on the tape. You can just return all the tuples from the tape first, and then all the tuples from the array. Agreed This is a good optimization for the common case where tuples are mostly already in order. We could increase the usefulness of this by making UPDATE pick blocks that are close to a tuple's original block, rather than putting them near the end of a relation. So here's a shorter/different explanation of this optimization: When it's time to perform the sort, instead of draining the in-memory heap one tuple at a time to the last tape, you sort the heap with quicksort, and pretend that the sorted heap belongs to the last tape, after all the other tuples in the tape. Some questions/thoughts on that: Isn't that optimization applicable even when you have multiple runs? Quicksorting the heap and keeping it as an array in memory is surely always faster than heapsorting and pushing it to the tape. It's about use of memory. If you have multiple runs on tape, then they will need to be merged and you need memory to do that efficiently. If there are tuples in the last batch still in memory then it can work, but it depends upon how full memory is from the last batch and how many batches there are. -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] Autonomous Transaction is back
On 28 July 2015 03:21, Josh Berkus Wrote: On 07/27/2015 02:47 AM, Rajeev rastogi wrote: Why have any fixed maximum? Since we are planning to have nested autonomous transaction, so it is required to have limit on this so that resources can be controlled. Is there a particular reason why this limit wouldn't just be max_stack_depth? We will require to allocate some initial resources in order to handle all nested autonomous transaction. So I think it is better to have some different configuration parameter. Thanks and Regards, Kumar Rajeev Rastogi -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Configurable location for extension .control files
On 07/02/2015 11:37 PM, Oskari Saarenmaa wrote: I'm somewhat interested in both #1 #2 for other projects, but I wrote this patch to address #3, i.e. to simplify the test setup we have in place for pgmemcache (https://github.com/ohmu/pgmemcache/blob/master/localtests.sh) and other extensions. I'd like to be able to set up a local PG cluster in /tmp or some other location and load the extensions I just built in there. Now that's a laudable goal. It indeed would be nice to be able to do make check to test an extension, using pgxs. The way make check within the PostgreSQL source tree works is that it performs make install to install PostgreSQL a temporary location, and installs the extension to that. We can't use make install to create a new PostgreSQL installation in an extension, but perhaps you could have a substitute of that that copies an existing installation to a temporary location. I hacked that pgmemcache localtest.sh script to do just that, see attached. That's usable for pgmemcache as it is, but for a general facility, you'd need to put that logic into pgxs instead, and it should also take care of running initdb and starting and stopping the cluster. But pg_regress can already do those things, so that should be easy. So, I think that's the direction this should be going. In summary, the goal is to make make check to work for extensions, and the way to do that is to teach pgxs to create a temporary installation. - Heikki diff --git a/.travis.yml b/.travis.yml index b30af7e..9197690 100644 --- a/.travis.yml +++ b/.travis.yml @@ -11,7 +11,7 @@ script: - PATH=/usr/lib/postgresql/$PGVER/bin/:$PATH ./localtests.sh after_failure: - - cat regressiondata/regression.diffs + - cat regression.diffs env: - PGVER=9.1 diff --git a/localtests.sh b/localtests.sh index d43e15f..956e300 100755 --- a/localtests.sh +++ b/localtests.sh @@ -2,17 +2,55 @@ # Create a clean PostgreSQL cluster for our testing -TESTDIR=$(pwd)/regressiondata +TESTDIR=$(pwd)/tmp_check export PGPORT=$((10240 + RANDOM / 2)) -export PGDATA=$TESTDIR/pg +export PGDATA=$TESTDIR/data rm -rf $TESTDIR + +# Create a minimal PostgreSQL installation, by copying an existing +# installation. This is a replacement for doing make install in the +# PostgreSQL source tree, for when you don't have the source tree available. +# +# The minimum we need to copy from the existing installation are server 'lib' +# and 'share' directories, and a few binaries. +# +# Note that 'pg_config --libdir' is the path to client-side libraries, i.e. +# libpq, and we don't want to copy that. (on many distributions, libdir points +# to just /usr/lib, and we certainly don't want to copy that in whole). +# Also note that we cannot use symlinks for the binaries, because the binaries +# look at the absolute path they are run from to find the rest of the files +# they need. + +TMP_INSTALL=$TESTDIR/install + +bindir=`pg_config --bindir` +pkglibdir=`pg_config --pkglibdir` +sharedir=`pg_config --sharedir` + +mkdir -p $TMP_INSTALL +mkdir -p $TMP_INSTALL$bindir +mkdir -p $TMP_INSTALL$pkglibdir +mkdir -p $TMP_INSTALL$sharedir + +cp -a $bindir/postgres $TMP_INSTALL$bindir +cp -a $bindir/initdb $TMP_INSTALL$bindir +cp -a $bindir/pg_ctl $TMP_INSTALL$bindir +cp -a $pkglibdir/* $TMP_INSTALL$pkglibdir +cp -a $sharedir/* $TMP_INSTALL$sharedir + +export PATH=$TMP_INSTALL$bindir:$PATH + +# Install pgmemcache to the temporary installation +make install DESTDIR=$TMP_INSTALL + +# Set up a temporary cluster mkdir -p $PGDATA initdb -E UTF-8 --no-locale +# XXX: Should use pg_config --config-auth to lock down the cluster sed -e s%^#port =.*%port = $PGPORT% \ -e s%^#\(unix_socket_director[a-z]*\) =.*%\1 = '$PGDATA'% \ --e s%^#dynamic_library_path = .*%dynamic_library_path = '$(pwd):\$libdir'% \ -e s%^#fsync = .*%fsync = off% \ -e s%^#synchronous_commit = .*%synchronous_commit = off% \ -i $PGDATA/postgresql.conf @@ -20,15 +58,5 @@ pg_ctl -l $PGDATA/log start while [ ! -S $PGDATA/.s.PGSQL.$PGPORT ]; do sleep 2; done trap pg_ctl stop -m immediate EXIT -# It's not possible to override the extension path, so we'll just execute -# the extension SQL directly after mangling it a bit with sed - -cp -a Makefile test.sql sql/ expected/ $TESTDIR -sed -e s%MODULE_PATHNAME%pgmemcache% \ --e /CREATE EXTENSION/d -e /^--/d -e /^$/d \ -ext/pgmemcache.sql $TESTDIR/sql/init.sql -cp $TESTDIR/sql/init.sql $TESTDIR/expected/init.out - -# Run the actual tests - -make -C $TESTDIR installcheck REGRESS_OPTS=--host=$PGDATA --port=$PGPORT +# Run the actual tests in the temporary cluster. +make installcheck REGRESS_OPTS=--host=$PGDATA --port=$PGPORT -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Foreign join pushdown vs EvalPlanQual
On 2015/07/07 19:15, Etsuro Fujita wrote: On 2015/07/06 9:42, Kouhei Kaigai wrote: However, we need to pay attention on advantages towards the alternatives. Hooks on add_paths_to_joinrel() enables to implement identical stuff, with less complicated logic to reproduce left / right relations from RelOptInfo of the joinrel. (Note that RelOptInfo-fdw_private enables to avoid path- construction multiple times.) I'm uncertain why this API change is necessary to fix up the problem around EvalPlanQual. Yeah, maybe we wouldn't need any API change. I think we would be able to fix this by complicating add_path as you pointed out upthread. I'm not sure that complicating it is a good idea, though. I think that it might be possible that the callback in standard_join_search would allow us to fix this without complicating the core path-cost-comparison stuff such as add_path. I noticed that what I proposed upthread doesn't work properly, though. To resolve this issue, I tried to make the core create an alternative plan that will be used in an EvalPlanQual recheck, instead of a foreign scan that performs a foreign join remotely (ie, scanrelid = 0). But I changed that idea. Instead, I'd like to propose that it's the FDW's responsibility to provide such a plan. Specifically, I'd propose that (1) we add a new Path field, say subpath, to the ForeignPath data structure and that (2) when generating a ForeignPath node for a foreign join, an FDW must provide the subpath Path node by itself. As before, it'd be recommended to use ForeignPath * create_foreignscan_path(PlannerInfo *root, RelOptInfo *rel, double rows, Cost startup_cost, Cost total_cost, List *pathkeys, Relids required_outer, Path *subpath, List *fdw_private) where subpath is the subpath Path node that has the pathkeys and the required_outer rels. (subpath is NULL if scanning a base relation.) Also, it'd be recommended that an FDW generates such ForeignPath nodes by considering, for each of paths in the rel's pathlist, whether to push down that path (to generate a ForeignPath node for a foreign join that has the same pathkeys and parameterization as that path). So, if generating the ForeignPath node, that path could be used as the subpath Path node. (I think the current postgres_fdw patch only considers an unsorted, unparameterized path for performing a foreign join remotely, but I think we should also consider presorted and/or parameterized paths.) I think this idea would apply to the API location that you proposed. However, ISTM that this idea would work better for the API call from standard_join_search because the rel's pathlist at that point has more paths worthy of consideration, in view of not only costs and sizes but pathkeys and parameterization. I think the subplan created from the subpath Path node could be used in an EvalPlanQual recheck, instead of a foreign scan that performs a foreign join remotely, as discussed previously. Comments welcome! Best regards, Etsuro Fujita -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] We need to support ForeignRecheck for late row locking, don't we?
On 2015/07/27 18:16, Kouhei Kaigai wrote: On 2015/07/24 23:51, Kouhei Kaigai wrote: On 2015/07/22 19:10, Etsuro Fujita wrote: While working on the issue Foreign join pushdown vs EvalPlanQual, I happened to notice odd behaviors of late row locking in FDWs. I think the reason for that is because we don't check pushed-down quals inside an EPQ testing even if what was fetched by RefetchForeignRow was an updated version of the tuple rather than the same version previously obtained. So, to fix this, I'd like to propose that pushed-down quals be checked in ForeignRecheck. * I've modified ForeignRecheck so as to check pushed-down quals whether doing late locking or early locking. Isn't it an option to put a new callback in ForeignRecheck? FDW driver knows its private data structure includes expression node that was pushed down to the remote side. So, it seems to me the best way to consult FDW driver whether the supplied tuple should be visible according to the pushed down qualifier. More or less, this fix need a new interface contract around EvalPlanQual logic. It is better to give FDW driver more flexibility of its private data structure and the way to process recheck logic, rather than special purpose variable. If FDW driver managed pushed-down expression in its own format, requirement to pushedDownQual makes them to have qualifier redundantly. The callback approach does not have such kind of concern. That might be an idea, but is there any performance disadvantage as discussed in [1]?; it looks like that that needs to perform another remote query to see if the supplied tuple satisfies the pushed-down quals during EPQ testing. I expect the callback of ForeignRecheck runs ExecQual() towards the qualifier expression pushed-down but saved on the private data of ForeignScanState. It does not need to kick another remote query (unless FDW driver is not designed), so performance disadvantage is none or quite limited. The advantages look not clear to me. I think the callback approach would be a good idea if FDWs were able to do the recheck more efficiently in their own ways than the core, for example. Best regards, Etsuro Fujita -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort and a merge step to significantly improve on tuplesort's single run external sort
On 07/30/2015 01:47 PM, Simon Riggs wrote: On 30 July 2015 at 08:00, Heikki Linnakangas hlinn...@iki.fi wrote: So here's a shorter/different explanation of this optimization: When it's time to perform the sort, instead of draining the in-memory heap one tuple at a time to the last tape, you sort the heap with quicksort, and pretend that the sorted heap belongs to the last tape, after all the other tuples in the tape. Some questions/thoughts on that: Isn't that optimization applicable even when you have multiple runs? Quicksorting the heap and keeping it as an array in memory is surely always faster than heapsorting and pushing it to the tape. It's about use of memory. If you have multiple runs on tape, then they will need to be merged and you need memory to do that efficiently. If there are tuples in the last batch still in memory then it can work, but it depends upon how full memory is from the last batch and how many batches there are. True, you need a heap to hold the next tuple from each tape in the merge step. To avoid exceeding work_mem, you'd need to push some tuples from the in-memory array to the tape to make room for that. In practice, though, the memory needed for the merge step's heap is tiny. Even if you merge 1000 tapes, you only need memory for 1000 tuples in the heap. But yeah, you'll need some logic to share/divide the in-memory array between the heap and the in-memory tail of the last tape. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: Using quicksort and a merge step to significantly improve on tuplesort's single run external sort
On Thu, Jul 30, 2015 at 12:09 PM, Heikki Linnakangas hlinn...@iki.fi wrote: True, you need a heap to hold the next tuple from each tape in the merge step. To avoid exceeding work_mem, you'd need to push some tuples from the in-memory array to the tape to make room for that. In practice, though, the memory needed for the merge step's heap is tiny. Even if you merge 1000 tapes, you only need memory for 1000 tuples in the heap. But yeah, you'll need some logic to share/divide the in-memory array between the heap and the in-memory tail of the last tape. It's a bit worse than that because we buffer up a significant chunk of the tape to avoid randomly seeking between tapes after every tuple read. But I think in today's era of large memory we don't need anywhere near the entire work_mem just to buffer to avoid random access. Something simple like a fixed buffer size per tape probably much less than 1MB/tape. I'm a bit confused where the big win comes from though. Is what's going on that the external sort only exceeded memory by a small amount so nearly all the tuples are still in memory? -- greg