Re: [HACKERS] pg_background (and more parallelism infrastructure patches)
On Mon, Sep 8, 2014 at 10:39 AM, Amit Kapila amit.kapil...@gmail.com wrote: On Sat, Jul 26, 2014 at 9:32 PM, Robert Haas robertmh...@gmail.com wrote: On Fri, Jul 25, 2014 at 4:16 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: On Fri, Jul 25, 2014 at 02:11:32PM -0400, Robert Haas wrote: + pq_mq_busy = true; + + iov[0].data = msgtype; + iov[0].len = 1; + iov[1].data = s; + iov[1].len = len; + + Assert(pq_mq_handle != NULL); + result = shm_mq_sendv(pq_mq_handle, iov, 2, false); + + pq_mq_busy = false; Don't you need a PG_TRY block here to reset pq_mq_busy? No. If shm_mq_sendv is interrupted, we can't use the shm_mq any more. But since that should only happen if an interrupt arrives while the queue is full, I think that's OK. I think here not only on interrupt, but any other error in this function shm_mq_sendv() path (one example is WaitLatch()) could lead to same behaviour. (Think about the alternatives: if the queue is full, we have no way of notifying the launching process without waiting for it to retrieve the results, but it might not do that right away, and if we've been killed we need to die *now* not later.) So in such cases what is the advise to users, currently they will see the below message: postgres=# select * from pg_background_result(5124) as (x int); ERROR: lost connection to worker process with PID 5124 One way is to ask them to check logs, but what about if they want to handle error and take some action? Another point about error handling is that to execute the sql in function pg_background_worker_main(), it starts the transaction which I think doesn't get aborted if error occurs For this I was just referring error handling code of StartBackgroundWorker(), however during shutdown of process, it will call AbortOutOfAnyTransaction() which will take care of aborting transaction. and similarly handling for timeout seems to be missing in error path. As we are anyway going to exit on error, so not sure, if this will be required, however having it for clean exit seems to be better. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
[HACKERS] Re: proposal: ignore null fields in not relation type composite type based constructors
2014-09-08 6:27 GMT+02:00 Stephen Frost sfr...@snowman.net: * Pavel Stehule (pavel.steh...@gmail.com) wrote: ignore_nulls in array_to_json_pretty probably is not necessary. On second hand, the cost is zero, and we can have it for API consistency. I'm willing to be persuaded either way on this, really. I do think it would be nice for both array_to_json and row_to_json to be single functions which take default values, but as for if array_to_json has a ignore_nulls option, I'm on the fence and would defer to people who use that function regularly (I don't today). Beyond that, I'm pretty happy moving forward with this patch. ok Regards Pavel Thanks, Stephen
Re: [HACKERS] gist vacuum gist access
On 09/07/2014 05:11 PM, Костя Кузнецов wrote: hello. i recode vacuum for gist index. all tests is ok. also i test vacuum on table size 2 million rows. all is ok. on my machine old vaccum work about 9 second. this version work about 6-7 sec . review please. If I'm reading this correctly, the patch changes gistbulkdelete to scan the index in physical order, while the old code starts from the root and scans the index from left to right, in logical order. Scanning the index in physical order is wrong, if any index pages are split while vacuum runs. A page split could move some tuples to a lower-numbered page, so that the vacuum will not scan those tuples. In the b-tree code, we solved that problem back in 2006, so it can be done but requires a bit more code. In b-tree, we solved it with a vacuum cycle ID number that's set on the page halves when a page is split. That allows VACUUM to identify pages that have been split concurrently sees them, and jump back to vacuum them too. See commit http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=5749f6ef0cc1c67ef9c9ad2108b3d97b82555c80. It should be possible to do something similar in GiST, and in fact you might be able to reuse the NSN field that's already set on the page halves on split, instead of adding a new vacuum cycle ID. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Optimization for updating foreign tables in Postgres FDW
I wrote: I gave it a spin and could not find any undesirable behaviour, and the output of EXPLAIN ANALYZE looks like I'd expect. I noticed that you use the list length of fdw_private to check if the UPDATE or DELETE is pushed down to the remote server or not. While this works fine, I wonder if it wouldn't be better to have some explicit flag in fdw_private for that purpose. Future modifications that change the list length might easily overlook that it is used for this purpose, thereby breaking the code. Other than that it looks alright to me. Maybe I should have mentioned that I have set the patch to Waiting for Author because I'd like to hear your opinion on that, but I'm prepared to set it to Ready for Committer soon. Yours, Laurenz Albe -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] gist vacuum gist access
On Mon, Sep 8, 2014 at 11:13 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 09/07/2014 05:11 PM, Костя Кузнецов wrote: hello. i recode vacuum for gist index. all tests is ok. also i test vacuum on table size 2 million rows. all is ok. on my machine old vaccum work about 9 second. this version work about 6-7 sec . review please. If I'm reading this correctly, the patch changes gistbulkdelete to scan the index in physical order, while the old code starts from the root and scans the index from left to right, in logical order. Scanning the index in physical order is wrong, if any index pages are split while vacuum runs. A page split could move some tuples to a lower-numbered page, so that the vacuum will not scan those tuples. In the b-tree code, we solved that problem back in 2006, so it can be done but requires a bit more code. In b-tree, we solved it with a vacuum cycle ID number that's set on the page halves when a page is split. That allows VACUUM to identify pages that have been split concurrently sees them, and jump back to vacuum them too. See commit http://git.postgresql.org/ gitweb/?p=postgresql.git;a=commit;h=5749f6ef0cc1c67ef9c9ad2108b3d9 7b82555c80. It should be possible to do something similar in GiST, and in fact you might be able to reuse the NSN field that's already set on the page halves on split, instead of adding a new vacuum cycle ID. Idea is right. But in fact, does GiST ever recycle any page? It has F_DELETED flag, but ISTM this flag is never set. So, I think it's possible that this patch is working correctly. However, probably GiST sometimes leaves new page unused due to server crash. Anyway, I'm not fan of committing patch in this shape. We need to let GiST recycle pages first, then implement VACUUM similar to b-tree. -- With best regards, Alexander Korotkov.
Re: [HACKERS] gist vacuum gist access
On Mon, Sep 8, 2014 at 12:08 PM, Alexander Korotkov aekorot...@gmail.com wrote: On Mon, Sep 8, 2014 at 11:13 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 09/07/2014 05:11 PM, Костя Кузнецов wrote: hello. i recode vacuum for gist index. all tests is ok. also i test vacuum on table size 2 million rows. all is ok. on my machine old vaccum work about 9 second. this version work about 6-7 sec . review please. If I'm reading this correctly, the patch changes gistbulkdelete to scan the index in physical order, while the old code starts from the root and scans the index from left to right, in logical order. Scanning the index in physical order is wrong, if any index pages are split while vacuum runs. A page split could move some tuples to a lower-numbered page, so that the vacuum will not scan those tuples. In the b-tree code, we solved that problem back in 2006, so it can be done but requires a bit more code. In b-tree, we solved it with a vacuum cycle ID number that's set on the page halves when a page is split. That allows VACUUM to identify pages that have been split concurrently sees them, and jump back to vacuum them too. See commit http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h= 5749f6ef0cc1c67ef9c9ad2108b3d97b82555c80. It should be possible to do something similar in GiST, and in fact you might be able to reuse the NSN field that's already set on the page halves on split, instead of adding a new vacuum cycle ID. Idea is right. But in fact, does GiST ever recycle any page? It has F_DELETED flag, but ISTM this flag is never set. So, I think it's possible that this patch is working correctly. However, probably GiST sometimes leaves new page unused due to server crash. Anyway, I'm not fan of committing patch in this shape. We need to let GiST recycle pages first, then implement VACUUM similar to b-tree. Another note. Assuming we have NSN which can play the role of vacuum cycle ID, can we implement sequential (with possible jump back) index scan for GiST? -- With best regards, Alexander Korotkov.
Re: [HACKERS] gist vacuum gist access
On 09/08/2014 11:08 AM, Alexander Korotkov wrote: On Mon, Sep 8, 2014 at 11:13 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 09/07/2014 05:11 PM, Костя Кузнецов wrote: hello. i recode vacuum for gist index. all tests is ok. also i test vacuum on table size 2 million rows. all is ok. on my machine old vaccum work about 9 second. this version work about 6-7 sec . review please. If I'm reading this correctly, the patch changes gistbulkdelete to scan the index in physical order, while the old code starts from the root and scans the index from left to right, in logical order. Scanning the index in physical order is wrong, if any index pages are split while vacuum runs. A page split could move some tuples to a lower-numbered page, so that the vacuum will not scan those tuples. In the b-tree code, we solved that problem back in 2006, so it can be done but requires a bit more code. In b-tree, we solved it with a vacuum cycle ID number that's set on the page halves when a page is split. That allows VACUUM to identify pages that have been split concurrently sees them, and jump back to vacuum them too. See commit http://git.postgresql.org/ gitweb/?p=postgresql.git;a=commit;h=5749f6ef0cc1c67ef9c9ad2108b3d9 7b82555c80. It should be possible to do something similar in GiST, and in fact you might be able to reuse the NSN field that's already set on the page halves on split, instead of adding a new vacuum cycle ID. Idea is right. But in fact, does GiST ever recycle any page? It has F_DELETED flag, but ISTM this flag is never set. So, I think it's possible that this patch is working correctly. However, probably GiST sometimes leaves new page unused due to server crash. Hmm. It's correctness depends on the fact that when a page is split, the new right page is always put on a higher-numbered page than the old page, which hinges on the fact that we never recycle pages. We used to delete pages in VACUUM FULL, but that got removed when old-style VACUUM FULL was changed to just rewrite the whole heap. So if you have pg_upgraded from an old index, it's possible that you still have some F_DELETED pages in the tree. And then there's the case of unused pages after crash that you mentioned. So no, you can't really rely on that. We could of course remove the code that checks the FSM altogether, and always extend the relation on page split. But of course the long-term solution is to allow recycling pages. Anyway, I'm not fan of committing patch in this shape. We need to let GiST recycle pages first, then implement VACUUM similar to b-tree. +1. Although I guess we could implement the b-tree style strategy first, and implement page recycling later. It's just a bit hard to test that it's correct, when there is no easy way to get deleted pages in the index to begin with. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] gist vacuum gist access
On 09/08/2014 11:19 AM, Alexander Korotkov wrote: On Mon, Sep 8, 2014 at 12:08 PM, Alexander Korotkov aekorot...@gmail.com wrote: On Mon, Sep 8, 2014 at 11:13 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: In the b-tree code, we solved that problem back in 2006, so it can be done but requires a bit more code. In b-tree, we solved it with a vacuum cycle ID number that's set on the page halves when a page is split. That allows VACUUM to identify pages that have been split concurrently sees them, and jump back to vacuum them too. See commit http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h= 5749f6ef0cc1c67ef9c9ad2108b3d97b82555c80. It should be possible to do something similar in GiST, and in fact you might be able to reuse the NSN field that's already set on the page halves on split, instead of adding a new vacuum cycle ID. ... Another note. Assuming we have NSN which can play the role of vacuum cycle ID, can we implement sequential (with possible jump back) index scan for GiST? Yeah, I think it would work. It's pretty straightforward, the page split code already sets the NSN, just when we need it. Vacuum needs to memorize the current NSN when it begins, and whenever it sees a page with a higher NSN (or the FOLLOW_RIGHT flag is set), follow the right-link if it points to lower-numbered page. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] add modulo (%) operator to pgbench
Hi, Here is the review result. #1. Patch compatibility Little bit hunk, but it can patch to latest master. #2. Functionality No problem. #3. Documentation I think modulo operator explanation should put at last at the doc line. Because the others are more frequently used. #4. Algorithm You proposed three modulo algorithm, that are 1. general modulo, 2. floor modulo and 3. euclidian modulo. They calculate different value when modulo2 or reminder is negative number. Calculate examples are here, 1. general modulo (patch1) 5 % 3 = 2 5 % -3 = 1 -5 % 3 = -1 2. floor modulo (patch2, 3) 5 % 3 = 2 5 % -3 = -2 -5 % 3 = 2 3. euclidian modulo (patch2) 5 % 3 = 2 5 % -3 = 4 -5 % 3 = 2 That's all. I think if we want to create equal possibility and inutuitive random generator, we select floor modulo, as you see the upper example. It can create contrapositive random number. 1 and 2 method cannot. I think euclidian modulo doesn't need by a lot of people. If we add it, many people will confuse, because they doesn't know the mathematic algorithms. So I like patch3 which is simple and practical. If you agree or reply my comment, I will mark ready for commiter. Best Regards, -- Mitsumsasa KONDO
Re: [HACKERS] Optimization for updating foreign tables in Postgres FDW
(2014/09/08 16:18), Albe Laurenz wrote: I wrote: I gave it a spin and could not find any undesirable behaviour, and the output of EXPLAIN ANALYZE looks like I'd expect. Thank you for the review! I noticed that you use the list length of fdw_private to check if the UPDATE or DELETE is pushed down to the remote server or not. While this works fine, I wonder if it wouldn't be better to have some explicit flag in fdw_private for that purpose. Future modifications that change the list length might easily overlook that it is used for this purpose, thereby breaking the code. Other than that it looks alright to me. Maybe I should have mentioned that I have set the patch to Waiting for Author because I'd like to hear your opinion on that, but I'm prepared to set it to Ready for Committer soon. I agree with you on that point. So, I've updated the patch to have the explicit flag, as you proposed. Attached is the updated version of the patch. In this version, I've also revised code and its comments a bit. Sorry for the delay. Best regards, Etsuro Fujita *** a/contrib/postgres_fdw/deparse.c --- b/contrib/postgres_fdw/deparse.c *** *** 188,197 is_foreign_expr(PlannerInfo *root, if (!foreign_expr_walker((Node *) expr, glob_cxt, loc_cxt)) return false; - /* Expressions examined here should be boolean, ie noncollatable */ - Assert(loc_cxt.collation == InvalidOid); - Assert(loc_cxt.state == FDW_COLLATE_NONE); - /* * An expression which includes any mutable functions can't be sent over * because its result is not stable. For example, sending now() remote --- 188,193 *** *** 927,932 deparseUpdateSql(StringInfo buf, PlannerInfo *root, --- 923,982 } /* + * deparse remote UPDATE statement + * + * The statement text is appended to buf, and we also create an integer List + * of the columns being retrieved by RETURNING (if any), which is returned + * to *retrieved_attrs. + */ + void + deparseDirectUpdateSql(StringInfo buf, PlannerInfo *root, + Index rtindex, Relation rel, + List *remote_conds, + List *targetlist, + List *targetAttrs, + List *returningList, + List **retrieved_attrs) + { + RelOptInfo *baserel = root-simple_rel_array[rtindex]; + List *params_list = NIL; + deparse_expr_cxt context; + bool first; + ListCell *lc; + + /* Set up context struct for recursion */ + context.root = root; + context.foreignrel = baserel; + context.buf = buf; + context.params_list = params_list; + + appendStringInfoString(buf, UPDATE ); + deparseRelation(buf, rel); + appendStringInfoString(buf, SET ); + + first = true; + foreach(lc, targetAttrs) + { + int attnum = lfirst_int(lc); + TargetEntry *tle = get_tle_by_resno(targetlist, attnum); + + if (!first) + appendStringInfoString(buf, , ); + first = false; + + deparseColumnRef(buf, rtindex, attnum, root); + appendStringInfo(buf, = ); + deparseExpr((Expr *) tle-expr, context); + } + if (remote_conds) + appendWhereClause(buf, root, baserel, remote_conds, + true, params_list); + + deparseReturningList(buf, root, rtindex, rel, false, + returningList, retrieved_attrs); + } + + /* * deparse remote DELETE statement * * The statement text is appended to buf, and we also create an integer List *** *** 949,954 deparseDeleteSql(StringInfo buf, PlannerInfo *root, --- 999,1031 } /* + * deparse remote DELETE statement + * + * The statement text is appended to buf, and we also create an integer List + * of the columns being retrieved by RETURNING (if any), which is returned + * to *retrieved_attrs. + */ + void + deparseDirectDeleteSql(StringInfo buf, PlannerInfo *root, + Index rtindex, Relation rel, + List *remote_conds, + List *returningList, + List **retrieved_attrs) + { + RelOptInfo *baserel = root-simple_rel_array[rtindex]; + List *params_list = NIL; + + appendStringInfoString(buf, DELETE FROM ); + deparseRelation(buf, rel); + if (remote_conds) + appendWhereClause(buf, root, baserel, remote_conds, + true, params_list); + + deparseReturningList(buf, root, rtindex, rel, false, + returningList, retrieved_attrs); + } + + /* * Add a RETURNING clause, if needed, to an INSERT/UPDATE/DELETE. */ static void *** a/contrib/postgres_fdw/expected/postgres_fdw.out --- b/contrib/postgres_fdw/expected/postgres_fdw.out *** *** 998,1004 INSERT INTO ft2 (c1,c2,c3) --- 998,1025 (3 rows) INSERT INTO ft2 (c1,c2,c3) VALUES (1104,204,'ddd'), (1105,205,'eee'); + EXPLAIN (verbose, costs off) + UPDATE ft2 SET c2 = c2 + 300, c3 = c3 || '_update3' WHERE c1 % 10 = 3; -- can be pushed down + QUERY PLAN +
Re: [HACKERS] pg_receivexlog and replication slots
On Wed, Sep 3, 2014 at 11:40 PM, Robert Haas robertmh...@gmail.com wrote: On Sun, Aug 31, 2014 at 9:45 AM, Magnus Hagander mag...@hagander.net wrote: Do we really want those Asserts? There is not a single Assert in bin/pg_basebackup today - as is the case for most things in bin/. We typically use regular if statements for things that can happen, and just ignore the others I think - since the callers are fairly simple to trace. I have no opinion on whether we want these particular Assert() calls, but I note that using Assert() in front-end code only became possible in February of 2013, as a result of commit e1d25de35a2b1f809e8f8d7b182ce0af004f3ec9. So the lack of assertions there may not be so much because people thought it was a bad idea as that it didn't use to work. Generally, I favor the use of Assert() in front-end code in the same scenarios in which we use it in back-end code: for checks that shouldn't burden production builds but are useful during development. Well that was exactly why they have been added first. The assertions have been placed in some functions to check for incompatible combinations of argument values when those functions are called. I don't mind re-adding them if people agree that they make sense. IMO they do and they help as well for the development of future utilities using replication protocol in src/bin/pg_basebackup as much as the refactoring work done on this thread. Regards, -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spinlocks and compiler/memory barriers
On 2014-09-04 14:19:47 +0200, Andres Freund wrote: Yes. I plan to push the patch this weekend. Sorry for the delay. I'm about to push this. Is it ok to first push it to master and only backpatch once a couple buildfarm cycles haven't complained? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] gist vacuum gist access
On Mon, Sep 8, 2014 at 12:51 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 09/08/2014 11:19 AM, Alexander Korotkov wrote: On Mon, Sep 8, 2014 at 12:08 PM, Alexander Korotkov aekorot...@gmail.com wrote: On Mon, Sep 8, 2014 at 11:13 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: In the b-tree code, we solved that problem back in 2006, so it can be done but requires a bit more code. In b-tree, we solved it with a vacuum cycle ID number that's set on the page halves when a page is split. That allows VACUUM to identify pages that have been split concurrently sees them, and jump back to vacuum them too. See commit http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h= 5749f6ef0cc1c67ef9c9ad2108b3d97b82555c80. It should be possible to do something similar in GiST, and in fact you might be able to reuse the NSN field that's already set on the page halves on split, instead of adding a new vacuum cycle ID. ... Another note. Assuming we have NSN which can play the role of vacuum cycle ID, can we implement sequential (with possible jump back) index scan for GiST? Yeah, I think it would work. It's pretty straightforward, the page split code already sets the NSN, just when we need it. Vacuum needs to memorize the current NSN when it begins, and whenever it sees a page with a higher NSN (or the FOLLOW_RIGHT flag is set), follow the right-link if it points to lower-numbered page. I mean full index scan feature for SELECT queries might be implemented as well as sequential VACUUM. -- With best regards, Alexander Korotkov.
Re: [HACKERS] postgres_fdw behaves oddly
(2014/09/02 18:55), Etsuro Fujita wrote: (2014/09/01 20:15), Etsuro Fujita wrote: While working on [1], I've found that postgres_fdw behaves oddly: I've revised the patch a bit further. Please find attached a patch. Thanks, Best regards, Etsuro Fujita *** a/contrib/postgres_fdw/deparse.c --- b/contrib/postgres_fdw/deparse.c *** *** 252,257 foreign_expr_walker(Node *node, --- 252,265 if (var-varno == glob_cxt-foreignrel-relid var-varlevelsup == 0) { + /* + * System columns can't be sent to remote. + * + * XXX: we should probably send ctid to remote. + */ + if (var-varattno 0) + return false; + /* Var belongs to foreign table */ collation = var-varcollid; state = OidIsValid(collation) ? FDW_COLLATE_SAFE : FDW_COLLATE_NONE; *** *** 1261,1266 deparseVar(Var *node, deparse_expr_cxt *context) --- 1269,1279 if (node-varno == context-foreignrel-relid node-varlevelsup == 0) { + /* + * System columns shouldn't be examined here. + */ + Assert(node-varattno = 0); + /* Var belongs to foreign table */ deparseColumnRef(buf, node-varno, node-varattno, context-root); } *** a/src/backend/optimizer/plan/createplan.c --- b/src/backend/optimizer/plan/createplan.c *** *** 20,25 --- 20,26 #include math.h #include access/skey.h + #include access/sysattr.h #include catalog/pg_class.h #include foreign/fdwapi.h #include miscadmin.h *** *** 1945,1950 create_foreignscan_plan(PlannerInfo *root, ForeignPath *best_path, --- 1946,1953 RelOptInfo *rel = best_path-path.parent; Index scan_relid = rel-relid; RangeTblEntry *rte; + Bitmapset *attrs_used = NULL; + ListCell *lc; int i; /* it should be a base rel... */ *** *** 1993,2008 create_foreignscan_plan(PlannerInfo *root, ForeignPath *best_path, * bit of a kluge and might go away someday, so we intentionally leave it * out of the API presented to FDWs. */ scan_plan-fsSystemCol = false; for (i = rel-min_attr; i 0; i++) { ! if (!bms_is_empty(rel-attr_needed[i - rel-min_attr])) { scan_plan-fsSystemCol = true; break; } } return scan_plan; } --- 1996,2030 * bit of a kluge and might go away someday, so we intentionally leave it * out of the API presented to FDWs. */ + + /* + * Add all the attributes needed for joins or final output. Note: we must + * look at reltargetlist, not the attr_needed data, because attr_needed + * isn't computed for inheritance child rels. + */ + pull_varattnos((Node *) rel-reltargetlist, rel-relid, attrs_used); + + /* Add all the attributes used by restriction clauses. */ + foreach(lc, rel-baserestrictinfo) + { + RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc); + + pull_varattnos((Node *) rinfo-clause, rel-relid, attrs_used); + } + + /* Are any system columns requested from rel? */ scan_plan-fsSystemCol = false; for (i = rel-min_attr; i 0; i++) { ! if (bms_is_member(i - FirstLowInvalidHeapAttributeNumber, attrs_used)) { scan_plan-fsSystemCol = true; break; } } + bms_free(attrs_used); + return scan_plan; } -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spinlocks and compiler/memory barriers
On Mon, Sep 8, 2014 at 8:07 AM, Andres Freund and...@2ndquadrant.com wrote: On 2014-09-04 14:19:47 +0200, Andres Freund wrote: Yes. I plan to push the patch this weekend. Sorry for the delay. I'm about to push this. Is it ok to first push it to master and only backpatch once a couple buildfarm cycles haven't complained? That will have the disadvantage that src/tools/git_changelog will show the commits separately instead of grouping them together; so it's probably best not to make a practice of it. But I think it's up to your discretion how to handle it in any particular case. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] add modulo (%) operator to pgbench
Hello Mutsumara-san, #3. Documentation I think modulo operator explanation should put at last at the doc line. Because the others are more frequently used. So I like patch3 which is simple and practical. Ok. If you agree or reply my comment, I will mark ready for commiter. Please find attached v4, which is v3 plus an improved documentation which is clearer about the sign of the remainder. -- Fabien.diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c index 2f7d80e..a815ad3 100644 --- a/contrib/pgbench/pgbench.c +++ b/contrib/pgbench/pgbench.c @@ -1574,6 +1574,22 @@ top: } snprintf(res, sizeof(res), INT64_FORMAT, ope1 / ope2); } +else if (strcmp(argv[3], %) == 0) +{ + int64_t remainder; + if (ope2 == 0) + { + fprintf(stderr, %s: division by zero in modulo\n, argv[0]); + st-ecnt++; + return true; + } + /* divisor signed remainder */ + remainder = ope1 % ope2; + if ((ope2 0 remainder 0) || + (ope2 0 remainder 0)) + remainder += ope2; + snprintf(res, sizeof(res), INT64_FORMAT, remainder); +} else { fprintf(stderr, %s: unsupported operator %s\n, argv[0], argv[3]); diff --git a/doc/src/sgml/pgbench.sgml b/doc/src/sgml/pgbench.sgml index b479105..66ec622 100644 --- a/doc/src/sgml/pgbench.sgml +++ b/doc/src/sgml/pgbench.sgml @@ -735,7 +735,9 @@ pgbench optional replaceableoptions/ /optional replaceabledbname/ Each replaceableoperand/ is either an integer constant or a literal:/replaceablevariablename/ reference to a variable having an integer value. The replaceableoperator/ can be - literal+/, literal-/, literal*/, or literal//. + literal+/, literal-/, literal*/, literal%/ or literal//. + The modulo operation (literal%/) is based on the floored division, so + that the remainder keeps the sign of the divisor. /para para -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spinlocks and compiler/memory barriers
Andres Freund and...@2ndquadrant.com writes: On 2014-09-04 14:19:47 +0200, Andres Freund wrote: Yes. I plan to push the patch this weekend. Sorry for the delay. I'm about to push this. Is it ok to first push it to master and only backpatch once a couple buildfarm cycles haven't complained? It makes for a cleaner commit history if you push concurrently into all the branches you intend to patch. That also gives more buildfarm runs, which seems like a good thing for this sort of patch. That is, assuming that we ought to backpatch at all, which to my mind is debatable. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spinlocks and compiler/memory barriers
On Mon, Sep 8, 2014 at 10:07 AM, Tom Lane t...@sss.pgh.pa.us wrote: It makes for a cleaner commit history if you push concurrently into all the branches you intend to patch. That also gives more buildfarm runs, which seems like a good thing for this sort of patch. That is, assuming that we ought to backpatch at all, which to my mind is debatable. We're not going to backpatch the main patch to make spinlock primitives act as compiler barriers - or at least, I will object loudly. But what we're talking about here is a bug fix for Sparc. And surely we ought to back-patch that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PL/pgSQL 2
On Fri, Sep 5, 2014 at 6:18 PM, Andrew Dunstan and...@dunslane.net wrote: On 09/05/2014 12:37 PM, Merlin Moncure wrote: On Thu, Sep 4, 2014 at 6:40 PM, Florian Pflug f...@phlo.org wrote: Cost of hidden IO cast is negative too. If we can change it, then we can increase a sped. But the whole power of PL/pgSQL comes from the fact that it allows you to use the full set of postgres data types and operatores, and that it seamlessly integrated with SQL. Without that, PL/pgSQL is about as appealing as BASIC as a programming language... Right, and it's exactly those types and operators that are the cause of the performance issues. A compiled pl/pgsql would only get serious benefit for scenarios involving tons of heavy iteration or funky local data structure manipulation. Those scenarios are somewhat rare in practice for database applications and often better handled in a another pl should they happen. plv8 is emerging as the best non-sql it's JIT compiled by the plv8 runtime, the javascript language is designed for embedding. and the json data structure has nice similarities with postgres's arrays and types. In fact, if I *were* to attempt pl/pgsql compiling, I'd probably translate the code to plv8 and hand it off to the llvm engine. You'd still have to let postgres handle most of the operator and cast operations but you could pull some things into the plv8 engine. Probably, this would be a net loser since plv8 (unlike plpgsql) has to run everything through SPI. plpgsql makes extensive use of SPI. Just look at the source code if you don't believe me. oh, certainly. pl/pgsql also has the ability to bypass SPI for many simple expressions. Other pls generally don't do this because they can't if they want to guarantee SQL semanticsthat's ok then because they don't have to as the language runtime handles operations local to the function and everything runs under that language's rules. In a nutshell, my thinking here is to translate pl/pgsql to pl/v8 javascript and then let the optimizing v8 runtime take it from there. This is IMNSHO a tiny challenge relative to writing an optimization engine for pl/pgsql by hand. Think of it as coffeescript for databases. It's a nice thought, but there's a lot of roadblocks to making it happen -- starting with the lack of a javascript library that would wrap the C postgres datatype routines so you wouldn't have to call in to SPI for every little thing; as you know even i := i + 1; can't be handled by native javascript operations. plv8 also has a nice find_function gadget that lets you find and call another plv8 function directly instead of having to use an SPI call. Yeah -- this is another reason why pl/v8 is a nice as a compilation target. javascript as we all know is a language with a long list of pros and cons but it's designed for embedding. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spinlocks and compiler/memory barriers
On Mon, Sep 8, 2014 at 10:08:04AM -0400, Robert Haas wrote: On Mon, Sep 8, 2014 at 8:07 AM, Andres Freund and...@2ndquadrant.com wrote: On 2014-09-04 14:19:47 +0200, Andres Freund wrote: Yes. I plan to push the patch this weekend. Sorry for the delay. I'm about to push this. Is it ok to first push it to master and only backpatch once a couple buildfarm cycles haven't complained? That will have the disadvantage that src/tools/git_changelog will show the commits separately instead of grouping them together; so it's probably best not to make a practice of it. But I think it's up to your discretion how to handle it in any particular case. Uh, git_changelog timespan check is 24 hours, so if the delay is less then 24 hours, I think we are ok, e.g.: # Might want to make this parameter user-settable. my $timestamp_slop = 24 * 60 * 60; -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] add modulo (%) operator to pgbench
Hi Fabien-san, Thank you for your fast work! 2014-09-08 23:08 GMT+09:00 Fabien COELHO coe...@cri.ensmp.fr: Hello Mutsumara-san, #3. Documentation I think modulo operator explanation should put at last at the doc line. Because the others are more frequently used. So I like patch3 which is simple and practical. Ok. If you agree or reply my comment, I will mark ready for commiter. Please find attached v4, which is v3 plus an improved documentation which is clearer about the sign of the remainder. The attached is seemed no problem. But I'd like to say about order of explanation in five formulas. Fix version is here. Please confirm, and I mark it for ready for commiter. Best regards, -- Mitsumasa KONDO pgbench-modulo-4-1.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spinlocks and compiler/memory barriers
Robert Haas robertmh...@gmail.com writes: But what we're talking about here is a bug fix for Sparc. And surely we ought to back-patch that. Ah. OK, no objection. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] add modulo (%) operator to pgbench
The attached is seemed no problem. But I'd like to say about order of explanation in five formulas. Fix version is here. Please confirm, and I mark it for ready for commiter. I'm ok with this version. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.5: Memory-bounded HashAgg
On Wed, Sep 3, 2014 at 7:34 PM, Tomas Vondra t...@fuzzy.cz wrote: Well, I think you're certainly right that a hash table lookup is more expensive than modulo on a 32-bit integer; so much is obvious. But if join can increase the number of batches on the fly, but only by doubling it, so you might go from 4 batches to 8 when 5 would really have been enough. And a hash join also can't *reduce* the number of batches on the fly, which might matter a lot. Getting the number of batches right avoids I/O, which is a lot more expensive than CPU. Regarding the estimates, I don't see much difference between the two approaches when handling this issue. It's true you can wait with deciding how many partitions (aka batches) to create until work_mem is full, at which point you have more information than at the very beginning. You know how many tuples you've already seen, how many tuples you expect (which is however only an estimate etc.). And you may use that to estimate the number of partitions to create. I think it's significantly better than that. The first point I'd make is that if work_mem never fills up, you don't need to batch anything at all. That's a potentially huge win over batching a join we thought was going to overrun work_mem, but didn't. But even work_mem does fill up, I think we still come out ahead, because we don't necessarily need to dump the *entirety* of each batch to disk. For example, suppose there are 900 distinct values and only 300 of them can fit in memory at a time. We read the input until work_mem is full and we see a previously-unseen value, so we decide to split the input up into 4 batches. We now finish reading the input. Each previously-seen value gets added to an existing in-memory group, and each each new value gets written into one of four disk files. At the end of the input, 300 groups are complete, and we have four files on disk each of which contains the data for 150 of the remaining 600 groups. Now, the alternative strategy is to batch from the beginning. Here, we decide right from the get-go that we're using 4 batches, so batch #1 goes into memory and the remaining 3 batches get written to three different disk files. At the end of the input, 225 groups are complete, and we have three files on disk each of which contains the data for 225 of the remaining 675 groups. This seems clearly inferior, because we have written 675 groups to disk when it would have been possible to write only 600. The gains can be even more significant when the input data is skewed. For example, suppose things are as above, but ten values accounts for 90% of all the inputs, and the remaining 890 values account for the other 10% of the inputs. Furthermore, let's suppose we have no table statistics or they are totally wrong. In Jeff's approach, as long as each of those values occurs at least once before work_mem fills up, they'll all be processed in the initial pass through the data, which means we will write at most 10% of the data to disk. In fact it will be a little bit less, because batch 1 will have not only then 10 frequently-occurring values but also 290 others, so our initial pass through the data will complete 300 groups covering (if the less-frequent values are occur with uniform frequency) 93.258% of the data. The remaining ~6.8% will be split up into 4 files which we can then reread and process. But if we use the other approach, we'll only get 2 or 3 of the 10 commonly-occurring values in the first batch, so we expect to write about 75% of the data out to one of our three batch files. That's a BIG difference - more than 10x the I/O load that Jeff's approach would have incurred. Now, admittedly, we could use a skew optimization similar to the one we use for hash joins to try to get the MCVs into the first batch, and that would help a lot when the statistics are right - but sometimes the statistics are wrong, and Jeff's approach doesn't care. It just keeps on working. That however comes at a cost - it's not really a memory-bounded hash aggregate, because you explicitly allow exceeding work_mem as more tuples for existing groups arrive. Well, that would be true for now, but as has been mentioned, we can add new methods to the aggregate infrastructure to serialize and de-serialize transition states. I guess I agree that, in the absence of such infrastructure, your patch might be a better way to handle cases like array_agg, but I'm pretty happy to see that infrastructure get added. Hmm. It occurs to me that it could also be really good to add a merge transition states operator to the aggregate infrastructure. That would allow further improvements to Jeff's approach for cases like array_agg. If we serialize a transition state to disk because it's not fitting in memory, we don't need to reload it before continuing to process the group, or at least not right away. We can instead just start a new transitions state and then merge all of the accumulated
Re: [HACKERS] proposal (9.5) : psql unicode border line styles
Hi I removed dynamic allocation and reduced patch size. What I tested a old unicode style is same as new unicode style. There nothing was changed .. some fields are specified in refresh_utf8format function Regards Pavel 2014-09-08 4:44 GMT+02:00 Stephen Frost sfr...@snowman.net: Pavel, * Pavel Stehule (pavel.steh...@gmail.com) wrote: 2014-07-23 8:38 GMT+02:00 Tomas Vondra t...@fuzzy.cz: OK, thanks. The new version seems OK to me. Thank you I've started looking over the patch and went back through the previous thread about it. For my part, I'm in favor of adding this capability, but I'm not terribly happy about how it was done. In particular, get_line_style() seems pretty badly hacked around, and I don't really like having the prepare_unicode_format call underneath it allocating memory and then passing back up the need to free that memory via a new field in the structure. Also, on a quick glance, are you sure that the new 'unicode' output matches the same as the old 'unicode' did (with pg_utf8format)? I would think we'd simply set up a structure which is updated when the linestyle is changed, which is surely going to be much less frequently than the request for which linestyle to use happens, and handle all of the line styles in more-or-less the same way rather than doing something completely different for unicode than for the others. Thanks, Stephen commit 509f8a92525889651653a75356d3fa57b58f3141 Author: Pavel Stehule pavel.steh...@gooddata.com Date: Mon Sep 8 17:18:43 2014 +0200 remove palloc diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml index db314c3..84233d0 100644 --- a/doc/src/sgml/ref/psql-ref.sgml +++ b/doc/src/sgml/ref/psql-ref.sgml @@ -2299,6 +2299,42 @@ lo_import 152801 /para /listitem /varlistentry + + varlistentry + termliteralunicode_border_style/literal/term + listitem + para + Sets the border line drawing style to one + of literalsingle/literal or literaldouble/literal + This option only affects the literalunicode/ + linestyle + /para + /listitem + /varlistentry + + varlistentry + termliteralunicode_column_style/literal/term + listitem + para + Sets the column line drawing style to one + of literalsingle/literal or literaldouble/literal + This option only affects the literalunicode/ + linestyle + /para + /listitem + /varlistentry + + varlistentry + termliteralunicode_header_style/literal/term + listitem + para + Sets the header line drawing style to one + of literalsingle/literal or literaldouble/literal + This option only affects the literalunicode/ + linestyle + /para + /listitem + /varlistentry /variablelist /para diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c index a66093a..fd05aae 100644 --- a/src/bin/psql/command.c +++ b/src/bin/psql/command.c @@ -1054,6 +1054,9 @@ exec_command(const char *cmd, footer, format, linestyle, null, numericlocale, pager, recordsep, tableattr, title, tuples_only, +unicode_border_linestyle, +unicode_column_linestyle, +unicode_header_linestyle, NULL }; @@ -2248,6 +2251,40 @@ _align2string(enum printFormat in) return unknown; } +/* + * Parse entered unicode linestyle. Returns true, when entered string is + * known linestyle: single, double else returns false. + */ +static bool +set_unicode_line_style(printQueryOpt *popt, const char *value, size_t vallen, unicode_linestyle *linestyle) +{ + if (pg_strncasecmp(single, value, vallen) == 0) + *linestyle = UNICODE_LINESTYLE_SINGLE; + else if (pg_strncasecmp(double, value, vallen) == 0) + *linestyle = UNICODE_LINESTYLE_DOUBLE; + else + return false; + + /* input is ok, generate new unicode style */ + refresh_utf8format((popt-topt)); + + return true; +} + +static const char * +_unicode_linestyle2string(int linestyle) +{ + switch (linestyle) + { + case UNICODE_LINESTYLE_SINGLE: + return single; + break; + case UNICODE_LINESTYLE_DOUBLE: + return double; + break; + } + return unknown; +} bool do_pset(const char *param, const char *value, printQueryOpt *popt, bool quiet) @@ -2305,6 +2342,42 @@ do_pset(const char *param, const char *value, printQueryOpt *popt, bool quiet) } + /* set unicode border line style */ + else if (strcmp(param, unicode_border_linestyle) == 0) + { + if (!value) + ; + else if (!set_unicode_line_style(popt, value, vallen, popt-topt.unicode_border_linestyle)) + { + psql_error(\\pset: allowed unicode border linestyle are single, double\n); + return false; + } + } + + /* set unicode column line style */ + else if (strcmp(param, unicode_column_linestyle) == 0) + {
Re: [HACKERS] PQputCopyEnd doesn't adhere to its API contract
On Thu, Sep 4, 2014 at 6:38 PM, David Johnston david.g.johns...@gmail.com wrote: The implied suggestion is that if I do find any other areas that look like they need fixing - even in the same file - I should separate them out into a separate patch. Yes. Though I have seen various while I was in there I also fixed such-and-such commits previously so the line is at least somewhat fluid. Yep, it's a judgement call. As a general rule though, I think there's often a pretty clear connection between the main topic of the commit and the changes folded in. Another way to think about this is that, at least for non-committers (and frequently also for committers), any patch someone writes is going to need an upvote from *at least* one other person in order to go in. If somebody can look at the patch and easily determine that it solves some problem and doesn't make anything worse, then they're likely to like it (and if they're a committer, maybe commit it). But if they see changes they like mixed in with changes they don't like, then they're likely to either bounce it back, or just say, hmm, this looks like it will take some time to deal with, let me put that on my TODO list. And of course sometimes it never makes it off the TODO list. Now, it's a fair point that this makes it hard to get large-scale changes done. But, to some extent, that's a good thing. Prudence, indeed, will dictate that source code or documentation long established should not be changed for light or transient causes. For the rest, if you feel a large scale change is really needed, it's best to start with a proposal: I think we need to rehash the documentation in section X because of reason Y. You've made a few proposals to rehash sections of the documentation on pgsql-hackers, but I haven't actually seen clear and compelling justification for those reworkings. Clearly, you like it better the new way, but the person who did it the existing way likely prefers their version, and they've been around longer than you. :-) By stating your objectives up-front, you can see whether people agree with those objectives or not. If they do, you can hope to find the final patch criticized only on fine details which are easily remedied; but if they don't, then some rethinking may be needed. This seems pointless. Of course general documentation will be less specific than documentation for specific functions. The existing wording was being verbose in order to be correct. In a summary like this I'd trade being reasonably accurate and general for the precision that is included elsewhere. This is an example of a goal where you might solicit people's general thoughts before starting. Maybe people will agree that removing details in a certain place is useful, or maybe they won't, but it needs discussion. One of the trade-offs I mentioned...its more style than anything but removing the parenthetical (if there is not error in the command) and writing it more directly seemed preferable in an overview such as this. Better: The function will either throw an error or return a PGresult object[...] Nope. This is not C++, nor is it the backend. It will not throw anything. + para + Second, the application should use the functions in this + section to receive data rows or transmit buffer loads. Buffer loads are + not guaranteed to be processed until the copy transfer is completed. + /para The main change here vs. the existing text is that you're now using the phase buffer loads to refer to what gets transmitted, and data rows to refer to what gets received. The existing text uses the term data rows for both, which seems correct to me. My first reaction on reading your revised text was wait, what's a buffer load?. So, my generalization policy working in reverse - since the transmit side does not have to be in complete rows implying that they are here is (albeit acceptably) inaccurate. The existing wording doesn't say that each call to one of the functions in question must contain only whole data rows of itself. It merely says that these are the functions you need to use to send data rows, which is true. - At this point further SQL commands can be issued via - functionPQexec/function. (It is not possible to execute other SQL - commands using the same connection while the commandCOPY/command - operation is in progress.) Removing this text doesn't seem like a good idea. It's a quite helpful clarification. The note you've added in its place doesn't seem like a good substitute for it, and more generally, I think we should avoid the overuse of constructs like note. Emphasis needs to be used minimally or it loses its meaning. Was trying to remove repetition here - happy to consider alternative way of doing so if the note is objectionable. I guess it doesn't seem repetitive to me. Its great to be able to read the documentation like a book but it also wants to be useful for scanning and a
Re: [HACKERS] gist vacuum gist access
On 09/08/2014 03:26 PM, Alexander Korotkov wrote: On Mon, Sep 8, 2014 at 12:51 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 09/08/2014 11:19 AM, Alexander Korotkov wrote: On Mon, Sep 8, 2014 at 12:08 PM, Alexander Korotkov aekorot...@gmail.com wrote: On Mon, Sep 8, 2014 at 11:13 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: In the b-tree code, we solved that problem back in 2006, so it can be done but requires a bit more code. In b-tree, we solved it with a vacuum cycle ID number that's set on the page halves when a page is split. That allows VACUUM to identify pages that have been split concurrently sees them, and jump back to vacuum them too. See commit http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h= 5749f6ef0cc1c67ef9c9ad2108b3d97b82555c80. It should be possible to do something similar in GiST, and in fact you might be able to reuse the NSN field that's already set on the page halves on split, instead of adding a new vacuum cycle ID. ... Another note. Assuming we have NSN which can play the role of vacuum cycle ID, can we implement sequential (with possible jump back) index scan for GiST? Yeah, I think it would work. It's pretty straightforward, the page split code already sets the NSN, just when we need it. Vacuum needs to memorize the current NSN when it begins, and whenever it sees a page with a higher NSN (or the FOLLOW_RIGHT flag is set), follow the right-link if it points to lower-numbered page. I mean full index scan feature for SELECT queries might be implemented as well as sequential VACUUM. Oh, sorry, I missed that. If you implement a full-index scan like that, you might visit some tuples twice, so you'd have to somehow deal with the duplicates. For a bitmap index scan it would be fine. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: PENDING_LIST_CLEANUP_SIZE - maximum size of GIN pending list Re: [HACKERS] HEAD seems to generate larger WAL regarding GIN index
On Sun, Aug 17, 2014 at 7:46 PM, Fujii Masao masao.fu...@gmail.com wrote: Thanks for reviewing the patch! ISTM that I failed to make the patch from my git repository... Attached is the rebased version. I get some compiler warnings on v2 of this patch: reloptions.c:219: warning: excess elements in struct initializer reloptions.c:219: warning: (near initialization for 'intRelOpts[15]') Cheers, Jeff
Re: [HACKERS] [v9.5] Custom Plan API
On Thu, Sep 4, 2014 at 7:57 PM, Kouhei Kaigai kai...@ak.jp.nec.com wrote: Regarding to the attached three patches: [1] custom-path and hook It adds register_custom_path_provider() interface for registration of custom-path entrypoint. Callbacks are invoked on set_plain_rel_pathlist to offer alternative scan path on regular relations. I may need to explain the terms in use. I calls the path-node custom-path that is the previous step of population of plan-node (like custom-scan and potentially custom-join and so on). The node object created by CreateCustomPlan() is called custom-plan because it is abstraction for all the potential custom-xxx node; custom-scan is the first of all. I don't think it's a good thing that add_custom_path_type is declared as void (*)(void *) rather than having a real type. I suggest we add the path-creation callback function to CustomPlanMethods instead, like this: void (*CreateCustomScanPath)(PlannerInfo *root, RelOptInfo *baserel, RangeTblEntry *rte); Then, register_custom_path_provider() can just take CustomPathMethods * as an argument; and create_customscan_paths can just walk the list of CustomPlanMethods objects and call CreateCustomScanPath for each one where that is non-NULL. This conflates the path generation mechanism with the type of path getting generated a little bit, but I don't see any real downside to that. I don't see a reason why you'd ever want two different providers to offer the same type of custompath. Don't the changes to src/backend/optimizer/plan/createplan.c belong in patch #2? [2] custom-scan node It adds custom-scan node support. The custom-scan node is expected to generate contents of a particular relation or sub-plan according to its custom-logic. Custom-scan provider needs to implement callbacks of CustomScanMethods and CustomExecMethods. Once a custom-scan node is populated from custom-path node, the backend calls back these methods in the planning and execution stage. It looks to me like this patch is full of holdovers from its earlier life as a more-generic CustomPlan node. In particular, it contains numerous defenses against the case where scanrelid != 0. These are confusingly written as scanrelid 0, but I think really they're just bogus altogether: if this is specifically a CustomScan, not a CustomPlan, then the relid should always be filled in. Please consider what can be simplified here. The comment in _copyCustomScan looks bogus to me. I think we should *require* a static method table. In create_custom_plan, you if (IsA(custom_plan, CustomScan)) { lots of stuff; } else elog(ERROR, ...). I think it would be clearer to write if (!IsA(custom_plan, CustomScan)) elog(ERROR, ...); lots of stuff; Also, regarding to the use-case of multi-exec interface. Below is an EXPLAIN output of PG-Strom. It shows the custom GpuHashJoin has two sub-plans; GpuScan and MultiHash. GpuHashJoin is stacked on the GpuScan. It is a case when these nodes utilize multi-exec interface for more efficient data exchange between the nodes. GpuScan already keeps a data structure that is suitable to send to/recv from GPU devices and constructed on the memory segment being DMA available. If we have to form a tuple, pass it via row-by-row interface, then deform it, it will become a major performance degradation in this use case. postgres=# explain select * from t10 natural join t8 natural join t9 where x 10; QUERY PLAN --- Custom (GpuHashJoin) (cost=10979.56..90064.15 rows=333 width=49) pseudo scan tlist: 1:(t10.bid), 3:(t10.aid), 4:t10.x, 2:t8.data, 5:[t8.aid], 6:[t9.bid] hash clause 1: ((t8.aid = t10.aid) AND (t9.bid = t10.bid)) - Custom (GpuScan) on t10 (cost=1.00..88831.26 rows=327 width=16) Host References: aid, bid, x Device References: x Device Filter: (x 10::double precision) - Custom (MultiHash) (cost=464.56..464.56 rows=1000 width=41) hash keys: aid, bid - Hash Join (cost=60.06..464.56 rows=1000 width=41) Hash Cond: (t9.data = t8.data) - Index Scan using t9_pkey on t9 (cost=0.29..357.29 rows=1 width=37) - Hash (cost=47.27..47.27 rows=1000 width=37) - Index Scan using t8_pkey on t8 (cost=0.28..47.27 rows=1000 width=37) Planning time: 0.810 ms (15 rows) Why can't the Custom(GpuHashJoin) node build the hash table internally instead of using a separate node? Also, for this patch we are only considering custom scan. Custom join is another patch. We don't need to provide infrastructure for that patch in this one. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription:
Re: [HACKERS] Scaling shared buffer eviction
On Fri, Sep 5, 2014 at 6:47 AM, Amit Kapila amit.kapil...@gmail.com wrote: Client Count/Patch_Ver (tps) 8 16 32 64 128 HEAD 58614 107370 140717 104357 65010 Patch 60092 113564 165014 213848 216065 This data is median of 3 runs, detailed report is attached with mail. I have not repeated the test for all configurations, as there is no major change in design/algorithm which can effect performance. Mark has already taken tpc-b data which ensures that there is no problem with it, however I will also take it once with latest version. Well, these numbers are pretty much amazing. Question: It seems there's obviously quite a bit of contention left; do you think there's still a significant amount of time in the clock sweep, or is the primary bottleneck the buffer mapping locks? merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgcrypto: PGP signatures
On Fri, Sep 5, 2014 at 4:38 AM, Marko Tiikkaja ma...@joh.to wrote: Hi all, I've updated the patch with a number of changes: 1) I've documented the current limitations of signatures 2) I've expanded section F.25.3 to add information about signatures (though I'm not sure why this part is in the user-facing documentation in the first place). 3) I've changed the code to use ntohl() and pg_time_t as per Thomas' comments. 4) I've changed the code to consistently use while (1) instead of for (;;) (except for the math library, but I didn't touch that at all) I've also changed the behaviour when passing a message with a signature to the decrypt functions which don't verify signatures. They now report ERROR: Wrong key or corrupt data instead of decrypting and silently ignoring the signature. The behaviour is now backwards compatible, but I see two ways we could possibly possibly improve this: 1) Produce a better error message (I'm sure most people don't know about the hidden debug=1 setting) 2) Provide an option to ignore the signature if decrypting the data is desirable even if the signature can't be verified If i understand the sequence here: The current git HEAD is that pgp_pub_decrypt would throw an error if given a signed and encrypted message, and earlier version of your patch changed that to decrypt the message and ignore the signature, and the current version went back to throwing an error. I think I prefer the middle of those behaviors. The original behavior seems like a bug to me, and I don't think we need to be backwards compatible with bugs. Why should a function called decrypt care if the message is also signed? That is not its job. If we decide to throw the error, a better error message certainly wouldn't hurt. And the output of 'debug=1' is generally not comprehensible unless you are familiar with the source code, so that is not a substitute. (By the way, there are now 2 patches in this series named pgcrypto_sigs.v3.patch--so be careful which one you look it.) There seems to be a memory leak in pgp_sym_decrypt_verify that does not exist in pgp_sym_decrypt. It is about 58 bytes per decryption. Perl test script: my $dbh=connect(...); my $pub=`cat public.asc`; my $pri=`cat private.asc`; my $enc= $dbh-prepare(select armor(pgp_sym_encrypt_sign('asdlkfjsldkfjsadf',?,dearmor(?),'debug=1'))); my $dec= $dbh-prepare(select pgp_sym_decrypt_verify(dearmor(?),?,dearmor(?),'debug=1')); my $i=1; $enc-execute(foobar$i,$pri); my ($message)=$enc-fetchrow(); foreach my $ii (1..100) { ## my $i=$ii; $dec-execute($message,foobar$i,$pub); my ($message2)=$dec-fetchrow(); die unless $message2 eq asdlkfjsldkfjsadf; warn $i\t, time() if $i%1000 ==0; }; Cheers, Jeff
Re: [HACKERS] implement subject alternative names support for SSL connections
On 09/05/2014 07:30 PM, Alexey Klyukin wrote: On Thu, Sep 4, 2014 at 10:23 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: Hmm. Perhaps we should use X509_NAME_get_index_by_NID + X509_NAME_get_entry instead of X509_NAME_get_text_by_NID. You could then pass the ASN1_STRING object to the certificate_name_entry_validate_match() function, and have it do the ASN1_STRING_length() and ASN1_STRING_data() calls too. ... I think we should: 1. Check if there's a common name, and if so, print that 2. Check if there is exactly one SAN, and if so, print that 3. Just print an error without mentioning names. There's a lot of value in printing the name if possible, so I'd really like to keep that. But I agree that printing all the names if there are several would get complicated and the error message could become very long. Yeah, the error message might need to be different for cases 1 and 2. Or maybe phrase it server certificate's name \%s\ does not match host name \%s\, which would be reasonable for both 1. and 2. Thank you, I've implemented both suggestions in the attached new version of the patch. On the error message, I've decided to show either a single name, or the first examined name and the number of other names present in the certificate, i.e: psql: server name example.com and 2 other names from server SSL certificate do not match host name example-foo.com The error does not state whether the names comes from the CN or from the SAN section. I'd reword that slightly, to: psql: server certificate for example.com (and 2 other names) does not match host name example-foo.com I never liked the current wording, server name foo very much. I think it's important to use the word server certificate in the error message, to make it clear where the problem is. For translations, that message should be pluralized, but there is no libpq_ngettext macro equivalent to ngettext(), like libpq_gettext. I wonder if that was left out on purpose, or if we just haven't needed that in libpq before. Anyway, I think we need to add that for this. This version also checks for the availability of the subject name, it looks like RFC 5280 does not require it at all. 4.1.2.6. Subject The subject field identifies the entity associated with the public key stored in the subject public key field. The subject name MAY be carried in the subject field and/or the subjectAltName extension. Ok. The pattern of allocating the name in the subroutine and then reporting it (and releasing when necessary) in the main function is a little bit ugly, but it looks to me the least ugly of anything else I could come up (i.e. extracting those names at the time the error message is shown). I reworked that a bit, see attached. What do you think? I think this is ready for commit, except for two things: 1. The pluralization of the message for translation 2. I still wonder if we should follow the RFC 6125 and not check the Common Name at all, if SANs are present. I don't really understand the point of that rule, and it seems unlikely to pose a security issue in practice, because surely a CA won't sign a certificate with a bogus/wrong CN, because an older client that doesn't look at the SANs at all would use the CN anyway. But still... what does a Typical Web Browser do? - Heikki diff --git a/src/interfaces/libpq/fe-secure-openssl.c b/src/interfaces/libpq/fe-secure-openssl.c index f950fc3..4f6f324 100644 --- a/src/interfaces/libpq/fe-secure-openssl.c +++ b/src/interfaces/libpq/fe-secure-openssl.c @@ -60,9 +60,13 @@ #ifdef USE_SSL_ENGINE #include openssl/engine.h #endif +#include openssl/x509v3.h static bool verify_peer_name_matches_certificate(PGconn *); static int verify_cb(int ok, X509_STORE_CTX *ctx); +static int verify_peer_name_matches_certificate_name(PGconn *conn, + ASN1_STRING *name, + char **store_name); static void destroy_ssl_system(void); static int initialize_SSL(PGconn *conn); static PostgresPollingStatusType open_client_SSL(PGconn *); @@ -473,98 +477,229 @@ wildcard_certificate_match(const char *pattern, const char *string) /* - * Verify that common name resolves to peer. + * Check if a name from a server's certificate matches the peer's hostname. + * + * Returns 1 if the name matches, and 0 if it does not. On error, returns + * -1, and sets the libpq error message. + * + * The name extracted from the certificate is returned in *store_name. The + * caller is responsible for freeing it. */ -static bool -verify_peer_name_matches_certificate(PGconn *conn) +static int +verify_peer_name_matches_certificate_name(PGconn *conn, ASN1_STRING *name_entry, + char **store_name) { - char
Re: [HACKERS] BRIN indexes - TRAP: BadArgument
On Mon, September 8, 2014 18:02, Alvaro Herrera wrote: Here's version 18. I have renamed it: These are now BRIN indexes. I get into a BadArgument after: $ cat crash.sql -- drop table if exists t_100_000_000 cascade; create table t_100_000_000 as select cast(i as integer) from generate_series(1, 1) as f(i) ; -- drop index if exists t_100_000_000_i_brin_idx; create index t_100_000_000_i_brin_idx on t_100_000_000 using brin(i); select pg_size_pretty(pg_relation_size('t_100_000_000_i_brin_idx')); select i from t_100_000_000 where i between 1 and 100; -- ( + 99 ) Log file says: TRAP: BadArgument(!(((context) != ((void *)0) (const Node*)((context)))-type) == T_AllocSetContext, File: mcxt.c, Line: 752) 2014-09-08 19:54:46.071 CEST 30151 LOG: server process (PID 30336) was terminated by signal 6: Aborted 2014-09-08 19:54:46.071 CEST 30151 DETAIL: Failed process was running: select i from t_100_000_000 where i between 1 and 100; The crash is caused by the last select statement; the table and index create are OK. it only happens with a largish table; small tables are OK. Linux / Centos / 32 GB. PostgreSQL 9.5devel_minmax_20140908_1809_0640c1bfc091 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.9.1, 64-bit setting | current_setting --+ autovacuum | off port | 6444 shared_buffers | 100MB effective_cache_size | 4GB work_mem | 10MB maintenance_work_mem | 1GB checkpoint_segments | 20 data_checksums | on server_version | 9.5devel_minmax_20140908_1809_0640c1bfc091 pg_postmaster_start_time | 2014-09-08 19:53 (uptime: 0d 0h 6m 54s) '--prefix=/var/data1/pg_stuff/pg_installations/pgsql.minmax' '--with-pgport=6444' '--bindir=/var/data1/pg_stuff/pg_installations/pgsql.minmax/bin' '--libdir=/var/data1/pg_stuff/pg_installations/pgsql.minmax/lib' '--enable-depend' '--enable-cassert' '--enable-debug' '--with-perl' '--with-openssl' '--with-libxml' '--with-extra-version=_minmax_20140908_1809_0640c1bfc091' pgpatches/0095/minmax/20140908/minmax-18.patch thanks, Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgcrypto: PGP signatures
On 2014-09-08 7:30 PM, Jeff Janes wrote: On Fri, Sep 5, 2014 at 4:38 AM, Marko Tiikkaja ma...@joh.to wrote: I've also changed the behaviour when passing a message with a signature to the decrypt functions which don't verify signatures. They now report ERROR: Wrong key or corrupt data instead of decrypting and silently ignoring the signature. The behaviour is now backwards compatible, but I see two ways we could possibly possibly improve this: 1) Produce a better error message (I'm sure most people don't know about the hidden debug=1 setting) 2) Provide an option to ignore the signature if decrypting the data is desirable even if the signature can't be verified If i understand the sequence here: The current git HEAD is that pgp_pub_decrypt would throw an error if given a signed and encrypted message, and earlier version of your patch changed that to decrypt the message and ignore the signature, and the current version went back to throwing an error. You got that right, yes. I think I prefer the middle of those behaviors. The original behavior seems like a bug to me, and I don't think we need to be backwards compatible with bugs. Why should a function called decrypt care if the message is also signed? That is not its job. Yeah, that seems reasonable, I guess. I'm kind of torn between the two behaviours to be honest. But perhaps it would make sense to change the previous behaviour (i.e. go back to way this patch was earlier) and document that somewhere. There seems to be a memory leak in pgp_sym_decrypt_verify that does not exist in pgp_sym_decrypt. It is about 58 bytes per decryption. Interesting. Thanks! I'll have a look. .marko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] BRIN indexes - TRAP: BadArgument
Erik Rijkers wrote: Log file says: TRAP: BadArgument(!(((context) != ((void *)0) (const Node*)((context)))-type) == T_AllocSetContext, File: mcxt.c, Line: 752) 2014-09-08 19:54:46.071 CEST 30151 LOG: server process (PID 30336) was terminated by signal 6: Aborted 2014-09-08 19:54:46.071 CEST 30151 DETAIL: Failed process was running: select i from t_100_000_000 where i between 1 and 100; A double-free mistake -- here's a patch. Thanks. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c index c89a167..6ac012c 100644 --- a/src/backend/access/brin/brin.c +++ b/src/backend/access/brin/brin.c @@ -388,10 +388,7 @@ bringetbitmap(PG_FUNCTION_ARGS) PointerGetDatum(key)); addrange = DatumGetBool(add); if (!addrange) - { - brin_free_tuple(tup); break; - } } } -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Fri, Sep 5, 2014 at 9:19 AM, Amit Kapila amit.kapil...@gmail.com wrote: On Fri, Sep 5, 2014 at 5:17 PM, Amit Kapila amit.kapil...@gmail.com wrote: Apart from above, I think for this patch, cat version bump is required as I have modified system catalog. However I have not done the same in patch as otherwise it will be bit difficult to take performance data. One regression failed on linux due to spacing issue which is fixed in attached patch. I took another read through this patch. Here are some further review comments: 1. In BgMoveBuffersToFreelist(), it seems quite superfluous to have both num_to_free and tmp_num_to_free. I'd get rid of tmp_num_to_free and move the declaration of num_to_free inside the outer loop. I'd also move the definitions of tmp_next_to_clean, tmp_recent_alloc, tmp_recent_backend_clocksweep into the innermost scope in which they are used. 2. Also in that function, I think the innermost bit of logic could be rewritten more compactly, and in such a way as to make it clearer for what set of instructions the buffer header will be locked. LockBufHdr(bufHdr); if (bufHdr-refcount == 0) { if (bufHdr-usage_count 0) bufHdr-usage_count--; else add_to_freelist = true; } UnlockBufHdr(bufHdr); if (add_to_freelist StrategyMoveBufferToFreeListEnd(bufHdr)) num_to_free--; 3. This comment is now obsolete: + /* +* If bgwriterLatch is set, we need to waken the bgwriter, but we should +* not do so while holding freelist_lck; so set it after releasing the +* freelist_lck. This is annoyingly tedious, but it happens at most once +* per bgwriter cycle, so the performance hit is minimal. +*/ + We're not actually holding any lock in need of releasing at that point in the code, so this can be shortened to If bgwriterLatch is set, we need to waken the bgwriter. * Ideally numFreeListBuffers should get called under freelist spinlock, That doesn't make any sense. numFreeListBuffers is a variable, so you can't call it. The value should be *read* under the spinlock, but it is. I think this whole comment can be deleted and replaced with If the number of free buffers has fallen below the low water mark, awaken the bgreclaimer to repopulate it. 4. StrategySyncStartAndEnd() is kind of a mess. One, it can return the same victim buffer that's being handed out - at almost the same time - to a backend running the clock sweep; if it does, they'll fight over the buffer. Two, the *end out parameter actually returns a count, not an endpoint. I think we should have BgMoveBuffersToFreelist() call StrategySyncNextVictimBuffer() at the top of the inner loop rather than the bottom, and change StrategySyncStartAndEnd() so that it knows nothing about victimbuf_lck. Let's also change StrategyGetBuffer() to call StrategySyncNextVictimBuffer() so that the logic is centralized in one place, and rename StrategySyncStartAndEnd() to something that better matches its revised purpose. Maybe StrategyGetReclaimInfo(). 5. Have you tested that this new bgwriter statistic is actually working? Because it looks to me like BgMoveBuffersToFreelist is changing BgWriterStats but never calling pgstat_send_bgwriter(), which I'm thinking will mean the counters accumulate forever inside the reclaimer but never get forwarded to the stats collector. 6. StrategyInitialize() uses #defines for HIGH_WATER_MARK_FREELIST_BUFFERS_PERCENT and LOW_WATER_MARK_FREELIST_BUFFERS_PERCENT but inline constants (5, 2000) for clamping. Let's have constants for all of those (and omit mention of the specific values in the comments). 7. The line you've added to the definition of view pg_stat_bgwriter doesn't seem to be indented the same way as all the others. Tab vs. space problem? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins
On Fri, Sep 5, 2014 at 3:23 PM, Tomas Vondra t...@fuzzy.cz wrote: as Heikki mentioned in his commitfest status message, this patch still has no reviewers :-( Is there anyone willing to pick up that, together with the 'dense allocation' patch (as those two are closely related)? I'm looking in Robert's direction, as he's the one who came up with the dense allocation idea, and also commented on the hasjoin bucket resizing patch ... I'd ask Pavel Stehule, but as he's sitting next to me in the office he's not really independent ;-) I'll do some reviews on the other patches over the weekend, to balance this of course. Will any of those patches to be, heh heh, mine? I am a bit confused by the relationship between the two patches you posted. The combined patch applies, but the other one does not. If this is a sequence of two patches, it seems like it would be more helpful to post A and B rather than B and A+B. Perhaps the other patch is on a different thread but there's not an obvious reference to said thread here. In ExecChooseHashTableSize(), if buckets_bytes is independent of nbatch, could we just compute it before working out dbatch, and just deduct it from hash_table_bytes? That seems like it'd do the same thing for less work. Either way, what happens if buckets_bytes is already bigger than hash_table_bytes? A few more stylistic issues that I see: + if ((hashtable-nbatch == 1) + if (hashtable-nbuckets_optimal = (INT_MAX/2)) + if (size (HASH_CHUNK_SIZE/8)) While I'm all in favor of using parentheses generously where it's needed to add clarity, these and similar usages seem over the top to me. On a related note, the first of these reads like this if (stuff) { if (more stuff) { do things } } which makes one wonder why we need two if statements. + + /* used for dense allocation of tuples (into linked chunks) */ + HashChunk chunks_batch; /* one list for the whole batch */ + } HashJoinTableData; If the original coding didn't have a blank line between the last structure member and the closing brace, it's probably not right to make it that way now. There are similar issues at the end of some functions you modified, and a few other places (like the new code in ExecChooseHashTableSize and chunk_alloc) where there are extra blank lines at the starts and ends of blocks. +char * chunk_alloc(HashJoinTable hashtable, int size) { Eh... no. + /* XXX maybe we should use MAXALIGN(size) here ... */ +1. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_background (and more parallelism infrastructure patches)
On Mon, Sep 8, 2014 at 1:09 AM, Amit Kapila amit.kapil...@gmail.com wrote: Don't you need a PG_TRY block here to reset pq_mq_busy? No. If shm_mq_sendv is interrupted, we can't use the shm_mq any more. But since that should only happen if an interrupt arrives while the queue is full, I think that's OK. I think here not only on interrupt, but any other error in this function shm_mq_sendv() path (one example is WaitLatch()) could lead to same behaviour. Probably true. But I think we generally count on that to be no-fail, or close to it, so I'm not really worried about it. (Think about the alternatives: if the queue is full, we have no way of notifying the launching process without waiting for it to retrieve the results, but it might not do that right away, and if we've been killed we need to die *now* not later.) So in such cases what is the advise to users, currently they will see the below message: postgres=# select * from pg_background_result(5124) as (x int); ERROR: lost connection to worker process with PID 5124 One way is to ask them to check logs, but what about if they want to handle error and take some action? They have to check the logs. To reiterate what I said before, there is no reasonable way to both have the background worker terminate quickly and also guarantee that the full error message is received by the process that started it. You have to pick one, and I stick by the one I picked. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_background (and more parallelism infrastructure patches)
On Mon, Sep 8, 2014 at 2:05 AM, Amit Kapila amit.kapil...@gmail.com wrote: Another point about error handling is that to execute the sql in function pg_background_worker_main(), it starts the transaction which I think doesn't get aborted if error occurs For this I was just referring error handling code of StartBackgroundWorker(), however during shutdown of process, it will call AbortOutOfAnyTransaction() which will take care of aborting transaction. Yeah. and similarly handling for timeout seems to be missing in error path. As we are anyway going to exit on error, so not sure, if this will be required, however having it for clean exit seems to be better. Can you be more specific? I'm generally of the view that there's little point in spending time cleaning things up that will go away anyway when the process exits. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ALTER TABLESPACE MOVE command tag tweak
Stephen Frost wrote: Alvaro, * Alvaro Herrera (alvhe...@2ndquadrant.com) wrote: ALTER TABLE ALL IN TABLESPACE xyz which AFAICS should work since ALL is already a reserved keyword. Pushed to master and REL9_4_STABLE. Thanks. One more tweak --- the whole reason for fiddling with this is to ensure that event triggers support this operation. Therefore this node should be handled by ProcessUtilitySlow, not standard_ProcessUtility, as in the attached patch. (I checked the documentation for necessary updates; turns out that the table in the event triggers chapter says that ddl_command_end etc support the command ALTER TABLE, and since this is the tag returned by the new ALTER TABLE ALL IN TABLESPACE command, there is no update needed. In fact, one can argue that the table is wrong currently because it doesn't say that ALTER TABLE ALL IN TABLESPACE is not supported.) I propose this for 9.4 too. Apologies on it taking so long- things have a bit interesting for me over the past month or two. :) I bet they have! Have fun, -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c index 40ac47f..e2c2d3d 100644 --- a/src/backend/tcop/utility.c +++ b/src/backend/tcop/utility.c @@ -507,10 +507,6 @@ standard_ProcessUtility(Node *parsetree, AlterTableSpaceOptions((AlterTableSpaceOptionsStmt *) parsetree); break; - case T_AlterTableMoveAllStmt: - AlterTableMoveAll((AlterTableMoveAllStmt *) parsetree); - break; - case T_TruncateStmt: ExecuteTruncate((TruncateStmt *) parsetree); break; @@ -1296,6 +1292,10 @@ ProcessUtilitySlow(Node *parsetree, AlterTSConfiguration((AlterTSConfigurationStmt *) parsetree); break; + case T_AlterTableMoveAllStmt: +AlterTableMoveAll((AlterTableMoveAllStmt *) parsetree); +break; + case T_DropStmt: ExecDropStmt((DropStmt *) parsetree, isTopLevel); break; -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: plpgsql - Assert statement
On Fri, Sep 5, 2014 at 2:16 AM, Pavel Stehule pavel.steh...@gmail.com wrote: Assert is usually implemented as custom functions and used via PERFORM statement now -- usual current solution PERFORM Assert(some expression) I would to implement Assert as plpgsql internal statement due bigger possibilities to design syntax and internal implementation now and in future. More - as plpgsql statement should be compatible with any current code - because there should not be collision between SQL and PLpgSQL space. So this design doesn't break any current code. I propose following syntax with following ecosystem: ASSERT [ NOTICE, WARNING, EXCEPTION ] [ string expression or literal - explicit message ] [ USING clause - same as RAISE stmt (possible in future ) ] ( ROW_COUNT ( = | ) ( 1 | 0 ) | ( QUERY some query should not be empty ) | ( CHECK some expression should be true ) ( IS NOT NULL expression should not be null ) Every variant (ROW_COUNT, QUERY, CHECK, IS NOT NULL) has own default message That's probably not the ugliest syntax I've *ever* seen, but it's definitely the ugliest syntax I've seen today. I previously proposed RAISE ASSERT ... WHERE, which seems like a nice variant of what we've already got, but perhaps this whole discussion merely illustrates that it's hard to get more than 1 vote for any proposal in this area. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ALTER TABLESPACE MOVE command tag tweak
Alvaro, * Alvaro Herrera (alvhe...@2ndquadrant.com) wrote: Stephen Frost wrote: * Alvaro Herrera (alvhe...@2ndquadrant.com) wrote: ALTER TABLE ALL IN TABLESPACE xyz which AFAICS should work since ALL is already a reserved keyword. Pushed to master and REL9_4_STABLE. Thanks. One more tweak --- the whole reason for fiddling with this is to ensure that event triggers support this operation. Therefore this node should be handled by ProcessUtilitySlow, not standard_ProcessUtility, as in the attached patch. Ah, right, of course. I recall looking for what else might need to be changed and apparently missed that distinction in the call sites. (I checked the documentation for necessary updates; turns out that the table in the event triggers chapter says that ddl_command_end etc support the command ALTER TABLE, and since this is the tag returned by the new ALTER TABLE ALL IN TABLESPACE command, there is no update needed. In fact, one can argue that the table is wrong currently because it doesn't say that ALTER TABLE ALL IN TABLESPACE is not supported.) Heh, yes, true. I propose this for 9.4 too. Agreed. Looks pretty straight-forward, will update soon. Apologies on it taking so long- things have a bit interesting for me over the past month or two. :) I bet they have! Have fun, Thanks! :) Stephen signature.asc Description: Digital signature
Re: [HACKERS] Scaling shared buffer eviction
On 5 September 2014 14:19, Amit Kapila amit.kapil...@gmail.com wrote: On Fri, Sep 5, 2014 at 5:17 PM, Amit Kapila amit.kapil...@gmail.com wrote: Apart from above, I think for this patch, cat version bump is required as I have modified system catalog. However I have not done the same in patch as otherwise it will be bit difficult to take performance data. One regression failed on linux due to spacing issue which is fixed in attached patch. Here's a set of test results against this patch: 8-core AMD FX-9590, 32GB RAM Config: checkpoint_segments = 256 checkpoint_timeout = 15min shared_buffers = 8GB pgbench scale factor = 3000 run duration time = 5min HEAD Client Count/No. Of Runs (tps) 8 16 32 64 128 Run-1 31568 41890 48638 49266 41845 Run-2 31613 41879 48597 49332 41736 Run-3 31674 41987 48647 49320 41745 Patch=scalable_buffer_eviction_v7.patch Client Count/No. Of Runs (tps) 8 16 32 64 128 Run-1 31880 42943 49359 49901 43779 Run-2 32150 43078 48934 49828 43769 Run-3 32323 42894 49481 49600 43529 -- Thom
Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins
On 8.9.2014 22:44, Robert Haas wrote: On Fri, Sep 5, 2014 at 3:23 PM, Tomas Vondra t...@fuzzy.cz wrote: as Heikki mentioned in his commitfest status message, this patch still has no reviewers :-( Is there anyone willing to pick up that, together with the 'dense allocation' patch (as those two are closely related)? I'm looking in Robert's direction, as he's the one who came up with the dense allocation idea, and also commented on the hasjoin bucket resizing patch ... I'd ask Pavel Stehule, but as he's sitting next to me in the office he's not really independent ;-) I'll do some reviews on the other patches over the weekend, to balance this of course. Will any of those patches to be, heh heh, mine? I'll exchange some of the credit for +1. Deal? ;-) I am a bit confused by the relationship between the two patches you posted. The combined patch applies, but the other one does not. If this is a sequence of two patches, it seems like it would be more helpful to post A and B rather than B and A+B. Perhaps the other patch is on a different thread but there's not an obvious reference to said thread here. Yeah, it's probably a bit confusing. The thing is the dense allocation idea was discussed in a different thread, so I posted the patch there: http://www.postgresql.org/message-id/53cbeb8a@fuzzy.cz The patch tweaking hash join buckets builds on the dense allocation, because it (a) makes up for the memory used for more buckets and (b) it's actually easier to rebuild the buckets this way. So I only posted the separate patch for those who want to do a review, and then a complete patch with both parts combined. But it sure may be a bit confusing. In ExecChooseHashTableSize(), if buckets_bytes is independent of nbatch, could we just compute it before working out dbatch, and just deduct it from hash_table_bytes? That seems like it'd do the same thing for less work. Well, we can do that. But I really don't think that's going to make measurable difference. And I think the code is more clear this way. Either way, what happens if buckets_bytes is already bigger than hash_table_bytes? Hm, I don't see how that could happen. FWIW, I think the first buckets_bytes formula is actually wrong: buckets_bytes = sizeof(HashJoinTuple) * my_log2(ntuples / NTUP_PER_BUCKET); and should be buckets_bytes = sizeof(HashJoinTuple) * (1 my_log2(ntuples / NTUP_PER_BUCKET)); instead. Also, we might consider that we never use less than 1024 buckets (but that's minor issue, I guess). But once we get into the need batching branch, we do this: lbuckets = 1 my_log2(hash_table_bytes / (tupsize * NTUP_PER_BUCKET + sizeof(HashJoinTuple))); which includes both the bucket (pointer) and tuple size, and IMHO guarantees that bucket_bytes will never be over work_mem (which is what hash_table_bytes is). The only case where this might happen is (tupsize 8), thanks to the my_log2 (getting (50% work_mem + epsilon), doubled to 100% work_mem). But tupsize is defined as this: tupsize = HJTUPLE_OVERHEAD + MAXALIGN(sizeof(MinimalTupleData)) + MAXALIGN(tupwidth); and HJTUPLE_OVERHEAD alone is 12B, MinimalTupleData is 11B (ignoring alignment) etc. So I believe this is safe ... A few more stylistic issues that I see: + if ((hashtable-nbatch == 1) + if (hashtable-nbuckets_optimal = (INT_MAX/2)) + if (size (HASH_CHUNK_SIZE/8)) While I'm all in favor of using parentheses generously where it's needed to add clarity, these and similar usages seem over the top to me. It seemed more readable for me at the time of coding it, and it still seems better this way, but ... https://www.youtube.com/watch?v=CxK_nA2iVXw On a related note, the first of these reads like this if (stuff) { if (more stuff) { do things } } which makes one wonder why we need two if statements. We probably don't ... + + /* used for dense allocation of tuples (into linked chunks) */ + HashChunk chunks_batch; /* one list for the whole batch */ + } HashJoinTableData; If the original coding didn't have a blank line between the last structure member and the closing brace, it's probably not right to make it that way now. There are similar issues at the end of some functions you modified, and a few other places (like the new code in ExecChooseHashTableSize and chunk_alloc) where there are extra blank lines at the starts and ends of blocks. I'll fix that. FWIW, those issues seem to be in the 'dense allocation' patch. +char * chunk_alloc(HashJoinTable hashtable, int size) { Eh... no. + /* XXX maybe we should use MAXALIGN(size) here ... */ +1. I'm not sure what's the 'no' pointing at? Maybe that the parenthesis should be on the next line? And is the +1 about doing MAXALING? regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make
Re: [HACKERS] PQputCopyEnd doesn't adhere to its API contract
On Mon, Sep 8, 2014 at 11:45 AM, Robert Haas [via PostgreSQL] ml-node+s1045698n5818200...@n5.nabble.com wrote: On Thu, Sep 4, 2014 at 6:38 PM, David Johnston [hidden email] http://user/SendEmail.jtp?type=nodenode=5818200i=0 wrote: One of the trade-offs I mentioned...its more style than anything but removing the parenthetical (if there is not error in the command) and writing it more directly seemed preferable in an overview such as this. Better: The function will either throw an error or return a PGresult object[...] Nope. This is not C++, nor is it the backend. It will not throw anything. The current sentence reads: The response to this (if there is no error in the command) will be a PGresult object bearing a status code of PGRES_COPY_OUT or PGRES_COPY_IN (depending on the specified copy direction). Maybe this is taken for granted by others, and thus can be excluded here, but I'm trying to specify what happens if there is an error in the command... Apparently one does not get back a PGresult object... Would simply saying: A successful response to this will be a PGresult object... be accurate (enough)? I appreciate the time you have taken on this and will look at my thoughts with the new understanding you have given me. Thank You! Thanks for getting involved. I apologize if I was brusque here or at other times in the past (or future). Unfortunately I've been insanely busy and it doesn't bring out the best in me, but I really do value your (and everyone's efforts) to move PostgreSQL forward. The tone of all your responses, including the first one pointing out my lack of supporting context and initially over-ambitious patch, have been very professional. A lot of what you've said resonates since I think subconsciously I understood that I needed to make fundamental changes to my approach and style but it definitely helps to have someone point out specific items and concerns to move those thoughts to the conscious part of the mind. Thank you again for doing that for me. Ignoring style and such did anything I wrote help to clarify your understanding of why pgPutCopyEnd does what it does? As I say this and start to try and read the C code I think that it is a good source for understanding novice assumptions but there is a gap between the docs and the code - though I'm not sure you've identified the only (or even proper) location. Unlike PQputCopyData there is no explicit logic (pqIsnonblocking(conn) in PQputCopyEnd for non-blocking mode (there is an implicit one via pqFlush). This does seem like an oversight - if a minor one since the likihood of not being able to add the EOF to the connection's buffer seems highly unlikely - but I would expect on the basis of symmetry alone - for both of them to have buffer filled testing logic. And, depending on how large *errormsg is, the risk of being unable to place the data in the buffer isn't as small and the expected EOF case. I'm getting way beyond my knowledge level here but my assumption from the documentation was that the async mode behavior of returning zero revolved around retrying in order to place the data into the buffer - not retrying to make sure the buffer is empty. The caller deals with that on their own based upon their blocking mode decision. Though we would want to call pqFlush for blocking mode callers here since this function should block until the buffer is clear. So, I thought I agreed with your suggestion that if the final pqFlush returns a 1 that pqPutCopyEnd should return 0. Additionally, if in non-blocking mode, and especially if *errormsg is not NULL, the available connection buffer length logic in pqPutCopyData should be evaluated here as well. However, the 0 being returned due to the pqFlush really isn't anything special for a non-blocking mode user (they have already been told to call pqFlush themselves after calling pqPutCopyEnd) and doesn't apply for blocking mode. Admittedly they could skip their call if they get a return value of 1 from pqPutCopyEnd but I'm not sure recommending that optimization has much going for it. And, again as you said (I am just discovering this for myself as much as possible), if the pqFlush caused the 0 you would not want to retry whereas in the filled buffer version you would. So we do have to pick one or the other situation and adjust the documentation accordingly. The most useful and compatible solution is to make pqPutCopyEnd synchronous regardless of the user selected blocking mode - which it mostly is now but the final pqFlush should be in a loop - and adjust the documentation in the areas noted in my patch-email to accommodate that fact. Regardless, the logic surrounding placing the *errormsg string into the buffer needs affirmation and a note whether it will block waiting to be put on a full connection buffer. Note that non-blocking users seeing a 1 on the pqPutCopyEnd probably still end up doing their own pqFlush looping calls but
Re: [HACKERS] PQputCopyEnd doesn't adhere to its API contract
On Mon, Sep 8, 2014 at 6:19 PM, David Johnston david.g.johns...@gmail.com wrote: On Mon, Sep 8, 2014 at 11:45 AM, Robert Haas [via PostgreSQL] ml-node+s1045698n5818200...@n5.nabble.com wrote: On Thu, Sep 4, 2014 at 6:38 PM, David Johnston [hidden email] http://user/SendEmail.jtp?type=nodenode=5818200i=0 wrote: One of the trade-offs I mentioned...its more style than anything but removing the parenthetical (if there is not error in the command) and writing it more directly seemed preferable in an overview such as this. Better: The function will either throw an error or return a PGresult object[...] Nope. This is not C++, nor is it the backend. It will not throw anything. The current sentence reads: The response to this (if there is no error in the command) will be a PGresult object bearing a status code of PGRES_COPY_OUT or PGRES_COPY_IN (depending on the specified copy direction). Maybe this is taken for granted by others, and thus can be excluded here, but I'm trying to specify what happens if there is an error in the command... Apparently one does not get back a PGresult object... Would simply saying: A successful response to this will be a PGresult object... be accurate (enough)? Apparently, NULL is the correct answer. Cannot that just be assumed to be the case or does saying that a failure scenario here returns NULL (or something other than pqResult object) impart useful knowledge? Dave -- View this message in context: http://postgresql.1045698.n5.nabble.com/PQputCopyEnd-doesn-t-adhere-to-its-API-contract-tp5803240p5818254.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
Re: [HACKERS] Scaling shared buffer eviction
On Mon, Sep 8, 2014 at 10:12 PM, Merlin Moncure mmonc...@gmail.com wrote: On Fri, Sep 5, 2014 at 6:47 AM, Amit Kapila amit.kapil...@gmail.com wrote: Client Count/Patch_Ver (tps) 8 16 32 64 128 HEAD 58614 107370 140717 104357 65010 Patch 60092 113564 165014 213848 216065 This data is median of 3 runs, detailed report is attached with mail. I have not repeated the test for all configurations, as there is no major change in design/algorithm which can effect performance. Mark has already taken tpc-b data which ensures that there is no problem with it, however I will also take it once with latest version. Well, these numbers are pretty much amazing. Question: It seems there's obviously quite a bit of contention left; do you think there's still a significant amount of time in the clock sweep, or is the primary bottleneck the buffer mapping locks? I think contention around clock sweep is very minimal and for buffer mapping locks it has reduced significantly (you can refer upthread some of LWLOCK stat data, I have posted after this patch), but might be we can get more out of it by improving hash table concurrency. However apart from both of the above, the next thing I have seen in profiles was palloc (at least that is what I remember when I had done some profiling few months back during the development of this patch). It seems to me at that time a totally different optimisation, so I left it for another patch. Another point is that the m/c on which I am doing performance test has 16 cores (64 hardware threads), so not sure how much scaling we can expect. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] proposal: plpgsql - Assert statement
On 09/05/2014 05:21 PM, Pavel Stehule wrote: *shrug* Doing it in SQL would probably break more stuff. I'm trying to contain the damage. And arguably, this is mostly only useful in PL/PgSQL. I've wanted assertions in SQL enough that I often write trivial wrappers around `raise` in PL/PgSQL for use in `CASE` statements etc. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: plpgsql - Assert statement
On 09/09/2014 05:20 AM, Robert Haas wrote: I previously proposed RAISE ASSERT ... WHERE, which seems like a nice variant of what we've already got, but perhaps this whole discussion merely illustrates that it's hard to get more than 1 vote for any proposal in this area. Well, you have Petr's for RAISE EXCEPTION ... WHEN, and I'd also like that or RAISE ASSERT ... WHEN. Much (much) saner than the other proposals on this thread IMO. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers