procedures and plpgsql PERFORM
Hi, We allow a function to be invoked as part of PERFORM statement in plpgsql do $$ begin perform pg_relation_size('t1'); end; $$ language plpgsql; DO But we do not allow a procedure to be invoked this way create procedure dummy_proc(a int) as $$ begin null; end; $$ language plpgsql; CREATE PROCEDURE do $$ begin perform dummy_proc(1); end; $$ language plpgsql; ERROR: dummy_proc(integer) is a procedure LINE 1: SELECT dummy_proc(1) ^ HINT: To call a procedure, use CALL. QUERY: SELECT dummy_proc(1) CONTEXT: PL/pgSQL function inline_code_block line 2 at PERFORM The documentation of PERFORM [1] says "For any SQL command that does not return rows, for example INSERT without a RETURNING clause, you can execute the command within a PL/pgSQL function just by writing the command." Procedures fit that category and like functions, I think, we should allow them be invoked directly without any quoting and CALL decoration. [1] https://www.postgresql.org/docs/devel/static/plpgsql-statements.html#PLPGSQL-STATEMENTS-SQL-NORESULT -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] Walsender timeouts and large transactions
On 7 December 2017 at 01:22, Petr Jelinekwrote: > On 05/12/17 21:07, Robert Haas wrote: > > > > Generally we write if (a && b) { ... } not if (a) { if (b) .. } > > > > It's rather ugly with && because one of the conditions is two line, but > okay here you go. I am keeping the brackets even if normally don't for > one-liners because it's completely unreadable without them IMHO. > > Yeah, that's why I passed on that FWIW. Sometimes breaking up a condition is nice. Personally I intensely dislike the convention of if (big_condition && big_condition) one_linerdo_something; as awfully unreadable, but I guess code convention means you live with things you don't like. Anyway, I've just hit this bug in the wild for the umpteenth time this year, and I'd like to know what I can do to help progress it to commit+backport. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] pg_upgrade failed with error - ERROR: column "a" in child table must be marked NOT NULL
On Thu, Dec 14, 2017 at 08:51:06AM +0700, Ali Akbar wrote: > 2017-12-13 15:37 GMT+07:00 Amit Langote: > > > On 2017/12/13 15:59, Ali Akbar wrote: > > > > > > Thanks for the link to those thread. > > > > > > Judging from the discussion there, it will be a long way to prevent DROP > > > NOT NULL. > > > > Yeah, I remembered that discussion when writing my email, but was for some > > reason convinced that everything's fine even without the elaborate > > book-keeping of inheritance information for NOT NULL constraints. Thanks > > Michael for reminding. > > > > Patch for adding check in pg_upgrade. Going through git history, the check > for inconsistency in NOT NULL constraint has ben there since a long time > ago. In this patch the check will be applied for all old cluster version. > I'm not sure in which version was the release of table inheritance. Here are some spelling and grammar fixes to that patch: but nullabe in its children: nullable child column is not market: marked with adding: by adding and restart: and restarting the problem columns: the problematic columns 9.5, 9.6, 10: 9.5, 9.6, and 10 restore, that will cause error.: restore phase of pg_upgrade, that would cause an error. Justin
pg_(total_)relation_size and partitioned tables
Hi. You may have guessed from $subject that the two don't work together. create table p (a int) partition by list (a); create table p1 partition of p for values in (1, 2) partition by list (a); create table p11 partition of p1 for values in (1); create table p12 partition of p1 for values in (2); insert into p select i % 2 + 1 from generate_series(1, 1000) i; On HEAD: select pg_relation_size('p'); pg_relation_size -- 0 (1 row) select pg_total_relation_size('p'); pg_total_relation_size 0 (1 row) After applying the attached patch: select pg_relation_size('p'); pg_relation_size -- 49152 (1 row) select pg_total_relation_size('p'); pg_total_relation_size 98304 (1 row) The patch basically accumulates and returns the sizes of all leaf partitions when passed a partitioned table. Will add it to next CF. Thanks, Amit From 4a5db12ad0155e043500d87caf2cc9438eb87e23 Mon Sep 17 00:00:00 2001 From: amitDate: Thu, 14 Dec 2017 13:56:08 +0900 Subject: [PATCH] Teach dbsize.c functions about partitioned tables --- src/backend/utils/adt/dbsize.c | 101 ++--- 1 file changed, 85 insertions(+), 16 deletions(-) diff --git a/src/backend/utils/adt/dbsize.c b/src/backend/utils/adt/dbsize.c index 58c0b01bdc..145f511c6c 100644 --- a/src/backend/utils/adt/dbsize.c +++ b/src/backend/utils/adt/dbsize.c @@ -17,6 +17,7 @@ #include "access/htup_details.h" #include "catalog/catalog.h" #include "catalog/namespace.h" +#include "catalog/partition.h" #include "catalog/pg_authid.h" #include "catalog/pg_tablespace.h" #include "commands/dbcommands.h" @@ -307,13 +308,17 @@ calculate_relation_size(RelFileNode *rfn, BackendId backend, ForkNumber forknum) return totalsize; } -Datum -pg_relation_size(PG_FUNCTION_ARGS) +/* + * Computes and stores in *size the file size of the specified fork of the + * relation with OID relOid. + + * Returns true if size was sucessfully calculated and stored in *size, false + * otherwise. + */ +static bool +pg_relation_size_internal(Oid relOid, char *forkName, int64 *size) { - Oid relOid = PG_GETARG_OID(0); - text *forkName = PG_GETARG_TEXT_PP(1); Relationrel; - int64 size; rel = try_relation_open(relOid, AccessShareLock); @@ -321,17 +326,52 @@ pg_relation_size(PG_FUNCTION_ARGS) * Before 9.2, we used to throw an error if the relation didn't exist, but * that makes queries like "SELECT pg_relation_size(oid) FROM pg_class" * less robust, because while we scan pg_class with an MVCC snapshot, -* someone else might drop the table. It's better to return NULL for -* already-dropped tables than throw an error and abort the whole query. +* someone else might drop the table. It's better to return that we +* couldn't get the size for already-dropped tables than throw an error +* and abort the whole query. */ if (rel == NULL) - PG_RETURN_NULL(); + return false; + + /* +* For a partitioned table, get the size by accumulating that of its +* leaf partitions. +*/ + if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) + { + PartitionDesc partdesc = rel->rd_partdesc; + int i; - size = calculate_relation_size(&(rel->rd_node), rel->rd_backend, - forkname_to_number(text_to_cstring(forkName))); + *size = 0; + for (i = 0; i < partdesc->nparts; i++) + { + int64 partsize; + + if (pg_relation_size_internal(partdesc->oids[i], forkName, + )) + *size += partsize; + } + } + else + *size = calculate_relation_size(&(rel->rd_node), rel->rd_backend, + forkname_to_number(forkName)); relation_close(rel, AccessShareLock); + return true; +} + +Datum +pg_relation_size(PG_FUNCTION_ARGS) +{ + Oid relOid = PG_GETARG_OID(0); + text *forkName = PG_GETARG_TEXT_PP(1); + int64 size; + + if (!pg_relation_size_internal(relOid, text_to_cstring(forkName), + )) + PG_RETURN_NULL(); + PG_RETURN_INT64(size); } @@ -508,22 +548,51 @@ calculate_total_relation_size(Relation rel) return size; } -Datum -pg_total_relation_size(PG_FUNCTION_ARGS) +static int64 +pg_total_relation_size_internal(Oid
incorrect error message, while dropping PROCEDURE
Hi, Currently if some one try to drop the PROCEDURE and it don't have privilege or it's not an owner, than error message still indicate object as FUNCTION. Example: postgres@37737=#drop procedure pro; ERROR: must be owner of function pro This doesn't look correct specially that now we have separate object type as OBJECT_PROCEDURE. It seems like we need to introduce new AclObjectKind for PROCEDURE and do the necessary changes to pass the correct AclObjectKind. PFA patch, where introduced new AclObjectKind (ACL_KIND_PROCEDURE), msg for the new AclObjectKind, and passed it through at appropriate places. Also update the necessary "make check" expected output changes. Regards, Thanks, Rushabh Lathia www.EnterpriseDB.com diff --git a/src/backend/catalog/aclchk.c b/src/backend/catalog/aclchk.c index e481cf3..680ef18 100644 --- a/src/backend/catalog/aclchk.c +++ b/src/backend/catalog/aclchk.c @@ -257,6 +257,7 @@ restrict_and_check_grant(bool is_grant, AclMode avail_goptions, bool all_privs, whole_mask = ACL_ALL_RIGHTS_DATABASE; break; case ACL_KIND_PROC: + case ACL_KIND_PROCEDURE: whole_mask = ACL_ALL_RIGHTS_FUNCTION; break; case ACL_KIND_LANGUAGE: @@ -2565,7 +2566,8 @@ ExecGrant_Function(InternalGrant *istmt) this_privileges = restrict_and_check_grant(istmt->is_grant, avail_goptions, istmt->all_privs, istmt->privileges, - funcId, grantorId, ACL_KIND_PROC, + funcId, grantorId, + (istmt->objtype == ACL_OBJECT_PROCEDURE) ? ACL_KIND_PROCEDURE : ACL_KIND_PROC, NameStr(pg_proc_tuple->proname), 0, NULL); @@ -3360,6 +3362,8 @@ static const char *const no_priv_msg[MAX_ACL_KIND] = gettext_noop("permission denied for database %s"), /* ACL_KIND_PROC */ gettext_noop("permission denied for function %s"), + /* ACL_KIND_PROCEDURE */ + gettext_noop("permission denied for procedure %s"), /* ACL_KIND_OPER */ gettext_noop("permission denied for operator %s"), /* ACL_KIND_TYPE */ @@ -3412,6 +3416,8 @@ static const char *const not_owner_msg[MAX_ACL_KIND] = gettext_noop("must be owner of database %s"), /* ACL_KIND_PROC */ gettext_noop("must be owner of function %s"), + /* ACL_KIND_PROCEDURE */ + gettext_noop("must be owner of procedure %s"), /* ACL_KIND_OPER */ gettext_noop("must be owner of operator %s"), /* ACL_KIND_TYPE */ diff --git a/src/backend/catalog/objectaddress.c b/src/backend/catalog/objectaddress.c index 9553675..d67a68b 100644 --- a/src/backend/catalog/objectaddress.c +++ b/src/backend/catalog/objectaddress.c @@ -2261,7 +2261,7 @@ check_object_ownership(Oid roleid, ObjectType objtype, ObjectAddress address, case OBJECT_PROCEDURE: case OBJECT_ROUTINE: if (!pg_proc_ownercheck(address.objectId, roleid)) -aclcheck_error(ACLCHECK_NOT_OWNER, ACL_KIND_PROC, +aclcheck_error(ACLCHECK_NOT_OWNER, objtype == OBJECT_PROCEDURE ? ACL_KIND_PROCEDURE : ACL_KIND_PROC, NameListToString((castNode(ObjectWithArgs, object))->objname)); break; case OBJECT_OPERATOR: diff --git a/src/backend/commands/functioncmds.c b/src/backend/commands/functioncmds.c index 2a9c901..7541146 100644 --- a/src/backend/commands/functioncmds.c +++ b/src/backend/commands/functioncmds.c @@ -2270,7 +2270,7 @@ ExecuteCallStmt(ParseState *pstate, CallStmt *stmt) aclresult = pg_proc_aclcheck(fexpr->funcid, GetUserId(), ACL_EXECUTE); if (aclresult != ACLCHECK_OK) - aclcheck_error(aclresult, ACL_KIND_PROC, get_func_name(fexpr->funcid)); + aclcheck_error(aclresult, ACL_KIND_PROCEDURE, get_func_name(fexpr->funcid)); InvokeFunctionExecuteHook(fexpr->funcid); nargs = list_length(fexpr->args); diff --git a/src/include/utils/acl.h b/src/include/utils/acl.h index 254a811..44037a0 100644 --- a/src/include/utils/acl.h +++ b/src/include/utils/acl.h @@ -191,6 +191,7 @@ typedef enum AclObjectKind ACL_KIND_SEQUENCE, /* pg_sequence */ ACL_KIND_DATABASE, /* pg_database */ ACL_KIND_PROC,/* pg_proc */ + ACL_KIND_PROCEDURE, /* pg_proc procedure */ ACL_KIND_OPER,/* pg_operator */ ACL_KIND_TYPE,/* pg_type */ ACL_KIND_LANGUAGE, /* pg_language */ diff --git a/src/test/regress/expected/create_procedure.out b/src/test/regress/expected/create_procedure.out index 5538ef2..9a5be38 100644 --- a/src/test/regress/expected/create_procedure.out +++ b/src/test/regress/expected/create_procedure.out @@ -73,7 +73,7 @@ GRANT INSERT ON cp_test TO regress_user1; REVOKE EXECUTE ON PROCEDURE ptest1(text) FROM PUBLIC; SET ROLE regress_user1; CALL ptest1('a'); -- error -ERROR: permission denied for function ptest1 +ERROR: permission denied for procedure ptest1 RESET ROLE; GRANT EXECUTE ON PROCEDURE ptest1(text) TO regress_user1; SET ROLE regress_user1; diff --git a/src/test/regress/expected/privileges.out b/src/test/regress/expected/privileges.out index e6994f0..c92993d 100644 --- a/src/test/regress/expected/privileges.out +++ b/src/test/regress/expected/privileges.out @@ -696,7 +696,7 @@ ERROR:
Re: Protect syscache from bloating with negative cache entries
At Fri, 1 Dec 2017 14:12:20 -0800, Andres Freundwrote in <20171201221220.z5e6wtlpl264w...@alap3.anarazel.de> > On 2017-12-01 17:03:28 -0500, Tom Lane wrote: > > Andres Freund writes: > > > On 2017-12-01 16:40:23 -0500, Tom Lane wrote: > > >> I have no faith in either of these proposals, because they both assume > > >> that the problem only arises over the course of many minutes. In the > > >> recent complaint about pg_dump causing relcache bloat, it probably does > > >> not take nearly that long for the bloat to occur. > > > > > To me that's a bit of a different problem than what I was discussing > > > here. It also actually doesn't seem that hard - if your caches are > > > growing fast, you'll continually get hash-resizing of the > > > various. Adding cache-pruning to the resizing code doesn't seem hard, > > > and wouldn't add meaningful overhead. > > > > That's an interesting way to think about it, as well, though I'm not > > sure it's quite that simple. If you tie this to cache resizing then > > the cache will have to grow up to the newly increased size before > > you'll prune it again. That doesn't sound like it will lead to nice > > steady-state behavior. > > Yea, it's not perfect - but if we do pruning both at resize *and* on > regular intervals, like once-a-minute as I was suggesting, I don't think > it's that bad. The steady state won't be reached within seconds, true, > but the negative consequences of only attempting to shrink the cache > upon resizing when the cache size is growing fast anyway doesn't seem > that large. > > I don't think we need to be super accurate here, there just needs to be > *some* backpressure. > > I've had cases in the past where just occasionally blasting the cache > away would've been good enough. Thank you very much for the valuable suggestions. I still would like to solve this problem and the a-counter-freely-running-in-minute(or several seconds)-resolution and pruning-too-long-unaccessed-entries-on-resizing seems to me to work enough for at least several known bloat cases. This still has a defect that this is not workable for a very quick bloating. I'll try thinking about the remaining issue. If no one has immediate objection to the direction, I'll come up with an implementation. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [HACKERS] UPDATE of partition key
Thanks for the updated patches, Amit. Some review comments. Forgot to remove the description of update_rri and num_update_rri in the header comment of ExecSetupPartitionTupleRouting(). - +extern void pull_child_partition_columns(Relation rel, + Relation parent, + Bitmapset **partcols); It seems you forgot to remove this declaration in partition.h, because I don't find it defined or used anywhere. I think some of the changes that are currently part of the main patch are better taken out into their own patches, because having those diffs appear in the main patch is kind of distracting. Just like you now have a patch that introduces a PartitionTupleRouting structure. I know that leads to too many patches, but it helps to easily tell less substantial changes from the substantial ones. 1. Patch to rename partition_tupconv_maps to parentchild_tupconv_maps. 2. Patch that introduces has_partition_attrs() in place of is_partition_attr() 3. Patch to change the names of map_partition_varattnos() arguments 4. Patch that does the refactoring involving ExecConstrains(), ExecPartitionCheck(), and the introduction of ExecPartitionCheckEmitError() Regarding ExecSetupChildParentMap(), it seems to me that it could simply be declared as static void ExecSetupChildParentMap(ModifyTableState *mtstate); Looking at the places from where it's called, it seems that you're just extracting information from mtstate and passing the same for the rest of its arguments. mt_is_tupconv_perpart seems like it's unnecessary. Its function could be fulfilled by inspecting the state of some other fields of ModifyTableState. For example, in the case of an update (operation == CMD_UPDATE), if mt_partition_tuple_routing is non-NULL, then we can always assume that mt_childparent_tupconv_maps has entries for all partitions. If it's NULL, then there would be only entries for partitions that have sub-plans. tupconv_map_for_subplan() looks like it could be done as a macro. Thanks, Amit
Re: explain analyze output with parallel workers - question about meaning of information for explain.depesz.com
On Thu, Dec 14, 2017 at 2:32 AM, Robert Haaswrote: > On Tue, Dec 12, 2017 at 9:37 PM, Amit Kapila wrote: >>> Uh, should I just revert that commit entirely first, and then we can >>> commit the new fix afterward? >> >> Yes. I have already extracted the test case of that commit to the new >> patch which is what we need from that commit. > > OK, done. > Thanks. I think now we can proceed with fix_accum_instr_parallel_workers_v8.patch posted above which will fix the original issue and the problem we have found in sort and hash nodes. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Parallel Index Scan vs BTP_DELETED and BTP_HALF_DEAD
On Thu, Dec 14, 2017 at 2:51 AM, Robert Haaswrote: > On Wed, Dec 13, 2017 at 12:35 AM, Kuntal Ghosh > wrote: >> I've also verified the backward scan case with the query provided by >> Thomas. In standby, >> 2. explain analyze select * from t1 where a+1>a order by a desc; and >> the parallel workers hang. >> The patch fixes the issue. > > Committed and back-patched to v10. > Thanks. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] pg_upgrade failed with error - ERROR: column "a" in child table must be marked NOT NULL
2017-12-13 15:37 GMT+07:00 Amit Langote: > On 2017/12/13 15:59, Ali Akbar wrote: > > > > Thanks for the link to those thread. > > > > Judging from the discussion there, it will be a long way to prevent DROP > > NOT NULL. > > Yeah, I remembered that discussion when writing my email, but was for some > reason convinced that everything's fine even without the elaborate > book-keeping of inheritance information for NOT NULL constraints. Thanks > Michael for reminding. > Patch for adding check in pg_upgrade. Going through git history, the check for inconsistency in NOT NULL constraint has ben there since a long time ago. In this patch the check will be applied for all old cluster version. I'm not sure in which version was the release of table inheritance. Thanks, Ali Akbar diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c index 1b9083597c..29bafdff74 100644 --- a/src/bin/pg_upgrade/check.c +++ b/src/bin/pg_upgrade/check.c @@ -26,6 +26,7 @@ static void check_for_isn_and_int8_passing_mismatch(ClusterInfo *cluster); static void check_for_reg_data_type_usage(ClusterInfo *cluster); static void check_for_jsonb_9_4_usage(ClusterInfo *cluster); static void check_for_pg_role_prefix(ClusterInfo *cluster); +static void check_for_not_null_inheritance(ClusterInfo *cluster); static char *get_canonical_locale_name(int category, const char *locale); @@ -99,6 +100,7 @@ check_and_dump_old_cluster(bool live_check) check_for_prepared_transactions(_cluster); check_for_reg_data_type_usage(_cluster); check_for_isn_and_int8_passing_mismatch(_cluster); + check_for_not_null_inheritance(_cluster); /* * Pre-PG 10 allowed tables with 'unknown' type columns and non WAL logged @@ -1096,6 +1098,105 @@ check_for_pg_role_prefix(ClusterInfo *cluster) check_ok(); } +/* + * check_for_not_null_inheritance() + * + * Currently pg_upgrade will fail if there are any inconsistencies in NOT NULL + * constraint inheritance: In Postgres version 9.5, 9.6, 10 we can have a column + * that is NOT NULL in parent, but nullabe in its children. But during schema + * restore, that will cause error. + */ +static void +check_for_not_null_inheritance(ClusterInfo *cluster) +{ + int dbnum; + FILE *script = NULL; + bool found = false; + char output_path[MAXPGPATH]; + + prep_status("Checking for NOT NULL inconsistencies in inheritance"); + + snprintf(output_path, sizeof(output_path), "not_null_inconsistent_columns.txt"); + + for (dbnum = 0; dbnum < cluster->dbarr.ndbs; dbnum++) + { + PGresult *res; + bool db_used = false; + int ntups; + int rowno; + int i_nspname, + i_relname, + i_attname; + DbInfo *active_db = >dbarr.dbs[dbnum]; + PGconn *conn = connectToServer(cluster, active_db->db_name); + + res = executeQueryOrDie(conn, +"WITH RECURSIVE parents AS ( " +" SELECT i.inhrelid, i.inhparent " +" FROM pg_catalog.pg_inherits i " +" UNION ALL " +" SELECT p.inhrelid, i.inhparent " +" FROM parents p " +" JOIN pg_catalog.pg_inherits i " +" ON i.inhrelid = p.inhparent " +") " +"SELECT n.nspname, c.relname, ac.attname " +"FROM parents p, " +" pg_catalog.pg_attribute ac, " +" pg_catalog.pg_attribute ap, " +" pg_catalog.pg_class c, " +" pg_catalog.pg_namespace n " +"WHERE NOT ac.attnotnull AND " +" ac.attrelid = p.inhrelid AND " +" ap.attrelid = p.inhparent AND " +" ac.attname = ap.attname AND " +" ap.attnotnull AND " +" c.oid = ac.attrelid AND " +" c.relnamespace = n.oid"); + + ntups = PQntuples(res); + i_nspname = PQfnumber(res, "nspname"); + i_relname = PQfnumber(res, "relname"); + i_attname = PQfnumber(res, "attname"); + for (rowno = 0; rowno < ntups; rowno++) + { + found = true; + if (script == NULL && (script = fopen_priv(output_path, "w")) == NULL) +pg_fatal("could not open file \"%s\": %s\n", output_path, + strerror(errno)); + if (!db_used) + { +fprintf(script, "Database: %s\n", active_db->db_name); +db_used = true; + } + fprintf(script, " %s.%s.%s\n", + PQgetvalue(res, rowno, i_nspname), + PQgetvalue(res, rowno, i_relname), + PQgetvalue(res, rowno, i_attname)); + } + + PQclear(res); + + PQfinish(conn); + } + + if (script) + fclose(script); + + if (found) + { + pg_log(PG_REPORT, "fatal\n"); + pg_fatal("Your installation contains has inconsistencies in NOT NULL\n" + "constraint inheritance: child column is not market NOT NULL\n" + "while parent column has NOT NULL constraint. You can fix the\n" + "inconsistency with adding NOT NULL constraint and restart the\n" + "upgrade.\n" + "A list of the problem columns is in the file:\n" + "%s\n\n", output_path); + } + else + check_ok(); +} /* * get_canonical_locale_name
access/parallel.h lacks PGDLLIMPORT
Hi hackers, I suppose that extensions are supposed to be allowed to use the facilities in access/parallel.h. I noticed in passing when I wrote a throwaway test harness that my Windows built drone complained: test_sharedtuplestore.obj : error LNK2001: unresolved external symbol ParallelWorkerNumber [C:\projects\postgres\test_sharedtuplestore.vcxproj] .\Release\test_sharedtuplestore\test_sharedtuplestore.dll : fatal error LNK1120: 1 unresolved externals [C:\projects\postgres\test_sharedtuplestore.vcxproj] I suppose that all three of these might need that, if they're part of the API for parallel worker management: extern volatile bool ParallelMessagePending; extern int ParallelWorkerNumber; extern bool InitializingParallelWorker; I'm less sure about the other two but at least ParallelWorkerNumber is essential for anything that needs to coordinate access to input/output arrays or similar. -- Thomas Munro http://www.enterprisedb.com
Re: [Sender Address Forgery]Re: [HACKERS] path toward faster partition pruning
Hi David. On 2017/12/13 18:48, David Rowley wrote: > On 12 December 2017 at 22:13, Amit Langote >wrote: >> Attached updated patches. > > Thanks for sending the updated patches. > > I don't have a complete review at the moment, but the following code > in set_append_rel_pathlist() should be removed. > > /* append_rel_list contains all append rels; ignore others */ > if (appinfo->parent_relid != parentRTindex) > continue; Will fix that right away, thanks. Thanks, Amit
Re: [HACKERS] Parallel Hash take II
Hi, Looking at the latest version of the tuplestore patch: diff --git a/src/backend/utils/sort/sharedtuplestore.c b/src/backend/utils/sort/sharedtuplestore.c new file mode 100644 index 000..d1233221a58 --- /dev/null +++ b/src/backend/utils/sort/sharedtuplestore.c @@ -0,0 +1,583 @@ +/*- + * + * sharedtuplestore.c + * Simple mechanism for sharing tuples between backends. + * + * This module provides a shared temporary tuple storage mechanism, providing + * a parallel-aware subset of the features of tuplestore.c. Multiple backends + * can write to a SharedTuplestore, and then multiple backends can later scan + * the stored tuples. Currently, the only scan type supported is a parallel + * scan where each backend reads an arbitrary subset of the tuples that were + * written. Cool. +/* Chunk written to disk. */ +typedef struct SharedTuplestoreChunk +{ + int ntuples;/* Number of tuples in this chunk. */ + booloverflow; /* Continuation of previous chunk? */ + chardata[FLEXIBLE_ARRAY_MEMBER]; +} SharedTuplestoreChunk; Ah. I was thinking we could have the 'overflow' variable be an int, indicating the remaining length of the oversized tuple. That'd allow us to skip ahead to the end of the oversized tuple in concurrent processes after hitting it. +/* + * Write a tuple. If a meta-data size was provided to sts_initialize, then a + * pointer to meta data of that size must be provided. + */ +void +sts_puttuple(SharedTuplestoreAccessor *accessor, void *meta_data, +MinimalTuple tuple) +{ + size_t size; + + /* Do we have our own file yet? */ + if (accessor->write_file == NULL) + { + SharedTuplestoreParticipant *participant; + charname[MAXPGPATH]; + + /* Create one. Only this backend will write into it. */ + sts_filename(name, accessor, accessor->participant); + accessor->write_file = BufFileCreateShared(accessor->fileset, name); + + /* Set up the shared state for this backend's file. */ + participant = >sts->participants[accessor->participant]; + participant->writing = true;/* for assertions only */ + } + + /* Do we have space? */ + size = accessor->sts->meta_data_size + tuple->t_len; + if (accessor->write_pointer + size >= accessor->write_end) + { + if (accessor->write_chunk == NULL) + { + /* First time through. Allocate chunk. */ + accessor->write_chunk = (SharedTuplestoreChunk *) + MemoryContextAllocZero(accessor->context, + STS_CHUNK_PAGES * BLCKSZ); + accessor->write_chunk->ntuples = 0; + accessor->write_pointer = >write_chunk->data[0]; + accessor->write_end = (char *) + accessor->write_chunk + STS_CHUNK_PAGES * BLCKSZ; + } + else + { + /* See if flushing helps. */ + sts_flush_chunk(accessor); + } + + /* It may still not be enough in the case of a gigantic tuple. */ + if (accessor->write_pointer + size >= accessor->write_end) + { + size_t written; + + /* +* We'll write the beginning of the oversized tuple, and then +* write the rest in some number of 'overflow' chunks. +*/ + if (accessor->write_pointer + accessor->sts->meta_data_size >= + accessor->write_end) + elog(ERROR, "meta-data too long"); That seems more like an Assert than a proper elog? Given that we're calculating size just a few lines above... + if (accessor->sts->meta_data_size > 0) + memcpy(accessor->write_pointer, meta_data, + accessor->sts->meta_data_size); + written = accessor->write_end - accessor->write_pointer - + accessor->sts->meta_data_size; + memcpy(accessor->write_pointer + accessor->sts->meta_data_size, + tuple, written); Also, shouldn't the same Assert() be here as well if you have it above? +static MinimalTuple +sts_read_tuple(SharedTuplestoreAccessor *accessor, void *meta_data) +{ + MinimalTuple tuple; + uint32 size; + size_t remaining_size; + size_t this_chunk_size; +
Re: Top-N sorts verses parallelism
On Tue, Dec 12, 2017 at 10:46 PM, Thomas Munro < thomas.mu...@enterprisedb.com> wrote: > Hi hackers, > > The start-up cost of bounded (top-N) sorts is sensitive at the small > end of N, and the > comparison_cost * tuples * LOG2(2.0 * output_tuples) curve doesn't > seem to correspond to reality. Do we want the cost-estimate to be accurate for the worst case (which is not terribly unlikely to happen, input order is opposite of desired output order), or for the average case? Cheers, Jeff
Re: Top-N sorts verses parallelism
On Wed, Dec 13, 2017 at 1:46 AM, Thomas Munrowrote: > Hi hackers, > > The start-up cost of bounded (top-N) sorts is sensitive at the small > end of N, and the > comparison_cost * tuples * LOG2(2.0 * output_tuples) curve doesn't > seem to correspond to reality. Here's a contrived example that comes > from a benchmark: > > create table foo as > select generate_series(1, 100)::int a, (random() * 1000)::int b; > analyze foo; > select * from foo order by b limit N; > > But the actual time doesn't really change on this random development > system: it's around 300ms. I stopped at limit 133 because for this > particular query, at 134 a Gather Merge plan kicked in and I got > degree 3 parallel sorting and I got my answer just shy of three times > faster. The speed up definitely paid for the parallel_setup_cost and > only one tuple was sent from each worker to the leader because the > limit was pushed down. I get: rhaas=# explain analyze select * from foo order by b limit 1; QUERY PLAN -- Limit (cost=19425.00..19425.00 rows=1 width=8) (actual time=183.129..183.129 rows=1 loops=1) -> Sort (cost=19425.00..21925.00 rows=100 width=8) (actual time=183.128..183.128 rows=1 loops=1) Sort Key: b Sort Method: top-N heapsort Memory: 25kB -> Seq Scan on foo (cost=0.00..14425.00 rows=100 width=8) (actual time=0.046..89.440 rows=100 loops=1) Planning time: 0.083 ms Execution time: 183.171 ms (7 rows) If we assume the costing of the Seq Scan node is right, then the startup cost of the Sort node is too low. For the Seq Scan node, each 1ms of execution time corresponds to 161.28 cost units. If that were accurate, then the 183.128 - 89.440 = 93.668 cost units of startup cost for the Sort ought to have a cost of 14425.00 + (93.668 * 161.28) = 29531.77504, but the actual cost is only 19425.00. In other words, the ratio of the startup cost of the sort to the total cost of the seq scan ought to be about 2:1, to match the ratio of the execution times, but it's actually more like 4:3. I also see that it switches to Gather Merge at LIMIT 134, but if I lower random_page_cost and seq_page_cost from the default values to 0.1, then for me it switches earlier, at 63. Of course, at that point the sort cost is now 3.5x the Seq Scan cost, so now things have gone too far in the other direction and we're still not getting the plan right, but I am out of time to investigate this for the moment. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: heap/SLRU verification, relfrozenxid cut-off, and freeze-the-dead bug (Was: amcheck (B-Tree integrity checking tool))
On Wed, Oct 18, 2017 at 12:45 PM, Peter Geogheganwrote: > Bringing it back to the concrete freeze-the-dead issue, and the > question of an XID-cutoff for safely interrogating CLOG: I guess it > will be possible to assess a HOT chain as a whole. We can probably do > this safely while holding a super-exclusive lock on the buffer. I can > probably find a way to ensure this only needs to happen in a rare slow > path, when it looks like the invariant might be violated but we need > to make sure (I'm already following this pattern in a couple of > places). Realistically, there will be some amount of "try it and see" > here. I would like to point out for the record/archives that I now believe that Andres' pending do-over fix for the "Freeze the dead" bug [1] will leave things in *much* better shape when it comes to verification. Andres' patch neatly addresses *all* of the concerns that I raised on this thread. The high-level idea of relfrozenxid as a unambiguous cut-off point at which it must be safe to interrogate the CLOG is restored. Off hand, I'd say that the only interlock amcheck verification now needs when verifying heap pages against the CLOG is a VACUUM-style SHARE UPDATE EXCLUSIVE lock on the heap relation being verified. Every heap tuple must either be observed to be frozen, or must only have hint bits that are observably in agreement with CLOG. The only complicated part is the comment that explains why this is comprehensive and correct (i.e. does not risk false positives or false negatives). We end up with something that is a bit like a "correct by construction" design. The fact that Andres also proposes to add a bunch of new defensive "can't happen" hard elog()s (mostly by promoting assertions) should validate the design of tuple + multixact freezing, in the same way that I hope amcheck can. [1] https://postgr.es/m/20171114030341.movhteyakqeqx...@alap3.anarazel.de -- Peter Geoghegan
Re: [HACKERS] parallel.c oblivion of worker-startup failures
On Wed, Dec 13, 2017 at 1:41 AM, Amit Kapilawrote: > I have tried to reproduce this situation by adding error case in > worker (just before worker returns success ('X' message)) and by > having breakpoint in WaitForParallelWorkersToFinish. What, I noticed > is that worker is not allowed to exit till it receives some ack from > master (basically worker waits at SendProcSignal after sending > message) and error is reported properly. SendProcSignal doesn't look to me like it can do anything that can block, so I don't understand this. > So, if I understand correctly, this means that it will return an error > even if there is a problem in one worker (exited or not started) even > though other workers would have easily finished the work. Won't such > an error can lead to situations where after running the query for one > hour the user might see some failure just because one of the many > workers is not started (or some similar failure) even though the query > would have been completed without that? If so, that doesn't sound > like a sensible behavior. I think it's the *only* sensible behavior. parallel.c does not know that the query would have completed successfully without that worker, or at least it doesn't know that it would have returned the right answer. It *might* be the case that the caller wasn't depending on that worker to do anything, but parallel.c doesn't know that. > This also doesn't appear to be completely safe. If we add > proc_exit(1) after attaching to error queue (say after > pq_set_parallel_master) in the worker, then it will lead to *hang* as > anyone_alive will still be false and as it will find that the sender > is set for the error queue, it won't return any failure. Now, I think > even if we check worker status (which will be stopped) and break after > the new error condition, it won't work as it will still return zero > rows in the case reported by you above. Hmm, there might still be a problem there. I was thinking that once the leader attaches to the queue, we can rely on the leader reaching "ERROR: lost connection to parallel worker" in HandleParallelMessages. However, that might not work because nothing sets ParallelMessagePending in that case. The worker will have detached the queue via shm_mq_detach_callback, but the leader will only discover the detach if it actually tries to read from that queue. > I think if before making an assumption not to scan locally if we have > a check to see whether workers are started, then my patch should work > fine and we don't need to add any additional checks. Also, it won't > create any performance issue as we will perform such a check only when > force_parallel_mode is not off or parallel leader participation is off > which I think are not common cases. My instinct is that we're going to have a lot of subtle bugs if clients of parallel.c can't count on nworkers_launched to be accurate. If clients didn't need that value, we could just have nworkers and let the callers figure it out, but nobody wants that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: explain analyze output with parallel workers - question about meaning of information for explain.depesz.com
On Tue, Dec 12, 2017 at 9:37 PM, Amit Kapilawrote: >> Uh, should I just revert that commit entirely first, and then we can >> commit the new fix afterward? > > Yes. I have already extracted the test case of that commit to the new > patch which is what we need from that commit. OK, done. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] A design for amcheck heapam verification
On Mon, Dec 11, 2017 at 9:35 PM, Michael Paquierwrote: > + /* > +* Generate random seed, or use caller's. Seed should always be a > positive > +* value less than or equal to PG_INT32_MAX, to ensure that any random > seed > +* can be recreated through callerseed if the need arises. (Don't assume > +* that RAND_MAX cannot exceed PG_INT32_MAX.) > +*/ > + seed = callerseed < 0 ? random() % PG_INT32_MAX : callerseed; > This could use pg_backend_random() instead. I don't see the need for a cryptographically secure source of randomness for any current Bloom filter caller, including the test harness. There are many uses of random() just like this throughout the backend, and far more random() calls than pg_backend_random() calls. srandom() seeds random per backend, ensuring non-deterministic behavior across backends. > +-- > +-- These tests don't produce any interesting output, unless they fail. For > an > +-- explanation of the arguments, and the values used here, see README. > +-- > +SELECT test_bloomfilter(power => 23, > +nelements => 838861, > +seed => -1, > +tests => 1); > This could also test the reproducibility of the tests with a fixed > seed number and at least two rounds, a low number of elements could be > more appropriate to limit the run time. The runtime is already dominated by pg_regress overhead. As it says in the README, using a fixed seed in the test harness is pointless, because it won't behave in a fixed way across platforms. As long as we cannot ensure deterministic behavior, we may as well fully embrace non-determinism. > I would vote for simplifying the initialization routine and just > directly let the caller decide it. Are there implementation > complications if this is not a power of two? I cannot guess one by > looking at the code. Yes, there are. It's easier to reason about the implementation with these restrictions. > So I would expect people using this API > are smart enough to do proper sizing. Note that some callers may > accept even a larger false positive rate. In short, this basically > brings us back to Thomas' point upthread with a size estimation > routine, and an extra routine doing the initialization: > https://www.postgresql.org/message-id/CAEepm=0dy53x1hg5dmyzmpv_kn99crxzqbto8gmiosxnyrx...@mail.gmail.com > Did you consider it? Splitting the size estimation and the actual > initialization has a lot of benefits in my opinion. Callers don't actually need to worry about these details. I think it's the wrong call to complicate the interface to simplify the implementation. Thomas was not arguing for the caller being able to specify a false positive rate to work backwards from -- that's not an unreasonable thing, but it seems unlikely to be of use to amcheck, or parallel hash join. Thomas was arguing for making the Bloom filter suitable for use with DSM. I ended up incorporating most of his feedback. The only things I were not added were a DSM-orientated routine for getting the amount of memory required, as well as a separate DSM-orientated routine for initializing caller's shared memory buffer. That's likely inconvenient for most callers, and could be restrictive if and when Bloom filters use resources other than memory, such as temp files. > +/* > + * What proportion of bits are currently set? > + * > + * Returns proportion, expressed as a multiplier of filter size. > + * > [...] > + */ > +double > +bloom_prop_bits_set(bloom_filter *filter) > Instead of that, having a function that prints direct information > about the filter's state would be enough with the real number of > elements and the number of bits set in the filter. I am not sure that > using rates is a good idea, sometimes rounding can cause confusion. The bloom filter doesn't track the number of elements itself. Callers that care can do it themselves. bloom_prop_bits_set() isn't very important, even compared to other types of instrumentation. It's moderately useful as a generic indicator of whether or not the Bloom filter was totally overwhelmed. That's about it. -- Peter Geoghegan
Re: [HACKERS] Custom compression methods
On 12/13/2017 05:55 PM, Alvaro Herrera wrote: > Tomas Vondra wrote: > >> On 12/13/2017 01:54 AM, Robert Haas wrote: > >>> 3. Compression is only applied to large-ish values. If you are just >>> making the data type representation more compact, you probably want to >>> apply the new representation to all values. If you are compressing in >>> the sense that the original data gets smaller but harder to interpret, >>> then you probably only want to apply the technique where the value is >>> already pretty wide, and maybe respect the user's configured storage >>> attributes. TOAST knows about some of that kind of stuff. >> >> Good point. One such parameter that I really miss is compression level. >> I can imagine tuning it through CREATE COMPRESSION METHOD, but it does >> not seem quite possible with compression happening in a datatype. > > Hmm, actually isn't that the sort of thing that you would tweak using a > column-level option instead of a compression method? > ALTER TABLE ALTER COLUMN SET (compression_level=123) > The only thing we need for this is to make tuptoaster.c aware of the > need to check for a parameter. > Wouldn't that require some universal compression level, shared by all supported compression algorithms? I don't think there is such thing. Defining it should not be extremely difficult, although I'm sure there will be some cumbersome cases. For example what if an algorithm "a" supports compression levels 0-10, and algorithm "b" only supports 0-3? You may define 11 "universal" compression levels, and map the four levels for "b" to that (how). But then everyone has to understand how that "universal" mapping is defined. Another issue is that there are algorithms without a compression level (e.g. pglz does not have one, AFAICS), or with somewhat definition (lz4 does not have levels, and instead has "acceleration" which may be an arbitrary positive integer, so not really compatible with "universal" compression level). So to me the ALTER TABLE ALTER COLUMN SET (compression_level=123) seems more like an unnecessary hurdle ... >>> I don't think TOAST needs to be entirely transparent for the >>> datatypes. We've already dipped our toe in the water by allowing some >>> operations on "short" varlenas, and there's really nothing to prevent >>> a given datatype from going further. The OID problem you mentioned >>> would presumably be solved by hard-coding the OIDs for any built-in, >>> privileged compression methods. >> >> Stupid question, but what do you mean by "short" varlenas? > > Those are varlenas with 1-byte header rather than the standard 4-byte > header. > OK, that's what I thought. But that is still pretty transparent to the data types, no? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] Surjective functional indexes
Thank you for review. On 13.12.2017 14:29, Simon Riggs wrote: On 4 December 2017 at 15:35, Konstantin Knizhnikwrote: On 30.11.2017 05:02, Michael Paquier wrote: On Wed, Sep 27, 2017 at 4:07 PM, Simon Riggs wrote: On 15 September 2017 at 16:34, Konstantin Knizhnik wrote: Attached please find yet another version of the patch. Thanks. I'm reviewing it. Two months later, this patch is still waiting for a review (you are listed as well as a reviewer of this patch). The documentation of the patch has conflicts, please provide a rebased version. I am moving this patch to next CF with waiting on author as status. Attached please find new patch with resolved documentation conflict. I’ve not posted a review because it was my hope that I could fix up the problems with this clever and useful patch and just commit it. I spent 2 days trying to do that but some problems remain and I’ve run out of time. Patch 3 has got some additional features that on balance I don’t think we need. Patch 1 didn’t actually work which was confusing when I tried to go back to that. Patch 3 Issues * Auto tuning added 2 extra items to Stats for each relation. That is too high a price to pay, so we should remove that. If we can’t do autotuning without that, please remove it. Right now size of this structure is 168 bytes. Adding two extra fields increase it to 184 bytes - 9%. Is it really so critical? What is the typical/maximal number of relation in a database? I can not believe that there can be more than thousand non-temporary relations in any database. Certainly application can generate arbitrary number of temporary tables. But millions of such tables cause catalog bloating and will have awful impact on performance in any case. And for any reasonable number of relations (< million), extra memory used by this statistic is negligible. Also I think that these two fields can be useful not only for projection indexes. It can be also used by DBA and optimizer because them more precisely explain relation update pattern. I really love you idea with autotune and it will be pity to reject it just because of extra 16 bytes per relation. * Patch needs a proper test case added. We shouldn’t just have a DEBUG1 statement in there for testing - very ugly. Please rewrite tests using functions that show how many changes have occurred during the current statement/transaction. * Parameter coding was specific to that section of code. We need a capability to parse that parameter from other locations, just as we do with all other reloptions. It looks like it was coded that way because of difficulties with code placement. Please solve this with generic code, not just code that solves only the current issue. I’d personally be happier without any option at all, but if y’all want it, then please add the code in the right place. Sorry, I do not completely understand what you mean by "parameter coding". Do you mean IsProjectionFunctionalIndex function and filling relation options in it? It is the only reported issue which I do not understand how to fix. * The new coding for heap_update() uses the phrase “warm”, which is already taken by another patch. Please avoid confusing things by re-using the same term for different things. The use of these technical terms like projection and surjective doesn’t seem to add anything useful to the patch * Rename parameter to recheck_on_update * Remove random whitespace changes in the patch So at this stage the concept is good and works, but the patch is only just at the prototype stage and nowhere near committable. I hope you can correct these issues and if you do I will be happy to review and perhaps commit. Thank you once again for your efforts in improving this patch. I will fix all reported issues. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: plpgsql test layout
On 12/12/17 22:59, Michael Paquier wrote: > Attached is what I have some up with, based on Peter's v2. This has been committed. Thanks! -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: That mode-700 check on DATADIR again
On 12/11/17 9:41 PM, Chapman Flack wrote: I have, more or less, this classic question: https://www.postgresql.org/message-id/4667C403.1070807%40t3go.de However, when you stat a file with a POSIX ACL, you get shown the ACL's 'mask' entry (essentially the ceiling of all the 'extra' ACL entries) in place of the old-fashioned group bits. So in a non-ACL-aware ls or stat, the above looks like: [data]# ls -ld drwxr-x---+ 22 postgres postgres 4096 Dec 11 18:14 . ... and postgres refuses to start because it mistakes the r-x for 'group' permissions. ACLs have added new semantics to POSIX permissions, and postgres doesn't understand them when it makes this hey-don't-shoot-your-foot check. I'm working on a patch that allows $PGDATA to have group r-x so that a non-privileged user in the group can do a file-level backup. However, it looks like it would work for your case as well based on your ACL. I'm planning to have the patch ready sometime next week and I'll respond here once it goes into the CF. Reviews would be welcome! Thanks, -- -David da...@pgmasters.net
Re: WIP: a way forward on bootstrap data
On 12/13/17 04:06, John Naylor wrote: > There doesn't seem to be any interest in bootstrap data at the moment, > but rather than give up just yet, I've added a couple features to make > a data migration more compelling: I took a brief look at your patches, and there appear to be a number of good cleanups in there at least. But could you please send patches in git format-patch format with commit messages, so we don't have to guess what each patch does? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: Exclude unlogged tables from base backups
On 12/13/17 10:04 AM, Stephen Frost wrote: Just to be clear- the new base backup code doesn't actually *do* the non-init fork removal, it simply doesn't include the non-init fork in the backup when there is an init fork, right? It does *not* do the unlogged non-init fork removal. The code I refactored in reinit.c is about identifying the forks, not removing them. That code is reused to determine what to exclude from the backup. I added the regression tests to ensure that the behavior of reinit.c is unchanged after the refactor. We certainly wouldn't want a basebackup actually running around removing the main fork for unlogged tables on a running and otherwise healthy system. ;) That would not be good. -- -David da...@pgmasters.net
Re: server crash with CallStmt
On 12/13/17 04:46, Ashutosh Bapat wrote: > I think it shouldn't reach this code. Like a normal function > invocation, an aggregate invocation should not be allowed in a CALL > statement. Here's patch to fix it. I thought that window function > invocations and casts had similar problem, but they are prohibited by > backend grammar. Committed your fix, thanks. I took out the hint, because calling an aggregate function is a bit more complicated than just replacing the CALL by SELECT. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] Something for the TODO list: deprecating abstime and friends
> On Dec 13, 2017, at 12:07 AM, Andres Freundwrote: > > Hi, > > On 2017-07-17 12:54:31 -0700, Mark Dilger wrote: >> These types provide a 4-byte datatype for storing real-world second >> precision timestamps, as occur in many log files. > > These seem to be getting less common IME, most products have higher > resolution these days. If this were nicely written, maintainable, and > documented code my position would differ, but it really isn't. > > >> That said, I am fully aware that these are deprecated and expect you >> will remove them, at which time I'll have to keep them in my tree >> and politely refuse to merge in your change which removes them. > > It'd be way less work to package abstime as an extension if you want to > do so. After proposing to submit a patch for the secondstamp datatype (which I mentioned upthread), Robert objected to the idea of data on disk changing meaning, which was a part of the idea that Tom seemed to be willing to accept. Since I couldn't get both Tom and Robert on board with any particular design, I silently withdrew from the development of any such patch. This has happened on several proposals I have made on this list over the years. Unless there is fairly unanimous support from the committers, I don't bother following through with a proposal, given the improbability of it getting accepted. I would happily finish and submit that prior proposal if there were general agreement that it is a good design. I have no interest in making abstime into an extension, however. mark
Re: PATCH: Exclude unlogged tables from base backups
David, * David Steele (da...@pgmasters.net) wrote: > On 12/12/17 8:48 PM, Stephen Frost wrote: > > I don't think there is, because, as David points out, the unlogged > > tables are cleaned up first and then WAL replay happens during recovery, > > so the init fork will cause the relation to be overwritten, but then > > later the logged 'drop table' and subsequent re-use of the relfilenode > > to create a new table (or persistence change) will all be in the WAL and > > will be replayed over top and will take care of this. > > Files can be copied in any order, so if an OID is recycled the backup > could copy its first, second, or nth incarnation. It doesn't really > matter since all of it will be clobbered by WAL replay. > > The new base backup code just does the non-init fork removal in advance, > following the same rules that would apply on recovery given the same > file set. Just to be clear- the new base backup code doesn't actually *do* the non-init fork removal, it simply doesn't include the non-init fork in the backup when there is an init fork, right? We certainly wouldn't want a basebackup actually running around removing the main fork for unlogged tables on a running and otherwise healthy system. ;) Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] eval_const_expresisions and ScalarArrayOpExpr
> On 12 September 2017 at 15:29, Tom Lanewrote: > > This patch no longer applies cleanly on HEAD, so here's a rebased version > (no substantive changes). As before, I think the most useful review task > would be to quantify whether it makes planning noticeably slower. I tried to experiment a bit with this patch, hope it may be helpful. Basically I've made some number of pgbench runs over an instance with/without the patch, assuming, that since the only difference would be in the planner, this will show performance impact from the patch. Also, I've manually collected some statistics for "Planning time" from explain analyze (I suppose it's also a valid test). As a test target I've used queries like `select * from test where 1 in (1, 2)` (as a side note - an involved table was empty) for different number of elements in an array, and for situations when this condition is true/false. So, for: pgbench -c 50 -s 100 -j 50 -n -t 1000 -f const_eval.sql I've got this data (avg latency): typew/ patchw/o patch short array 5.203 5.564 long array 9.293 9.667 Just out of curiosity I've also made the same test for constant arrays instead of integers, but in this case numbers for with and without the patch were almost identical. For explain analyze (avg planning time): typew/ patchw/o patch short array 0.214 0.226 long array 0.374 0.320 I'm not sure why for the long array case with the patch I have smaller latency (I thought, that more complexity should have negative impact on the performance), but a bit longer planning time. But at the end it looks that the difference is not that big. Speaking about the code, everything looks fine. As for me, some variables could be named in a more self-explanatory way (e.g `ece_evaluate_expr`, where `ece` is `eval_const_expresisions`, or `saop`, which is `ScalarArrayOp`), but it's minor. I wonder if it makes sense in this patch also deal with the following situation? explain analyze select * from test where 1 in (select 1); QUERY PLAN --- Result (cost=0.02..35.52 rows=2550 width=4) (actual time=0.036..0.036 rows=0 loops=1) One-Time Filter: (hashed SubPlan 1) -> Seq Scan on test (cost=0.02..35.52 rows=2550 width=4) (actual time=0.009..0.009 rows=0 loops=1) SubPlan 1 -> Result (cost=0.00..0.01 rows=1 width=4) (actual time=0.002..0.002 rows=1 loops=1) Planning time: 0.381 ms Execution time: 0.360 ms It looks quite similar from a user perspective (although since a subplan is involved, I assume it may be quite different in terms of implementation).
Re: PATCH: Exclude unlogged tables from base backups
On 12/12/17 8:48 PM, Stephen Frost wrote: > Andres, > > * Andres Freund (and...@anarazel.de) wrote: >> On 2017-12-12 18:04:44 -0500, David Steele wrote: >>> If the forks are written out of order (i.e. main before init), which is >>> definitely possible, then I think worst case is some files will be backed up >>> that don't need to be. The main fork is unlikely to be very large at that >>> point so it doesn't seem like a big deal. >>> >>> I don't see this as any different than what happens during recovery. The >>> unlogged forks are cleaned / re-inited before replay starts which is the >>> same thing we are doing here. >> >> It's quite different - in the recovery case there's no other write >> activity going on. But on a normally running cluster the persistence of >> existing tables can get changed, and oids can get recycled. What >> guarantees that between the time you checked for the init fork the table >> hasn't been dropped, the oid reused and now a permanent relation is in >> its place? > > We *are* actually talking about the recovery case here because this is a > backup that's happening and WAL replay will be happening after the > pg_basebackup is done and then the backup restored somewhere and PG > started up again. > > If the persistence is changed then the table will be written into the > WAL, no? All of the WAL generated during a backup (which is what we're > talking about here) has to be replayed after the restore is done and is > before the database is considered consistent, so none of this matters, > as far as I can see, because the drop table or alter table logged or > anything else will be in the WAL that ends up getting replayed. Yes - that's the way I see it. At least when I'm not tired from a day of coding like I was last night... > I don't think there is, because, as David points out, the unlogged > tables are cleaned up first and then WAL replay happens during recovery, > so the init fork will cause the relation to be overwritten, but then > later the logged 'drop table' and subsequent re-use of the relfilenode > to create a new table (or persistence change) will all be in the WAL and > will be replayed over top and will take care of this. Files can be copied in any order, so if an OID is recycled the backup could copy its first, second, or nth incarnation. It doesn't really matter since all of it will be clobbered by WAL replay. The new base backup code just does the non-init fork removal in advance, following the same rules that would apply on recovery given the same file set. -- -David da...@pgmasters.net signature.asc Description: OpenPGP digital signature
Re: [HACKERS] Custom compression methods
On Tue, 12 Dec 2017 15:52:01 -0500 Robert Haaswrote: > > Yes. I wonder if \d or \d+ can show it somehow. > Yes, in current version of the patch, \d+ shows current compression. It can be extended to show a list of current compression methods. Since we agreed on ALTER syntax, i want to clear things about CREATE. Should it be CREATE ACCESS METHOD .. TYPE СOMPRESSION or CREATE COMPRESSION METHOD? I like the access method approach, and it simplifies the code, but I'm just not sure a compression is an access method or not. Current implementation -- To avoid extra patches I also want to clear things about current implementation. Right now there are two tables, "pg_compression" and "pg_compression_opt". When compression method is linked to a column it creates a record in pg_compression_opt. This record's Oid is stored in the varlena. These Oids kept in first column so I can move them in pg_upgrade but in all other aspects they behave like usual Oids. Also it's easy to restore them. Compression options linked to a specific column. When tuple is moved between relations it will be decompressed. Also in current implementation SET COMPRESSION contains WITH syntax which is used to provide extra options to compression method. What could be changed - As Alvaro mentioned COMPRESSION METHOD is practically an access method, so it could be created as CREATE ACCESS METHOD .. TYPE COMPRESSION. This approach simplifies the patch and "pg_compression" table could be removed. So compression method is created with something like: CREATE ACCESS METHOD .. TYPE COMPRESSION HANDLER awesome_compression_handler; Syntax of SET COMPRESSION changes to SET COMPRESSION .. PRESERVE which is useful to control rewrites and for pg_upgrade to make dependencies between moved compression options and compression methods from pg_am table. Default compression is always pglz and if users want to change they run: ALTER COLUMN SET COMPRESSION awesome PRESERVE pglz; Without PRESERVE it will rewrite the whole relation using new compression. Also the rewrite removes all unlisted compression options so their compresssion methods could be safely dropped. "pg_compression_opt" table could be renamed to "pg_compression", and compression options will be stored there. I'd like to keep extra compression options, for example pglz can be configured with them. Syntax would be slightly changed: SET COMPRESSION pglz WITH (min_comp_rate=25) PRESERVE awesome; Setting the same compression method with different options will create new compression options record for future tuples but will not rewrite table. -- Regards, Ildus Kurbangaliev
Re: [HACKERS] Surjective functional indexes
On 4 December 2017 at 15:35, Konstantin Knizhnikwrote: > On 30.11.2017 05:02, Michael Paquier wrote: >> >> On Wed, Sep 27, 2017 at 4:07 PM, Simon Riggs >> wrote: >>> >>> On 15 September 2017 at 16:34, Konstantin Knizhnik >>> wrote: >>> Attached please find yet another version of the patch. >>> >>> Thanks. I'm reviewing it. >> >> Two months later, this patch is still waiting for a review (you are >> listed as well as a reviewer of this patch). The documentation of the >> patch has conflicts, please provide a rebased version. I am moving >> this patch to next CF with waiting on author as status. > > Attached please find new patch with resolved documentation conflict. I’ve not posted a review because it was my hope that I could fix up the problems with this clever and useful patch and just commit it. I spent 2 days trying to do that but some problems remain and I’ve run out of time. Patch 3 has got some additional features that on balance I don’t think we need. Patch 1 didn’t actually work which was confusing when I tried to go back to that. Patch 3 Issues * Auto tuning added 2 extra items to Stats for each relation. That is too high a price to pay, so we should remove that. If we can’t do autotuning without that, please remove it. * Patch needs a proper test case added. We shouldn’t just have a DEBUG1 statement in there for testing - very ugly. Please rewrite tests using functions that show how many changes have occurred during the current statement/transaction. * Parameter coding was specific to that section of code. We need a capability to parse that parameter from other locations, just as we do with all other reloptions. It looks like it was coded that way because of difficulties with code placement. Please solve this with generic code, not just code that solves only the current issue. I’d personally be happier without any option at all, but if y’all want it, then please add the code in the right place. * The new coding for heap_update() uses the phrase “warm”, which is already taken by another patch. Please avoid confusing things by re-using the same term for different things. The use of these technical terms like projection and surjective doesn’t seem to add anything useful to the patch * Rename parameter to recheck_on_update * Remove random whitespace changes in the patch So at this stage the concept is good and works, but the patch is only just at the prototype stage and nowhere near committable. I hope you can correct these issues and if you do I will be happy to review and perhaps commit. Thanks -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug
On Tue, Dec 12, 2017 at 2:33 PM, Teodor Sigaevwrote: > 0002-pg-trgm-strict_word-similarity.patch – implementation of >> strict_word_similarity() with comments, docs and tests. >> > After some looking in > > 1) > repeated piece of code: > + if (strategy == SimilarityStrategyNumber) > + nlimit = similarity_threshold; > + else if (strategy == WordSimilarityStrategyNumber) > + nlimit = word_similarity_threshold; > + else /* strategy == StrictWordSimilarityStrategyNumber */ > + nlimit = strict_word_similarity_threshold; > Isn't it better to move that piece to separate function? > Good point. Moved to separate function. 2) > calc_word_similarity(char *str1, int slen1, char *str2, int slen2, > bool check_only, bool word_bounds) > > Seems, two bools args are replaceble to bitwise-ORed flag. It will > simplify adding new options in future. Yep. I've introduced flags. Also, I've adjusted tests to make them stable (found example where TOP-8 distances are unique). Please, find revised patch in attachment. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company 0002-pg-trgm-strict_word-similarity-2.patch Description: Binary data
Re: [HACKERS] path toward faster partition pruning
On 12 December 2017 at 22:13, Amit Langotewrote: > Attached updated patches. Thanks for sending the updated patches. I don't have a complete review at the moment, but the following code in set_append_rel_pathlist() should be removed. /* append_rel_list contains all append rels; ignore others */ if (appinfo->parent_relid != parentRTindex) continue; -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: server crash with CallStmt
On Wed, Dec 13, 2017 at 12:55 PM, Rushabh Lathiawrote: > HI, > > Consider the below test: > > postgres@110311=#call sum(2+2); > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > The connection to the server was lost. Attempting reset: Failed. > > Call stack: > > (gdb) bt > #0 0x7fef9b5851d7 in raise () from /lib64/libc.so.6 > #1 0x7fef9b5868c8 in abort () from /lib64/libc.so.6 > #2 0x009ec5b5 in ExceptionalCondition (conditionName=0xb7e12b > "!(((bool) 0))", errorType=0xb7e04c "FailedAssertion", > fileName=0xb7e040 "parse_agg.c", lineNumber=349) at assert.c:54 > #3 0x005dc3b6 in check_agglevels_and_constraints (pstate=0x2d1ba90, > expr=0x2d1bd80) at parse_agg.c:349 > #4 0x005dc104 in transformAggregateCall (pstate=0x2d1ba90, > agg=0x2d1bd80, args=0x2d1c3c0, aggorder=0x0, agg_distinct=0 '\000') at > parse_agg.c:237 > #5 0x005f4df7 in ParseFuncOrColumn (pstate=0x2d1ba90, > funcname=0x2d10578, fargs=0x2d1c3c0, last_srf=0x0, fn=0x2d10760, proc_call=1 > '\001', > location=5) at parse_func.c:726 > > > Here it's hitting Assert() because pstate->p_expr_kind is set > to EXPR_KIND_NONE for the CallStmt function resolution. > I think it shouldn't reach this code. Like a normal function invocation, an aggregate invocation should not be allowed in a CALL statement. Here's patch to fix it. I thought that window function invocations and casts had similar problem, but they are prohibited by backend grammar. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company diff --git a/src/backend/parser/parse_func.c b/src/backend/parser/parse_func.c index 2f20516..5ba6065 100644 --- a/src/backend/parser/parse_func.c +++ b/src/backend/parser/parse_func.c @@ -336,6 +336,16 @@ ParseFuncOrColumn(ParseState *pstate, List *funcname, List *fargs, Form_pg_aggregate classForm; int catDirectArgs; + if (proc_call) + ereport(ERROR, + (errcode(ERRCODE_UNDEFINED_FUNCTION), + errmsg("%s is not a procedure", + func_signature_string(funcname, nargs, + argnames, + actual_arg_types)), + errhint("To call an aggregate function, use SELECT."), + parser_errposition(pstate, location))); + tup = SearchSysCache1(AGGFNOID, ObjectIdGetDatum(funcid)); if (!HeapTupleIsValid(tup)) /* should not happen */ elog(ERROR, "cache lookup failed for aggregate %u", funcid);
Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager
On Wed, Dec 13, 2017 at 4:30 PM, Andres Freundwrote: > On 2017-12-13 16:02:45 +0900, Masahiko Sawada wrote: >> When we add extra blocks on a relation do we access to the disk? I >> guess we just call lseek and write and don't access to the disk. If so >> the performance degradation regression might not be much. > > Usually changes in the file size require the filesystem to perform > metadata operations, which in turn requires journaling on most > FSs. Which'll often result in synchronous disk writes. > Thank you. I understood the reason why this measurement should use two different filesystems. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Re: [HACKERS] Something for the TODO list: deprecating abstime and friends
Hi, On 2017-07-17 12:54:31 -0700, Mark Dilger wrote: > These types provide a 4-byte datatype for storing real-world second > precision timestamps, as occur in many log files. These seem to be getting less common IME, most products have higher resolution these days. If this were nicely written, maintainable, and documented code my position would differ, but it really isn't. > That said, I am fully aware that these are deprecated and expect you > will remove them, at which time I'll have to keep them in my tree > and politely refuse to merge in your change which removes them. It'd be way less work to package abstime as an extension if you want to do so. Greetings, Andres Freund