Re: partition pruning doesn't work with IS NULL clause in multikey range partition case
On 2018/07/12 14:32, Amit Langote wrote: > Thanks Ashutosh for reporting and Dilip for the analysis and the patch. > > On 2018/07/11 21:39, Dilip Kumar wrote: >> On Wed, Jul 11, 2018 at 5:36 PM, amul sul wrote: >>> On Wed, Jul 11, 2018 at 5:10 PM Dilip Kumar wrote: >> >>> I am not sure that I have understand the following comments >>> 11 +* Generate one prune step for the information derived from IS NULL, >>> 12 +* if any. To prune hash partitions, we must have found IS NULL >>> 13 +* clauses for all partition keys. >>> 14 */ >>> >>> I am not sure that I have understood this -- no such restriction >>> required to prune the hash partitions, if I am not missing anything. >> >> Maybe it's not very clear but this is the original comments I have >> retained. Just moved it out of the (!generate_opsteps) condition. >> >> Just the explain this comment consider below example, >> >> create table hp (a int, b text) partition by hash (a int, b text); >> create table hp0 partition of hp for values with (modulus 4, remainder 0); >> create table hp3 partition of hp for values with (modulus 4, remainder 3); >> create table hp1 partition of hp for values with (modulus 4, remainder 1); >> create table hp2 partition of hp for values with (modulus 4, remainder 2); >> >> postgres=# insert into hp values (1, null); >> INSERT 0 1 >> postgres=# insert into hp values (2, null); >> INSERT 0 1 >> postgres=# select tableoid::regclass, * from hp; >> tableoid | a | b >> --+---+--- >> hp1 | 1 | >> hp2 | 2 | >> (2 rows) >> >> Now, if we query based on "b is null" then we can not decide which >> partition should be pruned whereas in case >> of other schemes, it will go to default partition so we can prune all >> other partitions. > > That's right. By generating a pruning step with only nullkeys set, we are > effectively discarding OpExprs that may have been found for some partition > keys. That's fine for list/range partitioning, because nulls can only be > found in a designated partition, so it's okay to prune all other > partitions and for that it's enough to generate the pruning step like > that. For hash partitioning, nulls could be contained in any partition so > it's not okay to discard OpExpr's like that. We can generate pruning > steps with combination of null and non-null keys in the hash partitioning > case if there are any OpExprs. > > I think your fix is correct. I slightly modified it along with updating > nearby comments and added regression tests. I updated regression tests to reduce lines. There is no point in repeating tests like v2 patch did. Thanks, Amit diff --git a/src/backend/partitioning/partprune.c b/src/backend/partitioning/partprune.c index cdc61a8997..d9612bf65b 100644 --- a/src/backend/partitioning/partprune.c +++ b/src/backend/partitioning/partprune.c @@ -854,44 +854,42 @@ gen_partprune_steps_internal(GeneratePruningStepsContext *context, } /* -* If generate_opsteps is set to false it means no OpExprs were directly -* present in the input list. +* For list and range partitioning, null partition keys can only be found +* in one designated partition, so if there are IS NULL clauses containing +* partition keys we should generate a pruning step that gets rid of all +* partitions but the special null-accepting partitions. For range +* partitioning, that means we will end up disregarding OpExpr's that may +* have been found for some other keys, but that's fine, because it is not +* possible to prune range partitions with a combination of null and +* non-null keys. +* +* For hash partitioning however, it is possible to combine null and non- +* null keys in a pruning step, so do this only if *all* partition keys +* are involved in IS NULL clauses. */ - if (!generate_opsteps) + if (!bms_is_empty(nullkeys) && + (part_scheme->strategy != PARTITION_STRATEGY_HASH || +bms_num_members(nullkeys) == part_scheme->partnatts)) + { + PartitionPruneStep *step; + + step = gen_prune_step_op(context, InvalidStrategy, +false, NIL, NIL, nullkeys); + result = lappend(result, step); + } + else if (!generate_opsteps && !bms_is_empty(notnullkeys)) { /* -* Generate one prune step for the information derived from IS NULL, -* if any. To prune hash partitions, we must have found IS NULL -* clauses for all partition keys. +* There are no OpExpr's, but there are IS NOT NULL clauses, which +* can be used to eliminate the null-partition-key-only partition. */ - if (!bms_is_empty(nullkeys) && - (part_scheme->strategy !=
Re: Add SKIP LOCKED to VACUUM and ANALYZE
On Wed, Jun 13, 2018 at 08:29:12PM +, Bossart, Nathan wrote: > Previous thread: > https://postgr.es/m/4BC0F3CD-F4B5-4F23-AADB-80607F9E4B4E%40amazon.com > > This is a new thread for tracking the work to add SKIP LOCKED to > VACUUM and ANALYZE. I've attached a rebased set of patches. I am beginning to look at this stuff, and committed patch 4 and 5 on the way as they make sense independently. I am still reviewing the rest, which applies with some fuzz, and the tests proposed still pass. -- Michael signature.asc Description: PGP signature
Re: partition pruning doesn't work with IS NULL clause in multikey range partition case
Thanks Ashutosh for reporting and Dilip for the analysis and the patch. On 2018/07/11 21:39, Dilip Kumar wrote: > On Wed, Jul 11, 2018 at 5:36 PM, amul sul wrote: >> On Wed, Jul 11, 2018 at 5:10 PM Dilip Kumar wrote: > >>> >> I am not sure that I have understand the following comments >> 11 +* Generate one prune step for the information derived from IS NULL, >> 12 +* if any. To prune hash partitions, we must have found IS NULL >> 13 +* clauses for all partition keys. >> 14 */ >> >> I am not sure that I have understood this -- no such restriction >> required to prune the hash partitions, if I am not missing anything. > > Maybe it's not very clear but this is the original comments I have > retained. Just moved it out of the (!generate_opsteps) condition. > > Just the explain this comment consider below example, > > create table hp (a int, b text) partition by hash (a int, b text); > create table hp0 partition of hp for values with (modulus 4, remainder 0); > create table hp3 partition of hp for values with (modulus 4, remainder 3); > create table hp1 partition of hp for values with (modulus 4, remainder 1); > create table hp2 partition of hp for values with (modulus 4, remainder 2); > > postgres=# insert into hp values (1, null); > INSERT 0 1 > postgres=# insert into hp values (2, null); > INSERT 0 1 > postgres=# select tableoid::regclass, * from hp; > tableoid | a | b > --+---+--- > hp1 | 1 | > hp2 | 2 | > (2 rows) > > Now, if we query based on "b is null" then we can not decide which > partition should be pruned whereas in case > of other schemes, it will go to default partition so we can prune all > other partitions. That's right. By generating a pruning step with only nullkeys set, we are effectively discarding OpExprs that may have been found for some partition keys. That's fine for list/range partitioning, because nulls can only be found in a designated partition, so it's okay to prune all other partitions and for that it's enough to generate the pruning step like that. For hash partitioning, nulls could be contained in any partition so it's not okay to discard OpExpr's like that. We can generate pruning steps with combination of null and non-null keys in the hash partitioning case if there are any OpExprs. I think your fix is correct. I slightly modified it along with updating nearby comments and added regression tests. Thanks, Amit diff --git a/src/backend/partitioning/partprune.c b/src/backend/partitioning/partprune.c index cdc61a8997..d9612bf65b 100644 --- a/src/backend/partitioning/partprune.c +++ b/src/backend/partitioning/partprune.c @@ -854,44 +854,42 @@ gen_partprune_steps_internal(GeneratePruningStepsContext *context, } /* -* If generate_opsteps is set to false it means no OpExprs were directly -* present in the input list. +* For list and range partitioning, null partition keys can only be found +* in one designated partition, so if there are IS NULL clauses containing +* partition keys we should generate a pruning step that gets rid of all +* partitions but the special null-accepting partitions. For range +* partitioning, that means we will end up disregarding OpExpr's that may +* have been found for some other keys, but that's fine, because it is not +* possible to prune range partitions with a combination of null and +* non-null keys. +* +* For hash partitioning however, it is possible to combine null and non- +* null keys in a pruning step, so do this only if *all* partition keys +* are involved in IS NULL clauses. */ - if (!generate_opsteps) + if (!bms_is_empty(nullkeys) && + (part_scheme->strategy != PARTITION_STRATEGY_HASH || +bms_num_members(nullkeys) == part_scheme->partnatts)) + { + PartitionPruneStep *step; + + step = gen_prune_step_op(context, InvalidStrategy, +false, NIL, NIL, nullkeys); + result = lappend(result, step); + } + else if (!generate_opsteps && !bms_is_empty(notnullkeys)) { /* -* Generate one prune step for the information derived from IS NULL, -* if any. To prune hash partitions, we must have found IS NULL -* clauses for all partition keys. +* There are no OpExpr's, but there are IS NOT NULL clauses, which +* can be used to eliminate the null-partition-key-only partition. */ - if (!bms_is_empty(nullkeys) && - (part_scheme->strategy != PARTITION_STRATEGY_HASH || -bms_num_members(nullkeys) == part_scheme->partnatts)) - { - PartitionPruneStep *step; +
Re: Expression errors with "FOR UPDATE" and postgres_fdw with partition wise join enabled.
On Thu, Jul 12, 2018 at 9:02 AM, Etsuro Fujita wrote: > (2018/07/11 20:02), Ashutosh Bapat wrote: >> >> On Wed, Jul 11, 2018 at 1:23 PM, Etsuro Fujita >> wrote: >>> >>> Actually, even if we could create such an index on the child table and >>> the >>> targetlist had the ConvertRowtypeExpr, the planner would still not be >>> able >>> to use an index-only scan with that index; because check_index_only would >>> not consider that an index-only scan is possible for that index because >>> that >>> index is an expression index and that function currently does not >>> consider >>> that index expressions are able to be returned back in an index-only >>> scan. >>> That behavior of the planner might be improved in future, though. > > >> Right and when we do so, not having ConvertRowtypeExpr in the >> targetlist will be a problem. > > > Yeah, but I don't think that that's unsolvable; because in that case the CRE > as an index expression could be converted back to the whole-row Var's > rowtype by adding another CRE to the index expression for that conversion, I > suspect that that special handling could allow us to support an index-only > scan even when having the whole-row Var instead of the CRE in the > targetlist. I am not able to understand this. Can you please provide an example? > (Having said that, I'm not 100% sure we need to solve that > problem when we improve the planner, because there doesn't seem to me to be > enough use-case to justify making the code complicated for that.) Anyway, I > think that that would be a matter of future versions of PG. I am afraid that a fix which just works only till we change the other parts of the system is useful only till that time. At that time, it needs to be replaced with a different fix. If that time is long enough, that's ok. In this case, I agree that if we haven't heard from users till now that they need to create indexes on whole-row expressions, there's chance that we will never implement this feature. > At places in planner we match equivalence members to the targetlist entries. This matching will fail unexpectedly when ConvertRowtypeExpr is removed from a child's targetlist. But again I couldn't reproduce a problem when such a mismatch arises. >>> >>> >>> >>> IIUC, I don't think the planner assumes that for an equivalence member >>> there >>> is an matching entry for that member in the targetlist; what I think the >>> planner assumes is: an equivalence member is able to be computed from >>> expressions in the targetlist. >> >> >> This is true. However, >> >>> So, I think it is safe to have whole-row >>> Vars instead of ConvertRowtypeExprs in the targetlist. >> >> >> when it's looking for an expression, it finds a whole-row expression >> so it think it needs to add a ConvertRowtypeExpr on that. But when the >> plan is created, there is ConvertRowtypeExpr already, but there is no >> way to know that a new ConvertRowtypeExpr is not needed anymore. So, >> we may have two ConvertRowtypeExprs giving wrong results. > > > Maybe I'm missing something, but I don't think that we need to worry about > that, because in the approach I proposed, we only add CREs above whole-row > Vars in the targetlists for subplans of an Append/MergeAppend for a > partitioned relation at plan creation time. There's a patch in an adjacent thread started by David Rowley to rip out Append/MergeAppend when there is only one subplan. So, your solution won't work there. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: Negotiating the SCRAM channel binding type
On Wed, Jul 11, 2018 at 03:01:03PM +0300, Heikki Linnakangas wrote: > That would be more complicated to parse. Yeah, we might need further options > for some SASL mechanisms in the future, but we can cross that bridge when we > get there. I don't see any need to complicate this case for that. Okay, fine for me. > I started digging into this more closely, and ran into a little problem. If > channel binding is not used, the client sends a flag to the server to > indicate if it's because the client doesn't support channel binding, or > because it thought that the server doesn't support it. This is the > gs2-cbind-flag. How should the flag be set, if the server supports channel > binding type A, while client supports only type B? The purpose of the flag > is to detect downgrade attacks, where a man-in-the-middle removes the PLUS > variants from the list of supported mechanisms. If we treat incompatible > channel binding types as "client doesn't support channel binding", then the > downgrade attack becomes possible (the attacker can replace the legit PLUS > variants with bogus channel binding types that the client surely doesn't > support). If we treat it as "server doesn't support channel binding", then > we won't downgrade to the non-channel-binding variant, in the legitimate > case that the client and server both support channel binding, but with > incompatible types. > > What we'd really want, is to include the full list of server's supported > mechanisms as part of the exchange, not just a boolean "y/n" flag. But > that's not what the spec says :-(. Let's not break the spec :) I understand from it that the client is in charge of the choice, so we are rather free to choose the reaction the client should have. In the second phase of the exchange, the client communicates back to the server the channel binding it has decided to choose, it is not up to the server to pick up one if the client thinks that it can use multiple ones. > I guess we should err on the side of caution, and fail the authentication in > that case. That's unfortunate, but it's better than not negotiating at all. > At least with the negotiation, the authentication will work if there is any > mutually supported channel binding type. I think that in this case the client should throw an error unconditionally if it wants to use channel A but server supports only B. It is always possible for the client to adjust the channel binding type it wants to use by using the connection parameter scram_channel_binding, or to recompile. If there is incompatibility between channel binding types, it would actually mean that the client and the server are not compiled with the same SSL implementation, would that actually work? Say a server has only tls-unique with gnu's library and the client uses JDBC which only has tls-server-end-point.. So, to keep things simple, it seems to me that we should just make the server send the list, and then check at client-level if the list sent by server includes what the client wants to use, complaining if the option is not present. -- Michael signature.asc Description: PGP signature
Re: How can we submit code patches that implement our (pending) patents?
On Thu, Jul 12, 2018 at 09:33:21AM +0800, Craig Ringer wrote: > On 12 July 2018 at 09:10, Tsunakawa, Takayuki < > tsunakawa.ta...@jp.fujitsu.com> wrote: > > From: Nico Williams [mailto:n...@cryptonector.com] > > > On Thu, Jul 12, 2018 at 12:29:12AM +, Tsunakawa, Takayuki wrote: > > > > How can one make defensive use of his patent if he allows everyone to > > > > use it royalty-free? Can he use his patent for cross-licensing > > > > negotiation if some commercial database vendor accuses his company > > > > that PostgreSQL unintentionally infringes that vendor's patent? > > > > > > https://en.wikipedia.org/wiki/Defensive_termination > > > > Thank you, his looks reasonable to give everyone the grant. Then, I > > wonder if the community can accept patented code if the patent holder > > grants this kind of license. > > Personally, I'd be in favour, but the core team has spoken here. I'm not > privy to the discussion but the outcome seems firm. Has the issue been considered in enough detail? A thorough treatment of the issue would probably mean that PG should have a contributor agreement, or at the very least the license that PG uses should include generous royalty-free patent grants so as to force contributors to make them. Of course, any time you add contributor agreements you add friction to the contribution process, and at least one-time legal costs for the core. Also, what about patents that are placed in the public domain? Surely code implementing those would not be rejected for involving patents... I'm not sure that the widest grant with no-lawsuit defensive clauses should be accepted, but it's not that far from the public domain case. Even here there's legal costs: at least a one-time cost of hiring an IP lawyer to write up patent grant language that the PG community would insist on. It could be well worth it. Nico --
Re: Cannot dump foreign key constraints on partitioned table
On Wed, Jul 11, 2018 at 03:49:59PM +0530, amul sul wrote: > On the master head, getConstraints() function skips FK constraints for > a partitioned table because of tbinfo->hastriggers is false. > > While creating FK constraints on the partitioned table, the FK triggers are > only > created on leaf partitions and skipped for the partitioned tables. Oops. Good catch. > To fix this, either bypass the aforementioned condition of getConstraints() or > set pg_class.relhastriggers to true for the partitioned table which doesn't > seem > to be the right solution, IMO. Thoughts? Changing pg_class.relhastriggers is out of scope because as far as I know partitioned tables have no triggers, so the current value is correct, and that would be a catalog change at this stage which would cause any existing deployments of v11 to complain about the inconsistency. I think that this should be fixed client-side as the attached does. I have just stolen this SQL set from Alvaro to check the consistency of the dumps created: create table prim (a int primary key); create table partfk (a int references prim) partition by range (a); create table partfk1 partition of partfk for values from (0) to (100); create table partfk2 partition of partfk for values from (100) to (200); Thoughts? -- Michael diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c index 463639208d..74a1270169 100644 --- a/src/bin/pg_dump/pg_dump.c +++ b/src/bin/pg_dump/pg_dump.c @@ -7131,7 +7131,12 @@ getConstraints(Archive *fout, TableInfo tblinfo[], int numTables) { TableInfo *tbinfo = [i]; - if (!tbinfo->hastriggers || + /* + * For partitioned tables, foreign keys have no triggers so they + * must be included anyway in case some foreign keys are defined. + */ + if ((!tbinfo->hastriggers && + tbinfo->relkind != RELKIND_PARTITIONED_TABLE) || !(tbinfo->dobj.dump & DUMP_COMPONENT_DEFINITION)) continue; signature.asc Description: PGP signature
Re: Expression errors with "FOR UPDATE" and postgres_fdw with partition wise join enabled.
(2018/07/11 20:02), Ashutosh Bapat wrote: On Wed, Jul 11, 2018 at 1:23 PM, Etsuro Fujita wrote: Actually, even if we could create such an index on the child table and the targetlist had the ConvertRowtypeExpr, the planner would still not be able to use an index-only scan with that index; because check_index_only would not consider that an index-only scan is possible for that index because that index is an expression index and that function currently does not consider that index expressions are able to be returned back in an index-only scan. That behavior of the planner might be improved in future, though. Right and when we do so, not having ConvertRowtypeExpr in the targetlist will be a problem. Yeah, but I don't think that that's unsolvable; because in that case the CRE as an index expression could be converted back to the whole-row Var's rowtype by adding another CRE to the index expression for that conversion, I suspect that that special handling could allow us to support an index-only scan even when having the whole-row Var instead of the CRE in the targetlist. (Having said that, I'm not 100% sure we need to solve that problem when we improve the planner, because there doesn't seem to me to be enough use-case to justify making the code complicated for that.) Anyway, I think that that would be a matter of future versions of PG. At places in planner we match equivalence members to the targetlist entries. This matching will fail unexpectedly when ConvertRowtypeExpr is removed from a child's targetlist. But again I couldn't reproduce a problem when such a mismatch arises. IIUC, I don't think the planner assumes that for an equivalence member there is an matching entry for that member in the targetlist; what I think the planner assumes is: an equivalence member is able to be computed from expressions in the targetlist. This is true. However, So, I think it is safe to have whole-row Vars instead of ConvertRowtypeExprs in the targetlist. when it's looking for an expression, it finds a whole-row expression so it think it needs to add a ConvertRowtypeExpr on that. But when the plan is created, there is ConvertRowtypeExpr already, but there is no way to know that a new ConvertRowtypeExpr is not needed anymore. So, we may have two ConvertRowtypeExprs giving wrong results. Maybe I'm missing something, but I don't think that we need to worry about that, because in the approach I proposed, we only add CREs above whole-row Vars in the targetlists for subplans of an Append/MergeAppend for a partitioned relation at plan creation time. Best regards, Etsuro Fujita
Re: How can we submit code patches that implement our (pending) patents?
On Thu, Jul 12, 2018 at 01:10:33AM +, Tsunakawa, Takayuki wrote: > From: Nico Williams [mailto:n...@cryptonector.com] > > On Thu, Jul 12, 2018 at 12:29:12AM +, Tsunakawa, Takayuki wrote: > > > How can one make defensive use of his patent if he allows everyone to > > > use it royalty-free? Can he use his patent for cross-licensing > > > negotiation if some commercial database vendor accuses his company > > > that PostgreSQL unintentionally infringes that vendor's patent? > > > > https://en.wikipedia.org/wiki/Defensive_termination > > Thank you, his looks reasonable to give everyone the grant. Then, I > wonder if the community can accept patented code if the patent holder > grants this kind of license. I'm not a core member, but personally I'd be inclined to accept a royalty-free, non-exclusive, non-expiring, transferrable, ..., grant where the only condition is to not sue the patent holder. I don't object to patents in general, but I think patent lifetimes are way too long for software (because they are not commensurate with the costs of developing software), and I understand that often one must obtain patents for *defensive* purposes because there are sharks out there. So I'm inclined to accept patents with defensive but otherwise very wide grants. Nico --
Re: TRUNCATE tables referenced by FKs on partitioned tables
On Wed, Jul 11, 2018 at 06:59:16PM -0400, Alvaro Herrera wrote: > Anyway, this patch seems to fix it, and adds what I think is appropriate > test coverage. This looks good to me. I am noticing that the documentation of TRUNCATE does not mention that when running the command on a partitioned table then it automatically gets to the children even if CASCADE is not used and each child partition is not listed. What is the filler column added in truncpart used for? -- Michael signature.asc Description: PGP signature
bgworkers should share more of BackendRun / PostgresMain
Hi folks As I work on background workers almost exclusively at the moment, I keep on running into things that are just not initialized, or not supported, in bgworkers. Many are quite surprising. I think bgworkers should just share more of the main Pg startup path. My latest surprise is that we don't re-initialize the RNG for bgworkers, because it's done in BackendRun before calling PostgresMain, and bgworker code path diverges before we call BackendRun. bgworkers also have to do all their own error handling, duplicating lots of logic in PostgresMain, if they want to have any error recovery except "die and restart". bgworkers don't participate in the ProcSignal mechanism, get query cancels, etc, unless they do extra setup. There are all sorts of other small surprises, like the way we don't set up the dbname in shmem, so logs for bgworkers show it as empty. There's no patch here because I'm not yet sure what the best approach is. Some things seem clear, like moving the RNG init work from BackendRun a bit earlier so all backends hit it. Other parts much less so. Look at the number of calls to InitProcess() from our profusion of custom worker types, user backends, the bgworker infrastructure, etc. I could just copy stuff from BackendRun / PostgresMain to StartBackgroundWorker() but I'm sure that's not the right approach. I think we need to decide what should be common and factor it out a bit. So discussion/ideas appreciated. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: _isnan() on Windows
On Wed, Jul 11, 2018 at 09:13:40AM -0400, Alvaro Herrera wrote: > I just pushed it before seeing your message. Fine as well, thanks for picking this up. The buildfarm shows no failures about this patch. -- Michael signature.asc Description: PGP signature
Re: Failure assertion in GROUPS mode of window function in current HEAD
On Wed, Jul 11, 2018 at 4:21 AM, Tom Lane wrote: > Masahiko Sawada writes: >> BTW, I found an another but related issue. We can got an assertion >> failure also by the following query. > >> =# select sum(c) over (partition by c order by c groups between 1 >> preceding and current row) from test; > > Oh, good point, that's certainly legal per spec, so we'd need to do > something about it. > > The proximate cause of the problem, I think, is that the planner throws > away the "order by c" as being redundant because it already sorted by c > for the partitioning requirement. So we end up with ordNumCols = 0 > even though the query had ORDER BY. Yeah, I think so too. > > This breaks not only GROUPS cases, but also RANGE OFFSET cases, because > the executor expects to have an ordering column. I thought for a bit > about fixing that by forcing the planner to emit the ordering column for > RANGE OFFSET even if it's redundant. But it's hard to make the existing > logic in get_column_info_for_window do that --- it can't tell which > partitioning column the ordering column was redundant with, and even if it > could, that column might've been of a different data type. So eventually > I just threw away that redundant-key-elimination logic altogether. > I believe that we never thought it was really useful as an optimization, > but way back when window functions were put in, we didn't have (or at > least didn't think about) a way to identify the partitioning/ordering > columns without reference to the input pathkeys. > Agreed. > With this patch, WindowAggPath.winpathkeys is not used for anything > anymore. I'm inclined to get rid of it, though I didn't do so here. > (If we keep it, we at least need to adjust the comment in relation.h > that claims createplan.c needs it.) +1 for removing. > > The other issue here is that nodeWindowAgg's setup logic is not quite > consistent with update_frameheadpos and update_frametailpos about when > to create tuplestore read pointers and slots. We might've prevented > all the inconsistent cases with the parser and planner fixes, but it > seems best to make them really consistent anyway, so I changed that. > > Draft patch attached. > Thank you for committing! I think we also can update the doc about that GROUPS offset mode requires ORDER BY clause. Thoughts? Attached patch updates it. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center update_window_syntax_doc.patch Description: Binary data
Re: [HACKERS] Replication status in logical replication
On Thu, Jul 12, 2018 at 10:22 AM, Michael Paquier wrote: > On Tue, Jul 10, 2018 at 10:14:35AM +0900, Michael Paquier wrote: >> Thanks. If there are no objections, then I will try to wrap this stuff >> on Thursday my time. > > And done down to 9.4. Thank you! Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
pread() and pwrite()
Hello hackers, A couple of years ago, Oskari Saarenmaa proposed a patch[1] to adopt $SUBJECT. Last year, Tobias Oberstein argued again that we should do that[2]. At the end of that thread there was a +1 from multiple committers in support of getting it done for PostgreSQL 12. Since Oskari hasn't returned, I decided to dust off his patch and start a new thread. Here is a rebase and an initial review. I plan to address the problems myself, unless Oskari wants to do that in which case I'll happily review and hopefully eventually commit it. It's possible I introduced new bugs while rebasing since basically everything moved around, but it passes check-world here with and without HAVE_PREAD and HAVE_PWRITE defined. Let me summarise what's going on in the patch: 1. FileRead() and FileWrite() are replaced with FileReadAt() and FileWriteAt(), taking a new offset argument. Now we can do one syscall instead of two for random reads and writes. 2. fd.c no longer tracks seek position, so we lose a bunch of cruft from the vfd machinery. FileSeek() now only supports SEEK_SET and SEEK_END. This is taking the position that we no longer want to be able to do file IO with implicit positioning at all. I think that seems reasonable: it's nice to drop all that position tracking and caching code, and the seek-based approach would be incompatible with any multithreading plans. I'm not even sure I'd bother adding "At" to the function names. If there are any extensions that want the old interface they will fail to compile either way. Note that the BufFile interface remains the same, so hopefully that covers many use cases. I guess the only remaining reason to use FileSeek() is to get the file size? So I wonder why SEEK_SET remains valid in the patch... if my suspicion is correct that only SEEK_END still has a reason to exist, perhaps we should just kill FileSeek() and add FileSize() or something instead? pgstat_report_wait_start(wait_event_info); +#ifdef HAVE_PREAD + returnCode = pread(vfdP->fd, buffer, amount, offset); +#else + lseek(VfdCache[file].fd, offset, SEEK_SET); returnCode = read(vfdP->fd, buffer, amount); +#endif pgstat_report_wait_end(); This obviously lacks error handling for lseek(). I wonder if anyone would want separate wait_event tracking for the lseek() (though this codepath would be used by almost nobody, especially if someone adds Windows support, so it's probably not worth bothering with). I suppose we could assume that if you have pread() you must also have pwrite() and save on ./configure cycles. I will add this to the next Festival of Commits, though clearly it needs a bit more work before the festivities begin. [1] https://www.postgresql.org/message-id/flat/a86bd200-ebbe-d829-e3ca-0c4474b2fcb7%40ohmu.fi [2] https://www.postgresql.org/message-id/flat/b8748d39-0b19-0514-a1b9-4e5a28e6a208%40gmail.com -- Thomas Munro http://www.enterprisedb.com 0001-Use-pread-pwrite-instead-of-lseek-read-write-v1.patch Description: Binary data
Re: Preferring index-only-scan when the cost is equal
On Wed, 11 Jul 2018 14:37:46 +0200 Tomas Vondra wrote: > > On 07/11/2018 01:28 PM, Ashutosh Bapat wrote: > > I don't think we should change add_path() for this. We will > > unnecessarily check that condition even for the cases where we do not > > create index paths. I think we should fix the caller of add_path() > > instead to add index only path before any index paths. For that the > > index list needs to be sorted by the possibility of using index only > > scan. > > > > But I think in your case, it might be better to first check whether > > there is any costing error because of which index only scan's path has > > the same cost as index scan path. Also I don't see any testcase which > > will show why index only scan would be more efficient in your case. > > May be provide output of EXPLAIN ANALYZE. > > > > I suspect this only happens due to testing on empty tables. Not only is > testing of indexes on small tables rather pointless in general, but more > importantly there will be no statistics. So we fall back to some default > estimates, but we also don't have relallvisible etc which is crucial for > estimating index-only scans. I'd bet that's why the cost estimates for > index scans and index-only scans are the same here. You are right. When the table have rows and this is vacuumed, index only scan's cost is cheaper and chosen properly. Sorry, I have jumped to the conclusion before confirming this. Thanks, -- Yugo Nagata
Re: How can we submit code patches that implement our (pending) patents?
On 12 July 2018 at 09:10, Tsunakawa, Takayuki < tsunakawa.ta...@jp.fujitsu.com> wrote: > From: Nico Williams [mailto:n...@cryptonector.com] > > On Thu, Jul 12, 2018 at 12:29:12AM +, Tsunakawa, Takayuki wrote: > > > How can one make defensive use of his patent if he allows everyone to > > > use it royalty-free? Can he use his patent for cross-licensing > > > negotiation if some commercial database vendor accuses his company > > > that PostgreSQL unintentionally infringes that vendor's patent? > > > > https://en.wikipedia.org/wiki/Defensive_termination > > Thank you, his looks reasonable to give everyone the grant. Then, I > wonder if the community can accept patented code if the patent holder > grants this kind of license. Personally, I'd be in favour, but the core team has spoken here. I'm not privy to the discussion but the outcome seems firm. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Accounting of zero-filled buffers in EXPLAIN (BUFFERS)
On 12 July 2018 at 12:19, Haribabu Kommi wrote: > Yes, I agree that we may need a new counter that counts the buffers that > are just allocated (no read or no write). But currently, may be the counter > value is very less, so people are not interested. The existing counters don't show up in EXPLAIN (ANALYZE, BUFFERS) when they're zero. The new counter would surely work the same way, so the users wouldn't be burdened by additional explain output it when they're not affected by it. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Replication status in logical replication
On Tue, Jul 10, 2018 at 10:14:35AM +0900, Michael Paquier wrote: > Thanks. If there are no objections, then I will try to wrap this stuff > on Thursday my time. And done down to 9.4. -- Michael signature.asc Description: PGP signature
Re: Allow to specify a index name as ANALYZE parameter
On Wed, 11 Jul 2018 14:26:03 +0300 Alexander Korotkov wrote: > On Wed, Jul 11, 2018 at 12:04 PM Yugo Nagata wrote: > > When we specify column names for ANALYZE, only the statistics for those > > columns > > are collected. Similarly, is it useful if we have a option to specify an > > index > > for ANALYZE to collect only the statistics for expression in the specified > > index? > > > > A usecase I suppose is that when a new expression index is created and that > > we need only the statistics for the new index. Although ANALYZE of all the > > indexes > > is usually fast because ANALYZE uses a random sampling of the table rows, > > ANALYZE > > on the specific index may be still useful if there are other index whose > > "statistics > > target" is large and/or whose expression takes time to compute, for example. > > > > Attached is the WIP patch to allow to specify a index name as ANALYZE > > parameter. > > Any documatation is not yet included. I would appreciate any feedback! > > I think this makes sense. Once we can collect statistics individually > for regular columns, we should be able to do the same for indexed > expression. Please, register this patch on the next commitfest. Thank you for your comment! I registered this to CF 2018-09. > Regarding current implementation I found message "ANALYZE option must > be specified when a column list is provided" to be confusing. > Perhaps, you've missed some negation in this message, since in fact > you disallow analyze with column list. > > However, since multicolumn index may contain multiple expression, I > think we should support specifying columns for ANALYZE index clause. > We could support specifying columns by their numbers in the same way > we did for "ALTER INDEX index ALTER COLUMN colnum SET STATISTICS > number". Make sense. I'll fix the patch to support specifying columns of index. Thanks, -- Yugo Nagata
RE: How can we submit code patches that implement our (pending) patents?
From: Nico Williams [mailto:n...@cryptonector.com] > On Thu, Jul 12, 2018 at 12:29:12AM +, Tsunakawa, Takayuki wrote: > > How can one make defensive use of his patent if he allows everyone to > > use it royalty-free? Can he use his patent for cross-licensing > > negotiation if some commercial database vendor accuses his company > > that PostgreSQL unintentionally infringes that vendor's patent? > > https://en.wikipedia.org/wiki/Defensive_termination Thank you, his looks reasonable to give everyone the grant. Then, I wonder if the community can accept patented code if the patent holder grants this kind of license. Regards Takayuki Tsunakawa
Upper limit arguments of pg_logical_slot_xxx_changes functions accept invalid values
Hi, While reading the replication slot codes, I found a wrong assignment in pg_logical_slot_get_changes_guts() function as follows. if (PG_ARGISNULL(2)) upto_nchanges = InvalidXLogRecPtr; else upto_nchanges = PG_GETARG_INT32(2); Since the upto_nchanges is an integer value we should set 0 meaning unlimited instead of InvalidXLogRecPtr. Since InvalidXLogRecPtr is actually 0 this function works fine so far. Also I surprised that these functions accept to set negative values to upto_nchanges. The upto_lsn argument is also the same; it also accepts an invalid lsn. For safety maybe it's better to reject non-NULL invalid values.That way, the behavior of these functions are consistent with what the documentation says; If upto_lsn is non-NULL, decoding will include only those transactions which commit prior to the specified LSN. If upto_nchanges is non-NULL, decoding will stop when the number of rows produced by decoding exceeds the specified value. Attached patch fixes them. Feedback is very welcome. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center fix_logical_slot_changes_funcs.patch Description: Binary data
Re: [HACKERS] [PATCH] WIP Add ALWAYS DEFERRED option for constraints
Attached is an additional patch, as well as a new, rebased patch. This includes changes responsive to Álvaro Herrera's commentary about the SET CONSTRAINTS manual page. Nico -- >From e7838b60dbf0a8cd7f35591db2f9aab78d8903cb Mon Sep 17 00:00:00 2001 From: Nicolas Williams Date: Wed, 11 Jul 2018 19:53:01 -0500 Subject: [PATCH] Add ALWAYS DEFERRED option for CONSTRAINTs (CR) --- doc/src/sgml/catalogs.sgml | 2 +- doc/src/sgml/ref/set_constraints.sgml | 10 ++ src/backend/catalog/index.c| 1 - src/backend/catalog/information_schema.sql | 8 +++- src/backend/commands/trigger.c | 1 - src/backend/parser/gram.y | 18 ++ 6 files changed, 20 insertions(+), 20 deletions(-) diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index 291e6a9..4d42594 100644 --- a/doc/src/sgml/catalogs.sgml +++ b/doc/src/sgml/catalogs.sgml @@ -2245,7 +2245,7 @@ SCRAM-SHA-256$iteration count: Constraint deferral option: a = always deferred, d = deferrable, -d = deferrable initially deferred, +i = deferrable initially deferred, n = not deferrable diff --git a/doc/src/sgml/ref/set_constraints.sgml b/doc/src/sgml/ref/set_constraints.sgml index 671332a..390015e 100644 --- a/doc/src/sgml/ref/set_constraints.sgml +++ b/doc/src/sgml/ref/set_constraints.sgml @@ -34,11 +34,13 @@ SET CONSTRAINTS { ALL | name [, ... - Upon creation, a constraint is given one of three + Upon creation, a constraint is given one of four characteristics: DEFERRABLE INITIALLY DEFERRED, - DEFERRABLE INITIALLY IMMEDIATE, or - NOT DEFERRABLE. The third - class is always IMMEDIATE and is not affected by the + DEFERRABLE INITIALLY IMMEDIATE, + NOT DEFERRABLE, or ALWAYS DEFERRED. + The third + class is always IMMEDIATE, while the fourth class is + always DEFERRED, and neither affected by the SET CONSTRAINTS command. The first two classes start every transaction in the indicated mode, but their behavior can be changed within a transaction by SET CONSTRAINTS. diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c index 795a7a9..45b52b4 100644 --- a/src/backend/catalog/index.c +++ b/src/backend/catalog/index.c @@ -1070,7 +1070,6 @@ index_create(Relation heapRelation, recordDependencyOn(, , deptype); } - Assert((flags & INDEX_CREATE_ADD_CONSTRAINT) == 0); } /* Store dependency on parent index, if any */ diff --git a/src/backend/catalog/information_schema.sql b/src/backend/catalog/information_schema.sql index bde6199..dd4792a 100644 --- a/src/backend/catalog/information_schema.sql +++ b/src/backend/catalog/information_schema.sql @@ -894,11 +894,9 @@ CREATE VIEW domain_constraints AS CAST(CASE WHEN condeferral = 'n' THEN 'NO' ELSE 'YES' END AS yes_or_no) AS is_deferrable, CAST(CASE WHEN condeferral = 'i' OR condeferral = 'a' THEN 'YES' ELSE 'NO' END - AS yes_or_no) AS initially_deferred - /* - * XXX Can we add is_always_deferred here? Are there - * standards considerations? - */ + AS yes_or_no) AS initially_deferred, + CAST(CASE WHEN condeferral = 'a' THEN 'YES' ELSE 'NO' END + AS yes_or_no) AS always_deferred FROM pg_namespace rs, pg_namespace n, pg_constraint con, pg_type t WHERE rs.oid = con.connamespace AND n.oid = t.typnamespace diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c index 41dc6a4..33b1095 100644 --- a/src/backend/commands/trigger.c +++ b/src/backend/commands/trigger.c @@ -3627,7 +3627,6 @@ typedef struct AfterTriggerSharedData TriggerEvent ats_event; /* event type indicator, see trigger.h */ Oid ats_tgoid; /* the trigger's ID */ Oid ats_relid; /* the relation it's on */ - bool ats_alwaysdeferred; /* whether this can be deferred */ CommandId ats_firing_id; /* ID for firing cycle */ struct AfterTriggersTableData *ats_table; /* transition table access */ } AfterTriggerSharedData; diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y index dab721a..9aaa2af 100644 --- a/src/backend/parser/gram.y +++ b/src/backend/parser/gram.y @@ -185,8 +185,8 @@ static void SplitColQualList(List *qualList, List **constraintList, CollateClause **collClause, core_yyscan_t yyscanner); static void processCASbits(int cas_bits, int location, const char *constrType, - char *deferral, - bool *not_valid, bool *no_inherit, core_yyscan_t yyscanner); + char *deferral, bool *not_valid, bool *no_inherit, + core_yyscan_t yyscanner); static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); %} @@ -5579,13 +5579,13 @@ ConstraintAttributeSpec: (newspec & (CAS_INITIALLY_IMMEDIATE))) ereport(ERROR, (errcode(ERRCODE_SYNTAX_ERROR), -
Re: Costing bug in hash join logic for semi joins
On 10 July 2018 at 22:21, David Rowley wrote: > I've done that in the attached. Also on reading the comment above, it > looks slightly incorrect. To me, it looks like it's applying a > twentieth of the cost and not a tenth as the comment claims. I > couldn't resist updating that too. I've added this patch to the September commit fest: https://commitfest.postgresql.org/19/1720/ -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: patch to allow disable of WAL recycling
On Tue, Jul 10, 2018 at 10:32 PM, Thomas Munro < thomas.mu...@enterprisedb.com> wrote: > On Wed, Jul 11, 2018 at 8:25 AM, Joshua D. Drake > wrote: > > On 07/10/2018 01:15 PM, Jerry Jelinek wrote: > >> > >> Thanks to everyone who took the time to look at the patch and send me > >> feedback. I'm happy to work on improving the documentation of this new > >> tunable to clarify when it should be used and the implications. I'm > trying > >> to understand more specifically what else needs to be done next. To > >> summarize, I think the following general concerns were brought up. > >> > >> For #6, there is no feasible way for us to recreate our workload on > other > >> operating systems or filesystems. Can anyone expand on what performance > data > >> is needed? > > > > I think a simple way to prove this would be to run BenchmarkSQL against > > PostgreSQL in a default configuration with pg_xlog/pg_wal on a filesystem > > that is COW (zfs) and then run another test where pg_xlog/pg_wal is > patched > > with your patch and new behavior and then run the test again. > BenchmarkSQL > > is a more thorough benchmarking tool that something like pg_bench and is > > very easy to setup. > > I have a lowly but trusty HP Microserver running FreeBSD 11.2 with ZFS > on spinning rust. It occurred to me that such an anaemic machine > might show this effect easily because its cold reads are as slow as a > Lada full of elephants going uphill. Let's see... > > # os setup > sysctl vfs.zfs.arc_min=134217728 > sysctl vfs.zfs.arc_max=134217728 > zfs create zoot/data/test > zfs set mountpoint=/data/test zroot/data/test > zfs set compression=off zroot/data/test > zfs set recordsize=8192 zroot/data/test > > # initdb into /data/test/pgdata, then set postgresql.conf up like this: > fsync=off > max_wal_size = 600MB > min_wal_size = 600MB > > # small scale test, we're only interested in producing WAL, not db size > pgbench -i -s 100 postgres > > # do this a few times first, to make sure we have lots of WAL segments > pgbench -M prepared -c 4 -j 4 -T 60 postgres > > # now test... > > With wal_recycle=on I reliably get around 1100TPS and vmstat -w 10 > shows numbers like this: > > procs memory pagedisks faults cpu > r b w avm fre flt re pi pofr sr ad0 ad1 insycs us > sy id > 3 0 3 1.2G 3.1G 4496 0 0 052 76 144 138 607 84107 29713 55 > 17 28 > 4 0 3 1.2G 3.1G 2955 0 0 084 77 134 130 609 82942 34324 61 > 17 22 > 4 0 3 1.2G 3.1G 2327 0 0 0 0 77 114 125 454 83157 29638 68 > 15 18 > 5 0 3 1.2G 3.1G 1966 0 0 082 77 86 81 335 84480 25077 74 > 13 12 > 3 0 3 1.2G 3.1G 1793 0 0 0 533 74 72 68 310 127890 31370 77 > 16 7 > 4 0 3 1.2G 3.1G 1113 0 0 0 151 73 95 94 363 128302 29827 74 > 18 8 > > With wal_recycle=off I reliably get around 1600TPS and vmstat -w 10 > shows numbers like this: > > procs memory pagedisks faults cpu > r b w avm fre flt re pi pofr sr ad0 ad1 insycs us > sy id > 0 0 3 1.2G 3.1G 148 0 0 0 402 71 38 38 153 16668 5656 10 > 3 87 > 5 0 3 1.2G 3.1G 4527 0 0 050 73 28 27 123 123986 23373 68 > 15 17 > 5 0 3 1.2G 3.1G 3036 0 0 0 151 73 47 49 181 148014 29412 83 > 16 0 > 4 0 3 1.2G 3.1G 2063 0 0 0 233 73 56 54 200 143018 28699 81 > 17 2 > 4 0 3 1.2G 3.1G 1202 0 0 095 73 48 49 189 147276 29196 81 > 18 1 > 4 0 3 1.2G 3.1G 732 0 0 0 0 73 56 55 207 146805 29265 82 > 17 1 > > I don't have time to investigate further for now and my knowledge of > ZFS is superficial, but the patch seems to have a clear beneficial > effect, reducing disk IOs and page faults on my little storage box. > Obviously this isn't representative of a proper server environment, or > some other OS, but it's a clue. That surprised me... I was quietly > hoping it was hoping it was going to be 'oh, if you turn off > compression and use 8kb it doesn't happen because the pages line up'. > But nope. > > -- > Thomas Munro > http://www.enterprisedb.com > > Hi Thomas, Thanks for testing! It's validating that you saw the same results. -- Dave
Re: How can we submit code patches that implement our (pending) patents?
On Thu, Jul 12, 2018 at 12:29:12AM +, Tsunakawa, Takayuki wrote: > From: Nico Williams [mailto:n...@cryptonector.com] > > My advice is to write up a patent grant that allows all to use the > > relevant patents royalty-free with a no-lawsuit covenant. I.e., make > > only defensive use of your patents. > > How can one make defensive use of his patent if he allows everyone to > use it royalty-free? Can he use his patent for cross-licensing > negotiation if some commercial database vendor accuses his company > that PostgreSQL unintentionally infringes that vendor's patent? https://en.wikipedia.org/wiki/Defensive_termination
RE: How can we submit code patches that implement our (pending) patents?
From: Nico Williams [mailto:n...@cryptonector.com] > You're proposing to include code that implements patented ideas with a > suitable patent grant. I would be free to not read the patent, but what > if the code or documents mention the relevant patented algorithms? > > If I come across something like this in the PG source code: > > /* The following is covered by patents US#... and so on */ I got it. Your concern makes sense. > My advice is to write up a patent grant that allows all to use the > relevant patents royalty-free with a no-lawsuit covenant. I.e., make > only defensive use of your patents. How can one make defensive use of his patent if he allows everyone to use it royalty-free? Can he use his patent for cross-licensing negotiation if some commercial database vendor accuses his company that PostgreSQL unintentionally infringes that vendor's patent? Regards Takayuki Tsunakawa
Re: patch to allow disable of WAL recycling
Hi, On 2018-07-10 14:15:30 -0600, Jerry Jelinek wrote: > Thanks to everyone who took the time to look at the patch and send me > feedback. I'm happy to work on improving the documentation of this new > tunable to clarify when it should be used and the implications. I'm trying > to understand more specifically what else needs to be done next. To > summarize, I think the following general concerns were brought up. > > 1) Disabling WAL recycling could have a negative performance impact on a > COW filesystem if all WAL files could be kept in the filesystem cache. > For #1, #2 and #3, I don't understand these concerns. It would be helpful > if these could be more specific We perform more writes (new files are zeroed, which needs to be fsynced), and increase metadata traffic (creation of files), when not recycling. Regards, Andres
Re: patch to allow disable of WAL recycling
On Tue, Jul 10, 2018 at 1:34 PM, Alvaro Herrera wrote: > On 2018-Jul-10, Jerry Jelinek wrote: > > > 2) Disabling WAL recycling reduces reliability, even on COW filesystems. > > I think the problem here is that WAL recycling in normal filesystems > helps protect the case where filesystem gets full. If you remove it, > that protection goes out the window. You can claim that people needs to > make sure to have available disk space, but this does become a problem > in practice. I think the thing to do is verify what happens with > recycling off when the disk gets full; is it possible to recover > afterwards? Is there any corrupt data? What happens if the disk gets > full just as the new WAL file is being created -- is there a Postgres > PANIC or something? As I understand, with recycling on it is easy (?) > to recover, there is no PANIC crash, and no data corruption results. > If the result of hitting ENOSPC when creating or writing to a WAL file was that the database could become corrupted, then wouldn't that risk already be present (a) on any system, for the whole period from database init until the maximum number of WAL files was created, and (b) all the time on any copy-on-write filesystem? Thanks, Dave
RE: How can we submit code patches that implement our (pending) patents?
From: Nico Williams [mailto:n...@cryptonector.com] > On Wed, Jul 11, 2018 at 01:03:44AM +, Tsunakawa, Takayuki wrote: > > As a practical matter, when and where are you planning to post the > > project policy? How would you check and prevent patented code? > > PG may need a contributor agreement. All I can find on that is: > > https://www.postgresql.org/message-id/flat/54337476.3040605%402ndquadr > ant.com#9b1968ddc0fadfe225001adc1a821925 Yes, I thought we should probably need a CLA, but I was hesitant to mention it... This is somehow off-topic, but a member in our IP department informed me that TPL only states the disclaimer for the University of California, not for PGDG or all copyright holders, which is unlike Apache License. Is this intended? https://www.postgresql.org/about/licence/ -- IN NO EVENT SHALL THE UNIVERSITY OF CALIFORNIA BE LIABLE TO ANY PARTY FOR DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS DOCUMENTATION, EVEN IF THE UNIVERSITY OF CALIFORNIA HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. THE UNIVERSITY OF CALIFORNIA SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE SOFTWARE PROVIDED HEREUNDER IS ON AN "AS IS" BASIS, AND THE UNIVERSITY OF CALIFORNIA HAS NO OBLIGATIONS TO PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS. -- Regards Takayuki Tsunakawa
Re: Accounting of zero-filled buffers in EXPLAIN (BUFFERS)
On Thu, Jul 12, 2018 at 8:32 AM Thomas Munro wrote: > On Thu, Jul 12, 2018 at 12:46 AM, Haribabu Kommi > wrote: > >> > On 2018-04-30 14:59:31 +1200, Thomas Munro wrote: > >> >> In EXPLAIN (BUFFERS), there are two kinds of cache misses that show > up > >> >> as "reads" when in fact they are not reads at all: > >> >> > >> >> 1. Relation extension, which in fact writes a zero-filled block. > >> >> 2. The RBM_ZERO_* modes, which provoke neither read nor write. > > > > I checked the patch and I agree with the change 1). And regarding change > 2) > > whether it is zeroing the contents of the page or not, it does read? > > because if it exists in the buffer pool, we are counting them as hits > > irrespective > > of the mode? Am I missing something? > > Further down in the function you can see that there is no read() > system call for the RBM_ZERO_* modes: > > if (mode == RBM_ZERO_AND_LOCK || mode == > RBM_ZERO_AND_CLEANUP_LOCK) > MemSet((char *) bufBlock, 0, BLCKSZ); > else > { > ... > smgrread(smgr, forkNum, blockNum, (char *) > bufBlock); > ... > } > Thanks for the details. I got your point. But we need to include RBM_ZERO_ON_ERROR case read operations, excluding others are fine. > I suppose someone might argue that even when it's not a hit and it's > not a read, we might still want to count this buffer interaction in > some other way. Perhaps there should be a separate counter? It may > technically be a kind of cache miss, but it's nowhere near as > expensive as a synchronous system call like read() so I didn't propose > that. > Yes, I agree that we may need a new counter that counts the buffers that are just allocated (no read or no write). But currently, may be the counter value is very less, so people are not interested. > Some more on my motivation: In our zheap prototype, when the system > is working well and we have enough space, we constantly allocate > zeroed buffer pages at the insert point (= head) of an undo log and > drop pages at the discard point (= tail) in the background; > effectively a few pages just go round and round via the freelist and > no read() or write() syscalls happen. That's something I'm very happy > about and it's one of our claimed advantages over the traditional heap > (which tends to read and dirty more pages), but EXPLAIN (BUFFERS) > hides this virtuous behaviour when comparing with the traditional > heap: it falsely and slanderously reports that zheap is reading undo > pages when it is not. Of course I don't intent to litigate zheap > design in this thread, I just I figured that since this accounting is > wrong on principle and affects current PostgreSQL too (at least in > theory) I would propose this little patch independently. It's subtle > enough that I wouldn't bother to back-patch it though. > OK. May be it is better to implement the buffer allocate counter along with zheap to provide better buffer results? Regards, Haribabu Kommi Fujitsu Australia
Re: Tips on committing
On Tue, Jul 10, 2018 at 9:14 PM, Noah Misch wrote: > My rule has been to add to my private checklist anytime I mail or push a patch > containing a readily-checkable mistake. I go through the checklist before > mailing or pushing any patch. It has things in common with your list, plus > these: > > * Validate err*() calls against > https://www.postgresql.org/docs/devel/static/error-style-guide.html > * Validate *printf calls for trailing newlines > > * Spellcheck the patch > > * Verify long lines are not better broken > git diff origin/master | grep -E '^(\+|diff)' | sed 's/^+//' | expand -t4 | > awk "length > 78 || /^diff/" > > * Run pgindent on changed files; keep the changes I made necessary > > * Update version numbers, if needed > CATALOG_VERSION_NO, PG_CONTROL_VERSION, XLOG_PAGE_MAGIC, PGSTAT_FILE_FORMAT_ID I'm going to use some of these, myself. >> I'll try to do that, but I'd still recommend personalizing it. A lot >> of the stuff in there is specific to my own workflow and tool >> preferences, and my own personal working style. > > I agree we won't all want the exact same checklist. Still, it wouldn't hurt > to have a wiki page of checklist entry ideas from which folks cherry-pick the > entries they like. You've convinced me that we should definitely have such a list. I've put it on my TODO list. -- Peter Geoghegan
Re: [HACKERS] autovacuum can't keep up, bloat just continues to rise
On Mon, Jul 24, 2017 at 11:12 AM, Peter Geoghegan wrote: > You can get about a 1/3 loss of space by inserting randomly, rather > than inserting in sorted order, which is what REINDEX will more or > less do for you. That's because random workloads almost entirely get > 50:50 page splits, whereas sorted input will always split the > rightmost page, and so will always get 90:10 splits. The space in the > random case isn't exactly wasted; it's there for the taking, for key > values that happen to fit on the page. I looked into this again. I decided to see for myself what was up. Actually, I used oltpbench [1], since that was the easiest TPC-C benchmark available. The order_line primary key may have been designed to be as unfriendly as possible to implementations that don't have good suffix truncation. This is also true of several of other TPC-C indexes. I initialized a scale 50 TPC-C database: pg@bat:~/code/oltpbench$ ./oltpbenchmark -b tpcc -c config/my_postgres_tpcc_config.xml --create=true --load=true 15:26:57,214 (DBWorkload.java:259) INFO - == Benchmark: TPCC {com.oltpbenchmark.benchmarks.tpcc.TPCCBenchmark} Configuration: config/sample_tpcc_config.xml Type: POSTGRES Driver:org.postgresql.Driver URL: jdbc:postgresql:tpcc Isolation: TRANSACTION_SERIALIZABLE Scale Factor: 50.0 Note that there is no garbage for VACUUM to clean up just yet. There hasn't been any deletes or updates yet. And yet, order_line_pkey is bloated. It's 783 MB, but if I run REINDEX it shrinks right down to 451 MB. This is on the master branch -- not with one of my patches. You can also arrange to make the index much smaller if you force the insertions to occur in a totally random order, rather than the order that the benchmark actually inserts them. The problem for us is that tuples are inserted in clustered/monotonically increasing order for the last 3 columns, while the first column (ol_w_id) is a low cardinality column whose values appear in random order, more or less. This isn't actually unrealistic - it makes sense that associated records would be inserted as per-warehouse groups like this, while the order of the groups remains unpredictable. I am working on adding suffix truncation at the moment. Right now, my patch doesn't help, because it isn't sophisticated enough about the choice of split point (we need to care about suffix truncation, and not just balancing space among the two halves of a split). It should try to find a split point that maximizes the chances of a new pivot tuple not having any low cardinality final column (ol_number). We should be able to mostly not have any of the last column in tuples that make it into the branch/internal nodes: pg@tpcc[1452]=# select count(*) from order_line ; count 15,001,784 (1 row) pg@tpcc[1452]=# select count(distinct(ol_w_id, ol_d_id, ol_o_id)) from order_line ; count ─── 1,500,000 (1 row) As I said, the final ol_number column makes the keyspace unbalanced by unnecessarily appearing in the internal/branch nodes: pg@tpcc[1452]=# select count(distinct(ol_number)) from order_line ; count ─── 15 (1 row) Since there can only be 15 distinct values for any given (ol_w_id, ol_d_id, ol_o_id, *), and since over 100 index tuples will fit on each leaf page, we just have to pick a split point that's between (x, y, z, 15) and (x, y, z + 1, 1). This makes it legal to truncate away the ol_number column, which allows us to balance the use of free space for items that are inserted after the split, by later transactions. [1] https://github.com/oltpbenchmark/oltpbench -- Peter Geoghegan
Re: TRUNCATE tables referenced by FKs on partitioned tables
On 2018-Jul-11, Michael Paquier wrote: > > alvherre=# truncate table prim, partfk; > > ERROR: cannot truncate a table referenced in a foreign key constraint > > DETALLE: Table "partfk" references "prim". > > SUGERENCIA: Truncate table "partfk" at the same time, or use TRUNCATE ... > > CASCADE. > > Your first and second queries are the same :) Yeah, C failure :-( Anyway, this patch seems to fix it, and adds what I think is appropriate test coverage. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services >From 07500a1d7d7e6e2a79514672a7b6441781baa1da Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Wed, 11 Jul 2018 18:54:00 -0400 Subject: [PATCH] fix truncate --- src/backend/catalog/heap.c | 7 +++- src/backend/commands/tablecmds.c | 2 +- src/test/regress/expected/truncate.out | 75 ++ src/test/regress/sql/truncate.sql | 47 + 4 files changed, 128 insertions(+), 3 deletions(-) diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c index d223ba8537..4cfc0c8911 100644 --- a/src/backend/catalog/heap.c +++ b/src/backend/catalog/heap.c @@ -3181,13 +3181,16 @@ heap_truncate_check_FKs(List *relations, bool tempTables) * Build a list of OIDs of the interesting relations. * * If a relation has no triggers, then it can neither have FKs nor be -* referenced by a FK from another table, so we can ignore it. +* referenced by a FK from another table, so we can ignore it. For +* partitioned tables, FKs have no triggers, so we must include them +* anyway. */ foreach(cell, relations) { Relationrel = lfirst(cell); - if (rel->rd_rel->relhastriggers) + if (rel->rd_rel->relhastriggers || + rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) oids = lappend_oid(oids, RelationGetRelid(rel)); } diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 7c0cf0d7ee..22e81e712d 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -1421,7 +1421,7 @@ ExecuteTruncateGuts(List *explicit_rels, List *relids, List *relids_logged, Oid*logrelids; /* -* Open, exclusive-lock, and check all the explicitly-specified relations +* Check the explicitly-specified relations. * * In CASCADE mode, suck in all referencing relations as well. This * requires multiple iterations to find indirectly-dependent relations. At diff --git a/src/test/regress/expected/truncate.out b/src/test/regress/expected/truncate.out index 735d0e862d..0c2ba173db 100644 --- a/src/test/regress/expected/truncate.out +++ b/src/test/regress/expected/truncate.out @@ -464,3 +464,78 @@ ERROR: cannot truncate only a partitioned table HINT: Do not specify the ONLY keyword, or use TRUNCATE ONLY on the partitions directly. TRUNCATE truncparted; DROP TABLE truncparted; +-- foreign key on partitioned table: partition key is referencing column. +-- Make sure truncate did execute on all tables +CREATE FUNCTION tp_ins_data() RETURNS void LANGUAGE plpgsql AS $$ + BEGIN + INSERT INTO truncprim VALUES (1), (100), (150); + INSERT INTO truncpart VALUES (1, 1), (100, 100), (150, 150); + END +$$; +CREATE FUNCTION tp_chk_data(OUT pktb regclass, OUT pkval int, OUT fktb regclass, OUT fkval int) + RETURNS SETOF record LANGUAGE plpgsql AS $$ + BEGIN +RETURN QUERY SELECT + pk.tableoid::regclass, pk.a, fk.tableoid::regclass, fk.a +FROM truncprim pk FULL JOIN truncpart fk USING (a) +ORDER BY 2, 4; + END +$$; +CREATE TABLE truncprim (a int PRIMARY KEY); +CREATE TABLE truncpart (a int REFERENCES truncprim, b int, filler text) + PARTITION BY RANGE (a); +CREATE TABLE truncpart_1 PARTITION OF truncpart FOR VALUES FROM (0) TO (100); +CREATE TABLE truncpart_2 PARTITION OF truncpart FOR VALUES FROM (100) TO (200) + PARTITION BY RANGE (a); +CREATE TABLE truncpart_2_1 PARTITION OF truncpart_2 FOR VALUES FROM (100) TO (150); +CREATE TABLE truncpart_2_d PARTITION OF truncpart_2 DEFAULT; +TRUNCATE TABLE truncprim; -- should fail +ERROR: cannot truncate a table referenced in a foreign key constraint +DETAIL: Table "truncpart" references "truncprim". +HINT: Truncate table "truncpart" at the same time, or use TRUNCATE ... CASCADE. +select tp_ins_data(); + tp_ins_data +- + +(1 row) + +-- should truncate everything +TRUNCATE TABLE truncprim, truncpart; +select * from tp_chk_data(); + pktb | pkval | fktb | fkval +--+---+--+--- +(0 rows) + +select tp_ins_data(); + tp_ins_data +- + +(1 row) + +-- should truncate everything +SET client_min_messages TO WARNING;-- suppress cascading notices +TRUNCATE TABLE truncprim
Segfault logical replication PG 10.4
We discovered our pg_wal partition was full few days after setting our first logical publication on a PG 10.4 instance. Then, we can not synchronise our slave to the master, it triggers a segfault on the slave. We had to drop manually the subscription on slave and the slot on master. Then, we wanted to find the cause of this bug, stop connection between master and slave , after 30 minutes, the slave had a segfault and could not synchronise. Why does the slave can not synchronise without a complete creation subscription after dropping the slot? How to manage the replication, knowing we use cloud vm and issue network latency. Here the details of conf and error logs: Conf on master: max_replication_slots = 10 max_sync_workers_per_subscription = 2 wal_receiver_timeout: 60s wal_keep_segments : 1000 wal_receiver_status_interval :10 wal_retrieve_retry_interval :5 s max_logical_replication_workers :4 Conf on slave same except wal_keep_segments=0 Error log on slave: LOG: logical replication apply worker for subscription « " has started DEBUG: connecting to publisher using connection string "postgresql://USER@IP" LOG: worker process: logical replication worker for subscription 132253 (PID 25359) was terminated by signal 11: Segmentation fault LOG: terminating any other active server processes WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly co rrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. LOG: all server processes terminated; reinitializing DEBUG: unregistering background worker "logical replication worker for subscription 132253" LOG: database system was interrupted; last known up at 2018-07-11 21:50:56 UTC DEBUG: checkpoint record is at 0/7DBFEF10 DEBUG: redo record is at 0/7DBFEF10; shutdown TRUE DEBUG: next transaction ID: 0:93714; next OID: 140237 DEBUG: next MultiXactId: 1; next MultiXactOffset: 0 DEBUG: oldest unfrozen transaction ID: 548, in database 1 DEBUG: oldest MultiXactId: 1, in database 1 DEBUG: commit timestamp Xid oldest/newest: 0/0 DEBUG: transaction ID wrap limit is 2147484195, limited by database with OID 1 DEBUG: MultiXactId wrap limit is 2147483648, limited by database with OID 1 DEBUG: starting up replication slots LOG: recovered replication state of node 2 to 0/0 LOG: recovered replication state of node 3 to 0/0 LOG: recovered replication state of node 4 to 0/0 LOG: recovered replication state of node 5 to 56A5/29ACA918 LOG: database system was not properly shut down; automatic recovery in progress THANK YOU
Re: CVE-2017-7484-induced bugs, or, btree cmp functions are not leakproof?
I wrote: > I propose to run through the system operator classes, find any for which > the comparison function isn't marked leakproof but the operators are, > and fix them. This is clearly appropriate for HEAD and maybe it's not > too late to force an initdb for v11 --- thoughts? I did that for the built-in btree opclasses. I decided that it's probably not worth forcing an initdb in v11 for, though. In principle, losing selectivity estimates because of non-leakproof functions should only happen in queries that are going to fail at runtime anyway. The real problem that ought to be addressed and perhaps back-patched is this: > Another question that could be raised is why we are refusing to use > stats for a child table when the caller has select on the parent. > It's completely trivial to extract data from a child table if you > have select on the parent, so it seems like we are checking the > wrong table's privileges. regards, tom lane
Re: Accounting of zero-filled buffers in EXPLAIN (BUFFERS)
On Thu, Jul 12, 2018 at 12:46 AM, Haribabu Kommi wrote: >> > On 2018-04-30 14:59:31 +1200, Thomas Munro wrote: >> >> In EXPLAIN (BUFFERS), there are two kinds of cache misses that show up >> >> as "reads" when in fact they are not reads at all: >> >> >> >> 1. Relation extension, which in fact writes a zero-filled block. >> >> 2. The RBM_ZERO_* modes, which provoke neither read nor write. > > I checked the patch and I agree with the change 1). And regarding change 2) > whether it is zeroing the contents of the page or not, it does read? > because if it exists in the buffer pool, we are counting them as hits > irrespective > of the mode? Am I missing something? Further down in the function you can see that there is no read() system call for the RBM_ZERO_* modes: if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK) MemSet((char *) bufBlock, 0, BLCKSZ); else { ... smgrread(smgr, forkNum, blockNum, (char *) bufBlock); ... } I suppose someone might argue that even when it's not a hit and it's not a read, we might still want to count this buffer interaction in some other way. Perhaps there should be a separate counter? It may technically be a kind of cache miss, but it's nowhere near as expensive as a synchronous system call like read() so I didn't propose that. Some more on my motivation: In our zheap prototype, when the system is working well and we have enough space, we constantly allocate zeroed buffer pages at the insert point (= head) of an undo log and drop pages at the discard point (= tail) in the background; effectively a few pages just go round and round via the freelist and no read() or write() syscalls happen. That's something I'm very happy about and it's one of our claimed advantages over the traditional heap (which tends to read and dirty more pages), but EXPLAIN (BUFFERS) hides this virtuous behaviour when comparing with the traditional heap: it falsely and slanderously reports that zheap is reading undo pages when it is not. Of course I don't intent to litigate zheap design in this thread, I just I figured that since this accounting is wrong on principle and affects current PostgreSQL too (at least in theory) I would propose this little patch independently. It's subtle enough that I wouldn't bother to back-patch it though. -- Thomas Munro http://www.enterprisedb.com
Re: [HACKERS] [PATCH] WIP Add ALWAYS DEFERRED option for constraints
On Wed, Jul 11, 2018 at 03:13:30PM -0400, Alvaro Herrera wrote: > On 2018-Jun-06, Nico Williams wrote: > > I've finally gotten around to rebasing this patch and making the change > > that was requested, which was: merge the now-would-be-three deferral- > > related bool columns in various pg_catalog tables into one char column. > > > > Instead of (deferrable, initdeferred, alwaysdeferred), now there is just > > (deferral). > > Nice stuff. > > Please add #defines for the chars in pg_constraint.h instead of using > the values directly in the source. Also, catalogs.sgml has a typo in > one of the values. > > What happens if you do SET CONSTRAINTS ALL IMMEDIATE and you have one of > these constraints? I expect that it silently does not alter that > constraint, but you didn't change that manpage. Correct, that's the idea, that it should be possible to write deferred constraints/triggers which users cannot make immediate. For example, an audit extension that logs changes via FOR EACH ROW ALWAYS DEFERRED, or my PoC COMMIT TRIGGER implementation (which depends on deferred triggers). I missed that there is a page for SET CONSTRAINTS! I'll update it. > I suggest not to include psql/tab-complete changes with your main patch. > If you want to update it, split it out to its own patch. Committer can > choose to include it in one commit or not (I'm mildly inclined not to, > but I'm probably inconsistent about it), but for review IMO it's better > not to mix things -- It's way too easy to get entangled in silly details > while editing that code, and it's not critical anyway. OK, sure, though, of course, the committer could always leave that out themselves, no? To me though it seems that the change should be complete. > > Incidentally, I had to do commit-by-commit rebasing to make the rebase > > easier. I have a shell function I use for this, if anyone wants a copy > > of it -- sometimes it's much easier to do this than to do one huge jump. > > I've done this manually a few times. Please share, I'm curious. OK, attached is my latest version of that script, though this one is a bit changed from the one I used. This version tries to be faster / more efficient by first doing 1 commit, then 2, then 3, and so on, and on conflict aborts and halves N to try again. The idea is to only have to merge conflicts at each commit where conflicts arise, never resolving conflicts across more than one commit -- this makes is much easier to reason about conflicts! Note that the script is actually a shell function, and that it keeps state in shel variables. A better implementation would do the sort of thing that git(1) itself does to keep rebase state. Nico -- # Based on a shell function by Viktor Dukhovni # # slowrebase BRANCH_TO_REBASE ONTO function slowrebase { typeset b N if (($# > 0)) && [[ $1 = -h || $1 = --help ]]; then printf 'Usage: slowrebase BRANCH_TO_REBASE ONTO_HEAD\n' printf ' slowrebase # to continue after resolving conflicts\n' printf '\n\tslowrebase is a shell function that uses the following\n' printf '\tglobal variables to keep state: $S $T $B ${C[@]}\n' printf '\t$slowrebase_window_sz\n' printf '\tDO NOT CHANGE THOSE VARIABLES.\n' return 0 elif (($# > 0 && $# != 2)); then printf 'Usage: slowrebase BRANCH_TO_REBASE ONTO_HEAD\n' 1>&2 printf ' slowrebase # to continue after resolving conflicts\n' printf '\n\tslowrebase is a shell function that uses the following\n' 1>&2 printf '\tglobal variables to keep state: $S $T $B ${C[@]}\n' 1>&2 printf '\t$slowrebase_window_sz\n' 1>&2 printf '\tDO NOT CHANGE THOSE VARIABLES.\n' 1>&2 return 1 fi if (($# == 2)); then slowrebase_window_sz=1 S=$1 T=$2 B=$(git merge-base "$S" "$T") C=( $(git log --oneline "$B".."$2" | awk '{print $1}') ) set -- # Prep git log -n1 "$S" > /dev/null || return 1 if [[ $(git log --oneline -n1 HEAD) != $(git log --oneline -n1 "$S") ]]; then if (($(git status -sb | wc -l) != 1)); then printf 'Error: please clean your workspace\n' return 1 fi git checkout "$S" elif (($(git status -sbuno | wc -l) != 1)); then printf 'Error: please clean your workspace\n' return 1 fi # Fall through to get started elif [[ $(git log --oneline -n1 HEAD) != $(git log --oneline -n1 "$S") ]] && ! git rebase --continue; then N=$(( ${#C[@]} - slowrebase_window_sz )) printf '\nConflicts while rebasing $S (%s) slowly onto $T (%s)\n' "$S" "$T" printf '${C[@]} has the commits left to process (%s left)\n' $N printf '$B is the commit we are rebasing onto right now: %s\n' "$B" printf '$b is the previous commit we had already
Re: make installcheck-world in a clean environment
Alexander Lakhin writes: > 06.07.2018 00:39, Peter Eisentraut wrote: >> Exactly what order of steps are you executing that doesn't work? > In Centos 7, using the master branch from git: > ./configure --enable-tap-tests > make install > make install -C contrib > chown -R postgres:postgres /usr/local/pgsql/ > /usr/local/pgsql/bin/initdb -D /usr/local/pgsql/data > /usr/local/pgsql/bin/pg_ctl start -l logfile -D /usr/local/pgsql/data > /make clean/ > # Also you can just install binary packages to get the same state. > make installcheck-world > # This check fails. I do not think that should be expected to work. It would require that "make installcheck" first invoke "make all" (to rebuild the stuff you threw away with "make clean"), which is rather antithetical to its purpose. Specifically, installcheck is supposed to be checking something you already built; so having it do a fresh build seems to introduce version-skew hazards that we don't need. regards, tom lane
Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors
can we try something like this? PGBENCH_ERROR_START(DEBUG_FAIL) { PGBENCH_ERROR("client %d repeats the failed transaction (try %d", Argh, no? I was thinking of something much more trivial: pgbench_error(DEBUG, "message format %d %s...", 12, "hello world"); If you really need some complex dynamic buffer, and I would prefer that you avoid that, then the fallback is: if (level >= DEBUG) { initPQstuff(); ... pgbench_error(DEBUG, "fixed message... %s\n", msg); freePQstuff(); } The point is to avoid building the message with dynamic allocation and so if in the end it is not used. -- Fabien.
Re: [HACKERS] [PATCH] WIP Add ALWAYS DEFERRED option for constraints
On 2018-Jun-06, Nico Williams wrote: > I've finally gotten around to rebasing this patch and making the change > that was requested, which was: merge the now-would-be-three deferral- > related bool columns in various pg_catalog tables into one char column. > > Instead of (deferrable, initdeferred, alwaysdeferred), now there is just > (deferral). Nice stuff. Please add #defines for the chars in pg_constraint.h instead of using the values directly in the source. Also, catalogs.sgml has a typo in one of the values. What happens if you do SET CONSTRAINTS ALL IMMEDIATE and you have one of these constraints? I expect that it silently does not alter that constraint, but you didn't change that manpage. I suggest not to include psql/tab-complete changes with your main patch. If you want to update it, split it out to its own patch. Committer can choose to include it in one commit or not (I'm mildly inclined not to, but I'm probably inconsistent about it), but for review IMO it's better not to mix things -- It's way too easy to get entangled in silly details while editing that code, and it's not critical anyway. > Incidentally, I had to do commit-by-commit rebasing to make the rebase > easier. I have a shell function I use for this, if anyone wants a copy > of it -- sometimes it's much easier to do this than to do one huge jump. I've done this manually a few times. Please share, I'm curious. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Shouldn't validateForeignKeyConstraint() reset memory context?
Hi, while looking at the pluggable storage patch I noticed that validateForeignKeyConstraint() calls RI_FKey_check() for each row without resetting a memory / expression context. There's not too much leakage in the called code, but there's some I think. Greetings, Andres Freund
Re: [HACKERS] [PATCH] WIP Add ALWAYS DEFERRED option for constraints
On Tue, Jun 26, 2018 at 04:54:13PM -0400, Robbie Harwood wrote: > Nico Williams writes: > > > [Re-send; first attempt appears to have hit /dev/null somewhere. My > > apologies if you get two copies.] > > > > I've finally gotten around to rebasing this patch and making the change > > that was requested, which was: merge the now-would-be-three deferral- > > related bool columns in various pg_catalog tables into one char column. > > > > Instead of (deferrable, initdeferred, alwaysdeferred), now there is just > > (deferral). > > This design seems correct to me. I have a couple questions inline, and > some nits to go with them. Thanks. Replies below. > > All tests (make check) pass. > > Thank you for adding tests! Well, yeah :) > > diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml > > index 3ed9021..e82e39b 100644 > > --- a/doc/src/sgml/catalogs.sgml > > +++ b/doc/src/sgml/catalogs.sgml > > @@ -2239,17 +2239,15 @@ SCRAM-SHA-256$iteration > > count: > > > > > > > > - condeferrable > > - bool > > - > > - Is the constraint deferrable? > > - > > - > > - > > - condeferred > > - bool > > + condeferral > > + char > > > > - Is the constraint deferred by default? > > + Constraint deferral option: > > +a = always deferred, > > +d = deferrable, > > +d = deferrable initially deferred, > > From the rest of the code, I think this is supposed to be 'i', not 'd'. Oh yes, good catch. > > +n = not deferrable > > + > > > > > > > > > diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c > > index 8b276bc..795a7a9 100644 > > --- a/src/backend/catalog/index.c > > +++ b/src/backend/catalog/index.c > > @@ -1070,6 +1070,7 @@ index_create(Relation heapRelation, > > > > recordDependencyOn(, , > > deptype); > > } > > + Assert((flags & INDEX_CREATE_ADD_CONSTRAINT) == 0); > > What does this ensure, and why is it in this part of the code? We're in > the `else` branch here - isn't this always the case (i.e., Assert can > never fire), since `flags` isn't manipulated in this block? Oy, nothing. I think that's a cut-n-paste error. I'll remove it. > > } > > > > /* Store dependency on parent index, if any */ > > > diff --git a/src/backend/catalog/information_schema.sql > > b/src/backend/catalog/information_schema.sql > > index f4e69f4..bde6199 100644 > > --- a/src/backend/catalog/information_schema.sql > > +++ b/src/backend/catalog/information_schema.sql > > @@ -891,10 +891,14 @@ CREATE VIEW domain_constraints AS > > CAST(current_database() AS sql_identifier) AS domain_catalog, > > CAST(n.nspname AS sql_identifier) AS domain_schema, > > CAST(t.typname AS sql_identifier) AS domain_name, > > - CAST(CASE WHEN condeferrable THEN 'YES' ELSE 'NO' END > > + CAST(CASE WHEN condeferral = 'n' THEN 'NO' ELSE 'YES' END > > AS yes_or_no) AS is_deferrable, > > - CAST(CASE WHEN condeferred THEN 'YES' ELSE 'NO' END > > + CAST(CASE WHEN condeferral = 'i' OR condeferral = 'a' THEN > > 'YES' ELSE 'NO' END > > AS yes_or_no) AS initially_deferred > > + /* > > + * XXX Can we add is_always_deferred here? Are there > > + * standards considerations? > > + */ > > I'm not familiar enough to speak to this. Hopefully someone else can. > Absent other input, I think we should add is_always_deferred. (And if > we were building this today, probably just expose a single character > instead of three booleans.) I had found the answer ("yes") in the history of this file but forgot to remove the comment while rebasing. I'll remove the comment. > > FROM pg_namespace rs, pg_namespace n, pg_constraint con, pg_type t > > WHERE rs.oid = con.connamespace > >AND n.oid = t.typnamespace > > > diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c > > index 57519fe..41dc6a4 100644 > > --- a/src/backend/commands/trigger.c > > +++ b/src/backend/commands/trigger.c > > @@ -3632,6 +3627,7 @@ typedef struct AfterTriggerSharedData > > TriggerEvent ats_event; /* event type indicator, see trigger.h > > */ > > Oid ats_tgoid; /* the trigger's ID */ > > Oid ats_relid; /* the relation it's on > > */ > > + boolats_alwaysdeferred; /* whether this can be > > deferred */ > > "Whether this must be deferred"? Also, I'm not sure what this is for - > it doesn't seem to be used anywhere. Ah, this became unused during the rebase. I'll remove it. > > CommandId ats_firing_id; /* ID for firing cycle */ > > struct AfterTriggersTableData *ats_table; /* transition table > > access */ > > }
Re: Shared buffer access rule violations?
On Tue, Jul 10, 2018 at 8:33 PM, Tom Lane wrote: > Asim R P writes: > >> One can find several PageInit() calls with no content lock held. See, >> for example: > >> fill_seq_with_data() > > That would be for a relation that no one else can even see yet, no? Yes, when the sequence is being created. No, when the sequence is being reset, in ResetSequence(). > >> vm_readbuf() >> fsm_readbuf() > > In these cases I'd imagine that the I/O completion interlock is what > is preventing other backends from accessing the buffer. > What is I/O completion interlock? I see no difference in initializing a visimap/fsm page and initializing a standard heap page. For standard heap pages, the code currently acquires the buffer pin as well as content lock for initialization. >> Moreover, fsm_vacuum_page() performs >> "PageGetContents(page))->fp_next_slot = 0;" without content lock. > > That field is just a hint, IIRC, and the possibility of a torn read > is explicitly not worried about. Yes, that's a hint. And ignoring torn page possibility doesn't result in checksum failures because fsm_read() passes RMB_ZERO_ON_ERROR to buffer manager. The page will be zeroed out in the event of checksum failure. Asim
Re: [PATCH] btree_gist: fix union implementation for variable length columns
On Wednesday, July 11, 2018 7:26:40 PM CEST Tom Lane wrote: > Pavel Raiskup writes: > > On Monday, July 9, 2018 7:41:59 PM CEST Tom Lane wrote: > >> Hi Pavel! For patches that purport to resolve bugs, we usually like to > >> add a regression test case that demonstrates the bug in unpatched code. > >> Can you provide a small test case that does so? (The BZ you pointed to > >> doesn't seem to address this...) > > > Turns out the problem is only related to bit/bit varying type (so the > > patch comments need to be reworded properly, at least) since those are the > > only types which have implemented the f_l2n() callback. > > What I'm failing to wrap my head around is why this code exists at all. > AFAICS, gbt_bit_xfrm just forces the bitstring to be zero-padded out to > an INTALIGN boundary, which it wouldn't necessarily be otherwise. > But why bother? That should have no effect on the behavior of bit_cmp(). The gbt_bit_xfrm() method actually also skips the varbit header (number of bits, stored in VarBit.bit_len), the magic is done by VARBITS macro. If the header is dropped, it's OK to just use existing byteacmp() for bit array comparison. Pavel
Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors
Just a quick skim while refreshing what were those error reporting API changes about ... On 2018-May-21, Marina Polyakova wrote: > v9-0001-Pgbench-errors-use-the-RandomState-structure-for-.patch > - a patch for the RandomState structure (this is used to reset a client's > random seed during the repeating of transactions after > serialization/deadlock failures). LGTM, though I'd rename the random_state struct members so that it wouldn't look as confusing. Maybe that's just me. > v9-0002-Pgbench-errors-use-the-Variables-structure-for-cl.patch > - a patch for the Variables structure (this is used to reset client > variables during the repeating of transactions after serialization/deadlock > failures). Please don't allocate Variable structs one by one. First time allocate some decent number (say 8) and then enlarge by duplicating size. That way you save realloc overhead. We use this technique everywhere else, no reason do different here. Other than that, LGTM. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [PATCH] Add missing type conversion functions for PL/Python
On 26/03/18 19:07, Nikita Glukhov wrote: Attached fixed 3th version of the patch: Thanks, I'm reviewing this now. Nice speedup! There is no test coverage for some of the added code. You can get a code coverage report with: ./configure --enable-coverage ... make make -C src/pl/plpython check make coverage-html That produces a code coverage report in coverage/index.html. Please look at the coverage of the new functions, and add tests to src/pl/plpython/sql/plpython_types.sql so that all the new code is covered. In some places, where you've already checked the object type e.g. with PyFloat_Check(), I think you could use the less safe variants, like PyFloat_AS_DOUBLE() instead of PyFloat_AsDouble(). Since this patch is all about performance, seems worth doing. Some of the conversions seem a bit questionable. For example: /* * Convert a Python object to a PostgreSQL float8 datum directly. * If can not convert it directly, fallback to PLyObject_ToScalar * to convert it. */ static Datum PLyObject_ToFloat8(PLyObToDatum *arg, PyObject *plrv, bool *isnull, bool inarray) { if (plrv == Py_None) { *isnull = true; return (Datum) 0; } if (PyFloat_Check(plrv) || PyInt_Check(plrv) || PyLong_Check(plrv)) { *isnull = false; return Float8GetDatum((double)PyFloat_AsDouble(plrv)); } return PLyObject_ToScalar(arg, plrv, isnull, inarray); } The conversion from Python int to C double is performed by PyFloat_AsDouble(). But there's no error checking. And wouldn't PyLong_AsDouble() be more appropriate in that case, since we already checked the python type? - Heikki
Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors
On 2018-Jul-11, Marina Polyakova wrote: > can we try something like this? > > PGBENCH_ERROR_START(DEBUG_FAIL) > { > PGBENCH_ERROR("client %d repeats the failed transaction (try %d", > st->id, st->retries + 1); > if (max_tries) > PGBENCH_ERROR("/%d", max_tries); > if (latency_limit) > { > PGBENCH_ERROR(", %.3f%% of the maximum time of tries was used", > getLatencyUsed(st, )); > } > PGBENCH_ERROR(")\n"); > } > PGBENCH_ERROR_END(); I didn't quite understand what these PGBENCH_ERROR() functions/macros are supposed to do. Care to explain? -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: In pageinspect, perform clean-up after testing gin-related functions
Andres Freund writes: > On 2018-07-11 12:56:49 +0530, Amit Kapila wrote: >> Yeah, it is good practice to drop the objects at the end. It is >> strange that original commit adfb81d9e1 has this at the end of the >> test, but a later commit 367b99bbb1 by Tom has removed the Drop >> statement. AFAICS, this is just a silly mistake, but I might be >> missing something. Tom, do you remember any reason for doing so? If >> not, then I think we can revert back that change (aka commit Kuntal's >> patch). > We actually sometimes intentionally want to persist objects past the end > of the test. Allows to test pg_dump / pg_upgrade. Don't know whether > that's the case here, but it's worthwhile to note. I don't think our pg_dump testbed makes any use of contrib regression tests, so that's not the reason here. I believe I took out the DROP because it made it impossible to do additional manual tests after the end of an installcheck run without laboriously re-creating the test table. In other words, I disagree with Amit's opinion that it's good practice to drop everything at the end of a test script. There are often good reasons to leave the objects available for later use. regards, tom lane
Re: no partition pruning when partitioning using array type
On 2018-Jul-11, Amit Langote wrote: > On 2018/07/11 13:12, Alvaro Herrera wrote: > > On 2018-Jul-11, Amit Langote wrote: > > > >> What's the solution here then? Prevent domains as partition key? > > > > Maybe if a domain is used in a partition key somewhere, prevent > > constraints from being added? > > Maybe, but I guess you mean only prevent adding such a constraint > after-the-fact. Yeah, any domain constraints added before won't be a problem. Another angle on this problem is to verify partition bounds against the domain constraint being added; if they all pass, there's no reason to reject the constraint. But I worry that Tom was using domain constraints as just an example of a more general problem that we can get into. > If a domain has a constraint before creating any partitions based on the > domain, then partition creation command would check that the partition > bound values satisfy those constraints. Of course. > > Another thing worth considering: are you prevented from dropping a > > domain that's used in a partition key? If not, you get an ugly message > > when you later try to drop the table. > > Yeah, I was about to write a message about that. I think we need to teach > RemoveAttributeById, which dependency.c calls when dropping the > referencing domain with cascade option, to abort if the attribute passed > to it belongs to the partition key of the input relation. Actually, here's another problem: why are we letting a column on a domain become partition key, if you'll never be able to create a partition? It seems that for pg11 the right reaction is to check the partition key and reject this case. I'm not sure of this implementation -- I couldn't find any case where we reject the deletion on the function called from doDeletion (and we don't have a preliminary phase on which to check for this, either). Maybe we need a pg_depend entry to prevent this, though it seems a little silly. Maybe we should add a preliminary verification phase in deleteObjectsInList(), where we invoke object-type-specific checks. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [PATCH] btree_gist: fix union implementation for variable length columns
Pavel Raiskup writes: > On Monday, July 9, 2018 7:41:59 PM CEST Tom Lane wrote: >> Hi Pavel! For patches that purport to resolve bugs, we usually like to >> add a regression test case that demonstrates the bug in unpatched code. >> Can you provide a small test case that does so? (The BZ you pointed to >> doesn't seem to address this...) > Turns out the problem is only related to bit/bit varying type (so the > patch comments need to be reworded properly, at least) since those are the > only types which have implemented the f_l2n() callback. What I'm failing to wrap my head around is why this code exists at all. AFAICS, gbt_bit_xfrm just forces the bitstring to be zero-padded out to an INTALIGN boundary, which it wouldn't necessarily be otherwise. But why bother? That should have no effect on the behavior of bit_cmp(). So I'm speculating that the reason nobody has noticed a problem is that there is no problem because this code is useless and could be ripped out. It would be easier to figure this out if the btree_gist code weren't so desperately undocumented. Teodor, do you remember why it's like this? regards, tom lane
Costing bug in hash join logic for semi joins
There is a costing bug in hash join logic seems to have been introduced by the patch related to inner_unique enhancements(commit: 9c7f5229ad68d7e0e4dd149e3f80257893e404d4). Specifically, "hashjointuples" which tracks the number of matches for hash clauses is computed wrong for inner unique scenario. This leads to lot of semi-joins incorrectly turn to inner joins with unique on inner side. Function "final_cost_hashjoin" has special handling to cost semi/anti joins to account for early stop after the first match. This is enhanced by the above said commit to be used for inner_unique scenario also. However, "hashjointuples" computation is not fixed to handle this new scenario which is incorrectly stepping into the anti join logic and assuming unmatched rows. Fix is to handle inner_unique case when computing "hashjointuples". Here is the outline of the code that shows the bug. void final_cost_hashjoin(PlannerInfo *root, HashPath *path, JoinCostWorkspace *workspace, JoinPathExtraData *extra) { . /* CPU costs */ *if (path->jpath.jointype == JOIN_SEMI || path->jpath.jointype == JOIN_ANTI ||extra->inner_unique)* { .. */* Get # of tuples that will pass the basic join */ if (path->jpath.jointype == JOIN_SEMI) hashjointuples = outer_matched_rows; else hashjointuples = outer_path_rows - outer_matched_rows; * } else { . } } Thanks, RK (Salesforce)
Re: cursors with prepared statements
On 07/06/18 22:42, Peter Eisentraut wrote: I have developed a patch that allows declaring cursors over prepared statements: DECLARE cursor_name CURSOR FOR prepared_statement_name [ USING param, param, ... ] This is an SQL standard feature. ECPG already supports it (with different internals). Internally, this just connects existing functionality in different ways, so it doesn't really introduce anything new. One point worth pondering is how to pass the parameters of the prepared statements. The actual SQL standard syntax would be DECLARE cursor_name CURSOR FOR prepared_statement_name; OPEN cursor_name USING param, param; But since we don't have the OPEN statement in direct SQL, it made sense to me to attach the USING clause directly to the DECLARE statement. Hmm. I'm not excited about adding PostgreSQL-extensions to the SQL standard. It's confusing, and risks conflicting with future additions to the standard. ECPG supports the actual standard syntax, with OPEN, right? So this wouldn't be consistent with ECPG, either. Curiously, the direct EXECUTE statement uses the non-standard syntax EXECUTE prep_stmt (param, param); instead of the standard EXECUTE prep_stmt USING param, param; I tried to consolidate this. But using DECLARE c CURSOR FOR p (foo, bar) leads to parsing conflicts (and looks confusing?), How about DECLARE c CURSOR FOR EXECUTE p (foo, bar) ? As a user, I'm already familiar with the "EXECUTE p (foo, bar)" syntax, so that's what I would intuitively try to use with DECLARE as well. In fact, I think I tried doing just that once, and was disappointed that it didn't work. and instead allowing EXECUTE + USING leads to a mess in the ECPG parser that exhausted me. So I'm leaving it as is for now and might give supporting EXECUTE + USING another try later on. The attached patch seems to do the trick, of allowing EXECUTE + USING. I'm not sure this is worth the trouble, though, since EXECUTE as a plain SQL command is a PostgreSQL-extension anyway. This also adds a test case for the existing "EXECUTE ()" syntax in ECPG. The current ECPG parsing of that is actually a bit weird, it allows "EXECUTE stmt (:param1) USING :param2", which seems unintentional. This patch rejects that syntax. - Heikki >From 3b6cad3f6ecb615442bd0d0f365fbdec91cf9317 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Wed, 11 Jul 2018 19:47:43 +0300 Subject: [PATCH 1/1] Add support for EXECUTE USING syntax. This is the SQL-standard equivalent of "EXECUTE ()". TODO: docs. --- src/backend/parser/gram.y | 1 + src/interfaces/ecpg/preproc/check_rules.pl | 2 +- src/interfaces/ecpg/preproc/ecpg.addons| 2 +- src/interfaces/ecpg/preproc/ecpg.trailer | 9 src/interfaces/ecpg/preproc/parse.pl | 2 +- src/interfaces/ecpg/test/expected/sql-execute.c| 51 ++ .../ecpg/test/expected/sql-execute.stderr | 24 +++--- .../ecpg/test/expected/sql-execute.stdout | 1 + src/interfaces/ecpg/test/sql/execute.pgc | 14 ++ src/test/regress/expected/prepare.out | 7 +++ src/test/regress/sql/prepare.sql | 3 ++ 11 files changed, 100 insertions(+), 16 deletions(-) diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y index 90dfac2cb1..851363fa4e 100644 --- a/src/backend/parser/gram.y +++ b/src/backend/parser/gram.y @@ -10779,6 +10779,7 @@ ExecuteStmt: EXECUTE name execute_param_clause ; execute_param_clause: '(' expr_list ')'{ $$ = $2; } + | USING expr_list { $$ = $2; } | /* EMPTY */ { $$ = NIL; } ; diff --git a/src/interfaces/ecpg/preproc/check_rules.pl b/src/interfaces/ecpg/preproc/check_rules.pl index 6c8b004854..ee67817be0 100644 --- a/src/interfaces/ecpg/preproc/check_rules.pl +++ b/src/interfaces/ecpg/preproc/check_rules.pl @@ -37,7 +37,7 @@ if ($verbose) my %replace_line = ( 'ExecuteStmtEXECUTEnameexecute_param_clause' => - 'EXECUTE prepared_name execute_param_clause execute_rest', + 'EXECUTE prepared_name execute_rest', 'ExecuteStmtCREATEOptTempTABLEcreate_as_targetASEXECUTEnameexecute_param_clause' => 'CREATE OptTemp TABLE create_as_target AS EXECUTE prepared_name execute_param_clause', diff --git a/src/interfaces/ecpg/preproc/ecpg.addons b/src/interfaces/ecpg/preproc/ecpg.addons index ca3efadc48..9606ad4783 100644 --- a/src/interfaces/ecpg/preproc/ecpg.addons +++ b/src/interfaces/ecpg/preproc/ecpg.addons @@ -283,7 +283,7 @@ ECPG: PrepareStmtPREPAREprepared_nameprep_type_clauseASPreparableStmt block $$.type = NULL; $$.stmt = $4; } -ECPG: ExecuteStmtEXECUTEprepared_nameexecute_param_clauseexecute_rest block +ECPG: ExecuteStmtEXECUTEprepared_nameexecute_rest block { $$ = $2; } ECPG: DeclareCursorStmtDECLAREcursor_namecursor_optionsCURSORopt_holdFORSelectStmt block { diff --git
Re: In pageinspect, perform clean-up after testing gin-related functions
On 2018-07-11 12:56:49 +0530, Amit Kapila wrote: > On Wed, Jul 11, 2018 at 12:37 PM, Kuntal Ghosh > wrote: > > Hello all, > > > > In pageinspect/sql/gin.sql, we don't drop the table test1 at the end > > of the test. IMHO, we should clean-up at the end of a test. > > > > Yeah, it is good practice to drop the objects at the end. It is > strange that original commit adfb81d9e1 has this at the end of the > test, but a later commit 367b99bbb1 by Tom has removed the Drop > statement. AFAICS, this is just a silly mistake, but I might be > missing something. Tom, do you remember any reason for doing so? If > not, then I think we can revert back that change (aka commit Kuntal's > patch). We actually sometimes intentionally want to persist objects past the end of the test. Allows to test pg_dump / pg_upgrade. Don't know whether that's the case here, but it's worthwhile to note. Greetings, Andres Freund
Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors
On 11-07-2018 16:24, Fabien COELHO wrote: Hello Marina, * -d/--debug: I'm not in favor in requiring a mandatory text argument on this option. As you wrote in [1], adding an additional option is also a bad idea: Hey, I'm entitled to some internal contradictions:-) ... and discussions will be continue forever %-) I'm sceptical of the "--debug-fails" options. ISTM that --debug is already there and should just be reused. I was thinking that you could just use the existing --debug, not change its syntax. My point was that --debug exists, and you could just print the messages when under --debug. Now I understand you better, thanks. I think it will be useful to receive only messages about failures, because they and progress reports can be lost in many other debug messages such as "client %d sending ..." / "client %d executing ..." / "client %d receiving". Maybe it's better to use an optional argument/arguments for compatibility (--debug[=fails] or --debug[=NUM])? But if we use the numbers, now I can see only 2 levels, and there's no guarantee that they will no change.. Optional arguments to options (!) are not really clean things, so I'd like to avoid going onto this path, esp. as I cannot see any other instance in pgbench or elsewhere in postgres, AFAICS they are used in pg_waldump (option --stats[=record]) and in psql (option --help[=topic]). and I personnaly consider these as a bad idea. So if absolutely necessary, a new option is still better than changing --debug syntax. If not necessary, then it is better:-) Ok! * I'm reserved about the whole ereport thing, see comments in other messages. Thank you, I'll try to implement the error reporting in the way you suggested. Dunno if it is a good idea either. The committer word is the good one in the end:-à I agree with you that ereport has good reasons to be non-trivial in the backend and it does not have the same in pgbench.. * doCustom changes. On CSTATE_FAILURE, the next command is possibly started. Although there is some consistency with the previous point, I think that it totally breaks the state automaton where now a command can start while the whole transaction is in failing state anyway. There was no point in starting it in the first place. So, for me, the FAILURE state should record/count the failure, then skip to RETRY if a retry is decided, else proceed to ABORT. Nothing else. This is much clearer that way. Then RETRY should reinstate the global state and proceed to start the *first* command again. <...> It is unclear to me why backslash command errors are turned to FAILURE instead of ABORTED: there is no way they are going to be retried, so maybe they should/could skip directly to ABORTED? So do you propose to execute the command "ROLLBACK" without calculating its latency etc. if we are in a failed transaction and clear the conditional stack after each failure? Also just to be clear: do you want to have the state CSTATE_ABORTED for client abortion and another state for interrupting the current transaction? I do not understand what "interrupting the current transaction" means. A transaction is either committed or rollbacked, I do not know about "interrupted". I mean that IIUC the server usually only reports the error and you must manually send the command "END" or "ROLLBACK" to rollback a failed transaction. When it is rollbacked, probably some stats will be collected in passing, I'm fine with that. If there is an error in a pgbench script, the transaction is aborted, which means for me that the script execution is stopped where it was, and either it is restarted from the beginning (retry) or counted as failure (not retry, just aborted, really). If by interrupted you mean that one script begins a transaction and another ends it, as I said in the review I think that this strange case should be forbidden, so that all the code and documentation trying to manage that can be removed. Ok! The current RETRY state does memory allocations to generate a message with buffer allocation and so on. This looks like a costly and useless operation. If the user required "retries", then this is normal behavior, the retries are counted and will be printed out in the final report, and there is no point in printing out every single one of them. Maybe you want that debugging, but then coslty operations should be guarded. I think we need these debugging messages because, for example, Debugging message should cost only when under debug. When not under debug, there should be no debugging message, and there should be no cost for building and discarding such messages in the executed code path beyond testing whether program is under debug. if you use the option --latency-limit, you we will never know in advance whether the serialization/deadlock failure will be retried or not. ISTM that it will be shown final report. If I want debug, I ask for --debug, otherwise I think that the
Re: GiST VACUUM
Hi, Heikki! Thanks for looking into the patch! > 11 июля 2018 г., в 0:07, Heikki Linnakangas написал(а): > > I'm now looking at the first patch in this series, to allow completely empty > GiST pages to be recycled. I've got some questions: > >> --- a/src/backend/access/gist/gist.c >> +++ b/src/backend/access/gist/gist.c >> @@ -700,6 +700,13 @@ gistdoinsert(Relation r, IndexTuple itup, Size >> freespace, GISTSTATE *giststate) >> GISTInsertStack *item; >> OffsetNumber downlinkoffnum; >> +if(GistPageIsDeleted(stack->page)) >> +{ >> +UnlockReleaseBuffer(stack->buffer); >> +xlocked = false; >> +state.stack = stack = stack->parent; >> +continue; >> +} >> downlinkoffnum = gistchoose(state.r, stack->page, itup, >> giststate); >> iid = PageGetItemId(stack->page, downlinkoffnum); >> idxtuple = (IndexTuple) PageGetItem(stack->page, iid); > > This seems misplaced. This code deals with internal pages, and as far as I > can see, this patch never marks internal pages as deleted, only leaf pages. > However, we should have something like this in the leaf-page branch, to deal > with the case that an insertion lands on a page that was concurrently deleted. That's a bug. Will fix this. > Did you have any tests, where an insertion runs concurrently with vacuum, > that would exercise this? Yes, I've tried to test this, but, obviously, not enough. I'll think more about how to deal with it. > > The code in gistbulkdelete() seems pretty expensive. In the first phase, it > records the parent of every empty leaf page it encounters. In the second > phase, it scans every leaf page of that parent, not only those leaves that > were seen as empty. Yes, in first patch there is simplest gistbulkdelete(), second patch will remember line pointers of downlinks to empty leafs. > > I'm a bit wary of using pd_prune_xid for the checks to determine if a deleted > page can be recycled yet. In heap pages, pd_prune_xid is just a hint, but > here it's used for a critical check. This seems to be the same mechanism we > use in B-trees, but in B-trees, we store the XID in BTPageOpaqueData.xact, > not pd_prune_xid. Also, in B-trees, we use ReadNewTransactionId() to set it, > not GetCurrentTransactionId(). See comments in _bt_unlink_halfdead_page() for > explanation. This patch is missing any comments to explain how this works in > GiST. Will look into this. I remember it was OK half a year ago, but now it is clear to me that I had to document that part when I understood it > > If you crash in the middle of gistbulkdelete(), after it has removed the > downlink from the parent, but before it has marked the leaf page as deleted, > the leaf page is "leaked". I think that's acceptable, but a comment at least > would be good. I was considering doing reverse-split (page merge) concurrency like in Lanin and Shasha's paper, but it is just too complex for little purpose. Will add comments on possible orphan pages. Many thanks! I hope to post updated patch series this week. Best regards, Andrey Borodin.
Re: [PATCH] btree_gist: fix union implementation for variable length columns
Hi Tom, On Monday, July 9, 2018 7:41:59 PM CEST Tom Lane wrote: > Pavel Raiskup writes: > > while I tried to debug 'gcc -fstack-protector -O3' problems in [1], I > > noticed > > that gbt_var_union() mistreats the first vector element. Patch is attached. > > Hi Pavel! For patches that purport to resolve bugs, we usually like to > add a regression test case that demonstrates the bug in unpatched code. > Can you provide a small test case that does so? (The BZ you pointed to > doesn't seem to address this...) I've been looking at the reproducer for a while now... and I probably need a hint if that's necessary. Turns out the problem is only related to bit/bit varying type (so the patch comments need to be reworded properly, at least) since those are the only types which have implemented the f_l2n() callback. Considering testcase 'bit.sql' (bit(33) type), it's pretty clear that on little endian boxes the problematic strings would be '0011*' (int32 value for '33', because there's varbit header). Big endian boxes would have problems with char[] like {0, 0, 0, 33, ...}. But I fail to construct right order of correctly picked elements into 'bit.data'. The problem probably is (a) that picksplit reorders the elements very often, and probably that (b) random() brings non-determinism into the final tree shape (if the reproducer was to be based on many equal keys in the index). It's amazing to see how the index fixes itself :-) for this bug. Can you help me with the reproducer idea, resp. can we live without the reproducer in this particular case? Pavel
Re: How can we submit code patches that implement our (pending) patents?
On Wed, Jul 11, 2018 at 01:03:44AM +, Tsunakawa, Takayuki wrote: > From: Tom Lane [mailto:t...@sss.pgh.pa.us] > > The core team has considered this matter, and has concluded that it's > > time to establish a firm project policy that we will not accept any code > > that is known to be patent-encumbered. The long-term legal risks and > > complications involved in doing that seem insurmountable, given the > > community's amorphous legal nature and the existing Postgres license > > wording (neither of which are open for negotiation here). Furthermore, > > Postgres has always been very friendly to creation of closed-source > > derivatives, but it's hard to see how inclusion of patented code would > > not cause serious problems for those. The potential benefits of > > accepting patented code just don't seem to justify trying to navigate > > these hazards. > > That decision is very unfortunate as a corporate employee on one hand, > but the firm cleanness looks a bit bright to me as one developer. > > As a practical matter, when and where are you planning to post the > project policy? How would you check and prevent patented code? PG may need a contributor agreement. All I can find on that is: https://www.postgresql.org/message-id/flat/54337476.3040605%402ndquadrant.com#9b1968ddc0fadfe225001adc1a821925 Nico --
Re: How can we submit code patches that implement our (pending) patents?
On Tue, Jul 10, 2018 at 09:47:09AM -0400, Tom Lane wrote: > The core team has considered this matter, and has concluded that it's > time to establish a firm project policy that we will not accept any code > that is known to be patent-encumbered. The long-term legal risks and > complications involved in doing that seem insurmountable, given the > community's amorphous legal nature and the existing Postgres license > wording (neither of which are open for negotiation here). Furthermore, > Postgres has always been very friendly to creation of closed-source > derivatives, but it's hard to see how inclusion of patented code would > not cause serious problems for those. The potential benefits of > accepting patented code just don't seem to justify trying to navigate > these hazards. +1. You should probably consider accepting code that involves patents that are in the public domain or expired by the time of release, though even that involves some legal costs you might not want to incur. E.g., what if a US patent is in the public domain but a corresponding EU patent is not? A search could be done, naturally, but professional patent searches are not cheap! Nico --
Re: How can we submit code patches that implement our (pending) patents?
On Tue, Jul 10, 2018 at 08:20:53AM +, Tsunakawa, Takayuki wrote: > From: Nico Williams [mailto:n...@cryptonector.com] > > On Sat, Jul 07, 2018 at 10:20:35AM -0700, Andres Freund wrote: > > > It's entirely possible to dual license contributions and everything. Why > > > are you making such aggressive statements about a, so far, apparently > > > good faith engagement? > > > > One problem is that many contributors would not want to be tainted by > > knowledge of the patents in question (since that leads to triple > > damages). > > > > How would you protect contributors and core developers from tainting? > > IIUC, you are concerned about the possibility that PG developers would > read the patent document (not the PG source code), and unconsciously > use the patented algorithm for other software that's not related to > PostgreSQL. That would only be helped by not reading the patent > document... BTW, are you relieved the current PostgreSQL doesn't > contain any patented code? As far as I know, PostgreSQL development > process doesn't have a step to check patent and copyright > infringement. You're proposing to include code that implements patented ideas with a suitable patent grant. I would be free to not read the patent, but what if the code or documents mention the relevant patented algorithms? If I come across something like this in the PG source code: /* The following is covered by patents US#... and so on */ now what? I could choose not to read it. But what if I have to touch that code in order to implement an unrelated feature due to some required refactoring of internal interfaces used by your code? Now I have to read it, and now I'm tainted, or I must choose to abandon my project. That is a heavy burden on the community. The community may want to extract a suitable patent grant to make that burden lighter. > > One possible answer is that you wouldn't. But that might reduce the > > size of the community, or lead to a fork. > > Yes, that's one unfortunate future, which I don't want to happen of > course. I believe PostgreSQL should accept patent for further > evolution, because PostgreSQL is now a popular, influential software > that many organizations want to join. I don't speak for the PG community, nor the core developers. Speaking for myself only, I hope that you can get, and PG accepts only, the widest possible royalty-free patent grant to the community, allowing others to not be constrained in making derivatives of PG. My advice is to write up a patent grant that allows all to use the relevant patents royalty-free with a no-lawsuit covenant. I.e., make only defensive use of your patents. Nico --
Re: Negotiating the SCRAM channel binding type
On Wed, Jul 11, 2018 at 04:00:47PM +0300, Heikki Linnakangas wrote: > In a nutshell, to get the token for tls-server-end-point, you need to get > the peer's certificate from the TLS library, in raw DER format, and > calculate a hash over it. The hash algorithm depends on the > signatureAlgorithm in the certificate, so you need to parse the certificate > to extract that. We don't want to re-implement X509 parsing, so > realistically we need the TLS library to have support functions for that. > > Looking at the GnuTLS docs, I believe it has everything we need. > gnutls_certificate_get_peers() and gnutls_certificate_get_ours() can be used > to get the certificate, and gnutls_x509_crt_get_signature_algorithm() gets > the signatureAlgorithm. > > The macOS Secure Transport documentation is a bit harder to understand, but > I think it has everything we need as well. > SSLCopyPeerTrust()+SecTrustGetCertificateAtIndex()+SecCertificateCopyData() > functions get you the certificate in DER format. You can get the signature > algorithm with SecCertificateCopyValues(), with the right constants. > > Am I missing something? I think we can support tls-server-end-point with all > TLS implementations we might care about. That seems right to me. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: [HACKERS] GnuTLS support
On 05/06/18 00:44, Peter Eisentraut wrote: On 6/2/18 16:50, Heikki Linnakangas wrote: On 08/03/18 14:13, Peter Eisentraut wrote: There are two failures in the SSL tests that I cannot explain. The tests are for some rather obscure configurations, so the changed behaviors are not obviously wrong, perhaps legitimate implementation differences. But someone wrote those tests with a purpose (probably), so we should have some kind of explanation for the regressions. I applied this over commit 4e0c743c18 (because this doesn't compile against current master, needs rebasing), and ran "make check" in src/test/ssl. All the tests passed. I'm using GnuTLS version 3.5.8. What failures did you see? The patch adjusts the expected test results so that the tests pass. Ah, gotcha. Look for the tests named - "connect with server CA cert, without root CA" So, in this test, the client puts the server's certificate in sslrootcert, but not the CA cert that the server's certificate was signed with. OpenSSL doesn't accept that, but apparently GnuTLS is OK with it. I think the GnuTLS behavior is reasonable, I was actually surprised that OpenSSL is so strict about that. If the user explicitly lists a server's certificate as trusted, by putting it in sslrootcert, it seems reasonable to accept it even if the CA cert is missing. - "CRL belonging to a different CA" Hmm. So in OpenSSL, when we load the CRL, we call X509_STORE_set_flags(cvstore, X509_V_FLAG_CRL_CHECK | X509_V_FLAG_CRL_CHECK_ALL). With that option, if a CRL for the server CA cannot be found (in this case, because the CRL is for a different CA), OpenSSL throws an error. Apparently, GnuTLS is more lenient. At a quick glance, I don't see an option in GnuTLS to change that behavior. But I think we can live with it, it's not wrong per se, just different. - Heikki
Re: [HACKERS] Two pass CheckDeadlock in contentent case
The patch still applies and it's part of this commitfest. On Tue, Oct 3, 2017 at 8:36 PM, Sokolov Yura wrote: > On 2017-10-03 17:30, Sokolov Yura wrote: >> >> Good day, hackers. >> >> During hard workload sometimes process reaches deadlock timeout >> even if no real deadlock occurred. It is easily reproducible with >> pg_xact_advisory_lock on same value + some time consuming >> operation (update) and many clients. >> >> When backend reaches deadlock timeout, it calls CheckDeadlock, >> which locks all partitions of lock hash in exclusive mode to >> walk through graph and search for deadlock. >> >> If hundreds of backends reaches this timeout trying to acquire >> advisory lock on a same value, it leads to hard-stuck for many >> seconds, cause they all traverse same huge lock graph under >> exclusive lock. >> During this stuck there is no possibility to do any meaningful >> operations (no new transaction can begin). >> >> Attached patch makes CheckDeadlock to do two passes: >> - first pass uses LW_SHARED on partitions of lock hash. >> DeadLockCheck is called with flag "readonly", so it doesn't >> modify anything. >> - If there is possibility of "soft" or "hard" deadlock detected, >> ie if there is need to modify lock graph, then partitions >> relocked with LW_EXCLUSIVE, and DeadLockCheck is called again. >> >> It fixes hard-stuck, cause backends walk lock graph under shared >> lock, and found that there is no real deadlock. >> >> Test on 4 socket xeon machine: >> pgbench_zipf -s 10 -c 800 -j 100 -M prepared -T 450 -f >> ./ycsb_read_zipf.sql@50 -f ./ycsb_update_lock2_zipf.sql@50 -P 5 >> >> ycsb_read_zipf.sql: >> \set i random_zipfian(1, 10 * :scale, 1.01) >> SELECT abalance FROM pgbench_accounts WHERE aid = :i; >> ycsb_update_lock2_zipf.sql: >> \set i random_zipfian(1, 10 * :scale, 1.01) >> select lock_and_update( :i ); >> >> CREATE OR REPLACE FUNCTION public.lock_and_update(i integer) >> RETURNS void >> LANGUAGE sql >> AS $function$ >> SELECT pg_advisory_xact_lock( $1 ); >> UPDATE pgbench_accounts SET abalance = 1 WHERE aid = $1; >> $function$ >> >> Without attached patch: >> >> progress: 5.0 s, 45707.0 tps, lat 15.599 ms stddev 83.757 >> progress: 10.0 s, 51124.4 tps, lat 15.681 ms stddev 78.353 >> progress: 15.0 s, 52293.8 tps, lat 15.327 ms stddev 77.017 >> progress: 20.0 s, 51280.4 tps, lat 15.603 ms stddev 78.199 >> progress: 25.0 s, 47278.6 tps, lat 16.795 ms stddev 83.570 >> progress: 30.0 s, 41792.9 tps, lat 18.535 ms stddev 93.697 >> progress: 35.0 s, 12393.7 tps, lat 33.757 ms stddev 169.116 >> progress: 40.0 s, 0.0 tps, lat -nan ms stddev -nan >> progress: 45.0 s, 0.0 tps, lat -nan ms stddev -nan >> progress: 50.0 s, 1.2 tps, lat 2497.734 ms stddev 5393.166 >> progress: 55.0 s, 0.0 tps, lat -nan ms stddev -nan >> progress: 60.0 s, 27357.9 tps, lat 160.622 ms stddev 1866.625 >> progress: 65.0 s, 38770.8 tps, lat 20.829 ms stddev 104.453 >> progress: 70.0 s, 40553.2 tps, lat 19.809 ms stddev 99.741 >> >> (autovacuum led to trigger deadlock timeout, >> and therefore, to stuck) >> >> Patched: >> >> progress: 5.0 s, 56264.7 tps, lat 12.847 ms stddev 66.980 >> progress: 10.0 s, 55389.3 tps, lat 14.329 ms stddev 71.997 >> progress: 15.0 s, 50757.7 tps, lat 15.730 ms stddev 78.222 >> progress: 20.0 s, 50797.3 tps, lat 15.736 ms stddev 79.296 >> progress: 25.0 s, 48485.3 tps, lat 16.432 ms stddev 82.720 >> progress: 30.0 s, 45202.1 tps, lat 17.733 ms stddev 88.554 >> progress: 35.0 s, 40035.8 tps, lat 19.343 ms stddev 97.787 >> progress: 40.0 s, 14240.1 tps, lat 47.625 ms stddev 265.465 >> progress: 45.0 s, 30150.6 tps, lat 31.140 ms stddev 196.114 >> progress: 50.0 s, 38975.0 tps, lat 20.543 ms stddev 104.281 >> progress: 55.0 s, 40573.9 tps, lat 19.770 ms stddev 99.487 >> progress: 60.0 s, 38361.1 tps, lat 20.693 ms stddev 103.141 >> progress: 65.0 s, 39793.3 tps, lat 20.216 ms stddev 101.080 >> progress: 70.0 s, 38387.9 tps, lat 20.886 ms stddev 104.482 I am confused about interpreting these numbers. Does it mean that without the patch at the end of 70s, we have completed 40553.2 * 70 transactions and with the patch there are 70 * 38387.9 transactions completed in 70 seconds? I agree that with the patch there is smoother degradation when a lot of backends enter deadlock detection phase. That's nice, but if my interpretation of numbers is correct, that smoothness comes with a cost of decreased TPS. So, it would make sense to have a GUC controlling this behaviour. Talking about GUCs, deadlock_timeout [1] value decides when the deadlock check kicks in. In such cases probably increasing it would suffice and we may not need the patch. Anybody else can comment about how desired is this feature? > Excuse me, corrected version is in attach. > About the patch itself, I think releasing the shared lock and taking exclusive lock on lock partitions, may lead to starvation. If there are multiple backends that enter deadlock detection
Re: Speeding up INSERTs and UPDATEs to partitioned tables
On 2018-Jul-11, David Rowley wrote: > On 6 July 2018 at 21:25, Kato, Sho wrote: > > 2. 11beta2 + patch1 + patch2 > > > > patch1: Allow direct lookups of AppendRelInfo by child relid > > commit 7d872c91a3f9d49b56117557cdbb0c3d4c620687 > > patch2: 0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch > > > > part_num | tps_ex| latency_avg | update_latency | select_latency | > > insert_latency > > --+-+-+++ > > 100 | 1224.430344 | 0.817 | 0.551 | 0.085 | > >0.048 > > 200 | 689.567511 |1.45 | 1.12 | 0.119 | > > 0.05 > > 400 | 347.876616 | 2.875 | 2.419 | 0.185 | > >0.052 > > 800 | 140.489269 | 7.118 | 6.393 | 0.329 | > >0.059 > > 1600 | 29.681672 | 33.691 | 31.272 | 1.517 | > >0.147 > > 3200 |7.021957 | 142.412 | 136.4 | 4.033 | > >0.214 > > 6400 |1.462949 | 683.557 |669.187 | 7.677 | > >0.264 > Just a note to say that the "Allow direct lookups of AppendRelInfo by > child relid" patch is already in master. It's much more relevant to be > testing with master than pg11. This patch is not intended for pg11. That commit is also in pg11, though -- just not in beta2. So we still don't know how much of an improvement patch2 is by itself :-) -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors
Hello Marina, * -d/--debug: I'm not in favor in requiring a mandatory text argument on this option. As you wrote in [1], adding an additional option is also a bad idea: Hey, I'm entitled to some internal contradictions:-) I'm sceptical of the "--debug-fails" options. ISTM that --debug is already there and should just be reused. I was thinking that you could just use the existing --debug, not change its syntax. My point was that --debug exists, and you could just print the messages when under --debug. Maybe it's better to use an optional argument/arguments for compatibility (--debug[=fails] or --debug[=NUM])? But if we use the numbers, now I can see only 2 levels, and there's no guarantee that they will no change.. Optional arguments to options (!) are not really clean things, so I'd like to avoid going onto this path, esp. as I cannot see any other instance in pgbench or elsewhere in postgres, and I personnaly consider these as a bad idea. So if absolutely necessary, a new option is still better than changing --debug syntax. If not necessary, then it is better:-) * I'm reserved about the whole ereport thing, see comments in other messages. Thank you, I'll try to implement the error reporting in the way you suggested. Dunno if it is a good idea either. The committer word is the good one in the end:-à Thank you, I'll fix this. I'm sorry, I'll fix this. You do not have to thank me or being sorry on every comment I do, once a the former is enough, and there is no need for the later. * doCustom changes. On CSTATE_FAILURE, the next command is possibly started. Although there is some consistency with the previous point, I think that it totally breaks the state automaton where now a command can start while the whole transaction is in failing state anyway. There was no point in starting it in the first place. So, for me, the FAILURE state should record/count the failure, then skip to RETRY if a retry is decided, else proceed to ABORT. Nothing else. This is much clearer that way. Then RETRY should reinstate the global state and proceed to start the *first* command again. <...> It is unclear to me why backslash command errors are turned to FAILURE instead of ABORTED: there is no way they are going to be retried, so maybe they should/could skip directly to ABORTED? So do you propose to execute the command "ROLLBACK" without calculating its latency etc. if we are in a failed transaction and clear the conditional stack after each failure? Also just to be clear: do you want to have the state CSTATE_ABORTED for client abortion and another state for interrupting the current transaction? I do not understand what "interrupting the current transaction" means. A transaction is either committed or rollbacked, I do not know about "interrupted". When it is rollbacked, probably some stats will be collected in passing, I'm fine with that. If there is an error in a pgbench script, the transaction is aborted, which means for me that the script execution is stopped where it was, and either it is restarted from the beginning (retry) or counted as failure (not retry, just aborted, really). If by interrupted you mean that one script begins a transaction and another ends it, as I said in the review I think that this strange case should be forbidden, so that all the code and documentation trying to manage that can be removed. The current RETRY state does memory allocations to generate a message with buffer allocation and so on. This looks like a costly and useless operation. If the user required "retries", then this is normal behavior, the retries are counted and will be printed out in the final report, and there is no point in printing out every single one of them. Maybe you want that debugging, but then coslty operations should be guarded. I think we need these debugging messages because, for example, Debugging message should cost only when under debug. When not under debug, there should be no debugging message, and there should be no cost for building and discarding such messages in the executed code path beyond testing whether program is under debug. if you use the option --latency-limit, you we will never know in advance whether the serialization/deadlock failure will be retried or not. ISTM that it will be shown final report. If I want debug, I ask for --debug, otherwise I think that the command should do what it was asked for, i.e. run scripts, collect performance statistics and show them at the end. In particular, when running with retries is enabled, the user is expecting deadlock/serialization errors, so that they are not "errors" as such for them. They also help to understand which limit of retries was violated or how close we were to these limits during the execution of a specific transaction. But I agree with you that they are costly and can be skipped if the failure type is never retried. Maybe it is better to split them
Re: Concurrency bug in UPDATE of partition-key
On 2018-Jul-11, Amit Kapila wrote: > Attached, please find an updated patch based on comments by Alvaro. > See, if this looks okay to you guys. LGTM as far as my previous comments are concerned. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: _isnan() on Windows
On 2018-Jul-11, Michael Paquier wrote: > On Tue, Jul 10, 2018 at 04:23:42PM -0400, Tom Lane wrote: > > Alvaro Herrera writes: > >> On 2018-Jul-10, Tom Lane wrote: > >>> I disagree --- including in c.h, as this would have us do, > >>> seems like a huge expansion of the visibility of that header. Moreover, > >>> doing that only on Windows seems certain to lead to missing-header > >>> problems in the reverse direction, ie patches developed on Windows > >>> will fail elsewhere. > > > >> I don't think so, because that's only done for MSVC older than 2013. > >> Nobody uses that for new development anymore. > > > > Hm. OK, maybe it's all right given that. I'm still a bit worried > > about downsides, but no doubt the buildfarm will tell us. > > It seems to me that this patch is a good idea. Any objections if I take > care of it? I have a Windows VM with only MSVC 2015 if I recall > correctly though... I just pushed it before seeing your message. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Add function to release an allocated SQLDA
On Tue, Jun 19, 2018 at 9:11 PM, Kato, Sho wrote: >>This is not clear to me. ECPGfreeSQLDA() releases a whole chain, but >>free() only releases a single SQLDA(), so they are obviously not >>interchangeable. When exactly should a user prefer one over the other? > > If an application use FETCH ALL to get the result set at once, a user should > use ECPGfreeSQLDA(). > Because the user does not know that the SQLDA holds the result set in the > chain. > If it does not release it will remain leaked. > > Considering the use of people who know the structure of SQLDA, I described > the releasing method using free () in the document. I wonder how other ESQL/C implementations deal with this stuff. The DB2 SQLDA struct[1] doesn't have desc_next. The Informix one[2] does, but I'm not entirely sure what it's for since Informix apparently doesn't support EXEC SQL FETCH ALL. The Informix manual also lists a function SqlFreeMem() that is needed on Windows for the same reason we need a new function[3]. I don't think there is anyone left who cares about compatibility with Informix so (unless there is some guidance from the spec here) I think your proposed name ECPGfreeSQLDA() is OK. test/sql/sqlda.pgc includes a bit where SQLDA objects are still freed with free(). That's because the loop does extra processing with each SQLA: while (outp_sqlda1) { blah blah... outp_sqlda1 = outp_sqlda1->desc_next; free(ptr); } To allow that type of usage, we would need two new functions: ECPGfreeSQLDA() to free just one, and ECPGfreeAllSQLDA() to free the whole chain. The minimum thing to back-patch would be a EDPGfreeSQLDA() function to free just one, since that fixes a potential bug for Window users (the same reason we back-patched 4c8156d87108fa1f245bee13775e76819cd46a90). Then ECPGfreeAllSQLDA() (or better name) could be a separate patch to discuss for PG 12. Does that make sense? sqlda.c: +void +ECPGfreeSQLDA_informix(struct sqlda_compat *sqlda_ptr) +{ ecpglib.h: +void ECPGfreeSQLDA_compat(struct sqlda_compat *); I think these function names were supposed to match? [1] https://www.ibm.com/support/knowledgecenter/en/SSEPGG_9.7.0/com.ibm.db2.luw.apdv.api.doc/doc/r0001751.html [2] https://www.ibm.com/support/knowledgecenter/en/SSGU8G_12.1.0/com.ibm.esqlc.doc/ids_esqlc_0656.htm [3] https://www.ibm.com/support/knowledgecenter/en/SSGU8G_11.50.0/com.ibm.esqlc.doc/xbsql1007134.htm -- Thomas Munro http://www.enterprisedb.com
Re: Negotiating the SCRAM channel binding type
On 11/07/18 12:27, Heikki Linnakangas wrote: Based on recent discussions, it looks like there's going to be differences in this area [1]. OpenSSL can support both tls-unique and tls-server-end-point. Java only supports tls-server-end-point, while GnuTLS only supports tls-unique. And Mac OS Secure Transports supports neither one. Furthermore, it's not clear how TLS v1.3 affects this. tls-unique might no longer be available in TLS v1.3, but we might get new channel binding types to replace it. So this is about to get really messy, if there is no way to negotiate. (Yes, it's going to be messy even with negotiation.) I've been reading up on the discussions on GnuTLS and Secure Transport, as well as the specs for tls-server-end-point. In a nutshell, to get the token for tls-server-end-point, you need to get the peer's certificate from the TLS library, in raw DER format, and calculate a hash over it. The hash algorithm depends on the signatureAlgorithm in the certificate, so you need to parse the certificate to extract that. We don't want to re-implement X509 parsing, so realistically we need the TLS library to have support functions for that. Looking at the GnuTLS docs, I believe it has everything we need. gnutls_certificate_get_peers() and gnutls_certificate_get_ours() can be used to get the certificate, and gnutls_x509_crt_get_signature_algorithm() gets the signatureAlgorithm. The macOS Secure Transport documentation is a bit harder to understand, but I think it has everything we need as well. SSLCopyPeerTrust()+SecTrustGetCertificateAtIndex()+SecCertificateCopyData() functions get you the certificate in DER format. You can get the signature algorithm with SecCertificateCopyValues(), with the right constants. Am I missing something? I think we can support tls-server-end-point with all TLS implementations we might care about. - Heikki
Re: Accounting of zero-filled buffers in EXPLAIN (BUFFERS)
On Mon, Apr 30, 2018 at 1:38 PM Thomas Munro wrote: > On Mon, Apr 30, 2018 at 3:13 PM, Andres Freund wrote: > > On 2018-04-30 14:59:31 +1200, Thomas Munro wrote: > >> In EXPLAIN (BUFFERS), there are two kinds of cache misses that show up > >> as "reads" when in fact they are not reads at all: > >> > >> 1. Relation extension, which in fact writes a zero-filled block. > >> 2. The RBM_ZERO_* modes, which provoke neither read nor write. > > > > Just for understanding: 2) can never happen for buffers showing up in > > EXPLAIN, right? > > > > I'm not saying you shouldn't fix the accounting... > > Maybe the hash AM can reach that in _hash_getinitbuf() while adding > overflow pages to bucket chains? Admittedly case 2 is obscure and > rare if not unreachable and probably no one would care too much about > that in practice (whereas case 1 can be seen by simply inserting stuff > into a new empty table). Other patches I'm working on for later > proposal do it more often (think accessing known-empty pages in a kind > of preallocated extent), and it occurred to me that it's clearly a bug > on principle, hence this patch. > I checked the patch and I agree with the change 1). And regarding change 2) whether it is zeroing the contents of the page or not, it does read? because if it exists in the buffer pool, we are counting them as hits irrespective of the mode? Am I missing something? Regards, Haribabu Kommi Fujitsu Australia
Re: partition pruning doesn't work with IS NULL clause in multikey range partition case
On Wed, Jul 11, 2018 at 5:36 PM, amul sul wrote: > On Wed, Jul 11, 2018 at 5:10 PM Dilip Kumar wrote: >> > I am not sure that I have understand the following comments > 11 +* Generate one prune step for the information derived from IS NULL, > 12 +* if any. To prune hash partitions, we must have found IS NULL > 13 +* clauses for all partition keys. > 14 */ > > I am not sure that I have understood this -- no such restriction > required to prune the hash partitions, if I am not missing anything. Maybe it's not very clear but this is the original comments I have retained. Just moved it out of the (!generate_opsteps) condition. Just the explain this comment consider below example, create table hp (a int, b text) partition by hash (a int, b text); create table hp0 partition of hp for values with (modulus 4, remainder 0); create table hp3 partition of hp for values with (modulus 4, remainder 3); create table hp1 partition of hp for values with (modulus 4, remainder 1); create table hp2 partition of hp for values with (modulus 4, remainder 2); postgres=# insert into hp values (1, null); INSERT 0 1 postgres=# insert into hp values (2, null); INSERT 0 1 postgres=# select tableoid::regclass, * from hp; tableoid | a | b --+---+--- hp1 | 1 | hp2 | 2 | (2 rows) Now, if we query based on "b is null" then we can not decide which partition should be pruned whereas in case of other schemes, it will go to default partition so we can prune all other partitions. explain (costs off) select * from hp where b is null; -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Preferring index-only-scan when the cost is equal
On 07/11/2018 01:28 PM, Ashutosh Bapat wrote: On Wed, Jul 11, 2018 at 11:03 AM, Yugo Nagata wrote: Hi, I found that there is a situation that even when index only scan can be effective, the planner doesn't select this. The planner makes indexe paths in descending order of indexrelid, and the new path is discarded if its cost is not less than the existing paths' cost. As a result, IndexOnlyScan path can be discard even hough it may be effective than normal IndexScan. Here is a example; =# create table test1 (i int, d int); CREATE TABLE =# create index on test1(i) include (d); CREATE INDEX =# explain select d from test1 where i = 0; QUERY PLAN -- Index Only Scan using test1_i_d_idx on test1 (cost=0.15..36.35 rows=11 width=4) Index Cond: (i = 0) (2 rows) =# create index on test1(i) ; CREATE INDEX =# explain select d from test1 where i = 0; QUERY PLAN --- Index Scan using test1_i_idx on test1 (cost=0.15..36.35 rows=11 width=4) Index Cond: (i = 0) (2 rows) This is not new for the covered index feature. We can see the same thing when using multi-column indexes. =# create table test2 (i int, d int); CREATE TABLE =# create index on test2(i,d); CREATE INDEX =# explain select d from test2 where i = 0; QUERY PLAN -- Index Only Scan using test2_i_d_idx on test2 (cost=0.15..36.35 rows=11 width=4) Index Cond: (i = 0) (2 rows) =# create index on test2(i); CREATE INDEX =# explain select d from test2 where i = 0; QUERY PLAN --- Index Scan using test2_i_idx on test2 (cost=0.15..36.35 rows=11 width=4) Index Cond: (i = 0) (2 rows) Attached is a patch to prefer index-only-scan when the cost is equal to other index path. Honestly, I'm not sure this is the best way. Any comments and advices would be appriciated. I don't think we should change add_path() for this. We will unnecessarily check that condition even for the cases where we do not create index paths. I think we should fix the caller of add_path() instead to add index only path before any index paths. For that the index list needs to be sorted by the possibility of using index only scan. But I think in your case, it might be better to first check whether there is any costing error because of which index only scan's path has the same cost as index scan path. Also I don't see any testcase which will show why index only scan would be more efficient in your case. May be provide output of EXPLAIN ANALYZE. I suspect this only happens due to testing on empty tables. Not only is testing of indexes on small tables rather pointless in general, but more importantly there will be no statistics. So we fall back to some default estimates, but we also don't have relallvisible etc which is crucial for estimating index-only scans. I'd bet that's why the cost estimates for index scans and index-only scans are the same here. Yugo-san, have you observed this behavior on larger tables? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: patch to allow disable of WAL recycling
Alvaro, I'll perform several test runs with various combinations and post the results. Thanks, Jerry On Tue, Jul 10, 2018 at 2:34 PM, Alvaro Herrera wrote: > On 2018-Jul-10, Jerry Jelinek wrote: > > > 2) Disabling WAL recycling reduces reliability, even on COW filesystems. > > I think the problem here is that WAL recycling in normal filesystems > helps protect the case where filesystem gets full. If you remove it, > that protection goes out the window. You can claim that people needs to > make sure to have available disk space, but this does become a problem > in practice. I think the thing to do is verify what happens with > recycling off when the disk gets full; is it possible to recover > afterwards? Is there any corrupt data? What happens if the disk gets > full just as the new WAL file is being created -- is there a Postgres > PANIC or something? As I understand, with recycling on it is easy (?) > to recover, there is no PANIC crash, and no data corruption results. > > -- > Álvaro Herrerahttps://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services >
Re: partition pruning doesn't work with IS NULL clause in multikey range partition case
On Wed, Jul 11, 2018 at 5:10 PM Dilip Kumar wrote: > > On Wed, Jul 11, 2018 at 4:20 PM, Dilip Kumar wrote: > > On Wed, Jul 11, 2018 at 3:06 PM, Ashutosh Bapat > > wrote: > >> Hi, > >> Consider following test case. > >> create table prt (a int, b int, c int) partition by range(a, b); > >> create table prt_p1 partition of prt for values (0, 0) to (100, 100); > >> create table prt_p1 partition of prt for values from (0, 0) to (100, 100); > >> create table prt_p2 partition of prt for values from (100, 100) to (200, > >> 200); > >> create table prt_def partition of prt default; > >> > > > --- a/src/backend/partitioning/partprune.c > > +++ b/src/backend/partitioning/partprune.c > > @@ -857,7 +857,7 @@ > > gen_partprune_steps_internal(GeneratePruningStepsContext *context, > > * If generate_opsteps is set to false it means no OpExprs were > > directly > > * present in the input list. > > */ > > - if (!generate_opsteps) > > + if (nullkeys || !generate_opsteps) > > { > > /* > > * Generate one prune step for the information derived > > from IS NULL, > > @@ -865,8 +865,7 @@ > > gen_partprune_steps_internal(GeneratePruningStepsContext *context, > > * clauses for all partition keys. > > */ > > if (!bms_is_empty(nullkeys) && > > - (part_scheme->strategy != PARTITION_STRATEGY_HASH || > > -bms_num_members(nullkeys) == > > part_scheme->partnatts)) > > + (part_scheme->strategy != PARTITION_STRATEGY_HASH)) > > { > > PartitionPruneStep *step; > > > > postgres=# explain verbose select * from prt where a is null and b = 100; > > QUERY PLAN > > -- > > Append (cost=0.00..35.51 rows=1 width=12) > >-> Seq Scan on public.prt_def (cost=0.00..35.50 rows=1 width=12) > > Output: prt_def.a, prt_def.b, prt_def.c > > Filter: ((prt_def.a IS NULL) AND (prt_def.b = 100)) > > (4 rows) > > > > Above fix is just to show the root cause of the issue, I haven't > > investigated that what should be the exact fix for this issue. > > > > I think the actual fix should be as attached. > I am not sure that I have understand the following comments 11 +* Generate one prune step for the information derived from IS NULL, 12 +* if any. To prune hash partitions, we must have found IS NULL 13 +* clauses for all partition keys. 14 */ I am not sure that I have understood this -- no such restriction required to prune the hash partitions, if I am not missing anything. Regards, Amul
Re: Negotiating the SCRAM channel binding type
On 11/07/18 14:37, Michael Paquier wrote: On Wed, Jul 11, 2018 at 12:27:33PM +0300, Heikki Linnakangas wrote: as the supported mechanisms, we change that into: SCRAM-SHA-256-PLUS tls-unique SCRAM-SHA-256-PLUS tls-server-end-point SCRAM-SHA-256 Can we say for sure that there won't be other options associated to a given SASL mechanism? It seems to me that something like the following is better long-term, with channel binding available as a comma-separated list of items: SCRAM-SHA-256-PLUS channel_bindings=tls-unique,tls-server-end-point SCRAM-SHA-256 That would be more complicated to parse. Yeah, we might need further options for some SASL mechanisms in the future, but we can cross that bridge when we get there. I don't see any need to complicate this case for that. I started digging into this more closely, and ran into a little problem. If channel binding is not used, the client sends a flag to the server to indicate if it's because the client doesn't support channel binding, or because it thought that the server doesn't support it. This is the gs2-cbind-flag. How should the flag be set, if the server supports channel binding type A, while client supports only type B? The purpose of the flag is to detect downgrade attacks, where a man-in-the-middle removes the PLUS variants from the list of supported mechanisms. If we treat incompatible channel binding types as "client doesn't support channel binding", then the downgrade attack becomes possible (the attacker can replace the legit PLUS variants with bogus channel binding types that the client surely doesn't support). If we treat it as "server doesn't support channel binding", then we won't downgrade to the non-channel-binding variant, in the legitimate case that the client and server both support channel binding, but with incompatible types. What we'd really want, is to include the full list of server's supported mechanisms as part of the exchange, not just a boolean "y/n" flag. But that's not what the spec says :-(. I guess we should err on the side of caution, and fail the authentication in that case. That's unfortunate, but it's better than not negotiating at all. At least with the negotiation, the authentication will work if there is any mutually supported channel binding type. - Heikki
Re: partition pruning doesn't work with IS NULL clause in multikey range partition case
On Wed, Jul 11, 2018 at 4:20 PM, Dilip Kumar wrote: > On Wed, Jul 11, 2018 at 3:06 PM, Ashutosh Bapat > wrote: >> Hi, >> Consider following test case. >> create table prt (a int, b int, c int) partition by range(a, b); >> create table prt_p1 partition of prt for values (0, 0) to (100, 100); >> create table prt_p1 partition of prt for values from (0, 0) to (100, 100); >> create table prt_p2 partition of prt for values from (100, 100) to (200, >> 200); >> create table prt_def partition of prt default; >> > --- a/src/backend/partitioning/partprune.c > +++ b/src/backend/partitioning/partprune.c > @@ -857,7 +857,7 @@ > gen_partprune_steps_internal(GeneratePruningStepsContext *context, > * If generate_opsteps is set to false it means no OpExprs were > directly > * present in the input list. > */ > - if (!generate_opsteps) > + if (nullkeys || !generate_opsteps) > { > /* > * Generate one prune step for the information derived > from IS NULL, > @@ -865,8 +865,7 @@ > gen_partprune_steps_internal(GeneratePruningStepsContext *context, > * clauses for all partition keys. > */ > if (!bms_is_empty(nullkeys) && > - (part_scheme->strategy != PARTITION_STRATEGY_HASH || > -bms_num_members(nullkeys) == part_scheme->partnatts)) > + (part_scheme->strategy != PARTITION_STRATEGY_HASH)) > { > PartitionPruneStep *step; > > postgres=# explain verbose select * from prt where a is null and b = 100; > QUERY PLAN > -- > Append (cost=0.00..35.51 rows=1 width=12) >-> Seq Scan on public.prt_def (cost=0.00..35.50 rows=1 width=12) > Output: prt_def.a, prt_def.b, prt_def.c > Filter: ((prt_def.a IS NULL) AND (prt_def.b = 100)) > (4 rows) > > Above fix is just to show the root cause of the issue, I haven't > investigated that what should be the exact fix for this issue. > I think the actual fix should be as attached. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com generate_prunestep_for_isnull.patch Description: Binary data
Re: Negotiating the SCRAM channel binding type
On Wed, Jul 11, 2018 at 12:27:33PM +0300, Heikki Linnakangas wrote: > as the supported mechanisms, we change that into: > > SCRAM-SHA-256-PLUS tls-unique > SCRAM-SHA-256-PLUS tls-server-end-point > SCRAM-SHA-256 Can we say for sure that there won't be other options associated to a given SASL mechanism? It seems to me that something like the following is better long-term, with channel binding available as a comma-separated list of items: SCRAM-SHA-256-PLUS channel_bindings=tls-unique,tls-server-end-point SCRAM-SHA-256 > Thoughts? Unfortunately this breaks compatibility with current v11beta > clients, but IMHO we must fix this. If we want to alleviate that, and save a > few bytes of network traffic, we could decide that plain > "SCRAM-SHA-256-PLUS" implies tls-unique, so the list would be: > > SCRAM-SHA-256-PLUS > SCRAM-SHA-256-PLUS tls-server-end-point > SCRAM-SHA-256 > > That would be mostly compatible with current libpq clients. I am not sure it is worth caring about the compatibility at a beta2 stage, as v11 is not released yet. Still I agree that not specifying anything should mean the default, which is per the RFCs currently existing tls-unique. Another piece of the puzzle is here as well: https://www.postgresql.org/message-id/flat/20180122072936.gb1...@paquier.xyz We need to allow a given SSL implementation to specify what the list of channel bindings it supports is. -- Michael signature.asc Description: PGP signature
Re: Problem with tupdesc in jsonb_to_recordset
On Wed, Jul 11, 2018 at 11:38:46AM +0100, Andrew Gierth wrote: > My first approach - assuming that update_cached_tupdesc has good reason > to make a copy, which I'm not convinced is the case - would be to simply > make a per-query-context copy of the tupdesc to assign to rsi.setDesc in > order to conform to the call protocol. I see what you are coming at here, thanks for the input. I am not really convinced that update_cached_tupdesc needs to do a new copy either, but let's see what Tom has to say on the matter. That's his feature and his code. -- Michael signature.asc Description: PGP signature
Re: Preferring index-only-scan when the cost is equal
On Wed, Jul 11, 2018 at 11:03 AM, Yugo Nagata wrote: > Hi, > > I found that there is a situation that even when index only scan can be > effective, > the planner doesn't select this. The planner makes indexe paths in descending > order of indexrelid, and the new path is discarded if its cost is not less > than > the existing paths' cost. As a result, IndexOnlyScan path can be discard even > hough it may be effective than normal IndexScan. > > Here is a example; > > =# create table test1 (i int, d int); > CREATE TABLE > =# create index on test1(i) include (d); > CREATE INDEX > > =# explain select d from test1 where i = 0; > QUERY PLAN > -- > Index Only Scan using test1_i_d_idx on test1 (cost=0.15..36.35 rows=11 > width=4) >Index Cond: (i = 0) > (2 rows) > > =# create index on test1(i) ; > CREATE INDEX > =# explain select d from test1 where i = 0; > QUERY PLAN > --- > Index Scan using test1_i_idx on test1 (cost=0.15..36.35 rows=11 width=4) >Index Cond: (i = 0) > (2 rows) > > > This is not new for the covered index feature. We can see the same thing when > using > multi-column indexes. > > > =# create table test2 (i int, d int); > CREATE TABLE > =# create index on test2(i,d); > CREATE INDEX > =# explain select d from test2 where i = 0; > QUERY PLAN > -- > Index Only Scan using test2_i_d_idx on test2 (cost=0.15..36.35 rows=11 > width=4) >Index Cond: (i = 0) > (2 rows) > > =# create index on test2(i); > CREATE INDEX > =# explain select d from test2 where i = 0; > QUERY PLAN > --- > Index Scan using test2_i_idx on test2 (cost=0.15..36.35 rows=11 width=4) >Index Cond: (i = 0) > (2 rows) > > > Attached is a patch to prefer index-only-scan when the cost is equal to other > index > path. Honestly, I'm not sure this is the best way. Any comments and advices > would > be appriciated. > I don't think we should change add_path() for this. We will unnecessarily check that condition even for the cases where we do not create index paths. I think we should fix the caller of add_path() instead to add index only path before any index paths. For that the index list needs to be sorted by the possibility of using index only scan. But I think in your case, it might be better to first check whether there is any costing error because of which index only scan's path has the same cost as index scan path. Also I don't see any testcase which will show why index only scan would be more efficient in your case. May be provide output of EXPLAIN ANALYZE. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: Allow to specify a index name as ANALYZE parameter
Hi! On Wed, Jul 11, 2018 at 12:04 PM Yugo Nagata wrote: > When we specify column names for ANALYZE, only the statistics for those > columns > are collected. Similarly, is it useful if we have a option to specify an index > for ANALYZE to collect only the statistics for expression in the specified > index? > > A usecase I suppose is that when a new expression index is created and that > we need only the statistics for the new index. Although ANALYZE of all the > indexes > is usually fast because ANALYZE uses a random sampling of the table rows, > ANALYZE > on the specific index may be still useful if there are other index whose > "statistics > target" is large and/or whose expression takes time to compute, for example. > > Attached is the WIP patch to allow to specify a index name as ANALYZE > parameter. > Any documatation is not yet included. I would appreciate any feedback! I think this makes sense. Once we can collect statistics individually for regular columns, we should be able to do the same for indexed expression. Please, register this patch on the next commitfest. Regarding current implementation I found message "ANALYZE option must be specified when a column list is provided" to be confusing. Perhaps, you've missed some negation in this message, since in fact you disallow analyze with column list. However, since multicolumn index may contain multiple expression, I think we should support specifying columns for ANALYZE index clause. We could support specifying columns by their numbers in the same way we did for "ALTER INDEX index ALTER COLUMN colnum SET STATISTICS number". -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Expression errors with "FOR UPDATE" and postgres_fdw with partition wise join enabled.
On Wed, Jul 11, 2018 at 1:23 PM, Etsuro Fujita wrote: > > > Actually, even if we could create such an index on the child table and the > targetlist had the ConvertRowtypeExpr, the planner would still not be able > to use an index-only scan with that index; because check_index_only would > not consider that an index-only scan is possible for that index because that > index is an expression index and that function currently does not consider > that index expressions are able to be returned back in an index-only scan. > That behavior of the planner might be improved in future, though. > >> Pathkey points to an equivalence class, which contains equivalence >> members. A parent equivalence class member containing a whole-row >> reference gets translated into a child equivalence member containing a >> ConvertRowtypeExpr. Right and when we do so, not having ConvertRowtypeExpr in the targetlist will be a problem. > > > I think so too. > >> At places in planner we match equivalence members >> to the targetlist entries. This matching will fail unexpectedly when >> ConvertRowtypeExpr is removed from a child's targetlist. But again I >> couldn't reproduce a problem when such a mismatch arises. > > > IIUC, I don't think the planner assumes that for an equivalence member there > is an matching entry for that member in the targetlist; what I think the > planner assumes is: an equivalence member is able to be computed from > expressions in the targetlist. This is true. However, > So, I think it is safe to have whole-row > Vars instead of ConvertRowtypeExprs in the targetlist. when it's looking for an expression, it finds a whole-row expression so it think it needs to add a ConvertRowtypeExpr on that. But when the plan is created, there is ConvertRowtypeExpr already, but there is no way to know that a new ConvertRowtypeExpr is not needed anymore. So, we may have two ConvertRowtypeExprs giving wrong results. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: Jsonb transform for pl/python
On Wed, Jul 11, 2018 at 12:30 AM Tom Lane wrote: > Peter Eisentraut writes: > > On 6/23/18 01:44, Nikita Glukhov wrote: > >> We are simply trying first to convert numeric to int64 if is does not have > >> digits after the decimal point, and then construct Python Int instead of > >> Decimal. Standard Python json.loads() does the same for exact integers. > > > We just had a very similar but not identical discussion in the thread > > about the PL/Perl transforms, where we rejected the proposal. Other > > comments on this one? > > I can sympathize with the speed concern, but the proposed patch produces > a functional change (ie, you get a different kind of Python object, with > different behaviors, in some cases). That seems to me to be a good reason > to reject it. Nevertheless, as Nikita pointed, json.loads() performs the same in Python. See an example for python 2.7.10. >>> import json >>> type(json.loads('1')) >>> type(json.loads('1.0')) So, from postgres point of view this behavior is wrong. But on another hand why don't let python transform to behave like a python? -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: partition pruning doesn't work with IS NULL clause in multikey range partition case
On Wed, Jul 11, 2018 at 3:06 PM, Ashutosh Bapat wrote: > Hi, > Consider following test case. > create table prt (a int, b int, c int) partition by range(a, b); > create table prt_p1 partition of prt for values (0, 0) to (100, 100); > create table prt_p1 partition of prt for values from (0, 0) to (100, 100); > create table prt_p2 partition of prt for values from (100, 100) to (200, 200); > create table prt_def partition of prt default; > > In a range partitioned table, a row with any partition key NULL goes > to the default partition if it exists. > insert into prt values (null, 1); > insert into prt values (1, null); > insert into prt values (null, null); > select tableoid::regclass, * from prt; > tableoid | a | b | c > --+---+---+--- > prt_def | | 1 | > prt_def | 1 | | > prt_def | | | > (3 rows) > > There's a comment in get_partition_for_tuple(), which says so. > /* > * No range includes NULL, so this will be accepted by the > * default partition if there is one, and otherwise rejected. > */ > > But when there is IS NULL clause on any of the partition keys with > some condition on other partition key, all the partitions scanned. I > expected pruning to prune all the partitions except the default one. > > explain verbose select * from prt where a is null and b = 100; > QUERY PLAN > -- > Append (cost=0.00..106.52 rows=3 width=12) >-> Seq Scan on public.prt_p1 (cost=0.00..35.50 rows=1 width=12) > Output: prt_p1.a, prt_p1.b, prt_p1.c > Filter: ((prt_p1.a IS NULL) AND (prt_p1.b = 100)) >-> Seq Scan on public.prt_p2 (cost=0.00..35.50 rows=1 width=12) > Output: prt_p2.a, prt_p2.b, prt_p2.c > Filter: ((prt_p2.a IS NULL) AND (prt_p2.b = 100)) >-> Seq Scan on public.prt_def (cost=0.00..35.50 rows=1 width=12) > Output: prt_def.a, prt_def.b, prt_def.c > Filter: ((prt_def.a IS NULL) AND (prt_def.b = 100)) > (10 rows) > > I thought that the following code in get_matching_range_bounds() > /* > * If there are no datums to compare keys with, or if we got an IS NULL > * clause just return the default partition, if it exists. > */ > if (boundinfo->ndatums == 0 || !bms_is_empty(nullkeys)) > { > result->scan_default = partition_bound_has_default(boundinfo); > return result; > } > > would do the trick but through the debugger I saw that nullkeys is > NULL for this query. > > I didn't investigate further to see why nullkeys is NULL, but it looks > like that's the problem and we are missing an optimization. I think the problem is that the gen_partprune_steps_internal expect that all the keys should match to IS NULL clause, only in such case the "partition pruning step" will store the nullkeys. After a small change, it is able to prune the partitions. --- a/src/backend/partitioning/partprune.c +++ b/src/backend/partitioning/partprune.c @@ -857,7 +857,7 @@ gen_partprune_steps_internal(GeneratePruningStepsContext *context, * If generate_opsteps is set to false it means no OpExprs were directly * present in the input list. */ - if (!generate_opsteps) + if (nullkeys || !generate_opsteps) { /* * Generate one prune step for the information derived from IS NULL, @@ -865,8 +865,7 @@ gen_partprune_steps_internal(GeneratePruningStepsContext *context, * clauses for all partition keys. */ if (!bms_is_empty(nullkeys) && - (part_scheme->strategy != PARTITION_STRATEGY_HASH || -bms_num_members(nullkeys) == part_scheme->partnatts)) + (part_scheme->strategy != PARTITION_STRATEGY_HASH)) { PartitionPruneStep *step; postgres=# explain verbose select * from prt where a is null and b = 100; QUERY PLAN -- Append (cost=0.00..35.51 rows=1 width=12) -> Seq Scan on public.prt_def (cost=0.00..35.50 rows=1 width=12) Output: prt_def.a, prt_def.b, prt_def.c Filter: ((prt_def.a IS NULL) AND (prt_def.b = 100)) (4 rows) Above fix is just to show the root cause of the issue, I haven't investigated that what should be the exact fix for this issue. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Problem with tupdesc in jsonb_to_recordset
> "Michael" == Michael Paquier writes: >> I don't think that your solution is correct. From my read of >> 37a795a6, the tuple descriptor is moved from the query-lifespan >> memory context (ecxt_per_query_memory) to a function-level context, >> which could cause the descriptor to become busted as your test is >> pointing out. Tom? Michael> Hacking my way out I am finishing with the attached, which Michael> fixes the problem and passes all regression tests. I am not Michael> completely sure that this the right approach though, I would Michael> need to think a bit more about it. What's happening in the original case is this: the SRF call protocol says that it's the executor's responsibility to free rsi.setDesc if it's not refcounted, under the assumption that in such a case it's in per-query memory (and not in either a shorter-lived or longer-lived context). The change to update_cached_tupdesc broke the protocol by causing populate_recordset_worker to set rsi.setDesc to a non-refcounted tupdesc allocated in a context that's not the per-query context. What's not clear here is why populate_recordset_record thinks it needs to update the tupdesc for every record in a single result set. The entire result set will ultimately be treated as having the same tupdesc regardless, so any per-record changes will break things later anyway. Your latest proposed fix isn't OK because it puts the wrong context in cache->fn_mcxt - data that's dangled off flinfo->fn_extra must NOT be in any context that has a different lifetime than flinfo->fn_mcxt (which might not be query lifetime), unless you're taking specific steps to free or invalidate it at the correct point. My first approach - assuming that update_cached_tupdesc has good reason to make a copy, which I'm not convinced is the case - would be to simply make a per-query-context copy of the tupdesc to assign to rsi.setDesc in order to conform to the call protocol. -- Andrew (irc:RhodiumToad)
Re: Negotiating the SCRAM channel binding type
On Wed, Jul 11, 2018 at 12:27:33PM +0300, Heikki Linnakangas wrote: > Currently, there is no negotiation of the channel binding type between > client and server. The server advertises that it supports channel binding, > or not, and the client decides what channel binding to use. If the server > doesn't support the binding type that the client chose, authentication will > fail. > > Based on recent discussions, it looks like there's going to be differences > in this area [1]. OpenSSL can support both tls-unique and > tls-server-end-point. Java only supports tls-server-end-point, while GnuTLS > only supports tls-unique. And Mac OS Secure Transports supports neither one. > Furthermore, it's not clear how TLS v1.3 affects this. tls-unique might no > longer be available in TLS v1.3, but we might get new channel binding types > to replace it. So this is about to get really messy, if there is no way to > negotiate. (Yes, it's going to be messy even with negotiation.) > > I think we must fix that before we release v11, because this affects the > protocol. There needs to be a way for the server to advertise what channel > binding types it supports. Yes, this does make sense. From a security perspective, it doesn't matter which channel binding method we use, assuming there are no unfixed exploits. The difference between the channel binding methods is explained in our PG 11 docs: https://www.postgresql.org/docs/11/static/sasl-authentication.html#SASL-SCRAM-SHA-256 -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Cannot dump foreign key constraints on partitioned table
Hi, On the master head, getConstraints() function skips FK constraints for a partitioned table because of tbinfo->hastriggers is false. While creating FK constraints on the partitioned table, the FK triggers are only created on leaf partitions and skipped for the partitioned tables. To fix this, either bypass the aforementioned condition of getConstraints() or set pg_class.relhastriggers to true for the partitioned table which doesn't seem to be the right solution, IMO. Thoughts? Regards, Amul
partition pruning doesn't work with IS NULL clause in multikey range partition case
Hi, Consider following test case. create table prt (a int, b int, c int) partition by range(a, b); create table prt_p1 partition of prt for values (0, 0) to (100, 100); create table prt_p1 partition of prt for values from (0, 0) to (100, 100); create table prt_p2 partition of prt for values from (100, 100) to (200, 200); create table prt_def partition of prt default; In a range partitioned table, a row with any partition key NULL goes to the default partition if it exists. insert into prt values (null, 1); insert into prt values (1, null); insert into prt values (null, null); select tableoid::regclass, * from prt; tableoid | a | b | c --+---+---+--- prt_def | | 1 | prt_def | 1 | | prt_def | | | (3 rows) There's a comment in get_partition_for_tuple(), which says so. /* * No range includes NULL, so this will be accepted by the * default partition if there is one, and otherwise rejected. */ But when there is IS NULL clause on any of the partition keys with some condition on other partition key, all the partitions scanned. I expected pruning to prune all the partitions except the default one. explain verbose select * from prt where a is null and b = 100; QUERY PLAN -- Append (cost=0.00..106.52 rows=3 width=12) -> Seq Scan on public.prt_p1 (cost=0.00..35.50 rows=1 width=12) Output: prt_p1.a, prt_p1.b, prt_p1.c Filter: ((prt_p1.a IS NULL) AND (prt_p1.b = 100)) -> Seq Scan on public.prt_p2 (cost=0.00..35.50 rows=1 width=12) Output: prt_p2.a, prt_p2.b, prt_p2.c Filter: ((prt_p2.a IS NULL) AND (prt_p2.b = 100)) -> Seq Scan on public.prt_def (cost=0.00..35.50 rows=1 width=12) Output: prt_def.a, prt_def.b, prt_def.c Filter: ((prt_def.a IS NULL) AND (prt_def.b = 100)) (10 rows) I thought that the following code in get_matching_range_bounds() /* * If there are no datums to compare keys with, or if we got an IS NULL * clause just return the default partition, if it exists. */ if (boundinfo->ndatums == 0 || !bms_is_empty(nullkeys)) { result->scan_default = partition_bound_has_default(boundinfo); return result; } would do the trick but through the debugger I saw that nullkeys is NULL for this query. I didn't investigate further to see why nullkeys is NULL, but it looks like that's the problem and we are missing an optimization. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: Problem with tupdesc in jsonb_to_recordset
> On Wed, 11 Jul 2018 at 08:27, Michael Paquier wrote: > > On Tue, Jul 10, 2018 at 10:39:28PM +0200, Dmitry Dolgov wrote: > > I've found out that currently in some situations jsonb_to_recordset can > > lead to > > a crash. Minimal example that I've managed to create looks like this: > > It would be better not to send the same email across multiple mailing > lists. Ok, sorry - next time I'll avoid this. > On Wed, Jul 11, 2018 at 03:27:13PM +0900, Michael Paquier wrote: > > I don't think that your solution is correct. From my read of 37a795a6, > > the tuple descriptor is moved from the query-lifespan memory context > > (ecxt_per_query_memory) to a function-level context, which could cause > > the descriptor to become busted as your test is pointing out. Tom? > > Hacking my way out I am finishing with the attached, which fixes the > problem and passes all regression tests. I am not completely sure that > this the right approach though, I would need to think a bit more about > it. So, you removed FreeTupleDesc call - does it mean, that if io->tupdesc will not be reset in some other situations it's going to be a memory leak?
Negotiating the SCRAM channel binding type
Currently, there is no negotiation of the channel binding type between client and server. The server advertises that it supports channel binding, or not, and the client decides what channel binding to use. If the server doesn't support the binding type that the client chose, authentication will fail. Based on recent discussions, it looks like there's going to be differences in this area [1]. OpenSSL can support both tls-unique and tls-server-end-point. Java only supports tls-server-end-point, while GnuTLS only supports tls-unique. And Mac OS Secure Transports supports neither one. Furthermore, it's not clear how TLS v1.3 affects this. tls-unique might no longer be available in TLS v1.3, but we might get new channel binding types to replace it. So this is about to get really messy, if there is no way to negotiate. (Yes, it's going to be messy even with negotiation.) I think we must fix that before we release v11, because this affects the protocol. There needs to be a way for the server to advertise what channel binding types it supports. I propose that the server list each supported channel binding type as a separate SASL mechanism. For example, where currently the server lists: SCRAM-SHA-256-PLUS SCRAM-SHA-256 as the supported mechanisms, we change that into: SCRAM-SHA-256-PLUS tls-unique SCRAM-SHA-256-PLUS tls-server-end-point SCRAM-SHA-256 Thoughts? Unfortunately this breaks compatibility with current v11beta clients, but IMHO we must fix this. If we want to alleviate that, and save a few bytes of network traffic, we could decide that plain "SCRAM-SHA-256-PLUS" implies tls-unique, so the list would be: SCRAM-SHA-256-PLUS SCRAM-SHA-256-PLUS tls-server-end-point SCRAM-SHA-256 That would be mostly compatible with current libpq clients. [1] https://www.postgresql.org/message-id/flat/20180122072936.GB1772%40paquier.xyz - Heikki
Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors
On 09-07-2018 16:05, Fabien COELHO wrote: Hello Marina, Hello, Fabien! Here is a review for the last part of your v9 version. Thank you very much for this! Patch does not "git apply" (may anymore): error: patch failed: doc/src/sgml/ref/pgbench.sgml:513 error: doc/src/sgml/ref/pgbench.sgml: patch does not apply Sorry, I'll send a new version soon. However I could get it to apply with the "patch" command. Then patch compiles, global & pgbench "make check" are ok. :-) Feature === The patch adds the ability to restart transactions (i.e. the full script) on some errors, which is a good thing as it allows to exercice postgres performance in more realistic scenarii. * -d/--debug: I'm not in favor in requiring a mandatory text argument on this option. It is not pratical, the user has to remember it, and it is a change. I'm sceptical of the overall debug handling changes. Maybe we could have multiple -d which lead to higher debug level, but I'm not sure that it can be made to work for this case and still be compatible with the previous behavior. Maybe you need a specific option for your purpose, eg "--debug-retry"? As you wrote in [1], adding an additional option is also a bad idea: I'm sceptical of the "--debug-fails" options. ISTM that --debug is already there and should just be reused. Maybe it's better to use an optional argument/arguments for compatibility (--debug[=fails] or --debug[=NUM])? But if we use the numbers, now I can see only 2 levels, and there's no guarantee that they will no change.. Code * The implementation is less complex that the previous submission, which is a good thing. I'm not sure that all the remaining complexity is still fully needed. * I'm reserved about the whole ereport thing, see comments in other messages. Thank you, I'll try to implement the error reporting in the way you suggested. Leves ELEVEL_LOG_CLIENT_{FAIL,ABORTED} & LOG_MAIN look unclear to me. In particular, the "CLIENT" part is not very useful. If the distinction makes sense, I would have kept "LOG" for the initial one and add other ones for ABORT and PGBENCH, maybe. Ok! * There are no comments about "retries" in StatData, CState and Command structures. * Also, for StatData, I would like to understand the logic between cnt, skipped, retries, retried, errors, ... so a clear information about the expected invariant if any would be welcome. One has to go in the code to understand how these fields relate one to the other. <...> * How "errors" differs from "ecnt" is unclear to me. Thank you, I'll fix this. * commandFailed: I think that it should be kept much simpler. In particular, having errors on errors does not help much: on ELEVEL_FATAL, it ignores the actual reported error and generates another error of the same level, so that the initial issue is hidden. Even if these are can't happen cases, hidding the origin if it occurs looks unhelpful. Just print it directly, and maybe abort if you think that it is a can't happen case. Oh, thanks, my mistake( * copyRandomState: just use sizeof(RandomState) instead of making assumptions about the contents of the struct. Also, this function looks pretty useless, why not just do a plain assignment? * copyVariables: lacks comments to explain that the destination is cleaned up and so on. The cleanup phase could probaly be in a distinct function, so that the code would be clearer. Maybe the function variable names are too long. Thank you, I'll fix this. if (current_source->svalue) in the context of a guard for a strdup, maybe: if (current_source->svalue != NULL) I'm sorry, I'll fix this. * I do not understand the comments on CState enum: "First, remember the failure in CSTATE_FAILURE. Then process other commands of the failed transaction if any" Why would other commands be processed at all if the transaction is aborted? For me any error must leads to the rollback and possible retry of the transaction. This comment needs to be clarified. It should also say that on FAILURE, it will go either to RETRY or ABORTED. See below my comments about doCustom. It is unclear to me why their could be several failures within a transaction, as I would have stopped that it would be aborted on the first one. * I do not undestand the purpose of first_failure. The comment should explain why it would need to be remembered. From my point of view, I'm not fully convinced that it should. <...> * executeCondition: this hides client automaton state changes which were clearly visible beforehand in the switch, and the different handling of if & elif is also hidden. I'm against this unnecessary restructuring and to hide such an information, all state changes should be clearly seen in the state switch so that it is easier to understand and follow. I do not see why touching the conditional stack on internal errors (evaluateExpr failure) brings anything, the whole transaction will be aborted anyway. *
Allow to specify a index name as ANALYZE parameter
Hi, When we specify column names for ANALYZE, only the statistics for those columns are collected. Similarly, is it useful if we have a option to specify an index for ANALYZE to collect only the statistics for expression in the specified index? A usecase I suppose is that when a new expression index is created and that we need only the statistics for the new index. Although ANALYZE of all the indexes is usually fast because ANALYZE uses a random sampling of the table rows, ANALYZE on the specific index may be still useful if there are other index whose "statistics target" is large and/or whose expression takes time to compute, for example. Attached is the WIP patch to allow to specify a index name as ANALYZE parameter. Any documatation is not yet included. I would appreciate any feedback! Regards, -- Yugo Nagata diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c index 25194e8..058f242 100644 --- a/src/backend/commands/analyze.c +++ b/src/backend/commands/analyze.c @@ -84,7 +84,7 @@ static BufferAccessStrategy vac_strategy; static void do_analyze_rel(Relation onerel, int options, VacuumParams *params, List *va_cols, AcquireSampleRowsFunc acquirefunc, BlockNumber relpages, - bool inh, bool in_outer_xact, int elevel); + bool inh, bool in_outer_xact, int elevel, Oid indexid); static void compute_index_stats(Relation onerel, double totalrows, AnlIndexData *indexdata, int nindexes, HeapTuple *rows, int numrows, @@ -121,6 +121,7 @@ analyze_rel(Oid relid, RangeVar *relation, int options, AcquireSampleRowsFunc acquirefunc = NULL; BlockNumber relpages = 0; bool rel_lock = true; + Oid indexid = InvalidOid; /* Select logging level */ if (options & VACOPT_VERBOSE) @@ -253,6 +254,28 @@ analyze_rel(Oid relid, RangeVar *relation, int options, /* Also get regular table's size */ relpages = RelationGetNumberOfBlocks(onerel); } + else if (onerel->rd_rel->relkind == RELKIND_INDEX) + { + Relation rel; + + indexid = relid; + relid = IndexGetRelation(indexid, true); + + if (va_cols != NIL) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("ANALYZE option must be specified when a column list is provided"))); + + rel = try_relation_open(relid, ShareUpdateExclusiveLock); + relation_close(onerel, ShareUpdateExclusiveLock); + + onerel = rel; + + /* Regular table, so we'll use the regular row acquisition function */ + acquirefunc = acquire_sample_rows; + /* Also get regular table's size */ + relpages = RelationGetNumberOfBlocks(onerel); + } else if (onerel->rd_rel->relkind == RELKIND_FOREIGN_TABLE) { /* @@ -308,14 +331,14 @@ analyze_rel(Oid relid, RangeVar *relation, int options, */ if (onerel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE) do_analyze_rel(onerel, options, params, va_cols, acquirefunc, - relpages, false, in_outer_xact, elevel); + relpages, false, in_outer_xact, elevel, indexid); /* * If there are child tables, do recursive ANALYZE. */ if (onerel->rd_rel->relhassubclass) do_analyze_rel(onerel, options, params, va_cols, acquirefunc, relpages, - true, in_outer_xact, elevel); + true, in_outer_xact, elevel, InvalidOid); /* * Close source relation now, but keep lock so that no one deletes it @@ -345,7 +368,7 @@ static void do_analyze_rel(Relation onerel, int options, VacuumParams *params, List *va_cols, AcquireSampleRowsFunc acquirefunc, BlockNumber relpages, bool inh, bool in_outer_xact, - int elevel) + int elevel, Oid indexid) { int attr_cnt, tcnt, @@ -445,7 +468,7 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params, } attr_cnt = tcnt; } - else + else if (!OidIsValid(indexid)) { attr_cnt = onerel->rd_att->natts; vacattrstats = (VacAttrStats **) @@ -459,6 +482,8 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params, } attr_cnt = tcnt; } + else + attr_cnt = 0; /* * Open all indexes of the relation, and see if there are any analyzable @@ -468,7 +493,7 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params, * all. */ if (!inh) - vac_open_indexes(onerel, AccessShareLock, , ); + vac_open_indexes(onerel, AccessShareLock, , , indexid); else { Irel = NULL; @@ -750,6 +775,7 @@ compute_index_stats(Relation onerel, double totalrows, int ind, i; + ind_context = AllocSetContextCreate(anl_context, "Analyze Index", ALLOCSET_DEFAULT_SIZES); diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c index d90cb9a..b21811d 100644 --- a/src/backend/commands/vacuum.c +++ b/src/backend/commands/vacuum.c @@ -1606,7 +1606,7 @@ vacuum_rel(Oid relid, RangeVar *relation, int options, VacuumParams *params) */ void vac_open_indexes(Relation relation, LOCKMODE lockmode, - int *nindexes, Relation **Irel) + int *nindexes, Relation **Irel, Oid idx) { List
Re: Excessive PostmasterIsAlive calls slow down WAL redo
On 11/07/18 04:16, Thomas Munro wrote: On Tue, Jul 10, 2018 at 11:39 PM, Heikki Linnakangas wrote: I don't have a FreeBSD machine at hand, so I didn't try fixing that patch. I updated the FreeBSD version to use the header test approach you showed, and pushed that too. FWIW the build farm has some FreeBSD animals with and without PROC_PDEATHSIG_CTL. Thanks! I suppose it's possibly that we might want to reconsider the choice of signal in the future (SIGINFO or SIGPWR). We could reuse SIGUSR1 for this. If we set the flag in SIGUSR1 handler, then some PostmasterIsAlive() calls would take the slow path unnecessarily, but it would probably be OK. The slow path isn't that slow. But using SIGINFO/SIGPWR seems fine. - Heikki
Re: Libpq support to connect to standby server as priority
On Wed, Jul 4, 2018 at 11:14 PM Laurenz Albe wrote: > Haribabu Kommi wrote: > > On Wed, Jan 24, 2018 at 9:01 AM Jing Wang wrote: > > > Hi All, > > > > > > Recently I put a proposal to support 'prefer-read' parameter in > target_session_attrs in libpq. Now I updated the patch with adding content > in the sgml and regression test case. > > > > > > Some people may have noticed there is already another patch ( > https://commitfest.postgresql.org/15/1148/ ) which looks similar with > this. But I would say this patch is more complex than my proposal. > > > > > > It is better separate these 2 patches to consider. > > > > I also feel prefer-read and read-only options needs to take as two > different options. > > prefer-read is simple to support than read-only. > > > > Here I attached an updated patch that is rebased to the latest master > and also > > fixed some of the corner scenarios. > Thanks for the review. > The patch applies, builds and passes "make check-world". > > I think the "prefer-read" functionality is desirable: It is exactly what > you need > if you want to use replication for load balancing, and your application > supports > different database connections for reading and writing queries. > > "read-only" does not have a clear use case in my opinion. > > With the patch, PostgreSQL behaves as expected if I have a primary and a > standby and run: > > psql "host=/tmp,/tmp port=5433,5434 target_session_attrs=prefer-read" > > But if I stop the standby (port 5434), libpq goes into an endless loop. > There was a problem in reusing the primary host index and it leads to loop. Attached patch fixed the issue. > Concerning the code: > > - The documentation needs some attention. Suggestion: > >If this parameter is set to prefer-read, connections >where SHOW transaction_read_only returns off are > preferred. >If no such connection can be found, a connection that allows read-write >transactions will be accepted. > updated as per you comment. > - I think the construction with "read_write_host_index" makes the code > even more > complicated than it already is. > > What about keeping the first successful connection open and storing it > in a > variable if we are in "prefer-read" mode. > If we get the read-only connection we desire, close that cached > connection, > otherwise use it. > Even if we add a variable to cache the connection, I don't think the logic of checking the next host for the read-only host logic may not change, but the extra connection request to the read-write host again will be removed. Regards, Haribabu Kommi Fujitsu Australia 0001-Allow-taget-session-attrs-to-accept-prefer-read-opti_v2.patch Description: Binary data
Re: Expression errors with "FOR UPDATE" and postgres_fdw with partition wise join enabled.
(2018/07/10 19:58), Ashutosh Bapat wrote: On Tue, Jul 10, 2018 at 9:08 AM, Etsuro Fujita wrote: (2018/07/09 20:43), Ashutosh Bapat wrote: At the cost of having targetlist being type inconsistent. I don't have any testcase either to show that that's a problem in practice. As I said before, I don't see any issue in having such a targetlist, but I might be missing something, so I'd like to discuss a bit more about that. Could you tell me the logic/place in the PG code where you think the problem might occur. (IIRC, you mentioned something about that before (pathkeys? or index-only scans?), but sorry, I don't understand that.) IIUC, index-only scan will be used when the all the required columns are covered by an index. If there is an index on the whole-row reference of the parent, it will be translated into a ConvertRowtypeExpr of the child when an index on the child is created. I think so too. If the targetlist doesn't have ConvertRowtypeExpr, as your patch does, the planner won't be able to use such an index on the child table. But I couldn't create an index with a whole-row reference in it. So, I think this isn't possible right now. Actually, even if we could create such an index on the child table and the targetlist had the ConvertRowtypeExpr, the planner would still not be able to use an index-only scan with that index; because check_index_only would not consider that an index-only scan is possible for that index because that index is an expression index and that function currently does not consider that index expressions are able to be returned back in an index-only scan. That behavior of the planner might be improved in future, though. Pathkey points to an equivalence class, which contains equivalence members. A parent equivalence class member containing a whole-row reference gets translated into a child equivalence member containing a ConvertRowtypeExpr. I think so too. At places in planner we match equivalence members to the targetlist entries. This matching will fail unexpectedly when ConvertRowtypeExpr is removed from a child's targetlist. But again I couldn't reproduce a problem when such a mismatch arises. IIUC, I don't think the planner assumes that for an equivalence member there is an matching entry for that member in the targetlist; what I think the planner assumes is: an equivalence member is able to be computed from expressions in the targetlist. So, I think it is safe to have whole-row Vars instead of ConvertRowtypeExprs in the targetlist. Thanks for the explanation! Best regards, Etsuro Fujita
Re: How can we submit code patches that implement our (pending) patents?
On Wed, Jul 11, 2018 at 1:34 AM, Tsunakawa, Takayuki < tsunakawa.ta...@jp.fujitsu.com> wrote: > From: Dave Page [mailto:dp...@pgadmin.org] > > SFLC have acted as the projects counsel in the past, so I'm not surprised > > they aren't talking to you; you won't be a known contact to them as a PG > > contributor, and as a Fujitsu employee there would likely be a conflict > > of interest for them to talk to you. > > I see. Then I hope for some reply saying so so that I can give up my hope > that I might get a good idea from them... > > Who is a known contact to SFLC in PostgreSQL community? Can we expect > response from SFLC if he/she contacts them? > I am - however as you've seen from Tom, the core team has already discussed this and come to the conclusion that the risks and downsides to accepting code under patent far outweighs the benefitsm so I won't be contacting them about this. We do thank you (and Fujitsu) for your efforts though. -- Dave Page Blog: http://pgsnake.blogspot.com Twitter: @pgsnake EnterpriseDB UK: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Problem with tupdesc in jsonb_to_recordset
On Wed, Jul 11, 2018 at 03:27:13PM +0900, Michael Paquier wrote: > I don't think that your solution is correct. From my read of 37a795a6, > the tuple descriptor is moved from the query-lifespan memory context > (ecxt_per_query_memory) to a function-level context, which could cause > the descriptor to become busted as your test is pointing out. Tom? Hacking my way out I am finishing with the attached, which fixes the problem and passes all regression tests. I am not completely sure that this the right approach though, I would need to think a bit more about it. -- Michael diff --git a/src/backend/utils/adt/jsonfuncs.c b/src/backend/utils/adt/jsonfuncs.c index e358b5ad13..b82060528b 100644 --- a/src/backend/utils/adt/jsonfuncs.c +++ b/src/backend/utils/adt/jsonfuncs.c @@ -2721,24 +2721,22 @@ JsValueToJsObject(JsValue *jsv, JsObject *jso) static void update_cached_tupdesc(CompositeIOData *io, MemoryContext mcxt) { - if (!io->tupdesc || - io->tupdesc->tdtypeid != io->base_typid || - io->tupdesc->tdtypmod != io->base_typmod) - { - TupleDesc tupdesc = lookup_rowtype_tupdesc(io->base_typid, - io->base_typmod); - MemoryContext oldcxt; + TupleDesc tupdesc; + MemoryContext oldcxt; - if (io->tupdesc) - FreeTupleDesc(io->tupdesc); + if (io->tupdesc != NULL && + io->tupdesc->tdtypeid == io->base_typid && + io->tupdesc->tdtypmod == io->base_typmod) + return; - /* copy tuple desc without constraints into cache memory context */ - oldcxt = MemoryContextSwitchTo(mcxt); - io->tupdesc = CreateTupleDescCopy(tupdesc); - MemoryContextSwitchTo(oldcxt); + tupdesc = lookup_rowtype_tupdesc(io->base_typid, io->base_typmod); - ReleaseTupleDesc(tupdesc); - } + /* copy tuple desc without constraints into cache memory context */ + oldcxt = MemoryContextSwitchTo(mcxt); + io->tupdesc = CreateTupleDescCopy(tupdesc); + MemoryContextSwitchTo(oldcxt); + + ReleaseTupleDesc(tupdesc); } /* recursively populate a composite (row type) value from json/jsonb */ @@ -3577,8 +3575,8 @@ populate_recordset_worker(FunctionCallInfo fcinfo, const char *funcname, if (!cache) { fcinfo->flinfo->fn_extra = cache = - MemoryContextAllocZero(fcinfo->flinfo->fn_mcxt, sizeof(*cache)); - cache->fn_mcxt = fcinfo->flinfo->fn_mcxt; + MemoryContextAllocZero(rsi->econtext->ecxt_per_query_memory, sizeof(*cache)); + cache->fn_mcxt = rsi->econtext->ecxt_per_query_memory; if (have_record_arg) { signature.asc Description: PGP signature
Re: In pageinspect, perform clean-up after testing gin-related functions
On Wed, Jul 11, 2018 at 12:37 PM, Kuntal Ghosh wrote: > Hello all, > > In pageinspect/sql/gin.sql, we don't drop the table test1 at the end > of the test. IMHO, we should clean-up at the end of a test. > Yeah, it is good practice to drop the objects at the end. It is strange that original commit adfb81d9e1 has this at the end of the test, but a later commit 367b99bbb1 by Tom has removed the Drop statement. AFAICS, this is just a silly mistake, but I might be missing something. Tom, do you remember any reason for doing so? If not, then I think we can revert back that change (aka commit Kuntal's patch). -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: TRUNCATE tables referenced by FKs on partitioned tables
On Tue, Jul 10, 2018 at 08:06:24PM -0400, Alvaro Herrera wrote: > You can't truncate prim on its own. This is expected. > alvherre=# truncate table prim, partfk; > ERROR: cannot truncate a table referenced in a foreign key constraint > DETALLE: Table "partfk" references "prim". > SUGERENCIA: Truncate table "partfk" at the same time, or use TRUNCATE ... > CASCADE. You mean that instead: =# truncate table prim; ERROR: 0A000: cannot truncate a table referenced in a foreign key constraint DETAIL: Table "partfk" references "prim". HINT: Truncate table "partfk" at the same time, or use TRUNCATE ... CASCADE. LOCATION: heap_truncate_check_FKs, heap.c:3245 I agree that this should be an error. > However, you can't do it even if you try to include partfk in the mix: > > alvherre=# truncate table prim, partfk; > ERROR: cannot truncate a table referenced in a foreign key constraint > DETALLE: Table "partfk" references "prim". > SUGERENCIA: Truncate table "partfk" at the same time, or use TRUNCATE ... > CASCADE. Your first and second queries are the same :) And those ones work: =# truncate table partfk; TRUNCATE TABLE =# truncate table partfk, partfk1; TRUNCATE TABLE =# truncate table partfk, partfk1, partfk2; TRUNCATE TABLE =# truncate table partfk, partfk2; TRUNCATE TABLE > Trying to list all the partitions individually is pointless: > > alvherre=# truncate table prim, partfk, partfk1, partfk2; > ERROR: cannot truncate a table referenced in a foreign key constraint > DETALLE: Table "partfk" references "prim". > SUGERENCIA: Truncate table "partfk" at the same time, or use TRUNCATE ... > CASCADE. Yes, I would expect this one to pass. > CASCADE is also useless: > > alvherre=# truncate table prim cascade; > NOTICE: truncate cascades to table "partfk" > NOTICE: truncate cascades to table "partfk1" > NOTICE: truncate cascades to table "partfk2" > ERROR: cannot truncate a table referenced in a foreign key constraint > DETALLE: Table "partfk" references "prim". > SUGERENCIA: Truncate table "partfk" at the same time, or use TRUNCATE ... > CASCADE. And this one as well. -- Michael signature.asc Description: PGP signature
In pageinspect, perform clean-up after testing gin-related functions
Hello all, In pageinspect/sql/gin.sql, we don't drop the table test1 at the end of the test. IMHO, we should clean-up at the end of a test. I've attached the patch to perform the same. -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com From 1e4be3749eed9ff9d59f775d2bd4ad2b409a6d3c Mon Sep 17 00:00:00 2001 From: Kuntal Ghosh Date: Wed, 11 Jul 2018 12:21:13 +0530 Subject: [PATCH] In pageinspect, drop table after testing gin-related functions --- contrib/pageinspect/expected/gin.out | 1 + contrib/pageinspect/sql/gin.sql | 2 ++ 2 files changed, 3 insertions(+) diff --git a/contrib/pageinspect/expected/gin.out b/contrib/pageinspect/expected/gin.out index 82f63b2..ef7570b 100644 --- a/contrib/pageinspect/expected/gin.out +++ b/contrib/pageinspect/expected/gin.out @@ -35,3 +35,4 @@ FROM gin_leafpage_items(get_raw_page('test1_y_idx', -[ RECORD 1 ] ?column? | t +DROP TABLE test1; diff --git a/contrib/pageinspect/sql/gin.sql b/contrib/pageinspect/sql/gin.sql index d516ed3..423f5c5 100644 --- a/contrib/pageinspect/sql/gin.sql +++ b/contrib/pageinspect/sql/gin.sql @@ -17,3 +17,5 @@ SELECT COUNT(*) > 0 FROM gin_leafpage_items(get_raw_page('test1_y_idx', (pg_relation_size('test1_y_idx') / current_setting('block_size')::bigint)::int - 1)); + +DROP TABLE test1; -- 1.8.3.1
Re: automatic restore point
On Wed, Jul 11, 2018 at 06:11:01AM +, Yotsunaga, Naoki wrote: > I want to hear about the following in addition to the previous > comment. What would you do if your customer dropped the table and asked you > to > restore it? I can think of 4 reasons on top of my mind: 1) Don't do that. 2) Implement safe-guards using utility hooks or event triggers. 3) Have a delayed standby if you don't believe that your administrators are skilled enough in case. 4) Have backups and a WAL archive. > Everyone is thinking what to do to avoid operation failure, but I’m > thinking about after the user’s failure. > What I mean is that not all users will set up in advance. > For example, if you make the settings described in the manual, you > will not drop the table by operation failure. However, not all users > do that setting. > For such users, I think that it is necessary to have a function to > easily restore data after failing operation without setting anything > in advance. So I proposed this function. Well, if you put in place correct measures from the start you would not have problems. It seems to me that there is no point in implementing something which is a solution for a very narrow case, where the user has shot his own foot to begin with. Having backups anyway is mandatory by the way, standby replicas are not backups. -- Michael signature.asc Description: PGP signature