Re: [HACKERS] Parallel Hash take II
On 2017-11-27 22:25:12 +1300, Thomas Munro wrote: > On Thu, Nov 23, 2017 at 12:36 AM, Thomas Munro >wrote: > > Here's a new patch set with responses to the last batch of review comments. Looking at 0004-Add-shared-tuplestores.patch Comments: - I'd rename mutex to lock. Seems quite possible that we end up with shared lockers too. - Expand "Simple mechanism for sharing tuples between backends." intro comment - that doesn't really seem like a meaningful description of the mechanism. Should probably mention that it's similar to tuplestores etc... - I'm still concerned about the chunking mechanism. How about this sketch of an alternative solution? Chunks are always the same length. To avoid having to read the length from disk while holding a lock, introduce continuation chunks which store the amount of space needed by the overlarge tuple at the start. The reading process stays largely the same, except that if a backend reads a chunk that's a continuation, it reads the length of the continuation and skips ahead. That could mean that more than one backend read continuation chunks, but the window is small and there's normally not goign to be that many huge tuples (otherwise things are slow anyway). - why are we using a separate hardcoded 32 for sts names? Why not just go for NAMEDATALEN or such? - I'd replace most of the "should's" in comments with "need". Greetings, Andres Freund
Re: [HACKERS] Parallel Hash take II
Hi, On 2017-11-27 22:25:12 +1300, Thomas Munro wrote: Pushed 0003-Add-infrastructure-for-sharing-temporary-files-betwe.patch after minor adjustments (some including conversing with you). Changes: - Changed an impossible elog() into an Assert(). - changed SharedFileSet->counter, and the backend static variable, to uint32. Not impossible that a 32bit int overflows over the course of a few weeks, and we a) imo shouldn't unnecessarily rely on signed overflow being defined b) a negative number would look weird, even if well defined (-fwrapv et al). - Added a small comment about arbitrary-ness of the 8 in Oid tablespaces[8]. - pgindent'ed Questions: - Right now RemovePgTempFilesInDir() will recurse into appropriately named directories, and when it recurses it doesn't require the same name pattern checks. I think that's good, but I think it'd be prudent to be a bit more paranoid and prevent recursing into symlinked subdirectories. - As we don't clean temp files after crash-restarts it isn't impossible to have a bunch of crash-restarts and end up with pids *and* per-pid shared file set counters reused. Which'd lead to conflicts. Do we care? Greetings, Andres Freund
Re: [HACKERS] pow support for pgbench
Hello Robert, The fact that the return type is not consistently of one type bothers me. I'm not sure pgbench's expression language is a good place to runtime polymorphism -- SQL doesn't work that way. Sure. Pg has a NUMERIC adaptative precision version, which is cheating, because it can return kind of an "int" or a "float", depending on whether there are digits after the decimal point or not. Pgbench does not have support for NUMERIC, just INT & DOUBLE, so the current version is an approximation of that. Now it is always possible to just do DOUBLE version, but this won't match SQL behavior either. + /* + * pow() for integer values with exp >= 0. Matches SQL pow() behaviour + */ What's the name of the backend function whose behavior this matches? POW(numeric,numeric) -> numeric, which matches "numeric_power". -- Fabien.
Re: Protect syscache from bloating with negative cache entries
On 2017-12-01 17:03:28 -0500, Tom Lane wrote: > Andres Freundwrites: > > On 2017-12-01 16:40:23 -0500, Tom Lane wrote: > >> I have no faith in either of these proposals, because they both assume > >> that the problem only arises over the course of many minutes. In the > >> recent complaint about pg_dump causing relcache bloat, it probably does > >> not take nearly that long for the bloat to occur. > > > To me that's a bit of a different problem than what I was discussing > > here. It also actually doesn't seem that hard - if your caches are > > growing fast, you'll continually get hash-resizing of the > > various. Adding cache-pruning to the resizing code doesn't seem hard, > > and wouldn't add meaningful overhead. > > That's an interesting way to think about it, as well, though I'm not > sure it's quite that simple. If you tie this to cache resizing then > the cache will have to grow up to the newly increased size before > you'll prune it again. That doesn't sound like it will lead to nice > steady-state behavior. Yea, it's not perfect - but if we do pruning both at resize *and* on regular intervals, like once-a-minute as I was suggesting, I don't think it's that bad. The steady state won't be reached within seconds, true, but the negative consequences of only attempting to shrink the cache upon resizing when the cache size is growing fast anyway doesn't seem that large. I don't think we need to be super accurate here, there just needs to be *some* backpressure. I've had cases in the past where just occasionally blasting the cache away would've been good enough. Greetings, Andres Freund
Re: Protect syscache from bloating with negative cache entries
Andres Freundwrites: > On 2017-12-01 16:40:23 -0500, Tom Lane wrote: >> I have no faith in either of these proposals, because they both assume >> that the problem only arises over the course of many minutes. In the >> recent complaint about pg_dump causing relcache bloat, it probably does >> not take nearly that long for the bloat to occur. > To me that's a bit of a different problem than what I was discussing > here. It also actually doesn't seem that hard - if your caches are > growing fast, you'll continually get hash-resizing of the > various. Adding cache-pruning to the resizing code doesn't seem hard, > and wouldn't add meaningful overhead. That's an interesting way to think about it, as well, though I'm not sure it's quite that simple. If you tie this to cache resizing then the cache will have to grow up to the newly increased size before you'll prune it again. That doesn't sound like it will lead to nice steady-state behavior. regards, tom lane
Re: [HACKERS] proposal: psql command \graw
Pavel Stehulewrites: > 2017-12-01 16:36 GMT+01:00 Robert Haas : >> I vote to reject this patch. It doesn't do anything that you can't >> already do; it just adds some syntactic sugar. And that syntactic >> sugar saves only a handful of keystrokes. If you want unaligned, >> tuples-only mode, you can set it in 5 keystrokes: > When you don't want to see column names, then it is true. If this is "\a\t except we aren't removing column headers", then it seems like "\graw" is a remarkably misleading name for the functionality. I'd been assuming, from reading no more than the patch subject line, was that it was something like "dump the raw field data with no headers, footers, or separators", with a claimed use-case involving passing bytea data to some consumer or other. FWIW, I concur with Robert's assessment, both that the use-case is pretty thin and that we do not want to get into the business of encouraging a boatload of \gxxx variants, especially not ones with randomly chosen names. > For some usage - like printing via gnuplot you would to see column names - > and it cannot be done with current settings. That seems like an argument for new pset option(s) to suppress footers but not headers. It's not an argument for invoking such options through a \g variant; that's totally non-orthogonal. regards, tom lane
Re: [HACKERS] Custom compression methods
On Fri, 1 Dec 2017 16:38:42 -0300 Alvaro Herrerawrote: > > To me it makes sense to say "let's create this method which is for > data compression" (CREATE ACCESS METHOD hyperz TYPE COMPRESSION) > followed by either "let's use this new compression method for the > type tsvector" (ALTER TYPE tsvector SET COMPRESSION hyperz) or "let's > use this new compression method for the column tc" (ALTER TABLE ALTER > COLUMN tc SET COMPRESSION hyperz). > Hi, I think if CREATE ACCESS METHOD can be used for compression, then it could be nicer than CREATE COMPRESSION METHOD. I just don't know that compression could go as access method or not. Anyway it's easy to change syntax and I don't mind to do it, if it will be neccessary for the patch to be commited. -- Regards, Ildus Kurbangaliev
Re: [HACKERS] Custom compression methods
On 2017-12-01 16:14:58 -0500, Robert Haas wrote: > Honestly, if we can give everybody a 4% space reduction by switching > to lz4, I think that's totally worth doing -- but let's not make > people choose it, let's make it the default going forward, and keep > pglz support around so we don't break pg_upgrade compatibility (and so > people can continue to choose it if for some reason it works better in > their use case). That kind of improvement is nothing special in a > specific workload, but TOAST is a pretty general-purpose mechanism. I > have become, through a few bitter experiences, a strong believer in > the value of trying to reduce our on-disk footprint, and knocking 4% > off the size of every TOAST table in the world does not sound > worthless to me -- even though context-aware compression can doubtless > do a lot better. +1. It's also a lot faster, and I've seen way way to many workloads with 50%+ time spent in pglz. Greetings, Andres Freund
Re: Protect syscache from bloating with negative cache entries
On 2017-12-01 16:40:23 -0500, Tom Lane wrote: > Andres Freundwrites: > > On 2017-12-01 16:20:44 -0500, Robert Haas wrote: > >> Well, yeah, that would be insane. But I think even something very > >> rough could work well enough. I think our goal should be to eliminate > >> cache entries that are have gone unused for many *minutes*, and > >> there's no urgency about getting it to any sort of exact value. For > >> non-idle backends, using the most recent statement start time as a > >> proxy would probably be plenty good enough. Idle backends might need > >> a bit more thought. > > > Our timer framework is flexible enough that we can install a > > once-a-minute timer without much overhead. That timer could increment a > > 'cache generation' integer. Upon cache access we write the current > > generation into relcache / syscache (and potentially also plancache?) > > entries. Not entirely free, but cheap enough. In those once-a-minute > > passes entries that haven't been touched in X cycles get pruned. > > I have no faith in either of these proposals, because they both assume > that the problem only arises over the course of many minutes. In the > recent complaint about pg_dump causing relcache bloat, it probably does > not take nearly that long for the bloat to occur. To me that's a bit of a different problem than what I was discussing here. It also actually doesn't seem that hard - if your caches are growing fast, you'll continually get hash-resizing of the various. Adding cache-pruning to the resizing code doesn't seem hard, and wouldn't add meaningful overhead. Greetings, Andres Freund
Re: Protect syscache from bloating with negative cache entries
On 2017-12-01 16:20:44 -0500, Robert Haas wrote: > Well, yeah, that would be insane. But I think even something very > rough could work well enough. I think our goal should be to eliminate > cache entries that are have gone unused for many *minutes*, and > there's no urgency about getting it to any sort of exact value. For > non-idle backends, using the most recent statement start time as a > proxy would probably be plenty good enough. Idle backends might need > a bit more thought. Our timer framework is flexible enough that we can install a once-a-minute timer without much overhead. That timer could increment a 'cache generation' integer. Upon cache access we write the current generation into relcache / syscache (and potentially also plancache?) entries. Not entirely free, but cheap enough. In those once-a-minute passes entries that haven't been touched in X cycles get pruned. Greetings, Andres Freund
Re: [HACKERS] Custom compression methods
On 12/01/2017 08:38 PM, Alvaro Herrera wrote: > Tomas Vondra wrote: > >> On 11/30/2017 09:51 PM, Alvaro Herrera wrote: > >>> Just passing by, but wouldn't this fit in the ACCESS METHOD group of >>> commands? So this could be simplified down to >>> CREATE ACCESS METHOD ts1 TYPE COMPRESSION >>> we have that for indexes and there are patches flying for heap storage, >>> sequences, etc. >> >> I think that would conflate two very different concepts. In my mind, >> access methods define how rows are stored. > > In mine, they define how things are accessed (i.e. more general than > what you're thinking). We *currently* use them to store rows [in > indexes], but there is no reason why we couldn't expand that. > Not sure I follow. My argument was not as much about whether the rows are stored as rows or in some other (columnar) format, but that access methods deal with "tuples" (i.e. row in the "logical" way). I assume that even if we end up implementing other access method types, they will still be tuple-based. OTOH compression methods (at least as introduced by this patch) operate on individual values, and have very little to do with access to the value (in a sense it's a transparent thing). > > So we group access methods in "types"; the current type we have is for > indexes, and methods in that type define how are indexes accessed. This > new type would indicate how would values be compressed. I disagree that > there is no parallel there. > > I'm trying to avoid pointless proliferation of narrowly defined DDL > commands. > Of course, the opposite case is using the same DDL for very different concepts (although I understand you don't see it that way). But in fairness, I don't really care if we call this COMPRESSION METHOD or ACCESS METHOD or DARTH VADER ... >> Furthermore, the "TYPE" in CREATE COMPRESSION method was meant to >> restrict the compression algorithm to a particular data type (so, if it >> relies on tsvector, you can't apply it to text columns). > > Yes, of course. I'm saying that the "datatype" property of a > compression access method would be declared somewhere else, not in the > TYPE clause of the CREATE ACCESS METHOD command. Perhaps it makes sense > to declare that a certain compression access method is good only for a > certain data type, and then you can put that in the options clause, > "CREATE ACCESS METHOD hyperz TYPE COMPRESSION WITH (type = tsvector)". > But many compression access methods would be general in nature and so > could be used for many datatypes (say, snappy). > > To me it makes sense to say "let's create this method which is for data > compression" (CREATE ACCESS METHOD hyperz TYPE COMPRESSION) followed by > either "let's use this new compression method for the type tsvector" > (ALTER TYPE tsvector SET COMPRESSION hyperz) or "let's use this new > compression method for the column tc" (ALTER TABLE ALTER COLUMN tc SET > COMPRESSION hyperz). > The WITH syntax does not seem particularly pretty to me, TBH. I'd be much happier with "TYPE tsvector" and leaving WITH for the options specific to each compression method. FWIW I think syntax is the least critical part of this patch. It's ~1% of the patch, and the gram.y additions are rather trivial. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: BUG #14941: Vacuum crashes
On 12/1/17, 3:04 PM, "Robert Haas"wrote: > On Fri, Dec 1, 2017 at 1:53 PM, Bossart, Nathan wrote: >> There is already an internal flag called VACOPT_NOWAIT that gives >> autovacuum this ability in certain cases (introduced in the aforementioned >> commit), and I recently brought up this potential use-case as >> justification for a patch [0]. I'd be happy to submit a patch for >> providing VACOPT_NOWAIT to users if others think it's something we should >> do. > > Seems entirely reasonable to me, provided that we only update the > extensible-options syntax: > > VACUUM [ ( option [, ...] ) ] [ table_and_columns [, ...] ] > > I don't want to add any more options to the older syntax: > > VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ table_and_columns [, > ...] ] Right. This seems like the right path forward to me, as the VACUUM documentation states that the unparenthesized syntax is deprecated, and the DISABLE_PAGE_SKIPPING option was not added to the old syntax, either. > I am slightly confused as to how we got on to this topic since the > subject is "Vacuum crashes", but perhaps Lyes was simply speaking > imprecisely. I'm hoping Lyes chimes in soon to clarify if I am interpreting the original report incorrectly. Nathan
Re: [HACKERS] Custom compression methods
On Fri, Dec 1, 2017 at 4:06 PM, Tomas Vondrawrote: >>> I agree with these thoughts in general, but I'm not quite sure >>> what is your conclusion regarding the patch. >> >> I have not reached one. Sometimes I like to discuss problems before >> deciding what I think. :-) > > That's lame! Let's make decisions without discussion ;-) Oh, right. What was I thinking? >> It does seem to me that the patch may be aiming at a relatively narrow >> target in a fairly large problem space, but I don't know whether to >> label that as short-sightedness or prudent incrementalism. > > I don't know either. I don't think people will start switching their > text columns to lz4 just because they can, or because they get 4% space > reduction compared to pglz. Honestly, if we can give everybody a 4% space reduction by switching to lz4, I think that's totally worth doing -- but let's not make people choose it, let's make it the default going forward, and keep pglz support around so we don't break pg_upgrade compatibility (and so people can continue to choose it if for some reason it works better in their use case). That kind of improvement is nothing special in a specific workload, but TOAST is a pretty general-purpose mechanism. I have become, through a few bitter experiences, a strong believer in the value of trying to reduce our on-disk footprint, and knocking 4% off the size of every TOAST table in the world does not sound worthless to me -- even though context-aware compression can doubtless do a lot better. > But the ability to build per-column dictionaries seems quite powerful, I > guess. And I don't think that can be easily built directly into JSONB, > because we don't have a way to provide information about the column > (i.e. how would you fetch the correct dictionary?). That's definitely a problem, but I think we should mull it over a bit more before giving up. I have a few thoughts, but the part of my life that doesn't happen on the PostgreSQL mailing list precludes expounding on them right this minute. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Doc tweak for huge_pages?
On Sat, Dec 2, 2017 at 4:08 AM, Peter Eisentrautwrote: > On 11/30/17 23:35, Thomas Munro wrote: >> On Fri, Dec 1, 2017 at 5:04 PM, Justin Pryzby wrote: >>> On Fri, Dec 01, 2017 at 04:01:24PM +1300, Thomas Munro wrote: Hi hackers, The manual implies that only Linux can use huge pages. That is not true: FreeBSD, Illumos and probably others support larger page sizes using transparent page coalescing algorithms. On my FreeBSD box procstat -v often shows PostgreSQL shared buffers in "S"-flagged memory. I think we should adjust the manual to make clear that it's the *explicit request for huge pages* that is supported only on Linux (and hopefully soon Windows). Am I being too pedantic? >>> >>> I suggest to remove "other" and include Linux in the enumeration, since it >>> also >>> supports "transparent" hugepages. >> >> Hmm. Yeah, it does, but apparently it's not so transparent. So if we >> mention that we'd better indicate in the same paragraph that you >> probably don't actually want to use it. How about the attached? > > Part of the confusion is that the huge_pages setting is only for shared > memory, whereas the kernel settings affect all memory. Right. And more specifically, just the main shared memory area, not DSM segments. Updated to make this point. (I have wondered whether DSM segments should respect this GUC: it seems plausible that they should when the size is a multiple of the huge page size, so that very large DSA areas finish up mostly backed by huge pages, so that very large shared hash tables would benefit from lower TLB miss rates. I have only read in an academic paper that this is a good idea, I haven't investigated whether that would really help us in practice, and the first problem is that Linux shm_open doesn't support huge pages anyway so you've need one of the other DSM implementation options which are currently non-default.) > Is the same true > for the proposed Windows patch? Yes. It adds a flag to the request for the main shared memory area (after jumping through various permissions hoops). -- Thomas Munro http://www.enterprisedb.com huge-pages-doc-tweak-v3.patch Description: Binary data
Re: [HACKERS] Custom compression methods
On Fri, Dec 1, 2017 at 2:38 PM, Alvaro Herrerawrote: > In mine, they define how things are accessed (i.e. more general than > what you're thinking). We *currently* use them to store rows [in > indexes], but there is no reason why we couldn't expand that. > > So we group access methods in "types"; the current type we have is for > indexes, and methods in that type define how are indexes accessed. This > new type would indicate how would values be compressed. I disagree that > there is no parallel there. +1. > I'm trying to avoid pointless proliferation of narrowly defined DDL > commands. I also think that's an important goal. > Yes, of course. I'm saying that the "datatype" property of a > compression access method would be declared somewhere else, not in the > TYPE clause of the CREATE ACCESS METHOD command. Perhaps it makes sense > to declare that a certain compression access method is good only for a > certain data type, and then you can put that in the options clause, > "CREATE ACCESS METHOD hyperz TYPE COMPRESSION WITH (type = tsvector)". > But many compression access methods would be general in nature and so > could be used for many datatypes (say, snappy). > > To me it makes sense to say "let's create this method which is for data > compression" (CREATE ACCESS METHOD hyperz TYPE COMPRESSION) followed by > either "let's use this new compression method for the type tsvector" > (ALTER TYPE tsvector SET COMPRESSION hyperz) or "let's use this new > compression method for the column tc" (ALTER TABLE ALTER COLUMN tc SET > COMPRESSION hyperz). +1 to this, too. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Custom compression methods
On 12/01/2017 08:20 PM, Robert Haas wrote: > On Fri, Dec 1, 2017 at 10:18 AM, Tomas Vondra >wrote: >> It has very little impact on this patch, as it has nothing to do with >> columnar storage. That is, each value is compressed independently. > > I understand that this patch is not about columnar storage, but I > think the idea that we may want to operate on the compressed data > directly is not only applicable to that case. > Yeah. To clarify, my point was that column stores benefit from compressing many values at once, and then operating on this compressed vector. That is not what this patch is doing (or can do), of course. But I certainly do agree that if the compression can be integrated into the data type, allowing processing on compressed representation, then that will beat whatever this patch is doing, of course ... >> >> I agree with these thoughts in general, but I'm not quite sure >> what is your conclusion regarding the patch. > > I have not reached one. Sometimes I like to discuss problems before > deciding what I think. :-) > That's lame! Let's make decisions without discussion ;-) > > It does seem to me that the patch may be aiming at a relatively narrow > target in a fairly large problem space, but I don't know whether to > label that as short-sightedness or prudent incrementalism. > I don't know either. I don't think people will start switching their text columns to lz4 just because they can, or because they get 4% space reduction compared to pglz. But the ability to build per-column dictionaries seems quite powerful, I guess. And I don't think that can be easily built directly into JSONB, because we don't have a way to provide information about the column (i.e. how would you fetch the correct dictionary?). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: BUG #14941: Vacuum crashes
On Fri, Dec 1, 2017 at 1:53 PM, Bossart, Nathanwrote: > There is already an internal flag called VACOPT_NOWAIT that gives > autovacuum this ability in certain cases (introduced in the aforementioned > commit), and I recently brought up this potential use-case as > justification for a patch [0]. I'd be happy to submit a patch for > providing VACOPT_NOWAIT to users if others think it's something we should > do. Seems entirely reasonable to me, provided that we only update the extensible-options syntax: VACUUM [ ( option [, ...] ) ] [ table_and_columns [, ...] ] I don't want to add any more options to the older syntax: VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ table_and_columns [, ...] ] I am slightly confused as to how we got on to this topic since the subject is "Vacuum crashes", but perhaps Lyes was simply speaking imprecisely. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Custom compression methods
On 12/01/2017 08:48 PM, Alvaro Herrera wrote: > Ildus Kurbangaliev wrote: > >> If the table is big, decompression could take an eternity. That's why i >> decided to only to disable it and the data could be decompressed using >> compression options. >> >> My idea was to keep compression options forever, since there will not >> be much of them in one database. Still that requires that extension is >> not removed. >> >> I will try to find a way how to recompress data first in case it moves >> to another table. > > I think what you should do is add a dependency between a column that > compresses using a method, and that method. So the method cannot be > dropped and leave compressed data behind. Since the method is part of > the extension, the extension cannot be dropped either. If you ALTER > the column so that it uses another compression method, then the table is > rewritten and the dependency is removed; once you do that for all the > columns that use the compression method, the compression method can be > dropped. > +1 to do the rewrite, just like for other similar ALTER TABLE commands > > Maybe our dependency code needs to be extended in order to support this. > I think the current logic would drop the column if you were to do "DROP > COMPRESSION .. CASCADE", but I'm not sure we'd see that as a feature. > I'd rather have DROP COMPRESSION always fail instead until no columns > use it. Let's hear other's opinions on this bit though. > Why should this behave differently compared to data types? Seems quite against POLA, if you ask me ... If you want to remove the compression, you can do the SET NOT COMPRESSED (or whatever syntax we end up using), and then DROP COMPRESSION METHOD. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager
On Fri, Dec 1, 2017 at 1:28 PM, Robert Haaswrote: > [ lots of minor comments ] When I took a break from sitting at the computer, I realized that I think this has a more serious problem: won't it permanently leak reference counts if someone hits ^C or an error occurs while the lock is held? I think it will -- it probably needs to do cleanup at the places where we do LWLockReleaseAll() that includes decrementing the shared refcount if necessary, rather than doing cleanup at the places we release heavyweight locks. I might be wrong about the details here -- this is off the top of my head. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Transaction control in procedures
On Fri, Dec 1, 2017 at 2:48 PM, Peter Eisentrautwrote: > Here is a new patch, now on top of master. The main changes are that a > lot of documentation has been added. This feature doesn't have many tests. I think it should have a lot more of them. It's tinkering with the transaction control machinery of the system in a fairly fundamental way, and that could break things. I suggest, in particular, testing how it interactions with resources such as cursors and prepared statements. For example, what happens if you commit or roll back inside a cursor-for loop (given that the cursor is not holdable)?There are several kinds of cursor loops in PostgreSQL, plus there are cursors, prepared statements, and portals that can be created using SQL commands or protocol messages. I suspect that there are quite a few tricky interactions there. Other things to think about: - COMMIT or ROLLBACK inside a PLpgsql block with an attached EXCEPTION block, or when an SQL SAVEPOINT has been established previously. - COMMIT or ROLLBACK inside a procedure with a SET clause attached, and/or while SET LOCAL is in effect either at the inner or outer level. - COMMIT or ROLLBACK with open large objects. - COMMIT inside a procedure fails because of a serialization failure, deferred constraint, etc. In some cases, there are not only questions of correctness (it shouldn't crash/give wrong answers) but also definitional questions (what exactly should happen in that case, anyway?). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Additional logging for VACUUM and ANALYZE
On Fri, Dec 1, 2017 at 11:29 AM, Bossart, Nathanwrote: > Thanks for the review, Robert. I've attached a new version that > addresses your feedback. Thanks. I think this looks fine now, except that (1) it needs a pgindent run and (2) I vote for putting the test case back. Michael thought the test case was too much because this is so obscure, but I think that's exactly why it needs a test case. Otherwise, somebody a few years from now may not even be able to figure out how to hit this message, and if it gets broken, we won't know. This code seems to be fairly easy to break in subtle ways, so I think more test coverage is good. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Transaction control in procedures
On 11/14/17 18:38, Peter Eisentraut wrote: > On 10/31/17 15:38, Peter Eisentraut wrote: >> Here is a patch that implements transaction control in PL/Python >> procedures. (This patch goes on top of "SQL procedures" patch v1.) > > Here is an updated patch, now on top of "SQL procedures" v2. Here is a new patch, now on top of master. The main changes are that a lot of documentation has been added. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services From 1a5e5fde2c0da663cc010b3e19418d0b2141304b Mon Sep 17 00:00:00 2001 From: Peter EisentrautDate: Fri, 1 Dec 2017 14:40:29 -0500 Subject: [PATCH v3] Transaction control in PL procedures In each of the supplied procedural languages (PL/pgSQL, PL/Perl, PL/Python, PL/Tcl), add language-specific commit and rollback functions/commands to control transactions in procedures in that language. Add similar underlying functions to SPI. Some additional cleanup so that transaction commit or abort doesn't blow away data structures still used by the procedure call. Add execution context tracking to CALL and DO statements so that transaction control commands can only be issued in top-level procedure and block calls, not function calls or other procedure or block calls. --- doc/src/sgml/plperl.sgml | 49 + doc/src/sgml/plpgsql.sgml | 86 + doc/src/sgml/plpython.sgml| 34 doc/src/sgml/pltcl.sgml | 37 doc/src/sgml/spi.sgml | 219 ++ src/backend/commands/functioncmds.c | 30 ++- src/backend/executor/spi.c| 92 +++-- src/backend/tcop/utility.c| 4 +- src/backend/utils/mmgr/portalmem.c| 39 ++-- src/include/commands/defrem.h | 4 +- src/include/executor/spi.h| 5 + src/include/executor/spi_priv.h | 4 + src/include/nodes/nodes.h | 3 +- src/include/nodes/parsenodes.h| 7 + src/pl/plperl/GNUmakefile | 2 +- src/pl/plperl/SPI.xs | 12 ++ src/pl/plperl/expected/plperl_transaction.out | 94 ++ src/pl/plperl/plperl.c| 6 + src/pl/plperl/sql/plperl_transaction.sql | 88 + src/pl/plpgsql/src/pl_exec.c | 48 - src/pl/plpgsql/src/pl_funcs.c | 44 + src/pl/plpgsql/src/pl_gram.y | 34 src/pl/plpgsql/src/pl_handler.c | 8 + src/pl/plpgsql/src/pl_scanner.c | 2 + src/pl/plpgsql/src/plpgsql.h | 22 ++- src/pl/plpython/Makefile | 1 + src/pl/plpython/expected/plpython_test.out| 4 +- src/pl/plpython/expected/plpython_transaction.out | 96 ++ src/pl/plpython/plpy_main.c | 7 +- src/pl/plpython/plpy_plpymodule.c | 28 +++ src/pl/plpython/sql/plpython_transaction.sql | 80 src/pl/tcl/Makefile | 2 +- src/pl/tcl/expected/pltcl_transaction.out | 63 +++ src/pl/tcl/pltcl.c| 44 + src/pl/tcl/sql/pltcl_transaction.sql | 60 ++ src/test/regress/expected/plpgsql.out | 110 +++ src/test/regress/sql/plpgsql.sql | 95 ++ 37 files changed, 1461 insertions(+), 102 deletions(-) create mode 100644 src/pl/plperl/expected/plperl_transaction.out create mode 100644 src/pl/plperl/sql/plperl_transaction.sql create mode 100644 src/pl/plpython/expected/plpython_transaction.out create mode 100644 src/pl/plpython/sql/plpython_transaction.sql create mode 100644 src/pl/tcl/expected/pltcl_transaction.out create mode 100644 src/pl/tcl/sql/pltcl_transaction.sql diff --git a/doc/src/sgml/plperl.sgml b/doc/src/sgml/plperl.sgml index 100162dead..82ddf26606 100644 --- a/doc/src/sgml/plperl.sgml +++ b/doc/src/sgml/plperl.sgml @@ -661,6 +661,55 @@ Database Access from PL/Perl + + + + spi_commit() + + spi_commit + in PL/Perl + + + + spi_rollback() + + spi_rollback + in PL/Perl + + + + + Commit or roll back the current transaction. This can only be called + in a procedure or anonymous code block called from the top level. + (Note that it is not possible to run the SQL + commands COMMIT or ROLLBACK + via spi_exec_query or similar. It has to be done + using these functions.) After a transaction is ended, a new + transaction is automatically started, so there is no separate function + for that. + +
Re: [HACKERS] Custom compression methods
On Fri, Dec 1, 2017 at 10:18 AM, Tomas Vondrawrote: > It has very little impact on this patch, as it has nothing to do with > columnar storage. That is, each value is compressed independently. I understand that this patch is not about columnar storage, but I think the idea that we may want to operate on the compressed data directly is not only applicable to that case. > I agree with these thoughts in general, but I'm not quite sure what is > your conclusion regarding the patch. I have not reached one. Sometimes I like to discuss problems before deciding what I think. :-) It does seem to me that the patch may be aiming at a relatively narrow target in a fairly large problem space, but I don't know whether to label that as short-sightedness or prudent incrementalism. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Would a BGW need shmem_access or database_connection to enumerate databases?
On Fri, Dec 1, 2017 at 10:04 AM, Chapman Flackwrote: > Can I call RegisterDynamicBackgroundWorker when not in the postmaster, > but also not in a "regular backend", but rather another BGW? I believe that doing it from another BGW works fine. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Partition-wise aggregation/grouping
On Fri, Dec 1, 2017 at 7:41 AM, Ashutosh Bapatwrote: > This code creates plans where there are multiple Gather nodes under an Append > node. We should avoid that. Starting and stopping workers is inefficient, and precludes things like turning the Append into a Parallel Append. > AFAIU, the workers assigned to one gather node can be reused until that > Gather node finishes. Having multiple Gather nodes under an Append mean that > every worker will be idle from the time that worker finishes the work till the > last worker finishes the work. No, workers will exit as soon as they finish. They don't hang around idle. > index b422050..1941468 100644 > --- a/src/tools/pgindent/typedefs.list > +++ b/src/tools/pgindent/typedefs.list > @@ -2345,6 +2345,7 @@ UnlistenStmt > UnresolvedTup > UnresolvedTupData > UpdateStmt > +UpperPathExtraData > UpperRelationKind > UpperUniquePath > UserAuth > > Do we commit this file as part of the feature? Andres and I regularly commit such changes; Tom rejects them. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Runtime Partition Pruning
On Fri, Dec 1, 2017 at 6:20 AM, Beena Emersonwrote: > David Q1: > postgres=# explain analyse execute ab_q1 (3,3); --const >QUERY PLAN > - > Append (cost=0.00..43.90 rows=1 width=8) (actual time=0.006..0.006 > rows=0 loops=1) >-> Seq Scan on ab_a3_b3 (cost=0.00..43.90 rows=1 width=8) (actual > time=0.005..0.005 rows=0 loops=1) > Filter: ((a = 3) AND (b = 3)) > Planning time: 0.588 ms > Execution time: 0.043 ms > (5 rows) I think the EXPLAIN ANALYZE input should show something attached to the Append node so that we can tell that partition pruning is in use. I'm not sure if that is as simple as "Run-Time Partition Pruning: Yes" or if we can give a few more useful details. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] pow support for pgbench
On Fri, Dec 1, 2017 at 4:57 AM, Raúl Marín Rodríguezwrote: > I've rebased the patch so it can be applied cleanly on top of current > master. Please add the new function into the documentation table in alphabetical order. The fact that the return type is not consistently of one type bothers me. I'm not sure pgbench's expression language is a good place to runtime polymorphism -- SQL doesn't work that way. + /* + * pow() for integer values with exp >= 0. Matches SQL pow() behaviour + */ What's the name of the backend function whose behavior this matches? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Use get_greatest_modulus() in partition_bounds_equal()
On Fri, Dec 1, 2017 at 5:13 AM, Ashutosh Bapatwrote: > partition_bounds_equal() accesses the last datum from the given > partition bounds directly to compare their greatest moduli. Rest of > the code uses get_greatest_modulus() to get the greatest modulus from > a partition bound. partition_bounds_equal() should also do the same > for the sake of consistency and to make sure the in future the code > remains sane if we change the way we store greatest modulus in > PartitionBoundInfoData in future. Makes sense. Committed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: BUG #14941: Vacuum crashes
On 12/1/17, 11:16 AM, "Tomas Vondra"wrote: >> The behavior I would like to see is that the void ignores this table and >> moves to another instead of being blocked. >> > > I believe autovacuum should not block waiting for a heavy-weight lock on > a table since this commit that went into 9.1: > > https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=32896c40ca766146312b28a5a0eb3f66ca0300ed > > So I'm wondering what problem you're running into. It sounds like Lyes is doing a periodic database-wide VACUUM that is getting blocked on certain relations that are already locked (perhaps because autovacuum is already working on it). IIUC, the ask is to provide a way to skip vacuuming relations that cannot be immediately locked. There is already an internal flag called VACOPT_NOWAIT that gives autovacuum this ability in certain cases (introduced in the aforementioned commit), and I recently brought up this potential use-case as justification for a patch [0]. I'd be happy to submit a patch for providing VACOPT_NOWAIT to users if others think it's something we should do. As a side note, this seems more like a feature request than a bug report, so I'm adding pgsql-hackers@ here, too. Nathan [0] https://www.postgresql.org/message-id/875815E8-7A99-4488-AA07-F254C404E2CF%40amazon.com
Re: Unclear regression test for postgres_fdw
On Fri, Dec 1, 2017 at 4:01 AM, Antonin Houskawrote: > I see no other problems here. Committed, thanks for the report and review. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Transform for pl/perl
A few very minor comments while quickly paging through: 1. non-ASCII tests are going to cause problems in one platform or another. Please don't include that one. 2. error messages a) please use ereport() not elog() b) conform to style guidelines: errmsg() start with lowercase, others are complete phrases (start with uppercase, end with period) c) replace "The type you was trying to transform can't be represented in JSONB" maybe with errmsg("could not transform to type \"%s\"", "jsonb"), errdetail("The type you are trying to transform can't be represented in JSON") d) same errmsg() for the other error; figure out suitable errdetail. 3. whitespace: don't add newlines to while, DirectFunctionCallN, pnstrdup. 4. the "relocatability" test seems pointless to me. 5. "#undef _" looks bogus. Don't do it. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager
On Fri, Dec 1, 2017 at 10:14 AM, Masahiko Sawadawrote: > The patched is 10 times faster than current HEAD. Nifty. The first hunk in monitoring.sgml looks unnecessary. The second hunk breaks the formatting of the documentation; you need to adjust the "morerows" value from 9 to 8 here: Lock And similarly make this one 18: IPC +* Relation extension locks. The relation extension lock manager is +specialized in relation extensions. In PostgreSQL 11 relation extension +lock has been moved out of regular lock. It's similar to regular locks +but doesn't have full dead lock detection, group locking and multiple +lock modes. When conflicts occur, lock waits are implemented using +condition variables. Higher up, it says that "Postgres uses four types of interprocess locks", but because you added this, it's now a list of five items. I suggest moving the section on relation locks to the end and rewriting the text here as follows: Only one process can extend a relation at a time; we use a specialized lock manager for this purpose, which is much simpler than the regular lock manager. It is similar to the lightweight lock mechanism, but is ever simpler because there is only one lock mode and only one lock can be taken at a time. A process holding a relation extension lock is interruptible, unlike a process holding an LWLock. +#define RelExtLockTargetTagToIndex(relextlock_tag) \ +(tag_hash((const void *) relextlock_tag, sizeof(RelExtLockTag)) \ +% N_RELEXTLOCK_ENTS) How about using a static inline function for this? +#define SET_RELEXTLOCK_TAG(locktag, d, r) \ +((locktag).dbid = (d), \ + (locktag).relid = (r)) How about getting rid of this and just doing the assignments instead? +#define RELEXT_VAL_LOCK ((uint32) ((1 << 25))) +#define RELEXT_LOCK_MASK((uint32) ((1 << 25))) It seems confusing to have two macros for the same value and an almost-interchangeable purpose. Maybe just call it RELEXT_LOCK_BIT? +RelationExtensionLockWaiterCount(Relation relation) Hmm. This is sort of problematic, because with then new design we have no guarantee that the return value is actually accurate. I don't think that's a functional problem, but the optics aren't great. +if (held_relextlock.nLocks > 0) +{ +RelExtLockRelease(held_relextlock.relid, true); +} Excess braces. +int +RelExtLockHoldingLockCount(void) +{ +return held_relextlock.nLocks; +} Maybe IsAnyRelationExtensionLockHeld(), returning bool? +/* If the lock is held by me, no need to wait */ If we already hold the lock, no need to wait. + * Luckily if we're trying to acquire the same lock as what we + * had held just before, we don't need to get the entry from the + * array by hashing. We're not trying to acquire a lock here. "If the last relation extension lock we touched is the same one for which we now need to wait, we can use our cached pointer to the lock instead of recomputing it." +registered_wait_list = true; Isn't it really registered_wait_count? The only list here is encapsulated in the CV. +/* Before retuning, decrement the wait count if we had been waiting */ returning -> returning, but I'd rewrite this as "Release any wait count we hold." + * Acquire the relation extension lock. If we're trying to acquire the same + * lock as what already held, we just increment nLock locally and return + * without touching the RelExtLock array. "Acquire a relation extension lock." I think you can forget the rest of this; it duplicates comments in the function body. + * Since we don't support dead lock detection for relation extension + * lock and don't control the order of lock acquisition, it cannot not + * happen that trying to take a new lock while holding an another lock. Since we don't do deadlock detection, caller must not try to take a new relation extension lock while already holding them. +if (relid == held_relextlock.relid) +{ +held_relextlock.nLocks++; +return true; +} +else +elog(ERROR, + "cannot acquire relation extension locks for multiple relations at the same"); I'd prefer if (relid != held_relextlock.relid) elog(ERROR, ...) to save a level of indentation for the rest. + * If we're trying to acquire the same lock as what we just released + * we don't need to get the entry from the array by hashing. we expect + * to happen this case because it's a common case in acquisition of + * relation extension locks. "If the last relation extension lock we touched is the same one for we now need to acquire, we can use our cached pointer to the lock instead of recomputing it. This is likely to be a common case in practice." +/* Could not got the lock, return iff in conditional locking */ "locking conditionally" +ConditionVariableSleep(&(relextlock->cv),
Re: Transform for pl/perl
Robert Haaswrites: > On Fri, Dec 1, 2017 at 12:30 AM, Michael Paquier > wrote: >> Patch moved to CF 2018-01. Perhaps a committer will look at it at some point. > FWIW, although I like the idea, I'm not going to work on the patch. I > like Perl, but I neither know its internals nor use plperl. I hope > someone else will be interested. I've been assuming (and I imagine other committers think likewise) that Peter will pick this up at some point, since the whole transform feature was his work to begin with. If he doesn't want to touch it, maybe he should say so explicitly so that other people will feel free to take it. regards, tom lane
Re: [HACKERS] INSERT ON CONFLICT and partitioned tables
On Fri, Dec 1, 2017 at 2:44 AM, Amit Langotewrote: > I forgot to consider the fact that mtstate could be NULL in > ExecSetupPartitionTupleRouting(), so would result in dereferencing NULL > pointer when called from CopyFrom(), which fixed in the attached updated > patch. a ? b : false can more simply be spelled a && b. Committed after changing it like that, fixing the broken documentation build, and making minor edits to the comments and documentation. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Issues with logical replication
On 30/11/17 11:48, Simon Riggs wrote: > On 30 November 2017 at 11:30, Petr Jelinek> wrote: >> On 30/11/17 00:47, Andres Freund wrote: >>> On 2017-11-30 00:45:44 +0100, Petr Jelinek wrote: I don't understand. I mean sure the SnapBuildWaitSnapshot() can live with it, but the problematic logic happens inside the XactLockTableInsert() and SnapBuildWaitSnapshot() has no way of detecting the situation short of reimplementing the XactLockTableInsert() instead of calling it. >>> >>> Right. But we fairly trivially can change that. I'm remarking on it >>> because other people's, not yours, suggestions aimed at making this >>> bulletproof. I just wanted to make clear that I don't think that's >>> necessary at all. >>> >> >> Okay, then I guess we are in agreement. I can confirm that the attached >> fixes the issue in my tests. Using SubTransGetTopmostTransaction() >> instead of SubTransGetParent() means 3 more ifs in terms of extra CPU >> cost for other callers. I don't think it's worth worrying about given we >> are waiting for heavyweight lock, but if we did we can just inline the >> code directly into SnapBuildWaitSnapshot(). > > This will still fail an Assert in TransactionIdIsInProgress() when > snapshots are overflowed. > Hmm, which one, why? I see 2 Asserts there, one is: > Assert(nxids == 0); Which is inside the RecoveryInProgress(), surely on standbys there will still be no PGXACTs with assigned xids so that should be fine. The other one is: > Assert(TransactionIdIsValid(topxid)); Which should be again fine toplevel xid of toplevel xid is same xid which is a valid one. So I think we should be fine there. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On Fri, Dec 1, 2017 at 1:36 AM, Rajkumar Raghuwanshiwrote: > On Tue, Oct 31, 2017 at 2:45 PM, Robert Haas wrote: >>> OK, committed. This is a good example of how having good code >> coverage doesn't necessarily mean you've found all the bugs. :-) >> > As of now partition_join.sql is not having test cases covering cases > where partition table have default partition, attaching a small test > case patch to cover those. That's not that small, and to me it looks like overkill. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] postgres_fdw super user checks
On Fri, Dec 1, 2017 at 12:31 AM, Michael Paquierwrote: > I am moving this patch to next CF 2018-01. There now seems to be a consensus for superuser -> superuser_arg rather than what Jeff did originally; that approach has 4 votes and nothing else has more than 1. So, here's a patch that does it that way. I tried to see if some documentation update was needed, but I think the documentation already reflects the proposed new behavior. It says: Only superusers may connect to foreign servers without password authentication, so always specify the password option for user mappings belonging to non-superusers. Currently, however, that's not accurate. Right now you need to specify the password option for user mappings that will be *used by* non-superusers, not user mappings *belonging to* non-superusers. So this patch is, I think, just making the actual behavior match the documented behavior. Not sure if anyone has any other suggestions here. I think this is definitely a master-only change; should we try to insert some kind of warning into the back-branch docs? I definitely think this should be called out in the v11 release notes. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company postgres-fdw-superuser.patch Description: Binary data
Re: Transform for pl/perl
On Fri, Dec 1, 2017 at 12:30 AM, Michael Paquierwrote: > On Tue, Nov 28, 2017 at 5:14 PM, Aleksander Alekseev > wrote: >> The new status of this patch is: Ready for Committer > > Patch moved to CF 2018-01. Perhaps a committer will look at it at some point. FWIW, although I like the idea, I'm not going to work on the patch. I like Perl, but I neither know its internals nor use plperl. I hope someone else will be interested. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] SQL procedures
On 2 December 2017 at 01:31, Peter Eisentrautwrote: > On 11/30/17 15:50, Thomas Munro wrote: >> postgres=# \df >>List of functions >> Schema | Name | Result data type | Argument data types | Type >> +--+--+-+-- >> public | bar | integer | i integer | func >> public | foo | | i integer | proc >> (2 rows) >> >> Should this now be called a "List of routines"? > > Maybe, but I hesitate to go around and change all mentions of "function" > like that. That might just confuse people. Agreed \dfn shows functions only so we might want to consider having \df say "List of functions and procedures" \dfn say "List of functions" and a new option to list procedures only \dfp say "List of procedures" -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] Constraint exclusion for partitioned tables
On Fri, Dec 1, 2017 at 12:21 AM, Michael Paquierwrote: > On Wed, Sep 13, 2017 at 4:07 PM, Ashutosh Bapat > wrote: >> For a partitioned table, this patch saves the time to run constraint >> exclusion on all the partitions if constraint exclusion succeeds on >> the partitioned table. If constraint exclusion fails, we have wasted >> CPU cycles on one run of constraint exclusion. The difference between >> the time spent in the two scenarios increases with the number of >> partitions. Practically, users will have a handful partitions rather >> than a couple and thus running overhead of running constraint >> exclusion on partitioned table would be justified given the time it >> will save when CE succeeds. > > Moved patch to next CF. Committed after adding a comment. Generally, code changes should be accompanied by comment updates. I tested this and found out that this is quite useful for cases where multiple levels of partitioning are in use. Consider creating 100 partitions like this: #!/usr/bin/perl use strict; use warnings; print "create table foo (a int, b int, c text) partition by list (a);\n"; for $a (1..10) { print "create table foo$a partition of foo for values in ($a) partition by list (b);\n"; for $b (1..10) { print "create table foo${a}_$b partition of foo$a for values in ($b);\n"; } } Then consider this query: select * from foo where a = 5; Without this patch, we have to reject 90 leaf partitions individually, but with the patch, we can reject the intermediate partitioned tables; each time we do, it substitutes for rejecting 10 children individually. This seems to me to be a case that is quite likely to come up in the real world. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] proposal: psql command \graw
2017-12-01 16:36 GMT+01:00 Robert Haas: > On Fri, Dec 1, 2017 at 12:16 AM, Michael Paquier > wrote: > > On Sat, Nov 11, 2017 at 12:57 AM, Pavel Stehule > wrote: > >> 2017-11-10 16:38 GMT+01:00 Fabien COELHO : > >>> So I switched the patch to "ready for committer". > >> > >> Thank you very much > > > > Patch moved to CF 2018-01 with same status: ready for committer. > > I vote to reject this patch. It doesn't do anything that you can't > already do; it just adds some syntactic sugar. And that syntactic > sugar saves only a handful of keystrokes. If you want unaligned, > tuples-only mode, you can set it in 5 keystrokes: > > rhaas=# \a\t > Output format is unaligned. > Tuples only is on. > > When you don't want to see column names, then it is true. For some usage - like printing via gnuplot you would to see column names - and it cannot be done with current settings. > If you use this command, it takes 4 keystrokes; instead of ending your > command with a semicolon (1 character) you end it with \graw (5 > characters). > > Now, granted, \graw lets you set those options for a single command > rather than persistently, but I'm just not very interested in having a > bunch of \g options that enable various combinations of > options. Soon we'll have a thicket of \g variants that force > whichever combinations of options particular developers like to use, > and if \graw is any indication, the \g variant won't > necessarily look anything like the normal way of setting those > options. And that's not easy to fix, either: \graw could be spelled > \gat since it forces \a on and \t on, but somebody's bound to > eventually propose a variant that sets an option that has no > single-character shorthand. > > I'm not going to be bitterly unhappy if somebody else decides to > commit this, but to me it looks like it gains us very little. > This feature marginal, but I don't thing so it is without any value - It is well usable with gnuplot - and I hope so not only MacOS users will have terminal with graphic possibilities - and then with this feature can emulate notebook interface, Regards Pavel > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company >
builddoc.pl for Windows outdated
src/tools/msvc/builddoc.pl and the associated documentation (in install-windows.sgml) is quite outdated, even for PG10. If someone has the required knowledge, please consider supplying an update. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] proposal: psql command \graw
On Fri, Dec 1, 2017 at 12:16 AM, Michael Paquierwrote: > On Sat, Nov 11, 2017 at 12:57 AM, Pavel Stehule > wrote: >> 2017-11-10 16:38 GMT+01:00 Fabien COELHO : >>> So I switched the patch to "ready for committer". >> >> Thank you very much > > Patch moved to CF 2018-01 with same status: ready for committer. I vote to reject this patch. It doesn't do anything that you can't already do; it just adds some syntactic sugar. And that syntactic sugar saves only a handful of keystrokes. If you want unaligned, tuples-only mode, you can set it in 5 keystrokes: rhaas=# \a\t Output format is unaligned. Tuples only is on. If you use this command, it takes 4 keystrokes; instead of ending your command with a semicolon (1 character) you end it with \graw (5 characters). Now, granted, \graw lets you set those options for a single command rather than persistently, but I'm just not very interested in having a bunch of \g options that enable various combinations of options. Soon we'll have a thicket of \g variants that force whichever combinations of options particular developers like to use, and if \graw is any indication, the \g variant won't necessarily look anything like the normal way of setting those options. And that's not easy to fix, either: \graw could be spelled \gat since it forces \a on and \t on, but somebody's bound to eventually propose a variant that sets an option that has no single-character shorthand. I'm not going to be bitterly unhappy if somebody else decides to commit this, but to me it looks like it gains us very little. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Custom compression methods
On 12/01/2017 03:23 PM, Robert Haas wrote: > On Thu, Nov 30, 2017 at 2:47 PM, Tomas Vondra >wrote: >> OK. I think it's a nice use case (and nice gains on the compression >> ratio), demonstrating the datatype-aware compression. The question is >> why shouldn't this be built into the datatypes directly? > > Tomas, thanks for running benchmarks of this. I was surprised to see > how little improvement there was from other modern compression > methods, although lz4 did appear to be a modest win on both size and > speed. But I share your intuition that a lot of the interesting work > is in datatype-specific compression algorithms. I have noticed in a > number of papers that I've read that teaching other parts of the > system to operate directly on the compressed data, especially for > column stores, is a critical performance optimization; of course, that > only makes sense if the compression is datatype-specific. I don't > know exactly what that means for the design of this patch, though. > It has very little impact on this patch, as it has nothing to do with columnar storage. That is, each value is compressed independently. Column stores exploit the fact that they get a vector of values, compressed in some data-aware way. E.g. some form of RLE or dictionary compression, which allows them to evaluate expressions on the compressed vector. But that's irrelevant here, we only get row-by-row execution. Note: The idea to build dictionary for the whole jsonb column (which this patch should allow) does not make it "columnar compression" in the "column store" way. The executor will still get the decompressed value. > As a general point, no matter which way you go, you have to somehow > deal with on-disk compatibility. If you want to build in compression > to the datatype itself, you need to find at least one bit someplace to > mark the fact that you applied built-in compression. If you want to > build it in as a separate facility, you need to denote the compression > used someplace else. I haven't looked at how this patch does it, but > the proposal in the past has been to add a value to vartag_external. AFAICS the patch does that by setting a bit in the varlena header, and then adding OID of the compression method after the varlena header. So you get (verlena header + OID + data). This has good and bad consequences. Good: It's transparent for the datatype, so it does not have to worry about the custom compression at all (and it may change arbitrarily). Bad: It's transparent for the datatype, so it can't operate directly on the compressed representation. I don't think this is an argument against the patch, though. If the datatype can support intelligent compression (and execution without decompression), it has to be done in the datatype anyway. > One nice thing about the latter method is that it can be used for any > data type generically, regardless of how much bit-space is available > in the data type representation itself. It's realistically hard to > think of a data-type that has no bit space available anywhere but is > still subject to data-type specific compression; bytea definitionally > has no bit space but is also can't benefit from special-purpose > compression, whereas even something like text could be handled by > starting the varlena with a NUL byte to indicate compressed data > following. However, you'd have to come up with a different trick for > each data type. Piggybacking on the TOAST machinery avoids that. It > also implies that we only try to compress values that are "big", which > is probably be desirable if we're talking about a kind of compression > that makes comprehending the value slower. Not all types of > compression do, cf. commit 145343534c153d1e6c3cff1fa1855787684d9a38, > and for those that don't it probably makes more sense to just build it > into the data type. > > All of that is a somewhat separate question from whether we should > have CREATE / DROP COMPRESSION, though (or Alvaro's proposal of using > the ACCESS METHOD stuff instead). Even if we agree that piggybacking > on TOAST is a good way to implement pluggable compression methods, it > doesn't follow that the compression method is something that should be > attached to the datatype from the outside; it could be built into it > in a deep way. For example, "packed" varlenas (1-byte header) are a > form of compression, and the default functions for detoasting always > produced unpacked values, but the operators for the text data type > know how to operate on the packed representation. That's sort of a > trivial example, but it might well be that there are other cases where > we can do something similar. Maybe jsonb, for example, can compress > data in such a way that some of the jsonb functions can operate > directly on the compressed representation -- perhaps the number of > keys is easily visible, for example, or maybe more. In this view of > the world, each data type
Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager
On Fri, Dec 1, 2017 at 10:26 AM, Masahiko Sawadawrote: > On Fri, Dec 1, 2017 at 3:04 AM, Robert Haas wrote: >> On Thu, Nov 30, 2017 at 6:20 AM, Masahiko Sawada >> wrote: This code ignores the existence of multiple databases; RELEXTLOCK contains a relid, but no database OID. That's easy enough to fix, but it actually causes no problem unless, by bad luck, you have two relations with the same OID in different databases that are both being rapidly extended at the same time -- and even then, it's only a performance problem, not a correctness problem. In fact, I wonder if we shouldn't go further: instead of creating these RELEXTLOCK structures dynamically, let's just have a fixed number of them, say 1024. When we get a request to take a lock, hash and take the result modulo 1024; lock the RELEXTLOCK at that offset in the array. >>> >>> Attached the latest patch incorporated comments except for the fix of >>> the memory size for relext lock. >> >> It doesn't do anything about the comment of mine quoted above. > > Sorry I'd missed the comment. > >> Since it's only possible to hold one relation extension lock at a time, we >> don't really need the hash table here at all. We can just have an >> array of 1024 or so locks and map every pair on to one of >> them by hashing. The worst thing we'll get it some false contention, >> but that doesn't seem awful, and it would permit considerable further >> simplification of this code -- and maybe make it faster in the >> process, because we'd no longer need the hash table, or the pin count, >> or the extra LWLocks that protect the hash table. All we would have >> is atomic operations manipulating the lock state, which seems like it >> would be quite a lot faster and simpler. > > Agreed. With this change, we will have an array of the struct that has > lock state and cv. The lock state has the wait count as well as the > status of lock. > >> BTW, I think RelExtLockReleaseAll is broken because it shouldn't >> HOLD_INTERRUPTS(); I also think it's kind of silly to loop here when >> we know we can only hold one lock. Maybe RelExtLockRelease can take >> bool force and do if (force) held_relextlock.nLocks = 0; else >> held_relextlock.nLocks--. Or, better yet, have the caller adjust that >> value and then only call RelExtLockRelease() if we needed to release >> the lock in shared memory. That avoids needless branching. > > Agreed. I'd vote for the latter. > >> On a >> related note, is there any point in having both held_relextlock.nLocks >> and num_held_relextlocks? > > num_held_relextlocks is actually unnecessary, will be removed. > >> I think RelationExtensionLock should be a new type of IPC wait event, >> rather than a whole new category. > > Hmm, I thought the wait event types of IPC seems related to events > that communicates to other processes for the same purpose, for example > parallel query, sync repli etc. On the other hand, the relation > extension locks are one kind of the lock mechanism. That's way I added > a new category. But maybe it can be fit to the IPC wait event. > Attached updated patch. I've done a performance measurement again on the same configuration as before since the acquiring/releasing procedures have been changed. - PATCHED - tps = 162.579320 (excluding connections establishing) tps = 162.144352 (excluding connections establishing) tps = 160.659403 (excluding connections establishing) tps = 161.213995 (excluding connections establishing) tps = 164.560460 (excluding connections establishing) - HEAD - tps = 157.738645 (excluding connections establishing) tps = 146.178575 (excluding connections establishing) tps = 143.788961 (excluding connections establishing) tps = 144.886594 (excluding connections establishing) tps = 145.496337 (excluding connections establishing) * micro-benchmark PATCHED = 1.61757e+07 (cycles/sec) HEAD = 1.48685e+06 (cycles/sec) The patched is 10 times faster than current HEAD. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index 8d461c8..7aa7981 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -669,8 +669,8 @@ postgres 27093 0.0 0.0 30096 2752 ?Ss 11:34 0:00 postgres: ser Heavyweight locks, also known as lock manager locks or simply locks, primarily protect SQL-visible objects such as tables. However, they are also used to ensure mutual exclusion for certain internal - operations such as relation extension. wait_event will - identify the type of lock awaited. + operations such as waiting for a transaction to finish. + wait_event will identify the type of lock awaited. @@
Re: [HACKERS] SQL procedures
On Fri, Dec 1, 2017 at 9:31 AM, Peter Eisentrautwrote: > On 11/30/17 15:50, Thomas Munro wrote: >> postgres=# \df >>List of functions >> Schema | Name | Result data type | Argument data types | Type >> +--+--+-+-- >> public | bar | integer | i integer | func >> public | foo | | i integer | proc >> (2 rows) >> >> Should this now be called a "List of routines"? > > Maybe, but I hesitate to go around and change all mentions of "function" > like that. That might just confuse people. Yeah, this is not unlike the problems we have deciding whether to say "relation" or "table". It's a problem that comes when most foos are bars but there are multiple types of exotic foo that are not bars. That's pretty much the case here -- most functions are probably just functions, but a few might be procedures or aggregates. I think leaving this and similar cases as "functions" is fine. I wonder whether it was really necessary for the SQL standards committee (or Oracle) to invent both procedures and functions to represent very similar things, but they did. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Doc tweak for huge_pages?
On 11/30/17 23:35, Thomas Munro wrote: > On Fri, Dec 1, 2017 at 5:04 PM, Justin Pryzbywrote: >> On Fri, Dec 01, 2017 at 04:01:24PM +1300, Thomas Munro wrote: >>> Hi hackers, >>> >>> The manual implies that only Linux can use huge pages. That is not >>> true: FreeBSD, Illumos and probably others support larger page sizes >>> using transparent page coalescing algorithms. On my FreeBSD box >>> procstat -v often shows PostgreSQL shared buffers in "S"-flagged >>> memory. I think we should adjust the manual to make clear that it's >>> the *explicit request for huge pages* that is supported only on Linux >>> (and hopefully soon Windows). Am I being too pedantic? >> >> I suggest to remove "other" and include Linux in the enumeration, since it >> also >> supports "transparent" hugepages. > > Hmm. Yeah, it does, but apparently it's not so transparent. So if we > mention that we'd better indicate in the same paragraph that you > probably don't actually want to use it. How about the attached? Part of the confusion is that the huge_pages setting is only for shared memory, whereas the kernel settings affect all memory. Is the same true for the proposed Windows patch? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Use of uninitialized variables in ExecFindPartition() for parent partition without leaves (HEAD only)
On Thu, Nov 30, 2017 at 11:54 PM, Amit Langotewrote: >> I have added a CF entry for this thread by the way >> (https://commitfest.postgresql.org/16/1392/), and marked the thing as >> ready for committer as we agree about the fix. Let's track properly >> this issue until it gets committed. > > Yeah, thanks. Committed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Would a BGW need shmem_access or database_connection to enumerate databases?
On 11/29/2017 05:48 PM, Chapman Flack wrote: > I'm thinking of writing a background worker that will enumerate > the databases present, and spin off, for each one, another BGW > that will establish a connection and do stuff. Can I even do this? "Unlike RegisterBackgroundWorker, which can only be called from within the postmaster, RegisterDynamicBackgroundWorker must be called from a regular backend." Can I call RegisterDynamicBackgroundWorker when not in the postmaster, but also not in a "regular backend", but rather another BGW? Put another way, can a BGW be regarded as a "regular backend" for the purpose of this rule? -Chap
Re: Allowing SSL connection of v11 client to v10 server with SCRAM channel binding
On 11/30/17 00:36, Michael Paquier wrote: > On Wed, Nov 29, 2017 at 1:04 AM, Peter Eisentraut >wrote: >> On 11/22/17 21:08, Michael Paquier wrote: >>> Yes, agreed. This patch looks good to me. In fe-auth-scram.c, it would >>> be also nice to add a comment to keep in sync the logics in >>> build_client_first_message() and build_client_final_message() which >>> assign the cbind flag value. >> >> Could you clarify what comment you would like to have added or changed? > > Sure. Here is with the attached patch what I have in mind. The way > cbind-flag is assigned in the client-first message should be kept > in-sync with the way the client-final message builds the binding data > in c=. It could be possible to add more sanity-checks based on > assertions by keeping track of the cbind-flag assigned in the > client-first message as your upthread patch is doing in the backend > code, but I see a simple comment as a sufficient reminder. Committed with that comment, thanks. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] Proposal: Local indexes for partitioned table
On Thu, Nov 30, 2017 at 11:39 PM, David Rowleywrote: > I feel like we could do better here with little extra effort. The > DETACH index feature does not really seem required for this patch. Because of the dump/restore considerations mentioned in http://postgr.es/m/ca+tgmobuhghg9v8saswkhbbfywg5a0qb+jgt0uovq5ycbdu...@mail.gmail.com I believe we need a way to create the index on the parent without immediately triggering index builds on the children, plus a way to create an index on a child after-the-fact and attach it to the parent. Detach isn't strictly required, but why support attach and not detach? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] [POC] Faster processing at Gather node
On Sun, Nov 26, 2017 at 3:15 AM, Amit Kapilawrote: > Yeah and I think something like that can happen after your patch > because now the memory for tuples returned via TupleQueueReaderNext > will be allocated in ExecutorState and that can last for long. I > think it is better to free memory, but we can leave it as well if you > don't feel it important. In any case, I have written a patch, see if > you think it makes sense. Well, I don't really know. My intuition is that in most cases after ExecShutdownGatherMergeWorkers() we will very shortly thereafter call ExecutorEnd() and everything will go away. Maybe that's wrong, but Tom put that call where it is in 2d44c58c79aeef2d376be0141057afbb9ec6b5bc, and he could have put it inside ExecShutdownGatherMergeWorkers() instead. Now maybe he didn't consider that approach, but Tom is usually smart about stuff like that... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] GSOC'17 project introduction: Parallel COPY execution with errors handling
On Fri, Dec 1, 2017 at 1:58 AM, Alvaro Herrerawrote: > On a *very* quick look, please use an enum to return from NextCopyFrom > rather than 'int'. The chunks that change bool to int are very > odd-looking. This would move the comment that explains the value from > copy.c to copy.h, obviously. Also, you seem to be using non-ASCII dashes > in the descriptions of those values; please don't. I will fix it, thank you. > > Or maybe I misunderstood the patch completely. > I hope so. Here is my thoughts how it all works, please correct me, where I am wrong: 1) First, I have simply changed ereport level to WARNING for specific validations (extra or missing columns, etc) if INGONE_ERRORS option is used. All these checks are inside NextCopyFrom. Thus, this patch performs here pretty much the same as before, excepting that it is possible to skip bad lines, and this part should be safe as well 2) About PG_TRY/CATCH. I use it to catch the only one specific function call inside NextCopyFrom--it is InputFunctionCall--which is used just to parse datatype from the input string. I have no idea how WAL write or trigger errors could get here All of these is done before actually forming a tuple, putting it into the heap, firing insert-related triggers, etc. I am not trying to catch all errors during the row processing, only input data errors. So why is it unsafe? Best, Alexey
Re: [HACKERS] SQL procedures
On 11/30/17 15:50, Thomas Munro wrote: > postgres=# \df >List of functions > Schema | Name | Result data type | Argument data types | Type > +--+--+-+-- > public | bar | integer | i integer | func > public | foo | | i integer | proc > (2 rows) > > Should this now be called a "List of routines"? Maybe, but I hesitate to go around and change all mentions of "function" like that. That might just confuse people. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] pg_basebackup --progress output for batch execution
On 11/21/17 05:16, Martín Marqués wrote: > El 21/11/17 a las 04:56, Arthur Zakirov escribió: >> On Mon, Nov 20, 2017 at 04:45:48PM -0300, Martín Marqués wrote: >>> New version of patch, without the --batch-mode option and using isatty() >>> >> >> Great! >> >>> + fprintf(stderr, "waiting for checkpoint"); >>> + if (isatty(fileno(stderr))) >>> + fprintf(stderr, "\n"); >>> + else >>> + fprintf(stderr, "\r"); >> >> Here the condition should be inverted I think. The next condition should be >> used: >> >>> if (!isatty(fileno(stderr))) >>> ... >> >> Otherwise pg_basebackup will insert '\n' in terminal instead '\r'. > > Ups! Attached the corrected version.:) > > Nice catch. I completely missed that. Thanks. Committed. I switched the if logic around even more, so that it is ! if (isatty(fileno(stderr))) ! fprintf(stderr, "\r"); ! else ! fprintf(stderr, "\n"); instead of ! if (!isatty(fileno(stderr))) ! fprintf(stderr, "\n"); ! else ! fprintf(stderr, "\r"); -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] Custom compression methods
On Thu, Nov 30, 2017 at 2:47 PM, Tomas Vondrawrote: > OK. I think it's a nice use case (and nice gains on the compression > ratio), demonstrating the datatype-aware compression. The question is > why shouldn't this be built into the datatypes directly? Tomas, thanks for running benchmarks of this. I was surprised to see how little improvement there was from other modern compression methods, although lz4 did appear to be a modest win on both size and speed. But I share your intuition that a lot of the interesting work is in datatype-specific compression algorithms. I have noticed in a number of papers that I've read that teaching other parts of the system to operate directly on the compressed data, especially for column stores, is a critical performance optimization; of course, that only makes sense if the compression is datatype-specific. I don't know exactly what that means for the design of this patch, though. As a general point, no matter which way you go, you have to somehow deal with on-disk compatibility. If you want to build in compression to the datatype itself, you need to find at least one bit someplace to mark the fact that you applied built-in compression. If you want to build it in as a separate facility, you need to denote the compression used someplace else. I haven't looked at how this patch does it, but the proposal in the past has been to add a value to vartag_external. One nice thing about the latter method is that it can be used for any data type generically, regardless of how much bit-space is available in the data type representation itself. It's realistically hard to think of a data-type that has no bit space available anywhere but is still subject to data-type specific compression; bytea definitionally has no bit space but is also can't benefit from special-purpose compression, whereas even something like text could be handled by starting the varlena with a NUL byte to indicate compressed data following. However, you'd have to come up with a different trick for each data type. Piggybacking on the TOAST machinery avoids that. It also implies that we only try to compress values that are "big", which is probably be desirable if we're talking about a kind of compression that makes comprehending the value slower. Not all types of compression do, cf. commit 145343534c153d1e6c3cff1fa1855787684d9a38, and for those that don't it probably makes more sense to just build it into the data type. All of that is a somewhat separate question from whether we should have CREATE / DROP COMPRESSION, though (or Alvaro's proposal of using the ACCESS METHOD stuff instead). Even if we agree that piggybacking on TOAST is a good way to implement pluggable compression methods, it doesn't follow that the compression method is something that should be attached to the datatype from the outside; it could be built into it in a deep way. For example, "packed" varlenas (1-byte header) are a form of compression, and the default functions for detoasting always produced unpacked values, but the operators for the text data type know how to operate on the packed representation. That's sort of a trivial example, but it might well be that there are other cases where we can do something similar. Maybe jsonb, for example, can compress data in such a way that some of the jsonb functions can operate directly on the compressed representation -- perhaps the number of keys is easily visible, for example, or maybe more. In this view of the world, each data type should get to define its own compression method (or methods) but they are hard-wired into the datatype and you can't add more later, or if you do, you lose the advantages of the hard-wired stuff. BTW, another related concept that comes up a lot in discussions of this area is that we could do a lot better compression of columns if we had some place to store a per-column dictionary. I don't really know how to make that work. We could have a catalog someplace that stores an opaque blob for each column configured to use a compression method, and let the compression method store whatever it likes in there. That's probably fine if you are compressing the whole table at once and the blob is static thereafter. But if you want to update that blob as you see new column values there seem to be almost insurmountable problems. To be clear, I'm not trying to load this patch down with a requirement to solve every problem in the universe. On the other hand, I think it would be easy to beat a patch like this into shape in a fairly mechanical way and then commit-and-forget. That might be leaving a lot of money on the table; I'm glad you are thinking about the bigger picture and hope that my thoughts here somehow contribute. Thanks, -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Custom compression methods
On 11/30/2017 09:51 PM, Alvaro Herrera wrote: > Tomas Vondra wrote: > >> On 11/30/2017 04:20 PM, Ildus Kurbangaliev wrote: > >>> CREATE COMPRESSION METHOD ts1 FOR tsvector HANDLER >>> tsvector_compression_handler; >> >> Understood. Good to know you've considered it, and I agree it doesn't >> need to be there from the start (which makes the patch simpler). > > Just passing by, but wouldn't this fit in the ACCESS METHOD group of > commands? So this could be simplified down to > CREATE ACCESS METHOD ts1 TYPE COMPRESSION > we have that for indexes and there are patches flying for heap storage, > sequences, etc. I think that's simpler than trying to invent all new > commands here. Then (in a future patch) you can use ALTER TYPE to > define compression for that type, or even add a column-level option to > reference a specific compression method. > I think that would conflate two very different concepts. In my mind, access methods define how rows are stored. Compression methods are an orthogonal concept, e.g. you can compress a value (using a custom compression algorithm) and store it in an index (using whatever access method it's using). So not only access methods operate on rows (while compression operates on varlena values), but you can combine those two things together. I don't see how you could do that if both are defined as "access methods" ... Furthermore, the "TYPE" in CREATE COMPRESSION method was meant to restrict the compression algorithm to a particular data type (so, if it relies on tsvector, you can't apply it to text columns). Which is very different from "TYPE COMPRESSION" in CREATE ACCESS METHOD. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] Additional logging for VACUUM and ANALYZE
On Thu, Nov 30, 2017 at 10:04 PM, Michael Paquierwrote: > On Wed, Nov 8, 2017 at 12:54 AM, Fabrízio de Royes Mello > wrote: >> Great. Changed status to ready for commiter. > > The patch still applies, but no committer seem to be interested in the > topic, so moved to next CF. The general idea of this patch seems OK to me. +rel_lock = true; I think it would look nicer to initialize this when you declare the variable, instead of having a separate line of code for that purpose. +if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0) +elevel = LOG; +else if (!IsAutoVacuumWorkerProcess()) +elevel = WARNING; +else +return; This can be rewritten with only one call to IsAutoVacuumWorkerProcess() by reversing the order of the branches. +PopActiveSnapshot(); +CommitTransactionCommand(); +return false; vacuum_rel already has too many copies of this logic -- can we try to avoid duplicating it into two more places? It seems like you could easily do that like this: int elevel = 0; if (relation != NULL) { /* maybe set elevel */ } if (elevel != 0) { if (!rel_lock) /* emit message */ else /* emit other message */ } This wouldn't be the first bit of code to assume that elevel == 0 can be used as a sentinel value meaning "none", so I think it's OK to do that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] pgbench more operators & functions
Here is a v13. No code changes, but TAP tests added to maintain pgbench coverage to green. Here is a v14, which is just a rebase after the documentation xml-ization. Regenerated v15 that applies cleanly on head. No changes. -- Fabien.diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml index 94b495e..cac257b 100644 --- a/doc/src/sgml/ref/pgbench.sgml +++ b/doc/src/sgml/ref/pgbench.sgml @@ -904,14 +904,32 @@ pgbench options d Sets variable varname to a value calculated from expression. - The expression may contain integer constants such as 5432, + The expression may contain the NULL constant, + boolean constants TRUE and FALSE, + integer constants such as 5432, double constants such as 3.14159, references to variables :variablename, - unary operators (+, -) and binary operators - (+, -, *, /, - %) with their usual precedence and associativity, - function calls, and - parentheses. + operators + with their usual SQL precedence and associativity, + function calls, + SQL CASE generic conditional + expressions and parentheses. + + + + Functions and most operators return NULL on + NULL input. + + + + For conditional purposes, non zero numerical values are + TRUE, zero numerical values and NULL + are FALSE. + + + + When no final ELSE clause is provided to a + CASE, the default value is NULL. @@ -920,6 +938,7 @@ pgbench options d \set ntellers 10 * :scale \set aid (1021 * random(1, 10 * :scale)) % \ (10 * :scale) + 1 +\set divx CASE WHEN :x 0 THEN :y/:x ELSE NULL END @@ -996,6 +1015,177 @@ pgbench options d + + Built-In Operators + + + The arithmetic, bitwise, comparison and logical operators listed in +are built into pgbench + and may be used in expressions appearing in + \set. + + + + pgbench Operators by increasing precedence + + + + Operator + Description + Example + Result + + + + + OR + logical or + 5 or 0 + TRUE + + + AND + logical and + 3 and 0 + FALSE + + + NOT + logical not + not false + TRUE + + + IS [NOT] (NULL|TRUE|FALSE) + value tests + 1 is null + FALSE + + + ISNULL|NOTNULL + null tests + 1 notnull + TRUE + + + = + is equal + 5 = 4 + FALSE + + + + is not equal + 5 4 + TRUE + + + != + is not equal + 5 != 5 + FALSE + + + + lower than + 5 4 + FALSE + + + = + lower or equal + 5 = 4 + FALSE + + + + greater than + 5 4 + TRUE + + + = + greater or equal + 5 = 4 + TRUE + + + | + integer bitwise OR + 1 | 2 + 3 + + + # + integer bitwise XOR + 1 # 3 + 2 + + + + integer bitwise AND + 1 3 + 1 + + + ~ + integer bitwise NOT + ~ 1 + -2 + + + + integer bitwise shift left + 1 2 + 4 + + + + integer bitwise shift right + 8 2 + 2 + + + + + addition + 5 + 4 + 9 + + + - + substraction + 3 - 2.0 + 1.0 + + + * + multiplication + 5 * 4 + 20 + + + / + division (integer truncates the results) + 5 / 3 + 1 + + + % + modulo + 3 % 2 + 1 + + + - + opposite + - 2.0 + -2.0 + + + + + + Built-In Functions @@ -1042,6 +1232,13 @@ pgbench options d 5432.0 + exp(x) + double + exponential + exp(1.0) + 2.718281828459045 + + greatest(a [, ... ] ) double if any a is double, else integer largest value among arguments @@ -1063,6 +1260,20 @@ pgbench options d 2.1 + ln(x) + double + natural logarithm + ln(2.718281828459045) + 1.0 + + + mod(i, bj) + integer + modulo + mod(54, 32) + 22 + + pi() double value of the constant PI diff --git a/src/bin/pgbench/exprparse.y b/src/bin/pgbench/exprparse.y index b3a2d9b..770be98 100644 --- a/src/bin/pgbench/exprparse.y +++ b/src/bin/pgbench/exprparse.y @@ -19,13 +19,17 @@ PgBenchExpr *expr_parse_result; static PgBenchExprList *make_elist(PgBenchExpr *exp, PgBenchExprList *list); +static PgBenchExpr *make_null_constant(void); +static PgBenchExpr *make_boolean_constant(bool bval); static PgBenchExpr
Re: [HACKERS] Partition-wise aggregation/grouping
Continuing with review of 0007. + +/* Copy input rels's relids to grouped rel */ +grouped_rel->relids = input_rel->relids; Isn't this done in fetch_upper_rel()? Why do we need it here? There's also a similar hunk in create_grouping_paths() which doesn't look appropriate. I guess, you need relids in grouped_rel->relids for FDW. There are two ways to do this: 1. set grouped_rel->relids for parent upper rel as well, but then we should pass relids to fetch_upper_rel() instead of setting those later. 2. For a parent upper rel, in create_foreignscan_plan(), set relids to all_baserels, if upper_rel->relids is NULL and don't set relids for a parent upper rel. I am fine with either of those. +/* partial phase */ +get_agg_clause_costs(root, (Node *) partial_target->exprs, + AGGSPLIT_INITIAL_SERIAL, + _partial_costs); IIUC, the costs for evaluating aggregates would not change per child. They won't be different for parent and any of the children. So, we should be able to calculate those once, save in "extra" and use repeatedly. +if (can_sort) +{ +Path *path = cheapest_path; + +if (!(pathkeys_contained_in(root->group_pathkeys, +path->pathkeys))) [ .. clipped patch .. ] + NIL, + dNumGroups)); +} We create two kinds of paths partial paths for parallel query and partial aggregation paths when group keys do not have partition keys. The comments and code uses partial to mean both the things, which is rather confusing. May be we should use term "partial aggregation" explicitly wherever it means that in comments and in variable names. I still feel that create_grouping_paths() and create_child_grouping_paths() have a lot of common code. While I can see that there are some pockets in create_grouping_paths() which are not required in create_child_grouping_paths() and vice-versa, may be we should create only one function and move those pockets under "if (child)" or "if (parent)" kind of conditions. It will be a maintenance burden to keep these two functions in sync in future. If we do not keep them in sync, that will introduce bugs. + +/* + * create_partition_agg_paths + * + * Creates append path for all the partitions and adds into the grouped rel. I think you want to write "Creates an append path containing paths from all the child grouped rels and adds into the given parent grouped rel". + * For partial mode we need to add a finalize agg node over append path before + * adding a path to grouped rel. + */ +void +create_partition_agg_paths(PlannerInfo *root, + RelOptInfo *grouped_rel, + List *subpaths, + List *partial_subpaths, Why do we have these two as separate arguments? I don't see any call to create_partition_agg_paths() through add_append_path() passing both of them non-NULL simultaneously. May be you want use a single list subpaths and another boolean indicating whether it's list of partial paths or regular paths. + +/* For non-partial path, just create a append path and we are done. */ This is the kind of confusion, I am talking about above. Here you have mentioned "non-partial path" which may mean a regular path but what you actually mean by that term is a path representing partial aggregates. +/* + * Partial partition-wise grouping paths. Need to create final + * aggregation path over append relation. + * + * If there are partial subpaths, then we need to add gather path before we + * append these subpaths. More confusion here. + */ +if (partial_subpaths) +{ +ListCell *lc; + +Assert(subpaths == NIL); + +foreach(lc, partial_subpaths) +{ +Path *path = lfirst(lc); +doubletotal_groups = path->rows * path->parallel_workers; + +/* Create gather paths for partial subpaths */ +Path *gpath = (Path *) create_gather_path(root, grouped_rel, path, + path->pathtarget, NULL, + _groups); + +subpaths = lappend(subpaths, gpath); Using the argument variable is confusing and that's why you need two different List variables. Instead probably you could have another variable local to this function to hold the gather subpaths. AFAIU, the Gather paths that this code creates has its parent set to parent grouped rel. That's not correct. These partial paths come from children of grouped rel and each gather is producing rows corresponding to one children of grouped rel. So gather path's parent should be set to corresponding child and not parent grouped rel. This code creates plans where there are multiple Gather nodes under an
Re: [HACKERS] Runtime Partition Pruning
Hello, PFA the updated patch rebased over Amit's v13 patches [1] part of which is committed. This also fixes few bugs I found. The InitPlans require execPlan which is not set during ExecInitAppend and so the evaluation of extern quals is moved from ExecInitAppend to ExecAppend. This changes the output of explain but only the correct partition(s) are scanned. David Q1: postgres=# explain analyse execute ab_q1 (3,3); --const QUERY PLAN - Append (cost=0.00..43.90 rows=1 width=8) (actual time=0.006..0.006 rows=0 loops=1) -> Seq Scan on ab_a3_b3 (cost=0.00..43.90 rows=1 width=8) (actual time=0.005..0.005 rows=0 loops=1) Filter: ((a = 3) AND (b = 3)) Planning time: 0.588 ms Execution time: 0.043 ms (5 rows) postgres=# explain analyse execute ab_q1 (3,3); --Param only ab_a3_b3 plan is executed QUERY PLAN - Append (cost=0.00..395.10 rows=9 width=8) (actual time=0.119..0.119 rows=0 loops=1) -> Seq Scan on ab_a1_b1 (cost=0.00..43.90 rows=1 width=8) (never executed) Filter: ((a = $1) AND (b = $2)) -> Seq Scan on ab_a1_b2 (cost=0.00..43.90 rows=1 width=8) (never executed) Filter: ((a = $1) AND (b = $2)) -> Seq Scan on ab_a1_b3 (cost=0.00..43.90 rows=1 width=8) (never executed) Filter: ((a = $1) AND (b = $2)) -> Seq Scan on ab_a2_b1 (cost=0.00..43.90 rows=1 width=8) (never executed) Filter: ((a = $1) AND (b = $2)) -> Seq Scan on ab_a2_b2 (cost=0.00..43.90 rows=1 width=8) (never executed) Filter: ((a = $1) AND (b = $2)) -> Seq Scan on ab_a2_b3 (cost=0.00..43.90 rows=1 width=8) (never executed) Filter: ((a = $1) AND (b = $2)) -> Seq Scan on ab_a3_b1 (cost=0.00..43.90 rows=1 width=8) (never executed) Filter: ((a = $1) AND (b = $2)) -> Seq Scan on ab_a3_b2 (cost=0.00..43.90 rows=1 width=8) (never executed) Filter: ((a = $1) AND (b = $2)) -> Seq Scan on ab_a3_b3 (cost=0.00..43.90 rows=1 width=8) (actual time=0.006..0.006 rows=0 loops=1) Filter: ((a = $1) AND (b = $2)) Planning time: 0.828 ms Execution time: 0.234 ms (21 rows) David Q1 postgres=# explain analyse execute ab_q1 (4); -- Const QUERY PLAN -- Append (cost=0.00..49.55 rows=1 width=8) (actual time=0.005..0.005 rows=0 loops=1) -> Seq Scan on ab_a4 (cost=0.00..49.55 rows=1 width=8) (actual time=0.004..0.004 rows=0 loops=1) Filter: ((a >= 4) AND (a <= 5) AND (a = 4)) Planning time: 0.501 ms Execution time: 0.039 ms (5 rows) postgres=# explain analyse execute ab_q1 (4); --Param QUERY PLAN -- Append (cost=0.00..99.10 rows=2 width=8) (actual time=0.063..0.063 rows=0 loops=1) -> Seq Scan on ab_a4 (cost=0.00..49.55 rows=1 width=8) (actual time=0.004..0.004 rows=0 loops=1) Filter: ((a >= 4) AND (a <= 5) AND (a = $1)) -> Seq Scan on ab_a5 (cost=0.00..49.55 rows=1 width=8) (never executed) Filter: ((a >= 4) AND (a <= 5) AND (a = $1)) Planning time: 0.563 ms Execution time: 0.111 ms I am still working on the patch to add more comments and regression tests but comments on the code is welcome. [1]https://www.postgresql.org/message-id/df609168-b7fd-4c0b-e9b2-6e398d411e27%40lab.ntt.co.jp -- Beena Emerson EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company 0001-Implement-runtime-partiton-pruning_v4.patch Description: Binary data
Re: [HACKERS] pow support for pgbench
Hi Fabien, The idea is that it would be relative to the "more functions and operators" > patch, but I guess this is too much for people checking, so I'm fine with > having it with the current base. > I tried applying the last "more functions and operators" patch (pgbench-more-ops-funcs-14.patch) but it also stopped applying cleanly so I decided to go for master to avoid having too many issues. Let me know if you rework that patch and I'll be happy to rebase mine on top. -- *Raúl Marín Rodríguez *carto.com
Use get_greatest_modulus() in partition_bounds_equal()
Hi all, partition_bounds_equal() accesses the last datum from the given partition bounds directly to compare their greatest moduli. Rest of the code uses get_greatest_modulus() to get the greatest modulus from a partition bound. partition_bounds_equal() should also do the same for the sake of consistency and to make sure the in future the code remains sane if we change the way we store greatest modulus in PartitionBoundInfoData in future. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company From 6aba3444b75f3420af52f602d7a2f53da8da0847 Mon Sep 17 00:00:00 2001 From: Ashutosh BapatDate: Fri, 1 Dec 2017 15:36:56 +0530 Subject: [PATCH] Use get_greatest_modulus() in partition_bounds_equal() partition_bounds_equal() accesses the last datum from the given partition bounds directly to compare their greatest moduli. Rest of the code uses get_greatest_modulus() to get the greatest modulus from a partition bound. Above comparison should also do the same for the sake of consistency and to make sure the in future the code remains sane if we change the way we store greatest modulus in PartitionBoundInfoData. Ashutosh Bapat. --- src/backend/catalog/partition.c |9 +++-- 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c index 2bf8117..dd4a8d3 100644 --- a/src/backend/catalog/partition.c +++ b/src/backend/catalog/partition.c @@ -751,15 +751,13 @@ partition_bounds_equal(int partnatts, int16 *parttyplen, bool *parttypbyval, if (b1->strategy == PARTITION_STRATEGY_HASH) { - int greatest_modulus; + int greatest_modulus = get_greatest_modulus(b1); /* * If two hash partitioned tables have different greatest moduli, - * their partition schemes don't match. For hash partitioned table, - * the greatest modulus is given by the last datum and number of - * partitions is given by ndatums. + * their partition schemes don't match. */ - if (b1->datums[b1->ndatums - 1][0] != b2->datums[b2->ndatums - 1][0]) + if (greatest_modulus != get_greatest_modulus(b2)) return false; /* @@ -773,7 +771,6 @@ partition_bounds_equal(int partnatts, int16 *parttyplen, bool *parttypbyval, * their indexes array will be same. So, it suffices to compare * indexes array. */ - greatest_modulus = get_greatest_modulus(b1); for (i = 0; i < greatest_modulus; i++) if (b1->indexes[i] != b2->indexes[i]) return false; -- 1.7.9.5
Re: [HACKERS] pow support for pgbench
Hi, I've rebased the patch so it can be applied cleanly on top of current master. On Fri, Dec 1, 2017 at 3:55 AM, Michael Paquierwrote: > On Tue, Nov 7, 2017 at 1:34 AM, Fabien COELHO wrote: > > Let's now hope that a committer gets around to consider these patch some > > day. > > Which is not the case yet, so moved to CF 2018-01. Please note that > the patch proposed does not apply anymore, so its status is changed to > "waiting on author" for a rebase. > -- > Michael > -- *Raúl Marín Rodríguez *carto.com From b12a7a37b1af1ede1aa9d5d88d0918808c54e19f Mon Sep 17 00:00:00 2001 From: Raul Marin Date: Fri, 1 Dec 2017 10:49:17 +0100 Subject: [PATCH] Add pow() support to pgbench --- doc/src/sgml/ref/pgbench.sgml| 7 src/bin/pgbench/exprparse.y | 3 ++ src/bin/pgbench/pgbench.c| 62 src/bin/pgbench/pgbench.h| 3 +- src/bin/pgbench/t/001_pgbench_with_server.pl | 16 ++- 5 files changed, 89 insertions(+), 2 deletions(-) diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml index 94b495e606..769e604dd6 100644 --- a/doc/src/sgml/ref/pgbench.sgml +++ b/doc/src/sgml/ref/pgbench.sgml @@ -1099,6 +1099,13 @@ pgbench options d sqrt(2.0) 1.414213562 + + pow(x, y) + integer if x and y are integers and y >= 0, else double + Numeric exponentiation + pow(2.0, 10) + 1024.0 + diff --git a/src/bin/pgbench/exprparse.y b/src/bin/pgbench/exprparse.y index b3a2d9bfd3..4db3b215ab 100644 --- a/src/bin/pgbench/exprparse.y +++ b/src/bin/pgbench/exprparse.y @@ -191,6 +191,9 @@ static const struct { "random_exponential", 3, PGBENCH_RANDOM_EXPONENTIAL }, + { + "pow", 2, PGBENCH_POW + }, /* keep as last array element */ { NULL, 0, 0 diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c index bd96eae5e6..5a5197cf63 100644 --- a/src/bin/pgbench/pgbench.c +++ b/src/bin/pgbench/pgbench.c @@ -746,6 +746,27 @@ getPoissonRand(TState *thread, int64 center) return (int64) (-log(uniform) * ((double) center) + 0.5); } + /* + * pow() for integer values with exp >= 0. Matches SQL pow() behaviour + */ +static int64 +ipow(int64 base, int64 exp) +{ + int64 result = 1; + + Assert(exp >= 0); + + while (exp) + { + if (exp & 1) + result *= base; + exp >>= 1; + base *= base; + } + + return result; +} + /* * Initialize the given SimpleStats struct to all zeroes */ @@ -1673,6 +1694,47 @@ evalFunc(TState *thread, CState *st, return true; } + case PGBENCH_POW: + { +PgBenchValue *lval = [0]; +PgBenchValue *rval = [1]; + +Assert(nargs == 2); + +/* + * If both operands are int and exp >= 0 use + * the ipow() function, else use pow() + */ +if (lval->type == PGBT_INT && + rval->type == PGBT_INT) +{ + + int64 li, +ri; + + if (!coerceToInt(lval, ) || + !coerceToInt(rval, )) + return false; + + if (ri >= 0) + setIntValue(retval, ipow(li, ri)); + else + setDoubleValue(retval, pow(li, ri)); +} +else +{ + double ld, +rd; + + if (!coerceToDouble(lval, ) || + !coerceToDouble(rval, )) + return false; + + setDoubleValue(retval, pow(ld, rd)); +} +return true; + } + default: /* cannot get here */ Assert(0); diff --git a/src/bin/pgbench/pgbench.h b/src/bin/pgbench/pgbench.h index fd428af274..e0132b5fcf 100644 --- a/src/bin/pgbench/pgbench.h +++ b/src/bin/pgbench/pgbench.h @@ -75,7 +75,8 @@ typedef enum PgBenchFunction PGBENCH_SQRT, PGBENCH_RANDOM, PGBENCH_RANDOM_GAUSSIAN, - PGBENCH_RANDOM_EXPONENTIAL + PGBENCH_RANDOM_EXPONENTIAL, + PGBENCH_POW } PgBenchFunction; typedef struct PgBenchExpr PgBenchExpr; diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl b/src/bin/pgbench/t/001_pgbench_with_server.pl index c095881312..fcb30cdde5 100644 --- a/src/bin/pgbench/t/001_pgbench_with_server.pl +++ b/src/bin/pgbench/t/001_pgbench_with_server.pl @@ -231,7 +231,14 @@ pgbench( qr{command=18.: double 18\b}, qr{command=19.: double 19\b}, qr{command=20.: double 20\b}, - qr{command=21.: int 9223372036854775807\b}, ], + qr{command=21.: int 9223372036854775807\b}, + qr{command=23.: int -27\b}, + qr{command=24.: double 1024\b}, + qr{command=25.: int 1\b}, + qr{command=26.: double 1\b}, + qr{command=27.: double -0.125\b}, + qr{command=28.: double -0.125\b}, +], 'pgbench expressions', { '001_pgbench_expressions' => q{-- integer functions \set i1 debug(random(1, 100)) @@ -261,6 +268,13 @@ pgbench( \set maxint debug(:minint - 1) -- reset a variable \set i1 0 +--- pow() operator +\set poweri debug(pow(-3,3)) +\set powerd debug(pow(2.0,10)) +\set poweriz debug(pow(0,0)) +\set powerdz debug(pow(0.0,0.0)) +\set powernegi debug(pow(-2,-3)) +\set
Re: Unclear regression test for postgres_fdw
Jeevan Chalkewrote: > On Thu, Nov 30, 2017 at 3:44 PM, Jeevan Chalke > wrote: > > On Thu, Nov 30, 2017 at 1:36 AM, Antonin Houska wrote: > > The following test > > -- Input relation to aggregate push down hook is not safe to pushdown and > thus > -- the aggregate cannot be pushed down to foreign server. > explain (verbose, costs off) > select count(t1.c3) from ft1 t1, ft1 t2 where t1.c1 = > postgres_fdw_abs(t1.c2); > > produces the following plan > > QUERY PLAN > > -- > Aggregate > Output: count(t1.c3) > -> Nested Loop > Output: t1.c3 > -> Foreign Scan on public.ft1 t2 > Remote SQL: SELECT NULL FROM "S 1"."T 1" > -> Materialize > Output: t1.c3 > -> Foreign Scan on public.ft1 t1 > Output: t1.c3 > Remote SQL: SELECT c3 FROM "S 1"."T 1" WHERE (("C 1" = > public.postgres_fdw_abs(c2))) > > which is not major problem as such, but gdb shows that the comment "aggregate > cannot be pushed" is not correct. In fact, postgresGetForeignUpperPaths() > *does* create the upper path. > > The reason that UPPERREL_GROUP_AGG is eventually not used seems to be that > postgresGetForeignJoinPaths() -> add_foreign_grouping_paths() -> > estimate_path_cost_size() estimates the join cost in rather generic way. > While > the remote server can push the join clause down to the inner relation of NL, > the postgres_fdw cost computation assumes that the join clause is applied to > each pair of output and input tuple. > > I don't think that the postgres_fdw's estimate can be fixed easily, but if > the > impact of "shipability" on (not) using the upper relation should be tested, > we > need a different test. > > Oops. My bad. > Agree with your analysis. > Will send a patch fixing this testcase. > > Attached patch to fix the test case. In new test case I am using a JOIN > query where JOIN condition is not safe to push down and hence the JOIN > itself is unsafe. I've just verified that postgresGetForeignUpperPaths() does return here: /* * If input rel is not safe to pushdown, then simply return as we cannot * perform any post-join operations on the foreign server. */ if (!input_rel->fdw_private || !((PgFdwRelationInfo *) input_rel->fdw_private)->pushdown_safe) return; I see no other problems here. You may need to add the patch to the next CF so it does not get lost. -- Antonin Houska Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de, http://www.cybertec.at
Re: [HACKERS] [PATCH] Incremental sort
Alexander Korotkovwrote: > On Wed, Nov 22, 2017 at 1:22 PM, Antonin Houska wrote: > > Alexander Korotkov wrote: > > > Antonin Houska wrote: > > > > * ExecIncrementalSort() > > > > > > ** if (node->tuplesortstate == NULL) > > > > > > If both branches contain the expression > > > > > > node->groupsCount++; > > > > > > I suggest it to be moved outside the "if" construct. > > > > Done. > > One more comment on this: I wonder if the field isn't incremented too > early. It seems to me that the value can end up non-zero if the input set is > to be empty (not sure if it can happen in practice). > > That happens in practice. On empty input set, incremental sort would count > exactly one group. > > # create table t (x int, y int); > CREATE TABLE > # create index t_x_idx on t (x); > CREATE INDEX > # set enable_seqscan = off; > SET > # explain (analyze, buffers) select * from t order by x, y; > QUERY PLAN > - > Incremental Sort (cost=0.74..161.14 rows=2260 width=8) (actual > time=0.024..0.024 rows=0 loops=1) > Sort Key: x, y > Presorted Key: x > Sort Method: quicksort Memory: 25kB > Sort Groups: 1 > Buffers: shared hit=1 > -> Index Scan using t_x_idx on t (cost=0.15..78.06 rows=2260 width=8) (actual > time=0.011..0.011 rows=0 loops=1) > Buffers: shared hit=1 > Planning time: 0.088 ms > Execution time: 0.066 ms > (10 rows) > But from prospective of how code works, it's really 1 group. Tuple sort was > defined, inserted no tuples, then sorted and got no tuples out of there. So, > I'm not sure if it's really incorrect... I expected the number of groups actually that actually appear in the output, you consider it the number of groups started. I can't find similar case elsewhere in the code (e.g. Agg node does not report this kind of information), so I have no clue. Someone else will have to decide. > > But there is IncrementalSort node on the remote side. > Let's see what happens. Idea of "CROSS JOIN, not pushed down" test is that > cross join with ORDER BY LIMIT is not beneficial to push down, because LIMIT > is not pushed down and remote side wouldn't be able to use top-N heapsort. > But if remote side has incremental sort then it can be > used, and fetching first 110 rows is cheap. Let's see plan of original "CROSS > JOIN, not pushed down" test with incremental sort. > > # EXPLAIN (ANALYZE, VERBOSE) SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 > t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10; ok, understood, thanks. Perhaps it's worth a comment in the test script. I'm afraid I still see a problem. The diff removes a query that (although a bit different from the one above) lets the CROSS JOIN to be pushed down and does introduce the IncrementalSort in the remote database. This query is replaced with one that does not allow for the join push down. *** a/contrib/postgres_fdw/sql/postgres_fdw.sql --- b/contrib/postgres_fdw/sql/postgres_fdw.sql *** SELECT t1.c1 FROM ft1 t1 WHERE NOT EXIST *** 510,517 SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10; -- CROSS JOIN, not pushed down EXPLAIN (VERBOSE, COSTS OFF) ! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10; ! SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10; -- different server, not pushed down. No result expected. EXPLAIN (VERBOSE, COSTS OFF) SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10; --- 510,517 SELECT t1.c1 FROM ft1 t1 WHERE NOT EXISTS (SELECT 1 FROM ft2 t2 WHERE t1.c1 = t2.c2) ORDER BY t1.c1 OFFSET 100 LIMIT 10; -- CROSS JOIN, not pushed down EXPLAIN (VERBOSE, COSTS OFF) ! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10; ! SELECT t1.c3, t2.c3 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c3, t2.c3 OFFSET 100 LIMIT 10; -- different server, not pushed down. No result expected. EXPLAIN (VERBOSE, COSTS OFF) SELECT t1.c1, t2.c1 FROM ft5 t1 JOIN ft6 t2 ON (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1 OFFSET 100 LIMIT 10; Shouldn't the test contain *both* cases? -- Antonin Houska Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de, http://www.cybertec.at