Re: [HACKERS] Transactions involving multiple postgres foreign servers
On Thu, Oct 26, 2017 at 4:11 PM, Masahiko Sawadawrote: >> Why do we want the the backend to linger behind, once it has added its >> foreign transaction entries in the shared memory and informed resolver >> about it? The foreign connections may take their own time and even >> after that there is no guarantee that the foreign transactions will be >> resolved in case the foreign server is not available. So, why to make >> the backend wait? > > Because I don't want to break the current user semantics. that is, > currently it's guaranteed that the subsequent reads can see the > committed result of previous writes even if the previous transactions > were distributed transactions. Right, this is very important, and having the backend wait for the resolver(s) is, I think, the right way to implement it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: BRIN bloom indexes
hi, On 10/28/2017 02:41 AM, Nico Williams wrote: > On Fri, Oct 27, 2017 at 10:06:58PM +0200, Tomas Vondra wrote: >>> + * We use an optimisation that initially we store the uint32 values >>> directly, >>> + * without the extra hashing step. And only later filling the bitmap space, >>> + * we switch to the regular bloom filter mode. >>> >>> I don't think that optimization is worthwhile. If I'm going to be using >>> a Bloom filter, it's because I expect to have a lot of rows. >>> >>> (Granted, if I CREATE TABLE .. LIKE .. INCLUDING INDEXES, and the new >>> table just won't have lots of rows, then I might want this optimization, >>> but I can always just drop the Bloom filter index, or not include >>> indexes in the LIKE.) >> >> I think you're confusing "rows" and "distinct values". Furthermore, it's >> rather irrelevant how many rows are in the table because BRIN indexes >> split the table into "ranges" that are 1MB by default. And you can only >> squash certain number of rows into such range. >> >> The idea of the optimization is to efficiently support cases where each >> range contains only small number of distinct values - which is quite >> common in the cases I described in my initial message (with bursts of >> rows with the same IP / client identifier etc.). > > What range? It's just bits to set. > > The point is that Bloom filters should ideally be about 50% full -- much > less than that and you are wasting space, much more than than and your > false positive rate becomes useless. The range defined by BRIN indexes. BRIN indexes split the table into a sequence of page ranges (128 pages by default, i.e. 1MB), and the bloom filters are built on those table "chunks". So it's irrelevant how many rows are in the table in total, what matters is just the page range. > >>> Filter compression is not worthwhile. You want to have a fairly uniform >>> hash distribution, and you want to end up with a Bloom filter that's not >>> much more than 50% full. That means that when near 50% full the filter >>> will not be very sparse at all. Optimizing for the not common case >>> doesn't seem worthwhile, and adds complexity. >> >> Properly sizing the bloom filter requires knowledge of many variables, >> in particular the number of distinct values expected to be added to the >> filter. But we don't really know that number, and we also don't even >> know many values useful for estimating that (how many rows will fit into >> a range, number of distinct values in the whole table, etc.) > > This is why Scalable Bloom filters exist: so you can adapt. > >> So the idea was to oversize the bloom filter, and then use the sparse >> representation to reduce the size. > > A space-efficient representation of sparse bitmaps is not as simple as a > Scalable Bloom filter. > > And a filter that requires user intervention to size correctly, or which > requires rebuilding when it gets too full, is also not as convenient as > a Scalable Bloom filter. > Maybe. For now I'm fine with the simple relopts-based approach and don't plan to spend much time on these improvements. Which is not to say that they may not be worthwhile, but it's not the only thing I'm working on. I also suspect there are practical implications, as some things are only possible with equally-sized bloom filters. I believe the "union" (missing in the current patch) will rely on that when merging bloom filters. >>> + * XXX We can also watch the number of bits set in the bloom filter, and >>> + * then stop using it (and not store the bitmap, to save space) when the >>> + * false positive rate gets too high. >>> >>> Ah, right, what you want is a "Scalable Bloom Filter". >> >> That's not what I had in mind. My idea was to size the bloom filter on >> "best effort" basis, and then if one range gets a bit too inaccurate >> then just get rid of the filter. If that happens for many ranges, we >> should probably allow specifying parameters as relopts for the index. > > Scalable Bloom filters are way more convenient than that. They're > always not-too-full, and only the open filter is less-than-full-enough. > > And since you should grow them somewhat exponentially (but not quite as > much as a doubling in each generation), there should never be too many > filters to search. But you can always "vacuum" (rebuild) the filter > starting with the size of the last filter added prior to the vacuum. > >> I think this is really an over-engineering, and I certainly don't plan >> to extend the patch in this direction. > > Whereas I think a sparse bitmap representation is overly complex and > "over-engineering". Scalable Bloom filters are very well understood in > the literature -- there's nothing terribly complex to them. > That "sparse bitmap" is mentioned merely in an XXX comment as a thing to consider, not something I plan to work right now. Perhaps it's not really a good idea in general, or maybe it's addressing a non-issue. In which case we don't need scalable bloom
Re: [HACKERS] WIP: BRIN bloom indexes
On Fri, Oct 27, 2017 at 10:06:58PM +0200, Tomas Vondra wrote: > > + * We use an optimisation that initially we store the uint32 values > > directly, > > + * without the extra hashing step. And only later filling the bitmap space, > > + * we switch to the regular bloom filter mode. > > > > I don't think that optimization is worthwhile. If I'm going to be using > > a Bloom filter, it's because I expect to have a lot of rows. > > > > (Granted, if I CREATE TABLE .. LIKE .. INCLUDING INDEXES, and the new > > table just won't have lots of rows, then I might want this optimization, > > but I can always just drop the Bloom filter index, or not include > > indexes in the LIKE.) > > I think you're confusing "rows" and "distinct values". Furthermore, it's > rather irrelevant how many rows are in the table because BRIN indexes > split the table into "ranges" that are 1MB by default. And you can only > squash certain number of rows into such range. > > The idea of the optimization is to efficiently support cases where each > range contains only small number of distinct values - which is quite > common in the cases I described in my initial message (with bursts of > rows with the same IP / client identifier etc.). What range? It's just bits to set. The point is that Bloom filters should ideally be about 50% full -- much less than that and you are wasting space, much more than than and your false positive rate becomes useless. > > Filter compression is not worthwhile. You want to have a fairly uniform > > hash distribution, and you want to end up with a Bloom filter that's not > > much more than 50% full. That means that when near 50% full the filter > > will not be very sparse at all. Optimizing for the not common case > > doesn't seem worthwhile, and adds complexity. > > Properly sizing the bloom filter requires knowledge of many variables, > in particular the number of distinct values expected to be added to the > filter. But we don't really know that number, and we also don't even > know many values useful for estimating that (how many rows will fit into > a range, number of distinct values in the whole table, etc.) This is why Scalable Bloom filters exist: so you can adapt. > So the idea was to oversize the bloom filter, and then use the sparse > representation to reduce the size. A space-efficient representation of sparse bitmaps is not as simple as a Scalable Bloom filter. And a filter that requires user intervention to size correctly, or which requires rebuilding when it gets too full, is also not as convenient as a Scalable Bloom filter. > > + * XXX We can also watch the number of bits set in the bloom filter, and > > + * then stop using it (and not store the bitmap, to save space) when the > > + * false positive rate gets too high. > > > > Ah, right, what you want is a "Scalable Bloom Filter". > > That's not what I had in mind. My idea was to size the bloom filter on > "best effort" basis, and then if one range gets a bit too inaccurate > then just get rid of the filter. If that happens for many ranges, we > should probably allow specifying parameters as relopts for the index. Scalable Bloom filters are way more convenient than that. They're always not-too-full, and only the open filter is less-than-full-enough. And since you should grow them somewhat exponentially (but not quite as much as a doubling in each generation), there should never be too many filters to search. But you can always "vacuum" (rebuild) the filter starting with the size of the last filter added prior to the vacuum. > I think this is really an over-engineering, and I certainly don't plan > to extend the patch in this direction. Whereas I think a sparse bitmap representation is overly complex and "over-engineering". Scalable Bloom filters are very well understood in the literature -- there's nothing terribly complex to them. > I do not expect these parameters (particularly the number of distinct > values in a range) to significantly change over time, so the easiest > solution is to provide a reloption to specify that number in > CREATE/ALTER index. Doesn't this depend on the use-case though? I think a self-tuning system is better than one that doesn't self-tune. Call that over-engineering if you like, but it doesn't make it not desirable :) > Alternatively, we could allow the summarization process to gather some > statistics (either on the whole range, or perhaps the whole table), and > store them somewhere for later use. For example to compute the number of > distinct values per range (in the existing data), and use that later. Again, Scalable Bloom filters automatically adapt without needing a statistics gathering exercise. All you need is a good hash function (that's another topic). Scalable Bloom filters are a trivial extension of Bloom filters. > > What I'm getting at is that the query planner really needs to understand > > that a Bloom filter is a probabilistic data structure. > > It does, and we
Re: [HACKERS] MERGE SQL Statement for PG11
On Fri, Oct 27, 2017 at 02:13:27PM -0700, srielau wrote: > While the standard may not require a unique index for the ON clause I have > never seen a MERGE statement that did not have this property. So IMHO this > is a reasonable restrictions. I don't understand how one could have a conflict upon which to turn INSERT into UPDATE without having a UNIQUE constraint violated... The only question is whether one should have control over -or have to specify- which constraint violations lead to UPDATE vs. which ones lead to failure vs. which ones lead to doing nothing. The row to update is the one that the to-be-inserted row conflicted with -- there can only have been one if the constraint violated was a PRIMARY KEY constraint, or if there is a PRIMARY KEY at all, but if there's no PRIMARY KEY, then there can have been more conflicting rows because of NULL columns in the to-be-inserted row. If the to-be-inserted row conflicts with multiple rows, then just fail, or don't allow MERGE on tables that have no PK (as you know, many think it makes no sense to not have a PK on a table in SQL). In the common case one does not care about which UNIQUE constraint is violated because there's only one that could have been violated, or because if the UPDATE should itself cause some other UNIQUE constraint to be violated, then the whole statement should fail. PG's UPSERT is fantastic -- it allows very fine-grained control, but it isn't as pithy as it could be when the author doesn't care to specify all that detail. Also, something like SQLite3's INSERT OR REPLACE is very convenient: pithy, INSERT syntax, upsert-like semantics[*]. I'd like to have this in PG: INSERT INTO .. ON CONFLICT DO UPDATE; -- I.e., update all columns of the existing -- row to match the ones from the row that -- would have been inserted had there not been -- a conflict. -- -- If an INSERTed row conflicts and then the -- UPDATE it devolves to also conflicts, then -- fail. and INSERT INTO .. ON CONFLICT DO UPDATE -- I.e., update all columns of the existing -- row to match the ones from the row that -- would have been inserted had there not been -- a conflict. -- ON CONFLICT DO NOTHING; -- If an INSERTed row conflicts and then the -- UPDATE it devolves to also conflicts, then -- DO NOTHING. [*] SQLite3's INSERT OR REPLACE is NOT an insert-or-update, but an insert-or-delete-and-insert, and any deletions that occur in the process do fire triggers. INSERT OR UPDATE would be much more useful. Nico -- -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MERGE SQL Statement for PG11
On Fri, Oct 27, 2017 at 9:00 AM, Robert Haaswrote: > On Fri, Oct 27, 2017 at 4:41 PM, Simon Riggs wrote: >> I didn't say it but my intention was to just throw an ERROR if no >> single unique index can be identified. >> >> It could be possible to still run MERGE in that situaton but we would >> need to take a full table lock at ShareRowExclusive. It's quite likely >> that such statements would throw duplicate update errors, so I >> wouldn't be aiming to do anything with that for PG11. > > Like Peter, I think taking such a strong lock for a DML statement > doesn't sound like a very desirable way forward. It means, for > example, that you can only have one MERGE in progress on a table at > the same time, which is quite limiting. It could easily be the case > that you have multiple MERGE statements running at once but they touch > disjoint groups of rows and therefore everything works. I think the > code should be able to cope with concurrent changes, if nothing else > by throwing an ERROR, and then if the user wants to ensure that > doesn't happen by taking ShareRowExclusiveLock they can do that via an > explicit LOCK TABLE statement -- or else they can prevent concurrency > by any other means they see fit. +1, I would suspect users to run this query in parallel of the same table for multiple data sets. Peter has taken some time to explain me a bit his arguments today, and I agree that it does not sound much appealing to have constraint limitations for MERGE. Particularly using the existing ON CONFLICT structure gets the feeling of having twice a grammar for what's basically the same feature, with pretty much the same restrictions. By the way, this page sums up nicely the situation about many implementations of UPSERT taken for all systems: https://en.wikipedia.org/wiki/Merge_(SQL) -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reading timeline from pg_control on replication slave
On Fri, Oct 27, 2017 at 1:04 AM, Andrey Borodinwrote: > I'm working on backups from replication salve in WAL-G [0] > Backups used to use result of pg_walfile_name(pg_start_backup(...)). Call to > pg_start_backup() works nice, but "pg_walfile_name() cannot be executed > during recovery." > This function has LSN as argument and reads TimeLineId from global state. > So I made a function[1] that, if on replica, reads timeline from pg_control > file and formats WAL file name as is it was produces by pg_wal_filename(lsn). ThisTimeLineID is not something you can rely on for standby backends as it is not set during recovery. That's the reason behind pg_walfile_name disabled during recovery. There are three things popping on top of my mind that one could think about: 1) Backups cannot be completed when started on a standby in recovery and when stopped after the standby has been promoted, meaning that its timeline has changed. 2) After a standby has been promoted, by using pg_start_backup, you issue a checkpoint which makes sure that the control file gets flushed with the new information, so when pg_start_backup returns to the caller you should have the correct timeline number because the outer function gets evaluated last. 3) Backups taken from cascading standbys, where a direct parent has been promoted. 1) and 2) are actually not problems per the restrictions I am giving above, but 3) is. If I recall correctly, when a streaming standby does a timeline jump, a restart point is not immediately generated, so you could have the timeline on the control file not updated to the latest timeline value, meaning that you could have the WAL file name you use here referring to a previous timeline and not the newest one. In short, yes, what you are doing is definitely risky in my opinion, particularly for complex cascading setups. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MERGE SQL Statement for PG11
via Newton Mail [https://cloudmagic.com/k/d/mailapp?ct=dx=9.8.79=10.12.6=email_footer_2] On Fri, Oct 27, 2017 at 2:42 PM, Peter Geogheganwrote: On Fri, Oct 27, 2017 at 2:13 PM, srielau wrote: > While the standard may not require a unique index for the ON clause I have > never seen a MERGE statement that did not have this property. So IMHO this > is a reasonable restrictions. The Oracle docs on MERGE say nothing about unique indexes or constraints. They don't even mention them in passing. They do say "This statement is a convenient way to combine multiple operations. It lets you avoid multiple INSERT, UPDATE, and DELETE DML statements." SQL Server's MERGE docs do mention unique indexes, but only in passing, saying something about unique violations, and that unique violations *cannot* be suppressed in MERGE, even though that's possible with other DML statements (with something called IGNORE_DUP_KEY). What other systems *do* have this restriction? I've never seen one that did. Not clear what you are leading up to here. When I did MERGE in DB2 there was also no limitation: " Each row in the target can only be operated on once. A row in the target can only be identified as MATCHED with one row in the result table of the table-reference” What there was however was a significant amount of code I had to write and test to enforce the above second sentence. IIRC it involved, in the absence of a proof that the join could not expand, adding a row_number() over() AS rn over the target leg of the join and then a row_number() over(partition by rn) > 1 THEN RAISE_ERROR() to catch violators. Maybe in PG there is a trivial way to detect an expanding join and block it at runtime. So the whole point I’m trying to make is that I haven’t seen the need for the extra work I had to do once the feature appeared in the wild. Cheers Serge
Re: [HACKERS] MERGE SQL Statement for PG11
On Fri, Oct 27, 2017 at 2:13 PM, srielauwrote: > While the standard may not require a unique index for the ON clause I have > never seen a MERGE statement that did not have this property. So IMHO this > is a reasonable restrictions. The Oracle docs on MERGE say nothing about unique indexes or constraints. They don't even mention them in passing. They do say "This statement is a convenient way to combine multiple operations. It lets you avoid multiple INSERT, UPDATE, and DELETE DML statements." SQL Server's MERGE docs do mention unique indexes, but only in passing, saying something about unique violations, and that unique violations *cannot* be suppressed in MERGE, even though that's possible with other DML statements (with something called IGNORE_DUP_KEY). What other systems *do* have this restriction? I've never seen one that did. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Local indexes for partitioned table
Robert Haaswrites: > On Fri, Oct 27, 2017 at 7:27 PM, Alvaro Herrera > wrote: >> I noticed that RelationBuildPartitionKey is generating a partition key >> in a temp context, then creating a private context and copying the key >> into that. That seems leftover from some previous iteration of some >> other patch; I think it's pretty reasonable to create the new context >> right from the start and allocate the key there directly instead. Then >> there's no need for copy_partition_key at all. > We could do that, but the motivation for the current system was to > avoid leaking memory in a long-lived context. I think the same > technique is used elsewhere for similar reasons. I admit I haven't > checked whether there would actually be a leak here if we did it as > you propose, but I wouldn't find it at all surprising. Another key point is to avoid leaving a corrupted relcache entry behind if you fail partway through. I think that the current coding is RelationBuildPartitionKey is safe, although it's far too undercommented for my taste. (The ordering of the last few steps is *critical*.) It might work to build the new key in a context that's initially a child of CurrentMemoryContext, then reparent it to be a child of CacheMemoryContext when done. But you'd have to look at whether that would end up wasting long-lived memory, if the build process generates any cruft in the same context. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Causal reads take II
On Sun, Oct 1, 2017 at 10:03 AM, Thomas Munrowrote: >> I tried few more times, and I've got it two times from four attempts on a >> fresh >> installation (when all instances were on the same machine). But anyway I'll >> try >> to investigate, maybe it has something to do with my environment. >> >> ... >> >> 2017-09-30 15:47:55.154 CEST [6030] LOG: invalid magic number in log >> segment 00010020, offset 10092544 > > Hi Dmitry, > > Thanks for testing. Yeah, it looks like the patch may be corrupting > the WAL stream in some case that I didn't hit in my own testing > procedure. I will try to reproduce these failures. Hi Dmitry, I managed to reproduce something like this on one of my home lab machines running a different OS. Not sure why yet and it doesn't happen on my primary development box which is how I hadn't noticed it. I will investigate and aim to get a fix posted in time for the Commitfest. I'm also hoping to corner Simon at PGDay Australia in a couple of weeks to discuss this proposal... -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MERGE SQL Statement for PG11
Simon, Nice writeup. While the standard may not require a unique index for the ON clause I have never seen a MERGE statement that did not have this property. So IMHO this is a reasonable restrictions. In fact I have only ever seen two flavors of usage: * Single row source (most often simply a VALUES clause) in OLTP In that case there was lots of concurrency * Massive source which affects a significant portion of the target table in DW. In this case there were no concurrent MERGEs I believe support for returning rows at a later stage would prove to be very powerful, especially in combination with chaining MERGE statements in CTEs. To do that would require language extensions to pass the coloring of the source row through, especially for rows that fell into "do nothing". -- Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] git down
git.postgresql.org is down/unreachable ( git://git.postgresql.org/git/postgresql.git ) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel safety for extern params
On Mon, Oct 16, 2017 at 12:59 PM, Amit Kapilawrote: >> I think the Param case should be mentioned after "... but" not before >> - i.e. referencing the child node's output... but setrefs.c might also >> have copied a Const or Param is-is. > > I am not sure if we can write the comment like that (.. copied a Const > or Param as-is.) because fix_upper_expr_mutator in setrefs.c has a > special handling for Var and Param where constants are copied as-is > via expression_tree_mutator. Also as proposed, the handling for > params is more like Var in exec_save_simple_expr. I committed fix_parallel_mode_nested_execution_v2.patch with some cosmetic tweaks. I back-patched it to 10 and 9.6, then had to fix some issues reported by Tom as followup commits. With respect to the bit quoted above, I rephrased the comment in a slightly different way that hopefully is a reasonable compromise, combined it with the main patch, and pushed it to master. Along the way I also got rid of the if statement you introduced and just made the Assert() more complicated instead, which seems better to me. When I tried back-porting the patch to v10 I discovered that the plpgsql changes conflict heavily and that ripping them out all the way produces regression failures under force_parallel_mode. I think I see why that's happening but it's not obvious how to me how to adapt b73f1b5c29e0ace5a281bc13ce09dea30e1b66de to the v10 code. Do you want to have a crack at it or should we just leave this as a master-only fix? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Local indexes for partitioned table
On Fri, Oct 27, 2017 at 7:27 PM, Alvaro Herrerawrote: > I noticed that RelationBuildPartitionKey is generating a partition key > in a temp context, then creating a private context and copying the key > into that. That seems leftover from some previous iteration of some > other patch; I think it's pretty reasonable to create the new context > right from the start and allocate the key there directly instead. Then > there's no need for copy_partition_key at all. We could do that, but the motivation for the current system was to avoid leaking memory in a long-lived context. I think the same technique is used elsewhere for similar reasons. I admit I haven't checked whether there would actually be a leak here if we did it as you propose, but I wouldn't find it at all surprising. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: BRIN bloom indexes
Hi, On 10/27/2017 05:22 PM, Sokolov Yura wrote: > > Hi, Tomas > > BRIN bloom index is a really cool feature, that definitely should be in > core distribution (either in contrib or builtin)!!! > > Small suggestion for algorithm: > > It is well known practice not to calculate whole hash function for every > point, but use double hashing approach. > Example of proving double hashing approach for bloom filters: > https://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.pdf > > So, instead of: > for (i = 0; i < filter->nhashes; i++) > { > /* compute the hashes with a seed, used for the bloom filter */ > uint32 h = ((uint32)DatumGetInt64(hash_uint32_extended(value, > i))) % filter->nbits; > > /* set or check bit h */ > } > > such construction could be used: > /* compute the hashes, used for the bloom filter */ > uint32 big_h = ((uint32)DatumGetInt64(hash_uint32_extended(value, i))); > uint32 h = big_h % filter->nbits; > /* ensures d is never >= filter->nbits */ > uint32 d = big_h % (filter->nbits - 1); > > for (i = 0; i < filter->nhashes; i++) > { > /* set or check bit h */ > > /* compute next bit h */ > h += d++; > if (h >= filter->nbits) h -= filter->nbits; > if (d == filter->nbits) d = 0; > } > > Modulo of one 64bit result by two coprime numbers (`nbits` and `nbits-1`) > gives two good-quality functions usable for double hashing. > OK, thanks for the suggestion. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: BRIN bloom indexes
Hi, On 10/27/2017 07:17 PM, Nico Williams wrote: > On Thu, Oct 19, 2017 at 10:15:32PM +0200, Tomas Vondra wrote: > > A bloom filter index would, indeed, be wonderful. > > Comments: > > + * We use an optimisation that initially we store the uint32 values directly, > + * without the extra hashing step. And only later filling the bitmap space, > + * we switch to the regular bloom filter mode. > > I don't think that optimization is worthwhile. If I'm going to be using > a Bloom filter, it's because I expect to have a lot of rows. > > (Granted, if I CREATE TABLE .. LIKE .. INCLUDING INDEXES, and the new > table just won't have lots of rows, then I might want this optimization, > but I can always just drop the Bloom filter index, or not include > indexes in the LIKE.) > I think you're confusing "rows" and "distinct values". Furthermore, it's rather irrelevant how many rows are in the table because BRIN indexes split the table into "ranges" that are 1MB by default. And you can only squash certain number of rows into such range. The idea of the optimization is to efficiently support cases where each range contains only small number of distinct values - which is quite common in the cases I described in my initial message (with bursts of rows with the same IP / client identifier etc.). > + * XXX Perhaps we could save a few bytes by using different data types, but > + * considering the size of the bitmap, the difference is negligible. > > A bytea is all that's needed. See below. > > + * XXX We could also implement "sparse" bloom filters, keeping only the > + * bytes that are not entirely 0. That might make the "sorted" phase > + * mostly unnecessary. > > Filter compression is not worthwhile. You want to have a fairly uniform > hash distribution, and you want to end up with a Bloom filter that's not > much more than 50% full. That means that when near 50% full the filter > will not be very sparse at all. Optimizing for the not common case > doesn't seem worthwhile, and adds complexity. > Properly sizing the bloom filter requires knowledge of many variables, in particular the number of distinct values expected to be added to the filter. But we don't really know that number, and we also don't even know many values useful for estimating that (how many rows will fit into a range, number of distinct values in the whole table, etc.) So the idea was to oversize the bloom filter, and then use the sparse representation to reduce the size. > + * XXX We can also watch the number of bits set in the bloom filter, and > + * then stop using it (and not store the bitmap, to save space) when the > + * false positive rate gets too high. > > Ah, right, what you want is a "Scalable Bloom Filter". > That's not what I had in mind. My idea was to size the bloom filter on "best effort" basis, and then if one range gets a bit too inaccurate then just get rid of the filter. If that happens for many ranges, we should probably allow specifying parameters as relopts for the index. > A Scalable Bloom filter is actually a series of Bloom filters where all > but the newest filter are closed to additions, and the newest filter is > where you do all the additions. You generally want to make each new > filter bigger than the preceding one because when searching a Scalable > Bloom filter you have to search *all* of them, so you want to minimize > the number of filters. > > Eventually, when you have enough sub-filters, you may want to re-write > the whole thing so that you start with a single sub-filter that is large > enough. > > The format of the bytea might then be something like: > > [[[...]] > > where the last bitmap is the filter that is open to additions. > I think this is really an over-engineering, and I certainly don't plan to extend the patch in this direction. I do not expect these parameters (particularly the number of distinct values in a range) to significantly change over time, so the easiest solution is to provide a reloption to specify that number in CREATE/ALTER index. Alternatively, we could allow the summarization process to gather some statistics (either on the whole range, or perhaps the whole table), and store them somewhere for later use. For example to compute the number of distinct values per range (in the existing data), and use that later. > ... > > Of course, for something like: > > SELECT a.*, b.* FROM a a JOIN b b USING (some_col); > > a Bloom filter can only be used as an optimization to avoid using a > slower index (or heap scan) on the inner table source. > > What I'm getting at is that the query planner really needs to understand > that a Bloom filter is a probabilistic data structure. > It does, and we never produce incorrect results. Which seems like the right thing to do. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list
Re: [HACKERS] logical decoding of two-phase transactions
On 2017-10-26 22:01, Sokolov Yura wrote: On 2017-09-27 14:46, Stas Kelvich wrote: On 7 Sep 2017, at 18:58, Nikhil Sontakkewrote: Hi, FYI all, wanted to mention that I am working on an updated version of the latest patch that I plan to submit to a later CF. Cool! So what kind of architecture do you have in mind? Same way as is it was implemented before? As far as I remember there were two main issues: * Decodong of aborted prepared transaction. If such transaction modified catalog then we can’t read reliable info with our historic snapshot, since clog already have aborted bit for our tx it will brake visibility logic. There are some way to deal with that — by doing catalog seq scan two times and counting number of tuples (details upthread) or by hijacking clog values in historic visibility function. But ISTM it is better not solve this issue at all =) In most cases intended usage of decoding of 2PC transaction is to do some form of distributed commit, so naturally decoding will happens only with in-progress transactions and we commit/abort will happen only after it is decoded, sent and response is received. So we can just have atomic flag that prevents commit/abort of tx currently being decoded. And we can filter interesting prepared transactions based on GID, to prevent holding this lock for ordinary 2pc. * Possible deadlocks that Andres was talking about. I spend some time trying to find that, but didn’t find any. If locking pg_class in prepared tx is the only example then (imho) it is better to just forbid to prepare such transactions. Otherwise if some realistic examples that can block decoding are actually exist, then we probably need to reconsider the way tx being decoded. Anyway this part probably need Andres blessing. Just rebased patch logical_twophase_v6 to master. Fixed small issues: - XactLogAbortRecord wrote DBINFO twice, but it was decoded in ParseAbortRecord only once. Second DBINFO were parsed as ORIGIN. Fixed by removing second write of DBINFO. - SnapBuildPrepareTxnFinish tried to remove xid from `running` instead of `committed`. And it removed only xid, without subxids. - test_decoding skipped returning "COMMIT PREPARED" and "ABORT PREPARED", Big issue were with decoding ddl-including two-phase transactions: - prepared.out were misleading. We could not reproduce decoding body of "test_prepared#3" with logical_twophase_v6.diff. It was skipped if `pg_logical_slot_get_changes` were called without `twophase-decode-with-catalog-changes` set, and only "COMMIT PREPARED test_prepared#3" were decoded. The reason is "PREPARE TRANSACTION" is passed to `pg_filter_prepare` twice: - first on "PREPARE" itself, - second - on "COMMIT PREPARED". In v6, `pg_filter_prepare` without `with-catalog-changes` first time answered "true" (ie it should not be decoded), and second time (when transaction became committed) it answered "false" (ie it should be decoded). But second time in DecodePrepare `ctx->snapshot_builder->start_decoding_at` is already in a future compared to `buf->origptr` (because it is on "COMMIT PREPARED" lsn). Therefore DecodePrepare just called ReorderBufferForget. If `pg_filter_prepare` is called with `with-catalog-changes`, then it returns "false" both times, thus DeocdePrepare decodes transaction in first time, and calls `ReorderBufferForget` in second time. I didn't found a way to fix it gracefully. I just change `pg_filter_prepare` to return same answer both times: "false" if called `with-catalog-changes` (ie need to call DecodePrepare), and "true" otherwise. With this change, catalog changing two-phase transaction is decoded as simple one-phase transaction, if `pg_logical_slot_get_changes` is called without `with-catalog-changes`. Small improvement compared to v7: - twophase_gid is written with alignment padding in the XactLogCommitRecord and XactLogAbortRecord. -- Sokolov Yura Postgres Professional: https://postgrespro.ru The Russian Postgres Company logical_twophase_v8.diff.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: BRIN bloom indexes
On 2017-10-27 20:17, Nico Williams wrote: On Thu, Oct 19, 2017 at 10:15:32PM +0200, Tomas Vondra wrote: A bloom filter index would, indeed, be wonderful. Comments: + * We use an optimisation that initially we store the uint32 values directly, + * without the extra hashing step. And only later filling the bitmap space, + * we switch to the regular bloom filter mode. I don't think that optimization is worthwhile. If I'm going to be using a Bloom filter, it's because I expect to have a lot of rows. (Granted, if I CREATE TABLE .. LIKE .. INCLUDING INDEXES, and the new table just won't have lots of rows, then I might want this optimization, but I can always just drop the Bloom filter index, or not include indexes in the LIKE.) + * XXX Perhaps we could save a few bytes by using different data types, but + * considering the size of the bitmap, the difference is negligible. A bytea is all that's needed. See below. + * XXX We could also implement "sparse" bloom filters, keeping only the + * bytes that are not entirely 0. That might make the "sorted" phase + * mostly unnecessary. Filter compression is not worthwhile. You want to have a fairly uniform hash distribution, and you want to end up with a Bloom filter that's not much more than 50% full. That means that when near 50% full the filter will not be very sparse at all. Optimizing for the not common case doesn't seem worthwhile, and adds complexity. + * XXX We can also watch the number of bits set in the bloom filter, and + * then stop using it (and not store the bitmap, to save space) when the + * false positive rate gets too high. Ah, right, what you want is a "Scalable Bloom Filter". A Scalable Bloom filter is actually a series of Bloom filters where all but the newest filter are closed to additions, and the newest filter is where you do all the additions. You generally want to make each new filter bigger than the preceding one because when searching a Scalable Bloom filter you have to search *all* of them, so you want to minimize the number of filters. Eventually, when you have enough sub-filters, you may want to re-write the whole thing so that you start with a single sub-filter that is large enough. The format of the bytea might then be something like: [[[...]] where the last bitmap is the filter that is open to additions. I wonder if there are write concurrency performance considerations here... It might be better to have a bytea value per-sub-filter so that there is no lock contention for the closed filters. The closed sub-filters are constant, thus not even shared locks are needed for them, and especially not exclusive locks. Writing to the filter will necessitate locking the entire open filter, or else byte-range locking on it. Something to think about. Now, what about query performance? Unlike the "minmax" indexes, the "bloom" filter can only handle equality conditions. A Bloom filter has non-zero false positives for existence, but zero false positives for non-existence. This means that a Bloom filter index can only be used for: a) non-existence tests (with equality predicates, as you point out), b) as an optimization to avoid slower index checks (or heap scans) when the Bloom filter indicates non-existence. (b) is really just an internal application of (a). There might be applications where false positives might be ok in a query like: SELECT a.* FROM a a JOIN b b USING (some_col); but for most real-world queries like that, allowing false positives by default would be very bad. An option in the index declaration could be used to enable existence equality predicates, but I would not include such an option initially -- let's see if anyone actually has a use case for it first. Of course, for something like: SELECT a.*, b.* FROM a a JOIN b b USING (some_col); a Bloom filter can only be used as an optimization to avoid using a slower index (or heap scan) on the inner table source. What I'm getting at is that the query planner really needs to understand that a Bloom filter is a probabilistic data structure. Nico -- PostgreSQL has a lot of probabilistic indexes. Some GIST opclasses returns false positives and tells to executor to recheck their results. Bitmap-index-scan, when there are a lot of result tuples, falls back to storing only page numbers, without actual tid's, and executer then scans heap-pages to find necessary tuples. BRIN index at all just "recommends executor to scan that heap page". Thus BRIN index is at whole just an optimisation (regardless is it `minmax` or `bloom`). So that is ok. -- Sokolov Yura Postgres Professional: https://postgrespro.ru The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index only scan for cube and seg
Robert Haaswrites: > On Thu, Oct 26, 2017 at 12:22 PM, Andrey Borodin wrote: >> For cube there is new default opclass. > I seem to recall that changing the default opclass causes unsolvable > problems with upgrades. You might want to check the archives for > previous discussions of this issue; unfortunately, I don't recall the > details off-hand. Quite aside from that, replacing the opclass with a new one creates user-visible headaches that I don't think are justified, i.e. having to reconstruct indexes in order to get the benefit. Maybe I'm missing something, but ISTM you could just drop the compress function and call it good. This would mean that an IOS scan would sometimes hand back a toast-compressed value, but what's the problem with that? (The only reason for making a decompress function that just detoasts is that your other support functions then do not have to consider the possibility that they're handed a toast-compressed value. I have not checked whether that really matters for cube.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] ALTER COLUMN TYPE vs. domain constraints
I found out that altering a column's type does not play nicely with domain constraints: tablecmds.c expects that only table constraints could depend on a column. Now, it's easy to hit that with domains over composite, so I propose to fix it in HEAD with the attached patch. However, if you really work at it, you can make it fail in the back branches too: regression=# create type comptype as (r float8, i float8); CREATE TYPE regression=# create domain silly as float8 check ((row(value,0)::comptype).r > 0); CREATE DOMAIN regression=# alter type comptype alter attribute r type varchar; ERROR: cache lookup failed for relation 0 Before commit 6784d7a1d, the ALTER actually went through, leaving a mess. Fortunately it doesn't actually crash afterwards, but you get things like regression=# select 0::silly; ERROR: ROW() column has type double precision instead of type character varying We could consider back-patching the attached to cover this, but I'm not entirely sure it's worth the trouble, because I haven't thought of any non-silly use-cases in the absence of domains over composite. Comments? regards, tom lane diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 3ab8087..fc93c4e 100644 *** a/src/backend/commands/tablecmds.c --- b/src/backend/commands/tablecmds.c *** static void ATPostAlterTypeParse(Oid old *** 426,431 --- 426,433 bool rewrite); static void RebuildConstraintComment(AlteredTableInfo *tab, int pass, Oid objid, Relation rel, char *conname); + static void RebuildDomainConstraintComment(AlteredTableInfo *tab, int pass, + Oid objid, List *domname, char *conname); static void TryReuseIndex(Oid oldId, IndexStmt *stmt); static void TryReuseForeignKey(Oid oldId, Constraint *con); static void change_owner_fix_column_acls(Oid relationOid, *** AlterTableGetLockLevel(List *cmds) *** 3319,3324 --- 3321,3327 case AT_ProcessedConstraint: /* becomes AT_AddConstraint */ case AT_AddConstraintRecurse: /* becomes AT_AddConstraint */ case AT_ReAddConstraint: /* becomes AT_AddConstraint */ + case AT_ReAddDomainConstraint: /* becomes AT_AddConstraint */ if (IsA(cmd->def, Constraint)) { Constraint *con = (Constraint *) cmd->def; *** ATRewriteCatalogs(List **wqueue, LOCKMOD *** 3819,3825 rel = relation_open(tab->relid, NoLock); foreach(lcmd, subcmds) ! ATExecCmd(wqueue, tab, rel, (AlterTableCmd *) lfirst(lcmd), lockmode); /* * After the ALTER TYPE pass, do cleanup work (this is not done in --- 3822,3830 rel = relation_open(tab->relid, NoLock); foreach(lcmd, subcmds) ! ATExecCmd(wqueue, tab, rel, ! castNode(AlterTableCmd, lfirst(lcmd)), ! lockmode); /* * After the ALTER TYPE pass, do cleanup work (this is not done in *** ATExecCmd(List **wqueue, AlteredTableInf *** 3936,3941 --- 3941,3953 ATExecAddConstraint(wqueue, tab, rel, (Constraint *) cmd->def, true, true, lockmode); break; + case AT_ReAddDomainConstraint: /* Re-add pre-existing domain check + * constraint */ + address = + AlterDomainAddConstraint(((AlterDomainStmt *) cmd->def)->typeName, + ((AlterDomainStmt *) cmd->def)->def, + NULL); + break; case AT_ReAddComment: /* Re-add existing comment */ address = CommentObject((CommentStmt *) cmd->def); break; *** ATPostAlterTypeCleanup(List **wqueue, Al *** 9616,9622 if (!HeapTupleIsValid(tup)) /* should not happen */ elog(ERROR, "cache lookup failed for constraint %u", oldId); con = (Form_pg_constraint) GETSTRUCT(tup); ! relid = con->conrelid; confrelid = con->confrelid; conislocal = con->conislocal; ReleaseSysCache(tup); --- 9628,9642 if (!HeapTupleIsValid(tup)) /* should not happen */ elog(ERROR, "cache lookup failed for constraint %u", oldId); con = (Form_pg_constraint) GETSTRUCT(tup); ! if (OidIsValid(con->conrelid)) ! relid = con->conrelid; ! else ! { ! /* must be a domain constraint */ ! relid = get_typ_typrelid(getBaseType(con->contypid)); ! if (!OidIsValid(relid)) ! elog(ERROR, "could not identify relation associated with constraint %u", oldId); ! } confrelid = con->confrelid; conislocal = con->conislocal; ReleaseSysCache(tup); *** ATPostAlterTypeParse(Oid oldId, Oid oldR *** 9753,9759 foreach(lcmd, stmt->cmds) { ! AlterTableCmd *cmd = (AlterTableCmd *) lfirst(lcmd); if (cmd->subtype == AT_AddIndex) { --- 9773,9779 foreach(lcmd, stmt->cmds) { ! AlterTableCmd *cmd = castNode(AlterTableCmd, lfirst(lcmd)); if (cmd->subtype == AT_AddIndex) { *** ATPostAlterTypeParse(Oid oldId, Oid oldR *** 9781,9789 } else if (cmd->subtype ==
Re: [HACKERS] Proposal: Local indexes for partitioned table
Alvaro Herrera wrote: > I'm now working on the ability to build unique indexes (and unique > constraints) on top of this. So I think there's not a lot of additional code required to support unique indexes with the restrictions mentioned; proof-of-concept (with several holes still) attached. As before, this is not finished, as there a few things that are wrong (such as generateClonedIndexStmt taking a RangeVar), and others that I don't understand (such as why is rd_partkey NULL for partitioned partitions when DefineIndex cascades to them and the columns are checked). I noticed that RelationBuildPartitionKey is generating a partition key in a temp context, then creating a private context and copying the key into that. That seems leftover from some previous iteration of some other patch; I think it's pretty reasonable to create the new context right from the start and allocate the key there directly instead. Then there's no need for copy_partition_key at all. Anyway, here is a preliminary submission before I close shop for the week. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services >From 37e683e352df5f3e6c110d94881abe888e36953e Mon Sep 17 00:00:00 2001 From: Alvaro HerreraDate: Mon, 16 Oct 2017 12:01:12 +0200 Subject: [PATCH v2 1/7] Tweak index_create/index_constraint_create APIs I was annoyed by index_create and index_constraint_create respective APIs. It's too easy to get confused when adding a new behavioral flag given the large number of boolean flags they already have. Turning them into a flags bitmask makes that code easier to read, as in the attached patch. I split the flags in two -- one set for index_create directly and another related to constraints. index_create() itself receives both, and then passes down the constraint one to index_constraint_create. It is shorter in LOC terms to create a single set mixing all the values, but it seemed to make more sense this way. Also, I chose not to turn the booleans 'allow_sytem_table_mods' and 'is_internal' into flag bits because of the different way they interact with this code. --- src/backend/catalog/index.c | 101 +++ src/backend/catalog/toasting.c | 3 +- src/backend/commands/indexcmds.c | 33 + src/backend/commands/tablecmds.c | 13 +++-- src/include/catalog/index.h | 29 ++- 5 files changed, 100 insertions(+), 79 deletions(-) diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c index c7b2f031f0..17faeffada 100644 --- a/src/backend/catalog/index.c +++ b/src/backend/catalog/index.c @@ -680,19 +680,25 @@ UpdateIndexRelation(Oid indexoid, * classObjectId: array of index opclass OIDs, one per index column * coloptions: array of per-index-column indoption settings * reloptions: AM-specific options - * isprimary: index is a PRIMARY KEY - * isconstraint: index is owned by PRIMARY KEY, UNIQUE, or EXCLUSION constraint - * deferrable: constraint is DEFERRABLE - * initdeferred: constraint is INITIALLY DEFERRED + * flags: bitmask that can include any combination of these bits: + * INDEX_CREATE_IS_PRIMARY + * the index is a primary key + * INDEX_CREATE_ADD_CONSTRAINT: + * invoke index_constraint_create also + * INDEX_CREATE_SKIP_BUILD: + * skip the index_build() step for the moment; caller must do it + * later (typically via reindex_index()) + * INDEX_CREATE_CONCURRENT: + * do not lock the table against writers. The index will be + * marked "invalid" and the caller must take additional steps + * to fix it up. + * INDEX_CREATE_IF_NOT_EXISTS: + * do not throw an error if a relation with the same name + * already exists. + * constr_flags: flags passed to index_constraint_create + * (only if INDEX_CREATE_ADD_CONSTRAINT is set) * allow_system_table_mods: allow table to be a system catalog - * skip_build: true to skip the index_build() step for the moment; caller - * must do it later (typically via reindex_index()) - * concurrent: if true, do not lock the table against writers. The index - * will be marked "invalid" and the caller must take additional steps - * to fix it up. * is_internal: if true, post creation hook for new index - * if_not_exists: if true, do not throw an error if a relation with - * the same name already exists. * * Returns the OID of the created index. */ @@ -709,15 +715,10 @@ index_create(Relation heapRelation, Oid *classObjectId, int16 *coloptions, Datum reloptions, -bool isprimary, -bool
Re: [HACKERS] WIP: BRIN bloom indexes
On Thu, Oct 19, 2017 at 10:15:32PM +0200, Tomas Vondra wrote: A bloom filter index would, indeed, be wonderful. Comments: + * We use an optimisation that initially we store the uint32 values directly, + * without the extra hashing step. And only later filling the bitmap space, + * we switch to the regular bloom filter mode. I don't think that optimization is worthwhile. If I'm going to be using a Bloom filter, it's because I expect to have a lot of rows. (Granted, if I CREATE TABLE .. LIKE .. INCLUDING INDEXES, and the new table just won't have lots of rows, then I might want this optimization, but I can always just drop the Bloom filter index, or not include indexes in the LIKE.) + * XXX Perhaps we could save a few bytes by using different data types, but + * considering the size of the bitmap, the difference is negligible. A bytea is all that's needed. See below. + * XXX We could also implement "sparse" bloom filters, keeping only the + * bytes that are not entirely 0. That might make the "sorted" phase + * mostly unnecessary. Filter compression is not worthwhile. You want to have a fairly uniform hash distribution, and you want to end up with a Bloom filter that's not much more than 50% full. That means that when near 50% full the filter will not be very sparse at all. Optimizing for the not common case doesn't seem worthwhile, and adds complexity. + * XXX We can also watch the number of bits set in the bloom filter, and + * then stop using it (and not store the bitmap, to save space) when the + * false positive rate gets too high. Ah, right, what you want is a "Scalable Bloom Filter". A Scalable Bloom filter is actually a series of Bloom filters where all but the newest filter are closed to additions, and the newest filter is where you do all the additions. You generally want to make each new filter bigger than the preceding one because when searching a Scalable Bloom filter you have to search *all* of them, so you want to minimize the number of filters. Eventually, when you have enough sub-filters, you may want to re-write the whole thing so that you start with a single sub-filter that is large enough. The format of the bytea might then be something like: [[[...]] where the last bitmap is the filter that is open to additions. I wonder if there are write concurrency performance considerations here... It might be better to have a bytea value per-sub-filter so that there is no lock contention for the closed filters. The closed sub-filters are constant, thus not even shared locks are needed for them, and especially not exclusive locks. Writing to the filter will necessitate locking the entire open filter, or else byte-range locking on it. Something to think about. > Now, what about query performance? Unlike the "minmax" indexes, the > "bloom" filter can only handle equality conditions. A Bloom filter has non-zero false positives for existence, but zero false positives for non-existence. This means that a Bloom filter index can only be used for: a) non-existence tests (with equality predicates, as you point out), b) as an optimization to avoid slower index checks (or heap scans) when the Bloom filter indicates non-existence. (b) is really just an internal application of (a). There might be applications where false positives might be ok in a query like: SELECT a.* FROM a a JOIN b b USING (some_col); but for most real-world queries like that, allowing false positives by default would be very bad. An option in the index declaration could be used to enable existence equality predicates, but I would not include such an option initially -- let's see if anyone actually has a use case for it first. Of course, for something like: SELECT a.*, b.* FROM a a JOIN b b USING (some_col); a Bloom filter can only be used as an optimization to avoid using a slower index (or heap scan) on the inner table source. What I'm getting at is that the query planner really needs to understand that a Bloom filter is a probabilistic data structure. Nico -- -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index only scan for cube and seg
On Thu, Oct 26, 2017 at 12:22 PM, Andrey Borodinwrote: > For cube there is new default opclass. I seem to recall that changing the default opclass causes unsolvable problems with upgrades. You might want to check the archives for previous discussions of this issue; unfortunately, I don't recall the details off-hand. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MERGE SQL Statement for PG11
On Fri, Oct 27, 2017 at 4:41 PM, Simon Riggswrote: > I didn't say it but my intention was to just throw an ERROR if no > single unique index can be identified. > > It could be possible to still run MERGE in that situaton but we would > need to take a full table lock at ShareRowExclusive. It's quite likely > that such statements would throw duplicate update errors, so I > wouldn't be aiming to do anything with that for PG11. Like Peter, I think taking such a strong lock for a DML statement doesn't sound like a very desirable way forward. It means, for example, that you can only have one MERGE in progress on a table at the same time, which is quite limiting. It could easily be the case that you have multiple MERGE statements running at once but they touch disjoint groups of rows and therefore everything works. I think the code should be able to cope with concurrent changes, if nothing else by throwing an ERROR, and then if the user wants to ensure that doesn't happen by taking ShareRowExclusiveLock they can do that via an explicit LOCK TABLE statement -- or else they can prevent concurrency by any other means they see fit. Other problems with taking ShareRowExclusiveLock include (1) probable lock upgrade hazards and (2) do you really want MERGE to kick autovacuum off of your giant table? Probably not. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MERGE SQL Statement for PG11
On Fri, Oct 27, 2017 at 6:24 AM, Robert Haaswrote: > I think one of the reasons why Peter Geoghegan decided to pursue > INSERT .. ON CONFLICT UPDATE was that, because it is non-standard SQL > syntax, he felt free to mandate a non-standard SQL requirement, namely > the presence of a unique index on the arbiter columns. That's true, but what I was really insistent on, more than anything else, was that the user would get a practical guarantee about an insert-or-update outcome under concurrency. There could be no "unprincipled deadlocks", nor could there be spurious unique violations. This is the kind of thing that the SQL standard doesn't really concern itself with, and yet it's of significant practical importance to users. Both Oracle and SQL Server allow these things that I specifically set out to avoid. I think that that's mostly a good thing, though; they do a really bad job of explaining what's what, and don't provide for a very real need ("upsert") in some other way, but their MERGE semantics do make sense to me. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MERGE SQL Statement for PG11
On Fri, Oct 27, 2017 at 7:41 AM, Simon Riggswrote: > Good points. > > I didn't say it but my intention was to just throw an ERROR if no > single unique index can be identified. You'd also throw an error when there was no "upsert compatible" join quals, I take it? I don't see the point in that. That's just mapping one syntax on to another. > It could be possible to still run MERGE in that situaton but we would > need to take a full table lock at ShareRowExclusive. It's quite likely > that such statements would throw duplicate update errors, so I > wouldn't be aiming to do anything with that for PG11. I would avoid mixing up ON CONFLICT DO UPDATE and MERGE. The "anomalies" you describe in MERGE are not really anomalies IMV. They're simply how the feature is supposed to operate, and how it's possible to make MERGE use alternative join algorithms based only on the underlying cost. You might use a merge join for a bulk load use-case, for example. I think an SQL MERGE feature would be compelling, but I don't think that it should take much from ON CONFLICT. As I've said many times [1], they're really two different features (Teradata had features similar to each, for example). I suggest that you come up with something that has the semantics that the standard requires, and therefore makes none of the ON CONFLICT guarantees about an outcome under concurrency (INSERT or UPDATE). Those guarantees are basically incompatible with how MERGE needs to work. In case it matters, I think that the idea of varying relation heavyweight lock strength based on subtle semantics within a DML statement is a bad one. Frankly, I think that that's going to be a nonstarter. [1] https://www.postgresql.org/message-id/CAM3SWZRP0c3g6+aJ=yydgyactzg0xa8-1_fcvo5xm7hrel3...@mail.gmail.com -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: BRIN bloom indexes
On 2017-10-19 23:15, Tomas Vondra wrote: Hi, The BRIN minmax opclasses work well only for data where the column is somewhat correlated to physical location in a table. So it works great for timestamps in append-only log tables, for example. When that is not the case (non-correlated columns) the minmax ranges get very "wide" and we end up scanning large portions of the tables. A typical example are columns with types that are inherently random (or rather non-correlated) like for example UUIDs, IP or MAC addresses, and so on. But it can just as easily happen even with simple IDs generated from a sequence. This WIP patch implements another type of BRIN index, with "summary" being represented by a bloom filter. So instead of building [min,max] range for each page range. The index is of course larger than the regular "minmax" BRIN index, but it's still orders of magnitude smaller than a btree index. Note: This is different from the index type implemented in the "bloom" extension. Those are not related to BRIN at all, and the index builds a bloom filter on individual rows (values in different columns). BTW kudos to everyone who worked on the BRIN infrastructure and API. I found it very intuitive and well-documented. Adding the new opclass was extremely straight-forward, and To demonstrate this on a small example, consider this table: CREATE TABLE bloom_test (id uuid, padding text); INSERT INTO bloom_test SELECT md5((mod(i,100)/100)::text)::uuid, md5(i::text) FROM generate_series(1,2) s(i); VACUUM ANALYZE bloom_test; List of relations Schema |Name| Type | Owner | Size | Description ++---+---+---+- public | bloom_test | table | tomas | 16 GB | (1 row) The table was built so that each range contains relatively small number of distinct UUID values - this is typical e.g. for user activity with "bursts" of actions from a particular user separated by long periods of inactivity (with actions from other users). Now, let's build BRIN "minmax", BRIN "bloom" and BTREE indexes on the id column. create index test_brin_idx on bloom_test using brin (id); create index test_bloom_idx on bloom_test using brin (id uuid_bloom_ops); create index test_btree_idx on bloom_test (id); which gives us this: List of relations Schema | Name | Type | Owner | Table| Size ++---+---++- public | test_bloom_idx | index | tomas | bloom_test | 12 MB public | test_brin_idx | index | tomas | bloom_test | 832 kB public | test_btree_idx | index | tomas | bloom_test | 6016 MB (3 rows) So yeah - the "bloom" index is larger than the default "minmax" index, but it's still orders of magnitude smaller than the regular btree one. Now, what about query performance? Unlike the "minmax" indexes, the "bloom" filter can only handle equality conditions. Let's see a query like this: select * from bloom_test where id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772'; The minmax index produces this plan QUERY PLAN Bitmap Heap Scan on bloom_test (cost=390.31..2130910.82 rows=20593 width=49) (actual time=197.974..22707.311 rows=2 loops=1) Recheck Cond: (id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772'::uuid) Rows Removed by Index Recheck: 19998 Heap Blocks: lossy=2061856 -> Bitmap Index Scan on test_brin_idx (cost=0.00..385.16 rows=5493161 width=0) (actual time=133.463..133.463 rows=20619520 loops=1) Index Cond: (id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772'::uuid) Planning time: 0.165 ms Execution time: 22707.891 ms (8 rows) Which is not that great, and the long duration is a direct consequence of "wide" ranges - the bitmap heap scan had to scan pretty much the whole table. (A sequential scan takes only about 15 seconds.) Now, the bloom index: QUERY PLAN Bitmap Heap Scan on bloom_test (cost=5898.31..2136418.82 rows=20593 width=49) (actual time=24.229..338.324 rows=2 loops=1) Recheck Cond: (id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772'::uuid) Rows Removed by Index Recheck: 2500448 Heap Blocks: lossy=25984 -> Bitmap Index Scan on test_bloom_idx (cost=0.00..5893.16 rows=5493161 width=0) (actual time=22.811..22.811 rows=259840 loops=1) Index Cond: (id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772'::uuid) Planning time: 0.201 ms Execution time: 338.946 ms (8 rows) So, way better. For comparison, a simple index scan / bitmap index scan using the btree take only about ~10ms in this case, but that's mostly thanks to the whole dataset being entirely
Re: [HACKERS] MERGE SQL Statement for PG11
On 27 October 2017 at 15:24, Robert Haaswrote: > On Fri, Oct 27, 2017 at 10:55 AM, Simon Riggs wrote: >> Questions? > > I think one of the reasons why Peter Geoghegan decided to pursue > INSERT .. ON CONFLICT UPDATE was that, because it is non-standard SQL > syntax, he felt free to mandate a non-standard SQL requirement, namely > the presence of a unique index on the arbiter columns. If MERGE's > join clause happens to involve equality conditions on precisely the > same set of columns as some unique index on the target table, then I > think you can reuse the INSERT .. ON CONFLICT UPDATE infrastructure > and I suspect there won't be too many problems. Agreed > However, if it > doesn't, then what? You could decree that such cases will fail, but > that might not meet your use case for developing the feature. Or you > could try to soldier on without the INSERT .. ON CONFLICT UPDATE > machinery, but that means, I think, that sometimes you will get > serialization anomalies - at least, I think, you will sometimes obtain > results that couldn't have been obtained under any serial order of > execution, and maybe it would end up being possible to fail with > serialization errors or unique index violations. > > In the past, there have been objections to implementations of MERGE > which would give rise to such serialization anomalies, but I'm not > sure we should feel bound by those discussions. One thing that's > different is that the common and actually-useful case can now be made > to work in a fairly satisfying way using INSERT .. ON CONFLICT UPDATE; > if less useful cases are vulnerable to some weirdness, maybe it's OK > to just document the problems. Good points. I didn't say it but my intention was to just throw an ERROR if no single unique index can be identified. It could be possible to still run MERGE in that situaton but we would need to take a full table lock at ShareRowExclusive. It's quite likely that such statements would throw duplicate update errors, so I wouldn't be aiming to do anything with that for PG11. -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Linking libpq statically to libssl
On Fri, Oct 27, 2017 at 2:37 PM, Tom Lanewrote: > Daniele Varrazzo writes: >> I have a problem building binary packages for psycopg2. Binary >> packages ship with their own copies of libpq and libssl; however if >> another python package links to libssl the library will be imported >> twice with conflicting symbols, likely resulting in a segfault (see >> https://github.com/psycopg/psycopg2/issues/543). This happens e.g. if >> a python script both connects to postgres and opens an https resource. > > Basically, you're doing it wrong. Shipping your own copy of libssl, > rather than depending on whatever packaging the platform provides, > is just asking for pain --- and not only of this sort. You're also > now on the hook to update your package whenever libssl fixes a bug > or a security vulnerability, which happens depressingly often. > > The same applies to libpq, really. You don't want to be in the > business of shipping bits that you are not the originator of. > > When I worked at Red Hat, there was an ironclad policy against > building packages that incorporated other packages statically. > I would imagine that other distros have similar policies for > similar reasons. Just because you *can* ignore those policies > doesn't mean you *should*. Distros do compile the library from source and against the system package, and everyone using the package directly can still do so; the binary packages are only installed by the Python package manager. Psycopg is more complex to install than the average Python package (it needs python and postgres dev files, pg_config available somewhere etc) and a no-dependencies package provides a much smoother experience. It also happens that the libpq and libssl versions used tend to be more up-to-date than the system one (people can already use the new libpq 10 features without waiting for debian packages). I am confronted with the reality of Python developers as of 201x's, and shipping binary packages has proven generally a positive feature, even factoring in the need of shipping updated binary packages when the need arises. The benefit of a simple-to-use library is for the Postgres project at large, it is not for my personal gain. So while I know the shortcomings of binary packages and static libraries, I would still be interested in knowing the best way to fix the problem in the subject. Feel free to write in private if you want to avoid the public shaming. -- Daniele -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: schema variables
2017-10-27 15:38 GMT+02:00 Gilles Darold: > Le 26/10/2017 à 09:21, Pavel Stehule a écrit : > > Hi, > > > > I propose a new database object - a variable. The variable is > > persistent object, that holds unshared session based not transactional > > in memory value of any type. Like variables in any other languages. > > The persistence is required for possibility to do static checks, but > > can be limited to session - the variables can be temporal. > > > > My proposal is related to session variables from Sybase, MSSQL or > > MySQL (based on prefix usage @ or @@), or package variables from > > Oracle (access is controlled by scope), or schema variables from DB2. > > Any design is coming from different sources, traditions and has some > > advantages or disadvantages. The base of my proposal is usage schema > > variables as session variables for stored procedures. It should to > > help to people who try to port complex projects to PostgreSQL from > > other databases. > > > > The Sybase (T-SQL) design is good for interactive work, but it is > > weak for usage in stored procedures - the static check is not > > possible. Is not possible to set some access rights on variables. > > > > The ADA design (used on Oracle) based on scope is great, but our > > environment is not nested. And we should to support other PL than > > PLpgSQL more strongly. > > > > There is not too much other possibilities - the variable that should > > be accessed from different PL, different procedures (in time) should > > to live somewhere over PL, and there is the schema only. > > > > The variable can be created by CREATE statement: > > > > CREATE VARIABLE public.myvar AS integer; > > CREATE VARIABLE myschema.myvar AS mytype; > > > > CREATE [TEMP] VARIABLE [IF NOT EXISTS] name AS type > > [ DEFAULT expression ] [[NOT] NULL] > > [ ON TRANSACTION END { RESET | DROP } ] > > [ { VOLATILE | STABLE } ]; > > > > It is dropped by command DROP VARIABLE [ IF EXISTS] varname. > > > > The access rights is controlled by usual access rights - by commands > > GRANT/REVOKE. The possible rights are: READ, WRITE > > > > The variables can be modified by SQL command SET (this is taken from > > standard, and it natural) > > > > SET varname = expression; > > > > Unfortunately we use the SET command for different purpose. But I am > > thinking so we can solve it with few tricks. The first is moving our > > GUC to pg_catalog schema. We can control the strictness of SET > > command. In one variant, we can detect custom GUC and allow it, in > > another we can disallow a custom GUC and allow only schema variables. > > A new command LET can be alternative. > > > > The variables should be used in queries implicitly (without JOIN) > > > > SELECT varname; > > > > The SEARCH_PATH is used, when varname is located. The variables can be > > used everywhere where query parameters are allowed. > > > > I hope so this proposal is good enough and simple. > > > > Comments, notes? > > > > regards > > > > Pavel > > > > > > Great feature that will help for migration. How will you handle CONSTANT > declaration? With Oracle it is possible to declare a constant as follow: > > > varname CONSTANT INTEGER:= 500; > > > for a variable that can't be changed. Do you plan to add a CONSTANT or > READONLY keyword or do you want use GRANT on the object to deal with > this case? > Plpgsql declaration supports CONSTANT I forgot it. Thank you Pavel > > Regards > > -- > Gilles Darold > Consultant PostgreSQL > http://dalibo.com - http://dalibo.org > > > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers >
[HACKERS] bgwriter_lru_maxpages range in postgresql.conf
With v10, the upper limit on bgwriter_lru_maxpages was changed from 1000 to INT_MAX / 2, but the postgresql.conf.sample was not updated. #bgwriter_lru_maxpages = 100# 0-1000 max buffers written/round I don't see any precedence for including INT_MAX-type limits in the sample config file, so maybe something like this: #bgwriter_lru_maxpages = 100# max buffers written/round, 0 to turn off Cheers, Jeff
Re: [HACKERS] proposal: schema variables
Le 26/10/2017 à 09:21, Pavel Stehule a écrit : > Hi, > > I propose a new database object - a variable. The variable is > persistent object, that holds unshared session based not transactional > in memory value of any type. Like variables in any other languages. > The persistence is required for possibility to do static checks, but > can be limited to session - the variables can be temporal. > > My proposal is related to session variables from Sybase, MSSQL or > MySQL (based on prefix usage @ or @@), or package variables from > Oracle (access is controlled by scope), or schema variables from DB2. > Any design is coming from different sources, traditions and has some > advantages or disadvantages. The base of my proposal is usage schema > variables as session variables for stored procedures. It should to > help to people who try to port complex projects to PostgreSQL from > other databases. > > The Sybase (T-SQL) design is good for interactive work, but it is > weak for usage in stored procedures - the static check is not > possible. Is not possible to set some access rights on variables. > > The ADA design (used on Oracle) based on scope is great, but our > environment is not nested. And we should to support other PL than > PLpgSQL more strongly. > > There is not too much other possibilities - the variable that should > be accessed from different PL, different procedures (in time) should > to live somewhere over PL, and there is the schema only. > > The variable can be created by CREATE statement: > > CREATE VARIABLE public.myvar AS integer; > CREATE VARIABLE myschema.myvar AS mytype; > > CREATE [TEMP] VARIABLE [IF NOT EXISTS] name AS type > [ DEFAULT expression ] [[NOT] NULL] > [ ON TRANSACTION END { RESET | DROP } ] > [ { VOLATILE | STABLE } ]; > > It is dropped by command DROP VARIABLE [ IF EXISTS] varname. > > The access rights is controlled by usual access rights - by commands > GRANT/REVOKE. The possible rights are: READ, WRITE > > The variables can be modified by SQL command SET (this is taken from > standard, and it natural) > > SET varname = expression; > > Unfortunately we use the SET command for different purpose. But I am > thinking so we can solve it with few tricks. The first is moving our > GUC to pg_catalog schema. We can control the strictness of SET > command. In one variant, we can detect custom GUC and allow it, in > another we can disallow a custom GUC and allow only schema variables. > A new command LET can be alternative. > > The variables should be used in queries implicitly (without JOIN) > > SELECT varname; > > The SEARCH_PATH is used, when varname is located. The variables can be > used everywhere where query parameters are allowed. > > I hope so this proposal is good enough and simple. > > Comments, notes? > > regards > > Pavel > > Great feature that will help for migration. How will you handle CONSTANT declaration? With Oracle it is possible to declare a constant as follow: varname CONSTANT INTEGER := 500; for a variable that can't be changed. Do you plan to add a CONSTANT or READONLY keyword or do you want use GRANT on the object to deal with this case? Regards -- Gilles Darold Consultant PostgreSQL http://dalibo.com - http://dalibo.org -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Linking libpq statically to libssl
Daniele Varrazzowrites: > I have a problem building binary packages for psycopg2. Binary > packages ship with their own copies of libpq and libssl; however if > another python package links to libssl the library will be imported > twice with conflicting symbols, likely resulting in a segfault (see > https://github.com/psycopg/psycopg2/issues/543). This happens e.g. if > a python script both connects to postgres and opens an https resource. Basically, you're doing it wrong. Shipping your own copy of libssl, rather than depending on whatever packaging the platform provides, is just asking for pain --- and not only of this sort. You're also now on the hook to update your package whenever libssl fixes a bug or a security vulnerability, which happens depressingly often. The same applies to libpq, really. You don't want to be in the business of shipping bits that you are not the originator of. When I worked at Red Hat, there was an ironclad policy against building packages that incorporated other packages statically. I would imagine that other distros have similar policies for similar reasons. Just because you *can* ignore those policies doesn't mean you *should*. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [GENERAL] Postgres 10 manual breaks links with anchors
Peter Eisentrautwrites: > On 10/26/17 16:10, Tom Lane wrote: >> In view of commit 1ff01b390, aren't we more or less locked into >> lower-case anchors going forward? > The details are more complicated. ... Ah. I'd imagined that we were using the original case for the anchors, rather than smashing them to upper (or lower) case. > So the options are simply > 1) Use the patch and keep indefinitely, keeping anchors compatible back > to forever and forward indefinitely. > 2) Don't use the patch, breaking anchors from <=9.6, but keeping them > compatible going forward. > > Considering how small the patch is compared to some other customizations > we carry, #1 seems reasonable to me. +1 regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MERGE SQL Statement for PG11
On Fri, Oct 27, 2017 at 10:55 AM, Simon Riggswrote: > Questions? I think one of the reasons why Peter Geoghegan decided to pursue INSERT .. ON CONFLICT UPDATE was that, because it is non-standard SQL syntax, he felt free to mandate a non-standard SQL requirement, namely the presence of a unique index on the arbiter columns. If MERGE's join clause happens to involve equality conditions on precisely the same set of columns as some unique index on the target table, then I think you can reuse the INSERT .. ON CONFLICT UPDATE infrastructure and I suspect there won't be too many problems. However, if it doesn't, then what? You could decree that such cases will fail, but that might not meet your use case for developing the feature. Or you could try to soldier on without the INSERT .. ON CONFLICT UPDATE machinery, but that means, I think, that sometimes you will get serialization anomalies - at least, I think, you will sometimes obtain results that couldn't have been obtained under any serial order of execution, and maybe it would end up being possible to fail with serialization errors or unique index violations. In the past, there have been objections to implementations of MERGE which would give rise to such serialization anomalies, but I'm not sure we should feel bound by those discussions. One thing that's different is that the common and actually-useful case can now be made to work in a fairly satisfying way using INSERT .. ON CONFLICT UPDATE; if less useful cases are vulnerable to some weirdness, maybe it's OK to just document the problems. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: Burst in WAL size when UUID is used as PK while full_page_writes are enabled
On Fri, Oct 27, 2017 at 5:46 PM, Thomas Kellererwrote: > akapila wrote: > >> You might want to give a try with the hash index if you are planning >> to use PG10 and your queries involve equality operations. > > But you can't replace the PK index with a hash index, because hash indexes > don't support uniqueness. > That's true, but it hasn't been mentioned in the mail that the usage of hash index is the for primary key. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Burst in WAL size when UUID is used as PK while full_page_writes are enabled
On Fri, Oct 27, 2017 at 5:36 PM, Alvaro Herrerawrote: > Amit Kapila wrote: > >> You might want to give a try with the hash index if you are planning >> to use PG10 and your queries involve equality operations. > > So, btree indexes on monotonically increasing sequences don't write tons > of full page writes because typically the same page is touched many > times by insertions on each checkpoint cycle; so only one or very few > full page writes are generated for a limited number of index pages. > > With UUID you lose locality of access: each insert goes to a different > btree page, so you generate tons of full page writes because the number > of modified index pages is very large. > > With hash on monotonically increasing keys, my guess is that you get > behavior similar to btrees on UUID: the inserts are all over the place > in the index, so tons of full page writes. Am I wrong? > > With hash on UUID, the same thing should happen. Am I wrong? > If the bucket pages are decided merely based on hashkey, then what you are saying should be right. However, we mask the hash key with high|low mask due to which it falls in one of existing page in the hash index. Also, I have suggested based on some of the tests we have done on UUID column and the result was that most of the time hash index size was lesser than btree size. See pages 15-17 of hash index presentation in the last PGCon [1]. [1] - https://www.pgcon.org/2017/schedule/attachments/458_durable-hash-indexes-postgresql.pdf -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: BRIN bloom indexes
On Fri, Oct 27, 2017 at 2:55 PM, Alvaro Herrerawrote: > I was rather thinking that if we can make this very robust against the > index growing out of proportion, we should consider ditching the > original minmax and replace it with multirange minmax, which seems like > it'd have much better behavior. If the multirange stuff can be done in such a way that it's just an updated version of the same opclass, and backward-compatible on disk, then I think this would be OK. But otherwise I don't think we ditch what already exists. That would break upgrades via both pg_upgrade and pg_dump, which seems like too high a price to pay to get rid of some arguably-worse code. It's actually WORSE to drop an opclass (which will make dumps not restore) than to do something like bump HASH_VERSION (which doesn't affect pg_dump at all and for pg_upgrade only requires post-upgrade steps rather than failing outright). > I don't see any reason to put any of this in contrib. Well, for one thing, it makes it easier to drop stuff later if we decide we don't really want it. I think that whichever BRIN opclasses are thought to be high quality and of general utility can go into core, just as we've done with other index AMs. However, if we think that something is useful-ish but maybe not something to which we want to make a permanent commitment, putting it into contrib is good for that. Upgrades are easier for things in contrib, too, because there's a built-in mechanism for people to try updating the SQL extensions (ALTER EXTENSION .. UPDATE) and if it fails they can adjust things and try again. When you just make a hard change to SQL definitions in a new release, any failures that result from those changes just turn into upgrade failures, which is IMHO a lot more painful than a failure to update an extension version while the database is still up and usable the whole time. For instance, if pg_stat_activity were bundled in an extension and we made the C code backward-compatibility with old extension versions, then some of the upgrade pain users have had with that over the years could have been avoided. People could update to the new version without failures and then at their leisure try to update. If the update failed due to dependencies, then they would have time to figure out what to do about it and try again later; in the meantime, they'd be on the new version. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: protocol version negotiation (Re: Libpq PGRES_COPY_BOTH - version compatibility)
On Thu, Oct 19, 2017 at 1:35 AM, Badrul Chowdhurywrote: > The new functionality is for sending 64bit ints. I think 32bits is sufficient > for the information we want to pass around in the protocol negotiation phase, > so I left this part unchanged. No, it isn't. That commit didn't add any new functionality, but it changed the preferred interfaces for assembling protocol messages. Your patch was still using the old ones. Attached is an updated patch. I've made a few modifications: - I wrote documentation. For future reference, that's really the job of the patch submitter. - I changed the message type byte for the new message from 'Y' to 'v'. 'Y' is apparently used by logical replication as a "type message", but 'v' is not currently used for anything. It's also somewhat mnemonic. - I deleted the minimum protocol version from the new message. I know there were a few votes for including it, but I think it's probably useless. The client should always request the newest version it can support; if that's not new enough for the server, then we're dead anyway and we might as well just handle that via ERROR. Moreover, it seems questionable whether we'd ever deprecate 3.0 support in the server anyway, or if we do, it'll probably be because 4.0 has been stable for a decade or so. Desupporting 3.0 while continuing to support 3.x,x>0 seems like a remote outcome (and, like I say, if it does happen, ERROR is a good-enough response). If there's some use case for having a client request an older protocol version than the newest one it can support, then this change could be revisited, or we can just handle that by retrying the whole connection attempt. - I changed the test for whether to send NegotiateProtocolVersion to send it only when the client requests a version too new for the server. I think that if the client requests a version older than what the server could support, the server should just silently use the older version. That's arguable. You could argue that the message should be sent anyway (at least to post-3.0 clients) as a way of reporting what happened with _pq_. parameters, but I have two counter-arguments. Number one, maybe clients just shouldn't send _pq_. parameters that the protocol version they're using doesn't support, or if they do, be prepared for them to have no effect. Number two, this is exactly the sort of thing that we can change in future minor protocol versions if we wish. For example, we could define protocol version 3.1 or 3.43 or whatever as always sending a NegotiateProtocolVersion message. There's no need for the code to make 3.0 compatible with future versions to decide what choices those future versions might make. - Made the prefix matching check for "_pq_." rather than "_pq_". I think we're imagining extensions like _pq_.magic_fairy_dust, not _pq_magic_fairy_dust. - I got rid of the optional_parameters thing in Port and just passed a list of unrecognized parameters around directly. Once some parameters are recognized, e.g. _pq_.magic_fairy_dust = 1, we'll probably have dedicated fields in the Port for them (i.e. int magic_fairy_dust) rather than digging them out of some list. Moreover, with your design, we'd end up having to make NegotiateServerProtocol exclude things that are actually recognized, which would be annoying. - I got rid of the pq_flush(). There's no need for this because we're always going to send another protocol message (ErrorResponse or AuthenticationSomething) afterwards. - I renamed NegotiateServerProtocol to SendNegotiateProtocolVersion and moved it to what I think is a more sensible place in the file. - I removed the error check for 2.x, x != 0. I may have advocated for this before, but on reflection, I don't see much advantage in rejecting things that work today. - I fixed the above-mentioned failure to use the new interface for assembling the NegotiateProtocolVersion message. - I did some work on the comments. Barring objections, I plan to commit this and back-patch it all the way. Of course, this is not a bug fix, but Tom previously proposed back-patching it and I think that's a good idea, because if we don't, it will be a very long time before servers with this code become commonplace in the wild. Back-patching doesn't completely fix that problem because not everybody applies upgrades and some people may be running EOL versions, but it will still help. Also attached is a patch I used for testing purposes, some version of which we might eventually use when we actually introduce version 3.1 of the protocol. It bumps the protocol version that libpq uses from 3.0 to 3.1 without changing what the server thinks the latest protocol version is, so the server always replies with a NegotiateProtocolVersion message. It also teaches libpq to ignore the NegotiateProtocolVersion message. With that patch applied, make check-world passes, which seems to show that the server-side changes are not totally broken.
Re: [HACKERS] WIP: BRIN bloom indexes
Tomas Vondra wrote: > Not sure "a number of in-core opclasses" is a good reason to (not) add > new ones. Also, we already have two built-in BRIN opclasses (minmax and > inclusion). > > In general, "BRIN bloom" can be packed as a contrib module (at least I > believe so). That's not the case for the "BRIN multi-range" which also > requires some changes to some code in brin.c (but the rest can be moved > to contrib module, of course). I was rather thinking that if we can make this very robust against the index growing out of proportion, we should consider ditching the original minmax and replace it with multirange minmax, which seems like it'd have much better behavior. I don't see any reason to put any of this in contrib. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel.c oblivion of worker-startup failures
On Tue, Sep 19, 2017 at 8:47 AM, Amit Kapilawrote: > On Mon, Sep 18, 2017 at 10:00 PM, Tom Lane wrote: >> Amit Kapila writes: >>> Attached patch fixes these problems. >> >> Hmm, this patch adds a kill(notify_pid) after one call to >> ForgetBackgroundWorker, but the postmaster has several more such calls. >> Shouldn't they all notify the notify_pid? Should we move that >> functionality into ForgetBackgroundWorker itself, so we can't forget >> it again? >> > > Among other places, we already notify during > ReportBackgroundWorkerExit(). However, it seems to me that all other > places except where this patch has added a call to notify doesn't need > such a call. The other places like in DetermineSleepTime and > ResetBackgroundWorkerCrashTimes is called for a crashed worker which I > don't think requires notification to the backend as that backend > itself would have restarted. The other place where we call > ForgetBackgroundWorker is in maybe_start_bgworkers when rw_terminate > is set to true which again seems to be either the case of worker crash > or when someone has explicitly asked to terminate the worker in which > case we already send a notification. > > I think we need to notify the backend on start, stop and failure to > start a worker. OTOH, if it is harmless to send a notification even > after the worker is crashed, then we can just move that functionality > into ForgetBackgroundWorker itself as that will simplify the code and > infact that is the first thing that occurred to me as well, but I > haven't done that way as I was not sure if we want to send > notification in all kind of cases. > The patch still applies (with some hunks). I have added it in CF [1] to avoid losing track. [1] - https://commitfest.postgresql.org/15/1341/ -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MERGE SQL Statement for PG11
Simon Riggs wrote: > Earlier thoughts on how this could/could not be done were sometimes > imprecise or inaccurate, so I have gone through the command per > SQL:2011 spec and produced a definitive spec in the form of an SGML > ref page. This is what I intend to deliver for PG11. Nice work. I didn't verify the SQL spec, just read your HTML page; some very minor comments based on that: * use "and" not "where" as initial words in "when_clause" and "merge_update" clause definitions * missing word here: "the DELETE privilege on the if you specify" * I think the word "match." is leftover from some editing in the phrase " that specifies which rows in the data_source match rows in the target_table_name. match." In the same paragraph, it is not clear whether all columns must be matched or it can be a partial match. * In the when_clause note, it is not clear whether you can have multiple WHEN MATCHED and WHEN NOT MATCHED clauses. Obviously you can have one of each, but I think your doc says it is possible to have more than one of each, with different conditions (WHEN MATCHED AND foo THEN bar WHEN MATCHED AND baz THEN qux). No example shows more than one. On the same point: Is there short-circuiting of such conditions, i.e. will the execution will stop looking for further WHEN matches if some rule matches, or will it rather check all rules and raise an error if more than one WHEN rules match each given row? * Your last example uses ELSE but that appears nowhere in the synopsys. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Linking libpq statically to libssl
Hello, I have a problem building binary packages for psycopg2. Binary packages ship with their own copies of libpq and libssl; however if another python package links to libssl the library will be imported twice with conflicting symbols, likely resulting in a segfault (see https://github.com/psycopg/psycopg2/issues/543). This happens e.g. if a python script both connects to postgres and opens an https resource. A solution to work around the problem could be to change libssl symbols names, in the binaries or in the source code (https://github.com/psycopg/psycopg2-wheels/issues/7). But maybe a simpler workaround could be just to build libpq linking to libssl statically. Is this possible? ...and other libs too I guess, because the problem was found on libssl but conflicts with other libs is possible too, ldap comes to mind... Any help to solve the problem would be appreciated. Thank you very much. -- Daniele -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: Burst in WAL size when UUID is used as PK while full_page_writes are enabled
akapila wrote: > You might want to give a try with the hash index if you are planning > to use PG10 and your queries involve equality operations. But you can't replace the PK index with a hash index, because hash indexes don't support uniqueness. -- Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Burst in WAL size when UUID is used as PK while full_page_writes are enabled
Amit Kapila wrote: > You might want to give a try with the hash index if you are planning > to use PG10 and your queries involve equality operations. So, btree indexes on monotonically increasing sequences don't write tons of full page writes because typically the same page is touched many times by insertions on each checkpoint cycle; so only one or very few full page writes are generated for a limited number of index pages. With UUID you lose locality of access: each insert goes to a different btree page, so you generate tons of full page writes because the number of modified index pages is very large. With hash on monotonically increasing keys, my guess is that you get behavior similar to btrees on UUID: the inserts are all over the place in the index, so tons of full page writes. Am I wrong? With hash on UUID, the same thing should happen. Am I wrong? -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Burst in WAL size when UUID is used as PK while full_page_writes are enabled
On Fri, Oct 27, 2017 at 11:26 AM, sanyam jainwrote: > Hi, > > I was reading the blog > https://blog.2ndquadrant.com/on-the-impact-of-full-page-writes . > > My queries: > > How randomness of UUID will likely to create new leaf page in btree index? > In my understanding as the size of UUID is 128 bits i.e. twice of BIGSERIAL > , more number of pages will be required to store the same number of rows and > hence there can be increase in WAL size due to FPW . > When compared the index size in local setup UUID index is ~2x greater in > size. > You might want to give a try with the hash index if you are planning to use PG10 and your queries involve equality operations. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [GENERAL] Postgres 10 manual breaks links with anchors
On 10/26/17 16:10, Tom Lane wrote: > Peter Eisentrautwrites: >> On 10/16/17 03:19, Thomas Kellerer wrote: >>> I don't know if this is intentional, but the Postgres 10 manual started to >>> use lowercase IDs as anchors in the manual. > >> Here is a patch that can be applied to PG 10 to put the upper case >> anchors back. >> The question perhaps is whether we want to maintain this patch >> indefinitely, or whether a clean break is better. > > In view of commit 1ff01b390, aren't we more or less locked into > lower-case anchors going forward? I'm not sure I see the point > of changing v10 back to the old way if v11 will be incompatible > anyhow. The details are more complicated. The IDs in DocBook documents have two purposes. One is to ensure non-broken links between things like and . This is set up in the DTD and checked during parsing (validation, more precisely). In DocBook SGML, many things including tag names, attribute names, and IDs are case insensitive. But in DocBook XML, everything is case sensitive. So in order to make things compatible for a conversion, we had to consolidate some variant spellings that have accumulated in our sources. For simplicity, I have converted everything to lower case. The other purpose is that the DocBook XSL and DSSSL stylesheets use the IDs for creating anchors in HTML documents (and also for the HTML file names themselves). This is merely a useful choice of those stylesheets. In PG 9.6 and earlier, we used a straight SGML toolchain, using Jade and DSSSL. The internal representation of a DocBook SGML document after parsing converts all the case insensitive bits to upper case. (This might be configured somewhere; I'm not sure.) So the stylesheets see all the IDs as upper case to begin with, and that's why all the anchors come out in upper case in the HTML output. In PG 10, the build first converts the SGML sources to XML, redeclares them as DocBook XML, then builds using XSLT. Because DocBook XML requires lower-case tags and attribute names, we have to use the osx -x lower option to convert all the case-insensitive bits to lower case instead of the default upper case. That's why the XSLT stylesheets see the IDs as lower case and that's why they are like that in the output. (If there were options more detailed than -x lower, that could have been useful.) The proposed patch works much later in the build process and converts IDs to upper case only when they are being considered for making an HTML anchor. The structure of the document as far as the XML parser is concerned stays the same. For PG 11, the idea is to convert the sources to a pure XML document. XML is case insensitive, so the XML parser would see the IDs as what they are. Without the mentioned patch to convert all IDs to lower case in the source, the XSL processor would see the IDs in whatever case they are, and anchors would end up in the HTML output using whatever case they are. So the conversion to lower case in the source also ensured anchor compatibility to PG 10. Otherwise, someone might well have complained in a similar manner a year from now. Applying the proposed patch to master/PG 11 would have the same effect as in PG 10. It would convert anchors to upper case in the HTML output but leave the logical and physical structure of the XML document alone. So the options are simply 1) Use the patch and keep indefinitely, keeping anchors compatible back to forever and forward indefinitely. 2) Don't use the patch, breaking anchors from <=9.6, but keeping them compatible going forward. Considering how small the patch is compared to some other customizations we carry, #1 seems reasonable to me. I just didn't know to what extent people had actually bookmarked fragment links. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] inconsistency in process names - bgworker: logical replication launcher
2017-10-27 13:03 GMT+02:00 Peter Eisentraut < peter.eisentr...@2ndquadrant.com>: > On 10/27/17 04:06, Pavel Stehule wrote: > > Why buildin process has prefix bgworker? > > Implementation detail. This has been changed in master already. > ok Thank you Pavel > > -- > Peter Eisentraut http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services >
Re: [HACKERS] Add support for tuple routing to foreign partitions
On 2017/10/26 16:40, Etsuro Fujita wrote: Other changes I made to the executor are: (1) currently, we set the RT index for the root partitioned table to ri_RangeTableIndex of partitions' ResultRelInfos, but the proposed EXPLAIN requires that the partition's ri_RangeTableIndex is set to the RT index for that partition's RTE, to show that partition info in the output. So, I made that change. One thing I forgot to mention is: that would be also required to call BeginForeignModify, ExecForeignInsert, and EndForeignModify with the partition's ResultRelInfo. I updated docs in doc/src/sgml/ddl.sgml the same way as [1]. (I used only the ddl.sgml change proposed by [1], not all the changes.) I did some cleanup as well. Please find attached an updated version of the patch. Best regards, Etsuro Fujita [1] https://www.postgresql.org/message-id/b19a8e2b-e000-f592-3e0b-3e90ba0fa816%40lab.ntt.co.jp *** a/contrib/file_fdw/input/file_fdw.source --- b/contrib/file_fdw/input/file_fdw.source *** *** 176,182 COPY pt FROM '@abs_srcdir@/data/list2.csv' with (format 'csv', delimiter ','); --- 176,188 SELECT tableoid::regclass, * FROM pt; SELECT tableoid::regclass, * FROM p1; SELECT tableoid::regclass, * FROM p2; + \t on + EXPLAIN (VERBOSE, COSTS FALSE) INSERT INTO pt VALUES (1, 'xyzzy'); + \t off INSERT INTO pt VALUES (1, 'xyzzy'); -- ERROR + \t on + EXPLAIN (VERBOSE, COSTS FALSE) INSERT INTO pt VALUES (2, 'xyzzy'); + \t off INSERT INTO pt VALUES (2, 'xyzzy'); SELECT tableoid::regclass, * FROM pt; SELECT tableoid::regclass, * FROM p1; *** a/contrib/file_fdw/output/file_fdw.source --- b/contrib/file_fdw/output/file_fdw.source *** *** 315,321 SELECT tableoid::regclass, * FROM p2; (0 rows) COPY pt FROM '@abs_srcdir@/data/list2.bad' with (format 'csv', delimiter ','); -- ERROR ! ERROR: cannot route inserted tuples to a foreign table CONTEXT: COPY pt, line 2: "1,qux" COPY pt FROM '@abs_srcdir@/data/list2.csv' with (format 'csv', delimiter ','); SELECT tableoid::regclass, * FROM pt; --- 315,321 (0 rows) COPY pt FROM '@abs_srcdir@/data/list2.bad' with (format 'csv', delimiter ','); -- ERROR ! ERROR: cannot route copied tuples to a foreign table CONTEXT: COPY pt, line 2: "1,qux" COPY pt FROM '@abs_srcdir@/data/list2.csv' with (format 'csv', delimiter ','); SELECT tableoid::regclass, * FROM pt; *** *** 341,348 SELECT tableoid::regclass, * FROM p2; p2 | 2 | qux (2 rows) INSERT INTO pt VALUES (1, 'xyzzy'); -- ERROR ! ERROR: cannot route inserted tuples to a foreign table INSERT INTO pt VALUES (2, 'xyzzy'); SELECT tableoid::regclass, * FROM pt; tableoid | a | b --- 341,366 p2 | 2 | qux (2 rows) + \t on + EXPLAIN (VERBOSE, COSTS FALSE) INSERT INTO pt VALUES (1, 'xyzzy'); + Insert on public.pt +Foreign Insert on public.p1 +Insert on public.p2 +-> Result + Output: 1, 'xyzzy'::text + + \t off INSERT INTO pt VALUES (1, 'xyzzy'); -- ERROR ! ERROR: cannot route inserted tuples to foreign table "p1" ! \t on ! EXPLAIN (VERBOSE, COSTS FALSE) INSERT INTO pt VALUES (2, 'xyzzy'); ! Insert on public.pt !Foreign Insert on public.p1 !Insert on public.p2 !-> Result ! Output: 2, 'xyzzy'::text ! ! \t off INSERT INTO pt VALUES (2, 'xyzzy'); SELECT tableoid::regclass, * FROM pt; tableoid | a | b *** a/contrib/postgres_fdw/expected/postgres_fdw.out --- b/contrib/postgres_fdw/expected/postgres_fdw.out *** *** 7029,7034 NOTICE: drop cascades to foreign table bar2 --- 7029,7236 drop table loct1; drop table loct2; -- === + -- test tuple routing for foreign-table partitions + -- === + create table pt (a int, b int) partition by list (a); + create table locp partition of pt for values in (1); + create table locfoo (a int check (a in (2)), b int); + create foreign table remp partition of pt for values in (2) server loopback options (table_name 'locfoo'); + explain (verbose, costs off) + insert into pt values (1, 1), (2, 1); +QUERY PLAN + - + Insert on public.pt +Insert on public.locp +Foreign Insert on public.remp + Remote SQL: INSERT INTO public.locfoo(a, b) VALUES ($1, $2) +-> Values Scan on "*VALUES*" + Output: "*VALUES*".column1, "*VALUES*".column2 + (6 rows) + + insert into pt values (1, 1), (2, 1); + select tableoid::regclass, * FROM pt; + tableoid | a | b + --+---+--- + locp | 1 | 1 + remp | 2 | 1 + (2 rows) + + select tableoid::regclass, * FROM locp; + tableoid | a | b + --+---+--- + locp | 1 | 1 + (1 row) + + select tableoid::regclass, * FROM remp; + tableoid | a | b +
Re: [HACKERS] inconsistency in process names - bgworker: logical replication launcher
On 10/27/17 04:06, Pavel Stehule wrote: > Why buildin process has prefix bgworker? Implementation detail. This has been changed in master already. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise aggregation/grouping
Hi, Attached new patch-set here. Changes include: 1. Added separate patch for costing Append node as discussed up-front in the patch-set. 2. Since we now cost Append node, we don't need partition_wise_agg_cost_factor GUC. So removed that. The remaining patch hence merged into main implementation patch. 3. Updated rows in test-cases so that we will get partition-wise plans. Thanks On Wed, Oct 18, 2017 at 9:53 AM, Dilip Kumarwrote: > On Tue, Oct 17, 2017 at 10:44 PM, Jeevan Chalke > wrote: > > > > > I didn't get what you mean by regression here. Can you please explain? > > > > I see that PWA plan is selected over regular plan when enabled on the > basis > > of costing. > > Regular planning need a Result node due to which costing increases where > as > > PWA don't need that and thus wins. > > Sorry for not clearly explaining, I meant that with normal plan > execution time is 263.678 ms whereas with PWA its 339.929 ms. > > I only set enable_partition_wise_agg=on and it switched to PWA and > execution time increased by 30%. > I understand that the this is the worst case for PWA where > FinalizeAggregate is getting all the tuple. > > -- > Regards, > Dilip Kumar > EnterpriseDB: http://www.enterprisedb.com > -- Jeevan Chalke Technical Architect, Product Development EnterpriseDB Corporation The Enterprise PostgreSQL Company partition-wise-agg-v6.tar.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel safety for extern params
On Mon, Oct 16, 2017 at 4:29 PM, Amit Kapilawrote: > On Sat, Oct 14, 2017 at 1:51 AM, Robert Haas wrote: > >> I think the bug is in ExecGather(Merge): it assumes that if we're in >> parallel mode, it's OK to start workers. But actually, it shouldn't >> do this unless ExecutePlan ended up with use_parallel_mode == true, >> which isn't quite the same thing. I think we might need ExecutePlan >> to set a flag in the estate that ExecGather(Merge) can check. >> > > Attached patch fixes the problem in a suggested way. > > All the patches still apply and pass the regression test. I have added this to CF to avoid losing the track of this patch. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] path toward faster partition pruning
On Fri, Oct 27, 2017 at 2:41 PM, Amit Langotewrote: > 0001: added some new tests > 0002: no change > 0003: fixed issue that Rajkumar reported (cope with Params properly) > 0004: no change > 0005: fix the case to prune the default partition when warranted (the > issue reported by Beena) > Thanks for the updated patch, i am getting server crash with below query. CREATE TABLE mp (c1 int, c2 int, c3 int) PARTITION BY LIST(c3); CREATE TABLE mp_p1 PARTITION OF mp FOR VALUES IN (10, 20) PARTITION BY RANGE(c2); CREATE TABLE mp_p1_1 PARTITION OF mp_p1 FOR VALUES FROM (0) TO (200); CREATE TABLE mp_p1_2 PARTITION OF mp_p1 FOR VALUES FROM (200) TO (400); CREATE TABLE mp_p2 PARTITION OF mp FOR VALUES IN (30, 40) PARTITION BY RANGE(c2); CREATE TABLE mp_p2_1 PARTITION OF mp_p2 FOR VALUES FROM (0) TO (300); CREATE TABLE mp_p2_2 PARTITION OF mp_p2 FOR VALUES FROM (300) TO (600); INSERT INTO mp VALUES(10, 100, 10); INSERT INTO mp VALUES(20, 200, 20); INSERT INTO mp VALUES(21, 150, 30); INSERT INTO mp VALUES(30, 200, 40); INSERT INTO mp VALUES(31, 300, 30); INSERT INTO mp VALUES(40, 400, 40); EXPLAIN (COSTS OFF) SELECT tableoid::regclass, * FROM mp WHERE c3 = 40 AND c2 < 300; server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. Thanks & Regards, Rajkumar Raghuwanshi QMG, EnterpriseDB Corporation
Re: [HACKERS] MERGE SQL Statement for PG11
Hey, It looks quite nice. Personally I'd like to also have the returning statement, and have the number of deleted and inserted rows as separate numbers in the output message. regards Szymon Lipiński pt., 27.10.2017, 10:56 użytkownik Simon Riggsnapisał: > I'm working on re-submitting MERGE for PG11 > > Earlier thoughts on how this could/could not be done were sometimes > imprecise or inaccurate, so I have gone through the command per > SQL:2011 spec and produced a definitive spec in the form of an SGML > ref page. This is what I intend to deliver for PG11. > > MERGE will use the same mechanisms as INSERT ON CONFLICT, so > concurrent behavior does not require further infrastructure changes, > just detailed work on the statement itself. > > I'm building up the code from scratch based upon the spec, rather than > trying to thwack earlier attempts into shape. This looks more likely > to yield a commitable patch. > > Major spanners or objections, please throw them in now cos I don't see any. > > Questions? > > -- > Simon Riggshttp://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers >
Re: [HACKERS] WIP: BRIN bloom indexes
hi, On 10/27/2017 09:34 AM, Simon Riggs wrote: > On 27 October 2017 at 07:20, Robert Haaswrote: >> On Thu, Oct 19, 2017 at 10:15 PM, Tomas Vondra >> wrote: >>> Let's see a query like this: >>> >>> select * from bloom_test >>> where id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772'; >>> >>> The minmax index produces this plan >>> >>>Heap Blocks: lossy=2061856 >>> Execution time: 22707.891 ms >>> >>> Now, the bloom index: >>> >>>Heap Blocks: lossy=25984 >>> Execution time: 338.946 ms >> >> It's neat to see BRIN being extended like this. Possibly we could >> consider making it a contrib module rather than including it in core, >> although I don't have strong feelings about it. > > I see that SP-GIST includes two operator classes in core, one default. > > Makes sense to do the same thing with BRIN and add this new op class > as a non-default option in core. > Not sure "a number of in-core opclasses" is a good reason to (not) add new ones. Also, we already have two built-in BRIN opclasses (minmax and inclusion). In general, "BRIN bloom" can be packed as a contrib module (at least I believe so). That's not the case for the "BRIN multi-range" which also requires some changes to some code in brin.c (but the rest can be moved to contrib module, of course). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Burst in WAL size when UUID is used as PK while full_page_writes are enabled
On 10/27/2017 07:56 AM, sanyam jain wrote: > Hi, > > I was reading the > blog https://blog.2ndquadrant.com/on-the-impact-of-full-page-writes . > For the record, I assume you're referring to this part: With BIGSERIAL new values are sequential, and so get inserted to the same leaf pages in the btree index. As only the first modification to a page triggers the full-page write, only a tiny fraction of the WAL records are FPIs. With UUID it’s completely different case, of couse – the values are not sequential at all, in fact each insert is likely to touch completely new leaf index leaf page (assuming the index is large enough). > My queries: > > How randomness of UUID will likely to create new leaf page in btree index? > In my understanding as the size of UUID is 128 bits i.e. twice of > BIGSERIAL , more number of pages will be required to store the same > number of rows and hence there can be increase in WAL size due to FPW . > When compared the index size in local setup UUID index is ~2x greater in > size. > Perhaps this is just a poor choice of words on my side, but I wasn't suggesting new leaf pages will be created but merely that the inserts will touch a different (possibly existing) leaf page. That's a direct consequence of the inherent UUID randomness. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] MERGE SQL Statement for PG11
I'm working on re-submitting MERGE for PG11 Earlier thoughts on how this could/could not be done were sometimes imprecise or inaccurate, so I have gone through the command per SQL:2011 spec and produced a definitive spec in the form of an SGML ref page. This is what I intend to deliver for PG11. MERGE will use the same mechanisms as INSERT ON CONFLICT, so concurrent behavior does not require further infrastructure changes, just detailed work on the statement itself. I'm building up the code from scratch based upon the spec, rather than trying to thwack earlier attempts into shape. This looks more likely to yield a commitable patch. Major spanners or objections, please throw them in now cos I don't see any. Questions? -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services Title: MERGE MERGEPrev UpSQL CommandsHome NextMERGEMERGE — insert, update, or delete rows of a table based upon source dataSynopsisMERGE INTO target_table_name [ [ AS ] target_alias ] USING data_source ON join_condition when_clause [...] where data_source is { source_table_name | ( source_query ) } [ [ AS ] source_alias ] where when_clause is { WHEN MATCHED [ AND condition ] THEN { merge_update | merge_delete } | WHEN NOT MATCHED [ AND condition ] THEN { merge_insert | DO NOTHING } } where merge_update is UPDATE SET { column_name = { _expression_ | DEFAULT } | ( column_name [, ...] ) = ( { _expression_ | DEFAULT } [, ...] ) } [, ...] and merge_insert is INSERT [( column_name [, ...] )] [ OVERRIDING { SYSTEM | USER } VALUE ] { VALUES ( { _expression_ | DEFAULT } [, ...] ) | DEFAULT VALUES } and merge_delete is DELETEDescription MERGE performs actions that modify rows in the target_table_name, using the data_source. MERGE provides a single SQL statement that can conditionally INSERT, UPDATE or DELETE rows, a task that would otherwise require multiple procedural language statements. First, the MERGE command performs a left outer join from data_source to target_table_name producing zero or more candidate change rows. For each candidate change row, WHEN clauses are evaluated in the order specified; if one of them is activated, the specified action occurs. No more than one WHEN clause can be activated for any candidate change row. MERGE actions have the same effect as regular UPDATE, INSERT, or DELETE commands of the same names. The syntax of those commands is different, notably that there is no WHERE clause and no tablename is specified. All actions refer to the target_table_name, though modifications to other tables may be made using triggers. The ON clause must join on all of the column(s) of the primary key, or if other columns are specified then another unique index on the target_table_name. Each candidate change row is locked via speculative insertion into the chosen unique index, ensuring that the status of MATCHED or NOT MATCHED is decided just once for each candidate change row, preventing interference from concurrent transactions. As a result of these requirements MERGE cannot yet work against partitioned tables. There is no MERGE privilege. You must have the UPDATE privilege on the column(s) of the target_table_name referred to in the SET clause if you specify an update action, the INSERT privilege on the target_table_name if you specify an insert action and/or the DELETE privilege on the if you specify a delete action on the target_table_name. Privileges are tested once at statement start and are checked whether or not particular WHEN clauses are activated during the subsequent execution. You will require the SELECT privilege on the data_source and any column(s) of the target_table_name referred to in a condition. Parameterstarget_table_name The name (optionally schema-qualified) of the target table to merge into. target_alias A substitute name for the target table. When an alias is provided, it completely hides the actual name of the table. For example, given MERGE foo AS f, the remainder of the MERGE statement must refer to this table as f not foo. source_table_name The name (optionally schema-qualified) of the source table, view or transition table. source_query A query (SELECT statement or VALUES statement) that supplies the rows to be merged into the target_table_name. Refer to the SELECT statement or VALUES statement for a description of the syntax. source_alias A substitute name for the data source. When an alias is provided, it completely hides whether table or query was specified. join_condition join_condition is an _expression_ resulting in a value of type boolean (similar to a WHERE clause) that specifies which rows
[HACKERS] inconsistency in process names - bgworker: logical replication launcher
Hi Why buildin process has prefix bgworker? 1907 ?Ss13:00 postgres: ides ides_immaj_prac 192.168.1.50(3524) idle 1941 ?Ss 0:05 postgres: ides ides_immaj_prac 192.168.1.50(3527) idle 3706 ?Ss 0:00 postgres: ides ides_immaj_prac 192.168.1.50(4012) idle 11924 pts/0S+ 0:00 grep postgres 18710 ?Ss 0:01 postgres: logger process 18712 ?Ss 0:06 postgres: checkpointer process 18713 ?Ss 0:02 postgres: writer process 18714 ?Ss 0:05 postgres: wal writer process 18715 ?Ss 0:04 postgres: autovacuum launcher process 18716 ?Ss 0:39 postgres: stats collector process 18717 ?Ss 0:00 postgres: bgworker: logical replication launcher 32202 ?Ss 0:00 postgres: ides postgres 192.168.1.50(3109) idle 32226 ?Ss 0:00 postgres: ides ides_immaj_prac 192.168.1.50(3114) idle 32274 ?Ss 0:00 postgres: ides ides_jmmaj_akt 192.168.1.50(3124) idle regards Pavel
Re: [HACKERS] path toward faster partition pruning
On Thu, Oct 26, 2017 at 4:47 PM, Amit Langotewrote: > > Meanwhile, attached updated set of patches including fixes for the typos > you reported in the other message. Updated 0005 fixes the first bug (the > Case 1 in your email), while other patches 0002-0004 are updated mostly to > fix the reported typos. A couple of tests are added in 0001 to test the > default partition case a bit more. > Hi Amit, while testing further this feature, I got a bug with partitions as foreign tables. Test case given below. Take a look. CREATE EXTENSION postgres_fdw; CREATE SERVER fp_server FOREIGN DATA WRAPPER postgres_fdw OPTIONS (dbname 'postgres', port '5432', use_remote_estimate 'true'); CREATE USER MAPPING FOR PUBLIC SERVER fp_server; CREATE TABLE fplt1 (a int, b int, c text) PARTITION BY LIST(c); CREATE TABLE fplt1_p1 (a int, b int, c text); CREATE FOREIGN TABLE ftplt1_p1 PARTITION OF fplt1 FOR VALUES IN ('', '0001', '0002', '0003') SERVER fp_server OPTIONS (TABLE_NAME 'fplt1_p1'); CREATE TABLE fplt1_p2 (a int, b int, c text); CREATE FOREIGN TABLE ftplt1_p2 PARTITION OF fplt1 FOR VALUES IN ('0004', '0005', '0006', '0007') SERVER fp_server OPTIONS (TABLE_NAME 'fplt1_p2'); INSERT INTO fplt1_p1 SELECT i, i, to_char(i/50, 'FM') FROM generate_series(0, 198, 2) i; INSERT INTO fplt1_p2 SELECT i, i, to_char(i/50, 'FM') FROM generate_series(200, 398, 2) i; --PG-HEAD postgres=# EXPLAIN (COSTS OFF) SELECT t1.c FROM fplt1 t1, LATERAL (SELECT DISTINCT t2.c FROM fplt1 t2 WHERE t2.c = t1.c ) q; QUERY PLAN -- Nested Loop -> Append -> Foreign Scan on ftplt1_p1 t1 -> Foreign Scan on ftplt1_p2 t1_1 -> Unique -> Append -> Foreign Scan on ftplt1_p1 t2 -> Foreign Scan on ftplt1_p2 t2_1 (8 rows) --PG-HEAD +v5 patches postgres=# EXPLAIN (COSTS OFF) SELECT t1.c FROM fplt1 t1, LATERAL (SELECT DISTINCT t2.c FROM fplt1 t2 WHERE t2.c = t1.c ) q; *ERROR: invalid expression for partition key* Thanks & Regards, Rajkumar Raghuwanshi QMG, EnterpriseDB Corporation
[HACKERS] Reading timeline from pg_control on replication slave
Hi, hackers! I'm working on backups from replication salve in WAL-G [0] Backups used to use result of pg_walfile_name(pg_start_backup(...)). Call to pg_start_backup() works nice, but "pg_walfile_name() cannot be executed during recovery." This function has LSN as argument and reads TimeLineId from global state. So I made a function[1] that, if on replica, reads timeline from pg_control file and formats WAL file name as is it was produces by pg_wal_filename(lsn). Are there any serious dangers? Obviously, this hack is not crisp and clear. Is the risk of reading stale timeline really a problem? By reading TimeLineId from file I'm fighting those precautions in pg_walfile_name(..) which were implemented for a reason, I guess. Thanks for reading this. I'll be happy to hear any comments on the matter. [0] https://github.com/wal-g/wal-g/pull/32 [1] https://github.com/wal-g/wal-g/pull/32/files#diff-455316c14fb04c9f9748a0798ad151d9R158 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: Is anything preventing us from allowing write to foreign tables from standby?
On Thu, Oct 26, 2017 at 11:14 AM, Ashutosh Bapatwrote: > In a nearby thread, we are discussing about atomic commit of > transactions involving foreign transactions. For maintaining > consistency, atomicity of transactions writing to foreign server, we > will need to create local transactions. Will that be possible on > standby? Obviously, we could add a restriction that no consistency and > atomic commit is guaranteed when foreign servers are written from a > standby. I am not sure how easy would be to impose such a restriction > and whether such a restriction would be practical. Yeah, that feature definitely can't work on a standby. But we could probably allow writes when that feature is not in use. One thing that bothers me about Alexander's patch is that there wouldn't be any way to distinguish between those two cases. Maybe we need a callback to ask the FDW "given the configured options, could you work on a standby?" However, as Michael also points out, it's arguably wrong to allow a nominally read-only transaction to write data regardless of whether it works. In the case of a standby it could be argued that your transaction is only read-only because you had no other choice, but nonetheless that's how it is marked. I have a feeling that if we extend the definition of "read only" to mean "sometimes allow writes", we may regret it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Typos in src/backend/optimizer/README
Hi, This sentence in the section of Partition-wise joins in src/backend/optimizer/README should be fixed: "This technique of breaking down a join between partition tables into join between their partitions is called partition-wise join." (1) s/a join between partition tables/a join between partitioned tables/ (2) s/join between their partitions/joins between their partitions/ It might be okay to leave #2 as-is, but I'd like to propose to change that way to make the meaning clear. Attached is a patch for that. Best regards, Etsuro Fujita diff --git a/src/backend/optimizer/README b/src/backend/optimizer/README index 031e503..1e4084d 100644 --- a/src/backend/optimizer/README +++ b/src/backend/optimizer/README @@ -1088,8 +1088,8 @@ broken into joins between the matching partitions. The resultant join is partitioned in the same way as the joining relations, thus allowing an N-way join between similarly partitioned tables having equi-join condition between their partition keys to be broken down into N-way joins between their matching -partitions. This technique of breaking down a join between partition tables -into join between their partitions is called partition-wise join. We will use +partitions. This technique of breaking down a join between partitioned tables +into joins between their partitions is called partition-wise join. We will use term "partitioned relation" for either a partitioned table or a join between compatibly partitioned tables. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: BRIN bloom indexes
On 27 October 2017 at 07:20, Robert Haaswrote: > On Thu, Oct 19, 2017 at 10:15 PM, Tomas Vondra > wrote: >> Let's see a query like this: >> >> select * from bloom_test >> where id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772'; >> >> The minmax index produces this plan >> >>Heap Blocks: lossy=2061856 >> Execution time: 22707.891 ms >> >> Now, the bloom index: >> >>Heap Blocks: lossy=25984 >> Execution time: 338.946 ms > > It's neat to see BRIN being extended like this. Possibly we could > consider making it a contrib module rather than including it in core, > although I don't have strong feelings about it. I see that SP-GIST includes two operator classes in core, one default. Makes sense to do the same thing with BRIN and add this new op class as a non-default option in core. -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Implementing pg_receivewal --no-sync
On Thu, Oct 26, 2017 at 10:46 PM, Robert Haaswrote: > On Wed, Oct 25, 2017 at 10:03 PM, Michael Paquier > wrote: >> This sentence is actually wrong, a feedback message is never sent with >> the feedback message. > > Eh? "A feedback message is never sent depending on the status interval". > I think this looks basically fine, though I'd omit the short option > for it. There are only so many letters in the alphabet, so let's not > use them up for developer-convenience options. No objections to that. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CurTransactionContext freed before transaction COMMIT ???
Thanks for your responses Michael and Amit Kapila, Yes, we have included your suggestions in our code and started testing to reproduce the same issue. In case we encounter this issue again we will get back here. Regards G. Sai Ram
Re: [HACKERS] proposal: schema variables
2017-10-27 7:47 GMT+02:00 Tsunakawa, Takayuki < tsunakawa.ta...@jp.fujitsu.com>: > From: pgsql-hackers-ow...@postgresql.org > > [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Pavel Stehule > > I propose a new database object - a variable. The variable is persistent > > object, that holds unshared session based not transactional in memory > value > > of any type. Like variables in any other languages. The persistence is > > required for possibility to do static checks, but can be limited to > session > > - the variables can be temporal. > > > > > > My proposal is related to session variables from Sybase, MSSQL or MySQL > > (based on prefix usage @ or @@), or package variables from Oracle (access > > is controlled by scope), or schema variables from DB2. Any design is > coming > > from different sources, traditions and has some advantages or > disadvantages. > > The base of my proposal is usage schema variables as session variables > for > > stored procedures. It should to help to people who try to port complex > > projects to PostgreSQL from other databases. > > Very interesting. I hope I could join the review and testing. > you are welcome. I wrote a prototype last year based on envelope functions. But the integration must be much more close to SQL to be some clear benefit of this feature. So there is lot of work. I hope so I have a prototype after this winter. It is my plan for winter. > > How do you think this would contribute to easing the port of Oracle PL/SQL > procedures? Would the combination of orafce and this feature promote > auto-translation of PL/SQL procedures? I'm curious what will be the major > road blocks after adding the schema variable. > It depends on creativity of PL/SQL developers. Usual .. 80% application is possible to migrate with current GUC - some work does ora2pg. But GUC is little bit slower (not too important) and is not simple possibility to secure it. So work with variables will be similar like GUC, but significantly more natural (not necessary to build wrap functions). It should be much better when value is of some composite type. The migrations will need some inteligence still, but less work and code will be more readable and cleaner. I talked already about "schema pined" functions (schema private/public objects) - but I didn't think about it more deeply. There can be special access right to schema variables, the pined schema can be preferred before search_path. With this feature the schema will have very similar behave like Oracle Modules. Using different words - we can implement scope access rights based on schemas. But it is far horizon. What is important - proposal doesn't block any future enhancing in this case, and is consistent with current state. In future you can work with schema private functions, tables, variables, sequences. So variables are nothing special. Regards Pavel Regards > Takayuki Tsunakawa > > >
Re: [HACKERS] path toward faster partition pruning
On 2017/10/27 13:57, Robert Haas wrote: > On Fri, Oct 27, 2017 at 3:17 AM, Amit Langote >wrote: >>> I don't think we really want to get into theorem-proving here, because >>> it's slow. >> >> Just to be clear, I'm saying we could use theorem-proving (if at all) just >> for the default partition. > > I don't really see why it should be needed there either. We've got > all the bounds in order, so we should know where there are any gaps > that are covered by the default partition in the range we care about. Sorry, I forgot to add: "...just for the default partition, for cases like the one in Beena's example." In normal cases, default partition selection doesn't require any theorem-proving. It proceeds in a straightforward manner more or less like what you said it should. After thinking more on it, I think there is a rather straightforward trick to implement the idea you mentioned to get this working for the case presented in Beena's example, which works as follows: For any non-root partitioned tables, we add the list of its partition constraint clauses to the query-provided list of clauses and use the whole list to drive the partition-pruning algorithm. So, when partition-pruning runs for tprt_1, along with (< 1) which the original query provides, we also have (>= 1) which comes from the partition constraint of tprt_1 (which is >= 1 and < 5). Note that there exists a trick in the new code for the (< 5) coming from the constraint to be overridden by the more restrictive (< 1) coming from the original query. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers