Re: [HACKERS] TABLESAMPLE patch
On 19/04/15 01:24, Michael Paquier wrote: On Sat, Apr 18, 2015 at 8:38 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Fri, Apr 17, 2015 at 10:54 PM, Petr Jelinek wrote: On 10/04/15 06:46, Michael Paquier wrote: 13) Some regression tests with pg_tablesample_method would be welcome. Not sure what you mean by that. I meant a sanity check on pg_tablesample_method to be sure that tsminit, tsmnextblock and tsmnexttuple are always defined as they are mandatory functions. So the idea is to add a query like and and to be sure that it returns no rows: SELECT tsmname FROM pg_tablesample_method WHERE tsminit IS NOT NULL OR tsmnextblock IS NOT NULL OR tsmnexttuple IS NOT NULL; Yesterday was a long day. I meant IS NULL and not IS NOT NULL, but I am sure you guessed it that way already.. Yes I guessed that and it's very reasonable request, I guess it should look like the attached (I don't want to send new version of everything just for this). -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services From d059c792a1e864e38f321cea51169ea0b3c5caab Mon Sep 17 00:00:00 2001 From: Petr Jelinek pjmo...@pjmodos.net Date: Wed, 22 Apr 2015 21:28:29 +0200 Subject: [PATCH 7/7] tablesample: add catalog regression test --- src/test/regress/expected/tablesample.out | 15 +++ src/test/regress/sql/tablesample.sql | 13 + 2 files changed, 28 insertions(+) diff --git a/src/test/regress/expected/tablesample.out b/src/test/regress/expected/tablesample.out index 271638d..04e5eb8 100644 --- a/src/test/regress/expected/tablesample.out +++ b/src/test/regress/expected/tablesample.out @@ -209,6 +209,21 @@ SELECT q.* FROM (SELECT * FROM test_tablesample) as q TABLESAMPLE BERNOULLI (5); ERROR: syntax error at or near TABLESAMPLE LINE 1: ...CT q.* FROM (SELECT * FROM test_tablesample) as q TABLESAMPL... ^ +-- catalog sanity +SELECT * +FROM pg_tablesample_method +WHERE tsminit IS NULL + OR tsmseqscan IS NULL + OR tsmpagemode IS NULL + OR tsmnextblock IS NULL + OR tsmnexttuple IS NULL + OR tsmend IS NULL + OR tsmreset IS NULL + OR tsmcost IS NULL; + tsmname | tsmseqscan | tsmpagemode | tsminit | tsmnextblock | tsmnexttuple | tsmexaminetuple | tsmend | tsmreset | tsmcost +-++-+-+--+--+-++--+- +(0 rows) + -- done DROP TABLE test_tablesample CASCADE; NOTICE: drop cascades to 2 other objects diff --git a/src/test/regress/sql/tablesample.sql b/src/test/regress/sql/tablesample.sql index 2f4b7de..7b3eb9b 100644 --- a/src/test/regress/sql/tablesample.sql +++ b/src/test/regress/sql/tablesample.sql @@ -57,5 +57,18 @@ SELECT * FROM query_select TABLESAMPLE BERNOULLI (5.5) REPEATABLE (1); SELECT q.* FROM (SELECT * FROM test_tablesample) as q TABLESAMPLE BERNOULLI (5); +-- catalog sanity + +SELECT * +FROM pg_tablesample_method +WHERE tsminit IS NULL + OR tsmseqscan IS NULL + OR tsmpagemode IS NULL + OR tsmnextblock IS NULL + OR tsmnexttuple IS NULL + OR tsmend IS NULL + OR tsmreset IS NULL + OR tsmcost IS NULL; + -- done DROP TABLE test_tablesample CASCADE; -- 1.9.1 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sequence Access Method WIP
On 20/04/15 17:50, Heikki Linnakangas wrote: On 03/15/2015 09:07 PM, Petr Jelinek wrote: Slightly updated version of the patch. Mainly rebased against current master (there were several conflicts) and fixed some typos, no real functional change. I also attached initial version of the API sgml doc. I finally got around to have another round of review on this. I fixed a couple of little bugs, did some minor edition on comments etc. See attached. It is also available in my git repository at git://git.postgresql.org/git/users/heikki/postgres.git, branch seqam, if you want to look at individual changes. It combines your patches 1 and 4, I think those need to be applied together. I haven't looked at the DDL changes yet. Thanks! I'm fairly happy with the alloc API now. I'm not sure it's a good idea for the AM to access the sequence tuple directly, though. I would seem cleaner if it only manipulated the amdata Datum. But it's not too bad as it is. Yeah, I was thinking about this myself I just liked sending 10 parameters to the function less than this. The division of labour between sequence.c and the AM, in the init and the get/set_state functions, is a bit more foggy: * Do we really need a separate amoptions() method and an init() method, when the only caller to amoptions() is just before the init() method? The changes in extractRelOptions suggest that it would call am_reloptions for sequences too, but that doesn't actually seem to be happening. Hmm yes it should and I am sure it did at some point, must have messed it during one of the rebases :( And it's the reason why we need separate API function. postgres=# create sequence fooseq using local with (garbage=option); CREATE SEQUENCE Invalid options probably should throw an error. * Currently, the AM's init function is responsible for basic sanity checking, like min max. It also has to extract the RESTART value from the list of options. I think it would make more sense to move that to sequence.c. The AM should only be responsible for initializing the 'amdata' portion, and for handling any AM-specific options. If the AM doesn't support some options, like MIN and MAX value, it can check them and throw an error, but it shouldn't be responsible for just passing them through from the arguments to the sequence tuple. Well then we need to send restart as additional parameter to the init function as restart is normally not stored anywhere. The checking is actually if new value is withing min/max but yes that can be done inside sequence.c I guess. * It might be better to form the sequence tuple before calling the init function, and pass the ready-made tuple to it. All the other API functions deal with the tuple (after calling sequence_read_tuple), so that would be more consistent. The init function would need to deconstruct it to modify it, but we're not concerned about performance here. Right, this is actually more of a relic of the custom columns implementation where I didn't want to build the tuple with NULLs for columns that might have been specified as NOT NULL, but with the amdata approach it can create the tuple safely. * The transformations of the arrays in get_state() and set_state() functions are a bit complicated. The seqam_get_state() function returns two C arrays, and pg_sequence_get_state() turns them into a text[] array. Why not construct the text[] array directly in the AM? I guess it's a bit more convenient to the AM, if the pg_sequence_get_state() do that, but if that's an issue, you could create generic helper function to turn two C string arrays into text[], and call that from the AM. Yeah that was exactly the reasoning. Helper function works for me (will check what Álvaro's suggested, maybe it can be moved somewhere and reused). seq = (FormData_pg_sequence *) GETSTRUCT(sequence_read_tuple(seqh)); for (i = 0; i count; i++) { if (pg_strcasecmp(keys[i], last_value) == 0) seq-last_value = DatumGetInt64(DirectFunctionCall1(int8in, CStringGetDatum(values[i]))); else if (pg_strcasecmp(keys[i], is_called) == 0) seq-is_called = DatumGetBool(DirectFunctionCall1(boolin, CStringGetDatum(values[i]))); else ereport(ERROR, (errcode(ERRCODE_SYNTAX_ERROR), errmsg(invalid state key \%s\ for local sequence, keys[i]))); } sequence_save_tuple(seqh, NULL, true); If that error happens after having already processed a last_value or is_called entry, you have already modified the on-disk tuple. AFAICS that's the only instance of that bug, but sequence_read_tuple - modify tuple in place - sequence_save_tuple pattern is quite unsafe in general. If you modify a tuple directly in a Buffer, you should have a critical section around it. It would make sense to start a critical section in sequence_read_tuple(), except that not all callers want to modify the tuple in place
Re: [HACKERS] Replication identifiers, take 4
On 21/04/15 22:36, Andres Freund wrote: On 2015-04-21 16:26:08 -0400, Robert Haas wrote: On Tue, Apr 21, 2015 at 8:08 AM, Andres Freund and...@anarazel.de wrote: I've now named the functions: * pg_replication_origin_create * pg_replication_origin_drop * pg_replication_origin_get (map from name to id) * pg_replication_progress_setup_origin : configure session to replicate from a specific origin * pg_replication_progress_reset_origin * pg_replication_progress_setup_tx_details : configure per transaction details (LSN and timestamp currently) * pg_replication_progress_is_replaying : Is a origin configured for the session * pg_replication_progress_advance : manually set the replication progress to a value. Primarily useful for copying values from other systems and such. * pg_replication_progress_get : How far did replay progress for a certain origin * pg_get_replication_progress : SRF returning the replay progress for all origin. Any comments? Why are we using functions for this rather than DDL? Unless I miss something the only two we really could use ddl for is pg_replication_origin_create/pg_replication_origin_drop. We could use DDL for them if we really want, but I'm not really seeing the advantage. I think the only value of having DDL for this would be consistency (catalog objects are created via DDL) as it looks like something that will be called only by extensions and not users during normal operation. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sequence Access Method WIP
On 20/04/15 12:05, Andres Freund wrote: On 2015-04-20 12:49:39 +0300, Heikki Linnakangas wrote: With the patch, pg_class.relam column references to the pg_seqam table for sequences, but pg_indexam for indexes. I believe it's the first instance where we reuse a foreign key column like that. It's not a real foreign key, of course - that wouldn't work with a real foreign key at all - but it's a bit strange. That makes me a bit uncomfortable. How do others feel about that? Hm. I'd modeled it more as an extension of the 'relkind' column mentally. I.e. it further specifies how exactly the relation is behaving. Given that the field has been added to pg_class and not pg_index, combined with it not having index in its name, makes me think that it actually was intended to be used the way it's done in the patch. That's how I think about it too. It's also short for access method, nothing really suggests to me that it should be index specific by design. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication identifiers, take 4
On 17/04/15 22:36, Simon Riggs wrote: I said that IMO the difference in WAL size is so small that we should just use 4-byte OIDs for the replication identifiers, instead of trying to make do with 2 bytes. Not because I find it too likely that you'll run out of IDs (although it could happen), but more to make replication IDs more like all other system objects we have. Andreas did some pgbench benchmarking to show that the difference in WAL size is about 10%. The WAL records generated by pgbench happen to have just the right sizes so that the 2-3 extra bytes bump them over to the next alignment boundary. That's why there is such a big difference - on average it'll be less. I think that's acceptable, Andreas seems to think otherwise. But if the WAL size really is so precious, we could remove the two padding bytes from XLogRecord, instead of dedicating them for the replication ids. That would be an even better use for them. The argument to move to 4 bytes is a poor one. If it was reasonable in terms of code or cosmetic value then all values used in the backend would be 4 bytes. We wouldn't have any 2 byte values anywhere. But we don't do that. The change does nothing useful, since I doubt anyone will ever need 32768 nodes in their cluster. And if they did there would be other much bigger problems than replication identifier being 16bit (it's actually 65534 as it's unsigned btw). Considering the importance and widespread use of replication I think we should really make sure the related features have small overhead. Increasing WAL size for any non-zero amount is needlessly wasteful for a change with only cosmetic value. But for a change that has significant value for database resilience, it is a sensible use of bytes. +1 to Andres' very reasonable suggestion. Lets commit this and go home. +1 -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication identifiers, take 4
On 10/04/15 18:03, Andres Freund wrote: On 2015-04-07 17:08:16 +0200, Andres Freund wrote: I'm starting benchmarks now. What I'm benchmarking here is the WAL overhead, since that's what we're debating. The test setup I used was a pgbench scale 10 instance. I've run with full_page_write=off to have more reproducible results. This of course over-emphasizes the overhead a bit. But for a long checkpoint interval and a memory resident working set it's not that unrealistic. I ran 50k transactions in a signle b baseline: - 20445024 - 20437128 - 20436864 - avg: 20439672 extern 2byte identifiers: - 23318368 - 23148648 - 23128016 - avg: 23198344 - avg overhead: 13.5% padding 2byte identifiers: - 21160408 - 21319720 - 21164280 - avg: 21214802 - avg overhead: 3.8% extern 4byte identifiers: - 23514216 - 23540128 - 23523080 - avg: 23525808 - avg overhead: 15.1% To me that shows pretty clearly that a) reusing the padding is worthwhile b) even without that using 2byte instead of 4 byte identifiers is beneficial. My opinion is that 10% of WAL size difference is quite high price to pay so that we can keep the padding for some other, yet unknown feature that hasn't come up in several years, which would need those 2 bytes. But if we are willing to pay it then we can really go all the way and just use Oids... -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TABLESAMPLE patch
On 10/04/15 21:26, Peter Eisentraut wrote: On 4/9/15 8:58 PM, Petr Jelinek wrote: Well, you can have two approaches to this, either allow some specific set of keywords that can be used to specify limit, or you let sampling methods interpret parameters, I believe the latter is more flexible. There is nothing stopping somebody writing sampling method which takes limit as number of rows, or anything else. Also for example for BERNOULLI to work correctly you'd need to convert the number of rows to fraction of table anyway (and that's exactly what the one database which has this feature does internally) and then it's no different than passing (SELECT 100/reltuples*number_of_rows FROM tablename) as a parameter. What is your intended use case for this feature? I know that give me 100 random rows from this table quickly is a common use case, but that's kind of cumbersome if you need to apply formulas like that. I'm not sure what the use of a percentage is. Presumably, the main use of this features is on large tables. But then you might not even know how large it really is, and even saying 0.1% might be more than you wanted to handle. My main intended use-case is analytics on very big tables. The percentages of population vs confidence levels are pretty well mapped there and you can get quite big speedups if you are fine with getting results with slightly smaller confidence (ie you care about ballpark figures). But this was not really my point, the BERNOULLI just does not work well with row-limit by definition, it applies probability on each individual row and while you can get probability from percentage very easily (just divide by 100), to get it for specific target number of rows you have to know total number of source rows and that's not something we can do very accurately so then you won't get 500 rows but approximately 500 rows. In any case for give me 500 somewhat random rows from table quickly you want probably SYSTEM sampling anyway as it will be orders of magnitude faster on big tables and yes even 0.1% might be more than you wanted in that case. I am not against having row limit input for methods which can work with it like SYSTEM but then that's easily doable by adding separate sampling method which accepts rows (even if sampling algorithm itself is same). In current approach all you'd have to do is write different init function for the sampling method and register it under new name (yes it won't be named SYSTEM but for example SYSTEM_ROWLIMIT then). -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TABLESAMPLE patch
On 10/04/15 22:16, Tomas Vondra wrote: On 04/10/15 21:57, Petr Jelinek wrote: On 10/04/15 21:26, Peter Eisentraut wrote: But this was not really my point, the BERNOULLI just does not work well with row-limit by definition, it applies probability on each individual row and while you can get probability from percentage very easily (just divide by 100), to get it for specific target number of rows you have to know total number of source rows and that's not something we can do very accurately so then you won't get 500 rows but approximately 500 rows. It's actually even trickier. Even if you happen to know the exact number of rows in the table, you can't just convert that into a percentage like that and use it for BERNOULLI sampling. It may give you different number of result rows, because each row is sampled independently. That is why we have Vitter's algorithm for reservoir sampling, which works very differently from BERNOULLI. Hmm this actually gives me idea - perhaps we could expose Vitter's reservoir sampling as another sampling method for people who want the give me 500 rows from table fast then? We already have it implemented it's just matter of adding the glue. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TABLESAMPLE patch
On 09/04/15 11:37, Simon Riggs wrote: On 9 April 2015 at 04:52, Simon Riggs si...@2ndquadrant.com wrote: TABLESAMPLE BERNOULLI could work in this case, or any other non-block based sampling mechanism. Whether it does work yet is another matter. This query should be part of the test suite and should generate a useful message or work correctly. The SQL Standard does allow the WITH query given. It makes no mention of the obvious point that SYSTEM-defined mechanisms might not work, but that is for the implementation to define, AFAICS. Yes SQL Standard allows this and the reason why they don't define what happens with SYSTEM is that they actually don't define how SYSTEM should behave except that it should return approximately given percentage of rows, but the actual behavior is left to the DBMS. The reason why other dbs like MSSQL or DB2 have chosen this to be block sampling is that it makes most sense (and is fastest) on tables and those databases don't support TABLESAMPLE on anything else at all. On balance, in this release, I would be happier to exclude sampled results from queries, and only allow sampling against base tables. I think so too, considering how late in the last CF we are. Especially given my note about MSSQL and DB2 above. In any case I don't see any fundamental issues with extending the current implementation with the subquery support. I think most of the work there is actually in parser/analyzer and planner. The sampling methods will just not receive the request for next blockid and tupleid from that block when source of the data is subquery and if they want to support subquery as source of sampling they will have to provide the examinetuple interface (which is already there and optional, the test/example custom sampling method is using it). -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TABLESAMPLE patch
On 06/04/15 12:33, Amit Kapila wrote: On Sat, Apr 4, 2015 at 8:25 PM, Petr Jelinek p...@2ndquadrant.com mailto:p...@2ndquadrant.com wrote: Yes I want extensibility here. And I think the tablesample method arguments are same thing as function arguments given that in the end they are arguments for the init function of tablesampling method. I would be ok with just expr_list, naming parameters here isn't overly important, but ability to have different types and numbers of parameters for custom TABLESAMPLE methods *is* important. Yeah, named parameters is one reason which I think won't be required for sample methods and neither the same is mentioned in docs (if user has to use, what is the way he can pass the same) and another is number of arguments for sampling methods which is currently seems to be same as FUNC_MAX_ARGS, I think that is sufficient but do we want to support that many arguments for sampling method. I think I'll go with simple list of a_exprs. The reason for that is that I can foresee sampling methods that need multiple parameters, but I don't think naming them is very important. Also adding support for naming parameters can be done in the future if we decide so without breaking compatibility. Side benefit is that it makes hinting about what is wrong with input somewhat easier. I don't think we need to come up with different limit from FUNC_MAX_ARGS. I don't think any sampling method would need that many parameters but I also don't see what would additional smaller limit give us. 2. postgres=# explain update test_tablesample TABLESAMPLE system(30) set id = 2; ERROR: syntax error at or near TABLESAMPLE LINE 1: explain update test_tablesample TABLESAMPLE system(30) set i... postgres=# Delete from test_tablesample TABLESAMPLE system(30); ERROR: syntax error at or near TABLESAMPLE LINE 1: Delete from test_tablesample TABLESAMPLE system(30); Isn't TABLESAMPLE clause suppose to work with Update/Delete statements? It's supported in the FROM part of UPDATE and USING part of DELETE. I think that that's sufficient. But I think the Update on target table with sample scan is supported via views which doesn't seem to be the right thing in case you just want to support it via FROM/USING, example postgres=# create view vw_test As select * from test_tablesample TABLESAMPLE sys tem(30); postgres=# explain update vw_test set id = 4; QUERY PLAN --- Update on test_tablesample (cost=0.00..4.04 rows=4 width=210) - Sample Scan on test_tablesample (cost=0.00..4.04 rows=4 width=210) (2 rows) Right, I'll make those views not auto-updatable. Standard is somewhat useless for UPDATE and DELETE as it only defines quite limited syntax there. From what I've seen when doing research MSSQL also only supports it in their equivalent of FROM/USING list, Oracle does not seem to support their SAMPLING clause outside of SELECTs at all and if I got the cryptic DB2 manual correctly I think they don't support it outside of (sub)SELECTs either. By the way, what is the usecase to support sample scan in Update or Delete statement? Well for the USING/FROM part the use-case is same as for SELECT - providing sample of the data for the query (it can be useful also for getting pseudo random rows fast). And if we didn't support it, it could still be done using sub-select so why not have it directly. Also, isn't it better to mention in the docs for Update and Delete incase we are going to support tablesample clause for them? Most of other clauses that we support in FROM are not mentioned in UPDATE/DELETE docs, both of those commands just say something like refer to the SELECT FROM docs for more info. Do you think TABLESAMPLE deserves special treatment in this regard? 7. ParseTableSample() { .. arg = coerce_to_specific_type(pstate, arg, INT4OID, REPEATABLE); .. } What is the reason for coercing value of REPEATABLE clause to INT4OID when numeric value is expected for the clause. If user gives the input value as -2.3, it will generate a seed which doesn't seem to be right. Because the REPEATABLE is numeric expression so it can produce whatever number but we need int4 internally (well float4 would also work just the code would be slightly uglier). Okay, I understand that part. Here the real point is why not just expose it as int4 to user rather than telling in docs that it is numeric and actually we neither expect nor use it as numberic. Even Oracle supports supports it as int with below description. The seed_value must be an integer between 0 and 4294967295 Well the thing with SQL Standard's numeric value expression is that it actually does not mean numeric data type, it's just simple arithmetic expression with some given rules (by the standard), but the output data type can be either implementation
Re: [HACKERS] TABLESAMPLE patch
On 06/04/15 15:07, Amit Kapila wrote: On Mon, Apr 6, 2015 at 5:56 PM, Petr Jelinek p...@2ndquadrant.com mailto:p...@2ndquadrant.com wrote: On 06/04/15 12:33, Amit Kapila wrote: But I think the Update on target table with sample scan is supported via views which doesn't seem to be the right thing in case you just want to support it via FROM/USING, example postgres=# create view vw_test As select * from test_tablesample TABLESAMPLE sys tem(30); postgres=# explain update vw_test set id = 4; QUERY PLAN --- Update on test_tablesample (cost=0.00..4.04 rows=4 width=210) - Sample Scan on test_tablesample (cost=0.00..4.04 rows=4 width=210) (2 rows) Right, I'll make those views not auto-updatable. Standard is somewhat useless for UPDATE and DELETE as it only defines quite limited syntax there. From what I've seen when doing research MSSQL also only supports it in their equivalent of FROM/USING list, Oracle does not seem to support their SAMPLING clause outside of SELECTs at all and if I got the cryptic DB2 manual correctly I think they don't support it outside of (sub)SELECTs either. By the way, what is the usecase to support sample scan in Update or Delete statement? Well for the USING/FROM part the use-case is same as for SELECT - providing sample of the data for the query (it can be useful also for getting pseudo random rows fast). And if we didn't support it, it could still be done using sub-select so why not have it directly. I can understand why someone wants to read sample data via SELECT, but not clearly able to understand, why some one wants to Update or Delete random data in table and if there is a valid case, then why just based on sub-selects used in where clause or table reference in FROM/USING list. Can't we keep it simple such that either we support to Update/Delete based on Tablesample clause or prohibit it in all cases? Well, I don't understand why would somebody do it either, but then again during research of this feature I've found questions on stack overflow and similar sites about how to do it, so people must have use-cases. And in any case as you say sub-select would work there so there is no reason to explicitly disable it. Plus there is already difference between what can be the target table in DELETE/UPDATE versus what can be in the FROM/USING clause and I think the TABLESAMPLE behavior follows that separation nicely - it's well demonstrated by the fact that we would have to add explicit exception to some places in code to disallow it. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TABLESAMPLE patch
On 06/04/15 11:02, Simon Riggs wrote: On 2 April 2015 at 17:36, Petr Jelinek p...@2ndquadrant.com wrote: so here is version 11. Looks great. Comment on docs: The SELECT docs refer only to SYSTEM and BERNOULLI. It doesn't mention that if other methods are available they could be used also. The phrasing was sampling method can be one of list. Will reword. Are we ready for a final detailed review and commit? I plan to send v12 in the evening with some additional changes that came up from Amit's comments + some improvements to error reporting. I think it will be ready then. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TABLESAMPLE patch
the calculations much easier and faster than dealing with Numeric would. 9. postgres=# explain SELECT t1.id http://t1.id FROM test_tablesample as t1 TABLESAMPLE SYSTEM ( 10), test_tablesample as t2 TABLESAMPLE BERNOULLI (20); QUERY PLAN Nested Loop (cost=0.00..4.05 rows=100 width=4) - Sample Scan on test_tablesample t1 (cost=0.00..0.01 rows=1 width=4) - Sample Scan on test_tablesample t2 (cost=0.00..4.02 rows=2 width=0) (3 rows) Isn't it better to display sample method name in explain. I think it can help in case of join queries. We can use below format to display: Sample Scan (System) on test_tablesample ... Good idea. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: nbklno in _hash_splitbucket unused (after ed9cc2b)
On 05/04/15 01:03, Tomas Vondra wrote: Hi there, the changes in _hash_splitbucket() made nblkno unused when compiled without asserts. Attached is a simple fix to get rid of the warnings. This has been already fixed, see b7e1652d5 . -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Abbreviated keys for Numeric
On 02/04/15 21:21, Robert Haas wrote: On Thu, Apr 2, 2015 at 2:07 PM, Robert Haas robertmh...@gmail.com wrote: I think this is really nice work, so I have committed this version. I made a few fairly minor changes, hopefully without breaking anything in the process: - I adjusted things for recent commits around INT{32,63}_{MIN_MAX}. - I removed (void) ssup; which I do not think is normal PostgreSQL style. - Removed the #if DEC_DIGITS != 4 check. The comment is great, but I think we don't need protect against #if 0 code get re-enabled. - I removed the too-clever (at least IMHO) handing of TRACE_SORT in favor of just using #ifdef around each occurrence. - I also declared trace_sort in guc.h, where various other GUCs are declared, instead of declaring it privately in each file that needed it. - Changed some definitions to depend on SIZEOF_DATUM rather than USE_FLOAT8_BYVAL. Hopefully I didn't muff this; please check it. - Fixed an OID conflict. - And of course, bumped catversion. And that's broken the buildfarm. Argh. That's because of this: +#ifdef SIZEOF_DATUM == 8 You need just #if there. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Abbreviated keys for Numeric
On 02/04/15 22:22, Robert Haas wrote: On Thu, Apr 2, 2015 at 4:21 PM, Petr Jelinek p...@2ndquadrant.com wrote: On 02/04/15 21:21, Robert Haas wrote: On Thu, Apr 2, 2015 at 2:07 PM, Robert Haas robertmh...@gmail.com wrote: I think this is really nice work, so I have committed this version. I made a few fairly minor changes, hopefully without breaking anything in the process: - I adjusted things for recent commits around INT{32,63}_{MIN_MAX}. - I removed (void) ssup; which I do not think is normal PostgreSQL style. - Removed the #if DEC_DIGITS != 4 check. The comment is great, but I think we don't need protect against #if 0 code get re-enabled. - I removed the too-clever (at least IMHO) handing of TRACE_SORT in favor of just using #ifdef around each occurrence. - I also declared trace_sort in guc.h, where various other GUCs are declared, instead of declaring it privately in each file that needed it. - Changed some definitions to depend on SIZEOF_DATUM rather than USE_FLOAT8_BYVAL. Hopefully I didn't muff this; please check it. - Fixed an OID conflict. - And of course, bumped catversion. And that's broken the buildfarm. Argh. That's because of this: +#ifdef SIZEOF_DATUM == 8 You need just #if there. Thanks. I actually pushed a fix for that about 25 minutes ago; hopefully that is all that is needed. Ok, the git.pg.org was somewhat behind. It did fix it for me when I tested it locally. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Unused variable in hashpage.c
Hi, my compiler complains about unused variable nblkno in _hash_splitbucket in no-assert build. It looks like relic of commit ed9cc2b5d which removed the only use of that variable besides the Assert. Looking at the code: nblkno = start_nblkno; Assert(nblkno == BufferGetBlockNumber(nbuf)); I was originally thinking of simply changing the Assert to use start_nblkno but then I noticed that the start_nblkno is actually not used anywhere in the function either (it's one of input parameters). And given that the only caller of that function is getting the variable it sends as nbuf to that function via start_nblkno, I think the Assert is somewhat pointless there anyway. So best way to solve this seems to be removing the start_nblkno from the _hash_splitbucket altogether. Attached patch does just that. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c index 9a77945..2f82b1e 100644 --- a/src/backend/access/hash/hashpage.c +++ b/src/backend/access/hash/hashpage.c @@ -39,7 +39,6 @@ static bool _hash_alloc_buckets(Relation rel, BlockNumber firstblock, static void _hash_splitbucket(Relation rel, Buffer metabuf, Buffer nbuf, Bucket obucket, Bucket nbucket, - BlockNumber start_oblkno, BlockNumber start_nblkno, uint32 maxbucket, uint32 highmask, uint32 lowmask); @@ -666,8 +665,7 @@ _hash_expandtable(Relation rel, Buffer metabuf) /* Relocate records to the new bucket */ _hash_splitbucket(rel, metabuf, buf_nblkno, old_bucket, new_bucket, - start_oblkno, start_nblkno, - maxbucket, highmask, lowmask); + start_oblkno, maxbucket, highmask, lowmask); /* Release bucket locks, allowing others to access them */ _hash_droplock(rel, start_oblkno, HASH_EXCLUSIVE); @@ -758,13 +756,11 @@ _hash_splitbucket(Relation rel, Bucket obucket, Bucket nbucket, BlockNumber start_oblkno, - BlockNumber start_nblkno, uint32 maxbucket, uint32 highmask, uint32 lowmask) { BlockNumber oblkno; - BlockNumber nblkno; Buffer obuf; Page opage; Page npage; @@ -781,8 +777,6 @@ _hash_splitbucket(Relation rel, opage = BufferGetPage(obuf); oopaque = (HashPageOpaque) PageGetSpecialPointer(opage); - nblkno = start_nblkno; - Assert(nblkno == BufferGetBlockNumber(nbuf)); npage = BufferGetPage(nbuf); /* initialize the new bucket's primary page */ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TABLESAMPLE patch
On 01/04/15 17:52, Robert Haas wrote: On Wed, Apr 1, 2015 at 9:49 AM, Amit Kapila amit.kapil...@gmail.com wrote: I am still not sure whether it is okay to move REPEATABLE from unreserved to other category. In-fact last weekend I have spent some time to see the exact reason for shift/reduce errors and tried some ways but didn't find a way to get away with the same. Now I am planning to spend some more time on the same probably in next few days and then still if I cannot find a way, I will share my findings and then once re-review the changes made by Petr in last version. I think overall the patch is in good shape now although I haven't looked into DDL support part of the patch which I thought could be done in a separate patch as well. That seems like a legitimate concern. We usually try not to make keywords more reserved in PostgreSQL than they are in the SQL standard, and REPEATABLE is apparently non-reserved there: http://www.postgresql.org/docs/devel/static/sql-keywords-appendix.html This also makes method an unreserved keyword, which I'm not wild about either. Adding new keyword doesn't cost *much*, but is this SQL-mandated syntax or something we created? If the latter, can we find something to call it that doesn't require new keywords? REPEATABLE is mandated by standard. I did try for quite some time to make it unreserved but was not successful (I can only make it unreserved if I make it mandatory but that's not a solution). I haven't been in fact even able to find out what it actually conflicts with... METHOD is something I added. I guess we could find a way to name this differently if we really tried. The reason why I picked METHOD was that I already added the same unreserved keyword in the sequence AMs patch and in that one any other name does not really make sense. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TABLESAMPLE patch
On 01/04/15 18:38, Robert Haas wrote: On Wed, Apr 1, 2015 at 12:15 PM, Petr Jelinek p...@2ndquadrant.com wrote: REPEATABLE is mandated by standard. I did try for quite some time to make it unreserved but was not successful (I can only make it unreserved if I make it mandatory but that's not a solution). I haven't been in fact even able to find out what it actually conflicts with... Yeah, that can be hard to figure out. Did you run bison with -v and poke around in gram.output? Oh, no I didn't (I didn't know gram.output will be generated). This helped quite a bit. Thanks. I now found the reason, it's conflicting with alias but that's my mistake - alias should be before TABLESAMPLE clause as per standard and I put it after in parser. Now that I put it at correct place REPEATABLE can be unreserved keyword. This change requires making TABLESAMPLE a type_func_name_keyword but that's probably not really an issue as TABLESAMPLE is reserved keyword per standard. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication identifiers, take 4
So I did some more in depth look, I do have couple of comments. I would really like to have something like Logical Replication Infrastructure doc section that would have both decoding and identifiers (and possibly even CommitTs) underneath. There is typo in docs: + para + The optional functionfilter_by_origin_cb/function callback + is called to determine wheter data that has been replayed wheter - whether And finally I have issue with how the new identifiers are allocated. Currently, if you create identifier 'foo', remove identifier 'foo' and create identifier 'bar', the identifier 'bar' will have same id as the old 'foo' identifier. This can be problem because the identifier id is used as origin of the data and the replication solution using the replication identifiers can end up thinking that data came from node 'bar' even though they came from the node 'foo' which no longer exists. This can have bad effects for example on conflict detection or debugging problems with replication. Maybe another reason to use standard Oids? -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication identifiers, take 4
On 24/03/15 16:33, Andres Freund wrote: Hi, Here's the next version of this patch. I've tried to address the biggest issue (documentation) and some more. Now that both the more flexible commit WAL record format and the BKI_FORCE_NOT_NULL thing is in, it looks much cleaner. Nice, I see you also did the more close integration with CommitTs. I only skimmed the patch but so far and it looks quite good, I'll take closer look around end of the week. I'd greatly appreciate some feedback on the documentation. I'm not entirely sure into how much detail to go; and where exactly in the docs to place it. I do wonder if we shouldn't merge this with the logical decoding section and whether we could also document commit timestamps somewhere in there. Perhaps we should have some Logical replication developer documentation section and put all those three as subsections of that? -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using 128-bit integers for sum, avg and statistics aggregates
On 22/03/15 10:35, Andres Freund wrote: On March 22, 2015 10:34:04 AM GMT+01:00, Michael Paquier michael.paqu...@gmail.com wrote: On Sun, Mar 22, 2015 at 6:22 PM, Andres Freund and...@2ndquadrant.com That's due to a different patch though, right? When I checked earlier only jacana had problems due to this, and it looked like random memory was being output. It's interesting that that's on the one windows (not cygwin) critter that does the 128bit dance... Yes, sorry, the e+000 stuff is from 959277a. This patch has visibly broken that: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacanadt=2015-03-21%2003%3A01%3A21 That's the stuff looking like random memory that I talk about above... If you look at it closely, it's actually not random memory. At least in the first 2 failing tests which are not obfuscated by aggregates on top of aggregates. It looks like first NumericDigit is ok and the second one is corrupted (there are only 2 NumericDigits in those numbers). Of course the conversion to Numeric is done from the end so it looks like only the last computation/pointer change/something stays ok while the rest got corrupted. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using 128-bit integers for sum, avg and statistics aggregates
On 22/03/15 13:59, Andreas Karlsson wrote: On 03/22/2015 11:47 AM, Petr Jelinek wrote: On 22/03/15 10:35, Andres Freund wrote: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacanadt=2015-03-21%2003%3A01%3A21 That's the stuff looking like random memory that I talk about above... If you look at it closely, it's actually not random memory. At least in the first 2 failing tests which are not obfuscated by aggregates on top of aggregates. It looks like first NumericDigit is ok and the second one is corrupted (there are only 2 NumericDigits in those numbers). Of course the conversion to Numeric is done from the end so it looks like only the last computation/pointer change/something stays ok while the rest got corrupted. Would this mean the bug is most likely somewhere in int128_to_numericvar()? Maybe that version of gcc has a bug in some __int128 operator or I messed up the code there somehow. Yeah that's what I was thinking also, and I went through the function and didn't find anything suspicious (besides it's same as the 64 bit version except for the int128 use). It really might be some combination of arithmetic + the conversion to 16bit uint bug in the compiler. Would be worth to try to produce test case and try it standalone maybe? -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INT64_MIN and _MAX
On 21/03/15 23:45, Andrew Gierth wrote: A couple of places (adt/timestamp.c and pgbench.c) have this: #ifndef INT64_MAX #define INT64_MAX INT64CONST(0x7FFF) #endif #ifndef INT64_MIN #define INT64_MIN (-INT64CONST(0x7FFF) - 1) #endif On the other hand, int8.c uses the INT64_MIN expression directly inline. On the third hand, INT64_MIN etc. would typically be defined in stdint.h if it exists. So wouldn't it make more sense to move these definitions into c.h and standardize their usage? I was thinking the same when I've seen Peter's version of Numeric abbreviations patch. So +1 for that. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using 128-bit integers for sum, avg and statistics aggregates
On 16/03/15 00:37, Andreas Karlsson wrote: On 03/15/2015 09:47 PM, Petr Jelinek wrote: It's almost the same thing as you wrote but the 128 bit and 64 bit optimizations are put on the same level of optimized routines. But this is nitpicking at this point, I am happy with the patch as it stands right now. Do you think it is ready for committer? In my opinion, yes. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] recovery_target_action = pause hot_standby = off
On 15/03/15 14:51, Magnus Hagander wrote: On Sun, Mar 15, 2015 at 2:27 PM, Andres Freund and...@2ndquadrant.com mailto:and...@2ndquadrant.com wrote: On 2015-03-12 15:52:02 +0100, Andres Freund wrote: /* * Override any inconsistent requests. Not that this is a change * of behaviour in 9.5; prior to this we simply ignored a request * to pause if hot_standby = off, which was surprising behaviour. */ if (recoveryTargetAction == RECOVERY_TARGET_ACTION_PAUSE recoveryTargetActionSet standbyState == STANDBY_DISABLED) recoveryTargetAction = RECOVERY_TARGET_ACTION_SHUTDOWN; While it's easy enough to fix I rather dislike the whole intent here though. *Silently* switching the mode of operation in a rather significant way seems like a bad idea to me. At the very least we need to emit a LOG message about this; but I think it'd be much better to error out instead. 9.5's behaviour was already quite surprising. But changing things to a different surprising behaviour seems like a bad idea. +1. Especially for sensitive operations like this, having predictable-behavior-or-error is usually the best choice. Thinking about it again now, it does seem that ignoring user setting because it's in conflict with another user setting is a bad idea and I think we in general throw errors on those. So +1 from me also. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using 128-bit integers for sum, avg and statistics aggregates
On 14/03/15 04:17, Andreas Karlsson wrote: On 03/13/2015 10:22 PM, Peter Geoghegan wrote: On Thu, Mar 12, 2015 at 6:23 PM, Andreas Karlsson andr...@proxel.se wrote: /* * Integer data types use Numeric accumulators to share code and avoid risk * of overflow. To speed up aggregation 128-bit integer accumulators are * used instead where sum(X) or sum(X*X) fit into 128-bits, and there is * platform support. * * For int2 and int4 inputs sum(X) will fit into a 64-bit accumulator, hence * we use faster special-purpose accumulator routines for SUM and AVG of * these datatypes. */ #ifdef HAVE_INT128 typedef struct Int128AggState Not quite. Refer to the 128-bit integer accumulators as special-purpose accumulator routines instead. Then, in the case of the extant 64-bit accumulators, refer to them by the shorthand integer accumulators. Otherwise it's the wrong way around. I disagree. The term integer accumulators is confusing. Right now I do not have any better ideas for how to write that comment, so I submit the next version of the patch with the comment as I wrote it above. Feel free to come up with a better wording, my English is not always up to par when writing technical texts. I don't like the term integer accumulators either as integer is platform specific. I would phase it like this: /* * Integer data types in general use Numeric accumulators to share code * and avoid risk of overflow. * * However for performance reasons some of the optimized special-purpose * accumulator routines are used when possible. * * On platforms with 128-bit integer support, the 128-bit routines will be * used when sum(X) or sum(X*X) fit into 128-bit. * * For int2 and int4 inputs, the N and sum(X) fit into 64-bit so the 64-bit * accumulators will be used for SUM and AVG of these data types. */ It's almost the same thing as you wrote but the 128 bit and 64 bit optimizations are put on the same level of optimized routines. But this is nitpicking at this point, I am happy with the patch as it stands right now. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add min and max execute statement time in pg_stat_statement
On 22/02/15 01:59, Petr Jelinek wrote: On 21/02/15 22:09, Andrew Dunstan wrote: On 02/16/2015 09:05 PM, Petr Jelinek wrote: I found one more issue with the 1.2--1.3 upgrade script, the DROP FUNCTION pg_stat_statements(); should be DROP FUNCTION pg_stat_statements(bool); since in 1.2 the function identity has changed. I think all the outstanding issues are fixed in this patch. I think PGSS_FILE_HEADER should be also updated, otherwise it's good. I see you marked this as 'needs review', I am marking it as 'ready for committer' as it looks good to me. Just don't forget to update the PGSS_FILE_HEADER while committing (I think that one can be threated same way as catversion, ie be updated by committer and not in patch submission). -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] recovery_target_action doesn't work for anything but shutdown
On 13/03/15 15:06, Petr Jelinek wrote: On 12/03/15 15:53, Andres Freund wrote: On 2015-03-12 15:52:02 +0100, Andres Freund wrote: I guess what you actually intended to test was StandbyModeRequested? Err, EnableHotStandby. Yeah pause does not work currently. This change was made by committer Actually the code is from me, committer just requested the change, sorry. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] recovery_target_action doesn't work for anything but shutdown
On 12/03/15 15:53, Andres Freund wrote: On 2015-03-12 15:52:02 +0100, Andres Freund wrote: I guess what you actually intended to test was StandbyModeRequested? Err, EnableHotStandby. Yeah pause does not work currently. This change was made by committer and I think the intended change was to make it a little safer by silently changing to shutdown instead of silently changing to promote which is essentially what was happening before the patch. But yes the check should have been against EnableHotStandby it seems. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] mogrify and indent features for jsonb
On 01/03/15 16:49, Andrew Dunstan wrote: On 03/01/2015 05:03 AM, Petr Jelinek wrote: On 23/02/15 18:15, Dmitry Dolgov wrote: Hi, Petr, thanks for the review. I think it would be better if the ident printing didn't put the start of array ([) and start of dictionary ({) on separate line Did you mean this? [ { a: 1, b: 2 } ] I tried to verify this in several ways (http://jsonprettyprint.com/, JSON.stringify, json.too from python), and I always get this result (new line after ([) and ({) ). No, I mean new lines before the ([) and ({) - try pretty printing something like '{a:[b, c], d: {e:f}}' using that tool you pasted and see what your patch outputs to see what I mean. Please try this patch and see if it's doing what you want. Yes, this produces output that looks like what javascript/python produce. (the other stuff I commented about, mainly the h_atoi is still unfixed though) -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TABLESAMPLE patch
On 10/03/15 10:54, Amit Kapila wrote: On Tue, Mar 10, 2015 at 3:03 PM, Petr Jelinek p...@2ndquadrant.com mailto:p...@2ndquadrant.com wrote: Ok now I think I finally understand what you are suggesting - you are saying let's go over whole page while tsmnexttuple returns something, and do the visibility check and other stuff in that code block under the buffer lock and cache the resulting valid tuples in some array and then return those tuples one by one from that cache? Yes, this is what I am suggesting. And if the caller will try to do it in one step and cache the visibility info then we'll end up with pretty much same structure as rs_vistuples - there isn't saner way to cache this info other than ordered vector of tuple offsets, unless we assume that most pages have close to MaxOffsetNumber of tuples which they don't, so why not just use the heapgetpage directly and do the binary search over rs_vistuples. The downside of doing it via heapgetpage is that it will do visibility test for tuples which we might not even need (I think we should do visibility test for tuples retrurned by tsmnexttuple). Well, heapgetpage can either read visibility data for whole page or not, depending on if we want pagemode reading or not. So we can use the pagemode for sampling methods where it's feasible (like system) and not use pagemode where it's not (like bernoulli) and then either use the rs_vistuples or call HeapTupleSatisfiesVisibility individually again depending if the method is using pagemode or not. Yeah, but as mentioned above, this has some downside, but go for it only if you feel that above suggestion is making code complex, which I think should not be the case as we are doing something similar in acquire_sample_rows(). I think your suggestion is actually simpler code wise, I am just somewhat worried by the fact that no other scan node uses that kind of caching and there is probably reason for that. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TABLESAMPLE patch
On 10/03/15 04:43, Amit Kapila wrote: On Mon, Mar 9, 2015 at 3:08 PM, Petr Jelinek p...@2ndquadrant.com mailto:p...@2ndquadrant.com wrote: On 09/03/15 04:51, Amit Kapila wrote: On Sat, Mar 7, 2015 at 10:37 PM, Petr Jelinek p...@2ndquadrant.com mailto:p...@2ndquadrant.com Double checking for tuple visibility is the only downside I can think of. That will happen if we use heapgetpage and the way currently code is written in patch, however we can easily avoid double checking if we don't call heapgetpage and rather do the required work at caller's place. What's the point of pagemode then if the caller code does the visibility checks still one by one on each call. I thought one of the points of pagemode was to do this in one step (and one buffer lock). You only need one buffer lock for doing at caller's location as well something like we do in acquire_sample_rows(). Ok now I think I finally understand what you are suggesting - you are saying let's go over whole page while tsmnexttuple returns something, and do the visibility check and other stuff in that code block under the buffer lock and cache the resulting valid tuples in some array and then return those tuples one by one from that cache? And if the caller will try to do it in one step and cache the visibility info then we'll end up with pretty much same structure as rs_vistuples - there isn't saner way to cache this info other than ordered vector of tuple offsets, unless we assume that most pages have close to MaxOffsetNumber of tuples which they don't, so why not just use the heapgetpage directly and do the binary search over rs_vistuples. The downside of doing it via heapgetpage is that it will do visibility test for tuples which we might not even need (I think we should do visibility test for tuples retrurned by tsmnexttuple). Well, heapgetpage can either read visibility data for whole page or not, depending on if we want pagemode reading or not. So we can use the pagemode for sampling methods where it's feasible (like system) and not use pagemode where it's not (like bernoulli) and then either use the rs_vistuples or call HeapTupleSatisfiesVisibility individually again depending if the method is using pagemode or not. This is how sequential scan does it afaik. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: disallow operator = and use it for named parameters
On 10/03/15 17:01, Pavel Stehule wrote: 2015-03-10 16:50 GMT+01:00 Tom Lane t...@sss.pgh.pa.us mailto:t...@sss.pgh.pa.us: Robert Haas robertmh...@gmail.com mailto:robertmh...@gmail.com writes: Committed with a few documentation tweaks. Was there any consideration given to whether ruleutils should start printing NamedArgExprs with =? Or do we think that needs to wait? I didn't think about it? I don't see any reason why it have to use deprecated syntax. There is one, loading the output into older version of Postgres. Don't know if that's important one though. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using 128-bit integers for sum, avg and statistics aggregates
On 09/03/15 13:39, Andreas Karlsson wrote: On 03/07/2015 07:18 PM, Petr Jelinek wrote: What I am wondering is if those numeric_int16_* functions that also deal with either the Int128AggState or NumericAggState should be renamed in similar fashion. You mean something like numeric_poly_sum instead of numeric_int16_sum? I personally am not fond of either name. While numeric_int16_* incorrectly implies we have a int16 SQL type numeric_poly_* does not tell us that this is an optimized version which uses a smaller state. Yes that's what I mean, since the int16 in the name is misleading given that in at least some builds the int16 won't be used. You could always have numeric function, int16 function and the poly function which decides between them but that's probably overkill. The worst part of writing this patch has always been naming functions and types. :) Naming is hard :) -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using 128-bit integers for sum, avg and statistics aggregates
On 09/03/15 18:39, David Fetter wrote: On Mon, Mar 09, 2015 at 01:39:04PM +0100, Andreas Karlsson wrote: On 03/07/2015 07:18 PM, Petr Jelinek wrote: What I am wondering is if those numeric_int16_* functions that also deal with either the Int128AggState or NumericAggState should be renamed in similar fashion. You mean something like numeric_poly_sum instead of numeric_int16_sum? I personally am not fond of either name. While numeric_int16_* incorrectly implies we have a int16 SQL type numeric_poly_* does not tell us that this is an optimized version which uses a smaller state. Would it be simpler to write a separate patch to add an int16 SQL type so that this implication is correct? No, because then you'd need to emulate the type on platforms where it does not exist and the patch would be several times more complex for no useful reason. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TABLESAMPLE patch
On 09/03/15 04:51, Amit Kapila wrote: On Sat, Mar 7, 2015 at 10:37 PM, Petr Jelinek p...@2ndquadrant.com mailto:p...@2ndquadrant.com wrote: On 05/03/15 09:21, Amit Kapila wrote: On Tue, Feb 17, 2015 at 3:29 AM, Petr Jelinek p...@2ndquadrant.com mailto:p...@2ndquadrant.com mailto:p...@2ndquadrant.com mailto:p...@2ndquadrant.com wrote: I didn't add the whole page visibility caching as the tuple ids we get from sampling methods don't map well to the visibility info we get from heapgetpage (it maps to the values in the rs_vistuples array not to to its indexes). Commented about it in code also. I think we should set pagemode for system sampling as it can have dual benefit, one is it will allow us caching tuples and other is it can allow us pruning of page which is done in heapgetpage(). Do you see any downside to it? Double checking for tuple visibility is the only downside I can think of. That will happen if we use heapgetpage and the way currently code is written in patch, however we can easily avoid double checking if we don't call heapgetpage and rather do the required work at caller's place. What's the point of pagemode then if the caller code does the visibility checks still one by one on each call. I thought one of the points of pagemode was to do this in one step (and one buffer lock). And if the caller will try to do it in one step and cache the visibility info then we'll end up with pretty much same structure as rs_vistuples - there isn't saner way to cache this info other than ordered vector of tuple offsets, unless we assume that most pages have close to MaxOffsetNumber of tuples which they don't, so why not just use the heapgetpage directly and do the binary search over rs_vistuples. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using 128-bit integers for sum, avg and statistics aggregates
On 04/03/15 23:51, Andreas Karlsson wrote: On 01/29/2015 12:28 AM, Peter Geoghegan wrote: * I'm not sure about the idea of polymorphic catalog functions (that return the type internal, but the actual struct returned varying based on build settings). I tend to think that things would be better if there was always a uniform return type for such internal type returning functions, but that *its* structure varied according to the availability of int128 (i.e. HAVE_INT128) at compile time. What we should probably do is have a third aggregate struct, that encapsulates this idea of (say) int2_accum() piggybacking on one of either Int128AggState or NumericAggState directly. Maybe that would be called PolyNumAggState. Then this common code is all that is needed on both types of build (at the top of int4_accum(), for example): PolyNumAggState *state; state = PG_ARGISNULL(0) ? NULL : (PolyNumAggState *) PG_GETARG_POINTER(0); I'm not sure if I'd do this with a containing struct or a simple #ifdef HAVE_INT128 typedef #else ... , but I think it's an improvement either way. Not entirely sure exactly what I mean but I have changed to something like this, relying on typedef, in the attached version of the patch. I think it looks better after the changes. I think this made the patch much better indeed. What I am wondering is if those numeric_int16_* functions that also deal with either the Int128AggState or NumericAggState should be renamed in similar fashion. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TABLESAMPLE patch
On 05/03/15 09:21, Amit Kapila wrote: On Tue, Feb 17, 2015 at 3:29 AM, Petr Jelinek p...@2ndquadrant.com mailto:p...@2ndquadrant.com wrote: I didn't add the whole page visibility caching as the tuple ids we get from sampling methods don't map well to the visibility info we get from heapgetpage (it maps to the values in the rs_vistuples array not to to its indexes). Commented about it in code also. I think we should set pagemode for system sampling as it can have dual benefit, one is it will allow us caching tuples and other is it can allow us pruning of page which is done in heapgetpage(). Do you see any downside to it? Double checking for tuple visibility is the only downside I can think of. Plus some added code complexity of course. I guess we could use binary search on rs_vistuples (it's already sorted) so that info won't be completely useless. Not sure if worth it compared to normal visibility check, but not hard to do. I personally don't see the page pruning in TABLESAMPLE as all that important though, considering that in practice it will only scan tiny portion of a relation and usually one that does not get many updates (in practice the main use-case is BI). Few other comments: 1. Current patch fails to apply, minor change is required: patching file `src/backend/utils/misc/Makefile' Hunk #1 FAILED at 15. 1 out of 1 hunk FAILED -- saving rejects to src/backend/utils/misc/Makefile.rej Ah bitrot over time. 2. Few warnings in code (compiled on windows(msvc)) 1src\backend\utils\tablesample\bernoulli.c(217): warning C4305: '=' : truncation from 'double' to 'float4' 1src\backend\utils\tablesample\system.c(177): warning C4305: '=' : truncation from 'double' to 'float4' I think this is just compiler stupidity but hopefully fixed (I don't have msvc to check for it and other compilers I tried don't complain). 3. +static void +InitScanRelation(SampleScanState *node, EState *estate, int eflags, +TableSampleClause *tablesample) { .. +/* +* Page at a time mode is useless for us as we need to check visibility +* of tuples individually because tuple offsets returned by sampling +* methods map to rs_vistuples values and not to its indexes. +*/ +node-ss.ss_currentScanDesc-rs_pageatatime = false; .. } Modifying scandescriptor in nodeSamplescan.c looks slightly odd, Do we modify this way at anyother place? I think it is better to either teach heap_beginscan_* api's about it or expose it via new API in heapam.c Yeah I agree, I taught the heap_beginscan_strat about it as that one is the advanced API. 4. +Datum +tsm_system_cost(PG_FUNCTION_ARGS) { .. +*tuples = path-rows * samplesize; } It seems above calculation considers all rows in table are of equal size and hence multiplying directly with samplesize will give estimated number of rows for sample method, however if row size varies then this calculation might not give right results. I think if possible we should consider the possibility of rows with varying sizes else we can at least write a comment to indicate the possibility of improvement for future. I am not sure how we would know what size would the tuples have in the random blocks that we are going to read later. That said, I am sure that costing can be improved even if I don't know how myself. 6. @@ -13577,6 +13608,7 @@ reserved_keyword: | PLACING | PRIMARY | REFERENCES +| REPEATABLE Have you tried to investigate the reason why it is giving shift/reduce error for unreserved category and if we are not able to resolve that, then at least we can try to make it in some less restrictive category. I tried (on windows) by putting it in (type_func_name_keyword:) and it seems to be working. To investigate, you can try with information at below link: http://www.gnu.org/software/bison/manual/html_node/Understanding.html Basically I think we should try some more before concluding to change the category of REPEATABLE and especially if we want to make it a RESERVED keyword. Yes it can be moved to type_func_name_keyword which is not all that much better but at least something. I did try to play with this already during first submission but could not find a way to move it to something less restrictive. I could not even pinpoint what exactly is the shift/reduce issue except that it has something to do with the fact that the REPEATABLE clause is optional (at least I didn't have the problem when it wasn't optional). On a related note, it seems you have agreed upthread with Kyotaro-san for below change, but I don't see the same in latest patch: diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y index 4578b5e..8cf09d5 100644 --- a/src/backend/parser/gram.y +++ b/src/backend/parser/gram.y @@ -10590,7 +10590,7 @@ tablesample_clause: ; opt_repeatable_clause: - REPEATABLE '(' AexprConst ')' { $$ = (Node *) $3; } + REPEATABLE '(' a_expr ')' { $$ = (Node *) $3
Re: [HACKERS] proposal: searching in array function - array_position
On 22/02/15 12:19, Pavel Stehule wrote: 2015-02-22 3:00 GMT+01:00 Petr Jelinek p...@2ndquadrant.com mailto:p...@2ndquadrant.com: On 28/01/15 08:15, Pavel Stehule wrote: 2015-01-28 0:01 GMT+01:00 Jim Nasby jim.na...@bluetreble.com mailto:jim.na...@bluetreble.com: On 1/27/15 4:36 AM, Pavel Stehule wrote: It is only partially identical - I would to use cache for array_offset, but it is not necessary for array_offsets .. depends how we would to modify current API to support externally cached data. Externally cached data? Some from these functions has own caches for minimize access to typcache (array_map, array_cmp is example). And in first case, I am trying to push these information from fn_extra, in second case I don't do it, because I don't expect a repeated call (and I am expecting so type cache will be enough). You actually do caching via fn_extra in both case and I think that's the correct way, and yes that part can be moved common function. I also see that the documentation does not say what is returned by array_offset if nothing is found (it's documented in code but not in sgml). rebased + fixed docs You still duplicate the type cache code, but many other array functions do that too so I will not hold that against you. (Maybe somebody should write separate patch that would put all that duplicate code into common function?) I think this patch is ready for committer so I am going to mark it as such. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] mogrify and indent features for jsonb
On 23/02/15 18:15, Dmitry Dolgov wrote: Hi, Petr, thanks for the review. I think it would be better if the ident printing didn't put the start of array ([) and start of dictionary ({) on separate line Did you mean this? [ { a: 1, b: 2 } ] I tried to verify this in several ways (http://jsonprettyprint.com/, JSON.stringify, json.too from python), and I always get this result (new line after ([) and ({) ). No, I mean new lines before the ([) and ({) - try pretty printing something like '{a:[b, c], d: {e:f}}' using that tool you pasted and see what your patch outputs to see what I mean. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication identifiers, take 4
On 22/02/15 09:57, Andres Freund wrote: On 2015-02-19 00:49:50 +0100, Petr Jelinek wrote: On 16/02/15 10:46, Andres Freund wrote: On 2015-02-16 11:34:10 +0200, Heikki Linnakangas wrote: At a quick glance, this basic design seems workable. I would suggest expanding the replication IDs to regular 4 byte oids. Two extra bytes is a small price to pay, to make it work more like everything else in the system. I don't know. Growing from 3 to 5 byte overhead per relevant record (or even 0 to 5 in case the padding is reused) is rather noticeable. If we later find it to be a limit (I seriously doubt that), we can still increase it in a major release without anybody really noticing. I agree that limiting the overhead is important. But I have related though about this - I wonder if it's worth to try to map this more directly to the CommitTsNodeId. Maybe. I'd rather go the other way round though; replication_identifier.c/h's stuff seems much more generic than CommitTsNodeId. Probably not more generic but definitely nicer as the nodes are named and yes it has richer API. I mean it is currently using that for the actual storage, but I am thinking of having the Replication Identifiers be the public API and the CommitTsNodeId stuff be just hidden implementation detail. This thought is based on the fact that CommitTsNodeId provides somewhat meaningless number while Replication Identifiers give us nice name for that number. And if we do that then the type should perhaps be same for both? I'm not sure. Given that this is included in a significant number of recordsd I'd really rather not increase the overhead as described above. Maybe we can just limit CommitTsNodeId to the same size for now? That would make sense. I also wonder about the default nodeid infrastructure the committs provides. I can't think of use-case which it can be used for and which isn't solved by replication identifiers - in the end the main reason I added that infra was to make it possible to write something like replication identifiers as part of an extension. In any case I don't think the default nodeid can be used in parallel with replication identifiers since one will overwrite the SLRU record for the other. Maybe it's enough if this is documented but I think it might be better if this patch removed that default committs nodeid infrastructure. It's just few lines of code which nobody should be using yet. Thinking about this some more and rereading the code I see that you are setting TransactionTreeSetCommitTsData during xlog replay, but not during transaction commit, that does not seem correct as the local records will not have any nodeid/origin. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Redesigning checkpoint_segments
On 23/02/15 03:24, Robert Haas wrote: On Sat, Feb 21, 2015 at 11:29 PM, Petr Jelinek p...@2ndquadrant.com wrote: I am wondering a bit about interaction with wal_keep_segments. One thing is that wal_keep_segments is still specified in number of segments and not size units, maybe it would be worth to change it also? And the other thing is that, if set, the wal_keep_segments is the real max_wal_size from the user perspective (not from perspective of the algorithm in this patch, but user does not really care about that) which is somewhat weird given the naming. It seems like wal_keep_segments is more closely related to wal_*min*_size. The idea of both settings is that each is a minimum amount of WAL we want to keep around for some purpose. But they're not quite the same, I guess, because wal_min_size just forces us to keep that many files around - they can be overwritten whenever. wal_keep_segments is an amount of actual WAL data we want to keep around. Err yes of course, min not max :) Would it make sense to require that wal_keep_segments = wal_min_size? It would to me, the patch as it stands is confusing in a sense that you can set min and max but then wal_keep_segments somewhat overrides those. And BTW this brings another point, I actually don't see check for min_wal_size = max_wal_size anywhere in the patch. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add min and max execute statement time in pg_stat_statement
On 21/02/15 22:09, Andrew Dunstan wrote: On 02/16/2015 09:05 PM, Petr Jelinek wrote: I found one more issue with the 1.2--1.3 upgrade script, the DROP FUNCTION pg_stat_statements(); should be DROP FUNCTION pg_stat_statements(bool); since in 1.2 the function identity has changed. I think all the outstanding issues are fixed in this patch. I think PGSS_FILE_HEADER should be also updated, otherwise it's good. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Redesigning checkpoint_segments
On 13/02/15 18:43, Heikki Linnakangas wrote: Ok, I don't hear any loud objections to min_wal_size and max_wal_size, so let's go with that then. Attached is a new version of this. It now comes in four patches. The first three are just GUC-related preliminary work, the first of which I posted on a separate thread today. The 0001 patch is very nice, I would go ahead and commit it. Not really sure I see the need for 0002 but it should not harm anything so why not. The 0003 should be part of 0004 IMHO as it does not really do anything by itself. I am wondering a bit about interaction with wal_keep_segments. One thing is that wal_keep_segments is still specified in number of segments and not size units, maybe it would be worth to change it also? And the other thing is that, if set, the wal_keep_segments is the real max_wal_size from the user perspective (not from perspective of the algorithm in this patch, but user does not really care about that) which is somewhat weird given the naming. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: searching in array function - array_position
On 28/01/15 08:15, Pavel Stehule wrote: 2015-01-28 0:01 GMT+01:00 Jim Nasby jim.na...@bluetreble.com mailto:jim.na...@bluetreble.com: On 1/27/15 4:36 AM, Pavel Stehule wrote: It is only partially identical - I would to use cache for array_offset, but it is not necessary for array_offsets .. depends how we would to modify current API to support externally cached data. Externally cached data? Some from these functions has own caches for minimize access to typcache (array_map, array_cmp is example). And in first case, I am trying to push these information from fn_extra, in second case I don't do it, because I don't expect a repeated call (and I am expecting so type cache will be enough). You actually do caching via fn_extra in both case and I think that's the correct way, and yes that part can be moved common function. I also see that the documentation does not say what is returned by array_offset if nothing is found (it's documented in code but not in sgml). -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication identifiers, take 4
Now that the issue with padding seems to no longer exists since the patch works both with and without padding, I went through the code and here are some comments I have (in no particular order). In CheckPointReplicationIdentifier: + * FIXME: Add a CRC32 to the end. The function already does it (I guess you forgot to remove the comment). Using max_replication_slots as limit for replication_identifier states does not really make sense to me as replication_identifiers track remote info while and slots are local and in case of master-slave replication you need replication identifiers but don't need slots. In bootstrap.c: #define MARKNOTNULL(att) \ ((att)-attlen 0 || \ (att)-atttypid == OIDVECTOROID || \ -(att)-atttypid == INT2VECTOROID) +(att)-atttypid == INT2VECTOROID || \ +strcmp(NameStr((att)-attname), riname) == 0 \ + ) Huh? Can this be solved in a nicer way? Since we call XLogFlush with local_lsn as parameter, shouldn't we check that it's actually within valid range? Currently we'll get errors like this if set to invalid value: ERROR: xlog flush request 123/123 is not satisfied --- flushed only to 0/168FB18 In AdvanceReplicationIndentifier: + /* +* XXX: should we restore into a hashtable and dump into shmem only after +* recovery finished? +*/ Probably no given that the function is also callable via SQL interface. As I wrote in another email, I would like to integrate the RepNodeId and CommitTSNodeId into single thing. There are no docs for the new sql interfaces. The replication_identifier.c might deserve some intro/notes text. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bootstrap DATA is a pita
On 21/02/15 04:22, Peter Eisentraut wrote: I violently support this proposal. Maybe something rougly like: # pg_type.data CatalogData( 'pg_type', [ { oid = 2249, data = {typname = 'cstring', typlen = -2, typbyval = 1, fake = '...'}, oiddefine = 'CSTRINGOID' } ] ); One concern I have with this is that in my experience different tools and editors have vastly different ideas on how to format these kinds of nested structures. I'd try out YAML, or even a homemade fake YAML over this. +1 for the idea and +1 for YAML(-like) syntax. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: disallow operator = and use it for named parameters
On 19/01/15 17:14, Pavel Stehule wrote: 2015-01-19 14:27 GMT+01:00 Robert Haas robertmh...@gmail.com mailto:robertmh...@gmail.com: On Mon, Jan 19, 2015 at 2:59 AM, Pavel Stehule pavel.steh...@gmail.com mailto:pavel.steh...@gmail.com wrote: I think you should just remove the WARNING, not change it to an error. If somebody wants to quote the operator name to be able to continue using it, I think that's OK. It looks so quoting doesn't help here + CREATE OPERATOR = ( +leftarg = int8,-- right unary +procedure = numeric_fac + ); + ERROR: syntax error at or near ( + LINE 1: CREATE OPERATOR = ( + ^ Well then the error check is just dead code. Either way, you don't need it. yes, I removed it I am marking this as Ready For Committer, the patch is trivial and works as expected, there is nothing to be added to it IMHO. The = operator was deprecated for several years so it should not be too controversial either. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication identifiers, take 4
On 16/02/15 10:46, Andres Freund wrote: On 2015-02-16 11:34:10 +0200, Heikki Linnakangas wrote: At a quick glance, this basic design seems workable. I would suggest expanding the replication IDs to regular 4 byte oids. Two extra bytes is a small price to pay, to make it work more like everything else in the system. I don't know. Growing from 3 to 5 byte overhead per relevant record (or even 0 to 5 in case the padding is reused) is rather noticeable. If we later find it to be a limit (I seriously doubt that), we can still increase it in a major release without anybody really noticing. I agree that limiting the overhead is important. But I have related though about this - I wonder if it's worth to try to map this more directly to the CommitTsNodeId. I mean it is currently using that for the actual storage, but I am thinking of having the Replication Identifiers be the public API and the CommitTsNodeId stuff be just hidden implementation detail. This thought is based on the fact that CommitTsNodeId provides somewhat meaningless number while Replication Identifiers give us nice name for that number. And if we do that then the type should perhaps be same for both? -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sequence Access Method WIP
On 17/02/15 23:11, Robert Haas wrote: On Sun, Feb 15, 2015 at 1:40 PM, Petr Jelinek p...@2ndquadrant.com wrote: sending new version that is updated along the lines of what we discussed at FOSDEM, which means: - back to single bytea amdata column (no custom columns) Well, the main argument is still future possibility of moving into single table for storage. And when we discussed about it in person we agreed that there is not too big advantage in having separate columns since there need to be dump/restore state interfaces anyway so you can see the relevant state via those as we made the output more human readable (and the psql support reflects that). -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] mogrify and indent features for jsonb
Hi, I looked at the patch and have several comments. First let me say that modifying the individual paths of the json is the feature I miss the most in the current implementation so I am happy that this patch was submitted. I would prefer slightly the patch split into two parts, one for the indent printing and one with the manipulation functions, but it's not too big patch so it's not too bad as it is. There is one compiler warning that I see: jsonfuncs.c:3533:1: warning: no previous prototype for ‘jsonb_delete_path’ [-Wmissing-prototypes] jsonb_delete_path(PG_FUNCTION_ARGS) I think it would be better if the ident printing didn't put the start of array ([) and start of dictionary ({) on separate line since most pretty printing implementations outside of Postgres (like the JSON.stringify or python's json module) don't do that. This is purely about consistency with the rest of the world. The json_ident might be better named as json_pretty perhaps? I don't really understand the point of h_atoi() function. What's wrong with using strtol like pg_atoi does? Also there is no integer overflow check anywhere in that function. There is tons of end of line whitespace mess in jsonb_indent docs. Otherwise everything I tried so far works as expected. The code looks ok as well except maybe the replacePath could use couple of comments (for example the line which uses the h_atoi) to make it easier to follow. About the: + /* XXX : why do we need this assertion? The functions is declared to take text[] */ + Assert(ARR_ELEMTYPE(path) == TEXTOID); I wonder about this also, some functions do that, some don't, I am not really sure what is the rule there myself. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sequence Access Method WIP
On 18/02/15 02:59, Petr Jelinek wrote: On 17/02/15 23:11, Robert Haas wrote: On Sun, Feb 15, 2015 at 1:40 PM, Petr Jelinek p...@2ndquadrant.com wrote: sending new version that is updated along the lines of what we discussed at FOSDEM, which means: - back to single bytea amdata column (no custom columns) Well, the main argument is still future possibility of moving into single table for storage. And when we discussed about it in person we agreed that there is not too big advantage in having separate columns since there need to be dump/restore state interfaces anyway so you can see the relevant state via those as we made the output more human readable (and the psql support reflects that). Also makes the patch a little simpler obviously. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add min and max execute statement time in pg_stat_statement
On 17/02/15 16:12, Andres Freund wrote: On 2015-02-17 15:50:39 +0100, Petr Jelinek wrote: On 17/02/15 03:07, Petr Jelinek wrote: On 17/02/15 03:03, Andrew Dunstan wrote: On 02/16/2015 08:57 PM, Andrew Dunstan wrote: Average of 3 runs of read-only pgbench on my system all with pg_stat_statement activated: HEAD: 20631 SQRT: 20533 SQRTD: 20592 So using sqrtd the cost is 0.18%. I think that's acceptable. Actually, sqrt/sqrtd is not called in accumulating the stats, only in the reporting function. So it looks like the difference here should be noise. Maybe we need some longer runs. Yes there are variations between individual runs so it might be really just that, I can leave it running for much longer time tomorrow. Ok so I let it run for more than hour on a different system, the difference is negligible - 14461 vs 14448 TPS. I think there is bigger difference between individual runs than between the two versions... These numbers sound like you measured them without concurrency, am I right? If so, the benchmark isn't that interesting - the computation happens while a spinlock is held, and that'll mainly matter if there are many backends running at the same time. It's pgbench -j16 -c32 -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add min and max execute statement time in pg_stat_statement
On 17/02/15 03:07, Petr Jelinek wrote: On 17/02/15 03:03, Andrew Dunstan wrote: On 02/16/2015 08:57 PM, Andrew Dunstan wrote: On 02/16/2015 08:48 PM, Petr Jelinek wrote: On 17/02/15 01:57, Peter Geoghegan wrote: On Mon, Feb 16, 2015 at 4:44 PM, Petr Jelinek p...@2ndquadrant.com wrote: We definitely want this feature, I wished to have this info many times. I would still like to see a benchmark. Average of 3 runs of read-only pgbench on my system all with pg_stat_statement activated: HEAD: 20631 SQRT: 20533 SQRTD: 20592 So using sqrtd the cost is 0.18%. I think that's acceptable. Actually, sqrt/sqrtd is not called in accumulating the stats, only in the reporting function. So it looks like the difference here should be noise. Maybe we need some longer runs. Yes there are variations between individual runs so it might be really just that, I can leave it running for much longer time tomorrow. Ok so I let it run for more than hour on a different system, the difference is negligible - 14461 vs 14448 TPS. I think there is bigger difference between individual runs than between the two versions... -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add min and max execute statement time in pg_stat_statement
On 21/01/15 17:32, Andrew Dunstan wrote: On 01/21/2015 11:21 AM, Arne Scheffer wrote: Why is it a bad thing to call the column stddev_samp analog to the aggregate function or make a note in the documentation, that the sample stddev is used to compute the solution? ... But I will add a note to the documentation, that seems reasonable. I agree this is worth mentioning in the doc. In any case here some review from me: We definitely want this feature, I wished to have this info many times. The patch has couple of issues though: The 1.2 to 1.3 upgrade file is named pg_stat_statements--1.2-1.3.sql and should be renamed to pg_stat_statements--1.2--1.3.sql (two dashes). There is new sqrtd function for fast sqrt calculation, which I think is a good idea as it will reduce the overhead of the additional calculation and this is not something where little loss of precision would matter too much. The problem is that the new function is actually not used anywhere in the code, I only see use of plain sqrt. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add min and max execute statement time in pg_stat_statement
On 17/02/15 01:57, Peter Geoghegan wrote: On Mon, Feb 16, 2015 at 4:44 PM, Petr Jelinek p...@2ndquadrant.com wrote: We definitely want this feature, I wished to have this info many times. I would still like to see a benchmark. Average of 3 runs of read-only pgbench on my system all with pg_stat_statement activated: HEAD: 20631 SQRT: 20533 SQRTD: 20592 -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add min and max execute statement time in pg_stat_statement
On 17/02/15 03:03, Andrew Dunstan wrote: On 02/16/2015 08:57 PM, Andrew Dunstan wrote: On 02/16/2015 08:48 PM, Petr Jelinek wrote: On 17/02/15 01:57, Peter Geoghegan wrote: On Mon, Feb 16, 2015 at 4:44 PM, Petr Jelinek p...@2ndquadrant.com wrote: We definitely want this feature, I wished to have this info many times. I would still like to see a benchmark. Average of 3 runs of read-only pgbench on my system all with pg_stat_statement activated: HEAD: 20631 SQRT: 20533 SQRTD: 20592 So using sqrtd the cost is 0.18%. I think that's acceptable. Actually, sqrt/sqrtd is not called in accumulating the stats, only in the reporting function. So it looks like the difference here should be noise. Maybe we need some longer runs. Yes there are variations between individual runs so it might be really just that, I can leave it running for much longer time tomorrow. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add min and max execute statement time in pg_stat_statement
On 17/02/15 02:57, Andrew Dunstan wrote: On 02/16/2015 08:48 PM, Petr Jelinek wrote: On 17/02/15 01:57, Peter Geoghegan wrote: On Mon, Feb 16, 2015 at 4:44 PM, Petr Jelinek p...@2ndquadrant.com wrote: We definitely want this feature, I wished to have this info many times. I would still like to see a benchmark. Average of 3 runs of read-only pgbench on my system all with pg_stat_statement activated: HEAD: 20631 SQRT: 20533 SQRTD: 20592 So using sqrtd the cost is 0.18%. I think that's acceptable. I think so too. I found one more issue with the 1.2--1.3 upgrade script, the DROP FUNCTION pg_stat_statements(); should be DROP FUNCTION pg_stat_statements(bool); since in 1.2 the function identity has changed. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical Replication Helpers WIP for discussion
On 13/02/15 14:04, Petr Jelinek wrote: On 13/02/15 08:48, Michael Paquier wrote: Looking at this patch, I don't see what we actually gain much here except a decoder plugin that speaks a special protocol for a special background worker that has not been presented yet. What actually is the value of that defined as a contrib/ module in-core. Note that we have already test_decoding to basically test the logical decoding facility, used at least at the SQL level to get logical changes decoded. Based on those reasons I am planning to mark this as rejected (it has no documentation as well). So please speak up if you think the contrary, but it seems to me that this could live happily out of core. I think you are missing point of this, it's not meant to be committed in this form at all and even less as contrib module. It was meant as basis for in-core logical replication discussion, but sadly I didn't really have time to pursue it in this CF in the end. That being said and looking at the size of February CF, I think I am fine with dropping this in 9.5 cycle, it does not seem likely that there will be anything useful done with this fast enough to get to 9.5 so there is no point in spending committer resources on it in final CF. I will pick it up again after the CF is done. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical Replication Helpers WIP for discussion
On 13/02/15 08:48, Michael Paquier wrote: On Mon, Dec 22, 2014 at 10:26 PM, Robert Haas robertmh...@gmail.com mailto:robertmh...@gmail.com wrote: On Fri, Dec 19, 2014 at 8:40 AM, Petr Jelinek p...@2ndquadrant.com mailto:p...@2ndquadrant.com wrote: What I hope to get from this is agreement on the general approach and protocol so that we can have common base which will both make it easier to create external logical replication solutions and also eventually lead to full logical replication inside core PostgreSQL. The protocol is a really important topic which deserves its own discussion. Andres has mentioned some of the ideas he has in mind - which I think are similar to what you did here - but there hasn't really been a thread devoted to discussing that topic specifically. I think that would be a good idea: lay out what you have in mind, and why, and solicit feedback. Looking at this patch, I don't see what we actually gain much here except a decoder plugin that speaks a special protocol for a special background worker that has not been presented yet. What actually is the value of that defined as a contrib/ module in-core. Note that we have already test_decoding to basically test the logical decoding facility, used at least at the SQL level to get logical changes decoded. Based on those reasons I am planning to mark this as rejected (it has no documentation as well). So please speak up if you think the contrary, but it seems to me that this could live happily out of core. I think you are missing point of this, it's not meant to be committed in this form at all and even less as contrib module. It was meant as basis for in-core logical replication discussion, but sadly I didn't really have time to pursue it in this CF in the end. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Redesigning checkpoint_segments
On 03/02/15 16:50, Robert Haas wrote: On Tue, Feb 3, 2015 at 10:44 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: That's the whole point of this patch. max_checkpoint_segments = 10, or max_checkpoint_segments = 160 MB, means that the system will begin a checkpoint so that when the checkpoint completes, and it truncates away or recycles old WAL, the total size of pg_xlog is 160 MB. That's different from our current checkpoint_segments setting. With checkpoint_segments, the documented formula for calculating the disk usage is (2 + checkpoint_completion_target) * checkpoint_segments * 16 MB. That's a lot less intuitive to set. Hmm, that's different from what I was thinking. We probably shouldn't call that max_checkpoint_segments, then. I got confused and thought you were just trying to decouple the number of segments that it takes to trigger a checkpoint from the number we keep preallocated. But I'm confused: how can we know how much new WAL will be written before the checkpoint completes? The preallocation is based on estimated size of next checkpoint which is basically running average of the previous checkpoints with some additional adjustments for unsteady behavior (last checkpoint has higher weight in the formula). (we also still internally have the CheckPointSegments which is calculated the way Heikki described above) In any case, I don't like the max_checkpoint_segments naming too much, and I don't even like the number of segments as limit too much, I think the ability to set this in actual size is quite nice property of this patch and as Heikki says the numbers don't map that well to the old ones in practice. I did some code reading and I do like the patch. Basically only negative thing I can say is that I am not big fan of _logSegNo variable name but that's not new in this patch, we use it all over the place in xlog. I would vote for bigger default of the checkpoint_wal_size (or whatever it will be named) though, since the current one is not much bigger in practice than what we have now and that one is way too conservative. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Getting rid of wal_level=archive and default to hot_standby + wal_senders
On 03/02/15 13:51, Magnus Hagander wrote: On Tue, Feb 3, 2015 at 1:43 PM, Andres Freund and...@2ndquadrant.com mailto:and...@2ndquadrant.com wrote: Hi, I think these days there's no reason for the split between the archive and hot_standby wal levels. The split was made out of volume and stability concerns. I think we can by now be confident about the wal_level = hot_standby changes (note I'm not proposing hot_standby = on). So let's remove the split. It just gives users choice between two options that don't have a meaningful difference. +1. +1 too Additionally I think we should change the default for wal_level to hot_standby and max_wal_senders (maybe to 5). That way users can use pg_basebackup and setup streaming standbys without having to restart the primary. I think that'd be a important step in making setup easier. Yes, please! Those who want to optimize their WAL size can set it back to minimal, but let's make the default the one that makes life *easy* for people. The other option, which would be more complicated (I have a semi-finished patch that I never got time to clean up) would be for pg_basebackup to be able to dynamically raise the value of wal_level during it's run. It would not help with the streaming standby part, but it would help with pg_basebackup. That could be useful independent - for those who prefer using wal_level=minimal and also pg_basebackup.. This is not that easy to do, let's do it one step at a time. Comments? Additionally, more complex and further into the future, I wonder if we couldn't also get rid of wal_level = logical by automatically switching to it whenever logical slots are active. If it can be safely done online, I definitely think that would be a good goal to have. If we could do the same for hot_standby if you had physical slots, that might also be a good idea? +many for the logical, physical would be nice but I think it's again in the category of not so easy and maybe better as next step if at all. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TABLESAMPLE patch
On 31/01/15 14:27, Amit Kapila wrote: On Fri, Jan 23, 2015 at 5:39 AM, Petr Jelinek p...@2ndquadrant.com mailto:p...@2ndquadrant.com wrote: On 19/01/15 07:08, Amit Kapila wrote: On Sun, Jan 18, 2015 at 12:46 AM, Petr Jelinek p...@2ndquadrant.com mailto:p...@2ndquadrant.com mailto:p...@2ndquadrant.com mailto:p...@2ndquadrant.com wrote: I think that's actually good to have, because we still do costing and the partial index might help produce better estimate of number of returned rows for tablesample as well. I don't understand how will it help, because for tablesample scan it doesn't consider partial index at all as per patch. Oh I think we were talking abut different things then, I thought you were talking about the index checks that planner/optimizer sometimes does to get more accurate statistics. I'll take another look then. Well similar, not same as we are not always fetching whole page or doing visibility checks on all tuples in the page, etc. But I don't see why it can't be inside nodeSamplescan. If you look at bitmap heap scan, that one is also essentially somewhat modified sequential scan and does everything within the node nodeBitmapHeapscan because the algorithm is not used anywhere else, just like sample scan. I don't mind doing everything in nodeSamplescan, however if you can split the function, it will be easier to read and understand, if you see in nodeBitmapHeapscan, that also has function like bitgetpage(). This is just a suggestion and if you think that it can be splitted, then it's okay, otherwise leave it as it is. Yeah I can split it to separate function within the nodeSamplescan, that sounds reasonable. Refer parameter (HeapScanDesc-rs_syncscan) and syncscan.c. Basically during sequiantial scan on same table by different backends, we attempt to keep them synchronized to reduce the I/O. Ah this, yes, it makes sense for bernoulli, not for system though. I guess it should be used for sampling methods that use SAS_SEQUENTIAL strategy. Have you taken care of this in your latest patch? Not yet. I think I will need to make the strategy property of the sampling method instead of returning it by costing function so that the info can be used by the scan. Oh and BTW when I delete 50k of tuples (make them invisible) the results of 20 runs are between 30880 and 40063 rows. This is between 60% to 80%, lower than what is expected, but I guess we can't do much for this except for trying with reverse order for visibility test and sample tuple call, you can decide if you want to try that once just to see if that is better. No, that's because I can't write properly, the lower number was supposed to be 39880 which is well within the tolerance, sorry for the confusion (9 and 0 are just too close). Anyway, attached is new version with some updates that you mentioned (all_visible, etc). I also added optional interface for the sampling method to access the tuple contents as I can imagine sampling methods that will want to do that. +/* +* Let the sampling method examine the actual tuple and decide if we +* should return it. +* +* Note that we let it examine even invisible tuples. +*/ +if (OidIsValid(node-tsmreturntuple.fn_oid)) +{ +found = DatumGetBool(FunctionCall4(node-tsmreturntuple, + PointerGetDatum (node), + UInt32GetDatum (blockno), + PointerGetDatum (tuple), + BoolGetDatum (visible))); +/* XXX: better error */ +if (found !visible) +elog(ERROR, Sampling method wanted to return invisible tuple); +} You have mentioned in comment that let it examine invisible tuple, but it is not clear why you want to allow examining invisible tuple and then later return error, why can't it skips invisible tuple. Well I think returning invisible tuples to user is bad idea and that's why the check, but I also think it might make sense to examine the invisible tuple for the sampling function in case it wants to create some kind of stats about the scan and wants to use those for making the decision about returning other tuples. The interface should be probably called tsmexaminetuple instead to make it more clear what the intention is. 1. How about statistics (pgstat_count_heap_getnext()) during SampleNext as we do in heap_getnext? Right, will add. 2. +static TupleTableSlot * +SampleNext(SampleScanState *node) +{ .. +/* +* Lock the buffer so we can safely assess tuple +* visibility later. +*/ +LockBuffer(buffer, BUFFER_LOCK_SHARE); .. } When is this content lock released, shouldn't we release it after checking visibility of tuple? Here, + if (!OffsetNumberIsValid(tupoffset)) + { + UnlockReleaseBuffer(buffer); but yes you are correct, it should be just released there and we can unlock already after visibility check. 3. +static TupleTableSlot * +SampleNext(SampleScanState
Re: [HACKERS] TABLESAMPLE patch
On 28/01/15 09:41, Kyotaro HORIGUCHI wrote: As an issue related to size esmation, I got a explain result as following, =# explain (analyze on, buffers on) select a from t1 tablesample system(10) where a 5; QUERY PLAN Sample Scan on t1 (cost=0.00..301.00 rows=1 width=4) (actual time=0.015..2 .920 rows=4294 loops=1) Filter: (a 5) Rows Removed by Filter: 5876 Buffers: shared hit=45 actual rows is large as twice as the estimated. tsm_system_cost estimates the number of the result rows using baserel-tuples, not using baserel-rows so it doesn't reflect the scan quals. Did the code come from some other intention? No, that's a bug. 4. SampleNext() a. Isn't it better to code SampleNext() similar to SeqNext() and IndexNext(), which encapsulate the logic to get the tuple in another function(tbs_next() or something like that)? Possibly, my thinking was that unlike the index_getnext() and heap_getnext(), this function would not be called from any other place so adding one more layer of abstraction didn't seem useful. And it would mean creating new ScanDesc struct, etc. I think adding additional abstraction would simplify the function and reduce the chance of discrepency, I think we need to almost create a duplicate version of code for heapgetpage() method. Yeah, I agree that we need to build structure like ScanDesc, but still it will be worth to keep code consistent. Well similar, not same as we are not always fetching whole page or doing visibility checks on all tuples in the page, etc. But I don't see why it can't be inside nodeSamplescan. If you look at bitmap heap scan, that one is also essentially somewhat modified sequential scan and does everything within the node nodeBitmapHeapscan because the algorithm is not used anywhere else, just like sample scan. As a general opinion, I'll vote for Amit's comment, since three or four similar instances seems to be a enough reason to abstract it. On the other hand, since the scan processes are distributed in ExecProcNode by the nodeTag of scan nodes, reunioning of the control by abstraction layer after that could be said to introducing useless annoyance. In short, bastraction at the level of *Next() would bring the necessity of additional changes around the execution node distribution. Yes, that's my view too. I would generally be for that change also and it would be worth it if the code was used in more than one place, but as it is it seems like it will just add code/complexity for no real benefit. It would make sense in case we used sequential scan node instead of the new node as Amit also suggested, but I remain unconvinced that mixing sampling and sequential scan into single scan node would be a good idea. How will it get smoothen for cases when let us say more than 50% of tuples are not visible. I think its due to Postgres non-overwritting storage architecture that we maintain multiple version of rows in same heap, otherwise for different kind of architecture (like mysql/oracle) where multiple row versions are not maintained in same heap, the same function (with same percentage) can return entirely different number of rows. I don't know how else to explain, if we loop over every tuple in the relation and return it with equal probability then visibility checks don't matter as the percentage of visible tuples will be same in the result as in the relation. Surely it migh yield the effectively the same result. Even so, this code needs exaplation comment about the nature aroud the code, or write code as MMVC people won't get confused, I suppose. Yes it does, but as I am failing to explain even here, it's not clear to me what to write there. From my point of view it's just effect of the essential property of bernoulli sampling which is that every element has equal probability of being included in the sample. It comes from the fact that we do bernoulli trial (in the code it's the while (sampler_random_fract(sampler-randstate) probability) loop) on every individual tuple. This means that the ratio of visible and invisible tuples should be same in the sample as it is in the relation. We then just skip the invisible tuples and get the correct percentage of the visible ones (this has performance benefit of not having to do visibility check on all tuples). If that wasn't true than the bernoulli sampling would just simply not work as intended as the same property is the reason why it's used in statistics - the trends should look the same in sample as they look in the source population. Obviously there is some variation in practice as we don't have perfect random generator but that's independent of the algorithm. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7
Re: [HACKERS] TABLESAMPLE patch
On 28/01/15 08:23, Kyotaro HORIGUCHI wrote: Hi, I took a look on this and found nice. By the way, the parameter for REPEATABLE seems allowing to be a expression in ParseTableSample but the grammer rejects it. It wasn't my intention to support it, but you are correct, the code is generic enough that we can support it. The following change seems enough. Seems about right, thanks. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sequence Access Method WIP
On 28/01/15 18:09, Heikki Linnakangas wrote: On 01/23/2015 02:34 AM, Petr Jelinek wrote: On 22/01/15 17:02, Petr Jelinek wrote: The new version (the one that is not submitted yet) of gapless sequence is way more ugly and probably not best example either but does guarantee gaplessness (it stores the last value in it's own value table). So I am not sure if it should be included either... Here it is as promised. I generally like the division of labour between the seqam implementations and the API now. I don't like the default_sequenceam GUC. That seems likely to create confusion. And it's not really enough for something like a remote sequence AM anyway that definitely needs some extra reloptions, like the hostname of the remote server. The default should be 'local', unless you specify something else with CREATE SEQUENCE USING - no GUCs. Hmm, I am not too happy about this, I want SERIAL to work with custom sequences (as long as it's possible for the sequence to work that way at least). That's actually important feature for me. Your argument about this being potential problem for some sequenceAMs is valid though. Some comments on pg_dump: * In pg_dump's dumpSequence function, check the server version number instead of checking whether pg_sequence_dump_state() function exists. That's what we usually do when new features are introduced. And there's actually a bug there: you have the check backwards. (try dumping a database with any sequences in it; it fails) Right. * pg_dump should not output a USING clause when a sequence uses the 'local' sequence. No point in adding such noise - making the SQL command not standard-compatible - for the 99% of databases that don't use other sequence AMs. Well this only works if we remove the GUC. Because otherwise if GUC is there then you always need to either add USING clause or set the GUC in advance (like we do with search_path for example). * Be careful to escape all strings correctly in pg_dump. I think you're missing escaping for the 'state' variable at least. Ouch. In sequence_save_tuple: else { /* * New tuple was not sent, so the original tuple was probably just * changed inline, all we need to do is mark the buffer dirty and * optionally log the update tuple. */ START_CRIT_SECTION(); MarkBufferDirty(seqh-buf); if (do_wal) log_sequence_tuple(seqh-rel, seqh-tup, seqh-buf, page); END_CRIT_SECTION(); } This is wrong when checksums are enabled and !do_wal. I believe this should use MarkBufferDirtyHint(). Oh ok, didn't realize. Notable changes: - The gapless sequence rewritten to use the transactional storage as that's the only way to guarantee gaplessness between dump and restore. Can you elaborate? Using the auxiliary seqam_gapless_values is a bit problematic. First of all, the event trigger on DROP SEQUENCE doesn't fire if you drop the whole schema containing the sequence with DROP SCHEMA CASCADE. Secondly, Yeah but at worst there are some unused rows there, it's not too bad. I could also create relation per sequence so that dependency code would handle everything correctly, but seems bit too expensive to create not one but two relations per sequence... updating a row in a table for every nextval() call is pretty darn expensive. Yes it's expensive, but the gapless sequence also serializes access so it will never be speed daemon. But I don't actually see a problem with dump and restore. The problem is that the tuple stored in the sequence relation is always the one with latest state (and with frozen xid), so pg_dump has no way of getting the value as it was at the time it took the snapshot. This means that after dump/restore cycle, the sequence can be further away than the table it's attached to and you end up with a gap there. Can you rewrite it without using the values table? AFAICS, there are a two of problems to solve: 1. If the top-level transaction aborts, you need to restore the old value. I suggest keeping two values in the sequence tuple: the old, definitely committed one, and the last value. The last value is only considered current if the associated XID committed; otherwise the old value is current. When a transaction is about to commit, it stores its top-level XID and the new value in the last field, and copies the previously current value to the old field. Yes, this is how the previous implementation worked. 2. You need to track the last value on a per-subtransaction basis, until the transaction commits/rolls back, in order to have rollback to savepoint to retreat the sequence's value. That can be done in backend-private memory, maintaining a stack of subtransactions and the last value of each. Use RegisterSubXactCallback to hook into subxact commit/abort. Just before top-level commit (in pre-commit callback), update the sequence tuple with the latest value, so that it gets WAL-logged
Re: [HACKERS] Re: Abbreviated keys for Numeric
On 27/01/15 00:51, Andres Freund wrote: On 2015-01-26 15:35:44 -0800, Peter Geoghegan wrote: On Mon, Jan 26, 2015 at 3:12 PM, Andrew Gierth and...@tao11.riddles.org.uk wrote: Obvious overheads in float8 comparison include having to check for NaN, and the fact that DatumGetFloat8 on 64bit doesn't get inlined and forces a store/load to memory rather than just using a register. Looking at those might be more beneficial than messing with abbreviations. Aren't there issues with the alignment of double precision floating point numbers on x86, too? Maybe my information there is at least partially obsolete. But it seems we'd have to control for this to be sure. I think getting rid of the function call for DatumGetFloat8() would be quite the win. On x86-64 the conversion then should amount to mov %rd?,-0x8(%rsp);movsd -0x8(%rsp),%xmm0 - that's pretty cheap. Both instructions have a cycle count of 1 + L1 access latency (4) + 2 because they use the same exection port. So it's about 12 fully pipelineable cycles. 2 if the pipeline can kept busy otherwise. I doubt that'd be noticeable if the conversion were inlined. IIRC the DatumGetFloat8 was quite visible in the perf when I was writing the array version of width_bucket. It was one of the motivations for making special float8 version since not having to call it had significant effect. Sadly I don't remember if it was the function call itself or the conversion anymore. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sequence Access Method WIP
On 22/01/15 16:50, Heikki Linnakangas wrote: On 01/12/2015 11:33 PM, Petr Jelinek wrote: Second patch adds DDL support. I originally wanted to make it CREATE/DROP SEQUENCE ACCESS METHOD... but that would mean making ACCESS a reserver keyword so I went for CREATE ACCESS METHOD FOR SEQUENCES which does not need to change anything (besides adding METHOD to unreserved keywords). The DDL support uses the DefineStmt infra with some very small change as the sequence ams are not schema qualified, but I think it's acceptable and saves considerable amount of boilerplate. Do we need DDL commands for this at all? I could go either way on that question. We recently had a discussion on that wrt. index access methods [1], and Tom opined that providing DDL for creating index access methods is not worth it. The extension can just insert the rows into pg_seqam with INSERT. Do we expect sequence access methods as extensions to be more popular than index access methods? Maybe, because the WAL-logging problem doesn't exist. But OTOH, if you're writing something like a replication system that needs global sequences as part of it, there aren't that many of those, and the installation scripts will need to deal with more complicated stuff than inserting a row in pg_seqam. I don't know about popularity, and I've seen the discussion about indexes. The motivation for DDL for me was handling dependencies correctly, that's all. If we say we don't care about that (and allow DROP EXTENSION even though user has sequences that are using the AM) then we don't need DDL. [1] http://www.postgresql.org/message-id/26822.1414516...@sss.pgh.pa.us And third patch is gapless sequence implementation updated to work with the new DDL support with some tests added for checking if dependencies work correctly. It also acts as example on how to make custom AMs. I'll take a look at that to see how the API works, but we're not going to include it in the source tree, because it doesn't actually guarantee gaplessness. That makes it a pretty dangerous example. I'm sure we can come up with a better example that might even be useful. How about a Lamport's clock sequence, which advances once per second, in addition to when anyone calls nextval() ? Or a remote sequence that uses an FDW to call nextval() in a foreign server? I have updated patch ready, just didn't submit it because I am otherwise busy this week, I hope to get to it today evening or tomorrow morning, so I'd wait until that with looking at the patch. The new version (the one that is not submitted yet) of gapless sequence is way more ugly and probably not best example either but does guarantee gaplessness (it stores the last value in it's own value table). So I am not sure if it should be included either... -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using 128-bit integers for sum, avg and statistics aggregates
On 23/01/15 00:40, Andreas Karlsson wrote: - Renamed some things from int12 to int128, there are still some places with int16 which I am not sure what to do with. I'd vote for renaming them to int128 too, there is enough C functions that user int16 for 16bit integer that this is going to be confusing otherwise. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tracking commit timestamps
On 05/01/15 17:50, Petr Jelinek wrote: On 05/01/15 16:17, Petr Jelinek wrote: On 05/01/15 07:28, Fujii Masao wrote: On Thu, Dec 4, 2014 at 12:08 PM, Fujii Masao masao.fu...@gmail.com wrote: On Wed, Dec 3, 2014 at 11:54 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Pushed with some extra cosmetic tweaks. I got the following assertion failure when I executed pg_xact_commit_timestamp() in the standby server. =# select pg_xact_commit_timestamp('1000'::xid); TRAP: FailedAssertion(!(((oldestCommitTs) != ((TransactionId) 0)) == ((newestCommitTs) != ((TransactionId) 0))), File: commit_ts.c, Line: 315) server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: 2014-12-04 12:01:08 JST sby1 LOG: server process (PID 15545) was terminated by signal 6: Aborted 2014-12-04 12:01:08 JST sby1 DETAIL: Failed process was running: select pg_xact_commit_timestamp('1000'::xid); The way to reproduce this problem is #1. set up and start the master and standby servers with track_commit_timestamp disabled #2. enable track_commit_timestamp in the master and restart the master #3. run some write transactions #4. enable track_commit_timestamp in the standby and restart the standby #5. execute select pg_xact_commit_timestamp('1000'::xid) in the standby BTW, at the step #4, I got the following log messages. This might be a hint for this problem. LOG: file pg_commit_ts/ doesn't exist, reading as zeroes CONTEXT: xlog redo Transaction/COMMIT: 2014-12-04 12:00:16.428702+09; inval msgs: catcache 59 catcache 58 catcache 59 catcache 58 catcache 45 catcache 44 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 snapshot 2608 relcache 16384 This problem still happens in the master. Regards, Attached patch fixes it, I am not sure how happy I am with the way I did it though. Added more comments and made the error message bit clearer. Fujii, Alvaro, did one of you had chance to look at this fix? -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TABLESAMPLE patch
On 17/01/15 13:46, Amit Kapila wrote: On Sun, Jan 11, 2015 at 1:29 AM, Petr Jelinek p...@2ndquadrant.com mailto:p...@2ndquadrant.com wrote: In second patch which implements the TABLESAMPLE itself I changed the implementation of random generator because when I looked at the code again I realized the old one would produce wrong results if there were multiple TABLESAMPLE statements in same query or multiple cursors in same transaction. I have looked into this patch and would like to share my findings with you. That's a lot for this. 1. +static void +set_tablesample_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte) { .. +/* +* There is only one plan to consider but we still need to set +* parameters for RelOptInfo. +*/ +set_cheapest(rel); } It seems we already call set_cheapest(rel) in set_rel_pathlist() which is a caller of set_tablesample_rel_pathlist(), so why do we need it inside set_tablesample_rel_pathlist()? Ah, this changed after I started working on this patch and I didn't notice - previously all the set_something_rel_pathlist called set_cheapest() individually. I will change the code. 2. +static void +set_tablesample_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte) { .. +/* We only do sample scan if it was requested */ +add_path(rel, (Path *) create_samplescan_path(root, rel, required_outer)); } Do we need to add_path, if there is only one path? Good point, we can probably just set the pathlist directly in this case. 3. @@ -332,6 +334,11 @@ set_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, /* Foreign table */ set_foreign_pathlist(root, rel, rte); } +else if (rte-tablesample != NULL) +{ +/* Build sample scan on relation */ +set_tablesample_rel_pathlist(root, rel, rte); +} Have you consider to have a different RTE for this? I assume you mean different RTEKind, yes I did, but the fact is that it's still a relation, just with different scan type and I didn't want to modify every piece of code which deals with RTE_RELATION to also deal with the new RTE for sample scan as it seems unnecessary. I am not strongly opinionated about this one though. 4. SampleNext() a. Isn't it better to code SampleNext() similar to SeqNext() and IndexNext(), which encapsulate the logic to get the tuple in another function(tbs_next() or something like that)? Possibly, my thinking was that unlike the index_getnext() and heap_getnext(), this function would not be called from any other place so adding one more layer of abstraction didn't seem useful. And it would mean creating new ScanDesc struct, etc. b. Also don't we want to handle pruning of page while scanning (heap_page_prune_opt()) and I observed in heap scan API's after visibility check we do check for serializable conflict (CheckForSerializableConflictOut()). Both good points, will add. Another thing is don't we want to do anything for sync scans for these method's as they are doing heap scan? I don't follow this one tbh. c. for bernoulli method, it will first get the tupoffset with the help of function and then apply visibility check, it seems that could reduce the sample set way lower than expected in case lot of tuples are not visible, shouldn't it be done in reverse way(first visibility check, then call function to see if we want to include it in the sample)? I think something similar is done in acquire_sample_rows which seems right to me. I don't think so, the way bernoulli works is that it returns every tuple with equal probability, so the visible tuples have same chance of being returned as the invisible ones so the issue should be smoothed away automatically (IMHO). The acquire_sample_rows has limit on number of rows it returns so that's why it has to do the visibility check before as the problem you described applies there. The reason for using the probability instead of tuple limit is the fact that there is no way to accurately guess number of tuples in the table at the beginning of scan. This method should actually be better at returning the correct number of tuples without dependence on how many of them are visible or not and how much space in blocks is used. 5. CREATE TABLE test_tablesample (id INT, name text) WITH (fillfactor=10); INSERT INTO test_tablesample SELECT i, repeat(i::text, 200) FROM generate_series(0, 9) s(i) ORDER BY i; postgres=# SELECT id FROM test_tablesample TABLESAMPLE BERNOULLI (80); id 1 3 4 7 8 9 (6 rows) postgres=# SELECT id FROM test_tablesample TABLESAMPLE BERNOULLI (80); id 0 1 2 3 4 5 6 7 8 9 (10 rows) So above test yield 60% rows first time and 100% rows next time, when the test has requested 80%. I understand that sample percentage will result an approximate number of rows, however I just wanted that we should check if the implementation has any problem or not? I think this is caused by random generator not producing smooth enough random
Re: [HACKERS] Turning recovery.conf into GUCs
On 14/01/15 14:24, Robert Haas wrote: On Mon, Jan 5, 2015 at 8:43 PM, Peter Eisentraut pete...@gmx.net wrote: I have previously argued for a different approach: Ready recovery.conf as a configuration file after postgresql.conf, but keep it as a file to start recovery. That doesn't break any old workflows, it gets you all the advantages of having recovery parameters in the GUC system, and it keeps all the options open for improving things further in the future. But this is confusing, too, because it means that if you set a parameter in both postgresql.conf and recovery.conf, you'll end up with the recovery.conf value of the parameter even after recovery is done. Until you restart, and then you won't. That's weird. I think your idea of adding read-only GUCs with the same names as all of the recovey.conf parameters is a clear win. Even if we do nothing more than that, it makes the values visible from the SQL level, and that's good. But I think we should go further and make them really be GUCs. Otherwise, if you want to be able to change one of those parameters other than at server startup time, you've got to invent a separate system for reloading them on SIGHUP. If you make them part of the GUC mechanism, you can use that. I think it's quite safe to say that the whole reason we *have* a GUC mechanism is so that all of our configuration can be done through one grand, unified mechanism rather than being fragmented across many files. This is basically what the patch which is in commitfest does no? Some renaming might be in order. Heikki previously suggested merging all of the recovery_target_whatever settings down into a single parameter recovery_target='kindofrecoverytarget furtherdetailsgohere', like recovery_target='xid 1234' or recovery_target='name bob'. Maybe that would be more clear (or not). I was thinking about this a lot myself while reviewing the patch too, seems strange to have multiple config parameters for what is essentially same thing, my thinking was similar except I'd use : as separator ('kindofrecoverytarget:furtherdetailsgohere'). I think though while it is technically nicer to do this, it might hurt usability for users. Maybe standby_mode needs a better name. I actually think standby_mode should be merged with hot_standby (as in standby_mode = 'hot'/'warm'/'off' or something). But I think the starting point for this discussion shouldn't be why in the world would we merge recovery.conf into postgresql.conf? but why, when we've otherwise gone to such trouble to push all of our configuration through a single, unified mechanism that offers many convenient features, do we continue to suffer our recovery.conf settings to go through some other, less-capable mechanism?. +1 -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sequence Access Method WIP
On 13/01/15 13:24, Tomas Vondra wrote: On 12.1.2015 22:33, Petr Jelinek wrote: On 15/12/14 11:36, Petr Jelinek wrote: On 10/12/14 03:33, Petr Jelinek wrote: On 24/11/14 12:16, Heikki Linnakangas wrote: About the rough edges: - The AlterSequence is not prettiest code around as we now have to create new relation when sequence AM is changed and I don't know how to do that nicely - I am not sure if I did the locking, command order and dependency handling in AlterSequence correcly This version does AlterSequence differently and better. Didn't attach the gapless sequence again as that one is unchanged. And another version, separated into patch-set of 3 patches. First patch contains the seqam patch itself, not many changes there, mainly docs/comments related. What I wrote in the previous email for version 3 still applies. I did a review of the first part today - mostly by reading through the diff. I plan to do a bit more thorough testing in a day or two. I'll also look at the two (much smaller) patches. Thanks! comments/questions/general nitpicking: (1) Why treating empty string as equal to 'local'? Isn't enforcing the actual value a better approach? Álvaro mentioned on IM also, I already changed it to saner normal GUC with 'local' as default value in my working copy (2) NITPICK: Maybe we could use access_method in the docs (instead of sequence_access_method), as the 'sequence' part is clear from the context? Yes. (3) Why index_reloptions / sequence_reloptions when both do exactly the same thing (call to common_am_reloptions)? I'd understand this if the patch(es) then change the sequence_reloptions() but that's not happening. Maybe that's expected to happen? That's leftover from the original design where AM was supposed to call this, since this is not exposed to AM anymore I think it can be single function now. (4) DOCS: Each sequence can only use one access method at a time. Does that mean a sequence can change the access method during it's lifetime? My understanding is that the AM is fixed after creating the sequence? Oh, I forgot to add ALTER SEQUENCE USING into docs, you can change AM even though you probably don't want to do it often, but for migrations it's useful. (8) check_default_seqam without a transaction * If we aren't inside a transaction, we cannot do database access so * cannot verify the name. Must accept the value on faith. In which situation this happens? Wouldn't it be better to simply enforce the transaction and fail otherwise? Reading postgresql.conf during startup, I don't think we want to fail in that scenario ;) -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Custom/Foreign-Join-APIs (Re: [HACKERS] [v9.5] Custom Plan API)
On 11/01/15 08:56, Kohei KaiGai wrote: 2015-01-11 10:40 GMT+09:00 Jim Nasby jim.na...@bluetreble.com: Yeah there are actually several places in the code where relid means index in range table and not oid of relation, it still manages to confuse me. Nothing this patch can do about that. Well, since it's confused 3 of us now... should we change it (as a separate patch)? I'm willing to do that work but don't want to waste time if it'll just be rejected. Any other examples of this I should fix too? Sorry, to clarify... any other items besides Scan.scanrelid that I should fix? This naming is a little bit confusing, however, I don't think it should be changed because this structure has been used for a long time, so reworking will prevent back-patching when we find bugs around scanrelid. We can still backpatch; it just requires more work. And how many bugs do we actually expect to find around this anyway? If folks think this just isn't worth fixing fine, but I find the backpatching argument rather dubious. Even though here is no problem around Scan structure itself, a bugfix may use the variable name of scanrelid to fix it. If we renamed it on v9.5, we also need a little adjustment to apply this bugfix on prior versions. It seems to me a waste of time for committers. I tend to agree, especially as there is multiple places in code this would affect - RelOptInfo and RestrictInfo have same issue, etc. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using 128-bit integers for sum, avg and statistics aggregates
On 11/01/15 05:07, Andreas Karlsson wrote: On 01/11/2015 02:36 AM, Andres Freund wrote: @@ -3030,6 +3139,18 @@ int8_avg_accum(PG_FUNCTION_ARGS) Datum int2_accum_inv(PG_FUNCTION_ARGS) { +#ifdef HAVE_INT128 +Int16AggState *state; + +state = PG_ARGISNULL(0) ? NULL : (Int16AggState *) PG_GETARG_POINTER(0); + +/* Should not get here with no state */ +if (state == NULL) +elog(ERROR, int2_accum_inv called with NULL state); + +if (!PG_ARGISNULL(1)) +do_int16_discard(state, (int128) PG_GETARG_INT16(1)); +#else NumericAggState *state; state = PG_ARGISNULL(0) ? NULL : (NumericAggState *) PG_GETARG_POINTER(0); @@ -3049,6 +3170,7 @@ int2_accum_inv(PG_FUNCTION_ARGS) if (!do_numeric_discard(state, newval)) elog(ERROR, do_numeric_discard failed unexpectedly); } Hm. It might be nicer to move the if (!state) elog() outside the ifdef, and add curly parens inside the ifdef. The reason I did so was because the type of the state differs and I did not feel like having two ifdef blocks. I have no strong opinion about this though. I think Andres means you should move the NULL check before the ifdef and then use curly braces inside the the ifdef/else so that you can define the state variable there. That can be done with single ifdef. int2_accum_inv(PG_FUNCTION_ARGS) { ... null check ... { #ifdef HAVE_INT128 Int16AggState *state; ... #else NumericAggState *state; ... #endif PG_RETURN_POINTER(state); } } -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Custom/Foreign-Join-APIs (Re: [HACKERS] [v9.5] Custom Plan API)
On 10/01/15 01:19, Kohei KaiGai wrote: 2015-01-10 8:18 GMT+09:00 Jim Nasby jim.na...@bluetreble.com: On 1/6/15, 5:43 PM, Kouhei Kaigai wrote: scan_relid != InvalidOid Ideally, they should be OidIsValid(scan_relid) Scan.scanrelid is an index of range-tables list, not an object-id. So, InvalidOid or OidIsValid() are not a good choice. I think the name needs to change then; scan_relid certainly looks like the OID of a relation. scan_index? Yep, I had a same impression when I looked at the code first time, however, it is defined as below. Not a manner of custom-scan itself. /* * == * Scan nodes * == */ typedef struct Scan { Planplan; Index scanrelid; /* relid is index into the range table */ } Scan; Yeah there are actually several places in the code where relid means index in range table and not oid of relation, it still manages to confuse me. Nothing this patch can do about that. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] XLOG_PARAMETER_CHANGE handling of wal_log_hints
On 07/01/15 00:59, Michael Paquier wrote: On Wed, Jan 7, 2015 at 4:24 AM, Petr Jelinek p...@2ndquadrant.com wrote: Hi, when I was fixing how commit_ts handles the XLOG_PARAMETER_CHANGE I noticed that for wal_log_hints we assign the value in ControFile to current value instead of value that comes from WAL. ISTM it has just information value so it should not have any practical impact, but it looks like a bug anyway so here is simple patch to change that. Right. That's something that should get fixed in 9.4 as well.. ControlFile-track_commit_timestamp = track_commit_timestamp; The new value of track_commit_timestamp should be taken as well from xlrec on master btw. That's part of the larger fix for CommitTs that I sent separately in response to the bug report from Fujii - there is much more to be done there for CommitTs than just this. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Custom/Foreign-Join-APIs (Re: [HACKERS] [v9.5] Custom Plan API)
On 07/01/15 00:05, Jim Nasby wrote: On 1/6/15, 8:17 AM, Kouhei Kaigai wrote: The attached patch is newer revision of custom-/foreign-join interface. Shouldn't instances of scan_relid 0 be scan_relid != InvalidOid Ideally, they should be OidIsValid(scan_relid) -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] XLOG_PARAMETER_CHANGE handling of wal_log_hints
Hi, when I was fixing how commit_ts handles the XLOG_PARAMETER_CHANGE I noticed that for wal_log_hints we assign the value in ControFile to current value instead of value that comes from WAL. ISTM it has just information value so it should not have any practical impact, but it looks like a bug anyway so here is simple patch to change that. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 5cc7e47..be67da4 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -8967,7 +8967,7 @@ xlog_redo(XLogReaderState *record) ControlFile-max_prepared_xacts = xlrec.max_prepared_xacts; ControlFile-max_locks_per_xact = xlrec.max_locks_per_xact; ControlFile-wal_level = xlrec.wal_level; - ControlFile-wal_log_hints = wal_log_hints; + ControlFile-wal_log_hints = xlrec.wal_log_hints; ControlFile-track_commit_timestamp = track_commit_timestamp; /* -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tracking commit timestamps
On 05/01/15 16:17, Petr Jelinek wrote: On 05/01/15 07:28, Fujii Masao wrote: On Thu, Dec 4, 2014 at 12:08 PM, Fujii Masao masao.fu...@gmail.com wrote: On Wed, Dec 3, 2014 at 11:54 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Pushed with some extra cosmetic tweaks. I got the following assertion failure when I executed pg_xact_commit_timestamp() in the standby server. =# select pg_xact_commit_timestamp('1000'::xid); TRAP: FailedAssertion(!(((oldestCommitTs) != ((TransactionId) 0)) == ((newestCommitTs) != ((TransactionId) 0))), File: commit_ts.c, Line: 315) server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: 2014-12-04 12:01:08 JST sby1 LOG: server process (PID 15545) was terminated by signal 6: Aborted 2014-12-04 12:01:08 JST sby1 DETAIL: Failed process was running: select pg_xact_commit_timestamp('1000'::xid); The way to reproduce this problem is #1. set up and start the master and standby servers with track_commit_timestamp disabled #2. enable track_commit_timestamp in the master and restart the master #3. run some write transactions #4. enable track_commit_timestamp in the standby and restart the standby #5. execute select pg_xact_commit_timestamp('1000'::xid) in the standby BTW, at the step #4, I got the following log messages. This might be a hint for this problem. LOG: file pg_commit_ts/ doesn't exist, reading as zeroes CONTEXT: xlog redo Transaction/COMMIT: 2014-12-04 12:00:16.428702+09; inval msgs: catcache 59 catcache 58 catcache 59 catcache 58 catcache 45 catcache 44 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 snapshot 2608 relcache 16384 This problem still happens in the master. Regards, Attached patch fixes it, I am not sure how happy I am with the way I did it though. Added more comments and made the error message bit clearer. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c index ca074da..59d19a0 100644 --- a/src/backend/access/transam/commit_ts.c +++ b/src/backend/access/transam/commit_ts.c @@ -557,6 +557,12 @@ StartupCommitTs(void) TransactionId xid = ShmemVariableCache-nextXid; int pageno = TransactionIdToCTsPage(xid); + if (track_commit_timestamp) + { + ActivateCommitTs(); + return; + } + LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE); /* @@ -571,14 +577,30 @@ StartupCommitTs(void) * This must be called ONCE during postmaster or standalone-backend startup, * when commit timestamp is enabled. Must be called after recovery has * finished. + */ +void +CompleteCommitTsInitialization(void) +{ + if (!track_commit_timestamp) + DeactivateCommitTs(true); +} + +/* + * This must be called when track_commit_timestamp is turned on. + * Note that this only happens during postmaster or standalone-backend startup + * or during WAL replay. + * + * The reason why this SLRU needs separate activation/deactivation functions is + * that it can be enabled/disabled during start and the activation/deactivation + * on master is propagated to slave via replay. Other SLRUs don't have this + * property and they can be just initialized during normal startup. * * This is in charge of creating the currently active segment, if it's not * already there. The reason for this is that the server might have been * running with this module disabled for a while and thus might have skipped * the normal creation point. */ -void -CompleteCommitTsInitialization(void) +void ActivateCommitTs(void) { TransactionId xid = ShmemVariableCache-nextXid; int pageno = TransactionIdToCTsPage(xid); @@ -591,22 +613,6 @@ CompleteCommitTsInitialization(void) LWLockRelease(CommitTsControlLock); /* - * If this module is not currently enabled, make sure we don't hand back - * possibly-invalid data; also remove segments of old data. - */ - if (!track_commit_timestamp) - { - LWLockAcquire(CommitTsLock, LW_EXCLUSIVE); - ShmemVariableCache-oldestCommitTs = InvalidTransactionId; - ShmemVariableCache-newestCommitTs = InvalidTransactionId; - LWLockRelease(CommitTsLock); - - TruncateCommitTs(ReadNewTransactionId()); - - return; - } - - /* * If CommitTs is enabled, but it wasn't in the previous server run, we * need to set the oldest and newest values to the next Xid; that way, we * will not try to read data that might not have been set. @@ -641,6 +647,35 @@ CompleteCommitTsInitialization(void) } /* + * This must be called when track_commit_timestamp is turned off. + * Note that this only happens during postmaster or standalone-backend startup + * or during WAL replay. + * + * Resets CommitTs into invalid state
[HACKERS] event trigger pg_dump fix
Hi, Attached is fix for http://www.postgresql.org/message-id/1420492505.23613.15.ca...@bloodnok.com It was dumping comment with NULL owner while the function expects empty string for objects without owner (there is pg_strdup down the line). -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c index 6658fda..6541463 100644 --- a/src/bin/pg_dump/pg_dump.c +++ b/src/bin/pg_dump/pg_dump.c @@ -15106,7 +15106,7 @@ dumpEventTrigger(Archive *fout, DumpOptions *dopt, EventTriggerInfo *evtinfo) query-data, , NULL, NULL, 0, NULL, NULL); dumpComment(fout, dopt, labelq-data, -NULL, NULL, +NULL, , evtinfo-dobj.catId, 0, evtinfo-dobj.dumpId); destroyPQExpBuffer(query); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] event trigger pg_dump fix
On 06/01/15 01:02, Tom Lane wrote: Petr Jelinek p...@2ndquadrant.com writes: Attached is fix for http://www.postgresql.org/message-id/1420492505.23613.15.ca...@bloodnok.com It was dumping comment with NULL owner while the function expects empty string for objects without owner (there is pg_strdup down the line). This isn't right either: event triggers do have owners. Bah, of course, stupid me. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c index 6658fda..00b87e7 100644 --- a/src/bin/pg_dump/pg_dump.c +++ b/src/bin/pg_dump/pg_dump.c @@ -15106,7 +15106,7 @@ dumpEventTrigger(Archive *fout, DumpOptions *dopt, EventTriggerInfo *evtinfo) query-data, , NULL, NULL, 0, NULL, NULL); dumpComment(fout, dopt, labelq-data, -NULL, NULL, +NULL, evtinfo-evtowner, evtinfo-dobj.catId, 0, evtinfo-dobj.dumpId); destroyPQExpBuffer(query); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Turning recovery.conf into GUCs
On 24/12/14 20:11, Alex Shulgin wrote: Alex Shulgin a...@commandprompt.com writes: - the StandbyModeRequested and StandbyMode should be lowercased like the rest of the GUCs Yes, except that standby_mode is linked to StandbyModeRequested, not the other one. I can try see if there's a sane way out of this. As I see it, renaming these will only add noise to this patch, and there is also ArchiveRecoveryRequested / InArchiveRecovery. This is going to be tricky and I'm not sure it's really worth the effort. I don't buy that to be honest, most (if not all) places that would be affected are already in diff as part of context around other renames anyway and it just does not seem right to rename some variables and not the others. I still think there should be some thought given to merging the hot_standby and standby_mode, but I think we'd need opinion of more people (especially some committers) on this one. - The whole CopyErrorData and memory context logic is not needed in check_recovery_target_time() as the FlushErrorState() is not called there You must be right. I recall having some trouble with strings being free'd prematurely, but if it's ERROR, then there should be no need for that. I'll check again. Hrm, after going through this again I'm pretty sure that was correct: the only way to obtain the current error message is to use CopyErrorData(), but that requires you to switch back to non-error memory context (via Assert). Right, my bad. The FlushErrorState() call is not there because it's handled by the hook caller (or it can exit via ereport() with elevel==ERROR). Yes I've seen that, shouldn't you FreeErrorData(edata) then though? I guess it does not matter too much considering that you are going to throw error and die eventually anyway once you are in that code path, maybe just for the clarity... -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] recovery_min_apply_delay with a negative value
On 05/01/15 20:44, Robert Haas wrote: On Sat, Jan 3, 2015 at 12:04 PM, Tom Lane t...@sss.pgh.pa.us wrote: Of course, if recovery_min_apply_delay were a proper GUC, we'd just configure it with a minimum value of zero and be done :-( Amen. We should *really* convert all of the recovery.conf parameters to be GUCs. Well, there is an ongoing effort on that and I think the patch is very close to the state where committer should take a look IMHO, I have only couple of gripes with it now and one of them needs opinions of others anyway. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tracking commit timestamps
On 05/01/15 07:28, Fujii Masao wrote: On Thu, Dec 4, 2014 at 12:08 PM, Fujii Masao masao.fu...@gmail.com wrote: On Wed, Dec 3, 2014 at 11:54 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Pushed with some extra cosmetic tweaks. I got the following assertion failure when I executed pg_xact_commit_timestamp() in the standby server. =# select pg_xact_commit_timestamp('1000'::xid); TRAP: FailedAssertion(!(((oldestCommitTs) != ((TransactionId) 0)) == ((newestCommitTs) != ((TransactionId) 0))), File: commit_ts.c, Line: 315) server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: 2014-12-04 12:01:08 JST sby1 LOG: server process (PID 15545) was terminated by signal 6: Aborted 2014-12-04 12:01:08 JST sby1 DETAIL: Failed process was running: select pg_xact_commit_timestamp('1000'::xid); The way to reproduce this problem is #1. set up and start the master and standby servers with track_commit_timestamp disabled #2. enable track_commit_timestamp in the master and restart the master #3. run some write transactions #4. enable track_commit_timestamp in the standby and restart the standby #5. execute select pg_xact_commit_timestamp('1000'::xid) in the standby BTW, at the step #4, I got the following log messages. This might be a hint for this problem. LOG: file pg_commit_ts/ doesn't exist, reading as zeroes CONTEXT: xlog redo Transaction/COMMIT: 2014-12-04 12:00:16.428702+09; inval msgs: catcache 59 catcache 58 catcache 59 catcache 58 catcache 45 catcache 44 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 snapshot 2608 relcache 16384 This problem still happens in the master. Regards, Attached patch fixes it, I am not sure how happy I am with the way I did it though. And while at it I noticed that redo code for XLOG_PARAMETER_CHANGE sets ControlFile-wal_log_hints = wal_log_hints; shouldn't it be ControlFile-wal_log_hints = xlrec.wal_log_hints; instead? -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c index ca074da..fcfccf8 100644 --- a/src/backend/access/transam/commit_ts.c +++ b/src/backend/access/transam/commit_ts.c @@ -557,6 +557,12 @@ StartupCommitTs(void) TransactionId xid = ShmemVariableCache-nextXid; int pageno = TransactionIdToCTsPage(xid); + if (track_commit_timestamp) + { + ActivateCommitTs(); + return; + } + LWLockAcquire(CommitTsControlLock, LW_EXCLUSIVE); /* @@ -571,14 +577,25 @@ StartupCommitTs(void) * This must be called ONCE during postmaster or standalone-backend startup, * when commit timestamp is enabled. Must be called after recovery has * finished. + */ +void +CompleteCommitTsInitialization(void) +{ + if (!track_commit_timestamp) + DeactivateCommitTs(true); +} + +/* + * This must be called when track_commit_timestamp is turned on. + * Note that this only happens during postmaster or standalone-backend startup + * or during WAL replay. * * This is in charge of creating the currently active segment, if it's not * already there. The reason for this is that the server might have been * running with this module disabled for a while and thus might have skipped * the normal creation point. */ -void -CompleteCommitTsInitialization(void) +void ActivateCommitTs(void) { TransactionId xid = ShmemVariableCache-nextXid; int pageno = TransactionIdToCTsPage(xid); @@ -591,22 +608,6 @@ CompleteCommitTsInitialization(void) LWLockRelease(CommitTsControlLock); /* - * If this module is not currently enabled, make sure we don't hand back - * possibly-invalid data; also remove segments of old data. - */ - if (!track_commit_timestamp) - { - LWLockAcquire(CommitTsLock, LW_EXCLUSIVE); - ShmemVariableCache-oldestCommitTs = InvalidTransactionId; - ShmemVariableCache-newestCommitTs = InvalidTransactionId; - LWLockRelease(CommitTsLock); - - TruncateCommitTs(ReadNewTransactionId()); - - return; - } - - /* * If CommitTs is enabled, but it wasn't in the previous server run, we * need to set the oldest and newest values to the next Xid; that way, we * will not try to read data that might not have been set. @@ -641,6 +642,35 @@ CompleteCommitTsInitialization(void) } /* + * This must be called when track_commit_timestamp is turned off. + * Note that this only happens during postmaster or standalone-backend startup + * or during WAL replay. + * + * Resets CommitTs into invalid state to make sure we don't hand back + * possibly-invalid data; also removes segments of old data. + */ +void +DeactivateCommitTs(bool do_wal) +{ + TransactionId xid = ShmemVariableCache-nextXid
Re: [HACKERS] [COMMITTERS] pgsql: Keep track of transaction commit timestamps
On 29/12/14 11:16, Andres Freund wrote: On 2014-12-29 12:06:07 +0200, Heikki Linnakangas wrote: That's a little bit better, but I have to say I'm still not impressed. There are so many implicit assumptions in the system. The first assumption is that a 32-bit node id is sufficient. Seriously? Are we going to build facilities for replication systems with that many nodes? It seems absolutely unrealistic that a) somebody does this b) that we'll blindly meet the demands of such a super hypothetical scenario. +1, Honestly, if somebody will ever have setup with more nodes than what fits into 32bits they will run into bigger problems than nodeid being too small. Then there's the assumption that the node id should be sticky, i.e. it's set for the whole session. Again, I'm sure that's useful for many systems, but I could just as easily imagine that you'd want it to reset after every commit. It's trivial to add that/reset it manually. So what? Yes you can reset in the extension after commit, or you can actually override both commit timestamp and nodeid after commit if you so wish. To be honest, I think this patch should be reverted. Instead, we should design a system where extensions can define their own SLRUs to store additional per-transaction information. That way, each extension can have as much space per transaction as needed, and support functions that make most sense with the information. Commit timestamp tracking would be one such extension, and for this node ID stuff, you could have another one (or include it in the replication extension). If somebody wants that they should develop it. But given that we, based on previous discussions, don't want to run user defined code in the relevant phase during transaction commit *and* replay I don't think it'd be all that easy to do it fast and flexible. Right, I would love to have custom SLRUs but I don't see it happening given those two restrictions, otherwise I would write the CommitTs patch that way in the first place... -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Commit timestamp abbreviations
On 24/12/14 15:15, Bruce Momjian wrote: On Tue, Dec 23, 2014 at 06:00:21PM -0300, Alvaro Herrera wrote: Bruce Momjian wrote: I noticed this when looking at the allocated shared memory structures in head: shared memory alignment 64-byte of CommitTs Ctl: 0 shared memory alignment 64-byte of CommitTs shared: 0 I thought we got rid of the idea that 'Ts' means timestamp. Was this part forgotten? Do you have a specific reference? That's not the concern I remember, and I sure don't want to re-read that whole thread again. I remember the issue of using _ts and 'ts' inconsistently, and I thought we were going to spell out timestamp in more places, but maybe I am remembering incorrectly. The change was from committs to commit_ts + CommitTs depending on place. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Turning recovery.conf into GUCs
On 12/12/14 16:34, Alex Shulgin wrote: Alex Shulgin a...@commandprompt.com writes: Alex Shulgin a...@commandprompt.com writes: Here's an attempt to revive this patch. Here's the patch rebased against current HEAD, that is including the recently committed action_at_recovery_target option. The default for the new GUC is 'pause', as in HEAD, and pause_at_recovery_target is removed completely in favor of it. I've also taken the liberty to remove that part that errors out when finding $PGDATA/recovery.conf. Now get your rotten tomatoes ready. ;-) This was rather short-sighted, so I've restored that part. Also, rebased on current HEAD, following the rename of action_at_recovery_target to recovery_target_action. Hi, I had a quick look, the patch does not apply cleanly anymore but it's just release notes so nothing too bad. I did not do any testing yet, but I have few comments about the code: - the patch mixes functionality change with the lowercasing of the config variables, I wonder if those could be separated into 2 separate diffs - it would make review somewhat easier, but I can live with it as it is if it's too much work do separate into 2 patches - the StandbyModeRequested and StandbyMode should be lowercased like the rest of the GUCs - I am wondering if StandbyMode and hot_standby should be merged into one GUC if we are breaking backwards compatibility anyway - Could you explain the reasoning behind: + if ((*newval)[0] == 0) + xid = InvalidTransactionId; + else in check_recovery_target_xid() (and same check in check_recovery_target_timeline()), isn't the strtoul which is called later enough? - The whole CopyErrorData and memory context logic is not needed in check_recovery_target_time() as the FlushErrorState() is not called there - The new code in StartupXLOG() like + if (recovery_target_xid_string != NULL) + InitRecoveryTarget(RECOVERY_TARGET_XID); + + if (recovery_target_time_string != NULL) + InitRecoveryTarget(RECOVERY_TARGET_TIME); + + if (recovery_target_name != NULL) + InitRecoveryTarget(RECOVERY_TARGET_NAME); + + if (recovery_target_string != NULL) + InitRecoveryTarget(RECOVERY_TARGET_IMMEDIATE); would probably be better in separate function that you then call StartupXLOG() as StartupXLOG() is crazy long already so I think it's preferable to not make it worse. - I wonder if we should rename trigger_file to standby_tigger_file -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TABLESAMPLE patch
On 21/12/14 18:38, Tomas Vondra wrote: Hi, On 18.12.2014 13:14, Petr Jelinek wrote: Hi, v2 version of this patch is attached. I did a review of this v2 patch today. I plan to do a bit more testing, but these are my comments/questions so far: Thanks for looking at it! (0) There's a TABLESAMPLE page at the wiki, not updated since 2012: https://wiki.postgresql.org/wiki/TABLESAMPLE_Implementation We should either update it or mark it as obsolete I guess. Also, I'd like to know what's the status regarding the TODO items mentioned there. Are those still valid with this patch? I will have to look in more detail over the holidays and update it, but the general info about table sampling there applies and will apply to any patch trying to implement it. Quick look on the mail thread suggest that at least the concerns mentioned in the mailing list should not apply to this implementation. And looking at the patch the problem with BERNOULLI sampling should not exist either as I use completely different implementation for that. I do also have some issues with joins which I plan to look at but it's different one (my optimizer code overestimates the number of rows) (1) The patch adds a new catalog, but does not bump CATVERSION. I thought this was always done by committer? (2) The catalog naming (pg_tablesamplemethod) seems a bit awkward, as it squishes everything into a single chunk. That's inconsistent with naming of the other catalogs. I think pg_table_sample_method would be better. Fair point, but perhaps pg_tablesample_method then as tablesample is used as single word everywhere including the standard. (3) There are a few more strange naming decisions, but that's mostly because of the SQL standard requires that naming. I mean SYSTEM and BERNOULLI method names, and also the fact that the probability is specified as 0-100 value, which is inconsistent with other places (e.g. percentile_cont uses the usual 0-1 probability notion). But I don't think this can be fixed, that's what the standard says. Yeah, I don't exactly love that either but what standard says, standard says. (4) I noticed there's an interesting extension in SQL Server, which allows specifying PERCENT or ROWS, so you can say SELECT * FROM table TABLESAMPLE SYSTEM (25 PERCENT); or SELECT * FROM table TABLESAMPLE SYSTEM (2500 ROWS); That seems handy, and it'd make migration from SQL Server easier. What do you think? Well doing it exactly this way somewhat kills the extensibility which was one of the main goals for me - I allow any kind of parameters for sampling and the handling of those depends solely on the sampling method. This means that in my approach if you'd want to change what you are limiting you'd have to write new sampling method and the query would then look like SELECT * FROM table TABLESAMPLE SYSTEM_ROWLIMIT (2500); or some such (depending on how you name the sampling method). Or SELECT * FROM table TABLESAMPLE SYSTEM (2500, 'ROWS'); if we chose to extend the SYSTEM sampling method, that would be also doable. The reason for this is that I don't want to really limit too much what parameters can be send to sampling (I envision even SYSTEM_TIMED sampling method that will get limit as time interval for example). (5) I envision a lot of confusion because of the REPEATABLE clause. With READ COMMITTED, it's not really repeatable because of changes done by the other users (and maybe things like autovacuum). Shall we mention this in the documentation? Yes docs need improvement and this should be mentioned, also code-docs - I found few places which I forgot to update when changing code and now have misleading comments. (6) This seems slightly wrong, because of long/uint32 mismatch: long seed = PG_GETARG_UINT32(1); I think uint32 would be more appropriate, no? Probably, although I need long later in the algorithm anyway. (7) NITPICKING: I think a 'sample_rate' would be a better name here: double percent = sampler-percent; Ok. (8) NITPICKING: InitSamplingMethod contains a command with ';;' fcinfo.arg[i] = (Datum) 0;; Yeah :) (9) The current regression tests only use the REPEATABLE cases. I understand queries without this clause are RANDOM, but maybe we could do something like this: SELECT COUNT(*) BETWEEN 5000 AND 7000 FROM ( SELECT id FROM test_tablesample TABLESAMPLE SYSTEM (50) ) foo; Granted, there's still a small probability of false positive, but maybe that's sufficiently small? Or is the amount of code this tests negligible? Well, depending on fillfactor and limit it could be made quite reliable I think, I also want to add test with VIEW (I think v2 has a bug there) and perhaps some JOIN. (10) In the initial patch you mentioned it's possible to write custom
Re: [HACKERS] TABLESAMPLE patch
On 22/12/14 20:14, Jaime Casanova wrote: On Thu, Dec 18, 2014 at 7:14 AM, Petr Jelinek p...@2ndquadrant.com wrote: Hi, v2 version of this patch is attached. a few more tests revealed that passing null as the sample size argument works, and it shouldn't. Fixed. in repeatable it gives an error if i use null as argument but it gives a syntax error, and it should be a data exception (data exception -- invalid repeat argument in a sample clause) according to the standard Also fixed. also you need to add CHECK_FOR_INTERRUPTS somewhere, i tried with a big table and had to wait a long time for it to finish Ah yeah, I can't rely on CHECK_FOR_INTERRUPTS in ExecScan because it might take a while to fetch a row if percentage is very small and table is big... Fixed. Attached is v3 which besides the fixes mentioned above also includes changes discussed with Tomas (except the CREATE/DROP TABLESAMPLE METHOD), fixes for crash with FETCH FIRST and is rebased against current master. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services diff --git a/doc/src/sgml/ref/select.sgml b/doc/src/sgml/ref/select.sgml index 01d24a5..250ae29 100644 --- a/doc/src/sgml/ref/select.sgml +++ b/doc/src/sgml/ref/select.sgml @@ -49,7 +49,7 @@ SELECT [ ALL | DISTINCT [ ON ( replaceable class=parameterexpression/replac phrasewhere replaceable class=parameterfrom_item/replaceable can be one of:/phrase -[ ONLY ] replaceable class=parametertable_name/replaceable [ * ] [ [ AS ] replaceable class=parameteralias/replaceable [ ( replaceable class=parametercolumn_alias/replaceable [, ...] ) ] ] +[ ONLY ] replaceable class=parametertable_name/replaceable [ * ] [ TABLESAMPLE replaceable class=parametersampling_method/replaceable ( replaceable class=parameterargument/replaceable [, ...] ) [ REPEATABLE ( replaceable class=parameterseed/replaceable ) ] ] [ [ AS ] replaceable class=parameteralias/replaceable [ ( replaceable class=parametercolumn_alias/replaceable [, ...] ) ] ] [ LATERAL ] ( replaceable class=parameterselect/replaceable ) [ AS ] replaceable class=parameteralias/replaceable [ ( replaceable class=parametercolumn_alias/replaceable [, ...] ) ] replaceable class=parameterwith_query_name/replaceable [ [ AS ] replaceable class=parameteralias/replaceable [ ( replaceable class=parametercolumn_alias/replaceable [, ...] ) ] ] [ LATERAL ] replaceable class=parameterfunction_name/replaceable ( [ replaceable class=parameterargument/replaceable [, ...] ] ) @@ -317,6 +317,38 @@ TABLE [ ONLY ] replaceable class=parametertable_name/replaceable [ * ] /varlistentry varlistentry + termTABLESAMPLE replaceable class=parametersampling_method/replaceable ( replaceable class=parameterargument/replaceable [, ...] ) [ REPEATABLE ( replaceable class=parameterseed/replaceable ) ]/term + listitem + para +Table sample clause after +replaceable class=parametertable_name/replaceable indicates that +a replaceable class=parametersampling_method/replaceable should +be used to retrieve subset of rows in the table. +The replaceable class=parametersampling_method/replaceable can be +one of: +itemizedlist + listitem + paraliteralSYSTEM/literal/para + /listitem + listitem + paraliteralBERNOULLI/literal/para + /listitem +/itemizedlist +Both of those sampling methods currently accept only single argument +which is the percent (floating point from 0 to 100) of the rows to +be returned. +The literalSYSTEM/literal sampling method does block level +sampling with each block having same chance of being selected and +returns all rows from each selected block. +The literalBERNOULLI/literal scans whole table and returns +individual rows with equal probability. +The optional numeric parameter literalREPEATABLE/literal is used +as random seed for sampling. + /para + /listitem + /varlistentry + + varlistentry termreplaceable class=parameteralias/replaceable/term listitem para diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile index 21721b4..595737c 100644 --- a/src/backend/access/Makefile +++ b/src/backend/access/Makefile @@ -8,6 +8,7 @@ subdir = src/backend/access top_builddir = ../../.. include $(top_builddir)/src/Makefile.global -SUBDIRS = brin common gin gist hash heap index nbtree rmgrdesc spgist transam +SUBDIRS = brin common gin gist hash heap index nbtree rmgrdesc spgist \ + transam tsm include $(top_srcdir)/src/backend/common.mk diff --git a/src/backend/access/tsm/Makefile b/src/backend/access/tsm/Makefile new file mode 100644 index 000..73bbbd7 --- /dev/null +++ b/src/backend/access/tsm/Makefile @@ -0,0 +1,17
Re: [HACKERS] [COMMITTERS] pgsql: Keep track of transaction commit timestamps
On 10/12/14 16:03, Petr Jelinek wrote: On 10/12/14 04:26, Michael Paquier wrote: On Thu, Dec 4, 2014 at 9:26 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: Yeah, it was raised. I don't think it was really addressed. There was lengthy discussion on whether to include LSN, node id, and/or some other information, but there was no discussion on: * What is a node ID? * How is it used? * Who assigns it? etc. None of those things are explained in the documentation nor code comments. Could it be possible to get a refresh on that before it bitrots too much or we'll simply forget about it? In the current state, there is no documentation able to explain what is the point of the nodeid stuff, how to use it and what use cases it is aimed at. The API in committs.c should as well be part of it. If that's not done, I think that it would be fair to remove this portion and simply WAL-log the commit timestamp its GUC is activated. I have this on my list so it's not being forgotten and I don't see it bitrotting unless we are changing XLog api again. I have bit busy schedule right now though, can it wait till next week? - I will post some code documentation patches then. Hi, as promised I am sending code-comment patch that explains the use of commit timestamps + nodeid C api for the conflict resolution, comments welcome. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c index ca074da..fff1fdd 100644 --- a/src/backend/access/transam/commit_ts.c +++ b/src/backend/access/transam/commit_ts.c @@ -6,6 +6,62 @@ * This module is a pg_clog-like system that stores the commit timestamp * for each transaction. * + * When track_commit_timestamp is enabled, this module will keep track of + * commit timestamp for each transaction. It also provides API to for + * optionally storing nodeid (origin) of each transaction. The main purpose of + * this functionality is to help with conflict detection and resolution for + * replication systems. + * + * The following example shows how to use the API provided by this module, to + * handle UPDATE conflicts coming from replication stream: + * void + * update_tuple(Relation relation, HeapTuple remote_tuple, + * TimestampTz remote_commit_ts, CommitTsNodeId remote_node_id) + * { + * bool exists; + * HeapTupleDatalocal_tuple; + * + * // Find existing tuple with same PK/unique index combination. + * exists = find_local_tuple(relation, remote_tuple, local_tuple); + * + * // The tuple was found. + * if (exists) + * { + * TransactionIdxmin; + * TimestampTz local_commit_ts; + * CommitTsNodeId local_node_id; + * + * xmin = HeapTupleHeaderGetXmin(local_tuple.t_data); + * TransactionIdGetCommitTsData(xmin, local_commit_ts, nodeid); + * + * // New tuple is coming from different node than the locally saved + * // tuple and the remote commit timestamp is older than local commit + * // timestamp, this is UPDATE/UPDATE conflict (node being UPDATEd on + * // different nodes at the same time. + * if (remote_id != local_node_id remote_commit_ts = local_commit_ts) + * { + * if (remote_commit_ts local_commit_ts) + * return; // Keep the local tuple. + * + * // Handle the conflict in a consistent manner. + * } + * else + * { + * // The new tuple either comes from same node as old tuple and/or + * // is has newer commit timestamp than the local tuple, apply the + * // UPDATE. + * } + * } + * else + * { + * // Tuple not found (possibly UPDATE/DELETE conflict), handle it + * // in a consistent manner. + * } + * } + * + * See default_node_id and CommitTsSetDefaultNodeId for explanation of how to + * set nodeid when applying transactions. + * * XLOG interactions: this module generates an XLOG record whenever a new * CommitTs page is initialized to zeroes. Also, one XLOG record is * generated for setting of values when the caller requests it; this allows @@ -49,6 +105,15 @@ */ /* + * CommitTimestampEntry + * + * Record containing information about the transaction commit timestamp and + * the nodeid. + * + * The nodeid provides IO efficient way for replication systems to store + * information about origin of the transaction. Currently the nodeid is opaque + * value meaning of which is defined by the replication system itself. + * * We need 8+4 bytes per xact. Note that enlarging this struct might mean * the largest possible file name is more than 5 chars long; see * SlruScanDirectory. @@ -93,6 +158,21 @@ CommitTimestampShared *commitTsShared; /* GUC variable */ bool track_commit_timestamp; +/* + * The default_node_id
Re: [HACKERS] [COMMITTERS] pgsql: Keep track of transaction commit timestamps
On 19/12/14 13:17, Michael Paquier wrote: On Fri, Dec 19, 2014 at 6:30 PM, Petr Jelinek p...@2ndquadrant.com wrote: On 10/12/14 16:03, Petr Jelinek wrote: On 10/12/14 04:26, Michael Paquier wrote: On Thu, Dec 4, 2014 at 9:26 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: Yeah, it was raised. I don't think it was really addressed. There was lengthy discussion on whether to include LSN, node id, and/or some other information, but there was no discussion on: * What is a node ID? * How is it used? * Who assigns it? etc. None of those things are explained in the documentation nor code comments. Could it be possible to get a refresh on that before it bitrots too much or we'll simply forget about it? In the current state, there is no documentation able to explain what is the point of the nodeid stuff, how to use it and what use cases it is aimed at. The API in committs.c should as well be part of it. If that's not done, I think that it would be fair to remove this portion and simply WAL-log the commit timestamp its GUC is activated. I have this on my list so it's not being forgotten and I don't see it bitrotting unless we are changing XLog api again. I have bit busy schedule right now though, can it wait till next week? - I will post some code documentation patches then. as promised I am sending code-comment patch that explains the use of commit timestamps + nodeid C api for the conflict resolution, comments welcome. Having some documentation with this example in doc/ would be more fruitful IMO. I am not sure I see point in that, the GUC and SQL interfaces are documented in doc/ and we usually don't put documentation for C interfaces there. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical Replication Helpers WIP for discussion
On 15/12/14 19:42, Robert Haas wrote: On Mon, Dec 15, 2014 at 12:57 AM, Petr Jelinek p...@2ndquadrant.com wrote: we've made few helper functions for making logical replication easier, I bundled it into contrib module as this is mainly for discussion at this time (I don't expect this to get committed any time soon, but it is good way to iron out protocol, etc). I created sample logical decoding plugin that uses those functions and which can be used for passing DML changes in platform/version independent (hopefully) format. I will post sample apply BG worker also once I get some initial feedback about this. It's hard to write tests for this as the binary changes contain transaction ids and timestamps so the data changes constantly. This is of course based on the BDR work Andres, Craig and myself have been doing. I can't understand, either from what you've written here or the rather sparse comments in the patch, what this might be good for. What I tried to achieve here is to provide solution to many of the common problems faced by logical replication solutions. I believe the first step in designing the logical replication (now that we have the logical decoding) is making the output plugin and the efficient protocol so I started with that. The code itself provides two main parts: First is the lrep_utils common utility functions that solve things like transporting DML statements, and more importantly the changed data in efficient manner, trying to not do any conversion if not needed (when architecture/version matches) but falling back to binary/textual IO representation of individual types so that the cross platform/version replication works too. I think those should eventually end up in core (ie not in contrib) as they are helper functions likely to be shared by multiple extensions, but for now I keep them with the rest of the contrib module as I feel better experimenting inside that module. There are also read functions that show how the other side could look like, but they are currently unused as the example apply worker is not part of the submission yet. The second part is extensible output plugin which serves both as an example of the intended use of those common utility functions and also as actual working solution that can be used as base for several replication solutions. It provides hooks for the replication solutions built on top of it that can be used for deciding if to replicate specific action on specific object and also injecting additional information to both BEGIN and COMMIT message - this can be useful for example when you are forwarding changes from another node and you wish to pass the information about the original node to the target one. What I hope to get from this is agreement on the general approach and protocol so that we can have common base which will both make it easier to create external logical replication solutions and also eventually lead to full logical replication inside core PostgreSQL. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TABLESAMPLE patch
Hi, v2 version of this patch is attached. On 16/12/14 09:31, Petr Jelinek wrote: On 16/12/14 08:43, Jaime Casanova wrote: Sadly when the jsonb functions patch was committed a few oids where used, so you should update the ones you are using. at least to make the patch easier for testing. Will do. Done. The test added for this failed, attached is the diff. i didn't looked up why it failed It isn't immediately obvious to me why, will look into it. Fixed. Finally, i created a view with a tablesample clause. i used the view and the tablesample worked, then dumped and restored and the tablesample clause went away... actually pg_get_viewdef() didn't see it at all. Yeah, as I mentioned in the submission the ruleutils support is not there yet, so that's expected. Also fixed. I also added proper costing/row estimation. I consider this patch feature complete now, docs could still use improvement though. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services diff --git a/doc/src/sgml/ref/select.sgml b/doc/src/sgml/ref/select.sgml index 01d24a5..250ae29 100644 --- a/doc/src/sgml/ref/select.sgml +++ b/doc/src/sgml/ref/select.sgml @@ -49,7 +49,7 @@ SELECT [ ALL | DISTINCT [ ON ( replaceable class=parameterexpression/replac phrasewhere replaceable class=parameterfrom_item/replaceable can be one of:/phrase -[ ONLY ] replaceable class=parametertable_name/replaceable [ * ] [ [ AS ] replaceable class=parameteralias/replaceable [ ( replaceable class=parametercolumn_alias/replaceable [, ...] ) ] ] +[ ONLY ] replaceable class=parametertable_name/replaceable [ * ] [ TABLESAMPLE replaceable class=parametersampling_method/replaceable ( replaceable class=parameterargument/replaceable [, ...] ) [ REPEATABLE ( replaceable class=parameterseed/replaceable ) ] ] [ [ AS ] replaceable class=parameteralias/replaceable [ ( replaceable class=parametercolumn_alias/replaceable [, ...] ) ] ] [ LATERAL ] ( replaceable class=parameterselect/replaceable ) [ AS ] replaceable class=parameteralias/replaceable [ ( replaceable class=parametercolumn_alias/replaceable [, ...] ) ] replaceable class=parameterwith_query_name/replaceable [ [ AS ] replaceable class=parameteralias/replaceable [ ( replaceable class=parametercolumn_alias/replaceable [, ...] ) ] ] [ LATERAL ] replaceable class=parameterfunction_name/replaceable ( [ replaceable class=parameterargument/replaceable [, ...] ] ) @@ -317,6 +317,38 @@ TABLE [ ONLY ] replaceable class=parametertable_name/replaceable [ * ] /varlistentry varlistentry + termTABLESAMPLE replaceable class=parametersampling_method/replaceable ( replaceable class=parameterargument/replaceable [, ...] ) [ REPEATABLE ( replaceable class=parameterseed/replaceable ) ]/term + listitem + para +Table sample clause after +replaceable class=parametertable_name/replaceable indicates that +a replaceable class=parametersampling_method/replaceable should +be used to retrieve subset of rows in the table. +The replaceable class=parametersampling_method/replaceable can be +one of: +itemizedlist + listitem + paraliteralSYSTEM/literal/para + /listitem + listitem + paraliteralBERNOULLI/literal/para + /listitem +/itemizedlist +Both of those sampling methods currently accept only single argument +which is the percent (floating point from 0 to 100) of the rows to +be returned. +The literalSYSTEM/literal sampling method does block level +sampling with each block having same chance of being selected and +returns all rows from each selected block. +The literalBERNOULLI/literal scans whole table and returns +individual rows with equal probability. +The optional numeric parameter literalREPEATABLE/literal is used +as random seed for sampling. + /para + /listitem + /varlistentry + + varlistentry termreplaceable class=parameteralias/replaceable/term listitem para diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile index 21721b4..595737c 100644 --- a/src/backend/access/Makefile +++ b/src/backend/access/Makefile @@ -8,6 +8,7 @@ subdir = src/backend/access top_builddir = ../../.. include $(top_builddir)/src/Makefile.global -SUBDIRS = brin common gin gist hash heap index nbtree rmgrdesc spgist transam +SUBDIRS = brin common gin gist hash heap index nbtree rmgrdesc spgist \ + transam tsm include $(top_srcdir)/src/backend/common.mk diff --git a/src/backend/access/tsm/Makefile b/src/backend/access/tsm/Makefile new file mode 100644 index 000..73bbbd7 --- /dev/null +++ b/src/backend/access/tsm/Makefile @@ -0,0 +1,17 @@ +#- +# +# Makefile-- +#Makefile for access/tsm
Re: [HACKERS] Combining Aggregates
On 17/12/14 11:02, Atri Sharma wrote: On Wed, Dec 17, 2014 at 3:23 PM, Simon Riggs si...@2ndquadrant.com mailto:si...@2ndquadrant.com wrote: KaiGai, David Rowley and myself have all made mention of various ways we could optimize aggregates. Following WIP patch adds an extra function called a combining function, that is intended to allow the user to specify a semantically correct way of breaking down an aggregate into multiple steps. Also, should we not have a sanity check for the user function provided? Looking at the code, yes, there seems to be XXX comment about that. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TABLESAMPLE patch
On 17/12/14 19:55, Alvaro Herrera wrote: I noticed that this makes REPEATABLE a reserved keyword, which is currently an unreserved one. Can we avoid that? I added it because I did not find any other way to fix the shift/reduce conflicts that bison complained about. I am no bison expert though. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TABLESAMPLE patch
On 16/12/14 08:43, Jaime Casanova wrote: On Wed, Dec 10, 2014 at 6:24 PM, Petr Jelinek p...@2ndquadrant.com wrote: Hello, Attached is a basic implementation of TABLESAMPLE clause. It's SQL standard clause and couple of people tried to submit it before so I think I don't need to explain in length what it does - basically returns random sample of a table using a specified sampling method. Tablesample, yay! Sadly when the jsonb functions patch was committed a few oids where used, so you should update the ones you are using. at least to make the patch easier for testing. Will do. The test added for this failed, attached is the diff. i didn't looked up why it failed It isn't immediately obvious to me why, will look into it. Finally, i created a view with a tablesample clause. i used the view and the tablesample worked, then dumped and restored and the tablesample clause went away... actually pg_get_viewdef() didn't see it at all. Yeah, as I mentioned in the submission the ruleutils support is not there yet, so that's expected. will look at the patch more close tomorrow when my brain wake up ;) Thanks! -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tracking commit timestamps
On 15/12/14 09:12, Michael Paquier wrote: On Wed, Dec 10, 2014 at 6:50 PM, Noah Misch n...@leadboat.com wrote: On Mon, Dec 08, 2014 at 02:23:39AM +0100, Petr Jelinek wrote: On 08/12/14 00:56, Noah Misch wrote: The commit_ts test suite gives me the attached diff on a 32-bit MinGW build running on 64-bit Windows Server 2003. I have not checked other Windows configurations; the suite does pass on GNU/Linux. Hmm I wonder if now() needs to be changed to = now() in those queries to make them work correctly on that plarform, I don't have machine with that environment handy right now, so I would appreciate if you could try that, in case you don't have time for that, I will try to setup something later... I will try that, though perhaps not until next week. FWIW, I just tried that with MinGW-32 and I can see the error on Win7. I also checked that changing now() to = now() fixed the problem, so your assumption was right, Petr. Regards, Cool, thanks, I think it was the time granularity problem in Windows. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sequence Access Method WIP
On 10/12/14 03:33, Petr Jelinek wrote: On 24/11/14 12:16, Heikki Linnakangas wrote: About the rough edges: - The AlterSequence is not prettiest code around as we now have to create new relation when sequence AM is changed and I don't know how to do that nicely - I am not sure if I did the locking, command order and dependency handling in AlterSequence correcly This version does AlterSequence differently and better. Didn't attach the gapless sequence again as that one is unchanged. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile index 21721b4..818da15 100644 --- a/src/backend/access/Makefile +++ b/src/backend/access/Makefile @@ -8,6 +8,7 @@ subdir = src/backend/access top_builddir = ../../.. include $(top_builddir)/src/Makefile.global -SUBDIRS = brin common gin gist hash heap index nbtree rmgrdesc spgist transam +SUBDIRS = brin common gin gist hash heap index nbtree rmgrdesc spgist \ + transam sequence include $(top_srcdir)/src/backend/common.mk diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c index c16b38e..35818c0 100644 --- a/src/backend/access/common/reloptions.c +++ b/src/backend/access/common/reloptions.c @@ -321,6 +321,7 @@ static bool need_initialization = true; static void initialize_reloptions(void); static void parse_one_reloption(relopt_value *option, char *text_str, int text_len, bool validate); +static bytea *common_am_reloptions(RegProcedure amoptions, Datum reloptions, bool validate); /* * initialize_reloptions @@ -821,7 +822,8 @@ untransformRelOptions(Datum options) * instead. * * tupdesc is pg_class' tuple descriptor. amoptions is the amoptions regproc - * in the case of the tuple corresponding to an index, or InvalidOid otherwise. + * in the case of the tuple corresponding to an index or sequence, InvalidOid + * otherwise. */ bytea * extractRelOptions(HeapTuple tuple, TupleDesc tupdesc, Oid amoptions) @@ -854,6 +856,9 @@ extractRelOptions(HeapTuple tuple, TupleDesc tupdesc, Oid amoptions) case RELKIND_INDEX: options = index_reloptions(amoptions, datum, false); break; + case RELKIND_SEQUENCE: + options = sequence_reloptions(amoptions, datum, false); + break; case RELKIND_FOREIGN_TABLE: options = NULL; break; @@ -1299,13 +1304,31 @@ heap_reloptions(char relkind, Datum reloptions, bool validate) /* * Parse options for indexes. + */ +bytea * +index_reloptions(RegProcedure amoptions, Datum reloptions, bool validate) +{ + return common_am_reloptions(amoptions, reloptions, validate); +} + +/* + * Parse options for sequences. + */ +bytea * +sequence_reloptions(RegProcedure amoptions, Datum reloptions, bool validate) +{ + return common_am_reloptions(amoptions, reloptions, validate); +} + +/* + * Parse options for indexes or sequences. * * amoptions Oid of option parser * reloptions options as text[] datum * validate error flag */ -bytea * -index_reloptions(RegProcedure amoptions, Datum reloptions, bool validate) +static bytea * +common_am_reloptions(RegProcedure amoptions, Datum reloptions, bool validate) { FmgrInfo flinfo; FunctionCallInfoData fcinfo; diff --git a/src/backend/access/sequence/Makefile b/src/backend/access/sequence/Makefile new file mode 100644 index 000..01a0dc8 --- /dev/null +++ b/src/backend/access/sequence/Makefile @@ -0,0 +1,17 @@ +#- +# +# Makefile-- +#Makefile for access/sequence +# +# IDENTIFICATION +#src/backend/access/sequence/Makefile +# +#- + +subdir = src/backend/access/sequence +top_builddir = ../../../.. +include $(top_builddir)/src/Makefile.global + +OBJS = seqam.o seqlocal.o + +include $(top_srcdir)/src/backend/common.mk diff --git a/src/backend/access/sequence/seqam.c b/src/backend/access/sequence/seqam.c new file mode 100644 index 000..ce57f16 --- /dev/null +++ b/src/backend/access/sequence/seqam.c @@ -0,0 +1,428 @@ +/*- + * + * seqam.c + * sequence access method routines + * + * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * + * IDENTIFICATION + * src/backend/access/sequence/seqam.c + * + * + * Sequence access method allows the SQL Standard Sequence objects to be + * managed according to either the default access method or a pluggable + * replacement. Each sequence can only use one access method at a time, + * though different sequence access methods can be in use by different + * sequences at the same time. + * + * The SQL Standard assumes that each Sequence object is completely controlled + * from the current database node, preventing any form